Here are some profiling results. splitvertical1 is the original code, splitvertical2 is some slight improvements in locality for it, and splitvertical3 is the new O(log n) memcpy code
% cumulative self self total
time seconds seconds calls ms/call ms/call name
49.44 0.88 0.88 1063 0.83 0.83 gradient_splitvertical1
47.19 1.72 0.84 1063 0.79 0.79 gradient_splitvertical2
2.81 1.77 0.05 1063 0.05 0.05 gradient_splitvertical3
i also tested this with 'time' to draw 1000 gradients, and the new code used approximately half the user time, and finished 10 seconds quicker. so yeah, it's magical and works well.