] - and getting much better results (only blocking one of the buffers or all of them to get the stuff out at the end with final values does not seem to make a difference):
with 302 items --> GPU: 276ms / CPU: 228ms
Here's the code I am using to invoke the kernel: http://goo.gl/297a3
One weird thing I've noticed, if I don't block any buffer the computation only takes 1ms ... which makes me think something is horribly wrong. Trying to find a way to verify.
As mentioned in the previous post, ideally I would like at this stage to get a 2-dimensional array out (at least for one of the buffers) at the end with values for each step of the loop I moved into the kernel, so that I can do some plotting and check that the computation is actually happening.
Any help on that appreciated!