I guess by this you mean the constants that I am passing down at every cycle? This makes a lot of sense, but I am not sure how to do it.
Wow swapping the arguments for I/O pipelines is a great suggestion - I didn't know this could be done at all (remember, I am n00b!). Would you be able to point me to some example? Also I will definitely follow your advice to move the loop in the kernel, at the moment I am getting the stuff out to plot it but when this gets plugged in the bigger picture I have in mind I won't need that. As for the bug I don't think it's a bug because the results I am plotting are showing the curves I expect (this is a dumb example and effectively I am running the same thing N times in parallel just to test it out so there is a lot of redundancy, in theory the inputs will be different for each of the items, but at the moment it's all the same).
This sounds like another clever trick - are you suggesting I store results as I go along on some buffer in the shared memory of the GPU then run another kernel at the end to harvest results from that buffer?
Thanks a lot for the awesome suggestions in terms of code tweaks - I will try now to follow your steps and see how this translates in performance gains on my GPU. Will post back results.
The only thing that still worries me a bit is that my baseline on GPU (ATI HD4850) is around 13000ms while yours is 1875ms and Michael's one is 3000ms. Trying to make sense of that, I suspect it's the 'Mobility' version of the card (I am on a 27'' iMac), but would that explain the huge difference?