On scaling up:
302 items --> GPU: 13332ms / CPU: 1428ms
3002 items --> GPU: 26979ms / CPU: 16245ms
300002 items --> GPU: 170071ms / CPU: 147497ms
which basically confirms that scaling up resolves the CPU vs GPU issue (and that my GPU sucks).
With non-locking all the output buffers except Vout:
302 items --> GPU: 5567ms / CPU: 1471ms
Non-locking all the output buffers:
302 items --> GPU: 5346ms / CPU: 757ms
In this case I noticed a couple of things: 1) if I add .finish() at the end the performance gets a bit worse (why?) 2) if I plot Vout the plot is a bit messed up but only if I run on GPU (why?). In general I'd be glad if you could point me to resources where I can learn more about what locking exactly does/means (I have an idea but I'd like to understand better).
Non-locking all + changing the arrays in the kernel to float4:
302 items --> GPU: 4801ms / CPU: 673ms
Next thing I am gonna do is move the loop in the kernel and have all the results stored along the way as an output for plotting (maybe optionally populated via another parameter) and I hope to get closer to the awesome results @notz reported (< 100ms on my crappy GPU I'd be happy). I'll post back here (maybe tomorrow) results.
The only thing I am not too sure about at this point is how I am going to return "the results" since that means returning 2-dimensional float arrays because I need to know the values for each item at each point of the computation (in order to do any plotting), so I am back to a problem already discussed on this forum [http://forum.jogamp.org/Passing-array-of-arrays-to-OpenCL-via-JOCL-tp2922911p2922911.html
], and I seem to understand I cannot just return a float** and this time I cannot flatten it out as I did for the inputs because I do not know how many time step the computation is gonna simulate (that's gonna be a parameter too). Ideas?
Thanks again for the awesome help, you guys rock.