So I have the following problem...I have about 10k objects that each contain float array attribute of size 64 (float myArray = new float), each array computes to one float in the kernel. The thing I am doing now with JOCL is this:
...and all this works. The only thing that is bugging me is that with this approach the CLBuffer will create a NIO buffer on the host and on the device (which is a waste, because I don't need the NIO).
How can I get my data to the GPU as fast as possible? Any suggestions? Is there anyway to create a GPU only buffer and fill that with the arrays (forming one big array that the krenel will access)?
The reason I am asking is that I only discovered this JOCL version recently (was using the one at jocl.org before). When I reimplemented my program with this library it was a bit slower...so I am trying to fix this since I really like how readable the code is with this lib :-)