|
So I have the following problem...I have about 10k objects that each contain float array attribute of size 64 (float[] myArray = new float[64]), each array computes to one float in the kernel. The thing I am doing now with JOCL is this:
CLBuffer<FloatBuffer> buffer = context.createFloatBuffer(64 * 10000, READ_ONLY);
CLBuffer<FloatBuffer> output = context.createFloatBuffer(10000, WRITE_ONLY);
for (MyObject object : objects) {
buffer.getBuffer().put(object.getMyArray());
}
buffer.getBuffer().rewind();
kernel.putArgs(buffer, output).putArg(GLOBAL);
queue.putWriteBuffer(buffer, false)
.put1DRangeKernel(kernel, 0, GLOBAL, 256)
.putReadBuffer(output, true);
...and all this works. The only thing that is bugging me is that with this approach the CLBuffer will create a NIO buffer on the host and on the device (which is a waste, because I don't need the NIO).
How can I get my data to the GPU as fast as possible? Any suggestions? Is there anyway to create a GPU only buffer and fill that with the arrays (forming one big array that the krenel will access)?
The reason I am asking is that I only discovered this JOCL version recently (was using the one at jocl.org before). When I reimplemented my program with this library it was a bit slower...so I am trying to fix this since I really like how readable the code is with this lib :-)
|