Optimal GPU data transfer

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Optimal GPU data transfer

So I have the following problem...I have about 10k objects that each contain float array attribute of size 64 (float[] myArray = new float[64]), each array computes to one float in the kernel. The thing I am doing now with JOCL is this:

    CLBuffer<FloatBuffer> buffer = context.createFloatBuffer(64 * 10000, READ_ONLY);
    CLBuffer<FloatBuffer> output = context.createFloatBuffer(10000, WRITE_ONLY);
    for (MyObject object : objects) {
    kernel.putArgs(buffer, output).putArg(GLOBAL);
    queue.putWriteBuffer(buffer, false)
        .put1DRangeKernel(kernel, 0, GLOBAL, 256)
        .putReadBuffer(output, true);

...and all this works. The only thing that is bugging me is that with this approach the CLBuffer will create a NIO buffer on the host and on the device (which is a waste, because I don't need the NIO).

How can I get my data to the GPU as fast as possible? Any suggestions? Is there anyway to create a GPU only buffer and fill that with the arrays (forming one big array that the krenel will access)?

The reason I am asking is that I only discovered this JOCL version recently (was using the one at jocl.org before). When I reimplemented my program with this library it was a bit slower...so I am trying to fix this since I really like how readable the code is with this lib :-)