| 
					
	
	
	
	
				 | 
				
					
	
	 
		So I have the following problem...I have about 10k objects that each contain float array attribute of size 64 (float[] myArray = new float[64]), each array computes to one float in the kernel. The thing I am doing now with JOCL is this:
      CLBuffer<FloatBuffer> buffer = context.createFloatBuffer(64 * 10000, READ_ONLY);
     CLBuffer<FloatBuffer> output = context.createFloatBuffer(10000, WRITE_ONLY);
     for (MyObject object : objects) {
         buffer.getBuffer().put(object.getMyArray());
     }
     buffer.getBuffer().rewind();
     kernel.putArgs(buffer, output).putArg(GLOBAL);
     queue.putWriteBuffer(buffer, false)
         .put1DRangeKernel(kernel, 0, GLOBAL, 256)
         .putReadBuffer(output, true);
  ...and all this works. The only thing that is bugging me is that with this approach the CLBuffer will create a NIO buffer on the host and on the device (which is a waste, because I don't need the NIO).
  How can I get my data to the GPU as fast as possible? Any suggestions? Is there anyway to create a GPU only buffer and fill that with the arrays (forming one big array that the krenel will access)? 
  The reason I am asking is that I only discovered this JOCL version recently (was using the one at jocl.org before). When I reimplemented my program with this library it was a bit slower...so I am trying to fix this since I really like how readable the code is with this lib :-)
	
	
	
	 
				 |