I have implemented a JOCL program using putMapBuffer(...). Actual putMapBuffer call takes approximatly 0 milli seconds (on CPU based OpenCL execution). But after this call i have to copy data from my application buffer to the pointers (ByteArray h_data1 and h_data2 as shown in below code). These copying of data is a expensive in terms of time.
After invocation of kernel i copy back results using the pointer (h_data3). The copying data back to host buffer is Ok and efficient in terms of time.
Am i doing correct steps or not?
Is there an efficient way to copy data from application buffers to the pointers returned by putMapBuffer methods ?
h_data1 = queue.putMapBuffer(clBufferA,WRITE,true);
h_data2 = queue.putMapBuffer(clBufferB,WRITE,true);
//input copy time (very expensive in terms of time)
long time1 = nanoTime();
time1 = nanoTime() - time1;
Are you timing one run or multiple/subsequent invocations? Mapping memory is probably lazy - i.e. the actual mapping only happens when you access it. Which makes the first time you do it slow.
Also note that the stuff below doesn't actually do any copying, it just creates a new pointer which associates cpu memory with the gpu object, all you're timing is a little bit of java code and a new().
//Data copy in App buffer (very efficient in terms of time)
long time2 = nanoTime();
clBufferC = clBufferC.cloneWith(h_data3.asFloatBuffer());
In general you seem to be doing excessive copies anyway. You already have data3, why copy it to clbufferc? And copying bufferA to the mapped copy of A?
And from what i can tell from the spec you need to unmap the buffer before executing the kernel. The spec wording is a little convoluted (spec 1.1, section 220.127.116.11) but it states that mapped memory cannot be accessed from kernels.