putMapBuffer - data In/Out performance issue

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

putMapBuffer - data In/Out performance issue

suleman
Hi,

 I have implemented a JOCL program using putMapBuffer(...). Actual putMapBuffer call takes approximatly 0 milli seconds (on CPU based OpenCL execution). But after this call i have to copy data from my application buffer to the pointers (ByteArray h_data1 and h_data2 as shown in below code). These copying of data is a expensive in terms of time.
After invocation of kernel i copy back results using the pointer  (h_data3). The copying data back to host buffer is Ok and efficient in terms of time.

Am i doing correct steps or not?
Is there an efficient way to copy data from application buffers to the pointers returned by putMapBuffer methods ?

Many thanks

CODE EXAMPLE
------------------------------
....
   h_data1 = queue.putMapBuffer(clBufferA,WRITE,true);
   h_data2 = queue.putMapBuffer(clBufferB,WRITE,true);
...    
//input copy time (very expensive in terms of time)
long time1 = nanoTime();
h_data1.clear();
h_data1.asFloatBuffer().put(clBufferA.getBuffer());
h_data2.clear();
h_data2.asFloatBuffer().put(clBufferB.getBuffer());
time1 = nanoTime() - time1;            
             
//Kernel execution
 queue.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize);
 queue.finish();

 //Output PutMapBuffer
 h_data3 = queue.putMapBuffer(clBufferC,READ,true);
               
//Data copy in App buffer  (very efficient in terms of time)
long time2 = nanoTime();
clBufferC = clBufferC.cloneWith(h_data3.asFloatBuffer());
time2 = nanoTime() - time2;
Reply | Threaded
Open this post in threaded view
|

Re: putMapBuffer - data In/Out performance issue

notzed
Are you timing one run or multiple/subsequent invocations?  Mapping memory is probably lazy - i.e. the actual mapping only happens when you access it.  Which makes the first time you do it slow.

Also note that the stuff below doesn't actually do any copying, it just creates a new pointer which associates cpu memory with the gpu object, all you're timing is a little bit of java code and a new().

//Data copy in App buffer  (very efficient in terms of time)
long time2 = nanoTime();
clBufferC = clBufferC.cloneWith(h_data3.asFloatBuffer());

In general you seem to be doing excessive copies anyway.  You already have data3, why copy it to clbufferc?  And copying bufferA to the mapped copy of A?

And from what i can tell from the spec you need to unmap the buffer before executing the kernel.  The spec wording is a little convoluted (spec 1.1, section 5.4.2.1) but it states that mapped memory cannot be accessed from kernels.

The AMD opencl programming guide section 4.4 covers a lot of this in detail for AMD's implementation, much of which i imagine is applicable to other gpus. http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

FWIW I only use putwritebuffer/putreadbuffer as I find that model more intuitive.  I haven't had any performance issues related to it.