jogamp › jocl

putMapBuffer - data In/Out performance issue

Classic

List

Threaded

2 messages Options

suleman

Jun 30, 2011; 4:12pm

putMapBuffer - data In/Out performance issue

Hi,

I have implemented a JOCL program using putMapBuffer(...). Actual putMapBuffer call takes approximatly 0 milli seconds (on CPU based OpenCL execution). But after this call i have to copy data from my application buffer to the pointers (ByteArray h_data1 and h_data2 as shown in below code). These copying of data is a expensive in terms of time.
After invocation of kernel i copy back results using the pointer (h_data3). The copying data back to host buffer is Ok and efficient in terms of time.

Am i doing correct steps or not?
Is there an efficient way to copy data from application buffers to the pointers returned by putMapBuffer methods ?

Many thanks

CODE EXAMPLE
------------------------------
....
h_data1 = queue.putMapBuffer(clBufferA,WRITE,true);
h_data2 = queue.putMapBuffer(clBufferB,WRITE,true);
...
//input copy time (very expensive in terms of time)
long time1 = nanoTime();
h_data1.clear();
h_data1.asFloatBuffer().put(clBufferA.getBuffer());
h_data2.clear();
h_data2.asFloatBuffer().put(clBufferB.getBuffer());
time1 = nanoTime() - time1;

//Kernel execution
queue.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize);
queue.finish();

//Output PutMapBuffer
h_data3 = queue.putMapBuffer(clBufferC,READ,true);

//Data copy in App buffer (very efficient in terms of time)
long time2 = nanoTime();
clBufferC = clBufferC.cloneWith(h_data3.asFloatBuffer());
time2 = nanoTime() - time2;

notzed

Jul 06, 2011; 1:45am

Re: putMapBuffer - data In/Out performance issue

Are you timing one run or multiple/subsequent invocations? Mapping memory is probably lazy - i.e. the actual mapping only happens when you access it. Which makes the first time you do it slow.

Also note that the stuff below doesn't actually do any copying, it just creates a new pointer which associates cpu memory with the gpu object, all you're timing is a little bit of java code and a new().

//Data copy in App buffer (very efficient in terms of time)
long time2 = nanoTime();
clBufferC = clBufferC.cloneWith(h_data3.asFloatBuffer());

In general you seem to be doing excessive copies anyway. You already have data3, why copy it to clbufferc? And copying bufferA to the mapped copy of A?

And from what i can tell from the spec you need to unmap the buffer before executing the kernel. The spec wording is a little convoluted (spec 1.1, section 5.4.2.1) but it states that mapped memory cannot be accessed from kernels.

The AMD opencl programming guide section 4.4 covers a lot of this in detail for AMD's implementation, much of which i imagine is applicable to other gpus. http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

FWIW I only use putwritebuffer/putreadbuffer as I find that model more intuitive. I haven't had any performance issues related to it.