Global Memory - Temporary CLBuffer in device memory
I have general question about OpenCL memory management. I want to perform gaussian downsampling of an image in two passes (vertical and horizontal). Before you ask: Much more like derivates and other task should follow, so OpenCL seams to be the right tool for the job.
I want to store the result of the computation of step 1 in an temporary CLBuffer. So far so good, it works as long as I transfer the result back into main memory using commandQueue.putReadBuffer(). However, since I don't need the data in my application (only for the second step) I replaced commandQueue.putReadBuffer() with commandQueue.finish(). Unfortunately my code stops working, which means that only zeros are finally read. The following code shows the problem:
// STEP 1
vPassKernel.rewind(); // the kernel might be reused
vPassKernel.putArg(inBuffer) // input image
.putArg(tmpBuffer); // result of the first pass is stored here (read_write, no host ptr)
.put1DRangeKernel(vPassKernel, 0, vPassGlobalWorkSize, vPassLocalWorkSize)
//.putReadBuffer(tmpBuffer, true); <-- slow, but works
.finish(); // <-- problem! When reusing tmpBuffer is has only zeros during the next step :(
// STEP 2
//tmpBuffer.getBuffer().rewind(); // only needed when using "putReadBuffer"
hPassKernel.putArg(tmpBuffer) // use as input
.putArg(clOutBuffer) // final output that is transfered into main memory
.put1DRangeKernel(hPassKernel, 0, hPassGlobalWorkSize, hPassLocalWorkSize)
Why is the call to "putReadBuffer" needed? Is that how the OpenCL memory model works? Am I missing something?
BTW: Changing between CPU or GPU does not change anything, so i guess it must be my fault. The kernels aren't special - just computing a weighted sum. I know that i could call the second kernel from the first kernel, but in the future I will need the temporary buffer in other ("non-sequencial") situations. (I'm running OSX 10.6 with a slow ATI 6490M).
Re: Global Memory - Temporary CLBuffer in device memory
you are enqueueing a writeBuffer command in the second step. This will write the contents from host memory to device memory, overwriting your temp results. (NIO -> CL). Try using finish without this command again and it should work.
to speed it up you might consider using events instead of finish to decouple the app thread from the queue.