I have general question about OpenCL memory management. I want to perform gaussian downsampling of an image in two passes (vertical and horizontal). Before you ask: Much more like derivates and other task should follow, so OpenCL seams to be the right tool for the job.
I want to store the result of the computation of step 1 in an temporary CLBuffer. So far so good, it works as long as I transfer the result back into main memory using commandQueue.putReadBuffer(). However, since I don't need the data in my application (only for the second step) I replaced commandQueue.putReadBuffer() with commandQueue.finish(). Unfortunately my code stops working, which means that only zeros are finally read. The following code shows the problem:
// STEP 1
vPassKernel.rewind(); // the kernel might be reused
vPassKernel.putArg(inBuffer) // input image
.putArg(tmpBuffer); // result of the first pass is stored here (read_write, no host ptr)
.put1DRangeKernel(vPassKernel, 0, vPassGlobalWorkSize, vPassLocalWorkSize)
//.putReadBuffer(tmpBuffer, true); <-- slow, but works
.finish(); // <-- problem! When reusing tmpBuffer is has only zeros during the next step :(
// STEP 2
//tmpBuffer.getBuffer().rewind(); // only needed when using "putReadBuffer"
hPassKernel.putArg(tmpBuffer) // use as input
.putArg(clOutBuffer) // final output that is transfered into main memory
.put1DRangeKernel(hPassKernel, 0, hPassGlobalWorkSize, hPassLocalWorkSize)
Why is the call to "putReadBuffer" needed? Is that how the OpenCL memory model works? Am I missing something?
BTW: Changing between CPU or GPU does not change anything, so i guess it must be my fault. The kernels aren't special - just computing a weighted sum. I know that i could call the second kernel from the first kernel, but in the future I will need the temporary buffer in other ("non-sequencial") situations. (I'm running OSX 10.6 with a slow ATI 6490M).