Reply – Global Memory - Temporary CLBuffer in device memory
Your Name
Subject
Message
or Cancel
In Reply To
Global Memory - Temporary CLBuffer in device memory
— by felix felix
Hi,

I have general question about OpenCL memory management. I want to perform gaussian downsampling of an image in two passes (vertical and horizontal). Before you ask: Much more like derivates and other task should follow, so OpenCL seams to be the right tool for the job.
I want to store the result of the computation of step 1 in an temporary CLBuffer. So far so good, it works as long as I transfer the result back into main memory using commandQueue.putReadBuffer(). However, since I don't need the data in my application (only for the second step) I replaced commandQueue.putReadBuffer() with commandQueue.finish(). Unfortunately my code stops working, which means that only zeros are finally read. The following code shows the problem:

// STEP 1
vPassKernel.rewind(); // the kernel might be reused
vPassKernel.putArg(inBuffer) // input image
    .putArg(tmpBuffer); // result of the first pass is stored here (read_write, no host ptr)
commandQueue.putWriteBuffer(inBuffer, false)
    .put1DRangeKernel(vPassKernel, 0, vPassGlobalWorkSize, vPassLocalWorkSize)
    //.putReadBuffer(tmpBuffer, true); <-- slow, but works
    .finish(); // <-- problem! When reusing tmpBuffer is has only zeros during the next step :(

// STEP 2
//tmpBuffer.getBuffer().rewind(); // only needed when using "putReadBuffer"
hPassKernel.rewind();
hPassKernel.putArg(tmpBuffer) // use as input
    .putArg(clOutBuffer) // final output that is transfered into main memory
    .putArg(hPassElementCount);
commandQueue.putWriteBuffer(tmpBuffer, false)
    .put1DRangeKernel(hPassKernel, 0, hPassGlobalWorkSize, hPassLocalWorkSize)
    .putReadBuffer(clOutBuffer, true);

Why is the call to "putReadBuffer" needed? Is that how the OpenCL memory model works? Am I missing something?
BTW: Changing between CPU or GPU does not change anything, so i guess it must be my fault. The kernels aren't special - just computing a weighted sum. I know that i could call the second kernel from the first kernel, but in the future I will need the temporary buffer in other ("non-sequencial") situations. (I'm running OSX 10.6 with a slow ATI 6490M).

Thanks, Felix