I have a triangle-ray intersection kernel that was working fine using global memory but when I add a local kernel argument it's throwing an exception when trying to read the returned intersection point, which is returned in a buffer of 1 float.
It throws the exception at the last putReadBuffer() call:
Exception in thread "main" com.jogamp.opencl.CLException$CLOutOfResourcesException: can not enqueue read-buffer: CLBuffer [id: 422414000 buffer: java.nio.DirectFloatBufferU[pos=0 lim=1 cap=1]] with
cond.: null events: null [error: CL_OUT_OF_RESOURCES]
CL_OUT_OF_RESOURCES can be hard to debug. Have you checked the amount of local memory available on your device to make sure you're not exceeding it? Only 64KB is required by the spec. You might try reducing the amount of work you're enqueing; if that works, you're just not doing the right checks to stay within your hardware constraints.
So are the multiple GPUs cooperating somehow in this algorithm, or are they supposed to be operating separately on separate buffers? If they're cooperating, I can see how there might be a problem, since you have to use clFlush()/clFinish() to synchronize between different command queues (I assume you have both devices in one context, but a separate command queue for each one). There's also the issue of copying a single buffer to two devices and then getting separate sets of results back to the host without clobbering one.
If your devices are supposed to be operating separately, and all the buffers and host memory are separate, then I'm not sure what's going on