I've run this on several different single GPU configurations (both AMD and NVidia), and the results are always correct.
However, when it's been run on multi-GPU configurations (both AMD and NVidia), the results are randomly incorrect.
(about 95% correct, 5% incorrect).
There is nothing within the CLTask above that can "bleed through" to other CLTask's except possibly the CL calls.
Obviously, at a given time, both an A-B and an A-C comparison could be performed at the same time on different devices.
So the question is: Am I doing this correctly?
So are the multiple GPUs cooperating somehow in this algorithm, or are they supposed to be operating separately on separate buffers? If they're cooperating, I can see how there might be a problem, since you have to use clFlush()/clFinish() to synchronize between different command queues (I assume you have both devices in one context, but a separate command queue for each one). There's also the issue of copying a single buffer to two devices and then getting separate sets of results back to the host without clobbering one.
If your devices are supposed to be operating separately, and all the buffers and host memory are separate, then I'm not sure what's going on
Each comparison computation is completely independent.
I'm not currently using a clFlush/clFinish for each individual CLTask, as that doesn't seem appropriate.
However, I am doing a CLCommandQueuePool.flushQueues() & finishQueues(), but it doesn't appear to make a difference.
The CLCommandQueuePool class acts like a threaded job queue, where each job (CLTask) is scheduled as soon as a device is available.
(Benchmarks that I've done have shown the implementation to be very efficient!)
The CLTask.execute() method passes a CLSimpleQueueContext argument, which varies for each device.
So each CLTask has an associated CLContext & CLCommandQueue.
I see two potential sources of error:
1) A CLTask is performed on a given device before the previous CLTask for that device is complete.
2) There is some issue with copying the same memory to multiple contexts simultaneously.
However, recalling that the computations are performed perfectly when using a single device, the first seems unlikely.
The second scenario would occur somewhat randomly, which is the observed behavior.
Since there is no exception thrown, further troubleshooting will require a substantial amount of debugging...
On 01/26/2014 01:48 AM, The.Scotsman [via jogamp] wrote:
> Thanks for the quick response!
> Each comparison computation is completely independent.
> I'm not currently using a clFlush/clFinish for each individual CLTask, as that
> doesn't seem appropriate.
> However, I am doing a CLCommandQueuePool.flushQueues() & finishQueues(), but
> it doesn't appear to make a difference.
> The CLCommandQueuePool class acts like a threaded job queue, where each job
> (CLTask) is scheduled as soon as a device is available.
> (Benchmarks that I've done have shown the implementation to be very efficient!)
> The CLTask.execute() method passes a CLSimpleQueueContext argument, which
> varies for each device.
> So each CLTask has an associated CLContext & CLCommandQueue.
> I see two potential sources of error:
> 1) A CLTask is performed on a given device before the previous CLTask for that
> device is complete.
> 2) There is some issue with copying the same memory to multiple contexts
> However, recalling that the computations are performed perfectly when using a
> single device, the first seems unlikely.
> The second scenario would occur somewhat randomly, which is the observed
> Since there is no exception thrown, further troubleshooting will require a
> substantial amount of debugging...
Can you provide a 'smallest' self-contained [unit]-test
and attach this to a new bug report ?
So it sounds like you're copying identical data to the two devices, and each device reads the data but doesn't modify it? As far as I know this should work correctly, as long as any output goes to two separate buffers so the results don't collide.
As Sven mentioned a small test case might be helpful here, to see if anyone else sees the same result (I can't test this personally, since I only have one GPU, but some of Sven's cluster machines might have more than one).
I was afraid you were going to say something like that...
(was hoping more for the "you screwed up HERE" type response)
First, I'm going to do some focused debugging to try to figure out exactly where the thing goes bad.
And possibly try to synchronize some things at my end.
Especially since there apparently isn't much multi-GPU testing capability amongst the development team.
(We need to get you guys a bigger budget!)
After a long hiatus, I'm back on this project.
I anyone is still interested, here's an update.
First thing I did was update to the latest jogamp release, but the problem still occurred.
Then I wired up some detailed debug output, which can be sorted and compared over multiple runs.
This did not succeed in localizing the problem, but did show that the error rate was larger than previously indicated, as there were a number of less serious errors that weren't apparent before.
Next thing I did was shuffle the input list: Instead of an ordered comparison of A to B, A to C, A to D, etc., the object comparisons are now random, resulting in many fewer instances where multiple GPU's are reading data from the same objects simultaneously.
As a result of this simple change, the error rate dropped to less than 1%, and performance actually improved a bit.
Finally, I manually synchronized both objects at the CLTask.execute() level:
...execute 3 comparison kernels...
With this, the error rate dropped to zero, although performance dropped about 35%.
So I'm still pretty confident the problem is a result of some non-thread safe code somewhere within Jogamp/OpenCL (CLCommandQueue.putWriteBuffer?).
The effort required to create an independent test case is substantial - many days - and it doesn't sound like anyone there has a rig to test it with anyways.
So I don't know if I will have the opportunity to address this further.
For the behavior you describe, this sounds like the canonical example of multithreaded access to common data, especially because:
- The error rate decreases if you randomize the order
- The errors disappear if you force serial execution
- The errors disappear if you use only one GPU
There's got to be some part of your data structure that's being shared and accessed in a non-thread-safe manner. Are you completely certain that there's no buffer sharing in your code between GPUs? I couldn't quite understand from your pseudocode what buffers were allocated where.
One debugging possibility might be to try running on the Intel CPU-only OpenCL implementation. If the errors go away, that would indicate that they're due to the fact that device and host memory are separate and incoherent on GPUs, but merely different regions in the same coherent memory system on CPUs.
Another debugging possibility might be to replace the GPU kernel calls with multithreaded function calls that run on your CPUs, and replace the OpenCL buffers with simple arrays (i.e. route around JOCL completely, but leave your code as much the same as possible). Then you would be able to tell whether the race condition is in your data structure or inside JOCL somewhere. That's kind of drastic, but these sorts of bugs are often difficult to track down :)