Login  Register

Re: Multi-GPU processing inconsistent.

Posted by Wade Walker on Oct 08, 2014; 12:58am
URL: https://forum.jogamp.org/Multi-GPU-processing-inconsistent-tp4031306p4033284.html

For the behavior you describe, this sounds like the canonical example of multithreaded access to common data, especially because:

- The error rate decreases if you randomize the order
- The errors disappear if you force serial execution
- The errors disappear if you use only one GPU

There's got to be some part of your data structure that's being shared and accessed in a non-thread-safe manner. Are you completely certain that there's no buffer sharing in your code between GPUs? I couldn't quite understand from your pseudocode what buffers were allocated where.

One debugging possibility might be to try running on the Intel CPU-only OpenCL implementation. If the errors go away, that would indicate that they're due to the fact that device and host memory are separate and incoherent on GPUs, but merely different regions in the same coherent memory system on CPUs.

Another debugging possibility might be to replace the GPU kernel calls with multithreaded function calls that run on your CPUs, and replace the OpenCL buffers with simple arrays (i.e. route around JOCL completely, but leave your code as much the same as possible). Then you would be able to tell whether the race condition is in your data structure or inside JOCL somewhere. That's kind of drastic, but these sorts of bugs are often difficult to track down :)