Login  Register

Re: How to queue in a multidevice environment

Posted by Arnold on Feb 26, 2017; 9:33pm
URL: https://forum.jogamp.org/How-to-queue-in-a-multidevice-environment-tp4037674p4037676.html

Thanks for the example links and that worked! Well partially. Using finish () is what I was looking for and the subbuffers seems to work. So the last lines of the code show are:

          for (int i = 0; i < nDevices; i++)
          {
                  CLDevice device = devices [i];
                  q [i] = device.createCommandQueue ();
                  q [i]
                                  .putWriteBuffer (CLSubArrayA [i], false)
                                  .putWriteBuffer (CLSubArrayB [i], false)
                                  .put1DRangeKernel (kernel, 0, globalWorkSize, localWorkSize)
                                  .putReadBuffer (CLSubArrayC [i], false);

          } // for
         
          for (int i = 0; i < nDevices; i++)
          {
                  q [i].finish ();
          } // for
       
There is just a small detail: the devices lumped together take twice as much time instead of half the time (I list the benchmark results below). I'll try to figure out tomorrow where the problem is. I saw a reference to CLCommandThreadPool by Michael Bien. Well, that's a nice exercise for tomorrow :-D

I did consider switching to the llb's but the hlb's are really well coded and I am rather particular to how code should look like. You guys really did a nice job with the hlb's. The books I am reading now are C(++) indeed but untilo now I can translate this to hlb one way or another.

1 OpenCL GPU platform(s) present
1 OpenCL CPU platform(s) present
Platform AMD Accelerated Parallel Processing contains 2 devices
   Device: CLDevice [id: 139956276461280 name: Ellesmere type: GPU profile: FULL_PROFILE]
   Device: CLDevice [id: 139956278518384 name: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz type: CPU profile: FULL_PROFILE]
Running benchmark for 2 devices.
Benchmarking Ellesmere           : 0 1 2 3 4 4983 ms
Benchmarking Intel(R) Xeon(R) CPU: 0 1 2 3 4 7765 ms
Benchmarking all devices together: 0 1 2 3 4 23149 ms
Benchmarking plain old Java: 12101 ms

Summary of computing vectors with 20,480,000 elements (of type double)
Device                          VectorAdd            VectorMul            VectorDiv            VectorTri
Ellesmere                             227                  280                  242                  245
Intel(R) Xeon(R) CPU           355                  365                  356                  475
All devices together           1120                 1124                 1142                 1242
Plain old Java                          45                   43                   79                11934

Anyhow, it's time to sleep. And thanks again for your great help and patience!