How to queue in a multidevice environment

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to queue in a multidevice environment

Arnold
I am trying to run the HelloWorld example in a multidevice environment. As I understand it, I set up a context for the devices I want to run the kernels on, create a program from it, from that program I create the kernels and for each kernel a buffer is created. Next I enumerate all devices, create a subbuffer and a command queue and run the lot (see sample code below, questions are put in comments almost at the bottom).

I have four questions
1. Is this understanding correct?
2. If so, how do I create a commandqueue without running it?
3. How do I start all commandqueues at the same moment?
4. How do I wait for the results?

I apologize for all these questions, but the documentation about CLCommandQueue is somewhat scanty.

Thanks in advance for your time and for all your patience so far :-)

Code:

   // Collect all relevant devices in CLDevice [] devices;
   context = CLContext.create (devices);
   program = context.createProgram (MultiBench.class.getResourceAsStream ("VectorFunctions.cl"));
   program.build ("", devices);
   kernels = program.createCLKernels ();
   for (Map.Entry<String, CLKernel> entry: kernels.entrySet ())
   {
    CLKernel kernel = entry.getValue ();
    // Now it’s getting tricky, creatinmg buffers and subbuffers
    // not sure whether this is correctly done

          int elementCount = 20000000;
          int localWorkSize = min (devices [0].getMaxWorkGroupSize(), 256);  // Local work size dimensions
          int globalWorkSize = roundUp (localWorkSize, elementCount);   // rounded up to the nearest multiple of the localWorkSize
          int nDevices = devices.length;
          int sliceSize = elementCount / nDevices;
          int extra = elementCount - nDevices * sliceSize;
          CLCommandQueue q [] = new CLCommandQueue [nDevices];
         
          CLSubBuffer<DoubleBuffer> [] CLSubArrayA = new CLSubBuffer [nDevices];
          CLSubBuffer<DoubleBuffer> [] CLSubArrayB = new CLSubBuffer [nDevices];
          CLSubBuffer<DoubleBuffer> [] CLSubArrayC = new CLSubBuffer [nDevices];

          // A, B are input buffers, C is for the result
          CLBuffer<DoubleBuffer> clBufferA = context.createDoubleBuffer(globalWorkSize, READ_ONLY);
          CLBuffer<DoubleBuffer> clBufferB = context.createDoubleBuffer(globalWorkSize, READ_ONLY);
          CLBuffer<DoubleBuffer> clBufferC = context.createDoubleBuffer(globalWorkSize, WRITE_ONLY);
         
          for (int i = 0; i < nDevices; i++)
          {
                  int size = sliceSize;
                  if (i == nDevices - 1) size += extra;
                  CLSubBuffer<DoubleBuffer> sbA = clBufferA.createSubBuffer (i * sliceSize, size, READ_ONLY);
                  CLSubArrayA [i] = sbA;
                  CLSubBuffer<DoubleBuffer> sbB = clBufferB.createSubBuffer (i * sliceSize, size, READ_ONLY);
                  CLSubArrayB [i] = sbB;
                  CLSubBuffer<DoubleBuffer> sbC = clBufferC.createSubBuffer (i * sliceSize, size, READ_ONLY);
                  CLSubArrayC [i] = sbC;
          } // for

          kernel.putArgs(clBufferA, clBufferB, clBufferC).putArg(elementCount);
         
          for (int i = 0; i < nDevices; i++)
          {
                  CLDevice device = devices [i];
                  q [i] = device.createCommandQueue ();
                  // asynchronous write of data to GPU device,
                  // followed by blocking read to get the computed results back.
                  q [i]
                                  .putWriteBuffer (CLSubArrayA [i], false)
                                  .putWriteBuffer (CLSubArrayB [i], false)
                                  .put1DRangeKernel (kernel, 0, globalWorkSize,
localWorkSize)
                                  .putReadBuffer (CLSubArrayC [i], true);
// I know the queing command above is not correct as you are not supposed
// to block while more enqueing commands follow. How to do this correctly?
         } // for
// How do start all queues and how to wait for the results?

   } // for
Reply | Threaded
Open this post in threaded view
|

Re: How to queue in a multidevice environment

Wade Walker
Administrator
Hi Arnold,

For these kind of questions, the ultimate resource is probably the OpenCL spec (https://www.khronos.org/registry/OpenCL/specs/opencl-2.1.pdf) and the official OpenCL docs (https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/). They go into far more detail than I could in a forum post :)

For examples of using JOCL's object-oriented features, the tests at https://github.com/WadeWalker/jocl/tree/master/test/com/jogamp/opencl and the demos at https://github.com/WadeWalker/jocl-demos are the best resource. Or, if you want to program OpenCL in a way similar to how it's used from C/C++, you can do it like in the "low-level bindings" test (https://github.com/WadeWalker/jocl/blob/master/test/com/jogamp/opencl/LowLevelBindingTest.java).

The JOCL low-level bindings are just a thin wrapper around the C functions, so they should work just as explained in the official OpenCL docs, except for the use of ByteBuffers in place of array and pointer arguments. This way of programming is a little more painful than using JOCL's object-oriented features, but since it's more like the non-Java use of OpenCL, it's easier to find help for it on places like StackOverflow and to compare it to the official Khronos docs.

Lastly, don't be discouraged if OpenCL programming seems difficult to understand at first. It's like that for everyone :) It just takes practice and exposure to get used to.
Reply | Threaded
Open this post in threaded view
|

Re: How to queue in a multidevice environment

Arnold
Thanks for the example links and that worked! Well partially. Using finish () is what I was looking for and the subbuffers seems to work. So the last lines of the code show are:

          for (int i = 0; i < nDevices; i++)
          {
                  CLDevice device = devices [i];
                  q [i] = device.createCommandQueue ();
                  q [i]
                                  .putWriteBuffer (CLSubArrayA [i], false)
                                  .putWriteBuffer (CLSubArrayB [i], false)
                                  .put1DRangeKernel (kernel, 0, globalWorkSize, localWorkSize)
                                  .putReadBuffer (CLSubArrayC [i], false);

          } // for
         
          for (int i = 0; i < nDevices; i++)
          {
                  q [i].finish ();
          } // for
       
There is just a small detail: the devices lumped together take twice as much time instead of half the time (I list the benchmark results below). I'll try to figure out tomorrow where the problem is. I saw a reference to CLCommandThreadPool by Michael Bien. Well, that's a nice exercise for tomorrow :-D

I did consider switching to the llb's but the hlb's are really well coded and I am rather particular to how code should look like. You guys really did a nice job with the hlb's. The books I am reading now are C(++) indeed but untilo now I can translate this to hlb one way or another.

1 OpenCL GPU platform(s) present
1 OpenCL CPU platform(s) present
Platform AMD Accelerated Parallel Processing contains 2 devices
   Device: CLDevice [id: 139956276461280 name: Ellesmere type: GPU profile: FULL_PROFILE]
   Device: CLDevice [id: 139956278518384 name: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz type: CPU profile: FULL_PROFILE]
Running benchmark for 2 devices.
Benchmarking Ellesmere           : 0 1 2 3 4 4983 ms
Benchmarking Intel(R) Xeon(R) CPU: 0 1 2 3 4 7765 ms
Benchmarking all devices together: 0 1 2 3 4 23149 ms
Benchmarking plain old Java: 12101 ms

Summary of computing vectors with 20,480,000 elements (of type double)
Device                          VectorAdd            VectorMul            VectorDiv            VectorTri
Ellesmere                             227                  280                  242                  245
Intel(R) Xeon(R) CPU           355                  365                  356                  475
All devices together           1120                 1124                 1142                 1242
Plain old Java                          45                   43                   79                11934

Anyhow, it's time to sleep. And thanks again for your great help and patience!

Reply | Threaded
Open this post in threaded view
|

Re: How to queue in a multidevice environment

Arnold
In reply to this post by Wade Walker
I think I understand sufficient intricacies of multi device openCL to have some examples running. What I don't understand how each task is associated with a certain device. I create a CLMultiContext from the relevant devices and use (a descendant of) CLTask to distribute the work over the devices. I found the example in one of the tests you mentioned. Its procedure execute reads as follows:

      public Buffer execute (final CLSimpleQueueContext qc)
      {
         final CLCommandQueue queue = qc.getQueue ();
         final CLContext context = qc.getCLContext ();
         final CLKernel kernel = qc.getKernel (kernelName);
         out.println ("starting #" + index + ", offset = " + offset + ", sliceHeight = " + sliceHeight + " " +
               context.getMaxFlopsDevice ().getName ());
// etc.

In the printout I see that most tasks run on the CPU (assuming that this is the correct way to detrmine to which device context is assigned). From previous bencmarks I know that I had better distribute the tasks evenly between CPU and GPU. How can I select a device for each task? And out of curiosity: how does the CLMultiContext framework assign a device to a certain task?
Reply | Threaded
Open this post in threaded view
|

Re: How to queue in a multidevice environment

Wade Walker
Administrator
Hi Arnold,

I think at this point you've gone beyond my understanding of CLMultiContext -- since I'm not the original author, I don't have that knowledge at my fingertips, I'd have to look it up in the code the same as you :)

In general, distributing tasks over devices is difficult, unless you know in advance the sizes of the tasks and the relative speeds of your CPU and GPU. To get a good balance across devices, people often do dynamic dispatching, where they don't do a static schedule ahead of time, but instead dispatch jobs to devices at runtime depending on the current utilization of each device and the expected size of each job.

People who are very concerned with fine-grained load balancing across multiple devices (for example in the high-performance computing arena) might explore task-based programming (for example OpenMP tasks), which allow you to specify dependencies between tasks, then let the runtime choose which ones get run where. I'm not sure OpenCL has such a facility, but you could write something like it based on OpenCL if you were so inclined :)
Reply | Threaded
Open this post in threaded view
|

Re: How to queue in a multidevice environment

Arnold
I am studying the code already thanks to earlier hints ;-) I noticed that most of the hlb code is written by Michael Bien. In one of the forum articles he points at the CLMultiContext class. Maybe I have reached the "not invented here" threshold and start coding things myself :-).

I am not sure yet whether I need all that fine-grained-ness of the tasks. Typical things I do are fractals and matrix multiplication. These are typical simd tasks.  I am just building a personal supercomputer (ahem) just for fun: 2 Xeons and three GPU's should get me some computer power. As you will understand, multi device balancing is part of the job. And openCL is full of subtleties. I thought that the multi-device version was scaling badly, but after some tests this morning I found out that scheduling Mandelbrot row-wise is a bad thing to do. It's about as inefficient as serialized Java! (which is much more efficient than I expected for an interpreted language) So back to the good old drawing board :-D