Login  Register

How to queue in a multidevice environment

Posted by Arnold on Feb 25, 2017; 2:34pm
URL: https://forum.jogamp.org/How-to-queue-in-a-multidevice-environment-tp4037674.html

I am trying to run the HelloWorld example in a multidevice environment. As I understand it, I set up a context for the devices I want to run the kernels on, create a program from it, from that program I create the kernels and for each kernel a buffer is created. Next I enumerate all devices, create a subbuffer and a command queue and run the lot (see sample code below, questions are put in comments almost at the bottom).

I have four questions
1. Is this understanding correct?
2. If so, how do I create a commandqueue without running it?
3. How do I start all commandqueues at the same moment?
4. How do I wait for the results?

I apologize for all these questions, but the documentation about CLCommandQueue is somewhat scanty.

Thanks in advance for your time and for all your patience so far :-)

Code:

   // Collect all relevant devices in CLDevice [] devices;
   context = CLContext.create (devices);
   program = context.createProgram (MultiBench.class.getResourceAsStream ("VectorFunctions.cl"));
   program.build ("", devices);
   kernels = program.createCLKernels ();
   for (Map.Entry<String, CLKernel> entry: kernels.entrySet ())
   {
    CLKernel kernel = entry.getValue ();
    // Now it’s getting tricky, creatinmg buffers and subbuffers
    // not sure whether this is correctly done

          int elementCount = 20000000;
          int localWorkSize = min (devices [0].getMaxWorkGroupSize(), 256);  // Local work size dimensions
          int globalWorkSize = roundUp (localWorkSize, elementCount);   // rounded up to the nearest multiple of the localWorkSize
          int nDevices = devices.length;
          int sliceSize = elementCount / nDevices;
          int extra = elementCount - nDevices * sliceSize;
          CLCommandQueue q [] = new CLCommandQueue [nDevices];
         
          CLSubBuffer<DoubleBuffer> [] CLSubArrayA = new CLSubBuffer [nDevices];
          CLSubBuffer<DoubleBuffer> [] CLSubArrayB = new CLSubBuffer [nDevices];
          CLSubBuffer<DoubleBuffer> [] CLSubArrayC = new CLSubBuffer [nDevices];

          // A, B are input buffers, C is for the result
          CLBuffer<DoubleBuffer> clBufferA = context.createDoubleBuffer(globalWorkSize, READ_ONLY);
          CLBuffer<DoubleBuffer> clBufferB = context.createDoubleBuffer(globalWorkSize, READ_ONLY);
          CLBuffer<DoubleBuffer> clBufferC = context.createDoubleBuffer(globalWorkSize, WRITE_ONLY);
         
          for (int i = 0; i < nDevices; i++)
          {
                  int size = sliceSize;
                  if (i == nDevices - 1) size += extra;
                  CLSubBuffer<DoubleBuffer> sbA = clBufferA.createSubBuffer (i * sliceSize, size, READ_ONLY);
                  CLSubArrayA [i] = sbA;
                  CLSubBuffer<DoubleBuffer> sbB = clBufferB.createSubBuffer (i * sliceSize, size, READ_ONLY);
                  CLSubArrayB [i] = sbB;
                  CLSubBuffer<DoubleBuffer> sbC = clBufferC.createSubBuffer (i * sliceSize, size, READ_ONLY);
                  CLSubArrayC [i] = sbC;
          } // for

          kernel.putArgs(clBufferA, clBufferB, clBufferC).putArg(elementCount);
         
          for (int i = 0; i < nDevices; i++)
          {
                  CLDevice device = devices [i];
                  q [i] = device.createCommandQueue ();
                  // asynchronous write of data to GPU device,
                  // followed by blocking read to get the computed results back.
                  q [i]
                                  .putWriteBuffer (CLSubArrayA [i], false)
                                  .putWriteBuffer (CLSubArrayB [i], false)
                                  .put1DRangeKernel (kernel, 0, globalWorkSize,
localWorkSize)
                                  .putReadBuffer (CLSubArrayC [i], true);
// I know the queing command above is not correct as you are not supposed
// to block while more enqueing commands follow. How to do this correctly?
         } // for
// How do start all queues and how to wait for the results?

   } // for