jogamp - How to queue in a multidevice environment

jogamp › jocl

How to queue in a multidevice environment

Posted by Arnold on Feb 25, 2017; 2:34pm
URL: https://forum.jogamp.org/How-to-queue-in-a-multidevice-environment-tp4037674.html

I am trying to run the HelloWorld example in a multidevice environment. As I understand it, I set up a context for the devices I want to run the kernels on, create a program from it, from that program I create the kernels and for each kernel a buffer is created. Next I enumerate all devices, create a subbuffer and a command queue and run the lot (see sample code below, questions are put in comments almost at the bottom).

I have four questions
1. Is this understanding correct?
2. If so, how do I create a commandqueue without running it?
3. How do I start all commandqueues at the same moment?
4. How do I wait for the results?

I apologize for all these questions, but the documentation about CLCommandQueue is somewhat scanty.

Thanks in advance for your time and for all your patience so far :-)

Code:

// Collect all relevant devices in CLDevice [] devices;
context = CLContext.create (devices);
program = context.createProgram (MultiBench.class.getResourceAsStream ("VectorFunctions.cl"));
program.build ("", devices);
kernels = program.createCLKernels ();
for (Map.Entry<String, CLKernel> entry: kernels.entrySet ())
{
CLKernel kernel = entry.getValue ();
// Now it’s getting tricky, creatinmg buffers and subbuffers
// not sure whether this is correctly done

int elementCount = 20000000;
int localWorkSize = min (devices [0].getMaxWorkGroupSize(), 256); // Local work size dimensions
int globalWorkSize = roundUp (localWorkSize, elementCount); // rounded up to the nearest multiple of the localWorkSize
int nDevices = devices.length;
int sliceSize = elementCount / nDevices;
int extra = elementCount - nDevices * sliceSize;
CLCommandQueue q [] = new CLCommandQueue [nDevices];

CLSubBuffer<DoubleBuffer> [] CLSubArrayA = new CLSubBuffer [nDevices];
CLSubBuffer<DoubleBuffer> [] CLSubArrayB = new CLSubBuffer [nDevices];
CLSubBuffer<DoubleBuffer> [] CLSubArrayC = new CLSubBuffer [nDevices];

// A, B are input buffers, C is for the result
CLBuffer<DoubleBuffer> clBufferA = context.createDoubleBuffer(globalWorkSize, READ_ONLY);
CLBuffer<DoubleBuffer> clBufferB = context.createDoubleBuffer(globalWorkSize, READ_ONLY);
CLBuffer<DoubleBuffer> clBufferC = context.createDoubleBuffer(globalWorkSize, WRITE_ONLY);

for (int i = 0; i < nDevices; i++)
{
int size = sliceSize;
if (i == nDevices - 1) size += extra;
CLSubBuffer<DoubleBuffer> sbA = clBufferA.createSubBuffer (i * sliceSize, size, READ_ONLY);
CLSubArrayA [i] = sbA;
CLSubBuffer<DoubleBuffer> sbB = clBufferB.createSubBuffer (i * sliceSize, size, READ_ONLY);
CLSubArrayB [i] = sbB;
CLSubBuffer<DoubleBuffer> sbC = clBufferC.createSubBuffer (i * sliceSize, size, READ_ONLY);
CLSubArrayC [i] = sbC;
} // for

kernel.putArgs(clBufferA, clBufferB, clBufferC).putArg(elementCount);

for (int i = 0; i < nDevices; i++)
{
CLDevice device = devices [i];
q [i] = device.createCommandQueue ();
// asynchronous write of data to GPU device,
// followed by blocking read to get the computed results back.
q [i]
.putWriteBuffer (CLSubArrayA [i], false)
.putWriteBuffer (CLSubArrayB [i], false)
.put1DRangeKernel (kernel, 0, globalWorkSize,
localWorkSize)
.putReadBuffer (CLSubArrayC [i], true);
// I know the queing command above is not correct as you are not supposed
// to block while more enqueing commands follow. How to do this correctly?
} // for
// How do start all queues and how to wait for the results?

} // for