Hi,
I'm very new to OpenCL and JOCL. I'm using a dual gpu HD7990 on win7 64, Catalyst 14.12 and having some problems getting reliable execution across both gpus concurrently. If I run the JOCLMultiDeviceSample I can see it enqueue work to both gpus however the work fully executes on one gpu before the work on the second gpu commences. Note that since the sample runs over all devices I can also see the CPU work occurring concurrently to the GPU work. My own launcher and kernels behave the same way - If I schedule the same work to both gpus and use clWaitForEvents(...) to wait for both jobs to complete they execute serially taking 2x the time of one job. When I run a thread per device (with separate context) AND force the work on the first core to be scheduled before that on the second core then both jobs execute concurrently and both jobs complete in the same time as a single job on one gpu. Any pointers as to whether this is an issue with the dual card / driver or a synchronisation problem in the JOCL wrapper? Thanks, Joe |
Actually it simply works concurrently if I separate the gpu work into separate threads (with independent contexts), it doesn't matter about whether gpu1 schedules before gpu2, but it always operates serially running both from the same thread waiting on multiple events.
|
Administrator
|
In reply to this post by devmonkey
Hi Joe,
Looking at the code for clWaitForEvents(), all it does is check the buffer arguments to make sure they're direct buffers, and then call the native OpenCL function, so there shouldn't be any waiting in the JOCL wrapper. Same goes for clEnqueueNDRangeKernel(). On Stack Overflow and other forums I've seen people saying that clEnqueueNDRangeKernel() is a blocking call on Nvidia hardware, even though the OpenCL spec doesn't require this. The Stack Overflow guy even got a response from Nvidia telling him to enqueue in separate threads :) So it looks like this is an Nvidia thing. http://stackoverflow.com/questions/11562543/clenqueuendrange-blocking-on-nvidia-hardware-also-multi-gpu https://devtalk.nvidia.com/default/topic/415023/launch-kernels-in-parallel-/ |
Hi Wade,
Thanks, but This is occurring with amd gpus and I've proven that the enqueus do not block. I wonder if someone can run the jocl sample (you need to wrap the loop inside the kernel with another loop to make it take some significant time) ? As it stands the sample executes so quickly there is not way to tell if it executes serially or not. Joe |
Administrator
|
Sorry, that's what I get for reading too quickly -- didn't notice you were on AMD :)
Have you tried using clFlush() right after each clEnqueueNDRangeKernel()? This should ensure that the commands in the queues are actually issued to the devices before you call clWaitForEvents(). The docs are unclear, but it may be that clWaitForEvents() waits for events serially, performing an implicit clFlush() before each one, which would result in the serial execution you see. I can't actually run this test, since I've only got one GPU. But if you're doing things like in JOCLMultiDeviceSample.java, the CL commands you're calling are just a thin wrapper around the C functions, so there shouldn't be any JOCL-specific weirdness going on. |
Hi Wade,
You nailed it. I updated JOCLMultiDeviceSample.java to call clFlush after each enqueue and now it does indeed run on both gpus concurrently. Do you think this is just a problem with my particular card or a general problem with the sample code? The change I made to the kernel in order to make it run long enough is: private static String programSource = "__kernel void sampleKernel(__global const float *input,"+ " __global float *output, " + " int size)"+ "{"+ " int gid = get_global_id(0);"+ " output[gid] = 0;" + " for (int j=0;j<100000;j++) {" + " for (int i=0; i<size; i++) " + " output[gid] += input[i];" + "}}"; Cheers, Joe |
Administrator
|
Hi Joe,
I think this is a problem with the sample code. Technically it doesn't promise that the jobs will execute simultaneously on the separate GPUs, but this is surely implied :) I'll update it to add clFlush() after each enqueue. Thanks for reporting this! |
Administrator
|
In reply to this post by devmonkey
Haha, oops -- JOCLMultiDeviceSample is actually from jocl.org, not from the JOCL that's at JogAmp.org, so I can't check in a fix for it since it's in another project's code :) I'm glad to have helped you find the bug, though.
|
Administrator
|
It would be better if JOCL was JogAmp's registered trademark in order to avoid any confusion.
Julien Gouesse | Personal blog | Website
|
Free forum by Nabble | Edit this page |