All,
exact same code running OK on Snow Leopard but throwing the following error on Lion: com.jogamp.opencl.CLException$CLInvalidWorkGroupSizeException: can not enqueue 1DRange CLKernel [id: 140699706121896 name: IntegrateHHStep] with gwo: null gws: {256} lws: {256} cond.: null events: null [error: CL_INVALID_WORK_GROUP_SIZE] I am using the following code to define the local workgroup size and global worksize for the I/O buffers: // Length of arrays to process int elementCount = models.size(); // Local work size dimensions for the selected device int localWorkSize = min(device.getMaxWorkGroupSize(), 256); // rounded up to the nearest multiple of the localWorkSize int globalWorkSize = roundUp(localWorkSize, elementCount); // results buffers are bigger as we are capturing every value for every item for every time-step int globalWorkSize_Results = roundUp(localWorkSize, elementCount*timeConfiguration.getTimeSteps()); On a twitter conversation, @mbien suggested I set localWorkSize to 0, so that the driver will pick-up automatically a worksize, but how can I declare the buffers of size globalWorkSize and globalWorkSize_Results without knowing what to round up to (do I just not round up)? Thanks! |
All, perfect, its much easier to answer here as via twitter :) first lets list a few rules about the sizes: LWS is limited by the device/driver specific maximum and can even depend on N of your NDRange. further GWS must be a multiple of LWS. however in many cases you don't care about LWS, all you want is that all elements of your GWS are processed - it doesn't matter how its subdivided. In this case you can simply set LWS=0 and GWS="problem size" (so you are right, you don't have to round up in this case). The driver will have to figure out the values by himself however it might end up in being not the optimal config. the code you posted tries to be smart and will only work if your kernel does a overlow check (if (workItemID >= size) return; ). It firstly tries to set a supported LWS and rounds GWS up to a multiple of LWS. in your case GWS = LWS which is a bit unusual but it should work. also checkout getMaxWorkItemSizes()[0] maybe its smaller than getMaxWorkGroupSize() on your system. -- - - - - http://michael-bien.com |
In reply to this post by Giovanni Idili
btw there is also the CLWork which tries to abstract all that rounding away
http://michael-bien.com/mbien/entry/many_little_improvements_made_it best regards, michael |
Nice explanation - thanks!
I tried on snow leopard with LWS = 0 and not rounding up and it works but it is many times slower as expected (because it may be not optimal, up to 10 times slower in my case). Will give it a shot tomorrow on Lion. What's the best way to figure out the optimal LWS on a generic system using some reasonable assumptions (I considered WGS=256 a reasonable assumption ... I must assume Lion drivers somehow screwed up that assumption)? I copied that code from one of your examples and indeed my kernel does the overflow check (as per your examples). Thanks again. P.S. thanks for the tip on CLWork - I didn't notice the update, need to stay on top of your blog posts! |
On 08/31/2011 01:44 AM, Giovanni Idili [via jogamp] wrote:
> Thanks a lot for the explanation! > > I tried on snow leopard with LWS = 0 and not rounding up and it works > but it is many times slower as expected (because it may be not > optimal, up to 10 times slower in my case). > > Will give it a shot tomorrow on Lion. > > What's the best way to figure out the optimal LWS on a generic system > using some reasonable assumptions (I considered WGS=256 a reasonable > assumption ... I must assume Lion drivers somehow screwed up that > assumption)? values for certain hardware (max WG size and for getMaxWorkItemSizes). Thats what i do also - CL is still new its hard to make those assumptions right now, esp if you want to run on every type of hardware (CPU, GPU etc). so, don't hardcode those values and don't use anything larger than those (device dependent) max values and hope for the best :) |
After putting it away for months I am now back to fighting with this.
I posted something on the khronos boards [http://www.khronos.org/message_boards/viewtopic.php?f=28&t=4521] and David Garcia pointed me to clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...) in the C bindings (even though I do not fully understand how the kernel could influence the work group size of the device). Question: is there something on the CLKernel object that wraps clGetKernelWorkGroupInfo? CLKernel.getWorkGroupSize(CLDevice device) looks promising, am I on the right path and if so am I supposed to use that work group size instead of the one I get as: int localWorkSize = min(device.getMaxWorkGroupSize(), 256); Thanks for your support! Best, Giovanni |
It's actually pretty simple: there are fixed limits per processor. Register counts, local memory, and so on. The more of these resources that are used by a given thread, the fewer threads can be executed concurrently. This might affect the work-group size limit (or not, depending on the vendors design). |
Thanks for the explanation.
I changed this line on code: int localWorkSize = min(device.getMaxWorkGroupSize(), 256); Into: int localWorkSize = min((int)kernel.getWorkGroupSize(device), 256); and now it's working fine. One last question though: what's the difference between CLKernel.getWorkGroupSize and CLKernel.getCompileWorkGroupSize? |
On 11/19/2011 05:30 AM, Giovanni Idili [via jogamp] wrote: Thanks for the explanation.compile WG size is the unintuitive name for the (optional) size you declared in your kernel. https://github.com/mbien/jocl/blob/master/test/com/jogamp/opencl/CLProgramTest.java#L280 http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetKernelWorkGroupInfo.html regards, michael -- http://michael-bien.com/ |
Free forum by Nabble | Edit this page |