Login  Register

Re: Concurrent kernel execution in OpenCL implementations

Posted by Michael Bien on Apr 19, 2011; 2:41am
URL: https://forum.jogamp.org/Concurrent-kernel-execution-in-OpenCL-implementations-tp2827852p2837302.html

  CKE is AFAIK it is not supported in APP SDK yet. However I believe to
remember reading it in the release notes of NV drivers somewhere but it
will probably only work on fermi and later.

-michael

On 04/16/2011 03:14 PM, Wibowit [via jogamp] wrote:

> Hi, there's is a repost from other forum (as it's wasteful to write same
> thing twice):
>
> I'm developing my implementation of Bitonic Sort in OpenCL. Basically I
> would want to have concurrent kernel execution because that would allow me
> to fully utilize resources. If I have few kernels, one LDS heavy, one Global
> Memory heavy, one ALU heavy, etc then interleaving them would lead to much
> better utilization of resources that running them serially - if one is
> stalled by LDS access, another can do ALU or transfer Global Memory.
>
> I have did some research and it turned out that AMD haven't yet implemented
> CKE in their APP SDK. On the other hand, nVidia do not support OpenCL 1.1
> officially yet - OpenCL 1.1 brings Out-of-Order queues, events, some
> extensions, etc nVidia provides OpenCL 1.1 enabled SDK's if you register
> somewhere. I don't have GeForce so I haven't googled that.
>
> I've asked on AMD Developer Forum about timeline of whether CKE will be
> supported but haven't received satisfactory answer so far. My post is here:
> http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=149524&enterthread=y
> link
>
> I have prepared a program that tests if OpenCL implementation you've
> installed supports CKE. It's here:
> http://www28.zippyshare.com/v/17487664/file.html link
>
> It requires three parameters:<iterations>  <multipleQueues>  <threads>
>
> iterations is a long value,
> multipleQueues is a boolean value,
> threads is an int value,
>
> For example parameters could look like: 1234567 false 10
>
> I have Juniper XT (ie. Radeon HD 5770) which has 10 compute units (10 SIMD
> arrays) so theoretically it should be able to run 10 different kernels at
> once if there's enough logic that would manage kernels. Sadly current OpenCL
> implementation from AMD executes only one kernel at a time.
>
> Here's my terminal log:
> piotrek@piotrek-pc:~/Pulpit/cke-test$ ./lin64.sh 1234567 true 1
> Total kernels execution time: 286
> Computations results (should be identical for identical number of
> iterations):
> 3531952812427241749
>
> piotrek@piotrek-pc:~/Pulpit/cke-test$ ./lin64.sh 1234567 true 10
> Total kernels execution time: 2666
> Computations results (should be identical for identical number of
> iterations):
> 3531952812427241749
> 3273905855313247543
> 3015858898199253337
> 2757811941085259131
> 2499764983971264925
> 2241718026857270719
> 1983671069743276513
> 1725624112629282307
> 1467577155515288101
> 1209530198401293895
>
> piotrek@piotrek-pc:~/Pulpit/cke-test$
>
> As you can see, running 10 different kernels consumes 10 times as much time,
> so it clearly shows that no kernel is running in parallel.
>
> All kernels run by my program consist of single work-item, so they occupy
> only one SIMD array.
>
> Currently I've coded Bitonic Sort for 4096 items wide blocks. It is heavily
> limited by LDS bandwidth, I don't know exactly how much but probably with
> fast LDS my algorithm would perform four times faster. That bottleneck could
> be hidden if CKE would be supported - while one (quarter-) wavefront waits
> for LDS, another could do ALU heavy task (eg. encoding) or Global Memory
> transfers.
>
> Maybe someone here has more fresh info than me and can tell me something
> about CKE?
>
> Anyway I would be happy if someone plays with my program and posts the
> results.
>
> My program requires Java, should have higher chances of running if your Java
> has the same bitness as your operating system, ie. you should use 64-bit
> Java on 64-bit OS, and of course it requires OpenCL driver. I've developed
> this program on computer with Catalyst 11.3 drivers and AMD APP SDK version
> 2.4.
>
> Additionally, there's NetBeans project containing sources:
> http://www49.zippyshare.com/v/67517804/file.html link
>
> End of repost. Maybe my code is wrong, maybe something could be done without
> events (so then would be OpenCL 1.0 compatible), etc If you have suggestions
> about code please write them there.
>
> Or maybe if you have a idea where could I repost this so more experts would
> look at my problem please write addresses of such places :)
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://forum.jogamp.org/Concurrent-kernel-execution-in-OpenCL-implementations-tp2827852p2827852.html
> To start a new topic under jogamp, email [hidden email]
> To unsubscribe from jogamp, visit
http://michael-bien.com/