Reply – Concurrent kernel execution in OpenCL implementations
Your Name
Subject
Message
or Cancel
In Reply To
Concurrent kernel execution in OpenCL implementations
— by Wibowit Wibowit
Hi, there's is a repost from other forum (as it's wasteful to write same thing twice):

I'm developing my implementation of Bitonic Sort in OpenCL. Basically I would want to have concurrent kernel execution because that would allow me to fully utilize resources. If I have few kernels, one LDS heavy, one Global Memory heavy, one ALU heavy, etc then interleaving them would lead to much better utilization of resources that running them serially - if one is stalled by LDS access, another can do ALU or transfer Global Memory.

I have did some research and it turned out that AMD haven't yet implemented CKE in their APP SDK. On the other hand, nVidia do not support OpenCL 1.1 officially yet - OpenCL 1.1 brings Out-of-Order queues, events, some extensions, etc nVidia provides OpenCL 1.1 enabled SDK's if you register somewhere. I don't have GeForce so I haven't googled that.

I've asked on AMD Developer Forum about timeline of whether CKE will be supported but haven't received satisfactory answer so far. My post is here: link

I have prepared a program that tests if OpenCL implementation you've installed supports CKE. It's here: link

It requires three parameters: <iterations> <multipleQueues> <threads>

iterations is a long value,
multipleQueues is a boolean value,
threads is an int value,

For example parameters could look like: 1234567 false 10

I have Juniper XT (ie. Radeon HD 5770) which has 10 compute units (10 SIMD arrays) so theoretically it should be able to run 10 different kernels at once if there's enough logic that would manage kernels. Sadly current OpenCL implementation from AMD executes only one kernel at a time.

Here's my terminal log:
piotrek@piotrek-pc:~/Pulpit/cke-test$ ./lin64.sh 1234567 true 1
Total kernels execution time: 286
Computations results (should be identical for identical number of iterations): 
3531952812427241749

piotrek@piotrek-pc:~/Pulpit/cke-test$ ./lin64.sh 1234567 true 10
Total kernels execution time: 2666
Computations results (should be identical for identical number of iterations): 
3531952812427241749
3273905855313247543
3015858898199253337
2757811941085259131
2499764983971264925
2241718026857270719
1983671069743276513
1725624112629282307
1467577155515288101
1209530198401293895

piotrek@piotrek-pc:~/Pulpit/cke-test$ 
As you can see, running 10 different kernels consumes 10 times as much time, so it clearly shows that no kernel is running in parallel.

All kernels run by my program consist of single work-item, so they occupy only one SIMD array.

Currently I've coded Bitonic Sort for 4096 items wide blocks. It is heavily limited by LDS bandwidth, I don't know exactly how much but probably with fast LDS my algorithm would perform four times faster. That bottleneck could be hidden if CKE would be supported - while one (quarter-) wavefront waits for LDS, another could do ALU heavy task (eg. encoding) or Global Memory transfers.

Maybe someone here has more fresh info than me and can tell me something about CKE?

Anyway I would be happy if someone plays with my program and posts the results.

My program requires Java, should have higher chances of running if your Java has the same bitness as your operating system, ie. you should use 64-bit Java on 64-bit OS, and of course it requires OpenCL driver. I've developed this program on computer with Catalyst 11.3 drivers and AMD APP SDK version 2.4.

Additionally, there's NetBeans project containing sources: link

End of repost. Maybe my code is wrong, maybe something could be done without events (so then would be OpenCL 1.0 compatible), etc If you have suggestions about code please write them there.

Or maybe if you have a idea where could I repost this so more experts would look at my problem please write addresses of such places :)