First of all I apologize for my ignorance. I am starting with OpenCL and JOCL and I am still trying to come to terms with a few basic concepts.
Here we go:
1) can I upload more than one kernel at a time to a given device assigning different workgroups (or queues) to each kernel so that they can execute concurrently (or do I need 1 device per kernel)?
2) If so, can I exchange data between workgroups running different kernels and somehow synchronize them?
Just to give a bit of background, I would like to use JOCL + OpenCL for a multiscale simulation (different kernels required) and I am trying to understand what are my options.
Could I do this entirely by concurrently uploading kernels to one or more devices or is it necessary to have some higher level code to orchestrate everything and pull out results from the different queues and synchronize the different simulations (some stuff can be a biology some other stuff physics, with different time scales, etc.) as needed?
I would like to minimize the I/O operations to the devices, as I understand that's where the bottleneck is.
Any JOCL examples related to points 1) and/or 2) appreciated!
Any pointer that could help me understand more about this - concurrency/synchronization and if it is possible to orchestrate this stuff from JOCL - would also be highly appreciated.
Basically, registers and LDS (Local Data Storage) have lifetime the same as kernel that uses them and they formally cannot be shared by different work-groups (registers are even private for a work-item). The only way you can exchange data is by Global Memory on video card. But as there's no universal way to synchronize different work-groups you don't have any guarantee of order of computations, so you're left with a scheme where one kernel invocation waits for completion of another.
You can use out-of-order queues and/ or events. Events serve as a way to make a partial order on kernel invocations or they're used for retrieving profiling data.
Thanks, I was having a look at your post - in a sense you're trying do to something quite similar to what I am describing.
Since CKE is not fully supported yet it sounds like in order to leverage concurrency I will probably have to assign different kernels to different devices (from JOCL, having different queues for different devices) and then pull out results from the queues and do my synchronization business from higher-level code (Java).
"The only way you can exchange data is by Global Memory on video card." <-- any pointers and or examples (with jocl) on using this kind of shared memory?
Currently from what I know CKE is theoretically possible on AMD's hardware as they have copies of contexts registers or something - you can read that in AMD's forums, just follow link provided by me. On the other hand, from Fermi whitepaper there is a conclusion that G80 and GT200 doesn't support CKE in hardware, but Fermi does. In fact Fermi is a very compute oriented architecture - it has many features that helps in achieving high utilization of compute resources. But I don't know if they implemented CKE in their OpenCL driver - I don't have GeForce Fermi to test that. Probably not, as they are more focused on CUDA than on OpenCL. It seems that both nVidia and AMD have very immature OpenCL drivers. Not sure about Intel and IBM ones.
You've asked how to exchange data between different kernels or kernel invocations. I've said just use Global Memory - Global Memory is just the off-die memory on graphics card, eg on eg Radeon HD 5770 1 GiB there is 1 GiB Global Memory. But as of now, AMD's OpenCL driver only allows to allocate 512 MiBs of memory on GPU. I think you don't need specific examples. There's much OpenCL kernels that can be found on Google.
You cannot reliably exchange data between two kernels running simultaneously. The only way to reliably exchange data is to wait for one kernel invocation to finish and then issue another kernel invocation. From Java side, your task is first to define CLBuffers, issue one kernel invocation, wait for it to finish (if you're using in-order queue then it's guaranteed that kernels are issued in orders so waiting is automatic, in out-of-order queues you must use event to make dependency graphs) and then issue new kernel invocation which as input buffer will accept output buffer from previous buffer invocation.
Pushing data through PCI-Express is rather slow (5 GiB/s on my system versus 70 GiB/s of memory bandwidth on GPU) so it's better to avoid it.
If your kernels have different bottlenecks and are short, ie. time to complete one kernel invocation is low (as kernel invocation I mean one enqueueNDRangeKernel) then you can combine to one uber-kernel (similiar thing as uber-shader, google it) - ie. you have to partition work-item to wavefronts (AMD) or warps (nVidia), combine more than one kernels into one and make a dispatch code that selects different kernel depending on wavefron/ warp number. This way you'll execute many different computations at once and you will have high performance as entire SIMD is executing one branch. Of course kernels in uber-kernel must be independent.
I'm planning to use such uber-kernel scheme soon for my OpenCL BWT-based data compressor.