Re: Concurrent Kernels + Synchronization + JOCL
Posted by Wibowit on Apr 21, 2011; 1:01pm
URL: https://forum.jogamp.org/Concurrent-Kernels-Synchronization-JOCL-tp2843784p2846872.html
Currently from what I know CKE is theoretically possible on AMD's hardware as they have copies of contexts registers or something - you can read that in AMD's forums, just follow link provided by me. On the other hand, from Fermi whitepaper there is a conclusion that G80 and GT200 doesn't support CKE in hardware, but Fermi does. In fact Fermi is a very compute oriented architecture - it has many features that helps in achieving high utilization of compute resources. But I don't know if they implemented CKE in their OpenCL driver - I don't have GeForce Fermi to test that. Probably not, as they are more focused on CUDA than on OpenCL. It seems that both nVidia and AMD have very immature OpenCL drivers. Not sure about Intel and IBM ones.
You've asked how to exchange data between different kernels or kernel invocations. I've said just use Global Memory - Global Memory is just the off-die memory on graphics card, eg on eg Radeon HD 5770 1 GiB there is 1 GiB Global Memory. But as of now, AMD's OpenCL driver only allows to allocate 512 MiBs of memory on GPU. I think you don't need specific examples. There's much OpenCL kernels that can be found on Google.
You cannot reliably exchange data between two kernels running simultaneously. The only way to reliably exchange data is to wait for one kernel invocation to finish and then issue another kernel invocation. From Java side, your task is first to define CLBuffers, issue one kernel invocation, wait for it to finish (if you're using in-order queue then it's guaranteed that kernels are issued in orders so waiting is automatic, in out-of-order queues you must use event to make dependency graphs) and then issue new kernel invocation which as input buffer will accept output buffer from previous buffer invocation.
Pushing data through PCI-Express is rather slow (5 GiB/s on my system versus 70 GiB/s of memory bandwidth on GPU) so it's better to avoid it.
If your kernels have different bottlenecks and are short, ie. time to complete one kernel invocation is low (as kernel invocation I mean one enqueueNDRangeKernel) then you can combine to one uber-kernel (similiar thing as uber-shader, google it) - ie. you have to partition work-item to wavefronts (AMD) or warps (nVidia), combine more than one kernels into one and make a dispatch code that selects different kernel depending on wavefron/ warp number. This way you'll execute many different computations at once and you will have high performance as entire SIMD is executing one branch. Of course kernels in uber-kernel must be independent.
I'm planning to use such uber-kernel scheme soon for my OpenCL BWT-based data compressor.