Posted by
Michael Bien on
May 22, 2010; 9:48pm
URL: https://forum.jogamp.org/Looking-over-JOCL-tp835533p837034.html
On 05/22/2010 08:54 PM, jcpalmer [via jogamp] wrote:
Michael,
Couple of refinements based on your reply:
>Just look at an OpenCL example and count the function calls for a
simple
>thing like sorting a buffer. Its like glbegin/end 18 years ago.
Let me refine my overhead tolerances. Calls that are outside what I
call the "Kernel Loop" should be written for ease of use / maintenance
/ upgrading to any newer version of the spec. En-queuing / arg setting
is show time.
I think JavaCL has a little un-neccessary Java overhead en-queuing
kernels. Nothing a code optimizer like Proguard cannot get rid of
though, for production.
sure. What i actually tried to say was that OpenCL will not stay as is.
Just wait until the first wave of extensions is in core. We will
subdevide devices at runtime and more... Concurrent kernel execution is
very young, technically available since a few weeks on high end
hardware.
setArgs + enqueueKernel+ waitforEvents are like a async function call
which should be as fast as possible. No unnecessary overhead in the
binding code. Thats the goal.
Again, i don't make any assumptions how the binding will be used by
client code or where the bottleneck may be today or in future. Since
its not possible to look into future.
Since there is as of today no faster way to call a function from java
as with a thin JNI layer... we are using a thin JNI layer :). As of
consistency we use it for all projects, JOGL, JOCL, JOAL and OpenMAX.
> (no constructors ...)
Same for JavaCL. I understand why. People like me just have to run
source code, instead of overriding.
>> JavaCL has methods to wait for events both in it's CLEvent
& CLQueue.
>i thought about that, but it was to dangerous from a concurrency
>perspective. CLCommandQueues are not thread save in JOCL's
concurrency
>model. This is by design since I expect that in most situations
they
>will be only used from one producer thread.
>JOCL forces you to use a queue to do any CLEvent work which forces
you
>to think about it... But maybe I will weaken this in future. But
that
>was the idea behind that.
FYI, research around events combined with release during garbage
collection is currently my highest priority. I had a little accident
converting my code base to look like JavaCL. I called clWaitForEvents,
but did not call clReleaseEvent. I got a 60% time reduction. It also
happened with enqueueWaitForEvents. If this is for real, I could just
let the garbage collector call it in it's own thread (I am running the
concurrent one). This was on win7.
hehe, so you are relying on finalizers :)
please don't do that... finalizers are like thread.stop... even worse
-> completely unspecified and therefore implementation dependent.
The only reason why there are still
available is backwards compatibility with java 1.1.
Before I got too excited, I wanted to see if this was
also the case on Linux (thanks for the help). Now that Linux is up I
can get to the bottom of this. The few nVidia sample that use events
do not even both to release them.
no problem. I was wondering why nobody else noticed this. We are using
LD_PRELOAD since around december in production.
>one context per device?
>I tried to prevent that... Its basically a hack since OpenCL 1.0
>implementations are not ready at this point. N Queues per device is
much
>cleaner.
>Just think about memory sharing...
The thing that was the decider against the multi-device context was
kernel arg setting. You cannot specify a command queue when setting
them. The upshot is you need to have a separate kernel for each
device, whether they are all in one context or each in their own. I
want each context to have it's own, unshared, host memory to be able to
process asynchronously, with Java's ThreadPoolExecutor running the show
& each context completely unaware multiple devices exist. If the
arg setting thing gets fixed, I'll consider pooling command queues
instead of contexts. See nVidia's simpleMultiGPU sample.
right. I wait till the API and implementations stabelize until i go
this road. I don't want to add public apis which deal with multiple
context now just to workaround implementation issues.
For production systems I go usually this road:
try to solve the issue by using:
1.) multiple queues + multiple kernel instances
2.) multiple program instances
3.) multiple contexts (luckily it never went this far)
thanks again,
great discussion
-michael
Jeff