Looking over JOCL

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Looking over JOCL

jcpalmer
Michael,
I wanted to checkout what you have done for JOCL.  The whole JogAmp site looks very professionally done (get rid of the under construction thing).  Integrating everything but the actual media/repositories into one site works well.  Similar color schemes between Nabble & github was fortuitous.  I did not even realize for a minute I left the site.  I have a mouse button for back & one on the browser, so who needs one on the page.

With 2 young guys working on different implementations of something I need, I cannot afford to skip due diligence in case one or the other loses interest(usually due to career or women).  The metric I am using to evaluate JOCL IS NOT based on call overhead, since this is not like OpenGL where you might need to do 1000 calls to produce 1 frame.  The metric is how fast could I switch between them.

I do not actually need a distribution to do an evaluation, after writing my own oo wrappers on top of OpenCL4Java, and almost doing a conversion to JavaCL. Looking at the latest source, this looks like it handles all my needs as is.

The major difference of JOCL is the same difference between my home grown wrappers and JavaCL.  That is how big a role is played by the "command queue" class.  You and I gave it a big role with it's methods being passed memory & kernel objects.  JavaCL gave it's memory & kernel objects those methods with the command queue being passed.  This is pretty much just different sides of the same coin.  JavaCL has methods to wait for events both in it's CLEvent & CLQueue.

One difference I have between both is that my am multi-level.  I had the general use classes with a sub-set of only the methods I use, and a higher level that integrated the lower level with java.util.concurrent.ThreadPoolExecutor.  The application only calls the higher level (just 3 classes:  ContextPool, Context, & OpenCLCapabilities which actually instances the single device Context[] for ContextPool).  The higher level allows for multi-GPU's.  You state how many kernel & mem objects you are going to need in the ContextPool constructor, and only reference them by index in methods, to avoid touching of the lower level by the application. This higher level would be tough to make general use without having a wrapper for every method in the lower level.  I just do the ones I need.

The reason I mentioned the high level is it is key to switching between the 2.  Only 3 classes need to change.  In fact, I did a conversion to JavaCL in about 4 days. I still felt insecure not having source I could modify easily, so I took another day and re-worked my lower level wrappers to emulate JavaCL.  I can go back and forth just switching 3 import statements & a jar.  I could probably emulate a big part of JOCL at the same time, made easier with the class names for the "command queue" being different.

I did see that there are no finalizers in classes ensuring OpenCL resources are cleaned up, should the developer fail to do so. This might be by design, because Garbage collection is only concerned about it's own memory.  However, OpenCL release calls return an error if they are attempted more than once (actual after the ref count == 0). CLProgram does have a member that indicates it has been released, but it not used to stop from attempting release more than once.  It kind of confuses me that CLProgram.release calls the release of CLKernels that are based on it.  In truth, I do not think it was even worth the effort of having both Program & Kernel in the spec.

Reccommend to eliminate the tiny interface CLResource, unless there is a class that implements CLResource that is NOT a subclass of CLObject.  Add to CLObject:
    protected volatile boolean releasedAlready; // keep release from getting exception if attempted more than once
    @Override
    public void finalize(){ release(); }
    public void release(){}  // placeholder for those CLObjects which do not require releasing, e.g. CLDevice
   
In the case of CLKernel have:
    @Override
    public void release(){
        if (!releasedAlready){
            int ret = cl.clReleaseKernel(ID);
            program.onKernelReleased(this);
            if(ret != CL.CL_SUCCESS) {
                throw newException(ret, "can not release "+this);
            }
        }
    }

You could even say CLObject implements CLResource, if you wanted to keep it.  That is really the only thing that jumps out.  Args could have JavaDoc entries, but having that is not going to allow someone to use this without reading the OpenCL spec.  Good Job!

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Looking over JOCL

Michael Bien
This post was updated on .
Hi Jeff,

On 05/22/2010 12:52 AM, jcpalmer [via jogamp] wrote:
> Michael,
> I wanted to checkout what you have done for JOCL.  The whole JogAmp
> site looks very professionally done (get rid of the under construction
> thing).  Integrating everything but the actual media/repositories into
> one site works well.  Similar color schemes between Nabble & github
> was fortuitous.  I did not even realize for a minute I left the site.
>  I have a mouse button for back & one on the browser, so who needs one
> on the page.
thank you. It was a long fight to create a site which meets all
non-professional expectations of the whole jogamp team (consisting of
hackers exclusively) :) Now we have a very good designer in the team who
is currently working on demos and logos etc.

thanks for the tips will fix it soon.

>
> With 2 young guys working on different implementations of something I
> need, I cannot afford to skip due diligence in case one or the other
> loses interest(usually due to career or women).  The metric I am using
> to evaluate JOCL IS NOT based on call overhead, since this is not like
> OpenGL where you might need to do 1000 calls to produce 1 frame.  The
> metric is how fast could I switch between them.
A young api (OpenCL) should not make those asumptions for its userbase
:) (esp if it doesn't have to)

18 years ago everyone was happy with glBegin/End (+ 50 other functions)
and a handful triangles on the screen. After people realised this does
not scale they switched to "server side objects + asynchronous RPC"  to
reduce call overhead. OpenCL has this paradigm from day 0 in its apis
but in a few years we will have the same bottleneck at the same point
again... history repeats itself.

Just look at an OpenCL example and count the function calls for a simple
thing like sorting a buffer. Its like glbegin/end 18 years ago.

but actually it doesn't matter. GlueGen is not forced to emit JNI code.
We could probably emit direct JNA code just by removing 90% of the
GlueGen-JNI-emitter code today - but we have no reason to do this.

You never know how people use your api...

>
> I do not actually need a distribution to do an evaluation, after
> writing my own oo wrappers on top of OpenCL4Java, and almost doing a
> conversion to JavaCL. Looking at the latest source, this looks like it
> handles all my needs as is.
>
> The major difference of JOCL is the same difference between my home
> grown wrappers and JavaCL.  That is how big a role is played by the
> "command queue" class.  You and I gave it a big role with it's methods
> being passed memory & kernel objects.  JavaCL gave it's memory &
> kernel objects those methods with the command queue being passed.
>  This is pretty much just different sides of the same coin.
- - -
just a few words before we start. nativelibs4java's and JogAmp's
licenses are not compatible. Thats why i never looked at JavaCL's
codebase and even declined Olivie's invitation to subscribe to his
mailinglists. I take this topic serious. Its a one way road. LGPL
projects like nativelibs4java can use our BSD code but we can not look
at LGPL code :)
- - -

i did it this way for consistency reasons. The queue is the main
communication facility in the CL runtime model so you and me gave it a
central role :).

I had basically the following goals by designing the api:
- reduce method argument count to a minimum without loosing flexibility
- map cl object to java objects (this one was easy, OpenCL already
simulates Objects)
- make it self explainable
- fast as possible (the high level bindings are even faster than LLB in
some concurrent scenarios, thread locality etc)
- try prevent all common mistakes I made as I learned OpenCL in C by
design :)

You can find some of the points by just looking at the samples or junit
tests. (no contructors, builder pattern, fluent interface, utility apis ...)

>  JavaCL has methods to wait for events both in it's CLEvent & CLQueue.
i thought about that, but it was to dangerous from a concurrency
perspective. CLCommandQueues are not thread save in JOCL's concurrency
model. This is by design since I expect that in most situations they
will be only used from one producer thread.

JOCL forces you to use a queue to do any CLEvent work which forces you
to think about it... But maybe I will weaken this in future. But that
was the idea behind that.


>
> One difference I have between both is that my am multi-level.  I had
> the general use classes with a sub-set of only the methods I use, and
> a higher level that integrated the lower level with
> java.util.concurrent.ThreadPoolExecutor.  The application only calls
> the higher level (just 3 classes:  ContextPool, Context, &
> OpenCLCapabilities which actually instances the single device
> Context[] for ContextPool).  The higher level allows for multi-GPU's.
>  You state how many kernel & mem objects you are going to need in the
> ContextPool constructor, and only reference them by index in methods,
> to avoid touching of the lower level by the application. This higher
> level would be tough to make general use without having a wrapper for
> every method in the lower level.  I just do the ones I need.
multi level..
you can mix HLB with LLB if you like.
CL cl = anyHLBObject.getContext().getLowLevelInterface();

one context per device?
I tried to prevent that... Its basically a hack since OpenCL 1.0
implementations are not ready at this point. N Queues per device is much
cleaner.
Just think about memory sharing...


>
> The reason I mentioned the high level is it is key to switching
> between the 2.  Only 3 classes need to change.  In fact, I did a
> conversion to JavaCL in about 4 days. I still felt insecure not having
> source I could modify easily, so I took another day and re-worked my
> lower level wrappers to emulate JavaCL.  I can go back and forth just
> switching 3 import statements & a jar.  I could probably emulate a big
> part of JOCL at the same time, made easier with the class names for
> the "command queue" being different.
again, can't help here since i don't look at JavaCL - sorry. Maybe you
could build an abstraction on top of both apis or use Jackpot like we
did to provide offline refactoring between incompatible apis. (JOGL1 ->
JOGL2 in our case)
http://github.com/mbien/jogl/tree/master/tools/jackpotc

I know people who can switch between LWJGL and JOGL which are
incompatible by design too. Obviously there is always the common
denominator issue..

>
> I did see that there are no finalizers in classes ensuring OpenCL
> resources are cleaned up, should the developer fail to do so. This
> might be by design, because Garbage collection is only concerned about
> it's own memory.  However, OpenCL release calls return an error if
> they are attempted more than once (actual after the ref count == 0).
> CLProgram does have a member that indicates it has been released, but
> it not used to stop from attempting release more than once.  It kind
> of confuses me that CLProgram.release calls the release of CLKernels
> that are based on it.  In truth, I do not think it was even worth the
> effort of having both Program & Kernel in the spec.
already started this work but forgot about that later. Its an Exception
which is thrown when release() is called to often, so its ok for now
since its probably a bug in the client application.

regarding finalizers:
yes its wrong to assume that native resources are in any relation to the
java heap. Thats why there are no finalizes or similar hacks in JOCL.
They have also implications on GC behaviour esp with concurrent GCs.
Finalizers are deprecated from a JVM perspective, just nobody cared to
put it into the spec :)


>
> Reccommend to eliminate the tiny interface CLResource, unless there is
> a class that implements CLResource that is NOT a subclass of CLObject.
>  Add to CLObject:
>     protected volatile boolean releasedAlready; // keep release from
> getting exception if attempted more than once
>     @Override
>     public void finalize(){ release(); }
>     public void release(){}  // placeholder for those CLObjects which
> do not require releasing, e.g. CLDevice
>
> In the case of CLKernel have:
>     @Override
>     public void release(){
>         if (!releasedAlready){
>             int ret = cl.clReleaseKernel(ID);
>             program.onKernelReleased(this);
>             if(ret != CL.CL_SUCCESS) {
>                 throw newException(ret, "can not release "+this);
>             }
>         }
>     }
you won't believe it but thats intended too :)
Imagine an application has an fancy sorting algorithm which has for
performance reasons native resources associated with it.
This fancy sorting object could simply implement CLResource to indicate
it should be released by the client api... (@see jocl-demos)

but.. i will think about that. This part of the api isn't final yet. You
will even notice that i also implement JDK7's Disposeable to be forward
compatible with ARM... so there are other non-obvious reasons why it is
an interface.

"releasedAlready"
.. will do (probably). Its not that easy since there is LLB too...

We will also provide various ways to find those kind of bugs (and
performance bottlenecks) via GlueGen's composeable pipeline mechanism...
more on that point in a few weeks.

>
> You could even say CLObject implements CLResource, if you wanted to
> keep it.  That is really the only thing that jumps out.  Args could
> have JavaDoc entries, but having that is not going to allow someone to
> use this without reading the OpenCL spec.  Good Job!


thank you very much for the feedback Jeff. Very appreciated.

best regards,
michael bien

>
> Jeff
>
> ------------------------------------------------------------------------
> View message @
> http://jogamp.762907.n3.nabble.com/Looking-over-JOCL-tp835533p835533.html
> To start a new topic under jogamp, email
> ml-node+762907-789687283-63768@n3.nabble.com
> To unsubscribe from jogamp, click here
> < (link removed) ==>.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Looking over JOCL

jcpalmer
Michael,
Couple of refinements based on your reply:

>Just look at an OpenCL example and count the function calls for a simple
>thing like sorting a buffer. Its like glbegin/end 18 years ago.

Let me refine my overhead tolerances.  Calls that are outside what I call the "Kernel Loop" should be written for ease of use / maintenance / upgrading to any newer version of the spec.  En-queuing / arg setting is show time.
I think JavaCL has a little un-neccessary Java overhead en-queuing kernels. Nothing a code optimizer like Proguard cannot get rid of though, for production.

> (no constructors ...)
Same for JavaCL.  I understand why.  People like me just have to run source code, instead of overriding.

>>  JavaCL has methods to wait for events both in it's CLEvent & CLQueue.
>i thought about that, but it was to dangerous from a concurrency
>perspective. CLCommandQueues are not thread save in JOCL's concurrency
>model. This is by design since I expect that in most situations they
>will be only used from one producer thread.

>JOCL forces you to use a queue to do any CLEvent work which forces you
>to think about it... But maybe I will weaken this in future. But that
>was the idea behind that.

FYI, research around events combined with release during garbage collection is currently my highest priority.  I had a little accident converting my code base to look like JavaCL.  I called clWaitForEvents, but did not call clReleaseEvent.  I got a  60% time reduction.  It also happened with enqueueWaitForEvents.  If this is for real,  I could just let the garbage collector call it in it's own thread (I am running the concurrent one).  This was on win7.  Before I got too excited, I wanted to see if this was also the case on Linux (thanks for the help).  Now that Linux is up I can get to the bottom of this.  The few nVidia sample that use events do not even both to release them.

>one context per device?
>I tried to prevent that... Its basically a hack since OpenCL 1.0
>implementations are not ready at this point. N Queues per device is much
>cleaner.
>Just think about memory sharing...

The thing that was the decider against the multi-device context was kernel arg setting.  You cannot specify a command queue when setting them.  The upshot is you need to have a separate kernel for each device, whether they are all in one context or each in their own.  I want each context to have it's own, unshared, host memory to be able to process asynchronously, with Java's ThreadPoolExecutor running the show & each context completely unaware multiple devices exist.  If the arg setting thing gets fixed, I'll consider pooling command queues instead of contexts. See nVidia's simpleMultiGPU sample.

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Looking over JOCL

Michael Bien
On 05/22/2010 08:54 PM, jcpalmer [via jogamp] wrote:
Michael,
Couple of refinements based on your reply:

>Just look at an OpenCL example and count the function calls for a simple
>thing like sorting a buffer. Its like glbegin/end 18 years ago.

Let me refine my overhead tolerances.  Calls that are outside what I call the "Kernel Loop" should be written for ease of use / maintenance / upgrading to any newer version of the spec.  En-queuing / arg setting is show time.
I think JavaCL has a little un-neccessary Java overhead en-queuing kernels. Nothing a code optimizer like Proguard cannot get rid of though, for production.
sure. What i actually tried to say was that OpenCL will not stay as is. Just wait until the first wave of extensions is in core. We will subdevide devices at runtime and more... Concurrent kernel execution is very young, technically available since a few weeks on high end hardware.

setArgs + enqueueKernel+ waitforEvents are like a async function call which should be as fast as possible. No unnecessary overhead in the binding code. Thats the goal.
Again, i don't make any assumptions how the binding will be used by client code or where the bottleneck may be today or in future. Since its not possible to look into future.

Since there is as of today no faster way to call a function from java as with a thin JNI layer... we are using a thin JNI layer :). As of consistency we use it for all projects, JOGL, JOCL, JOAL and OpenMAX.


> (no constructors ...)
Same for JavaCL.  I understand why.  People like me just have to run source code, instead of overriding.

>>  JavaCL has methods to wait for events both in it's CLEvent & CLQueue.
>i thought about that, but it was to dangerous from a concurrency
>perspective. CLCommandQueues are not thread save in JOCL's concurrency
>model. This is by design since I expect that in most situations they
>will be only used from one producer thread.

>JOCL forces you to use a queue to do any CLEvent work which forces you
>to think about it... But maybe I will weaken this in future. But that
>was the idea behind that.

FYI, research around events combined with release during garbage collection is currently my highest priority.  I had a little accident converting my code base to look like JavaCL.  I called clWaitForEvents, but did not call clReleaseEvent.  I got a  60% time reduction.  It also happened with enqueueWaitForEvents.  If this is for real,  I could just let the garbage collector call it in it's own thread (I am running the concurrent one).  This was on win7.
hehe, so you are relying on finalizers :)

please don't do that... finalizers are like thread.stop... even worse -> completely unspecified and therefore implementation dependent. The only reason why there are still available is backwards compatibility with java 1.1.


 Before I got too excited, I wanted to see if this was also the case on Linux (thanks for the help).  Now that Linux is up I can get to the bottom of this.  The few nVidia sample that use events do not even both to release them.
no problem. I was wondering why nobody else noticed this. We are using LD_PRELOAD since around december in production.



>one context per device?
>I tried to prevent that... Its basically a hack since OpenCL 1.0
>implementations are not ready at this point. N Queues per device is much
>cleaner.
>Just think about memory sharing...

The thing that was the decider against the multi-device context was kernel arg setting.  You cannot specify a command queue when setting them.  The upshot is you need to have a separate kernel for each device, whether they are all in one context or each in their own.  I want each context to have it's own, unshared, host memory to be able to process asynchronously, with Java's ThreadPoolExecutor running the show & each context completely unaware multiple devices exist.  If the arg setting thing gets fixed, I'll consider pooling command queues instead of contexts. See nVidia's simpleMultiGPU sample.
right. I wait till the API and implementations stabelize until i go this road. I don't want to add public apis which deal with multiple context now just to workaround implementation issues.

For production systems I go usually this road:
try to solve the issue by using:
1.) multiple queues + multiple kernel instances
2.) multiple program instances
3.) multiple contexts (luckily it never went this far)

thanks again,
great discussion
-michael


Jeff


View message @ http://jogamp.762907.n3.nabble.com/Looking-over-JOCL-tp835533p836722.html
To start a new topic under jogamp, email [hidden email]
To unsubscribe from jogamp, click here.