Hi Michael,
I wonder why you've opted for fixed size CLEventList? It brings a lot of disadvantages. Why didn't you just extend List<CLEvent> and add some implementation specific functions? I could write a wrapper event list and wrappers for functions requiring CLEventList but that would be a unnecessary bloat. Is there a chance for a normal, dynamically-sized collection? |
Hello,
its because of the performance. The list uses a direct allocated bytebuffer as event id storage. I thought about that already but decided to go the 100% transparent route (regarding behind the scenes buffer allocation, synchronization etc) for all runtime APIs. If it would be a plain old long array i would make it simply expandable... However what we could do is to add a CLEventList.create(CachedBufferFactory factory, boolean expandable) constructor to it. Every expand operation would trigger a copy from the old buffer to a new buffer, shrinking the list could be difficult to implement. Do you think the current static allocation leads to bad code style? How hard is it for you in your application to predict the event list size? thank you very much for the feedback best regards, michael On 03/20/2011 11:08 AM, Wibowit [via jogamp] wrote: > Hi Michael, > > I wonder why you've opted for fixed size CLEventList? It brings a lot of > disadvantages. Why didn't you just extend List and add some implementation > specific functions? > > I could write a wrapper event list and wrappers for functions requiring > CLEventList but that would be a unnecessary bloat. > > Is there a chance for a normal, dynamically-sized collection? > > _______________________________________________ > If you reply to this email, your message will be added to the discussion below: > http://forum.jogamp.org/Why-CLEventList-has-fixed-size-tp2705613p2705613.html > To start a new topic under jogamp, email [hidden email] > To unsubscribe from jogamp, visit http://michael-bien.com/ |
Well, premature optimization is a sign of a bad designer :)
Did you compare the performance? My test code: http://pastie.org/1693672 Console output (from NetBeans): run: Control value: 3854175008845418994 Arrays testing time: 425 ns Control value: 3854175008845418994 Buffers testing time: 4178 ns BUILD SUCCESSFUL (total time: 4 seconds) Million allocations in 4 seconds gives 250k allocations per second on Core2 Duo E8400. I doubt anyone will even get close. If you add a pool of small unused Buffers then the difference will be much smaller. Additionally (or alternatively) you can create a bunch of small sub-Buffers from slicing one preallocated big Buffer. There are many possibilities to reduce the number of direct (de-) allocations so you can safely implement a dynamically-sized list and when someone will complain about poor Events lists performance then you can add some logic/ strategies that reduces the number of allocations - or he can use the fixed size lists. There is a second thing I would want you to change. Every method that return an Event expects a CLEventList although it returns only one Event. I suggest creating a class CLEventHolder that will have a field of type CLEvent and will be used to pass the resulting Event to calling routine. Additionally you could add a method addFrom(CLEventHolder) to class CLEventList that will append the Event from that holder to the list. With the aid of WeakReferences one could even do automatic releasing of Events (and other objects) - ie integrate it with Garbage Collection. PS: I'm only starting writing OpenCL application, I don't have much experience with it but I insist on flexibility of solutions. |
On 03/20/2011 10:44 PM, Wibowit [via jogamp] wrote:
> Well, premature optimization is a sign of a bad designer :) in an application, maybe... sometimes. But this is not the rule in a API-binding library in the HPC sector. Its also no optimization, its just the simplest possible implementation for this particular CL feature which is also at the same time one of the fastest. > Did you compare the performance? of course. Thats the reason why I implemented the CachedBufferFactory (name might change in future ;)). in essence: small buffer allocations are more memory intensive as they should be and can cause fragmentation in the perm gen (direct mem is not relocatable). [cut] > There are many possibilities to reduce the number of direct (de-) > allocations so you can safely implement a dynamically-sized list and when > someone will complain about poor Events lists performance then you can add > some logic/ strategies that reduces the number of allocations - or he can > use the fixed size lists. > > > There is a second thing I would want you to change. Every method that return > an Event expects a CLEventList although it returns only one Event. I suggest > creating a class CLEventHolder that will have a field of type CLEvent and > will be used to pass the resulting Event to calling routine. Additionally > you could add a method addFrom(CLEventHolder) to class CLEventList that will > append the Event from that holder to the list. CLEvent implements CLEventHolder CLEventList implements CLEventHolder ... understood but how often will it happen that you need one single event and if yes what is wrong with using a list of size 1? I see how you would implement it but can't see the motivation behind it, please elaborate. > With the aid of WeakReferences one could even do automatic releasing of > Events (and other objects) - ie integrate it with Garbage Collection. WeekReferences, finalizes and friends: would not help in this context since you would make the assumption that the heap size/used heap size is somehow correlated to the direct memory size/used direct mem size. For the same reason there is no GC based collector for native CL resources in JOCL (but it has been implemented already in a early version). > PS: I'm only starting writing OpenCL application, I don't have much > experience with it but I insist on flexibility of solutions. thanks for your feedback but i would be still curious about the answers to my questions of the prev. post: > Do you think the current static allocation leads to bad code style? How > hard is it for you in your application to predict the event list size? best regards, -michael |
This post was updated on .
Well, a HPC sector still uses Fortran :) It gives them highest performance sometimes. And it's relatively unlikely that HPC sector will use JOCL as Java isn't particularly suited for that kind of things.
The problem is that Java users are used to allocate many objects and then forget them (which is a good thing - managing global state is painful not only to user but also to machine) and you're making a library that performs poor with such scenarios. Additionally you're forcing users to manually deallocate objects - if you don't perform clReleaseEvent then it will reside in memory - a potential source of memory leak. It is possible to automatically release Events: make a parent object for clones of the same Event and together with that Event store a reference to Event parent. In parent's finalize method we can release the Event. Too bad that low-level OpenCL API accepts only arrays of cl_event objects. If they would also accept arrays of pointers to cl_event objects then managing them (at least from Java or C++) would be much easier - as far as I understand (or imagine) low-level API copies the cl_event arrays immediately on method invoaction, so providing pointers shouldn't cause any problem. Maybe I should write about it on Khronos forum? > CLEvent implements CLEventHolder > CLEventList implements CLEventHolder No. I've meant separate CLEventHolder class with only one field of type CLEvent and accessors for it. > Do you think the current static allocation leads to bad code style? How > hard is it for you in your application to predict the event list size? Certainly it leads. And, well, if you're making a standalone application then it could be possible to predict list sizes accurately enough, but what if I make a framework based on JOCL? It would require adding a lot of code to compute the sizes of lists to allocate. I haven't written yet a full application but I plan to use graphs of Events - ie. for example many commands will wait for buffer to be uploaded. And some commands will wait for multiple Events. Current JOCL implementation doesn't faciliate it. Edit: I've looked to OpenCL headers. I don't know C good but statement: typedef struct _cl_event * cl_event;looks like cl_event is in fact a pointer to some struct (sorry that I haven't figured it out earlier). That invalidates some of my previous statements. I will rethink the problem later. |
Sorry for late reply.
I think I'll end up with my own Collection wrappers that will create lots of CLEventLists or I'll fork JOCL (much less probable). Event objects are so small (on client side they're just pointers) that copying them is essentially free. If I were you then I'll just create a 10 kB (or 100 kB) ThreadLocal direct ByteBuffer and use if for receiving/ sending events from/ to OpenCL driver (if there is more data then I'll just allocate temporary another buffer). I would store Events in usual collection implementing standard List interface. I would also integrate Events with Garbage Collector, ie. automatic clReleaseEvent. |
sure, setting a small framework on top of it should be easy. However
brining the heap and the GC strategy in correlation with native resources is not recommended. best regards, michael On 03/30/2011 06:46 PM, Wibowit [via jogamp] wrote: > > Sorry for late reply. > > I think I'll end up with my own Collection wrappers that will create lots of > CLEventLists. Event objects are so small (on client side they're just > pointers) that copying them is essentially free. If I were you then I'll > just create a 10 kB (or 100 kB) ThreadLocal direct ByteBuffer and use if for > receiving/ sending events from/ to OpenCL driver (if there is more data then > I'll just allocate temporary another buffer). I would store Events in usual > collection implementing standard List interface. I would also integrate > Events with Garbage Collector, ie. automatic clReleaseEvent. > > _______________________________________________ > If you reply to this email, your message will be added to the discussion below: > http://forum.jogamp.org/Why-CLEventList-has-fixed-size-tp2705613p2754482.html > To start a new topic under jogamp, email [hidden email] > To unsubscribe from jogamp, visit |
I am finally doing some work on topic and have a basic question: how to use CLEventLists?
I have a code: http://pastebin.com/X5NYX2h9 Problematic line is 71. AFAIU condition parameter should be a list of events to wait for completion, and event parameter should be a list to store resulting event. Extracting profiling info from event does work good. When I run my program in NetBeans then the output is: run: 1 com.jogamp.opencl.CLException$CLInvalidEventWaitListException: can not enqueue 1DRange CLKernel [id: 140154717645264 name: sortChunks] with gwo: null gws: {8388500} lws: {250} cond.: com.jogamp.opencl.CLEventList[CLEvent [id: 140154720214544 name: NDRANGE_KERNEL status: QUEUED]] events: com.jogamp.opencl.CLEventList[] [error: CL_INVALID_EVENT_WAIT_LIST] Enqueue-to-submit delay: 2105842 ns Submit-to-run delay: 8432622 ns Total running time: 8650142 ns <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/ <adeiiiklmmnswx________________.//aegiiiklmorwx________________"//::=hipssttwwx________________-LSaacceehimnnst________________"//:=hinoptttwww________________ -.//03ehlmoprtx________________.//egiik BUILD SUCCESSFUL (total time: 2 seconds)What's wrong with that event wait list? |
i only took a quick look at your code but it looks good so far. E.g events and conditions are not mixed etc. btw many methods allow vargs. Instead of createCommandQueue(new Mode[]{Mode.OUT_OF_ORDER_MODE, Mode.PROFILING_MODE}); you can just do createCommandQueue(Mode.OUT_OF_ORDER_MODE, Mode.PROFILING_MODE); or even createCommandQueue(OUT_OF_ORDER_MODE, PROFILING_MODE); if you static import the Mode enum. and you should also use the Buffers.newFooBuffer to make your code platform independent if this is your requirement. I can't find anything wrong in JOCL's high level API, if you still having trouble please provide a self contained test reproducing this issue to speed things up. best regards, -michael On 04/02/2011 01:16 AM, Wibowit [via jogamp] wrote: I am finally doing some work on topic and have a basic question: how to use CLEventLists? -- - - - - http://michael-bien.com |
In reply to this post by Wibowit
On 03/21/2011 01:27 PM, Wibowit [via jogamp] wrote:
Well, a HPC sector still uses Fortran :) It gives them highest performance sometimes. And it's relatively unlikely that HPC sector will use JOCL as Java isn't particularly suited for that kind of things.A big company doing space ship tech in europe is right now switching away from Fortran since they can't find anyone who still wants to code in this language while being productive as with easier languages (like java) for their simulations. If you want to compute everything on the CPU under realtime requirements java may not be the best choice, fully agreed. But we are talking here about different things. OpenCL moves the bottleneck to the GPU (or any device which is supported) the same way as GL does/did that. The host application is only the "provider", the devices are the "consumers" of the HPC scenario. The provider should never slow down consumers in this pattern. If you doing CL and have the bottleneck on the CPU side you really did something wrong (or the problem is not distributable by nature) since your program wouldn't scale with load. There is no reason why you couldn't use OpenCL in java the same way you could use it in C (same project requirements etc). Of course there are exceptions as always but i disagree with you if you say that this the general case. If you read JOCL's project goals you find points like * high performance, cross platform, high and low level OpenCL JNI bindings * GC friendly - no weak references, finalizers or other cheats * ... this is not only marketing stuff those are really the goals which also affected API design and technology choice (the JNI vs JNA discussion for example). best regards, michael
-- - - - - http://michael-bien.com |
In reply to this post by Michael Bien
Here is the sample:
http://pastebin.com/5PXvDc5V (Note: I'm using 120 as line width, not standard 80). It has two kernels: first one writes rot-13 ciphered string to buffer, second one deciphers it and saves to another buffer. Finally the last buffer is transferred back to host and written to standard output. Unfortunately this example works properly when I delete the event wait list, so we can't observe if OpenCL ignores event wait list or not. I'll prepare better sample code when you fix the exceptions. |
In reply to this post by Michael Bien
Well, at least for me JOCL looks very low-level. It's almost 1:1 copy of original OpenCL API. Certainly it doesn't benefit much from being object-oriented. If your goal is to be low-level then it does make some sense.
Unfortunately, I have too little time to develop developer-friendly OpenCL binding so I'm left with what is available now. You said that programmers switched from Fortran to Java because Java is more productive. I say that programmers will switch from low-level bindings to high-level bindings (when they appear) to be more productive. A little additional overhead won't make a big difference on program efficiency. Remember - hardware is getting cheaper but professionals are becoming more precious. Today's developers have trouble with simple problems like: http://www.codinghorror.com/blog/2007/02/why-cant-programmers-program.html If you have ever tried creating a game in OpenGL (or DirectX) then probably you've read about speed of state changes, ie. maximum achievable number of state changes per second. Generally GPU power grows much quicker than the state changes speed. Does it mean that OpenGL or DirectX aren't scalable? I think yes. Setting up GPU consumes a lot of time, the more complex GPU architecture become, the more complex are state changes. With my solution (ie. ThreadLocal buffers for copying events between Java and driver) you're only forced to allocate a buffer when you're event list is bigger than standard buffer size (eg 100 KiB). Copying 100 KiB of data probably takes more time than allocating a direct buffer so in the end my solution will be almost as fast as current solution or even faster (because with my solution you're not forced to allocate such many small direct buffers - instead we would allocate many Java objects and copy data). GC friendly? No. I think you're not. You're using direct buffers. Direct buffers are not GC friendly. And you additionally force users to allocate many such buffers - I do not see any way to change events that are already in an event list. |
In reply to this post by Wibowit
Okay, I've prepared a little better example:
http://pastebin.com/r4zPMcZB This one at least always shows wrong data when I comment out copy kernel invocation at line 58. This sample also has null event wait list. |
i believe i found the issue. Fix is in progress.
-michael On 04/03/2011 07:05 PM, Wibowit [via jogamp] wrote: > > Okay, I've prepared a little better example: > http://pastebin.com/r4zPMcZB > > This one at least always shows wrong data when I comment out copy kernel > invocation at line 58. This sample also has null event wait list. > > _______________________________________________ > If you reply to this email, your message will be added to the discussion below: > http://forum.jogamp.org/Why-CLEventList-has-fixed-size-tp2705613p2771789.html > To start a new topic under jogamp, email [hidden email] > To unsubscribe from jogamp, visit http://michael-bien.com/ |
In reply to this post by Wibowit
bug is fixed in
https://github.com/mbien/jocl/commit/6612391c7ad8309ebd315cdf2a91a71f11793a61 i added also a regression test. The builds are running right now. -michael On 04/03/2011 07:05 PM, Wibowit [via jogamp] wrote: > > Okay, I've prepared a little better example: > http://pastebin.com/r4zPMcZB > > This one at least always shows wrong data when I comment out copy kernel > invocation at line 58. This sample also has null event wait list. > > _______________________________________________ > If you reply to this email, your message will be added to the discussion below: > http://forum.jogamp.org/Why-CLEventList-has-fixed-size-tp2705613p2771789.html > To start a new topic under jogamp, email [hidden email] > To unsubscribe from jogamp, visit |
Administrator
|
In reply to this post by Wibowit
Wibowit, the HPC sector already uses Fortran, Cuda, etc... IBM researchers in this sector use Java too. It's time to forget your prejudices about Java.
Julien Gouesse | Personal blog | Website
|
Well, OpenCL is very immature so there's not much samples to play with. I could prove that automatic collection of events isn't perceptibly slower than manual management, but from what I recently read, AMD doesn't support out-of-order queues so playing with event is a pure waste of time now - events are simply ignored.
As to "prejudices": Java is my platform of choice, ie. I claim that it has many disadvantages and it's badly designed from a number of perspectives, but still it's fast enough, portable and pretty high level. HotSpot developers implemented a number of optimizations so high-level Java code with many layers of indirection performs almost as good as hand optimized code. There's dynamic inlining, escape analysis, generational GC, etc I've chosen Java over C++ mainly because Java does many things automatically. If I would want to have complete control of everything I would use C++. Furthermore, it's possible to integrate native libraries with Java thanks to JNA or JNI, or we can use inter-process communication. Forcing Java code to be low-level is IMO very bad. |
Free forum by Nabble | Edit this page |