Revisiting the "Manilla Benchmark" using Bridj

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

Revisiting the "Manilla Benchmark" using Bridj

Michael Bien
-------- Original Message --------
Subject: Revisiting the "Manilla Benchmark" using Bridj
Date: Mon, 19 Jul 2010 15:45:57 -0400
From: Jeff Palmer [hidden email]
To: Michael Bien [hidden email], Olivier Chafik [hidden email]

Michael & Olivier,
The JavaCL-Bridj preview is out.  Having already written a benchmark with
timings preserved on a webpage, it was logical to revisit this.

There is no-longer a low level bindings, so I changed it to JavaCL proper.
If I compare the old OpenCL4Java code with the new Bridj version I get a
66.5% reduction.  My problem is when I compare the absolute timings to those
on the web page, the base line is now 3x larger, so the reduction just puts
JavaCL back where is was.

I am finding this a little hard to understand.  I could almost put it down
to I am now running 257.21, but back in early March  The ICD could
also be different.

If I look closely at my first post however, I say my avg. MS per loop is
from my Mac, but in the rules I give nVidia version #'s.  I re-ran the
original code on OSX with beta4 OpenCL4Java, and I get a 2.5x increase. Even
here the date on the OpenCL "DLL" is now 06/10/10.  I do not know how I got
a file that new.

Trying to keep my sanity, I thought I would just run the JOCL side.  If it
also was much higher, then just ignore the old data.  Problem is I cannot
get it to run.  The source is missing an import, no biggie, but I built a NB
library wt 
- jocl.jar
- jocl-natives-windows-amd64.jar
- jocl-natives-windows-i586.jar
- gluegen-rt.jar

I am getting an unSatisfiedLinkError: no jocl in java.library.path .  What
am I missing?



Reply | Threaded
Open this post in threaded view

Re: Revisiting the "Manilla Benchmark" using Bridj

Michael Bien
I left out that I took my new JavaCL version of the program with beta 4 & compared it to the OpenCL4Java version and got virtually the same result.  All my own OpenCL4Java code is now retired, so I have no problem if you wish to use the conversion to cease to expose it.  Maybe some kind of table in the Assembler Optimizations section of the Design Wiki.  I was confused with all the faster-slower verbage.

I have since remembered that I must be missing the DLL that would have been made with a build.  They are not in the automated builds dirs.  Guess I am kind of spoiled now, not using JOGL for almost a year & even then I used the NB plug-in.  I would put everything needed for each platform in it’s own directory along with directions soon.  Maybe it is just me, I like to have the source of libraries, but no interest in having to build it from source.  Having to build JOGL just so you can build JOCL makes this pretty un-attractive.
yes you are right. Thanks for remembering me on this point. Its a trivial compile-time dependency which is only there for public-API convenience reasons. I'll update the script to allow compiling against jogl jars (automatic download etc) soon.
 Have fun in L.A. , but stay away from the Grand Canyon unless you have “your papers” ;-)
haha, thanks!

My window on this diversion has closed.  I have moved on.  One of my final tasks in the OpenCL part of my product is to test out my final production, asynchronous HTTP interface. screwed this up.  I went back to 197 today to complete my testing.  I re-ran the benchmark code, same result. See if interested.

Either way, I think I have been pretty consistent in saying bindings overhead was not important.  In a way, I still believe that, but in my 5th generation kernels (GLSL for Gen 1 & 2) I have actually moved the kernel loop right into the GPU.  Needed to kind of thread the needle in kernel design to get the equivalent of Global Work Set sync achieved with external kernel calls.  Also need to just throw all the possible combos of arguments up in a global prior.

Actually, this is to avoid much more than bindings overhead.  The kernel & arg setting overhead + bindings overhead just vanishes.  I could do everything with just 1 enqueueKernel().  In order to do multi-gpu, I only do 400 kernels per external queuing.