Negating some JNI call overhead by doing transformations in Java

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Negating some JNI call overhead by doing transformations in Java

GiGurra
Please excuse me if this might already be implemented or that there is some important reasons why it can't be used :) (but please do explain to me why, I would really appreciate it).

I'm writing this somewhat high performance 2d application using jogl 2 with lots of relative transformations from object A -> object B -> object C. I'm guessing one of the difficult parts with java<->opengl is making sure the jni call overhead doesn't become a big performance hog, especially when lining up a lot of variable translate/rotate/scale operations.

Now I've benchmarked a few glPop/push/loadMatrix/translate/rotate/scale operations here on my laptop (2.4 GHz i5 w intel gpu) and it seems like these operations take somewhere in between 0.5 - 7 microseconds to complete (The loadmatrix is by far the slowest, and the others are around 1 microsecond each).

This does feel like quite a large overhead per call compared to for example a native C program (though I cannot say I have any numbers to compare with, I'm just doing some estimations on what the implications are), and it does seem to match what JNI calls are supposed to take.

Now: Is it possible it would go faster if I used a wrapper for them, let's say I have 10-20 (extreme case but still) transformations that I wan't to do, and I did these things on a temporary matrix in java on cpu, and then just did one JNI call (load matrix) and transfered the result to opengl. Do you think there would be a performance benefit to this?
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

gouessej
Administrator
Hi

Most modern scenegraphs compute transforms on the CPU side and call glLoadMatrix and co. only in a very few cases (even never when using only shaders), have a look at Ardor3D, JMonkeyEngine, Aviatrix3D, 3DzzD, etc... It is a good approach.
Julien Gouesse | Personal blog | Website
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

Demoscene Passivist
Administrator
>Most modern scenegraphs compute transforms on the CPU side and call
>glLoadMatrix and co. only in a very few cases (even never when using only shaders)

I guess Julien means "GPU side" not "CPU side". And he's absolutely right, as API call overhead is immense in general regardless of the programming language or graphics API u use (Direct3D has the same problem as OpenGL here). So JNI call overhead or Java isn't the problem here. As Julien already pointed out, a good approach is to offload as many calculations as possible to the GPU and only set a couple of uniforms and let a GLSL vertex/geometry shader do the transforms.
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

GiGurra
I'm not sure I understand you here. JNI is a factor for my application.
Calling glRoatef for example in jogl would probably take a factor 10 longer than from a C program (probably more).
Let's say I have 10 of those. Why not just make a single transformation matrix of that in java and THEN make one JNI glLoadMatrixf to opengl though jogl? surely that must be an advantage if the application has a significant amount of transformations. (That is if it turns out that one java rotation matrix multiplication is faster than one call to glRotatef)

Let's for the moment skip shaders cause i dont even know what they are (well I've done some OpenCL and just guess that openGL shaders might be some kind of small mini apps running on the gpu) :), opengl beginner here but I very much need to optimize performance in my simple but somewhat big application.
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

Demoscene Passivist
Administrator
>I'm not sure I understand you here. JNI is a factor for my application.
>Calling glRoatef for example in jogl would probably take a factor 10 longer than from a
>C program (probably more).

Are u absolutely sure about ur observation that JNI is the bottleneck ? Have u done any benchmarks ? I'm asking because usually most people "just think" that JNI has to be slow and blame every performance bottleneck on the JNI call overhead. The is a performance impact sure, but usually its around 10% and not 10x overhead in regard to native C :)

Maybe if u don't believe me let me quote Kenneth Russel (the guy that ported Quake 2 to JOGL): "We did fairly extensive experiments a couple of years ago with the Jake2 Quake II port to Java and were not able to isolate JNI overhead as being a culprit on any modern CPU. We have had good success in past years in translating several moderately-sized C++ demos and animation engines to Java and have uniformly achieved 90-100% of the speed of C, and sometimes even better than 100% because HotSpot generates machine code specific to the processor you're currently running on while many C binaries are compiled for a least-common-denominator processor." See here for the full discussion back in the days.

Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

Demoscene Passivist
Administrator
One more thing:
Maybe u experience a significant overhead coz u are not using direct buffers to "transfer" ur matrices ? "Wrong" memory management while using JOGL/JNI definitely has a performance impact.
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

Michael Bien
In reply to this post by GiGurra

On 07/21/2011 02:28 PM, GiGurra [via jogamp] wrote:

>
> I'm not sure I agree with you here. JNI is a factor.
> Calling glRoatef for example in jogl would probably take a factor 10 longer
> than from a C program (probably more).
> Let's say I I have 10 of those. Why not just make a single transformation
> matrix of that in java and THEN make one glLoadMatrixf to opengl though
> jogl? surely that must be an advantage if the application has a significant
> amount of transformations.
>
> (let's for the moment skip shaders cause i dont even know what they are :),
> opengl beginner here but I very much need to optimize performance)
>
>
what you want are shaders :)
glRoatef for example does no longer exist in OpenGL 4. All those
transformations are either done in a shader, and/or as you said on the
host using matrices. The whole intention behind most GL features
introduced in the last 15 years was the reduction of API calls.
VertexArrays, DisplayLists, VBOs, shaders, removal of the fixed function
pipeline etc.

if you can, do as much as possible on the GPU.

regards,
michael

--
http://michael-bien.com/

Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

Wade Walker
Administrator
In reply to this post by Demoscene Passivist
I benchmarked JNI call rate a while back when someone asked me about this on my blog (benchmark code at http://wadeawalker.wordpress.com/2010/10/17/tutorial-faster-rendering-with-vertex-buffer-objects/#comment-108). The result was I could make anywhere from ~300,000 to ~4,000,000 JNI calls per second on a 2.4GHz machine, depending on the amount of loop unrolling performed by the JVM. I didn't test to see how much was "extra" due to JNI (compared to just a normal Java call).
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

gouessej
Administrator
This post was updated on .
In reply to this post by GiGurra
GiGurra, you overestimate the overhead due to JNI in my humble opinion. The ancestor of my first person shooter was written in C++ and I have never observed such differences of performance between C/C++ and Java when using OpenGL. I agree with Wade. If you don't want to use shaders, you should at least avoid using immediate rendering and look at Wade's tutorial about VBOs.

Demoscene Passivist, I meant "CPU". I looked at the source code of JMonkeyEngine 3.0 which is heavily shader-based, I asked some people to explain to me some things I didn't understand. Most of the transforms are done in the CPU side and then are sent to the shader. Momoko_Fan, one of the author of JMonkeyEngine explained this to me there:
http://www.java-gaming.org/index.php/topic,23544.0.html

A shader-oriented engine, does not use the predefined OpenGL uniforms such as gl_ModelViewProjectionMatrix, or the glPushMatrix/PopMatrix calls.

Instead the matrices are composed by the engine before the object is rendered and then those matrices are available through custom-defined uniforms to the shader.
Someone else wrote that:
When I was profiling old versions of my graphics engine, I found out that because of JNI overhead, it is actually slower to use glPush and glPop and make multiple glMatrix commands to set up the transform for each object.  It's faster to combine a scene graphs transforms in memory and track a final model transform for each object.  Then in the fixed pipeline it's only one graphics call, and in a shader engine, you only need to have one mat4 for the model.  HTH and I can explain better if you have questions.
Julien Gouesse | Personal blog | Website
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

Michael Bien
In reply to this post by Wade Walker
  4 mil JNI calls/s are not much. but who knows what the driver does if
you ask him for the version (maybe it gets parsed from the readme file
or so :P )

the last time i checked the average JNI call took between 11 and 15ns,
benchmarked on ubuntu/i7/64bit server VM. The overhead was largely
independent of the argument count (compared to JNA). But in practice a
JNI function isn't empty...

one reason for example why JOCL allows in the runtime API only direct
allocated buffers is simplified binding code to reduce call overhead to
a minimum.

On 07/21/2011 03:29 PM, Wade Walker [via jogamp] wrote:
> I benchmarked JNI call rate a while back when someone asked me about this on
> my blog (benchmark code at
> http://wadeawalker.wordpress.com/2010/10/17/tutorial-faster-rendering-with-vertex-buffer-objects/#comment-108).
> The result was I could make anywhere from ~300,000 to ~4,000,000 JNI calls
> per second on a 2.4GHz machine, depending on the amount of loop unrolling
> performed by the JVM. I didn't test to see how much was "extra" due to JNI
> (compared to just a normal Java call).
>

Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

GiGurra
This post was updated on .
In reply to this post by gouessej
Thanks everyone for your answers. Assuming I will move to a setup(still without custom shaders) where I only send one transformation matrix before for example drawelements(which is how I transmit vertices), what would be the fastest method of transmitting it from host to gpu? Just a standard glloadmatrixf with a direct floatbuffer?

The reasons why I ask all this is because I have relatively low complexity models but high number of transforms, on up to 30 or so displays(I know, sounds crazy but required :)), each updating at 50-60 Hz
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

gouessej
Administrator
Compose the transform matrix of each object in the CPU side (by combining all transforms concerning each object) and then call glLoadMatrixf one time per object.
Julien Gouesse | Personal blog | Website
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

GiGurra
Thanks everyone. Everything is now clear. I will attempt to compose a transformation matrix on cpu then send it by glloadmatrixf and then gldrawelements to get my vertices to the gpu. MAybe in the future I'll also look into shaders.


Holy crap jni call between 11-15 ns like Michael says? Woa I must be out somewhere way off ^^.
I guess gone are the days when they took microseconds. (all benchmarks i could find online said it would
take a microsecond or so, or more, so when my gl calls took 1 microsecond I assumed it was all jni overhead :D)
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

gouessej
Administrator
If you need some help for composing the transform matrices, look at the source code of Ardor3D and its JOGL renderer. For shaders, rather look at JMonkeyEngine 3 (my JOGL 2.0 renderer for this engine is still on its SVN repository as far as I know).

If your geometry does not change, I advise you to use static VBOs. It gives a supplementary speedup on drivers supporting them correctly (on others, dynamic and static VBOs are as far as plain vertex arrays). Display lists would be interesting if their implementations on some modern graphics cards were less poor :(

Lots of wrong things are said about Java and JNI. Java has a bad reputation when it comes to performances which is not fair. Mickael is right about JNA's performances. OpenTk (an set of binbings for .NET containing an OpenGL binding) uses a mechanism similar to the one used by JNA to call native code on Windows (Microsoft Platform Invoke) and the result is that OpenTk is twice slower than JOGL.
Julien Gouesse | Personal blog | Website
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

GiGurra
Thanks. Might give VBOs a try. Composing matrices wont be a problem :). Shaders...err.....sometime later ^^

I'm aware that java is very fast - What surprises me are MichaelĀ“s numbers on JNI calls of >11 ns . That is really good, the sources I found online hinted that the call overhead should be in the order or microseconds, which is why I was concerned. Maybe I'll make some experiments later on this. But if it's just 10-20 ns, then it surely is *nothing* :)

Btw if there is any interest my project is an "instrument loader" for (arbitrary) flight sims.

Basically u inject an implementation specific data io module into the application and it grabs data from (in theory) any sim/game.
Looks something like this (running here 60fps on both the game and the jogl screen to the left). Cpu usage is low and good :).

http://www.gigurra.se/gear/test11.png
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

gouessej
Administrator
You will have to take care of the gimbal lock in this case. That is a subject I know :)
Julien Gouesse | Personal blog | Website
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

Sven Gothel
Administrator
In reply to this post by gouessej
On Thursday, July 21, 2011 10:33:12 PM gouessej [via jogamp] wrote:
>
> Compose the transform matrix of each object in the CPU side (by combining all
> transforms concerning each object) and then call glLoadMatrixf one time per
> object.

http://jogamp.org/deployment/jogamp-next/javadoc/jogl/javadoc/com/jogamp/opengl/util/PMVMatrix.html

You can use the PMVMatrix, which impl. all matrix methods of the fixed function pipeline (FFP).
We use it for our FFP emulation, as well as for some ES2 demos.

~Sven
Reply | Threaded
Open this post in threaded view
|

Re: Negating some JNI call overhead by doing transformations in Java

GiGurra
This post was updated on .
Cool! Thanks

So if I understand you right, PMVMatrix objects are entirely handled on CPU, or do some of it's functions send stuff to GPU, or do I handle that myself?

Basically, I create a PMVMatrix (is this an on-cpu matrix stack then), I transform it as I want, then i use the standard gl.glLoadMatrixf to load the finished PMVMatrix? Or does absolutely everything go through PMVMatrix like I would do FFP programming normally and I just sit back and relax? :)

EDIT:
Got it. It is just an on host matrix stack.

EDIT2: Trying the PMVMatrix class here a bit but it seems to be significantly slower than standard ffp calls, so I either I make my own implementation or I stick with standard ffp calls. Regardless I can use the PMV matrix math for help should I need it :). so thx
(with -server and letting the the jit compiler have a warmup period of a couple of thousand frames, the ffp calls are faster by a factor 2-4. Initially they are faster by a factor 20 or so, but that goes away after a few seconds and should be expected).

I will proceed to write my own much simpler implementation, as I assume the PMVMatrix class was written more for robustness and compatibility rather than pure speed

EDIT3:
Using my own transformation class (currently only a matrix class, so doesnt do push/pop) I am able to speed up the actual transformation times 20x compared to native ffp calls. I am also using the fact that I'm only drawing 2d to limit the number of ops (meaning translate/rotate do not call on matrix multiply, but instead directly affects only specific elements)

about native ffp calls: glLoadMatrix seems to take as much time as 2 or 3 glrotates, so it is only worth while to make yoru own stuff if you need more than that, so unless I suddenly need 10 transformations per object I wont gain much by using host transformations, cause the loadmatrix will cancel out the benefit anyway.