jogamp › jogl

Performance Issues with VBOs

Classic

List

Threaded

8 messages Options

bgroenks96

Performance Issues with VBOs

I'm sure this is an issue that's come up before, and I'm sure I'm doing something horrendously wrong. Basically, I can't match let alone exceed the performance of immediate mode drawing with VBOs.

Basically, this library is set up as a higher level abstraction around the JOGL direct function calls.

Where I'm having the issues right now is in the code for drawing textured quads with VBOs and GL_TRIANGLE_STRIP. Everything draws correctly, but the performance is lagging behind immediate mode functions by nearly 50%.

The process is broken down into three main components, each of which I will display independently to make things clearer. Allocation, modification, and drawing. There are two unique things being rendered: a static background image colored with glColor4f(0.5f, 0.5f, 0.6f, 1), and a quad with a texture that displays a semi-transparent circle (blending is enabled). The background quad is drawn once, the texture quad is drawn 100 times with its location modified by a random number on each frame (this is just for testing).

The buffers are initialized with this method code:

// rectBuffInd is an int[] to hold the buffer IDs for rectangle VBOs
// ind is just an int specifying the current index to store the id in
gl.glGenBuffers(1, rectBuffInd, ind);
gl.glBindBuffer(GL_ARRAY_BUFFER, rectBuffInd[ind]);
if(textured) {
gl.glBufferData(GL_ARRAY_BUFFER, 16 * Buffers.SIZEOF_FLOAT),
null, storeType.usage);
} else {
gl.glBufferData(GL_ARRAY_BUFFER, 8 * Buffers.SIZEOF_FLOAT),
null, storeType.usage);
}

gl.glBindBuffer(GL_ARRAY_BUFFER, 0);

// this probably looks weird but it only runs once, and I need it to map IDs properly
int id = rectBuffInd[ind];
Arrays.sort(rectBuffInd);
ind = Arrays.binarySearch(rectBuffInd, id);
return ind;

"storeType.usage" is an enum containing the usage flags. This is GL_STREAM_DRAW for the texture quads that are updated each frame and GL_STATIC_DRAW for the background image. Likewise,, texBound and texEnabled are only true for the textured quads.

The next part actually stores the vertex data. This runs every frame for the texture quads and only once for the background image:

int buffSize = 8 * Buffers.SIZEOF_FLOAT;
if(texBound && texEnabled)
buffSize *= 2;
gl.glBindBuffer(GL_ARRAY_BUFFER, rectBuffId);
gl.glBufferData(GL_ARRAY_BUFFER, buffSize, null, storeType.usage); // discard previous data
ByteBuffer buff = gl.glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY); // upload via write-only mapping
FloatBuffer floatBuff = buff.order(ByteOrder.nativeOrder()).asFloatBuffer();
floatBuff.put(x); floatBuff.put(y);
floatBuff.put(x); floatBuff.put(y + ht);
floatBuff.put(x + wt); floatBuff.put(y);
floatBuff.put(x + wt); floatBuff.put(y + ht);
if(texBound && texEnabled)
floatBuff.put(texCoords);
gl.glUnmapBuffer(GL_ARRAY_BUFFER);
gl.glBindBuffer(GL_ARRAY_BUFFER, 0);

Finally, the drawing code:

// draw all quads in vertex buffer
gl.glBindBuffer(GL_ARRAY_BUFFER, buffId);
gl.glEnableClientState( GL_VERTEX_ARRAY);
gl.glVertexPointer(2, GL_FLOAT, 0, 0);
if(texBound && texEnabled) {
gl.glEnableClientState(GL_TEXTURE_COORD_ARRAY);
gl.glTexCoordPointer(2, GL_FLOAT, 0, 8 * Buffers.SIZEOF_FLOAT);
}
gl.glDrawArrays(GL_TRIANGLE_STRIP, 0, nverts);

gl.glBindBuffer(GL_ARRAY_BUFFER, 0);
gl.glDisableClientState(GL_VERTEX_ARRAY);
if(texBound && texEnabled)
gl.glDisableClientState(GL_TEXTURE_COORD_ARRAY);

Again, everything looks correct (it's a blob of randomly moving circles), but the performance is the issue.

With this procedure I get ~650 FPS
With the immediate mode functions equivalent I get ~1100 FPS

Now this I find REALLY weird. If I change GL_WRITE_ONLY to GL_READ_WRITE, I get ~900 FPS. Much better (although still not matching immediate mode :/ )

How does that make sense? Shouldn't GL_READ_WRITE be slower? It's setting up the mapping as a two-way bus right? I don't need to read anything off of the VBO, I just need to keep uploading new data.

I would greatly appreciate any suggestions!

gouessej

Re: Performance Issues with VBOs

Administrator

Hi

Don't use glMapBuffer and glUnmapBuffer, rather create a direct NIO buffer and call glBufferSubData or glBufferData.

Keep in mind that retained mode (vertex arrays, VBOs, ...) is faster than immediate mode if and only if you don't have to draw a very few primitives, it depends on the hardware too.

Julien Gouesse | Personal blog | Website

bgroenks96

Re: Performance Issues with VBOs

This post was updated on .

Hi

Could you give me a brief example of using glBufferSubData in place of the
mapping procedure I"m using here? When I tried it, the data seemed to be
lost on most frames, as though it was being drawn on the wrong display
buffer.

Also, when you say few primitives, do you mean a small number of objects or
objects with few vertices? Should retained mode still usually match the
performance of immediate mode?

Thanks for your time.

bgroenks96

Re: Performance Issues with VBOs

In reply to this post by gouessej

Ok I ditched the buffer mapping and used glBufferSubData instead. This time, I got it to work. I think the problems before were either due to FloatBuffer byte ordering or lack of a call to Buffer.flip()

int buffSize = 8 * Buffers.SIZEOF_FLOAT;
if(texBound && texEnabled)
buffSize *= 2;
gl.glBindBuffer(GL_ARRAY_BUFFER, rectBuffId);
gl.glBufferData(GL_ARRAY_BUFFER, buffSize, null, usage.usage);
FloatBuffer floatBuff = Buffers.newDirectByteBuffer(buffSize).asFloatBuffer();
floatBuff.put(x); floatBuff.put(y);
floatBuff.put(x); floatBuff.put(y + ht);
floatBuff.put(x + wt); floatBuff.put(y);
floatBuff.put(x + wt); floatBuff.put(y + ht);
if(texBound && texEnabled)
floatBuff.put(texCoords);
floatBuff.flip();
gl.glBufferSubData(GL_ARRAY_BUFFER, 0, buffSize, floatBuff);

This gets me to about ~740 FPS, which is better than the original mapping code but worse than when GL_READ_WRITE was used.

I'm going to try rewriting some of the abstraction code to allow for more control over the buffer allocation process. The code above is being called every frame 100 times (100 objects) with each object also having the 'draw' method called. It seems to me because of how VBOs work that maybe it would be more efficient to allocate a big VBO that can hold all of the vertex data for all 100 objects, then draw them all at once.

jmaasing

Re: Performance Issues with VBOs

bgroenks96 wrote

I'm going to try rewriting some of the abstraction code to allow for more control over the buffer allocation process. The code above is being called every frame 100 times (100 objects) with each object also having the 'draw' method called. It seems to me because of how VBOs work that maybe it would be more efficient to allocate a big VBO that can hold all of the vertex data for all 100 objects, then draw them all at once.

I'm playing around with making a little game, not very optimized code but very small meshes so I had decent frame rate until I started drawing more objects (not many though, around 50 per frame). That hit my FPS pretty hard, so I batch all meshes that used the same shader into one big array and draw with gl.glMultiDrawArrays

That really made a difference, almost back to 60 FPS now :-) So yeah, in my scenario the draw calls was really costly.

bgroenks96

Re: Performance Issues with VBOs

Interesting. What is glMultiDrawArrays? Why did you use that instead of just glDrawArrays?

jmaasing

Re: Performance Issues with VBOs

It render multiple sets of primitives from array data. The assumption is that several objects have the same shader and you draw them all with one draw call.

http://www.opengl.org/wiki/Vertex_Rendering#Multi-Draw
http://www.opengl.org/sdk/docs/man/xhtml/glMultiDrawArrays.xml
http://www.opengl.org/wiki/Vertex_Specification_Best_Practices

bgroenks96

Re: Performance Issues with VBOs

Thank you jmaasing!

I rewrote the abstraction and implemented the multi-object VBO with glMultiDrawArrays as you suggested, and now I'm getting about a 45% increase in FPS over immediate mode.

I'm using glMapBuffer, however. I got it to work properly and get the best performance with GL_READ_WRITE.

I tried glBufferSubData but was having trouble getting it to handle the multiple object data correctly, and the performance wasn't nearly as good.

Does anyone know why GL_READ_WRITE is faster than GL_WRITE_ONLY?