jogamp - Re: Parallel computation on CPU faster than GPU!

jogamp › jocl

Re: Parallel computation on CPU faster than GPU!

Posted by notzed on May 21, 2011; 1:35am
URL: https://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p2967779.html

Giovanni Idili wrote

First of all thanks for taking the time to help on this!

I've been knee deep in this stuff for the last few months (or similar stuff for years) and I spotted a few obvious simple things that were easy to address. Might be a bit off topic for the forum, but it's a bit quiet otherwise.

Giovanni Idili wrote

notzed wrote

1. Copy any cpu initialised values to the gpu once (or even initialise using a kernel if the data is large and generated algorithmically).

I guess by this you mean the constants that I am passing down at every cycle? This makes a lot of sense, but I am not sure how to do it.

Actually I meant the input values, vin, etc. The non-memory arguments are handled by the driver (kernel.setArg() etc), and are potentially cached on the server anyway and in any event are too small to matter in most cases. It's the memory buffer synchronisation that slows things down.

Giovanni Idili wrote

Wow swapping the arguments for I/O pipelines is a great suggestion - I didn't know this could be done at all (remember, I am n00b!). Would you be able to point me to some example?

I don't have a specific example handy, but the example i gave should be simple enough. But perhaps think of the allocations on the device like any other memory allocations, they stay around with their content until you free them, as far as any kernel is concerned at least. If you have an array on cpu code you don't (normally) copy the array to call a function which works on it, and a gpu is no different.

Your code was already basically doing it, but it was copying the same data to the cpu and then back to the gpu every loop - for no purpose or effect.

Giovanni Idili wrote

notzed wrote

3. Copy any sample results out on the gpu using another simple kernel/bit of code tacked onto the end. You could just pass the iteration count to tell it where to write the answer. Is this only for debugging anyway?

This sounds like another clever trick - are you suggesting I store results as I go along on some buffer in the shared memory of the GPU then run another kernel at the end to harvest results from that buffer?

More or less, but just write the results to global memory - local memory only exists for the duration of the kernel execution, and only amongst kernels in the same workgroup ( I think this is how your original code was treating the global memory). If they've been written in the kernel you can just then read that memory from the cpu. I mentioned an additional kernel if you didn't want to put it into the main kernel and just grabbed some samples from the 'current result'.

But since in this case it's only for debugging it probably isn't too important how it's done.

Giovanni Idili wrote

The only thing that still worries me a bit is that my baseline on GPU (ATI HD4850) is around 13000ms while yours is 1875ms and Michael's one is 3000ms. Trying to make sense of that, I suspect it's the 'Mobility' version of the card (I am on a 27'' iMac), but would that explain the huge difference?

It could easily explain such a difference. Mobile devices usually have slower memory/i/o and sometimes fewer transistors (i.e. less processors) at lower clock speeds. And even apart from that there are so many different devices with varying capabilities/generations.

On the flip side you might actually have more to gain by better code which more fully utilises the processors and lowers the memory bandwidth requirements. Even on a slow card registers are fast.

Z