Reply – Re: Parallel computation on CPU faster than GPU!
Your Name
or Cancel
In Reply To
Re: Parallel computation on CPU faster than GPU!
— by Giovanni Idili Giovanni Idili
First of all thanks for taking the time to help on this!

notzed wrote
1. Copy any cpu initialised values to the gpu once (or even initialise using a kernel if the data is large and generated algorithmically).
I guess by this you mean the constants that I am passing down at every cycle? This makes a lot of sense, but I am not sure how to do it.

notzed wrote
2. Run your kernel multiple times without any cpu synchronisation, and just swap the arguments for the input/output pipeline, e.g.:
  setargs(0, in);
  setargs(1, out);
  setargs(0 ,out);
  setargs(1, in);
  queuekernel(OUT, IN);

From a cursory look at the algorithm you could probably run the loop itself entirely on the gpu anyway - i haven't looked closely but it appears each kernel calculates a value independently of all other kernels, and every time the same one (by iGid) will always be working on the values it calculated last time.  If that is the case you could also just use the same memory for input and output as well and simplify memory management as a bonus (in this case the kernel would also have to dump out sample results as described in the next point).  I noticed you have a bug anyway - after the first loop it's just using Vout for both Vin and Vout (or maybe that isn't a bug, but if it isn't you're doing even more redundant copying).
Wow swapping the arguments for I/O pipelines is a great suggestion  - I didn't know this could be done at all (remember, I am n00b!). Would you be able to point me to some example? Also I will definitely follow your advice to move the loop in the kernel, at the moment I am getting the stuff out to plot it but when this gets plugged in the bigger picture I have in mind I won't need that. As for the bug I don't think it's a bug because the results I am plotting are showing the curves I expect (this is a dumb example and effectively I am running the same thing N times in parallel just to test it out so there is a lot of redundancy, in theory the inputs will be different for each of the items, but at the moment it's all the same).

notzed wrote
3. Copy any sample results out on the gpu using another simple kernel/bit of code tacked onto the end.  You could just pass the iteration count to tell it where to write the answer.  Is this only for debugging anyway?
This sounds like another clever trick - are you suggesting I store results as I go along on some buffer in the shared memory of the GPU then run another kernel at the end to harvest results from that buffer?

notzed wrote
There are also some simple code tweaks.
Thanks a lot for the awesome suggestions in terms of code tweaks - I will try now to follow your steps and see how this translates in performance gains on my GPU. Will post back results.

The only thing that still worries me a bit is that my baseline on GPU (ATI HD4850) is around 13000ms while yours is 1875ms and Michael's one is 3000ms. Trying to make sense of that, I suspect it's the 'Mobility' version of the card (I am on a 27'' iMac), but would that explain the huge difference?