Reply – Re: Parallel computation on CPU faster than GPU!
Your Name
or Cancel
In Reply To
Re: Parallel computation on CPU faster than GPU!
— by notzed notzed
Just to second what Michael said and to expand on it some - you are doing some things quite wrong, but they are easily fixed.

1. You're copying all your data from cpu->gpu running the code then copying it all back again for every loop.  At most you're taking a few samples from one buffer, why copy them all to/fro?  On CL/CPU  these copies are presumably a complete NO-OP, on the GPU each is a full memory copy across devices - all you're really timing are lots of memory copies.
2. Every copy you're copying from the gpu->cpu is synchronous.  Think of putting out a fire with a chain of people with buckets - it still takes just as long in transfer time to move a bucket from one end to the other, but if you already have a spare one to go you don't sit around waiting for work  In your case you're waiting for the whole line of buckets to empty before starting the next lot.
3. Your collating a tiny fraction of the data on the CPU whilst the GPU remains completely idle - even if it isn't very efficient code if you collate on the GPU it will run much faster.  From what I can tell you could very easily collate this on the gpu as well anyway.

To fix:

1. Copy any cpu initialised values to the gpu once (or even initialise using a kernel if the data is large and generated algorithmically).
2. Run your kernel multiple times without any cpu synchronisation, and just swap the arguments for the input/output pipeline, e.g.:
  setargs(0, in);
  setargs(1, out);
  setargs(0 ,out);
  setargs(1, in);
  queuekernel(OUT, IN);

From a cursory look at the algorithm you could probably run the loop itself entirely on the gpu anyway - i haven't looked closely but it appears each kernel calculates a value independently of all other kernels, and every time the same one (by iGid) will always be working on the values it calculated last time.  If that is the case you could also just use the same memory for input and output as well and simplify memory management as a bonus (in this case the kernel would also have to dump out sample results as described in the next point).  I noticed you have a bug anyway - after the first loop it's just using Vout for both Vin and Vout (or maybe that isn't a bug, but if it isn't you're doing even more redundant copying).

3. Copy any sample results out on the gpu using another simple kernel/bit of code tacked onto the end.  You could just pass the iteration count to tell it where to write the answer.  Is this only for debugging anyway?

4. Retrieve the results in one go, and only use 'blocking=true' on the last buffer - this will at least batch up all the copies and do them together rather than waiting for each to complete before moving on.  Assuming you have a defaultly[sic] configured queue, they will guarantee execution order so you can assume that if the last is done, all are.  Or use finish() as suggested.

There are also some simple code tweaks.

For one, i'm not sure how well the compiler will go at registerising arrays - and in any event it is compiler dependent.  Might be better done using 3 floats, or especially for a cpu/cell implementation, vector types.  With opencl 1.1 you can use float3, with opencl 1.0 you need to use float4 but the result should be the same.

float tau[3];
for (int i = 0; i < 3; i++) {
tau[i] = 1 / (alpha[i] + beta[i]);

 float4 tau;
 tau = 1 / (alpha + beta);

If the final result only uses then the compiler will throw away the 4th element calculation on non-SIMD processors.

Since you have 4 separate arrays and always operate on the same item in each you could store all of them in a single float4 array - which will affect performance one way or another (may be better, may be not).

Again from a very cursory look at it, i'm surprised it's taking more than a handful of milliseconds to calculate such a small amount of work and such a simple algorithm.

I don't think you have to worry about using multiple queues and threads - but they're not really very complicated either if you had to.

Since I was curious and had a bit of spare time I got a bit side-tracked and tried all of the above - although I didn't verify the results are correct so I might have made a mistake along the way.

- 1875ms - baseline on my gpu (nvidia gtx480)
- 631ms - removed the unnecessary array copies and only copy Vout from gpu to cpu inside each loop.

Next I removed the data downloading from the loop entirely - this means it isn't retrieving plot results but it could be added with a simple kernel/final step in this kernel which shouldn't take a lot of time.  At least this gives you an idea on the minimum bound.

- 130ms - This is more in line with what i'd expect as a baseline for the amount of work you're doing and shows all you're really timing is the device-host memory copies.

I then tried registerising/vectorising the code.

- 133ms - well at least the nvidia compiler must be doing this already.

And finally I put the entire loop on the gpu, it reads the n, m, h and V data points only once at the start of the kernel, iterates over all t and then writes them out once.   I also hard-coded the loop size using #defines (mostly just because it was simpler).

 - 54ms

And lastly just to make it overly GPU specific, I tried a local worksize of 128 rather than 256, but now this is really splitting hairs for this example.  Maybe it isn't splitting hairs in general though - you're only processing 302 items, so you're only using a maximum of 2 SM units if you use a local worksize of 256.

 - 50ms

I don't have a CPU driver installed, but some of those might benefit the CPU implementation as well.  The CPU compiler is probably already vectorising the arguments, but the loop changes should make a difference.

And to test scaling, I tried 30 002 items as Michael did:

 - 350ms - this is where it should really cane any CPU if the above doesn't do it already.

Luckily you have a problem that fits the gpu compute model about as ideally as is possible.