Login  Register

Re: Parallel computation on CPU faster than GPU!

Posted by Michael Bien on May 20, 2011; 12:03am
URL: https://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p2963758.html

  my quick benchmark (all runs on same system):

CLContext [id: 140645219742960, platform: Intel(R) OpenCL, profile:
FULL_PROFILE, devices: 1]
CLDevice [id: 1078036064 name: Intel(R) Core(TM) i7 CPU         940  @
2.93GHz type: CPU profile: FULL_PROFILE]
2328ms

CLContext [id: 1107974928, platform: ATI Stream, profile: FULL_PROFILE,
devices: 1]
CLDevice [id: 140532002411056 name: Intel(R) Core(TM) i7 CPU        
940  @ 2.93GHz type: CPU profile: FULL_PROFILE]
2471ms

CLContext [id: 1108558544, platform: NVIDIA CUDA, profile: FULL_PROFILE,
devices: 2]
CLDevice [id: 139701329395616 name: GeForce GTX 295 type: GPU profile:
FULL_PROFILE]
3000ms

now lets take a look how it scales...
  -  public static int ELEM_COUNT = 302;
  +  public static int ELEM_COUNT = 30002;

NV/GPU
8308ms

Intel driver/CPU
20351ms

AMD driver/CPU
21845ms

looks like you didn't put enough load on the GPU :-). If its not enough
only a small part of the compute elements will be used the rest runs
idle. This problem does not happen on the CPU since CPU parallelism is
tiny compared to a modern GPU. (also see the concurrent kernel execution
thread).


a few suggestions before i go to bed :)
- use vector types in the kernel instead of small arrays. float3 for example
- the loop on the host is not optimal since it does something like

loop{
upload
execute
download and block
copy from CLBuffer to heap for visualizations
}

the download part:
.putReadBuffer(V_out_Buffer, false)
.putReadBuffer(x_n_out_Buffer, false)
.putReadBuffer(x_m_out_Buffer, false)
.putReadBuffer(x_h_out_Buffer, false)
.finish();
... should be faster as sequential blocking reads (tested with events
but finish seems to be faster in my case)

further potential for optimizations:
  - figure out how to remove the blocking commands in the loop
... i have to think about that. Out of order queues would make it
unnecessary complex. Two threads, double buffering.., to late for me :)

btw i bet you will like the CLCommandQueuePool:
https://github.com/mbien/jocl/blob/master/test/com/jogamp/opencl/util/concurrent/CLMultiContextTest.java#L109

best regards,
michael

On 05/20/2011 12:29 AM, John_Idol [via jogamp] wrote:

>
> When I started running the sample and looking at how long the computation
> takes, I noticed that in the HelloJOCL sample my CPU takes around 12ms while
> the GPU takes around 18ms, but since I was just getting hte thing to work I
> did not ask myself too many questions.
>
> Now that I put together a sample to run neuronal simulations (Hodking-Huxley
> model) I am noticing disturbing performance differences: the CPU takes 4
> seconds while the GPU takes 13 seconds on average.
>
> Here is the sample code: https://gist.github.com/981935
> and here is the kernel: https://gist.github.com/981938
>
> The structure of my code is built on top of the example (so I have around
> 300 elements and I am just populating queues and sending them down for
> processing), with the significant difference that I am looping over a number
> of time steps and running my kernel in parallel at each timestep.
>
> My CPu is an i7 QuadCore, while the GPU is an ATI HD 4XXX card (512MB RAM).
>
> I am thinking either my CPu is exceptionally fast and my GPU is crap or I am
> doing something very wrong in my code (such as repeating operations that I
> could do only once in setting up the kernel).
>
> Any help appreciated!
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p2963506.html
> To start a new topic under jogamp, email [hidden email]
> To unsubscribe from jogamp, visit
http://michael-bien.com/