Reply – Re: Parallel computation on CPU faster than GPU!
Your Name
or Cancel
In Reply To
Re: Parallel computation on CPU faster than GPU!
— by Michael Bien Michael Bien
  my quick benchmark (all runs on same system):

CLContext [id: 140645219742960, platform: Intel(R) OpenCL, profile:
FULL_PROFILE, devices: 1]
CLDevice [id: 1078036064 name: Intel(R) Core(TM) i7 CPU         940  @
2.93GHz type: CPU profile: FULL_PROFILE]

CLContext [id: 1107974928, platform: ATI Stream, profile: FULL_PROFILE,
devices: 1]
CLDevice [id: 140532002411056 name: Intel(R) Core(TM) i7 CPU        
940  @ 2.93GHz type: CPU profile: FULL_PROFILE]

CLContext [id: 1108558544, platform: NVIDIA CUDA, profile: FULL_PROFILE,
devices: 2]
CLDevice [id: 139701329395616 name: GeForce GTX 295 type: GPU profile:

now lets take a look how it scales...
  -  public static int ELEM_COUNT = 302;
  +  public static int ELEM_COUNT = 30002;


Intel driver/CPU

AMD driver/CPU

looks like you didn't put enough load on the GPU :-). If its not enough
only a small part of the compute elements will be used the rest runs
idle. This problem does not happen on the CPU since CPU parallelism is
tiny compared to a modern GPU. (also see the concurrent kernel execution

a few suggestions before i go to bed :)
- use vector types in the kernel instead of small arrays. float3 for example
- the loop on the host is not optimal since it does something like

download and block
copy from CLBuffer to heap for visualizations

the download part:
.putReadBuffer(V_out_Buffer, false)
.putReadBuffer(x_n_out_Buffer, false)
.putReadBuffer(x_m_out_Buffer, false)
.putReadBuffer(x_h_out_Buffer, false)
... should be faster as sequential blocking reads (tested with events
but finish seems to be faster in my case)

further potential for optimizations:
  - figure out how to remove the blocking commands in the loop
... i have to think about that. Out of order queues would make it
unnecessary complex. Two threads, double buffering.., to late for me :)

btw i bet you will like the CLCommandQueuePool:

best regards,

On 05/20/2011 12:29 AM, John_Idol [via jogamp] wrote:

> When I started running the sample and looking at how long the computation
> takes, I noticed that in the HelloJOCL sample my CPU takes around 12ms while
> the GPU takes around 18ms, but since I was just getting hte thing to work I
> did not ask myself too many questions.
> Now that I put together a sample to run neuronal simulations (Hodking-Huxley
> model) I am noticing disturbing performance differences: the CPU takes 4
> seconds while the GPU takes 13 seconds on average.
> Here is the sample code:
> and here is the kernel:
> The structure of my code is built on top of the example (so I have around
> 300 elements and I am just populating queues and sending them down for
> processing), with the significant difference that I am looping over a number
> of time steps and running my kernel in parallel at each timestep.
> My CPu is an i7 QuadCore, while the GPU is an ATI HD 4XXX card (512MB RAM).
> I am thinking either my CPu is exceptionally fast and my GPU is crap or I am
> doing something very wrong in my code (such as repeating operations that I
> could do only once in setting up the kernel).
> Any help appreciated!
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> To start a new topic under jogamp, email [hidden email]
> To unsubscribe from jogamp, visit