Posted by
Michael Bien on
May 20, 2011; 12:03am
URL: https://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p2963758.html
my quick benchmark (all runs on same system):
CLContext [id: 140645219742960, platform: Intel(R) OpenCL, profile:
FULL_PROFILE, devices: 1]
CLDevice [id: 1078036064 name: Intel(R) Core(TM) i7 CPU 940 @
2.93GHz type: CPU profile: FULL_PROFILE]
2328ms
CLContext [id: 1107974928, platform: ATI Stream, profile: FULL_PROFILE,
devices: 1]
CLDevice [id: 140532002411056 name: Intel(R) Core(TM) i7 CPU
940 @ 2.93GHz type: CPU profile: FULL_PROFILE]
2471ms
CLContext [id: 1108558544, platform: NVIDIA CUDA, profile: FULL_PROFILE,
devices: 2]
CLDevice [id: 139701329395616 name: GeForce GTX 295 type: GPU profile:
FULL_PROFILE]
3000ms
now lets take a look how it scales...
- public static int ELEM_COUNT = 302;
+ public static int ELEM_COUNT = 30002;
NV/GPU
8308ms
Intel driver/CPU
20351ms
AMD driver/CPU
21845ms
looks like you didn't put enough load on the GPU :-). If its not enough
only a small part of the compute elements will be used the rest runs
idle. This problem does not happen on the CPU since CPU parallelism is
tiny compared to a modern GPU. (also see the concurrent kernel execution
thread).
a few suggestions before i go to bed :)
- use vector types in the kernel instead of small arrays. float3 for example
- the loop on the host is not optimal since it does something like
loop{
upload
execute
download and block
copy from CLBuffer to heap for visualizations
}
the download part:
.putReadBuffer(V_out_Buffer, false)
.putReadBuffer(x_n_out_Buffer, false)
.putReadBuffer(x_m_out_Buffer, false)
.putReadBuffer(x_h_out_Buffer, false)
.finish();
... should be faster as sequential blocking reads (tested with events
but finish seems to be faster in my case)
further potential for optimizations:
- figure out how to remove the blocking commands in the loop
... i have to think about that. Out of order queues would make it
unnecessary complex. Two threads, double buffering.., to late for me :)
btw i bet you will like the CLCommandQueuePool:
https://github.com/mbien/jocl/blob/master/test/com/jogamp/opencl/util/concurrent/CLMultiContextTest.java#L109best regards,
michael
On 05/20/2011 12:29 AM, John_Idol [via jogamp] wrote:
>
> When I started running the sample and looking at how long the computation
> takes, I noticed that in the HelloJOCL sample my CPU takes around 12ms while
> the GPU takes around 18ms, but since I was just getting hte thing to work I
> did not ask myself too many questions.
>
> Now that I put together a sample to run neuronal simulations (Hodking-Huxley
> model) I am noticing disturbing performance differences: the CPU takes 4
> seconds while the GPU takes 13 seconds on average.
>
> Here is the sample code:
https://gist.github.com/981935> and here is the kernel:
https://gist.github.com/981938>
> The structure of my code is built on top of the example (so I have around
> 300 elements and I am just populating queues and sending them down for
> processing), with the significant difference that I am looping over a number
> of time steps and running my kernel in parallel at each timestep.
>
> My CPu is an i7 QuadCore, while the GPU is an ATI HD 4XXX card (512MB RAM).
>
> I am thinking either my CPu is exceptionally fast and my GPU is crap or I am
> doing something very wrong in my code (such as repeating operations that I
> could do only once in setting up the kernel).
>
> Any help appreciated!
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
>
http://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p2963506.html> To start a new topic under jogamp, email
[hidden email]
> To unsubscribe from jogamp, visit
http://michael-bien.com/