CLContext [id: 140645219742960, platform: Intel(R) OpenCL, profile:
FULL_PROFILE, devices: 1]
CLDevice [id: 1078036064 name: Intel(R) Core(TM) i7 CPU 940 @
2.93GHz type: CPU profile: FULL_PROFILE]
CLContext [id: 1107974928, platform: ATI Stream, profile: FULL_PROFILE,
CLDevice [id: 140532002411056 name: Intel(R) Core(TM) i7 CPU
940 @ 2.93GHz type: CPU profile: FULL_PROFILE]
CLContext [id: 1108558544, platform: NVIDIA CUDA, profile: FULL_PROFILE,
CLDevice [id: 139701329395616 name: GeForce GTX 295 type: GPU profile:
now lets take a look how it scales...
- public static int ELEM_COUNT = 302;
+ public static int ELEM_COUNT = 30002;
looks like you didn't put enough load on the GPU :-). If its not enough
only a small part of the compute elements will be used the rest runs
idle. This problem does not happen on the CPU since CPU parallelism is
tiny compared to a modern GPU. (also see the concurrent kernel execution
a few suggestions before i go to bed :)
- use vector types in the kernel instead of small arrays. float3 for example
- the loop on the host is not optimal since it does something like
download and block
copy from CLBuffer to heap for visualizations
the download part:
... should be faster as sequential blocking reads (tested with events
but finish seems to be faster in my case)
further potential for optimizations:
- figure out how to remove the blocking commands in the loop
... i have to think about that. Out of order queues would make it
unnecessary complex. Two threads, double buffering.., to late for me :)
btw i bet you will like the CLCommandQueuePool:
On 05/20/2011 12:29 AM, John_Idol [via jogamp] wrote:
> When I started running the sample and looking at how long the computation
> takes, I noticed that in the HelloJOCL sample my CPU takes around 12ms while
> the GPU takes around 18ms, but since I was just getting hte thing to work I
> did not ask myself too many questions.
> Now that I put together a sample to run neuronal simulations (Hodking-Huxley
> model) I am noticing disturbing performance differences: the CPU takes 4
> seconds while the GPU takes 13 seconds on average.
> Here is the sample code: https://gist.github.com/981935
> and here is the kernel: https://gist.github.com/981938
> The structure of my code is built on top of the example (so I have around
> 300 elements and I am just populating queues and sending them down for
> processing), with the significant difference that I am looping over a number
> of time steps and running my kernel in parallel at each timestep.
> My CPu is an i7 QuadCore, while the GPU is an ATI HD 4XXX card (512MB RAM).
> I am thinking either my CPu is exceptionally fast and my GPU is crap or I am
> doing something very wrong in my code (such as repeating operations that I
> could do only once in setting up the kernel).
> Any help appreciated!
> If you reply to this email, your message will be added to the discussion below:
> To start a new topic under jogamp, email [hidden email]
> To unsubscribe from jogamp, visit