jogamp › jocl

Parallel computation on CPU faster than GPU!

Classic

List

Threaded

18 messages Options

Giovanni Idili

May 19, 2011; 10:29pm

Parallel computation on CPU faster than GPU!

Hi everyone, I am back with another typical n00b question.

When I started running the HelloJOCL sample [http://goo.gl/iI8t8] and looking at how long the computation was taking, I noticed that with my CPU the computation was taking around 12ms to complete while with the GPU it was around 18ms, but since I was just getting the thing to work I did not ask myself too many questions.

Now that I put together a sample to run neuronal simulations (Hodking-Huxley model) I am noticing disturbing performance differences: the CPU takes 4 seconds while the GPU takes 13 seconds on average.

Here is the sample code: https://gist.github.com/981935
and here is the kernel: https://gist.github.com/981938

The structure of my code is built on top of the HelloJOCL example (I have around 300 elements and I am just populating queues and sending them down for processing), with the significant difference that I am looping over a number of time steps and running my kernel in parallel at each time-step (not just once).

My CPu is an i7 QuadCore, while the GPU is an ATI HD 4XXX card (512MB RAM).

I am thinking either my CPu is exceptionally fast and my GPU is crap or I am doing something very wrong in my code (such as repeating operations that I could do only once in setting up the kernel).

Any help appreciated!

Michael Bien

May 20, 2011; 12:03am

Re: Parallel computation on CPU faster than GPU!

my quick benchmark (all runs on same system):

CLContext [id: 140645219742960, platform: Intel(R) OpenCL, profile:
FULL_PROFILE, devices: 1]
CLDevice [id: 1078036064 name: Intel(R) Core(TM) i7 CPU 940 @
2.93GHz type: CPU profile: FULL_PROFILE]
2328ms

CLContext [id: 1107974928, platform: ATI Stream, profile: FULL_PROFILE,
devices: 1]
CLDevice [id: 140532002411056 name: Intel(R) Core(TM) i7 CPU
940 @ 2.93GHz type: CPU profile: FULL_PROFILE]
2471ms

CLContext [id: 1108558544, platform: NVIDIA CUDA, profile: FULL_PROFILE,
devices: 2]
CLDevice [id: 139701329395616 name: GeForce GTX 295 type: GPU profile:
FULL_PROFILE]
3000ms

now lets take a look how it scales...
- public static int ELEM_COUNT = 302;
+ public static int ELEM_COUNT = 30002;

NV/GPU
8308ms

Intel driver/CPU
20351ms

AMD driver/CPU
21845ms

looks like you didn't put enough load on the GPU :-). If its not enough
only a small part of the compute elements will be used the rest runs
idle. This problem does not happen on the CPU since CPU parallelism is
tiny compared to a modern GPU. (also see the concurrent kernel execution
thread).

a few suggestions before i go to bed :)
- use vector types in the kernel instead of small arrays. float3 for example
- the loop on the host is not optimal since it does something like

loop{
upload
execute
download and block
copy from CLBuffer to heap for visualizations
}

the download part:
.putReadBuffer(V_out_Buffer, false)
.putReadBuffer(x_n_out_Buffer, false)
.putReadBuffer(x_m_out_Buffer, false)
.putReadBuffer(x_h_out_Buffer, false)
.finish();
... should be faster as sequential blocking reads (tested with events
but finish seems to be faster in my case)

further potential for optimizations:
- figure out how to remove the blocking commands in the loop
... i have to think about that. Out of order queues would make it
unnecessary complex. Two threads, double buffering.., to late for me :)

btw i bet you will like the CLCommandQueuePool:
https://github.com/mbien/jocl/blob/master/test/com/jogamp/opencl/util/concurrent/CLMultiContextTest.java#L109

best regards,
michael

On 05/20/2011 12:29 AM, John_Idol [via jogamp] wrote:

>
> When I started running the sample and looking at how long the computation
> takes, I noticed that in the HelloJOCL sample my CPU takes around 12ms while
> the GPU takes around 18ms, but since I was just getting hte thing to work I
> did not ask myself too many questions.
>
> Now that I put together a sample to run neuronal simulations (Hodking-Huxley
> model) I am noticing disturbing performance differences: the CPU takes 4
> seconds while the GPU takes 13 seconds on average.
>
> Here is the sample code: https://gist.github.com/981935
> and here is the kernel: https://gist.github.com/981938
>
> The structure of my code is built on top of the example (so I have around
> 300 elements and I am just populating queues and sending them down for
> processing), with the significant difference that I am looping over a number
> of time steps and running my kernel in parallel at each timestep.
>
> My CPu is an i7 QuadCore, while the GPU is an ATI HD 4XXX card (512MB RAM).
>
> I am thinking either my CPu is exceptionally fast and my GPU is crap or I am
> doing something very wrong in my code (such as repeating operations that I
> could do only once in setting up the kernel).
>
> Any help appreciated!
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p2963506.html
> To start a new topic under jogamp, email [hidden email]
> To unsubscribe from jogamp, visit
http://michael-bien.com/

... [show rest of quote]

notzed

May 20, 2011; 3:30am

Re: Parallel computation on CPU faster than GPU!

In reply to this post by Giovanni Idili

Just to second what Michael said and to expand on it some - you are doing some things quite wrong, but they are easily fixed.

1. You're copying all your data from cpu->gpu running the code then copying it all back again for every loop. At most you're taking a few samples from one buffer, why copy them all to/fro? On CL/CPU these copies are presumably a complete NO-OP, on the GPU each is a full memory copy across devices - all you're really timing are lots of memory copies.
2. Every copy you're copying from the gpu->cpu is synchronous. Think of putting out a fire with a chain of people with buckets - it still takes just as long in transfer time to move a bucket from one end to the other, but if you already have a spare one to go you don't sit around waiting for work In your case you're waiting for the whole line of buckets to empty before starting the next lot.
3. Your collating a tiny fraction of the data on the CPU whilst the GPU remains completely idle - even if it isn't very efficient code if you collate on the GPU it will run much faster. From what I can tell you could very easily collate this on the gpu as well anyway.

To fix:

1. Copy any cpu initialised values to the gpu once (or even initialise using a kernel if the data is large and generated algorithmically).
2. Run your kernel multiple times without any cpu synchronisation, and just swap the arguments for the input/output pipeline, e.g.:
setargs(0, in);
setargs(1, out);
queuekernel();
setargs(0 ,out);
setargs(1, in);
queuekernel(OUT, IN);

From a cursory look at the algorithm you could probably run the loop itself entirely on the gpu anyway - i haven't looked closely but it appears each kernel calculates a value independently of all other kernels, and every time the same one (by iGid) will always be working on the values it calculated last time. If that is the case you could also just use the same memory for input and output as well and simplify memory management as a bonus (in this case the kernel would also have to dump out sample results as described in the next point). I noticed you have a bug anyway - after the first loop it's just using Vout for both Vin and Vout (or maybe that isn't a bug, but if it isn't you're doing even more redundant copying).

3. Copy any sample results out on the gpu using another simple kernel/bit of code tacked onto the end. You could just pass the iteration count to tell it where to write the answer. Is this only for debugging anyway?

4. Retrieve the results in one go, and only use 'blocking=true' on the last buffer - this will at least batch up all the copies and do them together rather than waiting for each to complete before moving on. Assuming you have a defaultly[sic] configured queue, they will guarantee execution order so you can assume that if the last is done, all are. Or use finish() as suggested.

There are also some simple code tweaks.

For one, i'm not sure how well the compiler will go at registerising arrays - and in any event it is compiler dependent. Might be better done using 3 floats, or especially for a cpu/cell implementation, vector types. With opencl 1.1 you can use float3, with opencl 1.0 you need to use float4 but the result should be the same.

e.g.
float tau[3];
for (int i = 0; i < 3; i++) {
tau[i] = 1 / (alpha[i] + beta[i]);
}

becomes:
float4 tau;
tau = 1 / (alpha + beta);

If the final result only uses tau.xyz/tau.s012 then the compiler will throw away the 4th element calculation on non-SIMD processors.

Since you have 4 separate arrays and always operate on the same item in each you could store all of them in a single float4 array - which will affect performance one way or another (may be better, may be not).

Again from a very cursory look at it, i'm surprised it's taking more than a handful of milliseconds to calculate such a small amount of work and such a simple algorithm.

I don't think you have to worry about using multiple queues and threads - but they're not really very complicated either if you had to.

Since I was curious and had a bit of spare time I got a bit side-tracked and tried all of the above - although I didn't verify the results are correct so I might have made a mistake along the way.

- 1875ms - baseline on my gpu (nvidia gtx480)
- 631ms - removed the unnecessary array copies and only copy Vout from gpu to cpu inside each loop.

Next I removed the data downloading from the loop entirely - this means it isn't retrieving plot results but it could be added with a simple kernel/final step in this kernel which shouldn't take a lot of time. At least this gives you an idea on the minimum bound.

- 130ms - This is more in line with what i'd expect as a baseline for the amount of work you're doing and shows all you're really timing is the device-host memory copies.

I then tried registerising/vectorising the code.

- 133ms - well at least the nvidia compiler must be doing this already.

And finally I put the entire loop on the gpu, it reads the n, m, h and V data points only once at the start of the kernel, iterates over all t and then writes them out once. I also hard-coded the loop size using #defines (mostly just because it was simpler).

- 54ms

And lastly just to make it overly GPU specific, I tried a local worksize of 128 rather than 256, but now this is really splitting hairs for this example. Maybe it isn't splitting hairs in general though - you're only processing 302 items, so you're only using a maximum of 2 SM units if you use a local worksize of 256.

- 50ms

I don't have a CPU driver installed, but some of those might benefit the CPU implementation as well. The CPU compiler is probably already vectorising the arguments, but the loop changes should make a difference.

And to test scaling, I tried 30 002 items as Michael did:

- 350ms - this is where it should really cane any CPU if the above doesn't do it already.

Luckily you have a problem that fits the gpu compute model about as ideally as is possible.

Z

Giovanni Idili

May 20, 2011; 9:58am

Re: Parallel computation on CPU faster than GPU!

In reply to this post by Michael Bien

Michael Bien wrote

looks like you didn't put enough load on the GPU :-)

Argh - the thought crossed my mind a couple of times but I din't act on it - it makes a lot of sense.

Thanks for this and the rest of the advice: I will swap float3 in and see what happens. As for the blocking, I do need to get the stuff out for now ... but I am thinking to follow @notzed (to whom I will reply shortly in more detail) advice and move the loop in the kernel since at the moment I am plotting stuff just to make sure that the logic is correct, so in that case I could use .finish() I guess.

Once again thanks for all the help!

Giovanni Idili

May 20, 2011; 10:36am

Re: Parallel computation on CPU faster than GPU!

This post was updated on May 20, 2011; 2:58pm.

In reply to this post by notzed

First of all thanks for taking the time to help on this!

notzed wrote

1. Copy any cpu initialised values to the gpu once (or even initialise using a kernel if the data is large and generated algorithmically).

I guess by this you mean the constants that I am passing down at every cycle? This makes a lot of sense, but I am not sure how to do it.

notzed wrote

2. Run your kernel multiple times without any cpu synchronisation, and just swap the arguments for the input/output pipeline, e.g.:
setargs(0, in);
setargs(1, out);
queuekernel();
setargs(0 ,out);
setargs(1, in);
queuekernel(OUT, IN);

From a cursory look at the algorithm you could probably run the loop itself entirely on the gpu anyway - i haven't looked closely but it appears each kernel calculates a value independently of all other kernels, and every time the same one (by iGid) will always be working on the values it calculated last time. If that is the case you could also just use the same memory for input and output as well and simplify memory management as a bonus (in this case the kernel would also have to dump out sample results as described in the next point). I noticed you have a bug anyway - after the first loop it's just using Vout for both Vin and Vout (or maybe that isn't a bug, but if it isn't you're doing even more redundant copying).

... [show rest of quote]

Wow swapping the arguments for I/O pipelines is a great suggestion - I didn't know this could be done at all (remember, I am n00b!). Would you be able to point me to some example? Also I will definitely follow your advice to move the loop in the kernel, at the moment I am getting the stuff out to plot it but when this gets plugged in the bigger picture I have in mind I won't need that. As for the bug I don't think it's a bug because the results I am plotting are showing the curves I expect (this is a dumb example and effectively I am running the same thing N times in parallel just to test it out so there is a lot of redundancy, in theory the inputs will be different for each of the items, but at the moment it's all the same).

notzed wrote

3. Copy any sample results out on the gpu using another simple kernel/bit of code tacked onto the end. You could just pass the iteration count to tell it where to write the answer. Is this only for debugging anyway?

This sounds like another clever trick - are you suggesting I store results as I go along on some buffer in the shared memory of the GPU then run another kernel at the end to harvest results from that buffer?

notzed wrote

There are also some simple code tweaks.

Thanks a lot for the awesome suggestions in terms of code tweaks - I will try now to follow your steps and see how this translates in performance gains on my GPU. Will post back results.

The only thing that still worries me a bit is that my baseline on GPU (ATI HD4850) is around 13000ms while yours is 1875ms and Michael's one is 3000ms. Trying to make sense of that, I suspect it's the 'Mobility' version of the card (I am on a 27'' iMac), but would that explain the huge difference?

notzed

May 21, 2011; 1:35am

Re: Parallel computation on CPU faster than GPU!

Giovanni Idili wrote

First of all thanks for taking the time to help on this!

I've been knee deep in this stuff for the last few months (or similar stuff for years) and I spotted a few obvious simple things that were easy to address. Might be a bit off topic for the forum, but it's a bit quiet otherwise.

Giovanni Idili wrote

notzed wrote

1. Copy any cpu initialised values to the gpu once (or even initialise using a kernel if the data is large and generated algorithmically).

I guess by this you mean the constants that I am passing down at every cycle? This makes a lot of sense, but I am not sure how to do it.

Actually I meant the input values, vin, etc. The non-memory arguments are handled by the driver (kernel.setArg() etc), and are potentially cached on the server anyway and in any event are too small to matter in most cases. It's the memory buffer synchronisation that slows things down.

Giovanni Idili wrote

Wow swapping the arguments for I/O pipelines is a great suggestion - I didn't know this could be done at all (remember, I am n00b!). Would you be able to point me to some example?

I don't have a specific example handy, but the example i gave should be simple enough. But perhaps think of the allocations on the device like any other memory allocations, they stay around with their content until you free them, as far as any kernel is concerned at least. If you have an array on cpu code you don't (normally) copy the array to call a function which works on it, and a gpu is no different.

Your code was already basically doing it, but it was copying the same data to the cpu and then back to the gpu every loop - for no purpose or effect.

Giovanni Idili wrote

notzed wrote

3. Copy any sample results out on the gpu using another simple kernel/bit of code tacked onto the end. You could just pass the iteration count to tell it where to write the answer. Is this only for debugging anyway?

This sounds like another clever trick - are you suggesting I store results as I go along on some buffer in the shared memory of the GPU then run another kernel at the end to harvest results from that buffer?

More or less, but just write the results to global memory - local memory only exists for the duration of the kernel execution, and only amongst kernels in the same workgroup ( I think this is how your original code was treating the global memory). If they've been written in the kernel you can just then read that memory from the cpu. I mentioned an additional kernel if you didn't want to put it into the main kernel and just grabbed some samples from the 'current result'.

But since in this case it's only for debugging it probably isn't too important how it's done.

Giovanni Idili wrote

The only thing that still worries me a bit is that my baseline on GPU (ATI HD4850) is around 13000ms while yours is 1875ms and Michael's one is 3000ms. Trying to make sense of that, I suspect it's the 'Mobility' version of the card (I am on a 27'' iMac), but would that explain the huge difference?

It could easily explain such a difference. Mobile devices usually have slower memory/i/o and sometimes fewer transistors (i.e. less processors) at lower clock speeds. And even apart from that there are so many different devices with varying capabilities/generations.

On the flip side you might actually have more to gain by better code which more fully utilises the processors and lowers the memory bandwidth requirements. Even on a slow card registers are fast.

Z

Giovanni Idili

May 22, 2011; 1:21am

Re: Parallel computation on CPU faster than GPU!

OK I did some work on this, here's what I found:

On scaling up:
302 items --> GPU: 13332ms / CPU: 1428ms
3002 items --> GPU: 26979ms / CPU: 16245ms
300002 items --> GPU: 170071ms / CPU: 147497ms

which basically confirms that scaling up resolves the CPU vs GPU issue (and that my GPU sucks).

With non-locking all the output buffers except Vout:
302 items --> GPU: 5567ms / CPU: 1471ms

Non-locking all the output buffers:
302 items --> GPU: 5346ms / CPU: 757ms

In this case I noticed a couple of things: 1) if I add .finish() at the end the performance gets a bit worse (why?) 2) if I plot Vout the plot is a bit messed up but only if I run on GPU (why?). In general I'd be glad if you could point me to resources where I can learn more about what locking exactly does/means (I have an idea but I'd like to understand better).

Non-locking all + changing the arrays in the kernel to float4:
302 items --> GPU: 4801ms / CPU: 673ms

Next thing I am gonna do is move the loop in the kernel and have all the results stored along the way as an output for plotting (maybe optionally populated via another parameter) and I hope to get closer to the awesome results @notz reported (< 100ms on my crappy GPU I'd be happy). I'll post back here (maybe tomorrow) results.

The only thing I am not too sure about at this point is how I am going to return "the results" since that means returning 2-dimensional float arrays because I need to know the values for each item at each point of the computation (in order to do any plotting), so I am back to a problem already discussed on this forum [http://forum.jogamp.org/Passing-array-of-arrays-to-OpenCL-via-JOCL-tp2922911p2922911.html], and I seem to understand I cannot just return a float** and this time I cannot flatten it out as I did for the inputs because I do not know how many time step the computation is gonna simulate (that's gonna be a parameter too). Ideas?

Thanks again for the awesome help, you guys rock.

Giovanni Idili

May 22, 2011; 6:41pm

Re: Parallel computation on CPU faster than GPU!

In reply to this post by Giovanni Idili

I moved the loop to the kernel [http://goo.gl/ga6pt] - and getting much better results (only blocking one of the buffers or all of them to get the stuff out at the end with final values does not seem to make a difference):

with 302 items --> GPU: 276ms / CPU: 228ms

Here's the code I am using to invoke the kernel: http://goo.gl/297a3

One weird thing I've noticed, if I don't block any buffer the computation only takes 1ms ... which makes me think something is horribly wrong. Trying to find a way to verify.

As mentioned in the previous post, ideally I would like at this stage to get a 2-dimensional array out (at least for one of the buffers) at the end with values for each step of the loop I moved into the kernel, so that I can do some plotting and check that the computation is actually happening.

Any help on that appreciated!

Michael Bien

May 22, 2011; 7:11pm

Re: Parallel computation on CPU faster than GPU!

On 05/22/2011 08:41 PM, John_Idol [via jogamp] wrote:

> I moved the loop to the kernel [http://goo.gl/ga6pt] - and getting much
> better results (only blocking one of the buffers or all of them to get the
> stuff out at the end with final values does not seem to make a difference):
>
> with 302 items --> GPU: 276ms / CPU: 228ms
>
> Here's the code I am using to invoke the kernel: http://goo.gl/297a3
>
> One weird thing I've noticed, if I don't block any buffer the computation
> only takes 1ms ... which makes me think something is horribly wrong. Trying
> to find a way to verify.

if you don't block your java program will not wait for any results. It
would just send the command to the device and exit. The command queue is
by default asynchronous, you just send the commands but will have two
wait for the results somehow.

OpenCL provides many options how you can wait for results or certain
events in general.

- finish() waits for everything (all previously enqueued commands) to
complete
- a blocking command (the boolean on most putFoo methods) waits for
this command to complete (in a in-order queue its very similar to finish())
- and events
- (+ barriers but i leave them out for now)

events give you a lot of control over the execution. Most commands allow
passing a condition list and a event list as method parameters. Every
command is a CLEvent and can wait for other CLEvents (condition list).

the host can wait for events too... queue.putWaitForEvents(events, true);

In jocl you can use events only as list since there are often many of them.
CLEventList readEvents = new CLEventList(2);

...
queue.putReadBuffer(a, false, null/*this would be the condition list*/,
resultEvents);
queue.putReadBuffer(b, false, null/*this would be the condition list*/,
resultEvents);

queue.putWaitForEvents(readEvents, true);

after you release the list with readEvents.release() you can reuse the list.

(but events would be unnecessary in your usecase right now IMO)

-michael

> As mentioned in the previous post, ideally I would like at this stage to get
> a 2-dimensional array out (at least for one of the buffers) at the end with
> values for each step of the loop I moved into the kernel, so that I can do
> some plotting and check that the computation is actually happening.
>
> Any help on that appreciated!
>
>
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p2972470.html
> To start a new topic under jogamp, email [hidden email]
> To unsubscribe from jogamp, visit
http://michael-bien.com/

... [show rest of quote]

Giovanni Idili

May 22, 2011; 7:34pm

Re: Parallel computation on CPU faster than GPU!

Michael Bien wrote

if you don't block your java program will not wait for any results.

Thanks Michael, that explains it, should've probably guessed really since it says it's an asynchronous operation!

Abimael

Oct 12, 2011; 12:30am

Re: Parallel computation on CPU faster than GPU!

In reply to this post by Giovanni Idili

Excuse me , but how do you run OPENCL code on CPU ?
Or are you testing with java code on cpu x opencl code on GPU ?

Michael Bien

Oct 12, 2011; 4:27pm

Re: Parallel computation on CPU faster than GPU!

On 10/12/2011 02:30 AM, Abimael [via jogamp] wrote:
> Excuse me , but how do you run OPENCL code on CPU ?
> Or are you testing with java code on cpu x opencl code on GPU ?
>
AMD and Intel have both OpenCL implementations for the CPU. AMD's impl
works on all x86 CPUs where Intel's is locked to intel CPUs.

regards,
michael

--
http://michael-bien.com/

Abimael

Oct 13, 2011; 6:07pm

Re: Parallel computation on CPU faster than GPU!

@Michael

I had heard about it , but how can I do that.? I mean, how can I run OPENCL on CPU to compare ?

I recall that I read some day that in either createProgram or build (I do not recall which one) , I could pass an argument to redirect to GPU or CPU..

Is that the way to run OpenCL on CPU , or should I need to install a proper implementation (or libraries) to run on CPU ?

thanks

Giovanni Idili

Oct 13, 2011; 6:10pm

Re: Parallel computation on CPU faster than GPU!

// an array with available opencl devices detected on your system
CLDevice[] devices = context.getDevices();

for(int i=0; i<devices.length; i++)
{
out.println("device-" + i + ": " + devices[i]);
}

// have a look at the output and select a device
CLDevice device = devices[0];

Hope it helps!

On Thu, Oct 13, 2011 at 7:07 PM, Abimael [via jogamp] <
ml-node+s762907n3419346h9@n3.nabble.com> wrote:

>
> @Michael
>
> I had heard about it , but how can I do that.? I mean, how can I run OPENCL
> on CPU to compare ?
>
> I recall that I read some day that in either createProgram or build (I do
> not recall which one) , I could pass an argument to redirect to GPU or CPU..
>
>
> Is that the way to run OpenCL on CPU , or should I need to install a proper
> implementation (or libraries) to run on CPU ?
>
> thanks
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p3419346.html
> To unsubscribe from Parallel computation on CPU faster than GPU!, click
> here<http://forum.jogamp.org/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=2963506&code=Z2lvdmFubmkuaWRpbGlAZ21haWwuY29tfDI5NjM1MDZ8LTEwODg4NzEzNjM=>.
>
>

Michael Bien

Oct 13, 2011; 6:59pm

Re: Parallel computation on CPU faster than GPU!

and to create a context simply pass the device to the factory method.

CLContext context = CLContext.create(device);

On 10/13/2011 08:11 PM, Giovanni Idili [via jogamp] wrote:

>
> // an array with available opencl devices detected on your system
> CLDevice[] devices = context.getDevices();
>
> for(int i=0; i<devices.length; i++)
> {
> out.println("device-" + i + ": " + devices[i]);
> }
>
> // have a look at the output and select a device
> CLDevice device = devices[0];
> On Thu, Oct 13, 2011 at 7:07 PM, Abimael [via jogamp]<
> [hidden email]> wrote:
>
>> @Michael
>>
>> I had heard about it , but how can I do that.? I mean, how can I run OPENCL
>> on CPU to compare ?
>>
>> I recall that I read some day that in either createProgram or build (I do
>> not recall which one) , I could pass an argument to redirect to GPU or CPU..
>>
>>
>> Is that the way to run OpenCL on CPU , or should I need to install a proper
>> implementation (or libraries) to run on CPU ?
>>
>> thanks
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p3419346.html
>> To unsubscribe from Parallel computation on CPU faster than GPU!, click
>> here<
>>
>>
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://forum.jogamp.org/Parallel-computation-on-CPU-faster-than-GPU-tp2963506p3419350.html
> To start a new topic under jogamp, email [hidden email]
> To unsubscribe from jogamp, visit
http://michael-bien.com/

... [show rest of quote]

notzed

Oct 13, 2011; 11:29pm

Re: Parallel computation on CPU faster than GPU!

In reply to this post by Abimael

Abimael wrote

Is that the way to run OpenCL on CPU , or should I need to install a proper implementation (or libraries) to run on CPU ?

Yes of course you will need to install an OpenCL implementation that targets cpus.

e.g. the AMD one: http://developer.amd.com/sdks/amdappsdk/downloads/Pages/default.aspx

dunarel

May 23, 2012; 12:57pm

Re: Parallel computation on CPU faster than GPU!

Hi all,

I am new to OpenCL. I tried to install jocl from jogamp but never could (never found a jar library and .so files that work, always asking for other dependencies).

Instead I converted the example to jocl.org format. I seems to work but I have doubts about the results.
It gives me a graphic with everything almost 1, like 0.98 for the entire timeframe.

Is this what is supposed to be ?

Thanks.

paultim374

Jun 18, 2013; 5:24am

Re: Parallel computation on CPU faster than GPU!

This post was updated on Aug 09, 2014; 7:46am.

Dear i haven't looked closely but it appears each kernel calculates a value independently of all other kernels, and every time the same one (by iGid) will always be working on the values it calculated last time. If that is the case you could also just use the same memory for input and output as well and simplify memory management as a bonus (in this case the kernel would also have to dump out sample results as described in the next point). I noticed you have a bug anyway - after the first loop it's just using Vout for both Vin and Vout ...............
As mentioned in the previous post, ideally I would like at this stage to get a 2-dimensional array out (at least for one of the buffers) at the end with values for each step of the loop I moved into the kernel, so that I can do some plotting and check that the computation is actually happening...???

usman