Reply – Re: Parallel computation on CPU faster than GPU!
Your Name
or Cancel
In Reply To
Re: Parallel computation on CPU faster than GPU!
— by Giovanni Idili Giovanni Idili
OK I did some work on this, here's what I found:

On scaling up:
302 items --> GPU: 13332ms / CPU: 1428ms
3002 items --> GPU: 26979ms / CPU: 16245ms
300002 items --> GPU: 170071ms / CPU: 147497ms

which basically confirms that scaling up resolves the CPU vs GPU issue (and that my GPU sucks).

With non-locking all the output buffers except Vout:
302 items --> GPU: 5567ms / CPU: 1471ms

Non-locking all the output buffers:
302 items --> GPU: 5346ms / CPU: 757ms

In this case I noticed a couple of things: 1) if I add .finish() at the end the performance gets a bit worse (why?) 2) if I plot Vout the plot is a bit messed up but only if I run on GPU (why?). In general I'd be glad if you could point me to resources where I can learn more about what locking exactly does/means (I have an idea but I'd like to understand better).

Non-locking all + changing the arrays in the kernel to float4:
302 items --> GPU: 4801ms / CPU: 673ms

Next thing I am gonna do is move the loop in the kernel and have all the results stored along the way as an output for plotting (maybe optionally populated via another parameter) and I hope to get closer to the awesome results @notz reported (< 100ms on my crappy GPU I'd be happy). I'll post back here (maybe tomorrow) results.

The only thing I am not too sure about at this point is how I am going to return "the results" since that means returning 2-dimensional float arrays because I need to know the values for each item at each point of the computation (in order to do any plotting), so I am back to a problem already discussed on this forum [], and I seem to understand I cannot just return a float** and this time I cannot flatten it out as I did for the inputs because I do not know how many time step the computation is gonna simulate (that's gonna be a parameter too). Ideas?

Thanks again for the awesome help, you guys rock.