I started programming in openCL. My system is a core i7-920 12 GB and an NVidia GTX 1060 with 6GB. I started experimenting with the HelloWorld example and have it run 3 kernels: a+v=c, a*b=c and a/b=-c. Each vector contains 20mln elements just to make the benchmark results meaningful.
I got the following results (in ms, mileage may vary):
+ 159 159
* 152 179
/ 156 153
Well, there go my dreams of having built a small supercomputer, the graphics card is about as fast as the CPU. But it gets worse: I built a small benchmark subroutine that computes a/b=c 20 mln times and it does so in 78ms!
Results are changing when I throw in trigoniometric functions like sin(a)/cos(b)=c (the benchmark takes 4 times as much computation time). Is this normal? I hope not because I find it really somewhat discouraging. Note that I only time the queue.put... function. I can post the code when desired but maybe one of you cracks knows the answer beforehand.
One of the questions I have is about the number of processors. This is 8 for the i7-920 which is ok. But the GTX 1060 has just 10, while the specs tell me it has 1280 stream processors. Does anyone how that relates?
This all sounds pretty normal, and there are many possible reasons for what you're seeing.
1. To execute on the GPU, your data has to be copied over the bus from CPU to GPU, then the results copied back. If you're only doing one multiply, the execution time of your program will be dominated by the data copy time. To get the full benefit of a GPU, you have to move data over to it, then do *lots* of computation before copying results back.
2. Trig functions are more difficult than multiply and divide. A modern GPU's execution units can complete a multiply every clock cycle, for every thread in a warp. But trig functions are usually done by "special function units", which are fewer in number than multipliers, and are also slower. So anything that you want to be fast, find a way to take the trig out of it :)
3. You also have to set up the problem just right. There are a bunch of things about the way you write OpenCL code that could affect the performance. It'll just take reading and practice to get good at this.
Is there any specific problem you're trying to solve, or are you just learning OpenCL in general?
Thanks WadeWalker for your explanation. I had the naive impression that the openCL library functions would be always faster. I now know better :-) Just for fun I have the benchmarks results below
Running benchmark for 3 devices.
Summary of computing vectors with 20,000,000 elements (time in ms)
Device VectorAdd.cl VectorMul.cl VectorDiv.cl VectorTri.cl
GeForce GTX 1060 6GB 160 157 156 182
Intel(R) Core(TM) i7 154 159 154 336
Ellesmere 125 104 89 105
Plain Java 41 41 80 2768
What I want to do is fractal computation and linear algebra, like matrix multiplications, for neural networks. I parallellised the Mandelbrot in a "classic" way using threads and that works. I found the openCL demo code and I got it running, but I don't understand it. That's why I am learning openCL in general now, somewhere down the road to understanding openCL i should start to understand the Mandelbrot code :-) (after having it stripped from all the GL stuf).
My understanding fails at the data types. The book I use to learn openCL (OpenCL Parallel Programming Development Cookbook from Raymond Tay) explains buffers the C way. I cannot "translate" a struct with four ints as user data to a CLBuffer<type> without having to implement it as an extended Buffer class. So I might use an IntBuffer but that requires indexing and will be hopeless for user data with mixed data types. Is there a simpler way to define ones own data structure in Java?
I wrote a device lister, that lists each device and its capabilities and this benchmark program, that runs a benchmark for each device. I can contribute these to the demo package. just let me know.
One thing you might do if you wanted to benchmark the actual compute that's happening on the card instead of just the data copying, would be to run two benchmarks: your current one, and another where the kernel is the same but with twice as much floating-point math in it (making sure all the math feeds into the results of the kernel, or the compiler may optimize it away). Then the difference in execution time between the two kernels will be floating-point execution time only (since the data copied is the same in both cases).
Any programming in Java or C# (or any language that doesn't have pointers :)) will require the Buffer code you mention. This is just a by-product of the fact that if you don't have pointers in the language, you need some workaround to interface with APIs like OpenCL that receive raw memory buffers as input. However, for efficiency any fractal or neural network code you write will probably only be passing one huge buffer to the graphics card, so you shouldn't have to define that many custom types.
Your remark about copying implies that in fact that openCL isn't worth using for simple computations. That's the same when using threads, one should always check whether the extra overhead outweighs the computation. It should do so for Mandelbrot (few copying indeed) and neural networks (matmult with complex computations). Thanks for the tip about how to copy the data to the device, that makes sense. I'll adjust the examples of the book with that in my mind.
I replaced the c = a * b by c = a * b * a and did not detect any difference. Well, as usual benchmarking requires insight in *alll* the factors lying behind a computation. If you look at the benchmark results you see something funny. In the VectorTri benchmark plain Java performs in about 3 seconds while the openCL code performs in about 300 ms. Exactly the same platform but openCL is 10 times as fast! I forgot to tell that I use Double instead of Float (I need double for most of my computations).