MulJogamp Timings

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

MulJogamp Timings

gmseed
This post was updated on .
Hi

On the Tutorial page is a link to a paper that compares jogamp-jocl, jocl and javacl:

http://jogamp.org/wiki/index.php/JOCL_Tutorial

When comparing against "normal" Java the author includes the filling of the arrays in the cpu timings. I took the time to implement this test case [as I'm new to jocl] and fatcored out the array filling and computation:

       
        private void fillJavaArrays(float[] matA, float[] matB, int seedA, int seedB)
        {
                Random randA = new Random(seedA);
                Random randB = new Random(seedB);
                final int n = matA.length;
                for (int i=0; i<n; i++)
                {
                        matA[i] = randA.nextFloat();
                        matB[i] = randB.nextFloat();
                }
        }
       
        public void normalMatMulCalc(float[] matA, float[] matB, float[] C)
        {
                final int n = matA.length;
                for (int i=0; i<n; i++)
                {
                        C[i] = matA[i] * matB[i];
                }
        }

and now compare apples with apples:

...
                // normal Java calculation
                float[] matA = new float[n];
                float[] matB = new float[n];
                float[] C = new float[n];
                fillJavaArrays(matA,matB,seedA,seedB);
               
                time = nanoTime();
                normalMatMulCalc(matA,matB,C);
                time = nanoTime() - time;
...

From the pdf I'm a bit confused as to whether the size of n is 1444777 or 14447777, but using the bigger 14447777 then my timing results are:

created: CLContext [id: 375806496, platform: NVIDIA CUDA, profile: FULL_PROFILE, devices: 1]
using CLDevice [id: 375806416 name: Quadro K1000M type: GPU profile: FULL_PROFILE]
local: 256
global: 14447872
used device memory: 173MB
A*B=C results snapshot:
0.29194298, 0.23210067, 0.6739147, 0.5184218, 0.53693414, 0.0102392025, 0.2038985, 0.10943726, 0.16293794, 0.018490046, ...; 14447862 more
computation on GPU took: 52 ms
0.29194298, 0.23210067, 0.6739147, 0.5184218, 0.53693414, 0.0102392025, 0.2038985, 0.10943726, 0.16293794, 0.018490046, ...; 14447862 more
computation on CPU took: 16 ms

illustrating that the "normal" Java computation is 52/16=3.25 times faster.

I'm interested to hear if other people have run this test.

Thanks

Graham
Reply | Threaded
Open this post in threaded view
|

Re: MulJogamp Timings

Wade Walker
Administrator
Hi Graham,

I haven't run this particular test, but the result is not surprising  For a simple test like this, running on a desktop GPU with separate memory system, I would expect the overhead of copying the arrays out to the GPU memory to dominate the timing (so the CPU should be faster overall, as you observed).

Usually if you're going to offload computation to a GPU, there must be enough FP operations per byte of input to justify the copying overhead (definitely more than just 1 multiply per 8 bytes of float data as in this test). The kernels that really shine on the GPU are those with hundreds or thousands of FP operations per copied 4-byte operand.
Reply | Threaded
Open this post in threaded view
|

Re: MulJogamp Timings

gmseed
OK. It's just that the jogamp-jocl tutorial page is citing a paper comparing itself against jocl and javacl, in which the basis and results of the test are questionable.
Reply | Threaded
Open this post in threaded view
|

Re: MulJogamp Timings

Sven Gothel
Administrator
On 10/31/2013 11:22 AM, gmseed [via jogamp] wrote:
> OK. It's just that the jogamp-jocl tutorial page is citing a paper comparing
> itself against jocl and javacl, in which the basis and results of the test are
> questionable.

Thank you for mentioning this ..

Maybe Wade (or you) are able to correct such statements,
which were written by the previous maintainer.

Thank you.

~Sven



signature.asc (911 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MulJogamp Timings

gouessej
Administrator
In reply to this post by gmseed
gmseed wrote
OK. It's just that the jogamp-jocl tutorial page is citing a paper comparing itself against jocl and javacl, in which the basis and results of the test are questionable.
JNA-based bindings are intrinsically slower than JNI-based bindings and personally, I'm not sure that the previous maintainer is wrong. Feel free to show us why his results are questionable.
Julien Gouesse | Personal blog | Website
Reply | Threaded
Open this post in threaded view
|

Re: MulJogamp Timings

Wade Walker
Administrator
In reply to this post by gmseed
gmseed wrote
OK. It's just that the jogamp-jocl tutorial page is citing a paper comparing itself against jocl and javacl, in which the basis and results of the test are questionable.
I think the point of that paper was to compare the overheads of the three Java OpenCL bindings, not to compare OpenCL with the CPU. In that case we're not interested in how long the kernel itself takes, only how long it takes to start the kernel and how long it takes to be notified when it's done (since that's where any extra overhead is).