Understanding float and double

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Understanding float and double

Arnold
I am trying to rebuild the Mandelbrot example. I now have a very simple program that creates a mandelbrot fractal. I compare it with a simple (serial) and parallel (threaded) implementation I use forbenchmark results. The results are really great: my serial implementation runs in 2229ms, parallel: 432 ms and openCL in 9ms. An improvement with almost a factor 50!

To "dive" deep into a Mandelbrot fractal one needs doubles, else you reach too quickly the resolution of the floats. I noticed that the openCL solution used floats, while my NVidia GTX 1060 and the Intel core i7 920 both have a setting of cl_khr_fp64 = true. The Mandelbrot.cl kernel has a neat way of dealing with floats, making it dependent on the floating point setting. I set it explicitly to double but that did not help. I have listed the kernel below. In my java program I exclusive use double.

Anyone any idea how to have the kernel using double variables?

#ifdef DOUBLE_FP
    #ifdef AMD_FP
        #pragma OPENCL EXTENSION cl_amd_fp64 : enable
    #else
        #pragma OPENCL EXTENSION cl_khr_fp64 : enable
    #endif
    typedef double varfloat;
#else
    typedef float varfloat;
#endif

/**
 * For a description of this algorithm please refer to
 * http://en.wikipedia.org/wiki/Mandelbrot_set
 * @author Michael Bien
 */
kernel void Mandelbrot
    (
        const int width,        
        const int height,
        const int maxIterations,
        const double x0,      
        const double y0,
        const double stepX,  
        const double stepY,
        global int *output  
    )
{

    unsigned int ix = get_global_id (0);
    unsigned int iy = get_global_id (1);

    double r = x0 + ix * stepX;
    double i = y0 + iy * stepY;

    double x = 0;
    double y = 0;

    double magnitudeSquared = 0;
    int iteration = 0;

    while (magnitudeSquared < 4 && iteration < maxIterations)
    {
        varfloat x2 = x*x;
        varfloat y2 = y*y;
        y = 2 * x * y + i;
        x = x2 - y2 + r;
        magnitudeSquared = x2+y2;
        iteration++;
    }

    output [iy * width + ix] = iteration;
}
Reply | Threaded
Open this post in threaded view
|

Re: Understanding float and double

Wade Walker
Administrator
Hmm, not sure why that doesn't work. Are you really passing doubles into the kernel when you invoke it? If you were still passing floats as arguments, it would give the same results even with doubles used inside the kernel.
Reply | Threaded
Open this post in threaded view
|

Re: Understanding float and double

Arnold
Hi Wadewalker,

Thanks for your remark. It pointed me to an error I had overlooked: I had forgotten to replace the varfloat by double inside the Mandelbrot loop. When I did that all went well. But that meant implicitly that varfloat was replaced by float instead of double. So I wanted to see which branch would be followed by the #ifdef's. I adjusted the heading of the kernel as follows:

#ifdef DOUBLE_FP
    #ifdef AMD_FP
        #pragma OPENCL EXTENSION cl_amd_fp64 : enable
        #error cl_amd_fp64 detected
    #else
        #pragma OPENCL EXTENSION cl_khr_fp64 : enable
        #error cl_khr_fp64 detected
    #endif
    typedef double varfloat;
#else
    typedef float varfloat;
    #error no DOUBLE_FP detected
#endif

The compilation of the kernel crashes: no DOUBLE_FP detected (see the last lines below). That is strange because by explicitly declaring doubles, doubles are being used. I show you the compilation output of the program together with a list of all properties of the device for which the kernel was  built. In this example it is the GTX 1060, but I have the same results when I choose the Intel core i7.

***CLDevice [id: 520350624 name: GeForce GTX 1060 6GB type: GPU profile: FULL_PROFILE]
   --- Properties of device GeForce GTX 1060 6GB
  CL_DEVICE_NAME = GeForce GTX 1060 6GB
  CL_DEVICE_TYPE = GPU
  CL_DEVICE_EXTENSIONS = [cl_khr_global_int32_base_atomics, cl_khr_fp64, cl_nv_compiler_options, cl_khr_byte_addressable_store, cl_nv_copy_opts, cl_khr_global_int32_extended_atomics, cl_khr_icd, cl_nv_pragma_unroll, cl_nv_d3d10_sharing, cl_nv_device_attribute_query, cl_khr_local_int32_extended_atomics, cl_nv_d3d11_sharing, cl_khr_gl_sharing, cl_khr_d3d10_sharing, cl_nv_d3d9_sharing, cl_khr_local_int32_base_atomics]
  CL_DEVICE_AVAILABLE = true
  CL_DEVICE_VERSION = OpenCL 1.2 CUDA
  CL_DRIVER_VERSION = 372.90
  CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
  CL_DEVICE_ENDIAN_LITTLE = true
  CL_DEVICE_VENDOR_ID = 4318
  CL_DEVICE_OPENCL_C_VERSION = OpenCL C 1.2
  CL_DEVICE_ADDRESS_BITS = 64
  CL_DEVICE_GLOBAL_MEM_SIZE = 6442450944
  CL_DEVICE_LOCAL_MEM_SIZE = 49152
  CL_DEVICE_HOST_UNIFIED_MEMORY = false
  CL_DEVICE_MAX_SAMPLERS = 32
  CL_DEVICE_HALF_FP_CONFIG = []
  CL_DEVICE_LOCAL_MEM_TYPE = LOCAL
  cl_khr_icd = true
  CL_DEVICE_MAX_COMPUTE_UNITS = 10
  CL_DEVICE_MAX_CLOCK_FREQUENCY = 1708
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT = 1
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR = 1
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT = 1
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG = 1
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT = 1
  CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE = 1
  CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR = 1
  CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT = 1
  CL_DEVICE_NATIVE_VECTOR_WIDTH_INT = 1
  CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG = 1
  CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF = 0
  CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT = 1
  CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE = 1
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES = [1024, 1024, 64]
  CL_DEVICE_MAX_PARAMETER_SIZE = 4352
  CL_DEVICE_MAX_MEM_ALLOC_SIZE = 1610612736
  CL_DEVICE_MEM_BASE_ADDR_ALIGN = 4096
  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 65536
  CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE = 128
  CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 163840
  CL_DEVICE_MAX_CONSTANT_ARGS = 9
  CL_DEVICE_IMAGE_SUPPORT = true
  CL_DEVICE_MAX_READ_IMAGE_ARGS = 256
  CL_DEVICE_MAX_WRITE_IMAGE_ARGS = 16
  CL_DEVICE_IMAGE2D_MAX_WIDTH = 16384
  CL_DEVICE_IMAGE2D_MAX_HEIGHT = 32768
  CL_DEVICE_IMAGE3D_MAX_WIDTH = 16384
  CL_DEVICE_IMAGE3D_MAX_HEIGHT = 16384
  CL_DEVICE_IMAGE3D_MAX_DEPTH = 16384
  CL_DEVICE_PROFILING_TIMER_RESOLUTION = 1000
  CL_DEVICE_EXECUTION_CAPABILITIES = [EXEC_KERNEL]
  CL_DEVICE_SINGLE_FP_CONFIG = [DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA]
  CL_DEVICE_DOUBLE_FP_CONFIG = [DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA]
  CL_DEVICE_GLOBAL_MEM_CACHE_TYPE = READ_WRITE
  CL_DEVICE_QUEUE_PROPERTIES = [OUT_OF_ORDER_MODE, PROFILING_MODE]
  CL_DEVICE_COMPILER_AVAILABLE = true
  CL_DEVICE_ERROR_CORRECTION_SUPPORT = false
  cl_khr_fp16 = false
  cl_khr_fp64 = true
  cl_khr_gl_sharing | cl_APPLE_gl_sharing = true
  CL_DEVICE_PROFILE = FULL_PROFILE
  CL_DEVICE_VENDOR = NVIDIA Corporation
created CLContext [id: 586574304, platform: NVIDIA CUDA, profile: FULL_PROFILE, devices: 1]
using CLDevice [id: 520350624 name: GeForce GTX 1060 6GB type: GPU profile: FULL_PROFILE]
Exception in thread "JavaFX Application Thread" com.jogamp.opencl.CLException$CLBuildProgramFailureException:
CLDevice [id: 520350624 name: GeForce GTX 1060 6GB type: GPU profile: FULL_PROFILE] build log:
<kernel>:12:6: error: no DOUBLE_FP detected
    #error no DOUBLE_FP detected
     ^ [error: CL_BUILD_PROGRAM_FAILURE]

Reply | Threaded
Open this post in threaded view
|

Re: Understanding float and double

Wade Walker
Administrator
Are you setting "-D DOUBLE_FP" when you build the kernel? If not, it would show the behavior you describe.
Reply | Threaded
Open this post in threaded view
|

Re: Understanding float and double

Arnold
Sorry for my late answer, I was just building a new (opencl enabled) computer. I had missed that part of kernel building. I added it and it works, sorry for having bothered on something that trivial and thanks for your patience!
Reply | Threaded
Open this post in threaded view
|

Re: Understanding float and double

Wade Walker
Administrator
Hey, I'm just glad it works :)