Broken output from my algorithm on nVidia OpenCL implementation

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Broken output from my algorithm on nVidia OpenCL implementation

Wibowit
Hi,

I've developed a OpenCL ST5 (Schindler's Sort Transform of order 5) implementation. On my system (Radeon HD 5770, APP SDK 2.4, CCC 11.3, Ubuntu 64-bit or Windows 7 64-bit) it behaves correctly and produces output identical to verified (valid) CPU-based implementation. Unfortunately it doesn't work the same on nVidia cards, as reported by inikep here: http://encode.ru/threads/1275-OpenCL-ST5-implementation

Program with test data is here: http://www12.zippyshare.com/v/44761190/file.html

enwik16MiB is the input data
enwik16MiB.st5.bak is the correct output
bsc_st5.exe is the "reference" encoder
StreamPacker.tar.gz is my OpenCL implementation - it takes two parameters: input file name and output file name

Could someone test it on nVidia card? I would be glad if someone helps me to hunt the bug. I'm counting on you, Michael :)
Reply | Threaded
Open this post in threaded view
|

Re: Broken output from my algorithm on nVidia OpenCL implementation

Michael Bien
hello Piotr,

works fine with AMD drivers on x86_64
CLContext [id: 139933794781904, platform: ATI Stream, profile: FULL_PROFILE, devices: 1]
CLDevice [id: 139933794718016 name: Intel(R) Core(TM) i7 CPU         940  @ 2.93GHz type: CPU profile: FULL_PROFILE]
driver: 2.0
Compiling kernel.
(...)

however the kernel does not compile cleanly for my NV GPU:
CLContext [id: 139763470826864, platform: NVIDIA CUDA, profile: FULL_PROFILE, devices: 1]
CLDevice [id: 139763471694336 name: GeForce GTX 295 type: GPU profile: FULL_PROFILE]
driver: 270.41.06
Exception in thread "main" com.jogamp.opencl.CLException$CLInvalidBinaryException:
CLDevice [id: 140152635904016 name: GeForce GTX 295 type: GPU profile: FULL_PROFILE] build log:
ptxas error   : Entry function 'sort16PairsPlusLocal' uses too much shared data (0x4030 bytes + 0x10 bytes system, 0x4000 max) [error: CL_INVALID_BINARY]
        at com.jogamp.opencl.CLException.newException(CLException.java:78)
        at com.jogamp.opencl.CLProgram.build(CLProgram.java:381)
        at com.jogamp.opencl.CLProgramBuilder.build(CLProgramBuilder.java:247)
        at com.jogamp.opencl.CLProgramBuilder.build(CLProgramBuilder.java:224)
        at streampacker.Main.main(Main.java:67)

why don't you start a github project? Makes sharing and collaboration easier.

regards,
michael
Reply | Threaded
Open this post in threaded view
|

Re: Broken output from my algorithm on nVidia OpenCL implementation

Michael Bien
btw my CLInfo in case you need the device properties:

HOST_JRE: 1.6.0_24-b07
HOST_JVM: Java HotSpot(TM) 64-Bit Server VM
HOST_ARCH: amd64
HOST_NUM_CORES: 8
HOST_OS: Linux
HOST_LITTLE_ENDIAN: true
CL_BINDING_UNAVAILABLE_FUNCTIONS: [clCreateEventFromGLsyncKHR, clIcdGetPlatformIDsKHR]

CL_PLATFORM_NAME: ATI Stream
CL_PLATFORM_VERSION: OpenCL 1.1 ATI-Stream-v2.2 (302)
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VENDOR: Advanced Micro Devices, Inc.
CL_PLATFORM_ICD_SUFFIX_KHR: AMD
CL_PLATFORM_EXTENSIONS: [cl_khr_icd, cl_amd_event_callback]

 - CL_DEVICE_NAME: Intel(R) Core(TM) i7 CPU         940  @ 2.93GHz
 - CL_DEVICE_TYPE: CPU
 - CL_DEVICE_ENDIAN_LITTLE: true
 - CL_DEVICE_VERSION: OpenCL 1.1 ATI-Stream-v2.2 (302)
 - CL_DEVICE_PROFILE: FULL_PROFILE
 - CL_DEVICE_VENDOR: GenuineIntel
 - CL_DEVICE_EXTENSIONS: [cl_amd_device_attribute_query, cl_khr_byte_addressable_store, cl_khr_int64_extended_atomics, cl_khr_local_int32_extended_atomics, cl_amd_fp64, cl_amd_printf, cl_khr_local_int32_base_atomics, cl_khr_int64_base_atomics, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_khr_global_int32_extended_atomics, cl_ext_device_fission]
 - CL_DEVICE_MAX_COMPUTE_UNITS: 8
 - CL_DEVICE_MAX_CLOCK_FREQUENCY: 2934
 - CL_DEVICE_VENDOR_ID: 4098
 - CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.1
 - CL_DRIVER_VERSION: 2.0
 - CL_DEVICE_ADDRESS_BITS: 64
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 8
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 16
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 4
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 2
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 4
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 0
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR: 16
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT: 8
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: 4
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: 2
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF: 0
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: 4
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: 0
 - CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
 - CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
 - CL_DEVICE_MAX_WORK_ITEM_SIZES: [1024, 1024, 1024]
 - CL_DEVICE_MAX_PARAMETER_SIZE: 4096
 - CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1073741824
 - CL_DEVICE_GLOBAL_MEM_SIZE: 3221225472
 - CL_DEVICE_LOCAL_MEM_SIZE: 32768
 - CL_DEVICE_HOST_UNIFIED_MEMORY: true
 - CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 65536
 - CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
 - CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 32768
 - CL_DEVICE_MAX_CONSTANT_ARGS: 8
 - CL_DEVICE_IMAGE_SUPPORT: false
 - CL_DEVICE_MAX_READ_IMAGE_ARGS: 0
 - CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 0
 - CL_DEVICE_IMAGE2D_MAX_WIDTH: 0
 - CL_DEVICE_IMAGE2D_MAX_HEIGHT: 0
 - CL_DEVICE_IMAGE3D_MAX_WIDTH: 0
 - CL_DEVICE_IMAGE3D_MAX_HEIGHT: 0
 - CL_DEVICE_IMAGE3D_MAX_DEPTH: 0
 - CL_DEVICE_MAX_SAMPLERS: 0
 - CL_DEVICE_PROFILING_TIMER_RESOLUTION: 1
 - CL_DEVICE_EXECUTION_CAPABILITIES: [EXEC_KERNEL, EXEC_NATIVE_KERNEL]
 - CL_DEVICE_HALF_FP_CONFIG: []
 - CL_DEVICE_SINGLE_FP_CONFIG: [DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO]
 - CL_DEVICE_DOUBLE_FP_CONFIG: []
 - CL_DEVICE_LOCAL_MEM_TYPE: GLOBAL
 - CL_DEVICE_GLOBAL_MEM_CACHE_TYPE: READ_WRITE
 - CL_DEVICE_QUEUE_PROPERTIES: [PROFILING_MODE]
 - CL_DEVICE_AVAILABLE: true
 - CL_DEVICE_COMPILER_AVAILABLE: true
 - CL_DEVICE_ERROR_CORRECTION_SUPPORT: false
 - cl_khr_fp16: false
 - cl_khr_fp64: false
 - cl_khr_gl_sharing | cl_APPLE_gl_sharing: true

CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.0 CUDA 4.0.1
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VENDOR: NVIDIA Corporation
CL_PLATFORM_ICD_SUFFIX_KHR: NV
CL_PLATFORM_EXTENSIONS: [cl_khr_icd, cl_khr_byte_addressable_store, cl_nv_compiler_options, cl_nv_pragma_unroll, cl_nv_device_attribute_query, cl_khr_gl_sharing]

 - CL_DEVICE_NAME: GeForce GTX 295
 - CL_DEVICE_TYPE: GPU
 - CL_DEVICE_ENDIAN_LITTLE: true
 - CL_DEVICE_VERSION: OpenCL 1.0 CUDA
 - CL_DEVICE_PROFILE: FULL_PROFILE
 - CL_DEVICE_VENDOR: NVIDIA Corporation
 - CL_DEVICE_EXTENSIONS: [cl_khr_icd, cl_khr_byte_addressable_store, cl_khr_fp64, cl_khr_local_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_nv_compiler_options, cl_nv_pragma_unroll, cl_nv_device_attribute_query, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_khr_global_int32_extended_atomics]
 - CL_DEVICE_MAX_COMPUTE_UNITS: 30
 - CL_DEVICE_MAX_CLOCK_FREQUENCY: 1242
 - CL_DEVICE_VENDOR_ID: 7088510129506619614
 - CL_DEVICE_OPENCL_C_VERSION: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info string [error: CL_INVALID_VALUE]
 - CL_DRIVER_VERSION: 270.41.06
 - CL_DEVICE_ADDRESS_BITS: 32
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
 - CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
 - CL_DEVICE_MAX_WORK_ITEM_SIZES: [512, 512, 64]
 - CL_DEVICE_MAX_PARAMETER_SIZE: 4352
 - CL_DEVICE_MAX_MEM_ALLOC_SIZE: 234831872
 - CL_DEVICE_GLOBAL_MEM_SIZE: 939327488
 - CL_DEVICE_LOCAL_MEM_SIZE: 16384
 - CL_DEVICE_HOST_UNIFIED_MEMORY: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 65536
 - CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 0
 - CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 0
 - CL_DEVICE_MAX_CONSTANT_ARGS: 9
 - CL_DEVICE_IMAGE_SUPPORT: true
 - CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
 - CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
 - CL_DEVICE_IMAGE2D_MAX_WIDTH: 4096
 - CL_DEVICE_IMAGE2D_MAX_HEIGHT: 32768
 - CL_DEVICE_IMAGE3D_MAX_WIDTH: 2048
 - CL_DEVICE_IMAGE3D_MAX_HEIGHT: 2048
 - CL_DEVICE_IMAGE3D_MAX_DEPTH: 2048
 - CL_DEVICE_MAX_SAMPLERS: 16
 - CL_DEVICE_PROFILING_TIMER_RESOLUTION: 1000
 - CL_DEVICE_EXECUTION_CAPABILITIES: [EXEC_KERNEL]
 - CL_DEVICE_HALF_FP_CONFIG: []
 - CL_DEVICE_SINGLE_FP_CONFIG: [INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA]
 - CL_DEVICE_DOUBLE_FP_CONFIG: [DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA]
 - CL_DEVICE_LOCAL_MEM_TYPE: LOCAL
 - CL_DEVICE_GLOBAL_MEM_CACHE_TYPE: NONE
 - CL_DEVICE_QUEUE_PROPERTIES: [OUT_OF_ORDER_MODE, PROFILING_MODE]
 - CL_DEVICE_AVAILABLE: true
 - CL_DEVICE_COMPILER_AVAILABLE: true
 - CL_DEVICE_ERROR_CORRECTION_SUPPORT: false
 - cl_khr_fp16: false
 - cl_khr_fp64: true
 - cl_khr_gl_sharing | cl_APPLE_gl_sharing: true

 - CL_DEVICE_NAME: GeForce GTX 295
 - CL_DEVICE_TYPE: GPU
 - CL_DEVICE_ENDIAN_LITTLE: true
 - CL_DEVICE_VERSION: OpenCL 1.0 CUDA
 - CL_DEVICE_PROFILE: FULL_PROFILE
 - CL_DEVICE_VENDOR: NVIDIA Corporation
 - CL_DEVICE_EXTENSIONS: [cl_khr_icd, cl_khr_byte_addressable_store, cl_khr_fp64, cl_khr_local_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_nv_compiler_options, cl_nv_pragma_unroll, cl_nv_device_attribute_query, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_khr_global_int32_extended_atomics]
 - CL_DEVICE_MAX_COMPUTE_UNITS: 30
 - CL_DEVICE_MAX_CLOCK_FREQUENCY: 1242
 - CL_DEVICE_VENDOR_ID: 7088510129506619614
 - CL_DEVICE_OPENCL_C_VERSION: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info string [error: CL_INVALID_VALUE]
 - CL_DRIVER_VERSION: 270.41.06
 - CL_DEVICE_ADDRESS_BITS: 32
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
 - CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
 - CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
 - CL_DEVICE_MAX_WORK_ITEM_SIZES: [512, 512, 64]
 - CL_DEVICE_MAX_PARAMETER_SIZE: 4352
 - CL_DEVICE_MAX_MEM_ALLOC_SIZE: 234700800
 - CL_DEVICE_GLOBAL_MEM_SIZE: 938803200
 - CL_DEVICE_LOCAL_MEM_SIZE: 16384
 - CL_DEVICE_HOST_UNIFIED_MEMORY: com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]
 - CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 65536
 - CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 0
 - CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 0
 - CL_DEVICE_MAX_CONSTANT_ARGS: 9
 - CL_DEVICE_IMAGE_SUPPORT: true
 - CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
 - CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
 - CL_DEVICE_IMAGE2D_MAX_WIDTH: 4096
 - CL_DEVICE_IMAGE2D_MAX_HEIGHT: 32768
 - CL_DEVICE_IMAGE3D_MAX_WIDTH: 2048
 - CL_DEVICE_IMAGE3D_MAX_HEIGHT: 2048
 - CL_DEVICE_IMAGE3D_MAX_DEPTH: 2048
 - CL_DEVICE_MAX_SAMPLERS: 16
 - CL_DEVICE_PROFILING_TIMER_RESOLUTION: 1000
 - CL_DEVICE_EXECUTION_CAPABILITIES: [EXEC_KERNEL]
 - CL_DEVICE_HALF_FP_CONFIG: []
 - CL_DEVICE_SINGLE_FP_CONFIG: [INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA]
 - CL_DEVICE_DOUBLE_FP_CONFIG: [DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA]
 - CL_DEVICE_LOCAL_MEM_TYPE: LOCAL
 - CL_DEVICE_GLOBAL_MEM_CACHE_TYPE: NONE
 - CL_DEVICE_QUEUE_PROPERTIES: [OUT_OF_ORDER_MODE, PROFILING_MODE]
 - CL_DEVICE_AVAILABLE: true
 - CL_DEVICE_COMPILER_AVAILABLE: true
 - CL_DEVICE_ERROR_CORRECTION_SUPPORT: false
 - cl_khr_fp16: false
 - cl_khr_fp64: true
 - cl_khr_gl_sharing | cl_APPLE_gl_sharing: true
Reply | Threaded
Open this post in threaded view
|

Re: Broken output from my algorithm on nVidia OpenCL implementation

Wibowit
In reply to this post by Michael Bien
Quite surprising that it uses more LDS memory that I've requested on nVidia platform. I'll probably make a version with reduced LSD memory usage. Thanks for testing. However, it compiles on GTX 460 (as reported on that other forum) and runs, but produces broken output. It may be that Fermi can be configured to 16 KiB L1/ 48 KiB LDS or 48 KiB L1/ 16 KiB LDS and configuration was wrong.

Why I haven't started GitHub project? Well, until yesterday, when NetBeans 7.0 was released, there were no official integration of Git into NetBeans. Also I don't know which licenses could I use and still be allowed to host on free GitHub account. I know that programs have to be open source, but further details can vary. For example I would want to make my program free for personal use and paid for commercial use.