Wednesday, December 8, 2010

Cuda vs OpenCL – hands-on experiences:

- CUDA kernel code will be much nicer, it will take less lines and looks much more understandable. The cost is that it requires a proprietary compiler (nvcc)

- CUDA kernel to PTX compilation is performed when the project is built. In OpenCL it is performed at runtime. The "initialization overhead" can be reduced by using a cache. Cache could be used to avoid this

- There are two major differences between CUDA and OpenCL:
  OpenCL is industry standard while CUDA is NVIDIA's platform,
  OpenCL is a regular C library, CUDA is in addition an extension to the C language

- Unlike CUDA OpenCL requires environmental setup on the host (CPU) before launching kernels to run on a GPU. This process also includes compiling kernels. The process for setting up kernels is similar for all GPU kernels
- Compiling OpenCL source code into an intermediate representation takes the most time during initialization
- No off-line kernel compiler is available in NVIDIA’s OpenCL platform. Documentation suggests that this should be implemented by the developers
- If you intend to use OpenCL at the time being it's probably better to stay with Nvidia’s CUDA
- ATI is expected to improve the OpenCL compiler soon

OpenCL summary
Developing applications for OpenCL is not a simple task yet. The lack of any development environment with the possibility to emulate and hence debug code on any underlying hardware platform lets the programmer basically stuck with a text editor and print statements. Obviously print statements are often insufficient, furthermore unlike CUDA, OpenCL does not allow to produce print outs directly from the device; instead the data must first be transferred to the host, a time consuming and exceedingly boring task. Another issue that easily arises without a debugger is occurring system crashes. In CUDA, out of bounds memory accesses are easily detected when emulating the GPU. As no such possibility exists in OpenCL, the developer needs to run the application to see if it performs as expected. It is not uncommon that out of bound memory accesses result in a system crash. This lack of support does not help developer productivity.

Developing programs for OpenCL and CUDA has some similarities regarding parallelizing and writing C code, with a few notable exceptions. First, general optimization guidelines for various hardware platforms such as GPUs, CPUs, Cell-like processors and similar do not exist, as pointed out in the OpenCL specifications, "It is anticipated that over the coming months and years, experience will produce a set of best practices that will help foster a uniformly favourable experience on a diversity of computing devices." Hence, if one is writing code that is to be optimized not just for a GPU architecture, but for several GPU architectures, CPU architectures, Cell-like processors, DSPs and more, one is basically in the darkness, this issue needs serious research work and breaking new ground. nVidia has however released a "Best practices guide" for writing OpenCL code for nVidia GPUs. The general ideas likely also fall in line with similar recommendations for AMD GPUs.

The coding task of the process is not much more different than writing ordinary C code, except of the lack of tools to debug ones code with more support than using of print statements. The OpenCL related API calls are less obvious than the CUDA counterparts, though the terminology is in many cases more suitable.
It is unfortunate, that AMD/ATI brings what it claims to be a multi-teraflop chip to the market, but refuses to publish an open-source example code, as simple as matrix-matrix multiplication, that would illustrate the real-life performance and a recommended programming style for this chip.

Optimizing OpenCL applications for a particular architecture faces the same challenges. Further, more, optimizing a given OpenCL code for several architectures is a much more demanding issue, even an impossible one, this explains why many authors claim that OpenCL does not provide performance portability. This, along with the fact that GPUs are quickly evolving in complexity, has made tuning numerical libraries for a given platform a challenging task.