Wednesday, December 8, 2010

Cuda vs OpenCL – hands-on experiences:

- CUDA kernel code will be much nicer, it will take less lines and looks much more understandable. The cost is that it requires a proprietary compiler (nvcc)

- CUDA kernel to PTX compilation is performed when the project is built. In OpenCL it is performed at runtime. The "initialization overhead" can be reduced by using a cache. Cache could be used to avoid this

- There are two major differences between CUDA and OpenCL:
  OpenCL is industry standard while CUDA is NVIDIA's platform,
  OpenCL is a regular C library, CUDA is in addition an extension to the C language

- Unlike CUDA OpenCL requires environmental setup on the host (CPU) before launching kernels to run on a GPU. This process also includes compiling kernels. The process for setting up kernels is similar for all GPU kernels
- Compiling OpenCL source code into an intermediate representation takes the most time during initialization
- No off-line kernel compiler is available in NVIDIA’s OpenCL platform. Documentation suggests that this should be implemented by the developers
- If you intend to use OpenCL at the time being it's probably better to stay with Nvidia’s CUDA
- ATI is expected to improve the OpenCL compiler soon

OpenCL summary
Developing applications for OpenCL is not a simple task yet. The lack of any development environment with the possibility to emulate and hence debug code on any underlying hardware platform lets the programmer basically stuck with a text editor and print statements. Obviously print statements are often insufficient, furthermore unlike CUDA, OpenCL does not allow to produce print outs directly from the device; instead the data must first be transferred to the host, a time consuming and exceedingly boring task. Another issue that easily arises without a debugger is occurring system crashes. In CUDA, out of bounds memory accesses are easily detected when emulating the GPU. As no such possibility exists in OpenCL, the developer needs to run the application to see if it performs as expected. It is not uncommon that out of bound memory accesses result in a system crash. This lack of support does not help developer productivity.

Developing programs for OpenCL and CUDA has some similarities regarding parallelizing and writing C code, with a few notable exceptions. First, general optimization guidelines for various hardware platforms such as GPUs, CPUs, Cell-like processors and similar do not exist, as pointed out in the OpenCL specifications, "It is anticipated that over the coming months and years, experience will produce a set of best practices that will help foster a uniformly favourable experience on a diversity of computing devices." Hence, if one is writing code that is to be optimized not just for a GPU architecture, but for several GPU architectures, CPU architectures, Cell-like processors, DSPs and more, one is basically in the darkness, this issue needs serious research work and breaking new ground. nVidia has however released a "Best practices guide" for writing OpenCL code for nVidia GPUs. The general ideas likely also fall in line with similar recommendations for AMD GPUs.

The coding task of the process is not much more different than writing ordinary C code, except of the lack of tools to debug ones code with more support than using of print statements. The OpenCL related API calls are less obvious than the CUDA counterparts, though the terminology is in many cases more suitable.
It is unfortunate, that AMD/ATI brings what it claims to be a multi-teraflop chip to the market, but refuses to publish an open-source example code, as simple as matrix-matrix multiplication, that would illustrate the real-life performance and a recommended programming style for this chip.

Optimizing OpenCL applications for a particular architecture faces the same challenges. Further, more, optimizing a given OpenCL code for several architectures is a much more demanding issue, even an impossible one, this explains why many authors claim that OpenCL does not provide performance portability. This, along with the fact that GPUs are quickly evolving in complexity, has made tuning numerical libraries for a given platform a challenging task.

Thursday, November 11, 2010

RoCEE aka IBoE

RoCEE is RDMA over converged enhanced ethernet standard it is infiniband transport over ethernet, its highly light-weight transport, layered directly over Ethernet L2. Think FCoE equivalent for high performance IPC traffic. RDMA reduces endpoint latency by transferring data directly from the memory of one computer straight to the memory of another without being slowed by the operating system or the NIC memory. RoCEE should bring endpoint latency down from about 4.5 us that is standard with Ethernet to about 1.3 us.  RoCEE layers InfiniBand’s layer 2 / 3 protocols on top of Ethernet’s physical and MAC layers.

RoCEE packets are standard Ethernet frames with an IEEE assigned Ethertype, a GRH, unmodified IB transport headers and payload.  IB subnet management and SA services are not required for RoCEE operation; Ethernet management practices are used instead. RoCEE encodes IP addresses into its GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs, standard IP to MAC mappings apply.

The OFA RDMA Verbs API is syntactically unmodified. The CMA is adapted to support RoCEE ports allowing existing RDMA applications to run over RoCEE with no changes.

Wednesday, November 10, 2010

What is HPET ?

HPET stands for “High Precision Event Timer” – HPETs were introduced in 2005 to replace the read-time stamp counter (RDTSC) on x86 computers. HPETs run at 10 or 100Mhz providing a time “tick” of either 100 or 10ns respectively.

The HPET was introduced to serve multimedia applications with a high rate clocking mechanism to ensure smooth playback of video. Linux kernels can reference HPET time through various system calls. Lacking a precision time source, developers in finance resort to using HPET to time system events, but HPET shortcomings have become problematic as applications which demand real world time now span several cores on different processors. Although HPETs provide a fine grained tick time they suffer from serious
failings when used for system performance timing in financial applications as they were primarily designed for multimedia.

HPETs are largely only accurate on a “per core” basis due to CPU bus synchronization issues. Developers wishing to access this timing directly are constrained with either pinning all threads within a
process to a single core, or worse yet binding multiple processes with multiple threads to a single core to achieve equitably accurate clock times. Moreover, widely varied and constantly changing HPET configurations force development teams to constantly struggle to maintain instrumentation within applications, distracting them from the core business logic.

Tick to Trade latency measurement

Last night I meet up with an old friend for drinks who's currently working for mid-size market making firm and we started discussing tick to trade latency -- the period of time from the moment a particular market data tick message appears, full order book update, algo processing, order execution, and fill acknowledgment. We started discussing how to  does accurately measure tick to trade latency when the fastest systems on the street today are sub ~10 usec. Passive network taps have become essential in low latency system monitoring -- Off host processing of event metrics will minimize the impact of code instrumentation and enable measurement of end-to-end latency across applications and infrastructure components.

A slew of new application level tools are available that can time various components of your tick to trade process chain with deterministic timers with 100ns accuracy and little to no jitter such systems should be highly efficient and minimally invasive API calls without requiring context switches or kernel mode execution.

Reliable Multicast Messaging

Reliable Multicast Messaging (RMM) is a high-throughput low-latency transport fabric designed for one-to-many data delivery or many-to-many data exchange, in a message-oriented middleware publish/subscribe fashion. RMM exploits the IP multicast infrastructure to ensure scalable resource conservation and timely information distribution.

Reliability and traffic control are added on top of the standard multicast networking. RMM takes this one step further and enables the support of highly available multicast data distribution, by implementing a number of stream failover policies that allow seamless migration of multicast transmission from failed to backup processes. 

Log management in low latency trading systems.

Low latency logging can only be implemented efficiently asynchronous, using solid state drives is irrelevant with regards latency and don't provide any edge. The general advice still applies: eliminate memory allocations, data copies, lock contention and context switching. Write to a ring buffer in shared memory between the thread that logs and the thread that dispatches the message to the kernel, and if you are really skillful you could possible implement it lockless/lockfree. 

Tuesday, November 9, 2010

Welcome new readers!

I've started a new technology blog on blogger.com to cut down on administrative costs associated with maintaining a dedicated server, mysql, php wordpress etc.. Hopefully I should have a slew of updates that many of you will find useful.