Thursday, November 11, 2010

RoCEE aka IBoE

RoCEE is RDMA over converged enhanced ethernet standard it is infiniband transport over ethernet, its highly light-weight transport, layered directly over Ethernet L2. Think FCoE equivalent for high performance IPC traffic. RDMA reduces endpoint latency by transferring data directly from the memory of one computer straight to the memory of another without being slowed by the operating system or the NIC memory. RoCEE should bring endpoint latency down from about 4.5 us that is standard with Ethernet to about 1.3 us.  RoCEE layers InfiniBand’s layer 2 / 3 protocols on top of Ethernet’s physical and MAC layers.

RoCEE packets are standard Ethernet frames with an IEEE assigned Ethertype, a GRH, unmodified IB transport headers and payload.  IB subnet management and SA services are not required for RoCEE operation; Ethernet management practices are used instead. RoCEE encodes IP addresses into its GIDs and resolves MAC addresses using the host IP stack. For multicast GIDs, standard IP to MAC mappings apply.

The OFA RDMA Verbs API is syntactically unmodified. The CMA is adapted to support RoCEE ports allowing existing RDMA applications to run over RoCEE with no changes.

Wednesday, November 10, 2010

What is HPET ?

HPET stands for “High Precision Event Timer” – HPETs were introduced in 2005 to replace the read-time stamp counter (RDTSC) on x86 computers. HPETs run at 10 or 100Mhz providing a time “tick” of either 100 or 10ns respectively.

The HPET was introduced to serve multimedia applications with a high rate clocking mechanism to ensure smooth playback of video. Linux kernels can reference HPET time through various system calls. Lacking a precision time source, developers in finance resort to using HPET to time system events, but HPET shortcomings have become problematic as applications which demand real world time now span several cores on different processors. Although HPETs provide a fine grained tick time they suffer from serious
failings when used for system performance timing in financial applications as they were primarily designed for multimedia.

HPETs are largely only accurate on a “per core” basis due to CPU bus synchronization issues. Developers wishing to access this timing directly are constrained with either pinning all threads within a
process to a single core, or worse yet binding multiple processes with multiple threads to a single core to achieve equitably accurate clock times. Moreover, widely varied and constantly changing HPET configurations force development teams to constantly struggle to maintain instrumentation within applications, distracting them from the core business logic.

Tick to Trade latency measurement

Last night I meet up with an old friend for drinks who's currently working for mid-size market making firm and we started discussing tick to trade latency -- the period of time from the moment a particular market data tick message appears, full order book update, algo processing, order execution, and fill acknowledgment. We started discussing how to  does accurately measure tick to trade latency when the fastest systems on the street today are sub ~10 usec. Passive network taps have become essential in low latency system monitoring -- Off host processing of event metrics will minimize the impact of code instrumentation and enable measurement of end-to-end latency across applications and infrastructure components.

A slew of new application level tools are available that can time various components of your tick to trade process chain with deterministic timers with 100ns accuracy and little to no jitter such systems should be highly efficient and minimally invasive API calls without requiring context switches or kernel mode execution.

Reliable Multicast Messaging

Reliable Multicast Messaging (RMM) is a high-throughput low-latency transport fabric designed for one-to-many data delivery or many-to-many data exchange, in a message-oriented middleware publish/subscribe fashion. RMM exploits the IP multicast infrastructure to ensure scalable resource conservation and timely information distribution.

Reliability and traffic control are added on top of the standard multicast networking. RMM takes this one step further and enables the support of highly available multicast data distribution, by implementing a number of stream failover policies that allow seamless migration of multicast transmission from failed to backup processes. 

Log management in low latency trading systems.

Low latency logging can only be implemented efficiently asynchronous, using solid state drives is irrelevant with regards latency and don't provide any edge. The general advice still applies: eliminate memory allocations, data copies, lock contention and context switching. Write to a ring buffer in shared memory between the thread that logs and the thread that dispatches the message to the kernel, and if you are really skillful you could possible implement it lockless/lockfree. 

Tuesday, November 9, 2010

Welcome new readers!

I've started a new technology blog on blogger.com to cut down on administrative costs associated with maintaining a dedicated server, mysql, php wordpress etc.. Hopefully I should have a slew of updates that many of you will find useful.