Applications of Computing in Industry:
What is Low Latency All About?
eFX – January 2014
Divyakant Bengani
Undergrad degree in Management and IT from Manchester
Vice President at CS, responsible for eFX Core Technologies
Working in the banking industry since 2003 & CS for ~3 years
EFX - What do we do?
Cash FX Only
Spot, Forwards and Swaps
Continuous Publication of Prices
Streaming Executable Rates
Response to Request for Quotes
Acceptance and Booking of Trades
Key Statistics
~200 Currency Pairs (E.g EURUSD / GBPJPY etc.)
3 billion prices broadcast a day
60000 trades a day
>200 client connections
Technologies Used
C# for UIs
GWT for Web UIs
Oracle Coherence
Oracle DB
Derby DB
Azul Zing JVM
Low Latency Fix Engine
Socket Connections
Asynchronous JMS
Java RMI
Google Protobuf
Fixed Length Byte Arrays
FIX - Industry Standard
JMS Map Messages
Java Serialization
EFX - Overall Architecture
Service Discovery
Zero Conf
Dynamically add and remove services
Applications do not need to know about each other - just pick up what’s
Automated Testing
Code Quality Analysis
Continuous Integration
How to Achieve Low Latency
Daniel Nolan-Neylan
Graduated from UCL in 2004
Started working at Credit Suisse in 2006
− First, networking for 4 years
− Now, Application Developer in FX IT
Different projects:
− Distributed caching system for static data
− Simplified credit checking library
− Pricing and trading gateway (now team lead)
Corporate Design, HCBC 1
November 2011
Wait a second!
1 second is:
− 1,000 milliseconds
− 1,000,000 microseconds
− 1,000,000,000 nanoseconds
Latency Numbers Every
Programmer Should Know
L1 cache reference
0.5 ns
Branch mispredict
5 ns
L2 cache reference
7 ns
14x L1 cache
Mutex lock/unlock
25 ns
Main memory reference
100 ns
20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy
3,000 ns
Send 1K bytes over 1 Gbps network
10,000 ns 0.01 ms
Read 4K randomly from SSD*
150,000 ns 0.15 ms
Read 1 MB sequentially from memory
250,000 ns 0.25 ms
Round trip within same datacenter
500,000 ns 0.5 ms
Read 1 MB sequentially from SSD*
1,000,000 ns 1 ms 4X memory
Disk seek
10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
By Jeff Dean:
FX Trading – Latency Numbers
250ms – A human responding to price update
30ms – Bank accepting trade
10ms – Credit checking client
9ms – JVM Garbage Collecting
5ms – Persisting a trade to disk
2ms – JMS networking round-trip
1ms – Raw socket networking round-trip
0.5ms – Max wire-to-wire pricing latency
0.05ms – Min pricing latency
0.005ms – Writing price to FIX engine
Optimization Quotes
Michael A. Jackson:
“The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do
it yet.”
Rob Pike:
“Bottlenecks occur in surprising places, so don't try to second guess
and put in a speed hack until you have proven that's where the
bottleneck is.”
Where to Optimize? Use Profiler
Measuring Milliseconds and Nanoseconds in Java
Measure time taken for operations and log:
− System.currentTimeMillis()
Good for taking a time/date that can be compared against other
systems. Accuracy depends on OS, but 1ms accuracy achievable on
modern Unix-based OS (Linux)
Bad if more precise measurements are required
− System.nanoTime()
Good for sub-millisecond measurements
Bad if comparable time with other systems required
− Realistically, need to use both
Corporate Design, HCBC 1
November 2011
Quote Journalling – log latency of every price
Corporate Design, HCBC 1
November 2011
Our Soak Test Harness
Corporate Design, HCBC 1
November 2011
…and the graphs it can produce
Corporate Design, HCBC 1
November 2011
Removing Millisecond Delays
Identify the longest-running tasks
− Usually I/O delays
– Database activity
– Synchronous logging
– Writing files
– Calling network services
– Remote services far away (e.g. Across Atlantic ~50ms)
Removing Millisecond Delays (2)
Analyze whether delays can be eliminated
− Disk
Database activity -> Use a cache
Synchronous logging -> Use asynchronous logging
Writing files -> Use buffers and write asynchronously
− Network
Calling network services -> Cache where possible
Remote services far away -> Co-locate in same place
FX Trading – RFQ Example
E.g. Incoming request for a price, target response time is 10ms
− Need to:
Validate request parameters
Internally subscribe for prices
Obtain a globally unique transaction ID
Perform a credit check
How to get all this done in just 10ms?
FX Trading – RFQ Example (2)
Credit check
− Old one took 30-200ms
− New one takes 5-10ms
Using Caching and Co-location
Parallelize all validation
Pre-cache prices
− by opening up price streams in advance of being required
Don’t Optimize Too Soon
− Only optimize what you need to optimize
− Remove longest delays first
No point removing micros if you still have delays of millis or worse
− Always measure your operations carefully
Determine what minimum, maximum, mean, standard deviation, and
other percentiles are (99%, 99.9%, etc)
− Watch for jitter and solve separately
Removing Microsecond Delays
Intra-process delays
− Unbalanced / slow queues
− Slow algorithms
Expensive loops repeated many times
Poor use of object creation / memory allocation
Contented memory controlled with locks
Wasted effort calculating unwanted results
FX Trading – Pricing Example
Achieving wire-to-wire latencies of 50μs
− Google protobuf parsers replaced with low-garbage creating versions
each GC stops the JVM for 9,000μs (i.e. 9ms)
− LMAX Disruptors used instead of queues
Busy spin consumer threads / single-write principle
− “PriceBigDecimal” class to replace Java BigDecimal class
BigDecimal slow to instantiate and impossible to mutate
− No synchronous logging or network calls
− Pre-cache static data before starting price stream
Disruptor or Blocking Queues?
Corporate Design, HCBC 1
November 2011
Java BigDecimal or use Low Latency replacement?
Corporate Design, HCBC 1
November 2011
Removing Nanoseconds?
Use specialist hardware (such as FPGA)
Understand low-level CPU interconnectivity with memory, and how CPU
caching works (including cache-lines)
eFX – No need to pursue this level of performance at the moment
Latency vs Throughput
Latency - time taken (typically mean, percentile or worst case) to
complete a task
Throughput – the number of tasks completed in a given time period
(typically, per second)
Throughput is 1/latency (per pipeline)
Increasing Throughput
Identify delays
− Throughput constrained by latency
− Blocking I/O calls delay unprocessed messages
Data bursts
− What’s the peak throughput required?
− What’s the gap typically between bursts?
Techniques to Increase Throughput
− Sometimes latent calls are unavoidable
− Using batching can strip overhead of making call per transaction
− Cost of batching is the delay incurred waiting for new items to add to
− More difficult to accurately measure delay per item when multiple items
are in a batch
FX Trading – Batching Example
Legacy global server in London
Regional trade acceptance components
Latency between New York and London - 50ms
Per thread: 1/0.05 = 20 trades per second max
How to increase?
− More threads
− Add batching per thread
Now, with batch size of 5, 100 trades per second per
Techniques to Increase Throughput(2)
Use Asynchronous callbacks
− Synchronous calls:
boolean doCall()
Wait for response
Can be delayed for varying time
− Asynchronous calls:
void doCall(Callback callback)
Do not wait and keep processing more events
Can additionally overlay timeouts to improve resilience
FX Trading – Asynchronous Callbacks
Submission of trade to price service for verification – was originally
Call blocks for 50ms – max 20 trades per second per thread
After converting to asynchronous callbacks, the only delay is putting
packets on network buffer (μs), so effectively no delay – max numbers of
trades is very high!
eFX – January 2014