Team 3 Fault Tolerant EXchange 18-749: Fault Tolerant Distributed Systems Ernest Chan

advertisement
Team 3
Fault Tolerant EXchange
18-749: Fault Tolerant Distributed Systems
Ernest Chan
Yong Khong Chong
Han Chun Lim
Imran Sulaiman
Yik Shing Yip
1/31
Team Members
Ernest Chan (eechan)
Han Chun Lim (hanl)
2
Yong Khong Chong (ykc)
Imran Sulaiman (ais)
Yik Shing Yip
(Arnold) (ysy)
Baseline Application

Description

3
A fault-tolerant and highperformance market making
stock exchange. (i.e.
NASDAQ, Island,
Archipelago). Matches buy
and sell order of multiple
clients.
FTEX Screenshots
4
Baseline Application

Software Used






5
Java SDK 1.4.2
MySQL (on mahjongg)
Xdoclet
Ant
Jboss 3.2.3
Eclipse
Baseline Application

High-level Components
 Clients
 One Server
 One Database
Clients
6
Fault-Tolerant Goals




Passive replication of servers
Servers are stateless
Can have any number of replicas
Sacred machines


7
mahjongg (database)
go (global JNDI, replication manager)
Fault-Tolerant Goals

Elements of FT framework

Replication Manager



8
Detects faults on servers
Automatically recovers servers
Fault Injector
FT-Baseline Architecture
Clients
9
Mechanisms for Fail-Over


10
Clients get server names from Global
JNDI at startup
Clients move to the next server during an
exception
Fault-Tolerance Experimentation

4 invocated methods experimented






11
putOrder()
getAccount()
login()
getTransactions()
48 configurations
10000 invocations per client per experiment
Fault-Tolerance Experimentation

12
The magical 1%
Fault-Tolerance Experimentation

13
For the outliers, the majority of the latency is caused by
delays in the middleware
Fault-Tolerance Experimentation
14
Fault-Tolerance Experimentation
15
Fail-Over Measurements
Original Latency with failover
8.000000
Latency (seconds)
7.000000
6.000000
5.000000
4.000000
3.000000
2.000000
1.000000
0.000000
1
86
171 256 341 426
511 596 681 766
851 936 1021 1106 1191 1276 1361 1446 1531 1616 1701 1786 1871 1956 2041 2126 2211 2296 2381 2466
Request #
Latency
16
Failover
RT Fail-Over Measurements
Breakdown of original failover latency
Processing
14%
Fault Detection
7%
Failover
Fault Detection
Processing
Failover
79%
17
RT-FT-Performance Strategy


79% of the time is spent on fail-over
Proposed strategies:




18
Pre-resolve server names
Pre-establish connections on servers
Replication Manager tells client when primary dies
Replication Manager has a TCP connection to
each server
RT-FT-Baseline Architecture
Client
19
Bounded RT Fail-Over
Measurements
Changes in Failover Time
Failover Time (seconds)
4.000000
3.784534
3.500000
3.000000
2.500000
2.000000
1.500000
1.000000
0.500000
0.061474
0.000000
Original
0.010984
Prefetch server names
Preestablish
connections
Modification
Modifications
20
Failover Time (Seconds)
% change
Original
3.784534
0
Prefetch server names
0.061474
-98.38%
Preestablish connections
0.010984
-82.13%
Bounded RT Fail-Over
Measurements
Latency with Failover with pre-fetching server names
8.000000
Latency (Seconds)
7.000000
6.000000
5.000000
4.000000
3.000000
2.000000
1.000000
0.000000
1
86
171 256 341 426
511 596 681 766
851 936 1021 1106 1191 1276 1361 1446 1531 1616 1701 1786 1871 1956 2041 2126 2211 2296 2381 2466
Request #
Latency
21
Failover
Bounded RT Fail-Over
Measurements
Breakdown of failover latency with pre-fetching server names
Failover
9%
Fault Detection
25%
Processing
66%
Failover
Fault Detection
Processing
22
Bounded RT Fail-Over
Measurements
Breakdown of failover latency with pre-established connections
Failover
1%
Fault Detection
28%
Processing
71%
Failover
Fault Detection
Processing
23
Other Ideas

Communication between client and
Replication Manager

Java Programming



24
Dealing with multiple threads
Opening sockets, sending messages..
Keep TCP connections between Replication
Manager and Servers
Open Issues

Additional features we want


Active Replication
Elimination of single-point failure



25
Replication Manager
Database
Load balancing
Conclusions



26
Lessons learned
Accomplishments
Things we would have done differently
Lessons of the Project

Communication



Data analysis


Maximum / Average may not be the best measure
Bottleneck of our system

27
Within our team / with the TAs
Requirements and Expectations
Only one server and one sole database
handling every request
Lessons of the Project


Replication techniques and issues involved
Methods of data presentation



Random errors in experimental data



28
Box plots
3D scatter plots
Server load
Database size / load
Scripting
Accomplishments




29
Successfully completed all our objectives and
features set forth in the initial project
description sent to Priya
Developed a fault-tolerant stock-exchange
system
Gathered matrices to measure the system
We are done !
If we could re-do our project ...


30
Distributed the work better (avoid bottlenecks
in development)
Setup our own database server
Thank You!

31
Any Questions?
Download