Team 3 Fault Tolerant EXchange 18-749: Fault Tolerant Distributed Systems Ernest Chan

Team 3 Fault Tolerant EXchange 18-749: Fault Tolerant Distributed Systems Ernest Chan Yong Khong Chong Han Chun Lim Imran Sulaiman Yik Shing Yip 1/31 Team Members Ernest Chan (eechan) Han Chun Lim (hanl) 2 Yong Khong Chong (ykc) Imran Sulaiman (ais) Yik Shing Yip (Arnold) (ysy) Baseline Application  Description  3 A fault-tolerant and highperformance market making stock exchange. (i.e. NASDAQ, Island, Archipelago). Matches buy and sell order of multiple clients. FTEX Screenshots 4 Baseline Application  Software Used       5 Java SDK 1.4.2 MySQL (on mahjongg) Xdoclet Ant Jboss 3.2.3 Eclipse Baseline Application  High-level Components  Clients  One Server  One Database Clients 6 Fault-Tolerant Goals     Passive replication of servers Servers are stateless Can have any number of replicas Sacred machines   7 mahjongg (database) go (global JNDI, replication manager) Fault-Tolerant Goals  Elements of FT framework  Replication Manager    8 Detects faults on servers Automatically recovers servers Fault Injector FT-Baseline Architecture Clients 9 Mechanisms for Fail-Over   10 Clients get server names from Global JNDI at startup Clients move to the next server during an exception Fault-Tolerance Experimentation  4 invocated methods experimented       11 putOrder() getAccount() login() getTransactions() 48 configurations 10000 invocations per client per experiment Fault-Tolerance Experimentation  12 The magical 1% Fault-Tolerance Experimentation  13 For the outliers, the majority of the latency is caused by delays in the middleware Fault-Tolerance Experimentation 14 Fault-Tolerance Experimentation 15 Fail-Over Measurements Original Latency with failover 8.000000 Latency (seconds) 7.000000 6.000000 5.000000 4.000000 3.000000 2.000000 1.000000 0.000000 1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 1276 1361 1446 1531 1616 1701 1786 1871 1956 2041 2126 2211 2296 2381 2466 Request # Latency 16 Failover RT Fail-Over Measurements Breakdown of original failover latency Processing 14% Fault Detection 7% Failover Fault Detection Processing Failover 79% 17 RT-FT-Performance Strategy   79% of the time is spent on fail-over Proposed strategies:     18 Pre-resolve server names Pre-establish connections on servers Replication Manager tells client when primary dies Replication Manager has a TCP connection to each server RT-FT-Baseline Architecture Client 19 Bounded RT Fail-Over Measurements Changes in Failover Time Failover Time (seconds) 4.000000 3.784534 3.500000 3.000000 2.500000 2.000000 1.500000 1.000000 0.500000 0.061474 0.000000 Original 0.010984 Prefetch server names Preestablish connections Modification Modifications 20 Failover Time (Seconds) % change Original 3.784534 0 Prefetch server names 0.061474 -98.38% Preestablish connections 0.010984 -82.13% Bounded RT Fail-Over Measurements Latency with Failover with pre-fetching server names 8.000000 Latency (Seconds) 7.000000 6.000000 5.000000 4.000000 3.000000 2.000000 1.000000 0.000000 1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 1276 1361 1446 1531 1616 1701 1786 1871 1956 2041 2126 2211 2296 2381 2466 Request # Latency 21 Failover Bounded RT Fail-Over Measurements Breakdown of failover latency with pre-fetching server names Failover 9% Fault Detection 25% Processing 66% Failover Fault Detection Processing 22 Bounded RT Fail-Over Measurements Breakdown of failover latency with pre-established connections Failover 1% Fault Detection 28% Processing 71% Failover Fault Detection Processing 23 Other Ideas  Communication between client and Replication Manager  Java Programming    24 Dealing with multiple threads Opening sockets, sending messages.. Keep TCP connections between Replication Manager and Servers Open Issues  Additional features we want   Active Replication Elimination of single-point failure    25 Replication Manager Database Load balancing Conclusions    26 Lessons learned Accomplishments Things we would have done differently Lessons of the Project  Communication    Data analysis   Maximum / Average may not be the best measure Bottleneck of our system  27 Within our team / with the TAs Requirements and Expectations Only one server and one sole database handling every request Lessons of the Project   Replication techniques and issues involved Methods of data presentation    Random errors in experimental data    28 Box plots 3D scatter plots Server load Database size / load Scripting Accomplishments     29 Successfully completed all our objectives and features set forth in the initial project description sent to Priya Developed a fault-tolerant stock-exchange system Gathered matrices to measure the system We are done ! If we could re-do our project ...   30 Distributed the work better (avoid bottlenecks in development) Setup our own database server Thank You!  31 Any Questions?

Team 3 Fault Tolerant EXchange 18-749: Fault Tolerant Distributed Systems Ernest Chan

Related documents

Products

Support

Team 3 Fault Tolerant EXchange 18-749: Fault Tolerant Distributed Systems Ernest Chan

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib