Carnegie Mellon Got Predictability? Experiences with FT Middleware Tudor Dumitraş Priya Narasimhan Carnegie Mellon University Carnegie Mellon Who Needs Predictability? Service-level agreements Problem determination, fingerpointing Self-management, autonomic computing FT-middleware protects the critical parts of IT infrastructures Higher predictability requirements 2 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Predictability of Fault-Tolerant Middleware Faults are inherently unpredictable What about the fault-free case? Reportedly, max (response time) >> average (response time) 3 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Empirical Data Collected MEAD Trace: micro-benchmark (client-server) Middleware for Embedded Adaptive Dependability Fault-Tolerant CORBA implementation 1200 configurations FTDS Trace: 7 macro-benchmarks (3-tier applications) Developed during Fault-Tolerant Distributed Systems class Enterprise applications: online gaming, e-commerce Use CORBA or EJB 336 configurations Available at: http://www.ece.cmu.edu/~tdumitra/FT_traces/ 4 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Fault-Free vs. Faulty Unpredictability (MEAD Trace) 2 Recovery Time [s] Average Recovery Time Max Fault-Free Latency 1.5 1 0.5 0 1 4 7 10 13 16 Number of Clients 19 22 5 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Fault-Free vs. Faulty Unpredictability (FTDS Trace) 13.6 s 1.2 Fault Detection & Fail-over Fault Detection Fail-over Request Processing Max Fault-Free Latency Recovery Time [s] 1 0.8 0.6 0.4 0.2 0 1 2 3 4 Project 5 6 7 6 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Outline Can we predict the maximum latency of FT middleware? When do high latencies occur and how high are they? How common are the high latencies? Do most requests have bounded latencies? 7 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon MEAD Architecture The Replicator C R C C Interface to application / CORBA Tunability Tunable mechanisms Replication style #replicas R Client (modified system calls) Replicated state R Server CORBA CORBA Replicator Replicator Group Communication Interface to Group Communication Host OS Host OS Networking Active Replication: Passive Replication: all replicas process requests primary replica processes requests 8 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Applications from the FTDS Trace 1. Su-Duel-Ku 2. Blackjack Competitive Sudoku 3. FTEX Electronic stock exchange Online casino 4. eJBay Online auctioning EJB CORBA 5. Mafia Online game 6. Park’n Park Parking-lot management 7. Ticket Center Online ticketing 9 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Architecture of FTDS Applications 10 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Outline Can we predict the maximum latency of FT middleware? When do high latencies occur and how high are they? How common are the high latencies? Do most requests have bounded latencies? 11 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Example of Unpredictability x 10 8 x 10 1.8 1.6 7 1.4 6 1.2 5 1.2 4 1 0.8 3 0.8 0.6 2 0.4 1 PDF Latency [μs] 1.8 4 x 10 2 -4 1 0.2 0 5 10 15 20 Time [s] 25 30 35 0 0 1.6 1.4 0.6 0.4 0.2 0.5 1 1.5 Latency [μs] 2 4 x 10 0 Maximum latency can be orders of magnitude larger than the average 12 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Latency [μs] 4 Carnegie Mellon Average latency [μs] Unpredictability in the MEAD Trace 7 10 6 10 5 10 4 10 3 10 65536 4096 256 16 0 1000 2000 3000 4000 5000 13 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Maximum latency [μs] Average latency [μs] Unpredictability in the MEAD Trace 7 10 6 10 5 10 4 10 3 10 65536 4096 256 16 0 1000 2000 3000 4000 7 10 6 10 5 10 4 10 3 10 5000 65536 4096 256 16 0 1000 2000 3000 4000 5000 14 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Average and Maximum Latency 4 10 MEAD 3.5 10 3 Maximum latency [s] Maximum latency [s] 2 2.5 2 1.5 1 10 10 10 1 0 -1 MEAD SuDuelKu FTEX Park’n Park Ticket Center -2 0.5 0 0 10 1 2 3 Average latency [s] 4 -3 10 -2 0 10 Average latency [s] 10 2 15 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Outline Can we predict the maximum latency of FT middleware? When do high latencies occur and how high are they? How common are the high latencies? Do most requests have bounded latencies? 16 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Statistical Analysis of Unpredictability 8 x 10 -4 x 10 2 Z max 7 Max Mean 4 1.8 1.6 6 1.4 1.2 4 1 Mean 3 3 0.8 Latency [μs] PDF 5 0.6 2 0.4 1 0.2 0 0 0 0.5 1 Latency [μs] 1.5 2 x 10 4 17 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon 1.5% 300 1% 200 0.5% 100 0% 16 256 4096 16384 Size of reply messages [bytes] 65536 Maximum z-score Percentage of outliers Correlation with Message Size (MEAD) 0 18 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Time in Kernel and User Mode (MEAD) 25% kernel mode 16 KB and 64 KB 10% kernel mode 16 B, 256 B and 4 KB 19 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon 3% 150 2% 100 1% 50 0% SuDuelKu FTEX eJBay Mafia Ticket Center Blackjack Park’n Park FTDS Project Maximum z-score Percentage of outliers Number and Size of Outliers (FTDS) 0 20 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Correlation with Number of Clients (FTDS) SuDuelKu 60 5% 50 4% 40 3% 30 2% 20 1% 10 0% 1 4 Clients 7 10 Maximum z-score Percentage of outliers 6% 0 21 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Correlation with Request Rate (FTDS) FTEX 60 5% 50 4% 40 3% 30 2% 20 1% 10 0% 5 10 15 20 Request rate [req/s] 25 Maximum z-score Percentage of outliers 6% 0 22 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Outline Can we predict the maximum latency of FT middleware? When do high latencies occur and how high are they? How common are the high latencies? Do most requests have bounded latencies? 23 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Outlier Distribution (MEAD) 1200 Experiments 1000 800 600 400 200 0 0% 1% 2% 3% 4% Outliers per Experiment 5% 6% 24 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Outlier Distribution (Comparison) 1 Probability Density Ticket Center 0.8 eJBay Park’n Park 0.6 Blackjack FTEX Mafia 0.4 0.2 0 0% SuDuelKu 1% 2% 3% 4% Outliers per Experiment 5% 6% 25 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Isolating the Unpredictability (MEAD) 26 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Isolating the Unpredictability (MEAD) The “haircut” effect of removing 1% of the highest latencies 27 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon The Magical 1% Unpredictability seems to be confined to 1% of the remote invocations. 28 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Latency [s] 2 Carnegie Mellon Magical 1% 1.5 SuDuelKu 1.5 Mafia 1 1 0.5 0.5 0 Latency [s] 4 10 30 40 0 0.4 Blackjack 3 0.3 2 0.2 1 0.1 0 15 Latency [s] 20 10 20 30 40 10 4 5 2 0 10 © 2007 Tudor Dumitraş 20 30 Experiment 40 Average latency 20 30 40 20 30 40 Park’n Park 0 6 FTEX 10 10 MEAD 0 99 th percentile 200 400 600 800 Experiment Maximum latency Got Predictability? Experiences with Fault-Tolerant Middleware 1000 1200 29 Carnegie Mellon Outline Can we predict the maximum latency of FT middleware? When do high latencies occur and how high are they? How common are the high latencies? Do most requests have bounded latencies? 30 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Bounds for the 99th Percentile MEAD [ ] Latency range Ticket Center [ ] 99 th percentiles [ ] Confidence interval Park’n Park [ ] Mafia eJBay [] Z 99% 10 FTEX [] Blackjack [] SuDuelKu [] 0 40 80 120 160 200 Z-Scores of Latency 240 31 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Trends for the 99th Percentile (MEAD) 7 10 99% latency [ ms] 6 10 5 10 4 10 3 10 65536 16384 4096 5000 4000 256 3000 2000 16 Request size [bytes] 1000 0 Request rate [req/s] 32 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Summary Can we predict the maximum latency of FT middleware? When do high latencies occur and how high are they? Usually not correlated with configuration parameters, OS metrics Comparable with recovery time after crash faults How common are the high latencies? Not always; maximum usually not correlated with average Confined to 1% of remote invocations Do most requests have bounded latencies? 99% of requests have a z-score < 10 33 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Implications of the Magical 1% Predictable maximum latencies are hard to achieve Cannot eliminate high latencies by carefully configuring the system Statistical predictability is easy to achieve 99th percentile latency bounded with high confidence Confirmed for different Applications Programming languages Middleware technologies Replication mechanisms Operating systems Not confirmed for WANs, wireless networks Statistical predictability is relevant for many enterprise applications 34 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Thank You! For more information: http://www.ece.cmu.edu/~tdumitra 35 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon MEAD Trace vs. FTDS Trace MEAD FTDS Programming language C++ Java Middleware CORBA EJB, CORBA Tiers 2 (client, server) 3 (client, business logic, DB) Replication mechanisms ORB-level, transparent Application-level Recovery coordination Distributed (group communication) Centralized (replication manager) Operating System TimeSys Linux SUSE Linux Environment Isolated experiments Shared cluster 36 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Experimental Setup MEAD Test bed FTDS Test bed Emulab 100 Mb/s LAN Pentium III at 850 MHz Parameters varied Replication style: active, passive Replication degree: 1, 2, 3 replicas Number of clients: 1 – 22 Think time: 0, 0.5, 2, 8, 32 ms Reply size: 16 B, 256 B, 4 KB, 16 KB, 64 KB Undergraduate cluster 100 Mb/s LAN Pentium IV at 2.4 GHz Parameters varied Clients: 1, 4, 7, 10 Think time: 0, 20, 40 ms Reply size: original, 256 B, 512 B, 1 KB 37 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Sources of Unpredictability Client Server Application out in client server out in in out out in ORB in out interc_hi Replicator out interc_lo in reply request Group Communication 38 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Passive Replication Passively Replicated Server Object Passively Replicated Client Object Primary Replica Primary Replica State ORB ORB ORB ORB State ORB Request Response State Transfer Client Group Server Group © 2007 Tudor Dumitraş 39 Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Active Replication Actively Replicated Server Actively Replicated Client ORB ORB ORB ORB Duplicate Invocation Suppressed ORB Duplicate Responses Suppressed Client Group Server Group © 2007 Tudor Dumitraş 40 Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon 1% 400 0.5% 200 0% 1 4 7 10 13 16 Number of clients 19 22 Maximum z-score Percentage of outliers Correlation with Number of Clients (MEAD) 0 41 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Minor Page Faults (MEAD) 42 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware Carnegie Mellon Outlier Distribution (MEAD) 1200 Experiments 1000 800 600 400 200 0 0% 1% 2% 3% 4% Outliers per Experiment 5% 6% 43 © 2007 Tudor Dumitraş Got Predictability? Experiences with Fault-Tolerant Middleware