The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch Internet services are complex Tasks Time Scale and heterogeneity make Internet services complex Michael Chow 2 Internet services are complex Tasks Time Scale and heterogeneity make Internet services complex Michael Chow 3 Analysis Pipeline Michael Chow 4 Step 1: Identify segments Michael Chow 5 Step 2: Infer causal model Michael Chow 6 Step 3: Analyze individual requests Michael Chow 7 Step 4: Aggregate results Michael Chow 8 Challenges • Previous methods derive a causal model – Instrument scheduler and communication – Build model through human knowledge Need method that works at scale with heterogeneous components Michael Chow 9 Opportunities • Component-level logging is ubiquitous Tremendous detail about a request’s execution • Handle a large number of requests Coverage of a large range of behaviors Michael Chow 10 The Mystery Machine 1) Infer causal model from large corpus of traces – Identify segments – Hypothesize all possible causal relationships – Reject hypotheses with observed counterexamples 2) Analysis – Critical path, slack, anomaly detection, what-if Michael Chow 11 Step 1: Identify segments Michael Chow 12 Define a minimal schema Task Segment 1 Event Segment 2 Event Michael Chow Event 13 Define a minimal schema Task Segment 1 Event Segment 2 Event Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema Michael Chow 14 Step 2: Infer causal model Michael Chow 15 Types of causal relationships Relationship Counterexample Happens-Before A B B A Mutual Exclusion A B A OR B B A Pipeline t1 t 2 A B A’ t1 C B’ C’ t A B C’ C B’ A’ 2 Michael Chow 16 Producing causal model S1 N1 C1 S2 N2 C2 Causal Model Michael Chow 17 Producing causal model S1 N1 C1 S2 N2 C2 Causal Model Trace 1 S1 N1 C1 S2 N2 C2 Time Michael Chow 18 Producing causal model S1 N1 C1 S2 N2 C2 Causal Model Trace 1 S1 N1 C1 S2 N2 C2 Time Michael Chow 19 Producing causal model S1 N1 C1 S2 N2 C2 Causal Model Trace 2 S1 N1 S2 C1 N2 C2 Time Michael Chow 20 Producing causal model S1 N1 C1 S2 N2 C2 Causal Model Trace 2 S1 N1 S2 C1 N2 C2 Time Michael Chow 21 Producing causal model S1 N1 C1 S2 N2 C2 Causal Model Trace 2 S1 N1 S2 C1 N2 C2 Time Michael Chow 22 Step 3: Analyze individual requests Michael Chow 23 Critical path using causal model Trace 1 S1 N1 C1 S2 N2 C2 S1 N1 S2 C1 N2 C2 Time Michael Chow 24 Critical path using causal model Trace 1 S1 N1 C1 S2 N2 C2 S1 N1 S2 Michael Chow C1 N2 C2 25 Critical path using causal model Trace 1 S1 N1 C1 S2 N2 C2 S1 N1 S2 Michael Chow C1 N2 C2 26 Step 4: Aggregate results Michael Chow 27 Inaccuracies of Naïve Aggregation Michael Chow 28 Inaccuracies of Naïve Aggregation Michael Chow 29 Inaccuracies of Naïve Aggregation Need a causal model to correctly understand latency Michael Chow 30 High variance in critical path Client Network cdf Server Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow 31 High variance in critical path cdf Server Client Network 20% of requests, server contributes 10% of latency Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow 32 High variance in critical path Server Client Network cdf 20% of requests, server contributes 50% or more of latency Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow 33 Diverse clients and networks Server Network Client Server Network Client Server Network Client Michael Chow 34 Diverse clients and networks Server Network Client Server Network Client Server Network Client Michael Chow 35 Diverse clients and networks Server Network Client Server Network Client Server Network Client Michael Chow 36 Differentiated service Slack in server generation time Produce data slower End-to-end latency stays same No slack in server generation time Produce data faster Decrease end-to-end latency Deliver data when needed and reduce average response time Michael Chow 37 Additional analysis techniques • Slack analysis S1 N1 S2 C1 N2 C2 Time • What-if analysis – Use natural variation in large data set Michael Chow 38 What-if questions • Does server generation time affect end-to-end latency? • Can we predict which connections exhibit server slack? Michael Chow 39 Server slack analysis Slack < 25ms Slack > 2.5s Michael Chow 40 Server slack analysis Slack < 25ms Slack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on endto-end latency Michael Chow 41 Predicting server slack Second slack (ms) • Predict slack at the receipt of a request • Past slack is representative of future slack First slack (ms) Michael Chow 42 Predicting server slack • Predict slack at the receipt of a request • Past slack is representative of future slack Second slack (ms) Type II Error 9% Classifies 83% of requests correctly Type I Error 8% First slack (ms) Michael Chow 43 Conclusion Michael Chow 44 Questions Michael Chow 45