End-to-end performance analysis of large

advertisement
The Mystery Machine:
End-to-end performance analysis of
large-scale Internet services
Michael Chow
David Meisner, Jason Flinn, Daniel Peek, Thomas F.
Wenisch
Internet services are complex
Tasks
Time
Scale and heterogeneity make Internet services complex
Michael Chow
2
Internet services are complex
Tasks
Time
Scale and heterogeneity make Internet services complex
Michael Chow
3
Analysis Pipeline
Michael Chow
4
Step 1: Identify segments
Michael Chow
5
Step 2: Infer causal model
Michael Chow
6
Step 3: Analyze individual requests
Michael Chow
7
Step 4: Aggregate results
Michael Chow
8
Challenges
• Previous methods derive a causal model
– Instrument scheduler and communication
– Build model through human knowledge
Need method that works at scale with heterogeneous
components
Michael Chow
9
Opportunities
• Component-level logging is ubiquitous
Tremendous detail about a request’s execution
• Handle a large number of requests
Coverage of a large range of behaviors
Michael Chow
10
The Mystery Machine
1) Infer causal model from large corpus of traces
– Identify segments
– Hypothesize all possible causal relationships
– Reject hypotheses with observed counterexamples
2) Analysis
– Critical path, slack, anomaly detection, what-if
Michael Chow
11
Step 1: Identify segments
Michael Chow
12
Define a minimal schema
Task
Segment 1
Event
Segment 2
Event
Michael Chow
Event
13
Define a minimal schema
Task
Segment 1
Event
Segment 2
Event
Event
Request identifier
Machine identifier
Timestamp
Task
Event
Aggregate existing logs using minimal schema
Michael Chow
14
Step 2: Infer causal model
Michael Chow
15
Types of causal relationships
Relationship
Counterexample
Happens-Before
A
B
B
A
Mutual Exclusion
A
B
A
OR
B
B
A
Pipeline
t1
t
2
A
B
A’
t1
C
B’
C’
t
A
B
C’
C
B’
A’
2
Michael Chow
16
Producing causal model
S1
N1
C1
S2
N2
C2
Causal Model
Michael Chow
17
Producing causal model
S1
N1
C1
S2
N2
C2
Causal Model
Trace 1
S1
N1
C1
S2
N2
C2
Time
Michael Chow
18
Producing causal model
S1
N1
C1
S2
N2
C2
Causal Model
Trace 1
S1
N1
C1
S2
N2
C2
Time
Michael Chow
19
Producing causal model
S1
N1
C1
S2
N2
C2
Causal Model
Trace 2
S1
N1
S2
C1
N2
C2
Time
Michael Chow
20
Producing causal model
S1
N1
C1
S2
N2
C2
Causal Model
Trace 2
S1
N1
S2
C1
N2
C2
Time
Michael Chow
21
Producing causal model
S1
N1
C1
S2
N2
C2
Causal Model
Trace 2
S1
N1
S2
C1
N2
C2
Time
Michael Chow
22
Step 3: Analyze individual requests
Michael Chow
23
Critical path using causal model
Trace 1
S1
N1
C1
S2
N2
C2
S1
N1
S2
C1
N2
C2
Time
Michael Chow
24
Critical path using causal model
Trace 1
S1
N1
C1
S2
N2
C2
S1
N1
S2
Michael Chow
C1
N2
C2
25
Critical path using causal model
Trace 1
S1
N1
C1
S2
N2
C2
S1
N1
S2
Michael Chow
C1
N2
C2
26
Step 4: Aggregate results
Michael Chow
27
Inaccuracies of Naïve Aggregation
Michael Chow
28
Inaccuracies of Naïve Aggregation
Michael Chow
29
Inaccuracies of Naïve Aggregation
Need a causal model to correctly understand latency
Michael Chow
30
High variance in critical path
Client
Network
cdf
Server
Percent of critical path
Percent of critical path
Percent of critical path
• Breakdown in critical path shifts drastically
– Server, network, or client can dominate latency
Michael Chow
31
High variance in critical path
cdf
Server
Client
Network
20% of requests,
server contributes 10%
of latency
Percent of critical path
Percent of critical path
Percent of critical path
• Breakdown in critical path shifts drastically
– Server, network, or client can dominate latency
Michael Chow
32
High variance in critical path
Server
Client
Network
cdf
20% of requests, server
contributes 50% or more of
latency
Percent of critical path
Percent of critical path
Percent of critical path
• Breakdown in critical path shifts drastically
– Server, network, or client can dominate latency
Michael Chow
33
Diverse clients and networks
Server
Network
Client
Server
Network
Client
Server
Network
Client
Michael Chow
34
Diverse clients and networks
Server
Network
Client
Server
Network
Client
Server
Network
Client
Michael Chow
35
Diverse clients and networks
Server
Network
Client
Server
Network
Client
Server
Network
Client
Michael Chow
36
Differentiated service
Slack in server generation time
Produce data slower
End-to-end latency stays same
No slack in server generation time
Produce data faster
Decrease end-to-end latency
Deliver data when needed and reduce average response time
Michael Chow
37
Additional analysis techniques
• Slack analysis
S1
N1
S2
C1
N2
C2
Time
• What-if analysis
– Use natural variation in large data set
Michael Chow
38
What-if questions
• Does server generation time affect end-to-end
latency?
• Can we predict which connections exhibit
server slack?
Michael Chow
39
Server slack analysis
Slack < 25ms
Slack > 2.5s
Michael Chow
40
Server slack analysis
Slack < 25ms
Slack > 2.5s
End-to-end latency increases
as server generation time
increases
Server generation time
has little effect on endto-end latency
Michael Chow
41
Predicting server slack
Second slack (ms)
• Predict slack at the receipt of a request
• Past slack is representative of future slack
First slack (ms)
Michael Chow
42
Predicting server slack
• Predict slack at the receipt of a request
• Past slack is representative of future slack
Second slack (ms)
Type II Error 9%
Classifies 83%
of requests
correctly
Type I Error 8%
First slack (ms)
Michael Chow
43
Conclusion
Michael Chow
44
Questions
Michael Chow
45
Download