Diagnosing performance changes by comparing request

advertisement
Diagnosing performance changes
by comparing request flows!
Raja Sambasivan!
Alice Zheng★, Michael De Rosa✚, Elie Krevat,
Spencer Whitman, Michael Stroucken,
William Wang, Lianghong Xu, Greg Ganger!
!
Carnegie Mellon! Microsoft Research★! Google✚ !
Perf. diagnosis in distributed systems!
•  Very difficult and time consuming!
•  Root cause could be in any component!
!
•  Request-flow comparison!
•  Helps localize performance changes!
•  Key insight: Changes manifest as
mutations in request timing/structure!
!
http://www.pdl.cmu.edu/
2
Raja Sambasivan © March 11!
Perf. debugging a feature addition!
Client
Clients
Client
Client!
Metadata!
Servers
Servers
server!
NFS server!
Storage nodes!
http://www.pdl.cmu.edu/
3
Raja Sambasivan © March 11!
Perf. debugging a feature addition!
•  Before addition:!
•  Every file access needs a MDS access!
Metadata!
Client!
server!
(4)!
(1)!
(3)!
(6)!
(2)!
NFS server!
(5)!
Storage nodes!
http://www.pdl.cmu.edu/
4
Raja Sambasivan © March 11!
Perf. debugging a feature addition!
•  After: Metadata prefetched to clients !
•  Most requests donʼt need MDS access!
Metadata!
Client!
server!
(1)!
(3)!
NFS server!
(2)
Storage nodes!
http://www.pdl.cmu.edu/
5
Raja Sambasivan © March 11!
Perf. debugging a feature addition!
•  Adding metadata prefetching reduced
performance instead of improving it (!)!
•  How to efficiently diagnose this?!
http://www.pdl.cmu.edu/
6
Raja Sambasivan © March 11!
Request-flow comparison will show!
Precursor!
Mutation!
NFS Lookup Call!
10μs!
! MDS DB Lock
20μs!
!MDS DB Unlock!
50μs!
NFS Lookup Call!
10μs!
!
MDS DB Lock!
!
NFS
Lookup Reply!
MDS DB Unlock!
50μs!
350μs!
! Reply!
NFS Lookup
http://www.pdl.cmu.edu/
7
Raja Sambasivan © March 11!
Request-flow comparison will show!
Precursor!
Mutation!
NFS Lookup Call!
10μs!
! MDS DB Lock
20μs!
!MDS DB Unlock!
50μs!
NFS Lookup Call!
10μs!
!
MDS DB Lock!
!
NFS
Lookup Reply!
MDS DB Unlock!
50μs!
350μs!
! Reply!
Root cause localized byNFS
showing
how
Lookup
mutation and precursor differ!
http://www.pdl.cmu.edu/
8
Raja Sambasivan © March 11!
Request-flow comparison!
•  Identifies distribution changes!
•  Distinct from anomaly detection!
– E.g., Magpie, Pinpoint, etc.!
•  Satisfies many use cases!
•  Performance regressions/degradations!
•  Eliminating the system as the culprit!
http://www.pdl.cmu.edu/
9
Raja Sambasivan © March 11!
Contributions!
•  Heuristics for identifying mutations,
precursors, and for ranking them!
•  Implementation in Spectroscope!
•  Use of Spectroscope to diagnose!
•  Unsolved problems in Ursa Minor!
•  Problems in Google services !
http://www.pdl.cmu.edu/
10
Raja Sambasivan © March 11!
Spectroscope workflow!
Non-problem
period graphs!
Problem
period graphs!
Categorization!
Response-time
mutation identification!
Structural mutation
identification!
Ranking!
UI layer!
http://www.pdl.cmu.edu/
11
Raja Sambasivan © March 11!
Graphs via end-to-end tracing!
•  Used in research & production systems!
•  E.g., Magpie, X-Trace, Googleʼs Dapper!
•  Works as follows:!
•  Tracks trace points touched by requests!
•  Request-flow graphs obtained by
stitching together trace points accessed!
•  Yields < 1% overhead w/req. sampling !
http://www.pdl.cmu.edu/
12
Raja Sambasivan © March 11!
Example: Graph for a striped read!
NFS Read Response-time:!
Call!
8,120μs!
100μs!
Cache
10μs!
10μs!
Miss!
SN2 Read
SN1 Read
1,000μs!
1,000μs!
Call!
Call!
Work
Work
3,500μs! 5,500μs!
on SN1!
on SN2!
1,500μs!
1,500μs!
SN1
SN2
Reply!
NFS
Reply!
Reply!
10μs!
10μs!
http://www.pdl.cmu.edu/
13
Raja Sambasivan © March 11!
Example: Graph for a striped read!
NFS Read Response-time:!
Call!
8,120μs!
100μs!
Cache
10μs!
10μs!
Miss!
SN2 Read
SN1 Read
1,000μs!
1,000μs!
Call!
Call!
Work
Work
3,500μs! 5,500μs!
on SN1!
on SN2!
1,500μs!
1,500μs!
SN1
SN2
Reply!
NFS
Reply!
Nodes show trace points & edges show latencies
Reply!
10μs!
10μs!
http://www.pdl.cmu.edu/
14
Raja Sambasivan © March 11!
Spectroscope workflow!
Non-problem
period graphs
Problem
period graphs
Categorization
Response-time
mutation identification
Structural mutation
identification
Ranking
UI layer
http://www.pdl.cmu.edu/
15
Raja Sambasivan © March 11!
Categorization step!
•  Necessary since it is meaningless to
compare individual requests flows!
•  Groups together similar request flows!
•  Categories: basic unit for comparisons!
•  Allows for mutation identification by
comparing per-category distributions !
http://www.pdl.cmu.edu/
16
Raja Sambasivan © March 11!
Choosing what to bin into a category!
•  Our choice: identically structured reqs!
•  Uses same path/similar cost expectation!
•  Same path/similar costs notion is valid!
•  For 88—99% of Ursa Minor categories!
•  For 47—69% of Bigtable categories!
– Lower value due to sparser trace points!
– Lower value also due to contention !
Aside: Categorization can be used to
localize problematic sources of variance!
http://www.pdl.cmu.edu/
17
Raja Sambasivan © March 11!
Spectroscope workflow!
Non-problem
period graphs!
Problem
period graphs!
Categorization!
Response-time
mutation identification!
Structural mutation
identification!
Ranking!
UI layer!
http://www.pdl.cmu.edu/
18
Raja Sambasivan © March 11!
Type 1: Response-time mutations!
•  Requests that:!
•  are structurally identical in both periods!
•  have larger problem period latencies!
•  Root cause localized by… !
•  identifying interactions responsible !
http://www.pdl.cmu.edu/
19
Raja Sambasivan © March 11!
Categories w/response-time mutations!
•  Identified via use of a hypothesis test!
•  Sets apart natural variance from mutations!
•  Also used to find interactions responsible!
Frequency
Frequency
Category 1
!
Response-time
Response-time
http://www.pdl.cmu.edu/
Category 2!
20
Raja Sambasivan © March 11!
Categories w/response-time mutations!
•  Identified via use of a hypothesis test!
•  Sets apart natural variance from mutations!
•  Also used to find interactions responsible!
Frequency
Frequency
Category 1
!
Response-time
Response-time
IDʼd
as containing
mutations!
http://www.pdl.cmu.edu/
Category 2!
21
Not IDʼd as containing
mutations!
Raja Sambasivan © March 11!
Response-time mutation example!
Avg. non-problem
Avg. problem
response time:!
response time:!
110μs!
1,090μs!
NFS Read Call!
Avg. 10μs!
!
Avg. 20μs!
Avg. 1,000μs !
!
SN1 Read End!
Avg. 80μs!
!
http://www.pdl.cmu.edu/
SN1 Read Start
NFS Read Reply!
22
Raja Sambasivan © March 11!
Response-time mutation example!
Avg. non-problem
Avg. problem
response time:!
response time:!
110μs!
1,090μs!
NFS Read Call!
Avg. 10μs!
!
Avg. 20μs!
Avg. 1,000μs !
!
SN1 Read Start
SN1 Read End!
Avg. 80μs!
!
NFS
Read
Reply! interaction
Problem localized by
ID’ing
responsible
http://www.pdl.cmu.edu/
23
Raja Sambasivan © March 11!
Type 2: Structural mutations!
•  Requests that:!
•  take different paths in the problem period!
•  Root caused localized by…!
•  identifying their precursors!
– Likely path during the non-problem period!
•  idʼing how mutation & precursor differ!
http://www.pdl.cmu.edu/
24
Raja Sambasivan © March 11!
IDʼing categories w/structural mutations!
•  Assume similar workloads executed!
•  Categories with more problem period
requests contain mutations!
•  Reverse true for precursor categories!
•  Threshold used to differentiate natural
variance from categories w/mutations!
http://www.pdl.cmu.edu/
25
Raja Sambasivan © March 11!
Mapping mutations to precursors!
Precursors!
Read!
Structural Mutation!
Write!
Read!
?
NP: 1,000!
NP: 480!
P: 150!
P: 460!
NP: 300!
P: 800!
•  Accomplished using three heuristics!
•  See paper for details!
!
!
http://www.pdl.cmu.edu/
26
Raja Sambasivan © March 11!
Example structural mutation!
Precursor!
Mutation!
NFS Lookup Call!
10μs!
! MDS DB Lock
20μs!
!MDS DB Unlock!
50μs!
NFS Lookup Call!
10μs!
!
MDS DB Lock!
!
NFS
Lookup Reply!
MDS DB Unlock!
50μs!
350μs!
! Reply!
NFS Lookup
http://www.pdl.cmu.edu/
27
Raja Sambasivan © March 11!
Example structural mutation!
Precursor!
Mutation!
NFS Lookup Call!
10μs!
! MDS DB Lock
20μs!
!MDS DB Unlock!
50μs!
NFS Lookup Call!
10μs!
!
MDS DB Lock!
!
NFS
Lookup Reply!
MDS DB Unlock!
50μs!
350μs!
! Reply!
Root cause localized byNFS
showing
how
Lookup
mutation and precursor differ!
http://www.pdl.cmu.edu/
28
Raja Sambasivan © March 11!
Ranking of categories w/mutations!
•  Necessary for two reasons!
•  There may be more than one problem!
•  One problem may yield many mutations!
•  Rank based on:!
•  # of reqs affected * Δ in response time!
!
http://www.pdl.cmu.edu/
29
Raja Sambasivan © March 11!
Spectroscope workflow!
Non-problem
period graphs!
Problem
period graphs!
Categorization!
Response-time
mutation identification!
Structural mutation
identification!
Ranking!
UI layer!
http://www.pdl.cmu.edu/
30
Raja Sambasivan © March 11!
Putting it all together: The UI!
Precursor!
NFS Lookup Call!
10μs!
! MDS DB Lock
20μs!
!MDS DB Unlock!
50μs!
Mutation!
Ranked
list !
NFS Lookup Call!
10μs! 1.! Structural!
2. Response !
!
MDS DB Lock!
time!
350μs!
MDS
DB
Unlock!
!
NFS Lookup Reply!
20. Structural!
50μs!
! Reply!
NFS Lookup
http://www.pdl.cmu.edu/
31
Raja Sambasivan © March 11!
Putting it all together: The UI!
Mutation!
Ranked
list !
NFS Read Call!
! Structural!
1.
2. Response !
time!
Avg. 10μs!
!
Avg. 20μs!
Avg. 100μs !
!
SN1 Read Start
SN1 Read End!
Avg. 80μs!
!
http://www.pdl.cmu.edu/
20. Structural!
NFS Read Reply!
32
Raja Sambasivan © March 11!
Outline!
• 
• 
• 
• 
Introduction!
End-to-end tracing & Spectroscope!
Ursa Minor & Google case studies!
Summary!
http://www.pdl.cmu.edu/
33
Raja Sambasivan © March 11!
Ursa Minor case studies!
•  Used Spectroscope to diagnose real
performance problems in Ursa Minor!
•  Four were previously unsolved!
•  Evaluated ranked list by measuring!
•  % of top 10 results that were relevant!
•  % of total results that were relevant!
http://www.pdl.cmu.edu/
34
Raja Sambasivan © March 11!
Quantitative results!
100%
% top 10
relevant
75%
% total !
relevant!
50%
25%
N/A
0%
Prefetch Config. RMWs Creates 500us Spikes
delay
http://www.pdl.cmu.edu/
35
Raja Sambasivan © March 11!
Ursa Minor 5-component config!
•  All case studies use this configuration!
Client!
(1)!
(4)!
Metadata!
server!
(3)!
(6)!
(2)!
NFS server!
(5)
Storage nodes!
http://www.pdl.cmu.edu/
36
Raja Sambasivan © March 11!
Case 1: MDS configuration problem!
•  Problem: Slowdown in key benchmarks
seen after large code merge!
http://www.pdl.cmu.edu/
37
Raja Sambasivan © March 11!
Comparing request flows!
•  Identified 128 mutation categories:!
•  Most contained structural mutations!
•  Mutations and precursors differed only
in the storage node accessed!
•  Localized bug to unexpected interaction
between MDS & the data storage node!
•  But, unclear why this was happening!
http://www.pdl.cmu.edu/
38
Raja Sambasivan © March 11!
Further localization!
•  Thought mutations may be caused by
changes in low-level parameters!
•  E.g., function call parameters, etc.!
•  Identified parameters that separated 1stranked mutation from its precursors!
•  Showed changes in encoding params!
•  Localized root cause to two config files
!
Root cause: Change to an obscure config file
http://www.pdl.cmu.edu/
39
Raja Sambasivan © March 11!
Inter-cluster perf. at Google!
•  Load tests run on same software in two
datacenters yielded very different perf.!
•  Developers said perf. should be similar!
http://www.pdl.cmu.edu/
40
Raja Sambasivan © March 11!
Comparing request flows!
•  Revealed many mutations!
•  High-ranked ones were response-time!
•  Responsible interactions found both
within the service and in dependencies!
•  Led us to suspect root cause was issue
with the slower load testʼs datacenter !
Root cause: Shared Bigtable in slower load
test’s datacenter was not working properly
http://www.pdl.cmu.edu/
41
Raja Sambasivan © March 11!
Summary!
•  Introduced request-flow comparison as
a new way to diagnose perf. changes!
•  Presented algorithms for localizing
problems by identifying mutations!
•  Showed utility of our approach by using
it to diagnose real, unsolved problems!
http://www.pdl.cmu.edu/
42
Raja Sambasivan © March 11!
Related work (I)!
[Abd-El-Malek05b]: Ursa Minor: versatile cluster-based storage.
Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor,
Gregory R. Ganger, James Hendricks, Andrew J. Klosterman,
Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R.
Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno
Thereska, Matthew Wachs, Jay J. Wylie. FAST, 2005.!
!
[Barham04]: Using Magpie for request extraction and workload
modelling. Paul Barham, Austin Donnelly, Rebecca Isaacs,
Richard Mortier. OSDI, 2004.!
!
[Chen04]: Path-based failure and evolution management. Mike
Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave
Patterson, Armando Fox, Eric Brewer. NSDI, 2004.!
!
!
http://www.pdl.cmu.edu/
43
Raja Sambasivan © March 11!
!
!
Related work (II)!
[Cohen05]: Capturing, indexing, and retrieving system history.
Ira Cohen, Steve Zhang, Moises Goldszmidt, Terrence Kelly,
Armando Fox. SOSP, 2005.!
!
[Fonseca07]: X-Trace: A Pervasive Network Tracing
Framework. Rodrigo Fonseca, George Porter, Randy H. Katz,
Scott Shenker, Ion Stoica. NSDI, 2007.!
!
[Hendricks06]: Improving small file performance in objectbased storage. James Hendricks, Raja R. Sambasivan,
Shafeeq Sinnamohideen, Gregory R. Ganger. Technical Report
CMU-PDL-06-104, 2006.!
!
!
!
!
http://www.pdl.cmu.edu/
44
Raja Sambasivan © March 11!
!
Related work (III)!
[Reynolds06]: Detecting the unexpected in distributed systems.
Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C.
Mogul, Mehul A. Shah, Amin Vahdat. NSDI, 2006.!
!
[Sambasivan07]: Categorizing and differencing system
behaviours. Raja R. Sambasivan, Alice X. Zheng, Eno
Thereska. HotAC II, 2007.!
!
[Sambasivan10]: Diagnosing performance problems by
visualizing and comparing system behaviours. Raja R.
Sambasivan, Alice X. Zheng, Elie Krevat, Spencer Whitman,
Michael Stroucken, William Wang, Lianghong Xu, Gregory R.
Ganger. CMU-PDL-10-103. !
http://www.pdl.cmu.edu/
45
Raja Sambasivan © March 11!
Related work (IV)!
[Sambasivan11]: Automation without predictability is a recipe
for failure. Raja R. Sambasivan and Gregory R. Ganger.
Technical Report CMU-PDL-11-101, 2011.!
!
[Sigelman10]: Dapper, a large-scale distributed tracing
infrastructure. Benjamin H. Sigelman, Luiz Andre ́ Barroso,
Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver,
Saul Jaspan, Chandan Shanbhag. Technical report 2010-1,
2008.!
!
[Thereska06b]: Stardust: tracking activity in a distributed
storage system. Eno Thereska, Brandon Salmon, John
Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez,
Gregory R. Ganger. SIGMETRICS, 2006.!
!
!
http://www.pdl.cmu.edu/
46
Raja Sambasivan © March 11!
Download