Problem Diagnosis • Distributed Problem Diagnosis • Sherlock • X-trace Troubleshooting Networked Systems • Hard to develop, debug, deploy, troubleshoot • No standard way to integrate debugging, monitoring, diagnostics Status quo: device centric Load Balancer Web 1 Firewall ... ... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... ... ... ... [04:03:23 2006] [notice] Dispatch s1... [04:03:23 2006] [notice] Dispatch s2... [04:04:18 2006] [notice] Dispatch s3... [04:07:03 2006] [notice] Dispatch s1... [04:10:55 2006] [notice] Dispatch s2... [04:03:24 2006] [notice] Dispatch s3... [04:04:47 2006] [crit] Server s3 down... ... ... ... ... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga ... ... Web 2 ... ... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga ... ... Database ... ... LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: ... ... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... Status quo: device centric • Determining paths: – Join logs on time and ad-hoc identifiers • Relies on – well synchronized clocks – extensive application knowledge • Requires all operations logged to guarantee complete paths Examples DNS Server User Web Server Proxy 5 Examples DNS Server User Web Server Proxy 6 Examples DNS Server User Web Server Proxy 7 Examples DNS Server User Web Server Proxy 8 Approaches to Diagnosis • Passively learn the relationships – Infer problems as deviations from the norm • Actively Instrument the stack to learn relationships – Infer problems as deviations from the norm Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Well-Managed Enterprises Still Unreliable Response time of a Web server (ms) .1 .08 Fraction Of Requests .06 85% Normal 10% Troubled 0.7% Down .04 .02 0 10 100 1000 10000 10% responses take up to 10x longer than normal How do we manage evolving enterprise networks? Sherlock Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems Challenges for the End-to-End Approach • Don’t know what user’s performance depends on Challenges for the End-to-End Approach E.g., Web Connection SQL Backend Auth. Server DNS Web Server • Don’t know what user’s performance depends on – Dependencies are distributed – Dependencies are non-deterministic • Don’t know which dependency is causing the problem – Server CPU 70%, link dropped 10 packets, but which affected user? Client Sherlock’s Contributions • Passively infers dependencies from logs • Builds a unified dependency graph incorporating network, server and application dependencies • Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise Sherlock’s Architecture Sherlock’s Architecture Network Dependency Graph + User Observations Servers Inference Engine Clients = List Troubled Sherlock works for various client-server applications Web2 30ms Web1 File1 1000ms Timeout Components DNS Video Server Data Store How do you automatically learn such distributed dependencies? Strawman: Instrument all applications and libraries Not Practical Sherlock exploits timing info My Client talks to B My Client talks to C Time t If talks to B, whenever talks to C Dependent Connections Strawman: Instrument all applications and libraries Not Practical Sherlock exploits timing info B B B B C BB B Time t False Dependence If talks to B, whenever talks to C Dependent Connections Strawman: Instrument all applications and libraries Not Practical Sherlock exploits timing info B C B Time t Inter-access time Dependent iff t << Inter-access time If talks to B, whenever talks to C Dependent Connections As long as this occurs with probability higher than chance Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Video Store DNS Dependency Graph Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations Video Store DNS Bill’s Client Video Store DNS Dependency Graph Video Store Bill DNS Bill Video Bill Watches Video • Works with legacy applications • Adapts to changing conditions But hard dependencies are not enough… But hard dependencies are not enough… DNS Video Bill’s Client Store Video Store Bill DNS Bill Video p1 p1=10% p3 p2 p2=100% Bill watches Video If Bill caches server’s IP DNS down but Bill gets video Need Probabilities Sherlock uses the frequency with which a dependence occurs in logs as its edge probability How do we use the dependency graph to diagnose user problems? Diagnosing User Problems DNS Bill’s Client Video Store Video Store Bill DNS Bill Video Bill Watches Video Which components caused the problem? Need to disambiguate!! Diagnosing User Problems Sales DNS Bill’s Client Video Store Video2 Video Store Video2 Store Bill Sales Bill Sees Sales Bill DNS Bill Video Bill Watches Video components the problem? •Which Disambiguate bycaused correlating Use– correlation disambiguate!! Across logstofrom same client – Across clients • Prefer simpler explanations Paul Video2 Paul Watches Video2 Will Correlation Scale? Will Correlation Scale? Corporate Core Building Network Microsoft Internal Network • • • • O(100,000) client desktops O(10,000) servers O(10,000) apps/services O(10,000) network devices Campus Core Data Center Dependency Graph is Huge Will Correlation Scale? Can we evaluate all combinations of component failures? The number of fault combinations is exponential! Impossible to compute! Scalable Algorithm to Correlate Only a few faults happen concurrently But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Exponential Polynomial Scalable Algorithm to Correlate Only a few faults happen concurrently But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Exponential Polynomial Only few nodes change state Scalable Algorithm to Correlate Only a few faults happen concurrently Only few nodes change state But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults 99.9% accurate Exponential Polynomial Re-evaluate only if an ancestor changes state Reduces the cost of evaluating a case by 30x-70x Results Experimental Setup • Evaluated on the Microsoft enterprise network • Monitored 23 clients, 40 production servers for 3 weeks – Clients are at MSR Redmond – Extra host on server’s Ethernet logs packets • Busy, operational network – Main Intranet Web site and software distribution file server – Load-balancing front-ends – Many paths to the data-center What Do Web Dependencies in the MS Enterprise Look Like? What Do Web Dependencies in the MS Enterprise Look Like? Auth. Server Client Accesses Portal What Do Web Dependencies in the MS Enterprise Look Like? Auth. Server Client Accesses Portal What Do Web Dependencies in the MS Enterprise Look Like? Auth. Server Client Accesses Portal Client Accesses Sales Sherlock discovers complex dependencies of real apps. What Do File-Server Dependencies Look Like? Backend Server 1 Backend Server 2 8% File Server 100% Auth. Server WINS DNS Proxy 10% 6% 5% 2% Backend Server 3 5% 1% .3% Backend Server 4 Client Accesses Software Distribution Server Sherlock works for many client-server applications Sherlock Identifies Causes of Poor Performance Component Index Dependency Graph: 2565 nodes; 358 components that can fail Time (days) 87% of problems localized to 16 components Sherlock Identifies Causes of Poor Performance Component Index Inference Graph: 2565 nodes; 358 components that can fail Time (days) Corroborated the three significant faults Sherlock Goes Beyond Traditional Tools • SNMP-reported utilization on a link flagged by Sherlock • Problems coincide with spikes Sherlock identifies the troubled link but SNMP cannot! X-Trace • X-Trace records events in a distributed execution and their causal relationship • Events are grouped into tasks – Well defined starting event and all that is causally related • Each event generates a report, binding it to one or more preceding events • Captures full happens-before relation X-Trace Output HTTP Client HTTP Proxy TCP 1 Start IP TCP 1 End IP Router IP HTTP Server TCP 2 Start IP TCP 2 End IP Router • Task graph capturing task execution – Nodes: events across layers, devices – Edges: causal relations between events IP Router IP Basic Mechanism a g [T, a] HTTP Client b TCP 1 End d IP TCP 2 Start i e IP Router HTTP Server h f TCP 1 Start c [T, g] HTTP Proxy [T, a] n IP IP X-Trace Report TaskID: T EventID: g j k f Edge: from a, IP Router IP Router m TCP 2 End l IP • Each event uniquely identified within a task: [TaskId, EventId] • [TaskId, EventId] propagated along execution path • For each event create and log an X-Trace report – Enough info to reconstruct the task graph X-Trace Library API • Handles propagation within app • Threads / event-based (e.g., libasync) • Akin to a logging API: – Main call is logEvent(message) • Library takes care of event id creation, binding, reporting, etc • Implementations in C++, Java, Ruby, Javascript Task Tree • X-Trace tags all network operations resulting from a particular task with the same task identifier • Task tree is the set of network operations connected with an initial task • Task tree could be reconstruct after collecting trace data with reports 52 An example of the task tree • A simple HTTP request through a proxy 53 X-Trace Components • Data – X-Trace metadata • Network path – Task tree • Report – Reconstruct task tree 54 Propagation of X-Trace Metadata • The propagation of X-Trace metadata through the task tree 55 Propagation of X-Trace Metadata • The propagation of X-Trace metadata through the task tree 56 The X Trace metadata Field Usage Flags Bits that specify which of the three optional components are present TaskID An unique integer ID TreeInfo ParentID, OpID, EdgeType Destination Specify the address that X-Trace report should be sent to Options Accommodate future extensions mechanism 57 X-Trace Report Architecture 58 X-Trace Report Architecture 59 X-Trace Report Architecture 60 X-Trace-like in Google/Bing/Yahoo • Why? – Own large portion of the ecosystem – Use RPC for communication – Need to understand • Time for user request • Resource utilization by request Sherlock V X-trace • Overhead V. Accuracy • Deployment issues – Invasiveness – Code modification Conclusions • Sherlock passively infers network-wide dependencies from logs and traceroutes • It diagnoses faults by correlating user observations • X-trace actively discovers network-wide dependencies