Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP Goals & Target Environment Improve the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior 2 MSN Background Messenger, www.msn.com, Hotmail, Search, many other “properties” Large (> 100 million users) Sources of Complexity: multiple data-centers large # of machines complex internal network topology diversity of applications and software infrastructure 3 The Plan Detecting, managing, and diagnosing failures Review MSN’s current approaches Describe our solution at a high level 4 Detecting Failures Monitor system availability with heartbeats Monitor applications availability & quality of service using synthetic requests Customer complaints Telephone, email Problems: These approaches provide limited coverage – harder to catch failures that don’t affect every request Data on detected failures often lacks necessary detail to suggest a remedy: which front end is flaky? which app component caused end-user failure? 5 Managing Failures Definition: When server “x” fails, what is the impact of this failure? Ability to prioritize failures Detect component service degradation Characterizing app-stability Capacity planning Better use of ops and engineering resources Current approach: no systematic attempt to provide this functionality 6 Our solution (in 2 steps) Detecting and Managing Failures Step 1: Instrument applications to track user requests across the “service chain” Each request is tagged with a unique id Service chain is composed on-the-fly with help of app instrumentation For each request: Collect per-hop performance information Collect per-request failure status Centralized data collection 7 What kinds of failures? We can handle: Machine failures Network connectivity problems Most: Misconfiguration Application bugs But not all: Application errors where app itself doesn’t detect that there is a problem 8 Diagnosing Failures Assigning responsibility to a specific hw or sw component Insight into internals of a component Cross component interactions Current approach: instrument applications App-specific log messages Problems High request rates => log rollover Perceived overhead => detailed logging enabled during testing, disabled in production 9 Fuse Background FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred Lack of a positive ack => failure 10 Step 2: Conditional Logging Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain While fate is undecided: Detailed log messages stored in main memory Common case overload of logging is vastly reduced Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures Quantity of data generated is manageable, when most requests are successful 11 Example Client Server1 Server2 Server3 X Benefits: FUSE allows monitoring of real transactions. When a request fails, FUSE provides an audit trail All transactions, or a sampled subset to control overhead. How far did it get? How long did each step take? Any additional application specific context. FUSE can be deployed incrementally. 12 Issues Overload policy: need to handle bursts of failures without inducing more failures How much effort to make apps FUSE enabled? Are the right components FUSE enabled? Identifying and filtering false positives Tracking request flow is non-trivial with network load balancers 13 Status We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine Testing in progress Roll-out at end of summer 14 Backups 15 FUSE is Easy to Integrate Example current code on Front End: ReceiveRequestFromClient(…) { … SendRequestToBackEnd(…); } Example code on Front End using FUSE: ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null if ( f != null ) JoinFUSEGroup( f ); … SendRequestToBackEnd(…, f ); } Current implementation is in C#, and consists of 2400 LOC 16