Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN),

advertisement
Detecting, Managing, and
Diagnosing Failures with FUSE
John Dunagan, Juhan Lee (MSN),
Alec Wolman
WIP
Goals & Target Environment


Improve the ability of large internet
portals to gain insight into failures
Non-goals:


masking failures
use machine learning to infer
abnormal behavior
2
MSN Background



Messenger, www.msn.com, Hotmail, Search,
many other “properties”
Large (> 100 million users)
Sources of Complexity:




multiple data-centers
large # of machines
complex internal network topology
diversity of applications and software
infrastructure
3
The Plan

Detecting, managing, and diagnosing
failures


Review MSN’s current approaches
Describe our solution at a high level
4
Detecting Failures



Monitor system availability with heartbeats
Monitor applications availability & quality of service
using synthetic requests
Customer complaints
 Telephone, email
Problems:
 These approaches provide limited coverage – harder to
catch failures that don’t affect every request
 Data on detected failures often lacks necessary detail to
suggest a remedy:
 which front end is flaky?
 which app component caused end-user failure?
5
Managing Failures
Definition:





When server “x” fails, what is the impact of
this failure?


Ability to prioritize failures
Detect component service degradation
Characterizing app-stability
Capacity planning
Better use of ops and engineering resources
Current approach: no systematic attempt to
provide this functionality
6
Our solution (in 2 steps)
Detecting and Managing Failures

Step 1: Instrument applications to track
user requests across the “service chain”



Each request is tagged with a unique id
Service chain is composed on-the-fly with
help of app instrumentation
For each request:



Collect per-hop performance information
Collect per-request failure status
Centralized data collection
7
What kinds of failures?
We can handle:
 Machine failures
 Network connectivity problems
Most:
 Misconfiguration
 Application bugs
But not all:
 Application errors where app itself
doesn’t detect that there is a problem
8
Diagnosing Failures

Assigning responsibility to a specific hw or
sw component
Insight into internals of a component
Cross component interactions

Current approach: instrument applications




App-specific log messages
Problems


High request rates => log rollover
Perceived overhead => detailed logging enabled
during testing, disabled in production
9
Fuse Background

FUSE (OSDI 2004): lightweight
agreement on only one thing: whether
or not a failure has occurred

Lack of a positive ack => failure
10
Step 2: Conditional Logging

Step 2: Implement “conditional logging” to
significantly reduce the overhead of collecting
detailed logs across different machines in the
service chain


Step 1 provides ability to identify a request across all
participants in the service chain, Fuse provides agreement
on failure status across that chain
While fate is undecided: Detailed log messages stored in
main memory


Common case overload of logging is vastly reduced
Once the fate of service chain is decided, we discard app
logs for successful requests and save logs for failures

Quantity of data generated is manageable, when most
requests are successful
11
Example
Client
Server1
Server2
Server3
X
Benefits:

FUSE allows monitoring of real transactions.


When a request fails, FUSE provides an audit trail




All transactions, or a sampled subset to control
overhead.
How far did it get?
How long did each step take?
Any additional application specific context.
FUSE can be deployed incrementally.
12
Issues





Overload policy: need to handle bursts
of failures without inducing more
failures
How much effort to make apps FUSE
enabled?
Are the right components FUSE
enabled?
Identifying and filtering false positives
Tracking request flow is non-trivial with
network load balancers
13
Status



We’ve implemented FUSE for MSN,
integrated with ASP.NET rendering
engine
Testing in progress
Roll-out at end of summer
14
Backups
15
FUSE is Easy to Integrate
Example current code on Front End:
ReceiveRequestFromClient(…) {
…
SendRequestToBackEnd(…);
}
Example code on Front End using FUSE:
ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null
if ( f != null ) JoinFUSEGroup( f );
…
SendRequestToBackEnd(…, f );
}
Current implementation is in C#, and consists of 2400 LOC
16
Download