Nov 2 talk (Duke)

advertisement
Pip
Detecting the
Unexpected in
Distributed Systems
Patrick Reynolds
Janet
Wiener
Jeff Mogul
Mehul Shah
Chip Killian
Amin Vahdat
http://issg.cs.duke.edu/pip/
reynolds@cs.duke.edu
Motivation
•
Distributed systems exhibit complex behaviors
• Some behaviors are unexpected
–
Structural bugs
• Placement or timing of processing and communication
–
Performance problems
• Throughput bottlenecks
• Over- or under-consumption of resources
• Unexpected interdependencies
•
Parallel, inter-node behavior is hard to capture with
serial, single-node tools
–
–
Not captured by traditional debuggers, profilers
Not captured by unstructured log files
Pip - November 2005
page 2
Motivation
Three target audiences:
• Primary programmer
–
•
Secondary programmer
–
–
•
Debugging or optimizing his/her own system
Inheriting a project or joining a programming team
Learning how the system behaves
Operator
–
Monitoring running system for unexpected behavior
– Performing regression tests after a change
Pip - November 2005
page 3
Motivation
•
Programmers wish to
examine and check systemwide behaviors
–
–
–
•
Causal paths
500ms
Components of end-to-end delay
Attribution of resource
consumption
2000
page faults
Unexpected behavior might
indicate a bug
Pip - November 2005
Web server
App server
Database
page 4
Pip overview
Application
Pip:
1.
2.
3.
4.
•
Captures events from a running
system
Reconstructs behavior from events
Checks behavior against
expectations
Displays unexpected behavior
Both structure and resource
violations
Behavior
model
Expectations
Pip checker
Unexpected
structure
Resource
violations
Pip explorer:
visualization GUI
Goal: help programmers locate and
explain bugs
Pip - November 2005
page 5
Outline
•
Expressing expected behavior
• Building a model of actual behavior
• Exploring application behavior
• Results
–
–
–
FAB
RanSub
SplitStream
Pip - November 2005
page 6
Describing application behavior
•
Application behavior consists of paths
–
–
–
All events, on any node, related to one high-level
operation
Definition of a path is programmer defined
Path is often causal, related to a user request
WWW Parse
HTTP
App server
Send response
Run application
DB
Query
Pip - November 2005
time
page 7
Describing application behavior
•
Within paths are tasks, messages, and notices
–
–
Tasks: processing with start and end points
Messages: send and receive events for any
communication
• Includes network, synchronization (lock/unlock), and timers
–
Notices: time-stamped strings; essentially log entries
“Request = /cgi/…”
“2096 bytes in response”
“done with request 12”
WWW Parse
HTTP
App server
Run application
DB
Query
Pip - November 2005
Send response
time
page 8
Expectations: Recognizers
•
•
Application behavior consists of paths
Each recognizer matches paths
–
•
•
A path can match more than one recognizer
A recognizer can be a validator, an invalidator, or neither
Any path matching zero validators or at least one invalidator
is unexpected behavior: bug?
validator CGIRequest
task(“Parse HTTP”) limit(CPU_TIME, 100ms);
notice(m/Request URL: .*/);
send(AppServer);
recv(AppServer);
invalidator DatabaseError
notice(m/Database error: .*/);
Pip - November 2005
page 9
Expectations: Recognizers language
•
repeat: matches a ≤ n ≤ b copies of a block
repeat between 1 and 3 { … }
•
xor: matches any one of several blocks
xor {
}
•
•
branch: …
branch: …
call: include another recognizer (macro)
future: block matches now or later
–
done: force named block to match
future F1 { … }
…
done(F1);
Pip - November 2005
page 10
Expectations: Aggregate expectations
•
Recognizers categorize paths into sets
• Aggregates make assertions about sets of paths
–
–
Count, unique count, resource constraints
Simple math and set operators
assert(instances(CGIRequest) > 4);
assert(max(CPU_TIME, CGIRequest) < 500ms);
assert(max(REAL_TIME, CGIRequest) <=
3*avg(REAL_TIME, CGIRequest));
Pip - November 2005
page 11
Outline
•
Expressing expected behavior
• Building a model of actual behavior
• Exploring application behavior
• Results
Pip - November 2005
page 12
Building a behavior model
Model consists of paths constructed from events
recorded by the running application
Sources of events:
• Annotations in source code
–
•
Annotations in middleware
–
–
•
Middleware inserts annotations automatically
Faster and less error-prone
Passive tracing or interposition
–
•
Programmer inserts statements manually
Easier, but less information
Or any combination of the above
Pip - November 2005
page 13
Annotations
•
Set path ID
• Start/end task
• Send/receive message
• Notice
“Request = /cgi/…”
“2096 bytes in response”
“done with request 12”
WWW Parse
HTTP
App server
Run application
DB
Query
Pip - November 2005
Send response
time
page 14
Automating expectations and annotations
•
Expectations can be generated
from behavior model
–
–
–
•
•
Create a recognizer for each actual path
Eliminate repetition
Strike a balance between over- and
under-specification
Annotations can be generated by
middleware
Automatic annotations in Mace,
Sandstorm, J2EE, FAB
–
Several of our test systems use Mace
annotations
Pip - November 2005
Application
Annotations
Behavior
model
Expectations
Pip checker
Unexpected
behavior
page 15
Checking expectations
Application
Traces
Match start/end task,
send/receive message
Reconciliation
Events database
Path construction
Paths
Organize events into
causal paths
Expectations
Expectation checking
Categorized paths
Pip - November 2005
For each path P
For each recognizer R
Does R match P?
Check each aggregate
page 16
Exploring behavior
•
Expectations checker generates lists of valid and
invalid paths
• Explore both sets
–
–
Why did invalid paths occur?
Is any unexpected behavior misclassified as valid?
• Insufficiently constrained expectations
• Pip may be unable to express all expectations
•
Two ways to explore behavior
–
SQL queries over tables
• Paths, threads, tasks, messages, notices
–
Visualization
Pip - November 2005
page 17
Visualization: causal paths
Caused
Timing
tasks,
andmessages,
resource
Causal view
properties
andofnotices
path foron
one
that
task
thread
Pip - November 2005
page 18
Visualization: communication graph
•
Graph view of all host-to-host network traffic
Pip - November 2005
page 19
Delay (ms)
Visualization: performance graphs
Time (s)
•
Plot per-task or per-path resource metrics
–
•
Cumulative distribution (CDF), probability density (PDF), or vs.
time
Click on a point to see its value and the task/path represented
Pip - November 2005
page 20
Pip vs. printf
•
Pip
printf
Nesting, causal order
Time, path, and thread
CPU and I/O data
Automatic verification using
declarative language
SQL queries
Automatic generation for
some middleware
Unstructured
No context
No resource information
Verification with ad hoc grep
or expect scripts
“Queries” using Perl scripts
Manual placement
Both record interesting events to check off-line
–
–
Pip imposes structure and automates checking
Generalizes ad hoc approaches
Pip - November 2005
page 21
Results
•
We have applied Pip to several distributed systems:
–
FAB: distributed block store
– SplitStream: DHT-based multicast protocol
– RanSub: tree-based protocol used to build higher-level
systems
– Others: Bullet, SWORD, Oracle of Bacon
•
We have found unexpected behavior in each system
• We have fixed bugs in some systems
… and used Pip to verify that the behavior was fixed
Pip - November 2005
page 22
Results: SplitStream (DHT-based multicast protocol)
13 bugs found, 12 fixed
–
•
Structural bug: some nodes have up to 25 children
when they should have at most 18
–
–
–
•
11 found using expectations, 2 found using GUI
This bug was fixed and later reoccurred
Root cause #1: variable shadowing
Root cause #2: failed to register a callback
How discovered: first in the explorer GUI, confirmed
with automated checking
Pip - November 2005
page 23
Results: FAB (distributed block store)
1 bug (so far), fixed
–
•
Four protocols checked: read, write, Paxos, membership
Performance bug: nodes seeking quorum call self
and peers in arbitrary order
–
Should call self last, to overlap computation
– For cached blocks, should call self second-to-last
Pip - November 2005
page 24
Results: RanSub (tree-based protocol)
2 bugs found, 1 fixed
• Structural bug: during first round of communication,
parent nodes send summary messages before
hearing from all children
–
•
Root cause: uninitialized state variables
Performance bug: linear increase in end-to-end
delay for the first ~2 minutes
–
Suspected root cause: data structure listing all discovered
nodes
Pip - November 2005
page 25
Future work
•
Further automation of annotations, tracing
–
•
Extensible annotations
–
•
Explore tradeoffs between black-box, annotated
behavior models
Application-specific schema for notices
Composable expectations for large systems
Pip - November 2005
page 26
Related work
•
Expectations-based systems
–
–
–
•
Causal paths
–
–
–
•
PSpec [Perl, 1993]
Meta-level compilation [Engler, 2000]
Paradyn [Miller, 1995]
Pinpoint [Chen, 2002]
Magpie [Barham, 2004]
Project5 [Aguilera, 2003]
Model checking
–
–
MaceMC [Killian, 2006]
VeriSoft [Godefroid, 2005]
Pip - November 2005
page 27
Conclusions
•
Finding unexpected behavior can help us find bugs
–
•
Expectations serve as a high-level external
specification
–
–
•
Both structure and performance bugs
Summary of inter-component behavior and timing
Regression test for structure and performance
Some bugs not exposed by expectations can be
found through exploring: queries and visualization
Pip - November 2005
page 28
Extra slides
Resource metrics
•
Real time
• User time, system time
–
–
•
•
•
•
•
•
CPU time = user + system
Busy time = CPU time / real time
Major and minor page faults (paging and allocation)
Voluntary and involuntary context switches
Message size and latency
Number of messages sent
Causal depth of path
Number of threads, hosts in path
Pip - November 2005
page 30
Download