Scalable Statistical Bug Isolation

advertisement
Scalable Statistical
Bug Isolation
Ben Liblit, Mayur Naik, Alice Zheng,
Alex Aiken, and Michael Jordan, 2005
University of Wisconsin, Stanford University, and UC Berkeley
Mustafa Dajani
27 Nov 2006 CMSC 838P
Overview of the Paper
 explained a statistical debugging algorithm that is
able to isolate bugs in programs containing multiple
undiagnosed bugs
 showed a practical, scalable algorithm for
isolating multiple bugs in many software systems 1
 outline:
Introduction
Background
Cause Isolation Algorithm
Experiments
Objective of the Study:
To develop a statistical algorithm to
hunt for causes of failures
• Crash reporting systems are useful in
collecting data
• Actual executions are a vast resource
• Using feedback data for causes of failures
Introduction
• Statistical debugging - a dynamic analysis for detecting
the causes of run failures.
- an instrumentation program basically monitor
program behavior by sampling information
- this involves testing of predicates in particular
events during the run
• Predicates, P - bug predictors; large programs may
consist of thousands of predicates
• Feedback Report, R - contains information whether a
run has succeeded or failed.
Introduction
• the study’s model of behavior:
“If P is observed to be true at least once during run R then
R(P) = 1, otherwise R(P) = 0.”
- In other words, it counts how often “P observed true” and “P
observed” using random sampling
• previous works involved the use of regularized logistic
regression (it tries to select predicates to determine
outcome of every run)
- but this algorithm creates redundancy in finding
predicates as well as difficulty in predicting multiple bugs
Introduction
• Study design:
- determine all possible predicates
- eliminate predicates that have no predictive
power
- loop {
- rank the surviving predicates by importance
- remove all top-ranked predicates
- discard all runs where the run passed, R(P)=1
- go to top of loop until set of runs or set of
predicates are empty
Bug Isolation Architecture
Predicates
Program
Source
Sampler
Shipping
Application
Compiler
Top bugs with
likely causes
Statistical
Debugging
Counts
& J/L





Depicting failures through P
F(P) = # of failures where P observed true
S(P) = # of successes where P observed true
F(P)
Failure(P) =
F(P) + S(P)
When does a program fail?
Consider this code fragment:
if (f == NULL) {
x = 0;
*f;
}
Valid pointer assignment
If (…) f = …some valid pointer…;
*f;
Predicting P’s truth or
falsehood
F(P observed) = # of failures observing P
S(P observed) = # of successes observing P
F(P observed)
Context(P) =
F(P observed) + S(P observed)
Notes
• Two predicates are redundant if they predict
the same or nearly the same set of failing
ones
• Because of elimination is iterative, it is only
necessary that Importance selects a good
predictor at each step and not necessarily the
best one.
Guide to Visualization
Increase(P)
error bound
Context(P)
S(P)
log(F(P) + S(P))
http://www.cs.wisc.edu/~liblit/p
ldi-2005/
Rank by Increase(P)
• High Increase() but very few failing runs!
• These are all sub-bug predictors
– Each covers one special case of a larger bug
• Redundancy is clearly a problem
http://www.cs.wisc.edu/~liblit/pldi-2005/
Rank by F(P)
• Many failing runs but low Increase()!
• Tend to be super-bug predictors
– Each covers several bugs, plus lots of junk
http://www.cs.wisc.edu/~liblit/pldi-2005/
Notes
• In the language of information retrieval
– Increase(P) has high precision, low recall
– F(P) has high recall, low precision
• Standard solution:
– Take the harmonic mean of both
– Rewards high scores in both dimensions
http://www.cs.wisc.edu/~liblit/pldi-2005/
Rank by Harmonic Mean
• It works!
– Large increase, many failures, few or no
successes
• But redundancy is still a problem
http://www.cs.wisc.edu/~liblit/pldi-2005/
Lessons Learned
• Can learn a lot from actual executions
– Users are running buggy code anyway
– We should capture some of that
information
• Crash reporting is a good start, but…
– Pre-crash behavior can be important
– Successful runs reveal correct behavior
– Stack alone is not enough for 50% of bugs
http://www.cs.wisc.edu/~liblit/pldi-2005/
Download