Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University, and UC Berkeley Mustafa Dajani 27 Nov 2006 CMSC 838P Overview of the Paper explained a statistical debugging algorithm that is able to isolate bugs in programs containing multiple undiagnosed bugs showed a practical, scalable algorithm for isolating multiple bugs in many software systems 1 outline: Introduction Background Cause Isolation Algorithm Experiments Objective of the Study: To develop a statistical algorithm to hunt for causes of failures • Crash reporting systems are useful in collecting data • Actual executions are a vast resource • Using feedback data for causes of failures Introduction • Statistical debugging - a dynamic analysis for detecting the causes of run failures. - an instrumentation program basically monitor program behavior by sampling information - this involves testing of predicates in particular events during the run • Predicates, P - bug predictors; large programs may consist of thousands of predicates • Feedback Report, R - contains information whether a run has succeeded or failed. Introduction • the study’s model of behavior: “If P is observed to be true at least once during run R then R(P) = 1, otherwise R(P) = 0.” - In other words, it counts how often “P observed true” and “P observed” using random sampling • previous works involved the use of regularized logistic regression (it tries to select predicates to determine outcome of every run) - but this algorithm creates redundancy in finding predicates as well as difficulty in predicting multiple bugs Introduction • Study design: - determine all possible predicates - eliminate predicates that have no predictive power - loop { - rank the surviving predicates by importance - remove all top-ranked predicates - discard all runs where the run passed, R(P)=1 - go to top of loop until set of runs or set of predicates are empty Bug Isolation Architecture Predicates Program Source Sampler Shipping Application Compiler Top bugs with likely causes Statistical Debugging Counts & J/L Depicting failures through P F(P) = # of failures where P observed true S(P) = # of successes where P observed true F(P) Failure(P) = F(P) + S(P) When does a program fail? Consider this code fragment: if (f == NULL) { x = 0; *f; } Valid pointer assignment If (…) f = …some valid pointer…; *f; Predicting P’s truth or falsehood F(P observed) = # of failures observing P S(P observed) = # of successes observing P F(P observed) Context(P) = F(P observed) + S(P observed) Notes • Two predicates are redundant if they predict the same or nearly the same set of failing ones • Because of elimination is iterative, it is only necessary that Importance selects a good predictor at each step and not necessarily the best one. Guide to Visualization Increase(P) error bound Context(P) S(P) log(F(P) + S(P)) http://www.cs.wisc.edu/~liblit/p ldi-2005/ Rank by Increase(P) • High Increase() but very few failing runs! • These are all sub-bug predictors – Each covers one special case of a larger bug • Redundancy is clearly a problem http://www.cs.wisc.edu/~liblit/pldi-2005/ Rank by F(P) • Many failing runs but low Increase()! • Tend to be super-bug predictors – Each covers several bugs, plus lots of junk http://www.cs.wisc.edu/~liblit/pldi-2005/ Notes • In the language of information retrieval – Increase(P) has high precision, low recall – F(P) has high recall, low precision • Standard solution: – Take the harmonic mean of both – Rewards high scores in both dimensions http://www.cs.wisc.edu/~liblit/pldi-2005/ Rank by Harmonic Mean • It works! – Large increase, many failures, few or no successes • But redundancy is still a problem http://www.cs.wisc.edu/~liblit/pldi-2005/ Lessons Learned • Can learn a lot from actual executions – Users are running buggy code anyway – We should capture some of that information • Crash reporting is a good start, but… – Pre-crash behavior can be important – Successful runs reveal correct behavior – Stack alone is not enough for 50% of bugs http://www.cs.wisc.edu/~liblit/pldi-2005/