Automated Support for Classifying Software Failure Reports Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang Case Western Reserve University Presented by: Hamid Haidarian Shahri 1 Automated failure reporting Recent software products automatically detect and report crashes/exceptions to developer Netscape Navigator Microsoft products Report includes call stack, register values, other debug info 2 Example 3 User-initiated reporting Other products permit user to report a failure at any time User describes problem Application state info may also be included in report 4 Mixed blessing Good news: More failures reported More precise diagnostic information Bad news: Dramatic increase in failure reports Too many to review manually 5 Our approach Help developers group reported failures with same cause – before cause is known Provide “semi-automatic” support For execution profiling Supervised and unsupervised pattern classification Multivariate visualization Initial classification is checked, refined by developer 6 Example classification 7 How classification helps (Benefits) Aids prioritization and debugging: Suggests number of underlying defects Reflects how often each defect causes failures Assembles evidence relevant to prioritizing, diagnosing each defect 8 Formal view of problem Let F = { f1, f2, ..., fm } be set of reported failures True failure classification: partition of F into subsets F1, F2, ..., Fk such that in each Fi all failures have same cause Our approach produces approximate failure classification G1, G2, ..., Gp 9 Classification strategy (***) 1. 2. 3. 4. 5. Software instrumented to collect and upload profiles or captured executions for developer Profiles of reported failures combined with those of apparently successful executions (reducing bias) Subset of relevant features selected Failure profiles analyzed using cluster analysis and multivariate visualization Initial classification of failures examined, refined 10 Execution profiling Our approach not limited to classifying crashes and exceptions User may report failure well after critical events leading to failure Profiles should characterize entire execution Profiles should characterize events potentially relevant to failure, e.g., Control flow, data flow, variable values, event sequences, state transitions Full execution capture/replay permits arbitrary profiling 11 Feature selection 1. 2. 3. 4. Generate candidate feature sets Use each one to train classifier to distinguish failures from successful executions Select features of classifier, which performs best overall Use those features to group (cluster) related failures 12 Probabilistic wrapper method Used to select features in our experiments Due to Liu and Setiono Random feature sets generated Each used with one part of profile data to train classifier Misclassification rate of each classifier estimated using another part of data (testing) Features of classifier with smallest estimated misclassification rate used for grouping failures 13 Logistic regression (skip) Simple, widely-used classifier Binary dependent variable Y Expected value E(Y | x) of Y given predictor x = (x1, x2, ..., xp) is (x) = P(Y = 1 | x) x e g(x) 1 e g(x) 14 Logistic regression cont. (skip) Log odds ratio (logit) g(x) defined by g (x) ln x1 ... p x p Coefficients estimated from sample of x and Y values. Estimate of Y given x is 1 iff estimate of g(x) is positive 15 Grouping related failures Alternatives: 1) Automatic cluster analysis 2) Multivariate visualization Can be fully automated User must identify groups in display Weaknesses of each approach offset by combining them 16 1) Automatic cluster analysis Identifies clusters among objects based on similarity of feature values Employs dissimilarity metric e.g., Euclidean, Manhattan distance Must estimate number of clusters Difficult problem Several “reasonable” ways to cluster a population may exist 17 Estimating number of clusters Widely-used metric of quality of clustering due to Calinski and Harabasz: B /( c 1) CH (c) W /( n c) B is total between-cluster sum of squared distances W is total within-cluster sum of squared distances from cluster centroids n is number of objects in population Local maxima represent alternative estimates 18 2) Multidimensional scaling (MDS) Represents dissimilarities between objects by 2D scatter plot Distances between points in display approximate dissimilarities Small dissimilarities poorly represented with high-dimensional profiles Our solution: hierarchical MDS (HMDS) 19 Confirming or refining the initial classification Select 2+ failures from each group Debug to determine if they are related Choose ones with maximally dissimilar profiles If not, split group Examine neighboring groups to see if they should be combined 20 Limitations Classification unlikely to be exact Sampling error Modeling error Representation error Spurious correlations Form of profiling Human judgment 21 Experimental validation Implemented classification strategy with three large subject programs GCC, Jikes, javac compilers Failures clustered automatically (what failure?) Resulting clusters examined manually Most or all failures in each cluster examined 22 Subject programs GCC 2.95.2 C compiler Written in C Used subset of regression test suite (selfvalidating execution tests) 3333 tests run, 136 failures Profiled with Gnu Gcov (2214 function call counts) Jikes 1.15 java compiler Written in C++ Used Jacks test suite (self-validating) 3149 tests run, 225 failures Profiled with Gcov (3644 function call counts) 23 Subject programs cont. javac 1.3.1_02-b02 java compiler Written in Java Used Jacks test suite 3140 tests run, 233 failures Profiled with function-call profiler written using JVMPI (1554 call counts) 24 Experimental methodology (skip) 400-500 candidate Logistic Regression (LR) models generated per data set 500 randomly selected features per model Model with lowest estimated misclassification rate chosen Data partitioned into three subsets: Train (50%): used to train candidate models TestA (25%): used to pick best model TestB (25%): used for final estimate of misclassification rate 25 Experimental Methodology cont. (skip) Measure used to pick best model: % misclassif ied failures % misclassif ied successes 2 Gives extra weight to misclassification of failures Final LR models correctly classified 72% of failures and 91% of successes Linearly dependent features omitted from fitted LR models 26 Experimental methodology cont. (skip) Cluster analysis S-Plus clustering algorithm clara Based on k-medoids criterion Calinski-Harabasz index plotted for 2 c 50, local maxima examined Visualization Hierarchical MDS (HMDS) algorithm used 27 Manual examination of failures (skip) Several GCC tests often have same source file, different optimization levels Such tests often fail or succeed together Hence, GCC failures were grouped manually based on Source file Information about bug fixes Date of first version to pass test 28 Manual examination cont. (skip) Jikes, javac failures grouped in two stages 1. 2. Automatically formed clustered checked Overlapping clusters in HMDS display checked Activities: Debugging Comparing versions Examining error codes Inspecting source files Check correspondence between tests and JLS sections 29 GCC results Number of clusters Total failures (136) 21 % size of largest group of failures in cluster with same cause 100 1 83 6 (4%) 3 75,75, 71 23 (17%) 1 60 5 (4%) 1 24 25 (18%) 77 (57%) 30 GCC results cont. HMDS display of GCC failure profiles after feature selection. Convex hulls indicate results of automatic clustering into 27 clusters. HMDS display of GCC failure profiles after feature selection. Convex hulls indicate failures involving same defect using HMDS (more accurate). 31 GCC results cont. HMDS display of GCC failure profiles before feature selection. Convex hulls indicate failures involving same defect. So feature selection helps in grouping. 32 javac results Number of clusters % size of largest group of failures in cluster with same cause Total failures (232) 9 100 70 (30%) 5 88, 85, 85, 85, 83 64 (28%) 4 75, 67, 67, 57 49 (21%) 2 50, 50 20 (9%) 1 17 23 (10%) 33 javac results cont. HMDS display of javac failures. Convex hulls indicate results of manual classification with HMDS. 34 Jikes results Number of clusters % size of largest group of failures in cluster with same cause Total failures (225) 12 100 64 (29%) 5 85, 83, 80, 75, 75 41 (18%) 4 70, 67, 67, 56 25 (11%) 8 50, 50, 50, 43, 41, 33, 33, 25 76 (34%) 35 Jikes results cont. HMDS display of Jikes failures. Convex hulls indicate results of manual classification with HMDS. 36 Summary of results In most automatically-created clusters, majority of failures had same cause A few large, non-homogenous clusters were created Automatic clustering sometimes splits groups of failures with same cause Sub-clusters evident in HMDS displays HMDS displays didn’t have this problem Overall, failures with same cause formed fairly cohesive clusters 37 Threats to validity One type of program used in experiments Hand-crafted test inputs used for profiling Think of Microsoft.. 38 Related work cSlice [Agrawal, et al] Path spectra [Reps, et al] Tarantula [Jones, et al] Delta debugging [Hildebrand & Zeller] Cluster filtering [Dickinson, et al] Clustering IDS alarms [Julisch & Dacier] 39 Conclusions Demonstrated that our classification strategy is potentially useful with compilers Further evaluation needed with different types of software, failure reports from field Note: Input space is huge. More accurate reporting (severity, location) could facilitate a better grouping and overcome these problems Note: Limited labeled data available and error causes/types constantly changing (errors are debugged), so effectiveness of learning is somewhat questionable (like following your shadow) 40 Future work Further experimental evaluation Use more powerful classification, clustering techniques Use different profiling techniques Extract additional diagnostic information Use techniques for classifying intrusions reported by anomaly detection systems 41