Classifying Failure Reports

advertisement
Automated Support
for Classifying
Software Failure Reports
Andy Podgurski, David Leon, Patrick Francis, Wes
Masri, Melinda Minch, Jiayang Sun, Bin Wang
Case Western Reserve University
Presented by: Hamid Haidarian Shahri
1
Automated failure reporting

Recent software products automatically
detect and report crashes/exceptions to
developer



Netscape Navigator
Microsoft products
Report includes call stack, register
values, other debug info
2
Example
3
User-initiated reporting



Other products permit user to report a
failure at any time
User describes problem
Application state info may also be
included in report
4
Mixed blessing

Good news:



More failures reported
More precise diagnostic information
Bad news:


Dramatic increase in failure reports
Too many to review manually
5
Our approach


Help developers group reported failures with
same cause – before cause is known
Provide “semi-automatic” support




For execution profiling
Supervised and unsupervised pattern classification
Multivariate visualization
Initial classification is checked, refined by
developer
6
Example classification
7
How classification helps (Benefits)
Aids prioritization and debugging:
 Suggests number of underlying defects
 Reflects how often each defect causes
failures
 Assembles evidence relevant to
prioritizing, diagnosing each defect
8
Formal view of problem



Let F = { f1, f2, ..., fm } be set of
reported failures
True failure classification: partition of F
into subsets F1, F2, ..., Fk such that in
each Fi all failures have same cause
Our approach produces approximate
failure classification G1, G2, ..., Gp
9
Classification strategy (***)
1.
2.
3.
4.
5.
Software instrumented to collect and upload
profiles or captured executions for developer
Profiles of reported failures combined with those of
apparently successful executions (reducing bias)
Subset of relevant features selected
Failure profiles analyzed using cluster analysis and
multivariate visualization
Initial classification of failures examined, refined
10
Execution profiling




Our approach not limited to classifying crashes and
exceptions
User may report failure well after critical events
leading to failure
Profiles should characterize entire execution
Profiles should characterize events potentially
relevant to failure, e.g.,


Control flow, data flow, variable values, event sequences,
state transitions
Full execution capture/replay permits arbitrary
profiling
11
Feature selection
1.
2.
3.
4.
Generate candidate feature sets
Use each one to train classifier to
distinguish failures from successful
executions
Select features of classifier, which
performs best overall
Use those features to group (cluster)
related failures
12
Probabilistic wrapper method






Used to select features in our experiments
Due to Liu and Setiono
Random feature sets generated
Each used with one part of profile data to train
classifier
Misclassification rate of each classifier estimated
using another part of data (testing)
Features of classifier with smallest estimated
misclassification rate used for grouping failures
13
Logistic regression



(skip)
Simple, widely-used classifier
Binary dependent variable Y
Expected value E(Y | x) of Y given
predictor x = (x1, x2, ..., xp) is (x) =
P(Y = 1 | x)
 x 
e
g(x)
1 e
g(x)
14
Logistic regression cont.

(skip)
Log odds ratio (logit) g(x) defined by
  
g (x)  ln 
     x1  ...   p x p
 


Coefficients estimated from sample of x
and Y values.
Estimate of Y given x is 1 iff estimate of
g(x) is positive
15
Grouping related failures
Alternatives:
 1) Automatic cluster analysis


2) Multivariate visualization


Can be fully automated
User must identify groups in display
Weaknesses of each approach offset by
combining them
16
1) Automatic cluster analysis


Identifies clusters among objects based
on similarity of feature values
Employs dissimilarity metric


e.g., Euclidean, Manhattan distance
Must estimate number of clusters


Difficult problem
Several “reasonable” ways to cluster a
population may exist
17
Estimating number of clusters

Widely-used metric of quality of clustering due to
Calinski and Harabasz:
B /( c  1)
CH (c) 
W /( n  c)




B is total between-cluster sum of squared distances
W is total within-cluster sum of squared distances
from cluster centroids
n is number of objects in population
Local maxima represent alternative estimates
18
2) Multidimensional scaling
(MDS)




Represents dissimilarities between
objects by 2D scatter plot
Distances between points in display
approximate dissimilarities
Small dissimilarities poorly represented
with high-dimensional profiles
Our solution: hierarchical MDS (HMDS)
19
Confirming or refining the
initial classification

Select 2+ failures from each group


Debug to determine if they are related


Choose ones with maximally dissimilar
profiles
If not, split group
Examine neighboring groups to see if
they should be combined
20
Limitations

Classification unlikely to be exact






Sampling error
Modeling error
Representation error
Spurious correlations
Form of profiling
Human judgment
21
Experimental validation

Implemented classification strategy with
three large subject programs



GCC, Jikes, javac compilers
Failures clustered automatically (what
failure?)
Resulting clusters examined manually

Most or all failures in each cluster
examined
22
Subject programs

GCC 2.95.2 C compiler





Written in C
Used subset of regression test suite (selfvalidating execution tests)
3333 tests run, 136 failures
Profiled with Gnu Gcov (2214 function call counts)
Jikes 1.15 java compiler




Written in C++
Used Jacks test suite (self-validating)
3149 tests run, 225 failures
Profiled with Gcov (3644 function call counts)
23
Subject programs cont.

javac 1.3.1_02-b02 java compiler




Written in Java
Used Jacks test suite
3140 tests run, 233 failures
Profiled with function-call profiler written
using JVMPI (1554 call counts)
24
Experimental methodology

(skip)
400-500 candidate Logistic Regression (LR)
models generated per data set



500 randomly selected features per model
Model with lowest estimated misclassification rate
chosen
Data partitioned into three subsets:



Train (50%): used to train candidate models
TestA (25%): used to pick best model
TestB (25%): used for final estimate of misclassification
rate
25
Experimental Methodology
cont. (skip)

Measure used to pick best model:
% misclassif ied failures  % misclassif ied successes
2



Gives extra weight to misclassification of failures
Final LR models correctly classified  72% of
failures and  91% of successes
Linearly dependent features omitted from
fitted LR models
26
Experimental methodology
cont. (skip)

Cluster analysis

S-Plus clustering algorithm clara



Based on k-medoids criterion
Calinski-Harabasz index plotted for
2  c  50, local maxima examined
Visualization

Hierarchical MDS (HMDS) algorithm used
27
Manual examination of failures



(skip)
Several GCC tests often have same source
file, different optimization levels
Such tests often fail or succeed together
Hence, GCC failures were grouped manually
based on



Source file
Information about bug fixes
Date of first version to pass test
28
Manual examination cont.
(skip)
Jikes, javac failures grouped in two stages

1.
2.
Automatically formed clustered checked
Overlapping clusters in HMDS display checked
Activities:






Debugging
Comparing versions
Examining error codes
Inspecting source files
Check correspondence between tests and JLS
sections
29
GCC results
Number of
clusters
Total failures
(136)
21
% size of largest group
of failures in cluster with
same cause
100
1
83
6 (4%)
3
75,75, 71
23 (17%)
1
60
5 (4%)
1
24
25 (18%)
77 (57%)
30
GCC results cont.
HMDS display of GCC failure profiles
after feature selection. Convex hulls
indicate results of automatic
clustering into 27 clusters.
HMDS display of GCC failure profiles
after feature selection. Convex hulls
indicate failures involving same
defect using HMDS (more accurate).
31
GCC results cont.
HMDS display of GCC failure profiles before feature
selection. Convex hulls indicate failures involving
same defect. So feature selection helps in grouping.
32
javac results
Number of
clusters
% size of largest group
of failures in cluster with
same cause
Total
failures (232)
9
100
70 (30%)
5
88, 85, 85, 85, 83
64 (28%)
4
75, 67, 67, 57
49 (21%)
2
50, 50
20 (9%)
1
17
23 (10%)
33
javac results cont.
HMDS display of javac failures. Convex
hulls indicate results of manual
classification with HMDS.
34
Jikes results
Number of
clusters
% size of largest group
of failures in cluster with
same cause
Total
failures (225)
12
100
64 (29%)
5
85, 83, 80, 75, 75
41 (18%)
4
70, 67, 67, 56
25 (11%)
8
50, 50, 50, 43, 41, 33, 33,
25
76 (34%)
35
Jikes results cont.
HMDS display of Jikes failures. Convex
hulls indicate results of manual
classification with HMDS.
36
Summary of results


In most automatically-created clusters,
majority of failures had same cause
A few large, non-homogenous clusters were
created


Automatic clustering sometimes splits groups
of failures with same cause


Sub-clusters evident in HMDS displays
HMDS displays didn’t have this problem
Overall, failures with same cause formed
fairly cohesive clusters
37
Threats to validity



One type of program used in
experiments
Hand-crafted test inputs used for
profiling
Think of Microsoft..
38
Related work






cSlice [Agrawal, et al]
Path spectra [Reps, et al]
Tarantula [Jones, et al]
Delta debugging [Hildebrand & Zeller]
Cluster filtering [Dickinson, et al]
Clustering IDS alarms [Julisch & Dacier]
39
Conclusions




Demonstrated that our classification strategy is potentially
useful with compilers
Further evaluation needed with different types of
software, failure reports from field
Note: Input space is huge. More accurate reporting
(severity, location) could facilitate a better grouping and
overcome these problems
Note: Limited labeled data available and error
causes/types constantly changing (errors are debugged),
so effectiveness of learning is somewhat questionable (like
following your shadow)
40
Future work





Further experimental evaluation
Use more powerful classification, clustering
techniques
Use different profiling techniques
Extract additional diagnostic information
Use techniques for classifying intrusions
reported by anomaly detection systems
41
Download