Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao

advertisement
Reduce Instrumentation
Predictors Using Random Forests
Presented By Bin Zhao
Department of Computer Science
University of Maryland
May 3 2005
Motivation

Crash report – too late to collect
program information until the
program crashes

Testing – large number of test
cases. Can we focus on the failing
cases?
2
Motivation – failure prediction
Instrument program to monitor
behavior
 Predict if the program is going to fail
 Collect program data if the program
is predicted to likely fail
 Stop running the test if the test
program is not likely to fail

3
The problem

Large number of instrumentation
predictors

What instrumentation predictors to
picked?
4
The questions to answer

Can a good model be found for
predicting failing runs
based on all available data?

Can an equally good model be
created based on a random
selection of k% of the predictors?
5
Experiment





Instrumentation on a calculator program
295 predictors
Instrumentation data collected every 50
milli-seconds
100 runs – 81 success, 19 failure
Predictors: 275, 250, 225, 200, 175, 150,
125, 100, 90, 80, 70, 60, 50, 40, 30, 25,
20, 15, 10
6
Sample data

Pass Run
Run Res
1
pass
1
pass
1
pass
1
pass
1
pass

Rec
1
2
3
4
5
0x40bda0
3244
3206
3232
3203
3243
DataItem3
0
0
0
0
0
MSF-0x40bda0
3244
3206
3232
3203
3243
MSF-DataItem3
0
0
0
0
0
DataItem3
0
0
0
0
0
MSF-0x40bda0
3200
3200
3251
3251
3248
MSF-DataItem3
0
0
0
0
0
Failure Run
Run Res
10 fail
10 fail
10 fail
10 fail
10 fail
Rec
1
2
3
4
5
0x40bda0
3200
3200
3251
3251
3248
7
Background – Random Forests
Many classification trees
 Each tree gives a classification –
vote
 The classification is chosen by the
most votes

8
Background – Random Forests
Need a training set to grow the
forests
 M predictors are randomly selected
at each node to split the node
(mtry)
 One-third of the training data (oob)
is used to get an estimation error

9
Background – Random Forests
To classify a test run as pass or fail
 Sample model estimation

OOB error rate:
"fail"
"pass"
"fail"
933
5
0.0044
"pass"
17
4045
"class.error"
0.0178947368421053
0.00123456790123455
10
Background - R
Software for data manipulation,
analysis and calculation
 Provide script capability
 Provide an implementation of
Random Forests

11
Experiment steps
1.
2.
3.
4.
5.
Determine which slice of the data to be
used as modeling and testing
Find which parameter (ntree, mtry)
affect the model
Find the optimal parameter values for all
the random models
Build the random models by randomly
picking N predictors
Verify the random models by prediction
12
Find the good data
13
Influential parameters in Random Forest
Two possible parameters – ntree and
mtry
 Building model by fixing either ntree
or mtry and vary the other variable
 Ntree: 200 – 1000
 Mtry: 10 – 295
 Only Mtry matters

14
Optimal mtry
Need to decide optimal mtry for
different number of predictors (N)
 The default mtry is square root of N
 For different number of predicator
(295 – 10): N/2 – 3N

15
Random model
Randomly pick the predictors from
the full set of the predictors
 Generate 5 sets of data for each
number of predictor
 Use the 5 sets of the data to build
the random forest model and
average the result

16
Random prediction

For each trained random forest, do
prediction on a total different set of
test data (records 401 – 450)
17
70
60
50
Fail Error Rate (%)
80
Random Prediction Result
0
50
100
150
200
250
300
Number of Predictors
18
Analysis of the random model
0
20
40
60
80
Why not linear
Fail Error Rate (%)

Exp1
Exp2
Exp3
Experiments
Exp4
Exp5
19
Important predictors


Random Forests can give importance to
each predictor – the number of correct
votes involving the predictor
Top 20 important predictors
DataItem11
RT-DataItem11
AC-DataItem11 RT-DataItem9
AC-DataItem6 MSF-DataItem9
DataItem9
AC-DataItem9
MSF-DataItem12 AC-DataItem12
PC-DataItem11
RT-DataItem6
MSF-DataItem6
DataItem6
RT-DataItem12
MSF-DataItem11
PC-DataItem6
PC-DataItem9
DataItem12
PC-DataItem12
20
Top model

Pick the top important predictors
from the full set of the predictors to
build the model (top 100, 90, 80,
70, 60, 50, 40, 30, 25, 20, 15, 10)
21
45
40
35
Fail Error Rate (%)
50
Top model prediction result
20
40
60
80
100
Number of Predictors
22
Observation and analysis
The fail error rate is still high (>
30%)
 No all the runs fail at the same time
 Fail:Success = 19:81 (too few fail
cases to build a good model)
 Some predictors are raw, while
others are derived – MSF, AC, PC, RT

23
Improvements
Get the last N records for a
particular run
 For a set of data, randomly drop
some pass data and duplicate the
fail data
 Randomly pick the raw predictors
then all its derived predictors

24
70
60
50
40
30
Fail Error Rate (%)
80
90
Improved random prediction result
0
50
100
150
200
250
300
Number of Predictors
25
25
20
15
Fail Error Rate (%)
30
35
Improved top prediction result
20
40
60
80
100
Number of Predictors
26
Conclusion so far
Random selection does not achieve
a good error rate
 Some predictors have a stronger
prediction power
 A small set of important predictor
can achieve good error rate

27
Future work
Why some predictors have stronger
prediction power?
 Any pattern for the important
predictors?
 How many important predictors
should we pick?
 How soon can we predict a fail run
before it actually fails?

28
29
60
40
20
0
Fail Error Rate (%)
80
Random model estimation result
0
50
100
150
200
250
300
Number of Predictors
30
3
2
1
Fail Error Rate (%)
4
5
Top model estimation result
20
40
60
80
100
Number of Predictors
31
40
20
0
Fail Error Rate (%)
60
80
Improved random model
0
50
100
150
200
250
300
Number of Predictors
32
Download