Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005 Motivation Crash report – too late to collect program information until the program crashes Testing – large number of test cases. Can we focus on the failing cases? 2 Motivation – failure prediction Instrument program to monitor behavior Predict if the program is going to fail Collect program data if the program is predicted to likely fail Stop running the test if the test program is not likely to fail 3 The problem Large number of instrumentation predictors What instrumentation predictors to picked? 4 The questions to answer Can a good model be found for predicting failing runs based on all available data? Can an equally good model be created based on a random selection of k% of the predictors? 5 Experiment Instrumentation on a calculator program 295 predictors Instrumentation data collected every 50 milli-seconds 100 runs – 81 success, 19 failure Predictors: 275, 250, 225, 200, 175, 150, 125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10 6 Sample data Pass Run Run Res 1 pass 1 pass 1 pass 1 pass 1 pass Rec 1 2 3 4 5 0x40bda0 3244 3206 3232 3203 3243 DataItem3 0 0 0 0 0 MSF-0x40bda0 3244 3206 3232 3203 3243 MSF-DataItem3 0 0 0 0 0 DataItem3 0 0 0 0 0 MSF-0x40bda0 3200 3200 3251 3251 3248 MSF-DataItem3 0 0 0 0 0 Failure Run Run Res 10 fail 10 fail 10 fail 10 fail 10 fail Rec 1 2 3 4 5 0x40bda0 3200 3200 3251 3251 3248 7 Background – Random Forests Many classification trees Each tree gives a classification – vote The classification is chosen by the most votes 8 Background – Random Forests Need a training set to grow the forests M predictors are randomly selected at each node to split the node (mtry) One-third of the training data (oob) is used to get an estimation error 9 Background – Random Forests To classify a test run as pass or fail Sample model estimation OOB error rate: "fail" "pass" "fail" 933 5 0.0044 "pass" 17 4045 "class.error" 0.0178947368421053 0.00123456790123455 10 Background - R Software for data manipulation, analysis and calculation Provide script capability Provide an implementation of Random Forests 11 Experiment steps 1. 2. 3. 4. 5. Determine which slice of the data to be used as modeling and testing Find which parameter (ntree, mtry) affect the model Find the optimal parameter values for all the random models Build the random models by randomly picking N predictors Verify the random models by prediction 12 Find the good data 13 Influential parameters in Random Forest Two possible parameters – ntree and mtry Building model by fixing either ntree or mtry and vary the other variable Ntree: 200 – 1000 Mtry: 10 – 295 Only Mtry matters 14 Optimal mtry Need to decide optimal mtry for different number of predictors (N) The default mtry is square root of N For different number of predicator (295 – 10): N/2 – 3N 15 Random model Randomly pick the predictors from the full set of the predictors Generate 5 sets of data for each number of predictor Use the 5 sets of the data to build the random forest model and average the result 16 Random prediction For each trained random forest, do prediction on a total different set of test data (records 401 – 450) 17 70 60 50 Fail Error Rate (%) 80 Random Prediction Result 0 50 100 150 200 250 300 Number of Predictors 18 Analysis of the random model 0 20 40 60 80 Why not linear Fail Error Rate (%) Exp1 Exp2 Exp3 Experiments Exp4 Exp5 19 Important predictors Random Forests can give importance to each predictor – the number of correct votes involving the predictor Top 20 important predictors DataItem11 RT-DataItem11 AC-DataItem11 RT-DataItem9 AC-DataItem6 MSF-DataItem9 DataItem9 AC-DataItem9 MSF-DataItem12 AC-DataItem12 PC-DataItem11 RT-DataItem6 MSF-DataItem6 DataItem6 RT-DataItem12 MSF-DataItem11 PC-DataItem6 PC-DataItem9 DataItem12 PC-DataItem12 20 Top model Pick the top important predictors from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10) 21 45 40 35 Fail Error Rate (%) 50 Top model prediction result 20 40 60 80 100 Number of Predictors 22 Observation and analysis The fail error rate is still high (> 30%) No all the runs fail at the same time Fail:Success = 19:81 (too few fail cases to build a good model) Some predictors are raw, while others are derived – MSF, AC, PC, RT 23 Improvements Get the last N records for a particular run For a set of data, randomly drop some pass data and duplicate the fail data Randomly pick the raw predictors then all its derived predictors 24 70 60 50 40 30 Fail Error Rate (%) 80 90 Improved random prediction result 0 50 100 150 200 250 300 Number of Predictors 25 25 20 15 Fail Error Rate (%) 30 35 Improved top prediction result 20 40 60 80 100 Number of Predictors 26 Conclusion so far Random selection does not achieve a good error rate Some predictors have a stronger prediction power A small set of important predictor can achieve good error rate 27 Future work Why some predictors have stronger prediction power? Any pattern for the important predictors? How many important predictors should we pick? How soon can we predict a fail run before it actually fails? 28 29 60 40 20 0 Fail Error Rate (%) 80 Random model estimation result 0 50 100 150 200 250 300 Number of Predictors 30 3 2 1 Fail Error Rate (%) 4 5 Top model estimation result 20 40 60 80 100 Number of Predictors 31 40 20 0 Fail Error Rate (%) 60 80 Improved random model 0 50 100 150 200 250 300 Number of Predictors 32