Random Forest Applied Multivariate Statistics – Spring 2012 Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Out-of-bag error (OOB-error) Variable importance Healthy Diseased Healthy Diseased Diseased 1 Intuition of Random Forest Tree 2 Tree 1 young old young old diseased healthy diseased healthy male tall female short healthy healthy healthy diseased Tree 3 New sample: retired working healthy healthy old, retired, male, short Tree predictions: diseased, healthy, diseased tall Majority rule: diseased healthy short diseased 2 The Random Forest Algorithm 3 Differences to standard tree Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set) For each split, consider only m randomly selected variables Don’t prune Fit B trees in such a way and use average or majority voting to aggregate results 4 Why Random Forest works 1/2 Mean Squared Error = Variance + Bias2 If trees are sufficiently deep, they have very small bias How could we improve the variance over that of a single tree? 5 Why Random Forest works 2/2 i=j Decreaes, if 𝜌 decreases, i.e., if De-correlation gives better accuracy m decreases Decreases, if number of trees B increases (irrespective of 𝜌) 6 Estimating generalization error: Out-of bag (OOB) error Similar to leave-one-out cross-validation, but almost without any additional computational burden OOB error is a random number, since based on random resamples of the data Data: Resampled Data: Out of bag samples: old, tall – healthy old, tall – healthy young, short – diseased old, short – diseased old, short – diseased young, tall – healthy young, tall – healthy young, short – diseased young, tall – healthy young, short – healthy young, tall – healthy old, short – diseased young, short – healthy young, tall – healthy young old old, short– diseased diseased tall healthy healthy short diseased Out of bag (OOB) error rate: ¼ = 0.25 Variable Importance for variable i Data using Permutations Resampled Resampled Dataset 1 Dataset m OOB … Data 1 OOB Data m Permute values of variable i in OOB Tree 1 Tree m data set OOB error e1 OOB error em d1 = e1–p1 dm =em-pm OOB error pm OOB error p1 d= s2d = 1 m Pm 1 m¡1 i=1 di Pm i=1 (di 2 ¡ d) vi = d sd 8 Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters + RF as smaller prediction variance and therefore usually a better general performance + Easy to tune parameters - Prediction of trees tend to have a high variance - Rather slow - “Black Box”: Rather hard to get insights into decision rules 9 Comparing runtime (just for illustration) • Up to “thousands” of variables • Problematic if there are categorical predictors with many levels (max: 32 levels) RF: First predictor cut into 15 levels RF Tree 10 RF vs. + Can model nonlinear class boundaries + OOB error “for free” (no CV needed) + Works on continuous and categorical responses (regression / classification) + Gives variable importance + Very good performance x - “Black box” - Slow xx x x x x xx x x x x LDA + Very fast + Discriminants for visualizing group separation + Can read off decision rule - Can model only linear class boundaries - Mediocre performance - No variable selection - Only on categorical response - Needs CV for estimating prediction error x x x x x x x x x x x 11 Concepts to know Idea of Random Forest and how it reduces the prediction variance of trees OOB error Variable Importance based on Permutation 12 R functions to know Function “randomForest” and “varImpPlot” from package “randomForest” 13