Random Forest

Random Forest Applied Multivariate Statistics – Spring 2012 Overview  Intuition of Random Forest  The Random Forest Algorithm  De-correlation gives better accuracy  Out-of-bag error (OOB-error)  Variable importance Healthy Diseased Healthy Diseased Diseased 1 Intuition of Random Forest Tree 2 Tree 1 young old young old diseased healthy diseased healthy male tall female short healthy healthy healthy diseased Tree 3 New sample: retired working healthy healthy old, retired, male, short Tree predictions: diseased, healthy, diseased tall Majority rule: diseased healthy short diseased 2 The Random Forest Algorithm 3 Differences to standard tree  Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)  For each split, consider only m randomly selected variables  Don’t prune  Fit B trees in such a way and use average or majority voting to aggregate results 4 Why Random Forest works 1/2  Mean Squared Error = Variance + Bias2  If trees are sufficiently deep, they have very small bias  How could we improve the variance over that of a single tree? 5 Why Random Forest works 2/2 i=j Decreaes, if 𝜌 decreases, i.e., if De-correlation gives better accuracy m decreases Decreases, if number of trees B increases (irrespective of 𝜌) 6 Estimating generalization error: Out-of bag (OOB) error  Similar to leave-one-out cross-validation, but almost without any additional computational burden  OOB error is a random number, since based on random resamples of the data Data: Resampled Data: Out of bag samples: old, tall – healthy old, tall – healthy young, short – diseased old, short – diseased old, short – diseased young, tall – healthy young, tall – healthy young, short – diseased young, tall – healthy young, short – healthy young, tall – healthy old, short – diseased young, short – healthy young, tall – healthy young old old, short– diseased diseased tall healthy healthy short diseased Out of bag (OOB) error rate: ¼ = 0.25 Variable Importance for variable i Data using Permutations Resampled Resampled Dataset 1 Dataset m OOB … Data 1 OOB Data m Permute values of variable i in OOB Tree 1 Tree m data set OOB error e1 OOB error em d1 = e1–p1 dm =em-pm OOB error pm OOB error p1 d= s2d = 1 m Pm 1 m¡1 i=1 di Pm i=1 (di 2 ¡ d) vi = d sd 8 Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters + RF as smaller prediction variance and therefore usually a better general performance + Easy to tune parameters - Prediction of trees tend to have a high variance - Rather slow - “Black Box”: Rather hard to get insights into decision rules 9 Comparing runtime (just for illustration) • Up to “thousands” of variables • Problematic if there are categorical predictors with many levels (max: 32 levels) RF: First predictor cut into 15 levels RF Tree 10 RF vs. + Can model nonlinear class boundaries + OOB error “for free” (no CV needed) + Works on continuous and categorical responses (regression / classification) + Gives variable importance + Very good performance x - “Black box” - Slow xx x x x x xx x x x x LDA + Very fast + Discriminants for visualizing group separation + Can read off decision rule - Can model only linear class boundaries - Mediocre performance - No variable selection - Only on categorical response - Needs CV for estimating prediction error x x x x x x x x x x x 11 Concepts to know  Idea of Random Forest and how it reduces the prediction variance of trees  OOB error  Variable Importance based on Permutation 12 R functions to know  Function “randomForest” and “varImpPlot” from package “randomForest” 13

Random Forest

Related documents

Products

Support

Random Forest

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib