Random Forest

advertisement
Random Forest
Applied Multivariate Statistics – Spring 2012
Overview
 Intuition of Random Forest
 The Random Forest Algorithm
 De-correlation gives better accuracy
 Out-of-bag error (OOB-error)
 Variable importance
Healthy
Diseased
Healthy
Diseased
Diseased
1
Intuition of Random Forest
Tree 2
Tree 1
young
old
young
old
diseased
healthy
diseased
healthy
male
tall
female
short
healthy
healthy
healthy
diseased
Tree 3
New sample:
retired
working
healthy
healthy
old, retired, male, short
Tree predictions:
diseased, healthy, diseased
tall
Majority rule:
diseased
healthy
short
diseased
2
The Random Forest Algorithm
3
Differences to standard tree
 Train each tree on bootstrap resample of data
(Bootstrap resample of data set with N samples:
Make new data set by drawing with replacement N samples; i.e., some samples will
probably occur multiple times in new data set)
 For each split, consider only m randomly selected variables
 Don’t prune
 Fit B trees in such a way and use average or majority
voting to aggregate results
4
Why Random Forest works 1/2
 Mean Squared Error = Variance + Bias2
 If trees are sufficiently deep, they have very small bias
 How could we improve the variance over that of a single
tree?
5
Why Random Forest works 2/2
i=j
Decreaes, if
𝜌 decreases, i.e., if
De-correlation gives
better accuracy
m decreases
Decreases, if number of trees B
increases (irrespective of 𝜌)
6
Estimating generalization error:
Out-of bag (OOB) error
 Similar to leave-one-out cross-validation, but almost
without any additional computational burden
 OOB error is a random number, since based on random
resamples of the data
Data:
Resampled Data:
Out of bag samples:
old, tall – healthy
old, tall – healthy
young, short – diseased
old, short – diseased
old, short – diseased
young, tall – healthy
young, tall – healthy
young, short – diseased
young, tall – healthy
young, short – healthy
young, tall – healthy
old, short – diseased
young, short – healthy
young, tall – healthy
young
old
old, short– diseased
diseased
tall
healthy
healthy
short
diseased
Out of bag (OOB) error rate:
¼ = 0.25
Variable Importance for variable i
Data
using Permutations
Resampled
Resampled
Dataset 1
Dataset m
OOB
…
Data 1
OOB
Data m
Permute values of
variable i in OOB
Tree 1
Tree m
data set
OOB error e1
OOB error em
d1 = e1–p1
dm =em-pm
OOB error pm
OOB error p1
d=
s2d
=
1
m
Pm
1
m¡1
i=1 di
Pm
i=1 (di
2
¡ d)
vi =
d
sd
8
Trees
vs.
Random Forest
+ Trees yield insight into
decision rules
+ Rather fast
+ Easy to tune
parameters
+ RF as smaller prediction
variance and therefore
usually a better general
performance
+ Easy to tune parameters
- Prediction of trees tend
to have a high variance
- Rather slow
- “Black Box”: Rather hard
to get insights into decision
rules
9
Comparing runtime
(just for illustration)
• Up to “thousands” of variables
• Problematic if there are categorical predictors with many levels (max: 32 levels)
RF: First predictor cut into 15 levels
RF
Tree
10
RF
vs.
+ Can model nonlinear
class boundaries
+ OOB error “for free” (no
CV needed)
+ Works on continuous and
categorical responses
(regression / classification)
+ Gives variable
importance
+ Very good performance
x
- “Black box”
- Slow
xx x
x
x
x
xx
x
x x
x
LDA
+ Very fast
+ Discriminants for visualizing
group separation
+ Can read off decision rule
- Can model only linear class
boundaries
- Mediocre performance
- No variable selection
- Only on categorical response
- Needs CV for estimating
prediction error
x x
x
x
x
x
x
x x
x x
11
Concepts to know
 Idea of Random Forest and how it reduces the prediction
variance of trees
 OOB error
 Variable Importance based on Permutation
12
R functions to know
 Function “randomForest” and “varImpPlot” from package
“randomForest”
13
Download