Random Forests Outline • Growing Random Forests

advertisement
Random Forests
Stat 557
Heike Hofmann
Outline
• Growing Random Forests
• Parameters
• Results
• (Neural Networks)
Random Forests
• Breiman (2001), Breiman & Cutler (2004)
• Tree Ensemble built by randomly sampling
cases and variables
• Each case classified once for each tree in
the ensemble
How do Random
Forests work
• Large number (at least 500) of ‘different’
trees is grown
• Each tree gives a classification for each
record, i.e. the tree "votes" for that class.
• The forest determines the overall
classification for each record by a majority
vote.
Growing a Random
Forest
for sample size N and M explanatory variables X1, ..., XM
• draw bootstrap sample of data (i.e. draw
sample of size N with replacement)
• at each node, select m << M variables at
random and find best split.
• each tree is grown to the largest extent
possible, i.e. no pruning!
randomForest
package
randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL,
importance=FALSE, localImp=FALSE, nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, ...)
Results
800
count
600
400
200
0
A B C D E
F G H
I
J
K
L M N O P Q R S
predict(rf)
T U V W X Y
Z
Misclassification
Z
Y
X
W
V
U
T
S
R
Q
P
O
y
N
M
L
K
J
I
H
G
F
E
D
C
B
A
A B C D E F G H
I
J
K L M N O P Q R S T U V W X Y Z
x
Forest Error
•
Increasing correlation between any two trees
increases the forest error rate.
•
Trees with low individual error rates are
stronger classifiers. Increasing strength of
individual trees decreases the overall forest
error rate.
•
decreased m reduces both correlation and
strength. "optimal" range of m usually quite
wide.
Out of bag (oob) error
• Slight modification to bootstrap samples:
• for each tree, leave about 1/3 of data out of
sample, then draw bootstrap sample of size
N.
• use out-of bag data to get (running)
unbiased estimate of classification error as
each tree is added to forest.
running oob error-rate
0.040
0.040
0.038
0.038
0.036
0.036
0.16
0.14
0.08
0.06
rf$err.rate[, 1]
0.10
rf$err.rate[1:500, 1]
0.034
0.034
0.032
0.032
0.030
0.030
0.04
100
200
100
1:500
300
200
400
200
1:500
300
500
400
400
1:1000
600
500
800
1000
H
Class errors
0.08
0.06
• oob classification allows to assess
rf$err.rate[500, ]
rf$err.rate[1:500, 1]
0.12
K
IJ
F
0.04
error rates for each class
N
E
G
B
P
R
C
V
OOB
D
O
Q
S
0.02
L
X
Z
T
U
Y
M
W
A
0.90 0.95 1.00 1.05 1.10
optimal choice of m
• based on oob error:
oob error rate
0.045
0.040
0.035
2
4
m
6
8
10
sqrt(M) works well in most cases
Variable Importance
Permutation Criterion:
• based on out-of bag data
• for each tree, count # of correctly classified
oob records
• permute values of variable m, re-count
correctly classified oob records, subtract
from first count
• for each variable, average over all trees
Importance of Variables
variable
A
B
1.3
C
D
E
1.2
F
G
H
I
1.1
J
value
K
L
1.0
M
N
O
0.9
P
Q
R
S
0.8
T
U
V
0.7
W
X
Y
V10
V11
V12
V13
V14
V15
V16
V17
ID
V2
V3
V4
V5
V6
V7
V8
Z
V9
Mean Decrease
Accuracy
0.256
MeanDecreaseAccuracy
0.254
0.252
0.250
0.248
0.246
V7
V6
V4
V8
V17
V15
V2
V5
V12
V9
V3
reorder(ID, MeanDecreaseAccuracy)
V13
V16
V10
V14
V11
Proximity
• N x N matrix of proximity values
• for each tree: if two records k and l are in
the same leaf, increase proximity by one
• normalize proximity by dividing by number
of trees
• size problematic
Neural Networks
• Historically used to model (biological)
networks of neurons:
- nodes represent neurons
- edges represent nerves
- network illustrates activity and flow of
signals
Setup
• Response Y has K categories
• Network:
output layer
hidden
layer with
M units
Z1
Y1
Y2
YK
Z2
ZM
X1
input layer
Xi
Xp
Formula Setup
• Relationship between layers:
!
X)
Zm = σ(α0m + αm
Tk = β0k + βk! Z
m = 1, ..., M
k = 1, ..., K
fk (X) = gk (T )
k = 1, ..., K
sigma
the activation
! e.g.
! is
!
! ! function,
• where
π =
π
Π =2
1 π ·
I
c
J
σ(ν) = i=1 j=1
1 + e−ν
ij
c
ij
hk
i,j
h>i k>j
"
!!
between
T and Y
16
•σ (γ̂)g is =final transformation
π Π π −Π π
2
I
k
∞
J
ij
" + Π )4
Zm = σ(α0m n(Π
+ αC
m=
1, ..., M
m X) D
i=1
j=1
Tk = β0k + βk" Z
k = 1, ..., K
ΠC − ΠD
d
C ij
c
D ij
#2
Formula Setup
•g
with continuous response usually chosen
as identity, with categorical response usually
softmax:
k
eTk
gk (T ) = !K
T!
!=1 e
1
are=positive
and sum to 1
• i.e. estimatesσ(ν)
1+e
−ν
"
Zm = σ(α0m + αm
X)
Tk = β0k + βk" Z
m = 1, ..., M
k = 1, ..., K
fk (X) = gk (T )
k = 1, ..., K
"with
"
"
"
" Nets
Issues
Neural
Π =2
π ·
π
π =
I
J
c
ij
i=1 j=1
c
ij
hk
i,j
h>i k>j
Model generally
highly
#
$
•
"
" over-parametrized:
16
σ (γ̂) =
π Π π −Π π
I
2
∞
weights:
•
•
J
ij
n(Π
{α
, α+
:D
m)4=i=1
1, ...,
M}
0m C
mΠ
j=1
d
C ij
c
D ij
2
M (p + 1)
{β0k , βk : k = 1, ..., K}
K(M + 1)
ΠC − ΠD
γ=
Π C + ΠD
Optimization problem
π00 π11convex & unstable ->
convergenceθis:=tricky
π10 π01eTk
gk (T ) = !K
π : (1 − π)!=1
eT! to overfit at
Over-parametrization
leads
minimum
πj|i=0 − πj|i=1 =: π11 − π2 ,
σ(ν) =
1 + e−ν
πj=1|i=0
r :=
πj=1|i=1
"
Fitting Strategies
• Standardize input variables X
• Pick starting values for alpha, beta close to
zero (i.e. close to linear fit)
• Stop run before convergence (to avoid
overfitting)
• Alternatively: use penalty on size of weights
(decay)
λ·
!"
β +
2
"
α
2
{α0m , αm : m = 1, ..., M }
#
M (p + 1)
{β0k , βk : k = 1, ..., K}
K(M + 1)
Fitting Strategies
e
Tk
gk (T ) = $K
T!
!=1 e
1
• Pick large number
σ(ν)of
= hidden units OR
1+e
−ν
do cross-validation to figure out good size
(and with it #units) bounded
• # parameters
Z
= σ(α + α X)
m = 1, ..., M
by sample size
m
"
m
0m
Tk = β0k + βk" Z
k = 1, ..., K
set
of1,networks
• average
f (X) results
= g (Tfrom
)
k=
..., K
k
(bagging)
Πc = 2
k
I "
J
"
i=1 j=1
16
πij ·
""
πhk =
c
πij
i,j
h>i k>j
I "
J
"
"
!
#2
Neural Networks are
fickle
Choice of M is
important
0.9
0.8
0.7
err
starting parameters
are important some models do not
even come close to
a good solution in
100 iterations
0.6
0.5
0.4
6
8
10
M
12
14
Download