Random Forests Stat 557 Heike Hofmann Outline • Growing Random Forests • Parameters • Results • (Neural Networks) Random Forests • Breiman (2001), Breiman & Cutler (2004) • Tree Ensemble built by randomly sampling cases and variables • Each case classified once for each tree in the ensemble How do Random Forests work • Large number (at least 500) of ‘different’ trees is grown • Each tree gives a classification for each record, i.e. the tree "votes" for that class. • The forest determines the overall classification for each record by a majority vote. Growing a Random Forest for sample size N and M explanatory variables X1, ..., XM • draw bootstrap sample of data (i.e. draw sample of size N with replacement) • at each node, select m << M variables at random and find best split. • each tree is grown to the largest extent possible, i.e. no pruning! randomForest package randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE, ...) Results 800 count 600 400 200 0 A B C D E F G H I J K L M N O P Q R S predict(rf) T U V W X Y Z Misclassification Z Y X W V U T S R Q P O y N M L K J I H G F E D C B A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z x Forest Error • Increasing correlation between any two trees increases the forest error rate. • Trees with low individual error rates are stronger classifiers. Increasing strength of individual trees decreases the overall forest error rate. • decreased m reduces both correlation and strength. "optimal" range of m usually quite wide. Out of bag (oob) error • Slight modification to bootstrap samples: • for each tree, leave about 1/3 of data out of sample, then draw bootstrap sample of size N. • use out-of bag data to get (running) unbiased estimate of classification error as each tree is added to forest. running oob error-rate 0.040 0.040 0.038 0.038 0.036 0.036 0.16 0.14 0.08 0.06 rf$err.rate[, 1] 0.10 rf$err.rate[1:500, 1] 0.034 0.034 0.032 0.032 0.030 0.030 0.04 100 200 100 1:500 300 200 400 200 1:500 300 500 400 400 1:1000 600 500 800 1000 H Class errors 0.08 0.06 • oob classification allows to assess rf$err.rate[500, ] rf$err.rate[1:500, 1] 0.12 K IJ F 0.04 error rates for each class N E G B P R C V OOB D O Q S 0.02 L X Z T U Y M W A 0.90 0.95 1.00 1.05 1.10 optimal choice of m • based on oob error: oob error rate 0.045 0.040 0.035 2 4 m 6 8 10 sqrt(M) works well in most cases Variable Importance Permutation Criterion: • based on out-of bag data • for each tree, count # of correctly classified oob records • permute values of variable m, re-count correctly classified oob records, subtract from first count • for each variable, average over all trees Importance of Variables variable A B 1.3 C D E 1.2 F G H I 1.1 J value K L 1.0 M N O 0.9 P Q R S 0.8 T U V 0.7 W X Y V10 V11 V12 V13 V14 V15 V16 V17 ID V2 V3 V4 V5 V6 V7 V8 Z V9 Mean Decrease Accuracy 0.256 MeanDecreaseAccuracy 0.254 0.252 0.250 0.248 0.246 V7 V6 V4 V8 V17 V15 V2 V5 V12 V9 V3 reorder(ID, MeanDecreaseAccuracy) V13 V16 V10 V14 V11 Proximity • N x N matrix of proximity values • for each tree: if two records k and l are in the same leaf, increase proximity by one • normalize proximity by dividing by number of trees • size problematic Neural Networks • Historically used to model (biological) networks of neurons: - nodes represent neurons - edges represent nerves - network illustrates activity and flow of signals Setup • Response Y has K categories • Network: output layer hidden layer with M units Z1 Y1 Y2 YK Z2 ZM X1 input layer Xi Xp Formula Setup • Relationship between layers: ! X) Zm = σ(α0m + αm Tk = β0k + βk! Z m = 1, ..., M k = 1, ..., K fk (X) = gk (T ) k = 1, ..., K sigma the activation ! e.g. ! is ! ! ! function, • where π = π Π =2 1 π · I c J σ(ν) = i=1 j=1 1 + e−ν ij c ij hk i,j h>i k>j " !! between T and Y 16 •σ (γ̂)g is =final transformation π Π π −Π π 2 I k ∞ J ij " + Π )4 Zm = σ(α0m n(Π + αC m= 1, ..., M m X) D i=1 j=1 Tk = β0k + βk" Z k = 1, ..., K ΠC − ΠD d C ij c D ij #2 Formula Setup •g with continuous response usually chosen as identity, with categorical response usually softmax: k eTk gk (T ) = !K T! !=1 e 1 are=positive and sum to 1 • i.e. estimatesσ(ν) 1+e −ν " Zm = σ(α0m + αm X) Tk = β0k + βk" Z m = 1, ..., M k = 1, ..., K fk (X) = gk (T ) k = 1, ..., K "with " " " " Nets Issues Neural Π =2 π · π π = I J c ij i=1 j=1 c ij hk i,j h>i k>j Model generally highly # $ • " " over-parametrized: 16 σ (γ̂) = π Π π −Π π I 2 ∞ weights: • • J ij n(Π {α , α+ :D m)4=i=1 1, ..., M} 0m C mΠ j=1 d C ij c D ij 2 M (p + 1) {β0k , βk : k = 1, ..., K} K(M + 1) ΠC − ΠD γ= Π C + ΠD Optimization problem π00 π11convex & unstable -> convergenceθis:=tricky π10 π01eTk gk (T ) = !K π : (1 − π)!=1 eT! to overfit at Over-parametrization leads minimum πj|i=0 − πj|i=1 =: π11 − π2 , σ(ν) = 1 + e−ν πj=1|i=0 r := πj=1|i=1 " Fitting Strategies • Standardize input variables X • Pick starting values for alpha, beta close to zero (i.e. close to linear fit) • Stop run before convergence (to avoid overfitting) • Alternatively: use penalty on size of weights (decay) λ· !" β + 2 " α 2 {α0m , αm : m = 1, ..., M } # M (p + 1) {β0k , βk : k = 1, ..., K} K(M + 1) Fitting Strategies e Tk gk (T ) = $K T! !=1 e 1 • Pick large number σ(ν)of = hidden units OR 1+e −ν do cross-validation to figure out good size (and with it #units) bounded • # parameters Z = σ(α + α X) m = 1, ..., M by sample size m " m 0m Tk = β0k + βk" Z k = 1, ..., K set of1,networks • average f (X) results = g (Tfrom ) k= ..., K k (bagging) Πc = 2 k I " J " i=1 j=1 16 πij · "" πhk = c πij i,j h>i k>j I " J " " ! #2 Neural Networks are fickle Choice of M is important 0.9 0.8 0.7 err starting parameters are important some models do not even come close to a good solution in 100 iterations 0.6 0.5 0.4 6 8 10 M 12 14