11 – Tree-based Models for Classification Problems 11.1 – Classification Trees In classification trees the goals is to a tree-based model that will classify observations into one of g predetermined classes. The end result of a tree model can be viewed as a series of conditional probabilities (posterior probabilities) of class membership given a set of covariate values. For each terminal node we essentially have a probability distribution for class membership, where the probabilities are of the form: P (class i | x N A ) such that ~ g P(class i | x N i 1 ~ A ) = 1. Here, N A is a neighborhood defined by a set covariates/predictors, x . ~ The neighborhoods are found by a series of binary splits chosen to minimize the overall “loss” of the resulting tree. For classification problems measuring overall “loss” can be a bit complicated. One obvious method is to construct classification trees so that the overall misclassification rate is minimized. In fact, this is precisely what the RPART algorithm does by default. However in classification problems it is often times the case we wish to incorporate prior knowledge about likely class membership. This knowledge is represented by prior probabilities of an observation being from class i, which will denote by i . Naturally the priors must be chosen in such a way that they sum to 1. Other information we might want to incorporate into a modeling process is the cost or loss incurred by classifying an object from class i as being from class j and vice versa. With this information provided we would expect the resulting tree to avoid making the most costly misclassifications on our training data set. Some notation that is used by Therneau & Atkinson (1999) for determining the loss for a given node A: niA number of observations in node A from class i ni number of observations in training set from class i n total number of observations in the training set n A number of observations in node A n i prior probability of being from class i (by default i i ) n L(i, j ) = loss incurred by classifying a class i object as being from class j ( A) predicted class for node A 315 In general, the loss is specified as a matrix L(1,2) L(1,3) ... L(1, C ) 0 L(2,1) 0 L(2,3) ... L(2, C ) Loss Matrix L(3,1) L(3,2) ... ... ... ... ... 0 L(C 1, C ) ... L(C ,1) L(C ,2) ... L(C , C 1) 0 By default this is a symmetric matrix with L(i, j ) L( j , i ) 1 for all i j . Using the notation and concepts presented above the risk or loss at a node A is given by C n n R( A) i L(i, ( A)) iA n A i 1 ni n A Example 11.1: Kyphosis Data > library(rpart) > data(kyphosis) > attach(kyphosis) > names(kyphosis) [1] "Kyphosis" "Age" "Number" "Start" > k.default <- rpart(Kyphosis~.,data=kyphosis) > k.default n= 81 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 81 17 absent (0.7901235 0.2098765) 2) Start>=8.5 62 6 absent (0.9032258 0.0967742) 4) Start>=14.5 29 0 absent (1.0000000 0.0000000) * 5) Start< 14.5 33 6 absent (0.8181818 0.1818182) 10) Age< 55 12 0 absent (1.0000000 0.0000000) * 11) Age>=55 21 6 absent (0.7142857 0.2857143) 22) Age>=111 14 2 absent (0.8571429 0.1428571) * 23) Age< 111 7 3 present (0.4285714 0.5714286) * 3) Start< 8.5 19 8 present (0.4210526 0.5789474) * Sample Loss Calculations: Root Node: Terminal Node 3: R(root) = .2098765*1*(17/17)*(81/81)*81 = 17 R(A) = .7901235*1*(8/64)*(81/19)*19 = 8 The actual tree is shown on the following page. 316 Suppose now we have prior beliefs that 65% of patients will not have Kyphosis (absent) and 35% of patients will have Kyphosis (present). > k.priors <-rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(.65,.35))) > k.priors n= 81 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 81 28.350000 absent (0.65000000 0.35000000) 2) Start>=12.5 46 3.335294 absent (0.91563089 0.08436911) * 3) Start< 12.5 35 16.453130 present (0.39676840 0.60323160) 6) Age< 34.5 10 1.667647 absent (0.81616742 0.18383258) * 7) Age>=34.5 25 9.049219 present (0.27932897 0.72067103) * Sample Loss Calculations: R(root) = .35*1*(17/17)*(81/81)*81 = 28.35 R(node 7) = .65*1*(11/64)*(81/25)*25 = 9.049219 The resulting tree is shown on the following page. 317 Finally suppose we wish to incorporate the following loss information. Classification L(i,j) present absent Actual present 0 4 absent 1 0 Note: In R the category orderings for the loss matrix are Z A in both dimensions. This says that it is 4 times more serious to misclassify a child that actually has kyphosis (present) as not having it (absent). Again we will use the priors from the previous model (65% - absent, 35% - present). > lmat <- matrix(c(0,4,1,0),nrow=2,ncol=2,byrow=T) > k.priorloss <- rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(.65,.35),loss=lmat)) > k.priorloss n= 81 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 81 52.650000 present (0.6500000 0.3500000) 2) Start>=14.5 29 0.000000 absent (1.0000000 0.0000000) * 3) Start< 14.5 52 28.792970 present (0.5038760 0.4961240) 6) Age< 39 15 6.670588 absent (0.8735178 0.1264822) * 7) Age>=39 37 17.275780 present (0.3930053 0.6069947) * Sample Loss Calculations: R(root) = .65*1*(64/64)*(81/81)*81 = 52.650000 R(node 7) = .65*1*(21/64)*(81/37)*37 = 17.275780 R(node 6) = .35*4*(1/17)*(81/15)*15 = 6.67058 The resulting tree is shown below on the following page. 318 Example 11.2 – Owl Diet > owl.tree <- rpart(species~.,data=OwlDiet) > owl.tree n= 179 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 179 140 tiomanicus (0.14 0.21 0.13 0.089 0.056 0.22 0.16) 2) teeth_row>=56.09075 127 88 tiomanicus (0.2 0.29 0 0.13 0.079 0.31 0) 4) teeth_row>=72.85422 26 2 annandalfi (0.92 0.077 0 0 0 0 0) * 5) teeth_row< 72.85422 101 62 tiomanicus (0.0099 0.35 0 0.16 0.099 0.39 0) 10) teeth_row>=64.09744 49 18 argentiventer (0.02 0.63 0 0.27 0.02 0.061 0) 20) palatine_foramen>=61.23703 31 4 argentiventer (0.032 0.87 0 0.065 0 0.032 0)* 21) palatine_foramen< 61.23703 18 7 rajah (0 0.22 0 0.61 0.056 0.11 0) * 11) teeth_row< 64.09744 52 16 tiomanicus (0 0.077 0 0.058 0.17 0.69 0) 22) skull_length>=424.5134 7 1 surifer (0 0.14 0 0 0.86 0 0) * 23) skull_length< 424.5134 45 9 tiomanicus (0 0.067 0 0.067 0.067 0.8 0) * 3) teeth_row< 56.09075 52 24 whiteheadi (0 0 0.46 0 0 0 0.54) 6) palatine_foramen>=47.15662 25 1 exulans (0 0 0.96 0 0 0 0.04) * 7) palatine_foramen< 47.15662 27 0 whiteheadi (0 0 0 0 0 0 1) * The function misclass.rpart constructs a confusion matrix for a classification tree. 319 > misclass.rpart(owl.tree) Table of Misclassification (row = predicted, col = actual) 1 2 3 4 5 6 7 annandalfi 24 2 0 0 0 0 0 argentiventer 1 27 0 2 0 1 0 exulans 0 0 24 0 0 0 1 rajah 0 4 0 11 1 2 0 surifer 0 1 0 0 6 0 0 tiomanicus 0 3 0 3 3 36 0 whiteheadi 0 0 0 0 0 0 27 Misclassification Rate = 0.134 Using equal prior probabilities for each of the groups. > owl.tree2 <- rpart(species~.,data=OwlDiet,parms=list(prior=c(1,1,1,1,1,1,1)/7)) > misclass.rpart(owl.tree2) Table of Misclassification (row = predicted, col = actual) 1 2 3 4 5 6 7 annandalfi 24 2 0 0 0 0 0 argentiventer 1 27 0 2 0 1 0 exulans 0 0 24 0 0 0 1 rajah 0 4 0 11 1 2 0 surifer 0 1 0 1 9 5 0 tiomanicus 0 3 0 2 0 31 0 whiteheadi 0 0 0 0 0 0 27 Misclassification Rate = 0.145 > plot(owl.tree) > text(owl.tree,cex=.6) 320 We can see that teeth row and palantine foramen figure prominently in the classification rule. If you examine the group differences across these characteristics using comparative boxplots this fact is not surprising. Cross-validation is done in the usual fashion: we leave out a certain percentage of the observations, develop a model from the remaining data, predict back the class of the observations we set aside, calculate the misclassification rate, and then repeat this process are number of times. The function crpart.cv leaves out p% of the data at a time to perform the Monte Carlo cross-validation (MCCV) to estimate the APER. > results = crpart.cv(owl.rpart,y=Owldiet$species,data=Owldiet,B=100) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1017 0.1695 0.2034 0.2193 0.2712 0.3559 > crpart.cv = function(fit,y,data,B=25,p=.333) { n = length(y) These options take extract any optional fitting cv <- rep(0,B) arguments used in the fitting process such as the for (i in 1:B) { prior probabilities, loss matrix, complexity ss <- floor(n*p) parameter (cp), and minsplit/minbucket info. sam <- sample(1:n,ss) temp <- data[-sam,] fit2 <- rpart(formula(fit),data=temp,parms=fit$parms,control=fit$control) ynew <- predict(fit2,newdata=data[sam,],type="class") tab <- table(y[sam],ynew) mc <- ss - sum(diag(tab)) cv[i] <- mc/ss } cv } 321 11.2 - Bagging and Random Forests for Classification Trees As with regression problems bagging (averaging trees built to bootstrap samples of the training data) and random forests (building trees using random selected predictors at each stage) can produce superior predictive performance. Bagging (using the bagging() command in the ipred library) > owl.bag = bagging(species~.,data=Owldiet,coob=T) > owl.bag Bagging classification trees with 25 bootstrap replications Call: bagging.data.frame(formula=species ~ .,data = Owldiet, coob = T) Out-of-bag estimate of misclassification error: 0.2123 > misclass(predict(owl.bag,newdata=Owldiet),Owldiet$species) Table of Misclassification (row = predicted, col = actual) y fit annandalfi argentiventer exulans rajah surifer tiomanicus whiteheadi annandalfi 25 0 0 0 0 0 0 argentiventer 0 37 0 0 0 0 0 exulans 0 0 24 0 0 0 0 rajah 0 0 0 16 0 0 0 surifer 0 0 0 0 10 0 0 tiomanicus 0 0 0 0 0 39 0 whiteheadi 0 0 0 0 0 0 28 Misclassification Rate = 0 The training cases are predicted perfectly with 0% misclassified, however the out-of-bag estimate of the APER is .2123 or 21.23% misclassified. Example 11.3: Italian Olive Oils > mod = bagging(Area.name~.,data=O.area,coob=T) > mod Bagging classification trees with 25 bootstrap replications Call: bagging.data.frame(formula = Area.name ~ ., data = O.area, coob = T) Out-of-bag estimate of misclassification error: 0.0857 > misclass(predict(mod,newdata=O.area),O.area$Area.name) Table of Misclassification (row = predicted, col = actual) y fit Calabria Coastal-Sardinia East-Liguria Inland-Sardinia North-Apulia Sicily South-Apulia Umbria West-Liguria Calabria 56 0 0 0 0 0 0 0 0 Coastal-Sardinia 0 33 0 0 0 0 0 0 0 East-Liguria 0 0 50 0 0 0 0 0 0 Inland-Sardinia 0 0 0 65 0 0 0 0 0 North-Apulia 0 0 0 0 25 0 0 0 0 Sicily 0 0 0 0 0 36 0 0 0 South-Apulia 0 0 0 0 0 0 206 0 0 Umbria 0 0 0 0 0 0 0 51 0 West-Liguria 0 0 0 0 0 0 0 0 50 Misclassification Rate = 0 322 Again the training cases are predicted perfectly with 0% misclassified, however the outof-bag estimate of the APER is .0857 and 8.57% misclassified. 11.3 - Random Forests Example 11.3 (cont’d): Italian Olive Oils > area.rf = randomForest(Area.name~.,data=O.area,mtry=2,importance=T) > area.rf Call: randomForest(formula = Area.name ~ ., data = O.area, mtry = 2, importance = T) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.72% Confusion matrix: Calabria Coastal-Sardinia East-Liguria Inland-Sardinia North-Apulia Sicily South-Apulia Umbria West-Liguria class.error Calabria 54 0 0 0 0 0 2 0 0 0.03571429 Coastal-Sardinia 0 32 0 1 0 0 0 0 0 0.03030303 East-Liguria 0 0 46 0 0 0 0 1 3 0.08000000 Inland-Sardinia 0 0 0 65 0 0 0 0 0 0.00000000 North-Apulia 2 0 0 0 23 0 0 0 0 0.08000000 Sicily 4 0 0 0 3 23 6 0 0 0.36111111 South-Apulia 1 0 0 0 0 3 202 0 0 0.01941748 Umbria 0 0 0 0 0 0 0 51 0 0.00000000 West-Liguria 0 0 1 0 0 0 0 0 49 0.02000000 > area.rf = randomForest(Area.name~.,data=O.area,mtry=3,importance=T) > area.rf Call: randomForest(formula = Area.name ~ ., data = O.area, mtry = 3, importance = T) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 3 OOB estimate of error rate: 4.9% > area.rf = randomForest(Area.name~.,data=O.area,mtry=4,importance=T) > area.rf Call: randomForest(formula = Area.name ~ ., data = O.area, mtry = 4, importance = T) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 4.72% 323 Measures of Impurity (used to measure variable importance) 324 As you can see below the Gini Index decrease is used as a basis for one of the importance measures for random forest models. > area.rf$importance palmitic palmitoleic strearic oleic linoleic eicosanoic linolenic eicosenoic Calabria Coastal-Sardinia East-Liguria Inland-Sardinia North-Apulia Sicily South-Apulia Umbria West-Liguria 0.13452872 0.073673892 0.02009836 0.063620230 0.219814164 0.02393492 0.035160949 0.18144189 0.07172358 0.12917763 0.297395878 0.24871128 0.258199312 0.294735044 0.12067000 0.349637005 0.26129882 0.05793994 0.05485467 0.008206269 0.02237659 0.020664404 0.002113947 0.17771535 0.048116657 0.09939278 0.01070815 0.11245305 0.282924624 0.23505554 0.355941504 0.230132746 0.06476381 0.165858075 0.69235145 0.24210970 0.32933668 0.508924377 0.45492602 0.495736641 0.316481124 0.14388783 0.224371017 0.48554629 0.34768423 0.23589078 0.031037995 0.02158836 0.007285309 0.012992940 0.02669415 0.008961566 0.05100809 0.16965593 0.07932334 0.016359483 0.03856755 0.012864521 0.041563009 0.10790249 0.014568810 0.01612817 0.38582919 0.28569336 0.189885202 0.47343418 0.166049624 0.307131929 0.25378467 0.080635695 0.39261139 0.23982349 palmitic palmitoleic strearic oleic linoleic eicosanoic linolenic eicosenoic MeanDecreaseAccuracy MeanDecreaseGini 0.07276472 28.36385 0.25472399 102.65161 0.04839498 23.73749 0.24516620 96.05318 0.33471962 95.28149 0.05194775 19.67056 0.06224676 42.75880 0.21391468 56.87026 > rfimp.class = function(rffit,measure=1,horiz=T) { barplot(sort(rffit$importance[,measure]),horiz=horiz,xlab="Importance Measure",main="Variable Importance") } > rfimp.class(area.rf,measure=11,horiz=F) 325 > rfimp.class(area.rf,measure=10,horiz=F) The function below performs MCCV for a Random Forest model for classification. > crf.cv = function(fit,y,data,B=25,p=.333,mtry=fit$mtry,ntree=fit$ntree) { n = length(y) cv <- rep(0,B) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss) temp <- data[-sam,] fit2 <- randomForest(formula(fit),data=temp,mtry=mtry,ntree=ntree) ynew <- predict(fit2,newdata=data[sam,],type="class") tab <- table(y[sam],ynew) mc <- ss - sum(diag(tab)) cv[i] <- mc/ss } cv } > results = crf.cv(area.rf,y=O.area$Area.name,data=O.area) > results [1] [8] [15] [22] 0.07894737 0.04736842 0.04210526 0.07368421 0.09473684 0.05789474 0.04736842 0.06315789 0.03684211 0.06842105 0.06842105 0.07894737 0.06842105 0.05263158 0.05263158 0.04210526 0.05263158 0.04210526 0.05789474 0.07894737 0.08947368 0.06315789 0.05789474 0.07368421 0.06315789 > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.03684 0.05263 0.06316 0.06211 0.07368 0.09474 Proximities These are another potentially useful byproduct of random forest fit. After a tree is grown, run all of the data, both training and OOB, down the tree. If observations i and j are in the same terminal node in a given tree increase their proximity by one. At the end, normalize the proximities by dividing by the number of trees. The proximities are then returned in an nxn matrix which can be visualized as a heat-map using the image()function. 326 > area.rf = randomForest(Area.name~.,data=O.area,mtry=4,importance=T,proximity=T) > image(area.rf$proximity) > title(main="Image Plot of Proximities for Olive Oils") FYI ONLY: Multidimensional Scaling (MDS) is a method in which given set of proximities (or distances = 1 – proximities) between a set of I observations in pdimensional space, a lower-dimensional representation (N = 2 or 3) is found so that the distances between observations is preserved. These data can then be plotted in the lower dimensional space so that interesting structure in the data can be observed graphically. 327 The following is taken from the Wikipedia page for MDS The data to be analyzed is a collection of I objects on which a distance function is defined, δi,j := distance between i th and j th objects. These distances are the entries of the dissimilarity matrix The goal of MDS is, given Δ, to find vectors for all such that , where is a vector norm. In classical MDS, this norm is the Euclidean distance but, in a broader sense, it may be a metric or arbitrary distance function. In other words, MDS attempts to find an embedding from the objects into RN such that distances are preserved. If the dimension N is chosen to be 2 or 3, we may plot the vectors xi to obtain a visualization of the similarities between the objects. Note that the vectors xi are not unique. With the Euclidean distance for example, they may be arbitrarily translated and rotated, since these transformations do not change the pairwise distances . There are various approaches to determining the vectors xi. Usually, MDS is formulated as an optimization problem, where function, for example, is found as a minimizer of some cost A solution may then be found by numerical optimization techniques. For some particularly chosen cost functions (e.g. squared error), minimizers can be stated analytically in terms of matrix eigendecomposition, i.e. spectral decomposition of the distance matrix. (End Wikipedia content) The function MDSplot()will plot the results of a multidimensional scaling of the distances formed by subtracting the proximities returned from a random forest fit from 1. By default a two-dimensional plot is returned. The points are color-coded by the categorical response variable. 328 > MDSplot(area.rf,O.area$Area.name,k=2) > title(main="MDS Plot of Italian Olive Oils") 329 > MDSplot(area.rf,O.area$Area.name,k=3) The MDS plots almost always look like this, which raises the question of their usefulness. 330 11.4 - Boosting for Classification Trees Using function the boosting in the adabag library we can fit classification tree models using the adaptive boosting method developed by Schapire & Freund. There are other packages with boosting methods in them: mboost, ada, and gbm (which we used for regression problems). For classification problems with non-binary/dichotomous outcomes however, the boosting command in adabag appears to be only choice, and it also seems to work fine when the response is dichotomous as well. 331 Example 11.3 (cont’d): Italian Olive Oils ~ classifying the growing area > area.boost = boosting(Area.name~.,data=O.area,mfinal=200) > summary(area.boost) Length Class Mode formula 3 formula call trees 200 -none- list weights 200 -none- numeric votes 5148 -none- numeric prob 5148 -none- numeric class 572 -none- character importance 8 -none- numeric > barplot(sort(area.boost$importance),main="Variable Importance") > misclass(area.boost$class,O.area$Area.name) Table of Misclassification (row = predicted, col = actual) y fit Calabria Coastal-Sardinia East-Liguria Inland-Sardinia North-Apulia Sicily South-Apulia Umbria Calabria 56 0 0 0 0 0 0 0 Coastal-Sardinia 0 33 0 0 0 0 0 0 East-Liguria 0 0 50 0 0 0 0 0 Inland-Sardinia 0 0 0 65 0 0 0 0 North-Apulia 0 0 0 0 25 0 0 0 Sicily 0 0 0 0 0 36 0 0 South-Apulia 0 0 0 0 0 0 206 0 Umbria 0 0 0 0 0 0 0 51 West-Liguria 0 0 0 0 0 0 0 0 y fit West-Liguria Calabria 0 Coastal-Sardinia 0 East-Liguria 0 Inland-Sardinia 0 North-Apulia 0 Sicily 0 South-Apulia 0 Umbria 0 West-Liguria 50 Misclassification Rate = 0 > plot(area.boost$trees[[1]]) > text(area.boost$trees[[1]]) > area.boost$weights[1] [1] 1.035799 332 > plot(area.boost$trees[[180]]) > text(area.boost$trees[[180]],cex=.7) > area.boost$weights[180] [1] 1.067609 > area.boost$weights [1] [12] [23] [34] [45] [56] [67] [78] [89] [100] [111] [122] [133] [144] [155] [166] [177] [188] [199] 1.0357992 0.9715957 0.9370240 0.8949037 0.8703193 1.1165439 0.9269494 0.8986684 0.8862763 1.4032175 0.8953054 1.0342709 0.7803303 1.2587555 1.0375580 1.0706908 1.0416944 0.7859464 1.1358410 1.1342587 0.9109128 0.9950483 1.1064092 1.1578435 0.9244610 0.9932527 0.8673571 0.8581576 1.0068071 1.0230568 0.8832552 1.1511980 0.8619396 1.0651710 1.0271797 1.1162135 1.0937164 1.0333415 1.0305158 1.0666677 0.8706232 0.8895321 0.8326164 1.0625838 1.0627201 1.1098483 0.7838882 1.0870706 1.0244511 1.3778578 0.9266824 0.8910515 0.9567762 1.0954239 0.9455997 0.8481177 1.0581825 0.9420402 0.8595090 1.0391414 1.0623264 0.9551809 1.0069796 0.9219371 0.9551105 1.0336974 1.0796687 1.0634951 1.0173223 1.0879095 0.9914885 0.8851813 1.0676094 0.8882285 0.9917211 1.0315845 0.9354693 0.8809439 1.0686684 1.0194853 1.0046468 1.0020261 1.0679582 0.8826356 0.9398476 1.0060252 0.9170549 0.8995054 0.9840809 0.9446243 1.0573716 1.1383453 1.0266454 0.8000069 1.3969888 1.0362868 0.8351552 0.8126341 1.1906942 1.0697031 1.1108903 0.9392506 1.0518340 1.0988167 0.9041655 1.1486299 1.1065162 1.0860789 1.1463685 1.1355113 0.8842932 0.9443035 1.1309866 0.9771788 1.1511997 1.0491668 0.8106182 0.9486628 1.1117184 1.2032414 1.0957845 1.2851301 1.0006903 0.9859894 1.1213506 1.0502042 1.2167139 1.0273235 1.0258271 0.9376967 0.9871757 1.0108042 0.9688051 1.0504185 0.9775952 0.9528626 0.9922303 0.9313945 1.0512727 0.9475067 1.0259476 0.8927899 0.9043262 0.8718052 0.9977561 1.1245070 0.9817287 1.3242475 0.8733040 1.0829109 0.9657496 1.0177336 1.0493857 1.4468478 0.9585049 0.9451793 1.0091305 1.0060947 0.9706092 1.0550020 1.3029086 0.8169246 1.0600536 0.8847374 0.9830194 1.0275357 0.9645878 0.8918500 0.9068563 1.0563784 1.0615412 1.0089787 1.0442129 0.9147896 1.2940220 1.0095361 0.9209586 1.1006079 1.0930421 0.9604656 0.9956170 0.9920956 1.0781610 0.9091424 0.9838000 0.8922770 1.1212727 1.0633402 0.8617204 1.0115171 1.1921323 0.9673191 0.9882869 1.2943851 1.3546597 1.0491384 0.9440773 1.0676094 0.9534185 0.9847003 > plot(area.boost$trees[[91]]) > text(area.boost$trees[[91]],cex=.7) 333 The original fit is potentially overfitting the training data so we should explore this possibility by creating training/test sets from the original data or by performing MCCV. Test/Training Set Approach > > > > > sam = sample(1:572,100,replace=F) Oarea.test = O.area[sam,] Oarea.train = O.area[-sam,] areaboost.train = boosting(Area.name~.,data=Oarea.train,mfinal=200) misclass(areaboost.train$class,Oarea.train$Area.name) Table of Misclassification (row = predicted, col = actual) y fit Calabria Coastal-Sardinia East-Liguria Inland-Sardinia North-Apulia Sicily South-Apulia Umbria Calabria 42 0 0 0 0 0 0 0 Coastal-Sardinia 0 29 0 0 0 0 0 0 East-Liguria 0 0 44 0 0 0 0 0 Inland-Sardinia 0 0 0 55 0 0 0 0 North-Apulia 0 0 0 0 15 0 0 0 Sicily 0 0 0 0 0 32 0 0 South-Apulia 0 0 0 0 0 0 175 0 Umbria 0 0 0 0 0 0 0 41 West-Liguria 0 0 0 0 0 0 0 0 y fit West-Liguria Calabria 0 Coastal-Sardinia 0 East-Liguria 0 Inland-Sardinia 0 North-Apulia 0 Sicily 0 South-Apulia 0 Umbria 0 West-Liguria 39 Misclassification Rate = 0 > predarea = predict(areaboost.train,newdata=Oarea.test) > attributes(predarea) $names [1] "formula" "votes" "prob" "class" "confusion" "error" > predarea$confusion Observed Class Predicted Class Calabria Coastal-Sardinia East-Liguria Inland-Sardinia North-Apulia Sicily South-Apulia Umbria Calabria 14 0 0 0 0 0 0 0 Coastal-Sardinia 0 4 0 0 0 0 0 0 East-Liguria 0 0 6 0 0 0 0 0 Inland-Sardinia 0 0 0 10 0 0 0 0 North-Apulia 0 0 0 0 9 0 0 0 Sicily 0 0 0 0 1 3 1 0 South-Apulia 0 0 0 0 0 1 30 0 Umbria 0 0 0 0 0 0 0 10 West-Liguria 0 0 0 0 0 0 0 0 Observed Class Predicted Class West-Liguria Calabria 0 Coastal-Sardinia 0 East-Liguria 0 Inland-Sardinia 0 North-Apulia 0 Sicily 0 South-Apulia 0 Umbria 0 West-Liguria 11 > predarea$error [1] 0.03 3% of the test cases were misclassified 334 > barplot(sort(area.boost$importance),horiz=F, main="Variable Importances from AdaBoost") > boost.cv = function(fit,y,data,p=.333,B=25,control=rpart.control()) { n = length(y) cv <- rep(0,B) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) temp <- data[-sam,] fit2 <- boosting(formula(fit),data=temp,control=control) ypred <- predict(fit2,newdata=data[sam,]) tab = ypred$confusion mc <- ss - sum(diag(tab)) cv[i] <- mc/ss } cv } > results = boost.cv(area.boost,O.area$Area.name,data=O.area,B=100) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.03158 0.05132 0.05789 0.06232 0.07368 0.10530 335 > Statplot(results) There are three different weighting formulae available for use with the boosting function in the adabag library. The argument coeflearn has the following options: 1 Breiman (default): 𝛼 = 2 ln ( Freund: 𝛼 = 𝑙𝑛 ( Zhu: 𝛼 = 𝑙𝑛 ( 𝑒𝑡 −1 ) 𝑒𝑡 𝑒𝑡 −1 ) 𝑒𝑡 𝑒𝑡 −1 𝑒𝑡 ) + ln(# 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 − 1) How much the performance varies across these different weighting schemes can be explored on an application by application basis, choosing the one with the optimal performance as judged by MCCV or a test/training set approach. For problems when the response is dichotomous other libraries with boosting algorithms can be used. The function ada in the library of the same name will perform discrete, real, or gentle boosting. Real and Gentle boosting do not use misclassification error, but rather weight according to the estimated probabilities y = 1 given the x’s. For example, the weights for Real Boosting are determined by 𝛼= 1 1 − 𝑝𝑚 (𝑥) 𝑙𝑛 ( ) 2 𝑝𝑚 (𝑥) 336 where 𝑝𝑚 (𝑥) = 𝑃̂(𝑦 = 1|𝒙). For two class prediction problems it is probably worth exploring a variety of these methods for a given problem to arrive an optimal choice. My guess is the “optimal” choice, determined by some form of validation, will vary from problem to problem. Example 11.4: Cleveland Heart Disease Study > diag.ada = ada(diag~.,data=Cleve.diag) > diag.ada Call: ada(diag ~ ., data = Cleve.diag) Loss: exponential Method: discrete Iteration: 50 Final Confusion Matrix for Data: Final Prediction True value buff sick buff 151 9 sick 16 120 Train Error: 0.084 Out-Of-Bag Error: 0.108 iteration= 41 Additional Estimates of number of iterations: train.err1 train.kap1 44 44 > varplot(diag.ada) 337 > diag.adaboost = boosting(diag~.,data=Cleve.diag,mfinal=50) > misclass(diag.adaboost$class,Cleve.diag$diag) Table of Misclassification (row = predicted, col = actual) y fit buff sick buff 160 0 sick 0 136 Misclassification Rate = 0 MCCV Results > boost.results = boost.cv(diag.adaboost,Cleve.diag$diag,Cleve.diag) > summary(boost.results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1327 0.1735 0.1939 0.1967 0.2245 0.2653 > ada.results = ada.cv(diag.ada,Cleve.diag$diag,Cleve.diag) > summary(ada.results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1122 0.1429 0.1633 0.1694 0.1939 0.2347 The boosting methods in the ada outperform those in the adabag package for this two class prediction problem. The ada function also allows for easy prediction for a test set with a visualization of the results. > Cleve.test = Cleve.diag[250:296,] > Cleve.train = Cleve.diag[1:249,] > diag.train = ada(diag~.,data=Cleve.train) > diag.test = addtest(diag.train,Cleve.test[,-14],Cleve.test[,14]) > diag.test Call: ada(diag ~ ., data = Cleve.train) Loss: exponential Method: discrete Iteration: 50 Final Confusion Matrix for Data: Final Prediction True value buff sick buff 127 6 sick 14 102 Train Error: 0.08 Out-Of-Bag Error: 0.1 iteration= 48 Additional Estimates of number of iterations: train.err1 train.kap1 test.errs2 test.kaps2 50 50 32 32 338 > plot(diag.test,test=T) 339 340 Misclassification Table for Tree-based Classification Models misclass = function (tree) { temp <- table(predict(tree, type = "class"), tree$y) cat("Table of Misclassification\n") cat("(row = predicted, col = actual)\n") print(temp) cat("\n\n") numcor <- sum(diag(temp)) numinc <- length(tree$y) - numcor mcr <- numinc/length(tree$y) cat(paste("Misclassification Rate = ", format(mcr, digits = 3))) cat("\n") } This function should work for rpart, tree, bagging, and randomForests. 341 11.5 – Tree-based Models for Classification in JMP To begin tree-based modeling for regression or classification problems in JMP select Partition from the Modeling pull-out menu as shown below. If the response (Y) is continuous it will fit a regression tree-based model and if the response is nominal it will fit a classification tree-based model. Decision tree - basically fits a model like rpart in R. Bootstrap Forest - can be used to fit a bagged fit or a random forest to improve the predictive performance of the tree models. Boosted Tree - uses boosting to improve the predictive performance of tree for dichotomous responses only! Set a fraction of your training data to use as validation set, e.g. .3333 to set aside 33.33% of your training data for validation purposes. 342 Bagging and Random Forest Dialog Box in JMP M = # of tree models to average. Number of terms sampled per split = # of variables randomly sampled at each potential split (mtry in R) . Bootstrap sample rate = proportion of observations sampled for each bootstrap. Minimum Splits Per Tree = is the minimum number of splits for each tree Maximum Splits Per Tree = maximum tree size for any tree fit to a bootstrap sample. Minimum Size Split = smallest number of observations per node where further splitting would be considered. Early Stopping = will average less than M trees if averaging more trees does not improve validation statistic. Multiple Fits over number of terms = will fit random forests using the value specified in Number of terms per sampled split to Max Number of terms and return the final fit will be the best random forest obtained using values in this range. 343