Stat 407 Lab 6 Classification Trees Fall 2001 SOLUTION In this lab we will examine the results of classification trees on the crabs data. 1. Load the data, and subset it into training and test data sets. Use about 25% of cases for the test data set, that is, 25 cases from each Species. You can use the following script to build the training and test data sets: indx<-sample(c(1:100),size=25) indx2<-sample(c(101:200),size=25) indx<-c(sort(indx),sort(indx2)) indx crabs.train<-australian.crabs[-indx,] crabs.test<-australian.crabs[indx,] This will create two data sets called crabs.train and crabs.test which are partitions of the original data set into two based on selecting 25% from each species to hold out as test data. The training set has 150 cases, 75 of species 1 and 75 or species 2. The test set has 50 cases, 25 from species 1, 25 from species 2. Print the list of indices used to create your subsets. My list of indices for the test data set is [1] 2 4 8 12 14 23 28 31 41 42 43 48 50 51 55 56 72 [18] 73 78 80 81 87 89 98 99 101 105 112 113 116 122 123 135 138 [35] 139 148 150 151 152 165 170 179 181 184 186 187 192 193 194 196 2. Build a classification tree for the training data. Select Statistics, Tree, Tree Models. Use Sp as the dependent variable and the 5 physical measurements as the independent variables. Save the model as crabs.tree. Choose Summary Description, Full Tree and Missclassification Errors options in the Results window. Choose to plot the tree using the Proportional to Node Deviance and Add text labels in the Plot menu. Use the Predict control panel to obtain predictions for the test data. Report the residual mean deviance of the tree, the number of terminal nodes, and list the variables used in the tree construction. Residual mean deviance = 0.01957 = 3.6 / 184 Number of terminal nodes: 16 FL, CW, BD, RW, CL are used. 3. Examine the plot of the tree. We’re going to follow the right hand branch of the tree. Plot the first two variables (of the full data set) used in the branch of the tree, using color to represent the species. Draw the classification boundaries corresponding to this part of the tree as best possible in this plot. Separately. 4. Examine the predictions for the test data. Calculate the missclassification table for the test data. What is the estimated error rate of the tree? Actual—Predicted 1 2 1 24 1 2 1 24 The error rate is then 2/50 = 0.04. 1 5. Compare and contrast the results provided by a classification tree and that provided by linear discriminant analysis. The error rate is higher for the tree than LDA. The boundary between the groups for the tree is a very awkward step function, whereas the LDA boundary is a strictly linear boundary in a 2D discriminant space. 2