Supervised Classification using S-Plus Linear Discriminant Analysis This is an example of supervised classification of the Australian crabs data. The first step is to create a new variable sp.sex with 4 categories from the two variables species and sex. You can do this with a command like: Sp.sex_c(rep(1,50),rep(2,50),rep(3,50),rep(4,50)) since each group has sample size of 50, and the cases are ordered according to the species and sex variables. You’ll need to add this variable to the Australian crabs data table. Next we break the data into training and test samples. We’ll keep out 20% of each group to test our rule. Here is the code to do this: indx_c(sample(1:50,size=40),sample(51:100,size=40),sample(101:150,size=40), sample(151:200,size=40)) crabs.train_australian.crabs[indx,] crabs.test_australian.crabs[-indx,] Now you’re almost ready to run lda. You can change the aspect ratio of the plot, by the command par(pty=”s”) so that the plot of the data in the discriminant space will be a square plot instead of a rectangular plot. Using the Multivariate, Discriminant Analysis menu items on the GUI provides a panel to run linear discriminant analysis. On the panel , choose the crabs.train for the data, and sp.sex as the dependent variable, and FL, …, BD as the independent variables. Use the canonical option, with homoscedastic variance (equal variance-covariance), and proportional (weight according to sample size, equal weights in this case). On the Results tab choose Plot and Long output, and select Plugin. And then click Apply. Here are the results: *** Discriminant Analysis *** Call: discrim(structure(.Data = sp.sex ~ FL + RW + CL + CW + BD, class = "formula"), data = crabs.train, family = Canonical(cov.structure = "homoscedastic"), na.action = na.omit, prior = "proportional") Group means: FL RW G1 14.53 11.50 G2 13.34 12.18 G3 16.25 12.06 G4 17.36 14.59 CL 31.26 28.23 32.98 34.18 CW 35.93 32.78 36.41 38.50 BD 13.05 11.86 14.99 15.46 N Priors 40 0.25 40 0.25 40 0.25 40 0.25 Covariance Structure: homoscedastic FL RW CL CW BD FL 9.93 6.93 21.39 24.07 9.78 RW 5.19 15.13 17.05 6.90 CL 46.74 52.54 21.31 CW 59.22 23.96 BD 9.85 Canonical Coefficients: dim1 dim2 dim3 FL 1.6105 0.3357 -1.8263 RW 0.5383 1.6172 0.5305 CL 0.3748 -1.2881 0.7397 CW -1.6470 0.7206 -0.6705 BD 1.2708 -0.4509 1.3253 Singular Values: dim1 dim2 dim3 19.54 13.34 2.587 Constants: G1 G2 G3 G4 -15.74 -25.3 -17.64 -40.29 Linear Coefficients: G1 G2 G3 FL 6.464 7.55 13.55 RW 2.311 7.95 3.56 CL -7.641 -10.37 -3.53 CW 7.189 7.06 -3.05 BD -8.518 -7.45 -0.22 G4 17.22 10.48 -9.10 -0.87 -1.92 Canonical Correlations: Canonical.Corr Likelihood.Ratio Chi.square df Pr dim1 0.9381 0.0240 576.1 15 0.0000 dim2 0.8797 0.2003 248.6 8 0.0000 dim3 0.3377 0.8860 18.8 3 0.0003 Eigenvalues: Eigenvalue Difference Proportion Cumulative dim1 7.341 3.919 0.6740 0.6740 dim2 3.423 3.294 0.3142 0.9882 dim3 0.129 0.0118 1.0000 Tests for Homogeneity of Covariances: Statistic df Pr Box.M 226.3 45 0 adj.M 213.4 45 0 Tests for the Equality of Means: Group Variable: sp.sex Statistics F df1 Wilks Lambda 0.024 80.1 15 Pillai Trace 1.768 44.2 15 Hotelling-Lawley Trace 10.893 109.4 15 Roy Greatest Root 7.341 226.1 5 * Tests assume covariance homoscedasticity. F Statistic for Roy's Greatest Root is an df2 Pr 420 0 462 0 452 0 154 0 upper bound. Hotelling's T Squared for Differences in Means Between Each Group: F df1 df2 Pr G1-G2 39.0 5 152 0 G1-G3 121.4 5 152 0 G1-G4 182.2 5 152 0 G2-G3 141.6 5 152 0 G2-G4 105.5 5 152 0 G3-G4 72.6 5 152 0 Plug-in classification table: G1 G2 G3 G4 Error Posterior.Error G1 36 4 0 0 0.10 0.1062 G2 2 38 0 0 0.05 0.0190 G3 0 0 40 0 0.00 -0.0360 G4 0 0 2 38 0.05 0.0685 Overall 0.05 0.0394 (from=rows,to=columns) Rule Mean Square Error: 0.06431 (conditioned on the training data) Cross-validation table: G1 G2 G3 G4 Error Posterior.Error G1 36 4 0 0 0.1000 0.1107 G2 2 38 0 0 0.0500 0.0213 G3 0 0 40 0 0.0000 -0.0528 G4 0 0 3 37 0.0750 0.0844 Overall 0.0563 0.0409 (from=rows,to=columns) Predict the values of the test set: crabs.lda.pred_predict(crabs.lda,crabs.test) This creates a new data set that has the predicted sp.sex group and the probability of membership in each species.sex group, eg Groups G1 G1 G1 G1 G1 G1 0.999 0.994 1.000 0.787 0.992 G2 0.001 0.006 0.000 0.213 0.008 G3 0.000 0.000 0.000 0.000 0.000 G4 0.000 0.000 0.000 0.000 0.000 For each of these the predicted class is 1, and the probability of membership in each of the 4 classes is given in the four columns. For the 4th case there is some doubt whether it might be group 2. These predicted classes need to be crosstabulated with the true class to get the misclassification table: sp.sex|predicted |G1 |G2 |G3 |G4 |RwTt| -----+-----+-----+-----+-----+----+ 1 |10 | 0 | 0 | 0 |10 | |1 |0 |0 |0 |0.25| |0.91 |0 |0 |0 | | |0.25 |0 |0 |0 | | -----+-----+-----+-----+-----+----+ 2 | 1 | 9 | 0 | 0 |10 | |0.1 |0.9 |0 |0 |0.25| |0.091|1 |0 |0 | | |0.025|0.22 |0 |0 | | -----+-----+-----+-----+-----+----+ 3 | 0 | 0 |10 | 0 |10 | |0 |0 |1 |0 |0.25| |0 |0 |0.91 |0 | | |0 |0 |0.25 |0 | | -----+-----+-----+-----+-----+----+ 4 | 0 | 0 | 1 | 9 |10 | |0 |0 |0.1 |0.9 |0.25| |0 |0 |0.091|1 | | |0 |0 |0.025|0.22 | | -----+-----+-----+-----+-----+----+ ClTtl|11 |9 |11 |9 |40 | |0.28 |0.22 |0.28 |0.22 | | -----+-----+-----+-----+-----+----+ Which gives an error rate of 2/40=0.05, or 5%. The plot below shows the data plotted in the three discriminant coordinates. The first two coordinates appear to be sufficient to separate the four groups as best as possible. It should be possible to re-create this plot using the canonical coordinates. G2 G3 G4 6 6 4 4 dim1 dim1 G1 2 0 2 0 -2 -2 -4 -4 0 2 4 6 8 dim2 -7 -5 -3 dim3 dim2 8 6 4 2 0 -7 -5 -3 dim3 Re-creating the plots above using the canonical coordinates. > > > > d1<-c(1.61,0.54,0.37,-1.65,1.27) d2<-c(0.34,1.62,-1.29,0.72,-0.45) dc<-cbind(d1,d2) dc d1 d2 [1,] 1.61 0.34 [2,] 0.54 1.62 [3,] 0.37 -1.29 [4,] -1.65 0.72 [5,] 1.27 -0.45 > crabs.lda.proj<-as.matrix(australian.crabs[,5:9])%*%as.matrix(dc) > dim(crabs.lda.proj) > plot(crabs.lda.proj[,1],crabs.lda.proj[,2],type="n",xlab="Discrim 1",ylab="Discrim 2") > points(crabs.lda.proj[sp.sex==1,1],crabs.lda.proj[sp.sex==1,2],pch="1") > points(crabs.lda.proj[sp.sex==2,1],crabs.lda.proj[sp.sex==2,2],pch="2") > points(crabs.lda.proj[sp.sex==3,1],crabs.lda.proj[sp.sex==3,2],pch="3") > points(crabs.lda.proj[sp.sex==4,1],crabs.lda.proj[sp.sex==4,2],pch="4") 10 4 1 4 Discrim 2 6 8 4 2 1 1 2 22 2 2 2 22 2 2 22 22 2 2 2 2222 2 2 22 2 2 2222 2 2 22 22 2 22 1 21 12 221 1 1 1 2 12 1 1 1 1 12 1 1 1 1 1 11 21 1 1 11 1 11 1 1 1 11 1 1 11 11 1 11 1 4 4 44 4 3 3 3 33 3 3 333333 3 3 3 33 333 3333 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 0 1 -2 4 44 4 4 4 44 4 44 4 44 44 444 4444 4 4 4 4 4 44 44 4 44 44 4 1 -4 4 0 2 4 6 Discrim 1 Yes, this will do it. Quadratic Discriminant Analysis On the discriminant analysis control panel select Heterogeneous variance-covariance. You’ll need to use the Classical option. Here are the results. *** Discriminant Analysis *** Call: discrim(structure(.Data = sp.sex ~ FL + RW + CL + CW + BD, class = "formula"), data = crabs.train, family = Classical(cov.structure = "heteroscedastic"), na.action = na.omit, prior = "proportional") Group means: FL RW G1 14.53 11.50 G2 13.34 12.18 G3 16.25 12.06 G4 17.36 14.59 CL 31.26 28.23 32.98 34.18 CW 35.93 32.78 36.41 38.50 BD 13.05 11.86 14.99 15.46 N Priors 40 0.25 40 0.25 40 0.25 40 0.25 Covariance Structure: heteroscedastic Group: G1 FL RW CL CW BD FL 10.78 6.85 24.70 28.17 10.79 RW 4.67 15.93 18.20 6.91 CL 57.01 65.02 24.82 CW 74.29 28.36 BD 10.94 Group: G2 FL RW CL CW BD FL 6.94 RW CL CW BD 6.17 15.38 17.73 7.05 5.72 13.87 15.96 6.37 34.51 39.68 15.88 45.74 18.26 7.46 Group: G3 FL RW CL CW BD FL 12.67 7.85 27.35 30.08 12.69 RW 5.02 17.11 18.85 7.92 CL 59.57 65.50 27.59 CW 72.16 30.36 BD 12.87 Group: G4 FL RW CL CW BD FL 9.34 6.86 18.15 20.30 8.58 RW 5.38 13.62 15.19 6.41 CL 35.89 39.95 16.93 CW 44.70 18.87 BD 8.12 Constants: G1 G2 G3 G4 -49.29 -18.88 -50.8 -23.74 Linear Coefficients: G1 G2 G3 G4 FL 27.71 8.958 13.47 2.017 RW 18.64 6.185 27.04 7.769 CL -12.90 -2.317 -4.97 -1.524 CW -0.16 -0.922 -5.64 0.538 BD -8.23 -4.965 -4.81 -4.432 Quadratic coefficents: group: G1 FL RW CL CW BD FL -8.068 -1.759 3.187 -0.040 1.943 RW -2.914 0.764 0.854 -0.371 CL -5.953 4.065 -0.656 CW -4.446 1.801 BD -4.907 group: G2 FL RW CL CW BD FL -8.411 -1.182 1.761 2.839 -1.738 RW -3.898 1.508 0.791 -0.701 CL -7.173 4.175 2.103 CW -5.240 0.581 BD -3.725 group: G3 FL RW CL CW BD FL -4.935 -0.367 1.442 0.335 1.211 RW -5.177 0.432 1.270 -0.373 CL -5.100 3.248 1.586 CW -3.817 0.928 BD -6.596 group: G4 FL RW CL CW BD FL -4.541 -0.998 0.239 2.178 0.024 RW -2.730 0.767 0.845 -0.355 CL -3.525 2.150 1.495 CW -3.387 0.422 BD -3.904 Tests for Homogeneity of Covariances: Statistic df Pr Box.M 226.3 45 0 adj.M 213.4 45 0 Hotelling's T Squared for Differences in Means Between Each Group: F df1 df2 Pr G1-G2 34.7 5 72.39 0 G1-G3 120.0 5 73.67 0 G1-G4 120.2 5 68.19 0 G2-G3 141.4 5 70.88 0 G2-G4 119.0 5 66.20 0 G3-G4 73.7 5 66.67 0 * df2 is Yao's approximation. Pairwise Generalized Squared Distances: G1 G2 G3 G4 G1 0.00 30.58 44.45 77.51 G2 14.82 0.00 47.12 31.13 G3 53.58 85.95 0.00 25.15 G4 115.39 79.19 50.00 0.00 Kolmogorov-Smirnov Test for Normality: Statistic Probability FL 0.0544 0.7301 RW 0.0487 0.8419 CL 0.0429 0.9303 CW 0.0497 0.8238 BD 0.0514 0.7920 Plug-in classification table: G1 G2 G3 G4 Error Posterior.Error G1 38 2 0 0 0.050 0.0949 G2 1 39 0 0 0.025 0.0348 G3 0 0 40 0 0.000 -0.0180 G4 0 0 1 39 0.025 0.0286 Overall 0.025 0.0351 (from=rows,to=columns) Rule Mean Square Error: 0.04505 (conditioned on the training data) Cross-validation table: G1 G2 G3 G4 Error Posterior.Error G1 35 5 0 0 0.1250 0.1171 G2 3 36 0 1 0.1000 0.0484 G3 0 0 40 0 0.0000 -0.0122 G4 0 0 1 39 0.0250 0.0149 Overall 0.0625 0.0421 (from=rows,to=columns) For the crabs data quadratic discriminant analysis isn’t an improvement over linear discriminant analysis. The test set can be predicted similarly to lda: crabs.qda.pred_predict(crabs.qda,crabs.test) Classification Trees Use the Tree, Tree Models menu items on the GUI. Use the crabs.train as the data, sp.sex as the dependent variable and FL, …, BD as the independent variables. Save the model object. The fitting options provide control over the minimum number of observations allowed to split the data, the minimum number of cases in a node, and the minimum deviance allowed to add a node. On the results tab choose full tree and save misclassification errors. On the predict tab, use the crabs.test data, and use the response option. *** Tree Model *** Regression tree: tree(formula = sp.sex ~ FL + RW + CL + CW + BD, data = crabs.train, na.action = na.exclude, mincut = 5, minsize = 10, mindev = 0.01) Number of terminal nodes: 22 Residual mean deviance: 0.285 = 39.3 / 138 Distribution of residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -1.570 0.000 0.000 0.000 0.018 1.670 node), split, n, deviance, yval * denotes terminal node 1) root 160 200.0 3 2) RW<14.35 124 100.0 2 4) CW<36.2 83 90.0 2 8) BD<12.15 51 40.0 2 16) CL<26.75 37 30.0 2 32) BD<11.15 32 20.0 2 64) CL<22.9 20 20.0 2 128) BD<8.95 13 5.0 2 256) RW<8.1 7 4.0 1 * 257) RW>8.1 6 0.0 2 * 129) BD>8.95 7 4.0 3 * 65) CL>22.9 12 3.0 2 130) RW<10.55 5 0.0 1 * 131) RW>10.55 7 0.0 2 * 33) BD>11.15 5 1.0 3 * 17) CL>26.75 14 3.0 2 34) RW<11.25 7 0.0 1 * 35) RW>11.25 7 0.0 2 * 9) BD>12.15 32 20.0 3 18) RW<13.1 24 10.0 3 36) CW<33.95 18 4.0 3 72) RW<11.7 13 0.0 3 * 73) RW>11.7 5 0.0 4 * 37) CW>33.95 6 3.0 2 * 19) RW>13.1 8 0.0 4 * 5) CW>36.2 41 40.0 2 10) FL<17.3 26 8.0 1 20) RW<13.65 18 5.0 1 40) CW<38.95 10 4.0 1 80) RW<12 5 0.0 1 * 81) RW>12 5 3.0 2 * 41) CW>38.95 8 0.0 1 * 21) RW>13.65 8 0.0 2 * 11) FL>17.3 15 20.0 2 22) CL<39.15 9 2.0 3 * 23) CL>39.15 6 3.0 1 * 3) RW>14.35 36 30.0 3 6) CW<48.85 29 20.0 4 12) FL<17.75 7 7.0 3 * 13) FL>17.75 22 6.0 4 26) CW<46.35 14 0.9 4 * 27) CW>46.35 8 4.0 3 * 7) CW>48.85 7 8.0 3 * RW<14.35 | CW<36.2 BD<12.15 CW<48.85 FL<17.75 CW<46.35 3 3 4 3 FL<17.3 RW<13.65CL<39.15 CW<38.95 RW<12 2 1 2 1 3 CL<26.75 RW<13.1 BD<11.15 RW<11.25 CW<33.95 RW<11.7 4 1 2 CL<22.9 2 BD<8.95RW<10.55 3 3 4 1 2 RW<8.1 1 2 3 1 5 10 15 20 10 20 30 40 50 20 15 FL 10 5 20 15 RW 10 5 50 40 CL 30 20 10 50 40 CW 30 20 10 22 17 BD 12 7 2 5 10 15 20 10 20 30 40 50 2 7 12 17 Check the test error rate: > table(crabs.test[,1],round(crabs.tree.pred$fit,digits=0)) 1 2 3 4 1 8 0 2 0 2 0 7 3 0 3 2 1 7 0 4 1 1 1 7 which gives the error rate (2+3+2+1+1+1+1)/40 = 0.275, that is, 27.5%. Tree do much worse than linear discriminant analysis on this data. 22