Project #3 Answers STAT 873 Fall 2013 Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document. Background Wheat producers want to identify kernels that are in poor condition after being harvested. To facilitate this identification process, categorization systems have been developed to partition kernels into different categories. For this example, we will look at the categories of “healthy”, “sprout”, or “scab”. In summary, i) Healthy is the preferred condition because these kernels have not been damaged ii) Sprout is less preferred than healthy because they have reduced weight and poorer flour quality iii) Scab is less preferred than healthy because these kernels come from plants that have been infected by a disease and have undesirable qualities in their appearance Ideally, it would be preferred to make these categorizations for each kernel through using an automated process. Data To test a new automated system out, 276 wheat kernels were classified by human examination (assumed to be perfect) and through the automated system. The automated system uses information about the class of the wheat kernel (soft red winter or hard red winter) and measurements for density, hardness, size, weight, and moisture for the kernel. The data is stored in my wheat_all.csv file available on my website. Below is how I read in the data: > wheat<-read.csv(file = "C:\\data\\wheat_all.csv") > head(wheat, n = 3) class density hardness 1 hrw 1.35 60.33 2 hrw 1.29 56.09 3 hrw 1.23 43.99 size weight moisture type 2.30 24.65 12.02 Healthy 2.73 33.30 12.17 Healthy 2.51 31.76 11.88 Healthy > tail(wheat, n = 3) class density hardness size weight moisture type 274 srw 0.85 34.07 1.41 12.09 11.93 Scab 275 srw 1.18 60.98 1.06 9.48 12.24 Scab 276 srw 1.03 -9.57 2.06 23.82 12.65 Scab The focus here is to develop methods that best differentiate between the kernel types using the physical characteristics of the kernel and the wheat class. 1 1) Perform an initial investigation into the data as follows. a) (10 points) Examine the data using the appropriate graphical methods discussed earlier in the course. In your plots, determine if there may be ways to differentiate among kernel types. Also, examine observation #31 and compare it to the other observations. > wheat<-read.csv(file = "C:\\chris\\wheat_all.csv") > wheat[31,] class density hardness size weight moisture type 31 hrw 2.03 121.84 0.99 9.36 10.28 Scab > table(wheat$type) Healthy 96 Scab 84 Sprout 96 > stars(x = wheat[order(wheat$type),-1], ncol = 20, key.loc = c(-2, 1), draw.segments = TRUE, label = NULL, cex = 0.75, main = "Wheat data ordered by type") Wheat data ordered by type hardness size density weight type moisture > wheat2<-data.frame(kernel = 1:nrow(wheat), wheat[,2:6], class.new = ifelse(test = wheat$class == "hrw", yes = 0, no = 1)) > head(wheat2) kernel density hardness size weight moisture class.new 1 1 1.35 60.33 2.30 24.65 12.02 0 2 2 1.29 56.09 2.73 33.30 12.17 0 3 3 1.23 43.99 2.51 31.76 11.88 0 4 4 1.34 53.82 2.27 32.71 12.11 0 5 5 1.26 44.39 2.35 26.07 12.06 0 6 6 1.30 48.12 2.49 33.30 12.19 0 > #Colors by condition: 2 > wheat.colors<-ifelse(test = wheat$type == "Healthy", yes = "black", no = ifelse(test = wheat$type == "Sprout", yes = "red", no = "green")) > #Line type by condition: > wheat.lty<-ifelse(test = wheat$type == "Healthy", yes = "solid", no = ifelse(test = wheat$type == "Sprout", yes = "longdash", no = "dotdash")) > kernel31<-ifelse(test = wheat2$kernel == 31, yes = 3, no = 1) > parcoord(x = wheat2, col = wheat.colors, lty = wheat.lty, lwd = kernel31, main = "Parallel coordinate plot \n Kernel #31 is represented by a large line width") > legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), lty = c("solid", "longdash", "dotdash"), col = c("black", "red", "green"), cex = 0.8, bty = "n") Parallel coordinate plot Kernel #31 is represented by a large line width Healthy Sprout Scab kernel density hardness size weight moisture class.new Note that I created a new variable named class.new which is 0 for hard red winter wheat and 1 for soft red winter wheat. This needs to be done because class consists of characters. All plots show some separation of scab from the other kernel types (although not complete separation) indicating we may have some success differentiating scab kernels from the others. There is not as much separation between the healthy and sprout kernels. For example, there is a lot of overlap of the lines in the parallel coordinate plot for these two kernel types. This indicates that we may have difficulty differentiating between them. From examining a number of plots, one can see that observation #31 is quite different from the rest! It has the largest density value by far. It also has the largest hardness. Note that this data is an example where order in which observations are plotted in a parallel coordinate plot could make a difference in how you interpret it. Below is an example where the kernels are first sorted by kernel type (green is plotted last): > #Sort by wheat type > wheat.colors2<-ifelse(test = wheat$type == "Healthy", yes = 1, 3 no = ifelse(test = wheat$type == "Sprout", yes = 2, no = 3)) > wheat3<-data.frame(wheat.colors2, wheat2) > parcoord(x = wheat3[order(wheat.colors2),], col = wheat.colors2[order(wheat.colors2)], lty = wheat.lty[order(wheat.colors2)], main = "Parallel coordinate plot for wheat data - sort by Type") > legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), lty = c("solid", "longdash", "dotdash"), col = c("black", "red", "green"), cex = 0.8, bty = "n") Parallel coordinate plot for wheat data - sort by Type Healthy Sprout Scab wheat.colors2 kernel density hardness size weight moisture class.new One could have missed where all of the green observations (especially for hardness) were located in the first parallel coordinate plot due to being underneath some of the red and black observations. Using these plots, one can develop general ideas of what variables may be the best for classifying observations. For example, we see the density, size, and weight of the kernels help to somewhat differentiate between scab and the other two classes. Also, we see some variables, like hardness, may be of little help. PCA could also be used to examine the data. I will save the PCA part until later in the project. b) (1 point) This data comes from an actual consulting problem that I worked on in the past. I pointed out observation #31 to the subject-matter researcher, and he concluded that this observation must be a result of a measurement error. For this reason, we decided to remove it from the data set. For the remainder of this project, use an altered version of the data set that has this observation removed. Show how this observation is removed and show that the number of observations is now 275. > wheat.no31<-wheat[-31,] > nrow(wheat.no31) [1] 275 4 2) This portion of the project applies DA methods to differentiate between the kernel types. a) (10 points) Fill in the table below using the appropriate DA methods: Proportion correct Accuracy method Healthy Sprout Scab Overall DA Priors Linear Proportional Cross-validation Quadratic Proportional Cross-validation Also, provide the 33 classification tables. Which DA method performs the best? For what type of classifications do the methods perform the worse? Fully explain all of your answers. Proportion correct Accuracy method Healthy Sprout Scab Overall DA Priors Linear Proportional Cross-validation 0.7292 0.5417 0.7470 0.6691 Quadratic Proportional Cross-validation 0.6875 0.5938 0.7831 0.6836 Below are my classification tables: > summarize.class<-function(original, classify) { class.table<-table(original, classify) numb<-rowSums(class.table) prop<-round(class.table/numb,4) overall<-round(sum(diag(class.table))/sum(class.table),4) list(class.table = class.table, prop = prop, overall.correct = overall) } > library(MASS) > DA2<-lda(formula = type data = wheat.no31, CV > DA4<-qda(formula = type data = wheat.no31, CV ~ = ~ = class + density + hardness + size + weight + moisture, TRUE) class + density + hardness + size + weight + moisture, TRUE) > lda.accuracy<-summarize.class(original = wheat.no31$type, classify = DA2$class) > qda.accuracy<-summarize.class(original = wheat.no31$type, classify = DA4$class) > lda.accuracy $class.table classify original Healthy Scab Sprout Healthy 70 5 21 Scab 11 62 10 Sprout 24 20 52 $prop classify original Healthy Scab Sprout Healthy 0.7292 0.0521 0.2188 Scab 0.1325 0.7470 0.1205 Sprout 0.2500 0.2083 0.5417 $overall.correct [1] 0.6691 5 > qda.accuracy $class.table classify original Healthy Scab Sprout Healthy 66 10 20 Scab 7 65 11 Sprout 26 13 57 $prop classify original Healthy Scab Sprout Healthy 0.6875 0.1042 0.2083 Scab 0.0843 0.7831 0.1325 Sprout 0.2708 0.1354 0.5938 $overall.correct [1] 0.6836 Overall, we see that LDA and QDA have some ability to differentiate between the different kernel types. The largest amount of errors occurs with the sprout and healthy classifications – both with sprout being misclassified as healthy and vice versa. The least amount of classification error occurs with healthy kernels being classified as scab. QDA leads to very similar results to LDA. For parsimonious reasons then, I would likely use LDA. However, if the extra amount of accuracy potentially provided by QDA is important to the subject-matter researcher, I would use QDA. b) (8 points) The DA homework shows a scatter plot where there are two plotting points for each observation. The smaller point denotes the original population for the observation and the larger point denotes the classification. Construct a similar plot here, but now plot the first two PCs for it. Interpret the plot in the context of what the 33 classification table in b) gives as the correct and incorrect classification rates. Use the LDA cross-validation classifications found in part a) for the plot. > save<-princomp(formula = ~ density + hardness + size + weight + moisture + class.new, data = wheat.no31, cor = TRUE, scores = TRUE) > summary(save, loadings = TRUE, cutoff = 0.0) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 1.471919 1.3133732 0.9591963 Proportion of Variance 0.361091 0.2874915 0.1533429 Cumulative Proportion 0.361091 0.6485825 0.8019254 Comp.4 Comp.5 Comp.6 Standard deviation 0.8444315 0.53339093 0.43689493 Proportion of Variance 0.1188441 0.04741765 0.03181286 Cumulative Proportion 0.9207695 0.96818714 1.00000000 Loadings: density hardness size weight moisture class.new Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 -0.287 0.308 0.618 0.653 -0.043 0.115 0.361 0.238 0.662 -0.522 0.186 -0.260 -0.441 0.459 -0.086 -0.419 0.237 0.597 -0.559 0.325 -0.156 -0.135 -0.159 -0.717 -0.359 -0.494 0.352 -0.328 -0.604 0.175 -0.390 -0.537 0.156 0.006 0.719 -0.134 > save$scale<-apply(X = wheat.no31[,c(2:6,8)], MARGIN = 2, FUN = sd) 6 > score.cor<-predict(save, newdata = wheat.no31[,c(2:6,8)]) > head(score.cor) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 1 0.4498091 0.9670625 1.6347757 -0.03223047 -0.6494602 2 -0.4987891 1.5106507 1.0292678 -0.78907466 -0.6696128 3 -0.1694543 1.0656393 0.4728126 -0.59576666 -0.7210953 4 -0.1722994 1.1679671 1.2920498 -0.08442183 -0.8937361 5 0.2841531 0.7124044 0.7953254 -0.24934765 -0.7444673 6 -0.4135285 1.2352344 0.9291961 -0.38554676 -0.8487791 Comp.6 1 0.38456134 2 0.12537595 3 0.03432551 4 -0.32100839 5 0.39286627 6 -0.08053882 > par(pty = "s") > original.pch<-ifelse(test = wheat.no31$type == "Healthy", yes = 1, no = ifelse(test = wheat.no31$type == "Sprout", yes = 2, no = 5)) > original.color<-ifelse(test = wheat.no31$type == "Healthy", yes = "black", no = ifelse(test = wheat.no31$type == "Sprout", yes = "red", no = "green")) > plot(x = score.cor[,1], y = score.cor[,2], pch = original.pch, col = original.color, cex = 0.75, xlab = "Principal component 1", ylab = "Principal component 2", main = "PC score plot \n Classified (large points) overlaid on the original (small points)") > abline(h = 0, lty = 1, lwd = 2) > abline(v = 0, lty = 1, lwd = 2) > classify.pch<-ifelse(test = DA2$class == "Healthy", yes = 1, no = ifelse(test = DA2$class == "Sprout", yes = 2, no = 5)) > classify.color<-ifelse(test = DA2$class == "Healthy", yes = "black", no = ifelse(test = DA2$class=="Sprout", yes = "red", no = "green")) > points(x = score.cor[,1], y = score.cor[,2], pch = classify.pch, col = classify.color, cex = 1.5) > legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), pch = c(1,2,5), col = c("black", "red", "green"), cex=1, bty="n") 7 1 0 -1 Healthy Sprout Scab -4 -3 -2 Principal component 2 2 3 PC score plot LDA classified (large points) overlaid on original (small points) -2 0 2 4 Principal component 1 Overall, the first two PCs account for 65% of the total variation in the data. While this value is not extremely large, it does provide a decent proportion of the information available in the data. Plots like this can also be done using a bubble plot in order to include the third PC. I decided to keep it to two PCs in the project to correspond to what was done in the homework. We can immediately see why LDA performs better when classifying the scab kernels than with classifying the healthy and sprout kernels. The scab kernels are more separated from healthy and sprout than those two kernel types are from each other. Thus, positive PC #1 with negative PC #2 values tend to be correctly classified as scab. The loadings for the PCs provide some insight for why this is true. In 1) of this project, we saw that scab kernels tend to have a smaller density, size, and weight than the other kernel types. The coefficients on density, size, and weight are negative for PC #1 and positive for PC #2. This means that small density, size, and weight values will allow scab kernels to be larger for PC #1 (given the other loadings for this PC) and smaller for PC #2 (given the other loadings for this PC) in comparison to the other kernel types. 8 3) This portion of the project applies NNC methods to differentiate between the kernel types. a) (8 points) Determine an appropriate value for K using cross-validation. Set a seed number of 7771 before using NNC so that I can duplicate your results. > library(class) > class.new<-ifelse(test = wheat.no31$class == "hrw", yes = 0, no = 1) > Z<-scale(cbind(wheat.no31[,2:6], class.new)) > head(Z) density hardness size weight moisture class.new 1 1.2299602 1.2708682 0.1939024 -0.3602040 0.4071616 -0.9590205 2 0.7725714 1.1158760 1.0710624 0.7326285 0.4809390 -0.9590205 3 0.3151825 0.6735634 0.6222829 0.5380664 0.3383028 -0.9590205 4 1.1537287 1.0328967 0.1327052 0.6580885 0.4514281 -0.9590205 5 0.5438769 0.6881853 0.2958978 -0.1808026 0.4268356 -0.9590205 6 0.8488028 0.8245346 0.5814847 0.7326285 0.4907760 -0.9590205 > set.seed(7771) > save.results.cv<-matrix(data = NA, nrow = 40, ncol = 5) > for (K in 1:40) { NNC.cv<-knn.cv(train = Z, cl = wheat.no31$type, k = K, prob = TRUE) NNC.cv.accuracy<-summarize.class(original = wheat.no31$type, classify = NNC.cv) save.results.cv[K,]<-c(K, NNC.cv.accuracy$prop[1,1], NNC.cv.accuracy$prop[2,2], NNC.cv.accuracy$prop[3,3], NNC.cv.accuracy$overall.correct) } > head(save.results.cv) [,1] [,2] [,3] [1,] 1 0.5833 0.6386 [2,] 2 0.5625 0.5904 [3,] 3 0.5521 0.6386 [4,] 4 0.5312 0.6627 [5,] 5 0.5833 0.6627 [6,] 6 0.6562 0.6386 [,4] 0.5000 0.4896 0.4792 0.5000 0.5104 0.5521 [,5] 0.5709 0.5455 0.5527 0.5600 0.5818 0.6145 > plot(x = save.results.cv[,1], y = save.results.cv[,2], ylim = c(0, 1), main = "Cross-validation", panel.first = grid(), type = "o", col = "red", xlab = "K", ylab = "Accuracy") > points(x = save.results.cv[,1], y = save.results.cv[,3], ylim = c(0, 1), type = "o", col = "blue") > points(x = save.results.cv[,1], y = save.results.cv[,4], ylim = c(0, 1), type = "o", col = "green") > points(x = save.results.cv[,1], y = save.results.cv[,5], ylim = c(0, 1), type = "o", col = "black") > legend(x = 5, y = 1, legend = c("Correct healthy", "Correct scab", "Correct sprout", "Correct overall"), col = c("red", "blue", "green", "black"), bty = "n", cex = 0.75, lty = c(1,1,1,1), pch = c(1,1,1,1)) 9 1.0 Cross-validation 0.6 0.4 0.0 0.2 Accuracy 0.8 Correct healthy Correct scab Correct sprout Correct overall 0 10 20 30 40 K > max(save.results.cv[,5]) [1] 0.6255 > save.results.cv[save.results.cv[,5] == max(save.results.cv[,5]),] [,1] [,2] [,3] [,4] [,5] [1,] 11 0.6562 0.7108 0.5208 0.6255 [2,] 13 0.6875 0.6747 0.5208 0.6255 [3,] 15 0.6979 0.6506 0.5312 0.6255 [4,] 19 0.6667 0.6747 0.5417 0.6255 > data.frame(K = 1:15, overall = save.results.cv[1:15,5]) K overall 1 1 0.5709 2 2 0.5455 3 3 0.5527 4 4 0.5600 5 5 0.5818 6 6 0.6145 7 7 0.6073 8 8 0.6145 9 9 0.6073 10 10 0.6073 11 11 0.6255 12 12 0.6145 13 13 0.6255 14 14 0.6073 15 15 0.6255 The proportion of correct classifications is approximately the same once K 6. For this reason, I will choose K = 6. 10 b) (6 points) With the value of K chosen in a), perform NNC with cross-validation and provide the 33 classification table. For what type of classifications do the methods perform the worse? Set a seed number of 6126 before using knn.cv() so that I can duplicate your results. > set.seed(6126) > NNC6<-knn.cv(train = Z, cl = wheat.no31$type, k = 6, prob = TRUE) > NNC6.accuracy<-summarize.class(original = wheat.no31$type, classify = NNC.cv) > NNC6.accuracy $class.table classify original Healthy Scab Sprout Healthy 61 7 28 Scab 7 59 17 Sprout 34 8 54 $prop classify original Healthy Scab Sprout Healthy 0.6354 0.0729 0.2917 Scab 0.0843 0.7108 0.2048 Sprout 0.3542 0.0833 0.5625 $overall.correct [1] 0.6327 Overall, we that NNC has some ability to differentiate between the different kernel types. The largest amount of errors occurs with the sprout and healthy classifications – both with sprout being misclassified as healthy and vice versa. The least amount of classification error occurs with healthy kernels being classified as scab. c) (6 points) Construct a similar plot as done in part 2b with the NNC classifications obtained in b). Interpret the plot in the context of what the 33 classification table in b) gives as the correct and incorrect classification rates. Compare the plot for this problem to the one found in part 2b. > plot(x = score.cor[,1], y = score.cor[,2], pch = original.pch, col = original.color, cex = 0.75, xlab = "Principal component 1", ylab = "Principal component 2", main = "PC score plot \n NNC classified (large points) overlaid on original (small points)") > abline(h = 0, lty = 1, lwd = 2) > abline(v = 0, lty = 1, lwd = 2) > classify.pch<-ifelse(test = NNC.cv == "Healthy", yes = 1, no = ifelse(test = NNC.cv == "Sprout", yes = 2, no = 5)) > classify.color<-ifelse(test = NNC.cv == "Healthy", yes = "black", no = ifelse(test = NNC.cv=="Sprout", yes = "red", no = "green")) > points(x = score.cor[,1], y = score.cor[,2], pch = classify.pch, col = classify.color, cex = 1.5) > legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), pch = c(1,2,5), col = c("black", "red", "green"), cex=1, bty="n") 11 1 0 -1 Healthy Sprout Scab -4 -3 -2 Principal component 2 2 3 PC score plot NNC classified (large points) overlaid on original (small points) -2 0 2 4 Principal component 1 The overall results regarding scab classifications doing better than the healthy and sprout classifications are similar here as to what we saw for LDA. Overall, most classifications appear to be the same for NNC and LDA. We do see a few differences here when compared to the results from LDA. For example, the upper left portion of the plot (circled in blue) here contain two sprout kernels, but the nearest other kernels are healthy. This leads to a healthy classification by NNC; however, these kernels were correctly classified by LDA. Of course, one needs to be careful with this judgment of “nearest” here because PCA is used with only 65% of the total variation in the data being accounted for. 12