Answers

advertisement
Project #3 Answers
STAT 873
Fall 2013
Complete the following problems below. Within each part, include your R program output with code
inside of it and any additional information needed to explain your answer. Note that you will need to
edit your output and code in order to make it look nice after you copy and paste it into your Word
document.
Background
Wheat producers want to identify kernels that are in poor condition after being harvested. To
facilitate this identification process, categorization systems have been developed to partition
kernels into different categories. For this example, we will look at the categories of “healthy”,
“sprout”, or “scab”. In summary,
i) Healthy is the preferred condition because these kernels have not been damaged
ii) Sprout is less preferred than healthy because they have reduced weight and poorer flour
quality
iii) Scab is less preferred than healthy because these kernels come from plants that have
been infected by a disease and have undesirable qualities in their appearance
Ideally, it would be preferred to make these categorizations for each kernel through using an
automated process.
Data
To test a new automated system out, 276 wheat kernels were classified by human examination
(assumed to be perfect) and through the automated system. The automated system uses
information about the class of the wheat kernel (soft red winter or hard red winter) and
measurements for density, hardness, size, weight, and moisture for the kernel. The data is stored
in my wheat_all.csv file available on my website. Below is how I read in the data:
> wheat<-read.csv(file = "C:\\data\\wheat_all.csv")
> head(wheat, n = 3)
class density hardness
1
hrw
1.35
60.33
2
hrw
1.29
56.09
3
hrw
1.23
43.99
size weight moisture
type
2.30 24.65
12.02 Healthy
2.73 33.30
12.17 Healthy
2.51 31.76
11.88 Healthy
> tail(wheat, n = 3)
class density hardness size weight moisture type
274
srw
0.85
34.07 1.41 12.09
11.93 Scab
275
srw
1.18
60.98 1.06
9.48
12.24 Scab
276
srw
1.03
-9.57 2.06 23.82
12.65 Scab
The focus here is to develop methods that best differentiate between the kernel types using the
physical characteristics of the kernel and the wheat class.
1
1) Perform an initial investigation into the data as follows.
a) (10 points) Examine the data using the appropriate graphical methods discussed earlier in the
course. In your plots, determine if there may be ways to differentiate among kernel types. Also,
examine observation #31 and compare it to the other observations.
> wheat<-read.csv(file = "C:\\chris\\wheat_all.csv")
> wheat[31,]
class density hardness size weight moisture type
31
hrw
2.03
121.84 0.99
9.36
10.28 Scab
> table(wheat$type)
Healthy
96
Scab
84
Sprout
96
> stars(x = wheat[order(wheat$type),-1], ncol = 20, key.loc = c(-2, 1),
draw.segments = TRUE, label = NULL, cex = 0.75, main = "Wheat data ordered by
type")
Wheat data ordered by type
hardness
size
density
weight
type
moisture
> wheat2<-data.frame(kernel = 1:nrow(wheat), wheat[,2:6],
class.new = ifelse(test = wheat$class == "hrw", yes = 0, no = 1))
> head(wheat2)
kernel density hardness size weight moisture class.new
1
1
1.35
60.33 2.30 24.65
12.02
0
2
2
1.29
56.09 2.73 33.30
12.17
0
3
3
1.23
43.99 2.51 31.76
11.88
0
4
4
1.34
53.82 2.27 32.71
12.11
0
5
5
1.26
44.39 2.35 26.07
12.06
0
6
6
1.30
48.12 2.49 33.30
12.19
0
> #Colors by condition:
2
> wheat.colors<-ifelse(test = wheat$type == "Healthy", yes = "black",
no = ifelse(test = wheat$type == "Sprout", yes = "red", no = "green"))
> #Line type by condition:
> wheat.lty<-ifelse(test = wheat$type == "Healthy", yes = "solid",
no = ifelse(test = wheat$type == "Sprout", yes = "longdash", no = "dotdash"))
> kernel31<-ifelse(test = wheat2$kernel == 31, yes = 3, no = 1)
> parcoord(x = wheat2, col = wheat.colors, lty = wheat.lty, lwd = kernel31, main =
"Parallel coordinate plot \n Kernel #31 is represented by a large line width")
> legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), lty = c("solid",
"longdash", "dotdash"), col = c("black", "red", "green"), cex = 0.8, bty = "n")
Parallel coordinate plot
Kernel #31 is represented by a large line width
Healthy
Sprout
Scab
kernel
density
hardness
size
weight
moisture
class.new
Note that I created a new variable named class.new which is 0 for hard red winter wheat and
1 for soft red winter wheat. This needs to be done because class consists of characters.
All plots show some separation of scab from the other kernel types (although not complete
separation) indicating we may have some success differentiating scab kernels from the others.
There is not as much separation between the healthy and sprout kernels. For example, there is
a lot of overlap of the lines in the parallel coordinate plot for these two kernel types. This
indicates that we may have difficulty differentiating between them.
From examining a number of plots, one can see that observation #31 is quite different from the
rest! It has the largest density value by far. It also has the largest hardness.
Note that this data is an example where order in which observations are plotted in a parallel
coordinate plot could make a difference in how you interpret it. Below is an example where the
kernels are first sorted by kernel type (green is plotted last):
> #Sort by wheat type
> wheat.colors2<-ifelse(test = wheat$type == "Healthy", yes = 1,
3
no = ifelse(test = wheat$type == "Sprout", yes = 2, no = 3))
> wheat3<-data.frame(wheat.colors2, wheat2)
> parcoord(x = wheat3[order(wheat.colors2),], col =
wheat.colors2[order(wheat.colors2)], lty = wheat.lty[order(wheat.colors2)],
main = "Parallel coordinate plot for wheat data - sort by Type")
> legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), lty = c("solid",
"longdash", "dotdash"), col = c("black", "red", "green"), cex = 0.8, bty = "n")
Parallel coordinate plot for wheat data - sort by Type
Healthy
Sprout
Scab
wheat.colors2
kernel
density
hardness
size
weight
moisture
class.new
One could have missed where all of the green observations (especially for hardness) were
located in the first parallel coordinate plot due to being underneath some of the red and black
observations.
Using these plots, one can develop general ideas of what variables may be the best for
classifying observations. For example, we see the density, size, and weight of the kernels help
to somewhat differentiate between scab and the other two classes. Also, we see some
variables, like hardness, may be of little help.
PCA could also be used to examine the data. I will save the PCA part until later in the project.
b) (1 point) This data comes from an actual consulting problem that I worked on in the past. I
pointed out observation #31 to the subject-matter researcher, and he concluded that this
observation must be a result of a measurement error. For this reason, we decided to remove it
from the data set. For the remainder of this project, use an altered version of the data set that
has this observation removed. Show how this observation is removed and show that the
number of observations is now 275.
> wheat.no31<-wheat[-31,]
> nrow(wheat.no31)
[1] 275
4
2) This portion of the project applies DA methods to differentiate between the kernel types.
a) (10 points) Fill in the table below using the appropriate DA methods:
Proportion correct
Accuracy method Healthy Sprout Scab Overall
DA
Priors
Linear
Proportional Cross-validation
Quadratic Proportional Cross-validation
Also, provide the 33 classification tables. Which DA method performs the best? For what type
of classifications do the methods perform the worse? Fully explain all of your answers.
Proportion correct
Accuracy method Healthy Sprout Scab Overall
DA
Priors
Linear
Proportional Cross-validation
0.7292 0.5417 0.7470 0.6691
Quadratic Proportional Cross-validation
0.6875 0.5938 0.7831 0.6836
Below are my classification tables:
> summarize.class<-function(original, classify) {
class.table<-table(original, classify)
numb<-rowSums(class.table)
prop<-round(class.table/numb,4)
overall<-round(sum(diag(class.table))/sum(class.table),4)
list(class.table = class.table, prop = prop, overall.correct = overall)
}
> library(MASS)
> DA2<-lda(formula = type
data = wheat.no31, CV
> DA4<-qda(formula = type
data = wheat.no31, CV
~
=
~
=
class + density + hardness + size + weight + moisture,
TRUE)
class + density + hardness + size + weight + moisture,
TRUE)
> lda.accuracy<-summarize.class(original = wheat.no31$type, classify = DA2$class)
> qda.accuracy<-summarize.class(original = wheat.no31$type, classify = DA4$class)
> lda.accuracy
$class.table
classify
original Healthy Scab Sprout
Healthy
70
5
21
Scab
11
62
10
Sprout
24
20
52
$prop
classify
original Healthy
Scab Sprout
Healthy 0.7292 0.0521 0.2188
Scab
0.1325 0.7470 0.1205
Sprout
0.2500 0.2083 0.5417
$overall.correct
[1] 0.6691
5
> qda.accuracy
$class.table
classify
original Healthy Scab Sprout
Healthy
66
10
20
Scab
7
65
11
Sprout
26
13
57
$prop
classify
original Healthy
Scab Sprout
Healthy 0.6875 0.1042 0.2083
Scab
0.0843 0.7831 0.1325
Sprout
0.2708 0.1354 0.5938
$overall.correct
[1] 0.6836
Overall, we see that LDA and QDA have some ability to differentiate between the different
kernel types. The largest amount of errors occurs with the sprout and healthy classifications –
both with sprout being misclassified as healthy and vice versa. The least amount of
classification error occurs with healthy kernels being classified as scab.
QDA leads to very similar results to LDA. For parsimonious reasons then, I would likely use
LDA. However, if the extra amount of accuracy potentially provided by QDA is important to the
subject-matter researcher, I would use QDA.
b) (8 points) The DA homework shows a scatter plot where there are two plotting points for each
observation. The smaller point denotes the original population for the observation and the
larger point denotes the classification. Construct a similar plot here, but now plot the first two
PCs for it. Interpret the plot in the context of what the 33 classification table in b) gives as the
correct and incorrect classification rates. Use the LDA cross-validation classifications found in
part a) for the plot.
> save<-princomp(formula = ~ density + hardness + size + weight + moisture +
class.new, data = wheat.no31, cor = TRUE, scores = TRUE)
>
summary(save, loadings = TRUE, cutoff = 0.0)
Importance of components:
Comp.1
Comp.2
Comp.3
Standard deviation
1.471919 1.3133732 0.9591963
Proportion of Variance 0.361091 0.2874915 0.1533429
Cumulative Proportion 0.361091 0.6485825 0.8019254
Comp.4
Comp.5
Comp.6
Standard deviation
0.8444315 0.53339093 0.43689493
Proportion of Variance 0.1188441 0.04741765 0.03181286
Cumulative Proportion 0.9207695 0.96818714 1.00000000
Loadings:
density
hardness
size
weight
moisture
class.new
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
-0.287 0.308 0.618 0.653 -0.043 0.115
0.361 0.238 0.662 -0.522 0.186 -0.260
-0.441 0.459 -0.086 -0.419 0.237 0.597
-0.559 0.325 -0.156 -0.135 -0.159 -0.717
-0.359 -0.494 0.352 -0.328 -0.604 0.175
-0.390 -0.537 0.156 0.006 0.719 -0.134
> save$scale<-apply(X = wheat.no31[,c(2:6,8)], MARGIN = 2, FUN = sd)
6
> score.cor<-predict(save, newdata = wheat.no31[,c(2:6,8)])
> head(score.cor)
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
1 0.4498091 0.9670625 1.6347757 -0.03223047 -0.6494602
2 -0.4987891 1.5106507 1.0292678 -0.78907466 -0.6696128
3 -0.1694543 1.0656393 0.4728126 -0.59576666 -0.7210953
4 -0.1722994 1.1679671 1.2920498 -0.08442183 -0.8937361
5 0.2841531 0.7124044 0.7953254 -0.24934765 -0.7444673
6 -0.4135285 1.2352344 0.9291961 -0.38554676 -0.8487791
Comp.6
1 0.38456134
2 0.12537595
3 0.03432551
4 -0.32100839
5 0.39286627
6 -0.08053882
> par(pty = "s")
> original.pch<-ifelse(test = wheat.no31$type == "Healthy", yes = 1,
no = ifelse(test = wheat.no31$type == "Sprout", yes = 2, no = 5))
> original.color<-ifelse(test = wheat.no31$type == "Healthy", yes = "black",
no = ifelse(test = wheat.no31$type == "Sprout", yes = "red", no = "green"))
> plot(x = score.cor[,1], y = score.cor[,2], pch = original.pch, col =
original.color, cex = 0.75, xlab = "Principal component 1", ylab = "Principal
component 2", main = "PC score plot \n Classified (large points) overlaid on
the original (small points)")
> abline(h = 0, lty = 1, lwd = 2)
> abline(v = 0, lty = 1, lwd = 2)
> classify.pch<-ifelse(test = DA2$class == "Healthy", yes = 1,
no = ifelse(test = DA2$class == "Sprout", yes = 2, no = 5))
> classify.color<-ifelse(test = DA2$class == "Healthy", yes = "black",
no = ifelse(test = DA2$class=="Sprout", yes = "red", no = "green"))
> points(x = score.cor[,1], y = score.cor[,2], pch = classify.pch, col =
classify.color, cex = 1.5)
> legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), pch = c(1,2,5),
col = c("black", "red", "green"), cex=1, bty="n")
7
1
0
-1
Healthy
Sprout
Scab
-4
-3
-2
Principal component 2
2
3
PC score plot
LDA classified (large points) overlaid on original (small points)
-2
0
2
4
Principal component 1
Overall, the first two PCs account for 65% of the total variation in the data. While this value is
not extremely large, it does provide a decent proportion of the information available in the data.
Plots like this can also be done using a bubble plot in order to include the third PC. I decided to
keep it to two PCs in the project to correspond to what was done in the homework.
We can immediately see why LDA performs better when classifying the scab kernels than with
classifying the healthy and sprout kernels. The scab kernels are more separated from healthy
and sprout than those two kernel types are from each other. Thus, positive PC #1 with
negative PC #2 values tend to be correctly classified as scab. The loadings for the PCs
provide some insight for why this is true. In 1) of this project, we saw that scab kernels tend to
have a smaller density, size, and weight than the other kernel types. The coefficients on
density, size, and weight are negative for PC #1 and positive for PC #2. This means that small
density, size, and weight values will allow scab kernels to be larger for PC #1 (given the other
loadings for this PC) and smaller for PC #2 (given the other loadings for this PC) in comparison
to the other kernel types.
8
3) This portion of the project applies NNC methods to differentiate between the kernel types.
a) (8 points) Determine an appropriate value for K using cross-validation. Set a seed number of
7771 before using NNC so that I can duplicate your results.
> library(class)
> class.new<-ifelse(test = wheat.no31$class == "hrw", yes = 0, no = 1)
> Z<-scale(cbind(wheat.no31[,2:6], class.new))
> head(Z)
density hardness
size
weight moisture class.new
1 1.2299602 1.2708682 0.1939024 -0.3602040 0.4071616 -0.9590205
2 0.7725714 1.1158760 1.0710624 0.7326285 0.4809390 -0.9590205
3 0.3151825 0.6735634 0.6222829 0.5380664 0.3383028 -0.9590205
4 1.1537287 1.0328967 0.1327052 0.6580885 0.4514281 -0.9590205
5 0.5438769 0.6881853 0.2958978 -0.1808026 0.4268356 -0.9590205
6 0.8488028 0.8245346 0.5814847 0.7326285 0.4907760 -0.9590205
> set.seed(7771)
> save.results.cv<-matrix(data = NA, nrow = 40, ncol = 5)
> for (K in 1:40) {
NNC.cv<-knn.cv(train = Z, cl = wheat.no31$type, k = K, prob = TRUE)
NNC.cv.accuracy<-summarize.class(original = wheat.no31$type, classify = NNC.cv)
save.results.cv[K,]<-c(K, NNC.cv.accuracy$prop[1,1], NNC.cv.accuracy$prop[2,2],
NNC.cv.accuracy$prop[3,3], NNC.cv.accuracy$overall.correct)
}
> head(save.results.cv)
[,1]
[,2]
[,3]
[1,]
1 0.5833 0.6386
[2,]
2 0.5625 0.5904
[3,]
3 0.5521 0.6386
[4,]
4 0.5312 0.6627
[5,]
5 0.5833 0.6627
[6,]
6 0.6562 0.6386
[,4]
0.5000
0.4896
0.4792
0.5000
0.5104
0.5521
[,5]
0.5709
0.5455
0.5527
0.5600
0.5818
0.6145
> plot(x = save.results.cv[,1], y = save.results.cv[,2], ylim = c(0, 1), main =
"Cross-validation", panel.first = grid(), type = "o", col = "red", xlab = "K",
ylab = "Accuracy")
> points(x = save.results.cv[,1], y = save.results.cv[,3], ylim = c(0, 1), type =
"o", col = "blue")
> points(x = save.results.cv[,1], y = save.results.cv[,4], ylim = c(0, 1), type =
"o", col = "green")
> points(x = save.results.cv[,1], y = save.results.cv[,5], ylim = c(0, 1), type =
"o", col = "black")
> legend(x = 5, y = 1, legend = c("Correct healthy", "Correct scab", "Correct
sprout", "Correct overall"), col = c("red", "blue", "green", "black"), bty =
"n", cex = 0.75, lty = c(1,1,1,1), pch = c(1,1,1,1))
9
1.0
Cross-validation
0.6
0.4
0.0
0.2
Accuracy
0.8
Correct healthy
Correct scab
Correct sprout
Correct overall
0
10
20
30
40
K
> max(save.results.cv[,5])
[1] 0.6255
> save.results.cv[save.results.cv[,5] == max(save.results.cv[,5]),]
[,1]
[,2]
[,3]
[,4]
[,5]
[1,]
11 0.6562 0.7108 0.5208 0.6255
[2,]
13 0.6875 0.6747 0.5208 0.6255
[3,]
15 0.6979 0.6506 0.5312 0.6255
[4,]
19 0.6667 0.6747 0.5417 0.6255
> data.frame(K = 1:15, overall = save.results.cv[1:15,5])
K overall
1
1 0.5709
2
2 0.5455
3
3 0.5527
4
4 0.5600
5
5 0.5818
6
6 0.6145
7
7 0.6073
8
8 0.6145
9
9 0.6073
10 10 0.6073
11 11 0.6255
12 12 0.6145
13 13 0.6255
14 14 0.6073
15 15 0.6255
The proportion of correct classifications is approximately the same once K  6. For this reason,
I will choose K = 6.
10
b) (6 points) With the value of K chosen in a), perform NNC with cross-validation and provide the
33 classification table. For what type of classifications do the methods perform the worse?
Set a seed number of 6126 before using knn.cv() so that I can duplicate your results.
> set.seed(6126)
> NNC6<-knn.cv(train = Z, cl = wheat.no31$type, k = 6, prob = TRUE)
> NNC6.accuracy<-summarize.class(original = wheat.no31$type, classify = NNC.cv)
> NNC6.accuracy
$class.table
classify
original Healthy Scab Sprout
Healthy
61
7
28
Scab
7
59
17
Sprout
34
8
54
$prop
classify
original Healthy
Scab Sprout
Healthy 0.6354 0.0729 0.2917
Scab
0.0843 0.7108 0.2048
Sprout
0.3542 0.0833 0.5625
$overall.correct
[1] 0.6327
Overall, we that NNC has some ability to differentiate between the different kernel types. The
largest amount of errors occurs with the sprout and healthy classifications – both with sprout
being misclassified as healthy and vice versa. The least amount of classification error occurs
with healthy kernels being classified as scab.
c) (6 points) Construct a similar plot as done in part 2b with the NNC classifications obtained in
b). Interpret the plot in the context of what the 33 classification table in b) gives as the correct
and incorrect classification rates. Compare the plot for this problem to the one found in part 2b.
> plot(x = score.cor[,1], y = score.cor[,2], pch = original.pch, col =
original.color, cex = 0.75, xlab = "Principal component 1", ylab = "Principal
component 2", main = "PC score plot \n NNC classified (large points) overlaid
on original (small points)")
> abline(h = 0, lty = 1, lwd = 2)
> abline(v = 0, lty = 1, lwd = 2)
> classify.pch<-ifelse(test = NNC.cv == "Healthy", yes = 1,
no = ifelse(test = NNC.cv == "Sprout", yes = 2, no = 5))
> classify.color<-ifelse(test = NNC.cv == "Healthy", yes = "black",
no = ifelse(test = NNC.cv=="Sprout", yes = "red", no = "green"))
> points(x = score.cor[,1], y = score.cor[,2], pch = classify.pch, col =
classify.color, cex = 1.5)
> legend(locator(1), legend = c("Healthy", "Sprout", "Scab"), pch = c(1,2,5),
col = c("black", "red", "green"), cex=1, bty="n")
11
1
0
-1
Healthy
Sprout
Scab
-4
-3
-2
Principal component 2
2
3
PC score plot
NNC classified (large points) overlaid on original (small points)
-2
0
2
4
Principal component 1
The overall results regarding scab classifications doing better than the healthy and sprout
classifications are similar here as to what we saw for LDA.
Overall, most classifications appear to be the same for NNC and LDA. We do see a few
differences here when compared to the results from LDA. For example, the upper left portion
of the plot (circled in blue) here contain two sprout kernels, but the nearest other kernels are
healthy. This leads to a healthy classification by NNC; however, these kernels were correctly
classified by LDA. Of course, one needs to be careful with this judgment of “nearest” here
because PCA is used with only 65% of the total variation in the data being accounted for.
12
Download