Homework 8 Key - Plant Sciences

advertisement
Homework 8 Key. Discriminant analysis
Three species were characterized by their seeds in terms of size, mass and color. You need to
create a discrimination rule to assign individuals whose species is not known to one of the
groups, based on measurements of the characteristics mentioned.
1. Perform a discriminant analysis in JMP, assuming multivariate normality, homogeneity of
within variance-covariance matrices, and equal priors.
Discriminant(X( :Species), Y( :Length,
:mass, :RedRef, :GreenRef)
The default for JMP 5.1 is to assume equal priors and homogeneity of variance-covariance
matrices across groups. The output shows that the classification is quite good in the training
sample. SppA is completely different from the other two. The projection of the data on the first
two canonical variates shows that a linear function can separate SppB from SppC almost
perfectly, but there is no way to draw a straight line between the two sets of points without
having at least 3 point misclassified. Note that although the figure displays only two canonical
variables, the groups can differ in up to four dimensions, so the linear function separating any
two groups is a hyperplane in 4D.
10
9
7
SppB
mass
SppA
GreenRef
Length
SppC
Canonical2
8
RedRef
6
5
4
-10
-5
0
5
10
Canonical1
2. Does the analysis use a linear or a quadratic discriminant function? When are linear and
quadratic discriminant functions more appropriate?
The analysis uses a linear function in this case because JMP is limited to linear discrimination
functions. Linear discrimination functions are appropriate when there is homogeneity of variance
covariance matrices (HOV) across groups. In this example, we can see that this is not true,
because the red group (SppA) appears different in orientation and dimensions of the ellipsoid.
The other two groups are more similar and closer to HOV. However, the red group is so far from
1
the others that the heterogeneity of variance is pretty much irrelevant, and a linear classification
will do practically as well as a quadratic one.
3. Obtain a table of Actual by Predicted membership for the training sample. How many
observations are misclassified by this re-substitution method?
Counts: Actual Rows by Predicted Columns
SppA
SppB
SppC
SppA
50
0
0
SppB
0
49
1
SppC
0
2
48
Only 3 plants are misclassified, one SppB is incorrectly classified as SppC and 2 SppC are
misclassified as SppB. The overall error rate using the training sample as test sample is 3 out of
150 or 2%. This is calculated as 0.333*0 + 0.333*(1/50) + 0.333*(2/50) = 1/50 = 2%
4. What is the overall error rate using the hold-out method. Hold out the first 10 observations of
each species.
Select the first 10 rows of each species and exclude them by selecting Rows ->
Exclude/Unexclude. Run the discriminant analysis again and check that indeed this time you
have only 120 observations used. As you can see in the new output window, all rows are
classified, including the ones excluded from the training sample. Because the training sample is
different, though, the distances and probabilities are different.
All rows included:
2
Thirty rows held out:
The default JMP output shows the summary for the classification of ALL rows, including the
training sample. In order to obtain the table for just the hold-out sample, the exclusion of rows
has to be reversed. Select all rows excluded and Unexclude them; then, choose Rows -> Row
Selection -> Invert row selection and Exclude. Now you should have 120 rows excluded.
In the output window select
Now, the predictions are saved. These formulas and predictions assume that priors are equal. To
find out how the classification worked on the hold-out data, select Analyze -> Fit Y by X and put
the true spp in the X and the predicted in the Y boxes as shown below:
3
Modify the contingency table output to show only the counts and see that no errors were made in
the training data. The number of observations in the training data is very small relative to the
overall error rate, so the chance of actually making mistakes is small. In practical terms, I would
use the error rate from the whole data set, lest the hold-out method gives a false sense that no
classification errors are expected.
5. Select one observation in the listing and explain why it is placed in the category indicated on
the basis of posterior probabilities. Explain how priors and likelihoods are used to calculate
posteriors. Use an equation to explain.
First, note that the multivariate normal probability density function is
f (x) 
1
2 
p
2


1
e 0.5 x  
1
x 
2
where p is the number of dimensions in the multivariate distribution.
If we use it to calculate the probability of observing a multivariate x from a population with
mean vector  and variance-covariance matrix  , we can write that the likelihood if x is:
L (x |  )  k e0.5 D
2
When we assume homogeneity of variance-covariance across groups, the k is the same for all
groups and for all observations, so it cancels out in the equation for the posterior probability.
That is the reason why the JMP formulas use the –Exp[Dist(g)] for the probabilities.
The equation for the posterior probability of an observation x coming from group g, assuming
equal priors, is
4
p g | x  
p g L (x |  g  )
m
 p i L (x |   )

i
i1
assuming equal priors

0.3333e 0.5 DA
0.3333e 0.5 DA  0.3333e 0.5 DB  0.3333e 0.5 DC
2
e 0.5 DA
2
2
2

2
e 0.5 DA  e 0.5 DB  e 0.5 DC
2
2
2
For example, observation 73 is SppC, and with equal priors it is classified as C because the
formula above yields p(C | {6.3, 2.5, 4.9, 1.5}) = 0.81553, which is greater than the posteriors for
the other two groups.
6. Classify the set of plants whose species are not known, listed in the "new data" worksheet.
Interpret the classification of the test data. Based on the fact that the new data are actually
observations 9, 19, 29, ..., 149 of the calibration data, are any observations misclassified?
We classify observations without known membership by adding them at the end of the data table
after saving the discrimination formulas. Select all cells in the new data and copy them. Then,
paste them at the end of the original data table. The formulas are automatically extended to the
new rows, giving the predicted groups. Once we learn their true membership we see that once
again none is misclassified.
7. What are "prior probabilities?" Why are they considered in the classification process and in
the calculation of error rates? Explain impact of priors by changing them to 0.1, 0.1 and 0.8 for
species A, B, and C, and repeating step 1.
Prior probability is the probability that an observation or object belongs to a given group
calculated before the observable characteristics of the observation are measured. In practical
terms, the priors reflect the proportion of objects in each group in the population from which we
are randomly taking observations. It can also be the proportion of each object in the set of cases
for which we are making predictions, if that is known. Note that the priors are independent of the
training sample, and that they do not affect the likelihood part of the formula. Thus, the effect of
priors can be explored both by using the JMP output and by creating a new formula.
Suppose that we need to make a prediction for a new observation that looks just like observation
120, but that comes from a set that is known to be composed of 10% A, 10% B, and 80% C. We
can change the priors in the JMP output:
5
and then see that the “new” observation that looks just like 120 is classified as SppC, although
based on the statistical distance (Mahalanobis) this observation is closer to the centroid for B.
The change is due to the fact that a priori we now expect many more observations to be in C
than in B. The priors changed the decision rule to keep the overall error rate at a minimum.
The same results are obtained by creating a new column in JMP where you modify the original
“prior-free” or equal-priors equation to obtain the same result:
6
Download