Homework 8 Key. Discriminant analysis Three species were characterized by their seeds in terms of size, mass and color. You need to create a discrimination rule to assign individuals whose species is not known to one of the groups, based on measurements of the characteristics mentioned. 1. Perform a discriminant analysis in JMP, assuming multivariate normality, homogeneity of within variance-covariance matrices, and equal priors. Discriminant(X( :Species), Y( :Length, :mass, :RedRef, :GreenRef) The default for JMP 5.1 is to assume equal priors and homogeneity of variance-covariance matrices across groups. The output shows that the classification is quite good in the training sample. SppA is completely different from the other two. The projection of the data on the first two canonical variates shows that a linear function can separate SppB from SppC almost perfectly, but there is no way to draw a straight line between the two sets of points without having at least 3 point misclassified. Note that although the figure displays only two canonical variables, the groups can differ in up to four dimensions, so the linear function separating any two groups is a hyperplane in 4D. 10 9 7 SppB mass SppA GreenRef Length SppC Canonical2 8 RedRef 6 5 4 -10 -5 0 5 10 Canonical1 2. Does the analysis use a linear or a quadratic discriminant function? When are linear and quadratic discriminant functions more appropriate? The analysis uses a linear function in this case because JMP is limited to linear discrimination functions. Linear discrimination functions are appropriate when there is homogeneity of variance covariance matrices (HOV) across groups. In this example, we can see that this is not true, because the red group (SppA) appears different in orientation and dimensions of the ellipsoid. The other two groups are more similar and closer to HOV. However, the red group is so far from 1 the others that the heterogeneity of variance is pretty much irrelevant, and a linear classification will do practically as well as a quadratic one. 3. Obtain a table of Actual by Predicted membership for the training sample. How many observations are misclassified by this re-substitution method? Counts: Actual Rows by Predicted Columns SppA SppB SppC SppA 50 0 0 SppB 0 49 1 SppC 0 2 48 Only 3 plants are misclassified, one SppB is incorrectly classified as SppC and 2 SppC are misclassified as SppB. The overall error rate using the training sample as test sample is 3 out of 150 or 2%. This is calculated as 0.333*0 + 0.333*(1/50) + 0.333*(2/50) = 1/50 = 2% 4. What is the overall error rate using the hold-out method. Hold out the first 10 observations of each species. Select the first 10 rows of each species and exclude them by selecting Rows -> Exclude/Unexclude. Run the discriminant analysis again and check that indeed this time you have only 120 observations used. As you can see in the new output window, all rows are classified, including the ones excluded from the training sample. Because the training sample is different, though, the distances and probabilities are different. All rows included: 2 Thirty rows held out: The default JMP output shows the summary for the classification of ALL rows, including the training sample. In order to obtain the table for just the hold-out sample, the exclusion of rows has to be reversed. Select all rows excluded and Unexclude them; then, choose Rows -> Row Selection -> Invert row selection and Exclude. Now you should have 120 rows excluded. In the output window select Now, the predictions are saved. These formulas and predictions assume that priors are equal. To find out how the classification worked on the hold-out data, select Analyze -> Fit Y by X and put the true spp in the X and the predicted in the Y boxes as shown below: 3 Modify the contingency table output to show only the counts and see that no errors were made in the training data. The number of observations in the training data is very small relative to the overall error rate, so the chance of actually making mistakes is small. In practical terms, I would use the error rate from the whole data set, lest the hold-out method gives a false sense that no classification errors are expected. 5. Select one observation in the listing and explain why it is placed in the category indicated on the basis of posterior probabilities. Explain how priors and likelihoods are used to calculate posteriors. Use an equation to explain. First, note that the multivariate normal probability density function is f (x) 1 2 p 2 1 e 0.5 x 1 x 2 where p is the number of dimensions in the multivariate distribution. If we use it to calculate the probability of observing a multivariate x from a population with mean vector and variance-covariance matrix , we can write that the likelihood if x is: L (x | ) k e0.5 D 2 When we assume homogeneity of variance-covariance across groups, the k is the same for all groups and for all observations, so it cancels out in the equation for the posterior probability. That is the reason why the JMP formulas use the –Exp[Dist(g)] for the probabilities. The equation for the posterior probability of an observation x coming from group g, assuming equal priors, is 4 p g | x p g L (x | g ) m p i L (x | ) i i1 assuming equal priors 0.3333e 0.5 DA 0.3333e 0.5 DA 0.3333e 0.5 DB 0.3333e 0.5 DC 2 e 0.5 DA 2 2 2 2 e 0.5 DA e 0.5 DB e 0.5 DC 2 2 2 For example, observation 73 is SppC, and with equal priors it is classified as C because the formula above yields p(C | {6.3, 2.5, 4.9, 1.5}) = 0.81553, which is greater than the posteriors for the other two groups. 6. Classify the set of plants whose species are not known, listed in the "new data" worksheet. Interpret the classification of the test data. Based on the fact that the new data are actually observations 9, 19, 29, ..., 149 of the calibration data, are any observations misclassified? We classify observations without known membership by adding them at the end of the data table after saving the discrimination formulas. Select all cells in the new data and copy them. Then, paste them at the end of the original data table. The formulas are automatically extended to the new rows, giving the predicted groups. Once we learn their true membership we see that once again none is misclassified. 7. What are "prior probabilities?" Why are they considered in the classification process and in the calculation of error rates? Explain impact of priors by changing them to 0.1, 0.1 and 0.8 for species A, B, and C, and repeating step 1. Prior probability is the probability that an observation or object belongs to a given group calculated before the observable characteristics of the observation are measured. In practical terms, the priors reflect the proportion of objects in each group in the population from which we are randomly taking observations. It can also be the proportion of each object in the set of cases for which we are making predictions, if that is known. Note that the priors are independent of the training sample, and that they do not affect the likelihood part of the formula. Thus, the effect of priors can be explored both by using the JMP output and by creating a new formula. Suppose that we need to make a prediction for a new observation that looks just like observation 120, but that comes from a set that is known to be composed of 10% A, 10% B, and 80% C. We can change the priors in the JMP output: 5 and then see that the “new” observation that looks just like 120 is classified as SppC, although based on the statistical distance (Mahalanobis) this observation is closer to the centroid for B. The change is due to the fact that a priori we now expect many more observations to be in C than in B. The priors changed the decision rule to keep the overall error rate at a minimum. The same results are obtained by creating a new column in JMP where you modify the original “prior-free” or equal-priors equation to obtain the same result: 6