Stat 407 Exam 1 - SOLUTION Name 1. (5pts) Calculate the mean (X̄) and variance-covariance (Sn ) arrays for the following data: x1 9 2 6 4 x2 12 8 6 6 X̄ = " 5.25 8.00 # Sn = " 6.7 4 4 6.0 # 2. (5pts) Explain what standardize a variable means? What is the purpose of standardizing variables during multivariate analysis? A standardized variable has mean 0, variance 1, computed by subtracting the variable mean from the sample value and dividing by the standard deviation. It is important for multivariate analysis becaue it puts variables on a common unit-free scale. 3. (5pts) Calculate the pooled variance-covariance matrix, given the two variance-covariance matrices (assume the two sample sizes are equal): S1 = " 4 2 2 6 # S2 = " 6 −2 −2 4 # Do you think it made sense to pool them? Explain yourself. " 5 0 0 5 # I think these two variance-covariance matrices should not have been pooled, because they both had different covariance values, opposite in sign. 4. (5pts) How would you might detect an outlier using a parallel coordinate plot? Look at the line traces in the plot. Any traces that are different from all others or that extend out from the group at any part of the plot would correspond to an outlier. 1 5. The following questions refer to measurements made on the size of the carapace and gender of painted turtles (Jolicoeur and Mosimann, 1960). The variables are Length, Width and Height (in mm), and gender (1 =Female, 2 =Male). (a) (3pts) Describe the structure in the scatterplot matrix plot of the raw variables. Length vs Width Length vs Height Length vs Gender Width vs Height Width vs Gender Height vs Gender Strong slightly non-linear relationship linear relationship, slight skewness More variability in the females than males, males slightly smaller on average Linear relationship, slight skewness More variability in the females than males, males slightly smaller on average More variability in the females than males, males slightly smaller on average 2 (b) (2pts) As accurately as possible, plot the point X0 = (98 81 38 1)0 on the scatterplot matrix. ON PLOT (c) (3pts) How could you design a plot that would better illustrate the size differences on the physical measurements between females and males? Use color and/or symbol to represent Gender, and display this in a scatterplot matrix of the physical measurements or a tour of the 3 variables. (d) (2pts) When doing principal component analysis on this data, would it be better to use the covariance matrix or the correlation matrix? Explain your answer. Correlation. Although the units are the same, the physical measurements have very different scales, ranges. (e) (3pts) From the attached SAS output, fill in the table of eigenvectors, eigenvalues, cumulative proportion of total variance, for males and females separately. Females Males Variable e1 e2 e3 e1 e2 e3 Length .578 -.137 -.804 .582 -.041 -.812 Width .577 -.628 .522 .575 -.685 .447 Height .577 .766 .284 .574 .728 .375 Variance 2.94 .0343 .0259 2.87 .0879 .0403 Cum % Tot Var 98.0 99.0 100.0 95.7 98.7 100.0 3 (f) (2pts) Draw a scree plot for the females. (g) (2pts) How many principal components would you suggest using to reduce the dimensionality of this data (for the females only)? One, because the total variation is so close to 100% . (h) (2pts) Write down the value of the variance of the first principal component (of the females)? 2.94 (i) (3pts) Interpret the first principal component for the females. Is it the same interpretation for the males? The first principal component for both females and males is an equal positive amount from each variable, so the interpretation would be the size of the turtle. 4 6. The following questions are about a data set measured on Australian crabs. There are 200 measurements on 2 species, both males and females, of crabs. The classes are: Blue Crabs = 1 Orange Crabs = 2 Males = 1 Females = 2 A new class variable was created: 1=Blue Male, 2=Blue Female, 3=Orange Male, 4=Orange Female and the variables are: CL CW FL RW BD = = = = = Carapace Length Carapace Width Frontal Lobe Rear Width Body Depth (a) (2pts) On the attached SAS output highlight (point out) the B (Between group covariance) matrix. Between-Class Covariance Matrix, Variable FL RW CL CW BD DF = 3 FL RW CL CW BD 3.677006667 1.955410000 5.343694000 4.850986000 3.398708667 1.955410000 2.009526333 2.337795000 2.334203667 1.589017000 5.343694000 2.337795000 8.284449000 7.649887000 4.999583000 4.850986000 2.334203667 7.649887000 7.331625000 4.441259667 3.398708667 1.589017000 4.999583000 4.441259667 3.201678333 (b) (3pts) Explain conceptually what the between group covariance matrix is. This is a measure of the distance between the class means. (c) (2pts) Linear discriminant analysis was used to build a classification rule. Write down the confusion table for the classification rule. From Class Number of Observations and Percent Classified into Class 1 2 3 4 Total 1 45 5 0 0 50 2 0 50 0 0 50 3 0 0 50 0 50 4 0 0 3 47 50 Total 45 55 53 47 200 (d) (2pts) Calculate the apparent error rate of the procedure. 8/200 = 0.04 5 (e) (2pts) Circle the points corresponding to crabs that were missclassified on the appropriate plot in the SAS output. Marked as an “x”. Plot of CAN2*CAN1. 6 S E C O N D C A N O N I C A L 4 2 0 D I S C R I -2 M I N A N T -4 -6 Symbol is value of _INTO_. | | + | | | | 4 | 4 | 4 + | 2 2 | 4 2 22 | 4 2 | 4 2 2 22 | 4 4 22 2 | 444 4 44 2 2 + 4 44 4 4 2 2 2 | 444 4 4 4 222 2 2 | 4 4 4 4 222222 | 4 4 4 22 22 | 44 4 44 2 2 x | 2 212x 1 | 44 x2 1 + x 2 1 | x21 1 11 | x xx 3 21 1 111 | 3 111 1 1 | 3 3 3 3 1 1 11 1 1 | 33 3 11 111 | 3333 111 + 3 3 33 11 1 1 | 33 33 11 | 3 3 33 | 3 33 3 3 1 | 3 33 | 33 3 1 | 3 + | 3 3 | 3 3 | | | | + | --+--------+--------+--------+--------+--------+--------+----7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 FIRST CANONICAL DISCRIMINANT (f) (2pts) From the SAS output, which group would a crab with measurements (F L = 22.2, RW = 18.0, CL = 44.0, CW = 47.5, BD = 19.1) be classified into? Which species and sex is this? Group 4, orange female. This is given on the SAS output. 6 (g) (5pts) In the following plot of the crabs data in the discriminant space (not centered around the mean), which of the points, X, Y or Z, is most likely to be the new observation? Why? It is X. Use the raw canonical coefficients to project the point into the discriminant space. Or from the classification rule argue it is closer to group 3 than groups 1 or 2. 7. (5pts) Are the results from the following two procedures for building a classification rule for 3 groups likely to differ? Explain your answer. (a) Work pairwise to develop 3 pairwise classification rules (1 vs 2, 1 vs 3, 2 vs 3) and use this collection of rules to classify new observations into groups. (b) Compute the 2D discriminant space, which is the 2D projection which best separates the 3 groups. Then build a classification rule which classifies all 3 groups with one rule (if ... then group 1, else if ... then group 2, else group 3). Yes they are most likely to differ. Working pairwise uses different within and between covariance matrices. Its possible to get conflicting results with a pairwise solution, that an observation should be classed as group 1 and group 2. 7