Stat 407 Exam 1 - SOLUTION Name

advertisement
Stat 407
Exam 1 - SOLUTION
Name
1. (5pts) Calculate the mean (X̄) and variance-covariance (Sn ) arrays for the following data:
x1
9
2
6
4
x2
12
8
6
6
X̄ =
"
5.25
8.00
#
Sn =
"
6.7 4
4 6.0
#
2. (5pts) Explain what standardize a variable means? What is the purpose of standardizing variables
during multivariate analysis?
A standardized variable has mean 0, variance 1, computed by subtracting the variable mean from the sample
value and dividing by the standard deviation. It is important for multivariate analysis becaue it puts variables
on a common unit-free scale.
3. (5pts) Calculate the pooled variance-covariance matrix, given the two variance-covariance matrices (assume
the two sample sizes are equal):
S1 =
"
4 2
2 6
#
S2 =
"
6 −2
−2 4
#
Do you think it made sense to pool them? Explain yourself.
"
5 0
0 5
#
I think these two variance-covariance matrices should not have been pooled, because they both had different
covariance values, opposite in sign.
4. (5pts) How would you might detect an outlier using a parallel coordinate plot?
Look at the line traces in the plot. Any traces that are different from all others or that extend out from the
group at any part of the plot would correspond to an outlier.
1
5. The following questions refer to measurements made on the size of the carapace and gender of painted
turtles (Jolicoeur and Mosimann, 1960). The variables are Length, Width and Height (in mm), and gender
(1 =Female, 2 =Male).
(a) (3pts) Describe the structure in the scatterplot matrix plot of the raw variables.
Length vs Width
Length vs Height
Length vs Gender
Width vs Height
Width vs Gender
Height vs Gender
Strong slightly non-linear relationship
linear relationship, slight skewness
More variability in the females than males, males slightly
smaller on average
Linear relationship, slight skewness
More variability in the females than males, males slightly
smaller on average
More variability in the females than males, males slightly
smaller on average
2
(b) (2pts) As accurately as possible, plot the point X0 = (98 81 38 1)0 on the scatterplot matrix.
ON PLOT
(c) (3pts) How could you design a plot that would better illustrate the size differences on the physical
measurements between females and males?
Use color and/or symbol to represent Gender, and display this in a scatterplot matrix of the physical
measurements or a tour of the 3 variables.
(d) (2pts) When doing principal component analysis on this data, would it be better to use the covariance
matrix or the correlation matrix? Explain your answer.
Correlation. Although the units are the same, the physical measurements have very different scales,
ranges.
(e) (3pts) From the attached SAS output, fill in the table of eigenvectors, eigenvalues, cumulative proportion of total variance, for males and females separately.
Females
Males
Variable
e1
e2
e3
e1
e2
e3
Length
.578
-.137
-.804
.582
-.041
-.812
Width
.577
-.628
.522
.575
-.685
.447
Height
.577
.766
.284
.574
.728
.375
Variance
2.94
.0343
.0259
2.87
.0879
.0403
Cum % Tot Var
98.0
99.0
100.0
95.7
98.7
100.0
3
(f) (2pts) Draw a scree plot for the females.
(g) (2pts) How many principal components would you suggest using to reduce the dimensionality of this
data (for the females only)?
One, because the total variation is so close to 100% .
(h) (2pts) Write down the value of the variance of the first principal component (of the females)?
2.94
(i) (3pts) Interpret the first principal component for the females. Is it the same interpretation for the
males?
The first principal component for both females and males is an equal positive amount from each
variable, so the interpretation would be the size of the turtle.
4
6. The following questions are about a data set measured on Australian crabs. There are 200 measurements
on 2 species, both males and females, of crabs. The classes are:
Blue Crabs = 1
Orange Crabs = 2
Males = 1
Females = 2
A new class variable was created:
1=Blue Male, 2=Blue Female, 3=Orange Male, 4=Orange Female
and the variables are:
CL
CW
FL
RW
BD
=
=
=
=
=
Carapace Length
Carapace Width
Frontal Lobe
Rear Width
Body Depth
(a) (2pts) On the attached SAS output highlight (point out) the B (Between group covariance) matrix.
Between-Class Covariance Matrix,
Variable
FL
RW
CL
CW
BD
DF = 3
FL
RW
CL
CW
BD
3.677006667
1.955410000
5.343694000
4.850986000
3.398708667
1.955410000
2.009526333
2.337795000
2.334203667
1.589017000
5.343694000
2.337795000
8.284449000
7.649887000
4.999583000
4.850986000
2.334203667
7.649887000
7.331625000
4.441259667
3.398708667
1.589017000
4.999583000
4.441259667
3.201678333
(b) (3pts) Explain conceptually what the between group covariance matrix is.
This is a measure of the distance between the class means.
(c) (2pts) Linear discriminant analysis was used to build a classification rule. Write down the confusion
table for the classification rule.
From Class
Number of Observations and Percent Classified into Class
1
2
3
4
Total
1
45
5
0
0
50
2
0
50
0
0
50
3
0
0
50
0
50
4
0
0
3
47
50
Total
45
55
53
47
200
(d) (2pts) Calculate the apparent error rate of the procedure.
8/200 = 0.04
5
(e) (2pts) Circle the points corresponding to crabs that were missclassified on the appropriate plot in the
SAS output.
Marked as an “x”.
Plot of CAN2*CAN1.
6
S
E
C
O
N
D
C
A
N
O
N
I
C
A
L
4
2
0
D
I
S
C
R
I -2
M
I
N
A
N
T
-4
-6
Symbol is value of _INTO_.
|
|
+
|
|
|
|
4
|
4
|
4
+
|
2
2
|
4
2 22
|
4
2
|
4
2
2 22
|
4 4
22 2
|
444 4 44
2
2
+
4
44 4 4
2 2
2
|
444 4 4 4
222
2 2
|
4
4 4 4
222222
|
4
4 4
22
22
|
44 4
44
2 2 x
|
2 212x
1
|
44
x2
1
+
x 2
1
|
x21 1 11
|
x xx
3
21 1
111
|
3
111 1
1
|
3 3 3
3
1 1 11 1
1
|
33 3
11 111
|
3333
111
+
3 3 33
11
1
1
|
33
33
11
|
3 3 33
|
3
33 3 3
1
|
3 33
|
33 3
1
|
3
+
|
3 3
|
3 3
|
|
|
|
+
|
--+--------+--------+--------+--------+--------+--------+----7.5
-5.0
-2.5
0.0
2.5
5.0
7.5
FIRST CANONICAL DISCRIMINANT
(f) (2pts) From the SAS output, which group would a crab with measurements (F L = 22.2, RW =
18.0, CL = 44.0, CW = 47.5, BD = 19.1) be classified into? Which species and sex is this?
Group 4, orange female. This is given on the SAS output.
6
(g) (5pts) In the following plot of the crabs data in the discriminant space (not centered around the mean),
which of the points, X, Y or Z, is most likely to be the new observation? Why?
It is X. Use the raw canonical coefficients to project the point into the discriminant space. Or from
the classification rule argue it is closer to group 3 than groups 1 or 2.
7. (5pts) Are the results from the following two procedures for building a classification rule for 3 groups likely
to differ? Explain your answer.
(a) Work pairwise to develop 3 pairwise classification rules (1 vs 2, 1 vs 3, 2 vs 3) and use this collection
of rules to classify new observations into groups.
(b) Compute the 2D discriminant space, which is the 2D projection which best separates the 3 groups.
Then build a classification rule which classifies all 3 groups with one rule (if ... then group 1, else if
... then group 2, else group 3).
Yes they are most likely to differ. Working pairwise uses different within and between covariance matrices.
Its possible to get conflicting results with a pairwise solution, that an observation should be classed as group
1 and group 2.
7
Download