STAT 501: Multivariate Statistical Methods Solutions Homework 5

advertisement
STAT 501: Multivariate Statistical Methods
Solutions
Homework 5
Problem List:
Note:
Wichern and Johnson 6.33, 6.14, 7.6, 7.24, 7.21, 7.26, 7.27
Due Sunday March 25, 2012
Problem 6.33 (15 points):
(a) Following the example from problem 6.31 done in class and the R-code available at
http://www.public.iastate.edu/ maitra/stat501/Rcode/manova.R, we can write some
code in R that performs the desired analysis:
library(car)
vars<-c("nm560","nm720","Species","Time","Replication")
spec.reflec<-read.table(
’http://www.public.iastate.edu/~maitra/stat501/datasets/spectral.dat’,
col.names=vars)
spec.reflec$Time<-as.factor(spec.reflec$Time)
spec.reflec$Replication<-as.factor(spec.reflec$Replication)
summary(spec.reflec)
fit.manova <- Manova(lm(cbind(nm560,nm720)~ Species * Time, data = spec.reflec))
summary(fit.manova)
from which we get the following statistics and associated p-values testing for the effects
of Species, Time, and their interaction:
Type II MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den
Species
2
0.96120 12.4915
4
Time
2
0.99199 13.2853
4
Species:Time 4
0.92116
5.7634
8
Df
Pr(>F)
54 2.910e-07 ***
54 1.330e-07 ***
54 2.606e-05 ***
The p-values indicate that the null hypothesis (the effect is 0 across groups) should be
rejected. We must therefore conclude that there is statistical evidence that Species,
Time, and the interaction will have an impact on the percent of spectral relectance.
1
(b) The assumption of independent sampling must be assumed in this case;
The assumption that each set of 4 observations for specific time and species are multivariate normal will be hard examine graphically. However we can try:
There might be some elliptical shape in a few of these but some look less “multivariate
normal” than I would like. Since there are only 4 observations for each species:time
group, the MANOVA results will be sensitive to any problems with normality. We can
test the residuals using Dr. Maitra’s code:
source(’http://www.public.iastate.edu/~maitra/stat501/Rcode/testnormality.R’)
fit.model<-lm(cbind(nm560,nm720)~ Species * Time, data = spec.reflec)
testnormality(fit.model$residuals)
which gives a p-value of .000204244, leading to automatic rejection of the normality
assumption.
We can also see a problem with the residuals by plotting them against time for the
two wavelentghs:
2
While it is true that the residuals are centered around zero, there is an obvious shape,
indicating that variance is not the same for each time. Part of this may be due to
connections between time values, as seen below:
corspec.560<-cbind(spec.reflec[spec.reflec$Time==1, 1],
spec.reflec[spec.reflec$Time==2, 1],
spec.reflec[spec.reflec$Time==3, 1])
corspec.720<-cbind(spec.reflec[spec.reflec$Time==1, 2],
spec.reflec[spec.reflec$Time==2, 2],
spec.reflec[spec.reflec$Time==3, 2])
cor(corspec.560); cor(corspec.720)
This at least implies that the independence assumption may be unfounded, too.
So there is statistical evidence that the normality assumption is invalid. Combined with
graphical evidence that the homogeneity assumption is invalid and strong correlations
between times, we should not put much faith in the result of our test in part (a)
concerning the existence of specific effects. There are many ways to examine the
MANOVA assumptions, and these are just a few.
(c) Performing a univariate ANOVA for responses based on time, species, and their interaction can be accomplished with the following code (observations included):
anova(lm(nm560 ~ Species*Time,data=spec.reflec))
anova(lm(nm720 ~ Species*Time,data=spec.reflec))
While each Time and Species are significant for each wavelength, the interaction term
significant only at 560 nm (F-value: 70.073, p-value: 7.3·10−14 ), since there is no statistical evidence that the p-value is not zero at wavelengths of 720 nm (F-value: 0.7383,
p-value: 0.57). So the interaction does show up for only one of the two responses,
namely 560 nm reflectance.
(d) Changing the design of the experiment so that the Foresters are not, as they are
described as doing, measuring the same plant at multiple times. This would likely
prevent the correlation of reflectance across times. This would require 36 total plants
(12 in each species) instead of the 12 plants (4 in each species) currently used.
3
A design change that the researchers will probably reject (after they terminate their
relationship with you) would be to increase the total number of plants used to about
30 plants per species and time (about 300 total plants) which would insure that the
MANOVA was robust against the violations to multivariate normality.
Any method to examine the current data would involve a technique that allows variance
for diffferent times to be different.
6.41:
Performing the initial MANOVA in R using
vars<-c("Severity","Complexity","Experience","AssessTime","ImplementTime","TotalTime
cell.drop<-read.table(
’http://www.public.iastate.edu/~maitra/stat501/datasets/breakdown.dat’,
col.names=vars)
cell.drop
fit.manova <- Manova(lm(cbind(AssessTime,ImplementTime)~ Severity*Complexity*Experie
data = cell.drop))
fit.manova
gives
Type II MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df
Pr(>F)
Severity
1
0.95211
69.588
2
7 2.403e-05 ***
Complexity
1
0.98666 258.896
2
7 2.741e-07 ***
Experience
1
0.97284 125.360
2
7 3.302e-06 ***
Severity:Complexity
1
0.73660
9.788
2
7 0.009379 **
Severity:Experience
1
0.19522
0.849
2
7 0.467595
Complexity:Experience
1
0.22352
1.007
2
7 0.412538
Severity:Complexity:Experience 1
0.12227
0.488
2
7 0.633537
So we can see that the significant contributions to assessment time and implementation
time are the severity and complexity of the problem, the experience of the engineer,
and an interaction between the serevity and the complexity of the problem. Inspection
of the Wilk’s lambda values for the main effects and their interactions supports this as
well.
To get the confidence intervals, I will switch tools:
4
filename cells url "http://www.public.iastate.edu/~maitra/stat501/datasets/breakdown
DATA celldrop;
infile cells;
INPUT Severity $ Complexity $ Experience $ AssessTime ImplementTime TotalTime;
run;
data celldrop;
set celldrop;
SevComp=catx("",Severity,Complexity);
run;
*Bonferroni intervals;
proc glm data=celldrop;
class Severity Complexity Experience SevComp;
model AssessTime ImplementTime = Severity Complexity Experience SevComp;
means Severity Complexity Experience SevComp/bon clm;
run;
Produces the intervals:
Severity
High
Low
Complexity
Complex
Simple
Experience
Novice
N
8
8
AssessTime
Simultaneous 95%
Mean
Confidence Limits
5.0500
4.6380
5.4620
4.0125
3.6005
4.4245
ImplementTime
Simultan
Mean
Confiden
10.4250
9.9736
7.3375
6.8861
N
8
8
AssessTime
Simultaneous 95%
Mean
Confidence Limits
6.1250
5.7130
6.5370
2.9375
2.5255
3.3495
ImplementTime
Simulta
Mean
Confide
11.8375
11.3861
5.9250
5.4736
N
8
AssessTime
Simultaneous 95%
Mean
Confidence Limits
5.3875
4.9755
5.7995
ImplementTime
Simult
Mean
Confid
10.9625
10.5111
5
Guru
8
3.6750
3.2630
4.0870
AssessTime
SevComp
High Complex
Low Complex
High Simple
Low Simple
N
4
4
4
4
Mean
6.2750
5.9750
3.8250
2.0500
Simultaneous 95%
Confidence Limits
5.6052
6.9448
5.3052
6.6448
3.1552
4.4948
1.3802
2.7198
6.8000
6.3486
ImplementTime
Simult
Mean
Confid
12.8250
12.0911
10.8500
10.1161
8.0250
7.2911
3.8250
3.0911
7.24
(a) It is possible to fit the model easily enough using R:
vars<-c("Breed","SalePt","YrHgt","PrctFFB","FtFrBody","Frame","BkFat","SaleHt","Sale
bulls<-read.table(
’http://www.public.iastate.edu/~maitra/stat501/datasets/bulls.dat’,
col.names=vars)
bulls$Breed<-as.factor(bulls$Breed)
bulls$Frame<-as.factor(bulls$Frame)
fit.lm <- lm(SaleHt ~ YrHgt + FtFrBody, data = bulls)
summary(fit.lm)
plot(fit.lm$fitted,fit.lm$resid)
plot(density(fit.lm$resid))
Plots of the residuals show that the homogeniety of variance assumption is valid, and
QQ-plots indicate that the residuals follow a roughly normal distribution. Additionally,
the density curve for the residuals looks good.
The prediction interval can be found by using the lm-object function predict:
6
newdata <- data.frame(YrHgt= 50.5, FtFrBody= 970)
predict(fit.lm,new=newdata,interval=’prediction’)
which gives a 95% prediction interval of 32.3 to 184.4 (mean of 108.3) for the new
SaleHt.
(b) The new model follows the same routine:
fit.lm <- lm(cbind(SaleWt,SaleHt) ~ YrHgt + FtFrBody, data = bulls)
summary(fit.lm)
plot(fit.lm$fitted[,1],fit.lm$resid[,1])
plot(fit.lm$fitted[,2],fit.lm$resid[,2])
plot(density(fit.lm$resid[,1]))
plot(density(fit.lm$resid[,2]))
plot(fit.lm$resid[,1],fit.lm$resid[,2])
Also confidence ellipsoids.
7.21
(a) This is the same problem as before. The answer for part (iii) is (2.43, 16.86)
(b) Again, this follow from the code in problem 7.24 (b). 95% confidence ellipsoids are
wider than 95% confidence intervals because an interval for, say, X1 can grab any value
of X2 , meaing that we can get 95% of the values without going as far from whatever
we chose as the center. If we use an ellipse, there are two restrictions on which points
we could consider, meaning we will stretch further in both directions.
7.26
(a)
(i) To begin we will consider all variables seperately and run model statements for each:
pulp<-read.table(
’http://www.public.iastate.edu/~maitra/stat501/datasets/pulp_paper.dat’,
header=TRUE)
summary(lm(y1~z1+z2+z3+z4,data=pulp))
summary(lm(y2~z1+z2+z3+z4,data=pulp))
summary(lm(y3~z1+z2+z3+z4,data=pulp))
summary(lm(y4~z1+z2+z3+z4,data=pulp))
7
This leads to the following results: Y1 depends on Z2 , Z3 , and Z4 , Y2 depends on Z1
and Z4 , Y3 depends on Z2 , Z3 , and Z4 , Y4 depends on Z2 , Z3 , and Z4 .
We can find the coeffecients of each by using coef(lm(y1 ~ z2 + z3 + z4,data=pulp))
for each model. Doing this we get:
Model
y1~z2+z3+z4
y2~z1+z4
y3~z2+z3+z4
y4~z2+z3+z4
Int
z1
-70.12
-21.60 -0.96
-43.80
-17.00
z2
0.06
z3
0.06
0.03
0.02
0.03
0.01
z4
82.53
27.04
44.59
15.77
(ii) Checking influence with plot(lm.influence(lm(y1~z2+z3+z4,data=pulp))$hat,type="h",ylab=i
and plotting the residuals against thier fitted shows that observation 60 and 61 are
highly influential for Y1 , Y3 , and Y4 while only observation 60 is highly influential for
Y2 . To find the outliers, we can look for residuals that are further than, say, 3 standard
errors, from 0:
which(abs(fit.y1$resid)
which(abs(fit.y2$resid)
which(abs(fit.y3$resid)
which(abs(fit.y4$resid)
>
>
>
>
3*summary(fit.y1)[6][[1]])
3*summary(fit.y2)[6][[1]])
3*summary(fit.y3)[6][[1]])
3*summary(fit.y4)[6][[1]])
Which leads to observations 52 and 56 as outliers on y4, 52 as an outlier on y1 and 56
as an outlier on y3.
(iii) Prediction can be accomplished by using the following:
fit.y3<-lm(y3~z2+z3+z4,data=pulp)
newdata <- data.frame(z2=45.5,z3=20.375,z4=1.01)
predict(fit.y3,new=newdata,interval=’prediction’)
Which reports the interval as (1.59, 4.64).
(b)
(i) Fitting a regression model for all the data using
8
fit.all<-lm(cbind(y1, y2, y3, y4)~z1+z2+z3+z4,data=pulp)
summary(fit.all)
coef(fit.all)
Indicates that z2, z3, and z4 are significant factors for y1, y3, and y4 while only z1 and
z4 are significant for y2, giving coeffecients:
y1
y2
y3
y4
(Intercept) -74.23167347 -24.014741069 -45.76325188 -17.72729238
z1
-0.54997685
z2
0.09758331
0.009134465
0.04702695
z3
0.04940019
0.008352993
0.02530127
z4
85.07614717 28.754768387 45.79821105 16.21994962
(ii) The residuals all have roughly univariate normal distributions and appear to have
constant variance across the fitted responses. Taking observations with residuals more
than 3 standard errors from 0 as outliers, observation 52 is an outlier for y1 y2 and y4
while observation 56 is an outlier for y3.
(iii) We can use our a little linear algebra in this case, Consider the following:
confints<-function(model,dataset,newobs,xcols,alpha=.05){
ycols<-c(1:ncol(dataset))
ycols<-ycols[which(!(ycols %in% xcols))]
SIGMA<-t(model$resid)%*%model$resid/nrow(dataset)
fitmat<-as.matrix(model$coef)
Z<-as.matrix(dataset[,responsecols])
Z<-cbind(rep(1,nrow(dataset)),Z)
tZ<-t(Z)
tZZ<-tZ%*%Z
tZZinv<-solve(tZZ)
n<-nrow(dataset)
r<-length(xcols)
m<-length(ycols)
for(i in 1:length(ycols)){
lower[i]<-t(newobs)%*%fitmat[,i] sqrt(m*(n-r-1)/(n-r-m)*qf(1-alpha,m,n-r-m))*
9
sqrt((1+t(newobs)%*%tZZinv%*%newobs)*n/(n-r-m)*SIGMA[i,i])
upper[i]<-t(newobs)%*%fitmat[,i] +
sqrt(m*(n-r-1)/(n-r-m)*qf(1-alpha,m,n-r-m))*
sqrt((1+t(newobs)%*%tZZinv%*%newobs)*n/(n-r-m)*SIGMA[i,i])
}
return(cbind(lower,upper))
}
> confints(fit.all,pulp,c(1,0.33,45.5,20.375,1.01),c(5:8))
lower
upper
[1,] 9.9592146886 22.264982
[2,] 3.8123175885 6.632425
[3,] -0.0004013305 5.316223
[4,] -1.4092290742 1.456298
Notice that now the prediction interval for Y3 is larger.
Problem 7.27
Editing the data as discussed in the question stem gives:
vars<-c("Severity","Complexity","Experience","AssessTime","ImplementTime","TotalTime
cell.drop<-read.table(
’http://www.public.iastate.edu/~maitra/stat501/datasets/breakdown.dat’,
col.names=vars)
newSeverity <- as.factor(c("Low", "Low", "High", "High", "High"))
newComplexity<- as.factor(c("Complex", "Complex","Simple", "Simple", "Complex"))
newExperience <- as.factor(rep("Experienced", 5))
newAssessTime <- c(5.3, 5.0, 4.0, 4.5, 6.9)
newImplementTime <- c(9.2, 10.9, 8.6, 8.7, 14.9)
newTotalTime <- newAssessTime + newImplementTime
newobs<-data.frame(Severity=newSeverity,
Complexity=newComplexity,
Experience=newExperience,
AssessTime=newAssessTime,
ImplementTime=newImplementTime, TotalTime=newTotalTime)
cell.new<-rbind(cell.drop,newobs)
cell.new[14,3]<-"Experienced"; cell.new[3,3]<-"Experienced"
The model is simple enough to fit, using:
10
lm.fit<-lm(cbind(AssessTime,ImplementTime)~Severity Experience Complexity
Severity:Experience Severity:Complexity Experience:Complexity)
11
Download