STAT 501: Multivariate Statistical Methods Solutions Homework 5 Problem List: Note: Wichern and Johnson 6.33, 6.14, 7.6, 7.24, 7.21, 7.26, 7.27 Due Sunday March 25, 2012 Problem 6.33 (15 points): (a) Following the example from problem 6.31 done in class and the R-code available at http://www.public.iastate.edu/ maitra/stat501/Rcode/manova.R, we can write some code in R that performs the desired analysis: library(car) vars<-c("nm560","nm720","Species","Time","Replication") spec.reflec<-read.table( ’http://www.public.iastate.edu/~maitra/stat501/datasets/spectral.dat’, col.names=vars) spec.reflec$Time<-as.factor(spec.reflec$Time) spec.reflec$Replication<-as.factor(spec.reflec$Replication) summary(spec.reflec) fit.manova <- Manova(lm(cbind(nm560,nm720)~ Species * Time, data = spec.reflec)) summary(fit.manova) from which we get the following statistics and associated p-values testing for the effects of Species, Time, and their interaction: Type II MANOVA Tests: Pillai test statistic Df test stat approx F num Df den Species 2 0.96120 12.4915 4 Time 2 0.99199 13.2853 4 Species:Time 4 0.92116 5.7634 8 Df Pr(>F) 54 2.910e-07 *** 54 1.330e-07 *** 54 2.606e-05 *** The p-values indicate that the null hypothesis (the effect is 0 across groups) should be rejected. We must therefore conclude that there is statistical evidence that Species, Time, and the interaction will have an impact on the percent of spectral relectance. 1 (b) The assumption of independent sampling must be assumed in this case; The assumption that each set of 4 observations for specific time and species are multivariate normal will be hard examine graphically. However we can try: There might be some elliptical shape in a few of these but some look less “multivariate normal” than I would like. Since there are only 4 observations for each species:time group, the MANOVA results will be sensitive to any problems with normality. We can test the residuals using Dr. Maitra’s code: source(’http://www.public.iastate.edu/~maitra/stat501/Rcode/testnormality.R’) fit.model<-lm(cbind(nm560,nm720)~ Species * Time, data = spec.reflec) testnormality(fit.model$residuals) which gives a p-value of .000204244, leading to automatic rejection of the normality assumption. We can also see a problem with the residuals by plotting them against time for the two wavelentghs: 2 While it is true that the residuals are centered around zero, there is an obvious shape, indicating that variance is not the same for each time. Part of this may be due to connections between time values, as seen below: corspec.560<-cbind(spec.reflec[spec.reflec$Time==1, 1], spec.reflec[spec.reflec$Time==2, 1], spec.reflec[spec.reflec$Time==3, 1]) corspec.720<-cbind(spec.reflec[spec.reflec$Time==1, 2], spec.reflec[spec.reflec$Time==2, 2], spec.reflec[spec.reflec$Time==3, 2]) cor(corspec.560); cor(corspec.720) This at least implies that the independence assumption may be unfounded, too. So there is statistical evidence that the normality assumption is invalid. Combined with graphical evidence that the homogeneity assumption is invalid and strong correlations between times, we should not put much faith in the result of our test in part (a) concerning the existence of specific effects. There are many ways to examine the MANOVA assumptions, and these are just a few. (c) Performing a univariate ANOVA for responses based on time, species, and their interaction can be accomplished with the following code (observations included): anova(lm(nm560 ~ Species*Time,data=spec.reflec)) anova(lm(nm720 ~ Species*Time,data=spec.reflec)) While each Time and Species are significant for each wavelength, the interaction term significant only at 560 nm (F-value: 70.073, p-value: 7.3·10−14 ), since there is no statistical evidence that the p-value is not zero at wavelengths of 720 nm (F-value: 0.7383, p-value: 0.57). So the interaction does show up for only one of the two responses, namely 560 nm reflectance. (d) Changing the design of the experiment so that the Foresters are not, as they are described as doing, measuring the same plant at multiple times. This would likely prevent the correlation of reflectance across times. This would require 36 total plants (12 in each species) instead of the 12 plants (4 in each species) currently used. 3 A design change that the researchers will probably reject (after they terminate their relationship with you) would be to increase the total number of plants used to about 30 plants per species and time (about 300 total plants) which would insure that the MANOVA was robust against the violations to multivariate normality. Any method to examine the current data would involve a technique that allows variance for diffferent times to be different. 6.41: Performing the initial MANOVA in R using vars<-c("Severity","Complexity","Experience","AssessTime","ImplementTime","TotalTime cell.drop<-read.table( ’http://www.public.iastate.edu/~maitra/stat501/datasets/breakdown.dat’, col.names=vars) cell.drop fit.manova <- Manova(lm(cbind(AssessTime,ImplementTime)~ Severity*Complexity*Experie data = cell.drop)) fit.manova gives Type II MANOVA Tests: Pillai test statistic Df test stat approx F num Df den Df Pr(>F) Severity 1 0.95211 69.588 2 7 2.403e-05 *** Complexity 1 0.98666 258.896 2 7 2.741e-07 *** Experience 1 0.97284 125.360 2 7 3.302e-06 *** Severity:Complexity 1 0.73660 9.788 2 7 0.009379 ** Severity:Experience 1 0.19522 0.849 2 7 0.467595 Complexity:Experience 1 0.22352 1.007 2 7 0.412538 Severity:Complexity:Experience 1 0.12227 0.488 2 7 0.633537 So we can see that the significant contributions to assessment time and implementation time are the severity and complexity of the problem, the experience of the engineer, and an interaction between the serevity and the complexity of the problem. Inspection of the Wilk’s lambda values for the main effects and their interactions supports this as well. To get the confidence intervals, I will switch tools: 4 filename cells url "http://www.public.iastate.edu/~maitra/stat501/datasets/breakdown DATA celldrop; infile cells; INPUT Severity $ Complexity $ Experience $ AssessTime ImplementTime TotalTime; run; data celldrop; set celldrop; SevComp=catx("",Severity,Complexity); run; *Bonferroni intervals; proc glm data=celldrop; class Severity Complexity Experience SevComp; model AssessTime ImplementTime = Severity Complexity Experience SevComp; means Severity Complexity Experience SevComp/bon clm; run; Produces the intervals: Severity High Low Complexity Complex Simple Experience Novice N 8 8 AssessTime Simultaneous 95% Mean Confidence Limits 5.0500 4.6380 5.4620 4.0125 3.6005 4.4245 ImplementTime Simultan Mean Confiden 10.4250 9.9736 7.3375 6.8861 N 8 8 AssessTime Simultaneous 95% Mean Confidence Limits 6.1250 5.7130 6.5370 2.9375 2.5255 3.3495 ImplementTime Simulta Mean Confide 11.8375 11.3861 5.9250 5.4736 N 8 AssessTime Simultaneous 95% Mean Confidence Limits 5.3875 4.9755 5.7995 ImplementTime Simult Mean Confid 10.9625 10.5111 5 Guru 8 3.6750 3.2630 4.0870 AssessTime SevComp High Complex Low Complex High Simple Low Simple N 4 4 4 4 Mean 6.2750 5.9750 3.8250 2.0500 Simultaneous 95% Confidence Limits 5.6052 6.9448 5.3052 6.6448 3.1552 4.4948 1.3802 2.7198 6.8000 6.3486 ImplementTime Simult Mean Confid 12.8250 12.0911 10.8500 10.1161 8.0250 7.2911 3.8250 3.0911 7.24 (a) It is possible to fit the model easily enough using R: vars<-c("Breed","SalePt","YrHgt","PrctFFB","FtFrBody","Frame","BkFat","SaleHt","Sale bulls<-read.table( ’http://www.public.iastate.edu/~maitra/stat501/datasets/bulls.dat’, col.names=vars) bulls$Breed<-as.factor(bulls$Breed) bulls$Frame<-as.factor(bulls$Frame) fit.lm <- lm(SaleHt ~ YrHgt + FtFrBody, data = bulls) summary(fit.lm) plot(fit.lm$fitted,fit.lm$resid) plot(density(fit.lm$resid)) Plots of the residuals show that the homogeniety of variance assumption is valid, and QQ-plots indicate that the residuals follow a roughly normal distribution. Additionally, the density curve for the residuals looks good. The prediction interval can be found by using the lm-object function predict: 6 newdata <- data.frame(YrHgt= 50.5, FtFrBody= 970) predict(fit.lm,new=newdata,interval=’prediction’) which gives a 95% prediction interval of 32.3 to 184.4 (mean of 108.3) for the new SaleHt. (b) The new model follows the same routine: fit.lm <- lm(cbind(SaleWt,SaleHt) ~ YrHgt + FtFrBody, data = bulls) summary(fit.lm) plot(fit.lm$fitted[,1],fit.lm$resid[,1]) plot(fit.lm$fitted[,2],fit.lm$resid[,2]) plot(density(fit.lm$resid[,1])) plot(density(fit.lm$resid[,2])) plot(fit.lm$resid[,1],fit.lm$resid[,2]) Also confidence ellipsoids. 7.21 (a) This is the same problem as before. The answer for part (iii) is (2.43, 16.86) (b) Again, this follow from the code in problem 7.24 (b). 95% confidence ellipsoids are wider than 95% confidence intervals because an interval for, say, X1 can grab any value of X2 , meaing that we can get 95% of the values without going as far from whatever we chose as the center. If we use an ellipse, there are two restrictions on which points we could consider, meaning we will stretch further in both directions. 7.26 (a) (i) To begin we will consider all variables seperately and run model statements for each: pulp<-read.table( ’http://www.public.iastate.edu/~maitra/stat501/datasets/pulp_paper.dat’, header=TRUE) summary(lm(y1~z1+z2+z3+z4,data=pulp)) summary(lm(y2~z1+z2+z3+z4,data=pulp)) summary(lm(y3~z1+z2+z3+z4,data=pulp)) summary(lm(y4~z1+z2+z3+z4,data=pulp)) 7 This leads to the following results: Y1 depends on Z2 , Z3 , and Z4 , Y2 depends on Z1 and Z4 , Y3 depends on Z2 , Z3 , and Z4 , Y4 depends on Z2 , Z3 , and Z4 . We can find the coeffecients of each by using coef(lm(y1 ~ z2 + z3 + z4,data=pulp)) for each model. Doing this we get: Model y1~z2+z3+z4 y2~z1+z4 y3~z2+z3+z4 y4~z2+z3+z4 Int z1 -70.12 -21.60 -0.96 -43.80 -17.00 z2 0.06 z3 0.06 0.03 0.02 0.03 0.01 z4 82.53 27.04 44.59 15.77 (ii) Checking influence with plot(lm.influence(lm(y1~z2+z3+z4,data=pulp))$hat,type="h",ylab=i and plotting the residuals against thier fitted shows that observation 60 and 61 are highly influential for Y1 , Y3 , and Y4 while only observation 60 is highly influential for Y2 . To find the outliers, we can look for residuals that are further than, say, 3 standard errors, from 0: which(abs(fit.y1$resid) which(abs(fit.y2$resid) which(abs(fit.y3$resid) which(abs(fit.y4$resid) > > > > 3*summary(fit.y1)[6][[1]]) 3*summary(fit.y2)[6][[1]]) 3*summary(fit.y3)[6][[1]]) 3*summary(fit.y4)[6][[1]]) Which leads to observations 52 and 56 as outliers on y4, 52 as an outlier on y1 and 56 as an outlier on y3. (iii) Prediction can be accomplished by using the following: fit.y3<-lm(y3~z2+z3+z4,data=pulp) newdata <- data.frame(z2=45.5,z3=20.375,z4=1.01) predict(fit.y3,new=newdata,interval=’prediction’) Which reports the interval as (1.59, 4.64). (b) (i) Fitting a regression model for all the data using 8 fit.all<-lm(cbind(y1, y2, y3, y4)~z1+z2+z3+z4,data=pulp) summary(fit.all) coef(fit.all) Indicates that z2, z3, and z4 are significant factors for y1, y3, and y4 while only z1 and z4 are significant for y2, giving coeffecients: y1 y2 y3 y4 (Intercept) -74.23167347 -24.014741069 -45.76325188 -17.72729238 z1 -0.54997685 z2 0.09758331 0.009134465 0.04702695 z3 0.04940019 0.008352993 0.02530127 z4 85.07614717 28.754768387 45.79821105 16.21994962 (ii) The residuals all have roughly univariate normal distributions and appear to have constant variance across the fitted responses. Taking observations with residuals more than 3 standard errors from 0 as outliers, observation 52 is an outlier for y1 y2 and y4 while observation 56 is an outlier for y3. (iii) We can use our a little linear algebra in this case, Consider the following: confints<-function(model,dataset,newobs,xcols,alpha=.05){ ycols<-c(1:ncol(dataset)) ycols<-ycols[which(!(ycols %in% xcols))] SIGMA<-t(model$resid)%*%model$resid/nrow(dataset) fitmat<-as.matrix(model$coef) Z<-as.matrix(dataset[,responsecols]) Z<-cbind(rep(1,nrow(dataset)),Z) tZ<-t(Z) tZZ<-tZ%*%Z tZZinv<-solve(tZZ) n<-nrow(dataset) r<-length(xcols) m<-length(ycols) for(i in 1:length(ycols)){ lower[i]<-t(newobs)%*%fitmat[,i] sqrt(m*(n-r-1)/(n-r-m)*qf(1-alpha,m,n-r-m))* 9 sqrt((1+t(newobs)%*%tZZinv%*%newobs)*n/(n-r-m)*SIGMA[i,i]) upper[i]<-t(newobs)%*%fitmat[,i] + sqrt(m*(n-r-1)/(n-r-m)*qf(1-alpha,m,n-r-m))* sqrt((1+t(newobs)%*%tZZinv%*%newobs)*n/(n-r-m)*SIGMA[i,i]) } return(cbind(lower,upper)) } > confints(fit.all,pulp,c(1,0.33,45.5,20.375,1.01),c(5:8)) lower upper [1,] 9.9592146886 22.264982 [2,] 3.8123175885 6.632425 [3,] -0.0004013305 5.316223 [4,] -1.4092290742 1.456298 Notice that now the prediction interval for Y3 is larger. Problem 7.27 Editing the data as discussed in the question stem gives: vars<-c("Severity","Complexity","Experience","AssessTime","ImplementTime","TotalTime cell.drop<-read.table( ’http://www.public.iastate.edu/~maitra/stat501/datasets/breakdown.dat’, col.names=vars) newSeverity <- as.factor(c("Low", "Low", "High", "High", "High")) newComplexity<- as.factor(c("Complex", "Complex","Simple", "Simple", "Complex")) newExperience <- as.factor(rep("Experienced", 5)) newAssessTime <- c(5.3, 5.0, 4.0, 4.5, 6.9) newImplementTime <- c(9.2, 10.9, 8.6, 8.7, 14.9) newTotalTime <- newAssessTime + newImplementTime newobs<-data.frame(Severity=newSeverity, Complexity=newComplexity, Experience=newExperience, AssessTime=newAssessTime, ImplementTime=newImplementTime, TotalTime=newTotalTime) cell.new<-rbind(cell.drop,newobs) cell.new[14,3]<-"Experienced"; cell.new[3,3]<-"Experienced" The model is simple enough to fit, using: 10 lm.fit<-lm(cbind(AssessTime,ImplementTime)~Severity Experience Complexity Severity:Experience Severity:Complexity Experience:Complexity) 11