1. Consider the (non-full-rank) "effects model" for the

advertisement
Stat 511 HW#4 Spring 2003
1. Consider the (non-full-rank) "effects model" for the 2 × 2 factorial (with 2 observations per
cell) called "Example d)" in the first lecture. For each of the hypotheses below, identify a
C and a d so that it may be written in the form H 0 : C β = d . Determine which of the
%
% %
hypotheses below is testable. For each testable hypothesis that can written in the form
H 0 : Cβ = 0 , find a matrix X 0 so that the hypothesis can be written in the form
% %
H 0 : E Y = X 0γ for some γ (that is, so that it can be written as H 0 : EY ∈ C ( X 0 ) where
%
%
C ( X 0 ) ⊂ C ( X ) ).
a) H 0 :αβ12 − αβ11 − (αβ 22 − αβ 21 ) = 7
b) H 0 :αβ12 − αβ11 − (αβ 22 − αβ 21 ) = 0
c) H 0 : αβ ij = 0 ∀i, j
d) H 0 : µ + α1 = 7
e) H 0 : αβ ij = 0 ∀i, j and α1 − α 2 = 7
f) H 0 : αβ ij = 0 ∀i, j and α1 − α 2 = 0
Note: This is the problem as I assigned it. But, it is poorly worded and hopelessly ambiguous.
"Testability" as defined in class can only be determined for a particular expression of an
hypothesis in matrix form, which I didn't give you here.
2. (Adapted from Koehler's Spring 2002 HW 3) On the Web page
http://www.public.iastate.edu/~vardeman/stat511/511data.html you will find the file
biomass.txt We're going to do some statistical analysis on this R data frame. These data
come from a study of the effects of soil characteristics on aerial biomass production of the marsh
grass Spartina alterniflora (Rick A. Lindhurst, 1979, Aeration, nitrogen, pH, and salinity as
factors affecting Spartina alterniflora growth and dieback, Ph.D. dissertation, North Carolina
State University). There are eight entries on each line of the data frame. These are (in the order
below)
Location
Type (revegetated area, short grass, tall grass)
y = aerial biomass (g/m 2 )
x1 = soil salinity (%)
x2 = soil acidity as measured in water (pH)
x3 = soil potassium (ppm)
x4 = soil sodium (ppm)
x5 = soil zinc (ppm)
The first row of the file has the variable names in it. (You might open this file in Notepad and
have a look at it.)
Enter these data into R using the command
> biomass<-read.table("filename",header=T)
1
I was able to get this loaded by placing biomass.txt into the directory "rw1061" created in
the installation of my copy of R, and using biomass.txt (no quote marks) in place of
"filename" above. I was also able to get it loaded by using
"http://www.public.iastate.edu/~vardeman/stat511/biomass.txt"
(complete with quote marks) in place of "filename" above while connected to the network.
Use the command
> biomass
to view the data frame. It should have eight columns and 45 rows. Now create two matrices that
will be used to fit a regression model to these data. The third column will be used as the
response vector and the last five columns will be used to make most of the model matrix. Type
> Y<-as.matrix(biomass[,3])
> X<-as.matrix(biomass[,4:8])
Note the use of []to select columns from the data frame. Here, the function as.matrix is
used to create a matrix from one or more columns of the data frame. To add a column of ones to
the model matrix, type
> X0<-rep(1,length(Y))
> X<-cbind(X0,X)
Make a scatterplot matrix for y, x1 , x2 , x3 , x4 and x5 . To do this, first load the "lattice" package.
(Look under the "Packages" heading on the R GUI, select "Load package" and then "lattice".)
Then type
> splom(~biomass[,3:8],aspect="fill")
If you had to guess based on this plot, which single predictor do you think is probably the best
predictor of biomass? Do you see any evidence of multicollinearity (correlation among the
predictors) in this graphic?
To redo the scatterplot matrix after passing smooth curves through each of the scatterplots, you
may do this. Define a function
> points.lines<-function(x,y)
+
{
+
points(x,y)
+
lines(loess.smooth(x,y,0.90))
+
}
set some parameters for the graphic
> par(pch=18,cex=1.2,lwd=3)
and then issue the command
> pairs(biomass[,-(1:2)],panel=points.lines)
2
Also compute a sample correlation matrix for y, x1 , x2 , x3 , x4 and x5 . You may compute the
matrix using the cor() function and round the printed values to four places using the round()
function as
> round(cor(biomass[-(1:2)]),4)
Use the qr() function to find the rank of X.
Use R matrix operations on the X matrix and Y vector to find the estimated regression coefficient
vector bOLS , the estimated mean vector Yˆ , and the vector of residuals e = Y − Yˆ .
%
%
Plot the residuals against the fitted means. This can be done using the following code.
>
>
>
>
>
b<-solve(t(X)%*%X)%*%t(X)%*%Y
yhat<-X%*%b
e<-Y-yhat
par(fin=c(6.0,6.0),pch=18,cex=1.5,mar=c(5,5,4,2))
plot(yhat,e,xlab="Predicted Y",ylab="Residual",main="Residual Plot")
Type > help(par) to see the list of parameters that may be set on a graphic. What does the
first specification above do, i.e. what does fin=c(6.0,6.0) do?
Plot the residuals against salinity. You may use the following code.
> plot(biomass$salinity,e,xlab="Salinity",ylab="Residual",main=
"Residual Plot")
And you can add a smooth trend line to the plot by typing
> lines(loess.smooth(biomass$salinity,e,0.90))
What happens when you type
> lines(loess.smooth(biomass$salinity,e,0.50))
(The values 0.90 and 0.50 are values of a "smoothing parameter." You could have discovered
this (and more) about the loess.smooth function by typing > help(loess.smooth))
Now plot the residuals against each of x2 , x3 , x4 and x5 .
Create a normal plot from the values in the residual vector. You can do so by typing
> qqnorm(e,main="Normal Probability Plot")
> qqline(e)
Now compute the sum of squared residuals and the corresponding estimate of σ 2 , namely
3
(
)(
′
Y − Yˆ Y − Yˆ
¶
2
σ =
n − rank ( X )
)
Use this and compute an estimate of the covariance matrix for bOLS , namely
%
−1
σ¶2 X ′X
(
)
Sometimes you may want to write a matrix out to a file. This can be done as follows. First
prepare the row and columns labels and round all entries to 4 places using the code
> case<-1:45
> heading<- c("Case","Salinity","pH","K","Na","Zn",
"Biomass","Predicted","Residual")
> temp<-cbind(case,X[,-1],Y,yhat,e)
> dimnames(temp)<-list(case,heading)
> round(temp,4)
Then load the "MASS" package (in order to make the write.matrix function available).
The code
> write.matrix(temp,file="c:/temp/regoutput.out")
will then write output to the file c:/temp/regoutput.out (you may choose another name
and destination for this file).
Modify the above to create a matrix that has bOLS in the first column and a vector of
%
corresponding standard errors (square roots of diagonal entries of the estimated covariance
matrix for bOLS ) in the second. Label the rows and columns of your matrix and write it out to a
%
file. Submit a listing of that file and your R code.
3. It is, of course, possible to do the linear model calculations in R "automatically" by calling the
right function. After loading the biomass data frame, type
> lm(formula = biomass$biomass ~ biomass$salinity + biomass$pH +
biomass$K + biomass$Na + biomass$Zn)
Compare the values printed out with things you computed more painfully in question 2. There is
not much detail in what is printed out. Try instead typing
> summary(lm(formula=biomass$biomass~biomass$salinity+biomass$pH
+biomass$K+biomass$Na+biomass$Zn))
and notice that there is more detail provided.
4. The lm function in R allows one to do weighted least squares, i.e. minimize
∑w ( y
i
i
− yˆ i ) for positive weights wi . For the V1 case of the Aitken model of Problem 2 from
2
HW 3, find the BLUEs of the 3 cell means using lm and an appropriate vector of weights. (Type
> help(lm) in R in order to get help with the syntax.)
4
Download