1. The data set sealstrength.txt located on the course... link from the main course Web page) is a classic... Stat 511 HW#6 Spring 2003

advertisement
Stat 511 HW#6 Spring 2003
1. The data set sealstrength.txt located on the course "data sets" page (follow the
link from the main course Web page) is a classic one taken from "Sealing Strength of
Wax-Polyethylene Blends" by Brown, Turner and Smith (Tappi, 1958). Given are values
of
(Coded) Seal Temperature
(Coded) Cooling Bar Temperature
(Coded) Polyethylene Content
Bread Wrapper Seal Strength
t1 − 225
where t1 is in o F
30
t − 55
x2 = 2
where t2 is in o F
9
c − 1.1
where c is in %
x3 =
.6
y in g/in.
x1 =
from an experiment run to find good (large y ) settings for the process variables x . A
"standard" "response surface" analysis of these data is based on a multivariate quadratic
regression. Use R and appropriate matrix calculations to do the following.
a) Fit the (linear in the parameters and quadratic in the predictors) model
yi = β 0 + β1 x1i + β 2 x2i + β 3 x3i + β 4 x12i + β5 x22i + β 6 x32i + β 7 x1i x2 i + β 8 x1i x3i + β 9 x2 i x3i + ε i
to these data. Then compute and normal-plot standardized residuals.
b) In the model from a), test H 0 : β 4 = β 5 = L = β 9 = 0 . Report a p-value. Does
quadratic curvature in response (as a function of the x 's) appear to be statistically
detectable? (If the null hypothesis is true, the response is "planar" as a function of
the x 's.)
c) Some multivariate calculus on the fitted quadratic equation can be used to establish
that it has an absolute maximum at about the set of conditions
x1 = −1.01, x2 = .26, and x3 = .68
Use R matrix calculations to find 90% two-sided confidence limits for the mean response
here. Then find 90% two-sided prediction limits for a new response from this set of
conditions.
2. (Testing "Lack of Fit" … See Section 6.6 of Christensen) Suppose that in the usual
linear model
Y = Xß + e
X is of full rank ( k ). Suppose further that there are m < n distinct rows in X and that
m > k . One can then make up a "cell means" model for Y (where observations having
the same corresponding row in X are given the same mean response) say
Y = X *µ + e
1
This model puts no restrictions on the means of the observations except that those with
identical corresponding rows of X are equal. It is the case that C ( X ) ⊂ C ( X * ) and it thus
makes sense to test the hypothesis H 0 :EY ∈ C ( X ) in the cell means model. This can be
done using
Y ' ( PX* − PX ) Y / ( m − k )
F=
Y ' ( I − PX* ) Y / ( n − m )
and this is usually known as testing for "lack of fit."
Use R and matrix calculations to find a p-value for testing lack of fit to the quadratic
regression equation in problem 1.
3. (Adapted from Koehler's Spring 2002 HW7) In a study to examine the effects of
I = 4 drugs on dogs under J = 3 disease conditions, increases in systolic blood pressure
( y , in mm Hg ) were observed after drug treatment for several dogs with experimentally
induced cases of the diseases. The measured increases from Kutner (1974) are as below.
Drug 1
Drug 2
Drug 3
Drug 4
Disease 1
42, 44,36,13,19, 22
28,23, 24, 42,13
1, 29,19
24,9, 22, −2,15
Disease 2
33, 26,33,21
34,33,31,36
11,9,7,1, −6
27,12,12, −5,16,15
Disease 3
31, −3,25,25,24
3, 26,28,32,3,16
21,1,9,3
22,7, 25,5,12
a) Create three vectors in R of length n = 58 . The first should contain y values, the
second Drug ID Numbers ( 1-4 ), and the third Disease ID Numbers ( 1-3 ). Call these
vectors respectively "y", "drug" and "disease". Then create and print out an R data
frame using the commands
> d<-data.frame(y,drug,disease)
> d
b) Turn the numerical variables drug and disease into variables that R will recognize
as levels of factors by issuing the commands
> d$drug<-as.factor(d$drug)
> d$disease<-as.factor(d$disease)
Then compute and print out the cell means by typing
> means<-tapply(d$y,list(d$drug,d$disease),mean)
> means
You may find out more about the function tapply by typing
> ?tapply
2
c) Make a crude interaction plot by doing the following. First type
> x.axis<-unique(d$drug)
to set up horizontal plotting positions for the sample means. Then make a "matrix plot"
with lines connecting points by issuing the commands
> matplot(c(1,4),c(-10,50),type="n",xlab="Drug",ylab="Mean
Response",main="Change in Systolic Blood pressure")
> matlines(x.axis,means,type='b',lty=c(1,3,7))
The first of these commands sets up the axes and makes a dummy plot with invisible
points "plotted" at (1, −10) and (4,50) . The second puts the lines and identifying disease
numbers (as plotting symbols) on the plot.
d) Set the default for the restriction used to create a full rank model matrix, run the linear
models routine and find both (sensible) sets of "Type I" sums of squares by issuing the
following commands
>
>
>
>
>
options(contrasts=c("contr.sum","contr.sum"))
lm.out1<-lm(y~drug*disease,data=d)
summary.aov(lm.out1,ssType=1)
lm.out2<-lm(y~disease*drug,data=d)
summary.aov(lm.out2,ssType=1)
Then compute "Type III" sums of squares by issuing the command
> summary.aov(lm.out1,ssType=3)
This is the question as assigned. As discussed via e-mail, it appears that R will not
compute Type III sums of squares and one must use John Fox's unsupported "car"
package to get this done in R. Splus DOES produce the Type III sums of squares if the
above command is used.
(As Prof. Koehler points out about this data set, we have ignored a potentially important
aspect of the original real problem here. There were actually originally 6 dogs assigned at
random to each of the 12 treatment combinations. We have tacitly assumed that the data
that are missing are "missing at random" i.e. that the "missingness" provides no
information about the effects of the treatments. If that tacit assumption is wrong, none of
what is done above is anything but a numerical exercise … it provides no serious
scientific insight. For example, you might consider how differently you might think
about the medical problem if you believed that if in fact all missing data correspond to
dead dogs, and deaths were fundamentally due to huge blood pressure increases that are
not captured by the given values.)
3
4. Below is a small table of fake 2-way factorial data. Enter them into R in three vectors
of length n = 12 , much as was done in problem 3. Call these vectors "Y", "A", and "B".
Level 1 of A
Level 2 of A
Level 3 of A
Level 1 of B
12
9
10
Level 2 of B
14
11,12
11
Level 3 of B
10,12
6,7
7
a) Repeat parts a)-d) of Problem 3 on these data.
b) Create 12 × 9 full rank model matrices for both a cell means model and an effects
model with the sum restriction for these data. Using R matrix calculations and the 2nd of
these, compute "Type I" sums of squares corresponding to the linear model fit by
> lm.out1<-lm(Y~A*B,data=d)
Then use the first of these model matrices and appropriate matrices C and compute sums
of squares for H 0 : Cß = 0 , SS H 0 , that are equal to the "A" and "B" "Type III" sums of
squares.
c) Now suppose that by some misfortune, both observations from the ( 2, 2 ) cell of this
complete 3 × 3 factorial somehow get lost and one has only n = 10 observations from
k = 8 cells (and thus "incomplete factorial" data). Test the hypothesis that at least for the
cells where one has data are no interactions, i.e. E Y ∈ C (1 | Xα | X β ) . (Note that this
matrix (1 | Xα | X β ) should be of full rank ( 5 ).)
(
)
d) In the incomplete factorial context of part c), the function µ + α 2 + β 2 is estimable.
What is the OLS estimate for it? (Note that this is the mean response for the missing cell
only if the same no interaction model used to describe the 8 cells extends to the 9th.
Notice that this is the kind of assumption one makes in regression analysis when using a
fitted prediction equation to estimate a mean response at a set of conditions not in one's
original data set. It might well be argued, however, that the link between observed and
unobserved conditions is intuitively stronger with quantitative factors than with
qualitative factors.)
4
Download