Stat 511 HW#6 Spring 2003 1. The data set sealstrength.txt located on the course "data sets" page (follow the link from the main course Web page) is a classic one taken from "Sealing Strength of Wax-Polyethylene Blends" by Brown, Turner and Smith (Tappi, 1958). Given are values of (Coded) Seal Temperature (Coded) Cooling Bar Temperature (Coded) Polyethylene Content Bread Wrapper Seal Strength t1 − 225 where t1 is in o F 30 t − 55 x2 = 2 where t2 is in o F 9 c − 1.1 where c is in % x3 = .6 y in g/in. x1 = from an experiment run to find good (large y ) settings for the process variables x . A "standard" "response surface" analysis of these data is based on a multivariate quadratic regression. Use R and appropriate matrix calculations to do the following. a) Fit the (linear in the parameters and quadratic in the predictors) model yi = β 0 + β1 x1i + β 2 x2i + β 3 x3i + β 4 x12i + β5 x22i + β 6 x32i + β 7 x1i x2 i + β 8 x1i x3i + β 9 x2 i x3i + ε i to these data. Then compute and normal-plot standardized residuals. b) In the model from a), test H 0 : β 4 = β 5 = L = β 9 = 0 . Report a p-value. Does quadratic curvature in response (as a function of the x 's) appear to be statistically detectable? (If the null hypothesis is true, the response is "planar" as a function of the x 's.) c) Some multivariate calculus on the fitted quadratic equation can be used to establish that it has an absolute maximum at about the set of conditions x1 = −1.01, x2 = .26, and x3 = .68 Use R matrix calculations to find 90% two-sided confidence limits for the mean response here. Then find 90% two-sided prediction limits for a new response from this set of conditions. 2. (Testing "Lack of Fit" … See Section 6.6 of Christensen) Suppose that in the usual linear model Y = Xß + e X is of full rank ( k ). Suppose further that there are m < n distinct rows in X and that m > k . One can then make up a "cell means" model for Y (where observations having the same corresponding row in X are given the same mean response) say Y = X *µ + e 1 This model puts no restrictions on the means of the observations except that those with identical corresponding rows of X are equal. It is the case that C ( X ) ⊂ C ( X * ) and it thus makes sense to test the hypothesis H 0 :EY ∈ C ( X ) in the cell means model. This can be done using Y ' ( PX* − PX ) Y / ( m − k ) F= Y ' ( I − PX* ) Y / ( n − m ) and this is usually known as testing for "lack of fit." Use R and matrix calculations to find a p-value for testing lack of fit to the quadratic regression equation in problem 1. 3. (Adapted from Koehler's Spring 2002 HW7) In a study to examine the effects of I = 4 drugs on dogs under J = 3 disease conditions, increases in systolic blood pressure ( y , in mm Hg ) were observed after drug treatment for several dogs with experimentally induced cases of the diseases. The measured increases from Kutner (1974) are as below. Drug 1 Drug 2 Drug 3 Drug 4 Disease 1 42, 44,36,13,19, 22 28,23, 24, 42,13 1, 29,19 24,9, 22, −2,15 Disease 2 33, 26,33,21 34,33,31,36 11,9,7,1, −6 27,12,12, −5,16,15 Disease 3 31, −3,25,25,24 3, 26,28,32,3,16 21,1,9,3 22,7, 25,5,12 a) Create three vectors in R of length n = 58 . The first should contain y values, the second Drug ID Numbers ( 1-4 ), and the third Disease ID Numbers ( 1-3 ). Call these vectors respectively "y", "drug" and "disease". Then create and print out an R data frame using the commands > d<-data.frame(y,drug,disease) > d b) Turn the numerical variables drug and disease into variables that R will recognize as levels of factors by issuing the commands > d$drug<-as.factor(d$drug) > d$disease<-as.factor(d$disease) Then compute and print out the cell means by typing > means<-tapply(d$y,list(d$drug,d$disease),mean) > means You may find out more about the function tapply by typing > ?tapply 2 c) Make a crude interaction plot by doing the following. First type > x.axis<-unique(d$drug) to set up horizontal plotting positions for the sample means. Then make a "matrix plot" with lines connecting points by issuing the commands > matplot(c(1,4),c(-10,50),type="n",xlab="Drug",ylab="Mean Response",main="Change in Systolic Blood pressure") > matlines(x.axis,means,type='b',lty=c(1,3,7)) The first of these commands sets up the axes and makes a dummy plot with invisible points "plotted" at (1, −10) and (4,50) . The second puts the lines and identifying disease numbers (as plotting symbols) on the plot. d) Set the default for the restriction used to create a full rank model matrix, run the linear models routine and find both (sensible) sets of "Type I" sums of squares by issuing the following commands > > > > > options(contrasts=c("contr.sum","contr.sum")) lm.out1<-lm(y~drug*disease,data=d) summary.aov(lm.out1,ssType=1) lm.out2<-lm(y~disease*drug,data=d) summary.aov(lm.out2,ssType=1) Then compute "Type III" sums of squares by issuing the command > summary.aov(lm.out1,ssType=3) This is the question as assigned. As discussed via e-mail, it appears that R will not compute Type III sums of squares and one must use John Fox's unsupported "car" package to get this done in R. Splus DOES produce the Type III sums of squares if the above command is used. (As Prof. Koehler points out about this data set, we have ignored a potentially important aspect of the original real problem here. There were actually originally 6 dogs assigned at random to each of the 12 treatment combinations. We have tacitly assumed that the data that are missing are "missing at random" i.e. that the "missingness" provides no information about the effects of the treatments. If that tacit assumption is wrong, none of what is done above is anything but a numerical exercise … it provides no serious scientific insight. For example, you might consider how differently you might think about the medical problem if you believed that if in fact all missing data correspond to dead dogs, and deaths were fundamentally due to huge blood pressure increases that are not captured by the given values.) 3 4. Below is a small table of fake 2-way factorial data. Enter them into R in three vectors of length n = 12 , much as was done in problem 3. Call these vectors "Y", "A", and "B". Level 1 of A Level 2 of A Level 3 of A Level 1 of B 12 9 10 Level 2 of B 14 11,12 11 Level 3 of B 10,12 6,7 7 a) Repeat parts a)-d) of Problem 3 on these data. b) Create 12 × 9 full rank model matrices for both a cell means model and an effects model with the sum restriction for these data. Using R matrix calculations and the 2nd of these, compute "Type I" sums of squares corresponding to the linear model fit by > lm.out1<-lm(Y~A*B,data=d) Then use the first of these model matrices and appropriate matrices C and compute sums of squares for H 0 : Cß = 0 , SS H 0 , that are equal to the "A" and "B" "Type III" sums of squares. c) Now suppose that by some misfortune, both observations from the ( 2, 2 ) cell of this complete 3 × 3 factorial somehow get lost and one has only n = 10 observations from k = 8 cells (and thus "incomplete factorial" data). Test the hypothesis that at least for the cells where one has data are no interactions, i.e. E Y ∈ C (1 | Xα | X β ) . (Note that this matrix (1 | Xα | X β ) should be of full rank ( 5 ).) ( ) d) In the incomplete factorial context of part c), the function µ + α 2 + β 2 is estimable. What is the OLS estimate for it? (Note that this is the mean response for the missing cell only if the same no interaction model used to describe the 8 cells extends to the 9th. Notice that this is the kind of assumption one makes in regression analysis when using a fitted prediction equation to estimate a mean response at a set of conditions not in one's original data set. It might well be argued, however, that the link between observed and unobserved conditions is intuitively stronger with quantitative factors than with qualitative factors.) 4