Chapter 9: Multiple Regression (2)

advertisement
Chapter 9: Multiple Regression (2)
We used several continuous predictors to build a model for log brain mass based on log body
mass. Multiple regression is also how we fit models involving classification variables (ANOVA
models) from Chapter 5. Let’s look again at the mice diets in 5.1. Here are the treatments used
> with(mice, tapply(lifetime,treatment,mean))
lopro
N/N85
NP
N/R40
N/R50
R/R50
39.68571 32.69123 27.40204 45.11667 42.29718 42.88571
and the coefficient estimates:
> summary(mice.lm)
Estimate Std. Error
t value
Pr(>|t|)
(Intercept)
39.685714 0.8924172 44.469910 1.870050e-144
treatmentN/N85 -6.994486 1.2565210 -5.566549 5.248159e-08
treatmentNP
-12.283673 1.3063651 -9.402941 7.794053e-19
treatmentN/R40
5.430952 1.2408558 4.376780 1.600946e-05
treatmentN/R50
2.611469 1.1935501 2.187984 2.934503e-02
treatmentR/R50
3.200000 1.2620686 2.535520 1.167155e-02
Residual standard error: 6.678 on 343 degrees of freedom
Multiple R-squared: 0.4543,Adjusted R-squared: 0.4463
F-statistic: 57.1 on 5 and 343 DF, p-value: < 2.2e-16
Which treatment does not show up in the coefficient list?
How do you compute treatment means from the coefficient estimates?
You may think of the coefficients as one for a “baseline” treatment and others which adjust the
baseline when we move to a different level of treatment. Different stat packages make different
choices in how to pick a baseline level of a factor. SAS chooses the last level (alphabetically
ordered).
We can write this out as a multiple regression model.
µ{y|x1 , x2 , x3 , x4 , x5 } = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5
β0 is for the baseline group, and always get added in. The xi ’s are indicators. For any row of
the data (except for data points in the baseline group) one of the xi ’s will be 1 and the other
will be zero, so only one of β1 · · · , β5 gets used at a time.
One of the planned contrasts was to compare lopro to N/R50. Which coefficient does this?
Four of the five comparisons desired are to N/R50. Let’s make this our baseline by building
indicators for the others.
> dim(mice)
[1] 349
2
## 349 rows of data
## create 5 dummy variables.
First make them all zeroes.
5
>
>
>
>
>
>
lopro <- N.N85 <- NP <- N.R40 <- R.R50 <lopro[which(mice$treatment == "lopro")] =
N.N85[which(mice$treatment == "N/N85")] =
NP[which(mice$treatment == "NP")] = 1
N.R40[which(mice$treatment == "N/R40")] =
R.R50[which(mice$treatment == "R/R50")] =
rep(0,349)
1 ## change the right rows to 1’s
1 ## for each
## in turn
1
1
Fit these indicators as in a multiple regression. Now what is the ”intercept” estimating?
> summary(mice.fit2 <- lm(lifetime ~ lopro + N.N85 + NP + N.R40 + R.R50,mice))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.2972
0.7926 53.368 < 2e-16
lopro
-2.6115
1.1936 -2.188
0.0293
N.N85
-9.6060
1.1877 -8.088 1.06e-14
NP
-14.8951
1.2403 -12.009 < 2e-16
N.R40
2.8195
1.1711
2.408
0.0166
R.R50
0.5885
1.1936
0.493
0.6223
Residual standard error: 6.678 on 343 degrees of freedom
Multiple R-squared: 0.4543,Adjusted R-squared: 0.4463
F-statistic: 57.1 on 5 and 343 DF, p-value: < 2.2e-16
We wanted to compare 4 of the treatments (N/N85, lopro, R/R50, and N/R50) to N/R50.
It’s now easy to build CI’s for those planned comparisons. The fifth comparison, NP to N/N85,
will have to be done using contrasts with C2 = 1, C3 = −1, and the other Cj = 0.
The two models we have fit are just different versions of the same model. To see that we
could use anova, or compare the residuals.
> anova(mice.fit1,mice.fit2)
Analysis of Variance Table
Model 1: lifetime ~ treatment
Model 2: lifetime ~ lopro + N.N85 + NP + N.R40 + R.R50
Res.Df
RSS Df Sum of Sq F Pr(>F)
1
343 15297
2
343 15297 0 2.0009e-11
## really a zero
> which(abs( resid(mice.fit1)-resid(mice.fit2) ) > 1.0e-11)
named integer(0)
The anova function attempts to do an ESS F test, but the extra sum of squares is zero, and
the two models use the same 5 df, so it can’t compute the F ratio. None of the differences in
residuals is greater than 10−11 . Note that each model uses 5 degrees of freedom to fit the six
levels of the predictor.
Meadowfoam Example, §9.3.2 –9.3.4
In the meadowfoam experiment, we have two predictors: timing of extra light is either day0 or
day24, and light intensity takes on six values. We can build indicator variables for these factors
as we did above. The light intensity variable can be considered continuous, or (to look at lack
of fit) as a factor with 6 levels.
> mfoam <- read.table("data/meadowfoam.txt",head=T)
> mfoam$time <- factor(mfoam$time, labels=c("day0","day24"))
## replaces the old numeric variable with a factor
> mfoam$Intens <- factor(mfoam$intensity)
## creates a new variable so we can use both factor and continuous versions.
> plot(flowers ~ intensity, pch = rep(c(1,16),each=12),mfoam)
> legend("topright",pch=c(1,16), c("Day 0","Day 24"))
## one line is not going to fit well.
6
> mfoam.fit1 <- lm(flowers ~ intensity,mfoam)
> coef(mfoam.fit1)
(Intercept)
intensity
77.38500000 -0.04047143
> abline( 77.385, -0.0405, col="gray")
## do we want two parallel lines?
> mfoam.fit2 <- lm(flowers ~ intensity + time,mfoam)
> coef(mfoam.fit2)
(Intercept)
intensity
timeday24
71.30583333 -0.04047143 12.15833333
> abline( 71.306, -0.0405,lty=2,col=4)
> abline(71.306+ 12.158, -0.0405,lty=2,col=4)
## or 2 lines which are not required to be parallel?
mfoam.fit3 <- lm(flowers ~ intensity + time + intensity*time,mfoam)
coef(mfoam.fit3)
(Intercept)
intensity
timeday24 intensity:timeday24
71.623333333
-0.041076190
11.523333333
0.001209524
> abline( 71.623,
-0.04108,col=2,lty=1)
> abline( 71.623 + 11.523 ,
-0.04108 +
0.00121 ,col=2,lty=1)
## not much different, We’ll want to pick the simpler model.
>
>
●
60
●
●
●
●
●
Both
Day 0
Day 24
●
●
●
●
●
●
●
50
●
●
40
●
●
●
●
●
●
●
●
30
flowers
70
●
●
200
400
600
800
intensity
Let’s look at the possible models.
1. One slope.
µ{y|x1 } or µ{flowers | light} = β0 + β1 x1
d = 77.385 − 0.0405 · light
flowers
2. Two parallel lines. What is this model when x2 = 0? when x2 = 1?
µ{y|x1 , x2 } or µ{flowers | light, time} = β0 + β1 x1 + β2 x2
d = 71.306 − 0.0405 · light + 12.158 · day24
flowers
3. Two lines with arbitrary slope. What is this model when x2 = 0? when x2 = 1?
µ{y|x1 , x2 } or µ{flowers|light, time} = β0 + β1 x1 + β2 x2 + β3 x1 x2
d = 71.62 − 0.04108 · light + 11.523 · day24 + 0.00121 · light · day24
flowers
7
The third model uses what is called an “interaction” between the two predictors. An interaction
is present when the effect of the first predictor depends on the value of the second predictor.
In this case, we are trying to tell if the effect of light on flowers (a slope) changes with the timing
of the light. Or, if the distance between the two lines (the timing effect) varies from low light
intensity (left side of the plot) to high (right side). If the two slopes differ, then the distance
between the lines is changing.
Here’s the full model summary of the third model:
> summary( mfoam.fit3)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
71.623333
4.343305 16.491 4.14e-13
intensity
-0.041076
0.007435 -5.525 2.08e-05
timeday24
11.523333
6.142360
1.876
0.0753
intensity:timeday24 0.001210
0.010515
0.115
0.9096
Residual standard error: 6.598 on 20 degrees of freedom
Multiple R-squared: 0.7993,Adjusted R-squared: 0.7692
F-statistic: 26.55 on 3 and 20 DF, p-value: 3.549e-07
Compare to Display 9.14. Our timeDay24 effect they have labeled “early”.
Let’s use sequential ESS F tests on our third model.
> anova( mfoam.fit3) ## could also include all 3 models here
Analysis of Variance Table
Response: flowers
Df Sum Sq Mean Sq F value
Pr(>F)
intensity
1 2579.75 2579.75 59.2597 2.101e-07
time
1 886.95 886.95 20.3742 0.0002119
intensity:time 1
0.58
0.58 0.0132 0.9095675
Residuals
20 870.66
43.53
Conclusions:
There is a significant (negative) effect of light intensity on the number of flowers (F = 59.3
on 1,20 df, p-value = 2.1×10−7 ). The estimated slope -0.0411 (SE = 0.0074) is conditional on
having an adjustment in the model allowing different intercepts for timing.
The effect of timing on flowers (after accounting for the light intensity effect) is also significant
(F = 20.4 on 1,20 df, p-value = 2.1×10−4 ) with an increase in flowering estimated as 11.5
(SE=6.14) for beginning the extended lighting 24 days before PFI. The interaction between the
two predictors is not significant (F = 0.13 on 1, 20 df, p-value = .91).
We can also fit a model with a mean for each of the 12 treatment combinations by using
intensity as a factor.
> mfoam.fit4 <- lm(flowers ~ Intens * time,mfoam)
> anova( mfoam.fit2,mfoam.fit4)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
21
2
12
flowers ~ intensity + time
flowers ~ Intens * time
RSS Df Sum of Sq
F Pr(>F)
871.24
655.92 9
215.31 0.4377 0.8894
This F test is assessing the lack of fit. The null hypothesis is that the 2 parallel lines model
is adequate, the alternative: that we should fit 12 separate means instead. We fail to reject, and
conclude that the lack of fit is not significant.
8
More on Building Indicators and the Software
It’s more efficient to use the ifelse function in R like this:
>
R.R50 <- ifelse(mice$treatment == "R/R50", 1, 0)
## instead of
> R.R50 <- rep(0,349)
> R.R50[which(mice$treatment == "R/R50")] = 1 ## change the right rows to 1’s
In both constructs, note the use of double equals signs when we want to ask “are two things
equal?”. That’s different from assigning a value to an object, or passing an argument to a
function; both of those use a single equals sign. You can look at what factor does using the
contrast function which shows what the unique rows look like.
> contrasts(mfoam$Intens)
300 450 600 750 900
150
0
0
0
0
0
300
1
0
0
0
0
450
0
1
0
0
0
600
0
0
1
0
0
750
0
0
0
1
0
900
0
0
0
0
1
## first row is all zeroes
## because R skips the L150 indicator
> contrasts(mfoam$time)
day24
day0
0
day24
1
Important Practice
What are the fitted models, with and without interactions?
> round(coef(mfoam.fit2),3)
(Intercept)
intensity
timeday24
71.306
-0.040
12.158
> round(coef(mfoam.fit3),3)
(Intercept)
intensity
71.623
-0.041
timeday24 intensity:timeday24
11.523
0.001
Write out each model for two cases: day24 = 0 and day24 = 1.
9
The full 12–mean model uses dummy variables as shown above. What are the 12 fitted
values?
> round(coef(mfoam.fit4),3) ## (I added number after R spit out its labels)
(1)(Intercept)
(2)Intens300
(3)Intens450
(4)Intens600
69.85
-15.10
-14.10
-27.30
(5)Intens750
(6)Intens900
(7)timeday24 (8)Intens300:timeday24
-31.75
-30.50
6.85
11.95
(9)Intens450:timeday24 (10)Intens600:timeday24 (11)Intens750:timeday24 (12)Intens900:timeday24
1.45
8.15
8.00
2.30
Which coefficients go into estimating which means? In this first table just write in the
coefficient numbers (1 to 12) which I added to the R output above.
L150
L300
L450
L600
L750
L900
day0
day24
Now add those values to get the means.
L150
L300
L450
L600
L750
L900
day0
day24
Summary:
The mean number of flowers is well described by a regression model with a single slope (negative)
on light intensity and a different (higher) intercept for beginning extended lighting 24 days before
PFI (as opposed to waiting for PFI before increasing hours of lighting).
Scope:
Because this was a randomized experiment, we can infer cause-effect relationships between the
predictors and the response. It is not clear where the seedlings came from. They were probably
a convenience sample of some sort, so inference extends to the sample. Experts could (probably
should) argue that these plants are representative of some larger population of meadowfoam
plants, but the statistical inference extends only to the plants which could have been used in
the experiment.
10
Download