Chapter 9: Multiple Regression

advertisement
Chapter 9: Multiple Regression
An extension of simple linear regression to include several explanatory variables. In this chapter,
we focus just on understanding the regression coefficients.
9.2 Regression Coefficients
9.2.1 The Multiple Linear Regression Model
Multiple Regression:
• One response variable and multiple explanatory variables
• Many possible multiple regression models are available.
• Warnings:
– The regression will not be very helpful if it contains too many explanatory variables
or too much complexity.
– Do not think of the regression as some exact, discoverable equation. George Box:
“All models are wrong, but some are useful.”
• Two explanatory variables: The regression of Y on X1 and X2 describes the mean of the
distribution of Y for particular values of the explanatory variables
• Examples of some MLR models:
µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2
µ{Y |X1 } = β0 + β1 X1 + β2 X12
µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 + β3 X1 X2
µ{Y |X1 , X2 } = β0 + β1 log(X1 ) + β2 log(X2 )
The general case:
µ{Y |X1 , X2 } = β0 f0 (X1 , X2 ) + β1 f1 (X1 , X2 ) + β2 f2 (X1 , X2 ) + . . .
where fj (X1 , X2 )’s are known functions of the explanatory variables. Above we used
f0 (X1 , X2 ) = 1, fi (X1 , X2 ) = X1 , fj (X1 , X2 ) = X2 , fk (X1 , X2 ) = X1 X2 , fm (X1 , X2 ) =
log(X2 ), but many other options are available.
• Example (Meadowfoam Case Study):
µ{f lowers|light, time} = “mean number of flowers, as a function of intensity and timing”
= β0 + β1 light + β2 time
V ar{f lowers|light, time} = σ 2
= “variance of numbers of flowers, as a function of light and time”
As in SLR, we assume constant variance.
1
9.2.2 Interpretation of Regression Coefficients
GOALS:
•
•
•
•
Find a good fitting model for the response mean
Word the questions of interest in terms of model parameters (the regression coefficients)
Estimate the parameters with available data
Employ appropriate inferential tools for answering the questions of interest and for expressing the uncertainty in the answers.
Regression surfaces
• For two explanatory variables the regression surface is a plane (Meadowfoam Example).
– β1 is the slope of the plane as a function of light for an fixed value of time
– β2 is the slope of the plane as a function of time for any fixed value of light.
• For more than 2 explanatory variables, it is difficult and not always useful to consider the
geometry of the regression surface.
We instead interpret the regression coefficients in terms of the association the selected
explanatory variable has with the mean of the response when other explanatory variables
are also included in the model.
“Effects” of Explanatory Variables
• We want to estimate the change in the mean response that is associated with a one-unit
increase in that variable while holding all other explanatory variables fixed
• Model: µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2
• Use subtraction with the model specified above to find the parameter you want to estimate:
– Find the parameter describing the change in the mean of Y when X1 is increased by
one unit, and X2 is held constant:
µ{Y |X1 = (a + 1), X2 = b} − µ{Y |X1 = a, X2 = b} =?
– Find the parameter describing the change in the mean of Y when X2 is increased by
one unit, and X1 is held constant:
µ{Y |X1 = a, X2 = (b + 1)} − µ{Y |X1 = a, X2 = b} =?
2
• For this model, does it matter what the values of a and b are?
• Meadowfoam example:
• Interpretation:
– Randomized Experiment:
Example: “A 1-unit increase in light intensity causes the mean number of flowers to
increase by β1 .”
– Observational Study:
∗ Cannot make causal conclusions from statistical association!
∗ Use “associated with”
∗ Example: “For any subpopulation of mammal species with the same body weight
and litter size, a 1-day increase in the species’ gestation length is associated with
a β2 gram increase in the mean brain weight.”
Interpretation Depends on What Other X’s Are Included
• Different interpretation of β1 in the following two models:
Model A
Model B
µ{Y |X1 } = β0 + β1 X1
µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2
• What does the β1 in Model A measure?
• What does the β1 in Model B measure?
• If β1 = 0 in Model B, does that imply that β1 = 0 in Model A?
9.3 Specially Constructed Explanatory Variables
9.3.1 A Squared Term for Curvature
• Incorporate curvature into the linear regression model when a straight-line model is not
appropriate:
3
µ{Y |X1 } = β0 + β1 X1 + β2 X12
• What does the linear in SLR refer to?
• Now, the “effect” of rainfall is modeled to be different at different levels of rainfall:
µ{corn|(rain + 1)} − µ{corn|rain} =
• What gives a quick assessment of whether the straight-line model was inadequate from the
table of coefficients?
• It is difficult and usually unnecessary to interpret the coefficients!
9.3.2 An Indicator Variable to Distinguish Between Two Groups
• We can incorporate categorical variables (or factors) by using indicator variables! (i.e. We
can combine our usual ANOVA situation with SLR)
• What is an indicator variable (or dummy variable)?
• Consider the Meadowfoam example:
µ{f lowers|light, T IM E} = β0 + β1 light + β2 ind.day24
where ind.day24 = 0 at PFI (when T IM E = 0) and ind.day24 = 1 for pre-PFI (when
T IM E = 24).
– What is the equation at PFI (when ind.day24 = 0)?
4
– What is the equation pre-PFI (when ind.day24 = 1)?
– We can think of TWO different regression lines!
– Why do we call it a parallel lines model?
– How do we interpret β2 ?
– Would our interpretation have changed if we defined our indicator variable differently:
ind.day0 = 1 for at PFI and ind.day0 = 0 for at pre-PFI ?
9.3.3 Sets of Indicator Variables for Categorical Explanatory Variables with
More than Two Categories
• A categorical explanatory variable is called a factor.
• What are the individual categories called?
• How do we incorporate a factor with k levels into regression?
1. Choose a reference level for the factor. You will NOT make an indicator variable for
this level.
2. For every other level make an indicator variable (a vector of 1’s and 0’s). It should
be 1 when an observation is in that group (or level) and 0 for an observation that is
not in that group (or level).
• Baby Example:
Response (Y )
10.2
9.3
8.4
5.3
5.6
4.5
3.4
11.2
5.7
8.1
7.9
X1
12.3
11.4
13.4
12.8
12.7
11.3
13.4
15.6
14.1
13.8
11.1
X2 Category
A
A
B
B
A
B
C
C
D
D
D
ind.A = I(X2 = A)
ind.B = I(X2 = B)
ind.C = I(X2 = C)
• What would our regression model now look like if we include both X1 and X2 ?
5
• What function of parameters (β’s) describes the difference in the mean of Y for X2 = 2
compared to X2 = 4, while holding X1 fixed?
• What function of parameters describes the difference in means between groups 1 and 2 for
a fixed value of X1 ?
• Meadowfoam case study:
– What are the factors?
– How many levels does each factor have?
– Let’s make indicator variables for light: How many?
∗ In R:
ind.L300
ind.L450
ind.L600
ind.L750
ind.L900
<<<<<-
ifelse(light==300,1,0)
ifelse(light==450,1,0)
ifelse(light==600,1,0)
ifelse(light==750,1,0)
ifelse(light==900,1,0)
∗ What is the reference level ?
– Let’s look at tables of regression coefficients for several models:
∗ For light as a continuous variable, ignoring T IM E:
µ{f lowers | light} =
Call: lm(formula = flowers ~ light)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.385000
4.161186 18.597 6.06e-15 ***
light
-0.040471
0.007123 -5.682 1.03e-05 ***
--Residual standard error: 8.94 on 22 degrees of freedom
Multiple R-Squared: 0.5947,
Adjusted R-squared: 0.5763
F-statistic: 32.28 on 1 and 22 DF, p-value: 1.030e-05
∗ For light as a continuous variable with categorical TIME (parallel lines):
µ{f lowers | light, T IM E} =
Call:
lm(formula = flowers ~ light + ind.day24)
Coefficients:
6
Meadowfoam Data
Meadowfoam Data
●
●
At PFI
Pre−PFI
●
At PFI
Pre−PFI
●
70
70
●
●
60
●
flowers
flowers
60
●
●
●
50
50
●
●
●
●
●
●
●
●
●
40
40
●
●
●
●
400
600
●
30
30
●
200
●
800
200
400
light
600
800
light
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.305834
3.273772 21.781 6.77e-16
light
-0.040471
0.005132 -7.886 1.04e-07
ind.day24
12.158333
2.629557
4.624 0.000146
Residual standard error: 6.441 on 21 degrees of freedom
Multiple R-Squared: 0.7992,
Adjusted R-squared: 0.78
F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
– For LIGHT and TIME as categorical variables:
µ{f lowers | LIGHT, T IM E} = LIGHT + T IM E
Call:lm(formula = flowers ~ ind.L300 + ind.L450 + ind.L600 + ind.L750 +
ind.L900 + ind.day24)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
67.196
3.629 18.518 1.05e-12
ind.L300
-9.125
4.751 -1.921 0.071715
ind.L450
-13.375
4.751 -2.815 0.011919
ind.L600
-23.225
4.751 -4.888 0.000138
ind.L750
-27.750
4.751 -5.841 1.97e-05
ind.L900
-29.350
4.751 -6.178 1.01e-05
ind.day24
12.158
2.743
4.432 0.000365
Residual standard error: 6.719 on 17 degrees of freedom
Multiple R-Squared: 0.8231,
Adjusted R-squared: 0.7606
F-statistic: 13.18 on 6 and 17 DF, p-value: 1.427e-05
7
– ON YOUR OWN: Think about adding an interaction between LIGHT and TIME to
the above model.
∗ How many means are specified by the model?
∗ How many parameters are used to specify those means?
∗ Compare this to the model with no interaction.
9.3.4 A Product Term for Interaction
• Two explanatory variables interact if the effect that one of them has on the mean response
depends on the value of the other.
• How do we incorporate an interaction into our model?
• Meadowfoam example with interaction between light (continuous) and time (SEPARATE
LINES):
µ{f lowers | light, T IM E} =
Call:lm(formula = flowers ~ light + ind.day24 + light * ind.day24)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
71.623333
4.343305 16.491 4.14e-13
light
-0.041076
0.007435 -5.525 2.08e-05
ind.day24
11.523333
6.142361
1.876
0.0753
light:ind.day24 0.001210
0.010515
0.115
0.9096
Residual standard error: 6.598 on 20 degrees of freedom
Multiple R-Squared: 0.7993,
Adjusted R-squared: 0.7692
F-statistic: 26.55 on 3 and 20 DF, p-value: 3.549e-07
1. What are the slope and intercept for
at PFI (ind.day24 = 0) for the model
with the interaction between TIME and
light (continuous)?
80
Meadowfoam Data
●
At PFI
Pre−PFI
70
●
60
●
50
●
●
2. What are the slope and intercept for
pre-PFI (ind.day24 = 1) for the same
model?
●
●
●
40
●
●
●
●
30
flowers
●
0
200
400
600
800
1000
light
8
– NOTES:
∗ It is often difficult to interpret individual coefficients in an interaction model!
∗ When should you include an interaction term?
1. When a question of interest pertains to an interaction.
2. When good reason exists to suspect an interaction
3. When interactions are proposed as a more general model to examine goodness
of fit of the no-interaction model.
∗ If you include an interaction, you should also include the individual terms (even
if the coefficients are not signficantly different from zero)!!! (except in special
circumstances)
The Mammals Brain Size Study
• Since brain size is obviously related to body size, using regression we can investigate if
litter size and gestational period are associated with brain size after accounting for the
effect of body weight.
• See the Summary of Statistical Findings on page 238.
9.4 A Strategy for Data Analysis
See Display 9.9
9.5 Graphical Methods for Data Exploration and Presentation
• Matrix of Scatterplots:
– Scatterplots of all possible pairwise sets of variables
– Includes scatterplots of the response vs. explanatory variables and explanatory variables vs. other explanatory variables. (See Displays 9.10 and 9.11)
• Coded Scatterplots
Example given above for meadowfoam case study.
• Jittered Scatterplots
See Display 9.12
• Trellis Graphs See Display 9.13
9
Download