Chapter 9: Multiple Regression An extension of simple linear regression to include several explanatory variables. In this chapter, we focus just on understanding the regression coefficients. 9.2 Regression Coefficients 9.2.1 The Multiple Linear Regression Model Multiple Regression: • One response variable and multiple explanatory variables • Many possible multiple regression models are available. • Warnings: – The regression will not be very helpful if it contains too many explanatory variables or too much complexity. – Do not think of the regression as some exact, discoverable equation. George Box: “All models are wrong, but some are useful.” • Two explanatory variables: The regression of Y on X1 and X2 describes the mean of the distribution of Y for particular values of the explanatory variables • Examples of some MLR models: µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 µ{Y |X1 } = β0 + β1 X1 + β2 X12 µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 + β3 X1 X2 µ{Y |X1 , X2 } = β0 + β1 log(X1 ) + β2 log(X2 ) The general case: µ{Y |X1 , X2 } = β0 f0 (X1 , X2 ) + β1 f1 (X1 , X2 ) + β2 f2 (X1 , X2 ) + . . . where fj (X1 , X2 )’s are known functions of the explanatory variables. Above we used f0 (X1 , X2 ) = 1, fi (X1 , X2 ) = X1 , fj (X1 , X2 ) = X2 , fk (X1 , X2 ) = X1 X2 , fm (X1 , X2 ) = log(X2 ), but many other options are available. • Example (Meadowfoam Case Study): µ{f lowers|light, time} = “mean number of flowers, as a function of intensity and timing” = β0 + β1 light + β2 time V ar{f lowers|light, time} = σ 2 = “variance of numbers of flowers, as a function of light and time” As in SLR, we assume constant variance. 1 9.2.2 Interpretation of Regression Coefficients GOALS: • • • • Find a good fitting model for the response mean Word the questions of interest in terms of model parameters (the regression coefficients) Estimate the parameters with available data Employ appropriate inferential tools for answering the questions of interest and for expressing the uncertainty in the answers. Regression surfaces • For two explanatory variables the regression surface is a plane (Meadowfoam Example). – β1 is the slope of the plane as a function of light for an fixed value of time – β2 is the slope of the plane as a function of time for any fixed value of light. • For more than 2 explanatory variables, it is difficult and not always useful to consider the geometry of the regression surface. We instead interpret the regression coefficients in terms of the association the selected explanatory variable has with the mean of the response when other explanatory variables are also included in the model. “Effects” of Explanatory Variables • We want to estimate the change in the mean response that is associated with a one-unit increase in that variable while holding all other explanatory variables fixed • Model: µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 • Use subtraction with the model specified above to find the parameter you want to estimate: – Find the parameter describing the change in the mean of Y when X1 is increased by one unit, and X2 is held constant: µ{Y |X1 = (a + 1), X2 = b} − µ{Y |X1 = a, X2 = b} =? – Find the parameter describing the change in the mean of Y when X2 is increased by one unit, and X1 is held constant: µ{Y |X1 = a, X2 = (b + 1)} − µ{Y |X1 = a, X2 = b} =? 2 • For this model, does it matter what the values of a and b are? • Meadowfoam example: • Interpretation: – Randomized Experiment: Example: “A 1-unit increase in light intensity causes the mean number of flowers to increase by β1 .” – Observational Study: ∗ Cannot make causal conclusions from statistical association! ∗ Use “associated with” ∗ Example: “For any subpopulation of mammal species with the same body weight and litter size, a 1-day increase in the species’ gestation length is associated with a β2 gram increase in the mean brain weight.” Interpretation Depends on What Other X’s Are Included • Different interpretation of β1 in the following two models: Model A Model B µ{Y |X1 } = β0 + β1 X1 µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 • What does the β1 in Model A measure? • What does the β1 in Model B measure? • If β1 = 0 in Model B, does that imply that β1 = 0 in Model A? 9.3 Specially Constructed Explanatory Variables 9.3.1 A Squared Term for Curvature • Incorporate curvature into the linear regression model when a straight-line model is not appropriate: 3 µ{Y |X1 } = β0 + β1 X1 + β2 X12 • What does the linear in SLR refer to? • Now, the “effect” of rainfall is modeled to be different at different levels of rainfall: µ{corn|(rain + 1)} − µ{corn|rain} = • What gives a quick assessment of whether the straight-line model was inadequate from the table of coefficients? • It is difficult and usually unnecessary to interpret the coefficients! 9.3.2 An Indicator Variable to Distinguish Between Two Groups • We can incorporate categorical variables (or factors) by using indicator variables! (i.e. We can combine our usual ANOVA situation with SLR) • What is an indicator variable (or dummy variable)? • Consider the Meadowfoam example: µ{f lowers|light, T IM E} = β0 + β1 light + β2 ind.day24 where ind.day24 = 0 at PFI (when T IM E = 0) and ind.day24 = 1 for pre-PFI (when T IM E = 24). – What is the equation at PFI (when ind.day24 = 0)? 4 – What is the equation pre-PFI (when ind.day24 = 1)? – We can think of TWO different regression lines! – Why do we call it a parallel lines model? – How do we interpret β2 ? – Would our interpretation have changed if we defined our indicator variable differently: ind.day0 = 1 for at PFI and ind.day0 = 0 for at pre-PFI ? 9.3.3 Sets of Indicator Variables for Categorical Explanatory Variables with More than Two Categories • A categorical explanatory variable is called a factor. • What are the individual categories called? • How do we incorporate a factor with k levels into regression? 1. Choose a reference level for the factor. You will NOT make an indicator variable for this level. 2. For every other level make an indicator variable (a vector of 1’s and 0’s). It should be 1 when an observation is in that group (or level) and 0 for an observation that is not in that group (or level). • Baby Example: Response (Y ) 10.2 9.3 8.4 5.3 5.6 4.5 3.4 11.2 5.7 8.1 7.9 X1 12.3 11.4 13.4 12.8 12.7 11.3 13.4 15.6 14.1 13.8 11.1 X2 Category A A B B A B C C D D D ind.A = I(X2 = A) ind.B = I(X2 = B) ind.C = I(X2 = C) • What would our regression model now look like if we include both X1 and X2 ? 5 • What function of parameters (β’s) describes the difference in the mean of Y for X2 = 2 compared to X2 = 4, while holding X1 fixed? • What function of parameters describes the difference in means between groups 1 and 2 for a fixed value of X1 ? • Meadowfoam case study: – What are the factors? – How many levels does each factor have? – Let’s make indicator variables for light: How many? ∗ In R: ind.L300 ind.L450 ind.L600 ind.L750 ind.L900 <<<<<- ifelse(light==300,1,0) ifelse(light==450,1,0) ifelse(light==600,1,0) ifelse(light==750,1,0) ifelse(light==900,1,0) ∗ What is the reference level ? – Let’s look at tables of regression coefficients for several models: ∗ For light as a continuous variable, ignoring T IM E: µ{f lowers | light} = Call: lm(formula = flowers ~ light) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 77.385000 4.161186 18.597 6.06e-15 *** light -0.040471 0.007123 -5.682 1.03e-05 *** --Residual standard error: 8.94 on 22 degrees of freedom Multiple R-Squared: 0.5947, Adjusted R-squared: 0.5763 F-statistic: 32.28 on 1 and 22 DF, p-value: 1.030e-05 ∗ For light as a continuous variable with categorical TIME (parallel lines): µ{f lowers | light, T IM E} = Call: lm(formula = flowers ~ light + ind.day24) Coefficients: 6 Meadowfoam Data Meadowfoam Data ● ● At PFI Pre−PFI ● At PFI Pre−PFI ● 70 70 ● ● 60 ● flowers flowers 60 ● ● ● 50 50 ● ● ● ● ● ● ● ● ● 40 40 ● ● ● ● 400 600 ● 30 30 ● 200 ● 800 200 400 light 600 800 light Estimate Std. Error t value Pr(>|t|) (Intercept) 71.305834 3.273772 21.781 6.77e-16 light -0.040471 0.005132 -7.886 1.04e-07 ind.day24 12.158333 2.629557 4.624 0.000146 Residual standard error: 6.441 on 21 degrees of freedom Multiple R-Squared: 0.7992, Adjusted R-squared: 0.78 F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08 – For LIGHT and TIME as categorical variables: µ{f lowers | LIGHT, T IM E} = LIGHT + T IM E Call:lm(formula = flowers ~ ind.L300 + ind.L450 + ind.L600 + ind.L750 + ind.L900 + ind.day24) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 67.196 3.629 18.518 1.05e-12 ind.L300 -9.125 4.751 -1.921 0.071715 ind.L450 -13.375 4.751 -2.815 0.011919 ind.L600 -23.225 4.751 -4.888 0.000138 ind.L750 -27.750 4.751 -5.841 1.97e-05 ind.L900 -29.350 4.751 -6.178 1.01e-05 ind.day24 12.158 2.743 4.432 0.000365 Residual standard error: 6.719 on 17 degrees of freedom Multiple R-Squared: 0.8231, Adjusted R-squared: 0.7606 F-statistic: 13.18 on 6 and 17 DF, p-value: 1.427e-05 7 – ON YOUR OWN: Think about adding an interaction between LIGHT and TIME to the above model. ∗ How many means are specified by the model? ∗ How many parameters are used to specify those means? ∗ Compare this to the model with no interaction. 9.3.4 A Product Term for Interaction • Two explanatory variables interact if the effect that one of them has on the mean response depends on the value of the other. • How do we incorporate an interaction into our model? • Meadowfoam example with interaction between light (continuous) and time (SEPARATE LINES): µ{f lowers | light, T IM E} = Call:lm(formula = flowers ~ light + ind.day24 + light * ind.day24) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 71.623333 4.343305 16.491 4.14e-13 light -0.041076 0.007435 -5.525 2.08e-05 ind.day24 11.523333 6.142361 1.876 0.0753 light:ind.day24 0.001210 0.010515 0.115 0.9096 Residual standard error: 6.598 on 20 degrees of freedom Multiple R-Squared: 0.7993, Adjusted R-squared: 0.7692 F-statistic: 26.55 on 3 and 20 DF, p-value: 3.549e-07 1. What are the slope and intercept for at PFI (ind.day24 = 0) for the model with the interaction between TIME and light (continuous)? 80 Meadowfoam Data ● At PFI Pre−PFI 70 ● 60 ● 50 ● ● 2. What are the slope and intercept for pre-PFI (ind.day24 = 1) for the same model? ● ● ● 40 ● ● ● ● 30 flowers ● 0 200 400 600 800 1000 light 8 – NOTES: ∗ It is often difficult to interpret individual coefficients in an interaction model! ∗ When should you include an interaction term? 1. When a question of interest pertains to an interaction. 2. When good reason exists to suspect an interaction 3. When interactions are proposed as a more general model to examine goodness of fit of the no-interaction model. ∗ If you include an interaction, you should also include the individual terms (even if the coefficients are not signficantly different from zero)!!! (except in special circumstances) The Mammals Brain Size Study • Since brain size is obviously related to body size, using regression we can investigate if litter size and gestational period are associated with brain size after accounting for the effect of body weight. • See the Summary of Statistical Findings on page 238. 9.4 A Strategy for Data Analysis See Display 9.9 9.5 Graphical Methods for Data Exploration and Presentation • Matrix of Scatterplots: – Scatterplots of all possible pairwise sets of variables – Includes scatterplots of the response vs. explanatory variables and explanatory variables vs. other explanatory variables. (See Displays 9.10 and 9.11) • Coded Scatterplots Example given above for meadowfoam case study. • Jittered Scatterplots See Display 9.12 • Trellis Graphs See Display 9.13 9