Limitations of simple linear regression • So far, we’ve only been able to examine the relationship between two variables. • In many instances, we believe that more than one independent variable is correlated with the dependent variable. • Multiple linear regression provides is a tool that allows us to examine the relationship between 2 or more regressors and a response variable. • This technique is especially useful when trying to account for potential confounding factors in observational studies. General form of the model • The model is very similar to the simple linear model from before, with the addition of other regressor(s). • As before, we are assuming that the regressors relate to the response variables linearly. • We assume the relationship is of the form: E(Y ) = β0 + β1 X1 + β2 X2 + ... + βk Xk • We use the least squares methodology from before, which minimizes Pn 2 i=1 (yi − yˆi ) . • The fitted model is of the form Ŷ = βˆ0 + βˆ1 X1 + βˆ2 X2 + ... + βˆk Xk Model with two regressors Consider the case of 2 regressors, which is much easier to visualize than more complicated cases. • In the case of 2 regressors we are fitting a plane through our data using the same least squares strategy we had before. • The coefficient of each independent variable tells us the estimated change in the response associated with a one-unit increase in the independent variable, given that the other independent variable is held constant • If both the regressors change, then the estimated change in the response variable (∆Y ) is given by: βˆ1 ∆X1 + βˆ2 ∆X2 . Example with two regressors Imagine that we have a random sample of families. For each family we have information about the annual income, annual savings, and the number of children. We are interested in how the number of children and the level of income relate the amount of money saved. • Our response variable is the annual savings. • Our regressors are number of children and annual income. • Our fitted model will take the form: estimated average savings = βˆ0 + βˆ1 income + βˆ2 number of children Looking at the results as separate regression lines 4 1 1 3 1 2 1 1 12 1 1 2 1 1 1 2 1 2 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 2 2 2 2 2 3 3 3 2 2 1 2 2 3 2 15 3 2 2 2 2 3 2 2 2 2 1 1 1 11 1 2 0 Savings (in thousands of dollars) 1 1 1 1 1 1 1 3 3 3 3 One child Two children Three children 3 3 20 25 Income (in thousands of dollars) 30 35 What if the slopes are different? • In the last example, the coefficient of income remains the same, regardless of the number of kids in the family. • What if you think that there’s an interaction between income and children? (That is, you think the effect is not strictly additive.) • You might to choose to fit the model with an interaction effect, in which case you are modeling: mean savings = β0 + β1 ∗ income + β2 ∗ children + β3 ∗ income*children • This allows the coefficient associated with income to change based on the number of children • This sort of model is still linear, because the unknowns (the βs) are linear in their relationship to the knowns (income, children, income*children). Fitted lines with interaction term 4 1 1 3 1 2 1 1 12 1 1 2 1 1 1 2 1 2 3 3 3 3 3 2 3 3 3 2 3 3 3 3 3 2 2 2 2 2 3 3 3 2 2 1 2 2 3 2 15 3 2 2 2 2 3 2 2 2 2 1 1 1 11 1 2 0 Savings (in thousands of dollars) 1 1 1 1 1 1 1 3 3 3 3 One child Two children Three children 3 3 20 25 Income (in thousands of dollars) 30 35 Dummy variables • What happens when we want to denote a category as a regressor? • For instance, let’s say that we have a data set with gender as a variable. We have denoted males as “0” and females as “1”. • Such binary variables are called “dummy variables”. • Adding such a variable into the regression allows different intercepts in the fitted regression equation for males and females (or whatever two categories you have). • Adding an interaction term of the form “dummy*non-dummy” allows different coefficients for the the non-dummy variable for males and females. Interpretation of dummy variables Imagine that we have a data set for a sample of families, including annual income, annual savings, and whether the familiy is has a single breadwinner (“1”) or not (“0”). Fit the model: mean savings = β0 + β1 ∗ income + β2 ∗ oneearn. Assume we obtain the fitted regression equation: estimated savings = 400 + 0.05 ∗ income − 0.02 ∗ oneearn How does being a one breadwinner family affect the estimated average savings? Interpretation of dummy variables (cont.) Fit the model: mean savings = β0 +β1 ∗income+β2 ∗oneearn+β3 ∗income*oneearn (variables same as those mentioned in last slide). Assume we obtain the fitted regression equation estimated savings = 400 + 0.10 ∗ income − 175 ∗ oneearn − 0.04 ∗ income*oneearn Re-examine how being a one breadwinner family affects the estimated average savings. Inference concerning mult. regression coefficients • As in the case of simple linear regression, we can also form confidence intervals and conduct hypothesis tests for the coefficients βi . • In the case of k regressors, the statistic n − k − 1 degrees of freedom βˆi SEβˆ has a t distribution with i • (100 − α)% Confidence interval for βi : β̂i ± tn−k−1 SEβˆi α 2 • The estimates β̂i and their standard errors can be found on the output from a statistical package like S-Plus. Also, the fitted regression equation is sometimes presented with the standard errors listed under each estimate in parentheses.: ŷ = βˆ0 (SEβˆ0 ) + βˆ1 x1 (SEβˆ1 ) + βˆ2 x2 (SEβˆ2 ) Adjusted/corrected R2 • For multiple regression, we can still calculate the coefficient of determination R2 = SSR SST . • As before, R2 measures the proportion of the sum of squares of deviations of Y that can be explained by the relationship we have fitted using the explanatory variables. • Note that adding regressors can never cause R2 to decrease, even if the regressors) do not seem to have a significant effect on the response of Y . • Adjusted (sometimes called “corrected”) R2 takes into account the number of regressors included in the model; in effect, it penalizes us for adding in regressors that don’t “contribute their part” to explaning the response variable. • Adjusted R2 is given by the following, where k is the number of regressors (n − 1)R2 − k Adjusted R = n−k−1 2 Example: Hiring salaries Call: lm(formula = lgsalhr ~ age + educatn + seniorty + gender) Coefficients: (Intercept) age educatn seniorty gender Value Std. Error 8.7161 0.1116 0.0002 0.0001 0.0170 0.0045 -0.0041 0.0009 -0.1439 0.0216 t value 78.1167 2.7001 3.8012 -4.3281 -6.6626 Pr(>|t|) 0.0000 0.0083 0.0003 0.0000 0.0000 Residual standard error: 0.09145 on 88 d.f. Multiple R-Squared: 0.5209 Using the above information, adjusted R2 = (93−1)(0.5209)−4 = 0.4991. 93−4−1