Chapter 13 Multiple Regression 1 Introduction • In this chapter we extend simple linear regression where we had one explanatory variable, and allow for any number of explanatory variables. • We expect to build a model that fits the data better than the simple linear regression model. 2 Introduction • We shall use computer printout to – Assess the model • How well it fits the data • Is it useful • Are any required conditions violated? – Employ the model • Interpreting the coefficients • Predictions using the prediction equation • Estimating the expected value of the dependent variable 3 The Multiple Regression Model Idea: Examine the linear relationship between 1 response variable (y) & 2 or more explanatory variables (xi) Population model: Y-intercept Population slopes Random Error y β0 β1x1 β2 x 2 βk xk ε Estimated multiple regression model: Estimated (or predicted) value of y Estimated intercept Estimated slope coefficients ŷ b0 b1x1 b2 x 2 bk xk Simple Linear Regression ŷ b0 b1 x y Observed Value of y for xi εi Predicted Value of y for xi Slope = b1 Random Error for this x value Intercept = b0 xi x Multiple Regression, 2 explanatory variables •Y •* •* •* •* •* •* •* •* •* •* •* •Least Squares Plane (instead of line) •* •* •* •X•2 •* •*•* •X•1 •Scatter of points around plane are random error. 6 Multiple Regression Model Two variable model yi y ŷ b0 b1x1 b2 x 2 < yi Sample observation < e = (yi – yi) x2i x1 < x1i x2 The best fit equation, y , is found by minimizing the sum of squared errors, e2 Estimating the Coefficients and Assessing the Model • The procedure used to perform regression analysis: – Obtain the model coefficients and statistics using statistical software. – Diagnose violations of required conditions. Try to remedy problems when identified. – Assess the model fit using statistics obtained from the sample. – If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions. 8 Estimating the Coefficients and Assessing the Model, Example • Predicting final exam scores in BUS/ST 350 – We would like to predict final exam scores in 350. – Use information generated during the semester. – Predictors of the final exam score: • • • • Exam 1 Exam 2 Exam 3 Homework total 9 Estimating the Coefficients and Assessing the Model, Example • Data were collected from 203 randomly selected students from previous semesters • The following model is proposed final exam = b0 b1exam1 b2exam2 b3exam3 b4hwtot exam 1 exam2 exam3 hwtot finalexm 80 60 80 159 72 80 70 75 359 76 95 70 90 330 84 90 100 100 359 92 70 60 80 272 64 90 70 70 344 84 90 85 90 351 88 85 35 90 200 76 85 55 70 251 60 40 80 95 293 64 10 Regression Analysis, Excel Output This is the sample regression equation (sometimes called the prediction equation) Regression Statistics Multiple R 0.618439 R Square 0.38246679 Adjusted R Square 0.36999137 Standard Error 11.5122313 Observations 203 Final exam score = 0.0498 + 0.1002exam1 + 0.1541exam2 + 0.2960exam3 +0.1077hwtot ANOVA df Regression Residual Total Intercept exam 1 exam2 exam3 hwtot 4 198 202 SS 16252.40443 26241.23104 42493.63547 MS 4063 132.5 F Significance F 30.66 7.32692E-20 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 0.04978935 8.17368799 0.006 0.995 -16.06886586 16.16844455 0.10021107 0.075633398 1.325 0.187 -0.048939306 0.249361453 0.15413733 0.072271404 2.133 0.034 0.011616858 0.296657794 0.29600913 0.066724619 4.436 2E-05 0.16442702 0.427591244 11 0.10771069 0.022685084 4.748 4E-06 0.062975308 0.152446072 Interpreting the Coefficients • b0 = 0.0498. This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. • b1 = 0.1002. In this model, for each additional point on exam 1, the final exam score increases on average by 0.1002 (assuming the other variables are held constant). 12 Interpreting the Coefficients • b2 = 0.1541. In this model, for each additional point on exam 2, the final exam score increases on average by 0.1541 (assuming the other variables are held constant). • b3 = 0.2960. For each additional point on exam 3, the final exam score increases on average by 0.2960 (assuming the other variables are held constant). • b4 = 0.1077. For each additional point on the homework, the final exam score increases on average by 0.1077 (assuming the other variables are held constant). 13 Final Exam Scores, Predictions • Predict the average final exam score of a student with the following exam scores and homework score: – – – – Exam 1 score 75, Exam 2 score 79, Exam 3 score 85, Homework score 310 Final exam score = 0.0498 + 0.1002(75) +0.1541(79) + 0.2960(85) + 0.1077(310) = 78.2857 – Use trend function in Excel 14 Model Assessment • The model is assessed using three tools: – The standard error of the residuals – The coefficient of determination – The F-test of the analysis of variance • The standard error of the residuals participates in building the other tools. 15 Standard Error of Residuals • The standard deviation of the residuals is estimated by the Standard Error of the Residuals: SSE se n k 1 • The magnitude of se is judged by comparing it to y . 16 Regression Analysis, Excel Output Standard error of the residuals; sqrt(MSE) (standard error of the residuals)2: MSE=SSE/198 Regression Statistics Multiple R 0.618439 R Square 0.38246679 Adjusted R Square 0.36999137 Standard Error 11.5122313 Observations 203 ANOVA df Regression Residual Total Intercept exam 1 exam2 exam3 hwtot 4 198 202 SS 16252.40443 26241.23104 42493.63547 MS 4063 132.5 F Significance F 30.66 7.32692E-20 Sum of squares of Lower 95% Upper 95% residuals SSE Coefficients Standard Error t Stat P-value 0.04978935 8.17368799 0.006 0.995 -16.06886586 16.16844455 0.10021107 0.075633398 1.325 0.187 -0.048939306 0.249361453 0.15413733 0.072271404 2.133 0.034 0.011616858 0.296657794 0.29600913 0.066724619 4.436 2E-05 0.16442702 0.427591244 17 0.10771069 0.022685084 4.748 4E-06 0.062975308 0.152446072 Standard Error of Residuals • • • • From the printout, se = 11.5122…. Calculating the mean value of y we have y 78.84 It seems se is not particularly small. Question: Can we conclude the model does not fit the data well? 18 Coefficient of Determination R2 (like r2 in simple linear regression • The proportion of the variation in y that is explained by differences in the explanatory variables x1, x2, …, xk • R2 = 1 – (SSE/SSTotal) • From the printout, R2 = 0.382466… • 38.25% of the variation in final exam score is explained by differences in the exam1, exam2, exam3, and hwtot explanatory variables. 61.75% remains unexplained. • When adjusted for degrees of freedom, Adjusted R2 = 36.99% 19 Testing the Validity of the Model • We pose the question: Is there at least one explanatory variable linearly related to the response variable? • To answer the question we test the hypothesis H0: b1 = b2 = … = bk=0 H1: At least one bi is not equal to zero. • If at least one bi is not equal to zero, the model has some validity. 20 Testing the Validity of the Final Exam Scores Regression Model • The hypotheses are tested by what is called an F test shown in the Excel output below MSR/MSE ANOVA df Regression k = Residual n–k–1 = Total n-1 = SS 4 16252.404 198 26241.231 202 42493.635 MS 4063 132.5 F Significance F 30.66 7.32692E-20 P-value SSR MSR=SSR/k SSE MSE=SSE/(n-k-1) 21 Testing the Validity of the Final Exam Scores Regression Model [Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model; the model is useful, and thus, the null hypothesis H0 should be rejected. Reject H0 when P-value < 0.05 22 Testing the Validity of the Final Exam Scores Regression Model ANOVA Regression Residual Total Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the bi is not equal to zero. Thus, at least one explanatory variable is linearly related to y. Thisdflinear regression model SS MS is valid F Significance F 4 16252.404 198 26241.231 202 42493.635 4063 132.5 30.66 7.32692E-20 The P-value (Significance F) < 0.05 Reject the null hypothesis. 23 Testing the Coefficients • The hypothesis for each bi is H0: bi 0 H1: bi 0 • Excel printout Intercept exam 1 exam2 exam3 hwtot Coefficients Standard Error 0.04978935 8.17368799 0.10021107 0.075633398 0.15413733 0.072271404 0.29600913 0.066724619 0.10771069 0.022685084 Test statistic bi 0 t sbi d.f. = n - k -1 t Stat P-value 0.006 0.995145915 1.325 0.186712117 2.133 0.034176157 4.436 1.51714E-05 4.748 3.93288E-06 24