Multiple Linear Regression In many business applications, the relationship proposed by a simple linear regression model (models with only one “x” variable) does not adequately explain the variation in y. This is because the y variable in most business applications depends on more than one independent variable. In such cases, multiple linear regression can be used to explore the relationships between a dependent variable (e.g., production cost) and a set of independent variables (e.g., cost of raw materials, product complexity, labor utilization, etc.). The general theoretical model takes the form: y( x1 , x2 ,..., xk ) 0 1 x1 2 x2 k xk Where y is the dependent (or response) variable; x1 , x 2 ,..., x k are the independent (or predictor) variables; 0 is the “true” intercept term; 1 , 2 ,..., k are the “true slopes”; ~ N (0, 2 ) is the error term. As before, we assume error terms are independent and have constant variance. The ’s are estimated using least squares, which can be done in Excel using the built-in regression function. Example (Estimating Medical Costs, Revisited). Recall the IPA (Individual Practice Association) HMO data used for estimating healthcare expenditures. Enrollment was the sole independent variable and Total Expenses was the dependent variable. In arriving at a cost estimate for 100,000 employees, we simply used Total Enrollment (measured in total member months). However, your employees may be considerably healthier, on average, than the average individual in the general IPA HMO population. Suppose last year’s records indicate your employees needed 397,000 visits to the doctor, and 26,000 hospital days. TOT. EXP. Tot. Ambulatory Encounters Tot. Hosp. Days (y) ( x1 ) ( x2 ) 141550288 154319068 186336170 201621005 158685564 230493540 193939844 217963465 236795740 284644518 299357578 233118300 322120084 406588374 421456551 437953969 439120 473630 886628 1233593 273033 350565 578571 709668 711800 761674 1055879 1537164 699457 1267160 1128110 1998317 26926 61213 38142 46438 48599 65649 45884 58310 59484 58023 82581 92471 89987 69287 98194 157609 TOT. MEM. ( x3 ) 1219766 1238162 1340556 1373815 1441314 1457371 1653062 2047591 2065864 2443874 2451653 2605678 2894264 3617003 3848018 4419552 One might expect a more accurate cost estimate from a “utilization” model based on services rendered (ambulatory visits and hospitalization). One possible multiple linear regression model relating costs to utilization has the theoretical form: y( x1 , x2 ) 0 1 x1 2 x2 . 38 | P a g e Here y = total expenses, x1 = total number of ambulatory encounters, x 2 = total number of inpatient (hospital) days. Performing the regression in this case we get the Excel output.1 SUMMARY OUTPUT Regression Statistics Multiple R 0.80156102 R Square 0.642500069 Adjusted R Square 0.58750008 Standard Error 61543618.13 Observations 16 ANOVA df Regression Residual Total Intercept Tot. Ambulatory Encounters Tot. Hosp. Days 2 13 15 SS MS F Significance F 8.84925E+16 4.42E+16 11.68182 0.001248228 4.9239E+16 3.79E+15 1.37732E+17 Coefficients Standard Error t Stat P-value 85431953.15 38863640.11 2.198249 0.046647 42.95611322 50.49735715 0.850661 0.410353 1960.474897 757.4626725 2.588213 0.022504 Lower 95% 1472179.378 -66.13677336 324.0765969 Upper 95% Lower 95.0% 169391726.9 1472179.378 152.0489998 -66.13677336 3596.873198 324.0765969 Upper 95.0% 169391726.9 152.0489998 3596.873198 The marginal cost of an ambulatory encounter (assuming fixed hospital days) is $42.95 in this model. The marginal cost of a hospital day (assuming ambulatory encounters are held fixed) is $1960.47 in this model. Note that both of these interpretations require the other variables to be held fixed, a situation economists like to describe using the Latin phrase ceteris paribus (all other things being equal). To estimate the costs associated with 397,000 ambulatory visits and 26,000 hospital days, we simply “plug in” the given values into our regression equation. The predicted cost for 397,000 ambulatory encounters and 26,000 hospital days is (Ans. $153,457,877) This is merely a point estimate, and a better approach would be to provide a prediction interval. Unfortunately, exact prediction intervals in multiple linear regression are hard to compute in Excel. The formulas used for simple linear regression models do not generalize in an obvious way. Where absolutely necessary, I’ll compute them for you. For example, an exact 95% prediction interval for a future observation with 397,000 ambulatory encounters (given) and 26,000 hospital days (given) is $153,457,877 ± $144,951,976. A reasonable approximation to a 100( 1 )% prediction interval for a future observation x g ( x1 , x 2 ,..., x k ) is given by b0 b1 x1 b2 x2 bk xk t / 2, nk 1 df s 1 1 / n . 1 Warning: This example is solely for illustrative purposes because the sample size would generally be considered too small for a good model. There are many “rules of thumb” or ad hoc guidelines that exist for determining an appropriate sample size N for a regression model having m independent variables. These guidelines include: N ≥ 104 +m, N ≥ 40m, N ≥ 50+8m among others. You’ll notice that book problems routinely violate these data requirements. 39 | P a g e This approximation works well if the given values of the predictors are somewhat near their respective sample averages. Observe that we need k given values, one for each independent variable. Also note that the values taken from the t distribution are based on n-k-1 degrees of freedom. As before, is a “small probability” representing the user’s tolerance for getting a “bad” prediction interval. It is especially noteworthy that this multiple linear regression model with two independent variables does not fit the data as well as the simple linear regression model proposed earlier. Recall that y( x3 ) 0 3 x3 , where x3 enrollment, produced an R 2 = .929, considerably better than the .6425 achieved by the model above using ambulatory encounters ( x1 ) and hospitalization ( x 2 ). Significance of the Overall Regression Model: The F-test In simple linear regression, we could test for the statistical significance of a relationship between the sole independent variable (x) and the dependent variable (y) by doing a t-test on the slope coefficient. Since there was only one variable in the full model (the intercept doesn’t count as a model variable), the variation in y explained by x could also be tested using this t-test. However, in multiple linear regression, there are multiple “slopes” (one for each independent variable), and to measure their collective ability to explain the variation in y, we use something called the F-test. The F-test is a one-tailed test to the right. Under the assumptions of the model (normality, homoscedasticity, independence of ), the ratio SSR k F ~ Fk , n k 1 SSE n k 1 has an F Distribution with k degrees of freedom in the numerator and n-k-1 degrees of freedom in the denominator. The formal hypothesis test for significance of the model (sometimes called a test of “the model”) is H 0 : 1 2 k 0 vs. H A : At least one i 0 The null hypothesis is that the ’s associated with the x’s do not collectively explain a significant amount of the variation in y compared to a model that includes only an intercept term. The alternative hypothesis is that the x’s do explain a significant amount of the variation in y. The null hypothesis is rejected if the computed F-Statistic falls in the critical region determined by the stated level of significance (usually =.05). Alternatively, Excel gives the p-value for the computed F statistic on the printout. In the IPA Healthcare example above, ambulatory visits and hospitalization collectively explain a significant amount of the variation in total expenses. The computed F-Statistic of 11.68 has a p-value of .00124. We will always use the p-value approach for hypothesis tests in multiple linear regression. The relationship SST SSR SSE holds for multiple linear regression just as it did for simple linear regression. If the regression coefficients are collectively useful in explaining the variation in y, then SSR should account for a significant portion of SST – at least more than could 40 | P a g e be attributed to chance. Note that as n becomes large, SSR (and hence R 2 ) can be quite small and yet the overall regression model (“the model”) can be deemed statistically significant. When n is large, even explaining a small percentage of SST must be considered more than simply chance. Note, however, that “statistical significance” and “practical significance” are not necessarily the same! The Variation Explained by Specific Independent Variables: The t-Tests (sometimes called “Partial F-Tests”) It is also possible to determine how much of the total variation in y can be explained by individual x variables using the t statistics and t-tests in the Excel output. The t-statistic for a particular x variable measures the portion of the variation in y explained by that x variable assuming the other x variables are already included in the model. For this reason, the t-test for a particular x variable can be viewed as a measure of the linear relationship between y and x after adjusting for the other xvariables. For this reason, evidence of a linear relationship between y and a particular x can change depending on the set of x variables included in the model. For this reason, some people include the qualifying phrase in this model when talking about linear relationships. The words in this model are there to remind the audience that evidence of a linear relationship between y and x may depend on the specific model being used. To understand the t-statistic in greater detail, consider three different regression models for our IPA HMO: one that includes only ambulatory encounters to predict costs; one that includes only hospital days to predict costs; and one that includes both ambulatory encounters and hospital days to predict costs. Recall that the regression sum of squares, SSR, represents the amount of SST that is explained by the regression model. There is a different SSR for each of the three regression models proposed above. Let SSR(A) represent the amount of SST explained by a regression model that includes only ambulatory encounters. Let SSR(H) represent the amount of SST explained by a regression model that includes only hospital days. Finally, let SSR(A,H) represent the amount of SST explained by a regression model that includes both ambulatory encounters and hospital days, sometimes dubbed the “full” model. The Excel output for this model is displayed on the next page. The t-statistic for hospital days in the full model measures the significance of the difference SSR(A,H)-SSR(A), which is the marginal contribution of hospital days to a model that already includes ambulatory encounters. Similarly, the t-statistic for ambulatory encounters in the full model measures the significance of the difference SSR(A,H)-SSR(H), which is the marginal contribution of ambulatory encounters to a model that already includes hospital days. The large tstatistic (2.588) and small p-value (.022) for hospital days indicate that this variable makes a significant contribution to a model that already includes ambulatory encounters. Consequently, the inclusion of this variable in the final model is warranted. On the other hand, the small t-statistic and large p-value (.41) for ambulatory encounters suggests this variable does not contribute a significant amount to a model that already includes hospital days. In simple linear regression, the t-statistic was used to assess the existence of a linear relationship between x and y. In multiple linear regression, one must be more careful in determining the existence of a linear relationship between an independent variable and the dependent variable using only a t-statistic. The marginal contribution of a particular independent variable as measured by its t-test does not always accurately reflect whether there is a linear relationship. One way a linear relationship can be obscured is when the independent variables suffer from excessive multicollinearity. This topic is discussed next. 41 | P a g e 42 | P a g e Multicollinearity Multicollinearity occurs when a strong linear relationship exists among the x variables. Regression coefficients become unstable, standard deviations of the coefficients become large, t-statistics for coefficients become deceptively small, and prediction/confidence intervals are widened. Multicollinearity is often manifested by one or more nonsensical regression coefficients (e.g., the wrong sign). In general, multicollinearity makes interpretations of coefficients very difficult and often impossible. A strong relationship among the independent variables implies one cannot realistically change one variable without changing other independent variables as well. Moreover, strong relationships between the independent variables make it increasingly difficult to determine the contributions of individual variables. For example, if you have a model consisting of two perfectly correlated x variables, which x variable explains the variation in y? There are a number of quantitative ways to detect multicollinearity. The simplest involves inspecting the sample correlation matrix constructed from the independent variables. For example, suppose we fit a model of the form y 0 1 x1 2 x2 3 x3 in our IPA example (y = total expenses, x1 total ambulatory encounters, x2 total hospital days, and x3 total membership). The regression output for this model is in the file IPA-MLR.xls. Of particular interest are the coefficient estimates: -11.869 (Ambulatory Encounters), -271.34 (Hospital Days), 103.16 (Enrollment). Note the nonsensical sign for Hospital days and Ambulatory Encounters! The sample correlation matrix for all three independent variables is Correlation Matrix Tot. Amb. Enc. Tot. Hosp. Days Tot. Mem. Tot. Amb. Enc. 1 0.73714089 0.739107785 Tot. Hosp. Days 0.73714089 1 0.850078659 Tot. Mem. 0.739107785 0.850078659 1 (This can be found in Excel under Tools, Data Analysis). Generally speaking, high correlations can spell trouble. A common rule of thumb is that any correlation whose absolute value exceeds .7 is considered too high. In the example above, all three pair-wise correlations are above .7, and so multicollinearity appears to be a genuine concern. This explains the negative coefficients for Ambulatory Encounters and Hospital Days in the three-predictor model cited above. Multicollinearity can still be a problem even when pair-wise correlations are small. One way to detect multicollinearity in such situations is to calculate the variance inflationary factors (VIF’s). There is a different VIF for each independent variable. Each independent variable’s VIF measures how much the variance of its coefficient estimate has been inflated by multicollinearity. The ideal VIF for a variable is 1, but one shouldn't expect to see this value in practice (at least for work on observational data). A value of 4 or greater generally means that multicollinearity is a problem, thus interpretations of the regression coefficients—particularly those with high VIF’s—are suspect. Some authors suggest a threshold value of 5 or 10, but I’ve found 4 is a better cutoff, and that will be what we use in this class. To obtain the VIF’s in Excel, you first need to fill in the blank positions in your correlation matrix (if you have a really big correlation matrix, I can show you a fast way to do this using the “transpose” function). Once the values for the correlation matrix are filled out, you will invert the matrix using the “minverse” function. To do this, first highlight a square block of blank cells in 43 | P a g e your spreadsheet having dimensions identical to that of your correlation matrix. For the IPA example above, you’d swipe out a 3 by 3 block of unused cells. The upper left-hand cell of this block will be clear (the other cells will be darkened). Type =minverse(cell range), where cell range is the 33 block of cells where your correlation matrix is stored (note: your typed entry will appear in the clear cell in the upper left hand corner). Then hit Ctrl+Shift+Enter simultaneously. The inverse matrix will appear in the 3 by 3 workspace you have allocated for it. For our IPA example above, my correlation matrix was in cells M2:O4 (a 3 by 3 matrix); I swiped out the cells M7:O9 (also 3 by 3) as unused workspace; I typed “=minverse(M2:O4)”, which appeared in the (clear) cell M7; then I hit Ctrl+Shift+Enter to get Inverse of the Correlation Matrix 2.433034399 -0.95474559 -0.986665814 -0.95474559 3.979992168 -2.677646508 -0.986665814 -2.677646508 4.005462538 The VIF’s appear on the diagonal (shaded) and are not as bad as one might anticipate. For example, the largest VIF (4.005) occurs for the membership variable. I suggest you always inspect the correlation matrix and the VIF’s. If multicollinearity is present, at least one of these analyses will usually raise a red flag. When it does, interpretations of coefficients become problematic. There are a number of simple ways to manage multicollinearity. One way is to add more data (if this is feasible). Another way is to try and avoid including x-variables that are highly correlated in the model. Unfortunately, the latter is not always possible. Assignment #6 (Please hand in just your answers and work—no Excel output please) 1. Multiple Linear Regression, #6 (p. 658-659). 2. Multiple Linear Regression, #16 (p. 664). 3. Multiple Linear Regression, #24 (p. 672). 4. Multiple Linear Regression, #25 (p. 672-673). Also, determine if multicollinearity is an issue in this problem. 5. Multiple Linear Regression, #31 (p. 675). Skip part (c); Use APPROXIMATE prediction interval discussed in class for part (d). 44 | P a g e