Regression Diagnostics

advertisement
Multiple Linear Regression
In many business applications, the relationship proposed by a simple linear regression model
(models with only one “x” variable) does not adequately explain the variation in y. This is because
the y variable in most business applications depends on more than one independent variable. In
such cases, multiple linear regression can be used to explore the relationships between a dependent
variable (e.g., production cost) and a set of independent variables (e.g., cost of raw materials,
product complexity, labor utilization, etc.). The general theoretical model takes the form:
y( x1 , x2 ,..., xk )   0  1 x1   2 x2       k xk  
Where y is the dependent (or response) variable; x1 , x 2 ,..., x k are the independent (or predictor)
variables;  0 is the “true” intercept term; 1 ,  2 ,...,  k are the “true slopes”;  ~ N (0,  2 ) is the
error term. As before, we assume error terms are independent and have constant variance. The  ’s
are estimated using least squares, which can be done in Excel using the built-in regression function.
Example (Estimating Medical Costs, Revisited). Recall the IPA (Individual Practice Association)
HMO data used for estimating healthcare expenditures. Enrollment was the sole independent
variable and Total Expenses was the dependent variable. In arriving at a cost estimate for 100,000
employees, we simply used Total Enrollment (measured in total member months). However, your
employees may be considerably healthier, on average, than the average individual in the general
IPA HMO population. Suppose last year’s records indicate your employees needed 397,000 visits
to the doctor, and 26,000 hospital days.
TOT. EXP. Tot. Ambulatory Encounters Tot. Hosp. Days
(y)
( x1 )
( x2 )
141550288
154319068
186336170
201621005
158685564
230493540
193939844
217963465
236795740
284644518
299357578
233118300
322120084
406588374
421456551
437953969
439120
473630
886628
1233593
273033
350565
578571
709668
711800
761674
1055879
1537164
699457
1267160
1128110
1998317
26926
61213
38142
46438
48599
65649
45884
58310
59484
58023
82581
92471
89987
69287
98194
157609
TOT. MEM.
( x3 )
1219766
1238162
1340556
1373815
1441314
1457371
1653062
2047591
2065864
2443874
2451653
2605678
2894264
3617003
3848018
4419552
One might expect a more accurate cost estimate from a “utilization” model based on services
rendered (ambulatory visits and hospitalization). One possible multiple linear regression model
relating costs to utilization has the theoretical form:
y( x1 , x2 )   0  1 x1   2 x2   .
38 | P a g e
Here y = total expenses, x1 = total number of ambulatory encounters, x 2 = total number of inpatient (hospital) days. Performing the regression in this case we get the Excel output.1
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.80156102
R Square
0.642500069
Adjusted R Square
0.58750008
Standard Error
61543618.13
Observations
16
ANOVA
df
Regression
Residual
Total
Intercept
Tot. Ambulatory Encounters
Tot. Hosp. Days
2
13
15
SS
MS
F
Significance F
8.84925E+16 4.42E+16 11.68182
0.001248228
4.9239E+16 3.79E+15
1.37732E+17
Coefficients Standard Error
t Stat
P-value
85431953.15
38863640.11 2.198249 0.046647
42.95611322
50.49735715 0.850661 0.410353
1960.474897
757.4626725 2.588213 0.022504
Lower 95%
1472179.378
-66.13677336
324.0765969
Upper 95% Lower 95.0%
169391726.9 1472179.378
152.0489998 -66.13677336
3596.873198 324.0765969
Upper 95.0%
169391726.9
152.0489998
3596.873198
The marginal cost of an ambulatory encounter (assuming fixed hospital days) is $42.95 in this
model. The marginal cost of a hospital day (assuming ambulatory encounters are held fixed) is
$1960.47 in this model. Note that both of these interpretations require the other variables to be held
fixed, a situation economists like to describe using the Latin phrase ceteris paribus (all other things
being equal).
To estimate the costs associated with 397,000 ambulatory visits and 26,000 hospital days,
we simply “plug in” the given values into our regression equation. The predicted cost for 397,000
ambulatory encounters and 26,000 hospital days is
(Ans. $153,457,877)
This is merely a point estimate, and a better approach would be to provide a prediction
interval. Unfortunately, exact prediction intervals in multiple linear regression are hard to compute
in Excel. The formulas used for simple linear regression models do not generalize in an obvious
way. Where absolutely necessary, I’ll compute them for you. For example, an exact 95%
prediction interval for a future observation with 397,000 ambulatory encounters (given) and 26,000
hospital days (given) is
$153,457,877 ± $144,951,976.
A reasonable approximation to a 100( 1   )% prediction interval for a future observation
x g  ( x1 , x 2 ,..., x k ) is given by
b0  b1 x1  b2 x2      bk xk  t / 2, nk 1 df  s  1  1 / n .
1
Warning: This example is solely for illustrative purposes because the sample size would generally be considered too
small for a good model. There are many “rules of thumb” or ad hoc guidelines that exist for determining an appropriate
sample size N for a regression model having m independent variables. These guidelines include: N ≥ 104 +m, N ≥ 40m,
N ≥ 50+8m among others. You’ll notice that book problems routinely violate these data requirements.
39 | P a g e
This approximation works well if the given values of the predictors are somewhat near their
respective sample averages. Observe that we need k given values, one for each independent
variable. Also note that the values taken from the t distribution are based on n-k-1 degrees of
freedom. As before,  is a “small probability” representing the user’s tolerance for getting a “bad”
prediction interval.
It is especially noteworthy that this multiple linear regression model with two independent
variables does not fit the data as well as the simple linear regression model proposed earlier. Recall
that y( x3 )   0   3 x3 , where x3  enrollment, produced an R 2 = .929, considerably better than the
.6425 achieved by the model above using ambulatory encounters ( x1 ) and hospitalization ( x 2 ).
Significance of the Overall Regression Model: The F-test
In simple linear regression, we could test for the statistical significance of a relationship between
the sole independent variable (x) and the dependent variable (y) by doing a t-test on the slope
coefficient. Since there was only one variable in the full model (the intercept doesn’t count as a
model variable), the variation in y explained by x could also be tested using this t-test. However, in
multiple linear regression, there are multiple “slopes” (one for each independent variable), and to
measure their collective ability to explain the variation in y, we use something called the F-test.
The F-test is a one-tailed test to the right. Under the assumptions of the model (normality,
homoscedasticity, independence of  ), the ratio
SSR
k
 F ~ Fk , n  k 1
SSE
n  k 1
has an F Distribution with k degrees of freedom in the numerator and n-k-1 degrees of freedom in
the denominator. The formal hypothesis test for significance of the model (sometimes called a test
of “the model”) is
H 0 : 1   2       k  0
vs.
H A : At least one  i  0
The null hypothesis is that the  ’s associated with the x’s do not collectively explain a significant
amount of the variation in y compared to a model that includes only an intercept term. The
alternative hypothesis is that the x’s do explain a significant amount of the variation in y. The null
hypothesis is rejected if the computed F-Statistic falls in the critical region determined by the stated
level of significance (usually  =.05). Alternatively, Excel gives the p-value for the computed F
statistic on the printout. In the IPA Healthcare example above, ambulatory visits and
hospitalization collectively explain a significant amount of the variation in total expenses. The
computed F-Statistic of 11.68 has a p-value of .00124. We will always use the p-value approach for
hypothesis tests in multiple linear regression.
The relationship SST  SSR  SSE holds for multiple linear regression just as it did for
simple linear regression. If the regression coefficients are collectively useful in explaining the
variation in y, then SSR should account for a significant portion of SST – at least more than could
40 | P a g e
be attributed to chance. Note that as n becomes large, SSR (and hence R 2 ) can be quite small and
yet the overall regression model (“the model”) can be deemed statistically significant. When n is
large, even explaining a small percentage of SST must be considered more than simply chance.
Note, however, that “statistical significance” and “practical significance” are not necessarily the
same!
The Variation Explained by Specific Independent Variables: The t-Tests
(sometimes called “Partial F-Tests”)
It is also possible to determine how much of the total variation in y can be explained by individual x
variables using the t statistics and t-tests in the Excel output. The t-statistic for a particular x
variable measures the portion of the variation in y explained by that x variable assuming the other x
variables are already included in the model. For this reason, the t-test for a particular x variable can
be viewed as a measure of the linear relationship between y and x after adjusting for the other xvariables. For this reason, evidence of a linear relationship between y and a particular x can change
depending on the set of x variables included in the model. For this reason, some people include the
qualifying phrase in this model when talking about linear relationships. The words in this model are
there to remind the audience that evidence of a linear relationship between y and x may depend on
the specific model being used.
To understand the t-statistic in greater detail, consider three different regression models for
our IPA HMO: one that includes only ambulatory encounters to predict costs; one that includes only
hospital days to predict costs; and one that includes both ambulatory encounters and hospital days to
predict costs. Recall that the regression sum of squares, SSR, represents the amount of SST that is
explained by the regression model. There is a different SSR for each of the three regression models
proposed above. Let SSR(A) represent the amount of SST explained by a regression model that
includes only ambulatory encounters. Let SSR(H) represent the amount of SST explained by a
regression model that includes only hospital days. Finally, let SSR(A,H) represent the amount of
SST explained by a regression model that includes both ambulatory encounters and hospital days,
sometimes dubbed the “full” model. The Excel output for this model is displayed on the next page.
The t-statistic for hospital days in the full model measures the significance of the difference
SSR(A,H)-SSR(A), which is the marginal contribution of hospital days to a model that already
includes ambulatory encounters. Similarly, the t-statistic for ambulatory encounters in the full
model measures the significance of the difference SSR(A,H)-SSR(H), which is the marginal
contribution of ambulatory encounters to a model that already includes hospital days. The large tstatistic (2.588) and small p-value (.022) for hospital days indicate that this variable makes a
significant contribution to a model that already includes ambulatory encounters. Consequently, the
inclusion of this variable in the final model is warranted. On the other hand, the small t-statistic and
large p-value (.41) for ambulatory encounters suggests this variable does not contribute a significant
amount to a model that already includes hospital days.
In simple linear regression, the t-statistic was used to assess the existence of a linear
relationship between x and y. In multiple linear regression, one must be more careful in
determining the existence of a linear relationship between an independent variable and the
dependent variable using only a t-statistic. The marginal contribution of a particular independent
variable as measured by its t-test does not always accurately reflect whether there is a linear
relationship. One way a linear relationship can be obscured is when the independent variables
suffer from excessive multicollinearity. This topic is discussed next.
41 | P a g e
42 | P a g e
Multicollinearity
Multicollinearity occurs when a strong linear relationship exists among the x variables. Regression
coefficients become unstable, standard deviations of the coefficients become large, t-statistics for
coefficients become deceptively small, and prediction/confidence intervals are widened.
Multicollinearity is often manifested by one or more nonsensical regression coefficients (e.g., the
wrong sign). In general, multicollinearity makes interpretations of coefficients very difficult and
often impossible. A strong relationship among the independent variables implies one cannot
realistically change one variable without changing other independent variables as well. Moreover,
strong relationships between the independent variables make it increasingly difficult to determine
the contributions of individual variables. For example, if you have a model consisting of two
perfectly correlated x variables, which x variable explains the variation in y?
There are a number of quantitative ways to detect multicollinearity. The simplest involves
inspecting the sample correlation matrix constructed from the independent variables. For example,
suppose we fit a model of the form y   0  1 x1   2 x2   3 x3   in our IPA example (y = total
expenses, x1  total ambulatory encounters, x2  total hospital days, and x3  total membership).
The regression output for this model is in the file IPA-MLR.xls. Of particular interest are the
coefficient estimates: -11.869 (Ambulatory Encounters), -271.34 (Hospital Days), 103.16
(Enrollment). Note the nonsensical sign for Hospital days and Ambulatory Encounters! The
sample correlation matrix for all three independent variables is
Correlation Matrix
Tot. Amb. Enc.
Tot. Hosp. Days
Tot. Mem.
Tot. Amb. Enc.
1
0.73714089
0.739107785
Tot. Hosp. Days
0.73714089
1
0.850078659
Tot. Mem.
0.739107785
0.850078659
1
(This can be found in Excel under Tools, Data Analysis). Generally speaking, high correlations can
spell trouble. A common rule of thumb is that any correlation whose absolute value exceeds .7 is
considered too high. In the example above, all three pair-wise correlations are above .7, and so
multicollinearity appears to be a genuine concern. This explains the negative coefficients for
Ambulatory Encounters and Hospital Days in the three-predictor model cited above.
Multicollinearity can still be a problem even when pair-wise correlations are small. One
way to detect multicollinearity in such situations is to calculate the variance inflationary factors
(VIF’s). There is a different VIF for each independent variable. Each independent variable’s VIF
measures how much the variance of its coefficient estimate has been inflated by multicollinearity.
The ideal VIF for a variable is 1, but one shouldn't expect to see this value in practice (at least for
work on observational data). A value of 4 or greater generally means that multicollinearity is a
problem, thus interpretations of the regression coefficients—particularly those with high VIF’s—are
suspect. Some authors suggest a threshold value of 5 or 10, but I’ve found 4 is a better cutoff, and
that will be what we use in this class.
To obtain the VIF’s in Excel, you first need to fill in the blank positions in your correlation
matrix (if you have a really big correlation matrix, I can show you a fast way to do this using the
“transpose” function). Once the values for the correlation matrix are filled out, you will invert the
matrix using the “minverse” function. To do this, first highlight a square block of blank cells in
43 | P a g e
your spreadsheet having dimensions identical to that of your correlation matrix. For the IPA
example above, you’d swipe out a 3 by 3 block of unused cells. The upper left-hand cell of this
block will be clear (the other cells will be darkened). Type =minverse(cell range), where cell range
is the 33 block of cells where your correlation matrix is stored (note: your typed entry will appear
in the clear cell in the upper left hand corner). Then hit Ctrl+Shift+Enter simultaneously. The
inverse matrix will appear in the 3 by 3 workspace you have allocated for it. For our IPA example
above, my correlation matrix was in cells M2:O4 (a 3 by 3 matrix); I swiped out the cells M7:O9
(also 3 by 3) as unused workspace; I typed “=minverse(M2:O4)”, which appeared in the (clear) cell
M7; then I hit Ctrl+Shift+Enter to get
Inverse of the Correlation Matrix
2.433034399
-0.95474559
-0.986665814
-0.95474559
3.979992168
-2.677646508
-0.986665814
-2.677646508
4.005462538
The VIF’s appear on the diagonal (shaded) and are not as bad as one might anticipate. For example,
the largest VIF (4.005) occurs for the membership variable. I suggest you always inspect the
correlation matrix and the VIF’s. If multicollinearity is present, at least one of these analyses will
usually raise a red flag. When it does, interpretations of coefficients become problematic.
There are a number of simple ways to manage multicollinearity. One way is to add more
data (if this is feasible). Another way is to try and avoid including x-variables that are highly
correlated in the model. Unfortunately, the latter is not always possible.
Assignment #6
(Please hand in just your answers and work—no Excel output please)
1.
Multiple Linear Regression, #6 (p. 658-659).
2.
Multiple Linear Regression, #16 (p. 664).
3.
Multiple Linear Regression, #24 (p. 672).
4.
Multiple Linear Regression, #25 (p. 672-673). Also, determine if multicollinearity is an
issue in this problem.
5.
Multiple Linear Regression, #31 (p. 675). Skip part (c); Use APPROXIMATE prediction
interval discussed in class for part (d).
44 | P a g e
Download