in MS Word

advertisement
Simple Linear Regression
Previously we’ve discussed the notion of inference – using sample statistics to find
out information about a population. This was done in the context of the mean of a
random variable. In addition we talked early on about correlation between two
variables.
In this chapter we want to take this a step further and formalize the relationship
between two variables, and then extend this to multivariate analysis. This is done
through the concept of regression analysis.
Suppose we are interested in the relationship between two variables: length of stay and
hospital costs. We think that LOS causes costs–a longer LOS results in higher costs.
Note that with correlation there are no claims made about what causes what – just that
they move together. Here we are taking it further by “modeling” the direction of the
relationship. How would we go about testing this? If we take a sample of individual
stays and measure their LOS and cost, we might get the following:
Stay
LOS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cost
3
5
2
3
3
5
5
3
1
2
2
5
1
3
4
2
1
1
1
6
2614
4307
2449
2569
1936
7231
5343
4108
1597
4061
1762
4779
2078
4714
3947
2903
1439
820
3309
5476
Just looking at the data there looks to be a positive relationship, individuals with more
LOS seem to have a higher cost; but not always.
Another way of looking at the data would be with a scatter diagram:
1
Scatter Diagram: LOS vs Cost
8000
7000
$Cost
6000
5000
4000
3000
2000
1000
0
0
1
2
3
4
5
6
7
LOS
Ignore the line for now.
Cost is on the Y axis, LOS is on the X. It looks like there is a positive relationship
between cost and LOS. Simply eyeballing the data, it looks like the dots go up and to the
right.
The trend line basically connects the dots as best as possible. Note that the slope of this
line will tell you the marginal impact of LOS on charges: what happens to costs if LOS
increases by 1 unit? This line intersects the Y axis just above zero. This would be the
predicted cost if LOS was zero.
If the correlation coefficient between LOS and costs was 1 then all the dots would be on
this line. Note that for some observations the line is real close to the dot, while for
others it is pretty far away. Let the distance between the dot and the line be the error in
the trend line. The trend line is drawn so that the error term is minimized. Since errors
above the line will be positive, while errors below the line will be negative we have to be
careful – positive errors will tend to wash out negative errors. Thus a strategy in
estimating this line would be to draw the line such that we minimize the squared error
term.
This is known as The Least Squares Method.
I.
The Logic
The idea of Least Squares is as follows:
In theory our relationship is:
Y =  o +  2X + 
2
Y is the dependent variable – the thing we are trying to explain.
X is the independent variable – what is doing the explaining.
o and 1 are population parameters that we are interested in knowing
In our case Y is the charge, X is LOS. o is the intercept (where the line crosses the Y
axis), and 1 is the slope of the line. This coefficient 1 is the marginal impact of X on Y.
These are population parameters that we do not know. From our sample we estimate the
following:
Y = bo + b1X + e
Note I’ve switched from Greek to English letters since we are now dealing with the
sample. So bo is an estimator for o and b1 is an estimator for 1. e is the error term
reflecting the fact that we will not always be exactly on our line. If we wanted to get a
predicted value for Y (costs) we would use:
^
Y i  bo  b1 Xi {the ^ means predicted value)
Note the error term is gone and this is just the equation for the trend line. So suppose that
bo=3 and b1 = 2, then someone with a LOS of 5 days would be predicted to have
3+2*5=$13 in charges, etc.
Least squares finds bo and b1 by minimizing the sum of the squared difference between
the actual and predicted value for Y:
II.
Specifics
n
Sum of squared difference =
 (Y  Yˆ )
i
i
2
i 1
Substituting:
n
 (Y  Yˆ )
i
i 1
i
n
2
  [Yi  (bo  b1 Xi )] 2
i 1
Thus least squares finds bo and b1 to minimize this expression. We are not going to go
into the details here of how this is done, but we will focus on the intuition of what is
going on.
The easiest way to think about it is to go back to the scatter diagram: least squares draws
the trend line to connect the dots the best way possible. We choose the parameters to
minimize the size of our mistakes or errors.
3
III.
How do we do this in Excel?
Excel can do both simple (one independent variable) and multiple (more than one)
regression.
You need the Data Analysis Toolpack to do it.
Load the Analysis ToolPak
1. Click the File tab, click Options, and then click the Add-Ins category.
2. In the Manage box, select Excel Add-ins and then click Go.
3. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK.
To do a regression:
Click on the Data tab
Then click on Data Analysis
Then click on Regression, and click OK,
Then give it the Y input Range, and then the X input Range
You get the following output:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.807337
R Square
0.651794
Adjusted R Square 0.632449
Standard Error
995.4983
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
LOS
1
18
19
SS
MS
F
Significance F
33390795 33390795 33.69347
1.69E-05
17838305 991016.9
51229100
CoefficientsStandard Error t Stat
P-value Lower 95% Upper 95%
997.4659
465.7353 2.141701 0.046147
18.99164 1975.94
818.8394
141.0671 5.804607 1.69E-05
522.4681 1115.211
4
What does all this mean?
Skip the first two sections for now and just look at the bottom part. The numbers under
Coefficients are our coefficient estimates:
Costi = 997.5 + 818.8*LOSi + ei
So we would predict that a patient starts with a cost of 997.5 and each day adds 818.8 to
the cost.
The least squares estimate of the effect of LOS on cost is $818.8 per day. So someone
with a LOS of 5 days is predicted to have: 997.5 + 818.8*5 = $5091.5 in costs.
The statistics behind all this can get pretty complicated, but the interpretation is easy.
And note that we can now add as many variables as we want and the coefficient estimate
for each variable is calculated holding constant the other right-hand-side variables.
Causation is forced on the model, requires theory to get there.
Omitted variable bias
IV.
Measures of Variation:
We now want to talk about how well the model predicted the dependent variable. Is it a
good model or not? This will then allow us to make inferences about our model.
The Sum of Squares
The total sum of squares (SST) is a measure of variation of the Y values around their
mean. This is a measure of how much variation there is in our dependent variable.
n
_
SST   (Yi  Y ) 2
i 1
[Note that if we divide by n-1 we would get the sample variance for Y]
The sum of squares can be divided into two components
Explained variation, or the Regression Sum of Squares (SSR) – that which is attributable
to the relationship between X and Y, and unexplained variation, or Sum of Square Error
(SSE) – that which is attributable to factors other than the relationship between X and Y.
5
^
Y
Yi
(Yi  Y i ) 2  ei
^
Y i  bo  b1 Xi
_
(Yi  Y ) 2
^
(Y i  Y ) 2
_
Y
Xi
X
The dot represents one particular actual observation (Xi,Yi), the horizontal line
_
represents the mean of Y ( Y ), the upward sloping line represents the estimated
_
regression line. The distance between Yi and Y is the total variation (SST). This is
broken into two parts, that explained by X and that not explained. The distance between
the predicted value of Y and the mean of Y is that part of the variation that is explained
by X. This is SSR. The distance between the predicted value of Y and the actual value
of Y is the unexplained portion of the variation. This is SSE.
Suppose that X had no effect whatsoever on Y. Then the best regression line would
simply be the mean of Y. So the predicted value of Y would always be the Mean of Y no
matter what X is. So X is doing nothing in helping us to explain Y. Then all the
variation in Y will be unexplained.
6
Suppose, alternatively, that the predicted value was exactly correct – the dot is on the
regression line. Then notice that all the variation in Y is being explained by the variation
in X. In other words, if you know X you know Y exactly.
^
As shown above, some of the variation in Y is due to variation in X [ (Y i  Y ) 2 ] and some
^
of the variation is not explained by variation in X [ (Yi  Y i ) 2 ].
n
^
So to get the SSR we calculate SSR   (Y i  Y ) 2
i 1
n
And to get the SSE we calculate:
^
 (Y  Y
i
i
)2
i 1
Referring back to our first regression output notice the middle table looks as follows:
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
1 33390795 33390795 33.69347
1.69E-05
18 17838305 991016.9
19 51229100
The third column is labeled SS (or sum of squares), then the first row is regression So
SSR = 33390795. Residual is another word for error (or leftover) so SSE = 17838305,
and the SST = 51229100. Notice that: 51229100=33390795+17838305
How do we use this information? In general, the method of Least Squares chooses the
coefficients so to minimize SSE. So we want that to be as small as possible – or
equivalently, we want SSR to be as big as possible.
Notice that the closer SSR is to SST the better our regression is doing. In a perfect world
SSR = SST: or our model explains ALL the variation in Y. So if we look at the ratio of
SSR to SST, this will tell us how our model is doing. This is known as the Coefficient of
Determination or R2.
R2 = SSR/SST
for our example: R2= 33390795/51229100= .652
Is this bad? It depends.
Thus, 65% of the variation in charges can be explained by variation in LOS. Note that
this is pretty low since there are many other things that determine charges.
7
Standard Error of the Estimate
Note that for just about any regression, all the data points will not be exactly on the
regression line. We want to be able to measure the variability of the actual Y from the
predicted Y. This is similar to the standard error of the mean as a measure of variability
around a sample mean. This is called The Standard Error of the Estimate
n
SYX 
SSE

n2
 (Y  Yˆ )
i
i
2
i 1
n2
Notice that this looks very much like the standard deviation for a random variable. But
here we’re looking at variation of actual values around a prediction.
For our example SYX =
17838305
=995.5
18
Note that the top table in the Excel output has the R-squared and the Standard Error
listed, among other things.
This is a measure of the variation around the fitted regression line – a loose interpretation
would be that on average the data points are about $995 off of the regression line. We
will use this in the next section to make inferences about our coefficients.
V.
Inference
We made our estimates above for the regression line based on our sample information.
These are estimates of the (unknown) population parameters. In this section we want to
make inferences about the population using our sample information.
t-test for the slope
Again, our estimate of  is b. We can show that under certain assumptions (to come in a
bit) that b is an unbiased estimator for . But as discussed above there will still be some
sampling error associated with this estimate. So we can’t conclude that =b every time,
only on average. Thus we need to take this sampling variability into account.
Suppose we have the following null and alternative hypothesis:
Ho: 1=0 (there is no relationship between X and Y)
H1: 1 0 (there is a relationship)
This can also be one tailed if you have some prior information to make it so.
Our test statistic will be:
8
t = b1-1/Sb1, where Sb1 is the standard error of the coefficient.
Sb1 = SYX/SSX
Where SSX = (Xi-Xb)2
This follows a t-distribution with n-2 degrees of freedom.
[NOTE: in general this test has n-k-1 degrees of freedom, where k is the number of righthand-side variables. In this case k=1 so it is just n-2]
So the Standard error of the coefficient is the standard error of the estimate divided by the
squared deviation in X.
Again note the bottom part of the Excel output:
Intercept
LOS
Coefficient Standard
Lower
Upper
s
Error
t Stat
P-value
95%
95%
997.4659 465.7353 2.141701 0.046147 18.99164 1975.94
818.8394 141.0671 5.804607 1.69E-05 522.4681 1115.211
So our LOS coefficient is 818.8. Is this statistically different from zero?
Our test statistic is: t=(818.8-0)/.141.1 = 5.8. We can use the reported p-value: .0000169
to conclude that we would reject the null hypothesis and say that there is evidence that 1
is not zero. That is, LOS has a significant effect on charges.
The t-test can be used to test each individual coefficient for significance in a multiple
regression framework. The logic is just the same.
One could also test other hypothesis: Suppose it used to be the case that each day in the
hospital resulted in $1000 charge, is there evidence that it has changed?
Ho: 1=1000
Ha: 11000
t=(818.8-1000)/141.06 = -1.28 the pvalue associated with this is .215 – so there is a
21.5% chance we could get a coefficient of 818 or further away from 1000 if the null is
true. Thus, we would fail to reject the null and conclude that thee is no evidence that the
slope is different from 1000.
We could also estimate a confidence interval for the slope:
b1  tn-2Sb1
Where tn-2 is the appropriate critical value for t. You can get excel to spit this out for you
as well. Just click the confidence interval box and type in the level of confidence and it
will include the upper and lower limits in the output.
9
For my example we are 95% confident that the population parameter 1 is between 522
and 1115.
Multiple Regression
I.
Introduction
The simple regression can be easily expanded to a multivariate setting. Our model can be
written as:
Yi = o + 1X1i +2X2i + … + kXki + I
So we would have k explanatory variables. The interpretation of the ’s is the same as in
the simple regression framework. For example 1 is the marginal influence of X1 on the
dependent variable Y, holding all the other explanatory variables constant.
This is easy to do in Excel. It is similar to multiple regression except that one needs to
have all the X variables side by side in the spreadsheet.
Inference about individual coefficients is exactly the same as in simple regression.
Suppose we have the following data for 10 hospitals:
Y
Cost
X1
Size
2750
2400
2920
1800
3520
2270
3100
1980
2680
2720
X2
Visibility
225
200
300
350
200
250
175
400
350
275
6
37
14
33
11
21
21
22
20
16
Cost is the cost per case for each hospital, Size is the size of the hospital in the number of
beds, and visibility is a scale that measures how much the administrator knows about
competitor hospitals. In this case we might expect larger hospitals to have lower costs per
case, and when the administrator has more knowledge about his/her competition costs
will be lower as well.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.834034
R Square
0.695612
Adjusted R Square 0.608644
Standard Error
323.9537
10
Observations
10
ANOVA
df
Regression
Residual
Total
Intercept
Size
Visibility
2
7
9
SS
1678818
734622
2413440
MS
F
Significance F
839409 7.998485
0.01556
104946
Coefficients Standard Error t Stat
P-value Lower 95% Upper 95%
4240.131
435.6084 9.733813 2.56E-05
3210.081 5270.181
-3.76232
1.442784 -2.60768 0.035032
-7.17395 -0.35068
-29.8955
11.66298 -2.56328 0.037372
-57.4741 -2.31699
So in this case we’d say that each bed lowers the per case cost of the hospital by $3.76,
and every one unit increase in the visibility scale lowers costs by 29.90. Note that these
are not the same results we would get if we did two simple regressions:
If we only included Size:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.640237
R Square
0.409904
Adjusted R
Square
0.336142
Standard Error 421.9245
Observations
10
ANOVA
df
SS
Regression
1
Residual
Total
8
9
MS
F
989277. 5.55710
989277.9
9
8
178020.
1424162
3
2413440
Coefficie Standard
nts
Error
Intercept
Size
3804.717
-4.3696
Significanc
eF
0.046149
Upper
Lower
Upper
t Stat P-value Lower 95% 95%
95.0%
95.0%
7.28269 8.53E5009.44
522.4326
4
05 2599.984
9 2599.984 5009.449
- 0.04614
1.853606 2.35735
9 -8.64403 -0.09518 -8.64403 -0.09518
While if we only included Visibility:
SUMMARY OUTPUT
11
Regression Statistics
Multiple R
0.632394
R Square
0.399922
Adjusted R
Square
0.324912
Standard Error 425.4781
Observations
10
ANOVA
df
SS
Regression
1
Residual
Total
8
9
Coefficie
nts
Intercept
3315.282
Visibility
-34.8896
MS
F
965187. 5.33159
965187.2
2
5
181031.
1448253
6
2413440
Significanc
eF
0.049765
Standard
Error
t Stat P-value
9.98031 8.61E332.1822
1
06
- 0.04976
15.11012 2.30902
5
Why the changes in the coefficients?
Note the R-squared do not add up. The multiple Regression R2 is .695, and the two
simple R2’s are .40 and .41.
II.
Testing for the Significance of the Multiple Regression Model
F-test
Another general summary measure for the regression model is the F-test for overall
significance. This is testing whether or not any of our explanatory variables are
important determinants of the dependent variable. This is a type of ANOVA test.
Our null and alternative hypotheses are:
Ho: 1=2= … = k =0 (none of the variables are significant)
H1: At least one j  0
Here the F statistic is:
12
SSR
F=
SSE
k
nk 2
Notice that this statistic is the ratio of the regression sum of squares to the error sum of
squares. If our regression is doing a lot towards explaining the variation in Y then SSR
will be large relative to SSE and this will be a “big” number. Whereas if the variables are
not doing much to explain Y, then SSR will be small relative to SSE and this will be a
“small” number.
This ratio follows the F distribution with k and n-k-1 degrees of freedom.
The middle portion of the Excel output contains this information (this is the model with
school and experience, not shoe size):
ANOVA
df
Regression
Residual
Total
SS
2 1678818
7 734622
9 2413440
MS
F
Significance F
839409 7.998485
0.01556
104946
F = (1678818/2)/(734622/(10-2-1)) = 839409/104946 = 7.998
The “Significance F” is the p-value. So we’d reject Ho and conclude there is evidence
that at least one of the explanatory variables is contributing to the model. Note that this is
a pretty weak test: it could be only one of the variables or it could be all of them that
matter, or something in between. It just tells us that something in our model matters.
13
III.
Dummy Variables in Regression
Up to this point we’ve assumed that all the explanatory variables are numerical. But
suppose we think that, say, costs per case might differ between males and females. How
would we incorporate this into our regression?
The simplest way to do this is to assume that the only difference between men and
women is in the intercept (that is the coefficients on all the other variables are equal for
men and women).
Costs
Men
Women
male
female
LOS
Assume for now the only other variable that matters is LOS. The idea is that we think
men cost more (or less) than women independent of LOS. That is the male intercept
(male) is greater than the female intercept (female). We can incorporate this into our
regression by creating a dummy variable for gender. Suppose we let the variable Male
=1 if the individual is a male, and 0 otherwise. Then our equation becomes:
Costi = o + 1LOSi + 2Agei + 3Malei + i
So if the individual is male the variable Male is “on” and if she is female Male is “off”.
The coefficient 3 indicates how much extra the cost of males is than female (this can be
positive or negative in theory). In terms of our graph, female = o, and male = o+3.
So the dummy variable indicates how much the intercept shifts up or down for that group.
14
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.70752
R Square
0.500584
Adjusted R
Square
0.499657
Standard
Error
6396.593
Observations
1619
ANOVA
df
Regression
Residual
Total
Intercept
LOS
Age
Male
3
1615
1618
Coefficients
9719.703
4423.314
-19.4485
-76.9771
SS
MS
F
6.62E+10 2.21E+10 539.5931
6.61E+10 40916408
1.32E+11
Standard
Error
683.4021
110.0995
11.12886
324.6287
t Stat
P-value
14.22253 2.45E-43
40.17561 2.9E-245
-1.74758 0.080727
-0.23712 0.812591
Significance
F
7.3E-243
Lower 95%
8379.255
4207.361
-41.2771
-713.715
Upper
Lower
Upper
95%
95.0%
95.0%
11060.15 8379.255 11060.15
4639.267 4207.361 4639.267
2.379983 -41.2771 2.379983
559.7607 -713.715 559.7607
This says that costs start at 9719, each extra day in the hospital adds 4423 to costs all
things equal. Likewise every year older the patient is lowers costs by 19. Finally the
point estimate says that a male with the same LOS and Age as a female will have costs
that are about 77 lower. Note, however, this is not a significant effect, so we would
conclude that there is no evidence that males are different from females.
This can be done for more than two categories. Suppose we think that costs also differ by
payor status then we can write:
Costi = o + 1LOSi + 2Agei +3Malei + 4Medicarei + 5Medicaidi + i
Where Medicare is a dummy variable equal to 1 if the individual is covered by Medicare
and Medicaid is =1 if they are covered by Medicaid. Then Private insurance is the
omitted group – just like female is not explicitly accounted for. Thus the coefficients 4,
and 5 indicate how costs differ for Medicare and Medicaid differ from private insurance:
15
Note that if there are x different categories, we include x-1 dummy variables in our
model. The omitted group is always the comparison.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.708648
R Square
0.502181
Adjusted R
Square
0.500638
Standard
Error
6390.316
Observations
1619
ANOVA
df
Regression
Residual
Total
Intercept
LOS
Age
Male
Medicare
Medicaid
5
1613
1618
Coefficients
10455.71
4405.033
-36.7734
-100.697
1087.631
59.37047
Significance
SS
MS
F
F
6.64E+10 1.33E+10 325.4273
3E-241
6.59E+10 40836132
1.32E+11
Standard
Error
783.8259
110.2953
13.50197
324.8377
498.349
360.6697
t Stat
P-value
13.33933
1.4E-38
39.93854 4.2E-243
-2.72356 0.006528
-0.30999 0.756607
2.182469 0.029219
0.164612 0.86927
Upper
Lower
Upper
Lower 95%
95%
95.0%
95.0%
8918.287 11993.13 8918.287 11993.13
4188.696 4621.37 4188.696 4621.37
-63.2567 -10.2902 -63.2567 -10.2902
-737.845 536.451 -737.845 536.451
110.1516 2065.111 110.1516 2065.111
-648.06 766.8009
-648.06 766.8009
Note the change in the age effect – larger and more significant – once we accounted for
Medicaid.
Male is still not significant
Medicare coefficient of 1087, says all things equal costs are 1087 more for a Medicare
patient than for a private patient,
Not a significant diff between Medicaid and private.
16
IV.
Interaction Effects
Suppose we’re interested in explaining total charges and we think LOS and gender are
among the explanatory variables. But now what if we think the effect of LOS on charges
is different for males than females. How might we deal with this? Note that the idea is
that not only is there an intercept difference, but there is a slope difference as well. To
get at this we can interact LOS and Female – that is create a new variable that multiplies
the two together, then we would get something like the following:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.710345
R Square
0.504589
Adjusted R
Square
0.502745
Standard Error
6376.819
Observations
1619
ANOVA
df
Regression
Residual
Total
Intercept
LOS
Age
Male
Medicare
male*LOS
Male*Medicare
6
1612
1618
Coefficients
10409.36
4564.085
-36.9463
118.9449
365.9278
-406.643
1814.519
SS
MS
F
6.68E+10 1.11E+10 273.6445
6.56E+10 40663816
1.32E+11
Standard
Error
777.6643
142.0642
13.4717
484.3257
555.5593
224.6731
787.2146
t Stat
13.38541
32.12693
-2.74251
0.245589
0.658666
-1.80993
2.304986
P-value
8.01E-39
1.9E-175
0.006165
0.806032
0.510205
0.070493
0.021294
Significance
F
1.2E-241
Lower 95%
8884.017
4285.436
-63.3702
-831.029
-723.767
-847.325
270.4473
Upper
95%
11934.7
4842.735
-10.5224
1068.919
1455.622
34.03915
3358.591
Note the adjusted R2 increases which suggests that adding this new variable is “worth it”.
How do we interpret?
Males have charges that are $119 higher than females holding constant LOS and
Medicare. A one unit increase in LOS increases charges by $4564 for females, while the
effect for males is $406 LOWER. That is each day in the hospital increases charges for
males by 4564-406 = $4,158. So females start at a lower point, but increase faster with
LOS than do males.
17
Lower
95.0%
8884.017
4285.436
-63.3702
-831.029
-723.767
-847.325
270.4473
Likewise, the Medicare effect says that charges for Medicare females are 367 higher than
non-Medicare, while for males the effect is 365.9+1814.5=2180.4
18
HCAI 5220
Fall 2012
Ed Schumacher
Homework #2
Due (around) Monday September 24th
1. Suppose you are interested in predicting length of stay. A sample of 581
pneumonia patients is taken from a consortium of hospitals. These data are
found in the “Consortium” worksheet of the Homework2.xlsx Excel file.
Initially, you think that age causes LOS.
a. Plot a scatter diagram between age and LOS. Does it look like there is a
linear relationship between age and LOS? What other observations do
you have about the diagram?
b. Use the least-squares method to find the regression coefficients bo and b1.
c. Interpret the meaning of your estimates bo and b1.
d. What is the predicted LOS for a patient with an age of 62?
e. What is the standard error of the estimate? Interpret.
f. Determine the coefficient of determination, r2, and interpret its meaning in
this problem.
g. At the .05 level of significance, is there evidence of a relationship between
age and LOS?
2. Now suppose you think there are other determinants of LOS. Namely, you
suspect that the gender of the patient, the number of complicating symptoms, and
if the patient is covered by Medicaid have an affect along with the patient’s age.
a. Estimate this multiple regression model where LOS is the dependent
variable, and for independent variables include age, the number of
complications, male, and Medicaid. Provide an interpretation of your
coefficients.
b. How does the coefficient on age change here relative to the regression in
question 1?
c. Which variables are significant determinants of LOS?
d. How does the R-Squared in this model compare to that in question 1?
e. Is there evidence that the Medicaid effect is different by gender? Explain.
3. Now you want to use the significant variables found in question 2 to risk adjust
for the physicians in your hospital who treat pneumonia patients. The worksheet
titled “Our Hospital” displays patient data for the four main doctors who treated
pneumonia patients in your hospital this year.
a. Calculate the average patient characteristics for each doctor.
b. Based on a regression model using the significant variables found in
question 2, what is each doctor’s predicted average LOS? How does this
compare to their actual LOS?
c. Use the “Upper 95%” and “Lower 95%” from the regression output to
construct a 95% confidence interval for each doctor’s expected LOS.
d. Which doctors have a length of stay that is significantly greater than their
expected LOS?
19
20
Download