Oct. 25 Handout

advertisement
Stat 501 Oct. 25
Example
Data on hospital infection risk.
Y = Infection risk in hospital
X= Stay = average length of stay
Region = 1, 2, 3, or 4 indicating which of four regions of U.S. hospital is in.
We can’t use Region as a quantitative (numerical) predictor variable. The numbers
assigned to regions are arbitrary. Region is inherently a qualitative (categorical) variable.
Indicator Variables
We’ll define I1= 1 if region 1, 0 otherwise; I2 = 1 if region 2, 0 otherwise; I3 = 1 if
region 3, 0 otherwise; I4 = 1 if region 4, 0 otherwise.
General Rule: When we have k categories it’s possible to define k indicator variables,
but we only need to use k−1 of them to have a fully specified regression. It doesn’t matter
which k−1 we use.
Model for our Example
E(Y)   0  1Stay   2 I1  3 I2   4 I3
Output:
Regression Analysis: InfctRsk versus Stay, I1, I2, I3
The regression equation is
InfctRsk = - 0.281 + 0.597 Stay - 1.23 I1 - 1.02 I2 - 1.28 I3
Predictor
Constant
Stay
I1
I2
I3
Coef
-0.2813
0.59737
-1.2331
-1.0244
-1.2823
S = 1.01823
SE Coef
0.6707
0.07710
0.3821
0.3462
0.3248
R-Sq = 40.7%
Analysis of Variance
Source
DF
SS
Regression
4
74.141
Residual Error 104 107.827
Total
108 181.968
Source
Stay
I1
I2
I3
DF
1
1
1
1
T
-0.42
7.75
-3.23
-2.96
-3.95
P
0.676
0.000
0.002
0.004
0.000
R-Sq(adj) = 38.5%
MS
18.535
1.037
F
17.88
P
0.000
Seq SS
57.523
0.436
0.027
16.155
Questions: How can we use the output to do an F-test of whether there is or is not a region effect?
In terms of β coefficients, what would be the null hypothesis? Based on the t-tests what do you
think will be the result?
Graph of fitted values from model above.
Notice that three regions (1,2, 3) appear to be similar. The fourth region looks to have a
generally higher infection risk.
Question:
Here’s the sample regression model
The regression equation is
InfctRsk = - 0.281 + 0.597 Stay - 1.23 I1 - 1.02 I2 - 1.28 I3 .
What is the equation for each line in the graph above? There are four different lines so we have
four different equations relating Infection Risk to Stay.
Region = 1, Hint: I1=1, I2=0, I3 = 0
Region = 2
Region = 3
Region = 4 Hint: I1=I2=I3 = 0
The Tricky Part
When we use k−1 indicator variables, the interpretation of regression coefficients multiplying
indicator variables is not what you might think.
The correct interpretation is that a coefficient multiplying an indicator variable gives the
difference when the numeric predictors are held constant, between the mean for the group
indicated by this indicator and the group indicated by the indicator not used in the equation.
Download