Stat 501 Oct. 25 Example Data on hospital infection risk. Y = Infection risk in hospital X= Stay = average length of stay Region = 1, 2, 3, or 4 indicating which of four regions of U.S. hospital is in. We can’t use Region as a quantitative (numerical) predictor variable. The numbers assigned to regions are arbitrary. Region is inherently a qualitative (categorical) variable. Indicator Variables We’ll define I1= 1 if region 1, 0 otherwise; I2 = 1 if region 2, 0 otherwise; I3 = 1 if region 3, 0 otherwise; I4 = 1 if region 4, 0 otherwise. General Rule: When we have k categories it’s possible to define k indicator variables, but we only need to use k−1 of them to have a fully specified regression. It doesn’t matter which k−1 we use. Model for our Example E(Y) 0 1Stay 2 I1 3 I2 4 I3 Output: Regression Analysis: InfctRsk versus Stay, I1, I2, I3 The regression equation is InfctRsk = - 0.281 + 0.597 Stay - 1.23 I1 - 1.02 I2 - 1.28 I3 Predictor Constant Stay I1 I2 I3 Coef -0.2813 0.59737 -1.2331 -1.0244 -1.2823 S = 1.01823 SE Coef 0.6707 0.07710 0.3821 0.3462 0.3248 R-Sq = 40.7% Analysis of Variance Source DF SS Regression 4 74.141 Residual Error 104 107.827 Total 108 181.968 Source Stay I1 I2 I3 DF 1 1 1 1 T -0.42 7.75 -3.23 -2.96 -3.95 P 0.676 0.000 0.002 0.004 0.000 R-Sq(adj) = 38.5% MS 18.535 1.037 F 17.88 P 0.000 Seq SS 57.523 0.436 0.027 16.155 Questions: How can we use the output to do an F-test of whether there is or is not a region effect? In terms of β coefficients, what would be the null hypothesis? Based on the t-tests what do you think will be the result? Graph of fitted values from model above. Notice that three regions (1,2, 3) appear to be similar. The fourth region looks to have a generally higher infection risk. Question: Here’s the sample regression model The regression equation is InfctRsk = - 0.281 + 0.597 Stay - 1.23 I1 - 1.02 I2 - 1.28 I3 . What is the equation for each line in the graph above? There are four different lines so we have four different equations relating Infection Risk to Stay. Region = 1, Hint: I1=1, I2=0, I3 = 0 Region = 2 Region = 3 Region = 4 Hint: I1=I2=I3 = 0 The Tricky Part When we use k−1 indicator variables, the interpretation of regression coefficients multiplying indicator variables is not what you might think. The correct interpretation is that a coefficient multiplying an indicator variable gives the difference when the numeric predictors are held constant, between the mean for the group indicated by this indicator and the group indicated by the indicator not used in the equation.