Simple Linear Regression Previously we’ve discussed the notion of inference – using sample statistics to find out information about a population. This was done in the context of the mean of a random variable. In addition we talked early on about correlation between two variables. In this chapter we want to take this a step further and formalize the relationship between two variables, and then extend this to multivariate analysis. This is done through the concept of regression analysis. Suppose we are interested in the relationship between two variables: length of stay and hospital costs. We think that LOS causes costs–a longer LOS results in higher costs. Note that with correlation there are no claims made about what causes what – just that they move together. Here we are taking it further by “modeling” the direction of the relationship. How would we go about testing this? If we take a sample of individual stays and measure their LOS and cost, we might get the following: Stay LOS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cost 3 5 2 3 3 5 5 3 1 2 2 5 1 3 4 2 1 1 1 6 2614 4307 2449 2569 1936 7231 5343 4108 1597 4061 1762 4779 2078 4714 3947 2903 1439 820 3309 5476 Just looking at the data there looks to be a positive relationship, individuals with more LOS seem to have a higher cost; but not always. Another way of looking at the data would be with a scatter diagram: 1 Scatter Diagram: LOS vs Cost 8000 7000 $Cost 6000 5000 4000 3000 2000 1000 0 0 1 2 3 4 5 6 7 LOS Ignore the line for now. Cost is on the Y axis, LOS is on the X. It looks like there is a positive relationship between cost and LOS. Simply eyeballing the data, it looks like the dots go up and to the right. The trend line basically connects the dots as best as possible. Note that the slope of this line will tell you the marginal impact of LOS on charges: what happens to costs if LOS increases by 1 unit? This line intersects the Y axis just above zero. This would be the predicted cost if LOS was zero. If the correlation coefficient between LOS and costs was 1 then all the dots would be on this line. Note that for some observations the line is real close to the dot, while for others it is pretty far away. Let the distance between the dot and the line be the error in the trend line. The trend line is drawn so that the error term is minimized. Since errors above the line will be positive, while errors below the line will be negative we have to be careful – positive errors will tend to wash out negative errors. Thus a strategy in estimating this line would be to draw the line such that we minimize the squared error term. This is known as The Least Squares Method. I. The Logic The idea of Least Squares is as follows: In theory our relationship is: Y = o + 2X + 2 Y is the dependent variable – the thing we are trying to explain. X is the independent variable – what is doing the explaining. o and 1 are population parameters that we are interested in knowing In our case Y is the charge, X is LOS. o is the intercept (where the line crosses the Y axis), and 1 is the slope of the line. This coefficient 1 is the marginal impact of X on Y. These are population parameters that we do not know. From our sample we estimate the following: Y = bo + b1X + e Note I’ve switched from Greek to English letters since we are now dealing with the sample. So bo is an estimator for o and b1 is an estimator for 1. e is the error term reflecting the fact that we will not always be exactly on our line. If we wanted to get a predicted value for Y (costs) we would use: ^ Y i bo b1 Xi {the ^ means predicted value) Note the error term is gone and this is just the equation for the trend line. So suppose that bo=3 and b1 = 2, then someone with a LOS of 5 days would be predicted to have 3+2*5=$13 in charges, etc. Least squares finds bo and b1 by minimizing the sum of the squared difference between the actual and predicted value for Y: II. Specifics n Sum of squared difference = (Y Yˆ ) i i 2 i 1 Substituting: n (Y Yˆ ) i i 1 i n 2 [Yi (bo b1 Xi )] 2 i 1 Thus least squares finds bo and b1 to minimize this expression. We are not going to go into the details here of how this is done, but we will focus on the intuition of what is going on. The easiest way to think about it is to go back to the scatter diagram: least squares draws the trend line to connect the dots the best way possible. We choose the parameters to minimize the size of our mistakes or errors. 3 III. How do we do this in Excel? Excel can do both simple (one independent variable) and multiple (more than one) regression. You need the Data Analysis Toolpack to do it. Load the Analysis ToolPak 1. Click the File tab, click Options, and then click the Add-Ins category. 2. In the Manage box, select Excel Add-ins and then click Go. 3. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK. To do a regression: Click on the Data tab Then click on Data Analysis Then click on Regression, and click OK, Then give it the Y input Range, and then the X input Range You get the following output: SUMMARY OUTPUT Regression Statistics Multiple R 0.807337 R Square 0.651794 Adjusted R Square 0.632449 Standard Error 995.4983 Observations 20 ANOVA df Regression Residual Total Intercept LOS 1 18 19 SS MS F Significance F 33390795 33390795 33.69347 1.69E-05 17838305 991016.9 51229100 CoefficientsStandard Error t Stat P-value Lower 95% Upper 95% 997.4659 465.7353 2.141701 0.046147 18.99164 1975.94 818.8394 141.0671 5.804607 1.69E-05 522.4681 1115.211 4 What does all this mean? Skip the first two sections for now and just look at the bottom part. The numbers under Coefficients are our coefficient estimates: Costi = 997.5 + 818.8*LOSi + ei So we would predict that a patient starts with a cost of 997.5 and each day adds 818.8 to the cost. The least squares estimate of the effect of LOS on cost is $818.8 per day. So someone with a LOS of 5 days is predicted to have: 997.5 + 818.8*5 = $5091.5 in costs. The statistics behind all this can get pretty complicated, but the interpretation is easy. And note that we can now add as many variables as we want and the coefficient estimate for each variable is calculated holding constant the other right-hand-side variables. Causation is forced on the model, requires theory to get there. Omitted variable bias IV. Measures of Variation: We now want to talk about how well the model predicted the dependent variable. Is it a good model or not? This will then allow us to make inferences about our model. The Sum of Squares The total sum of squares (SST) is a measure of variation of the Y values around their mean. This is a measure of how much variation there is in our dependent variable. n _ SST (Yi Y ) 2 i 1 [Note that if we divide by n-1 we would get the sample variance for Y] The sum of squares can be divided into two components Explained variation, or the Regression Sum of Squares (SSR) – that which is attributable to the relationship between X and Y, and unexplained variation, or Sum of Square Error (SSE) – that which is attributable to factors other than the relationship between X and Y. 5 ^ Y Yi (Yi Y i ) 2 ei ^ Y i bo b1 Xi _ (Yi Y ) 2 ^ (Y i Y ) 2 _ Y Xi X The dot represents one particular actual observation (Xi,Yi), the horizontal line _ represents the mean of Y ( Y ), the upward sloping line represents the estimated _ regression line. The distance between Yi and Y is the total variation (SST). This is broken into two parts, that explained by X and that not explained. The distance between the predicted value of Y and the mean of Y is that part of the variation that is explained by X. This is SSR. The distance between the predicted value of Y and the actual value of Y is the unexplained portion of the variation. This is SSE. Suppose that X had no effect whatsoever on Y. Then the best regression line would simply be the mean of Y. So the predicted value of Y would always be the Mean of Y no matter what X is. So X is doing nothing in helping us to explain Y. Then all the variation in Y will be unexplained. 6 Suppose, alternatively, that the predicted value was exactly correct – the dot is on the regression line. Then notice that all the variation in Y is being explained by the variation in X. In other words, if you know X you know Y exactly. ^ As shown above, some of the variation in Y is due to variation in X [ (Y i Y ) 2 ] and some ^ of the variation is not explained by variation in X [ (Yi Y i ) 2 ]. n ^ So to get the SSR we calculate SSR (Y i Y ) 2 i 1 n And to get the SSE we calculate: ^ (Y Y i i )2 i 1 Referring back to our first regression output notice the middle table looks as follows: ANOVA df Regression Residual Total SS MS F Significance F 1 33390795 33390795 33.69347 1.69E-05 18 17838305 991016.9 19 51229100 The third column is labeled SS (or sum of squares), then the first row is regression So SSR = 33390795. Residual is another word for error (or leftover) so SSE = 17838305, and the SST = 51229100. Notice that: 51229100=33390795+17838305 How do we use this information? In general, the method of Least Squares chooses the coefficients so to minimize SSE. So we want that to be as small as possible – or equivalently, we want SSR to be as big as possible. Notice that the closer SSR is to SST the better our regression is doing. In a perfect world SSR = SST: or our model explains ALL the variation in Y. So if we look at the ratio of SSR to SST, this will tell us how our model is doing. This is known as the Coefficient of Determination or R2. R2 = SSR/SST for our example: R2= 33390795/51229100= .652 Is this bad? It depends. Thus, 65% of the variation in charges can be explained by variation in LOS. Note that this is pretty low since there are many other things that determine charges. 7 Standard Error of the Estimate Note that for just about any regression, all the data points will not be exactly on the regression line. We want to be able to measure the variability of the actual Y from the predicted Y. This is similar to the standard error of the mean as a measure of variability around a sample mean. This is called The Standard Error of the Estimate n SYX SSE n2 (Y Yˆ ) i i 2 i 1 n2 Notice that this looks very much like the standard deviation for a random variable. But here we’re looking at variation of actual values around a prediction. For our example SYX = 17838305 =995.5 18 Note that the top table in the Excel output has the R-squared and the Standard Error listed, among other things. This is a measure of the variation around the fitted regression line – a loose interpretation would be that on average the data points are about $995 off of the regression line. We will use this in the next section to make inferences about our coefficients. V. Inference We made our estimates above for the regression line based on our sample information. These are estimates of the (unknown) population parameters. In this section we want to make inferences about the population using our sample information. t-test for the slope Again, our estimate of is b. We can show that under certain assumptions (to come in a bit) that b is an unbiased estimator for . But as discussed above there will still be some sampling error associated with this estimate. So we can’t conclude that =b every time, only on average. Thus we need to take this sampling variability into account. Suppose we have the following null and alternative hypothesis: Ho: 1=0 (there is no relationship between X and Y) H1: 1 0 (there is a relationship) This can also be one tailed if you have some prior information to make it so. Our test statistic will be: 8 t = b1-1/Sb1, where Sb1 is the standard error of the coefficient. Sb1 = SYX/SSX Where SSX = (Xi-Xb)2 This follows a t-distribution with n-2 degrees of freedom. [NOTE: in general this test has n-k-1 degrees of freedom, where k is the number of righthand-side variables. In this case k=1 so it is just n-2] So the Standard error of the coefficient is the standard error of the estimate divided by the squared deviation in X. Again note the bottom part of the Excel output: Intercept LOS Coefficient Standard Lower Upper s Error t Stat P-value 95% 95% 997.4659 465.7353 2.141701 0.046147 18.99164 1975.94 818.8394 141.0671 5.804607 1.69E-05 522.4681 1115.211 So our LOS coefficient is 818.8. Is this statistically different from zero? Our test statistic is: t=(818.8-0)/.141.1 = 5.8. We can use the reported p-value: .0000169 to conclude that we would reject the null hypothesis and say that there is evidence that 1 is not zero. That is, LOS has a significant effect on charges. The t-test can be used to test each individual coefficient for significance in a multiple regression framework. The logic is just the same. One could also test other hypothesis: Suppose it used to be the case that each day in the hospital resulted in $1000 charge, is there evidence that it has changed? Ho: 1=1000 Ha: 11000 t=(818.8-1000)/141.06 = -1.28 the pvalue associated with this is .215 – so there is a 21.5% chance we could get a coefficient of 818 or further away from 1000 if the null is true. Thus, we would fail to reject the null and conclude that thee is no evidence that the slope is different from 1000. We could also estimate a confidence interval for the slope: b1 tn-2Sb1 Where tn-2 is the appropriate critical value for t. You can get excel to spit this out for you as well. Just click the confidence interval box and type in the level of confidence and it will include the upper and lower limits in the output. 9 For my example we are 95% confident that the population parameter 1 is between 522 and 1115. Multiple Regression I. Introduction The simple regression can be easily expanded to a multivariate setting. Our model can be written as: Yi = o + 1X1i +2X2i + … + kXki + I So we would have k explanatory variables. The interpretation of the ’s is the same as in the simple regression framework. For example 1 is the marginal influence of X1 on the dependent variable Y, holding all the other explanatory variables constant. This is easy to do in Excel. It is similar to multiple regression except that one needs to have all the X variables side by side in the spreadsheet. Inference about individual coefficients is exactly the same as in simple regression. Suppose we have the following data for 10 hospitals: Y Cost X1 Size 2750 2400 2920 1800 3520 2270 3100 1980 2680 2720 X2 Visibility 225 200 300 350 200 250 175 400 350 275 6 37 14 33 11 21 21 22 20 16 Cost is the cost per case for each hospital, Size is the size of the hospital in the number of beds, and visibility is a scale that measures how much the administrator knows about competitor hospitals. In this case we might expect larger hospitals to have lower costs per case, and when the administrator has more knowledge about his/her competition costs will be lower as well. SUMMARY OUTPUT Regression Statistics Multiple R 0.834034 R Square 0.695612 Adjusted R Square 0.608644 Standard Error 323.9537 10 Observations 10 ANOVA df Regression Residual Total Intercept Size Visibility 2 7 9 SS 1678818 734622 2413440 MS F Significance F 839409 7.998485 0.01556 104946 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 4240.131 435.6084 9.733813 2.56E-05 3210.081 5270.181 -3.76232 1.442784 -2.60768 0.035032 -7.17395 -0.35068 -29.8955 11.66298 -2.56328 0.037372 -57.4741 -2.31699 So in this case we’d say that each bed lowers the per case cost of the hospital by $3.76, and every one unit increase in the visibility scale lowers costs by 29.90. Note that these are not the same results we would get if we did two simple regressions: If we only included Size: SUMMARY OUTPUT Regression Statistics Multiple R 0.640237 R Square 0.409904 Adjusted R Square 0.336142 Standard Error 421.9245 Observations 10 ANOVA df SS Regression 1 Residual Total 8 9 MS F 989277. 5.55710 989277.9 9 8 178020. 1424162 3 2413440 Coefficie Standard nts Error Intercept Size 3804.717 -4.3696 Significanc eF 0.046149 Upper Lower Upper t Stat P-value Lower 95% 95% 95.0% 95.0% 7.28269 8.53E5009.44 522.4326 4 05 2599.984 9 2599.984 5009.449 - 0.04614 1.853606 2.35735 9 -8.64403 -0.09518 -8.64403 -0.09518 While if we only included Visibility: SUMMARY OUTPUT 11 Regression Statistics Multiple R 0.632394 R Square 0.399922 Adjusted R Square 0.324912 Standard Error 425.4781 Observations 10 ANOVA df SS Regression 1 Residual Total 8 9 Coefficie nts Intercept 3315.282 Visibility -34.8896 MS F 965187. 5.33159 965187.2 2 5 181031. 1448253 6 2413440 Significanc eF 0.049765 Standard Error t Stat P-value 9.98031 8.61E332.1822 1 06 - 0.04976 15.11012 2.30902 5 Why the changes in the coefficients? Note the R-squared do not add up. The multiple Regression R2 is .695, and the two simple R2’s are .40 and .41. II. Testing for the Significance of the Multiple Regression Model F-test Another general summary measure for the regression model is the F-test for overall significance. This is testing whether or not any of our explanatory variables are important determinants of the dependent variable. This is a type of ANOVA test. Our null and alternative hypotheses are: Ho: 1=2= … = k =0 (none of the variables are significant) H1: At least one j 0 Here the F statistic is: 12 SSR F= SSE k nk 2 Notice that this statistic is the ratio of the regression sum of squares to the error sum of squares. If our regression is doing a lot towards explaining the variation in Y then SSR will be large relative to SSE and this will be a “big” number. Whereas if the variables are not doing much to explain Y, then SSR will be small relative to SSE and this will be a “small” number. This ratio follows the F distribution with k and n-k-1 degrees of freedom. The middle portion of the Excel output contains this information (this is the model with school and experience, not shoe size): ANOVA df Regression Residual Total SS 2 1678818 7 734622 9 2413440 MS F Significance F 839409 7.998485 0.01556 104946 F = (1678818/2)/(734622/(10-2-1)) = 839409/104946 = 7.998 The “Significance F” is the p-value. So we’d reject Ho and conclude there is evidence that at least one of the explanatory variables is contributing to the model. Note that this is a pretty weak test: it could be only one of the variables or it could be all of them that matter, or something in between. It just tells us that something in our model matters. 13 III. Dummy Variables in Regression Up to this point we’ve assumed that all the explanatory variables are numerical. But suppose we think that, say, costs per case might differ between males and females. How would we incorporate this into our regression? The simplest way to do this is to assume that the only difference between men and women is in the intercept (that is the coefficients on all the other variables are equal for men and women). Costs Men Women male female LOS Assume for now the only other variable that matters is LOS. The idea is that we think men cost more (or less) than women independent of LOS. That is the male intercept (male) is greater than the female intercept (female). We can incorporate this into our regression by creating a dummy variable for gender. Suppose we let the variable Male =1 if the individual is a male, and 0 otherwise. Then our equation becomes: Costi = o + 1LOSi + 2Agei + 3Malei + i So if the individual is male the variable Male is “on” and if she is female Male is “off”. The coefficient 3 indicates how much extra the cost of males is than female (this can be positive or negative in theory). In terms of our graph, female = o, and male = o+3. So the dummy variable indicates how much the intercept shifts up or down for that group. 14 SUMMARY OUTPUT Regression Statistics Multiple R 0.70752 R Square 0.500584 Adjusted R Square 0.499657 Standard Error 6396.593 Observations 1619 ANOVA df Regression Residual Total Intercept LOS Age Male 3 1615 1618 Coefficients 9719.703 4423.314 -19.4485 -76.9771 SS MS F 6.62E+10 2.21E+10 539.5931 6.61E+10 40916408 1.32E+11 Standard Error 683.4021 110.0995 11.12886 324.6287 t Stat P-value 14.22253 2.45E-43 40.17561 2.9E-245 -1.74758 0.080727 -0.23712 0.812591 Significance F 7.3E-243 Lower 95% 8379.255 4207.361 -41.2771 -713.715 Upper Lower Upper 95% 95.0% 95.0% 11060.15 8379.255 11060.15 4639.267 4207.361 4639.267 2.379983 -41.2771 2.379983 559.7607 -713.715 559.7607 This says that costs start at 9719, each extra day in the hospital adds 4423 to costs all things equal. Likewise every year older the patient is lowers costs by 19. Finally the point estimate says that a male with the same LOS and Age as a female will have costs that are about 77 lower. Note, however, this is not a significant effect, so we would conclude that there is no evidence that males are different from females. This can be done for more than two categories. Suppose we think that costs also differ by payor status then we can write: Costi = o + 1LOSi + 2Agei +3Malei + 4Medicarei + 5Medicaidi + i Where Medicare is a dummy variable equal to 1 if the individual is covered by Medicare and Medicaid is =1 if they are covered by Medicaid. Then Private insurance is the omitted group – just like female is not explicitly accounted for. Thus the coefficients 4, and 5 indicate how costs differ for Medicare and Medicaid differ from private insurance: 15 Note that if there are x different categories, we include x-1 dummy variables in our model. The omitted group is always the comparison. SUMMARY OUTPUT Regression Statistics Multiple R 0.708648 R Square 0.502181 Adjusted R Square 0.500638 Standard Error 6390.316 Observations 1619 ANOVA df Regression Residual Total Intercept LOS Age Male Medicare Medicaid 5 1613 1618 Coefficients 10455.71 4405.033 -36.7734 -100.697 1087.631 59.37047 Significance SS MS F F 6.64E+10 1.33E+10 325.4273 3E-241 6.59E+10 40836132 1.32E+11 Standard Error 783.8259 110.2953 13.50197 324.8377 498.349 360.6697 t Stat P-value 13.33933 1.4E-38 39.93854 4.2E-243 -2.72356 0.006528 -0.30999 0.756607 2.182469 0.029219 0.164612 0.86927 Upper Lower Upper Lower 95% 95% 95.0% 95.0% 8918.287 11993.13 8918.287 11993.13 4188.696 4621.37 4188.696 4621.37 -63.2567 -10.2902 -63.2567 -10.2902 -737.845 536.451 -737.845 536.451 110.1516 2065.111 110.1516 2065.111 -648.06 766.8009 -648.06 766.8009 Note the change in the age effect – larger and more significant – once we accounted for Medicaid. Male is still not significant Medicare coefficient of 1087, says all things equal costs are 1087 more for a Medicare patient than for a private patient, Not a significant diff between Medicaid and private. 16 IV. Interaction Effects Suppose we’re interested in explaining total charges and we think LOS and gender are among the explanatory variables. But now what if we think the effect of LOS on charges is different for males than females. How might we deal with this? Note that the idea is that not only is there an intercept difference, but there is a slope difference as well. To get at this we can interact LOS and Female – that is create a new variable that multiplies the two together, then we would get something like the following: SUMMARY OUTPUT Regression Statistics Multiple R 0.710345 R Square 0.504589 Adjusted R Square 0.502745 Standard Error 6376.819 Observations 1619 ANOVA df Regression Residual Total Intercept LOS Age Male Medicare male*LOS Male*Medicare 6 1612 1618 Coefficients 10409.36 4564.085 -36.9463 118.9449 365.9278 -406.643 1814.519 SS MS F 6.68E+10 1.11E+10 273.6445 6.56E+10 40663816 1.32E+11 Standard Error 777.6643 142.0642 13.4717 484.3257 555.5593 224.6731 787.2146 t Stat 13.38541 32.12693 -2.74251 0.245589 0.658666 -1.80993 2.304986 P-value 8.01E-39 1.9E-175 0.006165 0.806032 0.510205 0.070493 0.021294 Significance F 1.2E-241 Lower 95% 8884.017 4285.436 -63.3702 -831.029 -723.767 -847.325 270.4473 Upper 95% 11934.7 4842.735 -10.5224 1068.919 1455.622 34.03915 3358.591 Note the adjusted R2 increases which suggests that adding this new variable is “worth it”. How do we interpret? Males have charges that are $119 higher than females holding constant LOS and Medicare. A one unit increase in LOS increases charges by $4564 for females, while the effect for males is $406 LOWER. That is each day in the hospital increases charges for males by 4564-406 = $4,158. So females start at a lower point, but increase faster with LOS than do males. 17 Lower 95.0% 8884.017 4285.436 -63.3702 -831.029 -723.767 -847.325 270.4473 Likewise, the Medicare effect says that charges for Medicare females are 367 higher than non-Medicare, while for males the effect is 365.9+1814.5=2180.4 18 HCAI 5220 Fall 2012 Ed Schumacher Homework #2 Due (around) Monday September 24th 1. Suppose you are interested in predicting length of stay. A sample of 581 pneumonia patients is taken from a consortium of hospitals. These data are found in the “Consortium” worksheet of the Homework2.xlsx Excel file. Initially, you think that age causes LOS. a. Plot a scatter diagram between age and LOS. Does it look like there is a linear relationship between age and LOS? What other observations do you have about the diagram? b. Use the least-squares method to find the regression coefficients bo and b1. c. Interpret the meaning of your estimates bo and b1. d. What is the predicted LOS for a patient with an age of 62? e. What is the standard error of the estimate? Interpret. f. Determine the coefficient of determination, r2, and interpret its meaning in this problem. g. At the .05 level of significance, is there evidence of a relationship between age and LOS? 2. Now suppose you think there are other determinants of LOS. Namely, you suspect that the gender of the patient, the number of complicating symptoms, and if the patient is covered by Medicaid have an affect along with the patient’s age. a. Estimate this multiple regression model where LOS is the dependent variable, and for independent variables include age, the number of complications, male, and Medicaid. Provide an interpretation of your coefficients. b. How does the coefficient on age change here relative to the regression in question 1? c. Which variables are significant determinants of LOS? d. How does the R-Squared in this model compare to that in question 1? e. Is there evidence that the Medicaid effect is different by gender? Explain. 3. Now you want to use the significant variables found in question 2 to risk adjust for the physicians in your hospital who treat pneumonia patients. The worksheet titled “Our Hospital” displays patient data for the four main doctors who treated pneumonia patients in your hospital this year. a. Calculate the average patient characteristics for each doctor. b. Based on a regression model using the significant variables found in question 2, what is each doctor’s predicted average LOS? How does this compare to their actual LOS? c. Use the “Upper 95%” and “Lower 95%” from the regression output to construct a 95% confidence interval for each doctor’s expected LOS. d. Which doctors have a length of stay that is significantly greater than their expected LOS? 19 20