Simple Linear Regression Previously we’ve discussed the notion of inference – using sample statistics to find out information about a population. This was done in the context of the mean of a random variable. In addition we talked early on about correlation between two variables. With the correlation coefficient and Chi-Square tests we were able to see if there were relationships between random variables. Here we want to take this a step further and formalize the relationship between two variables, and then extend this to multivariate analysis. This is done through the concept of regression analysis. Suppose we are interested in the relationship between two variables: length of stay and hospital costs. We think that LOS causes costs: a longer LOS results in higher costs. Note that with correlation there are no claims made about what causes what – just that they move together. Here we are taking it further by “modeling” the direction of the relationship. How would we go about testing this? If we take a sample of individual stays and measure their LOS and cost, we might get the following: Stay LOS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cost 3 5 2 3 3 5 5 3 1 2 2 5 1 3 4 2 1 1 1 6 2614 4307 2449 2569 1936 7231 5343 4108 1597 4061 1762 4779 2078 4714 3947 2903 1439 820 3309 5476 1 Just looking at the data there looks to be a positive relationship, individuals with more LOS seem to have a higher cost; but not always. Another way of looking at the data would be with a scatter diagram: Scatter Diagram: LOS vs Cost 8000 7000 $Cost 6000 5000 4000 3000 2000 1000 0 0 1 2 3 4 5 6 7 LOS Ignore the trend line for now. Cost is on the Y axis, LOS is on the X. It looks like there is a positive relationship between cost and LOS. Simply eyeballing the data, it looks like the dots go up and to the right. The trendline basically connects the dots as best as possible. Note that the slope of this line will tell you the marginal impact of LOS on charges: what happens to costs if LOS increases by 1 unit? This line intersects the Y axis just above zero. This would be the predicted cost if LOS was zero. If the correlation coefficient between LOS and costs was 1 then all the dots would be on this line. Note that for some observations the line is real close to the dot, while for others it is pretty far away. Let the distance between any given dot and the line be the error in the trend line. The trend line is drawn so that these errors are minimized. Since errors above the line will be positive, while errors below the line will be negative we have to be careful – positive errors will tend to wash out negative errors. Thus a strategy in estimating this line would be to draw the line such that we minimize the sum of the squared error term. This is known as The Least Squares Method. 2 I. The Logic The idea of Least Squares is as follows: In theory our relationship is: Y = o + 1X + Y is the dependent variable – the thing we are trying to explain. X is the independent variable – what is doing the explaining. o and 1 are population parameters that we are interested in knowing In our case Y is the charge, X is LOS. o is the intercept (where the line crosses the Y axis), and 1 is the slope of the line. This coefficient 1 is the marginal impact of X on Y. These are population parameters that we do not know. From our sample we estimate the following: Y = bo + b1X + e Note I’ve switched to non-Greek letters since we are now dealing with the sample. So bo is an estimator for o and b1 is an estimator for 1. e is the error term reflecting the fact that we will not always be exactly on our line. If we wanted to get a predicted value for Y (costs) we would use: ^ Y i bo b1 Xi {the ^ means predicted value) Note the error term is gone. So this is the equation for the estimated line. So suppose that bo=3 and b1 = 2, then someone with a LOS of 5 days would be predicted to have 3+2*5=$13 in charges, etc. Least squares finds bo and b1 by minimizing the sum of the squared difference between the actual and predicted value for Y: II. Specifics n Sum of squared difference = (Y Yˆ ) i i 2 i 1 Substituting: n (Y Yˆ ) i i 1 i n 2 [Yi (bo b1 Xi )] 2 i 1 3 Thus least squares finds bo and b1 to minimize this expression. We are not going to go into the details here of how this is done, but we will focus on the intuition of what is going on. The easiest way to think about it is to go back to the scatter diagram: least squares draws the trend line to connect the dots the best way possible. We choose the parameters to minimize the size of our mistakes or errors. III. How do we do this in Excel? Excel can do both simple (one independent variable) and multiple (more than one) regression. You need the Data Analysis Toolpack to do it. Load the Analysis ToolPak 1. Click the File tab, click Options, and then click the Add-Ins category. 2. In the Manage box, select Excel Add-ins and then click Go. 3. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK. For the Mac you have to use a different add-in called Statplus: http://www.analystsoft.com/en/products/statplusmacle/ 4 SUMMARY OUTPUT Regression Statistics Multiple R 0.807337 R Square 0.651794 Adjusted R Square 0.632449 Standard Error 995.4983 Observations 20 ANOVA df Regression Residual Total Intercept LOS 1 18 19 SS MS F Significance F 33390795 33390795 33.69347 1.69E-05 17838305 991016.9 51229100 CoefficientsStandard Error t Stat P-value Lower 95% Upper 95% 997.4659 465.7353 2.141701 0.046147 18.99164 1975.94 818.8394 141.0671 5.804607 1.69E-05 522.4681 1115.211 What does all this mess mean? Skip the first two sections for now and just look at the bottom part. The numbers under Coefficients are our coefficient estimates: Costi = 997.5 + 818.8*LOSi + ei So we would predict that a patient starts with a cost of 997.5 and each day adds 818.8 to the cost. The least squares estimate of the effect of LOS on cost is $818.8 per day. So someone with a LOS of 5 days is predicted to have: 997.5 + 818.8*5 = $5091.5 in costs. The statistics behind all this get pretty complicated, but the interpretation is easy. And note that we can now add as many variables as we want and the coefficient estimate for each variable is calculated holding constant the other right-hand-side variables. Science vs. Art in Regression Causation Omitted variable bias IV. Measures of Variation: We now want to talk about how well the model predicted the dependent variable. Is it a good model or not? This will then allow us to make inferences about our model. 5 The Sum of Squares The total sum of squares (SST) is a measure of variation of the Y values around their mean. This is a measure of how much variation there is in our dependent variable. n _ SST (Yi Y ) 2 i 1 [Note that if we divide by n-1 we would get the sample variance for Y] The sum of squares can be divided into two components Explained variation, or the Regression Sum of Squares (SSR) – that which is attributable to the relationship between X and Y, and unexplained variation, or Error Sum of Squares (SSE) – that which is attributable to factors other than the relationship between X and Y. ^ Y Yi (Yi Y i ) 2 ei ^ Y i bo b1 Xi _ (Yi Y ) 2 ^ (Y i Y ) 2 _ Y Xi X The dot, represents one particular actual observation (Xi,Yi), the horizontal line _ represents the mean of Y ( Y ), the upward sloping line represents the estimated _ regression line. The distance between Yi and Y is the total variation (SST). This is broken into two parts, that explained by X and that not explained. The distance between 6 the predicted value of Y and the mean of Y is that part of the variation that is explained by X. This is SSR. The distance between the predicted value of Y and the actual value of Y is the unexplained portion of the variation. This is SSE. Suppose that X had no effect whatsoever on Y. Then the best regression line would simply be the mean of Y. So the predicted value of Y would always be the Mean of Y no matter what X is. So X is doing nothing in helping us to explain Y. Then all the variation in Y will be unexplained. Suppose, alternatively, that the predicted value was exactly correct – the dot is on the regression line. Then notice that all the variation in Y is being explained by the variation in X. In other words, if you know X you know Y exactly. ^ As shown above, some of the variation in Y is due to variation in X ( (Y i Y ) 2 ) and some ^ of the variation is not explained by variation in X ( (Yi Y i ) 2 ). n ^ So to get the SSR we calculate SSR (Y i Y ) 2 i 1 n And to get the SSE we calculate: ^ (Y Y i i )2 i 1 Referring back to our first regression output notice the middle table looks as follows: ANOVA df Regression Residual Total SS MS F Significance F 1 33390795 33390795 33.69347 1.69E-05 18 17838305 991016.9 19 51229100 The third column is labeled SS (or sum of squares), then the first row is regression So SSR = 33390795. Residual is another word for error (or leftover) so SSE = 17838305, and the SST = 51229100. Notice that: 51229100=33390795+17838305 How do we use this information? In general, the method of Least Squares chooses the coefficients so to minimize SSE. So we want that to be as small as possible – or equivalently, we want SSR to be as big as possible. Notice that the closer SSR is to SST the better our regression is doing. In a perfect world SSR = SST: or our model explains ALL the variation in Y. So if we look at the ratio of SSR to SST, this will tell us how our model is doing. This is known as the Coefficient of Determination or R2. 7 R2 = SSR/SST for our example: R2= 33390795/51229100= .652 Is this good? It depends. Thus, 65% of the variation in charges can be explained by variation in LOS. Note that this is pretty low since there are many other things that determine charges. Standard Error of the Estimate Note that for just about any regression all the data points will not be exactly on the regression line. We want to be able to measure the variability of the actual Y from the predicted Y. This is similar to the standard deviation as a measure of variability around a mean. This is called The Standard Error of the Estimate n SYX SSE n2 (Y Yˆ ) i i 2 i 1 n2 Notice that this looks very much like the standard deviation for a random variable. But here we’re looking at variation of actual values around a prediction. For our example SYX = 17838305 =995.5 18 Note that the top table in the Excel output has the R-squared and the Standard Error listed, among other things. This is a measure of the variation around the fitted regression line – a loose interpretation would be that on average the data points are about $995 off of the regression line. We will use this in the next section to make inferences about our coefficients. V. Inference We made our estimates above for the regression line based on our sample information. These are estimates of the (unknown) population parameters. In this section we want to make inferences about the population using our sample information. t-test for the slope Again, our estimate of is b. We can show that under certain assumptions (to come in a bit) that b is an unbiased estimator for . But as discussed above there will still be some sampling error associated with this estimate. So we can’t conclude that =b every time, only on average. Thus we need to take this sampling variability into account. Suppose we have the following null and alternative hypothesis: 8 Ho: 1=0 (there is no relationship between X and Y) H1: 1 0 (there is a relationship) This can also be one tailed if you have some prior information to make it so. Our test statistic will be: t = b1-1/Sb1, where Sb1 is the standard error of the coefficient. Sb1 = SYX/SSX Where SSX = (Xi-Xb)2 This follows a t-distribution with n-2 degrees of freedom. [NOTE: in general this test has n-k-1 degrees of freedom, where k is the number of righthand-side variables. In this case k=1 so it is just n-2] The Standard error of the coefficient is the standard error of the estimate divided by the squared deviation in X. Again note the bottom part of the Excel output: Intercept LOS Coefficient Standard Lower Upper s Error t Stat P-value 95% 95% 997.4659 465.7353 2.141701 0.046147 18.99164 1975.94 818.8394 141.0671 5.804607 1.69E-05 522.4681 1115.211 So our LOS coefficient is 818.8. Is this statistically different from zero? Our test statistic is: t=(818.8-0)/.141.1 = 5.8. We can use the reported p-value: .0000169 to conclude that we would reject the null hypothesis and say that there is evidence that 1 is not zero. That is, LOS has a significant effect on charges. The t-test can be used to test each individual coefficient for significance in a multiple regression framework. The logic is just the same. One could also test other hypothesis: Suppose it used to be the case that each day in the hospital resulted in $1000 charge, is there evidence that it has changed? Ho: 1=1000 Ha: 11000 t=(818.8-1000)/141.06 = -1.28 the pvalue associated with this is .215 – so there is a 21.5% chance we could get a coefficient of 818 or further away from 1000 if the null is true. Thus, we would fail to reject the null and conclude that there is no evidence that the slope is different from 1000. 9 We could also estimate a confidence interval for the slope: b1 tn-2Sb1 Where tn-2 is the appropriate critical value for t. You can get excel to spit this out for you as well. Just click the confidence interval box and type in the level of confidence and it will include the upper and lower limits in the output. For my example we are 95% confident that the population parameter 1 is between 522 and 1115. You can also predict confidence interval. Suppose a patient was going to stay 3 days in the hospital, what do you predict costs to be? Point estimate: cost = 997+3*818=$3,451 Or a 95% confidence interval of the expected cost would be: 18.99 + 3*(522)=1586 1975 +3*(1115)=5321 Or we are 95% confident that the cost would be between $1,586 and $5,321 10 Multiple Regression VI. Introduction In the last section we looked at the simple linear regression model where we have only one explanatory (or independent) variable. This can be easily expanded to a multivariate setting. Our model can be written as: Yi = o + 1X1i +2X2i + … + kXki + I So we would have k explanatory variables. The interpretation of the ’s is the same as in the simple regression framework. For example 1 is the marginal influence of X1 on the dependent variable Y, holding all the other explanatory variables constant. This is easy to do in Excel. It is similar to simple regression except that one needs to have all the X variables side by side in the spreadsheet. Inference about individual coefficients is exactly the same as in simple regression. Suppose we have the following data for 10 hospitals: Y Cost X1 Size 2750 2400 2920 1800 3520 2270 3100 1980 2680 2720 X2 CEO IQ 225 200 300 350 200 250 175 400 350 275 6 37 14 33 11 21 21 22 20 16 Cost is the cost per case for each hospital, Size is the size of the hospital in the number of beds, and CEO IQ is a scale that measures how much the administrator knows about competitor hospitals. In this case we might expect larger hospitals to have lower costs per case, and when the administrator has more knowledge about his competition costs will be lower as well. SUMMARY OUTPUT Regression Statistics Multiple R 0.834034 R Square 0.695612 Adjusted R Square 0.608644 Standard Error 323.9537 Observations 10 ANOVA 11 df Regression Residual Total Intercept Size CEO IQ 2 7 9 SS 1678818 734622 2413440 MS F Significance F 839409 7.998485 0.01556 104946 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 4240.131 435.6084 9.733813 2.56E-05 3210.081 5270.181 -3.76232 1.442784 -2.60768 0.035032 -7.17395 -0.35068 -29.8955 11.66298 -2.56328 0.037372 -57.4741 -2.31699 So in this case we’d say that each bed lowers the per case cost of the hospital by $3.76, and every one unit increase in the CEO IQ scale lowers costs by 29.90. Note that these are not the same results we would get if we did two simple regressions: If we only included Size: SUMMARY OUTPUT Regression Statistics Multiple R 0.640237 R Square 0.409904 Adjusted R Square 0.336142 Standard Error 421.9245 Observations 10 ANOVA df Regression Residual Total Intercept Size 1 8 9 SS MS F Significance F 989277.9 989277.9 5.557108 0.046149 1424162 178020.3 2413440 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% 3804.717 522.4326 7.282694 8.53E-05 2599.984 5009.449 2599.984 5009.44 -4.3696 1.853606 -2.35735 0.046149 -8.64403 -0.09518 -8.64403 -0.0951 While if we only included CEO IQ: SUMMARY OUTPUT Regression Statistics Multiple R 0.632394 R Square 0.399922 Adjusted R Square 0.324912 Standard Error 425.4781 Observations 10 12 ANOVA df SS 965187.2 1448253 2413440 MS 965187.2 181031.6 Coefficients Standard Error 3315.282 332.1822 -34.8896 15.11012 t Stat 9.980311 -2.30902 Regression Residual Total 1 8 9 Intercept CEO IQ F Significance F 5.331595 0.049765 P-value 8.61E-06 0.049765 Why the changes in the coefficients? Note the R-squared do not add up. The multiple Regression R2 is .695, and the two simple R2’s are .40 and .41. VII. Testing for the Significance of the Multiple Regression Model F-test Another general summary measure for the regression model is the F-test for overall significance. This is testing whether or not any of our explanatory variables are important determinants of the dependent variable. This is a type of ANOVA test. Our null and alternative hypotheses are: Ho: 1=2= … = k =0 (none of the variables are significant) H1: At least one j 0 Here the F statistic is: SSR F= SSE k n k 1 Notice that this statistic is the ratio of the regression sum of squares to the error sum of squares. If our regression is doing a lot towards explaining the variation in Y then SSR will be large relative to SSE and this will be a “big” number. Whereas if the variables are not doing much to explain Y, then SSR will be small relative to SSE and this will be a “small” number. This ratio follows the F distribution with k and n-k-1 degrees of freedom. The middle portion of the Excel output contains this information (this is the model with school and experience, not shoe size): 13 ANOVA df Regression Residual Total SS 2 1678818 7 734622 9 2413440 MS F Significance F 839409 7.998485 0.01556 104946 F = (1678818/2)/(734622/(10-2-1)) = 839409/104946 = 7.998 The “Significance F” is the p-value. So we’d reject Ho and conclude there is evidence that at least one of the explanatory variables is contributing to the model. Note that this is a pretty weak test: it could be only one of the variables or it could be all of them that matter, or something in between. It just tells us that something in our model matters. VIII. Dummy Variables in Regression Up to this point we’ve assumed that all the explanatory variables are numerical. But suppose we think that, say, earnings might differ between males and females. How would we incorporate this into our regression? The simplest way to do this is to assume that the only difference between men and women is in the intercept (that is the coefficients on all the other variables are equal for men and women). Wage Men Women male female Education 14 Assume for now the only other variable that matters is education. The idea is that we think men make more than women independent of education. That is the male intercept (male) is greater than the female intercept (female). We can incorporate this into our regression by creating a dummy variable for gender. Suppose we let the variable Male =1 if the individual is a male, and 0 otherwise. Then our equation becomes: Wagei = o + 1Educationi + 2Malei + i So if the individual is male the variable Male is “on” and if she is female Male is “off”. The coefficient 3 indicates how much extra the wage of males is than female (this can be positive or negative in theory). In terms of our graph, female = o, and male = o+3. So the dummy variable indicates how much the intercept shifts up or down for that group. This can be done for more than two categories. Suppose we think that earnings also differ by race then we can write: Wagei = o + 1Educationi + 2Malei +3Blacki + 4Asiani + 5Otheri + i Where Black is a dummy variable equal to 1 if the individual is black, Asian =1 if the individual is Asian, other =1 for other nonwhite race. Note that white is omitted from this group. Just like female is omitted. Thus the coefficients 3, 4, and 5 indicate how wages differ for blacks, Asian, and other, relative to whites: Note that if there are x different categories, we include x-1 dummy variables in our model. The omitted group is always the comparison. Suppose we estimate this and get: .^ Wage = -20 + 3.12*School + 4*Male –5*black –3*Asian –3*other. So males make $4/hour more than females Blacks make $5 less than whites Asians make $3 less than whites Other races make $3 less than whites. Note that this is controlling for differences in other characteristics. That is, a male with the same level of schooling, experience and race will earn $4 more than the identical female. Similarly for race. IX. Interaction Effects Suppose we’re interested in explaining total costs and we think LOS and gender are the explanatory variables. We could estimate our model and get something like: 15 SUMMARY OUTPUT Regression Statistics Multiple R 0.961293 R Square 0.924085 Adjusted R Square 0.918011 Standard Error 619.0914 Observations 28 ANOVA df Regression Residual Total Intercept LOS Female 2 25 27 Coefficients 638.4373 1113.62 -244.657 Significance SS MS F F 1.17E+08 58317790 152.1569 1.01E-14 9581854 383274.2 1.26E+08 Standard Error 257.5843 65.47048 246.0593 t Stat P-value 2.478557 0.020295 17.0095 2.97E-15 -0.9943 0.329603 Interpret this. Now what if we think the effect of LOS on cost is different for males than females. How might we deal with this? Note that the idea is that not only is there an intercept difference, but there is a slope difference as well. To get at this we can interact LOS and Female – that is create a new variable that multiplies the two together, then we would get something like the following: SUMMARY OUTPUT Regression Statistics Multiple R 0.998708 R Square 0.997417 Adjusted R Square 0.997095 Standard Error 116.5416 Observations 28 16 ANOVA df Regression Residual Total Intercept LOS Female LOS*Female 3 24 27 Coefficients 2075.093 606.5649 -2472.39 711.2028 SS MS F 1.26E+08 41963822 3089.677 325966.7 13581.95 1.26E+08 Standard Error 73.34754 23.00362 97.09693 27.24366 t Stat 28.29125 26.36823 -25.4631 26.10526 Significance F 3.54E-31 P-value 6.06E-20 3.12E-19 7.01E-19 3.93E-19 Note the adjusted R2 increases which suggests that adding this new variable is “worth it”. How do we interpret? Females have costs that are $2,472 lower than males holding constant LOS. A one unit increase in LOS increases costs for males by $606 for males, while the effect for females is $711 LARGER. That is each day in the hospital increases costs for females by 606.5+711.2 = $1,317.7. So females start at a lower point, but increase faster with LOS than do males. 17