Lecture 9: Goodness of fit BUEC 333 Professor David Jacks 1 More than once, we have said that the goal of regression analysis is to explain variation in the dependent variable (Yi) on the basis of variation in the independent variables (X1i, X2i,…, Xki). But what exactly does this mean? And how do we know whether we (or our regression) are doing a good job? Today’s topic revolves around the idea of the goodness of fit. Explaining variation in Y 2 When we talk about the variation in Yi to be explained, we are implicitly talking about how Yi varies around its mean. Ultimately, our interest is in deviations of Yi from its population mean μY . Of course, we do not know μY. The total sum of squares 3 However, we always have n (Y Y ) 0 because i i 1 n n n Y Y Y i 1 i i 1 i 1 i So, trying to explain the total of these deviations is pretty useless. Taking a cue from OLS, it might make sense then The total sum of squares 4 We will focus on what is (usually) called the Total Sum of Squares (TSS): TSS Yi Y n 2 i 1 Should look familiar to you: think “variance”. Generally, this is not zero unless there is no variation in Yi at all. When TSS is large, there is lots of variation in Yi around its mean… The total sum of squares 5 We can always write: Yi Y Yi Y Yˆi Yˆi The decomposition of variance 6 Thus, TSS Yi Y Yˆi Y ei n i 1 n TSS i 1 2 Yˆi Y n i 1 n 2 e 2 i 1 2 i TSS ESS RSS This is a very convenient expression as it decomposes TSS into two components: The total sum of squares 7 When we build a regression model, we want to know how well it “fits” the data; that is, does our model do a good job of explaining variation in Yi? This suggests why our previous decomposition is so useful: it gives us two parts, that which is explained and that which is unexplained. Thus, we use it to measure the proportion of the Explained variance 8 The proportion of variation in Yi around its mean that is explained by the regression model is R2: 2 e i i ESS TSS RSS RSS R 1 1 2 TSS TSS TSS Yi Y 2 i Therefore, R2 is a summary statistic of the variation explained which is bound by 0 and 1 and is known as the coefficient of determination. Explained variance 9 If ESS = 0, then R2 = 0 and we have explained none of the variation with our regression model. Using R2 to assess model fit 10 If ESS = TSS, then R2 = 1 and we have explained all of the variation with our regression model. ? Using R2 to assess model fit 11 Typically, we do not encounter either of these extremes in the data Generally, larger values are better in the sense that our model does a better job of predicting Yi. So how big should R2 be to inspire confidence in our model? The answer is context specific… Using R2 to assess model fit 12 There is a somewhat natural temptation to build a model (i.e. choose your independent variables) to maximize R2. Avoid this temptation! If you add another independent variable, R2 never decreases—even if the new variables has no real relationship with the dependent variable. Why? Adding more variables will not change TSS, and it can either leave RSS unchanged or lower it. More about R2 13 Ultimately, we are looking for a set of independent variables that have economic as well as statistical significance. Another reason to avoid maximizing R2: there is a an associated loss of degrees of freedom. The degrees of freedom is defined as the number of observations (n) minus number of parameters Motivating adjusted R2 14 When we add independent variables to the model, we lose degrees of freedom, and our parameter estimates are less precise. So if we add extra variables to the model, we need to trade off a better fit (in terms of R2) against having a concise model. Adjusted R2 takes this trade-off into account by measuring the share of Yi’s variation explained by a model Motivating adjusted R2 15 With that in mind, adjusted R2 is defined as: R 2 e /(n k 1) 1 Y Y /(n 1) 2 i i 2 i i n 1 RSS R 1 * TSS n k 1 n 1 2 2 R 1 (1 R ) * n k 1 2 Adjusted R2 16 An example: our model of SALARY as a function of POINTS from Lecture 8. Naturally, we think performance will be linked to pay, a result borne out in the estimates… Adjusted R2 17 But what happens when we attempt to maximize R2 by adding extraneous independent variables. Naturally, we expect R2 might increase Adjusted R2 18 Thus, adjusted R2 (or R-bar-squared) penalizes for having lots of independent variables (i.e. few degrees of freedom). It can increase, decrease, or stay the same when we add an extra regressor to the model. Adjusted R2 19 Like R2, adjusted R2 is less than one, but it is not necessarily positive (i.e., if R2 is very close to zero to begin with, adjusted R2 can be negative). Conveniently, it can be used to compare fits of regressions with same dependent variable and different numbers of independent variable. But it is not “the final word”: we must assess Adjusted R2 20