Lecture 15: Regression specification, part II BUEC 333 Professor David Jacks 1 Specification error as a violation of Assumption 1 of the CLRM. Mis-specification occurs with the wrong choice of: 1.) independent variables 2.) functional form 3.) error distribution Already considered “right” independent variables Specification error 2 As always, our regression model is: Yi = E[Yi | X1i ,X2i ,...,Xki] + εi Given that X1i ,X2i ,...,Xki will be included in the model, we need to decide on a shape for the regression function, E[Yi | X1i ,X2i ,...,Xki]. Do we think the relationship between Xji and Yi is: 1.) a straight line? 2.) a curve? 3.) non-monotonic? What is functional form and why does it matter? 3 Should the slope be the same for every observation or are there distinct groups that have separate slopes (e.g., men/women, before/after)? Likewise, should the intercept be the same for all observations or are there distinct groups of observations that have a separate intercept? What is functional form and why does it matter? 4 A functional form is a mathematical specification of the regression function E[Yi | X1i ,X2i ,...,Xki] that we choose in response to these questions. The point is that different functional forms may give very different answers about the marginal effects of X on Y…and very different predictions. Thus, correct inference requires us getting the What is functional form and why does it matter? 5 Should a model include an intercept? The short answer: yes, always. The long answer: it is possible that theory tells you that the regression function should pass through the origin; that is, theory tells you that when all the X’s are zero, then Y is zero as well. First things first: the constant in OLS regressions 6 And why is this better? Theory could be wrong. If we include the intercept and the true intercept (of the DGP) turns out to be zero, the (unbiased) OLS estimate of the intercept should tend to zero. But if we leave out the intercept but the true intercept (of the DGP) turns out not to be zero, we can really screw up our estimates of the slopes. First things first: the constant in OLS regressions 7 An example: estimating a cost function Ci = β0 + β1Qi + εi If Q = 0, then we often think C should be zero. Makes sense for variable costs: if GM produces no cars, it needs no workers on the assembly line. First things first: the constant in OLS regressions 8 First things first: the constant in OLS regressions 9 It might also help to think about what an estimate of β0 typically includes the: 1.) true β0 2.) (constant) impact of any specification error 3.) sample mean of the residuals (if not equal to 0) Ideally, we want to purge our results of garbage First things first: the constant in OLS regressions 10 Now, why would anyone every run a regression without a constant? As it turns out, running OLS without a constant serves to artificially inflate both the value of the F-statistic and the R2 of the regression. By including the effect of the intercept in the TSS, a regression without a constant increases TSS, but more so ESS than RSS and as R2=ESS/TSS… First things first: the constant in OLS regressions 11 Having settled the question of whether to include or not include an intercept, our attention becomes primarily concerned with functional form. The most common functional forms we will encounter are the: 1.) linear: Y = β0 + β1X1 + ε 2.) polynomial: Y = β0 + β1X1 + β1X12 + ε 3.) log-log: log(Y) = β0 + β1log(X1) + ε 4.) semi-log: Y = β0 + β1log(X1) + ε -orlog(Y) = β0 + β1X1 + ε Functional form 12 The simplest functional form arises when the independent variables enter linearly: Yi = β0 + β1X1i + εi Remember linearity can refer to two things: linearity in variables and linearity in coefficients. Examples of non-linearity in variables: Y = β0 + β1X1 + β1X12 + ε Y = β0 + β1log(X1) + ε Y = β0 + β1(1/X1) + ε The linear functional form 13 The linear functional form 14 The linear functional form 15 For our purposes, non-linearity in variables is OK. On the other hand, linearity in coefficients is essential and occurs when the beta’s enter in the most straightforward fashion. Thus, the beta’s cannot be: 1.) raised to any power except one 2.) multiplied or divided by other coefficients The linear functional form 16 Why would we choose the linear form? 1.) If theory or intuition tells us that the marginal effect of X on Y is a constant (that is, the same at every level of X): Yi 1 X 1i The linear functional form 17 Why would we choose the linear form? 2.) If theory or intuition tells us that the elasticity of Y with respect to X is not a constant (that is, it is not the same at every level of X): Y , X 1 Yi / Yi Yi X 1i X 1i / X 1i X 1i Yi 3.) If you do not know what else to do The linear functional form 18 A flexible alternative to the linear functional form is a polynomial where one or more independent variables are raised to powers other than one: Yi = β0 + β1X1i + β2(X1i)2 + β3(X1i)3 + β4X2i + εi Why would we choose a polynomial form? If theory or intuition tells us that the marginal effect of X on Y is not a constant: The polynomial functional form 19 This also implies that the elasticity is not constant: Y , X 1 Yi X 1i 2 X 1i 1 2 2 X 1i 33 X 1i X 1i Yi Yi This approach is a useful supplement to the linear form, precisely because of its flexibility. The polynomial functional form 20 The polynomial functional form 21 If y to the “b-th power” produces x, then b is the logarithm of x (with base of y): b = log(x) if yb = x Thus, a logarithm (or a log) is the exponent to which a given base must be taken in order to produce a specific number. While logs come in more than one variety, we will use only natural logs (logs to the base e): A brief explanation of logs 22 e2 = 2.718282 = 7.389 → ln(7.389) = 2 ln(100) = 4.605 ln(1,000) = 6.908 ln(10,000) = 9.210 ln(100,000) = 11.513 ln(1,000,000) = 13.816 Distinct advantages of using logs: 1.) makes it easy to figure out impact in % terms A brief explanation of logs 23 One of the most common specifications: lnYi = β0 + β1lnX1i + β2lnX2i + εi Why would we choose this form? If theory or intuition tells us that the marginal effect of X on Y is not a constant (that is, the regression function is, in fact, a curve): Yi Yi ln Yi ln X 1i ln Yi Yi X 1i ln Yi ln X 1i X 1i ln X 1i The log-log functional form 24 But this implies that the elasticity of Y with respect to X is constant (main reason for using this form): Y , X 1 Yi X 1i Yi X 1i 1 1 X 1i Yi X 1i Yi Here, the beta’s directly measure elasticities: that is, the % change in Y for a 1% change in X (holding all else constant), implying a non-linear but smooth relationship between X and Y. The log-log functional form 25 LHS: RHS: The log-log functional form 26 This one comes in two flavors: (1) ln(Yi) = β0 + β1X1i + β2X2i + εi (2) Yi = β0 + β1ln(X1i) + β2X2i + εi That is, some variables are in natural logarithms. Economists use this kind of functional form often; a common application is when the variable being logged has a very skewed distribution. The semi-log functional form 27 1.0e-06 8.0e-07 6.0e-07 Density 4.0e-07 0 2.0e-07 2000000 4000000 6000000 usd_salary 8000000 1.00e+07 .4 0 .2 Density .6 .8 0 12 13 The semi-log functional form 14 ln_salary 15 16 28 Here, neither the marginal effect nor the elasticity is constant. But the coefficients in model 1 do have a very useful interpretation: β1 measures the percentage change in Yi for a one unit change in Xi. Example: in model 1, if Y is a person’s salary and X1 is years of education, then β1 measures the % The semi-log functional form 29 Another specification issue we are concerned with is whether different groups of observations have different slopes and/or intercepts. We have seen dummy variables before (“salary_nomiss” in the NHL; “I” in gravity). These simply indicate the presence or absence of a characteristic and, thus, take the value of 0 or 1. Example: Di = 1 if person i is a male Dummy variables 30 The most common use of dummy variables is to allow different intercepts for different groups. Example: Wi = β0 + β1Xi + β2Di + εi where Wi = person i’s hourly wage Xi = person i’s education (years) Di = 1 if person i is male Di = 0 if person i is not male For males, Wi = β0 + β1Xi + β2 + εi (intercept is β0 + β2) Intercept dummies 31 Same marginal effect, but for a given level of education, the average wage of males and nonmales is different by β2 dollars. Intercept dummies 32 Notice that in the previous example, we did not include a second dummy variable for being male: Wi = β0 + β1Xi + β2Di + β3Fi + εi where Fi = 1 if person i is female Fi = 0 if person i is not female (is male) Why? This violates Assumption 6 of the CLRM (no perfect collinearity) as Fi = 1 – Di and Fi is an exact linear function of Di. The dummy variable trap 33 Using dummy variables to indicate the absence/ presence of conditions with more than two categories is no problem…create more dummies. Example: dummies for position in hockey; POSITION takes one of four values (L, C, R, D). Know that SALARY and GOALS are positively correlated…a result that holds up in OLS results. What about more than two categories? 34 In this case, we could create a set of position dummies: L = 1 if POSITION = L, and 0 otherwise C = 1 if POSITION = C, and 0 otherwise R = 1 if POSITION = R, and 0 otherwise Our omitted category is POSITION = D. Our regression could then be: SALARYi = β0 + β1Li + β2Ci + β3Ri + … + εi What about more than two categories? 35 We can also use dummy variables to allow the slope of the regression to vary across observations. Example: suppose we think the returns to education (the marginal effect of another year of education on wages) is different for males than for non-males, but the intercepts are the same. We could estimate the following regression model: Wi = β0 + β1Xi + β2XiDi + εi What about more than two categories? 36 For males (Di = 1), the regression model is Wi = β0 + (β1 + β2 )Xi + εi For non-males (Di = 0), the regression model is Wi = β0 + β1Xi + εi Likewise, we can consider regressions where there are both different slopes (i.e., interaction terms are included) as well as different intercepts. What about more than two categories? 37 The average wage of males and non-males is different by β2 dollars when Xi=0 What about more than two categories? 38 As always, you should choose a functional form based on theory and intuition. Consequently, you should avoid choosing functional form based on model fit (R2). For one thing, you cannot compare R2 with different functional forms for the dependent variables; for example, Y versus ln(Y). Choosing the wrong functional form 39 Choosing the wrong functional form 40