Econ107 Applied Econometrics Topic 6: Specification: Choosing a Functional Form (Studenmund, Chapter 7) I. Should a Constant Term be Included? Constant term is there to provide for flexibility of the shape (position) of the regression line. Suppose the correct regression model is: lnW i = 0 + 1 S i + i with 0 0 , but we estimate the following model: lnW i = 1 S i + i 6 0 2 4 lnW 8 10 12 The effect of suppressing 0 can be seen from the graph: 0 5 10 15 20 S Suppressing the constant is that the slope coefficient estimates are biased. Also, Under the 'false' model: * Var ( ˆ1 ) = 2 S i 2 Under the true model: Var ( ˆ1 ) = 2 si 2 n Since n n i 1 i 1 n si (S i S ) 2 S i nS 2 . S i , the t-ratio is inflated. i 1 2 2 2 i 1 Include the constant term if data are not in the neighborhood of the origin. Unless you have strong reason, do not suppress the constant term. Although the constant term is important from the specification view point, it should NOT be relied on for purposes of interpretation and analysis. II. Functional Forms. The Log-Log Regression Model Consider the following 'exponential' regression model: Y i = X i 1 e i which we can express as a linear (in logs) regression model by taking natural logarithms of both sides: lnY i = 0 + 1 lnX i + i where 'ln' denotes the natural log, ‘e’ is the natural number (i.e., e = 2.71828) and 0 = ln The model is linear in the logarithms, even though it was originally nonlinear in terms of both the variables and parameters. Also referred to as a Double-Log or Log-Log model. If the classical assumptions are fulfilled, then we can estimate the parameters using OLS by letting: * * Y i = 0 + 1 X i + i where: * Y i = lnY i * X i = lnX i The estimates are BLUE. This is useful specification for a regression model, because the slope coefficient can be interpreted as an ‘elasticity’. Using calculus: dY / Y dY X % Y = = = 1 dX / X dX Y % X The assumption is that elasticity is constant. A numerical example. Coffee demand function. 2 lˆnY t = - .7774 - .2530 lnX t R = .7448 (.0152) (.0494) where Yt = Coffee consumption in cups per day. Xt = Coffee price per pound. The price elasticity is -0.253, implying that for a 1% increase in the price of coffee, the quantity of coffee demanded (as measured by cups consumed each day) decreases by 0.253%. Should also mention that the coefficients of determination between two regressions with different dependent variables cannot be compared. For example, here the R2 is .7448. Suppose we estimated the regression without the logs (i.e., we regressed cups of coffee against the per pound cost of both coffee and tea). If the R2 for this regression was .6519, we couldn't say that the log-linear regression had a 'better fit'. The Log-Lin Regression Model Take an example from labour economics. The theory of human capital investment says that individuals will invest in education because it raises their productivity, and higher productivity raises their potential wages in the labour market. Wi =Y0 e 1S i ei Taking the logs of both sides. lnW i = 0 + 1 Si + i where 0 = lnY 0 where W is income or earnings, and S is the number of years of schooling (education). Y0 represents earnings in the absence of all education. This is known as a Semilog regression model, because only one variable (in this case the dependent variable) is written as a log. This is also expressed as a Log-Lin model (a Lin-Log model has the independent variable as the only log). In this model, the slope coefficient measures ' ... the constant proportional change in W for a given absolute change in X.' In this case, this is the percentage change in earnings for a one-year increase in educational attainment. Numerical example: 2 lˆnW i = 2.574 + .085 S i R = .215 (.339) (.009) The estimated coefficient on schooling indicates that the ‘incremental impact’ of a year of education is to raise earnings by 8.5%. The Polynomial Form Take another example from labour economics. 2 Earningsi = 0 + 1 Agei 2 Agei + i This model can produce slopes that changes as the independent variable changes. dEarnings i = 1 2 2 Agei dAgei The Inverse Form Take an example from macroeconomics. W t = 0 + 1 /Ut + t This model can produce slopes that changes as the independent variable changes. dWt 2 = 1 /Ut dU t So the slope changes as U changes. As U t is getting larger and larger, Wt is getting closer and closed to the constant 0 . III. Problems with Adopting Wrong Functional Forms. Suppose we estimate: lnW i = 0 + 1 S i + i but the 'true' model is: 2 lnW i = 0 + 1 S i + 2 S i + i * We want an estimate of the 'rate of return' to education. However, we assume in our estimated regression that it is constant for each year of education. The truth may be that it decreases with the level of education (i.e., 1 >0 and 2 <0). The rate of return is just the partial derivative of the regression function: lnW i = 1 2 2 S i Si Thus, we'd get a biased estimate of the overall rate of return to education, if we ignored the fact that it's a linear function of the level of education. The SRF is a biased estimate of the PRF, because the wrong functional form was adopted from the outset. 0 2 4 6 lnW 8 10 12 14 IV. Dummy Independent Variables 0 2 4 6 S 8 10 Dummy variables are 'discrete' and 'qualitative' (e.g., male or female, in the labour force or not, working under a collective or individual employment contract, renting or owning your home). Units of measurement are ‘meaningless’. Normally 1 is assigned to the presence of some characteristic or attribute; 0 for the absence of that characteristic or attribute. EXAMPLE: A regression model of labour market discrimination by gender. Y i = 0 + 1 S i + 2 Gi + i where Yi = annual earnings Si = years of education. Gi = 1 if ith person is a male 0 if ith person is a female. No special estimation issues as long as the regression meets the all the classical assumptions. Only the nature of the independent variables has changed. The expected salary of a female is: E ( Y i | S i , Gi = 0 ) = 0 + 1 S i The expected salary of a male is: E ( Y i | S i , Gi = 1 ) = 0 + 1 S i + 2 = ( 0 + 2 ) + 1 Si Since E( i | Si, Gi)=0. Testing for discrimination (i.e., H0: β2=0) is a test for a difference in the intercept terms. Watch for the Dummy Variable Trap: Suppose we estimate the following: Y i = 0 + 1 Si + 2 F i + 3 M i + i where Fi = 1 0 Mi = 1 0 if ith person is female if ith person is male if ith person is male if ith person is female This is known as the 'Dummy Variable Trap'. We're including redundant information in the regression. Suppose the sample looks like this: Page 7 Constant Fi Mi 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 1 0 1 The problem is that the two dummies are a linear function of the constant (i.e., Fi+Mi = 1). Perfect multicollinearity. Violates Assumption (6). We’ll see in Ch8 that the estimated coefficients and their standard errors can’t be computed. The solution is simple -- drop a dummy variable or the constant term. Rule of Thumb: If you have 'm' categories, then use 'm-1' dummies. Slope dummy variables: We could allow for differences in these returns by adding an 'interacted' variable: Y i = 0 + 1 S i + 2 Gi + 3 Gi S i + i This is a more 'flexible' specification. The expected salary of female is: E ( Y i | S i , Gi = 0 ) = 0 + 1 S i The expected salary of male is: E ( Y i | S i , Gi = 1 ) = ( 0 + 2 ) + ( 1 + 3 ) S i We now have both a 'composite' intercept term and slope coefficient for male. Page - 8 If β2>0, then male regression line has a higher intercept. V. How to Detect the Problem of Adopting a Wrong Functional Form? Plot the residuals and look for 'distinct pattern'. If there is a systematic pattern between ei and Xi, a different function form is called for. If there is a systematic pattern between ei and a dummy variable, a dummy variable is needed. VI. Questions for Discussion: Q7.9 VII. Computing Exercise: Q7.16 (Johnson, Ch 7)