The Semi-Log Specification. AKA Log-linear Specification Many times it is convenient, in order to provide a better interpretation of the results, to transform some of the variables we analyze in our econometric models. One common transformation results from assuming that the true relationship between Y and X is Y β βX e 0 e 1 (1) Taking natural logarithms of both sides of the equation we obtain: lnY β0 β1 X (2) which is a linear relation between lnY and X. The usefulness of the semi-log relation arises from the ease of interpreting β1. It is possible to show that δY Y (3) δX that means that the slope coefficient is the ratio of the proportionate change in Y to the absolute change in X. β1 1 When the change in X is not too small, say one unit, then we denote it as ∆X, and from the last equation we can write proportionate change in Y β1∆X (4) If ∆X 1 then β1 is equal to the proportionate change in Y that results from a unit change in X. In our econometric model this means that we can write: lnYi β0 β1Xi ui (5) where the dependent variable is now lnY and the exogenous variable is still only X. This specification has been widely used in human capital theory of earnings determination. The theory states that the logarithm of earnings is linearly related to the level of educational attainment, so: ln EARNSi β0 β1EDi ui (6) the if for example β1 0 1, it means than an additional year of schooling increases earnings by around 10%. 2 Multiple Regression: The Two variable case Most economic relations, and the processes they describe, involve more than one determinants of some particular dependent variable. For example in one of the main examples we have discussed, education is the only variable that affects a person’s earnings. There are obviously many other variables that can affect a person labor earnings: age, experience, gender, marital status, etc. Think first of the case where we have two explanatory variables, X1 and X2. We are going to assume these are the only two variables that affect Y , and that they are determined outside the model. We can then write the linear multiple regression model: Yi β0 β1X1i β2X2i ui (7) notice that here X1i denotes the ith observation of X1 and X2i denotes the ith observation of X2. The interpretation of the disturbance term has not changed. Our econometric task is to estimate β0, β1, and β2. The estimated regression decomposes each Yi between the fitted value: Ŷi β̂0 β̂1X1i β̂2X2i (8) and its residual ei Yi Ŷi Yi β̂0 β̂1X1i 3 β̂2X2i (9) The OLS technique calculates the values of the unknown parameters by minimizing the sum of the square of these residuals. Although these will be basically be done with the help of a computer we can write formulas for this particular, two variable, model. To simplify the notation define the following: y Y Y x1 X1 X 1 and x2 X2 then the estimates are β̂2 β̂1 X2 (10) ∑ x2y ∑ x21 ∑ x1 y ∑ x1 x2 ∑ x21 ∑ x22 ∑ x1 x2 2 (11) ∑ x1y ∑ x22 ∑ x2 y ∑ x1 x2 ∑ x21 ∑ x22 ∑ x1 x2 2 (12) β̂0 Y β̂2X 2 β̂1X 1 (13) Notice that the estimators for all three coefficients depend on all the values for all the variables. For example β̂2 depends not only on Y and X2 but also on X1. This means that β̂2 is different from the slope coefficient of a regression of Y on X2. The multiple regression coefficients cannot be obtained by estimating two simple regressions, one of Y on X1 and another of Y on X2. The only exception to this is when X1 and X2 are uncorrelated, meaning that the correlation is zero, that is equivalent to ∑ x1x2 0, which in the formulas would make the slope coefficients equivalent to those of the simple regression models. 4 The normal situation is one in which the correlation between the explanatory variables is not zero, it might be positive or negative, in both cases this is taken into account when estimating the unknown parameters by OLS. It is important to understand the interpretation of the coefficients in this new model. For example, let’s interpret β̂1. We say that β̂1 is the change in Ŷ that results from a unit change in X1, holding constant the value of X2. This phrasing corresponds quite closely to the concept ceteris paribus, which is commonly used in economics. This does not mean that X2 remains constant when X1 changes, it just means that if we were able to keep X2 constant and X1 were to change by one unit, then Ŷ would change by β̂1. The same interpretation applies to β̂2. 5 Multiple Regression: The General case In this case the values of the dependent variable are determined by several explanatory variables, or regressors. In general we say that Y depends on k explanatory variables: Yi β0 β1X1i β2X2i βk Xki ui (14) A typical variable is denoted by X j and a typical coefficient is β j . This technique allows us to take into account all the relevant variables that help determine the value of the dependent variable. OLS here minimizes the sum of square residuals, where the latter are quite similar to the ones we wrote in the two variable case, but now with k variables. The interpretation of the estimated parameters of this model is not very different from the two variable case. We say that β̂ j is the change in Ŷ that results from a unit change in X j , holding constant the values of all the other variables. Again, the latter closely corresponds to the concept of ceteris paribus. These are the only effects revealed when estimating multiple regression models. Notice that the main measure of goodness of fit that we have used, the R2 is calculated in the same way as in the case of the simple regression 2 R 1 ∑ni 1 e2i n 2 ∑i 1 Yi Y 6 (15) However, the R2 is not very useful for comparing alternative specifications of the regression model when we have different number of regressors. The reason is that it always gives the same answer: the regression with additional variables included fits better. This is because the addition of an explanatory variable to an original regression model cannot raise the sum of squared residuals. Therefore the addition of one variable to a regression model cannot decrease the R2 and for all purposes it increases it. But the gain in the R2 comes at the cost of including another variable. So to compare different specifications with different number of parameters to estimate we need a measure that assesses whether the gain in fit outweighs the cost of estimating one more coefficient for a given number of observations. A statistic that make this comparison is the adjusted R2, denoted by R2, or corrected R2. it is defined as: n 2 e k 1 ∑ i n i 1 2 R 1 (16) ∑ni 1 Yi Y 2 n 1 Notice that when comparing specifications it all boils down to the numerator, since the denominator does not change. When we add one more regressor both the sum of square of the residuals goes down, but also n k 1 , so the change in the R2 depends on whether the sum of square residuals decrease pro portionately more or less than n k 1 . Most software programs report R2, so we can compare the fit of different regressions. 7