Page-1 Econ107 Applied Econometrics Topic 10: Dummy Dependent Variable (Studenmund, Chapter 13) I. The Linear Probability Model Suppose we have a cross section of 18-24 year-olds. We specify a simple 2-variable regression model. The probability of enrolling in tertiary study can be written: Y i = 0 + 1 X i + i where Yi = 1 if enrolled in university; 0 otherwise. Xi = performance in secondary school, family background, income or wealth of parents, gender, ethnicity, etc. We write this as a 2-variable regression for simplicity. Could be a multiple regression, where some or all of the independent variables are included. Assume Probi P(Yi 1 | X i ) . This is known as a Linear Probability Model (LPM), because the conditional expectation of Yi given Xi is the conditional probability of this event occurring: E ( Y i | X i )= 0 + 1 X i Probi = 0 + 1 X i where E( i |Xi)=0. Although we can treat this model like any other regression and use OLS to estimate the parameters, one restriction is that: 0 Probi 1 Only probabilities within the 0,1 interval make sense. Two undesirable characteristics of LPM (i.e., the use of OLS where the dependent variable is discrete): 1. Nonnormal/heteroskedastic disturbances In general, i = Y i - 0 - 1 X i = Y i - Probi Page-2 For the purposes of statistical inference, we assume that these disturbances are normally distributed with a constant variance. These are both violated under the LPM. We know that Yi can take on only one of two values: 0 and 1. Therefore: If Y i = 1 , then, i = 1 - Probi If Y i = 0 , then, i = - Probi Since this estimated probability will always be positive, the error terms will fluctuate between positive and negative values. It can be shown that the resulting variance of the disturbance terms follows a ‘binomial distribution’: Var ( i ) = i2 = Probi ( 1 - Probi ) The result is that the disturbances are not normally distributed, and are heteroskedastic. If they were homoskedastic, the Var( i ) should be a constant. Under the LPM, this variance is a function of Xi (i.e., it depends on Probi). With heteroskedasticity coefficient estimates are still unbiased, but inefficient (i.e., no longer minimum variance or BLUE.) This can be overcome by running Weighted Least Squares (WLS). Transform the data and use OLS. 2-Step Procedure: 1. Run OLS. Retain the fitted values and compute the following 'weight' Wˆ i = Pˆ robi (1 - Pˆ robi ) 2. Transform the data in the following way, and run OLS. 1 Yi = + 1 X i + 0 Wˆ i Wˆ i Wˆ i i Wˆ i This eliminates heteroskedasticity, since we now have unit variance for the composite disturbance term. The resulting coefficient estimates are now BLUE. However, this procedure eliminates only one of the two problems. Page-3 2. Unrestricted Range of Probi We said earlier that this probability must be restricted to the 0,1 interval. The problem is that nothing in LPM 'restricts the range' of Probi. Consider the following numerical example: Suppose we estimate: Probi = Yˆ i = .197 + .141 X i where Xi is defined as father's 'years of education minus 12'. For example, if father completes a secondary education, we say that he has 12 years of schooling (e.g., School Certificate). In this case, Xi=0. No qualification would make X negative. Any post-SC qualification would make Xi positive (e.g., if he has Ph.D, Xi=7). Show this in the following diagram. The data points lie on the 2 horizontal lines, where y = 0 and 1. Either the individual is enrolled in tertiary study or he or she isn’t. The dependent variable is dichotomous, although the independent variable is more or less continuous. This is the scatter diagram for a dummy dependent variable model. Page-4 OLS tries to fit a regression line through these data points that minimises the sum of the squared residuals. Suppose we get the upward-sloping regression function in the diagram. Enrolment in tertiary education is positively related to the father’s education. The intercept term (0.197) is the intersection of the regression function with the vertical axis. The slope is the estimated coefficient (0.141). Each year of education by the father raises the probability that the offspring will be enrolled in tertiary study by 14.1 percentage points. We can predict the probability that a given individual will be enrolled by plugging his or her father’s education into this conditional expectation. For example, two years of post-SC education would give us: Yˆ i = Probi = .197 + .141(2) = .479 The problem is that a regression function with any slope will eventually pass outside the horizontal lines defined by the data points. For example, someone with a father who dropped out of school at age 15 (i.e., Xi=-2), will have a negative probability of tertiary study: Yˆ i = Probi = .197 + .141(-2) = - .085 This isn’t possible. Someone with a father who has a PhD (i.e., Xi=7), will have a probability of tertiary study in excess of one: Yˆ i = Probi = .197 + .141(7) = 1.184 This isn’t possible either. Thus, we have a fundamental problem with the LPM and forecasts. As a consequence, we need to explore alternatives to the LPM. We want a technique that estimates a 'regression curve' bounded by zero and one (i.e., it asymptotically approaches these two horizontal lines.) Might also note that the R2 statistic isn’t very useful in the LPM as a measure of the ‘goodness of fit’ of this regression function. It’s difficult to fit a ‘regression line’ through two horizontal lines of data points. The intuition is that we’re trying to determine the relationship between the ‘probability’ of this event and some Page-5 independent variable. But we never observe the true probability. All we see is the eventual outcome of zero or one. II. The Logit Model Under the LPM the probability of an event occurring is written: Yˆ i = Probi = 0 + 1 X i Under the logit model this probability is written: Probi = 1 1+e Probi = - ( 0 + 1X i ) 1 - ˆ 1 + e Yi Note that there is now a difference between the fitted value and the estimated probability. This probability is now a nonlinear function of X. This is the cumulative distribution function (CDF) for the logistic distribution. We need to verify that the probability range is now restricted to lie within the 0,1 interval. If Yˆ i + , then Probi 1 When ‘e’ is raised to a large negative number (in absolute value), this probability approaches one. If Yˆ i - , then Probi 0 When ‘e’ is raised to a large positive number, this probability approaches zero. Thus, this logistic regression function asymptotically approaches one and zero. In between these two extremes, we can show this logit model relative to the LPM in the following diagram. Page-6 Note that the marginal or incremental effect of X on Y declines at the extremes. This is the slope of the curve at a given point. Contrast this with the constant slope of the LPM. The largest slope of the logit model occurs at the inflection point, where we go from increasing at an increasing rate to increasing at a decreasing rate. This doesn’t have to correspond to X=0. Log-Odds Ratio How do we estimate this logit regression model? One possibility is to convert this nonlinear function into a linear regression function and apply OLS. Begin by writing the probability of not enrolling in tertiary study as: 1 1 + e-Yˆi 1 + e-Yˆi 1 = -Yˆ i 1+e 1 + e-Yˆi 1 - Probi = 1 - -ˆ e Yi 1 + e-Yˆi 1 = 1 + eYˆi = We can now write the 'odds ratio' as: 1 + eYˆi Probi = = eYˆi 1 - Probi 1 + e-Yˆi Page-7 The trick is to realize that: ˆ 1 eY i = 1 + e-Yˆi 1 + eYˆi This odds ratio is the probability that an event will occur over the probability that it will not occur. For example, if the Probi = 0.75, the odds ratio is 3 or 3:1. If the Probi = 0.8, the odds ratio is 4 or 4:1 By taking the natural log of the odds ratio we get: ln ( Probi )= 0 + 1 X i 1 - Probi so that the 'log-odds ratio' is a linear function of Xi, but the probability is still a nonlinear function of Xi. For example, β1 tells us how the log of the odds ratio will change with a one unit change in Xi. Estimation Imagine that we try to use the log odds ratio to estimate the earlier regression model on tertiary enrolments. Plug in observed values for Yi or Probi and run OLS. What's wrong with this approach? It doesn't work with our cross section of individuals because we don't observe probabilities, just actual outcomes. 0 1 1 If Probi=1, then ln( ) is undefined 0 If Probi=0, then ln( ) is undefined One way to estimate the model is to use the Maximum likelihood (ML) method, which is beyond the scope of this course. Alternatively, we can use a method called Grouped Logit. Suppose we have 'group' rather than 'individual' data (e.g., a cross section of secondary schools). We could estimate the probabilities or frequencies of tertiary enrolments for the graduates of each school: Page-8 mi Pˆ robi = ni mi = number who attend universities or polytechnics by some age. ni = number who completed secondary school in that class. Assuming that this estimated probability is not 0 or 1, we can run the following with OLS. ˆ ln ( Probi ) = ˆ 0 + ˆ1 X i i 1 - Pˆ robi where the 'hats' on the coefficients indicate that these are estimated with 'grouped' data, and that we lose information in aggregating. Since the disturbances are heteroskedastic: Var ( i ) = 1 ni Pˆ robi (1 - Pˆ robi ) We can transform the data by multiplying through by the square root of the weighting variable: Wˆ i = ni Pˆ robi (1 - Pˆ robi ) This WLS procedure will yield more efficient estimators. III. The Probit Model The probit model is nothing more than an alternative regression function that also asymptotically approaches the zero and one horizontal lines. The difference is that it is based on a 'normal' rather than a 'logistic' distribution function. Recall that under the Logit model the probability that an event will occur is written: Probi = 1 1 + e-Yˆ i Under probit, we let Probi to be the CDF of a normal distribution: Page-9 Probi = 1 2 Yˆ i - exp( t2 )dt 2 where t is a standardised normal variable, with zero mean and unit variance. For this reason, it should be called ‘Normit’. In general, there is no reason to prefer logit over probit or vice versa. Probit does have a slightly different regression function (although it asymtotically approaches zero and one like logit). It approaches the extreme values faster than logit. Numerical example. The regression model: LF i = 0 + 1 M i + 2 S i + i where LFi = 1 if woman in labour force; 0 otherwise. Mi = 1 if woman is married; 0 otherwise. Si = number of years of schooling. 1. LPM (OLS). No correction for heteroskedasticity. Lˆ F i = - 0.28 - 0.38 M i + 0.09 S i ..................(0.15) (0.03) Page-10 We can interpret the effect of marital status on labour force participation: E ( LF i | M i = 0 , S i = 12) = - 0.28 + 0.09(12) = .80 E ( LF i | M i = 1 , S i = 12) = - 0.28 - 0.38 + 0.09(12) = .42 2. LPM (Weighted Least Squares). Lˆ F i = - 0.21 1 - 0 .39 M i + 0.08 S i Wˆ i Wˆ i Wˆ i Wˆ i ...........................(0.15) .........(0.02) where the relevant weight is the product of the probabilities of being in and out of the labour force estimated from the previous regression. It's easy to verify the problem of 'unrestricted range' of estimated probabilities under LPM. E ( LF i | M i = 0 , S i = 16) = - 0.21 + 0.08(16) = 1.07 The estimated probability of an unmarried woman with 16 years of education being in the labour force is 107%. 3. Logit (Maximum likelihood). We use the same individual data to estimate the equation with logit. ln LF i = - 5.89 - 2.59 M i + 0.69 S i 1 - LF i ...............................(1.18) (0.31) The results are represented in terms of the log of the odds ratio (even though maximum likelihood estimation on the individual data was used). 4. Probit (Maximum likelihood). Again, the individual data are used with maximum likelihood probit. -1 F ( LF i ) = - 3.44 - 1.44 M i + 0.40 S i ............................(0.62) (0.17) Where F-1 is the inverse of the normal CDF. Page-11 We can't compare the ‘magnitudes’ of the coefficient estimates from the logit and probit, but the t tests are performed in the traditional manner. The t ratios are around 2.2 and 2.3 on Mi, respectively, and better than 2 on Si in both regressions. The magnitudes of the estimated coefficients have no economic meaning, because they’re related to labour force participation in a nonlinear way. IV. Questions for Discussion: Q13.12 V. Computing Exercise: Johnson, Ch13