On the Road to Logit and Probit Land Suppose we have a binary dependent variable and some covariate X1. Let’s graph the data: d 1 .5 0 -1 -.5 0 x1 .5 1 What is the central feature of this graph? Now suppose we estimate a linear regression model using OLS: . reg d x1 Source | SS df MS -------------+-----------------------------Model | 18.4248249 1 18.4248249 Residual | 23.7881929 167 .142444269 -------------+-----------------------------Total | 42.2130178 168 .251267963 Number of obs F( 1, 167) Prob > F R-squared Adj R-squared Root MSE = = = = = = 169 129.35 0.0000 0.4365 0.4331 .37742 -----------------------------------------------------------------------------d | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | .7749921 .0681425 11.37 0.000 .6404603 .9095238 _cons | .5067289 .0290408 17.45 0.000 .4493945 .5640633 What do we see? The model “fits” well (significant F and relatively high r2). The coefficients are all significant. On the surface, it all seems fine. Let’s compute the residuals: . predict resid, resid 1 Now, let’s graph them (note I could use rvfplot command here if I wanted): . gr resid x1, ylab xlab yline(0) 1 Residuals .5 0 -.5 -1 -1 -.5 0 x1 .5 1 What is the central feature of this graph? It illustrates the point that for any given X, there are 2 and only 2 possible values of e (the residual). An implication of this is that the estimated variance of e will be heteroskedastic. One way to see this is to graph the squared residuals: 2 .8 Squared Residuals .6 .4 .2 0 -1 -.5 0 x1 .5 1 Hmmmm…..it looks as if the variance of e is systematically related to X. This is endemic to heteroskedasticity problems. Note that if we performed the White test (or any other test for heteroskedasticity, we would always reject the null of homoskedasticity. It has to be the case for a dichotomous d.v. when used in the OLS setting. Suppose we didn’t care about heteroskedastic errors (though this is not recommended!!) and proceeded as usual with OLS. We might want to generate predicted values on the dependent variable. Further, we would want to interpret these predictions as probabilities. Let’s generate the predicted values from our model: . predict xbOLS (option xb assumed; fitted values) and then graph them: . gr xbOLS x1, ylab xlab yline(0,1) 3 Fitted values d 1.5 1 .5 0 -.5 -1 -.5 0 x1 .5 1 Oops. What is the central feature of this graph? There are several things to note, among them we find that when X is equal to -.98, the predicted probability is -.25. When X is equal to .94, the predicted probability is 1.23. One way to avoid this embarrassing problem is to place restrictions on the predicted values: . gen xbOLSrestrict=xbOLS . replace xbOLSrestrict=.01 if xbOLS<=0 (13 real changes made) . replace xbOLSrestrict=.99 if xbOLS>=1 (10 real changes made) Now, I’ll graph them: . gr xbOLSrestrict d x1, ylab xlab yline(0,1) 4 xbOLSrestrict d 1 .5 0 -1 -.5 0 x1 .5 1 Now, the problems are solved, sort of. We get valid predictions but only after post hoc restrictions on the predicted values. Gujarati and others discuss the LPM (using WLS). I will not (this is an easy model to estimate). Even with “corrections,” OLS is troublesome. What is another feature of the graph shown above? The probabilities are assumed to increase linearly with X. The marginal effect of X is constant. Is this realistic? Unlikely. What we would like is a model that produces valid probabilities (without post hoc constraints) and one where the probabilities are nonlinear in X. As Aldrich and Nelson (1984) [quoted in Gujarti] note, we want a model that “approaches zero at slower and slower rates as X gets small and approaches one at slower and slower rates as X gets very large.” What we want is a sigmoid response function (or an “S-shaped” function). Note that all cumulative distribution functions are sigmoid response. Hence, if we use the CDF of a distribution to model the relationship between dichotomuous Y and X, we’ll resolve our problem. The question is: which CDF? Conventionally, the logistic distribution and standard normal distribution are two candidates. The logistic gives rise to the logit model; the standard normal gives rise to the probit model. For now, omit details of logit estimator. Let’s just estimate one and generate predicted values (which as we will see, are probabilities): . logit d x1 Iteration 0: log likelihood = -117.0679 5 Iteration Iteration Iteration Iteration Iteration 1: 2: 3: 4: 5: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -75.660879 -71.457455 -71.135961 -71.132852 -71.132852 Logit estimates Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -71.132852 = = = = 169 91.87 0.0000 0.3924 -----------------------------------------------------------------------------d | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | 5.093696 .7544297 6.75 0.000 3.615041 6.572351 _cons | .0330345 .2093178 0.16 0.875 -.3772208 .4432898 -----------------------------------------------------------------------------. predict xbLOGIT (option p assumed; Pr(d)) Now, let me graph them: . gr xbLOGIT d x1, ylab xlab yline(0,1) Pr(d) d 1 .5 0 -1 -.5 0 x1 .5 1 What is the central feature of this graph? Now, estimate probit and generate predicted values: . probit d x1 Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -117.0679 -74.736867 -70.555131 -70.350158 -70.349323 6 Probit estimates Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -70.349323 = = = = 169 93.44 0.0000 0.3991 -----------------------------------------------------------------------------d | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | 3.056784 .4159952 7.35 0.000 2.241449 3.87212 _cons | .0189488 .1209436 0.16 0.876 -.2180963 .255994 -----------------------------------------------------------------------------. drop xbprobit . predict xbPROBIT (option p assumed; Pr(d)) . gr xbPROBIT d x1, ylab xlab yline(0,1) Pr(d) d 1 .5 0 -1 -.5 0 x1 .5 1 What is the central feature of this graph? Differences between logit and probit? 7 Pr(d) d Pr(d) d 1 .5 0 -1 Logit has “fatter” tails. -.5 0 x1 .5 1 Difference is almost always trivial. 8 Extensions: Logit Model Illustrating Nonlinearity of Probabilities in Z. Let’s reestimate model: . logit d x1 Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -117.0679 -75.660879 -71.457455 -71.135961 -71.132852 -71.132852 Logit estimates Log likelihood = -71.132852 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 169 91.87 0.0000 0.3924 -----------------------------------------------------------------------------d | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | 5.093696 .7544297 6.75 0.000 3.615041 6.572351 _cons | .0330345 .2093178 0.16 0.875 -.3772208 .4432898 Now, let’s generate “Z”. . gen z=_b[_cons]+_b[x1]*x1 From z, let’s generate the logistic CDF (which is equivalent to P(Y=1): One way to do it is like this: . gen P=1/(1+exp(-z)) The other way to do it is like this: . gen Prob=exp(z)/(1+exp(z)) Obviously, they’re equivalent statements (correlation is 1.0): . corr P Prob (obs=169) | P Prob -------------+-----------------P | 1.0000 Prob | 1.0000 1.0000 If we want to verify P is in the permissible range, then we can summarize Prob: . summ Prob Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------Prob | 169 .5147929 .3396981 .0069682 .9918078 Note that the probabilities range from .0069 to .99 (recall where the OLS probabilities fell). Second, if we want to verify that P is nonlinearly related to Z, then we simply can graph P with respect to Z: gr Prob z, ylab xlab c(s) 9 Prob 1 .5 0 -5 0 z 5 This graph should look familiar. The only difference between this graph and the previous logit graph is the X-axis. Here it is z; before it was X. Now let’s turn to some real data: . logit aff1 white2 ideo Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood Logit estimates Log likelihood = -905.7173 = -1041.0041 = -918.3425 = -905.73209 = -905.7173 = -905.7173 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 2087 270.57 0.0000 0.1300 -----------------------------------------------------------------------------aff1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------white2 | -1.837103 .122431 -15.01 0.000 -2.077063 -1.597142 ideo | .8629587 .1508031 5.72 0.000 .56739 1.158527 _cons | -.1051663 .0964957 -1.09 0.276 -.2942944 .0839618 -----------------------------------------------------------------------------Interpretation? Not natural. Negative sign tell us log-odds are decreasing in X; positive sign says log-odds increasing in X. So we see whites are less likely to support affirmative action but that ideology is positively related. We could convert coefficients to odds ratios (Stata will do this through the logistic procedure:) . logistic aff1 white2 ideo Logit estimates Number of obs = 2087 10 Log likelihood = LR chi2(2) Prob > chi2 Pseudo R2 -905.7173 = = = 270.57 0.0000 0.1300 -----------------------------------------------------------------------------aff1 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------white2 | .1592782 .0195006 -15.01 0.000 .1252976 .2024743 ideo | 2.370163 .357428 5.72 0.000 1.763658 3.185239 Odds ratios are nice. What about probabilities? . predict plogit Note: From this model I’ll get 14 probability estimates? Why? . table plogit white2 ---------------------| white2 Pr(aff1) | 0 1 ----------+----------.0570423 | 48 .0750336 | 269 .097348 | 314 .1253988 | 830 .1600989 | 201 .2021813 | 165 .2536365 | 28 .2752544 | 13 .337441 | 42 .4037311 | 53 .4737326 | 326 .5447822 | 43 .6140543 | 43 .6808742 | 21 Generating probability scenarios: . gen probscenario1=(exp(_b[_cons]+_b[white2]+_b[ideo]*ideo))/(1+(exp(_b[_cons]+_b[white2]+ _b[ideo]*ideo))) (68 missing values generated) . gen probscenario2=(exp(_b[_cons]+_b[white2]*0+_b[ideo]*ideo))/(1+(exp(_b[_cons]+_b[white2]*0+ _b[ideo]*ideo))) (68 missing values generated) Now I have scenario for Whites and Nonwhites. I could graph them: gr probscenario1 probscenario2 ideo, ylab xlab b2(Ideology Scale) c(ll) 11 probscenario1 probscenario2 .8 .6 .4 .2 0 -1 -.5 0 Ideology Scale .5 1 12 NOW, let’s turn to the PROBIT estimator. We reestimate our model. . probit aff1 white2 ideo Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -1041.0041 -906.32958 -905.40724 -905.40688 Probit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -905.40688 = = = = 2087 271.19 0.0000 0.1303 -----------------------------------------------------------------------------aff1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------white2 | -1.081187 .0721158 -14.99 0.000 -1.222531 -.9398421 ideo | .4783113 .082473 5.80 0.000 .3166671 .6399554 _cons | -.0646902 .0598992 -1.08 0.280 -.1820905 .05271 -----------------------------------------------------------------------------Note, the coefficients change. Why? A different CDF is applied. Suppose we want to generate the utilities: . gen U=_b[_cons]+_b[ideo]*ideo+_b[white2]*white2 (89 missing values generated) . summ U Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------U | 2396 -.9258217 .5063438 -1.624188 .413621 Note that they span 0: i.e. they are unconstrained (unlike the LPM). Suppose we want to compute probabilities? We can take our function: . gen Prob_U=norm(U) (89 missing values generated) . summ Prob_U Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------Prob_U | 2396 .2034129 .1554985 .0521678 .6604242 What have I done here? CDF for the Normal). (I’ve used Stata’s function for the normal distribution: i.e. the Of course, I could ask Stata to compute these directly: . predict probitprob (option p assumed; Pr(aff1)) (89 missing values generated) You can verify for yourself that what I type before is equivalent to the Stata predictions. Note that as with logit, we will only get 14 probabilities (why?) 13 . table probitprob white2 ---------------------| white2 Pr(aff1) | 0 1 ----------+----------.0521678 | 48 .0719306 | 269 .0961646 | 314 .1259231 | 830 .161568 | 201 .2032153 | 165 .2522055 | 28 .2935644 | 13 .3518333 | 42 .4119495 | 53 .4742103 | 326 .5371088 | 43 .5990911 | 43 .6604242 | 21 Comparison of probabilities? Let’s correlate the logit with the probit probs: . corr probitprob plogit (obs=2396) | probit~b plogit -------------+-----------------probitprob | 1.0000 plogit | 0.9996 1.0000 Effectively, no difference. Note that as with logit, we can generate probability scenarios: . gen prob1=norm(_b[_cons]+_b[white2]+_b[ideo]*ideo) (68 missing values generated) . gen prob2=norm(_b[_cons]+_b[white2]*0+_b[ideo]*ideo) (68 missing values generated) The only difference is we use the standard normal CDF to derive the function. 14 Illustrating Likelihood Ratio Test Using results from Probit Model: Here, I compute –2logLo-(-2logLc) . display (-2*-1041.0041)-(-2*-905.40688) 271.19444 This is equivalent to –2(logLo-logLc) . display -2*(-1041.0041--905.40688) 271.19444 This is what Stata reports as “LR chi2(2)”. freedom. The (2) denotes there are k=2 degrees of Let’s illustrate this on a case where an independent variable is not significantly different from 0. I generate a random set of numbers and run my probit model: . set seed 954265252265262626262 . gen random=uniform() . probit aff1 random Iteration 0: Iteration 1: log likelihood = -1077.9585 log likelihood = -1077.9214 Probit estimates Log likelihood = -1077.9214 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 2147 0.07 0.7851 0.0000 -----------------------------------------------------------------------------aff1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------random | -.0290269 .106446 -0.27 0.785 -.2376571 .1796033 _cons | -.8227274 .0616407 -13.35 0.000 -.9435409 -.7019139 -----------------------------------------------------------------------------Here, the LR test is given by –2(-1077.9585—1077.9214) which is . display -2*(-1077.9585--1077.9214) .0742 Using Stata as chi2 table, I find that the p-value for this test statistic is equal to: . display chi2tail(1,.0742) .78531696 Which clearly demonstrates that the addition of the covariate “random” adds nothing to this model. Now, let’s illustrate another kind of LR test. We reestimate the “full” probit model. . probit aff1 white2 ideo Iteration 0: Iteration 1: Iteration 2: log likelihood = -1041.0041 log likelihood = -906.32958 log likelihood = -905.40724 15 Iteration 3: log likelihood = -905.40688 Probit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -905.40688 = = = = 2087 271.19 0.0000 0.1303 -----------------------------------------------------------------------------aff1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------white2 | -1.081187 .0721158 -14.99 0.000 -1.222531 -.9398421 ideo | .4783113 .082473 5.80 0.000 .3166671 .6399554 _cons | -.0646902 .0598992 -1.08 0.280 -.1820905 .05271 -----------------------------------------------------------------------------Now, let’s reestimate a simpler model: . probit aff1 white2 if ideo~=. Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -1041.0041 -922.93117 -922.53588 -922.53587 Probit estimates Log likelihood = -922.53587 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 2087 236.94 0.0000 0.1138 -----------------------------------------------------------------------------aff1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------white2 | -1.09842 .0716505 -15.33 0.000 -1.238852 -.9579877 _cons | -.0567415 .059649 -0.95 0.341 -.1736513 .0601683 -----------------------------------------------------------------------------(Why is the “if” command used?) The difference in the model chi2 between the first and second models can be used to evaluate whether or not the addition of the “ideology” covariate adds anything. For the first model (call it M0), we see the model chi2 is 271.19 and for the second model (call it M1), the model chi2 is 236.94. The difference then is MO-M1 or 271.19-236.94=34.25. p-value is: On 1 degree of freedom, we see the . display chi2tail(1, 34.25) 4.847e-09 which clearly shows the addition of the ideology covariate improves the fit of the model. There is a simpler way to do this. Stata has a command called lrtest which can be applied to any model that reports log-likelihoods. The command works by first estimating the “full” model and then estimating the “simpler” or reduced model. To illustrate, let me reestimate M0 (I’ll suppress the output) . probit aff1 white2 ideo [output suppressed] Next I type: . lrtest, saving(0) Then I estimate my simpler model: 16 . probit aff1 white2 if ideo~=. [output suppressed]. Next I type: . lrtest, saving(1) Now I have two model chi2 statistics saved: one for M0 and one for M1. likelihood ratio test I type: . lrtest Probit: likelihood-ratio test Compare this to my results from above. chi2(1) = Prob > chi2 = To compute the 34.26 0.0000 They are identical (as they should be). 17