ECON 4551 Econometrics II Memorial University of Newfoundland Qualitative and Limited Dependent Variable Models Adapted from Vera Tabakova’s notes 16.1 Models with Binary Dependent Variables 16.2 The Logit Model for Binary Choice 16.3 Multinomial Logit 16.4 Conditional Logit 16.5 Ordered Choice Models 16.6 Models for Count Data 16.7 Limited Dependent Variables Principles of Econometrics, 3rd Edition Slide 16-2 Examples: An economic model explaining why some individuals take a second, or third, job and engage in “moonlighting.” An economic model of why the federal government awards development grants to some large cities and not others. An economic model explaining why someone is in the labour force or not Principles of Econometrics, 3rd Edition Slide16-3 An economic model explaining why some loan applications are accepted and others not at a large metropolitan bank. An economic model explaining why some individuals vote “yes” for increased spending in a school board election and others vote “no.” An economic model explaining why some female college students decide to study engineering and others do not. Principles of Econometrics, 3rd Edition Slide16-4 As long as these exhaust the possible (mutually exclusive) options 1 individual drives to work y 0 individual takes bus to work (16.1) If the probability that an individual drives to work is p, then P y 1 p. It follows that the probability that a person uses public transportation is P y 0 1 p . f ( y) p y (1 p)1 y , y 0,1 (16.2) E y p; var y p 1 p Principles of Econometrics, 3rd Edition Slide16-5 y E ( y) e p e (16.3) E ( y ) p 1 2 x (16.4) y E ( y ) e 1 2 x e Principles of Econometrics, 3rd Edition (16.5) Slide16-6 One problem with the linear probability model is that the error term is heteroskedastic; the variance of the error term e varies from one observation to another. y value e value Probability 1 1 1 2 x p 1 2 x 0 1 2 x 1 p 1 1 2 x Principles of Econometrics, 3rd Edition Slide16-7 var e 1 2 x 1 1 2 x Using generalized least squares, the estimated variance is: ˆ i2 var ei b1 b2 xi 1 b1 b2 xi yi* yi ˆ i xi* xi ˆ i (16.6) So the problem of heteroskedasticity is not insurmountable… yi* 1ˆ i1 2 xi* ei* Principles of Econometrics, 3rd Edition Slide16-8 p̂ b1 b2 x dp 2 dx Principles of Econometrics, 3rd Edition (16.7) (16.8) Slide16-9 Problems: We can easily obtain values of p̂ that are less than 0 or greater than 1 Some of the estimated variances in (16.6) may be negative, so the WLS would not work Of course, the errors are not distributed normally R2 is usually very poor and a questionable guide for goodness of fit Principles of Econometrics, 3rd Edition Slide16-10 Figure 16.1 (a) Standard normal cumulative distribution function (b) Standard normal probability density function Principles of Econometrics, 3rd Edition Slide16-11 1 .5 z 2 ( z ) e 2 1 .5u 2 e du 2 (16.9) p P[ Z 1 2 xp̂] (1 2 x) (16.10) ( z ) P[ Z z ] z Principles of Econometrics, 3rd Edition Slide16-12 cumulative density dp d (t ) dt (1 2 x)2 dx dt dx (16.11) where t 1 2 x and (1 2 x) is the standard normal probability density function evaluated at 1 2 x. Note that this is clearly a nonlinear model: the marginal effect varies depending on where you measure it Principles of Econometrics, 3rd Edition Slide16-13 Equation (16.11) has the following implications: 1. Since (1 2 x) is a probability density function its value is always positive. Consequently the sign of dp/dx is determined by the sign of 2. In the transportation problem we expect 2 to be positive so that dp/dx > 0; as x increases we expect p to increase. Principles of Econometrics, 3rd Edition Slide16-14 2. As x changes the value of the function Φ(β1 + β2x) changes. The standard normal probability density function reaches its maximum when z = 0, or when β1 + β2x = 0. In this case p = Φ(0) = .5 and an individual is equally likely to choose car or bus transportation. The slope of the probit function p = Φ(z) is at its maximum when z = 0, the borderline case. Principles of Econometrics, 3rd Edition Slide16-15 3. On the other hand, if β1 + β2x is large, say near 3, then the probability that the individual chooses to drive is very large and close to 1. In this case a change in x will have relatively little effect since Φ(β1 + β2x) will be nearly 0. The same is true if β1 + β2x is a large negative value, say near 3. These results are consistent with the notion that if an individual is “set” in their ways, with p near 0 or 1, the effect of a small change in commuting time will be negligible. Principles of Econometrics, 3rd Edition Slide16-16 Predicting the probability that an individual chooses the alternative y = 1: pˆ (1 2 x) 1 yˆ 0 Principles of Econometrics, 3rd Edition pˆ 0.5 pˆ 0.5 (16.12) Although you have to be careful with this Interpretation! Slide16-17 f ( yi ) [(1 2 xi )]yi [1 (1 2 xi )]1 yi , yi 0,1 (16.13) f ( y1 , y2 , y3 ) f ( y1 ) f ( y2 ) f ( y3 ) Suppose that y1 = 1, y2 = 1 and y3 = 0. Suppose that the values of x, in minutes, are x1 = 15, x2 = 20 and x3 = 5. Principles of Econometrics, 3rd Edition Slide16-18 P[ y1 1, y2 1, y3 0] f (1,1,0) f (1) f (1) f (0) P[ y1 1, y2 1, y3 0] [1 2 (15)] [1 2 (20)] 1 [1 2 (5)] (16.14) In large samples the maximum likelihood estimator is normally distributed, consistent and best, in the sense that no competing estimator has smaller variance. Principles of Econometrics, 3rd Edition Slide16-19 Principles of Econometrics, 3rd Edition Slide16-20 1 2 DTIMEi .0644 .0299 DTIMEi (se) (16.15) (.3992) (.0103) dp (1 2 DTIME )2 (0.0644 0.0299 20)(0.0299) dDTIME (.5355)(0.0299) 0.3456 0.0299 0.0104 Marginal effect of DT Measured at DTIME = 20 Principles of Econometrics, 3rd Edition Slide16-21 If it takes someone 30 minutes longer to take public transportation than to drive to work, the estimated probability that auto transportation will be selected is pˆ (1 2 DTIME ) (0.0644 0.0299 30) .798 Since this estimated probability is 0.798, which is greater than 0.5, we may want to “predict” that when public transportation takes 30 minutes longer than driving to work, the individual will choose to drive. But again use this cautiously! Principles of Econometrics, 3rd Edition Slide16-22 In STATA: Use transport.dta . sum Variable Obs Mean autotime bustime dtime auto 21 21 21 21 49.34762 48.12381 -1.223809 .4761905 Principles of Econometrics, 3rd Edition Std. Dev. 32.43491 34.63082 56.91037 .5117663 Min Max .2 1.6 -90.7 0 99.1 91.5 91 1 Slide16-23 1 .8 .6 .4 .2 0 = 1 if auto chosen Linear fit??? -100 -50 0 bus time - auto time 50 100 Slide16-24 NORMAL distribution Not t distribution, because the properties of the probit are asymptotic You can choose p-values Understand but do not use this one!!! What is the meaning of this test? Principles of Econometrics, 3rd Edition Slide16-25 . probit auto dtime Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -14.532272 -6.2074806 -6.165583 -6.1651585 -6.1651585 Probit regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -6.1651585 auto Coef. dtime _cons .029999 -.0644338 Std. Err. .0102867 .3992438 z P>|z| 2.92 -0.16 0.004 0.872 = = = = 21 16.73 0.0000 0.5758 [95% Conf. Interval] .0098374 -.8469372 .0501606 .7180696 . mfx compute Marginal effects after probit y = Pr(auto) (predict) = .45971697 variable dy/dx dtime .0119068 Std. Err. .0041 z 2.90 P>|z| 0.004 [ 95% C.I. .003871 ] .019942 X -1.22381 Evaluates at the means by default too Principles of Econometrics, 3rd Edition 26 . probit auto dtime Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -14.532272 -6.2074806 -6.165583 -6.1651585 -6.1651585 You can request these iterations in GRETL too Probit regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -6.1651585 auto Coef. dtime _cons .029999 -.0644338 . probit Std. Err. .0102867 .3992438 z 2.92 -0.16 P>|z| 0.004 0.872 = = = = 21 16.73 0.0000 0.5758 [95% Conf. Interval] .0098374 -.8469372 .0501606 .7180696 auto Iteration 0: Iteration 1: What yields cnorm(-0.0597171)??? log likelihood = -14.532272 log likelihood = -14.532272 Probit regression Number of obs LR chi2(0) Prob > chi2 Pseudo R2 Log likelihood = -14.532272 auto Coef. _cons -.0597171 Std. Err. .2736728 z -0.22 = = = = 21 -0.00 . -0.0000 P>|z| [95% Conf. Interval] 0.827 -.596106 .4766718 Slide16-27 This is a probability Principles of Econometrics, 3rd Edition IN STATA * marginal effects mfx mfx,at (dtime=20) * direct calculation nlcom (normalden(_b[_cons]+_b[dtime]*30)*_b[dtime] ) and nlcom (normal(_b[_cons]+_b[dtime]*30) ) Slide16-29 (l ) el 1 e l 2 , l (16.16) 1 l p[ L l ] 1 e l p P L 1 2 x 1 2 x Principles of Econometrics, 3rd Edition (16.17) 1 1 e 1 2 x (16.18) Slide16-30 p 1 1 e 1 2 x exp 1 2 x 1 exp 1 2 x 1 1 p 1 exp 1 2 x so Pi 1P i Principles of Econometrics, 3rd Edition odds ratio exp 1 2 X Slide16-31 so Pi 1P i odds ratio exp 1 2 X Pi ln1P i 1 2 X So the “logit”, the log-odds, is actually a fully linear function of X Principles of Econometrics, 3rd Edition Slide16-32 1. As Probability goes from 0 to 1, logit goes from –infinite to + infinite 2. The logit is linear, but the probability is not 3. The explanatory variables are individual specific, but do not change across alternatives 4. The slope coefficient tells us by how much the log-odds changes with a unit change in the variable Slide16-33 1. This model can be in principle estimated with WLS (due to the heteroskedasticity in the error term) if we have grouped data (glogit in STATA, while blogit will run ML logit on grouped data) IN GRETL If you want to use logit for analysis of proportions (where the dependent variable is the proportion of cases having a certain characteristic, at each observation, rather than a 1 or 0 variable indicating whether the characteristic is present or not) you should not use the logit command, but rather construct the logit variable, as in genr lgt_p = log(p/(1 - p)) 2. Otherwise we use MLE on individual data Slide16-34 McFadden’s pseudo R2 (remember that it does not have any natural interpretation for values between 0 and 1) Count R2 (% of correct predictions) (dodgy but common!) Etc. Measures of goodness of fit are of secondary importance What counts is the sign of the regression coefficients and their statistical and practical significance Using MLE A large sample method => estimated errors are asymptotic => we use Z test statistics (based on the normal distribution), instead of t statistics A likelihood ratio test (with a test statistic distributed as chi-square with df= number of regressors) is equivalent to the F test How do you obtain this? Measures of Fit for probit of auto Log-Lik Intercept Only: D(19): -14.532 12.330 McFadden's R2: ML (Cox-Snell) R2: McKelvey & Zavoina's R2: Variance of y*: Count R2: AIC: BIC: BIC used by Stata: 0.576 0.549 0.745 3.915 0.905 0.778 -45.516 18.419 Log-Lik Full Model: LR(1): Prob > LR: McFadden's Adj R2: Cragg-Uhler(Nagelkerke) R2: Efron's R2: Variance of error: hoAdj Count R2: AIC*n: BIC': AIC used by Stata: -6.165 16.734 0.000 0.438 0.733 0.649 1.000 0.800 16.330 -13.690 16.330 See http://www.soziologie.uni-halle.de/langer/logitreg/books/long/stbfitstat.pdf . lstat But be very careful with these measures! Probit model for auto True Classified + Total D ~D Total 9 1 1 10 10 11 10 11 21 Classified + if predicted Pr(D) >= .5 True D defined as auto != 0 Sensitivity Specificity Positive predictive value Negative predictive value Pr( +| D) Pr( -|~D) Pr( D| +) Pr(~D| -) 90.00% 90.91% 90.00% 90.91% False False False False Pr( +|~D) Pr( -| D) Pr(~D| +) Pr( D| -) 9.09% 10.00% 10.00% 9.09% + + - rate rate rate rate for for for for true ~D true D classified + classified - Correctly classified 90.48% So in STATA The “ones” do not Really have to be Actual ones, just Non-zeros IN GRETL if you do not have a binary Dependent variable It is assumed Ordered unless specified multinomial. If not discrete: error! To compute the deviance of the residuals: predict “newname”, deviance The deviance for a logit model is like the RSS in OLS. The smaller the deviance the better the fit. And (Logit only) to combine with information about leverage: predict “newnamedelta”, ddeviance (A recommended cut-off value for the ddeviance is 4) . logit auto dtime, nolog Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -6.1660422 auto Coef. dtime _cons .0531098 -.2375754 . predict pred, p . predict dev, deviance . predict delta, ddeviance . list pred if delta>4 pred 13. .0708038 Std. Err. .0206423 .7504766 z 2.57 -0.32 P>|z| 0.010 0.752 = = = = 21 16.73 0.0000 0.5757 [95% Conf. Interval] .0126517 -1.708483 .093568 1.233332 Variable probit logit dtime bustime _cons -.0052 .103 -4.73 -.0044 .184 -8.15 chi2 df N aic bic 24.7 24.5 21 10.3 13.5 21 10.5 13.7 Why does rule of thumb not work for dtime A matter of taste nowadays, since we all have good computers The underlying distributions share the mean of zero but have different variances: Logit 2 3 And normal 1 So estimated slope coefficients differ by a factor of about 1.8 ( 3 ) . Logit ones are bigger Watch out for “perfect predictions” Luckily STATA will flag them for you and drop the culprit observations Gretl has a mechanism for preventing the algorithm from iterating endlessly in search of a nonexistent maximum. One sub-case of interest is when the perfect prediction problem arises because of a single binary explanatory variable. In this case, the offending variable is dropped from the model and estimation proceeds with the reduced specification. However, it may happen that no single “perfect classifier” exists among the regressors, in which case estimation is simply impossible and the algorithm stops with an error. If this happens, unless your model is trivially misspecified (like predicting if a country is an oil exporter on the basis of oil revenues), it is normally a small-sample problem: you probably just don’t have enough data to estimate your model. You may want to drop some of your explanatory variables. Learn about the test (Wald tests based on chi2) and lrtest commands (LR tests), so you can test hypotheses as we did with t-tests and F tests in OLS They are asymptotically equivalent but can differ in small samples Learn about the many extra STATA capabilities, if you use it, that will make your postestimation life much easier Long and Freese’s book is a great resource GRETL is more limited but doing things by hand for now will actually be a good thing! . logit auto dtime, nolog Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -6.1660422 auto Coef. dtime _cons .0531098 -.2375754 Std. Err. .0206423 .7504766 z P>|z| 2.57 -0.32 0.010 0.752 = = = = 21 16.73 0.0000 0.5757 [95% Conf. Interval] .0126517 -1.708483 .093568 1.233332 For example . listcoef, help logit (N=21): Factor Change in Odds Odds of: 1 vs 0 auto b z P>|z| e^b e^bStdX SDofX dtime 0.05311 2.573 0.010 1.0545 20.5426 56.9104 b z P>|z| e^b e^bStdX SDofX = = = = = = raw coefficient z-score for test of b=0 p-value for z-test exp(b) = factor change in odds for unit increase in X exp(b*SD of X) = change in odds for SD increase in X standard deviation of X Principles of Econometrics, 3rd Edition Slide16-47 Logistic regression Log likelihood = -113.6769 honcomp Coef. female _cons .6513706 -1.400088 . logit Iteration Iteration Iteration Iteration Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Std. Err. .3336752 .2631619 z 1.95 -5.32 P>|z| 0.051 0.000 log log log log -.0026207 -1.915875 likelihood likelihood likelihood likelihood = -115.64441 = -113.68907 = -113.67691 = -113.6769 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 -113.6769 honcomp Odds Ratio female 1.918168 Principles of Econometrics, 3rd Edition 1.305362 -.8842998 For example Logistic regression Log likelihood = 200 3.94 0.0473 0.0170 [95% Conf. Interval] honcomp female, or 0: 1: 2: 3: = = = = Std. Err. .6400451 z 1.95 = = = = 200 3.94 0.0473 0.0170 P>|z| [95% Conf. Interval] 0.051 .9973827 3.689024 Slide16-48 Stata users? Go through a couple of examples available online with your own STATA session connected to the internet. Examples: http://www.ats.ucla.edu/stat/stata/dae/probit.htm http://www.ats.ucla.edu/stat/stata/dae/logit.htm http://www.ats.ucla.edu/stat/stata/output/old/lognoframe.htm http://www.ats.ucla.edu/stat/stata/output/stata_logistic.htm binary choice models censored data conditional logit count data models feasible generalized least squares Heckit identification problem independence of irrelevant alternatives (IIA) index models individual and alternative specific variables individual specific variables latent variables likelihood function limited dependent variables linear probability model Principles of Econometrics, 3rd Edition logistic random variable logit log-likelihood function marginal effect maximum likelihood estimation multinomial choice models multinomial logit odds ratio ordered choice models ordered probit ordinal variables Poisson random variable Poisson regression model probit selection bias tobit model truncated data Slide 16-50 Long, S. and J. Freese for all topics (available on Google!) Multinomial Logit Conditional Logit