Binary Choice Models 1 Topic Overview • Introduction to binary choice models • The Linear Probability model (LPM) • The Probit model • The Logit model 2 Introduction • In some cases the outcome of interest (Y) is not quantitative, but a binary decision: – Go to college or not – Adopt a technology or not – Join the union or not • For example, how well do an individual’s socioeconomic characteristics explain his/her decision to join a trade union? • Often such models are used to model decisions: to invest or not, to enter a market or not, to hire or not, to adopt a technology or not… • Binary variables as dependent variables (Y) complicate the estimation process 3 Introduction • Only suitable where we can plausibly narrow down the decision alternatives to two. • Qualitative models where the choice is between two discrete, mutually exclusive and jointly exhaustive alternatives. • Y, the dependent variable is these models is binary or dichotomous; it can only takes on the values of 0 or 1. • Also known as ‘rational choice’ models – as Y often represents a rational choice between two alternatives. The Xs are the factors that are expected to contribute to the selection of one outcome over another. 4 An Example • The decisions of farmers to adopt to the latest technology: Yi = β1 + β2 Xi +...+ ui • where Yi is a binary variable, representing two choices, e.g. to adopt (Y=1) the latest technology or not to adopt (Y=0) • The decision is influenced by economic , structural, farm and farmer characteristics • For example costs, farm size, age of the farmer, access to credit etc. • So, we might for instance find that age of the farmer negatively affects the probability of adoption, while farm size has a positive effect 5 Alternative Models • There are several ways to estimate a binary choice model: 1. The Linear Probability Model (LPM) 2. The Probit Model 3. The Logit Model 6 The Linear Probability Model • Linear regression model with binary dependent variable Yi = 1+ 2Xi + 3Xi +…+ ui • The conditional expectations E(Yi|Xi) can be interpreted as the conditional probability that the event (Yi) will occur given Xi: E(Y|X1, X2,…, Xk) = P(Y=1|X1 , X2,…, Xk) • E(Yi|Xi) might express the probability of purchasing a durable good (e.g. a car) for a given level of Xi (e.g. income). • Estimated with OLS 7 The Linear Probability Model • The conditional expectation of the model can be interpreted as the conditional probability of Yi, or: E(Yi|Xi) = β1 + β2 Xi = Pi [ui is omitted since we have assumed that E(ui)=0 ] • Pi = probability that Yi = 1 and (1-Pi) = probability Yi =0 • Yi follows what is known as the Bernoulli probability distribution: • The fact that Yi is a probabilistic term imposes a very important restriction in the values it can take: 0 ≤ E(Yi|Xi) ≤ 1 8 An Example of the LPM • Estimate the determinants of trade union membership (variable union) • normal OLS regression with union as our dependent variable; • In Stata: regress union exp sex 9 Stata Output and Interpretation • The above can be interpreted as follows: • “the slope coefficient measures the change in the average value of the regressand for a unit change in the value of a regressor, with all other variables held constant” • In this case, holding other variables constant, an increase by one unit in exp (on-the-job experience) increases the probability of union membership by 0.004 10 Problems: LPM • Simple model but there are important shortcomings: 1. Non-normality of the disturbances 2. Heteroskedasticity 3. Nonsense probabilities 4. Implausibility of linearity 12 Non-normality of the disturbances • In the LPM the disturbances ui are: ui = Yi - β1 - β2 Xi • Just like Yi, ui also takes only two values. • This makes the assumption of normality in the distribution of ui (necessary for inference) unattainable. • In fact the probability distribution of ui in the LPM is: • Possible to overcome by central limit theorem 13 Heteroscedasticity • The (binomial) Bernoulli probability distribution implies by definition a non-constant variance • Specifically the variance would be: var(ui)= Pi(1-Pi) • Since the expected probability of an event happening varies for each case, then we can no longer assume a constant variance Pi= E(Yi|Xi) = β1 + β2 Xi • Usual remedial measures may be employed to correct for heteroscedasticity (e.g. WLS) 14 Nonsense Probabilities • Due to its probabilistic nature: 0 ≤ Yi ≤ 1 • In practice though, OLS estimates of Yi may be more than 1 or less than 0. • We can still ‘constrain’ those estimates to the desired boundaries, but the adjustment is not very good. If some of the estimated Ŷs are less than 0 (that is, negative), Ŷi is assumed to be zero for those cases; if they are greater than 1, they are assumed to be 1. 15 Implausible Linearity • The LPM assumes a linear relationship between the levels of the X variable(s) and the probability that Y=1. • This linearity (or constant effect of X on Y) is very implausible. • Consider the case of a family’s decision to own a house – would the probability be the same for all levels of income? • It is more plausible to expect that the probability is progressively higher or lower for different levels of income… All this indicates that the LPM is probably not a very good model. Probit and logit models offer significant advantages and should be preferred 16 Probit and Logit Models • Alternative models that are less problematic are the probit and the logit model – The relationship between Pi and Xi is non-linear – As Xi increases, the conditional probability of an event occurring Pr(Yi=1|Xi) increases but never steps outside the 0-1 interval – Due to this built-in non-linearity both use an alternative estimator to OLS; the Maximum-Likelihood (ML) method 17 Probit and Logit Models • Cumulative distribution function (s-shaped). • Normal distribution – probit or logistic distribution – logit. • Unlike the linear probability model the predicted probabilities are between 0 and 1. 18 The Probit Model • The probit model can be derived from an underlying latent variable model that satisfies the classical linear assumptions • The outcome decision depends on an unobservable utility index: Ii = β1 +β2 Xi + ui • For example, decision Y to own a house (Y=1) or not (Y=0) depends on an unobservable utility index Ii, that is determined by Xi (e.g. income, number of children) • The larger the value of Ii the greater the probability of Y =1 (e.g. owning a house) 19 The Probit Model • The latent (unoberservable) variable Ii is linked to the observed decision Yi by: 1 if I i 0 Yi 0 if I i 0 • If a person’s utility index I exceeds the threshold level I* (here assumed to be 0) Y=1, and if not, then Y=0 • It is assumed that the error term u is independent of X and follows a standard normal distribution • The error is symmetrically distributed about 0, which means that 1-F(-z) = F(z) 20 The Probit Model • Hence the normal distribution allows us to compute the probability that Y=1 P(Y 1 | X ) P( I 0) P( 1 2 X i ui ) 0 P(ui 1 2 X i ) F ( 1 2 X i ) • With F being the standard normal cumulative distribution function (CDF) F ( I ) ( I ) 1 2 I e 1 2 Z 2 dZ • This ensures that the probability is strictly between 0 and 1 21 The Normal CDF • That is, in the probit model, Pi the conditional probability that Yi=1 (given Xi), follows the normal CDF. • So if we plot the probabilities that Yi=1 for different (given) X values cumulatively we get: Pi Xi -∞ 0 +∞ 22 Stata Output 23 Interpreting the Results: Probit • Interpreting the slope coefficients of the probit model is complicated • Marginal effects: i ( Z i ) • • • • where ( Z i ) is the probability density function of the standard normal variable and Zi = β1 +β2 Xi +...+ βk Xi The sign of the marginal effect is the same as βi The magnitude of the change depends on the magnitude of βi and the value Zi All X variables are involved in computing the changes in probability Marginal effects vary for different levels of X; it is customary to estimate them at the mean of the variables. 24 Interpreting the Results: Probit …it follows that the marginal effects of X on Y, vary for different levels of X Pi Low marginal effects at extreme values of X, high marginal effects at central values. Xi -∞ 0 +∞ 25 Stata Output .0064 x (-1.12cons+0.006exp+(-.54sex)+(-.33sth)+.01age) 26 Interpreting the Results: Probit • If X2 is a binary variable the marginal effect from changing from 0 to 1 is F (1 21 X ) F (1 2 0 X ) • Again, this depends on all values of the other explanatory variables 27 The Logit Model • • The logit model is similar to the probit model – the key difference is that it is based on the logistic CDF rather than the normal CDF. If the utility index exceeds the 1 if I i 0 threshold level I*, Y=1, Yi otherwise Y=0 0 if I i 0 P(Y 1 | X ) P( I 0) P( 1 2 X i ui ) 0 P(ui 1 2 X i ) F ( 1 2 X i ) • Assuming F to be a logistic CDF 1 e Zi F (I ) Zi 1 e 1 e Zi • Where Z i i 2 X i 28 The Logistic CDF 1 -∞ 0 Pi +∞ Xi 29 Interpreting the Results: Logit • The ratio of the two probabilities is the odds ratio in favor of the outcome: Pi 1 e Zi Zi e 1 Pi 1 e Zi • The logit model produces easily communicable odd ratios of the marginal effects of a single unit’s increase in each independent variable on the probability of Y=1. • The ratio P/(1-P) is the odds ratio in favour of owning a house – ratio of the probability that a family will own a house to the probability that it will not own a house 30 Interpreting the Results: Logit • Marginal effects can be calculated in the same way as for the probit model • Also possible to calculate the odds ratios • odds ratio = eβ • where e (the natural logarithm) equals approximately 2.71828 • If eβ is greater than 1, the odds are eβ times larger • If eβ is less than 1, the odds are eβ times smaller • Positive effects are greater than 1, while negative effects are between 0 and 1 31 Stata Output eβ 2.71828 -9625674= 0.3819 “holding other regressors constant, women (sex=1) are approximately 3.8 times less likely to be a member of union” 32 Estimation: Probit and Logit • Estimation using OLS is not possible due to non-linearity not only in the variables but also in the parameters (the betas). • Maximum Likelihood is the suitable method: it involves maximising a likelihood function in such a way that the resulting betas take those values that maximise the probability of observing the given Y’s. • For the precise mechanics (see GUJ Appendix 15A.p. 633). • In practice software (in our case Stata) does all the hard work for us: • Command Syntax in stata: probit /logit <Y variable> <X variables> • e.g. probit union exp sex 33 Stata Output 34 Inference: Probit and Logit • Likelihood ratio (LR) statistic: – Tests the null hypothesis that all β coefficients are zero (equivalent to the F-test in the linear regression model). – The LR statistics follows the chi-square distribution (χ2) with df equal to the number of explanatory variables (constant not included), e.g. LR chi2(3) = 27.55 • Wald-statistic – Tests the null hypothesis that β=0 (equivalent to t-statistic) – inferences are based on the normal table (if sample is large, t-distribution converges to the normal distribution) • Stata provides exact p values that the null hypothesis is true for both tests. 35 Stata Output 36 Goodness of Fit: Probit and Logit • Conventional R2 is not very meaningful in probit or logit models. • Many alternative measures have been proposed, the most widely used are the Count R2 and McFadden R2. • Count R2: • McFadden R2 (Pseudo R2): Calculated as log L 1 log L0 • Expected signs and significance of coefficients is important 37 Example in Stata After estimating either a probit or a logit, type fitstat to obtain Goodness-of-Fit statistics: 38 Probit or Logit? • Respective CDFs are almost identical: 39 Probit and Logit • The two models can be used interchangeably: there are no good theoretical reasons to prefer one over the other. • Their results should be qualitatively identical; i.e. we should get the same coefficient signs regardless of whether we use probit or logit. • “ … if you multiply the probit coefficient by about 1.81 (which is approximately = π/√3), you will get approximately the logit coefficient (…) Alternatively, if you multiply a logit coefficient by 0.55 (= 1/1.81), you will get the probit coefficient” • Sometimes the logit is preferred due to the easy interpretation of its coefficients through odds ratios • Sometimes the probit is preferred due to its normal distribution assumption • You can begin by running a logit, perform the tests and to test for robustness also try a probit – then compare the output. 40 Example: Binary Choice Discussion Group Membership Probit Coefficients Logit Marginal effects -.1228 Coefficients -.627* (.377) Marginal effects -.114 Odds ratio BMW -.393* (.221) .5340 SW -.516** (.237) -.155 -.861** (.406) -.1504 .4227 East -.449** (.209) -.1404 -.762** (.355) -.1395 .4665 Herd size .017*** (.003) .0056 .0284*** (.005) .00568 1.028 LU/ha .219 (.189) .0734 .3402 (.333) .0679 1.405 Age -.008** (.008) -.0028 -.015** (.013) -.0030 .9848 job -.353 (.243) -.109 -.621 (.441) -.1111 .5374 cons -1.21** (.559) -.621** (.946) LR chi2(7) 72.48 LR chi2(7) 71.52 Pseudo R2 0.1810 Pseudo R2 0.1786 41