7. Regression with a binary dependent variable Up to now: • Dependent variable Y has a metric scale (it can take on any value on the real line) In this section: • Y takes on either the value 1 or 0 (binary variable) • We aim at finding out and modeling which determinants (Xregressors) cause Y to take on the values 1 or 0 189 Examples: • What is the effect of a tuition subsidy on an individual’s decision to go to college (Y = 1)? • Which factors determine whether a teenager takes up smoking (Y = 1)? • What determines if a country receives foreign aid (Y = 1)? • What determines if a job applicant is successful (Y = 1)? 190 Data set examined in this section: • Boston Home Mortgage Disclosure Act (HMDA) data set • Which factors determine whether a mortgage application is denied (Y ≡ DENY = 1) or approved (Y ≡ DENY = 0) • Potential factors (regressors): The required loan payment (P ) relative to the applicants income (I): X1 ≡ P/I RATIO The applicant’s race X2 ≡ BLACK = ( 1 if the applicant is black 0 if the applicant is white 191 7.1. The linear probability model Scatterplot of mortgage application denial and the payment-to-income ratio 192 Meaning of the OLS regression line: • Plot of the predicted value of Y = DENY as a function of the regressor X1 = P/I RATIO • For example, when P/I RATIO = 0.3 the predicted value of DENY is 0.2 • General interpretation (for k regressors): E(Y |X1, . . . , Xk ) = 0 · Pr(Y = 0|X1, . . . , Xk ) + 1 · Pr(Y = 1|X1, . . . , Xk ) Pr(Y = 1|X1, . . . , Xk ) −→ The predicted value from the regression line is the probability that Y = 1 given the values of the regressors X1 , . . . , X k 193 Definition 7.1: (Linear probability model) The linear probability model is the linear multiple regression model Yi = β0 + β1 · X1i + . . . + βk · Xki + ui (7.1) applied to a binary dependent variable Yi. Remarks: • Since Y is binary, it follows that Pr(Y = 1|X1, . . . , Xk ) = β0 + β1 · X1 + . . . + βk · Xk • The coefficient βj is the change in the probability that Y = 1 associated with a unit change in Xj holding constant the other regressors 194 Remarks: [continued] • The regression coefficients can be estimated by OLS • The errors of the linear probability model are always heteroskedastic −→ Use heteroskedasticity-robust standard errors for confidence intervals and hypothesis tests • The R2 is not a useful measure-of-fit (alternative measures-of-fit are discussed later) 195 Application to Boston HMDA data: • OLS regression of DENY on P/I RATIO yields d = −0.080 + 0.604 · P/I RATIO DENY (0.032) (0.098) • Coefficient on DENY is positive and significant at the 1% level • If P/I RATIO increases by 0.1, the probability of denial increases by 0.604 × 0.1 ≈ 0.060 = 6% (predicted change in the probability of denial given a change in the regressor) 196 Application to Boston HMDA data: [continued] • Effect of race on the probability of denial holding constant the P/I RATIO d = −0.091 + 0.559 · P/I RATIO + 0.177 · BLACK DENY (0.029) (0.089) (0.025) • Coefficient on BLACK is positive and significant at the 1% level −→ African American applicant has a 17.7% higher probability of having a mortgage application denied than a white (holding constant the P/I RATIO) • Potentially omitted factors: Applicant’s earning potential Applicant’s credit history (see class for a detailed case study) 197 Major shortcoming of the linear probability model: • Probabilities cannot fall below 0 or exceed 1 −→ Effect on Pr(Y = 1) of a given change in X must be nonlinear 198 7.2. Probit and logit regression Now: • Two alternative nonlinear formulations that force the predicted probabilities Pr(Y = 1|X1, . . . , Xk ) to range between 0 and 1 • The probit regression model uses the standard normal cumulative distribution function (cdf) • The logit regression model uses the logistic cdf 199 Probit model of the probability of DENY, given P/I RATIO 200 Definition 7.2: (Probit regression model) The population probit model with multiple regressors is given by Pr(Y = 1|X1, . . . , Xk ) = Φ(β0 + β1 · X1 + . . . + βk · Xk ), (7.2) where the dependent variable Y is binary, Φ(·) is the cumulative standard normal distribution function, and X1, . . . , Xk are the regressors. Remarks: • The effect on the predicted probability of a change in a regressor is obtained by computing the predicted probabilities 1. for the initial Xj -value 2. for the changed Xj -value 3. and by taking their difference 201 Remarks: [continued] • The probit coefficients and the standard errors are typically estimated using the method of maximum likelihood (MLE) (see Section 7.3) Application to Boston HMDA data: • Fit of a probit model to Y = DENIAL and X1 = P/I RATIO: Pr(Y\ = 1|X1) = Φ(−2.19 + 2.97 · P/I RATIO) (0.16) (0.47) • P/I RATIO is positively related to the probability of denial • Relationship is statistically significant at the 1% level (t-statistic = 2.97/0.47 = 6.32) 202 Application to Boston HMDA data: [continued] • Change in the probability of denial when P/I RATIO changes from 0.3 to 0.4: Pr(Y = 1|X1 = 0.3) = Φ(−2.19 + 2.97 · 0.3) = Φ(−1.30) = 0.097 Pr(Y = 1|X1 = 0.4) = Φ(−2.19 + 2.97 · 0.4) = Φ(−1.00) = 0.159 −→ Estimated change in probability of denial: Pr(Y = 1|X1 = 0.4) − Pr(Y = 1|X1 = 0.3) = 0.159 − 0.097 = 0.062 = 6.2% 203 Application to Boston HMDA data: • Fit of a probit model to Y = DENIAL, X1 = P/I RATIO and X2 = BLACK: = 1|X1) = Φ(−2.26 + 2.74 · P/I RATIO + 0.71 · BLACK) Pr(Y\ (0.16) (0.44) (0.083) • When P/I RATIO = 0.3, then Pr(Y = 1|X\ 1 = 0.3, X2 = 0) = Φ(−1.438) = 0.075 = 7.5% (white applicant) Pr(Y = 1|X\ 1 = 0.3, X2 = 1) = Φ(−0.728) = 0.233 = 23.3% (black applicant) 204 Definition 7.3: (Logit regression model) The population logit model with multiple regressors is given by Pr(Y = 1|X1, . . . , Xk ) = F (β0 + β1 · X1 + . . . + βk · Xk ), (7.3) where F (·) denotes the cdf of the logistic distribution defined as 1 F (x) = . 1 + exp{−x} Remarks: • The logit regression is similar to the probit regression, but using a different cdf • The computation of predicted probabilities are performed analogously to the probit model • The logit coefficients and standard errors are estimated by the maximum likelihood technique 205 Remarks: [continued] • In practice, logit and probit regressions often produce similar results Probit and logit models of the probability of DENY, given P/I RATIO 206 7.3. Estimation and inference in the logit and probit models Alternative estimation techniques: • Nonlinear least squares estimation by minimizing the sum of squared prediction mistakes: n X i=1 [Yi − Φ(b0 + b1X1i + . . . + bk Xki)]2 −→ min b0,...,bk (7.4) (see Eq. (2.2) on Slide 12) • Maximum likelihood estimation 207 Nonlinear least squares estimation: • NLS estimators are consistent normally distributed in large samples • However, NLS estimators are inefficient, that is there are other estimators having a smaller variance than the NLS estimators −→ Use of maximum likelihood estimators 208 Maximum likelihood estimation: • ML estimators are consistent normally distributed in large samples • More efficient than NLS estimators • ML estimation is discussed in the lecture Advanced Statistics Statistical inference based on MLE: • Since ML estimators are normally distributed in large samples, statistical inference about probit and logit coefficients based on MLE proceeds in the same way as inference about the linear regression functions coefficients based on the OLS estimator 209 In particular: • Hypothesis tests are performed using the t- and F -statistics (see Sections 3.2.–3.4.) • Confidence intervals are constructed according to Formula (3.3) on Slide 55 Measures-of-fit: • The conventional R2 is inappropriate for probit and logit regression models • Two frequently encountered measures-of-fit with binary dependent variables are the Fraction correctly predicted Pseudo-R2 210 Fraction correctly predicted: • This measure-of-fit is based on a simple classification rule • An observation Yi is said to be correctly predicted, \ if Yi = 1 and Pr(Yi = 1|X 1i, . . . , X ) > 0.5 or ki \ if Yi = 0 and Pr(Yi = 1|X 1i, . . . , Xki ) < 0.5 • Otherwise Yi is said to be incorrectly predicted • The fraction correctly predicted is the fraction of the n observations Y1, . . . , Yn that are correctly predicted Pseudo-R2: • The Pseudo-R2 compares values of the maximized likelihoodfunction with all regressors to the value of the likelihood function with no regressor 211 Case study: • Application to Boston HMDA data (see class) Other limited dependent variable models: • Censored and truncated regression models • Sample selection models • Count data • Ordered responses • Discrete choice data • For details see Ruud (2000) and Wooldridge (2002) 212