BINARY DISCRETE CHOICE REGRESSIONS MODELS INTRODUCTION A binary discrete choice regression model is used to describe a data generation process that has two possible outcomes. In economics, these two outcomes are usually alternatives associated with a choice process. Some examples are as follows. 1) Labor participation decision. 2) Loan default decision. 3) Public transportation decision. 4) Physician specialty choice decision. 6) Home purchase decision. Binary Discrete Choice Models Three binary discrete choice regression models are used in economics. 1. Linear probability model 2. Probit model 3. Logit model SPECIFICATION ISSUES In economics, a binary discrete choice model is used to analyze a choice process. The objective is to better understand the factors that influence a particular type of economic choice. The following are some basic concepts that underlie the specification of a binary discrete choice model. Decision-Making Unit To specify a binary discrete choice model, you begin by identifying the decision-making unit of interest. For example, this could be a consumer, firm, worker, commuter, physician, legislator, voter, state, nation, etc. Alternatives The decision-making unit must choose between two alternatives: alternative A, and alternative B. You must identify the alternatives from which the decision-making unit can choose. Some examples are as follows. 1) drive to work, take public transportation. 2) default on loan, make payment on loan. 3) work, don’t work. 4) buy a house, rent a house. Quantification of Alternatives Quantify the alternatives. To quantify the alternatives, you define a variable Y. This variable can assume two values. Y=1 if the decision-maker chooses alternative A Y=0 if the decision-maker chooses alternative B Some examples are as follows. Y = 1 if a commuter chooses to drive to work; Y = 0 if a commuter chooses to take public transportation. Y = 1 if a nation chooses to default on a loan; Y = 0 if a nation chooses to make payment on a loan. Y = 1 if an individual chooses to work; Y = 0 if an individual chooses not to work. Y = 1 if an individual chooses to buy a house; Y = 0 if an individual chooses to rent a house. The variable Y is the dependent variable for the choice process; that is, Y is the variable to be explained. Observable Factors that Influence Choice of Alternatives Identify observable factors that influence the decision-maker’s choice of alternatives. Observable factors are factors that can be observed, quantified, and included in the statistical model. The vector of observable factors can be denoted as follows, X = (X1, X2, …, Xk) In most choice situations, two types of observable factors influence the decision-maker’s choice of alternatives, 1. Attributes of the alternatives 2. Socioeconomic characteristics of the decision-maker For example, suppose the decision-maker is a commuter. The commuter must choose between two alternatives: 1) drive to work, 2) take public transportation. The actual choice that the commuter makes most likely depends on both attributes of the alternatives and socioeconomic characteristics of the commuter. Examples of attributes of the alternatives are difference in commuting time between taking public transportation and driving to work; price of gasoline; price of public transportation; and price of parking. Examples of socioeconomic characteristics of the decision-maker are income; age; gender; and marital status. The variables X1, X2, …, Xk are explanatory variables for the choice process. Unobservable Factors that Influence Choice of Alternatives There are many factors that influence the decision-maker’s choice of alternatives that either can’t be observed or quantified, or that are relatively unimportant. These factors are summarized by an error term, . The error term is a random variable. Choice Relationship The choice relationship can be written in general functional form as follows, Y = ( X1, X2, …, Xk, ) This tells you that the alternative that is chosen depends upon observable attributes of the alternatives, observable socioeconomic characteristics of the decision-maker, and unobservable factors. The following points should be noted about the choice relationship. 1. The dependent variable Y is a random variable. This is because Y depends on , and is a random variable. 2. The dependent variable Y is a discrete random variable, not a continuous random variable. This is because it can take only two values: Y = 1; Y = 0. Probability Distribution for the Dependent Variable The dependent variable is a discrete random variable that can take only two values, and therefore has the following discrete probability distribution, g(Y) = PY(1 – P)1 – Y for Y = 0, 1 The probability that Y = 1 is given by g(1) = P1(1 – P)1 – 1 = P The probability that Y = 0 is given by g(0) = P0(1 – P)1 = 1 – P The mean of Y is E(Y) = P The variance of Y is Var(Y) = E[(Y – E(Y))2] = (1 – P)P Note that for a discrete random variable that can take only two values, zero and one, the mean and variance of the probability distribution is completely determined by P, the probability that Y = 1. In the context of a discrete choice model, this is the probability that the decision-maker will choose alternative A. Conditional Mean Function of the Dependent Variable Recall that the objective of a regression model is to analyze the factors that influence the average value of the dependent variable, i.e., analyze the average relationship between the dependent variable and the explanatory variable(s). To do this, we estimate the parameters of a conditional mean function. Given that E(Y) = P, we can write the conditional mean function for a binary discrete choice regression model, in general functional form, as follows, P = F(X1, X2, …, Xk) Thus, a binary discrete choice model is a probability model. That is, a binary discrete choice model is used to analyze the factors that influence the probability that a decision-maker will choose alternative A, when there are only two alternatives from which she can choose, A or B. For example, a binary discrete choice model can be used to analyze how the difference in commuting time, price of a gallon of gasoline, price of public transportation, price of parking, and a commuter’s income, age, gender, and marital status influence the probability that he/she will choose to drive to work (alternative A). 3 Binary Discrete Choice Models To analyze the factors that influence the probability of a choice outcome, economists use one of 3 binary discrete choice models 1. Linear probability model 2. Probit model 3. Logit model The major difference between these models is the specific functional form, F, which relates X 1, X2, …, Xk to P. The linear probability model assumes F is a linear function. The probit model assumes F is a normal cumulative distribution function. The logit model assumes F is a logistic cumulative distribution function. LINEAR PROBABILITY MODEL The linear probability model is the classical linear regression model with a dummy variable as the dependent variable. It is the easiest model to estimate and interpret. However, it also has the biggest shortcomings. SPECIFICATION The linear probability model is given by Yt = 1 + 2Xt2 + … + kXtk + t for t = 1, 2, …, T where Y is a dummy variable, and the error term has all the assumptions of the classical linear regression model. Heteroscedasticity The assumption of constant error variance is always violated in the linear probability model. Why? Recall that the variance of Y, and therefore the variance of μ is given by Var(μ) = (1 – P)P. The variance of μ is a function of P, and P is a function of X1, X2, …, Xk. Thus, the variance of μ is not constant, but depends on the values of the explanatory variables. Interpretation For the classical linear regression model with a dummy dependent variable, the conditional mean function of Y is P = 1 + 2Xt2 + … + kXtk and therefore the probability that Y = 1 is a linear function of the explanatory variables. Each slope coefficient, , is interpreted as a marginal probability; that is, it indicates the effect of a one unit change in X on the probability that Y = 1. If a slope coefficient has a positive (negative) algebraic sign, then larger values of X increase (decrease) the probability that Y = 1. The fitted value of Y is a predicted probability for given values of the X’s. ESTIMATION To obtain estimates of the parameters of the model, you must choose an estimator. The two estimators used most often are: 1. Ordinary least squares (OLS) estimator with White Robust standard errors 2. Feasible generalized least squares (FGLS) estimator Ordinary Least Squares (OLS) Estimator To obtain estimates of the parameters of the linear probability model, you can apply the OLS estimator to the sample data. The OLS estimator is given by the rule, ^ = (XTX)-1XTy Properties of the OLS Estimator If the sample data are generated by the linear probability model, then the OLS estimator is unbiased, but inefficient. It is inefficient because it does not use the information that the error term is heteroscedastic. Unbiased and consistent estimates of the standard errors can be obtained by using White robust standard errors. Feasible Generalized Least Squares (FGLS) Estimator To obtain estimates of the parameters of the linear probability model, you can apply the FGLS estimator to the sample data. The FGLS estimator is given by the rule, ^ = (XTW-1X)-1XTW-1y Properties of the FGLS Estimator If the sample data are generated by the linear probability model, then the FGLS estimator is unbiased and efficient. It is efficient because uses the information that the error term is heteroscedastic. HYPOTHESIS TESTING Small (Finite) sample tests, e.g., t-test and F-test, cannot be used to test hypotheses in the linear probability model. This is because the error term has a binomial distribution, not a normal distribution. To test hypotheses, you must use large sample (asymptotic) tests. These include the t-test, approximate F-test, Likelihood ratio test, Wald test, and Lagrange multiplier test. GOODNESS-OF-FIT The R2 statistic is not a good measure of goodness-of-fit” of a linear probability model, and therefore its use should be avoided. PREDICTION The linear probability model can be used to predict the probability that an outcome will occur. MAJOR SHORTCOMINGS OF THE LINEAR PROBABILITY MODEL The linear probability model has 2 major shortcomings. 1. The probability that Y = 1 (i.e., Pt) may be greater than one or less than zero. Because of this the linear probability model has a logical inconsistency. 2. The marginal probabilities are constant, and therefore do not depend on the values of the explanatory variables. This may be an unreasonable structure to impose on the data. Both of these shortcomings arise because the linear probability model is linear in parameters. PROBIT MODEL The probit model does not have the shortcomings of the linear probability model. However, it is somewhat more difficult to estimate and interpret than the linear probability model. SPECIFICATION The probit model assumes that the conditional mean function for Y is given by P = F(I) = F(1 + 2X2 + … + kXk) Where I = 1 + 2X2 + … + kXk is called the index function. When the value of an explanatory variable changes, the value of the index function changes. When the value of the index function changes, the probability that Y = 1 changes. Interpretation of the Index Function The index, I, can be given any interpretation that is theoretically plausible. In economics, when specifying a model of a choice process it is usually interpreted as a utility index. Note that the value of the index is unknown and unobservable because the values of 1, 2, …, k are unknown and unobservable. However, by obtaining estimates of 1, 2, …, k you can obtain an estimate of I. Choice of Functional Form For the model to be logically consistent, you must choose a functional form, F(I), that has the following properties: 1. 0 P 1 2. - < I < + 3. P/I > 0 That is, you must choose a functional form that that is strictly monotonic and allows P to vary between 0 and 1 as I varies between - and +. One such function that satisfies these properties is the standard normal cumulative distribution function evaluated at value of I = 1 + 2X2 + … + kXk. This can be written as follows, I 2 P = F(I) = (2)-1/2 e-I /2 dI - Note that the function F assigns to each set of values X1, X2, … Xk one and only one value P. The value of P is equal to the area under the standard normal distribution between - and the value of I evaluated at the set of values X1, X2, … Xk. This is the probit model. The error term for the probit model is a component of random utility theory used to specify the model, and is therefore embedded in the specification of the model. Interpretation of Probit Model The slope coefficients of the index function are not marginal probabilities. To derive the marginal probability for Xk you must use the chain rule from calculus. Recall that for the probit model the probability that Y = 1 is given by P = F(I) = F(1 + 2X2 + … + kXk) where F is the standard normal cumulative distribution function. When Xk changes I changes. When I changes P changes. Application of the chain rule therefore yields, P F I = = (I)k Xk I Xk Where (I) is the standard normal probability density function. To obtain an estimate of the marginal probability, you use the estimate of k and evaluate (I) at specific values of the X’s, using the estimates of 1, 2, …, k. The magnitude of the marginal probability is not constant but varies with the values of the X’s. This is because (I) varies with the values of the X’s. The marginal probability is largest when P = 0.5 and smallest when P is close to 0 or 1. This implies that a change in Xk has the biggest effect on decisionmakers who are “sitting on the fence” when choosing Y =1 or Y = 0, and the smallest effect on decisionmakers who are “set in their ways.” The fitted value of Y is a predicted probability for given values of the X’s. ESTIMATION If you have individual data, then to obtain estimates of the parameters of the probit model you can apply the maximum likelihood estimator to the sample data. To obtain maximum likelihood estimates, you find the values of the unknown parameters 1, 2, …, k that maximize the log likelihood function for the sample of data. For the probit model, the log likelihood function is T ln L = [Yt ln F(1 + 2Xt2 + … + kXtk) + (1 – Yt) ln (1 - F(1 + 2Xt2 + … + kXtk))] t=1 The first-order necessary conditions for a maximum comprise a set of K-equations in the K-unknown parameters 1, 2, …, k. However, because these equations are nonlinear equations you cannot find solution expressions, and therefore a nice formula for 1, 2, …, k. To maximize the log likelihood function you must therefore use a numerical optimization procedure that uses an interative solution. Since the likelihood function for the probit model is concave and therefore has a single maximum, you can give the unknown parameters any starting values you desire. Properties of the Maximum Likelihood Estimator If the sample data generated by the probit model is individual data, then the maximum likelihood estimator has desirable larger sample properties; that is, it is asymptotically unbiased, consistent, and asymptotically efficient. HYPOTHESIS TESTING Small (Finite) sample tests, e.g., t-test and F-test, cannot be used to test hypotheses in the probit model. To test hypotheses, you must use large sample (asymptotic) tests. These include the asymptotic t-test, approximate F-test, likelihood ratio test, Wald test, and Lagrange multiplier test. GOODNESS-OF-FIT An R2 statistic cannot be used to measure how well a probit model fits the sample data. The most often used measures of goodness-of-fit are the likelihood ratio index and the percent of correct predictions. Likelihood Ratio Index The likelihood ratio index is given by LRI = 1 – (ln L0/ln L) Where ln L is the maximized value of the log likelihood function for the unrestricted probit model; ln L0 is the maximized value of the log likelihood function for the restricted probit model for which all slope coefficients, 2, …, k are set equal to zero. The LRI is an analog to the R2 statistic for the classical linear regression model. If all slope coefficients are jointly equal to zero, then LRI = 0. As the slope coefficients move away from zero, the LRI becomes larger in magnitude. The largest value the LRI can take is 1.00. There are two major problems with the LRI as a measure of goodness of fit: 1. Values of the LRI between 0 and 1 have no specific interpretation. 2. If the LRI = 1, this does not indicate a perfect fit; rather, it indicates a flaw in the model. Percent of Correct Predictions A table is constructed giving the number of Y = 1 values in the sample correctly and incorrectly predicted by the probit model, and the number of Y = 0 values in the sample correctly and incorrectly predicted by the probit model. From this table, the percent of correct predictions is then calculated and used as a measure of goodness of fit. A correct prediction is determined as follows. Compare the actual value of Yt (0 or 1) to the predicted value of Yt for each of the Tobservations in the sample. Correct and incorrect predictions are as follows: 1. 2. 3. 4. If Yt = 1 and Yt^ 0.5, then this is a correct prediction. If Yt = 0 and Yt^ < 0.5, then this is a correct prediction. If Yt = 1 and Yt^ < 0.5, then this is an incorrect prediction. If Yt = 0 and Yt^ 0.5, then this is an incorrect prediction. Note that a predicted (estimated) probability of 0.5 defines a correct and incorrect prediction. PREDICTION The probit model can be used to predict the probability that an outcome will occur. LOGIT MODEL Like the probit model, the logit model assumes that the conditional mean function is given by P = F(I) = F(1 + 2X2 + … + kXk) Where I = 1 + 2X2 + … + kXk is an index function, with the restriction that P/I > 0. However, the logit model assumes that F is a logistic cumulative distribution function. Thus, the conditional mean function for the logit model is given by 1 P = 1 + exp( -I) The rest of the model is analogous to the probit model. Comparison of Probit and Logit Models 1. The major theoretical difference between the probit and logit models is the functional form of the conditional mean function P = F(X1, X2, …, Xk). The probit model assumes that F is a standard normal cumulative distribution function. The logit model assumes that F is a logistic cumulative distribution function. These two functions have the same basic shape (both have an elongated S shape). However, the logistic distribution has fatter tails than the standard normal distribution. 2. The major practical difference between the probit and logit model is that the logit model is mathematically easier to work with. 3. The major empirical difference between the probit and logit model involves the predicted probabilities if the sample contains a disproportionate number of 0’s or 1’s. If the number of 0’s is not substantially different from the number of 1’s for the sample, then the predicted probabilities for the two models will be very similar. If the sample contains a large number of 0’s relative to 1’s or a larger number of 1’s relative to zeros, then the two models will yield different predicted probabilities. This is because the logistic distribution has fatter tails than the standard normal distribution. 4. In most applications, whether you choose a probit or logit model makes little difference.