Note

advertisement
BINARY DISCRETE CHOICE REGRESSIONS MODELS
INTRODUCTION
A binary discrete choice regression model is used to describe a data generation process that has
two possible outcomes. In economics, these two outcomes are usually alternatives associated
with a choice process. Some examples are as follows. 1) Labor participation decision. 2) Loan
default decision. 3) Public transportation decision. 4) Physician specialty choice decision. 6)
Home purchase decision.
Binary Discrete Choice Models
Three binary discrete choice regression models are used in economics.
1. Linear probability model
2. Probit model
3. Logit model
SPECIFICATION ISSUES
In economics, a binary discrete choice model is used to analyze a choice process. The objective
is to better understand the factors that influence a particular type of economic choice. The
following are some basic concepts that underlie the specification of a binary discrete choice
model.
Decision-Making Unit
To specify a binary discrete choice model, you begin by identifying the decision-making unit of
interest. For example, this could be a consumer, firm, worker, commuter, physician, legislator,
voter, state, nation, etc.
Alternatives
The decision-making unit must choose between two alternatives: alternative A, and alternative
B. You must identify the alternatives from which the decision-making unit can choose. Some
examples are as follows. 1) drive to work, take public transportation. 2) default on loan, make
payment on loan. 3) work, don’t work. 4) buy a house, rent a house.
Quantification of Alternatives
Quantify the alternatives. To quantify the alternatives, you define a variable Y. This variable
can assume two values.
Y=1
if the decision-maker chooses alternative A
Y=0
if the decision-maker chooses alternative B
Some examples are as follows. Y = 1 if a commuter chooses to drive to work; Y = 0 if a
commuter chooses to take public transportation. Y = 1 if a nation chooses to default on a loan;
Y = 0 if a nation chooses to make payment on a loan. Y = 1 if an individual chooses to work; Y =
0 if an individual chooses not to work. Y = 1 if an individual chooses to buy a house; Y = 0 if an
individual chooses to rent a house. The variable Y is the dependent variable for the choice
process; that is, Y is the variable to be explained.
Observable Factors that Influence Choice of Alternatives
Identify observable factors that influence the decision-maker’s choice of alternatives.
Observable factors are factors that can be observed, quantified, and included in the statistical
model. The vector of observable factors can be denoted as follows,
X = (X1, X2, …, Xk)
In most choice situations, two types of observable factors influence the decision-maker’s choice
of alternatives,
1. Attributes of the alternatives
2. Socioeconomic characteristics of the decision-maker
For example, suppose the decision-maker is a commuter. The commuter must choose between
two alternatives: 1) drive to work, 2) take public transportation. The actual choice that the
commuter makes most likely depends on both attributes of the alternatives and socioeconomic
characteristics of the commuter. Examples of attributes of the alternatives are difference in
commuting time between taking public transportation and driving to work; price of gasoline;
price of public transportation; and price of parking. Examples of socioeconomic characteristics
of the decision-maker are income; age; gender; and marital status. The variables X1, X2, …, Xk
are explanatory variables for the choice process.
Unobservable Factors that Influence Choice of Alternatives
There are many factors that influence the decision-maker’s choice of alternatives that either
can’t be observed or quantified, or that are relatively unimportant. These factors are
summarized by an error term, . The error term is a random variable.
Choice Relationship
The choice relationship can be written in general functional form as follows,
Y = ( X1, X2, …, Xk, )
This tells you that the alternative that is chosen depends upon observable attributes of the
alternatives, observable socioeconomic characteristics of the decision-maker, and unobservable
factors. The following points should be noted about the choice relationship.
1. The dependent variable Y is a random variable. This is because Y depends on , and
 is a random variable.
2. The dependent variable Y is a discrete random variable, not a continuous random
variable. This is because it can take only two values: Y = 1; Y = 0.
Probability Distribution for the Dependent Variable
The dependent variable is a discrete random variable that can take only two values, and
therefore has the following discrete probability distribution,
g(Y) = PY(1 – P)1 – Y
for Y = 0, 1
The probability that Y = 1 is given by
g(1) = P1(1 – P)1 – 1 = P
The probability that Y = 0 is given by
g(0) = P0(1 – P)1 = 1 – P
The mean of Y is
E(Y) = P
The variance of Y is
Var(Y) = E[(Y – E(Y))2] = (1 – P)P
Note that for a discrete random variable that can take only two values, zero and one, the mean
and variance of the probability distribution is completely determined by P, the probability that
Y = 1. In the context of a discrete choice model, this is the probability that the decision-maker
will choose alternative A.
Conditional Mean Function of the Dependent Variable
Recall that the objective of a regression model is to analyze the factors that influence the
average value of the dependent variable, i.e., analyze the average relationship between the
dependent variable and the explanatory variable(s). To do this, we estimate the parameters of
a conditional mean function. Given that E(Y) = P, we can write the conditional mean function
for a binary discrete choice regression model, in general functional form, as follows,
P = F(X1, X2, …, Xk)
Thus, a binary discrete choice model is a probability model. That is, a binary discrete choice
model is used to analyze the factors that influence the probability that a decision-maker will
choose alternative A, when there are only two alternatives from which she can choose, A or B.
For example, a binary discrete choice model can be used to analyze how the difference in
commuting time, price of a gallon of gasoline, price of public transportation, price of parking,
and a commuter’s income, age, gender, and marital status influence the probability that he/she
will choose to drive to work (alternative A).
3 Binary Discrete Choice Models
To analyze the factors that influence the probability of a choice outcome, economists use one
of 3 binary discrete choice models
1. Linear probability model
2. Probit model
3. Logit model
The major difference between these models is the specific functional form, F, which relates X 1,
X2, …, Xk to P. The linear probability model assumes F is a linear function. The probit model
assumes F is a normal cumulative distribution function. The logit model assumes F is a logistic
cumulative distribution function.
LINEAR PROBABILITY MODEL
The linear probability model is the classical linear regression model with a dummy variable as
the dependent variable. It is the easiest model to estimate and interpret. However, it also has
the biggest shortcomings.
SPECIFICATION
The linear probability model is given by
Yt = 1 + 2Xt2 + … + kXtk + t
for t = 1, 2, …, T
where Y is a dummy variable, and the error term has all the assumptions of the classical linear
regression model.
Heteroscedasticity
The assumption of constant error variance is always violated in the linear probability model.
Why? Recall that the variance of Y, and therefore the variance of μ is given by Var(μ) = (1 – P)P.
The variance of μ is a function of P, and P is a function of X1, X2, …, Xk. Thus, the variance of μ is
not constant, but depends on the values of the explanatory variables.
Interpretation
For the classical linear regression model with a dummy dependent variable, the conditional
mean function of Y is
P = 1 + 2Xt2 + … + kXtk
and therefore the probability that Y = 1 is a linear function of the explanatory variables. Each
slope coefficient, , is interpreted as a marginal probability; that is, it indicates the effect of a
one unit change in X on the probability that Y = 1. If a slope coefficient has a positive (negative)
algebraic sign, then larger values of X increase (decrease) the probability that Y = 1. The fitted
value of Y is a predicted probability for given values of the X’s.
ESTIMATION
To obtain estimates of the parameters of the model, you must choose an estimator. The two
estimators used most often are:
1. Ordinary least squares (OLS) estimator with White Robust standard
errors
2. Feasible generalized least squares (FGLS) estimator
Ordinary Least Squares (OLS) Estimator
To obtain estimates of the parameters of the linear probability model, you can apply the OLS
estimator to the sample data. The OLS estimator is given by the rule,
^ = (XTX)-1XTy
Properties of the OLS Estimator
If the sample data are generated by the linear probability model, then the OLS estimator is
unbiased, but inefficient. It is inefficient because it does not use the information that the error
term is heteroscedastic. Unbiased and consistent estimates of the standard errors can be
obtained by using White robust standard errors.
Feasible Generalized Least Squares (FGLS) Estimator
To obtain estimates of the parameters of the linear probability model, you can apply the FGLS
estimator to the sample data. The FGLS estimator is given by the rule,
^ = (XTW-1X)-1XTW-1y
Properties of the FGLS Estimator
If the sample data are generated by the linear probability model, then the FGLS estimator is
unbiased and efficient. It is efficient because uses the information that the error term is
heteroscedastic.
HYPOTHESIS TESTING
Small (Finite) sample tests, e.g., t-test and F-test, cannot be used to test hypotheses in the
linear probability model. This is because the error term has a binomial distribution, not a
normal distribution. To test hypotheses, you must use large sample (asymptotic) tests. These
include the t-test, approximate F-test, Likelihood ratio test, Wald test, and Lagrange multiplier
test.
GOODNESS-OF-FIT
The R2 statistic is not a good measure of goodness-of-fit” of a linear probability model, and
therefore its use should be avoided.
PREDICTION
The linear probability model can be used to predict the probability that an outcome will occur.
MAJOR SHORTCOMINGS OF THE LINEAR PROBABILITY MODEL
The linear probability model has 2 major shortcomings.
1. The probability that Y = 1 (i.e., Pt) may be greater than one or less than zero. Because of
this the linear probability model has a logical inconsistency.
2. The marginal probabilities are constant, and therefore do not depend on the values of the
explanatory variables. This may be an unreasonable structure to impose on the data.
Both of these shortcomings arise because the linear probability model is linear in parameters.
PROBIT MODEL
The probit model does not have the shortcomings of the linear probability model. However, it is
somewhat more difficult to estimate and interpret than the linear probability model.
SPECIFICATION
The probit model assumes that the conditional mean function for Y is given by
P = F(I) = F(1 + 2X2 + … + kXk)
Where I = 1 + 2X2 + … + kXk is called the index function. When the value of an explanatory
variable changes, the value of the index function changes. When the value of the index function
changes, the probability that Y = 1 changes.
Interpretation of the Index Function
The index, I, can be given any interpretation that is theoretically plausible. In economics, when
specifying a model of a choice process it is usually interpreted as a utility index. Note that the
value of the index is unknown and unobservable because the values of 1, 2, …, k are
unknown and unobservable. However, by obtaining estimates of 1, 2, …, k you can obtain an
estimate of I.
Choice of Functional Form
For the model to be logically consistent, you must choose a functional form, F(I), that has the
following properties:
1. 0  P  1
2. - < I < +
3. P/I > 0
That is, you must choose a functional form that that is strictly monotonic and allows P to vary
between 0 and 1 as I varies between - and +. One such function that satisfies these
properties is the standard normal cumulative distribution function evaluated at value of I = 1 +
2X2 + … + kXk. This can be written as follows,
I
2
P = F(I) =  (2)-1/2 e-I /2 dI
-
Note that the function F assigns to each set of values X1, X2, … Xk one and only one value P. The
value of P is equal to the area under the standard normal distribution between - and the value
of I evaluated at the set of values X1, X2, … Xk. This is the probit model.
The error term for the probit model is a component of random utility theory used to
specify the model, and is therefore embedded in the specification of the model.
Interpretation of Probit Model
The slope coefficients of the index function are not marginal probabilities. To derive the
marginal probability for Xk you must use the chain rule from calculus. Recall that for the probit
model the probability that Y = 1 is given by P = F(I) = F(1 + 2X2 + … + kXk) where F is the
standard normal cumulative distribution function. When Xk changes I changes. When I changes
P changes. Application of the chain rule therefore yields,
P
F
I
 =
   = (I)k
Xk
I
Xk
Where (I) is the standard normal probability density function. To obtain an estimate of the marginal
probability, you use the estimate of k and evaluate (I) at specific values of the X’s, using the estimates
of 1, 2, …, k.
The magnitude of the marginal probability is not constant but varies with the values of the X’s. This
is because (I) varies with the values of the X’s. The marginal probability is largest when P = 0.5 and
smallest when P is close to 0 or 1. This implies that a change in Xk has the biggest effect on decisionmakers who are “sitting on the fence” when choosing Y =1 or Y = 0, and the smallest effect on decisionmakers who are “set in their ways.”
The fitted value of Y is a predicted probability for given values of the X’s.
ESTIMATION
If you have individual data, then to obtain estimates of the parameters of the probit model you
can apply the maximum likelihood estimator to the sample data. To obtain maximum
likelihood estimates, you find the values of the unknown parameters 1, 2, …, k that maximize
the log likelihood function for the sample of data. For the probit model, the log likelihood
function is
T
ln L =  [Yt ln F(1 + 2Xt2 + … + kXtk) + (1 – Yt) ln (1 - F(1 + 2Xt2 + … + kXtk))]
t=1
The first-order necessary conditions for a maximum comprise a set of K-equations in the K-unknown
parameters 1, 2, …, k. However, because these equations are nonlinear equations you cannot find
solution expressions, and therefore a nice formula for  1, 2, …, k. To maximize the log likelihood
function you must therefore use a numerical optimization procedure that uses an interative solution.
Since the likelihood function for the probit model is concave and therefore has a single maximum, you
can give the unknown parameters any starting values you desire.
Properties of the Maximum Likelihood Estimator
If the sample data generated by the probit model is individual data, then the maximum likelihood
estimator has desirable larger sample properties; that is, it is asymptotically unbiased, consistent, and
asymptotically efficient.
HYPOTHESIS TESTING
Small (Finite) sample tests, e.g., t-test and F-test, cannot be used to test hypotheses in the
probit model. To test hypotheses, you must use large sample (asymptotic) tests. These include
the asymptotic t-test, approximate F-test, likelihood ratio test, Wald test, and Lagrange
multiplier test.
GOODNESS-OF-FIT
An R2 statistic cannot be used to measure how well a probit model fits the sample data. The
most often used measures of goodness-of-fit are the likelihood ratio index and the percent of
correct predictions.
Likelihood Ratio Index
The likelihood ratio index is given by
LRI = 1 – (ln L0/ln L)
Where ln L is the maximized value of the log likelihood function for the unrestricted probit
model; ln L0 is the maximized value of the log likelihood function for the restricted probit model
for which all slope coefficients, 2, …, k are set equal to zero. The LRI is an analog to the R2
statistic for the classical linear regression model. If all slope coefficients are jointly equal to
zero, then LRI = 0. As the slope coefficients move away from zero, the LRI becomes larger in
magnitude. The largest value the LRI can take is 1.00. There are two major problems with the
LRI as a measure of goodness of fit:
1. Values of the LRI between 0 and 1 have no specific interpretation.
2. If the LRI = 1, this does not indicate a perfect fit; rather, it indicates a flaw in the model.
Percent of Correct Predictions
A table is constructed giving the number of Y = 1 values in the sample correctly and incorrectly
predicted by the probit model, and the number of Y = 0 values in the sample correctly and
incorrectly predicted by the probit model. From this table, the percent of correct predictions is
then calculated and used as a measure of goodness of fit. A correct prediction is determined as
follows. Compare the actual value of Yt (0 or 1) to the predicted value of Yt for each of the Tobservations in the sample. Correct and incorrect predictions are as follows:
1.
2.
3.
4.
If Yt = 1 and Yt^  0.5, then this is a correct prediction.
If Yt = 0 and Yt^ < 0.5, then this is a correct prediction.
If Yt = 1 and Yt^ < 0.5, then this is an incorrect prediction.
If Yt = 0 and Yt^  0.5, then this is an incorrect prediction.
Note that a predicted (estimated) probability of 0.5 defines a correct and incorrect prediction.
PREDICTION
The probit model can be used to predict the probability that an outcome will occur.
LOGIT MODEL
Like the probit model, the logit model assumes that the conditional mean function is given by
P = F(I) = F(1 + 2X2 + … + kXk)
Where I = 1 + 2X2 + … + kXk is an index function, with the restriction that P/I > 0. However, the logit
model assumes that F is a logistic cumulative distribution function. Thus, the conditional mean function
for the logit model is given by
1
P = 
1 + exp( -I)
The rest of the model is analogous to the probit model.
Comparison of Probit and Logit Models
1. The major theoretical difference between the probit and logit models is the functional form of the
conditional mean function P = F(X1, X2, …, Xk). The probit model assumes that F is a standard normal
cumulative distribution function. The logit model assumes that F is a logistic cumulative distribution
function. These two functions have the same basic shape (both have an elongated S shape).
However, the logistic distribution has fatter tails than the standard normal distribution.
2. The major practical difference between the probit and logit model is that the logit model is
mathematically easier to work with.
3. The major empirical difference between the probit and logit model involves the predicted
probabilities if the sample contains a disproportionate number of 0’s or 1’s. If the number of 0’s is
not substantially different from the number of 1’s for the sample, then the predicted probabilities
for the two models will be very similar. If the sample contains a large number of 0’s relative to 1’s or
a larger number of 1’s relative to zeros, then the two models will yield different predicted
probabilities. This is because the logistic distribution has fatter tails than the standard normal
distribution.
4. In most applications, whether you choose a probit or logit model makes little difference.
Download