Models with Discrete Dependent Variables

advertisement
Models with Discrete Dependent Variables
These notes are mainly based on:
Damodar N. Gujarati. Basic Econometrics. Fourth Edition.
1. Concepts
In models with discrete dependent variables, the dependent variables can be
1. Binary or Dichotomous. Assume there are two political parties, Democratic or Republican. The
dependent variable here is the choice between the two political parties. We can let Y=1 if the vote is
for a Democratic candidate and Y=0 if the vote is for a Republican candidate;
2. Trichotomous. In the above example, if there are three parties, Democratic, Republican and
Independent, then the dependent variable is trichotomous.
3. Polychotomous or Multiple-category.
When doing research we can have ordinal dependent variable. This is an ordered categorical variable, such
as schooling (less than 8 years, 8-11 years, 12-14 years and 15 or more years). If there is no inherent ordering
in the categorical variables, we call it nominal, such as ethnicity (Black, White, Hispanic, Asian, and other).
Models with discrete dependent variable are a common kind of qualitative response models. In a model
where the dependent variable is quantitative, our objective is to estimate its expected, or mean, value given
the values of independent variables. In models where the dependent variable is qualitative, our objective is to
find the probability of something happening, such as voting for a Democratic candidate, or owning a house,
or belonging to a union. etc. Hence, qualitative response regression models are often known as probability
models. The theoretical framework of probability models is:
Prob (event j occurs) = Prob (Y=j) = Function ( relevant effects: parameters)
In the situations of models with discrete dependent variable, we need to consider the following problems:
1. How to estimate models with discrete dependent variable? Can the OLS procedure be applied here?
2. Are there special inference problems? That is, is the hypothesis testing procedure different here?
3. How can we measure the goodness of fit for such models?
2. Binary Response Regression Model
According to the theoretical framework of probability models, for the binary response regression model
Y  X   , we expect
lim
X  
Pr ob (Y=1) =1 and lim X  Pr ob (Y=1) = 0
There are mainly three approaches to developing a probability model for a binary response variable:
A. The linear probability model (LPM)
B. The logit model
C. The probit model
This classification is based on different distributions the binary response regression models are based on.
1
2.1 The linear probability model (LPM)
Consider the following regression model:
(1)
Yi  1   2 X i  i
Here X = family income and Y=1 if the family owns a house and 0 if it does not own a house. Model (1)
looks like a typical linear regression model. But because the dependent variable is binary, or dichotomous, it
is called a linear probability model (LPM). This is because the conditional expectation of Yi given X i ,
E (Yi / X i ), can be interpreted as the conditional probability that the event will occur given X i , that is,
Pr (Yi  1 / X i ) . Thus, in this example, E (Yi / X i ) gives the probability of a family owning a house (and)
whose income is the given amount X i .
Assuming E (  i ) = 0, we can obtain from (1):
E (Yi )  1   2 X i
(2)
Let Pi = the probability that Yi =1 and (1- Pi ) = the probability that Yi = 0, then
Probability
Yi
1
Pi
0
1- Pi
1
Total
Thus, LPM is based on Bernoulli probability distribution. So, E ( Yi ) = Pi . Also,
E (Yi / X i )  1   2 X i Pi
That is, the conditional expectation of model (1) can be interpreted as the conditional probability of Yi .
Since Pi lies between 0 and 1, we have 0  E (Yi / X i )  1 as a restriction.
We face several problems for LPM:
(1) Non-normality of the Disturbance  i :
Since  i  Yi  1   2 X i , the distribution of  i is:
i
Probability
When Yi = 1
1  1   2 X i
Pi
When Yi = 0
 1   2 X i
1- Pi
So  i follows the Bernoulli distribution and cannot be assumed to be normally distributed. However this
problem may not be so critical because if the objective is point estimation, the normality assumption of
disturbance is not necessary and the OLS still remain unbiased. Also, according to the central limit theorem,
the OLS estimators tend to be normally distributed as the sample size increases.
2
(2) Heteroscedastic Variance of the Disturbance
For the Bernoulli distribution, the variance is p(1-p), thus Var (  i )  p i (1 - p i ). Since
Pi  E (Yi / X i )  1   2 X i , the variance of  i ultimately depends on the values of X and hence is not
homoscedastic. However, this problem can be solved using weighted least squares with
as the weights. In practice, because the true E (Yi / X i ) is unknown,
p i (1 - p i ) serving
p i (1 - p i ) is unknown. We can run the
^
regression Yi  1   2 X i  i despite the heteroscedasticity problem and use Yi as the estimate of the true
^
E (Yi / X i ) and then use
^
Yi (1 - Yi ) as the weights.
(3) Nonfulfillment of 0  E (Yi / X i )  1
As a probability, E (Yi / X i ) must lie between 0 and 1. However, there is no guarantee that the predictions
from the model (the estimator of E (Yi / X i ) ) will fulfill this restriction. This is the real problem with LPM.
(4) Low R 2 Values
In the binary dependent variable model, for a given X, the Y values will be either 0 or 1. Therefore, all the Y
values will either lie along the X-axis or along the line corresponding to 1. Therefore, generally no LPM is
expected to fit such a scatter so well. As a result, the conventional R 2 is quite low for such models. Thus, it
is not advisable to use R 2 as a summary statistic in models with qualitative dependent variables.
^
Y
1
X
0
(5) The Linearity is Unrealistic
LPM assumes that Pi  E (Y  1 / X ) increases linearly with X, that is, the marginal effect of X remains the
same throughout. In our example, a certain amount of change in X will have the same effect on the
probability of owning a house, no matter the income level is $100 dollars or $100,000. However, in reality at
very low income level or very high income level, a small change in income X will have very little effect on
the probability of owning a house.
Because of the above problem (3) and problem (5), LPM is much less popular than logit model or probit
model
2.2 The Logit Model
The logit model is based on the cumulative distribution function of logistic distribution:
1
eZ i
Pi  E (Y  1/ X i ) 

(In our example, Z i  1   2 X i )
(3)
1  e Z i 1  e Z i
3
P
1
Z
0
As we can see from the figure, as Z i ranges from   to   , Pi ranges between 0 and 1 and that Pi is
nonlinearly related to Z i or X i . Hence, problem (3) and (5) can be solved in the logit model. However, on
the other hand, because Pi is not only nonlinear in X but also in  ’s, we cannot use conventional OLS
procedure to estimate the parameters. So we do the following transformations:
eZi
1

Zi
1 e
1  e Zi
Pi
1  e Zi

 e Z i  e 1   2 X i
Zi
1  Pi 1  e
1  Pi  1 
Here,
and
(4)
Pi
is the odds ratio in favor of owning a house---the ratio of the probability that a family will own
1  Pi
a house to the probability that it will not own a house. For instance, if Pi =0.8, then the odds are 4 to 1 in
favor of the family owning a house.
Pi
)  Z i  1   2 X i
(5)
1  Pi
Now Li , the log of the odds ratio, is not only linear in X but also linear in the parameters. L is called the logit,
and model (5) is named logit model. In the logit model,  2 measures the change in L for a unit change in X,
that is, it tells how the log odds in favor of owning a house change as income changes by a unit. Also,  1 is
the value of log-odds in favor of owning a house if income is zero. This interpretation may not have any
physical meaning.
Let Li  ln(
Once we get the estimates of  1 and  2 from logit model, we can estimate the probability of owning a house
by equation (3). The methods of estimating the parameters in logit model differ according to what kind of
data we have:
(1) Grouped or replicated data;
(2) Data at the individual, or micro, level.
4
2.2.1 Grouped or Replicated Data
Consider the data given in the following table:
Table 1
X i (Income, in thousand dollars)
6
8
10
13
15
20
25
30
35
40
N i (Number of families at
income X i )
40
50
60
80
100
70
65
50
40
25
ni (Number of families owning a
house among N i )
8
12
18
28
45
36
39
33
30
20
Corresponding to income level X i , there are N i families, and among the grouped N i families there are ni
families that own a house.
ni
, the relative frequency, can be used as an estimate of the true p i corresponding to each X i .
Ni
(When the sample size becomes infinitely large, the probability of an event is the limit of the relative
frequency). Therefore,
^
Thus, pi 
^
^
L i  ln(
pi
^
^
1  pi
^
)  1   2 X i
(6)
Thus, when the number of observations N i at each income level is reasonably large, we can estimate the
parameters in logit model. Can we apply OLS to equation (6) to estimate the parameters? The answer is not
quite because the disturbance term is not stochastic. If N i is fairly large and if each observation in a given
income class X i is distributed independently as a binomial variable, then
 i ~ N [0,
1
]
N i Pi (1  P ) i
That is,  i follows the normal distribution with zero mean and variance equal to
The formal proof process is as follows:
5
1
.
N i Pi (1  P)i
Let I be given and assume that each of the integers N1 ,...N i is large. Let F ( z i )  Pi 
eZi
. Since N i is
1  e Zi
^
large, p i is close to E ( yi )  F ( z i )  Pi by the law of large numbers. Also, by the central limit
^
theorem, the random variable
variance equal to :
N i ( p i - p i ) is approximately normally distributed with mean zero and
var( yi )  F ( z i )[1  F ( z i )]  pi (1  pi ).
Thus we can write:
^
pi  F ( zi )  vi
where the errors are independent and normally distributed with mean zero and variance:
p (1  pi )
.
var( vi )  i
Ni
We still do not have a model which is linear in the parameter z, however. To move in this direction, we use
Slutsky’s theorem on convergence in probability. Since N i is large:
^
F 1 ( pi )  F 1 ( pi )  F 1 [ F ( z i )]  z i
and the random variable:
^
^
N i [ F 1 ( pi )  F 1 ( pi )]  N i [ F 1 ( pi )  zi ]
is approximately normally distributed with mean zero and variance :
pi (1  pi )[
dF 1 ( pi ) 2
pi (1  pi )
p (1  pi )
] 
 i
1
2
dpi
f [ F ( pi )]
f ( z i )] 2

 

This brings us to the approximating model:
^
F 1 ( pi )  z i  u i
^
^
That is L i  ln(
pi
^
^
1  pi
^
)  1   2 X i  ui
Where the u i s are independent and normally distributed with mean zero and variance:
var( ui ) 
pi (1  pi )
F ( z i )[1  F ( z i )]
.

2
N i [ f ( z i )]
N i [ f ( z i )] 2


6
Because for the logit model f = F (1-F), we have:
var( u i ) 
1
1

N i F ( z i )[1  F ( z i )] N i Pi (1  Pi )
Therefore, we need to use the weighted least squares with
^
1
serving as weights to estimate
N i Pi (1  Pi )
^
equation (6). In the above example, if wi  N i Pi (1  Pi ) , then we can apply the weighted least squares to run
the following regression:
wi Li  1 wi   2 wi X i  wi  i  Li  1 wi   2 X i  vi
*
*
(7)
Where Li  wi Li , X i  wi X i and vi  wi  i
*
*
The results are as follows:
Li  1.59474 wi  0.07862 X i
t = (-14.43619)
(14.56675)
*
*
We have several ways to interpret the results:
(A). Interpretation of the Estimated Logit Model
*
The coefficient of X i tells us that as the weighted income increases for a unit ($1000), the weighted log of
the odds in favor of owning a house goes up by about 0.08 unites. But this is an indirect interpretation.
(B). Odds Interpretation
From equation (4) we have:
Pi
1  eZ i

 e Z i  e 1   2 X i
Zi
1  Pi 1  e
Since we have the estimates for  1 and  2 , we can calculate the amount of change in odds in favor of
owning a house caused by a unit of change in income X i .
(C). Computing Probabilities
From equation (3) we have:
eZ i
e 1   2 X i
Pi 

1  e Z i 1  e 1   2 X i
Using estimates of  1 and  2 , we can calculate the estimated probability of owning a house given a certain
income level.
For example, when X=20 (income is $20,000),
7
Z i  1   2 X i  1.59474  0.07862  20  0.022
^
Pi 
eZi
e 0.022

 0.49
1  e Z i 1  e 0.022
That is, given the income of $20,000, the probability of a family owning a house is about 49 percent.
(D). Computing the Rate of Change of Probability
eZ
P e Z (1  e Z )  e Z  e Z
eZ
1
dP P Z
P




 P  (1  P) 


  2 P(1  P)
Z
Z 2
Z
Z
Z
dX Z X
1 e
(1  e )
1 e
1 e
To calculate the change in probability of owning a house for a unit increase in income from level $20,000:
^
^
^
 2 P(1  P)  0.07862  0.49  0.51  0.01965
2.2.2 Ungrouped or Individual Data
An example of this kind of data is as follows:
Table 2
Family
1
2
3
4
5
6
7
8
9
Y(1 if owns house, 0 otherwise)
0
1
1
0
0
1
1
0
0
X (income, in thousand dollars)
8
16
18
11
12
19
20
13
9
Here, Pi  1 if a family owns a house and Pi  0 if it does not own a house. If we try to calculate the logit Li :
P
1
When a family owns a house, then Li  ln( i )  ln( )
1  Pi
0
P
0
When a family does not own a house, then Li  ln( i )  ln( )
1  Pi
1
Obviously, these expressions are meaningless. Therefore, if we have data at micro, or individual, level, we
cannot estimate logit model by the standard OLS method and may have to resort to the maximum likelihood
(ML) method to estimate the parameters.
Since
Pi 
1
1 e
 1   2 X i
8
We do not actually observe Pi , but only observe the outcome Y=1 if an individual owns a house, and Y=0, if
the individual does not own a house. Since each Yi is a Bernoulli random variable, we can write
Pr(Yi  1)  Pi
and
Pr(Yi  0)  1  Pi
Suppose we have a random sample of N observations and let f i (Yi ) be the probability that Yi  1 or 0, then
the likelihood function is
N
N
1
1
f (Y1,Y 2,...Yn )   f i (Yi )   Pi i (1  Pi )1Yi
Y
Since it is a little awkward to manipulate, we take its natural logarithm and the log-likelihood function is:
N
N
ln f (Y1,Y 2,...Yn )  [Yi ln Pi  (1  Yi ) ln( 1  Pi )]  [Yi ln(
1
Since 1  Pi 
1
1  e 1   2 X i
1
and
N
Pi
)]   ln( 1  Pi )]
1  Pi
1
P
ln( i )  1   2 X i
1  Pi
We have:
N
N
1
1
ln f (Y1,Y 2,...Yn )   Yi (  1   2 X i )   ln( 1  e 1   2 X i )
Next, we differentiate the above equation with respect to  1 and  2 , set the resulting equations to zero and
solve the resulting equations. Then we apply the second-order condition of maximization to verify the values
of the parameters we have obtained do in fact maximize the likelihood function.
However, the above equation is nonlinear in the parameters and it cannot be easily solved analytically.
Consequently, the maximum likelihood estimates must be obtained by iterative, numerical techniques. All of
the commonly used approaches to the numerical solutions of likelihood equations derived from Newton’s
method. The method of Newton is also called the Newton-Raphson method. To learn more about NewtonRaphson method, you can read the following books:
1. Econometrics of Qualitative Dependent Variables by Christian Gourierourx, or
2. Advanced Econometrics Methods by Thomas B. Fomby, R. Carter Hill and Stanley R. Johnson.
Several points are noteworthy about using the method of maximum likelihood in the binary dependent
variable model:
A.
Asymptotic estimated standard errors
The method of maximum likelihood is generally a large-sample method, so the estimated standard errors are
asymptotic;
B.
Measure of statistical significance
If the sample size is reasonably large, then the t distribution converges to normal distribution. Thus we use
the standard normal Z statistic instead of the t statistic to evaluate the statistical significance.
9
C.
Measure of goodness of fit
The conventional R 2 is not a meaningful measure of goodness of fit in binary regressand. In contrast, there
are various measures of goodness of fit for binary regressand models and they are called pseudo R 2 .
Often the goodness-of-fit measures are implicitly or explicitly based on comparisons with a model that
contains only a constant as the explanatory variable. Let ln L1 be the maximum log-likelihood value of the
model of interest and let ln L0 be the maximum value of the log-likelihood function when all parameters,
except the intercept, are set to zero. Because the log-likelihood is the sum of log probabilities,
lnL 0  ln L1  0 . The larger the difference between the two log-likelihood values, the more the extended
model adds to the very restrictive model.
To compute lnL 0 , it is not necessary to estimate a logit model with an intercept term only. If there is only a
constant term in the model, the MLE for P is:
^
P  N1 / N
where N1   Yi . That is, the estimated probability is equal to the proportion of ones in the sample.
i
Hence,
N
N
1
1
ln L0   Yi ln( N 1 / N )   (1  Yi ) ln( 1  N 1 / N )  N 1 ln( N 1 / N )  ( N  N 1 ) ln( 1  N 1 / N )
The value of ln L1 can be computed from the computer package.
2
McFadden R is one of the measures of goodness of fit in binary regressand models:
ln L1
2
McFadden R =1.
lnL 0
2
Conceptually, lnL 1 is equivalent to RSS and ln L0 is equivalent to TSS the linear regression. McFadden R
ranges between 0 and 1.
2
Another simple measure is the count R and it is defined as:
Count R 2 = number of correct predictions / total number of observations.
2
Since the regressand in the logit model takes a value of 1 or 0, to calculate count R , we need to classify the
predicted probability as 1 or 0. If the predicted probability is greater than 0.5, we classify that as 1. If the
predicted probability is less than 0.5, we classify that as 0.
However, it should be noted that the most important thing in the binary regressand models is the expected
signs of the regression coefficients and their statistical and/or practical significance.
D. Likelihood ratio statistic (LR)
To test the null hypothesis that all the slope coefficients are simultaneously equal to zero, the equivalent of
the F test in the linear regression model is the likelihood ratio statistic.
10
LR = -2( lnL 0 - ln L1 )
Given the null hypothesis, the LR statistic follows the  2 distribution with degree of freedom equal to the
number of explanatory variables (excluding the intercept). The null hypothesis that all the slope coefficients
are simultaneously equal to zero is rejected for large likelihood ratio.
Problems:
1. For the data in Table 1, try to interpret the logit model regression results in the four ways shown in the
above note at income level X=35:
A. What is the amount of change in odds in favor of owning a house if the income increases for a unit
($1000) at income level X=35?
B. Please calculate the estimated probability of owning a house at income level X=35.
C. Please calculate the change in probability of owning a house for a unit increase in income from the level
of $35,000
2. Suppose we use a sample with 32 observations to estimate a binary regressand logit model with 3
dependent variables. In this sample, the proportion of ones in the dependent variables is 11. From the
computer package we know the unrestricted log-likelihoods for this model is –12.890. Please calculate the
likelihood ratio statistic and test the hypothesis at the 0.05 level that the coefficients on the 3 independent
variables are all zero.
Answers to problems:
Problem 1.
Pi
A.
 e Zi  e 1   2 X i
1  Pi
e 0.07862  1.0817
1.0817-1=0.0817=8.17%
So the changes in odds in favor of owning a house is 8.17% if the income increases for a unit ($1000) at
income level X=35.
B. e 1   2 X i  e 1.59 0.07835  e1.14  3.13
^
P
e 1   2 X i
3.13

 0.76
1   2 X i
1  3.13
1 e
Thus, the estimated probability of owning a house at income level X=35 is 0.76.
11
^
^
^
C.  2 P(1  P)  0.07862  0.76  0.24  0.0143
So the change in probability of owning a house for a unit increase in income from the level of $35,000 is
0.0143.
Problem 2.
The restricted log-likelihood is :
ln L0  N1 ln( N1 / N )  ( N  N1 ) ln( 1  N1 / N )  11ln( 11 / 32)  (32  11) ln( 1  11 / 32)  20.5917
Thus, LR = -2( lnL 0 - ln L1 ) =  2  [20.5917  ( 12.890)] = 15.404
The critical value from the chi-squared distribution with 3 degrees of freedom at 0.05 level is 7.81. Hence,
we reject the hypothesis at the 0.05 level that the coefficients on the 3 independent variables are all zero.
12
Download