Uploaded by belenmunozenriquez

BinaryDependentVariable Slides

advertisement
Regression with a Binary Dependent Variable
Outline of the Chapter
 Introduction
 Example: Labor Force Participation
 Linear Probability Model
 Interpretation of Coefficients
 Estimation
 Heteroskedasticity of Errors
 Fit
 Shortcomings of the Linear Probability Model
 Probit/Logit Models
 Motivation
 Estimation
 Fit
2
Introduction
 Many situations in Economics require each individual in a
population to make a decision with two possible outcomes.
The decision may be modeled with a binary (0 or 1)
dependent variable Y.
 For example, Y can be defined to indicate whether an adult
has a high school education; Y can indicate whether a loan
was repaid or not; Y can indicate whether a firm was taken
over by another firm during a given year
 We are typically interested in the determinants of the
decision, on how likely a particular individual is to make a
choice 1, instead of choice 0.
3
Example – Labor Force
Participation Decision by Women
𝑖𝑛𝑙𝑓 = 𝛽0 + 𝛽1 𝑛𝑤𝑖𝑓𝑒𝑖𝑛𝑐 + 𝛽2 𝑒𝑑𝑢𝑐 + 𝛽3 𝑒𝑥𝑝𝑒𝑟 +
𝛽4 𝑎𝑔𝑒 + 𝛽5 𝑘𝑖𝑑𝑠𝑙𝑒𝑡6 + 𝛽6 𝑘𝑖𝑑𝑠𝑔𝑒6 + 𝑢
where
inlf = binary variable that takes the value 1 if woman
reports working for a wage outside the home at some
point during the year 1975
nwifeinc = other sources of income (including husband’s
income)
educ = years of educ
4
Example (cont)
𝑖𝑛𝑙𝑓 = 𝛽0 + 𝛽1 𝑛𝑤𝑖𝑓𝑒𝑖𝑛𝑐 + 𝛽2 𝑒𝑑𝑢𝑐 + 𝛽3 𝑒𝑥𝑝𝑒𝑟 + 𝛽4 𝑎𝑔𝑒 +
𝛽5 𝑘𝑖𝑑𝑠𝑙𝑒𝑡6 + 𝛽6 𝑘𝑖𝑑𝑠𝑔𝑒6 + 𝑢
exper = past years of labor market experience
age = age in years
kidslet6 = number of children younger than 6 years old
kidsge6 = number of children whose age is between 6
and 18
The sample includes 753 married women, 428 of which
participated in the labor force
5
Example (cont)
The estimation (using OLS) produced the following
results:
i nˆ lf = 0.707- 0.0033 nwifeinc+ 0.040 educ + 0.023 exper
(0.146)
(0.0016)
(0.007)
(0.002)
- 0.018 age- 0.272 kidslet6 + 0.0125 kidsge6
(0.002)
n = 753,
(0.031)
(0.0136)
R2 = 0.254
All variables, except kidsge6, are statistically significant.
6
Example (cont)
 Plot of the fitted line against years of schooling when
nwifeinc = 50, exper = 5, age = 30, kidslt6 = 1, kidsge6 = 0
7
Meaning of the coefficients
 What does the slope of the line (𝛽2 =0.04) mean?
 When educ = 12, 𝑖𝑛𝑓𝑙 = 0.33. What does this 0.33 mean?
Lets go back to the model:
inlf = b0 + b1nwifeinc+ b2 educ+ b 3exper + b 4 exper 2 +
+b5 age+ b6 kidslet6 + b 7 kidsge6 + u
 Because Y = inlf can only take two values, β2 cannot be interpreted
as the change in Y, on average, given a one-year increase in the years
of schooling, holding other factors fixed: Y either changes from zero
to one or from one to zero (or does not change).
8
Interpretation of Coefficients
Consider the generic model:
Y = β0 + β1 X1 + + β2 X2 + ... + βk Xk + u
Y can be a variable with quantitative meaning (like wage). or a binary dependent variable (like
inlf)
E[Y|X1, X2, …, Xk] = E[β0 + β1 X1 + β2 X2 + ... + βk Xk + u |X1, X2, …, Xk]
E[Y|X1, X2, …, Xk] = E[β0 |X1, X2, …, Xk ]+ E[ β1 X1 |X1, X2, …, Xk ] + E[β2 X2 |X1, X2, …, Xk ]+ ...
….+ E[βk Xk |X1, X2, …, Xk] + E[u |X1, X2, …, Xk]
E[Y|X1, X2, …, Xk] = β0+ β1 X1 + β2 X2 + ….+ βk Xk + E[u |X1, X2, …, Xk]
Assume the first OLS assumption E[u|X1, X2, …, Xk] = 0 holds, then
E[Y|X1, X2, …, Xk] = β0+ β1 X1 + β2 X2 + ….+ βk Xk
10
Interpretation of Coefficients
E[Y|X1, X2, …, Xk] = β0+ β1 X1 + β2 X2 + ….+ βk Xk
When Y has a quantitative meaning:
Predicted value from the population regression function is
the mean value of Y, given all the values of the regressors
β1 = expected change in Y resulting from changing X1 by one unit,
holding constant X2, , …, Xk.
OLS predicted values 𝑌 = 𝛽0 + 𝛽1 X1 + 𝛽2 X2 + ….+ 𝛽𝑘 Xk
𝛽1 = estimated value of β1
11
Interpretation of Coefficients
E[Y|X1, X2, …, Xk] = β0+ β1 X1 + β2 X2 + ….+ βk Xk
But when Y can take only values 0 or 1 then
E[Y|X1, …, Xk] = 0 × Pr[Y=0|X1, …, Xk] + 1 × Pr[Y=1|X1, …, Xk]
E[Y|X1, …, Xk] = Pr[Y=1|X1, …, Xk]
Thus,
Pr[Y=1|X1, …, Xk] = β0 + β1 X1 + β2 X2 + ... + βk Xk
12
Meaning of the coefficients
P[Y=1|X1, …, Xk] = β0 + β1 X1 + β2 X2 + ... + βk Xk
Predicted value from the population regression function is
the probability that Y = 1, given all the values of the
regressors
This model is called Linear Probability Model (LPM),
because the population regression function is linear on
the parameters and it represents a probability
βj measures the change in probability that Y=1 when Xj
changes by one unit, keeping all the other factors fixed.
13
Estimation of the LPM
The mechanics of OLS are the same as before.
When we estimate the parameters, we obtain
Yˆ = b0 + bˆ1 X1 + bˆ 2 X2 +… + bˆ k Xk
𝛽𝑗 = estimate of the change in probability that Y = 1,
for a unit change in Xj, keeping all the other factors
constant
𝑌 = estimated probability that Y = 1, given values for
the regressors
14
Back to the example
• An additional year of education is estimated to increase the
probability of labor force participation by 0.04 (4 percentage
points), keeping all the other factors constant
• If education = 12, the probability of labor participation is estimated
to be around 33% (for the specified values for all the other variables
in the graph on slide 7)
• For the values of all the other variables specified in our graph, the
predicted probability of participating in the labor force is negative
when education is less than 3.75 years.
• In this case, this is not much cause for concern since no woman in
this sample has less than 5 years of education. Moreover, for the
highest education level in the sample (17 years of schooling) the
predicted probability of labor participation 0.527.
15
Back to the example
 The coefficient on nwifeinc implies that if other sources of
income increases by $10,000, ceteris paribus, the
probability that a woman is in the labor force falls by 0.033
(3.3%).
 Holding other factors fixed, one more year of experience in
the job market increases the probability of labor
participation by 0.023 (2.3%)
 Unlike the number of older children, the number of
younger children has a huge impact on labor force
participation. Having one additional child younger than six
years old reduces the probability of participation by 0.272
(27.2%), keeping all other variables constant.
16
Special Features of the LPM
 (1) Errors from the LPM are always heteroskedastic.
To show why, lets use a single regression model:
Y = β 0 + β1 X + u
If model is well specified E[u| X] = 0, then
E[Y | X] = β0 + β1 X
Var[u|X] = Var[Y - β0 - β1 X | X]
= Var[Y|X]
17
Special Features of the LPM
Since Yi can only take two values (0 and 1)
𝑉𝑎𝑟 𝑌 𝑋 = 0 − 𝐸 𝑌 𝑋
2
× 𝑃 𝑌 =0 𝑋 + 1−𝐸 𝑌 𝑋
2
× 𝑃 𝑌=1𝑋
𝑉𝑎𝑟 𝑌 𝑋 = 0 − (β0 + β1 X)
2
× 𝑃 𝑌 = 0 𝑋 + 1 − (β0 + β1 X)
2
𝑉𝑎𝑟 𝑌 𝑋 = 0 − (β0 + β1 X)
2
× (1−(β0 + β1 X)) + 1 − (β0 + β1 X)
× 𝑃 𝑌=1𝑋
2
× (β0 + β1 X)
𝑉𝑎𝑟 𝑌 𝑋 = (β0 + β1 X) × (1−(β0 + β1 X))
Thus, Var[Y| X] = Var[u | X] is not constant, it depends on X. The errors are
heteroskedastic.
• (2) R2 concept does not carry over to LPM.
18
Another Example
Population: Young man in California born in 1960-61 who
have at least a prior arrest
ar rˆ 86 = 0.380 + 0.152 pcnv + 0.0046avgsen
-0.0026tottime- 0.024 ptime86 - 0.038qemp86
+0.170black + 0.096hispan
19
Another Example (cont)
where
arr86= binary variable that takes the value 1 if young man was
arrested in 1986, 0 otherwise
pcnv= proportion of prior arrests that led to conviction
avgsen= average sentence from prior convictions
tottime= months spent in prison since age 18 prior to 1986
ptime86 = months spent in prison in 1986
qemp86 = number of quarters employed in 1986
black = binary variable that takes value 1 if young man is black
hispan = binary variable that takes value 1 if young man is
hispanic
20
Another Example (cont)
ar rˆ 86 = 0.380 + 0.152 pcnv + 0.0046avgsen
-0.0026tottime- 0.024 ptime86 - 0.038qemp86
+0.170black + 0.096hispan
The model tells us that the probability of arrest is 17
percentage points higher for a black man than for a
white man (the base group), keeping all the other
variables constant.
21
Shortcomings of LPM
The LPM has the advantage of being easy to estimate
and interpret, but its linearity is also the root of its
major flaws.
Shortcomings:
• Predicted probabilities might be either less than zero
or greater than one. In our example, out of the 753
women, 16 predicted probabilities of entering the
labor force were negative and 17 were greater than one.
22
Shortcomings of LPM
• Probabilities cannot be linearly related to the independent
variables for all their possible values.
• In our example, the effect on the probability of working of
having an additional child is always the same (0.272).
• Thus, the effect is the same if the woman goes from having no
children to having one child or if the woman goes from
having three children to having four children. That, most
likely, is not realistic.
• Also, taken to the extreme, going from zero to four children
diminishes the probability of entering the workforce by 1.088
(108.8%)!
• It seems more realistic that the first small child would reduce
the probability by a large amount, but subsequent children
would have a smaller marginal effect. This is not captured in
the LPM specification.
23
Shortcomings of LPM
Coefficients from an
LPM may
mischaracterize the
relationship between
X and Y.
24
Linear Probability Model
• Even with these problems, the LPM is useful and often applied in Economics.
The LPM works well for values of the regressors that are near the averages of
the sample.
• The predicted probabilities outside the unit interval are a little troubling. Still,
we are able to use the estimated probabilities to predict the zero-one outcome:
𝑌𝑖 are the predicted probabilities (the fitted values).
Then if 𝑌𝑖 is greater or equal to 0.5, we would predict the outcome of Y = 1
(woman enters the workforce);
If 𝑌𝑖 is less than 0.5, we would predict the outcome of Y = 0
(woman does not enter the workforce).
Then, we could compare the actual data with our prediction of the decision and
find the percentage of correct predictions we get from the model (a goodness of
fit measure for binary dependent variables).
25
Probit and Logit Models
LPM: P[Y=1|X1, …, Xk] = β0 + β1 X1 + β2 X2 + ... + βk Xk
Now,
P[Y = 1 | X1, …, Xk] = G(α0 + α1 X1 + α2 X2 + ... + αk Xk)
where G is a function that takes values between zero and
one. That ensures that the estimated probabilities are
strictly between 0 and 1.
What kind of functions do we know that vary between zero
and one? Cumulative Distribution Functions (cdf).
26
Probit and Logit Functions
 Probit Model: G is the standard normal cumulative
distribution function (cdf)
𝑧
𝐺 𝑧 =
−∞
2
−𝑧
2𝜋 −1/2 exp
𝑑𝑧
2
 Logit Model: G is the logistic function (the cdf of a
standard logistic variable)
exp(𝑧)
1
𝐺 𝑧 =
=
1 + exp 𝑧
1 + exp(−𝑧)
27
Probit vs LPM - Illustration
Goal: Instead of fitting a
straight line, fit a curve
that is constrained to be
between 0 and 1.
28
Latent variable
What is the idea of these models?
The idea behind these models is that there exist an unobserved
(continuous) variable Y* (latent variable) that is driving the dependent
variable Y:
Y* = α0 + α1 X1 + α2 X2 + ... + αk Xk + u
In the example of women’s participation in the work force, Y* is the
difference in utility between going to work or not. That depends on a
bunch of factors (kids, other sources of income, education, etc.).
Y is the variable we observe. In our example, it is the decision of going to
work (Y = 1) or not (Y = 0).
If the utility of going to work is higher than the utility of staying at home
(Y* > 0), then the woman will go to work (Y = 1). Otherwise, she will
choose not to participate in the labor force. Thus,
 If Y* > 0 then Y = 1
 If Y* ≤ 0 then Y = 0
29
Latent variable (cont)
𝑌 ∗ = 𝛼0 + 𝛼1 𝑋1 + 𝛼2 𝑋2 + ⋯ + 𝛼𝑘 𝑋𝑘 + 𝑢
𝑃 𝑌 = 1 𝑋1 , ⋯ , 𝑋𝑘 ] =
= 𝑃 𝑌 ∗ > 0 𝑋1 , ⋯ , 𝑋𝑘 ] =
= 𝑃 𝛼0 + 𝛼1 𝑋1 + 𝛼2 𝑋2 + ⋯ + 𝛼𝑘 𝑋𝑘 + 𝑢 > 0 𝑋1 , ⋯ , 𝑋𝑘 =
= 𝑃 𝑢 > − 𝛼0 + 𝛼1 𝑋1 + 𝛼2 𝑋2 + ⋯ + 𝛼𝑘 𝑋𝑘 𝑋1 , ⋯ , 𝑋𝑘 =
= 1 − 𝑃 𝑢 ≤ − 𝛼0 + 𝛼1 𝑋1 + 𝛼2 𝑋2 + ⋯ + 𝛼𝑘 𝑋𝑘 𝑋1 , ⋯ , 𝑋𝑘 =
= 1 − 𝐺 − 𝛼0 + 𝛼1 𝑋1 + 𝛼2 𝑋2 + ⋯ + 𝛼𝑘 𝑋𝑘
30
Latent variable (cont)
𝑃 𝑌 = 1 𝑋1 , ⋯ , 𝑋𝑘 ] = 1 − 𝐺 − 𝛼0 + 𝛼1 𝑋1 + 𝛼2 𝑋2 + ⋯ + 𝛼𝑘 𝑋𝑘
If density function of u is symmetric around zero (true for a
standard normal and a standard logistic random variable) then
𝑃 𝑌 = 1 𝑋1 , ⋯ , 𝑋𝑘 ] = 𝐺 𝛼0 + 𝛼1 𝑋1 + 𝛼2 𝑋2 + ⋯ + 𝛼𝑘 𝑋𝑘
31
Probit and Logit Models
The LPM may have its problems, but it is definitely easy
to interpret: a one-unit increase in Xj is associated with a
𝛽𝑗 increase in the probability of Y = 1, all else constant.
Probit and logit models have their strength but being easy
to interpret is not one of them. The effect of Xj on the
probability of Y = 1 is not constant. We will see that that
effect depends not only on the value of Xj, but also on the
value of the other independent variables.
32
Probit Model
P[Y=1|X1, X2, …, Xk] = Φ(α0 + α1 X1 + …. + αk Xk)
where Φ is the standard normal cdf
We are primarily interested in the effects of each variable
(Xj) on P[Y = 1 | X1, X2, …, Xk]. We are not interested in
knowing the effect of Xj on Y*, i.e., we are not really
interested in the magnitude of αj.
Since the direction of the effect of Xj on E[Y*|X1, X2, …,
Xk] is the same as the direction of the effect of of Xj on
E[Y|X1, X2, …, Xk] = P[Y = 1 |X1, X2, …, Xk], we are
interested on the sign of αj.
33
Interpretation of Coefficients
αj > 0 then an increase in Xj increases P[Y=1|X1, X2, …, Xk]
αj < 0 then an increase in Xj decreases P[Y=1|X1, X2, …, Xk]
Beyond this, do not interpret the α coefficients directly.
To find the magnitude of the effects (on the probability that
Y = 1) of changing the independent variables, we have to
compute changes in the conditional probabilities.
If X1 changes by ΔX1, then P[Y=1 |X1, X2, …, Xk] changes by
Φ(α0 + α1 (X1+ΔX1) + … + αk Xk) – Φ(α0 + α1 X1 + … + αk Xk)
34
Interpretation of Coefficients
Suppose P[inlf = 1 | nwifeinc] = Φ(α0 + α1 nwifeinc)
and α0 = 0.432 and α1 = -0.013
What is the P[inlf = 1 | nwifeinc=10] = ?
When nwifeinc = 10, α0 + α1 nwifeinc = 0.432 -0.013 x 10 =
0.302. Thus, P[inlf = 1 | nwifeinc = 10] = Φ(0.302) = 0.6187
What happens when nwifeinc changes from 10 to 20?
(Δnwifeinc = 10)
P[inlf = 1 |nwifeinc = 20] = Φ(0.432 -0.013 x 20) = Φ(0.172) =
0.5683.
Thus, ΔP[inlf = 1] = 0.5683 – 0.6187 = - 0.0504
35
Interpretation of Coefficients
Suppose P[inlf = 1 | nwifeinc] = Φ(α0 + α1 nwifeinc)
and α0 = 0.432 and α1 = -0.013
What happens when nwifeinc changes from 80 to 90?
(Δnwifeinc = 10)
P[inlf = 1 |nwifeinc = 80] = Φ(0.432 -0.013 x 80) = Φ(-0.608) =
0.2716.
P[inlf = 1 |nwifeinc = 90] = Φ(0.432 -0.013 x 90) = Φ(-0.738) =
0.2303.
Thus, ΔP[inlf = 1] = 0.2303 – 0.2716 = - 0.0413
The effect of nwifeinc depends on the value of nwifeinc
36
Interpretation of Coefficients
Now,
P[inlf = 1 | nwifeinc, educ] = Φ(α0 + α1 nwifeinc + α2 educ)
and α0 = -1.131, α1 = -0.021 and α2 = 0.142
What is the P[inlf = 1 | nwifeinc = 20, educ = 5] = ?
P[inlf = 1 | nwifeinc = 20, educ = 5] = Φ(-1.131 -0.021 x 20 +
0.142 x 5) = Φ(-0.841) = 0.2002
What happens when educ changes from 5 to 10?(Δeduc = 5)
P[inlf = 1 |nwifeinc = 20, educ = 10] = Φ(-1.131 -0.021 x 20 +
0.142 x 10) = Φ(-0.131) = 0.4479
Thus, ΔP[inlf = 1] = 0.4479 – 0.2002 = 0.2477
37
Effect of Education
nwifeinc = 20
Prob[Y=1]
1
0.9
When education
increases from 5 to
10, the change in
probability of
entering the labor
force = 0.25
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
Education
When education
increases from 15 to
20, the change in
probability of
entering the labor
force = 0.18
38
Interpretation of Coefficients
P[inlf = 1 | nwifeinc, educ] = Φ(α0 + α1 nwifeinc + α2 educ)
and α0 = -1.131, α1 = -0.021 and α2 = 0.142
What is the P[inlf = 1 | nwifeinc = 100, educ = 5] = ?
P[inlf = 1 | nwifeinc = 100, educ = 5] = Φ(-1.131 -0.021 x 100 +
0.142 x 5) = Φ(-2.521) = 0.0059
What happens when educ changes from 5 to 10?(Δeduc = 5)
P[inlf = 1 |nwifeinc = 100, educ = 10] = Φ(-1.131 -0.021 x 100 +
0.142 x 10) = Φ(-1.811) = 0.0351
Thus, ΔP[inlf = 1] = 0.0351 - 0.0059 = 0.0292
The effect of educ depends on the value of nwifeinc
39
Effect of Education
nwifeinc = 100
Prob[Y=1]
0.45
0.4
When education
increases from 5 to
10, the change in
probability of
entering the labor
force = 0.03
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
5
10
15
20
25
Education
40
Effect of Other Sources of Income
educ = 12
Prob[Y=1]
0.8
When (other
sources of) income
increases from 10 to
20, the change in
probability of
entering the labor
force = -0.08
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
Other sources of Income
When (other
sources of) income
increases from 120
to 130, the change
in probability of
entering the labor
force = -0.01
41
Effect of Other Sources of Income
educ = 16
Prob[Y=1]
1
When (other
sources of) income
increases from 10 to
20, the change in
probability of
entering the labor
force = -0.06
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
Other sources of Income
42
Interpretation of Coefficients
To sum up, for a probit (or logit) model
• The effect of variable Xj on the P[Y=1 | X1, …, Xk]
depends on the value of Xj
• The effect of variable Xj on the P[Y=1 | X1, …, Xk]
depends on the values of the other independent
variables
43
Probit Model – Estimation Results
Now, lets estimate the full model. The estimation
produced the following results:
Pr[Y = 1| X] = F(0.580- 0.0116 nwifeinc + 0.134 educ +
(0.494 )
(0.0051)
(0.026)
+ 0.070 exper - 0.056 age- 0.874 kidslet6 +
(0.008)
(0.008)
(0.115)
+ 0.0345 kidsge6)
(0.0442)
44
Probit Model – Estimation Results
• Now, we are assured that P(inlf = 1 | X) will be estimated between
zero and one.
• The signs of the coefficients are all like in the LPM
• Also, just like in the LPM, all variables are statistically significant
except kidsge6.
• In the LPM, one additional child is estimated to decrease the
probability of labor participation by 27.2%, regardless of how many
children the woman has.
In the probit model, the effect of one additional child on the
probability of labor participation when the woman has no children
is ΔP[inlf = 1] = Φ(-0.5016) – Φ(0.3727) = 0.3080 – 0.6453 = -0.3373.
(Note: for this computation, we kept nwifeinc, educ, exper and age
at their means and kidsge6 at 0)
45
Probit Model – Estimation Results
And, keeping all the other variables constant at the values
specified, the effect of one additional child on the probability
of labor participation when the woman has already one child
is ΔP[inlf = 1] = Φ(-1.3759) - Φ(-0.5016) = 0.0844 – 0.3080 = 0.2236.
Therefore, the labor force participation probability is estimated
to decrease by 33.7% for an additional child when the woman
has no children initially, and by 22.4% when the woman has
already one child. The marginal effect of one additional child
is smaller when the woman has already one child (younger
than 6) compared with when the woman has no children.
This makes more sense than the constant marginal effect of
27% we found with the LPM.
46
Effect of number of young children
Prob[Y=1]
0,7
All variables are set
at their averages,
except the number
of older kids which
is kept at zero.
0,6
0,5
0,4
0,3
0,2
0,1
0
0
1
2
3
4
Number of young children
5
47
Logit Model – Estimation Results
The estimation produced the following results:
Pr[Y = 1| X] = F(0.838- 0.0202 nwifeinc + 0.227 educ +
(0.838)
(0.0087)
(0.044)
+ 0.120 exper - 0.091age-1.439 kidslet6 +
(0.015)
(0.014 )
(0.200)
+ 0.0582 kidsge6)
(0.0772)
48
Logit Model (cont)
The results from the logit model are very similar to the
probit model.
The signs of the coefficients are the same as with the LPM
or the probit model.
The magnitude of the effects is very similar also (check).
49
Estimation
The coefficients of the probit and logit model cannot be
estimated by OLS, since the P[Y =1 |X] is a non-linear
function on the parameters α’s.
We can estimate the the coefficients of the probit or logit
model using two different methods: Non Linear Least
Squares (NLLS) and Maximum Likelihood Estimation
(MLE)
50
Non Linear Least Squares
 Find the values of the parameters that minimize SSR
(sum of squared residuals)
 E[Y | X] = G(α0 + α1 X1 + …. + αk Xk),
 Then, NLLS finds the values of the α’s that minimize Σ
[Yi - G(α0 + α1 X1 + …. + αk Xk)]2
 The NLLS estimators are consistent, asymptotically
normal but they are not efficient (there are other
estimators with smaller variance).
51
Maximum Likelihood Estimation
 What is it?
 Likelihood function: Joint Probability function of the
data (as a function of the unknown coefficients)
 Maximum Likelihood estimates: values of the
coefficients that maximize the likelihood function
 MLE: Chooses the values of the parameters to
maximize the probability of drawing the data that are
actually observed; picks the parameter values “most
likely” to have produced the data we observe.
52
MLE
 Example:
Two iid observations y1 and y2 on a binary dependent
variable (Y is a Bernoulli). Assume for now that there
are no regressors, to make it simple.
P(Y = y) = py (1-p)(1-y)
We want to estimate p = P(Y = 1) (= Mean of Y)
What is the Likelihood Function?
P(Y1 = y1, Y2 = y2) = P(Y1 = y1) × P(Y2 = y2) =
= py1 (1-p)(1-y1) py2 (1-p)(1-y2) = p(y1 + y2) (1-p)(2-y1-y2)
53
MLE
We need to find the value of p that maximizes the Likelihood
Function: P(Y1 = y1, Y2 = y2) = p(y1 + y2) (1-p)(2-y1-y2)
 Take derivative of the Likelihood function, set it equal to zero
and solve:
(y1 + y2) p(y1 + y2)-1 (1-p)(2-y1-y2) - p(y1 + y2) (2-y1-y2)(1-p)(1-y1-y2) = 0
(y1 + y2) p(y1 + y2)-1 (1-p)(2-y1-y2) = p(y1 + y2) (2-y1-y2)(1-p)(1-y1-y2)
Divide both sides by p(y1 + y2) (1-p)(1-y1-y2)
(y1 + y2) p-1 (1-p) = (2-y1-y2)
p-1 – 1 = (2-y1-y2)/(y1 + y2)
p-1 = 2/(y1 + y2)
p = (y1 + y2) / 2
… (sample average!)
54
MLE
 Probit/Logit example:
n iid observations y1, y2 , …, yn on a binary dependent variable, and
P[Yi = 1 |X1i, X2i, …, Xki] = G(α0 + α1 X1i + …. + αk Xki)= pi
We want to estimate the values of α0, α1, … , αk that maximize the
Likelihood function
What is the Likelihood Function?
For the ith observation P[Yi = yi |X1i, …, Xki] = piyi(1-pi)(1-yi)
Then
L = P(Y1 = y1, Y2 = y2, …, Yn = yn|X1i, …, Xki) =
= P(Y1 = y1|X11, …, Xk1) × P(Y2 = y2|X12, …, Xk2) × … × P(Yn = yn|X1n, …, Xkn)
= p1y1(1-p1)(1-y1) p2y2 (1-p2)(1-y2)... pnyn (1-pn)(1-yn)
55
MLE
 Probit/Logit example (cont):
Take logs to obtain
Log L = y1 log(p1) + (1-y1) log(1-p1)+ y2 log p2 + (1-y2) log(1-p2)
+ … + yn log(pn)+ (1-yn) log(1-pn) =
= Σ yi log(pi) + Σ(1-yi) log(1-pi) =
= Σ yi log(G(α0 + α1 X1i + …. + αk Xki)) + Σ(1-yi) log(1-G(α0 +
α1 X1i + …. + αk Xki))
The values of α0, α1, … , αk that maximize the Log Likelihood
function are the MLE estimators.
Numerical methods are used to obtain the MLE estimates.
56
MLE
 The Maximum Likelihood estimators are consistent,
asymptotically normal and they are also efficient.
 The t-stats, F-stats, confidence intervals can be computed
in the usual way
57
Measures of Fit
(1) Fraction correctly predicted
If Yi = 1 and the estimated P[Y = 1 | X] ≥ 0.50, then we
have a correct prediction.
If Yi = 0 and the estimated P[Y = 1 | X] < 0.50, then we
have a correct prediction.
Fraction correctly predicted = (number of observations
correctly predicted) / total number of observations
58
Measures of Fit (cont)
(2) Pseudo-R2
Recall R2 in linear regression model: we compare the sum of squared
residuals from our model (when all regressors are present) to the
sum of squared residuals we would obtain when none of the
regressors are included
We use a similar approach here: we compare the value of the log
likelihood function we obtain for our probit/logit model (when all
regressors are present) with the one we would obtain when none of
the regressors are included
Pseudo-R2 = 1 – (ln L / ln L0).
Lets look at two possible extremes:
Your model explains absolutely nothing. Then, ln L = ln L0, and PseudoR2 = 0.
Your model explains everything. Then, the likelihood function is 1, which
means that ln L = 0 and Pseudo-R2 = 1
59
Download