Statistics

advertisement
Econometrics I
Professor William Greene
Stern School of Business
Department of Economics
18-1/49
Part 18: Maximum Likelihood Estimation
18-2/49
Part 18: Maximum Likelihood Estimation
Econometrics I
Part 18 – Maximum
Likelihood Estimation
18-3/49
Part 18: Maximum Likelihood Estimation
Maximum Likelihood Estimation
This defines a class of estimators based on the particular
distribution assumed to have generated the observed
random variable.
Not estimating a mean – least squares is not available
Estimating a mean (possibly), but also using information
about the distribution
18-4/49
Part 18: Maximum Likelihood Estimation
Advantage of The MLE
The main advantage of ML estimators is that
among all Consistent Asymptotically Normal
Estimators, MLEs have optimal asymptotic
properties.
The main disadvantage is that they are not
necessarily robust to failures of the
distributional assumptions. They are very
dependent on the particular assumptions.
The oft cited disadvantage of their mediocre
small sample properties is overstated in view of
the usual paucity of viable alternatives.
18-5/49
Part 18: Maximum Likelihood Estimation
Properties of the MLE
Consistent: Not necessarily unbiased, however
 Asymptotically normally distributed: Proof
based on central limit theorems
 Asymptotically efficient: Among the possible
estimators that are consistent and asymptotically
normally distributed – counterpart to GaussMarkov for linear regression
 Invariant: The MLE of g() is g(the MLE of )

18-6/49
Part 18: Maximum Likelihood Estimation
Setting Up the MLE
The distribution of the observed random
variable is written as a function of the
parameters to be estimated
P(yi|data,β) = Probability density | parameters.
The likelihood function is constructed from the
density
Construction: Joint probability density function
of the observed sample of data – generally the
product when the data are a random sample.
18-7/49
Part 18: Maximum Likelihood Estimation
(Log) Likelihood Function
f(yi|,xi) = probability density of observed yi
given parameter(s) and possibly data, xi.
 Observations are independent
 Joint density = i f(yi|,xi) = L(|y,X)
 f(yi|,xi) is the contribution of observation i to
the likelihood.
 The MLE of  maximizes L(|y,X)
 In practice it is usually easier to maximize
logL(|y,X) = i logf(yi|,xi)

18-8/49
Part 18: Maximum Likelihood Estimation
The MLE
The log-likelihood function: logL(|data)
The likelihood equation(s):
First derivatives of logL equal zero at the MLE.
(1/n)Σi ∂logf(yi|,xi)/∂MLE = 0.
(Sample statistic.) (The 1/n is irrelevant.)
“First order conditions” for maximization
A moment condition - its counterpart is the
fundamental theoretical result E[logL/] = 0.
18-9/49
Part 18: Maximum Likelihood Estimation
Average Time Until Failure
Estimating the average time until failure, , of light bulbs.
yi = observed life until failure.
f(yi|)
= (1/)exp(-yi/)
L()
= Πi f(yi|)= -n exp(-Σyi/)
logL()
= -nlog () - Σyi/
Likelihood equation: ∂logL()/∂ = -n/ + Σyi/2 =0
Solution:
MLE = Σyi /n. Note: E[yi]= 
Note,
∂logf(yi|)/∂ = -1/ + yi/2
Since E[yi]= , E[∂logf()/∂]=0.
(One of the Regularity conditions discussed below)
18-10/49
Part 18: Maximum Likelihood Estimation
The Linear (Normal) Model
Definition of the likelihood function - joint density of the
observed data, written as a function of the parameters
we wish to estimate.
Definition of the maximum likelihood estimator as that
function of the observed data that maximizes the
likelihood function, or its logarithm.
For the model:
yi = xi + i, where i ~ N[0,2],
the maximum likelihood estimators of  and 2 are
b = (XX)-1Xy and s2 = ee/n.
That is, least squares is ML for the slopes, but the
variance estimator makes no degrees of freedom
correction, so the MLE is biased.
18-11/49
Part 18: Maximum Likelihood Estimation
Normal Linear Model
The log-likelihood function
= i log f(yi|)
= sum of logs of densities.
For the linear regression model with normally distributed
disturbances
logL = i [ -½log 2 - ½log 2 - ½(yi – xi)2/2 ].
= -n/2[log2 + log2 + s2/2]
s2 = ee/n
18-12/49
Part 18: Maximum Likelihood Estimation
Likelihood Equations
The estimator is defined by the function of the data that equates
log-L/ to 0. (Likelihood equation)
The derivative vector of the log-likelihood function is the score function. For the
regression model,
g = [logL/ , logL/2]’
= logL/ = i [(1/2)xi(yi - xi) ]
= X/2 .
logL/2 = i [-1/(22) + (yi - xi)2/(24)] = -N/22 [1 – s2/2]
For the linear regression model, the first derivative vector of logL is
(1/2)X(y - X)
(K1)
and
(1/22) i [(yi - xi)2/2 - 1]
(11)
Note that we could compute these functions at any  and 2. If we compute
them at b and ee/n, the functions will be identically zero.
18-13/49
Part 18: Maximum Likelihood Estimation
Information Matrix
The negative of the second derivatives matrix of the loglikelihood,
 2 log fi
-H =  i  '
forms the basis for estimating the variance of the MLE.
It is usually a random matrix.
18-14/49
Part 18: Maximum Likelihood Estimation
Hessian for the Linear Model
  2 logL

2
 
 logL


= - 2
  logL

 2
  
 2 logL 
2 
  
 2 logL 
2
2
  

xi xi


i
1
= 2 
 1
(yi  xi )xi

2
i
 
1


x
(y

x

)
i
i
2 i i



1
2
(yi  xi )
4 i

2
Note that the off diagonal elements have expectation zero.
18-15/49
Part 18: Maximum Likelihood Estimation
Information Matrix
This can be computed at any vector  and scalar 2. You
can take expected values of the parts of the matrix to get
1
 2  i xi xi
-E[H]= 

0



0 

n 
24 
(which should look familiar). The off diagonal terms go to
zero (one of the assumptions of the linear model).
18-16/49
Part 18: Maximum Likelihood Estimation
Asymptotic Variance

The asymptotic variance is {–E[H]}-1 i.e., the inverse of
the information matrix.
2  x x  1
  i i i 
1
{-E[H]} = 

0


1
0  2  XX 
 
=
24  
0
 
n  
0 

4
2 
n 
There are several ways to estimate this matrix




18-17/49
Inverse of negative of expected second derivatives
Inverse of negative of actual second derivatives
Inverse of sum of squares of first derivatives
Robust matrix for some special cases
Part 18: Maximum Likelihood Estimation
Computing the Asymptotic Variance
We want to estimate {-E[H]}-1 Three ways:
(1) Just compute the negative of the actual second derivatives matrix
and invert it.
(2) Insert the maximum likelihood estimates into the known expected
values of the second derivatives matrix. Sometimes (1) and (2) give
the same answer (for example, in the linear regression model).
(3) Since E[H] is the variance of the first derivatives, estimate this with
the sample variance (i.e., mean square) of the first derivatives, then
invert the result. This will almost always be different from (1) and
(2).
Since they are estimating the same thing, in large samples, all three will
give the same answer. Current practice in econometrics often
favors (3). Stata rarely uses (3). Others do.
18-18/49
Part 18: Maximum Likelihood Estimation
Poisson Regression Model
Application of ML Estimation: Poisson Regression for a Count of Events
exp() j
Poisson Probability:
Prob[y=j]=
, j  0,1,...
j!
Regression Model:
 = E[y|x] = exp(x)
Competing estimators: Nonlinear Least Squares - Consistent
Maximum Likelihood
- Consistent and Efficient
18-19/49
Part 18: Maximum Likelihood Estimation
Application: Doctor Visits


German Individual Health Care data: n=27,236
Model for number of visits to the doctor:


18-20/49
Poisson regression (fit by maximum likelihood)
Income, Education, Gender
Part 18: Maximum Likelihood Estimation
Poisson Model
Density of Observed y
exp( i ) ij
Prob[yi = j | xi ] =
j!
Log Likelihood
logL(|y,X) =
  
n
i 1
i
 yi log  i  log yi !
Likelihood Equations = Derivatives of log likelihood
 i  exp(xi )
xi

 exp(xi )
=  i xi



 log L
n
=  i 1   i xi  yi xi   0

=  i 1 xi  yi   i  =  i 1 xi i
n
18-21/49
n
Part 18: Maximum Likelihood Estimation
Asymptotic Variance of the MLE
Variance of the first derivative vector:
Observations are independent. First derivative vector
is the sum of n independent terms. The variance is the
sum of the variances. The variance of each term is
Var[xi ( yi   i )] = xi xi Var[yi   i ] =  i xi xi
Summing terms
  log L 
n
Var 
  i 1  i xi xi  XΛX

  
18-22/49
Part 18: Maximum Likelihood Estimation
Estimators of the Asymptotic Covariance
Asymptotic Covariance Matrix
Conventional Estimator - Inverse of the information matrix
"Usual"
1
   i xi xi    XΛX 1
 i 1

"Berndt, Hall, Hall, Hausman" (BHHH)
n
1
  g i gi     [( yi   i )xi ][( yi   i )xi ] 
 i 1

 i 1

n
n
n

  i 1 ( yi   i ) 2 xi xi 


18-23/49
1
1
Part 18: Maximum Likelihood Estimation
Robust Estimation
Sandwich Estimator
 H-1 (GG) H-1

 XΛX

1
 n ( yi  i )2 xi xi   XΛX 1
 i 1

Is this appropriate? Why do we do this?
18-24/49
Part 18: Maximum Likelihood Estimation
Newton’s Method
18-25/49
Part 18: Maximum Likelihood Estimation
Newton’s Method for Poisson Regression
18-26/49
Part 18: Maximum Likelihood Estimation
Poisson Regression Iterations
Poisson ; lhs = doctor ; rhs = one,female,hhninc,educ;mar;output=3$
Method=Newton; Maximum iterations=100
Convergence criteria: gtHg
.1000D-05 chg.F
.0000D+00 max|db|
.0000D+00
Start values:
.00000D+00
.00000D+00
.00000D+00
.00000D+00
1st derivs.
-.13214D+06 -.61899D+05 -.43338D+05 -.14596D+07
Parameters:
.28002D+01
.72374D-01 -.65451D+00 -.47608D-01
Itr 2 F= -.1587D+06 gtHg= .2832D+03 chg.F= .1587D+06 max|db|= .1346D+01
1st derivs.
-.33055D+05 -.14401D+05 -.10804D+05 -.36592D+06
Parameters:
.21404D+01
.16980D+00 -.60181D+00 -.48527D-01
Itr 3 F= -.1115D+06 gtHg= .9725D+02 chg.F= .4716D+05 max|db|= .6348D+00
1st derivs.
-.42953D+04 -.15074D+04 -.13927D+04 -.47823D+05
Parameters:
.17997D+01
.27758D+00 -.54519D+00 -.49513D-01
Itr 4 F= -.1063D+06 gtHg= .1545D+02 chg.F= .5162D+04 max|db|= .1437D+00
1st derivs.
-.11692D+03 -.22248D+02 -.37525D+02 -.13159D+04
Parameters:
.17276D+01
.31746D+00 -.52565D+00 -.49852D-01
Itr 5 F= -.1062D+06 gtHg= .5006D+00 chg.F= .1218D+03 max|db|= .6542D-02
1st derivs.
-.12522D+00 -.54690D-02 -.40254D-01 -.14232D+01
Parameters:
.17249D+01
.31954D+00 -.52476D+00 -.49867D-01
Itr 6 F= -.1062D+06 gtHg= .6215D-03 chg.F= .1254D+00 max|db|= .9678D-05
1st derivs.
-.19317D-06 -.94936D-09 -.62872D-07 -.22029D-05
Parameters:
.17249D+01
.31954D+00 -.52476D+00 -.49867D-01
Itr 7 F= -.1062D+06 gtHg= .9957D-09 chg.F= .1941D-06 max|db|= .1602D-10
* Converged
18-27/49
Part 18: Maximum Likelihood Estimation
Regression and Partial Effects
+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
Constant|
1.72492985
.02000568
86.222
.0000
FEMALE |
.31954440
.00696870
45.854
.0000
.47877479
HHNINC |
-.52475878
.02197021
-23.885
.0000
.35208362
EDUC
|
-.04986696
.00172872
-28.846
.0000
11.3206310
+-------------------------------------------+
| Partial derivatives of expected val. with |
| respect to the vector of characteristics. |
| Effects are averaged over individuals.
|
| Observations used for means are All Obs. |
| Conditional Mean at Sample Point
3.1835 |
| Scale Factor for Marginal Effects 3.1835 |
+-------------------------------------------+
+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
Constant|
5.49135704
.07890083
69.598
.0000
FEMALE |
1.01727755
.02427607
41.905
.0000
.47877479
HHNINC |
-1.67058263
.07312900
-22.844
.0000
.35208362
EDUC
|
-.15875271
.00579668
-27.387
.0000
11.3206310
18-28/49
Part 18: Maximum Likelihood Estimation
Comparison of Standard Errors
Negative Inverse of Second Derivatives
+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
Constant|
1.72492985
.02000568
86.222
.0000
FEMALE |
.31954440
.00696870
45.854
.0000
.47877479
HHNINC |
-.52475878
.02197021
-23.885
.0000
.35208362
EDUC
|
-.04986696
.00172872
-28.846
.0000
11.3206310
BHHH
+--------+--------------+----------------+--------+--------+
|Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]|
+--------+--------------+----------------+--------+--------+
Constant|
1.72492985
.00677787
254.495
.0000
FEMALE |
.31954440
.00217499
146.918
.0000
HHNINC |
-.52475878
.00733328
-71.559
.0000
EDUC
|
-.04986696
.00062283
-80.065
.0000
Why are they so different? Model failure. This is a panel. There is autocorrelation.
18-29/49
Part 18: Maximum Likelihood Estimation
MLE vs. Nonlinear LS
18-30/49
Part 18: Maximum Likelihood Estimation
Testing Hypotheses – A Trinity of Tests
The likelihood ratio test:
Based on the proposition (Greene’s) that restrictions
always “make life worse”
Is the reduction in the criterion (log-likelihood) large?
Leads to the LR test.
The Wald test: The usual.
The Lagrange multiplier test:
Underlying basis: Reexamine the first order
conditions.
Form a test of whether the gradient is significantly
“nonzero” at the restricted estimator.
18-31/49
Part 18: Maximum Likelihood Estimation
Testing Hypotheses
Wald tests, using the familiar distance measure
Likelihood ratio tests:
LogLU
= log likelihood without restrictions
LogLR
= log likelihood with restrictions
LogLU > logLR for any nested restrictions
2(LogLU – logLR)  chi-squared [J]
18-32/49
Part 18: Maximum Likelihood Estimation
Testing the Model
+---------------------------------------------+
| Poisson Regression
|
| Maximum Likelihood Estimates
|
| Dependent variable
DOCVIS
|
| Number of observations
27326
|
| Iterations completed
7
|
| Log likelihood function
-106215.1
|
| Number of parameters
4
|
| Restricted log likelihood
-108662.1
|
| McFadden Pseudo R-squared
.0225193
|
| Chi squared
4893.983
|
| Degrees of freedom
3
|
| Prob[ChiSqd > value] =
.0000000
|
+---------------------------------------------+
Log likelihood
Log Likelihood with only a
constant term.
2*[logL – logL(0)]
Likelihood ratio test that all three slopes are zero.
18-33/49
Part 18: Maximum Likelihood Estimation
Wald
Test
--> MATRIX ; List ; b1 = b(2:4) ; v11 = varb(2:4,2:4) ; B1'<V11>B1$
Matrix B1
has 3 rows and
1
+-------------1|
.31954
2|
-.52476
3|
-.04987
1 columns.
Matrix V11
has 3 rows and 3 columns
1
2
3
+-----------------------------------------1| .4856275D-04 -.4556076D-06 .2169925D-05
2| -.4556076D-06
.00048
-.9160558D-05
3| .2169925D-05 -.9160558D-05 .2988465D-05
Matrix Result
has 1 rows and 1 columns.
1
+-------------1| 4682.38779
LR statistic was 4893.983
18-34/49
Part 18: Maximum Likelihood Estimation
Chow Style Test for Structural Change
Does the same model apply to 2 (G) groups?
For linear regression we used the "Chow" (F) test.
For models fit by maximum likelihood, we use a
test based on the likelihood function. The same
model is fit to the pooled sample and to each group.


G

Chi squared = 2   g 1 log Lg  log Lpooled 


Degrees of freedom = (G-1)K.
18-35/49
Part 18: Maximum Likelihood Estimation
Poisson Regressions
---------------------------------------------------------------------Poisson Regression
Dependent variable
DOCVIS
Log likelihood function
-90878.20153 (Pooled, N = 27326)
Log likelihood function
-43286.40271 (Male,
N = 14243)
Log likelihood function
-46587.29002 (Female, N = 13083)
--------+------------------------------------------------------------Variable| Coefficient
Standard Error b/St.Er. P[|Z|>z]
Mean of X
--------+------------------------------------------------------------Pooled
Constant|
2.54579***
.02797
91.015
.0000
AGE|
.00791***
.00034
23.306
.0000
43.5257
EDUC|
-.02047***
.00170
-12.056
.0000
11.3206
HSAT|
-.22780***
.00133
-171.350
.0000
6.78543
HHNINC|
-.26255***
.02143
-12.254
.0000
.35208
HHKIDS|
-.12304***
.00796
-15.464
.0000
.40273
--------+------------------------------------------------------------Males
Constant|
2.38138***
.04053
58.763
.0000
AGE|
.01232***
.00050
24.738
.0000
42.6528
EDUC|
-.02962***
.00253
-11.728
.0000
11.7287
HSAT|
-.23754***
.00202
-117.337
.0000
6.92436
HHNINC|
-.33562***
.03357
-9.998
.0000
.35905
HHKIDS|
-.10728***
.01166
-9.204
.0000
.41297
--------+------------------------------------------------------------Females
Constant|
2.48647***
.03988
62.344
.0000
AGE|
.00379***
.00048
7.940
.0000
44.4760
EDUC|
.00893***
.00234
3.821
.0001
10.8764
HSAT|
-.21724***
.00177
-123.029
.0000
6.63417
HHNINC|
-.22371***
.02767
-8.084
.0000
.34450
HHKIDS|
-.14906***
.01107
-13.463
.0000
.39158
--------+-------------------------------------------------------------
18-36/49
Part 18: Maximum Likelihood Estimation
Chi Squared Test
Namelist; X = one,age,educ,hsat,hhninc,hhkids$
Sample ; All $
Poisson ; Lhs = Docvis ; Rhs = X $
Calc
; Lpool = logl $
Poisson ; For [female = 0] ; Lhs = Docvis ; Rhs = X $
Calc
; Lmale = logl $
Poisson ; For [female = 1] ; Lhs = Docvis ; Rhs = X $
Calc
; Lfemale = logl $
Calc
; K = Col(X) $
Calc
; List
; Chisq = 2*(Lmale + Lfemale - Lpool)
; Ctb(.95,k) $
+------------------------------------+
| Listed Calculator Results
|
+------------------------------------+
CHISQ
=
2009.017601
*Result*=
12.591587
The hypothesis that the same model applies to men and
women is rejected.
18-37/49
Part 18: Maximum Likelihood Estimation
Properties of the
Maximum Likelihood Estimator
We will sketch formal proofs of these results:
The log-likelihood function, again
The likelihood equation and the information matrix.
A linear Taylor series approximation to the first order conditions:
g(ML) = 0  g() + H() (ML - )
(under regularity, higher order terms will vanish in large samples.)
Our usual approach. Large sample behavior of the left and right hand sides is the same.
A Proof of consistency.
(Property 1)
The limiting variance of n(ML - ). We are using the central limit theorem here.
Leads to asymptotic normality (Property 2). We will derive the asymptotic variance of
the MLE.
Estimating the variance of the maximum likelihood estimator.
Efficiency (we have not developed the tools to prove this.) The Cramer-Rao lower
bound for efficient estimation (an asymptotic version of Gauss-Markov).
Invariance. (A VERY handy result.) Coupled with the Slutsky theorem and the delta
method, the invariance property makes estimation of nonlinear functions of
parameters very easy.
18-38/49
Part 18: Maximum Likelihood Estimation
Regularity Conditions


Deriving the theory for the MLE relies on certain “regularity”
conditions for the density.
What they are




1. logf(.) has three continuous derivatives wrt parameters
2. Conditions needed to obtain expectations of derivatives are met.
(E.g., range of the variable is not a function of the parameters.)
3. Third derivative has finite expectation.
What they mean



18-39/49
Moment conditions and convergence. We need to obtain expectations
of derivatives.
We need to be able to truncate Taylor series.
We will use central limit theorems
Part 18: Maximum Likelihood Estimation
The MLE
The results center on the first order conditions for the MLE
 log L
= g ˆ MLE = 0
ˆ
 MLE


Begin with a Taylor series approximation to the first derivatives:




g ˆ MLE = 0  g    + H    ˆ MLE   [+ terms o(1/n) that vanish]
The derivative at the MLE, ˆ MLE , is exactly zero. It is close to zero at the
true , to the extent that ˆ
is a good estimator of .
MLE
Rearrange this equation and make use of the Slutsky theorem

ˆ MLE  

ˆ MLE  

1
   H     g   
In terms of the original log likelihood

1
n
n
   i 1 H i       i 1 g i    

 

 log f i ()
 2 log f i ()
where g i    
and H i    


18-40/49
Part 18: Maximum Likelihood Estimation
Consistency


1
n
n
   i 1 H i       i 1 g i    

 

Divide both sums by the sample size.
ˆ MLE  

1

 1 n
 1 n

1
ˆ MLE   =    i 1 H i       i 1 g i      o  
 n
 n

n
The approximation is now exact because of the higher order term.
As n  , the third term vanishes. The matrices in brackets are sample
means that converge to their expectations.
1
1
 1 n


H



E
H



 i    , a positive definite matrix.
 n  i 1 i   
1 n

g




i
 n i 1
  E g i      0, one of the regularity conditions.
Therefore, collecting terms,
 ˆ
18-41/49
MLE



   0 or plim ˆ MLE = 
Part 18: Maximum Likelihood Estimation
Asymptotic Variance
18-42/49
Part 18: Maximum Likelihood Estimation
Asymptotic Variance
18-43/49
Part 18: Maximum Likelihood Estimation
18-44/49
Part 18: Maximum Likelihood Estimation
18-45/49
Part 18: Maximum Likelihood Estimation
Asymptotic Distribution
18-46/49
Part 18: Maximum Likelihood Estimation
Efficiency: Variance Bound
18-47/49
Part 18: Maximum Likelihood Estimation
Invariance
The maximum likelihood estimator of a function of
, say h() is h(MLE). This is not always true of
other kinds of estimators. To get the variance of
this function, we would use the delta method.
E.g., the MLE of θ=(β/σ) is b/(ee/n)
18-48/49
Part 18: Maximum Likelihood Estimation
Estimating the Tobit Model
Log likelihood for the tobit model for estimation of β and  :
 1  y  x iβ   
n 
  x β 
logL= i=1 (1-di ) log   i   di log    i
 

  
 
 

di  1 if y i  0, 0 if y i = 0. Derivatives are very complicated,
Hessian is nightmarish. Consider the Olsen transformation*:
=1/, =-β /. (One to one; =1 / , β = - / 
logL= i=1 log (1-di ) log   x i    di log    y i  x i    
n
(1-d ) log   x    d (log   (1 / 2) log 2   (1 / 2)  y  x  2 ) 
log
i
i
i
i
i
i=1




  x i  
 logL
n
  i1 (1-di )
 diei  x i

  x i  


 logL
n
1

  i1 di   ei y i 




n
*Note on the Uniqueness of the MLE in the Tobit Model," Econometrica, 1978.
18-49/49
Part 18: Maximum Likelihood Estimation
Download