Econometrics I Professor William Greene Stern School of Business Department of Economics 18-1/49 Part 18: Maximum Likelihood Estimation 18-2/49 Part 18: Maximum Likelihood Estimation Econometrics I Part 18 – Maximum Likelihood Estimation 18-3/49 Part 18: Maximum Likelihood Estimation Maximum Likelihood Estimation This defines a class of estimators based on the particular distribution assumed to have generated the observed random variable. Not estimating a mean – least squares is not available Estimating a mean (possibly), but also using information about the distribution 18-4/49 Part 18: Maximum Likelihood Estimation Advantage of The MLE The main advantage of ML estimators is that among all Consistent Asymptotically Normal Estimators, MLEs have optimal asymptotic properties. The main disadvantage is that they are not necessarily robust to failures of the distributional assumptions. They are very dependent on the particular assumptions. The oft cited disadvantage of their mediocre small sample properties is overstated in view of the usual paucity of viable alternatives. 18-5/49 Part 18: Maximum Likelihood Estimation Properties of the MLE Consistent: Not necessarily unbiased, however Asymptotically normally distributed: Proof based on central limit theorems Asymptotically efficient: Among the possible estimators that are consistent and asymptotically normally distributed – counterpart to GaussMarkov for linear regression Invariant: The MLE of g() is g(the MLE of ) 18-6/49 Part 18: Maximum Likelihood Estimation Setting Up the MLE The distribution of the observed random variable is written as a function of the parameters to be estimated P(yi|data,β) = Probability density | parameters. The likelihood function is constructed from the density Construction: Joint probability density function of the observed sample of data – generally the product when the data are a random sample. 18-7/49 Part 18: Maximum Likelihood Estimation (Log) Likelihood Function f(yi|,xi) = probability density of observed yi given parameter(s) and possibly data, xi. Observations are independent Joint density = i f(yi|,xi) = L(|y,X) f(yi|,xi) is the contribution of observation i to the likelihood. The MLE of maximizes L(|y,X) In practice it is usually easier to maximize logL(|y,X) = i logf(yi|,xi) 18-8/49 Part 18: Maximum Likelihood Estimation The MLE The log-likelihood function: logL(|data) The likelihood equation(s): First derivatives of logL equal zero at the MLE. (1/n)Σi ∂logf(yi|,xi)/∂MLE = 0. (Sample statistic.) (The 1/n is irrelevant.) “First order conditions” for maximization A moment condition - its counterpart is the fundamental theoretical result E[logL/] = 0. 18-9/49 Part 18: Maximum Likelihood Estimation Average Time Until Failure Estimating the average time until failure, , of light bulbs. yi = observed life until failure. f(yi|) = (1/)exp(-yi/) L() = Πi f(yi|)= -n exp(-Σyi/) logL() = -nlog () - Σyi/ Likelihood equation: ∂logL()/∂ = -n/ + Σyi/2 =0 Solution: MLE = Σyi /n. Note: E[yi]= Note, ∂logf(yi|)/∂ = -1/ + yi/2 Since E[yi]= , E[∂logf()/∂]=0. (One of the Regularity conditions discussed below) 18-10/49 Part 18: Maximum Likelihood Estimation The Linear (Normal) Model Definition of the likelihood function - joint density of the observed data, written as a function of the parameters we wish to estimate. Definition of the maximum likelihood estimator as that function of the observed data that maximizes the likelihood function, or its logarithm. For the model: yi = xi + i, where i ~ N[0,2], the maximum likelihood estimators of and 2 are b = (XX)-1Xy and s2 = ee/n. That is, least squares is ML for the slopes, but the variance estimator makes no degrees of freedom correction, so the MLE is biased. 18-11/49 Part 18: Maximum Likelihood Estimation Normal Linear Model The log-likelihood function = i log f(yi|) = sum of logs of densities. For the linear regression model with normally distributed disturbances logL = i [ -½log 2 - ½log 2 - ½(yi – xi)2/2 ]. = -n/2[log2 + log2 + s2/2] s2 = ee/n 18-12/49 Part 18: Maximum Likelihood Estimation Likelihood Equations The estimator is defined by the function of the data that equates log-L/ to 0. (Likelihood equation) The derivative vector of the log-likelihood function is the score function. For the regression model, g = [logL/ , logL/2]’ = logL/ = i [(1/2)xi(yi - xi) ] = X/2 . logL/2 = i [-1/(22) + (yi - xi)2/(24)] = -N/22 [1 – s2/2] For the linear regression model, the first derivative vector of logL is (1/2)X(y - X) (K1) and (1/22) i [(yi - xi)2/2 - 1] (11) Note that we could compute these functions at any and 2. If we compute them at b and ee/n, the functions will be identically zero. 18-13/49 Part 18: Maximum Likelihood Estimation Information Matrix The negative of the second derivatives matrix of the loglikelihood, 2 log fi -H = i ' forms the basis for estimating the variance of the MLE. It is usually a random matrix. 18-14/49 Part 18: Maximum Likelihood Estimation Hessian for the Linear Model 2 logL 2 logL = - 2 logL 2 2 logL 2 2 logL 2 2 xi xi i 1 = 2 1 (yi xi )xi 2 i 1 x (y x ) i i 2 i i 1 2 (yi xi ) 4 i 2 Note that the off diagonal elements have expectation zero. 18-15/49 Part 18: Maximum Likelihood Estimation Information Matrix This can be computed at any vector and scalar 2. You can take expected values of the parts of the matrix to get 1 2 i xi xi -E[H]= 0 0 n 24 (which should look familiar). The off diagonal terms go to zero (one of the assumptions of the linear model). 18-16/49 Part 18: Maximum Likelihood Estimation Asymptotic Variance The asymptotic variance is {–E[H]}-1 i.e., the inverse of the information matrix. 2 x x 1 i i i 1 {-E[H]} = 0 1 0 2 XX = 24 0 n 0 4 2 n There are several ways to estimate this matrix 18-17/49 Inverse of negative of expected second derivatives Inverse of negative of actual second derivatives Inverse of sum of squares of first derivatives Robust matrix for some special cases Part 18: Maximum Likelihood Estimation Computing the Asymptotic Variance We want to estimate {-E[H]}-1 Three ways: (1) Just compute the negative of the actual second derivatives matrix and invert it. (2) Insert the maximum likelihood estimates into the known expected values of the second derivatives matrix. Sometimes (1) and (2) give the same answer (for example, in the linear regression model). (3) Since E[H] is the variance of the first derivatives, estimate this with the sample variance (i.e., mean square) of the first derivatives, then invert the result. This will almost always be different from (1) and (2). Since they are estimating the same thing, in large samples, all three will give the same answer. Current practice in econometrics often favors (3). Stata rarely uses (3). Others do. 18-18/49 Part 18: Maximum Likelihood Estimation Poisson Regression Model Application of ML Estimation: Poisson Regression for a Count of Events exp() j Poisson Probability: Prob[y=j]= , j 0,1,... j! Regression Model: = E[y|x] = exp(x) Competing estimators: Nonlinear Least Squares - Consistent Maximum Likelihood - Consistent and Efficient 18-19/49 Part 18: Maximum Likelihood Estimation Application: Doctor Visits German Individual Health Care data: n=27,236 Model for number of visits to the doctor: 18-20/49 Poisson regression (fit by maximum likelihood) Income, Education, Gender Part 18: Maximum Likelihood Estimation Poisson Model Density of Observed y exp( i ) ij Prob[yi = j | xi ] = j! Log Likelihood logL(|y,X) = n i 1 i yi log i log yi ! Likelihood Equations = Derivatives of log likelihood i exp(xi ) xi exp(xi ) = i xi log L n = i 1 i xi yi xi 0 = i 1 xi yi i = i 1 xi i n 18-21/49 n Part 18: Maximum Likelihood Estimation Asymptotic Variance of the MLE Variance of the first derivative vector: Observations are independent. First derivative vector is the sum of n independent terms. The variance is the sum of the variances. The variance of each term is Var[xi ( yi i )] = xi xi Var[yi i ] = i xi xi Summing terms log L n Var i 1 i xi xi XΛX 18-22/49 Part 18: Maximum Likelihood Estimation Estimators of the Asymptotic Covariance Asymptotic Covariance Matrix Conventional Estimator - Inverse of the information matrix "Usual" 1 i xi xi XΛX 1 i 1 "Berndt, Hall, Hall, Hausman" (BHHH) n 1 g i gi [( yi i )xi ][( yi i )xi ] i 1 i 1 n n n i 1 ( yi i ) 2 xi xi 18-23/49 1 1 Part 18: Maximum Likelihood Estimation Robust Estimation Sandwich Estimator H-1 (GG) H-1 XΛX 1 n ( yi i )2 xi xi XΛX 1 i 1 Is this appropriate? Why do we do this? 18-24/49 Part 18: Maximum Likelihood Estimation Newton’s Method 18-25/49 Part 18: Maximum Likelihood Estimation Newton’s Method for Poisson Regression 18-26/49 Part 18: Maximum Likelihood Estimation Poisson Regression Iterations Poisson ; lhs = doctor ; rhs = one,female,hhninc,educ;mar;output=3$ Method=Newton; Maximum iterations=100 Convergence criteria: gtHg .1000D-05 chg.F .0000D+00 max|db| .0000D+00 Start values: .00000D+00 .00000D+00 .00000D+00 .00000D+00 1st derivs. -.13214D+06 -.61899D+05 -.43338D+05 -.14596D+07 Parameters: .28002D+01 .72374D-01 -.65451D+00 -.47608D-01 Itr 2 F= -.1587D+06 gtHg= .2832D+03 chg.F= .1587D+06 max|db|= .1346D+01 1st derivs. -.33055D+05 -.14401D+05 -.10804D+05 -.36592D+06 Parameters: .21404D+01 .16980D+00 -.60181D+00 -.48527D-01 Itr 3 F= -.1115D+06 gtHg= .9725D+02 chg.F= .4716D+05 max|db|= .6348D+00 1st derivs. -.42953D+04 -.15074D+04 -.13927D+04 -.47823D+05 Parameters: .17997D+01 .27758D+00 -.54519D+00 -.49513D-01 Itr 4 F= -.1063D+06 gtHg= .1545D+02 chg.F= .5162D+04 max|db|= .1437D+00 1st derivs. -.11692D+03 -.22248D+02 -.37525D+02 -.13159D+04 Parameters: .17276D+01 .31746D+00 -.52565D+00 -.49852D-01 Itr 5 F= -.1062D+06 gtHg= .5006D+00 chg.F= .1218D+03 max|db|= .6542D-02 1st derivs. -.12522D+00 -.54690D-02 -.40254D-01 -.14232D+01 Parameters: .17249D+01 .31954D+00 -.52476D+00 -.49867D-01 Itr 6 F= -.1062D+06 gtHg= .6215D-03 chg.F= .1254D+00 max|db|= .9678D-05 1st derivs. -.19317D-06 -.94936D-09 -.62872D-07 -.22029D-05 Parameters: .17249D+01 .31954D+00 -.52476D+00 -.49867D-01 Itr 7 F= -.1062D+06 gtHg= .9957D-09 chg.F= .1941D-06 max|db|= .1602D-10 * Converged 18-27/49 Part 18: Maximum Likelihood Estimation Regression and Partial Effects +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ Constant| 1.72492985 .02000568 86.222 .0000 FEMALE | .31954440 .00696870 45.854 .0000 .47877479 HHNINC | -.52475878 .02197021 -23.885 .0000 .35208362 EDUC | -.04986696 .00172872 -28.846 .0000 11.3206310 +-------------------------------------------+ | Partial derivatives of expected val. with | | respect to the vector of characteristics. | | Effects are averaged over individuals. | | Observations used for means are All Obs. | | Conditional Mean at Sample Point 3.1835 | | Scale Factor for Marginal Effects 3.1835 | +-------------------------------------------+ +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ Constant| 5.49135704 .07890083 69.598 .0000 FEMALE | 1.01727755 .02427607 41.905 .0000 .47877479 HHNINC | -1.67058263 .07312900 -22.844 .0000 .35208362 EDUC | -.15875271 .00579668 -27.387 .0000 11.3206310 18-28/49 Part 18: Maximum Likelihood Estimation Comparison of Standard Errors Negative Inverse of Second Derivatives +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ Constant| 1.72492985 .02000568 86.222 .0000 FEMALE | .31954440 .00696870 45.854 .0000 .47877479 HHNINC | -.52475878 .02197021 -23.885 .0000 .35208362 EDUC | -.04986696 .00172872 -28.846 .0000 11.3206310 BHHH +--------+--------------+----------------+--------+--------+ |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| +--------+--------------+----------------+--------+--------+ Constant| 1.72492985 .00677787 254.495 .0000 FEMALE | .31954440 .00217499 146.918 .0000 HHNINC | -.52475878 .00733328 -71.559 .0000 EDUC | -.04986696 .00062283 -80.065 .0000 Why are they so different? Model failure. This is a panel. There is autocorrelation. 18-29/49 Part 18: Maximum Likelihood Estimation MLE vs. Nonlinear LS 18-30/49 Part 18: Maximum Likelihood Estimation Testing Hypotheses – A Trinity of Tests The likelihood ratio test: Based on the proposition (Greene’s) that restrictions always “make life worse” Is the reduction in the criterion (log-likelihood) large? Leads to the LR test. The Wald test: The usual. The Lagrange multiplier test: Underlying basis: Reexamine the first order conditions. Form a test of whether the gradient is significantly “nonzero” at the restricted estimator. 18-31/49 Part 18: Maximum Likelihood Estimation Testing Hypotheses Wald tests, using the familiar distance measure Likelihood ratio tests: LogLU = log likelihood without restrictions LogLR = log likelihood with restrictions LogLU > logLR for any nested restrictions 2(LogLU – logLR) chi-squared [J] 18-32/49 Part 18: Maximum Likelihood Estimation Testing the Model +---------------------------------------------+ | Poisson Regression | | Maximum Likelihood Estimates | | Dependent variable DOCVIS | | Number of observations 27326 | | Iterations completed 7 | | Log likelihood function -106215.1 | | Number of parameters 4 | | Restricted log likelihood -108662.1 | | McFadden Pseudo R-squared .0225193 | | Chi squared 4893.983 | | Degrees of freedom 3 | | Prob[ChiSqd > value] = .0000000 | +---------------------------------------------+ Log likelihood Log Likelihood with only a constant term. 2*[logL – logL(0)] Likelihood ratio test that all three slopes are zero. 18-33/49 Part 18: Maximum Likelihood Estimation Wald Test --> MATRIX ; List ; b1 = b(2:4) ; v11 = varb(2:4,2:4) ; B1'<V11>B1$ Matrix B1 has 3 rows and 1 +-------------1| .31954 2| -.52476 3| -.04987 1 columns. Matrix V11 has 3 rows and 3 columns 1 2 3 +-----------------------------------------1| .4856275D-04 -.4556076D-06 .2169925D-05 2| -.4556076D-06 .00048 -.9160558D-05 3| .2169925D-05 -.9160558D-05 .2988465D-05 Matrix Result has 1 rows and 1 columns. 1 +-------------1| 4682.38779 LR statistic was 4893.983 18-34/49 Part 18: Maximum Likelihood Estimation Chow Style Test for Structural Change Does the same model apply to 2 (G) groups? For linear regression we used the "Chow" (F) test. For models fit by maximum likelihood, we use a test based on the likelihood function. The same model is fit to the pooled sample and to each group. G Chi squared = 2 g 1 log Lg log Lpooled Degrees of freedom = (G-1)K. 18-35/49 Part 18: Maximum Likelihood Estimation Poisson Regressions ---------------------------------------------------------------------Poisson Regression Dependent variable DOCVIS Log likelihood function -90878.20153 (Pooled, N = 27326) Log likelihood function -43286.40271 (Male, N = 14243) Log likelihood function -46587.29002 (Female, N = 13083) --------+------------------------------------------------------------Variable| Coefficient Standard Error b/St.Er. P[|Z|>z] Mean of X --------+------------------------------------------------------------Pooled Constant| 2.54579*** .02797 91.015 .0000 AGE| .00791*** .00034 23.306 .0000 43.5257 EDUC| -.02047*** .00170 -12.056 .0000 11.3206 HSAT| -.22780*** .00133 -171.350 .0000 6.78543 HHNINC| -.26255*** .02143 -12.254 .0000 .35208 HHKIDS| -.12304*** .00796 -15.464 .0000 .40273 --------+------------------------------------------------------------Males Constant| 2.38138*** .04053 58.763 .0000 AGE| .01232*** .00050 24.738 .0000 42.6528 EDUC| -.02962*** .00253 -11.728 .0000 11.7287 HSAT| -.23754*** .00202 -117.337 .0000 6.92436 HHNINC| -.33562*** .03357 -9.998 .0000 .35905 HHKIDS| -.10728*** .01166 -9.204 .0000 .41297 --------+------------------------------------------------------------Females Constant| 2.48647*** .03988 62.344 .0000 AGE| .00379*** .00048 7.940 .0000 44.4760 EDUC| .00893*** .00234 3.821 .0001 10.8764 HSAT| -.21724*** .00177 -123.029 .0000 6.63417 HHNINC| -.22371*** .02767 -8.084 .0000 .34450 HHKIDS| -.14906*** .01107 -13.463 .0000 .39158 --------+------------------------------------------------------------- 18-36/49 Part 18: Maximum Likelihood Estimation Chi Squared Test Namelist; X = one,age,educ,hsat,hhninc,hhkids$ Sample ; All $ Poisson ; Lhs = Docvis ; Rhs = X $ Calc ; Lpool = logl $ Poisson ; For [female = 0] ; Lhs = Docvis ; Rhs = X $ Calc ; Lmale = logl $ Poisson ; For [female = 1] ; Lhs = Docvis ; Rhs = X $ Calc ; Lfemale = logl $ Calc ; K = Col(X) $ Calc ; List ; Chisq = 2*(Lmale + Lfemale - Lpool) ; Ctb(.95,k) $ +------------------------------------+ | Listed Calculator Results | +------------------------------------+ CHISQ = 2009.017601 *Result*= 12.591587 The hypothesis that the same model applies to men and women is rejected. 18-37/49 Part 18: Maximum Likelihood Estimation Properties of the Maximum Likelihood Estimator We will sketch formal proofs of these results: The log-likelihood function, again The likelihood equation and the information matrix. A linear Taylor series approximation to the first order conditions: g(ML) = 0 g() + H() (ML - ) (under regularity, higher order terms will vanish in large samples.) Our usual approach. Large sample behavior of the left and right hand sides is the same. A Proof of consistency. (Property 1) The limiting variance of n(ML - ). We are using the central limit theorem here. Leads to asymptotic normality (Property 2). We will derive the asymptotic variance of the MLE. Estimating the variance of the maximum likelihood estimator. Efficiency (we have not developed the tools to prove this.) The Cramer-Rao lower bound for efficient estimation (an asymptotic version of Gauss-Markov). Invariance. (A VERY handy result.) Coupled with the Slutsky theorem and the delta method, the invariance property makes estimation of nonlinear functions of parameters very easy. 18-38/49 Part 18: Maximum Likelihood Estimation Regularity Conditions Deriving the theory for the MLE relies on certain “regularity” conditions for the density. What they are 1. logf(.) has three continuous derivatives wrt parameters 2. Conditions needed to obtain expectations of derivatives are met. (E.g., range of the variable is not a function of the parameters.) 3. Third derivative has finite expectation. What they mean 18-39/49 Moment conditions and convergence. We need to obtain expectations of derivatives. We need to be able to truncate Taylor series. We will use central limit theorems Part 18: Maximum Likelihood Estimation The MLE The results center on the first order conditions for the MLE log L = g ˆ MLE = 0 ˆ MLE Begin with a Taylor series approximation to the first derivatives: g ˆ MLE = 0 g + H ˆ MLE [+ terms o(1/n) that vanish] The derivative at the MLE, ˆ MLE , is exactly zero. It is close to zero at the true , to the extent that ˆ is a good estimator of . MLE Rearrange this equation and make use of the Slutsky theorem ˆ MLE ˆ MLE 1 H g In terms of the original log likelihood 1 n n i 1 H i i 1 g i log f i () 2 log f i () where g i and H i 18-40/49 Part 18: Maximum Likelihood Estimation Consistency 1 n n i 1 H i i 1 g i Divide both sums by the sample size. ˆ MLE 1 1 n 1 n 1 ˆ MLE = i 1 H i i 1 g i o n n n The approximation is now exact because of the higher order term. As n , the third term vanishes. The matrices in brackets are sample means that converge to their expectations. 1 1 1 n H E H i , a positive definite matrix. n i 1 i 1 n g i n i 1 E g i 0, one of the regularity conditions. Therefore, collecting terms, ˆ 18-41/49 MLE 0 or plim ˆ MLE = Part 18: Maximum Likelihood Estimation Asymptotic Variance 18-42/49 Part 18: Maximum Likelihood Estimation Asymptotic Variance 18-43/49 Part 18: Maximum Likelihood Estimation 18-44/49 Part 18: Maximum Likelihood Estimation 18-45/49 Part 18: Maximum Likelihood Estimation Asymptotic Distribution 18-46/49 Part 18: Maximum Likelihood Estimation Efficiency: Variance Bound 18-47/49 Part 18: Maximum Likelihood Estimation Invariance The maximum likelihood estimator of a function of , say h() is h(MLE). This is not always true of other kinds of estimators. To get the variance of this function, we would use the delta method. E.g., the MLE of θ=(β/σ) is b/(ee/n) 18-48/49 Part 18: Maximum Likelihood Estimation Estimating the Tobit Model Log likelihood for the tobit model for estimation of β and : 1 y x iβ n x β logL= i=1 (1-di ) log i di log i di 1 if y i 0, 0 if y i = 0. Derivatives are very complicated, Hessian is nightmarish. Consider the Olsen transformation*: =1/, =-β /. (One to one; =1 / , β = - / logL= i=1 log (1-di ) log x i di log y i x i n (1-d ) log x d (log (1 / 2) log 2 (1 / 2) y x 2 ) log i i i i i i=1 x i logL n i1 (1-di ) diei x i x i logL n 1 i1 di ei y i n *Note on the Uniqueness of the MLE in the Tobit Model," Econometrica, 1978. 18-49/49 Part 18: Maximum Likelihood Estimation