Maximum Likelihood estimation 1 Estimation Let w = (w1 , . . . , wn ) denote the data, where wi depending on the context may denote a random variable or, with a slight abuse of notation, its (possible) realization(s). Let θ denote a finite dimensional parameter. We let L(θ; w) = f (w; θ) denote the Likelihood function, i.e., the probability density or mass function of the data considered as a function of the parameter. Throughout we will assume that the data are iid. Therefore, we have n Y L(θ; w) = f (w; θ) = f (wi ; θ), i=1 where (with a slight abuse of notation) f (wi ; θ) denotes the probability density or mass function of wi . Example 1. wi is a Bernoulli random variable, i.e., 1 with probability θ wi = 0 with probability 1 − θ . Then, the Likelihood function of a random sample (a sample of iid random variables) is given by L(θ; w) = n Y θwi (1 − θ)1−wi . i=1 In what follows, it will turn out to be convenient to work with a oneto-one transformation of the Likelihood function, namely the log-likelihood (function) given by l(θ; w) ≡ log L(θ; w) = log n Y i=1 where ≡ denotes “equals by definition.” 1 f (wi ; θ) = n X i=1 log f (wi ; θ), Example 1 - continued. l(θ; w) = n X wi log θ + (1 − wi ) log(1 − θ). i=1 The Maximum Likelihood Estimator (MLE) is given by the maximizer of the Likelihood function or equivalently of the log-likelihood. Under some regularity conditions, the maximizer, say θ̂, is given by the parameter value that satisfies the first order condition(s), i.e., ∂ ∂ l(θ̂; w) ≡ l(θ; w) ∂θ ∂θ = 0. (1) θ=θ̂ Example 1 - continued. ∂ l(θ̂; w) ∂θ ⇔ ⇔ ⇔ Pn i=1 =0 wi 1θ̂ + (1 − wi ) 1−1 θ̂ (−1) = 0 1 θ̂ Pn i=1 wi Pn (1 − θ̂) i=1 = 1 1 − θ̂ n X (1 − wi ) i=1 n X = θ̂ n − wi ! wi i=1 ⇔ Pn i=1 ⇔ wi = θ̂n n 1X = wi ≡ w̄ n i=1 θ̂ The MLE is given by the sample average, θ̂ = w̄. 2 Asymptotic properties Next, we will derive several asymptotic properties of the MLE. We will rely on (where all limits are taken as “n → ∞”) • Law of Large Numbers (LLN): Let x1 , . . . , xn be an iid random sample with µ ≡ E[xi ]. Then, p x̄ → µ 2 where x̄ ≡ 1 n Pn p i=1 xi and where “→” denotes “convergence in proba- bility.” • Central Limit Theorem (CLT): Let x1 , . . . , xn be an iid random sample with µ ≡ E[xi ] and σ 2 ≡ Var[xi ] < ∞. Then, √ where x̄ ≡ 1 n Pn i=1 d n(x̄ − µ) → N (0, σ 2 ) d xi and where “→” denotes “convergence in distribu- tion.” • Continuous Mapping Theorem (CMT): Let {Xn } be a sequence of random variables (e.g., Xn = x̄; the sample average depends on the sample size n). Then, for any continuous function g(·) p p Xn → c ⇒ g(Xn ) → g(c) and d d Xn → X ⇒ g(Xn ) → g(X) where c denotes some constant and X some random variable. • Slutsky: Let {Xn } and {Yn } denote two sequences of random variables p d such that Xn → X and Yn → c for some random variable X and some contant c. Then, d d d Xn + Yn → X + c, Xn Yn → Xc, and Xn /Yn → X/c. • Mean value Theorem: For any continuously differentiable function g(·) g(x) = g(x∗ ) + g 0 (x∗∗ )(x − x∗ ), where x∗∗ ∈ [x, x∗ ] (assuming x∗ > x). In what follows, it will be convenient to define the “individual” log-likelihood l(θ; wi ) ≡ log f (wi ; θ) 3 such that l(θ; w) = n X l(θ; wi ). i=1 Similarly, we call ∂ l(θ; wi ) ∂θ the “individual” score and let n X ∂ ∂ l(θ; w) = l(θ; wi ) ∂θ ∂θ i=1 be the score. Let θ∗ denote true parameter under which the data was generated. Then, we have that ∂ ∗ l(θ ; wi ) = 0, E ∂θ where ∂ ∂ l(θ∗ ; wi ) ≡ l(θ; wi ) ∂θ ∂θ (2) . θ=θ∗ To show that equation (2) holds, note that Z ∂ ∂ ∗ E l(θ ; wi ) = l(θ∗ ; wi )f (wi ; θ∗ )dwi , ∂θ ∂θ where the expectation is taken with respect to the true data generating process (dgp), i.e., we are integrating with respect to f (wi ; θ∗ ). Using l(θ; wi ) = log f (wi ; θ), we have (under some regularity conditions that allow us to change the order of integration and differentiation) Z ∂ ∂ 1 ∗ ∗ E l(θ ; wi ) = f (wi ; θ ) f (wi ; θ∗ )dwi ∗ ∂θ f (wi ; θ ) ∂θ Z Z ∂ ∂ ∂ ∗ = f (wi ; θ )dwi = f (wi ; θ∗ )dwi = 1 = 0. ∂θ ∂θ ∂θ Note that equation (2) can be seen as population analogue to the first order condition given in equation (1)—the “missing” 1 n being immaterial. Simi- larly, this can be seen as motivation for the MLE: The true value θ∗ sets the expected (“individual”) score equal to zero and the MLE θ̂ sets the sample 4 analogue/average equal to zero. Since sample averages converge (in probability) to population moments (cf. LLN) we would hope that the solution to the “sample problem” (what value of θ solves equation (1)?) converges (in probability) to the solution of the “population problem” (what value of θ solves equation (2)?). It can be shown that this intuition is correct and that under certain regularity conditions MLE is consistent, i.e., p θ̂ → θ∗ . For the purpose of this class, we take this as given. Example 1 - continued. Check for yourself that equation (2) holds. In order to derive the asymptotic distribution of the MLE, we consider the Taylor expansion of the log-likelihood at θ̂ around the true value θ∗ , ∂ ∂2 ∂ ∗ l(θ̂; w) = l(θ ; w) + 2 l(θ̄; w)(θ̂ − θ∗ ). ∂θ ∂θ ∂ θ Since the left hand side is zero by “construction” (see equation (1)), we have ∂ l(θ∗ ; w) ∂θ −1 ∂2 ∂ = − 2 l(θ̄; w) l(θ∗ ; w) ∂ θ ∂θ −1 √ 1 ∂ 1 ∂2 = − 2 l(θ̄; w) n l(θ∗ ; w). n∂ θ n ∂θ 2 − ∂∂2 θ l(θ̄; w)(θ̂ − θ∗ ) = ⇔ ⇔ (θ̂ − θ∗ ) √ n(θ̂ − θ∗ ) (3) Let’s consider the two terms on the right hand side in turn. Note that 2 n 1 X ∂2 ∂ 1 ∂2 p ∗ ∗ ∗ l(θ ; w) = l(θ ; wi ) → E 2 l(θ ; wi ) n ∂ 2θ n i=1 ∂ 2 θ ∂ θ p p by the LLN. Given that θ̂ → θ∗ which implies that θ̄ → θ∗ (since θ̄ is “between” θ̂ and θ∗ ), it can be shown that 2 1 ∂2 ∂ p ∗ l(θ̄; w) → E 2 l(θ ; wi ) . n ∂ 2θ ∂ θ Therefore, by the CMT we have that the first term on the right hand side of equation (3) satisfies −1 2 −1 1 ∂2 ∂ p ∗ → −E 2 l(θ ; wi ) . − 2 l(θ̄; w) n∂ θ ∂ θ 5 Now, consider the second term on the right hand side of equation (3). Given equation (2), we have √ 1 ∂ √ n l(θ∗ ; w) = n n ∂θ d →N 1 ∂ ∂ ∗ ∗ l(θ ; w) − E l(θ ; wi ) n ∂θ ∂θ " 2 #! ∂ l(θ∗ ; wi ) 0, E ∂θ (4) by the CLT, where we have used the fact that " " 2 # 2 2 # ∂ ∂ ∂ ∂ l(θ∗ ; wi ) = E l(θ∗ ; wi ) −E l(θ∗ ; wi ) = E l(θ∗ ; wi ) Var ∂θ ∂θ ∂θ ∂θ using equation (2) again. Combining the two results (using Slutsky), we get −2 " 2 #! 2 √ ∂ ∂ d E l(θ∗ ; wi ) . n(θ̂ − θ∗ ) → N 0, −E 2 l(θ∗ ; wi ) ∂ θ ∂θ This result further simplifies given the so-called information equality, i.e., " 2 # 2 ∂ ∂ ∗ ∗ E l(θ ; wi ) = −E 2 l(θ ; wi ) . (5) ∂θ ∂ θ The information equality can be derived as follows (recall l(θ; wi ) = log f (wi ; θ)) 2 ∂ ∂ ∂ ∂ 1 ∂ ∗ ∗ ∗ E 2 l(θ ; wi ) = E l(θ ; wi ) =E f (wi ; θ ) ∂ θ ∂θ ∂θ ∂θ f (wi ; θ∗ ) ∂θ " 2 # 1 ∂ 1 ∂2 ∗ ∗ =E − f (wi ; θ ) +E f (wi ; θ ) f (wi ; θ∗ ) ∂θ f (wi ; θ∗ ) ∂ 2 θ " 2 # ∂ = −E l(θ∗ ; wi ) +0 ∂θ The second term is zero because Z 1 ∂2 1 ∂2 ∗ ∗ E f (wi ; θ ) = f (wi ; θ ) f (wi ; θ∗ )dwi f (wi ; θ∗ ) ∂ 2 θ f (wi ; θ∗ ) ∂ 2 θ Z Z ∂2 ∂2 ∗ = f (wi ; θ )dwi = 2 f (wi ; θ∗ )dwi ∂ 2θ ∂ θ ∂2 = 2 1=0 ∂ θ 6 where changing the order of integration and differentiation is allowed under certain regularity conditions. Given equation (5), we obtain our “final” result √ d n(θ̂ − θ∗ ) → N −E h ∂2 l(θ∗ ; wi ) ∂2θ i −1 ! 2 ∂ 0, −E 2 l(θ∗ ; wi ) . ∂ θ (6) is called the Fisher “information”. The term information is intuitive because it corresponds to the (expected value of) second derivative of the likelihood function. If the likelihood function has a lot of “curvature” then it will be easy to find its maximum, i.e., we have a lot of information, and the MLE will have a small variance, which is given by the inverse of the information (a lot of information = small variance). Example 1 - continued. The MLE is given by w̄. Since there exists a closed form expression for the MLE (θ̂ = w̄) we can derive its asymptotic distribution directly. In particular, we have (using θ∗ as the true value for notational consistency) √ n(θ̂ − θ∗ ) = √ d n(w̄ − θ∗ ) → N (0, θ∗ (1 − θ∗ )) by a CLT, since Var(wi ) = θ∗ (1 − θ∗ ). Check for yourself that 2 1 ∂ ∗ , −E 2 l(θ ; wi ) = ∗ ∂ θ θ (1 − θ∗ ) i.e., the above theory gives the same result as the standard CLT in this example. 3 3.1 Inference t-test We can use the result in equation (6) to do inference, i.e., test certain hypotheses of interest and construct confidence intervals. In order to test H0 : θ∗ = θ0 7 (7) we can for example rely on the t-test. Under H0 , i.e., if the true value of θ, namely θ∗ , equals θ0 , we have that 2 −1 ! ∂ 0, −E 2 l(θ0 ; wi ) . ∂ θ i h 2 ∂ We cannot use this result directly because E ∂ 2 θ l(θ0 ; wi ) is unknown. But √ d n(θ̂ − θ0 ) → N we can replace it with a consistent estimator: 2 n 1 X ∂2 ∂ p l(θ0 ; wi ) → E 2 l(θ0 ; wi ) . n i=1 ∂ 2 θ ∂ θ (8) Note that θ0 is “known” under the null hypothesis and therefore does not need to be replaced with a (consistent) estimator. However, in practice we often use 2 n ∂ 1 X ∂2 p l(θ̂; wi ) → E 2 l(θ0 ; wi ) . n i=1 ∂ 2 θ ∂ θ The typical t-statistic used in practice is then given by √ n r θ̂ − θ0 − n1 −1 Pn ∂2 i=1 ∂ 2 θ l(θ̂; wi ) (9) θ̂ − θ0 = r −1 P 2 − ni=1 ∂∂2 θ l(θ̂; wi ) and satisfies √ n r θ̂ − θ0 − n1 d Pn ∂2 i=1 ∂ 2 θ l(θ̂; wi ) −1 → N (0, 1), (10) by Slutsky. In order to test H0 , we rely on the t-test that rejects if the absolute value of the t-statistic (because the testing problem is two-sided: H0 : θ∗ = θ0 vs. H1 : θ∗ 6= θ0 where H1 denotes the “alternative hypothesis”) exceeds the corresponding critical value, z1−α/2 . The corresponding 1 − α confidence interval (CI) is given by v !−1 u n u X ∂2 t − l(θ̂; wi ) θ̂ ± z1−α/2 . 2 ∂ θ i=1 Inspection of the CI explains why we typically use the estimator given in equation (9) rather than the one given in equation (8): We would not be able to write down the CI in such a compact form if we used equation (8). 8 3.1.1 Delta method A very useful result is given by the delta method (often used in combination with the t-test), which states that for any continuously differentiable function g(θ) for which g 0 (θ∗ ) 6= 0 (where g 0 (θ) denotes the derivative of g(θ)) 2 −1 ! √ ∂ d n(g(θ̂) − g(θ∗ )) → N 0, (g 0 (θ∗ ))2 −E 2 l(θ∗ ; wi ) . ∂ θ This follows from the mean value theorem, i.e., = g(θ∗ ) + g 0 (θ̄)(θ̂ − θ∗ ) √ √ ⇔ n(g(θ̂) − g(θ∗ )) = g 0 (θ̄) n(θ̂ − θ∗ ), g(θ̂) p g 0 (θ̄) → g 0 (θ∗ ) (by the CMT), equation (6), and Slutsky. Similar to above, we can use the delta method to test H0 : g(θ∗ ) = g(θ0 ) or to construct confidence intervals for g(θ∗ ) by replacing the unknown variance by a (consistent) estimator such as (g 0 (θ̂))2 !−1 n 1 X ∂2 l(θ̂; wi ) − n i=1 ∂ 2 θ −1 ! 2 ∂ p → (g 0 (θ∗ ))2 −E 2 l(θ∗ ; wi ) . ∂ θ See here for a multivariate version of the delta method which is very useful (not only for this course). 3.2 Alternative tests Next to the t-test for testing H0 given in equation (7), which is a special case of what is generally referred to as Wald(-type) tests, there are two additional (ML-specific) tests: the Score (or Lagrange Multiplier) test and the Likelihood Ratio test. 3.2.1 Score test We first consider the Score test. Note that under H0 : θ∗ = θ0 equation (4) implies √ 1 ∂ d n l(θ0 ; w) → N n ∂θ " 0, E 9 2 #! ∂ l(θ0 ; wi ) . ∂θ As above, the variance is unknown. One possible estimator is given by 2 n 1X ∂ l(θ0 ; wi ) . n i=1 ∂θ Here, we typically do not replace θ0 by θ̂, because if we consider √ 1 ∂ ∂ n n ∂θ l(θ0 ; w) l(θ ; w) 0 d ∂θ = q → q P N (0, 1), P 2 2 n n ∂ 1 ∂ l(θ ; w ) l(θ ; w ) 0 i 0 i i=1 ∂θ i=1 ∂θ n we note that the left hand side does not depend on θ̂. Put differently, we can use the above test statistic, i.e., the left hand side, for testing H0 : θ∗ = θ0 without ever computing θ̂, which can be time consuming in certain models. It is standard practice to rely on the square of the above test statistic, which satisfies 2 ∂ n n1 ∂θ l(θ0 ; w) d → q χ2 (1) P 2 n ∂ 1 i=1 ∂θ l(θ0 ; wi ) n √ by the CMT, where χ2 (k) denotes a chi-square distribution with degree-offreedom equal to k. As above, the resulting Score test rejects if the above (squared) test-statistic exceeds the corresponding critical value. 3.2.2 Likelihood Ratio test Next, we consider the Likelihood Ratio test. The Likelihood Ratio statistic for testing H0 : θ∗ = θ0 is given by 2(l(θ̂; w) − l(θ0 ; w)). Under H0 , the above test statistic satisfies d 2(l(θ̂; w) − l(θ0 ; w)) → χ2 (1). To show this, consider a second-order Taylor expansion (a generalization of the mean value Theorem) 1 ∂2 ∂ l(θ̄; w)(θ0 − θ̂)2 , l(θ0 ; w) = l(θ̂; w) + l(θ̂; w)(θ0 − θ̂) + 2 ∂θ 2∂ θ 10 where θ̄ lies between θ0 and θ̂. Given equation (1), we obtain 2(l(θ̂; w)−l(θ0 ; w)) = − ∂2 1 ∂2 2 l( θ̄; w)( θ̂−θ ) = − l(θ̄; w)n(θ̂−θ0 )2 . (11) 0 ∂ 2θ n ∂ 2θ Since r − √ √ 1 ∂2 l( θ̄; w) n( θ̂ − θ ) = nq 0 n ∂ 2θ θ̂ − θ0 d −1 → N (0, 1) 2 − n1 ∂∂2 θ l(θ̄; w) similar to equation (10) above, the right hand side of equation (11) satisfies − 1 ∂2 d l(θ̄; w)n(θ̂ − θ0 )2 → χ2 (1) 2 n∂ θ by the CMT. The Likelihood Ratio test therefore rejects the null hypothesis when the Likelihood Ratio statistic exceeds the corresponding critical value. The Likelihood Ratio statistic also applies more generally, i.e., when θ is not necessarily scalar and if the null hypothesis specifies only part of the (true) parameter vector. If the null hypothesis does not specify the entire parameter vector θ0 needs to be replaced with a “restricted” estimator, say θ̃ (see also here). In general, the asymptotic distribution of the test statistic is given by a χ2 (r), where r denotes the number of restrictions imposed by the null hypothesis. Take θ = (θ1 , θ2 )0 . Then, some examples are given by • H0 : θ1∗ = 0 ⇒ r = 1 • H0 : θ1∗ = θ2∗ ⇒ r = 1 • H0 : θ1∗ = θ2∗ = 0 ⇒ r = 2. 4 Conditional ML So far, we have considered (unconditional) Maximum Likelihood estimation, i.e., we have specified a distribution function for wi , f (wi ; θ), where we have implicitly taken wi to be scalar. In many cases, however, we observe more than one random variable for each entity/individual in our sample. Furthermore, we typically differentiate between an outcome variable yi and (a vector 11 of) explanatory variables xi = (xi1 , . . . , xiK )0 . Let wi = (yi , x0i )0 . Now, we can write the likelihood function as follows L(θ; w) = f (w; θ) = n Y f (wi ; θ) = i=1 n Y f (yi , xi ; θ), i=1 where f (yi , xi ; θ) denotes the joint distribution of yi and xi . By Bayes’ rule, we have f (yi , xi ; θ) = f (yi |xi ; θ)f (xi ; θ). Plugging in and taking logarithms, we obtain l(θ; w) = n X i=1 log f (yi |xi ; θ) + n X log f (xi ; θ) ≡ l(θ; y|x) + i=1 n X log f (xi ; θ), i=1 where l(θ; y|x) denotes the so-called conditional log-likelihood (function). (Here, we use the notation |x to highlight the conditioning on x.) In economics, we often feel more comfortable specifying f (yi |xi ; θ) rather than f (yi , xi ; θ) or, equivalently, we do not feel comfortable specifying f (xi ; θ). It turns out that basically all derivations above go through if we replace the log-likelihood with a conditional log-likelihood. The resulting estimator is sometimes referred to as conditional MLE, but since conditioning is so common in economics, we typically don’t explicitly mention the “conditional.” Example 2. yi ∈ {0, 1}. Consider the probit model where we specify the conditional mass function of yi given xi as follows 1−yi f (yi |xi ; θ) = Φ(x0i θ)yi (1 − Φ(x0i θ)) . The conditional “individual” log-likelihood is defined as l(θ; yi |xi ) = yi log(Φ(x0i θ)) + (1 − yi ) log(1 − Φ(x0i θ)). Taking the derivative with respect to (the vector) θ, we obtain ∂ 1 1 l(θ; yi |xi ) = yi φ(x0i θ)xi − (1 − yi ) φ(x0i θ)xi 0 ∂θ Φ(xi θ) 1 − Φ(x0i θ) 1 1 = yi − (1 − yi ) φ(x0i θ)xi . Φ(x0i θ) 1 − Φ(x0i θ) 12 To show that equation (2) holds in the context of conditional Maximum Likelihood (estimation), note that ∂ ∗ E l(θ ; yi |xi ) xi ∂θ 1 1 = E[yi |xi ] − (1 − E[yi |xi ]) φ(x0i θ∗ )xi 0 ∗ 0 ∗ Φ(xi θ ) 1 − Φ(xi θ ) 1 1 0 ∗ 0 ∗ − (1 − Φ(xi θ )) φ(x0i θ∗ )xi = 0. = Φ(xi θ ) Φ(x0i θ∗ ) 1 − Φ(x0i θ∗ ) Therefore, by the law of iterated expectations ∂ ∂ ∗ ∗ E l(θ ; yi |xi ) = E E l(θ ; yi |xi ) xi = E[0] = 0. ∂θ ∂θ Similarly, the other “results” go through. 13