9. Maximum Likelihood The estimating equations (normal Estimation equations) are I. Ordinary Least Squares Estimation: Only requires a model for the conditional mean of the response variable. n X @Q(b) @b0 = 2 (Yj @Q(b) @bi = 2 j =1 and For a linear model n X j =1 b0 Xij (Yj b1X1j b0 r Xrj ) = 0 b1X1j br Xrj ) = 0 for i = 1; 2; : : : ; r = 0 + 1X1j + + rXrj + j ; the OLS estimator for Yj 2 0 = .. 6 4 r 3 7 5 2 b0 is any b = .. 6 4 br 3 7 5 that minimizes the sum of squared residuals n X Q(b) = (Yj b0 b1X1j brXrj )2: j =1 The matrix form of these equations is (X T X )b = X T Y and a solution is b = (X T X ) X T Y: 701 700 The OLS estimator for an estimable function C T is C T b = C T (X T X ) X T Y for any solution to the normal equations. For a Gauss-Markov model with E (Y ) = X and V ar(Y) = 2I the OLS estimator of an estimable function C T is the unique best linear unbiased estimator (b.l.u.e.) of C T : E (C T b) = C T E (C T b) = C T V ar(C T b) = V ar(C T b) = 2C T (X T X ) C is smaller than the variance of any other linear unbiasd estimator for C T : 2C T (X T X ) X T X (X T X ) C: where = V ar(Y). The distribution of specied. Y is not completely 702 The distribution of specied. Y is not completely 703 The estimating equations are (X T V 1X )b = X T V 1Y: II. Generalized Least Squares Estimation Consider the Aitken model E (Y) = X and V ar(Y) = 2V where V is a positive denite symmetric matrix of known constants and 2 is an unknown variance parameter. A solution is bGLS X b)T V 1(Y 1Y: b.l.u.e. is = C T (X T V 1 X ) X T V 1 Y for any solution to the normal equations. C T bGLS E (C T b) = C T and X b) XT V For any estimable function C T the unique A GLS estimator for is any b that minimizes Q(b) = (Y = (X T V 1 X ) V ar(C T b) = 2C T (X T V (from Denition 3.8 with = 2V ). The distribution of specied. Y 1X ) C: is not completely 704 705 An unbiased estimator for 2 in the Aitken III. Maximum Likelihood Estimation The model must include a specication of the joint distribution of the observations model is 2 ^GLS = n rank(X ) Y 1 T V 1 V 1 X (X T V 1 X) XT V 1 In practice, V may not be known. Then 2 can be approximated by and GLS replacing V with a consistent estimator for V : { The estimator for C T is not b.l.u.e. { The estimator for 2 is not unbiased. { Both estimators are consistent. bGLS 706 Y: Example: Normal theory Gauss-Markov model: Yj = 0 + 1X1j + + r Xrj + j where or j NID(0; 2); 2 Y= 6 4 Y1 i = 1; : : : ; n 3 .. 75 N (X ; 2I ) Yn 707 This is equivalent to nding values of and 2 that maximize the log-likelihood. Find the parameter values that maximize the \likelihood" of the observed data. `( ; 2; Y1; : : : ; Yn) = log L(; 2; Y1; : : : ; Yn) For the normal-theory Gauss-Markov model, the likelihood function is = n2 log(2) n2 log(2) 1 (Y X )T (Y 22 L( ; 2; Y1; : : : ; Yn) = (2)1n=2n e 1 22 (Y X ) = n2 log(2) n2 log(2) 1 Xn (Y 22 j=1 j 0 X )T (Y X ) Find values of and 2 that maximize this likelihood function. r Xrj )2 % this is minimized by an OLS estimator for , regardless of the value of 2 708 709 Solve the likelihood equations: Solution: @`(; 2; Y) @0 and = 12 n X j =1 (Yj 0 ^2 r Xrj ) = 0 @i j =1 0 r Xrj ) = 0 for i = 1; 2; : : : ; r @`(; 2; Y) @ 2 n = 2n2 + 212 X (Yj j =1 = 0 r Xrj )2 710 j =1 % PX )Y n This estimator for 2 is biased. 0 n = n1 X (Yj ^0 ^rXrj )2 = n1 YT (I = 1 SSE @`(; 2; Y) n = 12 X Xij (Yj ^ = bOLS = (X T X ) X T Y 1 rank(X ) SSE is an unbiased estimator for 2. n n1 SSE and 1 rank(X ) SSE are asymptotically equivalent. n 711 General normal-theory linear model: Y = X + where N (0; ) and is known. Maximizing the log-likelihood when is known is equivalent to nding a that minimizes (Y X )T 1(Y X ) The multivariate normal likelihood function is L( ; Y) = (2)n=12jj1=2 e 12 (Y X )T 1(Y X ) The log-likelihood function is 1 n `( ; Y) = 2 log(2) 2 log(jj) 1 (Y X )T 1(Y X ) 2 The estimating equations are ( X T 1 X ) = X T 1 Y Solutions are of the form ^ = bGLS = (X T 1X ) XT For the general normal theory linear model, when is known, maximum likelihood estimation is the same as generalized least squares estimation. 713 712 When contains unknown parameters: You could maximize the log-likelihood `(; ; Y) = n2 log(2) 12 log(jj) 1 (Y X )T 1(Y 2 with respect to both and . 1Y Similarly, generalized least squares estimation and maximum likelihood estimation are equivalent for in the Aitken model Y N (X ; 2V ) when V is known. Substitute = 2V into the previous X ) There may be no algebraic formulas for solutions to the joint likelihood equations, say ^ and ^ . The MLE for is usually biased. 714 discussion. Then, the log-likelihood is `( ; 2; Y) = n2 log(2) n2 log(2) 1 log(jV j) 2 1 T 1 2 2 (Y X ) V ( Y X ) 715 The \likelihood equations" are (X T V 1X ) = X T V 1Y and 2 = 1 (Y n X )T V 1(Y General Properties of MLE's Regularity Conditions: X ) (i) The parameter space has nite dimension, is closed and compact, and the true parameter vector is in the interior of the parameter space. Solutions have the form ^ = bGLS = (X T V 1X ) X T V 1Y and ^2 = 1 (Y n X ^)T V 1(Y (ii) Probability distributions dened by any two dierent values of the parameter vector are distinct (an identiability condition). ^) X The likelihood equations are more complicated when V contains unknown parameters. 716 717 (iii) First three partial derivatives of the log-likelihood function, with respect to the parameters, (a) exist (b) are bounded by a function with a nite expectation. Suppose Y1; : : : ; Yn are independent vectors of observations, with 2 3 Y1j Yj = 64 . 75 ; (iv) The expectation of the negative of the matrix of second partial derivatives of the log-likelihood is (a) nite (b) positive denite in a neighborhood of the true value of the parameter vector. This is called the Fisher information matrix. Then, the joint likelihood function is n Y L( ; Y1; : : : ; Yn) = f (Yj ; ) 718 Ypj and the density function (or probability function) is f ( Yj ; ) j =1 and the log-likelihood function is `(; Y1; : : : ; Yn) = log (L( ; Y1; : : : ; Yn)) = n X j =1 log f (Yj ; ) : 719 The score function 2 u1() The maximum likelihood estimator (MLE) 2 3 @`( ;Y1;:::;Yn) @1 3 2 7 7 .. 75 = 666 .. u() = 7 4 5 @`( ;Y1;:::;Yn) ur () @r 6 4 is the vector of rst partial derivatives of the log-likelihood function with respect to the elements of 2 = 6 4 1 .. r 3 7 5 ^1 3 ^ = .. 75 ^r 6 4 is a solution to the likelihood equations, that maximizes the log-likelihood function. Fisher information matrix: i() = Var(u( ; Y1; : : : ; Yn)) = E u(; Y1 : : : ; Yn)[u(; Y1; : : : ; Yn)]T " #! @`(; Y1; : : : ; Yn) = E : The likelihood equations are @r @k u(; Y1; : : : ; Yn) = 0 720 721 Let Result 9.2: Asymptotic normality denote the parameter vector i() denote the Fisher information matrix ^ denote the MLE for . as Then, if the Regularity Conditions are satised, we have the following results: Result 9.1: ^ is a consistent estimator. Pr (^ n pn(^ ) dist!0n N 0; lim n[i()] 1 n!1 )T (^ ) > o n ! 1: With a slight abuse of notation we may express this as ^ N ; [i()] 1 !0 for \large" sample sizes. as n ! 1; for any > 0. 722 723 References: Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis Wiley, New York. Result 9.3: If ^ is the mle for , then the mle for g() is g(^) for any function g( ). , (2nd ed.), Cox, C. (1984). American Statistician, 38, pp. 283{287. Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics, Chapman & Hall, London (Chapters 8 and 9). Rao, C.R. (1973). Linear Statistical Inference, Wiley, New York (Chapter 5). 724 725