Contents Chapter 1. Preliminaries 1.1. Conditional Expectation 1.2. Suciency 1.3. Exponential Families. 1.4. Convex Loss Function Chapter 2. Unbiasedness 2.1. UMVU estimators. 2.2. Non-parametric families 2.3. The Information Inequality 2.4. Multiparameter Case Chapter 3. Equivariance 3.1. Equivariance for Location family 3.2. The General Equivariant Framework 3.3. Location-Scale Families Chapter 4. Average-Risk Optimality 4.1. Bayes Estimation 4.2. Minimax Estimation 4.3. Minimaxity and Admissibility in Exponential families 4.4. Shrinkage Estimators Chapter 5. Large Sample Theory 5.1. Convergence in Probability and Order in Probability 5.2. Convergence in Distribution 5.3. Asymptotic Comparisons (Pitman Eciency) 5.4. Comparison of sample mean, median and trimmed mean Chapter 6. Maximum Likelihood Estimation 6.1. Consistency 6.2. Asymptotic Normality of the MLE 6.3. Asymptotic Optimality of the MLE 3 5 5 6 16 27 29 29 37 40 47 59 59 67 72 79 79 84 88 96 101 101 105 110 111 117 117 122 125 4 CONTENTS CHAPTER 1 Preliminaries 1.1. Conditional Expectation Let (X A P ) be a probability space. If X 2 L1 (A P ) and G is a sub--eld of A, then E (X jG ) is a random variable such that (i) E (X jG ) 2 G (i.e. is G measurable) (ii) E (IGX ) = E (IGE (X jG )) 8G 2 G (For X 0 (G) = E (IGX ) is a measure on G and P (G) = 0 ) (G) = 0, so by the Radon-Nikodym theorem there exists a G -measurable function E (X jG ) such that (G) = R E (X jG )dP , i.e.(ii) is satised. This shows the existence of E (X +jG ) and E (X ;jG ). Then G we dene E (X jG ) = E (X +jG ) ; E (X ;jG ) ). Remark 1.1.1. (ii) generalizes to E (Y X ) = E (Y E (X jG )) 8Y 2 G such that E jY X j < 1. The conditional probability of A given G is dened for all A 2 A as P (AjG ) = E (IAjG ): Remark 1.1.2. If X 2 L2 (A P ), then E (X jG ) is the orthogonal projection in L2 (A P ) of X onto the closed linear subspace L2 (A P ) since (i) E (X jG ) 2 L2 (G P ) and (ii) E (Y (X ; E (X jG ))) = 0 8Y 2 L2 (G P ): Conditioning on a Statistic Let X be a r.v. dened on (X A P ) with E jX j < 1 and let T be a measurable function (not necessarily real-valued) from (X A) into (T F ). (X A P ) ;!T (T F P T ) Such a T is called a statistic (and is not necessarily real-valued). The -eld of subsets of X induced by T is (T ) = fT ;1S S 2 Fg = T ;1 F Definition 1.1.3. E (X jT ) E (X j (T )) Recall that a real-valued function f on X is (T ) measurable , f = g T for some F -measurable g on T , i.e. f (x) = g(T (x)) as shown below. . g T X ;;;! T ;;;! 5 R 6 1. PRELIMINARIES This implies that E (X jT ) is expressible as E (X jT ) = h(T ) for some function h 2 F which is unique a.e. P T . T h X ;;;! T ;;;! R Definition 1.1.4. E (X jt) h(t) Example 1.1.5. Suppose (X T ) has probability density p(x t) w.r.t. Lebesgue mea- sure R xp(xton) dxR 2 and E jX j < T1. Then E (X j(T )) = h(T ) where h(t) = E (X jT = t) = R p(xt) dx IpT (t)>0 (t) a:s: P : PROOF (i) R.S. is Borel measurable in t (by Fubini) (ii) G 2 (T ) ) G = T ;1F for some F 2 F ) IG = IF (T ) ) E (IG E (X j (T ))) = E (IG X ) = ZZ Z IGX dP Z = xIF (t)p(x t) dxdt = IF (t)h(t)pT (t) dt = E IF (T )h(T )] = E IG h(T )] Properties of Conditional Expectation If T is a statistic, X is the identity function on X and fn f g are integrable, then (i) E af (X ) + bg(X )jT ] = aE f (X )jT ] + bE g(X )jT ] a:s: (ii) a f (X ) b a:s: ) a E f (X )jT ] b a:s: (iii) jfnj g fn(x) ! f (x) a:s: ) E fn(X )jT ] ! E f (X )jT ] a:s: (iv) E E (f (X )jT )] = Ef (X ): If E jh(T )f (X )j < 1, then (v) E h(T )f (X )jT ] = h(T )E f (X )jT ] a:s: 1.2. Suciency Set up X : random observable quantity (the identity function on (X A P )) X : sample space, the set of possible values of X A: -algebra of subsets of X P : fP 2 g is a family of probability measures on A (distributions of X ) T: X ! T is an A=F measurable function and T (X ) is called a statistic. X T probability space (X A P ) ;;;! sample space (X A P ) ;;;! (T F P T ) We adopt this notation because sometimes we wish to talk about T (X ()) the random variable and sometimes about T (X (x)) = T (x), a particular element of T . We shall also use the 1.2. SUFFICIENCY 7 notation P (AjT (x)) for P (AjT = T (x)) and P (AjT ) for the random variable P (AjT ()) on X. Definition 1.2.1. The statistic T is sucient for (or P ) i the conditional distribution of X given T = t is independent of for all t, i.e. there exists an F measurable P (AjT = ) such that P (AjT = t) = P (AjT = t) a.s. PT for all A 2 A and all 2 . Example 1.2.2. X = (X1 : : : Xn) iid with pdf f (x) w:r:t: dx P = P (dx1 : : : dxn) = f (x1 ) f (xn) dx1 dxn T (X ) = (X(1) : : : X(n)) where X(i) is the ith order statistic: The probability mass function of X given T = t is pX jT =t(xjt) = t1 (x(1) ) n ! tn (x(n) ) i.e. it assigns point mass n1! to each x such that x(1) = t1 x(n) = tn. This is independent of , indicating that T contains all the information about contained in the sample. The Factorization Criterion Definition 1.2.3. A family of probability measure's P = fP : 2 g is equivalent to a p.m. if (A) = 0 () P (A) = 0 8 2 : We also say that P is dominated by a -nite measure on (X A) if P for all 2 : It is clear that equivalence to implies domination by . Theorem 1.2.4. Let P be dominated by a p.m. where = 1 X i=0 ci Pi (ci 0 X ci = 1): Then the statistic T (with range (T F )) is sucient for P () there exists an F -measurable function g () such that dP (x) = g (T (x)) d(x) 8 2 : Proof. ()) Suppose T is sucient for P . Then P (AjT (x)) = P (AjT (x)) 8: Throughout this part of the proof X will denote the indicator function of a subset of X . The preceding equality then implies that E (X jT ) = E (X jT ) 8X 2 A 8: 8 1. PRELIMINARIES Hence for all 2 X 2 A G 2 (T ), we have E (IG E (X jT )) = E (E (IGX jT )) = E (IGX ): Set = i , multiply by ci and sum over i = 0 1 2 : : : to get E(IG E (X jT )) = E(IGX ) 8X 2 A 8G 2 (T ): This implies that E (X jT ) = E (X jT ) 8X 2 A and hence E (X jT ) = E (X jT ) = E(X jT ) 8X 2 A 8: Now dene g (T ()) to be the Radon-Nikodym derivative of P with respect to , with both regarded as measures on (T ). We know this exists since dominates every P . We also know it is (T ) measurable, so it can be written in the form g (T ()), and we know that E (X ) = E (g (T )X ) for all X 2 (T ). We need to establish however that this last relation holds for all X 2 A. We do this as follows. X 2 A ) E (X ) = E (E (X jT ) = E (g (T )E (X jT )) = E (E (g (T )X jT )) = E (E(g (T )X jT )) = E (g (T )X ): dP This shows that g (T (x)) = d (x) when P and are regarded as measures on A. (() Suppose that for each , dPd (x) = g (T (x)) for some g . We shall then show that the conditional probability P (Ajt) is a version of P (Ajt) 8. A 2 A G 2 (T ) ) Z G IA dP = = and Z G IAdP = = Z Z G G Z Z G Z G P (AjT ) dP P (AjT )g (T ) d IAg (T ) d EIAg (T )jT ] d = EIAjT ]g (T ) d G ) P (AjT )g (T ) = E (IAjT )g (T ) a:s: and hence a.s. P 8. Also g (T ) 6= 0 a:s: P , since dP = g (T ) d. Hence P (AjT ) = E (IAjT ) = P(AjT ) a:s: P and the R.S. is independent of . 1.2. SUFFICIENCY 9 Theorem 1.2.5. (Theorem 2, Appendix, TSH P1) If P = fP 2 g is dominated by a -nite measure , then it is equivalent to P = 1 i=0 ci Pi for some countable subcollection Pi 2 P i = 0 1 2 : : : with ci 0 and ci = 1: Proof. is ;nite, ) 9An 2 A with A1 A2 : : : disjoint, and Ai = X such that 0 < (Ai) < 1 i = 1 2 : : : . Set 1 X (A) = 2(Ai(\AA)i) i i=1 Then, is a probability measure equivalent to . Hence we can assume without loss of generality that the dominating measure is a probability measure Let f = dP d and set S = fx : f (x) > 0g Then (1.2.1) P (A) = P (A \ S ) = 0 i (A \ S ) = 0: (Since P and since (A \ S ) > 0 f > 0 on A \ S ) P (A \ S ) > 0.) A set A 2 A is a kernel if A S for some a nite or countable union of kernels is called a chain. Set = sup (C ) chains C Then = (C ) for some chain C = 1 n=1 An An Sn . (since 9 fCng such that (Cn) " and for this sequence (Cn) = .) P 1 P (). Since It follows from the following Lemma that P is dominated by () = 1 n=1 2n n (A) = 0 ) Pn (A) = 0 8n ) P (A) = 0 8 (by the Lemma) it is obvious that P (A) = 0 8 ) (A) = 0 P 1 Hence P is equivalent to () = 1 n=1 2n Pn (): Lemma 1.2.6. If fn g is the sequence used in the construction of C , then fP 2 g is dominated by fPn n = 1 2 : : : g, i.e. Pn (A) = 0 8n ) P (A) = 0 8 1 TSH stands for Testing Statistical Hypotheses, E.L. Lehmann, Springer Texts in Statistics, 1997. 10 1. PRELIMINARIES Proof. Pn (A) = 0 8n ) (A \ Sn ) = 0 8n (by 1:2:1) ) (C Sn ) (A \ C ) = 0 ) (P ) P (A \ C ) = 0 8 If P (A) > 0 for some then, since P (A) = P (A \ C ) + P (A \ C c), P (A \ C c) = P (A \ C c \ S ) > 0 )A \ C c \ S is a kernel disjoint from C )C (A \ C c \ S ) is a chain with > (P (A) > 0 ) (A) > 0) contradicting the denition of : Hence, P (A) = 0 8 . Theorem 1.2.7. The Factorization Theorem Let be a -nite measure which dominates P = fP : 2 g and let p = dP d : Then the statistic T is sucient for P if and only if there exists a non negative F -measurable function g : T ! R and an A-measurable function h : X ! R such that (1.2.2) p (x) = g (T (x)) h (x) a.e. . Proof. By theorem 1.2.5, P is equivalent to = If T is sucient for P , X i ciPi where ci 0 X i ci = 1: (x) dP (x) d (x) p (x) = dP d (x) = d (x) d (x) = g (T (x)) h (x) by theorem 1.2.4. On the other hand, if equation (1.2.2) holds, d (x) = = (1.2.3) X X ci dPi (x) = cipi (x) d(x) 1 X i=1 cigi (T (x)) h (x) d (x) = K (T (x)) h (x) d (x) : 1.2. SUFFICIENCY 11 Thus, dP (x) = p (x) d (x) by the denition of p (x) by equations (1.2.2) and (1.2.3) = g (T (x)) h (x) d (x) K (T (x)) h (x) = g~ (T (x)) d (x) where g~ (T (x)) := 0 if K (T (x)) = 0: Hence T is sucient for P by theorem 1.2.4. Remark 1.2.8. If f (x) is the density of X with respect to Lebesgue measure then T is sucient for P i f (x) = g (T (x)) h (x) where h is independent of . Example 1.2.9. Let X1 X2 : : : Xn be iid N ( 2 ) 2 R > 0, and write X = (X1 X2 Xn). A -nite dominating measure on Bn is Lebesgue measure with ! X n 2 X 1 n ; 1 p2 (x) = ; p n exp 22 x2i + 2 xi ; 22 2 X X 2 1 = g2 xi xi : P P Therefore T (X ) = ( Xi Xi2)is sucient for P = fP2 g : ; Remark 1.2.10. T (X ) = X S 2 is also sucient for P = fP2 g, since X X ; 2 x S 2 g2 xi x2i = g T and T are equivalent in the following sense. Definition 1.2.11. Two statistics T and S are equivalent if they induce the same algebra up to P -null sets. i.e. if there exists a P -null set N and functions f and g such that T (x) = f (S (x)) and S (x) = g (T (x)) for all x 2 N c. Example 1.2.12. Let X1 : : : Xn be iid U (0 ) > 0 and X = (X1 : : : Xn ). Yn 1 p (x) = n I01)(xi )I(;1](xi ) 1 1 = n I01)(x(1) )I(;1](x(n) ) = g (x(n) )h(x) ) T (X ) = X(n) is sucient for : 12 1. PRELIMINARIES Example 1.2.13. X1 : : : Xn iid N (0 2 ), = f 2 : 2 > 0g. Dene T1 (X ) T2 (X ) T3 (X ) T4 (X ) (X1 : : : Xn) (X12 : : : Xn2) (X12 + + Xm2 Xm2 +1 + + Xn2) X12 + + Xn2 n X p (x) = p1 n exp (; 21 2 Xi2) ( 2) 1 Each Ti (X ) is sucient. However (T4 ) (T3) (T2 ) (T1 ): (since functions of T4 are functions of T3 , functions of T3 are functions of T2 and functions of T2 are functions of T1 .) Remark 1.2.14. If T is sucient for and T = H (S ) where S is some statistic, then S is also sucient since p (x) = g (T (x)) h (x) = g (H (S (x)) h (x) = = = = S H Since (T ) = S ;1H ;1BT S ;1BS ((X A) ;! (S BS ) ;! (T BT )), T provides a greater reduction of the data than S , strictly greater unless H is one to one, in which case S and T are equivalent. Definition 1.2.15. T is a minimal sucient statistic, if for any sucient statistic S , there exists a measurable function H such that T = H (S ) a.s. P : Theorem 1.2.16. If P is dominated by a -nite measure , then the statistic U is sucient i for every xed and 0 , the ratio of the densities p and p0 with respect to , dened to be 1 when both densities are zero, satises p (x) = f (U (x)) a:s: P for some measurable f . 0 p0 (x) 0 Proof. HW problem. Theorem 1.2.17. Let P be a nite family with densities fp0 p1 : : : pk g, all having the same support (i.e. S = fx : pi (x) > 0g is independent of i). Then p (x) p (x) p (x) T (x) = p1 (x) p2 (x) pk (x) 0 0 0 is minimal sucient. (Also true for a countable collection of densities with no change in the proof.) 1.2. SUFFICIENCY 13 Proof. First T is sucient by theorem (1.2.16) since ppji((xx)) is a function of T (x) for all i and j (need common support here.) If U is a sucient statistic then by theorem (1.2.16), pi (x) p0 (x) is a function of U for each i ) T is a function of U ) T is minimal sucient. Remark 1.2.18. The theorem 1.2.17 extends to uncountable collections under further conditions. Theorem 1.2.19. Let P be a family with common support and suppose P0 P . If T is minimal sucient for P0 and sucient for P , then T is minimal sucient for P . Proof. U is sucient for P ) U is sucient for P0 by Denition 1:2:1: T is minimal sucient for P0 ) T (x) = H (U (x)) a.s. P0 : But since P has common support T (x) = H (U (x)) a.s. P : Remark 1.2.20. 1. Minimal sucient statistics for uncountable families P can often be obtained by combining the above theorems. 2. Minimal sucient statistics exist under weak assumptions (but not always). In particular they exist if (X A) = (R n Bn) and P is dominated by a -nite measure. Example 1.2.21. P0 : (X1 : : : Xn ) iid N ( 1) 2 f0 1 g. P : (X1 : : : Xn) iid N ( 1) 2 R . p1 (x) = exp ; 1 hX(x ; )2 ; X(x ; )2i i 1 i 0 p0 (x) 2 1 hX i 2 2 = exp ; 2 2xi(0 ; 1 ) + n1 ; n0 This is a function of x, hence X is minimal sucient for P0 by Theorem 1.2.17. Since X is sucient for P (by the factorization theorem), X is minimal sucient for P . Example 1.2.22. P : (X1 : : : Xn) iid U (0 ) > 0. Show that X(n) is minimal sucient (This is part of problem 1.6.16 for which you will need to use problem 1.6.11). 14 1. PRELIMINARIES Example 1.2.23. Logistic P : (X1 : : : Xn) iid L( 1) 2 R. P0 : (X1 : : : Xn) iid L( 1) 2 f0 1 : : : ng. P(x ; )] exp ; i p (x) = Qn 2 i=1 f1 + exp ;(xi ; )]g so T = (T1 (X ) : : : Tn(X )) is minimal sucient where Yn (1 + e;xj )2 p i (x) n i Ti (x) = p (x) = e ;(xj ;i ) )2 : 0 j =1 (1 + e We will show that T (X ) is equivalent to (X(i) : : : X(n)), by showing that T (x) = T (y) , x(1) = y(1) x(n) = y(n): Proof. (() Obvious from the expression for Ti (x). ()) Suppose that Ti(x) = Ti(y) for i = 1 2 : : : n, Yn ;xj )2 Yn (1 + e;yj )2 (1 + e i.e. i = 1 ::: n ;(xj ;i ) )2 = ;(yj ;i ) )2 j =1 (1 + e j =1 (1 + e Yn 1 + uj ! Yn 1 + vj ! i.e. = ! = !1 : : : !n j =1 1 + uj j =1 1 + vj where uj = e;xj vj = e;yj and !i = ei . Here we have two polynomials in ! of degree n which are equal for n + 1 distinct values, 1 !1 : : : !n, of ! and hence for all !. !=0) ) Yn Yn j =1 n j =1 n Y j =1 (1 + uj ) = (1 + uj !) = (1 + vj ) Y j =1 (1 + vj !) 8! ) the zero sets of both these polynomials are the same ) x and y have the same order statistics. By theorem 1.2.17, the order statistics are therefore minimal sucient for P0 . They are also sucient for P , so by theorem 1.2.19, the order statistics are minimal sucient for P . There is not much reduction possible here! This is fairly typical of location families, the normal, uniform and exponential distributions providing happy exceptions. 1.2. SUFFICIENCY 15 Ancillarity Definition 1.2.24. A statistic V is said to be ancillary for P if the distribution, PV , of V does not depend on . It is called rst order ancillary if E V is independent of . Example 1.2.25. In example 1.2.23, X(2) ; X(1) is ancillary since Y1 = X1 ; : : : Yn = Xn ; are iid P0 , and X(2) ; X(1) = Y(2) ; Y(1) . Example 1.2.26. P : (X1 : : : Xn) iid N ( 1) 2 R : X S 2 = (Xi ; X )2 is ancillary since X S 2 = (Yi ; Y )2 where Yi = Xi ; i = 1 2 : : : are iid N (0 1): Remark 1.2.27. Ancillary statistics by themselves contain no information about , however minimal sucient statistics may contain ancillary components. For example, in 1.2.23, T = (X(1) X(n)) is equivalent to T = (X(1) X(2) ; X(1) X(n) ; X(1) ), whose last (n ; 1) components are ancillary. You can't drop them as X(1) is not even sucient. Complete Statistic A sucient statistic should bring about the best reduction of the data if it contains as little ancillary material as possible. This suggests requiring that no non-constant function of T be ancillary, or not even rst order ancillary, i.e. that E f (T ) = c for all 2 ) f (T ) = c a.s. P or equivalently that E f (T ) = 0 for all 2 ) f (T ) = 0 a.s. P : Definition 1.2.28. A statistic T is complete if (1.2.4) E f (T ) = 0 for all 2 ) f (T ) = 0 a.s. P T is said to be boundedly complete if equation (1.2.4) holds for all bounded measurable functions f . Since complete sucient statistics are intended to give a good reduction of the data, it is not unreasonable to expect them to minimal. We shall prove a slightly weaker result. Theorem 1.2.29. Let U be a complete sucient statistic. If there exists a minimal sucient statistic, then U is minimal sucient. Proof. Let T be a minimal sucient statistic and let be a bounded measurable function. We will show that (U ) 2 (T ) i.e. E ((U )jT ) = (U ) a:s: 16 1. PRELIMINARIES Now E ((U )jT ) = g(U ) for some measurable g since T is minimal and U is sucient: Let h(U ) = E ((U )jT ) ; (U ), then E h(U ) = 0 8 so h(U ) = 0 a:s: P since U is complete. Hence (U ) = E ((U )jT ) 2 (T ): Hence U -measurable indicator functions are T -measurable, i.e. (U ) (T ), i.e. U is minimal sucient. Remark 1.2.30. 1. If P is dominated by a -nite measure and (X A) = (R n Bn), the existence of a minimal sucient statistic does not need to be assumed. 2. A minimal sucient statistic is not necessarily complete. See the next example. Example 1.2.31. P = fN ( 2 ) > 0g 1 (x; )2 p (x) = p1 e; 2 2 = p1 e; 21 ( x ;1)2 2 2 The single observation X is minimal sucient but not complete since E I(01)(X ) ; (1)] = P (X > 0) ; (1) = 0 8 however P (I(01)(X ) ; (1) = 0) = 0 8. Theorem 1.2.32. (Basu's theorem) If T is complete and sucient for P , then any ancillary statistic is independent of T . Proof. If S is ancillary, then P (S 2 B ) = pB , independent of . Suciency of T ) P (S 2 B jT ) = h(T ), independent of . )E (h(T ) ; pB ) = 0 )h(T ) = pB a:s: P by completeness )S is independent of T 1.3. Exponential Families. Definition 1.3.1. A family of probability measure's fP : 2 g is said to be an s- parameter exponential family if there exists a -nite measure such that X ! s dP ( x ) i () Ti (x) ; B () h (x) p (x) = d(x) = exp 1 where i Ti and B are real-valued. Remark 1.3.2. 1. P 2 are equivalent (since fx : p (x) > 0g is independent of ). 1.3. EXPONENTIAL FAMILIES. 17 2. The factorization theorem implies that T = (T1 Ts) is sucient. 3. If we observe X1 : : : Xn, iid with marginal distributions P then sucient for . Pn T (X ) is j j =1 Theorem 1.3.3. If f1 1 : : : s g is LI, then T = (T1 : : : Ts) is minimal sucient. (Linear independence of f1 1 : : : s g means c1 1 () + + css () + d = 0 8 ) c1 = = cs = d = 0. Equivalently we can say that fig is anely independent or AI since the set of points f(1 () : : : s()) 2 g then lie in a proper ane subspace of R s .) Proof. Fix 0 2 and consider (1.3.1) dP (x) = p (x) = exp fB ( ) ; B ()g exp 0 dP0 p0 (x) (X s 1 ) (i () ; i(0 )) Ti(x) : If f1 1 : : : sg is LI then so is f1 1 ; 1 (0) : : : s ; s(0 )g: Set S = f(i() ; i(0 ) : : : s() ; s(0 )) 2 g R s . Then span(S ) is a linear subspace of R s . If dim(span(S )) < s, then there exists a non-zero vector v = (v1 : : : vs) s.t. v1 (1() ; 1(0 )) + + vs(s() ; s(0 )) = 0 8 contradicting the linear independence of f1 i ; i(0 )g. Hence dim(span(S )) = s i:e: 9 1 : : : s 2 s:t: f(1(i ) ; 1(0 ) s(i) ; s(0 )) i = 1 sg is LI. (1.3.2) From 1.3.1, s X j =1 (j (i ) ; j (0 ))Tj (x) = ln ppi ((xx)) + (B (i) ; B (0 ))i = 1 : : : s: 0 Since the matrix j (i ) ; j (0 )]sij=1 is non-singular, Tj (x) can be expressed uniquely in terms of ln pp0i ((xx)) , i = 1 : : : s. But pp0i ((xx)) i = 1 : : : s is minimal sucient for P0 = fPj j = 0 1 sg by theorem 1.2.17. Hence T is minimal sucient by theorem 1.2.19. 18 1. PRELIMINARIES Example 1.3.4. r p (x) = 2 expf; 12 x2 + x ; 2 g: 1 () = ; 12 2 () = T (x) = (x2 x) is sucient but not minimal r since rewriting the model as p (x) = 2 expf; 12 (x ; 1)2g we see that T (x) = (x ; 1)2 is minimal sucient. Remark 1.3.5. The exponential family can always be rewritten in such a way that the functions fTi g and fig are AI. If there exist constants c1 : : : cs d, not all zero, such that c1T1 (x) + + csTs(x) = d a:s: P then one of the Ti's can be expressed in terms of the others (or is constant). After reducing the number of functions Ti as far as possible, the same can be done with their coecients until the new functions fTig and fig are AI. Definition 1.3.6. (Order of the exponential family.) If the functions fTi i = 1 : : : sg on X and fi i = 1 : : : sg on are both AI, then s is the order of the exponential family X ! s dP i () Ti (x) ; B () h (x) : p (x) = d (x) = exp 1 Proposition 1.3.7. The order is well-dened. Proof. We shall show that s + 1 = dim(V ) where V is the set of functions on X dened by V = spanf1 ln dPdP0 () 2 g (independent of the dominating measure and the choice of fig fTig). s X dP ln dP0 (x) = i=1 (i () ; i (0 ))Ti(x) + B (0 ) ; B () so that V spanf1 Ti() i = 1 : : : sg ) dim(V ) s + 1 On the other hand, since f1 i i = 1 : : : sg is LI, each Tj (x) can be expressed as a linear dPi (x) i = 1 : : : s, as in the proof of the previous theorem, combination of 1, ln dP 0 ) spanf1 Ti () i = 1 : : : sg V ) s + 1 dim(V ) 1.3. EXPONENTIAL FAMILIES. 19 Definition 1.3.8. (Canonical Form) For any s-parameter exponential family (not necessarily of order s) we can view the vector () = (1 () : : : s()0 as the parameter rather than . Then the density with respect to can be rewritten as p(x ) = exp s X i=1 iTi (x) ; A()]h(x) 2 (): Since p( ) is a probability density with respect to , eA() (1.3.3) = Z Ps i Ti (x) e 1 h(x)d(x): Definition 1.3.9. (The Natural Parameter Set) This is a possibly larger set than f() 2 g. It is the set of all s-vectors for which, by suitable choice of A(), p( ) can be a probability density, i.e. P N = f = (1 s) R s : R e s1 iTi (x) h(x)d(x) < 1g Theorem 1.3.10. N is convex. Proof. Suppose = (1 : : : s ) and = (1 : : : s ) 2 N . Then Z Ps P ep iTi (x)+(1;p) s iTi (x) h(x) d(x) Z P Z P 1 p e <1 1 i Ti (x) h(x) Theorem 1.3.11. T = (T1 d(x) + (1 ; p) e i Ti (x) h(x) d(x) Ts) has density p (t) = exp ( t ; A ()) relative to = ~ T ;1 where d~ (x) = h (x) d (x). Proof. If f :T ! R is a bounded measurable function, Ef (T ) = = Z f (T (x))eT (x) e;A() d~(x) Z f (t)et e;A() d~ T ;1(t) Definition 1.3.12. The family of densities p (t) = exp ( t ; A ()) 2 () n is called an s-dimensional or s-parameter standard exponential family. (Dened on R s , not X .) 20 1. PRELIMINARIES Theorem 1.3.13. Let fp (x)g be the s-parameter exponential family, p (x) = exp and suppose X s i=1 Z (1.3.4) ! i () Ti (x) ; B () h (x)) 2 () Ps1 j Tj (x) (x) e d (x) exists and is nite for some and all j = aj + ibj such that a 2 N (=natural parameter space). Then P R (i) (x) e s1 j Tj (x)d (x) is an analytic function of each Ri on f :P<s () 2 int (N )g and (ii) the derivative of all orders with respect to the i's of (x) e 1 j Tj (x)d (x) can be computed by di erentiating under the integral sign. Proof. Let a0 = (a01 : : : a0s ) be in int(N ) and let 10 = a01 + ib01 . Then P (x)e s2 j Tj (x) = h1 (x) ; h2(x) + i(h3 (x) ; h4(x)) where h1 and h2 are the positive and negative parts of the real part and h3 and h4 are the positive and negative of the imaginary part. Ps1 j Tj (xparts R ) Then (x) e d (x) can be expressed as Z e1 T1 (x) d1(x) ; Z e1 T1 (x) d2(x) + i Z e1 T1 (x) Z d3(x) ; i e1 T1 (x) d4(x) where di(x) = hi(x) d(x) i = 1 : : : 4. Hence it suces to prove (i) and (ii) for (1 ) = Z e1 T1 (x) d(x): Since a0 2 int(N ), there exists > 0 s.t. (1) exists and is nite for all 1 with ja1 ; a01 j < . Now consider the dierence quotient Z 0 e(1 ;10)T1(x) ; 1 0 ( 1 ) ; (1 ) 1 T1 (x) 0 = e ( dx ) j () 1 ; 1 j < =2: 0 0 1 ; 1 1 ; 1 Observe that 1 1 j X X ( zt ) jztjj = ejztj ; 1 zt je ; 1j = j j 1 j! 1 j! jztjejztj zt ) j e z; 1 j jtjejztj 1.3. EXPONENTIAL FAMILIES. 21 (a0 + )jT (x)j The integrand inR(*) is therefore bounded in absolute value by j T 1 (x)je 1 2 1 , where a01 = Re(10) and jT1 (x)je(a01 + 2 )jT1(x)j (dx) < 1 since 8 ; 4 T1 e(a01 + 34 )T1 if T > 0 > j T j e 1 1 | {z } < | {z } 0 integrable jT1je(a1 + 2 )jT1 j = > zbounded z }| { : jT1je 4 T1 e(a01}|+ 4 )T{1 if T1 < 0 (independent of 1). Letting 1 ! 10 in (*) and using the dominated convergence theorem therefore gives 0(0) = (1:3:5) 1 Z T1(x)e10 T1 (x) (dx) where the integral exists and is nite 810 which is the rst component of some 0 for which Re(0 ) 2 N. Applying the same argument to (1.3.5) which we applied to (1.3.4) ) existence of all derivatives ) (i) and (ii). Theorem 1.3.14. For an exponential family of order s in canonical form and 2 int (N ), where N is the natural parameter space, (i) E (T ) = @@A1 @@As , and h is (ii) Cov (T ) = @@i2@A j . ij =1 Proof. From theorem 1.3.11Z eA() so = et (dt) = Z eT (x) h(x)(dx) R = Ti(x)eT (x) h(x)(dx) @A . whence E Ti = @ i @A @A eA() = R T (x)T (x)eT (x) h(x)(dx) (ii) @@i2@A j eA() + @ i j i @j 2A @ i.e. @i @j = E (TiTj ) ; E (Ti)E (Tj ) = Cov (Ti Tj ) (i) @A A() @i e Higher order moments of T1 Ts are frequently required, e.g. r1rs = E (T1r1 Tsrs ) r1rs = E (T1 ; E (T1))r1 (Ts ; E (Ts))rs ] 22 1. PRELIMINARIES etc. These can often be obtained readily from the MGF: MT (u1 us) := E (eu1 T1++usTs ) P If MT exists in some neighborhood of 0 ( u2i < ), then all the moments r1 rs exist and are the coecients in the power series expansion 1 r1 rs X MT (u1 us) = r1 rs ur1 ! ur s! 1 s r1 :::rs The cumulant generating function, CGF, is sometimes more convenient for calculations, especially in connection with sums of independent random vectors. The CGF is dened as KT (u1 us) := log MT (u1 us): If MT exists in a neighborhood of 0, then so does KT and 1 r1 rs X KT (u1 us) = Kr1 rs ur1 ! ur s! 1 s r1 rs=0 where the coecients Kr1 rs are called the cumulants of T . The moments and cumulants can be found from each other by formal comparison of the two series. Theorem 1.3.15. If X has the density p (x) = exp s X i=1 i Ti(x) ; A()]h(x) w.r.t some -nite measure , then for any 2 int(N ) the MGF and CGF of T exist in a neighborhood of 0 and KT () = A( + ) ; A() MT () = eA(+);A() Proof. Problem 3.4. Summary on Exponential Families. The family of probability measures fP g with densities relative to some -nite measure , s X dP (1:3:6) p (x) = d (x) = expf i ()Ti(x) ; B ()gh(x) 2 1 is an s-parameter exponential family By redening the functions Ti() and i() if necessary, we can always arrange for both sets of functions to be anely independent. The number of summands in the exponent is then the order of the exponential family. 1.3. EXPONENTIAL FAMILIES. 23 If f1 1 : : : sg and f1 T1 : : : Tsg are both L.I., then the family is said to be minimal and dp () 2 g) ; 1 s = dim(spanf1 log dp 0 = order of the exponential family Remark 1.3.16. Since (1.3.6) is by denition a probability density w.r.t. for each 2 , we have Z exp nX o i ()Ti(x) ; B () h(x)(dx) = 1 ) exp B () = Z exp nX o i ()Ti(x) h(x)(dx) which shows that the dependence of B on is through () = (1 () : : : s()) only, i.e. B () = A(()). Remark 1.3.17. The previous note implies that each member of the family (1.3.6) is a member of the family. (1:3:7) s X (x) = expf 1 iTi(x) ; A( )gh(x) = (1 : : : s) 2 () (in fact p (x) = () (x)). The family of densities f 2 ()g dened by (1.3.7) is the canonical family associated with (1.3.6). It is the same family parameterized by the natural parameter, =vector of coecients of Ti (x) i = 1 : : : s. Remark 1.3.18. Instead of restricting to the set (), it is natural to extend the family (1.3.7) to allow all 2 R s for which we can choose a value of A( ) to make (1.3.7) a probability density, i.e. for which (1:3:8) Z X expf iTi (x)gh(x)(dx) < 1 N = f 2 R s : (1:3:8) holdsg is the natural parameter space of the family (1.3.7). N () since (1.3.7) is by denition a family of probability densities. Definition 1.3.20. (Full rank family) As with the original parameterization, we can always redene to ensure that fT1 : : : Tsg is A.I. If () contains an s-dimensional rectangle and fT1() : : : Ts()g is A.I., then T is minimal sucient and we say the family Remark 1.3.19. (1.3.7) is of full rank. (A full rank family is clearly minimal.) 24 1. PRELIMINARIES Remark 1.3.21. Since N (), full rank ) int(N ) 6= and this is important in view of the consequence of theorem 1.3.13 that eA() = Z s X exp( i=1 iTi(x))h(x)(dx) is analytic in each i on the set of s-dimensional complex vectors, : Re( ) 2 int(N ). (So derivatives of eA() w.r.t. i i = 1 : : : s of all orders can be obtained by dierentiation under the integral, yielding explicit expressions for the moments of T for all values of the canonical parameter vector 2 int(N ).) Example 1.3.22. Multinomial X M (0 : : : s n) = (X0 : : : Xs ) where Xi = number of outcomes of type i in n independent trials where i i = 0 : : : s is the probability of an outcome of type i on any one trial. = f : 0 0 s 0 0 + + s = 1g (1) Probability density with respect to counting measure on Zs++1 Ys X n ! x x 0 s p (x) = x ! x ! 0 s I0 n](xi)Ifng ( xi ) 0 s i=0 = expf s X i=0 xi log igh(x) 2 : This is an (s + 1)-parameter exponential family with Ti(x) = xi i() = log i : The vectors () 2 , are not conned to a proper ane subspace of Rs , so T is minimal sucient. (2) fT0 : : : Tsg is not A.I. since T0 + + Ts = n. Setting T0 (x) = x0 = n ; x1 ; ; xn gives s X p (x) = h(x) expfn log 0 + xi log i g 0 i=1 Redening () = (log 10 log 0s ), we now have an s-parameter representation in which fT1 : : : Tsg is A.I., P since the vectors (x1 xs) x 2 X , are subject only to the constraints xi 0 and si=1 xi n. (3) Furthermore the new parameter vectors, () = (log 01 log s0 ), 2 , are not conned to any proper ane subspace of R s , since for any x 2 R s 9 0 : : : s such that () = x and so () = R s . Hence T (x) = (x1 : : : xs) is minimal sucient for P and the order of the family is s. (4) The canonical representation of the family (2) is s X (x) = expf ixi ; A( )gh(x) 2 () = f(log 1 log s ): 2 g 0 0 1 1.3. EXPONENTIAL FAMILIES. 25 We know from remark 1.3.16 before that B () = A(()) for some function A(). Although it is not necessary, we can verify this directly in this example, since from the representation (2) we have B () = ;n log 0 and 0 = 1 ; 1 ; ; s ) 1 = 1 + 1 + + s 0 0 0 1 ( ) = 1 + e + + es () ) B () = n log(1 + e1 () + + es () ) ) A( ) = n log(1 + e1 + + es ) A( ) is of course also determined by eA() = Z expf s X 1 ixi gh(x)d(x) . (5) The natural parameter space in this case is N = R s , since we know that N () and () = R s by (3) above. Clearly N contains an s-dimensional rectangle and fT1 : : : Tsg is A.I., hence f (x) 2 Ng is of full rank. (6) Moments of T (X ) = (X1 : : : Xs) Theorem 1:3:14 ) E Ti = @A 8 2 Rs @i i = 1 + e1 ne + + es ni =0 = 1 + 01 + + 0s = ni 2A and Cov(Ti Tj ) = @@ @ ( i j ;nei ej i 6= j (1+e1 ++es )2 = ;ni j = nei ; ne2i (1++es ) (1++es )2 = ni (1 ; i ) i = j (Moments exist 8 2 int(N ) = R s ) 26 1. PRELIMINARIES Theorem 1.3.23. (Sucient condition for completeness of T ) If (x) = exp X s i=1 ! iTi (x) ; A ( ) h (x) 2 () is a minimal canonical representation of the exponential family P = fp : 2 g and () contains an open subset of R s , then T = (T1 : : : Ts )is complete for P : Proof. Suppose E (f (T )) = 0 8 2 (): Then, (1:3:9) E f +(T ) = E f ;(T ) 8 2 (): Choose 0 2 int(()) and r > 0 such that N (0 r) := f : jj ; 0jj < rg (): Now dene the probability measures, R f +e0t (dt) +(A) = RA f +e0 t (dt) = ~ T ;1 d~(x) = h(x)(dx) RJ f ;e0t (dt) ;(A) = RA f ;e0 t (dt) J where we have assumed that (ft : f (t) 6= 0g) > 0, since otherwise f = 0 a:s: PT and we are done. Observe now thatZ Z t + (1:3:10) e (dt) = et ;(dt) 8 2 R s with jjjj < r since L:S: = = Z ZJ by (1.3.9) J f +(t)e(0 +)t (dt)= f ;(t)e(0 +)t (dt)= Z ZJ J f +(t)e0t (dt) f ;(t)e0t (dt) Now consider each side of (1.3.10) as a function of the complex argument = 0 + i, 2 R s . Then L() = R() 8 = 0 + i with jj0jj < r, since (by Theorem 1.3.13 (i)) both sides are analytic in each component of on the set where Re(0 + ) 2 N and they are equal when is real. In particular, L(i) = Z eit+(dt) = R(i) = Z eit;(dt) 1.4. CONVEX LOSS FUNCTION 27 for all 2 R s . Hence + and ; have the same characteristic function ) + = ; ) f + = f ; a:s:, contradicting (f 6= 0) > 0. So f = 0 a:s: . Example 1.3.24. X1 : : : Xn iid N ( 2 ) X X p (x) = p1 n expf; 21 2 x2i + 1 xi ; n2 g ( 2) X 1() = 21 2 T1(x) = ; x2i X 2 () = 1 T2 (x) = xi P P () does not contain a 2-dim rectangle in R2 . T (x) = ( x2i xi ) is not complete since X X E ( x2i ; n +2 1 ( xi )2) = n(22 ) ; n +2 1 (n2 + n2 2 ) = 0 P 2 (P x )2 = 0 on N c . but there exists no P -null set N such that x2i ; n+1 i 1.4. Convex Loss Function Lemma 1.4.1. Let be a convex function on (;1 1) which is bounded below and sup- pose that is not monotone. Then, takes on its minimum value c and ;1 (c)is a closed interval and is a singleton when is strictly convex. Proof. Since is convex and not monotone, lim (x) = 1: x! 1 Since is continuous, attains its minimum value c. ;1 (fcg) is closed by continuity and interval by convexity. The interval must have zero length if is strictly convex. Theorem 1.4.2. Let be a convex function dened on (;1 1) and X a random variable such that (a) = E ( (X ; a)) is nite for some a. If is not monotone, (a)takes on its minimum value and ;1 (a) is a closed set and is a singleton when is strictly convex. Proof. By the lemma, we only need to show that is convex and not monotone. Because limt! 1 (t) = 1 and lima! 1 x ; a = 1, lim (a) = 1 a! so that is not monotone. The convexity comes from (pa + (1 ; p) b) = E (p (X ; a) + (1 ; p) (X ; b)) E (p (X ; a) + (1 ; p) (X ; b)) = p (a) + (1 ; p) (b) : 28 1. PRELIMINARIES CHAPTER 2 Unbiasedness 2.1. UMVU estimators. Notation. P =fP 2 g is a family of probability measures on A (distributions of X ). T:X ! R is an A=B measurable function and T (or T (X )) is called a statistic. g : ! R is a function on whose value at is to be estimated. ; X T (X A P ) ;! (X A P) ;! R B PT Definition 2.1.1. A statistic T (or T (X )) is called an unbiased estimators of g () if E (T (X )) = g () for all 2 : Objectives of point estimation. In order to specify what we mean by a good estimator of g(), we need to specify what we mean when we say that T (X ) is close to g(). A fairly general way of dening this is to specify a loss function: L( d) = cost of concluding that g() = d when the parameter value is : L( d) > 0 and L( g()) = 0: Since T (X ) is a random variable, we measure the performance of T (X ) for estimating g() in terms of its expected (or long-term average) loss R( T ) = E L( T (X )) known as the risk function. Choice of a loss function will depend on the problem and the purpose of the estimation. For many estimation problem, the conclusion is not particularly sensitive to the choice of loss function within a reasonable range of alternatives. Because of this and especially because of its mathematical convenience, we often choose (and will do so in this chapter) the squarederror loss function L( d) = (g() ; d)2 with corresponding risk function (2.1.1) R( T ) = E (T (X ) ; g())2 29 30 2. UNBIASEDNESS Ideally we would like to choose T to minimize (2.1.1) uniformly in . Unfortunately this is impossible since the estimator T dened by (2.1.2) T (x) = g(0) 8x 2 X (where 0 is some xed parameter value in ) has the risk function, = 0 R( T ) = (g() ;0g( ))2 if if 6= 0 0 An estimator which simultaneously minimized R( T ) for all 2 would necessarily have R( T ) = 0 8 2 and this is impossible except in trivial cases. Why consider the class of unbiased estimators? There is nothing intrinsically good about unbiased estimators. The only criterion for goodness is that R( T ) should be small. The hope is that by restricting attention to a class of estimators which excludes (2.1.2), we may be able to minimize R( T ) uniformly in and that the resulting estimator will give small values of R( T ). This programme is frequently successful if we attempt to minimize R( T ) with T restricted to the class of unbiased estimators of g(). Definition 2.1.2. g () is U-estimable, if there exists an unbiased estimator of g (). Example 2.1.3. X1 : : : Xn iid Bernoulli(p), p 2 (0 1). g (p) = p is U-estimable, since E Xn = p 8p 2 (0 1), while h(p) = p1 is not U-estimable, since if X P P T (x)p xi (1 ; p)n; xi = p1 8p 2 (0 1) limp!0 RS = 1 and limp!0 LS = T (0). So T (0) = 1, but this is not possible since then EpT (X ) = 1 6= p1 8p 2 (0 1). Remark 2.1.4. Pni=1 Xi p;1 a:s: p and Pni=1n Xi a:s: ;! ;! n p;1 even though it is not unbiased. 8p 2 (0 1): Hence PnXi is a rea- sonable estimate of Theorem 2.1.5. If T0 is an unbiased estimator of g () then the totality of unbiased estimators of g ()is given by fT0 ; U : E U = 0 for all 2 g : Proof. If T is unbiased for g (), then T = T0 ; (T0 ; T ) where E (T0 ; T ) = 0 8 2 . Conversely if T = T0 ; U where E U = 0 8 2 , then E T = E T0 = g() 8 2 . Remark 2.1.6. For squared error loss, L( d) = (d ; g ())2, the risk R( T ) is R( T ) = E ((T (X ) ; g())2) = V ar (T (X )) if T is unbiased = V ar (T0 (X ) ; U ) = E (T0(X ) ; U )2 ] ; g()2 2.1. UMVU ESTIMATORS. 31 and hence the risk is minimized by minimizing E (T0 (X ) ; U )2 ] with respect to U , i.e. by taking any xed unbiased estimator of g() and nding the unbiased estimator of zero which minimizes E (T0(X ) ; U )2 ]. Then if U does not depend on we shall have found a uniformly minimum risk estimator of g(), while if U depends on , there is no uniformly minimum risk estimator. Note that for unbiased estimators and squared error loss, the risk is the same as the variance of the estimator, so uniformly minimum risk unbiased is the same as uniformly minimum variance unbiased in this case. Example 2.1.7. P (X = ;1) = p P (X = k) = q 2 pk k = 0 1 : : : , where q = 1 ; p. T0 (X ) = If;1g(X ) is unbiased for p, 0 < p < 1 T1 (X ) = If0g (X ) is unbiased for q2, U is unbiased for 0 ,0 = 1 X k=;1 U (k)P (X = k) = pU (;1) + = U (0) + 1 X k=1 1 X k=0 U (k)q2 pk 8p (U (k) ; 2U (k ; 1) + U (k ; 2))pk ,U (k) = ;kU (;1) = ka for some a (comparing coecients of pk k = 0 1 2 : : : ) So an unbiased estimator of p with minimum risk (i.e. variance) is T0(X ) ; a0X where a0 is the value of a which minimizes X Ep(T0 (X ) ; aX )2 = Pp(X = k)T0(k) ; ak]2 Similarly an unbiased estimator of q2 with minimum risk (i.e. variance) is T1 (X ) ; a1X where a1 is the value of a which minimizes X Ep(T1 (X ) ; aX )2 = Pp(X = k)T1(k) ; ak]2 Some straightforward calculations give a0 = p + q2 ;Pp1 k2 pk and a1 = 0 Since a 1 is independent of p, the estimator T1 (X ) of q2 is minimum variance unbiased for all p, i.e. UMVU. However a0 does depend on p and so the estimator T0(X ) = T0(X ) ; a0 X is only locally minimum variance unbiased at p. (We are using estimator in a generalized sense here since T0(X ) depends on p. We shall continue to use this terminology.) An UMVU estimator of p does not exist in this case. Definition 2.1.8. Let V () = inf T V ar (T ) where the inf is over all unbiased estimators of g(). If an unbiased estimator T of g() satises V ar (T ) = V () 8 2 it is called UMVU 1 32 If 2. UNBIASEDNESS V ar0 T = V (0 ) for some 0 2 T is called LMVU at 0 Remark 2.1.9. Let H be the Hilbert space of functions on X which are square integrable with respect to P (i.e. with respect to every P 2 P ), and let U be the set of all unbiased estimators of 0. If T0 is an unbiased estimator of g() in H, then a LMVU estimator in H at 0 is T0 ; PU (T0), where PU denotes orthogonal projection on U in the inner product space L2 (P0 ), i.e. PU (T0 ) is the unique element of U such that T0 ; PU (T0 ) ? U (in L2(P0 )): T0 ; PU (T0 ) is LMVU since PU (T0) = arg minU 2U E0 (T0 ; U )2 . Notation 2.1.10. We denote the set of all estimators T with E T 2 < 1 for all 2 by and the set of all unbiased estimators of 0 in by U . Theorem 2.1.11. An unbiased estimator T 2 of g () is UMVU i E (TU ) = 0 for all U 2 U and for all 2 : (i.e. Cov (T U ) = 0 since E U = 0 for all and E T = g () for all 2 .) Proof. ()) Suppose T is UMVU. For U 2 U , let T 0 = T + U with real. Then T 0 is unbiased and, by denition of T , V ar (T 0) = V ar (T ) + 2V ar (U ) + 2Cov (T U ) > V ar (T ) (TU ) therefore, 2V ar (U ) + 2Cov (T U ) > 0. Setting = ; Cov V ar (U ) gives a contradiction to this inequality unless Cov (T U ) = 0. Hence Cov (T U ) = 0. (() If E (TU ) = 0 8U 2 U and 8 2 , let T 0 be any other unbiased estimator. If V ar (T 0) = 1, then V ar (T ) < V ar (T 0), so suppose V ar (T 0) < 1. Then T 0 = T ; U , for some U which is unbiased for 0 (by Theorem 2.1.5). U = T ; T 0 ) E U 2 = E (T 0 ; T )2 6 2E T 0 2 + 2E T 2 < 1 )U 2U Hence V ar (T 0) = V ar (T ; U ) = V ar (T ) + V ar (U ) ; 2Cov (T U ) > V ar (T ) since Cov (T U ) = 0 ) T is UMVU: 2.1. UMVU ESTIMATORS. 33 Unbiasedness and suciency. Suppose now that T 2 is unbiased for g() and S is sucient for P = fP 2 g. Consider T 0 = E (T jS ) = E (T jS ) independent of Then (a) E T 0 = E E (T jS ) = E (T ) = g() 8: (b) V ar (T ) = E (T ; E (T jS ) + E (T jS ) ; g())2 = E ((T ; E (T jS ))2) + V ar (T 0) + 2E (T ; E (T jS ))(E (T jS ) ; g())] > V ar (T 0 ): On the second line we used the fact that T ; E (T jS ) is orthogonal to (S ). The inequality on the third line is strict for all , T = E (T jS ) a:s: P . Theorem 2.1.12. If S is a complete sucient statistic for P , then every U -estimable function g () has one and only one unbiased estimator which is a function of S . Proof. T unbiased ) E (T jS ) is unbiased and a function of S T1 (S ) T2(S ) unbiased ) E (T1 (S ) ; T2 (S )) = 0 8 ) T1(S ) = T2 (S ) a:s: P (completeness) Theorem 2.1.13. (Rao-Blackwell) Suppose S is a complete sucient statistic for P . Then (i) If g () is U -estimable, there exists an unbiased estimator which uniformly minimizes the risk for any loss function L ( d) which is convex in d. (ii) The UMV U in (1) is the unique unbiased estimator which is a function of S it is the unique unbiased estimator with minimum risk provided the risk is nite and L is strictly convex in d. Proof. (i) L( d) convex in d means L( pd1 + (1 ; p)d2) 6 pL( d1) + (1 ; p)L( d2 ) 0 < p < 1: Let T be any unbiased estimator of g() and let T 0 = E (T j S ), another unbiased estimator of g(). Then R( T 0) = E L( E (T j S ))] 6 E E (L( T ) j S )] by Jensen's inequality for conditional expectation, = E L( T ) = R( T ) 8: 34 2. UNBIASEDNESS If T2 is any other unbiased estimator then T20 = E (T2 j S ) = T 0 a.s. P by Theorem 2.1.12. Hence starting from any unbiased estimator and conditioning on the CSS S gives a uniquely dened unbiased estimator which is UMVU and is the unique function of S which is unbiased for g(). (ii) The rst statement was established at the end of the proof of (i). If T is UMVU then so is T 0 = E (T j S ) as shown in (i) We will show that T is necessarily the uniquely determined unbiased function of S , by showing that T is a function of S a.s. P . The proof is by contradiction. Suppose that "T is a function of S a.s. P " is false. Then there exists and a set of positive P measure where T 0 := E (T j S ) 6= T But this implies that R( T 0) = E (L( E (T j S ))) < E (E (L( T ) j S )) (Jensen's inequality is strict unless E (T j S ) = T a:s: P ) = R( T ) contradicting the UMVU property of T . Theorem 2.1.14. If P is an exponential family of full rank (i.e. f1 : : : s g and fT1 : : : Tsg are A.I. and () contains an open subset of R s ) then the Rao-Blackwell theorem applies to any U -estimable g () with S = T . Proof. T is complete sucient for P . Some obvious U -estimable g()'s are E Ti (X ) = @A @i =() f : () 2 int(N )g P where (x) = e iTi (x);A() h(x) is the canonical representation of p (x):] Two methods for nding UMVU's Method 1. Search for a function (T ), where T is a CSS, such that E (T ) = g() 8 2 : 2.1. UMVU ESTIMATORS. Example 2.1.15. X1 : : : Xn iid N ( 2 ) 2 R 2 > 0. 35 T =(X S 2) is CSS. E2 X = X is UMVU for : Method 2. Search for an unbiased (X ) and a CSS T . Then S = E ((X ) j T ) is UMV U Example 2.1.16. X1 : : : Xn iid U (0 ) > 0 g() = 2 1 (X ) = X1 is unbiased X(n) is CSS ) S = E (X1 j X(n) ) is UMVU To compute S we note that given X(n) = x, X1 = x w:p: n1 X1 v U (0 x) w:p: 1 ; n1 x 1 x n+1x ) S (x) = + (1 ; ) = n n 2 n 2 1 n + 1 X is UMVU for ) S (X(n) ) = 2 n (n) 2 n + 1 ) n X(n) is UMVU for Remark 2.1.17. (a) Convexity of L( ) is crucial to the Rao-Blackwell theorem. (b) Large-sample theory tends to support the use of convex L( ). Heuristically if X1 : : : Xn are iid, then as n ! 1 the error in estimating g() ! 0 for any reasonable estimates (in some probabilistic sense). Thus only the behavior of L( d) for d close to g() is relevant for large samples. A Taylor expansion around d = g() gives L( d) = a() + b()(d ; g()) + c()(d ; g())2 + Remainder But L( g()) = 0 ) a() = 0 L( d) 0 ) b() = 0 Hence locally, L( d) v c()(d ; g())2, a convex weighted squared error loss function. 36 2. UNBIASEDNESS Example 2.1.18. Observe X1 : : : Xm , iid N ( 2 ), and Y1 : : : Yn, iid N ( 2 ), inde- pendent of X1 : : : Xm: (i) For the 4-parameter family P = fP2 2 g, (X Y SX2 SY2 ) is a CSS since the exponential family is of full rank. Hence X and SX2 are UMVU for and 2 respectively and Y and SY2 are UMVU for and 2 : (ii) For the 3-parameter family P = fP2 2 g, (X Y SS ) is a CSS, where SS := (m ; 1)SX2 +(n ; 1)SY2 . Hence X Y and m+SSn;2 are UMVU for and 2 respectively. (iii) For the 3-parameter family with = 2 6= 2 (which arises when estimating a mean from 2 sets of readings with dierent accuracies), (X Y SX2 SY2 ) is minimal sucient but not complete, since X ; Y 6= 0 a:s: P , but E (X ; Y ) = 0 8 : To deal with Case (iii) we shall rst show the following: If 22 = r for some xed r, i.e. P = fPr 2 2 g P P P P then T = ( Xi + r Yj Xi2 + r Yj2) is CSS Proof. 1 1 1 m 2 n m+n 2 (2 ) 2 (r ) 2 ( ) 2 2 2 X 2 1 X2 1 1 m 1 n exp ; 2r 2 xi + r 2 m x ; 2r 2 ; 2 2 yi + 2 n y ; 2 2 1 X X2 X X 2 2 = exp ;A( ) exp ; 2r 2 ( xi + r yi ) + r 2 ( xi + r yi) p 2 (x y) = Since T is a CSS for P and since T1 = in P . P Xi+r P Yi m+rn is unbiased for , it is UMVU for T1 is also unbiased for in P = fP2 2 g 2 0202 0 m02 + n02 where 02 = r: (V is the smallest variance of all unbiased estimators of for P evaluated at 0 02 02 .) ) V (0 02 02 ) 6 V ar0 02 02 (T1 ) = 2.2. NON-PARAMETRIC FAMILIES 37 On the other hand, every T which is unbiased for in P is also unbiased in P . Hence if T is unbiased for in P , then PX +rPY 2 V ar0 02 02 (T ) > V ar0 02 02 ( mi + rn i ) where r = 02 0 and the inequality continues to hold with the left-hand side replaced by V (0 02 02 ). 2 2 So V (0 02 02 ) = m020+0n02 and the LMVU estimator at (0 02 02 ) is PX + PY i i 2 0 2 0 2 0 2 0 m + n : Since this estimate depends on the ratio r = 002 , an UMVU for does not exist in P . A natural estimate for is P X + SX2 P Y i SY2 i ^ = : 2 m + SSXY2 n (See Graybill and Deal, Biometrics, 1959, pp. 543550 for its properties.) 2 2.2. Non-parametric families Consider X = (X1 : : : Xn) where X1 : : : Xn are iid F , where F 2 F , a family of distribution functions, and P is the corresponding product measure on (R n Bn). For example, F0 = df's with density relative to Lebesgue measure Z F1 = df's with jxjF (dx) < 1 Z F2 = df's with x2 F (dx) < 1 etc: The estimand is g : F ! R . For example, Z g(F ) = xF (dx) = F Z g(F ) = x2 F (dx) g(F ) = F (a) g(F ) = F ;1(p) Proposition 2.2.1. If F0 is dened as above, then (X(1) : : : X(n) ) is complete sucient for F0 (i.e. for the corresponding family of probability measures P ). 38 2. UNBIASEDNESS Proof. We know that T (X ) = (X(1) : : : X(n) ) is sucient for P . It remains to show (by problem 1.6.32, p.72) that T is complete and sucient for a family P0 P such that each member of P0 has positive density on R n . Choose P0 to be the set of probability measures on Bn with densities relative to Lebesgue measure, X X X C (1 n ) expf1 xi + 2 xixj + + nx1 xn ; x2i n gg i<j This is an exponential family whose natural parameter set N contains an open set (N = R n ). P P So S (x) = ( xi i<j xi xj x1 xn) is complete. But S is equivalent to T (consider the nth degree polynomial whose zeroes are x(1) x(n) ), so T is complete for F0. Measurable functions of the order statistics. If T (x) := (x(1) : : : x(n) ) then (X1 : : : Xn) 2 (T ) , (X1 : : : Xn) = (X : : : Xn ) for every permutation (1 : : : n) of (1 : : : n). Since T is a CSS for F0, this enables us to 1 identify UMVU estimators of estimands g for which they exist. Example 2.2.2. g (F ) = F (a): An obvious unbiased estimator of F (a) is n X T1(X ) := n1 I(;1a](Xi) i=1 and T1 2 (T ) so T1 is UMVU for F (a). R Example 2.2.3. g (F ) = xdF F 2 F0 \ F2 : Let n X T2(x) = n1 Xi: i=1 Then T2 2 (T ) and, since T is also complete for F0 \ F2 is therefore UMVU for F : 2 F0 \ F4. Let P(x ; x)2 P(x(i) ; 1 P x(i))2 2 n T3 (x) = S (x) = n i; 1 = n;1 T3 2 (T ) and is unbiased for F2 . Since T is complete for F0 \ F4 T3 is UMVU for F2 . Remark 2.2.5. T complete for F does not imply generally that T is complete for F F . In fact the reverse is true. Completeness for F implies completeness for F . However Example 2.2.4. g (F ) = F2 F the same argument used in the proof of Proposition 2.2.1 shows that T is complete for F0 \ F2 (used in example 2.2.3) and T is complete for F0 \ F4 (used in example 2.2.4): 2.2. NON-PARAMETRIC FAMILIES 39 2 F0 \ F4 T4(X ) = n X(2i) ; S 2 (X ) is UMVU for g(F ): Example 2.2.6. g (F ) = 2F F 1X This result could also be obtained by observing that X1X2 is unbiased for 2F F 2 F0 \ F4, therefore E (X1X2 j X(1) : : : X(n)) is UMVU. But conditioned on X(1) X(n) , X1X2 = X(i) X(j) w:p: n(n2; 1) for each subset fi j g of f1 : : : ng ) E (X1 X2 j X(1) X(n)) = n(n1; 1) X i6=j X(i) X(j) X X = n(n1; 1) (( Xi )2 ; Xi2) X X X = n1 Xi2 ; n ;1 1 ( Xi2 ; n1 ( Xi)2 ) = T4 (X ) More generally suppose g(F ) is U-estimable in F0. Then 9 (X1 : : : Xm) such that EF (X1 : : : Xm) = g(F ) 8F 2 F0: Suppose also that (X1 : : : Xm ) has nite second moment for F 2 F0 \Fk for some positive integer k. We can assume is symmetric in X1 : : : Xm , since if not we can redene as X (X1 : : : Xm ) (X1 : : : Xm) = m1! permutations of (1::: m) which is unbiased and symmetric. Now we dene the U-statistic, X T = ;1 (X : : : X ) n m 16i1 <i2 <<im 6n i1 im This is symmetric in X1 : : : Xn and unbiased, and therefore UMVU for g(F ) F 2 F0 \Fk . Questions (1) Which g(F ) are U-estimable? (2) If g is U-estimable, what is the smallest value of m for which there exists a U-statistic for g of the form T ? This number is called the order of g. Proposition 2.2.7. If g is of degree 1, then for any F1 F2 2 F0 , g (F1 + (1 ; )F2 ) is linear in . 40 2. UNBIASEDNESS Proof. If g is of degree 1, there exists (X1 ) such that Z (x)F (dx) = g(F ) Z ) g (F1 + (1 ; )F2 ) = (x)F1 (dx) + (1 ; ) = g(F1) + (1 ; )g(F2): Z (x)F2 (dx) Generalization. If g is of degree s, then g(F1 + (1 ; )F2) is a polynomial in of degree 6 s (since dF (x1 : : : xs) = sdF1(x1 ) dF1(xs) if F1 is replaced by F1.) Example 2.2.8. g (F ) = F2 is of degree 2 in F0 \ F2 . Proof. Let (X1 X2 ) = 21 (X1 ; X2 )2 , then EF = EF X12 ; EF (X1 X2) = F2 so deg(g) 6 2. To show deg(g) = 6 1, consider 2 g(F1 + (1 ; )F2) = F Z +(1;)F Z 2 = x dF1(x) + (1 ; ) x2 dF2(x) ; F + (1 ; )F ]2 and this is linear in : , F = F : But this is not the case for every F1 F2 2 F0 \ F2: 1 2 1 1 2 2 Hence deg(g) = 2. Sering (1980) is a good reference on U-staticstics. Example 2.2.9. g (F ) = F is not U-estimable in F0 , since g (F1 + (1 ; )F2 ) is not a polynomial. 2.3. The Information Inequality For any estimator T 2 of g () and any function (X ) such that E j (X )j2 < 1, we have the inequality 2 j Cov (T )j (2.3.1) : Var (T ) Var ( (X )) However, this will not in general provide a useful lower bound for Var T since the RHS depends on T . It can be useful however when the RHS depends on T in a simple way, in particular when it depends on T only through E T . Theorem 2.3.1. Cov (T ) depends on T only through E T i Cov (U ) = 0 for all U 2 U \ (unbiased square-integrable estimators of 0): 2.3. THE INFORMATION INEQUALITY 41 Proof. (() Suppose Cov (U ) = 0 for all U 2 U\ and that T1 T2 are two estimators with nite variance and E T1 = E T2 8 2 : Then T1 ; T2 2 U , so Cov (T1 ) = Cov (T2 ). ()) If Cov (T ) depends on T only through E T and if U 2 U , then Cov (T + U ) = Cov (T ) ) Cov (U ) = 0 Hammersley-Chapman-Robbins Inequality Suppose X has probability density p(x ) 2 , where p(x ) > 0 8x and . Then (x ) = p(px(x +)) ; 1 satises the conditions of the Rprevious theorem, i.e. Cov (U ) = 0 8U 2 U \ , since E (X ) = 0 and E (U) = U (x)(p(x + ) ; p(x ))d(x). For any statistic S 2 , Z Cov (S ) = E (S) = S p(x + ) ; p(x )](dx) = E+ S ; E S S2U = g( + 0) ; g() if if S is unbiased for g(:) Hence from (2.3.1), if T 2 is unbiased for g(), 2 V ar (T ) > (g( p+(X)+;) g())2 8: E ( p(X) ; 1) ] Hence we obtain the HCR bound if T is unbiased for g(). 2 ( g ( + ) ; g ( )) V ar (T ) > sup + ) V ar ( p(pX (X) ) 42 2. UNBIASEDNESS Letting ! 0 in the HCR bound gives V ar T lim !0 ( g(+);g() )2 );p(X) )2 E ( 1 p(Xp+(X ) 0 2 () = g @p E ( @ =p)2 0 2 = @ logg p() E ( @ p(X ))2 provided g is dierentiable and we can dierentiate under the expectation. These steps are legitimized under the conditions of the following theorem. Theorem 2.3.2. (Cramer-Rao Lower Bound CRLB) Suppose that p(x ) > 0 and @ log p(x ) exists for all x and , and that for each there exists such that @ ( P P and j ; j < ) 1 p(x ) ; 1 G (x ) j;j p(x) (where G is independent of and E G (X )2 < 1 for all ). Then for any unbiased estimator T of g (), g0 ()2 Var T I () where 8 < I () = E @ log@p(x) 2 (denition of Fisher Information) : (g0 ())2 = lim sup! g();;g() 2 Proof. By the HCR bound Z ( p(x ) ; 1)2 p(x ) (dx) ( g() ; g() )2 j ; j p(x ) ; Let fng be a sequence such that n ! and ( g(n) ; g() )2 ! (g0())2: n ; Then setting = n in the above equality and letting n ! 1, gives (by DC) @ log p(X ))2 g0()2 V ar (T ) E ( @ V ar (T ) 1 2 2.3. THE INFORMATION INEQUALITY 43 Corollary 2.3.3. If X1 : : : Xn are iid P1 (the marginal distribution of Xi ) and the corresponding marginal density p1 (x ) satises the conditions of the Cramer-Rao Lower Bound theorem, then for any unbiased estimator T (X1 Xn) of g (), @ log p (X ) 2 0 ()2 g 1 1 where I1 () = E : Var (T ) nI1 () @ Proof. The sample space is X n and n dP (x) = Y n p 1 (xi ) where = : n d i=1 The Fisher information for P is !2 Z @ Yn n log p 1 (xi ) p1 (x1 ) p1 (xn ) (dx) @ Z X @i=1log p1(xi ) 2 = p1(x1 ) p1 (xn )n(dx) @ Z X @ log p1(xi ) 2 = p1(x1 ) p1 (xn )n(dx) = nI () @ since @ log 2 @ log @ log E @ p1(Xi ) @ p1(Xj ) = E @ p1 (Xi ) (by independence) and log p (X )) = 0: E ( @@ 1 i (We can dierentiate under the integral sign by DC and the assumptions on p1(xi )). The statement of the Corollary now follows if we can show that the assumptions on p1(xi ) carry over to p(x ), i.e. that j ; j < ) j pp1 ((xx1 )) pp1 ((xxn )) ; 1 j =j ; j G~ (x ) 1 1 1 n 2 where E G~ (X ) < 1. Now p1(xi ) 1 + j ; jG(x ) i p1(xi ) 1 + G(xi ) 44 2. UNBIASEDNESS and ja1 an ; 1j ja1 an ; a2 anj + ja2 an ; a3 anj + by the triangle inequality, ja1 ; 1jja2 anj + ja2 ; 1jja3 anj + + jan ; 1j: Setting ai = pp11((xxii)) gives Y j pp1((xx1 ))pp1((xxn )) ; 1 j j ; jG(x1 ) (1 + G(xi )) 1 1 n i>1 Y + j ; jG(x2 ) (1 + G(xi )) i>2 + + j ; jG(xn ) := j ; jG~ (x ) and E G~ (X )2 < 1 since X1 : : : Xn are independent and E G(Xi )2 < 1. (No G(Xi ) is raised to a power greater than 2.) Corollary 2.3.4. Suppose p(x ) satises the conditions of theorem 2.3.2 and E T (X ) = g() + b() i.e., T (X ) has bias b() for estimating g (). Then MSE (T ) = E (T (X ) ; g())2 = b 2 ( ) + V (T ) b2 () + Ic(()) where g() + b() ; g() + b() 2 c() = lim sup ; ! Proof. T is unbiased for g () + b(), so E (T (X ) ; g() ; b())2 Ic(()) Hence MSE = E (T (X ) ; g())2 = Var T (X ) + E (T (X ) ; g())]2 b2 () + Ic(()) : Behavior of I () under reparameterization. Suppose = h () reparametrizes fP : 2 g to fP : 2 h ()g. Then p (x ) = p (x h ()) 2.3. THE INFORMATION INEQUALITY 45 where h denotes the inverse mapping and, by denition, 2 @ log I () = E @ p (X ) @ log p (X ) 2 d = E jj d h () @ dh (=)h2() = I (h ()) d = I0 () 2 : (h ()) j=h() (2.3.3) An alternative expression for I (). Provided @@ log p(x ) exists for all x, and if 2 2 (2.3.4) then Proof. Z @2 2 Z @ @2 p(x ) d(x) = @2 p(x ) d(x) = 0 @2 I () = ;E @2 log p(x ) : (x) )2 @ 2 log p(x ) = @@22 p(x ) ; ( @p@ @2 p(x ) p(x )2 h i and E p(X );1 @@22 p(X ) = 0 by (2.3.4). Theorem 2.3.5. One-parameter exponential family. Suppose p (x ) = exp (T (x) () ; B ()) h (x) where = E T and () 2 Int (N ). Then I () = Var1 T : Proof. Z eT (x);A() h(x)(dx) = 1 and A0() = E T , A00 () = V ar T . Hence = A0(): Now (x ) = eT (x);A() h(x) (This is the canonical representation of the density of the exponential family. See (1.3.7)). Hence @ log@(x) = T (x) ; A0 (), so I () = E (T (X ) ; A0())2 = V ar T 46 2. UNBIASEDNESS Since = A0() = h;1 (), and we have I () = I (())(A00 ())2. So I () = I (())=A00 (())2 = V ar T=(V ar T )2 1 = V ar T Remark 2.3.6. T attains the CR lower bound in this case. A converse result also holds. Under some regularity conditions, attainment of the CR lower bound implies that T is the natural sucient statistic of some exponential family fP g. Example 2.3.7. Poisson family. Suppose that X1 : : : Xn are iid Poisson. Then p(x ) = e;n = e P xi Qn1x ! i P xi 1 1 ;n+n lg n Qn x ! 1 i Px (x ) = e;ne=n+ n i Qn1x ! 1 i Px T (x) = n i is UMVU for A0() = e=n = A00 () = n1 e=n = n I () = V ar T = n I () = V ar1 T = n = n lg = e=n The information on n lg increases with . The information on decreases with . Theorem 2.3.8. Alternative version of CRLB theorem. @p (x ) is nite Suppose is an open interval, A = fx : p(x ) > 0g is independent of , @ R @p (x )(dx): Then for all x 2 A and for all 2 , E @@ log p(x ) = 0, @@ (E T ) = T (x) @ V ar (T (X )) ( @@ E T )2 I () 2.4. MULTIPARAMETER CASE 47 Proof. Choose (x ) = @@ log p(x ) in (2.3.1). @ Cov T @ log p(X ) = T (x) @ (dx) @ (E T ) = @ = @ V ar @ log p(X ) @ E T @ log p(X ) Z @p(X ) = I (): 2.4. Multiparameter Case Theorem 2.4.1. For T (X ) 1 (X ) s (X ) functions with nite 2nd moments under P , we have the multiparameter analogue of (2.3.1), Var (T ) T C ;1 where T = (1 s), i = Cov (T i ) and C = Cov (i j )]sij =1 Proof. Let Y2^ denote the minimum mean squared error linear predictor, of Y = T ; E T 3 in terms of = 4 1 ; E 1 5. Then s ; E s Y^ = aT where E (Y ; Y^ )(j ; E j ) = 0 j = 1 : : : s, i.e. C a = = Cov (T ) These equations have a solution and hence a = C ;1 where C ;1 is any generalized inverse of C = E (0). So Y^ = T C ;1 . Also E (Y ; Y^ )2 = EY 2 ; E Y^ 2 since Y^ ? Y ; Y^ in L2 (P ) = V ar T ; E ( T C ;1 T C ;1 ) = V ar T ; T C ;1 ) E (Y ; Y^ )2 = V ar T ; T C ;1 0 ) V ar T T C ;1 Notice that the right hand side is the same for any generalized inverse C ;1 of C . 48 2. UNBIASEDNESS Generalization of the Information Inequality Assume that 8 is an open interval in Rs > < A = fx : p(x ) > 0g is independent of @p (x ) is nite 8x 2 A 8 2 i = 1 : : : s > @ : Ei @ log p(X ) = 0 i = 1 : : : s @i Definition 2.4.2. The information matrix. s @ @ I () := E @ log p(X ) @ log p(X ) j i@ ijs=1 @ = Cov @i log p(X ) @j log p(X ) ij =1 I () is strictly positive denite if f @@ i log p(X ) i = 1 : : : sg is linearly independent a:s: P . Theorem 2.4.3. Under the previous assumptions, if I () is strictly positive denite and if T (X ) satises E T (X )2 < 1 8 and @ E T (X ) = Z T (x) @ p(x )(dx) @j @j then 2 @ E T (X ) 3 @ 4 5 where = V ar T (X ) T I ();1 1 @ @s E T (X ) Proof. This is a direct application of Theorem 2.4.1 with the functions i dened by i (x ) = @@ log p(x ) i 2.4. MULTIPARAMETER CASE 49 Reparameterization If i = fi(1 : : : s) i = 1 : : : s s @ log p @ log p I () = E @ (X ) @ (X ) ij =1 "X X i j @ #s @ log p @ log p @ m E @ (X ) @ (X ) @n = @ i m n j ij =1 n m = JI ()J T where J = h @j is @i ij =1. Corollary 2.4.4. If I () is strictly positive denite, T1 : : : Tn are nite variance unbiased for g1 () : : : gn() respectively and each Ti satises the conditions of theorem 2.4.3, then (2.4.1) Cov T ( @g )I ;1()( @g )T @ @ 2 @g @g 3 @ @s 4 5: where A B means aT (A ; B )a 0 8a 2 R n and @g := @ @g @g 1 1 n @1 1 @ns Proof. Since aT (Cov T )a = V ar(aT T ), (2,4,1) is equivalent to ;1 ()( @g )T a 8a 2 R n : ) I V ar (aT T ) aT ( @g @ @ But the above inequality follows at once by applying theorem 2.4.3 to the real-valued statistic aT T for which 2 @ E (aT T ) 3 2 @ E (T T ) 3 @1 @1 T 5 = 4 5 a = ( @g =4 ) a: @ @ E (aT T ) @ E (T T ) @s @s Corollary 2.4.5. If T1 1 s, then Ts are unbiased for Cov (T ) I ();1 : Proof. Apply corollary 2.4.4 with gi () = i i = 1 : : : s (where @g @ = Iss). Remark 2.4.6. Suppose we wish to estimate 1 . If 2 : : : s are known then the CR lower bound is " 2#;1 @ log p E @ (X ) : 1 50 2. UNBIASEDNESS If 2 : : : s are not known, then the CR bound for estimating 1 is the (1,1) component of I ;1(), denoted by I ;1()11 (by Corollary 2.4.5). Naturally we expect " 2#;1 @log I ;1()11 E @ p(X ) 1 By the general formula for the inverse of a partitioned matrix, D ;1 ; DA A 12 ; 1 A = ;A;1 A D A;1 A DA 22A;1 12 22 22 21 22 21 where D = (A11 ; A12A;221 A21 );1, (see e.g. Ltkepohl, Introduction to Multiple Time Series Analysis) we nd that I ;1 ()11 = a ; bT1 A;1b a1 Theorem 2.4.7. (Order-s exponential family). Suppose that p (x ) = exp X s 1 ! i () Ti (x) ; B () h (x) 2 is an order-s exponential family parameterized by = E T and that () contains an open subset of R s . Then I () = C ;1 where C = Cov (T ) : Proof. By theorem 1.3.14 of chapter 1, we know that for 2 int(N ) @ 2 A = cov(T T ) cov(T ) = @A2 := @ 2 A s i j @1 @j @2 @i @j ij=1 We also know that if (x ) is the canonical form of the density @ log (x ) @ log (x ) = (T ; @A )(T ; @A ) i @i @j @i j @j @2A ) I ( ) = cov (T ) = 2 @ 2.4. MULTIPARAMETER CASE 51 2 @A=@ 3 1 4 5, and hence by the reparameterization formula = Moreover, = E T = @A @ @A=@s 2 2 I () = @@A2 I () @@A2 But the left hand side is @2A @2 (a symmetric matrix) and hence @2 A ;1 I () = @2 = C ;1: Examples of Information Matrices. Example 2.4.8. X N ( ) 2 R s xed and non-singular. Then p(x ) = p s 1 1=2 exp ; 12 (x ; )T ;1 (x ; ) ( 2) jdetj Writing = ij ]sij ;1 = ij ]sij , we have s @ log p (X ) = X ik (xk ; k ) @i k=1 X @ log p (X ) = s (x ; ) jm m m @j m=1 @ log p X s X s @ log p ) E @i (X ) @j (X ) = k=1 m=1 ik E (Xk ; k )(Xm ; m)jm = s X s X ik km mj k=1 m=1 = ;1 ;1 = ;1 : Example 2.4.9. The order-two exponential family fN ( ) 2 R > 0}. p (x ) = p1 e; (x;) 1 2 2 2 2 2 1 = p1 e; 22 x2+ 2 x; 22 ;log 2 52 2. UNBIASEDNESS where = . By Theorem 2.4.7, if we let = 1 = 2 2 + 2 X = E 2 X then we can rewrite p(x ) as p(x ) = p1 exp 1()x + 2 ()x2 ; B () 2 X X and since E X 2 = = E T T = X 2 , I () = Cov(T )];1 2 where Cov(T ) = 2 22 2 24 + 42 2 (check from MGF). We can now nd I () from the reparameterization formula, I () = JI ()J T 2 @ @ where J = 64 ... @s @1 1 1 @1 @s ... @s @s 3 75 = 1 2 = 2 2 . 0 2 + ) I ( );1 = (J ;1 )T I ;1 ()J ;1 1 ;= 0 1=2 1 ;= 22 where J ;1 = 1 0 2 = ;= 1= 2 22 24 + 42 2 0 1=2 2 0 = 0 2=2 1=2 0 ) I ( ) = 0 2= 2 (could also be obtained directly from p ) Example 2.4.10. Location-scale families. Suppose that f is a probability density with respect to Lebesgue measure, that f (x) is strictly positive for all x and that p(x ) = 1 f ( x ; ) 2 R > 0: 2.4. MULTIPARAMETER CASE 53 Then @ log p (x ) = ; 1 f 0( x; ) @ f ( x; ) @ log p = ; 1 ; x ; f 0( x; ) @ 2 f ( x; ) Z f 0( x; ) 2 1 x ; 1 I11 ( ) = 2 f( )dx f ( x; ) Z f 0(x) 2 1 = 2 f (x) f (x)dx Z x ; f 0( x; ) 2 1 x ; 1 I22 ( ) = 2 1+ f( )dx f ( x; ) Z f 0(x) 2 1 = 2 1+x f (x) f (x)dx Z f 0( x; ) 1 x ; Z x ; f 0( x; ) 2 1 x ; 1 1 I12 ( ) = 2 f ( ) dx + f( )dx x ; x ; 2 f( ) f( ) Z f 0(x) 2 1 =0+ 2 x f (x) f (x)dx Thus I is a diagonal matrix if f is symmetric about 0. Example 2.4.9 is a special case of this. When the CR bound in theorem 2.4.3 is not sharp, it can be improved by using higher order derivatives of . This is the content of the following theorem. Theorem 2.4.11. (Bhattacharya bounds) Suppose p(x ) have common support for any . Let T be unbiased for g (). Further assume Z @ i p(x )(dx) = g(i)() i = 1 : : : k T (x) @ i and Z @i @i p(x )(dx) = 0 i = 1 : : : k Then " # " # V ar (T ) g(1) () : : : g(k)() V ;1 g(1) () : : : g(k)() T where V = Cov 1 @ p(x) @ p(x ) : : : 1 @k p(x) @k p(x ) T , and V is assumed to be non-singular. 54 2. UNBIASEDNESS Proof. This follows immediately from theorem 2.4.1 by taking @ i p(x ) i = 1 : : : k i (x ) = p(x1 ) @ i giving i = Cov(T i) = E (Ti ) Z @ i p(x )(dx) = g(i) (): = T (x) @ i Example 2.4.12. X1 : : : Xn iid Poisson(), > 0, and g() = e; = P (Xi = 0). Then I0(X ) is unbiased for g(). Here P xi) log Y 1 xi ! > 0 P is an exponential family of full rank so T (X ) = Xi is complete and sucient. So the UMVU estimator of g( is E I0(X1 )jT (X )] and E I0 (X1)jT (X ) = t] = P (X1 = 0jT (X ) = t) \ T (X ) = t) = P (XP1 =(T0(X ) = t) ; ; ( n ; 1) 1))t =t! = e e ;n ((n ; e (n)t =t! p(x ) = e;n+( = (1 ; n1 )t so (1 ; n1 )T (X ) is UMVU. Noting that T (X ) P (n), we can write its probability generating function as EzT (X ) = ; n e (1;z) , and hence E (1 ; n1 )T (X ) = e; and E (1 ; n1 )2T (X ) = exp(;n(1 ; (1 ; 1=n)2)) = e;2+ n : 1 ) V ar (1 ; )T (X ) = e;2 (e n ; 1): n 2.4. MULTIPARAMETER CASE 55 g0 ()2 nI1 () , where g() = e; and X 2 1 (2.4.2) I1() = E ;1 + 1 = 2 V ar X = 1 e;2 < e;2 (e n ; 1) = e;2 ( + 1 2 + ) ) CRB = n n 2 n2 (Alternatively T is the CSS for a full-rank exponential family with E (T ) = , so I () = n 1 V ar T = = nI1 ().) The Bhattacharyga bound with k = 2, The CRB for this problem is g(1)() ;e; = ; (2) e Px @p @ 1 1 (x ) = p @ = @ log p = ;n + i Px Px 2p 2 1 @ @ 1 @p 2 2 (x ) = p @2 = @2 log p + ( p @ ) = ; 2 i + ( i ; n)2 n= 0 Cov (X ) = 0 2n2=2 Hence the Bhattacharyga bound is n= 0 ;e; " # ; ; V ar T ;e e 0 2n2=2 e; 2 = e;2 ( n + 2n2 ) > CRB but less than V ar T . By taking more derivatives the bound can be make arbitrarily close to V ar T . Extends to the multiparameter case also. More on Fisher Information-Conditional FI. Suppose X Y v p(x y ) with respect to 1 2 on X Y , i.e. d P (x y) = p(x y ) d(1 2) The Fisher information about in (X Y ) is " T @ log p(X Y ) # @ log p ( X Y ) I (XY )() = E @ @ " T @ log p(X ) # @ log p ( X ) I (X ) () = E @ @ g () 56 2. UNBIASEDNESS Additional information on contained in Y given X = x is dened to be I (Y jX )( x) = E @ @ Z @ log p(yjx ) T @ log p(yjx ) p(yjx )2(dy) @ @ is the conditional density of Y given X = x. = ) where p(yjx ) = pp(xy (x) " # @ log p(Y jx ) T @ log p(Y jx ) Proposition 2.4.13. Y I (XY ) () = I (X ) () + E I (Y jX )( X ) Proof. Z Z @(log p(x y ) ; log p(x )) T @(log p(x y ) ; log p(x )) p(x y )2(dy)1(dx) E = @ @ = I (XY ) () ; 2I (X ) () + I (X ) () = I (XY ) () ; I (X ) () since Z Z @ log p(x ) T @ log p(x y ) p(x y )2(dy)1(dx) @ @ X Y Z @ log p(x ) T Z @p(x y ) = 2(dy) 1(dx) @ @ X Y Z @ log p(x ) T @p(x ) = 1(dx) @ @ X Z @ log p(x ) T @ log p(x ) p(x )1(dx) = @ @ X =I (X ) () X Y Corollary 2.4.14. X1 : : : Xn are independent for any , then I (X1 :::Xn) () = n X i=1 I (Xi) (): Corollary 2.4.15. Suppose one of n possible experiments is performed, depending on the value of a r.v. J (= 1 2 : : : n), where the distribution of J is independent of and 2.4. MULTIPARAMETER CASE 57 J X1 : : : Xn are independent. Then ; I (JXJ )() = EJ I XJ jJ () = n X j =1 P (J = j )I (Xj ) () where I (Xj ) () is the information associated with the j th experiment. Note: This is a property of the experiment and calculable independently of the outcome of the experiment. If the value of J is known to be j then the information becomes the conditional information given J = j which is I (Xj )(). 58 2. UNBIASEDNESS CHAPTER 3 Equivariance 3.1. Equivariance for Location family In chapter 2 we introduced unbiasedness as a constraint to eliminate estimators which may do well at a particular parameter value at the cost of poor performance elsewhere. Within this limited class we could then sometimes determine uniformly minimum risk estimators for any convex loss function. Equivariance is a more physically motivated restriction. We start by considering how equivariance enters as a natural constraint on statistics used to estimate location parameters. Suppose X = (X1 X2 Xn) has density f (x ; ) = f (x1 ; xn ; ) where f is known, is a location parameter to be estimated and L ( d) is the loss when is estimated as d. Suppose we have settled on T (x) as a reasonable estimator of as measured by R ( T ) = E (L ( T (x))) Suppose another statistician B wants to measure the data using a dierent origin. So instead of recording X1 X Xn, B records X10 = X1 + 273 Xn0 = Xn + 273 say (This would be the case if we measured temperatures in C and B measured them in K.) Then (3.1.1) X 0 = X + a: On the new scale the location parameter becomes (3.1.2) 0 = + a and the joint density of X0 = (X10 Xn0 ) is f (x0 ; 0) = f (x01 ; x0n ; 0) : The estimated value d on the original scale becomes (3.1.3) d0 = d + a on the new one and the loss resulting from its use is L ( 0 d0). The problem of estimating is said to be invariant under the transformations (3.1.1),(3.1.2), and (3.1.3) if (3.1.4) L ( + a d + a) = L ( d) : 59 60 3. EQUIVARIANCE This condition on L is equivalent to the assumption that L has the functional form, L ( d) = (d ; ) for some function . Suppose we chose T (X) as a good estimator of in the original scale. Then since estimation of 0 in terms of X10 Xn0 is exactly the same problem, we should use T 0 (X10 Xn0 ) = T (X1 Xn) + a as our estimator of 0 = + a. If T (X1 + a Xn + a) = T (X1 Xn) + a then we say that the estimator T is equivariant under the transformations (3.1.1),(3.1.2), and (3.1.3) or location equivariant . P Remark 3.1.1. The mean, median, weighted average of order statistics (with wi = 1) and the MLE of for the family f (x ; ) are all location equivariant. Theorem 3.1.2. If X has density f (x ; 1) with respect to Lebesgue measure and T is equivariant for with loss L ( d) = ( ; d) : Then the bias, risk and variance of T are independent of . Proof. We give the proof for the bias. The other proofs are similar. b ( ) = E Z T (x) ; 1 = (T (x) ; ) f (x ; 1) (dx) Z = (T (x + 1) ; ) f (x) (dx) by shift-invariance of = E0T (X0) ; = E0 (T (X) + ) ; = E0T (X) : Remark 3.1.3. Since the risk of an equivariant estimator is independent of , the determination of a uniformly minimum risk equivariant estimator reduces to nding the equivariant estimator with minimum (for every ) risk - such an estimator typically exists - and is called a minimum risk equivariant (MRE) estimator. Our rst step is to nd a representation of all location equivariant estimators (Just as we found a representation of all unbiased estimators in Chap 2) 3.1. EQUIVARIANCE FOR LOCATION FAMILY 61 Lemma 3.1.4. If T0 is any location-equivariant estimator then (3.1.5) T is equivariant () T (x) = T0 (x) ; U (x) where U (x) is any function such that (3.1.6) U (x + a1) = U (x) for all x and a: Proof. If T is location-equivariant, set U (x) = T0 (x) ; T (x) : Then U (x + a1) = T0 (x + a1) ; T (x + a1) = T0 (x) + a ; T (x) ; a = U (x) : Conversely if equation (3.1.5) and (3.1.6) hold, then T (x + a1) = T0 (x + a1) ; U (x + a1) = T0 (x) + a ; U (x) = T (x) + a: Lemma 3.1.5. U satises if and only if Proof. () U (x + a1) = U (x) for all x and a U (x) = v (x1 ; xn xn;1 ; xn) for some function v: U (x) = v ((x1 + a) ; (xn + a) (xn;1 + a) ; (xn + a)) : )) Choosing a = ;xn , we have U (x) = U (x1 ; xn xn;1 ; xn 0) = v (x1 ; xn xn;1 ; xn ) Combining the previous lemmas gives the following theorem. Theorem 3.1.6. If T0 is any location-equivariant estimator, then a necessary and sucient condition for T to be equivariant is that there is a function v of n ; 1 variables such that T (x) = T0 (x) ; v (y) for all x where y = (x1 ; xn xn;1 ; xn ). Example 3.1.7. If n = 1, then only equivariant estimators are X + c for c 2 R . 62 3. EQUIVARIANCE Now we can determine the location-equivariant estimator with minimum risk. Theorem 3.1.8. Let x have a density function f (x ; ) with respect to Lebesque measure and let y = (y1 yn;1)T where yi = xi ; xn. Suppose that the loss function is given by L ( d) = (d ; ) and that there exists an equivariant estimator T0 with nite risk. Assume that for each y there exists a number v (y) = v (y) which minimizes (3.1.7) E0 ( (T0 (X) ; v (Y)) jY = y) then there exists an MRE estimator T of given by T (x) = T0 (x) ; v (y) : Proof. If T is equivariant then T (X) = T0 (X) ; v (Y) for some v. So to nd the MRE, we need to nd v to minimize R ( T ) = E ( (T ; )) and we calculate: R ( T ) = E ( (T0 (X) ; v (Y) ; )) = E Z 0 ( (T0 (X) ; v (Y))) by theorem (3.1.2) = E0 ( (T (X) ; v (Y)) jY = y) dP0 (y) Z E0 ( (T (X) ; v (Y)) jY = y) dP0 (y) = R (0 T ) : The risk is nite since R (0 T ) R (0 T0) < 1 by assumption. Corollary 3.1.9. If is convex and not monotone, then an MRE exists and is unique if is strictly convex (under the conditions of theorem (3.1.8)). Proof. Let (c) = E ( (T0 (X) ; c) jY = y) and apply theorem (1.4.2). Corollary 3.1.10. The following results hold: 1. (d ; ) = (d ; )2 ) v (y) = E0 (T0 (X) jY = y) 2. (d ; ) = jd ; j ) v (y) =med0 (T0 (X) jY = y) ; Proof. 1. E0 ( (T0 (X) ; c) jY = y) = E0 (T0 (X) ; c)2 jY = y is minimized at c = E0 (T0 (X) jY = y). 2. E0 (jT0 (X) ; cjjY = y) is minimized at c = med0 (T0 (X) jY = y). 3.1. EQUIVARIANCE FOR LOCATION FAMILY 63 Example 3.1.11. MRE's can exist also for non-convex . Suppose is given by 1 if jd ; j > c 0 otherwise, where c is xed. Then for n = 1, using T0 (X ) = X , v will minimize (d ; ) = if and only if maximizes E0 (X ; v) = P0 (jX ; vj > c) P0 (jX ; vj c) : If f is symmetric and unimodal then v = 0 and hence T0 (X ) ; 0 = X is MRE. On the other hand if f is U-shaped, say f (x) = (x2 + 1) I;LL] where c < L, then P0 (jX ; vj c) = P0 (v ; c X v + c) is maximized at v + c = L and v ; c = ;L. Thus there are two MREE's, X ; L + c and X + L ; c. Example 3.1.12. Let X1 Xn be iid N ( 2) where 2 is known. Because X is complete sucient and Y = (X1 ; Xn Xn;1 ; Xn) is ancillary, T0 (X) = X is independent of Y. Therefore, by minimizing the expression (3.1.7), we nd that ;; v (y) = argmin E0 X ; v : ; If is convex and even, then (v) := E0 X ; v is convex and even. Therefore, v (y) = 0 and X is MRE. (X is also MRE when is the non-convex function of Example 3.1.11.) Theorem 3.1.13. (Least favorable property of the normal distribution) Let F be the set of all distributions with pdf relative to Lebesgue measure and with variance 1. Let X1 Xn be iid with pdf f (x ; ), where = EXi . If rn (F ) is the risk of the MRE estimator of with squared error loss, then rn (F ) has its maximum value over F when F is normal. Proof. The MREE in the normal case is X with corresponding risk, rn = E0 X 2 = n1 : However, X is an equivariant estimator of for all F 2 F and the risk ; E X ; 2 = n1 for all F 2 F : Therefore, rn (F ) n1 for all F 2 F : 64 3. EQUIVARIANCE Remark 3.1.14. From corollary (3.1.10), the MRE in general in the previous theorem is ; X ; E0 X jY = y : ; But for n 3, E0 X jY = y = 0 if and only if F is normal. So the MRE for n 3 is X if and only if F is normal. Example 3.1.15. Let X1 Xn be iid with 1 ; e;(x;)=b x F (x) = 0 x< where b is known and 2 R. Then T0 = X(1) is equivariant, complete and sucient fot (CHECK) and T0 is independent of the ancillary statistic Y = (X1 ; Xn Xn;1 ; Xn) = (X1 ; ; (Xn ; ) Xn;1 ; ; (Xn ; )) : Therefore X(1) ; v is MRE if v minimizes ; E0 X(1) ; v : (In general we have to minimize E0 ( (T0 ; v (y)) jY = y) for each y by Theorem 3.1.8 but here the complete suciency of T0 and the ancillarity of Y implies that v (y) is independent of y since the distribution of T0 is the same for all y.) We now consider some special cases: ; b=n so that MRE is X ; b . 1. If (d ; ) = (d ; )2 , then v = E0 X;(1) = (1) n 2. If (d ; ) = jd ; j, then v = med0 X(1) = b log 2=n so that MRE is X(1) ; b logn 2 . (Because FX(1) (x) = 1 ; e;(x;)n=b = 21 implies (x ; ) n=b = log 2.) 3. If 1 if jd ; j > c (d ; ) = 0 if jd ; j c ; then v is the center of the interval I of length 2c which maximizes P0 X(1) 2 I so that v = c and the MREE is X(1) ; c. Theorem 3.1.16. (Pitman Estimator) Under the assumptions of Theorem 3.1.8, if L ( d) = (d ; )2 , R uf (x ; u x ; u) du T = R f (x 1; u x n; u) du 1 n is the MREE of . Remark 3.1.17. This coincides with the Bayes estimator corresponding to an improper at prior for the location parameter. (i.e. the conditional expectation of given X = x under the assumed joint "density" f (x ; 1) of X and .) 3.1. EQUIVARIANCE FOR LOCATION FAMILY 65 Proof. Corollary (3.1.10) implies T (X) = T0 (X) ; E0 (T0 jY) where T0 is any equivariant estimator. Let T0 (X) = Xn. Because 2 Y1 3 2 1 0 ;1 3 2 X1 3 66 ... 77 = 66 ... . . . ... ... 77 66 ... 77 4 Yn;1 5 4 0 1 ;1 5 4 Xn;1 5 Xn Xn 0 0 1 we have fY1 Yn;1 Xn (y1 yn;1 xn) = fX1 Xn (y1 + xn yn;1 + xn xn) and R xf (y + x y + x x) dx E0 (XnjY = y) = R f (y 1+ x y n;1+ x x) dx = = R xf (x1 ; x + x n; 1 x ; x + x x) dx R f (y 1; x n+ x x n;1; x n+ x x) dx R1uf (xn ; u xn;1 ; un x ; u) du xn ; R f (x 1; u x n;1; u x n; u) du : 1 for T n;1 n Substituting in the expression completes the proof. Example 3.1.18. Suppose f (x) = I(;1=21=2) (x) and X1 Xn are iid with density f (x ; ) = I(;1=2+1=2) (x). Then 1 if ; 1 x and x 1 (1) (n) 2 2 f (x1 xn) = 0 otherwise and 1 if u ; 1 x and x u + 1 (1) (n) 2 2 : f (x ; u1) = 0 otherwise Therefore, since x(n) ; x(1) 1, T (x) = = R x + udu x ; R xn + du xn; 1 ;x + 1 2 ; ;x ; 1 2 (1) ( n ) 2 2 2 1 2 1 ( ) 2 1 (1) 2 1 ( ) 2 (1) x(1) ; x(n) ; = 1 x(1) + x(n) : 2 66 3. EQUIVARIANCE Remark 3.1.19. UMVU vs MRE 1. MRE estimators often exist for more than just convex loss functions 2. For convex loss functions MRE estimators generally vary (unlike UMVUE's) with the loss function. 3. UMVUE's are frequently inadmissible (i.e. can be uniformly improved). The Pitman estimator is admissible under mild assumptions. 4. The principal application of UMVUE's is to exponential families. 5. For location problems, UMVUE's typically do not exist. 6. An MRE estimator is not necessarily unbiased. The following lemma examines this connection. Lemma 3.1.20. Suppose L (d ) = (d ; )2 , and f (x ; 1) 2 R , are densities with respect to Lebesgue measure. 1. If T is equivariant with bias b, then T ; b is equivariant, unbiased, with smaller risk than T . 2. The unique MRE estimator is unbiased. 3. If there exists an UMVU estimator and it is equivariant, then it is MRE. Proof. 1. It is clear that T ; b is equivariant and unbiased. For the smaller risk part, ; R ( T ; b) = R (0 T ; b) = E0 (T ; b)2 = VarT E0 T 2 : 2. The MRE estimator is unique by corollary (3.1.9). It is unbiased by (1), otherwise its risk could be improved by using the equivariant estimator T ; b. 3. The UMVUE is the unique MR estimator in U . If it falls in E it is the MR estimator in U \E . But the MREE is the MR estimator in U \E since it is necessarily unbiased. Hence they are the same. Definition 3.1.21. An estimator T of g () is risk-unbiased if E L ( T ) E L (0 T ) for all 6= 0: i.e. T is closer! to g () on average than to any false value g (0 ). Example 3.1.22. (mean unbiasedness) If L ( d) = (d ; g ())2 and T is risk-unbiased, then E (T ; g ())2 E (T ; g (0 ))2 for all 6= 0: This means (assuming that E T 2 < 1 and E T 2 g ()) that E (T ; g (0 ))2 is minimized by g (0 ) = E T . Hence g () = E T for all i.e. T is unbiased for g in the sense dened in chapter 2. 3.2. THE GENERAL EQUIVARIANT FRAMEWORK 67 Example 3.1.23. (Median unbiasedness) If L ( d) = jd ; g ()j and T is risk-unbiased, then E jT ; g ()j E jT ; g (0)j for all 6= 0: But the right hand side is minimized when g (0) = med T . Hence (3.1.8) g () = med T for all : (assuming that E jT j < 1 and there exists a med T in g () for all .) An estimator satisfying equation (3.1.8) is said to be median-unbiased for g (). Theorem 3.1.24. Suppose X has density f (x ; 1) with respect to Lebesgue measure. If T is an MRE estimator with L ( d) = (d ; ) then T is risk-unbiased. Proof. By Theorem 3.1.2, E (T ; ) = E0 (T ) : If 6= 0, then T ; 0 is equivariant and by denition of MRE, we have E0 (T ) E0 (T ; 0) for all 0 ) E0 (T ) E (T ; 0 ; ) for all 0 ) E0 (T ) E (T ; 0) for all 0 ) E (T ; ) E (T ; 0) for all 0: 3.2. The General Equivariant Framework Notation X : data. : parameter space. P : = fP : 2 g. G : a group of measurable bijective transformations of X ! X . Remark 3.2.1. The operation associated with the group G is function composition. i.e. fg (x) = f g (x) = f (g (x)) : Definition 3.2.2. We say that g leaves P invariant if for all 2 , there exists 0 2 such that and there exists 2 such that X P ) gX P0 X P ) gX P : 68 3. EQUIVARIANCE If C is a class of transformations which leave P invariant then G (C ) = g1 1g2 1 gm 1 : gi 2 C m 2 N is a group (the group generated by C ), each of whose member leaves P invariant. If each member of a group G leaves P invariant we say that G leaves P invariant. If G leaves P invariant and P 6= P0 for 6= 0 then there exists a unique 0 2 such that X P ) gX P0 : This denes a bijection g : ! , via the relation g () = 0: where Pg() is the distribution of g (X) under . Definition 3.2.3. Under the preceding conditions we dene G := fg : g 2 Gg : It is clear that G is also a group. Remark 3.2.4. For g 2 G , E f (gX ) = Eg()f (X ) since Z E f (gX ) = f (g (x)) P (dx) = Z f (x) P g;1 (dx) Z = f (x) Pg() (dx) = Eg()f (X ) : Equivariant Estimation Let T : X ! D be an estimator of h (). Instead of observing X , suppose we observe X 0 = g (X ) where X 0 is a sample from Pg() . Suppose that for any g 2 G , h (g) depends on only through h (), i.e. (3.2.1) h (1 ) = h (2 ) ) h (g1 ) = h (g2 ) : Then we denote gh () = h (g) : It is clear that G := g : g 2 G 3.2. THE GENERAL EQUIVARIANT FRAMEWORK 69 is a group and each g is a 1 to 1 mapping from H = h () to H. Definition 3.2.5. Under the conditions prescribed for existence of the groups of mappings G and G , if L (g gd) = L ( d) for all g 2 G (3.2.2) then we say that L is invariant under G . (g if we remove 'for all g 2 G '). If conditions (3.2.1) and (3.2.2) hold, we say that the problem of estimating h () is invariant under the group of transformations G (under g if we remove for all g 2 G !). An estimator T (X) of h () is said to be equivariant under G if gT (X) = T (gX) for all g 2 G : (3:2:3) If (3.2.2) and (3.2.3) hold and T (X) is a good estimator of h () based on X then T (gX) will be a good estimator of g(h ()) based on g(X). Example 3.2.6. (Location parameter) Let h () = and g (X) = X + a. X ! X + a1 X + a P+a ! g = + a h (g) = + a g(h()) = h() + a: Then, the problem of estimating is location-invariant if L(g gd) = L ( + a d + a) = L ( d) : An estimator of h() is equivariant if T (X + a1) = T (X) + a P n ; 1 e.g. X1, median(X), n i=1 Xi , etc. Theorem 3.2.7. If T is equivariant and g leaves P invariant and then where Proof. L ( d) = L (g gd) R ( T ) = R (g () T ) for all R ( T ) := E L ( T (X)) : E L ( T (X)) = = = = E L (g () gT (X)) E L (g () T (g (X))) Eg() L (g () T (X)) R (g () T ) : 70 3. EQUIVARIANCE Corollary 3.2.8. If G is transitive over (i.e. if 1 2 2 and 1 6= 2 then there exists g 2 G such that g (1 ) = 2 ) then R ( T ) is constant for every equivariant T . Proof. Fix 0 2 . If 6= 0 , there exists g such that g (0 ) = . Hence R (0 T ) = R (g (0 ) T ) = R ( T ) : P = fP : 2 g is invariant relative to a transitive group of transformations if and only if P is a group family (see TPE1 section 1.4.1). These are families generated by subjecting a r.v. with xed distribution to a group of transformations. We can then index P using G . Thus P = Pg : g 2 G Remark 3.2.9. where = g (0 ). Theorem 3.2.10. Suppose G is transitive and G is commutative. If T is MRE, then T is risk unbiased. Proof. Let T be MRE. For 6= 0 , there exists g such that g (0 ) = . Consequently, ; E (L (0 T (X))) = E L g;1 () T (X) = E L ( gT (X)) E L ( T ) : (If T is equivariant then so is gT since h gT = gh T = gT (h) by commutativity.) Example 3.2.11. Suppose X N ( 2 ), = ( 2 ), and we want to estimate h () = . 1. Let G1 = fg : gx = x + c c 2 R g , then P is invariant under G1 . Tx = x + c, c 2 R , are the only equivariant functions since T (x + a) = T (x) + a for all a ) T (x+aa);T (x) = 1 for all a 0 ) T (x) = 1 ) T (x) = x + c: ) TPE stands for Theory of Point Estimation, E.L. Lehmann and George Casella, Springer Texts in Statistics, 1998. 1 3.2. THE GENERAL EQUIVARIANT FRAMEWORK 71 (a) Suppose L (d ) = (d ; )2 =2 (squared loss function measured in units of ). Then X is MRE under G1 , i.e. 2 E L ( X ) = E (X ;2 ) = 1 ( ) 2 ( T ; ) = min E0 2 : T equivariant with respect to G1 : Because ;; ; g 2 = + c 2 G 1 is not transitive and X is not risk unbiased. (For xed, choose 0 = ( 102). Then 2 1 < E L ( X ) :) E L (0 X ) = E (X10;2 ) = 10 Here, G is not transitive since there exists no g such that g ( 102) = g ( 2) : (b) Suppose L0 (d ) = (d ; )2 . X is MRE and risk-unbiased since E L0 (0 X ) = E (X ; 0)2 E (X ; )2 for all 6= 0: G transitive is therefore not necessary in theorem (3.2.10) 2. Let G2 = fg : gx = ax + c a > 0 c 2 R g . Then, G 2 = fg : g ( 2) = (a + c a2 2)g since gX N (a + c a22 ). Also h () = so that h (g ()) = h (a + c a2 2) = ah () + c . Thus, gh () = a + c and G2 = fg : g (d) = ad + cg : P is invariant under G2 and T (x) = x is equivariant under G2 since Tgx = T (ax + c) = ax + c = g (T (x)) and g(Tx) = g(x) = ax + c: X is MRE under G1 G2 and is therefore MRE under G2. There exist no G1-equivariant estimators with smaller risk so there exist no G2-estimators with smaller risk since G2equivariance is a more severe restriction than G1-equivariance. But X is not risk unbiased with respect to L (d ) = (d ; )2 =2 . G 2 is transitive in this case but G2 is not commutative. So, theorem (3.2.10) cannot be applied here. 72 3. EQUIVARIANCE Suppose 3.3. Location-Scale Families X = (X1 Xn )t ;nf x ; x ; n 1 where f is known. The density is with respect to Lebesgue measure and = ( ) 2 = R R + where is the location parameter and is the scale parameter. Our objective here is to estimate the location parameter. Estimation of is discussed in TPE pp.167-173. We will consider this later. Suppose X = Rn = R R+ D = R: Dene gab : R n ! Rn by gab (x) = (ax1 + b axn + b) : Then gab : ! is dened by gab () = (a + b a) : R ! R is dened by and gab (d) = ad + b: gab since gh () = g = h(g) = a + b. Lemma 3.3.1. L ( d) is invariant under G if and only if d ; L (( ) d) = (i.e. if L is a function of error measured in terms of ). ; Proof. () If d; = L (( ) d), then L (g gd) = L((a + b a) ad + b) ; a ; b = ad + b a = L ( d) : )) We need to show if d and 0 0 d0 satisfy d; = d0;00 and if L is invariant then L (( ) d) = L ((0 0) d0) : 3.3. LOCATION-SCALE FAMILIES But this holds since 8 0 > <d = 0 d 0 + ; 0 0 = > : = 00 0 + ; 0 0 73 and L is invariant under this transformation by assumption. P is invariant under each g 2 G (check!) T is equivariant () gT (x) = Tg (x) () aT (x) + b = T (ax + b) : Since G is transitive (given any ( ) we can nd a b such that (a + b a) = (0 0) every equivariant estimator of has constant risk by corollary (3.2.8). Proposition 3.3.2. If T0 is an equivariant estimator of and , satises 1 6= 0 and 1 (aX + b) = a1 (X) for all a > 0 b 2 R then T is equivariant if and only if T (X) = T0 (X) ; W (Z) 1 (X) for some W where X ;X X 1 n n;2 ; Xn Xn;1 ; Xn Z = (Z1 Zn;1) = X ; X X ; X jX ; X j : n;1 n n;1 n n;1 n Note 3.3.3. 1. We have assumed that Xi 6= Xj for all i 6= j . This is ok since fX : Xi = Xj for some i 6= j g has measure 0. 2. Z is ancillary for so if T is a function of a CSS then T is independent of Z. Proof. 1. T is equivariant (i.e. T (ax + b) = aT (x) + b) () U := T ;1T0 satises U (ax + b) = U (x) for all a > 0 b 2 R . (check the details - simple algebra) 2. If U = T ;1T0 = W (Z) for some W then it is easy to see that U (ax + b) = U (x) and hence by (1), T is equivariant. 3. If T is equivariant then U (ax + b) = U (x) for all a > 0 b 2 R . Setting a = jX 1; X j b = ; jX X;n X j n;1 n n;1 n ) U (X) = X ;X X 1 n n;1 ; Xn U jX ; X j jX ; X j 0 n;1 Zn;1 nZ n = U 1 Zn;2 Zn;1 0 n;1 Zn;1 =: V (Z1 Zn;1) : 74 3. EQUIVARIANCE G is transitive so by corollary (3.1.9), an equivariant estimator has constant risk. Theorem 3.3.4. Suppose T0 is equivariant with nite risk and 1 is dened as in Propo- sition 3.3.2. If E(01) ( (T0 (X) ; W (Z) 1 (X)) jZ) is minimized when W (Z) = W (Z) then T (X) = T0 (X) ; W (Z) 1 (X) is MRE. Proof. For any equivariant T = T0 ; W (Z) 1 , R ((0 1) T ) = E(01) T (X) ; 0 1 ; = E(01) E(01) ( (T0 (X) ; W (Z) 1 (X)) jZ) E(01) ;E(01) ( (T0 (X) ; W (Z) 1 (X)) jZ) = E(01) (T ) = R ((0 1) T ) : Since every equivariant estimator has constant risk, T is MRE (for all ). Example 3.3.5. Suppose X1 Xn are iid with common pdf with respect to Lebesgue measure, 1 f x ; = 1 e; x; I (1) (x) and d ; d ; 2 : L (( ) d) = = P ; Let T0 = X(1) 1 (X) = ni=2 X(i) ; X(1) , Zi = XXn;i;1 ;XXn n (i = 1 n ; 2), Zn;1 = Xn;1 ;Xn jXn;1 ;Xn j . ;X is complete sucient for We rst show that T0, 1 , and Z are ;independent. (1) 1 ( ) and Z is already ; ancillary so that X(1) 1 is independent of Z by Basu's theorem. Also nX(1) (n ; 1) X(2) ; X(1) X(n) ; X(n;1) are iid E (1) so that X(1) is independent of ; ; 1 = (n ; 1) X(2) ; X(1) + + X(n) ; X(n;1) . Thus, X(1) 1, and Z are independent. Consequently, ; E(01) (T0 (X) ; W (Z) 1 (X))2 jZ = E(01) T02 ; 2W (Z) E(01) T0 E(01) 1 + W (Z)2 E(01) 12 3.3. LOCATION-SCALE FAMILIES is minimized with respect to W (Z) if W (Z) = W (Z) where E TE W (Z) = (01)E 0 (021) 1 (01) 1 1 (n ; 1) n = n ; 1 + (n ; 1)2 = n12 since 1 ; (n ; 1 1). Hence the MRE estimator of is n ; X X(i) ; X(1) : T = T0 ; W 1 = X(1) ; n12 Note 3.3.6. 2. Because i=2 1. T is not unbiased and the bias depends on 2 since E() T (X) = E(01) T (X + ) = + E(01) T (X) 1 n ; 1 = + n ; n2 = + n2 R ( T ) = E(01) = where we have T ; 0 2 1 2 n2 1 + Var(01)T Var(01) T = Var(01) X(1) + n14 Var(01) 1 = n12 + n n;4 1 3. The UMVUE of is R ( T ) = n n+3 1 : n ; X T (X) = X(1) ; n (n1; 1) X(i) ; X(1) 2 75 76 3. EQUIVARIANCE with corresponding R ( T ) = E(01) T 2 = n12 + 2 n ; 1 2 + 02 = n (n1; 1) n (n ; 1) n + 1 > n3 = R ( T ) : ; Estimation of r for some constant r. Assume L ( d) = dr and P is invariant under G = fg : gx = ax + bg. Dene gab : R n ! R n by gab (x) = ax + b = (ax1 + b axn + b) : Then gab : ! is dened by gab () = (a + b a) and h () = h ;g () = ar r = ar h () gab ab ,where h () = h (( )) = r : Thus, (d) = ar d: gab We can check invariance of loss function (i.e. L (g gd) =? L ( d)) since ar d d L (g g d) = ar r = r = L ( d) as required. Thus, the condition for equivariant of T is T (g (x)) = gT (x) or T (ax + b) = ar T (x) : Proposition 3.3.7. Let T0 be any positive equivariant estimator of r . Then T is equivariant if and only if T (X) = W (Z) T0 (X) for some W , where Z is dened on proposition (3.3.2) . Proof. If T = W (Z) T0 (X), then T (aX + b) = W (Z) ar T0 (X) = ar T (X) : 3.3. LOCATION-SCALE FAMILIES Conversely, if T is equivariant, then 77 ) U (X) := TT ((X X) 0 satises U (aX + b) = U (X) for all a > 0, b 2 R and so by proposition (3.3.2), U (X) = W (Z) for some W . Theorem 3.3.8. Let T0 be a particular equivariant estimator of r with nite risk. Suppose E(01) ( (W (Z) T0 (X)) jZ) is minimized when W (Z) = W (Z). Then T (X) = W (Z) T0 (X) is MRE. Proof. If T is any equivariant estimator then R ( T ) = R ((0 1) T ) = E(01) (W (Z) T0 (X)) ; = E(01) E(01) ( (W (Z) T0 (X)) jZ) E(01) ;E(01) ( (W (Z) T0 (X)) jZ) = E(01) (T ) = R ( T ) : Example 3.3.9. Let X1 Xnbe iid with density 1 e;(x;)= I(1) (x) and suppose we wish to estimate by minimizing the risk under the (invariant) squared fractional error loss function, d d 2 L(( ) d) = = ; 1 : Let T0 (X) = n X i=1 n ; X X(i) ; X(1) X1 ; X(i) = i=2 which is independent of Z (by the argument in example (3.3.5).) T0 is equivariant (T (aX + b) = aT (X)) and 78 3. EQUIVARIANCE ; E(01) E(01) ( (W (Z) T0 (X)) jZ) ; ; = E(01) E(01) (W (Z) T0 (X) ; 1)2 jZ = W (Z)2 E(01)T0 (X)2 ; 2W (Z) E(01) T0 (X) + 1 which is minimized when E T (X) W (Z) = W (Z) = (01) 0 2 = (nn;;1)1 n = n1 : E(01) T0 (X) Hence, T (X) = W (Z) T0 (X) n ; X 1 = n X(i) ; X(1) i=1 is MRE. Note 3.3.10. E() T (X) = E(01)T (X + ) = E(01)T (X) = n ; 1 6= : n CHAPTER 4 Average-Risk Optimality 4.1. Bayes Estimation The main factor contributing to the recent explosion of interest in Bayes estimation is its ability to handle extremely complicated practical problems. Some other factors which make Bayes estimation attractive are as follows. 1. The mathematical structure is very nice. 2. It permits the incorporation of prior information (although there is a lot of debate about how this should be done). 3. It provides a systematic approach to the determination of minimax estimators. In the Bayesian framework, we consider the parameter and observation vectors to be jointly distributed on X . We shall suppose that the parameter vector has the marginal distribution and that the conditional distribution of the observation vector X given = is P . For any particular valueR of , we dene the risk of the estimator T 0 at as R( T 0) := E L( T 0)j = ] = X L( T 0(x))dP (x) as before. Definition 4.1.1. For any estimator T 0 , the integral Z is called the Bayes of T 0 is thus R ( T 0) d () = Z Z X risk of T 0 with respect L( T 0(x))dP (x)d (): to the prior distribution . The Bayes risk E (R( T 0)) = E (L( T 0)): An estimator T is a Bayes estimator with respect to the distribution if Z R ( T ) d () = inf T0 Z R ( T 0) d () : Theorem 4.1.2. Suppose and X given = has distribution P . Then if there exists T () which minimizes E (L (( T (X))) jX = x) = Z L ( T (x)) dP (jX = x) for each x, where P ( jX = x) is the conditional (or posterior) distribution of given X = x, then T (X) is Bayes with respect to . 79 80 4. AVERAGE-RISK OPTIMALITY Proof. For any estimator T 0 , E (L ( T 0 (X)) jX) E (L ( T (X)) jX) and so, taking expectations of each side, EL ( T 0 (X)) EL ( T (X)) : Example 4.1.3. 1. Let L ( d) = (d ; g ())2 . The Bayes estimator T of g () minimizes ; E (T (X) ; g ())2 jX = x 8 x: Hence T (x) = E (g () jX = x) which is the posterior mean. 2. Let L ( d) = jd ; g ()j. The Bayes estimator T of g () minimizes E (jT (X ; g ())jjX = x) 8 x: Hence T (x) = med (g () jX = x) which is the posterior median. 3. Let L ( d) = w () (d ; g ())2 . The Bayes estimator T of g () minimizes ; E w () (T (X) ; g ())2 jX = x 8 x: It can therefore be obtained by solving 2 d R dT (x) (T (x) ; g ()) w () dP (jX = x) R 2 (T (x) ; g ()) w () dP (jX = x) = 0: = Hence g () jX = x) : T (x) = E (Ew (() w () jX = x) Example 4.1.4. Suppose X bin (n p) and p B (a b). Thus, d (p) = (;;((aa) +; (bb))) pa;1 (1 ; p)b;1 dp 0 < p < 1 a b > 0: 4.1. BAYES ESTIMATION 81 The posterior distribution of p given X = x is B (a + x b + n ; x) since f (xjp) f (p) fP jX (pjx) = X jP f (x) P ;npx (1X ; p)n;x Kpa;1 (1 ; p)b;1 = x fX (x) = c (x) pa+x;1 (1 ; p)b+n;x;1 which is a beta density. (Without doing any calculations it is clear that c (x) must be ; (a + b + n) = (; (a + x) ; (b + n ; x)).) Let L (p d) = (d ; p)2. Then the Bayes estimator of p is Z1 T (x) = pfP jX (pjx) dp = a +a +b +x n 0 a + n x : = a +a +b +b n a + b a+b+n n | {z } prior mean |{z} the usual etimator Thus T (X ) ; X=n ! 0 a.s. as n ! 1 with a and b xed and as a + b ! 0 with n xed. Clearly X=n is not a Bayes estimator for any beta prior (i.e. for any a > 0 and b > 0). However if is concentrated on the two-point set f0 1g then X=n is Bayes as the following argument shows. If P (P = 1) = 1 ; and P (P = 0) = , then X is either 0 or n with probability 1 and ( =1) 1; P (P = 1jX = n) = P (PX(=XnP =n) = 1; = 1 : P (P = 0jX = n) =0 Hence the Bayes estimator satises T (n) = 1 and a similar argument shows that T (0) = 0. Hence T (x) = x=n x = 0 n. Notice that this two-point distribution is the limit in distribution of Beta(a b) as a + b ! 0 with a=(a + b) = . Theorem 4.1.5. Let L ( d) = (d ; g ())2 . Then no unbiased estimator T of g () can be Bayes unless E (T (X) ; g ())2 = 0 i.e. unless T (X ) = g () with probability 1 and the Bayes risk of T is zero. Proof. If T is Bayes with respect to some and unbiased for g () then E (T (X) j) = g () and E (g () jX = x) = T (x) 82 4. AVERAGE-RISK OPTIMALITY Hence and so E (T (X) g ()) = = = = E (E (T (X) g ()) jX) ET (X)2 E (E (T (X) g ()) j) Eg ()2 E (T (X) ; g ())2 = ET (X)2 ; 2E (T (X) g ()) + E (g ())2 = 0: Xn be iid N ( 2) with 2 known and suppose that N ( 2) with 2 known. Then the joint density of and X is Pn fX ( x) = ; p1 n e; (xi;) p1 e; (;) Example 4.1.6. Given = , let X1 2 1 2 2 1 2 2 1 2 2 2 and the posterior density of is fjX (jx) = ffX ((x)x) X P x 2 n 2 = c (x) exp ; 2 + 2 i ; 2 + 2 2 2 nx + 2 2 n 1 2 + 2 2 2 : = N n 2 + 2 For squared error loss, the Bayes estimator of is ;2 ;2 = T (X) E (jX) = n;n X + 2 + ;2 n;2 + ;2 and ; Var (jX) = n;2 1+ ;2 = E ( ; T (X))2 jX r = ER ( T) = E ( ; T (X))2 = n;2 1+ ;2 : For large n, the Bayes estimator is close to X in the sense that T (X) ; X ! 0 a.s. as n ! 1 with and xed. Also T (X) ; X ! 0 as ! 1 with n and xed. However, X is not Bayes since the prior probability distribution N ( 2 ) does not converge to a probability measure as ! 1. 4.1. BAYES ESTIMATION 83 However one can formally obtain X as a Bayesian estimator with respect to the improper prior distribution, (d) = d. Suppose Pn p (xj) = ; p1 n e; (xi;) 1 2 2 1 2 2 with 2 known. Setting (d) = d we nd that the joint density of (X ) with respect to Lebesgue measure is p (x ) = p (xj) : The posterior distribution of is therefore n !! X 1 p (jx) = k (x) exp ; 22 n2 ; 2 xi i=1 n = k (x) exp ; 22 ( ; x)2 : Hence, 2 p (jx) = N x n and so X is the generalized Bayes estimator of with respect to L ( d) = (d ; )2 and the improper Lebesgue prior (d) = d for . The improper prior densities I(;1<1) and I01) are frequently used to account for total ignorance of parameters with values in R and R + respectively. Conjugate Priors If there exists a parametric family of prior distributions such that the posterior distribution also belongs to the same parametric family family, then the family is called conjugate. Example 4.1.7. Conditional on 2 = 2 , let X1 Xn be iid N (0 2 ) and dene ! = (22 );1. Pn fXj (xj ) = c r e; 1 x2i where r = n2 : ; and let the prior distribution for ! be ; g 1 with density g ( ) = ; (g) g;1e; 0: We note that ; E (!) = g E !2 = g (g+2 1) 2 ; ; E !;1 = g ; 1 E !;2 = (g ; 1)(g ; 2) : 84 4. AVERAGE-RISK OPTIMALITY Then, the posterior becomes P fjX ( jx) = c (x) r+g;1e;( x2i +) = density at of ; r + g P x12 + i so that the gamma distribution family is a conjugate family for the normal distribution. If the loss is squared error then the Bayes estimator of 2 = (2 );1 is P x2 Z 1 + i f jX ( jx) d = 2 2 (r +Pg ; 1) 2 = n++2g ;xi2 : As ! 0 with g = 1, the prior density/ converges pointwise to the improper prior density I01) and the Bayes estimator T (X) satises T (X) ; ni=1 Xi2=n ! 0 a:s:: 4.2. Minimax Estimation Definition 4.2.1. A statistic T is said to be minimax if T satises min sup R ( T 0) = sup R ( T ) : T0 2 We have seen that many estimation problems allow the determination of UMVU, MRE or Bayes estimators. Minimax estimators however are usually much harder to nd. A minimax estimator minimizes the maximum risk. i.e., a minimax estimator minimizes the risk in the worst case. This suggests a possible connection with Bayes estimation under the worst possible prior. Given on , let r denote the Bayes risk of the Bayes estimator T, i.e. r := Z R ( T) d () : Definition 4.2.2. A prior distribution is least favorable if r r0 for all other priors 0: Theorem 4.2.3. Suppose a prior satises r = sup R ( T) : Then 1. T is minimax. 2. If T is the unique Bayes estimator under , then it is the unique minimax estimator. 3. is least favorable. 4.2. MINIMAX ESTIMATION Proof. 85 1. For any estimator T , sup R ( T ) 2 Z Z R ( T ) d () R ( T) d () (since T is Bayes for :) = sup R ( T ) by hypothesis. 2 2. If T is the unique Bayes solution then the second inequality in (1) becomes strict. (i.e., !>.) 3. If 0 is another prior distribution then r0 = Z R ( T0 ) 0 (d) Z R ( T) d 0 () since T0 is Bayes for sup R ( T ) 0 = r by hypothesis. Corollary 4.2.4. If T has constant risk then it is minimax. Proof. If R ( T ) is independent of then Z R ( T) d () = sup R ( T) : Example 4.2.5. Let X bin (n p), L ( d) = (d ; )2 , and = p. We shall show below that X=n is not minimax. Let us use Corollary 4.2.4 to derive a minimax estimator. Suppose P B (a b). Then we have from example (4.1.4) that fP jX (pjx) = c (x) pa+x;1 (1 ; p)b+n;x;1 and T = a +a +b +x n : Thus we obtain 1 R (p T) = Ep (a + X ; p (a + b + n))2 2 (a + b + n) 2 npq + ( qa ; pb ) = : (a + b + n)2 86 4. AVERAGE-RISK OPTIMALITY Now, choose a and b to make the risk independent of p. Since the coecient of p2 is ;n + (a + b)2 and the coecient of p is n ; 2a (a + b), setting these equal to zero gives pn a=b= 2 : Hence p x + 2n T0 = n + pn is Bayes with constant risk and is therefore minimax. p p Since T0 is the unique Bayes solution with respect to 0 B 2n 2n , it follows from theorem 4.2.3 part (2) that T0 is the unique minimax estimator of p with respect to squared error loss and the risk is 2 r0 = E ( T 0 ; p) Z = R (p T0 ) 0 (dp) Z n 1 p (dp) = (n + n) 2 4 0 1 = p = R (p T0 ) 4 (1 + n)2 The risk of the usual estimator with squared error loss, T (X ) = X=n, is X 2 p (1 ; p) R (p T ) = Ep n ; p = n : Thus R (p T0 ) < R (p T ) for 21 ; cn < p < 21 + cn 1 R (p T0 ) > R (p T ) for p ; 2 > cn: cn ! 0 as n ! 1 but cn is larger for small n. For n = 1, (1 T0 (x) = 43 if x = 0 4 if x = 1: Remark 4.2.6. Squared error loss might not be best for estimating p. The penalty should perhaps be larger at the endpoints p = 0 and p = 1. If we use 2 2 ( d ; p ) ( d ; p ) L (p d) = pq = p (1 ; p) then X=n has constant risk and is Bayes with respect to U (0 1) prior (check!). 4.2. MINIMAX ESTIMATION 87 Definition 4.2.7. Let n be a sequence of priors, and suppose rn = Z where Tn is Bayes with respect to favorable if R ( Tn) d n () ! r as n ! 1 n for each n: We say that the sequence f ng is least r r for all i.e. if the limit of the minimum Bayes risk is at least as bad as the minimum Bayes risk for any prior. (Compare Denition 4.2.2.) Theorem 4.2.8. If there exists an estimator T and a sequence of prior distributions f ng such that sup R ( T ) = nlim r !1 n then 1. T is minimax (but not necessarily unique.) 2. f ng is least favorable. Proof. 1. If T 0 is any estimator, then sup R ( and hence Z R ( T 0) d rn 8n T 0) n sup R ( T 0) lim rn = sup R ( T ) : 2. If is any prior, then r = Z R ( T) d () Z R ( T ) d () sup R ( T ) = nlim r : !1 n Example 4.2.9. Suppose that X1 Xn iid N ( 2 ). 88 4. AVERAGE-RISK OPTIMALITY 1. 2 known: N ( 2 ) = . In example (4.1.6), we saw that the Bayes estimator for squared error loss is ;2 ;2 E (jX) = n;n X + 2 + ;2 n;2 + ;2 with r = E (T ; )2 ; = E E ( ; E (jX))2 jX = E Var (jX1 ) = E n;2 + ;2 : As ! 1, 2 ; r ! n = R X : Hence, X is minimax (for squared error loss) by theorem (4.2.8). 2. 2 unknown: Since sup2 R (( 2) T ) = 1 for T (X) = X and X is minimax for each xed , we restrict 2 to satisfy 2 m < 1. If T is minimax on 2 R , 2 m then ;; 2 T sup R ;; 2 X = sup R ;; 2 X : sup R 2 2 2 =m m =m But X is minimax on 2 = m, hence this is an equality. Hence X is minimax on 2 R 2 m. Although the restriction 2 m was necessary to make the minimax problem meaningful, the minimax estimator X does not depend on m. 4.3. Minimaxity and Admissibility in Exponential families Given two estimators T T 0, such that R ( T ) R ( T 0) for all then T is preferable to T 0 on the basis of risk. Definition 4.3.1. T 0 is inadmissible (with respect to the loss function L) if there exists T such that R ( T ) R ( T 0) for all with strict inequality for some . (In this case, we also say that T dominates T 0.) An estimator is admissible if it is not inadmissible. In general, it is dicult to determine whether or not an estimator is admissible. 4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES 89 Theorem 4.3.2. If T is a uniqueR Bayes estimator with R respect to some , then T is admissible. (Uniqueness means that if R ( T 0) d () = R ( T ) d () then P (T 0 6= T ) = 0 for all and consequently R ( T ) = E L ( T ) = E L ( T 0) = R ( T 0) for all .) Proof. If T is not admissible then there exists T 0 such that R ( T ) R ( T 0) for all and R ( T ) > R ( T 0) for some : R R Now R ( T ) d R ( T 0) d implies that T 0 is Bayes with respect to . Hence by uniqueness R ( T 0) = R ( T ) for all and we obtain the contradiction. Example 4.3.3. Suppose that X1 Xn iid N ( 2 ) with 2 unknown. Let N ( 2) and L ( d)=(d ; )2 . Then 2 2 () T = 2 n X + + n 2 2 + n 2 is the unique Bayes estimator of and is therefore admissible. This example shows that aX + b is admissible for all a 2 (0 1) and b 2 R since any a 2 (0 1) and b 2 R can be obtained in (*) by suitable choice of and 2 and hence aX + b is unique Bayes for some . Theorem 4.3.4. If X N ( 2 ), 2 is known and L ( d) = (d ; )2 then aX + b is inadmissible for if 1. a > 1, 2. a < 0, or 3. a = 1 and b 6= 0. Proof. We rst calculate R ( aX + b) = E (aX + b ; )2 = E (a (X ; ) + (a ; 1) + b)2 = a2 2 + ((a ; 1) + b)2 : 1. If a > 1, R ( aX + b) > R ( X ) = 2 for all : 2. If a < 0, then (a ; 1)2 > 1 and R ( aX + b) ((a ; 1) + b)2 2 b 2 = (a ; 1) + a ; 1 b > R 0X ; a;1 : 90 4. AVERAGE-RISK OPTIMALITY 3. If a = 1 and b 6= 0, then R ( X + b) = 2 + b2 > 2 = R ( X ) : Corollary 4.3.5. Suppose that X1 Xn N ( 2) with 2 known. Then aX + b is inadmissible if 1. a > 1, 2. a < 0, or 3. a = 1 and b 6= 0, and admissible if 0 a < 1. Proof. It remains only to establish admissibility when a = 0. This is trivial since for the estimator T (X ) = b corresponding to a = 0, the risk at = b is R (b b) = 0 and every other estimator has positive risk at b since if P (T 0(X ) 6= b) > 0 for some then T 0;1(fbgc) has positive Lebesque measure and this implies that Pb(T 0(X ) 6= b) > 0 and hence that Eb (T 0 ; b)2 > 0. Proposition 4.3.6. Suppose that X1 Xn iid N ( 2 ) with 2 known. Then, X is admissible. Proof. We give two proofs. 1. (Limiting Bayes method) Assume without loss of generality that 2 = 1. If X is inadmissible then there exists T such that R ( T ) n1 for all and R (0 T ) < n1 for some 0 . (T ; )2 = Z Y (T (x) ; )2 e n ; 12 (xi ;)2 p dx 2 i=1 is continuous in a neighborhood of 0 . Hence there exist a and b such that a < 0 < b and c > 0 such that R ( T ) < n1 ; c for all 2 (a b) : Suppose that N (0 2) =: . Then we shall obtain a contradiction by showing that Z 2 1 R ( T ) e; 2 2 d r := ER ( T ) = p 2 is smaller than the minimum Bayes risk for , i.e. 2 r = ER ( T) = n;2 1+ ;2 = n 2 + 1 : (As in example (4.1.6) , 2 ; E ( ; T (X))2 jX = n 2 + 1 = Var (jX) R ( T ) = E 4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES so that Now ; ; 91 r = E ( ; T (X))2 = E E ( ; T (X))2 jX = n 2 + 1 :) 1 n 1 n ; r = ; r p1 2 R ; 1 ; R ( T ) e ; n 1 n 2 2 2 ; n +1 2 2 d 2 2 + 1) Z b 1 2 n ( n p 2 ; R ( T ) e; 2 2 d a n Z b 2 2 > c n (n + 1) e; 2 2 d a where c > 0 is independent of . Since the integral in the last line converges to b ; a as ! 1 by DCT, the ratio goes to innity as ! 1. Thus, for all suciently large 0 , r( 0 ) < r(0 ) which contradicts the fact that r(0) is minimum Bayes risk. 2. (Via the information inequality) If T is any estimator of with nite second moment under each P , then E T = b () + and R ( T ) = Var T + b2 () 0 ())2 (1 + b nI () + b2 () 0 ())2 (1 + b 2 ( ) + b = n 0 (b () exists by theorem (1.3.13) and we assume without loss of generality = 1.) Hence if T is risk-preferable to X , R ( T ) n1 for all () i.e. 2 0 b2 () + (1 + bn ()) n1 for all () and so jb ()j p1n for all ( ) 92 4. AVERAGE-RISK OPTIMALITY and (1 + b0 ())2 1 ) ;2 b0 () 0 ) b is non-increasing: Now, we claim b0 (k ) ! 0 for some sequence k ! 1. If this is not the case, then lim!1b0 () < 0 and there exists 0 and > 0 such that b0 () < ;" for all > 0 . Then, Z b0 (y) dy + b (0 ) 0 ( ; 0 ) (;") + b (yn) ! ;1 as ! 1 contradicting (***) and thus proving the claim. Similarly, there exists j ! ;1 such ; that b0 j ! 0. ; Now, (**) implies that b j ! 0 and b (k ) ! 0. But since b is non-increasing, this implies that b () = 0 for all and hence that b0 () = 0 for all . Hence, by the information inequality, R ( T ) n1 , and so by (*), ; R ( T ) = n1 = R X so that X is admissible. b () = The above argument also shows that X is minimax since there is no estimator whose maximum risk is less than 1=n. In fact, X is the unique minimax estimator by the following theorem. Proposition 4.3.7. Suppose that T has constant risk and is admissible. Then T is minimax. If in addition L( ) is strictly convex, then T is the unique minimax estimator. Proof. By the admissibility of T , if there is another estimator T 0 with sup R( T 0 ) R( T ) then R( T 0) = R( T ) for all , since the risk of T 0 can't go strictly below that of T for any . This proves that T is minimax. If the loss function is strictly convex and T 0 is a minimax estimator such that P (T 0 6= T ) > 0, then if T = 21 (T + T 0), R ( T ) < 21 (R ( T ) + R ( T 0)) = R ( T ) which contradicts the admissibility of T . 4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES 93 Exponential Families (with s = 1). Suppose that the probability density of X with respect to the -nite measure is where Z eT (x);'()h (x) eT (x)h (x) d (x) = e'() Then E T (X) = '0 () = g () : Suppose the natural parameter space is an interval with end-points L U , ;1 L U 1 and L ( d) = (d ; g ())2 : The same argument used in the proof of theorem (4.3.4) shows that aT + b is inadmissible for (1) a < 0, (2) a > 1, or (3) a = 1 and b 6= 0. If a = 0, then aT + b is admissible since g^ () = b is the only estimator with zero risk at b. To deal with the remaining cases consider 1 T + r 1+ 1 + 0 < 1 r 2 R (i.e. 0 < a 1): Theorem 4.3.8. The estimator 1 T + r 0 < 1 r 2 R 1+ 1+ is admissible for g () = 0 () = E T if for some (and hence for all) 0 2 (L U ) Z and Proof. Recall that 0 L Z U 0 e;r+()d = 1 e;r+()d = 1: '0 () = E T '00 () = Var (T ) @ log p (x ) 2 I () = E @ = E (T ; '0 ())2 = '00 () : 94 4. AVERAGE-RISK OPTIMALITY Suppose there exists (X ) such that E We have that ( (X ) ; '0 ())2 E T + r 1+ 2 ; '0 () for all : () E ( (X ) ; '0 ())2 = Var (X ) + b () ; d (b () + '00 ())2 + b2 () d I () 2 0 = b2 () + (b ()I+(I) ()) : (b0 () exists since E j (X )j < 1.) So by (*), I () + 2 (r ; '0 ())2 b2 () + (b0 () + I ())2 (): I () (1 + )2 (1 + )2 Letting T r 0 h () = b () ; 1 + (r ; ' ()) = b () ; bias 1 + + 1 + h0 () = b0 () + 1 + '00 () : (**) is exactly equivalent to ;h0 () + 1 '00 ()2 00 () 1+ 0 h2 () + 2h () (r ; '0 ()) + ; ( ) 2 00 1+ ' () (1 + ) 0 2! 0 () 2 h = h2 () ; 2h () ('0 () ; r) + + h 00() : 1+ 1+ ' () Letting k () = h () er;'(), (***) becomes k2 () er;'() + 1 +2 k0 () 0 ( ) ) k0 () 0 for all : Hence, k () is decreasing. To prove k () 0 for all suppose k (0 ) < 0:Then k () < 0 for all > 0 . From (****), we also have d 1 = ; k0 () 1 + er;'() for all > ( ): 0 d k () 2 k ()2 4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES 95 Integrating both sides of (*****) from 0 to 1 > 0 , Z 1 ; 1 1 + 1 er;'()d: k (1 ) k (0 ) 2 0 As 1 ! U , this integral converges to 1by assumption. But the left hand side is less than ; k(10 ) so that we obtain the contradiction. Hence k () 0 for all : Similarly, Z 0 Hence, L er;'() d = 1 ) k () 0 for all : k () = 0 ) h () = 0 for all ) h0 () = 0 for all : Thus, h0 () = 0 and h () = 0 implies equality in (***), equality in (**), and nally equality in (*). (RS of (*)=LS of (**) LS of (*) RS of (**)). Since T1++r has the same risk as , T +r is admissible. 1+ Note 4.3.9. The case = 0 is of particular interest, i.e. T is admissible for E T provided L = ;1 and U = 1. Example 4.3.10. Suppose X b (n p) and := log 1;p p , ;1 < < 1. Then n n n ; x x p (x ) = x p (1 ; p) = x ex;n log(1;e ) and T (X ) = X is admissible for 0 () = 1 ne + e = np since L = ;1 and U = 1. Example 4.3.11. Suppose X1 Xn iid N ( 2 ) with 2 known. P x2 P x n2 1 p (x ) = ; p n exp ; 22i exp 2 i ; 22 P X2 T (X) = 2 i 0 () = n 2 : T (X) is admissible for and U = 1. n 2 R since e P xi 2 h (x) (dx) < 1 for ; 1 < < 1 i.e. L = ;1 96 4. AVERAGE-RISK OPTIMALITY 4.4. Shrinkage Estimators James-Stein Estimator. Let Xi N (i 1) i = 1 s be independent random vari- ables and let L ( d) = kd ; k2 = X (di ; i)2 : The usual estimator T (X) = (X1 Xs) will be shown to be inadmissible if s > 2. Let where S 2 = Ps X 2. Here i=1 i ic = 1 ; c s S;2 2 Xi i = 1 s 0 1c 1 c := @ ... A : sc Theorem 4.4.1. R ( c) = s ; (s ; 2)2 E 2c ; c2 S2 : To prove the theorem, we need the following two lemmas. Lemma 4.4.2. If X N (0 1) and g : R then ! R is absolutely continuous with derivative g0 E jg0 (X )j < 1 ) Eg0 (X ) = E (Xg (X )) : Proof. Let (x) = (2 ); 2 exp (;x2 =2). Then 1 x (x) = ;0 (x) (): 4.4. SHRINKAGE ESTIMATORS Then Eg0 (X ) = = = = = Z 97 g0 (x) (x) dx Z1 Z0 Z0 1 + Z 1 Z z 0 Z1 0 0 Z 1 ;1 g0 (x) 0 g0 (x) (x) dx x z (z) dz dx ; g0 (x) dx z (z) dz ; (g (z) ; g (0)) z (z) dz ; = E (Xg (X )) : Z0 g0 (x) Z 0 Z 0 ;1 Z0 ;1 ;1 z Z x ;1 z (z) dz dx by (*) g0 (x) dx z (z) dz by Fubini (g (0) ; g (z)) z (z) dz Xs) where X1 Xs are independent and Xi N (i vi) : Suppose f : R s ! R is absolutely continuous in xi for almost all (x1 xi;1 xi+1 xs ). Lemma 4.4.3. Suppose X = (X1 Then if @ E @x f (X) < 1 i vi E @x@ f (X) = E (Xi ; i) f (X) : i Proof. Let i ; i Z = Xp v : i then from lemma (4.4.2), ; ; E @z@ f x1 xi;1 i + pvi xi+1 xs j =Z ; ; = E Zf x1 xi;1 i + pvi Z xi+1 xs : Thus, pvi E @f (X) jX1 Xi;1 Xi+1 Xs @xi = E Xpi ;vii f (X) jX1 Xi;1 Xi+1 Xs : Taking expectation of each side, we obtain the desired result. Now, we are ready to prove the theorem. 98 4. AVERAGE-RISK OPTIMALITY Proof. (of theorem) Let 0 f1 1 fi (X) = c (s ;2 2) Xi and f = @ ... A : S Then c = X ; f and R ( c) = E X (Xi ; fi ; i )2 X X 2 X 2 (Xi ; i ) ; 2 (Xi ; i) fi + fi s X X Xi2 2 @ = s ; 2 E @x fi (X) + E S 4 c (s ; 2)2 : i 1 = E Since fs @ f (X) = c (s ; 2) S 2 ; Xi 2Xi = c (s ; 2) 1 ; 2Xi2 @xi i S4 S2 S4 R ( c) = s ; 2c (s ; 2) E = s ; (s ; 2)2 E s ; 2 S2 2c ; c2 S2 + c2 (s ; 2)2 E S12 : Corollary 4.4.4. For 0 < c < 2 and s > 2, R ( c) < s for all and 1 dominates all the other c 's. Proof. 2c ; c2 > 0 for all c 2 (0 2) and has a maximum value at c = 1. ; Remark 4.4.5. The James-Stein estimator 1 ; sS;22 The risk of X is X X has risk R ( ) = s;(s ; 2)2 E ; 1 . 1 R ( X) = E (Xi ; i )2 = s: Therefore, X is not admissible for . In fact, the James-Stein estimator is not admissible either. A strictly risk-preferable estimator can be arrived at as follows. S2 4.4. SHRINKAGE ESTIMATORS 99 Empirical Bayes Interpretation of the James-Stein Estimator. Suppose 1 s are iid N (0 2 ). If 2 were known, the Bayes estimator with respect to squared error loss would be 1 X i ^i = 1 + ;2 = 1 ; 1 + 2 Xi i = 1 s: However, if 2 is unknown it must be replaced by some estimate. Write Xi = i + Zi fZig iid N (0 1) with fZig independent of fI g (so that Xiji = i N (i 1)). Then, ; fXig iid N 0 2 + 1 : P Since S 2 = si=1 Xi2 is complete and sucient for 2 , sS;22 is UMVUE for 1+1 2 . So a natural estimator of i is s ; 2 1 = 1 ; S 2 X: Moreover since 1 <1 1 + 2 ; is min sS;22 1 which suggests using s;2 s ; 2 1 = 1 ; min S 2 1 X = max 1 ; S 2 1 X: 1 is strictly risk-preferable to 1 (so 1 is inadmissible). But 1 is also inadmissible. It was a dicult problem to nd an estimator which is strictly risk-preferable to 1 , in fact it took twenty years. It is now known that there are many admissible minimax estimators. (See TPE, p. 357.) a better estimate of 1 1+ 2 100 4. AVERAGE-RISK OPTIMALITY CHAPTER 5 Large Sample Theory 5.1. Convergence in Probability and Order in Probability Definition 5.1.1. A sequence of random variables Xn is said to converge to 0 in prob- ability if for any " > 0, in which case we write or equivalently P (jXnj > ") ! 0 as n ! 1 p Xn ;! 0 Xn = op (1) : Definition 5.1.2. fXn g is bounded in probability (or tight ) if for any " > 0, there exists M (") 2 R such that in which case we write P (jXnj > M ) < " for all n Xn = Op (1) : Definition 5.1.3. p p Xn ;! X () Xn ; X ;! 0 () Xn ; X = op (1) Xn = op (an) () Xa n = op (1) n X X = Op (an) () a n = Op (1) : n Proposition 5.1.4. Let fXn g and fYng be sequences of r.v's and suppose an > 0 and bn > 0. Then the following results hold: 1. If Xn = op (an ) and Yn = op (bn ) then XnYn = op (anbn ) Xn + Yn = op (max (an bn)) jXnjr = op (arn) r > 0: 101 102 5. LARGE SAMPLE THEORY 2. If Xn = op (an ) and Yn = Op (bn ) then XnYn = op (anbn) : 3. If Xn = Op (an) and Yn = Op (bn) then XnYn = Op (anbn) Xn + Yn = Op (max (an bn)) jXnjr = Op (arn) r > 0: Proof. XnYn We only prove the rst part and leave the remaining parts as exercises. If an bn > ", then X Y n either b 1 and a n > " Y n X Y n or n > 1 and n n > ": bn an bn Thus, if Xn = op (an) and Yn = op (bn), then jX Y j X Y P a nb n > " P a n > " + P b n > 1 ! 0 as n ! 1: n n n n j X + Y j If max(nan bnn) > ", then since jXn + Ynj jXnj + jYnj jXnj > " or jYnj > " : an 2 bn 2 Thus, as in the previous part, X + Y P maxn(a bn ) > " ! 0 n n r 1 If jXanrnj > ", then jXann j > " r . Thus, jX jr P anr > " ! 0: n Definition 5.1.5. For a sequence of random vectors Xn = (Xn1 Xnm ), we dene Op and op as follows: Xn = op (an ) () Xnj = op (an) j = 1 m: Xn = Op (an ) () Xnj = Op (an ) j = 1 m: p p Xn ;! X () Xn ; X = op (1) () Xnj ;! Xj j = 1 m: 5.1. CONVERGENCE IN PROBABILITY AND ORDER IN PROBABILITY Definition 5.1.6. kXn ; Xk2 := Proposition 5.1.7. Proof. )) m X j =1 jXnj ; Xj j2 : Xn ; X = op (1) () kXn ; Xk = op (1) : ; P kXn ; Xk2 > " = X ! m P jXnj ; Xj j2 > 'm1 n o! P () 103 j =1 m X j =1 jXnj ; Xj j2 > m" P jXnj ; Xj j2 > m" ! 0: jXni ; Xij2 kXn ; Xk2 ) Xni ; Xn = op (1) : p p Proposition 5.1.8. If Xn ; Yn ; ! 0 and Yn ; Y ;! 0, then p Xn ; Y ;! 0: Proof. kXn ; Yk = kXn ; Ynk + kYn ; Yk = op (1) : Proposition 5.1.9. If Xn ! X and g : R m ! R is continuous, then g (Xn) ;! g (X) : X has a subsequence p Proof. We note that X ; ! X if and only if every subsequence n nj n o Xnjk such that Xnjk ! X a.s. as k ! 1. Hence if Xnj is any subsequence of fXng, a.s. a.s. then there exists a subsequence Xnjk ;! X whence g Xnjk ;! g (X). But this implies p that g (Xn) ;! g (X). Example 5.1.10. If Xn =d X , then Xn = Op (1) and Xn = op (an ) for any sequence fan g such that an ! 1. p 104 5. LARGE SAMPLE THEORY Example 5.1.11. Suppose X1 X2 1. Also, by WLLN and CLT, 8 > <op (n) Sn = >Op (np) :Op ( n) iid. Then Xn = Op (1) and Xn = op (an) if an ! if EX1 = 0 if EX1 6= 0 if EX1 = 0 and Var (X1) < 1: Taylor Expansions in Probability Proposition 5.1.12. Suppose Xn = a + Op (rn ) with rn derivatives at a then g (Xn) = Proof. Let ! 0 and rn > 0. If g has s s (j ) X g (a) j =0 j s j ! (Xn ; a) + op (rn) : 8 g(x);Ps g j a (x;a)j < j j x;a s h (x) := : s 0 ( )( ) ! ) ! if x 6= a if x = a: Then h is continuous since g has s derivatives at a. Since Xnrn;a = Op (1), Xn ; a = op (1) : Thus, p h (Xn) ;! h (a) = 0 , i.e. h (Xn) = op (1) : Thus, s h (Xn) (Xn s;! a) = op (rns ) : =0 ( Example 5.1.13. Suppose fXn g iid ( 2 ), > 0. By Chebychev, X n = + Op n; 21 : Thus, and ; log X n = log + 1 X ; + op n; 12 pn ;log X ; log = pn ;X ; + o (1) : n p n 5.2. CONVERGENCE IN DISTRIBUTION 5.2. Convergence in Distribution Definition 5.2.1. We say that a sequence of random vector denote 105 Xn converges to X and d Xn ;! X if FXn (x) ! FX (x) for all x 2 C = fx : FX is continuous at xg. Remark 5.2.2. Convergence for all x is too stringent a requirement as illustrated by d X 0 even though FXn (0) = 0 6! FX (0) = 1. Xn n1 . We would like to say Xn ;! Theorem 5.2.3. Suppose Xn Fn and X F0 . Then the following are equivalent: d 1. X ! X. n; R R 2. R g (x) dFn (x) ! R g (x) dF0 (x) for all bounded continuous function g. 3. eitT xdFn (x) ! eitT XdF0 (x)for all t 2 R m (i.e. n (t) := E eitT Xn ! E eitT X =: 0 (t) for all t 2 R m .) Proof. See Billingsley pp.378-383. p Proposition . If Xn ; ! X, then itT Xn 5.2.4 T 1. E e ; eit X ! 0 for all t 2 R m and d 2. Xn ;! X. 1.itSince itT (Xn;X) T (Xn ;X) E 1 ; e E 1 ; e IkXn;Xk + 2P (kXn ; Xk > ) Proof. given any " > 0, we can choose to make the rst term less than "=2. Then choose n to make the second term less than "=2 . Hence the left hand side converges to 0 as n ! 1. 2. Since jj is convex, by Jensen's inequality, itT Xn itT X itT Xn itT X Ee ; Ee E e ; e ! 0: Thus, by theorem (5.2.3) d Xn ;! X: d Proposition 5.2.5. (Sluttzky's theorem) If Xn ; Yn = op (1) and Xn ; ! X, then d Yn ;! X: 106 5. LARGE SAMPLE THEORY Proof. For a random vector Z dene Z () by Z T Z () := ei zdFZ (z) Then, we have jYn (t) ; X (t)j jYn (t) ; Xn (t)j + jXn (t) ; X (t)j By Jensen's inequality and the same argument as in proposition (5.2.4)-(1), jYn (t) ; Xn (t)j E 1 ; e;itT (Xn ;Yn) ! 0: p Also, since Xn ;! X, the second term converges to 0. d Proposition 5.2.6. If Xn ; ! X and h : R m ! R s is continuous, then d h (Xn) ;! h (X) : ; Proof. Since exp itT h () is bounded continuous function, EeitT h(Xn ) ! EeitT h(X) : p d Proposition 5.2.7. If Xn ; ! X and Yn ;! b, then X d X n Yn ;! b : Moreover, if dim Yn = dim Xn, then d 1. Xn + Yn ;! X+b d T T 2. Yn Xn ;! b X. Proof. Let Zn = X n b : Then, X p Zn ; Ynn ;! 0 d X since each component converges to 0 in probability. Also Zn ;! b since the sequence of characteristic functions converges. Hence by Slutzky, X d X n Yn ;! b : Now, suppose dim Yn = dim Xn. Because the mappings (x y) ! x + y and (x y) ! x0y are continuous from R 2m ! R m , we obtain the results 1 and 2 from proposition (5.2.4). 5.2. CONVERGENCE IN DISTRIBUTION 107 Asymptotic Normality Definition 5.2.8. A sequence of random variables fXn g is said to be asymptotically normal with mean n and standard deviation n if n > 0 for all suciently large n and if d n;1 (Xn ; n) ;! Z , where Z N (0 1). We write Xn is AN ( |{z} n n2 ): |{z} asymptotic asymptotic mean variance Example 5.2.9. The classical CLT states that if X1 variance 2, then X n is AN 2 n Xn are iid with mean and i.e. pn ; !d Z: X n; ; Proposition 5.2.10. If Xn is AN ( n2 )with n ! 0 and g is di erentiable at , then ; g (Xn) is AN g () g0 ()2 n2 . Proof. Because d Zn := Xn; ;! Z N (0 1) n Zn = Op (1) : (Choose a pair of continuity points of Fz such that F (z1 ) < "=4 and F (z2 ) > 1 ; "=4. Then for all n > N ("), FZn (z1) < 3" and FZn (z2 ) > 1 ; 3" : For n = 1 N ("), there exists v (") such that P (jZnj > v (")) < " n = 1 N (") : Choose M (") max (v (") jz1 j jz2 j). Then P (jZnj > M (")) < " for all n = 1 2 :) Thus, Zn = Op (1) ) Xn = + Op (n) : By proposition (5.1.12) (g (X ) = g () + g0 () (X ; ) + op (n )), ;0 g0 ()2 : g (Xn) ; g () = g0 () (Xn ; ) + o (1) ;! d N p n n 108 5. LARGE SAMPLE THEORY 2 Example 5.2.11. Let fXn g iid ( 2 ). Then X n := n1 (X1 + + Xn ) is AN n . Suppose 6= 0. Then g (x) := x1 has derivative at so 1=X n is AN pn2 1 1 d ; ! N (0 1) : ; Xn 1 ; 1 2 2 2 n i.e. p Moreover since X n ;! (by the WLLN), proposition (5.2.10) implies pnX 2 1 1 d n ; ; ! N (0 1) : Xn Note 5.2.12. Although 1 is the asymptotic mean! of 1=X n in the above example, it is ; not the limit as n ! 1 of E 1=X n . In fact, E 1=X n = 1 if X1 N ( 2). Example 5.2.13. Suppose X1 Xn iid (0 2 ). Then pnX where Z 2 2 (1) since d n ! Z N (0 ; d 2 2 ) nX 2n ;! Z g (x) = x2 is continuous. 1) by CLT Multivariate Asymptotic Normality Definition 5.2.14. Xn is AN ( n n )if 1. n has no zero diagonal elements for all large enough n. 2. 0Xn is AN (0 n 0n) for all 2 R m such that 0n > 0 for all large enough n. Recalling the Cramer Wold device, i.e. d d 0 Xn ;! X () 0Xn ;! X for all 2 R m we see that Xn is AN ( n n)where n satises (1) if and only if 0p (Xn ; n) ;! d N (0 1) 0 n for all such that 0n > 0 for large enough n. Proposition 5.2.15. Suppose Xn is AN ( c2n) with cn ! 0. If g : R m ! R k is continuously di erentiable in a neighborhood of and DD0 has all diagonal elements greater than 0 where D = @gi =@xj ]X= , then g (Xn) is AN (g ( ) c2nDD0 ). 5.2. CONVERGENCE IN DISTRIBUTION 109 Definition 5.2.16. A sequence of estimates Tn = Tn (X1 Xn) of g () is said to be weakly consistent if Tn ;P! g () for all and strongly consistent if Tn ! g () a.s. P for all : Example 5.2.17. (Moment Estimation) Let fXN g iid P such that E jX jr < 1 for all . Suppose mj () = E X1j for 1 j r and g () = (m1 () mr ()) where is continuous. Then Tn (X1 Xn) = (m^ 1 m^ r ) ! g () a.s. P where n X m^ j = 1 Xkj : n k=1 Definition 5.2.18. A sequence of estimators is said to be asymptotically normal if there exists n () and n () > 0 such that Tn is AN (n () n2 ()) for all . i.e. T ; () P n (n) x ! (x) for all : n Remark 5.2.19. Suppose Tn is AN (0n () n0 2 ()). If n () ! 1 and n () ; 0n ! 0 n0 () n then Tn is AN (n () n2 ()) since Tn ; n = n0 Tn ; 0n + 0n ; n n n n0 n d# N (01) # 1 d# N (01) d# 0 Definition 5.2.20. A sequence of asymptotically normal estimators fTn g is said to be asymptotically unbiased for g () if in which case and n ();g() n () n () ; g () ! 0 n () Tn ; g () !d N (0 1) n () is called the standardized asymptotic bias. 110 5. LARGE SAMPLE THEORY Example 5.2.21. Let fXN g iid P such that E jX j2r < 1 for all . Suppose mj () = E X1j for 1 j r and g () = (m1 () mr ()) where is continuously dierentiable. Then 1 X @ Tn = (m1 () mr ()) + @x (m ()) (m^ j ; mj ()) + op pn j j P where m^ j := n1 nj=1 Xij . From CLT, 2 m^ 1 ; m1 () 3 pn 4 ... 5 !d N (0 ) where = (ij where m^ r ; mr () 1 1 ij =1. Thus, Tn ; (m1 () mr ()) !d N (0 1) n () ; ; ) = Cov X i X j r 0 @=@m1 1 @ @ .. A (if > 0). n2 () = n @m @m . 1 r @=@mr 1 @ Example 5.2.22. Let X1 Xn ; ( ). Then EX1 = , EX12 = 2P+ 2 2 = 2 (2 + ), and VarX1 = 2. (m1 () m2()) = = m2 (m)1;(m)1 () : Tn = m^ 2 ;m^ 21 . Then T is AN ( 2 ()),where n n m^ 1 ;2m2 ; (m ; m2) 1 X ;1 ; m22 1 2 1 1 2 m n () = n 1 1 m21 m1 m1 and Cov(X X ) Cov(X X 2) = Cov(X1 X12) Cov(X12 X12) 1 2 1 1 2 + 2) 3 ; 2 = ( + 1)( + 2) 3 ; 2 ((++1)( 1)( + 2)( + 3) 4 2 1 n Xi2 ;X 2 X = 5.3. Asymptotic Comparisons (Pitman Eciency) d p p d Definition 5.3.1. If n (Tn ; g ()) ; ! N (0 02()) and n Tn0 0(n) (X1 : : : Xn0 ) ; g() ;! N (0 02()), then the Pitman asymptotic relative eciency (ARE) of fTng relative to fTn0 g is 0 n (n) eT T 0 () = nlim !1 n 5.4. COMPARISON OF SAMPLE MEAN, MEDIAN AND TRIMMED MEAN 111 provided the limit exists and is independent of the sequence n0(n) chosen to satisfy pn ;T 0 ; g() ;! d N (0 02 ()) n (n) Roughly speaking, eT T 0 is the ratio of the number of observations required for the two estimators to achieve the same precision. 2 2 Theorem 5.3.2. if Tn s AN g () 0n() and Tn0 s AN g () 1n() , then eTT 0 () = 12 () 02 () . Proof. Let n0 = h p 2 n 102 (()) i , where x] is the integer part of x. Then s 2 n 12 (()) (Tn0 0 ; g()) 0 1 1 ( ) p 0 = () n (Tn0 ; g()) + Op n 0 d and since LHS ;! N (0 12 ()), therefore pn (T 0 ; g()) ! N ;0 2() n0 0 Hence n0(n) = 12() eT T 0 () = nlim !1 n 2() n0 (Tn0 0 ; g()) = provided the limit is the 0 same for all n0(n) such that lim n0 (n)=n exists pn (T 0 ; g()) ! N ;0 2() 0 n and 0 Suppose n0(n) is any such sequence, then the LHS of the previous line is equal to r n p pn p 1 0 0 0 p 0 n0 (Tn0 ; g()) = n0 n0 (Tn0 ; g()) n 0 1 d# N (0 02 ()) therefore ! 1, i.e. nn0 ! . 5.4. Comparison of sample mean, median and trimmed mean n 12 n0 02 2 1 2 0 Proposition 5.4.1. Let F be a cdf such that F is di erentiable at p = 100pth percentile (or p-quantile) of F and f (p) = be the order statistics. Then dF dx x=p > 0. Let X1 : : : Xn be iid F and let Y1 : : : Yn p(1 ; p) Ynp] is AN p nf 2(p) 112 5. LARGE SAMPLE THEORY Proof." # q Y np] ; p 2 x = P Ynp] x p(1 ; p)= nf (p)] + p P p p(1 ; p)= nf 2(p)] = P Sn np]] # " n ( p ; p ) S ; np n n n <p =1;P p npn(1 ; pn) npn (1 ; pn) where Sn = n X 1 q IXin ] b(n pn) pn = P (X1 n) and n = p + x p(1 ; p)= nf 2(p)]: We need the Berry-Esseen Theorem here: if X1 Xn are iid F , where E jX1j3 < 1, then there exists c, independent of F , such that S ; ES c E jX ; EX j3 n n 1 1 sup P p : x ; (x) pn 3 = V arSn x (V ar(X1 )) 2 Hence, for our problem, " # ! 1 P p Sn ; npn < pn(p ; pn) ; pn(p ; pn) pc = pc O(1) 3 = 2 npn(1 ; pn) npn(1 ; pn) n pn(1 ; pn)] n npn(1 ; pn) as n ! 1 and n ! p, therefore pn(1 ; pn) ! F (p) (1 ; F (p)) = p(1 ; p) Moreover pn (p ; P (X )) n ( p ; p ) n pnp (1 ; p ) pp(1 ;1p) n n n p p) ; F (n) = n F (p p(1 ; p) p ; p) + o(n ; p) = ; n f (p)(np q p(1p;(1p); p) 1 f (p)x nf 2 (p ) + o( pn ) p pp(1 ; p) =; n and the result follows. Proposition 5.4.2. Suppose F is continuous at p1 p2 and f (p1 ) > 0 f (p2 ) > 0. Then Y 1 p (1 ; p )=f 2( ) p (1 ; p )=(f ( )f ( )) np1] is AN p1 1 1 p1 1 2 p1 p2 Y p (1 ; p )=f 2( ) n p (1 ; p )=(f ( )f ( )) np2] p2 1 2 p1 p2 2 2 p2 5.4. COMPARISON OF SAMPLE MEAN, MEDIAN AND TRIMMED MEAN 113 where 0 < p1 < p2 < 1. Proof. Similar to the proof of proposition 5.4.1. Remark 5.4.3. Suppose X1 X2 iid F (x ; ) F (0) = 21 and f (0) > 0. Then 1 ~ X := median(X1 Xn) = X(n=2]) is AN 4nf 2(0) If EX1 = , then 2 where 2 = V ar X X is AN n therefore, eX~ X = 42f 2(0). So if 2f (0) < 1, then X is more ecient if 2f (0) > 1, then X~ is more ecient. Trimmed Mean Proposition 5.4.4. Suppose F is symmetric about 0 with F (;c) = 0, F (c) = 1 and f (x) = dF dx strictly positive and continuous on (;c c). Then pn X ;! d 2 N (0 ) where nX ;n] 1 X(i) X = n ; 2n] "i=Zn(1]+1;) # 2 = (1 ;22)2 t2 f (t)dt + 2(1 ; ) 0 and () := F ;1() Proof. Omitted Remark 5.4.5. Again assume X1 : : : Xn iid F (x ; ) F (;c) = 0, and f is symmetric continuous and positive. Then lim1 2 = 4f 21(0) (since X ! X~ ) " 2 2 = 2 lim #0 The ARE's of X~ and X relative to X are eX~ X (f ) = 42f 2(0) eX X (f ) = 2=2 114 5. LARGE SAMPLE THEORY f n .125 .25 .94 .84 p1 e;x2 =2 2 1 (1+x2 ) 1 e;jj=2 2 T (:01 3) T (:05 3) t3 t5 1 1 1.40 .98 1.19 1.91 1.24 1.63 .89 1.09 1.97 1.21 .5 2 =.64 1 2 .68 .83 1.62 .96 T ( ) = (1 ; ) p1 e;x2=2 + p1 e;x2=2 2 2 Remark 5.4.6. X is inecient for heavy tails. This is because it is sensitive to one or two extreme observations. Remark 5.4.7. The optimal depends on the distribution sampled. For large n the distribution can be estimated and chosen accordingly as ^. Proposition 5.4.8. (a) For f (x) symmetric about 0 and f (x) f (0) eX~ X (F ) 31 The lower bound is attained if and only if F is uniform. (b) If F satises the hypotheses of proposition 5.4.4, then eX~ X (F ) (1 ; 2)2 If in addition F is unimodal and f is non-increasing on 0 1), then eX~ X (F ) 1 +14 Proof. (a) eX~ X (F ) = 4 2 f (0). Since eX of scale, assume f (0) = 1 ~ X is independent R 2 and 0 f (x) R 1. Then we need to minimize x f (x)dx subject to 0 f (x) 1, f (0) = 1 and f (x)dx = 1. Use the method of Lagrange multipliers, i.e. minimize S= Z x2 f (x)dx ; Z Z f (x)dx ; 1 = (x2 ; )f (x)dx + with respect to and f subject to 0 f (x) 1 and f (0) = 1. 5.4. COMPARISON OF SAMPLE MEAN, MEDIAN AND TRIMMED MEAN 115 For each xed > 0, S is minimized by choosing f as large as possible for x2 < and f as small as possible for x2 > , i.e. p f (x) = 1 jxj < p f (x) = 0 jxj @S = 0 ) Z f (x)dx =1 ) Z 1dx = 1 ) p = 1=2 p @ ; Thus f (x) = I;1=21=2] (x) and 2 = 121 , so min eX~ X (F ) = 1=3: 116 5. LARGE SAMPLE THEORY CHAPTER 6 Maximum Likelihood Estimation 6.1. Consistency Suppose that X1 X2 : : : areiid P 2 and make the following assumptions. (A0) P 6= P0 if = 6 0 . (A1) P 2 , have common support. (A2) dd P (x) = f (x ). (A3) The true parameter 0 2 int(). Definition 6.1.1. L(x ) = Pn log f (x ) is called the log likelihood. An estimator i 1 ^ of is called a (global) MLE if L(x ^(x)) = sup L(x ) 2 The MLE of g() is dened to be g(^). Likelihood equations. If = (1 : : : d) 2 R d and is an open subsect of R d and L is dierentiable on , then ^ (if it exists) satises @L x ^(x) = 0 1 j d @j In general 5 L(x ) = 0 may not have a unique solution. Sometimes the likelihood is unbounded and the MLE does not exist. 117 118 6. MAXIMUM LIKELIHOOD ESTIMATION Example 6.1.2. 1 ; e;ax 0x< ; a ; b ( x ; ) 1;e x 2 = (a b ) 2 (0 1)3 f (x ) = ae;ax I0 )(x) + be;a ;b(x; ) I1)(x) a x< f ( x ) Hazard rate = = b x : 1 ; F (x ) F (x ) = L( x a b ) = k " X 1 n " X # # ;ax(i) + log a + ;a ; b(x(i) ; ) + log b k+1 where x(i) is the ith order statistic, and k = #fxi : xi < g. For any xed a, letting b = x(n1); and " x(n) , we see that L(x a x(n1); ) ! 1 as " x(n) , and hence a global MLE does not exist. However if we restrict < x(n;1) the constrained MLE exists and is consistent. Lemma 6.1.3. Let j ) I (j ji ) = ;Ei log ff(X i Z f (x(X ) j = ; log f (x i f (x i )d(x) = 12 Kullback-Leibler discrepancy of f ( j ) relative to f ( i) Then I (j ji) 0 with equality holds if and only if j = i. Proof. Jensen's inequality gives ;Ei log ff ((XX j)) ; log Ei ff ((XX j)) i i Z f (x j )d(x) ; log fx:f (x i )>0g ; log 1 = 0 R j) Equality holds if and only if ff((X Xi ) is constant a:s: Pi , and fx:f (x i )>0g f (x j )d(x) = 1. The equality implies Pj Pi and the former equality implies Pj = Pi , since R cf (latter x i)d = 1 implies that c = 1. Theorem 6.1.4. Suppose = f0 : : : k g and conditions (A0 ) and (A2 ) hold. Then the MLE ^n = ^(X1 Xn) is unique for suciently large n and ^n ;a:s: ;! . 6.1. CONSISTENCY 119 Proof. Suppose that 0 is the true parameter value. Then lemma 6.1.3 and the SLLN imply that n X 1 ; n log ff ((XXi j )) ! I (j j0 ) a:s: P0 j = 1 : : : k: i 0 1 Hence for n suciently large n X 1 Xi j ) > 1 I ( j ) ; n log ff ((X j 0 i 0 ) 2 1 i:e: L (X 0 ) ; L (X j ) > n2 I (j j0 ) > 0 if j 6= 0 : (6.1.1) So for all n suciently large L(X j ) has a unique maximum at j = 0 . Hence ^ML ! 0 a:s: P0 , i.e. ^ML is strongly consistent. Remark 6.1.5. Theorem 6.1.4 may not hold if is countably innite (see the example on page 410 of TPE). Theorem 6.1.6. Suppose conditions A0 ; A3 hold and for almost all x, f (x ) is di erentiable with respect to 2 N with continuous derivative f 0 (x ), where N is an open subset of containing 0 and R . Then with P0 probability 1, for n large X 0 L0( X ) = ff ((XXi )) = 0 i has a root ^n and ^n ! 0 a:s: P0 . Proof. Choose > 0 small so that (0 ; 0 + ) N and let Sn = fx : L(0 x ) > L(0 ; x ) and L(0 x ) > L(0 + x )g From equn (6.1.1), a.s. P0 , L (0 X ) ; L (0 + X ) > 0 for all n large. Hence there exists ^n(X ) 2 (0 ; 0 + ) such that L0(^n ) = 0. If there exist more than one ^n 2 (0 ; 0 + ), choose the one closest to 0 (the set of roots is closed since f 0(x ) is continuous in ). Call this root ^n . (Note that there could be 2 closest roots in which case choose the larger.) Let A = fX = (X1 X2 : : : ): 9 ^n 2 (0 ; 0 + ) s:t: L0(^n ) = 0 8 sucient large n and ^n is the closest to 0 g. Then P0 (A) = 1. Dene A0 = limk!1 A1=k . Clearly P0 (A0) = 1 and on A0, ^n ! 0 . Remark 6.1.7. Theorem 6.1.6 says there exists a sequence of local maxima which converges a.s. P0 to 0 . However since we don't know 0 , we can't determine the sequence unless L has a unique local maximum for each n. 120 6. MAXIMUM LIKELIHOOD ESTIMATION Corollary 6.1.8. If L0 () = 0 has a unique root ^n for all X and for all suciently large n, then ^n ! 0 a:s: P0 If in addition is the open interval (L U ) and L0 () is continuous on for all X , then ^n maximizes the likelihood (globally), i.e. ^n is the MLE and hence the MLE is consistent. Proof. The rst statement follows straight from theorem 6.1.6. If ^n is not the MLE, then L() ! sup L() as # L or " U But ^n is a local max by the proof of theorem 6.1.6 and hence L must also have a local min and L0() = 0 for some 6= ^n, a contradiction. So ^n is the MLE. The following theorem gives conditions for the consistency of the MLE regardless of the existence and (or) the number of roots of the likelihood equation. But rst we state without proof a lemma which is needed in the proof of the theorem. Lemma 6.1.9. If f : X ! R (with compact) is a function such that f (x ) is continuous on for each xed x and f ( ) is measurable on X for each xed , then there exists a measurable function T : X ! such that f (x T (x)) = max f (x ) 8x: (The point is the measurability of T ). Theorem 6.1.10. Suppose where is a compact metric space with metric d and dene P = fP 2 g where P = P 2 and P is a possibly defective probability measure for 2 n. Suppose (i) there exists -nite such that dPd (x) = f (x ) for all in , and f (x ) is continuous for all x (ii) P 6= P0 for all 6= 0 , for some xed 0 2 (iii) for all 2 there exists a neighborhood N of in such that f ( x 0) E0 inf log f (x ) > ;1 2N Then for any measurable ^n = ^n(X1 Xn) which maximizes L( X ) over , we have d(^n 0) ! 0 a:s: P0 6.1. CONSISTENCY 121 Proof. It suces to show that for each open neighborhood A(0 ) of 0 P0 ^n 2 A(0 ) 8 n suciently large = 1 For each not in A(0 ) there exists a neighborhood N of such that f ( X 0) (6.1.2) E inf log f (X ) > ;1 2N Let B ( r) = f : d( ) < rg. Then f (x 0 ) " log f (x 0 ) as r # 0 inf log 2B(r) f (x ) f (x ) and since f (x 0 ) inf log f (x 0) for r small inf log 2B(r) f (x ) 2N f (x ) it follows from (6.1.2) and the monotone convergence theorem that f (x 0) = I (j ) > 0 lim E inf log 0 0 r #0 2B(r) f (x ) In particular there exists r() > 0 such that f ( x 0) E0 2Binf log (r()) f (x ) > 0 Since the class of open sets B ( r()), not in A(0 ), covers the compact set nA(0 ), there exists 1 k 2 nA(0 ) such that nA(0 ) k=1 B ( r()) For each 2 f1 : : : kg, the SLLN implies that h ^ i L ( ) ; L ( ) 2B(inf L(0 ) ; L()] inf n 2B( r( )) r( )) X n f (Xi 0 ) inf log f (Xi ) 1 2B( r( )) SLLN ;;;! 1 a:s: P0 as n ! 1 i.e. L(^n ) ; sup2B(r()) L() ! 1 a:s: P0 . Hence with P0 probability 1, ^n 2= B ( r()) 8 = 1 k for n suciently large, whence ^n 2 A(0 ) for n suciently large: 122 6. MAXIMUM LIKELIHOOD ESTIMATION Example 6.1.11. f (x ) = e;x 0 < x < 1 0 < < 1: Obviously ^n = X1 by SLLN. However let us show it by applying Theorem 6.1.10. ! a.s. Let = 0 1] and f (x 0) = f (x 1) = 0 (defective probability densities). Then (i) f (x ) is continuous for any x (ii) P 6= P0 for any 6= 0 (iii) For 2 and 0 < < f ( X 0) 0 inf log f (X ) = 2inf log + X ( ; 0 ) ;<<+ B() log +0 + X ( ; ; 0) and E0 (RHS ) > ;1. For = 0 and any > 0 f ( X 0) 0 inf log f (X ) = 0inf log + X ( ; 0 ) 0< < log 0 ; X0 and again E0 (RHS ) > ;1. For = 1 and any > 0 f ( X 0) 0 inf log inf log <1 <1 f (X ) = + X ( ; 0 ) 0 X ) + 1 ; X0 if X1 > = log( log(0 =) + X ( ; 0 ) if X1 " ; # and E0 (log 0 X + 1 ; 0 X ) I01=](X ) + log 0 + X ( ; 0 ) I1=1](X ) > ;1. Hence the conditions of the theorem are satised, so the MLE is consistent. 6.2. Asymptotic Normality of the MLE Theorem 6.2.1. Suppose that (i) 0 = (10 d0 ) is the true parameter and 0 2 int() R d . P0 (ii) the MLE ^n ;;! 0 . (iii) E0 @@ j log f (hX 0) = 0 1 j d. h i ; ; i (iv) I (0) = E0 @@ log f (X 0) @@ log f (X 0) T = ;E0 @@22 log f (X 0) and I (0) is non-singular. 6.2. ASYMPTOTIC NORMALITY OF THE MLE 123 (v) There exists > 0 such that E0 W (X ) < 1 where 2 X @2 @ 0 log f (x ) ; @ @ log f (x ) W (X ) = sup @ @ 0 ; < 1 d q (vi) Then @2 @ q log f (x ) is continuous in for all x. ; pn ^ ; 0 ;! d N 0 I ;1(0 ) Proof. From (vi), W (X ) # 0 as # 0. Hence from (v), E W (X ) # 0 as # 0. 0 Now for some between 0 and ^n , n n X X 1 @ @ log f (x ^ ) 1 0 (6.2.1) log f ( x i )= i n n 1 @ n 1 @ + n1 n X d X @2 log f (xi )( 0 ; ^n ) @ @ i=1 =1 (The rst term equals 0 since ^n = MLE .) We will show that 1 X @ 2 log f (x ) ! ;I (0 ) (6.2.2) i n i @ @ Write n n @ 2 log f (x ) = 1 X @ 2 log f (x 0 ) 1X i i n i=1 @ @ n i=1 @ @ 2 X @2 @ 1 0 log f (xi ) ; @ @ log f (xi ) +n i @ @ The rst term on the right hand side goes to ;I (0) in probability by the WLLN. For any c > 0, ; ; P0 (j2nd termj > c) =P0 k ; 0 k> j2nd termj > c + P0 k ; 0 k j2nd termj > c ; P0 k ; 0 k> @2 ! 2 X 1 @ + P0 n sup log f (xi ) ; @ @ log f (xi 0 ) > c 0 @ @ i k; k< =I + II 124 6. MAXIMUM LIKELIHOOD ESTIMATION ; P where I P0 k ^n ; 0 k> ! 0, and II P0 n1 i W(xi) > c 1c E0 W(X ) ! 0 as # 0. This proves (6.2.2). Now write (6.2.1) in matrix form n ;;I (0) + o (1) 0 ; ^ 1X @ log f (X 0) = i p n n i=1 @ =1:::d By CLT, n X ;0 I (0) @ 1 d 0 pn log f ( X ; ! N i ) =1:::d i=1 @ It follows that n pn ^ ; 0 = ;I (0) + o (1);1 p1 X @ log f (x 0) i n p n 1 @ ; d ;! N 0 I ;1(0 )I (0)I ;1(0 ) : Note 6.2.2. Condition (v) can be replaced by @3 @ log f (X ) M (X ) for any with k 0 k< , and where E0 M (X ) < 1. Example 6.2.3. One parameter exponential family f (xi ) = eT (xi );A() h(xi ) P The likelihood equation L0 () = 0 implies n1 T (xi ) = A0 () = E T (X1). Since A00 () = V ar T > 0, A0() is strictly increasing so that L0(3) = 0 has at most one solution. By theorem (6.1.6) and its Corollary, ^ ! a.s. Also dd 3 log f (x ) = A000 () is independent of x and continuous. Hence by theorem (6.2.1), ; pn (^ ; ) ;! d N 0 I ;1() = N (0 V ar T ) Example 6.2.4. (Censoring). Suppose X1 Xn iid E ( 1 ) (i:e: EXi = ), and suppose we observe the censored data Yi = min(Xi T ) with T xed. Let be the measure on 0 T ] dened by (A) = (Lebesgue measure plus unit mass at T ). Z A dx + IA(T ) 6.3. ASYMPTOTIC OPTIMALITY OF THE MLE 125 Then the density of Y with respect to is 1 e;y= 0 y < T p(y ) = e;T= y = T (= P (Xi T )) ; 2 y @ @ @ + y0 = yT < T ) ; 2 log p(y ) = @@ @@ ;log T @ ;@ 1@ + 2y 0 y < T = 2T2 3 y = T 3 @2 ) I () = ;E 2 log p(y ) 1 2y 1 Z T @ = ; 2 + 3 e;y= dy + 2T3 e;T= 0 ; 1 = 2 1 ; e;T= @ 3 log p(y ) = ; 23 + 6y4 0 y < T 6T y=T @3 Since 0 2 (0 1), @@ 3 3 log f (y ) Ay + B for 2 < < 32 , and E (Ay + B ) < 1. Check 4 0 that ^n ;! 0 under P0 . Then by theorem (6.2.1) pn ^ ; ;! d N 0 n 0 P 0 2 1 ; e;T= 0 6.3. Asymptotic Optimality of the MLE Under the conditions of theorem (6.2.1), the MLE satises ; pn ^ ; ;! d N 0 I ;1(0 ) n 0 We will show now that I ;1(0 ) is the minimal attainable covariance matrix for a class of asymptotically normal estimates. (Note. As in section 6.2, I () denotes the Fish information matrix per observation.) Theorem 6.3.1. Suppose fTn g is a sequence of asymptotically normal estimators of with V ar (Tn ) < 1 for all n. Let @ @ 4n() = @ E (Tn) = @ E (Tnj ) i ij =1:::d If p d N (0 ()), (i) n (Tn ; ) ;! (ii) n Cov (Tn) " 4Tn I ;1()4p n, (iii) 9() such that supn E k n (Tn ; ) k2+ < 1, 126 (iv) 4n ! Idd , then 6. MAXIMUM LIKELIHOOD ESTIMATION () " I ;1() p Proof. (iii) implies that f n(Tn ; )g and fn(Tn ; )(Tn ; )g are uniformly integrable (Billingsley, p. 338). Hence from (i), if Z N (0 ()), p E n(Tn ; ) ! EZ = 0 and n Cov(Tn) ! Cov(Z ) = () (Billingsley, p.338). As n ! 1, the LHS of (ii) ! (), and the RHS of (ii) ! I ;1() by (iv). Therefore, () " I ;1(). Note 6.3.2. Assumption (ii) will be satised under the conditions of corollary (2.4.4)(in the UMVU section). Definition 6.3.3. If fTn g satises (i) with () = I ;1 (), then it is said to be asymptotically ecient. @gi , 1 Corollary 6.3.4. Let g = (g1 gr )0 : ! R r have continuous derivatives @ d 1 i r. Dene @g @g 4() = @ = @j i i=1 d j =1 r and assume that the sequence fTng of estimators satises p d N (0 ()), (i) n (Tn ; g()) ;! 0 (ii) n Cov (Tn) " 4n()I ;1()4n(p), where 4n() = @@ E Tn, (iii) 9() > 0 such that supn E k n (Tn ; g()) k2+ < 1, (iv) 4n ! 4, then () " 40I ;1()4 Proof. The same as the proof of theorem (6.3.1). ((ii) again holds under the conditions of Corollary 2.4.4.) Next we show that g(^n), where ^n = MLE (), achieves the lower bound in Corollary (6.3.4) provided the conditions of theorem (6.2.1) hold. ^ d p Corollary 6.3.5. Suppose n n ; ; ! N (0 I ;1()) and @gj =@i is continuous for 1 j r, 1 i d, then 1 0 ; 1 ^ g(n) is AN g() n 4 I ()4 : 6.3. ASYMPTOTIC OPTIMALITY OF THE MLE Proof. Since 127 1 ^n ; = Op pn ^ 1 0 ^ g(n) ; g() = 4 n ; + op pn p ) n g (^n ) ; g () ; d ;! N 0 40I ;1()4 by Slutzky: Xn are iidPnlognormal( 2), (i.e. Yi = log Xi N ( 2)) with 2 = 1. The MLE of is ^n = n1 i=1 log Xi. Suppose g() = EX1 = e+1=2 (Set = 1 in EeY = e+ =2 ) EX1 = e+ =2 ). Example 6.3.6. Suppose X1 1 2 2 2 Consider the two estimators of := g() g(^n) = e^n +1=2 (MLE) and n X ~n = n1 Xi (Moment estimator): 1 The Fisher information for in Xi = the Fisher information for in Yi since the transforma2 tions Yi = log Xi is one to one. So it equals E @@ log f (Yi ) = E (Yi ; )2 = 1 (notice 2 log f (Y ) = ; 21 ln 2 ; (Y ;2) ). Therefore ; pn ^ ; ;! d N 0 40I ;1 ()4 n where 4 = dd g() = e+1=2 , i.e. ; pn ^ ; ;! d N 0 e2+1 n P On the other hand, for the moment estimator ~n = n1 Xi, we have ; pn ~ ; ;! d N (0 V ar(Xi)) = N 0 e2+1 (e ; 1) n (EXi2 =; Ee2Y = (MGF of N ( 1) evaluated at = 2) = e2+2 , therefore V ar(Xi) = e2+2 ; e+1=2 2 = e2+1 (e ; 1).) The ARE of ~n relative to ^n is thus ARE (~n ^n) = 1=(e ; 1) = :582 The sample mean thus has poor ARE as an estimator of the mean of a log normal distribution. (The log normal distribution has heavy tails so this is not too surprising.) 128 6. MAXIMUM LIKELIHOOD ESTIMATION Xn iid N ( 1), I () = 1. Dene X if jX j > n;1=4 Tn = 0n if jXnj < n;1=4 n Example 6.3.7. (Hodges, TPE, p.440) X1 Then Tn is AN ( v(n) ), where v() = 1 if 6= 0 and 0 if = 0. So v() < I () at = 0 and the parameter value 0 is called a point of supereciency. Theorem (6.3.1) does not apply to this example. Note also that Tn is not uniformly better than Xn for nite n, for example n = n;1=4 ) En n(Tn ; n )2 ! 1 > 1 = limn!1 En n(Xn ; n)2 : Remark 6.3.8. LeCam has shown showed that for any estimator satisfying pn ^ ; ;! d N (0 v()) n 0 the set of 's where v() < I ();1 has Lebesgue measure zero. Iterative Methods Suppose we have a sequence of estimators ~n such that ~n = 0 + Op( p1n ) Set 0 ~ Tn = ~n ; L00(~n ) L (n) Then L0 (~n) = L0 (0 ) + (~n ; 0 )L00 (n ) where n is between 0 and ~n . Thus ! pn (T ; ) = pn ~ ; ; L0(~n ) n 0 n 0 L00(~n) 0 (0 ) 00 ( ) p L L ~ ~ = n n ; 0 ; 00 ~ ; (n ; 0 ) 00 ~n L (n ) L (n) 1 0 00 ) p ; n L (0 ) p ~ L ( n = + n( ; ) 1 ; 1 00 ~ n L (n ) n 0 L00 (~n) 6.3. ASYMPTOTIC OPTIMALITY OF THE MLE 129 Under the conditions of theorem (6.2.1) d p1n L0 (0 ) ;! N (0 I (0 )) 1 L00 (~ ) ;! P ;I (0 ) and n n 1 L00 ( ) ;! P ;I (0 ): n n Hence the term in bsquare brackets is op(1). Thus RHS = ; p1n L0 (0 ) 1 00 ~ n L (n ) ; d + op(1) ;! N 0 I ;1(0 ) and so we have proved the following theorem. Theorem 6.3.9. Suppose conditions (i)-(iv) of theorem (6.2.1) hold and ~n = + Op( p1n ) Then 0 ~ Tn := ~n ; L00(~n) is asymptotically ecient: L (n ) Corollary 6.3.10. If I () is continuous then the estimator 0 ~ Tn0 = ~n + L (~n) nI (n ) is asymptotically ecient. Proof. ! pn (T 0 ; T ) = pn L0(~n ) + L0 (~n) n n L00(~n) nI (~n ) pn 1 = L0 (~n) + L00(~n) nI (~n ) ! 0 (~n ) I (~n ) + L00 (~n ) L n = p 00 ~ L ( n I (~n) n n ) = op(1): (The rst factor is Op(1), the numerator of the second factor is op(1) and the denominator P I (0 )2 .) ! 130 6. MAXIMUM LIKELIHOOD ESTIMATION Example 6.3.11. Location family. Suppose that X1 X2 are iid f (x ; ) where f is dierentiable and symmetric, f (x) > 0 for all x and f 0 is continuous. Then the conditions of theorem (6.1.6) hold A0 : P 6= P0 6= 0 A1 : common support A2 : dPd (x) = f (x ; ) A3 : 2 int() = (;1 1) Hence n 0 X 0 (6.3.1) L ( X ) = ff ((xxi ;; )) = 0 i 1 has a sequence of roots ^n for n large such that ^n ! 0 a.s. P0 . Since L( X ) ! 0 as ! 1, L( X ) must have a max, however there may be several solutions of (6.3.1). Provided f (x ; ) satises conditions (v) and (vi) of theorem 6.2.1, i.e. " # @2 2 @ log f (x ; ) < 1 E0 sup @2 log f (x ; ) ; @ 0 2 j;0 j< and @@22 log f (x ; ) is continuous in for all x, then all the conditions of theorem (6.2.1) (apart from (ii)) are satised. 2 Also X is AN (0 n ) and hence X = + Op( p1n ) The corollary of theorem (6.3.9) therefore implies the asymptotic eciency of Tn = Xn + where Pn f 0 (Xi ;Xn ) i=1 f (Xi ;Xn ) nI (Xn) Z f 0 (y ) 2 I ( ) = f (y)dy: f (y)