Contents

advertisement
Contents
Chapter 1. Preliminaries
1.1. Conditional Expectation
1.2. Suciency
1.3. Exponential Families.
1.4. Convex Loss Function
Chapter 2. Unbiasedness
2.1. UMVU estimators.
2.2. Non-parametric families
2.3. The Information Inequality
2.4. Multiparameter Case
Chapter 3. Equivariance
3.1. Equivariance for Location family
3.2. The General Equivariant Framework
3.3. Location-Scale Families
Chapter 4. Average-Risk Optimality
4.1. Bayes Estimation
4.2. Minimax Estimation
4.3. Minimaxity and Admissibility in Exponential families
4.4. Shrinkage Estimators
Chapter 5. Large Sample Theory
5.1. Convergence in Probability and Order in Probability
5.2. Convergence in Distribution
5.3. Asymptotic Comparisons (Pitman Eciency)
5.4. Comparison of sample mean, median and trimmed mean
Chapter 6. Maximum Likelihood Estimation
6.1. Consistency
6.2. Asymptotic Normality of the MLE
6.3. Asymptotic Optimality of the MLE
3
5
5
6
16
27
29
29
37
40
47
59
59
67
72
79
79
84
88
96
101
101
105
110
111
117
117
122
125
4
CONTENTS
CHAPTER 1
Preliminaries
1.1. Conditional Expectation
Let (X A P ) be a probability space. If X 2 L1 (A P ) and G is a sub--eld of A, then
E (X jG ) is a random variable such that
(i) E (X jG ) 2 G (i.e. is G measurable)
(ii) E (IGX ) = E (IGE (X jG )) 8G 2 G
(For X 0 (G) = E (IGX ) is a measure on G and P (G) = 0 ) (G) = 0, so by the
Radon-Nikodym
theorem there exists a G -measurable function E (X jG ) such that (G) =
R E (X jG )dP , i.e.(ii)
is satised. This shows the existence of E (X +jG ) and E (X ;jG ). Then
G
we dene E (X jG ) = E (X +jG ) ; E (X ;jG ) ).
Remark 1.1.1. (ii) generalizes to E (Y X ) = E (Y E (X jG )) 8Y 2 G such that
E jY X j < 1. The conditional probability of A given G is dened for all A 2 A as P (AjG ) =
E (IAjG ):
Remark 1.1.2. If X 2 L2 (A P ), then E (X jG ) is the orthogonal projection in L2 (A P )
of X onto the closed linear subspace L2 (A P ) since
(i) E (X jG ) 2 L2 (G P ) and
(ii) E (Y (X ; E (X jG ))) = 0 8Y 2 L2 (G P ):
Conditioning on a Statistic
Let X be a r.v. dened on (X A P ) with E jX j < 1 and let T be a measurable function
(not necessarily real-valued) from (X A) into (T F ).
(X A P ) ;!T (T F P T )
Such a T is called a statistic (and is not necessarily real-valued). The -eld of subsets of X
induced by T is
(T ) = fT ;1S S 2 Fg = T ;1 F
Definition 1.1.3. E (X jT ) E (X j (T ))
Recall that a real-valued function f on X is (T ) measurable , f = g T for some
F -measurable g on T , i.e. f (x) = g(T (x)) as shown below.
.
g
T
X ;;;!
T ;;;!
5
R
6
1. PRELIMINARIES
This implies that E (X jT ) is expressible as E (X jT ) = h(T ) for some function h 2 F
which is unique a.e. P T .
T
h
X ;;;!
T ;;;!
R
Definition 1.1.4. E (X jt) h(t)
Example 1.1.5. Suppose (X T ) has probability density p(x t) w.r.t. Lebesgue mea-
sure
R xp(xton) dxR 2 and E jX j < T1. Then E (X j(T )) = h(T ) where h(t) = E (X jT = t) =
R p(xt) dx IpT (t)>0 (t) a:s: P :
PROOF
(i) R.S. is Borel measurable in t (by Fubini)
(ii) G 2 (T ) ) G = T ;1F for some F 2 F ) IG = IF (T )
) E (IG E (X j (T ))) = E (IG X ) =
ZZ
Z
IGX dP
Z
=
xIF (t)p(x t) dxdt = IF (t)h(t)pT (t) dt
= E IF (T )h(T )] = E IG h(T )]
Properties of Conditional Expectation
If T is a statistic, X is the identity function on X and fn f g are integrable, then
(i) E af (X ) + bg(X )jT ] = aE f (X )jT ] + bE g(X )jT ] a:s:
(ii) a f (X ) b a:s: ) a E f (X )jT ] b a:s:
(iii) jfnj g fn(x) ! f (x) a:s: ) E fn(X )jT ] ! E f (X )jT ] a:s:
(iv) E E (f (X )jT )] = Ef (X ):
If E jh(T )f (X )j < 1, then
(v) E h(T )f (X )jT ] = h(T )E f (X )jT ] a:s:
1.2. Suciency
Set up
X : random observable quantity (the identity function on (X A P ))
X : sample space, the set of possible values of X
A: -algebra of subsets of X
P : fP 2 g is a family of probability measures on A (distributions of X )
T: X ! T is an A=F measurable function and T (X ) is called a statistic.
X
T
probability space (X A P ) ;;;!
sample space (X A P ) ;;;!
(T F P T )
We adopt this notation because sometimes we wish to talk about T (X ()) the random variable
and sometimes about T (X (x)) = T (x), a particular element of T . We shall also use the
1.2. SUFFICIENCY
7
notation P (AjT (x)) for P (AjT = T (x)) and P (AjT ) for the random variable P (AjT ()) on
X.
Definition 1.2.1. The statistic T is sucient for (or P ) i the conditional distribution
of X given T = t is independent of for all t, i.e. there exists an F measurable P (AjT = )
such that P (AjT = t) = P (AjT = t) a.s. PT for all A 2 A and all 2 .
Example 1.2.2.
X = (X1 : : : Xn) iid with pdf f (x) w:r:t: dx
P = P (dx1 : : : dxn) = f (x1 ) f (xn) dx1 dxn
T (X ) = (X(1) : : : X(n)) where X(i) is the ith order statistic:
The probability mass function of X given T = t is
pX jT =t(xjt) = t1 (x(1) ) n ! tn (x(n) )
i.e. it assigns point mass n1! to each x such that x(1) = t1 x(n) = tn. This is independent
of , indicating that T contains all the information about contained in the sample.
The Factorization Criterion
Definition 1.2.3. A family of probability measure's P = fP : 2 g is equivalent to
a p.m. if
(A) = 0 () P (A) = 0 8 2 :
We also say that P is dominated by a -nite measure on (X A) if
P for all 2 :
It is clear that equivalence to implies domination by .
Theorem 1.2.4. Let P be dominated by a p.m. where
=
1
X
i=0
ci Pi
(ci 0
X
ci = 1):
Then the statistic T (with range (T F )) is sucient for P () there exists an F -measurable
function g () such that
dP (x) = g (T (x)) d(x) 8 2 :
Proof. ()) Suppose T is sucient for P . Then
P (AjT (x)) = P (AjT (x)) 8:
Throughout this part of the proof X will denote the indicator function of a subset of X . The
preceding equality then implies that
E (X jT ) = E (X jT ) 8X 2 A 8:
8
1. PRELIMINARIES
Hence for all 2 X 2 A G 2 (T ), we have
E (IG E (X jT )) = E (E (IGX jT )) = E (IGX ):
Set = i , multiply by ci and sum over i = 0 1 2 : : : to get
E(IG E (X jT )) = E(IGX ) 8X 2 A 8G 2 (T ):
This implies that E (X jT ) = E (X jT ) 8X 2 A and hence
E (X jT ) = E (X jT ) = E(X jT ) 8X 2 A 8:
Now dene g (T ()) to be the Radon-Nikodym derivative of P with respect to , with both
regarded as measures on (T ). We know this exists since dominates every P . We also
know it is (T ) measurable, so it can be written in the form g (T ()), and we know that
E (X ) = E (g (T )X ) for all X 2 (T ). We need to establish however that this last relation
holds for all X 2 A. We do this as follows.
X 2 A ) E (X ) = E (E (X jT )
= E (g (T )E (X jT ))
= E (E (g (T )X jT ))
= E (E(g (T )X jT ))
= E (g (T )X ):
dP
This shows that g (T (x)) = d (x) when P and are regarded as measures on A.
(() Suppose that for each , dPd (x) = g (T (x)) for some g . We shall then show that
the conditional probability P (Ajt) is a version of P (Ajt) 8.
A 2 A G 2 (T ) )
Z
G
IA dP =
=
and
Z
G
IAdP =
=
Z
Z
G
G
Z
Z
G
Z
G
P (AjT ) dP
P (AjT )g (T ) d
IAg (T ) d
EIAg (T )jT ] d
= EIAjT ]g (T ) d
G
) P (AjT )g (T ) = E (IAjT )g (T ) a:s: and hence a.s. P 8. Also g (T ) 6= 0 a:s: P , since dP = g (T ) d. Hence P (AjT ) =
E (IAjT ) = P(AjT ) a:s: P and the R.S. is independent of .
1.2. SUFFICIENCY
9
Theorem 1.2.5. (Theorem 2, Appendix, TSH
P1) If P = fP 2 g is dominated by a
-nite measure , then it is equivalent to P
= 1
i=0 ci Pi for some countable subcollection
Pi 2 P i = 0 1 2 : : : with ci 0 and ci = 1:
Proof. is ;nite, ) 9An 2 A with A1 A2 : : : disjoint, and Ai = X such that
0 < (Ai) < 1 i = 1 2 : : : . Set
1
X
(A) = 2(Ai(\AA)i)
i
i=1
Then, is a probability measure equivalent to . Hence we can assume without loss of
generality that the dominating measure is a probability measure Let
f = dP
d
and set
S = fx : f (x) > 0g
Then
(1.2.1)
P (A) = P (A \ S ) = 0 i (A \ S ) = 0:
(Since P and since (A \ S ) > 0 f > 0 on A \ S ) P (A \ S ) > 0.) A set A 2 A
is a kernel if A S for some a nite or countable union of kernels is called a chain. Set
= sup (C )
chains C
Then = (C ) for some chain C = 1
n=1 An An Sn . (since 9 fCng such that (Cn) " and for this sequence (Cn) = .)
P 1 P (). Since
It follows from the following Lemma that P is dominated by () = 1
n=1 2n n
(A) = 0 ) Pn (A) = 0 8n
) P (A) = 0 8 (by the Lemma)
it is obvious that
P (A) = 0 8 ) (A) = 0
P
1
Hence P is equivalent to () = 1
n=1 2n Pn ():
Lemma 1.2.6. If fn g is the sequence used in the construction of C , then fP 2 g is
dominated by fPn n = 1 2 : : : g, i.e.
Pn (A) = 0 8n ) P (A) = 0 8
1
TSH stands for Testing Statistical Hypotheses, E.L. Lehmann, Springer Texts in Statistics, 1997.
10
1. PRELIMINARIES
Proof.
Pn (A) = 0 8n ) (A \ Sn ) = 0 8n (by 1:2:1)
) (C Sn ) (A \ C ) = 0
) (P ) P (A \ C ) = 0 8
If P (A) > 0 for some then, since P (A) = P (A \ C ) + P (A \ C c),
P (A \ C c) = P (A \ C c \ S ) > 0
)A \ C c \ S is a kernel disjoint from C
)C (A \ C c \ S ) is a chain with > (P (A) > 0 ) (A) > 0)
contradicting the denition of :
Hence, P (A) = 0 8 .
Theorem 1.2.7. The Factorization Theorem
Let be a -nite measure which dominates P = fP : 2 g and let
p = dP
d :
Then the statistic T is sucient for P if and only if there exists a non negative F -measurable
function g : T ! R and an A-measurable function h : X ! R such that
(1.2.2)
p (x) = g (T (x)) h (x) a.e. .
Proof. By theorem 1.2.5, P is equivalent to
=
If T is sucient for P ,
X
i
ciPi where ci 0
X
i
ci = 1:
(x) dP (x) d (x)
p (x) = dP
d (x) = d (x) d (x)
= g (T (x)) h (x)
by theorem 1.2.4.
On the other hand, if equation (1.2.2) holds,
d (x) =
=
(1.2.3)
X
X
ci dPi (x) = cipi (x) d(x)
1
X
i=1
cigi (T (x)) h (x) d (x)
= K (T (x)) h (x) d (x) :
1.2. SUFFICIENCY
11
Thus,
dP (x) = p (x) d (x) by the denition of p (x)
by equations (1.2.2) and (1.2.3)
= g (T (x)) h (x) d (x)
K (T (x)) h (x)
= g~ (T (x)) d (x)
where g~ (T (x)) := 0 if K (T (x)) = 0:
Hence T is sucient for P by theorem 1.2.4.
Remark 1.2.8. If f (x) is the density of X with respect to Lebesgue measure then T is
sucient for P i
f (x) = g (T (x)) h (x)
where h is independent of .
Example 1.2.9. Let X1 X2 : : : Xn be iid N ( 2 ) 2 R > 0, and write X =
(X1 X2 Xn). A -nite dominating measure on Bn is Lebesgue measure with
!
X
n
2
X
1
n
;
1
p2 (x) = ; p n exp 22 x2i + 2 xi ; 22
2
X X 2 1
= g2
xi xi :
P P
Therefore T (X ) = ( Xi Xi2)is sucient for P = fP2 g :
; Remark 1.2.10. T (X ) = X S 2 is also sucient for P = fP2 g, since
X X ;
2 x S 2
g2
xi x2i = g
T and T are equivalent in the following sense.
Definition 1.2.11. Two statistics T and S are equivalent if they induce the same algebra up to P -null sets. i.e. if there exists a P -null set N and functions f and g such
that
T (x) = f (S (x)) and S (x) = g (T (x)) for all x 2 N c.
Example 1.2.12. Let X1 : : : Xn be iid U (0 ) > 0 and X = (X1 : : : Xn ).
Yn
1
p (x) = n I01)(xi )I(;1](xi )
1
1
= n I01)(x(1) )I(;1](x(n) )
= g (x(n) )h(x)
) T (X ) = X(n) is sucient for :
12
1. PRELIMINARIES
Example 1.2.13. X1 : : : Xn iid N (0 2 ), = f 2 : 2 > 0g. Dene
T1 (X )
T2 (X )
T3 (X )
T4 (X )
(X1 : : : Xn)
(X12 : : : Xn2)
(X12 + + Xm2 Xm2 +1 + + Xn2)
X12 + + Xn2
n
X
p (x) = p1 n exp (; 21 2 Xi2)
( 2)
1
Each Ti (X ) is sucient. However (T4 ) (T3) (T2 ) (T1 ):
(since functions of T4 are functions of T3 , functions of T3 are functions of T2 and functions
of T2 are functions of T1 .)
Remark 1.2.14. If T is sucient for and T = H (S ) where S is some statistic, then S
is also sucient since
p (x) = g (T (x)) h (x) = g (H (S (x)) h (x)
=
=
=
=
S
H
Since (T ) = S ;1H ;1BT S ;1BS ((X A) ;!
(S BS ) ;!
(T BT )), T provides a greater
reduction of the data than S , strictly greater unless H is one to one, in which case S and T
are equivalent.
Definition 1.2.15. T is a minimal sucient statistic, if for any sucient statistic S ,
there exists a measurable function H such that
T = H (S ) a.s. P :
Theorem 1.2.16. If P is dominated by a -nite measure , then the statistic U is
sucient i
for every xed and 0 , the ratio of the densities p and p0 with respect to ,
dened to be 1 when both densities are zero, satises
p (x) = f (U (x)) a:s: P for some measurable f .
0
p0 (x) 0
Proof. HW problem.
Theorem 1.2.17. Let P be a nite family with densities fp0 p1 : : : pk g, all having the
same support (i.e. S = fx : pi (x) > 0g is independent of i). Then
p (x) p (x) p (x) T (x) = p1 (x) p2 (x) pk (x)
0
0
0
is minimal sucient. (Also true for a countable collection of densities with no change in the
proof.)
1.2. SUFFICIENCY
13
Proof. First T is sucient by theorem (1.2.16) since ppji((xx)) is a function of T (x) for all
i and j (need common support here.) If U is a sucient statistic then by theorem (1.2.16),
pi (x)
p0 (x) is a function of U for each i
)
T is a function of U
)
T is minimal sucient.
Remark 1.2.18. The theorem 1.2.17 extends to uncountable collections under further
conditions.
Theorem 1.2.19. Let P be a family with common support and suppose P0 P . If T is
minimal sucient for P0 and sucient for P , then T is minimal sucient for P .
Proof.
U is sucient for P ) U is sucient for P0 by Denition 1:2:1:
T is minimal sucient for P0 ) T (x) = H (U (x)) a.s. P0 :
But since P has common support T (x) = H (U (x)) a.s. P :
Remark 1.2.20.
1. Minimal sucient statistics for uncountable families P can often be obtained by combining the above theorems.
2. Minimal sucient statistics exist under weak assumptions (but not always). In particular they exist if (X A) = (R n Bn) and P is dominated by a -nite measure.
Example 1.2.21. P0 : (X1 : : : Xn ) iid N ( 1) 2 f0 1 g.
P : (X1 : : : Xn) iid N ( 1) 2 R .
p1 (x) = exp ; 1 hX(x ; )2 ; X(x ; )2i
i
1
i
0
p0 (x)
2
1 hX
i
2
2
= exp ; 2
2xi(0 ; 1 ) + n1 ; n0
This is a function of x, hence X is minimal sucient for P0 by Theorem 1.2.17. Since X is
sucient for P (by the factorization theorem), X is minimal sucient for P .
Example 1.2.22. P : (X1 : : : Xn) iid U (0 ) > 0.
Show that X(n) is minimal sucient (This is part of problem 1.6.16 for which you will
need to use problem 1.6.11).
14
1. PRELIMINARIES
Example 1.2.23. Logistic
P : (X1 : : : Xn) iid L( 1) 2 R.
P0 : (X1 : : : Xn) iid L( 1) 2 f0 1 : : : ng.
P(x ; )]
exp
;
i
p (x) = Qn
2
i=1 f1 + exp ;(xi ; )]g
so T = (T1 (X ) : : : Tn(X )) is minimal sucient
where
Yn (1 + e;xj )2
p
i (x)
n
i
Ti (x) = p (x) = e
;(xj ;i ) )2 :
0
j =1 (1 + e
We will show that T (X ) is equivalent to (X(i) : : : X(n)), by showing that
T (x) = T (y) , x(1) = y(1) x(n) = y(n):
Proof. (() Obvious from the expression for Ti (x).
()) Suppose that Ti(x) = Ti(y) for i = 1 2 : : : n,
Yn
;xj )2
Yn (1 + e;yj )2
(1
+
e
i.e.
i = 1 ::: n
;(xj ;i ) )2 =
;(yj ;i ) )2
j =1 (1 + e
j =1 (1 + e
Yn 1 + uj ! Yn 1 + vj !
i.e.
=
! = !1 : : : !n
j =1 1 + uj
j =1 1 + vj
where uj = e;xj vj = e;yj and !i = ei . Here we have two polynomials in ! of degree n
which are equal for n + 1 distinct values, 1 !1 : : : !n, of ! and hence for all !.
!=0)
)
Yn
Yn
j =1
n
j =1
n
Y
j =1
(1 + uj ) =
(1 + uj !) =
(1 + vj )
Y
j =1
(1 + vj !)
8!
) the zero sets of both these polynomials are the same
) x and y have the same order statistics.
By theorem 1.2.17, the order statistics are therefore minimal sucient for P0 . They are also
sucient for P , so by theorem 1.2.19, the order statistics are minimal sucient for P . There
is not much reduction possible here! This is fairly typical of location families, the normal,
uniform and exponential distributions providing happy exceptions.
1.2. SUFFICIENCY
15
Ancillarity
Definition 1.2.24. A statistic V is said to be ancillary for P if the distribution, PV , of
V does not depend on . It is called rst order ancillary if E V is independent of .
Example 1.2.25. In example 1.2.23, X(2) ; X(1) is ancillary since Y1 = X1 ; : : : Yn =
Xn ; are iid P0 , and X(2) ; X(1) = Y(2) ; Y(1) .
Example 1.2.26.
P : (X1 : : : Xn) iid N ( 1) 2 R :
X
S 2 = (Xi ; X )2 is ancillary
since
X
S 2 = (Yi ; Y )2 where Yi = Xi ; i = 1 2 : : : are iid N (0 1):
Remark 1.2.27. Ancillary statistics by themselves contain no information about , however minimal sucient statistics may contain ancillary components. For example, in 1.2.23,
T = (X(1) X(n)) is equivalent to T = (X(1) X(2) ; X(1) X(n) ; X(1) ), whose last
(n ; 1) components are ancillary. You can't drop them as X(1) is not even sucient.
Complete Statistic
A sucient statistic should bring about the best reduction of the data if it contains as
little ancillary material as possible. This suggests requiring that no non-constant function
of T be ancillary, or not even rst order ancillary, i.e. that
E f (T ) = c for all 2 ) f (T ) = c a.s. P
or equivalently that
E f (T ) = 0 for all 2 ) f (T ) = 0 a.s. P :
Definition 1.2.28. A statistic T is complete if
(1.2.4)
E f (T ) = 0 for all 2 ) f (T ) = 0 a.s. P
T is said to be boundedly complete if equation (1.2.4) holds for all bounded measurable
functions f .
Since complete sucient statistics are intended to give a good reduction of the data, it
is not unreasonable to expect them to minimal. We shall prove a slightly weaker result.
Theorem 1.2.29. Let U be a complete sucient statistic. If there exists a minimal
sucient statistic, then U is minimal sucient.
Proof. Let T be a minimal sucient statistic and let be a bounded measurable
function. We will show that
(U ) 2 (T ) i.e. E ((U )jT ) = (U ) a:s:
16
1. PRELIMINARIES
Now
E ((U )jT ) = g(U ) for some measurable g since T is minimal and U is sucient:
Let h(U ) = E ((U )jT ) ; (U ), then E h(U ) = 0 8 so h(U ) = 0 a:s: P since U is
complete. Hence (U ) = E ((U )jT ) 2 (T ): Hence U -measurable indicator functions are
T -measurable, i.e. (U ) (T ), i.e. U is minimal sucient.
Remark 1.2.30.
1. If P is dominated by a -nite measure and (X A) = (R n Bn), the existence of a
minimal sucient statistic does not need to be assumed.
2. A minimal sucient statistic is not necessarily complete. See the next example.
Example 1.2.31.
P = fN ( 2 ) > 0g
1 (x; )2
p (x) = p1 e; 2 2 = p1 e; 21 ( x ;1)2
2
2
The single observation X is minimal sucient but not complete since
E I(01)(X ) ; (1)] = P (X > 0) ; (1) = 0 8 however P (I(01)(X ) ; (1) = 0) = 0 8.
Theorem 1.2.32. (Basu's theorem) If T is complete and sucient for P , then any ancillary statistic is independent of T .
Proof. If S is ancillary, then P (S 2 B ) = pB , independent of .
Suciency of T ) P (S 2 B jT ) = h(T ), independent of .
)E (h(T ) ; pB ) = 0
)h(T ) = pB a:s: P by completeness
)S is independent of T
1.3. Exponential Families.
Definition 1.3.1. A family of probability measure's fP : 2 g is said to be an s-
parameter exponential family if there exists a -nite measure such that
X
!
s
dP
(
x
)
i () Ti (x) ; B () h (x)
p (x) = d(x) = exp
1
where i Ti and B are real-valued.
Remark 1.3.2.
1. P 2 are equivalent (since fx : p (x) > 0g is independent of ).
1.3. EXPONENTIAL FAMILIES.
17
2. The factorization theorem implies that T = (T1 Ts) is sucient.
3. If we observe X1 : : : Xn, iid with marginal distributions P then
sucient for .
Pn T (X ) is
j
j =1
Theorem 1.3.3. If f1 1 : : : s g is LI, then T = (T1 : : : Ts) is minimal sucient.
(Linear independence of f1 1 : : : s g means c1 1 () + + css () + d = 0 8 ) c1 =
= cs = d = 0. Equivalently we can say that fig is anely independent or AI since
the set of points f(1 () : : : s()) 2 g then lie in a proper ane subspace of R s .)
Proof. Fix 0 2 and consider
(1.3.1)
dP (x) = p (x) = exp fB ( ) ; B ()g exp
0
dP0
p0 (x)
(X
s
1
)
(i () ; i(0 )) Ti(x) :
If f1 1 : : : sg is LI then so is f1 1 ; 1 (0) : : : s ; s(0 )g:
Set S = f(i() ; i(0 ) : : : s() ; s(0 )) 2 g R s . Then span(S ) is a linear
subspace of R s .
If dim(span(S )) < s, then there exists a non-zero vector v = (v1 : : : vs) s.t.
v1 (1() ; 1(0 )) + + vs(s() ; s(0 )) = 0 8
contradicting the linear independence of f1 i ; i(0 )g. Hence
dim(span(S )) = s i:e: 9 1 : : : s 2 s:t:
f(1(i ) ; 1(0 ) s(i) ; s(0 )) i = 1 sg is LI.
(1.3.2)
From 1.3.1,
s
X
j =1
(j (i ) ; j (0 ))Tj (x) = ln ppi ((xx)) + (B (i) ; B (0 ))i = 1 : : : s:
0
Since the matrix j (i ) ; j (0 )]sij=1 is non-singular, Tj (x) can be expressed uniquely in
terms of ln pp0i ((xx)) , i = 1 : : : s.
But pp0i ((xx)) i = 1 : : : s is minimal sucient for P0 = fPj j = 0 1 sg by theorem
1.2.17. Hence T is minimal sucient by theorem 1.2.19.
18
1. PRELIMINARIES
Example 1.3.4.
r
p (x) = 2 expf; 12 x2 + x ; 2 g:
1 () = ; 12 2 () = T (x) = (x2 x) is sucient but not minimal
r
since rewriting the model as p (x) = 2 expf; 12 (x ; 1)2g we see that
T (x) = (x ; 1)2 is minimal sucient.
Remark 1.3.5. The exponential family can always be rewritten in such a way that the
functions fTi g and fig are AI. If there exist constants c1 : : : cs d, not all zero, such that
c1T1 (x) + + csTs(x) = d a:s: P
then one of the Ti's can be expressed in terms of the others (or is constant). After reducing
the number of functions Ti as far as possible, the same can be done with their coecients
until the new functions fTig and fig are AI.
Definition 1.3.6. (Order of the exponential family.) If the functions fTi i = 1 : : : sg
on X and fi i = 1 : : : sg on are both AI, then s is the order of the exponential family
X
!
s
dP
i () Ti (x) ; B () h (x) :
p (x) = d (x) = exp
1
Proposition 1.3.7. The order is well-dened.
Proof. We shall show that
s + 1 = dim(V )
where V is the set of functions on X dened by V = spanf1 ln dPdP0 () 2 g (independent
of the dominating measure and the choice of fig fTig).
s
X
dP
ln
dP0 (x) = i=1 (i () ; i (0 ))Ti(x) + B (0 ) ; B ()
so that
V spanf1 Ti() i = 1 : : : sg ) dim(V ) s + 1
On the other hand, since f1 i i = 1 : : : sg is LI, each Tj (x) can be expressed as a linear
dPi (x) i = 1 : : : s, as in the proof of the previous theorem,
combination of 1, ln dP
0
) spanf1 Ti () i = 1 : : : sg V
) s + 1 dim(V )
1.3. EXPONENTIAL FAMILIES.
19
Definition 1.3.8. (Canonical Form) For any s-parameter exponential family (not necessarily of order s) we can view the vector () = (1 () : : : s()0 as the parameter rather
than . Then the density with respect to can be rewritten as
p(x ) = exp
s
X
i=1
iTi (x) ; A()]h(x) 2 ():
Since p( ) is a probability density with respect to ,
eA()
(1.3.3)
=
Z Ps
i Ti (x)
e
1
h(x)d(x):
Definition 1.3.9. (The
Natural Parameter Set) This is a possibly larger set than
f() 2 g. It is the set of all s-vectors for which, by suitable choice of A(), p( ) can
be a probability density, i.e.
P
N = f = (1 s) R s : R e s1 iTi (x) h(x)d(x) < 1g
Theorem 1.3.10. N is convex.
Proof. Suppose = (1 : : : s ) and = (1 : : : s ) 2 N . Then
Z Ps
P
ep iTi (x)+(1;p) s iTi (x) h(x) d(x)
Z P
Z P
1
p e
<1
1
i Ti (x) h(x)
Theorem 1.3.11. T = (T1
d(x) + (1 ; p) e
i Ti (x) h(x)
d(x)
Ts) has density
p (t) = exp ( t ; A ())
relative to = ~ T ;1 where d~ (x) = h (x) d (x).
Proof. If f :T ! R is a bounded measurable function,
Ef (T ) =
=
Z
f (T (x))eT (x) e;A() d~(x)
Z
f (t)et e;A() d~ T ;1(t)
Definition 1.3.12. The family of densities
p (t) = exp ( t ; A ()) 2 ()
n is called an s-dimensional or s-parameter standard exponential family. (Dened on
R s , not X .)
20
1. PRELIMINARIES
Theorem 1.3.13. Let fp (x)g be the s-parameter exponential family,
p (x) = exp
and suppose
X
s
i=1
Z
(1.3.4)
!
i () Ti (x) ; B () h (x)) 2 ()
Ps1 j Tj (x)
(x) e
d (x)
exists and is nite for some and all j = aj + ibj such that a 2 N (=natural parameter
space). Then
P
R
(i) (x) e s1 j Tj (x)d (x) is an analytic function of each Ri on f :P<s () 2 int (N )g and
(ii) the derivative of all orders with respect to the i's of (x) e 1 j Tj (x)d (x) can be
computed by di
erentiating under the integral sign.
Proof. Let a0 = (a01 : : : a0s ) be in int(N ) and let 10 = a01 + ib01 . Then
P
(x)e s2 j Tj (x) = h1 (x) ; h2(x) + i(h3 (x) ; h4(x))
where h1 and h2 are the positive and negative parts of the real part and h3 and h4 are the
positive
and negative
of the imaginary part.
Ps1 j Tj (xparts
R
)
Then (x) e
d (x) can be expressed as
Z
e1 T1 (x)
d1(x) ;
Z
e1 T1 (x)
d2(x) + i
Z
e1 T1 (x)
Z
d3(x) ; i e1 T1 (x) d4(x)
where di(x) = hi(x) d(x) i = 1 : : : 4. Hence it suces to prove (i) and (ii) for
(1 ) =
Z
e1 T1 (x) d(x):
Since a0 2 int(N ), there exists > 0 s.t. (1) exists and is nite for all 1 with ja1 ; a01 j < .
Now consider the dierence quotient
Z 0 e(1 ;10)T1(x) ; 1
0
(
1 ) ; (1 )
1 T1 (x)
0
=
e
(
dx
)
j
()
1 ; 1 j < =2:
0
0
1 ; 1
1 ; 1
Observe that
1
1
j
X
X
(
zt
)
jztjj = ejztj ; 1
zt
je ; 1j = j
j
1 j!
1 j!
jztjejztj
zt
) j e z; 1 j jtjejztj
1.3. EXPONENTIAL FAMILIES.
21
(a0 + )jT (x)j
The integrand inR(*) is therefore
bounded
in
absolute
value
by
j
T
1 (x)je 1 2 1 , where
a01 = Re(10) and jT1 (x)je(a01 + 2 )jT1(x)j (dx) < 1 since
8
; 4 T1 e(a01 + 34 )T1 if T > 0
>
j
T
j
e
1
1
| {z }
<
|
{z
}
0
integrable
jT1je(a1 + 2 )jT1 j = > zbounded
z
}|
{
: jT1je 4 T1 e(a01}|+ 4 )T{1 if T1 < 0
(independent of 1).
Letting 1 ! 10 in (*) and using the dominated convergence theorem therefore gives
0(0) =
(1:3:5)
1
Z
T1(x)e10 T1 (x) (dx)
where the integral exists and is nite 810 which is the rst component of some 0 for which Re(0 ) 2
N.
Applying the same argument to (1.3.5) which we applied to (1.3.4) ) existence of all derivatives ) (i) and (ii).
Theorem 1.3.14. For an exponential family of order s in canonical form and 2
int (N ), where N is the natural parameter space,
(i) E (T ) = @@A1 @@As , and
h is
(ii) Cov (T ) = @@i2@A j
.
ij =1
Proof. From theorem 1.3.11Z
eA()
so
=
et (dt) =
Z
eT (x) h(x)(dx)
R
= Ti(x)eT (x) h(x)(dx)
@A .
whence E Ti = @
i
@A @A eA() = R T (x)T (x)eT (x) h(x)(dx)
(ii) @@i2@A j eA() + @
i
j
i @j
2A
@
i.e. @i @j = E (TiTj ) ; E (Ti)E (Tj ) = Cov (Ti Tj )
(i)
@A A()
@i e
Higher order moments of T1 Ts are frequently required, e.g.
r1rs = E (T1r1 Tsrs )
r1rs = E (T1 ; E (T1))r1 (Ts ; E (Ts))rs ]
22
1. PRELIMINARIES
etc. These can often be obtained readily from the MGF:
MT (u1 us) := E (eu1 T1++usTs )
P
If MT exists in some neighborhood of 0 ( u2i < ), then all the moments r1 rs exist and
are the coecients in the power series expansion
1
r1
rs
X
MT (u1 us) =
r1 rs ur1 ! ur s!
1
s
r1 :::rs
The cumulant generating function, CGF, is sometimes more convenient for calculations,
especially in connection with sums of independent random vectors. The CGF is dened as
KT (u1 us) := log MT (u1 us):
If MT exists in a neighborhood of 0, then so does KT and
1
r1
rs
X
KT (u1 us) =
Kr1 rs ur1 ! ur s!
1
s
r1 rs=0
where the coecients Kr1 rs are called the cumulants of T .
The moments and cumulants can be found from each other by formal comparison of the two
series.
Theorem 1.3.15. If X has the density
p (x) = exp s
X
i=1
i Ti(x) ; A()]h(x)
w.r.t some -nite measure , then for any 2 int(N ) the MGF and CGF of T exist in a
neighborhood of 0 and
KT () = A( + ) ; A()
MT () = eA(+);A()
Proof. Problem 3.4.
Summary on Exponential Families. The family of probability measures fP g with
densities relative to some -nite measure ,
s
X
dP
(1:3:6)
p (x) = d (x) = expf i ()Ti(x) ; B ()gh(x) 2 1
is an s-parameter exponential family
By redening the functions Ti() and i() if necessary, we can always arrange for both
sets of functions to be anely independent. The number of summands in the exponent is
then the order of the exponential family.
1.3. EXPONENTIAL FAMILIES.
23
If f1 1 : : : sg and f1 T1 : : : Tsg are both L.I., then the family is said to be minimal
and
dp () 2 g) ; 1
s = dim(spanf1 log dp
0
= order of the exponential family
Remark 1.3.16. Since (1.3.6) is by denition a probability density w.r.t. for each
2 , we have
Z
exp
nX
o
i ()Ti(x) ; B () h(x)(dx) = 1
) exp B () =
Z
exp
nX
o
i ()Ti(x) h(x)(dx)
which shows that the dependence of B on is through () = (1 () : : : s()) only, i.e.
B () = A(()).
Remark 1.3.17. The previous note implies that each member of the family (1.3.6) is a
member of the family.
(1:3:7)
s
X
(x) = expf
1
iTi(x) ; A( )gh(x) = (1 : : : s) 2 ()
(in fact p (x) = () (x)).
The family of densities f 2 ()g dened by (1.3.7) is the canonical family associated with (1.3.6). It is the same family parameterized by the natural parameter,
=vector of coecients of Ti (x) i = 1 : : : s.
Remark 1.3.18. Instead of restricting to the set (), it is natural to extend the
family (1.3.7) to allow all 2 R s for which we can choose a value of A( ) to make (1.3.7) a
probability density, i.e. for which
(1:3:8)
Z
X
expf
iTi (x)gh(x)(dx) < 1
N = f 2 R s : (1:3:8) holdsg is the natural parameter space of the family (1.3.7).
N () since (1.3.7) is by denition a family of probability densities.
Definition 1.3.20. (Full rank family) As with the original parameterization, we can
always redene to ensure that fT1 : : : Tsg is A.I. If () contains an s-dimensional
rectangle and fT1() : : : Ts()g is A.I., then T is minimal sucient and we say the family
Remark 1.3.19.
(1.3.7) is of full rank. (A full rank family is clearly minimal.)
24
1. PRELIMINARIES
Remark 1.3.21. Since N (), full rank ) int(N ) 6= and this is important in view
of the consequence of theorem 1.3.13 that
eA()
=
Z
s
X
exp(
i=1
iTi(x))h(x)(dx)
is analytic in each i on the set of s-dimensional complex vectors, : Re( ) 2 int(N ). (So
derivatives of eA() w.r.t. i i = 1 : : : s of all orders can be obtained by dierentiation
under the integral, yielding explicit expressions for the moments of T for all values of the
canonical parameter vector 2 int(N ).)
Example 1.3.22. Multinomial X M (0 : : : s n) = (X0 : : : Xs ) where Xi =
number of outcomes of type i in n independent trials where i i = 0 : : : s is the probability
of an outcome of type i on any one trial.
= f : 0 0 s 0 0 + + s = 1g
(1) Probability density with respect to counting measure on Zs++1
Ys
X
n
!
x
x
0
s
p (x) = x ! x ! 0 s I0 n](xi)Ifng ( xi )
0
s
i=0
= expf
s
X
i=0
xi log igh(x)
2 :
This is an (s + 1)-parameter exponential family with Ti(x) = xi i() = log i : The
vectors () 2 , are not conned to a proper ane subspace of Rs , so T is minimal
sucient.
(2) fT0 : : : Tsg is not A.I. since T0 + + Ts = n. Setting T0 (x) = x0 = n ; x1 ; ; xn
gives
s
X
p (x) = h(x) expfn log 0 + xi log i g
0
i=1
Redening () = (log 10 log 0s ), we now have an s-parameter representation in
which fT1 : : : Tsg is A.I., P
since the vectors (x1 xs) x 2 X , are subject only to
the constraints xi 0 and si=1 xi n.
(3) Furthermore the new parameter vectors, () = (log 01 log s0 ), 2 , are not
conned to any proper ane subspace of R s , since for any x 2 R s 9 0 : : : s such
that () = x and so () = R s . Hence T (x) = (x1 : : : xs) is minimal sucient for
P and the order of the family is s.
(4) The canonical representation of the family (2) is
s
X
(x) = expf ixi ; A( )gh(x) 2 () = f(log 1 log s ): 2 g
0
0
1
1.3. EXPONENTIAL FAMILIES.
25
We know from remark 1.3.16 before that B () = A(()) for some function A().
Although it is not necessary, we can verify this directly in this example, since from
the representation (2) we have
B () = ;n log 0
and
0 = 1 ; 1 ; ; s ) 1 = 1 + 1 + + s
0
0
0
1 ( )
= 1 + e + + es ()
) B () = n log(1 + e1 () + + es () )
) A( ) = n log(1 + e1 + + es )
A( ) is of course also determined by
eA()
=
Z
expf
s
X
1
ixi gh(x)d(x)
.
(5) The natural parameter space in this case is N = R s , since we know that N ()
and () = R s by (3) above. Clearly N contains an s-dimensional rectangle and
fT1 : : : Tsg is A.I., hence f (x) 2 Ng is of full rank.
(6) Moments of T (X ) = (X1 : : : Xs)
Theorem 1:3:14 ) E Ti = @A
8
2
Rs
@i
i
= 1 + e1 ne
+ + es
ni =0
=
1 + 01 + + 0s
= ni
2A
and Cov(Ti Tj ) = @@ @
( i j ;nei ej
i 6= j
(1+e1 ++es )2 = ;ni j
=
nei ; ne2i
(1++es ) (1++es )2 = ni (1 ; i ) i = j
(Moments exist 8 2 int(N ) = R s )
26
1. PRELIMINARIES
Theorem 1.3.23.
(Sucient condition for completeness of T ) If
(x) = exp
X
s
i=1
!
iTi (x) ; A ( ) h (x)
2 ()
is a minimal canonical representation of the exponential family P = fp : 2 g and ()
contains an open subset of R s , then T = (T1 : : : Ts )is complete for P :
Proof. Suppose E (f (T )) = 0 8 2 (): Then,
(1:3:9)
E f +(T ) = E f ;(T ) 8 2 ():
Choose 0 2 int(()) and r > 0 such that
N (0 r) := f : jj ; 0jj < rg ():
Now dene the probability measures,
R f +e0t (dt)
+(A) = RA f +e0 t (dt) = ~ T ;1 d~(x) = h(x)(dx)
RJ f ;e0t (dt)
;(A) = RA f ;e0 t (dt)
J
where we have assumed that (ft : f (t) 6= 0g) > 0, since otherwise f = 0 a:s: PT and we are
done.
Observe now thatZ
Z
t
+
(1:3:10)
e (dt) = et ;(dt) 8 2 R s with jjjj < r
since
L:S: =
=
Z
ZJ
by (1.3.9)
J
f +(t)e(0 +)t (dt)=
f ;(t)e(0 +)t (dt)=
Z
ZJ
J
f +(t)e0t (dt)
f ;(t)e0t (dt)
Now consider each side of (1.3.10) as a function of the complex argument = 0 + i,
2 R s . Then
L() = R() 8 = 0 + i with jj0jj < r, since (by Theorem 1.3.13 (i)) both sides are analytic in each component of on the set where Re(0 + ) 2 N and they are equal when is real. In particular,
L(i) =
Z
eit+(dt) = R(i) =
Z
eit;(dt)
1.4. CONVEX LOSS FUNCTION
27
for all 2 R s . Hence + and ; have the same characteristic function ) + = ; ) f + =
f ; a:s:, contradicting (f 6= 0) > 0. So f = 0 a:s: .
Example 1.3.24. X1 : : : Xn iid N ( 2 )
X
X
p (x) = p1 n expf; 21 2 x2i + 1 xi ; n2 g
( 2)
X
1() = 21 2 T1(x) = ; x2i
X
2 () = 1 T2 (x) = xi
P P
() does not contain a 2-dim rectangle in R2 . T (x) = ( x2i xi ) is not complete since
X
X
E ( x2i ; n +2 1 ( xi )2) = n(22 ) ; n +2 1 (n2 + n2 2 ) = 0
P
2 (P x )2 = 0 on N c .
but there exists no P -null set N such that x2i ; n+1
i
1.4. Convex Loss Function
Lemma 1.4.1. Let be a convex function on (;1 1) which is bounded below and sup-
pose that is not monotone. Then, takes on its minimum value c and ;1 (c)is a closed
interval and is a singleton when is strictly convex.
Proof. Since is convex and not monotone,
lim (x) = 1:
x!
1
Since is continuous, attains its minimum value c. ;1 (fcg) is closed by continuity and
interval by convexity. The interval must have zero length if is strictly convex.
Theorem 1.4.2. Let be a convex function dened on (;1 1) and X a random variable such that (a) = E ( (X ; a)) is nite for some a. If is not monotone, (a)takes on
its minimum value and ;1 (a) is a closed set and is a singleton when is strictly convex.
Proof. By the lemma, we only need to show that is convex and not monotone. Because
limt!
1 (t) = 1 and lima!
1 x ; a = 1,
lim (a) = 1
a!
so that is not monotone.
The convexity comes from
(pa + (1 ; p) b) = E (p (X ; a) + (1 ; p) (X ; b))
E (p (X ; a) + (1 ; p) (X ; b))
= p (a) + (1 ; p) (b) :
28
1. PRELIMINARIES
CHAPTER 2
Unbiasedness
2.1. UMVU estimators.
Notation. P =fP 2 g is a family of probability measures on A (distributions of X ).
T:X ! R is an A=B measurable function and T (or T (X )) is called a statistic.
g : ! R is a function on whose value at is to be estimated.
;
X
T
(X A P ) ;!
(X A P) ;!
R B PT
Definition 2.1.1. A statistic T (or T (X )) is called an unbiased estimators of g () if
E (T (X )) = g () for all 2 :
Objectives of point estimation. In order to specify what we mean by a good estimator
of g(), we need to specify what we mean when we say that T (X ) is close to g(). A fairly
general way of dening this is to specify a loss function:
L( d) = cost of concluding that g() = d when the parameter value is :
L( d) > 0 and L( g()) = 0:
Since T (X ) is a random variable, we measure the performance of T (X ) for estimating g()
in terms of its expected (or long-term average) loss
R( T ) = E L( T (X ))
known as the risk function.
Choice of a loss function will depend on the problem and the purpose of the estimation.
For many estimation problem, the conclusion is not particularly sensitive to the choice of loss
function within a reasonable range of alternatives. Because of this and especially because of
its mathematical convenience, we often choose (and will do so in this chapter) the squarederror loss function
L( d) = (g() ; d)2
with corresponding risk function
(2.1.1)
R( T ) = E (T (X ) ; g())2
29
30
2. UNBIASEDNESS
Ideally we would like to choose T to minimize (2.1.1) uniformly in . Unfortunately this
is impossible since the estimator T dened by
(2.1.2)
T (x) = g(0) 8x 2 X
(where 0 is some xed parameter value in ) has the risk function,
= 0
R( T ) = (g() ;0g( ))2 if
if 6= 0
0
An estimator which simultaneously minimized R( T ) for all 2 would necessarily have
R( T ) = 0 8 2 and this is impossible except in trivial cases.
Why consider the class of unbiased estimators? There is nothing intrinsically good
about unbiased estimators. The only criterion for goodness is that R( T ) should be small.
The hope is that by restricting attention to a class of estimators which excludes (2.1.2), we
may be able to minimize R( T ) uniformly in and that the resulting estimator will give
small values of R( T ). This programme is frequently successful if we attempt to minimize
R( T ) with T restricted to the class of unbiased estimators of g().
Definition 2.1.2. g () is U-estimable, if there exists an unbiased estimator of g ().
Example 2.1.3. X1 : : : Xn iid Bernoulli(p), p 2 (0 1). g (p) = p is U-estimable, since
E Xn = p 8p 2 (0 1), while h(p) = p1 is not U-estimable, since if
X
P
P
T (x)p xi (1 ; p)n; xi = p1 8p 2 (0 1)
limp!0 RS = 1 and limp!0 LS = T (0). So T (0) = 1, but this is not possible since then
EpT (X ) = 1 6= p1 8p 2 (0 1).
Remark 2.1.4.
Pni=1 Xi
p;1
a:s:
p and Pni=1n Xi a:s:
;!
;!
n
p;1 even though it is not unbiased.
8p 2 (0 1): Hence PnXi is a rea-
sonable estimate of
Theorem 2.1.5. If T0 is an unbiased estimator of g () then the totality of unbiased
estimators of g ()is given by
fT0 ; U : E U = 0 for all 2 g :
Proof. If T is unbiased for g (), then T = T0 ; (T0 ; T ) where E (T0 ; T ) = 0 8 2 .
Conversely if T = T0 ; U where E U = 0 8 2 , then E T = E T0 = g() 8 2 .
Remark 2.1.6. For squared error loss, L( d) = (d ; g ())2, the risk R( T ) is
R( T ) = E ((T (X ) ; g())2)
= V ar (T (X )) if T is unbiased
= V ar (T0 (X ) ; U )
= E (T0(X ) ; U )2 ] ; g()2
2.1. UMVU ESTIMATORS.
31
and hence the risk is minimized by minimizing E (T0 (X ) ; U )2 ] with respect to U , i.e. by
taking any xed unbiased estimator of g() and nding the unbiased estimator of zero which
minimizes E (T0(X ) ; U )2 ]. Then if U does not depend on we shall have found a uniformly
minimum risk estimator of g(), while if U depends on , there is no uniformly minimum risk
estimator. Note that for unbiased estimators and squared error loss, the risk is the same as
the variance of the estimator, so uniformly minimum risk unbiased is the same as uniformly
minimum variance unbiased in this case.
Example 2.1.7. P (X = ;1) = p P (X = k) = q 2 pk k = 0 1 : : : , where q = 1 ; p.
T0 (X ) = If;1g(X ) is unbiased for p, 0 < p < 1
T1 (X ) = If0g (X ) is unbiased for q2,
U is unbiased for 0
,0 =
1
X
k=;1
U (k)P (X = k) = pU (;1) +
= U (0) +
1
X
k=1
1
X
k=0
U (k)q2 pk 8p
(U (k) ; 2U (k ; 1) + U (k ; 2))pk
,U (k) = ;kU (;1) = ka for some a
(comparing coecients of pk k = 0 1 2 : : : )
So an unbiased estimator of p with minimum risk (i.e. variance) is T0(X ) ; a0X where a0 is
the value of a which minimizes
X
Ep(T0 (X ) ; aX )2 = Pp(X = k)T0(k) ; ak]2
Similarly an unbiased estimator of q2 with minimum risk (i.e. variance) is T1 (X ) ; a1X
where a1 is the value of a which minimizes
X
Ep(T1 (X ) ; aX )2 = Pp(X = k)T1(k) ; ak]2
Some straightforward calculations give
a0 = p + q2 ;Pp1 k2 pk and a1 = 0
Since a
1
is independent of p, the estimator T1 (X ) of q2 is minimum variance unbiased for all
p, i.e. UMVU. However a0 does depend on p and so the estimator T0(X ) = T0(X ) ; a0 X is
only locally minimum variance unbiased at p. (We are using estimator in a generalized
sense here since T0(X ) depends on p. We shall continue to use this terminology.) An UMVU
estimator of p does not exist in this case.
Definition 2.1.8. Let V () = inf T V ar (T ) where the inf is over all unbiased estimators
of g(). If an unbiased estimator T of g() satises
V ar (T ) = V () 8 2 it is called UMVU
1
32
If
2. UNBIASEDNESS
V ar0 T = V (0 ) for some 0 2 T is called LMVU at 0
Remark 2.1.9. Let H be the Hilbert space of functions on X which are square integrable
with respect to P (i.e. with respect to every P 2 P ), and let U be the set of all unbiased
estimators of 0. If T0 is an unbiased estimator of g() in H, then a LMVU estimator in H at
0 is T0 ; PU (T0), where PU denotes orthogonal projection on U in the inner product space
L2 (P0 ), i.e. PU (T0 ) is the unique element of U such that
T0 ; PU (T0 ) ? U (in L2(P0 )):
T0 ; PU (T0 ) is LMVU since PU (T0) = arg minU 2U E0 (T0 ; U )2 .
Notation 2.1.10. We denote the set of all estimators T with E T 2 < 1 for all 2 by and the set of all unbiased estimators of 0 in by U .
Theorem 2.1.11. An unbiased estimator T 2 of g () is UMVU i
E (TU ) = 0 for all U 2 U and for all 2 :
(i.e. Cov (T U ) = 0 since E U = 0 for all and E T = g () for all 2 .)
Proof. ()) Suppose T is UMVU. For U 2 U , let T 0 = T + U with real. Then T 0 is
unbiased and, by denition of T ,
V ar (T 0) = V ar (T ) + 2V ar (U ) + 2Cov (T U ) > V ar (T )
(TU )
therefore, 2V ar (U ) + 2Cov (T U ) > 0. Setting = ; Cov
V ar (U ) gives a contradiction to
this inequality unless Cov (T U ) = 0. Hence Cov (T U ) = 0.
(() If E (TU ) = 0 8U 2 U and 8 2 , let T 0 be any other unbiased estimator. If
V ar (T 0) = 1, then V ar (T ) < V ar (T 0), so suppose V ar (T 0) < 1.
Then T 0 = T ; U , for some U which is unbiased for 0 (by Theorem 2.1.5).
U = T ; T 0 ) E U 2 = E (T 0 ; T )2
6 2E T 0 2 + 2E T 2 < 1
)U 2U
Hence
V ar (T 0) = V ar (T ; U )
= V ar (T ) + V ar (U ) ; 2Cov (T U )
> V ar (T ) since Cov (T U ) = 0
) T is UMVU:
2.1. UMVU ESTIMATORS.
33
Unbiasedness and suciency. Suppose now that T 2 is unbiased for g() and S
is sucient for P = fP 2 g. Consider
T 0 = E (T jS ) = E (T jS ) independent of Then
(a)
E T 0 = E E (T jS ) = E (T ) = g() 8:
(b)
V ar (T ) = E (T ; E (T jS ) + E (T jS ) ; g())2
= E ((T ; E (T jS ))2) + V ar (T 0) + 2E (T ; E (T jS ))(E (T jS ) ; g())]
> V ar (T 0 ):
On the second line we used the fact that T ; E (T jS ) is orthogonal to (S ). The inequality
on the third line is strict for all , T = E (T jS ) a:s: P .
Theorem 2.1.12. If S is a complete sucient statistic for P , then every U -estimable
function g () has one and only one unbiased estimator which is a function of S .
Proof.
T unbiased ) E (T jS ) is unbiased and a function of S
T1 (S ) T2(S ) unbiased ) E (T1 (S ) ; T2 (S )) = 0 8
) T1(S ) = T2 (S ) a:s: P (completeness)
Theorem 2.1.13. (Rao-Blackwell) Suppose S is a complete sucient statistic for P .
Then
(i) If g () is U -estimable, there exists an unbiased estimator which uniformly minimizes
the risk for any loss function L ( d) which is convex in d.
(ii) The UMV U in (1) is the unique unbiased estimator which is a function of S it is
the unique unbiased estimator with minimum risk provided the risk is nite and L is
strictly convex in d.
Proof. (i) L( d) convex in d means
L( pd1 + (1 ; p)d2) 6 pL( d1) + (1 ; p)L( d2 ) 0 < p < 1:
Let T be any unbiased estimator of g() and let T 0 = E (T j S ), another unbiased
estimator of g(). Then
R( T 0) = E L( E (T j S ))]
6 E E (L( T ) j S )] by Jensen's inequality for conditional expectation,
= E L( T ) = R( T ) 8:
34
2. UNBIASEDNESS
If T2 is any other unbiased estimator then
T20 = E (T2 j S ) = T 0 a.s. P by Theorem 2.1.12.
Hence starting from any unbiased estimator and conditioning on the CSS S gives a
uniquely dened unbiased estimator which is UMVU and is the unique function of S
which is unbiased for g().
(ii) The rst statement was established at the end of the proof of (i).
If T is UMVU then so is T 0 = E (T j S ) as shown in (i) We will show that T is
necessarily the uniquely determined unbiased function of S , by showing that T is a
function of S a.s. P .
The proof is by contradiction. Suppose that "T is a function of S a.s. P " is false.
Then there exists and a set of positive P measure where
T 0 := E (T j S ) 6= T
But this implies that
R( T 0) = E (L( E (T j S )))
< E (E (L( T ) j S ))
(Jensen's inequality is strict unless E (T j S ) = T a:s: P )
= R( T )
contradicting the UMVU property of T .
Theorem 2.1.14. If P is an exponential family of full rank (i.e. f1 : : : s g and fT1 : : : Tsg
are A.I. and () contains an open subset of R s ) then the Rao-Blackwell theorem applies to
any U -estimable g () with S = T .
Proof. T is complete sucient for P .
Some obvious U -estimable g()'s are
E Ti (X ) = @A
@i =() f : () 2 int(N )g
P
where (x) = e iTi (x);A() h(x) is the canonical representation of p (x):]
Two methods for nding UMVU's
Method 1. Search for a function (T ), where T is a CSS, such that
E (T ) = g() 8 2 :
2.1. UMVU ESTIMATORS.
Example 2.1.15. X1 : : : Xn iid N ( 2 ) 2 R 2 > 0.
35
T =(X S 2) is CSS.
E2 X = X is UMVU for :
Method 2. Search for an unbiased (X ) and a CSS T . Then
S = E ((X ) j T ) is UMV U
Example 2.1.16. X1 : : : Xn iid U (0 ) > 0
g() = 2
1 (X ) = X1 is unbiased
X(n) is CSS
) S = E (X1 j X(n) ) is UMVU
To compute S we note that given X(n) = x,
X1 = x w:p: n1
X1 v U (0 x) w:p: 1 ; n1
x
1 x n+1x
) S (x) = + (1 ; ) =
n
n 2
n 2
1 n + 1 X is UMVU for ) S (X(n) ) =
2 n (n)
2
n
+
1
) n X(n) is UMVU for Remark 2.1.17.
(a) Convexity of L( ) is crucial to the Rao-Blackwell theorem.
(b) Large-sample theory tends to support the use of convex L( ).
Heuristically if X1 : : : Xn are iid, then as n ! 1 the error in estimating g() ! 0 for
any reasonable estimates (in some probabilistic sense). Thus only the behavior of L( d) for
d close to g() is relevant for large samples.
A Taylor expansion around d = g() gives
L( d) = a() + b()(d ; g()) + c()(d ; g())2 + Remainder
But
L( g()) = 0 ) a() = 0
L( d) 0 ) b() = 0
Hence locally, L( d) v c()(d ; g())2, a convex weighted squared error loss function.
36
2. UNBIASEDNESS
Example 2.1.18. Observe X1 : : : Xm , iid N ( 2 ), and Y1 : : : Yn, iid N ( 2 ), inde-
pendent of X1 : : : Xm:
(i) For the 4-parameter family P = fP2 2 g, (X Y SX2 SY2 ) is a CSS since the exponential family is of full rank. Hence X and SX2 are UMVU for and 2 respectively
and Y and SY2 are UMVU for and 2 :
(ii) For the 3-parameter family P = fP2 2 g, (X Y SS ) is a CSS, where SS :=
(m ; 1)SX2 +(n ; 1)SY2 . Hence X Y and m+SSn;2 are UMVU for and 2 respectively.
(iii) For the 3-parameter family with = 2 6= 2 (which arises when estimating a mean
from 2 sets of readings with dierent accuracies), (X Y SX2 SY2 ) is minimal sucient
but not complete, since X ; Y 6= 0 a:s: P , but E (X ; Y ) = 0 8 :
To deal with Case (iii) we shall rst show the following: If 22 = r for some xed r, i.e.
P = fPr 2 2 g
P
P P
P
then T = ( Xi + r Yj Xi2 + r Yj2) is CSS
Proof.
1
1
1
m 2 n
m+n
2
(2
) 2 (r ) 2 ( ) 2
2
2
X 2 1
X2 1
1
m
1
n
exp ; 2r 2 xi + r 2 m x ; 2r 2 ; 2 2 yi + 2 n y ; 2 2
1 X
X2 X
X 2
2
= exp ;A( ) exp ; 2r 2 ( xi + r yi ) + r 2 ( xi + r yi)
p 2 (x y) =
Since T is a CSS for P and since T1 =
in P .
P Xi+r P Yi
m+rn
is unbiased for , it is UMVU for
T1 is also unbiased for in P = fP2 2 g
2
0202
0
m02 + n02 where 02 = r:
(V is the smallest variance of all unbiased estimators of for P evaluated at 0 02 02 .)
) V (0 02 02 ) 6 V ar0 02 02 (T1 ) =
2.2. NON-PARAMETRIC FAMILIES
37
On the other hand, every T which is unbiased for in P is also unbiased in P .
Hence if T is unbiased for in P , then
PX +rPY
2
V ar0 02 02 (T ) > V ar0 02 02 ( mi + rn i ) where r = 02
0
and the inequality continues
to hold with the left-hand side replaced by V (0 02 02 ).
2
2
So V (0 02 02 ) = m020+0n02 and the LMVU estimator at (0 02 02 ) is
PX + PY
i i
2
0
2
0
2
0
2
0
m + n
:
Since this estimate depends on the ratio r = 002 , an UMVU for does not exist in P .
A natural estimate for is
P X + SX2 P Y
i SY2
i
^ =
:
2
m + SSXY2 n
(See Graybill and Deal, Biometrics, 1959, pp. 543550 for its properties.)
2
2.2. Non-parametric families
Consider X = (X1 : : : Xn) where X1 : : : Xn are iid F , where F 2 F , a family of distribution functions, and P is the corresponding product measure on (R n Bn). For example,
F0 = df's with density relative to Lebesgue measure
Z
F1 = df's with jxjF (dx) < 1
Z
F2 = df's with x2 F (dx) < 1 etc:
The estimand is g : F ! R . For example,
Z
g(F ) = xF (dx) = F
Z
g(F ) = x2 F (dx)
g(F ) = F (a)
g(F ) = F ;1(p)
Proposition 2.2.1. If F0 is dened as above, then (X(1) : : : X(n) ) is complete sucient
for F0 (i.e. for the corresponding family of probability measures P ).
38
2. UNBIASEDNESS
Proof. We know that T (X ) = (X(1) : : : X(n) ) is sucient for P . It remains to show
(by problem 1.6.32, p.72) that T is complete and sucient for a family P0 P such that each
member of P0 has positive density on R n . Choose P0 to be the set of probability measures
on Bn with densities relative to Lebesgue measure,
X
X
X
C (1 n ) expf1 xi + 2 xixj + + nx1 xn ; x2i n gg
i<j
This is an exponential
family whose natural parameter set N contains an open set (N = R n ).
P
P
So S (x) = ( xi i<j xi xj x1 xn) is complete. But S is equivalent to T (consider
the nth degree polynomial whose zeroes are x(1) x(n) ), so T is complete for F0.
Measurable functions of the order statistics. If T (x) := (x(1) : : : x(n) ) then
(X1 : : : Xn) 2 (T ) , (X1 : : : Xn) = (X : : : Xn )
for every permutation (1 : : : n) of (1 : : : n). Since T is a CSS for F0, this enables us to
1
identify UMVU estimators of estimands g for which they exist.
Example 2.2.2. g (F ) = F (a): An obvious unbiased estimator of F (a) is
n
X
T1(X ) := n1 I(;1a](Xi)
i=1
and T1 2 (T ) so T1 is UMVU for F (a).
R
Example 2.2.3. g (F ) = xdF F 2 F0 \ F2 : Let
n
X
T2(x) = n1 Xi:
i=1
Then T2 2 (T ) and, since T is also complete for F0 \ F2 is therefore UMVU for F :
2 F0 \ F4. Let
P(x ; x)2 P(x(i) ; 1 P x(i))2
2
n
T3 (x) = S (x) = n i; 1 =
n;1
T3 2 (T ) and is unbiased for F2 . Since T is complete for F0 \ F4 T3 is UMVU for F2 .
Remark 2.2.5. T complete for F does not imply generally that T is complete for F F . In fact the reverse is true. Completeness for F implies completeness for F . However
Example 2.2.4. g (F ) = F2 F
the same argument used in the proof of Proposition 2.2.1 shows that
T is complete for F0 \ F2 (used in example 2.2.3) and
T is complete for F0 \ F4 (used in example 2.2.4):
2.2. NON-PARAMETRIC FAMILIES
39
2 F0 \ F4
T4(X ) = n X(2i) ; S 2 (X ) is UMVU for g(F ):
Example 2.2.6. g (F ) = 2F F
1X
This result could also be obtained by observing that X1X2 is unbiased for 2F F 2 F0 \ F4,
therefore E (X1X2 j X(1) : : : X(n)) is UMVU. But conditioned on X(1) X(n) ,
X1X2 = X(i) X(j) w:p: n(n2; 1) for each subset fi j g of f1 : : : ng
) E (X1 X2 j X(1)
X(n)) = n(n1; 1)
X
i6=j
X(i) X(j)
X
X
= n(n1; 1) (( Xi )2 ; Xi2)
X
X
X
= n1 Xi2 ; n ;1 1 ( Xi2 ; n1 ( Xi)2 )
= T4 (X )
More generally suppose g(F ) is U-estimable in F0. Then
9 (X1 : : : Xm) such that EF (X1 : : : Xm) = g(F ) 8F 2 F0:
Suppose also that (X1 : : : Xm ) has nite second moment for F 2 F0 \Fk for some positive
integer k. We can assume is symmetric in X1 : : : Xm , since if not we can redene as
X
(X1 : : : Xm )
(X1 : : : Xm) = m1!
permutations of (1::: m)
which is unbiased and symmetric.
Now we dene the U-statistic,
X
T = ;1
(X : : : X )
n
m 16i1 <i2 <<im 6n
i1
im
This is symmetric in X1 : : : Xn and unbiased, and therefore UMVU for g(F ) F 2 F0 \Fk .
Questions
(1) Which g(F ) are U-estimable?
(2) If g is U-estimable, what is the smallest value of m for which there exists a U-statistic
for g of the form T ? This number is called the order of g.
Proposition 2.2.7. If g is of degree 1, then for any F1 F2 2 F0 , g (F1 + (1 ; )F2 ) is
linear in .
40
2. UNBIASEDNESS
Proof. If g is of degree 1, there exists (X1 ) such that
Z
(x)F (dx) = g(F )
Z
) g (F1 + (1 ; )F2 ) = (x)F1 (dx) + (1 ; )
= g(F1) + (1 ; )g(F2):
Z
(x)F2 (dx)
Generalization. If g is of degree s, then g(F1 + (1 ; )F2) is a polynomial in of
degree 6 s (since dF (x1 : : : xs) = sdF1(x1 ) dF1(xs) if F1 is replaced by F1.)
Example 2.2.8. g (F ) = F2 is of degree 2 in F0 \ F2 .
Proof. Let (X1 X2 ) = 21 (X1 ; X2 )2 , then
EF = EF X12 ; EF (X1 X2) = F2
so deg(g) 6 2. To show deg(g) =
6 1, consider
2
g(F1 + (1 ; )F2) = F
Z +(1;)F
Z
2
= x dF1(x) + (1 ; ) x2 dF2(x) ; F + (1 ; )F ]2
and this is linear in : , F = F : But this is not the case for every F1 F2 2 F0 \ F2:
1
2
1
1
2
2
Hence deg(g) = 2.
Sering (1980) is a good reference on U-staticstics.
Example 2.2.9. g (F ) = F is not U-estimable in F0 , since g (F1 + (1 ; )F2 ) is not a
polynomial.
2.3. The Information Inequality
For any estimator T 2 of g () and any function (X ) such that E j (X )j2 < 1,
we have the inequality
2
j
Cov
(T )j
(2.3.1)
:
Var (T ) Var ( (X ))
However, this will not in general provide a useful lower bound for Var T since the RHS
depends on T . It can be useful however when the RHS depends on T in a simple way, in
particular when it depends on T only through E T .
Theorem 2.3.1. Cov (T ) depends on T only through E T i
Cov (U ) = 0 for all U 2 U \ (unbiased square-integrable estimators of 0):
2.3. THE INFORMATION INEQUALITY
41
Proof. (() Suppose Cov (U ) = 0 for all U 2 U\ and that T1 T2 are two estimators
with nite variance and
E T1 = E T2 8 2 :
Then T1 ; T2 2 U , so Cov (T1 ) = Cov (T2 ).
()) If Cov (T ) depends on T only through E T and if U 2 U , then
Cov (T + U ) = Cov (T )
) Cov (U ) = 0
Hammersley-Chapman-Robbins Inequality
Suppose X has probability density p(x ) 2 , where p(x ) > 0 8x and . Then
(x ) = p(px(x +)) ; 1
satises the conditions of the Rprevious theorem, i.e. Cov (U ) = 0 8U 2 U \ , since
E (X ) = 0 and E (U) = U (x)(p(x + ) ; p(x ))d(x).
For any statistic S 2 ,
Z
Cov (S ) = E (S) = S p(x + ) ; p(x )](dx)
= E+ S ; E S
S2U
= g( + 0) ; g() if
if S is unbiased for g(:)
Hence from (2.3.1), if T 2 is unbiased for g(),
2
V ar (T ) > (g( p+(X)+;) g())2 8:
E ( p(X) ; 1) ]
Hence we obtain the
HCR bound
if T is unbiased for g().
2
(
g
(
+
)
;
g
(
))
V ar (T ) > sup
+ )
V ar ( p(pX
(X) )
42
2. UNBIASEDNESS
Letting ! 0 in the HCR bound gives
V ar T lim
!0
( g(+);g() )2
);p(X) )2
E ( 1 p(Xp+(X
)
0
2
()
= g @p
E ( @ =p)2
0 2
= @ logg p()
E ( @ p(X ))2
provided g is dierentiable and we can dierentiate under the expectation. These steps are
legitimized under the conditions of the following theorem.
Theorem 2.3.2. (Cramer-Rao Lower Bound CRLB) Suppose that p(x ) > 0 and
@ log p(x ) exists for all x and , and that for each there exists such that
@
( P P and
j ; j < ) 1 p(x
) ; 1 G (x )
j;j p(x)
(where G is independent of and E G (X )2 < 1 for all ). Then for any unbiased
estimator T of g (),
g0 ()2
Var T I ()
where
8
< I () = E @ log@p(x) 2
(denition of Fisher Information)
: (g0 ())2 = lim sup! g();;g() 2
Proof. By the HCR bound
Z
( p(x ) ; 1)2 p(x ) (dx) ( g() ; g() )2
j ; j p(x )
;
Let fng be a sequence such that n ! and
( g(n) ; g() )2 ! (g0())2:
n ; Then setting = n in the above equality and letting n ! 1, gives (by DC)
@ log p(X ))2 g0()2
V ar (T ) E ( @
V ar (T )
1
2
2.3. THE INFORMATION INEQUALITY
43
Corollary 2.3.3. If X1 : : : Xn are iid P1 (the marginal distribution of Xi ) and the
corresponding marginal density p1 (x ) satises the conditions of the Cramer-Rao Lower
Bound theorem, then for any unbiased estimator T (X1 Xn) of g (),
@ log p (X ) 2
0 ()2
g
1 1
where I1 () = E
:
Var (T ) nI1 ()
@
Proof. The sample space is X n and
n
dP (x) = Y
n
p
1 (xi ) where = :
n
d
i=1
The Fisher information for P is
!2
Z @ Yn
n
log
p
1 (xi ) p1 (x1 ) p1 (xn ) (dx)
@
Z X @i=1log p1(xi ) 2
=
p1(x1 ) p1 (xn )n(dx)
@
Z X @ log p1(xi ) 2
=
p1(x1 ) p1 (xn )n(dx) = nI ()
@
since
@ log
2
@ log
@
log
E @ p1(Xi ) @ p1(Xj ) = E @ p1 (Xi ) (by independence)
and
log p (X )) = 0:
E ( @@
1 i
(We can dierentiate under the integral sign by DC and the assumptions on p1(xi )).
The statement of the Corollary now follows if we can show that the assumptions on
p1(xi ) carry over to p(x ), i.e. that j ; j < )
j pp1 ((xx1 )) pp1 ((xxn )) ; 1 j =j ; j G~ (x )
1 1
1 n
2
where E G~ (X ) < 1.
Now
p1(xi ) 1 + j ; jG(x )
i
p1(xi )
1 + G(xi )
44
2. UNBIASEDNESS
and
ja1 an ; 1j ja1 an ; a2 anj + ja2 an ; a3 anj + by the triangle inequality,
ja1 ; 1jja2 anj + ja2 ; 1jja3 anj + + jan ; 1j:
Setting ai = pp11((xxii)) gives
Y
j pp1((xx1 ))pp1((xxn )) ; 1 j j ; jG(x1 ) (1 + G(xi ))
1
1 n
i>1
Y
+ j ; jG(x2 ) (1 + G(xi ))
i>2
+
+ j ; jG(xn )
:= j ; jG~ (x )
and E G~ (X )2 < 1 since X1 : : : Xn are independent and E G(Xi )2 < 1. (No G(Xi )
is raised to a power greater than 2.)
Corollary 2.3.4. Suppose p(x ) satises the conditions of theorem 2.3.2 and
E T (X ) = g() + b()
i.e., T (X ) has bias b() for estimating g (). Then
MSE (T ) = E (T (X ) ; g())2
= b 2 ( ) + V (T )
b2 () + Ic(())
where
g() + b() ; g() + b() 2
c() = lim sup
;
!
Proof. T is unbiased for g () + b(), so
E (T (X ) ; g() ; b())2 Ic(())
Hence MSE = E (T (X ) ; g())2 = Var T (X ) + E (T (X ) ; g())]2 b2 () + Ic(()) :
Behavior of I () under reparameterization. Suppose = h () reparametrizes fP : 2 g
to fP : 2 h ()g. Then
p (x ) = p (x h ())
2.3. THE INFORMATION INEQUALITY
45
where h denotes the inverse mapping and, by denition,
2
@ log
I () = E @ p (X )
@ log p (X )
2
d
= E
jj d h ()
@
dh (=)h2()
= I (h ())
d
= I0 () 2
:
(h ()) j=h()
(2.3.3)
An alternative expression for I (). Provided @@ log p(x ) exists for all x, and if
2
2
(2.3.4)
then
Proof.
Z @2
2 Z
@
@2 p(x ) d(x) = @2 p(x ) d(x) = 0
@2
I () = ;E @2 log p(x ) :
(x) )2
@ 2 log p(x ) = @@22 p(x ) ; ( @p@
@2
p(x )
p(x )2
h
i
and E p(X );1 @@22 p(X ) = 0 by (2.3.4).
Theorem 2.3.5. One-parameter exponential family. Suppose
p (x ) = exp (T (x) () ; B ()) h (x)
where = E T and () 2 Int (N ). Then
I () = Var1 T :
Proof.
Z
eT (x);A() h(x)(dx) = 1
and A0() = E T , A00 () = V ar T . Hence = A0():
Now (x ) = eT (x);A() h(x) (This is the canonical representation of the density of the
exponential family. See (1.3.7)). Hence @ log@(x) = T (x) ; A0 (), so
I () = E (T (X ) ; A0())2 = V ar T
46
2. UNBIASEDNESS
Since = A0() = h;1 (), and we have I () = I (())(A00 ())2. So
I () = I (())=A00 (())2
= V ar T=(V ar T )2
1
=
V ar T
Remark 2.3.6. T attains the CR lower bound in this case. A converse result also holds.
Under some regularity conditions, attainment of the CR lower bound implies that T is the
natural sucient statistic of some exponential family fP g.
Example 2.3.7.
Poisson family. Suppose that X1 : : : Xn are iid Poisson. Then
p(x ) = e;n = e
P xi
Qn1x !
i
P xi 1 1
;n+n lg n
Qn x !
1 i
Px
(x ) = e;ne=n+ n i Qn1x !
1 i
Px
T (x) = n i is UMVU for A0() = e=n = A00 () = n1 e=n = n
I () = V ar T = n
I () = V ar1 T = n
= n lg = e=n
The information on n lg increases with . The information on decreases with .
Theorem 2.3.8. Alternative version of CRLB theorem.
@p (x ) is nite
Suppose is an open interval, A = fx : p(x ) > 0g is independent
of , @
R
@p (x )(dx): Then
for all x 2 A and for all 2 , E @@ log p(x ) = 0, @@ (E T ) = T (x) @
V ar (T (X )) ( @@ E T )2
I ()
2.4. MULTIPARAMETER CASE
47
Proof. Choose (x ) = @@ log p(x ) in (2.3.1).
@
Cov T
@ log p(X )
=
T (x) @ (dx)
@ (E T )
= @
=
@
V ar @ log p(X )
@
E T @ log p(X )
Z
@p(X )
= I ():
2.4. Multiparameter Case
Theorem 2.4.1. For T (X ) 1 (X ) s (X ) functions with nite 2nd moments
under P , we have the multiparameter analogue of (2.3.1),
Var (T ) T C ;1
where T = (1 s), i = Cov (T i ) and C = Cov (i j )]sij =1
Proof. Let Y2^ denote the minimum
mean squared error linear predictor, of Y = T ; E T
3
in terms of = 4
1 ; E 1
5. Then
s ; E s
Y^ = aT where E (Y ; Y^ )(j ; E j ) = 0 j = 1 : : : s, i.e.
C a = = Cov (T )
These equations have a solution and hence
a = C ;1
where C ;1 is any generalized inverse of C = E (0). So Y^ = T C ;1 .
Also
E (Y ; Y^ )2 = EY 2 ; E Y^ 2 since Y^ ? Y ; Y^ in L2 (P )
= V ar T ; E ( T C ;1 T C ;1 )
= V ar T ; T C ;1 ) E (Y ; Y^ )2 = V ar T ; T C ;1 0
) V ar T T C ;1 Notice that the right hand side is the same for any generalized inverse C ;1 of C .
48
2. UNBIASEDNESS
Generalization of the Information Inequality
Assume that
8 is an open interval in Rs
>
< A = fx : p(x ) > 0g is independent of @p (x ) is nite 8x 2 A 8 2 i = 1 : : : s
>
@
: Ei @ log p(X ) = 0 i = 1 : : : s
@i
Definition 2.4.2.
The information matrix.
s
@
@
I () := E @ log p(X ) @ log p(X )
j
i@
ijs=1
@
= Cov
@i log p(X ) @j log p(X )
ij =1
I () is strictly positive denite if f @@ i log p(X ) i = 1 : : : sg is linearly independent a:s: P .
Theorem 2.4.3. Under the previous assumptions, if I () is strictly positive denite and
if T (X ) satises
E T (X )2 < 1 8
and
@ E T (X ) = Z T (x) @ p(x )(dx)
@j @j
then
2 @ E T (X ) 3
@ 4
5
where =
V ar T (X ) T I ();1 1
@
@s E T (X )
Proof. This is a direct application of Theorem 2.4.1 with the functions i dened by
i (x ) = @@ log p(x )
i
2.4. MULTIPARAMETER CASE
49
Reparameterization
If i = fi(1 : : : s) i = 1 : : : s
s
@ log p
@
log
p
I () = E @ (X ) @ (X )
ij =1
"X X i j
@ #s
@
log
p
@
log
p
@
m
E @ (X ) @ (X ) @n
=
@
i
m
n
j ij =1
n m
= JI ()J T
where J =
h @j is
@i ij =1.
Corollary 2.4.4. If I () is strictly positive denite, T1 : : : Tn are nite variance unbiased for g1 () : : : gn() respectively and each Ti satises the conditions of theorem 2.4.3,
then
(2.4.1)
Cov T ( @g )I ;1()( @g )T
@
@
2 @g @g 3
@
@s
4
5:
where A B means aT (A ; B )a 0 8a 2 R n and @g
:=
@
@g
@g
1
1
n
@1
1
@ns
Proof. Since aT (Cov T )a = V ar(aT T ), (2,4,1) is equivalent to
;1 ()( @g )T a 8a 2 R n :
)
I
V ar (aT T ) aT ( @g
@
@
But the above inequality follows at once by applying theorem 2.4.3 to the real-valued statistic
aT T for which
2 @ E (aT T ) 3 2 @ E (T T ) 3
@1 @1 T
5 = 4 5 a = ( @g
=4
)
a:
@
@ E (aT T )
@ E (T T )
@s @s Corollary 2.4.5. If T1
1 s, then
Ts are unbiased for
Cov (T ) I ();1 :
Proof. Apply corollary 2.4.4 with gi () = i i = 1 : : : s (where @g
@ = Iss).
Remark 2.4.6. Suppose we wish to estimate 1 . If 2 : : : s are known then the CR
lower bound is
" 2#;1
@
log
p
E @ (X )
:
1
50
2. UNBIASEDNESS
If 2 : : : s are not known, then the CR bound for estimating 1 is the (1,1) component of
I ;1(), denoted by I ;1()11 (by Corollary 2.4.5). Naturally we expect
" 2#;1
@log
I ;1()11
E @ p(X )
1
By the general formula for the inverse of a partitioned matrix,
D
;1
;
DA
A
12
;
1
A = ;A;1 A D A;1 A DA 22A;1
12 22
22 21
22 21
where D = (A11 ; A12A;221 A21 );1, (see e.g. Ltkepohl, Introduction to Multiple Time Series
Analysis) we nd that
I ;1 ()11 = a ; bT1 A;1b a1
Theorem 2.4.7. (Order-s exponential family). Suppose that
p (x ) = exp
X
s
1
!
i () Ti (x) ; B () h (x)
2
is an order-s exponential family parameterized by
= E T
and that () contains an open subset of R s . Then
I () = C ;1
where
C = Cov (T ) :
Proof. By theorem 1.3.14 of chapter 1, we know that for 2 int(N )
@ 2 A = cov(T T ) cov(T ) = @A2 := @ 2 A s
i j
@1 @j
@2
@i @j ij=1
We also know that if (x ) is the canonical form of the density
@ log (x ) @ log (x ) = (T ; @A )(T ; @A )
i
@i
@j
@i j @j
@2A
) I ( ) = cov (T ) = 2
@
2.4. MULTIPARAMETER CASE
51
2 @A=@ 3
1
4
5, and hence by the reparameterization formula
=
Moreover, = E T = @A
@
@A=@s
2
2
I () = @@A2 I () @@A2
But the left hand side is
@2A
@2
(a symmetric matrix) and hence
@2 A ;1
I () = @2
= C ;1:
Examples of Information Matrices.
Example 2.4.8. X N ( ) 2 R s xed and non-singular. Then
p(x ) = p s 1 1=2 exp ; 12 (x ; )T ;1 (x ; )
( 2) jdetj
Writing = ij ]sij ;1 = ij ]sij , we have
s
@ log p (X ) = X
ik (xk ; k )
@i
k=1
X
@ log p (X ) = s (x ; )
jm m
m
@j
m=1
@ log p
X
s X
s
@
log
p
) E
@i (X ) @j (X ) = k=1 m=1 ik E (Xk ; k )(Xm ; m)jm
=
s X
s
X
ik km mj
k=1 m=1
= ;1 ;1
= ;1 :
Example 2.4.9.
The order-two exponential family fN ( ) 2 R > 0}.
p (x ) = p1 e; (x;)
1
2 2
2
2
2
1
= p1 e; 22 x2+ 2 x; 22 ;log 2
52
2. UNBIASEDNESS
where = . By Theorem 2.4.7, if we let
= 1 = 2
2
+ 2
X
= E
2
X
then we can rewrite p(x ) as
p(x ) = p1 exp 1()x + 2 ()x2 ; B ()
2
X
X
and since E X 2 = = E T T = X 2 ,
I () = Cov(T )];1
2
where Cov(T ) =
2
22 2 24 + 42 2 (check from MGF).
We can now nd I () from the reparameterization formula,
I () = JI ()J T
2 @ @
where J = 64 ...
@s
@1
1
1
@1
@s
...
@s
@s
3
75 = 1 2 = 2 2 .
0 2
+
) I ( );1 = (J ;1 )T I ;1 ()J ;1
1 ;= 0 1=2
1 ;= 22 where J ;1 =
1
0
2
= ;=
1= 2
22 24 + 42 2
0 1=2
2 0 = 0 2=2
1=2 0 ) I ( ) = 0 2= 2
(could also be obtained directly from p )
Example 2.4.10.
Location-scale families.
Suppose that f is a probability density with respect to Lebesgue measure, that f (x) is strictly
positive for all x and that
p(x ) = 1 f ( x ; ) 2 R > 0:
2.4. MULTIPARAMETER CASE
53
Then
@ log p (x ) = ; 1 f 0( x; )
@
f ( x; )
@ log p = ; 1 ; x ; f 0( x; )
@
2 f ( x; )
Z f 0( x; ) 2 1 x ; 1
I11 ( ) = 2
f(
)dx
f ( x; ) Z f 0(x) 2
1
= 2
f (x) f (x)dx
Z x ; f 0( x; ) 2 1 x ; 1
I22 ( ) = 2
1+ f(
)dx
f ( x; ) Z f 0(x) 2
1
= 2
1+x
f (x) f (x)dx
Z f 0( x; ) 1 x ; Z x ; f 0( x; ) 2 1 x ; 1
1
I12 ( ) = 2
f
(
)
dx
+
f(
)dx
x
;
x
;
2
f( ) f( ) Z f 0(x) 2
1
=0+ 2 x
f (x) f (x)dx
Thus I is a diagonal matrix if f is symmetric about 0. Example 2.4.9 is a special case of
this.
When the CR bound in theorem 2.4.3 is not sharp, it can be improved by using higher
order derivatives of . This is the content of the following theorem.
Theorem 2.4.11. (Bhattacharya bounds) Suppose p(x ) have common support for
any . Let T be unbiased for g (). Further assume
Z
@ i p(x )(dx) = g(i)() i = 1 : : : k
T (x) @
i
and
Z @i
@i p(x )(dx) = 0 i = 1 : : : k
Then
"
# "
#
V ar (T ) g(1) () : : : g(k)() V ;1 g(1) () : : : g(k)() T
where V = Cov
1 @
p(x) @ p(x
) : : :
1 @k
p(x) @k p(x
)
T , and V is assumed to be non-singular.
54
2. UNBIASEDNESS
Proof. This follows immediately from theorem 2.4.1 by taking
@ i p(x ) i = 1 : : : k
i (x ) = p(x1 ) @
i
giving
i = Cov(T i) = E (Ti )
Z
@ i p(x )(dx) = g(i) ():
= T (x) @
i
Example 2.4.12. X1 : : : Xn
iid Poisson(), > 0, and g() = e; = P (Xi = 0).
Then I0(X ) is unbiased for g(). Here
P xi) log Y 1
xi ! > 0
P
is an exponential family of full rank so T (X ) = Xi is complete and sucient. So the
UMVU estimator of g( is E I0(X1 )jT (X )] and
E I0 (X1)jT (X ) = t] = P (X1 = 0jT (X ) = t)
\ T (X ) = t)
= P (XP1 =(T0(X
) = t)
;
;
(
n
;
1)
1))t =t!
= e e ;n ((n ;
e (n)t =t!
p(x ) = e;n+(
= (1 ; n1 )t
so (1 ; n1 )T (X ) is UMVU.
Noting that T (X ) P (n), we can write its probability generating function as EzT (X ) =
;
n
e (1;z) , and hence
E (1 ; n1 )T (X ) = e; and
E (1 ; n1 )2T (X ) = exp(;n(1 ; (1 ; 1=n)2))
= e;2+ n :
1
) V ar (1 ; )T (X ) = e;2 (e n ; 1):
n
2.4. MULTIPARAMETER CASE
55
g0 ()2
nI1 () ,
where g() = e; and
X 2 1
(2.4.2)
I1() = E ;1 + 1 = 2 V ar X = 1
e;2 < e;2 (e n ; 1) = e;2 ( + 1 2 + )
) CRB =
n
n 2 n2
(Alternatively T is the CSS for a full-rank exponential family with E (T ) = , so I () =
n
1
V ar T = = nI1 ().)
The Bhattacharyga bound with k = 2,
The CRB for this problem is
g(1)() ;e; =
;
(2)
e
Px
@p
@
1
1 (x ) = p @ = @ log p = ;n + i
Px Px
2p
2
1
@
@
1
@p
2
2 (x ) = p @2 = @2 log p + ( p @ ) = ; 2 i + ( i ; n)2
n= 0 Cov (X ) = 0 2n2=2
Hence the Bhattacharyga bound is
n= 0 ;e; "
#
;
;
V ar T ;e e
0 2n2=2
e;
2
= e;2 ( n + 2n2 ) > CRB
but less than V ar T . By taking more derivatives the bound can be make arbitrarily close
to V ar T . Extends to the multiparameter case also.
More on Fisher Information-Conditional FI. Suppose X Y v p(x y ) with respect to 1 2 on X Y , i.e.
d P (x y) = p(x y )
d(1 2) The Fisher information about in (X Y ) is
"
T @ log p(X Y ) #
@
log
p
(
X
Y
)
I (XY )() = E
@
@
"
T @ log p(X ) #
@
log
p
(
X
)
I (X ) () = E
@
@
g ()
56
2. UNBIASEDNESS
Additional information on contained in Y given X = x is dened to be
I (Y jX )( x) = E
@
@
Z @ log p(yjx ) T @ log p(yjx ) p(yjx )2(dy)
@
@
is the conditional density of Y given X = x.
=
)
where p(yjx ) = pp(xy
(x)
"
#
@ log p(Y jx ) T @ log p(Y jx )
Proposition 2.4.13.
Y
I (XY ) () = I (X ) () + E I (Y jX )( X )
Proof.
Z Z @(log p(x y ) ; log p(x )) T @(log p(x y ) ; log p(x )) p(x y )2(dy)1(dx)
E =
@
@
= I (XY ) () ; 2I (X ) () + I (X ) ()
= I (XY ) () ; I (X ) ()
since
Z Z @ log p(x ) T @ log p(x y ) p(x y )2(dy)1(dx)
@
@
X Y
Z @ log p(x ) T Z @p(x y ) =
2(dy) 1(dx)
@
@
X
Y
Z @ log p(x ) T @p(x ) =
1(dx)
@
@
X
Z @ log p(x ) T @ log p(x ) p(x )1(dx)
=
@
@
X
=I (X ) ()
X Y
Corollary 2.4.14. X1 : : : Xn are independent for any , then
I (X1 :::Xn) () =
n
X
i=1
I (Xi) ():
Corollary 2.4.15. Suppose one of n possible experiments is performed, depending on
the value of a r.v. J (= 1 2 : : : n), where the distribution of J is independent of and
2.4. MULTIPARAMETER CASE
57
J X1 : : : Xn are independent. Then
;
I (JXJ )() = EJ I XJ jJ ()
=
n
X
j =1
P (J = j )I (Xj ) ()
where I (Xj ) () is the information associated with the j th experiment.
Note: This is a property of the experiment and calculable independently of the outcome
of the experiment. If the value of J is known to be j then the information becomes the
conditional information given J = j which is I (Xj )().
58
2. UNBIASEDNESS
CHAPTER 3
Equivariance
3.1. Equivariance for Location family
In chapter 2 we introduced unbiasedness as a constraint to eliminate estimators which
may do well at a particular parameter value at the cost of poor performance elsewhere.
Within this limited class we could then sometimes determine uniformly minimum risk estimators for any convex loss function. Equivariance is a more physically motivated restriction.
We start by considering how equivariance enters as a natural constraint on statistics used to
estimate location parameters.
Suppose X = (X1 X2 Xn) has density
f (x ; ) = f (x1 ; xn ; )
where f is known, is a location parameter to be estimated and L ( d) is the loss when is
estimated as d. Suppose we have settled on T (x) as a reasonable estimator of as measured
by
R ( T ) = E (L ( T (x)))
Suppose another statistician B wants to measure the data using a dierent origin. So instead
of recording X1 X Xn, B records X10 = X1 + 273 Xn0 = Xn + 273 say (This would
be the case if we measured temperatures in C and B measured them in K.) Then
(3.1.1)
X 0 = X + a:
On the new scale the location parameter becomes
(3.1.2)
0 = + a
and the joint density of X0 = (X10 Xn0 ) is
f (x0 ; 0) = f (x01 ; x0n ; 0) :
The estimated value d on the original scale becomes
(3.1.3)
d0 = d + a
on the new one and the loss resulting from its use is L ( 0 d0). The problem of estimating
is said to be invariant under the transformations (3.1.1),(3.1.2), and (3.1.3) if
(3.1.4)
L ( + a d + a) = L ( d) :
59
60
3. EQUIVARIANCE
This condition on L is equivalent to the assumption that L has the functional form,
L ( d) = (d ; )
for some function .
Suppose we chose T (X) as a good estimator of in the original scale. Then since
estimation of 0 in terms of X10 Xn0 is exactly the same problem, we should use
T 0 (X10 Xn0 ) = T (X1 Xn) + a
as our estimator of 0 = + a. If
T (X1 + a Xn + a) = T (X1 Xn) + a
then we say that the estimator T is equivariant under the transformations
(3.1.1),(3.1.2), and (3.1.3) or location equivariant .
P
Remark 3.1.1. The mean, median, weighted average of order statistics (with wi = 1)
and the MLE of for the family f (x ; ) are all location equivariant.
Theorem 3.1.2. If X has density f (x ; 1) with respect to Lebesgue measure and T
is equivariant for with loss
L ( d) = ( ; d) :
Then the bias, risk and variance of T are independent of .
Proof. We give the proof for the bias. The other proofs are similar.
b ( ) = E
Z T (x) ; 1
= (T (x) ; ) f (x ; 1) (dx)
Z
= (T (x + 1) ; ) f (x) (dx) by shift-invariance of = E0T (X0) ; = E0 (T (X) + ) ; = E0T (X) :
Remark 3.1.3. Since the risk of an equivariant estimator is independent of , the determination of a uniformly minimum risk equivariant estimator reduces to nding the equivariant estimator with minimum (for every ) risk - such an estimator typically exists - and is
called a minimum risk equivariant (MRE) estimator. Our rst step is to nd a representation of all location equivariant estimators (Just as we found a representation of all unbiased
estimators in Chap 2)
3.1. EQUIVARIANCE FOR LOCATION FAMILY
61
Lemma 3.1.4. If T0 is any location-equivariant estimator then
(3.1.5)
T is equivariant () T (x) = T0 (x) ; U (x)
where U (x) is any function such that
(3.1.6)
U (x + a1) = U (x) for all x and a:
Proof. If T is location-equivariant, set
U (x) = T0 (x) ; T (x) :
Then
U (x + a1) = T0 (x + a1) ; T (x + a1)
= T0 (x) + a ; T (x) ; a
= U (x) :
Conversely if equation (3.1.5) and (3.1.6) hold, then
T (x + a1) = T0 (x + a1) ; U (x + a1)
= T0 (x) + a ; U (x)
= T (x) + a:
Lemma 3.1.5. U satises
if and only if
Proof.
()
U (x + a1) = U (x) for all x and a
U (x) = v (x1 ; xn xn;1 ; xn) for some function v:
U (x) = v ((x1 + a) ; (xn + a) (xn;1 + a) ; (xn + a)) :
)) Choosing a = ;xn , we have
U (x) = U (x1 ; xn xn;1 ; xn 0)
= v (x1 ; xn xn;1 ; xn )
Combining the previous lemmas gives the following theorem.
Theorem 3.1.6. If T0 is any location-equivariant estimator, then a necessary and sucient condition for T to be equivariant is that there is a function v of n ; 1 variables such
that
T (x) = T0 (x) ; v (y) for all x
where y = (x1 ; xn xn;1 ; xn ).
Example 3.1.7. If n = 1, then only equivariant estimators are X + c for c 2 R .
62
3. EQUIVARIANCE
Now we can determine the location-equivariant estimator with minimum risk.
Theorem 3.1.8. Let x have a density function f (x ; ) with respect to Lebesque measure
and let y = (y1 yn;1)T where yi = xi ; xn. Suppose that the loss function is given by
L ( d) = (d ; ) and that there exists an equivariant estimator T0 with nite risk. Assume
that for each y there exists a number v (y) = v (y) which minimizes
(3.1.7)
E0 ( (T0 (X) ; v (Y)) jY = y)
then there exists an MRE estimator T of given by
T (x) = T0 (x) ; v (y) :
Proof. If T is equivariant then
T (X) = T0 (X) ; v (Y)
for some v. So to nd the MRE, we need to nd v to minimize
R ( T ) = E ( (T ; ))
and we calculate:
R ( T ) = E ( (T0 (X) ; v (Y) ; ))
= E
Z 0 ( (T0 (X) ; v (Y))) by theorem (3.1.2)
= E0 ( (T (X) ; v (Y)) jY = y) dP0 (y)
Z
E0 ( (T (X) ; v (Y)) jY = y) dP0 (y)
= R (0 T ) :
The risk is nite since R (0 T ) R (0 T0) < 1 by assumption.
Corollary 3.1.9. If is convex and not monotone, then an MRE exists and is unique
if is strictly convex (under the conditions of theorem (3.1.8)).
Proof. Let
(c) = E ( (T0 (X) ; c) jY = y)
and apply theorem (1.4.2).
Corollary 3.1.10. The following results hold:
1. (d ; ) = (d ; )2
) v (y) = E0 (T0 (X) jY = y)
2. (d ; ) = jd ; j
) v (y) =med0 (T0 (X) jY = y)
;
Proof.
1. E0 ( (T0 (X) ; c) jY = y) = E0 (T0 (X) ; c)2 jY = y is minimized at
c = E0 (T0 (X) jY = y).
2. E0 (jT0 (X) ; cjjY = y) is minimized at c = med0 (T0 (X) jY = y).
3.1. EQUIVARIANCE FOR LOCATION FAMILY
63
Example 3.1.11. MRE's can exist also for non-convex . Suppose is given by
1
if jd ; j > c
0 otherwise,
where c is xed. Then for n = 1, using T0 (X ) = X , v will minimize
(d ; ) =
if and only if maximizes
E0 (X ; v) = P0 (jX ; vj > c)
P0 (jX ; vj c) :
If f is symmetric and unimodal then v = 0 and hence
T0 (X ) ; 0 = X is MRE.
On the other hand if f is U-shaped, say f (x) = (x2 + 1) I;LL] where c < L, then
P0 (jX ; vj c) = P0 (v ; c X v + c)
is maximized at v + c = L and v ; c = ;L. Thus there are two MREE's, X ; L + c and
X + L ; c.
Example 3.1.12. Let X1 Xn be iid N ( 2) where 2 is known. Because X is complete sucient and Y = (X1 ; Xn Xn;1 ; Xn) is ancillary, T0 (X) = X is independent
of Y. Therefore, by minimizing the expression (3.1.7), we nd that
;;
v (y) = argmin E0 X ; v :
; If is convex and even, then (v) := E0 X ; v is convex and even. Therefore, v (y) = 0
and X is MRE. (X is also MRE when is the non-convex function of Example 3.1.11.)
Theorem 3.1.13. (Least favorable property of the normal distribution) Let F be the set
of all distributions with pdf relative to Lebesgue measure and with variance 1. Let X1 Xn
be iid with pdf f (x ; ), where = EXi . If rn (F ) is the risk of the MRE estimator of with squared error loss, then rn (F ) has its maximum value over F when F is normal.
Proof. The MREE in the normal case is X with corresponding risk,
rn = E0 X 2 = n1 :
However, X is an equivariant estimator of for all F 2 F and the risk
; E X ; 2 = n1 for all F 2 F :
Therefore,
rn (F ) n1 for all F 2 F :
64
3. EQUIVARIANCE
Remark 3.1.14. From corollary (3.1.10), the MRE in general in the previous theorem is
;
X ; E0 X jY = y :
;
But for n 3, E0 X jY = y = 0 if and only if F is normal. So the MRE for n 3 is X if
and only if F is normal.
Example 3.1.15. Let X1 Xn be iid with
1 ; e;(x;)=b x F (x) =
0
x<
where b is known and 2 R. Then T0 = X(1) is equivariant, complete and sucient fot (CHECK) and T0 is independent of the ancillary statistic
Y = (X1 ; Xn Xn;1 ; Xn) = (X1 ; ; (Xn ; ) Xn;1 ; ; (Xn ; )) :
Therefore X(1) ; v is MRE if v minimizes
;
E0 X(1) ; v :
(In general we have to minimize E0 ( (T0 ; v (y)) jY = y) for each y by Theorem 3.1.8 but
here the complete suciency of T0 and the ancillarity of Y implies that v (y) is independent
of y since the distribution of T0 is the same for all y.)
We now consider some special cases:
; b=n so that MRE is X ; b .
1. If (d ; ) = (d ; )2 , then v = E0 X;(1) =
(1) n
2. If (d ; ) = jd ; j, then v = med0 X(1) = b log 2=n so that MRE is X(1) ; b logn 2 .
(Because FX(1) (x) = 1 ; e;(x;)n=b = 21 implies (x ; ) n=b = log 2.)
3. If
1 if jd ; j > c
(d ; ) = 0 if jd ; j c
;
then v is the center of the interval I of length 2c which maximizes P0 X(1) 2 I so
that v = c and the MREE is X(1) ; c.
Theorem 3.1.16. (Pitman Estimator) Under the assumptions of Theorem 3.1.8, if
L ( d) = (d ; )2 ,
R uf (x ; u x ; u) du
T = R f (x 1; u x n; u) du
1
n
is the MREE of .
Remark 3.1.17. This coincides with the Bayes estimator corresponding to an improper
at prior for the location parameter. (i.e. the conditional expectation of given X = x
under the assumed joint "density" f (x ; 1) of X and .)
3.1. EQUIVARIANCE FOR LOCATION FAMILY
65
Proof. Corollary (3.1.10) implies
T (X) = T0 (X) ; E0 (T0 jY)
where T0 is any equivariant estimator. Let T0 (X) = Xn. Because
2 Y1 3 2 1 0 ;1 3 2 X1 3
66 ... 77 = 66 ... . . . ... ... 77 66 ... 77
4 Yn;1 5 4 0 1 ;1 5 4 Xn;1 5
Xn
Xn
0 0 1
we have
fY1 Yn;1 Xn (y1 yn;1 xn) = fX1 Xn (y1 + xn yn;1 + xn xn)
and
R xf (y + x y + x x) dx
E0 (XnjY = y) = R f (y 1+ x y n;1+ x x) dx
=
=
R xf (x1 ; x + x n; 1 x ; x + x x) dx
R f (y 1; x n+ x x n;1; x n+ x x) dx
R1uf (xn ; u xn;1 ; un x ; u) du
xn ; R f (x 1; u x n;1; u x n; u) du :
1
for T n;1
n
Substituting in the expression
completes the proof.
Example 3.1.18. Suppose f (x) = I(;1=21=2) (x) and X1 Xn are iid with density
f (x ; ) = I(;1=2+1=2) (x). Then
1 if ; 1 x and x 1
(1)
(n) 2
2
f (x1 xn) = 0
otherwise
and
1 if u ; 1 x and x u + 1
(1)
(n)
2
2 :
f (x ; u1) = 0
otherwise
Therefore, since x(n) ; x(1) 1,
T (x) =
=
R x + udu
x ;
R xn + du
xn;
1 ;x + 1 2 ; ;x ; 1 2
(1)
(
n
)
2
2
2
1
2
1
( ) 2
1
(1) 2
1
( ) 2
(1)
x(1) ; x(n)
;
= 1 x(1) + x(n) :
2
66
3. EQUIVARIANCE
Remark 3.1.19. UMVU vs MRE
1. MRE estimators often exist for more than just convex loss functions
2. For convex loss functions MRE estimators generally vary (unlike UMVUE's) with the
loss function.
3. UMVUE's are frequently inadmissible (i.e. can be uniformly improved). The Pitman
estimator is admissible under mild assumptions.
4. The principal application of UMVUE's is to exponential families.
5. For location problems, UMVUE's typically do not exist.
6. An MRE estimator is not necessarily unbiased. The following lemma examines this
connection.
Lemma 3.1.20. Suppose L (d ) = (d ; )2 , and f (x ; 1) 2 R , are densities with
respect to Lebesgue measure.
1. If T is equivariant with bias b, then T ; b is equivariant, unbiased, with smaller risk
than T .
2. The unique MRE estimator is unbiased.
3. If there exists an UMVU estimator and it is equivariant, then it is MRE.
Proof.
1. It is clear that T ; b is equivariant and unbiased. For the smaller risk
part,
; R ( T ; b) = R (0 T ; b) = E0 (T ; b)2 = VarT E0 T 2 :
2. The MRE estimator is unique by corollary (3.1.9). It is unbiased by (1), otherwise its
risk could be improved by using the equivariant estimator T ; b.
3. The UMVUE is the unique MR estimator in U . If it falls in E it is the MR estimator
in U \E . But the MREE is the MR estimator in U \E since it is necessarily unbiased.
Hence they are the same.
Definition 3.1.21. An estimator T of g () is risk-unbiased if
E L ( T ) E L (0 T ) for all 6= 0:
i.e. T is closer! to g () on average than to any false value g (0 ).
Example 3.1.22. (mean unbiasedness) If L ( d) = (d ; g ())2 and T is risk-unbiased,
then
E (T ; g ())2 E (T ; g (0 ))2 for all 6= 0:
This means (assuming that E T 2 < 1 and E T 2 g ()) that E (T ; g (0 ))2 is minimized
by g (0 ) = E T .
Hence g () = E T for all i.e. T is unbiased for g in the sense dened in chapter 2.
3.2. THE GENERAL EQUIVARIANT FRAMEWORK
67
Example 3.1.23. (Median unbiasedness) If L ( d) = jd ; g ()j and T is risk-unbiased,
then
E jT ; g ()j E jT ; g (0)j for all 6= 0:
But the right hand side is minimized when g (0) = med T .
Hence
(3.1.8)
g () = med T for all :
(assuming that E jT j < 1 and there exists a med T in g () for all .) An estimator
satisfying equation (3.1.8) is said to be median-unbiased for g ().
Theorem 3.1.24. Suppose X has density f (x ; 1) with respect to Lebesgue measure.
If T is an MRE estimator with L ( d) = (d ; ) then T is risk-unbiased.
Proof. By Theorem 3.1.2,
E (T ; ) = E0 (T ) :
If 6= 0, then T ; 0 is equivariant and by denition of MRE, we have
E0 (T ) E0 (T ; 0) for all 0
) E0 (T ) E (T ; 0 ; ) for all 0
) E0 (T ) E (T ; 0) for all 0
) E (T ; ) E (T ; 0) for all 0:
3.2. The General Equivariant Framework
Notation
X : data.
: parameter space.
P : = fP : 2 g.
G : a group of measurable bijective transformations of X ! X .
Remark 3.2.1. The operation associated with the group G is function composition. i.e.
fg (x) = f g (x) = f (g (x)) :
Definition 3.2.2. We say that g leaves P invariant if for all 2 , there exists 0 2 such that
and there exists 2 such that
X P ) gX P0
X P ) gX P :
68
3. EQUIVARIANCE
If C is a class of transformations which leave P invariant then
G (C ) = g1
1g2
1 gm
1 : gi 2 C m 2 N is a group (the group generated by C ), each of whose member leaves P invariant.
If each member of a group G leaves P invariant we say that G leaves P invariant. If G
leaves P invariant and P 6= P0 for 6= 0 then there exists a unique 0 2 such that
X P ) gX P0 :
This denes a bijection g : ! , via the relation
g () = 0:
where Pg() is the distribution of g (X) under .
Definition 3.2.3. Under the preceding conditions we dene
G := fg : g 2 Gg :
It is clear that G is also a group.
Remark 3.2.4. For g 2 G ,
E f (gX ) = Eg()f (X )
since
Z
E f (gX ) = f (g (x)) P (dx)
=
Z
f (x) P g;1 (dx)
Z
= f (x) Pg() (dx)
= Eg()f (X ) :
Equivariant Estimation
Let T : X ! D be an estimator of h (). Instead of observing X , suppose we observe
X 0 = g (X )
where X 0 is a sample from Pg() . Suppose that for any g 2 G , h (g) depends on only
through h (), i.e.
(3.2.1)
h (1 ) = h (2 ) ) h (g1 ) = h (g2 ) :
Then we denote
gh () = h (g) :
It is clear that
G := g : g 2 G
3.2. THE GENERAL EQUIVARIANT FRAMEWORK
69
is a group and each g is a 1 to 1 mapping from H = h () to H.
Definition 3.2.5. Under the conditions prescribed for existence of the groups of mappings G and G , if
L (g gd) = L ( d) for all g 2 G
(3.2.2)
then we say that L is invariant under G . (g if we remove 'for all g 2 G '). If conditions
(3.2.1) and (3.2.2) hold, we say that the problem of estimating h () is invariant under
the group of transformations G (under g if we remove for all g 2 G !). An estimator
T (X) of h () is said to be equivariant under G if
gT (X) = T (gX) for all g 2 G :
(3:2:3)
If (3.2.2) and (3.2.3) hold and T (X) is a good estimator of h () based on X then T (gX)
will be a good estimator of g(h ()) based on g(X).
Example 3.2.6. (Location parameter) Let h () = and g (X) = X + a.
X ! X + a1
X + a P+a ! g = + a
h (g) = + a g(h()) = h() + a:
Then, the problem of estimating is location-invariant if
L(g gd) = L ( + a d + a) = L ( d) :
An estimator of h() is equivariant if
T (X + a1) = T (X) + a
P
n
;
1
e.g. X1, median(X), n
i=1 Xi , etc.
Theorem 3.2.7. If T is equivariant and g leaves P invariant and
then
where
Proof.
L ( d) = L (g gd)
R ( T ) = R (g () T ) for all R ( T ) := E L ( T (X)) :
E L ( T (X)) =
=
=
=
E L (g () gT (X))
E L (g () T (g (X)))
Eg() L (g () T (X))
R (g () T ) :
70
3. EQUIVARIANCE
Corollary 3.2.8. If G is transitive over (i.e. if 1 2 2 and 1 6= 2 then there
exists g 2 G such that g (1 ) = 2 ) then R ( T ) is constant for every equivariant T .
Proof. Fix 0 2 . If 6= 0 , there exists g such that g (0 ) = . Hence
R (0 T ) = R (g (0 ) T ) = R ( T ) :
P = fP : 2 g is invariant relative to a transitive group of transformations if and only if P is a group family (see TPE1 section 1.4.1). These are families
generated by subjecting a r.v. with xed distribution to a group of transformations. We can
then index P using G . Thus
P = Pg : g 2 G
Remark 3.2.9.
where = g (0 ).
Theorem 3.2.10. Suppose G is transitive and G is commutative. If T is MRE, then T
is risk unbiased.
Proof. Let T be MRE. For 6= 0 , there exists g such that g (0 ) = . Consequently,
;
E (L (0 T (X))) = E L g;1 () T (X)
= E L ( gT (X))
E L ( T ) :
(If T is equivariant then so is gT since
h gT = gh T = gT (h)
by commutativity.)
Example 3.2.11. Suppose X N ( 2 ), = ( 2 ), and we want to estimate h () =
.
1. Let G1 = fg : gx = x + c c 2 R g , then P is invariant under G1 . Tx = x + c, c 2 R ,
are the only equivariant functions since
T (x + a) = T (x) + a for all a
) T (x+aa);T (x) = 1
for all a
0
)
T (x) = 1
)
T (x) = x + c:
)
TPE stands for Theory of Point Estimation, E.L. Lehmann and George Casella, Springer Texts in
Statistics, 1998.
1
3.2. THE GENERAL EQUIVARIANT FRAMEWORK
71
(a) Suppose L (d ) = (d ; )2 =2 (squared loss function measured in units of ).
Then X is MRE under G1 , i.e.
2
E L ( X ) = E (X ;2 )
= 1 (
)
2
(
T
;
)
= min E0 2 : T equivariant with respect to G1 :
Because
;;
;
g 2 = + c 2
G 1 is not transitive and X is not risk unbiased. (For xed, choose 0 = ( 102).
Then
2
1 < E L ( X ) :)
E L (0 X ) = E (X10;2 ) = 10
Here, G is not transitive since there exists no g such that g ( 102) = g ( 2) :
(b) Suppose L0 (d ) = (d ; )2 . X is MRE and risk-unbiased since
E L0 (0 X ) = E (X ; 0)2 E (X ; )2 for all 6= 0:
G transitive is therefore not necessary in theorem (3.2.10)
2. Let G2 = fg : gx = ax + c a > 0 c 2 R g . Then, G 2 = fg : g ( 2) = (a + c a2 2)g
since gX N (a + c a22 ). Also h () = so that h (g ()) = h (a + c a2 2) =
ah () + c . Thus,
gh () = a + c and G2 = fg : g (d) = ad + cg :
P is invariant under G2 and T (x) = x is equivariant under G2 since
Tgx = T (ax + c) = ax + c = g (T (x))
and
g(Tx) = g(x) = ax + c:
X is MRE under G1 G2 and is therefore MRE under G2. There exist no G1-equivariant
estimators with smaller risk so there exist no G2-estimators with smaller risk since G2equivariance is a more severe restriction than G1-equivariance. But X is not risk
unbiased with respect to L (d ) = (d ; )2 =2 . G 2 is transitive in this case but G2
is not commutative. So, theorem (3.2.10) cannot be applied here.
72
3. EQUIVARIANCE
Suppose
3.3. Location-Scale Families
X = (X1 Xn
)t
;nf
x ; x ; n
1
where f is known. The density is with respect to Lebesgue measure and
= ( ) 2 = R R +
where is the location parameter and is the scale parameter. Our objective here is
to estimate the location parameter. Estimation of is discussed in TPE pp.167-173. We
will consider this later.
Suppose
X = Rn = R R+ D = R:
Dene gab : R n ! Rn by
gab (x) = (ax1 + b axn + b) :
Then gab : ! is dened by
gab () = (a + b a)
: R ! R is dened by
and gab
(d) = ad + b:
gab
since gh () = g = h(g) = a + b.
Lemma 3.3.1. L ( d) is invariant under G if and only if
d ; L (( ) d) = (i.e. if L is a function of error measured in terms of ).
; Proof. () If d; = L (( ) d), then
L (g gd) = L((a + b a) ad + b)
; a ; b
= ad + b a
= L ( d) :
)) We need to show if d and 0 0 d0 satisfy d; = d0;00 and if L is invariant then
L (( ) d) = L ((0 0) d0) :
3.3. LOCATION-SCALE FAMILIES
But this holds since
8 0
>
<d = 0 d 0 + ; 0 0
= >
: = 00 0 + ; 0 0
73
and L is invariant under this transformation by assumption.
P is invariant under each g 2 G (check!)
T is equivariant () gT (x) = Tg (x)
() aT (x) + b = T (ax + b) :
Since G is transitive (given any ( ) we can nd a b such that (a + b a) = (0 0) every
equivariant estimator of has constant risk by corollary (3.2.8).
Proposition 3.3.2. If T0 is an equivariant estimator of and , satises 1 6= 0 and
1 (aX + b) = a1 (X) for all a > 0 b 2 R
then T is equivariant if and only if
T (X) = T0 (X) ; W (Z) 1 (X)
for some W where
X ;X
X
1
n
n;2 ; Xn Xn;1 ; Xn
Z = (Z1 Zn;1) = X ; X X ; X jX ; X j :
n;1
n
n;1
n
n;1
n
Note 3.3.3.
1. We have assumed that Xi 6= Xj for all i 6= j . This is ok since
fX : Xi = Xj for some i 6= j g has measure 0.
2. Z is ancillary for so if T is a function of a CSS then T is independent of Z.
Proof.
1. T is equivariant (i.e. T (ax + b) = aT (x) + b) () U := T ;1T0 satises
U (ax + b) = U (x) for all a > 0 b 2 R . (check the details - simple algebra)
2. If U = T ;1T0 = W (Z) for some W then it is easy to see that U (ax + b) = U (x) and
hence by (1), T is equivariant.
3. If T is equivariant then U (ax + b) = U (x) for all a > 0 b 2 R . Setting
a = jX 1; X j b = ; jX X;n X j
n;1
n
n;1
n
) U (X) =
X ;X
X
1
n
n;1 ; Xn
U jX ; X j jX ; X j 0
n;1
Zn;1 nZ
n
= U
1
Zn;2 Zn;1 0
n;1
Zn;1
=: V (Z1 Zn;1) :
74
3. EQUIVARIANCE
G is transitive so by corollary (3.1.9), an equivariant estimator has constant risk.
Theorem 3.3.4. Suppose T0 is equivariant with nite risk and 1 is dened as in Propo-
sition 3.3.2. If
E(01) ( (T0 (X) ; W (Z) 1 (X)) jZ)
is minimized when W (Z) = W (Z) then
T (X) = T0 (X) ; W (Z) 1 (X)
is MRE.
Proof. For any equivariant T = T0 ; W (Z) 1 ,
R ((0 1) T ) = E(01)
T (X) ; 0 1
;
= E(01) E(01) ( (T0 (X) ; W (Z) 1 (X)) jZ)
E(01) ;E(01) ( (T0 (X) ; W (Z) 1 (X)) jZ)
= E(01) (T )
= R ((0 1) T ) :
Since every equivariant estimator has constant risk, T is MRE (for all ).
Example 3.3.5. Suppose X1 Xn are iid with common pdf with respect to Lebesgue
measure,
1 f x ; = 1 e; x; I
(1) (x)
and
d ; d ; 2
:
L (( ) d) = = P ;
Let T0 = X(1) 1 (X) = ni=2 X(i) ; X(1) , Zi = XXn;i;1 ;XXn n (i = 1 n ; 2), Zn;1 =
Xn;1 ;Xn
jXn;1 ;Xn j .
;X is complete sucient for
We rst show that T0, 1 , and Z are ;independent.
(1) 1
( ) and Z is already
; ancillary so that X(1) 1 is independent of Z by Basu's theorem.
Also nX(1) (n ; 1) X(2) ; X(1) X(n) ; X(n;1) are iid E (1) so that X(1) is independent
of
;
;
1 = (n ; 1) X(2) ; X(1) + + X(n) ; X(n;1) . Thus, X(1) 1, and Z are independent.
Consequently,
;
E(01) (T0 (X) ; W (Z) 1 (X))2 jZ
= E(01) T02 ; 2W (Z) E(01) T0 E(01) 1 + W (Z)2 E(01) 12
3.3. LOCATION-SCALE FAMILIES
is minimized with respect to W (Z) if W (Z) = W (Z) where
E TE W (Z) = (01)E 0 (021) 1
(01) 1
1 (n ; 1)
n
=
n ; 1 + (n ; 1)2
= n12
since 1 ; (n ; 1 1).
Hence the MRE estimator of is
n ;
X
X(i) ; X(1) :
T = T0 ; W 1 = X(1) ; n12
Note 3.3.6.
2. Because
i=2
1. T is not unbiased and the bias depends on 2 since
E() T (X) = E(01) T (X + )
= + E(01) T (X)
1 n ; 1
= + n ; n2
= + n2
R (
T )
= E(01)
=
where
we have
T ; 0 2
1 2
n2
1
+ Var(01)T Var(01) T = Var(01) X(1) + n14 Var(01) 1
= n12 + n n;4 1
3. The UMVUE of is
R ( T ) = n n+3 1 :
n ;
X
T (X) = X(1) ; n (n1; 1)
X(i) ; X(1)
2
75
76
3. EQUIVARIANCE
with corresponding
R ( T ) = E(01) T 2
= n12 + 2 n ; 1 2 + 02 = n (n1; 1)
n (n ; 1)
n
+
1
> n3 = R ( T ) :
; Estimation of r for some constant r. Assume L ( d) = dr and P is invariant
under G = fg : gx = ax + bg. Dene gab : R n ! R n by
gab (x) = ax + b = (ax1 + b axn + b) :
Then gab : ! is dened by
gab () = (a + b a)
and
h () = h ;g () = ar r = ar h ()
gab
ab
,where
h () = h (( )) = r :
Thus,
(d) = ar d:
gab
We can check invariance of loss function (i.e. L (g gd) =? L ( d)) since
ar d d L (g g d) = ar r = r = L ( d)
as required. Thus, the condition for equivariant of T is
T (g (x)) = gT (x)
or
T (ax + b) = ar T (x) :
Proposition 3.3.7. Let T0 be any positive equivariant estimator of r . Then T is equivariant if and only if
T (X) = W (Z) T0 (X)
for some W , where Z is dened on proposition (3.3.2) .
Proof. If T = W (Z) T0 (X), then
T (aX + b) = W (Z) ar T0 (X)
= ar T (X) :
3.3. LOCATION-SCALE FAMILIES
Conversely, if T is equivariant, then
77
)
U (X) := TT ((X
X)
0
satises U (aX + b) = U (X) for all a > 0, b 2 R and so by proposition (3.3.2),
U (X) = W (Z)
for some W .
Theorem 3.3.8. Let T0 be a particular equivariant estimator of r with nite risk. Suppose
E(01) ( (W (Z) T0 (X)) jZ)
is minimized when W (Z) = W (Z). Then
T (X) = W (Z) T0 (X)
is MRE.
Proof. If T is any equivariant estimator then
R ( T ) = R ((0 1) T )
= E(01) (W (Z) T0 (X))
;
= E(01) E(01) ( (W (Z) T0 (X)) jZ)
E(01) ;E(01) ( (W (Z) T0 (X)) jZ)
= E(01) (T )
= R ( T ) :
Example 3.3.9. Let X1
Xnbe iid with density 1 e;(x;)= I(1) (x) and suppose we
wish to estimate by minimizing the risk under the (invariant) squared fractional error loss
function,
d d 2
L(( ) d) = = ; 1 :
Let
T0 (X) =
n X
i=1
n ;
X
X(i) ; X(1)
X1 ; X(i) =
i=2
which is independent of Z (by the argument in example (3.3.5).)
T0 is equivariant (T (aX + b) = aT (X)) and
78
3. EQUIVARIANCE
;
E(01) E(01) ( (W (Z) T0 (X)) jZ)
; ;
= E(01) E(01) (W (Z) T0 (X) ; 1)2 jZ
= W (Z)2 E(01)T0 (X)2 ; 2W (Z) E(01) T0 (X) + 1
which is minimized when
E T (X)
W (Z) = W (Z) = (01) 0 2 = (nn;;1)1 n = n1 :
E(01) T0 (X)
Hence,
T (X) = W (Z) T0 (X)
n ;
X
1
= n
X(i) ; X(1)
i=1
is MRE.
Note 3.3.10.
E() T (X) = E(01)T (X + )
= E(01)T (X)
= n ; 1 6= :
n
CHAPTER 4
Average-Risk Optimality
4.1. Bayes Estimation
The main factor contributing to the recent explosion of interest in Bayes estimation is its
ability to handle extremely complicated practical problems. Some other factors which make
Bayes estimation attractive are as follows.
1. The mathematical structure is very nice.
2. It permits the incorporation of prior information (although there is a lot of debate
about how this should be done).
3. It provides a systematic approach to the determination of minimax estimators.
In the Bayesian framework, we consider the parameter and observation vectors to be
jointly distributed on X . We shall suppose that the parameter vector has the marginal
distribution and that the conditional distribution of the observation vector X given = is P . For any particular valueR of , we dene the risk of the estimator T 0 at as
R( T 0) := E L( T 0)j = ] = X L( T 0(x))dP (x) as before.
Definition 4.1.1. For any estimator T 0 , the integral
Z
is called the Bayes
of T 0 is thus
R (
T 0) d
() =
Z Z
X
risk of T 0 with respect
L( T 0(x))dP (x)d ():
to the prior distribution . The Bayes risk
E (R( T 0)) = E (L( T 0)):
An estimator T is a Bayes estimator with respect to the distribution if
Z
R ( T ) d () = inf
T0
Z
R ( T 0) d () :
Theorem 4.1.2. Suppose and X given = has distribution P . Then if there
exists T () which minimizes
E (L (( T (X))) jX = x) =
Z
L ( T (x)) dP (jX = x)
for each x, where P ( jX = x) is the conditional (or posterior) distribution of given X = x,
then T (X) is Bayes with respect to .
79
80
4. AVERAGE-RISK OPTIMALITY
Proof. For any estimator T 0 ,
E (L ( T 0 (X)) jX) E (L ( T (X)) jX)
and so, taking expectations of each side,
EL ( T 0 (X)) EL ( T (X)) :
Example 4.1.3.
1. Let L ( d) = (d ; g ())2 . The Bayes estimator T of g () minimizes
;
E (T (X) ; g ())2 jX = x 8 x:
Hence
T (x) = E (g () jX = x)
which is the posterior mean.
2. Let L ( d) = jd ; g ()j. The Bayes estimator T of g () minimizes
E (jT (X ; g ())jjX = x) 8 x:
Hence
T (x) = med (g () jX = x)
which is the posterior median.
3. Let L ( d) = w () (d ; g ())2 . The Bayes estimator T of g () minimizes
;
E w () (T (X) ; g ())2 jX = x 8 x:
It can therefore be obtained by solving
2
d R
dT (x) (T (x) ; g ()) w () dP (jX = x)
R 2 (T (x) ; g ()) w () dP (jX = x) = 0:
=
Hence
g () jX = x) :
T (x) = E (Ew (()
w () jX = x)
Example 4.1.4. Suppose X bin (n p) and p B (a b). Thus,
d (p) = (;;((aa) +; (bb))) pa;1 (1 ; p)b;1 dp 0 < p < 1 a b > 0:
4.1. BAYES ESTIMATION
81
The posterior distribution of p given X = x is B (a + x b + n ; x) since
f (xjp) f (p)
fP jX (pjx) = X jP f (x) P
;npx (1X ; p)n;x Kpa;1 (1 ; p)b;1
= x
fX (x)
= c (x) pa+x;1 (1 ; p)b+n;x;1
which is a beta density. (Without doing any calculations it is clear that c (x) must be
; (a + b + n) = (; (a + x) ; (b + n ; x)).)
Let L (p d) = (d ; p)2. Then the Bayes estimator of p is
Z1
T (x) = pfP jX (pjx) dp = a +a +b +x n
0
a + n x
:
= a +a +b +b n a +
b a+b+n
n
| {z }
prior mean
|{z}
the usual etimator
Thus T (X ) ; X=n ! 0 a.s. as n ! 1 with a and b xed and as a + b ! 0 with n xed.
Clearly X=n is not a Bayes estimator for any beta prior (i.e. for any a > 0 and b > 0).
However if is concentrated on the two-point set f0 1g then X=n is Bayes as the following
argument shows. If P (P = 1) = 1 ; and P (P = 0) = , then X is either 0 or n with
probability 1 and
(
=1)
1;
P (P = 1jX = n) = P (PX(=XnP
=n) = 1; = 1 :
P (P = 0jX = n)
=0
Hence the Bayes estimator satises T (n) = 1 and a similar argument shows that T (0) =
0. Hence T (x) = x=n x = 0 n. Notice that this two-point distribution is the limit in
distribution of Beta(a b) as a + b ! 0 with a=(a + b) = .
Theorem 4.1.5. Let L ( d) = (d ; g ())2 . Then no unbiased estimator T of g () can
be Bayes unless
E (T (X) ; g ())2 = 0
i.e. unless T (X ) = g () with probability 1 and the Bayes risk of T is zero.
Proof. If T is Bayes with respect to some and unbiased for g () then
E (T (X) j) = g () and
E (g () jX = x) = T (x)
82
4. AVERAGE-RISK OPTIMALITY
Hence
and so
E (T (X) g ()) =
=
=
=
E (E (T (X) g ()) jX)
ET (X)2
E (E (T (X) g ()) j)
Eg ()2
E (T (X) ; g ())2 = ET (X)2 ; 2E (T (X) g ()) + E (g ())2
= 0:
Xn be iid N ( 2) with 2 known and suppose that N ( 2) with 2 known. Then the joint density of and X is
Pn
fX ( x) = ; p1 n e; (xi;) p1 e; (;)
Example 4.1.6. Given = , let X1
2
1
2 2
1
2
2
1
2 2
2
and the posterior density of is
fjX (jx) = ffX ((x)x)
X P x 2 n
2
= c (x) exp ; 2 + 2 i ; 2 + 2
2
2 nx + 2 2
n
1
2 + 2
2 2 :
= N n 2 + 2
For squared error loss, the Bayes estimator of is
;2
;2 = T (X)
E (jX) = n;n
X
+
2 + ;2
n;2 + ;2
and
;
Var (jX) = n;2 1+ ;2 = E ( ; T (X))2 jX
r = ER ( T) = E ( ; T (X))2 = n;2 1+ ;2 :
For large n, the Bayes estimator is close to X in the sense that T (X) ; X ! 0 a.s. as
n ! 1 with and xed. Also T (X) ; X ! 0 as ! 1 with n and xed. However, X is
not Bayes since the prior probability distribution N ( 2 ) does not converge to a probability
measure as ! 1.
4.1. BAYES ESTIMATION
83
However one can formally obtain X as a Bayesian estimator with respect to the improper
prior distribution, (d) = d. Suppose
Pn
p (xj) = ; p1 n e; (xi;)
1
2 2
1
2
2
with 2 known. Setting (d) = d we nd that the joint density of (X ) with respect to
Lebesgue measure is
p (x ) = p (xj) :
The posterior distribution of is therefore
n !!
X
1
p (jx) = k (x) exp ; 22 n2 ; 2 xi
i=1
n
= k (x) exp ; 22 ( ; x)2 :
Hence,
2 p (jx) = N x n
and so X is the generalized Bayes estimator of with respect to L ( d) = (d ; )2 and
the improper Lebesgue prior (d) = d for . The improper prior densities I(;1<1)
and I01) are frequently used to account for total ignorance of parameters with values in R
and R + respectively.
Conjugate Priors
If there exists a parametric family of prior distributions such that the posterior distribution
also belongs to the same parametric family family, then the family is called conjugate.
Example 4.1.7. Conditional on 2 = 2 , let X1 Xn be iid N (0 2 ) and dene
! = (22 );1.
Pn
fXj (xj ) = c r e; 1 x2i where r = n2 :
; and let the prior distribution for ! be ; g 1 with density
g
( ) = ; (g) g;1e; 0:
We note that
; E (!) = g
E !2 = g (g+2 1)
2
; ; E !;1 = g ; 1 E !;2 = (g ; 1)(g ; 2) :
84
4. AVERAGE-RISK OPTIMALITY
Then, the posterior becomes
P
fjX ( jx) = c (x) r+g;1e;( x2i +)
= density at of ; r + g P x12 + i
so that the gamma distribution family is a conjugate family for the normal distribution.
If the loss is squared error then the Bayes estimator of 2 = (2 );1 is
P x2
Z 1
+
i
f
jX ( jx) d =
2
2 (r +Pg ; 1)
2
= n++2g ;xi2 :
As ! 0 with g = 1, the prior density/ converges pointwise to the improper prior density
I01) and the Bayes estimator T (X) satises
T (X) ; ni=1 Xi2=n ! 0 a:s::
4.2. Minimax Estimation
Definition 4.2.1. A statistic T is said to be minimax if T satises
min
sup R ( T 0) = sup R ( T ) :
T0
2
We have seen that many estimation problems allow the determination of UMVU, MRE
or Bayes estimators. Minimax estimators however are usually much harder to nd.
A minimax estimator minimizes the maximum risk. i.e., a minimax estimator minimizes
the risk in the worst case. This suggests a possible connection with Bayes estimation under
the worst possible prior.
Given on , let r denote the Bayes risk of the Bayes estimator T, i.e.
r :=
Z
R ( T) d () :
Definition 4.2.2. A prior distribution
is least favorable if
r r0 for all other priors 0:
Theorem 4.2.3. Suppose a prior
satises
r = sup R ( T) :
Then
1. T is minimax.
2. If T is the unique Bayes estimator under , then it is the unique minimax estimator.
3. is least favorable.
4.2. MINIMAX ESTIMATION
Proof.
85
1. For any estimator T ,
sup R ( T ) 2
Z
Z
R ( T ) d ()
R ( T) d () (since T is Bayes for :)
= sup R ( T ) by hypothesis.
2
2. If T is the unique Bayes solution then the second inequality in (1) becomes strict.
(i.e., !>.)
3. If 0 is another prior distribution then
r0 =
Z
R ( T0 ) 0 (d)
Z
R ( T) d 0 () since T0 is Bayes for
sup R ( T )
0
= r by hypothesis.
Corollary 4.2.4. If T has constant risk then it is minimax.
Proof. If R ( T ) is independent of then
Z
R ( T) d () = sup R ( T) :
Example 4.2.5. Let X bin (n p), L ( d) = (d ; )2 , and = p. We shall show below
that X=n is not minimax. Let us use Corollary 4.2.4 to derive a minimax estimator. Suppose
P B (a b). Then we have from example (4.1.4) that
fP jX (pjx) = c (x) pa+x;1 (1 ; p)b+n;x;1 and
T = a +a +b +x n :
Thus we obtain
1
R (p T) =
Ep (a + X ; p (a + b + n))2
2
(a + b + n)
2
npq
+
(
qa
;
pb
)
=
:
(a + b + n)2
86
4. AVERAGE-RISK OPTIMALITY
Now, choose a and b to make the risk independent of p. Since the coecient of p2 is
;n + (a + b)2 and the coecient of p is n ; 2a (a + b), setting these equal to zero gives
pn
a=b= 2 :
Hence
p
x + 2n
T0 = n + pn
is Bayes with constant risk and is therefore minimax.
p p Since T0 is the unique Bayes solution with respect to 0 B 2n 2n , it follows from
theorem 4.2.3 part (2) that T0 is the unique minimax estimator of p with respect to squared
error loss and the risk is
2
r0 = E
(
T
0 ; p)
Z
= R (p T0 ) 0 (dp)
Z n
1
p
(dp)
=
(n + n) 2 4 0
1
=
p = R (p T0 )
4 (1 + n)2
The risk of the usual estimator with squared error loss, T (X ) = X=n, is
X 2 p (1 ; p)
R (p T ) = Ep n ; p = n :
Thus
R (p T0 ) < R (p T ) for 21 ; cn < p < 21 + cn
1 R (p T0 ) > R (p T ) for p ; 2 > cn:
cn ! 0 as n ! 1 but cn is larger for small n. For n = 1,
(1
T0 (x) = 43 if x = 0
4 if x = 1:
Remark 4.2.6. Squared error loss might not be best for estimating p. The penalty
should perhaps be larger at the endpoints p = 0 and p = 1. If we use
2
2
(
d
;
p
)
(
d
;
p
)
L (p d) = pq = p (1 ; p)
then X=n has constant risk and is Bayes with respect to U (0 1) prior (check!).
4.2. MINIMAX ESTIMATION
87
Definition 4.2.7. Let n be a sequence of priors, and suppose
rn =
Z
where Tn is Bayes with respect to
favorable if
R ( Tn) d n () ! r as n ! 1
n
for each n: We say that the sequence f ng is least
r r for all
i.e. if the limit of the minimum Bayes risk is at least as bad as the minimum Bayes risk for
any prior. (Compare Denition 4.2.2.)
Theorem 4.2.8. If there exists an estimator T and a sequence of prior distributions
f ng such that
sup R ( T ) = nlim
r
!1 n
then
1. T is minimax (but not necessarily unique.)
2. f ng is least favorable.
Proof.
1. If T 0 is any estimator, then
sup R (
and hence
Z
R ( T 0) d
rn 8n
T 0)
n
sup R ( T 0) lim rn = sup R ( T ) :
2. If is any prior, then
r =
Z
R ( T) d ()
Z
R ( T ) d ()
sup R ( T )
= nlim
r :
!1 n
Example 4.2.9. Suppose that X1
Xn iid N ( 2 ).
88
4. AVERAGE-RISK OPTIMALITY
1. 2 known:
N ( 2 ) = . In example (4.1.6), we saw that the Bayes estimator for squared
error loss is
;2
;2 E (jX) = n;n
X
+
2 + ;2
n;2 + ;2
with
r = E (T ; )2
;
= E E ( ; E (jX))2 jX
= E Var
(jX1 ) = E n;2 + ;2 :
As ! 1,
2
; r ! n = R X :
Hence, X is minimax (for squared error loss) by theorem (4.2.8).
2. 2 unknown:
Since sup2 R (( 2) T ) = 1 for T (X) = X and X is minimax for each xed , we
restrict 2 to satisfy 2 m < 1. If T is minimax on 2 R , 2 m then
;; 2 T sup R ;; 2 X = sup R ;; 2 X :
sup
R
2
2
2
=m
m
=m
But X is minimax on 2 = m, hence this is an equality. Hence X is minimax on
2 R 2 m. Although the restriction 2 m was necessary to make the minimax
problem meaningful, the minimax estimator X does not depend on m.
4.3. Minimaxity and Admissibility in Exponential families
Given two estimators T T 0, such that
R ( T ) R ( T 0) for all then T is preferable to T 0 on the basis of risk.
Definition 4.3.1. T 0 is inadmissible (with respect to the loss function L) if there exists
T such that
R ( T ) R ( T 0) for all with strict inequality for some . (In this case, we also say that T dominates T 0.)
An estimator is admissible if it is not inadmissible.
In general, it is dicult to determine whether or not an estimator is admissible.
4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES
89
Theorem 4.3.2. If T is a uniqueR Bayes estimator with
R respect to some , then T is admissible. (Uniqueness means that if R ( T 0) d () = R ( T ) d () then P (T 0 6= T ) =
0 for all and consequently R ( T ) = E L ( T ) = E L ( T 0) = R ( T 0) for all .)
Proof. If T is not admissible then there exists T 0 such that
R ( T ) R ( T 0) for all and
R ( T ) > R ( T 0) for some :
R
R
Now R ( T ) d R ( T 0) d implies that T 0 is Bayes with respect to . Hence by
uniqueness R ( T 0) = R ( T ) for all and we obtain the contradiction.
Example 4.3.3. Suppose that X1 Xn iid N ( 2 ) with 2 unknown. Let N ( 2) and L ( d)=(d ; )2 . Then
2
2 ()
T = 2 n
X
+
+ n 2
2 + n 2
is the unique Bayes estimator of and is therefore admissible.
This example shows that aX + b is admissible for all a 2 (0 1) and b 2 R since any a 2 (0 1)
and b 2 R can be obtained in (*) by suitable choice of and 2 and hence aX + b is unique
Bayes for some .
Theorem 4.3.4. If X N ( 2 ), 2 is known and L ( d) = (d ; )2 then aX + b is
inadmissible for if
1. a > 1,
2. a < 0, or
3. a = 1 and b 6= 0.
Proof. We rst calculate
R ( aX + b) = E (aX + b ; )2
= E (a (X ; ) + (a ; 1) + b)2
= a2 2 + ((a ; 1) + b)2 :
1. If a > 1,
R ( aX + b) > R ( X ) = 2 for all :
2. If a < 0, then (a ; 1)2 > 1 and
R ( aX + b) ((a ; 1) + b)2
2
b
2
= (a ; 1) + a ; 1
b
> R 0X ; a;1 :
90
4. AVERAGE-RISK OPTIMALITY
3. If a = 1 and b 6= 0, then
R ( X + b) = 2 + b2 > 2 = R ( X ) :
Corollary 4.3.5. Suppose that X1
Xn N ( 2) with 2 known. Then aX + b
is inadmissible if
1. a > 1,
2. a < 0, or
3. a = 1 and b 6= 0,
and admissible if 0 a < 1.
Proof. It remains only to establish admissibility when a = 0. This is trivial since for
the estimator T (X ) = b corresponding to a = 0, the risk at = b is R (b b) = 0 and every
other estimator has positive risk at b since if P (T 0(X ) 6= b) > 0 for some then T 0;1(fbgc)
has positive Lebesque measure and this implies that Pb(T 0(X ) 6= b) > 0 and hence that
Eb (T 0 ; b)2 > 0.
Proposition 4.3.6. Suppose that X1 Xn iid N ( 2 ) with 2 known. Then, X
is admissible.
Proof. We give two proofs.
1. (Limiting Bayes method) Assume without loss of generality that 2 = 1. If X is
inadmissible then there exists T such that R ( T ) n1 for all and R (0 T ) < n1
for some 0 .
(T ; )2 =
Z
Y
(T (x) ; )2 e
n
; 12 (xi ;)2
p
dx
2
i=1
is continuous in a neighborhood of 0 . Hence there exist a and b such that a < 0 < b
and c > 0 such that
R ( T ) < n1 ; c for all 2 (a b) :
Suppose that N (0 2) =: . Then we shall obtain a contradiction by showing
that
Z
2
1
R ( T ) e; 2
2 d
r := ER ( T ) = p
2
is smaller than the minimum Bayes risk for , i.e.
2
r = ER ( T) = n;2 1+ ;2 = n 2 + 1 :
(As in example (4.1.6) ,
2
;
E ( ; T (X))2 jX = n 2 + 1 = Var (jX)
R (
T ) = E
4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES
so that
Now
; ;
91
r = E ( ; T (X))2 = E E ( ; T (X))2 jX = n 2 + 1 :)
1
n
1
n
; r =
; r
p1
2
R ; 1 ; R ( T ) e ;
n
1
n
2
2 2
; n +1
2
2
d
2
2 + 1) Z b 1
2
n
(
n
p 2
;
R
( T ) e; 2
2 d
a n
Z
b 2
2
> c n (n + 1) e; 2
2 d
a
where c > 0 is independent of . Since the integral in the last line converges to b ; a
as ! 1 by DCT, the ratio goes to innity as ! 1. Thus, for all suciently large
0 ,
r(
0 ) < r(0 )
which contradicts the fact that r(0) is minimum Bayes risk.
2. (Via the information inequality) If T is any estimator of with nite second moment
under each P , then E T = b () + and
R ( T ) = Var T + b2 ()
0 ())2
(1
+
b
nI () + b2 ()
0 ())2
(1
+
b
2 ( )
+
b
=
n
0
(b () exists by theorem (1.3.13) and we assume without loss of generality = 1.)
Hence if T is risk-preferable to X ,
R ( T ) n1 for all ()
i.e.
2
0
b2 () + (1 + bn ()) n1 for all ()
and so
jb ()j p1n for all ( )
92
4. AVERAGE-RISK OPTIMALITY
and
(1 + b0 ())2 1
) ;2 b0 () 0
) b is non-increasing:
Now, we claim b0 (k ) ! 0 for some sequence k ! 1. If this is not the case, then
lim!1b0 () < 0 and there exists 0 and > 0 such that b0 () < ;" for all > 0 .
Then,
Z
b0 (y) dy + b (0 )
0
( ; 0 ) (;") + b (yn) ! ;1 as ! 1
contradicting
(***) and thus proving the claim. Similarly, there exists j ! ;1 such
;
that b0 j ! 0.
; Now, (**) implies that b j ! 0 and b (k ) ! 0. But since b is non-increasing, this
implies that b () = 0 for all and hence that b0 () = 0 for all . Hence, by the
information inequality, R ( T ) n1 , and so by (*),
; R ( T ) = n1 = R X
so that X is admissible.
b () =
The above argument also shows that X is minimax since there is no estimator whose maximum risk is less than 1=n. In fact, X is the unique minimax estimator by the following
theorem.
Proposition 4.3.7. Suppose that T has constant risk and is admissible. Then T is
minimax. If in addition L( ) is strictly convex, then T is the unique minimax estimator.
Proof. By the admissibility of T , if there is another estimator T 0 with sup R( T 0 ) R( T ) then R( T 0) = R( T ) for all , since the risk of T 0 can't go strictly below that of
T for any . This proves that T is minimax. If the loss function is strictly convex and T 0 is
a minimax estimator such that P (T 0 6= T ) > 0, then if T = 21 (T + T 0),
R ( T ) < 21 (R ( T ) + R ( T 0)) = R ( T )
which contradicts the admissibility of T .
4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES
93
Exponential Families (with s = 1). Suppose that the probability density of X with
respect to the -nite measure is
where
Z
eT (x);'()h (x)
eT (x)h (x) d (x) = e'()
Then
E T (X) = '0 () = g () :
Suppose the natural parameter space is an interval with end-points L U , ;1 L U 1 and
L ( d) = (d ; g ())2 :
The same argument used in the proof of theorem (4.3.4) shows that aT + b is inadmissible
for (1) a < 0, (2) a > 1, or (3) a = 1 and b 6= 0.
If a = 0, then aT + b is admissible since g^ () = b is the only estimator with zero risk at
b. To deal with the remaining cases consider
1 T + r
1+
1 + 0 < 1 r 2 R (i.e. 0 < a 1):
Theorem 4.3.8. The estimator
1 T + r 0 < 1 r 2 R
1+
1+
is admissible for g () = 0 () = E T if for some (and hence for all) 0 2 (L U )
Z
and
Proof. Recall that
0
L
Z U
0
e;r+()d = 1
e;r+()d = 1:
'0 () = E T
'00 () = Var (T )
@ log p (x ) 2
I () = E
@
= E (T ; '0 ())2 = '00 () :
94
4. AVERAGE-RISK OPTIMALITY
Suppose there exists (X ) such that
E
We have that
( (X ) ; '0 ())2 E
T + r
1+
2
; '0 ()
for all :
()
E ( (X ) ; '0 ())2 = Var (X ) + b ()
; d (b () + '00 ())2
+ b2 ()
d I ()
2
0
= b2 () + (b ()I+(I) ()) :
(b0 () exists since E j (X )j < 1.) So by (*),
I () + 2 (r ; '0 ())2 b2 () + (b0 () + I ())2 ():
I ()
(1 + )2
(1 + )2
Letting
T
r
0
h () = b () ; 1 + (r ; ' ()) = b () ; bias 1 + + 1 + h0 () = b0 () + 1 + '00 () :
(**) is exactly equivalent to
;h0 () + 1 '00 ()2 00
()
1+
0 h2 () + 2h ()
(r ; '0 ()) +
;
( )
2
00
1+
' ()
(1
+
)
0 2!
0 ()
2
h
= h2 () ; 2h ()
('0 () ; r) +
+ h 00() :
1+
1+
' ()
Letting k () = h () er;'(), (***) becomes
k2 () er;'() + 1 +2 k0 () 0 ( )
) k0 () 0 for all :
Hence, k () is decreasing. To prove
k () 0 for all suppose k (0 ) < 0:Then k () < 0 for all > 0 . From (****), we also have
d 1 = ; k0 () 1 + er;'() for all > ( ):
0
d k ()
2
k ()2
4.3. MINIMAXITY AND ADMISSIBILITY IN EXPONENTIAL FAMILIES
95
Integrating both sides of (*****) from 0 to 1 > 0 ,
Z
1 ; 1 1 + 1 er;'()d:
k (1 ) k (0 )
2 0
As 1 ! U , this integral converges to 1by assumption. But the left hand side is less than
; k(10 ) so that we obtain the contradiction. Hence
k () 0 for all :
Similarly,
Z
0
Hence,
L
er;'() d = 1 ) k () 0 for all :
k () = 0 ) h () = 0 for all ) h0 () = 0 for all :
Thus, h0 () = 0 and h () = 0 implies equality in (***), equality in (**), and nally equality
in (*). (RS of (*)=LS of (**) LS of (*) RS of (**)). Since T1++r has the same risk as ,
T +r is admissible.
1+
Note 4.3.9. The case = 0 is of particular interest, i.e. T is admissible for
E T provided L = ;1 and U = 1.
Example 4.3.10. Suppose X b (n p) and := log 1;p p , ;1 < < 1. Then
n
n
n
;
x
x
p (x ) = x p (1 ; p) = x ex;n log(1;e )
and T (X ) = X is admissible for
0 () = 1 ne
+ e = np
since L = ;1 and U = 1.
Example 4.3.11. Suppose X1 Xn iid N ( 2 ) with 2 known.
P x2 P x n2 1
p (x ) = ; p n exp ; 22i exp 2 i ; 22
P X2
T (X) = 2 i
0 () = n
2 :
T (X) is admissible for
and U = 1.
n
2
R
since e
P xi
2
h (x) (dx) < 1 for ; 1 < < 1 i.e. L = ;1
96
4. AVERAGE-RISK OPTIMALITY
4.4. Shrinkage Estimators
James-Stein Estimator. Let Xi N (i 1) i = 1 s be independent random vari-
ables and let
L ( d) = kd ; k2 =
X
(di ; i)2 :
The usual estimator
T (X) = (X1 Xs)
will be shown to be inadmissible if s > 2.
Let
where S 2 =
Ps X 2. Here
i=1 i
ic = 1 ; c s S;2 2 Xi i = 1 s
0 1c 1
c := @ ... A :
sc
Theorem 4.4.1.
R ( c) = s ; (s ;
2)2 E
2c ; c2 S2
:
To prove the theorem, we need the following two lemmas.
Lemma 4.4.2. If X N (0 1) and g : R
then
! R is absolutely continuous with derivative g0
E jg0 (X )j < 1 ) Eg0 (X ) = E (Xg (X )) :
Proof. Let (x) = (2 ); 2 exp (;x2 =2). Then
1
x (x) = ;0 (x)
():
4.4. SHRINKAGE ESTIMATORS
Then
Eg0 (X ) =
=
=
=
=
Z
97
g0 (x) (x) dx
Z1 Z0
Z0 1
+
Z 1 Z z
0
Z1
0
0
Z 1
;1
g0 (x)
0
g0 (x) (x) dx
x
z (z) dz dx ;
g0 (x) dx
z (z) dz ;
(g (z) ; g (0)) z (z) dz ;
= E (Xg (X )) :
Z0
g0 (x)
Z 0 Z 0
;1
Z0
;1
;1
z
Z x
;1
z (z) dz dx by (*)
g0 (x) dx
z (z) dz by Fubini
(g (0) ; g (z)) z (z) dz
Xs) where X1 Xs are independent and
Xi N (i vi) :
Suppose f : R s ! R is absolutely continuous in xi for almost all (x1 xi;1 xi+1 xs ).
Lemma 4.4.3. Suppose X = (X1
Then if
@
E @x f (X) < 1
i
vi E @x@ f (X) = E (Xi ; i) f (X) :
i
Proof. Let
i ; i
Z = Xp
v :
i
then from lemma (4.4.2),
; ;
E @z@ f x1 xi;1 i + pvi xi+1 xs j =Z
; ;
= E Zf x1 xi;1 i + pvi Z xi+1 xs :
Thus,
pvi E @f (X) jX1 Xi;1 Xi+1 Xs
@xi
= E Xpi ;vii f (X) jX1 Xi;1 Xi+1 Xs :
Taking expectation of each side, we obtain the desired result.
Now, we are ready to prove the theorem.
98
4. AVERAGE-RISK OPTIMALITY
Proof. (of theorem) Let
0 f1 1
fi (X) = c (s ;2 2) Xi and f = @ ... A :
S
Then
c = X ; f
and
R ( c) = E
X
(Xi ; fi ; i )2
X
X
2
X 2
(Xi ; i ) ; 2 (Xi ; i) fi + fi
s
X
X Xi2 2
@
= s ; 2 E @x fi (X) + E S 4 c (s ; 2)2 :
i
1
= E
Since
fs
@ f (X) = c (s ; 2) S 2 ; Xi 2Xi = c (s ; 2) 1 ; 2Xi2
@xi i
S4
S2 S4
R ( c) = s ; 2c (s ; 2) E
= s ; (s ;
2)2 E
s ; 2
S2
2c ; c2 S2
+ c2 (s ; 2)2 E S12
:
Corollary 4.4.4. For 0 < c < 2 and s > 2,
R ( c) < s for all and 1 dominates all the other c 's.
Proof. 2c ; c2 > 0 for all c 2 (0 2) and has a maximum value at c = 1.
;
Remark 4.4.5. The James-Stein estimator 1 ; sS;22
The risk of X is
X
X has risk R ( ) = s;(s ; 2)2 E ; 1 .
1
R ( X) = E (Xi ; i )2 = s:
Therefore, X is not admissible for . In fact, the James-Stein estimator is not admissible
either. A strictly risk-preferable estimator can be arrived at as follows.
S2
4.4. SHRINKAGE ESTIMATORS
99
Empirical Bayes Interpretation of the James-Stein Estimator. Suppose 1 s
are iid N (0 2 ). If 2 were known, the Bayes estimator with respect to squared error loss
would be
1
X
i
^i = 1 + ;2 = 1 ; 1 + 2 Xi i = 1 s:
However, if 2 is unknown it must be replaced by some estimate. Write
Xi = i + Zi fZig iid N (0 1) with fZig independent of fI g
(so that Xiji = i N (i 1)). Then,
;
fXig iid N 0 2 + 1 :
P
Since S 2 = si=1 Xi2 is complete and sucient for 2 , sS;22 is UMVUE for 1+1 2 . So a natural
estimator of i is
s ; 2
1 = 1 ; S 2 X:
Moreover since
1 <1
1 + 2
; is min sS;22 1 which suggests using
s;2 s ; 2 1 = 1 ; min S 2 1 X = max 1 ; S 2 1 X:
1 is strictly risk-preferable to 1 (so 1 is inadmissible). But 1 is also inadmissible. It was
a dicult problem to nd an estimator which is strictly risk-preferable to 1 , in fact it took
twenty years. It is now known that there are many admissible minimax estimators. (See
TPE, p. 357.)
a better estimate of
1
1+ 2
100
4. AVERAGE-RISK OPTIMALITY
CHAPTER 5
Large Sample Theory
5.1. Convergence in Probability and Order in Probability
Definition 5.1.1. A sequence of random variables Xn is said to converge to 0 in prob-
ability if for any " > 0,
in which case we write
or equivalently
P (jXnj > ") ! 0 as n ! 1
p
Xn ;!
0
Xn = op (1) :
Definition 5.1.2. fXn g is bounded in probability (or tight ) if for any " > 0, there exists
M (") 2 R such that
in which case we write
P (jXnj > M ) < " for all n
Xn = Op (1) :
Definition 5.1.3.
p
p
Xn ;!
X () Xn ; X ;!
0 () Xn ; X = op (1)
Xn = op (an) () Xa n = op (1)
n
X
X = Op (an) () a n = Op (1) :
n
Proposition 5.1.4. Let fXn g and fYng be sequences of r.v's and suppose an > 0 and
bn > 0. Then the following results hold:
1. If Xn = op (an ) and Yn = op (bn ) then
XnYn = op (anbn )
Xn + Yn = op (max (an bn))
jXnjr = op (arn) r > 0:
101
102
5. LARGE SAMPLE THEORY
2. If Xn = op (an ) and Yn = Op (bn ) then
XnYn = op (anbn) :
3. If Xn = Op (an) and Yn = Op (bn) then
XnYn = Op (anbn)
Xn + Yn = Op (max (an bn))
jXnjr = Op (arn) r > 0:
Proof.
XnYn We only prove the rst part and leave the remaining parts as exercises.
If an bn > ", then
X Y n
either b 1 and a n > "
Y n
X Y n
or n > 1 and n n > ":
bn
an bn
Thus, if Xn = op (an) and Yn = op (bn), then
jX Y j X Y P a nb n > " P a n > " + P b n > 1 ! 0 as n ! 1:
n n
n
n
j
X
+
Y
j
If max(nan bnn) > ", then since jXn + Ynj jXnj + jYnj
jXnj > " or jYnj > " :
an 2 bn 2
Thus, as in the previous part,
X + Y P maxn(a bn ) > " ! 0
n n
r
1
If jXanrnj > ", then jXann j > " r . Thus,
jX jr P anr > " ! 0:
n
Definition 5.1.5. For a sequence of random vectors Xn = (Xn1 Xnm ), we dene
Op and op as follows:
Xn = op (an ) () Xnj = op (an) j = 1 m:
Xn = Op (an ) () Xnj = Op (an ) j = 1 m:
p
p
Xn ;!
X () Xn ; X = op (1) () Xnj ;!
Xj j = 1 m:
5.1. CONVERGENCE IN PROBABILITY AND ORDER IN PROBABILITY
Definition 5.1.6.
kXn ; Xk2 :=
Proposition 5.1.7.
Proof.
))
m
X
j =1
jXnj ; Xj j2 :
Xn ; X = op (1) () kXn ; Xk = op (1) :
;
P kXn ; Xk2 > "
=
X
!
m
P
jXnj ; Xj j2 > 'm1 n
o!
P
()
103
j =1
m X
j =1
jXnj ; Xj j2 > m"
P jXnj ; Xj j2 > m" ! 0:
jXni ; Xij2 kXn ; Xk2 ) Xni ; Xn = op (1) :
p
p
Proposition 5.1.8. If Xn ; Yn ;
!
0 and Yn ; Y ;!
0, then
p
Xn ; Y ;!
0:
Proof.
kXn ; Yk = kXn ; Ynk + kYn ; Yk = op (1) :
Proposition 5.1.9. If Xn ! X and g : R m
! R is continuous, then
g (Xn) ;! g (X) :
X has a subsequence
p
Proof.
We
note
that
X
;
!
X
if
and
only
if
every
subsequence
n
nj
n o
Xnjk such that Xnjk ! X a.s. as k ! 1. Hence if Xnj is any subsequence of fXng,
a.s.
a.s.
then there exists a subsequence Xnjk ;!
X whence g Xnjk ;!
g (X). But this implies
p
that g (Xn) ;! g (X).
Example 5.1.10. If Xn =d X , then Xn = Op (1) and Xn = op (an ) for any sequence fan g
such that an ! 1.
p
104
5. LARGE SAMPLE THEORY
Example 5.1.11. Suppose X1 X2
1. Also, by WLLN and CLT,
8
>
<op (n)
Sn = >Op (np)
:Op ( n)
iid. Then Xn = Op (1) and Xn = op (an) if an !
if EX1 = 0
if EX1 6= 0
if EX1 = 0 and Var (X1) < 1:
Taylor Expansions in Probability
Proposition 5.1.12. Suppose Xn = a + Op (rn ) with rn
derivatives at a then
g (Xn) =
Proof. Let
! 0 and rn > 0. If g has s
s (j )
X
g (a)
j =0
j
s
j ! (Xn ; a) + op (rn) :
8 g(x);Ps g j a (x;a)j
<
j
j
x;a s
h (x) := :
s
0
( )( )
!
)
!
if x 6= a
if x = a:
Then h is continuous since g has s derivatives at a. Since Xnrn;a = Op (1),
Xn ; a = op (1) :
Thus,
p
h (Xn) ;!
h (a) = 0
, i.e.
h (Xn) = op (1) :
Thus,
s
h (Xn) (Xn s;! a) = op (rns ) :
=0
(
Example 5.1.13. Suppose fXn g iid ( 2 ), > 0. By Chebychev,
X n = + Op n; 21 :
Thus,
and
;
log X n = log + 1 X ; + op n; 12
pn ;log X ; log = pn ;X ; + o (1) :
n
p
n
5.2. CONVERGENCE IN DISTRIBUTION
5.2. Convergence in Distribution
Definition 5.2.1. We say that a sequence of random vector
denote
105
Xn converges to X and
d
Xn ;!
X
if FXn (x) ! FX (x) for all x 2 C = fx : FX is continuous at xg.
Remark 5.2.2. Convergence for all x is too stringent a requirement as illustrated by
d
X 0 even though FXn (0) = 0 6! FX (0) = 1.
Xn n1 . We would like to say Xn ;!
Theorem 5.2.3. Suppose Xn Fn and X F0 . Then the following are equivalent:
d
1. X
!
X.
n;
R
R
2. R g (x) dFn (x) ! R g (x) dF0 (x) for all bounded continuous function g.
3. eitT xdFn (x) ! eitT XdF0 (x)for all t 2 R m
(i.e. n (t) := E eitT Xn ! E eitT X =: 0 (t) for all t 2 R m .)
Proof. See Billingsley pp.378-383.
p
Proposition
. If Xn ;
!
X, then
itT Xn 5.2.4
T
1. E e
; eit X ! 0 for all t 2 R m and
d
2. Xn ;!
X.
1.itSince
itT (Xn;X)
T (Xn ;X) E 1 ; e
E 1 ; e
IkXn;Xk + 2P (kXn ; Xk > )
Proof.
given any " > 0, we can choose to make the rst term less than "=2. Then choose
n to make the second term less than "=2 . Hence the left hand side converges to 0 as
n ! 1.
2. Since jj is convex, by Jensen's inequality,
itT Xn itT X itT Xn itT X
Ee ; Ee E e ; e ! 0:
Thus, by theorem (5.2.3)
d
Xn ;!
X:
d
Proposition 5.2.5. (Sluttzky's theorem) If Xn ; Yn = op (1) and Xn ;
!
X, then
d
Yn ;!
X:
106
5. LARGE SAMPLE THEORY
Proof. For a random vector Z dene Z () by
Z T
Z () := ei zdFZ (z)
Then, we have
jYn (t) ; X (t)j jYn (t) ; Xn (t)j + jXn (t) ; X (t)j
By Jensen's inequality and the same argument as in proposition (5.2.4)-(1),
jYn (t) ; Xn (t)j E 1 ; e;itT (Xn ;Yn) ! 0:
p
Also, since Xn ;!
X, the second term converges to 0.
d
Proposition 5.2.6. If Xn ;
!
X and h : R m ! R s is continuous, then
d
h (Xn) ;!
h (X) :
;
Proof. Since exp itT h () is bounded continuous function,
EeitT h(Xn ) ! EeitT h(X) :
p
d
Proposition 5.2.7. If Xn ;
!
X and Yn ;!
b, then
X d X
n
Yn ;! b :
Moreover, if dim Yn = dim Xn, then
d
1. Xn + Yn ;!
X+b
d T
T
2. Yn Xn ;! b X.
Proof. Let Zn =
X n
b : Then,
X p
Zn ; Ynn ;!
0
d
X
since each component converges to 0 in probability. Also Zn ;! b since the sequence of
characteristic functions converges. Hence by Slutzky,
X d X
n
Yn ;! b :
Now, suppose dim Yn = dim Xn. Because the mappings (x y) ! x + y and (x y) ! x0y
are continuous from R 2m ! R m , we obtain the results 1 and 2 from proposition (5.2.4).
5.2. CONVERGENCE IN DISTRIBUTION
107
Asymptotic Normality
Definition 5.2.8. A sequence of random variables fXn g is said to be asymptotically
normal with mean n and standard deviation n if n > 0 for all suciently large n and if
d
n;1 (Xn ; n) ;!
Z
, where Z N (0 1). We write
Xn is AN ( |{z}
n
n2 ):
|{z}
asymptotic asymptotic
mean
variance
Example 5.2.9. The classical
CLT states that if X1
variance
2,
then X n is AN 2
n
Xn are iid with mean and
i.e.
pn ;
!d Z:
X
n; ;
Proposition
5.2.10. If Xn is AN ( n2 )with n ! 0 and g is di
erentiable at , then
;
g (Xn) is AN g () g0 ()2 n2 .
Proof. Because
d
Zn := Xn; ;!
Z N (0 1)
n
Zn = Op (1) :
(Choose a pair of continuity points of Fz such that F (z1 ) < "=4 and F (z2 ) > 1 ; "=4. Then
for all n > N ("),
FZn (z1) < 3" and FZn (z2 ) > 1 ; 3" :
For n = 1 N ("), there exists v (") such that
P (jZnj > v (")) < " n = 1 N (") :
Choose M (") max (v (") jz1 j jz2 j). Then
P (jZnj > M (")) < " for all n = 1 2 :)
Thus,
Zn = Op (1) ) Xn = + Op (n) :
By proposition (5.1.12) (g (X ) = g () + g0 () (X ; ) + op (n )),
;0 g0 ()2 :
g (Xn) ; g () = g0 () (Xn ; ) + o (1) ;!
d
N
p
n
n
108
5. LARGE SAMPLE THEORY
2
Example 5.2.11. Let fXn g iid ( 2 ). Then X n := n1 (X1 + + Xn ) is AN n .
Suppose 6= 0. Then g (x) := x1 has derivative at so 1=X n is AN
pn2 1 1 d
;
!
N (0 1) :
;
Xn 1
; 1
2
2 2
n
i.e.
p
Moreover since X n ;!
(by the WLLN), proposition (5.2.10) implies
pnX 2 1 1 d
n
;
;
!
N (0 1) :
Xn Note 5.2.12. Although 1 is the asymptotic mean! of 1=X n in the above example, it is
; not the limit as n ! 1 of E 1=X n . In fact, E 1=X n = 1 if X1 N ( 2).
Example 5.2.13. Suppose X1 Xn iid (0 2 ). Then
pnX
where Z 2 2 (1) since
d
n !
Z N (0
;
d 2 2
) nX 2n ;!
Z
g (x) = x2 is continuous.
1) by CLT
Multivariate Asymptotic Normality
Definition 5.2.14. Xn is AN ( n n )if
1. n has no zero diagonal elements for all large enough n.
2. 0Xn is AN (0 n 0n) for all 2 R m such that 0n > 0 for all large enough
n.
Recalling the Cramer Wold device, i.e.
d
d 0
Xn ;!
X () 0Xn ;!
X for all 2 R m
we see that Xn is AN ( n n)where n satises (1) if and only if
0p
(Xn ; n) ;!
d
N (0 1)
0
n
for all such that 0n > 0 for large enough n.
Proposition 5.2.15. Suppose Xn is AN ( c2n) with cn ! 0. If g : R m ! R k is continuously di
erentiable in a neighborhood of and DD0 has all diagonal elements greater
than 0 where D = @gi =@xj ]X= , then g (Xn) is AN (g ( ) c2nDD0 ).
5.2. CONVERGENCE IN DISTRIBUTION
109
Definition 5.2.16. A sequence of estimates Tn = Tn (X1 Xn) of g () is said to be
weakly consistent if
Tn ;P! g () for all and strongly consistent if
Tn ! g () a.s. P for all :
Example 5.2.17. (Moment Estimation) Let fXN g iid P such that E jX jr < 1 for
all . Suppose mj () = E X1j for 1 j r and g () = (m1 () mr ()) where is
continuous. Then
Tn (X1 Xn) = (m^ 1 m^ r ) ! g () a.s. P
where
n
X
m^ j = 1 Xkj :
n k=1
Definition 5.2.18. A sequence of estimators is said to be asymptotically normal if there
exists n () and n () > 0 such that Tn is AN (n () n2 ()) for all . i.e.
T ; () P n (n) x ! (x) for all :
n
Remark 5.2.19. Suppose Tn is AN (0n () n0 2 ()). If
n () ! 1 and n () ; 0n ! 0
n0 ()
n
then Tn is AN (n () n2 ()) since
Tn ; n = n0 Tn ; 0n + 0n ; n
n
n n0
n
d#
N (01)
#
1
d#
N (01)
d#
0
Definition 5.2.20. A sequence of asymptotically normal estimators fTn g is said to be
asymptotically unbiased for g () if
in which case
and
n ();g()
n ()
n () ; g () ! 0
n ()
Tn ; g () !d N (0 1)
n ()
is called the standardized asymptotic bias.
110
5. LARGE SAMPLE THEORY
Example 5.2.21. Let fXN g iid P such that E jX j2r < 1 for all . Suppose mj () =
E X1j for 1 j r and g () = (m1 () mr ()) where is continuously dierentiable.
Then
1
X @
Tn = (m1 () mr ()) + @x (m ()) (m^ j ; mj ()) + op pn
j
j
P
where m^ j := n1 nj=1 Xij . From CLT,
2 m^ 1 ; m1 () 3
pn 4
...
5 !d N (0 )
where = (ij
where
m^ r ; mr ()
1 1 ij =1. Thus,
Tn ; (m1 () mr ()) !d N (0 1)
n ()
; ;
) = Cov X i X j r
0 @=@m1 1
@ @ .. A (if > 0).
n2 () = n @m @m
.
1
r
@=@mr
1 @
Example 5.2.22. Let X1
Xn ; ( ). Then EX1 = , EX12 = 2P+ 2 2 =
2 (2 + ), and VarX1 = 2. (m1 () m2()) = = m2 (m)1;(m)1 () : Tn =
m^ 2 ;m^ 21 . Then T is AN ( 2 ()),where
n
n
m^ 1
;2m2 ; (m ; m2) 1 X ;1 ; m22 1
2
1
1
2
m
n () = n
1 1
m21
m1
m1
and
Cov(X X ) Cov(X X 2) = Cov(X1 X12) Cov(X12 X12)
1
2
1
1
2
+ 2) 3 ; 2
= ( + 1)( + 2) 3 ; 2 ((++1)(
1)( + 2)( + 3) 4
2
1
n
Xi2 ;X 2
X
=
5.3. Asymptotic Comparisons (Pitman Eciency)
d
p
p d
Definition 5.3.1. If n (Tn ; g ()) ;
!
N (0 02()) and n Tn0 0(n) (X1 : : : Xn0 ) ; g() ;!
N (0 02()), then the Pitman asymptotic relative eciency (ARE) of fTng relative to fTn0 g
is
0
n (n)
eT T 0 () = nlim
!1 n
5.4. COMPARISON OF SAMPLE MEAN, MEDIAN AND TRIMMED MEAN
111
provided the limit exists and is independent of the sequence n0(n) chosen to satisfy
pn ;T 0 ; g() ;!
d
N (0 02 ())
n (n)
Roughly speaking, eT T 0 is the ratio of the number of observations required for the two
estimators to achieve the same precision.
2
2
Theorem 5.3.2. if Tn s AN g () 0n() and Tn0 s AN g () 1n() , then eTT 0 () =
12 ()
02 () .
Proof. Let n0 =
h
p
2
n 102 (())
i
, where x] is the integer part of x. Then
s
2
n 12 (()) (Tn0 0 ; g())
0
1
1 ( ) p
0
= () n (Tn0 ; g()) + Op n
0
d
and since LHS ;!
N (0 12 ()), therefore
pn (T 0 ; g()) ! N ;0 2()
n0
0
Hence
n0(n) = 12()
eT T 0 () = nlim
!1 n
2()
n0 (Tn0 0 ; g()) =
provided the limit is the
0
same for all n0(n) such that lim n0 (n)=n exists
pn (T 0 ; g()) ! N ;0 2()
0
n
and
0
Suppose n0(n) is any such sequence, then the LHS of the previous line is equal to
r n p
pn p
1
0
0
0
p 0 n0 (Tn0 ; g()) = n0 n0 (Tn0 ; g())
n
0
1
d#
N (0 02 ())
therefore
! 1, i.e. nn0 ! .
5.4. Comparison of sample mean, median and trimmed mean
n 12
n0 02
2
1
2
0
Proposition 5.4.1. Let F be a cdf
such that F is di
erentiable at p = 100pth percentile
(or p-quantile) of F and f (p) =
be the order statistics. Then
dF
dx x=p
> 0. Let X1 : : : Xn be iid F and let Y1 : : : Yn
p(1 ; p) Ynp] is AN p
nf 2(p)
112
5. LARGE SAMPLE THEORY
Proof."
#
q
Y
np] ; p
2
x = P Ynp] x p(1 ; p)= nf (p)] + p
P p
p(1 ; p)= nf 2(p)]
= P Sn np]]
#
"
n
(
p
;
p
)
S
;
np
n
n
n
<p
=1;P p
npn(1 ; pn)
npn (1 ; pn)
where
Sn =
n
X
1
q
IXin ] b(n pn) pn = P (X1 n) and n = p + x p(1 ; p)= nf 2(p)]:
We need the Berry-Esseen Theorem here: if X1 Xn are iid F , where E jX1j3 < 1, then
there exists c, independent of F , such that
S ; ES c E jX ; EX j3
n
n
1
1
sup P p
:
x ; (x) pn
3
=
V arSn
x
(V ar(X1 )) 2
Hence, for our problem,
"
# !
1
P p Sn ; npn < pn(p ; pn) ; pn(p ; pn) pc
= pc O(1)
3
=
2
npn(1 ; pn) npn(1 ; pn)
n pn(1 ; pn)]
n
npn(1 ; pn)
as n ! 1 and n ! p, therefore
pn(1 ; pn) ! F (p) (1 ; F (p)) = p(1 ; p)
Moreover
pn (p ; P (X ))
n
(
p
;
p
)
n
pnp (1 ; p ) pp(1 ;1p) n
n
n
p p) ; F (n)
= n F (p
p(1 ; p)
p
; p) + o(n ; p)
= ; n f (p)(np
q p(1p;(1p); p) 1
f (p)x nf 2 (p ) + o( pn )
p
pp(1 ; p)
=; n
and the result follows.
Proposition 5.4.2. Suppose F is continuous at p1 p2 and f (p1 ) > 0 f (p2 ) > 0.
Then
Y 1 p (1 ; p )=f 2( ) p (1 ; p )=(f ( )f ( )) np1] is AN
p1
1
1
p1
1
2
p1
p2
Y
p (1 ; p )=f 2( )
n p (1 ; p )=(f ( )f ( ))
np2]
p2
1
2
p1
p2
2
2
p2
5.4. COMPARISON OF SAMPLE MEAN, MEDIAN AND TRIMMED MEAN
113
where 0 < p1 < p2 < 1.
Proof. Similar to the proof of proposition 5.4.1.
Remark 5.4.3. Suppose X1 X2 iid F (x ; ) F (0) = 21 and f (0) > 0. Then
1 ~
X := median(X1 Xn) = X(n=2]) is AN 4nf 2(0)
If EX1 = , then
2 where 2 = V ar X
X is AN n
therefore, eX~ X = 42f 2(0). So if 2f (0) < 1, then X is more ecient if 2f (0) > 1, then
X~ is more ecient.
Trimmed Mean
Proposition 5.4.4. Suppose F is symmetric about 0 with F (;c) = 0, F (c) = 1 and
f (x) = dF
dx strictly positive and continuous on (;c c). Then
pn X ;!
d
2
N (0 )
where
nX
;n]
1
X(i)
X = n ; 2n]
"i=Zn(1]+1;)
#
2 = (1 ;22)2
t2 f (t)dt + 2(1 ; )
0
and () := F ;1()
Proof. Omitted
Remark 5.4.5. Again assume X1 : : : Xn iid F (x ; ) F (;c) = 0, and f is symmetric continuous and positive. Then
lim1 2 = 4f 21(0) (since X ! X~ )
" 2
2 = 2
lim
#0
The ARE's of X~ and X relative to X are
eX~ X (f ) = 42f 2(0)
eX X (f ) = 2=2
114
5. LARGE SAMPLE THEORY
f n .125 .25
.94 .84
p1 e;x2 =2
2
1
(1+x2 )
1 e;jj=2
2
T (:01 3)
T (:05 3)
t3
t5
1 1
1.40
.98
1.19
1.91
1.24
1.63
.89
1.09
1.97
1.21
.5
2
=.64
1
2
.68
.83
1.62
.96
T ( ) = (1 ; ) p1 e;x2=2 + p1 e;x2=2
2
2
Remark 5.4.6. X is inecient for heavy tails. This is because it is sensitive to one or
two extreme observations.
Remark 5.4.7. The optimal depends on the distribution sampled. For large n the
distribution can be estimated and chosen accordingly as ^.
Proposition 5.4.8. (a) For f (x) symmetric about 0 and f (x) f (0)
eX~ X (F ) 31
The lower bound is attained if and only if F is uniform.
(b) If F satises the hypotheses of proposition 5.4.4, then
eX~ X (F ) (1 ; 2)2
If in addition F is unimodal and f is non-increasing on 0 1), then
eX~ X (F ) 1 +14
Proof. (a) eX~ X (F ) = 4 2 f (0). Since eX
of scale, assume f (0) = 1
~ X is independent
R
2
and 0 f (x) R 1. Then we need to minimize x f (x)dx subject to 0 f (x) 1,
f (0) = 1 and f (x)dx = 1.
Use the method of Lagrange multipliers, i.e. minimize
S=
Z
x2 f (x)dx
;
Z
Z
f (x)dx ; 1 = (x2 ; )f (x)dx + with respect to and f subject to 0 f (x) 1 and f (0) = 1.
5.4. COMPARISON OF SAMPLE MEAN, MEDIAN AND TRIMMED MEAN
115
For each xed > 0, S is minimized by choosing f as large as possible for x2 < and f as small as possible for x2 > , i.e.
p
f (x) = 1 jxj < p
f (x) = 0 jxj @S = 0 ) Z f (x)dx =1 ) Z 1dx = 1 ) p = 1=2
p
@
; Thus f (x) = I;1=21=2] (x) and 2 = 121 , so min eX~ X (F ) = 1=3:
116
5. LARGE SAMPLE THEORY
CHAPTER 6
Maximum Likelihood Estimation
6.1. Consistency
Suppose that X1 X2 : : : areiid P 2 and make the following assumptions.
(A0) P 6= P0 if =
6 0 .
(A1) P 2 , have common support.
(A2) dd P (x) = f (x ).
(A3) The true parameter 0 2 int().
Definition 6.1.1. L(x ) =
Pn log f (x ) is called the log likelihood. An estimator
i
1
^ of is called a (global) MLE if
L(x ^(x)) = sup L(x )
2
The MLE of g() is dened to be g(^).
Likelihood equations. If = (1 : : : d) 2 R d and is an open subsect of R d and
L is dierentiable on , then ^ (if it exists) satises
@L x ^(x) = 0 1 j d
@j
In general
5 L(x ) = 0
may not have a unique solution. Sometimes the likelihood is unbounded and the MLE does
not exist.
117
118
6. MAXIMUM LIKELIHOOD ESTIMATION
Example 6.1.2.
1 ; e;ax
0x<
;
a
;
b
(
x
;
)
1;e
x
2 = (a b ) 2 (0 1)3
f (x ) = ae;ax I0 )(x) + be;a ;b(x; ) I1)(x)
a x<
f
(
x
)
Hazard rate =
= b x :
1 ; F (x )
F (x ) =
L( x a b ) =
k "
X
1
n "
X
#
#
;ax(i) + log a + ;a ; b(x(i) ; ) + log b
k+1
where x(i) is the ith order statistic, and k = #fxi : xi < g.
For any xed a, letting b = x(n1); and " x(n) , we see that L(x a x(n1); ) ! 1 as "
x(n) , and hence a global MLE does not exist. However if we restrict < x(n;1) the constrained
MLE exists and is consistent.
Lemma 6.1.3. Let
j )
I (j ji ) = ;Ei log ff(X
i
Z f (x(X
)
j
= ; log
f (x i f (x i )d(x)
= 12 Kullback-Leibler discrepancy of f ( j ) relative to f ( i)
Then I (j ji) 0 with equality holds if and only if j = i.
Proof. Jensen's inequality gives
;Ei log ff ((XX j)) ; log Ei ff ((XX j))
i
i
Z
f (x j )d(x)
; log
fx:f (x i )>0g
; log 1 = 0
R
j)
Equality holds if and only if ff((X
Xi ) is constant a:s: Pi , and fx:f (x i )>0g f (x j )d(x) = 1.
The
equality implies Pj Pi and the former equality implies Pj = Pi , since
R cf (latter
x i)d = 1 implies that c = 1.
Theorem 6.1.4. Suppose = f0 : : : k g and conditions (A0 ) and (A2 ) hold. Then
the MLE ^n = ^(X1 Xn) is unique for suciently large n and ^n ;a:s:
;! .
6.1. CONSISTENCY
119
Proof. Suppose that 0 is the true parameter value. Then lemma 6.1.3 and the SLLN
imply that
n
X
1
; n log ff ((XXi j )) ! I (j j0 ) a:s: P0 j = 1 : : : k:
i 0
1
Hence for n suciently large
n
X
1
Xi j ) > 1 I ( j )
; n log ff ((X
j 0
i 0 ) 2
1
i:e: L (X 0 ) ; L (X j ) > n2 I (j j0 ) > 0 if j 6= 0 :
(6.1.1)
So for all n suciently large L(X j ) has a unique maximum at j = 0 . Hence ^ML !
0 a:s: P0 , i.e. ^ML is strongly consistent.
Remark 6.1.5. Theorem 6.1.4 may not hold if is countably innite (see the example
on page 410 of TPE).
Theorem 6.1.6. Suppose conditions A0 ; A3 hold and for almost all x, f (x ) is di
erentiable with respect to 2 N with continuous derivative f 0 (x ), where N is an open subset
of containing 0 and R . Then with P0 probability 1, for n large
X 0
L0( X ) = ff ((XXi )) = 0
i
has a root ^n and ^n ! 0 a:s: P0 .
Proof. Choose > 0 small so that (0 ; 0 + ) N and let
Sn = fx : L(0 x ) > L(0 ; x ) and L(0 x ) > L(0 + x )g
From equn (6.1.1), a.s. P0 , L (0 X ) ; L (0 + X ) > 0 for all n large. Hence there exists
^n(X ) 2 (0 ; 0 + ) such that L0(^n ) = 0.
If there exist more than one ^n 2 (0 ; 0 + ), choose the one closest to 0 (the set of
roots is closed since f 0(x ) is continuous in ). Call this root ^n . (Note that there could be
2 closest roots in which case choose the larger.)
Let A = fX = (X1 X2 : : : ): 9 ^n 2 (0 ; 0 + ) s:t: L0(^n ) = 0 8 sucient large
n and ^n is the closest to 0 g. Then P0 (A) = 1. Dene A0 = limk!1 A1=k . Clearly
P0 (A0) = 1 and on A0, ^n ! 0 .
Remark 6.1.7. Theorem 6.1.6 says there exists a sequence of local maxima which converges a.s. P0 to 0 . However since we don't know 0 , we can't determine the sequence
unless L has a unique local maximum for each n.
120
6. MAXIMUM LIKELIHOOD ESTIMATION
Corollary 6.1.8. If L0 () = 0 has a unique root ^n for all X and for all suciently
large n, then
^n ! 0 a:s: P0
If in addition is the open interval (L U ) and L0 () is continuous on for all X , then ^n
maximizes the likelihood (globally), i.e. ^n is the MLE and hence the MLE is consistent.
Proof. The rst statement follows straight from theorem 6.1.6.
If ^n is not the MLE, then
L() ! sup L() as # L or " U
But ^n is a local max by the proof of theorem 6.1.6 and hence L must also have a local min
and L0() = 0 for some 6= ^n, a contradiction. So ^n is the MLE.
The following theorem gives conditions for the consistency of the MLE regardless of the
existence and (or) the number of roots of the likelihood equation. But rst we state without
proof a lemma which is needed in the proof of the theorem.
Lemma 6.1.9. If f : X ! R (with compact) is a function such that f (x ) is
continuous on for each xed x and f ( ) is measurable on X for each xed , then there
exists a measurable function T : X ! such that
f (x T (x)) = max
f (x ) 8x:
(The point is the measurability of T ).
Theorem 6.1.10. Suppose where is a compact metric space with metric d
and dene
P = fP 2 g where P = P 2 and P is a possibly defective probability measure for 2 n.
Suppose
(i) there exists -nite such that dPd (x) = f (x ) for all in , and f (x ) is continuous for all x
(ii) P 6= P0 for all 6= 0 , for some xed 0 2 (iii) for all 2 there exists a neighborhood N of in such that
f
(
x
0)
E0 inf
log f (x ) > ;1
2N
Then for any measurable ^n = ^n(X1 Xn) which maximizes L( X ) over , we have
d(^n 0) ! 0 a:s: P0
6.1. CONSISTENCY
121
Proof. It suces to show that for each open neighborhood A(0 ) of 0
P0 ^n 2 A(0 ) 8 n suciently large = 1
For each not in A(0 ) there exists a neighborhood N of such that
f
(
X
0)
(6.1.2)
E inf
log f (X ) > ;1
2N
Let B ( r) = f : d( ) < rg. Then
f (x 0 ) " log f (x 0 ) as r # 0
inf
log
2B(r)
f (x )
f (x )
and since
f (x 0 ) inf log f (x 0) for r small
inf
log
2B(r)
f (x ) 2N
f (x )
it follows from (6.1.2) and the monotone convergence theorem that
f (x 0) = I (j ) > 0
lim
E
inf
log
0
0
r #0
2B(r)
f (x )
In particular there exists r() > 0 such that
f
(
x
0)
E0 2Binf
log
(r())
f (x ) > 0
Since the class of open sets B ( r()), not in A(0 ), covers the compact set nA(0 ),
there exists 1 k 2 nA(0 ) such that
nA(0 ) k=1 B ( r())
For each 2 f1 : : : kg, the SLLN implies that
h ^
i
L
(
)
;
L
(
)
2B(inf
L(0 ) ; L()]
inf
n
2B( r( ))
r( ))
X
n
f (Xi 0 )
inf
log
f (Xi )
1 2B( r( ))
SLLN
;;;!
1 a:s: P0 as n ! 1
i.e. L(^n ) ; sup2B(r()) L() ! 1 a:s: P0 .
Hence with P0 probability 1,
^n 2= B ( r()) 8 = 1 k for n suciently large, whence
^n 2 A(0 ) for n suciently large:
122
6. MAXIMUM LIKELIHOOD ESTIMATION
Example 6.1.11. f (x ) = e;x 0 < x < 1 0 < < 1: Obviously ^n = X1
by SLLN. However let us show it by applying Theorem 6.1.10.
! a.s.
Let = 0 1] and f (x 0) = f (x 1) = 0 (defective probability densities). Then
(i) f (x ) is continuous for any x
(ii) P 6= P0 for any 6= 0
(iii) For 2 and 0 < < f
(
X
0)
0
inf log f (X ) = 2inf
log + X ( ; 0 )
;<<+
B()
log +0 + X ( ; ; 0)
and E0 (RHS ) > ;1.
For = 0 and any > 0
f
(
X
0)
0
inf log f (X ) = 0inf
log + X ( ; 0 )
0<
<
log 0 ; X0
and again E0 (RHS ) > ;1.
For = 1 and any > 0
f
(
X
0)
0
inf log
inf log
<1
<1
f (X ) = + X ( ; 0 )
0 X ) + 1 ; X0 if X1 > = log(
log(0 =) + X ( ; 0 ) if X1 "
;
#
and E0 (log 0 X + 1 ; 0 X ) I01=](X ) + log 0 + X ( ; 0 ) I1=1](X ) > ;1.
Hence the conditions of the theorem are satised, so the MLE is consistent.
6.2. Asymptotic Normality of the MLE
Theorem 6.2.1. Suppose that
(i) 0 = (10 d0 ) is the true parameter and 0 2 int() R d .
P0
(ii) the MLE ^n ;;!
0 .
(iii) E0 @@ j log f (hX 0) = 0 1 j d.
h
i
;
;
i
(iv) I (0) = E0 @@ log f (X 0) @@ log f (X 0) T = ;E0 @@22 log f (X 0) and
I (0) is non-singular.
6.2. ASYMPTOTIC NORMALITY OF THE MLE
123
(v) There exists > 0 such that E0 W (X ) < 1 where
2
X @2
@
0
log f (x ) ; @ @ log f (x )
W (X ) = sup
@
@
0
; < 1
d
q
(vi)
Then
@2
@ q
log f (x ) is continuous in for all x.
;
pn ^ ; 0 ;!
d
N 0 I ;1(0 )
Proof. From (vi), W (X ) # 0 as # 0. Hence from (v), E W (X ) # 0 as # 0.
0
Now for some between 0 and ^n ,
n
n
X
X
1
@
@ log f (x ^ )
1
0
(6.2.1)
log
f
(
x
i )=
i n
n 1 @
n 1 @
+ n1
n X
d
X
@2
log f (xi )(
0 ; ^n
)
@
@
i=1 =1 (The rst term equals 0 since ^n = MLE .)
We will show that
1 X @ 2 log f (x ) ! ;I (0 )
(6.2.2)
i
n i @ @
Write
n
n
@ 2 log f (x ) = 1 X
@ 2 log f (x 0 )
1X
i
i
n i=1 @ @
n i=1 @ @
2
X @2
@
1
0
log f (xi ) ; @ @ log f (xi )
+n
i @ @
The rst term on the right hand side goes to ;I
(0) in probability by the WLLN. For any
c > 0,
;
;
P0 (j2nd termj > c) =P0 k ; 0 k> j2nd termj > c + P0 k ; 0 k j2nd termj > c
;
P0 k ; 0 k> @2
!
2
X
1
@
+ P0 n
sup
log f (xi ) ; @ @ log f (xi 0 ) > c
0
@
@
i k; k<
=I + II
124
6. MAXIMUM LIKELIHOOD ESTIMATION
; P
where I P0 k ^n ; 0 k> ! 0, and II P0 n1 i W(xi) > c 1c E0 W(X ) ! 0 as
# 0. This proves (6.2.2).
Now write (6.2.1) in matrix form
n ;;I (0) + o (1) 0 ; ^ 1X
@ log f (X 0)
=
i
p
n
n i=1 @
=1:::d
By CLT,
n X
;0 I (0)
@
1
d
0
pn
log
f
(
X
;
!
N
i )
=1:::d
i=1 @
It follows that
n
pn ^ ; 0 = ;I (0) + o (1);1 p1 X
@ log f (x 0)
i
n
p
n 1 @
;
d
;!
N 0 I ;1(0 )I (0)I ;1(0 ) :
Note 6.2.2. Condition (v) can be replaced by
@3
@ log f (X ) M
(X )
for any with k 0 k< , and where E0 M
(X ) < 1.
Example 6.2.3. One parameter exponential family
f (xi ) = eT (xi );A() h(xi )
P
The likelihood equation L0 () = 0 implies n1 T (xi ) = A0 () = E T (X1). Since A00 () =
V ar T > 0, A0() is strictly increasing so that L0(3) = 0 has at most one solution. By
theorem (6.1.6) and its Corollary, ^ ! a.s. Also dd 3 log f (x ) = A000 () is independent of
x and continuous. Hence by theorem (6.2.1),
;
pn (^ ; ) ;!
d
N 0 I ;1() = N (0 V ar T )
Example 6.2.4. (Censoring). Suppose X1 Xn iid E ( 1 ) (i:e: EXi = ), and
suppose we observe the censored data Yi = min(Xi T ) with T xed. Let be the measure
on 0 T ] dened by
(A) =
(Lebesgue measure plus unit mass at T ).
Z
A
dx + IA(T )
6.3. ASYMPTOTIC OPTIMALITY OF THE MLE
125
Then the density of Y with respect to is
1 e;y= 0 y < T
p(y ) = e;T= y = T
(= P (Xi T ))
;
2
y
@ @
@
+ y0 = yT < T
) ; 2 log p(y ) = @@ @@ ;log
T
@
;@ 1@ + 2y 0 y < T
= 2T2 3 y = T
3
@2
) I () = ;E 2 log p(y )
1 2y 1
Z T @
=
; 2 + 3 e;y= dy + 2T3 e;T=
0
;
1
= 2 1 ; e;T=
@ 3 log p(y ) = ; 23 + 6y4 0 y < T
6T
y=T
@3
Since 0 2 (0 1), @@
3
3
log f (y ) Ay + B for 2 < < 32 , and E (Ay + B ) < 1. Check
4
0
that ^n ;! 0 under P0 . Then by theorem (6.2.1)
pn ^ ; ;!
d
N 0
n
0
P
0
2
1 ; e;T=
0
6.3. Asymptotic Optimality of the MLE
Under the conditions of theorem (6.2.1), the MLE satises
;
pn ^ ; ;!
d
N 0 I ;1(0 )
n
0
We will show now that I ;1(0 ) is the minimal attainable covariance matrix for a class of
asymptotically normal estimates. (Note. As in section 6.2, I () denotes the Fish information
matrix per observation.)
Theorem 6.3.1. Suppose fTn g is a sequence of asymptotically normal estimators of with V ar (Tn ) < 1 for all n. Let
@
@
4n() = @ E (Tn) = @ E (Tnj )
i
ij =1:::d
If
p
d
N (0 ()),
(i) n (Tn ; ) ;!
(ii) n Cov (Tn) " 4Tn I ;1()4p
n,
(iii) 9() such that supn E k n (Tn ; ) k2+ < 1,
126
(iv) 4n ! Idd ,
then
6. MAXIMUM LIKELIHOOD ESTIMATION
() " I ;1()
p
Proof. (iii) implies that f n(Tn ; )g and fn(Tn ; )(Tn
; )g are uniformly
integrable (Billingsley, p. 338). Hence from (i), if Z N (0 ()),
p
E n(Tn ; ) ! EZ = 0
and
n Cov(Tn) ! Cov(Z ) = ()
(Billingsley, p.338). As n ! 1, the LHS of (ii) ! (), and the RHS of (ii) ! I ;1() by
(iv). Therefore, () " I ;1().
Note 6.3.2. Assumption (ii) will be satised under the conditions of corollary (2.4.4)(in
the UMVU section).
Definition 6.3.3. If fTn g satises (i) with () = I ;1 (), then it is said to be asymptotically ecient.
@gi , 1 Corollary 6.3.4. Let g = (g1 gr )0 : ! R r have continuous derivatives @
d 1 i r. Dene
@g @g
4() = @ = @j
i i=1 d j =1 r
and assume that the sequence fTng of estimators satises
p
d
N (0 ()),
(i) n (Tn ; g()) ;!
0
(ii) n Cov (Tn) " 4n()I ;1()4n(p), where 4n() = @@ E Tn,
(iii) 9() > 0 such that supn E k n (Tn ; g()) k2+ < 1,
(iv) 4n ! 4,
then
() " 40I ;1()4
Proof. The same as the proof of theorem (6.3.1). ((ii) again holds under the conditions
of Corollary 2.4.4.)
Next we show that g(^n), where ^n = MLE (), achieves the lower bound in Corollary
(6.3.4) provided the conditions of theorem (6.2.1) hold.
^ d
p
Corollary 6.3.5. Suppose n n ; ;
! N (0 I ;1()) and @gj =@i is continuous for
1 j r, 1 i d, then
1
0
;
1
^
g(n) is AN g() n 4 I ()4 :
6.3. ASYMPTOTIC OPTIMALITY OF THE MLE
Proof. Since
127
1
^n ; = Op pn
^ 1 0
^
g(n) ; g() = 4 n ; + op pn
p
) n g (^n ) ; g ()
;
d
;!
N 0 40I ;1()4 by Slutzky:
Xn are iidPnlognormal( 2), (i.e. Yi = log Xi N ( 2)) with 2 = 1. The MLE of is ^n = n1 i=1 log Xi. Suppose g() = EX1 = e+1=2
(Set = 1 in EeY = e+ =2 ) EX1 = e+ =2 ).
Example 6.3.6. Suppose X1
1
2 2
2
Consider the two estimators of := g()
g(^n) = e^n +1=2 (MLE)
and
n
X
~n = n1 Xi (Moment estimator):
1
The Fisher information for in Xi = the Fisher information for in Yi since the transforma2
tions Yi = log Xi is one to one. So it equals E @@ log f (Yi ) = E (Yi ; )2 = 1 (notice
2
log f (Y ) = ; 21 ln 2 ; (Y ;2) ). Therefore
;
pn ^ ; ;!
d
N 0 40I ;1 ()4
n
where 4 = dd g() = e+1=2 , i.e.
;
pn ^ ; ;!
d
N 0 e2+1
n
P
On the other hand, for the moment estimator ~n = n1 Xi, we have
;
pn ~ ; ;!
d
N (0 V ar(Xi)) = N 0 e2+1 (e ; 1)
n
(EXi2 =; Ee2Y = (MGF of N ( 1) evaluated at = 2) = e2+2 , therefore V ar(Xi) =
e2+2 ; e+1=2 2 = e2+1 (e ; 1).)
The ARE of ~n relative to ^n is thus
ARE (~n ^n) = 1=(e ; 1) = :582
The sample mean thus has poor ARE as an estimator of the mean of a log normal distribution.
(The log normal distribution has heavy tails so this is not too surprising.)
128
6. MAXIMUM LIKELIHOOD ESTIMATION
Xn iid N ( 1), I () = 1. Dene
X if jX j > n;1=4
Tn = 0n if jXnj < n;1=4
n
Example 6.3.7. (Hodges, TPE, p.440) X1
Then Tn is AN ( v(n) ), where v() = 1 if 6= 0 and 0 if = 0. So v() < I () at = 0 and
the parameter value 0 is called a point of supereciency.
Theorem (6.3.1) does not apply to this example. Note also that Tn is not uniformly
better than Xn for nite n, for example n = n;1=4 ) En n(Tn ; n )2 ! 1 > 1 =
limn!1 En n(Xn ; n)2 :
Remark 6.3.8. LeCam has shown showed that for any estimator satisfying
pn ^ ; ;!
d
N (0 v())
n
0
the set of 's where v() < I ();1 has Lebesgue measure zero.
Iterative Methods Suppose we have a sequence of estimators ~n such that
~n = 0 + Op( p1n )
Set
0 ~
Tn = ~n ; L00(~n )
L (n)
Then
L0 (~n) = L0 (0 ) + (~n ; 0 )L00 (n )
where n is between 0 and ~n . Thus
!
pn (T ; ) = pn ~ ; ; L0(~n )
n
0
n
0
L00(~n)
0 (0 )
00 ( ) p
L
L
~
~
= n n ; 0 ; 00 ~ ; (n ; 0 ) 00 ~n
L (n )
L (n)
1
0
00
)
p
; n L (0 ) p ~
L
(
n
=
+ n( ; ) 1 ;
1 00 ~
n L (n )
n
0
L00 (~n)
6.3. ASYMPTOTIC OPTIMALITY OF THE MLE
129
Under the conditions of theorem (6.2.1)
d
p1n L0 (0 ) ;!
N (0 I (0 ))
1 L00 (~ ) ;!
P
;I (0 ) and
n
n
1 L00 ( ) ;!
P
;I (0 ):
n
n
Hence the term in bsquare brackets is op(1).
Thus
RHS =
; p1n L0 (0 )
1 00 ~
n L (n )
;
d
+ op(1) ;!
N 0 I ;1(0 )
and so we have proved the following theorem.
Theorem 6.3.9. Suppose conditions (i)-(iv) of theorem (6.2.1) hold and
~n = + Op( p1n )
Then
0 ~
Tn := ~n ; L00(~n) is asymptotically ecient:
L (n )
Corollary 6.3.10. If I () is continuous then the estimator
0 ~
Tn0 = ~n + L (~n)
nI (n )
is asymptotically ecient.
Proof.
!
pn (T 0 ; T ) = pn L0(~n ) + L0 (~n)
n
n
L00(~n) nI (~n )
pn
1
= L0 (~n)
+
L00(~n) nI (~n )
!
0 (~n ) I (~n ) + L00 (~n )
L
n
= p
00
~
L
(
n
I (~n) n n )
= op(1):
(The rst factor is Op(1), the numerator of the second factor is op(1) and the denominator
P I (0 )2 .)
!
130
6. MAXIMUM LIKELIHOOD ESTIMATION
Example 6.3.11. Location family. Suppose that X1 X2 are iid f (x ; ) where f
is dierentiable and symmetric, f (x) > 0 for all x and f 0 is continuous. Then the conditions
of theorem (6.1.6) hold
A0 : P 6= P0 6= 0
A1 : common support
A2 : dPd (x) = f (x ; )
A3 : 2 int() = (;1 1)
Hence
n 0
X
0
(6.3.1)
L ( X ) = ff ((xxi ;; )) = 0
i
1
has a sequence of roots ^n for n large such that ^n ! 0 a.s. P0 . Since L( X ) ! 0 as
! 1, L( X ) must have a max, however there may be several solutions of (6.3.1).
Provided f (x ; ) satises conditions (v) and (vi) of theorem 6.2.1, i.e.
"
#
@2
2
@ log f (x ; ) < 1
E0 sup @2 log f (x ; ) ; @
0 2
j;0 j<
and @@22 log f (x ; ) is continuous in for all x, then all the conditions of theorem (6.2.1)
(apart from (ii)) are
satised.
2
Also X is AN (0 n ) and hence
X = + Op( p1n )
The corollary of theorem (6.3.9) therefore implies the asymptotic eciency of
Tn = Xn +
where
Pn
f 0 (Xi ;Xn )
i=1 f (Xi ;Xn )
nI (Xn)
Z f 0 (y ) 2
I ( ) =
f (y)dy:
f (y)
Download