Stat 643 Results, Fall 2000 Sufficiency (and Ancillarity and Completeness)

advertisement
Stat 643 Results, Fall 2000
Sufficiency (and Ancillarity and Completeness)
Roughlyß X Ð\Ñ is sufficient for ÖT) ×)-@ (or for )) provided the conditional distribution
of \ given X Ð\Ñ is the same for each ) - @.
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and
X Ð\Ñ œ ! \3 . You may show that T: Ò\ œ BlX œ >Ó is the same for each :.
8
3œ"
In some sense, a sufficient statistic carries all the information in \ . (A variable with the
same distribution as \ can be constructed from X Ð\Ñ and some randomization whose
form does not depend upon ).) The $64 question is how to recognize one. The primary
tool is the "Factorization Theorem" in its various incarnations. This says roughly that X
is sufficient iff
"
0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑ" Þ
Example (Page 20 Lehmann TSH ) Suppose that \ is discrete. Then X Ð\Ñ is sufficient
for ) iff b nonnegative functions 1) Ð † Ñ and 2Ð † Ñ such that T) Ò\ œ BÓ œ 1) ÐX ÐBÑÑ2ÐBÑÞ
(You can and should prove this yourself.)
The Stat 643 version of the above result is considerably more general and relies on
measure-theoretic machinery. Basically, one can "easily" push the factorization result as
far as "families of distributions dominated by a 5-finite measure." The following,
standard notation will be used throughout these notes:
Ðk ,U ,T) Ñ a probability space (for fixed ))
c œ ÖT) ×)-@
E) Ð1Ð\ÑÑ œ ' 1.T)
X À Ðk ß U ÑpÐg ß Y Ñ a statistic
U! œ U ÐX Ñ œ ÖX •" ÐJ Ñ×J -Y the 5-algebra generated by X
Definition 1 X (or U! ) is sufficient for c if for every F - U b a U! -measurable function
] such that for all ) - @
] œ E) ÐMF lU! Ñ a.s. T)
.
1
(Note that E) ÐMF lU! Ñ œ T) ÐFlU! Ñ. This definition says that there is a single random
variable that will serve as a conditional probability of F given U! for all ) in the
parameter space.)
Definition 2 c is dominated by a 5-finite measure . if T) ¥ . a ) - @.
Then assuming that c is dominated by ., standard notation is to let
0) œ
.T)
..
and it is these objects (R-N derivatives) that factor nicely iff X is sufficient. To prove this
involves some nontrivial technicalities.
Definition 3 c has a countable equivalent subset if b a sequence of positive constants
!-3 œ " and a corresponding sequence Ö)3 ×_
!-3 T)3 is
Ö-3 ×_
3œ" with
3œ" such that - œ
equivalent to c in the sense that -ÐFÑ œ ! iff T) ÐFÑ œ ! a )Þ
Notation: Assuming that c has a countable equivalent subset, let - be a choice of
!-3 T)3 as in Definition 3.
Theorem 1 (Page 575 Lehmann TSH) c is dominated by a 5-finite measure . iff it has
a countable equivalent subset.
Theorem 2 (Page 54 Lehmann TSH) Suppose that c is dominated by a 5-finite measure
and X À Ðk ß U ÑpÐg ß Y Ñ. Then X is sufficient for c iff b nonnegative Y -measurable
functions 1) such that a )
.T)
ÐBÑ œ 1) ÐX ÐBÑÑ a.s. - .
.-
(This last equality is equivalent to T) ÐFÑ œ 'F 1) ÐX ÐBÑÑ. -ÐBÑ a F - U .)
Theorem 3 (Page 55 Lehmann TSH) ("The" Factorization Theorem) Suppose that
c ¥ . where . is 5-finite. Then X is sufficient for c iff b a nonnegative U -measurable
function 2 and nonnegative Y -measurable functions 1) such that a )
.T)
ÐBÑ œ 0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑ a.s. . .
..
Standard/elementary applications of Theorem 3 are to cases where:
1) . is Lebesgue measure on e. , c is some parametric family of . -dimensional
distributions, and the 0) are . -dimensional probability densities, or
2
2) . is counting measure on some countable set (like, e.g., am € b. ) c is some
parametric family of discrete distributions and the 0) are probability mass
functions.
But it can also be applied much more widely (e.g. to cases where . is the sum of . dimensional Lebesgue measure and counting measure on some countable subset of e. ).
Example Theorem 3 can be used to show that for k œ e. ß
\ œ Ð\" ß \# ß ÞÞÞß \. Ñß U œ U. the .-dimensional Borel 5-algebra, 1 œ Ð1" ß 1# ß ÞÞÞß 1. Ñ a
generic permutation of the first . positive integers, 1Ð\Ñ œ Ð\1" ß \1# ß ÞÞÞß \1. Ñß . the . dimensional Lebesgue measure and c œ ÖT lT ¥ . and 0 œ .T
.. satisfies
0 Ð1ÐBÑÑ œ 0 ÐBÑ a.s. . for all permutations 1×ß the order statistic
X Ð\Ñ œ Ð\Ð"Ñ ß \Ð#Ñ ß ÞÞÞß \Ð.Ñ Ñ is sufficient.
Actually, it is an important fact (that does not follow from the factorization theorem) that
one can establish the sufficiency of the order statistic even more broadly. With c œ ÖT
on e. such that T ÐFÑ œ T Ð1ÐFÑÑ for all permutations 1×, direct argument shows that
the order statistic is sufficient. (You should try to prove this. For F - U , try ] ÐBÑ equal
to the fraction of all permutations that place B in F.)
(Obvious) Result If X Ð\Ñ is sufficient for c and c! is a subset of c , X Ð\Ñ is sufficient
for c! .
A natural question is "How far can one go in the direction of 'reducing' \ without
throwing something away?" The notion of minimal sufficiency is aimed in the direction
of answering this question.
Definition 4 A sufficient statistic X À Ðk ß U ÑpÐg ß Y Ñ is minimal sufficient (for ) or for
c ) provided for every sufficient statistic W À Ðk ß U ÑpÐf ß Z Ñ b a (?measurable?) function
h À f pg such that
X œ Y ‰ W a.e. c .
(It is probably equivalent or close to equivalent that U ÐX Ñ § U ÐWÑÞ)
There are a couple of ways of identifying a sufficient statistic.
Theorem 4 (Like page 92 of Schervish ... other development on pages 41-43 of
Lehmann TPE) Suppose that c is dominated by a 5-finite measure . and
X À Ðk ß U ÑpÐg ß Y Ñ is sufficient for c . Suppose that there exist versions of the R-N
)
derivatives .T
. . ÐBÑ œ 0) ÐBÑ such that
athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b implies X ÐBÑ œ X ÐCÑ
then X is minimal sufficient.
3
(A trivial "generalization" of Theorem 4 is to suppose that the condition only holds for a
set k ‡ § k with T) probability 1 for all ). The condition in Theorem 4 is essentially that
ÐCÑ
X ÐBÑ œ X ÐCÑ iff 00))ÐBÑ
is free of ).)
Note that sufficiency of X Ð\Ñ means that X ÐBÑ œ X ÐCÑ implies
athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b. So the condition in
Theorem 4 could be phrased as an iff condition.
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and
X Ð\Ñ œ ! \3 . You may use Theorem 4 to show that show that X is minimal sufficient.
8
3œ"
Lehmann emphasizes a second (?more technical and less illuminating?) method of
establishing minimality based on the following two theorems.
Theorem 5 (Page 41 of Lehmann TPE) Suppose that c œ ÖT3 ×53œ! is finite and let Ö03 ×
be the R-N derivatives with respect to some dominating 5-finite measure (like, e.g.,
! T3 ÑÞ Suppose that all 5 € " derivatives 03 are positive on all of k . Then
5
3œ!
X Ð\Ñ œ Œ
0" ÐBÑ 0# ÐBÑ
05 ÐBÑ
ß
ß ÞÞÞß
•
0! ÐBÑ 0! ÐBÑ
0! ÐBÑ
is minimal sufficient.
(Again it is a trivial "generalization" to observe that the minimality also holds if there is a
set k ‡ § k with T) probability 1 for all ) where the 5 € " derivatives 03 are positive on
all of k ‡ ÞÑ
Theorem 6 (Page 42 of Lehmann TPE) Suppose that c is a family of mutually
absolutely continuous distributions and that c! § c Þ If X is sufficient for c and
minimal sufficient for c! , then it is minimal sufficient for c .
The way that Theorems 5 and 6 are used together is to look at a few )'s, invoke Theorem
5 to establish minimal sufficiency for the finite family of a vector made up of a few
likelihood ratios and then use Theorem 6 to infer minimality for the whole family.
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Beta Ð!ß " Ñ. Using T! the
Beta Ð"ß "Ñ distribution, T" the Beta Ð#ß "Ñ distribution and T# the Beta Ð"ß #Ñ distribution,
Theorems 5 and 6 can be used to show that Œ # \3 ß # Ð" • \3 Ñ•is minimal sufficient for
the Beta family.
8
8
3œ"
3œ"
4
Definition 5 A statistic X À Ðk ß U ÑpÐg ß Y Ñ is said to be ancillary for c œ ÖT) ×)-@ if
the distribution of X (i.e., the measure T)X on Ðg ß Y Ñ defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑÑ
does not depend upon ) (i.e. is the same for all )).
Definition 6 A statistic X À Ðk ß U ÑpÐe" ß U" Ñ is said to be first order ancillary for
c œ ÖT) ×)-@ if the mean of X (i.e., E) ÐX Ð\ÑÑ œ ' X .T) ) does not depend upon ) (i.e. is
the same for all )).
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid NÐ),"Ñ. The sample variance
of the \3 's is ancillary.
Lehmann argues that roughly speaking models with nice (low dimensional) minimal
sufficient statistics X are ones where no nontrivial function of X is ancillary (or first order
ancillary). This line of thinking brings up the notion of completeness.
Notation: Continuing to let X À Ðk ß U ÑpÐg ß Y Ñ and T)X be the measure on Ðg ß Y Ñ
defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑ, let c X œ ÖT)X ×)-@ Þ
Definition 7a c X or X is complete (boundedly complete) for c or ) if
i) Y À Ðk ß U! ÑpÐe" ß U" Ñ U! -measurable (and bounded), and
ii) E) Y Ð\Ñ œ ! a )
imply that Y œ ! a.s. T) a )Þ
By virtue of Lehmann's Theorem (see Cressie's Probability Summary) this is equivalent to
the next definition.
Definition 7b c X or X is complete (boundedly complete) for c or ) if
i) 2 À Ðg ß Y ÑpÐe" ß U" Ñ Y -measurable (and bounded), and
ii) E) 2ÐX Ð\ÑÑ œ ! a )
imply that 2 ‰ X œ ! a.s. T) a )Þ
These definitions say that there is no nontrivial function of X that is an unbiased
estimator of 0. This is equivalent to saying that there is no nontrivial function of X that is
first order ancillary. Notice also that completeness implies bounded completeness, so that
bounded completeness is a slightly weaker condition than completeness.
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and
X Ð\Ñ œ ! \3 . You may use the fact that polynomials can be identically 0 on an interval
8
3œ"
only if their coefficients are all 0, to show that X Ð\Ñ is complete for : - Ò!ß "Ó.
(Obvious) Result 7 If X is complete for c œ ÖT) ×)-@ and @w § @, X need not be
w
complete for c œ ÖT) ×)-@w .
5
w
w
Theorem 8 If @ § @ and @ dominates @ in the sense that T) ÐFÑ œ ! a ) - @ implies
w
that T) ÐFÑ œ ! a ) - @ , then X (boundedly) complete for c œ ÖT) ×)-@ implies that X
w
is (boundedly) complete for c œ ÖT) ×)-@w .
Completeness of a statistic says that there is no part of it that is somehow irrelevant in
and of itself to inference about ). That is reminiscent of minimal sufficiency.
Theorem 9 (Bahadur's Theorem) Suppose that X À Ðk ß U ÑpÐg ß Y Ñ is sufficient and
boundedly complete.
i) (Pages 94-95 of Schervish) If Ðg ß Y Ñ œ Ðe. ß U. Ñ then X is minimal sufficient.
ii) If there is any minimal sufficient statistic, X is minimal sufficient.
Example Suppose that c ¥ . (where . is 5-finite) and that for ) - @ § e5
5
.T)
ÐBÑ œ -Ð)Ñexp")3 X3 ÐBÑ Þ
..
3œ"
Provided, for example, that @ contains an open rectangle in e5 , Theorem 1 page 142
Lehmann TSH implies that X Ð\Ñ œ ÐX" Ð\Ñß X# Ð\Ñß ÞÞÞß X5 Ð\ÑÑ is complete. It is clearly
sufficient by the factorization criterion. So by part i) of Theorem 9, X is minimal
sufficient.
Some Facts About Common Statistical Models
Bayes Models
The "Bayes" approach to statistical problems is to add additional model assumptions to
the basic structure introduced thus far. That is, to the standard assumptions concerning
c œ ÖT) ×)-@ "Bayesians" add a distribution on the parameter space @. That is, suppose
that
K is a distribution on Ð@ß VÑ
where it may be convenient to suppose that K is dominated by some 5-finite measure /
and that
.K
Ð)Ñ œ 1Ð)Ñ
./
If considered as a function of both B and ), 0) ÐBÑ is U ‚ V measurable, then b a
probability (a joint distribution for Ð\ß )Ñ on Ðk ‚ @ß U ‚ VÑ) defined by
1\ß) ÐEÑ œ ( 0) ÐBÑ.Ð. ‚ KÑÐBß )Ñ œ ( 0) ÐBÑ1Ð)Ñ.Ð. ‚ / ÑÐBß )Ñ
E
E
6
This distribution has marginals
1\ ÐFÑ œ (
(so that
.1\
..
F‚@
0) ÐBÑ.Ð. ‚ KÑÐBß )Ñ œ ( ( 0) ÐBÑ.KÐ)Ñ..ÐBÑ
F @
œ '@ 0) ÐBÑ.KÐ)Ñ œ '@ 0) ÐBÑ1Ð)Ñ. / Ð)Ñ) and
1) ÐGÑ œ (
k ‚V
0) ÐBÑ.Ð. ‚ KÑÐBß )Ñ œ ( ( 0) ÐBÑ..ÐBÑ.KÐ)Ñ œ KÐGÑ
G k
Further, the conditionals (regular conditional probabilities) are computable as
1\l) ÐFl)Ñ œ ( 0) ÐBÑ..ÐBÑ œ T) ÐFÑ
F
and
1)l\ ÐGlBÑ œ (
0) ÐBÑ
.KÐ)Ñ
'
G @ 0) ÐBÑ.KÐ)Ñ
Note then that
.1)l\
0) ÐBÑ
Ð)lBÑ œ
' 0) ÐBÑ.KÐ)Ñ
.K
@
and thus that
.1)l\
0) ÐBÑ1Ð)Ñ
Ð)lBÑ œ
' 0) ÐBÑ1Ð)Ñ. / Ð)Ñ
./
@
is the usual "posterior density" for ) given \ œ B.
Bayesians use the posterior distribution or density (THAT DEPENDS ON ONE'S
CHOICE OF K) as their basis of inference for ). Pages 84-88 of Schervish concern a
Bayesian formulation of sufficiency and argue (basically) that the Bayesian form is
equivalent to the "classical" form used above.
Exponential Families
Many common familes of distributions can be treated with a single set of analyses by
recognizing them to be of a common "exponential family" form. The following is a list of
facts about such families.
Definition 8 Suppose that c œ ÖT) × is dominated by a 5-finite measure .. c is called
an exponential family if for some 2ÐBÑ !,
7
0) ÐBÑ »
5
.T)
ÐBÑ œ expš+Ð)Ñ € "(3 Ð)ÑX3 ÐBÑ›2ÐBÑ a) .
..
3œ"
Definition 9 A family of probability measures c œ ÖT) × is said to be identifiable if
)" Á )# implies that T)" Á T)# Þ
Fact(s) 10 Let ( œ Ð(" ß (# ß ÞÞÞß (5 Ñß > œ Ö( - e5 l' 2ÐBÑexpš!(3 X3 ÐBÑ›. .ÐBÑ • _×,
and consider also the family of distributions (on k ) with R-N derivatives wrt . of the
form
0( ÐBÑ œ OÐ(Ñexpš"(3 X3 ÐBÑ›2ÐBÑ , ( - > .
5
3œ"
Call this family c ‡ . This set of distributions on k is at least as large as c (and this
second parameterization is mathematically nicer than the first). > is called the natural
parameter space for c ‡ Þ It is a convex subset of e5 . (TSH, page 57.) If > lies in a
subspace of dimension less than 5 , then 0( ÐBÑ (and therefore 0) ÐBÑÑ can be written in a
form involving fewer than 5 statistics X3 . We will henceforth assume > to be fully 5
dimensional. Note that depending upon the nature of the functions (3 Ð)Ñ and the
parameter space @, c may be a proper subset of c ‡ . That is, defining
>@ œ ÖÐ(" Ð)Ñß (# Ð)Ñß ÞÞÞß (5 Ð)ÑÑ - e5 l) - @ ×ß >@ can be a proper subset of >.
Fact 11 The "support" of T) , defined as ÖBl0) ÐBÑ ž !×, is clearly ÖBl2ÐBÑ ž !×, which
is independent of ). The distributions in c are thus mutually absolutely continuous.
Fact 12 From the Factorization Theorem, the statistic X œ ÐX" ß X# ß ÞÞÞß X5 Ñ is sufficient
for c .
Fact 13 X has induced measures ÖT)X l) - @× which also form an exponential family.
Fact 14 If >@ contains an open rectangle in e5 , then X is complete for c . (See pages
142-143 of TSH.)
Fact 15 If >@ contains an open rectangle in e5 (and actually under the much weaker
assumptions given on page 44 of TPE) X is minimal sufficient for c .
Fact 16 If 1 is any measurable real valued function such that E( l1Ð\Ñl • _, then
E( 1Ð\Ñ œ ( 1ÐBÑ0( ÐBÑ. .ÐBÑ
is continuous on > and has continuous partial derivatives of all orders on the interior of >.
These can be calculated as
8
` !" €!# €â€!5
` !" €!# €â€!5
1Ð\Ñ
œ
1ÐBÑ
0( ÐBÑ..ÐBÑ .
E
(
(
` ("!" ` (#!# â` (5!5
` ("!" ` (#!# â` (5!5
(See page 59 of TSH.)
Fact 17 If for ? œ Ð?" ß ?# ß ÞÞÞß ?5 Ñ, both (! and (! € ? belong to >
E(! expš?" X" Ð\Ñ € ?# X# Ð\Ñ € â € ?5 X5 Ð\Ñ› œ
OÐ(! Ñ
Þ
OÐ(! € ?Ñ
Further, if (! is in the interior of >, then
E(! ÐX"!" Ð\ÑX#!# Ð\ÑâX5!5 Ð\ÑÑ œ OÐ(! Ñ
In particular, E(! X4 Ð\Ñ œ
`
` (4 a
and Cov(! (X4 Ð\Ñ,X6 Ð\ÑÑ œ
• logOÐ(Ñb¹
`#
` ( 4 ` (6 a
` !" €!# €â€!5
"
Þ
•º
!"
!#
!5 Œ
` (" ` (# â` (5 OÐ(Ñ (œ(!
, Var(! X4 Ð\Ñ œ
• logOÐ(Ñb¹
(œ(!
(œ(!
`#
a
` (4#
• logOÐ(Ñb¹
(œ(!
,
Þ
Fact 18 If \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ has iid components, with each \3 µ T) , then \
generates a 5 dimensional exponential family on (k 8 ß U 8 ) wrt .8 . The 5 dimensional
statistic ! X Ð\3 Ñ is sufficient for this family. Under the condition that >@ contains an
8
3œ"
open rectangle, this statistic is also complete and minimal sufficient.
"Fisher Information"
Definition 10 We will say that the model c with dominating 5-finite measure . is FI
regular at )! - @ § e" , provided there exists an open neighborhood of )! , say S, such
that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is differentiable at )! and
iii) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated wrt ) under the integral at )! , i.e.
! œ ' ..) 0) ÐBѹ
..ÐBÑ .
) œ)!
(A definition like that on page 111 of Schervish, that requires i) and ii) only on a set of T)
probability 1 a) in S, is really no more general than this. WOLOG, just throw away the
exceptional set.)
.
Jargon: The (random) function of ), .)
log0) Ð\Ñ is called the score function. iii) of
Definition 10 says that the mean of the score function is ! at )! .
9
Definition 11 If the model c is FI regular at )! , then
MÐ)! Ñ œ E)! Œ
.
log0) Ð\ѹ
•
.)
) œ)!
#
is called the Fisher Information contained in \ at )! Þ
ÐIf a model is FI regular at )! , MÐ)! Ñ is the variance of the score function at )! .)
Theorem 19 Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ is in fact twice
differentiable at )! and " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the
#
integral at )! , i.e. both ! œ ' ..) 0) ÐBѹ
..ÐBÑ and ! œ ' ..)# 0) ÐBѹ
. .ÐBÑ then
) œ)!
MÐ)! Ñ œ • E)! Ž
) œ)!
.#
log0) Ð\Ѻ
• .
. )#
) œ)!
5-dimensional versions of Definitions 10 and 11 are next.
Definition 10b We will say that the model c with dominating 5-finite measure . is FI
regular at )! - @ § e5 , provided there exists an open neighborhood of )! , say S, such
that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ has first order partial derivatives at )! and
iii) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated wrt to each )3 under the integral at )! ,
i.e. ! œ ' ``) 0) ÐBѹ
..ÐBÑ .
3
)œ)!
Definition 11b If the model c is FI regular at )! and E)! Œ ``)3 log0) Ð\ѹ
)œ)!
then the 5 ‚ 5 matrix
M Ð)! Ñ œ ŒE)!
• • _ a3ß
#
`
`
log0) Ð\ѹ
log0) Ð\ѹ
•
` )3
)œ)! ` )4
)œ)!
is called the Fisher Information contained in \ at )! Þ
There is a multiparameter version of Theorem 19.
Theorem 19b Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ in fact has
continuous second order partials in the neighborhood Sß and " œ ' 0) ÐBÑ..ÐBÑ can be
differentiated twice under the integral at )! , i.e. both ! œ ' ``) 0) ÐBѹ
..ÐBÑ and
!œ'
`#
` )3 ` )4
0) ÐBѹ
)œ)!
3
)œ)!
..ÐBÑ a3 and 4, then
10
M Ð)! Ñ œ • E)! Ž
`#
log0) Ð\Ѻ
• .
` )3 ` ) 4
)œ)!
Fact 20 If \" ß \# ß ÞÞÞß \8 are independent, \3 µ T3) , then \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ (that
has distribution T") ‚ T#) ‚ â ‚ T8) ) carries Fisher Information MÐ) Ñ œ ! M3 Ð) Ñ,
8
where in the obvious way, M3 Ð) Ñ is the Fisher information in \3 Þ
3œ"
Fact 21 MÐ) Ñ doesn't depend upon the dominating measure.
Observation 22 The Fisher Information does depend upon the parameterization one
uses. For example, suppose that the model c is FI regular at )! - e" and that
( œ 2Ð)Ñ for some "nice" 2 (say, "-" with derivative 2w )
Then for T) with R-N derivative 0) with respect to ., the distributions U( for ( in the
range of @ are defined by
U( œ T2•" Ð(Ñ
and have R-N derivatives (again with respect to .)
1( œ
.U(
..
Then continuing to let "prime" denote differentiation with respect to ) or .,
1w( œ 02w •" Ð(Ñ
"
2w a2•" Ð(Ñb
and
MÐ(Ñ œ Œ
"
2w a2•" Ð(Ñb
•"
• M ˆ2 Ð(щ
#
(where the first "M " here is information about ( and the second is information about )).
Observation 23 If X ÐBÑ is a "-" function on k , then the Fisher Information in X Ð\Ñ is
the same as that in \ . This follows since the distributions of X Ð\Ñ, T)X , are dominated
by the measure .X defined by .X ÐJ Ñ œ .ÐX •" ÐJ ÑÑ and have R-N derivatives
.T)X
œ 0) ‰ X •"
. .X
which will satisfy the FI regularity conditions and have the right derivatives and integrals.
11
Theorem 24 Suppose the model c is FI regular at )! . (?And suppose that the family
c X œ ÖT)X ×)-@ is also FI regular at )! .?) Then
M\ Ð)! Ñ
MX Ð\Ñ Ð)! Ñ
and if X Ð\Ñ is sufficient for ) there is equality.
Theorem 24b (Theorem 2.86 of Schervish) Suppose the model c is FI regular at
)! - @ § e5 . (?And suppose that the family c X œ ÖT)X ×)-@ is also FI regular at )! .?)
Then the matrix
M\ Ð)! Ñ • MX Ð\Ñ Ð)! Ñ is nonnegative definite
and if X Ð\Ñ is sufficient for ) the matrix contains all ! elements.
Fact 25 In exponential families, the Fisher Information is very simple. For a family of
distributions as in Fact(s) 10 above with densities 0( ÐBÑ with respect to .,
`#
M a(! b œ Var(! X Ð\Ñ œ • Œ
logOÐ(ѹ
•
` (3 ` ( 4
(œ(!
"Kullback-Leibler Information"
There is another notion of statistical information, intended to measure how far apart 2
distributions are in the sense of likelihood, or in some sense how easy it will be to
discriminate between them on the basis of an observable, \ . (This measure is also useful
in the asymptotics of maximum likelihood estimation.)
Definition 12 If T and U are probability measures on k with R-N derivatives with
respect to a common dominating measure ., : and ; respectively, then the KullbackLeibler Information MÐT à UÑ is the T expected log likelihood ratio
MÐT à UÑ œ ET log
:Ð\Ñ
:ÐBÑ
œ ( logŒ
•:ÐBÑ..ÐBÑ
;Ð\Ñ
;ÐBÑ
Fact 26 MÐT à UÑ is not in general the same as MÐUà T Ñ
Fact 27 MÐT à UÑ
0 and there is equality only when T œ U
This fact follows from the next (slightly more general) lemma, upon noting that
GÐ † Ñ œ • logÐ † Ñ is strictly convex.
Lemma 28 (Silvey page 75) Suppose that T and U are two different distributions on a
space k with R-N derivatives : and ; respectively with respect to a common measure ..
Suppose further that a function G À Ð!ß_Ñpe" is convex. Then
12
ET G Œ
;Ð\Ñ
•
:Ð\Ñ
GÐ"Ñ
with strict inequality if G is strictly convex.
Theorem 29 If \ œ Ð\" ß \# Ñ and under both T and U the variables \" and \# are
independent, then
M\ ÐT ß UÑ œ M\" ÐT \" à U\" Ñ € M\# ÐT \# à U\# Ñ
So the K-L Information has a kind of additive property simlar to that possessed by the
Fisher Information. It also has the kind of non-increasing property associated with
reduction to a statistic X Ð\Ñ we saw with the Fisher Information.
Theorem 30 For any statistic X Ð\Ñ
M\ ÐT ß UÑ
MX Ð\Ñ ÐT X ß UX Ñ
and there is equality iff X Ð\Ñ is sufficient for ÖT ß U×.
Finally, there is also an important connection between the notions of Fisher and K-L
Information.
Theorem 31 Under the hypotheses of Theorem 19 and supposing that one can reverse
the order of limits in
.#
E) log0) Ð\Ѻ
. )# !
) œ)!
to produce
E)!
.#
log0) Ð\Ѻ
. )#
) œ)!
it follows that
.#
M\ ÐT)! ß T) Ѻ
œ M\ Ð)! Ñ
. )#
) œ)!
(the first "M " is the K-L Information and the second is the Fisher Information).
The multiparameter version of this is, of course, similar.
Theorem 31b Under the hypotheses of Theorem 19b (and assuming that it is
permissible to interchange the order of second order partial differentiation and integration
..ÐBÑ)
13
`#
Ž ` ) ` ) M\ ÐT)! ß T) Ѻ
•
3
4
)œ)!
œ M\ Ð)! Ñ
5‚5
(the first "M " is the K-L Information and the second is the Fisher Information)Þ
Types of Statistical Problems
In the framework introduced above (an identifiable family of distributions c œ ÖT) ×
)
dominated by a 5-finite measure . with R-N derivatives 0) œ .T
. . ) there are several
formulations of the basic notion of "using \ to tell us about )" that is statistical theory.
Here we briefly describe those.
Point Estimation is the business of using a statistic $ Ð\Ñ (where $ À k p@) to guess at the
value of ) (or some function of )). In those (common) cases where @ § e5 one can
adopt criteria like "'small' ) expected difference between $ Ð\Ñ and ) (or a function of ))"
to compare choices of "estimators" $Ð\Ñ. Common methods of inventing estimators are
taught in 543 and include such methods as maximum likelihood and the method of
moments. Bayesians typically use some feature of their posterior distribution (like its
mean or mode) to estimate ) (or a function thereof).
Set Estimation is the business of using a set-valued statistic WÐ\Ñ (where for V a 5 algebra on @, W À k pV) to guess at the true value of ) (or some function of )). One
(obvious) criterion for judging choices of "confidence procedures" W is that one wants
"'large' ) coverage probability" for all ). In those (common) cases where @ § e5 one
can adopt criteria like "'small' ) expected set 'size'" to add to the notion of large coverage
probability. Common methods of inventing set estimators include looking for "pivotals,"
inverting families of tests, looking for "standard errors" for good point estimators and
using them to provide limits around those point estimators, and bootstrap methods.
Bayesian, having a posterior probability distribution for ) typically use that to choose
what they call "credible sets" with large posterior probability of including ) (or a function
thereof).
Prediction is the business of positing a model for \ œ Ð\" ß \# Ñ, supposing only \" to
be observable and using it to predict \# (or some function thereof). There are both point
and set versions of this problem. In the common case where \# takes values in e5 , one
can adopt criteria like "'small' ) expected difference between the prediction $Ð\" Ñ and \#
(or a functiuon thereof)" to compare choices of "predictors" $Ð\" Ñ. Bayesians have a
conditional distribution for \# (called the predictive distribution) and can use some
feature of it to predict \# . "Set prediction" methods are usually judged on the basis of
their ) coverage probabilities and ) expected set sizes. Bayesian predictions are really
formally no different from Bayesian estimation, being based on conditional distributions
of the quantity of interest given the observable \" .
14
Hypothesis Testing is the business of partitioning the parameter space @ into two pieces,
@! and @" , and using a statistic 9Ð\Ñ (where 9 À k pÖ!ß "×) to decide which of the two
pieces is more plausible. Criteria like small "size" (maximum ) - @! probability that
9Ð\Ñ œ ") and large "power" () - @" probability that 9Ð\Ñ œ ") are usually used to
compare "tests" 9. Classical methods of inventing tests (taught in 543) include appeal to
the Neyman-Pearson Lemma and its extensions to convenient families (like monotone
likelihood ratio families) and use of the "likelihood ratio criterion." Bayesians, armed as
they are with a posterior distribution over @, have posterior probabilities for @! and @"
to guide their choice between @! and @" (and therefore their construction of 9's).
There are optimality theories that can be brought to bear on these statistical problems.
Those are part of the somewhat broader world of "statistical decision theory." We will do
a fair amount with this subject before the term is over. However, because the situations
in which the theory provides clean results do not begin to span the space of important
statistical applications, we will first continue with some more general theoretical material
of wider applicability, based primarily on asymptotic considerations.
Maximum Likelihood, Its Implications in Likelihood Ratio Testing and
Other Contexts (Primarily the Simple Asymptotics Thereof)
"Maximum Likelihood"
Maximum likelihood is a method of point estimation dating at least to Fisher and
probably beyond. As usual, for an identifiable family of distributions c œ ÖT) ×
)
dominated by a 5-finite measure ., we let 0) œ .T
.. Þ
Definition 13 A statistic X À k p@ is called a maximum likelihood estimator (MLE) of )
if
0X ÐBÑ ÐBÑ œ sup 0) ÐBÑ aB - k Þ
)-@
This definition is not without practical problems in terms of existence and computation of
MLEs. 0) ÐBÑ can fail to be bounded in ), and even when it is, can fail to have a
maximum value as one searches over @. One way to get rid of the problem of potential
lack of boundedness (favored and extensively practiced by Prof. Meeker for example) is
to admit that one can measure only "to the nearest something," so that "real" k are
discrete (finite or countable) so that . can be counting measure and
0) ÐBÑ œ T) ÐÖB×Ñ • "
Ðso that one is just maximizing the probability of observing the data in handÑÞ Of course,
the possibility that the sup is not achieved still remains even when this device is used
(like, for example, when \ µ Binomial Ð8ß :Ñ for : - Ð!ß "Ñ when \ œ ! or 8).
15
Note that in the common situation where @ § e5 and the 0) are nicely differentiable in
), a maximum likelihood estimator must satisfy the likelihood equations
`
log0) ÐBѹ
œ ! for 3 œ "ß #ß ÞÞÞß 5 .
` )3
) œX ÐBÑ
Much of the theory connected with maximum likelihood estimation has to do with roots
of the likelihood equations (as opposed to honest maximizers of 0) ÐBÑ). Further, much of
the attractiveness of the method derives from its asymptotic properties. A simple version
of the theory (primarily for the 5 œ " case) follows.
Adopt the following notation for the discussion of the asymptotics of maximum
likelihood.
\8
represents all information on ) available through the 8th stage of
observation
k8
the observation space for \ 8 (through the 8th stage)
U8
a 5-algebra on k8
8
T)\
the distribution of \ 8 (dominated by a 5-finite measure .8 )
@
a (fixed) parameter space
.T \
8
0)8
the R-N derivative ..) 8
For this material one really does need the basic probability space ÐHß T ß T) Ñ in the
background, as it is necessary to have a fixed space upon which to prove convergence
results. Often (i.e. in the iid case) we may take
Ðk ß U ß T)\" Ñ to be a "1-dimensional" model where the T)\" 's are absolutely
continuous with respect to a 5-finite measure .
_
Hœk
\ 8 œ Ð\" ß \# ß ÞÞÞß \8 Ñ
for \3 Ð=Ñ œ the 3th coordinate of =
k8 œ k 8
T œ U_
_
T) œ ˆT)\" ‰
8
8
T)\ œ ˆT)\" ‰
.8 œ .8
and
0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ
8
3œ"
For such a set-up, the usual drill is for appropriate sequences of estimators Ös
) 8 ÐB8 Ñ× to
try to prove consistency and asymptotic normality results.
Definition 14 Suppose that ÖX8 × is a sequence of statistics X8 À k8 p@, where @ § e5
i) ÖX8 × is (weakly) consistent at )! - @ if for every % ž !ß
T)! alX8 • )! l ž %bp! as 8p_, and
ii) ÖX8 × is strongly consistent at )! - @ if X8 p)! a.s. T)! .
16
There are two standard paths for developing asymptotic results for likelihood-based
estimators in iid contexts. These are
i) the path blazed by Wald based on very strong regularity assumptions (that can
be extremely hard to verify for a specific model) but that deal with "real"
maximum likelihood estimators, and
' based on much weaker assumptions and far easier
ii) the path blazed by Cramer
proofs that deal with solutions to the likelihood equations.
We'll wave our hands very briefly at a heuristic version of the Wald argument and defer
to pages 415-424 of Schervish for technical details. Then we'll do a version of the
' results (probably assuming more than is absolutely necessary to get them).
Cramer
All that follows regarding the asymptotics of maximum likelihood refers to the iid case,
where B8 œ ÐB" ß B# ß ÞÞÞß B8 Ñ, @ § e5 and 0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ.
8
3œ"
Now suppose that the 0) are positive everywhere on k and define the function of )
D)! Ð)Ñ » E)! log0) Ð\Ñ œ ( log0) ÐBÑ 0)! ÐBÑ ..ÐBÑ œ • MÐ)! à )Ñ € ( log0)! ÐBÑ 0)! ÐBÑ ..ÐBÑ Þ
Corollary 32 If the 0) are positive everywhere on k , then for ) Á )!
D)! Ð)! Ñ ž D)! Ð) Ñ Þ
Then, as a matter of notation, let
œ "log0) ÐB3 Ñ Þ
8
8
68 ÐB ß )Ñ »
log0)8 ÐB8 Ñ
3œ"
Then clearly E)! 8" 68 Ð\ 8 ß )Ñ œ D)! Ð)Ñ and in fact, the law of large numbers says that for
every ),
"
" 8
68 Ð\ 8 ß )Ñ œ "log0) Ð\3 Ñ p D)! Ð)Ñ a.s. T)! .
8
8 3œ"
So assuming enough regularity conditions to make this almost sure convergence uniform
in ), when )! governs the distribution of the observations there is perhaps hope that a
single maximizer of 8" 68 Ð\ 8 ß )Ñ will eventually exist and be close to the maximizer of
D)! Ð)Ñ, namely )! . To make this precise is not so easy. See Theorems 7.49 and 7.54 of
Schervish.
' type results.
Now turn to Cramer
17
Theorem 33 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open
neighborhood of )! , say S, such that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is differentiable at every point ) - S, and
iii) E)! log0) Ð\Ñ exists for all ) - S and E)! log0)! Ð\Ñ is finite.
Then, if % ž ! and $ ž !, b 8 such that a 7 ž 8, the T)! probability that the equation
. 7
"log0) Ð\3 Ñ œ !
. ) 3œ"
has a root within % of )! is at least " • $ Þ
(Essentially this theorem ducks the uniformity issue raised earlier by considering 68 at
only )! ß )! • % and )! € %.)
In general, Theorem 33 doesn't immediately do what we'd like. The root guaranteed by
the Theorem isn't necessarily an MLE. And in fact, when the likelihood equation can
have multiple roots, we haven't even yet defined a sequence of random variables, let alone
a sequence of estimators converging to )! . But the theorem can form the basis for simple
results that are of statistical importance.
Corollary 34 Under the hypotheses of Theorem 33, suppose that (in addition) @ is open
and aBß 0) ÐBÑ is positive and differentiable at every point ) - @ so that the likelihood
equation
. 8
"log0) Ð\3 Ñ œ !
. ) 3œ"
makes sense at all points ) - @. Then define 38 to be the root of the likelihood equation
when there is exactly one (and otherwise adopt any arbitrary definition for 38 ). If with
T)! probability approaching 1 the likelihood equation has a single root
38 p )! in T)! probability Þ
Where there is no guarantee that the likelihood equation will eventually have only a
single root, one could use Theorem 33 to argue that defining
<8 Ð)! Ñ œ œ
!
if the likelihood equation has no root
the root closest to )! otherwise ,
<8 Ð)! Ñ p )! in T)! probability. (The continuity of the likelihood implies that limits of
roots are roots, so for )! in an open @, the definition of <8 Ð)! Ñ above eventually makes
sense.) But, of course, this is of no statistical help. What is similar in nature but of more
help is the next corollary.
18
Corollary 35 Under the hypotheses of Theorem 33, if ÖX8 × is a sequence of estimators
consistent at )! and
if the likelihood equation has no roots
X8
s
)8 » œ
the root closest to X8 otherwise ,
then
s
) 8 p )! in T)! probability Þ
In practical terms, finding the root of the likelihood equation nearest to X8 may be a
numerical problem. (Actually, if it is, it's not so clear how you know you have solved the
problem. But we'll gloss over that.) Finding a root beginning from X8 might be
approached using something like Newton's method. That is, abbreviating
P8 Ð)Ñ » 68 Ð\ ß )Ñ œ "log0) Ð\3 Ñ ,
8
8
3œ"
the likelihood equation is
Pw8 Ð)Ñ œ ! Þ
One might try looking for a root of this equation by iterating using linearizations of Pw8 Ð)Ñ
(presumably beginning at X8 ). Or, assuming that Pww8 ÐX8 Ñ Á ! taking a single step in this
iteration process, a "-step likelihood-based modification/?improvement? on X8 ,
Pw ÐX8 Ñ
)˜ 8 œ X8 • 8ww
,
P8 ÐX8 Ñ
is suggested. It is plausible that an estimator like )˜ 8 might have asymptotic behavior
similar to that of an MLE. (See page 442 of TPE or Theorem 7.75 of Schervish, that
treats a more general multiparameter M-estimation version of this idea.)
The next matter up is the asymptotic distribution of likelihood-based estimators. The
following is representative of what people have proved. (I suspect that the hypotheses
here are stronger than are really needed to get the conclusion, but they make for a simple
proof.)
Theorem 36 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open
neighborhood of )! , say S, such that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S,
iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and
¹
.$
log0) ÐBѹ Ÿ Q ÐBÑ aB and a) - S ß
. )$
19
iv) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated twice wrt ) under the integral at )! ,
#
i.e. both ! œ ' ..) 0) ÐBѹ
..ÐBÑ and ! œ ' ..)# 0) ÐBѹ
..ÐBÑ, and
) œ)!
#
.
E) ˆ .) log0) Ð\щ
) œ)!
v) M" Ð)Ñ »
- Ð!ß _ Ñ a) - S.
If with T)! probability approaching 1, Ös
) 8 × is a root of the likelihood equation, and s
)8p
)! in T)! probabilityß then under T)!
_
È8Šs
) 8 • )! ‹ Ä NÐ!ß
"
Ñ Þ
M" Ð)! Ñ
There are multiparameter versions of this theorem (indeed the Schervish theorem is a
multiparameter one). And there are similar (multiparameter) versions of this for the
"Newton 1-step approximate MLE's" based on consistent estimators of ) (i.e. )˜ 8 's).
The practical implication of a result like this theorem is that one can make approximate
confidence intervals for ) of the form
s
)8 „ D
"
Þ
È8M" Ð)! Ñ
Well, it is almost a practical implication of the theorem. One needs to do something
about the fact that M" Ð)! Ñ is unknown when making a confidence interval so that it can't be
used in the formula. There are two common routes to fixing this problem. The first is to
) 8 Ñ in its place. This method is often referred to as using the "expected Fisher
use M" Ðs
information" in the making of the interval. (Note that M" Ð)! Ñ is a measure of curvature
) 8 Ñ is a measure of curvature of Ds) Ð)Ñ at
of D)! Ð)Ñ at its maximum (located at )! Ñ while M" Ðs
8
s
its maximum (located at ) 8 ).) Dealing with the theory of the first method is really very
simple. It is easy to prove the following.
Corollary 37 Under the hypotheses of Theorem 36, if M" Ð)Ñ is continuous at )! , then
under T)!
_
É8M" Ðs
) 8 )Šs
) 8 • )! ‹ Ä NÐ!ß "Ñ Þ
(Actually, the continuity of M" Ð)Ñ at )! may well follow from the hypotheses of Theorem
36 and therefore be redundant in the hypotheses of this corollary. I've not tried to think
this issue through.)
The second method of approximating M" Ð)! Ñ comes about by realizing that in the proof of
Theorem 36, M" Ð)! Ñ arises as the limit of • 8" Pww8 Ð)! ÑÞ One might then hope to
approximate it by
20
" ww s
" 8 .#
• P8 Ð) 8 Ñ œ • " # log0) Ð\3 ѹ s ß
8
8 3œ" .)
) œ) 8
) 8 Ñ is
which is sometimes called the "observed Fisher information." (Note that • 8" Pww8 Ðs
"
8
a measure of curvature of the average log likelihood function 8 68 Ð\ ß )Ñ at s
) 8 , which
one is secretly thinking of as essentially a maximizer of the average log likelihood.)
Schervish says that Efron and others claim that this second method is generally better
than the first (in terms of producing intervals with real coverage properties close to the
) 8 Ñ replaces
nominal properties). Proving an analogue of Corollary 37 where • 8" Pww8 Ðs
s
M" Ð) 8 ) is somewhat harder, but not by any means impossible. That is, one can prove the
following.
Corollary 38 Under the hypotheses of Theorem 36 and T)! ,
_
É • Pww8 Ðs
) 8 Ñ Šs
) 8 • )! ‹ Ä NÐ!ß "Ñ Þ
Under the hypotheses of Theorem 36, convergence in distribution of s
) 8 implies the
convergence of the second moment of s
) 8 , so that
Var)! È8Šs
) 8 • )! ‹ p
i.e.
Var)! ÐÈ8 Ðs
) 8 )Ñ p
"
ß
M" Ð)! Ñ
"
Þ
M" Ð)! Ñ
Then the fact that the C-R inequality (recall Stat 543) declares that under suitable
conditions no unbiased estimator of ) has variance less than "ÎÐ8M" Ð)ÑÑ, makes one
suspect that for a consistent sequence of estimators Ö)8‡ ×, convergence under )!
_
È8Ð)8‡ • )! Ñ p
NÐ!ß Z Ð)! ÑÑ
might imply that Z Ð)! Ñ
"
M" Ð)! Ñ ,
but in fact this inequality does not have to hold up.
Likelihood Ratio Tests, Related Tests and Confidence Procedures
One set of general purpose tools for making tests and confidence procedures consists of
the "likelihood ratio test," its relatives and related confidence procedures. The following
is an introduction to these methods and a little about their asymptotics.
21
For c œ ÖT) ×, H! À ) - @! and H" À ) - @" one might plausibly consider a test criterion
like
sup 0) ÐBÑ
) - @"
PVÐBÑ œ
Þ
sup 0) ÐBÑ
) - @!
(Notice, by the way, that Bayes tests look at ratios of K averages rather than sups.)
Or equivalently, one might consider
sup 0) ÐBÑ
)
-ÐBÑ œ maxÐPVÐBÑß "Ñ œ - @
ß
sup 0) ÐBÑ
) - @!
where we will want to reject H! for large -ÐBÑ. This is an intuitively reasonable criterion,
and many common tests (including optimal ones) turn out to be likelihood ratio tests. On
the other hand, it is easy enough to invent problems where LRTs need not be in any sense
optimal.
It is a useful fact that one can find some general results about how to set critical values
for LRTs (at least for iid models and large 8). This is because if there is an MLE, say )^,
sup 0) ÐBÑ œ 0s) ÐBÑ
)-@
and we can make use of MLE-like estimation results. For example, in the iid context of
Theorem 36, there is the following.
Theorem 39 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open
neighborhood of )! , say S, such that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S,
iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and
.$
¹ $ log0) ÐBѹ Ÿ Q ÐBÑ aB and a) - S ß
.)
iv) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated twice wrt ) under the integral at )! ,
#
i.e. both ! œ ' ..) 0) ÐBѹ
..ÐBÑ and ! œ ' ..)# 0) ÐBѹ
..ÐBÑ, and
#
v) M" Ð)Ñ » E) ˆ ..) log0) Ð\щ - Ð!ß _ Ñ a) - S.
If with T)! probability approaching 1, Ös
) 8 × is a root of the likelihood equation, and s
)8p
)! in T)! probabilityß then under T)!
) œ)!
) œ)!
22
#log-8‡
» #logŽ
0s8 Ð\ 8 Ñ
0)8! Ð\ 8 Ñ •
)8
_
p ;#" Þ
(The hypotheses here are exactly the hypotheses of Theorem 36.)
The application that people often make of this theorem is that they reject H! À ) œ )! in
favor of H" À ) Á )! when #log-8‡ exceeds the Ð" • !Ñ quantile of the ;#" distribution.
Notice also that upon inverting these tests, the theorem gives a way of making large
sample confidence sets for ). When s
) 8 is a maximizer of the log likelihood, the set of )!
for which H! À ) œ )! will be accepted using the LRT with an (approximate) ;#" cut-off
value (say ;#" ) is
Ö)l P8 Ð)Ñ
"
P8 Ðs
) 8 Ñ • ;#" × .
#
There are versions of the ;# limit that cover (composite) null hypotheses about some part
of a multi-dimensional parameter vector. For example, Schervish (page 459) proves a
result like the following.
Result 40 Suppose that @ § e: and appropriate regularity conditions hold for iid
observations \" ß \# ß ÞÞÞß \8 . If 5 Ÿ : and
sup 0)8 Ð\ 8 Ñ
)
-8 œ
sup
0)8 Ð\ 8 Ñ
) s.t. ) 3 œ )3! 3 œ "ß #ß ÞÞÞß 5
then under any T) with )3 œ )3! 3 œ "ß #ß ÞÞÞß 5
_
#log-8 p ;#5 .
This result (of course) allows one to set approximate critical points for testing
H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! , by rejecting H! when
#log-8 ž the Ð" • !Ñ ;#5 quantile .
And, one can make large 8 approximate confidence sets for Ð)" ß )# ß ÞÞÞß )5 Ñ of the form
ÖÐ)" ß ÞÞÞß )5 Ñl
sup
P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ
)5€" ß ÞÞÞß ):
"
sup P8 Ð)Ñ • ;#5 ×
#
)
for ;#5 the Ð" • !Ñ quantile of the ;#5 distribution.
There is some standard jargon associated with the function of Ð)" ß ÞÞÞß )5 Ñ appearing in the
above expressions.
23
Definition "5 The function of Ð)" ß ÞÞÞß )5 Ñ
P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ »
P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ
sup
)5€" ß ÞÞÞß ):
is called a profile likelihood function for the parameters )" ß ÞÞÞß )5 .
With this jargon/notation the form of the large 8 confidence set becomes
ÖÐ)" ß ÞÞÞß )5 Ñl P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ
"
P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ • ;5# × ,
sup
#
Ð)" ß ÞÞÞß )5 Ñ
which looks more like the form from the case of a one-dimensional parameter.
There are alternatives to the LRT of H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! . One
such test is the so called "Wald Test." See, for example, pages 115-120 of Silvey. The
following is borrowed pretty much directly from Silvey.
Consider @ § e: and 5 restrictions on )
2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! ,
(like, for instance, )" • )"! œ )# • )#! œ â œ )5 • )5! œ !ÑÞ Suppose that )^8 is an MLE
of ) and define
2Ð)Ñ » Ð2" Ð)Ñß 2# Ð)Ñß ÞÞÞß 25 Ð)ÑÑ Þ
Then if H! À 2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! is true, 2Ð )^8 Ñ ought to be near 0. The
question is how to measure "nearness" and to set a critical value in order to have a test
with size approximately !. An approach to doing this is as follows.
We expect (under suitable conditions) that under T) the (:-dimensional) estimator )^8 has
_
È8Ð )^8 • )Ñ p
N: Ð!ß M"•" Ð)ÑÑ .
Then if
`23
L Ð) Ñ » Œ
º• ß
` )4 )
5‚:
the ? method suggests that
_
È8Š2Ð)^8 Ñ • 2Ð)Ñ‹p
N5 Ð!ß L Ð)ÑM"•" Ð)ÑL w Ð)ÑÑ Þ
Abbreviate LÐ)ÑM"•" Ð)ÑL w Ð)Ñ as FÐ)ÑÞ Since 2Ð)Ñ œ ! if H! is true, the above suggests
that
24
_
82w Ð)^8 ÑF •" Ð)Ñ2Ð)^8 Ñ p ;5# under H! Þ
Now 82w Ð)^8 ÑF •" Ð)Ñ2Ð )^8 Ñ can not serve as a test statistic, since it involves ), which is
not completely specified by H! Þ But it is plausible to consider the statistic
[8 » 82w Ð)^8 ÑF •" Ð)^8 Ñ2Ð )^8 Ñ
and hope that under suitable conditions
_
[8 p ;#5 under H! Þ
If this can be shown to hold up, one may reject for [8 ž Ð" • !Ñ quantile of the ;#5
distribution.
The Wald tests for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 can be inverted to get approximate
confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): . The resulting confidence
regions will, by construction, be ellipsoidal.
There is a final alternative to the LRT and Wald tests that deserves mention. That is the
";# test" or "score test." See again Silvey. The notion here is that on occasion it can be
easier to maximize P8 Ð)Ñ subject to 2Ð)Ñ œ ! (again taking this to be a statement of 5
constraints on a :-dimensional parameter vector )) than to simply maximize P8 Ð)Ñ. Let
)˜ 8 be a "restricted" MLE (i.e. a maximizer of P8 Ð)Ñ subject to 2Ð)Ñ œ !). One might
expect that if H! À 2Ð)Ñ œ ! is true, then )˜ 8 ought to be nearly an unrestricted maximizer
of P8 Ð)Ñ and the partial derivatives of P8 Ð)Ñ should be nearly 0 at )˜ 8 . On the other hand,
if H! is not true, there is little reason to expect the partials to be nearly 0. So (again/still
with P8 Ð)Ñ œ ! log0) Ð\3 Ñ) one might consider the statistic
8
3œ"
w
" `
`
\ » Ž
P8 Ð)Ѻ
M"•" Ð )˜ 8 ÑŽ
P8 Ð)Ѻ
•
• Þ
8 ` )3
` )3
) œ)˜ 8
) œ)˜ 8
#
Then, for
sup 0)8 Ð\ 8 Ñ
)
-8 œ
sup
0)8 Ð\ 8 Ñ
) s.t. 2Ð)Ñ œ !
it is typically the case that under H! ß \ # differs from 2log-8 by a quantity tending to 0 in
T) probability. That is, this statistic can be calibrated using ;# critical points and can
form the basis of a test of H! asymptotically equivalent to a "Result 40 type" likelihood
ratio type test.
Once more (as for the Wald tests) for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 , one can invert these
tests to get confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): .
25
As a final note on this topic, observe that
i) an LRT of the style in Result 40 requires computation of an MLE, )^8 , and a
restricted MLE, )˜ 8 ,
ii) a Wald test requires only computation of the MLE, )^8 , and
iii) a ";# " test/"score test" requires only computation of the restricted MLE, )˜ 8 .
M-Estimation
From one point of view, this method of estimation is simply a generalization of maximum
likelihood. ("Robust" estimation proponents have other motivations for the method.)
Here is a very brief introduction to the subject.
Consider a statistical model with parameter space @ § e5 and observations
\" ß \# ß ÞÞÞß \8 iid T) . Likelihood-based estimation uses the loglikelihood
P8 Ð)Ñ œ "log0) Ð\3 Ñ
8
3œ"
and typically chooses ) to maximize P8 Ð)Ñ or to solve the 5 -dimensional system of
likelihood equations
fP8 Ð)Ñ œ !
M-estimation uses some suitable function 3ÐBß )Ñ and
<8 Ð) Ñ œ "3Ð\3 ß )Ñ
8
3œ"
and chooses ) to maximize <8 Ð)Ñ or to solve the 5 -dimensional system of likelihood
equations
f<8 Ð)Ñ œ !
(With <4 œ
`3
` )4
this is ! <4 Ð\3 ß ) Ñ œ ! a 4 œ "ß #ß ÞÞÞß 5 .)
8
3œ"
3 could be ÐB • )Ñ# ß log0) or even log1) for 1) the density for some other family of
models.
Provided that 3 (or the <4 ) interacts properly with the 0) , one gets properties for Mestimation like those we've deduced for maximum likelihood. For example,
E)! 3Ð\ß ) Ñ ž E)! 3Ð\ß ) Ñ a)! ß a)
26
plus other conditions implies that an M-estimator is consistent. (Note that with 3 the
function log0) , this inequality comes from Fact 27 about K-L information.)
Further, under appropriate conditions, one can prove asymptotic normality of i) one-step
Newton improvements on a consistent estimator of ), or ii) consistent roots of
f<8 Ð) Ñ œ !. In such results, the counterpart of M"•" Ð)! Ñ in the limiting 5 -variate normal
distribution is
N •" Ð)! ÑOÐ)! ÑN •" Ð)! Ñw
for
N Ð)! Ñ œ • E)! Ž
`#
3Ð\ß )Ѻ
•
` )3 ` ) 4
) œ )!
and
OÐ)! Ñ œ ŒCov)! Œ
`
`
3Ð\ß ) ѹ
ß
3Ð\ß )ѹ
••
` )3
)œ)! ` )4
)œ)!
(In the case of 3ÐBß ) Ñ œ log0) ÐBÑ, N œ O œ M" .)
Simple Implications of the MLE Results for the Shape of the Loglikelihood and
Posteriors
In rough terms, the statistical folklore says that "likelihoods pile up around true values
and are locally quadratic near their peaks, while provided a prior is sufficiently diffuse,
posteriors will pile up near true values and look Gaussian." Doing a really good job of
even saying this precisely is nontrivial (involving uniformities of convergence of random
functions, etc.). What follows are some very simpleminded (and non-uniform type)
consequences of the material on maximum likelihood. For more precise and harder
theorems, see Schervish §7.4.
Lemma 41 For \" ß \# ß ÞÞÞß \8 iid T)
P8 Ð)! Ñ • P8 Ð)Ñ
T)!
p _ for ) Á )!
Lemma 42 For \" ß \# ß ÞÞÞß \8 iid T) and K a prior on @ with density 1Ð)Ñ, if
0)8 ÐB8 Ñ1Ð)Ñ
1Ð)lB Ñ œ
' 0)8 ÐB8 Ñ1Ð)Ñ. / Ð)Ñ
@
8
is the posterior density of )l\ 8 (with respect to the measure / ) and 1Ð)Ñ ž ! then
27
1Ð)l\ 8 Ñ
1Ð)! l\ 8 Ñ
T)!
p ! for ) Á )!
Corollary 43 Under the hypotheses of Lemma 42, if @ is finite, / is counting measure
and 1Ð)Ñ ž ! a), the (random) posterior distribution of )l\ 8 is consistent in the sense
that for E a subset of @
8
1)l\ ÐEÑ
T)!
p MÒ)! - EÓ
Next, here's a simple observation about the local shape of the loglikelihood near its
maximum.
Lemma 44 Under the hypotheses of Theorem 36, suppose that s)8 is an MLE for ) that is
consistent at )! . Then
?
U8 Ð?Ñ œ P8 Ð s)8 Ñ • P8 Ð s)8 €
Ñ
È8
T)! " #
? M" Ð)! Ñ
p
#
(This lemma can be interpreted as saying that if one rescales properly so that there is a
limiting shape, the loglikelihood is for large 8 essentially quadratic near its maximum.
This is not surprising, since it must have ! derivative at the maximum and there is enough
differentiability assumed to allow a #nd order Taylor expansion.)
Corollary 45 Under the hypotheses of Lemma 44, suppose that a prior K has a density
1Ð)Ñ with respect to Lebesgue measure on @ § e" such that 1Ð)! Ñ ž ! and 1 is
continuous at )! . The posterior density of )|\ 8 has the property that
Î 1Ð s)8 l\ 8 Ñ Ñ
V8 Ð?Ñ œ log
Ï 1Ð s)8 € È?8 l\ 8 Ñ Ò
T)! " #
p
? M" Ð)! Ñ
#
Now in some very rough sense, this says that the near the MLE, the (random) log
posterior density tends to look approximately quadratic in the distance from the MLE.
But now note that if ] is a continuous random variable with pdf 0 and (exactly)
log
0 Ð!Ñ
œ C#
0 ÐCÑ
#
it is the case that ] must be normal with mean ! and variance "- . This then suggests that
under T)! one can expect posterior densities to start to look as if ) were normal with mean
s)8 and variance " . A Bayesian who wants a convenient approximate form for a
8M" Ð)! Ñ
posterior density (perhaps because the exact one is analytically intractable) might then
hope to get approximate posterior probabilities from the
28
NŽ s)8 ß
"
"
or NŽ s)8 ß
•
•
8M" Ð s)8 Ñ
• Pww8 Ð s)8 Ñ
distributions. (Schervish proves that the posterior density of É • Pww8 Ð s)8 ÑŠ) • s) 8 ‹, a
random function of ), converges in some appropriate sense to the standard normal
density. It is another, harder, matter to argue that approximate posterior probabilities for
) can be obtained using normal distributions.)
Some Simple Theory of Markov Chain Monte Carlo
It is sometimes the case that one has a "known" but unwieldy (joint) distribution for a
large dimensional random vector ] , and would like to know something about the
distribution (like, for example, the marginal distribution of ]" ). While in theory it might
be straightforward to compute the quantity of interest, in practice the necessary
computation can be impossible. With the advent of cheap computing, given appropriate
theory the possibility exists of simulating a sample of 8 realizations of ] and computing
an empirical version of the quantity of interest to approximate the (intractable) theoretical
one.
This possibility is especially important to Bayesians, where \l) µ T) and ) µ K produce
a joint distribution for Ð\ß )Ñ and then (in theory) a posterior distribution KÐ)l\Ñß the
object of primary Bayesian interest. Now Ðin the usual kind of notationÑ
1Ð)l\ œ BÑ º 0) ÐBÑ1Ð)Ñ
so that in this sense the posterior of the possibly high dimensional ) is "known." But it
may well be analytically intractable, with (for example) no analytical way to compute
E)" . Simulation is a way of getting (approximate) properties of an intractable/high
dimensional posterior distribution.
"Straightforward" simulation from an arbitrary multivariate distribution is, however,
typically not (straightforward). Markov Chain Monte Carlo methods have recently
become very popular as a solution to this problem. The following is some simple theory
covering the case where the (joint) distribution of ] is discrete.
Definition 16 A (discrete time/discrete state space) Markov Chain is a sequence of
random quantities Ö\5 ×, each taking values in a (finite or) countable set k , with the
property that
T Ò\8 œ B8 l\" œ B" ß ÞÞÞß \8•" œ B8•" Ó œ T Ò\8 œ B8 l\8•" œ B8•" Ó .
Definition 17 A Markov Chain is stationary provided T Ò\8 œ Bl\8•" œ Bw Ó is
independent of 8.
29
WOLOG we will henceforth name the elements of k with the integers "ß #ß $ß ÞÞÞ and call
them "states."
Definition 18 With :34 » T Ò\8 œ 4l\8•" œ 3Ó, the square matrix T » a:34 b is called
the transition matrix for a stationary Markov Chain and the :34 are called transition
probabilities.
Note that a transition matrix has nonnegative entries and row sums of 1. Such matrices
are often called "stochastic" matrices. As a matter of further notation for a stationary
Markov Chain, let
5
:34
œ T Ò\8€5 œ 4l\8 œ 3Ó
and
0345 œ T Ò\8€5 œ 4ß \8€5•" Á 4ß ÞÞÞß \8€" Á 4l\8 œ 3Ó .
(These are respectively the probabilities of moving from 3 to 4 in 5 steps and first moving
from 3 to 4 in 5 steps.)
Definition 19 We say that a MC is irreducible if for each 3 and 4 b 5 (possibly depending
5
upon 3 and 4) such that :34
ž !.
(A chain is irreducible if it is possible to eventually get from any state 3 to any other state
4.)
Definition 20 We say that the 3th state of a MC is transient if ! 0335 • " and say that the
_
state is persistent if ! 0335 œ "Þ A chain is called persistent if all of its states are
_
persistent.
5œ"
5œ"
Definition 21 We say that state 3 of a MC has period > if :335 œ ! unless 5 œ / > (5 is a
multiple of >) and > is the largest integer with this property. The state is aperiodic if no
such > ž " exists. And a MC is called aperiodic if all of its states are aperiodic.
Many sources (including Chapter 15 of the 3rd Edition of Feller Volume 1) present a
number of useful simple results about MC's. Among them are the following.
Theorem 46 All states of an irreducible MC are of the same type (with regard to
persistence and periodicity).
Theorem 47 A finite state space irreducible MC is persistent.
30
Theorem 48 Suppose that a MC is irreducible, aperiodic and persistent. Suppose further
that for each state 3 the mean recurrence time is finite, i.e.
"50335 • _ .
_
5œ"
Then an invariant/stationary distribution for the MC exists, i.e. b Ö?4 × with ?4 ž ! and
!?4 œ " such that
?4 œ "?3 :34 .
3
(If the chain is started with distribution Ö?4 ×, after one transition it is in states "ß #ß $ß ÞÞÞ
with probabilities Ö?4 ×Þ) Further, this distributionÖ?4 × satisfies
5
?4 œ lim :34
a3 ,
5p_
and
?4 œ
"
! 50445
_
Þ
5œ"
There is a converse of this theorem.
Theorem 49 An irreducible, aperiodic MC for which b Ö?4 × with ?4 ž ! and !?4 œ "
such that ?4 œ !?3 :34 must be persistent with ?4 œ
3
"
Þ
_
! 50445
5œ"
And there is an important "ergodic" result that guarantees that "time averages" have the
right limits.
Theorem 50 Under the hypotheses of Theorem 48, if 1 is a real-valued function such
that
"l1Ð4Ñl?4 • _
4
then for any 4, if \0 œ 4
" 8
"1Ð\5 Ñ a.s.
p "1Ð4Ñ?4
8 5œ1
4
(Note that the choice of 1 as an indicator provides approximations for stationary
probabilities.)
31
With this background, the basic idea of MCMC is the following. If we wish to simulate
from a distribution Ö?4 × or approximate properties of the distribution that can be
expressed as moments of some function 1, we find a convenient MC Ö\5 × whose
invariant distribution is Ö?4 ×. From a starting state \! œ 3, one uses T to simulate \" Þ
Using the realization \" œ B" and T , one simulates \# , etc. One applies Theorem 50 to
approximate the quantity of interest. Actually, it is common practice to use a "burn in" of
7 periods before starting the kind of time average indicated in Theorem 50. Two
important questions about this plan (only one of which we'll address) are:
1) How does one set up an appropriate/convenient T ?
2) How big should 7 be? (We'll not touch this matter.)
In answer to question 1), there are presumably a multitude of chains that would do the
job. But there is the following useful sufficient condition (that has application in the
original motivating problem of simulating from high dimensional distributions) for a
chain to have Ö?4 × for an invariant distribution.
Lemma 51 If Ö\5 × is a MC with transition probabilities satisfying
?3 :34 œ ?4 :43
,
æ
then it has invariant distribution Ö?4 ×.
Note then that if a candidate T satisfies æ and is irreducible and aperiodic, Theorem 49
shows that it is persistent and Theorem 48 then shows that any arbitrary starting value can
be used and yields approximate realizations from Ö?4 × and Theorem 50 implies that "time
averages" can be used to approximate properties of Ö?4 ×.
Several useful MCMC schemes can be shown to have the "right" invariant distributions
by noting that they satisfy æ.
For example, Lemma 51 can be applied to the "Metropolis-Hastings Algorithm." That
is, let X œ a>34 b be any stochastic matrix corresponding to an irreducible aperiodic MC.
(Presumably, some choices will be better that others in terms of providing quick
convergence to a target invariant distribution. But that issue is beyond the scope of the
present exposition.) (Note that in a finite case, one can take >34 œ "ÎÐthe number of
states).) The Metropolis-Hastings algorithm simulates as follows:
•Supposing that \8•" œ 3, generate N (at random) according to the distribution
over the state space specified by row 3 of X , that is according to Ö>34 ×.
•Then generate \8 based on 3 and (the randomly generated) N according to
Ú
N with probability minŠ"ß ??N3 >>3NN 3 ‹
\8 œ Û
Þ
ææ
?N >N 3
3
with
probability
max
!ß
"
•
Š
‹
Ü
?3 >3N
32
Note that for Ö\5 × so generated, for 4 Á 3
:34 œ T Ò\8 œ 4l\8•" œ 3Ó œ min Œ"ß
?4 >43
•>34
?3 >34
and
:33 œ T Ò\8 œ 3l\8•" œ 3Ó œ >33 € "maxŒ!ß " •
4Á3
?4 >43
•>34
?3 >34
.
So, for 3 Á 4
?3 :34 œ minÐ?3 >34 ß ?4 >43 Ñ œ ?4 :43 .
That is, æ holds and the MC Ö\5 × has stationary distribution Ö?4 ×. (Further, the
assumption that X corresponds to an irreducible aperiodic chain implies that Ö\5 × is
irreducible and aperiodic.)
Notice, by the way, that in order to use the Metropolis-Hastings Algorithm one only has
to have the ?4 's up to a multiplicative constant. That is, if we have @4 º ?4 , we may
compute the ratios ?4 Î?3 as @4 Î@3 and are in business as far as simulation goes. (This fact
is especially attractive in Bayesian contexts where it can be convenient to know a
posterior only up to a multiplicative constant.) Notice also that if X is symmetric, (i.e.
>34 œ >43 ), step ææ of the Metropolis algorithm becomes
Ú
\8 œ Û
Ü3
N
with probability minŠ"ß ??N3 ‹
with probability max Š!ß " •
?N
?3
‹
Þ
An algorithm similar to the Metropolis-Hastings algorithm is the "Barker Algorithm."
The Barker algorithm modifies the above by
replacing minŒ"ß
?N >N 3
?N >N 3
• with
?3 >3N
?3 >3N € ?N >N 3
in ææ. Note that for this modification of the Metropolis algorithm, for 4 Á 3
:34 œ Œ
?4 >43
•>34
?3 >34 € ?4 >43
,
so
?3 :34 œ
Ð?3 >34 ÑÐ?4 >43 Ñ
œ ?4 :43
?3 >34 € ?4 >43
.
That is, æ holds and thus Lemma 51 guarantees that Ö\5 × has invariant distribution
Ö?4 ×Þ (And X irreducible and aperiodic continues to imply that Ö\5 × is also.)
33
Note also that since
?4
?4 >43
?3 >43
œ
?
?3 >34 € ?4 >43
>34 € ?43 >43
once again it suffices to know the ?4 up to a multiplicative constant in order to be able to
implement Barker's algorithm. (This is again of comfort to Bayesians.) Finally, note that
if X is symmetric, step ææ of the Barker algorithm becomes
N
\8 œ ž
3
with probability
with probability
?N
?3 €?N
?3
?3 €?N
Þ
Finally, consider now the "Gibbs Sampler," or as Schervish correctly argues that it
should be termed, "Successive Substitution Sampling," for generating an observation
from a high dimensional distribution. For sake of concreteness, consider the situation
where the distribution of a discrete $-dimensional random vector Ð] ß Z ß [ Ñ with
probability mass function 0 ÐCß @ß AÑ is at issue. In the general MC setup, a possible
outcome ÐCß @ß AÑ represents a "state" and the distribution 0 ÐCß @ß AÑ is the "Ö?4 ×" from
which on wants to simulate. One defines a MC Ö\5 × as follows. For an arbitrary
starting state ÐC! ß @! ß A! Ñ:
•Generate \" œ Ð]" ß @! ß A! Ñ by generating ]" from the conditional distribution of
] lZ œ @! and [ œ A ! , i.e. from the (conditional) distribution with probability
function 0] lZ ß[ ÐCl@! ß A! Ñ » !0ÐCß@! ßA! Ñ . Let C" be the realized value of ]" .
0Ð6ß@! ßA! Ñ
6
•Generate \# œ ÐC" ß Z" ß A! Ñ by generating Z" from the conditional distribution of
Z l] œ C" and [ œ A ! , i.e. from the (conditional) distribution with probability
function 0Z l] ß[ Ð@lC" ß A! Ñ » !0ÐC" ß@ßA! Ñ . Let @" be the realized value of Z" .
0ÐC" ß6ßA! Ñ
6
•Generate \$ œ ÐC" ß @" ß [" Ñ by generating [" from the conditional distribution
of [ l] œ C" and Z œ @" , i.e. from the (conditional) distribution with probability
0ÐC" ß@" ßAÑ
function 0[ l] ßZ ÐAlC" ß @" Ñ » !
. Let A" be the realized value of [" .
0ÐC" ß@" ß6Ñ
6
•Generate \4 œ Ð]2 ß @1 ß A1 Ñ by generating ]2 from the conditional distribution of
] lZ œ @1 and [ œ A 1 , i.e. from the (conditional) distribution with probability
function 0] lZ ß[ ÐCl@1 ß A1 Ñ » !0ÐCß@1 ßA1 Ñ . Let C2 be the realized value of ]2 .
0Ð6ß@1 ßA1 Ñ
6
And so on, for some appropriate number of cycles.
Note that with this algorithm, a typical transition probability (for a step where a ]
realization is going to be generated) is
34
T Ò\8 œ ÐCw ß @ß AÑl\8•" œ ÐCß @ß AÑÓ œ
0 ÐCw ß @ß AÑ
!0 Ð6ß @ß AÑ
6
so if \8•" has distribution 0 , the probability that \8 œ ÐCß @ß AÑ is
"0 Ð6w ß @ß AÑ
6w
0 ÐCß @ß AÑ
œ 0 ÐCß @ß AÑ ,
!0 Ð6ß @ß AÑ
6
that is, \8 also has distribution 0 Þ And analogous results hold for transitions of all 3
types (] ß Z and Y realizations).
But it is not absolutely obvious how to apply the results 48 and 49 here since the chain is
not stationaryÞ The transition mechanism is not the same for a "] " transition as it is for a
"Z " transition. The key to extricating oneself from this problem is to note that if T] ß TZ
and T[ are respectively the transition matrices for ] ß Z and [ exchanges, then
T œ T] TZ T[
describes an entire cycle of the SSS algorithm. Then Ö\5w × defined by \8w œ \$8 is a
stationary Markov Chain with transition matrix T Þ The fact that T] ß TZ and T[ all leave
the distribution 0 invariant then implies that T also leaves 0 invariant. So one is in the
position to apply Theorems 49 and 48. If T is irreducible and aperiodic, Theorem 49
says that the chain Ö\5w × is persistent and then Theorems 48 and 50 say that 0 can be
simulated using an arbitrary starting state.
The non discrete ] version of all this is somewhat more subtle, but in many ways parallel
to the exposition given above. There is a nice exposition in Tierney's December 1994
Annals of Statistics paper (available at ISU through JSTOR at http://www.jstor.org/cgibin/jstor/listjournal). The following is borrowed from that source.
Consider now a (possibly non discrete) state space k and a target distribution 1 on
Ðk ß U Ñ (for which one wishes to approximate some property via simulation). Suppose
that T ÐBß EÑ À k ‚ U p Ò!ß "Ó is a "transition kernel" (a regular conditional probability, a
function such that for each B, T ÐBß † Ñ is a probability measure, and for each E, T Ð † ß EÑ
is U measurable). Consider a stochastic process Ö\5 × with the property that given
\" œ B" ß ÞÞÞß \8•" œ B8•" , the variable \8 has distribution T ÐB8•" ß † ÑÞ Ö\5 × is a
general stationary discrete time MC with transition kernel T ÐBß EÑ, and provided one can
simulate from T ÐBß EÑ, one can simulate realizations of Ö\5 ×. If T ÐBß EÑ is properly
chosen, empirical properties of a realization of Ö\5 × can approximate theoretical
properties of 1.
35
Notice that the 8-step transition kernels can be computed recursively beginning with
T ÐBß EÑ using
T 8 ÐBß EÑ œ ( T 8•" ÐCß EÑ.T ÐBß CÑ
Definition 22 T has invariant distribution 1 means that a E - U
1ÐEÑ œ ( T ÐBß EÑ. 1ÐBÑ
Definition 23 T is called 1-irreducible if for each B and each E with 1ÐEÑ ž ! b
8ÐBß EÑ " such that T 8ÐBßEÑ ÐBß EÑ ž !.
Definition 24 A 1-irreducible transition kernel T is called periodic if b an integer . #
and a sequence of sets E! ß E" ß ÞÞÞß E.•" ß E. œ E! in U such that for 3 œ !ß "ß ÞÞÞß . • "
and all B - E3 , T ÐBß E3€" Ñ œ ". If a 1-irreducible transition kernel T is not periodic it is
called aperiodic.
Definition 25 A 1-irreducible transition kernel T with invariant distribution 1 is called
Harris recurrent if the corresponding MC, Ö\5 ×, has the property that a E - U with
1ÐEÑ ž ! and all B - k
T\! œB Ò\8 - E infinitely oftenÓ œ "
Theorem 52 Suppose that transition kernel T has invariant distribution 1, is 1irreducible, aperiodic and Harris recurrent. If 1 is such that
( l1ÐBÑl.1ÐBÑ • _
then for \! œ B any element of k
" 8
"1Ð\5 Ñ a.s.
p ( 1ÐBÑ.1ÐBÑ
8 5œ"
Theorem 52 says that integrals of 1 (like, for example, probabilities and moments) can be
approximated by long run sample averages of a sample path of the MC Ö\5 × beginning
from any starting value, provided a suitable T can be identified. "Suitable" means, of
course, that it has the properties hypothesized in Theorem 52 and one can simulate from
it knowing no more about 1 than is known in practice. The game of MCMC is then to
identify useable T 's. As Tierney's article lays out, the general versions of SSS, the Barker
algorithm and the Metropolis-Hastings algorithm (and others) can often be shown to
satisfy the hypotheses of Theorem 52.
36
The meaning of "SSS" in the general case is obvious (one generates in turn from each
conditional of a coordinate of a multidimensional ] given all other current coordinates
and considers what one has at the end of each full cycle of these simulations). To
illustrate what another version of MCMC means in the general case, consider the
Metropolis-Hastings algorithm.
Suppose that UÐBß EÑ is some convenient transition kernel with the property that for
some 5-finite measure . on Ðk ß U Ñ (dominating 1) and a nonnegative function ;ÐBß CÑ on
k ‚ k (that is U ‚ U measurable) with 'k ;ÐBß CÑ. .ÐCÑ œ " for all B,
UÐBß EÑ œ ( ;ÐBß CÑ. .ÐCÑ
E
Let
.1
œ:
..
and suppose that 1 is not concentrated on a single point. Then with
!ÐBß CÑ œ ž
:ÐCÑ;ÐCßBÑ
minŠ :ÐBÑ;ÐBßCÑ
ß "‹
"
if :ÐBÑ;ÐBß CÑ ž !
if :ÐBÑ;ÐBß CÑ œ !
and
:ÐBß CÑ œ œ
;ÐBß CÑ!ÐBß CÑ
!
if B Á C
if B œ C
a Metropolis-Hastings transition kernel is
T ÐBß EÑ œ ( :ÐBß CÑ. .ÐCÑ € Œ" • ( :ÐBß CÑ. .ÐCÑ •?B ÐEÑ
æææ
E
where ?B is a probability measure degenerate at B. The "simulation interpretation" of this
is that if \8•" œ B, one generates C from UÐBß † Ñ and computes !ÐBß CÑ. Then with
probability !ÐBß CÑ, one sets \8 œ C and otherwise takes \8 œ B. Note that as in the
discrete version, the algorithm depends on : only through the ratios :ÐCÑÎ:ÐBÑ so it
suffices to know : only up to a multiplicative constant, a fact of great comfort to
Bayesians wishing to study analytically intractable posteriors.
It is straightforward to verify that the kernel defined by æææ is 1-invariant. It turns out
that sufficient conditions for the kernel T to be 1-irreducible are that U is 1-irreducible
and either that i) ;ÐBß CÑ ž ! a B and C or that ii) ;ÐBß CÑ œ ;ÐCß BÑ a B and C. In turn, 1irreducibility of T is sufficient to imply Harris recurrence for T . And, if T is 1irreducible, a sufficient condition for aperiodicity of T is that
37
1ŒÖBl( :ÐBß CÑ..ÐCÑ • "ו ž !
so there are some fairly concrete conditions that can be checked to verify the hypotheses
of Theorem 52.
A (Very) Little Bit About Bootstrapping
Suppose that the entries of \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ are iid according to some unknown
(possibly multivariate) distribution J , and for a random function
VÐ\ß J Ñ
there is some aspect of the distribution of VÐ\ß J Ñ that is of statistical interest. For
example, in a univariate context, we might be interested in the (J ) probability that
–
\ • .J
VÐ\ß J Ñ œ
W
È8
has absolute value less than #Þ Often, even if one knew J , the problem of finding
properties of the distribution of VÐ\ß J Ñ would be analytically intractable. (Of course,
knowing J , simulation might provide at least some approximation for the property of
interest ...)
As a means of possibly approximating the distribution of VÐ\ß J Ñ, Efron suggested the
s is an estimator of J . In parametric contexts, where )^Ð\Ñ is
following. Suppose that J
s œ Js . In nonparametric contexts, if J^ 8 is the empirical
an estimator of ), this could be J
)
^ . Suppose then that
s œ J
distribution of the sample \ , we might take J
8
‡
‡
‡
s . One might hope that (at
\ œ Ð\" ß ÞÞÞß \8 Ñ has entries that are an iid sample from J
least in some circumstances)
s distribution of VÐ\ ‡ ß J
sÑ .
the J distribution of VÐ\ß J Ñ ¸ the J
(This will not always be a reasonable approximation.) Now even the right side of this
expression will typically be analytic intractable. What, however, can be done is to now
simulate from this distribution. (Where one can not simulate from the distribution on the
left because J is not known, simulation from the distribution on the right is possible.)
sÞ
That is, suppose that \ ‡" ß \ ‡# ß ÞÞÞß \ ‡F are F (bootstrap) iid samples of size 8 from J
Using these, one may compute
sÑ
V3‡ œ VÐ\ ‡3 ß J
s distribution
and use the empirical distribution of ÖV3‡ l3 œ "ß ÞÞÞß F × to approximate the J
s Ñ.
of VÐ\ ‡ ß J
38
General Decision Theory
This is a formal framework for posing problems of finding good statistical procedures as
optimization problems.
Basic Elements of a (Statistical) Decision Problem:
\
data
formally a mapping ÐHß [ÑpÐk ß U Ñ
@
parameter space
often one needs to do integrations over @, so let V
be a 5-algebra on @ and assume that for F - U , the
mapping taking )pT) ÐFÑ from Ð@ß VÑpÐV " ß U" Ñ is
measurable
T
decision/action space the class of possible decisions or actions
we will also need to average over (random) actions,
so let X be 5-algebra on T
PÐ)ß +Ñ loss function
P À @ ‚ T pÒ!ß_Ñ and needs to be suitably
measurable
$ ÐBÑ
decision rule
$ ÐBÑ À Ðk ß U ÑpÐT ß X Ñ and for data \ , $ Ð\Ñ is the
decision taken on the basis of observing \
The fact that PÐ)ß $ Ð\ÑÑ is random prevents any sensible comparison of decision rules
until one has averaged out over the distribution of \ , producing something called a risk
function.
Definition 26 VÐ)ß $ Ñ » E) PÐ)ß $ Ð\ÑÑ œ ' PÐ)ß $ ÐBÑÑ.T) ÐBÑ maps @pÒ!ß_Ñ and is
called the risk function (of the decision rule $ ).
Definition 27 $ is at least as good as $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a)Þ
Definition 28 $ is better than $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a) with strict inequality for at
least one ) - @.
Definition 29 $ and $ w are risk equivalent if VÐ)ß $ Ñ œ VÐ)ß $ w Ñ a)Þ
(Note that risk equivalency need not imply equality of decision rules.)
Definition 30 $ is best in a class of decision rules ? if
i) $ - ?
and
ii) $ is at least as good as any other $ w - ? Þ
Typically, only smallish classes of decision rules will have best elements. (For example,
in some point estimation problems there are best unbiased estimators, where no estimator
is best overall. Here the class ? is the class of unbiased estimators.) An alternative to
looking for best elements in small classes of rules is to somehow reduce the function of
), VÐ)ß $ Ñ, to a single number and look for rules minimizing the number. (Reduction by
39
averaging amounts to looking for Bayes rules. Reduction by taking a maximum value
amounts to looking for minimax rules.)
Definition 31 $ is inadmissible in ?if b $ w - ? which is better than $ (which
"dominates" $ ).
Definition 32 $ is admissible in ? if it is not inadmissible in ? (if there is no $ w - ?
which is better).
On the face of it, it would seem that one would never want to use an inadmissible
decision rule. But, as it turns out, there can be problems where all rules are inadmissible.
(More in this direction later.)
Largely for technical reasons, it is convenient/important to somewhat broaden one's
understanding of what constitutes a decision rule. To this end, as a matter of notation, let
W œ Ö$ ÐBÑ× œ the class of (nonrandomized) decision rules .
Then one can define two fairly natural notions of randomized decision rules.
Definition 33 Suppose that for each B - k , 9B is a distribution on ÐT ß X Ñ. Then 9B is
called a behavioral decision rule.
The idea is that having observed \ œ B, one then randomizes over the action space
according to the distribution 9B in order to choose an action. (This should remind you of
what is done in some theoretical arguments in hypothesis testing.)
Notation: Let W‡ œ Ö9B × œ the class of behavioral decision rules
It is possible to think of W as a subset of W‡ . For $ - W, let 9B$ be a point mass
distribution on T concentrated at $ ÐBÑ. Then 9B$ - W‡ and is "clearly" equivalent to $ .
The risk of a behavioral decision rule is (abusing notation and reusing the symbols
VÐ † ß † Ñ in this context as well as that of Definition 8) clearly
VÐ)ß 9Ñ œ ( ( PÐ)ß +Ñ. 9B Ð+Ñ.T) ÐBÑ Þ
k
T
A somewhat less intuitively appealing notion of randomization for decision rules is next.
Let Y be a 5-algebra on W that contains the singleton sets.
Definition 34 A randomized decision function (or decision rule) < is a probability
measure on ÐWß Y ÑÞ ($ with distribution < becomes a random object.)
The idea here is that prior to observation, one chooses an element of W by use of <, say $ ,
and then having observed \ œ B decides $ ÐBÑ.
40
Notation: Let W‡ œ Ö<× œ the class of randomized decision rules.
It is possible to think of W as a subset of W‡ by identifying $ - W with <$ placing mass 1
on $ . <$ is "clearly" equivalent to $ .
The risk of a randomized decision rule is (once again abusing notation and reusing the
symbols VÐ † ß † Ñ) clearly
VÐ)ß <Ñ œ ( VÐ)ß $ Ñ. <Ð$ Ñ œ ( ( PÐ)ß $ ÐBÑÑ.T) ÐBÑ.<Ð$ Ñ ß
W
W k
assuming that VÐ)ß <Ñ is properly measurable.
The behavioral decision rules are probably more appealing/natural than the randomized
decision rules. On the other hand, the randomized decision rules are often easier to deal
with in proofs. So a reasonably important question is "When are W‡ and W‡ equivalent in
the sense of generating the same set of risk functions?"
"Result"53 (Some properly qualified version of this is true ... See Ferguson pages 2627) If T is a complete separable metric space with X the Borel 5-algebra, appropriate
conditions hold on the T) and ??????? holds, then W‡ and W‡ are equivalent in terms of
generating the same set of risk functions.
It is also of interest to know when the extra complication of considering randomized rules
is really not necessary. One simple result in that direction concerns cases where P is
convex in its second argument.
Lemma 54 (See page 151 of Schervish, page 40 of Berger and page 78 of Ferguson)
Suppose that T is a convex subset of e. and 9B is a behavioral decision rule. Define a
nonrandomized decision rule $ by
$ ÐBÑ œ ( +.9B Ð+Ñ .
T
(In the case that . ž ", interpret $ ÐBÑ as vector-valued, the integral as a vector of
integrals over the . coordinates of + - T .) Then
i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex, then
VÐ)ß $ Ñ Ÿ VÐ)ß 9Ñ Þ
ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)ß 9Ñ • _ and
T) ÖBl9B is nondegenerate× ž ! ,
then
VÐ)ß $ Ñ • VÐ)ß 9Ñ Þ
41
(Strict convexity of a function 1 means that for ! - Ð!ß "Ñ
1Ð!B € Ð" • !ÑCÑ • !1ÐBÑ € Ð" • !Ñ1ÐCÑ
for B and C in the convex domain of 1Þ See Berger pages 38 through 40. There is an
extension of Jensen's inequality that says that if 1 is strictly convex and the distribution of
\ is not concentrated at a point, then E1Ð\Ñ ž 1ÐEÐ\ÑÑ.)
Corollary 55 Suppose that T is a convex subset of e. and 9B is a behavioral decision
rule. Define a nonrandomized decision rule $ by
$ ÐBÑ œ ( +.9B Ð+Ñ .
T
i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a), then $ is at least as good as 9.
ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a) and for some )! the function
PÐ)! ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)! ß 9Ñ • _ and
T)! ÖBl9B is nondegenerate× ž !
then $ is better than 9.
Corollary 55 says roughly that in the case of a convex loss function, one need not worry
about considering randomized decision rules.
(Finite Dimensional) Geometry of Decision Theory
A very helpful device for understanding some of the basics of decision theory is to
consider the geometry associated with cases where @ is finite. So until further notice,
assume
@ œ Ö)" ß )# ß ÞÞÞß )5 ×
and
VÐ)ß <Ñ • _ a ) - @ and < - W‡ .
As a matter of notation, let
f œ ÖC œ ÐC" ß C# ß ÞÞÞß C5 Ñ - e5 l C3 œ VÐ)3 ß <Ñ a3 and < - W‡ × .
f is the set of all randomized risk vectors.
42
Theorem 56 f is convex.
In fact, it turns out to be the case, and is more or less obvious from Theorem 56, that if
f ! œ ÖClC3 œ VÐ)3 ß <Ñ a3 and < - W× ,
then f is the convex hull of f ! , i.e. the smallest convex set containing f ! .
Definition 35 For B - e5 , the lower quadrant of B is
UB œ ÖDlD3 Ÿ B3 a3× Þ
Theorem 57 C - f (or the decision rule giving rise to C) is admissible iff
UC • f œ ÖC× .
Definition 36 The lower boundary of f is
–
-Ðf Ñ œ ÖClUC • f œ ÖC××
–
for f the closure of f .
Definition 37 f is closed from below if -Ðf Ñ § f .
Notation: Let EÐf Ñ stand for the class of admissible rules (or risk vectors corresponding
to them) in f .
Theorem 58 If f is closed from below, then EÐf Ñ œ -Ðf ÑÞ
Theorem 58w If f is closed, then EÐf Ñ œ -Ðf Ñ.
Complete Classes of Decision Rules
Now drop the finiteness assumption on @.
Definition 38 A class of decision rules V § W‡ is a complete class if for any 9 Â V, b a
9w - V such that 9w is better than 9.
Definition 39 A class of decision rules V § W‡ is an essentially complete class if for any
9 Â V, b a 9w - V such that 9w is at least as good as 9.
Definition 40 A class of decision rules V § W‡ is a minimal complete class if it is
complete and is a subset of any other complete class.
Notation: Let AÐW‡ Ñ be the set of admissible rules in W‡ .
Theorem 59 If a minimal complete class V exists, then V œ EÐW‡ Ñ.
43
Theorem 60 If EÐW‡ Ñ is complete, then it is minimal complete.
Theorems 59 and 60 are the properly qualified versions of the rough (and not quite true)
statement "EÐW‡ Ñ equals the minimal complete class".
Sufficiency and Decision Theory
We worked very hard on the notion of sufficiency, with the goal in mind of establishing
that one "doesn't need" \ if one has X Ð\Ñ. That hard work ought to have some
implications for decision theory. Presumably, in most cases one can get by with basing a
decision rule on a sufficient statistic.
Result 61 (See page 120 of Ferguson, page 36 of Berger and page 151 of Schervish)
Suppose a statistic X is sufficient for c œ ÖT) ×. Then if 9 is a behavioral decision rule
there exists another behavioral decision rule 9w that is a function of X and has the same
risk function as 9.
Note that 9w a function of X means that X ÐBÑ œ X ÐCÑ implies that 9Bw and 9Cw are the
same measures on T . Note also that Result 61 immediately implies that the class of
decision rules in W‡ that are functions of X is essentially complete. In terms of risk, one
need only consider rules that are functions of sufficient statistics.
The construction used in the proof of Result 61 doesn't aim to improve on the original
(behavioral) decision rule by moving to a function of the sufficient statistic. But in
convex loss cases, there is a construction that produces rules depending on a sufficient
statistic and potentially improving on an initial (nonrandomized) decision rule. This is
the message of Theorem 64. But to prove Theorem 64, a lemma of some independent
interest is needed. That lemma and an immediate corollary are stated before Theorem 64.
Lemma 62 Suppose that T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized
decision rules. Then $ ÐBÑ œ "# Ð$" ÐBÑ € $# ÐBÑÑ is also a decision rule. Further,
i) if PÐ)ß +Ñ is convex in + and VÐ)ß $" Ñ œ VÐ)ß $# Ñ, then VÐ)ß $ Ñ Ÿ VÐ)ß $" Ñ, and
ii) if PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and
T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, then VÐ)ß $ Ñ • VÐ)ß $" ÑÞ
Corollary 63 Suppose T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized
decision rules with identical risk functions. If PÐ)ß +Ñ is convex in + a), the existence of a
) - @ such that PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and
T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, implies that $" and $# are inadmissible.
44
Theorem 64 (See Berger page 41, Ferguson page 121, TPE page 50, Schervish page
152) (Rao-Blackwell Theorem) Suppose that T § e5 is convex and $ is a
nonrandomized decision function with E) l$ Ð\Ñl • _ a)Þ Suppose further that X is
sufficient for ) and with U! œ U ÐX Ñ, let
$! ÐBÑ » EÒ$ lU! ÓÐBÑ Þ
Then $! is a nonrandomized decision rule. Further,
i) if PÐ)ß +Ñ is convex in +, then VÐ)ß $! Ñ Ÿ VÐ)ß $ Ñ, and
ii) for any ) for which PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $ Ñ • _ and
T) Ð$ Ð\Ñ Á $! Ð\ÑÑ ž !, one has that VÐ)ß $! Ñ • VÐ)ß $ ÑÞ
Bayes Rules
The Bayes approach to decision theory is one means of reducing risk functions to
numbers so that they can be compared in a straightforward fashion.
Notation: Let K be a distribution on Ð@ß V Ñ. K is usually (for philosophical reasons)
called the "prior" distribution on @.
Definition 41 The Bayes risk of 9 - W‡ with respect to the prior K is (abusing notation
again and once more using the symbols VÐ † ß † Ñ)
VÐKß 9Ñ œ ( VÐ)ß 9Ñ.KÐ) Ñ Þ
@
Notation: Again abusing notation somewhat, let
VÐKÑ » w inf ‡ VÐKß 9w Ñ Þ
9 -W
Definition 42 9 is said to be Bayes with respect to K (or to be a Bayes rule with respect
to K) provided
VÐKß 9Ñ œ VÐKÑ Þ
Definition 43 9 is said to be %-Bayes with respect to K provided
VÐKß 9Ñ Ÿ VÐKÑ € % Þ
Bayes rules are often admissible, at least if the prior involved "spreads its mass around
sufficiently."
Theorem 65 If @ is finite (or countable), K is a prior distribution with KÐ)Ñ ž ! a) and
9 is Bayes with respect to K, then 9 is admissible.
45
Theorem 66 Suppose @ § e5 is such that every neighborhood of any point ) - @ has
nonempty intersection with the interior of @. Suppose further that VÐ)ß 9Ñ is continuous
in ) for all 9 - W‡ . Let K be a prior distribution that has support @ in the sense that
every open ball that is a subset of @ has positive K probability. Then if VÐKÑ • _ and
9 is a Bayes rule with respect to K, 9 is admissible.
Theorem 67 If every Bayes rule with respect to K has the same risk function, they are all
admissible.
Corollary 68 In a problem where Bayes rules exist and are unique, they are admissible.
A converse of Theorem 65 (for finite @) is given in Theorem 70. Its proof requires the
use of a piece of 5 -dimensional analytic geometry (called the separating hyperplane
theorem) that is stated next.
Theorem 69 Let f" and f# be two disjoint convex subsets of e5 . Then b a : - e5 such
that : Á ! and !:3 B3 Ÿ !:3 C3 aC - f" and B - f# .
Theorem 70 If @ is finite and 9 is admissible, then 9 is Bayes with respect to some
prior.
For nonfinite @, admissible rules are "usually" Bayes or "limits" of Bayes rules. See
Section 2.10 of Ferguson or page 546 of Berger.
As the next result shows, Bayesians don't need randomized decision rules, at least for
purposes of achieving minimum Bayes risk, VÐKÑ.
Result 71 (See also page 147 of Schervish) Suppose that < - W‡ is Bayes versus K and
VÐKÑ • _. Then there exists a nonrandomized rule $ - W that is also Bayes versus KÞ
There remain the questions of 1) when Bayes rules exist and #Ñ what they look like when
they do exist. These issues are (to some extent) addressed next.
Theorem 72 If @ is finite, f is closed from below and K assigns positive probability to
each ) - @, there is a rule Bayes versus K.
Definition 44 A formal nonrandomized Bayes rule versus a prior K is a rule $ ÐBÑ such
that aB - k ß
$ ÐBÑ is an + - T minimizing ( PÐ)ß +ÑŒ
@
0) ÐBÑ
.KÐ)Ñ .
' 0) ÐBÑ.KÐ)Ñ •
@
46
Definition 45 If K is a 5-finite measure, a formal nonrandomized generalized Bayes rule
versus K is a rule $ ÐBÑ such that aB - k ß
$ ÐBÑ is an + - T minimizing ( PÐ)ß +Ñ0) ÐBÑ.KÐ)Ñ Þ
@
Minimax Decision Rules
An alternative to the Bayesian reduction of VÐ)ß 9Ñ to a number by averaging according
to the distribution K, is to instead reduce VÐ)ß 9Ñ to a number by maximization over ).
Definition 46 A decision rule 9 - H‡ is said to be minimax if
supVÐ)ß 9Ñ œ infsup
VÐ)ß 9w Ñ Þ
w
9 )
)
Definition 47 If a decision rule 9 has constant (in )) risk, it is called an equalizer rule.
It should make some sense that if one is trying to find a minimax rule by so to speak
pushing down the highest peak in the profile VÐ)ß 9Ñ, doing so will tend to drive one
towards rules with "flat" VÐ)ß 9Ñ profiles.
Theorem 73 If 9 is an equalizer rule and is admissible, then it is minimax.
Theorem 74 Suppose that Ö93 × is a sequence of decision rules, each 93 Bayes against a
prior K3 . If VÐK3 ß 93 ÑpG and 9 is a decision rule with VÐ)ß 9Ñ Ÿ G a), then 9 is
minimax.
Corollary 75 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then 9 is minimax.
Corollary 75' If 9 is an equalizer rule and is Bayes with respect to a prior K, then it is
minimax.
A reasonable question to ask, in the light of Corollaries 75 and 75', is "Which K should I
be looking for in order to produce a Bayes rule that might be minimax?" The answer to
this question is "Look for a least favorable prior."
Definition 48 A prior distribution K is said to be least favorable if
VÐKÑ œ supVÐKw Ñ .
Kw
Intuitively, one might think that if an intelligent adversary were choosing ), he/she might
well generate it according to a least favorable K, therefore maximizing the minimum
possible risk. In response a decision maker ought to use 9 Bayes versus K. Perhaps then,
this Bayes rule might tend to minimize the worst that could happen over choice of )?
Theorem 76 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then K is least favorable.
47
(Corollary 75 already shows 9 in Theorem 76 to be minimax.)
Optimal Point Estimation (Unbiasedness, Information Inequalities
and Invariance and Equivariance)
Unbiasedness
One possible limited class of "sensible" decision rules in estimation problems is the class
of "unbiased" estimators. (Here, we are tacitly assuming that T is e" or some subset of
it.)
Definition 49 The bias of an estimator $ ÐBÑ for # Ð)Ñ - e" is
E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ( Ð$ ÐBÑ • # Ð)ÑÑ.T) ÐBÑ .
k
Definition 50 An estimator $ ÐBÑ is unbiased for # Ð)Ñ - e" provided
E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ! a) Þ
Before jumping into the business of unbiased estimation with too much enthusiasm, an
early caution is in order. We supposedly believe that "Bayes ¸ admissible" and that
admissibility is sort of a minimal "goodness" requirement for a decision rule ... but there
are the following theorem and corollary.
Theorem 77 (See Ferguson pages 48 and 52) If PÐ)ß +Ñ œ AÐ)ÑÐ#Ð )Ñ • +Ñ# for
AÐ)Ñ ž !, and $ is both Bayes wrt K with finite risk function and unbiased for # Ð)Ñ, then
(according to the joint distribution of Ð\ß )ÑÑ
$ Ð\Ñ œ # Ð)Ñ a.s. Þ
Corollary 78 Suppose @ § e" has at least two elements and the family c œ ÖT) × is
mutually absolutely continuous. If PÐ)ß +Ñ œ Ð) • +Ñ# no Bayes estimator of ) is
unbiased.
Definition 51 $ is best unbiased for # Ð)Ñ - e" if it is unbiased and at least as good as
any other unbiased estimator.
There need not be any unbiased estimators in a given estimation problem, let alone a best
one. But for many common estimation loss functions, if there is a best unbiased
estimator it is unique.
48
Theorem 79 Suppose that T § e" is convex, PÐ)ß +Ñ is convex in + a) and $" and $#
are two best unbiased estimators of # Ð)Ñ. Then for any ) for which PÐ)ß +Ñ is strictly
convex in + and VÐ)ß $" Ñ • _
T) Ò$" Ð\Ñ œ $# Ð\ÑÓ œ " Þ
The special case of PÐ)ß +Ñ œ Ð#Ð )Ñ • +Ñ# is, of course, one of most interest. For this
case of unweighted squared error loss, for estimators with second moments
VÐ)ß $ Ñ œ Var) $ Ð\Ñ.
Definition 52 $ is called a UMVUE (uniformly minimum variance unbiased estimator)
of # Ð)Ñ - e" if it is unbiased and
Var) $ Ð\Ñ Ÿ Var) $ w Ð\Ñ a)
for any other unbiased estimator of # Ð)Ñ, $ w .
A much weaker notion is next.
Definition 53 $ is called locally minimum variance unbiased at )! - @ if it is unbiased
and
Var)! $ Ð\Ñ Ÿ Var)! $ w Ð\Ñ
for any other unbiased estimator of # Ð)Ñ, $ w .
Notation: Let h! œ Ö$ l E) $ Ð\Ñ œ ! and E) $ # Ð\Ñ • _ a)×Þ (h! is the set of unbiased
estimators of 0 with finite variance.)
Theorem 80 Suppose that $ is an unbiased estimator of # Ð)Ñ - e" and E) $ # Ð\Ñ • _
a). Then
$ is a UMVUE of # Ð)Ñ iff Cov) Ð$ Ð\Ñß ?Ð\ÑÑ œ ! a) a? - h! .
Theorem 81 (Lehmann/Scheffew ) Suppose that T is a convex subset of e" , X is a
sufficient statistic for c œ ÖT) × and $ is an unbiased estimator of # Ð)Ñ. Let
$! œ EÒ$lX Ó œ 9 ‰ X .
Then $! is an unbiased estimator of # Ð)ÑÞ If in addition, X is complete for c ,
i) 9w ‰ X unbiased for # Ð)Ñ implies that 9w ‰ X œ 9 ‰ X a.s. T) a), and
ii) PÐ)ß +Ñ convex in + a) implies that
a) $! is best unbiased, and
b) if $ w is any other best unbiased estimator of # Ð)Ñ
$ w œ $! a.s. T)
for any ) at which VÐ)ß $! Ñ • _ and PÐ)ß +Ñ is strictly convex in +.
49
(Note that iia) says in particular that $! is UMVUE, and that iib) says that $! is the unique
UMVUE.)
Basu's Theorem (recall Stat 543) is on occasion helpful in doing the conditioning required
to get $! of Theorem 81. The most famous example of this is in the calculation of the
UMVUE of Fˆ -•5 . ‰ based on 8 iid R Ð.ß 5# Ñ observations.
Information Inequalities
These are inequalities about variances of estimators, some of which involve information
measures. (See Schervish Sections 2.3.1, 5.1.2 and TPE Sections 2.6, 2.7.)
The fundamental tool here is the Cauchy-Schwarz inequality.
Lemma 82 (Cauchy-Schwarz) If Y and Z are random variables with EY # • _ and
EZ # • _, then
EY # † E Z #
ÐEY Z Ñ# Þ
Applying this to Y œ \ • E\ and Z œ ] • E] , we get the following.
Corollary 83 If \ and ] have finite second moments, ÐCovÐ\ß ] ÑÑ# Ÿ Var\ † Var] .
Corollary 84 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. If 1 is a
measurable function such that E1Ð] Ñ œ ! and ! • EÐ1Ð] ÑÑ# • _, defining
G » E$ Ð] Ñ1Ð] Ñ
Var$Ð] Ñ
G#
.
EÐ1Ð] ÑÑ#
A restatement of Corollary 84 in terms that look more "statistical" is next.
Corollary 84' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. If 1) is a
measurable function such that E) 1) Ð\Ñ œ ! and ! • E) Ð1) Ð\ÑÑ# • _, defining
GÐ)Ñ » E) $ Ð\Ñ1) Ð\Ñ
Var) $ Ð\Ñ
G # Ð) Ñ
.
E) Ð1) Ð\ÑÑ#
Theorem 85 (Cramer-Rao) Suppose that the model c is FI regular at )! and that
! • MÐ)! Ñ • _Þ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be differentiated under the integral
at )! , i.e. if ..) E) $ Ð\ѹ
=' $ ÐBÑ ..) 0) ÐBѹ
. .ÐBÑ , then
) œ)!
) œ)!
50
Var)! $ Ð\Ñ
.
Œ .) E) $ Ð\ѹ
•
) œ)!
#
MÐ)! Ñ
Þ
For the special case of $ Ð\Ñ viewed as an estimator of ), we sometimes write
E) $ Ð\Ñ œ ) € ,Ð)Ñ, and thus Theorem 85 says Var)! $ Ð\Ñ Ð" € ,w Ð)! ÑÑ# ÎMÐ)! Ñ .
On the face of it, showing that the C-R lower bound is achieved looks like a general
method of showing that an estimator is UMVUE. But the fact is that the circumstances in
which the C-R lower bound is achieved are very well defined. The C-R inequality is
essentially the Cauchy-Schwarz inequality, and one gets equality in Lemma 82 iff
Z œ +Y a.s.. This translates in the proof of Theorem 85 to equality iff
.
log0) Ð\ѹ
œ +Ð)! Ñ$ Ð\Ñ € ,Ð)! Ñ a.s. T)! Þ
.)
) œ)!
And, as it turns out, this can happen for all )! in a parameter space @ iff one is dealing
with a one parameter exponential family of distributions and the statistic $Ð\Ñ is the
natural sufficient statistic. So the method works only for UMVUEs of the means of the
natural sufficient statistics in one parameter exponential families.
Result 86 Suppose that for ( real, ÖT( × is dominated by a 5-finite measure ., and with
2ÐBÑ ž !,
0( ÐBÑ œ expa+Ð(Ñ € (X ÐBÑb a.s. . a( .
Then for any (! in the interior of > where Var( X Ð\Ñ • _, X Ð\Ñ achieves the C-R lower
bound for its variance.
(A converse of Result 86 is true, but is also substantially harder to prove.)
The C-R inequality is Corollary 84 with 1) ÐBÑ œ ..) log0) ÐBÑ. There are other interesting
choices of 1) . For example (assuming it is permissible to move enough derivatives
5
through integralsÑ, 1) ÐBÑ œ Š ..)5 log0) ÐBÑ‹Î0) ÐBÑ produces an interesting inequality.
And the choice 1) ÐBÑ œ a0) ÐBÑ • 0)w ÐBÑbÎ0) ÐBÑ leads to the next result.
Theorem 87 If T)w ¥ T) ß
Var) $ Ð\Ñ
aE) $ Ð\Ñ • E)w $ Ð\Ñb#
E) Š
0) Ð\Ñ•0)w Ð\Ñ
‹
0) Ð\Ñ
#
Þ
So, (the Chapman-Robbins Inequality)
51
Var) $ Ð\Ñ
sup
)' s.t.T)w ¥ T)
aE) $ Ð\Ñ • E)w $ Ð\Ñb#
E) Š
0) Ð\Ñ•0)w Ð\Ñ
‹
0) Ð\Ñ
#
Þ
There are multiparameter versions of the information inequalities (and for that matter,
multiple $ versions involving the covariance matrix of the $ 's). The fundamental result
needed to produce multiparameter versions of the information inequalities is the simple
generalization of Corollary 84.
Lemma 88 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. Suppose further
that 1" ß 1# ß ÞÞÞß 15 are measurable functions such that E13 Ð] Ñ œ ! and ! • EÐ13 Ð] ÑÑ# • _
a3 and the 13 are linearly independent as square integrable functions on k according to
the probability measure T . Let G3 » E$ Ð] Ñ13 Ð] Ñ and define a 5 ‚ 5 matrix
H » aE13 Ð] Ñ14 Ð] Ñb Þ
Then for G œ ÐG" ß G# ß ÞÞÞß G5 Ñw
Var$Ð] Ñ
G w H•" G Þ
The "statistical-looking" version of Lemma 88 is next.
Lemma 88' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. Suppose
further that 1)" ß 1)# ß ÞÞÞß 1)5 are measurable functions such that E) 1)3 Ð\Ñ œ ! and
! • E) Ð1)3 Ð\ÑÑ# • _ a3 and the 1)3 are linearly independent as square integrable
functions on k according to the probability measure T) . Let G3 Ð)Ñ » E) $ Ð\Ñ1)3 Ð\Ñ and
define a 5 ‚ 5 matrix
HÐ)Ñ » aE) 1)3 Ð\Ñ1)4 Ð\Ñb Þ
Then for G Ð)Ñ œ ÐG" Ð)Ñß G# Ð)Ñß ÞÞÞß G5 Ð)ÑÑw
Var) $ Ð\Ñ
G Ð)Ñw HÐ)Ñ•" G Ð) Ñ Þ
And there is the multiparameter version of the C-R inequality.
Theorem 89 (Cramer-Rao) Suppose that the model c is FI regular at )! - @ § e5 and
that M Ð)! Ñ exists and is nonsingularÞ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be
differentiated wrt to each )3 under the integral at )! , i.e. if
`
=' $ ÐBÑ ``)3 0) ÐBѹ
. .ÐBÑ , then with
` )3 E) $ Ð\ѹ
G Ð)! ) »
)œ)!
)œ)!
`
`
Ð ` )" E) $ Ð\ѹ
ß E) $ Ð\ѹ
ß ÞÞÞß ``)5 E) $ Ð\ѹ
Ñw
)œ)! ` )#
)œ)!
)œ)!
52
Var)! $Ð\Ñ
G Ð)! Ñw M Ð)! Ñ•" G Ð)! Ñ Þ
There are multiple $ versions of this information inequality stuff. See Rao, pages 326+.
Invariance and Equivariance
Unbiasedness (of estimators) is one means of restricting a class of decision rules down to
a subclass small enough that there is hope of finding a "best" rule in the subclass.
Invariance/equivariance considerations is another. These notions have to do with
symmetries in a decision problem. "Invariance" means that (some) things don't change.
"Equivariance" means that (some) things change similarly or equally.
We'll do the case of Location Parameter Estimation first, and then turn to generalities
about invariance/equivariance that apply more broadly (to both other estimation problems
and to other types of decision problems entirely).
Definition 54 When k œ e5 and @ § e5 , to say that ) is a location parameter means
that \ • ) has the same distribution a).
Definition 55 When k œ e5 and @ § e1 , to say that ) is a 1-dimensional location
parameter means that \ • )" (for " a 5 -vector of "'s) has the same distribution a).
Note that for ) a "-dimensional location parameter, \ µ T) means that
Ð\ € - "Ñ µ T-€) Þ It therefore seems that a "reasonable" restriction on estimators $ is
that they should satisfy $ ÐB € - "Ñ œ $ ÐBÑ € - . How to formalize this?
Until further notice suppose that @ œ T œ e" .
Definition 56 PÐ)ß +Ñ is called location invariant if PÐ)ß +Ñ œ 3Ð) • +Ñ. If 3ÐDÑ is
increasing in D for D ž ! and decreasing in D for D • !, the problem is called a location
estimation problem.
Definition 57 A decision rule $ is called location equivariant if $ ÐB € - "Ñ œ $ ÐBÑ € - .
Theorem 90 If ) is a "-dimensional location parameter, PÐ)ß +Ñ is location invariant and
$ is location equivariant, then VÐ)ß $ Ñ is constant in ). (That is, $ is an equalizer rule.)
This theorem says that if I'm working on a location estimation problem with a location
invariant loss function and restrict attention to location equivariant estimators, I can
compare estimators by comparing numbers. In fact, as it turns out, there is a fairly
explicit representation for the best location equivariant estimator in such cases.
Definition 58 A function 1 is called location invariant if 1ÐB € - "Ñ œ 1ÐBÑ.
53
Lemma 91 Suppose that $! is location equivariant. Then $" is location equivariant iff b
a location invariant function ? such that $" œ $! € ?Þ
Lemma 92 A function ? is location invariant iff ? depends on B only through
CÐBÑ œ ÐB" • B5 ß B# • B5 ß ÞÞÞß B5•" • B5 Ñ.
Theorem 93 Suppose that PÐ)ß +Ñ œ 3Ð) • +Ñ is location invariant and b a location
equivariant estimator $! for ) with finite risk. Let 3w ÐDÑ œ 3Ð • DÑÞ Suppose that for each
C b a number @‡ ÐCÑ that minimizes
E)œ! Ò3w Ð$! • @Ñl] œ CÓ
over choices of @Þ Then
$ ‡ ÐBÑ œ $! ÐBÑ • @‡ ÐCÐBÑÑ
is a best location equivariant estimator of ).
In the case of PÐ)ß +Ñ œ Ð) • +Ñ# the estimator of Theorem 93 is Pittman's estimator,
which can for some situations be written out in a more explicit form.
Theorem 94 Suppose that PÐ)ß +Ñ œ Ð) • +Ñ# and the distribution of
\ œ Ð\" ß \# ß ÞÞÞß \5 Ñ has R-N derivative wrt to Lebesgue measure on e5
0) ÐBÑ œ 0 ÐB" • )ß B# • )ß ÞÞÞß B5 • )Ñ œ 0 ÐB • )"Ñ
where 0 has second moments. Then the best location equivariant estimator of ) is
' _ )0 ÐB • )"Ñ. )
$ ÐBÑ œ •_
' _ 0 ÐB • )"Ñ. )
‡
•_
if this is well-defined and has finite risk. (Notice that this is the formal generalized Bayes
estimator vs Lebesgue measure.)
Lemma 95 For the case of PÐ)ß +Ñ œ Ð) • +Ñ# a best location equivariant estimator of )
must be unbiased for ).
Now cancel the assumption that @ œ T œ e" . The following is a more general story
about invariance/equivariance.
Suppose that c œ ÖT) × is identifiable and (to avoid degeneracies) that
PÐ)ß +Ñ œ PÐ)ß +wÑ a) Ê + œ +w Þ
Invariance/equivariance theory is phrased in terms of groups of transformations
i) k pk ,
ii) @p@ , and
iii) T pT ,
54
and how PÐ)ß +Ñ and decision rules behave under these groups of transformations. The
natural starting place is with transformations on k .
Definition 59 For 1-1 transformations of k onto k , 1" and 1# , the composition of 1" and
1# is another 1-1 onto transformation of k defined by 1# ‰ 1" ÐBÑ œ 1# Ð1" ÐBÑÑ aB - k Þ
Definition 60 For 1 a 1-1 transformation of k onto k , if 2 is another 1-1 onto
transformation such that 2 ‰ 1ÐBÑ œ B aB - k , 2 is called the inverse of 1 and denoted
as 2 œ 1•" .
It is then an easy fact that not only is 1•" ‰ 1ÐBÑ œ B aBß but 1 ‰ 1•" ÐBÑ œ B aB as well.
Definition 61 The identity transformation on k is defined by /ÐBÑ œ B aBÞ
Another easy fact is then that / ‰ 1 œ 1 ‰ / œ 1 for any 1-1 transformation of k onto k ,
1.
(Obvious) Result 96 If Z is a class of 1-1 transformations of k onto k that is closed
under composition (1" , 1# - Z Ê 1" ‰ 1# - Z ) and inversion (1 - Z Ê 1•" - Z ), then
with the operation " ‰ " (composition) Z is a group in the usual mathematical sense.
A group such as that described in Result 96 will be called a transformation group. Such a
group may or may not be Abelian/commutative (one is if 1" ‰ 1# œ 1# ‰ 1" a1" ß 1# - Z ).
Jargon/notation: Any class of transformations V can be thought of as generating a
group. That is, Z ÐVÑ consisting of all compositions of finite numbers of elements of V
and their inverses is a group under the operation " ‰ ".
To get anywhere with this theory, we then need transformations on @ that fit together
nicely with those on k . At a minimum, one will need that the transformations on k don't
get one outside of c .
1Ð\Ñ
Definition 62 If 1 is a 1-1 transformation of k onto k such that ÖT) ×)-@ œ c , then
the transformation 1 is said to leave the model invariant. If each element of a group of
transformations Z leaves c invariant, we'll say that the group leaves the model invariant.
One circumstance in which Z leaves c invariant is where the model c can be thought of
as "generated by Z " in the following sense. Suppose that Y has some fixed probability
distribution T! and consider
c œ ÖT 1ÐY Ñ l 1 - Z × Þ
If a group of transformations on k (say Z ) leaves c invariant, there is a natural
corresponding group of transformations on @. That is, for 1 - Z
55
\ µ T) Ê 1Ð\Ñ µ T)w for some )w - @ .
In fact, because of the identifiability assumption, there is only one such )w . So, as a
matter of notation, adopt the following.
Definition 63 If Z leaves the model c invariant, for each 1 - Z , define –1 À @p@ by
–1Ð)Ñ œ the element )w of @ such that \ µ T) Ê 1Ð\Ñ µ T)w Þ
(According to the notation established in Definition 51, for Z leaving c invariant
\ µ T) Ê 1Ð\Ñ µ T–1Ð)Ñ ÞÑ
(Obvious) Result 97 For Z a group of 1-1 transformations of k onto k leaving c
–
invariant, Z œ Ö1– l 1 - Z × with the operation " ‰ " (composition) is a group of 1-1
transformations of @.
–
(I believe that the group Z is the homomorphic image of Z under the transformation " –"
so that, for example, 1•" œ –1•" and 1# ‰ 1" œ –1# ‰ –1" .)
We next need transformations on T that fit nicely together with those on k and @. If
that's going to be possible, it would seem that the loss function will have to fit nicely with
the transformations already considered.
Definition 64 A loss function PÐ)ß +Ñ is said to be invariant under the group of
transformations Z provided that for every 1 - Z and + - T , b a unique +w - T such that
– )Ñß +w Ñ a) Þ
PÐ)ß +Ñ œ PÐ1Ð
Now then, as another matter of jargon/notation adopt the following.
Definition 65 If Z leaves the model invariant and the loss PÐ)ß +Ñ is invariant, define
1˜ À T pT by
– )Ñß +w Ñ a) Þ
1Ð+Ñ
œ the element +w of T such that PÐ)ß +Ñ œ PÐ1Ð
˜
(According to the notation established in Definition 53, for Z leaving c invariant and
invariant loss PÐ)ß +Ñ,
– )Ñß 1Ð+ÑÑ
PÐ)ß +Ñ œ PÐ1Ð
ÞÑ
˜
(Obvious) Result 98 For Z a group of 1-1 transformations of k onto k leaving c
invariant and invariant loss PÐ)ß +Ñ, Z˜ œ Ö1˜ l 1 - Z × with the operation " ‰ "
(composition) is a group of 1-1 transformations of T .
(I believe that the group Z˜ is the homomorphic image of Z under the transformation " ˜"
µ
so that, for example,1•" œ 1˜ •" and Ð1# µ
‰ 1" Ñ œ 1˜ # ‰ 1˜ " .)
56
So finally we come to the notion of equivariance for decision functions. The whole
business is slightly subtle for behavioral rules (see pages 150-151 of Ferguson) so for this
elementary introduction, we'll confine attention to nonrandomized rules.
Definition 66 In an invariant decision problem (one where Z leaves c invariant and
PÐ)ß +Ñ is invariant) a nonrandomized decision rule $ :k pT is said to be equivariant
provided
$ Ð1ÐBÑÑ œ 1Ð
˜ $ ÐBÑÑ aB - k and 1 - Z Þ
(One could make a definition here involving a single 1 or perhaps a 3 element group
Z œ Ö1ß 1•" ß /× and talk about equivariance under 1 instead of under a group. I'll not
bother.)
The basic elementary theorem about invariance/equivariance is the generalization of
Theorem 90 (that cuts down the size of the comparison problem that one faces when
comparing risk functions).
Theorem 99 If Z leaves c invariant, PÐ)ß +Ñ is invariant and $ is an equivariant
nonrandomized decision rule, then
– )Ñß $ Ñ a) - @ and 1 - Z .
VÐ)ß $ Ñ œ VÐ1Ð
Definition 67 In an invariant decision problem, for ) - @, b) œ Ö)w - @ l )w œ –1Ð)Ñ for
some 1 - Z × is called the orbit in @ containing ).
The set of orbits Öb) ×)-@ partitions @. The distinct orbits are equivalence classes.
Theorem 99 says that an equivariant nonrandomized decision rule in an invariant decision
problem has a risk function that is constant on orbits. When looking for good equivariant
rules, the problem of comparing risk functions over @ is reduced somewhat to comparing
risk functions on orbits. Of course, that comparison is especially easy when there is only
one orbit.
Definition 68 A group of 1-1 transformations of a space onto itself is said to be transitive
if for any two points in the space there is a transformation in the group taking the first
point into the second.
–
Corollary 100 If Z leaves c invariant, PÐ)ß +Ñ is invariant and Z is transitive over @,
then every equivariant nonrandomized decision rule has constant risk.
Definition 69 In an invariant decision problem, a decision rule $ is said to be a best
(nonrandomized) equivariant (or MRE ... minimum risk equivariant) decision rule if it is
equivariant and for any other nonrandomized equivariant rule $ w
57
VÐ)ß $ Ñ Ÿ VÐ)ß $ w Ñ a) Þ
The import of Theorem 99 is that one needs only check for this risk inequality for one )
from each orbit.
Optimal Hypothesis Testing (Simple vs Simple, Composite
Hypotheses, UMP Tests, Optimal Confidence Regions and UMPU
Tests)
We now consider the classical hypothesis testing problem. Here we treat
@ œ @! • @" where @! • @" œ g
and the hypotheses H3 :) - @3 3 œ !ß ". As is standard, we will call H! the "null
hypothesis" and H" the "alternative hypothesis." @3 with one element is called "simple"
and with more than one element is called "composite."
Deciding between H! and H" on the basis of data can be thought of in terms of a twoaction decision problem. That is, with T œ Ö!ß "× one is using \ to decide in favor of
one of the two hypotheses. Typically one wants a loss structure where correct decisions
are preferable to incorrect ones. An obvious (and popular) possibility is !-" loss,
PÐ)ß +Ñ œ MÒ) - @! ß + œ "Ó € MÒ) - @" ß + œ !Ó .
While it is perfectly sensible to simply treat hypothesis testing as a two-action decision
problem, being older than formal statistical decision theory, it has its own
peculiar/specialized set of jargon and emphases that we will have to recognize. These
include the following.
Definition 70 A nonrandomized decision function $ À k pÖ!ß "× is called a hypothesis
test.
Definition 71 The rejection region for a nonrandomized test $ is ÖBl $ ÐBÑ œ "×Þ
Definition 72 A type I error in a testing problem is taking action 1 when ) - @! . A type
II error in a testing problem is taking action 0 when ) - @" Þ
Note that in a two-action decision problem, a behavioral decision rule 9B is for each B a
distribution over Ö!ß "×, which is clearly characterized by 9B ÐÖ"×Ñ - Ò!ß "Ó. This notion
can be expressed in other terms.
Definition 73 A randomized hypothesis test is a function 9 À k pÒ!ß "Ó with the
interpretation that if 9ÐBÑ œ 1, one rejects H! with probability 1.
58
Note that for !-" loss,
for ) - @! ß VÐ)ß 9Ñ œ E) MÒ+ œ "Ó œ E) 9Ð\Ñ
the type I error probability, while
for ) - @" , VÐ)ß 9Ñ œ E) MÒ+ œ !Ó œ E) Ð" • 9Ð\ÑÑ œ " • E) 9Ð\Ñ
the type II error probability. Most of testing theory is phrased in terms of E) 9Ð\Ñ that
appears in both of these expressions.
Definition 74 The power function of the (possibly randomized) test 9 is
"9 Ð)Ñ » E) 9Ð\Ñ œ ( 9ÐBÑ.T) ÐBÑ Þ
Definition 75 The size or level of a test 9 is sup "9 Ð)ÑÞ
) - @!
This is the maximum probability of a type I error. Standard testing theory proceeds
(somewhat asymmetrically) by setting a maximum allowable size and then (subject to that
restriction) trying to maximize the power (uniformly) over @" Þ (That is, subject to a size
limitation, standard theory seek to minimize type II error probabilities.)
Recall that Result 61 says that if X is sufficient for ), the set of behavioral decision rules
that are functions of X is essentially complete. Arguing that fact in the present case is
very easy/straightforward. For any randomized test 9, consider
9‡ ÐBÑ œ EÒ9lX ÓÐBÑ ,
which is clearly a UÐX Ñ measurable function taking values in Ò!ß "Ó, i.e. is a hypothesis
test based on X . Note that
"9‡ Ð)Ñ œ E) 9‡ Ð\Ñ œ E) EÒ9lX Ó œ E) 9Ð\Ñ œ "9 Ð)Ñ ,
and 9‡ and 9 have the same power functions and are thus equivalent.
Simple vs Simple Testing Problems
Consider first testing problems where @! œ Ö)! × and @" œ Ö)" × (with !-" loss). In this
finite @ decision problem, we can for each test 9 define the risk vector
ÐVÐ)! ß 9Ñß V Ð)" ß 9ÑÑ œ Ð"9 Ð)! Ñß " • "9 Ð)" ÑÑ
and make the standard two-dimensional plot of risk vectors. A completely equivalent
device is to plot instead the points
Ð"9 Ð)! Ñß "9 Ð)" ÑÑ .
59
Lemma 101 For a simple versus simple testing problem, the risk set f has the following
properties:
i) f is convex,
ii) f is symmetric wrt the point Ð "# ß "# Ñ,
iii) Ð!ß "Ñ - f and Ð"ß !Ñ - f , and
iv) f is closed.
Lemma 101' (see page 77 of TSH) For a simple versus simple testing problem, the set
i » ÖÐ"9 Ð)! Ñß "9 Ð)" ÑÑl9 is a test× has the following properties:
i) i is convex,
ii) i is symmetric wrt the point Ð "# ß "# Ñ,
iii) Ð!ß !Ñ - i and Ð"ß "Ñ - i , and
iv) i is closed.
As indicated earlier, the usual game in testing is to fix !
! and maximum power on @" .
! and look for a test with size
Definition 76 For testing H! À ) œ )! vs H" À ) œ )" , a test is called most powerful of
size ! if "9 Ð)! Ñ œ ! and for any other test 9‡ with "9‡ Ð)! Ñ Ÿ !, "9‡ Ð)" Ñ Ÿ "9 Ð)" Ñ.
The fundamental result of all of testing theory is the Neyman-Pearson Lemma.
Theorem 102 (The Neyman-Pearson Lemma) Suppose that c ¥ ., a 5-finite measure
.T
and let 03 œ ..)3 .
i) (Sufficiency) Any test of the form
Ú"
9ÐBÑ œ Û / ÐBÑ
Ü!
if 0" ÐBÑ ž 50! ÐBÑ
if 0" ÐBÑ œ 50! ÐBÑ
if 0" ÐBÑ • 50! ÐBÑ
(æ)
for some 5 - Ò!ß_Ñ and / À k pÒ!ß "Ó is most powerful of its size. Further,
corresponding to the 5 œ _ case, the test
9ÐBÑ œ œ
"
!
if 0! ÐBÑ œ !
otherwise
(ææ)
is most powerful of size ! œ !.
ii) (Existence) For every ! - Ò!ß "Ó, there exists a test of form (æ) with / ÐBÑ œ /
(a constant in Ò!ß "Ó) or a test of form (ææ) with "9 Ð)! Ñ œ !.
iii) (Uniqueness) If 9 is a most powerful test of size !, then it is of form (æ) or
form (ææ) except possibly for B - R with T)! ÐRÑ œ T)" ÐRÑ œ !Þ
Note that for any two distributions there is always a dominating 5-finite measure
(. œ T)! € T)" will do the job). Further, it's easy to verify that the tests prescribed by this
theorem do not depend upon the dominating measure.
60
Note also that one could take a more explicitly decision theoretic slant on this whole
business. Thinking of testing as a two-decision problem with !-" loss and prior
distribution K, abbreviating KÐÖ)! ×Ñ œ 1! and KÐÖ)1 ×Ñ œ 11 , it's easy to see that a
Bayesian is a Neyman-Pearson tester with 5 œ 1! Î1" and that a Neyman-Pearson tester is
a closet Bayesian with 1! œ 5ÎÐ5 € "Ñ and 1" œ "ÎÐ" € 5Ñ.
There is a kind of "finite @! vs simple @" " version of the N-P Lemma. See TSH page 96.
A final small fact about simple versus simple testing (that turns out to be useful in some
composite versus composite situations) is next.
Lemma 103 Suppose that ! - Ð!ß "Ñ and 9 is most powerful of size !. Then
"9 Ð)" Ñ ! and the inequality is strict unless 0! œ 0" a.e. ..
UMP Tests for Composite vs Composite Problems
Now consider cases where @! and @" need not be singletons.
Definition 77 9 is UMP (uniformly most powerful) of size ! for testing H! À ) - @! vs
H" À ) - @" provided 9 is of size ! (i.e. sup "9 Ð)Ñ œ !) and for any other 9w of size
) - @!
no more than !, "9w Ð)Ñ Ÿ "9 Ð)Ñ a) - @" Þ
Problems for which one can find UMP tests are not all that many. Known results cover
situations where @ § e" involving one-sided hypotheses in monotone likelihood ratio
families, a few special cases that yield to a kind of "least favorable distribution" argument
and an initially somewhat odd-looking two-sided problem in one parameter exponential
families.
Definition 78 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ. Suppose that
X À k pe" . The family of R-N derivatives Ö0) × is said to have the monotone likelihood
ratio (MLR) property in X if for every )" • )# belonging to @, the ratio 0)# ÐBÑÎ0)" ÐBÑ is
nondecreasing in X ÐBÑ on ÖBl0)" ÐBÑ € 0)# ÐBÑ ž !×. (c is also said to have the MLR
property.)
Theorem 104 Suppose that Ö0) × has MLR in X . Then for testing H! À ) Ÿ )! vs
H" À ) ž )! , for each ! - Ð!ß "Ó b a test of the form
Ú"
9! ÐBÑ œ Û /
Ü!
if X ÐBÑ ž if X ÐBÑ œ if X ÐBÑ • -
for - - Ò • _ß_Ñ chosen to satisfy "9 Ð)! Ñ œ !Þ Such a test
i) has strictly increasing power function,
ii) is UMP size ! for these hypotheses, and
iii) uniformly minimizes "9 Ð)Ñ for ) • )! for tests with "9 Ð)! Ñ œ !.
61
Theorem 105 Suppose that Ö0) × has MLR in X and for state space @ and action space
T œ Ö!ß "× the loss function PÐ)ß +Ñ satisfies
PÐ)ß "Ñ • PÐ)ß !Ñ • ! for ) ž )! , and
PÐ)ß "Ñ • PÐ)ß !Ñ ž ! for ) • )!
Þ
Then the class of size ! tests (! - Ð!ß "Ñ) identified in Theorem 86 is essentially
complete in the class of tests with size in Ð!ß "Ñ.
One two-sided problem a single real parameter for which it possible to find UMP tests is
covered by the next theorem (that is proved in Section 4.3.4 of Schervish using the
generalization of the N-P lemma alluded to earlier).
Theorem 106 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ and
0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a).
Suppose further that )" • )# belong to @. A test of the form
Ú"
9ÐBÑ œ Û /3
Ü!
if -" • X ÐBÑ • -#
if X ÐBÑ œ -3 3 œ "ß #
if X ÐBÑ • -" or X ÐBÑ ž -#
for -" Ÿ -# minimizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with
"9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is UMP
level ! for testing H! À ) Ÿ )1 or ) )2 vs H" À )" • ) • )# .
As far as finding UMP tests for parameter spaces other than subsets of e" is concerned,
about the only "general" technique known is one discussed in Section 3.8 of TSHÞ The
method is not widely applicable, but does yield a couple of important results. It is based
on an "equalizer rule"/"least favorable distribution" type argument.
The idea is that for a few composite vs composite problems, we can for each )" - @"
guess at a distribution A over @! so that with
0A ÐBÑ œ ( 0) ÐBÑ.AÐ)Ñ
@!
simple versus simple testing of H! À 0A vs H" À 0)" produces a size ! N-P test whose
power over @! is bounded by !. (Notice by the wayß that this is a Bayes test in the
composite vs composite problem against a prior that is a mixture of A and a point mass at
)" ÞÑ Anyhow, when that test is independent of )" we get UMP size ! for the original
problem.
62
Definition 79 For testing H! À \ µ 0A vs H" À \ µ 0)" , A is called least favorable at
level ! if for 9A the N-P size ! test associated with A,
"9A Ð)" Ñ • "9Aw Ð)" Ñ
for any other distribution over @! , Aw .
Theorem 107 If 9 is of size ! for testing H! À ) - @! vs H" À ) œ )" , then it is of size
no more than ! for testing H! À \ µ 0A vs H" À \ µ 0)" . Further, if 9 is most powerful
of size ! for testing H! À \ µ 0A vs H" À \ µ 0)" and is also size ! for testing
H! À ) - @! vs H" À ) œ )" , then it is a MP size ! test for this set of hypotheses and A is
least favorable at level !.
Clearly, if in the context of Theorem 107, the same 9 is most powerful size ! for each )" ,
that test is then UMP size ! for the hypotheses H! À ) - @! vs H" À ) - @" .
Confidence Sets Derived From Tests (and Optimal Ones Derived From UMP
Tests)
Definition 80 Suppose that for each B - k ß WÐBÑ is a (measurable) subset of @. The
family ÖWÐBÑl B - k × is called a # level family of confidence sets for ) provided
T) Ò) - WÐBÑÓ
# a) - @ .
It is possible (and quite standard) to use tests to produce confidence procedures. That
goes as follows. For each )! - @, let 9)! be a size ! test of H! À ) œ )! vs H" À ) - @"
(where )! - @ • @" ) and define
W9 ÐBÑ œ Ö)l 9) ÐBÑ • "×
(W9 ÐBÑ is the set of )'s for which there is positive conditional probability of accepting H!
having observed \ œ B.) ÖW9 ÐBÑl B - k × is a level " • ! family of confidence sets for
). This rigmarole works perfectly generally and inversion of tests of point null
hypotheses is a time-honored means of constructing confidence procedures. In the
context of UMP testing, the construction produces confidence sets with an optimality
property.
Theorem 108 Suppose that for each )! ß OÐ)! Ñ § @ with )! Â OÐ)! Ñ and 9)! is a
nonrandomized UMP size ! test of H! À ) œ )! vs H" À ) - OÐ)! Ñ Then if W9 ÐBÑ is the
confidence procedure derived from the tests 9)! and WÐBÑ is any other level Ð" • !Ñ
confidence procedure
T) Ò)! - WÐ\ÑÓ
T) Ò)! - W9 Ð\ÑÓ a ) - OÐ)! Ñ Þ
(That is, the W9 construction produces confidence procedures that minimize the
probabilities of covering incorrect values.)
63
UMPU Tests for Composite versus Composite Problems
Cases in which one can find UMP tests are not all that many. This should not be terribly
surprising given our experience to date in Stat 643. We are used to not being able to find
uniformly best statistical procedures. In estimation problems we found that by restricting
the class of estimators under consideration (e.g. to the unbiased or equivariant estimators)
we could sometimes find uniformly best rules in the restricted class. Something similar is
sometimes possible in composite versus composite testing problems.
Definition 81 A size ! test of H! À ) - @! vs H" À ) - @" is unbiased provided
" 9 Ð) Ñ
! a ) - @" Þ
Definition 82 An unbiased size ! test of H! À ) - @! vs H" À ) - @" is UMPU
(uniformly most powerful unbiased) level ! if for any other size ! unbiased test 9w
" 9 Ð) Ñ
" 9 w Ð) Ñ a ) - @ " .
The conditions under which people can identify UMPU tests require that @ be a
topological space. (One needs notions of interior, boundary, continuity of functions, etc.)
So in this section make that assumption.
–
–
Definition 83 The boundary set is @B » @! • @" .
Lemma 109 If "9 Ð)Ñ is continuous and 9 is unbiased of size !, then "9 Ð)Ñ œ ! a
) - @B .
Definition 84 A test is said to be !-similar on the boundary if
" 9 Ð) Ñ œ ! a ) - @ B Þ
When looking for UMPU tests it often suffices to restrict attention to tests !-similar on
the boundary.
Lemma 110 Suppose that "9 Ð)Ñ is continuous for every 9. If 9 is UMP among tests !similar on the boundary and has level !, then it is UMPU level !.
This lemma and Theorem 106 (or the generalized Neyman-Pearson lemma that stands
behind it) can be used to prove a kind of "reciprocal" of Theorem 106 for one parameter
exponential families.
64
Theorem 111 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñß and
0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a).
Suppose further that )" • )# belong to @. A test of the form
Ú"
9ÐBÑ œ Û /3
Ü!
if X ÐBÑ • -" or X ÐBÑ ž -#
if X ÐBÑ œ -3 3 œ "ß #
if -" • X ÐBÑ • -#
for -" Ÿ -# maximizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with
"9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is
UMPU level ! for testing H! À ) - Ò)" ß )# Ó vs H" À ) Â Ò)" ß )# ÓÞ
Theorem 111 deals with an interval null hypothesis. Dealing with a point null hypothesis
turns out to require slightly different arguments.
Theorem 112 Suppose that for @ an open interval in e" , c ¥ ., a 5-finite measure on
Ðk ß U Ñ and
0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a).
Suppose that 9 is a test of the form given in Theorem 93 and that
" 9 Ð)! Ñ œ !
and
"9w Ð)! Ñ œ ! .
Then 9 is UMPU size ! for testing H! À ) œ )! vs H" À ) Á )! .
Theorems 111 and 112 are about cases where @ § e" (and, of course, limited to one
exponential families). It is possible to find some UMPU tests for hypotheses with
"Ð5 • "Ñ-dimensional" boundaries in "5 -dimensional" @ problems using a methodology
exploiting "Neyman structure" in a statistic X . I may write up some notes on this
methodology (and again I may not). It is discussed in Section 4.5 of Schervish. A
general kind of result that comes from the arguments is next.
65
Theorem 113 Suppose that \ œ Ð\" ß \# ß ÞÞÞß \5 Ñ takes values in e5 and the
distributions T( have densities with respect to a 5-finite measure . on e5 of the form
0( ÐBÑ œ OÐ(Ñexp"(3 B3
5
.
3œ"
Suppose further that the natural parameter space > contains an open rectangle in e5 and
let Y œ Ð\# ß ÞÞÞß \5 Ñ. Suppose ! - Ð!ß "ÑÞ
i) For H! À (" Ÿ ("‡ vs H" À (" ž ("‡
or for H! À (" Ÿ ("‡ or (" ("‡‡ vs H" À (" - Ð("‡ ß ("‡‡ Ñ,
for each ? b a conditional on Y œ ? UMP level ! test 9?w ÐB" Ñ, and the test
9ÐBÑ œ 9?w ÐB" Ñ
is UMPU level !.
ii) For H! À (" - Ò("‡ ß ("‡‡ Ó vs H" À ("  Ò("‡ ß ("‡‡ Ó
or for H! À (" œ ("‡ vs H" À (" Á ("‡
for each ? b a conditional on Y œ ? UMPU level ! test 9?w ÐB" Ñ, and the test
9ÐBÑ œ 9?w ÐB" Ñ
is UMPU level !.
This theorem says that if one thinks of ( œ Ð(" ß (# ß ÞÞÞß (5 Ñ as containing the parameter of
interest (" and "nuisance parameters" (# ß ÞÞÞß (5 , at least in some exponential families it is
possible to find "optimal" tests about the parameter of interest.
66
Download