Stat 643 Results Sufficiency (and Ancillarity and Completeness)

advertisement
Stat 643 Results
Sufficiency (and Ancillarity and Completeness)
Roughlyß X Ð\Ñ is sufficient for ÖT) ×)-@ (or for )) provided the conditional distribution
of \ given X Ð\Ñ is the same for each ) - @.
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and
X Ð\Ñ œ ! \3 . You may show that T: Ò\ œ BlX œ >Ó is the same for each :.
8
3œ"
In some sense, a sufficient statistic carries all the information in \ . (A variable with the
same distribution as \ can be constructed from X Ð\Ñ and some randomization whose
form does not depend upon ).) The $64 question is how to recognize one. The primary
tool is the "Factorization Theorem" in its various incarnations. This says roughly that X
is sufficient iff
ww
0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑww Þ
Result 1 (Page 20 Lehmann TSH ) Suppose that \ is discrete. Then X Ð\Ñ is sufficient
for ) iff b nonnegative functions 1) Ð † Ñ and 2Ð † Ñ such that T) Ò\ œ BÓ œ 1) ÐX ÐBÑÑ2ÐBÑÞ
(You can and should prove this yourself.)
The Stat 643 version of Result 1 is considerably more general and relies on measuretheoretic machinery. Basically, one can "easily" push Result 1 as far as "families of
distributions dominated by a 5-finite measure." The following, standard notation will be
used throughout these notes:
Ðk ,U ,T) Ñ a probability space (for fixed ))
c œ ÖT) ×)-@
E) Ð1Ð\ÑÑ œ ' 1.T)
X À Ðk ß U ÑpÐg ß Y Ñ a statistic
U! œ U ÐX Ñ œ ÖX •" ÐJ Ñ×J -Y the 5-algebra generated by X
Definition 1 X (or U! ) is sufficient for c if for every F - U b a U! -measurable function
] such that for all ) - @
] œ E) ÐMF lU! Ñ a.s. T)
-1-
.
(Note that E) ÐMF lU! Ñ œ T) ÐFlU! Ñ. This definition says that there is a single random
variable that will serve as a conditional probability of F given U! for all ) in the
parameter space.)
Definition 2 c is dominated by a 5-finite measure . if T) ¥ . a ) - @.
Then assuming that c is dominated by ., standard notation is to let
0) œ
.T)
..
and it is these objects (R-N derivatives) that factor nicely iff X is sufficient. To prove this
involves some nontrivial technicalities.
Definition 3 c has a countable equivalent subset if b a sequence of positive constants
!-3 œ " and a corresponding sequence Ö)3 ×_
!-3 T)3 is
Ö-3 ×_
3œ" with
3œ" such that - œ
equivalent to c in the sense that -ÐFÑ œ ! iff T) ÐFÑ œ ! a )Þ
Notation: Assuming that c has a countable equivalent subset, let - be a choice of
!-3 T)3 as in Definition 3.
Theorem 2 (Page 575 Lehmann TSH) c is dominated by a 5-finite measure . iff it has
a countable equivalent subset.
Theorem 3 (Page 54 Lehmann TSH) Suppose that c is dominated by a 5-finite measure
and X À Ðk ß U ÑpÐg ß Y Ñ. Then X is sufficient for c iff b nonnegative Y -measurable
functions 1) such that a )
.T)
ÐBÑ œ 1) ÐX ÐBÑÑ a.s. - .
.-
(This last equality is equivalent to T) ÐFÑ œ 'F 1) ÐX ÐBÑÑ. -ÐBÑ a F - U .)
Theorem 4 (Page 55 Lehmann TSH) ("The" Factorization Theorem) Suppose that
c ¥ . where . is 5-finite. Then X is sufficient for c iff b a nonnegative U -measurable
function 2 and nonnegative Y -measurable functions 1) such that a )
.T)
ÐBÑ œ 0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑ a.s. . .
..
-2-
Standard/elementary applications of Theorem 4 are to cases where:
1) . is Lebesgue measure on e. , c is some parametric family of . -dimensional
distributions, and the 0) are . -dimensional probability densities, or
2) . is counting measure on some countable set (like, e.g., am € b. ) c is some
parametric family of discrete distributions and the 0) are probability mass
functions.
But it can also be applied much more widely (e.g. to cases where . is the sum of . dimensional Lebesgue measure and counting measure on some countable subset of e. ).
Example Theorem 4 can be used to show that for k œ e. ß
\ œ Ð\" ß \# ß ÞÞÞß \. Ñß U œ U. the . -dimensional Borel 5-algebra, 1 œ Ð1" ß 1# ß ÞÞÞß 1. Ñ a
generic permutation of the first . positive integers, 1Ð\Ñ œ Ð\1" ß \1# ß ÞÞÞß \1. Ñß . the . dimensional Lebesgue measure and c œ ÖT lT ¥ . and 0 œ .T
.. satisfies
0 Ð1ÐBÑÑ œ 0 ÐBÑ a.s. . for all permutations 1×ß the order statistic
X Ð\Ñ œ Ð\Ð"Ñ ß \Ð#Ñ ß ÞÞÞß \Ð.Ñ Ñ is sufficient.
Actually, it is an important fact (that does not follow from the factorization theorem) that
one can establish the sufficiency of the order statistic even more broadly. With c œ ÖT
on e. such that T ÐFÑ œ T Ð1ÐFÑÑ for all permutations 1×, direct argument shows that
the order statistic is sufficient. (You should try to prove this. For F - U , try ] ÐBÑ equal
to the fraction of all permutations that place B in F.)
(Obvious) Result 5 If X Ð\Ñ is sufficient for c and c! is a subset of c , X Ð\Ñ is
sufficient for c! .
A natural question is "How far can one go in the direction of 'reducing' \ without
throwing something away?". The notion of minimal sufficiency is aimed in the direction
of answering this question.
Definition 4 A sufficient statistic X À Ðk ß U ÑpÐg ß Y Ñ is minimal sufficient (for ) or for
c ) provided for every sufficient statistic W À Ðk ß U ÑpÐf ß Z Ñ b a (?measurable?) function
h À f pg such that
X œ Y ‰ W a.e. c .
(It is probably equivalent or close to equivalent that U ÐX Ñ § U ÐWÑÞ)
There are a couple of ways of identifying a sufficient statistic.
-3-
Theorem 6 (Like page 92 of Schervish ... other development on pages 41-43 of
Lehmann TPE) Suppose that c is dominated by a 5-finite measure . and
X À Ðk ß U ÑpÐg ß Y Ñ is sufficient for c . Suppose that there exist versions of the R-N
)
derivatives .T
. . ÐBÑ œ 0) ÐBÑ such that
athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b implies X ÐBÑ œ X ÐCÑ
then X is minimal sufficient.
(A trivial "generalization" of Theorem 6 is to suppose that the condition only holds for a
set k ‡ § k with T) probability 1 for all ). The condition in Theorem 6 is essentially that
ÐCÑ
X ÐBÑ œ X ÐCÑ iff 00))ÐBÑ
is free of ).)
Note that sufficiency of X Ð\Ñ means that X ÐBÑ œ X ÐCÑ implies
athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b. So the condition in
Theorem 6 could be phrased as an iff condition.
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and
X Ð\Ñ œ ! \3 . You may use Theorem 6 to show that show that X is minimal sufficient.
8
3œ"
Lehmann emphasizes a second method of establishing minimality based on the following
two theorems.
Theorem 7 (Page 41 of Lehmann TPE) Suppose that c œ ÖT3 ×53œ! is finite and let Ö03 ×
be the R-N derivatives with respect to some dominating 5-finite measure (like, e.g.,
! T3 ÑÞ Suppose that all 5 € " derivatives 03 are positive on all of k . Then
5
3œ!
X Ð\Ñ œ Œ
0" ÐBÑ 0# ÐBÑ
05 ÐBÑ
ß
ß ÞÞÞß
•
0! ÐBÑ 0! ÐBÑ
0! ÐBÑ
is minimal sufficient.
(Again it is a trivial "generalization" to observe that the minimality also holds if there is a
set k ‡ § k with T) probability 1 for all ) where the 5 € " derivatives 03 are positive on
all of k ‡ ÞÑ
Theorem 8 (Page 42 of Lehmann TPE) Suppose that c is a family of mutually
absolutely continuous distributions and that c! § c Þ If X is sufficient for c and
minimal sufficient for c! , then it is minimal sufficient for c .
The way that Theorems 7 and 8 are used together is to look at a few )'s, invoke Theorem
7 to establish minimal sufficiency for the finite family of a vector made up of a few
likelihood ratios and then use Theorem 8 to infer minimality for the whole family.
-4-
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Beta Ð!ß "Ñ. Using T! the
Beta Ð"ß "Ñ distribution, T" the Beta Ð#ß "Ñ distribution and T# the Beta Ð"ß #Ñ distribution,
Theorems 7 and 8 can be used to show that Œ # \3 ß # Ð" • \3 Ñ•is minimal sufficient for
8
8
3œ"
3œ"
the Beta family.
Definition 5 A statistic X À Ðk ß U ÑpÐg ß Y Ñ is said to be ancillary for c œ ÖT) ×)-@ if
the distribution of X (i.e., the measure T)X on Ðg ß Y Ñ defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑÑ
does not depend upon ) (i.e. is the same for all )).
Definition 6 A statistic X À Ðk ß U ÑpÐe" ß U" Ñ is said to be first order ancillary for
c œ ÖT) ×)-@ if the mean of X (i.e., E) ÐX Ð\ÑÑ œ ' X.T ) ) does not depend upon ) (i.e. is
the same for all )).
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid NÐ),"Ñ. The sample variance
of the \3 's is ancillary.
Lehmann argues that roughly speaking models with nice (low dimensional) minimal
sufficient statistics X are ones where no nontrivial function of X is ancillary (or first order
ancillary). This line of thinking brings up the notion of completeness.
Notation: Continuing to let X À Ðk ß U ÑpÐg ß Y Ñ and T)X be the measure on Ðg ß Y Ñ
defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑ, let c X œ ÖT)X ×)-@ Þ
Definition 7a c X or X is complete (boundedly complete) for c or ) if
i) Y À Ðk ß U! ÑpÐe" ß U" Ñ U! -measurable (and bounded), and
ii) E) Y Ð\Ñ œ ! a )
imply that Y œ ! a.s. T) a )Þ
By virtue of Lehmann's Theorem (see Cressie's Probability Summary) this is equivalent to
the next definition.
Definition 7b c X or X is complete (boundedly complete) for c or ) if
i) 2 À Ðg ß Y ÑpÐe" ß U" Ñ Y -measurable (and bounded), and
ii) E) 2ÐX Ð\ÑÑ œ ! a )
imply that 2 ‰ X œ ! a.s. T) a )Þ
These definitions say that there is no nontrivial function of X that is an unbiased
estimator of 0. This is equivalent to saying that there is no nontrivial function of X that is
first order ancillary. Notice also that completeness implies bounded completeness, so that
bounded completeness is a slightly weaker condition than completeness.
-5-
Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and
X Ð\Ñ œ ! \3 . You may use the fact that polynomials can be identically 0 on an interval
8
3œ"
only if their coefficients are all 0, to show that X Ð\Ñ is complete for : - Ò!ß "Ó.
Theorem 9 (Page 99 Schervish) (Basu's Theorem) If X is a boundedly complete
sufficient statistic and Y is ancillary, then for all ) the random variables Y and X are
independent (according to T) )Þ
(This theorem has some uses in the identification of UMVUE's.)
(Obvious) Result 10 If X is complete for c œ ÖT) ×)-@ and @w § @, X need not be
w
complete for c œ ÖT) ×)-@w .
w
w
Theorem 11 If @ § @ and @ dominates @ in the sense that T) ÐFÑ œ ! a ) - @ implies
w
that T) ÐFÑ œ ! a ) - @ , then X (boundedly) complete for c œ ÖT) ×)-@ implies that X
w
is (boundedly) complete for c œ ÖT) ×)-@w .
Completeness of a statistic says that it is not carrying any "ancillary" information, stuff
that in and of itself is of no help in inference about ). That is reminiscent of minimal
sufficiency.
Theorem 12 (Bahadur's Theorem) Suppose that X À Ðk ß U ÑpÐg ß Y Ñ is sufficient and
boundedly complete.
i) (Pages 94-95 of Schervish) If Ðg ß Y Ñ œ Ðe. ß U. Ñ then X is minimal sufficient.
ii) If there is any minimal sufficient statistic, X is minimal sufficient.
Example Suppose that c ¥ . (where . is 5-finite) and that for ) - @ § e5
5
.T)
ÐBÑ œ -Ð)Ñexp")3 X3 ÐBÑ Þ
..
3œ"
Provided, for example, that @ contains an open rectangle in e5 , Theorem 1 page 142
Lehmann TSH implies that X Ð\Ñ œ ÐX" Ð\Ñß X# Ð\Ñß ÞÞÞß X5 Ð\ÑÑ is complete. It is clearly
sufficient by the factorization criterion. So by part i) of Theroem 12, X is minimal
sufficient.
-6-
Decision Theory
This is a formal framework for posing problems of finding good statistical procedures as
optimization problems.
Basic Elements of a (Statistical) Decision Problem:
\
data
formally a mapping ÐHß [ÑpÐk ß U Ñ
@
parameter space
often one needs to do integrations over @, so let V
be a 5-algebra on @ and assume that for F - U , the
mapping taking )pT) ÐFÑ from Ð@ß VÑpÐV " ß U" Ñ is
measurable
T
decision/action space the class of possible decisions or actions
we will also need to average over (random) actions,
so let X be 5-algebra on T
PÐ)ß +Ñ loss function
P À @ ‚ T pÒ!ß_Ñ and needs to be suitably
measurable
$ ÐBÑ
decision rule
$ ÐBÑ À Ðk ß U ÑpÐT ß X Ñ and for data \ , $ Ð\Ñ is the
decision taken on the basis of observing \
The fact that PÐ)ß $ Ð\ÑÑ is random prevents any sensible comparison of decision rules
until one has averaged out over the distribution of \ , producing something called a risk
function.
Definition 8 VÐ)ß $ Ñ » E) PÐ)ß $ Ð\ÑÑ œ ' PÐ)ß $ ÐBÑÑ.T) ÐBÑ maps @pÒ!ß_Ñ and is
called the risk function (of the decision rule $ ).
Definition 9 $ is at least as good as $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a)Þ
Definition 10 $ is better than $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a) with strict inequality for at
least one ) - @.
Definition 11 $ and $ w are risk equivalent if VÐ)ß $ Ñ œ VÐ)ß $ w Ñ a)Þ
(Note that risk equivalency need not imply equality of decision rules.)
Definition 12 $ is best in a class of decision rules ? if
i) $ - ?
and
ii) $ is at least as good as any other $ w - ? Þ
Typically, only smallish classes of decision rules will have best elements. (For example,
in some point estimation problems there are best unbiased estimators, where no estimator
is best overall. Here the class ? is the class of unbiased estimators.) An alternative to
looking for best elements in small classes of rules is to somehow reduce the function of
), VÐ)ß $ Ñ, to a single number and look for rules minimizing the number. (Reduction by
-7-
averaging amounts to looking for Bayes rules. Reduction by taking a maximum value
amounts to looking for minimax rules.)
Definition 13 $ is inadmissible in ?if b $ w - ? which is better than $ (which
"dominates" $ ).
Definition 14 $ is admissible in ? if it is not inadmissible in ? (if there is no $ w - ?
which is better).
On the face of it, it would seem that one would never want to use an inadmissible
decision rule. But, as it turns out, there can be problems where all rules are inadmissible.
(More in this direction later.)
Largely for technical reasons, it is convenient/important to somewhat broaden one's
understanding of what constitutes a decision rule. To this end, as a matter of notation, let
W œ Ö$ ÐBÑ× œ the class of (nonrandomized) decision rules .
Then one can define two fairly natural notions of randomized decision rules.
Definition 15 Suppose that for each B - k , 9B is a distribution on ÐT ß X Ñ. Then 9B is
called a behavioral decision rule.
The idea is that having observed \ œ B, one then randomizes over the action space
according to the distribution 9B in order to choose an action. (This should remind you of
what is done in some theoretical arguments in hypothesis testing.)
Notation: Let W‡ œ Ö9B × œ the class of behavioral decision rules
It is possible to think of W as a subset of W‡ . For $ - W, let 9B$ be a point mass
distribution on T concentrated at $ ÐBÑ. Then 9B$ - W‡ and is "clearly" equivalent to $ .
The risk of a behavioral decision rule is (abusing notation and reusing the symbols
VÐ † ß † Ñ in this context as well as that of Definition 8) clearly
VÐ)ß 9Ñ œ ( ( PÐ)ß +Ñ. 9B Ð+Ñ.T) ÐBÑ Þ
k
T
A somewhat less intuitively appealing notion of randomization for decision rules is next.
Let Y be a 5-algebra on W that contains the singleton sets.
Definition 16 A randomized decision function (or decision rule) < is a probability
measure on ÐWß Y ÑÞ ($ with distribution < becomes a random object.)
The idea here is that prior to observation, one chooses an element of W by use of <, say $ ,
and then having observed \ œ B decides $ ÐBÑ.
-8-
Notation: Let W‡ œ Ö<× œ the class of randomized decision rules.
It is possible to think of W as a subset of W‡ by identifying $ - W with <$ placing mass 1
on $ . <$ is "clearly" equivalent to $ .
The risk of a randomized decision rule is (once again abusing notation and reusing the
symbols VÐ † ß † Ñ) clearly
VÐ)ß <Ñ œ ( VÐ)ß $ Ñ. <Ð$ Ñ œ ( ( PÐ)ß $ ÐBÑÑ.T) ÐBÑ. <Ð$ Ñ ß
W
W k
assuming that VÐ)ß <Ñ is properly measurable.
The behavioral decision rules are probably more appealing/natural than the randomized
decision rules. On the other hand, the randomized decision rules are often easier to deal
with in proofs. So a reasonably important question is "When are W‡ and W‡ equivalent in
the sense of generating the same set of risk functions?"
"Result"13 (Some properly qualified version of this is true ... See Ferguson pages 2627) If T is a complete separable metric space with X the Borel 5-algebra, appropriate
conditions hold on the T) and ??????? holds, then W‡ and W‡ are equivalent in terms of
generating the same set of risk functions.
It is also of interest to know when the extra complication of considering randomized rules
is really not necessary. One simple result in that direction concerns cases where P is
convex in its second argument.
Lemma 14 (See page 151 of Schervish, page 40 of Berger and page 78 of Ferguson)
Suppose that T is a convex subset of e. and 9B is a behavioral decision rule. Define a
nonrandomized decision rule $ by
$ ÐBÑ œ ( +. 9B Ð+Ñ .
T
(In the case that . ž ", interpret $ ÐBÑ as vector-valued, the integral as a vector of
integrals over the . coordinates of + - T .) Then
i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex, then
VÐ)ß $ Ñ Ÿ VÐ)ß 9Ñ Þ
ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)ß 9Ñ • _ and
T) ÖBl9B is nondegenerate× ž ! ,
then
VÐ)ß $ Ñ • VÐ)ß 9Ñ Þ
-9-
(Strict convexity of a function 1 means that for ! - Ð!ß "Ñ
1Ð!B € Ð" • !ÑCÑ • !1ÐBÑ € Ð" • !Ñ1ÐCÑ
for B and C in the convex domain of 1Þ See Berger pages 38 through 40. There is an
extension of Jensen's inequality that says that if 1 is strictly convex and the distribution of
\ is not concentrated at a point, then E1Ð\Ñ • 1ÐEÐ\ÑÑ.)
Corollary 15 Suppose that T is a convex subset of e. and 9B is a behavioral decision
rule. Define a nonrandomized decision rule $ by
$ ÐBÑ œ ( +. 9B Ð+Ñ .
T
i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a), then $ is at least as good as 9.
ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a) and for some )! the function
PÐ)! ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)! ß 9Ñ • _ and
T)! ÖBl9B is nondegenerate× ž !
then $ is better than 9.
Corollary 15 says roughly that in the case of a convex loss function, one need not worry
about considering randomized decision rules.
(Finite Dimensional) Geometry of Decision Theory
A very helpful device for understanding some of the basics of decision theory is to
consider the geometry associated with cases where @ is finite. So until further notice,
assume
@ œ Ö)" ß )# ß ÞÞÞß )5 ×
and
VÐ)ß <Ñ • _ a ) - @ and < - W‡ .
As a matter of notation, let
f œ ÖC œ ÐC" ß C# ß ÞÞÞß C5 Ñ - e5 l C3 œ VÐ)3 ß <Ñ a3 and < - W‡ × .
f is the set of all randomized risk vectors.
-10-
Theorem 16 f is convex.
In fact, it turns out to be the case, and is more or less obvious from Theorem 16, that if
f ! œ ÖClC3 œ VÐ)3 ß <Ñ a3 and < - W× ,
then f is the convex hull of f ! , i.e. the smallest convex set containing f ! .
Definition 17 For B - e5 , the lower quadrant of B is
UB œ ÖDlD3 Ÿ B3 a3× Þ
Theorem 17 C - f (or the decision rule giving rise to C) is admissible iff
UC • f œ ÖC× .
Definition 18 The lower boundary of f is
–
-Ðf Ñ œ ÖClUC • f œ ÖC××
–
for f the closure of f .
Definition 19 f is closed from below if -Ðf Ñ § f .
Notation: Let EÐf Ñ stand for the class of admissible rules (or risk vectors corresponding
to them) in f .
Theorem 18 If f is closed from below, then EÐf Ñ œ -Ðf ÑÞ
Theorem 18w If f is closed, then EÐf Ñ œ -Ðf Ñ.
Complete Classes of Decision Rules
Now drop the finiteness assumption on @.
Definition 20 A class of decision rules V § W‡ is a complete class if for any 9 Â V, b a
9w - V such that 9w is better than 9.
Definition 21 A class of decision rules V § W‡ is an essentially complete class if for any
9 Â V, b a 9w - V such that 9w is at least as good as 9.
Definition 22 A class of decision rules V § W‡ is a minimal complete class if it is
complete and is a subset of any other complete class.
Notation: Let AÐW‡ Ñ be the set of admissible rules in W‡ .
Theorem 19 If a minimal complete class V exists, then V œ EÐW‡ Ñ.
-11-
Theorem 20 If EÐW‡ Ñ is complete, then it is minimal complete.
Theorems 19 and 20 are the properly qualified versions of the rough (and not quite true)
statement "EÐW‡ Ñ equals the minimal complete class".
Sufficiency and Decision Theory
We worked very hard on the notion of sufficiency, with the goal in mind of establishing
that one "doesn't need" \ if one has X Ð\Ñ. That hard work ought to have some
implications for decision theory. Presumably, in most cases one can get by with basing a
decision rule on a sufficient statistic.
Result 21 (See page 120 of Ferguson, page 36 of Berger and page 151 of Schervish)
Suppose a statistic X is sufficient for c œ ÖT) ×. Then if 9 is a behavioral decision rule
there exists another behavioral decision rule 9w that is a function of X and has the same
risk function as 9.
Note that 9w a function of X means that X ÐBÑ œ X ÐCÑ implies that 9Bw and 9Cw are the
same measures on T . Note also that Result 21 immediately implies that the class of
decision rules in W‡ that are functions of X is essentially complete. In terms of risk, one
need only consider rules that are functions of sufficient statistics.
The construction used in the proof of Result 21 doesn't aim to improve on the original
(behavioral) decision rule by moving to a function of the sufficient statistic. But in
convex loss cases, there is a construction that produces rules depending on a sufficient
statistic and potentially improving on an initial (nonrandomized) decision rule. This is
the message of Theorem 24. But to prove Theorem 24, a lemma of some independent
interest is needed. That lemma and an immediate corollary are stated before Theorem 24.
Lemma 22 Suppose that T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized
decision rules. Then $ ÐBÑ œ "# Ð$" ÐBÑ € $# ÐBÑÑ is also a decision rule. Further,
i) if PÐ)ß +Ñ is convex in + and VÐ)ß $" Ñ œ VÐ)ß $# Ñ, then VÐ)ß $ Ñ Ÿ VÐ)ß $" Ñ, and
ii) if PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and
T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, then VÐ)ß $ Ñ • VÐ)ß $" ÑÞ
Corollary 23 Suppose T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized
decision rules with identical risk functions. If PÐ)ß +Ñ is convex in + a), the existence of a
) - @ such that PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and
T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, implies that $" and $# are inadmissible.
-12-
Theorem 24 (See Berger page 41, Ferguson page 121, TPE page 50, Schervish page
152) (Rao-Blackwell Theorem) Suppose that T § e5 is convex and $ is a
nonrandomized decision function with E) l$ Ð\Ñl • _ a)Þ Suppose further that X is
sufficient for ) and with U! œ U ÐX Ñ, let
$! ÐBÑ » EÒ$ lU! ÓÐBÑ Þ
Then $! is a nonrandomized decision rule. Further,
i) if PÐ)ß +Ñ is convex in +, then VÐ)ß $! Ñ Ÿ VÐ)ß $ Ñ, and
ii) for any ) for which PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $ Ñ • _ and
T) Ð$ Ð\Ñ Á $! Ð\ÑÑ ž !, one has that VÐ)ß $! Ñ • VÐ)ß $ ÑÞ
Bayes Rules
The Bayes approach to decision theory is one means of reducing risk functions to
numbers so that they can be compared in a straightforward fashion.
Notation: Let K be a distribution on Ð@ß V Ñ. K is usually (for philosophical reasons)
called the "prior" distribution on @.
Definition 23 The Bayes risk of 9 - W‡ with respect to the prior K is (abusing notation
again and once more using the symbols VÐ † ß † Ñ)
VÐKß 9Ñ œ ( VÐ)ß 9Ñ.KÐ) Ñ Þ
@
Notation: Again abusing notation somewhat, let
VÐKÑ » w inf ‡ VÐKß 9w Ñ Þ
9 -W
Definition 24 9 is said to be Bayes with respect to K (or to be a Bayes rule with respect
to K) provided
VÐKß 9Ñ œ VÐKÑ Þ
Definition 25 9 is said to be %-Bayes with respect to K provided
VÐKß 9Ñ Ÿ VÐKÑ € % Þ
Bayes rules are often admissible, at least if the prior involved "spreads its mass around
sufficiently."
Theorem 25 If @ is finite (or countable), K is a prior distribution with KÐ)Ñ ž ! a) and
9 is Bayes with respect to K, then 9 is admissible.
-13-
Theorem 26 Suppose @ § e5 is such that every neighborhood of any point ) - @ has
nonempty intersection with the interior of @. Suppose further that VÐ)ß 9Ñ is continuous
in ) for all 9 - W‡ . Let K be a prior distribution that has support @ in the sense that
every open ball that is a subset of @ has positive K probability. Then if VÐKÑ • _ and
9 is a Bayes rule with respect to K, 9 is admissible.
Theorem 27 If every Bayes rule with respect to K has the same risk function, they are all
admissible.
Corollary 28 In a problem where Bayes rules exist and are unique, they are admissible.
A converse of Theorem 25 (for finite @) is given in Theorem 30. Its proof requires the
use of a piece of 5 -dimensional analytic geometry (called the separating hyperplane
theorem) that is stated next.
Theorem 29 Let f" and f# be two disjoint convex subsets of e5 . Then b a : - e5 such
that : Á ! and !:3 B3 Ÿ !:3 C3 aC - f" and B - f# .
Theorem 30 If @ is finite and 9 is admissible, then 9 is Bayes with respect to some
prior.
For nonfinite @, admissible rules are "usually" Bayes or "limits" of Bayes rules. See
Section 2.10 of Ferguson or page 546 of Berger.
As the next result shows, Bayesians don't need randomized decision rules, at least for
purposes of achieving minimum Bayes risk, VÐKÑ.
Result 31 (See also page 147 of Schervish) Suppose that < - W‡ is Bayes versus K and
VÐKÑ • _. Then there exists a nonrandomized rule $ - W that is also Bayes versus KÞ
There remain the questions of 1) when Bayes rules exist and #Ñ what they look like when
they do exist. These issues are (to some extent) addressed next.
Theorem 32 If @ is finite, f is closed from below and K assigns positive probability to
each ) - @, there is a rule Bayes versus K.
Definition 26 A formal nonrandomized Bayes rule versus a prior K is a rule $ ÐBÑ such
that aB - k ß
$ ÐBÑ is an + - T minimizing ( PÐ)ß +ÑŽ
@
-14-
0) ÐBÑ
.KÐ)Ñ .
'@ 0) ÐBÑ.KÐ)Ñ •
Definition 27 If K is a 5-finite measure, a formal nonrandomized generalized Bayes rule
versus K is a rule $ ÐBÑ such that aB - k ß
$ ÐBÑ is an + - T minimizing ( PÐ)ß +Ñ0) ÐBÑ.KÐ)Ñ Þ
@
Minimax Decision Rules
An alternative to the Bayesian reduction of VÐ)ß 9Ñ to a number by averaging according
to the distribution K, is to instead reduce VÐ)ß 9Ñ to a number by maximization over ).
Definition 28 A decision rule 9 - H‡ is said to be minimax if
supVÐ)ß 9Ñ œ infsup
VÐ)ß 9w Ñ Þ
w
9 )
)
Definition 29 If a decision rule 9 has constant (in )) risk, it is called an equalizer rule.
It should make some sense that if one is trying to find a minimax rule by so to speak
pushing down the highest peak in the profile VÐ)ß 9Ñ, doing so will tend to drive one
towards rules with "flat" VÐ)ß 9Ñ profiles.
Theorem 33 If 9 is an equalizer rule and is admissible, then it is minimax.
Theorem 34 Suppose that Ö93 × is a sequence of decision rules, each 93 Bayes against a
prior K3 . If VÐK3 ß 93 ÑpG and 9 is a decision rule with VÐ)ß 9Ñ Ÿ G a), then 9 is
minimax.
Corollary 35 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then 9 is minimax.
Corollary 35' If 9 is an equalizer rule and is Bayes with respect to a prior K, then it is
minimax.
A reasonable question to ask, in the light of Corollaries 35 and 35', is "Which K should I
be looking for in order to produce a Bayes rule that might be minimax?" The answer to
this question is "Look for a least favorable prior."
Definition 30 A prior distribution K is said to be least favorable if
VÐKÑ œ supVÐKw Ñ .
Kw
Intuitively, one might think that if an intelligent adversary were choosing ), he/she might
well generate it according to a least favorable K, therefore maximizing the minimum
possible risk. In response a decision maker ought to use 9 Bayes versus K. Perhaps then,
this Bayes rule might tend to minimize the worst that could happen over choice of )?
Theorem 36 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then K is least favorable.
-15-
(Corollary 35 already shows 9 in Theorem 36 to be minimax.)
-16-
Point Estimation (Unbiasedness, Exponential Families, Information
Inequalities, Invariance and Equivariance and Maximum Likelihood)
Unbiasedness
One possible limited class of "sensible" decision rules in estimation problems is the class
of "unbiased" estimators. (Here, we are tacitly assuming that T is e" or some subset of
it.)
Definition 31 The bias of an estimator $ ÐBÑ for # Ð)Ñ - e" is
E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ( Ð$ ÐBÑ • # Ð)ÑÑ.T) ÐBÑ .
k
Definition 32 An estimator $ ÐBÑ is unbiased for # Ð)Ñ - e" provided
E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ! a) Þ
Before jumping into the business of unbiased estimation with too much enthusiasm, an
early caution is in order. We supposedly believe that "Bayes ¸ admissible" and that
admissibility is sort of a minimal "goodness" requirement for a decision rule ... but there
are the following theorem and corollary.
Theorem 37 (See Ferguson pages 48 and 52) If PÐ)ß +Ñ œ AÐ)ÑÐ#Ð )Ñ • +Ñ# for
AÐ)Ñ ž !, and $ is both Bayes wrt K with finite risk function and unbiased for # Ð)Ñ, then
(according to the joint distribution of Ð\ß )ÑÑ
$ Ð\Ñ œ # Ð)Ñ a.s. Þ
Corollary 38 Suppose @ § e" has at least two elements and the family c œ ÖT) × is
mutually absolutely continuous. If PÐ)ß +Ñ œ Ð) • +Ñ# no Bayes estimator of ) is
unbiased.
Definition 33 $ is best unbiased for # Ð)Ñ - e" if it is unbiased and at least as good as
any other unbiased estimator.
There need not be any unbiased estimators in a given estimation problem, let alone a best
one. But for many common estimation loss functions, if there is a best unbiased
estimator it is unique.
Theorem 39 Suppose that T § e" is convex, PÐ)ß +Ñ is convex in + a) and $" and $#
are two best unbiased estimators of # Ð)Ñ. Then for any ) for which PÐ)ß +Ñ is strictly
convex in + and VÐ)ß $" Ñ • _
T) Ò$" Ð\Ñ œ $# Ð\ÑÓ œ " Þ
-17-
The special case of PÐ)ß +Ñ œ Ð#Ð )Ñ • +Ñ# is, of course, one of most interest. For this
case of unweighted squared error loss, for estimators with second moments
VÐ)ß $ Ñ œ Var) $ Ð\Ñ.
Definition 34 $ is called a UMVUE (uniformly minimum variance unbiased estimator)
of # Ð)Ñ - e" if it is unbiased and
Var) $ Ð\Ñ Ÿ Var) $ w Ð\Ñ a)
for any other unbiased estimator of # Ð)Ñ, $ w .
A much weaker notion is next.
Definition 35 $ is called locally minimum variance unbiased at )! - @ if it is unbiased
and
Var)! $ Ð\Ñ Ÿ Var)! $ w Ð\Ñ
for any other unbiased estimator of # Ð)Ñ, $ w .
Notation: Let h! œ Ö$ l E) $ Ð\Ñ œ ! and E) $ # Ð\Ñ • _ a)×Þ (h! is the set of unbiased
estimators of 0 with finite variance.)
Theorem 40 Suppose that $ is an unbiased estimator of # Ð)Ñ - e" and E) $ # Ð\Ñ • _
a). Then
$ is a UMVUE of # Ð)Ñ iff Cov) Ð$ Ð\Ñß ?Ð\ÑÑ œ ! a) a? - h! .
Theorem 41 (Lehmann/Scheffew ) Suppose that T is a convex subset of e" , X is a
sufficient statistic for c œ ÖT) × and $ is an unbiased estimator of # Ð)Ñ. Let
$! œ EÒ$lX Ó œ 9 ‰ X .
Then $! is an unbiased estimator of # Ð)ÑÞ If in addition, X is complete for c ,
i) 9w ‰ X unbiased for # Ð)Ñ implies that 9w ‰ X œ 9 ‰ X a.s. T) a), and
ii) PÐ)ß +Ñ convex in + a) implies that
a) $! is best unbiased, and
b) if $ w is any other best unbiased estimator of # Ð)Ñ
$ w œ $! a.s. T)
for any ) at which VÐ)ß $! Ñ • _ and PÐ)ß +Ñ is strictly convex in +.
(Note that iia) says in particular that $! is UMVUE, and that iib) says that $! is the unique
UMVUE.)
-18-
Basu's Theorem (Theorem 9) is on occasion helpful in doing the conditioning required to
get $! of Theorem 41. The most famous example of this is in the calculation of the
UMVUE of Fˆ -•5 . ‰ based on 8 iid R Ð.ß 5# Ñ observations.
Exponential Families
The most common method of establishing completeness of a statistic X (at least for the
standard examples) for use in the construction of Theorem 41 is an appeal to the
properties of "Exponential Families" of distributions. Hence the placement of the next
batch of class material.
A List of Facts About Exponential Families:
Definition 36 Suppose that c œ ÖT) × is dominated by a 5-finite measure .. c is called
an exponential family if for some 2ÐBÑ !,
5
.T)
"
(3 Ð)ÑX3 ÐBÑ›2ÐBÑ a) .
0) ÐBÑ »
ÐBÑ œ expš+Ð)Ñ €
..
3œ"
Definition 37 A family of probability measures c œ ÖT) × is said to be identifiable if
)" Á )# implies that T)" Á T)# Þ
Fact(s) 42 Let ( œ Ð(" ß (# ß ÞÞÞß (5 Ñß > œ Ö( - e5 l' 2ÐBÑexpš!(3 X3 ÐBÑ›. .ÐBÑ • _×,
and consider also the family of distributions (on k ) with R-N derivatives wrt . of the
form
0( ÐBÑ œ OÐ(Ñexpš"(3 X3 ÐBÑ›2ÐBÑ , ( - > .
5
3œ"
Call this family c ‡ . This set of distributions on k is at least as large as c (and this
second parameterization is mathematically nicer than the first). > is called the natural
parameter space for c ‡ Þ It is a convex subset of e5 . (TSH, page 57.) If > lies in a
subspace of dimension less than 5 , then 0( ÐBÑ (and therefore 0) ÐBÑÑ can be written in a
form involving fewer than 5 statistics X3 . We will henceforth assume > to be fully 5
dimensional. Note that depending upon the nature of the functions (3 Ð)Ñ and the
parameter space @, c may be a proper subset of c ‡ . That is, defining
>@ œ ÖÐ(" Ð)Ñß (# Ð)Ñß ÞÞÞß (5 Ð)ÑÑ - e5 l) - @ ×ß >@ can be a proper subset of >.
Fact 43 The "support" of T) , defined as ÖBl0) ÐBÑ ž !×, is clearly ÖBl2ÐBÑ ž !×, which
is independent of ). The distributions in c are thus mutually absolutely continuous.
Fact 44 From the Factorization Theorem, the statistic X œ ÐX" ß X# ß ÞÞÞß X5 Ñ is sufficient
for c .
-19-
Fact 45 X has induced measures ÖT)X l) - @× which also form an exponential family.
Fact 46 If >@ contains an open rectangle in e5 , then X is complete for c . (See pages
142-143 of TSH.)
Fact 47 If >@ contains an open rectangle in e5 (and actually under the much weaker
assumptions given on page 44 of TPE) X is minimal sufficient for c .
Fact 48 If 1 is any measurable real valued function such that E( l1Ð\Ñl • _, then
E( 1Ð\Ñ œ ( 1ÐBÑ0( ÐBÑ. .ÐBÑ
is continuous on > and has continuous partial derivatives of all orders on the interior of >.
These can be calculated as
` !" €!# €â€!5
` !" €!# €â€!5
E
1Ð\Ñ
œ
1ÐBÑ
0( ÐBÑ. .ÐBÑ .
(
(
` ("!" ` (#!# â` (5!5
` ("!" ` (#!# â` (5!5
(See page 59 of TSH.)
Fact 49 If for ? œ Ð?" ß ?# ß ÞÞÞß ?5 Ñ, both (! and (! € ? belong to >
E(! expš?" X" Ð\Ñ € ?# X# Ð\Ñ € â € ?5 X5 Ð\Ñ› œ
OÐ(! Ñ
Þ
OÐ(! € ?Ñ
Further, if (! is in the interior of >, then
E(! ÐX"!" Ð\ÑX#!# Ð\ÑâX5!5 Ð\ÑÑ œ OÐ(! Ñ
In particular, E(! X4 Ð\Ñ œ
`
` (4 a
and Cov(! (X4 Ð\Ñ,X6 Ð\ÑÑ œ
• logOÐ(Ñb¹
`#
` ( 4 ` (6 a
` !" €!# €â€!5
"
Þ
•º
!"
!#
!5 Œ
` (" ` (# â` (5 OÐ(Ñ (œ(!
, Var(! X4 Ð\Ñ œ
• logOÐ(Ñb¹
(œ(!
(œ(!
`#
a
` (4#
• logOÐ(Ñb¹
(œ(!
,
Þ
Fact 50 If \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ has iid components, with each \3 µ T) , then \
generates a 5 dimensional exponential family on (k 8 ß U 8 ) wrt .8 . The 5 dimensional
statistic ! X Ð\3 Ñ is sufficient for this family. Under the condition that >@ contains an
8
3œ"
open rectangle, this statistic is also complete and minimal sufficient.
Information Inequalities
These are inequalities about variances of estimators, some of which involve
"information" measures. (See Schervish Sections 2.3.1, 5.1.2 and TPE Sections 2.6, 2.7.)
The fundamental tool here is the Cauchy-Schwarz inequality.
-20-
Lemma 51 (Cauchy-Schwarz) If Y and Z are random variables with EY # • _ and
EZ # • _, then
EY # † E Z #
ÐEY Z Ñ# Þ
Applying this to Y œ \ • E\ and Z œ ] • E] , we get the following.
Corollary 52 If \ and ] have finite second moments, ÐCovÐ\ß ] ÑÑ# Ÿ Var\ † Var] .
Corollary 53 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. If 1 is a
measurable function such that E1Ð] Ñ œ ! and ! • EÐ1Ð] ÑÑ# • _, defining
G » E$ Ð] Ñ1Ð] Ñ
Var$Ð] Ñ
G#
.
EÐ1Ð] ÑÑ#
A restatement of Corollary 53 in terms that look more "statistical" is next.
Corollary 53' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. If 1) is a
measurable function such that E) 1) Ð\Ñ œ ! and ! • E) Ð1) Ð\ÑÑ# • _, defining
GÐ)Ñ » E) $ Ð\Ñ1) Ð\Ñ
Var) $ Ð\Ñ
G # Ð) Ñ
.
E) Ð1) Ð\ÑÑ#
Definition 38 We will say that the model c with dominating 5-finite measure . is FI
regular at )! - @ § e" , provided there exists an open neighborhood of )! , say S, such
that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is differentiable at )! and
iii) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated wrt ) under the integral at )! , i.e.
! œ ' ..) 0) ÐBѹ
. .ÐBÑ .
) œ)!
(A definition like that on page 111 of Schervish, that requires i) and ii) only on a set of T)
probability 1 a) in S, is really no more general than this. WOLOG, just throw away the
exceptional set.)
.
Jargon: The (random) function of ), .)
log0) Ð\Ñ is called the score function. iii) of
Definition 3) says that the mean of the score function is ! at )! .
-21-
Definition 39 If the model c is FI regular at )! , then
MÐ)! Ñ œ E)! Œ
.
log0) Ð\ѹ
•
) œ)!
.)
#
is called the Fisher Information contained in \ at )! Þ
ÐIf a model is FI regular at )! , MÐ)! Ñ is the variance of the score function at )! .)
Theorem 54 (Cramer-Rao) Suppose that the model c is FI regular at )! and that
! • MÐ)! Ñ • _Þ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be differentiated under the integral
at )! , i.e. if ..) E) $ Ð\ѹ
=' $ ÐBÑ ..) 0) ÐBѹ
. .ÐBÑ , then
) œ)!
Var)! $ Ð\Ñ
) œ)!
.
Œ .) E) $ Ð\ѹ
) œ)
MÐ)! Ñ
!
•
#
Þ
For the special case of $ Ð\Ñ viewed as an estimator of ), we sometimes write
E) $ Ð\Ñ œ ) € ,Ð)Ñ, and thus Theorem 54 says Var)! $ Ð\Ñ Ð" € ,w Ð)! ÑÑ# ÎMÐ)! Ñ .
On the face of it, showing that the C-R lower bound is achieved looks like a general
method of showing that an estimator is UMVUE. But the fact is that the circumstances in
which the C-R lower bound is achieved are very well defined. The C-R inequality is
essentially the Cauchy-Schwarz inequality, and one gets equality in Lemma 51 iff
Z œ +Y a.s.. This translates in the proof of Theorem 54 to equality iff
.
log0) Ð\ѹ
œ +Ð)! Ñ$ Ð\Ñ € ,Ð)! Ñ a.s. T)! Þ
) œ)!
.)
And, as it turns out, this can happen for all )! in a parameter space @ iff one is dealing
with a one parameter exponential family of distributions and the statistic $Ð\Ñ is the
natural sufficient statistic. So the method works only for UMVUEs of the means of the
natural sufficient statistics in one parameter exponential families.
Result 55 Suppose that for ( real, ÖT( × is dominated by a 5-finite measure ., and with
2ÐBÑ ž !,
0( ÐBÑ œ expa+Ð(Ñ € (X ÐBÑb a.s. . a( .
Then for any (! in the interior of > where Var( X Ð\Ñ • _, X Ð\Ñ achieves the C-R lower
bound for its variance.
(A converse of result 55 is true, but is also substantially harder to prove.)
-22-
The C-R inequality is Corollary 53 with 1) ÐBÑ œ ..) log0) ÐBÑ. There are other interesting
choices of 1) . For example (assuming it is permissible to move enough derivatives
5
through integralsÑ, 1) ÐBÑ œ Š ..)5 log0) ÐBÑ‹Î0) ÐBÑ produces an interesting inequality.
And the choice 1) ÐBÑ œ a0) ÐBÑ • 0)w ÐBÑbÎ0) ÐBÑ leads to the next result.
Theorem 56 If T)w ¥ T) ß
aE) $ Ð\Ñ • E)w $ Ð\Ñb#
)w Ð\Ñ
E) Š 0) Ð\Ñ•0
‹
0) Ð\Ñ
Var) $ Ð\Ñ
#
So, (the Chapman-Robbins Inequality)
Var) $ Ð\Ñ
sup
)' s.t.T)w ¥ T)
Þ
aE) $ Ð\Ñ • E)w $ Ð\Ñb#
)w Ð\Ñ
E) Š 0) Ð\Ñ•0
‹
0) Ð\Ñ
#
Þ
The Fisher Information comes up not only in the present context of information
inequalities, but also in the theory of the asymptotics of maximum likelihood estimation.
It is thus useful to have some more results about it.
Theorem 57 Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ is in fact twice
differentiable at )! and " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the
#
. .ÐBÑ and ! œ ' .)# 0) ÐBѹ
. .ÐBÑ then
integral at )! , i.e. both ! œ ' .) 0) ÐBѹ
.
.
) œ)!
MÐ)! Ñ œ • E)! Ž
) œ)!
.#
log0) Ð\Ѻ
• .
. )#
) œ)!
Fact 58 If \" ß \# ß ÞÞÞß \8 are independent, \3 µ T3) , then \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ (that
has distribution T") ‚ T#) ‚ â ‚ T8) ) carries Fisher Information MÐ)Ñ œ ! M3 Ð)Ñ, where
8
3œ"
in the obvious way, M3 Ð)Ñ is the Fisher information in \3 Þ
Fact 59 MÐ)Ñ doesn't depend upon the dominating measure.
Fact 60 For the case of E) $ Ð\Ñ œ ), the C-R inequality says that Var) $ Ð\Ñ MÐ")Ñ . And
when \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ has iid components, this combined with Fact 58 says that
Var) $Ð\Ñ 8M""Ð)Ñ Þ
There are multiparameter versions of the information inequalities (and for that matter,
multiple $ versions involving the covariance matrix of the $ 's). The fundamental result
-23-
needed to produce multiparameter versions of the information inequalities is the simple
generalization of Corollary 53.
Lemma 61 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. Suppose further
that 1" ß 1# ß ÞÞÞß 15 are measurable functions such that E13 Ð] Ñ œ ! and ! • EÐ13 Ð] ÑÑ# • _
a3 and the 13 are linearly independent as square integrable functions on k according to
the probability measure T . Let G3 » E$ Ð] Ñ13 Ð] Ñ and define a 5 ‚ 5 matrix
H » aE13 Ð] Ñ14 Ð] Ñb Þ
Then for G œ ÐG" ß G# ß ÞÞÞß G5 Ñw
G w H•" G Þ
Var$Ð] Ñ
The "statistical-looking" version of Lemma 61 is next.
Lemma 61' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. Suppose
further that 1)" ß 1)# ß ÞÞÞß 1)5 are measurable functions such that E) 1)3 Ð\Ñ œ ! and
! • E) Ð1)3 Ð\ÑÑ# • _ a3 and the 1)3 are linearly independent as square integrable
functions on k according to the probability measure T) . Let G3 Ð)Ñ » E) $ Ð\Ñ1)3 Ð\Ñ and
define a 5 ‚ 5 matrix
HÐ)Ñ » aE) 1)3 Ð\Ñ1)4 Ð\Ñb Þ
Then for G Ð)Ñ œ ÐG" Ð)Ñß G# Ð)Ñß ÞÞÞß G5 Ð)ÑÑw
Var) $ Ð\Ñ
G Ð)Ñw HÐ)Ñ•" G Ð) Ñ Þ
5-dimensional versions of Definitions 38 and 39 are next.
Definition 40 We will say that the model c with dominating 5-finite measure . is FI
regular at )! - @ § e5 , provided there exists an open neighborhood of )! , say S, such
that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ has first order partial derivatives at )! and
iii) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated wrt to each )3 under the integral at )! ,
i.e. ! œ ' ``) 0) ÐBѹ
. .ÐBÑ .
3
)œ)!
Definition 41 If the model c is FI regular at )! and
E)! Œ ``)3 log0) Ð\ѹ
)œ)
then the 5 ‚ 5 matrix
M Ð)! Ñ œ ŒE)!
`
`
log0) Ð\ѹ
log0) Ð\ѹ
•
)œ)! ` )4
)œ)!
` )3
is called the Fisher Information contained in \ at )! Þ
The multiparameter version of Theorem 57 is next
-24-
• • _ a3ß
#
!
Theorem 62 Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ in fact has
continuous second order partials in the neighborhood Sß and " œ ' 0) ÐBÑ. .ÐBÑ can be
differentiated twice under the integral at )! , i.e. both ! œ ' ``) 0) ÐBѹ
. .ÐBÑ and
!œ'
`#
` )3 ` )4
0) ÐBѹ
)œ)!
3
)œ)!
. .ÐBÑ a3 and 4, then
M Ð)! Ñ œ • E)! Ž
`#
log0) Ð\Ѻ
• .
` )3 ` ) 4
)œ)!
And there is the multiparameter version of the C-R inequality.
Theorem 63 (Cramer-Rao) Suppose that the model c is FI regular at )! - @ § e5 and
that M Ð)! Ñ exists and is nonsingularÞ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be
differentiated wrt to each )3 under the integral at )! , i.e. if
`
=' $ ÐBÑ ``)3 0) ÐBѹ
. .ÐBÑ , then with
` )3 E) $ Ð\ѹ
G Ð)! ) »
)œ)!
Ð ``)" E) $ Ð\ѹ
)œ)
ß
!
)œ)!
`
` )# E) $ Ð\ѹ)œ)
Var)! $Ð\Ñ
!
ß ÞÞÞß ``)5 E) $ Ð\ѹ
)œ)!
Ñw
G Ð)! Ñw M Ð)! Ñ•" G Ð)! Ñ Þ
There are multiple $ versions of this information inequality stuff. See Rao, pages 326+.
Invariance and Equivariance
Unbiasedness (of estimators) is one means of restricting a class of decision rules down to
a subclass small enough that there is hope of finding a "best" rule in the subclass.
Invariance/equivariance considerations is another. These notions have to do with
symmetries in a decision problem. "Invariance" means that (some) things don't change.
"Equivariance" means that (some) things change similarly or equally.
We'll do the case of Location Parameter Estimation first, and then turn to generalities
about invariance/equivariance that apply more broadly (to both other estimation problems
and to other types of decision problems entirely).
Definition 42 When k œ e5 and @ § e5 , to say that ) is a location parameter means
that \ • ) has the same distribution a).
Definition 43 When k œ e5 and @ § e1 , to say that ) is a 1-dimensional location
parameter means that \ • )" (for " a 5 -vector of "'s) has the same distribution a).
-25-
Note that for ) a "-dimensional location parameter, \ µ T) means that
Ð\ € - "Ñ µ T-€) Þ It therefore seems that a "reasonable" restriction on estimators $ is
that they should satisfy $ ÐB € - "Ñ œ $ ÐBÑ € - . How to formalize this?
Until further notice suppose that @ œ T œ e" .
Definition 44 PÐ)ß +Ñ is called location invariant if PÐ)ß +Ñ œ 3Ð) • +Ñ. If 3ÐDÑ is
increasing in D for D ž ! and decreasing in D for D • !, the problem is called a location
estimation problem.
Definition 45 A decision rule $ is called location equivariant if $ ÐB € - "Ñ œ $ ÐBÑ € - .
Theorem 64 If ) is a "-dimensional location parameter, PÐ)ß +Ñ is location invariant and
$ is location equivariant, then VÐ)ß $ Ñ is constant in ). (That is, $ is an equalizer rule.)
This theorem says that if I'm working on a location estimation problem with a location
invariant loss function and restrict attention to location equivariant estimators, I can
compare estimators by comparing numbers. In fact, as it turns out, there is a fairly
explicit representation for the best location equivariant estimator in such cases.
Definition 46 A function 1 is called location invariant if 1ÐB € - "Ñ œ 1ÐBÑ.
Lemma 65 Suppose that $! is location equivariant. Then $" is location equivariant iff b
a location invariant function ? such that $" œ $! € ?Þ
Lemma 66 A function ? is location invariant iff ? depends on B only through
CÐBÑ œ ÐB" • B5 ß B# • B5 ß ÞÞÞß B5•" • B5 Ñ.
Theorem 67 Suppose that PÐ)ß +Ñ œ 3Ð) • +Ñ is location invariant and b a location
equivariant estimator $! for ) with finite risk. Let 3w ÐDÑ œ 3Ð • DÑÞ Suppose that for each
C b a number @‡ ÐCÑ that minimizes
E)œ! Ò3w Ð$! • @Ñl] œ CÓ
over choices of @Þ Then
$ ‡ ÐBÑ œ $! ÐBÑ • @‡ ÐCÐBÑÑ
is a best location equivariant estimator of ).
In the case of PÐ)ß +Ñ œ Ð) • +Ñ# the estimator of Theorem 67 is Pittman's estimator,
which can for some situations be written out in a more explicit form.
-26-
Theorem 68 Suppose that PÐ)ß +Ñ œ Ð) • +Ñ# and the distribution of
\ œ Ð\" ß \# ß ÞÞÞß \5 Ñ has R-N derivative wrt to Lebesgue measure on e5
0) ÐBÑ œ 0 ÐB" • )ß B# • )ß ÞÞÞß B5 • )Ñ œ 0 ÐB • )"Ñ
where 0 has second moments. Then the best location equivariant estimator of ) is
_
'•_
)0 ÐB • )"Ñ. )
$ ÐBÑ œ _
'•_ 0 ÐB • )"Ñ. )
‡
if this is well-defined and has finite risk. (Notice that this is the formal generalized Bayes
estimator vs Lebesgue measure.)
Lemma 69 For the case of PÐ)ß +Ñ œ Ð) • +Ñ# a best location equivariant estimator of )
must be unbiased for ).
Now cancel the assumption that @ œ T œ e" . The following is a more general story
about invariance/equivariance.
Suppose that c œ ÖT) × is identifiable and (to avoid degeneracies) that
PÐ)ß +Ñ œ PÐ)ß +wÑ a) Ê + œ +w Þ
Invariance/equivariance theory is phrased in terms of groups of transformations
i) k pk ,
ii) @p@ , and
iii) T pT ,
and how PÐ)ß +Ñ and decision rules behave under these groups of transformations. The
natural starting place is with transformations on k .
Definition 47 For 1-1 transformations of k onto k , 1" and 1# , the composition of 1" and
1# is another 1-1 onto transformation of k defined by 1# ‰ 1" ÐBÑ œ 1# Ð1" ÐBÑÑ aB - k Þ
Definition 48 For 1 a 1-1 transformation of k onto k , if 2 is another 1-1 onto
transformation such that 2 ‰ 1ÐBÑ œ B aB - k , 2 is called the inverse of 1 and denoted
as 2 œ 1•" .
It is then an easy fact that not only is 1•" ‰ 1ÐBÑ œ B aBß but 1 ‰ 1•" ÐBÑ œ B aB as well.
Definition 49 The identity transformation on k is defined by /ÐBÑ œ B aBÞ
Another easy fact is then that / ‰ 1 œ 1 ‰ / œ 1 for any 1-1 transformation of k onto k ,
1.
(Obvious) Result 70 If Z is a class of 1-1 transformations of k onto k that is closed
under composition (1" , 1# - Z Ê 1" ‰ 1# - Z ) and inversion (1 - Z Ê 1•" - Z ), then
with the operation " ‰ " (composition) Z is a group in the usual mathematical sense.
-27-
A group such as that described in Result 70 will be called a transformation group. Such a
group may or may not be Abelian/commutative (one is if 1" ‰ 1# œ 1# ‰ 1" a1" ß 1# - Z ).
Jargon/notation: Any class of transformations V can be thought of as generating a
group. That is, Z ÐVÑ consisting of all compositions of finite numbers of elements of V
and their inverses is a group under the operation " ‰ ".
To get anywhere with this theory, we then need transformations on @ that fit together
nicely with those on k . At a minimum, one will need that the transformations on k don't
get one outside of c .
1Ð\Ñ
Definition 50 If 1 is a 1-1 transformation of k onto k such that ÖT) ×)-@ œ c , then
the transformation 1 is said to leave the model invariant. If each element of a group of
transformations Z leaves c invariant, we'll say that the group leaves the model invariant.
One circumstance in which Z leaves c invariant is where the model c can be thought of
as "generated by Z " in the following sense. Suppose that Y has some fixed probability
distribution T! and consider
c œ ÖT 1ÐY Ñ l 1 - Z × Þ
If a group of transformations on k (say Z ) leaves c invariant, there is a natural
corresponding group of transformations on @. That is, for 1 - Z
\ µ T) Ê 1Ð\Ñ µ T)w for some )w - @ .
In fact, because of the identifiability assumption, there is only one such )w . So, as a
matter of notation, adopt the following.
Definition 51 If Z leaves the model c invariant, for each 1 - Z , define –1 À @p@ by
–1Ð)Ñ œ the element )w of @ such that \ µ T) Ê 1Ð\Ñ µ T)w Þ
(According to the notation established in Definition 51, for Z leaving c invariant
\ µ T) Ê 1Ð\Ñ µ T–1Ð)Ñ ÞÑ
(Obvious) Result 71 For Z a group of 1-1 transformations of k onto k leaving c
–
invariant, Z œ Ö1– l 1 - Z × with the operation " ‰ " (composition) is a group of 1-1
transformations of @.
–
(I believe that the group Z is the homomorphic image of Z under the transformation " –"
so that, for example, 1•" œ –1•" and 1# ‰ 1" œ –1# ‰ –1" .)
-28-
We next need transformations on T that fit nicely together with those on k and @. If
that's going to be possible, it would seem that the loss function will have to fit nicely with
the transformations already considered.
Definition 52 A loss function PÐ)ß +Ñ is said to be invariant under the group of
transformations Z provided that for every 1 - Z and + - T , b a unique +w - T such that
– )Ñß +w Ñ a) Þ
PÐ)ß +Ñ œ PÐ1Ð
Now then, as another matter of jargon/notation adopt the following.
Definition 53 If Z leaves the model invariant and the loss PÐ)ß +Ñ is invariant, define
1˜ À T pT by
– )Ñß +w Ñ a) Þ
1Ð+Ñ
œ the element +w of T such that PÐ)ß +Ñ œ PÐ1Ð
˜
(According to the notation established in Definition 53, for Z leaving c invariant and
invariant loss PÐ)ß +Ñ,
– )Ñß 1Ð+ÑÑ
PÐ)ß +Ñ œ PÐ1Ð
ÞÑ
˜
(Obvious) Result 72 For Z a group of 1-1 transformations of k onto k leaving c
invariant and invariant loss PÐ)ß +Ñ, Z˜ œ Ö1˜ l 1 - Z × with the operation " ‰ "
(composition) is a group of 1-1 transformations of T .
(I believe that the group Z˜ is the homomorphic image of Z under the transformation " ˜"
µ
so that, for example,1•" œ 1˜ •" and Ð1# µ
‰ 1" Ñ œ 1˜ # ‰ 1˜ " .)
So finally we come to the notion of equivariance for decision functions. The whole
business is slightly subtle for behavioral rules (see pages 150-151 of Ferguson) so for this
elementary introduction, we'll confine attention to nonrandomized rules.
Definition 54 In an invariant decision problem (one where Z leaves c invariant and
PÐ)ß +Ñ is invariant) a nonrandomized decision rule $ :k pT is said to be equivariant
provided
$ Ð1ÐBÑÑ œ 1Ð
˜ $ ÐBÑÑ aB - k and 1 - Z Þ
(One could make a definition here involving a single 1 or perhaps a 3 element group
Z œ Ö1ß 1•" ß /× and talk about equivariance under 1 instead of under a group. I'll not
bother.)
The basic elementary theorem about invariance/equivariance is the generalization of
Theorem 64 (that cuts down the size of the comparison problem that one faces when
comparing risk functions).
Theorem 73 If Z leaves c invariant, PÐ)ß +Ñ is invariant and $ is an equivariant
nonrandomized decision rule, then
-29-
– )Ñß $ Ñ a) - @ and 1 - Z .
VÐ)ß $ Ñ œ VÐ1Ð
Definition 55 In an invariant decision problem, for ) - @, b) œ Ö)w - @ l )w œ –1Ð)Ñ for
some 1 - Z × is called the orbit in @ containing ).
The set of orbits Öb) ×)-@ partitions @. The distinct orbits are equivalence classes.
Theorem 73 says that an equivariant nonrandomized decision rule in an invariant decision
problem has a risk function that is constant on orbits. When looking for good equivariant
rules, the problem of comparing risk functions over @ is reduced somewhat to comparing
risk functions on orbits. Of course, that comparison is especially easy when there is only
one orbit.
Definition 56 A group of 1-1 transformations of a space onto itself is said to be transitive
if for any two points in the space there is a transformation in the group taking the first
point into the second.
–
Corollary 74 If Z leaves c invariant, PÐ)ß +Ñ is invariant and Z is transitive over @, then
every equivariant nonrandomized decision rule has constant risk.
Definition 57 In an invariant decision problem, a decision rule $ is said to be a best
(nonrandomized) equivariant (or MRE ... minimum risk equivariant) decision rule if it is
equivariant and for any other nonrandomized equivariant rule $ w
VÐ)ß $ Ñ Ÿ VÐ)ß $ w Ñ a) Þ
The import of Theorem 73 is that one needs only check for this risk inequality for one )
from each orbit.
Maximum Likelihood (Primarily the Simple Asymptotics Thereof)
Maximum likelihood is a non-decision-theoretic method of point estimation dating at
least to Fisher and probably beyond. As usual, for an identifiable family of distributions
)
c œ ÖT) × dominated by a 5-finite measure ., we let 0) œ .T
.. Þ
Definition 58 A statistic X À k p@ is called a maximum likelihood estimator (MLE) of )
if
0X ÐBÑ ÐBÑ œ sup 0) ÐBÑ aB - k Þ
)-@
This definition is not without practical problems in terms of existence and computation of
MLEs. 0) ÐBÑ can fail to be bounded in ), and even when it is, can fail to have a
maximum value as one searches over @. One way to get rid of the problem of potential
lack of boundedness (favored and extensively practiced by Prof. Meeker for example) is
-30-
to admit that one can measure only "to the nearest something," so that "real" k are
discrete (finite or countable) so that . can be counting measure and
0) ÐBÑ œ T) ÐÖB×Ñ • "
Ðso that one is just maximizing the probability of observing the data in handÑÞ Of course,
the possibility that the sup is not achieved still remains even when this device is used
(like, for example, when \ µ Binomial Ð8ß :Ñ for : - Ð!ß "Ñ when \ œ ! or 8).
Note that in the common situation where @ § e5 and the 0) are nicely differentiable in
), a maximum likelihood estimator must satisfy the likelihood equations
`
log0) ÐBѹ
œ ! for 3 œ "ß #ß ÞÞÞß 5 .
) œX ÐBÑ
` )3
Much of the theory connected with maximum likelihood estimation has to do with roots
of the likelihood equations (as opposed to honest maximizers of 0) ÐBÑ). Further, much of
the attractiveness of the method derives from its asymptotic properties. A simple version
of the theory (primarily for the 5 œ " case) follows.
Adopt the following notation for the discussion of the asymptotics of maximum
likelihood.
\8
represents all information on ) available through the 8th stage of
observation
k8
the observation space for \ 8 (through the 8th stage)
U8
a 5-algebra on k8
8
T)\
the distribution of \ 8 (dominated by a 5-finite measure .8 )
@
a (fixed) parameter space
.T \
8
0)8
the R-N derivative ..) 8
For this material one really does need the basic probability space ÐHß T ß T) Ñ in the
background, as it is necessary to have a fixed space upon which to prove convergence
results. Often (i.e. in the iid case) we may take
Ðk ß U ß T)\" Ñ to be a "1-dimensional" model where the T)\" 's are absolutely
continuous with respect to a 5-finite measure .
_
Hœk
\ 8 œ Ð\" ß \# ß ÞÞÞß \8 Ñ
for \3 Ð=Ñ œ the 3th coordinate of =
8
k8 œ k
T œ U_
_
T) œ ˆT)\" ‰
8
8
T)\ œ ˆT)\" ‰
.8 œ .8
0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ
8
and
3œ"
For such a set-up, the usual drill is for appropriate sequences of estimators Ös
) 8 ÐB8 Ñ× to
try to prove consistency and asymptotic normality results.
-31-
Definition 59 Suppose that ÖX8 × is a sequence of statistics X8 À k8 p@, where @ § e5
i) ÖX8 × is (weakly) consistent at )! - @ if for every % ž !ß
T)! alX8 • )! l ž %bp! as 8p_, and
ii) ÖX8 × is strongly consistent at )! - @ if X8 p)! a.s. T)! .
There are two standard paths for developing asymptotic results for likelihood-based
estimators in iid contexts. These are
i) the path blazed by Wald based on very strong regularity assumptions (that can
be extremely hard to verify for a specific model) but that deal with "real"
maximum likelihood estimators, and
ii) the path blazed by Cramer based on much weaker assumptions and far easier
proofs that deal with solutions to the likelihood equations.
We'll wave our hands very briefly at a heuristic version of the Wald argument and defer
to pages 415-424 of Schervish for technical details. Then we'll do a version of the
Cramer results (probably assuming more than is absolutely necessary to get them).
All that follows regarding the asymptotics of maximum likelihood refers to the iid case,
where B8 œ ÐB" ß B# ß ÞÞÞß B8 Ñ, @ § e5 and 0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ.
8
3œ"
Lemma 75 (Silvey, page 75) Suppose that T and U are two different distributions on a
space k with R-N derivatives : ž ! and ; ž ! with respect to a 5-finite measure ..
Suppose further that a function G À Ð!ß_Ñpe" is convex. Then
EU G Œ
:Ð\Ñ
•
;Ð\Ñ
GÐ"Ñ
with strict inequality if G is strictly convex.
Now suppose that the 0) are positive everywhere on k and define the function of )
D)! Ð)Ñ » E)! log0) Ð\Ñ œ ( log0) ÐBÑ 0)! ÐBÑ . .ÐBÑ Þ
Corollary 76 If the 0) are positive everywhere on k , then for ) Á )!
D)! Ð)! Ñ ž D)! Ð) Ñ Þ
Then, as a matter of notation, let
68 ÐB8 ß )Ñ » log0)8 ÐB8 Ñ œ "log0) ÐB3 Ñ Þ
8
3œ"
Then clearly E)! 8" 68 Ð\ 8 ß )Ñ œ D)! Ð)Ñ and in fact, the law of large numbers says that for
every ),
-32-
"
" 8
8
68 Ð\ ß )Ñ œ "log0) Ð\3 Ñ p D)! Ð)Ñ a.s. T)! .
8
8 3œ"
So assuming enough regularity conditions to make this almost sure convergence uniform
in ), when )! governs the distribution of the observations there is perhaps hope that a
single maximizer of 8" 68 Ð\ 8 ß )Ñ will eventually exist and be close to the maximizer of
D)! Ð)Ñ, namely )! . To make this precise is not so easy. See Theorems 7.49 and 7.54 of
Schervish.
Now turn to Cramer type results.
Theorem 77 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open
neighborhood of )! , say S, such that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is differentiable at every point ) - S, and
iii) E)! log0) Ð\Ñ exists for all ) - S and E)! log0)! Ð\Ñ is finite.
Then, if % ž ! and $ ž !, b 8 such that a 7 ž 8, the T)! probability that the equation
. 7
"log0) Ð\3 Ñ œ !
. ) 3œ"
has a root within % of )! is at least " • $ Þ
(Essentially this theorem ducks the uniformity issue raised earlier by considering 68 at
only )! ß )! • % and )! € %.)
In general, Theorem 77 doesn't immediately do what we'd like. The root guaranteed by
the Theorem isn't necessarily an MLE. And in fact, when the likelihood equation can
have multiple roots, we haven't even yet defined a sequence of random variables, let alone
a sequence of estimators converging to )! . But the theorem can form the basis for simple
results that are of statistical importance.
Corollary 78 Under the hypotheses of Theorem 77, suppose that (in addition) @ is open
and aBß 0) ÐBÑ is positive and differentiable at every point ) - @ so that the likelihood
equation
. 8
"log0) Ð\3 Ñ œ !
. ) 3œ"
makes sense at all points ) - @. Then define 38 to be the root of the likelihood equation
when there is exactly one (and otherwise adopt any arbitrary definition for 38 ). If with
T)! probability approaching 1 the likelihood equation has a single root
38 p )! in T)! probability Þ
-33-
Where there is no guarantee that the likelihood equation will eventually have only a
single root, one could use Theorem 77 to argue that defining
<8 Ð)! Ñ œ œ
!
if the likelihood equation has no root
the root closest to )! otherwise ,
<8 Ð)! Ñ p )! in T)! probability. (The continuity of the likelihood implies that limits of
roots are roots, so for )! in an open @, the definition of <8 Ð)! Ñ above eventually makes
sense.) But, of course, this is of no statistical help. What is similar in nature but of more
help is the next corollary.
Corollary 79 Under the hypotheses of Theorem 77, if ÖX8 × is a sequence of estimators
consistent at )! and
if the likelihood equation has no roots
X8
s
)8 » œ
the root closest to X8 otherwise ,
then
s
) 8 p )! in T)! probability Þ
In practical terms, finding the root of the likelihood equation nearest to X8 may be a
numerical problem. (Actually, if it is, it's not so clear how you know you have solved the
problem. But we'll gloss over that.) Finding a root beginning from X8 might be
approached using something like Newton's method. That is, abbreviating
P8 Ð)Ñ » 68 Ð\ ß )Ñ œ "log0) Ð\3 Ñ ,
8
8
3œ"
the likelihood equation is
Pw8 Ð)Ñ œ ! Þ
One might try looking for a root of this equation by iterating using linearizations of Pw8 Ð)Ñ
(presumably beginning at X8 ). Or, assuming that Pww8 ÐX8 Ñ Á ! taking a single step in this
iteration process, a "-step likelihood-based modification/?improvement? on X8 ,
w
˜)8 œ X8 • P8 ÐX8 Ñ ,
Pww8 ÐX8 Ñ
is suggested. It is plausible that an estimator like )˜ 8 might have asymptotic behavior
similar to that of an MLE. (See page 442 of TPE or Theorem 7.75 of Schervish, that
treats a more general multiparameter M-estimation version of this idea.)
The next matter up is the asymptotic distribution of likelihood-based estimators. The
following is representative of what people have proved. (I suspect that the hypotheses
here are stronger than are really needed to get the conclusion, but they make for a simple
proof.)
-34-
Theorem 80 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open
neighborhood of )! , say S, such that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S,
iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and
¹
.$
log0) ÐBѹ Ÿ Q ÐBÑ aB and a) - S ß
. )$
iv) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the integral at )! ,
#
i.e. both ! œ ' ..) 0) ÐBѹ
. .ÐBÑ and ! œ ' ..)# 0) ÐBѹ
. .ÐBÑ, and
#
v) M" Ð)Ñ » E) ˆ ..) log0) Ð\щ - Ð!ß _ Ñ a) - S.
If with T) probability approaching 1, Ös
) 8 × is a root of the likelihood equation, and s
)8p
) œ)!
) œ)!
!
)! in T)! probabilityß then under T)!
_
È8Šs
) 8 • )! ‹ Ä NÐ!ß
"
Ñ Þ
M" Ð)! Ñ
There are multiparameter versions of this theorem (indeed the Schervish theorem is a
multiparameter one). And there are similar (multiparameter) versions of this for the
"Newton 1-step approximate MLE's" based on consistent estimators of ) (i.e. )˜ 8 's).
The practical implication of a result like this theorem is that one can make approximate
confidence intervals for ) of the form
s
)8 „ D
"
Þ
È8M" Ð)! Ñ
Well, it is almost a practical implication of the theorem. One needs to do something
about the fact that M" Ð)! Ñ is unknown when making a confidence interval so that it can't be
used in the formula. There are two common routes to fixing this problem. The first is to
use M" Ðs
) 8 Ñ in its place. This method is often referred to as using the "expected Fisher
information" in the making of the interval. (Note that M" Ð)! Ñ is a measure of curvature
of D)! Ð)Ñ at its maximum (located at )! Ñ while M" Ðs
) 8 Ñ is a measure of curvature of Ds) Ð)Ñ at
8
s
its maximum (located at ) 8 ).) Dealing with the theory of the first method is really very
simple. It is easy to prove the following.
Corollary 81 Under the hypotheses of Theorem 80, if M" Ð)Ñ is continuous at )! , then
under T)!
_
É8M" Ðs
) 8 )Šs
) 8 • )! ‹ Ä NÐ!ß "Ñ Þ
-35-
(Actually, the continuity of M" Ð)Ñ at )! may well follow from the hypotheses of Theorem
80 and therefore be redundant in the hypotheses of this corollary. I've not tried to think
this issue through.)
The second method of approximating M" Ð)! Ñ comes about by realizing that in the proof of
Theorem 80, M" Ð)! Ñ arises as the limit of • "n Pww Ð)! ÑÞ One might then hope to
approximate it by
" ww s
" 8 .#
• P Ð) 8 Ñ œ • " # log0) Ð\3 ѹ s ß
) œ) 8
8
8 3œ" . )
which is sometimes called the "observed Fisher information." (Note that • 8" Pww Ðs
) 8 Ñ is
a measure of curvature of the average log likelihood function " 68 Ð\ 8 ß )Ñ at s
) 8 , which
8
one is secretly thinking of as essentially a maximizer of the average log likelihood.)
Schervish says that Efron and others claim that this second method is generally better
than the first (in terms of producing intervals with real coverage properties close to the
nominal properties). Proving an analogue of Corollary 81 where • 8" Pww Ðs
) 8 Ñ replaces
) 8 ) is somewhat harder, but not by any means impossible. That is, one can prove the
M" Ðs
following.
Corollary 82 Under the hypotheses of Theorem 80 and T)! ,
_
É • Pww Ðs
) 8 ÑŠs
) 8 • )! ‹ Ä NÐ!ß "Ñ Þ
Under the hypotheses of Theorem 80, convergence in distribution of s
) 8 implies the
s
convergence of the second moment of ) 8 , so that
Var)! È8Šs
) 8 • )! ‹ p
i.e.
Var)! ÐÈ8 Ðs
) 8 )Ñ p
"
ß
M" Ð)! Ñ
"
Þ
M" Ð)! Ñ
Then the fact that the C-R inequality declares that under suitable conditions no unbiased
estimator of ) has variance less than "ÎÐ8M" Ð)ÑÑ, makes one suspect that for a consistent
sequence of estimators Ö)8‡ ×, convergence under )!
_
È8Ð)8‡ • )! Ñ p
NÐ!ß Z Ð)! ÑÑ
might imply that Z Ð)! Ñ
"
M" Ð)! Ñ ,
but in fact this inequality does not have to hold up.
-36-
Hypothesis Testing (Simple vs Simple, Composite
Hypotheses, UMP Tests, Confidence Regions, UMPU Tests,
Likelihood Ratio and Related Tests)
We now consider the classical hypothesis testing problem. Here we treat
@ œ @! • @" where @! • @" œ g
and the hypotheses H3 :) - @3 3 œ !ß ". As is standard, we will call H! the "null
hypothesis" and H" the "alternative hypothesis." @3 with one element is called "simple"
and with more than one element is called "composite."
Deciding between H! and H" on the basis of data can be thought of in terms of a twoaction decision problem. That is, with T œ Ö!ß "× one is using \ to decide in favor of
one of the two hypotheses. Typically one wants a loss structure where correct decisions
are preferable to incorrect ones. An obvious (and popular) possibility is !-" loss,
PÐ)ß +Ñ œ MÒ) - @! ß + œ "Ó € MÒ) - @" ß + œ !Ó .
While it is perfectly sensible to simply treat hypothesis testing as a two-action decision
problem, being older than formal statistical decision theory, it has its own
peculiar/specialized set of jargon and emphases that we will have to recognize. These
include the following.
Definition 60 A nonrandomized decision function $ À k pÖ!ß "× is called a hypothesis
test.
Definition 61 The rejection region for a nonrandomized test $ is ÖBl $ ÐBÑ œ "×Þ
Definition 62 A type I error in a testing problem is taking action 1 when ) - @! . A type
II error in a testing problem is taking action 0 when ) - @" Þ
Note that in a two-action decision problem, a behavioral decision rule 9B is for each B a
distribution over Ö!ß "×, which is clearly characterized by 9B ÐÖ"×Ñ - Ò!ß "Ó. This notion
can be expressed in other terms.
Definition 63 A randomized hypothesis test is a function 9 À k pÒ!ß "Ó with the
interpretation that if 9ÐBÑ œ 1, one rejects H! with probability 1.
Note that for !-" loss,
for ) - @! ß VÐ)ß 9Ñ œ E) MÒ+ œ "Ó œ E) 9Ð\Ñ
the type I error probability, while
for ) - @" , VÐ)ß 9Ñ œ E) MÒ+ œ !Ó œ E) Ð" • 9Ð\ÑÑ œ " • E) 9Ð\Ñ
-37-
the type II error probability. Most of testing theory is phrased in terms of E) 9Ð\Ñ that
appears in both of these expressions.
Definition 64 The power function of the (possibly randomized) test 9 is
"9 Ð)Ñ » E) 9Ð\Ñ œ ( 9ÐBÑ.T) ÐBÑ Þ
Definition 65 The size or level of a test 9 is sup "9 Ð)ÑÞ
) - @!
This is the maximum probability of a type I error. Standard testing theory proceeds
(somewhat asymmetrically) by setting a maximum allowable size and then (subject to that
restriction) trying to maximize the power (uniformly) over @" Þ (That is, subject to a size
limitation, standard theory seek to minimize type II error probabilities.)
Recall that Result 21 says that if X is sufficient for ), the set of behavioral decision rules
that are functions of X is essentially complete. Arguing that fact in the present case is
very easy/straightforward. For any randomized test 9, consider
9‡ ÐBÑ œ EÒ9lX ÓÐBÑ ,
which is clearly a UÐX Ñ measurable function taking values in Ò!ß "Ó, i.e. is a hypothesis
test based on X . Note that
"9‡ Ð)Ñ œ E) 9‡ Ð\Ñ œ E) EÒ9lX Ó œ E) 9Ð\Ñ œ "9 Ð)Ñ ,
and 9‡ and 9 have the same power functions and are thus equivalent.
Simple vs Simple Testing Problems
Consider first testing problems where @! œ Ö)! × and @" œ Ö)" × (with !-" loss). In this
finite @ decision problem, we can for each test 9 define the risk vector
ÐVÐ)! ß 9Ñß V Ð)" ß 9ÑÑ œ Ð"9 Ð)! Ñß " • "9 Ð)" ÑÑ
and make the standard two-dimensional plot of risk vectors. A completely equivalent
device is to plot instead the points
Ð"9 Ð)! Ñß "9 Ð)" ÑÑ .
Lemma 83 For a simple versus simple testing problem, the risk set f has the following
properties:
i) f is convex,
ii) f is symmetric wrt the point Ð "# ß "# Ñ,
iii) Ð!ß "Ñ - f and Ð"ß !Ñ - f , and
iv) f is closed.
-38-
Lemma 83' (see page 77 of TSH) For a simple versus simple testing problem, the set
i » ÖÐ"9 Ð)! Ñß "9 Ð)" ÑÑl9 is a test× has the following properties:
i) i is convex,
ii) i is symmetric wrt the point Ð "# ß "# Ñ,
iii) Ð!ß !Ñ - i and Ð"ß "Ñ - i , and
iv) i is closed.
As indicated earlier, the usual game in testing is to fix !
! and maximum power on @" .
! and look for a test with size
Definition 66 For testing H! À ) œ )! vs H" À ) œ )" , a test is called most powerful of
size ! if "9 Ð)! Ñ œ ! and for any other test 9‡ with "9‡ Ð)! Ñ Ÿ !, "9‡ Ð)" Ñ Ÿ "9 Ð)" Ñ.
The fundamental result of all of testing theory is the Neyman-Pearson Lemma.
Theorem 84 (The Neyman-Pearson Lemma) Suppose that c ¥ ., a 5-finite measure
.T
and let 03 œ ..)3 .
i) (Sufficiency) Any test of the form
Ú"
9ÐBÑ œ Û / ÐBÑ
Ü!
if 0" ÐBÑ ž 50! ÐBÑ
if 0" ÐBÑ œ 50! ÐBÑ
if 0" ÐBÑ • 50! ÐBÑ
(æ)
for some 5 - Ò!ß_Ñ and / À k pÒ!ß "Ó is most powerful of its size. Further,
corresponding to the 5 œ _ case, the test
9ÐBÑ œ œ
"
!
if 0! ÐBÑ œ !
otherwise
(ææ)
is most powerful of size ! œ !.
ii) (Existence) For every ! - Ò!ß "Ó, there exists a test of form (æ) with / ÐBÑ œ /
(a constant in Ò!ß "Ó) or a test of form (ææ) with "9 Ð)! Ñ œ !.
iii) (Uniqueness) If 9 is a most powerful test of size !, then it is of form (æ) or
form (ææ) except possibly for B - R with T)! ÐRÑ œ T)" ÐRÑ œ !Þ
Note that for any two distributions there is always a dominating 5-finite measure
(. œ T)! € T)" will do the job). Further, it's easy to verify that the tests prescribed by this
theorem do not depend upon the dominating measure.
Note also that one could take a more explicitly decision theoretic slant on this whole
business. Thinking of testing as a two-decision problem with !-" loss and prior
distribution K, abbreviating KÐÖ)! ×Ñ œ 1! and KÐÖ)1 ×Ñ œ 11 , it's easy to see that a
Bayesian is a Neyman-Pearson tester with 5 œ 1! Î1" and that a Neyman-Pearson tester is
a closet Bayesian with 1! œ 5ÎÐ5 € "Ñ and 1" œ "ÎÐ" € 5Ñ.
There is a kind of "finite @! vs simple @" " version of the N-P Lemma. See TSH page 96.
-39-
A final small fact about simple versus simple testing (that turns out to be useful in some
composite versus composite situations) is next.
Lemma 85 Suppose that ! - Ð!ß "Ñ and 9 is most powerful of size !. Then "9 Ð)" Ñ
and the inequality is strict unless 0! œ 0" a.e. ..
!
UMP Tests for Composite vs Composite Problems
Now consider cases where @! and @" need not be singletons.
Definition 67 9 is UMP (uniformly most powerful) of size ! for testing H! À ) - @! vs
H" À ) - @" provided 9 is of size ! (i.e. sup "9 Ð)Ñ œ !) and for any other 9w of size
) - @!
no more than !, "9w Ð)Ñ Ÿ "9 Ð)Ñ a) - @" Þ
Problems for which one can find UMP tests are not all that many. Known results cover
situations where @ § e" involving one-sided hypotheses in monotone likelihood ratio
families, a few special cases that yield to a kind of "least favorable distribution" argument
and an initially somewhat odd-looking two-sided problem in one parameter exponential
families.
Definition 68 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ. Suppose that
X À k pe" . The family of R-N derivatives Ö0) × is said to have the monotone likelihood
ratio (MLR) property in X if for every )" • )# belonging to @, the ratio 0)# ÐBÑÎ0)" ÐBÑ is
nondecreasing in X ÐBÑ on ÖBl0)" ÐBÑ € 0)# ÐBÑ ž !×. (c is also said to have the MLR
property.)
Theorem 86 Suppose that Ö0) × has MLR in X . Then for testing H! À ) Ÿ )! vs
H" À ) ž )! , for each ! - Ð!ß "Ó b a test of the form
Ú"
9! ÐBÑ œ Û /
Ü!
if X ÐBÑ ž if X ÐBÑ œ if X ÐBÑ • -
for - - Ò • _ß_Ñ chosen to satisfy "9 Ð)! Ñ œ !Þ Such a test
i) has strictly increasing power function,
ii) is UMP size ! for these hypotheses, and
iii) uniformly minimizes "9 Ð)Ñ for ) • )! for tests with "9 Ð)! Ñ œ !.
Theorem 87 Suppose that Ö0) × has MLR in X and for state space @ and action space
T œ Ö!ß "× the loss function PÐ)ß +Ñ satisfies
PÐ)ß "Ñ • PÐ)ß !Ñ • ! for ) ž )! , and
-40-
PÐ)ß "Ñ • PÐ)ß !Ñ ž ! for ) • )!
Þ
Then the class of size ! tests (! - Ð!ß "Ñ) identified in Theorem 86 is essentially
complete in the class of tests with size in Ð!ß "Ñ.
One two-sided problem a single real parameter for which it possible to find UMP tests is
covered by the next theorem (that is proved in Section 4.3.4 of Schervish using the
generalization of the N-P lemma alluded to earlier).
Theorem 88 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ and
0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a).
Suppose further that )" • )# belong to @. A test of the form
Ú"
9ÐBÑ œ Û /3
Ü!
if -" • X ÐBÑ • -#
if X ÐBÑ œ -3 3 œ "ß #
if X ÐBÑ • -" or X ÐBÑ ž -#
for -" Ÿ -# minimizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with
"9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is UMP
level ! for testing H! À ) Ÿ )1 or ) )2 vs H" À )" • ) • )# .
As far as finding UMP tests for parameter spaces other than subsets of e" is concerned,
about the only "general" technique known is one discussed in Section 3.8 of TSHÞ The
method is not widely applicable, but does yield a couple of important results. It is based
on an "equalizer rule"/"least favorable distribution" type argument.
The idea is that for a few composite vs composite problems, we can for each )" - @"
guess at a distribution A over @! so that with
0A ÐBÑ œ ( 0) ÐBÑ. AÐ)Ñ
@!
simple versus simple testing of H! À 0A vs H" À 0)" produces a size ! N-P test whose
power over @! is bounded by !. (Notice by the wayß that this is a Bayes test in the
composite vs composite problem against a prior that is a mixture of A and a point mass at
)" ÞÑ Anyhow, when that test is independent of )" we get UMP size ! for the original
problem.
Definition 69 For testing H! À \ µ 0A vs H" À \ µ 0)" , A is called least favorable at
level ! if for 9A the N-P size ! test associated with A,
"9A Ð)" Ñ • "9Aw Ð)" Ñ
for any other distribution over @! , Aw .
-41-
Theorem 89 If 9 is of size ! for testing H! À ) - @! vs H" À ) œ )" , then it is of size no
more than ! for testing H! À \ µ 0A vs H" À \ µ 0)" . Further, if 9 is most powerful of
size ! for testing H! À \ µ 0A vs H" À \ µ 0)" and is also size ! for testing H! À ) - @!
vs H" À ) œ )" , then it is a MP size ! test for this set of hypotheses and A is least
favorable at level !.
Clearly, if in the context of Theorem 89, the same 9 is most powerful size ! for each )" ,
that test is then UMP size ! for the hypotheses H! À ) - @! vs H" À ) - @" .
Confidence Sets Derived From Tests (and Optimal Ones Derived From UMP
Tests)
Definition 70 Suppose that for each B - k ß WÐBÑ is a (measurable) subset of @. The
family ÖWÐBÑl B - k × is called a # level family of confidence sets for ) provided
T) Ò) - WÐBÑÓ
# a) - @ .
It is possible (and quite standard) to use tests to produce confidence procedures. That
goes as follows. For each )! - @, let 9)! be a size ! test of H! À ) œ )! vs H" À ) - @"
(where )! - @ • @" ) and define
W9 ÐBÑ œ Ö)l 9) ÐBÑ • "×
(W9 ÐBÑ is the set of )'s for which there is positive conditional probability of accepting H!
having observed \ œ B.) ÖW9 ÐBÑl B - k × is a level " • ! family of confidence sets for
). This rigmarole works perfectly generally and inversion of tests of point null
hypotheses is a time-honored means of constructing confidence procedures. In the
context of UMP testing, the construction produces confidence sets with an optimality
property.
Theorem 90 Suppose that for each )! ß OÐ)! Ñ § @ with )! Â OÐ)! Ñ and 9)! is a
nonrandomized UMP size ! test of H! À ) œ )! vs H" À ) - OÐ)! Ñ Then if W9 ÐBÑ is the
confidence procedure derived from the tests 9)! and WÐBÑ is any other level Ð" • !Ñ
confidence procedure
T) Ò)! - WÐ\ÑÓ
T) Ò)! - W9 Ð\ÑÓ a ) - OÐ)! Ñ Þ
(That is, the W9 construction produces confidence procedures that minimize the
probabilities of covering incorrect values.)
UMPU Tests for Composite versus Composite Problems
Cases in which one can find UMP tests are not all that many. This should not be terribly
surprising given our experience to date in Stat 643. We are used to not being able to find
uniformly best statistical procedures. In estimation problems we found that by restricting
the class of estimators under consideration (e.g. to the unbiased or equivariant estimators)
-42-
we could sometimes find uniformly best rules in the restricted class. Something similar is
sometimes possible in composite versus composite testing problems.
Definition 71 A size ! test of H! À ) - @! vs H" À ) - @" is unbiased provided
" 9 Ð) Ñ
! a ) - @" Þ
Definition 72 An unbiased size ! test of H! À ) - @! vs H" À ) - @" is UMPU
(uniformly most powerful unbiased) level ! if for any other size ! unbiased test 9w
" 9 Ð) Ñ
" 9 w Ð) Ñ a ) - @ " .
The conditions under which people can identify UMPU tests require that @ be a
topological space. (One needs notions of interior, boundary, continuity of functions, etc.)
So in this section make that assumption.
–
–
Definition 73 The boundary set is @B » @! • @" .
Lemma 91 If "9 Ð)Ñ is continuous and 9 is unbiased of size !, then "9 Ð)Ñ œ ! a ) - @B .
Definition 74 A test is said to be !-similar on the boundary if
" 9 Ð) Ñ œ ! a ) - @ B Þ
When looking for UMPU tests it often suffices to restrict attention to tests !-similar on
the boundary.
Lemma 92 Suppose that "9 Ð)Ñ is continuous for every 9. If 9 is UMP among tests !similar on the boundary and has level !, then it is UMPU level !.
This lemma and Theorem 88 (or the generalized Neyman-Pearson lemma that stands
behind it) can be used to prove a kind of "reciprocal" of Theorem 88 for one parameter
exponential families.
Theorem 93 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñß and
0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a).
Suppose further that )" • )# belong to @. A test of the form
Ú"
9ÐBÑ œ Û /3
Ü!
if X ÐBÑ • -" or X ÐBÑ ž -#
if X ÐBÑ œ -3 3 œ "ß #
if -" • X ÐBÑ • -#
for -" Ÿ -# maximizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with
"9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is
UMPU level ! for testing H! À ) - Ò)" ß )# Ó vs H" À ) Â Ò)" ß )# ÓÞ
-43-
Theorem 93 deals with an interval null hypothesis. Dealing with a point null hypothesis
turns out to require slightly different arguments.
Theorem 94 Suppose that for @ an open interval in e" , c ¥ ., a 5-finite measure on
Ðk ß U Ñ and
0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a).
Suppose that 9 is a test of the form given in Theorem 93 and that
" 9 Ð)! Ñ œ !
and
"9w Ð)! Ñ œ ! .
Then 9 is UMPU size ! for testing H! À ) œ )! vs H" À ) Á )! .
Theorems 93 and 94 are about cases where @ § e" (and, of course, limited to one
exponential families). It is possible to find some UMPU tests for hypotheses with
"Ð5 • "Ñ-dimensional" boundaries in "5 -dimensional" @ problems using a methodology
exploiting "Neyman structure" in a statistic X . I may write up some notes on this
methodology (and again I may not). It is discussed in Section 4.5 of Schervish. A
general kind of result that comes from the arguments is next.
Theorem 95 Suppose that \ œ Ð\" ß \# ß ÞÞÞß \5 Ñ takes values in e5 and the
distributions T( have densities with respect to a 5-finite measure . on e5 of the form
0( ÐBÑ œ OÐ(Ñexp"(3 B3
5
.
3œ"
Suppose further that the natural parameter space > contains an open rectangle in e5 and
let Y œ Ð\# ß ÞÞÞß \5 Ñ. Suppose ! - Ð!ß "ÑÞ
i) For H! À (" Ÿ ("‡ vs H" À (" ž ("‡
or for H! À (" Ÿ ("‡ or (" ("‡‡ vs H" À (" - Ð("‡ ß ("‡‡ Ñ,
for each ? b a conditional on Y œ ? UMP level ! test 9?w ÐB" Ñ, and the test
9ÐBÑ œ 9?w ÐB" Ñ
is UMPU level !.
ii) For H! À (" - Ò("‡ ß ("‡‡ Ó vs H" À ("  Ò("‡ ß ("‡‡ Ó
or for H! À (" œ ("‡ vs H" À (" Á ("‡
for each ? b a conditional on Y œ ? UMPU level ! test 9?w ÐB" Ñ, and the test
9ÐBÑ œ 9?w ÐB" Ñ
is UMPU level !.
-44-
This theorem says that if one thinks of ( œ Ð(" ß (# ß ÞÞÞß (5 Ñ as containing the parameter of
interest (" and "nuisance parameters" (# ß ÞÞÞß (5 , at least in some exponential families it is
possible to find "optimal" tests about the parameter of interest.
Likelihood Ratio Tests, Related Tests and Confidence Procedures
The various optimality theories of classical testing do not work for all that many
complicated models. As such, if one wants to do testing in a complicated model some
general purpose tools are needed for making tests and confidence procedures. One such
set of tools consists of the "likelihood ratio test," its relatives and related confidence
procedures. The following is an introduction to these methods and a little about their
asymptotics.
For c œ ÖT) ×, H! À ) - @! and H" À ) - @" one might plausibly consider a test criterion
like
sup 0) ÐBÑ
) - @"
PVÐBÑ œ
Þ
sup 0) ÐBÑ
) - @!
(Notice, by the way, that Bayes tests for 0-1 loss look at ratios of K averages rather than
sups.)
Or equivalently, one might consider
sup 0) ÐBÑ
)
-ÐBÑ œ maxÐPVÐBÑß "Ñ œ - @
ß
sup 0) ÐBÑ
) - @!
where we will want to reject H! for large -ÐBÑ. This is an intuitively reasonable criterion,
and many common tests (including optimal ones) turn out to be likelihood ratio tests. On
the other hand, it is easy enough to invent problems where LRTs need not be in any sense
optimal.
It is a useful fact that one can find some general results about how to set critical values
for LRTs (at least for iid models and large 8). This is because if there is an MLE, say )^,
sup 0) ÐBÑ œ 0s) ÐBÑ
)-@
and we can make use of MLE-like estimation results. For example, in the iid context of
Theorem 80, there is the following.
-45-
Theorem 96 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open
neighborhood of )! , say S, such that
i) 0) ÐBÑ ž ! aB and a) - S,
ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S,
iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and
¹
.$
log0) ÐBѹ Ÿ Q ÐBÑ aB and a) - S ß
. )$
iv) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the integral at )! ,
#
i.e. both ! œ ' ..) 0) ÐBѹ
. .ÐBÑ and ! œ ' ..)# 0) ÐBѹ
. .ÐBÑ, and
v) M" Ð)Ñ »
) œ)!
#
.
E) ˆ .) log0) Ð\щ
) œ)!
- Ð!ß _ Ñ a) - S.
If with T)! probability approaching 1, Ös
) 8 × is a root of the likelihood equation, and s
)8p
)! in T)! probabilityß then under T)!
#log-8‡
» #logŽ
0s8 Ð\ 8 Ñ
_
)8
p ;#"
•
8
8
0)! Ð\ Ñ
Þ
(The hypotheses here are exactly the hypotheses of Theorem 80.)
The application that people often make of this theorem is that they reject H! À ) œ )! in
favor of H" À ) Á )! when #log-8‡ exceeds the Ð" • !Ñ quantile of the ;#" distribution.
Notice also that upon inverting these tests, the theorem gives a way of making large
sample confidence sets for ). When s
) 8 is a maximizer of the log likelihood, the set of )!
for which H! À ) œ )! will be accepted using the LRT with an (approximate) ;#" cut-off
value (say ;#" ) is
Ö)l P8 Ð)Ñ
"
P8 Ðs
) 8 Ñ • ;#" × .
#
There are versions of the ;# limit that cover (composite) null hypotheses about some part
of a multi-dimensional parameter vector. For example, Schervish (page 459) proves a
result like the following.
Result 97 Suppose that @ § e: and appropriate regularity conditions hold for iid
observations \" ß \# ß ÞÞÞß \8 . If 5 Ÿ : and
sup 0)8 Ð\ 8 Ñ
)
-8 œ
sup
0)8 Ð\ 8 Ñ
!
) s.t. ) 3 œ )3 3 œ "ß #ß ÞÞÞß 5
then under any T) with )3 œ )3! 3 œ "ß #ß ÞÞÞß 5
_
#log-8 p ;#5 .
-46-
This result (of course) allows one to set approximate critical points for testing
H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! , by rejecting H! when
#log-8 ž the Ð" • !Ñ ;#5 quantile .
And, one can make large 8 approximate confidence sets for Ð)" ß )# ß ÞÞÞß )5 Ñ of the form
ÖÐ)" ß ÞÞÞß )5 Ñl
P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ
sup
)5€" ß ÞÞÞß ):
"
sup P8 Ð)Ñ • ;#5 ×
#
)
for ;#5 the Ð" • !Ñ quantile of the ;#5 distribution.
There is some standard jargon associated with the function of Ð)" ß ÞÞÞß )5 Ñ appearing in the
above expressions.
Definition 75 The function of Ð)" ß ÞÞÞß )5 Ñ
P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ »
sup
P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ
)5€" ß ÞÞÞß ):
is called a profile likelihood function for the parameters )" ß ÞÞÞß )5 .
With this jargon/notation the form of the large 8 confidence set becomes
ÖÐ)" ß ÞÞÞß )5 Ñl P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ
"
P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ • ;5# × ,
sup
#
Ð)" ß ÞÞÞß )5 Ñ
which looks more like the form from the case of a one-dimensional parameter.
There are alternatives to the LRT of H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! . One
such test is the so called "Wald Test." See, for example, pages 115-120 of Silvey. The
following is borrowed pretty much directly from Silvey.
Consider @ § e: and 5 restrictions on )
2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! ,
(like, for instance, )" • )"! œ )# • )#! œ â œ )5 • )5! œ !ÑÞ Suppose that )^8 is an MLE
of ) and define
2Ð)Ñ » Ð2" Ð)Ñß 2# Ð)Ñß ÞÞÞß 25 Ð)ÑÑ Þ
Then if H! À 2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! is true, 2Ð )^8 Ñ ought to be near 0. The
question is how to measure "nearness" and to set a critical value in order to have a test
with size approximately !. An approach to doing this is as follows.
-47-
We expect (under suitable conditions) that under T) the (:-dimensional) estimator )^8 has
_
È8Ð )^8 • )Ñ p
N: Ð!ß M"•" Ð)ÑÑ .
Then if
`23
L Ð) Ñ » Œ
º• ß
` )4 )
5‚:
the ? method suggests that
_
È8Š2Ð)^8 Ñ • 2Ð)Ñ‹p
N5 Ð!ß L Ð)ÑM"•" Ð)ÑL w Ð)ÑÑ Þ
Abbreviate LÐ)ÑM"•" Ð)ÑL w Ð)Ñ as FÐ)ÑÞ Since 2Ð)Ñ œ ! if H! is true, the above suggests
that
_
82w Ð)^8 ÑF •" Ð)Ñ2Ð)^8 Ñ p ;#5 under H! Þ
Now 82w Ð)^8 ÑF •" Ð)Ñ2Ð )^8 Ñ can not serve as a test statistic, since it involves ), which is
not completely specified by H! Þ But it is plausible to consider the statistic
[8 » 82w Ð)^8 ÑF •" Ð)^8 Ñ2Ð )^8 Ñ
and hope that under suitable conditions
_
[8 p ;#5 under H! Þ
If this can be shown to hold up, one may reject for [8 ž Ð" • !Ñ quantile of the ;#5
distribution.
The Wald tests for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 can be inverted to get approximate
confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): . The resulting confidence
regions will, by construction, be ellipsoidal.
There is a final alternative to the LRT and Wald tests that deserves mention. That is the
";# test." See again Silvey. The notion here is that on occasion it can be easier to
maximize P8 Ð)Ñ subject to 2Ð)Ñ œ ! (again taking this to be a statement of 5 constraints
on a :-dimensional parameter vector )) than to simply maximize P8 Ð)Ñ. Let )˜ 8 be a
"restricted" MLE (i.e. a maximizer of P8 Ð)Ñ subject to 2Ð)Ñ œ !). One might expect that
if H! À 2Ð)Ñ œ ! is true, then )˜ 8 ought to be nearly an unrestricted maximizer of P8 Ð)Ñ
and the partial derivatives of P8 Ð)Ñ should be nearly 0 at )˜ 8 . On the other hand, if H! is
not true, there is little reason to expect the partials to be nearly 0. So (again/still with
P8 Ð)Ñ œ ! log0) Ð\3 Ñ) one might consider the statistic
8
3œ"
-48-
w
" `
`
\ » Ž
P8 Ð)Ѻ
M"•" Ð )˜ 8 ÑŽ
P8 Ð)Ѻ
•
• Þ
8 ` )3
` )3
) œ)˜ 8
) œ)˜ 8
#
Then, for
sup 0)8 Ð\ 8 Ñ
)
-8 œ
sup
0)8 Ð\ 8 Ñ
) s.t. 2Ð)Ñ œ !
it is typically the case that under H! ß \ # differs from 2log-8 by a quantity tending to 0 in
T) probability. That is, this statistic can be calibrated using ;# critical points and can
form the basis of a test of H! asymptotically equivalent to a "Result 97 type" likelihood
ratio type test.
Once more (as for the Wald tests) for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 , one can invert these
tests to get confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): .
As a final note on this topic, observe that
i) an LRT of the style in Result 97 requires computation of an MLE, )^8 , and a
restricted MLE, )˜ 8 ,
ii) a Wald test requires only computation of the MLE, )^8 , and
iii) a ";# " test requires only computation of the restricted MLE, )˜ 8 .
-49-
Some Simple Theory of Markov Chain Monte Carlo
It is sometimes the case that one has a "known" but unwieldy (joint) distribution for a
large dimensional random vector ] , and would like to know something about the
distribution (like, for example, the marginal distribution of ]" ). While in theory it might
be straightforward to compute the quantity of interest, in practice the necessary
computation can be impossible. With the advent of cheap computing, given appropriate
theory the possibility exists of simulating a sample of 8 realizations of ] and computing
an empirical version of the quantity of interest to approximate the (intractable) theoretical
one.
This possibility is especially important to Bayesians, where \l) µ T) and ) µ K produce
a joint distribution for Ð\ß )Ñ and then (in theory) a posterior distribution KÐ)l\Ñß the
object of primary Bayesian interest. Now Ðin the usual kind of notationÑ
1Ð)l\ œ BÑ º 0) ÐBÑ1Ð)Ñ
so that in this sense the posterior of the possibly high dimensional ) is "known." But it
may well be analytically intractable, with (for example) no analytical way to compute
E)" . Simulation is a way of getting (approximate) properties of an intractable/high
dimensional posterior distribution.
"Straightforward" simulation from an arbitrary multivariate distribution is, however,
typically not (straightforward). Markov Chain Monte Carlo methods have recently
become very popular as a solution to this problem. The following is some simple theory
covering the case where the (joint) distribution of ] is discrete.
Definition 76 A (discrete time/discrete state space) Markov Chain is a sequence of
randon quantities Ö\5 ×, each taking values in a (finite or) countable set k , with the
property that
T Ò\8 œ B8 l\" œ B" ß ÞÞÞß \8•" œ B8•" Ó œ T Ò\8 œ B8 l\8•" œ B8•" Ó .
Definition 77 A Markov Chain is stationary provided T Ò\8 œ Bl\8•" œ Bw Ó is
independent of 8.
WOLOG we will henceforth name the elements of k with the integers "ß #ß $ß ÞÞÞ and call
them "states."
Definition 78 With :34 » T Ò\8 œ 4l\8•" œ 3Ó, the square matrix T » a:34 b is called
the transition matrix for a stationary Markov Chain and the :34 are called transition
probabilities.
Note that a transition matrix has nonnegative entries and row sums of 1. Such matrices
are often called "stochastic" matrices. As a matter of further notation for a stationary
Markov Chain, let
-50-
5
:34
œ T Ò\8€5 œ 4l\8 œ 3Ó
and
0345 œ T Ò\8€5 œ 4ß \8€5•" Á 4ß ÞÞÞß \8€" Á 4l\8 œ 3Ó .
(These are respectively the probabilities of moving from 3 to 4 in 5 steps and first moving
from 3 to 4 in 5 steps.)
Definition 79 We say that a MC is irreducible if for each 3 and 4 b 5 (possibly depending
5
upon 3 and 4) such that :34
ž !.
(A chain is irreducible if it is possible to eventually get from any state 3 to any other state
4.)
Definition 80 We say that the 3th state of a MC is transient if ! 0335 • " and say that the
_
state is persistent if ! 0335 œ "Þ A chain is called persistent if all of its states are
5œ"
_
5œ"
persistent.
Definition 81 We say that state 3 of a MC has period > if :335 œ ! unless 5 œ / > (5 is a
multiple of >) and > is the largest integer with this property. The state is aperiodic if no
such > ž " exists. And a MC is called aperiodic if all of its states are aperiodic.
Many sources (including Chapter 15 of the 3rd Edition of Feller Volume 1) present a
number of useful simple results about MC's. Among them are the following.
Theorem 98 All states of an irreducible MC are of the same type (with regard to
persistence and periodicity).
Theorem 99 A finite state space irreducible MC is persistent.
-51-
Theorem 100 Suppose that a MC is irreducible, aperiodic and persistent. Suppose
further that for each state 3 the mean recurrence time is finite, i.e.
"50335 • _ .
_
5œ"
Then an invariant/stationary distribution for the MC exists, i.e. b Ö?4 × with ?4 ž ! and
!?4 œ " such that
?4 œ "?3 :34 .
3
(If the chain is started with distribution Ö?4 ×, after one transition it is in states "ß #ß $ß ÞÞÞ
with probabilities Ö?4 ×Þ)
Further, this distributionÖ?4 × satisfies
5
?4 œ lim :34
a3 ,
5p_
and
?4 œ
"
! 50445
_
Þ
5œ"
There is a converse of this theorem.
Theorem 101 An irreducible, aperiodic MC for which b Ö?4 × with ?4 ž ! and !?4 œ "
such that ?4 œ !?3 :34 must be persistent with ?4 œ
3
"
Þ
_
! 50445
5œ"
With this background, the basic idea of MCMC is the following. If we wish to simulate
from a distribution Ö?4 ×, we find a convenient MC Ö\5 × whose invariant distribution is
Ö?4 ×. From a starting state \! œ 3, one uses T to simulate \" Þ Using the realization
\" œ B" and T , one simulates \# , etc. If the conclusions of Theorem 100 are in force,
after an initial "burn in" of 7 periods
T Ò\7 œ 4l\! œ 3Ó ¸ ?4
independent of 3 and \7 is approximately a realization from Ö?4 ×. Two important
questions about this plan (only one of which we'll address) are:
1) How does one set up an appropriate/convenient T ?
2) How big should 7 be? (We'll not touch this matter.)
-52-
In answer to question 1), there are presumably a multitude of chains that would do the
job. But there is the following useful sufficient condition (that has application in the
original motivating problem of simulating from high dimensional distributions) for a
chain to have Ö?4 × for an invariant distribution.
Lemma 102 If Ö\5 × is a MC with transition probabilities satisfying
?3 :34 œ ?4 :43
æ
,
then it has invariant distribution Ö?4 ×.
Note then that if a candidate T satisfies æ and is irreducible and aperiodic, Theorem 101
shows that it is persistent and Theorem 100 then shows that any arbitrary starting value
can be used and yields approximate realizations from Ö?4 ×.
Several useful MCMC schemes can be shown to have the "right" invariant distributions
by noting that they satisfy æ.
For example, Lemma 102 can be applied to the "Metropolis-Hastings Algorithm." That
is, let X œ a>34 b be any stochastic matrix corresponding to an irreducible aperiodic MC.
(Presumably, some choices will be better that others in terms of providing quick
convergence to a target invariant distribution. But that issue is beyond the scope of the
present exposition.) (Note that in a finite case, one can take >34 œ "ÎÐthe number of
states).) The Metropolis-Hastings algorithm simulates as follows:
•Supposing that \8•" œ 3, generate N (at random) according to the distribution
over the state space specified by row 3 of X , that is according to Ö>34 ×.
•Then generate \8 based on 3 and (the randomly generated) N according to
Ú
N with probability minŠ"ß ??N3 >>3NN 3 ‹
\8 œ Û
Þ
ææ
? >
Ü 3 with probability max Š!ß " • ?N3 >3NN 3 ‹
Note that for Ö\5 × so generated, for 4 Á 3
:34 œ T Ò\8 œ 4l\8•" œ 3Ó œ min Œ"ß
?4 >43
•>34
?3 >34
and
:33 œ T Ò\8 œ 3l\8•" œ 3Ó œ >33 € "maxŒ!ß " •
4Á3
So, for 3 Á 4
?3 :34 œ minÐ?3 >34 ß ?4 >43 Ñ œ ?4 :43 .
-53-
?4 >43
•>34
?3 >34
.
That is, æ holds and the MC Ö\5 × has stationary distribution Ö?4 ×. (Further, the
assumption that X corresponds to an irreducible aperiodic chain implies that Ö\5 × is
irreducible and aperiodic.)
Notice, by the way, that in order to use the Metropolis-Hastings Algorithm one only has
to have the ?4 's up to a multiplicative constant. That is, if we have @4 º ?4 , we may
compute the ratios ?4 Î?3 as @4 Î@3 and are in business as far as simulation goes. (This fact
is especially attractive in Bayesian contexts where it can be convenient to know a
posterior only up to a multiplicative constant.) Notice also that if X is symmetric, (i.e.
>34 œ >43 ), step ææ of the Metropolis algorithm becomes
Ú
\8 œ Û
Ü3
N
with probability minŠ"ß ??N3 ‹
with probability max Š!ß " •
?N
?3
‹
Þ
An algorithm similar to the Metropolis-Hastings algorithm is the "Barker Algorithm."
The Barker algorithm modifies the above by
replacing minŒ"ß
?N >N 3
?N >N 3
• with
?3 >3N
?3 >3N € ?N >N 3
in ææ. Note that for this modification of the Metropolis algorithm, for 4 Á 3
:34 œ Œ
?4 >43
•>34
?3 >34 € ?4 >43
,
so
?3 :34 œ
Ð?3 >34 ÑÐ?4 >43 Ñ
œ ?4 :43
?3 >34 € ?4 >43
.
That is, æ holds and thus Lemma 102 guarantees that Ö\5 × has invariant distribution
Ö?4 ×Þ (And X irreducible and aperiodic continues to imply that Ö\5 × is also.)
Note also that since
?4
?4 >43
?3 >43
œ
?
?3 >34 € ?4 >43
>34 € ?43 >43
once again it suffices to know the ?4 up to a multiplicative constant in order to be able to
implement Barker's algorithm. (This is again of comfort to Bayesians.) Finally, note that
if X is symmetric, step ææ of the Barker algorithm becomes
N
\8 œ ž
3
with probability
with probability
-54-
?N
?3 €?N
?3
?3 €?N
Þ
To finally get to the motivating application of this whole business, consider now the
"Gibbs Sampler," or as Schervish correctly argues that it should be termed, "Successive
Substitution Sampling," for generating an observation from a high dimensional
distribution. For sake of concreteness, consider the situation where the distribution of a
discrete $-dimensional random vector Ð] ß Z ß [ Ñ with probability mass function
0 ÐCß @ß AÑ is at issue. In the general MC set-up, a possible outcome ÐCß @ß AÑ represents a
"state" and the distribution 0 ÐCß @ß AÑ is the "Ö?4 ×" from which on wants to simulate.
One defines a MC Ö\5 × as follows. For an arbitrary starting state ÐC! ß @! ß A! Ñ:
•Generate \" œ Ð]" ß @! ß A! Ñ by generating ]" from the conditional distribution of
] lZ œ @! and [ œ A ! , i.e. from the (conditional) distribution with probability
function 0] lZ ß[ ÐCl@! ß A! Ñ » !0ÐCß@! ßA! Ñ . Let C" be the realized value of ]" .
0Ð6ß@! ßA! Ñ
6
•Generate \# œ ÐC" ß Z" ß A! Ñ by generating Z" from the conditional distribution of
Z l] œ C" and [ œ A ! , i.e. from the (conditional) distribution with probability
function 0Z l] ß[ Ð@lC" ß A! Ñ » !0ÐC" ß@ßA! Ñ . Let @" be the realized value of Z" .
0ÐC" ß6ßA! Ñ
6
•Generate \$ œ ÐC" ß @" ß [" Ñ by generating [" from the conditional distribution
of [ l] œ C" and Z œ @" , i.e. from the (conditional) distribution with probability
0ÐC" ß@" ßAÑ
function 0[ l] ßZ ÐAlC" ß @" Ñ » !
. Let A" be the realized value of [" .
0ÐC" ß@" ß6Ñ
6
•Generate \4 œ Ð]2 ß @1 ß A1 Ñ by generating ]2 from the conditional distribution of
] lZ œ @1 and [ œ A 1 , i.e. from the (conditional) distribution with probability
function 0] lZ ß[ ÐCl@1 ß A1 Ñ » !0ÐCß@1 ßA1 Ñ . Let C2 be the realized value of ]2 .
0Ð6ß@1 ßA1 Ñ
6
And so on, for some appropriate number of cycles.
Note that with this algorithm, a typical transition probability (for a step where a ]
realization is going to be generated) is
T Ò\8 œ ÐCw ß @ß AÑl\8•" œ ÐCß @ß AÑÓ œ
0 ÐCw ß @ß AÑ
!0 Ð6ß @ß AÑ
6
so if \8•" has distribution 0 , the probability that \8 œ ÐCß @ß AÑ is
"0 Ð6w ß @ß AÑ
6w
0 ÐCß @ß AÑ
œ 0 ÐCß @ß AÑ ,
!0 Ð6ß @ß AÑ
6
that is, \8 also has distribution 0 Þ And analogous results hold for transitions of all 3
types (] ß Z and Y realizations).
But it is not absolutely obious how to apply the results 100 and 101 here since the chain
is not stationaryÞ The transition mechanism is not the same for a "] " transition as it is
-55-
for a "Z " transition. The key to extricating oneself from this problem is to note that if
T] ß TY and TZ are respectively the transition matrices for ] ß Y and Z exchanges, then
T œ T] TY TZ
describes an entire cycle of the SSS algorithm. Then Ö\5w × defined by \8w œ \$8 is a
stationary Markov Chain with transition matrix T Þ The fact that T] ß TY and TZ all leave
the distribution 0 invariant then implies that T also leaves 0 invartiant. So one is in the
position to apply Theorems 101 and 100. If T is irreducible and aperiodic, Theorem 101
says that the chain Ö\5w × is persistent and then Theorem 100 says that 0 can be simulated
using an arbitrary starting state.
-56-
A (Very) Little Bit About Bootstrapping
Suppose that the entries of \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ are iid according to some unknown
(possibly multivariate) distribution J , and for a random function
VÐ\ß J Ñ
there is some aspect of the distribution of VÐ\ß J Ñ that is of statistical interest. For
example, in a univariate context, we might be interested in the (J ) probability that
–
\ • .J
VÐ\ß J Ñ œ
W
È8
has absolute value less than #Þ Often, even if one knew J , the problem of finding
properties of the distribution of VÐ\ß J Ñ would be analytically intractable. (Of course,
knowing J , simulation might provide at least some approximation for the property of
interest ...)
As a means of possibly approximating the distribution of VÐ\ß J Ñ, Efron suggested the
s is an estimator of J . In parametric contexts, where )^Ð\Ñ is
following. Suppose that J
s œ Js . In nonparametric contexts, if J^ 8 is the empirical
an estimator of ), this could be J
)
^ . Suppose then that
s œ J
distribution of the sample \ , we might take J
8
s . One might hope that (at
\ ‡ œ Ð\"‡ ß ÞÞÞß \8‡ Ñ has entries that are an iid sample from J
least in some circumstances)
s distribution of VÐ\ ‡ ß J
sÑ .
the J distribution of VÐ\ß J Ñ ¸ the J
(This will not always be a reasonable approximation.) Now even the right side of this
expression will typically be analytic intractable. What, however, can be done is to now
simulate from this distribution. (Where one can not simulate from the distribution on the
left because J is not known, simulation from the distribution on the right is possible.)
sÞ
That is, suppose that \ ‡" ß \ ‡# ß ÞÞÞß \ ‡F are F (bootstrap) iid samples of size 8 from J
Using these, one may compute
sÑ
V3‡ œ VÐ\ ‡3 ß J
s distribution
and use the empirical distribution of ÖV3‡ l3 œ "ß ÞÞÞß F × to approximate the J
s Ñ.
of VÐ\ ‡ ß J
-57-
Download