Stat 643 Results Sufficiency (and Ancillarity and Completeness) Roughlyß X Ð\Ñ is sufficient for ÖT) ×)-@ (or for )) provided the conditional distribution of \ given X Ð\Ñ is the same for each ) - @. Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and X Ð\Ñ œ ! \3 . You may show that T: Ò\ œ BlX œ >Ó is the same for each :. 8 3œ" In some sense, a sufficient statistic carries all the information in \ . (A variable with the same distribution as \ can be constructed from X Ð\Ñ and some randomization whose form does not depend upon ).) The $64 question is how to recognize one. The primary tool is the "Factorization Theorem" in its various incarnations. This says roughly that X is sufficient iff ww 0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑww Þ Result 1 (Page 20 Lehmann TSH ) Suppose that \ is discrete. Then X Ð\Ñ is sufficient for ) iff b nonnegative functions 1) Ð † Ñ and 2Ð † Ñ such that T) Ò\ œ BÓ œ 1) ÐX ÐBÑÑ2ÐBÑÞ (You can and should prove this yourself.) The Stat 643 version of Result 1 is considerably more general and relies on measuretheoretic machinery. Basically, one can "easily" push Result 1 as far as "families of distributions dominated by a 5-finite measure." The following, standard notation will be used throughout these notes: Ðk ,U ,T) Ñ a probability space (for fixed )) c œ ÖT) ×)-@ E) Ð1Ð\ÑÑ œ ' 1.T) X À Ðk ß U ÑpÐg ß Y Ñ a statistic U! œ U ÐX Ñ œ ÖX •" ÐJ Ñ×J -Y the 5-algebra generated by X Definition 1 X (or U! ) is sufficient for c if for every F - U b a U! -measurable function ] such that for all ) - @ ] œ E) ÐMF lU! Ñ a.s. T) -1- . (Note that E) ÐMF lU! Ñ œ T) ÐFlU! Ñ. This definition says that there is a single random variable that will serve as a conditional probability of F given U! for all ) in the parameter space.) Definition 2 c is dominated by a 5-finite measure . if T) ¥ . a ) - @. Then assuming that c is dominated by ., standard notation is to let 0) œ .T) .. and it is these objects (R-N derivatives) that factor nicely iff X is sufficient. To prove this involves some nontrivial technicalities. Definition 3 c has a countable equivalent subset if b a sequence of positive constants !-3 œ " and a corresponding sequence Ö)3 ×_ !-3 T)3 is Ö-3 ×_ 3œ" with 3œ" such that - œ equivalent to c in the sense that -ÐFÑ œ ! iff T) ÐFÑ œ ! a )Þ Notation: Assuming that c has a countable equivalent subset, let - be a choice of !-3 T)3 as in Definition 3. Theorem 2 (Page 575 Lehmann TSH) c is dominated by a 5-finite measure . iff it has a countable equivalent subset. Theorem 3 (Page 54 Lehmann TSH) Suppose that c is dominated by a 5-finite measure and X À Ðk ß U ÑpÐg ß Y Ñ. Then X is sufficient for c iff b nonnegative Y -measurable functions 1) such that a ) .T) ÐBÑ œ 1) ÐX ÐBÑÑ a.s. - . .- (This last equality is equivalent to T) ÐFÑ œ 'F 1) ÐX ÐBÑÑ. -ÐBÑ a F - U .) Theorem 4 (Page 55 Lehmann TSH) ("The" Factorization Theorem) Suppose that c ¥ . where . is 5-finite. Then X is sufficient for c iff b a nonnegative U -measurable function 2 and nonnegative Y -measurable functions 1) such that a ) .T) ÐBÑ œ 0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑ a.s. . . .. -2- Standard/elementary applications of Theorem 4 are to cases where: 1) . is Lebesgue measure on e. , c is some parametric family of . -dimensional distributions, and the 0) are . -dimensional probability densities, or 2) . is counting measure on some countable set (like, e.g., am € b. ) c is some parametric family of discrete distributions and the 0) are probability mass functions. But it can also be applied much more widely (e.g. to cases where . is the sum of . dimensional Lebesgue measure and counting measure on some countable subset of e. ). Example Theorem 4 can be used to show that for k œ e. ß \ œ Ð\" ß \# ß ÞÞÞß \. Ñß U œ U. the . -dimensional Borel 5-algebra, 1 œ Ð1" ß 1# ß ÞÞÞß 1. Ñ a generic permutation of the first . positive integers, 1Ð\Ñ œ Ð\1" ß \1# ß ÞÞÞß \1. Ñß . the . dimensional Lebesgue measure and c œ ÖT lT ¥ . and 0 œ .T .. satisfies 0 Ð1ÐBÑÑ œ 0 ÐBÑ a.s. . for all permutations 1×ß the order statistic X Ð\Ñ œ Ð\Ð"Ñ ß \Ð#Ñ ß ÞÞÞß \Ð.Ñ Ñ is sufficient. Actually, it is an important fact (that does not follow from the factorization theorem) that one can establish the sufficiency of the order statistic even more broadly. With c œ ÖT on e. such that T ÐFÑ œ T Ð1ÐFÑÑ for all permutations 1×, direct argument shows that the order statistic is sufficient. (You should try to prove this. For F - U , try ] ÐBÑ equal to the fraction of all permutations that place B in F.) (Obvious) Result 5 If X Ð\Ñ is sufficient for c and c! is a subset of c , X Ð\Ñ is sufficient for c! . A natural question is "How far can one go in the direction of 'reducing' \ without throwing something away?". The notion of minimal sufficiency is aimed in the direction of answering this question. Definition 4 A sufficient statistic X À Ðk ß U ÑpÐg ß Y Ñ is minimal sufficient (for ) or for c ) provided for every sufficient statistic W À Ðk ß U ÑpÐf ß Z Ñ b a (?measurable?) function h À f pg such that X œ Y ‰ W a.e. c . (It is probably equivalent or close to equivalent that U ÐX Ñ § U ÐWÑÞ) There are a couple of ways of identifying a sufficient statistic. -3- Theorem 6 (Like page 92 of Schervish ... other development on pages 41-43 of Lehmann TPE) Suppose that c is dominated by a 5-finite measure . and X À Ðk ß U ÑpÐg ß Y Ñ is sufficient for c . Suppose that there exist versions of the R-N ) derivatives .T . . ÐBÑ œ 0) ÐBÑ such that athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b implies X ÐBÑ œ X ÐCÑ then X is minimal sufficient. (A trivial "generalization" of Theorem 6 is to suppose that the condition only holds for a set k ‡ § k with T) probability 1 for all ). The condition in Theorem 6 is essentially that ÐCÑ X ÐBÑ œ X ÐCÑ iff 00))ÐBÑ is free of ).) Note that sufficiency of X Ð\Ñ means that X ÐBÑ œ X ÐCÑ implies athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b. So the condition in Theorem 6 could be phrased as an iff condition. Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and X Ð\Ñ œ ! \3 . You may use Theorem 6 to show that show that X is minimal sufficient. 8 3œ" Lehmann emphasizes a second method of establishing minimality based on the following two theorems. Theorem 7 (Page 41 of Lehmann TPE) Suppose that c œ ÖT3 ×53œ! is finite and let Ö03 × be the R-N derivatives with respect to some dominating 5-finite measure (like, e.g., ! T3 ÑÞ Suppose that all 5 € " derivatives 03 are positive on all of k . Then 5 3œ! X Ð\Ñ œ Œ 0" ÐBÑ 0# ÐBÑ 05 ÐBÑ ß ß ÞÞÞß • 0! ÐBÑ 0! ÐBÑ 0! ÐBÑ is minimal sufficient. (Again it is a trivial "generalization" to observe that the minimality also holds if there is a set k ‡ § k with T) probability 1 for all ) where the 5 € " derivatives 03 are positive on all of k ‡ ÞÑ Theorem 8 (Page 42 of Lehmann TPE) Suppose that c is a family of mutually absolutely continuous distributions and that c! § c Þ If X is sufficient for c and minimal sufficient for c! , then it is minimal sufficient for c . The way that Theorems 7 and 8 are used together is to look at a few )'s, invoke Theorem 7 to establish minimal sufficiency for the finite family of a vector made up of a few likelihood ratios and then use Theorem 8 to infer minimality for the whole family. -4- Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Beta Ð!ß "Ñ. Using T! the Beta Ð"ß "Ñ distribution, T" the Beta Ð#ß "Ñ distribution and T# the Beta Ð"ß #Ñ distribution, Theorems 7 and 8 can be used to show that Œ # \3 ß # Ð" • \3 Ñ•is minimal sufficient for 8 8 3œ" 3œ" the Beta family. Definition 5 A statistic X À Ðk ß U ÑpÐg ß Y Ñ is said to be ancillary for c œ ÖT) ×)-@ if the distribution of X (i.e., the measure T)X on Ðg ß Y Ñ defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑÑ does not depend upon ) (i.e. is the same for all )). Definition 6 A statistic X À Ðk ß U ÑpÐe" ß U" Ñ is said to be first order ancillary for c œ ÖT) ×)-@ if the mean of X (i.e., E) ÐX Ð\ÑÑ œ ' X.T ) ) does not depend upon ) (i.e. is the same for all )). Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid NÐ),"Ñ. The sample variance of the \3 's is ancillary. Lehmann argues that roughly speaking models with nice (low dimensional) minimal sufficient statistics X are ones where no nontrivial function of X is ancillary (or first order ancillary). This line of thinking brings up the notion of completeness. Notation: Continuing to let X À Ðk ß U ÑpÐg ß Y Ñ and T)X be the measure on Ðg ß Y Ñ defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑ, let c X œ ÖT)X ×)-@ Þ Definition 7a c X or X is complete (boundedly complete) for c or ) if i) Y À Ðk ß U! ÑpÐe" ß U" Ñ U! -measurable (and bounded), and ii) E) Y Ð\Ñ œ ! a ) imply that Y œ ! a.s. T) a )Þ By virtue of Lehmann's Theorem (see Cressie's Probability Summary) this is equivalent to the next definition. Definition 7b c X or X is complete (boundedly complete) for c or ) if i) 2 À Ðg ß Y ÑpÐe" ß U" Ñ Y -measurable (and bounded), and ii) E) 2ÐX Ð\ÑÑ œ ! a ) imply that 2 ‰ X œ ! a.s. T) a )Þ These definitions say that there is no nontrivial function of X that is an unbiased estimator of 0. This is equivalent to saying that there is no nontrivial function of X that is first order ancillary. Notice also that completeness implies bounded completeness, so that bounded completeness is a slightly weaker condition than completeness. -5- Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and X Ð\Ñ œ ! \3 . You may use the fact that polynomials can be identically 0 on an interval 8 3œ" only if their coefficients are all 0, to show that X Ð\Ñ is complete for : - Ò!ß "Ó. Theorem 9 (Page 99 Schervish) (Basu's Theorem) If X is a boundedly complete sufficient statistic and Y is ancillary, then for all ) the random variables Y and X are independent (according to T) )Þ (This theorem has some uses in the identification of UMVUE's.) (Obvious) Result 10 If X is complete for c œ ÖT) ×)-@ and @w § @, X need not be w complete for c œ ÖT) ×)-@w . w w Theorem 11 If @ § @ and @ dominates @ in the sense that T) ÐFÑ œ ! a ) - @ implies w that T) ÐFÑ œ ! a ) - @ , then X (boundedly) complete for c œ ÖT) ×)-@ implies that X w is (boundedly) complete for c œ ÖT) ×)-@w . Completeness of a statistic says that it is not carrying any "ancillary" information, stuff that in and of itself is of no help in inference about ). That is reminiscent of minimal sufficiency. Theorem 12 (Bahadur's Theorem) Suppose that X À Ðk ß U ÑpÐg ß Y Ñ is sufficient and boundedly complete. i) (Pages 94-95 of Schervish) If Ðg ß Y Ñ œ Ðe. ß U. Ñ then X is minimal sufficient. ii) If there is any minimal sufficient statistic, X is minimal sufficient. Example Suppose that c ¥ . (where . is 5-finite) and that for ) - @ § e5 5 .T) ÐBÑ œ -Ð)Ñexp")3 X3 ÐBÑ Þ .. 3œ" Provided, for example, that @ contains an open rectangle in e5 , Theorem 1 page 142 Lehmann TSH implies that X Ð\Ñ œ ÐX" Ð\Ñß X# Ð\Ñß ÞÞÞß X5 Ð\ÑÑ is complete. It is clearly sufficient by the factorization criterion. So by part i) of Theroem 12, X is minimal sufficient. -6- Decision Theory This is a formal framework for posing problems of finding good statistical procedures as optimization problems. Basic Elements of a (Statistical) Decision Problem: \ data formally a mapping ÐHß [ÑpÐk ß U Ñ @ parameter space often one needs to do integrations over @, so let V be a 5-algebra on @ and assume that for F - U , the mapping taking )pT) ÐFÑ from Ð@ß VÑpÐV " ß U" Ñ is measurable T decision/action space the class of possible decisions or actions we will also need to average over (random) actions, so let X be 5-algebra on T PÐ)ß +Ñ loss function P À @ ‚ T pÒ!ß_Ñ and needs to be suitably measurable $ ÐBÑ decision rule $ ÐBÑ À Ðk ß U ÑpÐT ß X Ñ and for data \ , $ Ð\Ñ is the decision taken on the basis of observing \ The fact that PÐ)ß $ Ð\ÑÑ is random prevents any sensible comparison of decision rules until one has averaged out over the distribution of \ , producing something called a risk function. Definition 8 VÐ)ß $ Ñ » E) PÐ)ß $ Ð\ÑÑ œ ' PÐ)ß $ ÐBÑÑ.T) ÐBÑ maps @pÒ!ß_Ñ and is called the risk function (of the decision rule $ ). Definition 9 $ is at least as good as $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a)Þ Definition 10 $ is better than $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a) with strict inequality for at least one ) - @. Definition 11 $ and $ w are risk equivalent if VÐ)ß $ Ñ œ VÐ)ß $ w Ñ a)Þ (Note that risk equivalency need not imply equality of decision rules.) Definition 12 $ is best in a class of decision rules ? if i) $ - ? and ii) $ is at least as good as any other $ w - ? Þ Typically, only smallish classes of decision rules will have best elements. (For example, in some point estimation problems there are best unbiased estimators, where no estimator is best overall. Here the class ? is the class of unbiased estimators.) An alternative to looking for best elements in small classes of rules is to somehow reduce the function of ), VÐ)ß $ Ñ, to a single number and look for rules minimizing the number. (Reduction by -7- averaging amounts to looking for Bayes rules. Reduction by taking a maximum value amounts to looking for minimax rules.) Definition 13 $ is inadmissible in ?if b $ w - ? which is better than $ (which "dominates" $ ). Definition 14 $ is admissible in ? if it is not inadmissible in ? (if there is no $ w - ? which is better). On the face of it, it would seem that one would never want to use an inadmissible decision rule. But, as it turns out, there can be problems where all rules are inadmissible. (More in this direction later.) Largely for technical reasons, it is convenient/important to somewhat broaden one's understanding of what constitutes a decision rule. To this end, as a matter of notation, let W œ Ö$ ÐBÑ× œ the class of (nonrandomized) decision rules . Then one can define two fairly natural notions of randomized decision rules. Definition 15 Suppose that for each B - k , 9B is a distribution on ÐT ß X Ñ. Then 9B is called a behavioral decision rule. The idea is that having observed \ œ B, one then randomizes over the action space according to the distribution 9B in order to choose an action. (This should remind you of what is done in some theoretical arguments in hypothesis testing.) Notation: Let W‡ œ Ö9B × œ the class of behavioral decision rules It is possible to think of W as a subset of W‡ . For $ - W, let 9B$ be a point mass distribution on T concentrated at $ ÐBÑ. Then 9B$ - W‡ and is "clearly" equivalent to $ . The risk of a behavioral decision rule is (abusing notation and reusing the symbols VÐ † ß † Ñ in this context as well as that of Definition 8) clearly VÐ)ß 9Ñ œ ( ( PÐ)ß +Ñ. 9B Ð+Ñ.T) ÐBÑ Þ k T A somewhat less intuitively appealing notion of randomization for decision rules is next. Let Y be a 5-algebra on W that contains the singleton sets. Definition 16 A randomized decision function (or decision rule) < is a probability measure on ÐWß Y ÑÞ ($ with distribution < becomes a random object.) The idea here is that prior to observation, one chooses an element of W by use of <, say $ , and then having observed \ œ B decides $ ÐBÑ. -8- Notation: Let W‡ œ Ö<× œ the class of randomized decision rules. It is possible to think of W as a subset of W‡ by identifying $ - W with <$ placing mass 1 on $ . <$ is "clearly" equivalent to $ . The risk of a randomized decision rule is (once again abusing notation and reusing the symbols VÐ † ß † Ñ) clearly VÐ)ß <Ñ œ ( VÐ)ß $ Ñ. <Ð$ Ñ œ ( ( PÐ)ß $ ÐBÑÑ.T) ÐBÑ. <Ð$ Ñ ß W W k assuming that VÐ)ß <Ñ is properly measurable. The behavioral decision rules are probably more appealing/natural than the randomized decision rules. On the other hand, the randomized decision rules are often easier to deal with in proofs. So a reasonably important question is "When are W‡ and W‡ equivalent in the sense of generating the same set of risk functions?" "Result"13 (Some properly qualified version of this is true ... See Ferguson pages 2627) If T is a complete separable metric space with X the Borel 5-algebra, appropriate conditions hold on the T) and ??????? holds, then W‡ and W‡ are equivalent in terms of generating the same set of risk functions. It is also of interest to know when the extra complication of considering randomized rules is really not necessary. One simple result in that direction concerns cases where P is convex in its second argument. Lemma 14 (See page 151 of Schervish, page 40 of Berger and page 78 of Ferguson) Suppose that T is a convex subset of e. and 9B is a behavioral decision rule. Define a nonrandomized decision rule $ by $ ÐBÑ œ ( +. 9B Ð+Ñ . T (In the case that . ž ", interpret $ ÐBÑ as vector-valued, the integral as a vector of integrals over the . coordinates of + - T .) Then i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex, then VÐ)ß $ Ñ Ÿ VÐ)ß 9Ñ Þ ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)ß 9Ñ • _ and T) ÖBl9B is nondegenerate× ž ! , then VÐ)ß $ Ñ • VÐ)ß 9Ñ Þ -9- (Strict convexity of a function 1 means that for ! - Ð!ß "Ñ 1Ð!B € Ð" • !ÑCÑ • !1ÐBÑ € Ð" • !Ñ1ÐCÑ for B and C in the convex domain of 1Þ See Berger pages 38 through 40. There is an extension of Jensen's inequality that says that if 1 is strictly convex and the distribution of \ is not concentrated at a point, then E1Ð\Ñ • 1ÐEÐ\ÑÑ.) Corollary 15 Suppose that T is a convex subset of e. and 9B is a behavioral decision rule. Define a nonrandomized decision rule $ by $ ÐBÑ œ ( +. 9B Ð+Ñ . T i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a), then $ is at least as good as 9. ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a) and for some )! the function PÐ)! ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)! ß 9Ñ • _ and T)! ÖBl9B is nondegenerate× ž ! then $ is better than 9. Corollary 15 says roughly that in the case of a convex loss function, one need not worry about considering randomized decision rules. (Finite Dimensional) Geometry of Decision Theory A very helpful device for understanding some of the basics of decision theory is to consider the geometry associated with cases where @ is finite. So until further notice, assume @ œ Ö)" ß )# ß ÞÞÞß )5 × and VÐ)ß <Ñ • _ a ) - @ and < - W‡ . As a matter of notation, let f œ ÖC œ ÐC" ß C# ß ÞÞÞß C5 Ñ - e5 l C3 œ VÐ)3 ß <Ñ a3 and < - W‡ × . f is the set of all randomized risk vectors. -10- Theorem 16 f is convex. In fact, it turns out to be the case, and is more or less obvious from Theorem 16, that if f ! œ ÖClC3 œ VÐ)3 ß <Ñ a3 and < - W× , then f is the convex hull of f ! , i.e. the smallest convex set containing f ! . Definition 17 For B - e5 , the lower quadrant of B is UB œ ÖDlD3 Ÿ B3 a3× Þ Theorem 17 C - f (or the decision rule giving rise to C) is admissible iff UC • f œ ÖC× . Definition 18 The lower boundary of f is – -Ðf Ñ œ ÖClUC • f œ ÖC×× – for f the closure of f . Definition 19 f is closed from below if -Ðf Ñ § f . Notation: Let EÐf Ñ stand for the class of admissible rules (or risk vectors corresponding to them) in f . Theorem 18 If f is closed from below, then EÐf Ñ œ -Ðf ÑÞ Theorem 18w If f is closed, then EÐf Ñ œ -Ðf Ñ. Complete Classes of Decision Rules Now drop the finiteness assumption on @. Definition 20 A class of decision rules V § W‡ is a complete class if for any 9  V, b a 9w - V such that 9w is better than 9. Definition 21 A class of decision rules V § W‡ is an essentially complete class if for any 9  V, b a 9w - V such that 9w is at least as good as 9. Definition 22 A class of decision rules V § W‡ is a minimal complete class if it is complete and is a subset of any other complete class. Notation: Let AÐW‡ Ñ be the set of admissible rules in W‡ . Theorem 19 If a minimal complete class V exists, then V œ EÐW‡ Ñ. -11- Theorem 20 If EÐW‡ Ñ is complete, then it is minimal complete. Theorems 19 and 20 are the properly qualified versions of the rough (and not quite true) statement "EÐW‡ Ñ equals the minimal complete class". Sufficiency and Decision Theory We worked very hard on the notion of sufficiency, with the goal in mind of establishing that one "doesn't need" \ if one has X Ð\Ñ. That hard work ought to have some implications for decision theory. Presumably, in most cases one can get by with basing a decision rule on a sufficient statistic. Result 21 (See page 120 of Ferguson, page 36 of Berger and page 151 of Schervish) Suppose a statistic X is sufficient for c œ ÖT) ×. Then if 9 is a behavioral decision rule there exists another behavioral decision rule 9w that is a function of X and has the same risk function as 9. Note that 9w a function of X means that X ÐBÑ œ X ÐCÑ implies that 9Bw and 9Cw are the same measures on T . Note also that Result 21 immediately implies that the class of decision rules in W‡ that are functions of X is essentially complete. In terms of risk, one need only consider rules that are functions of sufficient statistics. The construction used in the proof of Result 21 doesn't aim to improve on the original (behavioral) decision rule by moving to a function of the sufficient statistic. But in convex loss cases, there is a construction that produces rules depending on a sufficient statistic and potentially improving on an initial (nonrandomized) decision rule. This is the message of Theorem 24. But to prove Theorem 24, a lemma of some independent interest is needed. That lemma and an immediate corollary are stated before Theorem 24. Lemma 22 Suppose that T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized decision rules. Then $ ÐBÑ œ "# Ð$" ÐBÑ € $# ÐBÑÑ is also a decision rule. Further, i) if PÐ)ß +Ñ is convex in + and VÐ)ß $" Ñ œ VÐ)ß $# Ñ, then VÐ)ß $ Ñ Ÿ VÐ)ß $" Ñ, and ii) if PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, then VÐ)ß $ Ñ • VÐ)ß $" ÑÞ Corollary 23 Suppose T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized decision rules with identical risk functions. If PÐ)ß +Ñ is convex in + a), the existence of a ) - @ such that PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, implies that $" and $# are inadmissible. -12- Theorem 24 (See Berger page 41, Ferguson page 121, TPE page 50, Schervish page 152) (Rao-Blackwell Theorem) Suppose that T § e5 is convex and $ is a nonrandomized decision function with E) l$ Ð\Ñl • _ a)Þ Suppose further that X is sufficient for ) and with U! œ U ÐX Ñ, let $! ÐBÑ » EÒ$ lU! ÓÐBÑ Þ Then $! is a nonrandomized decision rule. Further, i) if PÐ)ß +Ñ is convex in +, then VÐ)ß $! Ñ Ÿ VÐ)ß $ Ñ, and ii) for any ) for which PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $ Ñ • _ and T) Ð$ Ð\Ñ Á $! Ð\ÑÑ ž !, one has that VÐ)ß $! Ñ • VÐ)ß $ ÑÞ Bayes Rules The Bayes approach to decision theory is one means of reducing risk functions to numbers so that they can be compared in a straightforward fashion. Notation: Let K be a distribution on Ð@ß V Ñ. K is usually (for philosophical reasons) called the "prior" distribution on @. Definition 23 The Bayes risk of 9 - W‡ with respect to the prior K is (abusing notation again and once more using the symbols VÐ † ß † Ñ) VÐKß 9Ñ œ ( VÐ)ß 9Ñ.KÐ) Ñ Þ @ Notation: Again abusing notation somewhat, let VÐKÑ » w inf ‡ VÐKß 9w Ñ Þ 9 -W Definition 24 9 is said to be Bayes with respect to K (or to be a Bayes rule with respect to K) provided VÐKß 9Ñ œ VÐKÑ Þ Definition 25 9 is said to be %-Bayes with respect to K provided VÐKß 9Ñ Ÿ VÐKÑ € % Þ Bayes rules are often admissible, at least if the prior involved "spreads its mass around sufficiently." Theorem 25 If @ is finite (or countable), K is a prior distribution with KÐ)Ñ ž ! a) and 9 is Bayes with respect to K, then 9 is admissible. -13- Theorem 26 Suppose @ § e5 is such that every neighborhood of any point ) - @ has nonempty intersection with the interior of @. Suppose further that VÐ)ß 9Ñ is continuous in ) for all 9 - W‡ . Let K be a prior distribution that has support @ in the sense that every open ball that is a subset of @ has positive K probability. Then if VÐKÑ • _ and 9 is a Bayes rule with respect to K, 9 is admissible. Theorem 27 If every Bayes rule with respect to K has the same risk function, they are all admissible. Corollary 28 In a problem where Bayes rules exist and are unique, they are admissible. A converse of Theorem 25 (for finite @) is given in Theorem 30. Its proof requires the use of a piece of 5 -dimensional analytic geometry (called the separating hyperplane theorem) that is stated next. Theorem 29 Let f" and f# be two disjoint convex subsets of e5 . Then b a : - e5 such that : Á ! and !:3 B3 Ÿ !:3 C3 aC - f" and B - f# . Theorem 30 If @ is finite and 9 is admissible, then 9 is Bayes with respect to some prior. For nonfinite @, admissible rules are "usually" Bayes or "limits" of Bayes rules. See Section 2.10 of Ferguson or page 546 of Berger. As the next result shows, Bayesians don't need randomized decision rules, at least for purposes of achieving minimum Bayes risk, VÐKÑ. Result 31 (See also page 147 of Schervish) Suppose that < - W‡ is Bayes versus K and VÐKÑ • _. Then there exists a nonrandomized rule $ - W that is also Bayes versus KÞ There remain the questions of 1) when Bayes rules exist and #Ñ what they look like when they do exist. These issues are (to some extent) addressed next. Theorem 32 If @ is finite, f is closed from below and K assigns positive probability to each ) - @, there is a rule Bayes versus K. Definition 26 A formal nonrandomized Bayes rule versus a prior K is a rule $ ÐBÑ such that aB - k ß $ ÐBÑ is an + - T minimizing ( PÐ)ß +ÑŽ @ -14- 0) ÐBÑ .KÐ)Ñ . '@ 0) ÐBÑ.KÐ)Ñ • Definition 27 If K is a 5-finite measure, a formal nonrandomized generalized Bayes rule versus K is a rule $ ÐBÑ such that aB - k ß $ ÐBÑ is an + - T minimizing ( PÐ)ß +Ñ0) ÐBÑ.KÐ)Ñ Þ @ Minimax Decision Rules An alternative to the Bayesian reduction of VÐ)ß 9Ñ to a number by averaging according to the distribution K, is to instead reduce VÐ)ß 9Ñ to a number by maximization over ). Definition 28 A decision rule 9 - H‡ is said to be minimax if supVÐ)ß 9Ñ œ infsup VÐ)ß 9w Ñ Þ w 9 ) ) Definition 29 If a decision rule 9 has constant (in )) risk, it is called an equalizer rule. It should make some sense that if one is trying to find a minimax rule by so to speak pushing down the highest peak in the profile VÐ)ß 9Ñ, doing so will tend to drive one towards rules with "flat" VÐ)ß 9Ñ profiles. Theorem 33 If 9 is an equalizer rule and is admissible, then it is minimax. Theorem 34 Suppose that Ö93 × is a sequence of decision rules, each 93 Bayes against a prior K3 . If VÐK3 ß 93 ÑpG and 9 is a decision rule with VÐ)ß 9Ñ Ÿ G a), then 9 is minimax. Corollary 35 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then 9 is minimax. Corollary 35' If 9 is an equalizer rule and is Bayes with respect to a prior K, then it is minimax. A reasonable question to ask, in the light of Corollaries 35 and 35', is "Which K should I be looking for in order to produce a Bayes rule that might be minimax?" The answer to this question is "Look for a least favorable prior." Definition 30 A prior distribution K is said to be least favorable if VÐKÑ œ supVÐKw Ñ . Kw Intuitively, one might think that if an intelligent adversary were choosing ), he/she might well generate it according to a least favorable K, therefore maximizing the minimum possible risk. In response a decision maker ought to use 9 Bayes versus K. Perhaps then, this Bayes rule might tend to minimize the worst that could happen over choice of )? Theorem 36 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then K is least favorable. -15- (Corollary 35 already shows 9 in Theorem 36 to be minimax.) -16- Point Estimation (Unbiasedness, Exponential Families, Information Inequalities, Invariance and Equivariance and Maximum Likelihood) Unbiasedness One possible limited class of "sensible" decision rules in estimation problems is the class of "unbiased" estimators. (Here, we are tacitly assuming that T is e" or some subset of it.) Definition 31 The bias of an estimator $ ÐBÑ for # Ð)Ñ - e" is E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ( Ð$ ÐBÑ • # Ð)ÑÑ.T) ÐBÑ . k Definition 32 An estimator $ ÐBÑ is unbiased for # Ð)Ñ - e" provided E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ! a) Þ Before jumping into the business of unbiased estimation with too much enthusiasm, an early caution is in order. We supposedly believe that "Bayes ¸ admissible" and that admissibility is sort of a minimal "goodness" requirement for a decision rule ... but there are the following theorem and corollary. Theorem 37 (See Ferguson pages 48 and 52) If PÐ)ß +Ñ œ AÐ)ÑÐ#Ð )Ñ • +Ñ# for AÐ)Ñ ž !, and $ is both Bayes wrt K with finite risk function and unbiased for # Ð)Ñ, then (according to the joint distribution of Ð\ß )ÑÑ $ Ð\Ñ œ # Ð)Ñ a.s. Þ Corollary 38 Suppose @ § e" has at least two elements and the family c œ ÖT) × is mutually absolutely continuous. If PÐ)ß +Ñ œ Ð) • +Ñ# no Bayes estimator of ) is unbiased. Definition 33 $ is best unbiased for # Ð)Ñ - e" if it is unbiased and at least as good as any other unbiased estimator. There need not be any unbiased estimators in a given estimation problem, let alone a best one. But for many common estimation loss functions, if there is a best unbiased estimator it is unique. Theorem 39 Suppose that T § e" is convex, PÐ)ß +Ñ is convex in + a) and $" and $# are two best unbiased estimators of # Ð)Ñ. Then for any ) for which PÐ)ß +Ñ is strictly convex in + and VÐ)ß $" Ñ • _ T) Ò$" Ð\Ñ œ $# Ð\ÑÓ œ " Þ -17- The special case of PÐ)ß +Ñ œ Ð#Ð )Ñ • +Ñ# is, of course, one of most interest. For this case of unweighted squared error loss, for estimators with second moments VÐ)ß $ Ñ œ Var) $ Ð\Ñ. Definition 34 $ is called a UMVUE (uniformly minimum variance unbiased estimator) of # Ð)Ñ - e" if it is unbiased and Var) $ Ð\Ñ Ÿ Var) $ w Ð\Ñ a) for any other unbiased estimator of # Ð)Ñ, $ w . A much weaker notion is next. Definition 35 $ is called locally minimum variance unbiased at )! - @ if it is unbiased and Var)! $ Ð\Ñ Ÿ Var)! $ w Ð\Ñ for any other unbiased estimator of # Ð)Ñ, $ w . Notation: Let h! œ Ö$ l E) $ Ð\Ñ œ ! and E) $ # Ð\Ñ • _ a)×Þ (h! is the set of unbiased estimators of 0 with finite variance.) Theorem 40 Suppose that $ is an unbiased estimator of # Ð)Ñ - e" and E) $ # Ð\Ñ • _ a). Then $ is a UMVUE of # Ð)Ñ iff Cov) Ð$ Ð\Ñß ?Ð\ÑÑ œ ! a) a? - h! . Theorem 41 (Lehmann/Scheffew ) Suppose that T is a convex subset of e" , X is a sufficient statistic for c œ ÖT) × and $ is an unbiased estimator of # Ð)Ñ. Let $! œ EÒ$lX Ó œ 9 ‰ X . Then $! is an unbiased estimator of # Ð)ÑÞ If in addition, X is complete for c , i) 9w ‰ X unbiased for # Ð)Ñ implies that 9w ‰ X œ 9 ‰ X a.s. T) a), and ii) PÐ)ß +Ñ convex in + a) implies that a) $! is best unbiased, and b) if $ w is any other best unbiased estimator of # Ð)Ñ $ w œ $! a.s. T) for any ) at which VÐ)ß $! Ñ • _ and PÐ)ß +Ñ is strictly convex in +. (Note that iia) says in particular that $! is UMVUE, and that iib) says that $! is the unique UMVUE.) -18- Basu's Theorem (Theorem 9) is on occasion helpful in doing the conditioning required to get $! of Theorem 41. The most famous example of this is in the calculation of the UMVUE of Fˆ -•5 . ‰ based on 8 iid R Ð.ß 5# Ñ observations. Exponential Families The most common method of establishing completeness of a statistic X (at least for the standard examples) for use in the construction of Theorem 41 is an appeal to the properties of "Exponential Families" of distributions. Hence the placement of the next batch of class material. A List of Facts About Exponential Families: Definition 36 Suppose that c œ ÖT) × is dominated by a 5-finite measure .. c is called an exponential family if for some 2ÐBÑ !, 5 .T) " (3 Ð)ÑX3 ÐBÑ›2ÐBÑ a) . 0) ÐBÑ » ÐBÑ œ expš+Ð)Ñ € .. 3œ" Definition 37 A family of probability measures c œ ÖT) × is said to be identifiable if )" Á )# implies that T)" Á T)# Þ Fact(s) 42 Let ( œ Ð(" ß (# ß ÞÞÞß (5 Ñß > œ Ö( - e5 l' 2ÐBÑexpš!(3 X3 ÐBÑ›. .ÐBÑ • _×, and consider also the family of distributions (on k ) with R-N derivatives wrt . of the form 0( ÐBÑ œ OÐ(Ñexpš"(3 X3 ÐBÑ›2ÐBÑ , ( - > . 5 3œ" Call this family c ‡ . This set of distributions on k is at least as large as c (and this second parameterization is mathematically nicer than the first). > is called the natural parameter space for c ‡ Þ It is a convex subset of e5 . (TSH, page 57.) If > lies in a subspace of dimension less than 5 , then 0( ÐBÑ (and therefore 0) ÐBÑÑ can be written in a form involving fewer than 5 statistics X3 . We will henceforth assume > to be fully 5 dimensional. Note that depending upon the nature of the functions (3 Ð)Ñ and the parameter space @, c may be a proper subset of c ‡ . That is, defining >@ œ ÖÐ(" Ð)Ñß (# Ð)Ñß ÞÞÞß (5 Ð)ÑÑ - e5 l) - @ ×ß >@ can be a proper subset of >. Fact 43 The "support" of T) , defined as ÖBl0) ÐBÑ ž !×, is clearly ÖBl2ÐBÑ ž !×, which is independent of ). The distributions in c are thus mutually absolutely continuous. Fact 44 From the Factorization Theorem, the statistic X œ ÐX" ß X# ß ÞÞÞß X5 Ñ is sufficient for c . -19- Fact 45 X has induced measures ÖT)X l) - @× which also form an exponential family. Fact 46 If >@ contains an open rectangle in e5 , then X is complete for c . (See pages 142-143 of TSH.) Fact 47 If >@ contains an open rectangle in e5 (and actually under the much weaker assumptions given on page 44 of TPE) X is minimal sufficient for c . Fact 48 If 1 is any measurable real valued function such that E( l1Ð\Ñl • _, then E( 1Ð\Ñ œ ( 1ÐBÑ0( ÐBÑ. .ÐBÑ is continuous on > and has continuous partial derivatives of all orders on the interior of >. These can be calculated as ` !" €!# €â€!5 ` !" €!# €â€!5 E 1Ð\Ñ œ 1ÐBÑ 0( ÐBÑ. .ÐBÑ . ( ( ` ("!" ` (#!# â` (5!5 ` ("!" ` (#!# â` (5!5 (See page 59 of TSH.) Fact 49 If for ? œ Ð?" ß ?# ß ÞÞÞß ?5 Ñ, both (! and (! € ? belong to > E(! expš?" X" Ð\Ñ € ?# X# Ð\Ñ € â € ?5 X5 Ð\Ñ› œ OÐ(! Ñ Þ OÐ(! € ?Ñ Further, if (! is in the interior of >, then E(! ÐX"!" Ð\ÑX#!# Ð\ÑâX5!5 Ð\ÑÑ œ OÐ(! Ñ In particular, E(! X4 Ð\Ñ œ ` ` (4 a and Cov(! (X4 Ð\Ñ,X6 Ð\ÑÑ œ • logOÐ(Ñb¹ `# ` ( 4 ` (6 a ` !" €!# €â€!5 " Þ •º !" !# !5 Œ ` (" ` (# â` (5 OÐ(Ñ (œ(! , Var(! X4 Ð\Ñ œ • logOÐ(Ñb¹ (œ(! (œ(! `# a ` (4# • logOÐ(Ñb¹ (œ(! , Þ Fact 50 If \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ has iid components, with each \3 µ T) , then \ generates a 5 dimensional exponential family on (k 8 ß U 8 ) wrt .8 . The 5 dimensional statistic ! X Ð\3 Ñ is sufficient for this family. Under the condition that >@ contains an 8 3œ" open rectangle, this statistic is also complete and minimal sufficient. Information Inequalities These are inequalities about variances of estimators, some of which involve "information" measures. (See Schervish Sections 2.3.1, 5.1.2 and TPE Sections 2.6, 2.7.) The fundamental tool here is the Cauchy-Schwarz inequality. -20- Lemma 51 (Cauchy-Schwarz) If Y and Z are random variables with EY # • _ and EZ # • _, then EY # † E Z # ÐEY Z Ñ# Þ Applying this to Y œ \ • E\ and Z œ ] • E] , we get the following. Corollary 52 If \ and ] have finite second moments, ÐCovÐ\ß ] ÑÑ# Ÿ Var\ † Var] . Corollary 53 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. If 1 is a measurable function such that E1Ð] Ñ œ ! and ! • EÐ1Ð] ÑÑ# • _, defining G » E$ Ð] Ñ1Ð] Ñ Var$Ð] Ñ G# . EÐ1Ð] ÑÑ# A restatement of Corollary 53 in terms that look more "statistical" is next. Corollary 53' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. If 1) is a measurable function such that E) 1) Ð\Ñ œ ! and ! • E) Ð1) Ð\ÑÑ# • _, defining GÐ)Ñ » E) $ Ð\Ñ1) Ð\Ñ Var) $ Ð\Ñ G # Ð) Ñ . E) Ð1) Ð\ÑÑ# Definition 38 We will say that the model c with dominating 5-finite measure . is FI regular at )! - @ § e" , provided there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is differentiable at )! and iii) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated wrt ) under the integral at )! , i.e. ! œ ' ..) 0) ÐBѹ . .ÐBÑ . ) œ)! (A definition like that on page 111 of Schervish, that requires i) and ii) only on a set of T) probability 1 a) in S, is really no more general than this. WOLOG, just throw away the exceptional set.) . Jargon: The (random) function of ), .) log0) Ð\Ñ is called the score function. iii) of Definition 3) says that the mean of the score function is ! at )! . -21- Definition 39 If the model c is FI regular at )! , then MÐ)! Ñ œ E)! Œ . log0) Ð\ѹ • ) œ)! .) # is called the Fisher Information contained in \ at )! Þ ÐIf a model is FI regular at )! , MÐ)! Ñ is the variance of the score function at )! .) Theorem 54 (Cramer-Rao) Suppose that the model c is FI regular at )! and that ! • MÐ)! Ñ • _Þ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be differentiated under the integral at )! , i.e. if ..) E) $ Ð\ѹ =' $ ÐBÑ ..) 0) ÐBѹ . .ÐBÑ , then ) œ)! Var)! $ Ð\Ñ ) œ)! . Œ .) E) $ Ð\ѹ ) œ) MÐ)! Ñ ! • # Þ For the special case of $ Ð\Ñ viewed as an estimator of ), we sometimes write E) $ Ð\Ñ œ ) € ,Ð)Ñ, and thus Theorem 54 says Var)! $ Ð\Ñ Ð" € ,w Ð)! ÑÑ# ÎMÐ)! Ñ . On the face of it, showing that the C-R lower bound is achieved looks like a general method of showing that an estimator is UMVUE. But the fact is that the circumstances in which the C-R lower bound is achieved are very well defined. The C-R inequality is essentially the Cauchy-Schwarz inequality, and one gets equality in Lemma 51 iff Z œ +Y a.s.. This translates in the proof of Theorem 54 to equality iff . log0) Ð\ѹ œ +Ð)! Ñ$ Ð\Ñ € ,Ð)! Ñ a.s. T)! Þ ) œ)! .) And, as it turns out, this can happen for all )! in a parameter space @ iff one is dealing with a one parameter exponential family of distributions and the statistic $Ð\Ñ is the natural sufficient statistic. So the method works only for UMVUEs of the means of the natural sufficient statistics in one parameter exponential families. Result 55 Suppose that for ( real, ÖT( × is dominated by a 5-finite measure ., and with 2ÐBÑ ž !, 0( ÐBÑ œ expa+Ð(Ñ € (X ÐBÑb a.s. . a( . Then for any (! in the interior of > where Var( X Ð\Ñ • _, X Ð\Ñ achieves the C-R lower bound for its variance. (A converse of result 55 is true, but is also substantially harder to prove.) -22- The C-R inequality is Corollary 53 with 1) ÐBÑ œ ..) log0) ÐBÑ. There are other interesting choices of 1) . For example (assuming it is permissible to move enough derivatives 5 through integralsÑ, 1) ÐBÑ œ Š ..)5 log0) ÐBÑ‹Î0) ÐBÑ produces an interesting inequality. And the choice 1) ÐBÑ œ a0) ÐBÑ • 0)w ÐBÑbÎ0) ÐBÑ leads to the next result. Theorem 56 If T)w ¥ T) ß aE) $ Ð\Ñ • E)w $ Ð\Ñb# )w Ð\Ñ E) Š 0) Ð\Ñ•0 ‹ 0) Ð\Ñ Var) $ Ð\Ñ # So, (the Chapman-Robbins Inequality) Var) $ Ð\Ñ sup )' s.t.T)w ¥ T) Þ aE) $ Ð\Ñ • E)w $ Ð\Ñb# )w Ð\Ñ E) Š 0) Ð\Ñ•0 ‹ 0) Ð\Ñ # Þ The Fisher Information comes up not only in the present context of information inequalities, but also in the theory of the asymptotics of maximum likelihood estimation. It is thus useful to have some more results about it. Theorem 57 Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ is in fact twice differentiable at )! and " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the # . .ÐBÑ and ! œ ' .)# 0) ÐBѹ . .ÐBÑ then integral at )! , i.e. both ! œ ' .) 0) ÐBѹ . . ) œ)! MÐ)! Ñ œ • E)! Ž ) œ)! .# log0) Ð\Ѻ • . . )# ) œ)! Fact 58 If \" ß \# ß ÞÞÞß \8 are independent, \3 µ T3) , then \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ (that has distribution T") ‚ T#) ‚ â ‚ T8) ) carries Fisher Information MÐ)Ñ œ ! M3 Ð)Ñ, where 8 3œ" in the obvious way, M3 Ð)Ñ is the Fisher information in \3 Þ Fact 59 MÐ)Ñ doesn't depend upon the dominating measure. Fact 60 For the case of E) $ Ð\Ñ œ ), the C-R inequality says that Var) $ Ð\Ñ MÐ")Ñ . And when \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ has iid components, this combined with Fact 58 says that Var) $Ð\Ñ 8M""Ð)Ñ Þ There are multiparameter versions of the information inequalities (and for that matter, multiple $ versions involving the covariance matrix of the $ 's). The fundamental result -23- needed to produce multiparameter versions of the information inequalities is the simple generalization of Corollary 53. Lemma 61 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. Suppose further that 1" ß 1# ß ÞÞÞß 15 are measurable functions such that E13 Ð] Ñ œ ! and ! • EÐ13 Ð] ÑÑ# • _ a3 and the 13 are linearly independent as square integrable functions on k according to the probability measure T . Let G3 » E$ Ð] Ñ13 Ð] Ñ and define a 5 ‚ 5 matrix H » aE13 Ð] Ñ14 Ð] Ñb Þ Then for G œ ÐG" ß G# ß ÞÞÞß G5 Ñw G w H•" G Þ Var$Ð] Ñ The "statistical-looking" version of Lemma 61 is next. Lemma 61' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. Suppose further that 1)" ß 1)# ß ÞÞÞß 1)5 are measurable functions such that E) 1)3 Ð\Ñ œ ! and ! • E) Ð1)3 Ð\ÑÑ# • _ a3 and the 1)3 are linearly independent as square integrable functions on k according to the probability measure T) . Let G3 Ð)Ñ » E) $ Ð\Ñ1)3 Ð\Ñ and define a 5 ‚ 5 matrix HÐ)Ñ » aE) 1)3 Ð\Ñ1)4 Ð\Ñb Þ Then for G Ð)Ñ œ ÐG" Ð)Ñß G# Ð)Ñß ÞÞÞß G5 Ð)ÑÑw Var) $ Ð\Ñ G Ð)Ñw HÐ)Ñ•" G Ð) Ñ Þ 5-dimensional versions of Definitions 38 and 39 are next. Definition 40 We will say that the model c with dominating 5-finite measure . is FI regular at )! - @ § e5 , provided there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ has first order partial derivatives at )! and iii) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated wrt to each )3 under the integral at )! , i.e. ! œ ' ``) 0) ÐBѹ . .ÐBÑ . 3 )œ)! Definition 41 If the model c is FI regular at )! and E)! Œ ``)3 log0) Ð\ѹ )œ) then the 5 ‚ 5 matrix M Ð)! Ñ œ ŒE)! ` ` log0) Ð\ѹ log0) Ð\ѹ • )œ)! ` )4 )œ)! ` )3 is called the Fisher Information contained in \ at )! Þ The multiparameter version of Theorem 57 is next -24- • • _ a3ß # ! Theorem 62 Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ in fact has continuous second order partials in the neighborhood Sß and " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice under the integral at )! , i.e. both ! œ ' ``) 0) ÐBѹ . .ÐBÑ and !œ' `# ` )3 ` )4 0) ÐBѹ )œ)! 3 )œ)! . .ÐBÑ a3 and 4, then M Ð)! Ñ œ • E)! Ž `# log0) Ð\Ѻ • . ` )3 ` ) 4 )œ)! And there is the multiparameter version of the C-R inequality. Theorem 63 (Cramer-Rao) Suppose that the model c is FI regular at )! - @ § e5 and that M Ð)! Ñ exists and is nonsingularÞ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be differentiated wrt to each )3 under the integral at )! , i.e. if ` =' $ ÐBÑ ``)3 0) ÐBѹ . .ÐBÑ , then with ` )3 E) $ Ð\ѹ G Ð)! ) » )œ)! Ð ``)" E) $ Ð\ѹ )œ) ß ! )œ)! ` ` )# E) $ Ð\ѹ)œ) Var)! $Ð\Ñ ! ß ÞÞÞß ``)5 E) $ Ð\ѹ )œ)! Ñw G Ð)! Ñw M Ð)! Ñ•" G Ð)! Ñ Þ There are multiple $ versions of this information inequality stuff. See Rao, pages 326+. Invariance and Equivariance Unbiasedness (of estimators) is one means of restricting a class of decision rules down to a subclass small enough that there is hope of finding a "best" rule in the subclass. Invariance/equivariance considerations is another. These notions have to do with symmetries in a decision problem. "Invariance" means that (some) things don't change. "Equivariance" means that (some) things change similarly or equally. We'll do the case of Location Parameter Estimation first, and then turn to generalities about invariance/equivariance that apply more broadly (to both other estimation problems and to other types of decision problems entirely). Definition 42 When k œ e5 and @ § e5 , to say that ) is a location parameter means that \ • ) has the same distribution a). Definition 43 When k œ e5 and @ § e1 , to say that ) is a 1-dimensional location parameter means that \ • )" (for " a 5 -vector of "'s) has the same distribution a). -25- Note that for ) a "-dimensional location parameter, \ µ T) means that Ð\ € - "Ñ µ T-€) Þ It therefore seems that a "reasonable" restriction on estimators $ is that they should satisfy $ ÐB € - "Ñ œ $ ÐBÑ € - . How to formalize this? Until further notice suppose that @ œ T œ e" . Definition 44 PÐ)ß +Ñ is called location invariant if PÐ)ß +Ñ œ 3Ð) • +Ñ. If 3ÐDÑ is increasing in D for D ž ! and decreasing in D for D • !, the problem is called a location estimation problem. Definition 45 A decision rule $ is called location equivariant if $ ÐB € - "Ñ œ $ ÐBÑ € - . Theorem 64 If ) is a "-dimensional location parameter, PÐ)ß +Ñ is location invariant and $ is location equivariant, then VÐ)ß $ Ñ is constant in ). (That is, $ is an equalizer rule.) This theorem says that if I'm working on a location estimation problem with a location invariant loss function and restrict attention to location equivariant estimators, I can compare estimators by comparing numbers. In fact, as it turns out, there is a fairly explicit representation for the best location equivariant estimator in such cases. Definition 46 A function 1 is called location invariant if 1ÐB € - "Ñ œ 1ÐBÑ. Lemma 65 Suppose that $! is location equivariant. Then $" is location equivariant iff b a location invariant function ? such that $" œ $! € ?Þ Lemma 66 A function ? is location invariant iff ? depends on B only through CÐBÑ œ ÐB" • B5 ß B# • B5 ß ÞÞÞß B5•" • B5 Ñ. Theorem 67 Suppose that PÐ)ß +Ñ œ 3Ð) • +Ñ is location invariant and b a location equivariant estimator $! for ) with finite risk. Let 3w ÐDÑ œ 3Ð • DÑÞ Suppose that for each C b a number @‡ ÐCÑ that minimizes E)œ! Ò3w Ð$! • @Ñl] œ CÓ over choices of @Þ Then $ ‡ ÐBÑ œ $! ÐBÑ • @‡ ÐCÐBÑÑ is a best location equivariant estimator of ). In the case of PÐ)ß +Ñ œ Ð) • +Ñ# the estimator of Theorem 67 is Pittman's estimator, which can for some situations be written out in a more explicit form. -26- Theorem 68 Suppose that PÐ)ß +Ñ œ Ð) • +Ñ# and the distribution of \ œ Ð\" ß \# ß ÞÞÞß \5 Ñ has R-N derivative wrt to Lebesgue measure on e5 0) ÐBÑ œ 0 ÐB" • )ß B# • )ß ÞÞÞß B5 • )Ñ œ 0 ÐB • )"Ñ where 0 has second moments. Then the best location equivariant estimator of ) is _ '•_ )0 ÐB • )"Ñ. ) $ ÐBÑ œ _ '•_ 0 ÐB • )"Ñ. ) ‡ if this is well-defined and has finite risk. (Notice that this is the formal generalized Bayes estimator vs Lebesgue measure.) Lemma 69 For the case of PÐ)ß +Ñ œ Ð) • +Ñ# a best location equivariant estimator of ) must be unbiased for ). Now cancel the assumption that @ œ T œ e" . The following is a more general story about invariance/equivariance. Suppose that c œ ÖT) × is identifiable and (to avoid degeneracies) that PÐ)ß +Ñ œ PÐ)ß +wÑ a) Ê + œ +w Þ Invariance/equivariance theory is phrased in terms of groups of transformations i) k pk , ii) @p@ , and iii) T pT , and how PÐ)ß +Ñ and decision rules behave under these groups of transformations. The natural starting place is with transformations on k . Definition 47 For 1-1 transformations of k onto k , 1" and 1# , the composition of 1" and 1# is another 1-1 onto transformation of k defined by 1# ‰ 1" ÐBÑ œ 1# Ð1" ÐBÑÑ aB - k Þ Definition 48 For 1 a 1-1 transformation of k onto k , if 2 is another 1-1 onto transformation such that 2 ‰ 1ÐBÑ œ B aB - k , 2 is called the inverse of 1 and denoted as 2 œ 1•" . It is then an easy fact that not only is 1•" ‰ 1ÐBÑ œ B aBß but 1 ‰ 1•" ÐBÑ œ B aB as well. Definition 49 The identity transformation on k is defined by /ÐBÑ œ B aBÞ Another easy fact is then that / ‰ 1 œ 1 ‰ / œ 1 for any 1-1 transformation of k onto k , 1. (Obvious) Result 70 If Z is a class of 1-1 transformations of k onto k that is closed under composition (1" , 1# - Z Ê 1" ‰ 1# - Z ) and inversion (1 - Z Ê 1•" - Z ), then with the operation " ‰ " (composition) Z is a group in the usual mathematical sense. -27- A group such as that described in Result 70 will be called a transformation group. Such a group may or may not be Abelian/commutative (one is if 1" ‰ 1# œ 1# ‰ 1" a1" ß 1# - Z ). Jargon/notation: Any class of transformations V can be thought of as generating a group. That is, Z ÐVÑ consisting of all compositions of finite numbers of elements of V and their inverses is a group under the operation " ‰ ". To get anywhere with this theory, we then need transformations on @ that fit together nicely with those on k . At a minimum, one will need that the transformations on k don't get one outside of c . 1Ð\Ñ Definition 50 If 1 is a 1-1 transformation of k onto k such that ÖT) ×)-@ œ c , then the transformation 1 is said to leave the model invariant. If each element of a group of transformations Z leaves c invariant, we'll say that the group leaves the model invariant. One circumstance in which Z leaves c invariant is where the model c can be thought of as "generated by Z " in the following sense. Suppose that Y has some fixed probability distribution T! and consider c œ ÖT 1ÐY Ñ l 1 - Z × Þ If a group of transformations on k (say Z ) leaves c invariant, there is a natural corresponding group of transformations on @. That is, for 1 - Z \ µ T) Ê 1Ð\Ñ µ T)w for some )w - @ . In fact, because of the identifiability assumption, there is only one such )w . So, as a matter of notation, adopt the following. Definition 51 If Z leaves the model c invariant, for each 1 - Z , define –1 À @p@ by –1Ð)Ñ œ the element )w of @ such that \ µ T) Ê 1Ð\Ñ µ T)w Þ (According to the notation established in Definition 51, for Z leaving c invariant \ µ T) Ê 1Ð\Ñ µ T–1Ð)Ñ ÞÑ (Obvious) Result 71 For Z a group of 1-1 transformations of k onto k leaving c – invariant, Z œ Ö1– l 1 - Z × with the operation " ‰ " (composition) is a group of 1-1 transformations of @. – (I believe that the group Z is the homomorphic image of Z under the transformation " –" so that, for example, 1•" œ –1•" and 1# ‰ 1" œ –1# ‰ –1" .) -28- We next need transformations on T that fit nicely together with those on k and @. If that's going to be possible, it would seem that the loss function will have to fit nicely with the transformations already considered. Definition 52 A loss function PÐ)ß +Ñ is said to be invariant under the group of transformations Z provided that for every 1 - Z and + - T , b a unique +w - T such that – )Ñß +w Ñ a) Þ PÐ)ß +Ñ œ PÐ1Ð Now then, as another matter of jargon/notation adopt the following. Definition 53 If Z leaves the model invariant and the loss PÐ)ß +Ñ is invariant, define 1˜ À T pT by – )Ñß +w Ñ a) Þ 1Ð+Ñ œ the element +w of T such that PÐ)ß +Ñ œ PÐ1Ð ˜ (According to the notation established in Definition 53, for Z leaving c invariant and invariant loss PÐ)ß +Ñ, – )Ñß 1Ð+ÑÑ PÐ)ß +Ñ œ PÐ1Ð ÞÑ ˜ (Obvious) Result 72 For Z a group of 1-1 transformations of k onto k leaving c invariant and invariant loss PÐ)ß +Ñ, Z˜ œ Ö1˜ l 1 - Z × with the operation " ‰ " (composition) is a group of 1-1 transformations of T . (I believe that the group Z˜ is the homomorphic image of Z under the transformation " ˜" µ so that, for example,1•" œ 1˜ •" and Ð1# µ ‰ 1" Ñ œ 1˜ # ‰ 1˜ " .) So finally we come to the notion of equivariance for decision functions. The whole business is slightly subtle for behavioral rules (see pages 150-151 of Ferguson) so for this elementary introduction, we'll confine attention to nonrandomized rules. Definition 54 In an invariant decision problem (one where Z leaves c invariant and PÐ)ß +Ñ is invariant) a nonrandomized decision rule $ :k pT is said to be equivariant provided $ Ð1ÐBÑÑ œ 1Ð ˜ $ ÐBÑÑ aB - k and 1 - Z Þ (One could make a definition here involving a single 1 or perhaps a 3 element group Z œ Ö1ß 1•" ß /× and talk about equivariance under 1 instead of under a group. I'll not bother.) The basic elementary theorem about invariance/equivariance is the generalization of Theorem 64 (that cuts down the size of the comparison problem that one faces when comparing risk functions). Theorem 73 If Z leaves c invariant, PÐ)ß +Ñ is invariant and $ is an equivariant nonrandomized decision rule, then -29- – )Ñß $ Ñ a) - @ and 1 - Z . VÐ)ß $ Ñ œ VÐ1Ð Definition 55 In an invariant decision problem, for ) - @, b) œ Ö)w - @ l )w œ –1Ð)Ñ for some 1 - Z × is called the orbit in @ containing ). The set of orbits Öb) ×)-@ partitions @. The distinct orbits are equivalence classes. Theorem 73 says that an equivariant nonrandomized decision rule in an invariant decision problem has a risk function that is constant on orbits. When looking for good equivariant rules, the problem of comparing risk functions over @ is reduced somewhat to comparing risk functions on orbits. Of course, that comparison is especially easy when there is only one orbit. Definition 56 A group of 1-1 transformations of a space onto itself is said to be transitive if for any two points in the space there is a transformation in the group taking the first point into the second. – Corollary 74 If Z leaves c invariant, PÐ)ß +Ñ is invariant and Z is transitive over @, then every equivariant nonrandomized decision rule has constant risk. Definition 57 In an invariant decision problem, a decision rule $ is said to be a best (nonrandomized) equivariant (or MRE ... minimum risk equivariant) decision rule if it is equivariant and for any other nonrandomized equivariant rule $ w VÐ)ß $ Ñ Ÿ VÐ)ß $ w Ñ a) Þ The import of Theorem 73 is that one needs only check for this risk inequality for one ) from each orbit. Maximum Likelihood (Primarily the Simple Asymptotics Thereof) Maximum likelihood is a non-decision-theoretic method of point estimation dating at least to Fisher and probably beyond. As usual, for an identifiable family of distributions ) c œ ÖT) × dominated by a 5-finite measure ., we let 0) œ .T .. Þ Definition 58 A statistic X À k p@ is called a maximum likelihood estimator (MLE) of ) if 0X ÐBÑ ÐBÑ œ sup 0) ÐBÑ aB - k Þ )-@ This definition is not without practical problems in terms of existence and computation of MLEs. 0) ÐBÑ can fail to be bounded in ), and even when it is, can fail to have a maximum value as one searches over @. One way to get rid of the problem of potential lack of boundedness (favored and extensively practiced by Prof. Meeker for example) is -30- to admit that one can measure only "to the nearest something," so that "real" k are discrete (finite or countable) so that . can be counting measure and 0) ÐBÑ œ T) ÐÖB×Ñ • " Ðso that one is just maximizing the probability of observing the data in handÑÞ Of course, the possibility that the sup is not achieved still remains even when this device is used (like, for example, when \ µ Binomial Ð8ß :Ñ for : - Ð!ß "Ñ when \ œ ! or 8). Note that in the common situation where @ § e5 and the 0) are nicely differentiable in ), a maximum likelihood estimator must satisfy the likelihood equations ` log0) ÐBѹ œ ! for 3 œ "ß #ß ÞÞÞß 5 . ) œX ÐBÑ ` )3 Much of the theory connected with maximum likelihood estimation has to do with roots of the likelihood equations (as opposed to honest maximizers of 0) ÐBÑ). Further, much of the attractiveness of the method derives from its asymptotic properties. A simple version of the theory (primarily for the 5 œ " case) follows. Adopt the following notation for the discussion of the asymptotics of maximum likelihood. \8 represents all information on ) available through the 8th stage of observation k8 the observation space for \ 8 (through the 8th stage) U8 a 5-algebra on k8 8 T)\ the distribution of \ 8 (dominated by a 5-finite measure .8 ) @ a (fixed) parameter space .T \ 8 0)8 the R-N derivative ..) 8 For this material one really does need the basic probability space ÐHß T ß T) Ñ in the background, as it is necessary to have a fixed space upon which to prove convergence results. Often (i.e. in the iid case) we may take Ðk ß U ß T)\" Ñ to be a "1-dimensional" model where the T)\" 's are absolutely continuous with respect to a 5-finite measure . _ Hœk \ 8 œ Ð\" ß \# ß ÞÞÞß \8 Ñ for \3 Ð=Ñ œ the 3th coordinate of = 8 k8 œ k T œ U_ _ T) œ ˆT)\" ‰ 8 8 T)\ œ ˆT)\" ‰ .8 œ .8 0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ 8 and 3œ" For such a set-up, the usual drill is for appropriate sequences of estimators Ös ) 8 ÐB8 Ñ× to try to prove consistency and asymptotic normality results. -31- Definition 59 Suppose that ÖX8 × is a sequence of statistics X8 À k8 p@, where @ § e5 i) ÖX8 × is (weakly) consistent at )! - @ if for every % ž !ß T)! alX8 • )! l ž %bp! as 8p_, and ii) ÖX8 × is strongly consistent at )! - @ if X8 p)! a.s. T)! . There are two standard paths for developing asymptotic results for likelihood-based estimators in iid contexts. These are i) the path blazed by Wald based on very strong regularity assumptions (that can be extremely hard to verify for a specific model) but that deal with "real" maximum likelihood estimators, and ii) the path blazed by Cramer based on much weaker assumptions and far easier proofs that deal with solutions to the likelihood equations. We'll wave our hands very briefly at a heuristic version of the Wald argument and defer to pages 415-424 of Schervish for technical details. Then we'll do a version of the Cramer results (probably assuming more than is absolutely necessary to get them). All that follows regarding the asymptotics of maximum likelihood refers to the iid case, where B8 œ ÐB" ß B# ß ÞÞÞß B8 Ñ, @ § e5 and 0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ. 8 3œ" Lemma 75 (Silvey, page 75) Suppose that T and U are two different distributions on a space k with R-N derivatives : ž ! and ; ž ! with respect to a 5-finite measure .. Suppose further that a function G À Ð!ß_Ñpe" is convex. Then EU G Œ :Ð\Ñ • ;Ð\Ñ GÐ"Ñ with strict inequality if G is strictly convex. Now suppose that the 0) are positive everywhere on k and define the function of ) D)! Ð)Ñ » E)! log0) Ð\Ñ œ ( log0) ÐBÑ 0)! ÐBÑ . .ÐBÑ Þ Corollary 76 If the 0) are positive everywhere on k , then for ) Á )! D)! Ð)! Ñ ž D)! Ð) Ñ Þ Then, as a matter of notation, let 68 ÐB8 ß )Ñ » log0)8 ÐB8 Ñ œ "log0) ÐB3 Ñ Þ 8 3œ" Then clearly E)! 8" 68 Ð\ 8 ß )Ñ œ D)! Ð)Ñ and in fact, the law of large numbers says that for every ), -32- " " 8 8 68 Ð\ ß )Ñ œ "log0) Ð\3 Ñ p D)! Ð)Ñ a.s. T)! . 8 8 3œ" So assuming enough regularity conditions to make this almost sure convergence uniform in ), when )! governs the distribution of the observations there is perhaps hope that a single maximizer of 8" 68 Ð\ 8 ß )Ñ will eventually exist and be close to the maximizer of D)! Ð)Ñ, namely )! . To make this precise is not so easy. See Theorems 7.49 and 7.54 of Schervish. Now turn to Cramer type results. Theorem 77 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is differentiable at every point ) - S, and iii) E)! log0) Ð\Ñ exists for all ) - S and E)! log0)! Ð\Ñ is finite. Then, if % ž ! and $ ž !, b 8 such that a 7 ž 8, the T)! probability that the equation . 7 "log0) Ð\3 Ñ œ ! . ) 3œ" has a root within % of )! is at least " • $ Þ (Essentially this theorem ducks the uniformity issue raised earlier by considering 68 at only )! ß )! • % and )! € %.) In general, Theorem 77 doesn't immediately do what we'd like. The root guaranteed by the Theorem isn't necessarily an MLE. And in fact, when the likelihood equation can have multiple roots, we haven't even yet defined a sequence of random variables, let alone a sequence of estimators converging to )! . But the theorem can form the basis for simple results that are of statistical importance. Corollary 78 Under the hypotheses of Theorem 77, suppose that (in addition) @ is open and aBß 0) ÐBÑ is positive and differentiable at every point ) - @ so that the likelihood equation . 8 "log0) Ð\3 Ñ œ ! . ) 3œ" makes sense at all points ) - @. Then define 38 to be the root of the likelihood equation when there is exactly one (and otherwise adopt any arbitrary definition for 38 ). If with T)! probability approaching 1 the likelihood equation has a single root 38 p )! in T)! probability Þ -33- Where there is no guarantee that the likelihood equation will eventually have only a single root, one could use Theorem 77 to argue that defining <8 Ð)! Ñ œ œ ! if the likelihood equation has no root the root closest to )! otherwise , <8 Ð)! Ñ p )! in T)! probability. (The continuity of the likelihood implies that limits of roots are roots, so for )! in an open @, the definition of <8 Ð)! Ñ above eventually makes sense.) But, of course, this is of no statistical help. What is similar in nature but of more help is the next corollary. Corollary 79 Under the hypotheses of Theorem 77, if ÖX8 × is a sequence of estimators consistent at )! and if the likelihood equation has no roots X8 s )8 » œ the root closest to X8 otherwise , then s ) 8 p )! in T)! probability Þ In practical terms, finding the root of the likelihood equation nearest to X8 may be a numerical problem. (Actually, if it is, it's not so clear how you know you have solved the problem. But we'll gloss over that.) Finding a root beginning from X8 might be approached using something like Newton's method. That is, abbreviating P8 Ð)Ñ » 68 Ð\ ß )Ñ œ "log0) Ð\3 Ñ , 8 8 3œ" the likelihood equation is Pw8 Ð)Ñ œ ! Þ One might try looking for a root of this equation by iterating using linearizations of Pw8 Ð)Ñ (presumably beginning at X8 ). Or, assuming that Pww8 ÐX8 Ñ Á ! taking a single step in this iteration process, a "-step likelihood-based modification/?improvement? on X8 , w ˜)8 œ X8 • P8 ÐX8 Ñ , Pww8 ÐX8 Ñ is suggested. It is plausible that an estimator like )˜ 8 might have asymptotic behavior similar to that of an MLE. (See page 442 of TPE or Theorem 7.75 of Schervish, that treats a more general multiparameter M-estimation version of this idea.) The next matter up is the asymptotic distribution of likelihood-based estimators. The following is representative of what people have proved. (I suspect that the hypotheses here are stronger than are really needed to get the conclusion, but they make for a simple proof.) -34- Theorem 80 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S, iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and ¹ .$ log0) ÐBѹ Ÿ Q ÐBÑ aB and a) - S ß . )$ iv) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the integral at )! , # i.e. both ! œ ' ..) 0) ÐBѹ . .ÐBÑ and ! œ ' ..)# 0) ÐBѹ . .ÐBÑ, and # v) M" Ð)Ñ » E) ˆ ..) log0) Ð\щ - Ð!ß _ Ñ a) - S. If with T) probability approaching 1, Ös ) 8 × is a root of the likelihood equation, and s )8p ) œ)! ) œ)! ! )! in T)! probabilityß then under T)! _ È8Šs ) 8 • )! ‹ Ä NÐ!ß " Ñ Þ M" Ð)! Ñ There are multiparameter versions of this theorem (indeed the Schervish theorem is a multiparameter one). And there are similar (multiparameter) versions of this for the "Newton 1-step approximate MLE's" based on consistent estimators of ) (i.e. )˜ 8 's). The practical implication of a result like this theorem is that one can make approximate confidence intervals for ) of the form s )8 „ D " Þ È8M" Ð)! Ñ Well, it is almost a practical implication of the theorem. One needs to do something about the fact that M" Ð)! Ñ is unknown when making a confidence interval so that it can't be used in the formula. There are two common routes to fixing this problem. The first is to use M" Ðs ) 8 Ñ in its place. This method is often referred to as using the "expected Fisher information" in the making of the interval. (Note that M" Ð)! Ñ is a measure of curvature of D)! Ð)Ñ at its maximum (located at )! Ñ while M" Ðs ) 8 Ñ is a measure of curvature of Ds) Ð)Ñ at 8 s its maximum (located at ) 8 ).) Dealing with the theory of the first method is really very simple. It is easy to prove the following. Corollary 81 Under the hypotheses of Theorem 80, if M" Ð)Ñ is continuous at )! , then under T)! _ É8M" Ðs ) 8 )Šs ) 8 • )! ‹ Ä NÐ!ß "Ñ Þ -35- (Actually, the continuity of M" Ð)Ñ at )! may well follow from the hypotheses of Theorem 80 and therefore be redundant in the hypotheses of this corollary. I've not tried to think this issue through.) The second method of approximating M" Ð)! Ñ comes about by realizing that in the proof of Theorem 80, M" Ð)! Ñ arises as the limit of • "n Pww Ð)! ÑÞ One might then hope to approximate it by " ww s " 8 .# • P Ð) 8 Ñ œ • " # log0) Ð\3 ѹ s ß ) œ) 8 8 8 3œ" . ) which is sometimes called the "observed Fisher information." (Note that • 8" Pww Ðs ) 8 Ñ is a measure of curvature of the average log likelihood function " 68 Ð\ 8 ß )Ñ at s ) 8 , which 8 one is secretly thinking of as essentially a maximizer of the average log likelihood.) Schervish says that Efron and others claim that this second method is generally better than the first (in terms of producing intervals with real coverage properties close to the nominal properties). Proving an analogue of Corollary 81 where • 8" Pww Ðs ) 8 Ñ replaces ) 8 ) is somewhat harder, but not by any means impossible. That is, one can prove the M" Ðs following. Corollary 82 Under the hypotheses of Theorem 80 and T)! , _ É • Pww Ðs ) 8 ÑŠs ) 8 • )! ‹ Ä NÐ!ß "Ñ Þ Under the hypotheses of Theorem 80, convergence in distribution of s ) 8 implies the s convergence of the second moment of ) 8 , so that Var)! È8Šs ) 8 • )! ‹ p i.e. Var)! ÐÈ8 Ðs ) 8 )Ñ p " ß M" Ð)! Ñ " Þ M" Ð)! Ñ Then the fact that the C-R inequality declares that under suitable conditions no unbiased estimator of ) has variance less than "ÎÐ8M" Ð)ÑÑ, makes one suspect that for a consistent sequence of estimators Ö)8‡ ×, convergence under )! _ È8Ð)8‡ • )! Ñ p NÐ!ß Z Ð)! ÑÑ might imply that Z Ð)! Ñ " M" Ð)! Ñ , but in fact this inequality does not have to hold up. -36- Hypothesis Testing (Simple vs Simple, Composite Hypotheses, UMP Tests, Confidence Regions, UMPU Tests, Likelihood Ratio and Related Tests) We now consider the classical hypothesis testing problem. Here we treat @ œ @! • @" where @! • @" œ g and the hypotheses H3 :) - @3 3 œ !ß ". As is standard, we will call H! the "null hypothesis" and H" the "alternative hypothesis." @3 with one element is called "simple" and with more than one element is called "composite." Deciding between H! and H" on the basis of data can be thought of in terms of a twoaction decision problem. That is, with T œ Ö!ß "× one is using \ to decide in favor of one of the two hypotheses. Typically one wants a loss structure where correct decisions are preferable to incorrect ones. An obvious (and popular) possibility is !-" loss, PÐ)ß +Ñ œ MÒ) - @! ß + œ "Ó € MÒ) - @" ß + œ !Ó . While it is perfectly sensible to simply treat hypothesis testing as a two-action decision problem, being older than formal statistical decision theory, it has its own peculiar/specialized set of jargon and emphases that we will have to recognize. These include the following. Definition 60 A nonrandomized decision function $ À k pÖ!ß "× is called a hypothesis test. Definition 61 The rejection region for a nonrandomized test $ is ÖBl $ ÐBÑ œ "×Þ Definition 62 A type I error in a testing problem is taking action 1 when ) - @! . A type II error in a testing problem is taking action 0 when ) - @" Þ Note that in a two-action decision problem, a behavioral decision rule 9B is for each B a distribution over Ö!ß "×, which is clearly characterized by 9B ÐÖ"×Ñ - Ò!ß "Ó. This notion can be expressed in other terms. Definition 63 A randomized hypothesis test is a function 9 À k pÒ!ß "Ó with the interpretation that if 9ÐBÑ œ 1, one rejects H! with probability 1. Note that for !-" loss, for ) - @! ß VÐ)ß 9Ñ œ E) MÒ+ œ "Ó œ E) 9Ð\Ñ the type I error probability, while for ) - @" , VÐ)ß 9Ñ œ E) MÒ+ œ !Ó œ E) Ð" • 9Ð\ÑÑ œ " • E) 9Ð\Ñ -37- the type II error probability. Most of testing theory is phrased in terms of E) 9Ð\Ñ that appears in both of these expressions. Definition 64 The power function of the (possibly randomized) test 9 is "9 Ð)Ñ » E) 9Ð\Ñ œ ( 9ÐBÑ.T) ÐBÑ Þ Definition 65 The size or level of a test 9 is sup "9 Ð)ÑÞ ) - @! This is the maximum probability of a type I error. Standard testing theory proceeds (somewhat asymmetrically) by setting a maximum allowable size and then (subject to that restriction) trying to maximize the power (uniformly) over @" Þ (That is, subject to a size limitation, standard theory seek to minimize type II error probabilities.) Recall that Result 21 says that if X is sufficient for ), the set of behavioral decision rules that are functions of X is essentially complete. Arguing that fact in the present case is very easy/straightforward. For any randomized test 9, consider 9‡ ÐBÑ œ EÒ9lX ÓÐBÑ , which is clearly a UÐX Ñ measurable function taking values in Ò!ß "Ó, i.e. is a hypothesis test based on X . Note that "9‡ Ð)Ñ œ E) 9‡ Ð\Ñ œ E) EÒ9lX Ó œ E) 9Ð\Ñ œ "9 Ð)Ñ , and 9‡ and 9 have the same power functions and are thus equivalent. Simple vs Simple Testing Problems Consider first testing problems where @! œ Ö)! × and @" œ Ö)" × (with !-" loss). In this finite @ decision problem, we can for each test 9 define the risk vector ÐVÐ)! ß 9Ñß V Ð)" ß 9ÑÑ œ Ð"9 Ð)! Ñß " • "9 Ð)" ÑÑ and make the standard two-dimensional plot of risk vectors. A completely equivalent device is to plot instead the points Ð"9 Ð)! Ñß "9 Ð)" ÑÑ . Lemma 83 For a simple versus simple testing problem, the risk set f has the following properties: i) f is convex, ii) f is symmetric wrt the point Ð "# ß "# Ñ, iii) Ð!ß "Ñ - f and Ð"ß !Ñ - f , and iv) f is closed. -38- Lemma 83' (see page 77 of TSH) For a simple versus simple testing problem, the set i » ÖÐ"9 Ð)! Ñß "9 Ð)" ÑÑl9 is a test× has the following properties: i) i is convex, ii) i is symmetric wrt the point Ð "# ß "# Ñ, iii) Ð!ß !Ñ - i and Ð"ß "Ñ - i , and iv) i is closed. As indicated earlier, the usual game in testing is to fix ! ! and maximum power on @" . ! and look for a test with size Definition 66 For testing H! À ) œ )! vs H" À ) œ )" , a test is called most powerful of size ! if "9 Ð)! Ñ œ ! and for any other test 9‡ with "9‡ Ð)! Ñ Ÿ !, "9‡ Ð)" Ñ Ÿ "9 Ð)" Ñ. The fundamental result of all of testing theory is the Neyman-Pearson Lemma. Theorem 84 (The Neyman-Pearson Lemma) Suppose that c ¥ ., a 5-finite measure .T and let 03 œ ..)3 . i) (Sufficiency) Any test of the form Ú" 9ÐBÑ œ Û / ÐBÑ Ü! if 0" ÐBÑ ž 50! ÐBÑ if 0" ÐBÑ œ 50! ÐBÑ if 0" ÐBÑ • 50! ÐBÑ (æ) for some 5 - Ò!ß_Ñ and / À k pÒ!ß "Ó is most powerful of its size. Further, corresponding to the 5 œ _ case, the test 9ÐBÑ œ œ " ! if 0! ÐBÑ œ ! otherwise (ææ) is most powerful of size ! œ !. ii) (Existence) For every ! - Ò!ß "Ó, there exists a test of form (æ) with / ÐBÑ œ / (a constant in Ò!ß "Ó) or a test of form (ææ) with "9 Ð)! Ñ œ !. iii) (Uniqueness) If 9 is a most powerful test of size !, then it is of form (æ) or form (ææ) except possibly for B - R with T)! ÐRÑ œ T)" ÐRÑ œ !Þ Note that for any two distributions there is always a dominating 5-finite measure (. œ T)! € T)" will do the job). Further, it's easy to verify that the tests prescribed by this theorem do not depend upon the dominating measure. Note also that one could take a more explicitly decision theoretic slant on this whole business. Thinking of testing as a two-decision problem with !-" loss and prior distribution K, abbreviating KÐÖ)! ×Ñ œ 1! and KÐÖ)1 ×Ñ œ 11 , it's easy to see that a Bayesian is a Neyman-Pearson tester with 5 œ 1! Î1" and that a Neyman-Pearson tester is a closet Bayesian with 1! œ 5ÎÐ5 € "Ñ and 1" œ "ÎÐ" € 5Ñ. There is a kind of "finite @! vs simple @" " version of the N-P Lemma. See TSH page 96. -39- A final small fact about simple versus simple testing (that turns out to be useful in some composite versus composite situations) is next. Lemma 85 Suppose that ! - Ð!ß "Ñ and 9 is most powerful of size !. Then "9 Ð)" Ñ and the inequality is strict unless 0! œ 0" a.e. .. ! UMP Tests for Composite vs Composite Problems Now consider cases where @! and @" need not be singletons. Definition 67 9 is UMP (uniformly most powerful) of size ! for testing H! À ) - @! vs H" À ) - @" provided 9 is of size ! (i.e. sup "9 Ð)Ñ œ !) and for any other 9w of size ) - @! no more than !, "9w Ð)Ñ Ÿ "9 Ð)Ñ a) - @" Þ Problems for which one can find UMP tests are not all that many. Known results cover situations where @ § e" involving one-sided hypotheses in monotone likelihood ratio families, a few special cases that yield to a kind of "least favorable distribution" argument and an initially somewhat odd-looking two-sided problem in one parameter exponential families. Definition 68 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ. Suppose that X À k pe" . The family of R-N derivatives Ö0) × is said to have the monotone likelihood ratio (MLR) property in X if for every )" • )# belonging to @, the ratio 0)# ÐBÑÎ0)" ÐBÑ is nondecreasing in X ÐBÑ on ÖBl0)" ÐBÑ € 0)# ÐBÑ ž !×. (c is also said to have the MLR property.) Theorem 86 Suppose that Ö0) × has MLR in X . Then for testing H! À ) Ÿ )! vs H" À ) ž )! , for each ! - Ð!ß "Ó b a test of the form Ú" 9! ÐBÑ œ Û / Ü! if X ÐBÑ ž if X ÐBÑ œ if X ÐBÑ • - for - - Ò • _ß_Ñ chosen to satisfy "9 Ð)! Ñ œ !Þ Such a test i) has strictly increasing power function, ii) is UMP size ! for these hypotheses, and iii) uniformly minimizes "9 Ð)Ñ for ) • )! for tests with "9 Ð)! Ñ œ !. Theorem 87 Suppose that Ö0) × has MLR in X and for state space @ and action space T œ Ö!ß "× the loss function PÐ)ß +Ñ satisfies PÐ)ß "Ñ • PÐ)ß !Ñ • ! for ) ž )! , and -40- PÐ)ß "Ñ • PÐ)ß !Ñ ž ! for ) • )! Þ Then the class of size ! tests (! - Ð!ß "Ñ) identified in Theorem 86 is essentially complete in the class of tests with size in Ð!ß "Ñ. One two-sided problem a single real parameter for which it possible to find UMP tests is covered by the next theorem (that is proved in Section 4.3.4 of Schervish using the generalization of the N-P lemma alluded to earlier). Theorem 88 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ and 0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a). Suppose further that )" • )# belong to @. A test of the form Ú" 9ÐBÑ œ Û /3 Ü! if -" • X ÐBÑ • -# if X ÐBÑ œ -3 3 œ "ß # if X ÐBÑ • -" or X ÐBÑ ž -# for -" Ÿ -# minimizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with "9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is UMP level ! for testing H! À ) Ÿ )1 or ) )2 vs H" À )" • ) • )# . As far as finding UMP tests for parameter spaces other than subsets of e" is concerned, about the only "general" technique known is one discussed in Section 3.8 of TSHÞ The method is not widely applicable, but does yield a couple of important results. It is based on an "equalizer rule"/"least favorable distribution" type argument. The idea is that for a few composite vs composite problems, we can for each )" - @" guess at a distribution A over @! so that with 0A ÐBÑ œ ( 0) ÐBÑ. AÐ)Ñ @! simple versus simple testing of H! À 0A vs H" À 0)" produces a size ! N-P test whose power over @! is bounded by !. (Notice by the wayß that this is a Bayes test in the composite vs composite problem against a prior that is a mixture of A and a point mass at )" ÞÑ Anyhow, when that test is independent of )" we get UMP size ! for the original problem. Definition 69 For testing H! À \ µ 0A vs H" À \ µ 0)" , A is called least favorable at level ! if for 9A the N-P size ! test associated with A, "9A Ð)" Ñ • "9Aw Ð)" Ñ for any other distribution over @! , Aw . -41- Theorem 89 If 9 is of size ! for testing H! À ) - @! vs H" À ) œ )" , then it is of size no more than ! for testing H! À \ µ 0A vs H" À \ µ 0)" . Further, if 9 is most powerful of size ! for testing H! À \ µ 0A vs H" À \ µ 0)" and is also size ! for testing H! À ) - @! vs H" À ) œ )" , then it is a MP size ! test for this set of hypotheses and A is least favorable at level !. Clearly, if in the context of Theorem 89, the same 9 is most powerful size ! for each )" , that test is then UMP size ! for the hypotheses H! À ) - @! vs H" À ) - @" . Confidence Sets Derived From Tests (and Optimal Ones Derived From UMP Tests) Definition 70 Suppose that for each B - k ß WÐBÑ is a (measurable) subset of @. The family ÖWÐBÑl B - k × is called a # level family of confidence sets for ) provided T) Ò) - WÐBÑÓ # a) - @ . It is possible (and quite standard) to use tests to produce confidence procedures. That goes as follows. For each )! - @, let 9)! be a size ! test of H! À ) œ )! vs H" À ) - @" (where )! - @ • @" ) and define W9 ÐBÑ œ Ö)l 9) ÐBÑ • "× (W9 ÐBÑ is the set of )'s for which there is positive conditional probability of accepting H! having observed \ œ B.) ÖW9 ÐBÑl B - k × is a level " • ! family of confidence sets for ). This rigmarole works perfectly generally and inversion of tests of point null hypotheses is a time-honored means of constructing confidence procedures. In the context of UMP testing, the construction produces confidence sets with an optimality property. Theorem 90 Suppose that for each )! ß OÐ)! Ñ § @ with )!  OÐ)! Ñ and 9)! is a nonrandomized UMP size ! test of H! À ) œ )! vs H" À ) - OÐ)! Ñ Then if W9 ÐBÑ is the confidence procedure derived from the tests 9)! and WÐBÑ is any other level Ð" • !Ñ confidence procedure T) Ò)! - WÐ\ÑÓ T) Ò)! - W9 Ð\ÑÓ a ) - OÐ)! Ñ Þ (That is, the W9 construction produces confidence procedures that minimize the probabilities of covering incorrect values.) UMPU Tests for Composite versus Composite Problems Cases in which one can find UMP tests are not all that many. This should not be terribly surprising given our experience to date in Stat 643. We are used to not being able to find uniformly best statistical procedures. In estimation problems we found that by restricting the class of estimators under consideration (e.g. to the unbiased or equivariant estimators) -42- we could sometimes find uniformly best rules in the restricted class. Something similar is sometimes possible in composite versus composite testing problems. Definition 71 A size ! test of H! À ) - @! vs H" À ) - @" is unbiased provided " 9 Ð) Ñ ! a ) - @" Þ Definition 72 An unbiased size ! test of H! À ) - @! vs H" À ) - @" is UMPU (uniformly most powerful unbiased) level ! if for any other size ! unbiased test 9w " 9 Ð) Ñ " 9 w Ð) Ñ a ) - @ " . The conditions under which people can identify UMPU tests require that @ be a topological space. (One needs notions of interior, boundary, continuity of functions, etc.) So in this section make that assumption. – – Definition 73 The boundary set is @B » @! • @" . Lemma 91 If "9 Ð)Ñ is continuous and 9 is unbiased of size !, then "9 Ð)Ñ œ ! a ) - @B . Definition 74 A test is said to be !-similar on the boundary if " 9 Ð) Ñ œ ! a ) - @ B Þ When looking for UMPU tests it often suffices to restrict attention to tests !-similar on the boundary. Lemma 92 Suppose that "9 Ð)Ñ is continuous for every 9. If 9 is UMP among tests !similar on the boundary and has level !, then it is UMPU level !. This lemma and Theorem 88 (or the generalized Neyman-Pearson lemma that stands behind it) can be used to prove a kind of "reciprocal" of Theorem 88 for one parameter exponential families. Theorem 93 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñß and 0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a). Suppose further that )" • )# belong to @. A test of the form Ú" 9ÐBÑ œ Û /3 Ü! if X ÐBÑ • -" or X ÐBÑ ž -# if X ÐBÑ œ -3 3 œ "ß # if -" • X ÐBÑ • -# for -" Ÿ -# maximizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with "9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is UMPU level ! for testing H! À ) - Ò)" ß )# Ó vs H" À )  Ò)" ß )# ÓÞ -43- Theorem 93 deals with an interval null hypothesis. Dealing with a point null hypothesis turns out to require slightly different arguments. Theorem 94 Suppose that for @ an open interval in e" , c ¥ ., a 5-finite measure on Ðk ß U Ñ and 0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a). Suppose that 9 is a test of the form given in Theorem 93 and that " 9 Ð)! Ñ œ ! and "9w Ð)! Ñ œ ! . Then 9 is UMPU size ! for testing H! À ) œ )! vs H" À ) Á )! . Theorems 93 and 94 are about cases where @ § e" (and, of course, limited to one exponential families). It is possible to find some UMPU tests for hypotheses with "Ð5 • "Ñ-dimensional" boundaries in "5 -dimensional" @ problems using a methodology exploiting "Neyman structure" in a statistic X . I may write up some notes on this methodology (and again I may not). It is discussed in Section 4.5 of Schervish. A general kind of result that comes from the arguments is next. Theorem 95 Suppose that \ œ Ð\" ß \# ß ÞÞÞß \5 Ñ takes values in e5 and the distributions T( have densities with respect to a 5-finite measure . on e5 of the form 0( ÐBÑ œ OÐ(Ñexp"(3 B3 5 . 3œ" Suppose further that the natural parameter space > contains an open rectangle in e5 and let Y œ Ð\# ß ÞÞÞß \5 Ñ. Suppose ! - Ð!ß "ÑÞ i) For H! À (" Ÿ ("‡ vs H" À (" ž ("‡ or for H! À (" Ÿ ("‡ or (" ("‡‡ vs H" À (" - Ð("‡ ß ("‡‡ Ñ, for each ? b a conditional on Y œ ? UMP level ! test 9?w ÐB" Ñ, and the test 9ÐBÑ œ 9?w ÐB" Ñ is UMPU level !. ii) For H! À (" - Ò("‡ ß ("‡‡ Ó vs H" À ("  Ò("‡ ß ("‡‡ Ó or for H! À (" œ ("‡ vs H" À (" Á ("‡ for each ? b a conditional on Y œ ? UMPU level ! test 9?w ÐB" Ñ, and the test 9ÐBÑ œ 9?w ÐB" Ñ is UMPU level !. -44- This theorem says that if one thinks of ( œ Ð(" ß (# ß ÞÞÞß (5 Ñ as containing the parameter of interest (" and "nuisance parameters" (# ß ÞÞÞß (5 , at least in some exponential families it is possible to find "optimal" tests about the parameter of interest. Likelihood Ratio Tests, Related Tests and Confidence Procedures The various optimality theories of classical testing do not work for all that many complicated models. As such, if one wants to do testing in a complicated model some general purpose tools are needed for making tests and confidence procedures. One such set of tools consists of the "likelihood ratio test," its relatives and related confidence procedures. The following is an introduction to these methods and a little about their asymptotics. For c œ ÖT) ×, H! À ) - @! and H" À ) - @" one might plausibly consider a test criterion like sup 0) ÐBÑ ) - @" PVÐBÑ œ Þ sup 0) ÐBÑ ) - @! (Notice, by the way, that Bayes tests for 0-1 loss look at ratios of K averages rather than sups.) Or equivalently, one might consider sup 0) ÐBÑ ) -ÐBÑ œ maxÐPVÐBÑß "Ñ œ - @ ß sup 0) ÐBÑ ) - @! where we will want to reject H! for large -ÐBÑ. This is an intuitively reasonable criterion, and many common tests (including optimal ones) turn out to be likelihood ratio tests. On the other hand, it is easy enough to invent problems where LRTs need not be in any sense optimal. It is a useful fact that one can find some general results about how to set critical values for LRTs (at least for iid models and large 8). This is because if there is an MLE, say )^, sup 0) ÐBÑ œ 0s) ÐBÑ )-@ and we can make use of MLE-like estimation results. For example, in the iid context of Theorem 80, there is the following. -45- Theorem 96 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S, iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and ¹ .$ log0) ÐBѹ Ÿ Q ÐBÑ aB and a) - S ß . )$ iv) " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the integral at )! , # i.e. both ! œ ' ..) 0) ÐBѹ . .ÐBÑ and ! œ ' ..)# 0) ÐBѹ . .ÐBÑ, and v) M" Ð)Ñ » ) œ)! # . E) ˆ .) log0) Ð\щ ) œ)! - Ð!ß _ Ñ a) - S. If with T)! probability approaching 1, Ös ) 8 × is a root of the likelihood equation, and s )8p )! in T)! probabilityß then under T)! #log-8‡ » #logŽ 0s8 Ð\ 8 Ñ _ )8 p ;#" • 8 8 0)! Ð\ Ñ Þ (The hypotheses here are exactly the hypotheses of Theorem 80.) The application that people often make of this theorem is that they reject H! À ) œ )! in favor of H" À ) Á )! when #log-8‡ exceeds the Ð" • !Ñ quantile of the ;#" distribution. Notice also that upon inverting these tests, the theorem gives a way of making large sample confidence sets for ). When s ) 8 is a maximizer of the log likelihood, the set of )! for which H! À ) œ )! will be accepted using the LRT with an (approximate) ;#" cut-off value (say ;#" ) is Ö)l P8 Ð)Ñ " P8 Ðs ) 8 Ñ • ;#" × . # There are versions of the ;# limit that cover (composite) null hypotheses about some part of a multi-dimensional parameter vector. For example, Schervish (page 459) proves a result like the following. Result 97 Suppose that @ § e: and appropriate regularity conditions hold for iid observations \" ß \# ß ÞÞÞß \8 . If 5 Ÿ : and sup 0)8 Ð\ 8 Ñ ) -8 œ sup 0)8 Ð\ 8 Ñ ! ) s.t. ) 3 œ )3 3 œ "ß #ß ÞÞÞß 5 then under any T) with )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 _ #log-8 p ;#5 . -46- This result (of course) allows one to set approximate critical points for testing H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! , by rejecting H! when #log-8 ž the Ð" • !Ñ ;#5 quantile . And, one can make large 8 approximate confidence sets for Ð)" ß )# ß ÞÞÞß )5 Ñ of the form ÖÐ)" ß ÞÞÞß )5 Ñl P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ sup )5€" ß ÞÞÞß ): " sup P8 Ð)Ñ • ;#5 × # ) for ;#5 the Ð" • !Ñ quantile of the ;#5 distribution. There is some standard jargon associated with the function of Ð)" ß ÞÞÞß )5 Ñ appearing in the above expressions. Definition 75 The function of Ð)" ß ÞÞÞß )5 Ñ P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ » sup P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ )5€" ß ÞÞÞß ): is called a profile likelihood function for the parameters )" ß ÞÞÞß )5 . With this jargon/notation the form of the large 8 confidence set becomes ÖÐ)" ß ÞÞÞß )5 Ñl P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ " P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ • ;5# × , sup # Ð)" ß ÞÞÞß )5 Ñ which looks more like the form from the case of a one-dimensional parameter. There are alternatives to the LRT of H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! . One such test is the so called "Wald Test." See, for example, pages 115-120 of Silvey. The following is borrowed pretty much directly from Silvey. Consider @ § e: and 5 restrictions on ) 2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! , (like, for instance, )" • )"! œ )# • )#! œ â œ )5 • )5! œ !ÑÞ Suppose that )^8 is an MLE of ) and define 2Ð)Ñ » Ð2" Ð)Ñß 2# Ð)Ñß ÞÞÞß 25 Ð)ÑÑ Þ Then if H! À 2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! is true, 2Ð )^8 Ñ ought to be near 0. The question is how to measure "nearness" and to set a critical value in order to have a test with size approximately !. An approach to doing this is as follows. -47- We expect (under suitable conditions) that under T) the (:-dimensional) estimator )^8 has _ È8Ð )^8 • )Ñ p N: Ð!ß M"•" Ð)ÑÑ . Then if `23 L Ð) Ñ » Œ º• ß ` )4 ) 5‚: the ? method suggests that _ È8Š2Ð)^8 Ñ • 2Ð)Ñ‹p N5 Ð!ß L Ð)ÑM"•" Ð)ÑL w Ð)ÑÑ Þ Abbreviate LÐ)ÑM"•" Ð)ÑL w Ð)Ñ as FÐ)ÑÞ Since 2Ð)Ñ œ ! if H! is true, the above suggests that _ 82w Ð)^8 ÑF •" Ð)Ñ2Ð)^8 Ñ p ;#5 under H! Þ Now 82w Ð)^8 ÑF •" Ð)Ñ2Ð )^8 Ñ can not serve as a test statistic, since it involves ), which is not completely specified by H! Þ But it is plausible to consider the statistic [8 » 82w Ð)^8 ÑF •" Ð)^8 Ñ2Ð )^8 Ñ and hope that under suitable conditions _ [8 p ;#5 under H! Þ If this can be shown to hold up, one may reject for [8 ž Ð" • !Ñ quantile of the ;#5 distribution. The Wald tests for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 can be inverted to get approximate confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): . The resulting confidence regions will, by construction, be ellipsoidal. There is a final alternative to the LRT and Wald tests that deserves mention. That is the ";# test." See again Silvey. The notion here is that on occasion it can be easier to maximize P8 Ð)Ñ subject to 2Ð)Ñ œ ! (again taking this to be a statement of 5 constraints on a :-dimensional parameter vector )) than to simply maximize P8 Ð)Ñ. Let )˜ 8 be a "restricted" MLE (i.e. a maximizer of P8 Ð)Ñ subject to 2Ð)Ñ œ !). One might expect that if H! À 2Ð)Ñ œ ! is true, then )˜ 8 ought to be nearly an unrestricted maximizer of P8 Ð)Ñ and the partial derivatives of P8 Ð)Ñ should be nearly 0 at )˜ 8 . On the other hand, if H! is not true, there is little reason to expect the partials to be nearly 0. So (again/still with P8 Ð)Ñ œ ! log0) Ð\3 Ñ) one might consider the statistic 8 3œ" -48- w " ` ` \ » Ž P8 Ð)Ѻ M"•" Ð )˜ 8 ÑŽ P8 Ð)Ѻ • • Þ 8 ` )3 ` )3 ) œ)˜ 8 ) œ)˜ 8 # Then, for sup 0)8 Ð\ 8 Ñ ) -8 œ sup 0)8 Ð\ 8 Ñ ) s.t. 2Ð)Ñ œ ! it is typically the case that under H! ß \ # differs from 2log-8 by a quantity tending to 0 in T) probability. That is, this statistic can be calibrated using ;# critical points and can form the basis of a test of H! asymptotically equivalent to a "Result 97 type" likelihood ratio type test. Once more (as for the Wald tests) for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 , one can invert these tests to get confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): . As a final note on this topic, observe that i) an LRT of the style in Result 97 requires computation of an MLE, )^8 , and a restricted MLE, )˜ 8 , ii) a Wald test requires only computation of the MLE, )^8 , and iii) a ";# " test requires only computation of the restricted MLE, )˜ 8 . -49- Some Simple Theory of Markov Chain Monte Carlo It is sometimes the case that one has a "known" but unwieldy (joint) distribution for a large dimensional random vector ] , and would like to know something about the distribution (like, for example, the marginal distribution of ]" ). While in theory it might be straightforward to compute the quantity of interest, in practice the necessary computation can be impossible. With the advent of cheap computing, given appropriate theory the possibility exists of simulating a sample of 8 realizations of ] and computing an empirical version of the quantity of interest to approximate the (intractable) theoretical one. This possibility is especially important to Bayesians, where \l) µ T) and ) µ K produce a joint distribution for Ð\ß )Ñ and then (in theory) a posterior distribution KÐ)l\Ñß the object of primary Bayesian interest. Now Ðin the usual kind of notationÑ 1Ð)l\ œ BÑ º 0) ÐBÑ1Ð)Ñ so that in this sense the posterior of the possibly high dimensional ) is "known." But it may well be analytically intractable, with (for example) no analytical way to compute E)" . Simulation is a way of getting (approximate) properties of an intractable/high dimensional posterior distribution. "Straightforward" simulation from an arbitrary multivariate distribution is, however, typically not (straightforward). Markov Chain Monte Carlo methods have recently become very popular as a solution to this problem. The following is some simple theory covering the case where the (joint) distribution of ] is discrete. Definition 76 A (discrete time/discrete state space) Markov Chain is a sequence of randon quantities Ö\5 ×, each taking values in a (finite or) countable set k , with the property that T Ò\8 œ B8 l\" œ B" ß ÞÞÞß \8•" œ B8•" Ó œ T Ò\8 œ B8 l\8•" œ B8•" Ó . Definition 77 A Markov Chain is stationary provided T Ò\8 œ Bl\8•" œ Bw Ó is independent of 8. WOLOG we will henceforth name the elements of k with the integers "ß #ß $ß ÞÞÞ and call them "states." Definition 78 With :34 » T Ò\8 œ 4l\8•" œ 3Ó, the square matrix T » a:34 b is called the transition matrix for a stationary Markov Chain and the :34 are called transition probabilities. Note that a transition matrix has nonnegative entries and row sums of 1. Such matrices are often called "stochastic" matrices. As a matter of further notation for a stationary Markov Chain, let -50- 5 :34 œ T Ò\8€5 œ 4l\8 œ 3Ó and 0345 œ T Ò\8€5 œ 4ß \8€5•" Á 4ß ÞÞÞß \8€" Á 4l\8 œ 3Ó . (These are respectively the probabilities of moving from 3 to 4 in 5 steps and first moving from 3 to 4 in 5 steps.) Definition 79 We say that a MC is irreducible if for each 3 and 4 b 5 (possibly depending 5 upon 3 and 4) such that :34 ž !. (A chain is irreducible if it is possible to eventually get from any state 3 to any other state 4.) Definition 80 We say that the 3th state of a MC is transient if ! 0335 • " and say that the _ state is persistent if ! 0335 œ "Þ A chain is called persistent if all of its states are 5œ" _ 5œ" persistent. Definition 81 We say that state 3 of a MC has period > if :335 œ ! unless 5 œ / > (5 is a multiple of >) and > is the largest integer with this property. The state is aperiodic if no such > ž " exists. And a MC is called aperiodic if all of its states are aperiodic. Many sources (including Chapter 15 of the 3rd Edition of Feller Volume 1) present a number of useful simple results about MC's. Among them are the following. Theorem 98 All states of an irreducible MC are of the same type (with regard to persistence and periodicity). Theorem 99 A finite state space irreducible MC is persistent. -51- Theorem 100 Suppose that a MC is irreducible, aperiodic and persistent. Suppose further that for each state 3 the mean recurrence time is finite, i.e. "50335 • _ . _ 5œ" Then an invariant/stationary distribution for the MC exists, i.e. b Ö?4 × with ?4 ž ! and !?4 œ " such that ?4 œ "?3 :34 . 3 (If the chain is started with distribution Ö?4 ×, after one transition it is in states "ß #ß $ß ÞÞÞ with probabilities Ö?4 ×Þ) Further, this distributionÖ?4 × satisfies 5 ?4 œ lim :34 a3 , 5p_ and ?4 œ " ! 50445 _ Þ 5œ" There is a converse of this theorem. Theorem 101 An irreducible, aperiodic MC for which b Ö?4 × with ?4 ž ! and !?4 œ " such that ?4 œ !?3 :34 must be persistent with ?4 œ 3 " Þ _ ! 50445 5œ" With this background, the basic idea of MCMC is the following. If we wish to simulate from a distribution Ö?4 ×, we find a convenient MC Ö\5 × whose invariant distribution is Ö?4 ×. From a starting state \! œ 3, one uses T to simulate \" Þ Using the realization \" œ B" and T , one simulates \# , etc. If the conclusions of Theorem 100 are in force, after an initial "burn in" of 7 periods T Ò\7 œ 4l\! œ 3Ó ¸ ?4 independent of 3 and \7 is approximately a realization from Ö?4 ×. Two important questions about this plan (only one of which we'll address) are: 1) How does one set up an appropriate/convenient T ? 2) How big should 7 be? (We'll not touch this matter.) -52- In answer to question 1), there are presumably a multitude of chains that would do the job. But there is the following useful sufficient condition (that has application in the original motivating problem of simulating from high dimensional distributions) for a chain to have Ö?4 × for an invariant distribution. Lemma 102 If Ö\5 × is a MC with transition probabilities satisfying ?3 :34 œ ?4 :43 æ , then it has invariant distribution Ö?4 ×. Note then that if a candidate T satisfies æ and is irreducible and aperiodic, Theorem 101 shows that it is persistent and Theorem 100 then shows that any arbitrary starting value can be used and yields approximate realizations from Ö?4 ×. Several useful MCMC schemes can be shown to have the "right" invariant distributions by noting that they satisfy æ. For example, Lemma 102 can be applied to the "Metropolis-Hastings Algorithm." That is, let X œ a>34 b be any stochastic matrix corresponding to an irreducible aperiodic MC. (Presumably, some choices will be better that others in terms of providing quick convergence to a target invariant distribution. But that issue is beyond the scope of the present exposition.) (Note that in a finite case, one can take >34 œ "ÎÐthe number of states).) The Metropolis-Hastings algorithm simulates as follows: •Supposing that \8•" œ 3, generate N (at random) according to the distribution over the state space specified by row 3 of X , that is according to Ö>34 ×. •Then generate \8 based on 3 and (the randomly generated) N according to Ú N with probability minŠ"ß ??N3 >>3NN 3 ‹ \8 œ Û Þ ææ ? > Ü 3 with probability max Š!ß " • ?N3 >3NN 3 ‹ Note that for Ö\5 × so generated, for 4 Á 3 :34 œ T Ò\8 œ 4l\8•" œ 3Ó œ min Œ"ß ?4 >43 •>34 ?3 >34 and :33 œ T Ò\8 œ 3l\8•" œ 3Ó œ >33 € "maxŒ!ß " • 4Á3 So, for 3 Á 4 ?3 :34 œ minÐ?3 >34 ß ?4 >43 Ñ œ ?4 :43 . -53- ?4 >43 •>34 ?3 >34 . That is, æ holds and the MC Ö\5 × has stationary distribution Ö?4 ×. (Further, the assumption that X corresponds to an irreducible aperiodic chain implies that Ö\5 × is irreducible and aperiodic.) Notice, by the way, that in order to use the Metropolis-Hastings Algorithm one only has to have the ?4 's up to a multiplicative constant. That is, if we have @4 º ?4 , we may compute the ratios ?4 Î?3 as @4 Î@3 and are in business as far as simulation goes. (This fact is especially attractive in Bayesian contexts where it can be convenient to know a posterior only up to a multiplicative constant.) Notice also that if X is symmetric, (i.e. >34 œ >43 ), step ææ of the Metropolis algorithm becomes Ú \8 œ Û Ü3 N with probability minŠ"ß ??N3 ‹ with probability max Š!ß " • ?N ?3 ‹ Þ An algorithm similar to the Metropolis-Hastings algorithm is the "Barker Algorithm." The Barker algorithm modifies the above by replacing minŒ"ß ?N >N 3 ?N >N 3 • with ?3 >3N ?3 >3N € ?N >N 3 in ææ. Note that for this modification of the Metropolis algorithm, for 4 Á 3 :34 œ Œ ?4 >43 •>34 ?3 >34 € ?4 >43 , so ?3 :34 œ Ð?3 >34 ÑÐ?4 >43 Ñ œ ?4 :43 ?3 >34 € ?4 >43 . That is, æ holds and thus Lemma 102 guarantees that Ö\5 × has invariant distribution Ö?4 ×Þ (And X irreducible and aperiodic continues to imply that Ö\5 × is also.) Note also that since ?4 ?4 >43 ?3 >43 œ ? ?3 >34 € ?4 >43 >34 € ?43 >43 once again it suffices to know the ?4 up to a multiplicative constant in order to be able to implement Barker's algorithm. (This is again of comfort to Bayesians.) Finally, note that if X is symmetric, step ææ of the Barker algorithm becomes N \8 œ ž 3 with probability with probability -54- ?N ?3 €?N ?3 ?3 €?N Þ To finally get to the motivating application of this whole business, consider now the "Gibbs Sampler," or as Schervish correctly argues that it should be termed, "Successive Substitution Sampling," for generating an observation from a high dimensional distribution. For sake of concreteness, consider the situation where the distribution of a discrete $-dimensional random vector Ð] ß Z ß [ Ñ with probability mass function 0 ÐCß @ß AÑ is at issue. In the general MC set-up, a possible outcome ÐCß @ß AÑ represents a "state" and the distribution 0 ÐCß @ß AÑ is the "Ö?4 ×" from which on wants to simulate. One defines a MC Ö\5 × as follows. For an arbitrary starting state ÐC! ß @! ß A! Ñ: •Generate \" œ Ð]" ß @! ß A! Ñ by generating ]" from the conditional distribution of ] lZ œ @! and [ œ A ! , i.e. from the (conditional) distribution with probability function 0] lZ ß[ ÐCl@! ß A! Ñ » !0ÐCß@! ßA! Ñ . Let C" be the realized value of ]" . 0Ð6ß@! ßA! Ñ 6 •Generate \# œ ÐC" ß Z" ß A! Ñ by generating Z" from the conditional distribution of Z l] œ C" and [ œ A ! , i.e. from the (conditional) distribution with probability function 0Z l] ß[ Ð@lC" ß A! Ñ » !0ÐC" ß@ßA! Ñ . Let @" be the realized value of Z" . 0ÐC" ß6ßA! Ñ 6 •Generate \$ œ ÐC" ß @" ß [" Ñ by generating [" from the conditional distribution of [ l] œ C" and Z œ @" , i.e. from the (conditional) distribution with probability 0ÐC" ß@" ßAÑ function 0[ l] ßZ ÐAlC" ß @" Ñ » ! . Let A" be the realized value of [" . 0ÐC" ß@" ß6Ñ 6 •Generate \4 œ Ð]2 ß @1 ß A1 Ñ by generating ]2 from the conditional distribution of ] lZ œ @1 and [ œ A 1 , i.e. from the (conditional) distribution with probability function 0] lZ ß[ ÐCl@1 ß A1 Ñ » !0ÐCß@1 ßA1 Ñ . Let C2 be the realized value of ]2 . 0Ð6ß@1 ßA1 Ñ 6 And so on, for some appropriate number of cycles. Note that with this algorithm, a typical transition probability (for a step where a ] realization is going to be generated) is T Ò\8 œ ÐCw ß @ß AÑl\8•" œ ÐCß @ß AÑÓ œ 0 ÐCw ß @ß AÑ !0 Ð6ß @ß AÑ 6 so if \8•" has distribution 0 , the probability that \8 œ ÐCß @ß AÑ is "0 Ð6w ß @ß AÑ 6w 0 ÐCß @ß AÑ œ 0 ÐCß @ß AÑ , !0 Ð6ß @ß AÑ 6 that is, \8 also has distribution 0 Þ And analogous results hold for transitions of all 3 types (] ß Z and Y realizations). But it is not absolutely obious how to apply the results 100 and 101 here since the chain is not stationaryÞ The transition mechanism is not the same for a "] " transition as it is -55- for a "Z " transition. The key to extricating oneself from this problem is to note that if T] ß TY and TZ are respectively the transition matrices for ] ß Y and Z exchanges, then T œ T] TY TZ describes an entire cycle of the SSS algorithm. Then Ö\5w × defined by \8w œ \$8 is a stationary Markov Chain with transition matrix T Þ The fact that T] ß TY and TZ all leave the distribution 0 invariant then implies that T also leaves 0 invartiant. So one is in the position to apply Theorems 101 and 100. If T is irreducible and aperiodic, Theorem 101 says that the chain Ö\5w × is persistent and then Theorem 100 says that 0 can be simulated using an arbitrary starting state. -56- A (Very) Little Bit About Bootstrapping Suppose that the entries of \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ are iid according to some unknown (possibly multivariate) distribution J , and for a random function VÐ\ß J Ñ there is some aspect of the distribution of VÐ\ß J Ñ that is of statistical interest. For example, in a univariate context, we might be interested in the (J ) probability that – \ • .J VÐ\ß J Ñ œ W È8 has absolute value less than #Þ Often, even if one knew J , the problem of finding properties of the distribution of VÐ\ß J Ñ would be analytically intractable. (Of course, knowing J , simulation might provide at least some approximation for the property of interest ...) As a means of possibly approximating the distribution of VÐ\ß J Ñ, Efron suggested the s is an estimator of J . In parametric contexts, where )^Ð\Ñ is following. Suppose that J s œ Js . In nonparametric contexts, if J^ 8 is the empirical an estimator of ), this could be J ) ^ . Suppose then that s œ J distribution of the sample \ , we might take J 8 s . One might hope that (at \ ‡ œ Ð\"‡ ß ÞÞÞß \8‡ Ñ has entries that are an iid sample from J least in some circumstances) s distribution of VÐ\ ‡ ß J sÑ . the J distribution of VÐ\ß J Ñ ¸ the J (This will not always be a reasonable approximation.) Now even the right side of this expression will typically be analytic intractable. What, however, can be done is to now simulate from this distribution. (Where one can not simulate from the distribution on the left because J is not known, simulation from the distribution on the right is possible.) sÞ That is, suppose that \ ‡" ß \ ‡# ß ÞÞÞß \ ‡F are F (bootstrap) iid samples of size 8 from J Using these, one may compute sÑ V3‡ œ VÐ\ ‡3 ß J s distribution and use the empirical distribution of ÖV3‡ l3 œ "ß ÞÞÞß F × to approximate the J s Ñ. of VÐ\ ‡ ß J -57-