Stat 643 Results, Fall 2000 Sufficiency (and Ancillarity and Completeness)

Stat 643 Results, Fall 2000 Sufficiency (and Ancillarity and Completeness) Roughlyß X Ð\Ñ is sufficient for ÖT) ×)-@ (or for )) provided the conditional distribution of \ given X Ð\Ñ is the same for each ) - @. Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and X Ð\Ñ œ ! \3 . You may show that T: Ò\ œ BlX œ >Ó is the same for each :. 8 3œ" In some sense, a sufficient statistic carries all the information in \ . (A variable with the same distribution as \ can be constructed from X Ð\Ñ and some randomization whose form does not depend upon ).) The $64 question is how to recognize one. The primary tool is the "Factorization Theorem" in its various incarnations. This says roughly that X is sufficient iff " 0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑ" Þ Example (Page 20 Lehmann TSH ) Suppose that \ is discrete. Then X Ð\Ñ is sufficient for ) iff b nonnegative functions 1) Ð † Ñ and 2Ð † Ñ such that T) Ò\ œ BÓ œ 1) ÐX ÐBÑÑ2ÐBÑÞ (You can and should prove this yourself.) The Stat 643 version of the above result is considerably more general and relies on measure-theoretic machinery. Basically, one can "easily" push the factorization result as far as "families of distributions dominated by a 5-finite measure." The following, standard notation will be used throughout these notes: Ðk ,U ,T) Ñ a probability space (for fixed )) c œ ÖT) ×)-@ E) Ð1Ð\ÑÑ œ ' 1.T) X À Ðk ß U ÑpÐg ß Y Ñ a statistic U! œ U ÐX Ñ œ ÖX •" ÐJ Ñ×J -Y the 5-algebra generated by X Definition 1 X (or U! ) is sufficient for c if for every F - U b a U! -measurable function ] such that for all ) - @ ] œ E) ÐMF lU! Ñ a.s. T) . 1 (Note that E) ÐMF lU! Ñ œ T) ÐFlU! Ñ. This definition says that there is a single random variable that will serve as a conditional probability of F given U! for all ) in the parameter space.) Definition 2 c is dominated by a 5-finite measure . if T) ¥ . a ) - @. Then assuming that c is dominated by ., standard notation is to let 0) œ .T) .. and it is these objects (R-N derivatives) that factor nicely iff X is sufficient. To prove this involves some nontrivial technicalities. Definition 3 c has a countable equivalent subset if b a sequence of positive constants !-3 œ " and a corresponding sequence Ö)3 ×_ !-3 T)3 is Ö-3 ×_ 3œ" with 3œ" such that - œ equivalent to c in the sense that -ÐFÑ œ ! iff T) ÐFÑ œ ! a )Þ Notation: Assuming that c has a countable equivalent subset, let - be a choice of !-3 T)3 as in Definition 3. Theorem 1 (Page 575 Lehmann TSH) c is dominated by a 5-finite measure . iff it has a countable equivalent subset. Theorem 2 (Page 54 Lehmann TSH) Suppose that c is dominated by a 5-finite measure and X À Ðk ß U ÑpÐg ß Y Ñ. Then X is sufficient for c iff b nonnegative Y -measurable functions 1) such that a ) .T) ÐBÑ œ 1) ÐX ÐBÑÑ a.s. - . .- (This last equality is equivalent to T) ÐFÑ œ 'F 1) ÐX ÐBÑÑ. -ÐBÑ a F - U .) Theorem 3 (Page 55 Lehmann TSH) ("The" Factorization Theorem) Suppose that c ¥ . where . is 5-finite. Then X is sufficient for c iff b a nonnegative U -measurable function 2 and nonnegative Y -measurable functions 1) such that a ) .T) ÐBÑ œ 0) ÐBÑ œ 1) ÐX ÐBÑÑ2ÐBÑ a.s. . . .. Standard/elementary applications of Theorem 3 are to cases where: 1) . is Lebesgue measure on e. , c is some parametric family of . -dimensional distributions, and the 0) are . -dimensional probability densities, or 2 2) . is counting measure on some countable set (like, e.g., am € b. ) c is some parametric family of discrete distributions and the 0) are probability mass functions. But it can also be applied much more widely (e.g. to cases where . is the sum of . dimensional Lebesgue measure and counting measure on some countable subset of e. ). Example Theorem 3 can be used to show that for k œ e. ß \ œ Ð\" ß \# ß ÞÞÞß \. Ñß U œ U. the .-dimensional Borel 5-algebra, 1 œ Ð1" ß 1# ß ÞÞÞß 1. Ñ a generic permutation of the first . positive integers, 1Ð\Ñ œ Ð\1" ß \1# ß ÞÞÞß \1. Ñß . the . dimensional Lebesgue measure and c œ ÖT lT ¥ . and 0 œ .T .. satisfies 0 Ð1ÐBÑÑ œ 0 ÐBÑ a.s. . for all permutations 1×ß the order statistic X Ð\Ñ œ Ð\Ð"Ñ ß \Ð#Ñ ß ÞÞÞß \Ð.Ñ Ñ is sufficient. Actually, it is an important fact (that does not follow from the factorization theorem) that one can establish the sufficiency of the order statistic even more broadly. With c œ ÖT on e. such that T ÐFÑ œ T Ð1ÐFÑÑ for all permutations 1×, direct argument shows that the order statistic is sufficient. (You should try to prove this. For F - U , try ] ÐBÑ equal to the fraction of all permutations that place B in F.) (Obvious) Result If X Ð\Ñ is sufficient for c and c! is a subset of c , X Ð\Ñ is sufficient for c! . A natural question is "How far can one go in the direction of 'reducing' \ without throwing something away?" The notion of minimal sufficiency is aimed in the direction of answering this question. Definition 4 A sufficient statistic X À Ðk ß U ÑpÐg ß Y Ñ is minimal sufficient (for ) or for c ) provided for every sufficient statistic W À Ðk ß U ÑpÐf ß Z Ñ b a (?measurable?) function h À f pg such that X œ Y ‰ W a.e. c . (It is probably equivalent or close to equivalent that U ÐX Ñ § U ÐWÑÞ) There are a couple of ways of identifying a sufficient statistic. Theorem 4 (Like page 92 of Schervish ... other development on pages 41-43 of Lehmann TPE) Suppose that c is dominated by a 5-finite measure . and X À Ðk ß U ÑpÐg ß Y Ñ is sufficient for c . Suppose that there exist versions of the R-N ) derivatives .T . . ÐBÑ œ 0) ÐBÑ such that athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b implies X ÐBÑ œ X ÐCÑ then X is minimal sufficient. 3 (A trivial "generalization" of Theorem 4 is to suppose that the condition only holds for a set k ‡ § k with T) probability 1 for all ). The condition in Theorem 4 is essentially that ÐCÑ X ÐBÑ œ X ÐCÑ iff 00))ÐBÑ is free of ).) Note that sufficiency of X Ð\Ñ means that X ÐBÑ œ X ÐCÑ implies athe existence of 5ÐBß CÑ ž ! such that 0) ÐCÑ œ 0) ÐBÑ5ÐBß CÑ a )b. So the condition in Theorem 4 could be phrased as an iff condition. Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and X Ð\Ñ œ ! \3 . You may use Theorem 4 to show that show that X is minimal sufficient. 8 3œ" Lehmann emphasizes a second (?more technical and less illuminating?) method of establishing minimality based on the following two theorems. Theorem 5 (Page 41 of Lehmann TPE) Suppose that c œ ÖT3 ×53œ! is finite and let Ö03 × be the R-N derivatives with respect to some dominating 5-finite measure (like, e.g., ! T3 ÑÞ Suppose that all 5 € " derivatives 03 are positive on all of k . Then 5 3œ! X Ð\Ñ œ Œ 0" ÐBÑ 0# ÐBÑ 05 ÐBÑ ß ß ÞÞÞß • 0! ÐBÑ 0! ÐBÑ 0! ÐBÑ is minimal sufficient. (Again it is a trivial "generalization" to observe that the minimality also holds if there is a set k ‡ § k with T) probability 1 for all ) where the 5 € " derivatives 03 are positive on all of k ‡ ÞÑ Theorem 6 (Page 42 of Lehmann TPE) Suppose that c is a family of mutually absolutely continuous distributions and that c! § c Þ If X is sufficient for c and minimal sufficient for c! , then it is minimal sufficient for c . The way that Theorems 5 and 6 are used together is to look at a few )'s, invoke Theorem 5 to establish minimal sufficiency for the finite family of a vector made up of a few likelihood ratios and then use Theorem 6 to infer minimality for the whole family. Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Beta Ð!ß " Ñ. Using T! the Beta Ð"ß "Ñ distribution, T" the Beta Ð#ß "Ñ distribution and T# the Beta Ð"ß #Ñ distribution, Theorems 5 and 6 can be used to show that Œ # \3 ß # Ð" • \3 Ñ•is minimal sufficient for the Beta family. 8 8 3œ" 3œ" 4 Definition 5 A statistic X À Ðk ß U ÑpÐg ß Y Ñ is said to be ancillary for c œ ÖT) ×)-@ if the distribution of X (i.e., the measure T)X on Ðg ß Y Ñ defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑÑ does not depend upon ) (i.e. is the same for all )). Definition 6 A statistic X À Ðk ß U ÑpÐe" ß U" Ñ is said to be first order ancillary for c œ ÖT) ×)-@ if the mean of X (i.e., E) ÐX Ð\ÑÑ œ ' X .T) ) does not depend upon ) (i.e. is the same for all )). Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid NÐ),"Ñ. The sample variance of the \3 's is ancillary. Lehmann argues that roughly speaking models with nice (low dimensional) minimal sufficient statistics X are ones where no nontrivial function of X is ancillary (or first order ancillary). This line of thinking brings up the notion of completeness. Notation: Continuing to let X À Ðk ß U ÑpÐg ß Y Ñ and T)X be the measure on Ðg ß Y Ñ defined by T)X ÐJ Ñ œ T) ÐX •" ÐJ ÑÑ, let c X œ ÖT)X ×)-@ Þ Definition 7a c X or X is complete (boundedly complete) for c or ) if i) Y À Ðk ß U! ÑpÐe" ß U" Ñ U! -measurable (and bounded), and ii) E) Y Ð\Ñ œ ! a ) imply that Y œ ! a.s. T) a )Þ By virtue of Lehmann's Theorem (see Cressie's Probability Summary) this is equivalent to the next definition. Definition 7b c X or X is complete (boundedly complete) for c or ) if i) 2 À Ðg ß Y ÑpÐe" ß U" Ñ Y -measurable (and bounded), and ii) E) 2ÐX Ð\ÑÑ œ ! a ) imply that 2 ‰ X œ ! a.s. T) a )Þ These definitions say that there is no nontrivial function of X that is an unbiased estimator of 0. This is equivalent to saying that there is no nontrivial function of X that is first order ancillary. Notice also that completeness implies bounded completeness, so that bounded completeness is a slightly weaker condition than completeness. Example Let \ œ Ð\" ß \# ß ÞÞÞ ß \8 Ñ, where the \3 are iid Bernoulli Ð:Ñ and X Ð\Ñ œ ! \3 . You may use the fact that polynomials can be identically 0 on an interval 8 3œ" only if their coefficients are all 0, to show that X Ð\Ñ is complete for : - Ò!ß "Ó. (Obvious) Result 7 If X is complete for c œ ÖT) ×)-@ and @w § @, X need not be w complete for c œ ÖT) ×)-@w . 5 w w Theorem 8 If @ § @ and @ dominates @ in the sense that T) ÐFÑ œ ! a ) - @ implies w that T) ÐFÑ œ ! a ) - @ , then X (boundedly) complete for c œ ÖT) ×)-@ implies that X w is (boundedly) complete for c œ ÖT) ×)-@w . Completeness of a statistic says that there is no part of it that is somehow irrelevant in and of itself to inference about ). That is reminiscent of minimal sufficiency. Theorem 9 (Bahadur's Theorem) Suppose that X À Ðk ß U ÑpÐg ß Y Ñ is sufficient and boundedly complete. i) (Pages 94-95 of Schervish) If Ðg ß Y Ñ œ Ðe. ß U. Ñ then X is minimal sufficient. ii) If there is any minimal sufficient statistic, X is minimal sufficient. Example Suppose that c ¥ . (where . is 5-finite) and that for ) - @ § e5 5 .T) ÐBÑ œ -Ð)Ñexp")3 X3 ÐBÑ Þ .. 3œ" Provided, for example, that @ contains an open rectangle in e5 , Theorem 1 page 142 Lehmann TSH implies that X Ð\Ñ œ ÐX" Ð\Ñß X# Ð\Ñß ÞÞÞß X5 Ð\ÑÑ is complete. It is clearly sufficient by the factorization criterion. So by part i) of Theorem 9, X is minimal sufficient. Some Facts About Common Statistical Models Bayes Models The "Bayes" approach to statistical problems is to add additional model assumptions to the basic structure introduced thus far. That is, to the standard assumptions concerning c œ ÖT) ×)-@ "Bayesians" add a distribution on the parameter space @. That is, suppose that K is a distribution on Ð@ß VÑ where it may be convenient to suppose that K is dominated by some 5-finite measure / and that .K Ð)Ñ œ 1Ð)Ñ ./ If considered as a function of both B and ), 0) ÐBÑ is U ‚ V measurable, then b a probability (a joint distribution for Ð\ß )Ñ on Ðk ‚ @ß U ‚ VÑ) defined by 1\ß) ÐEÑ œ ( 0) ÐBÑ.Ð. ‚ KÑÐBß )Ñ œ ( 0) ÐBÑ1Ð)Ñ.Ð. ‚ / ÑÐBß )Ñ E E 6 This distribution has marginals 1\ ÐFÑ œ ( (so that .1\ .. F‚@ 0) ÐBÑ.Ð. ‚ KÑÐBß )Ñ œ ( ( 0) ÐBÑ.KÐ)Ñ..ÐBÑ F @ œ '@ 0) ÐBÑ.KÐ)Ñ œ '@ 0) ÐBÑ1Ð)Ñ. / Ð)Ñ) and 1) ÐGÑ œ ( k ‚V 0) ÐBÑ.Ð. ‚ KÑÐBß )Ñ œ ( ( 0) ÐBÑ..ÐBÑ.KÐ)Ñ œ KÐGÑ G k Further, the conditionals (regular conditional probabilities) are computable as 1\l) ÐFl)Ñ œ ( 0) ÐBÑ..ÐBÑ œ T) ÐFÑ F and 1)l\ ÐGlBÑ œ ( 0) ÐBÑ .KÐ)Ñ ' G @ 0) ÐBÑ.KÐ)Ñ Note then that .1)l\ 0) ÐBÑ Ð)lBÑ œ ' 0) ÐBÑ.KÐ)Ñ .K @ and thus that .1)l\ 0) ÐBÑ1Ð)Ñ Ð)lBÑ œ ' 0) ÐBÑ1Ð)Ñ. / Ð)Ñ ./ @ is the usual "posterior density" for ) given \ œ B. Bayesians use the posterior distribution or density (THAT DEPENDS ON ONE'S CHOICE OF K) as their basis of inference for ). Pages 84-88 of Schervish concern a Bayesian formulation of sufficiency and argue (basically) that the Bayesian form is equivalent to the "classical" form used above. Exponential Families Many common familes of distributions can be treated with a single set of analyses by recognizing them to be of a common "exponential family" form. The following is a list of facts about such families. Definition 8 Suppose that c œ ÖT) × is dominated by a 5-finite measure .. c is called an exponential family if for some 2ÐBÑ !, 7 0) ÐBÑ » 5 .T) ÐBÑ œ expš+Ð)Ñ € "(3 Ð)ÑX3 ÐBÑ›2ÐBÑ a) . .. 3œ" Definition 9 A family of probability measures c œ ÖT) × is said to be identifiable if )" Á )# implies that T)" Á T)# Þ Fact(s) 10 Let ( œ Ð(" ß (# ß ÞÞÞß (5 Ñß > œ Ö( - e5 l' 2ÐBÑexpš!(3 X3 ÐBÑ›. .ÐBÑ • _×, and consider also the family of distributions (on k ) with R-N derivatives wrt . of the form 0( ÐBÑ œ OÐ(Ñexpš"(3 X3 ÐBÑ›2ÐBÑ , ( - > . 5 3œ" Call this family c ‡ . This set of distributions on k is at least as large as c (and this second parameterization is mathematically nicer than the first). > is called the natural parameter space for c ‡ Þ It is a convex subset of e5 . (TSH, page 57.) If > lies in a subspace of dimension less than 5 , then 0( ÐBÑ (and therefore 0) ÐBÑÑ can be written in a form involving fewer than 5 statistics X3 . We will henceforth assume > to be fully 5 dimensional. Note that depending upon the nature of the functions (3 Ð)Ñ and the parameter space @, c may be a proper subset of c ‡ . That is, defining >@ œ ÖÐ(" Ð)Ñß (# Ð)Ñß ÞÞÞß (5 Ð)ÑÑ - e5 l) - @ ×ß >@ can be a proper subset of >. Fact 11 The "support" of T) , defined as ÖBl0) ÐBÑ ž !×, is clearly ÖBl2ÐBÑ ž !×, which is independent of ). The distributions in c are thus mutually absolutely continuous. Fact 12 From the Factorization Theorem, the statistic X œ ÐX" ß X# ß ÞÞÞß X5 Ñ is sufficient for c . Fact 13 X has induced measures ÖT)X l) - @× which also form an exponential family. Fact 14 If >@ contains an open rectangle in e5 , then X is complete for c . (See pages 142-143 of TSH.) Fact 15 If >@ contains an open rectangle in e5 (and actually under the much weaker assumptions given on page 44 of TPE) X is minimal sufficient for c . Fact 16 If 1 is any measurable real valued function such that E( l1Ð\Ñl • _, then E( 1Ð\Ñ œ ( 1ÐBÑ0( ÐBÑ. .ÐBÑ is continuous on > and has continuous partial derivatives of all orders on the interior of >. These can be calculated as 8 ` !" €!# €â€!5 ` !" €!# €â€!5 1Ð\Ñ œ 1ÐBÑ 0( ÐBÑ..ÐBÑ . E ( ( ` ("!" ` (#!# â` (5!5 ` ("!" ` (#!# â` (5!5 (See page 59 of TSH.) Fact 17 If for ? œ Ð?" ß ?# ß ÞÞÞß ?5 Ñ, both (! and (! € ? belong to > E(! expš?" X" Ð\Ñ € ?# X# Ð\Ñ € â € ?5 X5 Ð\Ñ› œ OÐ(! Ñ Þ OÐ(! € ?Ñ Further, if (! is in the interior of >, then E(! ÐX"!" Ð\ÑX#!# Ð\ÑâX5!5 Ð\ÑÑ œ OÐ(! Ñ In particular, E(! X4 Ð\Ñ œ ` ` (4 a and Cov(! (X4 Ð\Ñ,X6 Ð\ÑÑ œ • logOÐ(Ñb¹ `# ` ( 4 ` (6 a ` !" €!# €â€!5 " Þ •º !" !# !5 Œ ` (" ` (# â` (5 OÐ(Ñ (œ(! , Var(! X4 Ð\Ñ œ • logOÐ(Ñb¹ (œ(! (œ(! `# a ` (4# • logOÐ(Ñb¹ (œ(! , Þ Fact 18 If \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ has iid components, with each \3 µ T) , then \ generates a 5 dimensional exponential family on (k 8 ß U 8 ) wrt .8 . The 5 dimensional statistic ! X Ð\3 Ñ is sufficient for this family. Under the condition that >@ contains an 8 3œ" open rectangle, this statistic is also complete and minimal sufficient. "Fisher Information" Definition 10 We will say that the model c with dominating 5-finite measure . is FI regular at )! - @ § e" , provided there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is differentiable at )! and iii) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated wrt ) under the integral at )! , i.e. ! œ ' ..) 0) ÐBÑ¹ ..ÐBÑ . ) œ)! (A definition like that on page 111 of Schervish, that requires i) and ii) only on a set of T) probability 1 a) in S, is really no more general than this. WOLOG, just throw away the exceptional set.) . Jargon: The (random) function of ), .) log0) Ð\Ñ is called the score function. iii) of Definition 10 says that the mean of the score function is ! at )! . 9 Definition 11 If the model c is FI regular at )! , then MÐ)! Ñ œ E)! Œ . log0) Ð\Ñ¹ • .) ) œ)! # is called the Fisher Information contained in \ at )! Þ ÐIf a model is FI regular at )! , MÐ)! Ñ is the variance of the score function at )! .) Theorem 19 Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ is in fact twice differentiable at )! and " œ ' 0) ÐBÑ. .ÐBÑ can be differentiated twice wrt ) under the # integral at )! , i.e. both ! œ ' ..) 0) ÐBÑ¹ ..ÐBÑ and ! œ ' ..)# 0) ÐBÑ¹ . .ÐBÑ then ) œ)! MÐ)! Ñ œ • E)! Ž ) œ)! .# log0) Ð\Ñº • . . )# ) œ)! 5-dimensional versions of Definitions 10 and 11 are next. Definition 10b We will say that the model c with dominating 5-finite measure . is FI regular at )! - @ § e5 , provided there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ has first order partial derivatives at )! and iii) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated wrt to each )3 under the integral at )! , i.e. ! œ ' ``) 0) ÐBÑ¹ ..ÐBÑ . 3 )œ)! Definition 11b If the model c is FI regular at )! and E)! Œ ``)3 log0) Ð\Ñ¹ )œ)! then the 5 ‚ 5 matrix M Ð)! Ñ œ ŒE)! • • _ a3ß # ` ` log0) Ð\Ñ¹ log0) Ð\Ñ¹ • ` )3 )œ)! ` )4 )œ)! is called the Fisher Information contained in \ at )! Þ There is a multiparameter version of Theorem 19. Theorem 19b Suppose that the model c is FI regular at )! Þ If aBß 0) ÐBÑ in fact has continuous second order partials in the neighborhood Sß and " œ ' 0) ÐBÑ..ÐBÑ can be differentiated twice under the integral at )! , i.e. both ! œ ' ``) 0) ÐBÑ¹ ..ÐBÑ and !œ' `# ` )3 ` )4 0) ÐBÑ¹ )œ)! 3 )œ)! ..ÐBÑ a3 and 4, then 10 M Ð)! Ñ œ • E)! Ž `# log0) Ð\Ñº • . ` )3 ` ) 4 )œ)! Fact 20 If \" ß \# ß ÞÞÞß \8 are independent, \3 µ T3) , then \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ (that has distribution T") ‚ T#) ‚ â ‚ T8) ) carries Fisher Information MÐ) Ñ œ ! M3 Ð) Ñ, 8 where in the obvious way, M3 Ð) Ñ is the Fisher information in \3 Þ 3œ" Fact 21 MÐ) Ñ doesn't depend upon the dominating measure. Observation 22 The Fisher Information does depend upon the parameterization one uses. For example, suppose that the model c is FI regular at )! - e" and that ( œ 2Ð)Ñ for some "nice" 2 (say, "-" with derivative 2w ) Then for T) with R-N derivative 0) with respect to ., the distributions U( for ( in the range of @ are defined by U( œ T2•" Ð(Ñ and have R-N derivatives (again with respect to .) 1( œ .U( .. Then continuing to let "prime" denote differentiation with respect to ) or ., 1w( œ 02w •" Ð(Ñ " 2w a2•" Ð(Ñb and MÐ(Ñ œ Œ " 2w a2•" Ð(Ñb •" • M ˆ2 Ð(Ñ‰ # (where the first "M " here is information about ( and the second is information about )). Observation 23 If X ÐBÑ is a "-" function on k , then the Fisher Information in X Ð\Ñ is the same as that in \ . This follows since the distributions of X Ð\Ñ, T)X , are dominated by the measure .X defined by .X ÐJ Ñ œ .ÐX •" ÐJ ÑÑ and have R-N derivatives .T)X œ 0) ‰ X •" . .X which will satisfy the FI regularity conditions and have the right derivatives and integrals. 11 Theorem 24 Suppose the model c is FI regular at )! . (?And suppose that the family c X œ ÖT)X ×)-@ is also FI regular at )! .?) Then M\ Ð)! Ñ MX Ð\Ñ Ð)! Ñ and if X Ð\Ñ is sufficient for ) there is equality. Theorem 24b (Theorem 2.86 of Schervish) Suppose the model c is FI regular at )! - @ § e5 . (?And suppose that the family c X œ ÖT)X ×)-@ is also FI regular at )! .?) Then the matrix M\ Ð)! Ñ • MX Ð\Ñ Ð)! Ñ is nonnegative definite and if X Ð\Ñ is sufficient for ) the matrix contains all ! elements. Fact 25 In exponential families, the Fisher Information is very simple. For a family of distributions as in Fact(s) 10 above with densities 0( ÐBÑ with respect to ., `# M a(! b œ Var(! X Ð\Ñ œ • Œ logOÐ(Ñ¹ • ` (3 ` ( 4 (œ(! "Kullback-Leibler Information" There is another notion of statistical information, intended to measure how far apart 2 distributions are in the sense of likelihood, or in some sense how easy it will be to discriminate between them on the basis of an observable, \ . (This measure is also useful in the asymptotics of maximum likelihood estimation.) Definition 12 If T and U are probability measures on k with R-N derivatives with respect to a common dominating measure ., : and ; respectively, then the KullbackLeibler Information MÐT à UÑ is the T expected log likelihood ratio MÐT à UÑ œ ET log :Ð\Ñ :ÐBÑ œ ( logŒ •:ÐBÑ..ÐBÑ ;Ð\Ñ ;ÐBÑ Fact 26 MÐT à UÑ is not in general the same as MÐUà T Ñ Fact 27 MÐT à UÑ 0 and there is equality only when T œ U This fact follows from the next (slightly more general) lemma, upon noting that GÐ † Ñ œ • logÐ † Ñ is strictly convex. Lemma 28 (Silvey page 75) Suppose that T and U are two different distributions on a space k with R-N derivatives : and ; respectively with respect to a common measure .. Suppose further that a function G À Ð!ß_Ñpe" is convex. Then 12 ET G Œ ;Ð\Ñ • :Ð\Ñ GÐ"Ñ with strict inequality if G is strictly convex. Theorem 29 If \ œ Ð\" ß \# Ñ and under both T and U the variables \" and \# are independent, then M\ ÐT ß UÑ œ M\" ÐT \" à U\" Ñ € M\# ÐT \# à U\# Ñ So the K-L Information has a kind of additive property simlar to that possessed by the Fisher Information. It also has the kind of non-increasing property associated with reduction to a statistic X Ð\Ñ we saw with the Fisher Information. Theorem 30 For any statistic X Ð\Ñ M\ ÐT ß UÑ MX Ð\Ñ ÐT X ß UX Ñ and there is equality iff X Ð\Ñ is sufficient for ÖT ß U×. Finally, there is also an important connection between the notions of Fisher and K-L Information. Theorem 31 Under the hypotheses of Theorem 19 and supposing that one can reverse the order of limits in .# E) log0) Ð\Ñº . )# ! ) œ)! to produce E)! .# log0) Ð\Ñº . )# ) œ)! it follows that .# M\ ÐT)! ß T) Ñº œ M\ Ð)! Ñ . )# ) œ)! (the first "M " is the K-L Information and the second is the Fisher Information). The multiparameter version of this is, of course, similar. Theorem 31b Under the hypotheses of Theorem 19b (and assuming that it is permissible to interchange the order of second order partial differentiation and integration ..ÐBÑ) 13 `# Ž ` ) ` ) M\ ÐT)! ß T) Ñº • 3 4 )œ)! œ M\ Ð)! Ñ 5‚5 (the first "M " is the K-L Information and the second is the Fisher Information)Þ Types of Statistical Problems In the framework introduced above (an identifiable family of distributions c œ ÖT) × ) dominated by a 5-finite measure . with R-N derivatives 0) œ .T . . ) there are several formulations of the basic notion of "using \ to tell us about )" that is statistical theory. Here we briefly describe those. Point Estimation is the business of using a statistic $ Ð\Ñ (where $ À k p@) to guess at the value of ) (or some function of )). In those (common) cases where @ § e5 one can adopt criteria like "'small' ) expected difference between $ Ð\Ñ and ) (or a function of ))" to compare choices of "estimators" $Ð\Ñ. Common methods of inventing estimators are taught in 543 and include such methods as maximum likelihood and the method of moments. Bayesians typically use some feature of their posterior distribution (like its mean or mode) to estimate ) (or a function thereof). Set Estimation is the business of using a set-valued statistic WÐ\Ñ (where for V a 5 algebra on @, W À k pV) to guess at the true value of ) (or some function of )). One (obvious) criterion for judging choices of "confidence procedures" W is that one wants "'large' ) coverage probability" for all ). In those (common) cases where @ § e5 one can adopt criteria like "'small' ) expected set 'size'" to add to the notion of large coverage probability. Common methods of inventing set estimators include looking for "pivotals," inverting families of tests, looking for "standard errors" for good point estimators and using them to provide limits around those point estimators, and bootstrap methods. Bayesian, having a posterior probability distribution for ) typically use that to choose what they call "credible sets" with large posterior probability of including ) (or a function thereof). Prediction is the business of positing a model for \ œ Ð\" ß \# Ñ, supposing only \" to be observable and using it to predict \# (or some function thereof). There are both point and set versions of this problem. In the common case where \# takes values in e5 , one can adopt criteria like "'small' ) expected difference between the prediction $Ð\" Ñ and \# (or a functiuon thereof)" to compare choices of "predictors" $Ð\" Ñ. Bayesians have a conditional distribution for \# (called the predictive distribution) and can use some feature of it to predict \# . "Set prediction" methods are usually judged on the basis of their ) coverage probabilities and ) expected set sizes. Bayesian predictions are really formally no different from Bayesian estimation, being based on conditional distributions of the quantity of interest given the observable \" . 14 Hypothesis Testing is the business of partitioning the parameter space @ into two pieces, @! and @" , and using a statistic 9Ð\Ñ (where 9 À k pÖ!ß "×) to decide which of the two pieces is more plausible. Criteria like small "size" (maximum ) - @! probability that 9Ð\Ñ œ ") and large "power" () - @" probability that 9Ð\Ñ œ ") are usually used to compare "tests" 9. Classical methods of inventing tests (taught in 543) include appeal to the Neyman-Pearson Lemma and its extensions to convenient families (like monotone likelihood ratio families) and use of the "likelihood ratio criterion." Bayesians, armed as they are with a posterior distribution over @, have posterior probabilities for @! and @" to guide their choice between @! and @" (and therefore their construction of 9's). There are optimality theories that can be brought to bear on these statistical problems. Those are part of the somewhat broader world of "statistical decision theory." We will do a fair amount with this subject before the term is over. However, because the situations in which the theory provides clean results do not begin to span the space of important statistical applications, we will first continue with some more general theoretical material of wider applicability, based primarily on asymptotic considerations. Maximum Likelihood, Its Implications in Likelihood Ratio Testing and Other Contexts (Primarily the Simple Asymptotics Thereof) "Maximum Likelihood" Maximum likelihood is a method of point estimation dating at least to Fisher and probably beyond. As usual, for an identifiable family of distributions c œ ÖT) × ) dominated by a 5-finite measure ., we let 0) œ .T .. Þ Definition 13 A statistic X À k p@ is called a maximum likelihood estimator (MLE) of ) if 0X ÐBÑ ÐBÑ œ sup 0) ÐBÑ aB - k Þ )-@ This definition is not without practical problems in terms of existence and computation of MLEs. 0) ÐBÑ can fail to be bounded in ), and even when it is, can fail to have a maximum value as one searches over @. One way to get rid of the problem of potential lack of boundedness (favored and extensively practiced by Prof. Meeker for example) is to admit that one can measure only "to the nearest something," so that "real" k are discrete (finite or countable) so that . can be counting measure and 0) ÐBÑ œ T) ÐÖB×Ñ • " Ðso that one is just maximizing the probability of observing the data in handÑÞ Of course, the possibility that the sup is not achieved still remains even when this device is used (like, for example, when \ µ Binomial Ð8ß :Ñ for : - Ð!ß "Ñ when \ œ ! or 8). 15 Note that in the common situation where @ § e5 and the 0) are nicely differentiable in ), a maximum likelihood estimator must satisfy the likelihood equations ` log0) ÐBÑ¹ œ ! for 3 œ "ß #ß ÞÞÞß 5 . ` )3 ) œX ÐBÑ Much of the theory connected with maximum likelihood estimation has to do with roots of the likelihood equations (as opposed to honest maximizers of 0) ÐBÑ). Further, much of the attractiveness of the method derives from its asymptotic properties. A simple version of the theory (primarily for the 5 œ " case) follows. Adopt the following notation for the discussion of the asymptotics of maximum likelihood. \8 represents all information on ) available through the 8th stage of observation k8 the observation space for \ 8 (through the 8th stage) U8 a 5-algebra on k8 8 T)\ the distribution of \ 8 (dominated by a 5-finite measure .8 ) @ a (fixed) parameter space .T \ 8 0)8 the R-N derivative ..) 8 For this material one really does need the basic probability space ÐHß T ß T) Ñ in the background, as it is necessary to have a fixed space upon which to prove convergence results. Often (i.e. in the iid case) we may take Ðk ß U ß T)\" Ñ to be a "1-dimensional" model where the T)\" 's are absolutely continuous with respect to a 5-finite measure . _ Hœk \ 8 œ Ð\" ß \# ß ÞÞÞß \8 Ñ for \3 Ð=Ñ œ the 3th coordinate of = k8 œ k 8 T œ U_ _ T) œ ˆT)\" ‰ 8 8 T)\ œ ˆT)\" ‰ .8 œ .8 and 0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ 8 3œ" For such a set-up, the usual drill is for appropriate sequences of estimators Ös ) 8 ÐB8 Ñ× to try to prove consistency and asymptotic normality results. Definition 14 Suppose that ÖX8 × is a sequence of statistics X8 À k8 p@, where @ § e5 i) ÖX8 × is (weakly) consistent at )! - @ if for every % ž !ß T)! alX8 • )! l ž %bp! as 8p_, and ii) ÖX8 × is strongly consistent at )! - @ if X8 p)! a.s. T)! . 16 There are two standard paths for developing asymptotic results for likelihood-based estimators in iid contexts. These are i) the path blazed by Wald based on very strong regularity assumptions (that can be extremely hard to verify for a specific model) but that deal with "real" maximum likelihood estimators, and ' based on much weaker assumptions and far easier ii) the path blazed by Cramer proofs that deal with solutions to the likelihood equations. We'll wave our hands very briefly at a heuristic version of the Wald argument and defer to pages 415-424 of Schervish for technical details. Then we'll do a version of the ' results (probably assuming more than is absolutely necessary to get them). Cramer All that follows regarding the asymptotics of maximum likelihood refers to the iid case, where B8 œ ÐB" ß B# ß ÞÞÞß B8 Ñ, @ § e5 and 0)8 ÐB8 Ñ œ # 0) ÐB3 Ñ. 8 3œ" Now suppose that the 0) are positive everywhere on k and define the function of ) D)! Ð)Ñ » E)! log0) Ð\Ñ œ ( log0) ÐBÑ 0)! ÐBÑ ..ÐBÑ œ • MÐ)! à )Ñ € ( log0)! ÐBÑ 0)! ÐBÑ ..ÐBÑ Þ Corollary 32 If the 0) are positive everywhere on k , then for ) Á )! D)! Ð)! Ñ ž D)! Ð) Ñ Þ Then, as a matter of notation, let œ "log0) ÐB3 Ñ Þ 8 8 68 ÐB ß )Ñ » log0)8 ÐB8 Ñ 3œ" Then clearly E)! 8" 68 Ð\ 8 ß )Ñ œ D)! Ð)Ñ and in fact, the law of large numbers says that for every ), " " 8 68 Ð\ 8 ß )Ñ œ "log0) Ð\3 Ñ p D)! Ð)Ñ a.s. T)! . 8 8 3œ" So assuming enough regularity conditions to make this almost sure convergence uniform in ), when )! governs the distribution of the observations there is perhaps hope that a single maximizer of 8" 68 Ð\ 8 ß )Ñ will eventually exist and be close to the maximizer of D)! Ð)Ñ, namely )! . To make this precise is not so easy. See Theorems 7.49 and 7.54 of Schervish. ' type results. Now turn to Cramer 17 Theorem 33 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is differentiable at every point ) - S, and iii) E)! log0) Ð\Ñ exists for all ) - S and E)! log0)! Ð\Ñ is finite. Then, if % ž ! and $ ž !, b 8 such that a 7 ž 8, the T)! probability that the equation . 7 "log0) Ð\3 Ñ œ ! . ) 3œ" has a root within % of )! is at least " • $ Þ (Essentially this theorem ducks the uniformity issue raised earlier by considering 68 at only )! ß )! • % and )! € %.) In general, Theorem 33 doesn't immediately do what we'd like. The root guaranteed by the Theorem isn't necessarily an MLE. And in fact, when the likelihood equation can have multiple roots, we haven't even yet defined a sequence of random variables, let alone a sequence of estimators converging to )! . But the theorem can form the basis for simple results that are of statistical importance. Corollary 34 Under the hypotheses of Theorem 33, suppose that (in addition) @ is open and aBß 0) ÐBÑ is positive and differentiable at every point ) - @ so that the likelihood equation . 8 "log0) Ð\3 Ñ œ ! . ) 3œ" makes sense at all points ) - @. Then define 38 to be the root of the likelihood equation when there is exactly one (and otherwise adopt any arbitrary definition for 38 ). If with T)! probability approaching 1 the likelihood equation has a single root 38 p )! in T)! probability Þ Where there is no guarantee that the likelihood equation will eventually have only a single root, one could use Theorem 33 to argue that defining <8 Ð)! Ñ œ œ ! if the likelihood equation has no root the root closest to )! otherwise , <8 Ð)! Ñ p )! in T)! probability. (The continuity of the likelihood implies that limits of roots are roots, so for )! in an open @, the definition of <8 Ð)! Ñ above eventually makes sense.) But, of course, this is of no statistical help. What is similar in nature but of more help is the next corollary. 18 Corollary 35 Under the hypotheses of Theorem 33, if ÖX8 × is a sequence of estimators consistent at )! and if the likelihood equation has no roots X8 s )8 » œ the root closest to X8 otherwise , then s ) 8 p )! in T)! probability Þ In practical terms, finding the root of the likelihood equation nearest to X8 may be a numerical problem. (Actually, if it is, it's not so clear how you know you have solved the problem. But we'll gloss over that.) Finding a root beginning from X8 might be approached using something like Newton's method. That is, abbreviating P8 Ð)Ñ » 68 Ð\ ß )Ñ œ "log0) Ð\3 Ñ , 8 8 3œ" the likelihood equation is Pw8 Ð)Ñ œ ! Þ One might try looking for a root of this equation by iterating using linearizations of Pw8 Ð)Ñ (presumably beginning at X8 ). Or, assuming that Pww8 ÐX8 Ñ Á ! taking a single step in this iteration process, a "-step likelihood-based modification/?improvement? on X8 , Pw ÐX8 Ñ )˜ 8 œ X8 • 8ww , P8 ÐX8 Ñ is suggested. It is plausible that an estimator like )˜ 8 might have asymptotic behavior similar to that of an MLE. (See page 442 of TPE or Theorem 7.75 of Schervish, that treats a more general multiparameter M-estimation version of this idea.) The next matter up is the asymptotic distribution of likelihood-based estimators. The following is representative of what people have proved. (I suspect that the hypotheses here are stronger than are really needed to get the conclusion, but they make for a simple proof.) Theorem 36 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S, iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and ¹ .$ log0) ÐBÑ¹ Ÿ Q ÐBÑ aB and a) - S ß . )$ 19 iv) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated twice wrt ) under the integral at )! , # i.e. both ! œ ' ..) 0) ÐBÑ¹ ..ÐBÑ and ! œ ' ..)# 0) ÐBÑ¹ ..ÐBÑ, and ) œ)! # . E) ˆ .) log0) Ð\Ñ‰ ) œ)! v) M" Ð)Ñ » - Ð!ß _ Ñ a) - S. If with T)! probability approaching 1, Ös ) 8 × is a root of the likelihood equation, and s )8p )! in T)! probabilityß then under T)! _ È8Šs ) 8 • )! ‹ Ä NÐ!ß " Ñ Þ M" Ð)! Ñ There are multiparameter versions of this theorem (indeed the Schervish theorem is a multiparameter one). And there are similar (multiparameter) versions of this for the "Newton 1-step approximate MLE's" based on consistent estimators of ) (i.e. )˜ 8 's). The practical implication of a result like this theorem is that one can make approximate confidence intervals for ) of the form s )8 „ D " Þ È8M" Ð)! Ñ Well, it is almost a practical implication of the theorem. One needs to do something about the fact that M" Ð)! Ñ is unknown when making a confidence interval so that it can't be used in the formula. There are two common routes to fixing this problem. The first is to ) 8 Ñ in its place. This method is often referred to as using the "expected Fisher use M" Ðs information" in the making of the interval. (Note that M" Ð)! Ñ is a measure of curvature ) 8 Ñ is a measure of curvature of Ds) Ð)Ñ at of D)! Ð)Ñ at its maximum (located at )! Ñ while M" Ðs 8 s its maximum (located at ) 8 ).) Dealing with the theory of the first method is really very simple. It is easy to prove the following. Corollary 37 Under the hypotheses of Theorem 36, if M" Ð)Ñ is continuous at )! , then under T)! _ É8M" Ðs ) 8 )Šs ) 8 • )! ‹ Ä NÐ!ß "Ñ Þ (Actually, the continuity of M" Ð)Ñ at )! may well follow from the hypotheses of Theorem 36 and therefore be redundant in the hypotheses of this corollary. I've not tried to think this issue through.) The second method of approximating M" Ð)! Ñ comes about by realizing that in the proof of Theorem 36, M" Ð)! Ñ arises as the limit of • 8" Pww8 Ð)! ÑÞ One might then hope to approximate it by 20 " ww s " 8 .# • P8 Ð) 8 Ñ œ • " # log0) Ð\3 Ñ¹ s ß 8 8 3œ" .) ) œ) 8 ) 8 Ñ is which is sometimes called the "observed Fisher information." (Note that • 8" Pww8 Ðs " 8 a measure of curvature of the average log likelihood function 8 68 Ð\ ß )Ñ at s ) 8 , which one is secretly thinking of as essentially a maximizer of the average log likelihood.) Schervish says that Efron and others claim that this second method is generally better than the first (in terms of producing intervals with real coverage properties close to the ) 8 Ñ replaces nominal properties). Proving an analogue of Corollary 37 where • 8" Pww8 Ðs s M" Ð) 8 ) is somewhat harder, but not by any means impossible. That is, one can prove the following. Corollary 38 Under the hypotheses of Theorem 36 and T)! , _ É • Pww8 Ðs ) 8 Ñ Šs ) 8 • )! ‹ Ä NÐ!ß "Ñ Þ Under the hypotheses of Theorem 36, convergence in distribution of s ) 8 implies the convergence of the second moment of s ) 8 , so that Var)! È8Šs ) 8 • )! ‹ p i.e. Var)! ÐÈ8 Ðs ) 8 )Ñ p " ß M" Ð)! Ñ " Þ M" Ð)! Ñ Then the fact that the C-R inequality (recall Stat 543) declares that under suitable conditions no unbiased estimator of ) has variance less than "ÎÐ8M" Ð)ÑÑ, makes one suspect that for a consistent sequence of estimators Ö)8‡ ×, convergence under )! _ È8Ð)8‡ • )! Ñ p NÐ!ß Z Ð)! ÑÑ might imply that Z Ð)! Ñ " M" Ð)! Ñ , but in fact this inequality does not have to hold up. Likelihood Ratio Tests, Related Tests and Confidence Procedures One set of general purpose tools for making tests and confidence procedures consists of the "likelihood ratio test," its relatives and related confidence procedures. The following is an introduction to these methods and a little about their asymptotics. 21 For c œ ÖT) ×, H! À ) - @! and H" À ) - @" one might plausibly consider a test criterion like sup 0) ÐBÑ ) - @" PVÐBÑ œ Þ sup 0) ÐBÑ ) - @! (Notice, by the way, that Bayes tests look at ratios of K averages rather than sups.) Or equivalently, one might consider sup 0) ÐBÑ ) -ÐBÑ œ maxÐPVÐBÑß "Ñ œ - @ ß sup 0) ÐBÑ ) - @! where we will want to reject H! for large -ÐBÑ. This is an intuitively reasonable criterion, and many common tests (including optimal ones) turn out to be likelihood ratio tests. On the other hand, it is easy enough to invent problems where LRTs need not be in any sense optimal. It is a useful fact that one can find some general results about how to set critical values for LRTs (at least for iid models and large 8). This is because if there is an MLE, say )^, sup 0) ÐBÑ œ 0s) ÐBÑ )-@ and we can make use of MLE-like estimation results. For example, in the iid context of Theorem 36, there is the following. Theorem 39 Suppose that \" ß \# ß ÞÞÞß \8 are iid T) , @ § e" and there exists an open neighborhood of )! , say S, such that i) 0) ÐBÑ ž ! aB and a) - S, ii) aBß 0) ÐBÑ is three times differentiable at every point ) - S, iii) b Q ÐBÑ ! with E)! Q Ð\Ñ • _ and .$ ¹ $ log0) ÐBÑ¹ Ÿ Q ÐBÑ aB and a) - S ß .) iv) " œ ' 0) ÐBÑ..ÐBÑ can be differentiated twice wrt ) under the integral at )! , # i.e. both ! œ ' ..) 0) ÐBÑ¹ ..ÐBÑ and ! œ ' ..)# 0) ÐBÑ¹ ..ÐBÑ, and # v) M" Ð)Ñ » E) ˆ ..) log0) Ð\Ñ‰ - Ð!ß _ Ñ a) - S. If with T)! probability approaching 1, Ös ) 8 × is a root of the likelihood equation, and s )8p )! in T)! probabilityß then under T)! ) œ)! ) œ)! 22 #log-8‡ » #logŽ 0s8 Ð\ 8 Ñ 0)8! Ð\ 8 Ñ • )8 _ p ;#" Þ (The hypotheses here are exactly the hypotheses of Theorem 36.) The application that people often make of this theorem is that they reject H! À ) œ )! in favor of H" À ) Á )! when #log-8‡ exceeds the Ð" • !Ñ quantile of the ;#" distribution. Notice also that upon inverting these tests, the theorem gives a way of making large sample confidence sets for ). When s ) 8 is a maximizer of the log likelihood, the set of )! for which H! À ) œ )! will be accepted using the LRT with an (approximate) ;#" cut-off value (say ;#" ) is Ö)l P8 Ð)Ñ " P8 Ðs ) 8 Ñ • ;#" × . # There are versions of the ;# limit that cover (composite) null hypotheses about some part of a multi-dimensional parameter vector. For example, Schervish (page 459) proves a result like the following. Result 40 Suppose that @ § e: and appropriate regularity conditions hold for iid observations \" ß \# ß ÞÞÞß \8 . If 5 Ÿ : and sup 0)8 Ð\ 8 Ñ ) -8 œ sup 0)8 Ð\ 8 Ñ ) s.t. ) 3 œ )3! 3 œ "ß #ß ÞÞÞß 5 then under any T) with )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 _ #log-8 p ;#5 . This result (of course) allows one to set approximate critical points for testing H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! , by rejecting H! when #log-8 ž the Ð" • !Ñ ;#5 quantile . And, one can make large 8 approximate confidence sets for Ð)" ß )# ß ÞÞÞß )5 Ñ of the form ÖÐ)" ß ÞÞÞß )5 Ñl sup P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ )5€" ß ÞÞÞß ): " sup P8 Ð)Ñ • ;#5 × # ) for ;#5 the Ð" • !Ñ quantile of the ;#5 distribution. There is some standard jargon associated with the function of Ð)" ß ÞÞÞß )5 Ñ appearing in the above expressions. 23 Definition "5 The function of Ð)" ß ÞÞÞß )5 Ñ P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ » P8 ÐÐ)" ß ÞÞÞß )5 ß )5€" ß ÞÞÞß ): ÑÑ sup )5€" ß ÞÞÞß ): is called a profile likelihood function for the parameters )" ß ÞÞÞß )5 . With this jargon/notation the form of the large 8 confidence set becomes ÖÐ)" ß ÞÞÞß )5 Ñl P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ " P‡8 ÐÐ)" ß ÞÞÞß )5 ÑÑ • ;5# × , sup # Ð)" ß ÞÞÞß )5 Ñ which looks more like the form from the case of a one-dimensional parameter. There are alternatives to the LRT of H! À )3 œ )3! 3 œ "ß #ß ÞÞÞß 5 vs H" À not H! . One such test is the so called "Wald Test." See, for example, pages 115-120 of Silvey. The following is borrowed pretty much directly from Silvey. Consider @ § e: and 5 restrictions on ) 2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! , (like, for instance, )" • )"! œ )# • )#! œ â œ )5 • )5! œ !ÑÞ Suppose that )^8 is an MLE of ) and define 2Ð)Ñ » Ð2" Ð)Ñß 2# Ð)Ñß ÞÞÞß 25 Ð)ÑÑ Þ Then if H! À 2" Ð)Ñ œ 2# Ð)Ñ œ â œ 25 Ð)Ñ œ ! is true, 2Ð )^8 Ñ ought to be near 0. The question is how to measure "nearness" and to set a critical value in order to have a test with size approximately !. An approach to doing this is as follows. We expect (under suitable conditions) that under T) the (:-dimensional) estimator )^8 has _ È8Ð )^8 • )Ñ p N: Ð!ß M"•" Ð)ÑÑ . Then if `23 L Ð) Ñ » Œ º• ß ` )4 ) 5‚: the ? method suggests that _ È8Š2Ð)^8 Ñ • 2Ð)Ñ‹p N5 Ð!ß L Ð)ÑM"•" Ð)ÑL w Ð)ÑÑ Þ Abbreviate LÐ)ÑM"•" Ð)ÑL w Ð)Ñ as FÐ)ÑÞ Since 2Ð)Ñ œ ! if H! is true, the above suggests that 24 _ 82w Ð)^8 ÑF •" Ð)Ñ2Ð)^8 Ñ p ;5# under H! Þ Now 82w Ð)^8 ÑF •" Ð)Ñ2Ð )^8 Ñ can not serve as a test statistic, since it involves ), which is not completely specified by H! Þ But it is plausible to consider the statistic [8 » 82w Ð)^8 ÑF •" Ð)^8 Ñ2Ð )^8 Ñ and hope that under suitable conditions _ [8 p ;#5 under H! Þ If this can be shown to hold up, one may reject for [8 ž Ð" • !Ñ quantile of the ;#5 distribution. The Wald tests for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 can be inverted to get approximate confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): . The resulting confidence regions will, by construction, be ellipsoidal. There is a final alternative to the LRT and Wald tests that deserves mention. That is the ";# test" or "score test." See again Silvey. The notion here is that on occasion it can be easier to maximize P8 Ð)Ñ subject to 2Ð)Ñ œ ! (again taking this to be a statement of 5 constraints on a :-dimensional parameter vector )) than to simply maximize P8 Ð)Ñ. Let )˜ 8 be a "restricted" MLE (i.e. a maximizer of P8 Ð)Ñ subject to 2Ð)Ñ œ !). One might expect that if H! À 2Ð)Ñ œ ! is true, then )˜ 8 ought to be nearly an unrestricted maximizer of P8 Ð)Ñ and the partial derivatives of P8 Ð)Ñ should be nearly 0 at )˜ 8 . On the other hand, if H! is not true, there is little reason to expect the partials to be nearly 0. So (again/still with P8 Ð)Ñ œ ! log0) Ð\3 Ñ) one might consider the statistic 8 3œ" w " ` ` \ » Ž P8 Ð)Ñº M"•" Ð )˜ 8 ÑŽ P8 Ð)Ñº • • Þ 8 ` )3 ` )3 ) œ)˜ 8 ) œ)˜ 8 # Then, for sup 0)8 Ð\ 8 Ñ ) -8 œ sup 0)8 Ð\ 8 Ñ ) s.t. 2Ð)Ñ œ ! it is typically the case that under H! ß \ # differs from 2log-8 by a quantity tending to 0 in T) probability. That is, this statistic can be calibrated using ;# critical points and can form the basis of a test of H! asymptotically equivalent to a "Result 40 type" likelihood ratio type test. Once more (as for the Wald tests) for 23 Ð)Ñ œ )3 • )3! 3 œ "ß #ß ÞÞÞß 5 , one can invert these tests to get confidence regions for 5 Ÿ : of the parameters )" ß )# ß ÞÞÞß ): . 25 As a final note on this topic, observe that i) an LRT of the style in Result 40 requires computation of an MLE, )^8 , and a restricted MLE, )˜ 8 , ii) a Wald test requires only computation of the MLE, )^8 , and iii) a ";# " test/"score test" requires only computation of the restricted MLE, )˜ 8 . M-Estimation From one point of view, this method of estimation is simply a generalization of maximum likelihood. ("Robust" estimation proponents have other motivations for the method.) Here is a very brief introduction to the subject. Consider a statistical model with parameter space @ § e5 and observations \" ß \# ß ÞÞÞß \8 iid T) . Likelihood-based estimation uses the loglikelihood P8 Ð)Ñ œ "log0) Ð\3 Ñ 8 3œ" and typically chooses ) to maximize P8 Ð)Ñ or to solve the 5 -dimensional system of likelihood equations fP8 Ð)Ñ œ ! M-estimation uses some suitable function 3ÐBß )Ñ and <8 Ð) Ñ œ "3Ð\3 ß )Ñ 8 3œ" and chooses ) to maximize <8 Ð)Ñ or to solve the 5 -dimensional system of likelihood equations f<8 Ð)Ñ œ ! (With <4 œ `3 ` )4 this is ! <4 Ð\3 ß ) Ñ œ ! a 4 œ "ß #ß ÞÞÞß 5 .) 8 3œ" 3 could be ÐB • )Ñ# ß log0) or even log1) for 1) the density for some other family of models. Provided that 3 (or the <4 ) interacts properly with the 0) , one gets properties for Mestimation like those we've deduced for maximum likelihood. For example, E)! 3Ð\ß ) Ñ ž E)! 3Ð\ß ) Ñ a)! ß a) 26 plus other conditions implies that an M-estimator is consistent. (Note that with 3 the function log0) , this inequality comes from Fact 27 about K-L information.) Further, under appropriate conditions, one can prove asymptotic normality of i) one-step Newton improvements on a consistent estimator of ), or ii) consistent roots of f<8 Ð) Ñ œ !. In such results, the counterpart of M"•" Ð)! Ñ in the limiting 5 -variate normal distribution is N •" Ð)! ÑOÐ)! ÑN •" Ð)! Ñw for N Ð)! Ñ œ • E)! Ž `# 3Ð\ß )Ñº • ` )3 ` ) 4 ) œ )! and OÐ)! Ñ œ ŒCov)! Œ ` ` 3Ð\ß ) Ñ¹ ß 3Ð\ß )Ñ¹ •• ` )3 )œ)! ` )4 )œ)! (In the case of 3ÐBß ) Ñ œ log0) ÐBÑ, N œ O œ M" .) Simple Implications of the MLE Results for the Shape of the Loglikelihood and Posteriors In rough terms, the statistical folklore says that "likelihoods pile up around true values and are locally quadratic near their peaks, while provided a prior is sufficiently diffuse, posteriors will pile up near true values and look Gaussian." Doing a really good job of even saying this precisely is nontrivial (involving uniformities of convergence of random functions, etc.). What follows are some very simpleminded (and non-uniform type) consequences of the material on maximum likelihood. For more precise and harder theorems, see Schervish §7.4. Lemma 41 For \" ß \# ß ÞÞÞß \8 iid T) P8 Ð)! Ñ • P8 Ð)Ñ T)! p _ for ) Á )! Lemma 42 For \" ß \# ß ÞÞÞß \8 iid T) and K a prior on @ with density 1Ð)Ñ, if 0)8 ÐB8 Ñ1Ð)Ñ 1Ð)lB Ñ œ ' 0)8 ÐB8 Ñ1Ð)Ñ. / Ð)Ñ @ 8 is the posterior density of )l\ 8 (with respect to the measure / ) and 1Ð)Ñ ž ! then 27 1Ð)l\ 8 Ñ 1Ð)! l\ 8 Ñ T)! p ! for ) Á )! Corollary 43 Under the hypotheses of Lemma 42, if @ is finite, / is counting measure and 1Ð)Ñ ž ! a), the (random) posterior distribution of )l\ 8 is consistent in the sense that for E a subset of @ 8 1)l\ ÐEÑ T)! p MÒ)! - EÓ Next, here's a simple observation about the local shape of the loglikelihood near its maximum. Lemma 44 Under the hypotheses of Theorem 36, suppose that s)8 is an MLE for ) that is consistent at )! . Then ? U8 Ð?Ñ œ P8 Ð s)8 Ñ • P8 Ð s)8 € Ñ È8 T)! " # ? M" Ð)! Ñ p # (This lemma can be interpreted as saying that if one rescales properly so that there is a limiting shape, the loglikelihood is for large 8 essentially quadratic near its maximum. This is not surprising, since it must have ! derivative at the maximum and there is enough differentiability assumed to allow a #nd order Taylor expansion.) Corollary 45 Under the hypotheses of Lemma 44, suppose that a prior K has a density 1Ð)Ñ with respect to Lebesgue measure on @ § e" such that 1Ð)! Ñ ž ! and 1 is continuous at )! . The posterior density of )|\ 8 has the property that Î 1Ð s)8 l\ 8 Ñ Ñ V8 Ð?Ñ œ log Ï 1Ð s)8 € È?8 l\ 8 Ñ Ò T)! " # p ? M" Ð)! Ñ # Now in some very rough sense, this says that the near the MLE, the (random) log posterior density tends to look approximately quadratic in the distance from the MLE. But now note that if ] is a continuous random variable with pdf 0 and (exactly) log 0 Ð!Ñ œ C# 0 ÐCÑ # it is the case that ] must be normal with mean ! and variance "- . This then suggests that under T)! one can expect posterior densities to start to look as if ) were normal with mean s)8 and variance " . A Bayesian who wants a convenient approximate form for a 8M" Ð)! Ñ posterior density (perhaps because the exact one is analytically intractable) might then hope to get approximate posterior probabilities from the 28 NŽ s)8 ß " " or NŽ s)8 ß • • 8M" Ð s)8 Ñ • Pww8 Ð s)8 Ñ distributions. (Schervish proves that the posterior density of É • Pww8 Ð s)8 ÑŠ) • s) 8 ‹, a random function of ), converges in some appropriate sense to the standard normal density. It is another, harder, matter to argue that approximate posterior probabilities for ) can be obtained using normal distributions.) Some Simple Theory of Markov Chain Monte Carlo It is sometimes the case that one has a "known" but unwieldy (joint) distribution for a large dimensional random vector ] , and would like to know something about the distribution (like, for example, the marginal distribution of ]" ). While in theory it might be straightforward to compute the quantity of interest, in practice the necessary computation can be impossible. With the advent of cheap computing, given appropriate theory the possibility exists of simulating a sample of 8 realizations of ] and computing an empirical version of the quantity of interest to approximate the (intractable) theoretical one. This possibility is especially important to Bayesians, where \l) µ T) and ) µ K produce a joint distribution for Ð\ß )Ñ and then (in theory) a posterior distribution KÐ)l\Ñß the object of primary Bayesian interest. Now Ðin the usual kind of notationÑ 1Ð)l\ œ BÑ º 0) ÐBÑ1Ð)Ñ so that in this sense the posterior of the possibly high dimensional ) is "known." But it may well be analytically intractable, with (for example) no analytical way to compute E)" . Simulation is a way of getting (approximate) properties of an intractable/high dimensional posterior distribution. "Straightforward" simulation from an arbitrary multivariate distribution is, however, typically not (straightforward). Markov Chain Monte Carlo methods have recently become very popular as a solution to this problem. The following is some simple theory covering the case where the (joint) distribution of ] is discrete. Definition 16 A (discrete time/discrete state space) Markov Chain is a sequence of random quantities Ö\5 ×, each taking values in a (finite or) countable set k , with the property that T Ò\8 œ B8 l\" œ B" ß ÞÞÞß \8•" œ B8•" Ó œ T Ò\8 œ B8 l\8•" œ B8•" Ó . Definition 17 A Markov Chain is stationary provided T Ò\8 œ Bl\8•" œ Bw Ó is independent of 8. 29 WOLOG we will henceforth name the elements of k with the integers "ß #ß $ß ÞÞÞ and call them "states." Definition 18 With :34 » T Ò\8 œ 4l\8•" œ 3Ó, the square matrix T » a:34 b is called the transition matrix for a stationary Markov Chain and the :34 are called transition probabilities. Note that a transition matrix has nonnegative entries and row sums of 1. Such matrices are often called "stochastic" matrices. As a matter of further notation for a stationary Markov Chain, let 5 :34 œ T Ò\8€5 œ 4l\8 œ 3Ó and 0345 œ T Ò\8€5 œ 4ß \8€5•" Á 4ß ÞÞÞß \8€" Á 4l\8 œ 3Ó . (These are respectively the probabilities of moving from 3 to 4 in 5 steps and first moving from 3 to 4 in 5 steps.) Definition 19 We say that a MC is irreducible if for each 3 and 4 b 5 (possibly depending 5 upon 3 and 4) such that :34 ž !. (A chain is irreducible if it is possible to eventually get from any state 3 to any other state 4.) Definition 20 We say that the 3th state of a MC is transient if ! 0335 • " and say that the _ state is persistent if ! 0335 œ "Þ A chain is called persistent if all of its states are _ persistent. 5œ" 5œ" Definition 21 We say that state 3 of a MC has period > if :335 œ ! unless 5 œ / > (5 is a multiple of >) and > is the largest integer with this property. The state is aperiodic if no such > ž " exists. And a MC is called aperiodic if all of its states are aperiodic. Many sources (including Chapter 15 of the 3rd Edition of Feller Volume 1) present a number of useful simple results about MC's. Among them are the following. Theorem 46 All states of an irreducible MC are of the same type (with regard to persistence and periodicity). Theorem 47 A finite state space irreducible MC is persistent. 30 Theorem 48 Suppose that a MC is irreducible, aperiodic and persistent. Suppose further that for each state 3 the mean recurrence time is finite, i.e. "50335 • _ . _ 5œ" Then an invariant/stationary distribution for the MC exists, i.e. b Ö?4 × with ?4 ž ! and !?4 œ " such that ?4 œ "?3 :34 . 3 (If the chain is started with distribution Ö?4 ×, after one transition it is in states "ß #ß $ß ÞÞÞ with probabilities Ö?4 ×Þ) Further, this distributionÖ?4 × satisfies 5 ?4 œ lim :34 a3 , 5p_ and ?4 œ " ! 50445 _ Þ 5œ" There is a converse of this theorem. Theorem 49 An irreducible, aperiodic MC for which b Ö?4 × with ?4 ž ! and !?4 œ " such that ?4 œ !?3 :34 must be persistent with ?4 œ 3 " Þ _ ! 50445 5œ" And there is an important "ergodic" result that guarantees that "time averages" have the right limits. Theorem 50 Under the hypotheses of Theorem 48, if 1 is a real-valued function such that "l1Ð4Ñl?4 • _ 4 then for any 4, if \0 œ 4 " 8 "1Ð\5 Ñ a.s. p "1Ð4Ñ?4 8 5œ1 4 (Note that the choice of 1 as an indicator provides approximations for stationary probabilities.) 31 With this background, the basic idea of MCMC is the following. If we wish to simulate from a distribution Ö?4 × or approximate properties of the distribution that can be expressed as moments of some function 1, we find a convenient MC Ö\5 × whose invariant distribution is Ö?4 ×. From a starting state \! œ 3, one uses T to simulate \" Þ Using the realization \" œ B" and T , one simulates \# , etc. One applies Theorem 50 to approximate the quantity of interest. Actually, it is common practice to use a "burn in" of 7 periods before starting the kind of time average indicated in Theorem 50. Two important questions about this plan (only one of which we'll address) are: 1) How does one set up an appropriate/convenient T ? 2) How big should 7 be? (We'll not touch this matter.) In answer to question 1), there are presumably a multitude of chains that would do the job. But there is the following useful sufficient condition (that has application in the original motivating problem of simulating from high dimensional distributions) for a chain to have Ö?4 × for an invariant distribution. Lemma 51 If Ö\5 × is a MC with transition probabilities satisfying ?3 :34 œ ?4 :43 , æ then it has invariant distribution Ö?4 ×. Note then that if a candidate T satisfies æ and is irreducible and aperiodic, Theorem 49 shows that it is persistent and Theorem 48 then shows that any arbitrary starting value can be used and yields approximate realizations from Ö?4 × and Theorem 50 implies that "time averages" can be used to approximate properties of Ö?4 ×. Several useful MCMC schemes can be shown to have the "right" invariant distributions by noting that they satisfy æ. For example, Lemma 51 can be applied to the "Metropolis-Hastings Algorithm." That is, let X œ a>34 b be any stochastic matrix corresponding to an irreducible aperiodic MC. (Presumably, some choices will be better that others in terms of providing quick convergence to a target invariant distribution. But that issue is beyond the scope of the present exposition.) (Note that in a finite case, one can take >34 œ "ÎÐthe number of states).) The Metropolis-Hastings algorithm simulates as follows: •Supposing that \8•" œ 3, generate N (at random) according to the distribution over the state space specified by row 3 of X , that is according to Ö>34 ×. •Then generate \8 based on 3 and (the randomly generated) N according to Ú N with probability minŠ"ß ??N3 >>3NN 3 ‹ \8 œ Û Þ ææ ?N >N 3 3 with probability max !ß " • Š ‹ Ü ?3 >3N 32 Note that for Ö\5 × so generated, for 4 Á 3 :34 œ T Ò\8 œ 4l\8•" œ 3Ó œ min Œ"ß ?4 >43 •>34 ?3 >34 and :33 œ T Ò\8 œ 3l\8•" œ 3Ó œ >33 € "maxŒ!ß " • 4Á3 ?4 >43 •>34 ?3 >34 . So, for 3 Á 4 ?3 :34 œ minÐ?3 >34 ß ?4 >43 Ñ œ ?4 :43 . That is, æ holds and the MC Ö\5 × has stationary distribution Ö?4 ×. (Further, the assumption that X corresponds to an irreducible aperiodic chain implies that Ö\5 × is irreducible and aperiodic.) Notice, by the way, that in order to use the Metropolis-Hastings Algorithm one only has to have the ?4 's up to a multiplicative constant. That is, if we have @4 º ?4 , we may compute the ratios ?4 Î?3 as @4 Î@3 and are in business as far as simulation goes. (This fact is especially attractive in Bayesian contexts where it can be convenient to know a posterior only up to a multiplicative constant.) Notice also that if X is symmetric, (i.e. >34 œ >43 ), step ææ of the Metropolis algorithm becomes Ú \8 œ Û Ü3 N with probability minŠ"ß ??N3 ‹ with probability max Š!ß " • ?N ?3 ‹ Þ An algorithm similar to the Metropolis-Hastings algorithm is the "Barker Algorithm." The Barker algorithm modifies the above by replacing minŒ"ß ?N >N 3 ?N >N 3 • with ?3 >3N ?3 >3N € ?N >N 3 in ææ. Note that for this modification of the Metropolis algorithm, for 4 Á 3 :34 œ Œ ?4 >43 •>34 ?3 >34 € ?4 >43 , so ?3 :34 œ Ð?3 >34 ÑÐ?4 >43 Ñ œ ?4 :43 ?3 >34 € ?4 >43 . That is, æ holds and thus Lemma 51 guarantees that Ö\5 × has invariant distribution Ö?4 ×Þ (And X irreducible and aperiodic continues to imply that Ö\5 × is also.) 33 Note also that since ?4 ?4 >43 ?3 >43 œ ? ?3 >34 € ?4 >43 >34 € ?43 >43 once again it suffices to know the ?4 up to a multiplicative constant in order to be able to implement Barker's algorithm. (This is again of comfort to Bayesians.) Finally, note that if X is symmetric, step ææ of the Barker algorithm becomes N \8 œ ž 3 with probability with probability ?N ?3 €?N ?3 ?3 €?N Þ Finally, consider now the "Gibbs Sampler," or as Schervish correctly argues that it should be termed, "Successive Substitution Sampling," for generating an observation from a high dimensional distribution. For sake of concreteness, consider the situation where the distribution of a discrete $-dimensional random vector Ð] ß Z ß [ Ñ with probability mass function 0 ÐCß @ß AÑ is at issue. In the general MC setup, a possible outcome ÐCß @ß AÑ represents a "state" and the distribution 0 ÐCß @ß AÑ is the "Ö?4 ×" from which on wants to simulate. One defines a MC Ö\5 × as follows. For an arbitrary starting state ÐC! ß @! ß A! Ñ: •Generate \" œ Ð]" ß @! ß A! Ñ by generating ]" from the conditional distribution of ] lZ œ @! and [ œ A ! , i.e. from the (conditional) distribution with probability function 0] lZ ß[ ÐCl@! ß A! Ñ » !0ÐCß@! ßA! Ñ . Let C" be the realized value of ]" . 0Ð6ß@! ßA! Ñ 6 •Generate \# œ ÐC" ß Z" ß A! Ñ by generating Z" from the conditional distribution of Z l] œ C" and [ œ A ! , i.e. from the (conditional) distribution with probability function 0Z l] ß[ Ð@lC" ß A! Ñ » !0ÐC" ß@ßA! Ñ . Let @" be the realized value of Z" . 0ÐC" ß6ßA! Ñ 6 •Generate \$ œ ÐC" ß @" ß [" Ñ by generating [" from the conditional distribution of [ l] œ C" and Z œ @" , i.e. from the (conditional) distribution with probability 0ÐC" ß@" ßAÑ function 0[ l] ßZ ÐAlC" ß @" Ñ » ! . Let A" be the realized value of [" . 0ÐC" ß@" ß6Ñ 6 •Generate \4 œ Ð]2 ß @1 ß A1 Ñ by generating ]2 from the conditional distribution of ] lZ œ @1 and [ œ A 1 , i.e. from the (conditional) distribution with probability function 0] lZ ß[ ÐCl@1 ß A1 Ñ » !0ÐCß@1 ßA1 Ñ . Let C2 be the realized value of ]2 . 0Ð6ß@1 ßA1 Ñ 6 And so on, for some appropriate number of cycles. Note that with this algorithm, a typical transition probability (for a step where a ] realization is going to be generated) is 34 T Ò\8 œ ÐCw ß @ß AÑl\8•" œ ÐCß @ß AÑÓ œ 0 ÐCw ß @ß AÑ !0 Ð6ß @ß AÑ 6 so if \8•" has distribution 0 , the probability that \8 œ ÐCß @ß AÑ is "0 Ð6w ß @ß AÑ 6w 0 ÐCß @ß AÑ œ 0 ÐCß @ß AÑ , !0 Ð6ß @ß AÑ 6 that is, \8 also has distribution 0 Þ And analogous results hold for transitions of all 3 types (] ß Z and Y realizations). But it is not absolutely obvious how to apply the results 48 and 49 here since the chain is not stationaryÞ The transition mechanism is not the same for a "] " transition as it is for a "Z " transition. The key to extricating oneself from this problem is to note that if T] ß TZ and T[ are respectively the transition matrices for ] ß Z and [ exchanges, then T œ T] TZ T[ describes an entire cycle of the SSS algorithm. Then Ö\5w × defined by \8w œ \$8 is a stationary Markov Chain with transition matrix T Þ The fact that T] ß TZ and T[ all leave the distribution 0 invariant then implies that T also leaves 0 invariant. So one is in the position to apply Theorems 49 and 48. If T is irreducible and aperiodic, Theorem 49 says that the chain Ö\5w × is persistent and then Theorems 48 and 50 say that 0 can be simulated using an arbitrary starting state. The non discrete ] version of all this is somewhat more subtle, but in many ways parallel to the exposition given above. There is a nice exposition in Tierney's December 1994 Annals of Statistics paper (available at ISU through JSTOR at http://www.jstor.org/cgibin/jstor/listjournal). The following is borrowed from that source. Consider now a (possibly non discrete) state space k and a target distribution 1 on Ðk ß U Ñ (for which one wishes to approximate some property via simulation). Suppose that T ÐBß EÑ À k ‚ U p Ò!ß "Ó is a "transition kernel" (a regular conditional probability, a function such that for each B, T ÐBß † Ñ is a probability measure, and for each E, T Ð † ß EÑ is U measurable). Consider a stochastic process Ö\5 × with the property that given \" œ B" ß ÞÞÞß \8•" œ B8•" , the variable \8 has distribution T ÐB8•" ß † ÑÞ Ö\5 × is a general stationary discrete time MC with transition kernel T ÐBß EÑ, and provided one can simulate from T ÐBß EÑ, one can simulate realizations of Ö\5 ×. If T ÐBß EÑ is properly chosen, empirical properties of a realization of Ö\5 × can approximate theoretical properties of 1. 35 Notice that the 8-step transition kernels can be computed recursively beginning with T ÐBß EÑ using T 8 ÐBß EÑ œ ( T 8•" ÐCß EÑ.T ÐBß CÑ Definition 22 T has invariant distribution 1 means that a E - U 1ÐEÑ œ ( T ÐBß EÑ. 1ÐBÑ Definition 23 T is called 1-irreducible if for each B and each E with 1ÐEÑ ž ! b 8ÐBß EÑ " such that T 8ÐBßEÑ ÐBß EÑ ž !. Definition 24 A 1-irreducible transition kernel T is called periodic if b an integer . # and a sequence of sets E! ß E" ß ÞÞÞß E.•" ß E. œ E! in U such that for 3 œ !ß "ß ÞÞÞß . • " and all B - E3 , T ÐBß E3€" Ñ œ ". If a 1-irreducible transition kernel T is not periodic it is called aperiodic. Definition 25 A 1-irreducible transition kernel T with invariant distribution 1 is called Harris recurrent if the corresponding MC, Ö\5 ×, has the property that a E - U with 1ÐEÑ ž ! and all B - k T\! œB Ò\8 - E infinitely oftenÓ œ " Theorem 52 Suppose that transition kernel T has invariant distribution 1, is 1irreducible, aperiodic and Harris recurrent. If 1 is such that ( l1ÐBÑl.1ÐBÑ • _ then for \! œ B any element of k " 8 "1Ð\5 Ñ a.s. p ( 1ÐBÑ.1ÐBÑ 8 5œ" Theorem 52 says that integrals of 1 (like, for example, probabilities and moments) can be approximated by long run sample averages of a sample path of the MC Ö\5 × beginning from any starting value, provided a suitable T can be identified. "Suitable" means, of course, that it has the properties hypothesized in Theorem 52 and one can simulate from it knowing no more about 1 than is known in practice. The game of MCMC is then to identify useable T 's. As Tierney's article lays out, the general versions of SSS, the Barker algorithm and the Metropolis-Hastings algorithm (and others) can often be shown to satisfy the hypotheses of Theorem 52. 36 The meaning of "SSS" in the general case is obvious (one generates in turn from each conditional of a coordinate of a multidimensional ] given all other current coordinates and considers what one has at the end of each full cycle of these simulations). To illustrate what another version of MCMC means in the general case, consider the Metropolis-Hastings algorithm. Suppose that UÐBß EÑ is some convenient transition kernel with the property that for some 5-finite measure . on Ðk ß U Ñ (dominating 1) and a nonnegative function ;ÐBß CÑ on k ‚ k (that is U ‚ U measurable) with 'k ;ÐBß CÑ. .ÐCÑ œ " for all B, UÐBß EÑ œ ( ;ÐBß CÑ. .ÐCÑ E Let .1 œ: .. and suppose that 1 is not concentrated on a single point. Then with !ÐBß CÑ œ ž :ÐCÑ;ÐCßBÑ minŠ :ÐBÑ;ÐBßCÑ ß "‹ " if :ÐBÑ;ÐBß CÑ ž ! if :ÐBÑ;ÐBß CÑ œ ! and :ÐBß CÑ œ œ ;ÐBß CÑ!ÐBß CÑ ! if B Á C if B œ C a Metropolis-Hastings transition kernel is T ÐBß EÑ œ ( :ÐBß CÑ. .ÐCÑ € Œ" • ( :ÐBß CÑ. .ÐCÑ •?B ÐEÑ æææ E where ?B is a probability measure degenerate at B. The "simulation interpretation" of this is that if \8•" œ B, one generates C from UÐBß † Ñ and computes !ÐBß CÑ. Then with probability !ÐBß CÑ, one sets \8 œ C and otherwise takes \8 œ B. Note that as in the discrete version, the algorithm depends on : only through the ratios :ÐCÑÎ:ÐBÑ so it suffices to know : only up to a multiplicative constant, a fact of great comfort to Bayesians wishing to study analytically intractable posteriors. It is straightforward to verify that the kernel defined by æææ is 1-invariant. It turns out that sufficient conditions for the kernel T to be 1-irreducible are that U is 1-irreducible and either that i) ;ÐBß CÑ ž ! a B and C or that ii) ;ÐBß CÑ œ ;ÐCß BÑ a B and C. In turn, 1irreducibility of T is sufficient to imply Harris recurrence for T . And, if T is 1irreducible, a sufficient condition for aperiodicity of T is that 37 1ŒÖBl( :ÐBß CÑ..ÐCÑ • "×• ž ! so there are some fairly concrete conditions that can be checked to verify the hypotheses of Theorem 52. A (Very) Little Bit About Bootstrapping Suppose that the entries of \ œ Ð\" ß \# ß ÞÞÞß \8 Ñ are iid according to some unknown (possibly multivariate) distribution J , and for a random function VÐ\ß J Ñ there is some aspect of the distribution of VÐ\ß J Ñ that is of statistical interest. For example, in a univariate context, we might be interested in the (J ) probability that – \ • .J VÐ\ß J Ñ œ W È8 has absolute value less than #Þ Often, even if one knew J , the problem of finding properties of the distribution of VÐ\ß J Ñ would be analytically intractable. (Of course, knowing J , simulation might provide at least some approximation for the property of interest ...) As a means of possibly approximating the distribution of VÐ\ß J Ñ, Efron suggested the s is an estimator of J . In parametric contexts, where )^Ð\Ñ is following. Suppose that J s œ Js . In nonparametric contexts, if J^ 8 is the empirical an estimator of ), this could be J ) ^ . Suppose then that s œ J distribution of the sample \ , we might take J 8 ‡ ‡ ‡ s . One might hope that (at \ œ Ð\" ß ÞÞÞß \8 Ñ has entries that are an iid sample from J least in some circumstances) s distribution of VÐ\ ‡ ß J sÑ . the J distribution of VÐ\ß J Ñ ¸ the J (This will not always be a reasonable approximation.) Now even the right side of this expression will typically be analytic intractable. What, however, can be done is to now simulate from this distribution. (Where one can not simulate from the distribution on the left because J is not known, simulation from the distribution on the right is possible.) sÞ That is, suppose that \ ‡" ß \ ‡# ß ÞÞÞß \ ‡F are F (bootstrap) iid samples of size 8 from J Using these, one may compute sÑ V3‡ œ VÐ\ ‡3 ß J s distribution and use the empirical distribution of ÖV3‡ l3 œ "ß ÞÞÞß F × to approximate the J s Ñ. of VÐ\ ‡ ß J 38 General Decision Theory This is a formal framework for posing problems of finding good statistical procedures as optimization problems. Basic Elements of a (Statistical) Decision Problem: \ data formally a mapping ÐHß [ÑpÐk ß U Ñ @ parameter space often one needs to do integrations over @, so let V be a 5-algebra on @ and assume that for F - U , the mapping taking )pT) ÐFÑ from Ð@ß VÑpÐV " ß U" Ñ is measurable T decision/action space the class of possible decisions or actions we will also need to average over (random) actions, so let X be 5-algebra on T PÐ)ß +Ñ loss function P À @ ‚ T pÒ!ß_Ñ and needs to be suitably measurable $ ÐBÑ decision rule $ ÐBÑ À Ðk ß U ÑpÐT ß X Ñ and for data \ , $ Ð\Ñ is the decision taken on the basis of observing \ The fact that PÐ)ß $ Ð\ÑÑ is random prevents any sensible comparison of decision rules until one has averaged out over the distribution of \ , producing something called a risk function. Definition 26 VÐ)ß $ Ñ » E) PÐ)ß $ Ð\ÑÑ œ ' PÐ)ß $ ÐBÑÑ.T) ÐBÑ maps @pÒ!ß_Ñ and is called the risk function (of the decision rule $ ). Definition 27 $ is at least as good as $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a)Þ Definition 28 $ is better than $ w if VÐ$ ß )Ñ Ÿ VÐ$ w ß )Ñ a) with strict inequality for at least one ) - @. Definition 29 $ and $ w are risk equivalent if VÐ)ß $ Ñ œ VÐ)ß $ w Ñ a)Þ (Note that risk equivalency need not imply equality of decision rules.) Definition 30 $ is best in a class of decision rules ? if i) $ - ? and ii) $ is at least as good as any other $ w - ? Þ Typically, only smallish classes of decision rules will have best elements. (For example, in some point estimation problems there are best unbiased estimators, where no estimator is best overall. Here the class ? is the class of unbiased estimators.) An alternative to looking for best elements in small classes of rules is to somehow reduce the function of ), VÐ)ß $ Ñ, to a single number and look for rules minimizing the number. (Reduction by 39 averaging amounts to looking for Bayes rules. Reduction by taking a maximum value amounts to looking for minimax rules.) Definition 31 $ is inadmissible in ?if b $ w - ? which is better than $ (which "dominates" $ ). Definition 32 $ is admissible in ? if it is not inadmissible in ? (if there is no $ w - ? which is better). On the face of it, it would seem that one would never want to use an inadmissible decision rule. But, as it turns out, there can be problems where all rules are inadmissible. (More in this direction later.) Largely for technical reasons, it is convenient/important to somewhat broaden one's understanding of what constitutes a decision rule. To this end, as a matter of notation, let W œ Ö$ ÐBÑ× œ the class of (nonrandomized) decision rules . Then one can define two fairly natural notions of randomized decision rules. Definition 33 Suppose that for each B - k , 9B is a distribution on ÐT ß X Ñ. Then 9B is called a behavioral decision rule. The idea is that having observed \ œ B, one then randomizes over the action space according to the distribution 9B in order to choose an action. (This should remind you of what is done in some theoretical arguments in hypothesis testing.) Notation: Let W‡ œ Ö9B × œ the class of behavioral decision rules It is possible to think of W as a subset of W‡ . For $ - W, let 9B$ be a point mass distribution on T concentrated at $ ÐBÑ. Then 9B$ - W‡ and is "clearly" equivalent to $ . The risk of a behavioral decision rule is (abusing notation and reusing the symbols VÐ † ß † Ñ in this context as well as that of Definition 8) clearly VÐ)ß 9Ñ œ ( ( PÐ)ß +Ñ. 9B Ð+Ñ.T) ÐBÑ Þ k T A somewhat less intuitively appealing notion of randomization for decision rules is next. Let Y be a 5-algebra on W that contains the singleton sets. Definition 34 A randomized decision function (or decision rule) < is a probability measure on ÐWß Y ÑÞ ($ with distribution < becomes a random object.) The idea here is that prior to observation, one chooses an element of W by use of <, say $ , and then having observed \ œ B decides $ ÐBÑ. 40 Notation: Let W‡ œ Ö<× œ the class of randomized decision rules. It is possible to think of W as a subset of W‡ by identifying $ - W with <$ placing mass 1 on $ . <$ is "clearly" equivalent to $ . The risk of a randomized decision rule is (once again abusing notation and reusing the symbols VÐ † ß † Ñ) clearly VÐ)ß <Ñ œ ( VÐ)ß $ Ñ. <Ð$ Ñ œ ( ( PÐ)ß $ ÐBÑÑ.T) ÐBÑ.<Ð$ Ñ ß W W k assuming that VÐ)ß <Ñ is properly measurable. The behavioral decision rules are probably more appealing/natural than the randomized decision rules. On the other hand, the randomized decision rules are often easier to deal with in proofs. So a reasonably important question is "When are W‡ and W‡ equivalent in the sense of generating the same set of risk functions?" "Result"53 (Some properly qualified version of this is true ... See Ferguson pages 2627) If T is a complete separable metric space with X the Borel 5-algebra, appropriate conditions hold on the T) and ??????? holds, then W‡ and W‡ are equivalent in terms of generating the same set of risk functions. It is also of interest to know when the extra complication of considering randomized rules is really not necessary. One simple result in that direction concerns cases where P is convex in its second argument. Lemma 54 (See page 151 of Schervish, page 40 of Berger and page 78 of Ferguson) Suppose that T is a convex subset of e. and 9B is a behavioral decision rule. Define a nonrandomized decision rule $ by $ ÐBÑ œ ( +.9B Ð+Ñ . T (In the case that . ž ", interpret $ ÐBÑ as vector-valued, the integral as a vector of integrals over the . coordinates of + - T .) Then i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex, then VÐ)ß $ Ñ Ÿ VÐ)ß 9Ñ Þ ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)ß 9Ñ • _ and T) ÖBl9B is nondegenerate× ž ! , then VÐ)ß $ Ñ • VÐ)ß 9Ñ Þ 41 (Strict convexity of a function 1 means that for ! - Ð!ß "Ñ 1Ð!B € Ð" • !ÑCÑ • !1ÐBÑ € Ð" • !Ñ1ÐCÑ for B and C in the convex domain of 1Þ See Berger pages 38 through 40. There is an extension of Jensen's inequality that says that if 1 is strictly convex and the distribution of \ is not concentrated at a point, then E1Ð\Ñ ž 1ÐEÐ\ÑÑ.) Corollary 55 Suppose that T is a convex subset of e. and 9B is a behavioral decision rule. Define a nonrandomized decision rule $ by $ ÐBÑ œ ( +.9B Ð+Ñ . T i) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a), then $ is at least as good as 9. ii) If PÐ)ß † Ñ À T pÒ!ß_Ñ is convex a) and for some )! the function PÐ)! ß † Ñ À T pÒ!ß_Ñ is strictly convex, VÐ)! ß 9Ñ • _ and T)! ÖBl9B is nondegenerate× ž ! then $ is better than 9. Corollary 55 says roughly that in the case of a convex loss function, one need not worry about considering randomized decision rules. (Finite Dimensional) Geometry of Decision Theory A very helpful device for understanding some of the basics of decision theory is to consider the geometry associated with cases where @ is finite. So until further notice, assume @ œ Ö)" ß )# ß ÞÞÞß )5 × and VÐ)ß <Ñ • _ a ) - @ and < - W‡ . As a matter of notation, let f œ ÖC œ ÐC" ß C# ß ÞÞÞß C5 Ñ - e5 l C3 œ VÐ)3 ß <Ñ a3 and < - W‡ × . f is the set of all randomized risk vectors. 42 Theorem 56 f is convex. In fact, it turns out to be the case, and is more or less obvious from Theorem 56, that if f ! œ ÖClC3 œ VÐ)3 ß <Ñ a3 and < - W× , then f is the convex hull of f ! , i.e. the smallest convex set containing f ! . Definition 35 For B - e5 , the lower quadrant of B is UB œ ÖDlD3 Ÿ B3 a3× Þ Theorem 57 C - f (or the decision rule giving rise to C) is admissible iff UC • f œ ÖC× . Definition 36 The lower boundary of f is – -Ðf Ñ œ ÖClUC • f œ ÖC×× – for f the closure of f . Definition 37 f is closed from below if -Ðf Ñ § f . Notation: Let EÐf Ñ stand for the class of admissible rules (or risk vectors corresponding to them) in f . Theorem 58 If f is closed from below, then EÐf Ñ œ -Ðf ÑÞ Theorem 58w If f is closed, then EÐf Ñ œ -Ðf Ñ. Complete Classes of Decision Rules Now drop the finiteness assumption on @. Definition 38 A class of decision rules V § W‡ is a complete class if for any 9 Â V, b a 9w - V such that 9w is better than 9. Definition 39 A class of decision rules V § W‡ is an essentially complete class if for any 9 Â V, b a 9w - V such that 9w is at least as good as 9. Definition 40 A class of decision rules V § W‡ is a minimal complete class if it is complete and is a subset of any other complete class. Notation: Let AÐW‡ Ñ be the set of admissible rules in W‡ . Theorem 59 If a minimal complete class V exists, then V œ EÐW‡ Ñ. 43 Theorem 60 If EÐW‡ Ñ is complete, then it is minimal complete. Theorems 59 and 60 are the properly qualified versions of the rough (and not quite true) statement "EÐW‡ Ñ equals the minimal complete class". Sufficiency and Decision Theory We worked very hard on the notion of sufficiency, with the goal in mind of establishing that one "doesn't need" \ if one has X Ð\Ñ. That hard work ought to have some implications for decision theory. Presumably, in most cases one can get by with basing a decision rule on a sufficient statistic. Result 61 (See page 120 of Ferguson, page 36 of Berger and page 151 of Schervish) Suppose a statistic X is sufficient for c œ ÖT) ×. Then if 9 is a behavioral decision rule there exists another behavioral decision rule 9w that is a function of X and has the same risk function as 9. Note that 9w a function of X means that X ÐBÑ œ X ÐCÑ implies that 9Bw and 9Cw are the same measures on T . Note also that Result 61 immediately implies that the class of decision rules in W‡ that are functions of X is essentially complete. In terms of risk, one need only consider rules that are functions of sufficient statistics. The construction used in the proof of Result 61 doesn't aim to improve on the original (behavioral) decision rule by moving to a function of the sufficient statistic. But in convex loss cases, there is a construction that produces rules depending on a sufficient statistic and potentially improving on an initial (nonrandomized) decision rule. This is the message of Theorem 64. But to prove Theorem 64, a lemma of some independent interest is needed. That lemma and an immediate corollary are stated before Theorem 64. Lemma 62 Suppose that T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized decision rules. Then $ ÐBÑ œ "# Ð$" ÐBÑ € $# ÐBÑÑ is also a decision rule. Further, i) if PÐ)ß +Ñ is convex in + and VÐ)ß $" Ñ œ VÐ)ß $# Ñ, then VÐ)ß $ Ñ Ÿ VÐ)ß $" Ñ, and ii) if PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, then VÐ)ß $ Ñ • VÐ)ß $" ÑÞ Corollary 63 Suppose T § e5 is convex and $" ÐBÑ and $# ÐBÑ are two nonrandomized decision rules with identical risk functions. If PÐ)ß +Ñ is convex in + a), the existence of a ) - @ such that PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $" Ñ œ VÐ)ß $# Ñ • _ and T) Ð$" Ð\Ñ Á $# Ð\ÑÑ ž !, implies that $" and $# are inadmissible. 44 Theorem 64 (See Berger page 41, Ferguson page 121, TPE page 50, Schervish page 152) (Rao-Blackwell Theorem) Suppose that T § e5 is convex and $ is a nonrandomized decision function with E) l$ Ð\Ñl • _ a)Þ Suppose further that X is sufficient for ) and with U! œ U ÐX Ñ, let $! ÐBÑ » EÒ$ lU! ÓÐBÑ Þ Then $! is a nonrandomized decision rule. Further, i) if PÐ)ß +Ñ is convex in +, then VÐ)ß $! Ñ Ÿ VÐ)ß $ Ñ, and ii) for any ) for which PÐ)ß +Ñ is strictly convex in +ß V Ð)ß $ Ñ • _ and T) Ð$ Ð\Ñ Á $! Ð\ÑÑ ž !, one has that VÐ)ß $! Ñ • VÐ)ß $ ÑÞ Bayes Rules The Bayes approach to decision theory is one means of reducing risk functions to numbers so that they can be compared in a straightforward fashion. Notation: Let K be a distribution on Ð@ß V Ñ. K is usually (for philosophical reasons) called the "prior" distribution on @. Definition 41 The Bayes risk of 9 - W‡ with respect to the prior K is (abusing notation again and once more using the symbols VÐ † ß † Ñ) VÐKß 9Ñ œ ( VÐ)ß 9Ñ.KÐ) Ñ Þ @ Notation: Again abusing notation somewhat, let VÐKÑ » w inf ‡ VÐKß 9w Ñ Þ 9 -W Definition 42 9 is said to be Bayes with respect to K (or to be a Bayes rule with respect to K) provided VÐKß 9Ñ œ VÐKÑ Þ Definition 43 9 is said to be %-Bayes with respect to K provided VÐKß 9Ñ Ÿ VÐKÑ € % Þ Bayes rules are often admissible, at least if the prior involved "spreads its mass around sufficiently." Theorem 65 If @ is finite (or countable), K is a prior distribution with KÐ)Ñ ž ! a) and 9 is Bayes with respect to K, then 9 is admissible. 45 Theorem 66 Suppose @ § e5 is such that every neighborhood of any point ) - @ has nonempty intersection with the interior of @. Suppose further that VÐ)ß 9Ñ is continuous in ) for all 9 - W‡ . Let K be a prior distribution that has support @ in the sense that every open ball that is a subset of @ has positive K probability. Then if VÐKÑ • _ and 9 is a Bayes rule with respect to K, 9 is admissible. Theorem 67 If every Bayes rule with respect to K has the same risk function, they are all admissible. Corollary 68 In a problem where Bayes rules exist and are unique, they are admissible. A converse of Theorem 65 (for finite @) is given in Theorem 70. Its proof requires the use of a piece of 5 -dimensional analytic geometry (called the separating hyperplane theorem) that is stated next. Theorem 69 Let f" and f# be two disjoint convex subsets of e5 . Then b a : - e5 such that : Á ! and !:3 B3 Ÿ !:3 C3 aC - f" and B - f# . Theorem 70 If @ is finite and 9 is admissible, then 9 is Bayes with respect to some prior. For nonfinite @, admissible rules are "usually" Bayes or "limits" of Bayes rules. See Section 2.10 of Ferguson or page 546 of Berger. As the next result shows, Bayesians don't need randomized decision rules, at least for purposes of achieving minimum Bayes risk, VÐKÑ. Result 71 (See also page 147 of Schervish) Suppose that < - W‡ is Bayes versus K and VÐKÑ • _. Then there exists a nonrandomized rule $ - W that is also Bayes versus KÞ There remain the questions of 1) when Bayes rules exist and #Ñ what they look like when they do exist. These issues are (to some extent) addressed next. Theorem 72 If @ is finite, f is closed from below and K assigns positive probability to each ) - @, there is a rule Bayes versus K. Definition 44 A formal nonrandomized Bayes rule versus a prior K is a rule $ ÐBÑ such that aB - k ß $ ÐBÑ is an + - T minimizing ( PÐ)ß +ÑŒ @ 0) ÐBÑ .KÐ)Ñ . ' 0) ÐBÑ.KÐ)Ñ • @ 46 Definition 45 If K is a 5-finite measure, a formal nonrandomized generalized Bayes rule versus K is a rule $ ÐBÑ such that aB - k ß $ ÐBÑ is an + - T minimizing ( PÐ)ß +Ñ0) ÐBÑ.KÐ)Ñ Þ @ Minimax Decision Rules An alternative to the Bayesian reduction of VÐ)ß 9Ñ to a number by averaging according to the distribution K, is to instead reduce VÐ)ß 9Ñ to a number by maximization over ). Definition 46 A decision rule 9 - H‡ is said to be minimax if supVÐ)ß 9Ñ œ infsup VÐ)ß 9w Ñ Þ w 9 ) ) Definition 47 If a decision rule 9 has constant (in )) risk, it is called an equalizer rule. It should make some sense that if one is trying to find a minimax rule by so to speak pushing down the highest peak in the profile VÐ)ß 9Ñ, doing so will tend to drive one towards rules with "flat" VÐ)ß 9Ñ profiles. Theorem 73 If 9 is an equalizer rule and is admissible, then it is minimax. Theorem 74 Suppose that Ö93 × is a sequence of decision rules, each 93 Bayes against a prior K3 . If VÐK3 ß 93 ÑpG and 9 is a decision rule with VÐ)ß 9Ñ Ÿ G a), then 9 is minimax. Corollary 75 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then 9 is minimax. Corollary 75' If 9 is an equalizer rule and is Bayes with respect to a prior K, then it is minimax. A reasonable question to ask, in the light of Corollaries 75 and 75', is "Which K should I be looking for in order to produce a Bayes rule that might be minimax?" The answer to this question is "Look for a least favorable prior." Definition 48 A prior distribution K is said to be least favorable if VÐKÑ œ supVÐKw Ñ . Kw Intuitively, one might think that if an intelligent adversary were choosing ), he/she might well generate it according to a least favorable K, therefore maximizing the minimum possible risk. In response a decision maker ought to use 9 Bayes versus K. Perhaps then, this Bayes rule might tend to minimize the worst that could happen over choice of )? Theorem 76 If 9 is Bayes versus K and VÐ)ß 9Ñ Ÿ VÐKÑ a), then K is least favorable. 47 (Corollary 75 already shows 9 in Theorem 76 to be minimax.) Optimal Point Estimation (Unbiasedness, Information Inequalities and Invariance and Equivariance) Unbiasedness One possible limited class of "sensible" decision rules in estimation problems is the class of "unbiased" estimators. (Here, we are tacitly assuming that T is e" or some subset of it.) Definition 49 The bias of an estimator $ ÐBÑ for # Ð)Ñ - e" is E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ( Ð$ ÐBÑ • # Ð)ÑÑ.T) ÐBÑ . k Definition 50 An estimator $ ÐBÑ is unbiased for # Ð)Ñ - e" provided E) Ð$ Ð\Ñ • # Ð)ÑÑ œ ! a) Þ Before jumping into the business of unbiased estimation with too much enthusiasm, an early caution is in order. We supposedly believe that "Bayes ¸ admissible" and that admissibility is sort of a minimal "goodness" requirement for a decision rule ... but there are the following theorem and corollary. Theorem 77 (See Ferguson pages 48 and 52) If PÐ)ß +Ñ œ AÐ)ÑÐ#Ð )Ñ • +Ñ# for AÐ)Ñ ž !, and $ is both Bayes wrt K with finite risk function and unbiased for # Ð)Ñ, then (according to the joint distribution of Ð\ß )ÑÑ $ Ð\Ñ œ # Ð)Ñ a.s. Þ Corollary 78 Suppose @ § e" has at least two elements and the family c œ ÖT) × is mutually absolutely continuous. If PÐ)ß +Ñ œ Ð) • +Ñ# no Bayes estimator of ) is unbiased. Definition 51 $ is best unbiased for # Ð)Ñ - e" if it is unbiased and at least as good as any other unbiased estimator. There need not be any unbiased estimators in a given estimation problem, let alone a best one. But for many common estimation loss functions, if there is a best unbiased estimator it is unique. 48 Theorem 79 Suppose that T § e" is convex, PÐ)ß +Ñ is convex in + a) and $" and $# are two best unbiased estimators of # Ð)Ñ. Then for any ) for which PÐ)ß +Ñ is strictly convex in + and VÐ)ß $" Ñ • _ T) Ò$" Ð\Ñ œ $# Ð\ÑÓ œ " Þ The special case of PÐ)ß +Ñ œ Ð#Ð )Ñ • +Ñ# is, of course, one of most interest. For this case of unweighted squared error loss, for estimators with second moments VÐ)ß $ Ñ œ Var) $ Ð\Ñ. Definition 52 $ is called a UMVUE (uniformly minimum variance unbiased estimator) of # Ð)Ñ - e" if it is unbiased and Var) $ Ð\Ñ Ÿ Var) $ w Ð\Ñ a) for any other unbiased estimator of # Ð)Ñ, $ w . A much weaker notion is next. Definition 53 $ is called locally minimum variance unbiased at )! - @ if it is unbiased and Var)! $ Ð\Ñ Ÿ Var)! $ w Ð\Ñ for any other unbiased estimator of # Ð)Ñ, $ w . Notation: Let h! œ Ö$ l E) $ Ð\Ñ œ ! and E) $ # Ð\Ñ • _ a)×Þ (h! is the set of unbiased estimators of 0 with finite variance.) Theorem 80 Suppose that $ is an unbiased estimator of # Ð)Ñ - e" and E) $ # Ð\Ñ • _ a). Then $ is a UMVUE of # Ð)Ñ iff Cov) Ð$ Ð\Ñß ?Ð\ÑÑ œ ! a) a? - h! . Theorem 81 (Lehmann/Scheffew ) Suppose that T is a convex subset of e" , X is a sufficient statistic for c œ ÖT) × and $ is an unbiased estimator of # Ð)Ñ. Let $! œ EÒ$lX Ó œ 9 ‰ X . Then $! is an unbiased estimator of # Ð)ÑÞ If in addition, X is complete for c , i) 9w ‰ X unbiased for # Ð)Ñ implies that 9w ‰ X œ 9 ‰ X a.s. T) a), and ii) PÐ)ß +Ñ convex in + a) implies that a) $! is best unbiased, and b) if $ w is any other best unbiased estimator of # Ð)Ñ $ w œ $! a.s. T) for any ) at which VÐ)ß $! Ñ • _ and PÐ)ß +Ñ is strictly convex in +. 49 (Note that iia) says in particular that $! is UMVUE, and that iib) says that $! is the unique UMVUE.) Basu's Theorem (recall Stat 543) is on occasion helpful in doing the conditioning required to get $! of Theorem 81. The most famous example of this is in the calculation of the UMVUE of Fˆ -•5 . ‰ based on 8 iid R Ð.ß 5# Ñ observations. Information Inequalities These are inequalities about variances of estimators, some of which involve information measures. (See Schervish Sections 2.3.1, 5.1.2 and TPE Sections 2.6, 2.7.) The fundamental tool here is the Cauchy-Schwarz inequality. Lemma 82 (Cauchy-Schwarz) If Y and Z are random variables with EY # • _ and EZ # • _, then EY # † E Z # ÐEY Z Ñ# Þ Applying this to Y œ \ • E\ and Z œ ] • E] , we get the following. Corollary 83 If \ and ] have finite second moments, ÐCovÐ\ß ] ÑÑ# Ÿ Var\ † Var] . Corollary 84 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. If 1 is a measurable function such that E1Ð] Ñ œ ! and ! • EÐ1Ð] ÑÑ# • _, defining G » E$ Ð] Ñ1Ð] Ñ Var$Ð] Ñ G# . EÐ1Ð] ÑÑ# A restatement of Corollary 84 in terms that look more "statistical" is next. Corollary 84' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. If 1) is a measurable function such that E) 1) Ð\Ñ œ ! and ! • E) Ð1) Ð\ÑÑ# • _, defining GÐ)Ñ » E) $ Ð\Ñ1) Ð\Ñ Var) $ Ð\Ñ G # Ð) Ñ . E) Ð1) Ð\ÑÑ# Theorem 85 (Cramer-Rao) Suppose that the model c is FI regular at )! and that ! • MÐ)! Ñ • _Þ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be differentiated under the integral at )! , i.e. if ..) E) $ Ð\Ñ¹ =' $ ÐBÑ ..) 0) ÐBÑ¹ . .ÐBÑ , then ) œ)! ) œ)! 50 Var)! $ Ð\Ñ . Œ .) E) $ Ð\Ñ¹ • ) œ)! # MÐ)! Ñ Þ For the special case of $ Ð\Ñ viewed as an estimator of ), we sometimes write E) $ Ð\Ñ œ ) € ,Ð)Ñ, and thus Theorem 85 says Var)! $ Ð\Ñ Ð" € ,w Ð)! ÑÑ# ÎMÐ)! Ñ . On the face of it, showing that the C-R lower bound is achieved looks like a general method of showing that an estimator is UMVUE. But the fact is that the circumstances in which the C-R lower bound is achieved are very well defined. The C-R inequality is essentially the Cauchy-Schwarz inequality, and one gets equality in Lemma 82 iff Z œ +Y a.s.. This translates in the proof of Theorem 85 to equality iff . log0) Ð\Ñ¹ œ +Ð)! Ñ$ Ð\Ñ € ,Ð)! Ñ a.s. T)! Þ .) ) œ)! And, as it turns out, this can happen for all )! in a parameter space @ iff one is dealing with a one parameter exponential family of distributions and the statistic $Ð\Ñ is the natural sufficient statistic. So the method works only for UMVUEs of the means of the natural sufficient statistics in one parameter exponential families. Result 86 Suppose that for ( real, ÖT( × is dominated by a 5-finite measure ., and with 2ÐBÑ ž !, 0( ÐBÑ œ expa+Ð(Ñ € (X ÐBÑb a.s. . a( . Then for any (! in the interior of > where Var( X Ð\Ñ • _, X Ð\Ñ achieves the C-R lower bound for its variance. (A converse of Result 86 is true, but is also substantially harder to prove.) The C-R inequality is Corollary 84 with 1) ÐBÑ œ ..) log0) ÐBÑ. There are other interesting choices of 1) . For example (assuming it is permissible to move enough derivatives 5 through integralsÑ, 1) ÐBÑ œ Š ..)5 log0) ÐBÑ‹Î0) ÐBÑ produces an interesting inequality. And the choice 1) ÐBÑ œ a0) ÐBÑ • 0)w ÐBÑbÎ0) ÐBÑ leads to the next result. Theorem 87 If T)w ¥ T) ß Var) $ Ð\Ñ aE) $ Ð\Ñ • E)w $ Ð\Ñb# E) Š 0) Ð\Ñ•0)w Ð\Ñ ‹ 0) Ð\Ñ # Þ So, (the Chapman-Robbins Inequality) 51 Var) $ Ð\Ñ sup )' s.t.T)w ¥ T) aE) $ Ð\Ñ • E)w $ Ð\Ñb# E) Š 0) Ð\Ñ•0)w Ð\Ñ ‹ 0) Ð\Ñ # Þ There are multiparameter versions of the information inequalities (and for that matter, multiple $ versions involving the covariance matrix of the $ 's). The fundamental result needed to produce multiparameter versions of the information inequalities is the simple generalization of Corollary 84. Lemma 88 Suppose that $ Ð] Ñ takes values in e" and EÐ$ Ð] ÑÑ# • _. Suppose further that 1" ß 1# ß ÞÞÞß 15 are measurable functions such that E13 Ð] Ñ œ ! and ! • EÐ13 Ð] ÑÑ# • _ a3 and the 13 are linearly independent as square integrable functions on k according to the probability measure T . Let G3 » E$ Ð] Ñ13 Ð] Ñ and define a 5 ‚ 5 matrix H » aE13 Ð] Ñ14 Ð] Ñb Þ Then for G œ ÐG" ß G# ß ÞÞÞß G5 Ñw Var$Ð] Ñ G w H•" G Þ The "statistical-looking" version of Lemma 88 is next. Lemma 88' Suppose that $ Ð\Ñ takes values in e" and E) Ð$ Ð\ÑÑ# • _. Suppose further that 1)" ß 1)# ß ÞÞÞß 1)5 are measurable functions such that E) 1)3 Ð\Ñ œ ! and ! • E) Ð1)3 Ð\ÑÑ# • _ a3 and the 1)3 are linearly independent as square integrable functions on k according to the probability measure T) . Let G3 Ð)Ñ » E) $ Ð\Ñ1)3 Ð\Ñ and define a 5 ‚ 5 matrix HÐ)Ñ » aE) 1)3 Ð\Ñ1)4 Ð\Ñb Þ Then for G Ð)Ñ œ ÐG" Ð)Ñß G# Ð)Ñß ÞÞÞß G5 Ð)ÑÑw Var) $ Ð\Ñ G Ð)Ñw HÐ)Ñ•" G Ð) Ñ Þ And there is the multiparameter version of the C-R inequality. Theorem 89 (Cramer-Rao) Suppose that the model c is FI regular at )! - @ § e5 and that M Ð)! Ñ exists and is nonsingularÞ If E) $ Ð\Ñ œ ' $ ÐBÑ0) ÐBÑ. .ÐBÑ can be differentiated wrt to each )3 under the integral at )! , i.e. if ` =' $ ÐBÑ ``)3 0) ÐBÑ¹ . .ÐBÑ , then with ` )3 E) $ Ð\Ñ¹ G Ð)! ) » )œ)! )œ)! ` ` Ð ` )" E) $ Ð\Ñ¹ ß E) $ Ð\Ñ¹ ß ÞÞÞß ``)5 E) $ Ð\Ñ¹ Ñw )œ)! ` )# )œ)! )œ)! 52 Var)! $Ð\Ñ G Ð)! Ñw M Ð)! Ñ•" G Ð)! Ñ Þ There are multiple $ versions of this information inequality stuff. See Rao, pages 326+. Invariance and Equivariance Unbiasedness (of estimators) is one means of restricting a class of decision rules down to a subclass small enough that there is hope of finding a "best" rule in the subclass. Invariance/equivariance considerations is another. These notions have to do with symmetries in a decision problem. "Invariance" means that (some) things don't change. "Equivariance" means that (some) things change similarly or equally. We'll do the case of Location Parameter Estimation first, and then turn to generalities about invariance/equivariance that apply more broadly (to both other estimation problems and to other types of decision problems entirely). Definition 54 When k œ e5 and @ § e5 , to say that ) is a location parameter means that \ • ) has the same distribution a). Definition 55 When k œ e5 and @ § e1 , to say that ) is a 1-dimensional location parameter means that \ • )" (for " a 5 -vector of "'s) has the same distribution a). Note that for ) a "-dimensional location parameter, \ µ T) means that Ð\ € - "Ñ µ T-€) Þ It therefore seems that a "reasonable" restriction on estimators $ is that they should satisfy $ ÐB € - "Ñ œ $ ÐBÑ € - . How to formalize this? Until further notice suppose that @ œ T œ e" . Definition 56 PÐ)ß +Ñ is called location invariant if PÐ)ß +Ñ œ 3Ð) • +Ñ. If 3ÐDÑ is increasing in D for D ž ! and decreasing in D for D • !, the problem is called a location estimation problem. Definition 57 A decision rule $ is called location equivariant if $ ÐB € - "Ñ œ $ ÐBÑ € - . Theorem 90 If ) is a "-dimensional location parameter, PÐ)ß +Ñ is location invariant and $ is location equivariant, then VÐ)ß $ Ñ is constant in ). (That is, $ is an equalizer rule.) This theorem says that if I'm working on a location estimation problem with a location invariant loss function and restrict attention to location equivariant estimators, I can compare estimators by comparing numbers. In fact, as it turns out, there is a fairly explicit representation for the best location equivariant estimator in such cases. Definition 58 A function 1 is called location invariant if 1ÐB € - "Ñ œ 1ÐBÑ. 53 Lemma 91 Suppose that $! is location equivariant. Then $" is location equivariant iff b a location invariant function ? such that $" œ $! € ?Þ Lemma 92 A function ? is location invariant iff ? depends on B only through CÐBÑ œ ÐB" • B5 ß B# • B5 ß ÞÞÞß B5•" • B5 Ñ. Theorem 93 Suppose that PÐ)ß +Ñ œ 3Ð) • +Ñ is location invariant and b a location equivariant estimator $! for ) with finite risk. Let 3w ÐDÑ œ 3Ð • DÑÞ Suppose that for each C b a number @‡ ÐCÑ that minimizes E)œ! Ò3w Ð$! • @Ñl] œ CÓ over choices of @Þ Then $ ‡ ÐBÑ œ $! ÐBÑ • @‡ ÐCÐBÑÑ is a best location equivariant estimator of ). In the case of PÐ)ß +Ñ œ Ð) • +Ñ# the estimator of Theorem 93 is Pittman's estimator, which can for some situations be written out in a more explicit form. Theorem 94 Suppose that PÐ)ß +Ñ œ Ð) • +Ñ# and the distribution of \ œ Ð\" ß \# ß ÞÞÞß \5 Ñ has R-N derivative wrt to Lebesgue measure on e5 0) ÐBÑ œ 0 ÐB" • )ß B# • )ß ÞÞÞß B5 • )Ñ œ 0 ÐB • )"Ñ where 0 has second moments. Then the best location equivariant estimator of ) is ' _ )0 ÐB • )"Ñ. ) $ ÐBÑ œ •_ ' _ 0 ÐB • )"Ñ. ) ‡ •_ if this is well-defined and has finite risk. (Notice that this is the formal generalized Bayes estimator vs Lebesgue measure.) Lemma 95 For the case of PÐ)ß +Ñ œ Ð) • +Ñ# a best location equivariant estimator of ) must be unbiased for ). Now cancel the assumption that @ œ T œ e" . The following is a more general story about invariance/equivariance. Suppose that c œ ÖT) × is identifiable and (to avoid degeneracies) that PÐ)ß +Ñ œ PÐ)ß +wÑ a) Ê + œ +w Þ Invariance/equivariance theory is phrased in terms of groups of transformations i) k pk , ii) @p@ , and iii) T pT , 54 and how PÐ)ß +Ñ and decision rules behave under these groups of transformations. The natural starting place is with transformations on k . Definition 59 For 1-1 transformations of k onto k , 1" and 1# , the composition of 1" and 1# is another 1-1 onto transformation of k defined by 1# ‰ 1" ÐBÑ œ 1# Ð1" ÐBÑÑ aB - k Þ Definition 60 For 1 a 1-1 transformation of k onto k , if 2 is another 1-1 onto transformation such that 2 ‰ 1ÐBÑ œ B aB - k , 2 is called the inverse of 1 and denoted as 2 œ 1•" . It is then an easy fact that not only is 1•" ‰ 1ÐBÑ œ B aBß but 1 ‰ 1•" ÐBÑ œ B aB as well. Definition 61 The identity transformation on k is defined by /ÐBÑ œ B aBÞ Another easy fact is then that / ‰ 1 œ 1 ‰ / œ 1 for any 1-1 transformation of k onto k , 1. (Obvious) Result 96 If Z is a class of 1-1 transformations of k onto k that is closed under composition (1" , 1# - Z Ê 1" ‰ 1# - Z ) and inversion (1 - Z Ê 1•" - Z ), then with the operation " ‰ " (composition) Z is a group in the usual mathematical sense. A group such as that described in Result 96 will be called a transformation group. Such a group may or may not be Abelian/commutative (one is if 1" ‰ 1# œ 1# ‰ 1" a1" ß 1# - Z ). Jargon/notation: Any class of transformations V can be thought of as generating a group. That is, Z ÐVÑ consisting of all compositions of finite numbers of elements of V and their inverses is a group under the operation " ‰ ". To get anywhere with this theory, we then need transformations on @ that fit together nicely with those on k . At a minimum, one will need that the transformations on k don't get one outside of c . 1Ð\Ñ Definition 62 If 1 is a 1-1 transformation of k onto k such that ÖT) ×)-@ œ c , then the transformation 1 is said to leave the model invariant. If each element of a group of transformations Z leaves c invariant, we'll say that the group leaves the model invariant. One circumstance in which Z leaves c invariant is where the model c can be thought of as "generated by Z " in the following sense. Suppose that Y has some fixed probability distribution T! and consider c œ ÖT 1ÐY Ñ l 1 - Z × Þ If a group of transformations on k (say Z ) leaves c invariant, there is a natural corresponding group of transformations on @. That is, for 1 - Z 55 \ µ T) Ê 1Ð\Ñ µ T)w for some )w - @ . In fact, because of the identifiability assumption, there is only one such )w . So, as a matter of notation, adopt the following. Definition 63 If Z leaves the model c invariant, for each 1 - Z , define –1 À @p@ by –1Ð)Ñ œ the element )w of @ such that \ µ T) Ê 1Ð\Ñ µ T)w Þ (According to the notation established in Definition 51, for Z leaving c invariant \ µ T) Ê 1Ð\Ñ µ T–1Ð)Ñ ÞÑ (Obvious) Result 97 For Z a group of 1-1 transformations of k onto k leaving c – invariant, Z œ Ö1– l 1 - Z × with the operation " ‰ " (composition) is a group of 1-1 transformations of @. – (I believe that the group Z is the homomorphic image of Z under the transformation " –" so that, for example, 1•" œ –1•" and 1# ‰ 1" œ –1# ‰ –1" .) We next need transformations on T that fit nicely together with those on k and @. If that's going to be possible, it would seem that the loss function will have to fit nicely with the transformations already considered. Definition 64 A loss function PÐ)ß +Ñ is said to be invariant under the group of transformations Z provided that for every 1 - Z and + - T , b a unique +w - T such that – )Ñß +w Ñ a) Þ PÐ)ß +Ñ œ PÐ1Ð Now then, as another matter of jargon/notation adopt the following. Definition 65 If Z leaves the model invariant and the loss PÐ)ß +Ñ is invariant, define 1˜ À T pT by – )Ñß +w Ñ a) Þ 1Ð+Ñ œ the element +w of T such that PÐ)ß +Ñ œ PÐ1Ð ˜ (According to the notation established in Definition 53, for Z leaving c invariant and invariant loss PÐ)ß +Ñ, – )Ñß 1Ð+ÑÑ PÐ)ß +Ñ œ PÐ1Ð ÞÑ ˜ (Obvious) Result 98 For Z a group of 1-1 transformations of k onto k leaving c invariant and invariant loss PÐ)ß +Ñ, Z˜ œ Ö1˜ l 1 - Z × with the operation " ‰ " (composition) is a group of 1-1 transformations of T . (I believe that the group Z˜ is the homomorphic image of Z under the transformation " ˜" µ so that, for example,1•" œ 1˜ •" and Ð1# µ ‰ 1" Ñ œ 1˜ # ‰ 1˜ " .) 56 So finally we come to the notion of equivariance for decision functions. The whole business is slightly subtle for behavioral rules (see pages 150-151 of Ferguson) so for this elementary introduction, we'll confine attention to nonrandomized rules. Definition 66 In an invariant decision problem (one where Z leaves c invariant and PÐ)ß +Ñ is invariant) a nonrandomized decision rule $ :k pT is said to be equivariant provided $ Ð1ÐBÑÑ œ 1Ð ˜ $ ÐBÑÑ aB - k and 1 - Z Þ (One could make a definition here involving a single 1 or perhaps a 3 element group Z œ Ö1ß 1•" ß /× and talk about equivariance under 1 instead of under a group. I'll not bother.) The basic elementary theorem about invariance/equivariance is the generalization of Theorem 90 (that cuts down the size of the comparison problem that one faces when comparing risk functions). Theorem 99 If Z leaves c invariant, PÐ)ß +Ñ is invariant and $ is an equivariant nonrandomized decision rule, then – )Ñß $ Ñ a) - @ and 1 - Z . VÐ)ß $ Ñ œ VÐ1Ð Definition 67 In an invariant decision problem, for ) - @, b) œ Ö)w - @ l )w œ –1Ð)Ñ for some 1 - Z × is called the orbit in @ containing ). The set of orbits Öb) ×)-@ partitions @. The distinct orbits are equivalence classes. Theorem 99 says that an equivariant nonrandomized decision rule in an invariant decision problem has a risk function that is constant on orbits. When looking for good equivariant rules, the problem of comparing risk functions over @ is reduced somewhat to comparing risk functions on orbits. Of course, that comparison is especially easy when there is only one orbit. Definition 68 A group of 1-1 transformations of a space onto itself is said to be transitive if for any two points in the space there is a transformation in the group taking the first point into the second. – Corollary 100 If Z leaves c invariant, PÐ)ß +Ñ is invariant and Z is transitive over @, then every equivariant nonrandomized decision rule has constant risk. Definition 69 In an invariant decision problem, a decision rule $ is said to be a best (nonrandomized) equivariant (or MRE ... minimum risk equivariant) decision rule if it is equivariant and for any other nonrandomized equivariant rule $ w 57 VÐ)ß $ Ñ Ÿ VÐ)ß $ w Ñ a) Þ The import of Theorem 99 is that one needs only check for this risk inequality for one ) from each orbit. Optimal Hypothesis Testing (Simple vs Simple, Composite Hypotheses, UMP Tests, Optimal Confidence Regions and UMPU Tests) We now consider the classical hypothesis testing problem. Here we treat @ œ @! • @" where @! • @" œ g and the hypotheses H3 :) - @3 3 œ !ß ". As is standard, we will call H! the "null hypothesis" and H" the "alternative hypothesis." @3 with one element is called "simple" and with more than one element is called "composite." Deciding between H! and H" on the basis of data can be thought of in terms of a twoaction decision problem. That is, with T œ Ö!ß "× one is using \ to decide in favor of one of the two hypotheses. Typically one wants a loss structure where correct decisions are preferable to incorrect ones. An obvious (and popular) possibility is !-" loss, PÐ)ß +Ñ œ MÒ) - @! ß + œ "Ó € MÒ) - @" ß + œ !Ó . While it is perfectly sensible to simply treat hypothesis testing as a two-action decision problem, being older than formal statistical decision theory, it has its own peculiar/specialized set of jargon and emphases that we will have to recognize. These include the following. Definition 70 A nonrandomized decision function $ À k pÖ!ß "× is called a hypothesis test. Definition 71 The rejection region for a nonrandomized test $ is ÖBl $ ÐBÑ œ "×Þ Definition 72 A type I error in a testing problem is taking action 1 when ) - @! . A type II error in a testing problem is taking action 0 when ) - @" Þ Note that in a two-action decision problem, a behavioral decision rule 9B is for each B a distribution over Ö!ß "×, which is clearly characterized by 9B ÐÖ"×Ñ - Ò!ß "Ó. This notion can be expressed in other terms. Definition 73 A randomized hypothesis test is a function 9 À k pÒ!ß "Ó with the interpretation that if 9ÐBÑ œ 1, one rejects H! with probability 1. 58 Note that for !-" loss, for ) - @! ß VÐ)ß 9Ñ œ E) MÒ+ œ "Ó œ E) 9Ð\Ñ the type I error probability, while for ) - @" , VÐ)ß 9Ñ œ E) MÒ+ œ !Ó œ E) Ð" • 9Ð\ÑÑ œ " • E) 9Ð\Ñ the type II error probability. Most of testing theory is phrased in terms of E) 9Ð\Ñ that appears in both of these expressions. Definition 74 The power function of the (possibly randomized) test 9 is "9 Ð)Ñ » E) 9Ð\Ñ œ ( 9ÐBÑ.T) ÐBÑ Þ Definition 75 The size or level of a test 9 is sup "9 Ð)ÑÞ ) - @! This is the maximum probability of a type I error. Standard testing theory proceeds (somewhat asymmetrically) by setting a maximum allowable size and then (subject to that restriction) trying to maximize the power (uniformly) over @" Þ (That is, subject to a size limitation, standard theory seek to minimize type II error probabilities.) Recall that Result 61 says that if X is sufficient for ), the set of behavioral decision rules that are functions of X is essentially complete. Arguing that fact in the present case is very easy/straightforward. For any randomized test 9, consider 9‡ ÐBÑ œ EÒ9lX ÓÐBÑ , which is clearly a UÐX Ñ measurable function taking values in Ò!ß "Ó, i.e. is a hypothesis test based on X . Note that "9‡ Ð)Ñ œ E) 9‡ Ð\Ñ œ E) EÒ9lX Ó œ E) 9Ð\Ñ œ "9 Ð)Ñ , and 9‡ and 9 have the same power functions and are thus equivalent. Simple vs Simple Testing Problems Consider first testing problems where @! œ Ö)! × and @" œ Ö)" × (with !-" loss). In this finite @ decision problem, we can for each test 9 define the risk vector ÐVÐ)! ß 9Ñß V Ð)" ß 9ÑÑ œ Ð"9 Ð)! Ñß " • "9 Ð)" ÑÑ and make the standard two-dimensional plot of risk vectors. A completely equivalent device is to plot instead the points Ð"9 Ð)! Ñß "9 Ð)" ÑÑ . 59 Lemma 101 For a simple versus simple testing problem, the risk set f has the following properties: i) f is convex, ii) f is symmetric wrt the point Ð "# ß "# Ñ, iii) Ð!ß "Ñ - f and Ð"ß !Ñ - f , and iv) f is closed. Lemma 101' (see page 77 of TSH) For a simple versus simple testing problem, the set i » ÖÐ"9 Ð)! Ñß "9 Ð)" ÑÑl9 is a test× has the following properties: i) i is convex, ii) i is symmetric wrt the point Ð "# ß "# Ñ, iii) Ð!ß !Ñ - i and Ð"ß "Ñ - i , and iv) i is closed. As indicated earlier, the usual game in testing is to fix ! ! and maximum power on @" . ! and look for a test with size Definition 76 For testing H! À ) œ )! vs H" À ) œ )" , a test is called most powerful of size ! if "9 Ð)! Ñ œ ! and for any other test 9‡ with "9‡ Ð)! Ñ Ÿ !, "9‡ Ð)" Ñ Ÿ "9 Ð)" Ñ. The fundamental result of all of testing theory is the Neyman-Pearson Lemma. Theorem 102 (The Neyman-Pearson Lemma) Suppose that c ¥ ., a 5-finite measure .T and let 03 œ ..)3 . i) (Sufficiency) Any test of the form Ú" 9ÐBÑ œ Û / ÐBÑ Ü! if 0" ÐBÑ ž 50! ÐBÑ if 0" ÐBÑ œ 50! ÐBÑ if 0" ÐBÑ • 50! ÐBÑ (æ) for some 5 - Ò!ß_Ñ and / À k pÒ!ß "Ó is most powerful of its size. Further, corresponding to the 5 œ _ case, the test 9ÐBÑ œ œ " ! if 0! ÐBÑ œ ! otherwise (ææ) is most powerful of size ! œ !. ii) (Existence) For every ! - Ò!ß "Ó, there exists a test of form (æ) with / ÐBÑ œ / (a constant in Ò!ß "Ó) or a test of form (ææ) with "9 Ð)! Ñ œ !. iii) (Uniqueness) If 9 is a most powerful test of size !, then it is of form (æ) or form (ææ) except possibly for B - R with T)! ÐRÑ œ T)" ÐRÑ œ !Þ Note that for any two distributions there is always a dominating 5-finite measure (. œ T)! € T)" will do the job). Further, it's easy to verify that the tests prescribed by this theorem do not depend upon the dominating measure. 60 Note also that one could take a more explicitly decision theoretic slant on this whole business. Thinking of testing as a two-decision problem with !-" loss and prior distribution K, abbreviating KÐÖ)! ×Ñ œ 1! and KÐÖ)1 ×Ñ œ 11 , it's easy to see that a Bayesian is a Neyman-Pearson tester with 5 œ 1! Î1" and that a Neyman-Pearson tester is a closet Bayesian with 1! œ 5ÎÐ5 € "Ñ and 1" œ "ÎÐ" € 5Ñ. There is a kind of "finite @! vs simple @" " version of the N-P Lemma. See TSH page 96. A final small fact about simple versus simple testing (that turns out to be useful in some composite versus composite situations) is next. Lemma 103 Suppose that ! - Ð!ß "Ñ and 9 is most powerful of size !. Then "9 Ð)" Ñ ! and the inequality is strict unless 0! œ 0" a.e. .. UMP Tests for Composite vs Composite Problems Now consider cases where @! and @" need not be singletons. Definition 77 9 is UMP (uniformly most powerful) of size ! for testing H! À ) - @! vs H" À ) - @" provided 9 is of size ! (i.e. sup "9 Ð)Ñ œ !) and for any other 9w of size ) - @! no more than !, "9w Ð)Ñ Ÿ "9 Ð)Ñ a) - @" Þ Problems for which one can find UMP tests are not all that many. Known results cover situations where @ § e" involving one-sided hypotheses in monotone likelihood ratio families, a few special cases that yield to a kind of "least favorable distribution" argument and an initially somewhat odd-looking two-sided problem in one parameter exponential families. Definition 78 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ. Suppose that X À k pe" . The family of R-N derivatives Ö0) × is said to have the monotone likelihood ratio (MLR) property in X if for every )" • )# belonging to @, the ratio 0)# ÐBÑÎ0)" ÐBÑ is nondecreasing in X ÐBÑ on ÖBl0)" ÐBÑ € 0)# ÐBÑ ž !×. (c is also said to have the MLR property.) Theorem 104 Suppose that Ö0) × has MLR in X . Then for testing H! À ) Ÿ )! vs H" À ) ž )! , for each ! - Ð!ß "Ó b a test of the form Ú" 9! ÐBÑ œ Û / Ü! if X ÐBÑ ž if X ÐBÑ œ if X ÐBÑ • - for - - Ò • _ß_Ñ chosen to satisfy "9 Ð)! Ñ œ !Þ Such a test i) has strictly increasing power function, ii) is UMP size ! for these hypotheses, and iii) uniformly minimizes "9 Ð)Ñ for ) • )! for tests with "9 Ð)! Ñ œ !. 61 Theorem 105 Suppose that Ö0) × has MLR in X and for state space @ and action space T œ Ö!ß "× the loss function PÐ)ß +Ñ satisfies PÐ)ß "Ñ • PÐ)ß !Ñ • ! for ) ž )! , and PÐ)ß "Ñ • PÐ)ß !Ñ ž ! for ) • )! Þ Then the class of size ! tests (! - Ð!ß "Ñ) identified in Theorem 86 is essentially complete in the class of tests with size in Ð!ß "Ñ. One two-sided problem a single real parameter for which it possible to find UMP tests is covered by the next theorem (that is proved in Section 4.3.4 of Schervish using the generalization of the N-P lemma alluded to earlier). Theorem 106 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñ and 0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a). Suppose further that )" • )# belong to @. A test of the form Ú" 9ÐBÑ œ Û /3 Ü! if -" • X ÐBÑ • -# if X ÐBÑ œ -3 3 œ "ß # if X ÐBÑ • -" or X ÐBÑ ž -# for -" Ÿ -# minimizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with "9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is UMP level ! for testing H! À ) Ÿ )1 or ) )2 vs H" À )" • ) • )# . As far as finding UMP tests for parameter spaces other than subsets of e" is concerned, about the only "general" technique known is one discussed in Section 3.8 of TSHÞ The method is not widely applicable, but does yield a couple of important results. It is based on an "equalizer rule"/"least favorable distribution" type argument. The idea is that for a few composite vs composite problems, we can for each )" - @" guess at a distribution A over @! so that with 0A ÐBÑ œ ( 0) ÐBÑ.AÐ)Ñ @! simple versus simple testing of H! À 0A vs H" À 0)" produces a size ! N-P test whose power over @! is bounded by !. (Notice by the wayß that this is a Bayes test in the composite vs composite problem against a prior that is a mixture of A and a point mass at )" ÞÑ Anyhow, when that test is independent of )" we get UMP size ! for the original problem. 62 Definition 79 For testing H! À \ µ 0A vs H" À \ µ 0)" , A is called least favorable at level ! if for 9A the N-P size ! test associated with A, "9A Ð)" Ñ • "9Aw Ð)" Ñ for any other distribution over @! , Aw . Theorem 107 If 9 is of size ! for testing H! À ) - @! vs H" À ) œ )" , then it is of size no more than ! for testing H! À \ µ 0A vs H" À \ µ 0)" . Further, if 9 is most powerful of size ! for testing H! À \ µ 0A vs H" À \ µ 0)" and is also size ! for testing H! À ) - @! vs H" À ) œ )" , then it is a MP size ! test for this set of hypotheses and A is least favorable at level !. Clearly, if in the context of Theorem 107, the same 9 is most powerful size ! for each )" , that test is then UMP size ! for the hypotheses H! À ) - @! vs H" À ) - @" . Confidence Sets Derived From Tests (and Optimal Ones Derived From UMP Tests) Definition 80 Suppose that for each B - k ß WÐBÑ is a (measurable) subset of @. The family ÖWÐBÑl B - k × is called a # level family of confidence sets for ) provided T) Ò) - WÐBÑÓ # a) - @ . It is possible (and quite standard) to use tests to produce confidence procedures. That goes as follows. For each )! - @, let 9)! be a size ! test of H! À ) œ )! vs H" À ) - @" (where )! - @ • @" ) and define W9 ÐBÑ œ Ö)l 9) ÐBÑ • "× (W9 ÐBÑ is the set of )'s for which there is positive conditional probability of accepting H! having observed \ œ B.) ÖW9 ÐBÑl B - k × is a level " • ! family of confidence sets for ). This rigmarole works perfectly generally and inversion of tests of point null hypotheses is a time-honored means of constructing confidence procedures. In the context of UMP testing, the construction produces confidence sets with an optimality property. Theorem 108 Suppose that for each )! ß OÐ)! Ñ § @ with )! Â OÐ)! Ñ and 9)! is a nonrandomized UMP size ! test of H! À ) œ )! vs H" À ) - OÐ)! Ñ Then if W9 ÐBÑ is the confidence procedure derived from the tests 9)! and WÐBÑ is any other level Ð" • !Ñ confidence procedure T) Ò)! - WÐ\ÑÓ T) Ò)! - W9 Ð\ÑÓ a ) - OÐ)! Ñ Þ (That is, the W9 construction produces confidence procedures that minimize the probabilities of covering incorrect values.) 63 UMPU Tests for Composite versus Composite Problems Cases in which one can find UMP tests are not all that many. This should not be terribly surprising given our experience to date in Stat 643. We are used to not being able to find uniformly best statistical procedures. In estimation problems we found that by restricting the class of estimators under consideration (e.g. to the unbiased or equivariant estimators) we could sometimes find uniformly best rules in the restricted class. Something similar is sometimes possible in composite versus composite testing problems. Definition 81 A size ! test of H! À ) - @! vs H" À ) - @" is unbiased provided " 9 Ð) Ñ ! a ) - @" Þ Definition 82 An unbiased size ! test of H! À ) - @! vs H" À ) - @" is UMPU (uniformly most powerful unbiased) level ! if for any other size ! unbiased test 9w " 9 Ð) Ñ " 9 w Ð) Ñ a ) - @ " . The conditions under which people can identify UMPU tests require that @ be a topological space. (One needs notions of interior, boundary, continuity of functions, etc.) So in this section make that assumption. – – Definition 83 The boundary set is @B » @! • @" . Lemma 109 If "9 Ð)Ñ is continuous and 9 is unbiased of size !, then "9 Ð)Ñ œ ! a ) - @B . Definition 84 A test is said to be !-similar on the boundary if " 9 Ð) Ñ œ ! a ) - @ B Þ When looking for UMPU tests it often suffices to restrict attention to tests !-similar on the boundary. Lemma 110 Suppose that "9 Ð)Ñ is continuous for every 9. If 9 is UMP among tests !similar on the boundary and has level !, then it is UMPU level !. This lemma and Theorem 106 (or the generalized Neyman-Pearson lemma that stands behind it) can be used to prove a kind of "reciprocal" of Theorem 106 for one parameter exponential families. 64 Theorem 111 Suppose @ § e" and c ¥ ., a 5-finite measure on Ðk ß U Ñß and 0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a). Suppose further that )" • )# belong to @. A test of the form Ú" 9ÐBÑ œ Û /3 Ü! if X ÐBÑ • -" or X ÐBÑ ž -# if X ÐBÑ œ -3 3 œ "ß # if -" • X ÐBÑ • -# for -" Ÿ -# maximizes "9 Ð)Ñ a ) • )" and a ) ž )# over choices of tests 9w with "9w Ð)" Ñ œ "9 Ð)" Ñ and "9w Ð)2 Ñ œ "9 Ð)2 Ñ. Further, if "9 Ð)" Ñ œ "9 Ð)# Ñ œ !, then 9 is UMPU level ! for testing H! À ) - Ò)" ß )# Ó vs H" À ) Â Ò)" ß )# ÓÞ Theorem 111 deals with an interval null hypothesis. Dealing with a point null hypothesis turns out to require slightly different arguments. Theorem 112 Suppose that for @ an open interval in e" , c ¥ ., a 5-finite measure on Ðk ß U Ñ and 0) ÐBÑ œ expa+Ð)Ñ € )X ÐBÑb a.e. . a). Suppose that 9 is a test of the form given in Theorem 93 and that " 9 Ð)! Ñ œ ! and "9w Ð)! Ñ œ ! . Then 9 is UMPU size ! for testing H! À ) œ )! vs H" À ) Á )! . Theorems 111 and 112 are about cases where @ § e" (and, of course, limited to one exponential families). It is possible to find some UMPU tests for hypotheses with "Ð5 • "Ñ-dimensional" boundaries in "5 -dimensional" @ problems using a methodology exploiting "Neyman structure" in a statistic X . I may write up some notes on this methodology (and again I may not). It is discussed in Section 4.5 of Schervish. A general kind of result that comes from the arguments is next. 65 Theorem 113 Suppose that \ œ Ð\" ß \# ß ÞÞÞß \5 Ñ takes values in e5 and the distributions T( have densities with respect to a 5-finite measure . on e5 of the form 0( ÐBÑ œ OÐ(Ñexp"(3 B3 5 . 3œ" Suppose further that the natural parameter space > contains an open rectangle in e5 and let Y œ Ð\# ß ÞÞÞß \5 Ñ. Suppose ! - Ð!ß "ÑÞ i) For H! À (" Ÿ ("‡ vs H" À (" ž ("‡ or for H! À (" Ÿ ("‡ or (" ("‡‡ vs H" À (" - Ð("‡ ß ("‡‡ Ñ, for each ? b a conditional on Y œ ? UMP level ! test 9?w ÐB" Ñ, and the test 9ÐBÑ œ 9?w ÐB" Ñ is UMPU level !. ii) For H! À (" - Ò("‡ ß ("‡‡ Ó vs H" À (" Â Ò("‡ ß ("‡‡ Ó or for H! À (" œ ("‡ vs H" À (" Á ("‡ for each ? b a conditional on Y œ ? UMPU level ! test 9?w ÐB" Ñ, and the test 9ÐBÑ œ 9?w ÐB" Ñ is UMPU level !. This theorem says that if one thinks of ( œ Ð(" ß (# ß ÞÞÞß (5 Ñ as containing the parameter of interest (" and "nuisance parameters" (# ß ÞÞÞß (5 , at least in some exponential families it is possible to find "optimal" tests about the parameter of interest. 66

Stat 643 Results, Fall 2000 Sufficiency (and Ancillarity and Completeness)

Related documents

Products

Support

Stat 643 Results, Fall 2000 Sufficiency (and Ancillarity and Completeness)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib