Stat 643 Outline Spring 2010 Steve Vardeman Iowa State University April 22, 2010 Abstract This outline summarizes the main points of the class lectures. Details (including proofs) must be seen in class notes and reference books. Contents 1 Characteristic Functions 1.1 Some Simple Generalities and Basic Facts About Chf’s . . . . . . 1.2 Chf’s and Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Chf’s and Convergence in Distribution . . . . . . . . . . . . . . . 3 4 5 7 2 Central Limit Theorems 2.1 Basic CLTs for Independent Double Arrays . . . . . . . . . . . . 2.2 Related Limit Results . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 CLTs Under Dependence . . . . . . . . . . . . . . . . . . 2.2.2 CLTs for Sums of Random Numbers of Random Summands 2.2.3 Rates for CLTs and Generalizations . . . . . . . . . . . . 2.2.4 Large Deviation Results . . . . . . . . . . . . . . . . . . . 2.2.5 Results About "Stable Laws" . . . . . . . . . . . . . . . . 2.2.6 Functional Central Limit Theorems and Empirical Processes 8 8 9 9 10 10 11 11 11 3 Conditioning 3.1 Basic Results About Conditional Expectation . . . . . . . . . 3.2 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . 3.3 Conditional Analogues of Properties of Ordinary Expectation 3.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . 13 14 15 16 16 4 Transition to Statistics . . . . . . . . 18 1 5 Su¢ 5.1 5.2 5.3 ciency and Related Concepts 19 Su¢ ciency and the Factorization Theorem . . . . . . . . . . . . . 19 Minimal Su¢ ciency . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Ancillarity and Completeness . . . . . . . . . . . . . . . . . . . . 22 6 Facts About Common Statistical Models 6.1 Bayes Models . . . . . . . . . . . . . . . . 6.2 Exponential Families . . . . . . . . . . . . 6.3 Measures of Statistical Information . . . . 6.3.1 Fisher Information . . . . . . . . . 6.3.2 Kullback-Leibler Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 25 26 27 29 7 Statistical Decision Theory 31 7.1 Basic Framework and Concepts . . . . . . . . . . . . . . . . . . . 31 7.2 (Finite Dimensional) Geometry of Decision Theory . . . . . . . . 33 7.3 Complete Classes of Decision Rules . . . . . . . . . . . . . . . . . 34 7.4 Su¢ ciency and Decision Theory . . . . . . . . . . . . . . . . . . . 35 7.5 Bayes Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.6 Minimax Decision Rules . . . . . . . . . . . . . . . . . . . . . . . 38 7.7 Invariance/Equivariance Theory . . . . . . . . . . . . . . . . . . . 38 7.7.1 Location Parameter Estimation and Invariance/Equivariance 39 7.7.2 Generalities of Invariance/Equivariance Theory . . . . . . 40 8 Asymptotics of Likelihood Inference 43 8.1 Asymptotics of Likelihood Estimation . . . . . . . . . . . . . . . 43 8.1.1 Consistency of Likelihood-Based Estimators . . . . . . . . 43 8.1.2 Asymptotic Normality of Likelihood-Based Estimators . . 46 8.2 Asymptotics of LRT-like Tests and Competitors and Related Con…dence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 8.3 Asymptotic Shape of the Loglikelihood and Related Bayesian Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 9 Optimality in Finite Sample Point Estimation 52 9.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 9.2 "Information" Inequalities . . . . . . . . . . . . . . . . . . . . . . 53 10 Optimality in Finite Sample Testing 54 10.1 Simple versus Simple Testing . . . . . . . . . . . . . . . . . . . . 55 2 1 Characteristic Functions p Recall from simple complex analysis that i 1 (we’ll have to tell from context whether i is standing for the root of 1 or for something else like an index of summation). For some purposes, complex numbers a + bi are just a compact representation of points in R2 . That is, a + bi with "real part" a and "imaginary part" b is equivalent to the ordered pair (a; b) 2 R2 . But the complex number rules of operation give the convenience of operating with many of the same rules of computation that we use for real numbers and others that are derivable by simply operating algebraically on i. For example, a reciprocal of the complex number a + bi (another complex number that when multiplied times it produces 1) is (a + bi) 1 = (a + bi) a bi = 2 2 a +b (a + bi) (a + bi) If g is a complex-valued function, we may think of g in terms of two realvalued functions (Re g (t) and Im g (t)) such that g (t) = Re g (t) + i Im g (t) So, for example, we may use the shorthand Z Z Z gd Re gd + i Im gd Further, in the obvious way, derivatives of complex functions come from derivatives of two real-valued functions d d d g (t) = Re g (t) + i Im g (t) dt dt dt A standard complex analysis convention is (for real ) to de…ne ei cos + i sin (You should check that this de…nition is plausible based on the forms of the Maclaurin series for exp t; cos t; and sin t.) Now it’s easy to see that ei = 1 so that ei is on the unit circle in C (or R2 ). It’s also straightforward to verify that for real 1 and 2 ei( 1 + 2 ) = ei 1 ei 2 Further, the complex-valued function of a single real variable g (t) = eitb has nth derivative n g (n) (t) = (ib) eitb 3 1.1 Some Simple Generalities and Basic Facts About Chf’s De…nition 1 If is a probability distribution on Rk (and is the distribution of the random vector X) then the characteristic function of (or of X) is the mapping from Rk to C de…ned by Z Z Z 0 0 (t) = Eeit X = eit x d (x) = cos (t0 x) d (x) + i sin (t0 x) d (x) Characteristic functions are guaranteed to exist for all probability distributions. Both A&L Ch 10 and Chung Ch 6 are full of interesting and useful stu¤ about chf’s. Some simple facts are: 1. If is a chf, then (0) = 1. 2. If is a chf, then j (t)j 1 8t. 3. If for 1 ; : : : ; m are chf’s for k-dimensional distributions and p1 ;P : : : ; pm P specify a distribution over f1; 2; : : : ; mg, then pl l is the chf of pl l . 4. If X 1 ; : : : ; X m are independent P k-dimensional random vectors Qm with chf’s m ; : : : ; , then the sum S = X has chf = 1 m m l Sm l=1 l=1 l . 5. If is the chf of real-valued X, then for real numbers a and b, Y = a + bX has chf Y (t) = eiat (bt). 6. If is the chf of is the chf of X, then X. 7. If X and Y are iid with chf’s , then (the symmetrically distributed) 2 random vector X Y has (real-valued) chf X Y = j j . 8. If is the chf of X (symmetric) distribution 1 2 and + 12 is the distribution of X, then the has (real-valued) chf Re . 9. If 1 is the chf of the k1 -dimensional X 1 and 2 is the chf of the k2 dimensional X 2 and the random vectors X 1 and X 2 are independent, X1 then for t1 2 Rk1 and t2 2 Rk2 the random vector has characteristic X2 t1 function = 1 (t1 ) 2 (t2 ) t2 On pages 155-156 Chung provides a nice list of standard distributions and their chfs. Table 1 is derived from Chung’s list. Per Prof. Nordman’s chf notes and 10.1.3 of A&L, there is the promise that chf’s are continuous. Theorem 2 If is a chf, then is uniformly continuous on Rk . Per Nordman’s Theorem 7.1, 10.1.4 of A&L, or page 67 of Lamperti, there is the fact that when they exist, moments of random variables can be deduced from derivatives of chf’s. 4 Table 1: Chung’s List of Chf’s Distribution Unit Point Mass at a Mass 12 at each of 1 Ber(p) Bi(n; p) Geo(p) Poisson( ) Exp( ) (mean 1 ) U( a; a) Triangular on [ a; a] N ; 2 pdf or pmf px (1 p) 1 x I [x is 0 or 1] n x n x p (1 p) I [x 2 f0; 1; : : : ; ng] x x p (1 p) I [x 2 f0; 1; : : :g] x e I [x 2 f0; 1; : : :g] x! e x I [x > 0] 1 I [ a < x < a] 2a a jxj I [ a < x < a] a2 ! 2 1 (x ) p exp 2 2 2 2 N(0; 1) Cauchy with scale (1 Chf eiat cos t p) + peit (1 p) + peit p 1 it e (e 1 ( 2 + x2 ) it sin at at sin at 2 t 2 t2 =2 e (0) = ir EX r Theorem 4 If f on R is decreasing and convex on R+ = [0; 1) with f (0) = 1; f (t) 0, and f (t) = f ( t), then f is a chf. Chf’s and Probabilities Theorem 3 shows that when they exist, chf’s encode the moments of a distribution. In fact, a chf encodes all there is to know about a distribution. One 5 2 2 2 exp i t As a bit of an aside, an interesting and important additional property of characteristic functions (provided by Bochner’s Theorem) is that they are nonnegative de…nite complex-valued functions. That means that any real-valued characteristic function can serve as a lag-correlation function for a second-order stationary stochastic process. That fact gives large classes of models for time series and spatial statistics (that are de…ned in terms of their lag-correlation functions). Related to this use of characteristic functions are theorems that provide su¢ cient conditions for a real-valued function to be a characteristic function. One such theorem is on page 191 of Chung and is 10.1.8 of A&L. 1.2 1 at 2 Theorem 3 If is the chf of the random variable X, and for some positive r integer r, EjXj < 1, then has a continuous rth derivative (r) on R and (r) 1) 1 e >0 p) eit (1 jtj n 1 can get all probabilities from a distribution’s chf. 10.2.6 of A&L and Theorem 11.1.1 of Rosenthal is next. Theorem 5 (Fourier Inversion Theorem) If is the chf of a probability distribution on R, then for any real a < b Z T e ita e itb 1 1 1 lim (t) dt = ((a; b)) + (fag) + (fbg) T !1 2 it 2 2 T Two simple consequences of Theorem 5 are next. Corollary 6 Under the hypotheses of Theorem 5, if a and b are continuity points of the cdf F corresponding to , then Z T e ita e itb 1 (t) dt = ((a; b)) lim T !1 2 it T Corollary 7 (Fourier Uniqueness Theorem) Chf ’s uniquely determine probability distributions. Corollary 7 is 10.2.5 of A&L. We will say that X is symmetrically distributed i¤ X and same distribution. X have the Corollary 8 X is symmetrically distributed i¤ its characteristic function real-valued. is There are all sorts of properties of distributions that are re‡ected in particular properties of characteristic functions. Per Nordman’s Theorem 7.2 and Corollary 7.4 (that is 10.2.4. of A&L) there are the following two results, the second of which follows from Corollary 6. Theorem 9 (Riemann-Lebesgue Lemma) If the distribution of a random variable X has a density f wrt Lebesgue measure on R, then the chf of X, , satis…es (t) ! 0 as jtj ! 1 on R has a chf, , for which RCorollary 10 If the probability distribution j (t)j dt < 1, then has density wrt Lebesgue measure (pdf ) on R Z 1 f (x) = e itx (t) dt 2 A&L’s 10.2.6b is also interesting. It is next. Theorem 11 For a distribution on R with chf and real number a, Z T 1 e ita (t) dt (fag) = lim T !1 2 T 6 1.3 Chf’s and Convergence in Distribution The major remaining major concerning characteristic functions is that they are useful for establishing convergence in distribution. That is, for distributions n d with chf’s n and a distribution with chf , n ! i¤ n (t) ! (t) 8t 2 Rk . Notice that such a result is a kind of continuity theorem for the mappings that 1) produce chf’s from distributions and 2) produce distributions from chf’s. (Continuity of a function means that proximity of arguments implies proximity of function values.) To get to this kind of continuity result, a technical lemma is required. The following is 10.3.3 of A&L and 11.1.13 of Rosenthal. Lemma 12 Let f n g be a sequence of distributions on R with corresponding sequence of characteristic functions f n g. If there exists g : R ! C that is continuous at 0 and t0 > 0 such that n (t) ! g (t) 8t 2 ( t0 ; t0 ) then f l g is tight. This then enables proof of the important continuity theorem (that is 10.3.4 of A&L). Theorem 13 (The R1 version of the Continuity Theorem) f l g on R have characteristic functions f l g and distribution d n ! i¤ n If distributions has chf , (t) 8t 2 R1 (t) ! (See A&L 10.4.4 for an Rk version of this.) A technical lemma provides a simple Central Limit Theorem as a direct consequence of Theorem 13. This concerns the complex number version of a limit from freshman calculus. Lemma 14 If complex numbers cn converge to c, then lim n!1 1+ n cn n = ec Theorem 15 Suppose the random variables fXn g are iid with EX12 < 1: Let EX1 = and VarX1 = 2 < 1: Then p d n Xn ! N 0; 2 A second very useful consequence of Theorem 13 is the so-called "CramérWold device." This appears as 10.4.5 of A&L. Theorem 16 (Cramér-Wold Device) Suppose that fX n g is a sequence of kdimensional random vectors. Then d d X n ! X i¤ a0 X n ! a0 X 8a 2 Rk 7 2 Central Limit Theorems The basic idea of Central Limit Theory is that sums of small, not too highly dependent, random variables should be approximately normal. The simplest Pn such theorems are for sequences of summands X1 ; X2 ; : : :, where for Sn i=1 Xi there are sequences of constants an and bn such that Sn an bn d ! N (0; 1) A more general and useful framework for CLTs is that of double arrays. i!1 De…nition 17 fXij g for i = 1; 2; : : : and j = 1; 2; : : : ; ni where ni ! 1 will be called a double array of random variables, and in the case where ni = i we will call the collection a triangular array. Standard notation will be to let Si = ni X Xij j=1 It is easy to see that even if one assumes that fXij gj=1;2;:::;ni is a set of independent random variables, there is no guarantee that there are constants ai and bi such that Si ai d ! N (0; 1) bi 2.1 Basic CLTs for Independent Double Arrays The most famous condition under which Central Limit Theorems are proved is the famous "Lindeberg condition." De…nition 18 A double array of independent random variables fXij g with 2 EXij = 0 8i; j for which each EXij < 1 is said to satisfy the Lindeberg Pni 2 2 condition provided for Bi VarSi = j=1 EXij , 8 >0 ni 1 X 2 EXij I [jXij j > Bi ] ! 0 i!1 Bi2 j=1 This condition says that the tails of the Xij contribute less and less to the variance of Si with increasing i. Next is 11.1.1 of A&L Theorem 19 (Lindeberg Central Limit Theorem) If fXij g is a double array of independent mean 0 random variables with …nite variances that satisfy the Lindeberg condition, then Si d ! N (0; 1) Bi A simpler-looking condition under which the Lindeberg condition holds is Liapounov’s condition. 8 De…nition 20 A double array of independent random variables fXij g with 2+ EXij = 0 8i; j for which each EjXij j < 1 for some …xed P> 0 is said ni 2 EXij to satisfy the Liapounov condition provided for Bi2 VarSi = j=1 1 Bi2+ ni X j=1 2+ E jXij j ! 0 i!1 The Liapounov condition says that the ratio of the total absolute 2 + moment to the 1 + 2 power of the total absolute 2nd moment goes to 0. Next is a result on page 348 of A&L. Theorem 21 If a double array of independent random variables fXij g with 2+ EXij = 0 8i; j for which each EjXij j < 1 for some …xed > 0 satis…es the Liapounov condition, it also satis…es the Lindeberg condition. 2.2 Related Limit Results The basic central limit theorems can be generalized in a number of directions. A few such directions are mentioned here. 2.2.1 CLTs Under Dependence One direction is to allow some (not too severe) dependencies in the variables being summed. De…nition 22 A sequence of random variables X1 ; X2 ; : : : is said to be mdependent if (X1 ; : : : ; Xj ) is independent of (Xj+m ; Xj+m+1 ; : : :) 8j On his page 224, Chung gives an m-dependent central limit theorem. Theorem 23 If an m-dependent sequence of uniformly bounded random variables X1 ; X2 ; : : : is such that p VarSn n !1 1=3 n n1=3 then Sn ESn n d ! N (0; 1) The condition on the variance of the partial sum in Theorem 23 says that the variance p of Sn can not grow too slowly. Note that an iid model of course has n = n . 9 2.2.2 CLTs for Sums of Random Numbers of Random Summands There are limit theorems for random sums of random variables like that of A&L problem 11.11 and Chung page 226. Theorem 24 Suppose that X1 ; X2 ; : : : is a sequence of iid mean 0 variance 1 random variables and 1 ; 2 ; : : : is a sequence of random variables taking only positive integer values such that for some c > 0 n P !c n then S p n n d ! N (0; 1) This kind of theorem is fairly easy to prove in the case that fXj g and f j g are independent (using characteristic function and conditioning). But this result is harder to prove and much more general. 2.2.3 Rates for CLTs and Generalizations There are theorems about the rate of convergence of distributions to normal. That is, Theorem 15 says that in the iid …nite variance case Sn n p n P t ! (t) 8t This together with Polya’s Theorem (9.14 of A&L) says that n sup P t Sn n p n and there are rate theorems about n t (t) ! 0 like 11.4.1 of A&L. Theorem 25 (The Berry-Esseen Theorem) Suppose the random variables fXj g 3 are iid with EjX1 j < 1: Let EX1 = and VarX1 = 2 . Then there is a C 2 (0; 1) such that 3 E jX1 j p C n 3 n In fact, the C in Theorem 25 is no more than 5:05. There are also rate theorems for "higher order" approximations called "Edgeworth Expansions." See for example, Theorem 11.4 of A&L, that says that under conditions like those of Theorem 25, with " # 3 E jX1 j Sn n 0 2 p p sup P t (t) t 1 (t) n n 6 3 n t 0 n =o n 1 2 or even 0 n =o n 1 . 10 2.2.4 Large Deviation Results There are theorems about approximating other probabilities related to Sn . For X1 ; X2 ; : : : a sequence of iid variables with EX1 = and t > so called "large deviation" results concern approximation of P Snn > t . Note that typical Central Limit Theorems say that P Sn n p n >t =P Sn > n + tp n !1 (t) and so CLTs provide approximate cumulative probabilities for values of Snn di¤ering from by amounts that shrink to 0 at rate n 1=2 . Large deviation results instead concern probabilities of exceeding a …xed value t > . Theorem 11.4.6 of A&L says 1 Sn n >t ! C (t) P n for C (t) some function depending upon the distribution of X1 . 2.2.5 Results About "Stable Laws" CLTs say that there are sequences of constants an and bn such that Sn an (1) bn has a standard normal limit. A sensible questions is whether quantities (1) can have any other types of limit distributions. De…nition 26 A non-degenerate distribution G on R is stable i¤ for X1 ; X2 ; : : : iid G, 9 sequences of constants an and bn such that Sn an bn d = X1 Theorem 27 A non-degenerate distribution G on R is stable i¤ 9 some distribution F and sequences of constants an and bn such that with X1 ; X2 ; : : : iid F Sn an d !G bn Section 11.2 of A&L is full interesting facts about stable laws (including representations of their characteristic functions). 2.2.6 Functional Central Limit Theorems and Empirical Processes For C [0; 1] the space of continuous functions on the unit interval with metric d (f; g) = sup jf (t) t 11 g (t)j one may think of putting distributions on the Borel sets in the complete separable metric space C [0; 1]. Thinking of the argument of an f 2 C [0; 1] as "time," such a distribution speci…es a stochastic process model. Suppose that fXj g are iid with EX1 = 0 and VarX1 = 1 and de…ne Wn (t) on [0; 1] by 1 j = p Sj for j = 0; 1; : : : ; n Wn n n and by linear interpolation for t not of the form j=n for any j = 0; 1; : : : ; n. Wn (t) is a random element of C [0; 1] and there is a probability measure n such that Wn n . The multivariate CLT can be used to show that 8 0 t1 t2 tk 1 0 1 Wn (t1 ) B Wn (t2 ) C B C d B C ! MVNk (0; ) .. @ A . Wn (tk ) where =( ij ) with ij = ti ^ tj (2) It further turns out that there is a distribution on C [0; 1] called the Standard Brownian Motion such that 8 0 t1 t2 tk 1, W implies that 0 1 W (t1 ) B W (t2 ) C B C B .. C MVNk (0; ) @ . A W (tk ) where has form (2). ( is a Gaussian process, meaning all …nite dimensional marginals are Gaussian.) In addition, under these assumptions there is a "functional CLT" (11.4.9 of A&L). Theorem 28 For fXj g iid with EX1 = 0 and VarX1 = 1, and Wn (t) ; as de…ned above d n ! n; and The distribution of Wn converges to a standard Brownian motion. (Convergence in distribution in this context is most naturally convergence of probabilities of sets of functions with boundaries that are null sets under the limiting distribution. See A&L Section 9.3 for convergence in distribution in abstract spaces.) A related line of argument concerns "empirical processes" and the "Brownian Bridge" distribution on C [0; 1] and is discussed in Section 11.4.5 of A&L. That is, suppose that W for on C [0; 1] the standard Brownian motion distribution. Let B (t) = W (t) tW (1) 12 Then B (t) has a Gaussian process distribution called the "Brownian Bridge" with B (0) = B (1) = 0 and EB (t) = 0 8t and Cov (B (t1 ) ; B (t2 )) = t1 ^ t2 t1 t2 This distribution on C [0; 1] can be thought of as arising as a limit distribution in the following way. For fUj g iid U(0; 1) and (the empirical cumulative distribution function for the U ’s) n Fn (t) 1X I [Uj n j=1 t] for t 2 [0; 1] let Yn (t) = the linear interpolation of p n (Fn (t) t) made from values at the jump points of Fn (i.e. where t is some Uj ) Then there is 11.4.1. of A&L. Theorem 29 With B (t) and Yn (t) as above d Yn (t) ! B (t) Yn (t) is the "empirical process" for the Uj and Theorem 29 says that approximations for probabilities for the process (and thence for Fn (t)) can be had from use of the Brownian bridge distribution. 3 Conditioning One needs properly general notions of "conditioning" in order to deal coherently with general statistical models. Some thought about even fairly simple not-completely-discrete problems leaves one convinced that ideas of conditional expectation must focus on conditioning -algebras, not conditioning variables (a conditioning variable generates a corresponding -algebra, but it is the latter that is intrinsic, not the former, as many slightly di¤erent variables can generate the same -algebra). Here is the standard de…nition. De…nition 30 Let ( ; F; P ) be a probability space and G F be a sub algebra. Let XR be a real-valued measurable function (a random variable) on with EjXj = jX (!)j dP (!) < 1: Then the conditional expectation of X given G, written E(XjG), is a random variable ! R with the properties that 1. E(XjG) is G-measurable, and 13 2. for any A 2 G Z E (XjG) dP = A Z XdP A In other notation, the basic requirement 2. of De…nition 30 is that EIA E(XjG) = EIA X for all A 2 G. The existence of E(XjG) satisfying De…nition 30 follows from the Radon-Nikodym Theorem. The case X = IB for B 2 F gets its own notation. De…nition 31 For B 2 F and G F a sub -algebra, the conditional probability of B given G is P (BjG) E (IB jG) Several comments about the notion of conditional expectation are in order. First, E(XjG) is clearly unique only up to sets of P measure 0: Second, the case where G = (W ) for some random variable W is E (XjW ) E (Xj (W )) And third, A&L have what is a very clever and possibly more natural (at least for statisticians) de…nition of conditioning that applies to X with second moments. Namely, E(XjG) is the projection of X onto the linear subspace of L2 ( ; F; P ) consisting of G-measurable functions. (More on this later.) If G is the trivial sub -algebra, i.e. G = f?; g, then E(XjG) must be constant and thus E(XjG) = EX. On the other hand, if G = F, then it’s easy to argue that E(XjG) = EX. Intuitively, the smaller is G, the "less random" is E(XjG), i.e. the "more averaging has gone on in its creation," i.e. the fewer values it takes on. 3.1 Basic Results About Conditional Expectation Some basic facts about conditional expectation are collected in the next theorem. Theorem 32 Let ( ; F; P ) be a probability space and G F be a sub -algebra. 1. If c 2 R, E(cjG) = c a.s. 2. If X has a …rst moment and is G-measurable, then E(XjG) = X a.s. 3. If X has a …rst moment and X 0 a.s., then E(XjG) 0 a.s. 4. If X1 ; : : : ; Xk have …rst moments and c1 ; : : : ; ck are real, then ! k k X X E ci Xi jG = ci E (Xi jG) a.s i=1 i=1 5. If X has a …rst moment then jE (XjG)j 14 E(jXj jG) a.s. Next is 12.1.5ii of A&L and 13.2.6 of Rosenthal. Theorem 33 Suppose that X and Y are random variables and G F is a sub -algebra. If Y and XY have …rst moments and X is G-measurable, then E (XY jG) = XE (Y jG) a.s. (A G-measurable X can be "factored out" of a conditional expectation.) Then there is a standard result from 12.1.5i of A&L or 13.2.7 of Rosenthal. Theorem 34 Let X be a random variable with …rst moment and G1 be two sub -algebras. Then G2 F E (E (XjG2 ) jG1 ) = E (XjG1 ) a.s. And there is a natural result about independence. Theorem 35 Suppose that X has a …rst moment and G F is a sub -algebra. If G and (X) are independent (in the sense that A 2 G and B 2 (X) implies that P (A \ B) = P (A) P (B)) then E (XjG) = EX a.s. 3.2 Conditional Variance De…nition 31 is one important special case of conditional expectation. Another is conditional variance. De…nition 36 If EX 2 < 1 and G tional variance F is a sub Var (XjG) = E X 2 jG -algebra, de…ne the condi- (E (XjG)) 2 Theorem 12.2.6 of A&L is next. Theorem 37 If EX 2 < 1 and G F is a sub -algebra, then VarX = Var (E (XjG)) + E (Var (XjG)) The next result shows that the A&L de…nition of conditional expectation is appropriate for X 2 L2 ( ; F; P ). Theorem 38 If EX 2 < 1 and G F is a sub mizes 2 E (X Y ) -algebra, then E(XjG) mini- over choices of G-measurable random variables Y: (So under the hypotheses of Theorem 38, E(XjG) is indeed the projection of X onto the subspace of L2 ( ; F; P ) consisting of G-measurable random variables.) 15 3.3 Conditional Analogues of Properties of Ordinary Expectation Many properties of ordinary expectation have analogues for conditional expectations. For one, there is a conditional version of Jensen’s inequality (that is 12.2.4 of A&L). Theorem 39 For 1 1 and that the function sub -algebra G F, a < b 1 suppose that X on ( ; F; P ) has P [a < X < b] = is convex on (a; b) with Ej (X)j < 1. Then for any (E (XjG)) E ( (X) jG) a.s. And there are convergence theorems for conditional expectations like a conditional monotone convergence theorem (12.2.1 of A&L), a conditional version of the Fatou lemma (12.2.2 of A&L), and a conditional dominated convergence theorem (12.2.3 of A&L). Theorem 40 (Monotone Convergence Theorem for Conditional Expectation) If fXn g is a sequence of non-negative random variables on ( ; F; P ) with Xn Xn+1 a.s. and Xn ! X a.s. then E (Xn jG) ! E (XjG) a.s. Theorem 41 (Fatou’s Lemma for Conditional Expectation) If fXn g is a sequence of non-negative random variables on ( ; F; P ), then lim E (Xn jG) E (lim Xn jG) a.s. Theorem 42 (Dominated Convergence Theorem for Conditional Expectation) Suppose fXn g is a sequence of random variables on ( ; F; P )and that Xn ! X a.s. If there 9 a random variable Z such that jXn j Z a.s. for all n and EjZj < 1, then E (Xn jG) ! E (XjG) a.s. For statistical applications, it is Theorem 39 that is the most important of these analogues of standard results for ordinary expectations. 3.4 Conditional Probability Consider the implications of De…nition 31. It’s possible to argue that for every countable set of disjoint Ai 2 F, X P (Ai jG) = P ([Ai jG) a.s. i.e. that 9 NfAi g such that P NfAi g = 0 and X P (Ai jG) (!) = P ([Ai jG) (!) 8! 2 nNfAi g 16 (3) It’s tempting to expect that this means that except for a set of P measure 0, I may think of P ( jG) (!) as a probability measure. But (3) does not in general guarantee this. The problem is that there are typically uncountably many sets fAi g and [NfAi g need not be F-measurable, let alone have P measure 0: So the fact that (3) holds does not guarantee the existence of an object satisfying the next de…nition. De…nition 43 For ( ; F; P ) a probability space and G -algebras, a function :D ! [0; 1] F and D F two sub is called a regular conditional probability on D given G provided 1. 8 A 2 D; (A; ) = P (AjG) ( ) a.s., 2. 8 ! 2 ( ; !) is a probability measure on ( ; D). ; A regular conditional probability need not exist. (See the example on page 81 of Breiman’s Probability.) When a regular conditional probability does exist, conditional expectations can be computed using it. That is, there is the following. Lemma 44 If is a regular conditional probability on D given G and Y is D-measurable, then Z E (Y jG) (!) = Y d ( ; !) for P almost all ! In the case where D = (X), a regular conditional probability on D given G is called a regular conditional distribution of X given G. Provided that X takes values in a nice enough space, (like Rk ; B k or even (R1 ; B 1 )) it’s possible to conclude that a regular conditional distribution of X given G exists. This is 12.3.1 of A&L. Theorem 45 Suppose that ( ; F; P ) is a probability space and (X ; B) is a Polish space with Borel -algebra. Let X : ! X be measurable. Then for any sub -algebra G F 9 a regular conditional distribution of X given G (a regular conditional probability on (X) given G). Under the conditions of Theorem 45, for any measurable h : X ! R; with Ejh (X)j < 1; for the conditional distribution of X given G guaranteed to exist by the theorem Z E (h (X) jG) (!) = h (X) d ( ; !) for P almost all ! 17 and this obviously then works for indicator functions h. So for A 2 B P X 1 (A) jG (!) = E IX 1 (A) jG (!) = E (IA (X) jG) (!) = = (f! 0 jX (! 0 ) 2 Ag ; !) X 1 (A) ; ! for P almost all ! making use of Lemma 44 for the switch between conditional expectation and values of . A …nal topic to be considered in the discussion of conditional probability is conditional independence. De…nition 46 Suppose G is a sub -algebra of F and for each l belonging to some index set , Fl is a subset of F. We will say that fFl gl2 is conditionally independent given G if for any positive integer k and l1 ; l2 ; : : : ; lk 2 P (A1 \ A2 \ \ Ak jG) = k Y i=1 P (Ai jG) a.s. for all choices of Ai 2 Fli . A collection of random variables fXl gl2 is called conditionally independent given G if f (Xl )gl2 is conditionally independent given G. Next is the result of Problem 12.18 of A&L. Proposition 47 Suppose G1 ; G2 ; and G3 are three sub -algebras of F. G1 and G2 are conditionally independent given G3 if and only if P (Aj (G2 [ G3 )) = P (AjG3 ) 8A 2 G1 And there is the result of Problem 12.19 of A&L. Proposition 48 Suppose G1 ; G2 ; and G3 are three sub -algebras of F. If (G1 [ G3 ) is independent of G2 , then G1 and G2 are conditionally independent given G3 4 Transition to Statistics In the usual probability theory set-up, a (measurable) random observable X maps ( ; F; P ) to some measure space (X ; B) and thereby induces a probability measure P X on (X ; B) (the distribution of the random observable). In statistical theory, the space ( ; F; P ) largely disappears from consideration, being replaced by consideration of X ; B; P X , and there is not just a single P X to be considered but rather a whole family of such (giving di¤erent models for the observable) indexed by a parameter belonging to some index set . It is completely standard to abuse notation somewhat and replace the P X notation 18 with just P notation for the distribution of X, and thus consider a set of models for X P fP g 2 We’ll use a subscript to indicate computations made under the P model. So, for example, Z E g (X) = gdP X The object is then to use X to learn about , some function of , or perhaps some as yet unobserved Y that has a distribution depending upon : The reason to have learned measure-theoretic probability is that it now allows a logically coherent approach to statistical analysis in models more general than the completely discrete or completely continuous models of Stat 542/543. Henceforth, we will suppose that 9 a -…nite measure such that P 8 2 (we will say that P is dominated by ). The Radon-Nikodym theorem then promises that for each 2 ; 9 f : X ! R such that Z P (B) = f d 8B 2 B B dP is the R-N derivative of P with respect to ). f basically tells one d how to go from probabilities for one model for X to probabilities for another. Then, for observable X, f (X) is a random function of that becomes the likelihood function in a properly general statistical set-up. We note for future reference that a Bayesian approach to inference will add to these measure-theoretic model assumptions an assumption that has some distribution G (on after it is given some -algebra). Notice then that Xj P together with G then presumably produces some (single) joint distribution on X that in turn typically can be thought of as producing some regular conditional distribution of given (X) in the style of Section 3.4 (that serves as a posterior distribution of jX). (f = 5 5.1 Su¢ ciency and Related Concepts Su¢ ciency and the Factorization Theorem Roughly, T (X) is su¢ cient for P fP g 2 (or for ) provided "the conditional distribution of X given T (X) is the same for each ," and this ought to be equivalent to the likelihood function being essentially the same for any two possible values of X, say x and x0 , with T (x) = T (x0 ) : This equivalence is the "factorization theorem." 19 A formal development that will produce the factorization theorem begins with a statistic T , a measurable map T : (X ; B) ! (T ; F) and the sub -algebra of B, B0 = B (T ) T 1 (F ) = F 2F (T ) De…nition 49 T (or B0 ) is su¢ cient for P (or for ) if for every B 2 B 9 a B0 -measurable function Y such that 8 2 Y = P (BjB0 ) a.s. P This general de…nition of su¢ ciency is phrased in terms of conditional probabilities (conditional expectations of indicators) and says that for each B, there is a single Y that works as conditional probability for B given B0 as one looks across all .. In the event that there is a single regular conditional distribution of X given B0 = B (T ) = (T ) that works for all 2 , De…nition 49 says that T is su¢ cient. (But the de…nition doesn’t require the existence of such a regular conditional distribution, only the existence of the " -free" conditional expectations one B at a time.) Theorem 50 (Halmos-Savage Factorization Theorem) Suppose that P where is -…nite. Then T is su¢ cient for P i¤ 9 a nonnegative B-measurable function h and nonnegative F-measurable functions g such that 8 dP (x) = f (x) = g (T (x)) h (x) a.s. d Theorem 50 says that T is su¢ cient i¤ x and x0 with T (x) = T (x0 ) have likelihood functions that are proportional. Theorem 50 is not so easy to prove. A series of lemmas is needed. The …rst is Lemma 1.2 of Shao and the hint of problem 12.1 of A&L. (Below B k is the Borel -algebra on Rk .) Lemma 51 (Lehmann’s Theorem) Let T : X ! T be measurable and : (X ; B) ! Rk ; B k be B (T )-measurable. Then 9 an F-measurable function : (T ; F) ! Rk ; B k such that (x) = (T (x)). Next is Lemma 2.1 of Shao. Lemma 52 P is dominated by a probability measure of the form -…nite measure = 1 X ci P i¤ P is dominated by a i i=1 for some countable set fP i g P and set of ci 20 0 with P1 i=1 ci = 1. Lemma 53 Suppose that P is dominated by a -…nite measure and T : (X ; B) ! (T ; F). Then T is su¢ cient i¤ 9 nonnegative F-measurable functions g such that 8 (and with as in Lemma 52 …xed) dP (x) = g (T (x)) a.s. d We’ll take Lemmas 51 and 52 as given. An outline of the proof of Lemma 53 using Lemmas 51 and 52 is dP 0 =g T the restrictions of P and to B0 , d 0 dP by Lehmann’s Theorem. Then show g T = using facts about P (BjB0 ) d (=)) For P 0 and 0 ((=) Show E (IB jB0 ) =E (IB jB0 ) using the representation of dP . d Then, an outline of the proof of Theorem 50 using Lemmas 51-53 is dP dP \=" d d (=)) d d plus a.e.’s for dP =g d T (Lemma 53) and d = h. d dP dP \=" d d ((=) h 1 X ci (g i d d plus a.e.’s for T ). This implies that i=1 5.2 dP = (g d T ) h and d d = dP is of the “right” form. d Minimal Su¢ ciency A su¢ cient statistic T (X) is a reduction of data X that in some sense still carries all the information available about P. A sensible question is what form the "most compact" su¢ cient reduction of X takes. This is the notion of minimal su¢ ciency. De…nition 54 A su¢ cient statistic T : (X ; B) ! (T ; F) is minimal suf…cient for P (or for ) provided for every su¢ cient statistic S : (X ; B) ! (S; G) 9 a function U : S ! T such that T =U S a.s. P (This de…nition is equivalent to or close to equivalent to B (T ) B (S).) The fundamental qualitative insight here is that "the ‘shape’ of the loglikelihood" is minimal su¢ cient. This is phrased in technical terms in the next result. 21 Theorem 55 Suppose that P is dominated by a -…nite measure , and that T : (X ; B) ! (T ; F) is su¢ cient for P. Suppose further that 9 versions of the R-N derivatives dP (x) = f (x) d such that the existence of k (x; y) > 0 such that f (y) = f (x) k (x; y) 8 implies that T (x) = T (y). Then T is minimal su¢ cient. A more technical and perhaps less illuminating line of argument for establishing minimal su¢ ciency is based on the next two results. (Shao’s Theorem 2.3ii is the k = 1 version of Theorem 56.) k Theorem 56 Suppose that P = fPi gi=0 is …nite and let fi be the R-N derivative of Pi with respect to some dominating -…nite measure . Suppose that all k + 1 densities fi are positive everywhere on X . Then T (X) = fk (X) f1 (X) f2 (X) ; ;:::; f0 (X) f0 (X) f0 (X) is minimal su¢ cient for P. (The positivity assumption in Theorem 56 is isn’t really necessary, but proof without that assumption is fairly unpleasant.) The next result is a version of Shao’s Theorem 2.3i. Theorem 57 Suppose that P is a family of distributions and P0 P dominates P (P (B) = 0 8P 2 P0 implies that P (B) = 0 8P 2 P). If T is su¢ cient for P and minimal su¢ cient for P0 , then it is minimal su¢ cient for P. 5.3 Ancillarity and Completeness Su¢ ciency of T (X) means roughly that "what is left over in X beyond the information in T (X) is of no additional use in inferences about ." This notion is related to the concepts of ancillarity and completeness. De…nition 58 A statistic T : (X ; B) ! (T ; F) is said to be ancillary for P = fP g 2 if the distribution of T (X) (i.e. the measure P T on (T ; F) de…ned by P T (F ) = P T 1 (F ) does not depend upon on (i.e. is the same for all ). De…nition 59 A statistic T : (X ; B) ! R1 ; B 1 is said to be 1st order ancillary for P = fP g 2 if E T (X) does not depend upon on (i.e. is the same for all ). Two slightly di¤erent versions of the notion of completeness are next. They are both stated in the framework of De…nition 58. That is, suppose T : (X ; B) ! (T ; F), de…ne P T on (T ; F) by P T (F ) = P T 1 (F ) , and let PT = P T . 22 De…nition 60 (A) P T (or T ) is complete (or boundedly complete) for P or if i) U : (X ; B) ! R1 ; B 1 is B (T )-measurable (and bounded) and ii) E U (X) = 0 8 imply that U = 0 a.s. P 8 . (B) P T (or T ) is complete (or boundedly complete) for P or if i) h : (T ; F) ! R1 ; B 1 is F-measurable (and bounded) and ii) E h (T (X)) = 0 8 imply that h T = 0 a.s. P 8 . By virtue of Lehmann’s Theorem, the two versions of completeness in De…nition 60 are equivalent. They both say that there is no nontrivial (bounded) function of T that is an unbiased estimator of 0. This is equivalent to saying that there is no nontrivial (bounded) function of T that is …rst order ancillary. Bounded completeness is obviously a slightly weaker condition than completeness. Two simple results regarding completeness and sub-families of models are next. Proposition 61 If T is complete for P = fP g complete for P 0 = fP g 2 0 . 2 and 0 , T need not be 00 Theorem 62 If and dominates 00 in the sense that P (B) = 0 8 2 implies that P (B) = 0 8 2 00 , then T (boundedly) complete for P = fP g 2 implies that T is (boundedly) complete for P 00 = fP g 2 00 . Completeness of a statistic guarantees that there is no part of it (no nontrivial function of it) that is somehow irrelevant in and of itself to inference about . That is reminiscent of minimal su¢ ciency. An important connection between the two concepts is next. The proof of part i) of the following is on pages 95-96 of Schervish. (Silvey page 30 has a heuristic argument for part ii).) Theorem 63 (Bahadur’s Theorem) Suppose that T : (X ; B) ! (T ; F) is su¢ cient and boundedly complete. Then i) if (T ; F) = Rk ; B k then T is minimal su¢ cient, and ii) if there is any minimal su¢ cient statistic, then T is minimal su¢ cient. Another standard result involving completeness, su¢ ciency, and ancillarity is the following (from page 99 of Schervish). Theorem 64 (Basu’s Theorem) If T is a boundedly complete su¢ cient statistic and U is ancillary, then for all the random variables U and T are independent (according to P ). 23 6 6.1 Facts About Common Statistical Models Bayes Models The "Bayes" approach to statistical problems is to add model assumptions to the basic structure introduced thus far. To the standard assumptions concerning P = fP g 2 , "Bayesians" add a distribution on the parameter space . That is, suppose that G is a distribution on ( ; C) where it may be convenient to suppose that G is dominated by some -…nite measure and that dG ( ) = g( ) d If considered as a function of both x and , f (x) is B C-measurable, then 9 a probability (a joint distribution for (X; ) on (X ; B C)) de…ned by Z Z X; (A) = f (x) d ( G) (x; ) = f (x) g ( ) d ( ) (x; ) A A This distribution has marginals Z X (B) = f (x) d ( G) (x; ) = B Z Z f (x) dG ( ) d (x) B R R (x) = f (x) dG ( ) = f (x) g ( ) d ( )) and Z Z Z (C) = f (x) d ( G) (x; ) = f (x) d (x) dG ( ) = G (C) (so that d X d X C C X Further, the conditionals (regular conditional probabilities) are computable as Z Xj (Bj ) = f (x) d (x) = P (B) B and jX (Cjx) = Z C Note then that jX d dG and thus that jX d d R f (x) dG ( ) f (x) dG ( ) ( jx) = R ( jx) = R f (x) f (x) dG ( ) f (x) g ( ) f (x) g ( ) d ( ) is the usual "posterior density" for given X = x. Bayesians use the posterior distribution or density (THAT DEPENDS ON ONE’S CHOICE OF G) as their basis of inference for ). Pages 84-88 of Schervish concern a Bayesian formulation of su¢ ciency and argue (basically) that the Bayesian form is equivalent to the "classical" form. 24 6.2 Exponential Families Many standard families of distributions can be treated with a single set of analyses by recognizing them to be of a common “exponential family” form. The following is a list of facts about such families. (See, for example, Schervish Sections 2.2.1 and 2.2.2, or scattered results in Shao or the old books by Lehmann (TSH and TPE ) for details.) De…nition 65 Suppose that P = fP g is dominated by a -…nite measure P is called an exponential family if for some h(x) 0, ! k X : dP (x) = exp a( ) + f (x) = . i ( )Ti (x) h(x) 8 d i=1 . De…nition 66 A family of probability measures P = fP g is said to be identi…able if 1 6= 2 implies that P 1 6= P 2 . R P Claim 67 Let = ( 1 ; 2 ; :::; k ); = f 2 Rk j h(x) exp ( i Ti (x)) d (x) < 1g, and consider also the family of distributions (on X ) with R-N derivatives wrt of the form ! k X f (x) = K( ) exp , 2 . i Ti (x) h(x) i=1 Call this family P . This set of distributions on X is at least as large as P (and this second parameterization is mathematically nicer than the …rst). is called the natural parameter space for P . It is a convex subset of Rk . (TSH, page 57.) If lies in a subspace of dimension less than k, then f (x) (and therefore f (x)) can be written in a form involving fewer than k statistics Ti . We will henceforth assume to be fully k dimensional. Note that depending upon the nature of the functions i ( ) and the parameter space , P may be a proper subset of P . That is, de…ning = f( 1 ( ); 2 ( ); :::; k ( )) 2 Rk j 2 g; can be a proper subset of . Claim 68 The "support" of P , de…ned as fxjf (x) > 0g, is clearly fxjh(x) > 0g, which is independent of . The distributions in P are thus mutually absolutely continuous. Claim 69 From the Factorization Theorem, the statistic T = (T1 ; T2 ; :::; Tk ) is su¢ cient for P. Claim 70 T has induced measures fP T j 2 family. 25 g which also form an exponential Claim 71 If contains an open rectangle in Rk , then T is complete for P. (See pages 142-143 of TSH.) Claim 72 If contains an open rectangle in Rk (and actually under the much weaker assumptions given on page 44 of TPE) T is minimal su¢ cient for P. Claim 73 If g is any measurable real valued function such that E jg(X)j < 1, then Z E g(X) = g(x)f (x)d (x) is continuous on and has continuous partial derivatives of all orders on the interior of . These can be calculated as Z @ 1+ 2+ + k @ 1+ 2+ + k E g(X) = g(x) f (x)d (x) . 1 2 k @ 1 @ 2 @ k @ 1 1@ 2 2 @ kk (See page 59 of TSH.) Claim 74 If for u = (u1 ; u2 ; :::; uk ), both E n exp u1 T1 (X) + u2 T2 (X) + 0 Further, if 0 0 and 0 + u belong to o + uk Tk (X) = K( 0 ) : K( 0 + u) is in the interior of , then E 0 (T1 1 (X)T2 2 (X) Tk k (X)) = K( In particular, E 0 Tj (X) = @ @ j and Cov 0 (Tj (X),Tl (X)) = ( ln K( )) @ @ 0) 2 j@ l 1+ @ @ 1 1 = ( ln K( )) @ 2+ + @ 2 2 k k k 1 K( ) , Var 0 Tj (X) = 0 = @2 @ j2 : = 0 ( ln K( )) . 0 Claim 75 If X = (X1 ; X2 ; :::; Xn ) has iid components, with each Xi P , n n n ; B ) wrt . The then X generates a k dimensional exponential family on (X Pn k dimensional statistic Under the i=1 T (Xi ) is su¢ cient for this family. condition that contains an open rectangle, this statistic is also complete and minimal su¢ cient. 6.3 Measures of Statistical Information Details for the development here can be found in Section 2.3 of Schervish. 26 = , 0 6.3.1 Fisher Information This is an attempt to quantify how much on average the likelihood f (X) tells one about 2 Rk . To even begin, one must have enough technical conditions in place to make it possible to de…ne statistical information. De…nition 76 We will say that the model P with dominating -…nite measure is FI Regular at 0 2 Rk provided there is an open neighborhood of 0 , say O, such that 1. f (x) > 0 8x 2 X and 2 O, 2. 8x; f (x) has …rst order partial derivatives at 0 , and R 3. 1 = f (x) d (x) can be di¤ erentiated with respect to each R d (x). integral sign at 0 ; that is 0 = @@ i f (x) = then the k @ ln f (X) @ i 0 under the 0 De…nition 77 If the model P with dominating -…nite measure at 0 2 Rk , and E i is FI Regular 2 = < 1 8i 0 k matrix I( 0) = E 0 @ ln f (X) @ i = @ ln f (X) 0 @ j is called the Fisher Information about = 0 contained in X at 0. Claim 78 Fisher information doesn’t depend upon the dominating measure. The Fisher Information certainly does depend upon the parameterization one uses. For example, suppose that the model P is FI regular at 0 2 R1 and that for some "nice" function h (say 1-1 with derivative h 1 ) = h( ) Then for P with R-N derivative f with respect to , the distributions Q (for in the range of h) are de…ned by Q = Ph 1( ) and have R-N derivatives (again with respect to ) g = dQ = fh d 1( ) Then continuing to let "prime" denote di¤erentiation with respect to g 0 = fh0 1 1( ) h0 (h 27 1 ( )) or , and I( )= 2 1 h0 (h 1 J h ( )) 1 ( ) where I is the information in X about and J is the information in X about . Under appropriate conditions, there is a second form for the Fisher Information matrix. Theorem 79 Suppose that the model P is FI regular at R0 2 . If 8x; f (x) has continuous partials in the neighborhood O, and 1 = f (x) d (x) can be di¤ erentiated twice with respect to each i under the integral sign at 0 ; that is R R 2 d (x) and 0 = @ @i @ j f (x) d (x) 8i and j, then 0 = @@ i f (x) = = 0 I( 0) = E 0 @2 @ i@ 0 f (X) j = 0 ! One reason Fisher Information is an attractive measure is that for independent observations, it is additive. Proposition 80 If X1 ; X2 ; : : : ; Xn are independent with Pn Xi Pi; , then X = (X1 ; X2 ; : : : ; Xn ) carries Fisher Information I ( ) = i=1 I i ( ) (where in the obvious notation I i ( ) is the Fisher information carried by Xi ). Fisher information also behaves "properly" in terms of what happens when one reduces X to some statistic T (X). That is, there are the next two results, the second of which is like 2.86 of Schervish. Proposition 81 Suppose the model P is FI Regular at 0 2 Rk . If the function T is 1-1, then the Fisher Information in T (X) is the same as the Fisher Information in X. Proposition 82 Suppose the model P is FI Regular at 0 2 Rk . (?And T also suppose that P is FI regular at 0 ?) Then the matrix 2 IX ( 0) I T (X) ( 0) is non-negative de…nite. Further, if T (X) is su¢ cient, then I X ( 0 ) I T (X) ( 0. Also, I X ( 0 ) I T (X) ( 0 ) = 0 8 0 implies that T (X) is su¢ cient. In exponential families, the Fisher Information has a very nice form. Proposition 83 For a family of distributions as in Claim 67, I( 0 ) = Var (T (X)) = @2 @ i@ 28 ( ln K( )) j = 0 0) = 6.3.2 Kullback-Leibler Information This is another measure of "information" that aims to measure how far apart two distributions are in the sense of likelihood, i.e. in the sense of how hard it will be to discriminate between them on the basis of X. It is a quantity that arises naturally in the asymptotics of maximum likelihood. De…nition 84 If P and Q are probability measures on X with R-N derivatives with respect to a common dominating measure , p and q respectively, then the Kullback-Leibler Information is the P expected log likelihood ratio Z p (x) p (X) = ln p (x) d (x) I (P; Q) = EP ln q (X) q (x) Claim 85 I (P; Q) is not in general the same as I (Q; P ). Lemma 86 I (P; Q) 0 and there is equality only when P = Q. Lemma 86 follows from the next (slightly more general) lemma (from page 75 of Silvey), upon noting that C ( ) = ln ( ) is strictly convex. Lemma 87 Suppose that P and Q are two di¤ erent probability measures on X with R-N derivatives with respect to a common dominating measure , p and q respectively. Suppose further that a function C : (0; 1) ! R is convex. Then EP C q (x) p (x) C (1) with strict inequality if C is strictly convex. K-L Information has a kind of additive property similar to that possessed by the Fisher Information. Theorem 88 If X = (X1 ; X2 ) and under both P and Q the variables X1 and X2 are independent, then IX (P; Q) = IX1 P X1 ; QX1 + IX2 P X2 ; QX2 K-L Information also has the kind of non-increasing property associated with reduction of X to a statistic T (X) possessed by the Fisher Information. Theorem 89 IX (P; Q) IT (X) P T (X) ; QT (X) and there is equality if and only if T (X) is su¢ cient for fP; Qg. Finally, there is also an important connection between the notions of Fisher and K-L Information under the formulation of Section 6.3.1. 29 Theorem 90 Under the hypotheses of Theorem 79, if one can reverse the order of limits in all @2 (E 0 ln f (X)) @ i@ j = 0 to get @2 E0 ln f (X) @ i@ j = 0 then @2 = I ( 0) I (P 0 ; P ) @ i@ j = 0 (Clearly, the …rst I is the K-L Information and the second is the Fisher Information). As a bit of an aside, it is worth considering what would establish the legitimacy of exchange of order of di¤erentiation and integration (e.g. required in the hypotheses of Theorem 90). For any function h ( ) : R1 ! R1 , de…ne D1 (h) ( ) = h ( + ) h( ) and D2 (h) ( ) = h ( + Then lim and !0 ) 1 D1 (h) ( ) = h0 ( ) 1 D2 (h) ( ) = h00 ( ) !0 lim ) + h( 2 2h ( ) Consider g ( ; x) : R1 X ! R1 . Then, for example, Z 1 2 D (g ( ; x)) ( 0 ) f 0 (x) d (x) 2 Z 1 = 2 [g ( 0 + ; x) + g ( 0 ; x) 2g ( 0 ; x)] f 0 (x) d (x) Z 1 = 2 D2 g ( ; x) f 0 (x) d (x) ( 0 ) Now the limit of the last of these (as ! 0) is Z d2 g ( ; x) f 0 (x) d (x) d 2 = 0 So the question of interchange of expectation and second derivative is that of whether the ( ) integral of the limit of 1 2 D2 (g ( ; x)) ( 0 ) f 0 (x) is the limit of the ( ) integral. This might be established using the dominated convergence theorem. 30 7 Statistical Decision Theory This is a formal framework for balancing of various kinds of risks under uncertainty, and can sometimes be used look for "good" statistical procedures. It …nds modern uses in the search for good data mining and machine learning algorithms. 7.1 Basic Framework and Concepts To the usual statistical modeling framework of Section 4 we now add the following elements: 1. some "action space" A with -algebra E, 2. a suitably measurable "loss function" L : 3. (non-randomized) decision rules A ! [0; 1), and : (X ; B) ! (A; E) : (For data X; (X) is the action taken on the basis of X.) R De…nition 91 R ( ; ) E L ( ; (X)) = L ( ; (x)) dP (x) mapping [0; 1) is called the risk function for . De…nition 92 is at least as good as 0 if R ( ; ) 0 De…nition 93 is better than (dominates) strict inequality for at least one 2 . De…nition 94 and 0 R ( ; 0) 8 . if R ( ; ) R ( ; 0 ) 8 with are risk equivalent if R ( ; ) = R ( ; 0 ) 8 . De…nition 95 is best in a class of decision rules least as good as any other 0 2 . De…nition 96 inates) . ! is inadmissible in De…nition 97 is admissible in no 0 which is better). if 9 0 2 if 2 and is at which is better than (dom- if it is not inadmissible in (if there is We will use the notation D = f g = the class of (non-randomized) decision rules It is technically useful to extend the notion of decision procedures to include the possibility of randomizing in various ways. De…nition 98 If for each x 2 X , called a behavioral decision rule. x is a distribution on (A; E), then 31 x is The notion of a behavioral decision rule is that one observes X = x and then makes a random choice of an element of A using distribution x . We’ll let D =f xg = the class of behavioral decision rules It’s possible to think of D as a subset of D by associating with 2 D a behavioral decision rule x that is a point mass distribution on A concentrated at (x). The natural de…nition of the risk function of a behavioral decision rule is (abusing notation and using "R" here too) Z Z R( ; ) = L ( ; a) d x (a) dP (x) X A A second (less intuitively appealing) notion of randomizing decisions is one that might somehow pick an element of D at random (and then plug in X). Let F be a -algebra on D that contains all singleton sets. De…nition 99 A randomized decision function (or rule) ity measure on (D; F). with distribution is a probabil- becomes a random object. Let D = f g = the class of randomized decision rules It’s possible to think of D as a subset of D by associating with 2 D a randomized decision rule placing mass 1 on . The natural de…nition of the risk function of a behavioral decision rule is (yet again abusing notation and using "R" here too) Z R( ; ) = R( ; )d ( ) D Z Z = L ( ; (x)) dP (x) d ( ) D X (assuming that R ( ; ) is properly measurable). The behavioral decision rules are most natural, while the randomized decision rules are easiest to deal with in some proofs. So it is a reasonably important question when D and D are equivalent in the sense of generating the same set of risk functions. Some properly quali…ed version of the following is true. Proposition 100 If A is a complete separable metric space with E the Borel -algebra and ??? regarding the distributions P and ???, then D and D are equivalent in terms of generating the same set of risk functions. D and D are clearly more complicated than D. A sensible question is when they really provide anything D doesn’t provide. One kind of negative answer can be given for the case of convex loss. The following is like 2.5a of Shao, page 151 of Schervish, page 40 of Berger, or page 78 of Ferguson. 32 Lemma 101 Suppose that A is a convex subset of Rd and decision rule. De…ne a non-randomized decision rule by Z (x) = ad x (a) x is a behavioral A (In the case that d > 1, interpret (x) as vector-valued, the integral as a vector of integrals over d coordinates of a 2 A.) Then 1. if L ( ; ) : A ! [0; 1) is convex, then R( ; ) R( ; ) and 2. if L ( ; ) : A ! [0; 1) is strictly convex, R ( ; ) < 1 and P (fxj x is non-degenerateg) > 0, then R( ; ) < R( ; ) Corollary 102 Suppose that A is a convex subset of Rd , decision rule and Z (x) = ad x (a) x is a behavioral A Then 1. if L ( ; a) is convex in a 8 , is at least as good as , 2. if L ( ; a) is convex in a 8 and for some convex in a, R ( 0 ; ) < 1 and P 0 (fxj is better than . 0 x the function L ( 0 ; a) is strictly is non-degenerateg) > 0, then The corollary shows, e.g., that for squared error loss estimation, averaging out over non-trivial randomization in a behavioral decision rule will in fact improve that estimator. 7.2 (Finite Dimensional) Geometry of Decision Theory A very helpful device for understanding some of the basics of decision theory is the geometry associated with cases where is …nite. Let = f 1; 2; : : : ; k g Assume that R ( ; ) < 1 8 2 and 2 D and note that in this case R ( ; ) : ! [0; 1) can be thought of as a k-vector. Let S = y = (y1 ; y2 ; : : : ; yk ) 2 Rk j yi = R ( i ; ) for all i and some = the set of all randomized risk vectors 33 2D Theorem 103 S is convex. In fact, it turns out that if S 0 = y = (y1 ; y2 ; : : : ; yk ) 2 Rk j yi = R ( i ; ) for all i and some 2D S is the convex hull of S 0 . This is the smallest convex set containing S 0 , the set of all convex combinations of points of S 0 , the intersection of all convex sets containing S 0 . (By the de…nition of randomized decision rules, each produces a y that is a convex combination of points in S 0 , so it is at least obvious that S is a subset of the convex hull of S 0 .) De…nition 104 For x 2 Rk , the lower quadrant of x is Qx = fzjzi xi 8ig Theorem 105 y 2 S (or the decision rule giving rise to y) is admissible if and only if Qy \ S = fyg. De…nition 106 For S the closure of S, the lower boundary of S is (S) = yjQy \ S = fyg De…nition 107 S is closed from below if (S) S. Let A (S) = fyjQy \ S = fygg be the set of admissible risk points. Theorem 108 If S is closed A (S) = (S). Theorem 108 is very easy to prove. A better theorem (one with weaker hypotheses) that is perhaps surprisingly harder to prove is next. Theorem 109 If S is closed from below A (S) = (S). Now drop the …niteness assumption. 7.3 Complete Classes of Decision Rules De…nition 110 A class of decision rules C D is a complete class if for any 2 = C, 9 0 2 C such that 0 is better than . (One can do better in the complete class than anywhere outside of the complete class in terms of risk functions.) De…nition 111 A class of decision rules C D is an essentially complete class if for any 2 = C, 9 0 2 C such that 0 is at least as good as . De…nition 112 A class of decision rules C D is a minimal complete class if it is complete and is a subset of any other complete class. 34 Let A (D ) be the set of admissible rules in D : Theorem 113 If a minimal complete class C exists, then C = A (D ). Theorem 114 If A (D ) is complete, then it is minimal complete. Theorems 113 and 114 are properly quali…ed versions of the rough (not quite true) statement "A (D ) is the minimal complete class." 7.4 Su¢ ciency and Decision Theory Presumably one can typically do as well in a decision problem based on a su¢ cient statistic T (X) as one can do based on X. Theorem 115 If T is su¢ cient for P and is a behavioral decision rule, then 9 another behavioral decision rule 0 that is a function of T and has the same risk function as . (That 0 is a function of T means that T (x) = T (y) implies that 0x and 0y are the same distributions on A.) Theorem 115 shows that the class of rules that are functions of a su¢ cient statistic is essentially complete. Theorem 118 shows that under a convex loss, conditioning on a su¢ cient statistic can sometimes actually improve a decision rule. We state two small results that lead up to this. Lemma 116 Suppose A decision rules. Then Rd is convex and = 1 ( 2 1 + 1 and 2 are two non-randomized 2) is also a decision rule. Further, 1. if L ( ; a) is convex in a and R ( ; and 1) 2. if L ( ; a) is strictly convex in a, R ( ; 0, then R ( ; ) < R ( ; 1 ). = R( ; 1) 2 ), = R( ; then R ( ; ) 2) R( ; < 1, and P ( 1 1 ), (X) 6= Corollary 117 Suppose A Rd is convex and 1 and 2 are two nonrandomized decision rules with identical risk functions. If L ( ; a) is convex in a 8 , the existence of a 2 for which L ( ; a) is strictly convex in a, R ( ; 1 ) = R ( ; 2 ) < 1 and P ( 1 (X) 6= 2 (X)) > 0, implies that 1 and 2 are inadmissible. Finally, we have the general form of the Rao-Blackwell Theorem (that usually appears in its SELE form in Stat 543 and is 2.5.6 of Shao, page 152 of Schervish, Berger page 41, and Ferguson page 121). 35 2 (X)) > Theorem 118 (Rao-Blackwell Theorem) Suppose A Rd is convex and is a nonrandomized decision function with E k (X)k < 1 8 . Suppose further that T is su¢ cient for and with B0 = B (T ) let 0 Then 0 (x) E ( jB0 ) (x) is a non-randomized decision rule. Further, 1. if L ( ; ) is convex in a, then R ( ; 0) R ( ; ), and 2. for any for which L ( ; a) is strictly convex in a, R ( ; ) < 1, and P ( 0 (X) 6= (X)) > 0, it is the case that R ( ; 0 ) < R ( ; ). 7.5 Bayes Decision Rules The Bayes approach to decision theory is one means of reducing risk functions to numbers so that they can be compared in a straightforward fashion. Let G be a distribution on ( ; C). De…nition 119 The Bayes risk of 2 D with respect to the prior G is (abusing notation again and once more using the symbols R ( ; )) Z R (G; ) = R ( ; ) dG ( ) Further (again abusing notation) the "minimum" Bayes risk is R (G) De…nition 120 inf R (G; 0 2D 0 ) 2 D is said to be Bayes with respect to G if R (G; ) = R (G) De…nition 121 2 D is said to be -Bayes with respect to G provided R (G; ) R (G) + Bayes rules are often admissible, at least if the prior involved "spreads its mass around su¢ ciently." Theorem 122 If is countable and G is a prior with G ( ) < 1 8 and Bayes with respect to G, then is admissible. is Theorem 123 Suppose Rk is such that every neighborhood of any point 2 has a non-empty intersection with the interior of . Suppose further that R ( ; ) is continuous in for all 2 D . Let G be a prior distribution that has support in the sense that every open ball that is a subset of has positive G probability. Then if R (G) < 1 and is a Bayes rule with respect to G, is admissible. 36 Theorem 124 If every Bayes rule with respect to G has the same risk function, they are all admissible. Corollary 125 In a problem where Bayes rules exist and are unique, they are admissible. A converse of Theorem 122 (for …nite ) is given in Theorem 127. Its proof requires the use of a piece of k-dimensional analytical geometry (called the separating hyperplane theorem) that is stated next. Theorem 126 (Separating Hyperplane Theorem) Let S1 and SP 2 be two disjoint P convex subsets of Rk . Then 9 a p 2 Rk such that p 6= 0 and p i xi pi yi 8y 2 S1 and 8x 2 S2 . Theorem 127 If to some prior. is …nite and is admissible, then is Bayes with respect For non-…nite , admissible rules are "usually" Bayes or "limits" of Bayes rules. See Section 2.10 of Ferguson or page 546 of Berger. As the next result shows, Bayesians don’t need randomized decision rules, at least for purposes of achieving minimum Bayes risk, R (G). See also page 147 of Schervish. Proposition 128 Suppose that 2 D is Bayes versus G and R (G) < 1. Then there exists a nonrandomized rule 2 D that is also Bayes versus G. There remain the questions of 1) when Bayes rules exist and 2) what they look like when they do exist. These are addressed next. Theorem 129 If is …nite, S is closed from below, and G assigns positive probability each 2 , there is a rule Bayes versus G. De…nition 130 A formal nonrandomized Bayes rule versus a prior G is a rule (x) such that 8x 2 X , Z f (x) dG ( ) (x) is an a 2 A minimizing L ( ; a) R f (x) dG ( ) De…nition 131 If G is a -…nite measure, a formal nonrandomized generalized Bayes rule versus G is a rule (x) such that that 8x 2 X , Z (x) is an a 2 A minimizing L ( ; a) f (x) dG ( ) 37 7.6 Minimax Decision Rules An R alternative to the Bayes reduction of R ( ; ) to the number R (G; ) = R ( ; ) dG ( ) is to reduce R ( ; ) to the number supR ( ; ). De…nition 132 A decision rule 2 D is said to be minimax if supR ( ; ) = inf supR ( ; 0 0 ) It is intuitively plausible that if one is trying to produce a minimax rule by pushing down "the highest peak" in R ( ; ) (considered as a function of ), that will tend to direct one toward rules with fairly "‡at" risk functions. So the notion of "equalizer rules" has a central place in minimax theory. De…nition 133 If a decision rule called an equalizer rule. Theorem 134 If minimax. 2 D has a constant risk function, it is 2 D is an equalizer rule and is admissible, then it is Theorem 135 Suppose that f i g is a sequence of decision rules, each i Bayes versus a prior Gi . If R (Gi ; i ) ! C < 1 and is a decision rule with R ( ; ) C 8 , then is minimax. Corollary 136 If 2 D is an equalizer rule and is Bayes with respect to a prior G, then it is minimax. Corollary 137 If is minimax. 2 D is Bayes versus G and R ( ; ) R (G) 8 , then it In light of Corollaries 136 and 137, a reasonable question is "How might I identify a prior that might produce a Bayes rule that is minimax?" An answer to this question is "Look for a ‘least favorable’prior distribution." De…nition 138 A prior distribution G is said to be least favorable if R (G) = supR (G0 ) G0 Theorem 139 If favorable. 7.7 is Bayes versus G and R ( ; ) R (G) 8 , then G is least Invariance/Equivariance Theory In decision problems with su¢ cient symmetries, it may make sense to restrict attention to decision rules that have corresponding symmetries. It then can happen that risk functions of such rules take relatively few values ... are potentially even constant. Then when comparing (constant) risk functions, one is in a position to identify a best (symmetric) rule. Invariance/equivariance theory is the formalization of this line of argument. 38 7.7.1 Location Parameter Estimation and Invariance/Equivariance De…nition 140 When X = Rk and parameter means that under P , X Rk , to say that is a location has the same distribution for all . De…nition 141 When X = Rk and R, to say that is a 1-dimensional location parameter means that under P , X 1 has the same distribution for all . Until further notice, assume that = A = R. De…nition 142 A loss function is called location invariant if L ( ; a) = ( a) for some non-negative function . If (z) is increasing in z for z > 0 and decreasing in z for z < 0, the corresponding decision problem is called a location estimation problem. De…nition 143 A non-randomized decision rule is called location equivariant if (x + c1) = (x) + c Theorem 144 If is a 1-dimensional location parameter, L ( ; a) is location invariant and is location equivariant, then R ( ; ) is constant in . De…nition 145 A function g : Rk ! R is called location invariant if g (x + c1) = g (x). A location invariant function is "constant on ‘45 lines.’" Lemma 146 A function u is location invariant if and only if u depends upon x only through y (x) = (x1 xk ; x2 xk ; : : : ; x k 1 xk ) Lemma 147 Suppose that a non-randomized decision rule 0 is location equivariant. Then 1 is location equivariant if and only if 9 a location invariant function u such that 1 = 0+u Theorem 148 Suppose that L ( ; a) = ( a) is location invariant and 9 a location equivariant estimator 0 for with …nite risk. Let 0 (z) = ( z). Suppose that for each y 9 a number v (y) that minimizes E =0 [ 0( 0 v) jY = y] over choices of v. Then (x) = 0 (x) v (y (x)) is a best location equivariant estimator of . 39 In the case of squared error loss, the estimator of Theorem 148 is "Pittman’s estimator" and can be written out more explicitly in some situations. 2 Theorem 149 Suppose that L ( ; a) = ( a) and that the distribution of X = (X1 ; : : : ; Xk ) (P ) has R-N derivative with respect to Lebesgue measure on Rk f (x) = f (x1 ; : : : ; xk ) = f (x 1) and that f has second moments. Then the best location equivariant estimator of is R1 f (x 1) d 1 (x) = R 1 f (x 1) d 1 Pittman’s estimator is the formal generalized Bayes estimator versus Lebesgue measure on R: Lemma 150 For the case of L ( ; a) = ( estimator of must be unbiased for . 7.7.2 2 a) a best location equivariant Generalities of Invariance/Equivariance Theory A somewhat more general story about invariance leads to a generalization of Theorem 144. So, suppose that P = fP g is identi…able and that L ( ; a) = L ( ; a0 ) 8 implies that a = a0 . Invariance/equivariance theory is phrased in terms of transformations X ! X , ! , and A ! A and how decision rules relate to them. De…nition 151 For 1-1 transformations of X onto X , g1 and g2 , the composition of g1 and g2 is another 1-1 transformations of X onto X de…ned by g2 g1 (x) = g2 (g1 (x)) 8x De…nition 152 For g a 1-1 transformation of X onto X , if h is another such transformation such that h g (x) = x 8x, h is called the inverse of g and denoted as h = g 1 . It’s an easy result that both g 1 g (x) = x 8x and g g 1 (x) = x 8x. De…nition 153 The identity transformation on X is de…ned by e (x) = x 8x. Another easy fact is then that e g = g e = g for g any 1-1 transformation of X onto X . Proposition 154 If G is a class of 1-1 transformations of X onto X that is closed under composition and inversion, then with the operation " " G is a group in the usual mathematical sense. To go anywhere with this theory, we need transformations on that …t with the ones on X . At a minimum, we don’t want the transformations on X to move us outside of P. 40 De…nition 155 If g is a 1-1 transformation of X onto X such that n o g(X) P =P 2 the transformation is said to leave the model invariant. If each element of a group of transformations G leaves P invariant, we’ll say that the group leaves the model invariant. One circumstance in which G leaves P invariant is that where the model P can be thought of as "generated by G" in the sense that for some U P (for a …xed P ) and P = P g(U ) jg 2 G . If a group of transformations on X (say G) leaves P invariant, there is a natural corresponding group of transformations on . That is, for g 2 G X P ) g (X) P 0 for some 0 2 (and in fact, because of the indenti…ability assumption, there is only one such 0 ). So one might adopt the next de…nition. De…nition 156 If G leaves P invariant, for each g 2 G, de…ne g : g ( ) = the element 0 2 such that X P ) g (X) Pg( ! by ) Proposition 157 For G a group of 1-1 transformation of X onto X leaving P invariant, the set G = fgjg 2 Gg with the operation "function composition" is a group of 1-1 transformations of onto . (G is the homomorphic image of G under the map g ! g.) Now we need transformations on A that …t together nicely with those on X and . De…nition 158 A loss function L ( ; a) is said to be invariant under the group of transformations G, provided for every g 2 G and a 2 A, 9 a unique a0 2 A such that L ( ; a) = L (g ( ) ; a0 ) 8 De…nition 159 If G leaves P invariant and the loss function L ( ; a) is invariant, de…ne g~ : A ! A by g~ (a) = the element a0 of A such that L ( ; a) = L (g ( ) ; a0 ) 8 According to the notations in De…nitions 156 and 159, for G leaving P invariant and L ( ; a) invariant, L ( ; a) = L (g ( ) ; g~ (a)) 41 Proposition 160 For G a group of 1-1 transformations X onto X leaving P invariant and invariant loss function L ( ; a), G~ = f~ g jg 2 Gg with the operation "function composition" is a group of 1-1 transformations of A onto A. (G~ is the homomorphic image of G under the map g ! g~.) Finally, we come to the business of equivariance of decision functions. Here restrict attention to non-randomized decision rules. De…nition 161 In an invariant decision problem (one where G leaves P invariant and the loss function L ( ; a) is invariant) a non-randomized rule is said to be equivariant provided (g (x)) = g~ ( (x)) 8x 2 X and g 2 G We now have enough structure to state the promised generalization of Theorem 144. Theorem 162 If G leaves P invariant, the loss function L ( ; a) is invariant, and the non-randomized rule is equivariant, R ( ; ) = R (g ( ) ; ) 8 2 De…nition 163 For 2 , O = f the orbit in containing . 0 2 j 0 and g 2 G = g ( ) for some g 2 Gg is called The distinct orbits O partition , and Theorem 162 says that an equivariant non-randomized rule in an invariant problem has a risk function that is constant on orbits. Comparing risk functions for equivariant non-randomized rules in an invariant problem reduces to comparing risks on orbits. This, of course, is most helpful when there is only one orbit. De…nition 164 A group of 1-1 transformations of a space onto itself is called transitive if for any two points in the space there is a transformation in the group taking the …rst point to the second. Corollary 165 If G leaves P invariant, the loss function L ( ; a) is invariant, and G is transitive over , then every equivariant non-randomized decision rule has constant risk. De…nition 166 In an invariant decision problem, a decision rule , an equivariant decision rule is said to be a best (non-randomized) equivariant (or MRE: minimum risk equivariant) decision rule provided that for any other (non-randomized) equivariant rule 0 R( ; ) (and it su¢ ces to check that R ( ; ) R ( ; 0) 8 R ( ; 0 ) for any one ). 42 8 Asymptotics of Likelihood Inference As usual, write P = fP g. Suppose that P is identi…able and dominated by a -…nite measure, , and write dP f = d We’ll here consider asymptotics for inference based on the likelihood function f (X) (a random function of ). We’ll concentrate on the iid (one sample) case. Here is some notation we’ll use in the discussion. The basic one-observation/one-dimensional model is (X ; B; P ) with parameter space Rk . Based on this we’ll take X n = (X1 ; X2 ; : : : ; Xn ) Xn Bn Pn n fn = dP n d n the the the the the observable through stage n observation space for X n n-fold product -algebra distribution of X n dominating (product) measure on X n Notice that if one wants to prove almost sure convergence results, one needs a …xed probability space in which to operate for all n. In that case, one could use a big space involving = X 1 ; C = B 1 ; P 1 ; 1 and think of Xi in X n = (X1 ; X2 ; : : : ; Xn ) as Xi (!) = the ith coordinate of ! 8.1 8.1.1 Asymptotics of Likelihood Estimation Consistency of Likelihood-Based Estimators De…nition 167 A statistic T : X n ! estimator of if is called a maximum likelihood fTn(xn ) (xn ) = supf n (xn ) 8xn 2 X n 2 For a …xed xn 2 X n a 2 maximizing f n (xn ) might be called a maximum likelihood estimate even when some xn ’s might not have corresponding maximizers. De…nition 168 Suppose that fTn g is a sequence of statistics, Tn : X n ! Rk 1. fTn g is (weakly) consistent at P n0 [kTn 0 0k 2 if for every > 0; > ] ! 0 as n ! 1 43 2. fTn g is strongly consistent at 0 2 if Tn ! 0 a.s. P 1 0 Let z 0 ( ) = E 0 ln f (X) Z = I ( 0 ; ) + ln (f 0 (x)) f 0 (x) d (x) It is a corollary of Lemma 86 that for 6= 0 z 0 ( ) < z 0 ( 0) Consider the loglikelihood through n observations ln (xn ; ) = ln f n (xn ) = n X ln f (xi ) i=1 Note that 1 ln (X n ; ) n and in fact that the LLN promises that for every z 0( )=E 0 1 ln (X n ; ) ! z 0 ( ) in n 0 probability So assuming enough regularity to make the convergence uniform in there is 1 perhaps hope that a single maximizer of ln (X n ; ) will exist and be close to n the maximizer of z 0 ( ) (namely 0 ), i.e. that an MLE might exist and be consistent. To make this kind of argument precise is quite hard. See Theorems 7.49 and 7.54 of Schervish. We’ll take a di¤erent, easier, and much more common line of development. Where the f are di¤erentiable in , an MLE must satisfy the likelihood equations. Rk and the f are di¤ erentiable in , De…nition 169 In the case where the likelihood equations are @ ln f n (xn ) = 0 for i = 1; 2; : : : ; k @ i Much of the relatively simple theory of (maximum) likelihood revolves more around solutions of the likelihood equations than it does maximizers of the likelihood. For a one-dimensional parameter there is the following result. Theorem 170 Suppose that k = 1 and 9 an open neighborhood of such that 1. f (x) > 0 8x and 8 2 O, 44 0, say O, 2. 8x; f (x) is di¤ erentiable at every point 3. E 0 ln f (X) exists for all Then if equation > 0 and 2 O, and 2 O and E 0 ln f 0 (X) is …nite. > 0 9 n such that 8m > n the 0 probability that the d lm (X m ; ) = 0 d (the likelihood equation) has a root within of 0 is at least 1 . Corollary 171 Under the hypotheses of Theorem 170, suppose in addition that is open and f (x) is di¤ erentiable at every point 2 , so that the likelihood equation n d X d ln (X n ; ) = ln f (Xi ) = 0 d d i=1 makes sense at all 2 . De…ne n to be the root of the likelihood equation when there is exactly one (and otherwise adopt any de…nition for n ). If with 0 probability approaching 1 the likelihood equation has a single root, then n ! 0 in 0 probability Corollary 172 Under the hypotheses of Theorem 170, if fTn g is a sequence of estimators consistent at 0 and 8 if the likelihood > > Tn < equation has no roots ^n = the root of the likelihood > > otherwise : equation closest to Tn then ^n ! probability Actually implementing the prescription for ^n in Corollary 172 is not without its practical and philosophical di¢ culties. Another line of attack for possibly improving a consistent estimator is to adopt a "1-step Newton improvement" on it. Writing n X Ln ( ) = ln (X n ; ) = ln f (Xi ) 0 in 0 i=1 this is L0n (Tn ) L00n (Tn ) (of course provided L00n (Tn ) 6= 0). In the event that k > 1, this becomes ~n = Tn ~n = Tn L00n (Tn ) 1 L0n (Tn ) for L0n (Tn ) the k 1 vector of …rst partials of Ln ( ) and L00n (Tn ) the k k matrix of second partials of Ln ( ). It is plausible that estimators of this type end up having asymptotic behavior similar to that of real roots of the likelihood equations. See, e.g., Schervish’s development around his Theorem 7.75 for a more general ("M-estimation") version of this. 45 8.1.2 Asymptotic Normality of Likelihood-Based Estimators An honest version of the Stat 543 "MLE’s are asymptotically Normal" is next. Theorem 173 Suppose that k = 1 and 9 an open neighborhood of such that 0, say O, 1. f (x) > 0 8x and 8 2 O, 2. 8x, f (x) is three times di¤ erentiable at every point of O, 3. 9 M (x) 0 with E 0 M (X) < 1 and d3 ln f (x) d 3 M (x) 8x and 8 2 O; R 4. 1 = f (x) d (x) can be di¤ erentiated twice with respect to integral at 0 , and under the 5. I1 ( ) 2 (0; 1) 8 2 O. If with 0 probability approaching 1, ^n is a root of the likelihood equation and ^n ! 0 in 0 probability, then under 0 p n ^n d 0 ! N 0; 1 I1 ( 0 ) Corollary 174 Under the hypotheses of Theorem 173, if I1 ( ) is continuous at 0 , then under 0 r d nI1 ^n ^n 0 ! N (0; 1) Corollary 175 Under the hypotheses of Theorem 173, under r d L00n ^n ^n 0 ! N (0; 1) 0 What is often a more practically useful result p (parallel to Theorem 173) concerns "one-step Newton improvements" on " n-consistent" estimators. (The following is a special case of Schervish’s Theorem 7.75.) Theorem 176 Under of Theorem 173, suppose that under 0 p the hypotheses 1-5 p estimators Tn are n-consistent, that is n (Tn 0 ) converges in distribution (or more generally, is O (1) in 0 probability). Then with L0n (Tn ) L00n (Tn ) ~n = Tn under 0 p n ~n d 0 ! N 0; 1 I1 ( 0 ) It is then obvious that versions of Corollaries 174 and 175 hold where ^n is replaced by ~n . 46 8.2 Asymptotics of LRT-like Tests and Competitors and Related Con…dence Regions This primarily concerns the most common "general purpose" means of cooking up tests (and related con…dence sets) in non-standard statistical models. For testing H0 : 2 0 versus Ha : 2 1 0 we might consider a statistic sup f (x) LR (x) = 2 1 2 0 sup f (x) or perhaps supf (x) (x) = max (LR (x) ; 1) = 2 sup f (x) 2 0 where rejection is appropriate for large (x). Notice that if there is an MLE of , say b, supf (x) = f b (x) 2 and there is thus the promise of using "MLE-type" asymptotics to establish limiting distributions for likelihood ratio-type test statistics. Corresponding to the case of a point null hypothesis (H0 : = 0 ) in the iid (one-sample) context there is the following. (This is the k = 1 version. Similar results hold for k > 1 with 2k limits.) Theorem 177 Under the hypotheses of Theorem 173, if n then under f ^n (X n ) 2 ln = 2 Ln ^n f 0 (X n ) Ln ( 0 ) 0 d n ! 2 1 Theorem 177 and its k > 1 generalizations are relevant to testing of point null hypotheses and corresponding con…dence set making for entire parameter vectors. There are also versions of 2 limit theorems for composite null hypotheses relevant to con…dence set making for parts of a k-dimensional parameter vector. (See, for example, Schervish page 459 for the complete statement of a theorem of this type.) In vague terms, one can often prove things like what is indicated in the next claim. Claim 178 Suppose that Rp and appropriate regularity conditions hold in the iid problem. If k < p and n sup f n (X n ) 2 = sup f n (X n ) 0 s.t. i = i i = 1; 2; : : : ; k 47 then under any 2 = 0 i for i = 1; 2; : : : ; k 2 ln n ! for which i d 2 k Claim 178 is relevant to testing H0 : i = i0 for i = 1; 2; : : : ; k and for making con…dence sets for ( 1 ; 2 ; : : : ; k ). A way of thinking about con…dence sets for part of a parameter vector that come from the inversion of likelihood ratio tests is in terms of "pro…le loglikelihood" functions. De…nition 179 In the (iid) context where Rp and k < p, the function of ( 1; 2; : : : ; k ) Ln ( 1 ; 2 ; : : : ; k ) = sup Ln ( ) ( k+1 ;:::; p ) is called the pro…le loglikelihood function for ( 1 ; 2 ; : : : ; k ). With the notation/jargon of De…nition 179 and results like Claim 178, large n con…dence sets for ( 1 ; 2 ; : : : ; k ) become ) ( 1 2 ( 1 ; 2 ; : : : ; k ) jLn ( 1 ; 2 ; : : : ; k ) > sup Ln ( 1 ; 2 ; : : : ; k ) 2 k ( 1 ; 2 ;:::; k ) (for 2k a small upper percentage point of the 2k distribution). A generalization of the hypothesis H0 : i = i0 for i = 1; 2; : : : ; k is the hypothesis H0 :hi ( ) = ci for i = 1; 2; : : : ; k for some set of functions fhi ( )g (that could be hi ( ) = i ) and constants fci g. The obvious LRTs based on c1 ;c2 ;:::;ck n sup f n (X n ) 2 sup f n (X n ) s.t. hi ( ) = ci i = 1; 2; : : : ; k = could be de…ned, 2k limit distributions identi…ed under appropriate conditions on the hi ( ), and large sample con…dence sets for (h1 ( ) ; h2 ( ) ; : : : ; hk ( )) of the form (c1 ; c2 ; : : : ; ck ) j2 ln cn1 ;c2 ;:::;ck < 2k made (for 2k a small upper percentage point of the 2k distribution). Tests (and associated con…dence sets) sometimes discussed as competitors for likelihood ratio tests (and regions) are so-called "Wald" and "score" procedures. So consider next Wald tests for H0 :hi ( ) = ci for i = 1; 2; : : : ; k in the (iid) context where Rp and k < p. De…ne 0 h ( ) = (h1 ( ) ; h2 ( ) ; : : : ; hk ( )) and for ^n such that under p n ^n 0 d 0 ! MVNp 0; I1 1 ( 0 ) 48 consider the estimator of h ( ) h ^n For di¤erentiable h ( ) let @ hi ( ) @ j H( ) = k p The -method shows that under p 0 d n h ^n 0 h ( 0 ) ! MVNk 0; H ( 0 ) I1 1 ( 0 ) H ( 0 ) 0 Let B ( 0 ) = H ( 0 ) I1 1 ( 0 ) H ( 0 ) . Provided H ( 0 ) is full rank so B ( 0 ) is 0 invertible and h ( 0 ) = (c1 ; c2 ; : : : ; ck ) , under 0 0 0 110 c1 B .. CC @ . AA B ( 0 ) ck B n @h ^n 1 0 0 11 c1 B .. CC d @ . AA ! ck B ^ @h n 2 k So, supposing that B ( ) is continuous at 0 , an "expected information" version of a Wald test of H0 :hi ( ) = ci for i = 1; 2; : : : ; k can be based on the statistic 0 0 110 c1 B .. CC @ . AA B ^n ck B Wnc1 ;c2 ;:::;ck = n @h ^n 1 0 0 11 c1 B .. CC @ . AA ck B ^ @h n and a limiting 2k null distribution. Or, an "observed information" version of Wn can be had by writing for 1 D n ^n n b n = H ^n B Dn ( ) and then replacing B ^n 1 @2 @ i@ 1 ( 0 ) H ^n 0 Ln ( ) j b 1. in Wnc1 ;c2 ;:::;ck with B n Note …nally that a (k-dimensional) large sample ellipsoidal con…dence set for h ( ) is then of the form 0 (c1 ; c2 ; : : : ; ck ) jWnc1 ;c2 ;:::;ck < 2k (for 2k a small upper percentage point of the 2k distribution). The "score test" competitor for LRT and Wald tests of H0 :hi ( ) = ci for i = 1; 2; : : : ; k is motivated by the notion that if H0 is true and ~n maximizes the loglikelihood Ln ( ) subject to the constraints imposed by the null hypothesis, then ~n should typically nearly be an absolute maximizer of Ln ( ), and therefore 49 the gradient of Ln ( ) at ~n should be small. That is, de…ne the p vector ! @ ~ rLn n = Ln ( ) @ i = ~n 1 gradient (this is the score function evaluated at the restricted MLE, ~n ) and under H0 we expect this to be close to 0. The score test statistic is Vnc1 ;c2 ;:::;ck = 1 rLn ~n n 0 I1 ~n rLn ~n 1 As it turns out, it is typically the case that under H0 , Vnc1 ;c2 ;:::;ck di¤ers from 2 ln cn1 ;c2 ;:::;ck (the version of the corresponding LRT statistic with the standard limiting null distribution) by a quantity tending to 0 in probability. Hence a 2k limiting null distribution is appropriate for a score test using Vnc1 ;c2 ;:::;ck . Note too that another large sample con…dence set for h ( ) is then of the form 0 (c1 ; c2 ; : : : ; ck ) jVnc1 ;c2 ;:::;ck < 2 k (for 2k a small upper percentage point of the 2k distribution). It is common to point out that when dealing with a composite null hypothesis, in order to use an LRT one must …nd both unrestricted and restricted MLEs, while to use a Wald test only an unrestricted MLE is needed, and to use a score test only a restricted MLE is needed. 8.3 Asymptotic Shape of the Loglikelihood and Related Bayesian Asymptotics We consider some simple/crude/easy implications of the foregoing to questions of the large sample shape of random functions that are the loglikelihood and Bayes posterior densities. (Much …ner/harder results are available and some are discussed in Schervish. See his Section 7.4.) Standard statistical folklore is that for large n, under 0 1. loglikelihoods are steep with a peak near in continuous contexts, and 0 and a locally quadratic shape 2. "posteriors are consistent" if a prior spreads some of its mass around at/near 0 , and in continuous contexts, the corresponding posterior density will be approximately normal with mean an MLE of . In the following discussion, continue the iid (one sample) context. Lemma 180 For 6= 0 Ln ( 0 ) Ln ( ) ! 1 in 50 0 probability Claim 181 If G is a prior on terior density is therefore then for 6= with density g with respect to g ( jX n ) = R 0, and the pos- g ( ) f n (X n ) g ( ) f n (X n ) d ( ) g ( jX n ) ! 0 in g ( 0 jX n ) 0 probability Corollary 182 If is …nite, is counting measure, and g ( ) > 0 8 , the n posterior jX is consistent in the sense that for any A jX n (A) ! I [ 0 2 A] in 0 probability A (crude) result giving a hint about shape for a loglikelihood is next. Lemma 183 Under the hypotheses of Theorem 173, suppose that ^n is an MLE of consistent at 0 . Then Qn ( ) Ln ^n Ln ^n + p n ! 1 2 2 I1 ( 0 ) in 0 probability Lemma 183 and the phenomenon it represents have implications for the nature of a posterior distribution. Schervish has a very careful and very hard development. What we can get easily is the following. Corollary 184 Under the hypotheses of Theorem 173, suppose that a prior G has density g with respect to Lebesgue measure on , g ( 0 ) > 0 and g is continuous at 0 . If ^n is an MLE of consistent at 0 , then the posterior density g ( jX n ) has the property that 1 0 n ^ g jX n C B C ! 1 2 I1 ( 0 ) in 0 probability R ( ) ln B A @ 2 g ^n + p jX n n Corollary 184 suggests that under 0 one might roughly speaking expect posterior densities to look approximately normal with mean ^n and variance 1=nI1 ( 0 ). So a practical approximate form for g ( jX n ) might be 0 1 0 1 1 1 A A or N @ ^n ; N @ ^n ; 00 ^ nI1 n L ^n n Schervish proves that under appropriate conditions the posterior density of r ^n L00 ^n n (a random function of ) converges in an appropriate sense to the standard normal density. It is another (harder) matter to prove that approximate posterior probabilities for this are in some appropriate sense close to real posterior probabilities. 51 9 Optimality in Finite Sample Point Estimation 9.1 Unbiasedness De…nition 185 The bias of an estimator for Z E ( (X) ( )) = ( (x) ( ) 2 R is ( )) dP (x) X is unbiased for De…nition 186 An estimator E ( (X) ( ) 2 R provided ( )) = 0 8 2 The following suggests that often, no unbiased estimator can be much good. 2 Theorem 187 If L ( ; a) = w ( ) ( ( ) a) for w ( ) > 0 8 and is both Bayes with respect to G with …nite risk function and unbiased for ( ), then (according to the joint distribution of (X; )) (X) = ( ) a.s. Corollary 188 Suppose R is nontrivial (has at least 2 elements) and fP g has elements that are mutually absolutely continuous. Then if L ( ; a) = 2 ( a) , no Bayes estimator with …nite variance can be unbiased. De…nition 189 is best unbiased for ( ) 2 R if it is unbiased and is at least as good as any other unbiased estimator. Theorem 190 Suppose that A R is convex, L ( ; a) is convex in a 8 and ( ). Then for any for which 1 and 2 are two best unbiased estimators of L ( ; a) is strictly convex in a and R ( ; 1 ) < 1 P [ E 1 (X) = 2 (X)] = 1 The special case of SELE of ( ) is of most importance. 2 (X) < 1 and unbiased for ( ) implies that R ( ; ) = Var For this case (X) De…nition 191 is called a UMVUE (uniformly minimum variance unbiased estimator) of ( ) 2 R if it is unbiased and Var for 0 (X) any other unbiased estimator of Var 0 (X) 8 ( ). The primary tool of UMVUE theory is the famous Lehmann-Sche¤é Theorem. 52 Theorem 192 (Lehmann-Sche¤ é) Suppose that A R is convex, T is su¢ cient for P = fP g, and is an unbiased estimator of ( ). Let E [ jT ] = 0 T (the latter representation guaranteed by Lehmann’s Theorem). Then unbiased estimator of ( ). If in addition, T is complete for P 1. if 0 T is unbiased for ( ), then 0 T = 0 is an T a.s. P 8 , and 2. L ( ; a) convex in a 8 implies that (a) 0 (b) if is best unbiased, and 0 is any other best unbiased estimator of 0 for any 9.2 for which R ( ; 0) = 0 ( ) a.s. P < 1 and L ( ; a) is strictly convex in a. "Information" Inequalities The topic of discussion here is inequalities about variances of estimators, some of which involve "information"measures. The basic tool is the Cauchy-Schwarz inequality. Lemma 193 (Cauchy-Schwarz) If U and V are random variables with EU 2 < 1 and EV 2 < 1, then 2 EU 2 EV 2 (EU V ) Corollary 194 If X and Y have …nite second moments (Cov (X; Y )) (so that jCorr (X; Y )j 2 VarX VarY 1). Corollary 195 Suppose that (Y ) takes values in R and E 2 (Y ) < 1. If g is a measurable function such that Eg (Y ) = 0 and 0 < Eg 2 (Y ) < 1, de…ning C E (Y ) g (Y ) C2 Var (Y ) Eg 2 (Y ) A more statistical-looking version of Corollary 195 is next. Corollary 196 Suppose that (X) takes values in R and E 2 (X) < 1. If g is a measurable function such that E g (X) = 0 and 0 < E g 2 (X) < 1, de…ning C ( ) E (X) g (X) Var C2 ( ) E g 2 (X) (X) 53 Corollary 196 is the source of several interesting inequalities as one chooses di¤erent interesting g ’s. The most famous derives from the choice g (x) = d ln f (x) d and is stated next. Theorem 197 (Cramér-Rao) Suppose that the model P =R fP g (for R) is FI regular at 0 and that 0 < I ( 0 ) < 1: If E (X) = (x) f (x) dP (x) can be di¤ erentiated under the integral sign at 0 , i.e. if Z d d = (x) d (x) E (X) f (x) d d = 0 = 0 then Var (X) 0 2 d d E (X) = 0 I ( 0) Then, consider the choice g (x) = Provided P 0 f (x) f 0 (x) f (x) P , Corollary 196 implies that Var (E (X) E (X) E 0 2 (X)) 2 f (X) f 0 (X) f (X) This produces the so-called Chapman-Robbins inequality. Theorem 198 (Chapman-Robbins) Var (X) 0 10 (E sup s.t. P 0 P E (X) E 0 (X)) f (X) f 0 (X) f (X) 2 2 Optimality in Finite Sample Testing We consider the classical hypothesis testing problem. Write = 0 [ 1 where 2 1 based on 0 \ 1 = ; and we must decide between H0 : 2 0 and Ha : X P . This can be thought of in decision theoretic terms as a two-action decision problem, i.e. where A = f0; 1g. One possible loss for such an approach is (0-1 loss) L ( ; a) = I [ 2 0] I [a = 1] + I [ 2 1] I [a = 0] While a completely decision-theoretic approach is possible, the subject is older than formal decision theory and has its own set of emphases and peculiar terminology. 54 De…nition 199 A non-randomized decision function called an hypothesis test. : X ! A = f0; 1g is De…nition 200 The rejection region for a non-randomized test is fxj (x) = 1g. De…nition 201 A Type 1 error in a testing problem is taking action 1 when 2 0 . A Type 2 error is taking action 0 when 2 1 . In a two-action decision problem a behavioral decision rule is for each x a distribution over A = f0; 1g, which is clearly characterized by x (f1g) 2 [0; 1]. In hypothesis testing terminology one has De…nition 202 A randomized test is a function : X ! [0; 1] with the interpretation that if (x) = , one rejects H0 with probability . For 0-1 loss, for 2 0 R ( ; ) = E I [a = 1] = E while for 2 (X) 1 R ( ; ) = E I [a = 0] = 1 E (X) (the a the action randomly chosen by ) and it is thus clear that the function of , E (X) will be of central importance to the theory of hypothesis testing. De…nition 203 The power function of a (possibly randomized) test Z ( ) = E (X) = (x) dP (x) De…nition 204 The size or level of the test 10.1 ( ). is sup 2 is 0 Simple versus Simple Testing Consider the case of testing where 0 = f 0 g and test de…ne the (0-1 loss) risk vector (R ( 0 ; ) ; R ( 1 ; )) = ( ( 0) ; 1 1 = f 1 g. We may for each ( 1 )) and consider the standard decision theoretic two-dimensional risk set S. Fairly obviously, a completely equivalent device is to instead consider the set V of points ( ( 0 ) ; ( 1 )) Lemma 205 For a simple versus simple testing problem with S = f( ( 0 ) ; 1 ( 1 )) j is a testg and V = f( ( 0 ) ; ( 1 )) j is a testg 1. both S and V are convex, 2. both S and V are symmetric with respect to the point 55 1 1 2; 2 , 3. (0; 1) 2 S, (1; 0) 2 S, and (0; 0) 2 V, (1; 1) 2 V, and 4. both S and V are closed. De…nition 206 For testing H0 : = 0 versus Ha : = 1 a test is called most powerful of size if it is of size and for any other test with ( 0) , ( 1) ( 1) The fundamental theorem of all testing theory is the Neyman-Pearson Lemma. It has implications that reach far beyond the present case of simple versus simple testing, but is stated in the present context. Theorem 207 (Neyman-Pearson) Suppose that P = fP 0 ; P 1 g dP i . …nite measure and let fi = d 1. (Su¢ ciency) Any test of the form 8 < 1 (x) (x) = : 0 a sigma- if f1 (x) > kf0 (x) if f1 (x) = kf0 (x) if f1 (x) < kf0 (x) (*) for some k 2 [0; 1) and (x) : X ! [0; 1] is most powerful of its size. Further (corresponding to the k = 1 case), the test (x) = is most powerful of its size 1 0 if f0 (x) = 0 otherwise (**) = 0. 2. (Existence) For every 2 [0; 1] 9 a test of the form (*) with constant in [0; 1]) or a test of form (**) with ( 0) = . (x) = (a 3. (Uniqueness) If is a most powerful test of size , then it is of form (*) or (**) except possibly for x 2 N with P 0 (N ) = P 1 (N ) = 0. 56