Statistics 550 Notes 6 Reading: Section 1.5 I. Sufficiency: Review and Factorization Theorem Motivation: The motivation for looking for sufficient statistics is that it is useful to condense the data X to a statistic T ( X ) that contains all the information about in the sample. Definition: A statistic Y T ( X ) is sufficient for if the conditional distribution of X given Y y does not depend on for any value of y . Example 1: Let X 1 , , X n be a sequence of independent Bernoulli random variables with P( X i 1) . Y i 1 X i is sufficient for . n Example 2: Let X 1 , , X n be iid Uniform( 0, ). Consider the statistic Y max1i n X i . We showed in Notes 4 that ny n 1 0 y fY ( y ) n 0 elsewhere For Y , we have 1 P ( X 1 x1 , , X n xn | Y y ) P ( X 1 x1 , , X n xn , Y y ) P (Y y ) 1 IY n 1 ny n n IY 1 ny n 1 which does not depend on . For Y , P ( X1 x1 , , X n xn | Y y) 0 . NOTE: NEED TO THINK MORE ABOUT THIS EXAMPLE AS P ( X1 x1 , , X n xn | Y y) does seem to depend on . It is often hard to verify or disprove sufficiency of a statistic directly because we need to find the distribution of the sufficient statistic. The following theorem is often helpful. Factorization Theorem: A statistic T ( X ) is sufficient for if and only if there exist functions g (t , ) and h( x ) such that p( x | ) g (T ( x ), ) h ( x ) for all x and all . (where p ( x | ) denotes the probability mass function for discrete data given the parameter and the probability density function for continuous data). Proof: We prove the theorem for discrete data; the proof for continuous distributions is similar. First, suppose that the 2 probability mass function factors as given in the theorem. Consider P ( X x | T t ) . If t T ( x ) , then P ( X x | T t ) 0 for all . Suppose t = T ( x ) We have P (T t ) p ( x ' | ) . x ':T ( x ') t so that P ( X x | T t ) P ( X x , T t ) P (T t ) P ( X x ) P (T t ) g (T ( x ), )h( x ) g ( T ( x' ), ) h ( x' ) x ':T ( x ') t h( x ) h( x' ) x ':T ( x ') t Thus, P ( X x | T t ) does not depend on and T ( X ) is sufficient for by the definition of sufficiency. Conversely, suppose T ( X ) is sufficient for . Then the conditional distribution of X | T ( X ) does not depend on . Let P( X x | T t ) k ( x, t ) . Then p( x | ) k ( x, t ) P (T t ) . Thus, we can take h( x) k ( x, t ), g (t , ) P (T t ) Example 1 Continued: X 1 , , X n a sequence of independent Bernoulli random variables with 3 P( X i 1) . To show that Y i 1 X i is sufficient for , we factor the probability mass function as follows: P( X x , , X x | ) (1 ) n n 1 xi xi 1 1 n n i 1 x n x i1 i (1 ) i1 i n n 1 i1 xi n (1 ) n The pmf is of the form g (i 1 xi , )h( x1 , h( x1 , , xn ) 1 . n , xn ) where Example 2 continued: Let X 1 , , X n be iid Uniform( 0, ). To show that Y max1i n X i is sufficient, we factor the pdf as follows: n 1 1 f ( x1 , , xn | ) I 0 xi n I max1in X i I max1in X i 0 i 1 The pdf is of the form g ( I max1in X i , )h( x1 , , xn ) where 1 g ( x1 , , xn , ) n I max1in X i , h( x1 , , xn ) I max1in X i 0 Example 3: Let X 1 , factors as , X n be iid Normal ( , 2 ). The pdf 4 1 1 exp 2 ( xi ) 2 2 i 1 2 1 n 1 2 n exp ( x ) 2 2 i 1 i (2 ) n / 2 n , xn ; , ) 2 f ( x1 , 1 n n 1 exp 2 ( i 1 xi 2 2 i 1 xi n 2 ) n/2 (2 ) 2 n The pdf is thus of the form g (i 1 xi , i 1 xi2 , , 2 )h( x1, n n , xn ) where h( x1 , , xn ) 1 . 2 Thus, (i 1 xi , i 1 xi ) is a two-dimensional sufficient n n 2 statistic for ( , ) , i.e., the distribution of X 1 , , X n is 2 2 independent of ( , ) given (i 1 xi , i 1 xi ) . n n A theorem for proving that a statistic is not sufficient: Theorem 1: Let T ( X ) be a statistic. If there exists some 1 , 2 and x , y such that (i) T ( x ) T ( y ) ; (ii) f ( x | 1 ) f ( y | 2 ) f ( x | 2 ) f ( y | 1 ) , then T ( X ) is not a sufficient statistic. Proof: First, suppose one side of (ii) equals 0 and the other side of (ii) does not equal 0. This implies that either x or y is in the support of f ( | 1 ) but not f ( | 2 ) or vice versa. If T ( X ) were sufficient, then (i) implies that both x , y must be in the support of f ( | 1 ) and f ( | 2 ) . Hence T ( X ) is not sufficient. 5 Second, suppose both sides of (ii) are greater than zero so that f ( x | 1 ), f ( y | 2 ), f ( x | 2 ), f ( y | 1 ) 0 . If T ( X ) were sufficient, then since the distribution of X given T ( X ) is independent of , we must have f ( x | T ( x ), 1 ) f ( y | T ( y ), 1 ) (0.1) f ( x | T ( x ), 2 ) f ( y | T ( y ), 2 ) The left hand side of (0.1) equals f ( x | T ( x ), 1 ) f ( x | 1 ) f (T ( x ) | 2 ) f ( x | T ( x ), 2 ) f ( x | 2 ) f (T ( x ) | 1 ) and the right hand side of (0.1) equals f ( y | T ( y ), 1 ) f ( y | 1 ) f (T ( y ) | 2 ) f ( y | 1 ) f (T ( x ) | 2 ) f ( y | T ( y ), 2 ) f ( y | 2 ) f (T ( y) | 1 ) f ( y | 2 ) f (T ( x) | 1 ) Thus, from (0.1), we conclude that if T ( X ) were sufficient, we would have f ( x | 1 ) f ( y | 1 ) f ( x | 2 ) f ( y | 2 ) , so that f ( x | 1 ) f ( y | 2 ) f ( x | 2 ) f ( y | 1 ) Thus, (i) and (ii) show that T ( X ) is not a sufficient statistic. Example 4: Consider a series of three independent Bernoulli trials X 1 , X 2 , X 3 with probability of success p. Let T X1 2 X 2 3 X 3 . Show that T is not sufficient. Let x = ( X1 0, X 2 0, X 3 1) and y ( X1 1, X 2 1, X 3 0) . We have T ( x ) T ( y ) 3 . 6 But f ( x | p 1/ 3) f ( y | p 2 / 3) ((2 / 3) 2 *(1/ 3))*((2 / 3) 2 *(1/ 3)) 16 / 729 f ( x | p 2 / 3) f ( y | p 1/ 3) ((1/ 3) 2 *(2 / 3))*((1/ 3) 2 *(2 / 3)) 4 / 729 Thus, by Theorem 1, T is not sufficient. II. Implications of Sufficiency We have said that reducing the data to a sufficient statistic does not sacrifice any information about . We now justify this statement in two ways: (1) We show that for any decision procedure, we can find a randomized decision procedure that is based only on the sufficient statistic and that has the same risk function. (2) We show that any point estimator that is not a function of the sufficient statistic can be improved upon for a convex loss function. (1) Let ( X ) be a decision procedure and T ( X ) be a sufficient statistic. Consider the following randomized decision procedure [call it '(T ( X )) ]: Based on T ( X ) , randomly draw X ' from the distribution X | T ( X ) (which does not depend on and is hence known) and take action ( X' ) . X has the same distribution as X' so that ( X ) has the same distribution as '(T ( X )) ( X' ) . 7 2 Example 2: X ~ N (0, ) . T ( X ) | X | is sufficient 2 because X | T ( X ) t is equally likely to be t for all . Given T t , construct X ' to be t with probability 0.5 2 each. Then X ' ~ N (0, ) . (2) The Rao-Blackwell Theorem. Convex functions: A real valued function defined on an open interval I (a, b) is convex if for any a x y b and 0 1 , [ x (1 ) y ] ( x) (1 ) ( y) . is strictly convex if the inequality is strict. If '' exists, then is convex if and only if '' 0 on I ( a, b) . A convex function lies above all its tangent lines. Convexity of loss functions: For point estimation: squared error loss is strictly convex. absolute error loss is convex but not strictly convex Huber’s loss functions, (q( ) - a)2 if |q( ) - a | k l ( a 2 2k | q( ) - a | -k if |q( ) - a |> k for some constant k is convex but not strictly convex. zero-one loss function 8 0 l ( a 1 is nonconvex. if |q( ) - a | k if |q( ) - a |> k Jensen’s Inequality: (Appendix B.9) Let X be a random variable. (i) If is convex in an open interval I and P( X I ) 1 and E ( X ) , then ( E[ X ]) E[ ( X )] . (ii) If is strictly convex, then ( E[ X ]) E[ ( X )] unless X equals a constant with probability one. Proof of (i): Let L ( x ) be a tangent line to ( x) at the point ( E[ X ]) . Write L( x) a bx . By the convexity of , ( x) a bx . Since expectations preserve inequalities, E[ ( X )] E[a bX ] a bE[ X ] L( E[ X ]) ( E[ X ]) as was to be shown. Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic. Let be a point estimate of q( ) and assume that the loss function l ( , d ) is strictly convex in d. Also assume that R ( , ) . Let (t ) E[ ( X ) | T ( X ) t ] . Then R( , ) R( , ) unless ( X ) (T ( X )) with probability one. 9 Proof: Fix . Apply Jensen’s inequality with (d ( x )) l ( , d ( x )) and let X have the conditional distribution of X | T ( X ) t for a particular choice of t . By Jensen’s inequality, l ( , (t )) E[l[ , ( X )] | t ] (0.2) Taking the expectation on both sides of this inequality yields R( , ) R( , ) unless ( X ) (T ( X )) with probability one. Comments: (1) Sufficiency ensures (t ) E[ ( X ) | T ( X ) t ] is an estimator (i.e., it depends only on t and not on ). (2) If loss is convex rather than strictly convex, we get in (1.2) (3) Theorem is not true without convexity of loss functions. Consequence of Rao-Blackwell theorem: For convex loss functions, we can dispense with randomized estimators. A randomized estimator randomly chooses the estimate Y( x ) , where the distribution of Y( x ) is known. A randomized estimator can be obtained as an estimator * estimator ( X ,U ) where X and U are independent and U is uniformly distributed on (0,1). This is achieved by observing X = x and then using U to construct the distribution of Y( x ) . For the data ( X , U ) , X is sufficient. Thus, by the Rao-Blackwell Theorem, the nonrandomized * * estimator E[ ( X ,U ) | X ] dominates ( X ,U ) for strictly convex loss functions. 10 III. Minimal Sufficiency For any model, there are many sufficient statistics. Example 1: For X 1 , , X n iid Bernoulli ( ), n T ( X ) X i , T '( X ) ( X1 , i 1 , X n ) are both sufficient but T provides a greater reduction of the data. Definition: A statistic T ( X ) is minimally sufficient if it is sufficient and it provides a reduction of the data that is at least as great as that of any other sufficient statistic S ( X ) in the sense that we can find a transformation r such that T ( X ) r ( S ( X )) . Comments: (1) To say that we can find a transformation r such that T ( X ) r ( S ( X )) means that if S ( x ) S ( y ) , then T ( x ) must equal T ( y ) . (2) Data reduction in terms of a particular statistic can be thought of as a partition of the sample space. A statistic T ( X ) partitions the sample space into sets At { x : T ( x) t} . If a statistic T ( X ) is minimally sufficient, then for another sufficient statistic S ( X ) which partitions the sample space into sets Bs { x : S ( x) s} , every set Bs must be a subset of some At . Thus, the partition associated with a minimal 11 sufficient statistic is the coarsest possible partition for a sufficient statistic and in this sense the minimal sufficient statistic achieves the greatest possible data reduction for a sufficient statistic. A useful theorem for finding a minimal sufficient statistic is the following: Theorem 2 (Lehmann and Scheffe, 1950): Suppose S ( X ) is a sufficient statistic for . Also suppose that for every two sample points x and y , the ratio f ( x | ) / f ( y | ) is constant as a function of if S ( x ) S ( y ) . Then S ( X ) is a minimal sufficient statistic for . Proof: Let T ( X ) be any statistic that is sufficient for . By the factorization theorem, there exist functions g and h such that f ( x | ) g (T ( x ) | θ ) . Let x and y be any two sample points with T ( x ) T ( y ) . Then f ( x | ) g (T ( x ) | )h( x) h( x) f ( y | ) g (T ( y) | )h( y) h( y) . Since this ratio does not depend on , the assumptions of the theorem imply that S ( x ) S ( y ) . Thus, S ( X ) is at least as coarse a partition of the sample space as T ( X ) , and consequently S ( X ) is minimal sufficient. Example 1 continued: Consider the ratio n n x n x i i 1 i i 1 f (x | ) (1 ) n n f ( y | ) i1 yi (1 ) n i1 yi . 12 This ratio is constant as a function of if i1 xi i1 yi . Since we have shown that n n n T ( X ) X i is a sufficient statistic, it follows from the i 1 n above sentence and Theorem 2 that T ( X ) X i is a i 1 minimal sufficient statistic. Note that a minimal sufficient statistic is not unique. Any one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic. For example, 1 n T '( X ) X i is a minimal sufficient statistic for the n i 1 i.i.d. Bernoulli case. Example 2: Suppose X 1 , , X n are iid uniform on the interval ( , 1), . Then the joint pdf of X is 1 <xi 1, i 1,..., n f (x | ) 0 otherwise 1 max i xi 1 min i xi 0 otherwise The statistic T ( X ) (min i X i , max i X i ) is a sufficient statistic by the factorization theorem with g (T ( X ), ) I (max i X i 1 min i X i ) and h( X ) 1 . For any two sample points x and y , the numerator and denominator of the ratio f ( x | ) / f ( y | ) will be positive 13 for the same values of if and only if min i xi min i yi and max i xi max i yi ; if the minima and maxima are equal, then the ratio is constant and in fact equals 1. Thus, T ( X ) mini xi , maxi xi is a minimal sufficient statistic by Theorem 2. Example 2 is a case in which the dimension of the minimal sufficient statistic (2) does not match the dimension of the parameter (1). There are models in which the dimension of the minimal sufficient statistic is equal to the sample size, 1 f ( x | ) e.g., X 1 , , X n iid Cauchy( ), [1 ( x )2 ] . (Problem 1.5.15). III. Ancillary Statistics A statistic T ( X ) is ancillary if its distribution does not depend on . Example 4: Suppose our model is X 1 , , X n iid N ( ,1) . Then X is a sufficient statistic and ( X 1 X , , X n X ) is an ancillary statistic. Although ancillary statistics contain no information about when the model is true, ancillary statistics are useful for checking the validity of the model. 14