Statistics 550 Notes 4 Reading: Sections 1.4, 1.5 Note: For Thursday, Sept. 18th, I will hold my office hours from 1-2 instead of my usual time 4:45-5:45. I. Prediction (Chapter 1.4) A common decision problem is that we want to predict a variable Y based on a covariate vector Z . Examples: (1) Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient; (2) Predict the price of a stock 6 months from now, on the basis of company performance measures and economic data; (3) Predict the numbers in a handwritten ZIP code, from a digitized image. We typically have a “training” sample (Y1 , Z1 ), , (Yn , Z n ) available from the joint distribution of (Y , Z ) and we want to predict Ynew for a new observation from the distribution of (Y , Z ) for which we know Z Z new . In Section 1.4, we consider how to make predictions when we know the joint distribution of (Y , Z ) ; in practice, we often have only an estimate of the joint distribution based on the training sample. 1 Let g ( Z ) be a rule for predicting Y based on Z . A criterion that is often used for judging different prediction rules is mean squared prediction error: 2 (Y , g ( Z )) E[{g ( Z ) Y }2 | Z ] -this is the average squared prediction error when g ( Z ) is used to predict Y for a particular Z . We want 2 (Y , g ( Z )) to be as small as possible. Theorem 1.4.1: Let ( Z ) E (Y | Z ) . ( Z ) is the best mean squared prediction error prediction rule. Proof: For any prediction rule g ( Z ) , 2 (Y , g ( Z )) E[{g ( Z ) Y }2 | Z ] E[{( g ( Z ) E (Y | Z )) ( E (Y | Z ) Y )}2 | Z ] E[( g ( Z ) E (Y | Z )) 2 | Z ] 2 E[( g ( Z ) E (Y | Z ))( E (Y | Z ) Y ) | Z ] E[( E (Y | Z ) Y ) 2 | Z ] E[( g ( Z ) E (Y | Z )) 2 | Z ] E[( E (Y | Z ) Y ) 2 | Z ] E[( E (Y | Z ) Y ) 2 | Z ] 2 (Y , E (Y | Z )) II. Sufficient Statistics Our usual setting: We observe data X from a distribution P , where we do not know the true P but only know that P P = { P , } (the statistical model). 2 The observed sample of data X may be very complicated (e.g., in the handwritten zip code example from Notes 1, the data is 500 216 72 matrices). An experimenter may wish to summarize the information in a sample by determining a few key features of the sample values, e.g., the sample mean, the sample variance or the largest observation. These are all examples of statistics. Recall: A statistic Y T ( X ) is a random variable or random vector that is a function of the data. A statistic is sufficient if it carries all the information in the data about the parameter vector . T ( X ) can be a scalar or a vector. If T ( X ) is of lower “dimension” than X , then we have a good summary of the data that does not discard any important information. For example, consider a sequence of independent Bernoulli trials with unknown probability of success . We may have the intuitive feeling that the total number of successes contains all the information about that there is in the sample, that the order in which the successes occurred, for example, does not give any additional information. The following definition formalizes this idea: 3 Definition: A statistic Y T ( X ) is sufficient for if the conditional distribution of X given Y y does not depend on for any value of y .1 Implication: If a statistic Y T ( X ) is sufficient, then if we already know the value of the statistic, knowing the full data X does not provide any additional information about . Example 1: Let X 1 , , X n be a sequence of independent Bernoulli random variables with P( X i 1) . We will verify that Y i 1 X i is sufficient for . Consider n P ( X1 x1 , , X n xn | i 1 X i y) . n We have 1 This definition is not quite precise. Difficulties arise when P (Y y) 0 , so that the conditioning event has probability zero. The definition of conditional probability can then be changed at one or more values of y (in fact at any set of y values which has probability zero) without affecting the distribution of X , which is the result of combining the distribution of Y with the conditional distribution of X given Y . In general, there can be more than one version of the conditional probability distribution P ( X | Y ) which together with the distribution of Y leads back to the distribution of X . We define a statistic as sufficient if there exists at least one version of the conditional probability distributions P ( X | Y ) which are the same for all . See Lehmann and Casella, Theory of Point Estimation, 2 nd Edition, Chapter 1.6, pg. 34-35 for further discussion. For our purposes, we define Y to be a sufficient statistic if i) for discrete distributions of the data, for each y that has positive probability for at least one , the conditional | Y ) does not depend on for all for which P (Y y ) 0 ; and ii) for continuous distributions of the data, for each y that has positive density for at least one , the conditional probability density f ( X | Y ) does not depend on for all for which f (Y y ) 0 . probability P ( X 4 P ( X 1 x1 , , X n xn | i 1 X i y ) n P ( X 1 x1 , , X n xn , Y y ) P (Y y ) y (1 ) n y 1 n y n n y (1 ) y y The conditional distribution thus does not involve at all n Y and thus i1 X i is sufficient for . Example 2: Let X 1 , , X n be iid Uniform( 0, ). Consider the statistic Y max1i n X i . We showed in Notes 4 that ny n 1 0 y fY ( y ) n 0 elsewhere Y must be less than . For Y , we have P ( X 1 x1 , , X n xn | Y y ) P ( X 1 x1 , , X n xn , Y y ) P (Y y ) 1 IY n 1 ny n n which does not depend on . IY 1 ny n 1 It is often hard to verify or disprove sufficiency of a statistic directly because we need to find the distribution of 5 the sufficient statistic. The following theorem is often helpful. Factorization Theorem: A statistic T ( X ) is sufficient for if and only if there exist functions g (t , ) and h( x ) such that p( x | ) g (T ( x ), ) h ( x ) for all x and all . (where p ( x | ) denotes the probability mass function for discrete data given the parameter and the probability density function for continuous data). Proof: We prove the theorem for discrete data; the proof for continuous distributions is similar. First, suppose that the probability mass function factors as given in the theorem. We have P (T ( x ) t ) p( x ' | ) x ':T ( x ') t so that P ( X x | T ( x ) t ) P ( X x , T ( x ) t ) P (T ( x ) t ) P ( X x ) P (T ( x ) t ) g (T ( x ), )h( x ) g (T ( x' ), )h( x' ) x ':T ( x ') t h( x ) h( x' ) x ':T ( x ') t 6 Thus, T ( X ) is sufficient for because the conditional distribution P ( X x | T ( x) t ) does not depend on . Conversely, suppose T ( X ) is sufficient for . Then the conditional distribution of X | T ( X ) does not depend on . Let P( X x | T ( X ) t ) k ( x, t ) . Then p( x | ) k ( x, t ) P (T ( x) t ) . Thus, we can take h( x) k ( x, t ), g (t , ) P (T ( x) t ) Example 1 Continued: X 1 , , X n a sequence of independent Bernoulli random variables with P( X i 1) . To show that Y i 1 X i is sufficient for , we factor the probability mass function as follows: n P( X 1 x1 , n , X n xn | ) xi (1 )1 xi i 1 x n x i1 i (1 ) i1 i n 1 n i1 xi n The pmf is of the form g (i 1 xi , )h( x1 , h( x1 , , xn ) 1 . n (1 ) n , xn ) where Example 2 continued: Let X 1 , , X n be iid Uniform( 0, ). To show that Y max1i n X i is sufficient, we factor the pdf as follows: 7 n , xn | ) f ( x1 , i 1 1 I 0 xi 1 n I max1in X i I min1in X i 0 The pdf is of the form g ( I max1in X i , )h( x1 , , xn ) where 1 g ( x1 , , xn , ) n I max1in X i , h( x1 , , xn ) I min1in X i 0 , X n be iid Normal ( , 2 ). The pdf Example 3: Let X 1 , factors as 1 1 exp 2 ( xi ) 2 2 i 1 2 1 n 1 n exp ( xi ) 2 n/2 2 i 1 (2 ) 2 n , xn ; , ) 2 f ( x1 , 1 n n 1 2 2 exp ( x 2 x n ) i i 2 2 i 1 i 1 n (2 ) n / 2 The pdf is thus of the form g (i 1 xi , i 1 xi2 , , 2 )h( x1, n n , xn ) where h( x1 , , xn ) 1 . 2 Thus, (i 1 xi , i 1 xi ) is a two-dimensional sufficient n n 2 statistic for ( , ) , i.e., the distribution of X 1 , , X n is 2 independent of ( , ) given (i 1 xi , i 1 xi ) . n 2 n A theorem for proving that a statistic is not sufficient: Theorem 1: Let T ( X ) be a statistic. If there exists some 1 , 2 and x , y such that (i) T ( x ) T ( y ) ; 8 (ii) p( x | 1 ) p( y | 2 ) p( x | 2 ) p( y | 1 ) , then T ( X ) is not a sufficient statistic. Proof: Assume that (i) and (ii) hold. Suppose that T ( X ) is a sufficient statistic. Then by the factorization theorem, p( x | ) g (T ( x ), )h( x ) . Thus, p( x | 1 ) p( y | 2 ) g (T ( x ), 1 )h( x) g (T ( y), 2 ) h( y) g (T ( x), ) h( x) g (T ( x), ) h( y ) , 1 2 where the last equality follows from (i). Also, p( x | 2 ) p( y | 1 ) g (T ( x ), 2 )h( x) g (T ( y), 1 )h( y) g (T ( x), 2 ) h( x) g (T ( x), 1 ) h( y ) where the last equality follows from (i). Thus, p( x | 1 ) p( y | 2 ) p( x | 2 ) p( y | 1 ) . This contradicts (ii). Hence the supposition that T ( X ) is a sufficient statistic is impossible and T ( X ) must not be a sufficient statistic when (i) and (ii) hold. ■ Example 4: Consider a series of three independent Bernoulli trials X 1 , X 2 , X 3 with probability of success p. Let T X1 2 X 2 3 X 3 . Show that T is not sufficient. Let x = ( X1 0, X 2 0, X 3 1) and y ( X1 1, X 2 1, X 3 0) . We have T ( x ) T ( y ) 3 . But 9 f ( x | p 1/ 3) f ( y | p 2 / 3) ((2 / 3) 2 *(1/ 3))*((2 / 3) 2 *(1/ 3)) 16 / 729 f ( x | p 2 / 3) f ( y | p 1/ 3) ((1/ 3) 2 *(2 / 3))*((1/ 3) 2 *(2 / 3)) 4 / 729 Thus, by Theorem 1, T is not sufficient. III. Implications of Sufficiency We have said that reducing the data to a sufficient statistic does not sacrifice any information about . We now justify this statement in two ways: (1) We show that for any decision procedure, we can find a randomized decision procedure that is based only on the sufficient statistic and that has the same risk function. (2) We show that any point estimator that is not a function of the sufficient statistic can be improved upon for a strictly convex loss function. (1) Let ( X ) be a decision procedure and T ( X ) be a sufficient statistic. Consider the following randomized decision procedure [call it '(T ( X )) ]: Based on T ( X ) , randomly draw X ' from the distribution X | T ( X ) (which does not depend on and is hence known) and take action ( X' ) . X has the same distribution as X' so that ( X ) has the same distribution as '(T ( X )) ( X' ) . Since ( X ) and 10 ( X' ) have the same distribution, they have the same risk function. 2 Example 5: X ~ N (0, ) . T ( X ) | X | is sufficient 2 because X | T ( X ) t is equally likely to be t for all . Given T t , construct X ' to be t with probability 0.5 2 each. Then X ' ~ N (0, ) . (2) The Rao-Blackwell Theorem. Convex functions: A real valued function defined on an open interval I (a, b) is convex if for any a x y b and 0 1 , [ x (1 ) y ] ( x) (1 ) ( y) . is strictly convex if the inequality is strict. If '' exists, then is convex if and only if '' 0 on I ( a, b) . A convex function lies above all its tangent lines. Convexity of loss functions: For point estimation: squared error loss is strictly convex. absolute error loss is convex but not strictly convex Huber’s loss functions, 2 if |q( ) - a | k (q( ) - a) l ( a 2 2k | q( ) - a | -k if |q( ) - a |> k 11 for some constant k is convex but not strictly convex. zero-one loss function if |q( ) - a | k 0 l ( a if |q( ) - a |> k 1 is nonconvex. Jensen’s Inequality: (Appendix B.9) Let X be a random variable. (i) If is convex in an open interval I and P( X I ) 1 and E ( X ) , then ( E[ X ]) E[ ( X )] . (ii) If is strictly convex, then ( E[ X ]) E[ ( X )] unless X equals a constant with probability one. Proof of (i): Let L ( x ) be a tangent line to ( x) at the point ( E[ X ]) . Write L( x) a bx . By the convexity of , ( x) a bx . Since expectations preserve inequalities, E[ ( X )] E[a bX ] a bE[ X ] L( E[ X ]) ( E[ X ]) as was to be shown. Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic. Let be a point estimate of q( ) and assume that the loss function l ( , d ) is strictly convex in d. Also assume that R ( , ) . Let (t ) E[ ( X ) | T ( X ) t ] . Then 12 R( , ) R( , ) unless ( X ) (T ( X )) with probability one. Proof: Fix . Apply Jensen’s inequality with (d ( x )) l ( , d ( x )) and let X have the conditional distribution of X | T ( X ) t for a particular choice of t . By Jensen’s inequality, l ( , (t )) E[l[ , ( X )] | t ] (0.1) Taking the expectation on both sides of this inequality yields R( , ) R( , ) unless ( X ) (T ( X )) with probability one. Comments: (1) Sufficiency ensures (t ) E[ ( X ) | T ( X ) t ] is an estimator (i.e., it depends only on t and not on ). (2) If loss is convex rather than strictly convex, we get in (1.2). (3) Theorem is not true without convexity of loss functions. Example 4 continued. Consider a series of three independent Bernoulli trials X 1 , X 2 , X 3 with probability of success p. We have shown that T ( X ) X1 X 2 X 3 is a sufficient statistic and that T '( X ) X1 2 X 2 3 X 3 is an insufficient statistic. The unbiased estimator X 2 X 2 3X 3 (X ) 1 is a function of the insufficient 6 statistic T '( X ) X1 2 X 2 3 X 3 and can thus be improved 13 for a strictly convex loss function by using the RaoBlackwell theorem: X1 2 X 2 3 X 3 | X1 X 2 X 3 t 6 (t ) E ( ( X ) | T ( X ) t ) E Note that Pp ( X 1 x | X 1 X 2 X 3 t ) Pp ( X 1 x, X 2 X 3 t x) Pp ( X 1 X 2 X 3 t ) t 2 tx 2 if x 1 2 t x p (1 p) p (1 p) 3 t x t x 3 t 3 3 t t p (1 p) 1 if x 0 t t 3 x 1 x Thus, X1 2 X 2 3 X 3 t 1 2t 1 3t 1 t | X1 X 2 X 3 t . 6 3 6 3 6 3 6 3 (t ) E For squared error loss we have, R( p, ) Bias p ( ) Varp ( ) 0 2 p(1 p) 3 and R( p, ) Bias p ( ) Varp ( ) 0 2 so that R( p, ) R( p, ) . 14 p(1 p) , 36 Consequence of Rao-Blackwell theorem: For convex loss functions, we can dispense with randomized estimators. A randomized estimator randomly chooses the estimate Y( x ) , where the distribution of Y( x ) is known. A randomized estimator can be obtained as an estimator * estimator ( X ,U ) where X and U are independent and U 14 is uniformly distributed on (0,1). This is achieved by observing X = x and then using U to construct the distribution of Y( x ) . For the data ( X , U ) , X is sufficient. Thus, by the Rao-Blackwell Theorem, the nonrandomized * * estimator E[ ( X ,U ) | X ] dominates ( X ,U ) for strictly convex loss functions. IV. Minimal Sufficiency A statistic T ( X ) is minimally sufficient if it is sufficient and it provides a reduction of the data that is at least as great as that of any other sufficient statistic S ( X ) in the sense that we can find a transformation r such that T ( X ) r ( S ( X )) . Theorem 2 (Lehmann and Scheffe, 1950): Suppose S ( X ) is a sufficient statistic for . Also suppose that if for two sample points x and y , the ratio f ( x | ) / f ( y | ) is constant as a function of , then S ( x ) S ( y ) . Then S ( X ) is a minimal sufficient statistic for . Proof: Let T ( X ) be any statistic that is sufficient for . By the factorization theorem, there exist functions g and h such that f ( x | ) g (T ( x ) | θ ) h( x ) . Let x and y be any two sample points with T ( x ) T ( y ) . Then f ( x | ) g (T ( x ) | )h( x) h( x) f ( y | ) g (T ( y) | )h( y) h( y) . Since this ratio does not depend on , the assumptions of the theorem imply that S ( x ) S ( y ) . Thus, S ( X ) is at 15 least as coarse a partition of the sample space as T ( X ) , and consequently S ( X ) is minimal sufficient. Example 6: Suppose X 1 , , X n iid Bernoulli ( ) . Consider the ratio n n x n x i i 1 i i 1 f (x | ) (1 ) n n f ( y | ) i1 yi (1 ) n i1 yi . If this ratio is constant as a function of , then i1 xi i1 yi . Since we have shown that n n n T ( X ) X i is a sufficient statistic, it follows from the i 1 n above sentence and Theorem 2 that T ( X ) X i is a i 1 minimal sufficient statistic. Note that a minimal sufficient statistic is not unique. Any one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic. For example, 1 n T '( X ) X i is a minimal sufficient statistic for the n i 1 i.i.d. Bernoulli case. Example 7: Suppose X 1 , , X n are iid uniform on the interval ( , 1), . Then the joint pdf of X is 16 1 <xi 1, i 1,..., n f (x | ) 0 otherwise 1 max i xi 1 min i xi 0 otherwise The statistic T ( X ) (min i X i , max i X i ) is a sufficient statistic by the factorization theorem with g (T ( X ), ) I (max i X i 1 min i X i ) and h( X ) 1 . For any two sample points x and y , the numerator and denominator of the ratio f ( x | ) / f ( y | ) will be positive for the same values of if and only if min i xi min i yi and max i xi max i yi ; if the minima and maxima are equal, then the ratio is constant and in fact equals 1. Thus, T ( X ) mini xi , maxi xi is a minimal sufficient statistic by Theorem 2. Example 7 is a case in which the dimension of the minimal sufficient statistic (2) does not match the dimension of the parameter (1). There are models in which the dimension of the minimal sufficient statistic is equal to the sample size, 1 f ( x | ) e.g., X 1 , , X n iid Cauchy( ), [1 ( x )2 ] . (Problem 1.5.15). 17