Chapter 3 Sufficient statistics and variance reduction Let X1 , X2 , . . . , Xn be a random sample from a certain distribution with p.m/d.f f (x|θ). A function T (X1 , X2 , . . . , Xn ) = T (X) of these observations is called a statistic. From a statistical point of view taking a statistic of the observations is equivalent to taking into account only part of the information in the sample. Example: An experiment can either result in “success” or “failure” with probability θ and (1 − θ) respectively. The experiment is performed independently n times. Let Xi = ½ 1 0 if the ith repetition results in “success” if the ith repetition results in “failure” Pn P Let Sm = m i=m+1 Xi . Consider the bivariate statistic i=1 Xi and Sn−m = T (X) = (Sm , Sn−m ) . This statistic gives information on how many “successes” are obtained in the first m experiments and on how many “successes” are obtained in the last n − mexperiments. The information on which particular experiments the “successes” were obtained in is not retained; neither is the information about how many “successes” are obtainedPin the first r experiments for r 6= m. Consider now the statistic U (X) = ni=1 Xi . This statistic gives information on the total number of “successes” in the n repetitions; all other information in the sample is not retained by U (X). Note, in fact, that U (X) retains even less information than T (X) . Note also that 35 36CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION U (X) = Sm + Sn−m i.e. U (X) is a function (i.e. statistic) of T (X) . Consequently we come to the conclusion that every time we take a function of a statistic we drop some of the information. We have argued in the past that the Fisher information ¡ ¢ I(θ) = IX (θ) = E S 2 (X) , d log fX (X|θ), is a measure of the amount of information dθ in the sample X about the parameter θ. Now, if T (X) is a statistic then a measure of the amount of information in T about θ can be given by the Fisher information of T defined by ¡ ¢ IT (θ) = E S 2 (T ) where S (X) = d log fT (T |θ) where fT (t|θ) is the p.m/d.f. of the statistic T. dθ If θ̂ (T ) is an unbiased estimator of θ based on the statistic T instead of on the whole sample X then the Cramér-Rao inequality becomes ´ ³ 1 V ar θ̂ (T ) ≥ (3.1) IT (θ) with S(T ) = Now, in view of the remarks we made about a statistic being equivalent to taking into account only part of the information in the sample, we should expect to have that IT (θ) ≤ IX (θ) (3.2) with equality holding if and only if the statistic has retained all the relevant information about θ and dropped only information which does not relate to θ. A statistic which retains all the relevant information about θ and discards only information which does not relate to θ is said to be sufficient for θ. Unfortunately, tempting as it may be, we can not adopt strict equality in (3.2) as the formal definition of sufficiency of a statistic T as this will only be possible in the cases when there is enough regularity for the Fisher Information to be defined. We need a formal definition of sufficiency which holds in all cases irrespective of whether this regularity is there or not. Formal definition of Sufficiency: A statistic T (X) of the observations X with p.m/d.f. fX (x|θ) is said to sufficient for the parameter θ if the conditional distribution of X given T = t is free of θ i.e. if the conditional p.m/d.f. 37 fX (x|T =t) does not involve θ. From this definition of sufficiency we have the following The factorization theorem A statistic T (X), where X has joint p.m/d.f. fX (x|θ), is sufficient for θ if and only if fX (x|θ) = g(t, θ)h(x) for all x ∈ X n where g(t, θ) is a function of θ and depends on the observations only through the value t of T and h(x) is a function which does not involve θ. Proof. We first note that if FX,T (x, t|θ) is the joint p.m/d.f. of X and T then fX,T (x, t|θ) = = ½ ½ fX (x|θ) 0 fX (x|θ) 0 if t = T (x) if t = 6 T (x) if x ∈ At if x ∈ / At (3.3) where the set At = {x′ : T (x′ ) = t} = set of all sample results for which T = t. We can understand better the result in (3.3) in terms of an example. Suppose an experiment which can result in either ”success” or ”failure” is repeated independently three times and on the ith repetition we record Xi = 1 if we get a ”success” and Xi = 0 if we get a ”failure” i = 1, 2, 3. Let the P3 statistic T = i=1 Xi be the number of ”successes in the three repetitions. The possible outcomes of the sample X = (X1 , X2 , X3 ) and of the statistic T are shown below. 38CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION Partition sets (X1 , X2 , X3 ) T A0 = (0,0,0) →0 A1 = (1,0,0) (0,1,0) (0,0,1) →1 A2 = (1,1,0) (1,0,1) (0,1,1) →2 A3 = (1,1,1) →3 Clearly fX,T ((0, 1, 0), 2|θ) = P r((X1 , X2 , X3 ) = (0, 1, 0), 3 X Xi = 2) i=1 = 0 since clearly Pwe cannot have the result (X1 , X2 , X3 ) = (0, 1, 0) and at the same have 3i=1 Xi = 2. On the other hand fX,T ((0, 1, 0), 1|θ) = P r((X1 , X2 , X3 ) = (0, 1, 0), 3 X Xi = 1) i=1 = P r((X1 , X2 , X3 ) = (0, 1, 0)) = fX ((0, 1, 0)|θ) . We now turn our attention to the proof of the factorization theorem. Assume first that T is sufficient for θ i.e. that fX (x|T =t) is free of the parameter θ. Since for t = T (x) fX (x|θ) = fX,T (x,t|θ) = fX (x|T =t)fT (t|θ) (see (3.3)) the factorization follows by taking fX (x|T =t) ≡ h(x) and fT (t|θ) ≡ g(t, θ). 39 Assume now that the factorization fX (x|θ) = g(t, θ)h(x) holds for all x ∈ X n with t = T (x). It follows that X X X h(x) = g(t, θ)H(t) g(t, θ)h(x) = g(t, θ) fX (x|θ) = fT (t|θ) = x∈At x∈At x∈At (3.4) where the set At = {x : T (x ) = t} = set of all sample results for which T = t. In calculating (3.4) we have assumed the observations to be discrete; if they are continuous replace summations by integrals. Further in (3.3) we have seen that fX (x|θ) if x ∈At fX (x|T =t) = fT (t|θ) 0 if x ∈A / t ′ ′ and from (3.4) and the factorization we get h(x) g(t, θ)h(x) = fX (x|T =t) = g(t, θ)H(t) H(t) 0 if x ∈At if x ∈A / t i.e. fX (x|T =t) is free of θ. This completes the proof of the factorization theorem. Remark 3.0.1 What are the implications of having the conditional p.m/d.f. fX (x|T =t) free of θ? Given that we know that T (x) = t it follows that x must be situated in the set At ; if further fX (x|T =t) is free of θ we can conclude that once we know that x is in the set At the probability of it being in any particular position within At is not dependent on θ i.e. once we know that x is in the set At information on its exact position within At does not relate to θ. Put in another way, all the information in x relating to θ is contained in the value of T (x) , the information in x which is not retained by the statistic T does not relate to θ. But we have seen that a statistic T which retains all the relevant information about θ and discards only information that is not relevant to θ is what we call a sufficient statistic for θ. Result: Let T (X) be a statistic of the sample X whose joint distribution depends on a parameter θ . Then under certain regularity conditions on the joint p.d/m.f fX (x|θ) of X and on the p.d/m.f fT (t|θ)) of T IT (θ) ≤ IX (θ) θ∈Θ 40CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION with equality if and only if T (X) is sufficient for θ. Here ÷ ¸2 ! ¶ µ ∂2 ∂ log fT (T |θ) = E − 2 log fT (T |θ) IT (θ) = E ∂θ ∂θ and IX (θ) = E ÷ µ ¸2 ! ¶ ∂2 ∂ = E − 2 log fX (X|θ) . log fX (X|θ) ∂θ ∂θ Proof. The inequality IT (θ) ≤ IX (θ) will be assumed valid as a consequence of our understanding of what a statistic does and of what the Fisher information represents - although it can be rigorously proved using mathematics. That strict equality holds if and only if T is sufficient for θ follows from the factorization theorem and is left as an exercise. Remark: 1. Notice that the factorization theorem not only gives us necessary and sufficient conditions for the existence of a sufficient statistic it also identifies for us the sufficient statistic. 2. Sufficiency implies that basing inferences about θ on procedures involving sufficient statistics rather than the whole sample, will be more preferable since such procedures discard, outright, unnecessary information which does not relate to θ. In particular, in estimating θ, the best unbiased estimators based on sufficient statistics are not going to be any less efficient (in the formal sense) than the best unbiased estimators based on the whole sample since for T sufficient IT (θ) = IX (θ) i.e. the CRLB for unbiased estimators based on T is the same as the CRLB for unbiased estimators based on X. Example 3.0.1 Let X1 , X2 , . . . , Xn be a random sample from the Bernoulli distribution i.e. ½ 1 with probability θ Xi = 0 with probability 1 − θ Hence fXi (xi |θ) = θxi (1 − θ)1−xi 41 for all i, Use the factorization theorem to find a sufficient statistic for θ and then confirm it is sufficient for θ with the use of the formal definition of sufficiency Solution: The joint mass function of the observations X = (X1 , X2 , . . . , Xn ) is fX (x|θ) = n Y fXi (xi |θ) = i=1 Pn xi = θ i=1 = g à n X i=1 n Y θxi (1 − θ)1−xi i=1 P n− n i=1 xi (1 − θ) ! xi , θ h (x) P with h (x) ≡ 1. Hence by the factorization theorem ni=1 Xi is a sufficient statistic for θ. Notice that since the factorization is not unique, there may be more than one sufficient statistic. For example we could have written fX (x|θ) = θSm +Sn−m (1 − θ)n−Sm −Sn−m = g((Sm , Sn−m ), θ)h (x) Pm P with, once again, h (x) ≡ 1 and Sn−m = ni=m+1 xi . Hence i=1 xi , ¢ ¡PSmm = P n by the factorization theorem i=m+1 Xi is a bivariate sufficient i=1 Xi , statistic for θ. P We now show, using the formal definition, that ni=1 Xi is P indeed a sufficient statistic. The conditional p.m.f. of X given that T (X) = ni=1 Xi = t is fX (x|T = t) = fX,T (x, t|θ) fX (x|θ) = fT (t|θ) fT (t|θ) when T (x) = t and zero otherwise. PnThe last equality was obtained using (3.3). However, the statistic T = i=1 Xi has the Binomial(n, θ) distribution. Hence Pn Pn µ ¶ θ i=1 xi (1 − θ)n− i=1 xi fX (x|θ) n = µ ¶ = 1/ fX (x|T = t) == t fT (t|θ) n θt (1 − θ)n−t t P which is independent of θ confirming that ni=1 Xi is sufficient for θ. 42CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION Example 3.0.2 Let X1 , X2 , . . . , Xn be a random sample from the N (µ, σ 2 ) distribution. Then ¶ µ n Y 1 1 2 √ exp − 2 (xi − µ) fX (x|θ) = 2σ 2πσ i=1 ! à n X ¢ ¡ 1 −n/2 2 (xi − µ) = 2πσ 2 exp − 2 2σ i=1 à ! n n ¡ ¢ 1 X 2 µ X nµ2 2 −n/2 = 2πσ exp − 2 x + xi − 2 2σ i=1 i σ 2 i=1 2σ T Suppose that both µ and σ 2 are unknown so that θ = (µ, σ 2 ) . Then fX (x|θ) = g Ãà n X xi , i=1 n X x2i i=1 ! ! , θ h (x) with h (x) ≡ 1 and g Ãà n X i=1 xi , n X i=1 x2i ! ,θ ! ¡ = 2πσ ¢ 2 −n/2 à n n µ X nµ2 1 X 2 xi + 2 xi − 2 exp − 2 2σ i=1 σ i=1 2σ ! P P Hence the bivariate statistic ( ni=1 Xi , ni=1 Xi2 ) P is sufficient for (µ, σ 2 ) . n This should be interpreted as saying that i=1 Xi is sufficient for Pn NOT 2 2 µ and i=1 Xi is sufficient for σ . All it says is that all the information contained in the sample about µ and σ 2 is also contained in the statistic P Pn ( i=1 Xi , ni=1 Xi2 ) . Suppose now that µ is unknown but that σ 2 so that we now have θ = µ and à à ! ! n n 2 X X ¡ ¢ µ 1 nµ −n/2 fX (x|θ) = 2πσ 2 exp xi − 2 exp − 2 x2 σ 2 i=1 2σ 2σ i=1 i {z }| {z } | ! à n X xi , θ h (x) = g i=1 By the factorization theorem we conclude that Pn i=1 Xi is sufficient for θ = µ. 43 Suppose now that µ is known but σ 2 is unknown so that now θ = σ 2 and à ! n n 2 X X 1 µ nµ · |{z} 1 fX (x|θ) = (2πθ)−n/2 exp − x2 + xi − 2θ i=1 i θ i=1 2θ {z } | ! à n n X X x2i , θ · h (x) xi , = g i=1 i=1 Pn Pn 2 By the factorization theorem we conclude that the bivariate statistic ( X , i i=1 i=1 Xi ) P n is sufficient for θ = σ 2 . Note that i=1 Xi2 by itself is not sufficient for σ 2 unless µ = 0. Example 3.0.3 Let X1 , X2 , . . . , Xn be a random sample from the U (0, θ) distribution i.e. ( 1 if 0 < xi < θ fXi (xi |θ) = θ 0 otherwise Note that θ is involved in the range of the distribution. Hence it is better if we write the p.d.f. of Xi as 1 fXi (xi |θ) = I(0,θ) (xi ) θ where I(0,θ) is the identity function of the interval (0, θ) . For any set A the identity function of A is defined as ½ 1 if x ∈ A IA (x) = 0 if x ∈ /A Hence fX (x|θ) = n Y 1 i=1 = θ I(0,θ) (xi ) = n 1 Y I(0,θ) (xi ) θn i=1 1 I (max xi ) . |{z} 1 n (0,θ) |θ {z } = g(max xi , θ) .h (x) =⇒ max Xi is sufficient for θ by the factorization theorem. 1≤i≤n