Chapter 3 Sufficient statistics and variance reduction

advertisement
Chapter 3
Sufficient statistics and
variance reduction
Let X1 , X2 , . . . , Xn be a random sample from a certain distribution with
p.m/d.f f (x|θ). A function T (X1 , X2 , . . . , Xn ) = T (X) of these observations
is called a statistic. From a statistical point of view taking a statistic of the
observations is equivalent to taking into account only part of the information
in the sample.
Example: An experiment can either result in “success” or “failure” with
probability θ and (1 − θ) respectively. The experiment is performed independently n times. Let
Xi =
½
1
0
if the ith repetition results in “success”
if the ith repetition results in “failure”
Pn
P
Let Sm = m
i=m+1 Xi . Consider the bivariate statistic
i=1 Xi and Sn−m =
T (X) = (Sm , Sn−m ) . This statistic gives information on how many “successes” are obtained in the first m experiments and on how many “successes”
are obtained in the last n − mexperiments. The information on which particular experiments the “successes” were obtained in is not retained; neither
is the information about how many “successes” are obtainedPin the first r
experiments for r 6= m. Consider now the statistic U (X) = ni=1 Xi . This
statistic gives information on the total number of “successes” in the n repetitions; all other information in the sample is not retained by U (X). Note,
in fact, that U (X) retains even less information than T (X) . Note also that
35
36CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION
U (X) = Sm + Sn−m i.e. U (X) is a function (i.e. statistic) of T (X) . Consequently we come to the conclusion that every time we take a function of a
statistic we drop some of the information.
We have argued in the past that the Fisher information
¡
¢
I(θ) = IX (θ) = E S 2 (X) ,
d
log fX (X|θ), is a measure of the amount of information
dθ
in the sample X about the parameter θ. Now, if T (X) is a statistic then
a measure of the amount of information in T about θ can be given by the
Fisher information of T defined by
¡
¢
IT (θ) = E S 2 (T )
where S (X) =
d
log fT (T |θ) where fT (t|θ) is the p.m/d.f. of the statistic T.
dθ
If θ̂ (T ) is an unbiased estimator of θ based on the statistic T instead of on
the whole sample X then the Cramér-Rao inequality becomes
´
³
1
V ar θ̂ (T ) ≥
(3.1)
IT (θ)
with S(T ) =
Now, in view of the remarks we made about a statistic being equivalent to
taking into account only part of the information in the sample, we should
expect to have that
IT (θ) ≤ IX (θ)
(3.2)
with equality holding if and only if the statistic has retained all the relevant
information about θ and dropped only information which does not relate to
θ. A statistic which retains all the relevant information about θ and discards
only information which does not relate to θ is said to be sufficient for θ.
Unfortunately, tempting as it may be, we can not adopt strict equality in
(3.2) as the formal definition of sufficiency of a statistic T as this will only
be possible in the cases when there is enough regularity for the Fisher Information to be defined. We need a formal definition of sufficiency which holds
in all cases irrespective of whether this regularity is there or not.
Formal definition of Sufficiency: A statistic T (X) of the observations X
with p.m/d.f. fX (x|θ) is said to sufficient for the parameter θ if the conditional distribution of X given T = t is free of θ i.e. if the conditional p.m/d.f.
37
fX (x|T =t) does not involve θ.
From this definition of sufficiency we have the following
The factorization theorem
A statistic T (X), where X has joint p.m/d.f. fX (x|θ), is sufficient for θ if
and only if
fX (x|θ) = g(t, θ)h(x)
for all x ∈ X n
where g(t, θ) is a function of θ and depends on the observations only through
the value t of T and h(x) is a function which does not involve θ.
Proof. We first note that if FX,T (x, t|θ) is the joint p.m/d.f. of X and T
then
fX,T (x, t|θ) =
=
½
½
fX (x|θ)
0
fX (x|θ)
0
if t = T (x)
if t =
6 T (x)
if x ∈ At
if x ∈
/ At
(3.3)
where the set At = {x′ : T (x′ ) = t} = set of all sample results for which
T = t.
We can understand better the result in (3.3) in terms of an example.
Suppose an experiment which can result in either ”success” or ”failure” is
repeated independently three times and on the ith repetition we record Xi =
1 if we get a ”success”
and Xi = 0 if we get a ”failure” i = 1, 2, 3. Let the
P3
statistic T = i=1 Xi be the number of ”successes in the three repetitions.
The possible outcomes of the sample X = (X1 , X2 , X3 ) and of the statistic
T are shown below.
38CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION
Partition sets (X1 , X2 , X3 )
T
A0 =
(0,0,0)
→0
A1 =
(1,0,0)
(0,1,0)
(0,0,1)
→1
A2 =
(1,1,0)
(1,0,1)
(0,1,1)
→2
A3 =
(1,1,1)
→3
Clearly
fX,T ((0, 1, 0), 2|θ) = P r((X1 , X2 , X3 ) = (0, 1, 0),
3
X
Xi = 2)
i=1
= 0
since clearly
Pwe cannot have the result (X1 , X2 , X3 ) = (0, 1, 0) and at the
same have 3i=1 Xi = 2. On the other hand
fX,T ((0, 1, 0), 1|θ) = P r((X1 , X2 , X3 ) = (0, 1, 0),
3
X
Xi = 1)
i=1
= P r((X1 , X2 , X3 ) = (0, 1, 0))
= fX ((0, 1, 0)|θ)
.
We now turn our attention to the proof of the factorization theorem.
Assume first that T is sufficient for θ i.e. that fX (x|T =t) is free of the
parameter θ. Since for t = T (x)
fX (x|θ) = fX,T (x,t|θ) = fX (x|T =t)fT (t|θ)
(see (3.3)) the factorization follows by taking fX (x|T =t) ≡ h(x) and fT (t|θ) ≡
g(t, θ).
39
Assume now that the factorization fX (x|θ) = g(t, θ)h(x) holds for all
x ∈ X n with t = T (x). It follows that
X
X
X
h(x) = g(t, θ)H(t)
g(t, θ)h(x) = g(t, θ)
fX (x|θ) =
fT (t|θ) =
x∈At
x∈At
x∈At
(3.4)
where the set At = {x : T (x ) = t} = set of all sample results for which
T = t. In calculating (3.4) we have assumed the observations to be discrete;
if they are continuous replace summations by integrals. Further in (3.3) we
have seen that

 fX (x|θ)
if x ∈At
fX (x|T =t) =
fT (t|θ)

0
if x ∈A
/ t
′
′
and from (3.4) and the factorization we get

h(x)
 g(t, θ)h(x)
=
fX (x|T =t) =
g(t, θ)H(t)
H(t)

0
if x ∈At
if x ∈A
/ t
i.e. fX (x|T =t) is free of θ. This completes the proof of the factorization
theorem.
Remark 3.0.1 What are the implications of having the conditional p.m/d.f.
fX (x|T =t) free of θ? Given that we know that T (x) = t it follows that x
must be situated in the set At ; if further fX (x|T =t) is free of θ we can
conclude that once we know that x is in the set At the probability of it
being in any particular position within At is not dependent on θ i.e. once
we know that x is in the set At information on its exact position within At
does not relate to θ. Put in another way, all the information in x relating
to θ is contained in the value of T (x) , the information in x which is not
retained by the statistic T does not relate to θ. But we have seen that a
statistic T which retains all the relevant information about θ and discards
only information that is not relevant to θ is what we call a sufficient statistic
for θ.
Result: Let T (X) be a statistic of the sample X whose joint distribution
depends on a parameter θ . Then under certain regularity conditions on the
joint p.d/m.f fX (x|θ) of X and on the p.d/m.f fT (t|θ)) of T
IT (θ) ≤ IX (θ)
θ∈Θ
40CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION
with equality if and only if T (X) is sufficient for θ. Here
÷
¸2 !
¶
µ
∂2
∂
log fT (T |θ)
= E − 2 log fT (T |θ)
IT (θ) = E
∂θ
∂θ
and
IX (θ) = E
÷
µ
¸2 !
¶
∂2
∂
= E − 2 log fX (X|θ) .
log fX (X|θ)
∂θ
∂θ
Proof. The inequality IT (θ) ≤ IX (θ) will be assumed valid as a consequence of our understanding of what a statistic does and of what the Fisher
information represents - although it can be rigorously proved using mathematics. That strict equality holds if and only if T is sufficient for θ follows
from the factorization theorem and is left as an exercise.
Remark:
1. Notice that the factorization theorem not only gives us necessary and
sufficient conditions for the existence of a sufficient statistic it also
identifies for us the sufficient statistic.
2. Sufficiency implies that basing inferences about θ on procedures involving sufficient statistics rather than the whole sample, will be more
preferable since such procedures discard, outright, unnecessary information which does not relate to θ. In particular, in estimating θ, the
best unbiased estimators based on sufficient statistics are not going to
be any less efficient (in the formal sense) than the best unbiased estimators based on the whole sample since for T sufficient IT (θ) = IX (θ)
i.e. the CRLB for unbiased estimators based on T is the same as the
CRLB for unbiased estimators based on X.
Example 3.0.1 Let X1 , X2 , . . . , Xn be a random sample from the Bernoulli
distribution i.e.
½
1 with probability θ
Xi =
0 with probability 1 − θ
Hence
fXi (xi |θ) = θxi (1 − θ)1−xi
41
for all i, Use the factorization theorem to find a sufficient statistic for θ and
then confirm it is sufficient for θ with the use of the formal definition of
sufficiency
Solution: The joint mass function of the observations X = (X1 , X2 , . . . , Xn )
is
fX (x|θ) =
n
Y
fXi (xi |θ) =
i=1
Pn
xi
= θ
i=1
= g
à n
X
i=1
n
Y
θxi (1 − θ)1−xi
i=1
P
n− n
i=1 xi
(1 − θ)
!
xi , θ h (x)
P
with h (x) ≡ 1. Hence by the factorization theorem ni=1 Xi is a sufficient
statistic for θ. Notice that since the factorization is not unique, there may be
more than one sufficient statistic. For example we could have written
fX (x|θ) = θSm +Sn−m (1 − θ)n−Sm −Sn−m
= g((Sm , Sn−m ), θ)h (x)
Pm
P
with, once again, h (x) ≡ 1 and
Sn−m
= ni=m+1 xi . Hence
i=1 xi ,
¢
¡PSmm = P
n
by the factorization theorem
i=m+1 Xi is a bivariate sufficient
i=1 Xi ,
statistic for θ.
P
We now show, using the formal definition, that ni=1 Xi is P
indeed a sufficient
statistic. The conditional p.m.f. of X given that T (X) = ni=1 Xi = t is
fX (x|T = t) =
fX,T (x, t|θ)
fX (x|θ)
=
fT (t|θ)
fT (t|θ)
when T (x) = t and zero otherwise.
PnThe last equality was obtained using
(3.3). However, the statistic T = i=1 Xi has the Binomial(n, θ) distribution. Hence
Pn
Pn
µ ¶
θ i=1 xi (1 − θ)n− i=1 xi
fX (x|θ)
n
= µ ¶
= 1/
fX (x|T = t) ==
t
fT (t|θ)
n
θt (1 − θ)n−t
t
P
which is independent of θ confirming that ni=1 Xi is sufficient for θ.
42CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION
Example 3.0.2 Let X1 , X2 , . . . , Xn be a random sample from the N (µ, σ 2 )
distribution. Then
¶
µ
n
Y
1
1
2
√
exp − 2 (xi − µ)
fX (x|θ) =
2σ
2πσ
i=1
!
Ã
n
X
¢
¡
1
−n/2
2
(xi − µ)
= 2πσ 2
exp − 2
2σ i=1
Ã
!
n
n
¡
¢
1 X 2
µ X
nµ2
2 −n/2
= 2πσ
exp − 2
x +
xi − 2
2σ i=1 i σ 2 i=1
2σ
T
Suppose that both µ and σ 2 are unknown so that θ = (µ, σ 2 ) . Then
fX (x|θ) = g
ÃÃ
n
X
xi ,
i=1
n
X
x2i
i=1
!
!
, θ h (x)
with h (x) ≡ 1 and
g
ÃÃ n
X
i=1
xi ,
n
X
i=1
x2i
!
,θ
!
¡
= 2πσ
¢
2 −n/2
Ã
n
n
µ X
nµ2
1 X 2
xi + 2
xi − 2
exp − 2
2σ i=1
σ i=1
2σ
!
P
P
Hence the bivariate statistic ( ni=1 Xi , ni=1 Xi2 ) P
is sufficient for (µ, σ 2 ) .
n
This should
be interpreted as saying that
i=1 Xi is sufficient for
Pn NOT
2
2
µ and
i=1 Xi is sufficient for σ . All it says is that all the information
contained
in the sample about µ and σ 2 is also contained in the statistic
P
Pn
( i=1 Xi , ni=1 Xi2 ) .
Suppose now that µ is unknown but that σ 2 so that we now have θ = µ
and
Ã
Ã
!
!
n
n
2
X
X
¡
¢
µ
1
nµ
−n/2
fX (x|θ) = 2πσ 2
exp
xi − 2 exp − 2
x2
σ 2 i=1
2σ
2σ i=1 i
{z
}|
{z
}
|
!
à n
X
xi , θ
h (x)
=
g
i=1
By the factorization theorem we conclude that
Pn
i=1
Xi is sufficient for θ = µ.
43
Suppose now that µ is known but σ 2 is unknown so that now θ = σ 2 and
Ã
!
n
n
2
X
X
1
µ
nµ
· |{z}
1
fX (x|θ) = (2πθ)−n/2 exp −
x2 +
xi −
2θ i=1 i
θ i=1
2θ
{z
}
|
!
à n
n
X X
x2i , θ
·
h (x)
xi ,
=
g
i=1
i=1
Pn
Pn
2
By the factorization theorem we conclude
that
the
bivariate
statistic
(
X
,
i
i=1
i=1 Xi )
P
n
is sufficient for θ = σ 2 . Note that i=1 Xi2 by itself is not sufficient for σ 2
unless µ = 0.
Example 3.0.3 Let X1 , X2 , . . . , Xn be a random sample from the U (0, θ)
distribution i.e.
( 1
if 0 < xi < θ
fXi (xi |θ) =
θ
0 otherwise
Note that θ is involved in the range of the distribution. Hence it is better if
we write the p.d.f. of Xi as
1
fXi (xi |θ) = I(0,θ) (xi )
θ
where I(0,θ) is the identity function of the interval (0, θ) . For any set A the
identity function of A is defined as
½
1 if x ∈ A
IA (x) =
0 if x ∈
/A
Hence
fX (x|θ) =
n
Y
1
i=1
=
θ
I(0,θ) (xi ) =
n
1 Y
I(0,θ) (xi )
θn i=1
1
I
(max xi ) . |{z}
1
n (0,θ)
|θ
{z
}
= g(max xi , θ)
.h (x)
=⇒ max Xi is sufficient for θ by the factorization theorem.
1≤i≤n
Download