Notes 6 - Wharton Statistics Department

advertisement
Statistics 550 Notes 6
Reading: Section 1.5
I. Sufficiency: Review and Factorization Theorem
Motivation: The motivation for looking for sufficient
statistics is that it is useful to condense the data X to a
statistic T ( X ) that contains all the information about  in
the sample.
Definition: A statistic Y  T ( X ) is sufficient for  if the
conditional distribution of X given Y  y does not depend
on  for any value of y .
Example 1: Let X 1 ,
, X n be a sequence of independent
Bernoulli random variables with P( X i  1)   .
Y  i 1 X i is sufficient for  .
n
Example 2:
Let X 1 , , X n be iid Uniform( 0,  ). Consider the statistic
Y  max1i n X i .
We showed in Notes 4 that
 ny n 1
0  y 

fY ( y )    n
0
elsewhere

For Y   , we have
1
P ( X 1  x1 ,
, X n  xn | Y  y ) 
P ( X 1  x1 , , X n  xn , Y  y )
P (Y  y )
1
IY 
  n 1
ny
n
n
IY 

1
ny n 1
which does not depend on  .
For Y   , P ( X1  x1 , , X n  xn | Y  y)  0 .
NOTE: NEED TO THINK MORE ABOUT THIS
EXAMPLE AS P ( X1  x1 , , X n  xn | Y  y) does seem to
depend on  .
It is often hard to verify or disprove sufficiency of a
statistic directly because we need to find the distribution of
the sufficient statistic. The following theorem is often
helpful.
Factorization Theorem: A statistic T ( X ) is sufficient for
 if and only if there exist functions g (t ,  ) and h( x ) such
that
p( x |  )  g (T ( x ),  ) h ( x )
for all x and all    .
(where p ( x |  ) denotes the probability mass function for
discrete data given the parameter  and the probability
density function for continuous data).
Proof: We prove the theorem for discrete data; the proof for
continuous distributions is similar. First, suppose that the
2
probability mass function factors as given in the theorem.
Consider P ( X  x | T  t ) . If t  T ( x ) , then
P ( X  x | T  t )  0 for all  . Suppose t = T ( x )
We have
P (T  t )   p ( x ' |  ) .
x ':T ( x ')  t
so that
P ( X  x | T  t ) 
P ( X  x , T  t )

P (T  t )
P ( X  x )

P (T  t )
g (T ( x ),  )h( x )

g
(
T
(
x'
),

)
h
(
x'
)

x ':T ( x ')  t
h( x )
 h( x' )
x ':T ( x ')  t
Thus, P ( X  x | T  t ) does not depend on  and
T ( X ) is sufficient for  by the definition of sufficiency.
Conversely, suppose T ( X ) is sufficient for  . Then the
conditional distribution of X | T ( X ) does not depend on  .
Let P( X  x | T  t )  k ( x, t ) . Then
p( x |  )  k ( x, t ) P (T  t ) .
Thus, we can take h( x)  k ( x, t ), g (t ,  )  P (T  t )
Example 1 Continued: X 1 , , X n a sequence of
independent Bernoulli random variables with
3
P( X i  1)   . To show that Y  i 1 X i is sufficient for
 , we factor the probability mass function as follows:
P( X  x , , X  x |  )    (1   )
n
n
1 xi
xi
1
1
n
n
i 1
x
n
x
   i1 i (1   )  i1 i
n
n
  


 1 
 i1 xi
n
(1   ) n
The pmf is of the form g (i 1 xi , )h( x1 ,
h( x1 , , xn )  1 .
n
, xn ) where
Example 2 continued: Let X 1 , , X n be iid Uniform( 0,  ).
To show that Y  max1i n X i is sufficient, we factor the pdf
as follows:
n
1
1
f ( x1 , , xn |  )   I 0 xi   n I max1in X i  I max1in X i 0
i 1


The pdf is of the form g ( I max1in X i  ,  )h( x1 , , xn ) where
1
g ( x1 , , xn ,  )  n I max1in X i  , h( x1 , , xn )  I max1in X i 0

Example 3: Let X 1 ,
factors as
, X n be iid Normal (  ,  2 ). The pdf
4
1
 1

exp  2 ( xi   ) 2 
 2

i 1  2
1
n
 1
2
 n
exp
(
x


)
 2 2  i 1 i

 (2 ) n / 2


n
, xn ;  ,  )  
2
f ( x1 ,

1
n
n
 1

exp  2 (  i 1 xi 2  2   i 1 xi  n 2 ) 
n/2
 (2 )
 2

n
The pdf is thus of the form
g (i 1 xi , i 1 xi2 , ,  2 )h( x1,
n
n
, xn ) where h( x1 ,
, xn )  1 .
2
Thus, (i 1 xi , i 1 xi ) is a two-dimensional sufficient
n
n
2
statistic for   (  ,  ) , i.e., the distribution of X 1 ,
, X n is
2
2
independent of (  ,  ) given (i 1 xi , i 1 xi ) .
n
n
A theorem for proving that a statistic is not sufficient:
Theorem 1: Let T ( X ) be a statistic. If there exists some
1 , 2   and x , y such that
(i) T ( x )  T ( y ) ;
(ii) f ( x | 1 ) f ( y |  2 )  f ( x |  2 ) f ( y | 1 ) ,
then T ( X ) is not a sufficient statistic.
Proof: First, suppose one side of (ii) equals 0 and the other
side of (ii) does not equal 0. This implies that either
x or y is in the support of f ( | 1 ) but not f ( |  2 ) or vice
versa. If T ( X ) were sufficient, then (i) implies that both
x , y must be in the support of f ( | 1 ) and f ( |  2 ) . Hence
T ( X ) is not sufficient.
5
Second, suppose both sides of (ii) are greater than zero so
that f ( x | 1 ), f ( y |  2 ), f ( x |  2 ), f ( y | 1 )  0 . If
T ( X ) were sufficient, then since the distribution of
X given T ( X ) is independent of  , we must have
f ( x | T ( x ), 1 )
f ( y | T ( y ), 1 )

(0.1)
f ( x | T ( x ),  2 ) f ( y | T ( y ),  2 )
The left hand side of (0.1) equals
f ( x | T ( x ), 1 ) f ( x | 1 ) f (T ( x ) |  2 )

f ( x | T ( x ),  2 ) f ( x |  2 ) f (T ( x ) | 1 )
and the right hand side of (0.1) equals
f ( y | T ( y ), 1 ) f ( y | 1 ) f (T ( y ) |  2 ) f ( y | 1 ) f (T ( x ) |  2 )


f ( y | T ( y ),  2 ) f ( y |  2 ) f (T ( y) | 1 ) f ( y |  2 ) f (T ( x) | 1 )
Thus, from (0.1), we conclude that if T ( X ) were sufficient,
we would have
f ( x | 1 ) f ( y | 1 )

f ( x |  2 ) f ( y |  2 ) , so that
f ( x | 1 ) f ( y |  2 )  f ( x |  2 ) f ( y | 1 )
Thus, (i) and (ii) show that T ( X ) is not a sufficient statistic.
Example 4: Consider a series of three independent
Bernoulli trials X 1 , X 2 , X 3 with probability of success p.
Let T  X1  2 X 2  3 X 3 . Show that T is not sufficient.
Let x = ( X1  0, X 2  0, X 3  1) and
y  ( X1  1, X 2  1, X 3  0) . We have T ( x )  T ( y )  3 .
6
But
f ( x | p  1/ 3) f ( y | p  2 / 3)  ((2 / 3) 2 *(1/ 3))*((2 / 3) 2 *(1/ 3))  16 / 729 
f ( x | p  2 / 3) f ( y | p  1/ 3)  ((1/ 3) 2 *(2 / 3))*((1/ 3) 2 *(2 / 3))  4 / 729
Thus, by Theorem 1, T is not sufficient.
II. Implications of Sufficiency
We have said that reducing the data to a sufficient statistic
does not sacrifice any information about  .
We now justify this statement in two ways:
(1) We show that for any decision procedure, we can
find a randomized decision procedure that is based
only on the sufficient statistic and that has the same
risk function.
(2) We show that any point estimator that is not a
function of the sufficient statistic can be improved
upon for a convex loss function.
(1) Let  ( X ) be a decision procedure and T ( X ) be a
sufficient statistic. Consider the following randomized
decision procedure [call it  '(T ( X )) ]:
Based on T ( X ) , randomly draw X ' from the distribution
X | T ( X ) (which does not depend on  and is hence
known) and take action  ( X' ) .
X has the same distribution as X' so that  ( X ) has the
same distribution as  '(T ( X ))   ( X' ) .
7
2
Example 2: X ~ N (0,  ) . T ( X ) | X | is sufficient
2
because X | T ( X )  t is equally likely to be t for all  .
Given T  t , construct X ' to be t with probability 0.5
2
each. Then X ' ~ N (0,  ) .
(2) The Rao-Blackwell Theorem.
Convex functions: A real valued function  defined on an
open interval I  (a, b) is convex if for any a  x  y  b
and 0    1 ,
[ x  (1   ) y ]   ( x)  (1   ) ( y) .
 is strictly convex if the inequality is strict.
If  '' exists, then  is convex if and only if  ''  0 on
I  ( a, b) .
A convex function lies above all its tangent lines.
Convexity of loss functions:
For point estimation:
 squared error loss is strictly convex.
 absolute error loss is convex but not strictly convex
 Huber’s loss functions,
(q(  ) - a)2
if |q(  ) - a | k

l (  a   
2

2k | q(  ) - a | -k if |q(  ) - a |> k
for some constant k is convex but not strictly convex.
 zero-one loss function
8
0
l (  a   
1
is nonconvex.
if |q(  ) - a | k
if |q(  ) - a |> k
Jensen’s Inequality: (Appendix B.9)
Let X be a random variable. (i) If  is convex in an open
interval I and P( X  I )  1 and E ( X )   , then
 ( E[ X ])  E[ ( X )] .
(ii) If  is strictly convex, then  ( E[ X ])  E[ ( X )] unless
X equals a constant with probability one.
Proof of (i): Let L ( x ) be a tangent line to  ( x) at the point
 ( E[ X ]) . Write L( x)  a  bx . By the convexity of  ,
 ( x)  a  bx . Since expectations preserve inequalities,
E[ ( X )]  E[a  bX ]
 a  bE[ X ]
 L( E[ X ])
  ( E[ X ])
as was to be shown.
Rao-Blackwell Theorem: Let T ( X ) be a sufficient statistic.
Let  be a point estimate of q( ) and assume that the loss
function l ( , d ) is strictly convex in d. Also assume that
R ( ,  )   . Let  (t )  E[ ( X ) | T ( X )  t ] . Then
R( , )  R( ,  ) unless  ( X )   (T ( X )) with probability
one.
9
Proof: Fix  . Apply Jensen’s inequality with
 (d ( x ))  l ( , d ( x )) and let X have the conditional
distribution of X | T ( X )  t for a particular choice of t .
By Jensen’s inequality,
l ( , (t ))  E[l[ ,  ( X )] | t ]
(0.2)
Taking the expectation on both sides of this inequality
yields R( , )  R( ,  ) unless  ( X )   (T ( X )) with
probability one.
Comments:
(1) Sufficiency ensures  (t )  E[ ( X ) | T ( X )  t ] is an
estimator (i.e., it depends only on t and not on  ).
(2) If loss is convex rather than strictly convex, we get  in
(1.2)
(3) Theorem is not true without convexity of loss functions.
Consequence of Rao-Blackwell theorem: For convex loss
functions, we can dispense with randomized estimators.
A randomized estimator randomly chooses the estimate
Y( x ) , where the distribution of Y( x ) is known. A
randomized estimator can be obtained as an estimator
*
estimator  ( X ,U ) where X and U are independent and U
is uniformly distributed on (0,1). This is achieved by
observing X = x and then using U to construct the
distribution of Y( x ) . For the data ( X , U ) , X is sufficient.
Thus, by the Rao-Blackwell Theorem, the nonrandomized
*
*
estimator E[ ( X ,U ) | X ] dominates  ( X ,U ) for strictly
convex loss functions.
10
III. Minimal Sufficiency
For any model, there are many sufficient statistics.
Example 1: For X 1 ,
, X n iid Bernoulli (  ),
n
T ( X )   X i , T '( X )  ( X1 ,
i 1
, X n ) are both sufficient but
T provides a greater reduction of the data.
Definition: A statistic T ( X ) is minimally sufficient if it is
sufficient and it provides a reduction of the data that is at
least as great as that of any other sufficient statistic S ( X ) in
the sense that we can find a transformation r such that
T ( X )  r ( S ( X )) .
Comments:
(1) To say that we can find a transformation r such that
T ( X )  r ( S ( X )) means that if S ( x )  S ( y ) , then
T ( x ) must equal T ( y ) .
(2) Data reduction in terms of a particular statistic can be
thought of as a partition of the sample space. A statistic
T ( X ) partitions the sample space into sets
At  { x : T ( x)  t} .
If a statistic T ( X ) is minimally sufficient, then for another
sufficient statistic S ( X ) which partitions the sample space
into sets Bs  { x : S ( x)  s} , every set Bs must be a subset
of some At . Thus, the partition associated with a minimal
11
sufficient statistic is the coarsest possible partition for a
sufficient statistic and in this sense the minimal sufficient
statistic achieves the greatest possible data reduction for a
sufficient statistic.
A useful theorem for finding a minimal sufficient statistic
is the following:
Theorem 2 (Lehmann and Scheffe, 1950): Suppose S ( X ) is
a sufficient statistic for  . Also suppose that for every two
sample points x and y , the ratio f ( x |  ) / f ( y |  ) is
constant as a function of  if S ( x )  S ( y ) . Then S ( X ) is a
minimal sufficient statistic for  .
Proof: Let T ( X ) be any statistic that is sufficient for  . By
the factorization theorem, there exist functions g and h
such that f ( x |  )  g (T ( x ) | θ ) . Let x and y be any two
sample points with T ( x )  T ( y ) . Then
f ( x |  ) g (T ( x ) |  )h( x) h( x)


f ( y |  ) g (T ( y) |  )h( y) h( y) .
Since this ratio does not depend on  , the assumptions of
the theorem imply that S ( x )  S ( y ) . Thus, S ( X ) is at
least as coarse a partition of the sample space as T ( X ) , and
consequently S ( X ) is minimal sufficient.
Example 1 continued: Consider the ratio
n
n
x
n  x i

i 1 i
i 1
f (x | ) 
(1   )

n
n
f ( y |  )   i1 yi (1   ) n  i1 yi .
12
This ratio is constant as a function of  if
i1 xi  i1 yi . Since we have shown that
n
n
n
T ( X )   X i is a sufficient statistic, it follows from the
i 1
n
above sentence and Theorem 2 that T ( X )   X i is a
i 1
minimal sufficient statistic.
Note that a minimal sufficient statistic is not unique. Any
one-to-one function of a minimal sufficient statistic is also
a minimal sufficient statistic. For example,
1 n
T '( X )   X i is a minimal sufficient statistic for the
n i 1
i.i.d. Bernoulli case.
Example 2: Suppose X 1 , , X n are iid uniform on the
interval ( ,  1),      . Then the joint pdf of X is
1  <xi    1, i  1,..., n
f (x | )  
0 otherwise
1 max i xi  1    min i xi

0 otherwise
The statistic T ( X )  (min i X i , max i X i ) is a sufficient
statistic by the factorization theorem with
g (T ( X ), )  I (max i X i  1    min i X i ) and h( X )  1 .
For any two sample points x and y , the numerator and
denominator of the ratio f ( x |  ) / f ( y |  ) will be positive
13
for the same values of  if and only if min i xi  min i yi and
max i xi  max i yi ; if the minima and maxima are equal,
then the ratio is constant and in fact equals 1. Thus,
T ( X )   mini xi , maxi xi  is a minimal sufficient statistic
by Theorem 2.
Example 2 is a case in which the dimension of the minimal
sufficient statistic (2) does not match the dimension of the
parameter (1). There are models in which the dimension of
the minimal sufficient statistic is equal to the sample size,
1
f
(
x
|

)

e.g., X 1 , , X n iid Cauchy(  ),
 [1  ( x   )2 ] .
(Problem 1.5.15).
III. Ancillary Statistics
A statistic T ( X ) is ancillary if its distribution does not
depend on  .
Example 4: Suppose our model is X 1 , , X n iid N (  ,1) .
Then X is a sufficient statistic and ( X 1  X , , X n  X ) is
an ancillary statistic.
Although ancillary statistics contain no information about
 when the model is true, ancillary statistics are useful for
checking the validity of the model.
14
Download