Notes 9 - Wharton Statistics Department

advertisement
Statistics 550 Notes 9
Reading: Section 1.6.4-1.6.5, 2.1
I. Multiparameter exponential families
The family of distributions of a model {P :  } is said to
be a k-parameter exponential family if there exist realvalued functions 1 , ,k , B of  such that the pdf or pmf
may be written as
k
p( x |  )  h( x) exp{ j ( )T j ( x)  B( )}
j 1
By the factorization theorem, T ( X )  (T1 ( X ),
a sufficient statistic.
(1.1)
, Tk ( X )) is
2
Example 1: X 1 , , X n is iid N (  ,  ) is a two-parameter
exponential family.
Example 2: Multinomial. Suppose we observe n
independent trials where each trial can end up in one of k
possible categories {1,...,k} with probabilities
  { p1 , , pk 1 , pk  1  p1   pk 1} . Let
y1 ( x ), , yk ( x) be the number of outcomes in categories
1,...,k in the n trials. Then,
1
p( x |  ) 



n!
y1 ( x )
yk ( x )
n!
y1 ( x )
 p1 
 
yk ( x )  pk 
n!
y1 ( x )
yk ( x )
pk yk ( x )
p1 y1 ( x )
y1 ( x )
 pk 1 


 pk 
yk 1 ( x )
pk n
exp[ y1 ( x ) log( p1 / pk ) 
 y k 1 ( x ) log( pk 1 / pk )  n log pk ]
n!
y1 ( x )
yk ( x )
exp[ y1 ( x ) log( p1 / pk ) 
k 1
 y k 1 ( x) log( pk 1 / pk )  n log(1   exp(log
i 1
The multinomial is a (k-1) parameter exponential family
with   (log( p1 / pk , ,log( pk 1 / pk )) ,
k 1
T ( x)  y1 ( x),
, yk 1 ( x) and A( )  n log(1   exp(i )) .
i 1
Moments of Sufficient Statistics: As with the oneparameter exponential family, it is convenient to index the
family by   (1 , ,k ) . The analogue of Theorem 1.6.2
that calculates the moments of the sufficient statistics is
Corollary 1.6.1:
 A
E0 T ( X )  
(0 ),


 1
T

A
,
(0 ) 
k

2 A
Var0 T ( X ) 
(0 )
a b
Example 2 continued: For the multinomial distribution,
2
pi
))]
pk
n
pi
pk
nei
 k 1 i 
E[ y j ( x )] 
n log 1   e  

k 1
k 1
 j
 i 1  1  ei 1 



i 1
Cov0 [ yi ( x ), y j ( x )] 
i 1
pi
pk

 npi
1
pi
pk
pk
n

nei e j
 k 1 
n log 1   ei  
 npi p j , i  j
k 1
 j k

2
i

1

 (1  e i )



i 1
Var0 [ yi ( x )] 



n log 1   ei   npi (1  pi ) .
2
 j
 i 1 
2
k 1
II. Conjugate Families of Prior Distributions (Chapter
1.6.5)
Consider a Bayesian model for the data in which the
distribution of the data given the parameter  is p ( x |  )
where    . A family of prior distributions for  ,
{ ( |  ),   } , is a conjugate family of priors to this
model if the posterior distribution for  , p ( | x ) , also
belongs to { ( |  ),   } . Note the parameters  of a
prior distribution are often called hyperparameters.
Examples: In Notes 2, we showed that the beta family of
priors is conjugate to the binomial model for the data and in
Notes 3, we showed that the normal family of priors is
conjugate to the normal model for the data in which the
variance is known.
Suppose X 1 , , X n are iid from the k-parameter
exponential family (1.1) so that
3
n
k

 n

p( x |  )   h( xi )  exp  j ( ) T j ( xi )  nB( )} (1.2)
i 1
 i 1

 j 1

A conjugate exponential family of priors is obtained by
n
letting t j   T j ( xi ), j  1,
i 1
, k and n  tk 1 be
“parameters” and treating  as the variable of interest.
That is, let
t  (t1 , , tk 1 )T and
 (t )  





k
exp{ t j j ( )  tk 1 B( )}d1
j 1
d k
,
  {(t1 , , tk 1 ) : 0   (t1 , , tk 1 )  }
with integrals replaced by sums in the discrete case. We
assume  is nonempty. Then,
Proposition 1.6.1: The (k  1) parameter exponential family
given by
k

 ( | t )  exp  j ( )t j  tk 1B( )  log  (t )  , t  (t1,
 j 1

, tk 1 )  (1.3)
is a conjugate family of prior distributions to p ( x |  ) given
by (1.2).
Note: We can view the prior distribution as saying that we
*
have tk 1 additional data points x from p ( x |  ) with
*
sufficient statistics T1  t1 , , Tk  tk , since the data ( x , x )
has a pdf/pmf that is proportional to the joint distribution of
( , x ) in the Bayesian model.
4
Proof: If p ( x |  ) is given by (1.2) and the prior distribution
 for  is given by a member of the family (1.3), then
p( | x )  p( x |  ) ( | t ) 
exp

 j 1 j ( )
,
where
s  (s1 ,
k

n
i 1


T j ( xi )  t j  (tk 1  n) B( )   ( | s)

, sk 1 )T  t1  i 1T1 ( xi ),
n
, tk  i 1Tk ( xi ), tk 1  n
n
Because two probability densities that are proportional
must be equal, p( | x )   ( | s) .
Note that the prior parameter t is simply updated to
s  t  a , where a = (i 1T1 ( X i ),
n
,i 1Tk ( X i ), n) .
n
Many well known examples of conjugate families are
special cases of Proposition 1.6.1.
Example: Suppose X 1 , , X n are iid Bernoulli(  ). The
X i follow a one-parameter exponential family
p( x |  )   x (1   )1 x
  
 exp[ x log 
  log(1   )] ,
 1 
  

(

)

log

 , B( )   log(1   ), T ( x)  x .
with
 1 
From Proposition 1.6.1, a conjugate family of prior
distributions is
5

 ( | t1 , t2 )  exp  ( )t1  t2 B( ) 
  
exp log 
  1


t

t
log(1


)

1 2


exp t1 log    (t2  t1 ) log 1    
 t (1   )t t
1
2
1
This is proportional to a Beta( t1  1, t2  t1  1 ) distribution.
Thus, the beta family of prior distributions is a conjugate
family to the Bernoulli (or binomial) model. For r and s
integers, we can view a Beta(r,s) prior as saying that we
have r  s  2 additional data points of which r  1 are
successes.
In Example 2 of Notes 2, we put a Beta(53,47) prior
on Shaquille O’Neal’s probability  of making a free throw
shot. This is saying that our prior information is equivalent
to observing Shaq take 98 free throw shots and make 52 of
the shots.
III. Methods of Estimation: Basic Heuristics of Estimation
Basic Setup: Family of possible distributions for data X
{ p( x |  ), } . Observe data X.
Point estimation: Best estimate of  based on data X.
We discussed the decision theoretic approach to evaluating
point estimates focusing particularly on squared error as a
loss function which results in mean squared error as the risk
function. But how do we come up with possible estimates
of  .
6
Example Estimation Problems:
(1) Bernoulli data: We observe X 1 , , X n iid Bernoulli (  )
(e.g., Shaq’s free throws). How do we estimate  ?
(2) Regression. We are interested in the mean of a
response Y given covariates X 1 , , X p and assume a model
E (Y | X 1 , , X p )  g (   X 1 , , X p ) , where g is a known
function and  is the unknown parameter vector.
Example: Life insurance companies are keenly interested in
predicting how long their customers will live because their
premiums and profitability depend on such numbers. An
actuary for one insurance company gathered data from 100
recently deceased male customers. She recorded Y=the age
at death of the customer, X1 = the age at death of his
mother, X2 =the age at death of his father, X3=the mean age
at death of his grandmothers and X4 =the mean age at death
of his grandfathers.
Multiple linear regression model:
E (Y | X1 , X 2 , X 3 , X 4 )  0  1 X1  2 X 2  3 X 3  4 X 4
(3) Parameter estimation in an iid model. As part of a
study to estimate the population size of the bowhead whale,
Raftery and Zeh wanted to understand the distribution of
whale swimming speeds. They randomly sampled the time
to swim 1km of 210 whales and believe that the gamma
7
model is a reasonable model for this data.
 p x p 1e x
p( x | p,  ) 
( p ) .
How do we estimate p and  ?
(4) Hardy-Weinberg equilibrium. If gene frequencies are
in equilibrium, then for a gene with two alleles, the
genotypes AA, Aa and aa occur in a population with
2
2
frequencies (1   ) , 2 (1   ), respectively according to
the Hardy-Weinberg law. In a sample from the Chinese
population of Hong-Kong in 1937, blood types occurred
with the following frequencies, where M and N are
erythrocyte antigens:
Blood Type
M
MN
N
Total
Frequency 342
500
187
1029
We can model the observed blood types as an iid sample
from a multinomial distribution with probabilities
(1   )2 , 2 (1   ), 2 . How do we estimate  ?
Minimum contrast heuristic: Choose a contrast function
 ( X ,  ) that measures the “discrepancy” between the data
X and the parameter vector  . The range of the contrast
function is typically taken to be the real numbers greater
than or equal to zero and the smaller the value of the
contrast function, the more “plausible”  is based on the
data X.
8
Let  0 denote the true parameter. Define the population
discrepancy D( 0 , ) as the expected value of the
discrepancy  ( X ,  ) :
D( 0 , )  E0  ( X , )
(1.4)
In order for  ( X ,  ) to be a valid contrast function, we
require that D( 0 , ) is uniquely minimized for    0 , i.e.,
D( 0 , )  D( 0 , 0 ) if    0 .
   0 is the minimizer of D(0 , ) . Although we don’t
know D( 0 , ) , the contrast function  ( X ,  ) is an
unbiased estimate of D( 0 , ) (see (1.4)). The minimum
contrast heuristic is to estimate  by minimizing  ( X ,  ) ,
i.e.,
ˆ  min   ( X , ) .
Example 1: Suppose X 1 , , X n iid Bernoulli (p), 0  p  1 .
The following is an example of a contrast functions and an
associated estimate:
n
2

(
X
,
p
)

(
X

p
)

i
“Least Squares”:
.
i 1
D( p0 , p)  E p0 [ i 1 ( X i  p) 2 ] 
n
 np0  2npp0  np 2
We have
9
D( p0 , p)
 2np0  2np
p
and it can be verified by the second derivative test that
arg min p D( p0 , p)  p0
n
2

(
X
,
p
)

(
X

p
)

i
Thus,
is a valid contrast function.
i 1
The associated estimate is
n
pˆ  arg min p  ( X , p)  arg min p  i 1 ( X i  p) 2
 arg min p p  2 p  i 1 X i
n
2


n
i 1
Xi
n
The following is an example of a function that is not a
contrast function:
n
 ( X , p)   ( X i  p) 4
i 1
D( p0 , p)  E p0 [ i 1 ( X i  p) 4 ] 
n
 E p0 [ i 1 X i4  4 X i3 p  6 X i2 p 2  4 X i p 3  p 4 ]
n
 p0  4 p0 p  6 p0 p 2  4 p0 p 3  p 4
For p0  0.7 , we find that D( p0 , p) is maximized at about
p=0.57
10
Least Squares methods for estimating regression can be
viewed as a minimum contrast estimates (Example 2.1.1).
Estimating Equation Heuristic:
Suppose  is d-dimensional. Consider a d-dimensional
function  ( X , ) and define
V ( 0  )  E0 ( X , ) .
11
Suppose V (0  )  0 has  0 as its unique solution for
 0   . We do not know V (0  but  ( X , ) is an
unbiased estimate of V (0  . The estimating equation
heuristic is to estimate  by solving  ( X , )  0 , i.e.,
 ( X ,ˆ)  0 .
 ( X , ) is called an estimating equation.
Method of Moments: Suppose X 1 , , X n iid from
{ p( x |  ), } where  is d-dimensional.
Let 1 ( ), , d ( ) denote the first d-moments of the
population we are sampling from (assuming that they
exist),
 j ( )  E ( X j ), 1  j  d
Define the jth sample moment ˆ j by
1 n
ˆ j   i 1 X i j , 1  j  d .
n
The function
 ( X , )  (ˆ1  1 ( ), , ˆ d  d ( ))
is an estimating equation for which
V (θ0  0   E  ( X , )  ( E ˆ1  1 ( ),
, E ˆ d  d ( ))  0
For many models, V (θ0   0 for all    0 .
Suppose   ( 1 ( ), , d ( )) is a 1-1 continuous
d
d
function from  to  . Then the estimating equation
12
estimate of  based on  ( X , ) is the ˆ that solves
 ( X ,ˆ)  0 , i.e.,
ˆ   (ˆ)  0, j  1, , d .
j
j
Example 2: X 1 ,
1 ( ) 
, X n iid Uniform (0,  ) .

2
The method of moments estimator solves,
ˆ
X   0,
2
i.e., ˆ  2X .
Example 3: X 1 ,
1 ( )  
, X n iid N (  ,  2 )
2 ( )   2   2
The method of moments estimator solves,
X  0

n
2
X
i
i 1
n
 2  2  0
 i 1 X i2
n
2
Thus, ˆ  X and ˆ 
n
X2


n
2
(
X

X
)
i
i 1
n
Large sample motivation for method of moments:
13
.
A reasonable requirement for a point estimator is that it
should converge to the true parameter value as we collect
more and more information.
Suppose X 1 , , X n iid.
A point estimator h(X1,...,Xn) of a parameter q( ) is
P
consistent if h(X1,...,Xn)  q( ) as n   for all   .
Definition of convergence in probability (A.14.1, page
P
466). h(X1,...,Xn)  q( ) means that for all   0 ,
lim P[| h( X 1 ,..., X n )  q( ) |  ]  0 .
n 
Under certain regularity conditions, the method of moments
estimator is consistent. We give a proof for a special case
Let g ( )  ( 1 ( ), , d ( )) . By the assumptions in
formulating the method of moments, g is a 1-1 continuous
d
d
function from  to  . The method of moments
estimator solves
g (ˆ)  (ˆ1 , , ˆ d )  0 .
d
When the g’s range is  , then
ˆ  g 1 (ˆ1 , , ˆ d ) . We prove the method of moments
1
1
estimator is consistent when ˆ  g (ˆ , , ˆ ) and g is
1
d
continuous.
Sketch of Proof: The method of moments estimator solves
14
ˆ j   j (ˆ)  0, j  1, , d .
By the law of large numbers,
P
( ˆ1 , , ˆ d ) ( 1 ( ), ,  d ( )) .
By the open mapping theorem (A.14.8, page 467), since
g 1 is assumed to be continuous,
ˆ  g 1 ( ˆ1 ,
P
, ˆ d )  g 1 ( 1 ( ),
, d ( ))  
Comments on method of moments:
(1) Instead of using the first d moments, we could use
higher order moments instead, leading to different
estimating equations. But the method of moments
estimator may be altered by which moments we choose.
Example: X 1 , , X n iid Poisson(  ). The first moment is
1 ( )  E ( X )   . Thus, the method of moments
estimator based on the first moment is ˆ  X .
We could also consider using the second moment to form a
method of moments estimator.
2 ( )  E ( X 2 )     2 .
The method of moments estimator based on the second
moment solves
1 n 2 ˆ ˆ2
 Xi    
n i 1
Solving this equation (by taking the positive root), we find
that
15
1/ 2
1 1 1 n

ˆ       i 1 X i2  .
2 4 n

The two method of moments estimators are different.
For example, for the data
> rpois(10,1)
[1] 2 3 0 1 2 1 3 1 2 1,
the method of moments estimator based on the first
moment is 1.1 and the method of moments estimator based
on the second moment is 1.096872.
(2) The method of moments does not use all the
information that is available.
X 1 , , X n iid Uniform (0,  ) .
The method of moments estimator based on the first
moment is ˆ  2X . If 2 X  max X i , we know that
  max X  ˆ
i
16
Download