Notes 10 - Wharton Statistics Department

advertisement
Statistics 550 Notes 10
Reading: Section 2.1.
Take-home midterm: I will e-mail it to you by the morning
of Saturday, October 14th. It will be due Monday, October
23rd by 5 p.m.
I. Method of Moments
Suppose X 1 , , X n iid from { p( x |  ), } where  is ddimensional.
Let 1 ( ), , d ( ) denote the first d-moments of the
population we are sampling from (assuming that they
exist),
 j ( )  E ( X j ), 1  j  d
Define the jth sample moment ˆ j by
1 n
ˆ j   i 1 X i j , 1  j  d .
n
The function
 ( X , )  (ˆ1  1 ( ), , ˆ d  d ( ))
is an estimating equation for which
V (θ0  0   E  ( X , )  ( E ˆ1  1 ( ),
, E ˆ d  d ( ))  0
For many models, V (θ0   0 for all    0 so that
 ( X ,ˆ)  0 is a valid estimating equation.
1
Suppose   ( 1 ( ), , d ( )) is a 1-1 continuous
d
d
function from  to  . Then the estimating equation
estimate of  based on  ( X , ) is the ˆ that solves
 ( X ,ˆ)  0 , i.e.,
ˆ   (ˆ)  0, j  1, , d .
j
j
Example 4: X 1 ,
f ( x | p,  ) 
, X n iid Gamma ( p,  ) .
 p x p 1e x
for x  0 . (see Section B.2.2 of
( p )
Bickel and Doksum).
The first two moments of the gamma distribution are
p
p ( p  1)
E p , ( X )  , E p , ( X 2 ) 

 2 . (see Excercise B.2.3,
page 526).
The method of moments estimator solves
pˆ
X  0
ˆ
2
X
 i 1 i
n
pˆ ( pˆ  1)
0
2
ˆ
n

which yields
X
pˆ  X ˆ  X
ˆ  n 2
i1 X i  X 2 and
n

2
X

n
2
X
i
i 1
n
X
2
.
Example: The gamma model is frequently used for
describing precipitation levels. In a study of the natural
variability of rainfall, the rainfall of summer storms was
measured by a network of rain gauges in southern Illinois
for the years 1960-1964. 227 measurements were taken.
For these data, X  .224,

n
2
X
i
i 1
n
method of moments estimates are
3
 0.184 , so that the
ˆ 
X

n
2
X
i
i 1
n
pˆ  X

X2
X

 0.224
n
2
X
i 1 i
n
.224
 1.674
.184  .2242
X2
0.224
 0.375
2
0.184  0.224
The following plot shows the Gamma ( p  .375,   1.674 )
density plotted on the histogram. In order to make the
visual comparison easy, the density was normalized to have
a total area equal to the total area under the histogram,
which is the number of observation times the bin width of
the histogram, or 227*.2=45.4.
4
Qualitatively, the fit of the gamma model to the data looks
reasonable; we will examine methods for assessing the
goodness of fit of a model to data in Chapter 4.
Large sample motivation for method of moments:
A reasonable requirement for a point estimator is that it
should converge to the true parameter value as we collect
more and more information.
Suppose X 1 , , X n iid.
A point estimator h(X1,...,Xn) of a parameter q( ) is
P
consistent if h(X1,...,Xn)  q( ) as n   for all   .
5
Definition of convergence in probability (A.14.1, page
P
466). h(X1,...,Xn)  q( ) means that for all   0 ,
lim P[| h( X 1 ,..., X n )  q( ) |  ]  0 .
n 
Under certain regularity conditions, the method of moments
estimator is consistent. We give a proof for a special case
Let g ( )  ( 1 ( ), , d ( )) . By the assumptions in
formulating the method of moments, g is a 1-1 continuous
d
d
function from  to  . The method of moments
estimator solves
g (ˆ)  (ˆ1 , , ˆ d )  0 .
d
When the g’s range is  , then
ˆ  g 1 (ˆ1 , , ˆ d ) . We prove the method of moments
1
1
estimator is consistent when ˆ  g (ˆ , , ˆ ) and g is
1
d
continuous.
Sketch of Proof: The method of moments estimator solves
ˆ j   j (ˆ)  0, j  1, , d .
By the law of large numbers,
P
( ˆ1 , , ˆ d ) ( 1 ( ), ,  d ( )) .
By the open mapping theorem (A.14.8, page 467), since
g 1 is assumed to be continuous,
ˆ  g 1 ( ˆ1 ,
P
, ˆ d )  g 1 ( 1 ( ),
6
, d ( ))  
Comments on method of moments:
(1) Instead of using the first d moments, we could use
higher order moments (or other functions of the data – see
Problem 2.1.13) instead, leading to different estimating
equations. But the method of moments estimator may be
altered by which moments we choose.
Example: X 1 , , X n iid Poisson(  ). The first moment is
1 ( )  E ( X )   . Thus, the method of moments
estimator based on the first moment is ˆ  X .
We could also consider using the second moment to form a
method of moments estimator.
2 ( )  E ( X 2 )     2 .
The method of moments estimator based on the second
moment solves
1 n 2 ˆ ˆ2
Xi    

i 1
n
Solving this equation (by taking the positive root), we find
that
1/ 2
1 1 1 n

ˆ       i 1 X i2  .
2 4 n

The two method of moments estimators are different.
For example, for the data
> rpois(10,1)
[1] 2 3 0 1 2 1 3 1 2 1,
7
the method of moments estimator based on the first
moment is 1.1 and the method of moments estimator based
on the second moment is 1.096872.
(2) The method of moments does not use all the
information that is available.
X 1 , , X n iid Uniform (0,  ) .
The method of moments estimator based on the first
moment is ˆ  2X . If 2 X  max X i , we know that
  max X  ˆ
i
II. Minimum Contrast heuristic
Minimum contrast heuristic: Choose a contrast function
 ( X ,  ) that measures the “discrepancy” between the data
X and the parameter vector  . The range of the contrast
function is typically taken to be the real numbers greater
than or equal to zero and the smaller the value of the
contrast function, the more “plausible”  is based on the
data X.
Let  0 denote the true parameter. Define the population
discrepancy D( 0 , ) as the expected value of the
discrepancy  ( X ,  ) :
D( 0 , )  E0  ( X , )
(1.1)
8
In order for  ( X ,  ) to be a valid contrast function, we
require that D( 0 , ) is uniquely minimized for    0 , i.e.,
D( 0 , )  D( 0 , 0 ) if    0 .
   0 is the minimizer of D(0 , ) . Although we don’t
know D( 0 , ) , the contrast function  ( X ,  ) is an
unbiased estimate of D( 0 , ) (see (1.1)). The minimum
contrast heuristic is to estimate  by minimizing  ( X ,  ) ,
i.e.,
ˆ  min   ( X , ) .
Example 1: Suppose X 1 , , X n iid Bernoulli (p), 0  p  1 .
The following is an example of a contrast functions and an
associated estimate:
n
2

(
X
,
p
)

(
X

p
)

i
“Least Squares”:
.
i 1
D( p0 , p)  E p0 [ i 1 ( X i  p) 2 ] 
n
 np0  2npp0  np 2
We have
D( p0 , p)
 2np0  2np
p
and it can be verified by the second derivative test that
arg min p D( p0 , p)  p0
9
n
2

(
X
,
p
)

(
X

p
)

i
Thus,
is a valid contrast function.
i 1
The associated estimate is
n
pˆ  arg min p  ( X , p)  arg min p  i 1 ( X i  p) 2
 arg min p np  2 p  i 1 X i
n
2


n
i 1
Xi
n
The following is an example of a function that is not a
contrast function:
n
 ( X , p)   ( X i  p) 4
i 1
D( p0 , p) / n  E p0 [ i 1 ( X i  p) 4 ] 
n
 E p0 [ i 1 X i4  4 X i3 p  6 X i2 p 2  4 X i p 3  p 4 ]
n
 p0  4 p0 p  6 p0 p 2  4 p0 p 3  p 4
For p0  0.7 , we find that D( p0 , p) / n is minimized at
about p=0.57
10
Least Squares methods for estimating regression can be
viewed as a minimum contrast estimates (Example 2.1.1
and Section 2.2.1).
III. The Plug-in Principle (Chapter 2.1.2)
Cox and Lewis (1966) reported 799 waiting times (in
seconds) between successive pulses along a nerve fiber.
11
Let X1 , , X 799 be the 799 waiting times.
“Nonparametric” model for nerve firings:
X1 , , X 799 iid from a distribution with CDF F – no further
restrictions on F.
How do we estimate parameters such as mean of F,
variance of F, skewness of F?
Estimating F: A natural estimate of F is the empirical CDF
The empirical CDF Fˆn for a sample of size n is the CDF
that puts mass 1/n at each data point X i .
Formally,

Fˆ ( x) 
n
n
i 1
I ( X i  x)
n
12
The empirical CDF is a consistent estimate of F as n  
in a strong sense:
P
| Fˆn ( x)  F ( x) |  0
Glivenko-Cantelli Theorem: sup
x
Plug-in-principle:
Consider a parameter  that can be written as a function of
F, i.e.,   T ( F ) .
The plug-in estimator of   T ( F ) is ˆ  T ( Fˆn ) .
Example 1:
13
(1) The mean. Let   T ( F )   xdF ( x) . The plug-in
estimator is ˆ  xdFˆ ( x)  X , the sample mean. For the

n
nerve firing data, ˆ  0.219 .
(2) The variance. Let
  T ( F )   x dF ( x) 
2
2
  xdF ( x)  .
2
The plug-in estimator is
ˆ 2   x 2 dFˆn ( x)   xdFˆn ( x)

1 n 2 1 n

  Xi    Xi 
n i 1
 n i 1 

2
2
2
1 n
  Xi  Xn 
n i 1
2
For the nerve firings data, ˆ  0.044 .
Comments on plug-in estimates:
(1) The plug-in estimator is a good estimator for the
nonparametric model in which nothing is assumed about
the cdf F.
(2) The plug-in estimator is generally consistent.
(3) However, for more parametric models, we can often
obtain estimators with better risk functions by utilizing the
specific parametric structure.
14
(4) Plug-in estimators are often valuable as preliminary
estimates in algorithms that search for more efficient
estimates.
15
Download