C22.0015 / B90.3302 NOTES for Wednesday 2011.MAR.09 Please

advertisement
C22.0015 / B90.3302
NOTES for Wednesday 2011.MAR.09
Please see the handout with caption “A Summary of the Bayesian Method and Bayesian
Point of View.”
We will consider the topic of decision theory. There is a separate handout for this
subject; this was distributed last week.d
As some interesting developments of the 20th century . . .
Cantor’s notion of infinity (around 1900)
Gödel’s incompleteness theorem (1931)
Ronald Fisher’s foundations of statistical inference (1920s)
Neyman-Pearson notion of statistical hypothesis testing (1931)
Operations research (World War II)
Shannon’s information theory (about 1948)
Kuhn-Tucker theorem for constrained optimization (about 1948)
John von Neumann’s game theory (about 1948)
Statistical and economic utility theory (1950s)
Kahneman and Tversky’s prospect theory (1971)
This list was Gary Simon’s. People in the class added also Box-Jenkins methods for
time series and also Bayesian methods.
Within just statistics, we’ve done many things that will “stick” through time. They are
probably not as important as the things in the list above, though. We could include the
Kaplan-Meyer non-parametric survival function (about 1950), Cox survival analysis
(1970), Efron’s bootstrap methods (1970s), Laird-Rubin E-M method for missing data
(1970s). These have been around for a while and have proved very useful.
There are other things that we’ve worked on. We found them stimulating, we had fun
with them, we wrote lots of papers…. and then they have faded away. I’m going to call
these “toys.”
The statistical robustness movement of the 1970s was such a toy. Very few of the ideas
remain, and no one really talks about it anymore.
I am going to claim here that statistical decision theory is such a toy. Others might
disagree. I guess that by talking about it at all provides some evidence that maybe it is
not a toy.
Of course, “toy” is an ex-post judgment. It’s hard to make that appraisal while
something is still at high mania. Perhaps MCMC is such a toy. Later we will mention
regression’s LASSO estimate. Could be a toy. What about networks?
1
There are several incredibly useful properties of maximum likelihood estimates. With
one notable category of exceptions, these estimates are asymptotically normal with the
optimum variance. What we mean here is that if  ML is a maximum likelihood estimate,
  
then ML
is approximately normally distributed and moreover, the standard
SD 
ej
ML
deviation is as small as possible.
OK, what’s the category of exceptions? These occur when estimating parameters
which limit the support of the random variable. For example, if we have a sample
from U(0, ), then  marks off the range of possible values. The maximum
likelihood estimate here will not have a limiting normal distribution.
1
, where I() is Fisher’s
I
information in the data. We need to see more about Fisher’s information.
We can actually obtain this optimal limiting variance as
af
We have a problem involving parameter . At the moment, there are no other
parameters. We’ve written the likelihood L and we have found ˆ ML . We are going to
  
assert that we have the approximate value of SD( ˆ ML ), and also that ML
is
SD  ML
ej
approximately normally distributed.
For this maximum likelihood estimate, the value of SD( ˆ ML ) is optimally small.
We can actually obtain the optimal limiting variance as
1
, where I() is Fisher’s
I
af
information in the data. Then, of course, SD( ˆ ML ) will be the square root of this limiting
variance.
We need to see more about Fisher’s information.
How do we find I()? There are a number of ways. Let f(x) be the likelihood for the
whole problem. Note here that x is used as a vector to represent the entire set of data.
2
We’ll use X (upper case) to denote the corresponding random variable. Let S be the score
random variable, defining this as
S =

log f  X  

This S is a random variable, but it is not a statistic; its form involves , which is
unknown.
It can be shown that E S = 0. There are three ways to get I():
(1)
I() = E S2
(2)
I() = Var S
(3)
 2

  
S
I() = E   2 log f  X  = E  






Generally one way will be somewhat easier than the others.
Here’s a neat example. Suppose that X 1 , X 2 ,..., X n are independent random variables,
each N(, 2). Let’s suppose for this example that  is a known value. It’s pretty easy to
show that  MM =  ML = X . Now let’s find I(). First, get the likelihood for the whole
sample:
1
2

xi   g
1
1
2b
e 2
L = 
= n
 2
i 1  2 
n
af
n/ 2
e

1 n
 xi 
2  2 i 1
b g
2
Now we need the score random variable:
log L =  n log  
S =
n
1 n
log 2  
 xi  
2
2 2 i 1
af
b g
2

1 n
1 n
log L   2  2 xi   = 2  xi  

2 i 1
 i 1
a fb g
b g
1 n
 Xi   . This is not a statistic, as it involves
 2 i 1
the unknown parameter . It’s easy to see that E S = 0 here.
In random variable form, this is S =
b g
3
There are several ways to get I(), all pretty easy.
(1)
 1 n

 1 n
I() = E S2 = E  2   X i      2   X j     
   j 1
  i 1
 
n
n
1
1
n
= 4   E Xi   X j   = 4 n 2 = 2 .
 i 1 j 1


Consider the double sum. Any term with i  j has expected value
zero. The n terms with i = j each have expected value 2.
1
n
I() = Var S = 4 n 2 = 2 . This is easy, but the third way is even


easier.
   1 n
2

I() = E  2 log L = E X    

2  i


    i 1
b gc h
(2)
(3)
L
M
N
=
O
P
Q
   n
1
1

E
X i  n   = 2 E  n 



2



   i 1
n
.
2
Thus, the asymptotic variance of the maximum likelihood estimate is
1
2
.

I
n
af
For cases in which we have a sample, meaning n independent values sampled from the
same distribution, we have I() = n I1(), where I1() is the information in one
observation. We can get this from the score random variable based on one observation,
generally identified as S1.
For the example above, the likelihood for just the first observation is
1
L =
 2  x1  
1
e 2
 2
2
This leads to
log L =  log  
1
1
2
log  2  
x  
2  1
2
2
Then
S =
1
1
d
x    ( 2) =
log L = 
 x1   
2  1
2
2
d
In random variable form, this is S =
1
 X1   .
2
4
1
 1

Var  X 1    =
It follows that I1() = Var(S) = Var  2  X 1     =
4


1
1
1
Var  X 1  = 4  2 = 2 .
4



This verifies, for this example, that I() = n I1().
OK, let’s work through the details for this particular situation.
Suppose that x1 , x 2 ,..., x n are known values, all positive. Suppose that Y1, Y2 , ..., Yn are
independent, with Yi ~ N(xi, xi2 2). We can certainly get a method of moments
estimate for . Observe that E Yi = xi, so that
n
n
i 1
i 1
E Yi   xi
Note that this is not a sample of values, all with the same
distribution. Each Yi has a possibly-different distribution. Thus
we cannot use the n I1 logic.
Y
By dividing by n, we get E Y =  x , so the method of moments estimate is  MM 
.
x
As an interesting challenge, you might show that
2
Var Yi = 2 2

n x
i 1
n
1
1
Var  MM = 2 Var Y = 2 2
x
n x
n
x
2
i
i 1
In what follows we have to worry about two parameters. For the sake of this example,
let’s think of  as known. (It actually won’t matter here.)
The likelihood for Yi is
f(yixi) =
1
xi  2 
e

1
yi xi
2 xi2  2
b
g
2
5
Based on this, we can write the likelihood for the whole problem:
L1 e
L = M
M
Nx  2
n
i 1

1
yi xi
2 xi2  2
b
O
1
=
P
P
Q  a2f
g
2
n/2
n
i
n
x
e

b
1 n yi xi

2  2 i 1
xi2
g
2
i
i 1
We’ll need to take log L:
n
log L =  n log   log 2  
2
af
b
1 n yi  xi
log
x



i
2 2 i 1
xi2
i 1
n
g
2
We could get the maximum likelihood estimates for both  and . For now, we’ll just
worry about , as noted above. Clearly we get that estimate by minimizing the sum from
the exponent. Thus we solve
b
 n yi  xi

 i 1
xi2
= 2
g   2by  x g( x )
2
n
i
i
i
xi2
i 1
n
= 2
i 1
xi yi  xi2
xi2
L
O
y
 x  P 0
M
N Q
n
i
i 1
let
i
1
Clearly the solution is ˆ ML 
n
n
Yi
x
i 1
. This is a very unusual ratio estimate.
i
Suppose we wanted to know its asymptotic variance. We need the score random
variable. (Sometimes this score random variable is found as part of the routine of getting
the maximum likelihood estimate, but not here.)
S =
b
gb g=

1 n yi  xi 2 xi
log L  


2 2 i 1
xi2
In random variable form, this is
S =
F
IJ
G
H K
1 n Yi


 2 i 1 xi
6
F
IJ
G
H K
1 n yi


 2 i 1 xi
The easiest way to get I() is as Var S:
I() = Var S = Var
1 x

 4 i 1 xi2
n
=
2
i
2
F1  FY  II = VarF1  Y I =
G
H x J
K
Hx J
KJ
H G
K G
n
2
i 1
n
i
2
i
i 1
i
i
bg
1 n Var Yi

 4 i 1 xi2
n
2
=
2

It follows that the limiting variance of  ML is
. You can actually show that this is a
n
non-asymptotic result as well.
Another example. Suppose that X 1 , X 2 ,..., X n is a sample from the exponential density
f(x) =  e - x I(x  0)
Let’s find the maximum likelihood estimate. Begin with the likelihood
n
L =
 e
n
  xi
=  e
n
   xi
i 1
i 1
It follows that
n
log L = n log    xi
i 1
let

n n
log L    xi  0

 i 1
It follow that  ML =
n
=
n
x
1
. In random variable form, this is  ML =
x
1
. It’s
X
i
i 1
going to be very difficult to get a limiting distribution. Let’s use the asymptotic results,
based on the fact that this is a maximum likelihood estimate.
For observation 1, we have log L1 = n log  -  x1. Then
S1 =

1
log L1   X1


Certainly E S1 = 0. There are several ways to get I1(). Here’s the easiest:
7
L

M
N 
1 O 1
= EL
M
N P
Q= 
I1() = E 
2
2
2
log L1
O

1
L  R
L  R
O
UO
log L V
= EM
= EM
X U
S
S
V
P
P
P
Q
Q N  T W
Q N  T W
1
1
2
It follows then that I() = n I1() =
n
.
2
We can certainly make an approximate 95% confidence interval based on  ML ± 2 SE(
 ML ). Specifically, this is
1
1
2
X
X n
This next example was not done in class.
Let’s use this technology for a genuinely difficult problem.
Recall our censored likelihood problem. Suppose that X1, X2, …, Xn are
independent random variables, each from the density  e -x. Actually we are able
to observe the Xi’s only if they have a value T or below. This corresponds to an
experimental framework in which we are observing the lifetimes of n independent
objects (light bulbs, say), but that the experiment ceases at time T.
Suppose that K of the Xi’s are observed; call these values X 1 , X 2 , X 3 , .... , X K .
The remaining n - K values are censored at T ; operationally, this means that
there were n - K light bulbs still burning when the experiment stopped at time T.
The overall likelihood is
L = e  T
nK 
K e

K
 xi
i 1
It is not at all obvious what would be the maximum likelihood estimate. Let’s take
logs:
K
log L =  T  n  K   K log     xi
i 1
8
Then we found . . . eventually . . . ˆ ML =
K
T n  K  
K
 x
i 1
.
i
Let’s get its asymptotic variance.
S() =
d
K
log L =  T  n  K  

d

K
 x
i 1
i
Finding E S2() or Var S() will be very tricky, as we need to consider that K is
random, and the random variables xi have to be considered conditional on having a value
below T.
We’ll do the third method:
 2

  
S
I() = E   2 log f  X  = E  
  
 


 
K
= E 

 T  n  K  
 




= E 



K
 x 
i 1
i

 
1
K 
 K 
= E  2  = 2 EK 


  
 
Here K is binomial with n trials and event probability 1  eT . Its expected value
is then n 1  e T  . This leads to
I() =
n
1  e T 
2
The limiting variance of the maximum likelihood estimate is
any actual use, we would use
1
ˆ 2

.
ˆ
I  ˆ 
n 1  e  T 
9
1
2

. In
I  
n 1  e  T 
Download