Notes 12 - Wharton Statistics Department

advertisement
Statistics 550 Notes 12
Reading: Sections 2.4.2
The bisection method is a root finding algorithm which
works by repeatedly dividing an interval in half and then
selecting the subinterval in which the root exists.
The bisection method can be used to solve the likelihood
equation to find the MLE for a one-dimensional model (or,
as in the gamma model, a multidimensional model in which
solving the likelihood equation can be reduced to the
problem of solving one equation).
I. Coordinate Ascent Method
The coordinate ascent method is an approach to finding the
maximum likelihood estimate in a multidimensional
family. The coordinate ascent method works by using the
bisection method iteratively. Suppose we have a kdimensional parameter (1 , ,k ) . The coordinate ascent
method is:
Choose an initial estimate (ˆ1 , ,ˆk )
0. Set (ˆ1 , ,ˆk )old  (ˆ1 , ,ˆk )
1. Maximize l ( ,ˆ , ,ˆ ) over  using the bisection
x
1
2
1
k

ˆ
method by solving  l (1 , 2 ,
1
1
,ˆk )  0 (assuming the
log likelihood is differentiable). Reset ˆ1 to the 1 that
maximizes l ( ,ˆ , ,ˆ ) .
x
1
2
k
2. Maximize lx (ˆ1 ,2 ,ˆ3 , ,ˆk ) over  2 using the bisection
method. Reset ˆ2 to the  2 that maximizes
l (ˆ , ,ˆ ,ˆ ) .
x
1
2
3
k
....
K. Maximize lx (ˆ1 ,ˆ2 ,ˆ3 , ,ˆk 1,k ) over  k using the
bisection method. Reset ˆk to the  k that maximizes
l (ˆ ,ˆ ,ˆ ,ˆ , ) .
x
1
2
3
k 1
k
K+1. Stop if the distance between (ˆ1 , ,ˆk )old and
(ˆ1 , ,ˆk ) is less than some tolerance  . Otherwise return
to step 0.
The coordinate ascent method converges to the maximum
likelihood estimate when the log likelihood function is
strictly concave on the parameter space. See Theorem
2.4.2 and Figure 2.4.1 in Bickel and Doksum.
Example: Beta Distribution
2
x r 1 (1  x) s 1
p( x | r , s) 

 ( r ) ( s )
(r  s)
exp((r  1) log x  ( s  1) log(1  x)  log ( r )  log ( s)  log ( r  s))
for 0  x  1 and   {(r , s) : 0  r  , 0<s  }
This is a two parameter full exponential family.
We found the method of moments estimates in Homework
5.
For X 1 ,
, X n iid Beta( r , s ),
n
n
i 1
i 1
l x (r , s)  (r  1) log X i  ( s  1) log(1  X i )  n log ( r )  n log ( s)  n log ( r  s)
n
l
 log (r )
 log (r  s)
  log X i  n
n
r i 1
r
r
n
l
 log ( s)
 log (r  s)
  log(1  X i )  n
n
s i 1
s
s
The following data comes from an experiment on seeding
clouds to suppress hail in South Africa. Each individual
observation is for a storm, the proportion of areas where
total destruction of crops due to hail was reported to the
corresponding larger area where at least partial destruction
of crops due to hail was reported. The beta distribution was
thought to be a good model for this data (Mielke, 1975,
Journal of Applied Metrology).
3
X =(0.33,0.48,0.54,0.82,0.58,0.68,0.65,0.37,0.42,0.37,0.48
,0.44,0.46,0.61,0.53,0.46,0.45,0.57,0.22,0.59,0.68,0.98,0.48
,0.26,0.40,0.60,0.98)
R code for finding the MLE:
# Code for beta distribution MLE
# xvec stores the data
# rhatcurr, shatcurr store current estimates of r and s
xvec=c(.33,0.48,0.54,0.82,0.58,0.68,0.65,0.37,0.42,0.37,0.48,0.44,0.46,0.61,
0.53,0.46,0.45,0.57,0.22,0.59,0.68,0.98,0.48,0.26,0.40,0.60,0.98)
# Generate data from Beta(r=2,s=3) distribution)
# xvec=rbeta(20,2,3);
# Set low and high starting values for the bisection searches
rhatlow=.001;
rhathigh=20;
shatlow=.001;
shathigh=20;
# Use method of moments for starting values
rhatmom=mean(xvec)*(mean(xvec)-mean(xvec^2))/(mean(xvec^2)mean(xvec)^2);
shatmom=((1-mean(xvec))*(mean(xvec)-mean(xvec^2)))/(mean(xvec^2)mean(xvec)^2);
4
rhatcurr=rhatmom;
shatcurr=shatmom;
rhatiters=rhatcurr;
shatiters=shatcurr;
derivrfunc=function(r,s,xvec){
n=length(xvec);
sum(log(xvec))-n*digamma(r)+n*digamma(r+s);
}
derivsfunc=function(s,r,xvec){
n=length(xvec);
sum(log(1-xvec))-n*digamma(s)+n*digamma(r+s);
}
dist=1;
toler=.0001;
while(dist>toler){
rhatnew=uniroot(derivrfunc,c(rhatlow,rhathigh),s=shatcurr,xvec=xvec)$root
;
shatnew=uniroot(derivsfunc,c(shatlow,shathigh),r=rhatnew,xvec=xvec)$root
;
dist=sqrt((rhatnew-rhatcurr)^2+(shatnew-shatcurr)^2);
rhatcurr=rhatnew;
shatcurr=shatnew;
rhatiters=c(rhatiters,rhatcurr);
shatiters=c(shatiters,shatcurr);
}
rhatmle=rhatcurr;
shatmle=shatcurr;
> rhatmom
[1] 3.508473
> shatmom
5
[1] 3.056237
> rhatmle
[1] 2.507865
> shatmle
[1] 1.997723
Simulation Study:
We simulate data X 1 ,
distribution.
1000 Simulations
, X n iid from a Beta (r  2, s  3)
Bias
rˆMOM
rˆMLE
n  20
n  1000
0.287
0.315
0.004
0.005
6
sˆMOM
sˆMLE
0.471
0.516
0.003
0.006
Root Mean Squared Error ( MSE )
rˆMOM
rˆMLE
sˆMOM
sˆMLE
n  20
n  1000
0.852
0.844
0.089
0.083
0.374
0.355
0.133
0.128
Both the method of moments and MLE have a substantial
bias for a small sample size (n  20) . Both methods are
approximately unbiased for a large sample size (n  1000) .
For both sample sizes, the MLE is slightly better than the
method of moments in terms of mean squared error.
II. Non-concave likelihoods
In Notes 11, we showed that for an exponential family
models, the log likelihood is concave. This means that a
solution to the likelihood equation found by the bisection
method or the coordinate ascent method is a global
maximum. For other models, the log likelihood might not
be concave and solutions to the likelihood equation might
be local minima, saddlepoints or local maxima that are not
global maxima. When there are multiple solutions to the
7
likelihood equation, the solution found by the bisection
method or coordinate ascent depends on the starting point.
Example: MLE for Cauchy Distribution
Cauchy model:
p( x |  ) 
1
, x  0,  
 (1  ( x   )2 )
Suppose X 1 , X 2 , X 3 are iid Cauchy(  ) and we observe
X1  0, X 2  1, X 3  10 .
8
Log likelihood is not concave and has two local maxima
between 0 and 10. There is also a local minimum.
The likelihood equation is
3
2( xi   )
l x '( )  
0
2
i 1 1  ( xi   )
The local maximum (i.e., the solution to the likelihood
equation) that the bisection method finds depends on the
interval searched over.
R program to use bisection method
derivloglikfunc=function(theta,x1,x2,x3){
dloglikx1=2*(x1-theta)/(1+(x1-theta)^2);
dloglikx2=2*(x2-theta)/(1+(x2-theta)^2);
dloglikx3=2*(x3-theta)/(1+(x3-theta)^2);
dloglikx1+dloglikx2+dloglikx3;
}
When the starting points for the bisection method are
x0  0, x1  5 , the bisection method finds the MLE:
uniroot(derivloglikfunc,interval=c(0,5),x1=0,x2=1,x3=10);
$root
[1] 0.6092127
When the starting points for the bisection method are
x0  0, x1  10 , the bisection method finds a local
maximum but not the MLE:
9
uniroot(derivloglikfunc,interval=c(0,10),x1=0,x2=1,x3=10)
;
$root
[1] 9.775498
For non-concave likelihood functions, we should try
multiple starting points.
10
Download