Statistics 550 Notes 12 Reading: Sections 2.4.2 The bisection method is a root finding algorithm which works by repeatedly dividing an interval in half and then selecting the subinterval in which the root exists. The bisection method can be used to solve the likelihood equation to find the MLE for a one-dimensional model (or, as in the gamma model, a multidimensional model in which solving the likelihood equation can be reduced to the problem of solving one equation). I. Coordinate Ascent Method The coordinate ascent method is an approach to finding the maximum likelihood estimate in a multidimensional family. The coordinate ascent method works by using the bisection method iteratively. Suppose we have a kdimensional parameter (1 , ,k ) . The coordinate ascent method is: Choose an initial estimate (ˆ1 , ,ˆk ) 0. Set (ˆ1 , ,ˆk )old (ˆ1 , ,ˆk ) 1. Maximize l ( ,ˆ , ,ˆ ) over using the bisection x 1 2 1 k ˆ method by solving l (1 , 2 , 1 1 ,ˆk ) 0 (assuming the log likelihood is differentiable). Reset ˆ1 to the 1 that maximizes l ( ,ˆ , ,ˆ ) . x 1 2 k 2. Maximize lx (ˆ1 ,2 ,ˆ3 , ,ˆk ) over 2 using the bisection method. Reset ˆ2 to the 2 that maximizes l (ˆ , ,ˆ ,ˆ ) . x 1 2 3 k .... K. Maximize lx (ˆ1 ,ˆ2 ,ˆ3 , ,ˆk 1,k ) over k using the bisection method. Reset ˆk to the k that maximizes l (ˆ ,ˆ ,ˆ ,ˆ , ) . x 1 2 3 k 1 k K+1. Stop if the distance between (ˆ1 , ,ˆk )old and (ˆ1 , ,ˆk ) is less than some tolerance . Otherwise return to step 0. The coordinate ascent method converges to the maximum likelihood estimate when the log likelihood function is strictly concave on the parameter space. See Theorem 2.4.2 and Figure 2.4.1 in Bickel and Doksum. Example: Beta Distribution 2 x r 1 (1 x) s 1 p( x | r , s) ( r ) ( s ) (r s) exp((r 1) log x ( s 1) log(1 x) log ( r ) log ( s) log ( r s)) for 0 x 1 and {(r , s) : 0 r , 0<s } This is a two parameter full exponential family. We found the method of moments estimates in Homework 5. For X 1 , , X n iid Beta( r , s ), n n i 1 i 1 l x (r , s) (r 1) log X i ( s 1) log(1 X i ) n log ( r ) n log ( s) n log ( r s) n l log (r ) log (r s) log X i n n r i 1 r r n l log ( s) log (r s) log(1 X i ) n n s i 1 s s The following data comes from an experiment on seeding clouds to suppress hail in South Africa. Each individual observation is for a storm, the proportion of areas where total destruction of crops due to hail was reported to the corresponding larger area where at least partial destruction of crops due to hail was reported. The beta distribution was thought to be a good model for this data (Mielke, 1975, Journal of Applied Metrology). 3 X =(0.33,0.48,0.54,0.82,0.58,0.68,0.65,0.37,0.42,0.37,0.48 ,0.44,0.46,0.61,0.53,0.46,0.45,0.57,0.22,0.59,0.68,0.98,0.48 ,0.26,0.40,0.60,0.98) R code for finding the MLE: # Code for beta distribution MLE # xvec stores the data # rhatcurr, shatcurr store current estimates of r and s xvec=c(.33,0.48,0.54,0.82,0.58,0.68,0.65,0.37,0.42,0.37,0.48,0.44,0.46,0.61, 0.53,0.46,0.45,0.57,0.22,0.59,0.68,0.98,0.48,0.26,0.40,0.60,0.98) # Generate data from Beta(r=2,s=3) distribution) # xvec=rbeta(20,2,3); # Set low and high starting values for the bisection searches rhatlow=.001; rhathigh=20; shatlow=.001; shathigh=20; # Use method of moments for starting values rhatmom=mean(xvec)*(mean(xvec)-mean(xvec^2))/(mean(xvec^2)mean(xvec)^2); shatmom=((1-mean(xvec))*(mean(xvec)-mean(xvec^2)))/(mean(xvec^2)mean(xvec)^2); 4 rhatcurr=rhatmom; shatcurr=shatmom; rhatiters=rhatcurr; shatiters=shatcurr; derivrfunc=function(r,s,xvec){ n=length(xvec); sum(log(xvec))-n*digamma(r)+n*digamma(r+s); } derivsfunc=function(s,r,xvec){ n=length(xvec); sum(log(1-xvec))-n*digamma(s)+n*digamma(r+s); } dist=1; toler=.0001; while(dist>toler){ rhatnew=uniroot(derivrfunc,c(rhatlow,rhathigh),s=shatcurr,xvec=xvec)$root ; shatnew=uniroot(derivsfunc,c(shatlow,shathigh),r=rhatnew,xvec=xvec)$root ; dist=sqrt((rhatnew-rhatcurr)^2+(shatnew-shatcurr)^2); rhatcurr=rhatnew; shatcurr=shatnew; rhatiters=c(rhatiters,rhatcurr); shatiters=c(shatiters,shatcurr); } rhatmle=rhatcurr; shatmle=shatcurr; > rhatmom [1] 3.508473 > shatmom 5 [1] 3.056237 > rhatmle [1] 2.507865 > shatmle [1] 1.997723 Simulation Study: We simulate data X 1 , distribution. 1000 Simulations , X n iid from a Beta (r 2, s 3) Bias rˆMOM rˆMLE n 20 n 1000 0.287 0.315 0.004 0.005 6 sˆMOM sˆMLE 0.471 0.516 0.003 0.006 Root Mean Squared Error ( MSE ) rˆMOM rˆMLE sˆMOM sˆMLE n 20 n 1000 0.852 0.844 0.089 0.083 0.374 0.355 0.133 0.128 Both the method of moments and MLE have a substantial bias for a small sample size (n 20) . Both methods are approximately unbiased for a large sample size (n 1000) . For both sample sizes, the MLE is slightly better than the method of moments in terms of mean squared error. II. Non-concave likelihoods In Notes 11, we showed that for an exponential family models, the log likelihood is concave. This means that a solution to the likelihood equation found by the bisection method or the coordinate ascent method is a global maximum. For other models, the log likelihood might not be concave and solutions to the likelihood equation might be local minima, saddlepoints or local maxima that are not global maxima. When there are multiple solutions to the 7 likelihood equation, the solution found by the bisection method or coordinate ascent depends on the starting point. Example: MLE for Cauchy Distribution Cauchy model: p( x | ) 1 , x 0, (1 ( x )2 ) Suppose X 1 , X 2 , X 3 are iid Cauchy( ) and we observe X1 0, X 2 1, X 3 10 . 8 Log likelihood is not concave and has two local maxima between 0 and 10. There is also a local minimum. The likelihood equation is 3 2( xi ) l x '( ) 0 2 i 1 1 ( xi ) The local maximum (i.e., the solution to the likelihood equation) that the bisection method finds depends on the interval searched over. R program to use bisection method derivloglikfunc=function(theta,x1,x2,x3){ dloglikx1=2*(x1-theta)/(1+(x1-theta)^2); dloglikx2=2*(x2-theta)/(1+(x2-theta)^2); dloglikx3=2*(x3-theta)/(1+(x3-theta)^2); dloglikx1+dloglikx2+dloglikx3; } When the starting points for the bisection method are x0 0, x1 5 , the bisection method finds the MLE: uniroot(derivloglikfunc,interval=c(0,5),x1=0,x2=1,x3=10); $root [1] 0.6092127 When the starting points for the bisection method are x0 0, x1 10 , the bisection method finds a local maximum but not the MLE: 9 uniroot(derivloglikfunc,interval=c(0,10),x1=0,x2=1,x3=10) ; $root [1] 9.775498 For non-concave likelihood functions, we should try multiple starting points. 10