Statistics 550 Notes 15 Reading: Section 2.4, 3.1-3.2 I. Numerical Methods for Finding MLEs Review from last class: The bisection method is a method for finding the root of a one-dimensional function f that is continuous on (a, b) , f (a) 0 f (b) for which f is monotone increasing or decreasing. It can be used when the likelihood equation is (or can be reduced to) a one-parameter equation. The bisection method works by repeatedly dividing an interval in half and then selecting the subinterval in which the root exists. Coordinate Ascent Method The coordinate ascent method is an approach to finding the maximum likelihood estimate in a multidimensional family. The coordinate ascent method works by using the bisection method iteratively. Suppose we have a kdimensional parameter (1 , ,k ) . The coordinate ascent method is: Choose an initial estimate (ˆ1 , ,ˆk ) 0. Set (ˆ1 , ,ˆk )old (ˆ1 , ,ˆk ) 1 1. Maximize lx (1 ,ˆ2 , ,ˆk ) over 1 using the bisection ˆ ˆ method by solving l (1 , 2 , , k ) 0 (assuming the 1 log likelihood is differentiable). Reset ˆ to the that 1 maximizes lx (1 ,ˆ2 , 1 ,ˆk ) . 2. Maximize lx (ˆ1 ,2 ,ˆ3 , ,ˆk ) over 2 using the bisection method. Reset ˆ2 to the 2 that maximizes l (ˆ , ,ˆ ,ˆ ) . x 1 2 3 k .... K. Maximize lx (ˆ1 ,ˆ2 ,ˆ3 , ,ˆk 1,k ) over k using the bisection method. Reset ˆk to the k that maximizes l (ˆ ,ˆ ,ˆ ,ˆ , ) . x 1 2 3 k 1 k K+1. Stop if the distance between (ˆ1 , ,ˆk )old and (ˆ1 , ,ˆk ) is less than some tolerance . Otherwise return to step 0. The coordinate ascent method converges to the maximum likelihood estimate when the log likelihood function is strictly concave on the parameter space. See Figure 2.4.1 in Bickel and Doksum. 2 Example: Beta Distribution x r 1 (1 x) s 1 p( x | r , s) ( r ) ( s ) ( r s ) exp((r 1) log x ( s 1) log(1 x) log (r ) log ( s) log (r s)) for 0 x 1 and {(r , s) : 0 r , 0<s } This is a two parameter full exponential family and hence the log likelihood is strictly concave. We found the method of moments estimates in Homework 5. For X 1 , , X n iid Beta( r , s ), n n i 1 i 1 l x (r , s) (r 1) log X i ( s 1) log(1 X i ) n log ( r ) n log ( s) n log ( r s) n l log (r ) log ( r s) log X i n n r i 1 r r n l log ( s) log ( r s) log(1 X i ) n n s i 1 s s For X = (0.3184108, 0.3875947, 0.7411803, 0.4044642, 0.7240628, 0.7247060, 0.1091041, 0.1388588, 0.7347975 0.5138287, 0.2683177, 0.4685777, 0.1746448, 0.2779592, 0.2876237, 0.5833377, 0.5847999, 0.2530112, 0.5018544 0.5295680) 3 R code for finding the MLE: # Code for beta distribution MLE # xvec stores the data # rhatcurr, shatcurr store current estimates of r and s # Generate data from Beta(r=2,s=3) distribution) xvec=rbeta(20,2,3); # Set low and high starting values for the bisection searches 4 rhatlow=.001; rhathigh=20; shatlow=.001; shathigh=20; # Use method of moments for starting values rhatcurr=mean(xvec)*(mean(xvec)mean(xvec^2))/(mean(xvec^2)-mean(xvec)^2); shatcurr=((1-mean(xvec))*(mean(xvec)mean(xvec^2)))/(mean(xvec^2)-mean(xvec)^2); rhatiters=rhatcurr; shatiters=shatcurr; derivrfunc=function(r,s,xvec){ n=length(xvec); sum(log(xvec))-n*digamma(r)+n*digamma(r+s); } derivsfunc=function(s,r,xvec){ n=length(xvec); sum(log(1-xvec))-n*digamma(s)+n*digamma(r+s); } dist=1; toler=.0001; while(dist>toler){ rhatnew=uniroot(derivrfunc,c(rhatlow,rhathigh),s=shatcurr, xvec=xvec)$root; 5 shatnew=uniroot(derivsfunc,c(shatlow,shathigh),r=rhatnew ,xvec=xvec)$root; dist=sqrt((rhatnew-rhatcurr)^2+(shatnew-shatcurr)^2); rhatcurr=rhatnew; shatcurr=shatnew; rhatiters=c(rhatiters,rhatcurr); shatiters=c(shatiters,shatcurr); } rhatmle=rhatcurr; shatmle=shatcurr; Newton’s Method Newton’s method is a numerical method for approximating solutions to equations. The method produces a sequence of (0) (1) values , , that, under ideal conditions, converges to the MLE ˆ . MLE To motivate the method, we expand the derivative of the ( j) log likelihood around : 0 l '(ˆMLE ) l '( ( j ) ) (ˆMLE ( j ) )l ''( ( j ) ) Solving for ˆ gives MLE l '( ( j ) ) MLE l ''( ( j ) ) This suggests the following iterative scheme: ˆ ( j) 6 ( j 1) ( j) l '( ( j ) ) l ''( ( j ) ) . Newton’s method can be extended to more than one dimension (usually called Newton-Raphson) ( j 1) ( j) 1 l ( ( j) ) l ( ( j ) ) where l denotes the gradient vector of the likelihood and l denote the Hessian. Comments on methods for finding the MLE: 1. The bisection method is guaranteed to converge if there is a unique root in the interval being searched over but is slower than Newton’s method. 2. Newton’s method: ( j) A. The method does not work if l ''( ) 0 . B. The method does not always converge. See attached pages from Numerical Recipes in C book. 3. For the coordinate ascent method and Newton’s method, a good choice of starting values is often the method of moments estimator. 4. When there are multiple roots to the likelihood equation, the solution found by the bisection method, the coordinate ascent method and Newton’s method depends on the starting value. These algorithms might converge to a local maximum (or a saddlepoint) rather than a global maximum. 7 5. The EM (Expectation/Maximization) algorithm (Section 2.4.4) is another approach to finding the MLE that is particularly suitable when part of the data is missing. Example of nonconcave likelihood: MLE for Cauchy Distribution Cauchy model: p( x | ) 1 , x 0, 2 (1 ( x ) ) Suppose X 1 , X 2 , X 3 are iid Cauchy( ) and we observe X1 0, X 2 1, X 3 10 . 8 Log likelihood is not concave and has two local maxima between 0 and 10. There is also a local minimum. The likelihood equation is 3 2( xi ) l x '( ) 0 2 1 ( x ) i 1 i The local maximum (i.e., the solution to the likelihood equation) that the bisection method finds depends on the interval searched over. R program to use bisection method 9 derivloglikfunc=function(theta,x1,x2,x3){ dloglikx1=2*(x1-theta)/(1+(x1-theta)^2); dloglikx2=2*(x2-theta)/(1+(x2-theta)^2); dloglikx3=2*(x3-theta)/(1+(x3-theta)^2); dloglikx1+dloglikx2+dloglikx3; } When the starting points for the bisection method are x0 0, x1 5 , the bisection method finds the MLE: uniroot(derivloglikfunc,interval=c(0,5),x1=0,x2=1,x3=10); $root [1] 0.6092127 When the starting points for the bisection method are x0 0, x1 10 , the bisection method finds a local maximum but not the MLE: uniroot(derivloglikfunc,interval=c(0,10),x1=0,x2=1,x3=10) ; $root [1] 9.775498 II. Bayes Procedures (Chapter 3.2) In Chapter 3, we return to the theme of Section 1.3 which is how to appraise and select among point estimators and decision procedures. We now discuss how the Bayes criteria for choosing decision procedures can be implemented. 10 Review of the Bayes criteria: Suppose a person’s prior distribution about is ( and the probability distribution for the data is that X | has probability density function (or probability mass function) p( x | ) . This can be viewed as a model in which is a random variable and the joint pdf of X , is ( p ( x | ) . The Bayes risk of a decision procedure for a prior distribution ( , denoted by r ( ) , is the expected value of the loss function over the joint distribution of ( X , ) , which is the expected value of the risk function over the prior distribution of : r ( ) E [ E[l ( , ( X )) | ]] E [ R( , )] . For a person with prior distribution ( , the decision procedure which minimizes r ( ) minimizes the expected loss and is the best procedure from this person’s point of view. The decision procedure which minimizes the Bayes risk for a prior ( is called the Bayes rule for the prior ( . A nonsubjective interpretation of the Bayes criteria: The Bayes approach leads us to compare procedures on the basis of 11 r ( ) R( , ) ( ) if is discrete with frequency function ( or r ( ) R( , ) ( )d if is continuous with density ( . Such comparisons make sense even if we do not interpret ( as a prior density or frequency, but only as a weight function that reflects the importance we place on doing well at the different possible values of . Computing Bayes estimators: In Chapter 1.3, we found the Bayes decision procedure by computing the Bayes risk for each decision procedure. This is usually an impossible task. We now provide a constructive method for finding the Bayes decision procedure Recall from Chapter 1.2 that the posterior distribution is the subjective probability distribution for the parameter after seeing the data x: p( | x) ( ) p( | x) p( | x) ( )d The posterior risk of an action a is the expected loss from taking action a under the posterior distribution p ( | x ) . r (a | x ) E p ( | x ) [l ( , a)] . The following proposition shows that a decision procedure which chooses the action that minimizes the posterior risk for each sample x is a Bayes decision procedure. 12 Proposition 3.2.1: Suppose there exists a function * ( x) such that r ( * ( x) | x) inf{r (a | x) : a } (1.1) * where denotes the action space, then ( x) is a Bayes rule. Proof: For any decision rule , we have r ( , ) E[l ( , ( X ))] E[ E[l ( , ( X )) | X ]] (1.2) By (1.1), E[l ( , ( X )) | X x] r ( ( x) | x) r ( * ( x) | x) E[l ( , * ( X )) | X x] Therefore, E[l ( , ( X )) | X x] E[l ( , * ( X )) | X x] and the result follows from (1.2). Consider the oil-drilling example (Example 1.3.5) again. Example 2: We are trying to decide whether to drill a location for oil. There are two possible states of nature, 1 location contains oil and 2 location doesn’t contain oil. We are considering three actions, a1 =drill for oil, a2 =sell the location or a3 =sell partial rights to the location. The following loss function is decided on (Drill) (Sell) a1 a2 0 10 1 (Oil) (Partial rights) a3 5 (No oil) 2 6 12 1 13 An experiment is conducted to obtain information about resulting in the random variable X with possible values 0,1 and frequency function p( x, ) given by the following table: Rock formation X 0 1 0.3 0.7 1 (Oil) 0.6 0.4 (No oil) 2 Application to Example 1.3.5: Consider the prior (1 ) 0.2, (2 ) 0.8 . Suppose we observe x 0 . Then the posterior distribution of is p ( x 0 | ) ( ) p ( | x 0) 2 p ( x 0 | ) ( ) , i i 1 i 1 8 p ( | x 0) , p ( | x 0) 1 2 which equals 9 9 . Thus, the posterior risks of the actions a1 , a2 , a3 are 1 8 r (a1 | 0) l (1 , a1 ) l (2 , a1 ) 10.67 9 9 r (a2 | 0) 2, r (a3 | 0) 5.89 * Therefore, a2 has the smallest posterior risk and if is the Bayes rule, * (0) a2 . Similarly, 14 r (a1 |1) 8.35, r (a2 |1) 3.74, r (a3 |1) 5.70 and we conclude that * (1) a2 . 15