On the Convergence Properties of the EM Algorithm Author(s): C. F. Jeff Wu Source: The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/2240463 . Accessed: 12/05/2013 20:54 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. . Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Statistics. http://www.jstor.org This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions The Annals of Statistics 1983,Vol. 11, No, 1, 95-103 ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM' BY C. F. JEFF WU Universityof Wisconsin,Madison are studied:(i) does the aspectsoftheEM algorithm Two convergence valueofthe(incompleteora stationary EM algorithm finda localmaximum data) likelihoodfunction?(ii) does the sequenceof parameterestimates resultsare obtainedunder Severalconvergence generated byEM converge? Twousefulspecial conditions thatareapplicabletomanypracticalsituations. can be described complete-data specification casesare: (a) iftheunobserved space,all thelimit family withcompactparameter bya curvedexponential pointsofthe likelihoodfunction; pointsofanyEM sequenceare stationary (b) if the likelihoodfunctionis unimodaland a certaindifferentiability to theuniquemaxiis satisfied, thenanyEM sequenceconverges condition is included. A listofkeyproperties ofthealgorithm mumlikelihood estimate. 1. Introduction. Dempster,Laird and Rubin (1977) (henceforthabbreviated DLR) introducedthe EM algorithmforcomputingmaximumlikelihoodestimatesfromincomplete data. The essentialideas underlyingthe EM algorithmhave been presentedin special cases by many authors;see DLR fora detailed account. Amongthem we mentionBaum et al. (1970),Hartleyand Hocking (1971), Orchardand Woodbury(1972), Sundberg (1974). The DLR paper has made threesignificantcontributions:(i) it recognizesthe expectation step (E-step) and the maximizationstep (M-step) in theirgeneralforms,(ii) it gives some theoreticalpropertiesof the algorithm,and (iii) it recognizesand gives a wide range of applicationsin statistics. However,the proofof convergenceof EM sequences in DLR contains an error.The implicationfrom(3.13) to (3.14) in their Theorem 2 fails due to an incorrectuse of the triangleinequality.Additionalcommentson thisproofare givenin Section 2.2. Therefore the convergenceof EM sequence as proved in their Theorems 2 and 3 is cast in doubt. Otherresultson the monotonicityoflikelihoodsequence and the convergencerate of EM sequence (Theorems 1 and 4 of DLR) remainvalid. Despite its slow numericalconvergence,the EM algorithmhas become a verypopular computationalmethod in statistics. Contraryto the general experience in numerical optimization,the implementationof the E-step and M-step is easy formany statistical problems,thanksto the nice formof the complete-datalikelihoodfunction.Solutions of the M-step oftenexist in closed form.In many cases the M-step can be performedwitha standardstatisticalpackage, thus saving programmingtime. Anotherreason forstatisticians to preferEM is that it does not requirelarge storagespace. These two featuresare especiallyattractiveto those withfreeaccess to small computers. In thispaper, instead ofpatchingup the originalproofof DLR, we studymore broadly two convergenceaspects of the EM algorithm.Our approach is to view EM as a special optimizationalgorithmand to utilizeexistingresultsin the optimizationliterature. Formallywe have two sample spaces . and cNand a many-to-onemappingfrom. to OY.Instead of observingthe "complete data" x in ?, we observe the "incompletedata" y - y(x) in ON. Let the densityfunctionof x be f (x I4) with parameters4 E f2and let the densityfunctionof y be givenby ReceivedMay 1981;revisedAugust1982. GrantNo. MCS-7901846. 'Researchsupported bytheNationalScienceFoundation AMS 1980 subject classification.Primary62F10, 90C30. incomplete data,curvedexponential GEM algorithm, Key wordsand phrases.EM algorithm, maximum likelihood estimate. family, 95 This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions C. F. JEFF WU 96 g(yI4)) f(xIo) dx, where X(y) = {x: y(x) = y). The parameters 4 are to be estimated by the method of maximumlikelihood,i.e., by maximizingg(y I4) over 4 E 2. In many statisticalproblems, maximizationof the complete-dataspecificationf(x 1I ) is simplerthan that of the incomplete-data specificationg(y 1I ). A main featureof the EM algorithmis maximizationof f(x 14))over 4 E 2 (M-step). Since x is unobservable, we replace logf(x 14))by its conditionalexpectationgiveny and the currentfit4p)(E-step).To this end, let k (x Iy, 4) f (x I4)/g(y 14) be the conditionaldensityof x giveny and 4. Then the log-likelihood L(4') = logg(yI o') = Q(O'I4)) -H(O'I 4), (1) whereQ (O'I 4))= E {logf(xI o') Iy, )) andH(O'l4) = E {logk(x I y,4') y, )) areassumed ), 4p,i E M(qp) as follows: to existforall pairs (4', 4). We now definethe EM iteration4p - Q (4 I4p). E-STEP. Determine M-STEP. Choose op?1 to be any value*,ofE)E 2 whichmaximizesQ (4 I 4p). Note that M is a point-to-setmap, i.e., M(4p) is the set of 4 values which maximizes Many applicationsof EM are forthe curved exponentialfamily,for whichthe E-step and M-step take special forms. Sometimes it may not be numericallyfeasible to performthe M-step. DLR defineda generalizedEM algorithm(a GEM algorithm)to be an iterative scheme 4)p-* 4)?p+iE M(4p),where4 -* M(4)) is a point-to-setmap, such that Q( I4p)over 4 E S. (2) Q(4'I) ' Q(4)I4))forall ' E M(+). EM is a special case of GEM. For any instance {fp) of a GEM algorithm, (3) L(Op+1)2 L(Op) followsfromthe definitionof GEM and the inequality (4) H (. I(t) >- H (.' I(t) forany O' (= S2. For proofsof (3) and (4), see Lemma 1 and Theorem 1 of DLR. In Sections 2.1 and 2.2 we inspect two convergence aspects of the EM and GEM algorithmsand discuss the relationshipof our resultsto previousones in the literature.In Section 3 we summarizethe keypropertiesofEM. Potential EM users that are not patient withmathematicaldetails may read the summarybeforeconsultingthe resultsin Section 2. 2. Does EM do the job? The originalpurpose of the EM algorithmwas to provide iterative computation of the maximum-likelihoodestimates. For a bounded sequence L (4p), (3) implies that L (4p) convergesmonotonicallyto some L *. We want to know whetherL * is the global maximumof L (4) over 2. If not, is it a local maximum or a stationaryvalue? This problem is studied in Section 2.1. A related problem of the convergenceof an EM or GEM sequence {fp) is studied in Section 2.2. We make the followingassumptionsforthe rest of the paper: (5) (6) (7) 2 is a subset in the r-dimensionalEuclidean space R, 2'> = {4) E 2: L(4)) 2 L(4)o)) is compact forany L(4o) > L is continuousin 2 and differentiable in the interiorof 2. As a consequence of (5), (6), (7), we have (8) -oo {L(4p))p:),,0is bounded above forany 4o E- 2. This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions 97 CONVERGENCE OF EM ALGORITHM The compactnessassumptionin (6) can be restrictivewhenno realisticcompactification ofthe originalparameterspace is available. Such may happen in,say,variance components models and factoranalysis.To avoid trivialities,we assume that the startingpoint4o of an EM or GEM algorithmsatisfiesL (0o) > -oo. When we computethe derivativesofL, Q, H at 4)p,we assume that4p is in theinteriorofS2.Such an assumptionis implied,forexample, by S2< is in the interiorof 2 forq)oE 2. (9) 2.1 Convergenceto global maximum,local maximumor stationaryvalue? From (3) and (8), L (4p) convergesmonotonicallyto some L *. There is no guaranteethat L * is the global maximumofL (4) over 2 forthe EM algorithm.Althougha global maximizationof Q is involvedin the M-step, the other termH in L = Q-H may not cooperate. Even the question of convergence to a local maximum cannot be satisfactorilyanswered. For simplicitywe assume that q)pconvergesto some 4* in the interiorof 2, that the Hessian matricesD20 Q(O* I 0* ) and D2OH(O* I 0*) with respect to the first4* variable exist,and that D20Q (4)' 1 ) is continuous in (4', 4). Then -D20Q (4* I 0*) is non-negativedefinite (n.n.d.)accordingto the definitionofthe M-step,and -D2OH(O* I 0*) is n.n.d.accordingto Lemma 2 of DLR. Since D2L(O*) = D20Q(4)* I )* - D2OH(O* I 0*), * may not be a local maximum. We give an example (Murray, 1977) to illustratethe possibilityof convergingto a stationaryvalue but not to a local maximum.The twelveobservationsin the displaybelow come froma bivariatenormal distributionwith zero means, correlationcoefficientp and aseik 2 variances ui, U2, where asterisksrepresentmissingvalues. For these data the likelihood has two 1 -1 Variable 1: 1 Variable2: 1 -1 -1 2 2 -2 1 -1 * * * -2 * * * * 2 2 -2 * -2 8/3and a saddle point at p = 0, u2 - (2 - 5/2If the global maxima at p = + 1/2, l = (= startingpointofan EM sequence has p = 0, thenp = 0 remainstrueforall the subsequent iterationsand the sequence convergesto the saddle point. If p is bounded away fromzero in the EM iterations,thenthe sequence convergesto eithermaximum.We will come back to this example afterCorollary1. In general,if the log-likelihoodL has several (local or global) maxima and stationary points,convergenceof the EM sequence to eithertype of point depends on the choice of startingpoint.This phenomenonhas also been reportedin Hasselblad (1969),Wolfe(1970), Haberman (1974), Laird (1978), Rubin and Thayer (1982). The above discussion on the convergence to local or global maximum may seem redundant,since it is known that no general optimizationalgorithmsare guaranteed to convergeto local maxima.Over the last fewyears some misconceptionofthe powerofEM has developed,partlybecause of the global maximizationnature of its M-step. We hope that our discussionwill help to remove the misconception. We now considerthe issue of convergenceto stationaryvalues. The main theoremsof thissectionrelyon the followingresult.A map A frompointsofX to subsets ofX is called a point-to-setmap on X. It is said to be closed at x ifXk -> X, Xk E X and Yk-> Y,Yk E A (Xk), implyy E A (x). For point-to-point map, continuityimplies closedness. GLOBAL CONVERGENCE THEOREM. Let the sequence {Xk} k=o be generated by Xk?l E M(xk), whereM is a point-to-setmap on X. Let a solutionset F C X be given,and suppose that: (i) all points Xk are contained in a compact set S C X; (ii) M is closed over the complementof F; (iii) thereis a continuousfunctiona on X such that (a) if x C F, a (y) > a (x) forall y E M(x), and (b) ifx E F, a(y) a-(x) forall y E M(x). Then all the limit points of {xk} are in the solution set F and a(Xk) converges monotonicallyto a (x) forsome x E r. This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions 98 C. F. JEFF WU PROOF. See Zangwill(1969,page 91). Let M be the point-to-setmap in a GEM iterationand let a(x) be the log-likelihood functionL. Take the solutionset F to be X1= set of local maxima in the interiorof S2, or M= set of stationarypoints in the interiorof U. Condition(iii)(b) followsfrom(3) and condition(i) followsfrom(3), (8). Then Theorem 1 followsas a special case of the above theorem. THEOREM 1. Let {4)p) be a GEM sequence generated byq)p+l E M(op), and suppose that (i) M is a closed point-to-set map over the complement of Y (resp. #), (ii) L (Op?i) > L(4p) forall 4p C Y' (resp. #). Then all the limitpoints of {fop}are stationarypoints (local maxima) ofL, and L (4p) convergesmonotonicallyto L * = L(0*) forsome 0* E Y (resp. #). For the EM algorithm,it is easy to show that a simple sufficientconditionfor the closedness ofM is that (10) in both4 and4. Q(4 I )) is continuous This condition is very weak and should be satisfied in most practical situations. For convergenceto stationaryvalues it turnsout to be the only requiredregularitycondition (in additionto those givenbefore).The followingtheoremis most usefulin that it covers a broad range of statisticalapplications. THEOREM 2. Suppose Q satisfies the continuitycondition (10). Then all the limit points of any instance {fp} of an EM algorithmare stationarypoints of L and L (0p) convergesmonotonicallyto L * = L (0*) forsome stationarypoint 4)*. PROOF. Since (10) is sufficient for(i) ofTheorem 1,it remainsto prove (ii) ofTheorem 1 forall 4p C Y. Consider a 4p, which is in the interiorof S by (9). Since 4p maximizes H(4 I 4p) over 4 E S2accordingto (4), D10H(+p I4p) = 0. ThereforeDL (0p) = D10Q((+pIOp) # 0 forany 4p C Y fromthe definitionof Y, implyingthat 4p is not a local maximumof Q (4 I 4p) over 4 E 2. From the definitionofthe M-step,Q (0p+jI4p) > Q (4pI4p). Together with (4), this proves L (Op+i) > L (4p) forall 4p C Y. The desired resultfollows.O The same argumentdoes notapplywhen9 is replacedby M. This is easilydemonstrated by consideringa 4p in 9 but not in M. Here DL (4p) = D 10Q(4pI4p) = 0 and 4p may indeed E maximizeQ (4 I 4p) over 4 (2. The EM iterationterminatesat 4)p,a stationarypoint but not a local maximum,and L (4p+i) > L (4p) does not hold forsuch op. Thus to guarantee convergenceto a local maximum,we need a furtherconditionsuch as (11) below. Since (11) holds forany 4 C Y, (4) and (11) imply(ii) of Theorem 1 forall OpC . The following theoremis a special case of Theorem 1. THEOREM 3. Suppose Q satisfies the continuitycondition (10) and (11) supf'EQ()' I4) > Q(4 I) I forany E-e\ X(. Then all the limitpoints of any instance {(p) of an EM algorithmare local maxima of L and L (op) convergesmonotonicallyto L * = L (4*) forsome local maximum4 *. Since (11) is typicallyhard to verify,the utilityof Theorem 3 is somewhatlimited.An importantclass of densitiesthat satisfy(10) is the curved exponentialfamily This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions 99 CONVERGENCE OF EM ALGORITHM f(x 14))= b(x)exp{fTt(x)}/a(0), (12) where the parameters 4 lie in a compact submanifold So of the r-dimensionalconvex region 2 = {(: a(+) = f.b(x)exp[OTt(x)] dx < 00). Here Q(0'10) = - loga(o') + 4). From the compactness of 2o and properties of E{logb(x)ly, 4) + 0'TE{t(x)Iy, exponentialfamily,one can easily verifycondition(10). We emphasize that the continuity of Q also hold trueformany densitiesoutside the exponentialfamily. Theorem 1 is the most general result for EM and GEM algorithms.The result in Theorem 2 was obtained by Baum et al. (1970) and Haberman (1977) for two special models. Boyles (1980) obtained a similar result for general models but under stronger regularityconditions.One key conditionin Baum et al. (1970) and Boyles (1980) is thatM is a continuouspoint-to-pointmap over 2, which is strongerthan a closed point-to-set map over 2\99 (assumed in Theorem 1). Boyles showed that his Lemma 4.2 covers the regular exponential family,a case not frequentlyencountered in practice, while our Theorem 2 covers a much broader class of applications. Murray'sexample illustratesTheorems 2 and 3 well. Convergenceof any EM sequence to a stationaryvalue followsfromTheorem 2 or Corollary1. But whydoes an EM sequence converge to a saddle point in this example? From the assumptions of Theorem 3, it becomes clear that condition(11) is being violated at the saddle point p = 0, l = 2= 5/2. In fact the saddle point maximizesthe likelihood over all the parametricspecifications with p = 0. On the other hand, if a particularEM sequence is not attractedtoward the hyperplanep = 0, then violationof (11) is avoided and convergenceof this EM sequence to local maxima is guaranteed. 2.2 Convergenceof an EM or GEM sequence {fp}. The convergenceof L (4p) to L * studied in Section 2.1 does not automaticallyimplythe convergenceof 4p to a point 4*. (The implicationin the reversedirectionis trueifD 10H(4' 4)) is continuous.)Convergence ofEM in the lattersense usually requiresmore stringentregularityconditionsand will be addressed in this subsection. Theorem 2 ofDLR incorrectlyclaimed that,undertheirconditions(1) and (2), a GEM sequence {fp} convergesto a point4* in the closureof 2. A GEM (but not EM) sequence was givenin Boyles (1983) as a counterexample.The DLR argumentwould be valid only ifcondition(2) of theirtheoremcould be replaced by the following: (13) Q (OP+l lfp)-Q (OPl p ) > A 1 p+l-op lI for some A >O and all p. However,if p --*, then (13) implies fl DlOQ(4* 1)* ) 11= 11DL (4* ) l -\ > O, and 4* can not be a stationarypoint.There appears to be no simple way to fixtheirproof.From the numericalviewpoint,the convergenceof4pis not as importantas the convergenceofL (4p) to stationaryvalues or local maxima. Since the originalDLR claim was made in termsof the convergenceof {fp} and since subsequentEM users oftenquoted thisresult,a rigorous studyof this problemis worthwhile. Define 99(a) = {4 E YML(+) a) and M (a) = {4 E M: L(+) = a). Under the assumptionsof Theorem 1, L (4p) L* and all the limitpointsof {fp) are in 9(L *) (resp. ,&(L*)). If 99(L*)(resp. Jl(L*)) consists of a single point 4)*,i.e., there can not be two different stationarypoints (resp. local maxima) withthe same L *, then p -+*. THEOREM 4. Let {f), be an instance of a GEM algorithmsatisfyingconditions i) and ii) of Theorem 1. If 9(L *)(resp. J/(L*)) = {4*}, whereL * is the limitof L (4p) in Theorem 1, then 4p -*. The above assumption99(L *) = {)* } can be greatlyrelaxed ifwe assume ||p+i 0 as p -> oo,a condition necessaryfor the desired result 4p -*. - | 4),p THEOREM 5. Let {fp) be an instance of a GEM algorithmsatisfyingconditions i) O--- 0 asp -0oo, thenall the limitpoints of {fp) are in and ii) of Theorem1. If 114p+i-i-p 11 This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions C. F. JEFF WU 100 a connectedand compactsubset ofS?(L *) (respectivelyX1 (L *)), whereL * is the limitof L(4p) in Theorem 1. In particular, if 9'(L*)(resp. dk(L*)) is discrete, i.e. its only connected componentsare singletons, then 4p converges to some 4* in 9(L*)(resp. ,&(L *)). PROOF. From assumption(6), {op) is a bounded sequence. Accordingto Theorem 28.1 of Ostrowski(1966), the set of limitpoints of a bounded sequence {top}with 11 Op+i - 4p 11 -O 0 as p -*oo is connectedand compact. From Theorem 1, all the limitpointsof {dip} are already in 9(L * ) (resp. X&(L * )). The desiredresultnow follows.0 Note that condition(ii) of Theorem 1 for4,pC 9'is automaticallysatisfiedforany EM sequence as demonstratedin the proofof Theorem 2. Theorem 5 was proved in Boyles (1980, 1982, 1983), under differentregularityconditions. Theorem 4 was obtained by Hartleyand Hocking (1971) fora special model. 9onvergenceof 4ppto a stationarypoint can be proved withoutrecourseto Theorem 1. Let Y(L) = {4)E l:L(4)) =L}. THEOREM 6. Let {4p) be an instance of a GEM algorithm with the additional propertyD10Q (p+i+iI4p) = 0. Suppose D 10Q()' 14)) is continuous in 4)' and 4. Then pp convergesto a stationarypoint 4)*withL (4*) = L *, the limitofL (4p), ifeither(a) S(L *) oo and f(L* ) is discrete. {)* }, or (b) 1IIp++- p O-> 11 0 asp forsome 4* E As in the proofsof Theorems 4 and 5, we can show 4p -* PROOF. S(L *). DL (4)*) = DOQ (4)* 14)*) = 0 followsfrom(4), D 10Q(4p+l14p))= 0 and the continuity ofD10Q (4)'j)I in4)' and 4). Note that EM satisfiesD 0Q(4,p+i14)p)= 0. Since X (L) and YS(L) are subsets ofYI(L), conditions(a) and (b) in Theorem 6 are strongerthan the correspondingones in Theorems 4 and 5 respectively.The advantage ofTheorem 6 is that it does not requireconditions(i) and (ii) of Theorem 1. An importantspecial case is the following. COROLLARY 1. Suppose that L (4)) is unimodal in S2with4* being the onlystationary point and that D10Q(O' 14) is continuous in 4) and 4'. Then for any EM sequence {fp}, 4p convergesto the unique maximizer4* of L (4). From the viewpointof the user,Theorem 2 and Corollary1 are the most usefulresults since they require conditionsthat are very easy to verify.Their applications to specific statistical problems are numerous,e.g., Haberman (1977), Redner and Walker (1982), Turnbull (1976), Turnbulland Mitchell (1981), Vardi (1982). There are two key assumptions (in addition to those discussed in Section 2.1) in Theorem 5 forthe convergenceof (4pp):discretenessof Y5(L *) and 11pp+i- 4)Op 11 0 as p -> oo.The latterconditionmay be hard to verifyin practice. For the special cases stated below,the conditionholds true. conditions for11pp+i- 4pI Sufficient CONDITION 0 as p -* oo: 1. There existsa forcingfunctiona such that Q(4P+1I14p) -Q(4p) p),-,g(114)p+i-p)lj)forallp. (14) (A mappinga: [0, oo)-> [0, oo) is said to be a forcingfunctionif,forany sequence = 0 implieslimk-,,,tk = 0.) From (4), 00), limk ,-U(tk) {tk) in [0, L (4p+,) -L (4pp) ': Q(4p+i14)p)- Q(4),p I)p). (15) Since L(,) > L * and a is a forcingfunction,(14) and (15) implyII,0p+1- 4p,j --* 0. By This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions 101 CONVERGENCE OF EM ALGORITHM taking((t) = Xt2,X > 0, (14) becomes Q (P+i I4p)-)Q (16) forall p. (p 1p) >-AX 11p+i _ 4p 112 Except forthe regularexponentialfamily,it is difficultto verify(16). We singleout a subclass of GEM algorithmsforwhich (14) is satisfied.Let B (0p) be a positivedefiniter x r matrixcontinuousin op. Then a sufficientconditionis obtained by lettingthe GEM iterationop ----op+,be definedby (17) and (18) (17) op+, wheres < 1 and 0 < spc s op + (pB (op) D '0Q (op Iopl)11 D'0Q (opIop) 11 1 is chosen such that op+, is in the interiorof 2, Iop)T(op+i_ op), IIop)- Q(oap Iop) Q(4W+i I -aD1OQ(op (18) where0 < a < 1 is independentofp. For example,forsteepestascent,we would take B (Op) = L The Armijoline search method (Polak, 1971,page 36;'Ortega and Rheinboldt,1970, algorithmdesignedforfindingthe steplength(p specified page 491) is a finite-terminating in (17) and (18). Such a 4palways exists because op is in the interiorby assumption, B (op)D 10Q in thedirection Q (4)I:op)is locallyincreasing (opIop), and 0 < a < 1. By the nature of the GEM iteration (17) and (18), the algorithm terminates at a ,p with D 10Q (op Iop) = 0. It is based solely on the directionalderivativesof Q and line searches along the directions.We now proceed to prove that the GEM iteration (17) and (18) satisfies(14) with((t) = ct(2-s)/(l-s) Let A* = sup,.g.0 {max eigenvalue ofB ()). From (17) and (18), Q(OP+llIp) - > aGi(op+i - 0p))TB-1(p) Q(OP IP) )(op+ I >' A*-X 0P+i 11 - a(X* )r(1-s)I- II 4p+l - op)ID10Q(opIop)Is p 11 2((p-lX*-l 11p+l _ o)p _ opII)(s)/(1-S) (2-s)/(1-s) Note that X* < oosince S20,is compact. For otheriterationschemes,sufficientconditions 0 can be foundin Ortega and Rheinboldt (1970, Chapter 14). for11op+, - op, We concludethissubsectionwitha discussionofthe discretenessassumptionon f(L *), whichwas not mentionedin DLR. If L (4)) has a ridge of stationarypoints in which L (4) = L *, i.e. f(L *) is not discrete,does an EM sequence with 11op+, -op 11-O 0 convergeto a 4)*in 99(L *)? Or may it sometimesmove indefinitely on the ridge?DLR (1977,page 10) made an unwarrantedclaim that the formermust be true. Although(16), an assumption made in DLR, impliesEp=o II _p+1 op 112< ooand 11op+,-op O--11 0, it is stillpossible to have ||I4p+, - op 11= ??. A sequence with this propertymay move indefinitelyon a ridge. E Althoughwe do'not have a counter-exampleto thisclaim by DLR, our generalimpression is that, if f((L*) is not discrete, convergence of {fp) can only be guaranteed under conditionsstrongerthan those assumed in theirpaper. This beliefis furthersupportedby the GEM counterexampleof Boyles (1983). 3. Summary of properties of EM algorithm. (i) Any EM sequence {fp) increases the likelihood and L (op), if bounded above, convergesto some L *. (ii) If Q ('PI)) is continuous in both 4 and 4, L * is a stationaryvalue of L. The continuityof Q holds forthe importantcase of a curved exponentialfamily.If 4p converges to some point 4*, 4* is a stationarypoint under the continuity ofD10Q(4' 14) in4' and 4. condition This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions 102 C. F. JEFF WU (iii) If,in additionto (ii), Q is not trappedat any point pothat is a stationarypointbut not a local maximumofL, i.e. supoE=aQ(4I Po) > Q(4o IGo),thenL* is also a local maximumofL. But thisconditionmay be difficult to verify.Since the convergence to stationaryvalue or local maximumor global maximumdepends on the choice of startingpoints, we recommend that several EM iterations be tried with different startingpoints that are representativeof the parameterspace. (iv) If, in addition to (ii) or (iii), 11Op+j- p 11 0 as p -> oo and the set of stationary points (local maxima) with a given L value is discrete,then 4p convergesto a stationarypoint (local maximum). (v) If, in addition to (ii) or (iii), there cannot exist two differentstationarypoints (local maxima) with the same L value, then 4p convergesto a stationarypoint (local maximum). 0 is condition(14), which is satisfiedby (vi) A sufficientconditionfor11Op,j - Ip a class of optimizationalgorithms(17) and (18), whose iterationscheme is based on the local directionalderivativesof Q and Armijoline searches along the chosen direction.For a regularexponentialfamily,a special case (16) of (14) is satisfied by the EM algorithm. is (vii) If L (f) is unimodal in S2 and has only stationary point and D 10Q (?') continuousin 0 and p',then op convergesto the unique maximizerp*of L (p). (viii) If the set ofstationarypoints (local maxima) witha givenL value, denotedf(L) (respectively./ (L)), is not discrete,and 11 -p+1- p 11 0, then 4p convergesto a compact, connected component of f(L) (resp. M (L)) but not necessarilyto a point. The originalclaim by DLR that 4p convergesto a point does not seem to hold underthe conditionstheystated. But we emphasize that the convergenceof the EM sequence 4S is not as importantas the convergenceof L (0p) to desired locations on the log-likelihoodsurface,an issue largelyresolved in the present article. Acknowledgments. helpfulcomments. The authorwishesto thankWing-HungWong and a refereefor REFERENCES BAUM,L. E., PETRIE, T., SOULES, G. and WEISS, N. (1970). A maximizationtechnique occurringin the statisticalanalysisof probabilisticfunctionsof Markov chains. Ann. Math. Statist. 41 164-171. BOYLES,R. A. (1980). Convergenceresultsforthe EM algorithm.Technical Report No. 13, Division of Statistics,Universityof California,Davis. BOYLES,R. A. (1983). On the convergenceofthe EM algorithm.J. Roy. Statist. Soc. B 44. To appear. DEMPSTER,A. P., LAIRD, N. M. and RUBIN, D. B. (1977). Maximum likelihoodfromincompletedata via the EM algorithm(withdiscussion).J. Roy. Statist. Soc. Ser. 39 1-38. HABERMAN,S. J. (1974). Loglinear models for frequencytables derived by indirect observation: maximumlikelihoodequations. Ann. Statist. 2 911-924. HABERMAN,S. J. (1977). Product models forfrequencytables involvingindirectobservation.Ann. Statist. 5 1124-1147. HARTLEY,H. 0. and HOCKING,R. R. (1971). The analysisof incompletedata. Biometrics27 783-808. HASSELBLAD,V. (1969). Estimationof finitemixturesof distributionsfromthe exponentialfamily.J. Amer.Statist. Assoc. 64 1459-1471. LAIRD, N. (1978). Nonparametricmaximumlikelihoodestimationof a mixingdistribution.J. Amer. Statist. Assoc. 73 805-811. MURRAY, G. D. (1977). Contributionto discussionofpaper by A. P. Dempster,N. M. Laird and D. B. Rubin. J. Roy. Statist. Soc. Ser. B 39 27-28. ORCHARD, T. and WOODBURY, M. A. (1972). A missinginformation principle:theoryand applications. Proc. 6thBerkeleySymposiumon Math. Statist. and Probab. 1 697-715. ORTEGA, J. M. and RHEINBOLDT, W. C. (1970). IterativeSolution ofNonlinear Equations in Several Variables. Academic,New York. OSTROWSKI, A. M. (1966). Solution of Equations and Systemsof Equations. 2nd Edition. Academic, New York. POLAK,E. (1971). ComputationalMethods in Optimization.Academic, New York. This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions CONVERGENCE OF EM ALGORITHM 103 REDNER, R. A. and WALKER, H. F. (1982). Mixture densities,maximum likelihood,and the EM algorithm.Preprint. D. B. and THAYER, D. T. (1982). EM algorithmsforML factoranalysis. Psychometrika 47 69-76. SUNDBERG, R. (1974). Maximum likelihoodtheoryforincompletedata froman exponentialfamily. RUBIN, Scand. J. Statist.1 49-58. grouped,censored and TURNBULL, B. W. (1976). The empiricaldistributionfunctionwith arbitrarily truncateddata. J. Roy.Statist.Soc. B 38 290-295. TURNBULL, B. W. and MITCHELL, T. J. (1981). Nonparametricestimationof the time to onset for specific diseases in survival/sacrifice experiments. Biometrics.To appear. 10 772-785. VARDI, Y. (1982). Nonparametric estimation in renewal processes. Ann.Statistics WOLFE, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research 5 329-350. A ZANGWILL, W. I. (1969).NonlinearProgramming: Cliffs,New Jersey. UnifiedApproach.Prentice Hall, Englewood DEPARTMENT OF STATISTICS UNIVERSITY OF WISCONSIN 1210 WEST DAYTON ST. MADISON, WISCONSIN 53706 This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM All use subject to JSTOR Terms and Conditions