On the Convergence Properties of the EM Algorithm

advertisement
On the Convergence Properties of the EM Algorithm
Author(s): C. F. Jeff Wu
Source: The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103
Published by: Institute of Mathematical Statistics
Stable URL: http://www.jstor.org/stable/2240463 .
Accessed: 12/05/2013 20:54
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The
Annals of Statistics.
http://www.jstor.org
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
The Annals of Statistics
1983,Vol. 11, No, 1, 95-103
ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM'
BY C. F.
JEFF WU
Universityof Wisconsin,Madison
are studied:(i) does the
aspectsoftheEM algorithm
Two convergence
valueofthe(incompleteora stationary
EM algorithm
finda localmaximum
data) likelihoodfunction?(ii) does the sequenceof parameterestimates
resultsare obtainedunder
Severalconvergence
generated
byEM converge?
Twousefulspecial
conditions
thatareapplicabletomanypracticalsituations.
can be described
complete-data
specification
casesare: (a) iftheunobserved
space,all thelimit
family
withcompactparameter
bya curvedexponential
pointsofthe likelihoodfunction;
pointsofanyEM sequenceare stationary
(b) if the likelihoodfunctionis unimodaland a certaindifferentiability
to theuniquemaxiis satisfied,
thenanyEM sequenceconverges
condition
is included.
A listofkeyproperties
ofthealgorithm
mumlikelihood
estimate.
1. Introduction. Dempster,Laird and Rubin (1977) (henceforthabbreviated DLR)
introducedthe EM algorithmforcomputingmaximumlikelihoodestimatesfromincomplete data. The essentialideas underlyingthe EM algorithmhave been presentedin special
cases by many authors;see DLR fora detailed account. Amongthem we mentionBaum
et al. (1970),Hartleyand Hocking (1971), Orchardand Woodbury(1972), Sundberg (1974).
The DLR paper has made threesignificantcontributions:(i) it recognizesthe expectation
step (E-step) and the maximizationstep (M-step) in theirgeneralforms,(ii) it gives some
theoreticalpropertiesof the algorithm,and (iii) it recognizesand gives a wide range of
applicationsin statistics.
However,the proofof convergenceof EM sequences in DLR contains an error.The
implicationfrom(3.13) to (3.14) in their Theorem 2 fails due to an incorrectuse of the
triangleinequality.Additionalcommentson thisproofare givenin Section 2.2. Therefore
the convergenceof EM sequence as proved in their Theorems 2 and 3 is cast in doubt.
Otherresultson the monotonicityoflikelihoodsequence and the convergencerate of EM
sequence (Theorems 1 and 4 of DLR) remainvalid.
Despite its slow numericalconvergence,the EM algorithmhas become a verypopular
computationalmethod in statistics. Contraryto the general experience in numerical
optimization,the implementationof the E-step and M-step is easy formany statistical
problems,thanksto the nice formof the complete-datalikelihoodfunction.Solutions of
the M-step oftenexist in closed form.In many cases the M-step can be performedwitha
standardstatisticalpackage, thus saving programmingtime. Anotherreason forstatisticians to preferEM is that it does not requirelarge storagespace. These two featuresare
especiallyattractiveto those withfreeaccess to small computers.
In thispaper, instead ofpatchingup the originalproofof DLR, we studymore broadly
two convergenceaspects of the EM algorithm.Our approach is to view EM as a special
optimizationalgorithmand to utilizeexistingresultsin the optimizationliterature.
Formallywe have two sample spaces . and cNand a many-to-onemappingfrom. to
OY.Instead of observingthe "complete data" x in ?, we observe the "incompletedata" y
- y(x) in ON.
Let the densityfunctionof x be f (x I4) with parameters4 E f2and let the
densityfunctionof y be givenby
ReceivedMay 1981;revisedAugust1982.
GrantNo. MCS-7901846.
'Researchsupported
bytheNationalScienceFoundation
AMS 1980 subject classification.Primary62F10, 90C30.
incomplete
data,curvedexponential
GEM algorithm,
Key wordsand phrases.EM algorithm,
maximum
likelihood
estimate.
family,
95
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
C. F. JEFF WU
96
g(yI4))
f(xIo) dx,
where X(y) = {x: y(x) = y). The parameters 4 are to be estimated by the method of
maximumlikelihood,i.e., by maximizingg(y I4) over 4 E 2. In many statisticalproblems,
maximizationof the complete-dataspecificationf(x 1I ) is simplerthan that of the incomplete-data specificationg(y 1I ). A main featureof the EM algorithmis maximizationof
f(x 14))over 4 E 2 (M-step). Since x is unobservable, we replace logf(x 14))by its
conditionalexpectationgiveny and the currentfit4p)(E-step).To this end, let k (x Iy, 4)
f (x I4)/g(y 14) be the conditionaldensityof x giveny and 4. Then the log-likelihood
L(4') = logg(yI o') = Q(O'I4)) -H(O'I 4),
(1)
whereQ (O'I 4))= E {logf(xI o') Iy, )) andH(O'l4) = E {logk(x I y,4') y, )) areassumed
), 4p,i E M(qp) as follows:
to existforall pairs (4', 4). We now definethe EM iteration4p
-
Q (4 I4p).
E-STEP. Determine
M-STEP. Choose op?1 to be any value*,ofE)E 2 whichmaximizesQ (4 I 4p).
Note that M is a point-to-setmap, i.e., M(4p) is the set of 4 values which maximizes
Many applicationsof EM are forthe curved exponentialfamily,for
whichthe E-step and M-step take special forms.
Sometimes it may not be numericallyfeasible to performthe M-step. DLR defineda
generalizedEM algorithm(a GEM algorithm)to be an iterative scheme 4)p-* 4)?p+iE
M(4p),where4 -* M(4)) is a point-to-setmap, such that
Q( I4p)over 4 E S.
(2)
Q(4'I) ' Q(4)I4))forall
' E M(+).
EM is a special case of GEM. For any instance {fp) of a GEM algorithm,
(3)
L(Op+1)2 L(Op)
followsfromthe definitionof GEM and the inequality
(4)
H (. I(t) >- H (.' I(t) forany O' (= S2.
For proofsof (3) and (4), see Lemma 1 and Theorem 1 of DLR.
In Sections 2.1 and 2.2 we inspect two convergence aspects of the EM and GEM
algorithmsand discuss the relationshipof our resultsto previousones in the literature.In
Section 3 we summarizethe keypropertiesofEM. Potential EM users that are not patient
withmathematicaldetails may read the summarybeforeconsultingthe resultsin Section
2.
2. Does EM do the job? The originalpurpose of the EM algorithmwas to provide
iterative computation of the maximum-likelihoodestimates. For a bounded sequence
L (4p), (3) implies that L (4p) convergesmonotonicallyto some L *. We want to know
whetherL * is the global maximumof L (4) over 2. If not, is it a local maximum or a
stationaryvalue? This problem is studied in Section 2.1. A related problem of the
convergenceof an EM or GEM sequence {fp) is studied in Section 2.2. We make the
followingassumptionsforthe rest of the paper:
(5)
(6)
(7)
2 is a subset in the r-dimensionalEuclidean space R,
2'> = {4) E 2: L(4)) 2 L(4)o)) is compact forany L(4o) >
L is continuousin 2 and differentiable
in the interiorof 2.
As a consequence of (5), (6), (7), we have
(8)
-oo
{L(4p))p:),,0is bounded above forany 4o E- 2.
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
97
CONVERGENCE OF EM ALGORITHM
The compactnessassumptionin (6) can be restrictivewhenno realisticcompactification
ofthe originalparameterspace is available. Such may happen in,say,variance components
models and factoranalysis.To avoid trivialities,we assume that the startingpoint4o of an
EM or GEM algorithmsatisfiesL (0o) > -oo. When we computethe derivativesofL, Q, H
at 4)p,we assume that4p is in theinteriorofS2.Such an assumptionis implied,forexample,
by
S2< is in the interiorof 2 forq)oE 2.
(9)
2.1 Convergenceto global maximum,local maximumor stationaryvalue? From (3)
and (8), L (4p) convergesmonotonicallyto some L *. There is no guaranteethat L * is the
global maximumofL (4) over 2 forthe EM algorithm.Althougha global maximizationof
Q is involvedin the M-step, the other termH in L = Q-H may not cooperate. Even the
question of convergence to a local maximum cannot be satisfactorilyanswered. For
simplicitywe assume that q)pconvergesto some 4* in the interiorof 2, that the Hessian
matricesD20 Q(O* I 0* ) and D2OH(O* I 0*) with respect to the first4* variable exist,and
that D20Q (4)' 1 ) is continuous in (4', 4). Then -D20Q (4* I 0*) is non-negativedefinite
(n.n.d.)accordingto the definitionofthe M-step,and -D2OH(O* I 0*) is n.n.d.accordingto
Lemma 2 of DLR. Since D2L(O*) = D20Q(4)* I )* - D2OH(O* I 0*), * may not be a local
maximum.
We give an example (Murray, 1977) to illustratethe possibilityof convergingto a
stationaryvalue but not to a local maximum.The twelveobservationsin the displaybelow
come froma bivariatenormal distributionwith zero means, correlationcoefficientp and
aseik
2
variances ui, U2, where asterisksrepresentmissingvalues. For these data the likelihood
has two
1 -1
Variable 1:
1
Variable2:
1 -1
-1
2
2 -2
1 -1
*
*
*
-2
*
*
*
*
2
2 -2
*
-2
8/3and a saddle point at p = 0, u2 - (2 - 5/2If the
global maxima at p = + 1/2, l = (=
startingpointofan EM sequence has p = 0, thenp = 0 remainstrueforall the subsequent
iterationsand the sequence convergesto the saddle point. If p is bounded away fromzero
in the EM iterations,thenthe sequence convergesto eithermaximum.We will come back
to this example afterCorollary1.
In general,if the log-likelihoodL has several (local or global) maxima and stationary
points,convergenceof the EM sequence to eithertype of point depends on the choice of
startingpoint.This phenomenonhas also been reportedin Hasselblad (1969),Wolfe(1970),
Haberman (1974), Laird (1978), Rubin and Thayer (1982).
The above discussion on the convergence to local or global maximum may seem
redundant,since it is known that no general optimizationalgorithmsare guaranteed to
convergeto local maxima.Over the last fewyears some misconceptionofthe powerofEM
has developed,partlybecause of the global maximizationnature of its M-step. We hope
that our discussionwill help to remove the misconception.
We now considerthe issue of convergenceto stationaryvalues. The main theoremsof
thissectionrelyon the followingresult.A map A frompointsofX to subsets ofX is called
a point-to-setmap on X. It is said to be closed at x ifXk -> X, Xk E X and Yk-> Y,Yk E A (Xk),
implyy E A (x). For point-to-point
map, continuityimplies closedness.
GLOBAL CONVERGENCE THEOREM.
Let the sequence {Xk} k=o be generated by Xk?l E
M(xk), whereM is a point-to-setmap on X. Let a solutionset F C X be given,and suppose
that: (i) all points Xk are contained in a compact set S C X; (ii) M is closed over the
complementof F; (iii) thereis a continuousfunctiona on X such that (a) if x C F, a (y)
> a (x) forall y E M(x), and (b) ifx E F, a(y)
a-(x) forall y E M(x).
Then all the limit points of {xk} are in the solution set F and a(Xk) converges
monotonicallyto a (x) forsome x E r.
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
98
C. F. JEFF WU
PROOF. See Zangwill(1969,page 91).
Let M be the point-to-setmap in a GEM iterationand let a(x) be the log-likelihood
functionL. Take the solutionset F to be
X1= set of local maxima in the interiorof S2,
or
M= set of stationarypoints in the interiorof U.
Condition(iii)(b) followsfrom(3) and condition(i) followsfrom(3), (8). Then Theorem 1
followsas a special case of the above theorem.
THEOREM 1. Let {4)p) be a GEM sequence generated byq)p+l E M(op), and suppose
that (i) M is a closed point-to-set map over the complement of Y (resp. #), (ii) L (Op?i) >
L(4p) forall 4p C Y' (resp. #).
Then all the limitpoints of {fop}are stationarypoints (local maxima) ofL, and L (4p)
convergesmonotonicallyto L * = L(0*) forsome 0* E Y (resp. #).
For the EM algorithm,it is easy to show that a simple sufficientconditionfor the
closedness ofM is that
(10)
in both4 and4.
Q(4 I )) is continuous
This condition is very weak and should be satisfied in most practical situations. For
convergenceto stationaryvalues it turnsout to be the only requiredregularitycondition
(in additionto those givenbefore).The followingtheoremis most usefulin that it covers
a broad range of statisticalapplications.
THEOREM 2. Suppose Q satisfies the continuitycondition (10). Then all the limit
points of any instance {fp} of an EM algorithmare stationarypoints of L and L (0p)
convergesmonotonicallyto L * = L (0*) forsome stationarypoint 4)*.
PROOF. Since (10) is sufficient
for(i) ofTheorem 1,it remainsto prove (ii) ofTheorem
1 forall 4p C Y. Consider a 4p, which is in the interiorof S by (9). Since 4p maximizes
H(4 I 4p) over 4 E S2accordingto (4), D10H(+p I4p) = 0. ThereforeDL (0p) = D10Q((+pIOp)
# 0 forany 4p C Y fromthe definitionof Y, implyingthat 4p is not a local maximumof
Q (4 I 4p) over 4 E 2. From the definitionofthe M-step,Q (0p+jI4p) > Q (4pI4p). Together
with (4), this proves L (Op+i) > L (4p) forall 4p C Y. The desired resultfollows.O
The same argumentdoes notapplywhen9 is replacedby M. This is easilydemonstrated
by consideringa 4p in 9 but not in M. Here DL (4p) = D 10Q(4pI4p) = 0 and 4p may indeed
E
maximizeQ (4 I 4p) over 4 (2.
The EM iterationterminatesat 4)p,a stationarypoint but
not a local maximum,and L (4p+i) > L (4p) does not hold forsuch op. Thus to guarantee
convergenceto a local maximum,we need a furtherconditionsuch as (11) below. Since
(11) holds forany 4 C Y, (4) and (11) imply(ii) of Theorem 1 forall OpC . The following
theoremis a special case of Theorem 1.
THEOREM 3. Suppose Q satisfies the continuitycondition (10) and
(11)
supf'EQ()'
I4) > Q(4 I)
I
forany
E-e\ X(.
Then all the limitpoints of any instance {(p) of an EM algorithmare local maxima of
L and L (op) convergesmonotonicallyto L * = L (4*) forsome local maximum4 *.
Since (11) is typicallyhard to verify,the utilityof Theorem 3 is somewhatlimited.An
importantclass of densitiesthat satisfy(10) is the curved exponentialfamily
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
99
CONVERGENCE OF EM ALGORITHM
f(x 14))= b(x)exp{fTt(x)}/a(0),
(12)
where the parameters 4 lie in a compact submanifold So of the r-dimensionalconvex
region 2 = {(: a(+) = f.b(x)exp[OTt(x)] dx < 00). Here Q(0'10) = - loga(o') +
4). From the compactness of 2o and properties of
E{logb(x)ly, 4) + 0'TE{t(x)Iy,
exponentialfamily,one can easily verifycondition(10). We emphasize that the continuity
of Q also hold trueformany densitiesoutside the exponentialfamily.
Theorem 1 is the most general result for EM and GEM algorithms.The result in
Theorem 2 was obtained by Baum et al. (1970) and Haberman (1977) for two special
models. Boyles (1980) obtained a similar result for general models but under stronger
regularityconditions.One key conditionin Baum et al. (1970) and Boyles (1980) is thatM
is a continuouspoint-to-pointmap over 2, which is strongerthan a closed point-to-set
map over 2\99 (assumed in Theorem 1). Boyles showed that his Lemma 4.2 covers the
regular exponential family,a case not frequentlyencountered in practice, while our
Theorem 2 covers a much broader class of applications.
Murray'sexample illustratesTheorems 2 and 3 well. Convergenceof any EM sequence
to a stationaryvalue followsfromTheorem 2 or Corollary1. But whydoes an EM sequence
converge to a saddle point in this example? From the assumptions of Theorem 3, it
becomes clear that condition(11) is being violated at the saddle point p = 0, l = 2= 5/2.
In fact the saddle point maximizesthe likelihood over all the parametricspecifications
with p = 0. On the other hand, if a particularEM sequence is not attractedtoward the
hyperplanep = 0, then violationof (11) is avoided and convergenceof this EM sequence
to local maxima is guaranteed.
2.2 Convergenceof an EM or GEM sequence {fp}. The convergenceof L (4p) to L *
studied in Section 2.1 does not automaticallyimplythe convergenceof 4p to a point 4*.
(The implicationin the reversedirectionis trueifD 10H(4' 4)) is continuous.)Convergence
ofEM in the lattersense usually requiresmore stringentregularityconditionsand will be
addressed in this subsection.
Theorem 2 ofDLR incorrectlyclaimed that,undertheirconditions(1) and (2), a GEM
sequence {fp} convergesto a point4* in the closureof 2. A GEM (but not EM) sequence
was givenin Boyles (1983) as a counterexample.The DLR argumentwould be valid only
ifcondition(2) of theirtheoremcould be replaced by the following:
(13)
Q (OP+l lfp)-Q
(OPl p ) > A 1 p+l-op
lI for some
A >O
and all p.
However,if p --*, then (13) implies fl
DlOQ(4* 1)* ) 11= 11DL (4* ) l -\ > O, and 4* can
not be a stationarypoint.There appears to be no simple way to fixtheirproof.From the
numericalviewpoint,the convergenceof4pis not as importantas the convergenceofL (4p)
to stationaryvalues or local maxima. Since the originalDLR claim was made in termsof
the convergenceof {fp} and since subsequentEM users oftenquoted thisresult,a rigorous
studyof this problemis worthwhile.
Define 99(a) = {4 E YML(+)
a) and M (a) = {4 E M: L(+) = a). Under the
assumptionsof Theorem 1, L (4p) L* and all the limitpointsof {fp) are in 9(L *) (resp.
,&(L*)). If 99(L*)(resp. Jl(L*)) consists of a single point 4)*,i.e., there can not be two
different
stationarypoints (resp. local maxima) withthe same L *, then p -+*.
THEOREM
4. Let {f), be an instance of a GEM algorithmsatisfyingconditions i)
and ii) of Theorem 1. If 9(L *)(resp. J/(L*)) = {4*}, whereL * is the limitof L (4p) in
Theorem 1, then 4p -*.
The above assumption99(L *) = {)* } can be greatlyrelaxed ifwe assume ||p+i
0 as p -> oo,a condition necessaryfor the desired result 4p -*.
-
|
4),p
THEOREM
5. Let {fp) be an instance of a GEM algorithmsatisfyingconditions i)
O--- 0 asp -0oo, thenall the limitpoints of {fp) are in
and ii) of Theorem1. If 114p+i-i-p 11
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
C. F. JEFF WU
100
a connectedand compactsubset ofS?(L *) (respectivelyX1 (L *)), whereL * is the limitof
L(4p) in Theorem 1. In particular, if 9'(L*)(resp. dk(L*)) is discrete, i.e. its only
connected componentsare singletons, then 4p converges to some 4* in 9(L*)(resp.
,&(L *)).
PROOF. From assumption(6), {op) is a bounded sequence. Accordingto Theorem 28.1
of Ostrowski(1966), the set of limitpoints of a bounded sequence {top}with 11
Op+i - 4p 11
-O 0 as p -*oo is connectedand compact. From Theorem 1, all the limitpointsof {dip} are
already in 9(L * ) (resp. X&(L * )). The desiredresultnow follows.0
Note that condition(ii) of Theorem 1 for4,pC 9'is automaticallysatisfiedforany EM
sequence as demonstratedin the proofof Theorem 2. Theorem 5 was proved in Boyles
(1980, 1982, 1983), under differentregularityconditions. Theorem 4 was obtained by
Hartleyand Hocking (1971) fora special model.
9onvergenceof 4ppto a stationarypoint can be proved withoutrecourseto Theorem 1.
Let Y(L) = {4)E l:L(4)) =L}.
THEOREM
6. Let {4p) be an instance of a GEM algorithm with the additional
propertyD10Q (p+i+iI4p) = 0. Suppose D 10Q()' 14)) is continuous in 4)' and 4. Then pp
convergesto a stationarypoint 4)*withL (4*) = L *, the limitofL (4p), ifeither(a) S(L *)
oo and f(L* ) is discrete.
{)* }, or (b) 1IIp++- p
O->
11 0 asp
forsome 4* E
As in the proofsof Theorems 4 and 5, we can show 4p -*
PROOF.
S(L *). DL (4)*) = DOQ (4)* 14)*) = 0 followsfrom(4), D 10Q(4p+l14p))= 0 and the continuity
ofD10Q (4)'j)I in4)' and 4).
Note that EM satisfiesD 0Q(4,p+i14)p)= 0. Since X (L) and YS(L) are subsets ofYI(L),
conditions(a) and (b) in Theorem 6 are strongerthan the correspondingones in Theorems
4 and 5 respectively.The advantage ofTheorem 6 is that it does not requireconditions(i)
and (ii) of Theorem 1. An importantspecial case is the following.
COROLLARY 1. Suppose that L (4)) is unimodal in S2with4* being the onlystationary
point and that D10Q(O' 14) is continuous in 4) and 4'. Then for any EM sequence {fp},
4p convergesto the unique maximizer4* of L (4).
From the viewpointof the user,Theorem 2 and Corollary1 are the most usefulresults
since they require conditionsthat are very easy to verify.Their applications to specific
statistical problems are numerous,e.g., Haberman (1977), Redner and Walker (1982),
Turnbull (1976), Turnbulland Mitchell (1981), Vardi (1982).
There are two key assumptions (in addition to those discussed in Section 2.1) in
Theorem 5 forthe convergenceof (4pp):discretenessof Y5(L *) and 11pp+i- 4)Op 11 0 as p
-> oo.The latterconditionmay be hard to verifyin practice. For the special cases stated
below,the conditionholds true.
conditions
for11pp+i- 4pI
Sufficient
CONDITION
0 as p
-* oo:
1. There existsa forcingfunctiona such that
Q(4P+1I14p)
-Q(4p) p),-,g(114)p+i-p)lj)forallp.
(14)
(A mappinga: [0, oo)-> [0, oo) is said to be a forcingfunctionif,forany sequence
= 0 implieslimk-,,,tk = 0.) From (4),
00), limk ,-U(tk)
{tk)
in
[0,
L (4p+,) -L (4pp) ': Q(4p+i14)p)- Q(4),p
I)p).
(15)
Since L(,)
>
L * and a is a forcingfunction,(14) and (15) implyII,0p+1- 4p,j --* 0. By
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
101
CONVERGENCE OF EM ALGORITHM
taking((t) = Xt2,X > 0, (14) becomes
Q (P+i I4p)-)Q
(16)
forall p.
(p 1p) >-AX
11p+i _ 4p 112
Except forthe regularexponentialfamily,it is difficultto verify(16).
We singleout a subclass of GEM algorithmsforwhich (14) is satisfied.Let B (0p) be a
positivedefiniter x r matrixcontinuousin op. Then a sufficientconditionis obtained by
lettingthe GEM iterationop ----op+,be definedby (17) and (18)
(17)
op+,
wheres < 1 and 0 <
spc
s
op + (pB (op) D '0Q (op Iopl)11 D'0Q (opIop) 11
1 is chosen such that op+, is in the interiorof 2,
Iop)T(op+i_ op),
IIop)- Q(oap
Iop)
Q(4W+i
I -aD1OQ(op
(18)
where0 < a < 1 is independentofp. For example,forsteepestascent,we would take B (Op)
= L The Armijoline search method (Polak, 1971,page 36;'Ortega and Rheinboldt,1970,
algorithmdesignedforfindingthe steplength(p specified
page 491) is a finite-terminating
in (17) and (18). Such a 4palways exists because op is in the interiorby assumption,
B (op)D 10Q
in thedirection
Q (4)I:op)is locallyincreasing
(opIop), and 0 < a < 1. By the
nature of the GEM iteration (17) and (18), the algorithm terminates at a ,p with
D 10Q
(op Iop) = 0. It is based solely on the directionalderivativesof Q and line searches
along the directions.We now proceed to prove that the GEM iteration (17) and (18)
satisfies(14) with((t) = ct(2-s)/(l-s) Let
A* = sup,.g.0 {max eigenvalue ofB ()).
From (17) and (18),
Q(OP+llIp)
-
> aGi(op+i - 0p))TB-1(p)
Q(OP
IP)
)(op+
I
>'
A*-X
0P+i
11
- a(X* )r(1-s)I-
II
4p+l
-
op)ID10Q(opIop)Is
p 11
2((p-lX*-l 11p+l
_ o)p
_
opII)(s)/(1-S)
(2-s)/(1-s)
Note that X* < oosince S20,is compact. For otheriterationschemes,sufficientconditions
0 can be foundin Ortega and Rheinboldt (1970, Chapter 14).
for11op+, - op,
We concludethissubsectionwitha discussionofthe discretenessassumptionon f(L *),
whichwas not mentionedin DLR. If L (4)) has a ridge of stationarypoints in which L (4)
= L *, i.e. f(L *) is not discrete,does an EM sequence with 11op+, -op 11-O 0 convergeto
a 4)*in 99(L *)? Or may it sometimesmove indefinitely
on the ridge?DLR (1977,page 10)
made an unwarrantedclaim that the formermust be true. Although(16), an assumption
made in DLR, impliesEp=o II _p+1 op 112< ooand 11op+,-op O--11 0, it is stillpossible to have
||I4p+, - op 11= ??. A sequence with this propertymay move indefinitelyon a ridge.
E
Althoughwe do'not have a counter-exampleto thisclaim by DLR, our generalimpression
is that, if f((L*) is not discrete, convergence of {fp) can only be guaranteed under
conditionsstrongerthan those assumed in theirpaper. This beliefis furthersupportedby
the GEM counterexampleof Boyles (1983).
3. Summary of properties of EM algorithm.
(i) Any EM sequence {fp) increases the likelihood and L (op), if bounded above,
convergesto some L *.
(ii) If Q ('PI)) is continuous in both 4 and 4, L * is a stationaryvalue of L. The
continuityof Q holds forthe importantcase of a curved exponentialfamily.If
4p converges to some point 4*, 4* is a stationarypoint under the continuity
ofD10Q(4' 14) in4' and 4.
condition
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
102
C. F. JEFF WU
(iii) If,in additionto (ii), Q is not trappedat any point pothat is a stationarypointbut
not a local maximumofL, i.e. supoE=aQ(4I Po) > Q(4o IGo),thenL* is also a local
maximumofL. But thisconditionmay be difficult
to verify.Since the convergence
to stationaryvalue or local maximumor global maximumdepends on the choice
of startingpoints, we recommend that several EM iterations be tried with
different
startingpoints that are representativeof the parameterspace.
(iv) If, in addition to (ii) or (iii), 11Op+j- p 11 0 as p -> oo and the set of stationary
points (local maxima) with a given L value is discrete,then 4p convergesto a
stationarypoint (local maximum).
(v) If, in addition to (ii) or (iii), there cannot exist two differentstationarypoints
(local maxima) with the same L value, then 4p convergesto a stationarypoint
(local maximum).
0 is condition(14), which is satisfiedby
(vi) A sufficientconditionfor11Op,j - Ip
a class of optimizationalgorithms(17) and (18), whose iterationscheme is based
on the local directionalderivativesof Q and Armijoline searches along the chosen
direction.For a regularexponentialfamily,a special case (16) of (14) is satisfied
by the EM algorithm.
is
(vii) If L (f) is unimodal in S2 and has only stationary point and D 10Q
(?')
continuousin 0 and p',then op convergesto the unique maximizerp*of L (p).
(viii) If the set ofstationarypoints (local maxima) witha givenL value, denotedf(L)
(respectively./ (L)), is not discrete,and 11 -p+1- p 11 0, then 4p convergesto a
compact, connected component of f(L) (resp. M (L)) but not necessarilyto a
point. The originalclaim by DLR that 4p convergesto a point does not seem to
hold underthe conditionstheystated. But we emphasize that the convergenceof
the EM sequence 4S is not as importantas the convergenceof L (0p) to desired
locations on the log-likelihoodsurface,an issue largelyresolved in the present
article.
Acknowledgments.
helpfulcomments.
The authorwishesto thankWing-HungWong and a refereefor
REFERENCES
BAUM,L. E., PETRIE, T., SOULES, G. and WEISS, N. (1970). A maximizationtechnique occurringin
the statisticalanalysisof probabilisticfunctionsof Markov chains. Ann. Math. Statist. 41
164-171.
BOYLES,R. A. (1980). Convergenceresultsforthe EM algorithm.Technical Report No. 13, Division
of Statistics,Universityof California,Davis.
BOYLES,R. A. (1983). On the convergenceofthe EM algorithm.J. Roy. Statist. Soc. B 44. To appear.
DEMPSTER,A. P., LAIRD, N. M. and RUBIN, D. B. (1977). Maximum likelihoodfromincompletedata
via the EM algorithm(withdiscussion).J. Roy. Statist. Soc. Ser. 39 1-38.
HABERMAN,S. J. (1974). Loglinear models for frequencytables derived by indirect observation:
maximumlikelihoodequations. Ann. Statist. 2 911-924.
HABERMAN,S. J. (1977). Product models forfrequencytables involvingindirectobservation.Ann.
Statist. 5 1124-1147.
HARTLEY,H. 0. and HOCKING,R. R. (1971). The analysisof incompletedata. Biometrics27 783-808.
HASSELBLAD,V. (1969). Estimationof finitemixturesof distributionsfromthe exponentialfamily.J.
Amer.Statist. Assoc. 64 1459-1471.
LAIRD, N. (1978). Nonparametricmaximumlikelihoodestimationof a mixingdistribution.J. Amer.
Statist. Assoc. 73 805-811.
MURRAY, G. D. (1977). Contributionto discussionofpaper by A. P. Dempster,N. M. Laird and D. B.
Rubin. J. Roy. Statist. Soc. Ser. B 39 27-28.
ORCHARD, T. and WOODBURY, M. A. (1972). A missinginformation
principle:theoryand applications.
Proc. 6thBerkeleySymposiumon Math. Statist. and Probab. 1 697-715.
ORTEGA, J. M. and RHEINBOLDT, W. C. (1970). IterativeSolution ofNonlinear Equations in Several
Variables. Academic,New York.
OSTROWSKI, A. M. (1966). Solution of Equations and Systemsof Equations. 2nd Edition. Academic,
New York.
POLAK,E. (1971). ComputationalMethods in Optimization.Academic, New York.
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
CONVERGENCE
OF EM ALGORITHM
103
REDNER, R. A. and WALKER, H. F. (1982). Mixture densities,maximum likelihood,and the EM
algorithm.Preprint.
D. B. and THAYER, D. T. (1982). EM algorithmsforML factoranalysis. Psychometrika
47
69-76.
SUNDBERG, R. (1974). Maximum likelihoodtheoryforincompletedata froman exponentialfamily.
RUBIN,
Scand. J. Statist.1 49-58.
grouped,censored and
TURNBULL, B. W. (1976). The empiricaldistributionfunctionwith arbitrarily
truncateddata. J. Roy.Statist.Soc. B 38 290-295.
TURNBULL, B. W. and MITCHELL, T. J. (1981). Nonparametricestimationof the time to onset for
specific diseases in survival/sacrifice experiments. Biometrics.To appear.
10 772-785.
VARDI, Y. (1982). Nonparametric estimation in renewal processes. Ann.Statistics
WOLFE, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate
Behavioral
Research 5 329-350.
A
ZANGWILL, W. I. (1969).NonlinearProgramming:
Cliffs,New Jersey.
UnifiedApproach.Prentice Hall, Englewood
DEPARTMENT
OF STATISTICS
UNIVERSITY
OF WISCONSIN
1210 WEST DAYTON ST.
MADISON, WISCONSIN
53706
This content downloaded from 128.12.187.216 on Sun, 12 May 2013 20:54:56 PM
All use subject to JSTOR Terms and Conditions
Download