2 Letthe F =base (V ×X )∪(E ×X ) be the index set for theideas cothat thealgorithm notion of based “overlapping cluster” used in the and measure is discrete. However, the mean field on auxiliary exponential ordinates of φto(the If e = (a, b) ∈ E, then work on structured mean field by Geiger et al. (2006). apply directly thepotentials). general exponential family—this families. it isbeunderstood that the detail following inclusion Some v-acyclic graphs have overlapping clusters, some will discussed in more in Section 3. holds on Thedo notion of vand b-acyclic subgraphs is different the induced sigma-algebra: σ(φe,· (X)) ⊇ σ(Xa , Xb ). not; and moreover, the computational dichotomy 2 Q ⊂ P, in which the log partition and moments can We then the fundamental equation of meanF = (V )∪(E ×X ) bewrite the index set for the cothatwe the notion here of “overlapping used ofinv-the Similarly, if×X v can ∈ V , σ(φ establish does not holdcluster” if the notion and Let v,· (X)) ⊇ σ(Xa ). We lose no ordinates of by φ (the potentials). If eof=potentials (a, b) ∈ E, on structured mean field by Geiger (2006). generality requiring existence forthen all b-acyclic subgraphs is replaced by thatetofal. overlapping be computed.work field approximation: itvertices is understood thatsince the following inclusion holdscoron Some v-acyclic graphs some and edges, we can always set their clusters. Note thathave otheroverlapping variational clusters, approximations induced parameter sigma-algebra: σ(φe,· (X))∗ ⊇ σ(Xa , Xb ). do not; moreover, Propagation the computational responding to zero. such and as Expectation also havedichotomy a subgraph the An intuitivelywe appealing approach to making this seµ&⊇−σ(X Aa ).(µ) µ no ∈ MMF } = V , σ(φv,· (X)) We :lose establish here does notand holdQiif 2003). the notion of this v- and interpretation (Minka While sub- Similarly, if v ∈ sup{%θ, by requiring b-acyclic subgraphs ishappens replaced of overlapping graphuse sometimes toby bethat b-acyclic, there is no generality lection is to make of the graphical representation 2.2 Convex duality existence of potentials for all sup{%ω, &+ %ϑ,set Γ(τ )&cor− A∗ (τ ) : τ ∈ N }, (6) specialNote distinction between v- and b-acyclic graphical vertices and edges, since we τ can always their clusters. that other variational approximations of the exponential family and a subset of the A simple parameter but fundamental inPropagation the to casechoose of Bethe-energy variational responding to zero.property of exponential suchapproximations as Expectation also have a subgraph families where is that theθgradient and Hessian of the log slight abuse of notation: approximations. why we focus on this meansubfield In interpretation (MinkaThis and Qi 2003). While edges, E ! ⊂ E, to represent a istractable subfamily. = (ω, ϑ). Note the partition function have the following forms: approximations. graph sometimes happens to be b-acyclic, there is no 2.2 Convex duality particular, inspecial defining this subfamily we retain only we use A to denote the partition function of both exbetween v- follows. and b-acyclic graphical Thedistinction paper is organized as We present a ba∇A(θ) = E[φ(X θ )] but fundamental property of exponential in the of Bethe-energy the potentialsapproximations with indices families; the notation can always be disamsic introduction to case structured mean field variational in Section 2. A simple ponential H(A(θ)) = Var[φ(X )]. (3) θ families is that the gradient and Hessian of the log approximations. This why weand focus on meandevelopfield We then discuss ourisanalysis algorithmic biguated byfollowing inspecting partition function have the forms: the dimensionality of the pa! approximations. ments in Section 3. We present empirical results to F = {fsupport ∈ F our :f claims = (v,in ·)Section for 4v and ∈ Vwe or The second identity implies we can Howtoconvexity, toconstruct constructawhich avariational variational formulationforforA?A? How formulation present our rameter vector. The paper is organized as follows. We present a bause in conjunction with the Legendre-Fenchel trans∇A(θ) = E[φ(X θ )] conclusionsfin= Section 5. for e ∈ E ! }. (e, ·) sic introduction to structured mean field in Section 2. formation to establish an alternative form for A. Key concept: convex duality convex. • •Key concept: convex duality AA is is convex. = Var[φ(X (3). . .). ) (6) makes it clear that θ )]. (recall Background We then discuss our analysis and algorithmic developTheH(A(θ)) right-hand side of(recall Equation Inference problem: Hence the number of nonzero entries in J (c) is equal to the number of edges in E\E ! that have an endpoint in V (c) and each entry is a quantity that does not depend on τ (c) . This shows that: Optimization of Structured Mean Field Objectives = X log c∈cc(G! ) = X valid for any ζ,† ∈ Ξ. † Department (c) 3.3 Dichotomy of tractable mean field subgraphs We therefore have: [g] h ! ! } } g∈F \F " T T ! ! ! ∗ g !(2) [g] h [g] ! f h,f ! } } " k−1 k k−1 k−1 k 0 0 1 k−1 k−2 1 0 k−2 k−1 " Error 1 k−1 " k−2 k k−1 10 8 " 6 Error } } F F (c) (ct ) !! ∂ ∼ Jf,gX(τ )(t)=!X(t −"1) P(Y a = s, Yb = t), # ∂τf !(ct ) (ct ) MRF P ,ω + B (ct ) (X(t − 1))ϑ , !(1) [g] h g,h " 1 " g [g] h 0 " k−1 [g] [g] Optimization of v-acyclic components 2 [g] h [g] An interesting property of this coordinate ascent algorithm is that it is guaranteed to converge to a local optimum (Wainwright and Jordan 2008). This can be seen from Equation (6) by using the fact that the tition orig- function coincide with the quantity of interest: inal problem in Equation (4) is convex. X ` ´ k−2 } y2 ∈X θh = yk−1 ∈X 4 Figure 3: Error ... (1) p1 ω (1) Recall that for τ binaryp0 random variables Xs with a function of th pk " τ " ! pling weights (2) θs,s and Yobservation weights θs , the 2 to a s (pk−1 ,pk ),(yk−1 ,s ) corresponds b ω (2) naive Gibbs τsampler where h = ((v, w), (x, y)), (v, w) ∈ pg , (x, y) ∈ X 2 . edges in G! tha is defined as the MarkovOne chain can check this is achieved the following SMF2inis not. Se We can getEquation all thethat derivatives ofall theh with log-partition where Y µ = (Y1 , . . . , Ym ). Using (3), we have for = ((v, w), (x, y)),function X(1), X(2), . . . where only coordinateWhy? st is resampled where B is the matrix defined as follows: 0 choice: 0 one shot using sum-product (v, w) ∈ pg , (x, y) ∈ X 2 : (c) 8 at time t with transition probabilities: There are three subcases to consider: Figure 2: An example of the notation used in this Bf,gthe (X(tblock − 1))Gibbs =1[f = (a, s)]×with log τh − log τv,x + log τ(a,v),(s,x) if v = p0 can check easily that sampler > < [g] [g] 4 Experim (1) (k) section. This #corresponds to the inference problem in ∂Z ∂A log τ − log τ (c)by: blocks VForm , . . . ,of V J :has a transition1[|{a, kernel given [g] v,x [g] h [g] [g] b} ∩ V | = 1]× = θ = Z × = Z × µ , (c) the mean field# subgraph in h [g] [g] − log τw,yh if w = pk−1 Figure 3: E + log τ(pk−1 ,b),(y,t) Ya Figure 1, left column, bot! one of the ∂vertices a, b belongs to gV = > 1. Exactly : edge (a, b) ∂θ ∂θ : X (t) X(t − ∼ ... p1 # p0 1) ) s ! (1) 1[X (t − 1) = s], h h Γ(τ (ct ) t We b tom row. The box indicate which nodes are involved in log τh − log τv,x otherwise X (t)!X(t 1) ) ∼ = (c) (c)τ )] aperformed function ex o Jf,g E[φ pk gV(Y Suppose a−(τ ∈ V , b ∈ / . Since a, b are [g] $ ( 20 References & the auxiliary exponential family P used to compute " # % ' [g] [g] ∂τ !(c) Y f where A = log Z SMF1 . This shows that one execution of corresponds and where \F ! and ((a, b), =f g.the (ct ) , g ∈ = vertex 18 ! NMF 2 D. Barber and W. Wiegerinck. Tractable variational strucMRF P !(ctf) ,∈ω F + components, B (ctF) (X(t − 1))ϑ , (s, t))by σ θpath θs! ,s X (t − 1) in different connected then Jf,g fora all f . Bernoulli The edges τ (1)bin the How can we get,hthe partial derivative with respect ? st +pg are in bold s ! to where = ((v, w), (x, y)), (v, w) ∈ p , (x, y) ∈ X . t φ((x )i,j∈{1,..., g sum-product at the cost of O(|p |) yields |F | entries. SMF2 in G tures In i,j Advances 16 g for approximating graphical models. edges and the edge corresponding to g is thesbold dashed Note that the ∂ sparsity pattern of Bare follows that ! ∈N (s in Neural Information Processing Systems, pages 183– Hammersley-Clifford theorem they indepen) t = P(Y = t) from line between 14 b a = s, Yofb sampling where B is the matrix defined as Ya and τ Yb (2) . 189, Cambridge, 1999. Press. SMF2 is no Using Equation (3), have for allJacobian h = ((v,MA, w), (x,MIT y)), Next, we define thewe following two matrices: of J. Moreover, thefollows: complexity 12 dent, (c) soMRF(P that: !(c) , ·) ∂τ f same (up to a multiplicative conD. M. Blei and M. I. Jordan. Variational inference is the with θ!1:121– =forT1 , w (v, w) ∈ pg , (x, y) ∈ X 2 : −1 Dirichlet process mixtures. Bayesian Analysis, [g] 10 Figure 2: An example of the notation used in this Bf,g (X(t − 1)) =1[f = (a, s)]× ! " ! " where σ(x) = {1+exp(−x)} . This closely resembles ∂θ ∂Z [g] stant) as the complexity of computing ∇A(c) (·). Both as used in statist ∂ 10 Experiments 8 section. This corresponds to the inference problem in (c) I = ; [g] K 144, = 2005.h (c) [g] require executing sum-product on the same tree. [g] SMF2 ∩ Va = | =s)P(Y 1]× b = t) De Freitas, M. I. Jordan, S.Expe Rus- an Jf,g (τ ) =1[|{a, b}P(Y ∂τ P. Hojen-Sorensen, tion4and function h,f thethisnaive mean field coordinate ∂Z ∂θh[g] g,h∂A N. sell. 6 [g] the mean field subgraph in Figure 1, see left immediately column, bot-that Jupdates: From expansion, we [g] f f,g (τ ) Variational MCMC. In Proceedings of the SevenSMF1 ∂τ = Z × = Z × µ , f that Intelliabsolute 1) = s], h on Uncertainty in so b (t −when [g] [g] teenth Conference Artificial This parallel 1[X breaks one or several of the 4 tomconrow. The which nodes are involved in as in the willbox notindicate have the same sparsity properties 20 8 ! ∂θ !Morgan Kaufmann. ∂θ & % ' gence, San Mateo, CA, 2001. where#I has h size |F \F | × Nh and K has size N × |F |, We perform SMF1 ∂ ! is b-acyclic. In this case there 2 nected!(c)components is the auxiliary exponential v-acyclic case.family P [g] used to compute Adding improves the We used the u 18 ! ,s µs ! (t − 1)edges NMF µ (t) ← σ θ + θ . D. Geiger, C. Meek, and Y. Wexler. A variational inferand where fno∈corresponding F ,g= ∈ F \Ftractable and b),= (s,Gibbs t)) = g. s s s τa,s((a, τb,tblock N = g∈F \F " |pg |. τ b,t sampler,Jf,gwhile t the path p tare in bold t 0 for all One f . The edges in SMF2 g ence procedure allowing internal structure for overlap[g] [g] 16 subgra 10 10 10 10 way to find these partial derivatives is to con- quality ∂τ ATime = log10Z . This shows one of tractable ofwhere the approximation pingthat clusters andexecution deterministic constraints. Journal of 6 s! ∈N (salgorithm. discusspattern in fthe following section) mean and fieldthe is edge corresponding to dynamic g is the programming bold dashed t) Note that (as the we sparsity of B follows that struct a specialized We can finally derive an expression for J, using the 14 point updates o Artificial Intelligence !Research, 27:1–23, 2006. φ((x i,j )i,j∈ sum-product at the cost of O(|p still tractable at of a computational higher betweenThis Ya and g |) yields |F | entries. taskYis of J. Moreover, the= complexity sampling b . non-trivial: to see why, notice that when τalbeit = (a, s)]from costline b,t × 1[f 12 generalization of the chain rule for Jacobian matrices: Before performin A. Globerson and T. Jaakkola. Approximate inference usfunction estimate as a !(c) than in the v-acyclic case. the partial derivative is taken with respect to a coordi- Figure 4: Error in the partition , ·) is the same (up to a multiplicative conMRF(P Tsubgraph T ing10planar graph decomposition. In Advances in Neural Using a b-acyclic is 4 function of the running time in milliseconds (abscissa J = K I . Next, we define the following two Jacobian matrices: ically using dire nateNow τf corresponding to an between edge in the path pg , there (c) Information Processing Systems, Cambridge, MA, 2006. the parallel block Gibbs and stant) asRelation the complexity of computing (·). Both in a sampling log scale) for three algorithms: naive mean field to block Gibbs sampling (c)∇A 8 Press. MIT thatwith we have der significantly morecarrying expensive 2. require Both executing a 3.5 and Optimization bsum-product belong toonV : We claim that this are factors τf that appear both in the numerator and (NMF), θ! = After matrix we obtain: v-acyclic structured mean!fieldthis (SMF1) and multiplication, of b-acyclic components the same tree. [g] " ! " v-acyclic mean is the following: let 6 and Y. Wu. Sequential mean field variational analG. Hua ! denominator: " #we seestructured From this expansion, immediately that Jf,g (τfields ) ∂θh ∂Z [g] b-acyclic mean field (SMF2). 2cannot, in fact, occur as the used in s ysis of structured deformable shapes. Since Computer Vision compu (2) ! in v-acyclic components. (2) (2) ! I = ; K = 4 This parallel breaks when one or several of the conwill not have the same sparsity properties as in the X X ∼ MRF ω + B (X )ϑ Proposition 6 [g] When G is b-acyclic, the embedding MRF(P, θ) denote the distribution of X . Then one and Image Understanding, 101:87–99, 2006. We now turn to the of structured mean field when t−1 θ t !caset−1 ∂τf h,f tion functio g,h expensive, we fir ” X Suppose the contrary: a, bthere belong to thecase. 2 τ(p0 ,p1 ),(y0 ,y1 ) nected components b-acyclic. Insince this case is ∂ n“ X Jacobian has ∂θ thehform: v-acyclic J. B. Lasserre. Global optimization with polynomials and there areisb-acyclic components. ´ ` " P Jf,g (τ ) = τ(a,p1 ),(s,s ) tialon gains accu so thatinabso the0problem of moments. SIAM Journal Optimiza0same no tractable block Gibbs sampler, " connected there is a path be- to find these 1 ∂τfpartial 8approxima" s" τ(p0 ,p1 ),(y0 ,s ) where using the more expensive b-acyclic 0 corresponding 1 2component, 3 4 while 10 11:796–817, 10 2000. 10 10 10 y ∈X s One way derivatives is to con[g] ! ! 1 tion, SinceXwe are treating graphical models that have only µw,y| × N and K has size N × To in simplify the notation wemean assume that where I has size |F |F |, the following str Time > \F Temperature (as we discuss the following section) field is there is a 0 1 X !(c) θ } µ } Ya } + log τ(pk−1 ,b),(y,t) − log τw,y : τ(pk−1 ,pk ),(yk−1 ,yk ) cou- > log×τh`−P log τv,x ´ } if w = pk−1 otherwise } } τ Error } 0 1 2 3 4 Error Error 0 1 2 3 4 > 0.1 0.3 × 1[x = v= p0 # structb)} a specialized dynamic programmingτ(palgorithm. tion pays off. > T. s] Minka and Y.ifQi. Tree-structured approximations by τf k−2 ,pk−1 ),(yk−2 ,yk−1 ) We used t tween asingle and b.atThis meansthat that Ethe entire ∪ {(a, < b-acyclic spans graph pairwise wethat drop from now `P ´ on the quadratic ˘ 1[y=t] ¯ propagation. × topotentials, · · why, · N = |p |. still tractable albeit a component computational cost higher [g] " This task is non-trivial: see notice when expectation In Advances in Neural Infor1 [g] g g∈F \F " τ 1 1 µ × − if w = p 2 0.9 0.8 v,x the τ We also performed timing k−1 (τexperiments ) = Z ×to compare s" (pk−2 ,pk−1 ),(yk−2 ,s ) tractable su y2 ∈X with ∈X random Figure 4:τv,x Error in the partition function estimate as aapp G. The derivation is essentially identical when Processing Systems, Cambridge, MA, 2003.the MIT f mation dependency onyk−1 the variable state spaces,behavior |X | J.f,g has cycle, apartition contradiction. thethere partial derivative is20 taken respect to a coordiSMF1 inathe v-acyclic case. function > Figurethan 3: Error in the estimate as References [g] > convergence of the mean field approximations [g] o B(X ) Press. J(τ ) > µ time inthe milliseconds (abscissa SMF1 τ(p1 in µv,xfunction of the 0 edge ,p the ),(y path ,y )p , there several connected We can finallywederive forrunning J, using nate τft corresponding 18to an : f an 0.4 SMF2 0.9 point updat graph with t − τexpression a function of theare temperature of the components. model. SMF2. As a baseline, ran the naive NMF × ` P k−1 k k−1 k g´ D.SMF1 Barberand and W. Wiegerinck. Tractable variational τfstruclog scale) three algorithms: mean field andfor J.otherwise R. Anderson. A mean naive field theory learnv,xinC.aPeterson are factors τ that appear both in the numerator and " τ SMF2 f " generalization ofThe the chain foralgorithm Jacobian matrices: mean (curve NMF graphical on the graph). are rule tures field for approximating models. In results Advances 16 Before perfo 0 1 s (pk−1 ,pk ),(yk−1 ,s ) ing forstructured neural networks. Systems,and 1: pos ponents 0.2 components 0.6 more 3.5 Optimization b-acyclic corresponds to a The meanquantity fieldofapproximation (NMF), v-acyclic meanComplex field (SMF1) to compute is:with denominator: in Neural in Information Processing Systems, pages 183– it 995–1019, 1987. T T displayed Figure 4. We can see that in this model ! b-acyclic JMA, = K IMITmore where f. Press. =time ((v,tow), (x, y)) is such mean that field (v, (SMF2). w) ∈ pg ; ically subusing edges in G than SMF1: SMF1 0.5 is v-acyclic while 189, Cambridge, 1999. 0 0 14 0 0.1 takes one order of magnitude move from L. K. Saul and M. I. Jordan. Exploiting tractablethe ∂0.2 mean field SMF2 appr We now turn to the case of structured when 12 Jf,g is equalfield, to zero.structures in intractable networks. In Advances M. Blei andfield M. I.tootherwise, Jordan. inference for The task is thus more than that of creatingD.naive SMF2 is not. See text for more n“ X ” X complex Neu-hav Jf,g (τdetails. )= P(Ya = s, Yb = t), thatin we mean v-acyclicVariational structured mean τ ∂ (p ,p ),(y ,y ) 0 1 0 1 After carrying this matrix we obtain: in486–492, Figure 1, there are b-acyclic components. ∂τf Dirichlet process Bayesian Analysis, 1:121– 10 ral Information Processing Systems 8, pages ` algorithm " P Jf,g (τ ) = a dynamic τ(a,p two orders ofmixtures. magnitude more time to move from multiplication, programming of the´ type used and 1 ),(s,s ) 144, 2005. ∂τf ") Cambridge, MA, 1996. MIT Press. τ ! ! " (p ,p ),(y ,s " 0 1 0 subgraph. 8 s to b-acyclicThe structured mean field approximas total cost ofand computing J is O(|E | × |F \F |), b-acyclicSince 1 ∈X the co for sum-product. A ychain rule would need to be used,N.v-acyclic To simplifywhere the notation that on there a where using the more expensive approxima! De Freitas, M.the I. Jordan, S.results Rusthis timewe theassume probability the is right-hand side M. J. Wainwright and M. I. Jordan. Variational inference in X X tions. ThisP.isHojen-Sorensen, consistent with theoretical Proposition 6 When G is b-acyclic, the embedding τ 6 (pk−2 ,pk−1 ),(yk−2 ,yk−1 ) which is larger ofthan the cost derived in theThe v-acyclic sell. Variational MCMC. In Proceedings the Sevenchanging f, g. The developed 4 Experiments pays off.models: view from the marginal polytope. expensive, w single b-acyclic component that spans entire graph ` Pof the recursion for each ´ × · · · the form cannot be decoupled into athe product of marginals. Let in Section 3. Moreover, the bound on the tiongraphical Jacobian has the form: teenth Conference on Uncertainty in Artificial Intelli4 In Forty-first Annual Allerton Conference on Communi" τ " (pk−2 ,p ),(y ,s ) case, but smaller than the naive dynamic programming In Figure 3 we s complexity of this naive approach can be shown to be s k−1 k−2 y ∈X y ∈X G. The derivation is essentially identical when there log partition function gets tighter more edges are We cation, 2 k−1 gains gence, San Mateo, CA, 2001. Morgan as Kaufmann. also performed timing experiments compare the in Control, and Computing, 2003. to tial 2! ! ! ! o 8 to C. theMeek, tractable subgraph. We performed experiments on O(|Eτ|(p×| F | ×| F \F |), which is considerable. algorithm at the convergence beginning of this subsecdifferent temper pg = {acomponents. = pthe . .9, pIsing : ∀i, (pi , pi+1 ) ∈ E } [g] inferare several connected behavior mean field approximations 0 , p19, .× k = bmodel: D.added Geiger, and Y. Wexler. mentioned A variational k−1 ,pk ),(yk−1 ,yk ) M. J. Wainwright and of M.the I. Jordan. Log-determinant reError } [g] + J (c) (τ )ϑ. " the definition of Γ: Note that Â(θ) ≤ A(θ), which ispurpose, a) very only increase the quality of the global optimum. This Γ(τ ϑ ! useful property ! tation; for this we introduce the following ∂Z As awhen consequence, edges in Gin=the (V, E ) can Consequence: if∂A a, b belong The paper organized as inner follows. We present ba∗vector no# ∇A(θ) It willa∂G be useful to represent this update in = Z × = Z to ×µ , mean-fieldadding inference is isused loop = E[φ(X )] ∂Γ ∂A θ g does not imply that the local optimum found Dby only the Γ (τ ) = E[φ (Y )] different cliques, ∂θ ∂θ ) = ωpurpose, ϑg introduce (τ ) − the (τ ) introduction to structured mean 2.(τthis g Moreover, τSectionfor the quality of the global optimum. Thisfield f + tation; we following ofincrease EM definition. (Wainwright and Jordan 2008). ifg in µ sic θ H(A(θ)) = Var[φ(X )]. (3) ∂τ ∂τ ∂τ θ f f i We then discuss our analysis and algorithmic developY a ⊥⊥ Y g∈F \F !! ! does not imply that the local optimum found by the τ ω optimization procedure will always be superior, dbutE ⊆ E , with associated mean field approximation definition. where A = log Z . This shows bthat one execution of ments inalways Section be 3. superior, We presentbut empirical results to optimization procedure will we can at the cost of O(|p |) yields |F | entries. ! The second identity implies convexity, whichsum-product Ă(θ), then Ă(θ) ≤ Â(θ) ≤ A(θ). Complexity where four ∈ F the . support2our claimsembedding in Section 4 andJacobian we present we show in the experimental section that empirically Definition The is (transChain for Jacobian matrices we show in the experimental section that empirically conjunction with theA(θ) Legendre-Fenchel transDefinition 2 use Thein embedding is isthe (trans-Next, of computing we definerule the following two Jacobian matrices: Var !Jacobian 0 =⇒ convex ! 5. ! conclusions in Section 3 As a consequence, adding edges in G = (V, E ) can formation to establish an alternative form for A. O(n) It will be useful to represent this update in vector nogradient O(n) an improvement there is indeed an improvement when edges are added there is indeed when edges are added O(n ) Necessary optimality condition: posed) Jacobian matrix of Γ:This posed) Jacobian matrix of Γ: ! ∂Z " ! ∂θ " only increase the quality of the global optimum. tation; for this purpose, we introduce the following to the approximation. % $1 ∂Γ ; K= tointhe 2 ∗0the Background For real-valued function f ,I = ∂θ not )ϑ imply that by the 0 = ω does + J(τ − ∇A (τ )local optimum found ∂τ g an extended First result: dichotomy termsapproximation. of a graph property, v-acyclic and % definition. Definition $ ∂Γ ∗ −1 J = ∇A = ∇A g the Legendre-Fenchel transformation is defined as: We optimization also let!N denote the set realizable moments but of procedure willof always be superior, ∂τf f,g b-acyclic subgraphs " J = where#I has size |F \F | × N and In this section, we review the principles of mean field T KThas size N × |F |, when theJacobian family is regular and minimal show in this theJ(τ experimental section thatfrom empirically We also let N denote the set of realizable moments ofwe Note that set is formally distinct M Definition 2 The embedding is the (transMF =⇒ J = K I τQ.= ∇A ω + )ϑ ∗ ∂τ f,g 0 N= |p |. f ! ! (x) = sup{"x, y# − f (y) : y ∈ dom(f )}. f approximation and set the notation. Our exposition f ∈ Jacobian F , g ∈ F matrix \F . of Γ: there is indeed an improvement when edges are added for particular, its elements have different dimensionalposed) Second result: improved in the b-acyclic subgraph case Q. algorithm Note that this set is formally distinct from M(in follows the general treatment of variational methods MFto the approximation. We can finally derive an expression for J, using the ity). Easy ? presented $ ∂Γ % ! ! in Wainwright and Jordan (2003) where the of the chain rule for Jacobian matrices: f ∈ F , g ∈ F \F . g and When f we is convex semi-continuous, f = f ∗∗ , With this definition, obtain thelower concise expression:generalization (in particular, its elements have different dimensional-We alsofor J = J =K I . Legendre-Fenchel plays let N itdenote set of realizable moments of a central role. we can use convexity By construction, will bethe possible totransformation perform the op∂τf f,gof A to obtain: ∗ ity). Q. Noteover thatvariables this set is distinct ∇G = ω + Jϑ − ∇A . MF After carrying this matrix multiplication, we obtain: timization in formally Rd , where d! =from |F ! |.MTo [g] (c) does not depend on τ (c) . It is hence possible to optimize exactly in time O(|F !(c) |) the block of coordinates τ (c) while keeping the values of the other blocks fixed.1 1 ω∈Ξ⊆R [g] 0 k−1 rameter vector. for a sufficient statistics φ : X → R , base measure problem can be relaxed. Mean fieldof methods can Stepν 2: be Relax thethat optimization problem using a subset of the initial ! entropy shown to be equal to the negative the f= ·) for ∈ Ewhere In particular, the function Γ on the right-hand side P(X ∈ A) = exp{"φ(x), (1) ω ∈beΞseen and as itsa moments by τ .(e, wee let Y}.denote a andθ parameters θ ∈ Ω∗= {θ θ# ∈ R−d A(θ)}ν( : A(θ) <dx), ∞}. exponential particular type ofAlso relaxation the family (defined by a subgraph) −H (X ) (Wainwright and Jordan 2008) cannot be The right-hand the side ofoptiEquation (6) makes it clear that perform ν µ Note that Equation (6) allows us to Â(θ) = sup{%θ, µ&A−! A (µ) : µ ∈ MMF }. generic (5) of Equation (6) is generally non-convex. The precise random variable that has a distribution in Q. ! ! sup is taken over a proper subset of M . In particuThe subgraph G for = arbitrary (V, E ) isµ.generally taken to be d! the mean field optimization efficiently Hence, the objecOverview problem is different than We will also use the notation Xµ where µ ∈ Rd to computed Example of a b-acyclic graph form of Γ will be established shortly. mization in the smaller space R ; this is a key algolar, the sup is taken over a subset of M for which the ⊆ M The subfamily induces a tractable subset M A(θ) = log exp{"φ(x), θ#}ν( dx) (2) MF acyclic so that inference in the efficiently. induced subfamily function cannot be evaluated On the Q denote a random variable with distribution in P such tive performing inference in Q, the latter being: d objective function and its gradient can be evaluated of moments in M : ∗ isYindeed tractable.consequence hand, (4) is a constrained of optimization µ. that this is defined other E[φ(Xµ,)]A the mean-field approximation. The left-hand side of Equation (6) gives another perω rithmic Indeed, for dynamic all µthat ∈programming M (µ) amounts towell computing Easy special case: beNote ! Equation " MF =can efficiently. d sup{%ω, τ & − A∗ (τ ) : τ ∈ N }. φ is sufficient for asince sufficient statisticsforφθ.: X → R , base measure ν Tractable problem that can be relaxed. Mean field methods can spective on the mean-field optimization problem: heretition function coincide with the quantity of interest: r.v. Tractable parameters M = µ ∈ M : ∃ω ∈ Ξ s.t. E[φ(Y )] = µ , used when the graph is acyclic We denote the parameters indexing this subfamily by MF ω the entropy of forest-shaped X andaparameters θ ∈ Ω = {θgraphical ∈ Rd : A(θ) <model: ∞}. be seen as a particular typegain of relaxation where the a new focus on the case inbut which theoptimization interactions areisZpairwise of this in accuracy, we present structure [g] ` [g] ´ we have a convex objective, the exp{!φ(x), θ [g] "} θ = We are interested in the case in which the distribuIn particular, the function Γ on the right-hand side Subgraph ω∈Ξ and its moments by τrelaxations . Also we let Y denote a 2.3 Graphical mean-field sup isintaken over aGeneric proper subset of M . based In particuand the base measure is and discrete. However, mean field algorithm on auxiliary exponential ! k−1 d 2.4 fixed point updates # which turn induces a tractable relaxation: over a non-convex (Wainwright 2008).the ideas “x∈X X use factors toµ an undirected will also the according notation X where µ ∈ Rgraphto " of Equation (6) is set generally non-convex. Jordan The precise generic random variable that has a distribution in Q. ” X Problem: ignores 36 ofWe thetion 128of components of the# X τ(p ,p ),(y ,y ) apply directly to the general exponential family—this= families. lar, the sup is taken over a subset of M for which the `P 0 1 0 1 ´ " τ(a,p1 ),(s,s ) order to define this relaxation, thesubset user needs to ical vertices =H (V,νE), i.e. X =) . In Hν (Yand) sufficient = denote Hinν on (Y )+ (Y |Y amodel random variable with G distribution in Ppa(j) such form of Γ will be established shortly. ⊆ M The subfamily induces a tractable M imexample j ∗ parameters statistics the " will be (6) discussed more in Section 3. s" τ(p0 ,p1 ),(y0 ,s ) Note that Equation allowsinus to detail perform the optiand gradient evaluated m Â(θ)afunction = sup{%θ, µ&itsprincipal −notion A (µ) :v-can µand ∈ be M }.MF subgraphs (5) y1 ∈X s" MF provide subset of the exponential family, ), Xµ.= X . For simplicity of notation we objective The of b-acyclic is different 1 , . . . ,µX m= ! )] Note that this is well defined that(X E[φ(X ∗ of moments in M : ) = %ω, τ &+%ϑ, Γ(τ )&−A (τ ). We take partial X X d τ(pk−2 ,pk−1 ),(yk−2 ,yk−1 ) Let G(τ j:j$=i∼j 2 efficiently. i∈cc(G! ) The left-hand side of Equation (6) gives another permization in the smaller space R ; this is a key algo´ ` P · · · × Let F = (V ×X )∪(E ×X ) be the index set for the cothat the notion of “overlapping cluster” used in the ! " since φ is sufficient for θ. ") τ " (p ,p ),(y ,s s k−2 k−1 k−2 ∗ ∈structured yk−1 ∈X y2 ∈X spective on the mean-field problem: here rithmic consequence ofofthe mean-field approximation. ordinates φoptimization (the potentials). If e = (a, b) ∈ E, then work on mean field et al. (2006). Mfor = µ µ∈ ∈MM :,∃ω Ξobtain s.t. E[φ(Y )] =by µ Geiger , Indeed, all Ato (µ) amounts to computing MFderivatives ω MF stationary point conditions. By τ Wesubgraph, are interested in takes the case in which the distribu- the we have objective, that but the thefollowing optimization is holds on × ` P (pk−1 ,pk ),(yk−1 ,yk ) ´ it is understood inclusion Some v-acyclic graphsmodel: have overlapping clusters, somea convex 2.3entropy Graphical mean-field relaxations Structured mean field harnesses an acyclic but also of a forest-shaped graphical " the definition of# Γ: s" τ(pk−1 ,pk ),(yk−1 ,s ) tion≤ of A(θ), X factors according an undirected the induced sigma-algebra: ⊇ σ(Xa , Xb ). and moreover, the computational dichotomy which in turn induces a tractable over aGeneric non-convex set point (Wainwright and σ(φ Jordan 2008). e,· (X)) ! do not; " Note that Â(θ) which is ato very usefulgraphproperty into account all components ∗relaxation: 2.4 fixed updates # µ ∈establish MMF ,here Athe (µ) isnottractable InConsequence: order to defineon this relaxation, user needs ical model on m vertices G = (V, E), i.e. X = Similarly, if v ∈ V , σ(φv,· (X)) ⊇ σ(Xa ). WeOne losecan no check that this is achieved with the following we(Y does hold if of v- and H (Y ) = H ) + H (Y )tothe .(5)notionNote ∗ ν ν µ& i − A (µ) : µν ∈ M j |Ypa(j) ∗ # m that Equation (6) allows us to perform the optiÂ(θ) = sup{%θ, }. when mean-field inference is used in the inner loop MF ∂G ∂ΓofgLet choice: for all b-acyclic subgraphs is replaced by that overlapping∂Agenerality by requiring provide a subset of the principal exponential family, (X1 , . . . , Xm ), X = X . For simplicity of notation we ∗ existence of potentials G(τ ) = %ω, τ &+%ϑ, Γ(τ )&−A j:j$=i∼j i∈cc(G! ) d! (τ ). We take partial Question: how to choose the acyclic subgraph? mization in the smaller space R ; this is a key algo8 (τ ) =Note ωfthat + other variational ϑg approximations (τ ) − (τ ) and edges, since we can always set their corvertices clusters. log τh − log τv,x + log τ(a,v),(s,x) if v = p0 of EM (Wainwright and Jordan 2008). Moreover, ifIndeed, for all µ∂τ > derivatives to obtain stationary point conditions. By < ∗ ∂τ ∂τ rithmic consequence of the mean-field approximation. responding parameter to zero. such as Expectation Propagation also have a subgraph f f i log τh − log τv,x ∈M to computing [g] MF , A (µ) amounts ! Step 3: Solve the simplified optimization problem θh = g∈F \F !! ! the definition of Γ: + log τ(pk−1 ,b),(y,t) − log τw,y if w = pk−1 Bag of tricks > interpretation (Minka and Qi 2003). While this subNotethe that Â(θ) of ≤ aA(θ), which is agraphical very useful property E , with : entropy forest-shaped model: • Adding an edge E in the⊆ subgraph can onlyassociated increase qualitymean field approximation log τ − log τv,x otherwise h ∗ ∗ # duality graph sometimes happens toτ be b-acyclic, there∂G isGeneric no ! " when mean-field inference is used in the inner loop 2.2 Convex ∂Γ ∂A 2.4 fixed point updates Â(θ) = sup{!ω, τ " + !ϑ, Γ(τ )" − A (τ ) : ∈ N } # # g 0 !distinction between Ă(θ), then Ă(θ) ≤ Â(θ) ≤ A(θ). (τ ) = ω + ϑ (τ ) − (τ ) special vand b-acyclic graphical f g where f ∈ F . H (Y ) = H (Y ) + H (Y |Y ) . where h = ((v, w), (x, y)), (v, w) ∈ pg , (x, y) ∈ X 2 . ν ν i ν j pa(j) if of EM (Wainwright and Jordan 2008). Moreover, ∂τ ∂τ ∂τ • But what is the impact on computational complexity? f f i ∗ A τsimple but fundamental property of exponential case of Bethe-energy Let variational ! ) approximations g∈F \F !)&−A !! ! moments in the subgraph G(τ ) = %ω, &+%ϑ, Γ(τ (τ ). We take partial j:j$field =in i∼jthe , withi∈cc(G associated mean approximation Using Equation (3), we have for all h = ((v, w), (x, y)), ! ! E ⊆ Erealizable is that the gradient and Hessian of the log approximations. This is why we focus on derivatives mean field to families As a consequence, adding edges in G = (V, E ) can obtain stationary point conditions. By pg , (x, y) ∈ X 2 : will beapproximations. useful no- have the following Ă(θ), thenIt Ă(θ) ≤ Â(θ) ≤ A(θ). to represent this update Theorem of A(θ) forms: (v, w) ∈ Hammersley-Clifford where f ∈in F ! .vector partition functionProperties } ! " τ (c) = ∇A(c) ω (c) + J (c) (τ (1) , · · · , τ (k) )ϑ 10 Suppose now that connected component G!(c) is vacyclic. We show how entries in the embedding Jaexp{!φ(x), θ [g] "} Z [g] θ [g] = (c) 8 Technique:x∈Xauxiliary exponential families v-acyclic subgraphs cobian J (τ ) can be computed in constant time in ” X “X τ(p ,p ),(y ,y ) 3.4 Relation to Gibbs sampling this case. `P ´ τ(a,p ),(s,s ) = τ (p ,p ),(y ,s ) s For fixed g, construct ansexponentialy families such that its partition6 Connected component decomposition: ∈X Recall that we have: X X τ(p ,p ),(y ,y ) Before' moving to the b-acyclic case, we draw satisfies: atition paral function function coincide with the quantity of interest: ´ ` & (1) P · · · × (1) (1) (1) τ (p ,p ),(y ,s ) 4 ∂ τ ∇A0 ω + J lel (τ between )ϑ block Gibbs sampling and structured mean (c) X y ∈X y ∈X [g] s ` [g] ´ [g] Jf,g (τ ) τ== E[φg (Y )], τ τ(p ,p ),(yθ ,y = exp{!φ(x), "} ) Z θ = field approximations in the case in which all connected ∂τf ´ × `P & ' (2) x∈X k−1 τ(p ,p ),(y ,s ) (2) (2) !(c) 2 s τ (2) ∇A0 ω + J components (τ )ϑ G are v-acyclic. This connection gen- “ X ” X τ !(c) ! (p0 ,p1 ),(y0 ,y1 ) where f ∈ F , g ∈ F \F . `P ´ = checkτthat (a,p1 ),(s,s One can this "is) achieved withτ the following eralizes the classical relationship between naive Gibbs " (p0 ,p1 ),(y0 ,s" ) " s y1 ∈X s 0 choice: 0 sampling steps and naive mean field coordinate ascent Since g ∈ F \F ! , we must have g = (e, (s, t)) for some X X τ (pk−2 ,pk−1 ),(yk−2 ,yk−1 ) 8 can check easily that the block Gibbs sampler with ´ `P log×τh − log· τ· v,x if v = p0 · + log τ(a,v),(s,x) ! (1) > ϑ updates. Γ(τ ) < e = (a, b) ∈ E\E ∈ has X 2 a. transition Therefore: " blocks Vand , . .s, . , Vt (k) kernel given by: log τh − log τv,x [g] s" τ(pk−2 ,pk−1 ),(yk−2 ,s ) The θ∈Θ⊆R A : Rin → computed. field approximation: In this section, we review the principlessubfamily of mean field beQ acyclic so that inference theRinduced conclusions in Section 5. performing inference in Q, the latter being: ∗ to establish an alternative form for A. formation f (x) = sup{"x, y# − f (y) : y ∈ dom(f )}. approximation and set the notation. Our exposition Parameters Log-partition • ∈v-acyclic, if for all e ∈ E, E′ ∪{e} is still acyclic ∗ is indeed tractable. sesup{%θ, µ& − A (µ) : µ M } = MF follows the general treatment of variational methods An intuitively appealing approach to making this ∗ 2 presented Background Definition 1 For an extended real-valued function f , ) : τ ∈ N }. lection is to make use of the graphical representation ! " Jordan (2003) where sup{%ω, τ & − A (τ in Wainwright and the D Theorem:When If f isfconvex and lower sup{%ω, τ & + %ϑ, Γ(τ )& − A∗ (τ ) : τ ∈ N }, (6) is convex and lower semi-continuous, f = f ∗∗ , φ:X →R µ = E φ(X ) • b-acyclic, otherwise the Legendre-Fenchel transformation is defined as: θ of the exponential family and to choose a subset of the We denote the parameters indexing this subfamily by Legendre-Fenchel transformation plays a central role. we can! use of A to obtain: semi-continuous, f =convexity f ** In this section, we review the principles of mean field edges, E ⊂ E, to represent a tractable subfamily. In where θ = (ω, ϑ). Note the slight abuse of notation: Sufficient statistics Graphical Moments ∗ In particular, the function Γ on the right-hand side ω ∈ Ξmodel and itsapproximation moments by τ . Also we let Y denote a f (x) = sup{"x, y# − f (y) : y ∈ dom(f )}. ∗ and set families the notation. Our exposition particular, Q ⊂ P, which thethis partition and moments can we Weuse canAthen write the equationofofboth meanA(θ) = sup{"θ, µ#log −subfamily A (µ) : µ ∈ M }, inin defining we retain(4) only 2.1 Exponential to denote thefundamental partition function exfollows the general treatment of variational methods of Equation (6) is generally non-convex. The precise generic random variable that has a distribution in the Q.be computed. field approximation: potentials with indices ponential families; the Examples notation can always be disamof v-acyclic graphs ∗∗ presented in Wainwright and Jordan (2003) We assume that the random variable underwhere study,the Xθ , When wheref M = ∇A(Θ) is the set of realizable moments. isinference convex of and semi-continuous, f = problem f , shortly. by µ& inspecting of the paStep Express as Γ alower constrained optimization will be established ⊆1:M The subfamily induces a transformation tractable Mfamily !form appealing An intuitively approach to making this se- biguated MF sup{%θ, − A∗ (µ) the : µ ∈dimensionality MMF } = ! has a distribution in a regularsubset exponential Legendre-Fenchel plays a central role.P F = {f ∈ F :f = (v, ·) for v ∈ V or we can use convexity of A to obtain: Formulation is nouse more tractable than representation the definirameter vector. using convex duality: intractable lection is to(4) make of the graphical in canonical ! P(X θ ∈ B) = exp{"φ(x), θ# − A(θ)}ν(in dx) ∗ sup{%ω, τ & + %ϑ, Γ(τ )& − A∗ (τ ) : τ ∈ N }, (6) of moments M : Oftenform: f = (e, ·) for e ∈ E }. tion of A in Equation (2): the term A (µ), which can ! ∗side of the exponential family and to choose a subset of the The left-hand of Equation (6) gives another per(e.g. NP-complete B A(θ) = sup{"θ, µ# − Anegative (µ) : µ ∈ (4) The right-hand side of Equation (6) makes it clear that ! 2.1 Exponential families " be shown of M the},entropy ! !to be! equal to the ! edges, E ⊂ E, to represent a tractable subfamily. In P(X θ for ∈ A) = exp{"φ(x), θ# − A(θ)}ν( dx), (1) where θ =field (ω, optimization ϑ). Note the problem slight abuse of notation: non-planar The−H subgraph G = (V, E ) Jordan is generally taken to be the mean is different than spective on the mean-field optimization problem: here (X ) (Wainwright and 2008) cannot be ν µ MMF θ"}ν( = µ M :that ∃ω ∈ AΞ s.t. E[φ(Y ω )] = µ , D A(θ) = log exp{!φ(x), dx) particular, in defining this subfamily we retain only we use A to denote the partition function of both ex! We∈ assume the random variable under study, X , where M that =efficiently ∇A(Θ) of s.t. realizable moments. Ising models) θ M = {µinference ∈ R isfor :the ∃θ ∈the Θ E[φ(X = µ} Q acyclic so inset induced subfamily θ )] objecperforming inference in Q, the latter being: computed arbitrary µ. Hence, the X we cannot have aevaluated convexefficiently. objective, but the optimization is can always be disamthe potentials with indices ponential families; the notation has a distribution exponential family P(2) is indeed A(θ) =inloga regular exp{"φ(x), θ#}ν( dx) tractable. tive function the Formulation (4) is no be more tractable than theOn definiin induces canonical form: biguated bysup{%ω, inspecting dimensionality τ & −the A∗ (τ )Naive : τ ∈MF N }. of the pa-Factorial !Equation ∗ optimization a !tractable relaxation: other hand, (4):fisthe a constrained over a non-convex set (Wainwright and Jordan 2008). Maximum D=# vertexwhich X #{0,1}in+ #turn edges X #{00,01,10,11} tion of A in Equation (2): term A (µ), which can F = {f ∈ F = (v, ·) for v ∈ V or We denote the parameters indexing this subfamily by d }} }} Proposition 5 The right hand side of the equation 1 ∇A1 (ζ 1 ) B C .. ∇A(ζ) = @ A . ∇Ak (ζ k ). 2 !inBackground Definition 1 For an extended real-valued function f, Two equivalent ways specify convex functions ments 3. ! We empirical taken results toto be • •Two equivalent ways to to specify convex functions subgraph G =Section (V, E ) ispresent generally The second identity implies convexity, which we can the mean field optimization problem is then different ⊂ in whichtransformation the transformation log partition and moments We can write thethan fundamental equation of meantheP, Legendre-Fenchel is defined as: can support our claims D in Section 4 and we present Tool: our Q Legendre-Fenchel Definition: an acyclic subgraph with edges E' ⊆ E is ... use in conjunction with the Legendre-Fenchel trans- } exp!ζ (c) , φ(c) (x)"ν( dx) A(c) (ζ (c) ), * Computer Science Division of Statistics University of California at Berkeley Now plug in ζ = ω Preview: Z c∈cc(G! ) Alexandre Bouchard-Côté* Michael I. Jordan* D + !ζ (k) , φ(k) (x)"}ν( dx) b-acyclic subgraphs the followin µw,y ´ 0 > ence procedure allowing internal structure for overlap> × 1[x = s] if v = p SMF1 and SMF2. As a baseline, we ran the laxation for approximate inference in discrete Markov 0 10 10 " 10 10 10 tion. regime, innaive this c τ > τ ,s ) f k−1 ,pk ),(y k−1 < Journal We now (ppresent an alternative Time approach that is both ping clusters and deterministic constraints. of˘ ¯ random fields. IEEE Transactions on Signal Processing, mean field (curve NMF on the graph). The results are [g] 1[y=t] × `P Form : n is: path in G! from 9J X o a to b (see FigThe quantity toof compute X denote the8 shortest 1 [g] Intelligence Research, 54:2099–2109, 2006. − τv,x if4.wWe=can pk−1 φ((xi,j )i,j∈{1,...,9}ure x x + x x , Jf,g (τ ) = Z27:1–23, × 2006.µv,x × 2) = simpler to implement and asymptotically faster. The 5Artificial i,j i+1,j j,i j,i+1 in Figure see that in this model itthe τdisplayed Conclusion f 2). SMF1 > A. Globerson and T. Jaakkola. Approximate inference us∂ [g] M. J. Wainwright and M. I. Jordan. Graphical models, Figure 4: Error in the partition function estimate as a > j=1 i=1 [g] idea is to construct an auxiliary exponential family takes one order of magnitude more time to move from > µ The task is thus more complex than that of creating can check Jeasily that block Gibbs sampler with (τ )let =y the P(Y s, Ywe t), : f Neuralµv,x ing planar graph decomposition. In Advances in f,gwe a = b = exponential families, and variational inference. Foundafunction of the running time in milliseconds (abscissa If = s, y = t, have: graph w − otherwise 0 f k naive mean field to v-acyclic structured mean field, (1) (k) ∂τ model and to use an elementary property of Jacobian τf 2006.τv,x Information Processing Systems, Cambridge, a dynamic programming algorithm ofalgorithms: the typenaive used tions and Trends in Machine Learning, 1:1–305, 2008. blocks1V , . . . , V has a transition kernel given by: We have characterized a dichotomy in the MA, complexity in a log scale) for three mean field and two orders of magnitude more time to moveponent from MIT Press. with θ! = T , where T is the temperature parameter matrices to reduce computation to a standard W. Wiegerinck. Variational approximations between mean Ya needmean for sum-product. A... chain rule the would to be used, ∂ X on X ! time the probability (NMF), v-acyclic structured field (SMF1) and ap- of optimizing structure mean field approximations of v-acyclic where this the right-hand side p to b-acyclic structured mean field approximap · · · P(Y = y , ∀i ∈ {1, . . . k}) J (τ ) = G. Hua and Y. Wu. Sequential field variational anal! f,g physics. In this model, thepiparti1 0 i field theory and(v, the w) junction algorithm. In Proas used in statistical t) exponential models. class plication sum-product the auxiliary wherefamily fmean = ((v,The w),first (x, y)) is such ∈ tree pthe mean field (SMF2). g forineach X (c (t)!decoupled X(t − 1) ∼into changing the formb-acyclic of of the recursion f, g. The exponential graphical, ∂τf ya∈X g ; theoretical ysis of structured deformable shapes. Computer Vision tions. Thisthat isthe consistent with results ceedings of Sixteenth Conference on Uncertainty in cannot be product yk−1of ∈X marginals. Let p 1 k allows efficient block updates while the second is comSMF2 the f tion function and moments can be computed exactly# family. " and Image Understanding, 101:87–99, 2006. Artificial Intelligence, pages 626–633, San Mateo, CA, complexity of this naive approach can be shown to be otherwise, J is equal to zero. developed in Section 3. Moreover, the bound on the Y f,g most tractable X X b putationally more challenging. While !(c∂ (ct ) (c t) t) 2000. Morgan Kaufmann. ! ! so that absolute errors can ! , = y )O(|E !·|·×| ωestablished. + ∀i, B (p(X(t − 1))ϑ J. B. Lasserre. Global optimization with polynomials and in Figu · F There | ×| F is \Fone |), auxiliary which is considerable. p1) = 0 pg = {a MRF = p0 , pP , . .∂τ .be , ,pP(Y pi+1 ∈ yE1 |Y }p0 1= k =a b=: s) i ,P(Y studied in the existing literature have fallen log partition function gets tighter as more edges are exponential family P [g] for each subgraphs the problem of moments. SIAM Journal on Optimizaf J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generaladded to the tractable subgraph. ! ! ! y1 ∈X y2 ∈X the11:796–817, first category, wetotal have presented theoretical and ((a, b), (s,alternative t))using = gthe∈more F \Fexpensive . It that is b-acyclic defined on the chain intion, 2000. where approximaThe cost of computing Jizedisbelief O(|E | × |FIn\F |), in Neural subgrap propagation. Advances InforWe used the updates of defined Proposition 5 when the We now present an approach is both ! follows: X X where B is the matrix as empirical reasons to expand the scope of the structured denote the shortest path to0 ,yb 1(see Figτ(p0 ,pa1 ),(y ∂ in G from mation Processing Systems, pages 689–695, Cambridge, ) tion pays off. p , p , . . . , p where p is as above. We pick its paT. Minka and Y. Qi. Tree-structured approximations by 1 2 k−1 g tractable subgraph was =v-acyclic which is larger than the cost derived in the τa,s and used the fixed · · · simpler to implement and asymptotically faster. The MA, 2001. MIT Press.v-acyclic mean field method to consider the in second expectation propagation. In Advances Neuralcategory. Infor[g] ure 2). (c) ∂τf τ p0 ,y0 so that the partial derivatives of its parrameters θ We also performed timing experiments to compare the y1the ∈X y2 ∈X point updatesBf,g of (X(t Proposition 6 in= b-acyclic cases. Figure 2: An example of the notation used in this − 1)) =1[f (a, s)]× mation Processing Systems, Cambridge, MA, 2003. MIT We also presented a novel algorithm for computing idea is to construct an auxiliary exponential family case, but smaller than the naive programming In Figure 3 5 dynamic Conclusion convergence behavior of the mean field approximations Press. we let y0 =the s, ykexperiments, = t, we have:we verified the gradient and bound on the log-partition function section. Thistocorresponds to the inference problem in BeforeIfperforming empir(c) model and use an elementary property of Jacobian algorithm mentioned at the beginning of this subsecdifferent tem SMF1 and SMF2. As a baseline, we ran the naive 1[|{a, b} ∩ V | = 1]× C.inPeterson and J. R. Anderson. A mean field theory learnthe b-acyclic case. the mean field subgraph in Figure 1, left column, botically using directional derivatives that the updates X X matrices to reduce the computation to a standard apmean field (curve NMF on the graph). The results are ∂ ing algorithm for neural networks. Complex Systems, 1: We have characterized a dichotomy in theregime, complexity tion. in t 1[XP(Y 1) = s], ··· Jf,ghave (τ ) =derived are b (t p− tom row. The box indicate which i = yi , ∀i ∈ {1, . . . k}) 995–1019, 1987. displayed in 4. Weare can involved see that inin this model it that we plication of sum-product inFigure the nodes auxiliary exponential of optimizing structure mean field approximations of ∂τf y ∈X error-free. [g] yk−1 ∈X 1 takes onefamily order of more time to move from the auxiliary exponential Pmagnitude used to compute L. K. Saul and M. I. Jordan. Exploiting tractable subgraphical, exponential family models. The first class family. s" 0 1 2 3 4