Graphical models and message-passing Part II: Marginals and likelihoods Martin Wainwright UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ewainwrig/kyoto12 September 3, 2012 Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 1 / 23 Graphs and factorization 2 ψ7 ψ47 3 4 7 1 5 6 ψ456 v clique C is a fully connected subset of vertices compatibility function ψC defined on variables xC = {xs , s ∈ C} factorization over all cliques p(x1 , . . . , xN ) = 1 Y ψC (xC ). Z C∈C Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 2 / 23 Core computational challenges Given an undirected graphical model (Markov random field): 1 Y p(x1 , x2 , . . . , xN ) = ψC (xC ) Z C∈C How to efficiently compute? most probable configuration (MAP estimate): Maximize : x b = arg max p(x1 , . . . , xN ) = arg max x∈X N x∈X N Y ψC (xC ). C∈C the data likelihood or normalization constant Sum/integrate : Z = X Y ψC (xC ) x∈X N C∈C marginal distributions at single sites, or subsets: Sum/integrate : Martin Wainwright (UC Berkeley) p(Xs = xs ) = 1 Z X Y ψC (xC ) xt , t6=s C∈C Graphical models and message-passing September 3, 2012 3 / 23 §1. Sum-product message-passing on trees Goal: Compute marginal distribution at node u on a tree: Y Y x b = arg max exp(θs (xs ) exp(θst (xs , xt )) . x∈X N s∈V (s,t)∈E M12 1 X x1 ,x2 ,x3 p(x) = X x2 M32 2 " exp(θ1 (x1 )) 3 Y t∈1,3 ( X xt exp[θt (xt ) + θ2t (x2 , xt )] )# Putting together the pieces Sum-product is an exact algorithm for any tree. Tw w s Mwt Mut Mts N (t) Mts t u Tu ≡ ≡ message from node t to s neighbors of node t Mvt v Tv Update: Sum-marginals: h i exp θst (xs , x′t ) + θt (x′t ) x′t ∈Xt Q ps (xs ; θ) ∝ exp{θs (xs )} t∈N (s) Mts (xs ). Mts (xs ) ← Martin Wainwright (UC Berkeley) P n Graphical models and message-passing Q Mvt (xt ) v∈N (t)\s September 3, 2012 o 5 / 23 Summary: sum-product on trees converges in at most graph diameter # of iterations updating a single message is an O(m2 ) operation overall algorithm requires O(N m2 ) operations upon convergence, yields the exact node and edge marginals: Y ps (xs ) ∝ eθs (xs ) Mus (xs ) u∈N (s) pst (xs , xt ) ∝ eθs (xs )+θt (xt )+θst (xs ,xt ) Y Mus (xs ) u∈N (s) Y Mut (xt ) u∈N (t) messages can also be used to compute the partition function X Y Y Z= eθs (xs ) eθst (xs ,xt ) . x1 ,...,xN s∈V Martin Wainwright (UC Berkeley) (s,t)∈E Graphical models and message-passing September 3, 2012 6 / 23 §2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: ◮ ◮ ◮ ◮ error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23 §2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: ◮ ◮ ◮ ◮ error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. some concerns with sum-product with cycles: ◮ ◮ ◮ no convergence guarantees can have multiple fixed points final estimate of Z is not a lower/upper bound Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23 §2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: ◮ ◮ ◮ ◮ error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. some concerns with sum-product with cycles: ◮ ◮ ◮ no convergence guarantees can have multiple fixed points final estimate of Z is not a lower/upper bound as before, can consider a broader class of reweighted sum-product algorithms Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23 Tree-reweighted sum-product algorithms Message update from node t to node s: Mts (xs ) ← κ X x′t ∈Xt ( exp h θ (x , x′ ) i st s t + θt (x′t ) ρst | {z } reweighted edge reweighted messages }| { Q z ρ Mvt (xt ) vt ) v∈N (t)\s (1−ρts ) Mst (xt ) | {z } opposite message . Properties: 1. Modified updates remain distributed and purely local over the graph. • Messages are reweighted with ρst ∈ [0, 1]. 2. Key differences: • Potential on edge (s, t) is rescaled by ρst ∈ [0, 1]. • Update involves the reverse direction edge. 3. The choice ρst = 1 for all edges (s, t) recovers standard update. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 8 / 23 Bethe entropy approximation define local marginal distributions (e.g., for m = 3 states): µs (0) µs (xs ) = µs (1) µs (2) µst (0, 0) µst (xs , xt ) = µst (1, 0) µst (2, 0) µst (0, 1) µst (1, 1) µst (2, 1) µst (0, 2) µst (1, 2) µst (2, 2) define node-based entropy and edge-based mutual information: X µs (xs ) log µs (xs ) Node-based entropy:Hs (µs ) = − xs Mutual information:Ist (µst ) = X µst (xs , xt ) log xs ,xt µst (xs , xt ) . µs (xs )µt (xt ) ρ-reweighted Bethe entropy HBethe (µ) = X s∈V Martin Wainwright (UC Berkeley) Hs (µs ) − X ρst Ist (µst ), (s,t)∈E Graphical models and message-passing September 3, 2012 9 / 23 Bethe entropy is exact for trees exact for trees, using the factorization: p(x; θ) = Y s∈V Martin Wainwright (UC Berkeley) µs (xs ) Y (s,t)∈E µst (xs , xt ) µs (xs )µt (xt ) Graphical models and message-passing September 3, 2012 10 / 23 Reweighted sum-product and Bethe variational principle Define the local constraint set X τs (xs ) = 1, L(G) = τs , τst | τ ≥ 0, xs X xt τst (xs , xt ) = τs (xs ) Reweighted sum-product and Bethe variational principle Define the local constraint set X τs (xs ) = 1, L(G) = τs , τst | τ ≥ 0, xs X xt τst (xs , xt ) = τs (xs ) Theorem For any choice of positive edge weights ρst > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with nX o X ABethe (θ; ρ) := max hτs , θs i + hτst , θst i + HBethe (τ ; ρ) . τ ∈L(G) s∈V (s,t)∈E Reweighted sum-product and Bethe variational principle Define the local constraint set X τs (xs ) = 1, L(G) = τs , τst | τ ≥ 0, xs X xt τst (xs , xt ) = τs (xs ) Theorem For any choice of positive edge weights ρst > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with nX o X ABethe (θ; ρ) := max hτs , θs i + hτst , θst i + HBethe (τ ; ρ) . τ ∈L(G) s∈V (s,t)∈E (b) For valid choices of edge weights {ρst }, the fixed points are unique and moreover log Z(θ) ≤ ABethe (θ; ρ). In addition, reweighted sum-product converges with appropriate scheduling. Lagrangian derivation of ordinary sum-product let’s try to solve this problem by a (partial) Lagrangian formulation assign a Lagrange multiplier λts (xs ) for each constraint P Cts (xs ) := τs (xs ) − xt τst (xs , xt ) = 0 will enforce the normalization ( explicitly P xs τs (xs ) = 1) and non-negativity constraints the Lagrangian takes the form: L(τ ; λ) = hθ, τ i + X Hs (τs ) − s∈V Ist (τst ) (s,t)∈E(G) + X X (s,t)∈E Martin Wainwright (UC Berkeley) X λst (xt )Cst (xt ) + xt Graphical models and message-passing X λts (xs )Cts (xs ) xs September 3, 2012 12 / 23 Lagrangian derivation (part II) taking derivatives of the Lagrangian w.r.t τs and τst yields ∂L ∂τs (xs ) = ∂L ∂τst (xs , xt ) = θs (xs ) − log τs (xs ) + X λts (xs ) + C t∈N (s) θst (xs , xt ) − log τst (xs , xt ) − λts (xs ) − λst (xt ) + C ′ τs (xs )τt (xt ) setting these partial derivatives to zero and simplifying: τs (xs ) ∝ τs (xs , xt ) ∝ exp θs (xs ) Y t∈N (s) exp λts (xs ) exp θs (xs ) + θt (xt ) + θst (xs , xt ) × Y Y exp λvt (xt ) exp λus (xs ) u∈N (s)\t v∈N (t)\s enforcing the constraint Cts (xs ) = 0 on these representations yields the familiar update rule for the messages Mts (xs ) = exp(λts (xs )): Mts (xs ) ← X xt Martin Wainwright (UC Berkeley) exp θt (xt ) + θst (xs , xt ) Graphical models and message-passing Y Mut (xt ) u∈N (t)\s September 3, 2012 13 / 23 Convex combinations of trees Idea: Upper bound A(θ) := log Z(θ) with a convex combination of tree-structured problems. θ A(θ) = ≤ ρ = {ρ(T )} θ(T ) ρ(T 1 )θ(T 1 ) ρ(T 1 )A(θ(T 1 )) ≡ ≡ Martin Wainwright (UC Berkeley) + + ρ(T 2 )θ(T 2 ) ρ(T 2 )A(θ(T 2 )) + + ρ(T 3 )θ(T 3 ) ρ(T 3 )A(θ(T 3 )) probability distribution over spanning trees tree-structured parameter vector Graphical models and message-passing September 3, 2012 14 / 23 Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23 Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Example: On the 2-D lattice: Grid size 9 16 36 100 Martin Wainwright (UC Berkeley) # trees 192 100352 3.26 × 1013 5.69 × 1042 Graphical models and message-passing September 3, 2012 15 / 23 Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. By a suitable dual reformulation, problem can be avoided: Key duality relation: P min T ρ(T )θ(T )=θ ρ(T )A(θ(T )) = max Martin Wainwright (UC Berkeley) µ∈L(G) hµ, θi + HBethe (µ; ρst ) . Graphical models and message-passing September 3, 2012 15 / 23 Edge appearance probabilities Experiment: What is the probability ρe that a given edge e ∈ E belongs to a tree T drawn randomly under ρ? f f f f b e (a) Original In this example: b b e b e (b) ρ(T 1 ) = ρb = 1; 1 3 (c) ρe = 23 ; e ρ(T 2 ) = 1 3 (d) ρ(T 3 ) = 1 3 ρf = 13 . The vector ρe = { ρe | e ∈ E } must belong to the spanning tree polytope. (Edmonds, 1971) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 16 / 23 Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = − X p(x1 , . . . , xN ) log p(x1 , . . . , xN ) x subject to expectation constraints of the form X p(x)φα (x) = µ bα . x Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23 Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = − X p(x1 , . . . , xN ) log p(x1 , . . . , xN ) x subject to expectation constraints of the form X p(x)φα (x) = µ bα . x Maximum likelihood in exponential family Maximize likelihood of parameterized densities X p(x1 , . . . , xN ; θ) = exp θα φα (x) − A(θ) . α Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23 Conjugate dual functions conjugate duality is a fertile source of variational representations any function f can be used to define another function f ∗ as follows: f ∗ (v) := sup hv, ui − f (u) . u∈Rn easy to show that f ∗ is always a convex function how about taking the “dual of the dual”? I.e., what is (f ∗ )∗ ? when f is well-behaved (convex and lower semi-continuous), we have (f ∗ )∗ = f , or alternatively stated: f (u) = sup hu, vi − f ∗ (v) v∈Rn Geometric view: Supporting hyperplanes Question: Given all hyperplanes in Rn × R with normal (v, −1), what is the intercept of the one that supports epi(f )? β f (u) hv, ui − ca Epigraph of f : epi(f ) := {(u, β) ∈ Rn+1 | f (u) ≤ β}. −ca −cb Analytically, we require the smallest c ∈ R such that: hv, ui − c ≤ f (u) for all u ∈ Rn By re-arranging, we find that this optimal c∗ is the dual value: c∗ = sup hv, ui − f (u) . u∈Rn hv, ui − cb (v, −1) u Example: Single Bernoulli Random variable X ∈ {0, 1} yields exponential family of the form: p(x; θ) ∝ exp θ x with A(θ) = log 1 + exp(θ) . Let’s compute the dual A∗ (µ) := sup µθ − log[1 + exp(θ)] . θ∈R (Possible) stationary point: µ = exp(θ)/[1 + exp(θ)]. A(θ) A(θ) hµ, θi − A∗ (µ) θ (a) Epigraph supported θ hµ, θi − c (b) Epigraph cannot be supported ( µ log µ + (1 − µ) log(1 − µ) if µ ∈ [0, 1] . +∞ otherwise. Leads to the variational representation: A(θ) = maxµ∈[0,1] µ · θ − A∗ (µ) . We find that: A∗ (µ) = Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 20 / 23 Geometry of Bethe variational problem µint M(G) µf rac L(G) belief propagation uses a polyhedral outer approximation to M(G): ◮ ◮ for any graph, L(G) ⊇ M(G). equality holds ⇐⇒ G is a tree. Natural question: Do BP fixed points ever fall outside of the marginal polytope M(G)? Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 21 / 23 Illustration: Globally inconsistent BP fixed points Consider the following assignment of pseudomarginals τs , τst : 0 :5 0 :5 2 Locally consistent (pseudo)marginals 1 0 :5 0 :5 0 :4 0 :1 0 :1 0 :4 0 :4 0 :1 0 :1 0 :4 3 0 :1 0 :4 0 :4 0 :1 0 :5 0 :5 can verify that τ ∈ L(G), and that τ is a fixed point of belief propagation (with all constant messages) however, τ is globally inconsistent Note: More generally: for any τ in the interior of L(G), can construct a distribution with τ as a BP fixed point. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 22 / 23 High-level perspective: A broad class of methods message-passing algorithms (e.g., mean field, belief propagation) are solving approximate versions of exact variational principle in exponential families there are two distinct components to approximations: (a) can use either inner or outer bounds to M (b) various approximations to entropy function −A∗ (µ) Refining one or both components yields better approximations: BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (e.g.,Yedidia et al., 2002) Expectation-propagation: better outer bounds and Bethe-like entropy approximations (Minka, 2002) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 23 / 23