Message passing and approximate message passing Arian Maleki Columbia University 1 / 47 What is the problem? Given pdf µ(x1 , x2 , . . . , xn ) we are interested in I arg maxx1 ,x2 ,...,xn µ(x1 , x2 , . . . , xn ) I Calculating marginal pdf µi (xi ) I Calculating joint distribution µS (xS ) for S ⊂ {1, 2, . . . , n} These problems are important because I Many inference problems are in these forms I Many applications in image procession, communication, machine learning, signal processing This talk: Marginalizing distribution I Most tools are applied to other problems 2 / 47 This talk: marginalization Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi ) I Simplification: all xj ’s are binary random variables Is it difficult? I Yes. No polynomial time algorithm I Complexity of variable elimination: 2n−1 Solution: I No “good solution” for “generic µ” I Consider structured µ 3 / 47 Structured µ Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi ) I Simplification: all xj ’s are binary random variables Example 1: x1 , x2 , . . . , xn are independent I Easy marginalization I Not interesting Example 2: Independent subsets µ(x1 , x2 , . . . , xn ) = µ01 (x1 , . . . , xk )µ02 (xk+1 , . . . , xn ) I More interesting and less easy I Complexity still exponential in terms of size of subsets I Does not cover many interesting cases 4 / 47 Structured µ:Cont’d Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi ) I Simplification: all xj ’s are binary random variables Example 3: more interesting structure: µ(x1 , x2 , . . . , xn ) = µ01 (xS1 )µ02 (xS2 ) . . . µ0` (xS` ) I Unlike independence, Si ∩ Sj 6= ∅ I Covers many interesting practical problems I Let |Si | n, ∀i. Is marginalization easier than a generic µ ? o Not clear!! We introduce factor graph to address this question 5 / 47 Factor graph Slight generalization µ(x1 , x2 , . . . , xn ) = 1 Y ψa (xSa ) Z a∈F I ψa : {0, 1}|Sa | → R+ not necessarily pdf I Z: normalizing constant. Called "partition function" I ψa (xSa ): factors Variable nodes 2 .. . Factor graph: I Variable node: n nodes corresponding to x1 , x 2 , . . . , x n I Function node: |F | nodes corresponding to ψa (xS1 ), ψb (xS2 ), . . . I Edge between a and variable node i iff xi ∈ Sa Factor nodes 1 a b .. . n 6 / 47 Factor graph Cont’d µ(x1 , x2 , . . . , xn ) = 1 Y ψa (xSa ) Z Variable nodes a∈F Factor graph: I Variable node: n nodes corresponding to x1 , x 2 , . . . , x n I Function node: F nodes corresponding to ψa (xS1 ), ψb (xS2 ), . . . I Edge between a and variable node i iff xi ∈ Sa 2 .. . Some notation: I a b .. . ∂a: neighbors of function node a: ∂a = {i : xi ∈ Sa }. I Factor nodes 1 n ∂i: neighbors of variable node i ∂i = {b ∈ F : i ∈ Sb }. 7 / 47 Factor graph example µ(x1 , x2 , x3 ) = 1 ψa (x1 , x2 )ψb (x2 , x3 ) Z Variable nodes Some notation: I I Factor nodes 1 ∂a: neighbors of function node a: ∂a = {1, 2} ∂b = {2, 3} ∂i: neighbors of variable node i ∂1 = {a} ∂2 = {a, b} 2 a b 3 8 / 47 Why factor graph? Variable nodes Factor nodes 1 2 a b Factor graphs simplifies understanding structures I Independence: graph has disconnected pieces I Tree graphs: Easy marginalization I Loops (specially short loops): generally make the problem more complicated .. . .. . n 9 / 47 Marginalization on tree structured graphs i Tree: Graph with no loops I Marginalization by variable elimination is efficient on such graphs (distributions) o Linear in n o Exponential in size of factors I Proof: on the board 10 / 47 Summary of variable elimination on trees Notation Q i I µ(x1 , x2 , . . . , xn ) = I µj→a : belief from variable node j to factor a I µ̂a→j : belief from factor node a to variable j a∈F ψa (x∂a ) We also have µa→j (xj ) = X ψa (x∂a ) {xj :j∈∂a\j} µ̂j→a (xj ) = Y Y µ̂`→a (x` ) `∈∂a\j µb→j (xj ) b∈∂j\a 11 / 47 Belief propagation algorithm on trees Based on variable elimination on trees consider iterative algorithm: X Y t+1 t νa→j (xj ) = ψa (x∂a ) ν̂`→a (x` ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y `∈∂a\j i t νb→j (xj ) b∈∂j\a I t t νa→j and ν̂j→a are beliefs of nodes at time t I Each node propagates its belief to other nodes Hope: Converge to a “good” fixed point t νb→j (xj ) → µb→j (xj ) t→∞ t ν̂j→b (xj ) → µj→b (xj ) t→∞ 12 / 47 Belief propagation algorithm on trees Belief propagation: X t+1 νa→j (xj ) = i ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` ) `∈∂a\j t νb→j (xj ) b∈∂j\a Why BP and not variable elimination? I On trees: no good reason I loopy graphs: can be easily applied 13 / 47 Belief propagation algorithm on trees Belief propagation: X t+1 νa→j (xj ) = ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` ) `∈∂a\j i t νb→j (xj ) b∈∂j\a Lemma: On a tree graph BP converges to the marginal distributions in finite number of iterations, i.e, t νb→j (xj ) → µb→j (xj ) t→∞ t ν̂j→b (xj ) → µj→b (xj ) t→∞ 14 / 47 Summary of belief propagation on trees i Equivalent to variable elimination Exact on trees Converges in finite number of iterations: I 2 times the maximum depth of the tree. 15 / 47 Loopy belief propagation Apply belief propagation to loopy graph: X Y t+1 t νa→j (xj ) = ψa (x∂a ) ν̂`→a (x` ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y i `∈∂a\j t νb→j (xj ) b∈∂j\a Wait until the algorithm converges: Q ∞ I Announce a∈∂j νa→j (xj ) as µj (xj ) 16 / 47 Challenges for loopy BP Does not converge necessarily I Example : µ(x1 , x2 , x3 ) = I(x1 6= x2 )I(x2 6= x3 )I(x3 6= x1 ) o o o o 1 2 t = 1 start with x1 = 0 t = 2 ⇒ x2 = 1 t = 3 ⇒ x3 = 0 t = 4 ⇒ x4 = 1 Even if converges, not necessarily the marginal distribution 3 Can have more than one fixed point 17 / 47 Advantages of loopy BP In many applications, works really well I Coding theory I Machine vision I compressed sensing .. . i In some cases, loopy BP can be analyzed theoretically 18 / 47 Some theoretical results on loopy BP BP equations have at least one fixed point I Proof uses Brouwer’s theorem I Does not necessarily converge to any of the fixed points i BP is exact on trees BP is “accurate” for “locally tree like” graphs Gallager 1963, Luby et al. 2001, Richardson and Urbanke 2001 Several methods exist for verifying correctness of BP I I Computation trees Dobrushin condition .. . 19 / 47 Locally tree like graphs BP is “accurate”for “locally tree like” graphs i Heuristic: sparse graphs (without many edges) can be “locally tree like” How to check? I If girth(G) = O(log(n)) girth(G) is the length of the shortest cycle I Random (dv , dc ) graph is locally tree like with high probability 20 / 47 Dobrushin condition sufficient condition for convergence of BP Influence of node j on node i: Cij = sup{kµ(xi | xV \i,j ) − µ(xi | x0V \i,j )kT V : xl = x0l ∀l 6= j} x,x0 Intuition: if the influence of the nodes on each other is very small, then oscillations should not occur Theorem P If γ = supi ( j Cij ) < 1, then BP marginals converge to the true marginals. 21 / 47 Summary of BP Belief propagation: X t+1 νa→j (xj ) = ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` ) `∈∂a\j i t νb→j (xj ) b∈∂j\a Exact on trees Accurate on locally tree like graphs No generic proving technique that works on all problems Successful in many applications, even when we cannot prove convergence 22 / 47 More challenges What if µ is a pdf on Rn ? Extension seems straightforward Z t+1 νa→j (xj ) = ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` )dx `∈∂a\j t νb→j (xj ) b∈∂j\a But, calculations become complicated I Shall discretize the pdf with certain accuracy I Do lots of numeric multiple integrtions Exception: Gaussian graphical models exp(−xt Jx+ct x) Z I µ(x) = I All messages become Gaussian I NOT the focus of this talk 23 / 47 Approximate message passing 24 / 47 N n y w A y = Axo + w xo : k-sparse vector in R k sparse signal Model x N A: n × N design matrix y: measurement vector in Rn w: measurement noise in Rn 25 / 47 LASSO Many useful heuristic algorithms: I Noiseless: `1 -minimization minimize x subject to kxk1 = I P i kxk1 y = Ax |xi | Noisy: LASSO minimize x 1 ky − Axk22 + λkxk1 2 convex optimizations Chen, Donoho, Saunders (96), Tibshirani (96) 26 / 47 LASSO Many useful heuristic algorithms: I Noiseless: `1 -minimization minimize x subject to kxk1 = I P i kxk1 y = Ax |xi | Noisy: LASSO minimize x 1 ky − Axk22 + λkxk1 2 convex optimizations Chen, Donoho, Saunders (96), Tibshirani (96) 27 / 47 Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: I interior point method: o generic tools o appropriate for N < 5000 I homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 I First order methods o Low computational complexity per iteration o Require many iterations 28 / 47 Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: I interior point method: o generic tools o appropriate for N < 5000 I homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 I First order methods o Low computational complexity per iteration o Require many iterations 29 / 47 Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: I interior point method: o generic tools o appropriate for N < 5000 I homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 I First order methods o Low computational complexity per iteration o Require many iterations 30 / 47 Message passing for LASSO Define: µ(dx) = 1 −βλkxk1 − β ky−Axk2 2 e dx Z I Z: normalization constant I β>0 As β → ∞: I µ concentrates around solution of LASSO: x̂λ Marginalize µ to obtain x̂λ 31 / 47 Message passing for LASSO Define: µ(dx) = 1 −βλkxk1 − β ky−Axk2 2 e dx Z I Z: normalization constant I β>0 As β → ∞: I µ concentrates around solution of LASSO: x̂λ Marginalize µ to obtain x̂λ 32 / 47 Message passing for LASSO Define: µ(dx) = 1 −βλkxk1 − β ky−Axk2 2 e dx Z I Z: normalization constant I β>0 As β → ∞: I µ concentrates around solution of LASSO: x̂λ Marginalize µ to obtain x̂λ 33 / 47 Message passing algorithm µ(dx) = 1 −βλ P |xi |− β2 P (ya −(Ax)a )2 i a e dx Z Belief propagation update rules: Q t+1 t ∼ −βλ|si | νi→a (si ) = e t ν̂a→i (si ) ∼ = R b6=a β ν̂b→i (si ) e− 2 (ya −(As)a ) 2 Q j6=i t dνj→a (sj ) Pearl (82) 34 / 47 Issues of message passing algorithm t+1 νi→a (si ) ∼ = e−βλ|si | Y t ν̂b→i (si ) b6=a t ν̂a→i (si ) ∼ = Z β e− 2 (ya −(As)a ) 2 Y t dνj→a (sj ) j6=i Challenges: I messages are distributions over real line I calculation of messages is difficult o Numeric integration o The number of variables in the integral is N − 1 I The graph is not "tree like" o No good analysis tools 35 / 47 Challenges for message passing algorithm t+1 νi→a (si ) ∼ = e−βλ|si | Y t νb→i (si ) b6=a t ν̂a→i (si ) ∼ = Z e 2 −β 2 (ya −(As)a ) Y t dνj→a (sj ) j6=i Challenges: I messages are distributions over real line I calculation of messages is difficult I The graph is not "tree like" Solution: I blessing of dimensionality: t o as N → ∞: ν̂a→i (si ) converges to Gaussian t o as N → ∞: νi→a (si ) converges to β 2 1 fβ (s; x, b) ≡ e−β|s|− 2b (s−x) . zβ (x, b) 36 / 47 Challenges for message passing algorithm t+1 νi→a (si ) ∼ = e−βλ|si | Y t νb→i (si ) b6=a t ν̂a→i (si ) ∼ = Z e 2 −β 2 (ya −(As)a ) Y t dνj→a (sj ) j6=i Challenges: I messages are distributions over real line I calculation of messages is difficult I The graph is not "tree like" Solution: I blessing of dimensionality: t o as N → ∞: ν̂a→i (si ) converges to Gaussian t o as N → ∞: νi→a (si ) converges to β 2 1 fβ (s; x, b) ≡ e−β|s|− 2b (s−x) . zβ (x, b) 37 / 47 Simplifying messages for large N t Simplification of ν̂a→i (si ) t ν̂a→i (si ) ∼ = Z β 2 e− 2 (ya −(As)a ) Y t dνj→a (sj ) j6=i X β ∼ Aaj sj )2 = E exp − (ya − Aai si − 2 j6=i β ∼ = E exp − (Aai si − Z)2 2 Z = ya − P d j6=i Aaj sj → W ; W is Gaussian Ehsi (Z) − Ehsi (W ) ≤ Ct000 , t N 1/2 (τ̂a→i )3/2 38 / 47 Message passing t As N → ∞, ν̂a→i converges to a Gaussian distribution. Theorem t Let xtj→a and (τj→a /β) denote the mean and variance of distribution t νj→a . Then there exists a constant Ct0 such that Ct0 , t N (τ̂a→i )3 s t t βA2ai β(Aai si −za→i )2 /2τ̂a→i dsi , e φ̂ta→i (dsi ) ≡ t 2πτ̂a→i t sup |ν̂a→i (si ) − φ̂ta→i (si )| si ≤ where t za→i ≡ ya − X j6=i Aaj xtj→a , t τ̂a→i ≡ X t A2aj τj→a . j6=i Donoho, M., Montanari (11) 39 / 47 t Shape of νi→a ; intuition Summary I t ν̂b→i (si ) ≈ Gaussian I t+1 (si ) ∼ νi→a = e−βλ|si | Q b6=a t νb→i (si ) t+1 Shape of νi→a (si ) I t+1 νi→a (si ) ≡ 1 zβ (x,b) β 2 e−β|s|− 2b (s−x) 40 / 47 Message passing (cont’d) Theorem t t t Suppose that ν̂a→i = φ̂ta→i , with mean za→i and variance τ̂a→i = τ̂ t . Then at the next iteration we have t+1 |νi→a (si ) − φt+1 i→a (si )| < C/n , X t+1 t φi→a (si ) ≡ λ fβ (λsi ; λ Abi zb→i , λ2 (1 + τ̂ t )) , b6=a and fβ (s; x, b) ≡ β 2 1 e−β|s|− 2b (s−x) . zβ (x, b) Donoho, M., Montanari (11) 41 / 47 Next asymptotic: β → ∞ Reminder: I t xtj→a mean of νj→a I t t za→i mean of ν̂a→i I t za→i ≡ ya − P j6=i Aaj xtj→a t Can we write xtj→a in terms of za→i ? xtj→a = I 1 zβ Z P −β|si |−β(si − si e b6=a t Abi zb→i )2 Can be done, but ... Laplace method simplifies it (β → ∞) X t xtj→a = η( Abi zb→i ). b6=a 42 / 47 Summary t xtj→a mean of νj→a t t za→i mean of ν̂a→i t za→i = ya − X Aaj xtj→a j6=i xt+1 i→a = η( X t Abi zb→i ). b6=a 43 / 47 Comparison of MP and IST MP t za→i = ya − X IST Aaj xtj→a t za→i = ya − j6=i xt+1 i→a = η( X Aaj xtj→a j6=i t Abi zb→i ). xt+1 i→a b6=a X t = η( Abi zb→i ). b x1 x1 y1 y1 y2 X x2 y2 x2 yn yn xn xn 44 / 47 Approximate message passing We can further simplify the messages xt+1 = zt = η(xt + AT z t ; λt ) kxt k0 t−1 z y − Axt + n 45 / 47 IST versus AMP AMP: IST: x t+1 t T t xt+1 = η(xt + AT z t ; λt ) t = η(x + A z ; λ ) z t = y − Axt + z t = y − Axt Mean square error State evolution Empirical 0.04 0.03 0.02 0.01 0 0 10 20 Iteration 30 40 Mean square error 0.05 0.05 1 kxt k0 z t−1 n State evolution Empirical 0.04 0.03 0.02 0.01 0 −0.01 0 10 20 Iteration 30 40 46 / 47 Theoretical guarantees We can predict the performance of AMP at every iteration in asymptotic regime Idea: I xt + AT z t can be modeled as signal plus Gaussian noise I We keep track of noise variance I I will discuss this part in the "Risk Seminar" 47 / 47