Message passing and approximate message passing

Message passing and approximate message passing Arian Maleki Columbia University 1 / 47 What is the problem? Given pdf µ(x1 , x2 , . . . , xn ) we are interested in I arg maxx1 ,x2 ,...,xn µ(x1 , x2 , . . . , xn ) I Calculating marginal pdf µi (xi ) I Calculating joint distribution µS (xS ) for S ⊂ {1, 2, . . . , n} These problems are important because I Many inference problems are in these forms I Many applications in image procession, communication, machine learning, signal processing This talk: Marginalizing distribution I Most tools are applied to other problems 2 / 47 This talk: marginalization Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi ) I Simplification: all xj ’s are binary random variables Is it difficult? I Yes. No polynomial time algorithm I Complexity of variable elimination: 2n−1 Solution: I No “good solution” for “generic µ” I Consider structured µ 3 / 47 Structured µ Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi ) I Simplification: all xj ’s are binary random variables Example 1: x1 , x2 , . . . , xn are independent I Easy marginalization I Not interesting Example 2: Independent subsets µ(x1 , x2 , . . . , xn ) = µ01 (x1 , . . . , xk )µ02 (xk+1 , . . . , xn ) I More interesting and less easy I Complexity still exponential in terms of size of subsets I Does not cover many interesting cases 4 / 47 Structured µ:Cont’d Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi ) I Simplification: all xj ’s are binary random variables Example 3: more interesting structure: µ(x1 , x2 , . . . , xn ) = µ01 (xS1 )µ02 (xS2 ) . . . µ0` (xS` ) I Unlike independence, Si ∩ Sj 6= ∅ I Covers many interesting practical problems I Let |Si | n, ∀i. Is marginalization easier than a generic µ ? o Not clear!! We introduce factor graph to address this question 5 / 47 Factor graph Slight generalization µ(x1 , x2 , . . . , xn ) = 1 Y ψa (xSa ) Z a∈F I ψa : {0, 1}|Sa | → R+ not necessarily pdf I Z: normalizing constant. Called "partition function" I ψa (xSa ): factors Variable nodes 2 .. . Factor graph: I Variable node: n nodes corresponding to x1 , x 2 , . . . , x n I Function node: |F | nodes corresponding to ψa (xS1 ), ψb (xS2 ), . . . I Edge between a and variable node i iff xi ∈ Sa Factor nodes 1 a b .. . n 6 / 47 Factor graph Cont’d µ(x1 , x2 , . . . , xn ) = 1 Y ψa (xSa ) Z Variable nodes a∈F Factor graph: I Variable node: n nodes corresponding to x1 , x 2 , . . . , x n I Function node: F nodes corresponding to ψa (xS1 ), ψb (xS2 ), . . . I Edge between a and variable node i iff xi ∈ Sa 2 .. . Some notation: I a b .. . ∂a: neighbors of function node a: ∂a = {i : xi ∈ Sa }. I Factor nodes 1 n ∂i: neighbors of variable node i ∂i = {b ∈ F : i ∈ Sb }. 7 / 47 Factor graph example µ(x1 , x2 , x3 ) = 1 ψa (x1 , x2 )ψb (x2 , x3 ) Z Variable nodes Some notation: I I Factor nodes 1 ∂a: neighbors of function node a: ∂a = {1, 2} ∂b = {2, 3} ∂i: neighbors of variable node i ∂1 = {a} ∂2 = {a, b} 2 a b 3 8 / 47 Why factor graph? Variable nodes Factor nodes 1 2 a b Factor graphs simplifies understanding structures I Independence: graph has disconnected pieces I Tree graphs: Easy marginalization I Loops (specially short loops): generally make the problem more complicated .. . .. . n 9 / 47 Marginalization on tree structured graphs i Tree: Graph with no loops I Marginalization by variable elimination is efficient on such graphs (distributions) o Linear in n o Exponential in size of factors I Proof: on the board 10 / 47 Summary of variable elimination on trees Notation Q i I µ(x1 , x2 , . . . , xn ) = I µj→a : belief from variable node j to factor a I µ̂a→j : belief from factor node a to variable j a∈F ψa (x∂a ) We also have µa→j (xj ) = X ψa (x∂a ) {xj :j∈∂a\j} µ̂j→a (xj ) = Y Y µ̂`→a (x` ) `∈∂a\j µb→j (xj ) b∈∂j\a 11 / 47 Belief propagation algorithm on trees Based on variable elimination on trees consider iterative algorithm: X Y t+1 t νa→j (xj ) = ψa (x∂a ) ν̂`→a (x` ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y `∈∂a\j i t νb→j (xj ) b∈∂j\a I t t νa→j and ν̂j→a are beliefs of nodes at time t I Each node propagates its belief to other nodes Hope: Converge to a “good” fixed point t νb→j (xj ) → µb→j (xj ) t→∞ t ν̂j→b (xj ) → µj→b (xj ) t→∞ 12 / 47 Belief propagation algorithm on trees Belief propagation: X t+1 νa→j (xj ) = i ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` ) `∈∂a\j t νb→j (xj ) b∈∂j\a Why BP and not variable elimination? I On trees: no good reason I loopy graphs: can be easily applied 13 / 47 Belief propagation algorithm on trees Belief propagation: X t+1 νa→j (xj ) = ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` ) `∈∂a\j i t νb→j (xj ) b∈∂j\a Lemma: On a tree graph BP converges to the marginal distributions in finite number of iterations, i.e, t νb→j (xj ) → µb→j (xj ) t→∞ t ν̂j→b (xj ) → µj→b (xj ) t→∞ 14 / 47 Summary of belief propagation on trees i Equivalent to variable elimination Exact on trees Converges in finite number of iterations: I 2 times the maximum depth of the tree. 15 / 47 Loopy belief propagation Apply belief propagation to loopy graph: X Y t+1 t νa→j (xj ) = ψa (x∂a ) ν̂`→a (x` ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y i `∈∂a\j t νb→j (xj ) b∈∂j\a Wait until the algorithm converges: Q ∞ I Announce a∈∂j νa→j (xj ) as µj (xj ) 16 / 47 Challenges for loopy BP Does not converge necessarily I Example : µ(x1 , x2 , x3 ) = I(x1 6= x2 )I(x2 6= x3 )I(x3 6= x1 ) o o o o 1 2 t = 1 start with x1 = 0 t = 2 ⇒ x2 = 1 t = 3 ⇒ x3 = 0 t = 4 ⇒ x4 = 1 Even if converges, not necessarily the marginal distribution 3 Can have more than one fixed point 17 / 47 Advantages of loopy BP In many applications, works really well I Coding theory I Machine vision I compressed sensing .. . i In some cases, loopy BP can be analyzed theoretically 18 / 47 Some theoretical results on loopy BP BP equations have at least one fixed point I Proof uses Brouwer’s theorem I Does not necessarily converge to any of the fixed points i BP is exact on trees BP is “accurate” for “locally tree like” graphs Gallager 1963, Luby et al. 2001, Richardson and Urbanke 2001 Several methods exist for verifying correctness of BP I I Computation trees Dobrushin condition .. . 19 / 47 Locally tree like graphs BP is “accurate”for “locally tree like” graphs i Heuristic: sparse graphs (without many edges) can be “locally tree like” How to check? I If girth(G) = O(log(n)) girth(G) is the length of the shortest cycle I Random (dv , dc ) graph is locally tree like with high probability 20 / 47 Dobrushin condition sufficient condition for convergence of BP Influence of node j on node i: Cij = sup{kµ(xi | xV \i,j ) − µ(xi | x0V \i,j )kT V : xl = x0l ∀l 6= j} x,x0 Intuition: if the influence of the nodes on each other is very small, then oscillations should not occur Theorem P If γ = supi ( j Cij ) < 1, then BP marginals converge to the true marginals. 21 / 47 Summary of BP Belief propagation: X t+1 νa→j (xj ) = ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` ) `∈∂a\j i t νb→j (xj ) b∈∂j\a Exact on trees Accurate on locally tree like graphs No generic proving technique that works on all problems Successful in many applications, even when we cannot prove convergence 22 / 47 More challenges What if µ is a pdf on Rn ? Extension seems straightforward Z t+1 νa→j (xj ) = ψa (x∂a ) {xj :j∈∂a\j} t+1 ν̂j→a (xj ) = Y Y t ν̂`→a (x` )dx `∈∂a\j t νb→j (xj ) b∈∂j\a But, calculations become complicated I Shall discretize the pdf with certain accuracy I Do lots of numeric multiple integrtions Exception: Gaussian graphical models exp(−xt Jx+ct x) Z I µ(x) = I All messages become Gaussian I NOT the focus of this talk 23 / 47 Approximate message passing 24 / 47 N n y w A y = Axo + w xo : k-sparse vector in R k sparse signal Model x N A: n × N design matrix y: measurement vector in Rn w: measurement noise in Rn 25 / 47 LASSO Many useful heuristic algorithms: I Noiseless: `1 -minimization minimize x subject to kxk1 = I P i kxk1 y = Ax |xi | Noisy: LASSO minimize x 1 ky − Axk22 + λkxk1 2 convex optimizations Chen, Donoho, Saunders (96), Tibshirani (96) 26 / 47 LASSO Many useful heuristic algorithms: I Noiseless: `1 -minimization minimize x subject to kxk1 = I P i kxk1 y = Ax |xi | Noisy: LASSO minimize x 1 ky − Axk22 + λkxk1 2 convex optimizations Chen, Donoho, Saunders (96), Tibshirani (96) 27 / 47 Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: I interior point method: o generic tools o appropriate for N < 5000 I homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 I First order methods o Low computational complexity per iteration o Require many iterations 28 / 47 Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: I interior point method: o generic tools o appropriate for N < 5000 I homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 I First order methods o Low computational complexity per iteration o Require many iterations 29 / 47 Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: I interior point method: o generic tools o appropriate for N < 5000 I homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 I First order methods o Low computational complexity per iteration o Require many iterations 30 / 47 Message passing for LASSO Define: µ(dx) = 1 −βλkxk1 − β ky−Axk2 2 e dx Z I Z: normalization constant I β>0 As β → ∞: I µ concentrates around solution of LASSO: x̂λ Marginalize µ to obtain x̂λ 31 / 47 Message passing for LASSO Define: µ(dx) = 1 −βλkxk1 − β ky−Axk2 2 e dx Z I Z: normalization constant I β>0 As β → ∞: I µ concentrates around solution of LASSO: x̂λ Marginalize µ to obtain x̂λ 32 / 47 Message passing for LASSO Define: µ(dx) = 1 −βλkxk1 − β ky−Axk2 2 e dx Z I Z: normalization constant I β>0 As β → ∞: I µ concentrates around solution of LASSO: x̂λ Marginalize µ to obtain x̂λ 33 / 47 Message passing algorithm µ(dx) = 1 −βλ P |xi |− β2 P (ya −(Ax)a )2 i a e dx Z Belief propagation update rules: Q t+1 t ∼ −βλ|si | νi→a (si ) = e t ν̂a→i (si ) ∼ = R b6=a β ν̂b→i (si ) e− 2 (ya −(As)a ) 2 Q j6=i t dνj→a (sj ) Pearl (82) 34 / 47 Issues of message passing algorithm t+1 νi→a (si ) ∼ = e−βλ|si | Y t ν̂b→i (si ) b6=a t ν̂a→i (si ) ∼ = Z β e− 2 (ya −(As)a ) 2 Y t dνj→a (sj ) j6=i Challenges: I messages are distributions over real line I calculation of messages is difficult o Numeric integration o The number of variables in the integral is N − 1 I The graph is not "tree like" o No good analysis tools 35 / 47 Challenges for message passing algorithm t+1 νi→a (si ) ∼ = e−βλ|si | Y t νb→i (si ) b6=a t ν̂a→i (si ) ∼ = Z e 2 −β 2 (ya −(As)a ) Y t dνj→a (sj ) j6=i Challenges: I messages are distributions over real line I calculation of messages is difficult I The graph is not "tree like" Solution: I blessing of dimensionality: t o as N → ∞: ν̂a→i (si ) converges to Gaussian t o as N → ∞: νi→a (si ) converges to β 2 1 fβ (s; x, b) ≡ e−β|s|− 2b (s−x) . zβ (x, b) 36 / 47 Challenges for message passing algorithm t+1 νi→a (si ) ∼ = e−βλ|si | Y t νb→i (si ) b6=a t ν̂a→i (si ) ∼ = Z e 2 −β 2 (ya −(As)a ) Y t dνj→a (sj ) j6=i Challenges: I messages are distributions over real line I calculation of messages is difficult I The graph is not "tree like" Solution: I blessing of dimensionality: t o as N → ∞: ν̂a→i (si ) converges to Gaussian t o as N → ∞: νi→a (si ) converges to β 2 1 fβ (s; x, b) ≡ e−β|s|− 2b (s−x) . zβ (x, b) 37 / 47 Simplifying messages for large N t Simplification of ν̂a→i (si ) t ν̂a→i (si ) ∼ = Z β 2 e− 2 (ya −(As)a ) Y t dνj→a (sj ) j6=i   X β ∼ Aaj sj )2  = E exp − (ya − Aai si − 2 j6=i β ∼ = E exp − (Aai si − Z)2 2 Z = ya − P d j6=i Aaj sj → W ; W is Gaussian Ehsi (Z) − Ehsi (W ) ≤ Ct000 , t N 1/2 (τ̂a→i )3/2 38 / 47 Message passing t As N → ∞, ν̂a→i converges to a Gaussian distribution. Theorem t Let xtj→a and (τj→a /β) denote the mean and variance of distribution t νj→a . Then there exists a constant Ct0 such that Ct0 , t N (τ̂a→i )3 s t t βA2ai β(Aai si −za→i )2 /2τ̂a→i dsi , e φ̂ta→i (dsi ) ≡ t 2πτ̂a→i t sup |ν̂a→i (si ) − φ̂ta→i (si )| si ≤ where t za→i ≡ ya − X j6=i Aaj xtj→a , t τ̂a→i ≡ X t A2aj τj→a . j6=i Donoho, M., Montanari (11) 39 / 47 t Shape of νi→a ; intuition Summary I t ν̂b→i (si ) ≈ Gaussian I t+1 (si ) ∼ νi→a = e−βλ|si | Q b6=a t νb→i (si ) t+1 Shape of νi→a (si ) I t+1 νi→a (si ) ≡ 1 zβ (x,b) β 2 e−β|s|− 2b (s−x) 40 / 47 Message passing (cont’d) Theorem t t t Suppose that ν̂a→i = φ̂ta→i , with mean za→i and variance τ̂a→i = τ̂ t . Then at the next iteration we have t+1 |νi→a (si ) − φt+1 i→a (si )| < C/n , X t+1 t φi→a (si ) ≡ λ fβ (λsi ; λ Abi zb→i , λ2 (1 + τ̂ t )) , b6=a and fβ (s; x, b) ≡ β 2 1 e−β|s|− 2b (s−x) . zβ (x, b) Donoho, M., Montanari (11) 41 / 47 Next asymptotic: β → ∞ Reminder: I t xtj→a mean of νj→a I t t za→i mean of ν̂a→i I t za→i ≡ ya − P j6=i Aaj xtj→a t Can we write xtj→a in terms of za→i ? xtj→a = I 1 zβ Z P −β|si |−β(si − si e b6=a t Abi zb→i )2 Can be done, but ... Laplace method simplifies it (β → ∞) X t xtj→a = η( Abi zb→i ). b6=a 42 / 47 Summary t xtj→a mean of νj→a t t za→i mean of ν̂a→i t za→i = ya − X Aaj xtj→a j6=i xt+1 i→a = η( X t Abi zb→i ). b6=a 43 / 47 Comparison of MP and IST MP t za→i = ya − X IST Aaj xtj→a t za→i = ya − j6=i xt+1 i→a = η( X Aaj xtj→a j6=i t Abi zb→i ). xt+1 i→a b6=a X t = η( Abi zb→i ). b x1 x1 y1 y1 y2 X x2 y2 x2 yn yn xn xn 44 / 47 Approximate message passing We can further simplify the messages xt+1 = zt = η(xt + AT z t ; λt ) kxt k0 t−1 z y − Axt + n 45 / 47 IST versus AMP AMP: IST: x t+1 t T t xt+1 = η(xt + AT z t ; λt ) t = η(x + A z ; λ ) z t = y − Axt + z t = y − Axt Mean square error State evolution Empirical 0.04 0.03 0.02 0.01 0 0 10 20 Iteration 30 40 Mean square error 0.05 0.05 1 kxt k0 z t−1 n State evolution Empirical 0.04 0.03 0.02 0.01 0 −0.01 0 10 20 Iteration 30 40 46 / 47 Theoretical guarantees We can predict the performance of AMP at every iteration in asymptotic regime Idea: I xt + AT z t can be modeled as signal plus Gaussian noise I We keep track of noise variance I I will discuss this part in the "Risk Seminar" 47 / 47

Message passing and approximate message passing

Related documents

Products

Support

Message passing and approximate message passing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib