Message passing and approximate message passing

advertisement
Message passing and approximate message passing
Arian Maleki
Columbia University
1 / 47
What is the problem?
Given pdf µ(x1 , x2 , . . . , xn ) we are interested in
I
arg maxx1 ,x2 ,...,xn µ(x1 , x2 , . . . , xn )
I
Calculating marginal pdf µi (xi )
I
Calculating joint distribution µS (xS ) for S ⊂ {1, 2, . . . , n}
These problems are important because
I
Many inference problems are in these forms
I
Many applications in image procession, communication, machine learning,
signal processing
This talk: Marginalizing distribution
I
Most tools are applied to other problems
2 / 47
This talk: marginalization
Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi )
I
Simplification: all xj ’s are binary random variables
Is it difficult?
I
Yes. No polynomial time algorithm
I
Complexity of variable elimination: 2n−1
Solution:
I
No “good solution” for “generic µ”
I
Consider structured µ
3 / 47
Structured µ
Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi )
I
Simplification: all xj ’s are binary random variables
Example 1: x1 , x2 , . . . , xn are independent
I
Easy marginalization
I
Not interesting
Example 2: Independent subsets
µ(x1 , x2 , . . . , xn ) = µ01 (x1 , . . . , xk )µ02 (xk+1 , . . . , xn )
I
More interesting and less easy
I
Complexity still exponential in terms of size of subsets
I
Does not cover many interesting cases
4 / 47
Structured µ:Cont’d
Problem: Given pdf µ(x1 , x2 , . . . , xn ) find µi (xi )
I
Simplification: all xj ’s are binary random variables
Example 3: more interesting structure:
µ(x1 , x2 , . . . , xn ) = µ01 (xS1 )µ02 (xS2 ) . . . µ0` (xS` )
I
Unlike independence, Si ∩ Sj 6= ∅
I
Covers many interesting practical problems
I
Let |Si | n, ∀i. Is marginalization easier than a generic µ ?
o Not clear!! We introduce factor graph to address this question
5 / 47
Factor graph
Slight generalization
µ(x1 , x2 , . . . , xn ) =
1 Y
ψa (xSa )
Z
a∈F
I
ψa : {0, 1}|Sa | → R+ not necessarily pdf
I
Z: normalizing constant. Called "partition
function"
I
ψa (xSa ): factors
Variable nodes 2
..
.
Factor graph:
I
Variable node: n nodes corresponding to
x1 , x 2 , . . . , x n
I
Function node: |F | nodes corresponding to
ψa (xS1 ), ψb (xS2 ), . . .
I
Edge between a and variable node i iff xi ∈ Sa
Factor nodes 1
a
b
..
.
n
6 / 47
Factor graph Cont’d
µ(x1 , x2 , . . . , xn ) =
1 Y
ψa (xSa )
Z
Variable nodes a∈F
Factor graph:
I
Variable node: n nodes corresponding to
x1 , x 2 , . . . , x n
I
Function node: F nodes corresponding to
ψa (xS1 ), ψb (xS2 ), . . .
I
Edge between a and variable node i iff xi ∈ Sa
2
..
.
Some notation:
I
a
b
..
.
∂a: neighbors of function node a:
∂a = {i : xi ∈ Sa }.
I
Factor nodes 1
n
∂i: neighbors of variable node i
∂i = {b ∈ F : i ∈ Sb }.
7 / 47
Factor graph example
µ(x1 , x2 , x3 ) =
1
ψa (x1 , x2 )ψb (x2 , x3 )
Z
Variable nodes Some notation:
I
I
Factor nodes 1
∂a: neighbors of function node a:
∂a
=
{1, 2}
∂b
=
{2, 3}
∂i: neighbors of variable node i
∂1
=
{a}
∂2
=
{a, b}
2
a
b
3
8 / 47
Why factor graph?
Variable nodes Factor nodes 1
2
a
b
Factor graphs simplifies understanding structures
I
Independence: graph has disconnected pieces
I
Tree graphs: Easy marginalization
I
Loops (specially short loops): generally make
the problem more complicated
..
.
..
.
n
9 / 47
Marginalization on tree structured graphs
i
Tree: Graph with no loops
I
Marginalization by variable elimination is
efficient on such graphs (distributions)
o Linear in n
o Exponential in size of factors
I
Proof: on the board
10 / 47
Summary of variable elimination on trees
Notation
Q
i
I
µ(x1 , x2 , . . . , xn ) =
I
µj→a : belief from variable node j to factor a
I
µ̂a→j : belief from factor node a to variable j
a∈F
ψa (x∂a )
We also have
µa→j (xj )
=
X
ψa (x∂a )
{xj :j∈∂a\j}
µ̂j→a (xj )
=
Y
Y
µ̂`→a (x` )
`∈∂a\j
µb→j (xj )
b∈∂j\a
11 / 47
Belief propagation algorithm on trees
Based on variable elimination on trees consider
iterative algorithm:
X
Y
t+1
t
νa→j
(xj ) =
ψa (x∂a )
ν̂`→a
(x` )
{xj :j∈∂a\j}
t+1
ν̂j→a
(xj )
=
Y
`∈∂a\j
i
t
νb→j
(xj )
b∈∂j\a
I
t
t
νa→j
and ν̂j→a
are beliefs of nodes at time t
I
Each node propagates its belief to other nodes
Hope: Converge to a “good” fixed point
t
νb→j
(xj ) → µb→j (xj )
t→∞
t
ν̂j→b
(xj ) → µj→b (xj )
t→∞
12 / 47
Belief propagation algorithm on trees
Belief propagation:
X
t+1
νa→j
(xj ) =
i
ψa (x∂a )
{xj :j∈∂a\j}
t+1
ν̂j→a
(xj )
=
Y
Y
t
ν̂`→a
(x` )
`∈∂a\j
t
νb→j
(xj )
b∈∂j\a
Why BP and not variable elimination?
I
On trees: no good reason
I
loopy graphs: can be easily applied
13 / 47
Belief propagation algorithm on trees
Belief propagation:
X
t+1
νa→j
(xj ) =
ψa (x∂a )
{xj :j∈∂a\j}
t+1
ν̂j→a
(xj )
=
Y
Y
t
ν̂`→a
(x` )
`∈∂a\j
i
t
νb→j
(xj )
b∈∂j\a
Lemma: On a tree graph BP converges to the
marginal distributions in finite number of
iterations, i.e,
t
νb→j
(xj ) → µb→j (xj )
t→∞
t
ν̂j→b
(xj )
→ µj→b (xj )
t→∞
14 / 47
Summary of belief propagation on trees
i
Equivalent to variable elimination
Exact on trees
Converges in finite number of iterations:
I
2 times the maximum depth of the tree.
15 / 47
Loopy belief propagation
Apply belief propagation to loopy graph:
X
Y
t+1
t
νa→j
(xj ) =
ψa (x∂a )
ν̂`→a
(x` )
{xj :j∈∂a\j}
t+1
ν̂j→a
(xj )
=
Y
i
`∈∂a\j
t
νb→j
(xj )
b∈∂j\a
Wait until the algorithm converges:
Q
∞
I
Announce
a∈∂j
νa→j (xj ) as µj (xj )
16 / 47
Challenges for loopy BP
Does not converge necessarily
I
Example :
µ(x1 , x2 , x3 ) = I(x1 6= x2 )I(x2 6= x3 )I(x3 6= x1 )
o
o
o
o
1
2
t = 1 start with x1 = 0
t = 2 ⇒ x2 = 1
t = 3 ⇒ x3 = 0
t = 4 ⇒ x4 = 1
Even if converges, not necessarily the marginal
distribution
3
Can have more than one fixed point
17 / 47
Advantages of loopy BP
In many applications, works really well
I
Coding theory
I
Machine vision
I
compressed sensing
..
.
i
In some cases, loopy BP can be analyzed
theoretically
18 / 47
Some theoretical results on loopy BP
BP equations have at least one fixed point
I
Proof uses Brouwer’s theorem
I
Does not necessarily converge to any of the
fixed points
i
BP is exact on trees
BP is “accurate” for “locally tree like” graphs
Gallager 1963, Luby et al. 2001, Richardson and Urbanke 2001
Several methods exist for verifying correctness of
BP
I
I
Computation trees
Dobrushin condition
..
.
19 / 47
Locally tree like graphs
BP is “accurate”for “locally tree like” graphs
i
Heuristic: sparse graphs (without many edges)
can be “locally tree like”
How to check?
I
If girth(G) = O(log(n))
girth(G) is the length of the shortest cycle
I
Random (dv , dc ) graph is locally tree like with
high probability
20 / 47
Dobrushin condition
sufficient condition for convergence of BP
Influence of node j on node i:
Cij = sup{kµ(xi | xV \i,j ) − µ(xi | x0V \i,j )kT V : xl = x0l ∀l 6= j}
x,x0
Intuition: if the influence of the nodes on each other is very small, then
oscillations should not occur
Theorem
P
If γ = supi ( j Cij ) < 1, then BP marginals
converge to the true marginals.
21 / 47
Summary of BP
Belief propagation:
X
t+1
νa→j
(xj ) =
ψa (x∂a )
{xj :j∈∂a\j}
t+1
ν̂j→a
(xj )
=
Y
Y
t
ν̂`→a
(x` )
`∈∂a\j
i
t
νb→j
(xj )
b∈∂j\a
Exact on trees
Accurate on locally tree like graphs
No generic proving technique that works on all
problems
Successful in many applications, even when we
cannot prove convergence
22 / 47
More challenges
What if µ is a pdf on Rn ?
Extension seems straightforward
Z
t+1
νa→j (xj ) =
ψa (x∂a )
{xj :j∈∂a\j}
t+1
ν̂j→a
(xj )
=
Y
Y
t
ν̂`→a
(x` )dx
`∈∂a\j
t
νb→j
(xj )
b∈∂j\a
But, calculations become complicated
I
Shall discretize the pdf with certain accuracy
I
Do lots of numeric multiple integrtions
Exception: Gaussian graphical models
exp(−xt Jx+ct x)
Z
I
µ(x) =
I
All messages become Gaussian
I
NOT the focus of this talk
23 / 47
Approximate message passing
24 / 47
N
n
y
w
A
y = Axo + w
xo : k-sparse vector in R
k sparse signal
Model
x
N
A: n × N design matrix
y: measurement vector in Rn
w: measurement noise in Rn
25 / 47
LASSO
Many useful heuristic algorithms:
I
Noiseless: `1 -minimization
minimize
x
subject to
kxk1 =
I
P
i
kxk1
y = Ax
|xi |
Noisy: LASSO
minimize
x
1
ky − Axk22 + λkxk1
2
convex optimizations
Chen, Donoho, Saunders (96), Tibshirani (96)
26 / 47
LASSO
Many useful heuristic algorithms:
I
Noiseless: `1 -minimization
minimize
x
subject to
kxk1 =
I
P
i
kxk1
y = Ax
|xi |
Noisy: LASSO
minimize
x
1
ky − Axk22 + λkxk1
2
convex optimizations
Chen, Donoho, Saunders (96), Tibshirani (96)
27 / 47
Algorithmic challenges of sparse recovery
Use convex optimization tools to solve LASSO
Computational complexity:
I
interior point method:
o generic tools
o appropriate for N < 5000
I
homotopy methods:
o use the structure of the LASSO
o appropriate for N < 50000
I
First order methods
o Low computational complexity per iteration
o Require many iterations
28 / 47
Algorithmic challenges of sparse recovery
Use convex optimization tools to solve LASSO
Computational complexity:
I
interior point method:
o generic tools
o appropriate for N < 5000
I
homotopy methods:
o use the structure of the LASSO
o appropriate for N < 50000
I
First order methods
o Low computational complexity per iteration
o Require many iterations
29 / 47
Algorithmic challenges of sparse recovery
Use convex optimization tools to solve LASSO
Computational complexity:
I
interior point method:
o generic tools
o appropriate for N < 5000
I
homotopy methods:
o use the structure of the LASSO
o appropriate for N < 50000
I
First order methods
o Low computational complexity per iteration
o Require many iterations
30 / 47
Message passing for LASSO
Define:
µ(dx) =
1 −βλkxk1 − β ky−Axk2
2
e
dx
Z
I
Z: normalization constant
I
β>0
As β → ∞:
I
µ concentrates around solution of LASSO: x̂λ
Marginalize µ to obtain x̂λ
31 / 47
Message passing for LASSO
Define:
µ(dx) =
1 −βλkxk1 − β ky−Axk2
2
e
dx
Z
I
Z: normalization constant
I
β>0
As β → ∞:
I
µ concentrates around solution of LASSO: x̂λ
Marginalize µ to obtain x̂λ
32 / 47
Message passing for LASSO
Define:
µ(dx) =
1 −βλkxk1 − β ky−Axk2
2
e
dx
Z
I
Z: normalization constant
I
β>0
As β → ∞:
I
µ concentrates around solution of LASSO: x̂λ
Marginalize µ to obtain x̂λ
33 / 47
Message passing algorithm
µ(dx) =
1 −βλ P |xi |− β2 P (ya −(Ax)a )2
i
a
e
dx
Z
Belief propagation update rules:
Q
t+1
t
∼ −βλ|si |
νi→a (si ) = e
t
ν̂a→i
(si ) ∼
=
R
b6=a
β
ν̂b→i (si )
e− 2 (ya −(As)a )
2
Q
j6=i
t
dνj→a
(sj )
Pearl (82)
34 / 47
Issues of message passing algorithm
t+1
νi→a
(si ) ∼
= e−βλ|si |
Y
t
ν̂b→i
(si )
b6=a
t
ν̂a→i
(si ) ∼
=
Z
β
e− 2 (ya −(As)a )
2
Y
t
dνj→a
(sj )
j6=i
Challenges:
I
messages are distributions over real line
I
calculation of messages is difficult
o Numeric integration
o The number of variables in the integral is N − 1
I
The graph is not "tree like"
o No good analysis tools
35 / 47
Challenges for message passing algorithm
t+1
νi→a
(si ) ∼
= e−βλ|si |
Y
t
νb→i
(si )
b6=a
t
ν̂a→i
(si ) ∼
=
Z
e
2
−β
2 (ya −(As)a )
Y
t
dνj→a
(sj )
j6=i
Challenges:
I
messages are distributions over real line
I
calculation of messages is difficult
I
The graph is not "tree like"
Solution:
I
blessing of dimensionality:
t
o as N → ∞: ν̂a→i
(si ) converges to Gaussian
t
o as N → ∞: νi→a
(si ) converges to
β
2
1
fβ (s; x, b) ≡
e−β|s|− 2b (s−x) .
zβ (x, b)
36 / 47
Challenges for message passing algorithm
t+1
νi→a
(si ) ∼
= e−βλ|si |
Y
t
νb→i
(si )
b6=a
t
ν̂a→i
(si ) ∼
=
Z
e
2
−β
2 (ya −(As)a )
Y
t
dνj→a
(sj )
j6=i
Challenges:
I
messages are distributions over real line
I
calculation of messages is difficult
I
The graph is not "tree like"
Solution:
I
blessing of dimensionality:
t
o as N → ∞: ν̂a→i
(si ) converges to Gaussian
t
o as N → ∞: νi→a
(si ) converges to
β
2
1
fβ (s; x, b) ≡
e−β|s|− 2b (s−x) .
zβ (x, b)
37 / 47
Simplifying messages for large N
t
Simplification of ν̂a→i
(si )
t
ν̂a→i
(si ) ∼
=
Z
β
2
e− 2 (ya −(As)a )
Y
t
dνj→a
(sj )
j6=i


X
β
∼
Aaj sj )2 
= E exp − (ya − Aai si −
2
j6=i
β
∼
= E exp − (Aai si − Z)2
2
Z = ya −
P
d
j6=i
Aaj sj → W ; W is Gaussian
Ehsi (Z) − Ehsi (W ) ≤
Ct000
,
t
N 1/2 (τ̂a→i
)3/2
38 / 47
Message passing
t
As N → ∞, ν̂a→i
converges to a Gaussian distribution.
Theorem
t
Let xtj→a and (τj→a
/β) denote the mean and variance of distribution
t
νj→a
. Then there exists a constant Ct0 such that
Ct0
,
t
N (τ̂a→i
)3
s
t
t
βA2ai β(Aai si −za→i
)2 /2τ̂a→i
dsi ,
e
φ̂ta→i (dsi ) ≡
t
2πτ̂a→i
t
sup |ν̂a→i
(si ) − φ̂ta→i (si )|
si
≤
where
t
za→i
≡ ya −
X
j6=i
Aaj xtj→a ,
t
τ̂a→i
≡
X
t
A2aj τj→a
.
j6=i
Donoho, M., Montanari (11)
39 / 47
t
Shape of νi→a
; intuition
Summary
I
t
ν̂b→i
(si ) ≈ Gaussian
I
t+1
(si ) ∼
νi→a
= e−βλ|si |
Q
b6=a
t
νb→i
(si )
t+1
Shape of νi→a
(si )
I
t+1
νi→a
(si ) ≡
1
zβ (x,b)
β
2
e−β|s|− 2b (s−x)
40 / 47
Message passing (cont’d)
Theorem
t
t
t
Suppose that ν̂a→i
= φ̂ta→i , with mean za→i
and variance τ̂a→i
= τ̂ t .
Then at the next iteration we have
t+1
|νi→a
(si ) − φt+1
i→a (si )| < C/n ,
X
t+1
t
φi→a (si ) ≡ λ fβ (λsi ; λ
Abi zb→i
, λ2 (1 + τ̂ t )) ,
b6=a
and
fβ (s; x, b) ≡
β
2
1
e−β|s|− 2b (s−x) .
zβ (x, b)
Donoho, M., Montanari (11)
41 / 47
Next asymptotic: β → ∞
Reminder:
I
t
xtj→a mean of νj→a
I
t
t
za→i
mean of ν̂a→i
I
t
za→i
≡ ya −
P
j6=i
Aaj xtj→a
t
Can we write xtj→a in terms of za→i
?
xtj→a =
I
1
zβ
Z
P
−β|si |−β(si −
si e
b6=a
t
Abi zb→i
)2
Can be done, but ...
Laplace method simplifies it (β → ∞)
X
t
xtj→a = η(
Abi zb→i
).
b6=a
42 / 47
Summary
t
xtj→a mean of νj→a
t
t
za→i
mean of ν̂a→i
t
za→i
= ya −
X
Aaj xtj→a
j6=i
xt+1
i→a
= η(
X
t
Abi zb→i
).
b6=a
43 / 47
Comparison of MP and IST
MP
t
za→i
= ya −
X
IST
Aaj xtj→a
t
za→i
= ya −
j6=i
xt+1
i→a = η(
X
Aaj xtj→a
j6=i
t
Abi zb→i
).
xt+1
i→a
b6=a
X
t
= η(
Abi zb→i
).
b
x1
x1
y1
y1
y2
X
x2
y2
x2
yn
yn
xn
xn
44 / 47
Approximate message passing
We can further simplify the messages
xt+1
=
zt
=
η(xt + AT z t ; λt )
kxt k0 t−1
z
y − Axt +
n
45 / 47
IST versus AMP
AMP:
IST:
x
t+1
t
T
t
xt+1 = η(xt + AT z t ; λt )
t
= η(x + A z ; λ )
z t = y − Axt +
z t = y − Axt
Mean square error
State evolution
Empirical
0.04
0.03
0.02
0.01
0
0
10
20
Iteration
30
40
Mean square error
0.05
0.05
1
kxt k0 z t−1
n
State evolution
Empirical
0.04
0.03
0.02
0.01
0
−0.01
0
10
20
Iteration
30
40
46 / 47
Theoretical guarantees
We can predict the performance of AMP at every iteration in asymptotic
regime
Idea:
I
xt + AT z t can be modeled as signal plus Gaussian noise
I
We keep track of noise variance
I
I will discuss this part in the "Risk Seminar"
47 / 47
Download