Graphical models and message-passing Part II: Marginals and likelihoods

advertisement
Graphical models and message-passing
Part II: Marginals and likelihoods
Martin Wainwright
UC Berkeley
Departments of Statistics, and EECS
Tutorial materials (slides, monograph, lecture notes) available at:
www.eecs.berkeley.edu/ewainwrig/kyoto12
September 3, 2012
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
1 / 23
Graphs and factorization
2
ψ7
ψ47
3
4
7
1
5
6
ψ456
v
clique C is a fully connected subset of vertices
compatibility function ψC defined on variables xC = {xs , s ∈ C}
factorization over all cliques
p(x1 , . . . , xN ) =
1 Y
ψC (xC ).
Z
C∈C
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
2 / 23
Core computational challenges
Given an undirected graphical model (Markov random field):
1 Y
p(x1 , x2 , . . . , xN ) =
ψC (xC )
Z C∈C
How to efficiently compute?
most probable configuration (MAP estimate):
Maximize :
x
b = arg max p(x1 , . . . , xN ) = arg max
x∈X N
x∈X N
Y
ψC (xC ).
C∈C
the data likelihood or normalization constant
Sum/integrate :
Z =
X Y
ψC (xC )
x∈X N C∈C
marginal distributions at single sites, or subsets:
Sum/integrate :
Martin Wainwright (UC Berkeley)
p(Xs = xs ) =
1
Z
X Y
ψC (xC )
xt , t6=s C∈C
Graphical models and message-passing
September 3, 2012
3 / 23
§1. Sum-product message-passing on trees
Goal: Compute marginal distribution at node u on a tree:


Y

Y
x
b = arg max
exp(θs (xs )
exp(θst (xs , xt )) .

x∈X N 
s∈V
(s,t)∈E
M12
1
X
x1 ,x2 ,x3
p(x) =
X
x2
M32
2
"
exp(θ1 (x1 ))
3
Y
t∈1,3
(
X
xt
exp[θt (xt ) + θ2t (x2 , xt )]
)#
Putting together the pieces
Sum-product is an exact algorithm for any tree.
Tw
w
s
Mwt
Mut
Mts
N (t)
Mts
t
u
Tu
≡
≡
message from node t to s
neighbors of node t
Mvt
v
Tv
Update:
Sum-marginals:
h
i
exp θst (xs , x′t ) + θt (x′t )
x′t ∈Xt
Q
ps (xs ; θ) ∝ exp{θs (xs )} t∈N (s) Mts (xs ).
Mts (xs ) ←
Martin Wainwright (UC Berkeley)
P n
Graphical models and message-passing
Q
Mvt (xt )
v∈N (t)\s
September 3, 2012
o
5 / 23
Summary: sum-product on trees
converges in at most graph diameter # of iterations
updating a single message is an O(m2 ) operation
overall algorithm requires O(N m2 ) operations
upon convergence, yields the exact node and edge marginals:
Y
ps (xs ) ∝ eθs (xs )
Mus (xs )
u∈N (s)
pst (xs , xt ) ∝ eθs (xs )+θt (xt )+θst (xs ,xt )
Y
Mus (xs )
u∈N (s)
Y
Mut (xt )
u∈N (t)
messages can also be used to compute the partition function
X Y
Y
Z=
eθs (xs )
eθst (xs ,xt ) .
x1 ,...,xN s∈V
Martin Wainwright (UC Berkeley)
(s,t)∈E
Graphical models and message-passing
September 3, 2012
6 / 23
§2. Sum-product on graph with cycles
as with max-product, a widely used heuristic with a long history:
◮
◮
◮
◮
error-control coding: Gallager, 1963
artificial intelligence: Pearl, 1988
turbo decoding: Berroux et al., 1993
etc..
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
7 / 23
§2. Sum-product on graph with cycles
as with max-product, a widely used heuristic with a long history:
◮
◮
◮
◮
error-control coding: Gallager, 1963
artificial intelligence: Pearl, 1988
turbo decoding: Berroux et al., 1993
etc..
some concerns with sum-product with cycles:
◮
◮
◮
no convergence guarantees
can have multiple fixed points
final estimate of Z is not a lower/upper bound
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
7 / 23
§2. Sum-product on graph with cycles
as with max-product, a widely used heuristic with a long history:
◮
◮
◮
◮
error-control coding: Gallager, 1963
artificial intelligence: Pearl, 1988
turbo decoding: Berroux et al., 1993
etc..
some concerns with sum-product with cycles:
◮
◮
◮
no convergence guarantees
can have multiple fixed points
final estimate of Z is not a lower/upper bound
as before, can consider a broader class of reweighted sum-product
algorithms
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
7 / 23
Tree-reweighted sum-product algorithms
Message update from node t to node s:
Mts (xs )
←
κ
X
x′t ∈Xt
(
exp
h θ (x , x′ )
i
st
s
t
+ θt (x′t )
ρst
| {z }
reweighted edge
reweighted messages
}| {
Q z
ρ
Mvt (xt ) vt )
v∈N (t)\s
(1−ρts )
Mst (xt )
|
{z
}
opposite message
.
Properties:
1. Modified updates remain distributed and purely local over the graph.
• Messages are reweighted with ρst ∈ [0, 1].
2. Key differences: • Potential on edge (s, t) is rescaled by ρst ∈ [0, 1].
• Update involves the reverse direction edge.
3. The choice ρst = 1 for all edges (s, t) recovers standard update.
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
8 / 23
Bethe entropy approximation
define local marginal distributions (e.g., for m = 3 states):



µs (0)

µs (xs ) = µs (1)
µs (2)
µst (0, 0)
µst (xs , xt ) = µst (1, 0)
µst (2, 0)
µst (0, 1)
µst (1, 1)
µst (2, 1)

µst (0, 2)
µst (1, 2)
µst (2, 2)
define node-based entropy and edge-based mutual information:
X
µs (xs ) log µs (xs )
Node-based entropy:Hs (µs ) = −
xs
Mutual information:Ist (µst ) =
X
µst (xs , xt ) log
xs ,xt
µst (xs , xt )
.
µs (xs )µt (xt )
ρ-reweighted Bethe entropy
HBethe (µ) =
X
s∈V
Martin Wainwright (UC Berkeley)
Hs (µs ) −
X
ρst Ist (µst ),
(s,t)∈E
Graphical models and message-passing
September 3, 2012
9 / 23
Bethe entropy is exact for trees
exact for trees, using the factorization:
p(x; θ)
=
Y
s∈V
Martin Wainwright (UC Berkeley)
µs (xs )
Y
(s,t)∈E
µst (xs , xt )
µs (xs )µt (xt )
Graphical models and message-passing
September 3, 2012
10 / 23
Reweighted sum-product and Bethe variational
principle
Define the local constraint set
X
τs (xs ) = 1,
L(G) = τs , τst | τ ≥ 0,
xs
X
xt
τst (xs , xt ) = τs (xs )
Reweighted sum-product and Bethe variational
principle
Define the local constraint set
X
τs (xs ) = 1,
L(G) = τs , τst | τ ≥ 0,
xs
X
xt
τst (xs , xt ) = τs (xs )
Theorem
For any choice of positive edge weights ρst > 0:
(a) Fixed points of reweighted sum-product are stationary points of the
Lagrangian associated with
nX
o
X
ABethe (θ; ρ) := max
hτs , θs i +
hτst , θst i + HBethe (τ ; ρ) .
τ ∈L(G)
s∈V
(s,t)∈E
Reweighted sum-product and Bethe variational
principle
Define the local constraint set
X
τs (xs ) = 1,
L(G) = τs , τst | τ ≥ 0,
xs
X
xt
τst (xs , xt ) = τs (xs )
Theorem
For any choice of positive edge weights ρst > 0:
(a) Fixed points of reweighted sum-product are stationary points of the
Lagrangian associated with
nX
o
X
ABethe (θ; ρ) := max
hτs , θs i +
hτst , θst i + HBethe (τ ; ρ) .
τ ∈L(G)
s∈V
(s,t)∈E
(b) For valid choices of edge weights {ρst }, the fixed points are unique and
moreover log Z(θ) ≤ ABethe (θ; ρ). In addition, reweighted sum-product
converges with appropriate scheduling.
Lagrangian derivation of ordinary sum-product
let’s try to solve this problem by a (partial) Lagrangian formulation
assign a Lagrange multiplier
λts (xs ) for each constraint
P
Cts (xs ) := τs (xs ) − xt τst (xs , xt ) = 0
will enforce the normalization (
explicitly
P
xs
τs (xs ) = 1) and non-negativity constraints
the Lagrangian takes the form:
L(τ ; λ) = hθ, τ i +
X
Hs (τs ) −
s∈V
Ist (τst )
(s,t)∈E(G)
+
X X
(s,t)∈E
Martin Wainwright (UC Berkeley)
X
λst (xt )Cst (xt ) +
xt
Graphical models and message-passing
X
λts (xs )Cts (xs )
xs
September 3, 2012
12 / 23
Lagrangian derivation (part II)
taking derivatives of the Lagrangian w.r.t τs and τst yields
∂L
∂τs (xs )
=
∂L
∂τst (xs , xt )
=
θs (xs ) − log τs (xs ) +
X
λts (xs ) + C
t∈N (s)
θst (xs , xt ) − log
τst (xs , xt )
− λts (xs ) − λst (xt ) + C ′
τs (xs )τt (xt )
setting these partial derivatives to zero and simplifying:
τs (xs )
∝
τs (xs , xt )
∝
exp θs (xs )
Y
t∈N (s)
exp λts (xs )
exp θs (xs ) + θt (xt ) + θst (xs , xt ) ×
Y
Y
exp λvt (xt )
exp λus (xs )
u∈N (s)\t
v∈N (t)\s
enforcing the constraint Cts (xs ) = 0 on these representations yields the familiar
update rule for the messages Mts (xs ) = exp(λts (xs )):
Mts (xs )
←
X
xt
Martin Wainwright (UC Berkeley)
exp θt (xt ) + θst (xs , xt )
Graphical models and message-passing
Y
Mut (xt )
u∈N (t)\s
September 3, 2012
13 / 23
Convex combinations of trees
Idea: Upper bound A(θ) := log Z(θ) with a convex combination of
tree-structured problems.
θ
A(θ)
=
≤
ρ = {ρ(T )}
θ(T )
ρ(T 1 )θ(T 1 )
ρ(T 1 )A(θ(T 1 ))
≡
≡
Martin Wainwright (UC Berkeley)
+
+
ρ(T 2 )θ(T 2 )
ρ(T 2 )A(θ(T 2 ))
+
+
ρ(T 3 )θ(T 3 )
ρ(T 3 )A(θ(T 3 ))
probability distribution over spanning trees
tree-structured parameter vector
Graphical models and message-passing
September 3, 2012
14 / 23
Finding the tightest upper bound
Observation: For each fixed distribution ρ over spanning trees, there are
many such upper bounds.
Goal: Find the tightest such upper bound over all trees.
Challenge: Number of spanning trees grows rapidly in graph size.
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
15 / 23
Finding the tightest upper bound
Observation: For each fixed distribution ρ over spanning trees, there are
many such upper bounds.
Goal: Find the tightest such upper bound over all trees.
Challenge: Number of spanning trees grows rapidly in graph size.
Example:
On the 2-D lattice:
Grid size
9
16
36
100
Martin Wainwright (UC Berkeley)
# trees
192
100352
3.26 × 1013
5.69 × 1042
Graphical models and message-passing
September 3, 2012
15 / 23
Finding the tightest upper bound
Observation: For each fixed distribution ρ over spanning trees, there are
many such upper bounds.
Goal: Find the tightest such upper bound over all trees.
Challenge: Number of spanning trees grows rapidly in graph size.
By a suitable dual reformulation, problem can be avoided:
Key duality relation:
P
min
T
ρ(T )θ(T )=θ
ρ(T )A(θ(T )) = max
Martin Wainwright (UC Berkeley)
µ∈L(G)
hµ, θi + HBethe (µ; ρst ) .
Graphical models and message-passing
September 3, 2012
15 / 23
Edge appearance probabilities
Experiment: What is the probability ρe that a given edge e ∈ E belongs to a
tree T drawn randomly under ρ?
f
f
f
f
b
e
(a) Original
In this example:
b
b
e
b
e
(b)
ρ(T 1 )
=
ρb = 1;
1
3
(c)
ρe = 23 ;
e
ρ(T 2 )
=
1
3
(d) ρ(T 3 ) =
1
3
ρf = 13 .
The vector ρe = { ρe | e ∈ E } must belong to the spanning tree polytope.
(Edmonds, 1971)
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
16 / 23
Why does entropy arise in the duality?
Due to a deep correspondence between two problems:
Maximum entropy density estimation
Maximize entropy
H(p) = −
X
p(x1 , . . . , xN ) log p(x1 , . . . , xN )
x
subject to expectation constraints of the form
X
p(x)φα (x) = µ
bα .
x
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
17 / 23
Why does entropy arise in the duality?
Due to a deep correspondence between two problems:
Maximum entropy density estimation
Maximize entropy
H(p) = −
X
p(x1 , . . . , xN ) log p(x1 , . . . , xN )
x
subject to expectation constraints of the form
X
p(x)φα (x) = µ
bα .
x
Maximum likelihood in exponential family
Maximize likelihood of parameterized densities
X
p(x1 , . . . , xN ; θ) = exp
θα φα (x) − A(θ) .
α
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
17 / 23
Conjugate dual functions
conjugate duality is a fertile source of variational representations
any function f can be used to define another function f ∗ as follows:
f ∗ (v) := sup hv, ui − f (u) .
u∈Rn
easy to show that f ∗ is always a convex function
how about taking the “dual of the dual”? I.e., what is (f ∗ )∗ ?
when f is well-behaved (convex and lower semi-continuous), we have
(f ∗ )∗ = f , or alternatively stated:
f (u) = sup hu, vi − f ∗ (v)
v∈Rn
Geometric view: Supporting hyperplanes
Question: Given all hyperplanes in Rn × R with normal (v, −1), what is the
intercept of the one that supports epi(f )?
β
f (u)
hv, ui − ca
Epigraph of f :
epi(f ) := {(u, β) ∈ Rn+1 | f (u) ≤ β}.
−ca
−cb
Analytically, we require the smallest c ∈ R such that:
hv, ui − c
≤
f (u)
for all u ∈ Rn
By re-arranging, we find that this optimal c∗ is the dual value:
c∗ = sup hv, ui − f (u) .
u∈Rn
hv, ui − cb
(v, −1)
u
Example: Single Bernoulli
Random variable X ∈ {0, 1} yields exponential family of the form:
p(x; θ) ∝ exp θ x
with A(θ) = log 1 + exp(θ) .
Let’s compute the dual A∗ (µ) := sup µθ − log[1 + exp(θ)] .
θ∈R
(Possible) stationary point:
µ = exp(θ)/[1 + exp(θ)].
A(θ)
A(θ)
hµ, θi − A∗ (µ)
θ
(a) Epigraph supported
θ
hµ, θi − c
(b)
Epigraph cannot be supported
(
µ log µ + (1 − µ) log(1 − µ) if µ ∈ [0, 1]
.
+∞
otherwise.
Leads to the variational representation:
A(θ) = maxµ∈[0,1] µ · θ − A∗ (µ) .
We find that:
A∗ (µ) =
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
20 / 23
Geometry of Bethe variational problem
µint
M(G)
µf rac
L(G)
belief propagation uses a polyhedral outer approximation to M(G):
◮
◮
for any graph, L(G) ⊇ M(G).
equality holds ⇐⇒ G is a tree.
Natural question: Do BP fixed points ever fall outside of the marginal
polytope M(G)?
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
21 / 23
Illustration: Globally inconsistent BP fixed points
Consider the following assignment of pseudomarginals τs , τst :
0 :5
0 :5
2
Locally
consistent
(pseudo)marginals
1
0 :5
0 :5
0 :4 0 :1
0 :1 0 :4
0 :4 0 :1
0 :1 0 :4
3
0 :1 0 :4
0 :4 0 :1
0 :5
0 :5
can verify that τ ∈ L(G), and that τ is a fixed point of belief propagation
(with all constant messages)
however, τ is globally inconsistent
Note: More generally: for any τ in the interior of L(G), can construct a
distribution with τ as a BP fixed point.
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
22 / 23
High-level perspective: A broad class of methods
message-passing algorithms (e.g., mean field, belief propagation) are
solving approximate versions of exact variational principle in exponential
families
there are two distinct components to approximations:
(a) can use either inner or outer bounds to M
(b) various approximations to entropy function −A∗ (µ)
Refining one or both components yields better approximations:
BP: polyhedral outer bound and non-convex Bethe approximation
Kikuchi and variants: tighter polyhedral outer bounds and better entropy
approximations
(e.g.,Yedidia et al., 2002)
Expectation-propagation: better outer bounds and Bethe-like entropy
approximations
(Minka, 2002)
Martin Wainwright (UC Berkeley)
Graphical models and message-passing
September 3, 2012
23 / 23
Download