An Introduction to Factor Graphs

advertisement
An Introduction to Factor Graphs
Hans-Andrea Loeliger
MLSB 2008, Copenhagen
1
Definition
A factor graph represents the factorization of a function of several
variables. We use Forney-style factor graphs (Forney, 2001).
Example:
f (x1, x2, x3, x4, x5) = fA(x1, x2, x3) · fB(x3, x4, x5) · fC(x4).
fA
x1
x2
fB
x3
x5
x4
fC
Rules:
• A node for every factor.
• An edge or half-edge for every variable.
• Node g is connected to edge x iff variable x appears in factor g.
(What if some variable appears in more than 2 factors?)
2
Example:
Markov Chain
pXY Z (x, y, z) = pX (x) pY |X (y|x) pZ|Y (z|y).
X
pX
Y
pY |X
Z
pZ|Y
We will often use capital letters for the variables. (Why?)
Further examples will come later.
3
Message Passing Algorithms
operate by passing messages along the edges of a factor graph:
6
6
?
?
6
...
6
?
6
?
6
?
?
4
A main point of factor graphs (and similar graphical notations):
A Unified View of Historically Different Things
Statistical physics:
- Markov random fields (Ising 1925)
Signal processing:
- linear state-space models and Kalman filtering: Kalman 1960. . .
- recursive least-squares adaptive filters
- Hidden Markov models: Baum et al. 1966. . .
- unification: Levy et al. 1996. . .
Error correcting codes:
- Low-density parity check codes: Gallager 1962; Tanner 1981;
MacKay 1996; Luby et al. 1998. . .
- Convolutional codes and Viterbi decoding: Forney 1973. . .
- Turbo codes: Berrou et al. 1993. . .
Machine learning, statistics:
- Bayesian networks: Pearl 1988; Shachter 1988; Lauritzen and
Spiegelhalter 1988; Shafer and Shenoy 1990. . .
5
Other Notation Systems for Graphical Models
Example: p(u, w, x, y, z) = p(u)p(w)p(x|u, w)p(y|x)p(z|x).
U
W
U
X
Z
=
X
Z
W
Y
Y
Forney-style factor graph.
Original factor graph [FKLW 1997].
U
U
W
?
X
-
-
Z
?
Y
Bayesian network.
W
X
Z
Y
Markov random field.
6
Outline
1. Introduction
2. The sum-product and max-product algorithms
3. More about factor graphs
4. Applications of sum-product & max-product
to hidden Markov models
5. Graphs with cycles, continuous variables, . . .
7
The Two Basic Problems
1. Marginalization: Compute
X
4
¯
fk (xk ) =
f (x1, . . . , xn)
x1 , . . . , x n
except xk
2. Maximization: Compute the “max-marginal”
4
fˆk (xk ) = max f (x1, . . . , xn)
x1 , . . . , x n
except xk
assuming that f is real-valued and nonnegative and has a maximum.
Note that
argmax f (x1, . . . , xn) = argmax fˆ1(x1), . . . , argmax fˆn(xn) .
For large n, both problems are in general intractable
even for x1, . . . , xn ∈ {0, 1}.
8
Factorization Helps
For example, if f (x1, . . . , fn) = f1(x1)f2(x2) · · · fn(xn) then
f¯k (xk ) =
X
f1(x1) · · ·
X
x1
fk−1(xk−1) fk (xk ) · · ·
X
xk−1
fn(xn)
xn
and
fˆk (xk ) =
max f1(x1) · · ·
x1
max fk−1(xk−1) fk (xk ) · · ·
xk−1
max fn(xn) .
xn
Factorization helps also beyond this trivial example.
9
Towards the sum-product algorithm:
Computing Marginals—A Generic Example
Assume we wish to compute
f¯3(x3) =
X
f (x1, . . . , x7)
x1 , . . . , x 7
except x3
and assume that f can be factored as follows:
f2
f4
X2
X1
f1
X4
X3
f3
f7
X7
X5
f5
X6
f6
10
Example cont’d:
Closing Boxes by the Distributive Law
f2
f4
X2
X1
X4
X3
f1
f3
f7
X7
X5
f5
X6
f6
!
X
f¯3(x3) =
f1(x1)f2(x2)f3(x1, x2, x3)
x1 ,x2
·
X
x4 ,x5
f4(x4)f5(x3, x4, x5)
X
!
f6(x5, x6, x7)f7(x7)
x6 ,x7
11
Example cont’d: Message Passing View
f2
f4
X2
X1
X4
X7
X3
-
f1
f7
X5
f3
X6
f5
f6
!
X
f¯3(x3) =
f1(x1)f2(x2)f3(x1, x2, x3)
|x1,x2
·
X
x4 ,x5
{z
}
−
→
µ X3 (x3 )
f4(x4)f5(x3, x4, x5)
X
f6(x5, x6, x7)f7(x7)
x6 ,x7
|
|
!
{z
←
− (x )
µ
X3 3
{z
←
− (x )
µ
X5 5
}
}
12
Example cont’d: Messages Everywhere
?
X4
X2
X1
-
X7
?
f5
f3
?
X6
X5
X3
-
f1
f7
f4
f2
f6
4
4
→
→
µ X2 (x2) = f2(x2), etc., we have
With −
µ X1 (x1) = f1(x1), −
−
→
µ X3 (x3) =
X
−
→
→
µ X1 (x1)−
µ X2 (x2)f3(x1, x2, x3)
x1 ,x2
←
− (x ) =
µ
X5 5
X
−
→
µ X7 (x7)f6(x5, x6, x7)
x6 ,x7
←
− (x ) =
µ
X3 3
X
− (x )f (x , x , x )
−
→
µ
µ X4 (x4)←
X5 5 5 3 4 5
x4 ,x5
13
The Sum-Product Algorithm (Belief Propagation)
Y
H
HH 1
H HH
HH
j H
H
* ..
.
g
X
-
Yn
Sum-product message computation rule:
X
→
→
−
→
µ Yn (yn)
µ X (x) =
g(x, y1, . . . , yn)−
µ Y1 (y1) · · · −
y1 ,...,yn
Sum-product theorem:
If the factor graph for some function f has no cycles, then
→
− (x).
f¯ (x) = −
µ (x)←
µ
X
X
X
14
Towards the max-product algorithm:
Computing Max-Marginals—A Generic Example
Assume we wish to compute
fˆ3(x3) =
max
f (x1, . . . , x7)
x1 , . . . , x 7
except x3
and assume that f can be factored as follows:
f2
f4
X2
X1
f1
X4
X3
f3
f7
X7
X5
f5
X6
f6
15
Example:
Closing Boxes by the Distributive Law
f2
f4
X2
X1
f1
X4
X7
X3
f3
f7
X5
f5
X6
f6
!
fˆ3(x3) =
max f1(x1)f2(x2)f3(x1, x2, x3)
x1 ,x2
!
· max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7)
x4 ,x5
x6 ,x7
16
Example cont’d: Message Passing View
f2
X2
X4
X7
X3
X1
-
X5
f3
f1
f7
f4
X6
f5
f6
!
fˆ3(x3) =
max f1(x1)f2(x2)f3(x1, x2, x3)
|x1,x2
{z
}
−
→
µ X3 (x3 )
!
· max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7)
x4 ,x5
|x6,x7
{z
}
←
− (x )
µ
X5 5
|
{z
←
− (x )
µ
X3 3
}
17
Example cont’d: Messages Everywhere
?
X4
X2
X1
-
-
X7
?
?
X6
X5
X3
f1
f7
f4
f2
f6
f5
f3
4
4
→
−
With −
µ X1 (x1) = f1(x1), →
µ X2 (x2) = f2(x2), etc., we have
−
→
→
→
µ X3 (x3) = max −
µ X1 (x1)−
µ X2 (x2)f3(x1, x2, x3)
x1 ,x2
←
− (x ) = max −
→
µ
µ (x )f (x , x , x )
X5
5
x6 ,x7
X7
7
6
5
6
7
←
− (x ) = max −
→
− (x )f (x , x , x )
µ
µ X4 (x4)←
µ
X3 3
X5 5 5 3 4 5
x4 ,x5
18
The Max-Product Algorithm
Y
HH
HH1
HH
H
j H
H
H
* ..
.
g
X
-
Yn
Max-product message computation rule:
→
→
−
→
µ Yn (yn)
µ X (x) = max g(x, y1, . . . , yn)−
µ Y1 (y1) · · · −
y1 ,...,yn
Max-product theorem:
If the factor graph for some global function f has no cycles, then
→
− (x).
fˆ (x) = −
µ (x)←
µ
X
X
X
19
Outline
1. Introduction
2. The sum-product and max-product algorithms
3. More about factor graphs
4. Applications of sum-product & max-product
to hidden Markov models
5. Graphs with cycles, continuous variables, . . .
20
Terminology
fA
x1
x2
fB
x3
x5
x4
fC
Global function f = product of all factors; usually (but not always!)
real and nonnegative.
A configuration is an assignment of values to all variables.
The configuration space is the set of all configurations, which is the
domain of the global function.
A configuration ω = (x1, . . . , x5) is valid iff f (ω) 6= 0.
21
Invalid Configurations Do Not Affect Marginals
A configuration ω = (x1, . . . , xn) is valid iff f (ω) 6= 0.
Recall:
1. Marginalization: Compute
4
f¯k (xk ) =
X
f (x1, . . . , xn)
x1 , . . . , x n
except xk
2. Maximization: Compute the “max-marginal”
4
fˆk (xk ) =
max f (x1, . . . , xn)
x1 , . . . , x n
except xk
assuming that f is real-valued and nonnegative and has a maximum.
Constraints may be expressed by factors that evaluate to 0 if the
constraint is violated.
22
Branching Point = Equality Constraint Factor
X0
X
=
X 00
⇐⇒
X
t
The factor
0
00
4
f=(x, x , x ) =
1, if x = x0 = x00
0, else
enforces X = X 0 = X 00 for every valid configuration.
More general:
4
f=(x, x0, x00) = δ(x − x0)δ(x − x00)
where δ denotes the Kronecker delta for discrete variables and the
Dirac delta for discrete variables.
23
Example:
Independent Observations
Let Y1 and Y2 be two independent observations of X:
p(x, y1, y2) = p(x)p(y1|x)p(y2|x).
pX
X
=
pY1|X
pY2|X
Y1
Y2
Literally, the factor graph represents an extended model
p(x, x0, x00, y1, y2) = p(x)p(y1|x0)p(y2|x00)f=(x, x0, x00)
with the same marginals and max-marginals as p(x, y1, y2).
24
From A Priori to A Posteriori Probability
Example (cont’d): Let Y1 = y1 and Y2 = y2 be two independent
observations of X, i.e., p(x, y1, y2) = p(x)p(y1|x)p(y2|x).
For fixed y1 and y2, we have
p(x|y1, y2) =
p(x, y1, y2)
∝ p(x, y1, y2).
p(y1, y2)
pX
X
=
pY1|X
pY2|X
y1
y2
The factorization is unchanged (except for a scale factor).
Known variables will be denoted by small letters;
unknown variables will usually be denoted by capital letters.
25
Outline
1. Introduction
2. The sum-product and max-product algorithms
3. More about factor graphs
4. Applications of sum-product & max-product
to hidden Markov models
5. Graphs with cycles, continuous variables, . . .
26
Example:
Hidden Markov Model
p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn) = p(x0)
n
Y
p(xk |xk−1)p(yk |xk−1)
k=1
p(x0)
X0
-
=
-
X1
-
=
-
p(x1|x0)
?
p(y1|x0)
X2
-
...
p(x2|x1)
?
p(y2|x1)
Y1
?
Y2
?
27
Sum-product algorithm applied to HMM:
Estimation of Current State
p(xn, y1, . . . , yn)
p(xn|y1, . . . , yn) =
p(y1, . . . , yn)
∝ p(xn, y1, . . . , yn)
X
X
=
...
p(x0, x1, . . . , xn, y1, y2, . . . , yn)
x0
xn−1
→
=−
µ Xn (xn).
For n = 2:
X0 -
=
6
?
y1
?
X1 -
=
6
?
y2
?
X2 -
=
=1 6?
=1
...
Y3
?
28
Backward Message in Chain Rule Model
X
Y
-
-
pY |X
pX
If Y = y is known (observed):
←
− (x) = p
µ
X
Y |X (y|x),
the likelihood function.
If Y is unknown:
←
− (x) =
µ
X
X
pY |X (y|x)
y
= 1.
29
Sum-product algorithm applied to HMM:
Prediction of Next Output Symbol
p(y1, . . . , yn+1)
p(y1, . . . , yn)
∝ p(y1, . . . , yn+1)
X
=
p(x0, x1, . . . , xn, y1, y2, . . . , yn, yn+1)
p(yn+1|y1, . . . , yn) =
x0 ,x1 ,...,xn
−
=→
µ Yn (yn).
For n = 2:
X0 -
=
6
?
y1
?
X1 -
=
-
X2 -
=
=1
6
?
y2
?
...
?
?
Y3
?
?
30
Sum-product algorithm applied to HMM:
Estimation of Time-k State
p(xk , y1, y2, . . . , yn)
p(xk | y1, y2, . . . , yn) =
p(y1, y2, . . . , yn)
∝ p(xk , y1, y2, . . . , yn)
X
p(x0, x1, . . . , xn, y1, y2, . . . , yn)
=
x0 , . . . , x n
except xk
→
− (x )
=−
µ Xk (xk )←
µ
Xk k
For k = 1:
X0 -
=
6
?
y1
?
X1 =
6
?
y2
?
X2 =
...
6
?
y3
?
31
Sum-product algorithm applied to HMM:
All States Simultaneously
p(xk |y1, . . . , yn) for all k:
X0 =
6
?
y1
?
X1 =
6
?
y2
?
X2 =
-
...
6
?
y3
?
In this application, the sum-product algorithm coincides with the
Baum-Welch / BCJR forward-backward algorithm.
32
Scaling of Messages
In all the examples so far:
→
− (x )) equals the desired
• The final result (such as −
µ Xk (xk )←
µ
Xk k
quantity (such as p(xk |y1, . . . , yn)) only up to a scale factor.
• The missing scale factor γ may be recovered at the end from
the condition
X
→
− (x ) = 1.
γ−
µ Xk (xk )←
µ
Xk k
xk
• It follows that messages may be scaled freely along the way.
• Such message scaling is often mandatory to avoid numerical
problems.
33
Sum-product algorithm applied to HMM:
Probability of the Observation
p(y1, . . . , yn) =
=
X
...
X
p(x0, x1, . . . , xn, y1, y2, . . . , yn)
x0
xn
X
−
→
µ Xn (xn).
xn
This is a number. Scale factors cannot be neglected in this case.
For n = 2:
X0 -
=
6
?
y1
?
X1 -
=
6
?
y2
?
X2 -
=
=1 6?
=1
...
Y3
?
34
Max-product algorithm applied to HMM:
MAP Estimate of the State Trajectory
The estimate
(x̂0, . . . , x̂n)MAP = argmax p(x0, . . . , xn|y1, . . . , yn)
x0 ,...,xn
= argmax p(x0, . . . , xn, y1, . . . , yn)
x0 ,...,xn
may be obtained by computing
4
p̂k (xk ) =
max p(x0, . . . , xn, y1, . . . , yn)
x1 , . . . , x n
except xk
→
− (x )
=−
µ Xk (xk )←
µ
Xk k
for all k by forward-backward max-product sweeps.
In this example, the max-product algorithm is a time-symmetric
version of the Viterbi algorithm with soft output.
35
Max-product algorithm applied to HMM:
MAP Estimate of the State Trajectory cont’d
Computing
4
max p(x0, . . . , xn, y1, . . . , yn)
p̂k (xk ) =
x1 , . . . , x n
except xk
→
− (x )
=−
µ Xk (xk )←
µ
Xk k
simultaneously for all k:
X0 =
6
?
y1
?
X1 =
6
?
y2
?
X2 =
-
...
6
?
y3
?
36
Marginals and Output Edges
→
− (x) may be viewed as messages out of a
Marginals such −
µ X (x)←
µ
X
“output half edge” (without incoming message):
X0
X
-
-
-
=
−
→
µ X̄ (x) =
x0
x00
-
=⇒
X̄
Z Z
X 00
?
?
−
→
− 00 (x00) δ(x − x0) δ(x − x00) dx0 dx00
µ X 0 (x0) ←
µ
X
−
− 00 (x)
=→
µ X 0 (x) ←
µ
X
=⇒ Marginals are computed like messages out of “=”-nodes.
37
Outline
1. Introduction
2. The sum-product and max-product algorithms
3. More about factor graphs
4. Applications of sum-product & max-product
to hidden Markov models
5. Graphs with cycles, continuous variables, . . .
38
Factor Graphs with Cycles? Continuous Variables?
6
6
?
?
6
6
?
6
?
6
?
?
39
Example:
Error Correcting Codes (e.g. LDPC Codes)
Codeword x = (x1, . . . , xn) ∈ C ⊂ {0, 1}n. Received y = (y1, . . . , yn) ∈ Rn.
p(x, y) ∝ p(y|x)δC (x).
⊕
f⊕(z1, . . . , zm)
1, if z1 + . . . + zm ≡ 0 mod 2
=
0, else
⊕
L\
L\
L
L
“random” connections
L
L
=
X1
\
\
=
X2
\L \L =
...
code membership function
1, if x ∈ C
δC (x) =
0, else
Xn
channel model p(y|x)
n
Y
e.g., p(y|x) =
p(y`|x`)
y1
y2
yn
`=1
Decoding by iterative sum-product message passing.
40
Example:
Magnets, Spin Glasses, etc.
Configuration x = (x1, . . . , xn), xk ∈ {0, 1}.
X
X
“Energy” w(x) =
βk xk +
βk,`(xk − x`)2
k
−w(x)
p(x) ∝ e
=
neighboring pairs (k, `)
Y
e
k
2
Y
−βk xk
e−βk,`(xk −x`)
neighboring pairs (k, `)
H
=
H
=
H
=
H
=
H
=
H
=
H
=
H
=
H
=
41
Example:
Hidden Markov Model with Parameter(s)
p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn | θ) = p(x0)
n
Y
p(xk |xk−1, θ)p(yk |xk−1)
k=1
Θ
p(x0)
X0
-
=
?
=
-
=
X1
-
?
=
-
p(x1|x0, θ)
?
Y1
?
X2
-
...
p(x2|x1, θ)
?
Y2
?
42
Example:
Least-Squares Problems
Minimizing
n
X
x2k subject to (linear or nonlinear) constraints is
k=1
equivalent to maximizing
e−
Pn
2
k=1 xk
=
n
Y
2
e−xk
k=1
subject to the given constraints.
constraints
X1
−x21
e
...
Xn
−x2n
e
43
General Linear State Space Model
Uk+1
Uk
?
?
Bk+1
Bk
Xk−1 ...
?
Ak
-
+
Xk
-
=
?
Ck
Y
? k
-
Ak+1
?
-
+
Xk+1-
=
-
...
?
Ck+1
Y
? k+1
Xk = Ak Xk−1 + Bk Uk
Yk = Ck Xk
44
Gaussian Message Passing in Linear Models
encompasses much of classical signal processing and appears often
as a component of more complex problems/algorithms.
Note:
1. Gaussianity of messages is preserved in linear models.
2. Includes Kalman filtering and recursive least-squares algorithms.
3. For Gaussian messages, the sum-product (integral-product) algorithm coincides with the max-product algorithm.
4. For jointly Gaussian random variables,
MAP estimation = MMSE estimation = LMMSE estimation.
5. Even if X and Y are not jointly Gaussian, the LMMSE estimate
of X from Y = y may be obtained by pretending that X and Y
are jointly Gaussian (with their actual means and second-order
moments) and forming the corresponding MAP estimate.
See “The factor graph approach to model-based signal processing,”
Proceedings of the IEEE, June 2007.
45
Continuous Variables: Message Types
The following message types are widely applicable.
• Quantization of messages into discrete bins. Infeasible in higher
dimensions.
• Single point: the message µ(x) is replaced by a temporary or
final estimate x̂.
• Function value and gradient at a point selected by the receiving
node. Allows steepest-ascent (descent) algorithms.
• Gaussians. Works for Kalman filtering.
• Gaussian mixtures.
• List of samples: a pdf can be represented by a list of samples.
This data type is the basis of particle filters, but it can be used
as a general data type in a graphical model.
• Compound messages: the “product” of other message types.
46
(7)
d quantity
(8)
rics.
ding to (6)
s therefore
sk−1 , (9)
ill choose
= E p(yk |Xk , Sk , Sk−1 )|Y k−1 ,
(16)
where
the expectation
with
respect toPassing
the probability density
Particle
Methodsisas
Message
p(sk−1 , sk , xk |y k−1 ) = p(sk−1 |y k−1 )p(xk , sk |sk−1 ) (17)
Basic idea: represent a probability density f by a list
= µk−1 (sk−1 )p(xk ,sk |sk−1 ).
L = (x̂(1), w(1)), . . . (x̂(L), w(L))
(18)
of weighted samples
III.(“particles”):
A PARTICLE M ETHOD
f
x
Fig. 2. A probability density function f : R → R+ and its representation
• Versatile
data type for sum-product, max-product, EM, . . .
as a list
of particles.
• Not sensitive to dimensionality.
If both the input alphabet X and the state space S are finite47
sets (and the alphabet of X and S is not too large), then
On Factor Graphs with Cycles
• Generally iterative algorithms.
• For example, alternating maximization
x̂new = argmax f (x, ŷ) and ŷnew = argmax f (x̂, y)
x
y
using the max-product algorithm in each iteration.
• Iterative sum-product message passing gives excellent results
for maximization(!) in some applications (e.g., the decoding of
error correcting codes).
• Many other useful algorithms can be formulated in message
passing form (e.g., J. Dauwels): gradient ascent, Gibbs sampling, expectation maximization, variational methods, clustering
by “affinity propagation” (B. Frey) . . .
• Factor graphs facilitate to mix and match different algorithmic
techniques.
End of this talk.
48
Download