An Introduction to Factor Graphs Hans-Andrea Loeliger MLSB 2008, Copenhagen 1 Definition A factor graph represents the factorization of a function of several variables. We use Forney-style factor graphs (Forney, 2001). Example: f (x1, x2, x3, x4, x5) = fA(x1, x2, x3) · fB(x3, x4, x5) · fC(x4). fA x1 x2 fB x3 x5 x4 fC Rules: • A node for every factor. • An edge or half-edge for every variable. • Node g is connected to edge x iff variable x appears in factor g. (What if some variable appears in more than 2 factors?) 2 Example: Markov Chain pXY Z (x, y, z) = pX (x) pY |X (y|x) pZ|Y (z|y). X pX Y pY |X Z pZ|Y We will often use capital letters for the variables. (Why?) Further examples will come later. 3 Message Passing Algorithms operate by passing messages along the edges of a factor graph: 6 6 ? ? 6 ... 6 ? 6 ? 6 ? ? 4 A main point of factor graphs (and similar graphical notations): A Unified View of Historically Different Things Statistical physics: - Markov random fields (Ising 1925) Signal processing: - linear state-space models and Kalman filtering: Kalman 1960. . . - recursive least-squares adaptive filters - Hidden Markov models: Baum et al. 1966. . . - unification: Levy et al. 1996. . . Error correcting codes: - Low-density parity check codes: Gallager 1962; Tanner 1981; MacKay 1996; Luby et al. 1998. . . - Convolutional codes and Viterbi decoding: Forney 1973. . . - Turbo codes: Berrou et al. 1993. . . Machine learning, statistics: - Bayesian networks: Pearl 1988; Shachter 1988; Lauritzen and Spiegelhalter 1988; Shafer and Shenoy 1990. . . 5 Other Notation Systems for Graphical Models Example: p(u, w, x, y, z) = p(u)p(w)p(x|u, w)p(y|x)p(z|x). U W U X Z = X Z W Y Y Forney-style factor graph. Original factor graph [FKLW 1997]. U U W ? X - - Z ? Y Bayesian network. W X Z Y Markov random field. 6 Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . . 7 The Two Basic Problems 1. Marginalization: Compute X 4 ¯ fk (xk ) = f (x1, . . . , xn) x1 , . . . , x n except xk 2. Maximization: Compute the “max-marginal” 4 fˆk (xk ) = max f (x1, . . . , xn) x1 , . . . , x n except xk assuming that f is real-valued and nonnegative and has a maximum. Note that argmax f (x1, . . . , xn) = argmax fˆ1(x1), . . . , argmax fˆn(xn) . For large n, both problems are in general intractable even for x1, . . . , xn ∈ {0, 1}. 8 Factorization Helps For example, if f (x1, . . . , fn) = f1(x1)f2(x2) · · · fn(xn) then f¯k (xk ) = X f1(x1) · · · X x1 fk−1(xk−1) fk (xk ) · · · X xk−1 fn(xn) xn and fˆk (xk ) = max f1(x1) · · · x1 max fk−1(xk−1) fk (xk ) · · · xk−1 max fn(xn) . xn Factorization helps also beyond this trivial example. 9 Towards the sum-product algorithm: Computing Marginals—A Generic Example Assume we wish to compute f¯3(x3) = X f (x1, . . . , x7) x1 , . . . , x 7 except x3 and assume that f can be factored as follows: f2 f4 X2 X1 f1 X4 X3 f3 f7 X7 X5 f5 X6 f6 10 Example cont’d: Closing Boxes by the Distributive Law f2 f4 X2 X1 X4 X3 f1 f3 f7 X7 X5 f5 X6 f6 ! X f¯3(x3) = f1(x1)f2(x2)f3(x1, x2, x3) x1 ,x2 · X x4 ,x5 f4(x4)f5(x3, x4, x5) X ! f6(x5, x6, x7)f7(x7) x6 ,x7 11 Example cont’d: Message Passing View f2 f4 X2 X1 X4 X7 X3 - f1 f7 X5 f3 X6 f5 f6 ! X f¯3(x3) = f1(x1)f2(x2)f3(x1, x2, x3) |x1,x2 · X x4 ,x5 {z } − → µ X3 (x3 ) f4(x4)f5(x3, x4, x5) X f6(x5, x6, x7)f7(x7) x6 ,x7 | | ! {z ← − (x ) µ X3 3 {z ← − (x ) µ X5 5 } } 12 Example cont’d: Messages Everywhere ? X4 X2 X1 - X7 ? f5 f3 ? X6 X5 X3 - f1 f7 f4 f2 f6 4 4 → → µ X2 (x2) = f2(x2), etc., we have With − µ X1 (x1) = f1(x1), − − → µ X3 (x3) = X − → → µ X1 (x1)− µ X2 (x2)f3(x1, x2, x3) x1 ,x2 ← − (x ) = µ X5 5 X − → µ X7 (x7)f6(x5, x6, x7) x6 ,x7 ← − (x ) = µ X3 3 X − (x )f (x , x , x ) − → µ µ X4 (x4)← X5 5 5 3 4 5 x4 ,x5 13 The Sum-Product Algorithm (Belief Propagation) Y H HH 1 H HH HH j H H * .. . g X - Yn Sum-product message computation rule: X → → − → µ Yn (yn) µ X (x) = g(x, y1, . . . , yn)− µ Y1 (y1) · · · − y1 ,...,yn Sum-product theorem: If the factor graph for some function f has no cycles, then → − (x). f¯ (x) = − µ (x)← µ X X X 14 Towards the max-product algorithm: Computing Max-Marginals—A Generic Example Assume we wish to compute fˆ3(x3) = max f (x1, . . . , x7) x1 , . . . , x 7 except x3 and assume that f can be factored as follows: f2 f4 X2 X1 f1 X4 X3 f3 f7 X7 X5 f5 X6 f6 15 Example: Closing Boxes by the Distributive Law f2 f4 X2 X1 f1 X4 X7 X3 f3 f7 X5 f5 X6 f6 ! fˆ3(x3) = max f1(x1)f2(x2)f3(x1, x2, x3) x1 ,x2 ! · max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7) x4 ,x5 x6 ,x7 16 Example cont’d: Message Passing View f2 X2 X4 X7 X3 X1 - X5 f3 f1 f7 f4 X6 f5 f6 ! fˆ3(x3) = max f1(x1)f2(x2)f3(x1, x2, x3) |x1,x2 {z } − → µ X3 (x3 ) ! · max f4(x4)f5(x3, x4, x5) max f6(x5, x6, x7)f7(x7) x4 ,x5 |x6,x7 {z } ← − (x ) µ X5 5 | {z ← − (x ) µ X3 3 } 17 Example cont’d: Messages Everywhere ? X4 X2 X1 - - X7 ? ? X6 X5 X3 f1 f7 f4 f2 f6 f5 f3 4 4 → − With − µ X1 (x1) = f1(x1), → µ X2 (x2) = f2(x2), etc., we have − → → → µ X3 (x3) = max − µ X1 (x1)− µ X2 (x2)f3(x1, x2, x3) x1 ,x2 ← − (x ) = max − → µ µ (x )f (x , x , x ) X5 5 x6 ,x7 X7 7 6 5 6 7 ← − (x ) = max − → − (x )f (x , x , x ) µ µ X4 (x4)← µ X3 3 X5 5 5 3 4 5 x4 ,x5 18 The Max-Product Algorithm Y HH HH1 HH H j H H H * .. . g X - Yn Max-product message computation rule: → → − → µ Yn (yn) µ X (x) = max g(x, y1, . . . , yn)− µ Y1 (y1) · · · − y1 ,...,yn Max-product theorem: If the factor graph for some global function f has no cycles, then → − (x). fˆ (x) = − µ (x)← µ X X X 19 Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . . 20 Terminology fA x1 x2 fB x3 x5 x4 fC Global function f = product of all factors; usually (but not always!) real and nonnegative. A configuration is an assignment of values to all variables. The configuration space is the set of all configurations, which is the domain of the global function. A configuration ω = (x1, . . . , x5) is valid iff f (ω) 6= 0. 21 Invalid Configurations Do Not Affect Marginals A configuration ω = (x1, . . . , xn) is valid iff f (ω) 6= 0. Recall: 1. Marginalization: Compute 4 f¯k (xk ) = X f (x1, . . . , xn) x1 , . . . , x n except xk 2. Maximization: Compute the “max-marginal” 4 fˆk (xk ) = max f (x1, . . . , xn) x1 , . . . , x n except xk assuming that f is real-valued and nonnegative and has a maximum. Constraints may be expressed by factors that evaluate to 0 if the constraint is violated. 22 Branching Point = Equality Constraint Factor X0 X = X 00 ⇐⇒ X t The factor 0 00 4 f=(x, x , x ) = 1, if x = x0 = x00 0, else enforces X = X 0 = X 00 for every valid configuration. More general: 4 f=(x, x0, x00) = δ(x − x0)δ(x − x00) where δ denotes the Kronecker delta for discrete variables and the Dirac delta for discrete variables. 23 Example: Independent Observations Let Y1 and Y2 be two independent observations of X: p(x, y1, y2) = p(x)p(y1|x)p(y2|x). pX X = pY1|X pY2|X Y1 Y2 Literally, the factor graph represents an extended model p(x, x0, x00, y1, y2) = p(x)p(y1|x0)p(y2|x00)f=(x, x0, x00) with the same marginals and max-marginals as p(x, y1, y2). 24 From A Priori to A Posteriori Probability Example (cont’d): Let Y1 = y1 and Y2 = y2 be two independent observations of X, i.e., p(x, y1, y2) = p(x)p(y1|x)p(y2|x). For fixed y1 and y2, we have p(x|y1, y2) = p(x, y1, y2) ∝ p(x, y1, y2). p(y1, y2) pX X = pY1|X pY2|X y1 y2 The factorization is unchanged (except for a scale factor). Known variables will be denoted by small letters; unknown variables will usually be denoted by capital letters. 25 Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . . 26 Example: Hidden Markov Model p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn) = p(x0) n Y p(xk |xk−1)p(yk |xk−1) k=1 p(x0) X0 - = - X1 - = - p(x1|x0) ? p(y1|x0) X2 - ... p(x2|x1) ? p(y2|x1) Y1 ? Y2 ? 27 Sum-product algorithm applied to HMM: Estimation of Current State p(xn, y1, . . . , yn) p(xn|y1, . . . , yn) = p(y1, . . . , yn) ∝ p(xn, y1, . . . , yn) X X = ... p(x0, x1, . . . , xn, y1, y2, . . . , yn) x0 xn−1 → =− µ Xn (xn). For n = 2: X0 - = 6 ? y1 ? X1 - = 6 ? y2 ? X2 - = =1 6? =1 ... Y3 ? 28 Backward Message in Chain Rule Model X Y - - pY |X pX If Y = y is known (observed): ← − (x) = p µ X Y |X (y|x), the likelihood function. If Y is unknown: ← − (x) = µ X X pY |X (y|x) y = 1. 29 Sum-product algorithm applied to HMM: Prediction of Next Output Symbol p(y1, . . . , yn+1) p(y1, . . . , yn) ∝ p(y1, . . . , yn+1) X = p(x0, x1, . . . , xn, y1, y2, . . . , yn, yn+1) p(yn+1|y1, . . . , yn) = x0 ,x1 ,...,xn − =→ µ Yn (yn). For n = 2: X0 - = 6 ? y1 ? X1 - = - X2 - = =1 6 ? y2 ? ... ? ? Y3 ? ? 30 Sum-product algorithm applied to HMM: Estimation of Time-k State p(xk , y1, y2, . . . , yn) p(xk | y1, y2, . . . , yn) = p(y1, y2, . . . , yn) ∝ p(xk , y1, y2, . . . , yn) X p(x0, x1, . . . , xn, y1, y2, . . . , yn) = x0 , . . . , x n except xk → − (x ) =− µ Xk (xk )← µ Xk k For k = 1: X0 - = 6 ? y1 ? X1 = 6 ? y2 ? X2 = ... 6 ? y3 ? 31 Sum-product algorithm applied to HMM: All States Simultaneously p(xk |y1, . . . , yn) for all k: X0 = 6 ? y1 ? X1 = 6 ? y2 ? X2 = - ... 6 ? y3 ? In this application, the sum-product algorithm coincides with the Baum-Welch / BCJR forward-backward algorithm. 32 Scaling of Messages In all the examples so far: → − (x )) equals the desired • The final result (such as − µ Xk (xk )← µ Xk k quantity (such as p(xk |y1, . . . , yn)) only up to a scale factor. • The missing scale factor γ may be recovered at the end from the condition X → − (x ) = 1. γ− µ Xk (xk )← µ Xk k xk • It follows that messages may be scaled freely along the way. • Such message scaling is often mandatory to avoid numerical problems. 33 Sum-product algorithm applied to HMM: Probability of the Observation p(y1, . . . , yn) = = X ... X p(x0, x1, . . . , xn, y1, y2, . . . , yn) x0 xn X − → µ Xn (xn). xn This is a number. Scale factors cannot be neglected in this case. For n = 2: X0 - = 6 ? y1 ? X1 - = 6 ? y2 ? X2 - = =1 6? =1 ... Y3 ? 34 Max-product algorithm applied to HMM: MAP Estimate of the State Trajectory The estimate (x̂0, . . . , x̂n)MAP = argmax p(x0, . . . , xn|y1, . . . , yn) x0 ,...,xn = argmax p(x0, . . . , xn, y1, . . . , yn) x0 ,...,xn may be obtained by computing 4 p̂k (xk ) = max p(x0, . . . , xn, y1, . . . , yn) x1 , . . . , x n except xk → − (x ) =− µ Xk (xk )← µ Xk k for all k by forward-backward max-product sweeps. In this example, the max-product algorithm is a time-symmetric version of the Viterbi algorithm with soft output. 35 Max-product algorithm applied to HMM: MAP Estimate of the State Trajectory cont’d Computing 4 max p(x0, . . . , xn, y1, . . . , yn) p̂k (xk ) = x1 , . . . , x n except xk → − (x ) =− µ Xk (xk )← µ Xk k simultaneously for all k: X0 = 6 ? y1 ? X1 = 6 ? y2 ? X2 = - ... 6 ? y3 ? 36 Marginals and Output Edges → − (x) may be viewed as messages out of a Marginals such − µ X (x)← µ X “output half edge” (without incoming message): X0 X - - - = − → µ X̄ (x) = x0 x00 - =⇒ X̄ Z Z X 00 ? ? − → − 00 (x00) δ(x − x0) δ(x − x00) dx0 dx00 µ X 0 (x0) ← µ X − − 00 (x) =→ µ X 0 (x) ← µ X =⇒ Marginals are computed like messages out of “=”-nodes. 37 Outline 1. Introduction 2. The sum-product and max-product algorithms 3. More about factor graphs 4. Applications of sum-product & max-product to hidden Markov models 5. Graphs with cycles, continuous variables, . . . 38 Factor Graphs with Cycles? Continuous Variables? 6 6 ? ? 6 6 ? 6 ? 6 ? ? 39 Example: Error Correcting Codes (e.g. LDPC Codes) Codeword x = (x1, . . . , xn) ∈ C ⊂ {0, 1}n. Received y = (y1, . . . , yn) ∈ Rn. p(x, y) ∝ p(y|x)δC (x). ⊕ f⊕(z1, . . . , zm) 1, if z1 + . . . + zm ≡ 0 mod 2 = 0, else ⊕ L\ L\ L L “random” connections L L = X1 \ \ = X2 \L \L = ... code membership function 1, if x ∈ C δC (x) = 0, else Xn channel model p(y|x) n Y e.g., p(y|x) = p(y`|x`) y1 y2 yn `=1 Decoding by iterative sum-product message passing. 40 Example: Magnets, Spin Glasses, etc. Configuration x = (x1, . . . , xn), xk ∈ {0, 1}. X X “Energy” w(x) = βk xk + βk,`(xk − x`)2 k −w(x) p(x) ∝ e = neighboring pairs (k, `) Y e k 2 Y −βk xk e−βk,`(xk −x`) neighboring pairs (k, `) H = H = H = H = H = H = H = H = H = 41 Example: Hidden Markov Model with Parameter(s) p(x0, x1, x2, . . . , xn, y1, y2, . . . , yn | θ) = p(x0) n Y p(xk |xk−1, θ)p(yk |xk−1) k=1 Θ p(x0) X0 - = ? = - = X1 - ? = - p(x1|x0, θ) ? Y1 ? X2 - ... p(x2|x1, θ) ? Y2 ? 42 Example: Least-Squares Problems Minimizing n X x2k subject to (linear or nonlinear) constraints is k=1 equivalent to maximizing e− Pn 2 k=1 xk = n Y 2 e−xk k=1 subject to the given constraints. constraints X1 −x21 e ... Xn −x2n e 43 General Linear State Space Model Uk+1 Uk ? ? Bk+1 Bk Xk−1 ... ? Ak - + Xk - = ? Ck Y ? k - Ak+1 ? - + Xk+1- = - ... ? Ck+1 Y ? k+1 Xk = Ak Xk−1 + Bk Uk Yk = Ck Xk 44 Gaussian Message Passing in Linear Models encompasses much of classical signal processing and appears often as a component of more complex problems/algorithms. Note: 1. Gaussianity of messages is preserved in linear models. 2. Includes Kalman filtering and recursive least-squares algorithms. 3. For Gaussian messages, the sum-product (integral-product) algorithm coincides with the max-product algorithm. 4. For jointly Gaussian random variables, MAP estimation = MMSE estimation = LMMSE estimation. 5. Even if X and Y are not jointly Gaussian, the LMMSE estimate of X from Y = y may be obtained by pretending that X and Y are jointly Gaussian (with their actual means and second-order moments) and forming the corresponding MAP estimate. See “The factor graph approach to model-based signal processing,” Proceedings of the IEEE, June 2007. 45 Continuous Variables: Message Types The following message types are widely applicable. • Quantization of messages into discrete bins. Infeasible in higher dimensions. • Single point: the message µ(x) is replaced by a temporary or final estimate x̂. • Function value and gradient at a point selected by the receiving node. Allows steepest-ascent (descent) algorithms. • Gaussians. Works for Kalman filtering. • Gaussian mixtures. • List of samples: a pdf can be represented by a list of samples. This data type is the basis of particle filters, but it can be used as a general data type in a graphical model. • Compound messages: the “product” of other message types. 46 (7) d quantity (8) rics. ding to (6) s therefore sk−1 , (9) ill choose = E p(yk |Xk , Sk , Sk−1 )|Y k−1 , (16) where the expectation with respect toPassing the probability density Particle Methodsisas Message p(sk−1 , sk , xk |y k−1 ) = p(sk−1 |y k−1 )p(xk , sk |sk−1 ) (17) Basic idea: represent a probability density f by a list = µk−1 (sk−1 )p(xk ,sk |sk−1 ). L = (x̂(1), w(1)), . . . (x̂(L), w(L)) (18) of weighted samples III.(“particles”): A PARTICLE M ETHOD f x Fig. 2. A probability density function f : R → R+ and its representation • Versatile data type for sum-product, max-product, EM, . . . as a list of particles. • Not sensitive to dimensionality. If both the input alphabet X and the state space S are finite47 sets (and the alphabet of X and S is not too large), then On Factor Graphs with Cycles • Generally iterative algorithms. • For example, alternating maximization x̂new = argmax f (x, ŷ) and ŷnew = argmax f (x̂, y) x y using the max-product algorithm in each iteration. • Iterative sum-product message passing gives excellent results for maximization(!) in some applications (e.g., the decoding of error correcting codes). • Many other useful algorithms can be formulated in message passing form (e.g., J. Dauwels): gradient ascent, Gibbs sampling, expectation maximization, variational methods, clustering by “affinity propagation” (B. Frey) . . . • Factor graphs facilitate to mix and match different algorithmic techniques. End of this talk. 48