Christopher M. Bishop, Pattern Recognition and Machine Learning 1 Outline Introduction Directed Graphs Undirected Graphs Factor Graphs Summary 2 Outline Introduction Directed Graphs Undirected Graphs Factor Graphs Summary 3 Introduction A graph consists of nodes (vertices) that are connected by edges (links, arcs) They provide a simple and clear way to visualize the probabilistic model Complex computations can be expressed in terms of graphical manipulations 4 Probabilistic Graphical Models There are two models: directed and undirected graphical models Each node represents a random variable and the edges represent probabilistic relationships between these variables Directed Undirected 5 Outline Introduction Directed Graphs Undirected Graphs Factor Graphs Summary 6 Directed Graphical Models An example: p ( a , b , c ) p ( c | a , b ) p ( a , b ) p (c | a , b ) p (b | a ) p ( a ) a b c Definition: for a graph with K nodes, the joint distribution is given by K p ( x1 , , xK ) p ( x k | pa k ) k 1 where pa k denotes the set of parents of x k 7 An Example x1 x2 x4 x6 p ( x1 , x3 x5 x7 , x 7 ) p ( x1 ) p ( x 2 ) p ( x 3 ) p ( x 4 | x1 , x 2 , x 3 ) p ( x 5 | x1 , x 3 ) p ( x 6 | x 4 ) p ( x 7 | x 4 , x 5 ) 8 Conditional Independence (1) a is conditionally independent of b given c : p (a | b, c) p (a | c) p ( a , b | c ) p ( a | b , c ) p (b | c ) p ( a | c ) p (b | c ) A shorthand notation: There are three types of conditional independencies for the directed graphs 9 Conditional Independence (2) tail-to-tail a b p ( a , b , c ) p ( a | c ) p (b | c ) p ( c ) p (a, b) c blocked c c p ( a | c ) p (b | c ) p ( c ) a b p (a, b | c) p (a, b, c) p (c ) p ( a | c ) p (b | c ) 10 Conditional Independence (2) a c b head-to-tail b a c head-to-head b a c dependence Definition: d-separation is the notion of being separated on a directed graph 11 D-separation: an example f a e b c 12 Application: an Example Hidden Markov model: 13 Outline Introduction Directed Graphs Undirected Graphs Factor Graphs Summary 14 Undirected Graphical Models C B A Nodes of set A and B are separated by the third set C A and B are conditionally independent, p ( a1 , b1 | c1 , c 2 ) p ( a1 | c1 , c 2 ) p ( b1 | c1 , c 2 ) 15 Conditional Independence C1 H1 H2 C2 The computers can infect each other via the hubs and the hubs can infect each other via the computers 16 Cliques Definition: a subset of the nodes in a clique is fully connected Maximal cliques x1 x2 x3 x4 We can define the factors in decomposition of the joint distribution as functions of the variable in the clique 17 Undirected Factorization Consider factorizations of the form: p(x ) 1 Z C (x C ) C where C (x C ) is a non-negative potential function of a maximal clique An example: x x p ( x1 , 1 2 x3 x4 , x 4 ) ( x1 , x 2 , x 3 ) ( x 2 , x 3 , x 4 ) 18 An Example Markov random field: 19 Directed versus Undirected (1) x1 x3 x1 x2 moralization x3 x2 moral graph x4 p ( x1 , x4 , x 4 ) p ( x1 ) p ( x 2 ) p ( x 3 ) p ( x 4 | x1 , x 2 , x 3 ) ( x1 , x 2 , x 3 , x 4 ) We have to discard some conditional independence properties to complete this transfer 20 Directed versus Undirected (2) D U P P: the set of all distributions over a given set of variables 21 Outline Introduction Directed Graphs Undirected Graphs Factor Graphs Summary 22 Factor Graphs (1) A factor graph is a more general graph It allows us to be more explicit about the details of the factorization An example: x2 x1 x3 Variable node Factor node fa fb fc fd p ( x1 , x 2 , x 3 ) f a ( x1 , x 2 ) f b ( x1 , x 2 ) f c ( x 2 , x 3 ) f d ( x 3 ) 23 Factor Graphs (2) Definition: given a factor graph, the joint probability distribution is given by p(x ) fs (x s ) s where the x s denotes a subset of the variables that connect to the factor f s Each factor f s is a function of a corresponding set of variables x s 24 Factor Graphs (3) ( x1 , x 2 , x 3 ) f a ( x1f, (xx2 1, ,xx32), fxb3()x1 , x 2 ) 25 Factor Graphs (4) f a ( x1 ) pf ((xx11),, xf2b,(xx32)) pp((xx1 2) ),p (fxc 2( )x1p,(xx23, |xx3 1), x 2 p) ( x 3 | x1 , x 2 ) Directed and undirected graphs are special cases of factor graphs 26 Sum-Product Algorithm (1) Goal: Obtain a efficient, exact inference algorithm for finding marginals Allow computations to be shared efficiently By definition, the marginal is p(x) p (x ) x\x 27 Sum-Product Algorithm (2) p(x ) Fs ( x , X s ) s ne ( x ) where n e ( x ) : the factor nodes are neighbors of x X s : all variables in the subtree Fs ( x , X s ) : the product of all the factors in the group associated with factor f s 28 Sum-Product Algorithm (3) p( x) p(x ) x\x f x\x Fs ( x , X s ) s ne ( x ) Fs ( x , X s ) s ne ( x ) X s s ne ( x ) f s x ( x) can be view as messages from the factor node fs to the variable node x Fs ( x , X s ) which is a factor sub-graph can itself be factorized s x ( x) Fs ( x , X s ) f s ( x , x1 , , x M ) G1 ( x1 , X s1 ) G M ( x M , X sM ) 29 Sum-Product Algorithm (4) G1(x1,Xs1) x1 x x2 fs xM Fs ( x , X s ) f s ( x , x1 , , x M ) G1 ( x1 , X s1 ) G M ( x M , X sM ) 30 Sum-Product Algorithm (5) f s x ( x) x1 xM x1 xM f s ( x , x1 , , x M ) G m ( x m , X sm ) m ne ( f s ) \ x X sm f s ( x , x1 , , xM ) m ne ( f s ) \ x x m fs ( xm ) 31 Sum-Product Algorithm (6) G m ( x m , X sm ) Fl ( x m , X m l ) l n e ( x m ) \ f s x m fs ( xm ) l ne ( x m ) \ f s Fl ( x m , X m l ) X ml l ne ( x m ) \ f s f l xm ( xm ) 32 Sum-Product Algorithm (7) Messages: Variable node factor node: take the product of the in coming messages along all of the other link Factor node variable node: take the product of the in coming messages along all of the other link and multiply by the factor 33 Sum-Product Algorithm (8) The sum-product algorithm can be viewed purely in terms of messages sent out by factor nodes to other factor nodes 34 Sum-Product Algorithm – an Example root x1 x2 x3 fb fa fc x4 35 Max-Sum Algorithm (1) Find a setting of the variables that has the largest probability x m ax arg m ax p ( x ) x Find the value of that probability p(x m ax ) m ax p ( x ) x m ax p ( x ) m ax x x1 m ax p ( x ) xM 36 Max-Sum Algorithm (2) Compare this with the marginal: p(x m ax ) m ax p ( x ) x m ax m ax p ( x ) x x\x p(x) p (x ) x\x That is similar to the sum-product algorithm except that the summations are replaced by maximization 37 Max-Sum Algorithm (3) The max-product algorithm: f s x ( x) x1 xM f x ( x ) m ax x1 f s ( x , x1 , x mne ( f s ) \ x m ax f ( x , x1 , xM x f ( x ) , xM ) , xM ) mne ( f ) \ x l n e ( x ) \ f f l x m fs x m ( xm ) f ( xm ) ( x) 38 Max-Sum Algorithm (4) It is convenient to work with the logarithm of the joint distribution The max-sum algorithm: f x ( x ) m ax ln f ( x , x1 , x1 , , x M x f ( x) , xM ) l ne ( x ) \ f m ne ( f ) \ x f l x x m f ( xm ) ( x) 39 Max-Sum Algorithm (5) We can find the maximum by propagating messages from leaves to a root node p m ax m ax f s x ( x ) x s ne ( x ) Now we want to find the configuration of the variables for which the joint distribution attains this maximum value 40 Max-Sum Algorithm (6) An example: x2 x1 x3 f2,3 f1,2 x m ax N xN-1 xN fN-1,N arg m ax f N 1 , N x N ( x N ) x N ( x n ) arg m ax ln f n 1, n ( x n 1, x n ) x x n 1 n 1 f n 1 , n ( xn ) m ax N Once we know x , we can propagate a message back down the chain using x n 1 ( x n m ax m ax ) 41 Max-Sum Algorithm (7) It is known as back-tracking This can be extended to a general treestructure factor graph 42 Examples A Markov chain: A hidden Markov model: 43 Outline Introduction Directed Graphs Undirected Graphs Factor Graphs Summary 44 Summary The author introduces three types of probabilistic graphs Graphical models are composed of probability theory and graphical theory The concept is to factorize a complicated system into some simple components 45