Representation for Undirected Graphical Models Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Summary of Directed Graphical Models Local Markov Assumptions: π ⊥ ππππ·ππ πππππππ‘π | πππ I−map πΌπ πΊ ⊆ πΌ π ⇔ P X1 , … , ππ factorizes as π π(ππ | ππππ ) Topological ordering, chain rule, local Markov assumptions Less edge = stronger assumption Delete any edges of G, no longer I-map ⇒ πΊ minimal I-map for π P-map: πΌπ πΊ = πΌ π D-separation: Active trail for longer range dependence No active trail between ππ and ππ given π ⇒ π_π ⊥ π_π | π π΄ π ¬ (π΄ ⊥ π») π΄ ⊥π»|π π» π π π» π ⊥π»|π ¬(π ⊥ π») πΉ π΄ π π΄⊥πΉ ¬ (π΄ ⊥ πΉ | π) 2 How about other derived relations? ? πΉ ⊥ {π΅, πΈ, πΊ, π½} πΉ ⊥ {π΅, πΈ, πΊ, π½} ? π΄, πΆ, πΉ, πΌ ⊥ π΅, πΈ, πΊ, π½ π΄, πΆ, πΉ, πΌ ⊥ {π΅, πΈ, πΊ, π½} ? π΅ ⊥π½|πΈ π΅ ⊥ π½|πΈ ? πΈ ⊥πΉ|πΎ ¬ (πΈ ⊥ πΉ | πΎ) ? πΈ ⊥ πΉ | {πΎ, πΌ} πΈ ⊥ πΉ | {πΎ, πΌ} ? πΉ ⊥πΊ|π· π΄ π΅ πΆ π· πΈ πΉ π» πΊ π½ πΌ ¬ (πΉ ⊥ πΊ | π·) ? πΉ ⊥πΊ|π» ¬ (πΉ ⊥ πΊ | π») πΎ ? πΉ ⊥ πΊ | {π», πΎ} ¬ (πΉ ⊥ πΊ | {π», πΎ}) ? πΉ ⊥ πΊ | {π», π΄} πΉ ⊥ πΊ | {π», π΄} 3 Generate Samples from Bayesian Networks BN describe a generative process for observations First, sort the nodes in topological order 1 2 πΉππ’ π΄ππππππ¦ Then, generate sample using this order according to the CPTs ππππ’π 3 Generate a set of sample for (A, F, S, N, H): Sample ππ ∼ π π΄ Sample ππ ∼ π πΉ Sample π π ∼ π π ππ , ππ Sample ππ ∼ π π π π Sample βπ ∼ π π» π π π»πππππβπ πππ π 4 5 4 Reduction in representation π· π· Easy One parent each πΆ π1 π2 … ππ π π1 , π2 , … , ππ , πΆ, π· = π π· π πΆπ· π(ππ | πΆ) Difficult Multiple parents for C πΆ π1 π2 … ππ π π1 , π2 , … , ππ , πΆ, π· = π π1 ) … π(ππ π πΆ π1 , … , ππ π π· πΆ π=1…π Only need small two-way tables Still need (n+1)-way table, 2π parameters 5 Additional notation Observed variable: filled circle Hidden variable: open circle π»1 π»2 π»3 π1 π2 π3 A plate: repeat the variable π times with the same CPTs πΆ π1 π2 When all π ππ πΆ are the same ... ππ πΆ ππ π 6 Nested plate notation Eg. Latent Dirichlet Allocation (LDA) … π½ ππ π2π … … … πΌ π2π π2 π22 π22 π1 π21 π21 ππ : topic mixing proportion for document π πππ : topic indicator variable for word π π€ππ : word π in document π πΌ: prior for mixing proportion π½: topic parameters π πΌ, ππ , πππ , πππ , π½ = … π πΌ π ππ π π πππ ππ π πππ πππ , π½ π π½ πΌ ππ πππ Edge with the same color has the same parameter πππ π π 7 Inexistence of P-maps for Bayesian Networks XOR: A = B XOR C π΄ ⊥ π΅, ¬ (π΄ ⊥ π΅ | πΆ) π΅ ⊥ πΆ, ¬ (π΄ ⊥ πΆ | π΅) πΆ ⊥ π΄, ¬ (π΅ ⊥ πΆ | π΄) π΅ π΄ Not P-map Can not read π΅⊥πΆ πΆ π΄⊥C Minimal I-map π΄⊥π΅ Swinging couples of variables π1 , π1 and π2 , π2 ¬ (π1 ⊥ π1 ) ¬ (π2 ⊥ π2 ) π1 ⊥ π2 | π1 π2 π1 ⊥ π2 | π1 π2 No BN π-map, need new representation! π2 π1 π2 π1 8 Undirected Graphical Models (UGM) Eg. Grid models (image processing, physics) Each node: a pixel, or an atom The values of adjacent variables are dependent due to pattern continuity or electro-magnetic force, etc. Most likely joint configuration corresponds to “low-energy” state π π1 , … , ππ = 1 exp( π 2. Factorize distribution πππ ππ ππ + ππ ππ ) π∈π π,π ∈πΈ 3. Representation power π 1. Read conditional independence π΅π ππ 9 Read conditional independence from UGM Global Markov Independence π΄ ⊥ π΅ | πΆ Independence based on separation π΅ πΆ π΄ Local Markov Independence π ⊥ πβππ ππ π‘|π΄π΅πΆπ· ABCD Markov blanket π΄ π΅ π πΆ π· 10 Global Markov Independence π πππΊ π΄; π΅ πΆ : C separates A and B if every path from a node in A to a node in B passes through a node in C A distribution satisfies the global Markov property if for any disjoint A, B, C, such that π πππΊ π΄; π΅ πΆ , then A is independent of C given B: πΌ(πΊ) = {π΄ ⊥ π΅ | πΆ: π πππΊ (π΄; π΅|πΆ)} π΄ πΆ π΅ 11 Soundness of separation criterion The independence in πΌ πΊ are precisely those that are guaranteed to hold for every MN distribution π over πΊ In other words, the separation criterion is sound for detecting independence properties in MN distributions over πΊ In a sense, reading conditional independence from MN is simpler than that from BN π΅ π΄ πΆ πΆ separate π΄ and π΅ But ¬ (π΄ ⊥ π΅ | πΆ) πΆ π΄ π΅π π΅ πΆ separate π΄ and π΅ (π΄ ⊥ π΅ | πΆ) ππ πΆ π΄ π΅ πΆ separate π΄ and π΅ (π΄ ⊥ π΅ | πΆ) 12 Local Markov Independence For each node ππ ∈ π, there is a unique Markov blanket of ππ , denoted by ππ΅ππ , which is the set of immediate neighbors of ππ in the graph The local Markov independence associated with G is: πΌπ πΊ = {ππ ⊥ π – ππ – ππ΅ππ | ππ΅ππ : ∀π} In other words, ππ is independence of the rest given its immediate neighbors π΄ π΅ π πΆ π· 13 Pairwise Markov independence Given a graph G=(V,E), pairwise Markov independence associated with G are πΌπ (πΊ) = {π ⊥ π | π β {π, π}: {π, π} ∉ πΈ} E.g., π1 ⊥ π5 | π2 , π3 , π4 π1 π2 π3 π4 π5 BN: need to use active trail to judge it E.g., ¬ (π1 ⊥ π3 | π2 ) π1 π2 π3 14 Markov Blanket example Note: the local Markov independence in MN and BN can be quite different! ππ A A B D π ⊥ π – π – π΄π΅πΆπ· | π΄π΅πΆπ· B π E π C π΅π C D ¬ (π ⊥ π – π – π΄π΅πΆπ· | π΄π΅πΆπ·) ! ¬ (π ⊥ πΈ | π΄π΅πΆπ·) 15 Read conditional independence from UGM Global Markov Independence π΄ ⊥ π΅ | πΆ Independence based on separation π΅ πΆ π΄ Edges give dependence between variables, but no causal relations and generate sample is more complicated Local Markov Independence π ⊥ πβππ ππ π‘|π΄π΅πΆπ· ABCD Markov blanket π΄ π΅ How to factorize the distribution? π πΆ π· 16 Maximal Cliques For πΊ = π, πΈ , a complete subgraph (clique) is a subgraph πΊ’ = π’ ⊆ π, πΈ’ ⊆ πΈ s.t. nodes in π’ are fully connected A (maximal) clique is a complete subgraph s.t. any superset π ′′ , π’ ⊂ π’’, is not fully connected πΆ π΅ π΄ π· Example: Maximal cliques = {A,B,C}, {A,B,D} Sub-cliques = {A}, {B}, {A,B}, {C,D} … (all edges and singletons) 17 Distribution Factorization in Markov Networks Given an undirected graph G over variables π³ = π1 , … , ππ A distribution π factorizes over πΊ if there exist subset of variables π·1 ⊆ π³, …, π·π ⊆ π³ (π·π are maximal cliques in πΊ) non-negative potentials (factors/functions) Ψ1 π·1 ,…, Ψπ π·π such that π π1 , π2 , … , ππ 1 = π where π Ψπ π·π π=1 π π = π₯1 ,π₯2 ,…,π₯π Ψπ π·π π=1 Also know as Gibbs distributions, Markov random Fields, and undirected graphical models 18 Interpretation of Potential Functions π π π π π, π, π = 1 Ψ π, π Ψ(π, π) π The undirected graph implies π ⊥ π | π. This independence statement implies that the joint must factorizes as π π, π, π = π π π π, π π = π π π π π π π π We can write as π π, π, π = π π, π π π π or π π π π π, π , but cannot have all potentials be marginals cannot have all potentials be conditionals The clique potentials can be thought of as general “compatibility”, “goodness” or “happiness” functions over their variables, but not distributions/conditionals 19 Another example π π΄, π΅, πΆ, π· = π = 1 π π,π,π,π Ψ1 Ψ1 π΄, π΅, πΆ Ψ2 π΄, π΅, π· πΆ π΄, π΅, πΆ Ψ2 π΄, π΅, π· π΅ π΄ For discrete nodes, we can represent π(π΄, π΅, πΆ, π·) as two 3-way tables instead of one 4-way tables π· ππ π΅π BN: π π΄, π΅, πΆ, π· = π πΆ π π΄ πΆ π π΅ π΄, πΆ π π· π΄, π΅ Two 3-way tables + one 2-way table + one vector πΆ π΅ π΄ π· Each table has meaning as conditional distribution 20 Conditional Independence in Problem World, Data, Reality: BN MN True distribution π Contains conditional independence assertions πΌ(π) local Markov assumptions πΌπ πΊ ⊆ πΌ π global Markov assumptions I πΊ ⊆πΌ π 21 I-map: Bayesian Networks BN encodes local Markov assumptions πΌπ πΊ If local conditional This BN is an I-map of P independence in BN are π factorizes according to BN subset of Then the joint probability conditional independence in π can be written as obtain π π(π1 , … , ππ ) = π(ππ | ππππ ) πΌπ πΊ ⊆ πΌ π π Every P has least one BN structure G If the joint probability π can be written as π(π1 , … , ππ ) = π(ππ | ππππ ) obtain π Read independence of P from BN structure G Then local conditional independence in BN are subset of conditional independence in π πΌπ πΊ ⊆ πΌ π 22 I-map: Markov Networks MN encodes global Markov assumptions πΌπ πΊ If global conditional This MN is an I-map of P independence in MN are π factorizes according to BN subset of Then the joint probability conditional independence in π can be written as obtain π 1 π(π1 , … , ππ ) = Ψ π·π π πΌπ πΊ ⊆ πΌ π π Every P has least one MN structure G If the joint probability π can be written as 1 π(π1 , … , ππ ) = π Ψ π·π obtain π Read independence of P from MN structure G Then global conditional independence in MN are subset of conditional independence in π πΌπ πΊ ⊆ πΌ π 23 Counter Example π1 , … , π4 are binary, and only eight assignments have positive probability (each with 1/8) π1 (0,0,0,0), (1,0,0,0), (1,1,0,0), (1,1,1,0), (0,0,0,1), (0,0,1,1), (0,1,1,1), (1,1,1,1) π4 π2 π3 Eg., π1 ⊥ π3 | π2 , π4 ππ ππ 00 01 10 11 ππ ππ ππ ππ 00 01 10 11 ππ ππ ππ 00 01 10 11 ππ 00 ½ ½ 0 0 0 ½ 1 0 ½ 0 1 ½ ½ 0 01 0 ½ 0 ½ 1 ½ 0 1 ½ 1 0 ½ ½ 1 10 ½ 0 ½ 0 11 0 0 ½ ½ π π1 , π3 π2 , π4 = π π1 π2 , π4 π π3 π2 , π4 But distribution does not factorize! eg. π 0,0,1,0 = 0 = 1 π Ψ12 0, 0 Ψ23 0,1 Ψ34 1,0 Ψ14 (0,0) 24 Markov Network Representation If global conditional This MN is an I-map of P independence in MN are π factorizes according to BN subset of Then the joint probability conditional independence in π can be written as obtain a strictly postive π 1 π(π1 , … , ππ ) = Ψ π·π π πΌπ πΊ ⊆ πΌ π π Every P has least one MN structure G For all π₯, π π = π₯ > 0 Known as Hammersley-Clifford Theorem 25 Minimal I-maps and Markov networks A fully connected graph is an I-map for all distribution Remember minimal I-maps Deleting an edge make it no longer I-map In a Bayesian Network, there is no unique minimal I-map For strictly positive distributions and Markov network, minimal I-map is unique! Many ways to find minimal I-map, eg., Take pairwise Markov assumption If π does not entail it, add edge 26 How about a perfect map? Perfect maps? Independence in the graph are exactly the same as those in π For Bayesian networks, does not always exist Counter example: swinging couple of variables π How about for Markov networks? Counter example: V-structure π΄ πΉ π π΄⊥πΉ ¬ (π΄ ⊥ πΉ | π) ππ π΅π πΉ π΄ π Minimal I-map MN Not a P-map Can not code π΄ ⊥ πΉ 27 Pairwise Markov Networks All factors over single variables or pairs of variables Node potentials Ψπ ππ Edge potentials Ψππ ππ , ππ Factorization π π = π∈π Ψπ ππ (π,π)∈πΈ Ψππ ππ , ππ Note that there may be bigger cliques in the graph, but only consider pairwise potentials πΆ π΅ π΄ π· 28 An example π π΄, π΅, πΆ, π· = π = 1 π π,π,π,π Ψ1 πΆ Ψ1 π΄, π΅, πΆ Ψ2 π΄, π΅, π· π΄, π΅, πΆ Ψ2 π΄, π΅, π· π΅ π΄ π· Use two 3-way tables Maximal clique specification 1 π π′ π΄, π΅, πΆ, π· = Ψ π΄πΆ Ψ π΅πΆ Ψ π΄π΅ Ψ π΄π· Ψ(π΅π·) Pairwise Markov Networks πΆ Use five 2-way tables π΅ π΄ What is the relation between πΌ π and πΌ π’ ? πΉπ’πππ¦ πππππππ‘ππ ⊆ πΌ π ⊆ πΌ π’ ⊆ πππ ππππππ‘ππ π· 29 Applications of Pairwise Markov Networks Image segmentation: separate foreground from background bg Graph structure: Grid with one node per pixel Pairwise Markov networks fg Node potential Background color vs. foreground color Ψ ππ = ππ, ππ = exp −||ππ − πππ ||2 Ψ ππ = ππ, ππ = exp −||ππ − πππ ||2 Edge potential Neighbors likely have the same label ππ ππ Fg Ψ(ππ , ππ ) = Fg Bg Bg 10 1 1 10 π π1 , … , ππ 1 = π Ψ ππ π Ψ ππ , ππ ππ 30 Exponential Form Standard model: 1 π(π1 , … , ππ ) = π Ψ π·π π Assuming strictly positive potentials: π(π1 , … , ππ ) = 1 π πΨ π·π 1 exp log Ψ π·π π π 1 = exp π log Ψ π·π π 1 = exp( π Φ(π·π ) ) π = We can maintain table Φ π·π (can have negative entries) rather than table Ψ π·π (strictly positive entries) 31 Exponential Form—Log-linear Models Features are some functions π(π·) for a subset of variables π· Log-linear model over a Markov network G: A set of features π1 π·1 , … , ππ π·π Each π·π is over a subset of a clique in G Eg. Pairwise model π·π = ππ , ππ Two f’s can be over the same variables It’s ok for π·π = π·π A set of weights π€1 , … , π€π Usually learned from data π π1 , … , ππ = 1 exp( π π π=1 π€π ππ (π·π ) ) 32 Factor Graph πΆ Maximal clique specification π π΄, π΅, πΆ, π· = 1 π Ψ1 π΄, π΅, πΆ Ψ2 π΄, π΅, π· π΅ π΄ Pairwise Markov Networks π· 1 π π′ π΄, π΅, πΆ, π· = Ψ π΄πΆ Ψ π΅πΆ Ψ π΄π΅ Ψ π΄π· Ψ(π΅π·) Can not look at the graph and tell what potential is using Factor graph is to make this clear in graphical form 33 Factor Graph Make factor dependency explicit πΆ Useful for later inference π΅ π΄ Bipartite graph: π· Variable nodes (circle) for π1 , … , ππ Factor nodes (square) for Ψ1 , … , Ψπ Edge ππ − Ψπ if ππ ∈ π·π (Scope of Ψπ π·π ) π΄ π΅ Ψ1 πΆ Ψ2 1 Ψ π΄, π΅, πΆ Ψ2 π΄, π΅, π· π 1 π΄ π· Ψ1 πΆ π΅ Ψ2 Ψ3 π· Ψ4 Ψ5 1 Ψ π΄π΅ Ψ2 π΄πΆ Ψ3 π΅πΆ Ψ4 π΄π· Ψ5 (π΅π·) π 1 34 Conditional random Fields Focus on conditional distribution π(π1 , … , ππ |π1 , … , ππ , π) Do not explicitly model dependence between π1 , … , ππ , π π Only model relation between π − π and π − π π(π1 , π2 , π3 , π4 |π1 , π2 , π3 , π4 , π) =π 1 π1 ,π2 ,π3 ,π4 ,π Ψ π1 , π2 , π1 , π2 , π Ψ(π2 , π3 , π2 , π3 , π) π1 π2 π3 π1 π2 π3 π π1 , π2 , π3 , π4 , π = π¦1 π¦2 π¦3 Ψ π1 , π2 , π1 , π2 , π Ψ(π2 , π3 , π2 , π3 , π) 35