Inference in Graphical Models Variable Elimination and Message Passing Algorithm Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Conditional Independence Assumptions Local Markov Assumption Global Markov Assumption π ⊥ ππππππ πππππππ‘π |πππ π΄ ⊥ π΅|πΆ, π πππΊ π΄, π΅; πΆ ππππππ πππππππ‘π πππ π Undirected Tree Undirected Chordal Graph π΅π ππ π΄ πΆ π΅ π Moralize ππ π΅π Triangulate 2 Distribution Factorization Bayesian Networks (Directed Graphical Models) πΌ − πππ: πΌπ πΊ ⊆ πΌ π ⇔ π π(π1 , … , ππ ) = Conditional Probability Tables (CPTs) π(ππ | ππππ ) π=1 Markov Networks (Undirected Graphical Models) π π‘ππππ‘ππ¦ πππ ππ‘ππ£π π, πΌ − πππ: πΌ πΊ ⊆ πΌ π Clique ⇔ π Potentials 1 π(π1 , … , ππ ) = Ψπ π·π π Maximal π=1 Normalization (Partition Function) π π = π₯1 ,π₯2 ,…,π₯π Ψπ π·π Clique π=1 3 Inference in Graphical Models Graphical models give compact representations of probabilistic distributions π π1 , … , ππ (n-way tables to much smaller tables) How do we answer queries about P? We use inference as a name for the process of computing answers to such queries 4 Query Type 1: Likelihood Most queries involve evidence Evidence π is an assignment of values to a set πΈ variables Evidence are observations on some variables Without loss of generality πΈ = ππ+1 , … , ππ Simplest query: compute probability of evidence π π = π₯1 … π₯π π(π₯1 , … , π₯π , π) This is often referred to as computing the likelihood of π Sum over this set of variables πΈ 5 Query Type 2: Conditional Probability Often we are interested in the conditional probability distribution of a variable given the evidence π π, π π(π, π) π ππ = = π π π₯ π(π = π₯, π) It is also called a posteriori belief in π given evidence π We usually query a subset Y of all variables π³ = {π, π, π} and “don’t care” about the remaining π π ππ = π(π, π = π§|π) π§ Take all possible configuration of π into account The processes of summing out the unwanted variable Z is called marginalization 6 Query Type 2: Conditional Probability Example Interested in the conditionals for these variables Sum over this set of variables πΈ Interested in the conditionals for these variables πΈ Sum over this set of variables 7 Application of a posteriori Belief Prediction: what is the probability of an outcome given the starting condition The query node is a descendent of the evidence π΄ π΅ πΆ Diagnosis: what is the probability of disease/fault given symptoms The query node is an ancestor of the evidence π΄ π΅ πΆ Learning under partial observations (Fill in the unobserved) Information can flow in either direction Inference can combine evidence from all parts of the networks 8 Query Type 3: Most Probable Assignment Want to find the most probably joint assignment for some variables of interests Such reasoning is usually performed under some given evidence π, and ignoring (the values of other variables) π Also called maximum a posteriori (MAP) assignment for π ππ΄π π π = ππππππ₯π¦ π π π = ππππππ₯π¦ π§ π π, π = π§ π Interested in the most probable values for these variables Sum over this set of variables πΈ 9 Application of MAP assignment Classification Find most likely label, given the evidence Explanation What is the most likely scenario, given the evidence Cautionary note: The MAP assignment of a variable dependence on its context – the set of varibales being jointly queried Example: MAP of π, π ? (0, 0) MAP of π? 1 X Y P(X,Y) X P(X) 0 0 0.35 0 0.4 0 1 0.05 1 0.6 1 0 0.3 1 1 0.3 10 Complexity of Inference Computing the a posteriori belief π π π in a GM is NP-hard in general Hardness implies we cannot find a general procedure that works efficiently for arbitrary GMs For particular families of GMs, we can have provably efficient procedures eg. trees For some families of GMs, we need to design efficient approximate inference algorithms eg. grids 11 Approaches to inference Exact inference algorithms Variable elimination algorithm Message-passing algorithm (sum-product, belief propagation algorithm) The junction tree algorithm Approximate inference algorithms Sampling methods/Stochastic simulation Variational algorithms 12 Marginalization and Elimination A metabolic pathway: What is the likelihood protein πΈ is produced π΅ π΄ π· πΆ πΈ Query: P(E) π πΈ = π π π ππ π, π, π, π, πΈ Using graphical models, we get π πΈ = π π π ππ π)π π π π π π π π π π(πΈ|π Naïve summation needs to enumerate over an exponential number of terms 13 Elimination in Chains π΄ π΅ π· πΆ πΈ Rearranging terms and the summations π πΈ = π π)π π π π π π π π π π(πΈ|π π π π = π π ππ π ππ π πΈπ π π π π π π ππ π 14 Elimination in Chains (cont.) π΅ π΄ π· πΆ πΈ π(π) Now we can perform innermost summation efficiently π πΈ = π ππ π ππ π πΈπ π π π = π π π ππ π π π π π π π π πΈ π π(π) π π π Equivalent to matrix-vector multiplication, |Val(A)| * |Val(B)| The innermost summation eliminates one variable from our summation argument at a local cost. 15 Elimination in Chains (cont.) π΅ π΄ π(π) πΆ πΈ π· π(π) Rearranging and then summing again, we get C B π πΈ = π π π π π π π π π π(π) π π = π = 1 B 0 0 0 .15 0.35 0 0 .25 1 0.85 1 0.75 0.65 π π ππ π πΈπ π 0 π ππ π π π π π π π πΈ π π(π) π π Equivalent to matrix-vector multiplication, |Val(B)| * |Val(C)| 16 Elimination in Chains (cont.) π΅ π΄ π(π) πΆ π· πΈ π(π) Eliminate nodes one by one all the way to the end π πΈ = π πΈ π π(π) π Computational Complexity for a chain of length π Each step costs O(|Val(ππ )| * |Val(ππ+1 )|) operations: O(ππ2 ) Ψ ππ = π₯π−1 π ππ ππ−1 )π(ππ−1 ) Compare to naïve summation: O(ππ ) π₯1 … π₯π−1 π(π₯1 , … , ππ ) 17 Undirected Chains π΅ π΄ πΆ πΈ π· Rearrange terms, perform local summation … π πΈ = π 1 = π 1 = π π π π 1 Ψ π, π Ψ π, π Ψ π, π Ψ(πΈ, π) π Ψ π, π Ψ π, π Ψ πΈ, π π π π Ψ π, π π Ψ π, π Ψ π, π Ψ πΈ, π Ψ π π π π 18 The Sum-Product Operation During inference, we try to compute an expression Sum-product form: π Ψ∈π Ψ π§ = {π1 , … , ππ } the set of variables π a set of factors such that for each Ψ ∈ π, πππππ Ψ ∈ π§ π¨ ⊂ π§ a set of query variables π© = π§ − π¨ the variables to eliminate The result of eliminating the variables in π© is a factor π π¨ = Ψ π§ Ψ∈π This factor does not necessarily correspond to any probability or conditional probability in the network. π π¨ = π(π¨) π(π¨) 19 Inference via Variable Elimination General Idea Write query in the form π π1 , π = … π₯π π π₯π ππππ π₯3 π₯2 π The sum is ordered to suggest an elimination order Then iteratively Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term Insert the new term into the product Finally renormalize π π1 π = π π1 , π π₯1 π(π1 , π) 20 A more complex network A food web π΅ πΆ π΄ π· πΈ πΊ πΉ π» What is the probability π π΄ π» that hawks are leaving given that the grass condition is poor? 21 Example: Variable Elimination Query: π(π΄|β), need to eliminate π΅, πΆ, π·, πΈ, πΉ, πΊ, π» Initial factors π΅ πΆ π π π π π π π π π π π π π, π π π π π π π π β π, π Choose an elimination order: π», πΊ, πΉ, πΈ, π·, πΆ, π΅ (<) π· πΈ πΉ π» πΊ Step 1: Eliminate G Conditioning (fix the evidence node on its observed value) πβ π, π = π(π» = β|π, π) π΄ π΅ πΆ π΄ π· πΉ πΈ πΊ 22 Example: Variable Elimination Query: π(π΄|β), need to eliminate π΅, πΆ, π·, πΈ, πΉ, πΊ π΅ πΆ Initial factors π π π π π π π π π π π π π, π π π π π π π π β π, π ⇒ π π π π π π π π π π π π π, π π π π π π π πβ (π, π) π΄ π· πΈ πΉ π» πΊ Step 2: Eliminate πΊ Compute ππ π = ππ ππ =1 π΅ ⇒ π π π π π π π π π π π π π, π π π π ππ π πβ (π, π) ⇒ π π π π π π π π π π π π π, π π π π πβ (π, π) πΆ π΄ π· πΉ πΈ 23 Example: Variable Elimination Query: π(π΄|β), need to eliminate π΅, πΆ, π·, πΈ, πΉ π΅ πΆ Initial factors π π π π π π π π π π π π π, π π π π π π π π β π, π ⇒ π π π π π π π π π π π π π, π π π π π π π πβ π, π ⇒ π π π π π π π π π π π π π, π π π π πβ π, π π· πΈ πΉ π» πΊ π΅ Step 3: Eliminate πΉ Compute ππ π, π = π΄ ππ π π πβ (π, π) ⇒ π π π π π π π π π π π π π, π ππ (π, π) πΆ π΄ π· πΈ 24 Example: Variable Elimination Query: π(π΄|β), need to eliminate π΅, πΆ, π·, πΈ πΆ Initial factors π π π ⇒π π ⇒π π ⇒π π π π π π π π π π ππ π π ππ π ππ π ππ π΅ ππ π π ππ π ππ π ππ π π, π π π π π, π π π π, π π π π, π π π π π π π β π, π π π π π π π πβ π, π π π π πβ π, π ππ π, π π΄ π· πΈ πΉ π» πΊ π΅ Step 3: Eliminate πΈ Compute ππ π, π, π = ππ π π, π ππ (π, π) πΆ π΄ π· ⇒ π π π π π π π π π π ππ (π, π, π) 25 Example: Variable Elimination Query: π(π΄|β), need to eliminate π΅, πΆ, π· πΆ Initial factors π π π ⇒π π ⇒π π ⇒π π π π π π π π π π ππ π π ππ π ππ π ππ π΅ ππ π π ππ π ππ π ππ π π, π π π π π, π π π π, π π π π, π π π π π π π β π, π π π π π π π πβ π, π π π π πβ π, π ππ π, π π· πΈ πΉ π» πΊ ⇒ π π π π π π π π π π ππ π, π, π Step 3: Eliminate π· π΄ π΅ π΄ πΆ Compute ππ π, π = π π π π ππ (π, π, π) ⇒ π π π π π π π ππ (π, π) 26 Example: Variable Elimination Query: π(π΄|β), need to eliminate π΅, πΆ πΆ Initial factors π π π ⇒π π ⇒π π ⇒π π π π π π π π π π ππ π π ππ π ππ π ππ π΅ ππ π π ππ π ππ π ππ π π, π π π π π, π π π π, π π π π, π π π π π π π β π, π π π π π π π πβ π, π π π π πβ π, π ππ π, π π΄ π· πΈ π» πΊ ⇒ π π π π π π π π π π ππ π, π, π ⇒ π π π π π π π ππ π, π πΉ π΅ π΄ πΆ Step 3: Eliminate πΆ Compute ππ π, π = ⇒ π π π π ππ (π, π) ππ π π ππ (π, π) 27 Example: Variable Elimination Query: π(π΄|β), need to eliminate π΅ πΆ Initial factors π π π ⇒π π ⇒π π ⇒π π π π π π π π π π ππ π π ππ π ππ π ππ π΅ ππ π π ππ π ππ π ππ π π, π π π π π, π π π π, π π π π, π π π π π π π β π, π π π π π π π πβ π, π π π π πβ π, π ππ π, π ⇒ π π π π π π π π π π ππ π, π, π ⇒ π π π π π π π ππ π, π ⇒ π π π π ππ π, π π΄ π· πΈ πΉ π» πΊ π΅ π΄ πΆ Step 3: Eliminate πΆ Compute ππ π = ⇒ π π ππ (π) π π(π)ππ (π, π) 28 Example: Variable Elimination Query: π(π΄|β), need to renormalize over π΄ πΆ Initial factors π π π ⇒π π ⇒π π ⇒π π π π π π π π π π ⇒π ⇒π ⇒π ⇒π π π π π π π π π ππ π, π, π π π π π π ππ π, π π π ππ π, π ππ π π π π π π΅ ππ π π ππ π ππ π ππ ππ π π ππ π ππ π ππ π π, π π π π π, π π π π, π π π π, π π π π π π π β π, π π π π π π π πβ π, π π π π πβ π, π ππ π, π π΄ π· πΈ πΉ π» πΊ π΅ π΄ πΆ Step 3: renormalize π π, β = π π ππ π , compute π(β) = ⇒π πβ = ππ π ππ (π) π π ππ (π) π π π ππ (π΄) 29 Complexity of variable elimination Suppose in one elimination step we compute ππ₯ π¦1 , … , π¦π = ππ₯′ π₯, π¦1 , … , π¦π ′ π₯ ππ₯ (π₯, π¦1 , … , π¦π ) = ππ=1 ππ π₯, π¦ππ π π¦1 This requires π ∗ πππ π ∗ π πππ πππ π¦π π¦π multiplications For each value of π₯, π¦1 , … , π¦π , we do k multiplications πππ π ∗ π πππ πππ additions For each value of π¦1 , … , π¦π , we do πππ π additions Complexity is exponential in the number of variables in the intermediate factor 30 From Variable Elimination to Message Passing Recall that induced dependency during marginalization is captured in elimination cliques Summation ο³ Elimination Intermediate term ο³ Elimination cliques Can this lead to an generic inference algorithm? 31 Tree Graphical Models Undirected tree: a unique path between any pair of nodes Directed tree: all nodes except the root have exactly one parent 32 Equivalence of directed and undirected trees Any undirected tree can be converted to a directed tree by choosing a root node and directing all edges away from it A directed tree and the corresponding undirected tree make the conditional independence assertions Parameterization are essentially the same Undirected tree: π π = 1 π Directed tree: π π = π ππ π∈V Ψ ππ (π,π)∈E Ψ(ππ , ππ ) π,π ∈πΈ π(ππ |ππ ) Equivalence: Ψ ππ = π ππ , Ψ ππ , ππ = π ππ ππ , π = 1, Ψ ππ = 1 33 From Variable Elimination to Message Passing Recall Variable Elimination Algorithm Choose an ordering in which the query node π is the final node Eliminate node π by removing all potentials containing π, take sum/product over π₯π Place the resultant factor back For a Tree graphical model: Choose query node f as the root of the tree View tree as a directed tree with edges pointing towards π Elimination of each node can be considered as message-passing directly along tree branches, rather than on some transformed graphs Thus, we can use the tree itself as a data-structure to inference 34 Message passing for trees Let πππ ππ denote the factor resulting from eliminating variables from below up to π, which is a function ππ πππ ππ = π₯π Ψ π₯π Ψ ππ , π₯π π∈π π \i πππ (π₯π ) This is like a message sent from π to π π πππ ππ πππ ππ π π πππ ππ πππ ππ π π π π₯π ∝ Ψ π₯π πππ (π₯π ) π∈π π πππ (π₯π ) represents a belief on π₯π from π₯π 35