INFERENCE IN BAYESIAN NETWORKS AGENDA Reading off independence assumptions Efficient inference in Bayesian Networks Top-down inference Variable elimination Monte-Carlo methods SOME APPLICATIONS OF BN Medical diagnosis Troubleshooting of hardware/software systems Fraud/uncollectible debt detection Data mining Analysis of genetic sequences Data interpretation, computer vision, image understanding MORE COMPLICATED SINGLY-CONNECTED BELIEF NET Battery Radio Gas SparkPlugs Starts Moves Region = {Sky, Tree, Grass, Rock} R1 Above R2 R3 R4 BN to evaluate insurance risks BN FROM LAST LECTURE Burglary Intuitive meaning of arc from x to y: “x has direct influence on y” Earthquake causes Alarm Directed acyclic graph effects JohnCalls MaryCalls ARCS DO NOT NECESSARILY ENCODE CAUSALITY! A C B B C A 2 BN’s that can encode the same joint probability distribution READING OFF INDEPENDENCE RELATIONSHIPS A Given B, does the value of A affect the probability of C? B C P(C|B,A) = P(C|B)? No! C parent’s (B) are given, and so it is independent of its non-descendents (A) Independence is symmetric: C A | B => A C | B WHAT DOES THE BN ENCODE? Burglary Earthquake Alarm JohnCalls Burglary Earthquake JohnCalls MaryCalls | Alarm JohnCalls Burglary | Alarm JohnCalls Earthquake | Alarm MaryCalls Burglary | Alarm MaryCalls Earthquake | Alarm MaryCalls A node is independent of its non-descendents, given its parents READING OFF INDEPENDENCE RELATIONSHIPS Burglary Earthquake Alarm JohnCalls MaryCalls How about Burglary Earthquake | Alarm ? No! Why? READING OFF INDEPENDENCE RELATIONSHIPS Burglary Earthquake Alarm JohnCalls MaryCalls How about Burglary Earthquake | Alarm ? No! Why? P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075 P(B|A)P(E|A) = 0.086 READING OFF INDEPENDENCE RELATIONSHIPS Burglary Earthquake Alarm JohnCalls MaryCalls How about Burglary Earthquake | JohnCalls? No! Why? Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent INDEPENDENCE RELATIONSHIPS Rough intuition (this holds for tree-like graphs, polytrees): Evidence on the (directed) road between two variables makes them independent Evidence on an “A” node makes descendants independent Evidence on a “V” node, or below the V, makes the ancestors of the variables dependent (otherwise they are independent) Formal property in general case : D-separation independence (see R&N) BENEFITS OF SPARSE MODELS Modeling Fewer relationships need to be encoded (either through understanding or statistics) Large networks can be built up from smaller ones Intuition Dependencies/independencies between variables can be inferred through network structures Tractable inference TOP-DOWN INFERENCE Suppose we want to compute P(Alarm) Burglary P(B) Earthquake 0.001 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE Suppose we want to compute P(Alarm) 1. P(Alarm) = Σb,e P(A,b,e) P(B) 2. P(Alarm) = Σb,e P(A|b,e)P(b)P(e) Burglary Earthquake 0.001 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE Suppose we want to compute P(Alarm) 1. P(Alarm) = Σb,e P(A,b,e) P(B) 2. P(Alarm) = Σb,e P(A|b,e)P(b)P(e) 0.001 3. P(Alarm)Burglary = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P(E) + P(A|B,E)P(B)P(E) + P(A|B,E)P(B)P(E) Alarm JohnCalls A P(J|…) T F 0.90 0.05 Earthquake P(E) 0.002 B E P(A|…) T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE Suppose we want to compute P(Alarm) 1. P(A) = Σb,e P(A,b,e) P(B) 2. P(A) = Σb,e P(A|b,e)P(b)P(e) Burglary 0.001+ 3. P(A) = P(A|B,E)P(B)P(E) P(A|B, E)P(B)P(E) + P(A|B,E)P(B)P(E) + P(A|B,E)P(B)P(E) 4. P(A) = 0.95*0.001*0.002 + 0.94*0.001*0.998 + Alarm 0.29*0.999*0.002 + 0.001*0.999*0.998 = 0.00252 JohnCalls A P(J|…) T F 0.90 0.05 Earthquake P(E) 0.002 B E P(A|…) T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE Now, suppose we want to compute P(MaryCalls) Burglary P(B) Earthquake 0.001 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1. P(M) = P(M|A)P(A) + P(M| A) P(A) Burglary P(B) Earthquake 0.001 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1. P(M) = P(M|A)P(A) + P(M| A) P(A) 2. P(M) = 0.70*0.00252 + P(B) 0.01*(1-0.0252) Burglary 0.001 Earthquake = 0.0117 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE WITH EVIDENCE Suppose we want to compute P(Alarm|Earthquake) Burglary P(B) Earthquake 0.001 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE WITH EVIDENCE Suppose we want to compute P(A|e) 1. P(A|e) = Σb P(A,b|e) P(B) 2. P(A|e) = Σb P(A|b,e)P(b) Burglary Earthquake 0.001 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE WITH EVIDENCE Suppose we want to compute P(A|e) 1. P(A|e) = Σb P(A,b|e) P(B) 2. P(A|e) = Σb P(A|b,e)P(b) 3. P(A|e) =Burglary 0.95*0.001 + 0.001 0.29*0.999 + = 0.29066 Earthquake P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 TOP-DOWN INFERENCE Only works if the graph of ancestors of a variable is a polytree Evidence given on ancestor(s) of the query variable Efficient: O(d 2k) time, where d is the number of ancestors of a variable, with k a bound on # of parents Evidence on an ancestor cuts off influence of portion of graph above evidence node QUERYING THE BN The BN gives P(T|C) What about P(C|T)? Cavity P(C) 0.1 C P(T|C) Toothache T 0.4 F 0.01111 BAYES’ RULE P(AB) So… = P(A|B) P(B) = P(B|A) P(A) P(A|B) = P(B|A) P(A) / P(B) APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we know P(B|A) and P(A) (conditional probability tables) What’s P(B)? APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we know P(B|A) and P(A) (conditional probability tables) What’s P(B)? P(B) = Sa P(B,A=a) P(B,A=a) So, = P(B|A=a)P(A=a) P(B) = Sa P(B | A=a) P(A=a) [marginalization] [conditional probability] APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we know P(B|A) and P(A) (conditional probability tables) What’s P(A|B)? APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we know P(B|A) and P(A) (conditional probability tables) What’s P(A|B)? P(A|B) P(B) So, = P(B|A)P(A)/P(B) [Bayes rule] = Sa P(B | A=a) P(A=a) [Last slide] P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] HOW DO WE READ THIS? P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on, and all values B can take on] P(A=a|B=b) = HOW DO WE READ THIS? P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on, and all values B can take on] P(A=a|B=b) = P(B=b|A=a)P(A=a) / [Sa P(B=b | A=a) P(A=a)] Are these the same a? HOW DO WE READ THIS? P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on, and all values B can take on] P(A=a|B=b) = P(B=b|A=a)P(A=a) / [Sa P(B=b | A=a) P(A=a)] Are these the same a? NO! HOW DO WE READ THIS? P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on, and all values B can take on] P(A=a|B=b) = P(B=b|A=a)P(A=a) / [Sa’ P(B=b | A=a’) P(A=a’)] Be careful about indices! QUERYING THE BN Cavity P(C) 0.1 The BN gives P(T|C) What about P(C|T)? P(Cavity|Toothache) = P(Toothache|Cavity) P(Cavity) P(Toothache) C P(T|C) Toothache [Bayes’ rule] T 0.4 F 0.01111 Denominator computed by summing out numerator over Cavity and Cavity Querying a BN is just applying Bayes’ rule on a larger scale… PERFORMING INFERENCE Variables X Have evidence set E=e, query variable Q Want to compute the posterior probability distribution over Q, given E=e Let the non-evidence variables be Y (= X \ E) Straight forward method: 1. 2. 3. Compute joint P(YE=e) Marginalize to get P(Q,E=e) Divide by P(E=e) to get P(Q|E=e) INFERENCE IN THE ALARM EXAMPLE Burglary P(B) Earthquake 0.001 P(J|M) = ?? P(E) 0.002 B E P(A|…) Alarm Evidence E=e T T F F T F T F 0.95 0.94 0.29 0.001 Query Q JohnCalls A P(J|…) T F 0.90 0.05 MaryCalls A P(M|…) T 0.70 F 0.01 INFERENCE IN THE ALARM EXAMPLE Burglary P(B) Earthquake 0.001 P(J|MaryCalls) = ?? B E P(A|…) T T F F T F T F 0.95 0.94 0.29 0.001 P(E) 0.002 24 entries 1. P(J,A,B,E,MaryCalls) Alarm = P(J|A)P(MaryCalls|A)P(A|B,E)P(B)P(E) P(x1x = Pi=1,…,nP(x JohnCalls MaryCalls T 0.70 i)) 2…xnT) 0.90 i|parents(X A P(J|…) A P(M|…) F 0.05 F 0.01 full joint distribution table INFERENCE IN THE ALARM EXAMPLE Burglary P(B) Earthquake 0.001 P(J|MaryCalls) = ?? B E P(A|…) T T F F T F T F 0.95 0.94 0.29 0.001 1. P(J,A,B,E,MaryCalls) Alarm = P(J|A)P(MaryCalls|A)P(A|B,E)P(B)P(E) 2. P(J,MaryCalls) = A P(J|…) Sa,b,e P(J,A=a,B=b,E=e,MaryCalls) JohnCalls T 0.90 MaryCalls F 0.05 P(E) 0.002 2 entries: one for JohnCalls, the other for JohnCalls A P(M|…) T 0.70 F 0.01 INFERENCE IN THE ALARM EXAMPLE Burglary P(B) Earthquake 0.001 P(J|MaryCalls) = ?? P(E) 0.002 B E P(A|…) T T F F T F T F 0.95 0.94 0.29 0.001 1. P(J,A,B,E,MaryCalls) Alarm = P(J|A)P(MaryCalls|A)P(A|B,E)P(B)P(E) 2. P(J,MaryCalls) = A P(J|…) Sa,b,e P(J,A=a,B=b,E=e,MaryCalls) JohnCalls T 0.90 MaryCalls F 0.05 3. P(J|MaryCalls) = P(J,MaryCalls)/P(MaryCalls) = P(J,MaryCalls)/(SjP(j,MaryCalls)) A P(M|…) T 0.70 F 0.01 HOW EXPENSIVE? P(X) = P(x1x2…xn) = Pi=1,…,n P(xi|parents(Xi)) Straightforward method: 1. Use above to compute P(Y,E=e) 2. P(Q,E=e) = Sy1 … Syk P(Y,E=e) Normalization factor – no big deal once 3. P(E=e) = Sq P(Q,E=e) we have P(Q,E=e) n-|E| Step 1: O( 2 ) entries! Can we do better? VARIABLE ELIMINATION Consider linear network X1X2X3 P(X) = P(X1) P(X2|X1) P(X3|X2) P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) VARIABLE ELIMINATION Consider linear network X1X2X3 P(X) = P(X1) P(X2|X1) P(X3|X2) P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1) Rearrange equation… VARIABLE ELIMINATION Consider linear network X1X2X3 P(X) = P(X1) P(X2|X1) P(X3|X2) P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1) = Σx2 P(X3|x2) P(x2) Computed for each value of X2 Cache P(x2) for both values of X3! VARIABLE ELIMINATION Consider linear network X1X2X3 P(X) = P(X1) P(X2|X1) P(X3|X2) P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1) = Σx2 P(X3|x2) P(x2) How many * and + saved? *: 2*4*2=16 vs 4+4=8 + 2*3=8 vs 2+1=3 Computed for each value of X2 Can lead to huge gains in larger networks VE IN ALARM EXAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = ΣaΣb P(E) P(b) P(a|E,b) P(j|a) P(m|a) VE IN ALARM EXAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = ΣaΣb P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a) VE IN ALARM EXAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = ΣaΣb P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) P(j,m|E,b) Compute for all values of E,b VE IN ALARM EXAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = ΣaΣb P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) P(j,m|E,b) = P(E) P(j,m|E) Compute for all values of E WHAT ORDER TO PERFORM VE? For tree-like BNs (polytrees), order so parents come before children # of variables in each intermediate probability table is 2^(# of parents of a node) If the number of parents of a node is bounded, then VE is linear time! Other networks: intermediate factors may become large NON-POLYTREE NETWORKS P(D) = Σa Σb Σc P(A)P(B|A)P(C|A)P(D|B,C) = Σb Σc P(D|B,C) Σa P(A)P(B|A)P(C|A) A B No more simplifications… C D APPROXIMATE INFERENCE TECHNIQUES Based on the idea of Monte Carlo simulation Basic idea: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed Conditional simulation: 1. 2. 3. 4. To estimate the probability P(H) that a coin picked out of bucket B flips heads, I can: Pick a coin C out of B (occurs with probability P(C)) Flip C and observe whether it flips heads (occurs with probability P(H|C)) Put C back and repeat from step 1 many times Return the fraction of heads observed (estimate of P(H)) APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION Sample from the joint distribution Burglary P(B) Earthquake 0.001 P(E) 0.002 B E P(A|…) B=0 E=0 A=0 J=1 M=0 Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION As more samples are generated, the distribution of the samples approaches the joint distribution! B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION Inference: given evidence E=e (e.g., J=1) Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution! HOW MANY SAMPLES? average 1 𝑛 Error of estimate, for n samples, is 𝑂( ) on Variance-reduction techniques RARE EVENT PROBLEM: What if some events are really rare (e.g., burglary & earthquake ?) # of samples must be huge to get a reasonable estimate Solution: likelihood weighting Enforce that each sample agrees with evidence While generating a sample, keep track of the ratio of (how likely the sampled value is to occur in the real world) (how likely you were to generate the sampled value) LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=1 Burglary P(B) Earthquake 0.001 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.008 Burglary P(B) Earthquake 0.001 B=0 E=1 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.0023 B=0 E=1 A=1 Burglary P(B) Earthquake 0.001 A=1 is enforced, and the weight updated Alarm to reflect the likelihood that this occurs JohnCalls A P(J|…) T F 0.90 0.05 P(E) 0.002 B E P(A|…) T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.0016 Burglary P(B) Earthquake 0.001 B=0 E=1 A=1 M=1 J=1 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=3.988 Burglary P(B) Earthquake 0.001 B=0 E=0 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.004 Burglary P(B) Earthquake 0.001 B=0 E=0 A=1 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.0028 Burglary P(B) Earthquake 0.001 B=0 E=0 A=1 M=1 J=1 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.00375 Burglary P(B) Earthquake 0.001 B=1 E=0 A=1 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.0026 Burglary P(B) Earthquake 0.001 B=1 E=0 A=1 M=1 J=1 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=5e-7 Burglary P(B) Earthquake 0.001 B=1 E=1 A=1 M=1 J=1 P(E) 0.002 B E P(A|…) Alarm JohnCalls A P(J|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(M|…) T 0.70 F 0.01 LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 w=0.0016 B=0 E=1 A=1 M=1 J=1 w=0.0028 B=0 E=0 A=1 M=1 J=1 w=0.0026 B=1 E=0 A=1 M=1 J=1 w~=0 B=1 E=1 A=1 M=1 J=1 N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375 RECAP Efficient inference in BNs Variable elimination Approximate methods: Monte-Carlo sampling NEXT LECTURE Statistical learning: from data to distributions R&N 20.1-2