Duke

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Bayesian Networks AGENDA  Bayesian networks Chain rule for Bayes nets  Naïve Bayes models   Independence declarations   D-separation Probabilistic inference queries PURPOSES OF BAYESIAN NETWORKS Efficient and intuitive modeling of complex causal interactions  Compact representation of joint distributions O(n) rather than O(2n)  Algorithms for efficient inference with given evidence (more on this next time)  INDEPENDENCE OF RANDOM VARIABLES  Two random variables A and B are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A)  Knowing B doesn’t give you any information about A  [This equality has to hold for all combinations of values that A and B can take on, i.e., all events A=a and B=b are independent] SIGNIFICANCE OF INDEPENDENCE  If A and B are independent, then P(A,B) = P(A) P(B) => The joint distribution over A and B can be defined as a product over the distribution of A and the distribution of B  => Store two much smaller probability tables rather than a large probability table over all combinations of A and B  CONDITIONAL INDEPENDENCE  Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C) hence P(A|B,C) = P(A|C)  Once you know C, learning B doesn’t give you any information about A  [again, this has to hold for all combinations of values that A,B,C can take on] SIGNIFICANCE OF CONDITIONAL INDEPENDENCE Consider Grade(CS101), Intelligence, and SAT  Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores  but good students are more likely to get good SAT scores, so they are not independent…  It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence  BAYESIAN NETWORK Explicitly represent independence among propositions  Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly  P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I) P(I=x) Intel. P(G=x|I) I=low I=high high 0.3 low 0.7 ‘a’ 0.2 0.74 P(S=x|I) I=low I=high ‘b’ 0.34 0.17 low 0.95 0.05 ‘C’ 0.46 0.09 high 0.2 0.8 Grade SAT 6 probabilities, instead of 11 DEFINITION: BAYESIAN NETWORK Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn)  Each node has a set of parents PaX    Graph must be a DAG Each node also maintains a conditional probability distribution (often, a table) P(X|PaX)  2k-1 entries for binary valued variables   Overall: O(n2k) storage for binary variables  Encodes the joint probability over X1,…,Xn CALCULATION OF JOINT PROBABILITY Burglary P(b) Earthquake 0.001 P(jmabe) = ?? 0.002 B E P(a|…) Alarm JohnCalls P(e) A P(j|…) T F 0.90 0.05 T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(m|…) T 0.70 F 0.01 Burglary Earthquake Alarm P(jmabe) JohnCalls = P(jm|a,b,e)  P(abe) = P(j|a,b,e)  P(m|a,b,e)  P(abe) (J and M are independent given A)  P(j|a,b,e) = P(j|a) (J and B and J and E are independent given A)  P(m|a,b,e) = P(m|a)  P(abe) = P(a|b,e)  P(b|e)  P(e) = P(a|b,e)  P(b)  P(e) (B and E are independent)  P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)  MaryCalls CALCULATION OF JOINT PROBABILITY Burglary P(b) Earthquake 0.001 P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e) = 0.9 x 0.7 x 0.001 x 0.999 xalarm 0.998 = 0.00062 JohnCalls A P(j|…) T F 0.90 0.05 P(e) 0.002 B E P(a|…) T T F F T F T F 0.95 0.94 0.29 0.001 MaryCalls A P(m|…) T 0.70 F 0.01 CALCULATION OF JOINT PROBABILITY Burglary P(b) Earthquake 0.001 P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e) = 0.9 x 0.7 x 0.001 x 0.999 xalarm 0.998 = 0.00062 F 0.002 b e P(a|…) T T F F T F T F 0.95 0.94 0.29 0.001 P(x1x = Pi=1,…,nP(x johnCalls maryCalls 2…xnT) 0.90 i|paXi) a P(e) P(j|…) a 0.05 T 0.70 F 0.01  full joint distribution P(m|…) CHAIN RULE FOR BAYES NETS  Joint distribution is a product of all CPTs  P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi) EXAMPLE: NAÏVE BAYES MODELS  P(Cause,Effect1,…,Effectn) = P(Cause) Pi P(Effecti | Cause) Cause Effect1 Effect2 Effectn ADVANTAGES OF BAYES NETS (AND OTHER GRAPHICAL MODELS) More manageable # of parameters to set and store  Incremental modeling  Explicit encoding of independence assumptions  Efficient inference techniques  ARCS DO NOT NECESSARILY ENCODE CAUSALITY A C B B C A C B A 2 BN’s with the same expressive power, and a 3rd with greater power (exercise) READING OFF INDEPENDENCE RELATIONSHIPS  A Given B, does the value of A affect the probability of C?   B C   P(C|B,A) = P(C|B)? No! C parent’s (B) are given, and so it is independent of its non-descendents (A) Independence is symmetric: C  A | B => A  C | B BASIC RULE  A node is independent of its non-descendants given its parents (and given nothing else) WHAT DOES THE BN ENCODE? Burglary Earthquake Alarm JohnCalls Burglary  Earthquake JohnCalls  MaryCalls | Alarm JohnCalls  Burglary | Alarm JohnCalls  Earthquake | Alarm MaryCalls  Burglary | Alarm MaryCalls  Earthquake | Alarm MaryCalls A node is independent of its non-descendents, given its parents READING OFF INDEPENDENCE RELATIONSHIPS Burglary Earthquake Alarm JohnCalls MaryCalls How about Burglary Earthquake | Alarm ?  No! Why?  READING OFF INDEPENDENCE RELATIONSHIPS Burglary Earthquake Alarm JohnCalls MaryCalls How about Burglary  Earthquake | Alarm ?  No! Why?  P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075  P(B|A)P(E|A) = 0.086  READING OFF INDEPENDENCE RELATIONSHIPS Burglary Earthquake Alarm JohnCalls MaryCalls How about Burglary  Earthquake | JohnCalls?  No! Why?  Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent  INDEPENDENCE RELATIONSHIPS  For polytrees, there exists a unique undirected path between A and B. For each node on the path: Evidence on the directed road XEY or XEY makes X and Y independent  Evidence on an XEY makes descendants independent  Evidence on a “V” node, or below the V: XEY, or XWY with W… E makes the X and Y dependent (otherwise they are independent)  GENERAL CASE  Formal property in general case: D-separation : the above properties hold for all (acyclic) paths between A and B  D-separation  independence   That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation  The CPTs may indeed encode additional independences PROBABILITY QUERIES Given: some probabilistic model over variables X  Find: distribution over YX given evidence E=e for some subset E  X / Y    P(Y|E=e) Inference problem ANSWERING INFERENCE PROBLEMS WITH THE JOINT DISTRIBUTION  Easiest case: Y=X/E  P(Y|E=e) = P(Y,e)/P(e)  Denominator  Determine  P(e) by marginalizing: P(e) = Sy P(Y=y,e) Otherwise, let W=X/(EY)  P(Y|E=e)  P(e)  makes the probabilities sum to 1 = Sw P(Y,W=w,e) /P(e) = Sy Sw P(Y=y,W=w,e) Inference with joint distribution: O(2|X/E|) for binary variables NAÏVE BAYES CLASSIFIER  P(Class,Feature1,…,Featuren) = P(Class) Pi P(Featurei | Class) Spam / Not Spam Class English / French / Latin … Feature1 Given features, what class? Feature2 Featuren Word occurrences P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn) = 1/Z P(C) Pi P(Fi|C) NAÏVE BAYES CLASSIFIER  P(Class,Feature1,…,Featuren) = P(Class) Pi P(Featurei | Class) Given some features, what is the distribution over class? P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk) = 1/Z Sfk+1…fn P(C,F1,….,Fk,fk+1,…fn) = 1/Z P(C) Sfk+1…fn Pi=1…k P(Fi|C) = 1/Z P(C) Pi=1…k P(Fi|C) = 1/Z P(C) Pi=1…k P(Fi|C) Pj=k+1…n P(fj|C) Pj=k+1…n Sfj P(fj|C) FOR GENERAL QUERIES   For BNs and queries in general, it’s not that simple… more in later lectures. Next class: skim 5.1-3, begin reading 9.1-4

Duke

Related documents

Products

Support

Duke

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib