Solving problems using propositional logic • Need to write what you know as propositional formulas • Theorem proving will then tell you whether a given new sentence will hold given what you know • Three kinds of queries – Is my knowledgebase consistent? (i.e. is there at least one world where everything I know is true?) Satisfiability – Is the sentence S entailed by my knowledge base? (i.e., is it true in every world where my knowledge base is true?) – Is the sentence S consistent/possibly true with my knowledge base? (i.e., is S true in at least one of the worlds where my knowledge base holds?) • S is consistent if ~S is not entailed • But cannot differentiate between degrees of likelihood among possible sentences Example • Pearl lives in Los Angeles. It is a high-crime area. Pearl installed a burglar alarm. He asked his neighbors John & Mary to call him if they hear the alarm. This way he can come home if there is a burglary. Los Angeles is also earth-quake prone. Alarm goes off when there is an earthquake. Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls If there is a burglary, will Mary call? Check KB & E |= M If Mary didn’t call, is it possible that Burglary occurred? Check KB & ~M doesn’t entail ~B Example (Real) • • Pearl lives in Los Angeles. It is a highcrime area. Pearl installed a burglar alarm. He asked his neighbors John & Mary to call him if they hear the alarm. This way he can come home if there is a burglary. Los Angeles is also earthquake prone. Alarm goes off when there is an earth-quake. Pearl lives in real world where (1) burglars can sometimes disable alarms (2) some earthquakes may be too slight to cause alarm (3) Even in Los Angeles, Burglaries are more likely than Earth Quakes (4) John and Mary both have their own lives and may not always call when the alarm goes off (5) Between John and Mary, John is more of a slacker than Mary.(6) John and Mary may call even without alarm going off Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls If there is a burglary, will Mary call? Check KB & E |= M If Mary didn’t call, is it possible that Burglary occurred? Check KB & ~M doesn’t entail ~B John already called. If Mary also calls, is it more likely that Burglary occurred? You now also hear on the TV that there was an earthquake. Is Burglary more or less likely now? How do we handle Real Pearl? Potato in the tail-pipe • Omniscient & Eager way: – Model everything! – E.g. Model exactly the conditions under which John will call • Ignorant (non-omniscient) and Lazy (nonomnipotent) way: • He shouldn’t be listening to loud music, he hasn’t gone on an errand, he didn’t recently have a tiff with Pearl etc etc. A & c1 & c2 & c3 &..cn => J (also the exceptions may have interactions c1&c5 => ~c9 ) Qualification and Ramification problems make this an infeasible enterprise – Model the likelihood – In 85% of the worlds where there was an alarm, John will actually call – How do we do this? • • • • Non-monotonic logics “certainty factors” “fuzzy logic” “probability” theory? Non-monotonic (default) logic • Prop calculus (as well as the first order logic we shall discuss later) are monotonic, in that once you prove a fact F to be true, no amount of additional knowledge can allow us to disprove F. • But, in the real world, we jump to conclusions by default, and revise them on additional evidence – Consider the way the truth of the statement “F: Tweety Flies” is revised by us when we are given facts in sequence: 1. Tweety is a bird (F)2. Tweety is an Ostritch (~F) 3. Tweety is a magical Ostritch (F) 4. Tweety was cursed recently (~F) 5. Tweety was able to get rid of the curse (F) • How can we make logic show this sort of “defeasible” (aka defeatable) conclusions? – Many ideas, with one being negation as failure – Let the rule about birds be Bird & ~abnormal => Fly • The “abnormal” predicate is treated special— if we can’t prove abnormal, we can assume ~abnormal is true • (Note that in normal logic, failure to prove a fact F doesn’t allow us to assume that ~F is true since F may be holding in some models and not in other models). – Non-monotonic logic enterprise involves (1) providing clean semantics for this type of reasoning and (2) making defeasible inference efficient Certainty Factors • Associate numbers to each of the facts/axioms • When you do derivations, compute c.f. of the results in terms of the c.f. of the constituents (“truth functional”) • Problem: Circular reasoning because of mixed causal/diagnostic directions – Raining => Grass-wet 0.9 – Grass-wet => raining 0.7 • If you know grass-wet with 0.4, then we know raining which makes grass more wet, which…. Fuzzy Logic vs. Prob. Prop. Logic • Fuzzy Logic assumes that the word is made of statements that have different grades of truth – Recall the puppy example • Fuzzy Logic is “Truth Functional”—i.e., it assumes that the truth value of a sentence can be established in terms of the truth values only of the constituent elements of that sentence. • PPL assumes that the world is made up of statements that are either true or false • PPL is truth functional for “truth value in a given world” but not truth functional for entailment status. Prop. Logic vs. Prob. Prop. Logic • Model theory for logic • – Set of all worlds where the formula/KB holds – Each world is either a model or not • Proof theory/inference – Rather than enumerate all models, write KB sentences that constrain the models • Model theory for prob. Logic – For each world, give the probability p that the formula holds in that world • p(w)=0 means that world is definitely not a model • Otherwise, 0< p(w) <= 1 • Sum of p(w) = 1 • AKA Joint Probability Distribution—2n – 1 numbers Proof Theory – statements on subsets of propositions • E.g. p(A)=0.5; p(B|A)=0.7 etc • (Cond) Independences… – These constrain the joint probability distribution Easy Special Cases • If there are no relations between the propositions (i.e., they can take values independently of each other) – Then the joint probability distribution can be specified in terms of probabilities of each proposition being true – Just n numbers instead of 2n • If in addition, each proposition is equally likely to be true or false, – Then the joint probability distribution can be specified without giving any numbers! • All worlds are equally probable! If there are n props, each world will be 1/2n probable – Probability of any propositional conjunction with m (< n) propositions will be 1/2m Will we always need • n 2 numbers? If every pair of variables is independent of each other, then – P(x1,x2…xn)= P(xn)* P(xn-1)*…P(x1) – Need just n numbers! – But if our world is that simple, it would also be very uninteresting & uncontrollable (nothing is correlated with anything else!) A more realistic middle ground is that interactions between variables are contained to regions. --e.g. the “school variables” and the “home variables” interact only loosely (are independent for most practical purposes) -- Will wind up needing O(2k) numbers (k << n) • We need 2n numbers if every subset of our n-variables are correlated together – P(x1,x2…xn)= P(xn|x1…xn-1)* P(xn-1|x1…xn-2)*…P(x1) – But that is too pessimistic an assumption on the world • If our world is so interconnected we would’ve been dead long back… Probabilistic Calculus to the Rescue Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls Suppose we know the likelihood of each of the (propositional) worlds (aka Joint Probability distribution ) Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you) So, Joint Probability Distribution is all that you ever need! In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers) --In general 2n separate numbers (which should add up to 1) Topology encodes the conditional independence assertions Only 10 (instead of 32) numbers to specify! If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with? --Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers ---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take! Probabilistic Calculus to the Rescue Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls Suppose we know the likelihood of each of the (propositional) worlds (aka Joint Probability distribution ) Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you) So, Joint Probability Distribution is all that you ever need! In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers) --In general 2n separate numbers (which should add up to 1) Only 10 (instead of 32) numbers to specify! If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with? --Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers ---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take! Propositional Probabilistic Logic Your name: Computing Prob. Queries Given a Joint Distribution P(CA & TA) = P(CA) = TA ~TA CA 0.04 0.06 ~CA 0.01 0.89 P(TA) = P(CA V TA) = P(CA|~TA) = P(~TA|CA) = Check if P(CA|~TA) = P(~TA|CA)* P(CA) P(CA & TA) = 0.04 P(CA) = 0.04+0.06 = 0.1 (marginalizing over TA) TA ~TA CA 0.04 0.06 ~CA 0.01 0.89 P(TA) = 0.04+0.01= 0.05 P(CA V TA) = P(CA) + P(TA) – P(CA&TA) = 0.1+0.05-0.04 = 0.11 P(CA|~TA) = P(CA&~TA)/P(~TA) = 0.06/(0.06+.89) = .06/.95=.0631 Think of this as analogous to entailment by truth-table enumeration! The material in this slide was presented on the white board If B=>A then P(A|B) = ? P(B|~A) = ? P(B|A) = ? Most useful probabilistic reasoning involves computing posterior distributions P(CA|TA;Catch;Romney-won)? Probability P(CA|TA;Catch) P(CA|TA) P(CA) Variable values Important: Computing posterior distribution is inference; not learning The material in this slide was presented on the white board CONDITIONAL PROBABLITIES Non-monotonicity w.r.t. evidence– P(A|B) can be either higher, lower or equal to P(A) The material in this slide was presented on the white board Generalized bayes rule P(A|B,e) = P(B|A,e) P(A|e) P(B|e) Get by with easier to assess numbers A be Anthrax; Rn be Runny Nose P(A|Rn) = P(Rn|A) P(A)/ P(Rn) Think of this as analogous to inference rules (like modus-ponens) The material in this slide was presented on the white board Relative ease/utility of Assessing various types of probabilities • • • Joint distribution requires us to assess probabilities of type P(x1,~x2,x3,….~xn) This means we have to look at all entities in the world and see which fraction of them have x1,~x2,x3….~xm true Difficult experiment to setup.. • Conditional probabilities of type P(A|B) are relatively easier to assess – You just need to look at the set of entities having B true, and look at the fraction of them that also have A true – Eventually, they too can get baroque P(x1,~x2,…xm|y1..yn) • Among the conditional probabilities, causal probabilities of the form P(effect|cause) are better to assess than diagnostic probabilities of the form P(cause|effect) – Causal probabilities tend to me more stable compared to diagnostic probabilities – (for example, a text book in dentistry can publish P(TA|Cavity) and hope that it will hold in a variety of places. In contrast, P(Cavity|TA) may depend on other fortuitous factors—e.g. in areas where people tend to eat a lot of icecream, many tooth aches may be prevalent, and few of them may be actually due to cavities. The material in this slide was presented on the white board What happens if there are multiple symptoms…? Patient walked in and complained of toothache You assess P(Cavity|Toothache) Conditional independence To the rescue Suppose P(TA,Catch|cavity) = P(TA|Cavity)*P(Catch|Cavity) Now you try to probe the patients mouth with that steel thingie, and it catches… How do we update our belief in Cavity? P(Cavity|TA, Catch) = P(TA,Catch| Cavity) * P(Cavity) P(TA,Catch) = a P(TA,Catch|Cavity) * P(Cavity) Need to know this! If n evidence variables, We will need 2n probabilities! Put variables in reverse topological sort; apply chain rule P(J|M,A,~B,~E)*P(M|A,~B,~E)*P(A|~B,~E)*P(~B|~E)*P(~E) P(J|A) * P(M|A) *P(A|~B,~E) * P(~B) * P(~E) .9 .7 .001 By conditional independence inherent in the Bayes Net .000628 .999 .998 Local Semantics Node independent of non-descendants given its parents Gives global semantics i.e. the full joint Conditional Independence Assertions • We write X || Y | Z to say that the set of variables X is conditionally independent of the set of variables Y given evidence on the set of variables Z (where X,Y,Z are subsets of the set of all random variables in the domain model) • Inference can exploit conditional independence assertions. Why not Specifically, – X || Y| Z implies • P(X & Y|Z) = P(X|Z) * P(Y|Z) • P(X|Y, Z) = P(X|Z) • P(Y|X,Z) = P(Y|Z) – If A||B|C then P(A,B,C)=P(A|B,C)P(B,C) =P(A|B,C)P(B|C)P(C) =P(A|C)P(B|C)P(C) (Can get by with 1+2+2=5 numbers instead of 8) write down all conditional independence assertions that hold in a domain? Cond. Indep. Assertions (Contd) • Idea: Why not write down all conditional independence assertions (CIA) (X || Y | Z) that hold in a domain? • Problem: There can be exponentially many conditional independence assertions that hold in a domain (recall that X, Y and Z are all subsets of the domain variables). • Brilliant Idea: May be we should implicitly specify the CIA by writing down the “local dependencies” between variables using a graphical model – A Bayes Network is a way of doing just this. • The Bayes Net is a Directed Acyclic Graph whose nodes are random variables, and the immediate dependencies between variables are represented by directed arcs – The topology of a bayes network shows the inter-variable dependencies. Given the topology, there is a way of checking if any Cond. Indep. Assertion. holds in the network (the Bayes Ball algorithm and the D-Sep idea) Topological Semantics Markov Blanket Parents; Children; Children’s other parents These two conditions are equivalent Many other conditional indepdendence assertions follow from these CIA implicit in Bayes Nets • So, what conditional independence assumptions are implicit in Bayes nets? – Local Markov Assumption: • A node N is independent of its non-descendants (including ancestors) given its immediate parents. (So if P are the immediate paretnts of N, and A is a subset of of Ancestors and other non-descendants, then {N} || A| P ) • (Equivalently) A node N is independent of all other nodes given its markov blanket (parents, children, children’s parents) – Given this assumption, many other conditional independencies follow. For a full answer, we need to appeal to D-Sep condition and/or Bayes Ball reachability 1 2 4 31! 8 16 Alarm Burglary P(A|J,M) =P(A)? Introduce variables in the causal order Easy when you know the Earthquake causality of the domain hard otherwise.. How many probabilities are needed? 13 for the new; 10 for the old Is this the worst? Bayesian (Personal) Probabilities • • Continuing bad friends, in the question above, suppose a second friend comes along and says that he can give you the conditional probabilities that you want to complete the specification of your bayes net. You ask him a CPT entry, and pat comes a response--some number between 0 and 1. This friend is well meaning, but you are worried that the numbers he is giving may lead to some sort of inconsistent joint probability distribution. Is your worry justified ( i.e., can your friend give you numbers that can lead to an inconsistency?) (To understand "inconsistency", consider someone who insists on giving you P(A), P(B), P(A&B) as well as P(AVB) and they wind up not satisfying the P(AVB)= P(A)+P(B) -P(A&B) [or alternately, they insist on giving you P(A|B), P(B|A), P(A) and P(B), and the four numbers dont satisfy the bayes rule] • Answer: No—as long as we only ask the friend to fill up the CPTs in the bayes network, there is no way the numbers won’t makeup a consistent joint probability distribution – This should be seen as a feature.. Personal Probabilities – John may be an optimist and believe that P(burglary)=0.01 and Tom may be a pessimist and believe that P(burglary)=0.99 – Bayesians consider both John and Tom to be fine (they don’t insist on an objective frequentist interpretation for probabilites) – However, Bayesians do think that John and Tom should act consistently with their own beliefs • For example, it makes no sense for John to go about installing tons of burglar alarms given his belief, just as it makes no sense for Tom to put all his valuables on his lawn Ideas for reducing the number of probabilties to be specified • • • Problem 1: Joint distribution requires 2n numbers to specify; and those numbers are harder to assess Problem 2: But, CPTs will be as big as the full joint if the network is dense CPTs Problem 3: But, CPTs can still be quite hard to specify if there are too many parents (or if the variables are continuous) • Solution: Use Bayes Nets to reduce the numbers and specify them as CPTs • Solution: Introduce intermediate variables to induce sparsity into the network • Solution: Parameterize the CPT (use Noisy OR etc for discrete variables; gaussian etc for continuous variables) Making the network Sparse by introducing intermediate variables • Consider a network of boolean variables where n parent nodes are connected to m children nodes (with each parent influencing each child). – You will need n + m*2n conditional probabilities • Suppose you realize that what is really influencing the child nodes is some single aggregate function on the parent’s values (e.g. sum of the parents). – We can introduce a single intermediate node called “sum” which has links from all the n parent nodes, and separately influences each of the m child nodes • Now you will wind up needing only n+ 2n + 2m conditional probabilities to specify this new network! Learning such hidden variables from data poses challenges.. Compact/Parameterized distributions are pretty much the only way to go when continuous variables are involved! Think of a firing squad with upto k gunners trying to shoot you You will live only if everyone who shoots misses.. We only consider the failure to cause probability of the Causes that hold How about Noisy And? (hint: A&B => ~( ~A V ~B) ) k i=j+1 ri Prob that X holds even though ith parent doesn’t Constructing Belief Networks: Summary • [[Decide on what sorts of queries you are interested in answering – This in turn dictates what factors to model in the network • Decide on a vocabulary of the variables and their domains for the problem – Introduce “Hidden” variables into the network as needed to make the network “sparse” • Decide on an order of introduction of variables into the network – Introducing variables in causal direction leads to fewer connections (sparse structure) AND easier to assess probabilities • Try to use canonical distributions to specify the CPTs – Noisy-OR – Parameterized discrete/continuous distributions • Such as Poisson, Normal (Gaussian) etc Case Study: Pathfinder System • Domain: Lymph node diseases – Deals with 60 diseases and 100 disease findings • Versions: – – – Pathfinder I: A rule-based system with logical reasoning Pathfinder II: Tried a variety of approaches for uncertainity • Simple bayes reasoning outperformed Pathfinder III: Simple bayes reasoning, but reassessed probabilities – Parthfinder IV: Bayesian network was used to handle a variety of conditional dependencies. • Deciding vocabulary: 8 hours • Devising the topology of the network: 35 hours • Assessing the (14,000) probabilities: 40 hours – • Physician experts liked assessing causal probabilites Evaluation: 53 “referral” cases – Pathfinder III: 7.9/10 – Pathfinder IV: 8.9/10 [Saves one additional life in every 1000 cases!] – A more recent comparison shows that Pathfinder now outperforms experts who helped design it!! Conditional Independence Assertions • We write X || Y | Z to say that the set of variables X is conditionally independent of the set of variables Y given evidence on the set of variables Z (where X,Y,Z are subsets of the set of all random variables in the domain model) • Inference can exploit conditional independence assertions. Why not Specifically, – X || Y| Z implies • P(X & Y|Z) = P(X|Z) * P(Y|Z) • P(X|Y, Z) = P(X|Z) • P(Y|X,Z) = P(Y|Z) – If A||B|C then P(A,B,C)=P(A|B,C)P(B,C) =P(A|B,C)P(B|C)P(C) =P(A|C)P(B|C)P(C) (Can get by with 1+2+2=5 numbers instead of 8) write down all conditional independence assertions that hold in a domain? Topological Semantics Markov Blanket Parents; Children; Children’s other parents These two conditions are equivalent Many other conditional indepdendence assertions follow from these Independence in Bayes Networks: Causal Chains; Common Causes; Common Effects Common Cause (diverging) X and Y are caused by Z is blocked if Z is given Causal chain (linear) X causes Y through Z is blocked if Z is given Common Effect (converging) X and Y cause Z is blocked only if neither Z nor its descendants are given D-sep (direction dependent Separation) • X || Y | E if every undirected path from X to Y is blocked by E – A path is blocked if there is a node Z on the path s.t. 1. 2. 3. [Z] Z is in E and Z has one arrow coming in and another going out [Z] is in E and Z has both arrows going out [Z] Neither Z nor any of its descendants are in E and both path arrows lead to Z B||M|A (J,M)||E | A B||E B||E | A B||E | M Topological Semantics Markov Blanket Parents; Children; Children’s other parents These two conditions are equivalent Many other conditional indepdendence assertions follow from these V||A | T How many probabilities? (assume all are boolean variables) T ||L ? P(V,~T,S,~L,A,B,~X,~D) = T||L | D • “Asia” network: Visit to Asia Smoking X||D X||D | A Tuberculosis Lung Cancer (V,S)||(X,D)| A Abnormality in Chest X-Ray (V,S) || X | A Bronchitis Dyspnea Name: th 0 idea for Bayes Net Inference • Given a bayes net, we can compute all the entries of the joint distribution (by just multiplying entries in CPTs) • Given the joint distribution, we can answer any probabilistic query. • Ergo, we can do inference on bayes networks • Qn: Can we do better? – Ideas: • Implicity enumerate only the part of the joint that is needed • Use sampling techniques to compute the probabilities Factor Algebra… P(A,B) Variable Elimination (Summing Away) P(A,B) = Elimination is what we mostly do to compute probabilities P(A|B) = P(AB)/P(B) P(B) =? Eliminate A Sum rows where A changes polarity Normalization If the expression tree is evaluated in a depth first fashion, then the space requirement is linear.. fA(a,b,e)*fj(a)*fM(a)+ fA(~a,b,e)*fj(~a)*fM(~a)+ A join.. Complexity depends on the size of the largest factor which in turn depends on the order in which variables are eliminated.. *Read Worked Out Example of Variable Elimination in the Lecture Notes* [From Lise Getoor’s notes] A More Complex Example • “Asia” network: Visit to Asia Tuberculosis Smoking Lung Cancer Abnormality in Chest X-Ray Bronchitis Dyspnea • We want to compute P(d) • Need to eliminate: v,s,x,t,l,a,b S V L T B A Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b ) • We want to compute P(d) • Need to eliminate: v,s,x,t,l,a,b S V L T B A Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b ) Eliminate: v Compute: fv (t ) P (v )P (t |v ) v fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b) Note: fv(t) = P(t) In general, result of elimination is not necessarily a probability term • We want to compute P(d) • Need to eliminate: s,x,t,l,a,b S V L T B A • Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b ) fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b) Eliminate: s Compute: fs (b,l ) P (s )P (b | s )P (l | s ) s fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b) Summing on s results in a factor with two arguments fs(b,l) In general, result of elimination may be a function of several variables • We want to compute P(d) • Need to eliminate: x,t,l,a,b S V L T B A • Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b ) fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b) fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b) Eliminate: x Compute: fx (a ) P (x | a ) x fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b) Note: fx(a) = 1 for all values of a !! • We want to compute P(d) • Need to eliminate: t,l,a,b S V L T B A • Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b ) fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b) fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b) fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b) Eliminate: t Compute: ft (a ,l ) fv (t )P (a |t ,l ) t fs (b,l )fx (a )ft (a,l )P (d | a, b) • We want to compute P(d) • Need to eliminate: l,a,b S V L T B A • Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b ) fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b) fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b) fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b) fs (b,l )fx (a )ft (a,l )P (d | a, b) Eliminate: l Compute: fl (a , b ) fs (b,l )ft (a ,l ) fl (a, b)fx (a )P (d | a, b) l • We want to compute P(d) • Need to eliminate: b S V L T B A • Initial factors X D P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b ) fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b) fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b) fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b) fs (b,l )fx (a )ft (a,l )P (d | a, b) fl (a, b)fx (a )P (d | a, b) fa (b,d ) fb (d ) Eliminate: a,b Compute: fa (b,d ) fl (a , b )fx (a ) p (d | a , b ) fb (d ) fa (b,d ) a b Variable Elimination • We now understand variable elimination as a sequence of rewriting operations • Actual computation is done in elimination step • Computation depends on order of elimination – Optimal elimination order can be computed—but is NP-hard to do so Sufficient Condition 1 In general, any leaf node that is not a query or evidence variable is irrelevant (and can be removed) (once it is removed, others may be seen to be irrelevant) Can drop irrelevant variables from the network before starting the query off.. Sufficient Condition 2 Note that condition 2 doesn’t subsume condition 1. In particular, it won’t allow us to say that M is irrelevant for the query P(J|B) Notice that sampling methods could in general be used even when we don’t know the bayes net (and are just observing the world)! We should strive to make the sampling more efficient given that we know the bayes net Generating a Sample from the Network Note that the sample is generated in the causal order so all the required probabilities can be read off from CPTs! (To see how useful this is, consider generating the sample in the order of wetgrass, rain, sprinkler … To sample WG, we will first need P(WG), then we need P(Rain|Wg), then we need P(Sp|Rain,Wg)— NONE of these are directly given in CPTs –and have to be computed… Note that in MCMC we do (re)sample a node given its markov blanket—which is not in causal order—since MB contains children and their parents. <C, ~S, R, W> Network Samples Joint distribution That is, the rejection sampling method doesn’t really use the bayes network that much… Notice that to attach the likelihood to the evidence, we are using the CPTs in the bayes net. (Model-free empirical observation, in contrast, either gives you a sample or not; we can’t get fractional samples) Notice that to attach the likelihood to the evidence, we are using the CPTs in the bayes net. (Model-free empirical observation, in contrast, either gives you a sample or not; we can’t get fractional samples) Note that the other parents of zj are part of the markov blanket P(rain|cl,sp,wg) = P(rain|cl) * P(wg|sp,rain) Examples of singly connected networks include Markov Chains and Hidden Markov Models Network Topology & Complexity of Inference Sprinklers Rain Wetgrass Can be converted to singly-connected (by merging nodes) Cloudy Singly Connected Networks (poly-trees – At most one path between any pair of nodes) Inference is polynomial Wetgrass Sprinklers+Rain (takes 4 values 2x2) The “size” of the merged network can be Exponentially larger (so polynomial inference on that network isn’t exactly god’s gift Cloudy Multiplyconnected Inference NP-hard Summary of BN Inference Algorithms • Complexity – NP-hard (actually #P-Complete; since we “count” models) • Polynomial for “Singly connected” networks (one path between each pair of nodes) – NP-hard also for absolute and relative approximation • Exact Inference Algorithms – Enumeration – Variable elimination • Avoids the redundant computations of Enumeration – [Many others such as “message passing” algorithms, Constraintpropagation based algorithms etc.] • Approximate Inference Algorithms – Based on Stochastic Simulation • • • • Sampling from empty networks Rejection sampling Likelihood weighting MCMC [And many more] Conjunctive queries are essentially computing joint distributions on sets of query variables. A special case of computing the full joint on query variables is finding just the query variable configuration that is Most likely given the evidence. There are two special cases here MPE—Most Probable Explanation Most likely assignment to all other variables given the evidence Mostly involves max/product MAP—Maximum a posteriori Most likely assignment to some variables given the evidence Can involve, max/product/sum operations