Machine Learning and the Axioms of Probability Brigham S. Anderson www.cs.cmu.edu/~brigham brigham@cmu.edu School of Computer Science Carnegie Mellon University Copyright © 2006, Brigham S. Anderson Probability • The world is a very uncertain place • 30 years of Artificial Intelligence and Database research danced around this fact • And then a few AI researchers decided to use some ideas from the eighteenth century Copyright © 2006, Brigham S. Anderson 2 What we’re going to do • We will review the fundamentals of probability. • It’s really going to be worth it • You’ll see examples of probabilistic analytics in action: • Inference, • Anomaly detection, and • Bayes Classifiers Copyright © 2006, Brigham S. Anderson 3 Discrete Random Variables • A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs. • Examples • A = The US president in 2023 will be male • A = You wake up tomorrow with a headache • A = You have influenza Copyright © 2006, Brigham S. Anderson 4 Probabilities • We write P(A) as “the probability that A is true” • We could at this point spend 2 hours on the philosophy of this. • We’ll spend slightly less... Copyright © 2006, Brigham S. Anderson 5 Sample Space Definition 1. The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment The elements of the sample space are called outcomes. Copyright © 2006, Brigham S. Anderson 6 Sample Spaces Sample space of a coin flip: S = {H, T} T H Copyright © 2006, Brigham S. Anderson 7 Sample Spaces Sample space of a die roll: S = {1, 2, 3, 4, 5, 6} Copyright © 2006, Brigham S. Anderson 8 Sample Spaces Sample space of three die rolls? S = {111,112,113,…, …,664,665,666} Copyright © 2006, Brigham S. Anderson 9 Sample Spaces Sample space of a single draw from a deck of cards: S={As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} Copyright © 2006, Brigham S. Anderson 10 So Far… Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is an element of the sample space. 2c Copyright © 2006, Brigham S. Anderson 11 Events Definition 2. An event is any subset of S (including S itself) Copyright © 2006, Brigham S. Anderson 12 Events Sample Space of card draw • The Sample Space is the set of all outcomes. • An Outcome is a possible world. • An Event is a set of outcomes Event: “Jack” Copyright © 2006, Brigham S. Anderson 13 Events Sample Space of card draw • The Sample Space is the set of all outcomes. • An Outcome is a possible world. • An Event is a set of outcomes Event: “Hearts” Copyright © 2006, Brigham S. Anderson 14 Events Sample Space of card draw • The Sample Space is the set of all outcomes. • An Outcome is a possible world. • An Event is a set of outcomes Event: “Red and Face” Copyright © 2006, Brigham S. Anderson 15 Definitions Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 2c An event is a set of outcomes from the sample space. {2h,2c,2s,2d} Copyright © 2006, Brigham S. Anderson 16 17 Events Definition 3. Two events A and B are mutually exclusive if A^B=Ø. hearts clubs spades diamonds Definition 4. If A1, A2, … are mutually exclusive and A1 A2 … = S, then the collection A1, A2, … forms a partition of S. Copyright © 2006, Brigham S. Anderson Probability Definition 5. Given a sample space S, a probability function is a function that maps each event in S to a real number, and satisfies • P(A) ≥ 0 for any event A in S • P(S) = 1 • For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2A3 …) = P(A1) + P(A2) + P(A3) +… * This definition of the domain of this function is not 100% sufficient, but it’s close enough for our purposes… (I’m sparing you Borel Fields) Copyright © 2006, Brigham S. Anderson 18 Definitions Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 4c An event is a set of one or more outcomes Card is “Red” P(E) maps event E to a real number and satisfies the axioms of probability P(Red) = 0.50 P(Black) = 0.50 Copyright © 2006, Brigham S. Anderson 19 Misconception • The relative area of the events determines their probability • • …in a Venn diagram it does, but not in general. However, the “area equals probability” rule is guaranteed to result in axiom-satisfying probability functions. We often~A assume, for example, that the probability of “heads” Ais equal to “tails” in absence of other information… But this is totally outside the axioms! Copyright © 2006, Brigham S. Anderson 20 Creating a Valid P() • One convenient way to create an axiomsatisfying probability function: 1. Assign a probability to each outcome in S 2. Make sure they sum to one 3. Declare that P(A) equals the sum of outcomes in event A Copyright © 2006, Brigham S. Anderson 21 Everyday Example Assume you are a doctor. This is the sample space of “patients you might see on any given day”. Outcomes Non-smoker, female, diabetic, headache, good insurance, etc… Smoker, male, herniated disk, back pain, mildly schizophrenic, delinquent medical bills, etc… Copyright © 2006, Brigham S. Anderson 22 Everyday Example Number of elements in the “patient space”: 100 jillion Are these patients equally likely to occur? Again, generally not. Let’s assume for the moment that they are, though. …which roughly means “area equals probability” Copyright © 2006, Brigham S. Anderson 23 Everyday Example Event: Patient has Flu F Size of set “F”: 2 jillion (Exactly 2 jillion of the points in the sample space have flu.) Size of “patient space”: 100 jillion 2 jillion PpatientSpace(F) = 100 jillion Copyright © 2006, Brigham S. Anderson = 0.02 24 Everyday Example F 2 jillion PpatientSpace(F) = 100 jillion = 0.02 From now on, the subscript on P() will be omitted… Copyright © 2006, Brigham S. Anderson 25 These Axioms are Not to be Trifled With • There have been attempts to do different methodologies for uncertainty • • • • Fuzzy Logic Three-valued logic Dempster-Shafer Non-monotonic reasoning • But the axioms of probability are the only system with this property: If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931] Copyright © 2006, Brigham S. Anderson 26 Theorems from the Axioms Axioms • P(A) ≥ 0 for any event A in S • P(S) = 1 • For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +… Theorem. If P is a probability function and A is an event in S, then P(~A) = 1 – P(A) Proof: (1) Since A and ~A partition S, P(A ~A) (2) Since A and ~A are disjoint, P(A = P(S) = 1 ~A) = P(A) + P(~A) Combining (1) and (2) gives the result Copyright © 2006, Brigham S. Anderson 27 Multivalued Random Variables • Suppose A can take on more than 2 values • A is a random variable with arity k if it can take on exactly one value out of {A1,A2, ... Ak}, and • The events {A1,A2,…,Ak} partition S, so P( Ai , A j ) 0 if i j P( A1 A2 ... Ak ) 1 Copyright © 2006, Brigham S. Anderson 28 Elementary Probability in Pictures P(~A) + P(A) = 1 ~A A Copyright © 2006, Brigham S. Anderson 29 Elementary Probability in Pictures P(B) = P(B, A) + P(B, ~A) ~A A B Copyright © 2006, Brigham S. Anderson 30 Elementary Probability in Pictures k P( A ) 1 j 1 j A2 A3 A1 Copyright © 2006, Brigham S. Anderson 31 Elementary Probability in Pictures k P( B) P( B, Aj ) j 1 A2 B A3 A1 Useful! Copyright © 2006, Brigham S. Anderson 32 Conditional Probability Assume once more that you are a doctor. Again, this is the sample space of “patients you might see on any given day”. Copyright © 2006, Brigham S. Anderson 33 Conditional Probability Event: Flu P(F) = 0.02 F Copyright © 2006, Brigham S. Anderson 34 Conditional Probability Event: Headache H P(H) = 0.10 F Copyright © 2006, Brigham S. Anderson 35 Conditional Probability P(F) = 0.02 P(H) = 0.10 H F …we still need to specify the interaction between flu and headache… Define P(H|F) = Fraction of F’s outcomes which are also in H Copyright © 2006, Brigham S. Anderson 36 Conditional Probability 0.89 0.09 P(F) = 0.02 P(H) = 0.10 P(H|F) = 0.50 H F 0.01 0.01 H = “headache” F = “flu” Copyright © 2006, Brigham S. Anderson 37 Conditional Probability P(H|F) = Fraction of flu worlds in which patient has a headache 0.89 0.09 H F 0.01 0.01 H = “headache” F = “flu” = #worlds with flu and headache -----------------------------------#worlds with flu = Size of “H and F” region -----------------------------Size of “F” region = P(H, F) ---------P(F) Copyright © 2006, Brigham S. Anderson 38 Conditional Probability 39 Definition. If A and B are events in S, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is P( A, B) P( A | B) P( B) The Chain Rule A simple rearrangement of the above equation yields P( A, B) P( A | B) P( B) Copyright © 2006, Brigham S. Anderson Main Bayes Net concept! Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” H P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50 F One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu” Is this reasoning good? Copyright © 2006, Brigham S. Anderson 40 Probabilistic Inference 41 H = “Have a headache” F = “Coming down with Flu” H F P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50 P( F , H ) P( H | F ) P( F ) (0.50)(0.02) P( F | H ) 0.10 0.1 P( H ) P( H ) Copyright © 2006, Brigham S. Anderson What we just did… P(A,B) P(A|B) P(B) P(B|A) = ----------- = --------------P(A) P(A) This is Bayes Rule Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 Copyright © 2006, Brigham S. Anderson 42 More General Forms of Bayes Rule P( B | A) P( A) P( A |B) P( B | A) P( A) P( B |~ A) P(~ A) P( B | A, C ) P( A, C ) P( A |B, C ) P ( B, C ) Copyright © 2006, Brigham S. Anderson 43 More General Forms of Bayes Rule P( Ai |B) P( B | Ai ) P( Ai ) nA P( B | A ) P( A ) k 1 k Copyright © 2006, Brigham S. Anderson k 44 45 Independence Definition. Two events, A and B, are statistically independent if P( A, B) P( A) P( B) Which is equivalent to P( A | B) P( A) Important for Bayes Nets Copyright © 2006, Brigham S. Anderson Representing P(A,B,C) • How can we represent the function P(A)? • P(A,B)? • P(A,B,C)? Copyright © 2006, Brigham S. Anderson 46 The Joint Probability Table Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows). Example: P(A, B, C) A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to 1. 47 A 0.05 0.25 0.30 Copyright © 2006, Brigham S. Anderson B 0.10 0.10 0.10 0.05 0.05 C 48 Using the Joint One you have the JPT you can ask for the probability of any logical expression P( E ) …what is P(Poor,Male)? Copyright © 2006, Brigham S. Anderson P(row ) rows matching E 49 Using the Joint P(Poor, Male) = 0.4654 P( E ) P(row ) rows matching E …what is P(Poor)? Copyright © 2006, Brigham S. Anderson 50 Using the Joint P(Poor) = 0.7604 P( E ) P(row ) rows matching E …what is P(Poor|Male)? Copyright © 2006, Brigham S. Anderson 51 Inference with the Joint P( E1 , E2 ) P( E1 | E2 ) P ( E2 ) P(row ) rows matching E1 and E2 P(row ) rows matching E2 Copyright © 2006, Brigham S. Anderson 52 Inference with the Joint P( E1 , E2 ) P( E1 | E2 ) P ( E2 ) P(row ) rows matching E1 and E2 P(row ) rows matching E2 P(Male | Poor) = 0.4654 / 0.7604 = 0.612 Copyright © 2006, Brigham S. Anderson Inference is a big deal • I’ve got this evidence. What’s the chance that this conclusion is true? • I’ve got a sore neck: how likely am I to have meningitis? • I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? • There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis Copyright © 2006, Brigham S. Anderson 53 Where do Joint Distributions come from? 54 • Idea One: Expert Humans • Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew P(A) = 0.5 P(B|A) = 0.2 P(B|~A) = 0.1 P(C|A,B) = 0.1 P(C|A,~B) = 0.8 P(C|~A,B) = 0.3 P(C|~A,~B) = 0.1 Then you can automatically compute the JPT using the chain rule P(A,B,C) = P(A) P(B|A) P(C|A,B) Copyright © 2006, Brigham S. Anderson Bayes Nets are a systematic way to do this. Where do Joint Distributions come from? • Idea Three: Learn them from data! Prepare to witness an impressive learning algorithm…. Copyright © 2006, Brigham S. Anderson 55 56 Learning a JPT Build a Joint Probability table for your attributes in which the probabilities are unspecified Then fill in each row with records matching row ˆ P(row ) total number of records A B C Prob 0 0 0 ? 0 0 1 ? A B C Prob 0 1 0 ? 0 0 0 0.30 0 1 1 ? 0 0 1 0.05 1 0 0 ? 0 1 0 0.10 1 0 1 ? 0 1 1 0.05 1 1 0 ? 1 0 0 0.05 1 1 1 ? 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 Fraction of all records in which A and B are True but C is False Copyright © 2006, Brigham S. Anderson Example of Learning a JPT • This JPT was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995] Copyright © 2006, Brigham S. Anderson 57 Where are we? • We have recalled the fundamentals of probability • We have become content with what JPTs are and how to use them • And we even know how to learn JPTs from data. Copyright © 2006, Brigham S. Anderson 58 Density Estimation • Our Joint Probability Table (JPT) learner is our first example of something called Density Estimation • A Density Estimator learns a mapping from a set of attributes to a Probability Input Attributes Density Estimator Copyright © 2006, Brigham S. Anderson Probability 59 Evaluating a density estimator • Given a record x, a density estimator M can tell you how likely the record is: Pˆ ( x|M ) • Given a dataset with R records, a density estimator can tell you how likely the dataset is: (Under the assumption that all records were independently generated from the probability function) R Pˆ (dataset |M ) Pˆ (x1 , x 2 , x R|M ) Pˆ (x k|M ) k 1 Copyright © 2006, Brigham S. Anderson 60 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america america america america america america europe europe From the UCI repository (thanks to Ross Quinlan) Copyright © 2006, Brigham S. Anderson 61 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america america america america america america europe europe Copyright © 2006, Brigham S. Anderson 62 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america 1 america america america america america europe europe R Pˆ (dataset |M ) Pˆ (x , x 2 , x R|M ) Pˆ (x k|M ) Copyright © 2006, Brigham S. Anderson k 1 3.4 10-203 63 Log Probabilities Since probabilities of datasets get so small we usually use log probabilities R R k 1 k 1 log Pˆ (dataset |M ) log Pˆ (x k|M ) log Pˆ (x k|M ) Copyright © 2006, Brigham S. Anderson 64 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america america america america america america europe europe R R k 1 k 1 log Pˆ (dataset |M ) log Pˆ (x k|M ) log Pˆ (x k|M ) 466.19 Copyright © 2006, Brigham S. Anderson 65 Summary: The Good News The JPT allows us to learn P(X) from data. • Can do inference: P(E1|E2) Automatic Doctor, Recommender, etc • Can do anomaly detection spot suspicious / incorrect records (e.g., credit card fraud) • Can do Bayesian classification Predict the class of a record (e.g, predict cancerous/not-cancerous) Copyright © 2006, Brigham S. Anderson 66 Summary: The Bad News • Density estimation with JPTs is trivial, mindless and dangerous Copyright © 2006, Brigham S. Anderson 67 Using a test set An independent test set with 196 cars has a much worse log likelihood than it had on the training set (actually it’s a billion quintillion quintillion quintillion quintillion times less likely) ….Density estimators can overfit. And the JPT estimator is the overfittiest of them all! Copyright © 2006, Brigham S. Anderson 68 Overfitting Density Estimators If this ever happens, it means there are certain combinations that we learn are “impossible” Copyright © 2006, Brigham S. Anderson 69 Using a test set The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020 We need Density Estimators that are less prone to overfitting Copyright © 2006, Brigham S. Anderson 70 Is there a better way? The problem with the JPT is that it just mirrors the training data. In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from it! We need to represent the probability function with fewer parameters… Copyright © 2006, Brigham S. Anderson 71 72 Aside: Bayes Nets Copyright © 2006, Brigham S. Anderson Bayes Nets • What are they? • Bayesian nets are a framework for representing and analyzing models involving uncertainty • What are they used for? • Intelligent decision aids, data fusion, 3-E feature recognition, intelligent diagnostic aids, automated free text understanding, data mining • How are they different from other knowledge representation and probabilistic analysis tools? • Uncertainty is handled in a mathematically rigorous yet efficient and simple way Copyright © 2006, Brigham S. Anderson 73 Bayes Net Concepts 1.Chain Rule P(A,B) = P(A) P(B|A) 2.Conditional Independence P(A|B,C) = P(A|B) Copyright © 2006, Brigham S. Anderson 74 A Simple Bayes Net • Let’s assume that we already have P(Mpg,Horse) low P(Mpg, Horse) = high P(good, low) = 0.36 0.04 good P(good,high) = P( low) = 0.12 0.48 badbad, P( bad,high) = 0.36 0.04 0.12 0.48 How would you rewrite this using the Chain rule? Copyright © 2006, Brigham S. Anderson 75 76 Review: Chain Rule P(Mpg) P(Mpg, Horse) low high P(good, low) = 0.36 0.04 good P(good,high) = P( low) = 0.12 0.48 badbad, P( bad,high) = P(Mpg, Horse) P(good) = 0.4 P( bad) = 0.6 0.36 0.04 0.12 0.48 * P(Horse|Mpg) P( low|good) P( low| bad) P(high|good) P(high| bad) Copyright © 2006, Brigham S. Anderson = = = = 0.89 0.21 0.11 0.79 77 Review: Chain Rule = P(good) * P(low|good) = 0.4 * 0.89 = P(good) * P(high|good) = 0.4 * 0.11 P(Mpg) P(Mpg, Horse) low high P(good, low) = 0.36 0.04 good P(good,high) = P( low) = 0.12 0.48 badbad, P( bad,high) = P(good) = 0.4 P( bad) = 0.6 0.36 0.04 0.12 0.48 * P(Horse|Mpg) P(Mpg, Horse) P( low|good) P( low| bad) P(high|good) P(high| bad) = P(bad) * P(high|bad) = 0.6 * 0.79 = = = = = P(bad) * P(low|bad) = 0.6 * 0.21 Copyright © 2006, Brigham S. Anderson 0.89 0.21 0.11 0.79 How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) Mpg Horse Copyright © 2006, Brigham S. Anderson 78 How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) Copyright © 2006, Brigham S. Anderson = = = = 0.90 0.21 0.10 0.79 79 How to Make a Bayes Net • Each node is a probability function • Each arc denotes conditional dependence P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) Copyright © 2006, Brigham S. Anderson = = = = 0.90 0.21 0.10 0.79 80 How to Make a Bayes Net So, what have we accomplished thus far? Nothing; we’ve just “Bayes Net-ified” the P(Mpg, Horse) JPT using the Chain rule. …the real excitement starts when we wield conditional independence Copyright © 2006, Brigham S. Anderson Mpg Horse P(Mpg) P(Horse|Mpg) 81 How to Make a Bayes Net 82 Before we continue, we need a worthier opponent than puny P(Mpg, Horse)… We’ll use P(Mpg, Horse, Accel): * Note: I made these up… P(Mpg,Horse,Accel) P(good, low,slow) P(good, low,fast) P(good,high,slow) P(good,high,fast) P( bad, low,slow) P( bad, low,fast) P( bad,high,slow) P( bad,high,fast) Copyright © 2006, Brigham S. Anderson = = = = = = = = 0.37 0.01 0.02 0.00 0.10 0.12 0.16 0.22 How to Make a Bayes Net Step 1: Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Note: Obviously, we could have written this 3!=6 different ways… P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) =… =… Copyright © 2006, Brigham S. Anderson 83 How to Make a Bayes Net Step 1: Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Mpg Horse Accel Copyright © 2006, Brigham S. Anderson 84 How to Make a Bayes Net Mpg P(Mpg) Horse P(Horse|Mpg) Accel P(Accel|Mpg,Horse) Copyright © 2006, Brigham S. Anderson 85 How to Make a Bayes Net P(Mpg) Mpg P(Horse|Mpg) P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.90 0.21 0.10 0.79 P(good) = 0.4 P( bad) = 0.6 P(Accel|Mpg,Horse) P(slow|good, low) P(slow|good,high) P(slow| bad, low) P(slow| bad,high) P(fast|good, low) P(fast|good,high) P(fast| bad, low) P(fast| bad,high) Horse Accel = = = = = = = = * Note: I made these up too… Copyright © 2006, Brigham S. Anderson 0.97 0.15 0.90 0.05 0.03 0.85 0.10 0.95 86 How to Make a Bayes Net P(Mpg) Mpg P(Horse|Mpg) P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(good) = 0.4 P( bad) = 0.6 P(Accel|Mpg,Horse) P(slow|good, low) P(slow|good,high) P(slow| bad, low) P(slow| bad,high) Accel P(fast|good, low) P(fast|good,high) A Miracle Occurs! P(fast| bad, low) P(fast| You are told by God (or another domain expert) bad,high) Horse that Accel is independent of Mpg given Horse! i.e., P(Accel | Mpg, Horse) = P(Accel | Horse) Copyright © 2006, Brigham S. Anderson = = = = = = = = 0.97 0.15 0.90 0.05 0.03 0.85 0.10 0.95 87 How to Make a Bayes Net P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel P(slow| low) P(slow|high) P(fast| low) P(fast|high) Copyright © 2006, Brigham S. Anderson = = = = 0.22 0.64 0.78 0.36 88 How to Make a Bayes Net P(Mpg) Mpg Thank you, domain expert! P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Now I only need to learn 5 parameters instead of 7 from my data! Horse My parameter estimates will be more accurate as a result! P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel P(slow| low) P(slow|high) P(fast| low) P(fast|high) Copyright © 2006, Brigham S. Anderson = = = = 0.22 0.64 0.78 0.36 89 Independence “The Acceleration does not depend on the Mpg once I know the Horsepower.” This can be specified very simply: P(Accel Mpg, Horse) = P(Accel | Horse) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation. Copyright © 2006, Brigham S. Anderson 90 Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where: • V is a set of vertices. • E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: • A Conditional Probability Table (CPT) indicating how this variable’s probabilities depend on all possible combinations of parental values. Copyright © 2006, Brigham S. Anderson 91 Bayes Nets Summary • Bayes nets are a factorization of the full JPT which uses the chain rule and conditional independence. • They can do everything a JPT can do (like quick, cheap lookups of probabilities) Copyright © 2006, Brigham S. Anderson 92 The good news We can do inference. We can compute any conditional probability: P( Some variable Some other variable values ) P( E1 E2 ) P( E1 | E2 ) P ( E2 ) P( joint entry ) joint entries matching E1 and E2 P( joint entry ) joint entries matching E2 Copyright © 2006, Brigham S. Anderson 93 The good news We can do inference. We can compute any conditional probability: P( Some variable Some other variable values ) P( E1 E2 ) P( E1 | E2 ) P ( E2 ) P( joint entry ) joint entries matching E1 and E2 P( joint entry ) joint entries matching E2 Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables. How much work is the above computation? Copyright © 2006, Brigham S. Anderson 94 The sad, bad news Doing inference “JPT-style” by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? • In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. • So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete. Copyright © 2006, Brigham S. Anderson 95 Case Study I Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA). • Diagnostic system for lymph-node diseases. • 60 diseases and 100 symptoms and test-results. • 14,000 probabilities • Expert consulted to make net. • 8 hours to determine variables. • 35 hours for net topology. • 40 hours for probability table values. • Apparently, the experts found it quite easy to invent the causal links and probabilities. • Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains. Copyright © 2006, Brigham S. Anderson 96 Bayes Net Info GUI Packages: • Genie -- Free • Netica -- $$ • Hugin -- $$ Non-GUI Packages: • All of the above have APIs • BNT for MATLAB • AUTON code (learning extremely large networks of tens of thousands of nodes) Copyright © 2006, Brigham S. Anderson 97 98 Bayes Nets and Machine Learning Copyright © 2006, Brigham S. Anderson Machine Learning Tasks Evidence e1 Missing Variables e2 Inference Engine P(e2 | e1) Data point x Anomaly Detector P(x) Data point x Classifier P(C | x) Copyright © 2006, Brigham S. Anderson 99 What is an Anomaly? • An irregularity that cannot be explained by simple domain models and knowledge • Anomaly detection only needs to learn from examples of “normal” system behavior. • Classification, on the other hand, would need examples labeled “normal” and “not-normal” Copyright © 2006, Brigham S. Anderson 100 Anomaly Detection in Practice • Monitoring computer networks for attacks. • Monitoring population-wide health data for outbreaks or attacks. • Looking for suspicious activity in bank transactions • Detecting unusual eBay selling/buying behavior. Copyright © 2006, Brigham S. Anderson 101 102 JPT Anomaly Detector • Suppose we have the following model: P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) • We’re trying to detect anomalous cars. • If the next example we see is <good,high>, how anomalous is it? Copyright © 2006, Brigham S. Anderson = = = = 0.36 0.04 0.12 0.48 103 JPT Anomaly Detector How likely is <good,high>? P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) likelihood ( good , high) P( good , high) 0.04 Could not be easier! Just look up the entry in the JPT! Smaller numbers are more anomalous in that the model is more surprised to see them. Copyright © 2006, Brigham S. Anderson = = = = 0.36 0.04 0.12 0.48 104 Bayes Net Anomaly Detector P(Mpg) Mpg How likely is <good,high>? P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) likelihood ( good , high ) P( good , high ) P( good ) P(high | good ) 0.04 Copyright © 2006, Brigham S. Anderson = = = = 0.90 0.21 0.10 0.79 105 Bayes Net Anomaly Detector P(Mpg) Mpg How likely is <good,high>? P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) likelihood ( good , high ) P( good , high ) Again, trivial! Pone ( good P(high | We need todo tiny )lookup for each variable in the network! 0.04 Copyright © 2006, Brigham S. Anderson good ) = = = = 0.90 0.21 0.10 0.79 Machine Learning Tasks Evidence e1 Inference Engine P(E2 | e1) Data point x Anomaly Detector P(x) Data point x Classifier P(C | x) Missing Variables E2 Copyright © 2006, Brigham S. Anderson 106 Bayes Classifiers • A formidable and sworn enemy of decision trees BC DT Copyright © 2006, Brigham S. Anderson 107 Bayes Classifiers in 1 Slide Bayes classifiers just do inference. That’s it. The “algorithm” 1. Learn P(class,X) 2. For a given input x, infer P(class|x) 3. Choose the class with the highest probability Copyright © 2006, Brigham S. Anderson 108 109 JPT Bayes Classifier • Suppose we have the following model: P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) = = = = • We’re trying to classify cars as Mpg = “good” or “bad” • If the next example we see is Horse = “low”, how do we classify it? Copyright © 2006, Brigham S. Anderson 0.36 0.04 0.12 0.48 110 JPT Bayes Classifier How do we classify <Horse=low>? P( good | low) P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) = = = = 0.36 0.04 0.12 0.48 P( good , low) P(low) P( good , low) P( good , low)P(bad , low) 0.36 0.739 0.360.12 Copyright © 2006, Brigham S. Anderson The P(good | low) = 0.75, so we classify the example as “good” 111 Bayes Net Classifier P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) • We’re trying to classify cars as Mpg = “good” or “bad” Horse • If the next example we see is <Horse=low,Accel=fast> how do we classify it? P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel Copyright © 2006, Brigham S. Anderson P(slow| low) P(slow|high) P(fast| low) P(fast|high) = = = = 0.95 0.11 0.05 0.89 Bayes Net 112 Bayes Classifier Suppose we get a <Horse=low, Accel=fast> example? P( good , low, fast ) P( good | low, fast ) P(low, fast ) P(Mpg) P(good) = 0.4 P( bad) = 0.6 Mpg P(Horse|Mpg) P( good ) P(low | good ) P( fast | low) P(low, fast ) P( low|good) P( low| bad) P(high|good) P(high| bad) Horse (0.4)(0.89)(0.05) 0.0178 P(low, fast ) P(low, fast ) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) 0.0178 P( good , low, fast ) P(bad , low, fast ) 0.75 Accel Note: this is not exactly 0.75 because I rounded some of the CPT numbers Copyright © 2006, Brigham S. Anderson earlier… P(slow| low) P(slow|high) P(fast| low) P(fast|high) = = = = 0.95 0.11 0.05 0.89 Bayes Net 113 Bayes Classifier P(Mpg) Mpg The P(good | low, fast) = 0.75, so we classify the example as “good”. P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) …but that seems somehow familiar… Horse Wasn’t that the same answer as P(Mpg=good | Horse=low)? P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel Copyright © 2006, Brigham S. Anderson P(slow| low) P(slow|high) P(fast| low) P(fast|high) = = = = 0.95 0.11 0.05 0.89 Bayes Classifiers • OK, so classification can be posed as inference • In fact, virtually all machine learning tasks are a form of inference • • • • • Anomaly detection: Classification: Regression: Model Learning: Feature Selection: P(x) P(Class | x) P(Y | x) P(Model | dataset) P(Model | dataset) Copyright © 2006, Brigham S. Anderson 114 The Naïve Bayes Classifier ASSUMPTION: all the attributes are conditionally independent given the class variable Copyright © 2006, Brigham S. Anderson 115 The Naïve Bayes Advantage At least 256 parameters! You better have the data to support them… A mere 25 parameters! (the CPTs are tiny because the attribute nodes only have one parent.) Copyright © 2006, Brigham S. Anderson 116 What is the Probability Function of the Naïve Bayes? P(Mpg,Cylinders,Weight,Maker,…) = P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) … Copyright © 2006, Brigham S. Anderson 117 What is the Probability Function of the Naïve Bayes? P(class , x) P(class ) P( xi | class ) i This is another great feature of Bayes Nets; you can graphically see your model assumptions Copyright © 2006, Brigham S. Anderson 118 Bayes Classifier Results: “MPG”: 392 records The Classifier learned by “Naive BC” Copyright © 2006, Brigham S. Anderson 119 120 Bayes Classifier Results: “MPG”: 40 records Copyright © 2006, Brigham S. Anderson More Facts About Bayes Classifiers • Many other density estimators can be slotted in • Density estimation can be performed with real-valued inputs • Bayes Classifiers can be built with real-valued inputs • Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on • Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help. • Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully! Copyright © 2006, Brigham S. Anderson 121 Summary • Axioms of Probability • Bayes nets are created by • chain rule • conditional independence • Bayes Nets can do • Inference • Anomaly Detection • Classification Copyright © 2006, Brigham S. Anderson 122 123 Copyright © 2006, Brigham S. Anderson 124 Copyright © 2006, Brigham S. Anderson