Machine Learning and the Axioms of Probability Brigham S. Anderson

Machine Learning and the Axioms of Probability Brigham S. Anderson www.cs.cmu.edu/~brigham brigham@cmu.edu School of Computer Science Carnegie Mellon University Copyright © 2006, Brigham S. Anderson Probability • The world is a very uncertain place • 30 years of Artificial Intelligence and Database research danced around this fact • And then a few AI researchers decided to use some ideas from the eighteenth century Copyright © 2006, Brigham S. Anderson 2 What we’re going to do • We will review the fundamentals of probability. • It’s really going to be worth it • You’ll see examples of probabilistic analytics in action: • Inference, • Anomaly detection, and • Bayes Classifiers Copyright © 2006, Brigham S. Anderson 3 Discrete Random Variables • A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs. • Examples • A = The US president in 2023 will be male • A = You wake up tomorrow with a headache • A = You have influenza Copyright © 2006, Brigham S. Anderson 4 Probabilities • We write P(A) as “the probability that A is true” • We could at this point spend 2 hours on the philosophy of this. • We’ll spend slightly less... Copyright © 2006, Brigham S. Anderson 5 Sample Space Definition 1. The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment The elements of the sample space are called outcomes. Copyright © 2006, Brigham S. Anderson 6 Sample Spaces Sample space of a coin flip: S = {H, T} T H Copyright © 2006, Brigham S. Anderson 7 Sample Spaces Sample space of a die roll: S = {1, 2, 3, 4, 5, 6} Copyright © 2006, Brigham S. Anderson 8 Sample Spaces Sample space of three die rolls? S = {111,112,113,…, …,664,665,666} Copyright © 2006, Brigham S. Anderson 9 Sample Spaces Sample space of a single draw from a deck of cards: S={As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} Copyright © 2006, Brigham S. Anderson 10 So Far… Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is an element of the sample space. 2c Copyright © 2006, Brigham S. Anderson 11 Events Definition 2. An event is any subset of S (including S itself) Copyright © 2006, Brigham S. Anderson 12 Events Sample Space of card draw • The Sample Space is the set of all outcomes. • An Outcome is a possible world. • An Event is a set of outcomes Event: “Jack” Copyright © 2006, Brigham S. Anderson 13 Events Sample Space of card draw • The Sample Space is the set of all outcomes. • An Outcome is a possible world. • An Event is a set of outcomes Event: “Hearts” Copyright © 2006, Brigham S. Anderson 14 Events Sample Space of card draw • The Sample Space is the set of all outcomes. • An Outcome is a possible world. • An Event is a set of outcomes Event: “Red and Face” Copyright © 2006, Brigham S. Anderson 15 Definitions Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 2c An event is a set of outcomes from the sample space. {2h,2c,2s,2d} Copyright © 2006, Brigham S. Anderson 16 17 Events Definition 3. Two events A and B are mutually exclusive if A^B=Ø. hearts clubs spades diamonds Definition 4. If A1, A2, … are mutually exclusive and A1 A2 … = S, then the collection A1, A2, … forms a partition of S. Copyright © 2006, Brigham S. Anderson Probability Definition 5. Given a sample space S, a probability function is a function   that maps each event in S to a real number, and satisfies • P(A) ≥ 0 for any event A in S • P(S) = 1 • For any number of mutually exclusive events A1, A2, A3 …, we have P(A1  A2A3  …) = P(A1) + P(A2) + P(A3) +… * This definition of the domain of this function is not 100% sufficient, but it’s close enough for our purposes… (I’m sparing you Borel Fields) Copyright © 2006, Brigham S. Anderson 18 Definitions Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 4c An event is a set of one or more outcomes Card is “Red” P(E) maps event E to a real number and satisfies the axioms of probability P(Red) = 0.50 P(Black) = 0.50 Copyright © 2006, Brigham S. Anderson 19 Misconception • The relative area of the events determines their probability • • …in a Venn diagram it does, but not in general. However, the “area equals probability” rule is guaranteed to result in axiom-satisfying probability functions. We often~A assume, for example, that the probability of “heads” Ais equal to “tails” in absence of other information… But this is totally outside the axioms! Copyright © 2006, Brigham S. Anderson 20 Creating a Valid P() • One convenient way to create an axiomsatisfying probability function: 1. Assign a probability to each outcome in S 2. Make sure they sum to one 3. Declare that P(A) equals the sum of outcomes in event A Copyright © 2006, Brigham S. Anderson 21 Everyday Example Assume you are a doctor. This is the sample space of “patients you might see on any given day”. Outcomes Non-smoker, female, diabetic, headache, good insurance, etc… Smoker, male, herniated disk, back pain, mildly schizophrenic, delinquent medical bills, etc… Copyright © 2006, Brigham S. Anderson 22 Everyday Example Number of elements in the “patient space”: 100 jillion Are these patients equally likely to occur? Again, generally not. Let’s assume for the moment that they are, though. …which roughly means “area equals probability” Copyright © 2006, Brigham S. Anderson 23 Everyday Example Event: Patient has Flu F Size of set “F”: 2 jillion (Exactly 2 jillion of the points in the sample space have flu.) Size of “patient space”: 100 jillion 2 jillion PpatientSpace(F) = 100 jillion Copyright © 2006, Brigham S. Anderson = 0.02 24 Everyday Example F 2 jillion PpatientSpace(F) = 100 jillion = 0.02 From now on, the subscript on P() will be omitted… Copyright © 2006, Brigham S. Anderson 25 These Axioms are Not to be Trifled With • There have been attempts to do different methodologies for uncertainty • • • • Fuzzy Logic Three-valued logic Dempster-Shafer Non-monotonic reasoning • But the axioms of probability are the only system with this property: If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931] Copyright © 2006, Brigham S. Anderson 26 Theorems from the Axioms Axioms • P(A) ≥ 0 for any event A in S • P(S) = 1 • For any number of mutually exclusive events A1, A2, A3 …, we have P(A1 A2 A3  …) = P(A1) + P(A2) + P(A3) +… Theorem. If P is a probability function and A is an event in S, then P(~A) = 1 – P(A) Proof: (1) Since A and ~A partition S, P(A  ~A) (2) Since A and ~A are disjoint, P(A  = P(S) = 1 ~A) = P(A) + P(~A) Combining (1) and (2) gives the result Copyright © 2006, Brigham S. Anderson 27 Multivalued Random Variables • Suppose A can take on more than 2 values • A is a random variable with arity k if it can take on exactly one value out of {A1,A2, ... Ak}, and • The events {A1,A2,…,Ak} partition S, so P( Ai , A j )  0 if i  j P( A1  A2  ...  Ak )  1 Copyright © 2006, Brigham S. Anderson 28 Elementary Probability in Pictures P(~A) + P(A) = 1 ~A A Copyright © 2006, Brigham S. Anderson 29 Elementary Probability in Pictures P(B) = P(B, A) + P(B, ~A) ~A A B Copyright © 2006, Brigham S. Anderson 30 Elementary Probability in Pictures k  P( A )  1 j 1 j A2 A3 A1 Copyright © 2006, Brigham S. Anderson 31 Elementary Probability in Pictures k P( B)   P( B, Aj ) j 1 A2 B A3 A1 Useful! Copyright © 2006, Brigham S. Anderson 32 Conditional Probability Assume once more that you are a doctor. Again, this is the sample space of “patients you might see on any given day”. Copyright © 2006, Brigham S. Anderson 33 Conditional Probability Event: Flu P(F) = 0.02 F Copyright © 2006, Brigham S. Anderson 34 Conditional Probability Event: Headache H P(H) = 0.10 F Copyright © 2006, Brigham S. Anderson 35 Conditional Probability P(F) = 0.02 P(H) = 0.10 H F …we still need to specify the interaction between flu and headache… Define P(H|F) = Fraction of F’s outcomes which are also in H Copyright © 2006, Brigham S. Anderson 36 Conditional Probability 0.89 0.09 P(F) = 0.02 P(H) = 0.10 P(H|F) = 0.50 H F 0.01 0.01 H = “headache” F = “flu” Copyright © 2006, Brigham S. Anderson 37 Conditional Probability P(H|F) = Fraction of flu worlds in which patient has a headache 0.89 0.09 H F 0.01 0.01 H = “headache” F = “flu” = #worlds with flu and headache -----------------------------------#worlds with flu = Size of “H and F” region -----------------------------Size of “F” region = P(H, F) ---------P(F) Copyright © 2006, Brigham S. Anderson 38 Conditional Probability 39 Definition. If A and B are events in S, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is P( A, B) P( A | B)  P( B) The Chain Rule A simple rearrangement of the above equation yields P( A, B)  P( A | B) P( B) Copyright © 2006, Brigham S. Anderson Main Bayes Net concept! Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” H P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50 F One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu” Is this reasoning good? Copyright © 2006, Brigham S. Anderson 40 Probabilistic Inference 41 H = “Have a headache” F = “Coming down with Flu” H F P(H) = 0.10 P(F) = 0.02 P(H|F) = 0.50 P( F , H ) P( H | F ) P( F ) (0.50)(0.02) P( F | H )     0.10 0.1 P( H ) P( H ) Copyright © 2006, Brigham S. Anderson What we just did… P(A,B) P(A|B) P(B) P(B|A) = ----------- = --------------P(A) P(A) This is Bayes Rule Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 Copyright © 2006, Brigham S. Anderson 42 More General Forms of Bayes Rule P( B | A) P( A) P( A |B)  P( B | A) P( A)  P( B |~ A) P(~ A) P( B | A, C ) P( A, C ) P( A |B, C )  P ( B, C ) Copyright © 2006, Brigham S. Anderson 43 More General Forms of Bayes Rule P( Ai |B)  P( B | Ai ) P( Ai ) nA  P( B | A ) P( A ) k 1 k Copyright © 2006, Brigham S. Anderson k 44 45 Independence Definition. Two events, A and B, are statistically independent if P( A, B)  P( A) P( B) Which is equivalent to P( A | B)  P( A) Important for Bayes Nets Copyright © 2006, Brigham S. Anderson Representing P(A,B,C) • How can we represent the function P(A)? • P(A,B)? • P(A,B,C)? Copyright © 2006, Brigham S. Anderson 46 The Joint Probability Table Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows). Example: P(A, B, C) A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to 1. 47 A 0.05 0.25 0.30 Copyright © 2006, Brigham S. Anderson B 0.10 0.10 0.10 0.05 0.05 C 48 Using the Joint One you have the JPT you can ask for the probability of any logical expression P( E )  …what is P(Poor,Male)? Copyright © 2006, Brigham S. Anderson  P(row ) rows matching E 49 Using the Joint P(Poor, Male) = 0.4654 P( E )   P(row ) rows matching E …what is P(Poor)? Copyright © 2006, Brigham S. Anderson 50 Using the Joint P(Poor) = 0.7604 P( E )   P(row ) rows matching E …what is P(Poor|Male)? Copyright © 2006, Brigham S. Anderson 51 Inference with the Joint P( E1 , E2 ) P( E1 | E2 )   P ( E2 )  P(row ) rows matching E1 and E2  P(row ) rows matching E2 Copyright © 2006, Brigham S. Anderson 52 Inference with the Joint P( E1 , E2 ) P( E1 | E2 )   P ( E2 )  P(row ) rows matching E1 and E2  P(row ) rows matching E2 P(Male | Poor) = 0.4654 / 0.7604 = 0.612 Copyright © 2006, Brigham S. Anderson Inference is a big deal • I’ve got this evidence. What’s the chance that this conclusion is true? • I’ve got a sore neck: how likely am I to have meningitis? • I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? • There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis Copyright © 2006, Brigham S. Anderson 53 Where do Joint Distributions come from? 54 • Idea One: Expert Humans • Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew P(A) = 0.5 P(B|A) = 0.2 P(B|~A) = 0.1 P(C|A,B) = 0.1 P(C|A,~B) = 0.8 P(C|~A,B) = 0.3 P(C|~A,~B) = 0.1 Then you can automatically compute the JPT using the chain rule P(A,B,C) = P(A) P(B|A) P(C|A,B) Copyright © 2006, Brigham S. Anderson Bayes Nets are a systematic way to do this. Where do Joint Distributions come from? • Idea Three: Learn them from data! Prepare to witness an impressive learning algorithm…. Copyright © 2006, Brigham S. Anderson 55 56 Learning a JPT Build a Joint Probability table for your attributes in which the probabilities are unspecified Then fill in each row with records matching row ˆ P(row )  total number of records A B C Prob 0 0 0 ? 0 0 1 ? A B C Prob 0 1 0 ? 0 0 0 0.30 0 1 1 ? 0 0 1 0.05 1 0 0 ? 0 1 0 0.10 1 0 1 ? 0 1 1 0.05 1 1 0 ? 1 0 0 0.05 1 1 1 ? 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 Fraction of all records in which A and B are True but C is False Copyright © 2006, Brigham S. Anderson Example of Learning a JPT • This JPT was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995] Copyright © 2006, Brigham S. Anderson 57 Where are we? • We have recalled the fundamentals of probability • We have become content with what JPTs are and how to use them • And we even know how to learn JPTs from data. Copyright © 2006, Brigham S. Anderson 58 Density Estimation • Our Joint Probability Table (JPT) learner is our first example of something called Density Estimation • A Density Estimator learns a mapping from a set of attributes to a Probability Input Attributes Density Estimator Copyright © 2006, Brigham S. Anderson Probability 59 Evaluating a density estimator • Given a record x, a density estimator M can tell you how likely the record is: Pˆ ( x|M ) • Given a dataset with R records, a density estimator can tell you how likely the dataset is: (Under the assumption that all records were independently generated from the probability function) R Pˆ (dataset |M )  Pˆ (x1 , x 2 , x R|M )   Pˆ (x k|M ) k 1 Copyright © 2006, Brigham S. Anderson 60 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america america america america america america europe europe From the UCI repository (thanks to Ross Quinlan) Copyright © 2006, Brigham S. Anderson 61 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america america america america america america europe europe Copyright © 2006, Brigham S. Anderson 62 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america 1 america america america america america europe europe R Pˆ (dataset |M )  Pˆ (x , x 2 , x R|M )   Pˆ (x k|M ) Copyright © 2006, Brigham S. Anderson k 1  3.4  10-203 63 Log Probabilities Since probabilities of datasets get so small we usually use log probabilities R R k 1 k 1 log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M ) Copyright © 2006, Brigham S. Anderson 64 A small dataset: Miles Per Gallon 192 Training Set Records mpg modelyear maker good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america : : : america america america america america america america america europe europe R R k 1 k 1 log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )   466.19 Copyright © 2006, Brigham S. Anderson 65 Summary: The Good News The JPT allows us to learn P(X) from data. • Can do inference: P(E1|E2) Automatic Doctor, Recommender, etc • Can do anomaly detection spot suspicious / incorrect records (e.g., credit card fraud) • Can do Bayesian classification Predict the class of a record (e.g, predict cancerous/not-cancerous) Copyright © 2006, Brigham S. Anderson 66 Summary: The Bad News • Density estimation with JPTs is trivial, mindless and dangerous Copyright © 2006, Brigham S. Anderson 67 Using a test set An independent test set with 196 cars has a much worse log likelihood than it had on the training set (actually it’s a billion quintillion quintillion quintillion quintillion times less likely) ….Density estimators can overfit. And the JPT estimator is the overfittiest of them all! Copyright © 2006, Brigham S. Anderson 68 Overfitting Density Estimators If this ever happens, it means there are certain combinations that we learn are “impossible” Copyright © 2006, Brigham S. Anderson 69 Using a test set The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020 We need Density Estimators that are less prone to overfitting Copyright © 2006, Brigham S. Anderson 70 Is there a better way? The problem with the JPT is that it just mirrors the training data. In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from it! We need to represent the probability function with fewer parameters… Copyright © 2006, Brigham S. Anderson 71 72 Aside: Bayes Nets Copyright © 2006, Brigham S. Anderson Bayes Nets • What are they? • Bayesian nets are a framework for representing and analyzing models involving uncertainty • What are they used for? • Intelligent decision aids, data fusion, 3-E feature recognition, intelligent diagnostic aids, automated free text understanding, data mining • How are they different from other knowledge representation and probabilistic analysis tools? • Uncertainty is handled in a mathematically rigorous yet efficient and simple way Copyright © 2006, Brigham S. Anderson 73 Bayes Net Concepts 1.Chain Rule P(A,B) = P(A) P(B|A) 2.Conditional Independence P(A|B,C) = P(A|B) Copyright © 2006, Brigham S. Anderson 74 A Simple Bayes Net • Let’s assume that we already have P(Mpg,Horse) low P(Mpg, Horse) = high P(good, low) = 0.36 0.04 good P(good,high) = P( low) = 0.12 0.48 badbad, P( bad,high) = 0.36 0.04 0.12 0.48 How would you rewrite this using the Chain rule? Copyright © 2006, Brigham S. Anderson 75 76 Review: Chain Rule P(Mpg) P(Mpg, Horse) low high P(good, low) = 0.36 0.04 good P(good,high) = P( low) = 0.12 0.48 badbad, P( bad,high) = P(Mpg, Horse) P(good) = 0.4 P( bad) = 0.6 0.36 0.04 0.12 0.48 * P(Horse|Mpg) P( low|good) P( low| bad) P(high|good) P(high| bad) Copyright © 2006, Brigham S. Anderson = = = = 0.89 0.21 0.11 0.79 77 Review: Chain Rule = P(good) * P(low|good) = 0.4 * 0.89 = P(good) * P(high|good) = 0.4 * 0.11 P(Mpg) P(Mpg, Horse) low high P(good, low) = 0.36 0.04 good P(good,high) = P( low) = 0.12 0.48 badbad, P( bad,high) = P(good) = 0.4 P( bad) = 0.6 0.36 0.04 0.12 0.48 * P(Horse|Mpg) P(Mpg, Horse) P( low|good) P( low| bad) P(high|good) P(high| bad) = P(bad) * P(high|bad) = 0.6 * 0.79 = = = = = P(bad) * P(low|bad) = 0.6 * 0.21 Copyright © 2006, Brigham S. Anderson 0.89 0.21 0.11 0.79 How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) Mpg Horse Copyright © 2006, Brigham S. Anderson 78 How to Make a Bayes Net P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg) P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) Copyright © 2006, Brigham S. Anderson = = = = 0.90 0.21 0.10 0.79 79 How to Make a Bayes Net • Each node is a probability function • Each arc denotes conditional dependence P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) Copyright © 2006, Brigham S. Anderson = = = = 0.90 0.21 0.10 0.79 80 How to Make a Bayes Net So, what have we accomplished thus far? Nothing; we’ve just “Bayes Net-ified” the P(Mpg, Horse) JPT using the Chain rule. …the real excitement starts when we wield conditional independence Copyright © 2006, Brigham S. Anderson Mpg Horse P(Mpg) P(Horse|Mpg) 81 How to Make a Bayes Net 82 Before we continue, we need a worthier opponent than puny P(Mpg, Horse)… We’ll use P(Mpg, Horse, Accel): * Note: I made these up… P(Mpg,Horse,Accel) P(good, low,slow) P(good, low,fast) P(good,high,slow) P(good,high,fast) P( bad, low,slow) P( bad, low,fast) P( bad,high,slow) P( bad,high,fast) Copyright © 2006, Brigham S. Anderson = = = = = = = = 0.37 0.01 0.02 0.00 0.10 0.12 0.16 0.22 How to Make a Bayes Net Step 1: Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Note: Obviously, we could have written this 3!=6 different ways… P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) =… =… Copyright © 2006, Brigham S. Anderson 83 How to Make a Bayes Net Step 1: Rewrite joint using the Chain rule. P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse) Mpg Horse Accel Copyright © 2006, Brigham S. Anderson 84 How to Make a Bayes Net Mpg P(Mpg) Horse P(Horse|Mpg) Accel P(Accel|Mpg,Horse) Copyright © 2006, Brigham S. Anderson 85 How to Make a Bayes Net P(Mpg) Mpg P(Horse|Mpg) P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.90 0.21 0.10 0.79 P(good) = 0.4 P( bad) = 0.6 P(Accel|Mpg,Horse) P(slow|good, low) P(slow|good,high) P(slow| bad, low) P(slow| bad,high) P(fast|good, low) P(fast|good,high) P(fast| bad, low) P(fast| bad,high) Horse Accel = = = = = = = = * Note: I made these up too… Copyright © 2006, Brigham S. Anderson 0.97 0.15 0.90 0.05 0.03 0.85 0.10 0.95 86 How to Make a Bayes Net P(Mpg) Mpg P(Horse|Mpg) P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(good) = 0.4 P( bad) = 0.6 P(Accel|Mpg,Horse) P(slow|good, low) P(slow|good,high) P(slow| bad, low) P(slow| bad,high) Accel P(fast|good, low) P(fast|good,high) A Miracle Occurs! P(fast| bad, low) P(fast| You are told by God (or another domain expert) bad,high) Horse that Accel is independent of Mpg given Horse! i.e., P(Accel | Mpg, Horse) = P(Accel | Horse) Copyright © 2006, Brigham S. Anderson = = = = = = = = 0.97 0.15 0.90 0.05 0.03 0.85 0.10 0.95 87 How to Make a Bayes Net P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel P(slow| low) P(slow|high) P(fast| low) P(fast|high) Copyright © 2006, Brigham S. Anderson = = = = 0.22 0.64 0.78 0.36 88 How to Make a Bayes Net P(Mpg) Mpg Thank you, domain expert! P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Now I only need to learn 5 parameters instead of 7 from my data! Horse My parameter estimates will be more accurate as a result! P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel P(slow| low) P(slow|high) P(fast| low) P(fast|high) Copyright © 2006, Brigham S. Anderson = = = = 0.22 0.64 0.78 0.36 89 Independence “The Acceleration does not depend on the Mpg once I know the Horsepower.” This can be specified very simply: P(Accel Mpg, Horse) = P(Accel | Horse) This is a powerful statement! It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation. Copyright © 2006, Brigham S. Anderson 90 Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where: • V is a set of vertices. • E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: • A Conditional Probability Table (CPT) indicating how this variable’s probabilities depend on all possible combinations of parental values. Copyright © 2006, Brigham S. Anderson 91 Bayes Nets Summary • Bayes nets are a factorization of the full JPT which uses the chain rule and conditional independence. • They can do everything a JPT can do (like quick, cheap lookups of probabilities) Copyright © 2006, Brigham S. Anderson 92 The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) P( E1  E2 ) P( E1 | E2 )   P ( E2 )  P( joint entry ) joint entries matching E1 and E2  P( joint entry ) joint entries matching E2 Copyright © 2006, Brigham S. Anderson 93 The good news We can do inference. We can compute any conditional probability: P( Some variable  Some other variable values ) P( E1  E2 ) P( E1 | E2 )   P ( E2 )  P( joint entry ) joint entries matching E1 and E2  P( joint entry ) joint entries matching E2 Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables. How much work is the above computation? Copyright © 2006, Brigham S. Anderson 94 The sad, bad news Doing inference “JPT-style” by enumerating all matching entries in the joint are expensive: Exponential in the number of variables. But perhaps there are faster ways of querying Bayes nets? • In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time. • So we’ve just got to program our computer to do those tricks too, right? Sadder and worse news: General querying of Bayes nets is NP-complete. Copyright © 2006, Brigham S. Anderson 95 Case Study I Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA). • Diagnostic system for lymph-node diseases. • 60 diseases and 100 symptoms and test-results. • 14,000 probabilities • Expert consulted to make net. • 8 hours to determine variables. • 35 hours for net topology. • 40 hours for probability table values. • Apparently, the experts found it quite easy to invent the causal links and probabilities. • Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains. Copyright © 2006, Brigham S. Anderson 96 Bayes Net Info GUI Packages: • Genie -- Free • Netica -- $$ • Hugin -- $$ Non-GUI Packages: • All of the above have APIs • BNT for MATLAB • AUTON code (learning extremely large networks of tens of thousands of nodes) Copyright © 2006, Brigham S. Anderson 97 98 Bayes Nets and Machine Learning Copyright © 2006, Brigham S. Anderson Machine Learning Tasks Evidence e1 Missing Variables e2 Inference Engine P(e2 | e1) Data point x Anomaly Detector P(x) Data point x Classifier P(C | x) Copyright © 2006, Brigham S. Anderson 99 What is an Anomaly? • An irregularity that cannot be explained by simple domain models and knowledge • Anomaly detection only needs to learn from examples of “normal” system behavior. • Classification, on the other hand, would need examples labeled “normal” and “not-normal” Copyright © 2006, Brigham S. Anderson 100 Anomaly Detection in Practice • Monitoring computer networks for attacks. • Monitoring population-wide health data for outbreaks or attacks. • Looking for suspicious activity in bank transactions • Detecting unusual eBay selling/buying behavior. Copyright © 2006, Brigham S. Anderson 101 102 JPT Anomaly Detector • Suppose we have the following model: P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) • We’re trying to detect anomalous cars. • If the next example we see is <good,high>, how anomalous is it? Copyright © 2006, Brigham S. Anderson = = = = 0.36 0.04 0.12 0.48 103 JPT Anomaly Detector How likely is <good,high>? P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) likelihood ( good , high)  P( good , high)  0.04 Could not be easier! Just look up the entry in the JPT! Smaller numbers are more anomalous in that the model is more surprised to see them. Copyright © 2006, Brigham S. Anderson = = = = 0.36 0.04 0.12 0.48 104 Bayes Net Anomaly Detector P(Mpg) Mpg How likely is <good,high>? P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) likelihood ( good , high )  P( good , high )  P( good ) P(high | good )  0.04 Copyright © 2006, Brigham S. Anderson = = = = 0.90 0.21 0.10 0.79 105 Bayes Net Anomaly Detector P(Mpg) Mpg How likely is <good,high>? P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) Horse P( low|good) P( low| bad) P(high|good) P(high| bad) likelihood ( good , high )  P( good , high ) Again, trivial! Pone ( good P(high | We need todo tiny )lookup for each variable in the network!  0.04 Copyright © 2006, Brigham S. Anderson good ) = = = = 0.90 0.21 0.10 0.79 Machine Learning Tasks Evidence e1 Inference Engine P(E2 | e1) Data point x Anomaly Detector P(x) Data point x Classifier P(C | x) Missing Variables E2 Copyright © 2006, Brigham S. Anderson 106 Bayes Classifiers • A formidable and sworn enemy of decision trees BC DT Copyright © 2006, Brigham S. Anderson 107 Bayes Classifiers in 1 Slide Bayes classifiers just do inference. That’s it. The “algorithm” 1. Learn P(class,X) 2. For a given input x, infer P(class|x) 3. Choose the class with the highest probability Copyright © 2006, Brigham S. Anderson 108 109 JPT Bayes Classifier • Suppose we have the following model: P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) = = = = • We’re trying to classify cars as Mpg = “good” or “bad” • If the next example we see is Horse = “low”, how do we classify it? Copyright © 2006, Brigham S. Anderson 0.36 0.04 0.12 0.48 110 JPT Bayes Classifier How do we classify <Horse=low>? P( good | low)   P(Mpg, Horse) = P(good, low) P(good,high) P( bad, low) P( bad,high) = = = = 0.36 0.04 0.12 0.48 P( good , low) P(low) P( good , low) P( good , low)P(bad , low) 0.36   0.739 0.360.12 Copyright © 2006, Brigham S. Anderson The P(good | low) = 0.75, so we classify the example as “good” 111 Bayes Net Classifier P(Mpg) Mpg P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) • We’re trying to classify cars as Mpg = “good” or “bad” Horse • If the next example we see is <Horse=low,Accel=fast> how do we classify it? P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel Copyright © 2006, Brigham S. Anderson P(slow| low) P(slow|high) P(fast| low) P(fast|high) = = = = 0.95 0.11 0.05 0.89 Bayes Net 112 Bayes Classifier Suppose we get a <Horse=low, Accel=fast> example? P( good , low, fast ) P( good | low, fast )  P(low, fast ) P(Mpg) P(good) = 0.4 P( bad) = 0.6 Mpg P(Horse|Mpg)   P( good ) P(low | good ) P( fast | low) P(low, fast ) P( low|good) P( low| bad) P(high|good) P(high| bad) Horse (0.4)(0.89)(0.05) 0.0178  P(low, fast ) P(low, fast ) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) 0.0178  P( good , low, fast )  P(bad , low, fast )  0.75 Accel Note: this is not exactly 0.75 because I rounded some of the CPT numbers Copyright © 2006, Brigham S. Anderson earlier… P(slow| low) P(slow|high) P(fast| low) P(fast|high) = = = = 0.95 0.11 0.05 0.89 Bayes Net 113 Bayes Classifier P(Mpg) Mpg The P(good | low, fast) = 0.75, so we classify the example as “good”. P(good) = 0.4 P( bad) = 0.6 P(Horse|Mpg) …but that seems somehow familiar… Horse Wasn’t that the same answer as P(Mpg=good | Horse=low)? P( low|good) P( low| bad) P(high|good) P(high| bad) = = = = 0.89 0.21 0.11 0.79 P(Accel|Horse) Accel Copyright © 2006, Brigham S. Anderson P(slow| low) P(slow|high) P(fast| low) P(fast|high) = = = = 0.95 0.11 0.05 0.89 Bayes Classifiers • OK, so classification can be posed as inference • In fact, virtually all machine learning tasks are a form of inference • • • • • Anomaly detection: Classification: Regression: Model Learning: Feature Selection: P(x) P(Class | x) P(Y | x) P(Model | dataset) P(Model | dataset) Copyright © 2006, Brigham S. Anderson 114 The Naïve Bayes Classifier ASSUMPTION: all the attributes are conditionally independent given the class variable Copyright © 2006, Brigham S. Anderson 115 The Naïve Bayes Advantage At least 256 parameters! You better have the data to support them… A mere 25 parameters! (the CPTs are tiny because the attribute nodes only have one parent.) Copyright © 2006, Brigham S. Anderson 116 What is the Probability Function of the Naïve Bayes? P(Mpg,Cylinders,Weight,Maker,…) = P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) … Copyright © 2006, Brigham S. Anderson 117 What is the Probability Function of the Naïve Bayes? P(class , x)  P(class ) P( xi | class ) i This is another great feature of Bayes Nets; you can graphically see your model assumptions Copyright © 2006, Brigham S. Anderson 118 Bayes Classifier Results: “MPG”: 392 records The Classifier learned by “Naive BC” Copyright © 2006, Brigham S. Anderson 119 120 Bayes Classifier Results: “MPG”: 40 records Copyright © 2006, Brigham S. Anderson More Facts About Bayes Classifiers • Many other density estimators can be slotted in • Density estimation can be performed with real-valued inputs • Bayes Classifiers can be built with real-valued inputs • Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on • Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help. • Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully! Copyright © 2006, Brigham S. Anderson 121 Summary • Axioms of Probability • Bayes nets are created by • chain rule • conditional independence • Bayes Nets can do • Inference • Anomaly Detection • Classification Copyright © 2006, Brigham S. Anderson 122 123 Copyright © 2006, Brigham S. Anderson 124 Copyright © 2006, Brigham S. Anderson

Machine Learning and the Axioms of Probability Brigham S. Anderson

Related documents

Products

Support

Machine Learning and the Axioms of Probability Brigham S. Anderson

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib