Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets Bayesian Learning Conditional Probability • Probability of an event given the occurrence of some other event. P( X Y ) P( X , Y ) P( X | Y ) P(Y ) P(Y ) Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? P(X) =100 /1000 =.1 Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? P(X) =100 /1000 =.1 What is the probability an email you receive is put in your junk folder? Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? P(X) =100 /1000 =.1 What is the probability an email you receive is put in your junk folder? P(Y ) = 200 /1000 =.2 Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? P(X) =100 /1000 =.1 What is the probability an email you receive is put in your junk folder? P(Y ) = 200 /1000 =.2 Given that an email is in your junk folder, what is the probability it is spam? Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? P(X) =100 /1000 =.1 What is the probability an email you receive is put in your junk folder? P(Y ) = 200 /1000 =.2 Given that an email is in your junk folder, what is the probability it is spam? P(X ÇY ) P(X |Y ) = = .09 /.2 = .45 P(Y ) Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? P(X) =100 /1000 =.1 What is the probability an email you receive is put in your junk folder? P(Y ) = 200 /1000 =.2 Given that an email is in your junk folder, what is the probability it is spam? P(X ÇY ) P(X |Y ) = = .09 /.2 = .45 P(Y ) Given that an email is spam, what is the probability it is in your junk folder? Example You’ve been keeping track of the last 1000 emails you received. You find that 100 of them are spam. You also find that 200 of them were put in your junk folder, of which 90 were spam. What is the probability an email you receive is spam? P(X) =100 /1000 =.1 What is the probability an email you receive is put in your junk folder? P(Y ) = 200 /1000 =.2 Given that an email is in your junk folder, what is the probability it is spam? P(X ÇY ) P(X |Y ) = = .09 /.2 = .45 P(Y ) Given that an email is spam, what is the probability it is in your junk folder? P(X ÇY ) P(Y | X) = = .09 /.1 = .9 P(X) Deriving Bayes Rule P(X ÇY ) P(X | Y ) = P(Y ) P(X ÇY ) P(Y | X) = P(X) Bayes rule : P(Y | X)P(X) P(X | Y ) = P(Y ) General Application to Data Models • In machine learning we have a space H of hypotheses: h1 , h2 , ... , hn (possibly infinite) • We also have a set D of data • We want to calculate P(h | D) • Bayes rule gives us: P ( D | h) P ( h ) P ( h | D) P ( D) Terminology – Prior probability of h: • P(h): Probability that hypothesis h is true given our prior knowledge • If no prior knowledge, all h H are equally probable – Posterior probability of h: • P(h | D): Probability that hypothesis h is true, given the data D. – Likelihood of D: • P(D | h): Probability that we will see data D, given hypothesis h is true. – Marginal likelihood of D • P(D) = P(D | h)P(h) å h A Bayesian Approach to the Monty Hall Problem You are a contestant on a game show. There are 3 doors, A, B, and C. There is a new car behind one of them and goats behind the other two. Monty Hall, the host, knows what is behind the doors. He asks you to pick a door, any door. You pick door A. Monty tells you he will open a door, different from A, that has a goat behind it. He opens door B: behind it there is a goat. Monty now gives you a choice: Stick with your original choice A or switch to C. Bayesian probability formulation Hypothesis space H: h1 = Car is behind door A h2 = Car is behind door B h3 = Car is behind door C Prior probability: P(h1) = 1/3 P(h2) =1/3 P(h3) =1/3 Data D: After you picked door A, Monty opened B to show a goat Likelihood: P(D | h1) = 1/2 P(D | h2) = 0 P(D | h3) = 1 What is P(h1 | D)? What is P(h2 | D)? What is P(h3 | D)? Marginal likelihood: P(D) = p(D|h1)p(h1) + p(D|h2)p(h2) + p(D|h3)p(h3) = 1/6 + 0 + 1/3 = 1/2 By Bayes rule: P(D | h1 )P(h1 ) æ 1 öæ 1 ö 1 P(h1 | D) = = ç ÷ç ÷ (2) = è 2 øè 3 ø P(D) 3 æ1ö P(D | h2 )P(h2 ) P(h2 | D) = = ( 0) ç ÷ (2) = 0 è 3ø P(D) æ1ö P(D | h3 )P(h3 ) 2 P(h3 | D) = = (1) ç ÷ (2) = è 3ø P(D) 3 So you should switch! MAP (“maximum a posteriori”) Learning Bayes rule: P ( D | h) P ( h ) P ( h | D) P ( D) Goal of learning: Find maximum a posteriori hypothesis hMAP: hMAP = argmax P(h | D) hÎH P(D | h)P(h) = argmax hÎH P(D) = argmax P(D | h)P(h) hÎH because P(D) is a constant independent of h. Note: If every h H is equally probable, then hMAP argmax P( D | h) hH hMAP is called the “maximum likelihood hypothesis”. A Medical Example Toby takes a test for leukemia. The test has two outcomes: positive and negative. It is known that if the patient has leukemia, the test is positive 98% of the time. If the patient does not have leukemia, the test is positive 3% of the time. It is also known that 0.008 of the population has leukemia. Toby’s test is positive. Which is more likely: Toby has leukemia or Toby does not have leukemia? • Hypothesis space: h1 = T. has leukemia h2 = T. does not have leukemia • Prior: 0.008 of the population has leukemia. Thus P(h1) = 0.008 P(h2) = 0.992 • Likelihood: P(+ | h1) = 0.98, P(− | h1) = 0.02 P(+ | h2) = 0.03, P(− | h2) = 0.97 • Posterior knowledge: Blood test is + for this patient. • In summary P(h1) = 0.008, P(h2) = 0.992 P(+ | h1) = 0.98, P(− | h1) = 0.02 P(+ | h2) = 0.03, P(− | h2) = 0.97 •h Thus: = argmax P(D | h)P(h) MAP hÎH P(+ | leukemia)P(leukemia) = (0.98)(0.008) = 0.0078 P(+ | Øleukemia)P(Øleukemia) = (0.03)(0.992) = 0.0298 hMAP = Øleukemia • What is P(leukemia|+)? P ( D | h) P ( h ) P( h | D) P( D) So, 0.0078 P(leukemia | +) = = 0.21 0.0078 + 0.0298 0.0298 P(Øleukemia | +) = = 0.79 0.0078 + 0.0298 These are called the “posterior” probabilities. Bayesianism vs. Frequentism • Classical probability: Frequentists – Probability of a particular event is defined relative to its frequency in a sample space of events. – E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses. • Bayesian probability: – Combine measure of “prior” belief you have in a proposition with your subsequent observations of events. • Example: Bayesian can assign probability to statement “There was life on Mars a billion years ago” but frequentist cannot. Independence and Conditional Independence • Recall that two random variables, X and Y, are independent if P( X , Y ) P( X ) P(Y ) • Two random variables, X and Y, are independent given C if P( X , Y | C ) P( X | C ) P(Y | C ) Naive Bayes Classifier Let f (x) be a target function for classification: f (x) {+1, −1}. Let x = (x1, x2, ..., xn) We want to find the most probable class value, hMAP, given the data x: classMAP = argmax P(class | D) class Î {+1,-1} = argmax P(class | x1, x2 ,..., xn ) class Î {+1,-1} By Bayes Theorem: P(x1, x2 ,..., xn | class)P(class) classMAP = argmax P(x1, x2 ,..., xn ) class Î {+1,-1} = argmax P(x1, x2 ,..., xn | class)P(class) class Î {+1,-1} P(class) can be estimated from the training data. How? However, in general, not practical to use training data to estimate P(x1, x2, ..., xn | class). Why not? • Naive Bayes classifier: Assume P(x1, x2,..., xn | class) = P(x1 | class)P(x2 | class) P(xn | class) Is this a good assumption? Given this assumption, here’s how to classify an instance x = (x1, x2, ...,xn): Naive Bayes classifier: classNB (x) = argmax P(class)Õ P(xi | class) class Î {+1,-1} i To train: Estimate the values of these various probabilities over the training set. Training data: Day Outlook Temp Humidity Wind PlayTennis D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Cool High Strong ? Test data: D15 Sunny Use training data to compute a probabilistic model: P(Outlook = Sunny | Yes) = 2 / 9 P(Outlook = Sunny | No) = 3 / 5 P(Outlook = Overcast | Yes) = 4 / 9 P(Outlook = Overcast | No) = 0 P(Outlook = Rain | Yes) = 3 / 9 P(Outlook = Rain | No) = 2 / 5 P(Temperature = Hot | Yes) = 2 / 9 P(Temperature = Hot | No) = 2 / 5 P(Temperature = Mild | Yes) = 4 / 9 P(Temperature = Mild | No) = 2 / 5 P(Temperature = Cool | Yes) = 3 / 9 P(Temperature = Cool | No) = 1 / 5 P(Humidity = High | Yes) = 3 / 9 P(Humidity = High | No) = 4 / 5 P(Humidity = Normal | Yes) = 6 / 9 P(Humidity = Normal | No) = 1 / 5 P(Wind = Strong | Yes) = 3 / 9 P(Wind = Strong | No) = 3 / 5 P(Wind = Weak | Yes) = 6 / 9 P(Wind = Weak | No) = 2 / 5 Use training data to compute a probabilistic model: P(Outlook = Sunny | Yes) = 2 / 9 P(Outlook = Sunny | No) = 3 / 5 P(Outlook = Overcast | Yes) = 4 / 9 P(Outlook = Overcast | No) = 0 P(Outlook = Rain | Yes) = 3 / 9 P(Outlook = Rain | No) = 2 / 5 P(Temperature = Hot | Yes) = 2 / 9 P(Temperature = Hot | No) = 2 / 5 P(Temperature = Mild | Yes) = 4 / 9 P(Temperature = Mild | No) = 2 / 5 P(Temperature = Cool | Yes) = 3 / 9 P(Temperature = Cool | No) = 1 / 5 P(Humidity = High | Yes) = 3 / 9 P(Humidity = High | No) = 4 / 5 P(Humidity = Normal | Yes) = 6 / 9 P(Humidity = Normal | No) = 1 / 5 P(Wind = Strong | Yes) = 3 / 9 P(Wind = Strong | No) = 3 / 5 P(Wind = Weak | Yes) = 6 / 9 P(Wind = Weak | No) = 2 / 5 Day D15 Outlook Sunny Temp Cool Humidity Wind High Strong PlayTennis ? Use training data to compute a probabilistic model: P(Outlook = Sunny | Yes) = 2 / 9 P(Outlook = Sunny | No) = 3 / 5 P(Outlook = Overcast | Yes) = 4 / 9 P(Outlook = Overcast | No) = 0 P(Outlook = Rain | Yes) = 3 / 9 P(Outlook = Rain | No) = 2 / 5 P(Temperature = Hot | Yes) = 2 / 9 P(Temperature = Hot | No) = 2 / 5 P(Temperature = Mild | Yes) = 4 / 9 P(Temperature = Mild | No) = 2 / 5 P(Temperature = Cool | Yes) = 3 / 9 P(Temperature = Cool | No) = 1 / 5 P(Humidity = High | Yes) = 3 / 9 P(Humidity = High | No) = 4 / 5 P(Humidity = Normal | Yes) = 6 / 9 P(Humidity = Normal | No) = 1 / 5 P(Wind = Strong | Yes) = 3 / 9 P(Wind = Strong | No) = 3 / 5 P(Wind = Weak | Yes) = 6 / 9 P(Wind = Weak | No) = 2 / 5 Day D15 Outlook Sunny Temp Cool Humidity Wind High Strong classNB (x) = argmax P(class)Õ P(xi | class) class Î {+1,-1} i PlayTennis ? Estimating probabilities / Smoothing • Recap: In previous example, we had a training set and a new example, (Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong) • We asked: What classification is given by a naive Bayes classifier? • Let nc be the number of training instances with class c. xi =ak be the number of training instances with attribute c n • Let xi=ak and class c. ncxi =ak Then: P(xi = ai | c) = nc value • Problem with this method: If nc is very small, gives a poor estimate. • E.g., P(Outlook = Overcast | no) = 0. • Now suppose we want to classify a new instance: (Outlook=overcast, Temperature=cool, Humidity=high, Wind=strong) Then: P(no)Õ P(xi | no) = 0 i This incorrectly gives us zero probability due to small sample. One solution: Laplace smoothing (also called “add-one” smoothing) For each class c and attribute xi with value ak, add one “virtual” instance. That is, for each class c, recalculate: ncxi =ak +1 P(xi = ai | c) = nc + K where K is the number of possible values of attribute a. Training data: Day Outlook Temp Humidity Wind PlayTennis D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Laplace smoothing: Add the following virtual instances for Outlook: Outlook=Sunny: Yes Outlook=Sunny: No Outlook=Overcast: Yes Outlook=Overcast: No 0 ncxi =ak +1 0 +1 1 P(Outlook = overcast | No) = ® = = 5 nc + K 5+ 3 8 4 ncxi =ak +1 4 +1 5 P(Outlook = overcast | Yes) = ® = = 9 nc + K 9 + 3 12 Outlook=Rain: Yes Outlook=Rain: No P(Outlook = Sunny | Yes) = 2 / 9 ® 3 /12 P(Outlook = Sunny | No) = 3 / 5 ® 4 / 8 P(Outlook = Overcast | Yes) = 4 / 9 ® 5 /12 P(Outlook = Overcast | No) = 0 / 5 ®1 / 8 P(Outlook = Rain | Yes) = 3 / 9 ® 4 /12 P(Outlook = Rain | No) = 2 / 5 ® 3 / 8 Etc. In-class exercise 1. Recall the Naïve Bayes Classifier classNB (x) = argmax P(class)Õ P(xi | class) class Î {+1,-1} i Consider this training set, in which each instance has four binary features and a binary class: Instance x1 x2 x3 x4 Class x1 1 1 1 1 POS x2 1 1 0 1 POS x3 0 1 1 1 POS x4 1 0 0 1 POS x5 1 0 0 0 NEG x6 1 0 1 0 NEG x7 0 1 0 1 NEG 2. Recall the formula for Laplace smoothing: where K is the number of possible values of attribute a. (a) Apply Laplace smoothing to all the probabilities from the training set in question 1. (b) Use the smoothed probabilities to determine classNB for the following new instances: Instance x1 x2 x3 x4 x10 0 1 0 0 11 0 0 0 0 (a) Create a probabilistic model that you could use to classify new instances. That is, calculate P(class) and P(xi | class) for each class. No smoothing is needed (yet). x (b) Use your probabilistic model to determine classNB for the following new instance: Instance x1 x2 x3 x4 x8 0 0 0 1 Class Class Naive Bayes on continuousvalued attributes • How to deal with continuous-valued attributes? Two possible solutions: – Discretize – Assume particular probability distribution of classes over values (estimate parameters from training data) Discretization: Equal-Width Binning For each attribute xi , create k equal-width bins in interval from min(xi ) to max(xi). The discrete “attribute values” are now the bins. Questions: What should k be? What if some bins have very few instances? Problem with balance between discretization bias and variance. The more bins, the lower the bias, but the higher the variance, due to small sample size. Discretization: Equal-Frequency Binning For each attribute xi , create k bins so that each bin contains an equal number of values. Also has problems: What should k be? Hides outliers. Can group together instances that are far apart. Gaussian Naïve Bayes Assume that within each class, values of each numeric feature are normally distributed: where μi,c is the mean of feature i given the class c, and σi,c is the standard deviation of feature i given the class c We estimate μi,c and σi,c from training data. Example x1 x2 Class 3.0 5.1 POS 4.1 6.3 POS 7.2 9.8 POS 2.0 1.1 NEG 4.1 2.0 NEG 8.1 9.4 NEG Example x1 x2 Class 3.0 5.1 POS 4.1 6.3 POS 7.2 9.8 POS 2.0 1.1 NEG 4.1 2.0 NEG 8.1 9.4 NEG P(POS) = 0.5 P(NEG) = 0.5 http://homepage.stat.uiowa.edu/~mbognar/applets/normal.html N1,POS = N(x; 4.8, 1.8) N2,POS = N(x; 7.1, 2.0) N1,NEG = N(x; 4.7, 2.5) N2,NEG = N(x; 4.2, 3.7) Now, suppose you have a new example x, with x1 = 5.2, x2 = 6.3. What is classNB (x) ? Now, suppose you have a new example x, with x1 = 5.2, x2 = 6.3. What is classNB (x) ? classNB (x) = argmax P(class)Õ P(xi | class) class Î {+1,-1} i Note: N is the probability density function, but can be used analogously to probability in Naïve Bayes calculations. Now, suppose you have a new example x, with x1 = 5.2, x2 = 6.3. What is classNB (x) ? classNB (x) = argmax P(class)Õ P(xi | class) class Î {+1,-1} i Positive : P(POS)P(x1 | POS)P(x2 | POS) = (.5)(.22)(.18) = .02 Negative : P(NEG)P(x1 | NEG)P(x2 | NEG) = (.5)(.16)(.09) = .0072 classNB (x) = POS Use logarithms to avoid underflow classNB (x) = argmax P(class)Õ P(xi | class) class Î {+1,-1} i æ ö = argmax log ç P(class)Õ P(xi | class)÷ class Î {+1,-1} è ø i æ ö = argmax ç log P(class) + å log P(xi | class)÷ class Î {+1,-1} è ø i Bayes Nets Another example A patient comes into a doctor’s office with a bad cough and a high fever. Hypothesis space H: Data D: h1: patient has flu coughing = true, fever = true h2: patient does not have flu Prior probabilities: Likelihoods p(h1) = .1 p(D | h1) = .8 p(h2) = . 9 p(D | h2) = .4 Posterior probabilities: P(h1|D) = P(h2|D) = Prob. of data P(D) = • Let’s say we have the following random variables: cough fever flu smokes Full joint probability distribution smokes cough cough Fever Fever Fever Fever flu p1 p2 p3 p4 flu p5 p6 p7 p8 smokes cough cough fever fever fever fever flu p9 p10 p11 p12 flu p13 p14 p15 p16 Sum of all boxes is 1. In principle, the full joint distribution can be used to answer any question about probabilities of these combined parameters. However, size of full joint distribution scales exponentially with number of parameters so is expensive to store and to compute with. Bayesian networks • Idea is to represent dependencies (or causal relations) for all the variables so that space and computation-time requirements are minimized. flu smokes cough “Graphical Models” fever cough flu smoke true false True True 0.95 0.05 True False 0.8 0.2 smoke true false 0.2 0.8 False True 0.6 0.4 false 0.05 0.95 false flu flu smoke cough Conditional probability tables for each node true 0.01 false 0.99 fever fever flu true false true 0.9 0.1 false 0.2 0.8 Semantics of Bayesian networks • If network is correct, can calculate full joint probability distribution from network. P(( X 1 x1 ) ( X 2 x2 )... ( X n xn )) n P( X i xi | parents( X i )) i 1 where parents(Xi) denotes specific values of parents of Xi. Example • Calculate P[(cough t ) ( fever f ) ( flu f ) ( smoke f )] Example • Calculate P[(cough t ) ( fever f ) ( flu f ) ( smoke f )] Different types of inference in Bayesian Networks Causal inference Evidence is cause, inference is probability of effect Example: Instantiate evidence flu = true. What is P(fever | flu)? P( fever | flu ) .9 (up from .207) Diagnostic inference Evidence is effect, inference is probability of cause Example: Instantiate evidence fever = true. What is P(flu | fever)? P( flu | fever) P( fever | flu) P( flu) (.9)(.01) .043 (up from .01) P( fever) .207 Example: What is P(flu|cough)? P(cough | flu ) P( flu ) P(cough) [ P(cough | flu, smoke) p( smoke) P(cough | flu, smoke) p (smoke)]P( flu ) p(cough) [(.95)(.2) (.8)(.8)](. 01) .0497 .167 P( flu | cough) Inter-causal inference Explain away different possible causes of effect Example: What is P(flu|cough,smoke)? P( flu | cough, smoke) p( flu cough smoke) p(cough smoke) p(cough | flu, smoke) p( flu ) p( smoke) p(cough | flu, smoke) p( flu ) p( smoke) p(cough | smoke, flu ) p( smoke) p(flu ) (.95)(.01)(.2) /[(. 95)(.01)(.2) (.6)(.2)(.99)] 0.016 Why is P(flu|cough,smoke) < P(flu|cough)? Complexity of Bayesian Networks For n random Boolean variables: • Full joint probability distribution: 2n entries • Bayesian network with at most k parents per node: – Each conditional probability table: at most 2k entries – Entire network: n 2k entries What are the advantages of Bayesian networks? • Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables. • Represents “beliefs and knowledge” about a particular class of situations. • Efficient (?) (approximate) inference algorithms • Efficient, effective learning algorithms Issues in Bayesian Networks • Building / learning network topology • Assigning / learning conditional probability tables • Approximate inference via sampling Real-World Example: The Lumière Project at Microsoft Research • Bayesian network approach to answering user queries about Microsoft Office. • “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.” • “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”. • Networks were developed by experts from user modeling studies. Offspring of project was Office Assistant in Office 97. flu smoke fever cough nausea headache P( flu smoke fever cough nausea headache) P( flu ) P( smoke | flu ) P( fever | flu ) P(cough | flu ) P(nausea | flu ) P(headache | flu ) flu smoke cough fever nausea More generally, for classifica tion : P(C c j | X 1 x1 ,..., X 1 xn ) P(C c j ) P( X i xi | C c j ) i “Naive Bayes” headache Learning network topology • Many different approaches, including: – Heuristic search, with evaluation based on information theory measures – Genetic algorithms – Using “meta” Bayesian networks! Learning conditional probabilities • In general, random variables are not binary, but real-valued • Conditional probability tables conditional probability distributions • Estimate parameters of these distributions from data Approximate inference via sampling • Recall: We can calculate full joint probability distribution from network. d P( X 1 ,..., X d ) P( X i | parents( X i )) i 1 where parents(Xi) denotes specific values of parents of Xi. • We can do diagnostic, causal, and inter-causal inference • But if there are a lot of nodes in the network, this can be very slow! Need efficient algorithms to do approximate calculations! A Précis of Sampling Algorithms Gibbs Sampling 𝑑 • Suppose that we want to sample from: 𝑝 𝑥|𝜃 , 𝑥 ∈ ℝ Basic Idea: Sample, sequentially from “full conditionals.” • Initialize: 𝑥𝑖 : 𝑖 = 1, . . . , 𝐷 • For t=1…T sample : x1(t 1) ~ p x1 | x((t )1) sample : x2(t 1) ~ p x2 | x((t )2) ( t 1) (t ) sample : xD ~ p xD | x( D ) • Complications: (i) How to “order” samples (can be random, but be careful); (ii) Need full conditionals (can use approximation). • Under “nice” assumptions (ergodicity), we get sample: x ~ p x | A Précis of Sampling Algorithms Sampling Algorithms More Generally: • How do we perform posterior inference more generally? • Oftentimes, we rely on strong parametric assumptions (e.g. conjugacy, exponential family structures). • Monte Carlo Approximation/Inference can get around this. • Basic Idea: (1) Draw samples xs~p(x|θ). (2) Compute quantity of interest, e.g., (marginals): 1 E p x1 | D x1,i , etc. S S 1 In general: E f x f x p x | dx f x s S S A Précis of Sampling Algorithms Sampling Algorithms More Generally: CDF Method (MC technique) Steps: (1) sample u~U(0,1) (2) F-1(U)~F A Précis of Sampling Algorithms Sampling Algorithms More Generally: Rejection Sampling (MC technique) One can show that x~p(x) Issues: need a “good” q(x), c and rejection rate can grow astronomically! Pros of MC sampling: samples are independent; Cons: very inefficient in high dimensions. Alternatively, one can use MCMC methods. Markov Chain Monte Carlo Sampling • One of most common methods used in real applications. • Recall that: By construction of Bayesian network, a node is conditionally independent of its non-descendants, given its parents. • Also recall that: a node can be conditionally dependent on its children and on the other parents of its children. (Why?) • Definition: The Markov blanket of a variable Xi is Xi’s parents, children, and children’s other parents. Example • What is the Markov blanket of cough? of flu? flu smokes cough fever • Theorem: A node Xi is conditionally independent of all other nodes in the network, given its Markov blanket. Markov Chain Monte Carlo (MCMC) Sampling • Start with random sample from variables: (x1, ..., xn). This is the current “state” of the algorithm. • Next state: Randomly sample value for one non-evidence variable Xi , conditioned on current values in “Markov Blanket” of Xi. Example • Query: What is P(cough| smoke)? • MCMC: – Random sample, with evidence variables fixed: flu smoke fever cough truetrue false true – Repeat: 1. Sample flu probabilistically, given current values of its Markov blanket: smoke = true, fever = false, cough = true Suppose result is false. New state: flu false smoke fever cough true false true 2. Sample cough, given current values of its Markov blanket: smoke = true , flu = false Suppose result is true. New state: flu false smoke fever cough true false true 3. Sample fever, given current values of its Markov blanket: flu = false Suppose result is true. New state: flu false smoke fever cough true true true • Each sample contributes to estimate for query P(cough | smoke) • Suppose we perform 100 such samples, 20 with cough = true and 80 with cough= false. • Then answer to the query is P(cough | smoke) = .20 • Theorem: MCMC settles into behavior in which each state is sampled exactly according to its posterior probability, given the evidence. Applying Bayesian Reasoning to Speech Recognition • Task: Identify sequence of words uttered by speaker, given acoustic signal. • Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc. • Thus speech recognition is viewed as problem of probabilistic inference. • So far, we’ve looked at probabilistic reasoning in static environments. • Speech: Time sequence of “static environments”. – Let X be the “state variables” (i.e., set of non-evidence variables) describing the environment (e.g., Words said during time step t) – Let E be the set of evidence variables (e.g., S = features of acoustic signal). – The E values and X joint probability distribution changes over time. t1: X1, e1 t2: X2 , e2 etc. • At each t, we want to compute P(Words | S). • We know from Bayes rule: P(Words | S) P(S | Words ) P(Words ) • P(S | Words), for all words, is a previously learned “acoustic model”. – E.g. For each word, probability distribution over phones, and for each phone, probability distribution over acoustic signals (which can vary in pitch, speed, volume). • P(Words), for all words, is the “language model”, which specifies prior probability of each utterance. – E.g. “bigram model”: probability of each word following each other word. • Speech recognition typically makes three assumptions: 1. Process underlying change is itself “stationary” i.e., state transition probabilities don’t change 2. Current state X depends on only a finite history of previous states (“Markov assumption”). – Markov process of order n: Current state depends only on n previous states. 3. Values et of evidence variables depend only on current state Xt. (“Sensor model”) Hidden Markov Models • Markov model: Given state Xt, what is probability of transitioning to next state Xt+1 ? • E.g., word bigram probabilities give P (wordt+1 | wordt ) • Hidden Markov model: There are observable states (e.g., signal S) and “hidden” states (e.g., Words). HMM represents probabilities of hidden states given observable states. Example: “I’m firsty, um, can I have something to dwink?” Graphical Models and Computer Vision