CS 416 Artificial Intelligence Lecture 16 Uncertainty Chapter 14 Conditional probability The probability of a given all we know is b • P (a | b) Written as an unconditional probability • Conditioning A distribution over Y can be obtained by summing out all the other variables from any joint distribution containing Y event evidence anywhere x and e are true all other variables We need the full joint distribution to sum this up Bayes Network Bayes Network captures the full joint distribution For comparison: Example P(B | john calls, mary calls) Example P(B | john calls, mary calls) old way To expedite, move terms outside summation Example Depthfirst tree traversal required Example Complexity of Bayes Net • Bayes Net reduces space complexity • Bayes Net does not reduce time complexity for general case Time complexity Note repeated subexpressions Dynamic Programming Time complexity Dynamic programming • Works well for polytrees (where there is at most one path between any two nodes) • Doesn’t work for multiply connected networks Clustering • Try to convert multiply connected networks to polytrees Approximate Inference It’s expensive to work with the full joint distrbution… whether as a table or as a Bayes Network Is approximation good enough? Monte Carlo Monte Carlo Use samples to approximate solution • Simulated annealing used Monte Carlo theories to justify why random guesses and sometimes going uphill can lead to optimality More samples = better approximation • How many are needed? • Where should you take the samples? Prior sampling An ability to model the prior probabilities of a set of random variables Approximating true distribution With enough samples, perfect modeling is possible Rejection sampling Compute P(X | e) • Use PriorSample (SPS) and create N samples • Inspect each sample TRUTH of e • Keep a count for how many samples are consistent with e • P(X | e) can be computed from count / N Example • P(Rain | Sprinkler = true) • Use Bayes Net to generate 100 samples – Suppose 73 have Sprinkler=false – Suppose 27 have Sprinkler=true 8 have Rain=true 19 have Rain=false • P(Rain | Sprinkler=true) = Normalize (<8, 19>) = <0.3, 0.7> Problems with rejection sampling • Standard deviation of the error in probability is proportional to 1/sqrt(n), where n is the number of samples consistent with evidence • As problems become complex, number of samples consistent with evidence becomes small and it becomes harder to construct accurate estimates Likelihood weighting We only want to generate samples that are consistent with the evidence, e • We’ll sample the Bayes Net, but we won’t let every random variable be sampled, some will be forced to produce a specific output Example P (Rain | Sprinkler=true, WetGrass=true) Example P (Rain | Sprinkler=true, WetGrass=true) • First, weight vector, w, set to 1.0 Example Notice that weight is reduced according to how likely an evidence variable’s output is given its parents • So final probability is a function of what comes from sampling the free variables while constraining the evidence variables Comparing techniques • In likelihood weighting, attention is paid to evidence variables before samples are collected • In rejection sampling, evidence variables are considered after the sampling • Likelihood weighting isn’t as accurate as the true posterior distribution, P(z | e) because the sampled variables ignore evidence among z’s non-ancestors Likelihood weighting • Uses all the samples • As evidence variables increase, it becomes harder to keep the weighting constant high and estimate quality drops Markov chain Monte Carlo (MCMC) • Imagine being in a current state – An assignment to all the random variables • The next state is selected according to random sample of one of the nonevidence variables, Xi – Conditioned on the current values of the variables in the current state • MCMC wanders around state space, flipping one variable at a time while keeping evidence variables fixed Cool method of operation • The sampling process settles into a “dynamic equilibrium” in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability • Let q(x x’) = probability of transitioning from state x to x’ • A Markov chain is a sequence of state transitions according to q( ) functions pt(x) measures the probability of being in state x after t steps Markov chains pt+1(x) = probability of being in x after t+1 steps • If pt = pt+1 we have reached a stationary distribution