CSC242: Intro to AI Lecture 15 Administrivia Project 3 Code available Due Mon Apr 9 11:59PM ULW 2nd draft due Apr 1 College Writing Program Contest: $100! Deadline: April 5 http://writing.rochester.edu Approximate Inference in Bayesian Networks Bayesian Networks Cavity Random Variables P(Cavity) “has direct influence on” P(Toothache | Cavity) Toothache P(Catch | Cavity) Catch conditionally independent given parents P(Xi | Parents(Xi )) The Goal • Query variable X • Evidence variables E , ..., E • Observed values: e = < e , ..., e > • Non-evidence, non-query (“hidden”) variables: Y • Approximate: P(X | e) 1 m 1 m Generating Samples • Sample each variable in topological order • Child appears after its parents • Choose the value for that variable conditioned on the values already chosen for its parents Rejection Sampling • Generate sample from the prior distribution specified by the network • Reject sample if inconsistent with the evidence • Use remaining samples to estimate probability of event • Fraction of samples consistent with the evidence drops exponentially with number of evidence variables Likelihood Weighting • Generate sample using topological order • Evidence variable: Fix value to evidence value and update weight of sample using probability in network • Non-evidence variable: Sample from values using probabilities in the network (given parents) • Probabilistic Graphical Models: Principles and Techniques by Daphne Koller, Nir Friedman P(C)=.5 Cloudy C t f P(S) .10 .50 Rain Sprinkler Wet Grass S t t f f C P(R) t .80 f .20 Cloudy true Sprinkler true Rain false WetGrass true R P(W) t .99 f .90 t .90 f .00 P(Rain | Sprinkler = true, WetGrass = true) Markov Chain Monte Carlo Simulation • To approximate: P(X | e) • Generate a sequence of states • Values of evidence variables are fixed • Values of other variables appear in the right proportion given the distribution encoded by the network U1 ... Um X Z1j Y1 ... U1 Z nj Yn Conditional Independence Um ... X Z 1j Y1 ... Z nj Yn Markov Blanket Markov Blanket • The Markov Blanket of a node: • Parents • Children • Children’s parents • A node is conditionally independent of all other nodes in the network given its Markov Blanket MCMC Gibbs Simulation Sampling • To approximate: P(X | e) • Start in a state with evidence variables set to evidence values (others arbitrary) • On each step, sample a non-evidence variable conditioned on the values of the variables in its Markov Blanket • Order irrelevant P(C)=.5 Cloudy C t f P(S) .10 .50 Rain Sprinkler Wet Grass S t t f f C P(R) t .80 f .20 Cloudy true Sprinkler true Rain false WetGrass true R P(W) t .99 f .90 t .90 f .00 P(Rain | Sprinkler = true, WetGrass = true) P(C)=.5 Cloudy C t f P(S) .10 .50 Rain Sprinkler Wet Grass S t t f f C P(R) t .80 f .20 Cloudy true Sprinkler true Rain false WetGrass true R P(W) t .99 f .90 t .90 f .00 P(Rain | Sprinkler = true, WetGrass = true) P(Cloudy | Sprinkler = true, Rain = false) P(C)=.5 Cloudy C t f P(S) .10 .50 Rain Sprinkler Wet Grass S t t f f C P(R) t .80 f .20 Cloudy false Sprinkler true Rain false WetGrass true R P(W) t .99 f .90 t .90 f .00 P(Rain | Sprinkler = true, WetGrass = true) P(Rain | Sprinkler = true, Rain = false, Cloudy = false) P(C)=.5 Cloudy C t f P(S) .10 .50 Rain Sprinkler Wet Grass S t t f f R P(W) t .99 f .90 t .90 f .00 C P(R) t .80 f .20 Cloudy false Sprinkler true Rain true WetGrass true Sprinkler P(Rain | Sprinkler = true, WetGrass = true) P(C)=.5 Cloudy C t f P(S) .10 .50 Rain Sprinkler Wet Grass S t t f f R P(W) t .99 f .90 t .90 f .00 C P(R) t .80 f .20 Cloudy false Sprinkler true Rain true WetGrass true Sprinkler P(Rain | Sprinkler = true, WetGrass = true) Cloudy Sprinkler Rain T T F F T F F T T WetGrass T T T R ✓ ¬R ✓ ✓ Gibbs Sampling • To approximate: P(X | e) • Start in a state with evidence variables set to evidence values (others arbitrary) • On each step, sample non-evidence variables conditioned on the values of the variables in their Markov Blanket • Order irrelevant • A form of local search! Exact Inference in Bayesian Networks • #P-Hard even for distribution described as a Bayesian Network Approximate Inference in Bayesian Networks • Sampling consistent with a distribution • Rejection Sampling: rejects too much • Likelihood Weighting: weights get too small • Gibbs Sampling: MCMC algorithm (like local search) • All generate consistent estimates (equal to exact probability in the large-sample limit) Probabilistic Reasoning Over Time Belief States • Logic: Defined in terms of possible (or impossible) worlds • Probability: Defined in terms of more (or less) likely possible worlds • Either way: State of the world is fixed (doesn’t change during reasoning) • Random variable has single fixed value 1,4 2,4 3,4 4,4 3,3 4,3 3,2 4,2 1,3 W! P? 2,3 1,2 A 2,2 S B 2,1 V OK = Agent = Breeze = Glitter, Gold = Safe square = Pit = Stench = Visited = Wumpus 1,4 2,4 1,3 W! 1,2 OK OK 1,1 P? A B G OK P S V W B V OK 3,1 (a) P! P? 4,1 S V OK 1,1 3,4 4,4 2,3 3,3 P? 4,3 2,2 3,2 4,2 A S G B V OK 2,1 V OK P? B V OK 3,1 (b) P! 4,1 Changing Probabilities • • Query: BloodSugar, InsulinLevel • Hidden: MetabolicActivity, ... Evidence: MeasuredBloodSugar, InsulinTaken, FoodEaten, ... Goal • Given history of evidence • Assess current state • Predict future states Representation • Model the world as a series of time slices • Unobservable state variables X • Observable evidence variables E • State at time t: X • Observation at time t: E = e t t t Raint = Rt Umbrellat = Ut Raint = Rt Umbrellat = Ut Xt : R0 , R1 , R2 , . . . Et : U1 , U 2 , . . . • Representation of state: X t , Et Modeling Change • Given state from time 0 until time t, need to know distribution of state variables for time t+1 P(Xt+1 | X0:t ) = P(Xt+1 | X0 , X1 , . . . Xt ) Andrey (Andrei) Andreyevich Markov (1856 – 1922) Markov Process (Assumption) • Current state depends only on the previous state and not on earlier states • The future is conditionally independent of the past, given the present P(Xt | X0:t−1 ) = P(Xt | Xt−1 ) Markov Process Xt-2 Xt-1 Xt Xt+1 Xt+2 Markov Process Rt-2 Rt-1 Rt Rt+1 Rt+2 Markov Process Rt-2 Rt-1 P(Rt−2 | Rt−3 ) Rt Rt+1 P(Rt | Rt−1 ) P(Rt−1 | Rt−2 ) Rt+2 P(Rt+2 | Rt−1 ) P(Rt+1 | Rt ) Stationary Process • Changes in the state are caused by a process that does not itself change • Can use the same model to compute the changes for any pair of states Xt, Xt+1 • Example: P(R t | Rt−1 ) is the same for any t Rt-1 t f Rt-1 P(Rt) 0.7 0.3 Rt Rt+1 • Representation of state: X , E • Transition model: P(X | X ) • Markov assumption, stationary process t t t t−1 Raint = Rt Umbrellat = Ut Xt : R0 , R1 , R2 , . . . Et : U1 , U 2 , . . . Rt-1 t f P(Rt) 0.7 0.3 Rt-1 Rt Rt+1 Ut-1 Ut Ut+1 Rt-1 t f P(Rt) 0.7 0.3 Rt-1 Rt Rt+1 Ut-1 Ut Ut+1 Sensor Markov Assumption • Observed values of evidence variables depend only on the current state • Evidence is conditionally independent of the past, given the present P(Et | X0:t , E0:t−1 ) = P(Et | Xt ) Sensor Markov Assumption • Observed values of evidence variables depend only on the current state • Evidence is conditionally independent of the past, given the present P(Et | X0:t , E0:t−1 ) = P(Et | Xt ) Sensor Model Rt-1 t f Rt-1 P(Rt) 0.7 0.3 Rt Rt+1 Rt t f Ut-1 Ut Ut+1 P(Ut) 0.9 0.2 • Representation of state: X , E • Transition model: P(X | X ) • Markov assumption, stationary process • Sensor model: P(E | X ) • Prior distribution at time 0: P(X ) t t t t t−1 t 0 Temporal Model P(X0:t , E1:t ) = P(X0 ) t � i=1 P(Xi | Xi−1 )P(Ei | Xi ) Sensor Model Initial State Model Transition Model Inference • Filtering (State Estimation) • Prediction • Smoothing • Most Likely Explanation Filtering (State Estimation) • Compute current belief state given all evidence to date Filtering (State Estimation) • Compute current belief state given all evidence to date • Method: Build network incrementally, do inference on it Rt-1 t f P(Rt) 0.7 0.3 R0 Rt t f P(Ut) 0.9 0.2 Rt-1 t f R0 P(Rt) 0.7 0.3 R1 Rt t f U1 P(Ut) 0.9 0.2 Rt-1 t f R0 R1 P(Rt) 0.7 0.3 R2 Rt t f U1 U2 P(Ut) 0.9 0.2 Rt-1 t f R0 R1 P(Rt) 0.7 0.3 R2 R3 Rt t f U1 U2 U3 P(Ut) 0.9 0.2 Filtering (State Estimation) • Compute current belief state given all evidence to date • Method: Build network incrementally, do inference on it • Have to maintain current state estimate and update it rather than recomputing over history of observations every time Filtering (State Estimation) P(Xt+1 | e1:t+1 ) = P(Xt+1 | e1:t , et+1 ) Bayes’ Rule = α P(et+1 | Xt+1 , e1:t )P(Xt+1 | e1:t ) = α P(et+1 | Xt+1 )P(Xt+1 | e1:t ) Sensor Markov assumption Filtering (State Estimation) P(Xt+1 | e1:t+1 ) = P(Xt+1 | e1:t , et+1 ) = α P(et+1 | Xt+1 , e1:t )P(Xt+1 | e1:t ) = α P(et+1 | Xt+1 )P(Xt+1 | e1:t ) Prediction of next state Update with evidence (using sensor model) One-Step Prediction P(Xt+1 | e1:t ) = = � xt � xt Condition on Xt P(Xt+1 | xt , e1:t )P (xt | e1:t ) P(Xt+1 | xt )P (xt | e1:t ) Markov assumption Filtering (State Estimation) P(Xt+1 | e1:t+1 ) = α P(et+1 | Xt+1 ) � xt P(Xt+1 | xt )P (xt | e1:t ) Prediction of next state Update with evidence (using sensor model) Rt-1 t f R0 P(Rt) 0.7 0.3 R1 R2 Rt t f U1 U2 P(Ut) 0.9 0.2 Rt-1 t f R0 P(Rt) 0.7 0.3 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� U1 U2 P(Ut) 0.9 0.2 Rt-1 t f R0 P(Rt) 0.7 0.3 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� U1 U1 = true U2 P(Ut) 0.9 0.2 Rt-1 t f P(Rt) 0.7 0.3 R0 P(R0 ) = �0.5, 0.5� � P(R1 ) = P(R1 | r0 )P (r0 ) r0 R1 R2 Rt t f U1 U2 U1 = true = �0.7, 0.3� × 0.5 + �0.3, 0.7� × 0.5 = �0.5, 0.5� P(Ut) 0.9 0.2 Rt-1 t f R0 P(Rt) 0.7 0.3 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� P(R1 | u1 ) = α P(u1 | R1 )P(R1 ) U1 U1 = true = α �0.9, 0.2��0.5, 0.5� = α �0.45, 0.1� ≈ �0.818, 0.182� U2 P(Ut) 0.9 0.2 Rt-1 t f P(Rt) 0.7 0.3 R0 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� P(R1 | u1 ) ≈ �0.818, 0.182� U1 U1 = true U2 P(Ut) 0.9 0.2 Rt-1 t f P(Rt) 0.7 0.3 R0 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� P(R1 | u1 ) ≈ �0.818, 0.182� U1 U2 U1 = true U2 = true P(Ut) 0.9 0.1 Rt-1 t f R0 P(Rt) 0.7 0.3 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� P(R1 | u1 ) ≈ �0.818, 0.182� � P(R2 | u1 ) = P(R2 | r1)PU (r11 | u1 ) r1 U2 U1 = true U2 = true = �0.7, 0.3� × 0.818 + �0.3, 0.7� × 0.182 ≈ �0.627, 0.373� P(Ut) 0.9 0.1 Rt-1 t f P(Rt) 0.7 0.3 R0 R1 R2 P(R0 ) = �0.5, 0.5� P(R1 | u1 ) ≈ �0.818, 0.182� P(R2 | u1 , u2 ) = α P(u2 | R2 )P(R U1 2 | u1 ) U2 U1 = true U2 = true = α �0.9, 0.2��0.627, 0.373� = α �0.565, 0.075� ≈ �0.883, 0.117� Rt t f P(Ut) 0.9 0.1 Rt-1 t f P(Rt) 0.7 0.3 R0 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� P(R1 | u1 ) ≈ �0.818, 0.182� P(R2 | u1 , u2 ) ≈ �0.883, 0.117�U1 U2 U1 = true U2 = true P(Ut) 0.9 0.1 Filtering (State Estimation) P(Xt+1 | e1:t+1 ) = α P(et+1 | Xt+1 ) � xt P(Xt+1 | xt )P (xt | e1:t ) Implement as recursive procedure: P(X0 | e1:0 ) = P(X0 ) P(Xt+1 | e1:t+1 ) = α Forward(P(Xt | e1:t ), et+1 ) Updates in constant time and space! Prediction • Compute posterior distribution for future state, given all evidence to date • This is filtering without the addition of any new evidence Prediction P(Xt+k+1 | e1:t+1 ) = � xt+k P(Xt+k+1 | xt+k )P(xt+k | e1:t ) Smoothing • Compute posterior over past state(s) given evidence up to the present • May allow you to improve the estimate you made at the time, since you know now what was then in the future Rt-1 t f P(Rt) 0.7 0.3 R0 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� P(R1 | u1 ) ≈ �0.818, 0.182� P(R2 | u1 , u2 ) ≈ �0.883, 0.117�U1 U2 U1 = true U2 = true P(R1 | u1 , u2 ) = α P(R1 | u1 )P(u2 | R1 ) P(Ut) 0.9 0.1 Rt-1 t f P(Rt) 0.7 0.3 R0 R1 R2 Rt t f P(R0 ) = �0.5, 0.5� P(R1 | u1 ) ≈ �0.818, 0.182� P(R2 | u1 , u2 ) ≈ �0.883, 0.117�U1 U2 U1 = true U2 = true P(R1 | u1 , u2 ) = α P(R1 | u1 )P(u2 | R1 ) ≈ �0.883, 0.117� P(Ut) 0.9 0.1 Finding the Most Likely Sequence • Infer most likely sequence of states that could have generated observations • Without enumerating all possible sequences of states and evaluating their likelihood Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 true true true true true false false false false false Umbrella t true true false true true .8182 .5155 .0361 .0334 .0210 .1818 .0491 .1237 .0173 .0024 m1:1 m1:2 m1:3 m1:4 m1:5 (a) (b) Viterbi Algorithm Time complexity: O(t) Space complexity: O(t) Temporal Models • Representation of state: Xt , Et • Transition model: P(X | X ) • Markov assumption, stationary process • Sensor model: P(E | X ) • Sensor Markov assumption • Prior distribution at time 0: P(X ) t t t−1 t 0 Temporal Model P(X0:t , E1:t ) = P(X0 ) t � i=1 P(Xi | Xi−1 )P(Ei | Xi ) Sensor Model Initial State Model Transition Model Inference • Filtering (State Estimation): Compute current belief state given all evidence to date • Prediction: Compute posterior distribution for future state, given all evidence to date • Smoothing: Compute posterior over past state(s) given evidence up to the present • Most Likely State: Infer most likely sequence of states that could have generated observations For Next Time: AIMA 16.0-16.3, 16.5