Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1 Announcements • HW2 due tomorrow, 11:59pm • Mid-term was handed back last week (see me if you haven’t collected yours) • No class Monday (University Day) Review: Probability • • • • • • • • Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions Product rule Independence and conditional independence Inference Bayesian decision making • Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E • Inference problem: given some evidence E = e, what is P(X | e)? • Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)} Bayesian networks (BNs) A type of graphical model A BN states conditional independence relationships between random variables Compact specification of full joint distributions Random Variables Random variables X = { X1, X 2 X 3,...X n } Let x i be a realization of X i Random Variables Random variables X = { X1, X 2 X 3,...X n } Let x i be a realization of X i A random variable is some aspect of the world about which we (may) have uncertainty. Random variables can be: Binary (e.g. {true,false}, {spam/ham}), Take on a discrete set of values (e.g. {Spring, Summer, Fall, Winter}), Or be continuous (e.g. [0 1]). Joint Probability Distribution Random variables X = { X1, X 2 X 3,...X n } Let x i be a realization of X i Joint Probability Distribution specifies a probability for every atomic event: P(X1 = x1, X2 = x2, X3 = x3,...Xn = xn ) Also written p(x1, x 2 x 3,...x n ) Queries Joint Probability Distribution specifies a probability for every atomic event: P(X1 = x1, X2 = x2, X3 = x3,...Xn = xn ) Also written p(x1, x 2 x 3,...x n ) Given a joint distribution, we can reason about unobserved variables given observations (evidence): P(X q | x e1,..., x ek ) Stuff you care about Stuff you already know Representation Joint Probability Distribution specifies a probability for every atomic event: P(X1 = x1, X2 = x2, X3 = x3,...Xn = xn ) Also written p(x1, x 2 x 3,...x n ) One way to represent the joint probability distribution for discrete Xi is as an n-dimensional table, each cell containing the probability n for a setting of X. This would have r entries if each Xi ranges over r values. Graphical Models! Representation Joint Probability Distribution: P(X1 = x1, X 2 = x2, X 3 = x 3,...X n = x n ) Also written p(x1, x 2 x 3 ,...x n ) Graphical models represent joint probability distributions more economically, using a set of “local” relationships among variables. Graphical Models Graphical models offer several useful properties: 1. They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models. 2. Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph. 3. Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations. from Chris Bishop Main kinds of models • Undirected (also called Markov Random Fields) - links express constraints between variables. A B • Directed (also called Bayesian Networks) - have a notion of causality -- one can regard an arc from A to B as indicating that A "causes" B. A B Syntax Directed Acyclic Graph Weather Cavity Nodes: random variables Can be assigned/observed (shaded) or unassigned/unobserved (unshaded) Toothache Arcs: interactions An arrow from one variable to another indicates direct influence Encode independence/conditional independence Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity Catch Example: N independent coin flips Complete independence: no interactions X1 X2 … Xn Example: Naïve Bayes document model Random variables: X: document class W1, …, Wn: words in the document X W1 W2 … Wn Example: Burglar Alarm I have a burglar alarm that is sometimes set off by minor earthquakes. My two neighbors, John and Mary, promise to call me at work if they hear the alarm Example inference task: suppose Mary calls and John doesn’t call. What is the probability of a burglary? What are the random variables? Burglary, Earthquake, Alarm, JohnCalls, MaryCalls What are the direct influence relationships? A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call Example: Burglar Alarm What does this mean? What are the model parameters? Bayes Nets Directed Graph, G = (X,E) Nodes Edges X = { X1, X 2 X 3,...X n } E = {(X i, X j ) : i ¹ j} Each node is associated with a random variable Example X4 X2 X6 X1 X3 X5 Joint Distribution p(x1, x 2 x 3, x 4 , x 5 , x 6 ) = p(x1 )p(x2 | x1 )p(x3 | x1, x2 )p(x4 | x1, x2, x3 )p(x5 | x1, x2, x3, x4 )p(x6 | x1, x2, x3, x4, x5 ) By Chain Rule (using the usual arithmetic ordering) Directed Graphical Models Directed Graph, G = (X,E) Nodes Edges X = { X1, X 2 X 3,...X n } E = {(X i, X j ) : i ¹ j} Each node is associated with a random variable Definition of joint probability in a graphical model: n p(x1,...x n ) = Õ p(xi i=1 xp i ) where pi are the parents of x i Example X4 X2 X6 X1 X3 X5 Joint Probability: p(x1, x 2 x 3 , x 4 , x 5 , x 6 ) = p(x1) p(x 2 | x1) p(x 3 | x1) p(x 4 | x 2 ) p(x 5 | x 3 ) p(x 6 | x 2, x 5 ) Conditional Independence Independence: p(x A , x B ) = p(x A )p(x B ) Conditional Independence: p(xA , xC | x B ) = p(x A | x B )p(xC | x B ) Or, p(x A | x B , xC ) = p(x A | x B ) Conditional Independence p(x1, x 2 x 3, x 4 , x 5 , x 6 ) = p(x1) p(x 2 | x1)p(x3 | x1, x 2 )p(x4 | x1, x2, x3 )p(x5 | x1, x2, x 3, x4 ) p(x6 | x1, x2, x3, x 4 , x5 ) By Chain Rule (using the usual arithmetic ordering) Joint distribution from the example graph: p(x1) p(x2 | x1)p(x 3 | x1)p(x 4 | x2 )p(x5 | x3 )p(x6 | x2, x5 ) Missing variables in the local conditional probability functions correspond to missing edges in the underlying graph. Removing an edge into node i eliminates an argument from the conditional probability factor i 1 2 i-1 p(x | x , x ,..., x ) Semantics A BN represents a full joint distribution in a compact way. We only need to specify a conditional probability distribution for each node given its parents: P (X | Parents(X)) Z1 … Z2 Zn X P (X | Z1, …, Zn) X1 Example 0 21 0 1 X X4 0 1 X X2 0 41 X2 X6 X1 0 X1 1 X3 1 0 X5 X3 X1 0 1 X 0 31 0 0 1 0 1 X 0 51 1 Example: Alarm Network B P(B) +b 0.001 b 0.999 Burglary Earthqk E P(E) +e 0.002 e 0.998 B E A P(A|B,E) +b +e +a 0.95 +b +e a 0.05 +b e +a 0.94 Alarm John calls Mary calls A J P(J|A) A M P(M|A) +b e a 0.06 +a +j 0.9 +a +m 0.7 b +e +a 0.29 +a j 0.1 +a m 0.3 b +e a 0.71 a +j 0.05 a +m 0.01 b e +a 0.001 a j 0.95 a m 0.99 b e a 0.999 Size of a Bayes’ Net • How big is a joint distribution over N Boolean variables? 2N • How big is an N-node net if nodes have up to k parents? O(N * 2k+1) • • • • Both give you the power to calculate BNs: Huge space savings! Easier to elicit local CPTs Also faster to answer queries 30 The joint probability distribution For example, P(j, m, a, ¬b, ¬e) = P(¬b) P(¬e) P(a | ¬b, ¬e) P(j | a) P(m | a) Independence in a BN • Important question about a BN: – – – – Are two nodes independent given some evidence? If yes, can prove using algebra (tedious in general) If no, can prove with a counter example Example: X Y Z – Question: are X and Z necessarily independent? • Answer: no. Example: low pressure causes rain, which causes traffic. • X can influence Z, Z can influence X (via Y) • Addendum: they could be independent: how? Independence Key properties: a) Each node is conditionally independent of its non-descendants given its parents b) A node is conditionally independent of all other nodes in the graph given it’s Markov blanket (it’s parents, children, and children’s other parents) Moral Graphs Equivalent undirected form of a directed acyclic graph Independence in a BN • Important question about a BN: – – – – Are two nodes independent given certain evidence? If yes, can prove using algebra (tedious in general) If no, can prove with a counter example Example: X Y Z – Question: are X and Z necessarily independent? • Answer: no. Example: low pressure causes rain, which causes traffic. • X can influence Z, Z can influence X (via Y) • Addendum: they could be independent: how? Causal Chains • This configuration is a “causal chain” X: Project due X Y Z Y: No office hours Z: Students panic – Is Z independent of X given Y? Yes! – Evidence along the chain “blocks” the influence 37 Common Cause • Another basic configuration: two effects of the same cause Y – Are X and Z independent? – Are X and Z independent given Y? X Z Y: Homework due X: Full attendance Z: Students sleepy Yes! – Observing the cause blocks influence between effects. 38 Common Effect • Last configuration: two causes of one effect (v-structures) – Are X and Z independent? • Yes: the ballgame and the rain cause traffic, but they are not correlated • Still need to prove they must be (try it!) – Are X and Z independent given Y? • No: seeing traffic puts the rain and the ballgame in competition as explanation – This is backwards from the other cases X Z Y X: Raining Z: Ballgame Y: Traffic • Observing an effect activates influence between possible causes. 39 The General Case • Any complex example can be analyzed using these three canonical cases Causal Chain Common Cause • General question: in a given BN, are two variables independent (given evidence)? (Unobserved) Common Effect • Solution: analyze the graph 40 Bayes Ball • Shade all observed nodes. Place balls at the starting node, let them bounce around according to some rules, and ask if any of the balls reach any of the goal node. • We need to know what happens when a ball arrives at a node on its way to the goal node. 42 43 Example Yes R B T T’ 44 Example L Yes R Yes D B T Yes T’ 45 Constructing Bayesian networks 1. 2. Choose an ordering of variables X1, … , Xn For i = 1 to n add Xi to the network select parents from X1, … ,Xi-1 such that P(Xi | Parents(Xi)) = P(Xi | X1, ... Xi-1) Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? P(A | J, M) = P(A)? P(A | J, M) = P(A | J)? P(A | J, M) = P(A | M)? No Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? P(A | J, M) = P(A)? P(A | J, M) = P(A | J)? P(A | J, M) = P(A | M)? No No No No Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A)? P(A | J, M) = P(A | J)? P(A | J, M) = P(A | M)? P(B | A, J, M) = P(B)? P(B | A, J, M) = P(B | A)? No No No Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A)? P(A | J, M) = P(A | J)? P(A | J, M) = P(A | M)? P(B | A, J, M) = P(B)? P(B | A, J, M) = P(B | A)? No No No No Yes Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A)? P(A | J, M) = P(A | J)? P(A | J, M) = P(A | M)? P(B | A, J, M) = P(B)? P(B | A, J, M) = P(B | A)? P(E | B, A ,J, M) = P(E)? P(E | B, A, J, M) = P(E | A, B)? No No No No Yes Example Suppose we choose the ordering M, J, A, B, E P(J | M) = P(J)? No P(A | J, M) = P(A)? P(A | J, M) = P(A | J)? P(A | J, M) = P(A | M)? P(B | A, J, M) = P(B)? P(B | A, J, M) = P(B | A)? P(E | B, A ,J, M) = P(E)? P(E | B, A, J, M) = P(E | A, B)? No No No No Yes No Yes Example contd. Deciding conditional independence is hard in noncausal directions The causal direction seems much more natural, but is not mandatory Network is less compact Summary Bayesian networks provide a natural representation for (causally induced) conditional independence Topology + conditional probability tables Generally easy for domain experts to construct