Part II: Graphical models Challenges of probabilistic models • Specifying well-defined probabilistic models with many variables is hard (for modelers) • Representing probability distributions over those variables is hard (for computers/learners) • Computing quantities using those distributions is hard (for computers/learners) Representing structured distributions Four random variables: X1 X2 X3 X4 coin toss produces heads pencil levitates friend has psychic powers friend has two-headed coin Domain {0,1} {0,1} {0,1} {0,1} Joint distribution • Requires 15 numbers to specify probability of all values x1,x2,x3,x4 – N binary variables, 2N-1 numbers • Similar cost when computing conditional probabilities P(x1 1, x 2, x 3 , x 4 ) P(x 2, x 3 , x 4 | x1 1) P(x1 1) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 How can we use fewer numbers? Four random variables: X1 X2 X3 X4 coin toss produces heads coin toss produces heads coin toss produces heads coin toss produces heads Domain {0,1} {0,1} {0,1} {0,1} Statistical independence • Two random variables X1 and X2 are independent if P(x1|x2) = P(x1) – e.g. coinflips: P(x1=H|x2=H) = P(x1=H) = 0.5 • Independence makes it easier to represent and work with probability distributions • We can exploit the product rule: P(x1, x2, x3, x4 ) P(x1 | x2, x3,x4 )P(x2 | x3, x4 )P(x3 | x4 )P(x4 ) If x1, x2, x3, and x4 are all independent… P(x1, x2, x3, x4 ) P(x1)P(x2 )P(x3 )P(x4 ) Expressing independence • Statistical independence is the key to efficient probabilistic representation and computation • This has led to the development of languages for indicating dependencies among variables • Some of the most popular languages are based on “graphical models” Part II: Graphical models • Introduction to graphical models – representation and inference • Causal graphical models – causality – learning about causal relationships • Graphical models and cognitive science – uses of graphical models – an example: causal induction Part II: Graphical models • Introduction to graphical models – representation and inference • Causal graphical models – causality – learning about causal relationships • Graphical models and cognitive science – uses of graphical models – an example: causal induction Graphical models • Express the probabilistic dependency structure among a set of variables (Pearl, 1988) • Consist of – a set of nodes, corresponding to variables – a set of edges, indicating dependency – a set of functions defined on the graph that specify a probability distribution QuickTi me™ and a TIFF ( Uncompressed) decompressor are needed to see thi s pi ctur e. Undirected graphical models • Consist of X1 X3 X4 – a set of nodes X2 X5 – a set of edges – a potential for each clique, multiplied together to yield the distribution over variables • Examples – statistical physics: Ising model, spinglasses – early neural networks (e.g. Boltzmann machines) Directed graphical models • Consist of X1 X3 X4 – a set of nodes X2 X5 – a set of edges – a conditional probability distribution for each node, conditioned on its parents, multiplied together to yield the distribution over variables • Constrained to directed acyclic graphs (DAGs) • Called Bayesian networks or Bayes nets Bayesian networks and Bayes • Two different problems – Bayesian statistics is a method of inference – Bayesian networks are a form of representation • There is no necessary connection – many users of Bayesian networks rely upon frequentist statistical methods – many Bayesian inferences cannot be easily represented using Bayesian networks Properties of Bayesian networks • Efficient representation and inference – exploiting dependency structure makes it easier to represent and compute with probabilities • Explaining away – pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI Properties of Bayesian networks • Efficient representation and inference – exploiting dependency structure makes it easier to represent and compute with probabilities • Explaining away – pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI Efficient representation and inference Four random variables: X1 X2 X3 X4 coin toss produces heads pencil levitates friend has psychic powers friend has two-headed coin P(x4) X4 P(x1|x3, x4) X3 P(x3) X1 X2 P(x2|x3) The Markov assumption Every node is conditionally independent of its nondescendants, given its parents P(xi | xi1,...,xk ) P(xi | Pa(Xi )) where Pa(Xi) is the set of parents of Xi k P(x1,..., x k ) P(x i | Pa(X i )) i1 (via the product rule) Efficient representation and inference Four random variables: X1 X2 X3 X4 coin toss produces heads pencil levitates friend has psychic powers friend has two-headed coin 1 P(x4) X4 4 total = 7 (vs 15) P(x1|x3, x4) X3 P(x3) 1 X1 X2 P(x2|x3) 2 P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4) Reading a Bayesian network • The structure of a Bayes net can be read as the generative process behind a distribution • Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents Reading a Bayesian network Four random variables: X1 X2 X3 X4 coin toss produces heads pencil levitates friend has psychic powers friend has two-headed coin P(x4) X4 P(x1|x3, x4) X3 P(x3) X1 X2 P(x2|x3) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4) Reading a Bayesian network • The structure of a Bayes net can be read as the generative process behind a distribution • Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents • Simple rules for determining whether two variables are dependent or independent • Independence makes inference more efficient Computing with Bayes nets P(x1 1, x 2, x 3 , x 4 ) P(x 2, x 3 , x 4 | x1 1) P(x1 1) P(x4) X4 P(x1|x3, x4) X3 P(x3) X1 X2 P(x2|x3) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4) Computing with Bayes nets P(x1 1) P(x 1 x 2 ,x 3 ,x 4 P(x4) X4 P(x1|x3, x4) 1, x 2 , x 3, x 4 ) sum over 8 values X3 P(x3) X1 X2 P(x2|x3) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4) Computing with Bayes nets P(x1 1) P(x 1 | x 3, x 4 )P(x 2 | x 3 )P(x 3 )P(x 4 ) x 2 ,x 3 ,x 4 P(x4) X4 P(x1|x3, x4) X3 P(x3) X1 X2 P(x2|x3) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4) Computing with Bayes nets P(x1 1) P(x 1 x 3 ,x 4 sum over 4 values P(x4) X4 P(x1|x3, x4) | x 3, x 4 )P(x 3 )P(x 4 ) X3 P(x3) X1 X2 P(x2|x3) P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4) Computing with Bayes nets • Inference algorithms for Bayesian networks exploit dependency structure • Message-passing algorithms – “belief propagation” passes simple messages between nodes, exact for tree-structured networks • More general inference algorithms – exact: “junction-tree” – approximate: Monte Carlo schemes (see Part IV) Properties of Bayesian networks • Efficient representation and inference – exploiting dependency structure makes it easier to represent and compute with probabilities • Explaining away – pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on: P(W w | S , R) 1 if S s or R r 0 if R r and S s. Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given that the grass is wet: P(r | w) P( w | r ) P(r ) P( w) Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given that the grass is wet: P( w | r ) P(r ) P(r | w) P(w | r, s) P(r, s) r , s Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given that the grass is wet: P(r | w) P(r ) P(r , s) P(r , s) P(r , s) Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given that the grass is wet: P(r ) P(r | w) P(r ) P(r , s) Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given that the grass is wet: P(r | w) P(r ) P(r ) P(r ) P( s) Between 1 and P(s) P (r ) Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given that the grass is wet and sprinklers were left on: P(r | w, s) P( w | r , s ) P(r | s ) P( w | s ) Both terms = 1 Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given that the grass is wet and sprinklers were left on: P(r | w, s ) P(r | s ) P (r ) Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P( S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. P(r ) P(r ) P(r ) P( s) P(r | w, s ) P(r | s ) P (r ) P(r | w) P (r ) “Discounting” to prior probability. Contrast w/ production system Rain Sprinkler Grass Wet • Formulate IF-THEN rules: – IF Rain THEN Wet – IF Wet THEN Rain IF Wet AND NOT Sprinkler THEN Rain • Rules do not distinguish directions of inference • Requires combinatorial explosion of rules Contrast w/ spreading activation Rain Sprinkler Grass Wet • Excitatory links: Rain Wet, Sprinkler Wet • Observing rain, Wet becomes more active. • Observing grass wet, Rain and Sprinkler become more active • Observing grass wet and sprinkler, Rain cannot become less active. No explaining away! Contrast w/ spreading activation Rain Sprinkler Grass Wet • Excitatory links: Rain • Inhibitory link: Rain Wet, Sprinkler Sprinkler Wet • Observing grass wet, Rain and Sprinkler become more active • Observing grass wet and sprinkler, Rain becomes less active: explaining away Contrast w/ spreading activation Rain Burst pipe Sprinkler Grass Wet • Each new variable requires more inhibitory connections • Not modular – whether a connection exists depends on what others exist – big holism problem – combinatorial explosion Contrast w/ spreading activation QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. (McClelland & Rumelhart, 1981) Graphical models • Capture dependency structure in distributions • Provide an efficient means of representing and reasoning with probabilities • Allow kinds of inference that are problematic for other representations: explaining away – hard to capture in a production system – more natural than with spreading activation Part II: Graphical models • Introduction to graphical models – representation and inference • Causal graphical models – causality – learning about causal relationships • Graphical models and cognitive science – uses of graphical models – an example: causal induction Causal graphical models • Graphical models represent statistical dependencies among variables (ie. correlations) – can answer questions about observations • Causal graphical models represent causal dependencies among variables (Pearl, 2000) – express underlying causal structure – can answer questions about both observations and interventions (actions upon a variable) Quick Time™a nd a TIFF ( Unco mpre ssed ) dec ompr esso r ar e nee ded to see this pictur e. Bayesian networks Nodes: variables Links: dependency Four random variables: X1 X2 X3 X4 Each node has a conditional probability distribution coin toss produces heads pencil levitates friend has psychic powers friend has two-headed coin Data: observations of x1, ..., x4 P(x4) X4 P(x1|x3, x4) X3 P(x3) X1 X2 P(x2|x3) Causal Bayesian networks Nodes: variables Links: causality Four random variables: X1 X2 X3 X4 Each node has a conditional probability distribution coin toss produces heads pencil levitates friend has psychic powers friend has two-headed coin Data: observations of and interventions on x1, ..., x4 P(x4) X4 P(x1|x3, x4) X3 P(x3) X1 X2 P(x2|x3) Interventions Four random variables: Cut all incoming links for the node that we intervene on X1 X2 X3 X4 Compute probabilities with “mutilated” Bayes net coin toss produces heads pencil levitates friend has psychic powers friend has two-headed coin hold down pencil P(x4) X4 X3 P(x3) X P(x1|x3, x4) X1 X2 P(x2|x3) Learning causal graphical models C B E C B E • Strength: how strong is a relationship? • Structure: does a relationship exist? Causal structure vs. causal strength C B E C B E • Strength: how strong is a relationship? Causal structure vs. causal strength B C B w0 w1 w0 E C E • Strength: how strong is a relationship? – requires defining nature of relationship Parameterization • Structures: h1 = C B h0 = E E • Parameterization: C B 0 1 0 1 0 0 1 1 Generic h1: P(E = 1 | C, B) p00 p10 p01 p11 C B h0: P(E = 1| C, B) p0 p0 p1 p1 Parameterization • Structures: h1 = B C w0 w1 h0 = C B w0 E E w0, w1: strength parameters for B, C • Parameterization: C B 0 1 0 1 0 0 1 1 Linear h1: P(E = 1 | C, B) 0 w1 w0 w1+ w0 h0: P(E = 1| C, B) 0 0 w0 w0 Parameterization • Structures: h1 = B C w0 w1 h0 = C B w0 E E w0, w1: strength parameters for B, C • Parameterization: C B 0 1 0 1 0 0 1 1 “Noisy-OR” h1: P(E = 1 | C, B) 0 w1 w0 w1+ w0 – w1 w0 h0: P(E = 1| C, B) 0 0 w0 w0 Parameter estimation • Maximum likelihood estimation: maximize i P(bi,ci,ei; w0, w1) • Bayesian methods: as in Part I Causal structure vs. causal strength C B E C B E • Structure: does a relationship exist? Approaches to structure learning • Constraint-based: C B – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Approaches to structure learning • Constraint-based: C B – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Approaches to structure learning • Constraint-based: C B – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Approaches to structure learning • Constraint-based: C B – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Attempts to reduce inductive problem to deductive problem Approaches to structure learning • Constraint-based: C B – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) • Bayesian: – compute posterior probability of structures, given observed data P(h|data) P(data|h) P(h) C B C B E E P(h1|data) P(h0|data) (Heckerman, 1998; Friedman, 1999) Bayesian Occam’s Razor P(d | h ) h0 (no relationship) h1 (relationship) All possible data sets d For any model h, P(d | h) d 1 Causal graphical models • Extend graphical models to deal with interventions as well as observations • Respecting the direction of causality results in efficient representation and inference • Two steps in learning causal models – strength: parameter estimation – structure: structure learning Part II: Graphical models • Introduction to graphical models – representation and inference • Causal graphical models – causality – learning about causal relationships • Graphical models and cognitive science – uses of graphical models – an example: causal induction Uses of graphical models • Understanding existing cognitive models – e.g., neural network models • Representation and reasoning – a way to address holism in induction (c.f. Fodor) • Defining generative models – mixture models, language models (see Part IV) • Modeling human causal reasoning Human causal reasoning • How do people reason about interventions? (Gopnik, Glymour, Sobel, Schulz, Kushnir & Danks, 2004; Lagnado & Sloman, 2004; Sloman & Lagnado, 2005; Steyvers, Tenenbaum, Wagenmakers & Blum, 2003) • How do people learn about causal relationships? – parameter estimation (Shanks, 1995; Cheng, 1997) – constraint-based models (Glymour, 2001) – Bayesian structure learning (Steyvers et al., 2003; Griffiths & Tenenbaum, 2005) Causation from contingencies C present C absent (c+) (c-) E present (e+) a c E absent (e-) b d “Does C cause E?” (rate on a scale from 0 to 100) Two models of causal judgment • Delta-P (Jenkins & Ward, 1965): P P(e | c ) P(e | c ) a (a b) • Power PC (Cheng, 1997): P P Powerp 1 P (e | c ) d (c d ) Buehner and Cheng (1997) P 0.00 People P Power 0.25 0.50 0.75 1.00 Buehner and Cheng (1997) People P Power Constant P, changing judgments Buehner and Cheng (1997) People P Power Constant causal power, changing judgments Buehner and Cheng (1997) People P Power P = 0, changing judgments Causal structure vs. causal strength B C B w0 w1 w0 E C E • Strength: how strong is a relationship? • Structure: does a relationship exist? Causal strength • Assume structure: B C w0 w1 E P and causal power are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): – linear P, Noisy-OR causal power Causal structure • Hypotheses: h1 = h0 = C B C B E E • Bayesian causal inference: support = P(d|h1) likelihood ratio (Bayes factor) gives evidence in favor of h1 P(d|h0) P(d | w ,w ) p(w ,w | h ) dw 1 P(d | h0 ) P(d | w0 ) p(w0 | h0 ) dw0 0 P(d | h1) 1 1 0 0 0 1 0 1 1 0 dw1 Buehner and Cheng (1997) People P (r = 0.89) Power (r = 0.88) Support (r = 0.97) The importance of parameterization • Noisy-OR incorporates mechanism assumptions: – generativity: causes increase probability of effects – each cause is sufficient to produce the effect – causes act via independent mechanisms (Cheng, 1997) • Consider other models: – statistical dependence: 2 test – generic parameterization (cf. Anderson, 1990) People Support (Noisy-OR) 2 Support (generic) Generativity is essential P(e+|c+) P(e+|c-) 100 50 0 8/8 8/8 6/8 6/8 4/8 4/8 2/8 2/8 0/8 0/8 Support • Predictions result from “ceiling effect” – ceiling effects only matter if you believe a cause increases the probability of an effect Both objects activate the detector Blicket detector Object A does not activate the detector by itself Chi each Then they maket (Dave Sobel, Alison Gopnik, and colleagues) Backward Blocking Condition Procedure used in Sobel et al. (2002), Experiment 2 e Condition See this? It’s a blicket machine. Blickets Blocking Condition make it go. s activate tector Object A does not activate the detector by itself s activate tector Object Aactivates the detector by itself Both objects activate the detector Children are asked if each is a blicket Thenare asked to they makethe machine go Let’s put this one on the machine. Children are asked if each is a blicket Thenare asked to they Object Aactivates the detector by itself Oooh, it’s a blicket! Chi each Then they maket Both objects activate the detector Object A does not activate the detector by itself “Backwards blocking” el et al. (2002), Experiment 2Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 Backward Blocking Condition One-Cause Condition (Sobel, Tenenbaum & Gopnik, 2004) A t A does not e the detector by itself Aactivates the tor by itself Children are a each is a blicke Thenare asked t they makethe machin B Children are asked if objects activate Both each is a blicket the detector Thenare asked to they makethe machine go – – – – AB Trial Both objects activate Object A does not the detector activate the detector by itself A Trial Object Aactivates the Children are asked if detector by itself each is a blicket Thenare asked to they makethe machine go Two objects: A and B Backward Blocking Condition Trial 1: A B on detector – detector active Trial 2: A on detector – detector active 4-year-olds judge whether each object is a blicket Both objects activate Object Aactivates the each isa a blicket • A: blicket (100% say yes) the detector detector by itself Thenare asked to they makethe machine go • B: probably not a blicket (34% say yes) Children are asked if Children are asked if each is a blicket Thenare asked to they makethe machine go Children are a each is a blicke Thenare asked t they makethe machin A Possible hypotheses A B A E A A E B A A A B A A A B A A A B A A A B A A A B A A B E B E B E E B E B E E B E B E E E B E B E E B E B E E B E B E E B E A E B A B B A B E Bayesian inference • Evaluating causal models in light of data: P(d | hi )P(hi ) P(hi | d) P(d | h j )P(h j ) j • Inferring a particular causal relation: P( A E | d ) P( A E | h j ) P(h j | d ) h H j Bayesian inference With a uniform prior on hypotheses, and the generic parameterization Probability of being a blicket A B 0.32 0.32 0.34 0.34 Modeling backwards blocking Assume… • Links can only exist from blocks to detectors • Blocks are blickets with prior probability q • Blickets always activate detectors, but detectors never activate on their own – deterministic Noisy-OR, with wi = 1 and w0 = 0 Modeling backwards blocking P(h00) = (1 – q)2 A P(E=1 | A=0, B=0): P(E=1 | A=1, B=0): P(E=1 | A=0, B=1): P(E=1 | A=1, B=1): B P(h01) = (1 – q) q A B P(h10) = q(1 – q) A B P(h11) = q2 A B E E E E 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 P(B E | d) P(h01) P(h11) q Modeling backwards blocking P(h00) = (1 – q)2 A P(E=1 | A=1, B=1): B P(h01) = (1 – q) q A B P(h10) = q(1 – q) A B P(h11) = q2 A B E E E E 0 1 1 1 P(h01) P(h11) q P(B E | d) P(h01) P(h10 ) P(h11) q q(1 q) Modeling backwards blocking P(h01) = (1 – q) q A B P(h10) = q(1 – q) A B P(h11) = q2 A B E E E P(E=1 | A=1, B=0): 0 1 1 P(E=1 | A=1, B=1): 1 1 1 P(h11) P(B E | d) q P(h10 ) P(h11) Manipulating prior probability (Tenenbaum, Sobel, Griffiths, & Gopnik, submitted) Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 One-Cause Condition Both objects activate the detector Object A does not activate the detector by itself Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 ed et al. (2002), Experiment 2 et in al.Sobel (2002), Experiment 2 Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 One-Cause Condition Backward Blocking Condition One-Cause Condition Both objects activate the detector A does not A does notObject activate the detector he detector by itself y itself Object A does not Children are asked if each is a blicket activate detector Both objects activate Object Aactivates the Children asked if Children are asked if theare Then they are asked to by itself Both objects activate Object A does not the detector detector by itself each is a blicket each is a blicket make t he machine go the detector activate the detector Then Then they are asked to they are asked to by itself make Initial AB Trial A Trial Children are ask each is a blicket Thenare asked to they makethe machine Children are ask each Children are asked if is a blicket Thenare asked to each is a blicket they Thenare asked to makethe machine they Summary • Graphical models provide solutions to many of the challenges of probabilistic models – defining structured distributions – representing distributions on many variables – efficiently computing probabilities • Causal graphical models provide tools for defining rational models of human causal reasoning and learning