# Bayesian models of inductive learning

```Bayesian models of
inductive learning
Josh Tenenbaum &amp; Tom Griffiths
MIT
Computational Cognitive Science Group
Department of Brain and Cognitive Sciences
Computer Science and AI Lab (CSAIL)
What to expect
• What you’ll get out of this tutorial:
– Our view of what Bayesian models have to offer
cognitive science.
– In-depth examples of basic and advanced models:
how the math works &amp; what it buys you.
– Some comparison to other approaches.
– Opportunities to ask questions.
• What you won’t get:
– Detailed, hands-on how-to.
http://bayesiancognition.com
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Bayesian models in
cognitive science
•
•
•
•
•
Vision
Motor control
Memory
Language
Inductive learning and reasoning….
Everyday inductive leaps
• Learning concepts and words from examples
“horse”
“horse”
“horse”
Learning concepts and words
“tufa”
“tufa”
“tufa”
Can you pick out the tufas?
Inductive reasoning
Input:
Cows can get Hick’s disease.
Gorillas can get Hick’s disease.
(premises)
All mammals can get Hick’s disease.
(conclusion)
Task: Judge how likely conclusion is to be
true, given that premises are true.
Inferring causal relations
Input:
Day 1
Day 2
Day 3
Day 4
...
Took vitamin B23
yes
yes
no
yes
...
no
yes
yes
no
...
Does vitamin B23 cause headaches?
Task: Judge probability of a causal link
given several joint observations.
Everyday inductive leaps
How can we learn so much about . . .
–
–
–
–
–
–
Properties of natural kinds
Meanings of words
Future outcomes of a dynamic process
Hidden causal properties of an object
Causes of a person’s action (beliefs, goals)
Causal laws governing a domain
. . . from such limited data?
The Challenge
• How do we generalize successfully from very
limited data?
– Just one or a few examples
– Often only positive examples
• Philosophy:
– Induction is a “problem”, a “riddle”, a “paradox”,
a “scandal”, or a “myth”.
• Machine learning and statistics:
– Focus on generalization from many examples,
both positive and negative.
Rational statistical inference
(Bayes, Laplace)
Posterior
probability
Likelihood
Prior
probability
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
Sum over space
of hypotheses
Bayesian models of inductive
learning: some recent history
• Shepard (1987)
– Analysis of one-shot stimulus generalization, to explain
the universal exponential law.
• Anderson (1990)
– Models of categorization and causal induction.
• Oaksford &amp; Chater (1994)
– Model of conditional reasoning (Wason selection task).
• Heit (1998)
– Framework for category-based inductive reasoning.
Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
• Learners’ domain theories generate their
hypothesis space H and prior p(h).
– Well-matched to structure of the natural world.
– Learnable from limited data.
– Computationally tractable inference.
What is a theory?
• Working definition
– An ontology and a system of abstract principles
that generates a hypothesis space of candidate
world structures along with their relative
probabilities.
• Analogy to grammar in language.
• Example: Newton’s laws
Structure and statistics
• A framework for understanding how structured
knowledge and statistical inference interact.
– How structured knowledge guides statistical inference, and is
itself acquired through higher-order statistical learning.
– How simplicity trades off with fit to the data in evaluating
structural hypotheses.
– How increasingly complex structures may grow as required
by new data, rather than being pre-specified in advance.
Structure and statistics
• A framework for understanding how structured
knowledge and statistical inference interact.
– How structured knowledge guides statistical inference, and is
itself acquired through higher-order statistical learning.
Hierarchical Bayes.
– How simplicity trades off with fit to the data in evaluating
structural hypotheses.
Bayesian Occam’s Razor.
– How increasingly complex structures may grow as required
by new data, rather than being pre-specified in advance.
Non-parametric Bayes.
Alternative approaches to
inductive generalization
•
•
•
•
•
•
Associative learning
Connectionist networks
Similarity to examples
Toolkit of simple heuristics
Constraint satisfaction
Analogical mapping
Marr’s Three Levels of Analysis
• Computation:
“What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?”
• Representation and algorithm:
Cognitive psychology
• Implementation:
Neurobiology
Why Bayes?
• A framework for explaining cognition.
– How people can learn so much from such limited data.
– Why process-level models work the way that they do.
– Strong quantitative models with minimal ad hoc assumptions.
• A framework for understanding how structured
knowledge and statistical inference interact.
– How structured knowledge guides statistical inference, and is
itself acquired through higher-order statistical learning.
– How simplicity trades off with fit to the data in evaluating
structural hypotheses (Occam’s razor).
– How increasingly complex structures may grow as required
by new data, rather than being pre-specified in advance.
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Coin flipping
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
Bayes’ rule
For data D and a hypothesis H, we have:
P( H ) P( D | H )
P( H | D) 
P( D)
• “Posterior probability”: P ( H | D )
• “Prior probability”: P(H )
• “Likelihood”: P ( D | H )
The origin of Bayes’ rule
• A simple consequence of using probability
to represent degrees of belief
• For any two random variables:
p ( A &amp; B )  p ( A) p ( B | A)
p( A &amp; B)  p( B) p( A | B)
p( B) p( A | B)  p( A) p( B | A)
p( A) p( B | A)
p( A | B) 
p( B)
Why represent degrees of belief
with probabilities?
• Good statistics
– consistency, and worst-case error bounds.
• Cox Axioms
– necessary to cohere with common sense
• “Dutch Book” + Survival of the Fittest
– if your beliefs do not accord with the laws of
probability, then you can always be out-gambled by
someone whose beliefs do so accord.
• Provides a theory of learning
– a common currency for combining prior knowledge and
the lessons of experience.
Bayes’ rule
For data D and a hypothesis H, we have:
P( H ) P( D | H )
P( H | D) 
P( D)
• “Posterior probability”: P ( H | D )
• “Prior probability”: P(H )
• “Likelihood”: P ( D | H )
Hypotheses in Bayesian inference
• Hypotheses H refer to processes that could
have generated the data D
• Bayesian inference provides a distribution
over these hypotheses, given D
• P(D|H) is the probability of D being
generated by the process identified by H
• Hypotheses H are mutually exclusive: only
one process could have generated D
Hypotheses in coin flipping
Describe processes by which D could be generated
D = HHTHT
• Fair coin, P(H) = 0.5
• Coin with P(H) = p
• Markov model
• Hidden Markov model
• ...
statistical
models
Hypotheses in coin flipping
Describe processes by which D could be generated
D = HHTHT
• Fair coin, P(H) = 0.5
• Coin with P(H) = p
• Markov model
• Hidden Markov model
• ...
generative
models
Representing generative models
• Graphical model notation
– Pearl (1988), Jordan (1998)
• Variables are nodes, edges
indicate dependency
• Directed edges show causal
process of data generation
HHTHT
d1 d2 d3 d4 d5
d1
d2
d3
d4
Fair coin, P(H) = 0.5
d1
d2
d3
d4
Markov model
Models with latent structure
• Not all nodes in a graphical
model need to be observed
• Some variables reflect latent
structure, used in generating
D but unobserved
HHTHT
d1 d2 d3 d4 d5
p
d1
d2
d3
d4
P(H) = p
s1
s2
s3
s4
d1
d2
d3
d4
Hidden Markov model
Coin flipping
• Comparing two simple hypotheses
– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
– P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
– P(H) = p
• Psychology: Representativeness
Coin flipping
• Comparing two simple hypotheses
– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
– P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
– P(H) = p
• Psychology: Representativeness
Comparing two simple hypotheses
• Contrast simple hypotheses:
– H1: “fair coin”, P(H) = 0.5
– H2:“always heads”, P(H) = 1.0
• Bayes’ rule:
P( H ) P( D | H )
P( H | D) 
P( D)
• With two hypotheses, use odds form
Bayes’ rule in odds form
P(H1|D)
P(H2|D)
D:
H1, H2:
P(H1|D):
P(D|H1):
P(H1):
=
P(D|H1)
x
P(D|H2)
P(H1)
P(H2)
data
models
posterior probability H1 generated the data
likelihood of data under model H1
prior probability H1 generated the data
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
Comparing two simple hypotheses
P(H1|D)
P(H2|D)
D:
=
P(D|H1)
x
P(D|H2)
P(H1)
P(H2)
HHTHT
H1, H2:
“fair coin”, “always heads”
P(D|H1) = 1/25
P(H1) =
999/1000
P(D|H2) = 0
P(H2) =
1/1000
P(H1|D) / P(H2|D) = infinity
Comparing two simple hypotheses
P(H1|D)
P(H2|D)
D:
=
P(D|H1)
x
P(D|H2)
P(H1)
P(H2)
HHHHH
H1, H2:
“fair coin”, “always heads”
P(D|H1) = 1/25
P(H1) =
999/1000
P(D|H2) = 1
P(H2) =
1/1000
P(H1|D) / P(H2|D)  30
Comparing two simple hypotheses
P(H1|D)
P(H2|D)
D:
=
P(D|H1)
x
P(D|H2)
P(H1)
P(H2)
HHHHHHHHHH
H1, H2:
“fair coin”, “always heads”
P(D|H1) = 1/210
P(H1) =
999/1000
P(D|H2) = 1
P(H2) =
1/1000
P(H1|D) / P(H2|D)  1
Comparing two simple hypotheses
• Bayes’ rule tells us how to combine prior
beliefs with new data
– top-down and bottom-up influences
• As a model of human inference
– predicts conclusions drawn from data
– identifies point at which prior beliefs are
overwhelmed by new experiences
• But… more complex cases?
Coin flipping
• Comparing two simple hypotheses
– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
– P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
– P(H) = p
• Psychology: Representativeness
Comparing simple and complex hypotheses
p
d1
d2
d3
d4
Fair coin, P(H) = 0.5
vs.
d1
d2
d3
d4
P(H) = p
• Which provides a better account of the data:
the simple hypothesis of a fair coin, or the
complex hypothesis that P(H) = p?
Comparing simple and complex hypotheses
• P(H) = p is more complex than P(H) = 0.5 in
two ways:
– P(H) = 0.5 is a special case of P(H) = p
– for any observed sequence X, we can choose p
such that X is more probable than if P(H) = 0.5
Probability
Comparing simple and complex hypotheses
Probability
Comparing simple and complex hypotheses
HHHHH
p = 1.0
Probability
Comparing simple and complex hypotheses
HHTHT
p = 0.6
Comparing simple and complex hypotheses
• P(H) = p is more complex than P(H) = 0.5 in
two ways:
– P(H) = 0.5 is a special case of P(H) = p
– for any observed sequence X, we can choose p
such that X is more probable than if P(H) = 0.5
• How can we deal with this?
– frequentist: hypothesis testing
– information theorist: minimum description length
– Bayesian: just use probability theory!
Comparing simple and complex hypotheses
P(H1|D)
P(H2|D)
P(D|H1)
=
P(D|H2)
P(H1)
x
P(H2)
Computing P(D|H1) is easy:
P(D|H1) = 1/2N
Compute P(D|H2) by averaging over p:
Probability
Comparing simple and complex hypotheses
Distribution is an average over all values of p
Probability
Comparing simple and complex hypotheses
Distribution is an average over all values of p
Comparing simple and complex hypotheses
• Simple and complex hypotheses can be
compared directly using Bayes’ rule
– requires summing over latent variables
• Complex hypotheses are penalized for their
greater flexibility: “Bayesian Occam’s razor”
• This principle is used in model selection
methods in psychology (e.g. Myung &amp; Pitt, 1997)
Coin flipping
• Comparing two simple hypotheses
– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
– P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
– P(H) = p
• Psychology: Representativeness
Comparing infinitely many hypotheses
• Assume data are generated from a model:
p
d1
d2
d3
d4
P(H) = p
• What is the value of p?
– each value of p is a hypothesis H
– requires inference over infinitely many hypotheses
Comparing infinitely many hypotheses
• Flip a coin 10 times and see 5 heads, 5 tails.
• P(H) on next flip? 50%
• Why? 50% = 5 / (5+5) = 5/10.
• “Future will be like the past.”
• Suppose we had seen 4 heads and 6 tails.
• P(H) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
Integrating prior knowledge and data
P( H ) P( D | H )
P( H | D) 
P( D)
P(p | D)  P(D | p) P(p)
• Posterior distribution P(p | D) is a probability
density over p = P(H)
• Need to work out likelihood P(D | p) and
specify prior distribution P(p)
Likelihood and prior
• Likelihood:
P(D | p) = pNH (1-p)NT
– NH: number of heads
– NT: number of tails
• Prior:
P(p)  pFH-1 (1-p)FT-1
?
A simple method of specifying priors
• Imagine some fictitious trials, reflecting a
set of previous experiences
– strategy often used with neural networks
• e.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair
• In fact, this is a sensible statistical idea...
Likelihood and prior
• Likelihood:
P(D | p) = pNH (1-p)NT
– NH: number of heads
– NT: number of tails
• Prior:
P(p)  pFH-1 (1-p)FT-1 Beta(FH,FT)
– FH: fictitious observations of heads
– FT: fictitious observations of tails
Conjugate priors
• Exist for many standard distributions
– formula for exponential family conjugacy
• Define prior in terms of fictitious observations
• Beta is conjugate to Bernoulli (coin-flipping)
FH = FT = 1
FH = FT = 3
FH = FT = 1000
Likelihood and prior
• Likelihood:
P(D | p) = pNH (1-p)NT
– NH: number of heads
– NT: number of tails
• Prior:
P(p)  pFH-1 (1-p)FT-1
– FH: fictitious observations of heads
– FT: fictitious observations of tails
Comparing infinitely many hypotheses
P(p | D)  P(D | p) P(p) = pNH+FH-1 (1-p)NT+FT-1
• Posterior is Beta(NH+FH,NT+FT)
– same form as conjugate prior
• Posterior mean:
• Posterior predictive distribution:
Some examples
• e.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 1004 / (1004+1006) = 49.95%
• e.g., F ={3 heads, 3 tails} ~ weak
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 7 / (7+9) = 43.75%
Prior knowledge too weak
But… flipping thumbtacks
• e.g., F ={4 heads, 3 tails} ~ weak expectation
that tacks are slightly biased towards heads
• After seeing 2 heads, 0 tails, P(H) on next flip
= 6 / (6+3) = 67%
• Some prior knowledge is always necessary to
avoid jumping to hasty conclusions...
• Suppose F = { }: After seeing 2 heads, 0 tails,
P(H) on next flip = 2 / (2+0) = 100%
Origin of prior knowledge
• Tempting answer: prior experience
• Suppose you have previously seen 2000
coin flips: 1000 heads, 1000 tails
• By assuming all coins (and flips) are alike,
these observations of other coins are as
good as observations of the present coin
Problems with simple empiricism
• Haven’t really seen 2000 coin flips, or any flips of a
thumbtack
– Prior knowledge is stronger than raw experience justifies
• Haven’t seen exactly equal number of heads and tails
– Prior knowledge is smoother than raw experience justifies
• Should be a difference between observing 2000 flips
of a single coin versus observing 10 flips each for 200
coins, or 1 flip each for 2000 coins
– Prior knowledge is more structured than raw experience
A simple theory
• “Coins are manufactured by a standardized
procedure that is effective but not perfect.”
– Justifies generalizing from previous coins to the
present coin.
– Justifies smoother and stronger prior than raw
experience alone.
– Explains why seeing 10 flips each for 200 coins is
more valuable than seeing 2000 flips of one coin.
• “Tacks are asymmetric, and manufactured to
less exacting standards.”
Limitations
• Can all domain knowledge be represented
so simply, in terms of an equivalent number
of fictional observations?
• Suppose you flip a coin 25 times and get all
Something funny is going on…
• But with F ={1000 heads, 1000 tails}, P(H)
on next flip = 1025 / (1025+1000) = 50.6%.
Looks like nothing unusual
Hierarchical priors
• Higher-order hypothesis: is this
coin fair or unfair?
• Example probabilities:
– P(fair) = 0.99
– P(p|fair) is Beta(1000,1000)
– P(p|unfair) is Beta(1,1)
• 25 heads in a row propagates up,
affecting p and then P(fair|D)
fair
p
d1
d2
d3
d4
P(25 heads|fair) P(fair) = 9 x 10-5
=
More hierarchical priors
• Latent structure can capture coin variability
p ~ Beta(FH,FT)
FH,FT
Coin 1
d1
Coin 2
p
d2
d3
d4
d1
d2
...
p
d3
d4
p Coin 200
d1
d2
d3
• 10 flips from 200 coins is better than 2000 flips
from a single coin: allows estimation of FH, FT
d4
Yet more hierarchical priors
physical knowledge
FH,FT
p
d1
d2
p
d3
d4
d1
d2
p
d3
d4
d1
d2
d3
d4
• Discrete beliefs (e.g. symmetry) can influence
estimation of continuous properties (e.g. FH, FT)
Comparing infinitely many hypotheses
• Apply Bayes’ rule to obtain posterior
probability density
• Requires prior over all hypotheses
– computation simplified by conjugate priors
– richer structure with hierarchical priors
• Hierarchical priors indicate how simple
theories can inform statistical inferences
– one step towards structure and statistics
Coin flipping
• Comparing two simple hypotheses
– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
– P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
– P(H) = p
• Psychology: Representativeness
Psychology: Representativeness
Which sequence is more likely from a fair coin?
HHTHT
HHHHH
more representative
of a fair coin
(Kahneman &amp; Tversky, 1972)
What might representativeness mean?
Evidence for a random generating process
P(H1|D)
P(H2|D)
P(D|H1)
=
P(D|H2)
likelihood ratio
H1: random process (fair coin)
H2: alternative processes
P(H1)
x
P(H2)
A constrained hypothesis space
Four hypotheses:
h1
h2
h3
h4
fair coin
“always alternates”
HHTHTTTH
HTHTHTHT
HHTHTHHH
HHHHHHHH
Representativeness judgments
Results
• Good account of representativeness data,
with three pseudo-free parameters,  = 0.91
– “always alternates” means 99% of the time
– “mostly heads” means P(H) = 0.85
– “always heads” means P(H) = 0.99
• With scaling parameter, r = 0.95
(Tenenbaum &amp; Griffiths, 2001)
The role of theories
The fact that HHTHT looks representative of
a fair coin and HHHHH does not reflects our
implicit theories of how the world works.
– Easy to imagine how a trick all-heads coin
could work: high prior probability.
– Hard to imagine how a trick “HHTHT” coin
could work: low prior probability.
Summary
• Three kinds of Bayesian inference
– comparing two simple hypotheses
– comparing simple and complex hypotheses
– comparing an infinite number of hypotheses
• Critical notions:
– generative models, graphical models
– Bayesian Occam’s razor
– priors: conjugate, hierarchical (theories)
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Rules and similarity
Structure versus statistics
Rules
Logic
Symbols
Statistics
Similarity
Typicality
A better metaphor
A better metaphor
Structure and statistics
Statistics
Similarity
Typicality
Rules
Logic
Symbols
Structure and statistics
• Basic case study #1: Flipping coins
– Learning and reasoning with structured
statistical models.
• Basic case study #2: Rules and similarity
– Statistical learning with structured
representations.
The number game
• Program input: number between 1 and 100
• Program output: “yes” or “no”
The number game
– Observe one or more positive (“yes”) examples.
– Judge whether other numbers are “yes” or “no”.
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
60
Diffuse similarity
The number game
Examples of
“yes” numbers
Generalization
judgments (n = 20)
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
60 52 57 55
Focused similarity:
numbers near 50-60
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
16
Diffuse similarity
16 8 2 64
Rule:
“powers of 2”
16 23 19 20
Focused similarity:
numbers near 20
The number game
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
60 52 57 55
Focused similarity:
numbers near 50-60
Main phenomena to explain:
– Generalization can appear either similaritybased (graded) or rule-based (all-or-none).
– Learning from just a few positive examples.
Rule/similarity hybrid models
• Category learning
– Nosofsky, Palmeri et al.: RULEX
– Erickson &amp; Kruschke: ATRIUM
Divisions into “rule” and
“similarity” subsystems
• Category learning
– Nosofsky, Palmeri et al.: RULEX
– Erickson &amp; Kruschke: ATRIUM
• Language processing
– Pinker, Marcus et al.: Past tense morphology
• Reasoning
– Sloman
– Rips
– Nisbett, Smith et al.
Rule/similarity hybrid models
• Why two modules?
• Why do these modules work the way that they do,
and interact as they do?
• How do people infer a rule or similarity metric
from just a few positive examples?
Bayesian model
• H: Hypothesis space of possible concepts:
–
–
–
–
–
h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)
h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)
h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)
h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)
...
Representational interpretations for H:
– Candidate rules
– Features for similarity
– “Consequential subsets” (Shepard, 1987)
Inferring hypotheses from
similarity judgment
Additive clustering (Shepard &amp; Arabie, 1977):
sij   wk fik f jk
k
sij : similarity of stimuli i, j
wk : weight of cluster k
f ik : membership of stimulus i in cluster k
(1 if stimulus i in cluster k, 0 otherwise)
Equivalent to similarity as a weighted sum of
common features (Tversky, 1977).
Additive clustering for the integers 0-9:
sij   wk fik f jk
k
Rank
Weight
Stimuli in cluster
Interpretation
0 1 2 3 4 5 6 7 8 9
1
.444
*
*
2
.345
* * *
3
.331
4
.291
5
.255
6
.216
*
7
.214
* * * *
8
.172
*
powers of two
small numbers
*
*
*
* * * *
* * * * *
*
*
multiples of three
large numbers
middle numbers
*
* * * * *
*
odd numbers
smallish numbers
largish numbers
Three hypothesis subspaces for
number concepts
• Mathematical properties (24 hypotheses):
– Odd, even, square, cube, prime numbers
– Multiples of small integers
– Powers of small integers
• Raw magnitude (5050 hypotheses):
– All intervals of integers with endpoints between
1 and 100.
• Approximate magnitude (10 hypotheses):
– Decades (1-10, 10-20, 20-30, …)
Hypothesis spaces and theories
• Why a hypothesis space is like a domain theory:
– Represents one particular way of classifying entities in
a domain.
– Not just an arbitrary collection of hypotheses, but a
principled system.
• What’s missing?
– Explicit representation of the principles.
• Hypothesis spaces (and priors) are generated by
theories. Some analogies:
– Grammars generate languages (and priors over
structural descriptions)
– Hierarchical Bayesian modeling
Bayesian model
• H: Hypothesis space of possible concepts:
– Mathematical properties: even, odd, square, prime, . . . .
– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
– Raw magnitude: all intervals between 1 and 100.
• X = {x1, . . . , xn}: n examples of a concept C.
• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) 
p( X )
– p(h) [“prior”]: domain knowledge, pre-existing biases
– p(X|h) [“likelihood”]: statistical information in examples.
– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
Bayesian model
• H: Hypothesis space of possible concepts:
– Mathematical properties: even, odd, square, prime, . . . .
– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
– Raw magnitude: all intervals between 1 and 100.
• X = {x1, . . . , xn}: n examples of a concept C.
• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) 
 p( X | h) p(h)
hH
– p(h) [“prior”]: domain knowledge, pre-existing biases
– p(X|h) [“likelihood”]: statistical information in examples.
– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
 1 
p ( X | h)  
if x1 ,  , xn  h

 size( h) 
 0 if any xi  h
• Follows from assumption of randomly sampled examples.
• Captures the intuition of a representative sample.
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Data slightly more of a coincidence under h1
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Data much more of a coincidence under h1
Bayesian Occam’s Razor
p(D = d | M )
M1
Law of
“Conservation
of Belief”
M2
All possible data sets d
For any model M,

all d D
p(D  d | M )  1
Probability
Comparing simple and complex hypotheses
Distribution is an average over all values of p
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
• p(h) encodes relative weights of alternative theories:
H: Total hypothesis space
p(H1) = 1/5
p(H2) = 3/5
p(H3) = 1/5
H1: Math properties (24)
H2: Raw magnitude (5050)
H3: Approx. magnitude (10)
• even numbers
• powers of two
• multiples of three
…. p(h) = p(H1) / 24
• 10-15
• 20-32
• 37-54
…. p(h) = p(H2) / 5050
• 10-20
• 20-30
• 30-40
…. p(h) = p(H3) / 10
A more complex approach to priors
• Start with a base set of regularities R and combination
operators C.
• Hypothesis space = closure of R under C.
– C = {and, or}: H = unions and intersections of regularities in R (e.g.,
“multiples of 10 between 30 and 70”).
– C = {and-not}: H = regularities in R with exceptions (e.g., “multiples
of 10 except 50 and 70”).
• Two qualitatively similar priors:
– Description length: number of combinations in C needed to generate
hypothesis from R.
– Bayesian Occam’s Razor, with model classes defined by number of
combinations: more combinations
more hypotheses
lower prior
Posterior:
p ( X | h) p ( h)
p(h | X ) 
 p( X | h) p(h)
hH
• X = {60, 80, 10, 30}
• Why prefer “multiples of 10” over “even
numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
10 except 50 and 20”? p(h).
• Why does a good generalization need both high
prior and high likelihood? p(h|X) ~ p(X|h) p(h)
Bayesian Occam’s Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
Generalizing to new objects
Given p(h|X), how do we compute p( y  C | X ) ,  p( y 
the probability that C applies to some new
hH
stimulus y?
Generalizing to new objects
Hypothesis averaging:
Compute the probability that C applies to some
new object y by averaging the predictions of all
hypotheses h, weighted by p(h|X):
p( y  C | X ) 
y  C | h) p ( h | X )
 p(

hH


 1 if yh

 0 if yh
h { y, X }
p(h | X )
Examples:
16
Connection to feature-based
similarity
• Additive clustering model of similarity:
sij   wk fik f jk
k
• Bayesian hypothesis averaging:
p( y  C | X ) 
p( y  Cp(| hX)| |Xph()hp| (Xh)
 
h
H{ y, X }
h
• Equivalent if we identify features fk with
hypotheses h, and weights wk with p(h | X ) .
Examples:
16
8
2
64
Examples:
16
23
19
20
Model fits
Examples of
“yes” numbers
60
60 80 10 30
60 52 57 55
Generalization
judgments (N = 20)
Bayesian Model
(r = 0.96)
Model fits
Examples of
“yes” numbers
16
16 8 2 64
16 23 19 20
Generalization
judgments (N = 20)
Bayesian Model
(r = 0.93)
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior  likelihood  prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging  size principle
narrow p(h|X): all-or-none rule
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior  likelihood  prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging  size principle
Many h of similar size: broad p(h|X)
One h much smaller: narrow p(h|X)
Alternative models
• Neural networks
even
60
80
10
30
multiple
of 10
multiple
of 3
power
of 2
Alternative models
• Neural networks
• Hypothesis ranking and elimination
Hypothesis
ranking:
60
80
10
30
1
2
3
4
….
even
multiple
of 10
multiple
of 3
power
of 2
….
Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
1
sim ( y, x j )
– Average similarity: p( y  C | X ) 

| X | x X
j
60
60 80 10 30
60 52 57 55
Data
Model (r = 0.80)
Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
– Max similarity: p( y  C | X )  max sim ( y, x j )
x j X
60
60 80 10 30
60 52 57 55
Data
Model (r = 0.64)
Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
– Average similarity
– Max similarity
– Flexible similarity? Bayes.
Alternative models
•
•
•
•
Neural networks
Hypothesis ranking and elimination
Similarity to exemplars
Toolbox of simple heuristics
– 60: “general” similarity
– 60 80 10 30: most specific rule (“subset principle”).
– 60 52 57 55: similarity in magnitude
Why these heuristics? When to use which heuristic?
Bayes.
Summary
• Generalization from limited data possible via the
interaction of structured knowledge and statistics.
– Structured knowledge: space of candidate rules, theories
generate hypothesis space (c.f. hierarchical priors)
– Statistics: Bayesian Occam’s razor.
• Better understand the interactions between
– Rules and statistics
– Rules and similarity
– Rules and representativeness
• Explains why central but notoriously slippery
processing-level concepts work the way they do.
– Similarity
– Representativeness
Why Bayes?
• A framework for explaining cognition.
– How people can learn so much from such limited data.
– Why process-level models work the way that they do.
– Strong quantitative models with minimal ad hoc assumptions.
• A framework for understanding how structured
knowledge and statistical inference interact.
– How structured knowledge guides statistical inference, and is
itself acquired through higher-order statistical learning.
– How simplicity trades off with fit to the data in evaluating
structural hypotheses (Occam’s razor).
– How increasingly complex structures may grow as required
by new data, rather than being pre-specified in advance.
Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
• Learners’ domain theories generate their
hypothesis space H and prior p(h).
– Well-matched to structure of the natural world.
– Learnable from limited data.
– Computationally tractable inference.
Looking towards the afternoon
• How do we apply these ideas to more
natural and complex aspects of cognition?
• Where do the hypothesis spaces come
from?
• Can we formalize the contributions of
domain theories?
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Marr’s Three Levels of Analysis
• Computation:
“What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?”
• Representation and algorithm:
Cognitive psychology
• Implementation:
Neurobiology
Working at the computational level
statistical
• What is the computational problem?
– input: data
– output: solution
Working at the computational level
statistical
• What is the computational problem?
– input: data
– output: solution
• What knowledge is available to the learner?
• Where does that knowledge come from?
Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
• Learners’ domain theories generate their
hypothesis space H and prior p(h).
– Well-matched to structure of the natural world.
– Learnable from limited data.
– Computationally tractable inference.
Causality
Bayes nets and beyond...
• Increasingly popular approach to studying
human causal inferences
(e.g. Glymour, 2001; Gopnik et al., 2004)
• Three reactions:
– Bayes nets are the solution!
– Bayes nets are missing the point, not sure why…
– what is a Bayes net?
Bayes nets and beyond...
• What are Bayes nets?
– graphical models
– causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
– other knowledge in causal induction
– formalizing causal theories
Bayes nets and beyond...
• What are Bayes nets?
– graphical models
– causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
– other knowledge in causal induction
– formalizing causal theories
Graphical models
• Express the probabilistic dependency
structure among a set of variables (Pearl, 1988)
• Consist of
– a set of nodes, corresponding to variables
– a set of edges, indicating dependency
– a set of functions defined on the graph that
defines a probability distribution
Undirected graphical models
• Consist of
X1
X3
X4
– a set of nodes
X2
X5
– a set of edges
– a potential for each clique, multiplied together to
yield the distribution over variables
• Examples
– statistical physics: Ising model, spinglasses
– early neural networks (e.g. Boltzmann machines)
Directed graphical models
• Consist of
X1
X3
X4
– a set of nodes
X2
X5
– a set of edges
– a conditional probability distribution for each
node, conditioned on its parents, multiplied
together to yield the distribution over variables
• Constrained to directed acyclic graphs (DAG)
• AKA: Bayesian networks, Bayes nets
Bayesian networks and Bayes
• Two different problems
– Bayesian statistics is a method of inference
– Bayesian networks are a form of representation
• There is no necessary connection
– many users of Bayesian networks rely upon
frequentist statistical methods (e.g. Glymour)
– many Bayesian inferences cannot be easily
represented using Bayesian networks
Properties of Bayesian networks
• Efficient representation and inference
– exploiting dependency structure makes it easier
to represent and compute with probabilities
• Explaining away
– pattern of probabilistic reasoning characteristic of
Bayesian networks, especially early use in AI
Efficient representation and inference
• Three binary variables: Cavity, Toothache, Catch
Efficient representation and inference
• Three binary variables: Cavity, Toothache, Catch
• Specifying P(Cavity, Toothache, Catch) requires 7
parameters (1 for each set of values, minus 1
because it’s a probability distribution)
• With n variables, we need 2n -1 parameters
• Here n=3. Realistically, many more: X-ray, diet,
oral hygiene, personality, . . . .
Conditional independence
• All three variables are dependent, but Toothache
and Catch are independent given the presence or
absence of Cavity
• In probabilistic terms:
P(ache  catch | cav)  P(ache | cav) P(catch | cav)
P(ache  catch | cav)  P(ache | cav) P(catch | cav)
 1  P(ache | cav)P(catch | cav)
• With n evidence variables, x1, …, xn, we need 2 n
conditional probabilities: P( xi | cav), P( xi | cav)
A simple Bayesian network
• Graphical representation of relations between a set of
random variables:
Cavity
Toothache
Catch
• Probabilistic interpretation: factorizing complex terms
P( A, B, C ) 

P(V | parents [V ])
V { A, B,C}
P( Ache, Catch, Cav)  P( Ache, Catch | Cav) P(Cav)
 P( Ache | Cav) P(Catch | Cav) P(Cav)
A more complex system
Battery
Ignition
Gas
Starts
On time to work
• Joint distribution sufficient for any inference:
P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G ) P( S | I , G ) P(O | S )
 P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )
P(O, G ) B, R, I , S
P(O | G ) 

P(G )
P(G )
A more complex system
Battery
Ignition
Gas
Starts
On time to work
• Joint distribution sufficient for any inference:
P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G ) P( S | I , G ) P(O | S )


P(O, G )
P(O | G ) 
    P( B) P( I | B) P( S | I , G )  P(O | S )


P(G )
S  B, I

A more complex system
Battery
Ignition
Gas
Starts
On time to work
• Joint distribution sufficient for any inference:
P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G ) P( S | I , G ) P(O | S )
• General inference algorithm: local message passing
(belief propagation; Pearl, 1988)
– efficiency depends on sparseness of graph structure
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
• Assume grass will be wet if and only if it rained last
night, or if the sprinklers were left on:
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P( w | r ) P(r )
P( w)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P( w | r ) P(r )
P(r | w) 
 P(w | r, s) P(r, s)
r , s
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P(r )
P(r , s)  P(r , s)  P(r , s)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r )
P(r | w) 
P(r )  P(r , s)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P(r )
P(r )  P(r ) P( s)
Between 1 and P(s)
 P (r )
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet and
sprinklers were left on:
P(r | w, s) 
P( w | r , s ) P(r | s )
P( w | s )
Both terms = 1
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet and
sprinklers were left on:
P(r | w, s )  P(r | s )  P (r )
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
P(r )
P(r )  P(r ) P( s)
P(r | w, s )  P(r | s )  P (r )
P(r | w) 
 P (r )
“Discounting” to
prior probability.
Contrast w/ production system
Rain
Sprinkler
Grass Wet
• Formulate IF-THEN rules:
– IF Rain THEN Wet
– IF Wet THEN Rain
IF Wet AND NOT Sprinkler
THEN Rain
• Rules do not distinguish directions of inference
• Requires combinatorial explosion of rules
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
• Excitatory links: Rain
Wet, Sprinkler
Wet
• Observing rain, Wet becomes more active.
• Observing grass wet, Rain and Sprinkler become
more active.
• Observing grass wet and sprinkler, Rain cannot
become less active. No explaining away!
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
• Excitatory links: Rain
• Inhibitory link: Rain
Wet, Sprinkler
Sprinkler
Wet
• Observing grass wet, Rain and Sprinkler become
more active.
• Observing grass wet and sprinkler, Rain becomes
less active: explaining away.
Contrast w/ spreading activation
Rain
Burst pipe
Sprinkler
Grass Wet
• Each new variable requires more inhibitory
connections.
• Interactions between variables are not causal.
• Not modular.
– Whether a connection exists depends on what other
connections exist, in non-transparent ways.
– Big holism problem.
– Combinatorial explosion.
Graphical models
• Capture dependency structure in distributions
• Provide an efficient means of representing
and reasoning with probabilities
• Allow kinds of inference that are problematic
for other representations: explaining away
– hard to capture in a production system
– hard to capture with spreading activation
Bayes nets and beyond...
• What are Bayes nets?
– graphical models
– causal graphical models
• An example: causal induction
• Beyond Bayes nets…
– other knowledge in causal induction
– formalizing causal theories
Causal graphical models
• Graphical models represent statistical
dependencies among variables (ie. correlations)
• Causal graphical models represent causal
dependencies among variables
– express underlying causal structure
– can answer questions about both observations and
interventions (actions upon a variable)
Observation and intervention
Battery
Ignition
Gas
Starts
On time to work
Graphical model:
Causal graphical model:
Observation and intervention
Battery
Ignition
Gas
Starts
On time to work
Graphical model:
Causal graphical model:
“graph surgery” produces “mutilated graph”
Assessing interventions
• To compute P(Y|do(X=x)), delete all edges
coming into X and reason with the resulting
Bayesian network (“do calculus”; Pearl, 2000)
• Allows a single structure to make predictions
about both observations and interventions
Causality simplifies inference
• Using a representation in which the direction of
causality is correct produces sparser graphs
• Suppose we get the direction of causality wrong,
thinking that “symptoms” causes “diseases”:
Ache
Catch
Cavity
• Does not capture the correlation between symptoms:
falsely believe P(Ache, Catch) = P(Ache) P(Catch).
Causality simplifies inference
• Using a representation in which the direction of
causality is correct produces sparser graphs
• Suppose we get the direction of causality wrong,
thinking that “symptoms” causes “diseases”:
Ache
Catch
Cavity
• Inserting a new arrow allows us to capture this
correlation.
• This model is too complex: do not believe that
P( Ache, Catch | Cav)  P( Ache | Cav) P(Catch | Cav)
Causality simplifies inference
• Using a representation in which the direction of
causality is correct produces sparser graphs
• Suppose we get the direction of causality wrong,
thinking that “symptoms” causes “diseases”:
Ache
X-ray
Catch
Cavity
• New symptoms require a combinatorial proliferation
of new arrows. This reduces efficiency of inference.
Learning causal graphical models
C
B
E
C
B
E
• Strength: how strong is a relationship?
• Structure: does a relationship exist?
Causal structure vs. causal strength
C
B
E
C
B
E
• Strength: how strong is a relationship?
Causal structure vs. causal strength
B
C
B
w0
w1
w0
E
C
E
• Strength: how strong is a relationship?
– requires defining nature of relationship
Parameterization
• Structures: h1 =
C
B
h0 =
E
E
• Parameterization:
C B
0
1
0
1
0
0
1
1
Generic
h1: P(E = 1 | C, B)
p00
p10
p01
p11
C
B
h0: P(E = 1| C, B)
p0
p0
p1
p1
Parameterization
• Structures: h1 =
B
C
w0
w1
h0 =
C
B
w0
E
E
w0, w1: strength parameters for B, C
• Parameterization:
C B
0
1
0
1
0
0
1
1
Linear
h1: P(E = 1 | C, B)
0
w1
w0
w1+ w0
h0: P(E = 1| C, B)
0
0
w0
w0
Parameterization
• Structures: h1 =
B
C
w0
w1
h0 =
C
B
w0
E
E
w0, w1: strength parameters for B, C
• Parameterization:
C B
0
1
0
1
0
0
1
1
“Noisy-OR”
h1: P(E = 1 | C, B)
0
w1
w0
w1+ w0 – w1 w0
h0: P(E = 1| C, B)
0
0
w0
w0
Parameter estimation
• Maximum likelihood estimation:
maximize i P(bi,ci,ei; w0, w1)
• Bayesian methods: as in the “Comparing
infinitely many hypotheses” example…
Causal structure vs. causal strength
C
B
E
C
B
E
• Structure: does a relationship exist?
Approaches to structure learning
• Constraint-based
C
B
– dependency from statistical tests (eg. 2)
– deduce structure from dependencies
E
(Pearl, 2000; Spirtes et al., 1993)
Approaches to structure learning
• Constraint-based:
C
B
– dependency from statistical tests (eg. 2)
– deduce structure from dependencies
E
(Pearl, 2000; Spirtes et al., 1993)
Approaches to structure learning
• Constraint-based:
C
B
– dependency from statistical tests (eg. 2)
– deduce structure from dependencies
E
(Pearl, 2000; Spirtes et al., 1993)
Approaches to structure learning
• Constraint-based:
C
B
– dependency from statistical tests (eg. 2)
– deduce structure from dependencies
E
(Pearl, 2000; Spirtes et al., 1993)
Attempts to reduce inductive problem to deductive problem
Approaches to structure learning
• Constraint-based:
C
B
– dependency from statistical tests (eg. 2)
– deduce structure from dependencies
E
(Pearl, 2000; Spirtes et al., 1993)
• Bayesian:
– compute posterior
probability of structures,
given observed data
P(S|data)  P(data|S) P(S)
C
B
C
B
E
E
P(S1|data)
P(S0|data)
(Heckerman, 1998; Friedman, 1999)
Causal graphical models
• Extend graphical models to deal with
interventions as well as observations
• Respecting the direction of causality results
in efficient representation and inference
• Two steps in learning causal models
– parameter estimation
– structure learning
Bayes nets and beyond...
• What are Bayes nets?
– graphical models
– causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
– other knowledge in causal induction
– formalizing causal theories
Elemental causal induction
C present
C absent
E present
a
c
E absent
b
d
“To what extent does C cause E?”
Causal structure vs. causal strength
B
C
B
w0
w1
w0
E
C
E
• Strength: how strong is a relationship?
• Structure: does a relationship exist?
Causal strength
• Assume structure:
B
C
w0
w1
E
• Leading models (DP and causal power) are maximum
likelihood estimates of the strength parameter w1, under
different parameterizations for P(E|B,C):
– linear  DP, Noisy-OR  causal power
Causal structure
• Hypotheses: h1 =
C
B
E
h0 =
C
B
E
• Bayesian causal inference:
1 1
support =
P(data | h1 )  
P(data | w0 , w1 ) p ( w0 , w1 | h1

01 0
P(data | h0 )   P(data | w0 ) p( w0 | h0 ) dw0
0
Buehner and Cheng (1997)
People
DP (r = 0.89)
Power (r = 0.88)
Support (r = 0.97)
The importance of parameterization
• Noisy-OR incorporates mechanism assumptions:
– generativity: causes increase probability of effects
– each cause is sufficient to produce the effect
– causes act via independent mechanisms
(Cheng, 1997)
• Consider other models:
– statistical dependence: 2 test
– generic parameterization (Anderson, computer science)
People
Support (Noisy-OR)
2
Support (generic)
Generativity is essential
P(e+|c+)
P(e+|c-)
100
50
0
8/8
8/8
6/8
6/8
4/8
4/8
2/8
2/8
0/8
0/8
Support
• Predictions result from “ceiling effect”
– ceiling effects only matter if you believe a cause
increases the probability of an effect
Bayes nets and beyond...
• What are Bayes nets?
– graphical models
– causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
– other knowledge in causal induction
– formalizing causal theories
chemicals
genes
Clofibrate
Wyeth 14,643
p450 2B1
Gemfibrozil
Phenobarbital
Carnitine Palmitoyl Transferase 1
Hamadeh et al. (2002) Toxicological sciences.
chemicals
genes
X
Clofibrate
Wyeth 14,643
p450 2B1
Gemfibrozil
Phenobarbital
Carnitine Palmitoyl Transferase 1
Hamadeh et al. (2002) Toxicological sciences.
Chemical X
chemicals
genes
peroxisome proliferators
Clofibrate
Wyeth 14,643
+
p450 2B1
+
Gemfibrozil
Phenobarbital
+
Carnitine Palmitoyl Transferase 1
Hamadeh et al. (2002) Toxicological sciences.
Using causal graphical models
• Three questions (usually solved by researcher)
– what are the variables?
– what structures are plausible?
– how do variables interact?
• How are these questions answered if causal
graphical models are used in cognition?
Bayes nets and beyond...
• What are Bayes nets?
– graphical models
– causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
– other knowledge in causal induction
– formalizing causal theories
Theory-based causal induction
Causal theory
P(h|data)  P(data|h) P(h)
– Ontology
– Plausible relations
– Functional form
Evaluated by statistical inference
P(h1) = 
h1:
Generates
Y
X
B
Z
P(h0) =1 – 
h0:
Y
X
B
Z
Hypothesis space of causal graphical models
Both objects activate
the detector
Blicket detector
Object A does not
activate the detector
by itself
Chi
each
Then
they
maket
(Gopnik, Sobel,Backward
and colleagues)
Blocking Condition
Procedure used in Sobel et al. (2002), Experiment 2
e Condition
See this? It’s a
blicket machine.
Blickets
Blocking
Condition make it go.
s activate
tector
Object A does not
activate the detector
by itself
s activate
tector
Object Aactivates the
detector by itself
Both objects activate
the detector
Children are asked if
each is a blicket
they
makethe machine go
Let’s put this one
on the machine.
Children are asked if
each is a blicket
they
Object Aactivates the
detector by itself
Oooh, it’s a
blicket!
Chi
each
Then
they
maket
Both objects activate
the detector
Object A does not
activate the detector
by itself
Children are asked if
each is a blicket
they
makethe machine go
“Blocking”
Experiment
edure
used in2Sobel et al. (2002), Experiment 2
Backward Blocking Condition
dition
e
A
–
king Condition
–
–
–
–
e
B
Both objects
activate
Children
areObject
ifA does not
the
detector
each is a blicket
activate the detector
they
by itself
makethe machine go
Trial 1
Object Aactivates the
detector Children
by itself are asked if
each is a blicket
they
makethe machine go
Trial 2
Children are asked if
each is a blicket
they
makethe machine go
Both objects activate
the detector
Backward Blocking C
Both objects 3,
activate
Trials
4
the detector
Two objects: A and B
Trial 1: A on detector – detector active
Trial 2: B on detector – detector inactive
Trials 3,4: A B on detector – detector active
3, 4-year-olds judge whether each object is a blicket
Children are
if
Object
Aactivates
the
each is a blicket
detector by itself
they
makethe machine go
Children are asked if
each is a blicket
they
makethe machine go
• A: a blicket
• B: not a blicket
A deductive inference?
• Causal law: detector activates if and only if
one or more objects on top of it are blickets.
• Premises:
– Trial 1: A on detector – detector active
– Trial 2: B on detector – detector inactive
– Trials 3,4: A B on detector – detector active
• Conclusions deduced from premises and
causal law:
– A: a blicket
– B: not a blicket
Both objects activate
the detector
Object A does not
activate the detector
by itself
“Backwards blocking”
el et al. (2002), Experiment 2Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
Backward Blocking Condition
One-Cause Condition
(Sobel, Tenenbaum &amp; Gopnik, 2004)
A
t A does not
e the detector
by itself
Aactivates the
tor by itself
Children are a
each is a blicke
they
makethe machin
B
if objects activate
Both
each is a blicket
the detector
they
makethe machine go
–
–
–
–
Trial 1
Both objects activate
Object A does not
the detector
activate the detector
by itself
Trial 2
Object Aactivates the
Children are asked if
detector by itself
each is a blicket
they
makethe machine go
Two objects:
A and B
Backward Blocking Condition
Trial 1: A B on detector – detector active
Trial 2: A on detector – detector active
4-year-olds judge whether each object is a blicket
Both objects activate
Object Aactivates the
Children are asked if
is a
a blicket
• each
A:
blicket
(100%
of
judgments)
each is a blicket
the detector
detector by itself
they
they
makethe machine go
makethe machine go
• B: probably not a blicket (66% of judgments)
Children are asked if
Children are a
each is a blicke
they
makethe machin
Theory
• Ontology
– Types: Block, Detector, Trial
– Predicates:
Contact(Block, Detector, Trial)
Active(Detector, Trial)
• Constraints on causal relations
– For any Block b and Detector d, with prior probability q :
Cause(Contact(b,d,t), Active(d,t))
• Functional form of causal relations
– Causes of Active(d,t) are independent mechanisms, with
causal strengths wi. A background cause has strength w0.
Assume a near-deterministic mechanism: wi ~ 1, w0 ~ 0.
Theory
• Ontology
– Types: Block, Detector, Trial
– Predicates:
Contact(Block, Detector, Trial)
Active(Detector, Trial)
B
A
E
Theory
• Ontology
– Types: Block, Detector, Trial
– Predicates:
Contact(Block, Detector, Trial)
Active(Detector, Trial)
B
A
E
A = 1 if Contact(block A, detector, trial), else 0
B = 1 if Contact(block B, detector, trial), else 0
E = 1 if Active(detector, trial), else 0
Theory
• Constraints on causal relations
– For any Block b and Detector d, with prior probability q :
Cause(Contact(b,d,t), Active(d,t))
P(h00) = (1 – q)2
No hypotheses with
E
B, E A,
A
B, etc.
h00 :
B
A
P(h10) = q(1 – q)
h10 :
E
E
P(h01) = (1 – q) q
A
= “A is a blicket”
E
h01 :
B
A
E
B
A
P(h11) = q2
h11 :
B
A
E
Theory
• Functional form of causal relations
– Causes of Active(d,t) are independent mechanisms, with
causal strengths wb. A background cause has strength w0.
Assume a near-deterministic mechanism: wb ~ 1, w0 ~ 0.
P(h00) = (1 – q)2
A
P(E=1 | A=0, B=0):
P(E=1 | A=1, B=0):
P(E=1 | A=0, B=1):
P(E=1 | A=1, B=1):
B
P(h01) = (1 – q) q
A
B
P(h10) = q(1 – q)
A
B
P(h11) = q2
A
B
E
E
E
E
0
0
0
0
0
0
1
1
0
1
0
1
0
1
1
1
“Activation law”: E=1 if and only if A=1 or B=1.
Bayesian inference
• Evaluating causal models in light of data:
P (d | hi ) P (hi )
P(hi | d ) 
 P(d | h j ) P(h j )
h H
j
• Inferring a particular causal relation:
P( A  E | d ) 
 P( A  E | h j ) P(h j | d )
h H
j
Modeling backwards blocking
P(h00) = (1 – q)2
A
P(E=1 | A=0, B=0):
P(E=1 | A=1, B=0):
P(E=1 | A=0, B=1):
P(E=1 | A=1, B=1):
B
P(h01) = (1 – q) q
A
B
P(h10) = q(1 – q)
A
B
P(h11) = q2
A
B
E
E
E
E
0
0
0
0
0
0
1
1
0
1
0
1
0
1
1
1
P( B  E | d ) P(h01 )  P(h11 )
q


P( B
E | d ) P(h00 )  P(h10 ) 1  q
Modeling backwards blocking
P(h00) = (1 – q)2
A
P(E=1 | A=1, B=1):
B
P(h01) = (1 – q) q
A
B
P(h10) = q(1 – q)
A
B
P(h11) = q2
A
B
E
E
E
E
0
1
1
1
P( B  E | d ) P(h01 )  P(h11 )
1


P( B
E | d)
P(h10 )
1 q
Modeling backwards blocking
P(h01) = (1 – q) q
A
B
P(h10) = q(1 – q)
A
B
P(h11) = q2
A
B
E
E
E
P(E=1 | A=1, B=0):
0
1
1
P(E=1 | A=1, B=1):
1
1
1
P( B  E | d ) P(h11 )
q


P( B
E | d ) P(h10 ) 1  q
Both objects activate
the detector
Object A does not
activate the detector
by itself
Children are asked if
Manipulating the prior
eachused
is a blicket
Figure 13: Procedure
in Sobel et al. (2002), Experiment 2
they
makethe machine go
One-Cause
Condition
Figure
13:2Procedure
Figure
used
13:2inProcedure
Sobel
Figure
et al.
used
13:
(2002),
in
Procedure
Sobel
Experiment
et used
al. (2002),
in2Sobel
Experiment
et al. (2002),
2
esed
13:inProcedure
Sobel et al.
used
(2002),
in Sobel
Experiment
et al. (2002),
Experiment
Backward Blocking
Condition
One-Cause
Condition
One-Cause Condition
One-Cause Condition
Cause Condition
I. Pre-training phase: Blickets are rare . . . .
Both objects activate
the detector
Object A does not
Children are a
each is a blicke
activate the detector
they
by itself
Both
Both
objects
Object
Both
not
objectsare
activate
Object
not
ObjectifA doesChildren
not
Both
Object
the
Children
if A doesChildren
h objects activate
Object A does
notobjects activate
Object A does
Children
notobjects
areactivate
Children
areactivate
if A does
makethe are
machin
each is a blicket
each is a blicket
eachthe
is detector
a blicket
thea detector
thea detector
activate the detector
activate the detector
each is
blicket detector byeach
blicketactivate the detector
the detector
itselfis
the detector
activate the detector
activate the detector
Thenare asked to by itselfthey
they
they
by itself
by itself
they
they
by itself
by itself
makethe machine go
makethe machine g
makethe machine go
makethe machine go
makethe machine go
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
sed
et al.
(2002), Experiment
2
el
et in
al.Sobel
(2002),
Experiment
2
Figure
13: Procedure used in Sobel et al. (2002), Experiment 2
One-Cause Condition
Backward Blocking Condition
One-Cause Condition
Backward Blocking
Backward
Condition
Blocking
Backward
Condition
Blocking Condition
ward
ondition
Blocking Condition
II. Backwards blocking phase:
Both objects activate
the detector
A does not
t A does notObject
Object A does not
Children are asked if
each
is a blicket
activate
the
detector
Both
objects
activate
Object Aactivates the
Children are a
if
Then
they
are
to
by
itself
each
Both
objects
activate
Object
A
does
not
Children
are
if is a blicke
the
detector
detector
by
itself
each
is
a
blicket
activate the detector each is a blicket
e the detector
make
t
he
machine
go
Then
each
is
a
blicket
the
detector
activate
the
detector
they
Then
Thenare asked to they are asked to
by itself
they
by itself
Both
objects
activate
Both
objects
activate
Object
Aactivates
Both
objects
the
activate
Object
Aactivates
Children
the
are
Object
Aactivates
if
Children
the
Then
h objects activate
Object Aactivates the make
Object
Aactivates
Children
the
are
if
Children
are
if
make
they
are
to
by
itself
t
he are
machin
makethe machine go
the machine go
each is a blicket
each
each is
blicket
each is
blicket detector by itself
thea detector
thea detector
the detector detector by itself
detector
by itself
the detector detector by itself
detector by itself
make
the machine
go is a blicket
they
they
they
they
makethe machine go
makethe machine g
makethe machine go
makethe machine go
A
B
Trial 1
Trial 2
After each trial, adults judge the probability that each
Backward Blocking Condition
ndition
object is a blicket.
Backward Blocking Condition
• “Rare” condition: First observe 12 objects
on detector, of which 2 set it off.
• “Common” condition: First observe 12
objects on detector, of which 10 set it off.
Both objects activate
the detector
Object A does not
Children are asked if
Figure
13: Procedure
used2in Sobel
a blicket et
Figure
Procedure usedeach
in isSobel
al. (2002),
Experiment
activate the13:
detector
Inferences from ambiguous data
they
makethe machine go
by itself
One-Cause
Condition
One-Cause
Condition
Figure
13:2Procedure
Figure
used
13:2inProcedure
Sobel
Figure
et al.
used
13:
(2002),
in
Procedure
Sobel
Experiment
et used
al. (2002),
in2Sobel
Experiment
et al. (2002),
2
esed
13:inProcedure
Sobel et al.
used
(2002),
in Sobel
Experiment
et al. (2002),
Experiment
Backward Blocking
Condition
One-Cause
Condition
One-Cause Condition
One-Cause Condition
Cause Condition
I. Pre-training phase: Blickets are rare . . . .
Object A doesBoth
not objects activate
if
each is a blicket activate
activate the detector the detector
by
they
by itself
Both
Both
objects
Object
Both
not
objectsare
activate
Object
not
Object
ifAthe
does
Children
Both
Object
the
Children
if A doesChildren
h objects activate
Object A does
notobjects activate
Object A does
Children
notobjects
areactivate
Children
areactivate
if A does
make
machine
each is a blicket
each is a blicket
eachthe
is detector
a blicket
thea detector
thea detector
activate the detector
activate the detector
each is
blicket detector byeach
blicketactivate the detector
the detector
itselfis
the detector
activate the detector
activate the detector
Then
Then
Then
Then
they are asked to by itselfthey
by itself
by itself
they are asked to
they are asked to
they are asked to
by itself
by itself
make
make
makethe machine go
makethe machine go
makethe machine go
the machine go
the machine g
Both objects activate
the detector
Figure
13: Procedure
used2in Sobel et al. (2002), Experiment 2
e 13: Procedure used in Sobel et
al. (2002),
Experiment
2),
Experiment
2
al.
obel
(2002),
et al. (2002),
Experiment
Experiment
2
2
Figureused
13: Procedure
in Sobel
et al. (2002),
Cause
Condition
Figure
13: Procedure
inOne-Cause
Sobel etused
al.Condition
(2002),
Experiment
2 Experiment 2
Backward Blocking Condition
Backward Blocking Condition
One-Cause Condition
One-Cause Condition
Backward Blocking
Backward
Condition
Blocking
Backward
Condition
Blocking Condition
ward
ondition
Blocking Condition
II. Two trials: A B
detector, B C
detector
A does not
Children are asked if
h objects activate
Object A doesBoth
not objects activate
if
each
is a blicket Children are asked
the
detector
activate
the
detector
each
is
a
blicket
the
detector
activate
the
detector
Object
Children areChildren
Both objects activate
Object AactivatesBoth
the objects activate
if Aa
Object
es not A does not
Then
Then
they
are
by
itself
they
are
to
by
itself
each is activate
a blicket
theifdetector asked toeach is a blicket detecto
the
detector
byare
itself
each isBoth
a blicket
each is activate
a blicket
ctivate
etector theBoth
detector
objects
Object
A does not
Children
objects
Object A does not
Children are asked if detector
makethe machine they
maketthe
go are asked to
he machine
Then
Then
Then
Then
they
they
they
to
to
f
by itself the
each is a blicket
the
detector
activate
detector
eachgo
is a blicket
detector
activate
the detector
Both
objects
activate
Both
objects
activate
Object
Aactivates
Both
objects
the
activate
Object
Aactivates
Children
the
are
Object
Aactivates
if the machine
Children
h objects activate
Object
Aactivates
thetgo
Object
Aactivates
Children
the
Children
make
make
make
maketgo
the machine
he machine
he
machine
goby itself
they
by itselfarethey
Then
each
is
a
blicket
each
is a blicket
the
detector
the
detector
detector
by
itself
the
detector
detector
by
itself
detector
by
itself
each
is
a
blicket
each
is
a
blicket
the detector detector by itself
detector by itself
makethe machine go
makethe machine go
Then
Then
Then
Then
they
are
to
they
are
they are asked to
they are asked to
make
make
makethe machine go
makethe machine go
the machine go
the machine g
A
B
C
Trial 1
Trial 2
After each trial,
Backward
Blockingjudge
Conditionthe probability that each
Backward
Condition
Backward Blocking
Condition
object
isBlocking
a blicket.
ward Blocking Condition
Same domain theory generates hypothesis
space for 3 objects:
A
• Hypotheses:
h000 =
B
C
h100 =
E
A
h010 =
B
C
B
A
C
h110 =
C
B
C
E
A
B
C
h011 =
E
h101 =
B
E
h001 =
E
A
A
A
B
E
C
h111 =
A
E
• Likelihoods: P(E=1| A, B, C; h) = 1 if A = 1 and A
or B = 1 and B
or C = 1 and C
else 0.
B
C
E
E exists,
E exists,
E exists,
• “Rare” condition: First observe 12 objects
on detector, of which 2 set it off.
The role of causal mechanism
knowledge
• Is mechanism knowledge necessary?
– Constraint-based learning using 2 tests of
conditional independence.
• How important is the deterministic functional
form of causal relations?
– Bayes with “noisy sufficient causes” theory (c.f.,
Cheng’s causal power theory).
Bayes with correct theory:
Bayes with “noisy sufficient causes” theory:
Theory-based causal induction
• Explains one-shot causal inferences about
physical systems: blicket detectors
• Captures a spectrum of inferences:
– unambiguous data: adults and children make allor-none inferences
– ambiguous data: adults and children make more
• Extends to more complex cases with hidden
variables, dynamic systems: come to my talk!
Summary
• Causal graphical models provide a language
• Key issues in modeling causal induction:
– what do we mean by causal induction?
– how do knowledge and statistics interact?
• Bayesian approach allows exploration of
different answers to these questions
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Property induction
Collaborators
Charles Kemp
Lauren Schmidt
Fei Xu
Pat Shafto
Neville Sanjana
Amy Perfors
Liz Baraff
The Big Question
• How can we generalize new concepts
reliably from just one or a few examples?
– Learning word meanings
“horse”
“horse”
“horse”
The Big Question
• How can we generalize new concepts
reliably from just one or a few examples?
– Learning word meanings, causal relations,
social rules, ….
– Property induction
Gorillas have T4 cells.
Squirrels have T4 cells.
All mammals have T4 cells.
How probable is the the conclusion (target) given
the premises (examples)?
The Big Question
• How can we generalize new concepts
reliably from just one or a few examples?
– Learning word meanings, causal relations,
social rules, ….
– Property induction
Gorillas have T4 cells.
Squirrels have T4 cells.
Gorillas have T4 cells.
Chimps have T4 cells.
All mammals have T4 cells.
All mammals have T4 cells.
More diverse examples
stronger generalization
Is rational inference the answer?
• Everyday induction often appears to follow
principles of rational scientific inference.
– Could that explain its success?
• Goal of this work: a rational computational
model of human inductive generalization.
– Explain people’s judgments as approximations to
optimal inference in natural environments.
– Close quantitative fits to people’s judgments with
a minimum of free parameters or assumptions.
Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
• Learners’ domain theories generate their
hypothesis space H and prior p(h).
– Well-matched to structure of the natural world.
– Learnable from limited data.
– Computationally tractable inference.
The plan
• Similarity-based models
• Theory-based model
• Bayesian models
– “Empiricist” Bayes
– Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
– Learning with multiple domain theories
– Learning domain theories
The plan
• Similarity-based models
• Theory-based model
• Bayesian models
– “Empiricist” Bayes
– Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
– Learning with multiple domain theories
– Learning domain theories
An experiment
(Osherson et al., 1990)
• 20 subjects rated the strength of 45 arguments:
X1 have property P.
X2 have property P.
X3 have property P.
All mammals have property P.
• 40 different subjects rated the similarity of all
pairs of 10 mammals.
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Sum-Similarity:
sim( i, X ) 

sim( i, j )
x
x
jX
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Sum-sim versus Max-sim
• Two models appear functionally similar:
– Both increase monotonically as new examples
are observed.
• Reasons to prefer Sum-sim:
– Standard form of exemplar models of
categorization, memory, and object recognition.
– Analogous to kernel density estimation
techniques in statistical pattern recognition.
• Reasons to prefer Max-sim:
– Fit to generalization judgments . . . .
Data
Data vs. models
Model
.
Each “ ” represents one argument:
X1 have property P.
X2 have property P.
X3 have property P.
All mammals have property P.
Three data sets
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Feature rating data
(Osherson and Wilkie)
• People were given 48 animals, 85 features,
and asked to rate whether each animal had
each feature.
• E.g., elephant: 'gray' 'hairless' 'toughskin'
'big' 'bulbous' 'longleg'
'tail' 'chewteeth' 'tusks'
'smelly' 'walks' 'slow'
'inactive' 'vegetation' 'grazer'
'oldworld' 'bush' 'jungle'
'ground' 'timid' 'smart'
'group'
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
New property
• Compute similarity based on Hamming
distance,  A  B  A  B or cosine.
• Generalize based on Max-sim or Sum-sim.
Three data sets
r = 0.77
r = 0.75
r = 0.94
r = – 0.21
r = 0.63
r = 0.19
Max-Sim
Sum-Sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Problems for sim-based approach
• No principled explanation for why Max-Sim works so
well on this task, and Sum-Sim so poorly, when SumSim is the standard in other similarity-based models.
• Free parameters mixing similarity and coverage terms,
and possibly Max-Sim and Sum-Sim terms.
• Does not extend to induction with other kinds of
properties, e.g., from Smith et al., 1993:
Dobermanns can bite through wire.
German shepherds can bite through wire.
Poodles can bite through wire.
German shepherds can bite through wire.
Marr’s Three Levels of Analysis
• Computation:
“What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?”
• Representation and algorithm:
Max-sim, Sum-sim
• Implementation:
Neurobiology
The plan
• Similarity-based models
• Theory-based model
• Bayesian models
– “Empiricist” Bayes
– Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
– Learning with multiple domain theories
– Learning domain theories
Theory-based induction
• Scientific biology: species generated by an
evolutionary branching process.
– A tree-structured taxonomy of species.
• Taxonomy also central in folkbiology (Atran).
Theory-based induction
Begin by reconstructing intuitive taxonomy
from similarity judgments:
clustering
How taxonomy constrains induction
• Atran (1998): “Fundamental principle of
systematic induction” (Warburton 1967,
Bock 1973)
– Given a property found among members of any
two species, the best initial hypothesis is that
the property is also present among all species
that are included in the smallest higher-order
taxon containing the original pair of species.
“all mammals”
Cows have property P.
Dolphins have property P.
Squirrels have property P.
All mammals have property P.
Strong (0.76 [max = 0.82])
“large herbivores”
Cows have property P.
Dolphins have property P.
Squirrels have property P.
Cows have property P.
Horses have property P.
Rhinos have property P.
All mammals have property P.
All mammals have property P.
Strong: 0.76 [max = 0.82])
Weak: 0.17 [min = 0.14]
“all mammals”
Cows have property P.
Dolphins have property P.
Squirrels have property P.
Seals have property P.
Dolphins have property P.
Squirrels have property P.
All mammals have property P.
All mammals have property P.
Strong: 0.76 [max = 0.82]
Weak: 0.30 [min = 0.14]
Taxonomic
distance
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
The challenge
• Can we build models with the best of both
– Quantitatively accurate predictions.
– Strong rational basis.
• Will require novel ways of integrating
structured knowledge with statistical inference.
The plan
• Similarity-based models
• Theory-based model
• Bayesian models
– “Empiricist” Bayes
– Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
– Learning with multiple domain theories
– Learning domain theories
The Bayesian approach
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
New property
The Bayesian approach
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The Bayesian approach
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The Bayesian approach
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The Bayesian approach
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The Bayesian approach
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The Bayesian approach
?
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The Bayesian approach
p(h)
h
p(d |h)
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Bayes’ rule:
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
p(h)
h
p(d |h)
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Probability that property Q
holds for species x: p(Q( x) | d ) 

p(h | d )
h consistent
with Q ( x )
p(h)
h
p(d |h)
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
“Size principle”:
|h| = # of positive
instances of h
1
p(d | h) 
h
0
if d is
consistent
with h
otherwise
p(h)
h
p(d |h)
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The size principle
h1
“even numbers”
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
“multiples of 10”
The size principle
h1
“even numbers”
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
“multiples of 10”
Data slightly more of a coincidence under h1
The size principle
h1
“even numbers”
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
“multiples of 10”
Data much more of a coincidence under h1
Illustrating the size principle
Which argument is stronger?
Grizzly bears have property P.
All mammals have property P.
“Non-monotonicity”
Grizzly bears have property P.
Brown bears have property P.
Polar bears have property P.
All mammals have property P.
Probability that property Q holds
for species x: p(Q( x) | d )   p(h) / h
h consistent
with Q ( x ), d

p ( h) / h
h consistent
with d
p(h)
p(Q(x)|d)
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
h
p(d |h)
d
?
?
?
?
?
?
?
...
?
Generalization
Hypotheses
New property
Probability that property Q holds
for species x: p(Q( x) | d )   p(h) / h
h consistent
with Q ( x ), d

p ( h) / h
h consistent
with d
p(h)
h
p(d |h)
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Specifying the prior p(h)
• A good prior must focus on a small subset of
all 2n possible hypotheses, in order to:
– Match the distribution of properties in the world.
– Be learnable from limited data.
– Be efficiently computationally.
• We consider two approaches:
– “Empiricist” Bayes: unstructured prior based
directly on known features.
– “Theory-based” Bayes: structured prior based on
rational domain theory, tuned to known features.
“Empiricist”
Bayes:
(Heit, 1998)
h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
p(h) =
1
15
1 2
15 15
1 3
15 15
1 1 1
15 15 15
1 1
15 15
1
15
1
15
h
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Results
r = 0.38
r = 0.16
r = 0.79
r = 0.77
r = 0.75
r = 0.94
“Empiricist”
Bayes
Max-Sim
Why doesn’t “Empiricist”
Bayes work?
• With no structural bias, requires too many
features to estimate the prior reliably.
• An analogy: Estimating a smooth probability
density function by local interpolation.
N=5
N = 100
N = 500
Why doesn’t “Empiricist”
Bayes work?
• With no structural bias, requires too many
features to estimate the prior reliably.
• An analogy: Estimating a smooth probability
density function by local interpolation.
Assuming an appropriately
structured form for density
to better generalization
from sparse data.
N=5
N=5
“Theory-based” Bayes
Theory: Two principles based on the structure of
species and properties in the natural world.
1. Species generated by an evolutionary
branching process.
– A tree-structured taxonomy of species (Atran,
1998).
2. Features generated by stochastic mutation
process and passed on to descendants.
– Novel features can appear anywhere in tree, but
some distributions are more likely than others.
Mutation process generates p(h|T):
– Choose label for root.
– Probability that label mutates
along branch b : 1  e 2 l b
l = mutation rate
|b| = length of branch b
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
2
T
p(h|T)
h
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Mutation process generates p(h|T):
– Choose label for root.
– Probability that label mutates
along branch b : 1  e 2 l b
l = mutation rate
|b| = length of branch b
x
x
x
2
T
p(h|T)
h
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Samples from the prior
• Labelings that cut the data along fewer
branches are more probable:
&gt;
“monophyletic”
“polyphyletic”
Samples from the prior
• Labelings that cut the data along longer
branches are more probable:
&gt;
“more distinctive”
“less distinctive”
• Mutation process over tree T
generates p(h|T).
• Message passing over tree T
efficiently sums over all h.
• How do we know which tree T
to use?
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
T
p(h|T)
h
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
The same mutation process
generates p(Features|T):
– Assume each feature generated
independently over the tree.
– Use MCMC to infer most likely
tree T and mutation rate l given
observed features.
– No free parameters!
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
T
p(h|T)
h
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Results
r = 0.91
r = 0.95
r = 0.91
r = 0.38
r = 0.16
r = 0.79
r = 0.77
r = 0.75
r = 0.94
“Theory-based”
Bayes
“Empiricist”
Bayes
Max-Sim
Grounding in similarity
Reconstruct intuitive taxonomy from
similarity judgments:
clustering
Theory-based
Bayes
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Explaining similarity
• Why does Max-sim fit so well?
– An efficient and accurate approximation to
this Theory-Based Bayesian model.
– Correlation with
Bayes on threepremise general
arguments, over 100
simulated trees:
Mean r = 0.94
Correlation (r)
– Theorem. Nearest neighbor classification
approximates evolutionary Bayes in the limit of
high mutation rate, if domain is tree-structured.
Alternative feature-based models
• Taxonomic Bayes (strictly taxonomic
hypotheses, with no mutation process)
&gt;
“monophyletic”
“polyphyletic”
Alternative feature-based models
• Taxonomic Bayes (strictly taxonomic
hypotheses, with no mutation process)
• PDP network (Rogers and McClelland)
Species
Features
Results
r = 0.91
r = 0.95
r = 0.91
Bias is
just
right!
Theory-based
Bayes
r = 0.51
r = 0.53
r = 0.85
Bias is
too
strong
Taxonomic
Bayes
r = 0.41
PDP network
r = 0.62
r = 0.71
Bias is
too
weak
Mutation principle versus
pure Occam’s Razor
• Mutation principle provides a version of
Occam’s Razor, by favoring hypotheses that
span fewer disjoint clusters.
• Could we use a more generic Bayesian
Occam’s Razor, without the biological
motivation of mutation?
Mutation process generates p(h|T):
– Choose label for root.
– Probability that label mutates
along branch b : 1  e 2 l b
l = mutation rate
|b| = length of branch b
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
2
T
p(h|T)
h
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Mutation process generates p(h|T):
– Choose label for root.
– Probability that label mutates
along branch b :
l
l = mutation rate
|b| = length of branch b
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
T
p(h|T)
h
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
d
?
?
?
?
?
?
?
?
Features
Generalization
Hypothesis
New property
Bayes
(taxonomy+
mutation)
Premise typicality effect (Rips,
1975; Osherson et al., 1990):
Strong:
Bayes
(taxonomy+
Occam)
Horses have property P.
All mammals have property P.
Max-sim
Weak:
Seals have property P.
Conclusion
kind:
“all mammals”
Number of
examples:
1
All mammals have property P.
Typicality meets hierarchies
• Collins and Quillian: semantic memory structured
hierarchically
• Traditional story: Simple hierarchical structure
uncomfortable with typicality effects &amp; exceptions.
• New story: Typicality &amp; exceptions compatible with
rational statistical inference over hierarchy.
Intuitive versus scientific
theories of biology
• Same structure for how species are related.
– Tree-structured taxonomy.
• Same probabilistic model for traits
– Small probability of occurring along any branch
at any time, plus inheritance.
• Different features
– Scientist: genes
– People: coarse anatomy and behavior
Induction in Biology: summary
• Theory-based Bayesian inference explains
taxonomic inductive reasoning in folk biology.
• Insight into processing-level accounts.
– Why Max-sim over Sum-sim in this domain?
– How is hierarchical representation compatible
with typicality effects &amp; exceptions?
• Reveals essential principles of domain theory.
– Category structure: taxonomic tree.
– Feature distribution: stochastic mutation process +
inheritance.
The plan
• Similarity-based models
• Theory-based model
• Bayesian models
– “Empiricist” Bayes
– Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
– Learning with multiple domain theories
– Learning domain theories
Property type
Generic “essence”
Theory Structure
Taxonomic Tree
Lion
Cheetah
Hyena
Giraffe
Gazelle
Gorilla
Monkey
Lion
Cheetah
Hyena
Giraffe
Gazelle
Gorilla
Monkey
...
Property type
Generic “essence”
Size-related
Food-carried
Theory Structure
Taxonomic Tree
Dimensional
Directed Acyclic
Network
Giraffe
Lion
Giraffe
Cheetah
Hyena
Lion
Gorilla
Giraffe
Hyena
Gazelle
Gazelle
Cheetah
Monkey
Monkey
Gorilla
Gorilla
Monkey
Lion
Cheetah
Hyena
Giraffe
Gazelle
Gorilla
Monkey
...
Lion
Gazelle
Hyena
...
Cheetah
...
One-dimensional predicates
• Q = “Have skins that are more resistant to
penetration than most synthetic fibers”.
– Unknown relevant property: skin toughness
– Model influence of known properties via judged
prior probability that each species has Q.
threshold for Q
Skin toughness
House cat
Camel Elephant
Rhino
One-dimensional predicates
Bayes
(taxonomy+
mutation)
Max-sim
Bayes
(1D model)
Disease
Food web model fits (Shafto et al.)
r = 0.82
r = -0.35
r = -0.05
Property
r = 0.77
Mammals
Island
Disease
Taxonomic tree model fits (Shafto et al.)
r = 0.16
r = 0.81
r = 0.62
Property
r = -0.12
Mammals
Island
The plan
• Similarity-based models
• Theory-based model
• Bayesian models
– “Empiricist” Bayes
– Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
– Learning with multiple domain theories
– Learning domain theories
Theory
• Species organized in taxonomic tree structure
• Feature i generated by mutation process with rate li
p(S|T)
F9
F8
Domain
Structure
F7 F11
F14
F13
F6
F12
F3
F1
F10
F14
F2
F4
F10
F10
F5
S3 S4 S1 S2 S9 S10 S5 S6 S7 S8
p(D|S)
Data
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
l10 high ~ weight low
Theory
• Species organized in taxonomic tree structure
• Feature i generated by mutation process with rate li
p(S|T)
F9
F8
Domain
Structure
F7 F11
F14
F13
F6
F12
F3
F1
F10
F14
F2
F10
F4
F10
F5
S3 S4 S1 S2 S9 S10 S5 S6 S7 S8
p(D|S)
Data
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Species X
?
?
?
? ?
?
?
?
?
?
?
?
?
Theory
• Species organized in taxonomic tree structure
• Feature i generated by mutation process with rate li
p(S|T)
F9
F8
Domain
Structure
F7 F11
F14
F13
F6
F12
F3
F1
F10
F14
F2
F4
F10
F10
F5
S3 S4 S1 S2 S9 S10 S5 S6 S7 S8
p(D|S)
Data
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Species X
SX
Where does the domain theory
come from?
• Innate.
– Atran (1998): The tendency to group living
kinds into hierarchies reflects an “innately
determined cognitive structure”.
• Emerges (only approximately) through
learning in unstructured connectionist
networks.
– McClelland and Rogers (2003).
Bayesian inference to theories
• Challenge to the nativist-empiricist
dichotomy.
– We really do have structured domain theories.
– We really do learn them.
• Bayesian framework applies over multiple
levels:
– Given hypothesis space + data, infer concepts.
– Given theory + data, infer hypothesis space.
– Given X + data, infer theory.
Bayesian inference to theories
• Candidate theories for biological species and
their features:
– T0: Features generated independently for each species. (c.f.
naive Bayes, Anderson’s rational model.)
– T1: Features generated by mutation in tree-structured
taxonomy of species.
– T2: Features generated by mutation in a one-dimensional
chain of species.
• Score theories by likelihood on object-feature
matrix: p( D | T )   p( D | S , T ) p(S | T )
S
 max p( D | S , T ) p( S | T )
S
T0:
• No organizational structure
for species.
• Features distributed
independently over species.
F1
F2
F5
F8
F9
F1
F2
F3
F2 F5 F2
F4 F7 F4 F1
F6 F8 F7 F5
F7 F10 F9 F7
F9 F12 F12 F13
F14 F13 F14 F14
S1
S2
S3 S4
Data
F1
F2
F6 F2
F4
F7 F4 F2 F1 F8
F8 F5 F3 F6 F9
F9 F12 F6 F8 F10
F10 F13 F11 F9 F11
F13 F14 F13 F12 F14
S5 S6
S7
S8 S9 S10
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
T0:
• No organizational structure
for species.
• Features distributed
independently over species.
F1
F6
F7
F8
F9
F11
F1 F3 F3
F6 F7 F7
F7 F8 F8
F8 F9 F9
F9 F11 F11 F4
F10 F12 F12 F8
F11 F14 F14 F9
S1 S2
S3 S4
Data
F4
F8
F9
S5 S6
F2
F5 F5 F6
F9 F9 F7
F10 F10 F8
F13 F13 F9
F14 F14 F11
S7
F2
F6
F7
F8
F9
F11
S8 S9 S10
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
T0:
• No organizational structure
for species.
• Features distributed
independently over species.
T1:
• Species organized in
taxonomic tree structure.
• Features distributed via
stochastic mutation process.
F9
F1
F6
F7
F8
F9
F11
F1 F3 F3
F6 F7 F7
F7 F8 F8
F8 F9 F9
F9 F11 F11 F4
F10 F12 F12 F8
F11 F14 F14 F9
S1 S2
S3 S4
Data
F8
F4
F8
F9
S5 S6
F2
F5 F5 F6
F9 F9 F7
F10 F10 F8
F13 F13 F9
F14 F14 F11
S7
F7 F11
F2
F6
F7
F8
F9
F11
F14
F13
F12
F3
S8 S9 S10
F6
F1
F10
F2
F14
F10
F4
F5
S3 S4 S1 S2 S9 S10 S5 S6 S7 S8
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
T0: p(Data|T1) ~ 1.83 x 10-41
• No organizational structure
for species.
• Features distributed
independently over species.
T1: p(Data|T2) ~ 2.42 x 10-32
• Species organized in
taxonomic tree structure.
• Features distributed via
stochastic mutation process.
F9
F1
F6
F7
F8
F9
F11
F1 F3 F3
F6 F7 F7
F7 F8 F8
F8 F9 F9
F9 F11 F11 F4
F10 F12 F12 F8
F11 F14 F14 F9
S1 S2
S3 S4
Data
F8
F4
F8
F9
S5 S6
F2
F5 F5 F6
F9 F9 F7
F10 F10 F8
F13 F13 F9
F14 F14 F11
S7
F7 F11
F2
F6
F7
F8
F9
F11
F14
F13
F12
F3
S8 S9 S10
F6
F1
F10
F2
F14
F10
F4
F5
S3 S4 S1 S2 S9 S10 S5 S6 S7 S8
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
T0:
• No organizational structure
for species.
• Features distributed
independently over species.
F1
F2
F5
F8
F9
F1
F2
F3
F2 F5 F2
F4 F7 F4 F1
F6 F8 F7 F5
F7 F10 F9 F7
F9 F12 F12 F13
F14 F13 F14 F14
S1
S2
S3 S4
Data
F4
F14
F1
F2
F6 F2
F4
F7 F4 F2 F1 F8
F8 F5 F3 F6 F9
F9 F12 F6 F8 F10
F10 F13 F11 F9 F11
F13 F14 F13 F12 F14
S5 S6
S7
T1:
• Species organized in
taxonomic tree structure.
• Features distributed via
stochastic mutation process.
F9
F7
F1
F5 F7 F13
F12
F13 F10
F8 F9
F11 F13
F13 F10 F11
F12 F12
F6
F5
S8 S9 S10
F2
F9 F6
F8 F3
F10
F8
F12 F7
F5
F2 F6 F6
F3
F2
F14
S2 S4 S7 S10 S8 S1 S9 S6 S3 S5
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
T0: p(Data|T1) ~ 2.29 x 10-42
• No organizational structure
for species.
• Features distributed
independently over species.
F1
F2
F5
F8
F9
F1
F2
F3
F2 F5 F2
F4 F7 F4 F1
F6 F8 F7 F5
F7 F10 F9 F7
F9 F12 F12 F13
F14 F13 F14 F14
S1
S2
S3 S4
Data
F4
F14
F1
F2
F6 F2
F4
F7 F4 F2 F1 F8
F8 F5 F3 F6 F9
F9 F12 F6 F8 F10
F10 F13 F11 F9 F11
F13 F14 F13 F12 F14
S5 S6
S7
T1: p(Data|T2) ~ 4.38 x 10-53
• Species organized in
taxonomic tree structure.
• Features distributed via
stochastic mutation process.
F9
F7
F1
F5 F7 F13
F12
F13 F10
F8 F9
F11 F13
F13 F10 F11
F12 F12
F6
F5
S8 S9 S10
F2
F9 F6
F8 F3
F10
F8
F12 F7
F5
F2 F6 F6
F3
F2
F14
S2 S4 S7 S10 S8 S1 S9 S6 S3 S5
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
Empirical tests
• Synthetic data: 32 objects, 120 features
– tree-structured generative model
– linear chain generative model
– unconstrained (independent features).
• Real data
– Animal feature judgments: 48 species, 85
features.
– US Supreme Court decisions, 1981-1985: 9
people, 637 cases.
Results
Preferred
Model
Null
Tree
Linear
Tree
Linear
Theory acquisition: summary
• So far, just a computational proof of concept.
• Future work:
– Experimental studies of theory acquisition in the
lab, with adult and child subjects.
– Modeling developmental or historical trajectories
of theory change.
• Sources of hypotheses for candidate theories:
– What is innate?
– Role of analogy?
Outline
• Morning
– Introduction (Josh)
– Basic case study #1: Flipping coins (Tom)
– Basic case study #2: Rules and similarity (Josh)
• Afternoon
– Advanced case study #1: Causal induction (Tom)
– Advanced case study #2: Property induction (Josh)
– Quick tour of more advanced topics (Tom)
Structure and statistics
• Statistical language modeling
– topic models
• Relational categorization
– attributes and relations
Structure and statistics
• Statistical language modeling
– topic models
• Relational categorization
– attributes and relations
Statistical language modeling
• A variety of approaches to statistical language
modeling are used in cognitive science
– e.g. LSA
(Landauer &amp; Dumais, 1997)
– distributional clustering (Redington, Chater, &amp; Finch, 1998)
• Generative models have unique advantages
– identify assumed causal structure of language
– make use of standard tools of Bayesian statistics
– easily extended to capture more complex structure
Generative models for language
latent structure
observed data
Generative models for language
meaning
sentences
Topic models
• Each document a mixture of topics
• Each word chosen from a single topic
• Introduced by Blei, Ng, and Jordan (2001),
reinterpretation of PLSI (Hofmann, 1999)
• Idea of probabilistic topics widely used
(eg. Bigi et al., 1997; Iyer &amp; Ostendorf, 1996; Ueda &amp; Saito, 2003)
Generating a document
q
distribution over topics
z
z
z
topic assignments
w
w
w
observed words
w
P(w|z = 1) = f (1)
HEART
LOVE
SOUL
TEARS
JOY
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
topic 1
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
w
P(w|z = 2) = f (2)
HEART
LOVE
SOUL
TEARS
JOY
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
topic 2
0.0
0.0
0.0
0.0
0.0
0.2
0.2
0.2
0.2
0.2
Choose mixture weights for each document, generate “bag of words”
q = {P(z = 1), P(z = 2)}
{0, 1}
{0.25, 0.75}
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
{0.5, 0.5}
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS
WORK TEARS SOUL KNOWLEDGE HEART
{0.75, 0.25}
WORK JOY SOUL TEARS MATHEMATICS
TEARS LOVE LOVE LOVE SOUL
{1, 0}
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
A selection of topics (from 500)
THEORY
SCIENTISTS
EXPERIMENT
OBSERVATIONS
SCIENTIFIC
EXPERIMENTS
HYPOTHESIS
EXPLAIN
SCIENTIST
OBSERVED
EXPLANATION
BASED
OBSERVATION
IDEA
EVIDENCE
THEORIES
BELIEVED
DISCOVERED
OBSERVE
FACTS
SPACE
EARTH
MOON
PLANET
ROCKET
MARS
ORBIT
ASTRONAUTS
FIRST
SPACECRAFT
JUPITER
SATELLITE
SATELLITES
ATMOSPHERE
SPACESHIP
SURFACE
SCIENTISTS
ASTRONAUT
SATURN
MILES
ART
STUDENTS
PAINT
TEACHER
ARTIST
STUDENT
PAINTING
TEACHERS
PAINTED
TEACHING
ARTISTS
CLASS
MUSEUM
CLASSROOM
WORK
SCHOOL
PAINTINGS
LEARNING
STYLE
PUPILS
PICTURES
CONTENT
WORKS
INSTRUCTION
OWN
TAUGHT
SCULPTURE
GROUP
PAINTER
ARTS
SHOULD
BEAUTIFUL
DESIGNS
CLASSES
PORTRAIT
PUPIL
PAINTERS
GIVEN
BRAIN
NERVE
SENSE
SENSES
ARE
NERVOUS
NERVES
BODY
SMELL
TASTE
TOUCH
MESSAGES
IMPULSES
CORD
ORGANS
SPINAL
FIBERS
SENSORY
PAIN
IS
CURRENT
ELECTRICITY
ELECTRIC
CIRCUIT
IS
ELECTRICAL
VOLTAGE
FLOW
BATTERY
WIRE
WIRES
SWITCH
CONNECTED
ELECTRONS
RESISTANCE
POWER
CONDUCTORS
CIRCUITS
TUBE
NEGATIVE
NATURE
WORLD
HUMAN
PHILOSOPHY
MORAL
KNOWLEDGE
THOUGHT
REASON
SENSE
OUR
TRUTH
NATURAL
EXISTENCE
BEING
LIFE
MIND
ARISTOTLE
BELIEVED
EXPERIENCE
REALITY
THIRD
FIRST
SECOND
THREE
FOURTH
FOUR
TWO
FIFTH
SEVENTH
SIXTH
EIGHTH
HALF
SEVEN
SIX
SINGLE
NINTH
END
TENTH
ANOTHER
A selection of topics (from 500)
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
A selection of topics (from 500)
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
Learning topic hiearchies
(Blei, Griffiths, Jordan, &amp; Tenenbaum, 2004)
Syntax and semantics from statistics
Factorization of language based on
statistical dependency patterns:
long-range, document specific,
dependencies
short-range dependencies constant
across all documents
semantics: probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax: probabilistic regular grammar
(Griffiths, Steyvers, Blei, &amp; Tenenbaum, submitted)
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.1
0.3
0.2
0.9
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.1
0.3
0.2
0.9
THE ………………………………
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.1
0.3
0.2
0.9
THE LOVE……………………
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.1
0.3
0.2
0.9
THE LOVE OF………………
x=3
THE
0.6
A
0.3
MANY 0.1
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
KNOWLEDGE
WORK
RESEARCH
MATHEMATICS
0.2
0.2
0.2
0.2
0.2
0.7
0.1
0.3
0.2
0.9
THE LOVE OF RESEARCH ……
x=3
THE
0.6
A
0.3
MANY 0.1
Semantic categories
MAP
FOOD
NORTH
FOODS
EARTH
BODY
SOUTH
NUTRIENTS
POLE
DIET
MAPS
FAT
EQUATOR
SUGAR
WEST
ENERGY
LINES
MILK
EAST
EATING
AUSTRALIA
FRUITS
GLOBE
VEGETABLES
POLES
WEIGHT
HEMISPHERE
FATS
LATITUDE
NEEDS
CARBOHYDRATES PLACES
LAND
VITAMINS
WORLD
CALORIES
COMPASS
PROTEIN
CONTINENTS
MINERALS
GOLD
CELLS
BEHAVIOR
DOCTOR
BOOK
IRON
CELL
SELF
PATIENT
BOOKS
SILVER
ORGANISMS
INDIVIDUAL
HEALTH
ALGAE
PERSONALITY
HOSPITAL
INFORMATION COPPER
METAL
BACTERIA
RESPONSE
MEDICAL
LIBRARY
METALS
MICROSCOPE
SOCIAL
CARE
REPORT
STEEL
MEMBRANE
EMOTIONAL
PATIENTS
PAGE
CLAY
ORGANISM
LEARNING
NURSE
TITLE
FOOD
FEELINGS
DOCTORS
SUBJECT
LIVING
PSYCHOLOGISTS
MEDICINE
PAGES
ORE
FUNGI
INDIVIDUALS
NURSING
GUIDE
ALUMINUM PSYCHOLOGICAL
MOLD
TREATMENT
WORDS
MINERAL
EXPERIENCES MATERIALS
NURSES
MATERIAL
MINE
NUCLEUS
ENVIRONMENT
PHYSICIAN
ARTICLE
STONE
CELLED
HUMAN
HOSPITALS
ARTICLES
MINERALS
STRUCTURES
RESPONSES
DR
WORD
POT
MATERIAL
BEHAVIORS
SICK
FACTS
MINING
STRUCTURE
ATTITUDES
ASSISTANT
AUTHOR
MINERS
GREEN
PSYCHOLOGY
EMERGENCY
REFERENCE
TIN
MOLDS
PERSON
PRACTICE
NOTE
PLANTS
PLANT
LEAVES
SEEDS
SOIL
ROOTS
FLOWERS
WATER
FOOD
GREEN
SEED
STEMS
FLOWER
STEM
LEAF
ANIMALS
ROOT
POLLEN
GROWING
GROW
Syntactic categories
SAID
THOUGHT
TOLD
SAYS
MEANS
CALLED
CRIED
SHOWS
TELLS
REPLIED
SHOUTED
EXPLAINED
LAUGHED
MEANT
WROTE
SHOWED
BELIEVED
WHISPERED
THE
HIS
THEIR
YOUR
HER
ITS
MY
OUR
THIS
THESE
A
AN
THAT
NEW
THOSE
EACH
MR
ANY
MRS
ALL
MORE
SUCH
LESS
MUCH
KNOWN
JUST
BETTER
RATHER
GREATER
HIGHER
LARGER
LONGER
FASTER
EXACTLY
SMALLER
SOMETHING
BIGGER
FEWER
LOWER
ALMOST
ON
AT
INTO
FROM
WITH
THROUGH
OVER
AROUND
AGAINST
ACROSS
UPON
TOWARD
UNDER
ALONG
NEAR
BEHIND
OFF
ABOVE
DOWN
BEFORE
GOOD
SMALL
NEW
IMPORTANT
GREAT
LITTLE
LARGE
*
BIG
LONG
HIGH
DIFFERENT
SPECIAL
OLD
STRONG
YOUNG
COMMON
WHITE
SINGLE
CERTAIN
ONE
SOME
MANY
TWO
EACH
ALL
MOST
ANY
THREE
THIS
EVERY
SEVERAL
FOUR
FIVE
BOTH
TEN
SIX
MUCH
TWENTY
EIGHT
HE
YOU
THEY
I
SHE
WE
IT
PEOPLE
EVERYONE
OTHERS
SCIENTISTS
SOMEONE
WHO
NOBODY
ONE
SOMETHING
ANYONE
EVERYBODY
SOME
THEN
BE
MAKE
GET
HAVE
GO
TAKE
DO
FIND
USE
SEE
HELP
KEEP
GIVE
LOOK
COME
WORK
MOVE
LIVE
EAT
BECOME
Statistical language modeling
• Generative models provide
– transparent assumptions about causal process
– opportunities to combine and extend models
• Richer generative models...
– probabilistic context-free grammars
– paragraph or sentence-level dependencies
– more complex semantics
Structure and statistics
• Statistical language modeling
– topic models
• Relational categorization
– attributes and relations
Relational categorization
• Most approaches to categorization in
psychology and machine learning focus on
attributes - properties of objects
– words in titles of CogSci posters
• But… a significant portion of knowledge is
organized in terms of relations
– co-authors on posters
– who talks to whom
(Kemp, Griffiths, &amp; Tenenbaum, 2004)
Attributes and relations
Data
Model
attributes
objects
X
P(X) = ik z P(xik|zi) i P(zi)
mixture model (c.f. Anderson, 1990)
objects
objects
Y
P(Y) = ij z P(yij|zi) i P(zi)
stochastic blockmodel
Stochastic blockmodels
• For any pair of objects, (i,j), probability of
relation is determined by classes, (zi, zj)
To type j
From
type i
l11
l12
l13
l21
l22
l23
l31
l32
l33
Each entity has a type = Z
L
P(Z,L|Y)  P(Y|Z,LP(Z)P(L
• Allows types of objects and class
probabilities to be learned from data
Stochastic blockmodels
A
B
C
A
B
C
A
B
D
C
A B C
A
A
B
B
C
C
D
D
Categorizing words
• Relational data: word association norms
(Nelson, McEvoy, &amp; Schreiber, 1998)
• 5018 x 5018 matrix of associations
– symmetrized
– all words with &lt; 50 and &gt; 10 associates
– 2513 nodes, 34716 links
Categorizing words
BAND
INSTRUMENT
BLOW
HORN
FLUTE
BRASS
GUITAR
PIANO
TUBA
TRUMPET
TIE
COAT
SHOES
ROPE
LEATHER
SHOE
HAT
PANTS
WEDDING
STRING
SEW
MATERIAL
WOOL
YARN
WEAR
TEAR
FRAY
JEANS
COTTON
CARPET
WASH
LIQUID
BATHROOM
SINK
CLEANER
STAIN
DRAIN
DISHES
TUB
SCRUB
Categorizing actors
• Internet Movie Database (IMDB) data, from
the start of cinema to 1960 (Jeremy Kubica)
• Relational data: collaboration
• 5000 x 5000 matrix of most prolific actors
– all actors with &lt; 400 and &gt; 1 collaborators
– 2275 nodes, 204761 links
Categorizing actors
Albert Lieven
Karel Stepanek
Walter Rilla
Anton Walbrook
Germany  UK
Moore Marriott
Laurence Hanray
Gus McNaughton
Gordon Harker
Helen Haye
Alfred Goddard
Morland Graham
Margaret Lockwood
Hal Gordon
Bromley Davenport
British comedy
Gino Cervi
Enrico Glori
Paolo Stoppa
Bernardi Nerio
Amedeo Nazzari
Gina Lollobrigida
Aldo Silvani
Franco Interlenghi
Guido Celano
Archie Ricks
Helen Gibson
Oscar Gahan
Buck Moulton
Buck Connors
Clyde McClary
Barney Beasley
Buck Morgan
Tex Phelps
George Sowards
Italian
US Westerns
Structure and statistics
• Bayesian approach allows us to specify
structured probabilistic models
• Explore novel representations and domains
– topics for semantic representation
– relational categorization
• Use powerful methods for inference,
developed in statistics and machine learning
Other methods and tools...
• Inference algorithms
–
–
–
–
belief propagation
dynamic programming
the EM algorithm and variational methods
Markov chain Monte Carlo
• More complex models
– Dirichlet processes and Bayesian non-parametrics
– Gaussian processes and kernel methods
Reading list at http://www.bayesiancognition.com
Taking stock
Bayesian models of inductive learning
• Inductive leaps can be explained with
hierarchical Theory-based Bayesian models:
Domain Theory
Probabilistic
Generative
Model
Structural Hypotheses
Data
Bayesian
inference
Bayesian models of inductive learning
• Inductive leaps can be explained with
hierarchical Theory-based Bayesian models:
T
S
S
D D D
D D D
S
D D
...
D ...
Bayesian models of inductive learning
• Inductive leaps can be explained with
hierarchical Theory-based Bayesian models.
• What the approach offers:
– Strong quantitative models of generalization
behavior.
– Flexibility to model different patterns of reasoning
that in different tasks and domains, using
differently structured theories, but the same
general-purpose Bayesian engine.
– Framework for explaining why inductive
generalization works, where knowledge comes
from as well as how it is used.
Bayesian models of inductive learning
• Inductive leaps can be explained with
hierarchical Theory-based Bayesian models.
• Challenges:
– Theories are hard.
Bayesian models of inductive learning
• Inductive leaps can be explained with
hierarchical Theory-based Bayesian models:
• The interaction between structure and
statistics is crucial.
– How structured knowledge supports statistical
learning, by constraining hypothesis spaces.
– How statistics supports reasoning with and
learning structured knowledge.
– How complex structures can grow from data,
rather than being fully specified in advance.
```