Bayesian models of inductive learning Tom Griffiths Josh Tenenbaum

advertisement
Bayesian models of
inductive learning
Tom Griffiths
Josh Tenenbaum
UC Berkeley
MIT
Charles Kemp
MIT
What to expect
• What you’ll get out of this tutorial:
– Our view of what Bayesian models have to offer
cognitive science.
– In-depth examples of basic and advanced models: how
the math works & what it buys you.
– Some (not extensive) comparison to other approaches.
– Opportunities to ask questions.
• What you won’t get:
– Detailed, hands-on how-to.
– Where you can learn more:
• http://bayesiancognition.com
• Trends in Cognitive Sciences, July 2006, special issue on
“Probabilistic Models of Cognition”.
Outline
• Morning
– Introduction: Why Bayes? (Josh)
– Basic of Bayesian inference (Josh)
– Graphical models, causal inference and learning
(Tom)
• Afternoon
– Hierarchical Bayesian models, property induction,
and learning domain structures (Charles)
– Methods of approximate learning and inference,
probabilistic models of semantic memory (Tom)
Why Bayes?
• The problem of induction
– How does the mind form inferences, generalizations, models or
theories about the world from impoverished data?
• Induction is ubiquitous in cognition
–
–
–
–
–
Vision (+ audition, touch, or other perceptual modalities)
Language (understanding, production)
Concepts (semantic knowledge, “common sense”)
Causal learning and reasoning
Decision-making and action (production, understanding)
• Bayes gives a general framework for explaining
how induction can work in principle, and
perhaps, how it does work in the mind….
S  NP VP
Grammar G
NP  Det [ Adj ] Noun [ RelClause ]
RelClause  [ Rel ] NP V
VP  VP NP
P(S | G)
VP  Verb
Phrase structure S
P(U | S)
P(
Utterance U
P(S | U, G) ~ P(U | S) x P(S | G)
Bottom-up
Top-down
“Universal Grammar”
Hierarchical phrase structure
grammars (e.g., CFG, HPSG, TAG)
P(grammar | UG)
Grammar
P(phrase structure | grammar)
Phrase structure
P(utterance | phrase structure)
Utterance
P(speech | utterance)
Speech signal
(c.f. Chater and Manning, 2006)
S  NP VP
NP  Det [ Adj ] Noun [ RelClause ]
RelClause  [ Rel ] NP V
VP  VP NP
VP  Verb
The approach
• Key concepts
– Inference in probabilistic generative models
– Hierarchical probabilistic models, with inference at all levels of
abstraction
– Structured knowledge representations: graphs, grammars,
predicate logic, schemas, theories
– Flexible structures, with complexity constrained by Bayesian
Occam’s razor
– Approximate methods of learning and inference: ExpectationMaximization (EM), Markov chain Monte Carlo (MCMC)
• Much recent progress!
– Computational resources to implement and test models that we could
dream up but not realistically imagine working with
– New theoretical tools let us develop models that we could not clearly
conceive of before.
Vision as probabilistic parsing
(Han and Zhu, 2006)
Word learning on planet Gazoob
“tufa”
“tufa”
“tufa”
Can you pick out the tufas?
Learning word meanings
Principles
Structure
Data
Whole-object principle
Shape bias
Taxonomic principle
Contrast principle
Basic-level bias
Causal learning and reasoning
Principles
Structure
Data
Goal-directed action
(production and comprehension)
(Wolpert et al., 2003)
Marr’s Three Levels of Analysis
• Computation:
“What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?”
• Algorithm:
Cognitive psychology
• Implementation:
Neurobiology
Alternative approaches to
inductive learning and inference
•
•
•
•
•
•
Associative learning
Connectionist networks
Similarity to examples
Toolkit of simple heuristics
Constraint satisfaction
Analogical mapping
Summary: Why Bayes?
• A unifying framework for explaining cognition.
– How people can learn so much from such limited data.
– Strong quantitative models with minimal ad hoc assumptions.
– Why algorithmic-level models work the way they do.
• A framework for understanding how structured
knowledge and statistical inference interact.
– How structured knowledge guides statistical inference, and
may itself be acquired through statistical means.
– What forms knowledge takes, at multiple levels of
abstraction.
– What knowledge must be innate, and what can be learned.
– How flexible knowledge structures may grow as required by
the data, with complexity controlled by Occam’s razor.
Outline
• Morning
– Introduction: Why Bayes? (Josh)
– Basic of Bayesian inference (Josh)
– Graphical models, causal inference and learning
(Tom)
• Afternoon
– Hierarchical Bayesian models, property induction,
and learning domain structures (Charles)
– Methods of approximate learning and inference,
probabilistic models of semantic memory (Tom)
Bayes’ rule
For any hypothesis h and data d,
Posterior
probability
Likelihood
Prior
probability
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
Sum over space
of alternative hypotheses
Bayesian inference
P ( h) P ( d | h)
• Bayes’ rule: P(h | d ) 
 P(hi ) P(d | hi )
• An example
hi
– Data: John is coughing
– Some hypotheses:
1. John has a cold
2. John has emphysema
3. John has a stomach flu
– Prior favors 1 and 3 over 2
– Likelihood P(d|h) favors 1 and 2 over 3
– Posterior P(d|h) favors 1 over 2 and 3
Coin flipping
• Basic Bayes
– data = HHTHT or HHHHH
– compare two simple hypotheses:
P(H) = 0.5 vs. P(H) = 1.0
• Parameter estimation (Model fitting)
– compare many hypotheses in a parameterized family
P(H) = q : Infer q
• Model selection
– compare qualitatively different hypotheses, often
varying in complexity:
P(H) = 0.5 vs. P(H) = q
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
Comparing two simple hypotheses
• Contrast simple hypotheses:
– h1: “fair coin”, P(H) = 0.5
– h2:“always heads”, P(H) = 1.0
• Bayes’ rule:
P ( h) P ( d | h)
P(h | d ) 
 P(hi ) P(d | hi )
hi
• With two hypotheses, use odds form
Comparing two simple hypotheses
P( H1 | D)
P( H 2 | D)
D:

P( D | H1 )

P( D | H 2 )
HHTHT
H1, H2:
“fair coin”, “always heads”
P(D|H1) = 1/25
P(H1) =
?
P(D|H2) = 0
P(H2) =
1-?
P( H1 )
P( H 2 )
Comparing two simple hypotheses
P( H1 | D)
P( H 2 | D)
D:

P( D | H1 )

P( D | H 2 )
P( H1 )
P( H 2 )
HHTHT
H1, H2:
“fair coin”, “always heads”
P(D|H1) = 1/25
P(H1) =
999/1000
P(D|H2) = 0
P(H2) =
1/1000
P( H1 | D)
P( H 2 | D)
1 32 999


0
1
 infinity
Comparing two simple hypotheses
P( H1 | D)
P( H 2 | D)
D:
P( D | H1 )

P( D | H 2 )

P( H1 )
P( H 2 )
HHHHH
H1, H2:
“fair coin”, “always heads”
P(D|H1) = 1/25
P(H1) =
999/1000
P(D|H2) = 1
P(H2) =
1/1000
P ( H1 | D )
P( H 2 | D)

1 999

32 1
 30
Comparing two simple hypotheses
P( H1 | D)
P( H 2 | D)
D:

P( D | H1 )

P( D | H 2 )
P( H1 )
P( H 2 )
HHHHHHHHHH
H1, H2:
“fair coin”, “always heads”
P(D|H1) = 1/210
P(H1) =
999/1000
P(D|H2) = 1
P(H2) =
1/1000
P ( H1 | D )
P( H 2 | D)
1
999


1024 1
1
The role of intuitive theories
The fact that HHTHT looks representative of
a fair coin and HHHHH does not reflects our
implicit theories of how the world works.
– Easy to imagine how a trick all-heads coin
could work: high prior probability.
– Hard to imagine how a trick “HHTHT” coin
could work: low prior probability.
Coin flipping
• Basic Bayes
– data = HHTHT or HHHHH
– compare two hypotheses:
P(H) = 0.5 vs. P(H) = 1.0
• Parameter estimation (Model fitting)
– compare many hypotheses in a parameterized family
P(H) = q : Infer q
• Model selection
– compare qualitatively different hypotheses, often
varying in complexity:
P(H) = 0.5 vs. P(H) = q
Parameter estimation
• Assume data are generated from a
parameterized model:
q
d1
d2
d3
d4
P(H) = q
• What is the value of q ?
– each value of q is a hypothesis H
– requires inference over infinitely many hypotheses
Model selection
• Assume hypothesis space of possible models:
q
d1
d2
d3
d4
Fair coin: P(H) = 0.5
d1
d2
d3
P(H) = q
d4
s1
s2
s3
s4
d1
d2
d3
d4
Hidden Markov model:
si {Fair coin, Trick coin}
• Which model generated the data?
– requires summing out hidden variables
– requires some form of Occam’s razor to trade off
complexity with fit to the data.
Parameter estimation vs. Model selection
across learning and development
• Causality: learning the strength of a relation vs. learning
the existence and form of a relation
• Language acquisition: learning a speaker's accent, or
frequencies of different words vs. learning a new tense or
syntactic rule (or learning a new language, or the existence
of different languages)
• Concepts: learning what horses look like vs. learning that
there is a new species (or learning that there are species)
• Intuitive physics: learning the mass of an object vs.
learning about gravity or angular momentum
• Intuitive psychology: learning a person’s beliefs or goals
vs. learning that there can be false beliefs, or that visual
access is valuable for establishing true beliefs
A hierarchical learning framework
model
M
parameters
w
Parameter estimation:
p( w | D, M )  p( D | w, M ) p( w | M )
data
D
A hierarchical learning framework
model class C
p( D | M )   p( D | w, M ) p( w | M )
w
Model selection:
model
M
parameters
w
p ( M | D, C )  p ( D | M ) p ( M | C )
Parameter estimation:
p( w | D, M )  p( D | w, M ) p( w | M )
data
D
Bayesian parameter estimation
• Assume data are generated from a model:
q
d1
d2
d3
d4
P(H) = q
• What is the value of q ?
– each value of q is a hypothesis H
– requires inference over infinitely many hypotheses
Some intuitions
•
•
•
•
D = 10 flips, with 5 heads and 5 tails.
q = P(H) on next flip? 50%
Why? 50% = 5 / (5+5) = 5/10.
Why? “The future will be like the past”
• Suppose we had seen 4 heads and 6 tails.
• P(H) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
Integrating prior knowledge and data
p( D | q ) p(q )
p(q | D) 
p( D)
• Posterior distribution P(q | D) is a probability
density over q = P(H)
• Need to work out likelihood P(D | q ) and
specify prior distribution P(q )
Likelihood and prior
• Likelihood: Bernoulli distribution
P(D | q ) = q NH (1-q ) NT
– NH: number of heads
– NT: number of tails
• Prior:
P(q ) 
?
Some intuitions
•
•
•
•
D = 10 flips, with 5 heads and 5 tails.
q = P(H) on next flip? 50%
Why? 50% = 5 / (5+5) = 5/10.
Why? Maximum likelihood: qˆ  arg max P( D | q )
q
• Suppose we had seen 4 heads and 6 tails.
• P(H) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
A simple method of specifying priors
• Imagine some fictitious trials, reflecting a
set of previous experiences
– strategy often used with neural networks or
building invariance into machine vision.
• e.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair
• In fact, this is a sensible statistical idea...
Likelihood and prior
• Likelihood: Bernoulli(q ) distribution
P(D | q ) = q NH (1-q ) NT
– NH: number of heads
– NT: number of tails
• Prior: Beta(FH,FT) distribution
P(q )  q FH-1 (1-q ) FT-1
– FH: fictitious observations of heads
– FT: fictitious observations of tails
Shape of the Beta prior
Shape of the Beta prior
FH = 0.5, FT = 0.5
FH = 0.5, FT = 2
FH = 2, FT = 0.5
FH = 2, FT = 2
Bayesian parameter estimation
P(q | D)  P(D | q ) P(q ) = q NH+FH-1 (1-q ) NT+FT-1
• Posterior is Beta(NH+FH,NT+FT)
– same form as prior!
– expected P(H) = (NH+FH) / (NH+FH+NT+FT)
Conjugate priors
• A prior p(q ) is conjugate to a likelihood
function p(D | q ) if the posterior has the same
functional form of the prior.
– Parameter values in the prior can be thought of as a
summary of “fictitious observations”.
– Different parameter values in the prior and
posterior reflect the impact of observed data.
– Conjugate priors exist for many standard models
(e.g., all exponential family models)
Some examples
• e.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 1004 / (1004+1006) = 49.95%
• e.g., F ={3 heads, 3 tails} ~ weak
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 7 / (7+9) = 43.75%
Prior knowledge too weak
But… flipping thumbtacks
• e.g., F ={4 heads, 3 tails} ~ weak expectation
that tacks are slightly biased towards heads
• After seeing 2 heads, 0 tails, P(H) on next flip
= 6 / (6+3) = 67%
• Some prior knowledge is always necessary to
avoid jumping to hasty conclusions...
• Suppose F = { }: After seeing 1 heads, 0 tails,
P(H) on next flip = 1 / (1+0) = 100%
Origin of prior knowledge
• Tempting answer: prior experience
• Suppose you have previously seen 2000
coin flips: 1000 heads, 1000 tails
Problems with simple empiricism
• Haven’t really seen 2000 coin flips, or any flips of a
thumbtack
– Prior knowledge is stronger than raw experience justifies
• Haven’t seen exactly equal number of heads and tails
– Prior knowledge is smoother than raw experience justifies
• Should be a difference between observing 2000 flips
of a single coin versus observing 10 flips each for 200
coins, or 1 flip each for 2000 coins
– Prior knowledge is more structured than raw experience
A simple theory
• “Coins are manufactured by a standardized
procedure that is effective but not perfect, and
symmetric with respect to heads and tails.
Tacks are asymmetric, and manufactured to
less exacting standards.”
– Justifies generalizing from previous coins to the
present coin.
– Justifies smoother and stronger prior than raw
experience alone.
– Explains why seeing 10 flips each for 200 coins is
more valuable than seeing 2000 flips of one coin.
A hierarchical Bayesian model
physical knowledge
Coins
q ~ Beta(FH,FT)
FH,FT
Coin 1
d1
Coin 2
q1
d2
d3
d4
d1
d2
...
q2
d3
d4
q200 Coin 200
d1
d2
d3
d4
• Qualitative physical knowledge (symmetry) can
influence estimates of continuous parameters (FH, FT).
• Explains why 10 flips of 200 coins are better than 2000
flips of a single coin: more informative about FH, FT.
Stability versus Flexibility
• Can all domain knowledge be represented
with conjugate priors?
• Suppose you flip a coin 25 times and get all
heads. Something funny is going on …
• But with F ={1000 heads, 1000 tails},
P(heads) on next flip = 1025 / (1025+1000)
= 50.6%. Looks like nothing unusual.
• How do we balance stability and flexibility?
– Stability: 6 heads, 4 tails
– Flexibility: 25 heads, 0 tails
q ~ 0.5
q ~1
A hierarchical Bayesian model
fair/unfair?
• Higher-order hypothesis: is this
coin fair or unfair?
• Example probabilities:
– P(fair) = 0.99
– P(q |fair) is Beta(1000,1000)
– P(q |unfair) is Beta(1,1)
• 25 heads in a row propagates up,
affecting q and then P(fair|D)
FH,FT
q
d1
d2
d3
d4
1 heads|fair)
P(fair|25 heads)
P(25
P(fair) = 9 x 10-5
P( D | fair =)  P( D | q ) p (q | fair )dq
P(unfair|25 heads) P(25
0 heads|unfair) P(unfair)

Summary: Bayesian parameter estimation
• Learning the parameters of a generative
model as Bayesian inference.
• Conjugate priors
– an elegant way to represent simple kinds of prior
knowledge.
• Hierarchical Bayesian models
– integrate knowledge across instances of a system,
or different systems within a domain.
– can represent richer, more abstract knowledge
Some questions
• Learning isn’t just about parameter
estimation
– How do we learn the functional form of a
variable’s distribution?
– How do we learn model structure, or theories
with the expressiveness of predicate logic?
• Can we “grow” levels of abstraction?
A hierarchical learning framework
model class C
p( D | M )   p( D | w, M ) p( w | M )
w
Model selection:
model
M
parameters
w
p ( M | D, C )  p ( D | M ) p ( M | C )
Model fitting:
p( w | D, M )  p( D | w, M ) p( w | M )
data
D
Bayesian model selection
q
d1
d2
d3
d4
Fair coin, P(H) = 0.5
vs.
d1
d2
d3
d4
P(H) = q
• Which provides a better account of the data:
the simple hypothesis of a fair coin, or the
complex hypothesis that P(H) = q ?
Comparing simple and complex hypotheses
• P(H) = q is more complex than P(H) = 0.5 in
two ways:
– P(H) = 0.5 is a special case of P(H) = q
– for any observed sequence D, we can choose q
such that D is more probable than if P(H) = 0.5
Comparing simple and complex hypotheses
Probability
P( D | q )  q n (1  q ) N n
q = 0.5
D = HHHHH
Comparing simple and complex hypotheses
Probability
P( D | q )  q n (1  q ) N n
q = 1.0
q = 0.5
D = HHHHH
Comparing simple and complex hypotheses
Probability
P( D | q )  q n (1  q ) N n
q = 0.6
q = 0.5
D = HHTHT
Comparing simple and complex hypotheses
• P(H) = q is more complex than P(H) = 0.5 in
two ways:
– P(H) = 0.5 is a special case of P(H) = q
– for any observed sequence X, we can choose q
such that X is more probable than if P(H) = 0.5
• How can we deal with this?
– Some version of Occam’s razor?
– Bayes: automatic version of Occam’s razor
follows from the “law of conservation of belief”.
Comparing simple and complex hypotheses
P(h1|D)
P(h0|D)
P(D|h1)
=
P(D|h0)
P(h1)
x
P(h0)
P( D | h0 )  (1 / 2) n (1  1 / 2) N  n  1 / 2 N
1
P( D | h1 )   P( D | q ) p(q | h1 )dq
0
The “evidence” or “marginal likelihood”: The
probability that randomly selected parameters
from the prior would generate the data.
P ( D | h1 )
log
P ( D | h0 )
1
P( D | h1 )   P( D | q ) p(q | h1 )dq
0
P( D | h0 )  1 / 2 N
q
Bayesian Occam’s Razor
p(D = d | M )
M1
M2
All possible data sets d
For any model M,

p(D  d | M )  1
all d D
Law of “conservation of belief”: A model that can predict many
possible data sets must assign each of them low probability.
Ockham’s Razor in curve fitting

p(D  d | M )  1
M1
all d D
p(D = d | M )
M1
M2
M2
M3
D
Observed data
M1: A model that is too simple is unlikely to generate
the data.
M3: A model that is too complex can generate many
possible data sets, so it is unlikely to generate
this particular data set at random.
M3

p(D  d | M )  1
M1
all d D
p(D = d | M )
M1
M2
M2
M3
D
Observed data
p ( D | M )  p ( y | x, M )
  p( y | x, q , M ) p(q | M )dq
[assume Gaussian parameter priors, Gaussian likelihoods (noise)]
M3
M1
For best fitting version of each model:
Prior
Likelihood
high
low
medium
high
M2
M3
very very very
very low
very high
(assuming Gaussian noise, and Gaussian priors on parameters)
(Ghahramani)
(Ghahramani)
Hierarchical Bayesian learning with
flexibly structured models
• Learning context-free grammars for natural language
(Stolcke & Omohundro; Griffiths and Johnson; Perfors et al.).
• Learning complex concepts.
“fruit”
“fruit” <= or(
and(
color > 0.2, color < 0.4,
size > 1, size < 4 ),
and(
color > 0.5, color < 0.65,
size > 2, size < 7),
…)
Navarro (2006):
Nonparametric model
Goodman et al. (2006):
Probabilistic context-free grammar
for rule-based concepts
The “blessing of abstraction”
• Often easier to learn at higher levels of abstraction
– Easier to learn that you have a biased coin than to learn
its bias.
– Easier to learn causal structure than causal strength.
– Easier to learn that you are hearing two languages (vs.
one), or to learn that language has a hierarchical phrase
structure, than to learn how any one language works.
• Why? Hypothesis space gets smaller as you go up.
– But the total hypothesis space gets bigger when we add
levels of abstraction (e.g., model selection).
– Can make better (more confident, more accurate)
predictions by increasing the size of the hypothesis
space, if we introduce good inductive biases.
Summary
• Three kinds of Bayesian inference
– Comparing two simple hypotheses
– Parameter estimation
• The importance and subtlety of prior knowledge
– Model selection
• Bayesian Occam’s razor, the blessing of abstraction
• Key concepts
– Probabilistic generative models
– Hierarchies of abstraction, with statistical
inference at all levels
– Flexibly structured representations
Download