Outline • Bayesian parameter estimation • Hierarchical Bayesian models • Metropolis-Hastings

advertisement
Outline
• Bayesian parameter estimation
• Hierarchical Bayesian models
• Metropolis-Hastings
– A more general approach to MCMC
Coin flipping
• Comparing two simple hypotheses
– P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
– P(H) = 0.5 vs. P(H) = θ
• Comparing infinitely many hypotheses
– P(H) = θ : Infer θ
Comparing infinitely many hypotheses
• Assume data are generated from a model:
θ
d1
d2
d3
d4
P(H) = θ
• What is the value of θ ?
– each value of θ is a hypothesis H
– requires inference over infinitely many hypotheses
Comparing infinitely many hypotheses
• Flip a coin 10 times and see 5 heads, 5 tails.
• P(H) on next flip? 50%
• Why? 50% = 5 / (5+5) = 5/10.
• “Future will be like the past.”
• Suppose we had seen 4 heads and 6 tails.
• P(H) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
Integrating prior knowledge and data
P( H ) P( D | H )
P( H | D) =
P( D)
P(θ | D) ∝ P(D | θ ) P(θ )
• Posterior distribution P(p | D) is a probability
density over θ = P(H)
• Need to work out likelihood P(D | θ ) and
specify prior distribution P(θ )
Likelihood and prior
• Likelihood:
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails
• Prior:
?
P(θ ) ∝ pFH-1 (1-p)FT-1
A simple method of specifying priors
• Imagine some fictitious trials, reflecting a
set of previous experiences
– strategy often used with neural networks or
building invariance into stat. machine vision.
• e.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair
• In fact, this is a sensible statistical idea...
Likelihood and prior
• Likelihood:
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails
• Prior:
P(θ ) ∝ θ FH-1 (1-θ ) FT-1 Beta(FH,FT)
– FH: fictitious observations of heads
– FT: fictitious observations of tails
Likelihood and prior
• Likelihood:
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails
• Prior:
P(θ ) =
Γ(FH+FT)
Γ(FH) Γ(FT)
θ FH-1 (1-θ ) FT-1
– FH: fictitious observations of heads
– FT: fictitious observations of tails
Likelihood and prior
• Likelihood:
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails
1
∫0
• Prior:
1 Γ(FH+FT)
0 Γ(FH) Γ(FT)
P(θ ) dθ = ∫
θ FH-1 (1-θ ) FT-1dθ = 1
A very useful integral
Likelihood and prior
• Likelihood:
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails
1
∫0
• Prior:
1 Γ(FH+FT)
0 Γ(FH) Γ(FT)
P(θ ) dθ = ∫
θ FH-1 (1-θ ) FT-1dθ = 1
Also useful: Γ(x) = (x-1)!
Γ(x+1) = x Γ(x)
Shape of the Beta prior
FH = 0.5, FT = 0.5
2
1
0
Probability
Probability
3
8
7
6
5
4
3
2
1
0
0
0.25
0.5
X
0.75
1
FH = 2, FT = 0.5
0
0.25
0.5
0.75
1
8
7
6
5
4
3
2
1
0
FH = 0.5, FT = 2
2
Probability
Probability
4
0
0.25
0.5
X
0.75
1
FH = 2, FT = 2
1.5
1
0.5
0
X
0
0.25
X
Figure by MIT OCW.
0.5
0.75
1
Shape of the Beta prior
FH = FT = 1
FH = FT = 3
40
30
FH = FT = 1000
P (p) 20
10
0
0
0.1
0.2
0.3
0.4
0.5
Figure by MIT OCW.
0.6
0.7
0.8
0.9
1
Bayesian parameter learning
• Likelihood: Bernoulli(θ )
P(D | θ ) = θ NH (1-θ ) NT
– NH, NT: number of heads, tails observed
• Prior: Beta(FH,FT)
P(θ ) ∝ θ FH-1 (1-θ ) FT-1
– FH, FT: fictitious observations of heads, tails
• Posterior: Beta(NH+FH, NT+FT)
P(θ | D) ∝ θ NH+FH-1 (1-θ ) NT+FT-1
=
Γ(NH+FH+NT+FT)
Γ(NH+FH) Γ(NT+FT)
θ NH+FH-1 (1-θ ) NT+FT-1
Bayesian parameter learning
θ
D = NH,NT
d1
d2
d3
d4
• Likelihood: Bernoulli(θ )
P(D | θ ) = θ NH (1-θ ) NT
– NH, NT: number of heads, tails observed
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
• Prior: Beta(FH,FT)
P(θ | FH, FT) ∝ θ FH-1 (1-θ ) FT-1
– FH, FT: fictitious observations of heads, tails
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
• Posterior: Beta(NH+FH, NT+FT)
P(θ | D, FH, FT) ∝ θ NH+FH-1 (1-θ ) NT+FT-1
=
Γ(NH+FH+NT+FT)
Γ(NH+FH) Γ(NT+FT)
θ NH+FH-1 (1-θ ) NT+FT-1
Conjugate priors
• A prior p(θ ) is conjugate to a likelihood
function p(D | θ ) if the posterior has the same
functional form of the prior.
– Different parameter values in the prior and
posterior reflect the impact of observed data.
– Parameter values in the prior can be thought of as a
summary of “fictitious observations”.
• Exist for many standard distributions
– all exponential family models
– e.g., Beta is conjugate to Bernoulli (coin-flipping)
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D=NH, NT; FH, FT) = ?
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
1
∫0 P(H|θ ) P(θ | D, FH, FT) dθ
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
1
∫0 θ
P(θ | D, FH, FT) dθ
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
1
∫0
θ
Γ(NH+FH+NT+FT)
Γ(NH+FH) Γ(NT+FT)
θ NH+FH-1 (1-θ ) NT+FT-1 dθ
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
1
∫
Γ(NH+FH+NT+FT)
Γ(NH+FH) Γ(NT+FT) 0
θ θ NH+FH-1 (1-θ ) NT+FT-1 dθ
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
1
∫
Γ(NH+FH+NT+FT)
Γ(NH+FH) Γ(NT+FT) 0
θ NH+FH (1-θ ) NT+FT-1 dθ
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
Γ(x+1) = x Γ(x)
Γ(NH+FH+NT+FT)
Γ(NH+FH+1) Γ(NT+FT)
x
Γ(NH+FH) Γ(NT+FT)
Γ(NH+FH+NT+FT+1)
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
Γ(NH+FH+NT+FT)
(NH+FH) Γ(NH+FH) Γ(NT+FT)
x
Γ(NH+FH) Γ(NT+FT)
(NH+FH+NT+FT) Γ(NH+FH+NT+FT)
Bayesian parameter learning
FH,FT
θ
D = NH,NT
d1
d2
d3
d4
H
• Posterior predictive distribution:
P(H|D, FH, FT) =
(NH+FH)
(NH+FH+NT+FT)
Some examples
• e.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 1004 / (1004+1006) = 49.95%
• e.g., F ={3 heads, 3 tails} ~ weak
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 7 / (7+9) = 43.75%
Prior knowledge too weak
But… flipping thumbtacks
• e.g., F ={4 heads, 3 tails} ~ weak expectation
that tacks are slightly biased towards heads
• After seeing 2 heads, 0 tails, P(H) on next flip
= 6 / (6+3) = 67%
• Some prior knowledge is always necessary to
avoid jumping to hasty conclusions...
• Suppose F = { }: After seeing 2 heads, 0 tails,
P(H) on next flip = 2 / (2+0) = 100%
Origin of prior knowledge
• Tempting answer: prior experience
• Suppose you have previously seen 2000
coin flips: 1000 heads, 1000 tails
• By assuming all coins (and flips) are alike,
these observations of other coins are as
good as observations of the present coin
– e.g., 200 coins flipped 10 times each.
Hierarchical priors
θ ~ Beta(FH,FT)
FH,FT
Coin 1
d1
Coin 2
θ1
d2
d3
d4
d1
d2
...
θ2
d3
d4
θ200 Coin 200
d1
d2
d3
• Latent structure captures what is common to all
coins, and also their individual variability
d4
Problems with simple empiricism
• Haven’t really seen 2000 coin flips, or any flips of a
thumbtack
– Prior knowledge is stronger than raw experience justifies
• Haven’t seen exactly equal number of heads and tails
– Prior knowledge is smoother than raw experience justifies
• Should be a difference between observing 2000 flips
of a single coin versus observing 10 flips each for 200
coins, or 1 flip each for 2000 coins
– Prior knowledge is more structured than raw experience
A simple theory
• “Coins are manufactured by a standardized
procedure that is effective but not perfect.”
– Justifies generalizing from previous coins to the
present coin.
– Justifies smoother and stronger prior than raw
experience alone.
– Explains why seeing 10 flips each for 200 coins is
more valuable than seeing 2000 flips of one coin.
• “Tacks are asymmetric, and manufactured to
less exacting standards.”
Hierarchical priors
physical knowledge
Coins
FH,FT
Coin 1
d1
Thumbtacks
d2
...
θ1
d3
d4
θ200
Tack 1
d1
FH,FT
...
θ1
d2
d3
d4
• Qualitative beliefs (e.g. symmetry) can influence
estimation of continuous properties (e.g. FH, FT)
θ200
Hierarchical priors
physical knowledge
Coins
FH,FT
Coin 1
d1
Thumbtacks
d2
...
θ1
d3
d4
θ200
Tack 1
d1
FH,FT
...
θ1
d2
d3
θ200
d4
• Explains why 10 flips of 200 coin are better than 2000
flips of a single coin: more informative about FH, FT,
assuming parameters not too large for new kind of coin.
Stability versus Flexibility
• Can all domain knowledge be represented
with conjugate priors?
• Suppose you flip a coin 25 times and get all
heads. Something funny is going on …
• But with F ={1000 heads, 1000 tails},
P(heads) on next flip = 1025 / (1025+1000)
= 50.6%. Looks like nothing unusual.
• How do we balance stability and flexibility?
– Stability: 6 heads, 4 tails
– Flexibility: 25 heads, 0 tails
θ ~ 0.5
θ ~1
Hierarchical priors
• Higher-order hypothesis: is this
coin fair or unfair?
• Example probabilities:
fair/unfair?
FH,FT
– P(fair) = 0.99
– P(θ |fair) is Beta(1000,1000)
– P(θ |unfair) is Beta(1,1)
• 25 heads in a row propagates up,
affecting θ and then P(fair|D)
θ
d1
d2
d3
P(fair|25 heads)
P(25 heads|fair) P(fair) = 9 x 10-5
=
P(unfair|25 heads) P(25 heads|unfair) P(unfair)
1
P ( D | fair ) = ∫ P ( D | θ ) p (θ | fair )dθ
0
d4
Physical knowledge
social knowledge
Hierarchical priors
fair/unfair?
• Higher-order hypothesis: is this
coin fair or unfair?
• Example probabilities:
– P(fair) = 0.99
– P(θ |fair) is Beta(1000,1000)
– P(θ |unfair) is Beta(1,1)
• 25 heads in a row propagates up,
affecting θ and then P(fair|D)
FH,FT
θ
d1
d2
d3
d4
P(fair|25 heads)
P(25 heads|fair) P(fair) = 9 x 10-5
=
P(unfair|25 heads) P(25 heads|unfair) P(unfair)
Summary
• Learning the parameters of a generative
model as Bayesian inference.
• Conjugate priors
– an elegant way to represent simple kinds of prior
knowledge.
• Hierarchical Bayesian models
– integrate knowledge across instances of a system,
or different systems within a domain.
– can incorporate abstract theoretical knowledge.
– inference may get difficult….
Other questions
• Learning isn’t just about parameter
estimation
– How do we learn the functional form of a
variable’s distribution?
– How do we learn model structure? Theories?
• Can we “grow” levels of abstraction?
• How do hierarchical Bayesian models
address the Grue problem? Do we care?
• The “topics” model for semantics as an
example of applying hierarchical Bayesian
modeling to cognition. Probably next time.
Topic models of semantic structure: e.g., Latent
Dirichlet Allocation (Blei, Ng, Jordan)
– Each document in a corpus is associated with a
distribution θ over topics.
– Each topic t is associated with a distribution φ(t)
over words.
Image removed due to copyright considerations. Please see:
Blei, David, Andrew Ng, and Michael Jordan. "Latent Dirichlet Allocation."
Journal of Machine Learning Research 3 (Jan 2003): 993-1022.
A selection of topics (TASA)
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
A selection of topics (TASA)
JOB
BALL
SCIENCE
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
Joint models of syntax and semantics
(Griffiths, Steyvers, Blei & Tenenbaum, NIPS 2004)
• Embed topics model inside an nth order
Hidden Markov Model:
Image removed due to copyright considerations. Please see:
Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics
and Syntax. Advances in Neural Information Processing Systems 17 (2005).
Image removed due to copyright considerations. Please see:
Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics
and Syntax. Advances in Neural Information Processing Systems 17 (2005).
Semantic classes
MAP
FOOD
NORTH
FOODS
EARTH
BODY
SOUTH
NUTRIENTS
POLE
DIET
MAPS
FAT
EQUATOR
SUGAR
WEST
ENERGY
LINES
MILK
EAST
EATING
AUSTRALIA
FRUITS
GLOBE
VEGETABLES
POLES
WEIGHT
HEMISPHERE
FATS
LATITUDE
NEEDS
CARBOHYDRATES PLACES
LAND
VITAMINS
WORLD
CALORIES
COMPASS
PROTEIN
CONTINENTS
MINERALS
GOLD
CELLS
BEHAVIOR
DOCTOR
BOOK
IRON
CELL
SELF
PATIENT
BOOKS
SILVER
ORGANISMS
INDIVIDUAL
HEALTH
READING
ALGAE
PERSONALITY
HOSPITAL
INFORMATION COPPER
METAL
BACTERIA
RESPONSE
MEDICAL
LIBRARY
METALS
MICROSCOPE
SOCIAL
CARE
REPORT
STEEL
MEMBRANE
EMOTIONAL
PATIENTS
PAGE
CLAY
ORGANISM
LEARNING
NURSE
TITLE
LEAD
FOOD
FEELINGS
DOCTORS
SUBJECT
ADAM
LIVING
PSYCHOLOGISTS
MEDICINE
PAGES
ORE
FUNGI
INDIVIDUALS
NURSING
GUIDE
ALUMINUM PSYCHOLOGICAL
MOLD
TREATMENT
WORDS
MINERAL
MATERIALS
EXPERIENCES
NURSES
MATERIAL
MINE
NUCLEUS
ENVIRONMENT
PHYSICIAN
ARTICLE
STONE
CELLED
HUMAN
HOSPITALS
ARTICLES
MINERALS
STRUCTURES
RESPONSES
DR
WORD
POT
MATERIAL
BEHAVIORS
SICK
FACTS
MINING
STRUCTURE
ATTITUDES
ASSISTANT
AUTHOR
MINERS
GREEN
PSYCHOLOGY
EMERGENCY
REFERENCE
TIN
MOLDS
PERSON
PRACTICE
NOTE
PLANTS
PLANT
LEAVES
SEEDS
SOIL
ROOTS
FLOWERS
WATER
FOOD
GREEN
SEED
STEMS
FLOWER
STEM
LEAF
ANIMALS
ROOT
POLLEN
GROWING
GROW
Image removed due to copyright considerations. Please see:
Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics
and Syntax. Advances in Neural Information Processing Systems 17 (2005).
Syntactic classes
SAID
ASKED
THOUGHT
TOLD
SAYS
MEANS
CALLED
CRIED
SHOWS
ANSWERED
TELLS
REPLIED
SHOUTED
EXPLAINED
LAUGHED
MEANT
WROTE
SHOWED
BELIEVED
WHISPERED
THE
HIS
THEIR
YOUR
HER
ITS
MY
OUR
THIS
THESE
A
AN
THAT
NEW
THOSE
EACH
MR
ANY
MRS
ALL
MORE
SUCH
LESS
MUCH
KNOWN
JUST
BETTER
RATHER
GREATER
HIGHER
LARGER
LONGER
FASTER
EXACTLY
SMALLER
SOMETHING
BIGGER
FEWER
LOWER
ALMOST
ON
AT
INTO
FROM
WITH
THROUGH
OVER
AROUND
AGAINST
ACROSS
UPON
TOWARD
UNDER
ALONG
NEAR
BEHIND
OFF
ABOVE
DOWN
BEFORE
GOOD
SMALL
NEW
IMPORTANT
GREAT
LITTLE
LARGE
*
BIG
LONG
HIGH
DIFFERENT
SPECIAL
OLD
STRONG
YOUNG
COMMON
WHITE
SINGLE
CERTAIN
ONE
SOME
MANY
TWO
EACH
ALL
MOST
ANY
THREE
THIS
EVERY
SEVERAL
FOUR
FIVE
BOTH
TEN
SIX
MUCH
TWENTY
EIGHT
HE
YOU
THEY
I
SHE
WE
IT
PEOPLE
EVERYONE
OTHERS
SCIENTISTS
SOMEONE
WHO
NOBODY
ONE
SOMETHING
ANYONE
EVERYBODY
SOME
THEN
BE
MAKE
GET
HAVE
GO
TAKE
DO
FIND
USE
SEE
HELP
KEEP
GIVE
LOOK
COME
WORK
MOVE
LIVE
EAT
BECOME
Corpus-specific factorization
(NIPS)
Image removed due to copyright considerations. Please see:
Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics
and Syntax. Advances in Neural Information Processing Systems 17 (2005).
Syntactic classes in PNAS
5
IN
FOR
ON
BETWEEN
DURING
AMONG
FROM
UNDER
WITHIN
THROUGHOUT
THROUGH
TOWARD
INTO
AT
INVOLVING
AFTER
ACROSS
AGAINST
WHEN
ALONG
8
ARE
WERE
WAS
IS
WHEN
REMAIN
REMAINS
REMAINED
PREVIOUSLY
BECOME
BECAME
BEING
BUT
GIVE
MERE
APPEARED
APPEAR
ALLOWED
NORMALLY
EACH
14
THE
THIS
ITS
THEIR
AN
EACH
ONE
ANY
INCREASED
EXOGENOUS
OUR
RECOMBINANT
ENDOGENOUS
TOTAL
PURIFIED
TILE
FULL
CHRONIC
ANOTHER
EXCESS
25
26
30
SUGGEST
LEVELS
RESULTS
INDICATE
NUMBER
ANALYSIS
SUGGESTING
LEVEL
DATA
SUGGESTS
RATE
STUDIES
SHOWED
TIME
STUDY
REVEALED
CONCENTRATIONS
FINDINGS
SHOW
VARIETY
EXPERIMENTS
DEMONSTRATE
RANGE
OBSERVATIONS
INDICATING
CONCENTRATION
HYPOTHESIS
PROVIDE
DOSE
ANALYSES
SUPPORT
FAMILY
ASSAYS
INDICATES
SET
POSSIBILITY
PROVIDES
FREQUENCY
MICROSCOPY
INDICATED
SERIES
PAPER
DEMONSTRATED
AMOUNTS
WORK
SHOWS
RATES
EVIDENCE
SO
CLASS
FINDING
REVEAL
VALUES
MUTAGENESIS
DEMONSTRATES
AMOUNT
OBSERVATION
SUGGESTED
SITES
MEASUREMENTS
33
BEEN
MAY
CAN
COULD
WELL
DID
DOES
DO
MIGHT
SHOULD
WILL
WOULD
MUST
CANNOT
REMAINED
ALSO
THEY
BECOME
MAG
LIKELY
Semantic highlighting
Darker words are more likely to have been generated from the
topic-based “semantics” module:
Outline
• Bayesian parameter estimation
• Hierarchical Bayesian models
• Metropolis-Hastings
– A more general approach to MCMC
Motivation
• Want to compute P(h|evidence):
P(evidence | h) P(h)
P (h | evidence) =
∑ P(evidence | h′) P(h′)
h′
• General problem with complex models: sum
over alternative hypotheses is intractable.
Markov chain Monte Carlo
• Sample from a Markov chain which
converges to posterior distribution
• After an initial “burn in” period,
samples are independent of starting
conditions.
Image removed due to
copyright considerations.
What’s a Markov chain?
x
x
x
x
x
x
x
Transition matrix
P(x(t+1)|x(t)) = T(x(t),x(t+1))
• States of chain are variables of interest
• Transition matrix chosen to give posterior
distribution as stationary distribution
x
Gibbs sampling
• Suppose (1) we can factor hypotheses into
individual state variables, h = <h1, h2, …, hn>;
• and (2) we can easily compute
P(hi|h-i, evidence), where
h-i = h1(t+1), h2(t+1),…, hi-1(t+1), hi+1(t), …, hn(t)
• Then use Gibbs sampling:
– Cycle through variables h1, h2, …, hn
– Draw hi(t+1) from P(hi|h-i, evidence)
Gibbs sampling
Image removed due to copyright considerations.
(MacKay, 2002)
Motivation for Metropolis-Hastings
• Want to compute P(h|evidence):
P(evidence | h) P(h)
P (h | evidence) =
∑ P(evidence | h′) P(h′)
h′
• We have a probabilistic model that allows
us to compute P(evidence|h) and P(h).
• We can compute relative posteriors:
P (hi | evidence) P(evidence | hi ) P(hi )
=
P(h j | evidence) P (evidence | h j ) P(h j )
Metropolis-Hastings algorithm
• Transitions have two parts:
– proposal distribution: Q(h(t+1)| h(t))
– acceptance: take proposals with probability
A(h(t+1)|
h(t))
P(h(t+1)|evidence) Q(h(t)| h(t +1))
= min{ 1,
}
(t)
(t+1)
(t)
P(h |evidence) Q(h | h )
Metropolis-Hastings algorithm
Complex unknown posterior distribution
Complex Unknown Posterior Distribution
Figure by MIT OCW.
Metropolis-Hastings algorithm
Complex unknown posterior distribution
Complex Unknown Posterior Distribution
Figure by MIT OCW.
e.g., Gaussian proposal distribution
Metropolis-Hastings algorithm
Complex unknown posterior distribution
Complex Unknown Posterior Distribution
Figure by MIT OCW.
e.g., Gaussian proposal distribution
Metropolis-Hastings algorithm
Complex unknown posterior distribution
Complex Unknown Posterior Distribution
A (h(t+1) h(t)) = 0.5
Figure by MIT OCW.
e.g., Gaussian proposal distribution
Metropolis-Hastings algorithm
Complex unknown posterior distribution
Complex Unknown Posterior Distribution
Figure by MIT OCW.
e.g., Gaussian proposal distribution
Metropolis-Hastings algorithm
Complex unknown posterior distribution
Complex Unknown Posterior Distribution
A (h(t+1) h(t)) = 1
Figure by MIT OCW.
e.g., Gaussian proposal distribution
Advanced topics
• What makes a good proposal distribution?
– “Goldilocks principle”
– May be data-dependent
• Connections to simulated annealing
– Integration versus optimization
– MCMC at different temperatures
• MCMC over model structures
– Reversible jump MCMC
Relation to simulated annealing
Complex unknown cost function
Complex Unknown Cost Function
Figure by MIT OCW.
Why MCMC is important
• Simple
• Can be used with just about any kind of
probabilistic model, including complex
hierarchical structures
• Always works pretty well, if you’re willing
to wait a long time
(cf. Backpropagation for neural networks.)
A model for cognitive
development?
• Some features of cognitive development:
–
–
–
–
–
Small, random, dumb, local steps
Takes a long time
Can get stuck in plateaus or stages
“Two steps forward, one step back”
Over time, intuitive theories get consistently
better (more veridical, more powerful, broader
scope).
– Everyone reaches basically the same state
(though some take longer than others).
Download