Outline • Bayesian parameter estimation • Hierarchical Bayesian models • Metropolis-Hastings – A more general approach to MCMC Coin flipping • Comparing two simple hypotheses – P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses – P(H) = 0.5 vs. P(H) = θ • Comparing infinitely many hypotheses – P(H) = θ : Infer θ Comparing infinitely many hypotheses • Assume data are generated from a model: θ d1 d2 d3 d4 P(H) = θ • What is the value of θ ? – each value of θ is a hypothesis H – requires inference over infinitely many hypotheses Comparing infinitely many hypotheses • Flip a coin 10 times and see 5 heads, 5 tails. • P(H) on next flip? 50% • Why? 50% = 5 / (5+5) = 5/10. • “Future will be like the past.” • Suppose we had seen 4 heads and 6 tails. • P(H) on next flip? Closer to 50% than to 40%. • Why? Prior knowledge. Integrating prior knowledge and data P( H ) P( D | H ) P( H | D) = P( D) P(θ | D) ∝ P(D | θ ) P(θ ) • Posterior distribution P(p | D) is a probability density over θ = P(H) • Need to work out likelihood P(D | θ ) and specify prior distribution P(θ ) Likelihood and prior • Likelihood: P(D | θ ) = θ NH (1-θ ) NT – NH: number of heads – NT: number of tails • Prior: ? P(θ ) ∝ pFH-1 (1-p)FT-1 A simple method of specifying priors • Imagine some fictitious trials, reflecting a set of previous experiences – strategy often used with neural networks or building invariance into stat. machine vision. • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • In fact, this is a sensible statistical idea... Likelihood and prior • Likelihood: P(D | θ ) = θ NH (1-θ ) NT – NH: number of heads – NT: number of tails • Prior: P(θ ) ∝ θ FH-1 (1-θ ) FT-1 Beta(FH,FT) – FH: fictitious observations of heads – FT: fictitious observations of tails Likelihood and prior • Likelihood: P(D | θ ) = θ NH (1-θ ) NT – NH: number of heads – NT: number of tails • Prior: P(θ ) = Γ(FH+FT) Γ(FH) Γ(FT) θ FH-1 (1-θ ) FT-1 – FH: fictitious observations of heads – FT: fictitious observations of tails Likelihood and prior • Likelihood: P(D | θ ) = θ NH (1-θ ) NT – NH: number of heads – NT: number of tails 1 ∫0 • Prior: 1 Γ(FH+FT) 0 Γ(FH) Γ(FT) P(θ ) dθ = ∫ θ FH-1 (1-θ ) FT-1dθ = 1 A very useful integral Likelihood and prior • Likelihood: P(D | θ ) = θ NH (1-θ ) NT – NH: number of heads – NT: number of tails 1 ∫0 • Prior: 1 Γ(FH+FT) 0 Γ(FH) Γ(FT) P(θ ) dθ = ∫ θ FH-1 (1-θ ) FT-1dθ = 1 Also useful: Γ(x) = (x-1)! Γ(x+1) = x Γ(x) Shape of the Beta prior FH = 0.5, FT = 0.5 2 1 0 Probability Probability 3 8 7 6 5 4 3 2 1 0 0 0.25 0.5 X 0.75 1 FH = 2, FT = 0.5 0 0.25 0.5 0.75 1 8 7 6 5 4 3 2 1 0 FH = 0.5, FT = 2 2 Probability Probability 4 0 0.25 0.5 X 0.75 1 FH = 2, FT = 2 1.5 1 0.5 0 X 0 0.25 X Figure by MIT OCW. 0.5 0.75 1 Shape of the Beta prior FH = FT = 1 FH = FT = 3 40 30 FH = FT = 1000 P (p) 20 10 0 0 0.1 0.2 0.3 0.4 0.5 Figure by MIT OCW. 0.6 0.7 0.8 0.9 1 Bayesian parameter learning • Likelihood: Bernoulli(θ ) P(D | θ ) = θ NH (1-θ ) NT – NH, NT: number of heads, tails observed • Prior: Beta(FH,FT) P(θ ) ∝ θ FH-1 (1-θ ) FT-1 – FH, FT: fictitious observations of heads, tails • Posterior: Beta(NH+FH, NT+FT) P(θ | D) ∝ θ NH+FH-1 (1-θ ) NT+FT-1 = Γ(NH+FH+NT+FT) Γ(NH+FH) Γ(NT+FT) θ NH+FH-1 (1-θ ) NT+FT-1 Bayesian parameter learning θ D = NH,NT d1 d2 d3 d4 • Likelihood: Bernoulli(θ ) P(D | θ ) = θ NH (1-θ ) NT – NH, NT: number of heads, tails observed Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 • Prior: Beta(FH,FT) P(θ | FH, FT) ∝ θ FH-1 (1-θ ) FT-1 – FH, FT: fictitious observations of heads, tails Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 • Posterior: Beta(NH+FH, NT+FT) P(θ | D, FH, FT) ∝ θ NH+FH-1 (1-θ ) NT+FT-1 = Γ(NH+FH+NT+FT) Γ(NH+FH) Γ(NT+FT) θ NH+FH-1 (1-θ ) NT+FT-1 Conjugate priors • A prior p(θ ) is conjugate to a likelihood function p(D | θ ) if the posterior has the same functional form of the prior. – Different parameter values in the prior and posterior reflect the impact of observed data. – Parameter values in the prior can be thought of as a summary of “fictitious observations”. • Exist for many standard distributions – all exponential family models – e.g., Beta is conjugate to Bernoulli (coin-flipping) Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D=NH, NT; FH, FT) = ? Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = 1 ∫0 P(H|θ ) P(θ | D, FH, FT) dθ Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = 1 ∫0 θ P(θ | D, FH, FT) dθ Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = 1 ∫0 θ Γ(NH+FH+NT+FT) Γ(NH+FH) Γ(NT+FT) θ NH+FH-1 (1-θ ) NT+FT-1 dθ Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = 1 ∫ Γ(NH+FH+NT+FT) Γ(NH+FH) Γ(NT+FT) 0 θ θ NH+FH-1 (1-θ ) NT+FT-1 dθ Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = 1 ∫ Γ(NH+FH+NT+FT) Γ(NH+FH) Γ(NT+FT) 0 θ NH+FH (1-θ ) NT+FT-1 dθ Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = Γ(x+1) = x Γ(x) Γ(NH+FH+NT+FT) Γ(NH+FH+1) Γ(NT+FT) x Γ(NH+FH) Γ(NT+FT) Γ(NH+FH+NT+FT+1) Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = Γ(NH+FH+NT+FT) (NH+FH) Γ(NH+FH) Γ(NT+FT) x Γ(NH+FH) Γ(NT+FT) (NH+FH+NT+FT) Γ(NH+FH+NT+FT) Bayesian parameter learning FH,FT θ D = NH,NT d1 d2 d3 d4 H • Posterior predictive distribution: P(H|D, FH, FT) = (NH+FH) (NH+FH+NT+FT) Some examples • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • After seeing 4 heads, 6 tails, P(H) on next flip = 1004 / (1004+1006) = 49.95% • e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair • After seeing 4 heads, 6 tails, P(H) on next flip = 7 / (7+9) = 43.75% Prior knowledge too weak But… flipping thumbtacks • e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks are slightly biased towards heads • After seeing 2 heads, 0 tails, P(H) on next flip = 6 / (6+3) = 67% • Some prior knowledge is always necessary to avoid jumping to hasty conclusions... • Suppose F = { }: After seeing 2 heads, 0 tails, P(H) on next flip = 2 / (2+0) = 100% Origin of prior knowledge • Tempting answer: prior experience • Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails • By assuming all coins (and flips) are alike, these observations of other coins are as good as observations of the present coin – e.g., 200 coins flipped 10 times each. Hierarchical priors θ ~ Beta(FH,FT) FH,FT Coin 1 d1 Coin 2 θ1 d2 d3 d4 d1 d2 ... θ2 d3 d4 θ200 Coin 200 d1 d2 d3 • Latent structure captures what is common to all coins, and also their individual variability d4 Problems with simple empiricism • Haven’t really seen 2000 coin flips, or any flips of a thumbtack – Prior knowledge is stronger than raw experience justifies • Haven’t seen exactly equal number of heads and tails – Prior knowledge is smoother than raw experience justifies • Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins – Prior knowledge is more structured than raw experience A simple theory • “Coins are manufactured by a standardized procedure that is effective but not perfect.” – Justifies generalizing from previous coins to the present coin. – Justifies smoother and stronger prior than raw experience alone. – Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin. • “Tacks are asymmetric, and manufactured to less exacting standards.” Hierarchical priors physical knowledge Coins FH,FT Coin 1 d1 Thumbtacks d2 ... θ1 d3 d4 θ200 Tack 1 d1 FH,FT ... θ1 d2 d3 d4 • Qualitative beliefs (e.g. symmetry) can influence estimation of continuous properties (e.g. FH, FT) θ200 Hierarchical priors physical knowledge Coins FH,FT Coin 1 d1 Thumbtacks d2 ... θ1 d3 d4 θ200 Tack 1 d1 FH,FT ... θ1 d2 d3 θ200 d4 • Explains why 10 flips of 200 coin are better than 2000 flips of a single coin: more informative about FH, FT, assuming parameters not too large for new kind of coin. Stability versus Flexibility • Can all domain knowledge be represented with conjugate priors? • Suppose you flip a coin 25 times and get all heads. Something funny is going on … • But with F ={1000 heads, 1000 tails}, P(heads) on next flip = 1025 / (1025+1000) = 50.6%. Looks like nothing unusual. • How do we balance stability and flexibility? – Stability: 6 heads, 4 tails – Flexibility: 25 heads, 0 tails θ ~ 0.5 θ ~1 Hierarchical priors • Higher-order hypothesis: is this coin fair or unfair? • Example probabilities: fair/unfair? FH,FT – P(fair) = 0.99 – P(θ |fair) is Beta(1000,1000) – P(θ |unfair) is Beta(1,1) • 25 heads in a row propagates up, affecting θ and then P(fair|D) θ d1 d2 d3 P(fair|25 heads) P(25 heads|fair) P(fair) = 9 x 10-5 = P(unfair|25 heads) P(25 heads|unfair) P(unfair) 1 P ( D | fair ) = ∫ P ( D | θ ) p (θ | fair )dθ 0 d4 Physical knowledge social knowledge Hierarchical priors fair/unfair? • Higher-order hypothesis: is this coin fair or unfair? • Example probabilities: – P(fair) = 0.99 – P(θ |fair) is Beta(1000,1000) – P(θ |unfair) is Beta(1,1) • 25 heads in a row propagates up, affecting θ and then P(fair|D) FH,FT θ d1 d2 d3 d4 P(fair|25 heads) P(25 heads|fair) P(fair) = 9 x 10-5 = P(unfair|25 heads) P(25 heads|unfair) P(unfair) Summary • Learning the parameters of a generative model as Bayesian inference. • Conjugate priors – an elegant way to represent simple kinds of prior knowledge. • Hierarchical Bayesian models – integrate knowledge across instances of a system, or different systems within a domain. – can incorporate abstract theoretical knowledge. – inference may get difficult…. Other questions • Learning isn’t just about parameter estimation – How do we learn the functional form of a variable’s distribution? – How do we learn model structure? Theories? • Can we “grow” levels of abstraction? • How do hierarchical Bayesian models address the Grue problem? Do we care? • The “topics” model for semantics as an example of applying hierarchical Bayesian modeling to cognition. Probably next time. Topic models of semantic structure: e.g., Latent Dirichlet Allocation (Blei, Ng, Jordan) – Each document in a corpus is associated with a distribution θ over topics. – Each topic t is associated with a distribution φ(t) over words. Image removed due to copyright considerations. Please see: Blei, David, Andrew Ng, and Michael Jordan. "Latent Dirichlet Allocation." Journal of Machine Learning Research 3 (Jan 2003): 993-1022. A selection of topics (TASA) JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER A selection of topics (TASA) JOB BALL SCIENCE FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER Joint models of syntax and semantics (Griffiths, Steyvers, Blei & Tenenbaum, NIPS 2004) • Embed topics model inside an nth order Hidden Markov Model: Image removed due to copyright considerations. Please see: Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17 (2005). Image removed due to copyright considerations. Please see: Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17 (2005). Semantic classes MAP FOOD NORTH FOODS EARTH BODY SOUTH NUTRIENTS POLE DIET MAPS FAT EQUATOR SUGAR WEST ENERGY LINES MILK EAST EATING AUSTRALIA FRUITS GLOBE VEGETABLES POLES WEIGHT HEMISPHERE FATS LATITUDE NEEDS CARBOHYDRATES PLACES LAND VITAMINS WORLD CALORIES COMPASS PROTEIN CONTINENTS MINERALS GOLD CELLS BEHAVIOR DOCTOR BOOK IRON CELL SELF PATIENT BOOKS SILVER ORGANISMS INDIVIDUAL HEALTH READING ALGAE PERSONALITY HOSPITAL INFORMATION COPPER METAL BACTERIA RESPONSE MEDICAL LIBRARY METALS MICROSCOPE SOCIAL CARE REPORT STEEL MEMBRANE EMOTIONAL PATIENTS PAGE CLAY ORGANISM LEARNING NURSE TITLE LEAD FOOD FEELINGS DOCTORS SUBJECT ADAM LIVING PSYCHOLOGISTS MEDICINE PAGES ORE FUNGI INDIVIDUALS NURSING GUIDE ALUMINUM PSYCHOLOGICAL MOLD TREATMENT WORDS MINERAL MATERIALS EXPERIENCES NURSES MATERIAL MINE NUCLEUS ENVIRONMENT PHYSICIAN ARTICLE STONE CELLED HUMAN HOSPITALS ARTICLES MINERALS STRUCTURES RESPONSES DR WORD POT MATERIAL BEHAVIORS SICK FACTS MINING STRUCTURE ATTITUDES ASSISTANT AUTHOR MINERS GREEN PSYCHOLOGY EMERGENCY REFERENCE TIN MOLDS PERSON PRACTICE NOTE PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW Image removed due to copyright considerations. Please see: Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17 (2005). Syntactic classes SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME Corpus-specific factorization (NIPS) Image removed due to copyright considerations. Please see: Griffiths, T. L., M. Steyvers, D. M. Blei, and J. B. Tenenbaum.Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17 (2005). Syntactic classes in PNAS 5 IN FOR ON BETWEEN DURING AMONG FROM UNDER WITHIN THROUGHOUT THROUGH TOWARD INTO AT INVOLVING AFTER ACROSS AGAINST WHEN ALONG 8 ARE WERE WAS IS WHEN REMAIN REMAINS REMAINED PREVIOUSLY BECOME BECAME BEING BUT GIVE MERE APPEARED APPEAR ALLOWED NORMALLY EACH 14 THE THIS ITS THEIR AN EACH ONE ANY INCREASED EXOGENOUS OUR RECOMBINANT ENDOGENOUS TOTAL PURIFIED TILE FULL CHRONIC ANOTHER EXCESS 25 26 30 SUGGEST LEVELS RESULTS INDICATE NUMBER ANALYSIS SUGGESTING LEVEL DATA SUGGESTS RATE STUDIES SHOWED TIME STUDY REVEALED CONCENTRATIONS FINDINGS SHOW VARIETY EXPERIMENTS DEMONSTRATE RANGE OBSERVATIONS INDICATING CONCENTRATION HYPOTHESIS PROVIDE DOSE ANALYSES SUPPORT FAMILY ASSAYS INDICATES SET POSSIBILITY PROVIDES FREQUENCY MICROSCOPY INDICATED SERIES PAPER DEMONSTRATED AMOUNTS WORK SHOWS RATES EVIDENCE SO CLASS FINDING REVEAL VALUES MUTAGENESIS DEMONSTRATES AMOUNT OBSERVATION SUGGESTED SITES MEASUREMENTS 33 BEEN MAY CAN COULD WELL DID DOES DO MIGHT SHOULD WILL WOULD MUST CANNOT REMAINED ALSO THEY BECOME MAG LIKELY Semantic highlighting Darker words are more likely to have been generated from the topic-based “semantics” module: Outline • Bayesian parameter estimation • Hierarchical Bayesian models • Metropolis-Hastings – A more general approach to MCMC Motivation • Want to compute P(h|evidence): P(evidence | h) P(h) P (h | evidence) = ∑ P(evidence | h′) P(h′) h′ • General problem with complex models: sum over alternative hypotheses is intractable. Markov chain Monte Carlo • Sample from a Markov chain which converges to posterior distribution • After an initial “burn in” period, samples are independent of starting conditions. Image removed due to copyright considerations. What’s a Markov chain? x x x x x x x Transition matrix P(x(t+1)|x(t)) = T(x(t),x(t+1)) • States of chain are variables of interest • Transition matrix chosen to give posterior distribution as stationary distribution x Gibbs sampling • Suppose (1) we can factor hypotheses into individual state variables, h = <h1, h2, …, hn>; • and (2) we can easily compute P(hi|h-i, evidence), where h-i = h1(t+1), h2(t+1),…, hi-1(t+1), hi+1(t), …, hn(t) • Then use Gibbs sampling: – Cycle through variables h1, h2, …, hn – Draw hi(t+1) from P(hi|h-i, evidence) Gibbs sampling Image removed due to copyright considerations. (MacKay, 2002) Motivation for Metropolis-Hastings • Want to compute P(h|evidence): P(evidence | h) P(h) P (h | evidence) = ∑ P(evidence | h′) P(h′) h′ • We have a probabilistic model that allows us to compute P(evidence|h) and P(h). • We can compute relative posteriors: P (hi | evidence) P(evidence | hi ) P(hi ) = P(h j | evidence) P (evidence | h j ) P(h j ) Metropolis-Hastings algorithm • Transitions have two parts: – proposal distribution: Q(h(t+1)| h(t)) – acceptance: take proposals with probability A(h(t+1)| h(t)) P(h(t+1)|evidence) Q(h(t)| h(t +1)) = min{ 1, } (t) (t+1) (t) P(h |evidence) Q(h | h ) Metropolis-Hastings algorithm Complex unknown posterior distribution Complex Unknown Posterior Distribution Figure by MIT OCW. Metropolis-Hastings algorithm Complex unknown posterior distribution Complex Unknown Posterior Distribution Figure by MIT OCW. e.g., Gaussian proposal distribution Metropolis-Hastings algorithm Complex unknown posterior distribution Complex Unknown Posterior Distribution Figure by MIT OCW. e.g., Gaussian proposal distribution Metropolis-Hastings algorithm Complex unknown posterior distribution Complex Unknown Posterior Distribution A (h(t+1) h(t)) = 0.5 Figure by MIT OCW. e.g., Gaussian proposal distribution Metropolis-Hastings algorithm Complex unknown posterior distribution Complex Unknown Posterior Distribution Figure by MIT OCW. e.g., Gaussian proposal distribution Metropolis-Hastings algorithm Complex unknown posterior distribution Complex Unknown Posterior Distribution A (h(t+1) h(t)) = 1 Figure by MIT OCW. e.g., Gaussian proposal distribution Advanced topics • What makes a good proposal distribution? – “Goldilocks principle” – May be data-dependent • Connections to simulated annealing – Integration versus optimization – MCMC at different temperatures • MCMC over model structures – Reversible jump MCMC Relation to simulated annealing Complex unknown cost function Complex Unknown Cost Function Figure by MIT OCW. Why MCMC is important • Simple • Can be used with just about any kind of probabilistic model, including complex hierarchical structures • Always works pretty well, if you’re willing to wait a long time (cf. Backpropagation for neural networks.) A model for cognitive development? • Some features of cognitive development: – – – – – Small, random, dumb, local steps Takes a long time Can get stuck in plateaus or stages “Two steps forward, one step back” Over time, intuitive theories get consistently better (more veridical, more powerful, broader scope). – Everyone reaches basically the same state (though some take longer than others).