Part IV: Inference algorithms Estimation and inference • Actually working with probabilistic models requires solving some difficult computational problems… • Two key problems: – estimating parameters in models with latent variables – computing posterior distributions involving large numbers of variables Part IV: Inference algorithms • The EM algorithm – for estimation in models with latent variables • Markov chain Monte Carlo – for sampling from posterior distributions involving large numbers of variables Part IV: Inference algorithms • The EM algorithm – for estimation in models with latent variables • Markov chain Monte Carlo – for sampling from posterior distributions involving large numbers of variables SUPERVISED dog dog cat dog QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. dog dog cat cat QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. dog cat dog QuickTime™ and a TIFF (Unc ompressed) decompres sor are needed to see this picture. dog cat QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. cat cat dog QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Supervised learning Category A Category B What characterizes the categories? How should we categorize a new observation? Parametric density estimation • Assume that p(x|c) has a simple form, characterized by parameters • Given stimuli X = x1, x2, …, xn from category c, find by maximum-likelihood estimation ˆ argmax p(X | c, ) or some form of Bayesian estimation ˆ argmax log p(X | c, ) log p( ) Spatial representations • Assume a simple parametric form for p(x|c): a Gaussian • For each category, estimate parameters – mean – variance } c P(c) x p(x|c) Probability density p(x) The Gaussian distribution QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. standard deviation mean 1 2 2 p(x) exp{(x ) /2 } 2 variance = 2 (x-)/ Estimating a Gaussian X = {x1, x2, …, xn} independently sampled from a Gaussian n p(X | , ) p(x i | , ) i1 n i1 1 2 2 exp{(x i ) /2 } 2 n 1 1 2 exp{ 2 (x i ) } 2 2 i1 n Estimating a Gaussian X = {x1, x2, …, xn} independently sampled from a Gaussian n 1 1 2 p(X | , ) exp{ (x ) } i 2 2 2 i1 n maximum likelihood parameter estimates: n 1 xi n i1 n 1 2 (x i ) 2 n i1 Multivariate Gaussians 1 p(x | , ) exp{(x ) 2 /2 2 } 2 mean variance/covariance matrix p(x | ,) 1 (2 ) m /2 exp{(x ) (x ) /2} T 1/ 2 1 quadratic form 1 0 0 1 1 0 0 0.25 1 0.8 0.8 1 Estimating a Gaussian X = {x1, x2, …, xn} independently sampled from a Gaussian maximum likelihood parameter estimates: n 1 xi n i1 1 n (x i )(x i )T n i1 Bayesian inference P(x | c)P(c) P(c | x) P(x | c)P(c) Probability c x UNSUPERVISED QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Unc ompressed) decompres sor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Unsupervised learning What latent structure is present? What are the properties of a new observation? An example: Clustering Assume each observed xi is from a cluster ci, where ci is unknown What characterizes the clusters? What cluster does a new x come from? Density estimation c P(c) • We need to estimate some probability distributions – what is P(c)? – what is p(x|c)? x p(x|c) • But… c is unknown, so we only know the value of x Supervised and unsupervised Supervised learning: categorization • Given x = {x1, …, xn} and c = {c1, …, cn} • Estimate parameters of p(x|c) and P(c) n ˆ arg max p(x,c | ) arg max p(x | c , )P(c | ) i i i i1 Unsupervised learning: clustering • Given x = {x1, …, xn} • Estimate parameters of p(x|c) and P(c) n ˆ argmax p(x | ) argmax p(x | c , )P(c | ) i i i i1 c i Mixture distributions Probability mixture distribution mixture components p(x i | ) p(x i | c i , )P(c i | ) ci x mixture weights More generally… Unsupervised learning is density estimation using distributions with latent variables z P(z) Latent (unobserved) P(x) P(x | z)P(z) z x P(x|z) Observed Marginalize out (i.e. sum over) latent structure A chicken and egg problem • If we knew which cluster the observations were from we could find the distributions – this is just density estimation • If we knew the distributions, we could infer which cluster each observation came from – this is just categorization Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments ci: ci argmax P(ci | xi,) argmax p(xi | ci,)P(ci | ) 2. Given assignments ci, solve for maximum likelihood parameter estimates: n ˆ arg max p(x,c | ) arg max p(x | c , )P(c | ) i i i i1 3. Go to step 1 Alternating optimization algorithm x c: assignments to cluster , , P(c): parameters For simplicity, assume , P(c) fixed: “k-means” algorithm Alternating optimization algorithm Step 0: initial parameter values Alternating optimization algorithm Step 1: update assignments Alternating optimization algorithm Step 2: update parameters Alternating optimization algorithm Step 1: update assignments Alternating optimization algorithm Step 2: update parameters Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a why “hard” assignments? posteriori assignments ci: ci argmax P(ci | xi,) argmax p(xi | ci,)P(ci | ) 2. Given assignments ci, solve for maximum likelihood parameter estimates: n ˆ arg max p(x | c, ) arg max p(x | c , ) i i i1 3. Go to step 1 Estimating a Gaussian (with hard assignments) X = {x1, x2, …, xn} independently sampled from a Gaussian n 1 1 2 p(X | , ) exp{ (x ) } i 2 2 2 i1 n maximum likelihood parameter estimates: n 1 xi n i1 n 1 2 (x i ) 2 n i1 Estimating a Gaussian (with soft assignments) the “weight” of each point is the probability of being in the cluster P(x i | c i j, )P(c i j | ) P(c i j | x i , ) P(x i | c, )P(c | ) c maximum likelihood parameter estimates: n n j x i P(c i j | x i, ) i1 n P(c i1 i j | x i , ) 2j 2 (x ) i P(c i j | x i, ) i1 n P(c i1 i j | x i , ) The Expectation-Maximization algorithm (clustering version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over assignments ci: P(ci | xi,) p(xi | ci,)P(ci | ) 2. Solve for maximum likelihood parameter estimates, weighting each observation by the probability it came from that cluster 3. Go to step 1 The Expectation-Maximization algorithm (more general version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over latent variables z: P(z | x, ) P(x | z , )P(z | ) 2. Find parameter estimates ˆ new argmax P(z | x, old ) log P(x,z | new ) z 3. Go to step 1 A note on expectations • For a function f(x) and distribution P(x), the expectation of f with respect to P is E P(x ) f (x) f (x)P(x) x • The expectation is the average of f, when x is drawn from the probability distribution P Good features of EM • Convergence – guaranteed to converge to at least a local maximum of the likelihood (or other extremum) – likelihood is non-decreasing across iterations • Efficiency – big steps initially (other algorithms better later) • Generality – can be defined for many probabilistic models – can be combined with a prior for MAP estimation Limitations of EM • Local minima – e.g., one component poorly fits two clusters, while two components split up a single cluster • Degeneracies – e.g., two components may merge, a component may lock onto one data point, with variance going to zero • May be intractable for complex models – dealing with this is an active research topic EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” – anywhere there’s a “chicken and egg” problem – a prime example: language learning Probabilistic context free grammars S NP NP VP T T N N V V NP VP TN N V NP the a man ball hit took S 1.0 0.7 0.3 1.0 0.8 0.2 0.5 0.5 0.6 0.4 1.0 NP VP 0.7 T 0.8 the N 1.0 V 0.5 NP 0.7 0.6 man hit T 0.8 the P(tree) = 1.00.71.00.80.50.60.70.80.5 N 0.5 ball EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” – anywhere there’s a “chicken and egg” problem – a prime example: language learning • Fried and Holyoak (1984) explicitly tested a model of human categorization that was almost exactly a version of the EM algorithm for a mixture of Gaussians Part IV: Inference algorithms • The EM algorithm – for estimation in models with latent variables • Markov chain Monte Carlo – for sampling from posterior distributions involving large numbers of variables The Monte Carlo principle • The expectation of f with respect to P can be approximated by n 1 E P(x ) f (x) f (x i ) n i1 where the xi are sampled from P(x) • Example: the average # of spots on a die roll The Monte Carlo principle The law of large numbers n E P(x ) f (x) f (x i ) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. i1 Markov chain Monte Carlo • Sometimes it isn’t possible to sample directly from a distribution • Sometimes, you can only compute something proportional to the distribution • Markov chain Monte Carlo: construct a Markov chain that will converge to the target distribution, and draw samples from that chain – just uses something proportional to the target Markov chains x x x x x x x Transition matrix T = P(x(t+1)|x(t)) Variables x(t+1) independent of all previous variables given immediate predecessor x(t) x An example: card shuffling • Each state x(t) is a permutation of a deck of cards (there are 52! permutations) • Transition matrix T indicates how likely one permutation will become another • The transition probabilities are determined by the shuffling procedure – riffle shuffle – overhand – one card Convergence of Markov chains • Why do we shuffle cards? • Convergence to a uniform distribution takes only 7 riffle shuffles… • Other Markov chains will also converge to a stationary distribution, if certain simple conditions are satisfied (called “ergodicity”) – e.g. every state can be reached in some number of steps from every other state Markov chain Monte Carlo x x x x x x x Transition matrix T = P(x(t+1)|x(t)) • States of chain are variables of interest • Transition matrix chosen to give target distribution as stationary distribution x Metropolis-Hastings algorithm • Transitions have two parts: – proposal distribution: Q(x(t+1)|x(t)) – acceptance: take proposals with probability A(x(t),x(t+1)) = min( 1, P(x(t+1)) Q(x(t)|x(t+1)) P(x(t)) Q(x(t+1)|x(t)) ) Metropolis-Hastings algorithm p(x) Metropolis-Hastings algorithm p(x) Metropolis-Hastings algorithm p(x) Metropolis-Hastings algorithm p(x) A(x(t), x(t+1)) = 0.5 Metropolis-Hastings algorithm p(x) Metropolis-Hastings algorithm p(x) A(x(t), x(t+1)) = 1 Gibbs sampling Particular choice of proposal distribution For variables x = x1, x2, …, xn Draw xi(t+1) from P(xi|x-i) x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t) (this is called the full conditional distribution) Gibbs sampling (MacKay, 2002) MCMC vs. EM EM: converges to a single solution QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. MCMC: converges to a distribution of solutions MCMC and cognitive science • The Metropolis-Hastings algorithm seems like a good metaphor for aspects of development… • Some forms of cultural evolution can be shown to be equivalent to Gibbs sampling (Griffiths & Kalish, 2005) • For experiments based on MCMC, see talk by Adam Sanborn at MathPsych! • The main use of MCMC is for probabilistic inference in complex models A selection of topics JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER Syntactic classes Semantic classes Semantic “gist” of document FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON GOLD BOOK IRON BOOKS SILVER READING INFORMATION COPPER METAL LIBRARY METALS REPORT STEEL PAGE CLAY TITLE LEAD SUBJECT ADAM PAGES GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE BIG LONG HIGH DIFFERENT ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY CELLS BEHAVIOR CELL SELF ORGANISMS INDIVIDUAL ALGAE PERSONALITY BACTERIA RESPONSE MICROSCOPE SOCIAL MEMBRANE EMOTIONAL ORGANISM LEARNING FOOD FEELINGS LIVING PSYCHOLOGISTS HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP Summary • Probabilistic models can pose significant computational challenges – parameter estimation with latent variables, computing posteriors with many variables • Clever algorithms exist for solving these problems, easing use of probabilistic models • These algorithms also provide a source of new models and methods in cognitive science Generative models for language latent structure observed data Generative models for language meaning words Topic models • Each document (or conversation, or segment of either) is a mixture of topics • Each word is chosen from a single topic T P(w i ) P(w i | zi j)P(zi j) j1 where wi is the ith word zi is the topic of the ith word T is the number of topics Generating a document g distribution over topics z z z topic assignments w w w observed words w P(w|z = 1) HEART LOVE SOUL TEARS JOY SCIENTIFIC KNOWLEDGE WORK RESEARCH MATHEMATICS topic 1 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 w P(w|z = 2) HEART LOVE SOUL TEARS JOY SCIENTIFIC KNOWLEDGE WORK RESEARCH MATHEMATICS topic 2 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 0.2 Choose mixture weights for each document, generate “bag of words” g = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART {0.5, 0.5} MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART {0.75, 0.25} WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL {1, 0} TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY Inferring topics from text • The topic model is a generative model for a set of documents (assuming a set of topics) – a simple procedure for generating documents • Given the documents, we can try to find the topics and their proportions in each document • This is an unsupervised learning problem – we can use the EM algorithm, but it’s not great – instead, we use Markov chain Monte Carlo A selection from 500 topics [P(w|z = j)] THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS BRAIN CURRENT ART STUDENTS SPACE NERVE ELECTRICITY PAINT TEACHER EARTH SENSE ELECTRIC ARTIST STUDENT MOON SENSES CIRCUIT PAINTING TEACHERS PLANET ARE IS PAINTED TEACHING ROCKET NERVOUS ELECTRICAL ARTISTS CLASS MARS NERVES VOLTAGE MUSEUM CLASSROOM ORBIT BODY FLOW WORK SCHOOL ASTRONAUTS SMELL BATTERY PAINTINGS LEARNING FIRST TASTE WIRE STYLE PUPILS SPACECRAFT TOUCH WIRES PICTURES CONTENT JUPITER SWITCH WORKS INSTRUCTION MESSAGES SATELLITE IMPULSES CONNECTED OWN TAUGHT SATELLITES CORD ELECTRONS GROUP ATMOSPHERE SCULPTURE ORGANS RESISTANCE PAINTER GRADE SPACESHIP SPINAL POWER ARTS SHOULD SURFACE FIBERS CONDUCTORS BEAUTIFUL GRADES SCIENTISTS SENSORY CIRCUITS DESIGNS CLASSES ASTRONAUT PAIN TUBE PORTRAIT PUPIL SATURN IS NEGATIVE PAINTERS GIVEN MILES A selection from 500 topics [P(w|z = j)] FIELD STORY MIND MAGNETIC STORIES WORLD MAGNET TELL DREAM WIRE CHARACTER DREAMS CHARACTERS NEEDLE THOUGHT CURRENT AUTHOR IMAGINATION COIL READ MOMENT POLES TOLD THOUGHTS IRON SETTING OWN COMPASS TALES REAL LINES PLOT LIFE CORE TELLING IMAGINE ELECTRIC SHORT SENSE DIRECTION CONSCIOUSNESS FICTION FORCE ACTION STRANGE MAGNETS TRUE FEELING BE EVENTS WHOLE MAGNETISM TELLS BEING POLE TALE MIGHT INDUCED NOVEL HOPE JOB SCIENCE BALL WORK STUDY GAME JOBS SCIENTISTS TEAM CAREER SCIENTIFIC FOOTBALL KNOWLEDGE BASEBALL EXPERIENCE WORK PLAYERS EMPLOYMENT RESEARCH PLAY OPPORTUNITIES WORKING CHEMISTRY FIELD TRAINING TECHNOLOGY PLAYER SKILLS MANY BASKETBALL CAREERS MATHEMATICS COACH POSITIONS BIOLOGY PLAYED FIND FIELD PLAYING POSITION PHYSICS HIT FIELD LABORATORY TENNIS STUDIES TEAMS OCCUPATIONS REQUIRE WORLD GAMES SCIENTIST SPORTS OPPORTUNITY EARN STUDYING BAT ABLE SCIENCES TERRY A selection from 500 topics [P(w|z = j)] FIELD STORY MIND MAGNETIC STORIES WORLD MAGNET TELL DREAM WIRE CHARACTER DREAMS CHARACTERS NEEDLE THOUGHT CURRENT AUTHOR IMAGINATION COIL READ MOMENT POLES TOLD THOUGHTS IRON SETTING OWN COMPASS TALES REAL LINES PLOT LIFE CORE TELLING IMAGINE ELECTRIC SHORT SENSE DIRECTION CONSCIOUSNESS FICTION FORCE ACTION STRANGE MAGNETS TRUE FEELING BE EVENTS WHOLE MAGNETISM TELLS BEING POLE TALE MIGHT INDUCED NOVEL HOPE JOB SCIENCE BALL WORK STUDY GAME JOBS SCIENTISTS TEAM CAREER SCIENTIFIC FOOTBALL KNOWLEDGE BASEBALL EXPERIENCE WORK PLAYERS EMPLOYMENT RESEARCH PLAY OPPORTUNITIES WORKING CHEMISTRY FIELD TRAINING TECHNOLOGY PLAYER SKILLS MANY BASKETBALL CAREERS MATHEMATICS COACH POSITIONS BIOLOGY PLAYED FIND FIELD PLAYING POSITION PHYSICS HIT FIELD LABORATORY TENNIS STUDIES TEAMS OCCUPATIONS REQUIRE WORLD GAMES SCIENTIST SPORTS OPPORTUNITY EARN STUDYING BAT ABLE SCIENCES TERRY Gibbs sampling for topics • Need full conditional distributions for variables • Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d Gibbs sampling iteration 1 i wi di zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 Gibbs sampling iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 ? Gibbs sampling iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 ? Gibbs sampling iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 ? Gibbs sampling iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 ? Gibbs sampling iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 ? Gibbs sampling iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 1 ? Gibbs sampling iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 1 2 ? Gibbs sampling iteration 1 2 … 1000 i wi di zi zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 1 2 2 2 2 1 2 2 1 2 . . . 1 2 2 2 1 2 2 2 1 2 2 2 2 . . . 1 … A visual example: Bars sample each pixel from a mixture of topics pixel = word image = document Summary • Probabilistic models can pose significant computational challenges – parameter estimation with latent variables, computing posteriors with many variables • Clever algorithms exist for solving these problems, easing use of probabilistic models • These algorithms also provide a source of new models and methods in cognitive science When Bayes is useful… • Clarifying computational problems in cognition • Providing rational explanations for behavior • Characterizing knowledge informing induction • Capturing inferences at multiple levels