Probabilistic Topic Models and Associative Memory Mark Steyvers UC Irvine Tom Griffiths Brown University Josh Tenenbaum MIT Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining Example of associative memory: word association CUE: RESPONSES: PLAY FUN, BALL, GAME, WORK, GROUND, MATE, CHILD, ENJOY, WIN, ACTOR Example of associative memory: free recall STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy RECALL WORDS ..... FALSE RECALL: “Sleep” 61% A theory for semantic association Semantic association as probabilistic inference Representation of semantic structure Latent Semantic Structure Distribution over words Latent Structure P( w ) P( w , ) Inferring latent structure P ( | w ) Words w P ( w | ) P ( ) P( w ) Prediction P(wn1 | w ) ... Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining Probabilistic Topic Models Probabilistic Latent Semantic Indexing (pLSI) Hoffman (1999): Latent Dirichlet Allocation (LDA) Blei, Ng, & Jordan (2003) this talk, use topic models as a theory for human semantic association Topic Model Unsupervised learning of topics (“gist”) of documents: articles/chapters conversations emails .... any verbal context Topics are useful latent structures to explain semantic association Probabilistic Generative Model Each document is a probability distribution over topics Each topic is a probability distribution over words GENERATIVE PROCESS .8 .3 TOPIC 1 .2 .7 DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2 DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2 money1 TOPIC 2 Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b ) The probability of choosing a word: T P w Pw | z P z z 1 word probability in topic j probability of topic j in document Graphical Model a q sample a distribution over topics sample a topic z b f sample a word from that topic w T Nd D INVERTING THE GENERATIVE PROCESS DOCUMENT 1: A Play is written to be performed on a stage before a live audience or before motion picture or television cameras ( for later viewing by large audiences ). A Play is written because playwrights have something ... ? TOPIC 1 ? He was listening to music coming from a passing riverboat. The music had already captured his heart as well as his ear . It was jazz . Bix beiderbecke had already had music lessons . He wanted to play the cornet. And he wanted to play jazz ....... DOCUMENT 2: ? TOPIC 2 We estimate the assignments of topics to words INVERTING THE GENERATIVE PROCESS A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... DOCUMENT ? TOPIC 1 ? He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077....... DOCUMENT ? TOPIC 2 1: 2: We estimate the assignments of topics to words Statistical Inference Fix number of topics T We estimate the posterior over topic assignments P(z | w ) P( w , z) z P ( w , z ) Markov Chain Monte Carlo (MCMC) with Gibbs sampling Choosing number of topics Subjective interpretability Bayesian model selection Griffiths & Steyvers (2004) Generalization test Non-parametric Bayesian statistics Infinite models; models that grow with size of data Teh, Jordan, Teal, & Blei (2004) Blei, Griffiths, Jordan, Tenenbaum (2004) Procedure INPUT: word-document counts OUTPUT: topic assignments to each word z likely words in each topic P(w | z ) likely topics for a document (“gist”) P (z | w ) Example: topics from an educational corpus (TASA) • 37K docs, 26K words • 1700 topics, e.g.: PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining Example associative structure BAT BALL BASEBALL GAME PLAY STAGE THEATER (Association norms by Doug Nelson et al. 1998) Explaining structure with topics BAT BASEBALL topic 1 BALL GAME PLAY topic 2 STAGE THEATER Tasa corpus Need a suitable corpus to model human associations TASA an educational corpus of text 37K documents 26K words Modeling Word Association Word association modeled as prediction Given that a single word is observed, what future other words might occur? Under a single topic assumption: Pwn1 | w Pwn1 | z Pz | w z Response Cue Observed associates for the cue “play” HUMANS TOPICS (T=500) LSA ( Word P( word ) FUN .141 BALL .134 GAME .074 WORK .067 GROUND .060 MATE .027 CHILD .020 ENJOY .020 WIN .020 ACTOR .013 FIGHT .013 HORSE .013 KID .013 MUSIC .013 Word P( word ) BALL .041 GAME .039 CHILDREN .019 ROLE .014 GAMES .014 MUSIC .009 BASEBALL .009 HIT .008 FUN .008 TEAM .008 IMPORTANT .006 BAT .006 RUN .006 STAGE .005 Wo KICKB VOLLE GAM COSTU DRA RO PLAYW FU ACT REHEA GAM ACTO CHEC MOLI Model predictions HUMANS TOPICS (T=500) LSA (5 Word P( word ) FUN .141 BALL .134 GAME .074 WORK .067 GROUND .060 MATE .027 CHILD .020 ENJOY .020 WIN .020 ACTOR .013 FIGHT .013 HORSE .013 KID .013 MUSIC .013 Word P( word ) BALL .041 GAME .039 CHILDREN .019 ROLE .014 GAMES .014 MUSIC .009 BASEBALL .009 HIT .008 FUN .008 TEAM .008 IMPORTANT .006 BAT .006 RUN .006 STAGE .005 Wor KICKB VOLLEY GAME COSTU DRAM ROL PLAYWR FUN RANK 9 ACTO REHEAR GAM ACTO CHECK MOLIE Median rank of first associate 40 Best LSA cosine Best LSA inner product 1700 topics 1500 topics 1300 topics 1100 topics 900 topics 700 topics 500 topics 300 topics 35 30 25 Median Rank 20 15 10 5 0 1 Latent Semantic Analysis (Landauer & Dumais, 1997) high dimensional space Singular value decomposition word-document counts STREAM RIVER BANK MONEY Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors Median rank of first associate 40 Best LSA cosine Best LSA inner product 1700 topics 1500 topics 1300 topics 1100 topics 900 topics 700 topics 500 topics 300 topics 35 30 25 Median Rank 20 15 10 5 0 1 Triangle Inequality in Spatial Representations THEATER w1 w2 PLAY w3 SOCCER Cosine similarity: cos(w1,w3) ≥ cos(w1,w2)cos(w2,w3) – sin(w1,w2)sin(w2,w3) Testing violation of triangle inequality Look for triplets of associates w1 w2 w3 such that and measure Vary threshold t P( w2 | w1 ) > t P( w3 | w2 ) > t P( w3 | w1 ) Recall: example study List STUDY: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy FALSE RECALL: “Sleep” 61% Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Pwn1 | w Pwn1 | z Pz | w z Retrieved word Study list Predictions for the “Sleep” list 0 STUDY LIST EXTRA LIST (top 8) 0.02 0.04 0.06 0.08 BED REST TIRED AWAKE WAKE NAP DREAM YAWN DROWSY BLANKET SNORE SLUMBER PEACE DOZE 0.1 0.12 0.14 0.16 0.18 Pwn1 | w SLEEP NIGHT ASLEEP MORNING HOURS SLEEPY EYES AWAKENED 0.2 Correlation between intrusion rates and predictions TOPICS MODEL 0.8 0.8 0.7 0.7 (word association) 0.6 Correlation Correlation LSA 0.5 .37 0.4 .53 0.5 0.4 0.3 0.2 0.2 200 400 600 # Dimensions (word association) 0.6 0.3 0 .69 800 0 400 800 1200 1600 2000 # Topics Latent Semantic Analysis vs. Topics Quantitative differences Qualitative differences probabilistic generative models can work with more structured representations Extensions of topic models: hierarchies syntax-semantics Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining Integrating Topics and Syntax (Griffiths, Steyvers, Blei, & Tenenbaum, 2004) Syntactic dependencies short range dependencies Semantic dependencies long-range q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 Semantic state: generate words from topic model Syntactic states: generate words from HMM ATTENTION SEARCH VISUAL PROCESSING TASK PERFORMANCE INFORMATION ATTENTIONAL THE A AN THIS THEIR ITS EACH ONE MEMORY TERM LONG SHORT RETRIEVAL STORAGE MEMORIES AMNESIA IN BY WITH ON AS FROM TO FOR IQ BEHAVIOR EVOLUTIONARY ENVIRONMENT GENES HERITABILITY GENETIC SELECTION IS ARE BE HAS HAVE WAS WERE AS DRUG AROUSAL NEURAL BRAIN HABITUATION BIOLOGICAL TOLERANCE BEHAVIORAL BASED PRESENTED DISCUSSED PROPOSED DESCRIBED SUCH USED DERIVED ... SOCIAL SELF ATTITUDE IMPLICIT ATTITUDES PERSONALITY JUDGMENT PERCEPTION THEORY MODEL PROCESSES MODELS SYSTEM PROCESS EFFECTS INFORMATION (S) THE SEARCH IN LONG TERM MEMORY …… (S) A MODEL OF VISUAL ATTENTION …… Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL Topic Hierarchies In regular topic model, no relations between topics Alternative: hierarchical topic organization topic 1 topic 2 topic 4 topic 5 topic 3 topic 6 topic 7 Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum (2004) Learn hierarchical structure, as well as topics within structure Example: Psych Review Abstracts THE OF AND TO IN A IS A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT RESPONSE SPEECH STIMULUS READING REINFORCEMENT WORDS RECOGNITION MOVEMENT STIMULI MOTOR RECALL VISUAL CHOICE WORD CONDITIONING SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN REASONING IMAGE CONDITIONIN ATTITUDE COLOR STRESS CONSISTENCY MONOCULAR EMOTIONAL SITUATIONAL LIGHTNESS BEHAVIORAL INFERENCE GIBSON FEAR JUDGMENT SUBMOVEMENT STIMULATION PROBABILITIES ORIENTATION TOLERANCE STATISTICAL HOLOGRAPHIC RESPONSES Generative Process THE OF AND TO IN A IS A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT RESPONSE SPEECH STIMULUS READING REINFORCEMENT WORDS RECOGNITION MOVEMENT STIMULI MOTOR RECALL VISUAL CHOICE WORD CONDITIONING SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN REASONING IMAGE CONDITIONIN ATTITUDE COLOR STRESS CONSISTENCY MONOCULAR EMOTIONAL SITUATIONAL LIGHTNESS BEHAVIORAL INFERENCE GIBSON FEAR JUDGMENT SUBMOVEMENT STIMULATION PROBABILITIES ORIENTATION TOLERANCE STATISTICAL HOLOGRAPHIC RESPONSES Overview I Associative memory II The topic model III Applications to associative memory IV Extensions of the model V Applications in machine learning/text mining Applications in machine learning/ text mining Mark Steyvers UC Irvine Padhraic Smyth UC Irvine Michal Rosen-Zvi UC Irvine Tom Griffiths Brown University Applications in Machine Learning Automatically learn topics from large text collections NSF/NIH grant proposals 18th century newspapers Enron email Topics provide quick overview of content Enron email data 500,000 emails 5000 authors 1999-2002 Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU PERSON1 PERSON2 2000 May 22, 2000 Start of California energy crisis 2001 2002 TIMELINE 2003 NSF & NIH grant abstracts Analyze 22,000+ active grants during 2002 NIH – NIMH, NCI NSF – BIO, SBE What topics are funded? Topic map of funding programs Example topics BRAIN IMAGING brain .101 fmri .054 imaging .054 functional .046 mri .033 subjects .033 magnetic .031 resonance .029 neuroimaging .028 structural .018 VISUAL PROCESSING visual .075 processing .048 sensory .035 spatial .034 information .022 eye .020 stimuli .020 object .019 objects .019 perception .018 CHILD PARENT INTERACTION children .153 child .089 parent .038 parents .032 family .032 families .022 early .020 problems .019 mothers .017 risk .017 MEMORY memory .237 working .049 memories .022 tasks .022 retrieval .021 encoding .020 cognitive .019 processing .019 recognition .018 performance .016 HIV INTERVENTION hiv .121 intervention .064 risk .050 sexual .043 prevention .037 aids .024 interventions .018 reduction .015 behavior .015 men .013 AGING older adults age elderly geriatric life aging late cognitive aged .083 .071 .066 .041 .041 .039 .033 .032 .028 .022 SCHIZOPHRENIA schizophrenia .226 patients .067 deficits .054 schizophrenic .027 psychosis .024 subjects .023 psychotic .022 dysfunction .019 abnormalities .017 clinical .015 ALZHEIMER DISEASE disease .102 ad .074 alzheimer .043 diabetes .025 cardiovascular .016 insulin .015 vascular .015 blood .013 clinical .012 individuals .012 NSF – SBE INT Japan and Korea INT INT INT Africa, Near East, International Central and South Asia activities - Other and Eastern Europe INT DEB East Environmental INT Asia and -Pacific biology Other Americas BCS Archaeology, archeometry, and ... BCS Geography and regional science SES Science and technology studies BCS Environmental social and behavioral science SES Social and economic sciences - Other NSF – BIO INT Western Europe MCB Molecular and cellular biosciences - Other DEB Ecological studies DEB Systematic & population biology MCB Biomolecular structure & function BIR BIR BIR BIR Human Research Biological Instrumentation resources infrastructureresources - Other IBN PGR MCB Physiology Plant genome research project Cell biology IBN and ethology BCS Integrative biology MCB IBN Physical and neuroscience - Other Genetics Developmental anthropology mechanisms SES BCS BCS Ethics SES Cultural Instrumentation Research on science and values studies anthropology and technology BCS SES SES BCS SES Linguistics Innovation Political Behavioral Methodology, measures, SESorganizational change and science and cognitive sciences - Other and statistics Sociology SES Transformations BCS SES to quality organizations Human cognition Law and perception and social science SES Decision, risk, BCS NIMH and management science Child learning Extramural research and development BCS SES NCI Social Economics Cancer prevention psychology and control IBN Neuroscience MCB Biochemical and biomolecular processes NCI Research manpower development NCI Cancer Research Centers NIMH Intramural research NCI Cancer biology, detection and diagnosis NCI Cancer causation NCI AIDS Research NIMH AIDS Research NIH NCI Cancer treatment Pennsylvania Gazette (courtesy of David Newman & Sharon Block, UC Irvine) 1728-1800 80,000 articles Historical Trends in Pen. Gazette (courtesy of David Newman & Sharon Block, UC Irvine) STATE GOVERNMENT CONSTITUTION LAW UNITED POWER CITIZEN PEOPLE PUBLIC CONGRES Topic Proportion (%) 10 8 6 4 2 0 1730 1740 1750 1760 1770 YEAR 1780 1790 1800 SILK COTTON DITTO WHITE BLACK LINEN CLOTH WOMEN BLUE WORSTED Conclusion Semantic association as probabilistic inference prediction (compare with ACT-R) Relation to other theories of memory REM ACT-R Generative models are useful makes modeling assumptions explicit flexible Cognitive Science Machine Learning