Text Mining and Topic Modeling Padhraic Smyth Department of Computer Science University of California, Irvine Progress Report • New deadline • • In class, Thursday February 18th (not Tuesday) Outline • • 3 to 5 pages maximum Suggested content • • • • • • Data Brief restatement of the problem you are addressing (no need to repeat everything in your original proposal), e.g., ½ a page Summary of progress so far Background papers read Data preprocessing, exploratory data analysis Algorithms/Software reviewed, implemented, tested Initial results (if any) Challenges and difficulties encountered Brief outline of plans between now and end of quarter Use diagrams, figures, tables, where possible Write clearly: check what you write Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Road Map • Topics covered • • • • • Yet to come…. • • • • Data Exploratory data analysis and visualization Regression Classification Text classification Unsupervised learning with text (topic models) Social networks Recommender systems (including Netflix) Mining of Web data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Text Mining • Document classification • Information extraction • • Document summarization • Google news, Newsblaster (http://www1.cs.columbia.edu/nlp/newsblaster/) • Document clustering • Topic modeling • • Data Named-entity extraction: recognize names of people, places, genes, etc Representing document as mixtures of topics And many more… Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Named Entity-Extraction • Often a combination of • • • • Non-trivial since entity-names can be confused with real names • • Data E.g., gene name ABS and abbreviation ABS Also can look for co-references • • Knowledge-based approach (rules, parsers) Machine learning (e.g., hidden Markov model) Dictionary E.g., “IBM today…… Later, the company announced…..” Useful as a preprocessing step for data mining, e.g., use entity-names to train a classifier to predict the category of an article Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Example: GATE/ANNIE extractor Data • GATE: free software infrastructure for text analysis (University of Sheffield, UK) • ANNIE: widely used entity-recognizer, part of GATE http://www.gate.ac.uk/annie/ Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Information Extraction From Seymore, McCallum, Rosenfeld, Learning Hidden Markov Model Structure for Information Extration, AAAI 1999 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Topic Models • Background on graphical models • Unsupervised learning from text documents • • • • Extensions • Data Motivation Topic model and learning algorithm Results Topics over time, author topic models, etc Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Enron email data 250,000 emails 28,000 authors 1999-2002 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Other Examples of Large Corpora • CiteSeer digital collection: • • MEDLINE collection • Data 700,000 papers, 700,000 authors, 1986-2005 16 million abstracts in medicine/biology • US Patent collection • and many more.... Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Unsupervised Learning from Text • Large collections of unlabeled documents.. • • • Data Web Digital libraries Email archives, etc • Often wish to organize/summarize/index/tag these documents automatically • We will look at probabilistic techniques for clustering and topic extraction from sets of documents Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Problems of Interest Data • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on….. Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Review Slides on Graphical Models Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Multinomial Models for Documents • Example: 50,000 possible words in our vocabulary • Simple memoryless model • 50,000-sided die • a non-uniform die: each side/word has its own probability • to generate N words we toss the die N times • This is a simple probability model: • Data p( document | f ) = P p(word i | f ) • to "learn" the model we just count frequencies • p(word i) = number of occurences of i / total number • Typically interested in conditional multinomials, e.g., • p(words | spam) versus p(words | non-spam) Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Real examples of Word Multinomials TOPIC 209 Data Mining Lectures TOPIC 289 WORD PROB. WORD PROB. PROBABILISTIC 0.0778 RETRIEVAL 0.1179 BAYESIAN 0.0671 TEXT 0.0853 PROBABILITY 0.0532 DOCUMENTS 0.0527 CARLO 0.0309 INFORMATION 0.0504 MONTE 0.0308 DOCUMENT 0.0441 DISTRIBUTION 0.0257 CONTENT 0.0242 INFERENCE 0.0253 INDEXING 0.0205 PROBABILITIES 0.0253 RELEVANCE 0.0159 CONDITIONAL 0.0229 COLLECTION 0.0146 PRIOR 0.0219 RELEVANT 0.0136 .... ... ... ... Text Mining and Topic Models P(w | z ) © Padhraic Smyth, UC Irvine A Graphical Model for Multinomials p( doc | f ) = = set of probabilities one per word Data Mining Lectures p(wi | f ) f f = "parameter vector" w1 P p( w | f ) w2 wn Text Mining and Topic Models © Padhraic Smyth, UC Irvine Another view.... p( doc | f ) = P p(wi | f ) This is “plate notation” f Items inside the plate are conditionally independent given the variable outside the plate wi i=1:n Data Mining Lectures Text Mining and Topic Models There are “n” conditionally independent replicates represented by the plate © Padhraic Smyth, UC Irvine Being Bayesian.... a This is a prior on our multinomial parameters, e.g., a simple Dirichlet smoothing prior with symmetric parameter a to avoid estimates of probabilities that are 0 f wi i=1:n Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Being Bayesian.... a Learning: infer p( f | words, a ) proportional to p( words | f) p(f|a) f wi i=1:n Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Multiple Documents p( corpus | f ) = P p( doc | f ) a f wi 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Different Document Types p( w | f) is a multinomial over words a f wi Data Mining Lectures 1:n Text Mining and Topic Models © Padhraic Smyth, UC Irvine Different Document Types p( w | f) is a multinomial over words a f wi 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Different Document Types p( w | a f, zd) is a multinomial over words zd is the "label" for each doc zd f wi 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Different Document Types p( w | a f, zd) is a multinomial over words zd is the "label" for each doc Different multinomials, depending on the value of zd (discrete) zd f f now represents |z| different wi multinomials 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Unknown Document Types a b Now the values of z for each document are unknown - hopeless? zd f wi 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Unknown Document Types a b Now the values of z for each document are unknown - hopeless? Not hopeless :) Can learn about both z and zd q e.g., EM algorithm f This gives probabilistic clustering p(w | z=k , wi f) is the kth multinomial over words 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Topic Model b a f zi is a "label" for each word qd q: p( zi | qd ) = distribution over topics that is document specific f: p( w | f , zi = k) = multinomial over words = a "topic" zi wi 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Example of generating words MONEY LOAN BANK RIVER STREAM 1.0 1 1 1 1 1 1 MONEY BANK BANK LOAN BANK MONEY BANK 1 1 1 1 1 MONEY BANK LOAN LOAN BANK MONEY 1 .... .6 2 1 2 2 2 RIVER MONEY BANK STREAM BANK BANK .4 RIVER STREAM BANK MONEY LOAN Topics f Data Mining Lectures 1 1.0 1 2 1 MONEY RIVER MONEY BANK 2 2 2 2 LOAN1 MONEY1 .... 2 2 RIVER BANK STREAM BANK RIVER BANK Mixtures θ 1 2.... Documents and topic assignments Text Mining and Topic Models © Padhraic Smyth, UC Irvine Learning ? ? ? ? ? MONEY BANK BANK LOAN BANK MONEY BANK ? ? ? ? ? MONEY BANK LOAN LOAN BANK MONEY ? ? ? ? ? ? ? ? .... ? RIVER MONEY BANK STREAM BANK BANK ? ? ? ? ? MONEY RIVER MONEY BANK LOAN MONEY ? ? ? ? ? RIVER BANK STREAM BANK RIVER BANK Topics f Data Mining Lectures Mixtures θ ? ? ? .... ?.... Documents and topic assignments Text Mining and Topic Models © Padhraic Smyth, UC Irvine Key Features of Topic Models • Model allows a document to be composed of multiple topics • • Completely unsupervised • • • Gibbs sampling is the method of choice Scalable • • Data Topics learned directly from data Leverages strong dependencies at word level Learning algorithm • • More powerful than 1 doc -> 1 cluster Linear in number of word tokens Can be run on millions of documents Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Document generation as a probabilistic process • • • Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic T P( wi ) P( wi | zi j ) P( zi j ) j 1 From parameters f (j) Data Mining Lectures Text Mining and Topic Models From parameters q (d) © Padhraic Smyth, UC Irvine Learning the Model • Three sets of latent variables we can learn • • • • topic-word distributions document-topic distributions topic assignments for each word f θ z Options: • • • EM algorithm to find point estimates of f and q • e.g., Chien and Wu, IEEE Trans ASLP, 2008 Gibbs sampling • Find p(f | data), p(q | data), p(z| data) • Can be slow to converge Collapsed Gibbs sampling • Most widely used method [See also Asuncion, Welling, Smyth, Teh, UAI 2009 for additional discussion] Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Gibbs Sampling • Say we have 3 parameters x,y,z, and some data • Bayesian learning: • • • • Data We want to compute p(x, y, z | data) But frequently it is impossible to compute this exactly However, often we can compute conditionals for individual variables, e.g., p(x | y, z, data) Not clear how this is useful yet, since it assumes y and z are known (i.e., we condition on them). Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Gibbs Sampling 2 • Example of Gibbs sampling: • • • • • Initialize with x’, y’, z’ (e.g., randomly) Iterate: • Sample new x’ ~ P(x | y’, z’, data) • Sample new y’ ~ P(y | x’, z’, data) • Sample new z’ ~ P(z | x’, y’, data) Continue for some number (large) of iterations Each iteration consists of a sweep through the hidden variables or parameters (here, x, y, and z) Gibbs = a Markov Chain Monte Carlo (MCMC) method In the limit, the samples x’, y’, z’ will be samples from the true joint distribution P(x , y, z | data) This gives us an empirical estimate of P(x , y, z | data) Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Example of Gibbs Sampling in 2d From online MCMC tutorial notes by Frank Dellaert, Georgia Tech Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Computation • Convergence • • • Complexity per iteration • • Data In the limit, samples x’, y’, z’ are from P(x , y, z | data) How many iterations are needed? • Cannot be computed ahead of time • Early iterations are discarded (“burn-in”) • Typically monitor some quantities of interest to monitor convergence • Convergence in Gibbs/MCMC is a tricky issue! Linear in number of hidden variables and parameters Times the complexity of generating a sample each time Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Gibbs Sampling for the Topic Model • Recall: 3 sets of latent variables we can learn • • • • topic-word distributions document-topic distributions topic assignments for each word θ z Gibbs sampling algorithm • • Initialize all the z’s randomly to a topic, z1, ….. zN Iteration • For i = 1,…. N Sample zi ~ p(zi | all other z’s, data) • Continue for a fixed number of iterations or convergence • Note that this is collapsed Gibbs sampling • Data f Mining Lectures Sample from p(z1, ….. zN | data), “collapsing” over f and q Text Mining and Topic Models © Padhraic Smyth, UC Irvine Topic Model qd f zi wi 1:n 1:D Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Sampling each Topic-Word Assignment count of topic t assigned to doc d count of word w assigned to topic t ntd i a nwt i b p( zi t | z i ) i i n T a n t ' t 'd w' w't W b probability that word i is assigned to topic t Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Convergence Example (from Newman et al, JMLR, 2009) Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Complexity • Time • O(N T) per iteration, where N is the number of “tokens”, T the number of topics • For fast sampling, see “Fast-LDA”, Porteous et al, ACM SIGKDD, 2008; also Yao, Mimno, McCallum, ACM SIGKDD 2009. For distributed algorithms, see Newman et al., Journal of Machine Learning Research, 2009, e.g., T = 1000, N = 100 million • • Space • • Data O(D T + T W + N), where D is the number of documents and W is the number of unique words (size of vocabulary) Can reduce these size by using sparse matrices • Store non-zero counts for doc-topic and topic-word • Only apply smoothing at prediction time Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine 16 Artificial Documents River Stream Bank Money Loan documents 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Can we recover the original topics and topic mixtures from this data? Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Starting the Gibbs Sampling • Assign word tokens randomly to topics (●=topic 1; ●=topic 2 ) River Stream Bank Money Loan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine After 1 iteration River Stream Bank Money Loan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine After 4 iterations River Stream Bank Money Loan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine After 32 iterations ● ● topic 1 stream .40 bank .35 river .25 River Stream Bank topic 2 bank .39 money .32 loan .29 Money Loan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Software for Topic Modeling • Mark Steyver’s public-domain MATLAB toolbox for topic modeling on the Web psiexp.ss.uci.edu/research/programs_data/toolbox.htm Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine History of topic models • origins in statistics: • • • applications in computer science • • • • Hoffman, SIGIR, 1999 Blei, Jordan, Ng, JMLR 2003 (known as “LDA”) Griffiths and Steyvers, PNAS, 2004 more recent work • • • • • • Data latent class models in social science admixture models in statistical genetics author-topic models: Steyvers et al, Rosen-Zvi et al, 2004 Hierarchical topics: McCallum et al, 2006 Correlated topic models: Blei and Lafferty, 2005 Dirichlet process models: Teh, Jordan, et al large-scale web applications: Buntine et al, 2004, 2005 undirected models: Welling et al, 2004 Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Topic = probability distribution over words TOPIC 209 TOPIC 289 WORD PROB. WORD PROB. PROBABILISTIC 0.0778 RETRIEVAL 0.1179 BAYESIAN 0.0671 TEXT 0.0853 PROBABILITY 0.0532 DOCUMENTS 0.0527 CARLO 0.0309 INFORMATION 0.0504 MONTE 0.0308 DOCUMENT 0.0441 DISTRIBUTION 0.0257 CONTENT 0.0242 INFERENCE 0.0253 INDEXING 0.0205 PROBABILITIES 0.0253 RELEVANCE 0.0159 CONDITIONAL 0.0229 COLLECTION 0.0146 PRIOR 0.0219 RELEVANT 0.0136 .... ... ... ... P(w | z ) Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Four example topics from NIPS TOPIC 19 TOPIC 24 TOPIC 29 WORD PROB. KERNEL 0.0683 0.0371 SUPPORT 0.0377 ACTION 0.0332 VECTOR 0.0257 0.0241 OPTIMAL 0.0208 KERNELS 0.0217 HANDWRITTEN 0.0169 ACTIONS 0.0208 SET 0.0205 0.0159 FUNCTION 0.0178 SVM 0.0204 IMAGE 0.0157 REWARD 0.0165 SPACE 0.0188 0.0254 DISTANCE 0.0153 SUTTON 0.0164 MACHINES 0.0168 PARAMETERS 0.0209 DIGIT 0.0149 AGENT 0.0136 ESTIMATE 0.0204 HAND 0.0126 DECISION 0.0118 MARGIN 0.0151 AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Tresp_V 0.0333 Simard_P 0.0694 Singh_S 0.1412 Smola_A 0.1033 Singer_Y 0.0281 Martin_G 0.0394 Barto_A 0.0471 Scholkopf_B 0.0730 Jebara_T 0.0207 LeCun_Y 0.0359 Sutton_R 0.0430 Burges_C 0.0489 Ghahramani_Z 0.0196 Denker_J 0.0278 Dayan_P 0.0324 Vapnik_V 0.0431 Ueda_N 0.0170 Henderson_D 0.0256 Parr_R 0.0314 Chapelle_O 0.0210 Jordan_M 0.0150 Revow_M 0.0229 Dietterich_T 0.0231 Cristianini_N 0.0185 Roweis_S 0.0123 Platt_J 0.0226 Tsitsiklis_J 0.0194 Ratsch_G 0.0172 Schuster_M 0.0104 Keeler_J 0.0192 Randlov_J 0.0167 Laskov_P 0.0169 Xu_L 0.0098 Rashid_M 0.0182 Bradtke_S 0.0161 Tipping_M 0.0153 Saul_L 0.0094 Sackinger_E 0.0132 Schwartz_A 0.0142 Sollich_P 0.0141 Data WORD PROB. WORD PROB. LIKELIHOOD 0.0539 RECOGNITION 0.0400 MIXTURE 0.0509 CHARACTER 0.0336 POLICY EM 0.0470 CHARACTERS 0.0250 DENSITY 0.0398 TANGENT GAUSSIAN 0.0349 ESTIMATION 0.0314 DIGITS LOG 0.0263 MAXIMUM Mining Lectures WORD TOPIC 87 PROB. REINFORCEMENT 0.0411 Text Mining and Topic Models REGRESSION 0.0155 © Padhraic Smyth, UC Irvine Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Comparing Topics and Other Approaches • Clustering documents • • • LSI/LSA/SVD • • • • • Linear project of V-dim word vectors into lower dimensions Less interpretable Not generalizable • E.g., authors or other side-information Not as accurate • E.g., precision-recall: Hoffman, Blei et al, Buntine, etc Probabilistic models such as topic Models • • Data Computationally simpler… But a less accurate and less flexible model Mining Lectures “next-generation” text modeling, after LSI provide a modular extensible framework Text Mining and Topic Models © Padhraic Smyth, UC Irvine Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied online or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Clusters v. Topics One Cluster Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied online or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. Data Mining Lectures model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [cluster 88] Text Mining and Topic Models © Padhraic Smyth, UC Irvine Clusters v. Topics One Cluster Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied online or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. Data Mining Lectures model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [cluster 88] Text Mining and Topic Models Multiple Topics state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 10] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes [topic 37] © Padhraic Smyth, UC Irvine Examples of Topics learned from Proceedings of the National Academy of Sciences Griffiths and Steyvers, PNAS, 2004 FORCE HIV SURFACE VIRUS MOLECULES INFECTED SOLUTION IMMUNODEFICIENCY SURFACES CD4 MICROSCOPY INFECTION WATER HUMAN FORCES VIRAL PARTICLES TAT STRENGTH GP120 POLYMER REPLICATION IONIC TYPE ATOMIC ENVELOPE AQUEOUS AIDS MOLECULAR REV PROPERTIES BLOOD LIQUID CCR5 SOLUTIONS INDIVIDUALS BEADS ENV MECHANICAL PERIPHERAL Data Mining Lectures MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE Text Mining and Topic Models NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN © Padhraic Smyth, UC Irvine Examples of PNAS topics CHROMOSOME ADULT MALE PARASITE REGION DEVELOPMENT FEMALE PARASITES CHROMOSOMES FETAL MALES FALCIPARUM KB DAY FEMALES MALARIA MAP DEVELOPMENTAL SEX HOST MAPPING POSTNATAL SEXUAL PLASMODIUM CHROMOSOMAL EARLY BEHAVIOR ERYTHROCYTES HYBRIDIZATION OFFSPRING DAYS ERYTHROCYTE ARTIFICIAL REPRODUCTIVE NEONATAL MAJOR MAPPED LIFE MATING LEISHMANIA PHYSICAL DEVELOPING SOCIAL INFECTED MAPS EMBRYONIC SPECIES BLOOD GENOMIC BIRTH REPRODUCTION INFECTION DNA NEWBORN FERTILITY MOSQUITO LOCUS MATERNAL TESTIS INVASION GENOME PRESENT MATE TRYPANOSOMA GENE PERIOD GENETIC CRUZI HUMAN ANIMALS GERM BRUCEI SITU NEUROGENESIS CHOICE HUMAN CLONES ADULTS SRY HOSTS MODEL STUDIES MECHANISM MODELS PREVIOUS MECHANISMS EXPERIMENTAL SHOWN UNDERSTOOD BASED RESULTS POORLY PROPOSED RECENT ACTION DATA PRESENT UNKNOWN SIMPLE STUDY REMAIN DYNAMICS DEMONSTRATED UNDERLYING PREDICTED INDICATE MOLECULAR EXPLAIN WORK PS BEHAVIOR SUGGEST REMAINS THEORETICAL SUGGESTED SHOW ACCOUNT USING RESPONSIBLE THEORY FINDINGS PROCESS PREDICTS DEMONSTRATE SUGGEST COMPUTER REPORT UNCLEAR QUANTITATIVE INDICATED REPORT PREDICTIONS CONSISTENT LEADING CONSISTENT REPORTS LARGELY PARAMETERS CONTRAST KNOWN Examples of PNAS topics CHROMOSOME ADULT MALE PARASITE REGION DEVELOPMENT FEMALE PARASITES CHROMOSOMES FETAL MALES FALCIPARUM KB DAY FEMALES MALARIA MAP DEVELOPMENTAL SEX HOST MAPPING POSTNATAL SEXUAL PLASMODIUM CHROMOSOMAL EARLY BEHAVIOR ERYTHROCYTES HYBRIDIZATION OFFSPRING DAYS ERYTHROCYTE ARTIFICIAL REPRODUCTIVE NEONATAL MAJOR MAPPED LIFE MATING LEISHMANIA PHYSICAL DEVELOPING SOCIAL INFECTED MAPS EMBRYONIC SPECIES BLOOD GENOMIC BIRTH REPRODUCTION INFECTION DNA NEWBORN FERTILITY MOSQUITO LOCUS MATERNAL TESTIS INVASION GENOME PRESENT MATE TRYPANOSOMA GENE PERIOD GENETIC CRUZI HUMAN ANIMALS GERM BRUCEI SITU NEUROGENESIS CHOICE HUMAN CLONES ADULTS SRY HOSTS MODEL STUDIES MECHANISM MODELS PREVIOUS MECHANISMS EXPERIMENTAL SHOWN UNDERSTOOD BASED RESULTS POORLY PROPOSED RECENT ACTION DATA PRESENT UNKNOWN SIMPLE STUDY REMAIN DYNAMICS DEMONSTRATED UNDERLYING PREDICTED INDICATE MOLECULAR EXPLAIN WORK PS BEHAVIOR SUGGEST REMAINS THEORETICAL SUGGESTED SHOW ACCOUNT USING RESPONSIBLE THEORY FINDINGS PROCESS PREDICTS DEMONSTRATE SUGGEST COMPUTER REPORT UNCLEAR QUANTITATIVE INDICATED REPORT PREDICTIONS CONSISTENT LEADING CONSISTENT REPORTS LARGELY PARAMETERS CONTRAST KNOWN What can Topic Models be used for? • Queries • Who writes on this topic? • Data e.g., finding experts or reviewers in a particular area What topics does this person do research on? • Comparing groups of authors or documents • Discovering trends over time • Detecting unusual papers and authors • Interactive browsing of a digital library via topics • Parsing documents (and parts of documents) by topic • and more….. Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine What is this paper about? Empirical Bayes screening for multi-item associations Bill DuMouchel and Daryl Pregibon, ACM SIGKDD 2001 Most likely topics according to the model are… 1. 2. 3. 4. Data Mining Lectures data, mining, discovery, association, attribute.. set, subset, maximal, minimal, complete,… measurements, correlation, statistical, variation, Bayesian, model, prior, data, mixture,….. Text Mining and Topic Models © Padhraic Smyth, UC Irvine 3 of 300 example topics (TASA) TOPIC 82 WORD TOPIC 166 PROB. WORD PROB. WORD PROB. PLAY 0.0601 MUSIC 0.0903 PLAY 0.1358 PLAYS 0.0362 DANCE 0.0345 BALL 0.1288 STAGE 0.0305 SONG 0.0329 GAME 0.0654 MOVIE 0.0288 PLAY 0.0301 PLAYING 0.0418 SCENE 0.0253 SING 0.0265 HIT 0.0324 ROLE 0.0245 SINGING 0.0264 PLAYED 0.0312 AUDIENCE 0.0197 BAND 0.0260 BASEBALL 0.0274 THEATER 0.0186 PLAYED 0.0229 GAMES 0.0250 PART 0.0178 SANG 0.0224 BAT 0.0193 FILM 0.0148 SONGS 0.0208 RUN 0.0186 ACTORS 0.0145 DANCING 0.0198 THROW 0.0158 DRAMA 0.0136 PIANO 0.0169 BALLS 0.0154 REAL 0.0128 PLAYING 0.0159 TENNIS 0.0107 CHARACTER 0.0122 RHYTHM 0.0145 HOME 0.0099 ACTOR 0.0116 ALBERT 0.0134 CATCH 0.0098 ACT 0.0114 MUSICAL 0.0134 FIELD 0.0097 MOVIES 0.0114 DRUM 0.0129 PLAYER 0.0096 ACTION 0.0101 GUITAR 0.0098 FUN 0.0092 0.0097 BEAT 0.0097 THROWING 0.0083 0.0094 BALLET 0.0096 PITCHER 0.0080 SET SCENES Data TOPIC 77 Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Automated Tagging of Words (numbers & colors topic assignments) A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166.... Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Four example topics from CiteSeer (T=300) TOPIC 205 TOPIC 209 WORD PROB. WORD DATA 0.1563 MINING 0.0674 BAYESIAN ATTRIBUTES 0.0462 DISCOVERY TOPIC 289 PROB. TOPIC 10 WORD PROB. WORD PROB. RETRIEVAL 0.1179 QUERY 0.1848 0.0671 TEXT 0.0853 QUERIES 0.1367 PROBABILITY 0.0532 DOCUMENTS 0.0527 INDEX 0.0488 0.0401 CARLO 0.0309 INFORMATION 0.0504 DATA 0.0368 ASSOCIATION 0.0335 MONTE 0.0308 DOCUMENT 0.0441 JOIN 0.0260 LARGE 0.0280 CONTENT 0.0242 INDEXING 0.0180 KNOWLEDGE 0.0260 INDEXING 0.0205 PROCESSING 0.0113 DATABASES 0.0210 RELEVANCE 0.0159 AGGREGATE 0.0110 ATTRIBUTE 0.0188 CONDITIONAL 0.0229 COLLECTION 0.0146 ACCESS 0.0102 DATASETS 0.0165 PRIOR 0.0219 RELEVANT 0.0136 PRESENT 0.0095 AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Han_J 0.0196 Friedman_N 0.0094 Oard_D 0.0110 Suciu_D 0.0102 Rastogi_R 0.0094 Heckerman_D 0.0067 Croft_W 0.0056 Naughton_J 0.0095 Zaki_M 0.0084 Ghahramani_Z 0.0062 Jones_K 0.0053 Levy_A 0.0071 Shim_K 0.0077 Koller_D 0.0062 Schauble_P 0.0051 DeWitt_D 0.0068 Ng_R 0.0060 Jordan_M 0.0059 Voorhees_E 0.0050 Wong_L 0.0067 Liu_B 0.0058 Neal_R 0.0055 Singhal_A 0.0048 Mannila_H 0.0056 Raftery_A 0.0054 Hawking_D 0.0048 Ross_K 0.0061 Brin_S 0.0054 Lukasiewicz_T 0.0053 Merkl_D 0.0042 Hellerstein_J 0.0059 Liu_H 0.0047 Halpern_J 0.0052 Allan_J 0.0040 Lenzerini_M 0.0054 Holder_L 0.0044 Muller_P 0.0048 Doermann_D 0.0039 Moerkotte_G 0.0053 PROBABILISTIC 0.0778 DISTRIBUTION 0.0257 INFERENCE 0.0253 PROBABILITIES 0.0253 Chakrabarti_K 0.0064 More CiteSeer Topics TOPIC 10 TOPIC 209 WORD TOPIC 87 WORD PROB. SPEECH 0.1134 RECOGNITION 0.0349 BAYESIAN 0.0671 TOPIC 20 PROB. WORD PROB. WORD PROB. PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164 INTERFACE 0.1080 OBSERVATIONS 0.0150 WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150 SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145 ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144 RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134 SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124 SOUND 0.0127 VISUAL 0.0203 OBSERVED 0.0108 TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101 MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087 AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143 Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131 Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089 Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083 Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078 Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067 Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063 Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059 Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055 Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050 PROBABILITIES 0.0253 Temporal patterns in topics: hot and cold topics • CiteSeer papers from 1986-2002, about 200k papers • For each year, calculate the fraction of words assigned to each topic • -> a time-series for topics • • Data Hot topics become more prevalent Cold topics become less prevalent Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine 4 2 x 10 Document and Word Distribution by Year in the UCI CiteSeer Data 5 x 10 14 1.8 12 1.6 Number of Documents 1.2 8 1 6 0.8 0.6 Number of Words 10 1.4 4 0.4 2 0.2 0 1986 1988 1990 1992 1994 1996 1998 2000 2002 0 Year Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine 0.012 CHANGING TRENDS IN COMPUTER SCIENCE 0.01 WWW Topic Probability 0.008 0.006 INFORMATION RETRIEVAL 0.004 0.002 0 1990 1992 1994 1996 1998 2000 2002 Year Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine 0.012 CHANGING TRENDS IN COMPUTER SCIENCE 0.01 Topic Probability 0.008 OPERATING SYSTEMS WWW PROGRAMMING LANGUAGES 0.006 INFORMATION RETRIEVAL 0.004 0.002 0 1990 1992 1994 1996 1998 2000 2002 Year Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine -3 8 x 10 HOT TOPICS: MACHINE LEARNING/DATA MINING 7 Topic Probability 6 CLASSIFICATION 5 REGRESSION 4 DATA MINING 3 2 1 1990 1992 1994 1996 1998 2000 2002 Year Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine -3 5.5 x 10 BAYES MARCHES ON 5 BAYESIAN Topic Probability 4.5 PROBABILITY 4 3.5 STATISTICAL PREDICTION 3 2.5 2 1.5 1990 1992 1994 1996 1998 2000 2002 Year Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine -3 9 x 10 SECURITY-RELATED TOPICS 8 Topic Probability 7 6 5 COMPUTER SECURITY 4 3 ENCRYPTION 2 1 1990 1992 1994 1996 1998 2000 2002 Year Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine 0.012 INTERESTING "TOPICS" 0.01 FRENCH WORDS: LA, LES, UNE, NOUS, EST Topic Probability 0.008 0.006 DARPA 0.004 0.002 0 1990 MATH SYMBOLS: GAMMA, DELTA, OMEGA 1992 1994 1996 1998 2000 2002 Year Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Four example topics from NIPS (T=100) TOPIC 19 TOPIC 24 TOPIC 29 WORD PROB. WORD PROB. WORD LIKELIHOOD 0.0539 RECOGNITION 0.0400 MIXTURE 0.0509 CHARACTER 0.0336 POLICY EM 0.0470 CHARACTERS 0.0250 DENSITY 0.0398 TANGENT GAUSSIAN 0.0349 ESTIMATION 0.0314 DIGITS LOG 0.0263 MAXIMUM TOPIC 87 PROB. WORD PROB. KERNEL 0.0683 0.0371 SUPPORT 0.0377 ACTION 0.0332 VECTOR 0.0257 0.0241 OPTIMAL 0.0208 KERNELS 0.0217 HANDWRITTEN 0.0169 ACTIONS 0.0208 SET 0.0205 0.0159 FUNCTION 0.0178 SVM 0.0204 IMAGE 0.0157 REWARD 0.0165 SPACE 0.0188 0.0254 DISTANCE 0.0153 SUTTON 0.0164 MACHINES 0.0168 PARAMETERS 0.0209 DIGIT 0.0149 AGENT 0.0136 ESTIMATE 0.0204 HAND 0.0126 DECISION 0.0118 MARGIN 0.0151 AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Tresp_V 0.0333 Simard_P 0.0694 Singh_S 0.1412 Smola_A 0.1033 Singer_Y 0.0281 Martin_G 0.0394 Barto_A 0.0471 Scholkopf_B 0.0730 Jebara_T 0.0207 LeCun_Y 0.0359 Sutton_R 0.0430 Burges_C 0.0489 Ghahramani_Z 0.0196 Denker_J 0.0278 Dayan_P 0.0324 Vapnik_V 0.0431 Ueda_N 0.0170 Henderson_D 0.0256 Parr_R 0.0314 Chapelle_O 0.0210 Jordan_M 0.0150 Revow_M 0.0229 Dietterich_T 0.0231 Cristianini_N 0.0185 Roweis_S 0.0123 Platt_J 0.0226 Tsitsiklis_J 0.0194 Ratsch_G 0.0172 Schuster_M 0.0104 Keeler_J 0.0192 Randlov_J 0.0167 Laskov_P 0.0169 Xu_L 0.0098 Rashid_M 0.0182 Bradtke_S 0.0161 Tipping_M 0.0153 Saul_L 0.0094 Sackinger_E 0.0132 Schwartz_A 0.0142 Sollich_P 0.0141 REINFORCEMENT 0.0411 REGRESSION 0.0155 NIPS: support vector topic Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine NIPS: neural network topic Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine NIPS Topics over Time topic mass (in vertical height) Figure courtesy of Xuerie Wang and Andrew McCallum, U Mass Amherst time Pennsylvania Gazette (courtesy of David Newman & Sharon Block, UC Irvine) 1728-1800 80,000 articles Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Pennsylvania Gazette Data courtesy of David Newman (CS Dept) and Sharon Block (History Dept) Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Topic trends from New York Times 15 Tour-de-France 10 5 0 Jan00 330,000 articles 2000-2002 Jul00 Jan01 Jul01 Jan02 Jul02 Jan03 Quarterly Earnings 30 20 10 0 Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 50 Data Mining Lectures 0 Jan00 Jul00 Jan01 Jul01 Jan02 Text Mining and Topic Models Jul02 COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Jan03 Anthrax 100 TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Jan03 ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING © Padhraic Smyth, UC Irvine Enron email data 250,000 emails 28,000 authors 1999-2002 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Enron email: business topics TOPIC 36 Data TOPIC 72 TOPIC 23 TOPIC 54 WORD PROB. WORD PROB. WORD PROB. FEEDBACK 0.0781 PROJECT 0.0514 FERC 0.0554 PERFORMANCE 0.0462 PLANT 0.028 MARKET 0.0328 PROCESS 0.0455 COST 0.0182 ISO 0.0226 PEP 0.0446 MANAGEMENT 0.03 UNIT 0.0166 ORDER COMPLETE 0.0205 FACILITY 0.0165 QUESTIONS 0.0203 SITE 0.0136 CONSTRUCTION 0.0169 WORD PROB. ENVIRONMENTAL 0.0291 AIR 0.0232 MTBE 0.019 EMISSIONS 0.017 0.0212 CLEAN 0.0143 FILING 0.0149 EPA 0.0133 COMMENTS 0.0116 PENDING 0.0129 COMMISSION 0.0215 SELECTED 0.0187 PROJECTS 0.0117 PRICE 0.0116 SAFETY 0.0104 COMPLETED 0.0146 CONTRACT 0.011 CALIFORNIA 0.0110 WATER 0.0092 SYSTEM 0.0146 UNITS 0.0106 FILED 0.0110 GASOLINE 0.0086 SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB. perfmgmt 0.2195 *** 0.0288 *** 0.0532 *** 0.1339 perf eval process 0.0784 *** 0.022 *** 0.0454 *** 0.0275 enron announcements 0.0489 *** 0.0123 *** 0.0384 *** 0.0205 *** 0.0089 *** 0.0111 *** 0.0334 *** 0.0166 *** 0.0048 *** 0.0108 *** 0.0317 *** 0.0129 Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Enron: non-work topics… TOPIC 66 Data TOPIC 182 TOPIC 113 TOPIC 109 WORD PROB. WORD PROB. WORD PROB. WORD PROB. HOLIDAY 0.0857 TEXANS 0.0145 GOD 0.0357 AMAZON 0.0312 PARTY 0.0368 WIN 0.0143 LIFE 0.0272 GIFT 0.0226 YEAR 0.0316 FOOTBALL 0.0137 MAN 0.0116 CLICK 0.0193 SEASON 0.0305 FANTASY 0.0129 PEOPLE 0.0103 SAVE 0.0147 COMPANY 0.0255 SPORTSLINE 0.0129 CHRIST 0.0092 SHOPPING 0.0140 CELEBRATION 0.0199 PLAY 0.0123 FAITH 0.0083 OFFER 0.0124 ENRON 0.0198 TEAM 0.0114 LORD 0.0079 HOLIDAY 0.0122 TIME 0.0194 GAME 0.0112 JESUS 0.0075 RECEIVE 0.0102 RECOGNIZE 0.019 SPORTS 0.011 SPIRITUAL 0.0066 SHIPPING 0.0100 MONTH 0.018 GAMES 0.0109 VISIT 0.0065 FLOWERS 0.0099 SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB. chairman & ceo 0.131 cbs sportsline com 0.0866 crosswalk com 0.2358 amazon com 0.1344 *** 0.0102 houston texans 0.0267 wordsmith 0.0208 jos a bank 0.0266 *** 0.0046 houstontexans 0.0203 *** 0.0107 sharperimageoffers 0.0136 *** 0.0022 sportsline rewards 0.0175 travelocity com 0.0094 general announcement 0.0017 pro football 0.0136 barnes & noble com 0.0089 Mining Lectures doctor dictionary 0.0101 *** Text Mining and Topic Models 0.0061 © Padhraic Smyth, UC Irvine Enron: public-interest topics... TOPIC 18 Data TOPIC 22 TOPIC 114 WORD PROB. WORD PROB. WORD PROB. TOPIC 194 WORD PROB. POWER 0.0915 STATE 0.0253 COMMITTEE 0.0197 LAW 0.0380 CALIFORNIA 0.0756 PLAN 0.0245 BILL 0.0189 TESTIMONY 0.0201 ELECTRICITY 0.0331 CALIFORNIA 0.0137 HOUSE 0.0169 ATTORNEY 0.0164 UTILITIES 0.0253 POLITICIAN Y 0.0137 SETTLEMENT 0.0131 PRICES 0.0249 RATE 0.0131 LEGAL 0.0100 MARKET 0.0244 EXHIBIT 0.0098 PRICE 0.0207 SOCAL 0.0119 CONGRESS 0.0112 CLE 0.0093 UTILITY 0.0140 POWER 0.0114 PRESIDENT 0.0105 SOCALGAS 0.0093 CUSTOMERS 0.0134 BONDS 0.0109 METALS 0.0091 ELECTRIC 0.0120 MOU 0.0107 DC 0.0093 PERSON Z 0.0083 SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB. *** 0.1160 *** 0.0395 *** 0.0696 *** 0.0696 *** 0.0518 *** 0.0337 *** 0.0453 *** 0.0453 *** 0.0284 *** 0.0295 *** 0.0255 *** 0.0255 *** 0.0272 *** 0.0251 *** 0.0173 *** 0.0173 *** 0.0266 *** 0.0202 *** 0.0317 *** 0.0317 Mining Lectures BANKRUPTCY 0.0126 WASHINGTON 0.0140 SENATE 0.0135 POLITICIAN X 0.0114 LEGISLATION 0.0099 Text Mining and Topic Models © Padhraic Smyth, UC Irvine Comparing Predictive Power 14000 • Author model 10000 8000 6000 4000 6 24 10 25 64 16 4 Author-Topics 0 2000 Topics model 1 Perplexity (new words) 12000 Train models on part of a new document and predict remaining words # Observed words in document Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Using Topic Models for Document Search Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine References on Topic Models Overviews: • Steyvers, M. & Griffiths, T. (2006). Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum • D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. Taylor and Francis, 2009. More specific: • Latent Dirichlet allocation David Blei, Andrew Y. Ng and Michael Jordan. Journal of Machine Learning Research, 3:993-1022, 2003. Data • Finding scientific topics Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235 • Probabilistic author-topic models for information discovery M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, in Proceedings of the ACM SIGKDD Conference on Data Mining and Knowledge Discovery, August 2004. Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine ADDITIONAL SLIDES Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine PubMed-Query Topics TOPIC 188 WORD BIOLOGICAL TOPIC 63 PROB. PROB. WORD PROB. 0.1002 PLAGUE 0.0296 BOTULISM 0.1014 AGENTS 0.0889 MEDICAL 0.0287 BOTULINUM 0.0888 THREAT 0.0396 CENTURY 0.0280 BIOTERRORISM 0.0348 MEDICINE 0.0266 0.0563 0.0669 INHIBITORS 0.0366 0.0340 INHIBITOR 0.0220 0.0245 PLASMA 0.0204 0.0203 POTENTIAL 0.0305 EPIDEMIC 0.0106 ATTACK 0.0290 GREAT 0.0091 CHEMICAL 0.0288 WARFARE 0.0219 CHINESE 0.0083 ANTHRAX 0.0146 FRENCH 0.0082 PARALYSIS 0.0124 PROB. AUTHOR PROB. AUTHOR PROTEASE 0.0916 TYPE HISTORY PROB. HIV PROB. 0.0877 0.0328 EPIDEMICS 0.0090 WORD TOXIN WEAPONS AUTHOR CLOSTRIDIUM INFANT NEUROTOXIN AMPRENAVIR 0.0527 0.0184 APV 0.0169 BONT 0.0167 DRUG 0.0169 FOOD 0.0134 RITONAVIR 0.0164 IMMUNODEFICIENCY0.0150 AUTHOR PROB. Atlas_RM 0.0044 Károly_L 0.0089 Hatheway_CL 0.0254 Sadler_BM 0.0129 Tegnell_A 0.0036 Jian-ping_Z 0.0085 Schiavo_G 0.0141 Tisdale_M 0.0118 Aas_P 0.0036 Sabbatani_S 0.0080 Sugiyama_H 0.0111 Lou_Y 0.0069 Arnon_SS 0.0108 Stein_DS 0.0069 Simpson_LL 0.0093 Haubrich_R 0.0061 Greenfield_RA Bricaire_F Data WORD TOPIC 32 TOPIC 85 Mining Lectures 0.0032 0.0032 Theodorides_J 0.0045 Bowers_JZ 0.0045 Text Mining and Topic Models © Padhraic Smyth, UC Irvine PubMed-Query Topics TOPIC 40 WORD TOPIC 89 PROB. WORD ANTHRACIS 0.1627 CHEMICAL 0.0578 ANTHRAX 0.1402 SARIN 0.0454 BACILLUS 0.1219 AGENT 0.0332 SPORES 0.0614 GAS 0.0312 CEREUS 0.0382 SPORE 0.0274 THURINGIENSIS 0.0177 VX PROB. ENZYME 0.0938 MUSTARD 0.0639 ACTIVE 0.0429 EXPOSURE 0.0444 SM SKIN 0.0208 REACTION 0.0225 EXPOSED 0.0185 AGENT 0.0140 0.0124 TOXIC 0.0197 PRODUCTS 0.0170 PROB. EPIDERMAL DAMAGE AUTHOR Mock_M 0.0203 Minami_M 0.0093 Phillips_AP 0.0125 Hoskin_FC 0.0092 Smith_WJ Welkos_SL 0.0083 Benschop_HP 0.0090 Turnbull_PC 0.0071 Raushel_FM 0.0084 Mining Lectures 0.0361 0.0264 0.0232 Wild_JR SITE 0.0399 0.0308 STERNE 0.0067 0.0353 SUBSTRATE ENZYMES 0.0220 Fouet_A PROB. 0.0657 NERVE AUTHOR WORD 0.0343 ACID PROB. HD PROB. SULFUR 0.0152 AUTHOR WORD 0.0268 SUBTILIS INHALATIONAL 0.0104 Data AGENTS TOPIC 178 TOPIC 104 0.0075 0.0129 0.0116 PROB. Monteiro-Riviere_NA 0.0284 SUBSTRATES 0.0201 FOLD CATALYTIC RATE AUTHOR 0.0176 0.0154 0.0148 PROB. Masson_P 0.0166 0.0219 Kovach_IM 0.0137 Lindsay_CD 0.0214 Schramm_VL 0.0094 Sawyer_TW 0.0146 Meier_HL 0.0139 Text Mining and Topic Models Barak_D Broomfield_CA 0.0076 0.0072 © Padhraic Smyth, UC Irvine PubMed-Query: Topics by Country ISRAEL, n=196 authors TOPIC 188 p=0.049 BIOLOGICAL AGENTS THREAT BIOTERRORISM WEAPONS POTENTIAL ATTACK CHEMICAL WARFARE ANTHRAX TOPIC 6 p=0.045 INJURY INJURIES WAR TERRORIST MILITARY MEDICAL VICTIMS TRAUMA BLAST VETERANS TOPIC 133 p=0.043 HEALTH PUBLIC CARE SERVICES EDUCATION NATIONAL COMMUNITY INFORMATION PREVENTION LOCAL TOPIC 7 p=0.026 RENAL HFRS VIRUS SYNDROME FEVER TOPIC 79 p=0.024 FINDINGS CHEST CT LUNG CLINICAL HEMORRHAGIC PULMONARY TOPIC 104 p=0.027 HD MUSTARD EXPOSURE SM SULFUR SKIN EXPOSED AGENT EPIDERMAL DAMAGE TOPIC 159 p=0.025 EMERGENCY RESPONSE MEDICAL PREPAREDNESS DISASTER MANAGEMENT TRAINING EVENTS BIOTERRORISM LOCAL CHINA, n=1775 authors Data Mining Lectures TOPIC 177 p=0.045 SARS RESPIRATORY SEVERE COV SYNDROME ACUTE CORONAVIRUS CHINA HANTAVIRUS ABNORMAL HANTAAN Text MiningINVOLVEMENT and Topic Models TOPIC 49 p=0.024 METHODS RESULTS CONCLUSION OBJECTIVE CONCLUSIONS BACKGROUND STUDY OBJECTIVES TOPIC 197 p=0.023 PATIENTS HOSPITAL PATIENT ADMITTED TWENTY HOSPITALIZED CONSECUTIVE PROSPECTIVELY © Padhraic Smyth, UC Irvine POTENTIAL ATTACK CHEMICAL WARFARE ANTHRAX MEDICAL VICTIMS TRAUMA BLAST VETERANS NATIONAL COMMUNITY INFORMATION PREVENTION LOCAL SKIN TOPIC 7 p=0.026 RENAL HFRS VIRUS SYNDROME FEVER TOPIC 79 p=0.024 FINDINGS CHEST CT LUNG CLINICAL HEMORRHAGIC PULMONARY CONCLUSIONS BACKGROUND HANTAVIRUS HANTAAN PUUMALA ABNORMAL STUDY INVOLVEMENT COMMON OBJECTIVES INVESTIGATE HANTAVIRUSES RADIOGRAPHIC DESIGN EXPOSED MANAGEMENT TRAINING EVENTS BIOTERRORISM LOCAL PubMed-Query: Topics by Country EPIDERMAL AGENT DAMAGE CHINA, n=1775 authors TOPIC 177 p=0.045 SARS RESPIRATORY SEVERE COV SYNDROME ACUTE CORONAVIRUS CHINA KONG PROBABLE Data Mining Lectures Text Mining and Topic Models TOPIC 49 p=0.024 METHODS RESULTS CONCLUSION OBJECTIVE TOPIC 197 p=0.023 PATIENTS HOSPITAL PATIENT ADMITTED TWENTY HOSPITALIZED CONSECUTIVE PROSPECTIVELY DIAGNOSED PROGNOSIS © Padhraic Smyth, UC Irvine Examples of Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Examples of Topics from New York Times Terrorism Wall Street Firms SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS Data Mining Lectures Stock Market Bankruptcy WEEK BANKRUPTCY DOW_JONES CREDITORS POINTS BANKRUPTCY_PROTECTION 10_YR_TREASURY_YIELD ASSETS PERCENT COMPANY CLOSE FILED NASDAQ_COMPOSITE BANKRUPTCY_FILING STANDARD_POOR ENRON CHANGE BANKRUPTCY_COURT FRIDAY KMART DOW_INDUSTRIALS CHAPTER_11 GRAPH_TRACKS FILING EXPECTED COOPER BILLION BILLIONS NASDAQ_COMPOSITE_INDEX COMPANIES EST_02 BANKRUPTCY_PROCEEDINGS PHOTO_YESTERDAY DEBTS YEN RESTRUCTURING 10 CASE 500_STOCK_INDEX GROUP Text Mining and Topic Models © Padhraic Smyth, UC Irvine Collocation Topic Model For each document, choose a mixture of topics For every word slot, sample a topic If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word Data Mining Lectures TOPIC MIXTURE TOPIC TOPIC TOPIC ... WORD WORD WORD ... X Text Mining and Topic Models X ... © Padhraic Smyth, UC Irvine Collocation Topic Model Example: “DOW JONES RISES” TOPIC MIXTURE JONES is more likely explained as a word following DOW than as word sampled from topic TOPIC Result: DOW_JONES recognized as collocation DOW JONES X=1 Data Mining Lectures Text Mining and Topic Models TOPIC ... RISES ... X=0 ... © Padhraic Smyth, UC Irvine Using Topic Models for Information Retrieval Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Stability of Topics • Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) • However, • • • Data Majority of topics are stable over processing time Majority of topics can be aligned across runs Topics appear to represent genuine structure in data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Comparing NIPS topics from the same Markov chain 10 16 20 14 30 12 40 10 50 70 80 90 6 4 2 100 20 40 60 80 100 topics at t1=1000 Data Mining Lectures t1 ANALOG CIRCUIT CHIP CURRENT VOLTAGE VLSI INPUT OUTPUT CIRCUITS FIGURE PULSE SYNAPSE SILICON CMOS MEAD .043 .040 .034 .025 .023 .022 .018 .018 .015 .014 .012 .011 .011 .009 .008 t2 ANALOG CIRCUIT CHIP VOLTAGE CURRENT VLSI OUTPUT INPUT CIRCUITS PULSE SYNAPSE SILICON FIGURE CMOS GATE .044 .040 .037 .024 .023 .023 .022 .019 .015 .012 .012 .011 .010 .009 .009 8 60 KL distance Re-ordered topics at t2=2000 BEST KL = 0.54 Text Mining and Topic Models WORST KL = 4.78 t1 FEEDBACK ADAPTATION CORTEX REGION FIGURE FUNCTION BRAIN COMPUTATIONAL FIBER FIBERS ELECTRIC BOWER FISH SIMULATIONS CEREBELLAR .040 .034 .025 .016 .015 .014 .013 .013 .012 .011 .011 .010 .010 .009 .009 t2 ADAPTATION FIGURE SIMULATION GAIN EFFECTS FIBERS COMPUTATIONAL EXPERIMENT FIBER SITES RESULTS EXPERIMENTS ELECTRIC SITE NEURO .051 .033 .026 .025 .016 .014 .014 .014 .013 .012 .012 .012 .011 .009 .009 © Padhraic Smyth, UC Irvine 10 16 20 14 30 12 40 10 50 8 60 KL distance Re-ordered topics from chain 2 Comparing NIPS topics from two different Markov chains 70 80 90 6 4 BEST KL = 1.03 Chain 1 MOTOR TRAJECTORY ARM HAND MOVEMENT INVERSE DYNAMICS CONTROL JOINT POSITION FORWARD TRAJECTORIES MOVEMENTS FORCE MUSCLE .041 .031 .027 .022 .022 .019 .019 .018 .018 .017 .014 .014 .013 .012 .011 Chain 2 MOTOR ARM TRAJECTORY HAND MOVEMENT INVERSE JOINT DYNAMICS CONTROL POSITION FORWARD FORCE TRAJECTORIES MOVEMENTS CHANGE .040 .030 .030 .024 .023 .021 .021 .018 .015 .015 .015 .014 .013 .012 .010 2 100 20 40 60 80 100 topics from chain 1 Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Outline • Background on statistical text modeling • Unsupervised learning from text documents • • • • Extensions • • Author-topic models Applications • • Motivation Topic model and learning algorithm Results Demo of topic browser Future directions Approach • The author-topic model • extension of the topic model: linking authors and topics • authors -> topics -> words • learned from data • completely unsupervised, no labels • generative model • Different questions or queries can be answered by appropriate probability calculus • E.g., p(author | words in document) • E.g., p(topic | author) Graphical Model x Author z Topic Graphical Model x Author z Topic w Word Graphical Model x Author z Topic w Word n Graphical Model a x Author z Topic w Word n D Graphical Model a Author-Topic distributions Topic-Word distributions q f x Author z Topic w Word n D Generative Process • Let’s assume authors A1 and A2 collaborate and produce a paper • • • A1 has multinomial topic distribution q1 A2 has multinomial topic distribution q2 For each word in the paper: 1. Sample an author x (uniformly) from A1, A2 2. Sample a topic z from qX 3. Sample a word w from a multinomial topic distribution fz Graphical Model a Author-Topic distributions Topic-Word distributions q f x Author z Topic w Word n D Learning • Observed • • Unknown • • • x, z : hidden variables Θ, f : unknown parameters Interested in: • • • W = observed words, A = sets of known authors p( x, z | W, A) p( θ , f | W, A) But exact learning is not tractable Step 1: Gibbs sampling of x and z a Average over unknown parameters q f x Author z Topic w Word n D Step 2: estimates of θ and f a q f x Author z Topic w Word Condition on particular samples of x and z n D Gibbs Sampling • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: P( zi j , xi k | wi m, z i , x i , w i , a d ) WT Cmj b WT C m' m' j Vb CmjAT a AT C j ' kj ' Ta number of times word w assigned to topic j number of times topic j assigned to author k Authors and Topics (CiteSeer Data) TOPIC 10 TOPIC 209 WORD TOPIC 87 WORD PROB. SPEECH 0.1134 RECOGNITION 0.0349 BAYESIAN 0.0671 TOPIC 20 PROB. WORD PROB. WORD PROB. PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164 INTERFACE 0.1080 OBSERVATIONS 0.0150 WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150 SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145 ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144 RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134 SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124 SOUND 0.0127 VISUAL 0.0203 OBSERVED 0.0108 TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101 MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087 AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143 Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131 Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089 Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083 Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078 Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067 Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063 Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059 Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055 Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050 PROBABILITIES 0.0253 Some likely topics per author (CiteSeer) • Author = Andrew McCallum, U Mass: • • • • Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural…. Finding unusual papers for an author Perplexity = exp [entropy (words | model) ] = measure of surprise for model on data Can calculate perplexity of unseen documents, conditioned on the model for a particular author Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Papers and Perplexities: M_Jordan Data Mining Lectures Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 Text Mining and Topic Models © Padhraic Smyth, UC Irvine Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 MEDIAN PERPLEXITY Data Mining Lectures Text Mining and Topic Models 2567 © Padhraic Smyth, UC Irvine Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 MEDIAN PERPLEXITY Data Mining Lectures 2567 Defining and Handling Transient Fields in Pjama 14555 An Orthogonally Persistent JAVA 16021 Text Mining and Topic Models © Padhraic Smyth, UC Irvine Papers and Perplexities: T_Mitchell Data Mining Lectures Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 Text Mining and Topic Models © Padhraic Smyth, UC Irvine Papers and Perplexities: T_Mitchell Data Mining Lectures Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY 2837 Text Mining and Topic Models © Padhraic Smyth, UC Irvine Papers and Perplexities: T_Mitchell Data Mining Lectures Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY 2837 Text Classification from Labeled and Unlabeled Documents using EM 3802 A Method for Estimating Occupational Radiation Dose… 8814 Text Mining and Topic Models © Padhraic Smyth, UC Irvine Who wrote what? Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author A method1 is described which like the kernel1 trick1 in support1 vector1 machines1 SVMs1 lets us generalize distance1 based2 algorithms to operate in feature1 spaces usually nonlinearly related to the input1 space This is done by identifying a class of kernels1 which can be represented as norm1 based2 distances1 in Hilbert spaces It turns1 out that common kernel1 algorithms such as SVMs1 and kernel1 PCA1 are actually really distance1 based2 algorithms and can be run2 with that class of kernels1 too As well as providing1 a useful new insight1 into how these algorithms work the present2 work can form the basis1 for conceiving new algorithms This paper presents2 a comprehensive approach for model2 based2 diagnosis2 which includes proposals for characterizing and computing2 preferred2 diagnoses2 assuming that the system2 description2 is augmented with a system2 structure2 a directed2 graph2 explicating the interconnections between system2 components2 Specifically we first introduce the notion of a consequence2 which is a syntactically2 unconstrained propositional2 sentence2 that characterizes all consistency2 based2 diagnoses2 and show2 that standard2 characterizations of diagnoses2 such as minimal conflicts1 correspond to syntactic2 variations1 on a consequence2 Second we propose a new syntactic2 variation on the consequence2 known as negation2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm2 for computing consequences in NNF given a structured system2 description We show that if the system2 structure2 does not contain cycles2 then there is always a linear size2 consequence2 in NNF which can be computed in linear time2 For arbitrary1 system2 structures2 we show a precise connection between the complexity2 of computing2 consequences and the topology of the underlying system2 structure2 Finally we present2 an algorithm2 that enumerates2 the preferred2 diagnoses2 characterized by a consequence2 The algorithm2 is shown1 to take linear time2 in the size2 of the consequence2 if the preference criterion1 satisfies some general conditions Data Mining Lectures Text Mining and Topic Models Written by (1) Scholkopf_B Written by (2) Darwiche_A © Padhraic Smyth, UC Irvine The Author-Topic Browser Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Comparing Predictive Power 14000 • Author model 10000 8000 6000 4000 6 24 10 25 64 16 4 Author-Topics 0 2000 Topics model 1 Perplexity (new words) 12000 Train models on part of a new document and predict remaining words # Observed words in document Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Outline • Background on statistical text modeling • Unsupervised learning from text documents • • • • Extensions • • Author-topic models Applications • • Motivation Topic model and learning algorithm Results Demo of topic browser Future directions Online Demonstration of Topic Browser for UCI and UCSD Faculty Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine