T EXT M INING S TATISTICAL M ODELING OF T EXTUAL D ATA L ECTURE 3 Mattias Villani Division of Statistics Dept. of Computer and Information Science Linköping University M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 1/9 O VERVIEW L ECTURE 3 I I I Demo of document classication in the tm package in R. Topic models (LDA) Demo of topicmodels package in R M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 2/9 T OPIC MODELS I I I I I Models for unsupervised learning [No need for labelled data!], but more recently also for supervised learning. Probabilistic generative models. Very popular model in applications and research. > 8000 Google scholar citations in 11 years. The basic topic models are extensions of the bag-of-words (unigram) model. Unigram model: each word is assumed to be drawn from the same word (term) distribution. PĚ‚ (w ) = #Nw I Many extensions in recent years: nGrams, supervised, nonparametric, relational topics, correlated topics, dynamically time-varying topics. M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 3/9 M IXTURE OF UNIGRAMS I Mixture of unigrams: 1. Draw a topic zd for the d th document from a topic distribution θ = (θ , ..., θK ). 2. Conditional on the drawn topic zd draw words from a word distribution 1 for that topic. I Topic models are mixed-membership models: each document can belong to several topics simultaneously. M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 4/9 M ULTINOMIAL AND D IRICHLET DISTRIBUTIONS I Multinomial distribution: random discrete variable X ∈ {1, 2, ..., K } that can assume exactly one of I Pr (X = k ) = θk I Parameters I K (unordered) values. θ = (θ1 , ..., θK ). Dirichlet distribution: random vector X = (X , ..., XK ) satisfying the constraint X + X + ... + XK = 1. 1 1 2 I Unit simplex I Parameters: α = (α1 , ..., αK ) α = (1, 1, ..., 1) I Uniform distribution: I Small variance (informative) when the I Bathtub shape when M ATTIAS V ILLANI (S TATISTICS , L I U) αk < 1 for all T EXT M INING α's k. are large. 5/9 M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 6/9 G ENERATING A CORPUS FROM A TOPIC MODEL I Assume that we have: V D documents N words in each document K topics I A xed vocabulary I I I 1. For each topic (k = 1, ..., K ): A . Draw a distribution over the words 2. For each document (d = 1, ..., D ): A . Draw a vector of topic proportions B . For each word ( = 1, ..., ): n I. II . N β k ∼ Dir (η, η, ..., η ) θd ∼ Dir (α1 , ..., αK ) Draw a topic assignment zd ,n ∼ Multinomial (θd ) Draw a word wd ,n ∼ Multinomial ( βzd n ) M ATTIAS V ILLANI (S TATISTICS , L I U) , T EXT M INING 6/9 ( HORRIBLE PICTURE OF A ) TOPIC MODEL M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 7/9 E XAMPLE - SIMULATION FROM TWO TOPICS Topic Word distr. probability dna gene data distribution 1 β1 0 5 . 0 1 . 0 0 . 0 2 . 0 2 2 β2 0 0 . 0 5 . 0 4 . 0 1 . 0.0 Word 1: Topic=2 Word='gene' Word 2: Topic=2 Word='gene' Word 3: Topic=1 Word='data' Word 1: Topic=1 Word='probability' Word 2: Topic=1 Word='data' Word 3: Topic=1 Word='probability' . . . . Doc 1 Doc 2 Doc 3 . . M ATTIAS . . . V ILLANI (S TATISTICS , . . θ1 = (0.2, 0.8) θ2 = (0.9, 0.1) θ2 = (0.5, 0.5) L I U) . . . . INING T EXT M . . . . 8/9 L EARNING / INFERENCE IN TOPIC MODELS I What do we know? I The words in the documents: I w :D . 1 What do we not know? I Topic proportions for each document: θ 1:D I Topic assignments for each word in each document: I Word distributions for each topic: I β 1:K 1 Do the Bayes dance: Posterior distribution p ( θ :D , z :D , β 1 I z :D 1 1 :K | w :D ) 1 The posterior is mathematically untractable. Solutions: I Gibbs sampling (MCMC) [Correct, but can be slow] I Variational Bayes [Crude approximation of the posterior distribution, but typically rather accurate about posterior mode (MAP)] I The inferred θ1:D can be used as features in supervised classication. M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 9/9 From Blei (2012). Probabilistic topic models, Communication of the ACM. M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 9/9