STATISTICAL TOPIC MODELING part 1 Andrea Tagarelli Univ. of Calabria, Italy Statistical topic modeling (1/3) • Key assumption: • text data represented as a mixture of topics, i.e., probability distributions over terms • Generative model for documents: • document features as being generated by latent variables • Topic modeling vs. vector-space text modeling • (Latent) Semantic aspects underlying correlations between words • Document topical structure Statistical topic modeling (2/3) • Training on (large) corpus to learn: • Per-topic word distributions • Per-document topic distributions [Blei, CACM, 2012] Statistical topic modeling (3/3) [Hofmann, SIGIR, 1999] • Graphical “Plate” notation • Standard representation for generative models • Rectangles (plates) represent repeated areas of the model • number of times the variable(s) is repeated Observed and latent variables • Observed variable: we know the current value • Latent variable: a variable whose state cannot be observed • Estimation problem: • Estimate values for a set of distribution parameters that can best explain a set of observations • Most likely values of parameters: maximum likelihood of a model • Likelihood impossible to calculate in full • Approximation through • Expectation-maximization algorithm: an iterative method to estimate the probability of unobserved, latent variables. Until a local optimum is obtained • Gibbs sampling: update parameters sample-wise • Variational inference: approximate the model by a simpler one Probabilistic LSA • PLSA [Hofmann, 2001] • Probabilistic version of LSA conceived to better handling problems of term polysemy d z w M N PLSA training (1/2) • Joint probability model: • Likelihood PLSA training (2/2) • Training with EM: • Initialization of the per-topic word distributions and per-document topic distributions • E-step: • M-step: Latent Dirichlet Allocation (1/2) • LDA [Blei et al., 2003] • Adds a Dirichlet prior on the per-document topic distribution • 3-level scheme: corpus, documents, and terms • Terms are the only observed variables Topic assignment to a word at position i in doc dj Per-topic word distribution For each word position in a doc of length M For each doc in a collection of N docs Per-document topic distribution Word token at position i in doc dj [Moens and Vulic, Tutorial @WSDM 2014] Latent Dirichlet Allocation (2/2) • Meaning of Dirichlet priors • θ ~ Dir(α1, …, αK) • Each αk is a prior observation count for the no. of times a topic zk is sampled in a document prior to word observations • Analogously for ηi, with β ~ Dir(η1, …, ηV) • Inference for a new document: Given α, β, η, infer θ • Exact inference problem is intractable: training through • Gibbs sampling • Variational inference