topics - Lirmm

advertisement
STATISTICAL TOPIC MODELING
part 1
Andrea Tagarelli
Univ. of Calabria, Italy
Statistical topic modeling (1/3)
• Key assumption:
• text data represented as a mixture of topics, i.e., probability
distributions over terms
• Generative model for documents:
• document features as being generated by latent variables
• Topic modeling vs. vector-space text modeling
• (Latent) Semantic aspects underlying correlations between words
• Document topical structure
Statistical topic modeling (2/3)
• Training on (large) corpus to learn:
• Per-topic word distributions
• Per-document topic distributions
[Blei, CACM, 2012]
Statistical topic modeling (3/3)
[Hofmann, SIGIR, 1999]
• Graphical “Plate” notation
• Standard representation for generative models
• Rectangles (plates) represent repeated areas of the model
• number of times the variable(s) is repeated
Observed and latent variables
• Observed variable: we know the current value
• Latent variable: a variable whose state cannot be observed
• Estimation problem:
• Estimate values for a set of distribution parameters that can best
explain a set of observations
• Most likely values of parameters: maximum likelihood of a
model
• Likelihood impossible to calculate in full
• Approximation through
• Expectation-maximization algorithm: an iterative method to estimate
the probability of unobserved, latent variables. Until a local optimum is
obtained
• Gibbs sampling: update parameters sample-wise
• Variational inference: approximate the model by a simpler one
Probabilistic LSA
• PLSA [Hofmann, 2001]
• Probabilistic version of LSA conceived to
better handling problems of term
polysemy
d
z
w M
N
PLSA training (1/2)
• Joint probability model:
• Likelihood
PLSA training (2/2)
• Training with EM:
• Initialization of the per-topic word distributions and per-document
topic distributions
• E-step:
• M-step:
Latent Dirichlet Allocation (1/2)
• LDA [Blei et al., 2003]
• Adds a Dirichlet prior on the per-document topic distribution
• 3-level scheme: corpus, documents, and terms
• Terms are the only observed variables
Topic assignment to a
word at position i in doc dj
Per-topic word
distribution
For each word position
in a doc of length M
For each doc in a
collection of N docs
Per-document
topic distribution
Word token at position
i in doc dj
[Moens and Vulic, Tutorial @WSDM 2014]
Latent Dirichlet Allocation (2/2)
• Meaning of Dirichlet priors
• θ ~ Dir(α1, …, αK)
• Each αk is a prior observation count for the no. of times a topic zk is
sampled in a document prior to word observations
• Analogously for ηi, with β ~ Dir(η1, …, ηV)
• Inference for a new document: Given α, β, η, infer θ
• Exact inference problem is intractable: training through
• Gibbs sampling
• Variational inference
Download