Part 1

advertisement
British Museum Library, London
Picture Courtesy: flickr
Courtesy: Wikipedia
Topic Models and the
Role of Sampling
Barnan Das
British Museum Library, London
Picture Courtesy: flickr
Topic Modeling
• Methods for automatically organizing,
understanding, searching and summarizing
large electronic archives.
Uncover hidden topical patterns in collections.
• Annotate documents according to topics.
• Using annotations to organize, summarize and
search.
•
Topic Modeling
NIH Grants Topic Map 2011
NIH Map Viewer (https://app.nihmaps.org)
Topic Modeling Applications
• Information retrieval.
• Content-based image retrieval.
• Bioinformatics
Overview of this Presentation
• Latent Dirichlet allocation (LDA)
• Approximate posterior inference
•
Gibbs sampling
• Paper
•
Fast collapsed Gibbs sampling for LDA
Latent Dirichlet Allocation
David Blei’s Talk
Machine Learning Summer School, Cambridge 2009
D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet
allocation," The Journal of Machine Learning Research,
vol. 3, pp. 993-1022, 2003.
Probabilistic Model
• Generative probabilistic modeling
Treats data as observations
• Contains hidden variables
• Hidden variables reflect thematic structure of
the collection.
•
• Infer hidden structure using posterior
inference
•
Discovering topics in the collection.
• Placing new data into the estimated model
•
Situating new documents into the estimated
topic structure.
Intuition
Generative Model
Posterior Distribution
• Only documents are observable.
• Infer underlying topic structure.
Topics that generated the documents.
• For each document, distribution of topics.
• For each word, which topic generated the word.
•
• Algorithmic challenge: Finding the
conditional distribution of all the latent
variables, given the observation.
LDA as Graphical Model
Dirichlet
Multinomial
Multinomial
Dirichlet
Posterior Distribution
• From a collection of documents W, infer
Per-word topic assignment zd,n
• Per-document topic proportions d
• Per-corpus topic distribution k
•
• Use posterior expectation to perform different
tasks.
Posterior Distribution
• Evaluate P(z|W): posterior distribution over
the assignment of words to topic.
•  and  can be estimated.
Computing P(z|W)
• Involves evaluating a probability distribution
over a large discrete space.
• Contribution of each zd,n depends on:
All z-n values.
• NkWn -># of times word Wd,n has been assigned
a topic k.
• Nkd -># of times a word from document d has
been assigned a topic k.
•
• Sampling from the target distribution using
MCMC.
Approximate posterior
inference:
Gibbs Sampling
C. M. Bishop and SpringerLink, Pattern recognition
and machine learning vol. 4: Springer New York, 2006.
Iain Murray’s Talk
Machine Learning Summer School, Cambridge 2009
Overview
• When exact inference is intractable.
• Standard sampling techniques have
limitation:
Cannot handle all kinds of distributions.
• Cannot handle high dimensional data.
•
• MCMC techniques do not have these
limitations.
• Markov chain:
For random variables x(1),…,x(M),
p(x(m+1)|x(1),…,x(m))=p(x(m+1)|x(m)) ; m{1,…M-1}
Gibbs Sampling
• Target distribution: p(x) = p(x1,…,xM).
• Choose the initial state of the Markov chain:
{xi:i=1,…M}.
• Replace xi by a value drawn from the
distribution p(xi|x-i).
xi: ith component of Z
• x-i: x1,…,xM but xi omitted.
•
• This process is repeated for all the variables.
• Repeat the whole cycle for however many
samples are needed.
Why Gibbs Sampling?
• Compared to other MCMC techniques, Gibbs
sampling is:
Easy to implement
• Requires little memory
• Competitive in speed and performance
•
Gibbs Sampling for LDA
• The full conditional distribution is:
P ( zd ,n  k | zd , n , Wd )

Wd ,n
 n,k
( )
 n,k
N
N
N dn ,k  
 W  N dn ,  K
Probability of Wd,n under topic k
Probability of topic k in document d
P ( zd , n
Wd ,n
 n,k
( )
 n,k
1 N
 k | zd , n ,Wd ) 
Z N
Z = k

N dn ,k  
 W  N dn ,  K

Gibbs Sampling for LDA
• Target distribution: P( zd ,n  k | zd ,n ,Wd )
• Initial state of Markov chain: {zn} will have
value in {1,2,…,K}.
• Chain run for a number of iterations.
• In each iteration a new state is found by
sampling {zn} from P( zd ,n  k | zd ,n ,Wd )
Gibbs Sampling for LDA
• Subsequent samples are taken after
appropriate lag to ensure that their
autocorrelation is low.
• This is collapsed Gibbs sampling.
• For single sample  and  are calculated
from z.
ˆ
Wd
k
NkWd  
 ()
Nk  W 
d
N

ˆkd  dk
N  K
Fast Collapsed Gibbs Sampling
For Latent Dirichlet Allocation
Ian Porteous, David Newman, Alexander Ihler,
Arthur Asuncion, Padhraic Smyth, Max Welling
University of California, Irvine
FastLDA: Graphical Representation
FastLDA: Segments
• Sequence of bounds on the Z: Z1,…, Zk
• Z1  Z2  …  ZK = Z
• Several slk…sKk segments for each topic.
• 1st segment: conservative estimate on the
probability of the topic given the upper
bound Zk on the true normalization factor Z.
• Subsequent segments: corrections for the
missing probability mass for a topic given
the improved bound.
FastLDA: Segments
Upper Bounds for Z
• Find a sequence of improving bounds on the
normalization constant.
• Z defined in terms of component vectors.
• Holder’s inequality to construct initial upper
bound.
• Bound intelligently improved for each topic.
Fast LDA Algorithm
• Algorithm:
Sort topics in decreasing order of Nkd
• u ~ Uniform[0,1]
• For topics in order:
•
• Calculate length of segments.
• For each next topic, Zk is improved.
• When sum of segments > u:
• Return topic and return.
• Complexity:
•
Not more than O(K log K) for any operation.
Experiments
• Four large datasets:
NIPS full papers
• Enron emails
• NY Times news articles
• PubMed abstracts
•
•  = 0.01 and  = 2/K
• Computations run on workstations with:
• Dual Xeon 3.0Ghz processors
• Code compiled by gcc version 3.4.
Results
Speedup : 5-8 times
Results
• Speedup relatively insensitive
to number of documents in the
corpus.
Results
• Large Dirichlet parameter smooths
the distribution of the topics within a
document.
• FastLDA needs to visit and
compute more topics before drawing
a sample.
Discussions
Discussions
• Other domains.
• Other sampling techniques.
• Other distributions other than Dirichlet.
• Parallel computation.
•
Newman et al. “Scalable parallel topic models”.
• Deciding on the value of K.
• Choices of bounds.
• Reason behind choosing these datasets.
• Are the values mentioned in the paper magic numbers?
• Why were the words having count <10 discarded?
• Assigning weights to words.
Backup Slides
Dirichlet Distribution
• The Dirichlet distribution is an exponential
family distribution over the simplex, i.e.,
positive vectors that sum to one.
• The Dirichlet is conjugate to the multinomial.
Given a multinomial observation, the posterior
distribution of  is a Dirichlet.
Download