Finding Scientific topics

advertisement
Finding Scientific topics
August 26 - 31, 2011
Topic Modeling
1.
2.
3.
A document as a probabilistic mixture of topics.
A topic as a probability distribution over words.
Words are assumed known and the number of words is fixed.
If there are T topics,
Here, { w } denote words, { z } denote topics.
The conditional probability P ( w | z ) indicates which words are
important to a topic.
For a particular document, P (z), distribution over topics, determines
how these topics are mixed together in forming the document.
An example “Soft Classification”: a document is not assigned with
only one topic (a single class).
For document, P(z) gives an indication of what topic should be
assigned to it.
How to think/visualize P(z), P(w | z )?
words
topics
What do you want know? What do we want to compute from the input
data?
Inputs: a. A document or a collection of documents, with the collection of
words appeared in the document {w1, …. , wn } (repetition allowed,
perhaps with deletions of unimportant words such as articles ‘the’, ‘an’,
prepositions ‘on’, ‘of’).
b. Number of topics, T.
We want to know (compute) P ( z ), P (w | z ) for each topic z.
There is one (or D, the number of documents) P( z ), and T P (w | z = j). What form
should P(z) and P( w | z) take?
Multinomial distributions ( http://en.wikipedia.org/wiki/Multinomial_distribution )
What is this? Each P ( z) is a non-negative vector with T components (with sum = 1).
Each P ( w | z ) is a non-negative vector with W components (with sum = 1).
One possible solution
Question: how many variables are here?
Problem with this approach: local maxima and slow convergence
Bayesian Approach
Estimate phi and theta indirectly via the following generative model
Dirichlet Distribution (http://en.wikipedia.org/wiki/Dirichlet_distribution )
What does this generative model say?
It gives us the way that the observed data are thought to be generated.
Where is the prior?
Idea: using the generative model to explain the input data.
Alpha, beta are the hyperparameters
The goal is evaluate the posterior distribution
Difficult because the denominator cannot be computed.
(Know very well what the notations Z, W stand for).
However, we do have (for the numerator P( w, z ) )
P ( z ) = P ( z1, … , zw | theta ) = P (z1 | theta) … P (zw | theta) ( Assuming
conditional Independence of zi given theta)
This gives Equation 3 (with D= 1, one document)
Equation 2 can be obtained similarly.
The goal is evaluate the posterior distribution
Difficult because the denominator cannot be computed.
But what can we do with P(z | w)?
Recall that our goal is to estimate theta (topic proportion) and phi (topics)
Suppose we know the true topic assignments (z1, … , zW), theta can be estimated as
theta_i = ( number of words assigned topic i ) / ( number of total words, W) and
How about phi?
More About Dirichlet Distribution
Dirichlet Distribution is defined on the T-dimensional standard
simplex
The density function for DD with parameters
When alpha_i is close to zero, probability will concentrate near theta_i =0.
On the other hand, when alpha_i is away from zero (large), probability will
move away from theta_i=0.
An example with T = 3,
More About Dirichlet Distribution
The expected value and variance of each component theta_i are
given by the formulas
When alpha_i is close to zero, probability will concentrate near theta_i =0.
On the other hand, when alpha_i is away from zero (large), probability will
move away from theta_i=0.
An example with T = 3,
Suppose we know the true topic assignments (z1, … , zW), theta can be estimated
as
theta_i = ( number of words assigned topic i ) / ( number of total words, W) and
How about phi?
Of course, we don’t know the ground truth, but only the distribution P ( Z |
W). We need to know P ( theta | Z, W)
By Baye’ rule, we have
Therefore, P(theta | Z, W) is also another Dirichlet Distribution.
For a given Z, what should the estimated theta be?
This gives Equation 6 (and 7 similarly).
The point, of course, is that we don’t know the exact topic assignment (z1, … zW),
but only its distribution P( z ).
For a probability distribution P(x), the expectation of a function can be
estimated as (where yi are the samples of P(x) )
For example, we can use this formula to estimate the mean,
variance of a distribution from its samples.
More samples give more accurate estimate on the right.
The goal is evaluate the posterior distribution
Difficult because the denominator cannot be computed.
Using Markov Chain Monte Carlo (MCMC) to simulate the probability).
What this mean is that we want to draw samples with respect to the
distribution for each sample we generated from P(z | w), we have one estimate
of theta and phi according to
Simulating P (z|w) using MCMC (Markov Chain Monte Carlo)
Much more on this later… The steps are
1.
2.
Initialize the topic assignments (z1, … zW)s=0,
Do (say three thousand iterations)
for each i = 1, …. , W change the current assignment for zi according
to the probability
One cycle through all i gives a new topic assignment (z1, … zW)s=s+1,
What does the formula say?
3. Generate Samples
Example, Suppose there are 3 (T=3) topics and there are 3 words (A, B, C)
in the dictionary and the word list has 10 words (W=10). Take
alpha=beta=1. Word List is { A, A, C, B, C, A, C, A, B, B}
With the initial topic assignment
{ 1, 1, 2, 2, 3, 1, 2, 3, 1 , 1}
How do we apply the formula? First word is A, in the word list, A has been
assigned to topics 1, 1, 1, and 3.
P ( z1 | Z-1, W) = (0.5159, 0.1848, 0.3003)
{ 3, 1, 2, 2, 3, 1, 2, 3, 1 , 1}
Next, compute P ( z1 | Z-1, W) and sample a new z2 value
What are the effects of alpha and beta? (prior and large W)
Summary
Goal is to infer theta (Dirichlet) from the data, with theta itself a distribution.
Dirichlet distribution is a prior distribution on theta (A Bayesian approach).
Therefore, it is a distribution on the space of distributions. No particular forms
of theta are assumed (nonparametric).
The base probability space is finite and discrete X = {1, …, T}. Things become
much more complicated when X is no longer discrete, for example, X is the set of
real numbers.
Need more sophisticated mathematical language, and that will be the goal for
the next two – three weeks.
Download