T M S D

advertisement
T EXT M INING
S TATISTICAL M ODELING OF T EXTUAL D ATA
L ECTURE 3
Mattias Villani
Division of Statistics
Dept. of Computer and Information Science
Linköping University
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
1/9
O VERVIEW L ECTURE 3
I
I
I
Demo of document classication in the tm package in R.
Topic models (LDA)
Demo of topicmodels package in R
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
2/9
T OPIC MODELS
I
I
I
I
I
Models for unsupervised learning [No need for labelled data!], but
more recently also for supervised learning.
Probabilistic generative models.
Very popular model in applications and research. > 8000 Google
scholar citations in 11 years.
The basic topic models are extensions of the bag-of-words (unigram)
model.
Unigram model: each word is assumed to be drawn from the same
word (term) distribution.
PĚ‚ (w ) = #Nw
I
Many extensions in recent years: nGrams, supervised, nonparametric,
relational topics, correlated topics, dynamically time-varying topics.
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
3/9
M IXTURE OF UNIGRAMS
I
Mixture of unigrams:
1. Draw a topic zd for the d th document from a topic distribution
θ = (θ , ..., θK ).
2. Conditional on the drawn topic zd draw words from a word distribution
1
for that topic.
I
Topic models are mixed-membership models: each document can
belong to several topics simultaneously.
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
4/9
M ULTINOMIAL AND D IRICHLET DISTRIBUTIONS
I
Multinomial distribution: random discrete variable X ∈ {1, 2, ..., K }
that can assume exactly one of
I
Pr (X = k ) = θk
I Parameters
I
K
(unordered) values.
θ = (θ1 , ..., θK ).
Dirichlet distribution: random vector X = (X , ..., XK ) satisfying
the constraint X + X + ... + XK = 1.
1
1
2
I Unit simplex
I Parameters:
α = (α1 , ..., αK )
α = (1, 1, ..., 1)
I Uniform distribution:
I Small variance (informative) when the
I Bathtub shape when
M ATTIAS V ILLANI (S TATISTICS , L I U)
αk < 1
for all
T EXT M INING
α's
k.
are large.
5/9
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
6/9
G ENERATING A CORPUS FROM A TOPIC MODEL
I
Assume that we have:
V
D documents
N words in each document
K topics
I A xed vocabulary
I
I
I
1. For each topic (k = 1, ..., K ):
A . Draw a distribution over the words
2. For each document (d = 1, ..., D ):
A . Draw a vector of topic proportions
B . For each word ( = 1, ..., ):
n
I.
II .
N
β k ∼ Dir (η, η, ..., η )
θd ∼ Dir (α1 , ..., αK )
Draw a topic assignment zd ,n ∼ Multinomial (θd )
Draw a word wd ,n ∼ Multinomial ( βzd n )
M ATTIAS V ILLANI (S TATISTICS , L I U)
,
T EXT M INING
6/9
( HORRIBLE PICTURE OF A ) TOPIC MODEL
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
7/9
E XAMPLE - SIMULATION FROM TWO TOPICS
Topic
Word distr.
probability
dna
gene
data
distribution
1
β1
0 5
.
0 1
.
0 0
.
0 2
.
0 2
2
β2
0 0
.
0 5
.
0 4
.
0 1
.
0.0
Word 1:
Topic=2
Word='gene'
Word 2:
Topic=2
Word='gene'
Word 3:
Topic=1
Word='data'
Word 1:
Topic=1
Word='probability'
Word 2:
Topic=1
Word='data'
Word 3:
Topic=1
Word='probability'
.
.
.
.
Doc 1
Doc 2
Doc 3
.
.
M ATTIAS
.
.
.
V ILLANI (S TATISTICS
,
.
.
θ1 = (0.2, 0.8)
θ2 = (0.9, 0.1)
θ2 = (0.5, 0.5)
L I U)
.
.
.
. INING
T EXT M
.
.
.
.
8/9
L EARNING / INFERENCE IN TOPIC MODELS
I
What do we know?
I The words in the documents:
I
w :D .
1
What do we not know?
I Topic proportions for each document:
θ 1:D
I Topic assignments for each word in each document:
I Word distributions for each topic:
I
β 1:K
1
Do the Bayes dance: Posterior distribution
p ( θ :D , z :D , β
1
I
z :D
1
1
:K |
w :D )
1
The posterior is mathematically untractable. Solutions:
I Gibbs sampling (MCMC) [Correct, but can be slow]
I Variational Bayes [Crude approximation of the posterior
distribution,
but typically rather accurate about posterior mode (MAP)]
I
The inferred θ1:D can be used as features in supervised classication.
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
9/9
From Blei (2012). Probabilistic topic models, Communication of the ACM.
M ATTIAS V ILLANI (S TATISTICS , L I U)
T EXT M INING
9/9
Download