PPT - University of California, Irvine

advertisement
Text Mining and Topic Modeling
Padhraic Smyth
Department of Computer Science
University of California, Irvine
Progress Report
•
New deadline
•
•
In class, Thursday February 18th (not Tuesday)
Outline
•
•
3 to 5 pages maximum
Suggested content
•
•
•
•
•
•
Data
Brief restatement of the problem you are addressing (no need to repeat
everything in your original proposal), e.g., ½ a page
Summary of progress so far
 Background papers read
 Data preprocessing, exploratory data analysis
 Algorithms/Software reviewed, implemented, tested
 Initial results (if any)
Challenges and difficulties encountered
Brief outline of plans between now and end of quarter
Use diagrams, figures, tables, where possible
Write clearly: check what you write
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Road Map
•
Topics covered
•
•
•
•
•
Yet to come….
•
•
•
•
Data
Exploratory data analysis and visualization
Regression
Classification
Text classification
Unsupervised learning with text (topic models)
Social networks
Recommender systems (including Netflix)
Mining of Web data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Text Mining
•
Document classification
•
Information extraction
•
•
Document summarization
•
Google news, Newsblaster
(http://www1.cs.columbia.edu/nlp/newsblaster/)
•
Document clustering
•
Topic modeling
•
•
Data
Named-entity extraction: recognize names of people, places, genes, etc
Representing document as mixtures of topics
And many more…
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Named Entity-Extraction
•
Often a combination of
•
•
•
•
Non-trivial since entity-names can be confused with real
names
•
•
Data
E.g., gene name ABS and abbreviation ABS
Also can look for co-references
•
•
Knowledge-based approach (rules, parsers)
Machine learning (e.g., hidden Markov model)
Dictionary
E.g., “IBM today…… Later, the company announced…..”
Useful as a preprocessing step for data mining,
e.g., use entity-names to train a classifier to predict the
category of an article
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Example: GATE/ANNIE extractor
Data
•
GATE: free software infrastructure for text analysis
(University of Sheffield, UK)
•
ANNIE: widely used entity-recognizer, part of GATE
http://www.gate.ac.uk/annie/
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Information Extraction
From Seymore, McCallum, Rosenfeld, Learning Hidden Markov Model Structure
for Information Extration, AAAI 1999
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Topic Models
•
Background on graphical models
•
Unsupervised learning from text documents
•
•
•
•
Extensions
•
Data
Motivation
Topic model and learning algorithm
Results
Topics over time, author topic models, etc
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Pennsylvania Gazette
1728-1800
80,000 articles
25 million words
www.accessible.com
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Enron email data
250,000 emails
28,000 authors
1999-2002
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Other Examples of Large Corpora
•
CiteSeer digital collection:
•
•
MEDLINE collection
•
Data
700,000 papers, 700,000 authors, 1986-2005
16 million abstracts in medicine/biology
•
US Patent collection
•
and many more....
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Unsupervised Learning from Text
•
Large collections of unlabeled documents..
•
•
•
Data
Web
Digital libraries
Email archives, etc
•
Often wish to organize/summarize/index/tag these
documents automatically
•
We will look at probabilistic techniques for clustering and
topic extraction from sets of documents
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Problems of Interest
Data
•
What topics do these documents “span”?
•
Which documents are about a particular topic?
•
How have topics changed over time?
•
What does author X write about?
•
Who is likely to write about topic Y?
•
Who wrote this specific document?
•
and so on…..
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Review Slides on Graphical Models
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Multinomial Models for Documents
•
Example: 50,000 possible words in our vocabulary
•
Simple memoryless model
•
50,000-sided die
• a non-uniform die: each side/word has its own probability
• to generate N words we toss the die N times
•
This is a simple probability model:
•
Data
p( document | f ) =
P
p(word i | f )
•
to "learn" the model we just count frequencies
• p(word i) = number of occurences of i / total number
•
Typically interested in conditional multinomials, e.g.,
• p(words | spam) versus p(words | non-spam)
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Real examples of Word Multinomials
TOPIC 209
Data
Mining Lectures
TOPIC 289
WORD
PROB.
WORD
PROB.
PROBABILISTIC
0.0778
RETRIEVAL
0.1179
BAYESIAN
0.0671
TEXT
0.0853
PROBABILITY
0.0532
DOCUMENTS
0.0527
CARLO
0.0309
INFORMATION
0.0504
MONTE
0.0308
DOCUMENT
0.0441
DISTRIBUTION
0.0257
CONTENT
0.0242
INFERENCE
0.0253
INDEXING
0.0205
PROBABILITIES
0.0253
RELEVANCE
0.0159
CONDITIONAL
0.0229
COLLECTION
0.0146
PRIOR
0.0219
RELEVANT
0.0136
....
...
...
...
Text Mining and Topic Models
P(w | z )
© Padhraic Smyth, UC Irvine
A Graphical Model for Multinomials
p( doc | f ) =
= set of probabilities
one per word
Data
Mining Lectures
p(wi | f )
f
f = "parameter vector"
w1
P
p( w | f )
w2
wn
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Another view....
p( doc | f ) =
P
p(wi | f )
This is “plate notation”
f
Items inside the plate
are conditionally independent
given the variable outside
the plate
wi
i=1:n
Data
Mining Lectures
Text Mining and Topic Models
There are “n” conditionally
independent replicates
represented by the plate
© Padhraic Smyth, UC Irvine
Being Bayesian....
a
This is a prior on our multinomial parameters,
e.g., a simple Dirichlet smoothing prior with
symmetric parameter a to avoid
estimates of probabilities that are 0
f
wi
i=1:n
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Being Bayesian....
a
Learning: infer p( f | words, a )
proportional to p( words | f) p(f|a)
f
wi
i=1:n
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Multiple Documents
p( corpus | f ) =
P
p( doc | f )
a
f
wi
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Different Document Types
p( w |
f) is a multinomial over words
a
f
wi
Data
Mining Lectures
1:n
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Different Document Types
p( w |
f) is a multinomial over words
a
f
wi
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Different Document Types
p( w |
a
f,
zd) is a multinomial over words
zd is the "label" for each doc
zd
f
wi
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Different Document Types
p( w |
a
f,
zd) is a multinomial over words
zd is the "label" for each doc
Different multinomials,
depending on the
value of zd (discrete)
zd
f
f now represents |z| different
wi
multinomials
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Unknown Document Types
a
b
Now the values of z for each document
are unknown - hopeless?
zd
f
wi
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Unknown Document Types
a
b
Now the values of z for each document
are unknown - hopeless?
Not hopeless :)
Can learn about both z and
zd
q
e.g., EM algorithm
f
This gives probabilistic clustering
p(w | z=k ,
wi
f) is the kth
multinomial over words
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Topic Model
b
a
f
zi is a "label" for each word
qd
q:
p( zi | qd ) = distribution
over topics that is
document specific
f:
p( w | f , zi = k)
= multinomial over words
= a "topic"
zi
wi
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Example of generating words
MONEY
LOAN
BANK
RIVER
STREAM
1.0
1
1
1
1
1
1
MONEY BANK BANK LOAN BANK MONEY BANK
1
1
1
1
1
MONEY BANK LOAN LOAN BANK MONEY
1 ....
.6
2
1
2
2
2
RIVER MONEY BANK STREAM BANK BANK
.4
RIVER
STREAM
BANK
MONEY
LOAN
Topics
f
Data
Mining Lectures
1
1.0
1
2
1
MONEY RIVER MONEY BANK
2
2
2
2 LOAN1 MONEY1 ....
2
2
RIVER BANK STREAM BANK RIVER BANK
Mixtures
θ
1
2....
Documents and
topic assignments
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Learning
?
?
?
?
?
MONEY BANK BANK LOAN BANK MONEY BANK
?
?
?
?
?
MONEY BANK LOAN LOAN BANK MONEY
?
?
?
?
?
?
?
? ....
?
RIVER MONEY BANK STREAM BANK BANK
?
?
?
?
?
MONEY RIVER MONEY BANK LOAN MONEY
?
?
?
?
?
RIVER BANK STREAM BANK RIVER BANK
Topics
f
Data
Mining Lectures
Mixtures
θ
?
?
? ....
?....
Documents and
topic assignments
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Key Features of Topic Models
•
Model allows a document to be composed of multiple topics
•
•
Completely unsupervised
•
•
•
Gibbs sampling is the method of choice
Scalable
•
•
Data
Topics learned directly from data
Leverages strong dependencies at word level
Learning algorithm
•
•
More powerful than 1 doc -> 1 cluster
Linear in number of word tokens
Can be run on millions of documents
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Document generation as a probabilistic process
•
•
•
Each topic is a distribution over words
Each document a mixture of topics
Each word chosen from a single topic
T
P( wi )   P( wi | zi  j ) P( zi  j )
j 1
From
parameters
f (j)
Data
Mining Lectures
Text Mining and Topic Models
From
parameters
q (d)
© Padhraic Smyth, UC Irvine
Learning the Model
•
Three sets of latent variables we can learn
•
•
•
•
topic-word distributions
document-topic distributions
topic assignments for each word
f
θ
z
Options:
•
•
•
EM algorithm to find point estimates of f and q
• e.g., Chien and Wu, IEEE Trans ASLP, 2008
Gibbs sampling
• Find p(f | data), p(q | data), p(z| data)
• Can be slow to converge
Collapsed Gibbs sampling
• Most widely used method
[See also Asuncion, Welling, Smyth, Teh, UAI 2009 for additional discussion]
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Gibbs Sampling
•
Say we have 3 parameters x,y,z, and some data
•
Bayesian learning:
•
•
•
•
Data
We want to compute p(x, y, z | data)
But frequently it is impossible to compute this exactly
However, often we can compute conditionals for individual
variables, e.g.,
p(x | y, z, data)
Not clear how this is useful yet, since it assumes y and z are
known (i.e., we condition on them).
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Gibbs Sampling 2
•
Example of Gibbs sampling:
•
•
•
•
•
Initialize with x’, y’, z’ (e.g., randomly)
Iterate:
• Sample new x’ ~ P(x | y’, z’, data)
• Sample new y’ ~ P(y | x’, z’, data)
• Sample new z’ ~ P(z | x’, y’, data)
Continue for some number (large) of iterations
Each iteration consists of a sweep through the hidden
variables or parameters (here, x, y, and z)
Gibbs = a Markov Chain Monte Carlo (MCMC) method
In the limit, the samples x’, y’, z’ will be samples from the
true joint distribution P(x , y, z | data)
This gives us an empirical estimate of P(x , y, z | data)
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Example of Gibbs Sampling in 2d
From online MCMC tutorial notes by Frank Dellaert, Georgia Tech
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Computation
•
Convergence
•
•
•
Complexity per iteration
•
•
Data
In the limit, samples x’, y’, z’ are from P(x , y, z | data)
How many iterations are needed?
• Cannot be computed ahead of time
• Early iterations are discarded (“burn-in”)
• Typically monitor some quantities of interest to monitor
convergence
• Convergence in Gibbs/MCMC is a tricky issue!
Linear in number of hidden variables and parameters
Times the complexity of generating a sample each time
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Gibbs Sampling for the Topic Model
•
Recall: 3 sets of latent variables we can learn
•
•
•
•
topic-word distributions
document-topic distributions
topic assignments for each word
θ
z
Gibbs sampling algorithm
•
•
Initialize all the z’s randomly to a topic, z1, ….. zN
Iteration
• For i = 1,…. N

Sample zi ~ p(zi | all other z’s, data)
•
Continue for a fixed number of iterations or convergence
•
Note that this is collapsed Gibbs sampling
•
Data
f
Mining Lectures
Sample from p(z1, ….. zN | data), “collapsing” over f and q
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Topic Model
qd
f
zi
wi
1:n
1:D
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Sampling each Topic-Word Assignment
count of topic t
assigned to doc d
count of word w
assigned to topic t
ntd i  a
nwt i  b
p( zi  t | z i ) 

i
i
n

T
a
n
t ' t 'd
 w' w't  W b
probability that word i
is assigned to topic t
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Convergence Example
(from Newman et al, JMLR, 2009)
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Complexity
•
Time
•
O(N T) per iteration, where N is the number of “tokens”, T the
number of topics
•
For fast sampling, see “Fast-LDA”, Porteous et al, ACM SIGKDD, 2008;
also Yao, Mimno, McCallum, ACM SIGKDD 2009.
For distributed algorithms, see Newman et al., Journal of Machine
Learning Research, 2009, e.g., T = 1000, N = 100 million
•
•
Space
•
•
Data
O(D T + T W + N), where D is the number of documents and W
is the number of unique words (size of vocabulary)
Can reduce these size by using sparse matrices
• Store non-zero counts for doc-topic and topic-word
• Only apply smoothing at prediction time
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
16 Artificial Documents
River
Stream
Bank
Money
Loan
documents
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Can we recover the original topics and topic mixtures from this data?
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Starting the Gibbs Sampling
•
Assign word tokens randomly to topics
(●=topic 1; ●=topic 2 )
River
Stream
Bank
Money
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
After 1 iteration
River
Stream
Bank
Money
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
After 4 iterations
River
Stream
Bank
Money
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
After 32 iterations
●
●
topic 1
stream .40
bank .35
river .25
River
Stream
Bank
topic 2
bank .39
money .32
loan .29
Money
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Software for Topic Modeling
•
Mark Steyver’s public-domain MATLAB toolbox
for topic modeling on the Web
psiexp.ss.uci.edu/research/programs_data/toolbox.htm
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
History of topic models
•
origins in statistics:
•
•
•
applications in computer science
•
•
•
•
Hoffman, SIGIR, 1999
Blei, Jordan, Ng, JMLR 2003 (known as “LDA”)
Griffiths and Steyvers, PNAS, 2004
more recent work
•
•
•
•
•
•
Data
latent class models in social science
admixture models in statistical genetics
author-topic models: Steyvers et al, Rosen-Zvi et al, 2004
Hierarchical topics: McCallum et al, 2006
Correlated topic models: Blei and Lafferty, 2005
Dirichlet process models: Teh, Jordan, et al
large-scale web applications: Buntine et al, 2004, 2005
undirected models: Welling et al, 2004
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Topic = probability distribution over words
TOPIC 209
TOPIC 289
WORD
PROB.
WORD
PROB.
PROBABILISTIC
0.0778
RETRIEVAL
0.1179
BAYESIAN
0.0671
TEXT
0.0853
PROBABILITY
0.0532
DOCUMENTS
0.0527
CARLO
0.0309
INFORMATION
0.0504
MONTE
0.0308
DOCUMENT
0.0441
DISTRIBUTION
0.0257
CONTENT
0.0242
INFERENCE
0.0253
INDEXING
0.0205
PROBABILITIES
0.0253
RELEVANCE
0.0159
CONDITIONAL
0.0229
COLLECTION
0.0146
PRIOR
0.0219
RELEVANT
0.0136
....
...
...
...
P(w | z )
Important point:
these distributions are learned in a completely automated
“unsupervised” fashion from the data
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Four example topics from NIPS
TOPIC 19
TOPIC 24
TOPIC 29
WORD
PROB.
KERNEL
0.0683
0.0371
SUPPORT
0.0377
ACTION
0.0332
VECTOR
0.0257
0.0241
OPTIMAL
0.0208
KERNELS
0.0217
HANDWRITTEN 0.0169
ACTIONS
0.0208
SET
0.0205
0.0159
FUNCTION
0.0178
SVM
0.0204
IMAGE
0.0157
REWARD
0.0165
SPACE
0.0188
0.0254
DISTANCE
0.0153
SUTTON
0.0164
MACHINES
0.0168
PARAMETERS
0.0209
DIGIT
0.0149
AGENT
0.0136
ESTIMATE
0.0204
HAND
0.0126
DECISION
0.0118
MARGIN
0.0151
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
Tresp_V
0.0333
Simard_P
0.0694
Singh_S
0.1412
Smola_A
0.1033
Singer_Y
0.0281
Martin_G
0.0394
Barto_A
0.0471
Scholkopf_B
0.0730
Jebara_T
0.0207
LeCun_Y
0.0359
Sutton_R
0.0430
Burges_C
0.0489
Ghahramani_Z
0.0196
Denker_J
0.0278
Dayan_P
0.0324
Vapnik_V
0.0431
Ueda_N
0.0170
Henderson_D
0.0256
Parr_R
0.0314
Chapelle_O
0.0210
Jordan_M
0.0150
Revow_M
0.0229
Dietterich_T
0.0231
Cristianini_N
0.0185
Roweis_S
0.0123
Platt_J
0.0226
Tsitsiklis_J
0.0194
Ratsch_G
0.0172
Schuster_M
0.0104
Keeler_J
0.0192
Randlov_J
0.0167
Laskov_P
0.0169
Xu_L
0.0098
Rashid_M
0.0182
Bradtke_S
0.0161
Tipping_M
0.0153
Saul_L
0.0094
Sackinger_E
0.0132
Schwartz_A
0.0142
Sollich_P
0.0141
Data
WORD
PROB.
WORD
PROB.
LIKELIHOOD
0.0539
RECOGNITION
0.0400
MIXTURE
0.0509
CHARACTER
0.0336
POLICY
EM
0.0470
CHARACTERS
0.0250
DENSITY
0.0398
TANGENT
GAUSSIAN
0.0349
ESTIMATION
0.0314
DIGITS
LOG
0.0263
MAXIMUM
Mining Lectures
WORD
TOPIC 87
PROB.
REINFORCEMENT 0.0411
Text Mining and Topic Models
REGRESSION 0.0155
© Padhraic Smyth, UC Irvine
Topics from New York Times
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
SEPT_11
WAR
SECURITY
IRAQ
TERRORISM
NATION
KILLED
AFGHANISTAN
ATTACKS
OSAMA_BIN_LADEN
AMERICAN
ATTACK
NEW_YORK_REGION
NEW
MILITARY
NEW_YORK
WORLD
NATIONAL
QAEDA
TERRORIST_ATTACKS
WALL_STREET
ANALYSTS
INVESTORS
FIRM
GOLDMAN_SACHS
FIRMS
INVESTMENT
MERRILL_LYNCH
COMPANIES
SECURITIES
RESEARCH
STOCK
BUSINESS
ANALYST
WALL_STREET_FIRMS
SALOMON_SMITH_BARNEY
CLIENTS
INVESTMENT_BANKING
INVESTMENT_BANKERS
INVESTMENT_BANKS
WEEK
DOW_JONES
POINTS
10_YR_TREASURY_YIELD
PERCENT
CLOSE
NASDAQ_COMPOSITE
STANDARD_POOR
CHANGE
FRIDAY
DOW_INDUSTRIALS
GRAPH_TRACKS
EXPECTED
BILLION
NASDAQ_COMPOSITE_INDEX
EST_02
PHOTO_YESTERDAY
YEN
10
500_STOCK_INDEX
BANKRUPTCY
CREDITORS
BANKRUPTCY_PROTECTION
ASSETS
COMPANY
FILED
BANKRUPTCY_FILING
ENRON
BANKRUPTCY_COURT
KMART
CHAPTER_11
FILING
COOPER
BILLIONS
COMPANIES
BANKRUPTCY_PROCEEDINGS
DEBTS
RESTRUCTURING
CASE
GROUP
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Comparing Topics and Other Approaches
•
Clustering documents
•
•
•
LSI/LSA/SVD
•
•
•
•
•
Linear project of V-dim word vectors into lower dimensions
Less interpretable
Not generalizable
• E.g., authors or other side-information
Not as accurate
• E.g., precision-recall: Hoffman, Blei et al, Buntine, etc
Probabilistic models such as topic Models
•
•
Data
Computationally simpler…
But a less accurate and less flexible model
Mining Lectures
“next-generation” text modeling, after LSI
provide a modular extensible framework
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Clusters v. Topics
Hidden Markov Models in
Molecular Biology: New
Algorithms and Applications
Pierre Baldi, Yves C Hauvin, Tim Hunkapiller,
Marcella A. McClure
Hidden Markov Models (HMMs) can be
applied to several important problems in
molecular biology. We introduce a new
convergent learning algorithm for HMMs
that, unlike the classical Baum-Welch
algorithm is smooth and can be applied online or in batch mode, with or without the
usual Viterbi most likely path
approximation. Left-right HMMs with
insertion and deletion states are then
trained to represent several protein families
including immunoglobulins and kinases. In
all cases, the models derived capture all the
important statistical properties of the
families and can be used efficiently in a
number of important tasks such as multiple
alignment, motif detection, and
classification.
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Clusters v. Topics
One Cluster
Hidden Markov Models in
Molecular Biology: New
Algorithms and Applications
Pierre Baldi, Yves C Hauvin, Tim Hunkapiller,
Marcella A. McClure
Hidden Markov Models (HMMs) can be
applied to several important problems in
molecular biology. We introduce a new
convergent learning algorithm for HMMs
that, unlike the classical Baum-Welch
algorithm is smooth and can be applied online or in batch mode, with or without the
usual Viterbi most likely path
approximation. Left-right HMMs with
insertion and deletion states are then
trained to represent several protein families
including immunoglobulins and kinases. In
all cases, the models derived capture all the
important statistical properties of the
families and can be used efficiently in a
number of important tasks such as multiple
alignment, motif detection, and
classification.
Data
Mining Lectures
model
data models time
neural figure
state learning set
parameters
network
probability
number networks
training function
system algorithm
hidden markov
[cluster 88]
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Clusters v. Topics
One Cluster
Hidden Markov Models in
Molecular Biology: New
Algorithms and Applications
Pierre Baldi, Yves C Hauvin, Tim Hunkapiller,
Marcella A. McClure
Hidden Markov Models (HMMs) can be
applied to several important problems in
molecular biology. We introduce a new
convergent learning algorithm for HMMs
that, unlike the classical Baum-Welch
algorithm is smooth and can be applied online or in batch mode, with or without the
usual Viterbi most likely path
approximation. Left-right HMMs with
insertion and deletion states are then
trained to represent several protein families
including immunoglobulins and kinases. In
all cases, the models derived capture all the
important statistical properties of the
families and can be used efficiently in a
number of important tasks such as multiple
alignment, motif detection, and
classification.
Data
Mining Lectures
model
data models time
neural figure
state learning set
parameters
network
probability
number networks
training function
system algorithm
hidden markov
[cluster 88]
Text Mining and Topic Models
Multiple Topics
state hmm markov
sequence models hidden
states probabilities
sequences parameters
transition probability training
hmms hybrid model
likelihood modeling
[topic 10]
genetic structure
chain protein population
region algorithms human
mouse selection fitness
proteins search evolution
generation function sequence
sequences genes
[topic 37]
© Padhraic Smyth, UC Irvine
Examples of Topics learned from
Proceedings of the National Academy of Sciences
Griffiths and Steyvers, PNAS, 2004
FORCE
HIV
SURFACE
VIRUS
MOLECULES
INFECTED
SOLUTION IMMUNODEFICIENCY
SURFACES
CD4
MICROSCOPY
INFECTION
WATER
HUMAN
FORCES
VIRAL
PARTICLES
TAT
STRENGTH
GP120
POLYMER
REPLICATION
IONIC
TYPE
ATOMIC
ENVELOPE
AQUEOUS
AIDS
MOLECULAR
REV
PROPERTIES
BLOOD
LIQUID
CCR5
SOLUTIONS
INDIVIDUALS
BEADS
ENV
MECHANICAL
PERIPHERAL
Data
Mining Lectures
MUSCLE
CARDIAC
HEART
SKELETAL
MYOCYTES
VENTRICULAR
MUSCLES
SMOOTH
HYPERTROPHY
DYSTROPHIN
HEARTS
CONTRACTION
FIBERS
FUNCTION
TISSUE
RAT
MYOCARDIAL
ISOLATED
MYOD
FAILURE
STRUCTURE
ANGSTROM
CRYSTAL
RESIDUES
STRUCTURES
STRUCTURAL
RESOLUTION
HELIX
THREE
HELICES
DETERMINED
RAY
CONFORMATION
HELICAL
HYDROPHOBIC
SIDE
DIMENSIONAL
INTERACTIONS
MOLECULE
SURFACE
Text Mining and Topic Models
NEURONS
BRAIN
CORTEX
CORTICAL
OLFACTORY
NUCLEUS
NEURONAL
LAYER
RAT
NUCLEI
CEREBELLUM
CEREBELLAR
LATERAL
CEREBRAL
LAYERS
GRANULE
LABELED
HIPPOCAMPUS
AREAS
THALAMIC
TUMOR
CANCER
TUMORS
HUMAN
CELLS
BREAST
MELANOMA
GROWTH
CARCINOMA
PROSTATE
NORMAL
CELL
METASTATIC
MALIGNANT
LUNG
CANCERS
MICE
NUDE
PRIMARY
OVARIAN
© Padhraic Smyth, UC Irvine
Examples of PNAS topics
CHROMOSOME
ADULT
MALE
PARASITE
REGION
DEVELOPMENT
FEMALE
PARASITES
CHROMOSOMES
FETAL
MALES
FALCIPARUM
KB
DAY
FEMALES
MALARIA
MAP
DEVELOPMENTAL
SEX
HOST
MAPPING
POSTNATAL
SEXUAL
PLASMODIUM
CHROMOSOMAL
EARLY
BEHAVIOR
ERYTHROCYTES
HYBRIDIZATION OFFSPRING
DAYS
ERYTHROCYTE
ARTIFICIAL REPRODUCTIVE
NEONATAL
MAJOR
MAPPED
LIFE
MATING
LEISHMANIA
PHYSICAL
DEVELOPING
SOCIAL
INFECTED
MAPS
EMBRYONIC
SPECIES
BLOOD
GENOMIC
BIRTH
REPRODUCTION
INFECTION
DNA
NEWBORN
FERTILITY
MOSQUITO
LOCUS
MATERNAL
TESTIS
INVASION
GENOME
PRESENT
MATE
TRYPANOSOMA
GENE
PERIOD
GENETIC
CRUZI
HUMAN
ANIMALS
GERM
BRUCEI
SITU
NEUROGENESIS
CHOICE
HUMAN
CLONES
ADULTS
SRY
HOSTS
MODEL
STUDIES
MECHANISM
MODELS
PREVIOUS
MECHANISMS
EXPERIMENTAL
SHOWN
UNDERSTOOD
BASED
RESULTS
POORLY
PROPOSED
RECENT
ACTION
DATA
PRESENT
UNKNOWN
SIMPLE
STUDY
REMAIN
DYNAMICS DEMONSTRATED UNDERLYING
PREDICTED
INDICATE
MOLECULAR
EXPLAIN
WORK
PS
BEHAVIOR
SUGGEST
REMAINS
THEORETICAL
SUGGESTED
SHOW
ACCOUNT
USING
RESPONSIBLE
THEORY
FINDINGS
PROCESS
PREDICTS
DEMONSTRATE
SUGGEST
COMPUTER
REPORT
UNCLEAR
QUANTITATIVE
INDICATED
REPORT
PREDICTIONS
CONSISTENT
LEADING
CONSISTENT
REPORTS
LARGELY
PARAMETERS
CONTRAST
KNOWN
Examples of PNAS topics
CHROMOSOME
ADULT
MALE
PARASITE
REGION
DEVELOPMENT
FEMALE
PARASITES
CHROMOSOMES
FETAL
MALES
FALCIPARUM
KB
DAY
FEMALES
MALARIA
MAP
DEVELOPMENTAL
SEX
HOST
MAPPING
POSTNATAL
SEXUAL
PLASMODIUM
CHROMOSOMAL
EARLY
BEHAVIOR
ERYTHROCYTES
HYBRIDIZATION OFFSPRING
DAYS
ERYTHROCYTE
ARTIFICIAL REPRODUCTIVE
NEONATAL
MAJOR
MAPPED
LIFE
MATING
LEISHMANIA
PHYSICAL
DEVELOPING
SOCIAL
INFECTED
MAPS
EMBRYONIC
SPECIES
BLOOD
GENOMIC
BIRTH
REPRODUCTION
INFECTION
DNA
NEWBORN
FERTILITY
MOSQUITO
LOCUS
MATERNAL
TESTIS
INVASION
GENOME
PRESENT
MATE
TRYPANOSOMA
GENE
PERIOD
GENETIC
CRUZI
HUMAN
ANIMALS
GERM
BRUCEI
SITU
NEUROGENESIS
CHOICE
HUMAN
CLONES
ADULTS
SRY
HOSTS
MODEL
STUDIES
MECHANISM
MODELS
PREVIOUS
MECHANISMS
EXPERIMENTAL
SHOWN
UNDERSTOOD
BASED
RESULTS
POORLY
PROPOSED
RECENT
ACTION
DATA
PRESENT
UNKNOWN
SIMPLE
STUDY
REMAIN
DYNAMICS DEMONSTRATED UNDERLYING
PREDICTED
INDICATE
MOLECULAR
EXPLAIN
WORK
PS
BEHAVIOR
SUGGEST
REMAINS
THEORETICAL
SUGGESTED
SHOW
ACCOUNT
USING
RESPONSIBLE
THEORY
FINDINGS
PROCESS
PREDICTS
DEMONSTRATE
SUGGEST
COMPUTER
REPORT
UNCLEAR
QUANTITATIVE
INDICATED
REPORT
PREDICTIONS
CONSISTENT
LEADING
CONSISTENT
REPORTS
LARGELY
PARAMETERS
CONTRAST
KNOWN
What can Topic Models be used for?
•
Queries
• Who writes on this topic?

•
Data
e.g., finding experts or reviewers in a particular area
What topics does this person do research on?
•
Comparing groups of authors or documents
•
Discovering trends over time
•
Detecting unusual papers and authors
•
Interactive browsing of a digital library via topics
•
Parsing documents (and parts of documents) by topic
•
and more…..
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
What is this paper about?
Empirical Bayes screening for multi-item associations
Bill DuMouchel and Daryl Pregibon, ACM SIGKDD 2001
Most likely topics according to the model are…
1.
2.
3.
4.
Data
Mining Lectures
data, mining, discovery, association, attribute..
set, subset, maximal, minimal, complete,…
measurements, correlation, statistical, variation,
Bayesian, model, prior, data, mixture,…..
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
3 of 300 example topics (TASA)
TOPIC 82
WORD
TOPIC 166
PROB.
WORD
PROB.
WORD
PROB.
PLAY
0.0601
MUSIC
0.0903
PLAY
0.1358
PLAYS
0.0362
DANCE
0.0345
BALL
0.1288
STAGE
0.0305
SONG
0.0329
GAME
0.0654
MOVIE
0.0288
PLAY
0.0301
PLAYING
0.0418
SCENE
0.0253
SING
0.0265
HIT
0.0324
ROLE
0.0245
SINGING
0.0264
PLAYED
0.0312
AUDIENCE
0.0197
BAND
0.0260
BASEBALL
0.0274
THEATER
0.0186
PLAYED
0.0229
GAMES
0.0250
PART
0.0178
SANG
0.0224
BAT
0.0193
FILM
0.0148
SONGS
0.0208
RUN
0.0186
ACTORS
0.0145
DANCING
0.0198
THROW
0.0158
DRAMA
0.0136
PIANO
0.0169
BALLS
0.0154
REAL
0.0128
PLAYING
0.0159
TENNIS
0.0107
CHARACTER 0.0122
RHYTHM
0.0145
HOME
0.0099
ACTOR
0.0116
ALBERT
0.0134
CATCH
0.0098
ACT
0.0114
MUSICAL
0.0134
FIELD
0.0097
MOVIES
0.0114
DRUM
0.0129
PLAYER
0.0096
ACTION
0.0101
GUITAR
0.0098
FUN
0.0092
0.0097
BEAT
0.0097
THROWING
0.0083
0.0094
BALLET
0.0096
PITCHER
0.0080
SET
SCENES
Data
TOPIC 77
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Automated Tagging of Words
(numbers & colors  topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093
audience082 or before motion270 picture004 or television004 cameras004 ( for
later054 viewing004 by large202 audiences082). A Play082 is written082
because playwrights082 have something ...
He was listening077 to music077 coming009 from a passing043 riverboat. The
music077 had already captured006 his heart157 as well as his ear119. It was
jazz077. Bix beiderbecke had already had music077 lessons077. He
wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The
game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180
and jim296 read254 the game166 book254. The boys020 see a game166 for
two. The two boys020 play166 the game166....
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Four example topics from CiteSeer (T=300)
TOPIC 205
TOPIC 209
WORD
PROB.
WORD
DATA
0.1563
MINING
0.0674
BAYESIAN
ATTRIBUTES
0.0462
DISCOVERY
TOPIC 289
PROB.
TOPIC 10
WORD
PROB.
WORD
PROB.
RETRIEVAL
0.1179
QUERY
0.1848
0.0671
TEXT
0.0853
QUERIES
0.1367
PROBABILITY
0.0532
DOCUMENTS
0.0527
INDEX
0.0488
0.0401
CARLO
0.0309
INFORMATION
0.0504
DATA
0.0368
ASSOCIATION
0.0335
MONTE
0.0308
DOCUMENT
0.0441
JOIN
0.0260
LARGE
0.0280
CONTENT
0.0242
INDEXING
0.0180
KNOWLEDGE
0.0260
INDEXING
0.0205
PROCESSING 0.0113
DATABASES
0.0210
RELEVANCE
0.0159
AGGREGATE 0.0110
ATTRIBUTE
0.0188
CONDITIONAL
0.0229
COLLECTION
0.0146
ACCESS
0.0102
DATASETS
0.0165
PRIOR
0.0219
RELEVANT
0.0136
PRESENT
0.0095
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
Han_J
0.0196
Friedman_N
0.0094
Oard_D
0.0110
Suciu_D
0.0102
Rastogi_R
0.0094
Heckerman_D
0.0067
Croft_W
0.0056
Naughton_J
0.0095
Zaki_M
0.0084
Ghahramani_Z
0.0062
Jones_K
0.0053
Levy_A
0.0071
Shim_K
0.0077
Koller_D
0.0062
Schauble_P
0.0051
DeWitt_D
0.0068
Ng_R
0.0060
Jordan_M
0.0059
Voorhees_E
0.0050
Wong_L
0.0067
Liu_B
0.0058
Neal_R
0.0055
Singhal_A
0.0048
Mannila_H
0.0056
Raftery_A
0.0054
Hawking_D
0.0048
Ross_K
0.0061
Brin_S
0.0054
Lukasiewicz_T
0.0053
Merkl_D
0.0042
Hellerstein_J
0.0059
Liu_H
0.0047
Halpern_J
0.0052
Allan_J
0.0040
Lenzerini_M
0.0054
Holder_L
0.0044
Muller_P
0.0048
Doermann_D
0.0039
Moerkotte_G
0.0053
PROBABILISTIC 0.0778
DISTRIBUTION 0.0257
INFERENCE
0.0253
PROBABILITIES 0.0253
Chakrabarti_K 0.0064
More CiteSeer Topics
TOPIC 10
TOPIC 209
WORD
TOPIC 87
WORD
PROB.
SPEECH
0.1134
RECOGNITION
0.0349
BAYESIAN
0.0671
TOPIC 20
PROB.
WORD
PROB.
WORD
PROB.
PROBABILISTIC 0.0778
USER
0.2541
STARS
0.0164
INTERFACE
0.1080
OBSERVATIONS 0.0150
WORD
0.0295
PROBABILITY
0.0532
USERS
0.0788
SOLAR
0.0150
SPEAKER
0.0227
CARLO
0.0309
INTERFACES
0.0433
MAGNETIC
0.0145
ACOUSTIC
0.0205
MONTE
0.0308
GRAPHICAL
0.0392
RAY
0.0144
RATE
0.0134
DISTRIBUTION
0.0257
INTERACTIVE
0.0354
EMISSION
0.0134
SPOKEN
0.0132
INFERENCE
0.0253
INTERACTION
0.0261
GALAXIES
0.0124
SOUND
0.0127
VISUAL
0.0203
OBSERVED
0.0108
TRAINING
0.0104
CONDITIONAL
0.0229
DISPLAY
0.0128
SUBJECT
0.0101
MUSIC
0.0102
PRIOR
0.0219
MANIPULATION
0.0099
STAR
0.0087
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
Waibel_A
0.0156
Friedman_N
0.0094
Shneiderman_B
0.0060
Linsky_J
0.0143
Gauvain_J
0.0133
Heckerman_D
0.0067
Rauterberg_M
0.0031
Falcke_H
0.0131
Lamel_L
0.0128
Ghahramani_Z
0.0062
Lavana_H
0.0024
Mursula_K
0.0089
Woodland_P
0.0124
Koller_D
0.0062
Pentland_A
0.0021
Butler_R
0.0083
Ney_H
0.0080
Jordan_M
0.0059
Myers_B
0.0021
Bjorkman_K
0.0078
Hansen_J
0.0078
Neal_R
0.0055
Minas_M
0.0021
Knapp_G
0.0067
Renals_S
0.0072
Raftery_A
0.0054
Burnett_M
0.0021
Kundu_M
0.0063
Noth_E
0.0071
Lukasiewicz_T
0.0053
Winiwarter_W
0.0020
Christensen-J
0.0059
Boves_L
0.0070
Halpern_J
0.0052
Chang_S
0.0019
Cranmer_S
0.0055
Young_S
0.0069
Muller_P
0.0048
Korvemaker_B
0.0019
Nagar_N
0.0050
PROBABILITIES 0.0253
Temporal patterns in topics: hot and cold topics
•
CiteSeer papers from 1986-2002, about 200k papers
•
For each year, calculate the fraction of words assigned to
each topic
•
-> a time-series for topics
•
•
Data
Hot topics become more prevalent
Cold topics become less prevalent
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
4
2
x 10
Document and Word Distribution by Year in the UCI CiteSeer Data
5
x 10
14
1.8
12
1.6
Number of Documents
1.2
8
1
6
0.8
0.6
Number of Words
10
1.4
4
0.4
2
0.2
0
1986
1988
1990
1992
1994
1996
1998
2000
2002
0
Year
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
0.012
CHANGING TRENDS IN COMPUTER SCIENCE
0.01
WWW
Topic Probability
0.008
0.006
INFORMATION
RETRIEVAL
0.004
0.002
0
1990
1992
1994
1996
1998
2000
2002
Year
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
0.012
CHANGING TRENDS IN COMPUTER SCIENCE
0.01
Topic Probability
0.008
OPERATING
SYSTEMS
WWW
PROGRAMMING
LANGUAGES
0.006
INFORMATION
RETRIEVAL
0.004
0.002
0
1990
1992
1994
1996
1998
2000
2002
Year
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
-3
8
x 10
HOT TOPICS: MACHINE LEARNING/DATA MINING
7
Topic Probability
6
CLASSIFICATION
5
REGRESSION
4
DATA MINING
3
2
1
1990
1992
1994
1996
1998
2000
2002
Year
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
-3
5.5
x 10
BAYES MARCHES ON
5
BAYESIAN
Topic Probability
4.5
PROBABILITY
4
3.5
STATISTICAL
PREDICTION
3
2.5
2
1.5
1990
1992
1994
1996
1998
2000
2002
Year
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
-3
9
x 10
SECURITY-RELATED TOPICS
8
Topic Probability
7
6
5
COMPUTER
SECURITY
4
3
ENCRYPTION
2
1
1990
1992
1994
1996
1998
2000
2002
Year
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
0.012
INTERESTING "TOPICS"
0.01
FRENCH WORDS:
LA, LES, UNE, NOUS, EST
Topic Probability
0.008
0.006
DARPA
0.004
0.002
0
1990
MATH SYMBOLS:
GAMMA, DELTA, OMEGA
1992
1994
1996
1998
2000
2002
Year
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Four example topics from NIPS (T=100)
TOPIC 19
TOPIC 24
TOPIC 29
WORD
PROB.
WORD
PROB.
WORD
LIKELIHOOD
0.0539
RECOGNITION
0.0400
MIXTURE
0.0509
CHARACTER
0.0336
POLICY
EM
0.0470
CHARACTERS
0.0250
DENSITY
0.0398
TANGENT
GAUSSIAN
0.0349
ESTIMATION
0.0314
DIGITS
LOG
0.0263
MAXIMUM
TOPIC 87
PROB.
WORD
PROB.
KERNEL
0.0683
0.0371
SUPPORT
0.0377
ACTION
0.0332
VECTOR
0.0257
0.0241
OPTIMAL
0.0208
KERNELS
0.0217
HANDWRITTEN 0.0169
ACTIONS
0.0208
SET
0.0205
0.0159
FUNCTION
0.0178
SVM
0.0204
IMAGE
0.0157
REWARD
0.0165
SPACE
0.0188
0.0254
DISTANCE
0.0153
SUTTON
0.0164
MACHINES
0.0168
PARAMETERS
0.0209
DIGIT
0.0149
AGENT
0.0136
ESTIMATE
0.0204
HAND
0.0126
DECISION
0.0118
MARGIN
0.0151
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
Tresp_V
0.0333
Simard_P
0.0694
Singh_S
0.1412
Smola_A
0.1033
Singer_Y
0.0281
Martin_G
0.0394
Barto_A
0.0471
Scholkopf_B
0.0730
Jebara_T
0.0207
LeCun_Y
0.0359
Sutton_R
0.0430
Burges_C
0.0489
Ghahramani_Z
0.0196
Denker_J
0.0278
Dayan_P
0.0324
Vapnik_V
0.0431
Ueda_N
0.0170
Henderson_D
0.0256
Parr_R
0.0314
Chapelle_O
0.0210
Jordan_M
0.0150
Revow_M
0.0229
Dietterich_T
0.0231
Cristianini_N
0.0185
Roweis_S
0.0123
Platt_J
0.0226
Tsitsiklis_J
0.0194
Ratsch_G
0.0172
Schuster_M
0.0104
Keeler_J
0.0192
Randlov_J
0.0167
Laskov_P
0.0169
Xu_L
0.0098
Rashid_M
0.0182
Bradtke_S
0.0161
Tipping_M
0.0153
Saul_L
0.0094
Sackinger_E
0.0132
Schwartz_A
0.0142
Sollich_P
0.0141
REINFORCEMENT 0.0411
REGRESSION 0.0155
NIPS: support vector topic
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
NIPS: neural network topic
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
NIPS Topics over Time
topic
mass (in vertical height)
Figure courtesy of Xuerie Wang and Andrew McCallum, U Mass Amherst
time
Pennsylvania Gazette
(courtesy of David Newman & Sharon Block, UC Irvine)
1728-1800
80,000 articles
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Pennsylvania Gazette Data
courtesy of David Newman (CS Dept) and Sharon Block (History Dept)
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Topic trends from New York Times
15
Tour-de-France
10
5
0
Jan00
330,000 articles
2000-2002
Jul00
Jan01
Jul01
Jan02
Jul02
Jan03
Quarterly Earnings
30
20
10
0
Jan00
Jul00
Jan01
Jul01
Jan02
Jul02
50
Data
Mining Lectures
0
Jan00
Jul00
Jan01
Jul01
Jan02
Text Mining and Topic Models
Jul02
COMPANY
QUARTER
PERCENT
ANALYST
SHARE
SALES
EARNING
Jan03
Anthrax
100
TOUR
RIDER
LANCE_ARMSTRONG
TEAM
BIKE
RACE
FRANCE
Jan03
ANTHRAX
LETTER
MAIL
WORKER
OFFICE
SPORES
POSTAL
BUILDING
© Padhraic Smyth, UC Irvine
Enron email data
250,000 emails
28,000 authors
1999-2002
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Enron email: business topics
TOPIC 36
Data
TOPIC 72
TOPIC 23
TOPIC 54
WORD
PROB.
WORD
PROB.
WORD
PROB.
FEEDBACK
0.0781
PROJECT
0.0514
FERC
0.0554
PERFORMANCE
0.0462
PLANT
0.028
MARKET
0.0328
PROCESS
0.0455
COST
0.0182
ISO
0.0226
PEP
0.0446
MANAGEMENT
0.03
UNIT
0.0166
ORDER
COMPLETE
0.0205
FACILITY
0.0165
QUESTIONS
0.0203
SITE
0.0136
CONSTRUCTION 0.0169
WORD
PROB.
ENVIRONMENTAL 0.0291
AIR
0.0232
MTBE
0.019
EMISSIONS
0.017
0.0212
CLEAN
0.0143
FILING
0.0149
EPA
0.0133
COMMENTS
0.0116
PENDING
0.0129
COMMISSION 0.0215
SELECTED
0.0187
PROJECTS
0.0117
PRICE
0.0116
SAFETY
0.0104
COMPLETED
0.0146
CONTRACT
0.011
CALIFORNIA
0.0110
WATER
0.0092
SYSTEM
0.0146
UNITS
0.0106
FILED
0.0110
GASOLINE
0.0086
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
perfmgmt
0.2195
***
0.0288
***
0.0532
***
0.1339
perf eval process
0.0784
***
0.022
***
0.0454
***
0.0275
enron announcements
0.0489
***
0.0123
***
0.0384
***
0.0205
***
0.0089
***
0.0111
***
0.0334
***
0.0166
***
0.0048
***
0.0108
***
0.0317
***
0.0129
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Enron: non-work topics…
TOPIC 66
Data
TOPIC 182
TOPIC 113
TOPIC 109
WORD
PROB.
WORD
PROB.
WORD
PROB.
WORD
PROB.
HOLIDAY
0.0857
TEXANS
0.0145
GOD
0.0357
AMAZON
0.0312
PARTY
0.0368
WIN
0.0143
LIFE
0.0272
GIFT
0.0226
YEAR
0.0316
FOOTBALL
0.0137
MAN
0.0116
CLICK
0.0193
SEASON
0.0305
FANTASY
0.0129
PEOPLE
0.0103
SAVE
0.0147
COMPANY
0.0255
SPORTSLINE
0.0129
CHRIST
0.0092
SHOPPING
0.0140
CELEBRATION
0.0199
PLAY
0.0123
FAITH
0.0083
OFFER
0.0124
ENRON
0.0198
TEAM
0.0114
LORD
0.0079
HOLIDAY
0.0122
TIME
0.0194
GAME
0.0112
JESUS
0.0075
RECEIVE
0.0102
RECOGNIZE
0.019
SPORTS
0.011
SPIRITUAL
0.0066
SHIPPING
0.0100
MONTH
0.018
GAMES
0.0109
VISIT
0.0065
FLOWERS
0.0099
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
chairman & ceo
0.131
cbs sportsline com 0.0866
crosswalk com
0.2358
amazon com
0.1344
***
0.0102
houston texans 0.0267
wordsmith
0.0208
jos a bank
0.0266
***
0.0046
houstontexans 0.0203
***
0.0107
sharperimageoffers
0.0136
***
0.0022
sportsline rewards 0.0175
travelocity com
0.0094
general announcement 0.0017
pro football 0.0136
barnes & noble com
0.0089
Mining Lectures
doctor dictionary 0.0101
***
Text Mining and Topic Models
0.0061
© Padhraic Smyth, UC Irvine
Enron: public-interest topics...
TOPIC 18
Data
TOPIC 22
TOPIC 114
WORD
PROB.
WORD
PROB.
WORD
PROB.
TOPIC 194
WORD
PROB.
POWER
0.0915
STATE
0.0253
COMMITTEE
0.0197
LAW
0.0380
CALIFORNIA
0.0756
PLAN
0.0245
BILL
0.0189
TESTIMONY
0.0201
ELECTRICITY
0.0331
CALIFORNIA
0.0137
HOUSE
0.0169
ATTORNEY
0.0164
UTILITIES
0.0253
POLITICIAN Y
0.0137
SETTLEMENT
0.0131
PRICES
0.0249
RATE
0.0131
LEGAL
0.0100
MARKET
0.0244
EXHIBIT
0.0098
PRICE
0.0207
SOCAL
0.0119
CONGRESS
0.0112
CLE
0.0093
UTILITY
0.0140
POWER
0.0114
PRESIDENT
0.0105
SOCALGAS
0.0093
CUSTOMERS
0.0134
BONDS
0.0109
METALS
0.0091
ELECTRIC
0.0120
MOU
0.0107
DC
0.0093
PERSON Z
0.0083
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
***
0.1160
***
0.0395
***
0.0696
***
0.0696
***
0.0518
***
0.0337
***
0.0453
***
0.0453
***
0.0284
***
0.0295
***
0.0255
***
0.0255
***
0.0272
***
0.0251
***
0.0173
***
0.0173
***
0.0266
***
0.0202
***
0.0317
***
0.0317
Mining Lectures
BANKRUPTCY 0.0126
WASHINGTON 0.0140
SENATE
0.0135
POLITICIAN X 0.0114
LEGISLATION 0.0099
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Comparing Predictive Power
14000
•
Author model
10000
8000
6000
4000
6
24
10
25
64
16
4
Author-Topics
0
2000
Topics model
1
Perplexity (new words)
12000
Train models on part of
a new document and
predict remaining words
# Observed words in document
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Using Topic Models for Document Search
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
References on Topic Models
Overviews:
•
Steyvers, M. & Griffiths, T. (2006). Probabilistic topic models. In T.
Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic
Analysis: A Road to Meaning. Laurence Erlbaum
•
D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami,
editors, Text Mining: Theory and Applications. Taylor and Francis, 2009.
More specific:
•
Latent Dirichlet allocation
David Blei, Andrew Y. Ng and Michael Jordan. Journal of Machine Learning
Research, 3:993-1022, 2003.
Data
•
Finding scientific topics
Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy
of Sciences, 101 (suppl. 1), 5228-5235
•
Probabilistic author-topic models for information discovery
M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, in Proceedings of the
ACM SIGKDD Conference on Data Mining and Knowledge Discovery, August
2004.
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
ADDITIONAL SLIDES
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
PubMed-Query Topics
TOPIC 188
WORD
BIOLOGICAL
TOPIC 63
PROB.
PROB.
WORD
PROB.
0.1002
PLAGUE
0.0296
BOTULISM
0.1014
AGENTS
0.0889
MEDICAL
0.0287
BOTULINUM
0.0888
THREAT
0.0396
CENTURY 0.0280
BIOTERRORISM 0.0348
MEDICINE
0.0266
0.0563
0.0669
INHIBITORS
0.0366
0.0340
INHIBITOR
0.0220
0.0245
PLASMA
0.0204
0.0203
POTENTIAL
0.0305
EPIDEMIC
0.0106
ATTACK
0.0290
GREAT
0.0091
CHEMICAL
0.0288
WARFARE
0.0219
CHINESE
0.0083
ANTHRAX
0.0146
FRENCH
0.0082
PARALYSIS
0.0124
PROB.
AUTHOR
PROB.
AUTHOR
PROTEASE
0.0916
TYPE
HISTORY
PROB.
HIV
PROB.
0.0877
0.0328
EPIDEMICS 0.0090
WORD
TOXIN
WEAPONS
AUTHOR
CLOSTRIDIUM
INFANT
NEUROTOXIN
AMPRENAVIR 0.0527
0.0184
APV
0.0169
BONT
0.0167
DRUG
0.0169
FOOD
0.0134
RITONAVIR
0.0164
IMMUNODEFICIENCY0.0150
AUTHOR
PROB.
Atlas_RM
0.0044
Károly_L
0.0089
Hatheway_CL
0.0254
Sadler_BM
0.0129
Tegnell_A
0.0036
Jian-ping_Z
0.0085
Schiavo_G
0.0141
Tisdale_M
0.0118
Aas_P
0.0036
Sabbatani_S
0.0080
Sugiyama_H
0.0111
Lou_Y
0.0069
Arnon_SS
0.0108
Stein_DS
0.0069
Simpson_LL
0.0093
Haubrich_R
0.0061
Greenfield_RA
Bricaire_F
Data
WORD
TOPIC 32
TOPIC 85
Mining Lectures
0.0032
0.0032
Theodorides_J 0.0045
Bowers_JZ
0.0045
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
PubMed-Query Topics
TOPIC 40
WORD
TOPIC 89
PROB.
WORD
ANTHRACIS
0.1627
CHEMICAL 0.0578
ANTHRAX
0.1402
SARIN
0.0454
BACILLUS
0.1219
AGENT
0.0332
SPORES
0.0614
GAS
0.0312
CEREUS
0.0382
SPORE
0.0274
THURINGIENSIS 0.0177
VX
PROB.
ENZYME
0.0938
MUSTARD
0.0639
ACTIVE
0.0429
EXPOSURE
0.0444
SM
SKIN
0.0208
REACTION
0.0225
EXPOSED
0.0185
AGENT
0.0140
0.0124
TOXIC
0.0197
PRODUCTS 0.0170
PROB.
EPIDERMAL
DAMAGE
AUTHOR
Mock_M
0.0203
Minami_M
0.0093
Phillips_AP
0.0125
Hoskin_FC
0.0092
Smith_WJ
Welkos_SL
0.0083
Benschop_HP 0.0090
Turnbull_PC
0.0071
Raushel_FM 0.0084
Mining Lectures
0.0361
0.0264
0.0232
Wild_JR
SITE
0.0399
0.0308
STERNE
0.0067
0.0353
SUBSTRATE
ENZYMES
0.0220
Fouet_A
PROB.
0.0657
NERVE
AUTHOR
WORD
0.0343
ACID
PROB.
HD
PROB.
SULFUR
0.0152
AUTHOR
WORD
0.0268
SUBTILIS
INHALATIONAL 0.0104
Data
AGENTS
TOPIC 178
TOPIC 104
0.0075
0.0129
0.0116
PROB.
Monteiro-Riviere_NA 0.0284
SUBSTRATES 0.0201
FOLD
CATALYTIC
RATE
AUTHOR
0.0176
0.0154
0.0148
PROB.
Masson_P
0.0166
0.0219
Kovach_IM
0.0137
Lindsay_CD
0.0214
Schramm_VL
0.0094
Sawyer_TW
0.0146
Meier_HL
0.0139
Text Mining and Topic Models
Barak_D
Broomfield_CA
0.0076
0.0072
© Padhraic Smyth, UC Irvine
PubMed-Query: Topics by Country
ISRAEL, n=196 authors
TOPIC 188
p=0.049
BIOLOGICAL
AGENTS
THREAT
BIOTERRORISM
WEAPONS
POTENTIAL
ATTACK
CHEMICAL
WARFARE
ANTHRAX
TOPIC 6
p=0.045
INJURY
INJURIES
WAR
TERRORIST
MILITARY
MEDICAL
VICTIMS
TRAUMA
BLAST
VETERANS
TOPIC 133
p=0.043
HEALTH
PUBLIC
CARE
SERVICES
EDUCATION
NATIONAL
COMMUNITY
INFORMATION
PREVENTION
LOCAL
TOPIC 7
p=0.026
RENAL
HFRS
VIRUS
SYNDROME
FEVER
TOPIC 79
p=0.024
FINDINGS
CHEST
CT
LUNG
CLINICAL
HEMORRHAGIC
PULMONARY
TOPIC 104
p=0.027
HD
MUSTARD
EXPOSURE
SM
SULFUR
SKIN
EXPOSED
AGENT
EPIDERMAL
DAMAGE
TOPIC 159
p=0.025
EMERGENCY
RESPONSE
MEDICAL
PREPAREDNESS
DISASTER
MANAGEMENT
TRAINING
EVENTS
BIOTERRORISM
LOCAL
CHINA, n=1775 authors
Data
Mining Lectures
TOPIC 177
p=0.045
SARS
RESPIRATORY
SEVERE
COV
SYNDROME
ACUTE
CORONAVIRUS
CHINA
HANTAVIRUS
ABNORMAL
HANTAAN
Text MiningINVOLVEMENT
and Topic Models
TOPIC 49
p=0.024
METHODS
RESULTS
CONCLUSION
OBJECTIVE
CONCLUSIONS
BACKGROUND
STUDY
OBJECTIVES
TOPIC 197
p=0.023
PATIENTS
HOSPITAL
PATIENT
ADMITTED
TWENTY
HOSPITALIZED
CONSECUTIVE
PROSPECTIVELY
© Padhraic Smyth, UC Irvine
POTENTIAL
ATTACK
CHEMICAL
WARFARE
ANTHRAX
MEDICAL
VICTIMS
TRAUMA
BLAST
VETERANS
NATIONAL
COMMUNITY
INFORMATION
PREVENTION
LOCAL
SKIN
TOPIC 7
p=0.026
RENAL
HFRS
VIRUS
SYNDROME
FEVER
TOPIC 79
p=0.024
FINDINGS
CHEST
CT
LUNG
CLINICAL
HEMORRHAGIC
PULMONARY
CONCLUSIONS
BACKGROUND
HANTAVIRUS
HANTAAN
PUUMALA
ABNORMAL
STUDY
INVOLVEMENT
COMMON
OBJECTIVES
INVESTIGATE
HANTAVIRUSES
RADIOGRAPHIC
DESIGN
EXPOSED
MANAGEMENT
TRAINING
EVENTS
BIOTERRORISM
LOCAL
PubMed-Query: Topics by
Country
EPIDERMAL
AGENT
DAMAGE
CHINA, n=1775 authors
TOPIC 177
p=0.045
SARS
RESPIRATORY
SEVERE
COV
SYNDROME
ACUTE
CORONAVIRUS
CHINA
KONG
PROBABLE
Data
Mining Lectures
Text Mining and Topic Models
TOPIC 49
p=0.024
METHODS
RESULTS
CONCLUSION
OBJECTIVE
TOPIC 197
p=0.023
PATIENTS
HOSPITAL
PATIENT
ADMITTED
TWENTY
HOSPITALIZED
CONSECUTIVE
PROSPECTIVELY
DIAGNOSED
PROGNOSIS
© Padhraic Smyth, UC Irvine
Examples of Topics from New York
Times
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
SEPT_11
WAR
SECURITY
IRAQ
TERRORISM
NATION
KILLED
AFGHANISTAN
ATTACKS
OSAMA_BIN_LADEN
AMERICAN
ATTACK
NEW_YORK_REGION
NEW
MILITARY
NEW_YORK
WORLD
NATIONAL
QAEDA
TERRORIST_ATTACKS
WALL_STREET
ANALYSTS
INVESTORS
FIRM
GOLDMAN_SACHS
FIRMS
INVESTMENT
MERRILL_LYNCH
COMPANIES
SECURITIES
RESEARCH
STOCK
BUSINESS
ANALYST
WALL_STREET_FIRMS
SALOMON_SMITH_BARNEY
CLIENTS
INVESTMENT_BANKING
INVESTMENT_BANKERS
INVESTMENT_BANKS
WEEK
DOW_JONES
POINTS
10_YR_TREASURY_YIELD
PERCENT
CLOSE
NASDAQ_COMPOSITE
STANDARD_POOR
CHANGE
FRIDAY
DOW_INDUSTRIALS
GRAPH_TRACKS
EXPECTED
BILLION
NASDAQ_COMPOSITE_INDEX
EST_02
PHOTO_YESTERDAY
YEN
10
500_STOCK_INDEX
BANKRUPTCY
CREDITORS
BANKRUPTCY_PROTECTION
ASSETS
COMPANY
FILED
BANKRUPTCY_FILING
ENRON
BANKRUPTCY_COURT
KMART
CHAPTER_11
FILING
COOPER
BILLIONS
COMPANIES
BANKRUPTCY_PROCEEDINGS
DEBTS
RESTRUCTURING
CASE
GROUP
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Examples of Topics from New York
Times
Terrorism
Wall Street Firms
SEPT_11
WAR
SECURITY
IRAQ
TERRORISM
NATION
KILLED
AFGHANISTAN
ATTACKS
OSAMA_BIN_LADEN
AMERICAN
ATTACK
NEW_YORK_REGION
NEW
MILITARY
NEW_YORK
WORLD
NATIONAL
QAEDA
TERRORIST_ATTACKS
WALL_STREET
ANALYSTS
INVESTORS
FIRM
GOLDMAN_SACHS
FIRMS
INVESTMENT
MERRILL_LYNCH
COMPANIES
SECURITIES
RESEARCH
STOCK
BUSINESS
ANALYST
WALL_STREET_FIRMS
SALOMON_SMITH_BARNEY
CLIENTS
INVESTMENT_BANKING
INVESTMENT_BANKERS
INVESTMENT_BANKS
Data
Mining Lectures
Stock Market
Bankruptcy
WEEK
BANKRUPTCY
DOW_JONES
CREDITORS
POINTS
BANKRUPTCY_PROTECTION
10_YR_TREASURY_YIELD
ASSETS
PERCENT
COMPANY
CLOSE
FILED
NASDAQ_COMPOSITE
BANKRUPTCY_FILING
STANDARD_POOR
ENRON
CHANGE
BANKRUPTCY_COURT
FRIDAY
KMART
DOW_INDUSTRIALS
CHAPTER_11
GRAPH_TRACKS
FILING
EXPECTED
COOPER
BILLION
BILLIONS
NASDAQ_COMPOSITE_INDEX
COMPANIES
EST_02
BANKRUPTCY_PROCEEDINGS
PHOTO_YESTERDAY
DEBTS
YEN
RESTRUCTURING
10
CASE
500_STOCK_INDEX
GROUP
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Collocation Topic Model
For each document, choose
a mixture of topics
For every word slot,
sample a topic
If x=0, sample a word from
the topic
If x=1, sample a word from
the distribution based
on previous word
Data
Mining Lectures
TOPIC MIXTURE
TOPIC
TOPIC
TOPIC
...
WORD
WORD
WORD
...
X
Text Mining and Topic Models
X
...
© Padhraic Smyth, UC Irvine
Collocation Topic Model
Example:
“DOW JONES RISES”
TOPIC MIXTURE
JONES is more likely
explained as a word
following DOW than as
word sampled from topic
TOPIC
Result: DOW_JONES
recognized as
collocation
DOW
JONES
X=1
Data
Mining Lectures
Text Mining and Topic Models
TOPIC
...
RISES
...
X=0
...
© Padhraic Smyth, UC Irvine
Using Topic Models for Information Retrieval
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Stability of Topics
•
Content of topics is arbitrary across runs of model
(e.g., topic #1 is not the same across runs)
•
However,
•
•
•
Data
Majority of topics are stable over processing time
Majority of topics can be aligned across runs
Topics appear to represent genuine structure in data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Comparing NIPS topics from the same Markov chain
10
16
20
14
30
12
40
10
50
70
80
90
6
4
2
100
20
40
60
80
100
topics at t1=1000
Data
Mining Lectures
t1
ANALOG
CIRCUIT
CHIP
CURRENT
VOLTAGE
VLSI
INPUT
OUTPUT
CIRCUITS
FIGURE
PULSE
SYNAPSE
SILICON
CMOS
MEAD
.043
.040
.034
.025
.023
.022
.018
.018
.015
.014
.012
.011
.011
.009
.008
t2
ANALOG
CIRCUIT
CHIP
VOLTAGE
CURRENT
VLSI
OUTPUT
INPUT
CIRCUITS
PULSE
SYNAPSE
SILICON
FIGURE
CMOS
GATE
.044
.040
.037
.024
.023
.023
.022
.019
.015
.012
.012
.011
.010
.009
.009
8
60
KL distance
Re-ordered topics at t2=2000
BEST KL = 0.54
Text Mining and Topic Models
WORST KL = 4.78
t1
FEEDBACK
ADAPTATION
CORTEX
REGION
FIGURE
FUNCTION
BRAIN
COMPUTATIONAL
FIBER
FIBERS
ELECTRIC
BOWER
FISH
SIMULATIONS
CEREBELLAR
.040
.034
.025
.016
.015
.014
.013
.013
.012
.011
.011
.010
.010
.009
.009
t2
ADAPTATION
FIGURE
SIMULATION
GAIN
EFFECTS
FIBERS
COMPUTATIONAL
EXPERIMENT
FIBER
SITES
RESULTS
EXPERIMENTS
ELECTRIC
SITE
NEURO
.051
.033
.026
.025
.016
.014
.014
.014
.013
.012
.012
.012
.011
.009
.009
© Padhraic Smyth, UC Irvine
10
16
20
14
30
12
40
10
50
8
60
KL distance
Re-ordered topics from chain 2
Comparing NIPS topics from two different Markov chains
70
80
90
6
4
BEST KL = 1.03
Chain 1
MOTOR
TRAJECTORY
ARM
HAND
MOVEMENT
INVERSE
DYNAMICS
CONTROL
JOINT
POSITION
FORWARD
TRAJECTORIES
MOVEMENTS
FORCE
MUSCLE
.041
.031
.027
.022
.022
.019
.019
.018
.018
.017
.014
.014
.013
.012
.011
Chain 2
MOTOR
ARM
TRAJECTORY
HAND
MOVEMENT
INVERSE
JOINT
DYNAMICS
CONTROL
POSITION
FORWARD
FORCE
TRAJECTORIES
MOVEMENTS
CHANGE
.040
.030
.030
.024
.023
.021
.021
.018
.015
.015
.015
.014
.013
.012
.010
2
100
20
40
60
80
100
topics from chain 1
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Outline
•
Background on statistical text modeling
•
Unsupervised learning from text documents
•
•
•
•
Extensions
•
•
Author-topic models
Applications
•
•
Motivation
Topic model and learning algorithm
Results
Demo of topic browser
Future directions
Approach
•
The author-topic model
•
extension of the topic model: linking authors and topics
• authors -> topics -> words
•
learned from data
• completely unsupervised, no labels
•
generative model
• Different questions or queries can be answered by
appropriate probability calculus
• E.g., p(author | words in document)
• E.g., p(topic | author)
Graphical Model
x
Author
z
Topic
Graphical Model
x
Author
z
Topic
w
Word
Graphical Model
x
Author
z
Topic
w
Word
n
Graphical Model
a
x
Author
z
Topic
w
Word
n
D
Graphical Model
a
Author-Topic
distributions
Topic-Word
distributions
q
f
x
Author
z
Topic
w
Word
n
D
Generative Process
•
Let’s assume authors A1 and A2 collaborate and produce a
paper
•
•
•
A1 has multinomial topic distribution q1
A2 has multinomial topic distribution q2
For each word in the paper:
1. Sample an author x (uniformly) from A1, A2
2. Sample a topic z from qX
3. Sample a word w from a multinomial topic distribution fz
Graphical Model
a
Author-Topic
distributions
Topic-Word
distributions
q
f
x
Author
z
Topic
w
Word
n
D
Learning
•
Observed
•
•
Unknown
•
•
•
x, z : hidden variables
Θ, f : unknown parameters
Interested in:
•
•
•
W = observed words, A = sets of known authors
p( x, z | W, A)
p( θ , f | W, A)
But exact learning is not tractable
Step 1: Gibbs sampling of x and z
a
Average
over unknown
parameters
q
f
x
Author
z
Topic
w
Word
n
D
Step 2: estimates of θ and f
a
q
f
x
Author
z
Topic
w
Word
Condition on
particular
samples of
x and z
n
D
Gibbs Sampling
•
Need full conditional distributions for variables
•
The probability of assigning the current word i to topic j and
author k given everything else:
P( zi  j , xi  k | wi  m, z i , x i , w i , a d ) 
WT
Cmj
b
WT
C
m' m' j  Vb
CmjAT  a
AT
C
 j ' kj '  Ta
number of times word w assigned to topic j
number of times topic j assigned to author k
Authors and Topics (CiteSeer Data)
TOPIC 10
TOPIC 209
WORD
TOPIC 87
WORD
PROB.
SPEECH
0.1134
RECOGNITION
0.0349
BAYESIAN
0.0671
TOPIC 20
PROB.
WORD
PROB.
WORD
PROB.
PROBABILISTIC 0.0778
USER
0.2541
STARS
0.0164
INTERFACE
0.1080
OBSERVATIONS 0.0150
WORD
0.0295
PROBABILITY
0.0532
USERS
0.0788
SOLAR
0.0150
SPEAKER
0.0227
CARLO
0.0309
INTERFACES
0.0433
MAGNETIC
0.0145
ACOUSTIC
0.0205
MONTE
0.0308
GRAPHICAL
0.0392
RAY
0.0144
RATE
0.0134
DISTRIBUTION
0.0257
INTERACTIVE
0.0354
EMISSION
0.0134
SPOKEN
0.0132
INFERENCE
0.0253
INTERACTION
0.0261
GALAXIES
0.0124
SOUND
0.0127
VISUAL
0.0203
OBSERVED
0.0108
TRAINING
0.0104
CONDITIONAL
0.0229
DISPLAY
0.0128
SUBJECT
0.0101
MUSIC
0.0102
PRIOR
0.0219
MANIPULATION
0.0099
STAR
0.0087
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
Waibel_A
0.0156
Friedman_N
0.0094
Shneiderman_B
0.0060
Linsky_J
0.0143
Gauvain_J
0.0133
Heckerman_D
0.0067
Rauterberg_M
0.0031
Falcke_H
0.0131
Lamel_L
0.0128
Ghahramani_Z
0.0062
Lavana_H
0.0024
Mursula_K
0.0089
Woodland_P
0.0124
Koller_D
0.0062
Pentland_A
0.0021
Butler_R
0.0083
Ney_H
0.0080
Jordan_M
0.0059
Myers_B
0.0021
Bjorkman_K
0.0078
Hansen_J
0.0078
Neal_R
0.0055
Minas_M
0.0021
Knapp_G
0.0067
Renals_S
0.0072
Raftery_A
0.0054
Burnett_M
0.0021
Kundu_M
0.0063
Noth_E
0.0071
Lukasiewicz_T
0.0053
Winiwarter_W
0.0020
Christensen-J
0.0059
Boves_L
0.0070
Halpern_J
0.0052
Chang_S
0.0019
Cranmer_S
0.0055
Young_S
0.0069
Muller_P
0.0048
Korvemaker_B
0.0019
Nagar_N
0.0050
PROBABILITIES 0.0253
Some likely topics per author (CiteSeer)
•
Author = Andrew McCallum, U Mass:
•
•
•
•
Topic 1: classification, training, generalization, decision, data,…
Topic 2: learning, machine, examples, reinforcement, inductive,…..
Topic 3: retrieval, text, document, information, content,…
Author = Hector Garcia-Molina, Stanford:
- Topic 1: query, index, data, join, processing, aggregate….
- Topic 2: transaction, concurrency, copy, permission, distributed….
- Topic 3: source, separation, paper, heterogeneous, merging…..
•
Author = Paul Cohen, USC/ISI:
- Topic 1: agent, multi, coordination, autonomous, intelligent….
- Topic 2: planning, action, goal, world, execution, situation…
- Topic 3: human, interaction, people, cognitive, social, natural….
Finding unusual papers for an author
Perplexity = exp [entropy (words | model) ]
= measure of surprise for model on data
Can calculate perplexity of unseen documents,
conditioned on the model for a particular author
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Papers and Perplexities: M_Jordan
Data
Mining Lectures
Factorial Hidden Markov Models
687
Learning from Incomplete Data
702
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Papers and Perplexities: M_Jordan
Factorial Hidden Markov Models
687
Learning from Incomplete Data
702
MEDIAN PERPLEXITY
Data
Mining Lectures
Text Mining and Topic Models
2567
© Padhraic Smyth, UC Irvine
Papers and Perplexities: M_Jordan
Factorial Hidden Markov Models
687
Learning from Incomplete Data
702
MEDIAN PERPLEXITY
Data
Mining Lectures
2567
Defining and Handling Transient Fields
in Pjama
14555
An Orthogonally Persistent JAVA
16021
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Papers and Perplexities: T_Mitchell
Data
Mining Lectures
Explanation-based Learning for Mobile
Robot Perception
1093
Learning to Extract Symbolic
Knowledge from the Web
1196
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Papers and Perplexities: T_Mitchell
Data
Mining Lectures
Explanation-based Learning for Mobile
Robot Perception
1093
Learning to Extract Symbolic
Knowledge from the Web
1196
MEDIAN PERPLEXITY
2837
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Papers and Perplexities: T_Mitchell
Data
Mining Lectures
Explanation-based Learning for Mobile
Robot Perception
1093
Learning to Extract Symbolic
Knowledge from the Web
1196
MEDIAN PERPLEXITY
2837
Text Classification from Labeled and
Unlabeled Documents using EM
3802
A Method for Estimating Occupational
Radiation Dose…
8814
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Who wrote what?

Test of model:
1) artificially combine abstracts from different authors
2) check whether assignment is to correct original
author
A method1 is described which like the kernel1 trick1 in support1 vector1 machines1 SVMs1 lets us generalize
distance1 based2 algorithms to operate in feature1 spaces usually nonlinearly related to the input1 space
This is done by identifying a class of kernels1 which can be represented as norm1 based2 distances1 in
Hilbert spaces It turns1 out that common kernel1 algorithms such as SVMs1 and kernel1 PCA1 are actually
really distance1 based2 algorithms and can be run2 with that class of kernels1 too As well as providing1 a
useful new insight1 into how these algorithms work the present2 work can form the basis1 for conceiving
new algorithms
This paper presents2 a comprehensive approach for model2 based2 diagnosis2 which includes proposals for
characterizing and computing2 preferred2 diagnoses2 assuming that the system2 description2 is augmented
with a system2 structure2 a directed2 graph2 explicating the interconnections between system2
components2 Specifically we first introduce the notion of a consequence2 which is a syntactically2
unconstrained propositional2 sentence2 that characterizes all consistency2 based2 diagnoses2 and show2
that standard2 characterizations of diagnoses2 such as minimal conflicts1 correspond to syntactic2
variations1 on a consequence2 Second we propose a new syntactic2 variation on the consequence2 known
as negation2 normal form NNF and discuss its merits compared to standard variations Third we introduce a
basic algorithm2 for computing consequences in NNF given a structured system2 description We show that
if the system2 structure2 does not contain cycles2 then there is always a linear size2 consequence2 in NNF
which can be computed in linear time2 For arbitrary1 system2 structures2 we show a precise connection
between the complexity2 of computing2 consequences and the topology of the underlying system2
structure2 Finally we present2 an algorithm2 that enumerates2 the preferred2 diagnoses2 characterized by a
consequence2 The algorithm2 is shown1 to take linear time2 in the size2 of the consequence2 if the
preference criterion1 satisfies some general conditions
Data
Mining Lectures
Text Mining and Topic Models
Written by
(1) Scholkopf_B
Written by
(2) Darwiche_A
© Padhraic Smyth, UC Irvine
The Author-Topic
Browser
Querying on
author
Pazzani_M
Querying on topic
relevant to author
Querying on
document written
by author
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Comparing Predictive Power
14000
•
Author model
10000
8000
6000
4000
6
24
10
25
64
16
4
Author-Topics
0
2000
Topics model
1
Perplexity (new words)
12000
Train models on part of
a new document and
predict remaining words
# Observed words in document
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Outline
•
Background on statistical text modeling
•
Unsupervised learning from text documents
•
•
•
•
Extensions
•
•
Author-topic models
Applications
•
•
Motivation
Topic model and learning algorithm
Results
Demo of topic browser
Future directions
Online Demonstration of Topic Browser
for UCI and UCSD Faculty
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Data
Mining Lectures
Text Mining and Topic Models
© Padhraic Smyth, UC Irvine
Download