topic model - University of Massachusetts Amherst

advertisement
Discovering Latent Structure in
Multiple Modalities
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with 
Xuerui Wang, Natasha Mohanty,
Andres Corrada, Chris Pal, Wei Li, Greg Druck.
Social Network in an Email Dataset
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Social Network in Political Data
Vote similarity in
U.S. Senate
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
[Jakulin & Buntine 2005]
Groups and Topics
• Input:
– Observed relations between people
– Attributes on those relations (text, or categorical)
• Output:
– Attributes clustered into “topics”
– Groups of people---varying depending on topic
Discovering Groups from
Observed Set of Relations
Student Roster
Academic Admiration
Adams
Bennett
Carter
Davis
Edwards
Frederking
Acad(A, B) Acad(C, B)
Acad(A, D) Acad(C, D)
Acad(B, E) Acad(D, E)
Acad(B, F) Acad(D, F)
Acad(E, A) Acad(F, A)
Acad(E, C) Acad(F, C)
Admiration relations among six high school students.
Adjacency Matrix Representing Relations
Student Roster
Academic Admiration
Adams
Bennett
Carter
Davis
Edwards
Frederking
Acad(A, B) Acad(C, B)
Acad(A, D) Acad(C, D)
Acad(B, E) Acad(D, E)
Acad(B, F) Acad(D, F)
Acad(E, A) Acad(F, A)
Acad(E, C) Acad(F, C)
A B C D E F
G1G2G1G2G3G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
G1
G2
G1
G2
G3
G3
A C B D E F
G1G1G2G2G3G3
A
C
B
D
E
F
G1
G1
G2
G2
G3
G3
Group Model:
Partitioning Entities into Groups
Stochastic Blockstructures for Relations
[Nowicki, Snijders 2001]

Beta
S: number of entities
Multinomial

g
G2
S

Dirichlet

G: number of groups
Binomial
v
S2
Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]
Two Relations with Different Attributes
Student Roster
Academic Admiration
Social Admiration
Adams
Bennett
Carter
Davis
Edwards
Frederking
Acad(A, B) Acad(C, B)
Acad(A, D) Acad(C, D)
Acad(B, E) Acad(D, E)
Acad(B, F) Acad(D, F)
Acad(E, A) Acad(F, A)
Acad(E, C) Acad(F, C)
Soci(A, B) Soci(A, D) Soci(A, F)
Soci(B, A) Soci(B, C) Soci(B, E)
Soci(C, B) Soci(C, D) Soci(C, F)
Soci(D, A) Soci(D, C) Soci(D, E)
Soci(E, B) Soci(E, D) Soci(E, F)
Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1G1G2G2G3G3
A
C
B
D
E
F
G1
G1
G2
G2
G3
G3
A C E B D F
G1G1G1G2G2G2
A
C
E
B
D
F
G1
G1
G1
G2
G2
G2
The Group-Topic Model:
Discovering Groups and Topics Simultaneously
[Wang, Mohanty, McCallum 2006]

Uniform

Beta
Multinomial

t
g
G2
S
Binomial
Multinomial
w
T
Nb
v


T
Dirichlet

Dirichlet
S2
B
Inference and Estimation
Gibbs Sampling:
- Many r.v.s can be
integrated out
- Easy to implement
- Reasonably fast
We assume the relationship is symmetric.
Dataset #1:
U.S. Senate
•
16 years of voting records in the US Senate (1989 – 2005)
•
a Senator may respond Yea or Nay to a resolution
•
3423 resolutions with text attributes (index terms)
•
191 Senators in total across 16 years
S.543
Title: An Act to reform Federal deposit insurance, protect the deposit insurance
funds, recapitalize the Bank Insurance Fund, improve supervision and regulation
of insured depository institutions, and for other purposes.
Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2)
Latest Major Action: 12/19/1991 Became Public Law No: 102-242.
Index terms: Banks and banking Accounting Administrative fees Cost control
Credit Deposit insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea
Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
Topics Discovered (U.S. Senate)
Mixture of Unigrams
Group-Topic Model
Education
Energy
Military
Misc.
Economic
education
school
aid
children
drug
students
elementary
prevention
energy
power
water
nuclear
gas
petrol
research
pollution
government
military
foreign
tax
congress
aid
law
policy
federal
labor
insurance
aid
tax
business
employee
care
Education
+ Domestic
Foreign
Economic
Social Security
+ Medicare
labor
insurance
tax
congress
income
minimum
wage
business
social
security
insurance
medical
care
medicare
disability
assistance
education
foreign
school
trade
federal
chemicals
aid
tariff
government
congress
tax
drugs
energy
communicable
research
diseases
Groups Discovered (US Senate)
Groups from topic Education + Domestic
Senators Who Change Coalition the most
Dependent on Topic
e.g. Senator Shelby (D-AL) votes
with the Republicans on Economic
with the Democrats on Education + Domestic
with a small group of maverick Republicans on Social Security + Medicaid
Dataset #2:
The UN General Assembly
•
Voting records of the UN General Assembly (1990 - 2003)
•
A country may choose to vote Yes, No or Abstain
•
931 resolutions with text attributes (titles)
•
192 countries in total
•
Also experiments later with resolutions from 1960-2003
Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting
The draft resolution on permanent sovereignty of the Palestinian people in the
occupied Palestinian territory, including Jerusalem, and of the Arab population in
the occupied Syrian Golan over their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:
In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France,
Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey, and other 126 countries.
Against: Israel, Marshall Islands, United States.
Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
Topics Discovered (UN)
Mixture of
Unigrams
Group-Topic
Model
Everything
Nuclear
Human Rights
Security
in Middle East
nuclear
weapons
use
implementation
countries
rights
human
palestine
situation
israel
occupied
israel
syria
security
calls
Nuclear
Non-proliferation
Nuclear Arms
Race
Human Rights
nuclear
states
united
weapons
nations
nuclear
arms
prevention
race
space
rights
human
palestine
occupied
israel
Groups
Discovered
(UN)
The countries list for each
group are ordered by their
2005 GDP (PPP) and only 5
countries are shown in
groups that have more than
5 members.
Outline
Discovering Latent Structure in Multiple Modalities
a
• Groups & Text (Group-Topic Model, GT)
• Nested Correlations (Pachinko Allocation, PAM)
• Time & Text (Topics-over-Time Model, TOT)
• Time & Text with Nested Correlations (PAM-TOT)
• Multi-Conditional Mixtures
Latent Dirichlet Allocation
“images,
motion, eyes”
LDA 20
[Blei, Ng, Jordan, 2003]
α
θ
z
w
β
n
φ
N
T
visual
model
motion
field
object
image
images
objects
fields
receptive
eye
position
spatial
direction
target
vision
multiple
figure
orientation
location
“motion,
some junk”
LDA 100
motion
detection
field
optical
flow
sensitive
moving
functional
detect
contrast
light
dimensional
intensity
computer
mt
measures
occlusion
temporal
edge
real
Correlated Topic Model
[Blei, Lafferty, 2005]


logistic
normal

z
w
β
n
φ
N
T
Square matrix of
pairwise correlations.
Topic Correlation Representation
7 topics: {A, B, C, D, E, F, G}
Correlations: {A, B, C, D, E} and {C, D, E, F, G}
CTM
B C D E F
Mixture Model
G
A
B
C
D
E
F
21 parameters
A
B
C
D
E
14 parameters
F
G
Pachinko Machine
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Pachinko Allocation Model
[Li, McCallum, 2005, 2006]
11
21
31
41
word1
word2
42
word3
Given:
directed acyclic graph (DAG);
at each interior node: a Dirichlet over its children
and words at leaves
22
32
33
43
word4
For each document:
Sample a multinomial from
each Dirichlet
44
word5
word6
45
word7
For each word in this document:
Starting from the root,
sample a child from successive
nodes, down to a leaf.
Generate the word at the leaf
word8
Like a Polya tree, but DAG shaped, with arbitrary number of children.
Pachinko Allocation Model
[Li, McCallum, 2005]
11
21
31
41
word1
word2
42
word3
DAG may have arbitrary structure
• arbitrary depth
• any number of children per node
• sparse connectivity
• edges may skip layers
22
32
33
43
word4
44
word5
word6
45
word7
word8
Pachinko Allocation Model
[Li, McCallum, 2005]
11
21
31
41
word1
word2
42
word3
22
32
Distributions over distributions over topics...
33
43
word4
Distributions over topics;
mixtures, representing topic correlations
44
word5
word6
45 Distributions over words
(like “LDA topics”)
word7
word8
Some interior nodes could contain one multinomial, used for all documents.
(i.e. a very peaked Dirichlet)
Pachinko Allocation Model
[Li, McCallum, 2005]
11
Estimate all these Dirichlets from data.
21
22
Estimate model structure from data.
(number of nodes, and connectivity)
31
41
word1
word2
42
word3
32
33
43
word4
44
word5
word6
45
word7
word8
Pachinko Allocation Special Cases
Latent Dirichlet Allocation
32
41
word1
word2
42
word3
43
word4
44
word5
word6
45
word7
word8
Pachinko Allocation Model
... with two layers, no skipping layers,
fully-connected from one layer to the next.
11
21
31
32
22
23
33
“super-topics”
34
35
“sub-topics”
fixed multinomials
word1
word2
word3
word4
word5
word6
word7
word8
Another special case would select only one super-topic per document.
Graphical Models
Four-level PAM
LDA
(with fixed multinomials for sub-topics)
α
α1
α2
θ
θ2
θ3
z
w
z2
β
n
φ
N
T
T’
z3
w
β
n
φ
N
T
Inference – Gibbs Sampling
P ( z w 2  t k , z w3
nkp( d )   kp
n pw   w
n1(kd )  1k
 t p | D, z  w ,  ,  )  ( d )
 (d )

n1  k ' 1k ' nk   p '  kp ' n p  m  m
α2
α3
P (t k )
θ2
Jointly
sampled
θ3
z2
P (t p | t k )
P(w | t p )
T’
z3
w
β
n
φ
N
T
Dirichlet parameters α are estimated with moment matching
Experimental Results
• Topic clarity by human judgement
• Likelihood on held-out data
• Document classification
Datasets
• Rexa (http://rexa.info/)
– 4000 documents, 278438 word tokens and 25597
unique words.
• NIPS
– 1647 documents, 114142 word tokens and 11708
unique words.
• 20 newsgroup comp5 subset
– 4836 documents, 35567 unique words.
Topic Correlations
Example Topics
“images,
motion
eyes”
LDA 20
“motion”
(+ some generic)
LDA 100
visual
model
motion
field
object
image
images
objects
fields
receptive
eye
position
spatial
direction
target
vision
multiple
figure
orientation
location
motion
detection
field
optical
flow
sensitive
moving
functional
detect
contrast
light
dimensional
intensity
computer
mt
measures
occlusion
temporal
edge
real
“motion”
PAM 100
“eyes”
PAM 100
“images”
PAM 100
motion
video
surface
surfaces
figure
scene
camera
noisy
sequence
activation
generated
analytical
pixels
measurements
assigne
advance
lated
shown
closed
perceptual
eye
head
vor
vestibulo
oculomotor
vestibular
vary
reflex
vi
pan
rapid
semicircular
canals
responds
streams
cholinergic
rotation
topographically
detectors
ning
image
digit
faces
pixel
surface
interpolation
scene
people
viewing
neighboring
sensors
patches
manifold
dataset
magnitude
transparency
rich
dynamical
amounts
tor
Blind Topic Evaluation
• Randomly select 25
similar pairs of topics
generated from PAM
and LDA
• 5 people
• Each asked to “select
the topic in each pair
that you find more
semantically coherent.”
Topic counts
LDA
PAM
5 votes
0
5
>= 4 votes
3
8
>= 3 votes
9
16
Examples
PAM
LDA
PAM
LDA
control
systems
robot
adaptive
environment
goal
state
controller
control
systems
based
adaptive
direct
con
controller
change
motion
image
detection
images
scene
vision
texture
segmentation
image
motion
images
multiple
local
generated
noisy
optical
5 votes
0 vote
4 votes
1 vote
Examples
PAM
LDA
PAM
LDA
signals
source
separation
eeg
sources
blind
single
event
signal
signals
single
time
low
source
temporal
processing
algorithm
learning
algorithms
gradient
convergence
function
stochastic
weight
algorithm
algorithms
gradient
convergence
stochastic
line
descent
converge
4 votes
1 vote
1 vote
4 votes
Likelihood Comparison
• Dataset: NIPS
• Two sets of experiments:
– Varying number of topics
– Different proportions of training data
Likelihood Comparison
• Varying number of topics
Likelihood Comparison
• Different proportions of training data
Document Classification
• 20 newsgroup comp5 subset
• 5-way classification (accuracy in %)
class
# docs
LDA
PAM
graphics
243
83.95
86.83
os
239
81.59
84.10
pc
245
83.67
88.16
mac
239
86.61
89.54
Windows.x
243
88.07
92.20
total
1209
84.70
87.34
Outline
Discovering Latent Structure in Multiple Modalities
a
• Groups & Text (Group-Topic Model, GT)
a
• Nested Correlations (Pachinko Allocation, PAM)
• Time & Text (Topics-over-Time Model, TOT)
• Time & Text with Nested Correlations (PAM-TOT)
• Multi-Conditional Mixtures
Want to Model Trends over Time
• Is prevalence of topic growing or waning?
• Pattern appears only briefly
– Capture its statistics in focused way
– Don’t confuse it with patterns elsewhere in time
• How do roles, groups, influence shift over time?
Topics over Time (TOT)
[Wang, McCallum 2006]
Dirichlet

multinomial
over topics

word
w
topic
index

z


T
Nd
D
multinomial
over topics
topic
index
word

t
time stamp
Dirichlet
prior
time
stamp
T
Multinomial
over words


Uniform
prior
z


T
Dirichlet
prior
t
Beta
over time
Uniform
prior


distribution
on time
stamps
Beta
over time
w
T
Multinomial
over words
Nd
D
Attributes of this Approach to
Modeling Time
• Not a Markov model
– No state transitions, or Markov assumption
• Continuous Time
– Time not discretized
• Easily incorporated into other more complex
models with additional modalities.
State of the Union Address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs
and treated them as ‘documents’. One-line paragraphs were excluded.
Stopping was applied.
• 17156 ‘documents’
• 21534 words
• 669,425 tokens
Our scheme of taxation, by means of which this needless surplus is taken
from the people and put into the public Treasury, consists of a tariff or
duty levied upon importations from abroad and internal-revenue taxes levied
upon the consumption of tobacco and spirituous and malt liquors. It must be
conceded that none of the things subjected to internal-revenue taxation
are, strictly speaking, necessaries. There appears to be no just complaint
of this taxation by the consumers of these articles, and there seems to be
nothing so well able to bear the burden without hardship to any portion of
the people.
1910
Comparing
TOT
against
LDA
TOT on 17 years of NIPS proceedings
topic mass (in vertical height)
Topic Distributions Conditioned on Time
time
TOT on 17 years of NIPS proceedings
TOT
LDA
TOT
versus
LDA
on my
email
TOT improves ability to Predict Time
Predicting the year of a State-of-the-Union address.
L1 = distance between predicted year and actual year.
Outline
Discovering Latent Structure in Multiple Modalities
a
• Groups & Text (Group-Topic Model, GT)
a
• Nested Correlations (Pachinko Allocation, PAM)
a
• Time & Text (Topics-over-Time Model, TOT)
• Time & Text with Nested Correlations (PAM-TOT)
• Multi-Conditional Mixtures
PAM Over Time (PAMTOT)
α1
α2
θ2
θ3

t2
T’
z2
z3
t3
w
β
n
φ
N
T
Experimental Results
• Dataset: Rexa subset
– 4454 documents between years 1991 and 2005
– 372936 word tokens
– 21748 unique words
• Topic Examples
• Predict Time
Topic Examples (1)
PAMTOT
PAM
Topic Examples (2)
PAMTOT
PAM
Predict Time with PAMTOT
L1 Error
E(L1)
Accuracy
PAMTOT
1.56
1.57
0.29
PAM
5.34
5.30
0.10
L1 Error: the difference between predicted and true years
E(L1): average difference between all years and true year
using p(t|.) from the model.
Outline
Discovering Latent Structure in Multiple Modalities
a
• Groups & Text (Group-Topic Model, GT)
a
• Nested Correlations (Pachinko Allocation, PAM)
a
• Time & Text (Topics-over-Time Model, TOT)
a
• Time & Text with Nested Correlations (PAM-TOT)
• Multi-Conditional Mixtures
Want a “topic model” with
the advantages of CRFs
• Use arbitrary, overlapping features of the input.
• Undirected graphical model, so we don’t have to
think about avoiding cycles.
• Integrate naturally with our other CRF
components.
• Train “discriminatively”
• Natural semi-supervised & transfer learning
“Multi-Conditional Mixtures”
Latent Variable Models fit by Multi-way Conditional Probability
[McCallum, Wang, Pal, 2005],
[McCallum, Pal, Druck, Wang, 2006]
• For clustering structured data,
ala Latent Dirichlet Allocation & its successors
• But an undirected model,
like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]
• But trained by a “multi-conditional” objective:
O = P(A|B,C) P(B|A,C) P(C|A,B)
e.g. A,B,C are different modalities
See also [Minka 2005 TR] and [Pereira & Gordon 2006 ICML]
Objective Functions for Parameter Estimation
Traditional
Traditional, joint training (e.g. naive Bayes, most topic models)
Traditional mixture model (e.g. LDA)
Traditional, conditional training (e.g. MaxEnt classifiers, CRFs)
New, multi-conditional
Conditional mixtures (e.g. Jebara’s CEM,
McCallum CRF string edit distance, ...)
Multi-conditional
(mostly conditional, generative regularization)
Multi-conditional
(for semi-sup)
Multi-conditional
(for transfer learning, 2 tasks, shared hiddens)
“Multi-Conditional Learning” (Regularization)
[McCallum, Pal, Wang, 2006]
Predictive Random Fields
mixture of Gaussians on synthetic data
Data, classify by color
Generatively trained
[McCallum, Wang, Pal, 2005]
Multi-Conditional
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Conditionally-trained [Jebara 1998]
Topic Words
Strong positive and negative indicators
20 Newsgroups data, two subtopics of talk.politics.guns:
Multi-Conditional Harmonium
Multi-Conditional Mixtures
vs. Harmoniun
on document retrieval task
[McCallum, Wang, Pal, 2005]
Multi-Conditional,
multi-way conditionally
trained
Conditionally-trained,
to predict class labels
Harmonium, joint,
with class labels
and words
Harmonium, joint with
words, no labels
Outline
Discovering Latent Structure in Multiple Modalities
a
• Groups & Text (Group-Topic Model, GT)
a
• Nested Correlations (Pachinko Allocation, PAM)
a
• Time & Text (Topics-over-Time Model, TOT)
a
• Time & Text with Nested Correlations (PAM-TOT)
a
• Multi-Conditional Mixtures
Summary
Download