Predictively Modeling Social Text William W. Cohen

advertisement
Predictively Modeling Social Text
William W. Cohen
Machine Learning Dept. and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Joint work with: Amr Ahmed, Andrew Arnold, Ramnath Balasubramanyan, Frank
Lin, Matt Hurst (MSFT), Ramesh Nallapati, Noah Smith, Eric Xing, Tae Yano
Newswire Text
• Formal
• Primary purpose:
– Inform “typical reader”
about recent events
• Broad audience:
– Explicitly establish
shared context with
reader
– Ambiguity often
avoided
Social Media Text
• Informal
• Many purposes:
– Entertain, connect,
persuade…
• Narrow audience:
– Friends and colleagues
– Shared context already
established
– Many statements are
ambiguous out of social
context
Newswire Text
• Goals of analysis:
– Extract information
about events from text
– “Understanding” text
requires understanding
“typical reader”
• conventions for
communicating with
him/her
• Prior knowledge,
background, …
Social Media Text
• Goals of analysis:
– Very diverse
– Evaluation is difficult
• And requires revisiting
often as goals evolve
– Often “understanding”
social text requires
understanding a
community
Outline
• Tools for analysis of text
– Probabilistic models for text, communities,
and time
• Mixture models and LDA models for text
• LDA extensions to model hyperlink structure
• LDA extensions to model time
– Alternative framework based on graph
analysis to model time & community
• Preliminary results & tradeoffs
• Discussion of results & challenges
Introduction to Topic Models
• Multinomial Naïve Bayes
W1
W2


C
football
W3
…..
WN
The
Pittsburgh Steelers
M
b
b
Box is shorthand for many repetitions of the structure….
won
…..
Introduction to Topic Models
• Multinomial Naïve Bayes
W1
W2


C
politics
W3
…..
WN
The
Pittsburgh mayor
M
b
b
stated
…..
Introduction to Topic Models
• Naïve Bayes Model: Compact representation


C
C
W1
W2
W3
…..
WN
M
W
N
b
M
b
Introduction to Topic Models
• Multinomial Naïve Bayes
• For

each document d = 1,, M
• Generate Cd ~ Mult( ¢ | )
• For each position n = 1,, Nd
C
• Generate wn ~ Mult(¢|b,Cd)
• For
W1
W2
W3
…..
• Generate Cd ~ Mult( ¢ | ) = ‘football’
WN
M
b
document d = 1
• For each position n = 1,, Nd=67
• Generate w1 ~ Mult(¢|b,Cd) = ‘the’
• Generate w2 = ‘Pittsburgh’
• Generate w3 = ‘Steelers’
Introduction to Topic Models
• Multinomial Naïve Bayes

In the graphs:
• shaded circles are known values
• parents of variable W are the inputs to the
function used in generating W.
C
W1
W2
W3
…..
Goal: given known values, estimate the rest,
usually to maximize the probability of the
observations:
WN
M
b
Introduction to Topic Models
• Mixture model: unsupervised naïve
Bayes model

• Joint probability of words and classes:
C
Z
• But classes are not visible:
W
N
M
b
Introduction to Topic Models
• Learning for naïve Bayes:
– Take logs, the function is convex, linear and easy to
optimize for any parameter
• Learning for mixture model:
– Many local maxima (at least one for each
permutation of classes)
– Expectation/maximization is most common method
Introduction to Topic Models
• Mixture model: EM solution
E-step:
M-step:
Introduction to Topic Models
• Mixture model: EM solution
E-step:
Estimate the expected values of the
unknown variables (“soft classification”)
M-step:
Maximize the values of the parameters
subject to this guess—usually, this is
learning the parameter values given
the “soft classifications”
Introduction to Topic Models
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis Model
d

d
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
z
• generate wn ~ Mult( ¢ | bzn)
Topic
distribution
Mixture model:
• each document is generated by a single (unknown) multinomial
distribution of words, the corpus is “mixed” by 
w
N
M
b
PLSA model:
• each word is generated by a single unknown multinomial
distribution of words, each document is mixed by d
Introduction to Topic Models
JMLR, 2003
Introduction to Topic Models
• Latent Dirichlet Allocation

a
• For each document d = 1,,M
• Generate d ~ Dir(¢ | a)
• For each position n = 1,, Nd
z
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | bzn)
w
N
M
b
Introduction to Topic Models
• LDA’s view of a document
Introduction to Topic Models
• LDA topics
Introduction to Topic Models
• Latent Dirichlet Allocation
– Overcomes some technical issues with PLSA
• PLSA only estimates mixing parameters for training docs
– Parameter learning is more complicated:
• Gibbs Sampling: easy to program, often slow
• Variational EM
Introduction to Topic Models
• Perplexity comparison of various models
Unigram
Lower is better
LDA
Introduction to Topic Models
• Prediction accuracy for classification using learning
with topic-models as features
Higher is better
Outline
• Tools for analysis of text
– Probabilistic models for text, communities,
and time
• Mixture models and LDA models for text
• LDA extensions to model hyperlink structure
• LDA extensions to model time
– Alternative framework based on graph
analysis to model time & community
• Preliminary results & tradeoffs
• Discussion of results & challenges
Hyperlink modeling using LDA
Hyperlink modeling using LinkLDA
a
[Erosheva, Fienberg, Lafferty, PNAS, 2004]
• For each document d = 1,,M

• Generate d ~ Dir(¢ | a)
• For each position n = 1,, Nd
z
z
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | bzn)
w
N
L
M
b
•For each citation j = 1,, Ld
c
• generate zj ~ Mult( . | d)
• generate cj ~ Mult( . | zj)

Learning using variational EM
Hyperlink modeling using LDA
[Erosheva, Fienberg, Lafferty, PNAS, 2004]
Newswire Text
• Goals of analysis:
– Extract information
about events from text
– “Understanding” text
requires understanding
“typical reader”
Social Media Text
• Goals of analysis:
– Very diverse
– Evaluation is difficult
• And requires revisiting
often as goals evolve
– Often “understanding”
• conventions for
social text requires
communicating with
understanding a
him/her
community
• Prior knowledge,
Science as a testbed for social
background, …
text: an open community which
we understand
Models of hypertext for blogs
[ICWSM 2008]
Ramesh Nallapati
Amr Ahmed
Eric Xing
me
LinkLDA model for citing documents
Variant of PLSA model for cited documents
Topics are shared between citing, cited
Links depend on topics in two documents
Link-PLSA-LDA
Experiments
• 8.4M blog postings in Nielsen/Buzzmetrics corpus
– Collected over three weeks summer 2005
• Selected all postings with >=2 inlinks or >=2 outlinks
– 2248 citing (2+ outlinks), 1777 cited documents (2+ inlinks)
– Only 68 in both sets, which are duplicated
• Fit model using variational EM
Topics in blogs
Model can
answer
questions like:
which blogs
are most likely
to be cited
when
discussing
topic z?
Topics in blogs
Model can be
evaluated by
predicting
which links an
author will
include in a an
article
Link-LDA
Link-PLDA-LDA
Lower is better
Another model: Pairwise Link-LDA
• LDA for both cited
and citing documents
• Generate an
indicator for every
pair of docs
• Vs. generating
pairs of docs
•Link depends on the
mixing components
(’s)
• stochastic block
model
a


z
z
w

c
N
z
z
w
N
b
Pairwise Link-LDA supports new inferences…
…but doesn’t perform better on link prediction
Outline
• Tools for analysis of text
– Probabilistic models for text, communities,
and time
• Mixture models and LDA models for text
• LDA extensions to model hyperlink structure
– Observation: these models can be used for many
purposes…
• LDA extensions to model time
– Alternative framework based on graph
analysis to model time & community
• Discussion of results & challenges
Predicting Response to Political Blog
Posts with Topic Models [NAACL ’09]
Tae Yano
Noah Smith
Political blogs and and comments
Posts are often coupled
with comment sections
Comment style is
casual, creative,
less carefully edited
39
Political blogs and comments
• Most of the text associated with large “A-list”
community blogs is comments
– 5-20x as many words in comments as in text for the
5 sites considered in Yano et al.
• A large part of socially-created commentary in
the blogosphere is comments.
– Not blog  blog hyperlinks
• Comments do not just echo the post
Modeling political blogs
Our political blog model:
CommentLDA
z, z` = topic
w = word (in post)
w`= word (in comments)
u = user
D = # of documents;
N = # of words in post;
M = # of words in comments
Modeling political blogs
Our proposed political blog model:
LHS is vanilla LDA
D = # of documents;
CommentLDA
N = # of words in post;
M = # of words in comments
Modeling political blogs
RHS to capture the
generation of reaction
separately from the
post body
Our proposed political blog model:
CommentLDA
Two chambers
share the same
topic-mixture
Two separate
sets of word
distributions
D = # of documents;
N = # of words in post;
M = # of words in comments
Modeling political blogs
Our proposed political blog model:
User IDs of the
commenters as a part
of comment text
CommentLDA
generate the words
in the comment
section
D = # of documents;
N = # of words in post;
M = # of words in comments
Modeling political blogs
Another model we tried:
Took out the words from the
comment section!
CommentLDA
The model is
structurally
equivalent to the
LinkLDA from
(Erosheva et al.,
2004)
This is a model
agnostic to the
words in the
comment section!
D = # of documents;
N = # of words in post;
M = # of words in comments
Topic discovery - Matthew Yglesias (MY) site
46
Topic discovery - Matthew Yglesias (MY) site
47
Topic discovery - Matthew Yglesias (MY) site
48
Comment prediction
(MY)
20.54
%
• LinkLDA and CommentLDA
consistently outperform baseline
models
• Neither consistently outperforms
the other.
Comment LDA (R)
(RS)
(CB)
16.92
%
Link LDA (R)
32.06
%
Link LDA (C)
user prediction: Precision at top 10
From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB)
49
From Episodes to Sagas:
Temporally Clustering News Via SocialMedia Commentary [current work]
Ramnath
Balasubramanyan
Frank Lin
Matthew Hurst
Noah Smith
Motivation
• News-related blogosphere is driven by recency
• Some recent news is better understood based on
context of sequence of related stories
• Some readers have this context – some don’t
• To reconstruct the context, reconstruct the sequence
of related stories (“saga”)
– Similar to retrospective event detection
• First efforts:
– Find related stories
– Cluster by time
• Evaluation: agreement with human annotators
Clustering results on Democratic-primaryrelated documents
k-walks (more later)
SpeCluster + time:
Mixture of multinomials
+ model for “general” text
+ timestamp from Gaussian
Clustering results on Democratic-primaryrelated documents
Also had three human annotators build gold-standard
timelines
•
hierarchical
• annotated with names of events, times, …
Can evaluate a machine-produced timeline by tree-distance to goldstandard one
Clustering results on Democratic-primaryrelated documents
Issue: divergence of opinion with human annotators
• is modeling community interests the problem?
• how much of what we want is actually in the data?
• should this task be supervised or unsupervised?
More sophisticated time models
• Hierarchical LDA Over Time model:
– LDA to generate text
– Also generate a timestamp for each
document from topic-specific
Gaussians
– “Non-parametric” model
• Number of clusters is also generated (not
specified by user)
– Allows use of user-provided
“prototypes”
– Evaluated on liberal/conservative blogs
and ML papers from NIPS conferences
Ramnath
Balasubramanyan
Results with HOTS model - unsupervised
Results with HOTS model – human guidance
• Adding human seeds for
some key events improves
performance on all events.
• Allows a user to partially
specify a timeline of
events and have the
system complete it.
Outline
• Tools for analysis of text
– Probabilistic models for text, communities,
and time
• Mixture models and LDA models for text
• LDA extensions to model hyperlink structure
• LDA extensions to model time
– Alternative framework based on graph
analysis to model time & community
• Preliminary results & tradeoffs
• Discussion of results & challenges
Spectral Clustering: Graph = Matrix
Vector = Node  Weight
v
M
A B C D E F G H I
A _ 1 1
J
1
A
A 3
B 1 _ 1
B 2
C 1 1 _
C 3
D
_ 1 1
D
E
1 _ 1
E
F 1
1 1 _
F
G
C
_
1 1
G
H
_ 1 1
H
I
1 1 _ 1
I
J
1 1 1 _
J
M
B
D
A
G
I
J
F
E
H
Spectral Clustering: Graph = Matrix
M*v1 = v2 “propogates weights from
neighbors”
M * v1 = v 2
A B C D E F G H I
A _ 1 1
J
1
A 3
B 1 _ 1
B 2
C 1 1 _
C 3
D
_ 1 1
D
E
1 _ 1
E
F
1 1 _
F
G
_
A
2*1+3*1+0
*1
B
3*1+3*1
C
3*1+2*1
A
B
D
E
F
1 1
G
H
_ 1 1
H
G
I
1 1 _ 1
I
H
J
1 1 1 _
J
I
J
M
C
F
D
E
Spectral Clustering: Graph = Matrix
M*v1 = v2 “propogates weights from
neighbors”
M * v1 = v 2
A B C D E F G H I
A _ 1 1
J
1
A 3
B 1 _ 1
B 2
C 1 1 _
C 3
D
_ 1 1
D
E
1 _ 1
E
F
1 1 _
F
G
_
A
5
B
6
C
5
A
B
D
E
F
G
1 1
G
H
_ 1 1
H
I
I
1 1 _ 1
I
J
J
1 1 1 _
J
M
C
H
F
D
E
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
e2
0.4
0.2
x
x
x
xx
xx
x
x
xx
x
0.0
-0.2
yyyy z zzzzz
z z zz
y
-0.4
-0.4
-0.2
0
[Shi & Meila, 2002]
M
e3
e2
0.2
e1
Spectral Clustering
If W is row-normalized adjacency
matrix for a connected graph with
k closely-connected
subcommunities then
• the “top” eigenvector is a
constant vector
• the next k eigenvectors are
roughly piecewise constant with
“pieces” corresponding to
subcommunities
M
Spectral clustering:
• Find the top k+1
eigenvectors v1,…,vk+1
• Discard the “top” one
• Replace every node a with
k-dimensional vector xa =
<v2(a),…,vk+1 (a) >
• Cluster with k-means
Spectral Clustering: Pros and Cons
• Elegant, and well-founded mathematically
• Works quite well when relations are approximately
transitive (like similarity, social connections)
• Expensive for very large datasets
– Computing eigenvectors is the bottleneck
• Noisy datasets cause problems
– “Informative” eigenvectors need not be in top few
– Performance can drop suddenly from good to terrible
Experimental results:
best-case assignment of class labels to clusters
Adamic & Glance
“Divided They Blog:…” 2004
Spectral Clustering: Graph = Matrix
M*v1 = v2 “propogates weights from
neighbors”
M * v1 = v 2
A B C D E F G H I
A _ 1 1
J
1
A 3
B 1 _ 1
B 2
C 1 1 _
C 3
D
_ 1 1
D
E
1 _ 1
E
F
1 1 _
F
G
_
A
5
B
6
C
5
A
B
D
E
F
G
1 1
G
H
_ 1 1
H
I
I
1 1 _ 1
I
J
J
1 1 1 _
J
M
C
H
F
D
E
Repeated averaging with neighbors as a
clustering method
• Pick a vector v0 (maybe even at random)
• Compute v1 = Wv0
– i.e., replace v0[x] with weighted average of v0[y] for
the neighbors y of x
• Plot v1[x] for each x
• Repeat for v2, v3, …
• What are the dynamics of this process?
small
larger
Repeated averaging with neighbors on a sample
problem…
PIC: Power Iteration Clustering
run power iteration (repeated averaging w/
neighbors) with early stopping
Frank Lin
•
–
–
–
Formally, can show this works when spectral techniques work
Experimentally, linear time
Easy to implement and efficient
Very easily parallelized
– Experimentally, often better than traditional spectral methods
Experimental results:
best-case assignment of class labels to clusters
Experiments: run time and scalability
Time in millisec
Clustering results on Democratic-primaryrelated documents
k-walks is early version of PIC
• cluster a graph with several
types of nodes: blog entries;
news stories; and dates.
• clusters (“communities”) of the
graph correspond to events in
the “saga”.
Advantage:
• PIC clusters at interactive
speeds.
vs human annotators
k-walks
Outline
• Tools for analysis of text
– Probabilistic models for text, communities,
and time
• Mixture models and LDA models for text
• LDA extensions to model hyperlink structure
• LDA extensions to model time
– Alternative framework based on graph
analysis to model time & community
• Preliminary results & tradeoffs
• Discussion of results & challenges
Social Media Text
Comments
• Goals of analysis:
• Probabilistic models
– Very diverse
– Evaluation is difficult
• And requires revisiting
often as goals evolve
– Often “understanding”
social text requires
understanding a
community
– can model many aspects of
social text
• Community (links, comments)
• Time
• Evaluation:
– introspective, qualitative on
communities we understand
• Scientific communities
– quantitative on predictive
tasks
• Link prediction, user
prediction, …
– Against “gold-standard
visualization” (sagas)
Thanks to…
•
•
•
•
•
•
NIH/NIGMS
NSF
Microsoft LiveLabs
Microsoft Research
Johnson & Johnson
Language Technology Inst, CMU
Download