slides - School of Computer Science

advertisement
Adaptation of Graph-Based Semi-Supervised
Methods to Large-Scale Text Data
Frank Lin and William W. Cohen
School of Computer Science, Carnegie Mellon University
KDD-MLG 2011
2011-08-21, San Diego, CA, USA
Overview
• Graph-based SSL
• The Problem with Text Data
• Implicit Manifolds
– Two GSSL Methods
– Three Similarity Functions
• A Framework
• Results
• Related Work
Graph-based SSL
• Graph-based semi-supervised learning methods
can often be viewed as methods for propagating
labels along edges of a graph.
• Naturally they are a good fit for network data
Labeled
class A
How to label
the rest of the
nodes in the
graph?
Labeled
Class B
The Problem with Text Data
• Documents are often represented as feature
vectors of words:
The importance of a Web
page is an inherently
subjective matter, which
depends on the readers…
In this paper, we present
Google, a prototype of a
large-scale search engine
which makes heavy use…
You're not cool just
because you have a lot of
followers on twitter, get
over yourself…
cool
web
search
make
over
you
0
4
8
2
5
3
0
8
7
4
3
2
1
0
0
0
1
2
The Problem with Text Data
• Feature vectors are often sparse
• But similarity matrix is not!
Mostly non-zero
- any two
documents are
likely to have a
word in common
27 125
23
-
23
Mostly zeros - any
document contains
only a small fraction
of the vocabulary
-
125
27
cool
web
search
make
over
you
0
4
8
2
5
3
0
8
7
4
3
2
1
0
0
0
1
2
The Problem with Text Data
• A similarity matrix is the input to many GSSL
methods
• How can we apply GSSL methods efficiently?
O(n2) time to construct
27 125
-
23
-
125
-
23
27
O(n2) space to store
Too
expensive!
Does not
scale up to
big
datasets!
> O(n2) time to operate on
The Problem with Text Data
• Solutions:
1. Make the matrix sparse
2. Implicit Manifold
A lot of cool work
has gone into
how to do this…
But this is what
we’ll talk about!
(If you do SSL on large-scale nonnetwork data, there is a good chance
you are using this already...)
Implicit Manifolds
What do you
mean by…?
• A pair-wise similarity matrix is a manifold
under which the data points “reside”.
• It is implicit because we never explicitly
construct the manifold (pair-wise similarity),
although the results we get are exactly the
same as if we did.
Sounds too
good. What’s
the catch?
8
Implicit Manifolds
• Two requirements for using implicit manifolds
on text (-like) data:
1. The GSSL method can be implemented with
matrix-vector multiplications
2. We can decompose the dense similarity matrix
into a series of sparse matrix-matrix
multiplications
As long as they are met, we can
obtain the exact same results
without ever constructing a
similarity matrix!
Let’s look at some
specifics, starting with
two GSSL methods
9
If you’ve
never heard
of them…
Two GSSL Methods
• Harmonic functions method (HF)
• MultiRankWalk (MRW)
You might
have heard
one of these:
MRW’s cousins:
HF’s cousins:
 Partially labeled classification
using Markov random walks
(Szummer & Jaakkola 2001)
 Learning with local and global
consistency (Zhou et al. 2004)
 Graph-based SSL as a generative
model (He et al. 2007)
forward
Propagation
Ghost edges for via
classification
(Gallagher
al. 2008)(w/ restart)
randometwalks
 and others …
 Gaussian fields and harmonic
functions classifier (Zhu et al. 2003)
 Weighted-voted relational network
classifier (Macskassy & Provost 2007)
 Weakly-supervised classification
via random walks (Talukdar et al.
2008)
Propagation
via
 Adsorption
(Baluja et al.
2008)
backward
Learning on diffusion
maps
(Lafon
random
walks
& Lee 2006)
 and others …
10
Two GSSL Methods
• Harmonic functions
method (HF)
Let’s look at
their implicit
manifold
qualification…
• MultiRankWalk
(MRW)
In both of these iterative
implementations, the core
computation are matrix-vector
multiplications!
11
Three Similarity Functions
• Inner product similarity
• Cosine similarity
• Bipartite graph walk
Side note: perhaps all the
cool work on matrix
factorization could be
useful here too…
Simple, good for
binary features
Often used for
document
categorization
Can be viewed as a
bipartite graph
walk, or as feature
reweighting by
relative frequency;
related to TF-IDF
term weighting…
12
How about a
similarity function
we actually use for
text data?
Putting Them Together
• Example: HF + inner product similarity:
Diagonal matrix D can
be calculated as:
D(i,i)=d(i) where d=FFT1
S  FF
T
Parentheses are
important!
• Iteration update becomes:
 v
t 1
Construction: O(n)
1
 D ( F ( F v ))
T
Storage: O(n)
t
Operation: O(n)
13
Putting Them Together
• Example:O(n)
HF + cosine similarity:
Construction:
Storage: O(n)
S  N c FF N c
T
• Iteration update:
Operation: O(n)
v

t 1
Diagonal cosine
normalizing matrix
1
Compact storage: we
don’t need a cosinenormalized version of
the feature vectors
 D (N c (F(F (N c v ))))
T
t
Diagonal matrix D can be calculated as:
D(i,i)=d(i) where d=NcFFTNc1
14
A Framework
• So what’s the point?
1. Towards a principled way for researchers to apply
GSSL methods to text data, and the conditions under
which they can do this efficiently
2. Towards a framework on which researchers can
develop and discover new methods (and recognizing
old ones)
3. Building a SSL tool set – pick one that works for you,
according to all the great work that has been done
15
A Framework
How about
MultiRankWalk with
a low restart
probability?
Hmm… I have a large
dataset with very
few training labels,
what should I try?
Choose your SSL Method…
Harmonic Functions
MultiRankWalk
16
A Framework
Can’t go wrong
with cosine
similarity!
But the
documents in my
dataset are kinda
long…
… and pick your similarity function
Inner Product
Cosine Similarity
Bipartite Graph Walk
17
Results
• On 2 commonly used document categorization datasets
• On 2 less common NP classification datasets
• Goal: to show they do work on large text datasets, and
consistent with what we know about these SSL methods
and similarity functions
18
Results
• Document categorization
19
Results
• Noun phrase classification dataset:
20
Questions?
Additional Information
22
MRW: RWR for Classification
We refer to this method as MultiRankWalk: it classifies
data with multiple rankings using random walks
MRW: Seed Preference
• Obtaining labels for data points is expensive
• We want to minimize cost for obtaining labels
• Observations:
– Some labels inherently more useful than others
– Some labels easier to obtain than others
Question: “Authoritative” or “popular” nodes
in a network are typically easier to obtain
labels for. But are these labels also more
useful than others?
Seed Preference
• Consider the task of giving a human expert (or
posting jobs on Amazon Mechanical Turk) a list of
data points to label
• The list (seeds) can be generated uniformly at
random, or we can have a seed preference,
according to simple properties of the unlabeled
data
• We consider 3 preferences:
– Random
– Link Count
– PageRank
Nodes with highest counts make the list
Nodes with highest scores make the list
MRW: The Question
• What really makes MRW and wvRN different?
• Network-based SSL often boil down to label propagation.
• MRW and wvRN represent two general propagation
methods – note that they are call by many names:
MRW
Random walk with restart
wvRN
Reverse random walk
Great…but
we
still don’t
Regularized random
walk
Random
walk with sink nodes
know why the differences
in
Hitting time
their behavior on these
Local & global consistency
Harmonic functions on graphs
network datasets!
Personalized PageRank
Iterative averaging of neighbors
MRW: The Question
• It’s difficult to answer exactly why MRW does better
with a smaller number of seeds.
• But we can gather probable factors from their
propagation models:
MRW
1 Centrality-sensitive
wvRN
Centrality-insensitive
Exponential drop-off
2
No drop-off / damping
/ damping factor
Propagation of
Propagation of different
3 different classes done
classes interact
independently
1. Centrality-sensitive:
seeds have different
scores and not
necessarily the highest
MRW: The Question
Seed labels
underlined
• An example from a political blog dataset – MRW vs. wvRN
scores for how much a blog is politically conservative:
0.020
0.019
0.017
0.017
0.013
0.011
0.010
0.010
0.008
0.007
0.007
0.007
0.006
0.006
0.005
0.005
firstdownpolitics.com
1.000
neoconservatives.blogspot.com
neoconservatives.blogspot.com
strangedoctrines.typepad.com
2. Exponential drop- 1.000
jmbzine.com off: much less sure 1.000
jmbzine.com
strangedoctrines.typepad.com
presidentboxer.blogspot.com
about nodes further 0.593
millers_time.typepad.com
0.585
rooksrant.com
away from seeds
decision08.blogspot.com
0.568
purplestates.blogspot.com
gopandcollege.blogspot.com
0.553
ikilledcheguevara.blogspot.com
charlineandjamie.com
0.540
restoreamerica.blogspot.com
We still don’t
really
marksteyn.com
0.539
billrice.org
understand0.529
it yet.kalblog.com
blackmanforbush.blogspot.com
3. Classes propagate
reggiescorner.blogspot.com
0.517
right-thinking.com
independently:
fearfulsymmetry.blogspot.com
0.517 charlineandjamie.com
tom-hanna.org
is both
quibbles-n-bits.com
0.514 very
crankylittleblog.blogspot.com
likely a conservative and a
undercaffeinated.com
0.510
hasidicgentile.org
liberal blog (good or bad?)
samizdata.net
0.509
stealthebandwagon.blogspot.com
pennywit.com
0.509
carpetblogger.com
RWR ranking as
features to SVM
Similar
MRW: Relatedformulation,
Work
different view
• MRW is very much related to
Random
walk
without
restart,
heuristic
stopping
– “Local and global consistency” (Zhou et al. 2004)
– “Web content categorization using link information”
(Gyongyi et al. 2006)
– “Graph-based semi-supervised learning as a
generative model” (He et al. 2007)
• Seed preference is related to the field of active
learning
– Active learning chooses which data point
to label next
Authoritative seed
based on previous labels; the labeling ispreference
interactive
a good
– Seed preference is a batch labeling method
base line for active
learning on network
data!
Results
• How much better is MRW using authoritative seed preference?
y-axis:
MRW F1
score minus
wvRN F1
The gap between
MRW and wvRN
narrows with
authoritative
seeds, but they
are still
prominent on
some datasets
with small
number of seed
labels
x-axis:
number
of seed
labels per
class
A Web-Scale Knowledge Base
• Read the Web (RtW) project:
Build a never-ending system that
learns to extract information
from unstructured web pages,
resulting in a knowledge base of
structured information.
31
Noun Phrase and Context Data
• As a part of RtW project, two kinds of noun
phrase (NP) and context co-occurrence data
was produced:
– NP-context co-occurrence data
– NP-NP co-occurrence data
• These datasets can be treated as graphs
32
Noun Phrase and Context Data
• NP-context data:
… know that drinking pomegranate juice may not be a bad …
before context
pomegranate juice
JELL-O
Jagermeister
NP-context
graph
NP
3
2
5
8
2
1
after context
know that drinking _
_ may not be a bad
_ is made from
_ promotes responsible
33
Noun Phrase and Context Data
• NP-NP data:
Context can be used for
weighting edges or making
a more complex graph
… French Constitution Council validates burqa ban …
NP
context
NP
veil
French Constitution Council
burqa ban
hot pants
French Court
NP-NP
graph
Jagermeister
JELL-O
34
Noun Phrase Categorization
• We propose using MRW (with path folding) on
the NP-context data to categorize NPs, given a
handful of seed NPs.
• Challenges:
– Large, noisy dataset (10m NPs, 8.6m contexts from
500m web pages).
– What’s the right function for NP-NP categorical
similarity?
– Which learned category assignment should we
“promote” to the knowledge base?
– How to evaluate it?
35
Noun Phrase Categorization
• We propose using MRW (with path folding) on
the NP-context data to categorize NPs, given a
handful of seed NPs.
• Challenges:
– Large, noisy dataset (10m NPs, 8.6m contexts from
500m web pages).
– What’s the right function for NP-NP categorical
similarity?
– Which learned category assignment should we
“promote” to the knowledge base?
– How to evaluate it?
36
Noun Phrase Categorization
• Preliminary experiment:
– Small subset of the NP-context data
• 88k NPs
• 99k contexts
– Find category “city”
• Start with a handful of seeds
– Ground truth set of 2,404 city NPs created using
37
Download