# Learning Similarity Measures Based on Random Walks in Graphs William W. Cohen

```Learning Similarity Measures
Based on Random Walks in Graphs
William W. Cohen
Machine Learning Department and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
joint work with:
Tom Mitchell, CMU
Einat Minkov, Univ Haifa
Motivation:
The simple and the complex
• In computer science there is a tension between
– The elegant, simple and general
– The messy, complex and problem-specific
• Graphs are:
– Simple: so they are easy to analyze and store
– General: so
• They appear in many contexts
• They are often a natural representation of
important aspects of information
– Well-understood
Motivation:
The simple and the complex
• The real world is complex…
• … learning is a way to incorporate that complexity in our
models without sacrificing elegance and generality
Motivation:
The simple and the complex
• This talk: Learning Similarity Measures Based on Random
Walks in Graphs
– Many fundamental tasks in computer science map
an input to an output
– … i.e., the task can be modeled as a relation between
input and output
– …and further the relation can often be viewed as a
similarity relation: the desired outputs are similar to
the input (query)
– we want to learn this relationship
• even if (especially if) it is complex
• even if it is described by a multi-step process
– Here: one line of work on learning complex
relationships
Motivation:
The simple and the complex
• This talk:
– One line of work on learning complex relationships
– Not covered here:
• Minkov et al 2006, 2008, 2011: Similar framework
for personalized information management queries
and NLP relationships (e.g., synonyms) using
generative and reranking-based learning strategies
• Backstrom and Leskovec, 2011: Alternative, very
expressive parameterization of learning complex
similarity metrics in graphs with feature vectors on
the edges.
Similarity Queries on Graphs
1) Given type t* and node x in G, find y:T(y)=t* and y~x.
2) Given type t* and node set X, find y:T(y)=t* and y~X.
• Nearest-neighbor classification:
– G contains feature nodes and instance nodes
– A link (x,f) means feature f is true for instance x
– x* is a query instance, y~x* means y likely of same class as x*
• Information retrieval:
– G contains word nodes and document nodes
– A link (w,d) means word w is in document d
– X is a set of keywords, y~X means y likely to be relevant to X
• Database retrieval:
– G encodes a database
– …?
BANKS: Browsing and Keyword Search
• Database is modeled as a graph
– Nodes = tuples
– Edges = references between tuples
• edges are directed and indicate foreign key, inclusion dependencies, ..
MultiQuery Optimization
writes
author
S. Sudarshan
paper
writes
Prasan Roy
author
Query: {“sudarshan”, “roy”} Answer: subtree from graph
MultiQuery Optimization
writes
author
S. Sudarshan
paper
writes
Prasan Roy
author
y: paper(y) &amp; ~“sudarshan”
AND
w: paper(y) &amp; w~“roy”
Query: “sudarshan”, “roy” Answer: subtree from graph
Similarity Queries on Graphs
1) Given type t* and node x in G, find y:T(y)=t* and y~x.
2) Given type t* and node set X, find y:T(y)=t* and y~X.
•
•
•
•
Nearest-neighbor classification
Information retrieval
Database retrieval
Evaluation: specific families of tasks for scientific publications:
– Citation recommendation for a paper: (given title, year, …, of paper p,
what papers should be cited by p?)
– Expert-finding: (given keywords, genes, … suggest a possible author)
– “Entity recommendation”: (given title, author, year, … predict entities
mentioned in a paper, e.g. gene-protein entities) – can improve NER
– Literature recommendation: given researcher and year, suggest papers
• Evaluation: Inference in a DB of automatically-extracted facts
Similarity Queries on Graphs
query 1, ans 1
query 2, ans 2
….
LEARNER
may use PPR
Sim(s,p) = mapping
from query  ans
variant of PPR
• Evaluation: specific families of tasks for scientific publications:
– Citation recommendation for a paper: (given title, year, …, of paper p,
what papers should be cited by p?)
– Expert-finding: (given keywords, genes, … suggest a possible author)
– “Entity recommendation”: (given title, author, year, … predict entities
mentioned in a paper, e.g. gene-protein entities)
– Literature recommendation: given researcher and year, suggest papers
• Evaluation: Inference in a DB of automatically-extracted facts
Outline
•
•
•
•
Motivation for Learning Similarity in Graphs
A Baseline Similarity Metric
The Path Ranking Algorithm (Learning Method)
– Motivation
– Details
Defining Similarity on Graphs: PPR/RWR
[Personalized PageRank 1999]
Given type t* and node x, find y:T(y)=t* and y~x.
• Similarity defined by “damped” version of PageRank
• Similarity between nodes x and y:
– “Random surfer model”: from a node z,
• with probability α, teleport back to x (“reset”)
• Else pick a y uniformly from { y’ : z  y’ }
• repeat from node y ....
– Similarity x~y = Pr( surfer is at y | restart is always to x )
• Intuitively, x~y is sum of weight of all paths from x to y, where weight
of path decreases exponentially with length (and fanout)
• Can easily extend to a “query” set X={x1,…,xk}
• Data used in this study
– Yeast: 0.2M nodes, 5.5M links
– Fly: 0.8M nodes, 3.5M links
– E.g. the fly graph
Cite 1,267,531
Author
233,229
Write
679,903
Publication
126,813
Physical/Genetic
interactions
1,352,820
689,812
2,060,275
Title Terms
102,223
Journal
1,801
Year 1,785,626
58
before
Transcribe
293,285
Gene
516,416
Downstream
/Uptream
Bioentity
5,823,376
Protein
414,824
Learning Proximity Measures for
– Gene recommendation:
– Reference recommendation:
– Expert-finding:
– Literature-recommendation:
author, yeargene
words,yearpaper
words, genesauthor
• Baseline method:
– Typed RWR proximity methods
• Baseline learning method:
– parameterize Prob(walk edge|edge label=L) and tune the
parameters for each label L (somehow…)
P(L=cite)
=a
Cite 1,267,531
Write
Author
233,229
P(write)=b
679,903
Publication
126,813
Physical/Genetic
interactions
P(bindTo)
=d
1,352,820
P(NE) = c
689,812
Gene
516,416
Transcribe
P(express) = d
293,285
Protein
414,824
Path-based vs Edge-label based learning
• RWR is a very robust and useful similarity metric
• Learning one-parameter-per-edge label is very limited
• In many cases, there aren’t enough parameters to express a
complex relationship
Path-based vs Edge-label based learning
• Learning one-parameter-per-edge label is limited because the context in
which an edge label appears is ignored
– E.g. (observed from real data – task, find papers to read)
• Instead, we will learn path-specific parameters
Path
favorite authors
• Paths will be interpreted as constrained random walks that give a
similarity-like weight to every reachable node
• Step 0: D0 = {a} Start at author a
• Step 1: D1: Uniform over all papers p read by a
• Step 2: D2: Author a’ of papers in D1 weighted by number of papers
• Step 3: D3 Papers p’ written by a’ weighted by ....
• …
A Limitation of RWR Learning Methods
• Learning one-parameter-per-edge label is limited because
the context in which an edge label appears is ignored
– E.g. (observed from real data – task, find papers to read)
• Instead, we will learn path-specific parameters
Path
]author-[write]paper
write-1
favorite authors
Path
author –[write] paper –[contain]gene-[contain-1]paper
I’m working on
author –[write] paper –[publish-1]institute-[publlish]paper
my own lab
Definitions
• An graph G=(T,R,X,E), is
– a set of entity types T={T} and a set of relations R={R}
– a set of entities (nodes) X={x}, where each node x has a type from T
– a set of edges e=(x,y), where each edge has a relation label from R
• A path P=(R1, …,Rn) is a sequence of relations
• Path Constrained Random Walk
– Given a query set S of “source” nodes
– Distribution D0 at time 0 is uniform over s in S
– Distribution Dt at time t&gt;0 is formed by
• Pick x from Dt-1
• Pick y uniformly from all things related to x
– by an edge labeled Rt
Author
Write
Paper
WrittenBy
Paper
Cite
Paper
Cite
Paper
CiteBy
WrittenBy
CiteBy
Paper
Author
Paper
– Notation: fP(s,t) = Prob(st; P)
– In our examples type of t will be determined by Rn
21
x –[AthletePlaysForTeam]y –[TeamPlaysInLeague]z
Path Ranking Algorithm (PRA)
[Lao &amp; Cohen, ECML 2010]
• A PRA model scores a source-target node pair by a linear function of
their path features
score( s, t )   f P ( s, t ) P
PP
f P ( s, t )  Prob( s  t; P)
where P is the set of all relation paths with length ≤ L (with support on data, in
some cases – see [Lao and Cohen EMNLP 2011])
• For a relation R and a set of node pairs {(si, ti)}, we construct a training
dataset D ={(xi, yi)}, where xi is a vector of all the path features for (si, ti),
and yi indicates whether R(si, ti) is true or not
• θ is estimated using L1,L2-regularized logistic regression
• We’ve gone from a small parameter space to a huge one
Parameter Estimation (Details)
• Given a set of training data
– D={(q(m), A(m), y(m))} m=1…M, y(m)(e)=1/0
• We can define a regularized objective function
O( ) 

m 1.. M
om ( )  1 |  |1 2 |  |2 / 2
• Use average log-likelihood as the objective om(θ)
om ( ) | Pm |1
( m)
i
p
 p( y
 ln p
iPm
( m)
i
 1| q
( m)
i
( m)
 | Nm |1
 ln(1  p
iNm
( m)
i
)
exp( T Ai( m ) )
; ) 
1  exp( T Ai( m ) )
– P(m) the index set or relevant entities,
– N(m) the index set of irrelevant entities
(how to choose them will be discussed later)
25
Parameter Estimation (Details)
• Selecting the negative entity set Nm
– Few positive entities vs. thousands (or millions) of negative entities?
– First sort all the negative entities with an uniform-weight RWR model
– Then take negative entities at the k(k+1)/2-th position, for k=1,2,….
• Use orthant-wise L-BFGS (Andrew &amp; Gao, 2007) to estimate θ
– Efficient, Can deal with L1 regularization
L2 Regularization
• Improves retrieval quality
– On the citation recommendation task
0.45
l=2
l=3
l=4
1.5
l=2
l=3
l=4
0.40
1.4
0.35
MAP
Negative Log-likelihood
1.6
1.3
0.30
1.2
0.25
1.1
1.0
0.0000001
0.00001
0.001
λ2 (λ1=0)
0.1
0.20
1E-07
0.00001
0.001
λ2 (λ1=0)
0.1
L1 Regularization
• Does not improve retrieval quality…
0.5
0.4
1.30
0.3
MAP
Negative Log-likelihood
1.40
1.20
l=2
l=3
l=4
1.10
1E-05 0.0001 0.001 0.01
λ1 (λ2=0.00001)
0.2
0.1
0.1
l=2
l=3
l=4
0.0
1E-05 0.0001 0.001
0.01
λ1 (λ2=0.00001)
0.1
L1 Regularization
• … but can help reduce number of features
1000
0.8
l=2
l=3
l=4
No. Active Features
0.7
0.6
MRR
0.5
0.4
0.3
0.2
l=2
l=3
l=4
100
10
0.1
0.0
1E-05 0.0001 0.001 0.01
λ1 (λ2=0.00001)
0.1
1
1E-05 0.0001 0.001 0.01
λ1 (λ2=0.00001)
0.1
Another potential “regularization:
approximate RWR
Experiment Setup for BioLiterature
•
Data sources for bio-informatics
–
–
–
–
•
–
–
–
–
•
PubMed on-line archive of over 18 million biological abstracts
PubMed Central (PMC) full-text copies of over 1 million of these papers
Saccharomyces Genome Database (SGD) a database for yeast
Flymine a database for fruit flies
Gene recommendation:
Venue recommendation:
Citation recommendation:
Expert-finding:
author, yeargene
genes, title wordsjournal
title words,yearpaper
title words, genesauthor
Data split
– 2000 training, 2000 tuning, 2000 test
•
Time variant graph
– each edge is tagged with a time stamp (year)
– only consider edges that are earlier than the query, during random walk
31
BioLiterature: Some Results
• Compare the mean average precision (MAP) of PRA to
– RWR model
– RWR trained with one-parameter per link
Except these† , all improvements are statistically signiﬁcant
at p&lt;0.05 using paired t-test
Example Path Features and their Weights
• A PRA+qip+pop model trained for the citation
recommendation task on the yeast data
1) papers co-cited with on-topic papers
6) approx. standard IR retrieval
7,8) papers cited during the past two years
9) well cited papers
10,11) key early papers about specific genes
12,13) papers published during the past two years
14) old papers
Extension 1: Query Independent Paths
• PageRank (and other query-independent rankings):
– assign an importance score (query independent) to each web page
– later combined with relevance score (query dependent)
• We generalize pagerank to heterogeneous graphs:
– We include to each query a special entity e0 of special type T0
– T0 is related to all other entity types, and each type is related to all
instances of that type
– This defines a set of PageRank-like query independent relation paths
– Compute f(*t;P) offline for efficiency
Paper
• Example
CiteBy
all papers
Paper
T0
Cite
WrittenBy
Author
all authors
Wrote
Paper
Author
Paper
well cited papers
productive authors
34
Extension 2: Entity-specific rankings
• There are entity-specific characteristics which cannot be captured by
a general model
– Some items are interesting to the users because of features not
captured in the data
– To model this, assume the identity of the entity matters
– Introduce new features f(st; Ps,t) to account for jumping from s
to t and new features f(*t; P*,t)
– At each gradient step, add a few new features of this sort with
highest gradient, count on regularization to avoid overfitting
BioLiterature: Some Results
• Compare the MAP of PRA to
– RWR model
– query independent paths (qip)
– popular entity biases (pop)
Except these† , all improvements are statistically signiﬁcant
at p&lt;0.05 using paired t-test
Example Path Features and their Weights
• A PRA+qip+pop model trained for the citation
recommendation task on the yeast data
1) papers co-cited with on-topic papers
6) approx. standard IR retrieval
7,8) papers cited during the past two years
9) well cited papers
10,11) key early papers about specific genes
12,13) papers published during the past two years
14) old papers
Outline
• Random Walk With Reset/Personalized
PageRank
– What is it?
• Similarity Queries
• Learning How to “Tune” Similarity Functions for
An Application/Subdomains
• Applications and Results
– BioLiterature
– Knowledge Base Inference
Outline
•
•
•
•
Motivation for Learning Similarity in Graphs
A Baseline Similarity Metric
The Path Ranking Algorithm (Learning Method)
– Motivation
– Details
[Lao, Mitchell, Cohen, EMNLP 2011]
Large Scale Knowledge-Bases
•
Large-Scale Collections of Automatically Extracted Knowledge
– KnowItAll (Univ. Washington)
• 0.5B facts extracted from 0.1B web pages
– DBpedia (Univ. Leipzig)
• 3.5M entities 0.7B facts extracted from wikipedia
– YAGO (Max-Planck-Institute)
• 2M entities 20M facts extracted from Wikipedia and wordNet
– FreeBase
• 20M entities 0.3B links, integrated from different data sources
and human judgments
– NELL (Never-Ending Language Learning, CMU)
• 0.85M facts extracted from 0.5B webpages
Inference in Noisy Knowledge Bases
• Challenges
– Robustness: extracted knowledge is incomplete and noisy
– Scalability: the size of knowledge base is large
AthletePlays
ForTeam
HinesWard
Steelers
TeamPlays
InLeague
AthletePlaysInLeague
?
NFL
IsA
PlaysIn
American
isa-1
The NELL Case Study
• Never-Ending Language Learning: “a never-ending learning
system that operates 24 hours per day, for years, to
continuously improve its ability to read (extract structured
facts from) the web” (Carlson et al., 2010)
• Closed domain, semi-supervised extraction
• Combines multiple strategies: morphological patterns,
textual context, html patterns, logical inference
• Example beliefs
• We consider 48 relations for which NELL database has more than
100 instances
– AthletePlaysInLeague(HinesWard,?)
– AthletePlaysInLeague(?, NFL)
• The actual nodes y known to satisfy R(x; ?) are treated as labeled
positive examples, and all other nodes are treated as negative
examples
Current NELL method (baseline)
• FOIL (Quinlan and Cameron-Jones, 1993) is a learning
algorithm similar to decision trees, but in relational domains
• NELL implements two assumptions for efficient learning
– The predicates are functional --e.g. an athlete plays in at
most one league
– Only find clauses that correspond to bounded-length paths
of binary relations -- relational pathfinding (Richards &amp;
Mooney, 1992)
Current NELL method (baseline)
• FOL not great for handling uncertainty
– FOIL can only combine rules with disjunctions, therefore cannot
leverage low accuracy rules
– E.g. rules for teamPlaysSports
Experiments - Cross Validation on KB data
(for parameter setting, etc)
†
†
†
†
RWR: Random Walk with Restart (PPR)
†Paired
t-test give p-values 7x10-3, 9x10-4, 9x10-8, 4x10-4
Example Paths
Synonyms of
the query
team
Evaluation by Mechanical Turk
• There are many test queries per predicate
– All entities of a predicate’s domain/range, e.g.
• WorksFor(person, organization)
– On average 7,000 test queries for each functional predicate, and
13,000 for each non-functional predicate
• Sampled evaluation
– We only evaluate the top ranked result for each query
– We sort the queries for each predicate according to the scores of
their top ranked results, and then evaluate precisions at top 10,
100 and 1000 queries
• Each belief is voted by 5 workers
– Workers are given assertions like “Hines Ward plays for the
Evaluation by Mechanical Turk
• On 8 functional predicates where N-FOIL can successfully learn
– PRA is comparable to N-FOIL for [email protected], but has significantly better
[email protected]
• On 8 randomly sampled non-functional (one-many) predicates
– Slightly lower accuracy than functional predicates
#Rule
s
Functional Predicates 2.1(+37)
Non-functional
Predicates
----
N-FOIL
[email protected]
0.76
[email protected]
0
0.380
#Path
s
43
----
----
92
PRA
[email protected]
0.79
[email protected]
0
0.668
0.65
0.620
PRA: Path Ranking Algorithm
Beyond Pure KB Inference
• Following Minkov et al, 2008:
– Learn paths in a graph composed of multiple
dependency trees—to find synonyms, etc.
Learning Lexico-Syntactic Patterns
• Following Minkov et al, 2008:
– Learn paths in a graph composed of text and
knowledge [Lao et al, EMNLP 2011]
Beyond Pure KB Inference
• Following Minkov et al, 2008:
– Learn paths in a graph composed of text and
knowledge [Lao et al, EMNLP 2011]
Learning Lexico-Syntactic Patterns
Learning Lexico-Syntactic Patterns
Outline
•
•
•
•
Motivation for Learning Similarity in Graphs
A Baseline Similarity Metric
The Path Ranking Algorithm (Learning Method)
– Motivation
– Details
• Conclusions
Summary/Conclusion
• Learning is the way to make a clean, elegant formulation of a task
work in the messy, complicated real world
• Learning how to navigate graphs is a significant, core task that
models
– Recommendation, expert-finding, …
– Information retrieval
– Inference in KBs
– …
• It includes significant, core learning problems
– Regularization/search of huge feature space
– Discovery: long paths, lexicalized paths, …
– Incorporating knowledge of graph structure …
– ….
Looking Forward
• PRA learns very restricted “inference rules”
desiredResult(Query,Result) 
p1(Query,X1), p2(X1,X2), … pk(Xk-1,Result)
• Can you generalize from these to a larger set of inference rules?
• Can you generalize from binary to n-ary relationships?
• Can you jointly learn several relationships at once?
• PRA learns to navigate “real” graphs
– What about graphs that are built on-the-fly?
• E.g., Graphs that summarize a program’s execution, or a
theorem-prover’s behavior?
• Future work?
• Thanks to:
– My co-authors on this work
– All of you for being here
– NSF grant IIS-0811562
– NIH grant R01GM081293