On Beyond Hypertext: William W. Cohen Center for Automated Learning and Discovery

advertisement
On Beyond Hypertext:
Searching in Graphs Containing
Documents, Words, and Data
William W. Cohen
Center for Automated Learning and Discovery
+ Language Technology Institute
+ Center for Bioimage Informatics
+ Joint CMU-Pitt Program in Bioinformatics
Carnegie Mellon University
On Beyond Hypertext:
Searching in Graphs Containing
Documents, Words, and Data
William W. Cohen
Machine Learning Department
+ Language Technology Institute
+ Center for Bioimage Informatics
+ Joint CMU-Pitt Program in Bioinformatics
Carnegie Mellon University
On Beyond Hypertext:
Searching in Graphs Containing
Documents, Words, and Data
William W. Cohen
Carnegie Mellon University
joint work with:
Einat Minkov (CMU)
Andrew Ng (Stanford)
Outline
• Motivation: why I’m interested in
– structured data that is partly text;
– structured data represented as graphs;
– measuring similarity of nodes in graphs
• Contributions:
– a simple query language for graphs;
– experiments on natural types of queries;
– techniques for learning to answer queries of a
certain type better
“A Little Knowledge is A Dangerous
Thing” [A. Pope, 1709]
• Three centuries later, we’ve
learned that a lot of knowledge
is also sort of dangerous....
... so how do we deal with
information overload?
One approach: adding structure to
unstructured information
October 14, 2002, 4:00 a.m. PT
... by recognizing entity
names...
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
... and relationships
between them...
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
One approach: adding structure to
unstructured information
[Carvalho, Cohen SIGIR05;
Cohen, Carvalho, Mitchell EMNLP 04]
One approach: adding structure to
unstructured information
[Mitchell et al CEAS 2004]
One approach: adding structure to
unstructured information
One approach: adding structure to
unstructured information
[McCallum et al IJCAI05]
Is converting unstructured data to
structured data enough?
Limitations of structured data
What is the email address for
the person named “Halevy”
mentioned in this presentation?
• Diversity: many different types of
information from many different
sources, that arise to fill many
different needs.
• Uncertainty: information from many
sources (like IE programs or the web)
need not be correct.
What files from my home
machine will I need for this
meeting?
What people will attend this
meeting?
• Complexity of interaction:
How do you discover & access the
formulating ‘information needs’ as tens or hundreds
... ? of structured
queries to a DB can be
databases?
?
difficult...especially a heterogeneous
How do you understand & combine
DB, with a complex/changing schema.
the hundreds of schemata, with
thousands of fields?
How can you include many
diverse sources of information
in single database?
How do you relate the thousands
or millions or ... of entity identifiers
from the different databases?
When are two entities the same?
When is referent(oid1)=referent(oid2) ?
•
•
•
•
•
•
Bell Labs [1925]
Bell Telephone Labs
AT&T Bell Labs
AT&T Labs
AT&T Labs—Research
AT&T Labs Research,
Shannon Laboratory
• Shannon Labs
• Bell Labs Innovations
• Lucent Technologies/Bell
Labs Innovations
History of Innovation: From 1925
to today, AT&T has attracted some
of the world's greatest scientists,
engineers and developers….
[www.research.att.com]
Bell Labs Facts: Bell Laboratories,
the research and development arm
of Lucent Technologies, has been
operating continuously since 1925…
[bell-labs.com]
Is there a definition of ‘entity identity’ that is user- and purpose- independent?
=
≠
Bell Telephone Labs
=
When are two entities are the same?
“Buddhism rejects the key element in folk psychology: the
idea of a self (a unified personal identity that is continuous
through time)…
King Milinda and Nagasena (the Buddhist sage) discuss …
personal identity… Milinda gradually realizes that "Nagasena"
(the word) does not stand for anything he can point to: … not …
the hairs on Nagasena's head, nor the hairs of the body, nor the
"nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..."
etc… Milinda concludes that "Nagasena" doesn't stand for
anything… If we can't say what a person is, then how do we
know a person is the same person through time? …
There's really no you, and if there's no you, there are no beliefs
or desires for you to have… The folk psychology picture is
profoundly misleading and believing it will make you
miserable.” -S. LaFave
Traditional approach:
Linkage
Queries
Uncertainty about what to link
must be decided by the integration
system, not the end user
WHIRL vision:
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a=S.a and S.b=T.b
Link items as
needed by Q
Query Q
R.a
S.a
S.b
T.b
Anhai
Anhai
Doan
Doan
Dan
Dan
Weld
Weld
Weaker links: those
agreeable to some users
William
Will
Cohen
Cohn
Steve
Steven
Minton
Mitton
even weaker links…
William
David
Cohen
Cohn
Strongest links: those
agreeable to most users
WHIRL vision:
DB1 + DB2 ≠ DB
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a~S.a and S.b~T.b
Link items as
needed by Q
Incrementally produce a
ranked list of possible links,
with “best matches” first. User
(or downstream process)
decides how much of the list to
generate and examine.
(~ TFIDF-similar)
Query Q
R.a
S.a
S.b
T.b
Anhai
Anhai
Doan
Doan
Dan
Dan
Weld
Weld
William
Will
Cohen
Cohn
Steve
Steven
Minton
Mitton
William
David
Cohen
Cohn
Outline
• Motivation: why I’m interested in
– structured data that is partly text: similarity!
– structured data represented as graphs;
There are general-purpose, fast, robust
– measuring
similarityfor
of text,
nodes
in graphs
similarity
measures
which
are
for data integration....and hence,
• useful
Contributions:
combining
information
from
– a simple query
language
formultiple
graphs;
sources.
– experiments on natural types of queries;
– techniques for learning to answer queries of a
certain type better
Limitations of structured data
What is the email address for
the person named “Halevy”
mentioned in this presentation?
• Diversity: many different types of
information from many different
sources, that arise to fill many
different needs.
What files from my home
machine will I need for this
meeting?
• Uncertainty: information from many
sources (like IE programs or the web)
need not be correct.
• Complexity of interaction:
formulating ‘information needs’ as
queries to a DB can be
difficult...especially a heterogeneous
one.
How can you exploit
structure without
understanding the structure?
What people will attend this
meeting?
... ?
?
Schema-free structured search
•
•
•
•
DataSpot (DTL)/Mercado Intuifind: [VLDB 98]
Proximity Search: [VLDB98]
Information units (linked Web pages): [WWW10]
Microsoft DBExplorer, Microsoft English query
• BANKS (Browsing ANd Keyword Search):
[Chakrabarti & others, VLDB 02, VLDB 05]
BANKS: Basic Data Model
• Database is modeled as a graph
– Nodes = tuples
– Edges = references between tuples
User need not know
organization of database
to formulate queries.
• edges are directed.
• foreign key, inclusion dependencies, ..
BANKS: Keyword search…
MultiQuery Optimization
paper
writes
Charuta
S. Sudarshan
Prasan Roy
author
BANKS: Answer to Query
Query: “sudarshan roy” Answer: subtree from graph
MultiQuery Optimization
writes
author
S. Sudarshan
paper
writes
Prasan Roy
author
BANKS: Basic Data Model
• Database is modeled as a graph
– Nodes = tuples
– Edges = references between tuples
• edges are directed.
• foreign key, inclusion dependencies, ..
not quite so basic
BANKS: Basic Data Model
• Database All information is modeled as a graph
– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes
•
•
•
•
•
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF weighted, perhaps)
word W to doc/string containing W (inverted index)
[string S to strings ‘similar to’ S]
Outline
• Motivation: why I’m interested in
– structured data that is partly text – similarity!
– structured data represented as graphs; all sorts
of information can be poured into this model.
– measuring similarity of nodes in graphs
• Contributions:
– a simple query language for graphs;
– experiments on natural types of queries;
– techniques for learning to answer queries of a
certain type better
Yet another schema-free query language
• Assume data is encoded in a graph with:
– a node for each object x
– a type of each object x, T(x)
– an edge for each binary relation r:x  y
Node
similarity
• Queries are of this form:
– Given type t* and node x, find y:T(y)=t* and y~x.
• We’d like to construct a general-purpose similarity
function x~y for objects in the graph:
• We’d also like to learn many such functions for different
specific tasks (like “who should attend a meeting”)
Similarity of Nodes in Graphs
Given type t* and node x, find y:T(y)=t* and y~x.
• Similarity defined by “damped” version of PageRank
• Similarity between nodes x and y:
– “Random surfer model”: from a node z,
• with probability α, stop and “output” z
• pick an edge label r using Pr(r | z) ... e.g. uniform
• pick a y uniformly from { y’ : z  y with label r }
• repeat from node y ....
– Similarity x~y = Pr( “output” y | start at x)
• Intuitively, x~y is summation of weight of all paths from x to y, where
weight of path decreases exponentially with length.
not quite so basic
BANKS: Basic Data Model
• Database All information is modeled as a graph
– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes
•
•
•
•
•
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF weighted, perhaps)
word W to doc/string containing W (inverted index)
[string S to strings ‘similar to’ S]
“William W. Cohen, CMU”
cohen
dr
william
w cmu
“Dr. W. W. Cohen”
optional—strings that are similar in
TFIDF/cosine distance will still be
“nearby” in graph (connected by
many length=2 paths)
Similarity of Nodes in Graphs
• Random surfer on graphs:
– natural extension to PageRank
– closely related to Lafferty’s heat diffusion kernel
• but generalized to directed graphs
– somewhat amenable to learning parameters of the walk
(gradient search, w/ various optimization metrics):
• Toutanova, Manning & NG, ICML2004
• Nie et al, WWW2005
• Xi et al, SIGIR 2005
– can be sped up and adapted to longer walks by
sampling approaches to matrix multiplication (e.g. Lewis
& E. Cohen, SODA 1998), similar to particle filtering
– our current implementation (GHIRL): Lucene +
Sleepycat with extensive use of memory caching
(sampling approaches visit many nodes repeatedly)
Query: “sudarshan roy” Answer: subtree from graph
MultiQuery Optimization
writes
author
S. Sudarshan
paper
writes
Prasan Roy
author
y: paper(y) & y~“roy”
AND
w: paper(y) & w~“roy”
Query: “sudarshan roy” Answer: subtree from graph
Evaluation on Personal Information
Management Tasks [Minkov et al, SIGIR 2006]
Many tasks can be expressed as simple, non-conjunctive
search queries in this framework.
Such as:
What is the email address for
the person named “Halevy”
mentioned in this presentation?
• Person Name Disambiguation
in Email novel
• Threading
[eg Diehl, Getoor, Namata,
2006]
What files from my home
machine will I need for this
meeting?
[eg Lewis & Knowles 97]
• Finding email-address aliases
given a person’s name novel
• Finding relevant meeting
attendees
novel
What people will attend this
meeting?
... ?
Also consider a generalization: x  Vq
Vq is a distribution over nodes x
Email as a graph
sent_date
date
2
sent_to
alias
Email
address
1
+1_day
a_inv
sf_inv
date
1
person
name
1
Sent_
from
Sent_
to
Email
address
2
person
name
2
st_inv
sent_date
file
1
sd_Inv
Email
address
3
sent_
from
in_
file
in_
subj
Email
address
4
If_inv
person
name
4
is_inv
term
1
term
4
file
2
person
name
3
term
2
term
5
term
8
term
3
term
6
term
7
Email
address
5
sent_
to
person
name
5
term
9
term
10
term
11
Person Name Disambiguation
file
Person
file
Person:
Andrew
Johns
Q: “who is Andy?”
file
•
•
Given: a term that is not mentioned ‘as is’ in header
(otherwise, easy), that is known to be a personal name
Output: ranked person nodes.
term:
andy
Person
* This task is complementary to person name annotation in email
(E. Minkov, R. Wang, W.Cohen, Extracting Personal Names from Emails: Applying Named Entity
Recognition to Informal Text, HLT/EMNLP 2005)
Corpora and Datasets
a. Corpora
Example nicknames:
Dave for David,
Kai for Keiko,
Jenny for Qing
b. Types of names
Person Name Disambiguation
1. Baseline: String matching (&
common nicknames)
3. Graph walk: term+file
Vq: name term + file nodes (2 steps)
Find persons that are similar to the
name term (Jaro)
• The file node is natural available context
• Successful in many cases
• Solves the ambiguity problem!
• Not successful for some nicknames
• But, incorporates additional noise.
• Can not handle ambiguity (arbitrary)
2. Graph walk: term
Vq: name term node (2 steps)
4. Graph walk: term+file,
reranked using learning
Re-rank the output of (3), using:
• Models co-occurrences.
- path-describing features
• Can not handle ambiguity (dominant)
- ‘source count’ : do the paths originate from
a single or two source nodes
- string similarity
Results
Results
after learning-to-rank
graph walk from {name,file}
graph walk from name
baseline: string match, nicknames
Results
Enron execs
Results
Learning
• There is no single “best” measure of similarity:
– How can you learn how to better rank graph nodes, for a
particular task?
• Learning methods for graph walks:
– The parameters can be adjusted using gradient descent methods
(Diligenti et-al, IJCAI 2005)
– We explored a node re-ranking approach – which can take
advantage of a wider range of features features (and is
complementary to parameter tuning)
• Features of candidate answer y describe the set of paths
from query x to y
Re-ranking overview
Boosting-based reranking, following (Collins and Koo, Computational Linguistics, 2005):
A training example includes:
–
–
–
linear combination of features
a ranked list of li nodes.
Each node is represented through m features
At least one known correct node
original score y~x
Scoring function:
, where
Find w that minimizes (boosted version):
Requires binary features and has a closed form formula to find best feature and
delta in each iteration.
Path describing Features
• The set of paths to a target node in step k is recovered in full.
X1
‘Edge unigram’ features:
was edge type l used in
reaching x from Vq.
X2
X3
X4
X5
K=0
K=1
Paths (x3, k=2):
x2  x1  x3
x4  x1  x3
x2  x2  x3
x2  x3
K=2
‘Edge bigram’ features:
were edge types l1 and l2 used
(in that order) in reaching x from Vq.
‘Top edge bigram’ features:
were edge types l1 and l2 used
(in that order) in reaching x from Vq,
among the top two highest scoring
paths.
Results
Threading
Threading is an interesting problem, because:
•
There are often irregularities in thread structural information, thus threads
discourse should be captured using an intelligent approach (D.E. Lewis and K.A.
Knowles, Threading email: A preliminary study, Information Processing and Management, 1997)
•
Threading information can improve message categorization into topical
folders (B. Klimt and Y. Yang, The Enron corpus: A new dataset for email classification
research, ECML, 2004)
•
Adjacent messages in a thread can be assumed to be most similar to each
other in the corpus. Therefore, threading is related to the general problem of
finding similar messages in a corpus.
The task: given a message, retrieve adjacent messages in the thread
Some intuition ?
file
x
Some intuition ?
file
x
Shared
content
Some intuition ?
file
x
Shared
content
Social
network
Some intuition ?
file
x
Shared
content
Social
network
Timeline
Threading: experiments
1.
Baseline: TF-IDF Similarity
Consider all the available information (header & body) as text
2.
Graph walk: uniform
Start from the file node, 2 steps, uniform edge weights
3.
Graph walk: random
Start from the file node, 2 steps, random edge weights
(best out of 10)
4.
Graph walk: reranked
Rerank the output of (3) using the graph-describing features
Results
Highly-ranked edge-bigrams:
• sent-from  sent-to -1
• date-of
 date-of -1
• has-term  has-term -1
Finding Meeting Attendees
Extended graph contains 2 months of calendar data:
Main Contributions
• Presented an extended similarity measure incorporating non-textual
objects
• Finite lazy random walks to perform typed search
• A re-ranking paradigm to improve on graph walk results
• Instantiation of this framework for email
• Defined and evaluated novel tasks for email
Another Task that Can be Formulated as
a Graph Query: GeneId-Ranking
• Given:
– a biomedical paper abstract
• Find:
– the geneId for every gene mentioned in the abstract
• Method:
– from paper x, ranked list of geneId y: x~y
• Background resources:
– a “synonym list”: geneId  { name1, name2, ... }
– one or more protein NER systems
– training/evaluation data: pairs of (paper, {geneId1, ...., geneIdn})
Sample abstracts and synonyms
• MGI:96273
• Htr1a
• 5-hydroxytryptamine (serotonin)
true
receptor 1A
• 5-HT1A receptor
labels
•MGI:104886
•Gpx5
•glutathione peroxidase 5
NER
extractor
•Arep
• ...
• 52,000+ for mouse, 35,000+ for fly
Graph for the task....
abstracts
...
file:doc115
hasProtein
hasProtein
hasProtein
proteins
“HT1A”
“CA1”
“HT1”
hasTerm
terms
term:HT
term:1
term:A
hasTerm
...
hasTerm
term:hippocampus
term:CA
inFile
synonyms
“5-HT1A
receptor”
“Htr1a”
“eIF-1A”
synonym
geneIds
MGI:46273
...
synonym
MGI:95298
...
...
abstracts
...
file:doc115
hasProtein
hasProtein
hasProtein
proteins
“HT1A”
“CA1”
“HT1”
hasTerm
hasTerm
terms
term:HT
term:1
term:A
hasTerm
...
term:hippocampus
term:CA
...
inFile
synonyms
“5-HT1A
receptor”
“Htr1a”
“eIF-1A”
...
synonym
geneIds
noisy
training
abstracts
MGI:46273
file:doc214
MGI:95298
file:doc523
...
file:doc6273
...
Experiments
• Data: Biocreative Task 1B
– mouse: 10,000 train abstracts, 250 devtest, using first
150 for now; 50,000+ geneId’s; graph has 525,000+
nodes
• NER systems:
– likelyProtein: trained on yapex.train using off-the-shelf
NER systems (Minorthird)
– possibleProtein: same, but modified (on yapex.test) to
optimize F3, not F1 (rewards recall over precision)
Experiments with NER
Token
Precision
Span
Recall
Precision
Recall
F1
likely
94.9
64.8
87.2
62.1
72.5
possible
49.0
97.4
47.2
82.5
60.0
likely
81.6
31.3
66.7
26.8
45.3
possible
43.9
88.5
30.4
56.6
39.6
dictionary
50.1
46.9
24.5
43.9
31.4
yapex.test
mouse
Experiments with Graph Search
• Baseline method:
– extract entities of type x
– for each string of type x, find best-matching synonym,
and then its geneId
• consider only synonyms sharing >=1 token
• Soft/TFIDF distance
• break ties randomly
– rank geneId’s by number of times they are reached
• rewards multiple mentions (even via alternate synonyms)
• Evaluation:
– average, over 50 test documents, of
• non-interpolated average precision (plausible for curators)
• max F1 over all cutoff’s
Experiments with Graph Search
mouse dataset
MAP
maxF1
likelyProtein + softTFIDF
45.0
58.1
possibleProtein + softTFIDF
62.6
74.9
graph walk
51.3
64.3
Baseline vs Graphwalk
• Baseline includes:
– softTFIDF distances from NER entity to gene synonyms
– knowledge that “shortcut” path docentitysynonymgeneId is
important
• Graph includes:
– IDF effects, correlations, training data, etc
• Proposed graph extension:
– add softTFIDF and “shortcut” edges
• Learning and reranking:
– start with “local” features fi(e) of edges e=uv
– for answer y, compute expectations: E( fi(e) | start=x,end=y)
– use expectations as feature values and voted perceptron (Collins,
2002) as learning-to-rank method.
Experiments with Graph Search
mouse dataset
MAP
maxF1
likelyProtein + softTFIDF
45.0
58.1
possibleProtein + softTFIDF
62.6
74.9
graph walk
51.3
64.3
walk + extra links
73.0
80.7
walk + extra links + learning
79.7
83.9
Experiments with Graph Search
Hot off the presses
• Ongoing work: learn NER system from pairs of
(document,geneIdList)
– much easier to obtain training data than documents in which every
occurrence of every gene name is highlighted (usual NER training
data)
– obtains F1 of 71.4 on mouse data (vs 45.3 by training on YAPEX
data, which is from different distribution)
– Joint work with Richard Wang, Bob Frederking, Anthony Tomasic
Experiments with Graph Search
mouse dataset
MAP
(Yapex Trained)
MAP
(MGI Trained)
likelyProtein + softTFIDF
45.0
72.7
possibleProtein + softTFIDF
62.6
65.7
graph walk
51.3
54.4
walk + extra links
73.0
76.7
walk + extra links + learning
79.7
84.2
Summary
• Contributions:
– a very simple query language for graphs, based
on a diffusion-kernel (damped PageRank,...)
similarity metric
– experiments on natural types of queries:
• finding likely meeting attendees
• finding related documents (email threading)
• disambiguating person and gene/protein entity
names
– techniques for learning to answer queries
• reranking using expectations of simple, local features
• tune performance to a particular “similarity”
Summary
• Some open problems:
– scalability & efficiency:
• K-step walk on node-node graph with fan-out b is
O(KbN)
• accurate sampling is O(1min) for 10-steps with
O(106) nodes.
– faster, better learning methods:
• combine re-ranking with learning parameters of
graph walk
– add language modeling, topic modeling:
• extend graph to include models as well as data
Download