A Framework for Learning to Query Heterogeneous Data William W. Cohen

advertisement
A Framework for Learning to
Query Heterogeneous Data
William W. Cohen
Machine Learning Department and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
joint work with:
Einat Minkov, Andrew Ng, Richard Wang, Anthony Tomasic, Bob Frederking
Outline
• Two views on data quality:
– Cleaning your data vs living with the mess.
– “A lazy/Bayesian view of data cleaning”
• A framework for querying dirty data
–
–
–
–
Data model
Query language
Baseline results (biotext and email)
How to improve results with learning
• Learning to re-rank query output
• Conclusions
[Science 1959]
A Bayesian Looks at Record Linkage
• Record linkage problem: given
two sets of records A={a1,…,am}
and B={b1,…,bn}, determine
when referent(ai)=referent(bj)
• Idea: compute for each ai,bj pair
Pr(referent(ai)=referent(bj))
• Pick two thresholds:
– Pr(a=b) > HI  accept pairing
– Pr(a=b) < LO  reject pairing
– otherwise, “clerical review” by a
human clerk
• Every optimal decision
boundary is defined by a
threshold on the ranked list.
• Thresholds depend on prior
probability of a and b matching.
A
B
Pr(A=B)
A17
B22
0.99
A43
B07
0.98
…
…
…
A21
B13
0.85
A37
B44
0.82
A84
B03
0.79
A83
B71
0.63
…
…
…
A24
B52
0.25
A Bayesian Looks at Record Linkage
• Every optimal decision
boundary is defined by a
threshold on the ranked list.
• In other words:
– 2n*m – n*m linkages can be
discarded as impossible*
– of the remaining n*m, all
but HI-LO can be discarded
as “improbable”
• But wait: why doesn’t
n*mthe
pairs
human clerk pick a
threshold between LO and
HI?
2n*m ways to link
A
B
Pr(A=B)
A17
B22
0.99
M
M
U
A43
B07
0.98
M
M
U
A16
B23
0.91
M
U
U
A21
B13
0.85
M
U
U
A37
B44
0.82
U
U
U
A84
B03
0.79
U
U
M
A83
B71
0.63
U
U
M
A91
B21
0.46
U
U
M
A24
B52
0.25
U
U
M
...
A Bayesian Looks at Record Linkage
A
B
Pr(A=B)
A17
B22
0.99
M
A43
B07
0.98
M
A32
B72
0.91
M
A21
B13
0.85
M
A37
B44
0.82
U
M
M
M
U
A84
B03
0.79
U
M
M
U
U
A83
B71
0.63
M
M
U
U
U
A21
B43
0.46
U
A24
B52
0.25
U
A Bayesian Looks at Record Linkage
A
B
Pr(A=B)
A17
B22
0.99
M
A43
B07
0.98
M
A32
B72
0.91
M
A21
B13
0.85
M
A37
B44
0.82
U
M
M
M
U
A84
B03
0.79
U
M
M
U
U
A83
B71
0.63
M
M
U
U
U
A21
B43
0.46
U
A24
B52
0.25
U
An alternate view of
the process:
1. F-S’s method
answers the
question directly
for the cases that
everyone would
agree on.
2. Human effort is
used to answer
the cases that are
a little harder.
A Bayesian Looks at Record Linkage
A
B
Pr(A=B)
A17
B22
0.99
M
A43
B07
0.98
M
A32
B72
0.91
M
A21
B13
0.85
M
A37
B44
0.82
A84
B03
0.79
A83
B71
0.63
?
A21
B43
0.46
U
A24
B52
0.25
U
Q: is A43 in B?
A: yes (p=0.98)
Q: is A83 in B?
A: not clear…
Q: is A21 in B?
A: unlikely
An alternate view of
the process:
1. F-S’s method
answers the
question directly
for the cases that
everyone would
agree on.
2. Human effort is
used to answer
the cases that are
a little harder.
Passing linkage decisions along to
the user
Usual goal: link records and create a single
highly accurate database for users query.
• Equality is often uncertain, given available
information about an entity
– “name: T. Kennedy occupation: terrorist”
• The interpretation of “equality” may change from
user to user and application to application
– Does “Boston Market” = “McDonalds” ?
– Alternate goal: wait for a query, then answer it,
propogating uncertainty about linkage
decisions on that query to the end user
X
WHIRL project (1997-2000)
• WHIRL initiated when at AT&T Bell Labs
AT&T Research
AT&T Labs - Research
AT&T Labs
AT&T Research
AT&T Research – Shannon Laboratory
AT&T Shannon Labs
When are two entities the same?
•
•
•
•
•
•
Bell Labs [1925]
Bell Telephone Labs
AT&T Bell Labs
A&T Labs
AT&T Labs—Research
AT&T Labs Research,
Shannon Laboratory
• Shannon Labs
• Bell Labs Innovations
• Lucent Technologies/Bell
Labs Innovations
History of Innovation: From 1925
to today, AT&T has attracted some
of the world's greatest scientists,
engineers and developers….
[www.research.att.com]
Bell Labs Facts: Bell Laboratories,
the research and development arm
of Lucent Technologies, has been
operating continuously since 1925…
[bell-labs.com]
When are two entities are the same?
“Buddhism rejects the key element in folk psychology: the
idea of a self (a unified personal identity that is continuous
through time)…
King Milinda and Nagasena (the Buddhist sage) discuss …
personal identity… Milinda gradually realizes that "Nagasena"
(the word) does not stand for anything he can point to: … not …
the hairs on Nagasena's head, nor the hairs of the body, nor the
"nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..."
etc… Milinda concludes that "Nagasena" doesn't stand for
anything… If we can't say what a person is, then how do we
know a person is the same person through time? …
There's really no you, and if there's no you, there are no beliefs
or desires for you to have… The folk psychology picture is
profoundly misleading and believing it will make you
miserable.” -S. LaFave
Traditional approach:
Linkage
Queries
Uncertainty about what to link
must be decided by the integration
system, not the end user
WHIRL vision:
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a=S.a and S.b=T.b
Link items as
needed by Q
Query Q
R.a
S.a
S.b
T.b
Anhai
Anhai
Doan
Doan
Dan
Dan
Weld
Weld
Weaker links: those
agreeable to some users
William
Will
Cohen
Cohn
Steve
Steven
Minton
Mitton
even weaker links…
William
David
Cohen
Cohn
Strongest links: those
agreeable to most users
WHIRL vision:
DB1 + DB2 ≠ DB
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a~S.a and S.b~T.b
Link items as
needed by Q
Incrementally produce a
ranked list of possible links,
with “best matches” first. User
(or downstream process)
decides how much of the list to
generate and examine.
(~ TFIDF-similar)
Query Q
R.a
S.a
S.b
T.b
Anhai
Anhai
Doan
Doan
Dan
Dan
Weld
Weld
William
Will
Cohen
Cohn
Steve
Steven
Minton
Mitton
William
David
Cohen
Cohn
WHIRL queries
• Assume two relations:
review(movieTitle,reviewText): archive of reviews
listing(theatre, movieTitle, showTimes, …): now showing
The
Hitchhiker’s
Guide to the
Galaxy, 2005
This is a faithful re-creation
of the original radio series –
not surprisingly, as Adams
wrote the screenplay ….
Men in
Black, 1997
Will Smith does an excellent
job in this …
Space Balls,
1987
Only a die-hard Mel Brooks
fan could claim to enjoy …
…
…
Star Wars
Episode III
The
Senator
Theater
1:00,
4:15, &
7:30pm.
Cinderella
Man
The
Rotunda
Cinema
1:00,
4:30, &
7:30pm.
…
…
…
WHIRL queries
• “Find reviews of sci-fi comedies [movie domain]
FROM review SELECT * WHERE r.text~’sci fi comedy’
(like standard ranked retrieval of “sci-fi comedy”)
• “ “Where is [that sci-fi comedy] playing?”
FROM review as r, LISTING as s, SELECT *
WHERE r.title~s.title and r.text~’sci fi comedy’
(best answers: titles are similar to each other – e.g.,
“Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s
Guide to the Galaxy, 2005” and the review text is similar
to “sci-fi comedy”)
WHIRL queries
• Similarity is based on TFIDF rare words are most important.
• Search for high-ranking answers uses inverted indices….
The Hitchhiker’s Guide to the Galaxy, 2005
Star Wars Episode III
Men in Black, 1997
Hitchhiker’s Guide to the Galaxy
Space Balls, 1987
Cinderella Man
…
…
WHIRL queries
• Similarity is based on TFIDF rare words are most important.
• Search for high-ranking answers uses inverted indices….
- It is easy to find the (few) items that match on “important” terms
- Search for strong matches can prune “unimportant terms”
The
Star Wars Episode III
Hitchhiker’s Guide to the Galaxy,
2005
Hitchhiker’s Guide to the Galaxy
Men in Black, 1997
Cinderella Man
Space Balls, 1987
…
…
Years are common in the
review archive, so have
low weight
hitchhiker
movie00137
the
movie001,movie003,movie007,movie008,
movie013,movie018,movie023,movie0031,
…..
WHIRL results
• This sort of worked:
– Interactive speeds
(<0.3s/q) with a few
hundred thousand tuples.
– For 2-way joins, average
precision (sort of like area
under precision-recall curve)
from 85% to 100% on 13
problems in 6 domains.
– Average precision better
than 90% on 5-way joins
WHIRL and soft integration
• WHIRL worked for a number of
web-based demo applications.
– e.g., integrating data from 30-50
smallish web DBs with <1 FTE
labor
• WHIRL could link many data
types reasonably well, without
engineering
• WHIRL generated numerous
papers (Sigmod98, KDD98,
Agents99, AAAI99, TOIS2000,
AIJ2000, ICML2000, JAIR2001)
• WHIRL was relational
– But see ELIXIR (SIGIR2001)
• WHIRL users need to know
schema of source DBs
• WHIRL’s query-time linkage
worked only for TFIDF, tokenbased distance metrics
–  Text fields with few
misspellimgs
• WHIRL was memory-based
– all data must be centrally
stored—no federated data.
–  small datasets only
WHIRL vision:
very radical,
everything was
inter-dependent
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a~S.a and S.b~T.b
Link items as
needed by Q
Incrementally produce a
ranked list of possible links,
with “best matches” first. User
(or downstream process)
decides how much of the list to
generate and examine.
(~ TFIDF-similar)
Query Q
R.a
S.a
S.b
T.b
Anhai
Anhai
Doan
Doan
Dan
Dan
Weld
Weld
William
Will
Cohen
Cohn
Steve
Steven
Minton
Mitton
William
David
Cohen
Cohn
To make SQLlike queries,
user must
understand
the schema
of the
underlying
DB (and
hence
someone
must
understand
DB1, DB2,
DB3, ...
?
Outline
• Two views on data quality:
– Cleaning your data vs living with the mess.
– A lazy/Bayesian view of data cleaning
• A framework for querying dirty data
– Data model
– Query language
– Baseline results (biotext and email)
• How to improve results with learning
– Learning to re-rank query output
• Conclusions
BANKS: Basic Data Model
• Database is modeled as a graph
– Nodes = tuples
– Edges = references between tuples
User need not know
organization of database
to formulate queries.
• foreign key, inclusion dependencies, ..
• Edges are directed.
BANKS: Keyword search…
MultiQuery Optimization
paper
writes
Charuta
S. Sudarshan
Prasan Roy
author
BANKS: Answer to Query
Query: “sudarshan roy” Answer: subtree from graph
MultiQuery Optimization
writes
author
S. Sudarshan
paper
writes
Prasan Roy
author
BANKS: Basic Data Model
• Database is modeled as a graph
– Nodes = tuples
– Edges = references between tuples
• edges are directed.
• foreign key, inclusion dependencies, ..
not quite so basic
BANKS: Basic Data Model
• Database All information is modeled as a graph
– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes
•
•
•
•
•
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF weighted, perhaps)
word W to doc/string containing W (inverted index)
[string S to strings ‘similar to’ S]
Similarity in a BANKS-like system
• Motivation: why I’m interested in
– structured data that is partly text – similarity!
– structured data represented as graphs; all sorts
of information can be poured into this model.
– measuring similarity of nodes in graphs
• Coming up next:
– a simple query language for graphs;
– experiments on natural types of queries;
– techniques for learning to answer queries of a
certain type better
Yet another schema-free query language
• Assume data is encoded in a graph with:
– a node for each object x
– a type of each object x, T(x)
– an edge for each binary relation r:x  y
Node
similarity
• Queries are of this form:
– Given type t* and node x, find y:T(y)=t* and y~x.
• We’d like to construct a general-purpose similarity
function x~y for objects in the graph:
• We’d also like to learn many such functions for different
specific tasks (like “who should attend a meeting”)
Similarity of Nodes in Graphs
Given type t* and node x, find y:T(y)=t* and y~x.
• Similarity defined by “damped” version of PageRank
• Similarity between nodes x and y:
– “Random surfer model”: from a node z,
• with probability α, stop and “output” z
• pick an edge label r using Pr(r | z) ... e.g. uniform
• pick a y uniformly from { y’ : z  y with label r }
• repeat from node y ....
– Similarity x~y = Pr( “output” y | start at x)
• Intuitively, x~y is summation of weight of all paths from x to y, where
weight of path decreases exponentially with length.
not quite so basic
BANKS: Basic Data Model
• Database All information is modeled as a graph
– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes
•
•
•
•
•
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF weighted, perhaps)
word W to doc/string containing W (inverted index)
[string S to strings ‘similar to’ S]
“William W. Cohen, CMU”
cohen
dr
william
w cmu
“Dr. W. W. Cohen”
optional—strings that are similar in
TFIDF/cosine distance will still be
“nearby” in graph (connected by
many length=2 paths)
Similarity of Nodes in Graphs
• Random surfer on graphs:
– natural extension to PageRank
– closely related to Lafferty’s heat diffusion kernel
• but generalized to directed graphs
– somewhat amenable to learning parameters of the walk
(gradient search, w/ various optimization metrics):
• Toutanova, Manning & NG, ICML2004
• Nie et al, WWW2005
• Xi et al, SIGIR 2005
– can be sped up and adapted to longer walks by
sampling approaches to matrix multiplication (e.g. Lewis
& E. Cohen, SODA 1998), similar to particle filtering
– our current implementation (GHIRL): Lucene +
Sleepycat with extensive use of memory caching
(sampling approaches visit many nodes repeatedly)
Query: “sudarshan roy” Answer: subtree from graph
MultiQuery Optimization
writes
author
S. Sudarshan
paper
writes
Prasan Roy
author
y: paper(y) & y~“roy”
AND
w: paper(y) & w~“roy”
Query: “sudarshan roy” Answer: subtree from graph
Evaluation on Personal Information
Management Tasks [Minkov et al, SIGIR 2006]
Many tasks can be expressed as simple, non-conjunctive
search queries in this framework.
Such as:
What is the email address for
the person named “Halevy”
mentioned in this presentation?
• Person Name Disambiguation
in Email novel
• Threading
[eg Diehl, Getoor, Namata,
2006]
What files from my home
machine will I need for this
meeting?
[eg Lewis & Knowles 97]
• Finding email-address aliases
given a person’s name novel
• Finding relevant meeting
attendees
novel
What people will attend this
meeting?
... ?
Also consider a generalization: x  Vq
Vq is a distribution over nodes x
Email as a graph
sent_date
date
2
sent_to
alias
Email
address
1
+1_day
a_inv
sf_inv
date
1
person
name
1
Sent_
from
Sent_
to
Email
address
2
person
name
2
st_inv
sent_date
file
1
sd_Inv
Email
address
3
sent_
from
in_
file
in_
subj
Email
address
4
If_inv
person
name
4
is_inv
term
1
term
4
file
2
person
name
3
term
2
term
5
term
8
term
3
term
6
term
7
Email
address
5
sent_
to
person
name
5
term
9
term
10
term
11
Person Name Disambiguation
file
Person
file
Person:
Andrew
Johns
Q: “who is Andy?”
file
•
•
Given: a term that is not mentioned ‘as is’ in header
(otherwise, easy), that is known to be a personal name
Output: ranked person nodes.
term:
andy
Person
* This task is complementary to person name annotation in email
(E. Minkov, R. Wang, W.Cohen, Extracting Personal Names from Emails: Applying Named Entity
Recognition to Informal Text, HLT/EMNLP 2005)
Corpora and Datasets
a. Corpora
Example nicknames:
Dave for David,
Kai for Keiko,
Jenny for Qing
b. Types of names
Person Name Disambiguation
1. Baseline: String matching (&
common nicknames)
3. Graph walk: term+file
Vq: name term + file nodes (2 steps)
Find persons that are similar to the
name term (Jaro)
• The file node is natural available context
• Successful in many cases
• Solves the ambiguity problem!
• Not successful for some nicknames
• But, incorporates additional noise.
• Can not handle ambiguity (arbitrary)
2. Graph walk: term
Vq: name term node (2 steps)
4. Graph walk: term+file,
reranked using learning
Re-rank the output of (3), using:
• Models co-occurrences.
- path-describing features
• Can not handle ambiguity (dominant)
- ‘source count’ : do the paths originate from
a single or two source nodes
- string similarity
Results
Results
after learning-to-rank
graph walk from {name,file}
graph walk from name
baseline: string match, nicknames
Results
Enron execs
Results
Learning
• There is no single “best” measure of similarity:
– How can you learn how to better rank graph nodes, for a
particular task?
• Learning methods for graph walks:
– The parameters can be adjusted using gradient descent methods
(Diligenti et-al, IJCAI 2005)
– We explored a node re-ranking approach – which can take
advantage of a wider range of features features (and is
complementary to parameter tuning)
• Features of candidate answer y describe the set of paths
from query x to y
Re-ranking overview
Boosting-based reranking, following (Collins and Koo, Computational Linguistics, 2005):
A training example includes:
–
–
–
linear combination of features
a ranked list of li nodes.
Each node is represented through m features
At least one known correct node
original score y~x
Scoring function:
, where
Find w that minimizes (boosted version):
Requires binary features and has a closed form formula to find best feature and
delta in each iteration.
Path describing Features
• The set of paths to a target node in step k is recovered in full.
X1
‘Edge unigram’ features:
was edge type l used in
reaching x from Vq.
X2
X3
X4
X5
K=0
K=1
Paths (x3, k=2):
x2  x1  x3
x4  x1  x3
x2  x2  x3
x2  x3
K=2
‘Edge bigram’ features:
were edge types l1 and l2 used
(in that order) in reaching x from Vq.
‘Top edge bigram’ features:
were edge types l1 and l2 used
(in that order) in reaching x from Vq,
among the top two highest scoring
paths.
Results
Threading
Threading is an interesting problem, because:
•
There are often irregularities in thread structural information, thus threads
discourse should be captured using an intelligent approach (D.E. Lewis and K.A.
Knowles, Threading email: A preliminary study, Information Processing and Management, 1997)
•
Threading information can improve message categorization into topical
folders (B. Klimt and Y. Yang, The Enron corpus: A new dataset for email classification
research, ECML, 2004)
•
Adjacent messages in a thread can be assumed to be most similar to each
other in the corpus. Therefore, threading is related to the general problem of
finding similar messages in a corpus.
The task: given a message, retrieve adjacent messages in the thread
Some intuition ?
file
x
Some intuition ?
file
x
Shared
content
Some intuition ?
file
x
Shared
content
Social
network
Some intuition ?
file
x
Shared
content
Social
network
Timeline
Threading: experiments
1.
Baseline: TF-IDF Similarity
Consider all the available information (header & body) as text
2.
Graph walk: uniform
Start from the file node, 2 steps, uniform edge weights
3.
Graph walk: random
Start from the file node, 2 steps, random edge weights
(best out of 10)
4.
Graph walk: reranked
Rerank the output of (3) using the graph-describing features
Results
Highly-ranked edge-bigrams:
• sent-from  sent-to -1
• date-of
 date-of -1
• has-term  has-term -1
Finding Meeting Attendees
[Minkov et al, CEAS 2006]
Extended graph contains 2 months of calendar data:
Another Task that Can be Formulated as a Graph
Query: GeneId-Ranking
• Given:
– a biomedical paper abstract
• Find:
– the geneId for every gene mentioned in the abstract
• Method:
– from paper x, ranked list of geneId y: x~y
• Background resources:
– a “synonym list”: geneId  { name1, name2, ... }
– one or more protein NER systems
– training/evaluation data: pairs of (paper, {geneId1, ...., geneIdn})
Sample abstracts and synonyms
• MGI:96273
• Htr1a
• 5-hydroxytryptamine (serotonin)
true
receptor 1A
• 5-HT1A receptor
labels
•MGI:104886
•Gpx5
•glutathione peroxidase 5
NER
extractor
•Arep
• ...
• 52,000+ for mouse, 35,000+ for fly
Graph for the task....
abstracts
...
file:doc115
hasProtein
hasProtein
hasProtein
proteins
“HT1A”
“CA1”
“HT1”
hasTerm
terms
term:HT
term:1
term:A
hasTerm
...
hasTerm
term:hippocampus
term:CA
inFile
synonyms
“5-HT1A
receptor”
“Htr1a”
“eIF-1A”
synonym
geneIds
MGI:46273
...
synonym
MGI:95298
...
...
abstracts
...
file:doc115
hasProtein
hasProtein
hasProtein
proteins
“HT1A”
“CA1”
“HT1”
hasTerm
hasTerm
terms
term:HT
term:1
term:A
hasTerm
...
term:hippocampus
term:CA
...
inFile
synonyms
“5-HT1A
receptor”
“Htr1a”
“eIF-1A”
...
synonym
geneIds
noisy
training
abstracts
MGI:46273
file:doc214
MGI:95298
file:doc523
...
file:doc6273
...
Experiments
• Data: Biocreative Task 1B
– mouse: 10,000 train abstracts, 250 devtest, using first
150 for now; 50,000+ geneId’s; graph has 525,000+
nodes
• NER systems:
– likelyProtein: trained on yapex.train using off-the-shelf
NER systems (Minorthird)
– possibleProtein: same, but modified (on yapex.test) to
optimize F3, not F1 (rewards recall over precision)
Experiments with NER
Token
Precision
Span
Recall
Precision
Recall
F1
likely
94.9
64.8
87.2
62.1
72.5
possible
49.0
97.4
47.2
82.5
60.0
likely
81.6
31.3
66.7
26.8
45.3
possible
43.9
88.5
30.4
56.6
39.6
dictionary
50.1
46.9
24.5
43.9
31.4
yapex.test
mouse
Experiments with Graph Search
• Baseline method:
– extract entities of type x
– for each string of type x, find best-matching synonym,
and then its geneId
• consider only synonyms sharing >=1 token
• Soft/TFIDF distance
• break ties randomly
– rank geneId’s by number of times they are reached
• rewards multiple mentions (even via alternate synonyms)
• Evaluation:
– average, over 50 test documents, of
• non-interpolated average precision (plausible for curators)
• max F1 over all cutoff’s
Experiments with Graph Search
mouse eval dataset
MAP
maxF1
likelyProtein + softTFIDF
45.0
58.1
possibleProtein + softTFIDF
62.6
74.9
graph walk
51.3
64.3
Baseline vs Graphwalk
• Baseline includes:
– softTFIDF distances from NER entity to gene synonyms
– knowledge that “shortcut” path docentitysynonymgeneId is
important
• Graph includes:
– IDF effects, correlations, training data, etc
• Proposed graph extension:
– add softTFIDF and “shortcut” edges
• Learning and reranking:
– start with “local” features fi(e) of edges e=uv
– for answer y, compute expectations: E( fi(e) | start=x,end=y)
– use expectations as feature values and voted perceptron (Collins,
2002) as learning-to-rank method.
Experiments with Graph Search
mouse eval dataset
MAP
average
max F1
likelyProtein + softTFIDF
45.0
58.1
possibleProtein + softTFIDF
62.6
74.9
graph walk
51.3
64.3
walk + extra links
73.0
80.7
walk + extra links + learning
79.7
83.9
Experiments with Graph Search
Hot off the presses
• Ongoing work: learn NER system from pairs of
(document,geneIdList)
– much easier to obtain training data than documents in
which every occurrence of every gene name is
highlighted (usual NER training data)
– obtains F1 of 71.1 on mouse data (vs 45.3 by training
on YAPEX data, which is from different distribution)
Experiments with Graph Search
mouse eval dataset
MAP
(Yapex trained)
likelyProtein + softTFIDF
45.0
possibleProtein + softTFIDF
62.6
graph walk
51.3
walk + extra links
73.0
walk + extra links + learning
79.7
Experiments with Graph Search
mouse eval dataset
MAP
(Yapex trained)
MAP
(MGI trained)
likelyProtein + softTFIDF
45.0
72.7
possibleProtein + softTFIDF
62.6
65.7
graph walk
51.3
54.4
walk + extra links
73.0
76.7
walk + extra links + learning
79.7
84.2
Experiments on BioCreative Blind Test Set
mouse blind test data
MAP
(Yapex trained)
Max F1
(Yapex trained)
likelyProtein + softTFIDF
(45.0) 36.8
(58.1) 42.1
possibleProtein + softTFIDF
(62.6) 61.1
(74.9) 67.2
graph walk
(51.3) 64.0
(64.3) 69.5
walk + extra links + learning
(79.7) 71.1
(83.9) 75.5
Experiments with Graph Search
mouse blind test data
MAP
(Yapex trained)
Max F1
(Yapex trained)
likelyProtein + softTFIDF
36.8
42.1
possibleProtein + softTFIDF
61.1
67.2
graph walk
64.0
69.5
walk + extra links + learning
71.1
75.5
mouse blind test data
walk + extra links + learning
MAP
(MGI trained)
80.1
Max F1
(MGI trained)
83.7
mouse blind test data
walk + extra links + learning
MAP
(Yapex trained)
71.1
(MGI trained)
walk + extra links + learning
80.1
Average Max F1
(Yapex trained)
75.5
(MGI trained)
83.7
Outline
• Two views on data quality:
– Cleaning your data vs living with the mess.
– “A lazy/Bayesian view of data cleaning”
• A framework for querying dirty data
–
–
–
–
Data model
Query language
Baseline results (biotext and email)
How to improve results with learning
• Learning to re-rank query output
• Conclusions
Conclusions
• Contributions:
– a very simple query language for graphs, based on a
diffusion-kernel (damped PageRank,...) similarity
metric
– experiments on natural types of queries:
• finding likely meeting attendees
• finding related documents (email threading)
• disambiguating person and gene/protein entity names
– techniques for learning to answer queries
• reranking using expectations of simple, local features
• tune performance to a particular “similarity”
Conclusions
• Some open problems:
– scalability & efficiency:
• K-step walk on node-node graph with fan-out b is O(KbN)
• accurate sampling is O(1min) for 10-steps with O(106)
nodes.
– faster, better learning methods:
• combine re-ranking with learning parameters of graph walk
– add language modeling, topic modeling:
• extend graph to include models as well as data
Conclusions
• Don’t forget that there are two views on data
quality:
– Cleaning your data vs living with the mess.
– “A lazy/Bayesian view of data cleaning”
– SQL/Oracle vs Google
• vs something in between .... ?
Download