JCDL07_dvk - University of California, Irvine

advertisement
ACM IEEE Joint Conference on Digital Libraries 2007
Adaptive Graphical Approach
to Entity Resolution
Dmitri V. Kalashnikov
Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra
Computer Science Department
University of California, Irvine
Additional information is available at http://www.ics.uci.edu/~dvk
Copyright © by Dmitri V. Kalashnikov, 2007
Structure of the Talk
 Motivation
• Generic Disambiguation Framework
– High-level
• Entity Resolution Approach
– Part of the Framework
• Experiments
2
Entity Resolution & Data Cleaning
Raw Dataset(s)
A "nice" regular Database
...J. Smith ...
MIT
Intel Inc.
.. John Smith ...
.. Jane Smith ...
?
•Uncertainty
•Errors
•Missing data
Analysis on bad data leads to wrong conclusions!
3
Why do we need “Entity Resolution”?
q
Hi, I’m Jane Smith.
I’d like to apply for
a faculty position.
Wow! I am sure we
will accept a strong
candidate like that!OK, let me
check
something
quickly …
???
Publications:
1.
……
2.
……
3.
……
Jane Smith – Fresh Ph.D.
CiteSeer Rank
Tom - Recruiter
4
What is the problem?
Suspicious entries
– Lets go to DBLP website
– which stores bibliographic
entries of many CS authors
– Lets check two people
– “A. Gupta”
– “L. Zhang”
CiteSeer: the top-k most cited authors
DBLP
DBLP
5
Comparing raw and cleaned CiteSeer
Raw CiteSeer’s Top-K
Most Cited Authors
Cleaned CiteSeer’s Top-K
Most Cited Authors
Rank
Author
Location
1 (100.00%)
douglas schmidt
cs@wustl
2 (100.00%)
rakesh agrawal
almaden@ibm
3 (100.00%)
hector garciamolina
4 (100.00%)
sally floyd
5 (100.00%)
jennifer widom
6 (100.00%)
david culler
6 (100.00%)
thomas henzinger
7 (100.00%)
rajeev motwani
8 (100.00%)
willy zwaenepoel
cs@rice
9 (100.00%)
van jacobson
lbl@gov
10 (100.00%)
rajeev alur
11 (100.00%)
john ousterhout
@pacbell
12 (100.00%)
joseph halpern
cs@cornell
13 (100.00%)
andrew kahng
@ucsd
14 (100.00%)
peter stadler
15 (100.00%)
serge abiteboul
@
@aciri
@stanford
cs@berkeley
eecs@berkeley
@stanford
cis@upenn
tbi@univie
@inria
6
What is the lesson?
“Garbage in, garbage out” principle:
Making decisions based on bad data, can lead to wrong results.
– Data should be cleaned first
– E.g., determine the (unique) real authors of publications
– Solving such challenges is not always “easy”
– This explains a large body of work on Entity Resolution
7
Typical Data Processing Flow
Raw Data
Extraction
Data Cleaning
Representation
Analysis
8
Two most common types of Entity Resolution
Fuzzy lookup
– match references to objects
– list of all objects is given
– [SDM’05], [TODS’06]
Fuzzy grouping
– group references that
co-refer
– [IQIS’05], [JCDL’07]
...J. Smith ...
MIT
Intel Inc.
.. John Smith ...
.. Jane Smith ...
9
Structure of the Talk
• Motivation
 Generic Framework
– High-level
• Approach
– Part of the Framework
• Experiments
10
Traditional Approach to Entity Resolution
"J. Smith"
?
"Jane Smith"
f2
?
f2
X
f3
?
f3
Y
? sm@yahoo.com
Traditional Methods
js@mit.edu
Features and Context
s (X,Y) = f (X,Y)
Similarity = Similarity of Features
11
Key Observation: More Info is Available
Jane Smith
A "nice" regular Database
=
J. Smith
John Smith
12
Solution: Main Idea
New Paradigm
f1
?
f1
f2
?
f2
X
f3
?
f3
ARG
Y
+
B
C
A
D
X
f4
?
Traditional Methods
f4
features and context
E
Y
F
Relationship Analysis
s (X,Y) = c (X,Y) + γ f (X,Y)
Similarity = Similarity of Features + “Connection Strength”
13
Illustrative Example
“Indirect connections”
– Suppose your co-worker’s name is “John White”
– Suppose you see on the Web, on my homepage
– My name: “Dmitri …”
– Somebody named: “John White”
– Who is the “John White”?
– From data you might establish a connection:
Dmitri
John
White
<you>
Visited
Visited
JCDL’07
WorksAT
WorksAT
<your ORG>
– “Dmitri” might be connected to more “John White”’s…
14
Key Features of the Framework
Our goal is/was to create a framework, such that:
– solid theoretic foundation
– lookup
–
–
–
–
–
domain-independent framework
self-tuning
scales to large datasets
robust under uncertainty
high disambiguation quality
15
Structure of the Talk
• Motivation
• Generic Framework
– High-level
 Approach
– Part of the Framework
• Experiments
16
Approach
• Graph Creation
– Entity-Relationship Graph
• Consolidation Algorithm
– Bottom-up clustering
• Adaptiveness to data
– That is, self-tuning
– Supervised learning
• External Data
– To improve the quality further
– A theoretic possibility
– Not tested yet
17
ER Graph Creation
18
Virtual Connected Subgraph (VCS)
•
VCS
–
–
Similarity edges form VCSs
Subgraphs in the ER graph
Nodes
publication
person
department
organization
1. “Virtual”
–
Contains only similarity edges
2. “Connected”
–
Edges
regular
similarity
VCS
A path between any 2 nodes
3. Completeness
–
•
Adding more nodes/edges would violate (1) and (2)
Logically, the Goal is
–
Partition each VCS properly
19
Consolidation Algorithm: Merging
20
Self-tuning via Supervised Learning
21
Self-tuning (2)
22
External Knowledge to Improve Quality
23
Structure of the Talk
• Motivation
• Generic Framework
– High-level
• Approach
– Part of the Framework
 Experiments
24
Quality
“Context” is proposed in [Bhattacharya et al., DMKD’04]
The two algos are proposed in [Dong et al., SIGMOD’05]
25
Scalability & Efficiency
26
Impact of Random Relationships
27
Contact Information
• Info about our disambiguation project
– http://www.ics.uci.edu/~dvk
• Overall design
– Dmitri V. Kalashnikov
– dvk [at] domain
• Implementation details in JCDL’07
– Zhaoqi (Stella) Chen
– chenz [at] domain
– domain = ics.uci.edu
28
Download