Sharad Mehrotra

advertisement
Work supported by NSF Grants IIS-0331707 and IIS-0083489
Exploiting Relationships
for Object Consolidation
Zhaoqi Chen
Dmitri V. Kalashnikov
Sharad Mehrotra
Computer Science Department
University of California, Irvine
http://www.ics.uci.edu/~dvk/RelDC
http://www.itr-rescue.org (RESCUE)
ACM IQIS 2005
Talk Overview
• Motivation
• Object consolidation problem
• Proposed approach
– RelDC: Relationship based data cleaning
– Relationship analysis and graph partitioning
• Experiments
2
Why do we need “Data Cleaning”?
q
???
Hi, my name is
Jane Smith.
I’d like to apply for
a faculty position
at your university
Wow! Unbelievable!
Are you sure you
will join us even ifOK, let me
we do not offer you check
tenure right away?
something
quickly …
Publications:
1.
……
2.
……
3.
……
Jane Smith – Fresh Ph.D.
CiteSeer Rank
Tom - Recruiter
3
What is the problem?
• Names often do not
uniquely identify
people
CiteSeer: the top-k most cited authors
DBLP
DBLP
4
Comparing raw and cleaned CiteSeer
Rank
Author
Location
# citations
1 (100.00%)
douglas schmidt
cs@wustl
5608
2 (100.00%)
rakesh agrawal
almaden@ibm
4209
3 (100.00%)
hector garciamolina
@
4167
4 (100.00%)
sally floyd
@aciri
3902
5 (100.00%)
jennifer widom
@stanford
3835
6 (100.00%)
david culler
cs@berkeley
3619
6 (100.00%)
thomas henzinger
eecs@berkeley
3752
7 (100.00%)
rajeev motwani
@stanford
3570
8 (100.00%)
willy zwaenepoel
cs@rice
3624
9 (100.00%)
van jacobson
lbl@gov
3468
10 (100.00%)
rajeev alur
cis@upenn
3577
11 (100.00%)
john ousterhout
@pacbell
3290
12 (100.00%)
joseph halpern
cs@cornell
3364
13 (100.00%)
andrew kahng
@ucsd
3288
14 (100.00%)
peter stadler
tbi@univie
3187
15 (100.00%)
serge abiteboul
@inria
3060
Cleaned CiteSeer top-k
CiteSeer top-k
5
Object Consolidation Problem
Representations of objects in the database
r1 r2 r3 r4
r5 r6 r7
rN
o1 o2 o3 o4 o5 o6 o7
oM
Real objects in the database
• Cluster representations that correspond to the same
“real” world object/entity
• Two instances: real world objects are known/unknown
6
RelDC Approach
• Exploit relationships among objects to
disambiguate when traditional approach on
clustering based on similarity does not work
RelDC Framework
Relationship-based Data Cleaning
f1
?
f1
f2
?
f2
X
f3
?
f3
f4
?
Traditional Methods
f4
features and context
ARG
B
C
A
Y
+
D
X
E
Y
F
Relationship Analysis
7
Attributed Relational Graph (ARG)
View the database as an ARG
Nodes
– per cluster of
representations (if already
resolved by feature-based
approach)
– per representation (for
“tough” cases)
Edges
– Regular – correspond to
relationships between
entities
– Similarity – created using
feature-based methods on
representations
person
publication
department
organization
8
Context Attraction Principle (CAP)
Jane Smith
Who is “J. Smith”
– Jane?
– John?
J. Smith
John Smith
Merging a new publication.
9
Questions to Answer
1. Does the CAP principle hold over real datasets?
That is, if we consolidate objects based on it, will the
quality of consolidation improves?
2. Can we design a generic strategy that exploits CAP
for consolidation?
10
Consolidation Algorithm
1. Construct ARG and identify all virtual clusters
(VCSs)
– use FBS in constructing the ARG
2. Choose a VCS and compute connection
strength between nodes
– for each pair of repr. connected via a similarity edge
3. Partition the VCS
–
–
–
–
use a graph partitioning algorithm
partitioning is based on connection strength
after partitioning, adjust ARG accordingly
go to Step 2, if more potential clusters exists
11
Connection Strength c(u,v)
Models for c(u,v)
B
– many possibilities
A
– diffusion kernels, random walks, etc
– none is fully adequate
– cannot learn similarity from data
C
u
v
D
E
F
Diffusion kernels
– (x,y)= 1(x,y) “base similarity”
G
H
z
– via direct links (of size 1)
– k(x,y) “indirect similarity”
– via links of size k
– B: where Bxy = B1xy = 1(x,y)
– base similarity matrix
– Bk: indirect similarity matrix
– K: total similarity matrix, or “kernel”
12
Connection Strength c(u,v) (cont.)
Instantiating parameters
– Determining (x,y)
N-2
... ... ... ... ...
– regular edges have types T1,...,Tn
– types T1,...,Tn have weights w1,...,wn
– (x,y) = wi
MIT
T2
T1
John Smith
T2
T1
P1
Alan White
– get the type of a given edge
– assign this weigh as base similarity
– Handling similarity edges
– (x,y) assigned value proportional to
similarity (heuristic)
– Approach to learn (x,y) from data
(ongoing work)
(a)
R1:John
(b)
R3:John
(c)
A6:Tom
P1
A4:Alan
A1:John
P4
R3:John
P2
MIT
A5:Mike
P3
R2:J.Smith
A4:Alan
P1
R1:John
Stanford
A7:Kate
A3:John
Implementation
– we do not compute the whole matrix K
– we compute one c(u,v) at a time
– limit path lengths by L
13
Consolidation via Partitioning
Observations
– each VCS contains representations of at least
1 object
– if a repr. is in VCS, then the rest of repr. of the
same object are in it too
3
1
2
4
– when k is known, use any partit. algo
2
5
4
– maximize inside-con, minimize outside-con.
– we use [Shi,Malik’2000]
– normalized cut
5
5
– when k is unknown
split into two: just to see the cut
compare cut against threshold
decide “to split” or “not to split”
Iterate
1
VCS 1
– k, the number of entities in VSC, is known
– k is unknown
–
–
–
–
2
3
Partitioning
– two cases
1
1
5
VCS 2
14
Measuring Quality of Outcome
– dispersion
– for an entity, into how many clusters
its repr. are clustered, ideal is 1
– diversity
Ideal Clustering
1
1
1
1
1
1
2
2
2
2
2
2
C1
C2
– for a cluster, how many distinct
entities it covers, ideal is 1
One Misassigned (Example 1)
– Entity uncertainty
– for an entity, if out of m represent.
m1 to C1; ...; mn to Cn then
1
1
1
1
1
2
2
2
2
2
2
1
C1
C2
Half Misassigned
1
1
1
2
2
2
2
2
2
1
1
1
C1
C2
Div
H
1
0
1
0
Div
H
2
0.65
2
0.65
Div
H
2
1
2
1
E1
E2
E1
E2
E1
E2
Dis
H
1
0
1
0
Dis
H
2
0.65
2
0.65
Dis
H
2
1
2
1
Dis/Div cannot distinguish the two cases
Entropy can:
since 0.65 < 1, first clustering is better
– Cluster Uncertainty
– if a cluster consists of represent.: m1
of E1; ...; mn of En then (same...)
– ideal entropy is zero
One Misassigned (Example 2)
1
1
2
1
2
1
1
2
1
2
2
2
C1
C2
Div
H
2
0.592
1
0
Dis
E1
E2
H
1
0
2
0.65
Average entropy decreases (improves),
compared to Example 1
15
Experimental Setup
RealMov
– movies (12K)
– people (22K)
– actors
– directors
– producers
– studious (1K)
– producing
– distributing
Parameters
– L-short simple paths, L = 7
– L is the path-length limit
Note
– The algorithm is applied to
“tough cases”, after FBS
already has successfully
consolidated many entries!
Uncertainty
– d1,d2,...,dn are director entities
– pick a fraction d1,d2,...,dm
– Group entries in size k,
– e.g. in groups of two {d1,d2}, ...
,{d9,d10}
– make all representations of a
group indiscernible by FBS, ...
Baseline 1
– one cluster per VCS, regardless
– Equivalent to using only FBS
– ideal dispersion & H(E)!
Baseline 2
– knows grouping statistics
– gueses #ent in VCS
– random assigns repr. to clusters
16
Sample Movies Data
17
The Effect of L on Quality
Cluster Entropy & Diversity
Entity Entropy & Dispersion
18
Effect of Threshold and Scalability
19
Summary
RelDC
– domain-independent data cleaning framework
– uses relationships for data cleaning
– reference disambiguation [SDM’05]
– object consolidation [IQIS’05]
Ongoing work
– “learning” the importance of relationships from data
– Exploiting relationships among entities for other
data cleaning problems
20
Contact Information
RelDC project
www.ics.uci.edu/~dvk/RelDC
www.itr-rescue.org (RESCUE)
Zhaoqi Chen
chenz@ics.uci.edu
Dmitri V. Kalashnikov
www.ics.uci.edu/~dvk
dvk@ics.uci.edu
Sharad Mehrotra
www.ics.uci.edu/~sharad
sharad@ics.uci.edu
21
extra slides…
22
Object Consolidation
Notation
– O={o1,...,o|O|} set of entities
– unknown in general
– X={x1,...,x|X|} set of repres.
– d[xi] the entity xi refers to
– unknown in general
– C[xi] all repres. that refer to d[xi]
– “group set”
– unknown in general
– the goal is to find it for each xi
– S[xi] all repres. that can be xi
– “consolidation set”
– determined by FBS
– we assume C[xi]  S[xi]
24
Object Consolidation Problem
• Let O={o1,...,o|O|} be the set of
entities
– unknown in general
• Let X={x1,...,x|X|} be the set of
representations
• Map xi to its corresponding
entity oj in O d[xi] the entity
xi refers to
– unknown in general
– C[xi] all repres. that refer to d[xi]
– “group set”
– unknown in general
– the goal is to find it for each xi
– S[x ] all repres. that can be x
25
RelDC Framework
Raw Data
Extraction
Data Cleaning
B
Representation
Analysis
C
A
D
X
E
Y
F
ARG
Tables/ARGs
RelDC Framework
Relationship-based Data Cleaning
f1
?
f1
f2
?
f2
f3
?
f3
X
f4
?
Traditional Methods
f4
features and context
ARG
Y
+
B
C
A
D
X
E
Y
F
Relationship Analysis
26
Connection Strength
Computation of c(u,v)
Phase 1: Discover connections
– all L-short simple paths between u and v
– bottleneck
– optimizations, not in IQIS’05
Phase 2: Measure the strength
– in the discovered connections
– many c(u,v) models exist
– we use model similar to diffusion kernels
B
C
A
u
v
D
E
G
F
H
z
27
Our c(u,v) Model
Our model & Diff. kernels
N-2
... ... ... ... ...
– virtually identical, but...
– we do not compute the whole matrix K
MIT
T2
– we compute one c(u,v) at a time
T1
John Smith
T2
T1
P1
Alan White
– we limit path lengths by L
– (x,y) is unknown in general
– the analyst assigns them
– learn from data (ongoing work)
(a)
R1:John
(b)
R3:John
(c)
A6:Tom
P1
A4:Alan
A1:John
P4
R3:John
P2
MIT
A5:Mike
P3
R2:J.Smith
A4:Alan
P1
R1:John
Stanford
A7:Kate
A3:John
Our c(u,v) model
– regular edges have types T1,...,Tn
– types T1,...,Tn have weights w1,...,wn
– (x,y) = wi
– get the type of a given edge
– assign this weigh as base similarity
– paths with similarity edges
– might not exist, use heuristics
28
Download