Dmitri V. Kalashnikov

advertisement
Work supported by NSF Grants IIS-0331707 and IIS-0083489
Copyright(c) by Dmitri V. Kalashnikov, 2005
Exploiting Relationships
for Object Consolidation
Zhaoqi Chen
Dmitri V. Kalashnikov
Sharad Mehrotra
Computer Science Department
University of California, Irvine
http://www.ics.uci.edu/~dvk/RelDC
http://www.itr-rescue.org (RESCUE)
ACM IQIS 2005
Talk Overview
• Examples
– motivating data cleaning (DC)
– motivating analysis of relationships for DC
• Object consolidation
– one of the DC problems
– this work addresses
• Proposed approach
– RelDC framework
– Relationship analysis and graph partitioning
• Experiments
2
Why do we need “Data Cleaning”?
q
???
Hi, my name is
Jane Smith.
I’d like to apply for
a faculty position
at your university
Publications:
1.
……
2.
……
3.
……
Jane Smith – Fresh Ph.D.
Wow! Unbelievable!
You must be a really
hard worker! I amOK, let me
sure we will accept check
a candidate like something
that!
quickly …
CiteSeer Rank
Tom - Recruiter
3
What is the problem?
Suspicious entries
– Lets go to DBLP website
– which stores bibliographic
entries of many CS authors
– Lets check two people
– “A. Gupta”
– “L. Zhang”
CiteSeer: the top-k most cited authors
DBLP
DBLP
4
Comparing raw and cleaned CiteSeer
Rank
Author
Location
# citations
1 (100.00%)
douglas schmidt
cs@wustl
5608
2 (100.00%)
rakesh agrawal
almaden@ibm
4209
3 (100.00%)
hector garciamolina
@
4167
4 (100.00%)
sally floyd
@aciri
3902
5 (100.00%)
jennifer widom
@stanford
3835
6 (100.00%)
david culler
cs@berkeley
3619
6 (100.00%)
thomas henzinger
eecs@berkeley
3752
7 (100.00%)
rajeev motwani
@stanford
3570
8 (100.00%)
willy zwaenepoel
cs@rice
3624
9 (100.00%)
van jacobson
lbl@gov
3468
10 (100.00%)
rajeev alur
cis@upenn
3577
11 (100.00%)
john ousterhout
@pacbell
3290
12 (100.00%)
joseph halpern
cs@cornell
3364
13 (100.00%)
andrew kahng
@ucsd
3288
14 (100.00%)
peter stadler
tbi@univie
3187
15 (100.00%)
serge abiteboul
@inria
3060
Cleaned CiteSeer top-k
CiteSeer top-k
5
What is the lesson?
“Garbage in, garbage out” principle:
Making decisions based on bad data, can lead to wrong results.
–
–
–
–
–
data should be cleaned first
e.g., determine the (unique) real authors of publications
solving such challenges is not always “easy”
that explains a large body of work on data cleaning
note
– CiteSeer is aware of the problem with its ranking
– there are more issues with CiteSeer
– many not related to data cleaning
6
RelDC Framework
Raw Data
Extraction
Data Cleaning
B
Representation
Analysis
C
A
D
X
E
Y
F
ARG
Tables/ARGs
RelDC Framework
Relationship-based Data Cleaning
f1
?
f1
f2
?
f2
f3
?
f3
X
f4
?
Traditional Methods
f4
features and context
ARG
Y
+
B
C
A
D
X
E
Y
F
Relationship Analysis
7
Object Consolidation
Notation
– O={o1,...,o|O|} set of entities
– unknown in general
– X={x1,...,x|X|} set of repres.
– d[xi] the entity xi refers to
– unknown in general
– C[xi] all repres. that refer to d[xi]
– “group set”
– unknown in general
– the goal is to find it for each xi
– S[xi] all repres. that can be xi
– “consolidation set”
– determined by FBS
– we assume C[xi]  S[xi]
8
Attributed Relational Graph (ARG)
ARG in RelDC
Nodes
– per cluster of
representations
– per representation (for
“tough” cases)
Edges
– regular
– similarity
person
publication
department
organization
9
Context Attraction Principle (CAP)
Take a guess:
Who is “J. Smith”
– Jane?
– John?
Jane Smith
J. Smith
John Smith
Merging a new publication.
10
Questions to Answer
1. Does the CAP principle hold over real datasets?
That is, if we consolidate objects based on it, will the
quality of consolidation improves?
2. Can we design a generic solution to exploiting
relationships for disambiguation?
11
Consolidation Algorithm
1. Construct ARG and identify all VCS’s
– use FBS in constructing the ARG
2. Choose a VCS and compute c(u,v)’s
– for each pair of repr. connected via a similarity edge
3. Partition VSC
–
–
–
–
use a graph partitioning algorithm
partitioning is based on c(u,v)’s
after partitioning, adjust ARG accordingly
go to Step 2, if more VCS exists
12
Connection Strength
Computation of c(u,v)
Phase 1: Discover connections
– all L-short simple paths between u and v
– bottleneck
– optimizations, not in IQIS’05
Phase 2: Measure the strength
– in the discovered connections
– many c(u,v) models exist
– we use model similar to diffusion kernels
B
C
A
u
v
D
E
G
F
H
z
13
Existing c(u,v) Models
Models for c(u,v)
B
– many exists
A
– diffusion kernels, random walks, etc
– none is fully adequate
– cannot learn similarity from data
C
u
v
D
E
F
Diffusion kernels
– (x,y)= 1(x,y) “base similarity”
G
H
z
– via direct links (of size 1)
– k(x,y) “indirect similarity”
– via links of size k
– B: where Bxy = B1xy = 1(x,y)
– base similarity matrix
– Bk: indirect similarity matrix
– K: total similarity matrix, or “kernel”
14
Our c(u,v) Model
Our model & Diff. kernels
N-2
... ... ... ... ...
– virtually identical, but...
– we do not compute the whole matrix K
MIT
T2
– we compute one c(u,v) at a time
T1
John Smith
T2
T1
P1
Alan White
– we limit path lengths by L
– (x,y) is unknown in general
– the analyst assigns them
– learn from data (ongoing work)
(a)
R1:John
(b)
R3:John
(c)
A6:Tom
P1
A4:Alan
A1:John
P4
R3:John
P2
MIT
A5:Mike
P3
R2:J.Smith
A4:Alan
P1
R1:John
Stanford
A7:Kate
A3:John
Our c(u,v) model
– regular edges have types T1,...,Tn
– types T1,...,Tn have weights w1,...,wn
– (x,y) = wi
– get the type of a given edge
– assign this weigh as base similarity
– paths with similarity edges
– might not exist, use heuristics
15
Consolidation via Partitioning
Observations
– each VCS contains representations of
at least 1 object
– if a repr. is in VCS, then the rest of repr.
of the same object are in it too
Partitioning
1
1
3
1
2
1
3
2
2
VCS 1
– two cases
– k, the number of entities in VSC, is known
– k is unknown
4
5
4
– when k is known, use any partit. algo
– maximize inside-con, minimize outside-con.
– we use [Shi,Malik’2000]
– normalized cut
5
5
5
VCS 2
– when k is unknown
– split into two: just to see the cut
– compare cut against threshold
– decide “to split” or “not to split” actually
16
Measuring Quality of Outcome
Existing measures
– dispersion [DMKD’04]
– for an entity, into how many clusters its
repr. are clustered, ideal is 1
Ideal Clustering
1
1
1
1
1
1
2
2
2
2
2
2
C1
C2
Div
H
1
0
1
0
E1
E2
Dis
H
1
0
1
0
– diversity
– for a cluster, how many distinct entities
it covers, ideal is 1
– easy, clear semantics
– but have problems, see figure
Entropy
– for an entity, if out of m represent.
m1 to C1; ...; mn to Cn then
One Misassigned (Example 1)
1
1
1
1
1
2
2
2
2
2
2
1
C1
C2
Half Misassigned
1
1
1
2
2
2
2
2
2
1
1
1
C1
C2
H
2
0.65
2
0.65
Div
H
2
1
2
1
E1
E2
E1
E2
Dis
H
2
0.65
2
0.65
Dis
H
2
1
2
1
Dis/Div cannot distinguish the two cases
Entropy can:
since 0.65 < 1, first clustering is better
One Misassigned (Example 2)
1
– if a cluster consists of represent.:
m1 of E1; ...; mn of En then (same...)
– ideal entropy is zero
Div
1
2
1
2
1
1
2
1
2
2
2
C1
C2
Div
H
2
0.592
1
0
Dis
E1
E2
H
1
0
2
0.65
Average entropy decreases (improves),
compared to Example 1
17
Experimental Setup
RealMov
– movies (12K)
– people (22K)
– actors
– directors
– producers
– studious (1K)
– producing
– distributing
Uncertainty
– d1,d2,...,dn are director entities
– pick a fraction d1,d2,...,d10
– group, e.g. in groups of two
– {d1,d2}, ... ,{d9,d10}
– make all representations of d1,d2
indiscernible by FBS, ...
Parameters
Baseline 1
– L-short simple paths, L = 7
– L is the path-length limit
– one cluster per VCS, regardless
– dumb? ... but ideal disp & H(E)
Note
Baseline 2
– The algorithm is applied to
“tough cases”, after FBS
already has successfully
consolidated many entries!
– knows grouping statistics
– guesses #ent in VCS
– random assigns repr. to clusters
18
Sample Movies Data
19
The Effect of L on Quality
Cluster Entropy & Diversity
Entity Entropy & Dispersion
20
Effect of Threshold and Scalability
21
Summary
RelDC
– developed in Aug 2003 (reference disambiguation)
– domain-independent data cleaning framework
– uses relationships for data cleaning
– reference disambiguation [SDM’05]
– object consolidation [IQIS’05]
Ongoing work
– “learning” the importance of relationships from data
22
Contact Information
RelDC project
www.ics.uci.edu/~dvk/RelDC
www.itr-rescue.org (RESCUE)
Zhaoqi Chen
chenz@ics.uci.edu
Dmitri V. Kalashnikov
www.ics.uci.edu/~dvk
dvk@ics.uci.edu
Sharad Mehrotra
www.ics.uci.edu/~sharad
sharad@ics.uci.edu
23
Download