Work supported by NSF Grants IIS-0331707 and IIS-0083489 Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC http://www.itr-rescue.org (RESCUE) ACM IQIS 2005 Talk Overview • Motivation • Object consolidation problem • Proposed approach – RelDC: Relationship based data cleaning – Relationship analysis and graph partitioning • Experiments 2 Why do we need “Data Cleaning”? q ??? Hi, my name is Jane Smith. I’d like to apply for a faculty position at your university Wow! Unbelievable! Are you sure you will join us even ifOK, let me we do not offer you check tenure right away? something quickly … Publications: 1. …… 2. …… 3. …… Jane Smith – Fresh Ph.D. CiteSeer Rank Tom - Recruiter 3 What is the problem? • Names often do not uniquely identify people CiteSeer: the top-k most cited authors DBLP DBLP 4 Comparing raw and cleaned CiteSeer Rank Author Location # citations 1 (100.00%) douglas schmidt cs@wustl 5608 2 (100.00%) rakesh agrawal almaden@ibm 4209 3 (100.00%) hector garciamolina @ 4167 4 (100.00%) sally floyd @aciri 3902 5 (100.00%) jennifer widom @stanford 3835 6 (100.00%) david culler cs@berkeley 3619 6 (100.00%) thomas henzinger eecs@berkeley 3752 7 (100.00%) rajeev motwani @stanford 3570 8 (100.00%) willy zwaenepoel cs@rice 3624 9 (100.00%) van jacobson lbl@gov 3468 10 (100.00%) rajeev alur cis@upenn 3577 11 (100.00%) john ousterhout @pacbell 3290 12 (100.00%) joseph halpern cs@cornell 3364 13 (100.00%) andrew kahng @ucsd 3288 14 (100.00%) peter stadler tbi@univie 3187 15 (100.00%) serge abiteboul @inria 3060 Cleaned CiteSeer top-k CiteSeer top-k 5 Object Consolidation Problem Representations of objects in the database r1 r2 r3 r4 r5 r6 r7 rN o1 o2 o3 o4 o5 o6 o7 oM Real objects in the database • Cluster representations that correspond to the same “real” world object/entity • Two instances: real world objects are known/unknown 6 RelDC Approach • Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work RelDC Framework Relationship-based Data Cleaning f1 ? f1 f2 ? f2 X f3 ? f3 f4 ? Traditional Methods f4 features and context ARG B C A Y + D X E Y F Relationship Analysis 7 Attributed Relational Graph (ARG) View the database as an ARG Nodes – per cluster of representations (if already resolved by feature-based approach) – per representation (for “tough” cases) Edges – Regular – correspond to relationships between entities – Similarity – created using feature-based methods on representations person publication department organization 8 Context Attraction Principle (CAP) Jane Smith Who is “J. Smith” – Jane? – John? J. Smith John Smith Merging a new publication. 9 Questions to Answer 1. Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the quality of consolidation improves? 2. Can we design a generic strategy that exploits CAP for consolidation? 10 Consolidation Algorithm 1. Construct ARG and identify all virtual clusters (VCSs) – use FBS in constructing the ARG 2. Choose a VCS and compute connection strength between nodes – for each pair of repr. connected via a similarity edge 3. Partition the VCS – – – – use a graph partitioning algorithm partitioning is based on connection strength after partitioning, adjust ARG accordingly go to Step 2, if more potential clusters exists 11 Connection Strength c(u,v) Models for c(u,v) B – many possibilities A – diffusion kernels, random walks, etc – none is fully adequate – cannot learn similarity from data C u v D E F Diffusion kernels – (x,y)= 1(x,y) “base similarity” G H z – via direct links (of size 1) – k(x,y) “indirect similarity” – via links of size k – B: where Bxy = B1xy = 1(x,y) – base similarity matrix – Bk: indirect similarity matrix – K: total similarity matrix, or “kernel” 12 Connection Strength c(u,v) (cont.) Instantiating parameters – Determining (x,y) N-2 ... ... ... ... ... – regular edges have types T1,...,Tn – types T1,...,Tn have weights w1,...,wn – (x,y) = wi MIT T2 T1 John Smith T2 T1 P1 Alan White – get the type of a given edge – assign this weigh as base similarity – Handling similarity edges – (x,y) assigned value proportional to similarity (heuristic) – Approach to learn (x,y) from data (ongoing work) (a) R1:John (b) R3:John (c) A6:Tom P1 A4:Alan A1:John P4 R3:John P2 MIT A5:Mike P3 R2:J.Smith A4:Alan P1 R1:John Stanford A7:Kate A3:John Implementation – we do not compute the whole matrix K – we compute one c(u,v) at a time – limit path lengths by L 13 Consolidation via Partitioning Observations – each VCS contains representations of at least 1 object – if a repr. is in VCS, then the rest of repr. of the same object are in it too 3 1 2 4 – when k is known, use any partit. algo 2 5 4 – maximize inside-con, minimize outside-con. – we use [Shi,Malik’2000] – normalized cut 5 5 – when k is unknown split into two: just to see the cut compare cut against threshold decide “to split” or “not to split” Iterate 1 VCS 1 – k, the number of entities in VSC, is known – k is unknown – – – – 2 3 Partitioning – two cases 1 1 5 VCS 2 14 Measuring Quality of Outcome – dispersion – for an entity, into how many clusters its repr. are clustered, ideal is 1 – diversity Ideal Clustering 1 1 1 1 1 1 2 2 2 2 2 2 C1 C2 – for a cluster, how many distinct entities it covers, ideal is 1 One Misassigned (Example 1) – Entity uncertainty – for an entity, if out of m represent. m1 to C1; ...; mn to Cn then 1 1 1 1 1 2 2 2 2 2 2 1 C1 C2 Half Misassigned 1 1 1 2 2 2 2 2 2 1 1 1 C1 C2 Div H 1 0 1 0 Div H 2 0.65 2 0.65 Div H 2 1 2 1 E1 E2 E1 E2 E1 E2 Dis H 1 0 1 0 Dis H 2 0.65 2 0.65 Dis H 2 1 2 1 Dis/Div cannot distinguish the two cases Entropy can: since 0.65 < 1, first clustering is better – Cluster Uncertainty – if a cluster consists of represent.: m1 of E1; ...; mn of En then (same...) – ideal entropy is zero One Misassigned (Example 2) 1 1 2 1 2 1 1 2 1 2 2 2 C1 C2 Div H 2 0.592 1 0 Dis E1 E2 H 1 0 2 0.65 Average entropy decreases (improves), compared to Example 1 15 Experimental Setup RealMov – movies (12K) – people (22K) – actors – directors – producers – studious (1K) – producing – distributing Parameters – L-short simple paths, L = 7 – L is the path-length limit Note – The algorithm is applied to “tough cases”, after FBS already has successfully consolidated many entries! Uncertainty – d1,d2,...,dn are director entities – pick a fraction d1,d2,...,dm – Group entries in size k, – e.g. in groups of two {d1,d2}, ... ,{d9,d10} – make all representations of a group indiscernible by FBS, ... Baseline 1 – one cluster per VCS, regardless – Equivalent to using only FBS – ideal dispersion & H(E)! Baseline 2 – knows grouping statistics – gueses #ent in VCS – random assigns repr. to clusters 16 Sample Movies Data 17 The Effect of L on Quality Cluster Entropy & Diversity Entity Entropy & Dispersion 18 Effect of Threshold and Scalability 19 Summary RelDC – domain-independent data cleaning framework – uses relationships for data cleaning – reference disambiguation [SDM’05] – object consolidation [IQIS’05] Ongoing work – “learning” the importance of relationships from data – Exploiting relationships among entities for other data cleaning problems 20 Contact Information RelDC project www.ics.uci.edu/~dvk/RelDC www.itr-rescue.org (RESCUE) Zhaoqi Chen chenz@ics.uci.edu Dmitri V. Kalashnikov www.ics.uci.edu/~dvk dvk@ics.uci.edu Sharad Mehrotra www.ics.uci.edu/~sharad sharad@ics.uci.edu 21 extra slides… 22 Object Consolidation Notation – O={o1,...,o|O|} set of entities – unknown in general – X={x1,...,x|X|} set of repres. – d[xi] the entity xi refers to – unknown in general – C[xi] all repres. that refer to d[xi] – “group set” – unknown in general – the goal is to find it for each xi – S[xi] all repres. that can be xi – “consolidation set” – determined by FBS – we assume C[xi] S[xi] 24 Object Consolidation Problem • Let O={o1,...,o|O|} be the set of entities – unknown in general • Let X={x1,...,x|X|} be the set of representations • Map xi to its corresponding entity oj in O d[xi] the entity xi refers to – unknown in general – C[xi] all repres. that refer to d[xi] – “group set” – unknown in general – the goal is to find it for each xi – S[x ] all repres. that can be x 25 RelDC Framework Raw Data Extraction Data Cleaning B Representation Analysis C A D X E Y F ARG Tables/ARGs RelDC Framework Relationship-based Data Cleaning f1 ? f1 f2 ? f2 f3 ? f3 X f4 ? Traditional Methods f4 features and context ARG Y + B C A D X E Y F Relationship Analysis 26 Connection Strength Computation of c(u,v) Phase 1: Discover connections – all L-short simple paths between u and v – bottleneck – optimizations, not in IQIS’05 Phase 2: Measure the strength – in the discovered connections – many c(u,v) models exist – we use model similar to diffusion kernels B C A u v D E G F H z 27 Our c(u,v) Model Our model & Diff. kernels N-2 ... ... ... ... ... – virtually identical, but... – we do not compute the whole matrix K MIT T2 – we compute one c(u,v) at a time T1 John Smith T2 T1 P1 Alan White – we limit path lengths by L – (x,y) is unknown in general – the analyst assigns them – learn from data (ongoing work) (a) R1:John (b) R3:John (c) A6:Tom P1 A4:Alan A1:John P4 R3:John P2 MIT A5:Mike P3 R2:J.Smith A4:Alan P1 R1:John Stanford A7:Kate A3:John Our c(u,v) model – regular edges have types T1,...,Tn – types T1,...,Tn have weights w1,...,wn – (x,y) = wi – get the type of a given edge – assign this weigh as base similarity – paths with similarity edges – might not exist, use heuristics 28