ACM IEEE Joint Conference on Digital Libraries 2007 Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department University of California, Irvine Additional information is available at http://www.ics.uci.edu/~dvk Copyright © by Dmitri V. Kalashnikov, 2007 Structure of the Talk Motivation • Generic Disambiguation Framework – High-level • Entity Resolution Approach – Part of the Framework • Experiments 2 Entity Resolution & Data Cleaning Raw Dataset(s) A "nice" regular Database ...J. Smith ... MIT Intel Inc. .. John Smith ... .. Jane Smith ... ? •Uncertainty •Errors •Missing data Analysis on bad data leads to wrong conclusions! 3 Why do we need “Entity Resolution”? q Hi, I’m Jane Smith. I’d like to apply for a faculty position. Wow! I am sure we will accept a strong candidate like that!OK, let me check something quickly … ??? Publications: 1. …… 2. …… 3. …… Jane Smith – Fresh Ph.D. CiteSeer Rank Tom - Recruiter 4 What is the problem? Suspicious entries – Lets go to DBLP website – which stores bibliographic entries of many CS authors – Lets check two people – “A. Gupta” – “L. Zhang” CiteSeer: the top-k most cited authors DBLP DBLP 5 Comparing raw and cleaned CiteSeer Raw CiteSeer’s Top-K Most Cited Authors Cleaned CiteSeer’s Top-K Most Cited Authors Rank Author Location 1 (100.00%) douglas schmidt cs@wustl 2 (100.00%) rakesh agrawal almaden@ibm 3 (100.00%) hector garciamolina 4 (100.00%) sally floyd 5 (100.00%) jennifer widom 6 (100.00%) david culler 6 (100.00%) thomas henzinger 7 (100.00%) rajeev motwani 8 (100.00%) willy zwaenepoel cs@rice 9 (100.00%) van jacobson lbl@gov 10 (100.00%) rajeev alur 11 (100.00%) john ousterhout @pacbell 12 (100.00%) joseph halpern cs@cornell 13 (100.00%) andrew kahng @ucsd 14 (100.00%) peter stadler 15 (100.00%) serge abiteboul @ @aciri @stanford cs@berkeley eecs@berkeley @stanford cis@upenn tbi@univie @inria 6 What is the lesson? “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results. – Data should be cleaned first – E.g., determine the (unique) real authors of publications – Solving such challenges is not always “easy” – This explains a large body of work on Entity Resolution 7 Typical Data Processing Flow Raw Data Extraction Data Cleaning Representation Analysis 8 Two most common types of Entity Resolution Fuzzy lookup – match references to objects – list of all objects is given – [SDM’05], [TODS’06] Fuzzy grouping – group references that co-refer – [IQIS’05], [JCDL’07] ...J. Smith ... MIT Intel Inc. .. John Smith ... .. Jane Smith ... 9 Structure of the Talk • Motivation Generic Framework – High-level • Approach – Part of the Framework • Experiments 10 Traditional Approach to Entity Resolution "J. Smith" ? "Jane Smith" f2 ? f2 X f3 ? f3 Y ? sm@yahoo.com Traditional Methods js@mit.edu Features and Context s (X,Y) = f (X,Y) Similarity = Similarity of Features 11 Key Observation: More Info is Available Jane Smith A "nice" regular Database = J. Smith John Smith 12 Solution: Main Idea New Paradigm f1 ? f1 f2 ? f2 X f3 ? f3 ARG Y + B C A D X f4 ? Traditional Methods f4 features and context E Y F Relationship Analysis s (X,Y) = c (X,Y) + γ f (X,Y) Similarity = Similarity of Features + “Connection Strength” 13 Illustrative Example “Indirect connections” – Suppose your co-worker’s name is “John White” – Suppose you see on the Web, on my homepage – My name: “Dmitri …” – Somebody named: “John White” – Who is the “John White”? – From data you might establish a connection: Dmitri John White <you> Visited Visited JCDL’07 WorksAT WorksAT <your ORG> – “Dmitri” might be connected to more “John White”’s… 14 Key Features of the Framework Our goal is/was to create a framework, such that: – solid theoretic foundation – lookup – – – – – domain-independent framework self-tuning scales to large datasets robust under uncertainty high disambiguation quality 15 Structure of the Talk • Motivation • Generic Framework – High-level Approach – Part of the Framework • Experiments 16 Approach • Graph Creation – Entity-Relationship Graph • Consolidation Algorithm – Bottom-up clustering • Adaptiveness to data – That is, self-tuning – Supervised learning • External Data – To improve the quality further – A theoretic possibility – Not tested yet 17 ER Graph Creation 18 Virtual Connected Subgraph (VCS) • VCS – – Similarity edges form VCSs Subgraphs in the ER graph Nodes publication person department organization 1. “Virtual” – Contains only similarity edges 2. “Connected” – Edges regular similarity VCS A path between any 2 nodes 3. Completeness – • Adding more nodes/edges would violate (1) and (2) Logically, the Goal is – Partition each VCS properly 19 Consolidation Algorithm: Merging 20 Self-tuning via Supervised Learning 21 Self-tuning (2) 22 External Knowledge to Improve Quality 23 Structure of the Talk • Motivation • Generic Framework – High-level • Approach – Part of the Framework Experiments 24 Quality “Context” is proposed in [Bhattacharya et al., DMKD’04] The two algos are proposed in [Dong et al., SIGMOD’05] 25 Scalability & Efficiency 26 Impact of Random Relationships 27 Contact Information • Info about our disambiguation project – http://www.ics.uci.edu/~dvk • Overall design – Dmitri V. Kalashnikov – dvk [at] domain • Implementation details in JCDL’07 – Zhaoqi (Stella) Chen – chenz [at] domain – domain = ics.uci.edu 28