Literature Review 1 Record linkage Runtime reduction techniques ◦ Blocking ◦ Canopies ◦ Sorted Neighborhood Shift to parallel computing Research directions 2 Determining if pairs of records refer to the same entity ◦ E.g. Distinguishing between data belonging to… Yipeng, the NUS student and Yipeng, the son of PM Lee 3 DB 1 DB 2 DB 3 Amanda Amanda Amanda Beverley Beverley Beverley Catherine Katherine Katherine Daniel David Amanda Elaine Amanda Dedup Two Lists O(M*N) Dedup Single List O(N2) 4 Pairwise comparison increasing expensive Blocking techniques ◦ Reduce the search space Amanda Amanda David Daniel 5 6 Record 1 Record 2 Record 3 Record 4 Comparison Window: 2w−1 Record 5 Record 6 Record 7 Record 8 Record 9 Record 10 7 Pairwise comparison increasing expensive Blocking techniques ◦ Reduce the search space ◦ Limitations Single node computation Localized data source Conflicting in function Amanda Amanda David Daniel 8 Multi node computation Data source flexibility Complementary to blocking methods Frontrunners: ◦ P-Febrl (P Christen 2003), ◦ P-Swoosh (H Kawai 2006), ◦ Parallel Linkage (H Kim 2007) 9 Peter Christen ◦ Parallelized Febrl with MPI ◦ Linear Speedup but did not Scaleup well Hideki Kawai ◦ Designed P-swoosh in a simulated environment ◦ Match based parallelism ◦ 2x speedup with use of domain knowledge 10 Hung-sik Kim, Dongwon Lee ◦ Explored parallel record linkage for different input cases in MATLAB ◦ Consistent Speedup ◦ Not validated with very large datasets 11 Handles system level concerns… ◦ E.g. Data distribution, fault tolerance, dynamic load balancing, portability and scalability Convenient model for scaling record linkage ◦ Beter scaleup on pairwise comparisions (T Elsayed 2008) ◦ Runtime increased linearly with dataset (R Vernica 2010) 12 Tailoring Hadoop for record linkage problems ◦ E.g. Bin packing blocks of different sizes Experimenting with different problem types ◦ E.g. Bipartite data centers Adapting existing parallel clustering algorithms onto the MapReduce model 13 Parallelism a right step in the right direction ◦ Complementary to existing approaches ◦ Consistent with the object orientation But… ◦ Parallel design and implementation is difficult ◦ MapReduce is a viable solution 14