LitPresentation

advertisement
Literature Review
1


Record linkage
Runtime reduction techniques
◦ Blocking
◦ Canopies
◦ Sorted Neighborhood


Shift to parallel computing
Research directions
2

Determining if pairs of records refer to the same
entity
◦ E.g. Distinguishing between data belonging to…
Yipeng, the NUS student
and
Yipeng, the son of PM Lee
3
DB 1
DB 2
DB 3
Amanda
Amanda
Amanda
Beverley
Beverley
Beverley
Catherine
Katherine
Katherine
Daniel
David
Amanda
Elaine
Amanda
Dedup Two Lists
O(M*N)
Dedup Single List
O(N2)
4


Pairwise comparison increasing expensive
Blocking techniques
◦ Reduce the search space
Amanda
Amanda
David
Daniel
5
6
Record 1
Record 2
Record 3
Record 4
Comparison
Window: 2w−1
Record 5
Record 6
Record 7
Record 8
Record 9
Record 10
7


Pairwise comparison increasing expensive
Blocking techniques
◦ Reduce the search space
◦ Limitations
 Single node computation
 Localized data source
 Conflicting in function
Amanda
Amanda
David
Daniel
8

Multi node computation
Data source flexibility
Complementary to blocking methods

Frontrunners:


◦ P-Febrl (P Christen 2003),
◦ P-Swoosh (H Kawai 2006),
◦ Parallel Linkage (H Kim 2007)
9

Peter Christen
◦ Parallelized Febrl with MPI
◦ Linear Speedup but did not Scaleup well

Hideki Kawai
◦ Designed P-swoosh in a simulated environment
◦ Match based parallelism
◦ 2x speedup with use of domain knowledge
10

Hung-sik Kim, Dongwon Lee
◦ Explored parallel record linkage for different input cases
in MATLAB
◦ Consistent Speedup
◦ Not validated with very large datasets
11

Handles system level concerns…
◦ E.g. Data distribution, fault tolerance, dynamic load
balancing, portability and scalability

Convenient model for scaling record linkage
◦ Beter scaleup on pairwise comparisions (T Elsayed 2008)
◦ Runtime increased linearly with dataset (R Vernica 2010)
12

Tailoring Hadoop for record linkage problems
◦ E.g. Bin packing blocks of different sizes

Experimenting with different problem types
◦ E.g. Bipartite data centers

Adapting existing parallel clustering algorithms
onto the MapReduce model
13

Parallelism a right step in the right direction
◦ Complementary to existing approaches
◦ Consistent with the object orientation

But…
◦ Parallel design and implementation is difficult
◦ MapReduce is a viable solution
14
Download