Progressive Duplicate Detection

advertisement
Progressive Duplicate Detection
ABSTRACT
Duplicate detection is the process of identifying multiple representations of same real
world entities. Today, duplicate detection methods need to process ever larger datasets in ever
shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two
novel, progressive duplicate detection algorithms that significantly increase the efficiency of
finding duplicates if the execution time is limited: They maximize the gain of the overall process
within the time available by reporting most results much earlier than traditional approaches.
Existing System
Data are among the most important assets of a company. But due to data changes and
sloppy data entry, errors such as duplicate entries might occur, making data cleansing and in
particular duplicate detection indispensable. However, the pure size of today’s datasets renders
duplicate detection processes expensive. Online retailers, for example, offer huge catalogs
comprising a constantly growing set of items from many different suppliers. As independent
persons change the product portfolio, duplicates arise. Although there is an obvious need for
deduplication, online shops without downtime cannot afford traditional deduplication.
Disadvantages of Existing System:
1. Needs to process large dataset in short time.
2. Quality of data set becomes increasingly difficult.
3. A user has little knowledge about the given data.
Proposed System:
We propose two novel, progressive duplicate detection algorithms namely progressive
sorted neighborhood method (PSNM), which performs best on small and almost clean datasets,
and progressive blocking (PB), which performs best on large and very dirty datasets. Both
enhance the efficiency of duplicate detection even on very large datasets. In comparison to
traditional duplicate detection, progressive duplicate detection satisfies two conditions: 1.
Improved early quality and 2. Same eventual quality
Advantages of Proposed System:
1. Increase the efficiency of duplicate detection
MODULES
1. Window enlargement interval module
2. Partition caching module
Module description:
Window Enlargement Module:
It defines how many distance-iterations PSNM should execute on each loaded partition.
Partition Caching Module:
By using this module we can overcome all records need to be read when loading the next
partition means re-iterate the entire file searching problem.
SYSTEM REQUIREMENTS
Hardware Requirements:

Processor
-
Pentium –IV

Speed
-
1.1 Ghz

Ram
-
256 Mb

Hard Disk
-
20 Gb

Key Board
-
Standard Windows Keyboard

Mouse
-
Two or Three Button Mouse

Monitor
-
SVGA
Software Requirements:

Operating System
:
Windows XP

Coding Language
:
Java
Download