Progressive Duplicate Detection

Progressive Duplicate Detection ABSTRACT Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Existing System Data are among the most important assets of a company. But due to data changes and sloppy data entry, errors such as duplicate entries might occur, making data cleansing and in particular duplicate detection indispensable. However, the pure size of today’s datasets renders duplicate detection processes expensive. Online retailers, for example, offer huge catalogs comprising a constantly growing set of items from many different suppliers. As independent persons change the product portfolio, duplicates arise. Although there is an obvious need for deduplication, online shops without downtime cannot afford traditional deduplication. Disadvantages of Existing System: 1. Needs to process large dataset in short time. 2. Quality of data set becomes increasingly difficult. 3. A user has little knowledge about the given data. Proposed System: We propose two novel, progressive duplicate detection algorithms namely progressive sorted neighborhood method (PSNM), which performs best on small and almost clean datasets, and progressive blocking (PB), which performs best on large and very dirty datasets. Both enhance the efficiency of duplicate detection even on very large datasets. In comparison to traditional duplicate detection, progressive duplicate detection satisfies two conditions: 1. Improved early quality and 2. Same eventual quality Advantages of Proposed System: 1. Increase the efficiency of duplicate detection MODULES 1. Window enlargement interval module 2. Partition caching module Module description: Window Enlargement Module: It defines how many distance-iterations PSNM should execute on each loaded partition. Partition Caching Module: By using this module we can overcome all records need to be read when loading the next partition means re-iterate the entire file searching problem. SYSTEM REQUIREMENTS Hardware Requirements:  Processor - Pentium –IV  Speed - 1.1 Ghz  Ram - 256 Mb  Hard Disk - 20 Gb  Key Board - Standard Windows Keyboard  Mouse - Two or Three Button Mouse  Monitor - SVGA Software Requirements:  Operating System : Windows XP  Coding Language : Java

Progressive Duplicate Detection

Related documents

Products

Support

Progressive Duplicate Detection

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib