Page 1 SYSTEM REQUIRMENTS SPECIFICATION Abstract: This project gives the solution on duplicate detection in more complex hierarchical structures, like XML data. We present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several data sets. XMLDup is also able to outperform another state-of-the-art duplicate detection solution, both in terms of efficiency and of effectiveness. 1.0 Aim/ Problem definition of the project: The main aim of the project is to detect the duplicate in the structured data. The proposed system focuses on a specific type of error, namely fuzzy duplicates, or duplicates for short. Duplicates are multiple representations of the same real-world object (e.g., a person) that differ from each other because, for example, one representation stores an outdated address. Existing system. Most commonly we use a relational database to store the data. In this case, the detection strategy typically consists of comparing pairs of tuples (each tuple represent- ing an object) by computing a similarity score based on their attribute values. Then, two tuples are classified as duplicates if their similarity is above a predefined threshold. However, this narrow view often neglects so their available related information as, for instance, the fact that data stored in a relational table relates to data in other tables through the foreign key. Page 2 1.02. Proposed System The proposed system finds the duplicates in structured (XML) data. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several datasets. 1.1 Description of the project in short: The main aim of the project is to detect the duplicate in the structured data. In this paper, we first present a probabilistic duplicate detection algorithm for hierarchical data called XML Dup. Our approach for XML duplicate detection is centered around one basic assumption: The fact that two XML nodes are duplicates depends only on the fact that their values are duplicates and that their children nodes are duplicates. Thus, we say that two XML trees are duplicates if their root nodes are duplicates. Proposed system works on some conditional probabilities a) The probability of the values of the nodes being duplicates, given that each individual pair of values contains duplicates. b) The probability of two nodes being duplicates given that their values and their children are duplicates. c) The probability of the children nodes being duplicates, given that each individual pair of children are duplicates. d) The probability of a set of nodes of the same type being duplicates given that each pair of individual nodes in the set are duplicates. 2.0 Process Summary a. b. c. d. Take the two XML files which contain few records are duplicates. Parse the XML file using XML Parser API. Compare the nodes of one file with another. If the two nodes are similar more than the threshold value, then those two nodes are duplicates. e. Compare proposed system with XML dup and check that our system is giving better recall. Page 3 2.1 Algorithms 3.0 Deliverables Software: Eclipse, JDK1.6, MySQL Application: Desktop Application. 3.01 Usefulness/ Advantages 1. Saves the time of the user to see to find the duplicates in structured data. 2. Automatically extract the parts from the structured data. Page 4 3.1 Operating Environment Software Requirement Operating System : Windows Technology : Java (JDK) Front End : Swing Database : MySQL Database Connectivity : JDBC. API XML Parser API, : Hardware Requirement Processor : At Least Pentium Processor Ram : 64 MB Hard Disk : 2 GB 3.2 Design and Implementation Constraints We have to take such two XML files which will have some duplicate record so that we can find it. Then main motive of the proposed system is just to find the duplicate in XML. Our contribution work will neglect the inner nodes attributes. Contribution work: In the proposed system, they are implementing XML Dup algorithm in which they are considering all attributes of all nodes. But in our system instead of taking all attributes for comparison we will just consider root attribute and left other. So it will give better performance than previous. 4.0 Modules Information 1. Module : The module contains some basic GUI, which takes XML files as input. 2. Module: Page 5 Parse the XML files using the XML Parser API, and get the individual nodes. 3. Module: Compare the nodes with each other and check whether these are greater than the threshold value or not. If they then take those nodes. 4. Module Compare proposed system with XML dup algorithm and check that our system is giving better recall than the previous system. Project Plan Modules Module 1 Module 2 Module 3 Module 4 Code Delivery date Code delivered (Percentage) % in 25% 50% 75% 100% Note from Author: This project is very useful to find the duplicates in the hierarchical data. Showing the XML data in the hierarchical form on the GUI is very challenging work for this project.