Software Requirements Specification for < Annotating Search Results from Web Databases> Page 1 Annotating Search Results from Web Databases 1.0 Aim of the project/ problem definition Automatic annotation to the data units within the SRRs (Search Result Records) returned from WDBs (Web Databases). Aim of the paper is to provide data unit level annotation and to implement query based summarization from SRRs. 1.01 Existing System. Many systems rely on human users to mark the desired information on sample pages and label the marked data at the same time, and then the system can induce a series of rules (wrapper) to extract the same set of information on web pages from the same source. These systems are often referred as a wrapper induction system. Because of the supervised training and learning process, these systems can usually achieve high extraction accuracy. However, they suffer from poor scalability and are not suitable for applications that need to extract information from a large number of web sources. 1.02. Proposed System There is a high demand for collecting data of interest from multiple Web Databases (WDBs). So that in the proposed system we are automatically assigning labels to the data units within the SRRs returned from WDBs. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, they need to be extracted out and assigned meaningful labels. This automatic annotation solution consists of three phases, Phase1 Alignment, Phase2 Annotation, Phase3 Annotation Wrapper generation. Apply the query based summarization from SRR using Query Specific ROCK Clustering Algorithm. Description of the project in short: This automatic annotation solution consists of three phases, Phase1 Alignment, Phase2 Annotation, Phase3 Annotation Wrapper generation. In this phase1, we first identify all data units in the SRRs and then organize them into different groups with each group corresponding to a different concept. In Phase2 we introduce multiple basic annotators with each exploiting one type of features. Every basic annotator is used to produce a label for the units within their group holistically, and a probability model is adopted to determine the most appropriate label for each group. In Phase3 For each indentified concept, we generate an annotation rule that describes how to extract the data units of this concept in the result page and what the Software Requirements Specification for < Annotating Search Results from Web Databases> Page 2 appropriate semantic label should be. The rules for all aligned groups, collectively, form the annotation wrapper for the corresponding WDB, which can be used to directly annotate the data retrieved from the same WDB in response to new queries without the need to perform the alignment and annotation phases again. Instead of using only the DOM tree or other HTML tag tree structures of the SRRs to align the data units, we consider important features shared among data units, such as their data types (DT), data contents (DC), presentation styles (PS), and adjacency (AD) information, those are extracted by using ViTN (Visual information aNd Tag structure based wrapper generator). After successful implementation of annotation to SRR we will find out the short summary about the entered query by using the data from SRR. Clustering based Text Summarization Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions. Clustering is nothing but Grouping of objects into different sets, or the partitioning of a data set into subsets (clusters). Clustering algorithms are classified in two categories as a) Query-Based Text Summarization System uses standard retrieval methods to map a query against a document collection and to create a summary & b) The second system, the Concept-Based Text Summarization System, creates a query-independent document summer . The main aim is to combine both approaches of document clustering and query dependent summarization with natural language processing (NLP) based. This mainly includes applying different clustering algorithms on a text document. Create a weighted document graph of the resulting graph based on the keywords. And obtain the optimal tree to get the summary of the document. Block 1: Processing input file and generating document graph: This block is needed to accept the text file only. It is responsible to upload text file, to process the file i.e. to form nodes for every newline contents. It is also responsible for generating weight from each node to very other node Block 2: Clustering node and building clustered graph: This block is responsible for choosing a clustering algorithm out of two. It also accepts the threshold, so that can check the similarity between the clusters up to that level. It is responsible for making clusters. Block 3: Creating weighted document clustered graph: This block is responsible to accept the fired query. It is responsible to check the similarities between the query a contents and the contents in the clusters. It then build weighted clustered document graph. Block 4: Summary generation: This block is responsible for generating the summary of the clusters we formed, as a response for fired query. It generated the minimal clusters and after finding the weight of the node for fired query, it gives top most summaries. Software Requirements Specification for < Annotating Search Results from Web Databases> Page 3 2.0 Process Summary:1. Each SRR extracted by ViNTs has a tag structure that determines how the contents of the SRRs are displayed on a web browser. 2. We identify and use five common features shared by the data units a. Data Content b. Presentation Style c. Data Type (DT) d. Tag Path (TP) e. Adjacency (AD) 3. four types of relationships between data unit (U) and text node (T): a. One-to-One Relationship b. One-to-Many Relationship c. Many-to-One Relationship d. One-To-Nothing Relationship 4. Data Alignment Match similarity between data units using above 5 features.(Alignment Algorithm) 5. Assigning Label 6. Annotation 7. Annotation Wrapper. 8. Query Based Summarization Software Requirements Specification for < Annotating Search Results from Web Databases> Page 4 2.1 Algorithms: 1] Alignment Algorithm 3.0 Deliverables: Web Application. 4. Operating Environment S/W Specification Operating System Developing language : - Java (JDK 1.6) : - Windows XP/7/8 Software Requirements Specification for < Annotating Search Results from Web Databases> Page 5 Database : - MySql Server :- Tomcat 6.0 H/W Specification Processor : - PIV– 500 MHz to 3.0 GHz. RAM : - 1GB. Disk : - 20 GB. Monitor : - Any Color Display. Standard Keyboard and Mouse. 3.2 Design and Implementation Constraints:As we are doing the alignment on the basis of features extracted like Data Content, presentation style, tag path, data type & adjacency of data units from SRR. We can’t give the guarantee of accuracy. 3.3 Assumptions and Dependencies:ViNT (Visual information aNd Tag structure based wrapper generator ) is directly available, so we will use it as a web service to extract the SRR from result page. 4.0 Modules Information () Module1: Basic GUI module which takes query & search engine name from user, takes data from web database, and then extract the top 10 SRR Search Result Records from the result page for further processing. Module2:Extract the data units and the text nodes from the SRR and find out the features of those, and then find out the relationships between them. And on the basis of that relationship do the alignment of data units using Alignment Algorithm. Software Requirements Specification for < Annotating Search Results from Web Databases> Page 6 Module3:After alignment, assign labels using the basic annotators such as Table Annotator, QueryBased Annotator, and Frequency-Based annotator. Module4:Combine Annotators to capable of fully labeling all the data units on different result pages. Use these annotated data units to construct an annotation wrapper for the WDB so that the new SRRs retrieved from the same WDB can be annotated using this wrapper quickly without reapplying the entire annotation process. Module: 5 Contribution work. We will process on text nodes extracted from SRR, and will apply Query Specific ROCK Clustering algorithm to extract useful information for end user so he can get knowledge about the query he issued. (Query based Summarization). Project Plan Modules Code Delivery date Code delivered (Percentage) % Module 1 25% Module 2 50% Module 3 75% Module 4 100% in