SRS

advertisement
Software Requirements Specification for < Annotating Search Results from Web Databases>
Page 1
Annotating Search Results from Web Databases
1.0 Aim of the project/ problem definition
Automatic annotation to the data units within the SRRs (Search Result Records) returned
from WDBs (Web Databases). Aim of the paper is to provide data unit level annotation and to
implement query based summarization from SRRs.
1.01 Existing System.
Many systems rely on human users to mark the desired information on sample pages and
label the marked data at the same time, and then the system can induce a series of rules
(wrapper) to extract the same set of information on web pages from the same source.
These systems are often referred as a wrapper induction system. Because of the
supervised training and learning process, these systems can usually achieve high
extraction accuracy. However, they suffer from poor scalability and are not suitable for
applications that need to extract information from a large number of web sources.
1.02. Proposed System
There is a high demand for collecting data of interest from multiple Web Databases
(WDBs). So that in the proposed system we are automatically assigning labels to the data units
within the SRRs returned from WDBs. The data units returned from the underlying database
are usually encoded into the result pages dynamically for human browsing. For the encoded
data units to be machine processable, they need to be extracted out and assigned meaningful
labels. This automatic annotation solution consists of three phases, Phase1 Alignment, Phase2
Annotation, Phase3 Annotation Wrapper generation. Apply the query based summarization
from SRR using Query Specific ROCK Clustering Algorithm.
Description of the project in short:
This automatic annotation solution consists of three phases, Phase1 Alignment, Phase2
Annotation, Phase3 Annotation Wrapper generation. In this phase1, we first identify all data
units in the SRRs and then organize them into different groups with each group corresponding
to a different concept. In Phase2 we introduce multiple basic annotators with each exploiting
one type of features. Every basic annotator is used to produce a label for the units within their
group holistically, and a probability model is adopted to determine the most appropriate label
for each group. In Phase3 For each indentified concept, we generate an annotation rule that
describes how to extract the data units of this concept in the result page and what the
Software Requirements Specification for < Annotating Search Results from Web Databases>
Page 2
appropriate semantic label should be. The rules for all aligned groups, collectively, form the
annotation wrapper for the corresponding WDB, which can be used to directly annotate the
data retrieved from the same WDB in response to new queries without the need to perform
the alignment and annotation phases again.
Instead of using only the DOM tree or other HTML tag tree structures of the SRRs to align the
data units, we consider important features shared among data units, such as their data types
(DT), data contents (DC), presentation styles (PS), and adjacency (AD) information, those are
extracted by using ViTN (Visual information aNd Tag structure based wrapper generator).
After successful implementation of annotation to SRR we will find out the short summary
about the entered query by using the data from SRR.
Clustering based Text Summarization
Clustering is a data mining (machine learning) technique used to place data elements into
related groups without advance knowledge of the group definitions. Clustering is nothing but
Grouping of objects into different sets, or the partitioning of a data set into subsets (clusters).
Clustering algorithms are classified in two categories as a) Query-Based Text
Summarization System uses standard retrieval methods to map a query against a document
collection and to create a summary & b) The second system, the Concept-Based Text
Summarization System, creates a query-independent document summer .
The main aim is to combine both approaches of document clustering and query dependent
summarization with natural language processing (NLP) based. This mainly includes applying
different clustering algorithms on a text document. Create a weighted document graph of the
resulting graph based on the keywords. And obtain the optimal tree to get the summary of the
document.
Block 1: Processing input file and generating document graph: This block is needed to accept
the text file only. It is responsible to upload text file, to process the file i.e. to form nodes for every
newline contents. It is also responsible for generating weight from each node to very other node
Block 2: Clustering node and building clustered graph: This block is responsible for choosing a
clustering algorithm out of two. It also accepts the threshold, so that can check the similarity
between the clusters up to that level. It is responsible for making clusters.
Block 3: Creating weighted document clustered graph: This block is responsible to accept the
fired query. It is responsible to check the similarities between the query a contents and the contents
in the clusters. It then build weighted clustered document graph.
Block 4: Summary generation: This block is responsible for generating the summary of the
clusters we formed, as a response for fired query. It generated the minimal clusters and after finding
the weight of the node for fired query, it gives top most summaries.
Software Requirements Specification for < Annotating Search Results from Web Databases>
Page 3
2.0 Process Summary:1. Each SRR extracted by ViNTs has a tag structure that determines how the contents of
the SRRs are displayed on a web browser.
2. We identify and use five common features shared by the data units
a. Data Content
b. Presentation Style
c. Data Type (DT)
d. Tag Path (TP)
e. Adjacency (AD)
3. four types of relationships between data unit (U) and text node (T):
a. One-to-One Relationship
b. One-to-Many Relationship
c. Many-to-One Relationship
d. One-To-Nothing Relationship
4. Data Alignment
Match similarity between data units using above 5 features.(Alignment Algorithm)
5. Assigning Label
6. Annotation
7. Annotation Wrapper.
8. Query Based Summarization
Software Requirements Specification for < Annotating Search Results from Web Databases>
Page 4
2.1 Algorithms:
1] Alignment Algorithm
3.0 Deliverables:
Web Application.
4. Operating Environment
S/W Specification

Operating System

Developing language : - Java (JDK 1.6)
: - Windows XP/7/8
Software Requirements Specification for < Annotating Search Results from Web Databases>
Page 5

Database
: - MySql

Server
:- Tomcat 6.0
H/W Specification

Processor
: - PIV– 500 MHz to 3.0 GHz.

RAM
: - 1GB.

Disk
: - 20 GB.

Monitor
: - Any Color Display.

Standard Keyboard and Mouse.
3.2 Design and Implementation Constraints:As we are doing the alignment on the basis of features extracted like Data Content,
presentation style, tag path, data type & adjacency of data units from SRR. We can’t give the guarantee
of accuracy.
3.3 Assumptions and Dependencies:ViNT (Visual information aNd Tag structure based wrapper generator ) is directly available, so
we will use it as a web service to extract the SRR from result page.
4.0 Modules Information ()
Module1:
Basic GUI module which takes query & search engine name from user, takes data from web
database, and then extract the top 10 SRR Search Result Records from the result page for further
processing.
Module2:Extract the data units and the text nodes from the SRR and find out the features of those, and
then find out the relationships between them.
And on the basis of that relationship do the alignment of data units using Alignment Algorithm.
Software Requirements Specification for < Annotating Search Results from Web Databases>
Page 6
Module3:After alignment, assign labels using the basic annotators such as Table Annotator, QueryBased Annotator, and Frequency-Based annotator.
Module4:Combine Annotators to capable of fully labeling all the data units on different result pages. Use
these annotated data units to construct an annotation wrapper for the WDB so that the new SRRs
retrieved from the same WDB can be annotated using this wrapper quickly without reapplying the
entire annotation process.
Module: 5 Contribution work.
We will process on text nodes extracted from SRR, and will apply Query Specific ROCK
Clustering algorithm to extract useful information for end user so he can get knowledge about the
query he issued. (Query based Summarization).
Project Plan
Modules
Code Delivery date
Code
delivered
(Percentage) %
Module 1
25%
Module 2
50%
Module 3
75%
Module 4
100%
in
Download