Informative Content

Informative Content Extraction By Using Eifce [Effective Informative Content Extractor] ABSTRACT Internet web pages contain several items that cannot be classified as the ―informative content,‖ e.g., search and filtering panel, navigation links, advertisements, and so on. Most clients and end-users search for the informative content, and largely do not seek the non-informative content. As a result, the need of Informative Content Extraction from web pages becomes evident. Two steps, Web Page Segmentation and Informative Content Extraction, are needed to be carried out for Web Informative Content Extraction. DOM-based Segmentation Approaches cannot often provide satisfactory results. Vision-based Segmentation Approaches also have some drawbacks. So this paper proposes Effective Visual Block Extractor (EVBE) Algorithm to overcome the problems of DOM-based Approaches and reduce the drawbacks of previous works in Web Page Segmentation. And it also proposes Effective Informative Content Extractor (EIFCE) Algorithm to reduce the drawbacks of previous works in Web Informative Content Extraction. Web Page Indexing System, Web Page Classification and Clustering System, Web Information Extraction System can achieve significant savings and satisfactory results by applying the Proposed Algorithms. Existing System • The problem of information overload: – Users have difficulty assimilating needed knowledge from the overwhelming number of documents. • The situation is even worse if the needed knowledge is related to a temporal incident. – The published documents should be considered together to understand the development of the incident. Proposed System: For further Effective Informative Content Extraction, it needs to segment the web page into semantic blocks correctly. By applying the Proposed EVBE Algorithm, the blocks such as BL3 and BL4 can be extracted easily. However, VIPS algorithm cannot segment them as separate blocks when the Permitted Degree of Coherence (PDoC) value is low. It can segment them as separate blocks only if PDoC value is high. However, when the PDoC value is high, it segments the page into many small blocks although some separate blocks should be a single block. It is unreasonable and inconvenient for any further processing. Although BL3 contains the informative content of the web page, BL4 doesn‘t contain any informative content of the page. Actually the content nature of BL3 and BL4 is different and they should be segmented as separate blocks. However, when the PDoC value is low, VIPS algorithm assumes BL3 and BL4 as a single block. The great rules of EVBE Algorithm can reduce the drawbacks of previous works and can help for getting finer results in Web Page Segmentation. Some solutions proposed DOMbased Approaches to extract the informative content of the web page. Unfortunately DOM tends to reveal presentation structure other than content structure, and is often not accurate enough to extract the informative content of the web page. CE needs a learning phase for Informative Content Extraction from web pages. So it couldn‘t extract the informative content from random one input web page. FE can identify Informative Content Block of the web page only if there is a dominant feature. So the Proposed Approach intends to introduce EIFCE Algorithm which could extract the informative content that is not necessarily the dominant content and without any learning phase and with one random page. It simulates the concept of how a user understands the layout structure of a web page based on its visual representation. Compared with DOM-based Informative Content Extraction Approaches, it utilizes useful visual cues to obtain a better extraction of the informative content of the web page at the semantic level. The efficient rules of the Proposed EVBE Algorithm in Web Page Segmentation Phase can help for getting finer results in Web Informative Content Extraction. MODULES: 1. Text Segmentation 2. Text Summarization 3. Web Page Segmentation 4. Informative Content Extraction Modules Description 1. Text Segmentation The objective of text segmentation is to partition an input text into nonoverlapping segments such that each segment is a subject-coherent unit, and any two adjacent units represent different subjects. Depending on the type of input text, segmentation can be classified as story boundary detection or document subtopic identification. The input for story boundary detection is usually a text stream. 2. Text Summarization Generic text summarization automatically creates a condensed version of one or more documents that captures the gist of the documents. As a document’s content may contain many themes, generic summarization methods concentrate on extending the summary’s diversity to provide wider coverage of the content. 3. Web Page Segmentation Several methods have been explored to segment a web page into regions or blocks. In the DOM-based Segmentation Approach, an HTML document is represented as a DOM tree. Useful tags that may represent a block in a page include P (for paragraph), TABLE (for table), UL (for list), H1~H6 (for heading), etc. DOM in general provides a useful structure for a web page. But tags such as TABLE and P are used not only for content organization, but also for layout presentation. In many cases, DOM tends to reveal presentation structure other than content structure, and is often not accurate enough to discriminate different semantic blocks in a web page. The drawback of this method is that such a kind of layout template cannot be fit into all web pages. Furthermore, the segmentation is too rough to exhibit semantic coherence. Compared with the above segmentation, Vision-based Page Segmentation (VIPS) excels in both an appropriate partition granularity and coherent semantic aggregation. By detecting useful visual cues based on DOM structure, a tree-like vision-based content structure of a web page is obtained. The granularity is controlled by the Degree of Coherence (DoC) which indicates how coherence each block is. VIPS can efficiently keep related content together while separating semantically different blocks from each other. Visual cues such as font, color and size, are used to detect blocks. Each block in VIPS is represented as a node in a tree. The root is the whole page; inner nodes are the top level coarser blocks; children nodes are obtained by partitioning the parent node into finer blocks; and all leaf nodes consist of a flat segmentation of a web page with an appropriate coherent degree. The stopping of the VIPS algorithm is controlled by the Permitted DoC (PDoC), which plays a role as a threshold to indicate the finest granularity that we are satisfied. The segmentation only stops when the DoCs of all blocks are not smaller than the PDoC. 4. Informative Content Extraction Informative Content Extraction is the process of determining the parts of a web page which contain the main textual content of this document. A human user nearly naturally performs some kind of Informative Content Extraction when reading a web page by ignoring the parts with additional non-informative contents, such as navigation, functional and design elements or commercial banners − at least as long as they are not of interest. Though it is a relatively intuitive task for a human user, it turns out to be difficult to determine the main content of a document in an automatic way. Several approaches deal with the problem under very different circumstances. For example, Informative Content Extraction is used extensively in applications, rewriting web pages for presentation on small screen devices or access via screen readers for visually impaired users. Some applications in the fields of Information Retrieval and Information Extraction, Web Mining and Text Summarisation use Informative Content Extraction to pre-process the raw data in order to improve accuracy. It becomes obvious that under the mentioned circumstances the extraction has to be performed by a general approach rather than a tailored solution for one particular set of HTML documents with a well-known structure. System Configuration:H/W System Configuration:Processor Pentium –III - Speed - 1.1 Ghz RAM - 256 MB (min) Hard Disk - 20 GB Floppy Drive - 1.44 MB Key Board - Standard Windows Keyboard Mouse - Two or Three Button Mouse Monitor - SVGA S/W System Configuration: Operating System :Windows95/98/2000/XP  Application Server : Tomcat5.0/6.X  Front End : HTML, Java, Jsp  Scripts  Server side Script  Database : Mysql  Database Connectivity : JDBC. : JavaScript. : Java Server Pages. CONCLUSION Web pages typically contain non-informative content, noises that could negatively affect the performance of Web Mining tasks. Automatically extracting the informative content of the page is an interesting problem. By applying the Proposed EVBE and EIFCE Algorithms, the informative content of the web page can be extracted effectively. Automatically extracting Informative Content Block from web pages can help for increasing the performance of Web Mining tasks. The empirical experiment of the Proposed Approach is planned as the future work.

Informative Content

Related documents

Products

Support

Informative Content

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib