Project4 - WordPress.com

advertisement

A Novel Approach for Content Extraction from Web Pages

Summary

Domain

Base paper

Knowledge and Data Engineering

IEEE Transaction 2014

Language / Platform

Server deployment used

JAVA/J2EE on Linux/Windows OS for JBoss Application server/ Apache

Tomcat Webserver

Abstract

The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content

(TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex.

This project discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.

Existing systems and its Disadvantages

Many methods have been developed to extract content blocks from web pages.

Lin and Ho proposed a method named infodiscoverer in which they used <table> tag to divide the web page into blocks. Then they extracted features from blocks and calculated entropy value of these features. Then this entropy value is used to determine whether the block is informative or not. But the problem in this method is that they are not able to divide the web pages that contain other tags like DIV than the table tag. They only performed experiments on news websites having Chinese pages.

Kao and Lin proposed a method in which they used HITS (Hyperlink Induced Topic

Search) algorithm to get a concise structure of web site by removing irrelevant structures.

On the filtered structure they performed info discoverer method. This method is better than info discoverer because instead of using the whole web page they experimented on the filtered structure. HITS algorithm works by finding hub and authority page but it becomes difficult for it to find out those hub web pages that have few authority pages linked with it.

Kao proposed WISDOM (Web Intrapage Informative Structure Mining Based on

Document Object Model) method. This method evaluates the amount of information contained in node of DOM (Document Object Model) tree with the help of information theory. It first divides the original DOM tree into subtrees and chooses the candidate subtrees with the help of assigned threshold. Then a top–down and greedy algorithm is applied to select the informative blocks and a skeleton set which consist of set of candidate informative structures. Merging and expanding methods are applied on skeleton set to get the required informative blocks. It removes pseudo informative nodes while merging.

Debnath gave four algorithms content extractor, feature extractor, k-feature extractor and

L-extractor for separating content blocks from irrelevant content. Content Extractor algorithm finds redundant blocks based on the occurrence of he same block across multiple Web pages. Feature Extractor algorithm identifies the content block with help of particular feature. K- Feature Extractor, algorithm uses a K-means clustering which gets multiple blocks as compared to Feature Extractor that selects a single block. L-Extractor algorithm combines block-partitioning algorithm (VIPS – Vision based Page

Segmentation algorithm) with support vector machine to identify content blocks in a web page. Content extraction and feature extraction was performed on different websites and

showed that both algorithms performed better than INFODISCVERER method in nearly all cases.

Wang proposed a method which is based on fundamental information of web pages. In first step, method extracts information from each web page and thereby combining that information to get site information. The information extracted is text node i.e. the data present in the tags, word length, menu subtree which is a subtree having text node length less than 5, menu item information, menu instance information. In second step, entropy estimation method is applied to discover actual information which is required by the users. The information extracted from this method helps in classifying web pages and in domain ontology generation.

Tseng and Kao proposed a method based on three novel features which are similarity, density and diverseness for identifying information from web pages. The block which has maximum value of these three features is considered as informative block. Similarity feature identifies the similar set of objects in the group, density feature indicates the degree of similar objects in particular area and diverseness measures distribution of features among various objects.

Huang proposed a method employing block pre clustering technology. This method consists of two methods- matching phase and modeling phase. In matching phase, it first partitions the web page into blocks based on VIPS (Vision based Page Segmentation algorithm). Nearest neighbor clustering algorithm is used to cluster these partitioned blocks based on similar structures. Importance degree is associated with each cluster and clusters with importance degree are stored in clustered pattern database. In modeling phase, when a new web page comes it is first partitioned into blocks and then these blocks are matched with clustered pattern database to get the importance degree of these new partitioned blocks. Entropy evaluation is done on these blocks to know whether they are informative or not.

Kang and Choi proposed algorithm RIPB (Recognizing Informative Page Blocks) using visual block segmentation. This method also partitions web page into blocks based on

VIPS. Similar structure blocks are grouped into clusters. A linear weighted function is applied to determine whether the block is informative or not. The function is based on the tokens i.e. bits of text and area of cluster.

Li proposed a novel algorithm for extracting informative blocks based on new tree CST

(Content Structure Tree). CST is a good source for examining structure and content of web page. After creating CST, content weight in each block of a web page is calculated.

Extract Informative Content Blocks algorithm proposed by authors is used then to extract blocks from CST trees.

Fei for content extraction in e-commerce domain found another type of tree called

SDOM (semantic Dom) was used. This tree gives a new idea that combines structure information with its semantic meaning for efficient extraction. Different wrappers convert tags into corresponding information.

Thomas proposed content code vector (CCV) approach in which a web page containing content and tags are represented as 1 and 0 respectively called content code vector.

Content code ratio (CCR) is calculated which determine total amount of content and code present around the CCV. If CCR value is high then it means there are more text and little tags.

Tim developed another approach that makes use of tag ratios. Tag ratio is computed as the ratio of number of non HTML tag characters to number of HTML tags. The problem with this approach is that code of a web page can be indented or unintended which leads to different values for tag ratios depending upon the distribution of code. Then a threshold value selects tag ratio values as content and non-content blocks.

Nguyen creates a template which consists of paths to content blocks and also stores non content blocks which has same path as content block. Any new web page is compared with the stored template to know whether it contains content blocks or not. But this approach works for particular type of web pages only.

Yang used three parameters node link text density, non-anchor text density and punctuation mark density together in extraction process. The main idea behind the use of these three densities is that the non-informative blocks consist of fewer punctuation marks and more of anchor text and less text.

Uzun proposed hybrid approach which combines automatic and manual techniques together for extraction process. Machine learning methods are used which draw rules for extraction process. They found decision tree learning as the best learning method for the

creation of rules. Then informative content is extracted by using these rules and simple string manipulation functions. If these rules are not able to get informative

Proposed System

Our approach combines word to leaf ratio (WLR) with link attributes of nodes for content extraction. In previous techniques characters were used instead of words but it purposelessly gives importance to long words. So instead of characters words are used. Leaves are examined in this ratio as these are the only nodes that consist of textual information.

In some work, they have not considered the idea that a block containing more number of links is less informative than the block containing lesser links. So, adding text link and anchor text ratios to word to leaf ratio gives a new approach which is more efficient.

Hardware Requirements

1.2 GHz CPU

80GB hard disk

2 GB RAM

Software Requirements

JDK 1.7 and JRE 7

Eclipse LUNA Integrated Development Environment

Apache Tomcat webserver

JBoss AS 7 Application server

MySQL 5.5

Firebug Debugging tool

Windows/ Linux Operating System

Download