iv. principle of genetic algorithms

International Journal on Advanced Computer Theory and Engineering (IJACTE)
Cosine Similarity Measure and Genetic Algorithm for extracting main
content from web documents
Digvijay B. Gautam, 2Pradnya V. Kulkarni
Department of Computer Engineering, Maharashtra Institute of Technology, Pune 411038, India
Email: 1anshul1389@gmail.com, 2pradnya.kulkarni@mitpune.edu.in
Abstract— Because of the use of growing information, web
mining has become a primary necessity of world. Due to
this, research on web mining has received a lot of interest
from both industry and academia. Mining and prediction
of user’s web browsing behaviors and deducing the actual
content in a web document is one of the active subjects.
The information on web is dirty. Apart from useful
information, it contains unwanted information such as
copyright notices and navigation bars that are not part of
main contents of web pages. These seriously harm Web
Data Mining and hence, need to be eliminated. This paper
aims at studying the possible similarity criteria based on
cosine similarity to deduce which parts of content are more
important than others. Under vector space model,
information retrieval is based on the similarity
measurement between query and documents. Documents
with high similarity to query are judge more relevant to
the query and should be retrieved first. Under genetic
algorithms, each query is represented by a chromosome.
These chromosomes feed into genetic operator process:
selection, crossover, and mutation until we get an optimize
query chromosome for document retrieval. Our testing
result show that information retrieval with 0.8 crossover
probability and 0.01 mutation probability give the highest
precision while 0.8 crossover probability and 0.3 mutation
probability give the highest recall.
Keywords- Cosine Similarity; Genetic Algorithm; Fitness
Function; Web Content Mining.
Information is expanding rapidly and hence, the web.
Web is a collection of abundant information. The
information on web is dirty. Apart from useful
information, it contains unwanted information such as
copyright notices and navigation bars that are not part of
main contents of web pages. Although these information
item are useful for human viewers and necessary for the
Web site owners, they can seriously harm automated
information collection and Web data mining, e.g. Web
page clustering, Web page classification, and
information retrieval. So how to extract the main content
blocks become very important. Web pages contain Div
block, Table block or other HTML blocks. This paper
aims at extracting the main content from a web
document which is relevant to user’s query. Here, a new
algorithm is proposed to extract the informative block
from web page based on DOM-based analysis and
Content Structure Tree (CST). Then we apply cosine
similarity measure to identify and separate the rank of
the each block from the page. Further, we also use TFIDF (Term Frequency and Inverse Document
Frequency) scheme for calculating the weight of each
node on the CST. Finally, we extract the most relevant
information to the query.
We will discuss two algorithms in this paper.
One is for generating Content Structure Tree and
another for extracting main content from Content
structure Tree.
From time to time, many extraction systems have been
developed. In [1], Swe Swe Nyein proposed the method
of mining contents in web page using cosine similarity.
In [2], C. Li et al. propose a method to extract
informative block from a web page based on the analysis
of both the layouts and the semantic information of the
web pages. They needed to identify blocks occurring in
a web collection based on the Vision-based Page
Segmentation algorithm. In [3], L. Yi et al. propose a
new tree structure, called Style Tree to capture the actual
contents and the common layouts (or presentation styles)
of the Web pages in a Web site. Their method can
difficult to capture the common presentation for many
web pages from different web sites. In [4], Y. Fu et al.
propose a method to discover informative content block
based on DOM tree. They removed clutters using XPath.
They could remove only the web pages with similar
layout. In [5], P. S. Hiremath et al. propose an algorithm
called VSAP (Visual Structure based Analysis of web
Pages) to exact the data region based on the visual clue
(location of data region / data records / data items / on
the screen at which tag are rendered) information of web
pages. In [6] S. H. Lin et al. propose a system,
InfoDiscoverer to discover informative content blocks
from web documents. It first partitions a web page into
several content blocks according to HTML tag
<TABLE>. In [7] D. Cai et al. propose a Vision-based
Page Segmentation (VIPS) algorithm that segments web
ISSN (Print): 2319-2526, Volume -3, Issue -6, 2014
International Journal on Advanced Computer Theory and Engineering (IJACTE)
pages using DOM tree with a combination of human
visual cues, including tag cue, color cue, size cue, and
others. In [8], P. M. Joshi propose an approach of
combination of HTML DOM analysis and Natural
Language Processing (NLP) techniques for automated
extractions of main article with associated images form
web pages. Their approach did not require prior
knowledge of website templates and also extracted not
only the text but also associated images based on
semantic similarity of image captions to the main text.
In [9], Y. Li et al. propose a tree called content structure
tree which captured the importance of the blocks. In
[10], R R Mehta propose a page segmentation algorithm
which used both visual and content information to
obtain semantically meaningful blocks. The output of
the algorithm was a semantic structure tree. In [11], S.
Gupta proposes content extraction technique that could
remove clutter without destroying webpage layout. It is
not only extract information from large logical units but
also manipulate smaller units such as specific links
within the structure of the DOM tree. Most of the
existing approaches based on only DOM tree. In [12],
David demonstrates the basic principles of Genetic
Algorithms. In [13], Goldberg highlights the
significance of Genetic Algorithms in search,
optimization and Machine Learning. In [14], Kraft uses
The Genetic Programming to Build Queries for
Information Retrieval. In [15], Martin-Bautista uses
Adaptive Information Retrieval Agent using Genetic
Algorithms with Fuzzy Set Genes.
Web Mining is the extraction of interesting and
potentially useful patterns and implicit information from
artifacts or activity related to the World Wide Web. It
has become a necessity as the information across the
world is increasing tremendously and hence the size of
the web.
According to the differences of the mining objects, there
are roughly three knowledge discovery domains that
pertain to web mining: Web Content Mining, Web
Structure Mining and Web Usage Mining. Web usage
mining is an application of data mining technology to
mining the data of the web server log file. It can
discover the browsing patterns of user and some kind of
correlations between the web pages. Web usage mining
provides the support for the web site design, providing
personalization server and other business making
decision, etc. Web mining applies the data mining, the
artificial intelligence and the chart technology and so on
to the web data and traces users' visiting characteristics,
and then extracts the users' using pattern. The web usage
mining generally includes the following several steps:
Establishing interesting model
Pattern analysis
Web mining algorithm based on web usage mining is
also produces the design mentality of the electronic
commerce website application algorithm. This algorithm
is simple, effective and easy to realize, it is suitable to
the web usage mining demand of construct a low cost
Business-to-Customer (B2C) website. Web mining
algorithm generally includes the following several steps:
Collect and pre-treat users' information
Establish the topology structure of web site
Establish conjunction matrices of the users visit
Concrete application.
A. Web Content Mining: Web Content Mining refers to
description and detection of useful information from the
web contents / data / documents. There are two views on
web content mining: view of information retrieval and
data base. The aim of web content mining according to
data retrieval based on content is to help the process of
data filtering or finding data for the user which is
usually performed based on extraction or demand of
users; while according to the view of data bases it means
attempt for modeling the data on web and its
combination such that most of the expert query required
for searching the information can be executed on this
kind of data mode.
B. Web Structure Mining: Web Structure Mining tries to
discover the model underlying the link structures of the
web. This model can be used to categorize web pages
and is useful to generate information such as the
similarity and relationship between different web sites.
C. Web Usage Mining: Web Usage Mining using the
data derived from using effects on the web detects the
behavioral models of the users to access the web
services automatically.
GA’s are characterized by 5 basic components as
Data collection
1) Chromosome representation for the feasible solutions
to the optimization problem.
Data pretreatment
2) Initial population of the feasible solutions.
ISSN (Print): 2319-2526, Volume -3, Issue -6, 2014
International Journal on Advanced Computer Theory and Engineering (IJACTE)
3) A fitness function that evaluates each solution.
4) Genetic operators that generate a new population
from the existing population.
5) Control parameters such as population size,
probability of genetic operators, number of generation
information retrieval problem is how to retrieve user
required documents. It seems that we could use the
fitness functions in Table 1 to calculate the distance
between document and query. From Table 1, there are 2
types of fitness functions: weighted term vector and
binary term vector.
We define X = (x1, x2, x3,….., xn) , | X | = number of
terms occur in X ,
GA’s is an iterative procedure which maintains a
constant size population of feasible solutions. During
each iteration step, called a generation, the fitness of
current population are evaluated, and population are
selected based on the fitness values. The higher fitness
chromosomes are selected for reproduction under the
action of crossover and mutation to form new
population. The lower fitness chromosomes are
eliminated. These new population are evaluated,
selected and fed into genetic operator process again until
we get an optimal solution. (See Fig. 1.)
= number of terms occur in both X and Y
Table 1. Fitness Function
Cosine similarity is a measure of similarity between two
high-dimensional vectors. In essence, it is the cosine
value of the angle between two vectors.
The similarity of content in the web pages is estimated
using cosine similarity measure which is the cosine of
the angle between the query vector q and the document
vector dj. Then the weight of each node (term) in
Content Structure Tree which represents the contents in
the form of nodes of a tree such as Text node, Image
node, and Link node is calculated. TF-IDF scheme
(Term Frequency and Inverse Document Frequency) is
used to calculate the weight. The weight of a term ti in
document dj is the number of times that appears in
document. In this scheme, an arbitrary normalized wij is
defined. Then, the weight of content node is obtained,
Content Weight = Text Weight + Image Weight + Link
Fig. 1. The process of Genetic Algorithms
The content value which is computed by the children
nodes of HtmlItem node will be added to element which
is the first child of HtmlItem node. Before adding to the
parent node, this system checks the HtmlItem nodes
Fitness function is a performance measure or reward
whether there are same level nodes. If so, then similarity
function which evaluates how good each solution is. The
of these nodes is computed using their weights and then
ISSN (Print): 2319-2526, Volume -3, Issue -6, 2014
International Journal on Advanced Computer Theory and Engineering (IJACTE)
all HtmlItem nodes are eliminated, except the highest
similarity node. Ranking of the documents is done using
their similarity values. The top ranked documents are
regarded as more relevant to the query.
This experimentation tests for 21 queries with 3
different fitness functions: jaccard coefficient (F1),
cosine coefficient (F2) and dice coefficient (F3). A
particular fitness function tests with set of parameters:
probability of crossover (Pc = 0.8), and probability of
mutation (Pm = 0.01, 0.10, 0.30) to compare the
efficiency of retrieval system. The information retrieval
efficiency measures from recall and precision. Recall is
defined as the proportion of relevant document retrieved
(see equation (1))
relevant documents.
2. Information retrieval with Pc = 0.8 and Pm = 0.01
yields the highest precision 0.746 while information
retrieval with Pm = 0.10 yields the moderate precision
0.560 and information retrieval with Pm = 0.30 yields
the lowest precision 0.417 as shown in Figure 2.
3. Information retrieval with Pc = 0.8 and Pm = 0.30
yields the highest recall 0.976 while information
retrieval with Pm = 0.01 yields the moderate recall and
information retrieval with Pm = 0.l0 yields the lowest
recall 0.786 as shown in Fig. 2.
Precision is defined as the proportion of retrieved
document that is relevant (see equation (2))
Fig. 2. Precision and Recall
From preliminary experiment indicated that precision
and recall are invert. To use which parameters depends
on the appropriateness that what would user like to
retrieve for. In the case of high precision documents
prefer, the parameters will be high crossover probability
and low mutation probability. While in the case of more
relevant documents (high recall) prefer, the parameters
will be high mutation probability and lower crossover
probability. From preliminary experiment indicated that
we could use GA’s in information retrieval.
Table 2. Information retrieval by 3 Fitness Functions
with PC = 0.8 AND PM = 0.01
Swe Swe Nyein, “Mining Contents in Web
Page Using Cosine Similarity”, IEEE,
Copyright 2011.
C. Li, J. Dong, and J. Chen, “Extraction of
Informative Blocks from Web Pages Based on
VIPS”, 1553-9105/ Copyright January 2010.
L. Yi, B. Liu, and X. Li, “Eliminating Noisy
Information in Web Pages for Data Mining”, in
Proc. ACM SIGKDD International Conference
Preliminary testing indicated that
1. Experiment from 3 fitness functions testing show that
optimize queries from these fitness functions are all the
same queries but there are different fitness values (F1,
F2, and F3) as shown in Table 2. From Table 2, RetRel
is defined as number of retrieved relevant documents
and RetNRel is defined as number of retrieved but not
ISSN (Print): 2319-2526, Volume -3, Issue -6, 2014
International Journal on Advanced Computer Theory and Engineering (IJACTE)
on Knowledge Discovery & Data Mining
Y. Fu, D. Yang, and S. Tang,”Using XPath to
Discover Informative Content Blocks of Web
Pages”, IEEE. DOI 10.1109/SKG, 2007.
R. R. Mehta, P. Mitra, and H. Kamick,
“Extracting Semantic Structure of Web
Documents Using Content and Visual
Information”, ACM, Chiba, Japan, May 2005.
P. S. Hiremath, S. S. Benchalli, S. P. Algur, and
R. V. Udapudi, “Mining Data Regions from
Web Pages”, International Conference on
Management of Data COMAD, India,
December 2005.
S. Gupta, G. Kaiser, D. Neistadt, and P.
Grimm,”DOM-based Content Extraction of
HTML Documents”, Pro. 12 th International
Conference on WWW, ISBN: 1-58113-680-3,
David, L. Handbook of Genetic Algorithms.
New York : Van Nostrand Reinhold. 1991.
Goldberg, D.E. Genetic Algorithms: in Search,
Optimization, and Machine Learning. New
York : Addison-Wesley Publishing Co. Inc.
Kraft, D.H. et. al. “The Use of Genetic
Programming to Build Queries for Information
Retrieval.” in Proceedings of the First IEEE
Conference on Evolutional Computation. New
York: IEEE Press. 1994. PP. 468-473.
Martin-Bautista, M.J. et. al. “An Approach to
An Adaptive Information Retrieval Agent using
Genetic Algorithms with Fuzzy Set Genes.” In
Proceeding of the Sixth International
Conference on Fuzzy Systems. New York:
IEEE Press. 1997. PP.1227-1232.
S. H. Lin
Discovery &
and J. M .Ho, “Discovering
Conference on Knowledge
Data Mining, pp.588-593, July
D. Cai, S. Yu, J. R. Wen, and W. Y. Ma,
“VIPS: a Vision- based PageSegmentation
Algorithm”, Technical Report, MSR-TR, Nov.
1, 2003.
P. M. Joshi, and S. Liu, “ Web Document Text
and Images Extraction using DOM Analysis
and Natural Language Processing”, ACM,
DocEng, 2009.
Y. Li and J. Yang, “A Novel Method to Extract
Informative Blocks from Web Pages”, IEEE.
DOI 10. 1109/ JCAI, 2009.
ISSN (Print): 2319-2526, Volume -3, Issue -6, 2014