ppt

advertisement
Web-page Classification through
Summarization
D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu
and *W.Y. Ma
TsingHua University, *Microsoft Research Asia,
**Hong Kong University
(SIGIR 2004)
Abstract


Web-page classification is much more
difficult than pure-text classification due to a
large variety of noisy information embedded
in Web pages
In this paper, we propose a new Web-page
classification algorithm based on Web
summarization for improving the accuracy
2/29
Abstract


Experimental results show that our proposed
summarization-based classification algorithm
achieves an approximately 8.8%
improvement as compared to pure-text-based
classification algorithm
We further introduce an ensemble classifier
using the improved summarization algorithm
and show that it achieves about 12%
improvement over pure-text based methods
3/29
Introduction

With the rapid growth of WWW, there is an
increasing need to provide automated assistance to
Web users for Web page classification and
categorization



Yahoo directory
LookSmart directory
However, it is difficult to meet without automated
Web-page classification techniques due to the laborintensive nature of human editing
4/29
Introduction



Web pages have their own underlying embedded
structure in the HTML language and typically
contain noisy content such as advertisement banner
and navigation bar
If a pure-text classification is applied to these pages,
it will incur much bias for the classification
algorithm, making it possible to loss focus on the
main topic and important content
Thus, a critical issue is to design an intelligent
preprocessing technique to extract the main topic of
a Web page
5/29
Introduction


In this paper, we show that using Web-page
summarization techniques for preprocessing
in Web-page classification is a viable and
effective technique
We further show that instead of using off-theshelf summarization technique that is
designed for pure-text summarization, it is
possible to design specialized summarization
methods catering to Web-page structures
6/29
Related Work

Recent works on Web-page summarization



Ocelot (Berger al et., 2000) is a system for summarizing
Web pages using probabilistic models to generate the “gist”
of a Web page
Buyukkokten et al. (2001) introduces five methods for
summarizing parts of Web pages on handheld devices
Delort (2003) exploits the effect of context in Web page
summarization, which consists of the information
extracted from the content of all the documents linking to
a page
7/29
Related Work

Our work is related to that for removing noise
from a Web page


Yi et al. (2003) propose an algorithm by
introducing a tree structure, called Style Tree, to
capture the common presentation styles and the
actual contents of the pages
Chen et al. (2000) extract title and meta-data
included in HTML tags to represent the semantic
meaning of the Web pages
8/29
Web-Page Summarization

Adapted Luhn’s Summarization Method


Every sentence is assigned with a significance factor, and
the sentence with the highest significance factor are
selected to form the summary
The significance factor of a sentence can be computed as
follows.




A “significant words pool” is built using word frequency
Set a limit L for the distance at which any two significant words
could be considered as being significantly related
Find out a portion in the sentence that is bracketed by significant
words not more than L non-significant words apart
Count the number of significant words contained in the portion
and divide the square of this number by the total number of words
within the portion
9/29
Web-Page Summarization

In order to customize the Luhn’s algorithm for web-pages, we
make a modification



The category information of each page is already known in the
training data, thus significant-words selection could be processed
with each category
We build significant words pool for each category by selecting the
words with high frequency after removing the stop words
For a testing Web page, we do not have the category
information


We calculate the significant factor for each sentence according to
different significant words pools over all categories separately
The significant factor for the target sentence will be averaged over all
categories and referred to as Sluhn
10/29
Web-Page Summarization

Latent semantic Analysis (LSA)

LSA is applicable in summarization because of two
reasons:



LSA is capable of capturing and modeling interrelationship
among terms by semantically clustering terms and
sentences
LSA can capture the salient and recurring word
combination pattern in a document which describe a
certain topic or concept
LSA is based on singular value decomposition
A  U V
T
A :m*n
:n*n
V :n*n
11/29
Web-Page Summarization

Latent semantic Analysis (LSA)




Concepts can be represented by one of the singular
vectors where the magnitude of the corresponding
singular value indicates the importance degree of this
pattern within the document
Any sentence containing this word combination pattern
will be projected along this singular value
The sentence that best represents this pattern will have the
largest index value with this vector
We denote this index value as Slsa and select the sentences
with the highest Slsa to form the summary
12/29
Web-Page Summarization

Content Body Identification by Page Layout Analysis


The structured character of Web pages makes Web-page
summarization different from pure-text summarization
In order to utilize the structure information of Web pages, we employ
a simplified version of Function-Based Object Model (Chen et al.,
2001)



Basic Object (BO)
Composite Object (CO)
After detecting all BOs and COs in a Web page, we could identify the
category of each object according to some heuristic rules:





Information Object
Navigation Object
Interaction Object,
Decoration Object
Special Function Object
13/29
Web-Page Summarization

Content Body Identification by Page Layout Analysis


In order to make use of these objects, we define the Content Body
(CB) of a Web page which consists of the main object related to the
topic of that page
The algorithm for detecting CB is as follows:






Consider each selected object as a single document and build the
TF*IDF index for the object
Calculate the similarity between any two objects using COSINE
similarity computation and add a link between them if their similarity is
greater than a threshold (=0.1)
In the graph, a core object is defined as the object having the most edges
Extract the CB
In the sentence is included in CB, Scb=1, otherwise, Scb=0
All sentences with Scb=1 give rise to the summary
14/29
Web-Page Summarization

Supervised Summarization

There are eight features








fi1 measures the position of a sentence Si in a certain paragraph
fi2 measures the length of a sentence Si, which is the number of
words in Si
fi3 =ΣTFw*SFw
fi4 is the similarity between Si and the title
fi5 is the consine similarity between Si and all text in the pages
fi6 is the consine similarity between Si and meta-data in the page
fi7 is the number of occurrences of word from Si in a special
word set
fi8 is the average font size of words in Si
15/29
Web-Page Summarization

Supervised Summarization

We apply Naïve Bayesian classifier to train a
summarizer
16/29
Web-Page Summarization

An Ensemble of Summarizer

By combining the four methods presented in the
previous sections, we obtain a hybrid Web-page
S  S luhn  S lsa  S cb  S sup

The sentences with the highest S will be chosen into
the summary
17/29
Experiments

Data Set




We use about 2 millions Web pages crawled from the
LookSmart Web directory
Due to the limitation of network bandwidth, we only
downloaded about 500 thousand descriptions of Web
pages that are manually created by human editors
We randomly sampled 30% of the pages with descriptions
for our experiment purpose
The Extracted subset includes 153,019 pages, which are
distributed among 64 categories
18/29
Experiments

Data Set

In order to reduce the uncertainty of data split, a 10fold cross validation procedure is applied
19/29
Experiments

Classifiers

Naïve Bayesian Classifier (NB)

Support Vector Machine (SVM)
20/29
Experiments

Experiment Measure


We employ the standard measures to evaluate the
performance of Web classification: Precision, recall and
F1-measure
To evaluate the average performance across multiple
categories, there are two conventional methods: microaverage and macro-average



Micro-average gives equal weight to every document
Macro-average gives equal weight to every category, regardless
of its frequency
Only micro-average will be used to evaluate the
performance of classifiers
21/29
Experiments

Experimental Results and Analysis

Baseline (Full text)


Each web page is represented as a bag-of-words, in which the
weight of each word is assigned with their term frequency
Human’s summary (Description, Title, meta-data)

We extract the description of each Web page from the
LookSmart Website and consider it as the “ideal” summary
22/29
Experiments

Experimental Results and Analysis

Unsupervised summarization


Supervised summarization algorithm


Content Body, Adapted Luhn’s algorithm (compression rate:20%)
and LSA (compression rate:30%)
We define one sentence as positive if its similarity with the
description is greater than a threshold (=0.3), and others as
negative
Hybrid summarization algorithm

Use all Unsupervised and supervised summarization algorithms
23/29
Experiments

Experimental Results and Analysis
24/29
Experiments

Experimental Results and Analysis

Performance on Different Compression
25/29
Experiments

Experimental Results and Analysis

Effect of different weighting schemata


Schema1: we assigned the weight of each summarization
method in proportion to the performance of each method
(micro-F1)
Schema2: We increased the value of Wi (i=1,2,3,4) to 2 in
schema2-5 respectively and kept others as one
26/29
Experiments

Case studies



We randomly selected 100 Web pages that are
correctly labeled by all our summarization based
approaches but wrongly labeled by the baseline
(denoted as set A) and 500 pages randomly from
the testing pages (denoted as set B)
The average size of pages in A which is 31.2k in
much larger than that of in B which is 10.9k
The summarization techniques can help us to
extract this useful information from large pages
27/29
Experiments

Case studies
28/29
Conclusions and Future Work



Several web-page summarization algorithms are
proposed for extracting most relevant features from
Web pages for improving the accuracy of Web
classification
Experimental results show that automatic summary
can achieve a similar improvement (about 12.9%) as
the ideal-case accuracy achieved by using the
summary created by human editors
We will investigate methods for multi-document
summarization of hyperlinked Web pages to boost
the accuracy of Web classification
29/29
Download