Web Page Classification

advertisement
Topic Distillation and Web
Page Categorization
Prasanna K. Desikan
(05/29/2002)
Motivation
The web is a
information.

huge
repository
of
Categorizing web documents facilitates
the search and retrieval of pages.
 Topic
distillation is the process of finding
authoritative
Web
pages
and
comprehensive ‘hubs’ which reciprocally
endorse each other and are relevant to
a given query.
2
Approaches for Categorization
Text based Categorization
 Structure or link based
Categorization
 Combination of link and text
information

3
Web Page Categorization
Algorithms

Manual categorization by domain specific
experts.


Categorization would involve the analysis of the
contents of the web page by a number of domain
experts and classification based on the textual
content as by Yahoo.
Content-based categorization - solely on
document content or a combination of
document content and META tags.

To classify a document, all the stop words are
removed and the remaining keywords/phrases are
represented in the form of a feature vector.
4
Web Page Categorization
Algorithms
 Link
and Content Analysis.
 Based
on the fact that a web page that
refers to a document must contain
enough hints about its content to
induce someone to read it . Such hints
can be used to classify the document
being referred.
5
Topic Distillation in Hyperlinked
Environment [1]
 Aim: To find quality documents related to
a query topic.
 Problems encountered with HITS
approach.
 Mutually reinforcing relationships between
hosts.
 Automatically generated links.
 Non Relevant Nodes (documents not relevant
to the query topic) .
6
Topic Distillation in Hyperlinked
Environment[1]
Let the Web be represented as a graph with
the node as a web page and the edge as
a link.
Approaches:
If there are k edges (an edge here is a
link) from documents on a first host to a
single document on a second host we give
each edge an authority weight of 1/k.
7
Topic Distillation in Hyperlinked
Environment[1]
Approaches (contd…).
Compute the Relevance Weight for each
node.
 Eliminate non-relevant nodes from the
graph by setting a threshold on the
relevance weight .
Regulate the influence of a node based on
its relevance.
8
Topic Distillation in Hyperlinked
Environment[1]
Approaches (contd…).
Partial Content Analysis.
Content Pruning by analyzing only a part of
the graph- i.e. the nodes which are most
influential in the outcome.
9
Automatic Resource Compilation [2]


Goal: Automatically compile a resource list on
any topic that is broad and well-represented on
the Web.
Approach.



search-and-growth phase.
a weighting phase.
 w(p,q) = 1 + n(t).
w(p,q) -measure of the authority on the topic
invested by page ‘p’ in page ‘q’.
n(t) - number of matches between terms in the topic
description in the anchor window of width ‘B’.
an iteration-and-reporting phase.
10
Relaxation Labeling Technique[3]


First Classify the unclassified documents from
the neighborhood (using terms only classifier -i.e
using the text from the neighboring documents).
Iterate until convergence.
o

Recompute the class for each document using both the
local text and the class information of the neighbors.
The relaxation is guaranteed to converge to a
consistent state.
11
Probabilistic Relational Model[4]
 Web Pages and Links are modeled as
entities and relationships respectively,
while each of them is represented as a
class.
 Create Bayesian network using the
attributes from entity-relationship model
in order to model uncertainty and make
inference.
12
Probabilistic Relational Model

By belief propagation, an approximation
inference approach, we can use our prior
knowledge to infer the unobserved case.
o Given new data with some unobserved
variables, first assign most likely values to
them.
o Based on the estimation of those marginal
probabilities, we predict the correct
classification.
13
Probabilistic Relational Model

This approach proved to be effective
when applied to hypertext classification
problem, by utilizing both information
from the content and the link structure, it
provides more accurate classification and
ability to do probabilistic reasoning.
14
Integrating the DOM With Hyperlinks
for Enhanced Topic Distillation [6]
A uniform grained model.
Web pages are represented by their tag
trees (also called their Document
Object Models (DOMs)).
 DOM trees are interconnected by
ordinary hyperlinks.
 dis-aggregate mixed hubs.

15
A new fine grained model [7]
<html>…<body>…
html
<table …>
<tr><td>
<table …>
<tr><td><a href=“http://art.qaz.com”>art</a></td></tr>
body
<tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>…
head
</table>
</td></tr>
table
tr
<tr><td>
td tr
td
<ul>
<li><a href=“http://www.fromages.com”>Fromages.com</a>
table
ul
French cheese…</li>
<li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a>
tr tr
tr
Buy online…</li>
…
li li
…
td td
td
</ul>…
a
a
</td></tr>
a
a
</table>…
</body></html>
ski.qaz.com
art.qaz.com
Document
Object Model
(DOM)
Frontier of
differentiation
…
Relevant
subtree
li
Irrelevant
subtree
Toncheese.co.uk
www.fromages.com
16
Integrating the DOM With Hyperlinks
for Enhanced Topic Distillation
Figure 6: The fine-grained model of Web linkage which
unifies hyperlinks and DOM structure
17
Integrating the DOM With Hyperlinks
for Enhanced Topic Distillation
Benefits
 Reduces Topic Drift
 Identifies and extracts regions (DOM
Subtrees) relevant to the query out
of the following:
Broader hub
 Hub with additional less-relevant
contents and links

18
Web Page Classification Based on
Document Structure
Web pages that belong to a particular
category have some similarity in their
structure.
Information Pages.
 Research Pages.
 Personal Home Pages.
The
general structural information of any

page can be deduced from the placement of
links, text and images – including equations
and graphs.
19
Web Page Categories Based on
Structural Similarities

Information Pages



a logo on the top followed by a navigation bar linking
the page to other important pages
the ratio of link text (amount of text with links) to
normal text also tends to be relatively high
Research Pages


contain huge amounts of text, equations and graphs
in the form of images
The number of distinctive gray levels/color shades in
the images also provides a cue
20
Web Page Categories Based on
Structural Similarities

Personal Pages.
The name and address of the person appear
prominently at the top of the page.
 A photograph of the person concerned.
 towards the bottom of the page, the person
provides links to his publications if there are
any and other useful references or links to
his favorite destinations on the web.

21
Feature Extraction
(a)
Textual Information.


The number and placement of links in a
page provides valuable information about
the broad category the page belongs to .
The ratio of number of characters in links to
the total number of characters in the page.
22
Feature Extraction
b)
Image Information



Information pages have more colors than personal
homepages, which in turn have more colors than
research pages
The histogram of synthetic images generally tends
to concentrate at a few bands of color shades. In
contrast, the histogram of natural images is spread
over a larger area
Information pages usually contain many natural
images, while research pages contain a number of
synthetic images
23
Feature Extraction
c)
Other Information

Approaches using classification based
on video and other multimedia
content presently not implemented
24
Results
Details
Results
No. of Pages in which we
tested our implementation
~4000
Pages Categorized
~3700
Pages categorized
correctly
~3250
% categorized correctly
87.83%
25
Web Page Categories Based on
Structural Similarities
Conclusions and Future work for the
approach:




This approach augmented with traditional text
based approaches could be used for effective
categorization of web pages.
Improvement in feature selection.
Automate the training process.
Has to be experimented on more data sets.
26
References
[1]K.Bharat and M. Henzinger, Improved Algorithms for Topic
Distillation in a hyperlinked environment, In 21st
International ACM SIGIR Conference on Research and
Development in Information Retrieval.
[2] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P.
Raghavan, and S. Rajagopalan. Automatic Resource
Compilation by Analyzing Hyperlink Structure and
Associated Text. Proceedings of the 7th World-Wide Web
conference, 1998.
[3] S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext
categorization using hyperlinks. Proceedings of ACM
SIGMOD 1998.
27
References
[4] L.Getoor, E.Segal, B.Tasker, D.Koller. Probabilistic Models
of Text and Link Structure for Hypertext Classification.
IJCAI Workshop on "Text Learning: Beyond Supervision",
Seattle, WA, August 2001.
[5] Arul Prakash Asirvatham, Kranthi Kumar Ravi,
C.V.Jawahar, 'Web Page Classification based on
Document Structure‘.
[6] Soumen Chakrabarti, ‘Integrating the Document Object
Model with Hyperlinks for Enhanced Topic Distillation
and Information Extraction ‘ 10th International World Wide
Web Conference, Hong Kong, May 2001.
[7] Soumen Chakrabarti, Mukul M. Joshi , Vivek B. Tawde,
‘Enhanced topic distillation using text, markup tags, and
hyperlinks.’ SIGIR 2001, New Orleans, LA, Sep 2001.
28
Download