Final project: Web page classification

advertisement
Final project:
Web Page Classification
By: Xiaodong Wang
Yanhua Wang
Haitang Wang
University of Cincinnati
1/16
Content





Problem formulation
Algorithms
Implementation
Results
Discussion and future work
Problem



World Wide Web can be clustered into different
subsets and labeled accordingly, search engine
users can then restrict their keyword search to
these specific subsets.
Clustering of web pages can also be used to
post-process searching results.
Efficient clustering of web pages is important


Clustering accuracy: feature selection, and web
exploitation
Fast algorithm
Web clustering



Clustering is done based on similarity
between web pages
Clustering can be done in supervised and
unsupervised mode
In our project, we try to focus on
unsupervised classification (no sample
category labels provided), and compare the
efficiency of algorithms and features for
clustering web pages.
Project overview

In this project, a platform of unsupervised clustering is implemented:





Vector Model is used

TFIDF model (term frequency-inverted document frequency)

Text, meta information, links and linked content can be configured as
features
Similarity measure:

Cosine similarity

Euclidean similarity
Clustering algorithm

K-means

HAC (Hierarchical Agglomerative Clustering)
For a given link list, clustering accuracy and algorithm efficiency is
compared.
It is implemented in Java, and can be extended easily.
User interface
Major functionalities

Web page preprocessing





downloading
Parsing: link, meta, text extraction
Filtering of non-sense words: Stop word removal
and stemming
Put terms into a pool
clustering
Feature selection

First, a naïve approach from ranking of query results
is used:



Then we use all the unique terms appearing as
meta information in web pages as feature terms.



All the unique terms (after text extraction and filtering)
forms the feature terms. That is, if there are totally 1000
terms, the vector dimension will be 1000.
This approach works for small sets of links.
The dimension can be reduced dramatically.
For 30 links, dimension is 2384 for naïve method, but is
reduced to 408 when using meta.
Hyperlink exploitation


Links in web page can also be features
The content or meta information of linked web pages can
be seen as local content.
TFIDF based vector space model

TFIDF(i,j)= TF(i,j)*IDF(i)



TF(i,j): the number of times word i occurs in
document
DF(i) the number documents in which word i
occurs at least once
IDF(i) can be calculated from the document
frequency:
IDF (i )  log
1 D
DF (i )
Similarity measure


Euclidean similarity :Given the vector space
defined by all terms compute the Euclidean
distance between each document, and then
the reciprocal is taken.
Cosine similarity= numerator / denominator


Numerator: inner product of two vector
Denominator: Euclidean length of the document
Cluster algorithms: Hierarchical
Agglomerative Clustering (HAC)

It starts with all the documents and
successively combines them into groups within
which inter-document similarity is high
Cluster algorithms: K means

K means clustering: nonhierarchical method



Final required number of clusters is chosen
Examines each component in the population
and assigns it to one of the clusters depending
on the minimum distance
Centroid's position is recalculated every time a
component is added to the cluster and this
continues until all the components are
grouped into the final required number of
clusters
Complexity Analysis



HAC methods need to compute similarity of
all pairs of n individual instances which is
O(n2).
In K-means, for each round, n documents
have to be compared against k centroids,
which will take time O (kn) more efficient than
O(n2) HAC.
While in our experiment, we found that
clustering result of HAC make more sense
than K-means
Conclusion

Unique features of web page should be exploited



HAC is better than K-means in clustering accuracy.
Correct and robust parsing of web pages is
important for web page clustering


Link, meta information
Our parser doesn’t work well on all web pages tested.
The overall performance of our implementation is
not satisfactory



Dimension is still large
Space requirement
Parsing accuracy, and some page doesn’t have meta
information
Download