Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati 1/16 Content Problem formulation Algorithms Implementation Results Discussion and future work Problem World Wide Web can be clustered into different subsets and labeled accordingly, search engine users can then restrict their keyword search to these specific subsets. Clustering of web pages can also be used to post-process searching results. Efficient clustering of web pages is important Clustering accuracy: feature selection, and web exploitation Fast algorithm Web clustering Clustering is done based on similarity between web pages Clustering can be done in supervised and unsupervised mode In our project, we try to focus on unsupervised classification (no sample category labels provided), and compare the efficiency of algorithms and features for clustering web pages. Project overview In this project, a platform of unsupervised clustering is implemented: Vector Model is used TFIDF model (term frequency-inverted document frequency) Text, meta information, links and linked content can be configured as features Similarity measure: Cosine similarity Euclidean similarity Clustering algorithm K-means HAC (Hierarchical Agglomerative Clustering) For a given link list, clustering accuracy and algorithm efficiency is compared. It is implemented in Java, and can be extended easily. User interface Major functionalities Web page preprocessing downloading Parsing: link, meta, text extraction Filtering of non-sense words: Stop word removal and stemming Put terms into a pool clustering Feature selection First, a naïve approach from ranking of query results is used: Then we use all the unique terms appearing as meta information in web pages as feature terms. All the unique terms (after text extraction and filtering) forms the feature terms. That is, if there are totally 1000 terms, the vector dimension will be 1000. This approach works for small sets of links. The dimension can be reduced dramatically. For 30 links, dimension is 2384 for naïve method, but is reduced to 408 when using meta. Hyperlink exploitation Links in web page can also be features The content or meta information of linked web pages can be seen as local content. TFIDF based vector space model TFIDF(i,j)= TF(i,j)*IDF(i) TF(i,j): the number of times word i occurs in document DF(i) the number documents in which word i occurs at least once IDF(i) can be calculated from the document frequency: IDF (i ) log 1 D DF (i ) Similarity measure Euclidean similarity :Given the vector space defined by all terms compute the Euclidean distance between each document, and then the reciprocal is taken. Cosine similarity= numerator / denominator Numerator: inner product of two vector Denominator: Euclidean length of the document Cluster algorithms: Hierarchical Agglomerative Clustering (HAC) It starts with all the documents and successively combines them into groups within which inter-document similarity is high Cluster algorithms: K means K means clustering: nonhierarchical method Final required number of clusters is chosen Examines each component in the population and assigns it to one of the clusters depending on the minimum distance Centroid's position is recalculated every time a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters Complexity Analysis HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In K-means, for each round, n documents have to be compared against k centroids, which will take time O (kn) more efficient than O(n2) HAC. While in our experiment, we found that clustering result of HAC make more sense than K-means Conclusion Unique features of web page should be exploited HAC is better than K-means in clustering accuracy. Correct and robust parsing of web pages is important for web page clustering Link, meta information Our parser doesn’t work well on all web pages tested. The overall performance of our implementation is not satisfactory Dimension is still large Space requirement Parsing accuracy, and some page doesn’t have meta information