Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft Research Asia, **Hong Kong University (SIGIR 2004) Abstract Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy 2/29 Abstract Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12% improvement over pure-text based methods 3/29 Introduction With the rapid growth of WWW, there is an increasing need to provide automated assistance to Web users for Web page classification and categorization Yahoo directory LookSmart directory However, it is difficult to meet without automated Web-page classification techniques due to the laborintensive nature of human editing 4/29 Introduction Web pages have their own underlying embedded structure in the HTML language and typically contain noisy content such as advertisement banner and navigation bar If a pure-text classification is applied to these pages, it will incur much bias for the classification algorithm, making it possible to loss focus on the main topic and important content Thus, a critical issue is to design an intelligent preprocessing technique to extract the main topic of a Web page 5/29 Introduction In this paper, we show that using Web-page summarization techniques for preprocessing in Web-page classification is a viable and effective technique We further show that instead of using off-theshelf summarization technique that is designed for pure-text summarization, it is possible to design specialized summarization methods catering to Web-page structures 6/29 Related Work Recent works on Web-page summarization Ocelot (Berger al et., 2000) is a system for summarizing Web pages using probabilistic models to generate the “gist” of a Web page Buyukkokten et al. (2001) introduces five methods for summarizing parts of Web pages on handheld devices Delort (2003) exploits the effect of context in Web page summarization, which consists of the information extracted from the content of all the documents linking to a page 7/29 Related Work Our work is related to that for removing noise from a Web page Yi et al. (2003) propose an algorithm by introducing a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages Chen et al. (2000) extract title and meta-data included in HTML tags to represent the semantic meaning of the Web pages 8/29 Web-Page Summarization Adapted Luhn’s Summarization Method Every sentence is assigned with a significance factor, and the sentence with the highest significance factor are selected to form the summary The significance factor of a sentence can be computed as follows. A “significant words pool” is built using word frequency Set a limit L for the distance at which any two significant words could be considered as being significantly related Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart Count the number of significant words contained in the portion and divide the square of this number by the total number of words within the portion 9/29 Web-Page Summarization In order to customize the Luhn’s algorithm for web-pages, we make a modification The category information of each page is already known in the training data, thus significant-words selection could be processed with each category We build significant words pool for each category by selecting the words with high frequency after removing the stop words For a testing Web page, we do not have the category information We calculate the significant factor for each sentence according to different significant words pools over all categories separately The significant factor for the target sentence will be averaged over all categories and referred to as Sluhn 10/29 Web-Page Summarization Latent semantic Analysis (LSA) LSA is applicable in summarization because of two reasons: LSA is capable of capturing and modeling interrelationship among terms by semantically clustering terms and sentences LSA can capture the salient and recurring word combination pattern in a document which describe a certain topic or concept LSA is based on singular value decomposition A U V T A :m*n :n*n V :n*n 11/29 Web-Page Summarization Latent semantic Analysis (LSA) Concepts can be represented by one of the singular vectors where the magnitude of the corresponding singular value indicates the importance degree of this pattern within the document Any sentence containing this word combination pattern will be projected along this singular value The sentence that best represents this pattern will have the largest index value with this vector We denote this index value as Slsa and select the sentences with the highest Slsa to form the summary 12/29 Web-Page Summarization Content Body Identification by Page Layout Analysis The structured character of Web pages makes Web-page summarization different from pure-text summarization In order to utilize the structure information of Web pages, we employ a simplified version of Function-Based Object Model (Chen et al., 2001) Basic Object (BO) Composite Object (CO) After detecting all BOs and COs in a Web page, we could identify the category of each object according to some heuristic rules: Information Object Navigation Object Interaction Object, Decoration Object Special Function Object 13/29 Web-Page Summarization Content Body Identification by Page Layout Analysis In order to make use of these objects, we define the Content Body (CB) of a Web page which consists of the main object related to the topic of that page The algorithm for detecting CB is as follows: Consider each selected object as a single document and build the TF*IDF index for the object Calculate the similarity between any two objects using COSINE similarity computation and add a link between them if their similarity is greater than a threshold (=0.1) In the graph, a core object is defined as the object having the most edges Extract the CB In the sentence is included in CB, Scb=1, otherwise, Scb=0 All sentences with Scb=1 give rise to the summary 14/29 Web-Page Summarization Supervised Summarization There are eight features fi1 measures the position of a sentence Si in a certain paragraph fi2 measures the length of a sentence Si, which is the number of words in Si fi3 =ΣTFw*SFw fi4 is the similarity between Si and the title fi5 is the consine similarity between Si and all text in the pages fi6 is the consine similarity between Si and meta-data in the page fi7 is the number of occurrences of word from Si in a special word set fi8 is the average font size of words in Si 15/29 Web-Page Summarization Supervised Summarization We apply Naïve Bayesian classifier to train a summarizer 16/29 Web-Page Summarization An Ensemble of Summarizer By combining the four methods presented in the previous sections, we obtain a hybrid Web-page S S luhn S lsa S cb S sup The sentences with the highest S will be chosen into the summary 17/29 Experiments Data Set We use about 2 millions Web pages crawled from the LookSmart Web directory Due to the limitation of network bandwidth, we only downloaded about 500 thousand descriptions of Web pages that are manually created by human editors We randomly sampled 30% of the pages with descriptions for our experiment purpose The Extracted subset includes 153,019 pages, which are distributed among 64 categories 18/29 Experiments Data Set In order to reduce the uncertainty of data split, a 10fold cross validation procedure is applied 19/29 Experiments Classifiers Naïve Bayesian Classifier (NB) Support Vector Machine (SVM) 20/29 Experiments Experiment Measure We employ the standard measures to evaluate the performance of Web classification: Precision, recall and F1-measure To evaluate the average performance across multiple categories, there are two conventional methods: microaverage and macro-average Micro-average gives equal weight to every document Macro-average gives equal weight to every category, regardless of its frequency Only micro-average will be used to evaluate the performance of classifiers 21/29 Experiments Experimental Results and Analysis Baseline (Full text) Each web page is represented as a bag-of-words, in which the weight of each word is assigned with their term frequency Human’s summary (Description, Title, meta-data) We extract the description of each Web page from the LookSmart Website and consider it as the “ideal” summary 22/29 Experiments Experimental Results and Analysis Unsupervised summarization Supervised summarization algorithm Content Body, Adapted Luhn’s algorithm (compression rate:20%) and LSA (compression rate:30%) We define one sentence as positive if its similarity with the description is greater than a threshold (=0.3), and others as negative Hybrid summarization algorithm Use all Unsupervised and supervised summarization algorithms 23/29 Experiments Experimental Results and Analysis 24/29 Experiments Experimental Results and Analysis Performance on Different Compression 25/29 Experiments Experimental Results and Analysis Effect of different weighting schemata Schema1: we assigned the weight of each summarization method in proportion to the performance of each method (micro-F1) Schema2: We increased the value of Wi (i=1,2,3,4) to 2 in schema2-5 respectively and kept others as one 26/29 Experiments Case studies We randomly selected 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline (denoted as set A) and 500 pages randomly from the testing pages (denoted as set B) The average size of pages in A which is 31.2k in much larger than that of in B which is 10.9k The summarization techniques can help us to extract this useful information from large pages 27/29 Experiments Case studies 28/29 Conclusions and Future Work Several web-page summarization algorithms are proposed for extracting most relevant features from Web pages for improving the accuracy of Web classification Experimental results show that automatic summary can achieve a similar improvement (about 12.9%) as the ideal-case accuracy achieved by using the summary created by human editors We will investigate methods for multi-document summarization of hyperlinked Web pages to boost the accuracy of Web classification 29/29