World Wide Knowledge Base Webpage Classification Midway Report Bo Feng Rong Zou 108809282 108661275 1. Goal To learn classifiers to predict the type of a webpage from the text. To develop a probabilistic, symbolic knowledge base that mirrors the content of the world wide web. If successful, this will make text information on the web available in computerunderstandable form, enabling much more sophisticated information retrieval and problem solving. 2. Relevant work 3. Problem Analysis We are developing a system that can classify webpages. For this part, our goal is to develop a probabilistic, symbolic knowledge base that mirrors the content of the World Wide Web[1] using three approaches. If successful, we will try to develop an application for fitting pages in mobile devices. This will make text information on the web available in computerunderstandable form, enabling much more sophisticated information retrieval and problem solving. First, we will learn classifiers to predict the type of webpages from the text by using Naïve Bayes and SVM (unigram and bigram). Also, we will try to improve accuracy by exploiting correlations between pages that point to each other[2]. In addition, in the part of segmenting web pages into meaningful parts, including bio, publication, etc, we'd like to employ the decision tree method[3]. Since intuitively there should be some rules to distinguish those parts, e.g. the biography part of professor may use <img> and <p> html tags, while the "publications" section may use <ul> <li> tags, etc. In the end, after we learned those classifiers, we’d like using these to build an application: given a web page url, it will rearrange the content and generate a proper web page for mobile user, which we think will be quite useful. 4. Approach We are developing a system that can be trained to extract symbolic knowledge from hypertext, using a variety of machine learning methods: 1) 2) 3) 4) Naïve Bayes SVM Bayes Network HMM In additon, we will try to improve accuracy by (1) exploiting correlations between pages that point to each other, and/or (2) segmenting the pages into meaningful parts (bio, publications, etc.) 6. Dataset and features 6.1 Dataset A dataset from 4 universities containing 8,282 web pages and hyperlink data, labeled with whether they are professor, student, project, or other pages. For each class the data set contains pages from the four universities: Cornell (867) Texas (827) Washington (1205) Wisconsin (1263) and 4,120 miscellaneous pages collected from other universities. 6.2 Test & Train Data Splits Since each university's web pages have their own idiosyncrasies, we train on three of the universities plus the miscellaneous collection, and testing on the pages from a fourth, held-out university, which may be called four-fold cross validation. 6.3 Feature Selection We will use two types of methods to select features. Feature set or "vocabulary" size may be reduced by occurrence counts or by average mutual information with the class variable, which we also call "information gain". 1) Word Counts and Probabilities Remove words that occur in N or fewer documents. Remove words that occur less than N times. 2) Information Gain Remove all but the top N words by selecting words with highest average mutual information with the class variable. Default is N=0, which is a special case that removes no words. In order to apply maximum entropy to a domain, we need to select a set of features to use for setting the constraints. For text classification with maximum entropy, we use word counts as our features. More specifically, in this paper for each word-class combination we instantiate a feature as: 7. Results Expected result: Milestone Result: Learn classifier using at least two of these methods in a subset dataset:1) develop a system that can be trained to extract symbolic knowledge from hypertext using Naïve Bayes and SVM; 2) exploit correlations between pages that point to each other; 3) Segment the pages int o meaningful parts using decision tree. Actual Milestone Result: Final Result: Make a combination of three approaches to learn a good classifier for the whole dataset; hopefully, we can improve the accuracy by 10% with our new combinative algorithm. Then we will try to develop an application for fitting webpage into mobile device using our training model. 8. Reference [1] Learning to Construct Knowledge Bases from the World Wide Web. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. Artificial Intelligence. [2] Learning to Extract Symbolic Knowledge from the World Wide Web. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98). [3] Data Mining on Symbolic Knowledge Extracted from the Web. Rayid Ghani, Rosie Jones, Dunja Mladenic, Kamal Nigam and Sean Slattery. KDD-2000 Workshop on Text Mining. 2000