World Wide Knowledge Based Webpage Classification Midway Report Bo Feng (108809282) Rong Zou (108661275) 1. Goal To learn classifiers to predict the type of a webpage from the text. 2. Relevant work Automatic text classification has always been an important application and research topic since the inception of digital documents. Intuitively Text Classification is the task of classifying a document under a predefined category. More formally, if ππ is a document of the entire set of documents D and { π1 , π2 , … , ππ } is the set of all the categories, then text classification assigns one category ππ to a document ππ . 2.1 Feature Selection The aim of feature-selection methods is the reduction of the dimensionality of the dataset by removing features that are considered irrelevant for the classification. Methods for feature subset selection for text document classification task use an evaluation function that is applied to a single word. Scoring of individual words (Best Individual Features) can be performed using some of the measures, for instance, document frequency, term frequency, mutual information, information gain and etc. 2.1.1 Term frequency π»π(π, π ) In the case of the term frequency π‘π(π‘, π), we choose normalized frequency to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the document: π(π‘, π) ππΉ(π‘, π) = max{π(π€, π): π€ ∈ π} 2.1.2 Inverse document frequency π°π«π(π, π«) The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents |π·| by the number of documents containing the term |{π ∈ π·: π‘ ∈ π}|, and then taking the logarithm of that quotient. πΌπ·πΉ(π‘, π·) = log |π·| |{π ∈ π·: π‘ ∈ π}| 2.1.3 Term frequency–inverse document frequency π»π − π°π«π(π, π , π«) [4] 1 ππ– πΌπ·πΉ is the product of two statistics, term frequency and inverse document frequency. Then ππΉ– πΌπ·πΉ is calculated as: ππΉ − πΌπ·πΉ(π‘, π, π·) = ππΉ(π‘, π) × πΌπ·πΉ(π‘, π·) A high weight in ππΉ– πΌπ·πΉ is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. 2.2 Machine Learning Algorithms After feature selection and transformation the documents can be easily represented in a form that can be used by a ML algorithm. Many text classifiers have been proposed in the literature using machine learning techniques, probabilistic models, etc. They often differ in the approach adopted: Decision Trees, Naïve-Bayes, Neural Networks, Nearest Neighbors, and lately, Support Vector Machines. Naive Bayes is often used in text classification applications and experiments because of its simplicity and effectiveness [6] ; Support vector machines (SVM), when applied to text classification also usually provides excellent precision, but poor recall. One means of customizing SVMs to improve recall, is to adjust the threshold associated with an SVM. Shanahan and Roma described an automatic process for adjusting the thresholds of generic SVM [7] with better results. 3. Problem Analysis We are developing a system that can classify webpages. For this part, our goal is to develop a probabilistic, symbolic knowledge base that mirrors the content of the World Wide Web[1]. This will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving. First, we will learn classifiers to predict the type of webpages from the text by using Naïve Bayes and SVM, with unigram as feature. Also, we will try to improve accuracy by exploiting correlations between pages that point to each other[2]. 4. Approaches In Figure 1 is given the graphical representation of the our webpage classification process. Parse web page content Vector Representation of web page Feature Selection Learning Algorithm Figure 1: Webpage Classification process 4.1 Parse web page content We extract text features using ππΉ and ππΉ − πΌπ·πΉ (term frequency–inverse document frequency), which reflects how important a word is to a document in a collection or corpus. 2 4.2 Algorithms We are developing a system that can be trained to extract symbolic knowledge from hypertext, using two machine learning methods based on unigram: 1) Naïve Bayes, with using Witten-Bell smoothing probability; 2) SVM, with selecting features based on TF and TF-IDF). 5. Dataset 5.1 Dataset The WebKB dataset[2] contains 8,282 web pages gathered from 4 university’s computer science departments. The pages are labeled with seven categories: student, faculty, staff, course, project, department and other. For each class, the data set contains pages from the four universities: Cornell (867), Texas (827), Washington (1205), Wisconsin (1263), and 4,120 miscellaneous pages collected from other universities. 5.2 Train & Test Data Splits Since each university's web pages have their own idiosyncrasies, we train on three of the universities plus the miscellaneous collection, and testing on the pages from a fourth, holdout university, which is 4-fold cross validation. 6. Results First we try to learn model only with the actual content in the web page, without considering the html tag and the relationship between different pages. So in the data pre-processing step, we removed html tags and ignore the html MIME header information in each web pages, then we extract unigram with removing punctuation. But here we didn't stemming or removing stop words, the reason here is, even stop words, we thought those may give us extra information. For example, maybe web documents in one of the categories may contains more "the" or "in" than other categories. After we get those unigram for each web document, we employ Naive Bayes and SVM algorithm to build our learning model, and got classification accuracy for each model: Table 1: classification accuracy of three classifiers Accuracy Naïve Bayes SVM(TF) SVM(TF-IDF) Fold 1 25.26% 68.86% 69.67% Fold 2 16.57% 67.96% 67.47% Fold 3 21.49% 68.30% 66.22% Fold 4 26.29% 70.55% 72.21% Average 22.40% 68.92% 68.89% 3 6.1 Naïve Bayes classifier In the Naive Bayes part, we first build the probability of a unigram given one class π(π€|π), For example, when calculating the π(ππππππ π ππ|ππππ’ππ‘π¦) in the training dataset, using the word professor occurrence count as numerator and all the words' count as denominator for the documents in the faculty category. Here we also use Witten-Bell smoothing algorithm[10] to smooth the probability to handle unigrams that doesn't appear in the training dataset but exists in the test dataset. After we build the π(π€|π) for each word and for each category, we using Naive Bayes algorithm to classify a new web document, based on the assumption that each word is independent in one web document, we simply use: (here we use cat to denote category, use doc to denote document): π π(πππ‘|πππ) = π(πππ|πππ‘) × π(πππ‘) = ∏ π(π€ππππ |πππ‘) × π(πππ‘) π=1 After we calculate this number for each category, we assign this new document a category that has the maximum probability. However, the result for this algorithm is quite awful, after we did our 4-folds cross validation, the average accuracy is only 22.40%, it is only slightly better than random choice a category which is 1/7 = 14.29%. The reasons of low accuracy might be: 1) this dataset is unbalanced: the category of “others” account for 45.44%, which is a collection of pages that were not deemed the '' main page'' representing an instance of the previous 6 categories, while other 6 categories only account for 1% to 20% respectively; 2) the feature we used can not model document well, an improve might be using π(π€|π) × ππΉ as probability or removing stop words may also help. 6.2 Unigram + SVM classifier (TF/TF-IDF) In the SVM learning part, also based on our unigram, we try two different features, one is the ππΉ(term frequency), another is ππΉ × πΌπ·πΉ as mentioned above. Here we use svmmulticlass tool in svmlight[9] to build our model. In this training task, we use SVM with a linear kernel, we tried the slack parameter C for different values, (0.01, 0.1, 1, 10, 100, 1000, 5000, 104 , 5 × 104 , 10^5, 5 × 105 , 106 , 5 × 106 , 107 , 5 × 107 , 108 ), and find the best C for our holdout data, then apply this model to our test dataset. To be more specific, we calculate ππΉ value as a single word’s count divided by the total words' count in one document; and calculate πΌπ·πΉ in this way: for each word calculate how many document it appears, that’s the value D(w), then we have the total number of all web documents in training set, that’s the value D, so we can get πΌπ·πΉ value of one word: π· πΌπ·πΉ(π€πππ) = πππ ( ) π·(π€) The result is much better than Naive Bayes classifier, by using π‘π as feature part, after 4 folds cross validation and apply our model to the test dataset, we get our average accuracy 68.92%; By using π‘π ∗ 4 πππ as feature part, the average accuracy is 68.89%. They are basically the same, introducing the extra πππ value actually doesn't improve our model. 7. Future Work Here simply by using SVM with unigram's TF / TF-IDF as feature, we already has about 69% accuracy, there are still many information we didn't use, like the word's Point Mutual Information[11], words's POS tag, and more importantly, the html tag, and the link relationship between pages, which will give us much extra information. Thus, next we'll try to use PMI information with SVM again, to see whether it will give a better result. Also with html tags and pages' link relationship we'd like to employ HMM or CRF algorithm to build a probabilistic graphic model, which may give us promising accuracy. 8. References [1] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery , Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence. [2] M Craven, etc. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intellligence (AAAI- 98), pages 509-516, 1998. [3] Rayid Ghani, Rosie Jones, Dunja Mladenic, Kamal Nigam and Sean Slattery. Data Mining on Symbolic Knowledge Extracted from the Web. KDD-2000 Workshop on Text Mining. 2000 [4] tf–idf, http://en.wikipedia.org/wiki/Tfidf [5] Zu G., Ohyama W., Wakabayashi T., Kimura F., Accuracy improvement of automatic text classification based on feature transformation [6] Kim S. B., etc., Effective Methods for Improving Naive Bayes Text Classifiers [7] Shanahan J. and Roma N., Improving SVM Text Classification Performance through Threshold Adjustment, LNAI 2837, 2003, 361-372 [8] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques, Wseas Transactions on Computers, issue 8, volume 4, pp. 966- 974, 2005. [9] svmlight, http://svmlight.joachims.org/svm_multiclass.html [10] Witten-Bell smoothing, http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf [11] Point Mutual Information, http://en.wikipedia.org/wiki/Pointwise_mutual_information 5