Midway Report5

World Wide Knowledge Based Webpage Classification Midway Report Bo Feng (108809282) Rong Zou (108661275) 1. Goal To learn classifiers to predict the type of a webpage from the text. 2. Relevant work Automatic text classification has always been an important application and research topic since the inception of digital documents. Intuitively Text Classification is the task of classifying a document under a predefined category. More formally, if 𝑑𝑖 is a document of the entire set of documents D and { 𝑐1 , 𝑐2 , … , 𝑐𝑛 } is the set of all the categories, then text classification assigns one category 𝑐𝑖 to a document 𝑑𝑖 . 2.1 Feature Selection The aim of feature-selection methods is the reduction of the dimensionality of the dataset by removing features that are considered irrelevant for the classification. Methods for feature subset selection for text document classification task use an evaluation function that is applied to a single word. Scoring of individual words (Best Individual Features) can be performed using some of the measures, for instance, document frequency, term frequency, mutual information, information gain and etc. 2.1.1 Term frequency 𝑻𝑭(𝒕, 𝒅) In the case of the term frequency 𝑡𝑓(𝑡, 𝑑), we choose normalized frequency to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the document: 𝑓(𝑡, 𝑑) 𝑇𝐹(𝑡, 𝑑) = max{𝑓(𝑤, 𝑑): 𝑤 ∈ 𝑑} 2.1.2 Inverse document frequency 𝑰𝑫𝑭(𝒕, 𝑫) The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents |𝐷| by the number of documents containing the term |{𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑}|, and then taking the logarithm of that quotient. 𝐼𝐷𝐹(𝑡, 𝐷) = log |𝐷| |{𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑}| 2.1.3 Term frequency–inverse document frequency 𝑻𝑭 − 𝑰𝑫𝑭(𝒕, 𝒅, 𝑫) [4] 1 𝑇𝑓– 𝐼𝐷𝐹 is the product of two statistics, term frequency and inverse document frequency. Then 𝑇𝐹– 𝐼𝐷𝐹 is calculated as: 𝑇𝐹 − 𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝑑) × 𝐼𝐷𝐹(𝑡, 𝐷) A high weight in 𝑇𝐹– 𝐼𝐷𝐹 is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. 2.2 Machine Learning Algorithms After feature selection and transformation the documents can be easily represented in a form that can be used by a ML algorithm. Many text classifiers have been proposed in the literature using machine learning techniques, probabilistic models, etc. They often differ in the approach adopted: Decision Trees, Naïve-Bayes, Neural Networks, Nearest Neighbors, and lately, Support Vector Machines. Naive Bayes is often used in text classification applications and experiments because of its simplicity and effectiveness [6] ; Support vector machines (SVM), when applied to text classification also usually provides excellent precision, but poor recall. One means of customizing SVMs to improve recall, is to adjust the threshold associated with an SVM. Shanahan and Roma described an automatic process for adjusting the thresholds of generic SVM [7] with better results. 3. Problem Analysis We are developing a system that can classify webpages. For this part, our goal is to develop a probabilistic, symbolic knowledge base that mirrors the content of the World Wide Web[1]. This will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving. First, we will learn classifiers to predict the type of webpages from the text by using Naïve Bayes and SVM, with unigram as feature. Also, we will try to improve accuracy by exploiting correlations between pages that point to each other[2]. 4. Approaches In Figure 1 is given the graphical representation of the our webpage classification process. Parse web page content Vector Representation of web page Feature Selection Learning Algorithm Figure 1: Webpage Classification process 4.1 Parse web page content We extract text features using 𝑇𝐹 and 𝑇𝐹 − 𝐼𝐷𝐹 (term frequency–inverse document frequency), which reflects how important a word is to a document in a collection or corpus. 2 4.2 Algorithms We are developing a system that can be trained to extract symbolic knowledge from hypertext, using two machine learning methods based on unigram: 1) Naïve Bayes, with using Witten-Bell smoothing probability; 2) SVM, with selecting features based on TF and TF-IDF). 5. Dataset 5.1 Dataset The WebKB dataset[2] contains 8,282 web pages gathered from 4 university’s computer science departments. The pages are labeled with seven categories: student, faculty, staff, course, project, department and other. For each class, the data set contains pages from the four universities: Cornell (867), Texas (827), Washington (1205), Wisconsin (1263), and 4,120 miscellaneous pages collected from other universities. 5.2 Train & Test Data Splits Since each university's web pages have their own idiosyncrasies, we train on three of the universities plus the miscellaneous collection, and testing on the pages from a fourth, holdout university, which is 4-fold cross validation. 6. Results First we try to learn model only with the actual content in the web page, without considering the html tag and the relationship between different pages. So in the data pre-processing step, we removed html tags and ignore the html MIME header information in each web pages, then we extract unigram with removing punctuation. But here we didn't stemming or removing stop words, the reason here is, even stop words, we thought those may give us extra information. For example, maybe web documents in one of the categories may contains more "the" or "in" than other categories. After we get those unigram for each web document, we employ Naive Bayes and SVM algorithm to build our learning model, and got classification accuracy for each model: Table 1: classification accuracy of three classifiers Accuracy Naïve Bayes SVM(TF) SVM(TF-IDF) Fold 1 25.26% 68.86% 69.67% Fold 2 16.57% 67.96% 67.47% Fold 3 21.49% 68.30% 66.22% Fold 4 26.29% 70.55% 72.21% Average 22.40% 68.92% 68.89% 3 6.1 Naïve Bayes classifier In the Naive Bayes part, we first build the probability of a unigram given one class 𝑃(𝑤|𝑐), For example, when calculating the 𝑃(𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑜𝑟|𝑓𝑎𝑐𝑢𝑙𝑡𝑦) in the training dataset, using the word professor occurrence count as numerator and all the words' count as denominator for the documents in the faculty category. Here we also use Witten-Bell smoothing algorithm[10] to smooth the probability to handle unigrams that doesn't appear in the training dataset but exists in the test dataset. After we build the 𝑃(𝑤|𝑐) for each word and for each category, we using Naive Bayes algorithm to classify a new web document, based on the assumption that each word is independent in one web document, we simply use: (here we use cat to denote category, use doc to denote document): 𝑛 𝑃(𝑐𝑎𝑡|𝑑𝑜𝑐) = 𝑃(𝑑𝑜𝑐|𝑐𝑎𝑡) × 𝑃(𝑐𝑎𝑡) = ∏ 𝑃(𝑤𝑜𝑟𝑑𝑖 |𝑐𝑎𝑡) × 𝑃(𝑐𝑎𝑡) 𝑖=1 After we calculate this number for each category, we assign this new document a category that has the maximum probability. However, the result for this algorithm is quite awful, after we did our 4-folds cross validation, the average accuracy is only 22.40%, it is only slightly better than random choice a category which is 1/7 = 14.29%. The reasons of low accuracy might be: 1) this dataset is unbalanced: the category of “others” account for 45.44%, which is a collection of pages that were not deemed the '' main page'' representing an instance of the previous 6 categories, while other 6 categories only account for 1% to 20% respectively; 2) the feature we used can not model document well, an improve might be using 𝑃(𝑤|𝑐) × 𝑇𝐹 as probability or removing stop words may also help. 6.2 Unigram + SVM classifier (TF/TF-IDF) In the SVM learning part, also based on our unigram, we try two different features, one is the 𝑇𝐹(term frequency), another is 𝑇𝐹 × 𝐼𝐷𝐹 as mentioned above. Here we use svmmulticlass tool in svmlight[9] to build our model. In this training task, we use SVM with a linear kernel, we tried the slack parameter C for different values, (0.01, 0.1, 1, 10, 100, 1000, 5000, 104 , 5 × 104 , 10^5, 5 × 105 , 106 , 5 × 106 , 107 , 5 × 107 , 108 ), and find the best C for our holdout data, then apply this model to our test dataset. To be more specific, we calculate 𝑇𝐹 value as a single word’s count divided by the total words' count in one document; and calculate 𝐼𝐷𝐹 in this way: for each word calculate how many document it appears, that’s the value D(w), then we have the total number of all web documents in training set, that’s the value D, so we can get 𝐼𝐷𝐹 value of one word: 𝐷 𝐼𝐷𝐹(𝑤𝑜𝑟𝑑) = 𝑙𝑜𝑔 ( ) 𝐷(𝑤) The result is much better than Naive Bayes classifier, by using 𝑡𝑓 as feature part, after 4 folds cross validation and apply our model to the test dataset, we get our average accuracy 68.92%; By using 𝑡𝑓 ∗ 4 𝑖𝑑𝑓 as feature part, the average accuracy is 68.89%. They are basically the same, introducing the extra 𝑖𝑑𝑓 value actually doesn't improve our model. 7. Future Work Here simply by using SVM with unigram's TF / TF-IDF as feature, we already has about 69% accuracy, there are still many information we didn't use, like the word's Point Mutual Information[11], words's POS tag, and more importantly, the html tag, and the link relationship between pages, which will give us much extra information. Thus, next we'll try to use PMI information with SVM again, to see whether it will give a better result. Also with html tags and pages' link relationship we'd like to employ HMM or CRF algorithm to build a probabilistic graphic model, which may give us promising accuracy. 8. References [1] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery , Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence. [2] M Craven, etc. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intellligence (AAAI- 98), pages 509-516, 1998. [3] Rayid Ghani, Rosie Jones, Dunja Mladenic, Kamal Nigam and Sean Slattery. Data Mining on Symbolic Knowledge Extracted from the Web. KDD-2000 Workshop on Text Mining. 2000 [4] tf–idf, http://en.wikipedia.org/wiki/Tfidf [5] Zu G., Ohyama W., Wakabayashi T., Kimura F., Accuracy improvement of automatic text classification based on feature transformation [6] Kim S. B., etc., Effective Methods for Improving Naive Bayes Text Classifiers [7] Shanahan J. and Roma N., Improving SVM Text Classification Performance through Threshold Adjustment, LNAI 2837, 2003, 361-372 [8] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques, Wseas Transactions on Computers, issue 8, volume 4, pp. 966- 974, 2005. [9] svmlight, http://svmlight.joachims.org/svm_multiclass.html [10] Witten-Bell smoothing, http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf [11] Point Mutual Information, http://en.wikipedia.org/wiki/Pointwise_mutual_information 5

Midway Report5

Related documents

Products

Support

Midway Report5

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib