Web Page Classification Bo Feng, Rong Zou Data Preparation • Dataset: WebKB dataset contains 8,282 web pages gathered from 4 university’s computer science departments, combined with miscellaneous web page collections. • 7 categories, (quite unbalanced): – – – – – – – Student (1641) Faculty (1124) Staff (137) course (930) Project (504) Department (182) Other (3764) • 4 Folds Cross Validation: – train on three of the universities + the miscellaneous collection – testing on the pages from a fourth university Classification Approaches • Content based Classification – Naïve Bayes – SVM + TF / TF-IDF • Link based Classification – Decision Tree based on URL – SVM based on URL – SVM based on link’s type – SVM based on link’s keyword • Combination of link and content based Classification – Content(SVM) + URL(decision tree) Classification based on content • Data preprocessing – Ignore MIME header – Without removing stop words • Features – Unigram + TF – Unigram + TF * IDF • Learning Methods – Naïve Bayes(witten-bell smoothing) – SVM (linear kernel) Classification based on content • Result - Naive Bayes: 22.40% - SVM (TF): 68.92% - SVM ( TF-IDF): 68.89% 68.92% 68.89% SVM (TF) SVM ( TF-IDF) 22.40% Naive Bayes Classification based on URL • http://cs.cornell.edu/course/cse511 vs http://cs.wustl.edu/~rad/ • Features (17 binary attributes) – has(course), has_number, is_root_path, has(faculty), has(people), has(user), has(~), has(project), has(staff), has(student), has(pub), has(research), has(paper), endswith(.ps|.pdf), endswith(.html), startwith(ftp), startwith(mailto) • Learning Methods – Decision Tree – SVM (linear kernel) Classification based on URL • Suffered from “other” category : – E.g. http://www.cs.wisc.edu/~dyer/cs540/getstarted.html With “other” category Without “other” category 78.56% 61.21% Decision Tree 66.01% 56.30% SVM (linear kernel) Classification based on link’s type • Idea: For each web page, classify link’s type by previous trained Decision Tree model, for each link type, count how many times it appears. • Features: – <link_type: count> pairs • Learning methods: – SVM (linear kernel) • Result: – Accuracy: 50.82% (all category); 60.24% (with removing “other” category) Classification based on link’s keyword • Features : – – – – – – – – # (link contains “course”) # (link contains number) # (link contains “faculty”) # (link contains “~”) … # (link ends with .ps or .pdf) # (link starts with mailto) … • Learning method: – SVM (linear kernel) • Result: – Accuracy : 31.15% Combine URL and content’s unigram • Combine 2 classifiers: – Page content : SVM (unigram + TF) – Page URL: decision tree (with removing “other” category) • Learning method: – SVM’s prediction is “other” category, predict it as “other” – Otherwise, check page’s URL, if decision tree predict it as some category with high probability (>0.5), predict it as that category; otherwise, use SVM’s prediction • Result: – Accuracy: 78.28% content link SVM(TF)+Decision Tree 56.30% Decision Tree(URL) SVM (URL) SVM(link’s type) SVM(link’s keyword) 68.89% SVM(TF) SVM ( TF-IDF) Naive Bayes Result Comparison 78.28% 68.92% 61.21% 50.82% 31.15% 16.35% combination Comparison with paper[1] 78.28% 72.6% 53.90% all categories(our classifier) categories without "others" (paper[1]) all categories(paper[1]) [1] M Craven, etc. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intellligence, 1998. Conclusion • Result Analysis: – Data is quite unbalanced – Data in “other” category is quite noisy – Cannot construct graph model due to lacking linked web page’s content • Some methods might help: – – – – Using bigram/named entity SVM with RBF kernel Train weight for several different SVM classifiers Adaboost