Web Page Classification

Web Page Classification Bo Feng, Rong Zou Data Preparation • Dataset: WebKB dataset contains 8,282 web pages gathered from 4 university’s computer science departments, combined with miscellaneous web page collections. • 7 categories, (quite unbalanced): – – – – – – – Student (1641) Faculty (1124) Staff (137) course (930) Project (504) Department (182) Other (3764) • 4 Folds Cross Validation: – train on three of the universities + the miscellaneous collection – testing on the pages from a fourth university Classification Approaches • Content based Classification – Naïve Bayes – SVM + TF / TF-IDF • Link based Classification – Decision Tree based on URL – SVM based on URL – SVM based on link’s type – SVM based on link’s keyword • Combination of link and content based Classification – Content(SVM) + URL(decision tree) Classification based on content • Data preprocessing – Ignore MIME header – Without removing stop words • Features – Unigram + TF – Unigram + TF * IDF • Learning Methods – Naïve Bayes(witten-bell smoothing) – SVM (linear kernel) Classification based on content • Result - Naive Bayes: 22.40% - SVM (TF): 68.92% - SVM ( TF-IDF): 68.89% 68.92% 68.89% SVM (TF) SVM ( TF-IDF) 22.40% Naive Bayes Classification based on URL • http://cs.cornell.edu/course/cse511 vs http://cs.wustl.edu/~rad/ • Features (17 binary attributes) – has(course), has_number, is_root_path, has(faculty), has(people), has(user), has(~), has(project), has(staff), has(student), has(pub), has(research), has(paper), endswith(.ps|.pdf), endswith(.html), startwith(ftp), startwith(mailto) • Learning Methods – Decision Tree – SVM (linear kernel) Classification based on URL • Suffered from “other” category : – E.g. http://www.cs.wisc.edu/~dyer/cs540/getstarted.html With “other” category Without “other” category 78.56% 61.21% Decision Tree 66.01% 56.30% SVM (linear kernel) Classification based on link’s type • Idea: For each web page, classify link’s type by previous trained Decision Tree model, for each link type, count how many times it appears. • Features: – <link_type: count> pairs • Learning methods: – SVM (linear kernel) • Result: – Accuracy: 50.82% (all category); 60.24% (with removing “other” category) Classification based on link’s keyword • Features : – – – – – – – – # (link contains “course”) # (link contains number) # (link contains “faculty”) # (link contains “~”) … # (link ends with .ps or .pdf) # (link starts with mailto) … • Learning method: – SVM (linear kernel) • Result: – Accuracy : 31.15% Combine URL and content’s unigram • Combine 2 classifiers: – Page content : SVM (unigram + TF) – Page URL: decision tree (with removing “other” category) • Learning method: – SVM’s prediction is “other” category, predict it as “other” – Otherwise, check page’s URL, if decision tree predict it as some category with high probability (>0.5), predict it as that category; otherwise, use SVM’s prediction • Result: – Accuracy: 78.28% content link SVM(TF)+Decision Tree 56.30% Decision Tree(URL) SVM (URL) SVM(link’s type) SVM(link’s keyword) 68.89% SVM(TF) SVM ( TF-IDF) Naive Bayes Result Comparison 78.28% 68.92% 61.21% 50.82% 31.15% 16.35% combination Comparison with paper[1] 78.28% 72.6% 53.90% all categories(our classifier) categories without "others" (paper[1]) all categories(paper[1]) [1] M Craven, etc. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intellligence, 1998. Conclusion • Result Analysis: – Data is quite unbalanced – Data in “other” category is quite noisy – Cannot construct graph model due to lacking linked web page’s content • Some methods might help: – – – – Using bigram/named entity SVM with RBF kernel Train weight for several different SVM classifiers Adaboost

Web Page Classification

Related documents

Products

Support

Web Page Classification

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib