Web Page Classification

advertisement
Web Page Classification
Bo Feng, Rong Zou
Data Preparation
• Dataset:
WebKB dataset contains 8,282 web pages gathered from 4
university’s computer science departments, combined with
miscellaneous web page collections.
• 7 categories, (quite unbalanced):
–
–
–
–
–
–
–
Student (1641)
Faculty (1124)
Staff (137)
course (930)
Project (504)
Department (182)
Other (3764)
• 4 Folds Cross Validation:
– train on three of the universities + the miscellaneous collection
– testing on the pages from a fourth university
Classification Approaches
• Content based Classification
– Naïve Bayes
– SVM + TF / TF-IDF
• Link based Classification
–
Decision Tree based on URL
–
SVM based on URL
–
SVM based on link’s type
–
SVM based on link’s keyword
• Combination of link and content based Classification
–
Content(SVM) + URL(decision tree)
Classification based on content
• Data preprocessing
– Ignore MIME header
– Without removing stop words
• Features
– Unigram + TF
– Unigram + TF * IDF
• Learning Methods
– Naïve Bayes(witten-bell smoothing)
– SVM (linear kernel)
Classification based on content
• Result
- Naive Bayes: 22.40%
- SVM (TF):
68.92%
- SVM ( TF-IDF): 68.89%
68.92%
68.89%
SVM (TF)
SVM ( TF-IDF)
22.40%
Naive Bayes
Classification based on URL
• http://cs.cornell.edu/course/cse511 vs
http://cs.wustl.edu/~rad/
• Features (17 binary attributes)
– has(course), has_number, is_root_path, has(faculty),
has(people), has(user), has(~), has(project), has(staff),
has(student), has(pub), has(research), has(paper),
endswith(.ps|.pdf), endswith(.html), startwith(ftp),
startwith(mailto)
• Learning Methods
– Decision Tree
– SVM (linear kernel)
Classification based on URL
• Suffered from “other” category :
– E.g. http://www.cs.wisc.edu/~dyer/cs540/getstarted.html
With “other” category
Without “other” category
78.56%
61.21%
Decision Tree
66.01%
56.30%
SVM (linear kernel)
Classification based on link’s type
• Idea:
For each web page, classify link’s type by
previous trained Decision Tree model, for each link
type, count how many times it appears.
• Features:
– <link_type: count> pairs
• Learning methods:
– SVM (linear kernel)
• Result:
– Accuracy: 50.82% (all category);
60.24% (with removing “other” category)
Classification based on link’s keyword
• Features :
–
–
–
–
–
–
–
–
# (link contains “course”)
# (link contains number)
# (link contains “faculty”)
# (link contains “~”)
…
# (link ends with .ps or .pdf)
# (link starts with mailto)
…
• Learning method:
– SVM (linear kernel)
• Result:
– Accuracy : 31.15%
Combine URL and content’s unigram
• Combine 2 classifiers:
– Page content : SVM (unigram + TF)
– Page URL: decision tree (with removing “other” category)
• Learning method:
– SVM’s prediction is “other” category, predict it as “other”
– Otherwise, check page’s URL, if decision tree predict it as
some category with high probability (>0.5), predict it as
that category; otherwise, use SVM’s prediction
• Result:
– Accuracy: 78.28%
content
link
SVM(TF)+Decision Tree
56.30%
Decision Tree(URL)
SVM (URL)
SVM(link’s type)
SVM(link’s keyword)
68.89%
SVM(TF)
SVM ( TF-IDF)
Naive Bayes
Result Comparison
78.28%
68.92%
61.21%
50.82%
31.15%
16.35%
combination
Comparison with paper[1]
78.28%
72.6%
53.90%
all categories(our
classifier)
categories without
"others" (paper[1])
all categories(paper[1])
[1] M Craven, etc. Learning to extract symbolic knowledge from the World Wide Web. In
Proceedings of the Fifteenth National Conference on Artificial Intellligence, 1998.
Conclusion
• Result Analysis:
– Data is quite unbalanced
– Data in “other” category is quite noisy
– Cannot construct graph model due to lacking linked
web page’s content
• Some methods might help:
–
–
–
–
Using bigram/named entity
SVM with RBF kernel
Train weight for several different SVM classifiers
Adaboost
Download