Review of the web page classification approaches and applications

advertisement
Review of the web page classification
approaches and applications
Luu-Ngoc Do
Quang-Nhat Vo
Contents
• Introduction
• Applications
• Features
• Algorithms
• Experiments
Introduction
• Large amount of web pages on the World Wide
Web
• Web information retrieval tasks: crawling,
searching, extracting KBs,…
Introduction
• Subject classification: consider the subject or topic
of web page. Example: “business”, “sports”,…
• Functional classification: role of web pages.
Example: course page, researcher homepage,…
Applications
• Improving quality of search result
• Building focused crawler
• Extracting KBs
Improving Search Results
• Solve the query ambiguity
• User is asked to specify before searching
(Chekuri et al. [1997])
• Present the categorized view of results to users
(Kaki [2005])
Building Focused Crawler
• When only domain-specific queries are expected,
performing a full crawl is usually inefficient.
• Only documents relevant to a predefined set of
topics are of interest. (Chakrabarti et al. [1999])
Extracting KBs
• Store complex structured and unstructured
information from the World Wide Web to make a
computer understandable environment.
• First step : recognize class instances by classifying
web’s content. (Craven et al. [1998])
Feature Selection
• Textual contents, HTML tags, hyperlinks,
anchor texts
• On-page features
• Neighbors features
On-page Features
• Textual Content
▫ Bag-of-words
▫ N-gram representation: n consecutive words
(Mladenic [1998]). Example: New York, new, york
• HTML tags: Ardo [2005]
• URL: Kan and Thi [2005], Sujatha [2013].
Positive point: reduce processing time
Neighbors Features (1)
• Weak assumption: neighbor pages of the pages
belong to the same category share common
characteristics
• Strong assumption: a page is much more likely to
be surrounded by pages of the same category.
Neighbors Features (2)
Neighbors Features (3)
• Sibling pages are more useful than parents and
children. (Chakrabarti et al. [1998], Qi and
Davison [2006])
• The content of neighbors need to be sufficiently
similar to the target page. (Oh et al. [2000])
• Using a portion of content on parent and child
pages: title, anchor text, and the surrounding
text of anchor text on the parent pages
Algorithms
• k-NN
• Co-training
• Naïve Bayes
K-NN
• Kwon and Lee [2000]
• Bag-of-words
Co-traning
• Blum and Mitchell [1998]
• Labeled and unlabeled data
• Two classifiers that are trained on different sets
of features are used to classify the unlabeled
instances.
• The prediction of each classifier is used to train
the other.
Web Page Classification using Naive
Bayes
• Bernoulli model: a document is represented by a feature vector
with binary elements taking value 1 if the corresponding word
is present in the document and 0 if the word is not present
▫ E.g: consider the vocabulary:
and the short document “the blue dog ate a blue biscuit”.
The Bernoulli feature vector is: b = (1, 0, 1, 0, 1, 0)T
• Consider a web page D, whose class is given by C, we classify D
as the class which has the highest posterior probability P(C |D):
17
Web Page Classification using Naive
Bayes
• The document likelihood P(Di|C):
Where:




bi : Bernoulli feature vector.
P( wt |C ): the probability of word wt occurring in a document of class C.
nk(wt) be the number of documents of class C = k in which wt is observed.
Nk is the total document of class C = k.
• The prior term:
18
Experimental Results
• Dataset: WebKB
▫ Contains 8145 webs pages.
▫ Seven categories: student, faculty, staff, course, project, department and othe
r.
▫ Data is collected in 4 departments and some pages from other universities.
 Cornell, Texas, Washington, Wisconsin, and others.
• Experimental setup:
▫ Select four most populous categories: student, faculty, course, and project.
▫ Training data: Cornell, Washington, Texas and miscellaneous pages co
llected from other universities.
▫ Testing data: Wisconsin.
19
Experimental Results
• Experimental result:
Classes
# of training pages
# of testing pages
accuracy
Faculty
Course
Student
Project
1082
845
1485
479
42
85
156
25
0.8182
0.8851
0.7595
0.8148
20
THANK YOU
21
Download