Midway Report

advertisement
World Wide Knowledge Base Webpage Classification
Midway Report
Bo Feng
Rong Zou
108809282
108661275
1. Goal
To learn classifiers to predict the type of a webpage from the text.
To develop a probabilistic, symbolic knowledge base that mirrors the content of the world
wide web. If successful, this will make text information on the web available in computerunderstandable form, enabling much more sophisticated information retrieval and problem
solving.
2. Relevant work
3. Problem Analysis
We are developing a system that can classify webpages. For this part, our goal is to develop
a probabilistic, symbolic knowledge base that mirrors the content of the World Wide Web[1]
using three approaches. If successful, we will try to develop an application for fitting pages in
mobile devices. This will make text information on the web available in computerunderstandable form, enabling much more sophisticated information retrieval and problem
solving.
First, we will learn classifiers to predict the type of webpages from the text by using Naïve
Bayes and SVM (unigram and bigram). Also, we will try to improve accuracy by exploiting
correlations between pages that point to each other[2]. In addition, in the part of segmenting web
pages into meaningful parts, including bio, publication, etc, we'd like to employ the decision tree
method[3]. Since intuitively there should be some rules to distinguish those parts, e.g. the
biography part of professor may use <img> and <p> html tags, while the "publications" section
may use <ul> <li> tags, etc. In the end, after we learned those classifiers, we’d like using these to
build an application: given a web page url, it will rearrange the content and generate a proper
web page for mobile user, which we think will be quite useful.
4. Approach
We are developing a system that can be trained to extract symbolic knowledge from
hypertext, using a variety of machine learning methods:
1)
2)
3)
4)
Naïve Bayes
SVM
Bayes Network
HMM
In additon, we will try to improve accuracy by (1) exploiting correlations between pages that
point to each other, and/or (2) segmenting the pages into meaningful parts (bio, publications, etc.)
6. Dataset and features
6.1 Dataset
A dataset from 4 universities containing 8,282 web pages and hyperlink data, labeled with
whether they are professor, student, project, or other pages.
For each class the data set contains pages from the four universities:




Cornell (867)
Texas (827)
Washington (1205)
Wisconsin (1263)
and 4,120 miscellaneous pages collected from other universities.
6.2 Test & Train Data Splits
Since each university's web pages have their own idiosyncrasies, we train on three of the
universities plus the miscellaneous collection, and testing on the pages from a fourth, held-out
university, which may be called four-fold cross validation.
6.3 Feature Selection
We will use two types of methods to select features.
Feature set or "vocabulary" size may be reduced by occurrence counts or by average mutual
information with the class variable, which we also call "information gain".
1) Word Counts and Probabilities
 Remove words that occur in N or fewer documents.
 Remove words that occur less than N times.
2) Information Gain
Remove all but the top N words by selecting words with highest average mutual information
with the class variable. Default is N=0, which is a special case that removes no words.
In order to apply maximum entropy to a domain, we need to select a set of features to use for
setting the constraints. For text classification with maximum entropy, we use word counts as our
features. More specifically, in this paper for each word-class combination we instantiate a feature
as:
7. Results
Expected result:
Milestone Result: Learn classifier using at least two of these methods in a subset dataset:1) develop a system
that can be trained to extract symbolic knowledge from hypertext using Naïve Bayes and SVM; 2) exploit
correlations between pages that point to each other; 3) Segment the pages int o meaningful parts using
decision tree.
Actual Milestone Result:
Final Result: Make a combination of three approaches to learn a good classifier for the whole dataset;
hopefully, we can improve the accuracy by 10% with our new combinative algorithm. Then we will try to
develop an application for fitting webpage into mobile device using our training model.
8. Reference
[1] Learning to Construct Knowledge Bases from the World Wide Web. M. Craven, D. DiPasquo, D.
Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. Artificial Intelligence.
[2] Learning to Extract Symbolic Knowledge from the World Wide Web. M. Craven, D. DiPasquo, D.
Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. Proceedings of the 15th National
Conference on Artificial Intelligence (AAAI-98).
[3] Data Mining on Symbolic Knowledge Extracted from the Web. Rayid Ghani, Rosie Jones, Dunja
Mladenic, Kamal Nigam and Sean Slattery. KDD-2000 Workshop on Text Mining. 2000
Download