Midway Report5

advertisement
World Wide Knowledge Based Webpage Classification
Midway Report
Bo Feng (108809282)
Rong Zou (108661275)
1. Goal
To learn classifiers to predict the type of a webpage from the text.
2. Relevant work
Automatic text classification has always been an important application and research topic since the
inception of digital documents. Intuitively Text Classification is the task of classifying a document under
a predefined category. More formally, if 𝑑𝑖 is a document of the entire set of documents D and
{ 𝑐1 , 𝑐2 , … , 𝑐𝑛 } is the set of all the categories, then text classification assigns one category 𝑐𝑖 to a
document 𝑑𝑖 .
2.1 Feature Selection
The aim of feature-selection methods is the reduction of the dimensionality of the dataset by removing
features that are considered irrelevant for the classification. Methods for feature subset selection for text
document classification task use an evaluation function that is applied to a single word. Scoring of
individual words (Best Individual Features) can be performed using some of the measures, for instance,
document frequency, term frequency, mutual information, information gain and etc.
2.1.1 Term frequency 𝑻𝑭(𝒕, 𝒅)
In the case of the term frequency 𝑑𝑓(𝑑, 𝑑), we choose normalized frequency to prevent a bias towards
longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the
document:
𝑓(𝑑, 𝑑)
𝑇𝐹(𝑑, 𝑑) =
max{𝑓(𝑀, 𝑑): 𝑀 ∈ 𝑑}
2.1.2 Inverse document frequency 𝑰𝑫𝑭(𝒕, 𝑫)
The inverse document frequency is a measure of whether the term is common or rare across all
documents. It is obtained by dividing the total number of documents |𝐷| by the number of documents
containing the term |{𝑑 ∈ 𝐷: 𝑑 ∈ 𝑑}|, and then taking the logarithm of that quotient.
𝐼𝐷𝐹(𝑑, 𝐷) = log
|𝐷|
|{𝑑 ∈ 𝐷: 𝑑 ∈ 𝑑}|
2.1.3 Term frequency–inverse document frequency 𝑻𝑭 − 𝑰𝑫𝑭(𝒕, 𝒅, 𝑫) [4]
1
𝑇𝑓– 𝐼𝐷𝐹 is the product of two statistics, term frequency and inverse document frequency.
Then 𝑇𝐹– 𝐼𝐷𝐹 is calculated as:
𝑇𝐹 − 𝐼𝐷𝐹(𝑑, 𝑑, 𝐷) = 𝑇𝐹(𝑑, 𝑑) × πΌπ·πΉ(𝑑, 𝐷)
A high weight in 𝑇𝐹– 𝐼𝐷𝐹 is reached by a high term frequency (in the given document) and a low
document frequency of the term in the whole collection of documents; the weights hence tend to filter out
common terms.
2.2 Machine Learning Algorithms
After feature selection and transformation the documents can be easily represented in a form that can be
used by a ML algorithm. Many text classifiers have been proposed in the literature using machine
learning techniques, probabilistic models, etc. They often differ in the approach adopted: Decision Trees,
Naïve-Bayes, Neural Networks, Nearest Neighbors, and lately, Support Vector Machines.
Naive Bayes is often used in text classification applications and experiments because of its simplicity
and effectiveness [6] ; Support vector machines (SVM), when applied to text classification also usually
provides excellent precision, but poor recall. One means of customizing SVMs to improve recall, is to
adjust the threshold associated with an SVM. Shanahan and Roma described an automatic process for
adjusting the thresholds of generic SVM [7] with better results.
3. Problem Analysis
We are developing a system that can classify webpages. For this part, our goal is to develop a
probabilistic, symbolic knowledge base that mirrors the content of the World Wide Web[1]. This will
make text information on the web available in computer-understandable form, enabling much more
sophisticated information retrieval and problem solving.
First, we will learn classifiers to predict the type of webpages from the text by using Naïve Bayes and
SVM, with unigram as feature. Also, we will try to improve accuracy by exploiting correlations between
pages that point to each other[2].
4. Approaches
In Figure 1 is given the graphical representation of the our webpage classification process.
Parse web
page content
Vector Representation
of web page
Feature
Selection
Learning
Algorithm
Figure 1: Webpage Classification process
4.1 Parse web page content
We extract text features using 𝑇𝐹 and 𝑇𝐹 − 𝐼𝐷𝐹 (term frequency–inverse document frequency), which
reflects how important a word is to a document in a collection or corpus.
2
4.2 Algorithms
We are developing a system that can be trained to extract symbolic knowledge from hypertext, using two
machine learning methods based on unigram: 1) Naïve Bayes, with using Witten-Bell smoothing
probability; 2) SVM, with selecting features based on TF and TF-IDF).
5. Dataset
5.1 Dataset
The WebKB dataset[2] contains 8,282 web pages gathered from 4 university’s computer science
departments. The pages are labeled with seven categories: student, faculty, staff, course, project,
department and other. For each class, the data set contains pages from the four universities: Cornell (867),
Texas (827), Washington (1205), Wisconsin (1263), and 4,120 miscellaneous pages collected from other
universities.
5.2 Train & Test Data Splits
Since each university's web pages have their own idiosyncrasies, we train on three of the universities plus
the miscellaneous collection, and testing on the pages from a fourth, holdout university, which is 4-fold
cross validation.
6. Results
First we try to learn model only with the actual content in the web page, without considering the html tag
and the relationship between different pages. So in the data pre-processing step, we removed html tags
and ignore the html MIME header information in each web pages, then we extract unigram with removing
punctuation. But here we didn't stemming or removing stop words, the reason here is, even stop words,
we thought those may give us extra information. For example, maybe web documents in one of the
categories may contains more "the" or "in" than other categories.
After we get those unigram for each web document, we employ Naive Bayes and SVM algorithm to build
our learning model, and got classification accuracy for each model:
Table 1: classification accuracy of three classifiers
Accuracy
Naïve Bayes
SVM(TF)
SVM(TF-IDF)
Fold 1
25.26%
68.86%
69.67%
Fold 2
16.57%
67.96%
67.47%
Fold 3
21.49%
68.30%
66.22%
Fold 4
26.29%
70.55%
72.21%
Average
22.40%
68.92%
68.89%
3
6.1 Naïve Bayes classifier
In the Naive Bayes part, we first build the probability of a unigram given one class 𝑃(𝑀|𝑐), For example,
when calculating the 𝑃(π‘π‘Ÿπ‘œπ‘“π‘’π‘ π‘ π‘œπ‘Ÿ|π‘“π‘Žπ‘π‘’π‘™π‘‘π‘¦) in the training dataset, using the word professor occurrence
count as numerator and all the words' count as denominator for the documents in the faculty category.
Here we also use Witten-Bell smoothing algorithm[10] to smooth the probability to handle unigrams that
doesn't appear in the training dataset but exists in the test dataset.
After we build the 𝑃(𝑀|𝑐) for each word and for each category, we using Naive Bayes algorithm to
classify a new web document, based on the assumption that each word is independent in one web
document, we simply use: (here we use cat to denote category, use doc to denote document):
𝑛
𝑃(π‘π‘Žπ‘‘|π‘‘π‘œπ‘) = 𝑃(π‘‘π‘œπ‘|π‘π‘Žπ‘‘) × π‘ƒ(π‘π‘Žπ‘‘) = ∏ 𝑃(π‘€π‘œπ‘Ÿπ‘‘π‘– |π‘π‘Žπ‘‘) × π‘ƒ(π‘π‘Žπ‘‘)
𝑖=1
After we calculate this number for each category, we assign this new document a category that has the
maximum probability.
However, the result for this algorithm is quite awful, after we did our 4-folds cross validation, the average
accuracy is only 22.40%, it is only slightly better than random choice a category which is 1/7 = 14.29%.
The reasons of low accuracy might be: 1) this dataset is unbalanced: the category of “others” account for
45.44%, which is a collection of pages that were not deemed the '' main page'' representing an instance of
the previous 6 categories, while other 6 categories only account for 1% to 20% respectively; 2) the feature
we used can not model document well, an improve might be using 𝑃(𝑀|𝑐) × π‘‡πΉ as probability or
removing stop words may also help.
6.2 Unigram + SVM classifier (TF/TF-IDF)
In the SVM learning part, also based on our unigram, we try two different features, one is the 𝑇𝐹(term
frequency), another is 𝑇𝐹 × πΌπ·πΉ as mentioned above. Here we use svmmulticlass tool in svmlight[9] to build
our model. In this training task, we use SVM with a linear kernel, we tried the slack parameter C for
different values, (0.01, 0.1, 1, 10, 100, 1000, 5000, 104 , 5 × 104 , 10^5, 5 × 105 , 106 , 5 × 106 , 107 , 5 ×
107 , 108 ), and find the best C for our holdout data, then apply this model to our test dataset.
To be more specific, we calculate 𝑇𝐹 value as a single word’s count divided by the total words' count in
one document; and calculate 𝐼𝐷𝐹 in this way: for each word calculate how many document it appears,
that’s the value D(w), then we have the total number of all web documents in training set, that’s the value
D, so we can get 𝐼𝐷𝐹 value of one word:
𝐷
𝐼𝐷𝐹(π‘€π‘œπ‘Ÿπ‘‘) = π‘™π‘œπ‘” (
)
𝐷(𝑀)
The result is much better than Naive Bayes classifier, by using 𝑑𝑓 as feature part, after 4 folds cross
validation and apply our model to the test dataset, we get our average accuracy 68.92%; By using 𝑑𝑓 ∗
4
𝑖𝑑𝑓 as feature part, the average accuracy is 68.89%. They are basically the same, introducing the extra
𝑖𝑑𝑓 value actually doesn't improve our model.
7. Future Work
Here simply by using SVM with unigram's TF / TF-IDF as feature, we already has about 69% accuracy,
there are still many information we didn't use, like the word's Point Mutual Information[11], words's POS
tag, and more importantly, the html tag, and the link relationship between pages, which will give us much
extra information.
Thus, next we'll try to use PMI information with SVM again, to see whether it will give a better result.
Also with html tags and pages' link relationship we'd like to employ HMM or CRF algorithm to build a
probabilistic graphic model, which may give us promising accuracy.
8. References
[1] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery , Learning to
Construct Knowledge Bases from the World Wide Web. Artificial Intelligence.
[2] M Craven, etc. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the
Fifteenth National Conference on Artificial Intellligence (AAAI- 98), pages 509-516, 1998.
[3] Rayid Ghani, Rosie Jones, Dunja Mladenic, Kamal Nigam and Sean Slattery. Data Mining on Symbolic
Knowledge Extracted from the Web. KDD-2000 Workshop on Text Mining. 2000
[4] tf–idf, http://en.wikipedia.org/wiki/Tfidf
[5] Zu G., Ohyama W., Wakabayashi T., Kimura F., Accuracy improvement of automatic text classification
based on feature transformation
[6] Kim S. B., etc., Effective Methods for Improving Naive Bayes Text Classifiers
[7] Shanahan J. and Roma N., Improving SVM Text Classification Performance through Threshold
Adjustment, LNAI 2837, 2003, 361-372
[8] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques,
Wseas Transactions on Computers, issue 8, volume 4, pp. 966- 974, 2005.
[9] svmlight, http://svmlight.joachims.org/svm_multiclass.html
[10] Witten-Bell smoothing, http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf
[11] Point Mutual Information, http://en.wikipedia.org/wiki/Pointwise_mutual_information
5
Download