Document 13135804

advertisement
2011 2nd International Conference on Networking and Information Technology
IPCSIT vol.17 (2011) © (2011) IACSIT Press, Singapore
News Classification based on experts’ work knowledge
Yen-Liang Chen1 and Tung-Lin Yu2
1,2
Department of Information Management, National Central University, Chung-Li 320, Taiwan
2
1
ylchen@mgt.ncu.edu.tw pp_littlefish@yahoo.com.tw
Abstract. The news classification problem deals with correctly categorizing unseen news items. Although
numerous past studies have investigated this problem, they have a common problem that the classification
algorithms used were usually designed from a technical perspective. In other words, they paid little attention
to analyzing how experts really classify news in practice. As a result, these algorithms were designed based
solely on technical aspects, without referencing or including experts’ work process knowledge. In this work,
we take a fundamentally different approach to the news classification problem. The difference lies in that we
first observe the work processes of experts. Then, by imitating the experts’ work processes, we develop a
two-phased algorithm for news classification. Finally, a series of experiments are carried out using real data
sets. The experimental results show that our proposed method, which was designed by imitating experts’
work processes, can significantly improve news classification accuracy. This result demonstrates that the
process knowledge learned from experts is a valuable asset that should be employed when classifying news.
Keywords: news classification; knowledge; work process knowledge; text mining
1. Introduction
Given a collection of records with attributes and labels, classification is the technique of assigning a label
to an unseen record using a classifier obtained from training data. Numerous classification approaches have
been proposed; the most famous ones include decision trees, rule-based methods, support vector machine, Knearest neighbor, Bayesian, neural network, associative classification, and so on. [9]
Classification has been applied in many different areas. One of the most famous applications is the news
classification problem [1, 4, 15]. Its importance is evident in the enormous amount of news documents
collected every day by Web news providers, such as Yahoo! News or CNN. For the Web master, it is
laborious and tedious, if not impossible, to classify news items manually. Therefore, automatic news
classification approaches have been proposed to improve classification effectiveness and efficiency. Although
many classification approaches have been used, a common problem of previous work is that their solutions are
designed mainly from a technique-oriented perspective [5, 6, 7, 8, 10, 11, 12, 13, 14]. In other words, their
focus in designing news classification algorithms is on choosing appropriate classification techniques that
have been shown efficient and effective in the past. They pay little attention to how experts actually classify
news in practice. As a result, their algorithms were designed based solely on technical aspects, without
referencing or including experts’ work process knowledge. In light of this problem, this work takes a
fundamentally different approach in dealing with the news classification problem. The difference is explained
below.
According to [2], work process knowledge is an understanding of an organization’s labor process and
production process. Such knowledge is essential to organizations where communication and information
technologies are introduced to enable better use of knowledge assets. Because of its importance, work process
knowledge has drawn extensive attention from many areas, such as business process reengineering, enterprise
resource planning, workflow system design, software engineering, and so on. Basically, the studies in these
100
areas are concerned with the deployment and usage of process knowledge from an organization or information
system perspective. No one has investigated the application of process knowledge to help solve a micro level
problem, such as news classification.
In this work, our conjecture is that the work process adopted by experts to classify news items is a
valuable asset that should be employed when automatically classifying news. If the work process knowledge
learned from experts can be translated into our algorithm, we can improve the efficiency and effectiveness of
news classification. This paper is the first attempt in the literature to study news classification based on
experts’ work process knowledge. We first take a close look at experts’ work processes in news classification.
Then, imitating the work processes of experts, this work transforms this knowledge into a two-phased
algorithm for news classification.
The remainder of this paper is organized as follows. In Section 2, we design the news classification
algorithms by imitating the work processes of experts. To imitate each major operation in the process,
appropriate data mining techniques are selected and extended so that they work smoothly with other
components of the algorithm. In Section 3, a series of experiments are carried out using real data sets. The
experimental results show that our proposed method can significantly improve news classification accuracy.
This result proves our assumption that the process knowledge learned from experts is a valuable asset
deserving to be employed when classifying news. Finally, we draw conclusions and discuss future works in
Section 4.
2. The algorithm
According to our observations, experts divide the news categorization problem into two major steps: (1)
scan for any representative keywords that can help directly determine the news category. If they exist, the
category is assigned according to the representative words. If none exist, go to next step. (2) Carefully check
the news content in its entirety. Assign the unseen news to the category whose content is most similar.
In the first step, experts search for important keywords. If the news item has important keywords that can
determine the news category, they directly classify the document to this category. For example, if the keyword
“MLB” appeared in the news item, this would belong to the category “Sports” since the keyword “MLB”
appears more frequently in “Sports” than in any other category. These kinds of keywords are “representative
keywords”. If representative keywords cannot be found, they move on to the second step.
In the second step, experts carefully check the entirety of the news item to identify the category with the
most similar content. In this step, we also observe that a category is probably constituted of several
subcategories. For example, “Sports” can be further divided into a set of subcategories, such as baseball,
football, and basketball. In each subcategory of sports, the keywords usually differ from those in other
subcategories, even though they are all in the same category “Sports”. Since each subcategory has different
characteristics, it should be treated as an independent category.
Using our observations, we imitate the work processes of experts and propose a new learning technique
called the two-phase news classification (2PNC) approach. Fig. 1 shows the framework of the algorithm,
where the upper and lower halves are the training and testing phases, respectively. In the training phase, each
news document must go through text preprocessing operations, including stemming, stop words removal,
index term selection, and index term weight computation using the TF×ICF formula. Next, we transform the
TF×ICF weight vector into transaction format, and then apply an association rule mining algorithm to find
associative classification rules, which will assign the unseen news item to a suitable category according to the
representative keywords. In addition, we apply a k-means algorithm to divide each news category into several
subcategories, and find a representative vector for each subcategory. Here, the k value for k-means is
determined by the CHI index.
In the testing phase, we also conduct text preprocessing and compute the TF×ICF weight of each term in
the document. Then, we transform the TF×ICF vector into transaction format, and find matched associative
classification rules. The confidence of the matched rule can be divided into high, medium, and low levels.
There are three cases that explain how categories are assigned using the matched rules. First, if the unseen
news item was matched with multiple high level rules in different categories, the news is assigned to an
appropriate category using the conflict resolution principle. Second, if medium level but no high level rules
101
match, we then compute the similarity or distance between the unseen news item and the representative
vectors of each subcategory for the medium level rules’ categories. After that, we apply nearest neighbor (NN)
strategy to assign the most suitable category to the unseen news item. Finally, if no high or medium level rules
match, we then compute the similarity or distance between the unseen news item and the representative
vectors of each subcategory for all categories. Then, we also apply nearest neighbor (NN) strategy to find the
most suitable category for the unseen news item.
Figure 2.
The framework of algorithm 2PNC.
In the testing phase, we can find a number of matched rules from the associative classification rules after
we transform a news item into a transaction. We separate the matched rules into three levels, high, medium,
and low level rules. We use [a, b] to represent the split points to define the matched rule levels.
• High level rule: the confidence of the matched rule is greater than or equal to a% (100 % ≥
confidence ≥ a %).
• Medium level rule: the confidence of the matched rule is less than a% but no less than b% (a % >
confidence ≥ b %).
• Low level rule: the confidence of the matched rule is less than b% (b % > confidence ≥ 0 %).
If there are multiple high level rules, we designed three conflict resolution principles to determine the
news category: Number of matched rules in each category (NR), Average confidence (AvgC), and Maximum
confidence (MaxC). NR, AvgC, MaxC are to select the category with the most number of rules, with the
maximum average rule confidence, and the maximum rule confidence, respectively. Otherwise, if there is
none, we use KNN method to determine the assigned category.
The problem with the traditional kNN algorithm is that news items tend to be assigned to the biggest
category since the biggest categories tend to have more examples in the k-neighbor set. As a result, we
propose another kNN-like method, called A-CkNN (average distance of k nearest subcategories in each
category), to alleviate the influence of imbalanced subcategory size. With A-CkNN, we find k nearest
subcategories in each category, and measure the average distance of k nearest subcategories in each category.
3. Experiments
In order to test the effectiveness of the two-phase classification approach, a series of experiments was
conducted to determine the most appropriate parameter settings and to compare our approach with several
traditional algorithms. All the experiments were implemented in the following environment: (a) Operating
system: Windows Vista, (b) Hardware: Intel Core2 Duo CPU 7500 2.20 GHz, 2.00 GB main memory, (c)
Language: Java, (d) Integrated development environment: Eclipse, (e) Database: MySQL 5.1 (GUI tool:
MySQL Workbench).
This section contains four parts: (1) Data collections, (2) Measure metrics, (3) Pretest to set parameter
values, and (4) Experiment comparison.
102
3.1. Data Collections
Experiments were conducted with two data collections: Reuters-21578 news collection and WebKB. The
Reuters-21578 news collection is commonly used in text mining and categorization tasks. This collection
includes 21,578 documents with 135 categories. Only ten categories with the highest number of training
examples are usually considered for text categorization tasks. Additionally, documents with more than one
topic are eliminated since we are focused on the single-label classification problem. After eliminating
unsuitable documents, there remain eight most frequent news categories. Table I shows the news distribution
in the eight categories. We use “R8” to represent the Reuters-21578 data set.
The second data set used is the WebKB collection, which is collected by the World Wide Knowledge
Base project of the CMU text learning group. These pages were collected from the computer science
departments of various universities in 1997, and manually classified into seven different categories: student,
faculty, staff, department, course, project, and other. We discarded the categories “Department” and “Staff”
since there were only a few pages from each university, and also discarded the category “Other” since the
pages in this category were very diverse. Table II shows the distribution of WebKB documents in the
remaining four categories. We use “WK” to represent the WebKB data set.
TABLE I.
DATA SET RETURES-21578 (R8)
TABLE II.
DATA SET WEBKB (WK)
3.2. Measure metrics
We used precision, recall, F1-measure, and accuracy to evaluate the performance of our news
classification system. P(Ci), the precision of class Ci, is defined as the percentage of news items classified in
Ci that truly belong in class Ci. R(Ci), the recall of class Ci, is defined as the percentage of news items in class
Ci that are classified to Ci. F1-measure (Ci) is well-known index combining precision with recall. Finally,
accuracy is defined as the percentage of news items classified correctly.
3.3. Pretest to determine parameter settings
Three control variables in our method could possibly have a significant effect on classification accuracy:
(1) the kNN strategy, (2) the conflict resolution principle and the split points in defining a rule’s match level,
and (3) the value of k in kNN strategy. We conducted a series of experiments with different parameter settings
and two data collections to determine the most appropriate settings for our approach. After these pretesting,
we have the results as follows.
• We used A-CkNN as the kNN strategy in the R8 and WK data collections.
• We use NR as the conflict resolution principle in the R8 and WK data collections.
• We chose [100, 30] and [90, 10] as the split points of the matched rule level in A-CkNN in the R8 data
set and in WK data set, respectively.
103
•
We set k to 10 in A-CkNN for both data sets.
3.4. Experiment comparison
In this section, we present the experimental evaluation of the proposed method by comparing its results
with several traditional algorithms, including the nearest neighbor, instance-based KNN, clustering-based
KNN, associative classification rule, and SVM. Firstly we give a brief introduction of each comparison
algorithm as below.
• The nearest neighbor: In this classifier, we measure the similarity between the unseen news item and
each training article. We then select the category of the most similar training news as the news
category. This method is the same as the instance-based KNN with k=1.
• Associative classification rules: We use the traditional Apriori algorithm to mine the associative
classification rules and select the news category using the “NR” conflict resolution principle. Note that
in this algorithm, we no longer separate the matched rules into three levels. We simply assign the news
item to the category with the most matched association rules.
• Instance-based KNN: In this method, we measure the similarity between the unseen news item and
each training news article, and then determine the news category using A-CkNN. According to pretest
result, we set the value of k to 10 in A-CkNN.
• Clustering-based KNN: We use the traditional k-means algorithm to cluster the news of each
category and select the news category using A-CkNN with k=10. Here, we use the Calinski Harabasz
Index [3] to select the number of clusters in the k-means algorithm.
• SVM: SVM training algorithm can build a model to predict whether a new example falls into one
category or the other. SVM is recently widely used in classification and regression. We use the
LIBSVM to conduct SVM algorithm, and download the LIBSVM source code from LIBSVM Tools
home page.
• Associative classification rules + Clustering-based KNN (2PNC): This is the two-phase news
classification approach (2PNC) proposed in this paper.
The results of comparisons between 2PNC and the other five algorithms in the R8 data set are
summarized in Table III. As seen in the table, compared with the other five traditional algorithms, our
approach (2PNC) has the highest average F1 value and classification accuracy. Besides, we have the following
observations.
• The clustering-based KNN method improves the performance of instance-based KNN. This result
indicates that clustering a category into several subcategories can improve the accuracy of news
classification.
• The instance-based KNN improves the performance of the nearest neighbor method, which is a special
case of the instance-based KNN with k=1. This result indicates that adopting a suitable size of
neighborhood can increase the accuracy of news classification.
• The associative classification rule method has a very good precision about 91%. This result suggests
that if we use only those high level associative rules for the first phase classification, we have a very
high chance of making correct predictions for the news with distinctive representative words.
The observations above strongly comply with our initial observations learned from experts’ work process
based on which we design the two-phase algorithm for news classification.
We also observed that the F1 value is positively related to the number of training news documents. For
example, categories “earn” and “acq” have higher F1 values in all approaches. One possible reason is that
categories “earn” and “acq” have more than 1,500 training data, so more representative keywords can be
found in these two categories. In contrast, categories “grain” and “ship” have lower F1 values no matter which
approach was used, since “grain” and “ship” have only 41 and 108 training news items, respectively.
We also observed the impact of category size on the associative classification rule method. The accuracy
of this method relies on how many representative keywords can be found in the training data, that is, the less
training data in a specific category, the fewer representative keywords in that category. This led to negative
performances for categories “grain” and “ship” when using the associative classification rule.
The results also indicate that our approach greatly outperforms all traditional algorithms for small size
category, as in the cases of categories “grain” and “ship”. This means that our approach has a more stable
performance for different size categories.
104
The comparisons between the five different algorithms with the WK data set are summarized in Table IV.
As seen in the table, compared with the other five traditional algorithms, our approach (2PNC) has the highest
average F1 value and classification accuracy. We also noted that the cluster-based KNN method improves the
performance of instance-based KNN and nearest neighbor methods. The integration of cluster-based KNN and
associative classification also improves the performance of each separate method
Again, category “project” had the lowest F1 value no matter which algorithm was used, since “project”
has only 336 training documents. Our approach, however, still achieved a much better performance than the
other approaches in the “project” category. According to this experimental result, we prove once again that
our approach has a more stable performance for all different category sizes.
4. Conclusion
Unlike traditional news classification algorithms that designed their solutions from a technique-oriented
perspective, this research first observe how experts classify news in their workplace and then transforms that
process into the proposed algorithm. In experimental comparisons, the results show that the new approach
achieves better classification performance than traditional technique-oriented algorithms, and has a more
stable performance for all category sizes.
In this work, we have proposed a two-phase algorithm for news classification. In the first phase, we apply
the associative classification rules to select appropriate categories based on the representative keywords
contained in the news. In fact, there are many other data mining methods that can do the same purpose, such
as fuzzy association rules, rough set rules, decision trees, and so on. Therefore, a possible future research
direction is to investigate if the performance of our first phase can be further improved by using other data
mining or statistics methods.
In the second phase, we have to cluster the news in a category into several sub-categories. We did so by
using the most popular clustering method, the k-means algorithm. Since clustering is a well-known and wellstudied area in data mining, there are numerous clustering algorithms which have been proposed. Therefore,
another possible future research direction is to investigate if the performance of our second phase can be
further improved by applying other existing clustering algorithms or creating custom-designed clustering
algorithms.
5. References
[1] C. Aasheim and G.J. Koehler, “Scanning World Wide Web Documents with the Vector Space Model,” Decision
Support Systems, vol. 42, 2006, pp. 690-699.
[2] N. C. Boreham, R. Samurcay, and M. Fischer, Martin, Work Process Knowledge, London: Routledge, 2002.
[3] R. B. Calinski and J. Harabasz, “A Dendrite Method for Cluster Analysis,” Communications in Statistics, vol. 3,
1974, pp. 1-27.
[4] J.P. Caulkins, W. Ding, G. Duncan, R. Krishnan, and E. Nyberg, “A Method for Managing Access to Web Pages:
Filtering by Statistical Classification (FSC) Applied to Text,” Decision Support Systems, vol. 42, 2006, pp. 144161.
[5] S. Chakrabarti, S. Roy, and M. V. Soundalgekar, “Fast and Accurate Text Classification Via Multiple Linear
Discriminant Projections,” The VLDB Journal, vol. 12, 2003, pp. 170-185.
[6] X.Y. Chen, Y. Chen, R.L. Li, and Y.F. Hu, “An Improvement of Text Association Classification using Rules
Weights,” Lecture Notes in Artificial Intelligence, vol. 3584, 2005, pp. 355-363.
[7] O. Cordon, M. J. Del Jesus, and F. Herrera, “A Proposal on Reasoning Methods in Fuzzy Rule-based Classification
Systems,” International Journal of Approximate Reasoning, vol. 20, 1999, pp. 21-45.
[8] E.H. Han, G. Karypis, and V. Kumar, “Text Categorization using Weight Adjusted K-nearest Neighbor
Classification,” Proc. the Fifth Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining,
2001, pp. 53-65.
[9] J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Francisco, CA, Morgan Kaufmann, 2001.
[10] A. Juan and E. Vidal, “On the Use of Bernoulli Mixture Models for Text Classification,” Pattern Recognition, vol.
35, 2002, pp. 2705-2710.
105
[11] S.B. Kim, K.S. Han, H. C. Rim, and S. H. Myaeng, “Some Effective Techniques for Naive Bayes Text
Classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, 2006, pp. 1457-1466.
[12] H. Li and K. Yamanishi, “Text Classification using ESC-based Stochastic Decision Lists,” Proc. CIKM-99, 8th
ACM International Conference on Information and Knowledge Management, 1999, pp. 122-130.
[13] C. Vens, J. Struyf, L. Schietgat, S. Dzeroski, and H. Blockeel, “Decision Trees for Hierarchical Multi-label
Classification,” Machine Learning, vol. 73, 2008, pp. 185-214.
[14] W. Wei and Y. Bo, “Text Categorization based on Combination of Modified Back Propagation Neural Network
and Latent Semantic Analysis,” Neural Computing and Applications, vol. 18, 2009, pp. 875-881.
[15] Y.L. Zhang, Y. Dang, H.C. Chen, M. Thurmond, and C. Larson, “Automatic Online News Monitoring and
Classification for Syndromic Surveillance,” Decision Support Systems, vol. 47, 2009, pp. 508-517.
TABLE III. EXPERIMENTAL RESULTS (R3)
TABLE IV. EXPERIMENTAL RESULTS (WK)
106
Download