2011 2nd International Conference on Networking and Information Technology IPCSIT vol.17 (2011) © (2011) IACSIT Press, Singapore News Classification based on experts’ work knowledge Yen-Liang Chen1 and Tung-Lin Yu2 1,2 Department of Information Management, National Central University, Chung-Li 320, Taiwan 2 1 ylchen@mgt.ncu.edu.tw pp_littlefish@yahoo.com.tw Abstract. The news classification problem deals with correctly categorizing unseen news items. Although numerous past studies have investigated this problem, they have a common problem that the classification algorithms used were usually designed from a technical perspective. In other words, they paid little attention to analyzing how experts really classify news in practice. As a result, these algorithms were designed based solely on technical aspects, without referencing or including experts’ work process knowledge. In this work, we take a fundamentally different approach to the news classification problem. The difference lies in that we first observe the work processes of experts. Then, by imitating the experts’ work processes, we develop a two-phased algorithm for news classification. Finally, a series of experiments are carried out using real data sets. The experimental results show that our proposed method, which was designed by imitating experts’ work processes, can significantly improve news classification accuracy. This result demonstrates that the process knowledge learned from experts is a valuable asset that should be employed when classifying news. Keywords: news classification; knowledge; work process knowledge; text mining 1. Introduction Given a collection of records with attributes and labels, classification is the technique of assigning a label to an unseen record using a classifier obtained from training data. Numerous classification approaches have been proposed; the most famous ones include decision trees, rule-based methods, support vector machine, Knearest neighbor, Bayesian, neural network, associative classification, and so on. [9] Classification has been applied in many different areas. One of the most famous applications is the news classification problem [1, 4, 15]. Its importance is evident in the enormous amount of news documents collected every day by Web news providers, such as Yahoo! News or CNN. For the Web master, it is laborious and tedious, if not impossible, to classify news items manually. Therefore, automatic news classification approaches have been proposed to improve classification effectiveness and efficiency. Although many classification approaches have been used, a common problem of previous work is that their solutions are designed mainly from a technique-oriented perspective [5, 6, 7, 8, 10, 11, 12, 13, 14]. In other words, their focus in designing news classification algorithms is on choosing appropriate classification techniques that have been shown efficient and effective in the past. They pay little attention to how experts actually classify news in practice. As a result, their algorithms were designed based solely on technical aspects, without referencing or including experts’ work process knowledge. In light of this problem, this work takes a fundamentally different approach in dealing with the news classification problem. The difference is explained below. According to [2], work process knowledge is an understanding of an organization’s labor process and production process. Such knowledge is essential to organizations where communication and information technologies are introduced to enable better use of knowledge assets. Because of its importance, work process knowledge has drawn extensive attention from many areas, such as business process reengineering, enterprise resource planning, workflow system design, software engineering, and so on. Basically, the studies in these 100 areas are concerned with the deployment and usage of process knowledge from an organization or information system perspective. No one has investigated the application of process knowledge to help solve a micro level problem, such as news classification. In this work, our conjecture is that the work process adopted by experts to classify news items is a valuable asset that should be employed when automatically classifying news. If the work process knowledge learned from experts can be translated into our algorithm, we can improve the efficiency and effectiveness of news classification. This paper is the first attempt in the literature to study news classification based on experts’ work process knowledge. We first take a close look at experts’ work processes in news classification. Then, imitating the work processes of experts, this work transforms this knowledge into a two-phased algorithm for news classification. The remainder of this paper is organized as follows. In Section 2, we design the news classification algorithms by imitating the work processes of experts. To imitate each major operation in the process, appropriate data mining techniques are selected and extended so that they work smoothly with other components of the algorithm. In Section 3, a series of experiments are carried out using real data sets. The experimental results show that our proposed method can significantly improve news classification accuracy. This result proves our assumption that the process knowledge learned from experts is a valuable asset deserving to be employed when classifying news. Finally, we draw conclusions and discuss future works in Section 4. 2. The algorithm According to our observations, experts divide the news categorization problem into two major steps: (1) scan for any representative keywords that can help directly determine the news category. If they exist, the category is assigned according to the representative words. If none exist, go to next step. (2) Carefully check the news content in its entirety. Assign the unseen news to the category whose content is most similar. In the first step, experts search for important keywords. If the news item has important keywords that can determine the news category, they directly classify the document to this category. For example, if the keyword “MLB” appeared in the news item, this would belong to the category “Sports” since the keyword “MLB” appears more frequently in “Sports” than in any other category. These kinds of keywords are “representative keywords”. If representative keywords cannot be found, they move on to the second step. In the second step, experts carefully check the entirety of the news item to identify the category with the most similar content. In this step, we also observe that a category is probably constituted of several subcategories. For example, “Sports” can be further divided into a set of subcategories, such as baseball, football, and basketball. In each subcategory of sports, the keywords usually differ from those in other subcategories, even though they are all in the same category “Sports”. Since each subcategory has different characteristics, it should be treated as an independent category. Using our observations, we imitate the work processes of experts and propose a new learning technique called the two-phase news classification (2PNC) approach. Fig. 1 shows the framework of the algorithm, where the upper and lower halves are the training and testing phases, respectively. In the training phase, each news document must go through text preprocessing operations, including stemming, stop words removal, index term selection, and index term weight computation using the TF×ICF formula. Next, we transform the TF×ICF weight vector into transaction format, and then apply an association rule mining algorithm to find associative classification rules, which will assign the unseen news item to a suitable category according to the representative keywords. In addition, we apply a k-means algorithm to divide each news category into several subcategories, and find a representative vector for each subcategory. Here, the k value for k-means is determined by the CHI index. In the testing phase, we also conduct text preprocessing and compute the TF×ICF weight of each term in the document. Then, we transform the TF×ICF vector into transaction format, and find matched associative classification rules. The confidence of the matched rule can be divided into high, medium, and low levels. There are three cases that explain how categories are assigned using the matched rules. First, if the unseen news item was matched with multiple high level rules in different categories, the news is assigned to an appropriate category using the conflict resolution principle. Second, if medium level but no high level rules 101 match, we then compute the similarity or distance between the unseen news item and the representative vectors of each subcategory for the medium level rules’ categories. After that, we apply nearest neighbor (NN) strategy to assign the most suitable category to the unseen news item. Finally, if no high or medium level rules match, we then compute the similarity or distance between the unseen news item and the representative vectors of each subcategory for all categories. Then, we also apply nearest neighbor (NN) strategy to find the most suitable category for the unseen news item. Figure 2. The framework of algorithm 2PNC. In the testing phase, we can find a number of matched rules from the associative classification rules after we transform a news item into a transaction. We separate the matched rules into three levels, high, medium, and low level rules. We use [a, b] to represent the split points to define the matched rule levels. • High level rule: the confidence of the matched rule is greater than or equal to a% (100 % ≥ confidence ≥ a %). • Medium level rule: the confidence of the matched rule is less than a% but no less than b% (a % > confidence ≥ b %). • Low level rule: the confidence of the matched rule is less than b% (b % > confidence ≥ 0 %). If there are multiple high level rules, we designed three conflict resolution principles to determine the news category: Number of matched rules in each category (NR), Average confidence (AvgC), and Maximum confidence (MaxC). NR, AvgC, MaxC are to select the category with the most number of rules, with the maximum average rule confidence, and the maximum rule confidence, respectively. Otherwise, if there is none, we use KNN method to determine the assigned category. The problem with the traditional kNN algorithm is that news items tend to be assigned to the biggest category since the biggest categories tend to have more examples in the k-neighbor set. As a result, we propose another kNN-like method, called A-CkNN (average distance of k nearest subcategories in each category), to alleviate the influence of imbalanced subcategory size. With A-CkNN, we find k nearest subcategories in each category, and measure the average distance of k nearest subcategories in each category. 3. Experiments In order to test the effectiveness of the two-phase classification approach, a series of experiments was conducted to determine the most appropriate parameter settings and to compare our approach with several traditional algorithms. All the experiments were implemented in the following environment: (a) Operating system: Windows Vista, (b) Hardware: Intel Core2 Duo CPU 7500 2.20 GHz, 2.00 GB main memory, (c) Language: Java, (d) Integrated development environment: Eclipse, (e) Database: MySQL 5.1 (GUI tool: MySQL Workbench). This section contains four parts: (1) Data collections, (2) Measure metrics, (3) Pretest to set parameter values, and (4) Experiment comparison. 102 3.1. Data Collections Experiments were conducted with two data collections: Reuters-21578 news collection and WebKB. The Reuters-21578 news collection is commonly used in text mining and categorization tasks. This collection includes 21,578 documents with 135 categories. Only ten categories with the highest number of training examples are usually considered for text categorization tasks. Additionally, documents with more than one topic are eliminated since we are focused on the single-label classification problem. After eliminating unsuitable documents, there remain eight most frequent news categories. Table I shows the news distribution in the eight categories. We use “R8” to represent the Reuters-21578 data set. The second data set used is the WebKB collection, which is collected by the World Wide Knowledge Base project of the CMU text learning group. These pages were collected from the computer science departments of various universities in 1997, and manually classified into seven different categories: student, faculty, staff, department, course, project, and other. We discarded the categories “Department” and “Staff” since there were only a few pages from each university, and also discarded the category “Other” since the pages in this category were very diverse. Table II shows the distribution of WebKB documents in the remaining four categories. We use “WK” to represent the WebKB data set. TABLE I. DATA SET RETURES-21578 (R8) TABLE II. DATA SET WEBKB (WK) 3.2. Measure metrics We used precision, recall, F1-measure, and accuracy to evaluate the performance of our news classification system. P(Ci), the precision of class Ci, is defined as the percentage of news items classified in Ci that truly belong in class Ci. R(Ci), the recall of class Ci, is defined as the percentage of news items in class Ci that are classified to Ci. F1-measure (Ci) is well-known index combining precision with recall. Finally, accuracy is defined as the percentage of news items classified correctly. 3.3. Pretest to determine parameter settings Three control variables in our method could possibly have a significant effect on classification accuracy: (1) the kNN strategy, (2) the conflict resolution principle and the split points in defining a rule’s match level, and (3) the value of k in kNN strategy. We conducted a series of experiments with different parameter settings and two data collections to determine the most appropriate settings for our approach. After these pretesting, we have the results as follows. • We used A-CkNN as the kNN strategy in the R8 and WK data collections. • We use NR as the conflict resolution principle in the R8 and WK data collections. • We chose [100, 30] and [90, 10] as the split points of the matched rule level in A-CkNN in the R8 data set and in WK data set, respectively. 103 • We set k to 10 in A-CkNN for both data sets. 3.4. Experiment comparison In this section, we present the experimental evaluation of the proposed method by comparing its results with several traditional algorithms, including the nearest neighbor, instance-based KNN, clustering-based KNN, associative classification rule, and SVM. Firstly we give a brief introduction of each comparison algorithm as below. • The nearest neighbor: In this classifier, we measure the similarity between the unseen news item and each training article. We then select the category of the most similar training news as the news category. This method is the same as the instance-based KNN with k=1. • Associative classification rules: We use the traditional Apriori algorithm to mine the associative classification rules and select the news category using the “NR” conflict resolution principle. Note that in this algorithm, we no longer separate the matched rules into three levels. We simply assign the news item to the category with the most matched association rules. • Instance-based KNN: In this method, we measure the similarity between the unseen news item and each training news article, and then determine the news category using A-CkNN. According to pretest result, we set the value of k to 10 in A-CkNN. • Clustering-based KNN: We use the traditional k-means algorithm to cluster the news of each category and select the news category using A-CkNN with k=10. Here, we use the Calinski Harabasz Index [3] to select the number of clusters in the k-means algorithm. • SVM: SVM training algorithm can build a model to predict whether a new example falls into one category or the other. SVM is recently widely used in classification and regression. We use the LIBSVM to conduct SVM algorithm, and download the LIBSVM source code from LIBSVM Tools home page. • Associative classification rules + Clustering-based KNN (2PNC): This is the two-phase news classification approach (2PNC) proposed in this paper. The results of comparisons between 2PNC and the other five algorithms in the R8 data set are summarized in Table III. As seen in the table, compared with the other five traditional algorithms, our approach (2PNC) has the highest average F1 value and classification accuracy. Besides, we have the following observations. • The clustering-based KNN method improves the performance of instance-based KNN. This result indicates that clustering a category into several subcategories can improve the accuracy of news classification. • The instance-based KNN improves the performance of the nearest neighbor method, which is a special case of the instance-based KNN with k=1. This result indicates that adopting a suitable size of neighborhood can increase the accuracy of news classification. • The associative classification rule method has a very good precision about 91%. This result suggests that if we use only those high level associative rules for the first phase classification, we have a very high chance of making correct predictions for the news with distinctive representative words. The observations above strongly comply with our initial observations learned from experts’ work process based on which we design the two-phase algorithm for news classification. We also observed that the F1 value is positively related to the number of training news documents. For example, categories “earn” and “acq” have higher F1 values in all approaches. One possible reason is that categories “earn” and “acq” have more than 1,500 training data, so more representative keywords can be found in these two categories. In contrast, categories “grain” and “ship” have lower F1 values no matter which approach was used, since “grain” and “ship” have only 41 and 108 training news items, respectively. We also observed the impact of category size on the associative classification rule method. The accuracy of this method relies on how many representative keywords can be found in the training data, that is, the less training data in a specific category, the fewer representative keywords in that category. This led to negative performances for categories “grain” and “ship” when using the associative classification rule. The results also indicate that our approach greatly outperforms all traditional algorithms for small size category, as in the cases of categories “grain” and “ship”. This means that our approach has a more stable performance for different size categories. 104 The comparisons between the five different algorithms with the WK data set are summarized in Table IV. As seen in the table, compared with the other five traditional algorithms, our approach (2PNC) has the highest average F1 value and classification accuracy. We also noted that the cluster-based KNN method improves the performance of instance-based KNN and nearest neighbor methods. The integration of cluster-based KNN and associative classification also improves the performance of each separate method Again, category “project” had the lowest F1 value no matter which algorithm was used, since “project” has only 336 training documents. Our approach, however, still achieved a much better performance than the other approaches in the “project” category. According to this experimental result, we prove once again that our approach has a more stable performance for all different category sizes. 4. Conclusion Unlike traditional news classification algorithms that designed their solutions from a technique-oriented perspective, this research first observe how experts classify news in their workplace and then transforms that process into the proposed algorithm. In experimental comparisons, the results show that the new approach achieves better classification performance than traditional technique-oriented algorithms, and has a more stable performance for all category sizes. In this work, we have proposed a two-phase algorithm for news classification. In the first phase, we apply the associative classification rules to select appropriate categories based on the representative keywords contained in the news. In fact, there are many other data mining methods that can do the same purpose, such as fuzzy association rules, rough set rules, decision trees, and so on. Therefore, a possible future research direction is to investigate if the performance of our first phase can be further improved by using other data mining or statistics methods. In the second phase, we have to cluster the news in a category into several sub-categories. We did so by using the most popular clustering method, the k-means algorithm. Since clustering is a well-known and wellstudied area in data mining, there are numerous clustering algorithms which have been proposed. Therefore, another possible future research direction is to investigate if the performance of our second phase can be further improved by applying other existing clustering algorithms or creating custom-designed clustering algorithms. 5. References [1] C. Aasheim and G.J. Koehler, “Scanning World Wide Web Documents with the Vector Space Model,” Decision Support Systems, vol. 42, 2006, pp. 690-699. [2] N. C. Boreham, R. Samurcay, and M. Fischer, Martin, Work Process Knowledge, London: Routledge, 2002. [3] R. B. Calinski and J. Harabasz, “A Dendrite Method for Cluster Analysis,” Communications in Statistics, vol. 3, 1974, pp. 1-27. [4] J.P. Caulkins, W. Ding, G. Duncan, R. Krishnan, and E. Nyberg, “A Method for Managing Access to Web Pages: Filtering by Statistical Classification (FSC) Applied to Text,” Decision Support Systems, vol. 42, 2006, pp. 144161. [5] S. Chakrabarti, S. Roy, and M. V. Soundalgekar, “Fast and Accurate Text Classification Via Multiple Linear Discriminant Projections,” The VLDB Journal, vol. 12, 2003, pp. 170-185. [6] X.Y. Chen, Y. Chen, R.L. Li, and Y.F. Hu, “An Improvement of Text Association Classification using Rules Weights,” Lecture Notes in Artificial Intelligence, vol. 3584, 2005, pp. 355-363. [7] O. Cordon, M. J. Del Jesus, and F. Herrera, “A Proposal on Reasoning Methods in Fuzzy Rule-based Classification Systems,” International Journal of Approximate Reasoning, vol. 20, 1999, pp. 21-45. [8] E.H. Han, G. Karypis, and V. Kumar, “Text Categorization using Weight Adjusted K-nearest Neighbor Classification,” Proc. the Fifth Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2001, pp. 53-65. [9] J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Francisco, CA, Morgan Kaufmann, 2001. [10] A. Juan and E. Vidal, “On the Use of Bernoulli Mixture Models for Text Classification,” Pattern Recognition, vol. 35, 2002, pp. 2705-2710. 105 [11] S.B. Kim, K.S. Han, H. C. Rim, and S. H. Myaeng, “Some Effective Techniques for Naive Bayes Text Classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, 2006, pp. 1457-1466. [12] H. Li and K. Yamanishi, “Text Classification using ESC-based Stochastic Decision Lists,” Proc. CIKM-99, 8th ACM International Conference on Information and Knowledge Management, 1999, pp. 122-130. [13] C. Vens, J. Struyf, L. Schietgat, S. Dzeroski, and H. Blockeel, “Decision Trees for Hierarchical Multi-label Classification,” Machine Learning, vol. 73, 2008, pp. 185-214. [14] W. Wei and Y. Bo, “Text Categorization based on Combination of Modified Back Propagation Neural Network and Latent Semantic Analysis,” Neural Computing and Applications, vol. 18, 2009, pp. 875-881. [15] Y.L. Zhang, Y. Dang, H.C. Chen, M. Thurmond, and C. Larson, “Automatic Online News Monitoring and Classification for Syndromic Surveillance,” Decision Support Systems, vol. 47, 2009, pp. 508-517. TABLE III. EXPERIMENTAL RESULTS (R3) TABLE IV. EXPERIMENTAL RESULTS (WK) 106