義 守 大 學 資 訊 管 理 研 究 所 碩 士 論 文 模糊相關與支援向量學習應用在文件 多重分類問題之研究 Fuzzy Correlation and Support Vector Learning Approach to Multi-Categorization of Documents 研究生:胡翠峰 指導教授:林建宏博士 中 華 民 國 九十三 年 七 月 模糊相關與支援向量學習應用在文件 多重分類問題之研究 Fuzzy Correlation and Support Vector Learning Approach to Multi-Categorization of Documents 研 究 生:胡翠峰 指導教授:林建宏博士 Student:Tsui-Feng Hu Advisor:Dr. Jiann-Horng Lin 義守大學 資訊管理研究所 碩士論文 A Thesis Submitted to Institute of Information Management I-Shou University in Partial Fulfillment of the Requirements for the Master Degree in Information Management July, 2004 Kaohsiung, Taiwan 中 華 民 國 九十三 年 七 月 I 模糊相關與支援向量學習應用在文件 多重分類問題之研究 研究生:胡翠峰 指導教授:林建宏 博士 義守大學資訊管理研究所 摘 要 在本論文中,我們提出了一個新的文件分類方法。這個方法是基於支援向量學習與模糊 相關,用來解決電子文件多類別與多重分類的問題。支援向量機(Support Vector Machines, SVMs)是一個在高維度特徵空間的線性學習系統。而其學習演算法是從最佳理論以及統計學 習理論得來的。支援向量機提供強而有效的分類演算法,而這個演算法是可以在高維度的輸 入空間中。有效處理分類問題。除了支援向量機之外,我們還使用模糊相關的觀念,模糊相 關是可以量測兩個變數或兩個屬性之間的相關程度。我們利用模糊相關去量測在未分類文件 與事先定義的類別之間的相關性,並且將未分類的文件分類到多個不同的類別。這個方法不 但可以解決多類別分類也可以處理多重分類的問題。 關鍵詞: 模糊相關,支援向量機,文件多重分類 II Fuzzy Correlation and Support Vector Learning Approach to Multi-Categorization of Documents Student: Tsui-Feng Hu Advisor: Dr. Jiann-Horng Lin Institute of Information Management I-Shou University ABSTRACT In this thesis, we propose a new text categorization method for the multi-class and multi-label problems based on support vector machines in conjunction with fuzzy correlation. Support vector machines (SVMs) are learning systems that use a hypothesis space of linear function in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. SVMs provide efficient and powerful categorization algorithms which are capable of dealing with high dimensional input space. In addition to SVM, we use concept of fuzzy correlation which can measure correlation degree between two-variable or two-attribute. We employ fuzzy correlation to measure correlation between unclassified documents and predefined categories. This way not only solves multi-class classification but also multi-label categorization problem. Keywords: Fuzzy Correlation, Support Vector Machines (SVMs), Multi-Categorization of Documents III Acknowledgements 我的碩士論文能夠順利完成,首先最感謝的是我的指導教授林建宏博士,老師在這段期 間不辭辛勞的矯正我的讀書態度,教導我細心及耐心的讀書觀念,指導我研究方向及撰寫論 文的方法,且不斷的和我討論及修改論文,使我在短短二年間真的學到很多,並且能有多篇 的論文發表,同時順利取得碩士學位,在此致上由衷的敬意與感謝。 再來我要感謝我的碩士論文口試委員:高雄大學洪宗貝教授與本校資訊管理研究所林文 揚教授,謝謝他們能撥冗參加口試會議,並於口試期間給予我相當有建設性及前曕性的建議, 使我的論文更趨完整。 兩年的研究生生活是短暫且辛苦的,但卻也是我受益良多的一段時間。感謝這兩年指導 我進行研究的教授們,包括林文揚老師、錢炳全老師及其他參與研討會議的老師們,還有過 去參與研討的學長姐及學弟妹們。感謝你們,由於你們的支持及幫助,才能使我順利的完成 學業。 最後感謝我的爸媽,謝謝您們對於我求學階段的照顧與支持,使我無後顧之憂的取得碩 士學位,在此也表達衷心的謝意。這一路走來,我學到了很多,在這裏謹向所有幫助過我及 默默支持我的人,致上我最深的謝意。 胡翠峰 民國 93 年 7 月 IV Contents 摘 要 .......................................................................................................................................... II ABSTRACT ..................................................................................................................................... AVER-ACCURACY IN THE DIFFERENT LEARNING MACHINES ........................................... 72 FIGURE 7.10: COMPUTING KEYWORDS IN THE DIFFERENT CATEGORIES S .......................................... 73 FIGURE 7.11: KEYWORDS IN THE TEN CATEGORIES FREQUENCY ........................................................ 74 FIGURE 7.12: CORRELATION MEMBERSHIP IN THE TEN CATEGORIES .................................................. 78 VIII List of Tables TABLE 2.1: CLASSIFICATION ACCURACY, THORSTEN JOACHIMS, 1997, REUTERS-21758 DATASET ....... 8 TABLE 2.2: CLASSIFICATION ACCURACY, THORSTEN JOACHIMS, WEBKB DATASET ............................. 8 TABLE 2.3: CLASSIFICATION ACCURACY, THORSTEN JOACHIMS, OHSUMED COLLECTION DATASET ..... 9 TABLE 2.4: CLASSIFICATION ERROR, JASON AND RYAN,2001 ................................................................ 9 TABLE 2.5: CLASSIFICATION ERROR, JASON AND RYAN, 2001 ............................................................... 9 TABLE 2.6: CLASSIFICATION ERROR, FRIEDHELM SCHWENKER, 2000 ................................................. 10 TABLE 2.7: CLASSIFICATION ERROR, JOHN AND NELLO, 2000............................................................. 10 TABLE 2.8: CLASSIFICATION BETWEEN TWO-UNCLASSIFIED DOCUMENTS AND FOUR-PREDEFINED CATEGORIES ...................................................................................................................... 11 TABLE 4.1: MULTI-CLASS CLASSIFIER COMPARISON .......................................................................... 27 TABLE 7.1: DOCUMENTS IN ONE CATEGORY OWNED .......................................................................... 51 TABLE 7.2: DIFFERENT CATEGORIES WITH TRAINING NUMBERS ......................................................... 51 TABLE 7.3: ACCURACY COMPARISON WITH DIFFERENT METHODS ...................................................... 70 TABLE 7.4: ACCURACY IN THE DIFFERENT DIMENSIONS ..................................................................... 71 TABLE 7.5: EVERY DOCUMENT IN THE DIFFERENT CATEGORIES APPEARS FREQUENCY ...................... 74 TABLE 7.6: CORRELATION BETWEEN EVERY DOCUMENT AND TEN CATEGORIES ................................. 75 TABLE 7.7: MULTI-CATEGORIIES OF DOCUMENTS ............................................................................... 77 IX CHAPTER 1 INTRODUCTION 1.1 Research Background and Motivation There are billions of text documents available in electronic form. These collections represent a massive amount of information that is easily accessible. However, seeking relevant information in this huge collection requires organization. With the rapid growth of online information, document categorization has become one of the key techniques for handling and organizing text data. This can be greatly aided by automated classifier systems. The accuracy of such systems determines their usefulness. Text categorization is the classification to assign a text document to appropriate category/ies in a predefined set of categories. Originally, research in text categorization addressed the binary problem, where a document is either relevant or not with respect to a given category. In real world situation, however, the great variety of different sources and hence categories usually poses multi-class classification problem, where a document belongs to exactly one category selected from a predefined set [33][84][85][86]. Even more general is the case of multi-label problem, where a document can be classified into more than one category. While binary and multi-class problems (single-categorization of documents) were investigated extensively [52], multi-label problems (multi-categorization of documents) have received very little attention [87]. In our thesis, we propose a new text categorization method for the multi-class and multi-label problems based on support vector machines in conjunction with fuzzy correlation. 1 The concept of support vector machines is proposed by Vapnik in 1995 according to a foundation of improving statistical learning theory. A support vector learning method is similar to perceptron on neural network and the are all classifier models. Owning to support vector machines have higher accuracy and provide relative model of support vector regression and support vector clustering, it is most appropriate method on document categorization. Support vector machines (SVMs) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. To find out what methods are promising for learning document classifier, we should find out more about the properties of text: (1) high dimensional input space; (2) few irrelevant features; (3) document vectors are sparse; (4) most document categorization problems are linearly separable. In [33], Joachims published results on a set of binary text classification experiments using the SVM. The SVM yields lower error than many other classification techniques. Yang and Liu [78] followed later with experiments of her own on the same data set. She used improved versions of Naïve Bayes (NB) and k-nearest neighbors (kNN) but still found that the SVM performed at least as well as all other classifier she tried. Both papers used the SVM for binary text classification, leaving the multi-class problem (assigning a single label to each example) open for future research. The multi-class classification problem refers to assigning each of the observations into on of k classes. As two-class problems are much easier to solve, many authors propose to use two-class classifiers for multi-class classification. Berger and Ghani individually chose to attack the multi-class text classification problem using error-correcting output codes (ECOC) [76] [79]. They both chose to use Naïve Bayes as the binary classifier. ECOC combines the outputs of many individual binary classifiers in an additive fashion to produce a single multi-class output. It works in two-stages: first, independently construct many subordinate classifiers, each responsible for removing some uncertainty about the correct class 2 of the input; secondary, apply a voting scheme to decide upon the correct class, given the output of each weak learner. In our thesis, we focus on techniques that provide a multi-class classification solution by combining all pairwise comparisons. A common way to combine pairwise comparisons is by voting [80]. It construct a rule for discriminating between every pair of classes and then selecting the class with the most winning two-class decisions. Though the voting procedure requires just pairwise decisions, it only predicts a class label. In many scenarios, however, probability estimates are desired. As numerous (pairwise) classifiers do provide class probabilities, several authors [81][82] have proposed probability estimates by combining the pairwise class probabilities. A parametric approach was proposed by Platt [83], which consists of finding the parameters of a sigmoid function, mapping the scores into probability estimates. SVMs learn a decision boundary between two classes by mapping the training examples onto a higher dimensional space and then determining the optimal separating hyperplane between that space. Given a test example, the SVM outputs a score that provides the distance of the test example from the separating hyperplane. The sign of the score indicates to which class text example belongs. In our approach, we want to have a measure of confidence (belief) in the classification. The final decision is based on the classifier with maximum confidence. To provide an accurate measure of confidence, we adopt a concept of fuzzy correlation to determine unclassified documents relation among other categorizations. Fuzzy correlation not only can distinguish relative degree between text documents and classified categorization but also obtain messages of positive, negative and irrelative correlation between text documents and classified categorization. 3 1.2 Contributions of The Thesis The main contributions of this thesis are: 1. We present an efficient method for producing class membership estimates for multi-class text categorization problem. 2. Based on SVM binary classifiers in conjunction with the class membership, we propose a measure of confidence given by the fuzzy correlation. An acknowledged deficiency of SVMs is that the uncalibrated outputs do not provide estimates of posterior probability of class membership. Our approach not only solves multi-class classification but also multi-label categorization problems. 1.3 Organization of The Thesis The remainder of this thesis is organized as follows. The related document categorization, why use support vector learning technology and fuzzy correlation are briefly reviewed in Chapter 2. Support Vector Machines (SVMs) is proposed in Chapter 3. Multi-class SVMs classifiers strategy is adopted in Chapter 4. Fuzzy correlation is proposed in Chapter 5. Fuzzy correlation and support vector learning approach to solve multi-categorization of document is applied in Chapter 6. Experimental results for this new approach are given in Chapter 7. Finally, conclusion and future work are given in Chapter 8. 4 CHAPTER 2 REVIEW OF RELATED WORKS In this chapter, we review some related researches about the thesis, including document categorization, fuzzy correlation, and machine learning technology. 2.1 Document Categorization In the past, document categorization is read, analyzed, and classified by a group of specialists. But it is slowly efficiency and expensively labor power cost, therefore categorization does not fast and effectively overall constructed. Recently, mass information quantity are brought up in networks because a prosperity is on information technology and the most of data are slowly transformed a kind of electronic format. Consequently, an essential issue that is rapidly and accurately obtaining significant information in the period of information explosion Document categorization is a domain among text mining which is as data mining, a quite mature technology, generally analyzes data. Text mining combines data mining, information extraction, information retrieval, document classification, probabilistic modeling, linear algebra, machine learning, computational linguistics to discover structure, pattern and knowledge in large textual corpora…so on. Thus it can be seen, text mining is a widespread domain. It can apply to every kind of research methods, for example: computer science, mathematics and statistics, information retrieval, artificial intelligence…so on. It is not really easy to take these research methods to apply in document categorization appropriately. For which one is more 5 appropriate, it still not explicitly indicate up to the present. Because most research are all theoretical, if we want to apply in real network world, we will still more spare no efforts to carry out it. 2.2 Review of Document Categorization Document categorization [6][32][54][76] extremely would spend on and waste time, if document categorization were done manually. For this reason, automatic document categorization technique will rise and develop. The earliest automatic technique of ruled-based approaches of expert system applies in document categorization. This method must need constructional regulation and membership function, and it difficultly corrects behaviors. From or after that time, some categorization methods are successively appearance. These categorizations have four advantages: 1) easily construct and update; 2) by material information providing some data to users easily; 3) constructed in interesting categories of users; 4) accurately provide decision to users; improve the earliest expert system’s shortcoming. Presumed resolving power of significant words Nonsignificant high-frequency terms C D Nonsignificant terms Resolving power of low-frequency words Words in deceasing frequency order Figure 2.1 Words appear frequency in the documents and theirs relation 6 Machine classify documents categories documents training filing Filing documents extracting documents Training documents Man-made confirm Classified documents categories Man-made classify Figure 2.2 Automatic documents categorization flow chart The technical applications of statistics categorization and machine learning for document categorization [2][4][5][19][31][35][36][42][43][52][53][60][67][76] are maturity little by little. For example: classification and regression tree, multivariate regression models, nearest neighbor classifiers, decision tree, neural networks, gene algorithm, symbolic rule learning, naïve bayes rules, bayes nets, and support vector machines for document categorization…so on. 2.3 Why Use Support Vector Machines Accuracy is a significant target for evaluate these document categorization methods which own higher accuracy and appropriateness. We review and refer many research reports and literature, and discover support vector machines (SVMs) [1][2][3][7][8][9][11][12][14][16][20][21][23][24][25][26][28][29][37][38][40][41][44][45][46 7 ][50][55][56][57][65][66][72][73] that apply in document categorization have higher accuracy than the common categorization methods in machine learning generally. For this reason, this is a motive for us to use support vector machines (SVMs) in document categorization. According to experimental result in the literature information, we can prove that support vector machines (SVMs) actually have more excellent classified accuracy from following experimented tables. Table 2.1 Classification accuracy from [33], Thorsten Joachims, 1997, Reuters-21758 dataset Earn Acq Money-fx Grain Crude Trade Interest Ship Wheat Corn Micro-average. Naïve Bayes Rocchio C4.5 k-NN 95.9 91.5 62.9 72.5 81.0 50.0 58.0 78.7 60.6 47.3 72.0 96.1 92.1 67.6 79.5 81.5 77.4 72.5 83.1 79.4 62.2 79.9 96.1 85.3 69.4 89.1 75.5 59.2 49.1 80.9 85.5 87.7 79.4 97.3 92.0 78.2 82.2 85.7 77.4 74.0 79.2 76.6 77.9 82.3 SVM (RBF r=0.5) 98.5 95.0 74.0 93.1 88.9 76.9 74.4 85.4 85.2 85.1 86.4 SVM (RBF r=1.0) 98.4 95.3 76.3 91.9 88.9 77.8 76.2 87.6 85.9 85.7 86.3 SVM (RBF r=1.2) 98.3 95.4 75.9 90.6 88.2 76.8 76.1 87.1 85.9 84.5 86.2 Table 2.2 Classification accuracy from [34], Thorsten Joachims, WebKB dataset Course Faculty Project Student Average Naïve Bayes 57.2 42.4 21.4 63.5 46.1 8 SVM 68.7 52.5 37.5 70.0 57.2 Table 2.3 Classification accuracy from [34], Thorsten Joachims, Ohsumed collection dataset Pathology Cardiovascular Neoplasms Nervous system Immunologic Average Naïve Bayes 39.6 49.0 53.1 28.1 28.3 39.6 SVM 41.8 58.0 65.1 35.5 42.8 48.6 Table 2.4 Classification error from [48] Jason and Ryan, 2001 20 News group Ova Dense 15 BCH 15 Dense 31 BCH 31 Dense 63 BCH 63 800 SVM NB 0.131 0.146 0.142 0.176 0.145 0.169 0.135 0.168 0.131 0.153 0.129 0.154 0.125 0.145 250 SVM NB 0.167 0.199 0.193 0.222 0.196 0.225 0.180 0.214 0.173 0.198 0.171 0.198 0.164 0.188 100 SVM NB 0.214 0.277 0.251 0.282 0.262 0.311 0.233 0.267 0.224 0.259 0.222 0.256 0.213 0.245 30 SVM 0.311 0.366 0.415 0.348 0.333 0.326 0.312 NB 0.455 0.431 0.520 0.428 0.438 0.407 0.390 Table 2.5 Classification error from [48] Jason and Ryan, 2001 Industry Sector Ova Dense 15 BCH 15 Dense 31 BCH 31 Dense 63 BCH 63 800 SVM NB 0.072 0.357 0.119 0.191 0.106 0.182 0.083 0.145 0.076 0.140 0.072 0.135 0.067 0.128 250 SVM NB 0.176 0.568 0.283 0.363 0.261 0.352 0.216 0.301 0.198 0.292 0.189 0.279 0.176 0.272 9 100 SVM NB 0.341 0.725 0.461 0.542 0.438 0.518 0.394 0.482 0.371 0.462 0.363 0.453 0.343 0.443 30 SVM 0.650 0.738 0.717 0.701 0.676 0.674 0.653 NB 0.885 0.805 0.771 0.769 0.743 0.745 0.734 Table 2.6 Classification error from [51] Friedhelm Schwenker, 2000 Classifier MLP 5-NN LVQ RBF SVM-1-R SVM-1-1 SVM-TR Error (%) 2.41 2.34 3.01 1.51 1.40 1.37 1.39 MLP: multi-player preceptors 5-NN: 5-nearest neighbor classifier trained with Kohonen’s software package with LOVQ1&OLVQ3 SVM-1-R: one-against-rest strategy SVM-1-1: one-against-one strategy SVM-TR: hierarchies or trees of binary SVM classifiers strategy Table 2.7 Classification Error from [47] John and Nello, 2000 1-v-r Max Wins DDAG C 100 100 100 3.58 5.06 5.06 Error Rate (%) 4.7 4.5 4.4 1-v-r: One-against-Rest Strategy, Max Wins: Max Wins algorithm DDAG: Decision Directed Acyclic Graph From above-mentioned experimented table, and by other especial statistic test estimating in the literature of Yang & Liu ,{SVM,KNN}>LLSF>multilayered perceptrons>>multinomial naïve bayes (in this five classifier, SVM has the best categorization efficiency ). And by micro-average computing efficiency for Reuters in Joachims; SVM (0.864)>KNN (0.823)> {Rocchio (0.799), C4.5 (0.794)}>naïve bayes (0.72); SVM still has the best categorization efficiency. We can get a significant message that support vector machines (SVMs) are more appropriate in document categorization than others. In an aspect of the multi-class categorization decision, one-against-one classifiers strategy is the superior than others, and we use and think the one-against-one SVMs classifiers strategy (OAO-SVMs) [3][9][25][36][37][38][45][47][48][65] that is similar structure concept of decision directed 10 acyclic graph classifiers strategy (DDAG) but OAO-SVMs is more simple and appropriate in multi-class classified decision structure than other classifiers strategy. 2.4 Why Use Fuzzy Correlation Fuzzy set is a kind of mathematics model to express method of linguistic information. That is a kind of tools that are quantification of ambiguous meaning. We employ fuzzy correlation to measure correlation between two-variables or two-attributes. Using fuzzy correlation [10][17][22][70], we can not make a threshold subjectively, and let a threshold be called into question. Table 2.8 Correlation between two-unclassified documents and four-predefined categories Categorization A Categorization B Categorization C Categorization D Document 1 0.90 0.65 0.68 0.92 Document 2 0.41 0.63 0.44 0.62 Categorization A Categorization B Categorization C Categorization D Document 1 0.12 0.68 0.69 0.71 Document 2 0.91 0.93 0.23 0.69 By the above-mentioned table, we can employ fuzzy correlation to compute correlation degree between unclassified documents and pre-defined categories, and to match up learning machines to classify. Fuzzy correlation can measure correlation degree between unclassified documents and pre-defined categories; SVMs can classify unclassified documents into multiple categories through fuzzy correlation measure. Because SVMs just only can classify unclassified documents to only one or special one pre-defined category. 11 CHAPTER 3 SUPPORT VECTOR MACHINES (SVMS) Support vector machines (SVMs) are a system for efficiently training the linear learning machines in the kernel-induced feature spaces, and can be used for pattern categorization and nonlinear regression. In categorization, the main idea of a support vector machine is to construct a hyperplane as the decision surface in such a way that the margin of separation between positive negative examples is maximized. The machine achieves this desirable property by following a principled approach rooted in the statistical learning theory. Form the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error (the theoretical generalization performance on new data). These generalization bound have two important features. First, the upper bound in the generalization error does not depend on the dimension of the space. Secondly, the error bound is minimized by maximizing the margin, , i.e. the minimal distance between the hyperplane separating the two classes and the closest data points to the hyperplane as shown in Figure 3.1. Accordingly, the support vector machines (SVMs) can provide a good generalization performance to avoid overfitting problem pattern categorization problem. This attribute is unique to support vector machines (SVMs). 12 + margin + + + + optimal hyperplane Figure 3.1.Binary categorization of support vector machine Support vector machines (SVMs) are a linear binary classifier, and define two different dataset by predefined classified value (1 or -1). And then separate two dataset by classified function. The theory anticipates to labeling the classified data, and inputs data into a linear function to train continuously, and train out the optimization decision function for two dataset in final, and by the best optimal decision function becoming maximum margin in hyperplane that can separate two dataset. Therefore, SVMs are exactly handling the optimization problem. 1 T x b 0 1 Figure 3.2 Framework of support vector machines (SVMs) 13 3.1 The Maximal Margin Classifier Let us consider a binary categorization task with data points xi i 1,..., m having corresponding labels yi 1 and let the decision function [14][50][64][73] be: f x signw x b 3.1 The hyperplane parameter of the decision function is w, b . If the data set is separable then the data will be correctly classified if yi w x b 0i . We implicitly define a scale for w b to give canonical hyperplane such that w x b 1 for the closest points on one side and w x b 1 for the closest points on the other side. For the separating hyperplane w x b 0 the normal vector is clearly means the margin is w . Since w x b 1 and w x b 1 this w2 1 . To maximize the margin the task is therefore: w2 Minimize 3.2 1 2 w 2 2 subject to the constraints: 3.3 yi w x b 1 i and the learning task can be reduce to minimization of the primal lagrangian: L N 1 w w i yi w x b 1 2 i 1 3.4 where i are the Lagrangian multipliers (hence i 0 ). From Wolfe’s theorem, we can take the derivative with respect to b and w to obtain N L 0 i yi 0 b i 1 3.5 N L 0 w i yi xi w i 1 and re-substitute back in the primal 3.4 to give Wolfe dual Lagrangian: 14 N W i i 1 1 N i j yi y j xi , x j 2 i 1 3.6 which must be maximized with respect to the i subject to the constraint: 3.7 . 0 i N y i i 1 i Solving Equation 3.8 0 3.6 with constraints Equation 3.7 determines the Lagrange multipliers, and the optimal separating hyperplane is given by, ^ ^ w i yi xi SVs ^ 1^ b w xr xs , 2 where SVs denotes the set of support vectors whose Lagrange multiplier is positive. ^ The bias term b is computed here using two support vectors, xr and xs which are any support vectors from each class satisfying, ^ 3.9 ^ r , s 0, yr 1, ys 1 But can be computed using all the support vectors on the margin for stability. 3.2 Kernel-Induced Feature Spaces For the dual Lagrangian 3.6 we notice that the data points xi only appear inside an inner product. To get a better representation of the data we can therefore map the data points into an alternative higher dimensional space, called feature space, through replacement: xi x j xi x j 3.10 The functional form of the mapping xi does not need to be known since it is implicitly 15 defined by the choice of kernel [12][14]: K xi , x j xi x j 3.11 After substituting the data inner product xi x j for K xi , x j , the dual Lagrangian 3.6 becomes N W i i 1 1 N i j yi y j K xi , x j 2 i , j 1 3.12 where K x, y is the kernel function performing the non-linear mapping into feature space, and the constraints are unchanged, 3.13 i 0 N y i 1 i i 3.14 0 Solving the Equation 3.12 with constraints Equation 3.13 determines the Lagrange multipliers, and a hard margin classifier in the feature space is given by, ^ ^ f x sign i yi K xi , x b SVs 3.15 where ^ b ^ 1 i yi K xr , xi K xs xi 2 SVs 3.16 The bias is computed here using two support vectors, but can be computed using all the support vectors on the margin for stability. 3.3 Kernel Functions The idea of kernel function [14][44][73] is to enable operations to be performed in the input space rather than the potentially high dimensional feature space. Hence the inner product does not need to be evaluated in the feature space. This provides a way of addressing the curse 16 of dimensionality. In the following subsections, we introduction three common choices of kernel function. A、 Polynomial A polynomial mapping is a popular method for non-linear modeling, K x, y x y d K x, y x y 1 d 3.17 where d 1,2,... . The second kernel is usually preferable as it avoid problems with the Hessian matrix becoming zero. B、 Gaussian Radial Basis Function Radial basis functions have received significant attention, most commonly with a Gaussian of the form, x y 2 K x, y exp 2 2 An attractive feature [15][30][49][68] of the SVMs is that this selection is implicit, with each support vectors contributing one local Gaussian function, centered at that data point. Figure 3.3 SVMs mapping 17 3.4 Soft Margin Optimization Most real life data sets contain noise and SVMs [39][59][71] can fit this noise leading to poor generalization. The effect of outliers and noise can noise can be reduced by introduction a soft margin and two schemes, l1 error norm and l2 error norm, are current used. The justification for these soft margin techniques comes from statistical learning theory but can readily viewed as relaxation of the hard margin constrain 3.3 . Thus for the l1 error norm we introduce a positive slack variable i into 3.3 : 3.19 yi w xi b 1 i i N and the task is now to minimize the sum of error i 1 min i in addition to w 2 : N 1 w w C i 2 i 1 subject to yi w xi b 1 i i 3.20 N For l2 error norm, the task is to minimize the sum of squared error i 1 2 i in addition to w 2: min N 1 w w C i2 2 i 1 subject to yi w xi b 1 i i 3.21 3.5 Support Vector Classification, Clustering, Regression and Fuzzy Support Vector Machines Support vector learning [6][11][12][13][14][77], it can apply in support vector classification [1][2][3][7][8][9][16][19][23][24][25][26][28][29][31][33][34], support vector 18 clustering (SVC) [27][49][77], and support vector regression (SVR) [4][20][21][57]. Support vector learning takes support vector machines (SVMs) concept into support vector clustering (SVC) and support vector regression (SVR). Let SVMs apply to more widespread domain. In addition to SVR and SVC, recently add the fuzzy theory concept that is so-called fuzzy support vector machines (FSVMs). In FSVMs, according to different degree of fuzzy values, we clearly divide categorization boundary in different degree. This method makes categorization more elasticity and categorization accuracy promotion. Figure3.4 Support vector clustering (SVC) Figure 3.5 Support vector regression (SVR) 19 Class boundary with membership function Figure 3.6 Fuzzy support vector machines (FSVMs) 3.6 Support Vector Machines (SVMs) Application Support vector machines (SVMs) can apply in following domains: A. Document categorization [30][58][61][62][63][69] A kernel from IR applied to information filtering B. Image recognition Aspect independent classification Colour-based classification C. Hand-written digit recognition D. Bioinformation E. Protein homology detection Gene expression Commerce and finance 20 CHAPTER 4 MULTI-CLASS SUPPORT VECTOR LEARNING Support vector machines (SVMs) are the binary classifiers, but only one classifier does not classify multiple categories. Therefore, we need to use other methods or many binary classifiers to combine classifier strategy model. At present, there are common four classifier model: one-against-one classifiers strategy, one-against rest classifiers strategy, hierarchies or trees of binary SVM classifiers strategy and decision directed acyclic graph However, these classifiers strategy model [3][9][25][36][37][43][45][47][48][51][65][66] are all advantages and disadvantages, we will describe these four classifiers model below: 4.1 One-against-One Classifiers Strategy (OAO) One-against-one classifiers strategy structure has N N 1 classifiers, and it employs the 2 method of majority voting scheme to combine classifier and to evaluate classified result finally. But the maximum classifier numbers are never over Minimize: N N 1 . 2 1 ij T ij w w C tij 2 t Subject to: wij xt bij 1 tij , if yt i T 21 w x b ij T t ij 1 tij , if yt j tij 0 Ex. Superior scholars, Ordinary scholars, Maimed scholars, Aboriginal scholars Superior scholars Ordinary scholars Classifier A Superior scholars Maimed scholars Classifier B Superior scholars Aboriginal scholars Classifier C Maimed scholars Classifier D Ordinary scholars Aboriginal scholars Classifier E Maimed scholars Aboriginal scholars Ordinary scholars Classifier F 22 Superior scholars Ordinary scholars Maimed scholars Common students Superior scholars Peculiar students Classifier A -1 Ordinary scholars +1 Maimed scholars Classifier B Aboriginal scholars Classifier C -1 Superior scholars Aboriginal scholars +1 -1 Ordinary scholars Maimed scholars +1 Aboriginal scholars Figure 4.1 One-against-one classifiers strategy I, II (Majority voting scheme) 4.2 One-against-Rest Classifiers Strategy (OAR) One-against-rest classifiers strategy structure has N 1 classifiers if it has N categories. It employs one classifier to classify one category, and other categories are one by one classified by other classifiers. Minimize: l 1 iT i w w C ij 2 j 1 Subject to: wi x j bi 1 ij , if y j i T 23 w x b i T i j 1 ij , if y j i ij 0 , j 1,2,..., l class of x arg max i 1, 2,..., k wi x bi T Ex. Superior scholars, Ordinary scholars, Maimed scholars, Aboriginal scholars Superior scholars, Ordinary scholars, Maimed scholars, Aboriginal scholars Classifier +1 -1 Superior scholars Ordinary scholars, Maimed scholars, Aboriginal scholars Classifier +1 -1 Maimed scholars, Aboriginal scholars Ordinary scholars Classifier +1 Aboriginal scholars Figure 4.2 One-against-rest classifiers strategy 24 -1 Maimed scholars 4.3 Hierarchies or Trees of Binary SVM Classifiers Strategy To take binary classifiers build a hierarchical tree structure, and a number of classifiers are not fixed. By data distributed state, we can know the number of classifiers, and finally output does not relate among every each categories. Ex. SS: Superior scholars, OS: Ordinary scholars, MS: Maimed scholars, AS: Aboriginal scholars {SS, OS, MS, AS} Classifier -1 +1 {SS, OS} {MS, AS} Classifier -1 SS Classifier +1 -1 OS MS +1 AS Figure 4.3 Hierarchies or trees of binary SVM classifiers strategy 4.4 Decision Directed Acyclic Graph (DDAG) The method of decision directed acyclic graph (DDAG) is provided by Taylor. This way combines one-against-one classifiers strategy and hierarchies or trees of binary SVM classifiers strategy. DDAG and one-against-one classifier are the same in training phase. A number of 25 classifiers are fixed in N N 1 at most, and by hierarchical trees structure improves output 2 classifiers in every each categories which are not related. DDAG is appropriate to multi-class classifier because DDAG make classifier quantity and output structure more completed. Ex. SS: Superior scholars, OS: Ordinary scholars, MS: Maimed scholars, AS: Aboriginal scholars {SS, OS, MS, AS} SS vs AS Not AS Not SS {SS, OS, MS} {OS, MS, AS} SS vs MS Not MS Not SS Not AS OS, MS SS vs MS SS OS vs AS OS Not OS MS vs AS MS Figure 4.4 Decision directed acyclic graph (DDAG) 26 AS 4.5 Multi-Class Classifier Comparison Table 4.1.Multi-class classifier comparison One-against-One classifiers strategy Advantage 1.Maxiumum and steady quantity of classification implements Disadvantage 1.Evaluative measure and strategy with Majority voting Scheme needed 2.Classified implements decreased One-against-Rest classifiers strategy Hierarchies or Trees of Binary SVM classifiers strategy 3.High accuracy 1.Understood constructed 1.Un-accuracy classified easily than other classifiers 1.Relational source 1.Uncertain classification understood implement quantity in hierarchy structure 2.Incomplete independent structure of input class Decision Directed Acyclic 1.Maxiumum and steady Graph (DDAG) quantity of classification implements 3.Classified data points all refreshed class 1,1 in every time 1.Classified data points all refreshed class 1,1 in every time 2.Relational source understood in hierarchy structure 3.Incomplete input classification structure improved 4.High accuracy ※Classified data points all refreshed class 1,1 in every time, which don’t accord to data points property classify, unduly subjective judgment. 27 CHAPTER 5 FUZZY CORRELATION Correlation is frequently used to find out relation whose correlation between two-variables or two-attributes in data. If variations of two random variables X and Y are existence and greater than zero X ,Y 0 , correlation of X and Y , representing X , Y , and defining Cov X , Y . In the defining, a significant property of random variables X Var X Var Y and Y is 1 X ,Y 1. Correlation estimates linearly correlation degree between variable X and Y . While a value of X , Y is closed to +1 or -1, it represents highly linear correlation between X and Y ; otherwise, while a value of X , Y is zero 0 , it represents irrelative correlation. The positive value of X , Y represents positive correlation which when the value of X is added, the value Y also tend to be added; otherwise, the negative value of X , Y represents negative correlation which when the value of X is added, then the value Y also tend to be decreased. If the X ,Y 0 , the irrelative correlation between X and Y . By way of correlation analysis, we can easily measure relation in common between two-attributes or two-variables. How to distinguish correlation degree from two-undefined attributes or variables? Fuzzy correlation is a correlation degree measurement of fuzzy set [10][74][75]. For example, there are N -undefined attributes A1 , A2 ,..., AN and n elements x1 , x2 ,..., xn ; We do not know real connotation in every A and x , and we just know membership degree of each x in each A . For this reason, how to measure correlation degree 28 of two-undefined attributes Ai and A j , i j ? 5. 1 Fuzzy Correlation Evaluating a measurement is not absolute formula by viewing different viewpoints from the measurer. The same as above, evaluating correlation degree of two-undefined attributes is not also absolute formula, and the correlation degree may be accepted if it does not obey human intuition. Therefore, employing correlation of fuzzy set is accepted evaluative implement that is common so-called fuzzy correlation [10][17][22][70]. In 1999, Gerstenkorn and Manko propose a method about correlation between two-fuzzy sets A and B , k A, B , defining k A, B C A, B T A T B n C A, B A xi B xi v A xi vB xi i 1 n T A A xi v A xi , i 1 n 2 2 T B B xi vB xi , i 1 2 2 when A : X 0,1, vA : X 0,1, and 0 A x vA x 1; B : X 0,1, vB : X 0,1, and 0 B x vB x 1; for all x X A x is x membership degree in A , vA x is x no membership degree in A ; the same ahead, B x is x membership degree in B , vB x is x no membership degree in B . According to this definition, the value of k A, B will between 0,1 and Gerstenkorn 29 and Manko think that if the value is A B , then the value is k A, B 1 . However, above definition only represents correlation degree in sense of propriety between two fuzzy set, and it is still inconvenient to apply in real application. In 1999, Ding-An Chiang and Nancy P. Lin propose a method about correlation between two-fuzzy sets, and this way combines concept of correlation in traditional statistics with fuzzy theory [74][75]. Assume that there is a random sample x1, x2 ,..., xn in a sample space X , x1, x2 ,..., xn X , A x1 , B x1 ,..., A xn , B xn alone with a sequence of paired data which correspond to the grades of the membership functions of fuzzy sets A and B defined on X now. Let us define the correlation coefficient A, B between fuzzy set A and B : x x / n 1 n A, B A i 1 i A A xi B i 1 n B xi i 1 n x n , S A2 n B i S A SB n where A B i 1 2 A i A n 1 x n , S B2 i 1 , S A S A2 ; 2 B i n 1 B , S B S B2 . B xi a A xi b , for some constant a, b . According to above the definition, we know the correlation A, B of fuzzy sets A and B will in 1,1. Moreover, we can obtain correlation degree between fuzzy set A and B by A, B , and even more obtain correlations of positive correlation, negative correlation, or irrelative correlation between fuzzy set A and B (positive correlation: A, B 0 , negative correlation: A, B 0 , irrelative correlation: A, B 0 ). 30 5. 2 Fuzzy Correlation for Multi-Categorization of Documents Traditionally the process of document categorization is according to concept of document-self, and we take unclassified documents to a single class which is predefined. This form of document categorization is called single-categorization of documents. Owning to the concept that documents maybe involve different subjects of discussion, or the correlation is not completely independent among every predefined category, it wonders about appropriateness for the way that each document only belongs to an unitary, specific category. For this reason, it is necessary that unclassified documents belong to different categories in certain of conditions. This form of document categorization is called multi-categorization of documents. Up to present, many methods have been proposed to deal with single-categorization of documents problem. But they are not suitable to be used to solve multi-categorization of documents. In order to solve the multi-categorization of documents, we employ fuzzy correlation in multi-categorization of documents. Fuzzy correlation not only can discriminate from correlation degree but also gets messages of positive, negative, and irrelative correlation between documents and classified categories. By correlation, we can solve traditionally single-categorization method for multi-categorization of documents problem. In fuzzy correlation [10] application, we need to define a keyword set and employ this keyword set to obtain correlation degree between unclassified document and each category. First of all we must define and employ the set of keyword that can estimate correlation degree between unclassified documents and each classified categories. We assume the defined keyword set X , X wk1, wk2 ,..., wkn , and we observe a certain 31 document T and classified T wk1 , T wk2 ,..., T wkn categories to Ci obtain wk , wk ,..., wk and Ci 1 Ci i Ci . n two-set T and numeric: C i are membership values, wki , i 1,..., n individually representing each keyword belong to high significant degree between unclassified documents T and each classified categories Ci , then we can employ fuzzy correlation T , C as correlation degree between unclassified documents i T and each classified categories Ci : wk wk /n 1 n T , Ci i 1 T i T wk i 1 T 2 T ,S n wk i 1 Ci i T wki Ci wk , S 2 i T n i ,S i T n 1 n C i ST SCi n where T Ci 2 Ci T wki C ST2 2 Ci i n 1 , SCi SC2i nCi wki nT wki , Ci wki nT wki nF wki nci wki nF wki ( nT wki are frequency that keywords wki appear in a document T ; nCi wki are frequency that keywords wki appear in categories Ci ; nF wki are frequency that keywords wki appear in the Frequency List (FL) table ) When the fuzzy correlation T , C i 0 , the unclassified documents T and each classified categories Ci belong to positive correlation; on the contrary, when the fuzzy correlation T , C i 0 , the unclassified documents T and each classified categories Ci belong to negative correlation; the fuzzy correlation T , C i 0 , the unclassified documents T and each classified categories Ci belong to irrelative correlation. 32 Perhaps we can employ another way to compute fuzzy correlation [10] between the document T and each classified category Ci . We employ membership function Tm wki and C l wki in the fuzzy theory to measure a significant degree of every keywords wki between documents Tm and every classified category Cl , and we adopt a kind of fuzzy keyword FX wki wk1 , Tm wk1 , wk2 , Tm wk2 ,..., wkn , Tm wkn set to present significant degree of keywords in the documents. T wki m nTm wki nTm wki nF wki , m 1,2,..., p; i 1,2,..., n nTm wki : frequency that keywords wki appear in this documents Tm nF wki : frequency that keywords wki appear in a frequency list FL The same as above, we also apply membership function to present a significant degree between these keywords wki and every classified category Cl , FX wki wk1 , Cl wk1 , wk2 , Cl wk2 ,..., wkn , Cl wkn C wki l nCl wki nCl wki nF wki , l 1,2,..., k ; i 1,2,..., n nCl wki : frequency that keywords wki appear in these classified categories Cl nF wki : frequency that keywords wki appear in a frequency list FL Then we can compute membership function between the unclassified documents Tm and classified categories Cl : n C Tm l wk wk i 1 Tm i Cl n wk i 1 Tm i ; m 1,2,..., p; l 1,2,..., k ; i 1,2,..., n i When every unclassified documents and every classified category are presented with membership values of fuzzy theory, we can undertake to compute membership degree for 33 every unclassified document in the multiple categories, that is our search for multi-categorization of documents. We apply intersection of fuzzy theory to compute membership degree. Cl C j , ; l , j 1,2,..., k . However, we compute intersection after we employ multi-class SVMs classifier to classify out every document that belongs to a particular category. According to that particular category, then we will intersect that particular category (by multi-class SVMs) with the other classified categories to compute membership degree. C C Tm m i nC Tm , C Tn l j l j then we apply -cut threshold to evaluate the minimum with satisfied restrictive condition, and it can still provide appropriately elasticity in the multi-categorization status. m i nm a xC C Tn l, j l j We can get an unclassified document that maybe can belong to other multiple categories in addition to multi-class SVMs classify out one particular category. SCl T | Cl T , l 1,2,..., k We can adopt either measure of fuzzy correlation described in this section and the outputs of all pairwise coupling SVM binary classifiers for the multi-class classification and multi-label categorization problems. In both ways, a document is assigned to the classes with larger confidences for the multi-label categorization problem and the maximum confidence for the multi-class classification problem. For the multi-label categorization, this post-processing thresholding step is independent of the learning step. The critical step in thresholding is to determine the value, known as the threshold, at which the larger confidence measures (fuzzy correlations, in our approach) are considered. We adopt the positive fuzzy correlation coefficients for the larger confidence measures and cut for the threshold, respectively. 34 CHAPTER 6 FUZZY CORRELATION AND SUPPORT VECTOR LEARNING 6.1 Framework of The Proposed Approach Machine Learning N documents Fuzzy correlation Figure 6.1 Framework of document categorization We can clearly understand document categorization from above graph. In the document categorization model, webpage data are input documents and predefined categories are output webpage which belongs to. For example: earn, acquisitions, money-fx, and grain four categories. 35 We handle the document categorization by machine learning methods, for example: fuzzy rules, neural networks, data warehousing, naïve, Bayesian networks, rough set…as categorization webpages methods. Input Preprocess Process Output Class E SVMs Class S N-documents Fuzzy Correlation TFIDF & Feature Selection Class M Class F Figure 6.2 Framework of the proposed approach We input the frequency of words in the webpages, and then by TFIDF (term frequency inverse document frequency) transform weights in a pre-processing. Features selection of vector are input vector, and by multi-class processing model “Support vector learning” which has training categorization and test categorization, and finally we can judge categories by output value 1,1 . In this thesis, we use support vector machines (SVMs) in the document categorization. The input data resource comes from Reuters data set. There are three pre-processing phases 1) computing words frequency, 2) processing stop of words, and 3) feature selection. We employ OAO-SVMs (one-against-one SVMs) strategy to improve common shortcomings in the 36 multi-class categorization methods and let a whole training architecture be more rationalization. By output values 1,1 determine categorization categories finally. Input Data & Approaches Reuters-21578 collection of documents Pre-process 1.Web Frequency Indexer , computing Words Frequency 2.Removing Stop of Words by oneself Process Output 1.One-against- Output One (OAO) Class SVMs (Multi-Class SVMs) 2. Fuzzy Correlation 3.Using TFIDF to execute Feature Selection In our thesis, we apply Reuters-21578 collection data set benchmark to do document categorization. We can collect part of articles which are from 2000 chapters to test and verify our thesis. The input form is multi-dimension vector, and by pre-processing stages, the input vectors are as follows, T1 ~ Tm feature words CTS CORP SHR T1 3.1 1.4 4.0 ... ... ... T2 0 1.2 2.4 ... ... ... Input Vector = ... ... ... ... ... ... ... ... 2.4 0 5.1 ... ... ... Tm 0 1.4 0 ... ... ... 37 MTHS ... 3.4 ... 0 ... ... ... ... ... 4.21 According to above, we can know an input vector is multi-dimensional matrix. A row displays feature words of all documents, and the average every chapter of the features are at least 50-70. Therefore, it is quite great quantity dimension. There will be still hundreds of features even if by feature selection. As for a column, it presents chapter quantity of document. In this matrix, these values present a weight value for every feature. If the keyword appear in someone document, the weight value will display numerals 1.4,4.21,... in the matrix, but if no, then the weight value will display zero 0 . 6.2 Pre-processing Before accomplishing input vectors, we have to transform every documents data. SVMs can manipulate after pre-processing. Normalizing and formatting data, and computing words frequency, processing stop of words, feature selection… are all my pre-processing phases. These phases are indispensable to pre-transforming on document categorization. A. Computing words frequency We pick out significant words in every document by computing program, and transform these words into the values. We can clearly understand importance of these words in every document through words appear frequency and weights. Therefore, document categorization is computing words frequency in every document in first stage. How to select methods in computing words frequency is more difficult because the selectiveness is too much. Therefore, in order to operating convenience, and retrenching cost of time, we choose a webpage that can compute words frequency in network. It’s not only to obtain conveniently but also to transform quickly, un-needed plug program outside, appropriate to any system…. 38 We chose the webpage that is web frequency indexer of Georgetown University to compute words frequency tool. Because we do not input net address, we just copy article of documents to paste and then to transform, and this webpage will help us to compute word frequency as follows: Web frequency indexer of Georgetown University ( http://www.georgetown.edu/faculty/ballc/webtools/web_freqs.html ) Figure 6.3 Web frequency indexer webpage Figure 6.4 Web frequency indexer computes word frequency 39 B. Processing stop of words Stop of words are prepositions (of, in, on, about, for, …), conjunctions (when, while, how, and, but,…), articles (an, a, the, this, that, these, those, another, others,…), numerals (#12, 11,4,…), auxiliary verbs (will, may, should, can,…), expletives (oh, wow,…), symbols (%, $, #,…) and so on. If there are too many stop of words in the document categorization, they are unmeaning and interfere with classified accuracy. Therefore, we have to remove these stop of words from input vectors, prevent unnecessary interference, and reduce dimension of input vectors. However, up to present, there is ineffective to remove stop of words. It still classify manually in real program. Consequently handling these stop of words is quite time-consuming. shr shr at $ #12 the corp mths cts a corp Pick out the stop of words Class M mths cts Class M Figure 6.5 Picking out stop of words in bags of words C. Feature selection In the pattern recognition domain, there are two methods to reduce dimension, one is the feature selection and the other is the feature extraction. Reducing the feature concentrates on the way of feature selection in the document categorization domain. The other way is the feature extraction, but this way is not main stream in general. 40 There are two kind of the feature selection methods, one is the threshold methods, and the other is the information theory methods. In general, we common use the threshold methods to choose in the feature selection. C.1 Threshold methods: 1. Document frequency thresholding, DF DF thresholding is a simple way of vocabulary decrease. We compute DF to have less than set threshold value in training set. 2. Information gain, IG IG is common employed to handle the best vocabulary or string in machine learning. By vocabulary or string of known or unknown, we can obtain IG in every category. 3. Mutual information, MI MI applies in relation application of statistic language model, and it is common standard criterion. That is computing relation between vocabulary and vocabulary. 4. 2 statistic, CHI 2 statistic is applied in an independent of event on statistic analysis. To construct a table between vocabulary and categories happened. 5. Term frequency inverse document frequency, TFIDF TFIDF combines term frequency with inverse document frequency This way is the most general use in document categorization. TFIDF not only computes words of weights but also normalizes the length of document vectors. TFIDF not only handles presentation of documents but also does the function of feature selection. 6. Others: Odds ratio, weirdness, term strength (TS)… 41 C.2 Information theory methods: 1. Single-noise ratio This way applies in the words of particular structure document. 2. Feature clustering The goal is to find similar features and gather in a cluster, and every new cluster can become a new feature that regards as a concept. 6.3 Fuzzy Correlation and One-against-One (OAO) SVMs for Multi-Categorization of Documents We adopt fuzzy correlation with OAO SVMs (one-against-one SVMs). That is our categorization architecture. The reason of adopting SVMs is according to Chapter 1 which was already introduced. For documents property, they are appropriate way for the support vector learning. In multi-class categorization, we adopt one-against-one (OAO) SVMs to handle because one-against-one (OAO) SVMs classifiers strategy is superior to all multi-class SVMs (one-against-rest (OAR) SVMs, Hierarchies or trees of binary SVMs, decision directed acyclic graph (DDAG) SVMs). Beside OAO, in order to solve problem in multi-categorization of documents we also apply fuzzy correlation to compute correlation degree between every documents and every categories. By fuzzy correlation, we can understand which one document is related to which categories in degree, and then we can classify these documents more appropriately through OAO. We will give an example below to display the inner framework: We give an example to explain. We assume four categories: music (M), food (F), sport (S), education (E), and webpage of every category are separately music-three chapters ( M1 , M 2 , M 3 ), food-four chapters ( F1 , F2 , F3 , F4 ), sport-five chapters ( S1 , S2 , S3 , S4 , S5 ), education-six chapters 42 ( E1 , E2 , E3 , E4 , E5 , E6 ). The classified architecture is used to multiple SVMs classifiers because one SVM just can classify two categories and the output one is +1 in your left side, the other is -1in your right side. We can train these data through SVMs. General OAO-SVMs M1 , M 2 , M 3 F1 , F2 , F3 , F4 M1 , M 2 , M 3 S1 , S2 , S3 , S4 , S5 M1 , M 2 , M 3 E1 , E2 , E3 , E4 , E5 , E6 F1 , F2 , F3 , F4 S1 , S2 , S3 , S4 , S5 F1 , F2 , F3 , F4 E1 , E2 , E3 , E4 , E5 , E6 S1 , S2 , S3 , S4 , S5 SVM SVM SVM SVM SVM E1 , E2 , E3 , E4 , E5 , E6 Figure 6.6 Multi-class SVMs (OAO-SVMs) architecture I 43 SVM Majority Voting Scheme M1 , M 2 , M 3 S1 , S2 , S3 , S4 , S5 F1 , F2 , F3 , F4 Dancing Course E1 , E2 , E3 , E4 , E5 , E6 Cooking Course SVM +1 -1 M1 , M 2 , M 3 F1 , F2 , F3 , F4 S1 , S2 , S3 , S4 , S5 E1 , E2 , E3 , E4 , E5 , E6 SVM +1 M1 , M 2 , M 3 SVM -1 +1 S1 , S2 , S3 , S4 , S5 F1 , F2 , F3 , F4 -1 E1 , E2 , E3 , E4 , E5 , E6 Figure 6.7 Multi-class SVMs (OAO-SVMs) architecture II 44 But in terms of figure 6.7, we can clearly set music category is 1 , education category is 1 when it trains. However, the remnant categories (for example: sport, food two categories) will be set 1 or 1 ? If we can not decide immediately, support vector learning will not classify. A scholar decides category which is 1 or 1 in advance. In order words, food and sport categories are set to belong to 1 or 1 in a first classifier in advance. In this state, we still do not understand food and sport categories which belong to music a more or education a more. No matter what classifiers, one-against-one classifiers strategy, hierarchies or trees of binary SVM classifiers strategy, and decision directed acyclic graph all have the same problem. This problem is that the remnant webpage (or documents) will belong to what category in a next level classifier when we classify webpage immediately. Is there a method to solve the above-mentioned problem? This concept looks alike clustering. The remnant webpage (or documents) in next level classifier belong to music category ( 1 ) a more or education category ( 1 ) a more We can give an example: movies types categorization. A movies can classify general audiences (G), parental guidance suggested (PG), parents strongly cautioned (PG-13), restricted (R), no one 17 and under admitted (NC-17) five categories. We can classify out G and NC-17 two categories immediately if we only classify bi-categories. For other categories (PG, PG-13, and R), we can clearly define which belong to what category (G ( 1 ) or NC-17 ( 1 )). Therefore, we need a method to evaluate other remnant types: G, PG, PG-13 belong to a category 1 ; R, NC-17 belong to the other category 1 . Or G, PG belong to a category 1 ; PG-13, R, NC-17 belong to the other category 1 ; et al. How about selecting appropriate categorization way to classify in a fuzzy region again when we understand the problem? In order to solve this problem, we can use fuzzy membership for every webpage (or documents) or modify the training state (we still use SVMs classifier because we have provided advantage of SVMs in the chapter 2). In the multi-class SVMs classifiers strategy, we employ pairwise categorization (OAO 45 SVMs, OAR SVMs classifiers) to solve traditional multi-class SVMs categorization problem for unclear data reset category ( 1class , or 1class ). The OAO-SVMs with majority voting scheme and OAR-SVMs with max distance solve multi-class SVMs problem (reset 1class / 1class ). Previously using pairwise categorization [76][77] in the ECOC (Error-Correcting Output Coding) with max distance (ex. hamming distance/hamming code) for classifications (text, pattern, image…) to improve categorization accuracy, and the other way is by probability to compute likelihood value for every document belong to every category. We apply OAO-SVMs with fuzzy correlation to solve multi-categorization of documents. OAO-SVMs is a kind of multi-class SVMs classifiers strategy, and it is also pairwise categorization classifier. OAO-SVMs has higher accuracy than OAR-SVMs, and easier to understand than DDAG-SVMs in the same accuracy. However OAO-SVMs has a problem that is reset 1,1 category in every classified data points. In order to improve this status, we use fuzzy correlation with OAO-SVMs classifier to solve rest category every time and multi-categorization of documents. Our method is as follows, for OAO-SVMs architecture, has N N 1 classifiers, and adds voting scheme (fuzzy 2 correlation) to evaluate test data which will belong to which category (music, food, sport, education four categories) and compute correlation between document and category. 46 Voting Strategy M 1 , M 2 , M 3 , F1 , F2 , F3 , F4 , S1 , S 2 , S 3 , S 4 , S 5 , E1 , E2 , E3 , E4 , E5 , E6 M , F SVM M /F F, S SVM F/S S, E E, M SVM SVM S/E E/M M , S SVM M /S Voting Strategy (Fuzzy Correlation) Output ◎Test Data belongs to M - (Music Class) probability F - (Food Class) probability S - (Sport Class) probability E - (Education Class) probability ◎Fuzzy Correlation : M & F , F & S , S & E , E &M , M &S , F & E 47 F, E SVM F/E S2 , S3 margin F2 , F4 , S1, S4 , S5 F1, F3 M1, M 2 , M 3 E1, E2 , E3 , E4 , E5 , E6 optimal hyperplane Figure 6.8 Improving pre-set states in training data In classifier phase, we still adapt SVMs because SVMs is the best categorization accuracy in webpage (or documents) than other classifiers ,for example: naïve bayes, K-nearest neighbors, Rocchio, C4.5, LLSF, NNet, DNF, decision tree…( SVM, KNN > LLSF > Multilayer edPerceptrons >> Multinomia lNaiveBaye s )[31][32],( SVM 0.864 > KNN 0.823 > Rocchio 0.799, C 4.50.794 > NaiveBayes0.72 )[33][34],( SVM 0.66 > KNN 0.591 > NaiveBayes0.57> Rocchio 0.566 > C 4.50.50 )[69], in the chapter 2 which has been described. However, in other phase, we can use fuzzy membership of those data (webpage or documents), and those data with fuzzy value can be according to its fuzzy value classified. 48 M 1 , M 2 , M 3 , F1 , F2 , F3 , F4 , S1 , S 2 , S 3 , S 4 , S 5 , E1 , E2 , E3 , E4 , E5 , E6 Fuzzy membership function SVMs M1, M 2 , M 3 , F1 E1, E2 , E3 , E4 , E5 , E6 , S4 , S5 F2 , F3 , F4 , S1, S2 , S3 E1, E2 , E3 , E4 , E5 , E6 , S4 , S6 F2 , F3 , F4 , S2 , S3 , S5 M1, M 2 , M 3 , F1 Figure 6.9 Improving two-class SVMs in gray region 49 CHAPTER 7 EXPERIMENTAL RESULTS 7. 1 Experimental Data Source A. Experimental Data Source Our experimental data source comes from Reuter-21578 data set that was provided by David D.Lewis of AT& T Lab in 1997. Up to present, that data set are provided freely in this webpage; http://www.research.att.com/~lewis. This data set is the most common used of standard criterion in document categorization. There are twenty thousand of documents that are to differ in length, over one hundred eighteen categories of documents, and every document has two hundred of words in average. Almost literatures all use ten categories to research data. Therefore, in our thesis, we also select ten categories (eg. earnings, corporate acquisition, money market, grain, crude, trade, interest, wheat, ship, corn) in the research situation from Reuter-21578 classified categories. In the ten categories, we randomly select fifteen hundred training data set and fifty test data set to as research data. B. Experimental Data Form The data form of Reuters-21578 data set is SMGL form. Using tags of SGML language as a label to match up mapping document type definition (DTD) of SGML document form, and it is explicit boundary for title, categories, and content of document which is every significant portion. 1. All documents in the database are classified by five-categorization methods, all 50 categories have explicit description in detail by five-categorization methods, and give as far as possible every category a appropriate denomination. 2. Every document is given a new identification number (NEWID) according to time serial, and every one-thousand document are combined a file. There are five-categorization methods of documents in the Reuter-21578. It is Topics, Places, People, Orgs, and Exchanges respectively. The most general use method is Topics categorization in document categorization research. The Topics in Reuters-21578, are classified five macro-categories and one hundred thirty-five micro-categories. Table 7.1 Documents in one category owned Owning N-documents N 1000 1000 N 100 100 N 0 N 0 Categories 2 21 97 15 Table7.2 Different categories with training numbers Dumais, S. et. Category name Num train Acquisition (acq) 1650 Earn (earn) 2877 Grain (grain) 433 Money-fx (money) 538 Crude (crude) 389 Trade (trade) 369 Interest (interest) 347 Wheat (wheat) 212 Ship (ship) 197 Corn (corn) 182 Our thesis Category name Num train Acq 150 Earn 150 Grain 150 Money 150 Crude 150 Trade 150 Interest 150 Wheat 150 Ship 150 Corn 150 51 Training Data Test Data Figure 7.1 Reuters-21578 dataset form 52 7.2 Experimental Results Because the quantity of experimental data is too huge to describe in detail, we will give a simple example to describe the experimental process. In our thesis, we use a tool that is program design software Matlab 6.5 of engineering, and by this tool, we can compute complex and a great quantity of matrix operation In the program, we employ part of elements of Matlab toolbox program, and design program by myself to transform data between training and test. Following the experimental process of a simple example: Step 1. Randomly selects documents of training and test data from Reuters-21578 collections, and employs Web Frequency Indexer to compute the frequency of every word in every document. 53 Figure 7.2 Web frequency indexer computes every word frequency of every chapter Step 2. Employing EXCEL separates vocabulary and frequency and annotates documents which are serial number. Figure 7.3 Computing every word frequency and belonging to which one document 54 Step3. Copying vocabulary, frequency, and serial number of documents into two TXT document files which are input data of Matlab program reading-in. Figure 7.4 Data form before input 55 Step4. Employing a designed Matlab program takes TXT format files of training data and test data reading-in, and transforms them into input vector form of entering SVMs. Training data before classifying Training data before classifying 56 Training data before classifying Training data before classifying 57 Training data before classifying Training data before classifying 58 Training data before classifying Training data before classifying 59 Training data before classifying Training data before classifying 60 Training data before classifying Training data before classifying 61 Test data before classifying Test data before classifying Figure 7.5 Training data and test data before classifying 62 Step5.Using Matlab toolbox trains parameter program to read in the input vector of training data and to proceed multi-class categorization tree process. Training out decision parameter function (training data, pre-classified categories, , and b value). 63 64 65 66 67 Figure 7.6 Decision function parameter values 68 Step6. Inputting parameter of trained and test data into Matlab toolbox program classified operation elements program computing , and getting every document belongs to which one category. Figure 7.7 Belonging to which one category in the multi-class SVMs 69 We use 1500 training data from four categories (acq, earn, money, grain, crude, trade, interest, wheat, ship, corn), and 50 test data will be experimental data. We employ four different dimensions (50, 300) with every data in 150 training data. The test data is employed to two dimensions the same as training data. Table 7.3 Accuracy comparison with different methods Findsim NBayes BayesNets Trees Linear SVM 50 50 50 50 300 300 64.70% 87.80% 88.30% 56.70% 89.70% 93.70% 92.90% 95.90% 95.80% 79.50% 97.80% 98.00% 67.50% 78.80% 81.40% 85.40% 85.00% 94.60% 46.70% 56.60% 58.80% 84.50% 66.20% 74.50% 70.10% 79.50% 79.60% 81.70% 85.00% 88.90% 65.10% 63.90% 69.00% 72.00% 72.50% 75.90% 63.40% 64.90% 71.30% 70.20% 67.10% 77.70% 68.90% 69.70% 82.70% 86.50% 92.50% 91.80% 49.20% 85.40% 84.40% 85.00% 74.20% 85.60% 48.20% 48.20% 76.40% 78.50% 91.80% 90.30% 63.67% 73.07% 78.77% 78.00% 82.18% 87.09% OAO: Using One-against-One SVM learning method K: Dimension length 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% OAO 300 75.50% 90.00% 96.50% 96.50% 85.50% 78.50% 83.50% 91.00% 88.60% 92.50% 87.81% Findsim 50 Nbayes 50 BayesNets 50 OAO 50 Ac q Ea rn Gr ain M on ey Cr ud e Tr ad e Int ere st W he at Sh ip Co rn Accuracy (%) K Acq Earn Grain Money Crude Trade Interest Wheat Ship Corn Avg Top 10 OAO Different categories (10-categories in the 50-dimension & differnet learning machines) 70 Tree 300 Lnear SVM 300 OAO 300 Ac q Ea rn Gr ain M on ey Cr ud e Tr ad e Int ere st W he at Sh ip Co rn Accuracy (%) 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% Different categories (10-categories in the300-dimension & differnet learning machines) Figure 7.8 Accuracy with ten categories in the 50, 300-dimension Table 7.4 Accuracy in the different dimensions OAO OAO K 50 300 Acq 56.70% 75.50% Earn 79.50% 90.00% Grain 85.40% 96.50% Money 84.50% 96.50% Crude 81.70% 85.50% Trade 72.00% 78.50% Interest 70.20% 83.50% Wheat 86.50% 91.00% Ship 85.00% 88.60% Corn 78.50% 92.50% Avg Top 10 78.00% 87.81% 71 Avg-accuracy (%) 100.00% 80.00% 60.00% Avg Top 10 40.00% 20.00% 0.00% 50 50 50 50 Findsim Nbayes BayesNets OAO Avg-accuracy (%) Different learning machines in the 50-dimension 100.00% 80.00% 60.00% Avg Top 10 40.00% 20.00% 0.00% 300 300 300 Tree Lnear SVM OAO Different learning machines in the 300-dimension Figure 7.9 Average-accuracy in the different learning machines 72 We can get different experimental results from different dimension, and the higher dimension will be the more accuracy when we get. When a dimension (50-dimension) is lower, that indicates the keyword quantity is less than high dimension (300 -dimension). Therefore, for accuracy, 50-dimension is not as good as other higher dimensions (300-dimension). Step7. Using fuzzy correlation in classified categories which gives different weight to different import data points, and adjusts classified elasticity; fuzzy correlation can decrease overfitting effect and improve performance. Figure 7.10 Computing keywords in the different categories 73 Table 7.5 Every document in the different categories appear frequency Acq Earn Grain Money Crude Trade Interest Wheat Ship Corn Document 1 886 1327 460 949 253 138 92 21 45 14 Document 2 743 893 620 1617 557 296 177 45 75 24 Document 3 943 621 324 2128 251 220 393 47 67 7 Document 4 283 114 200 293 130 167 22 17 38 6 Document 5 262 772 1498 752 160 85 10 60 33 56 Acq 2000 Earn 1500 Grain 1000 Money 500 Crude Trade 0 1 D oc um en t2 D oc um en t3 D oc um en t4 D oc um en t5 Interest en t D oc um Keyword frequencies in the different categories 2500 Five documents Figure 7.11 keywords in the ten categories frequency 74 Wheat Ship Corn Table 7.6 Correlation between every document and ten categories Document 1 Document 2 Document 3 Document 4 Document 5 Acq/Earn Acq/Grain Acq/Money Acq/Crude Acq/Trade 0.120426(Earn) 0.084163(Grain) 0.069005(Acq) 0.119112(Crude) 0.069005(Acq) 0.188062(Acq) 0.188062(Acq) 0.188062(Acq) 0.226254(Crude) 0.188062(Acq) 0.097418(Acq) 0.097418(Acq) 0.097418(Acq) 0.097418(Acq) 0.097418(Acq) 0.187947(Acq) 0.187947(Acq) 0.187947(Acq) 0.187947(Acq) 0.072241(Trade) 0.750464(Earn) 0.744488(Acq) 0.746495(Money) 0.744488(Acq) 0.744488(Acq) Acq/Interest Acq/Wheat 0.069005(Acq) 0.092289(Wheat) 0.188062(Acq) 0.188062(Acq) 0.097418(Acq) 0.097418(Acq) 0.086886(Interest) 0.009547(Wheat) 0.744488(Acq) 0.744488(Acq) Document 1 Document 2 Document 3 Document 4 Document 5 Earn/Grain Earn/Money 0.120426(Earn) 0.120456(Earn) 0.044604(Earn) 0.174060(Money) 0.050268(Grain) 0.147439(Earn) 0.216264(Earn) 0.216264(Earn) 0.750464(Earn) 0.750464(Earn) Acq/Ship 0.069005(Acq) 0.188062(Acq) 0.097418(Acq) 0.187947(Acq) 0.744488(Acq) Earn/Crude 0.120456(Earn) 0.226254(Crude) 0.105767(Crude) 0.216264(Earn) 0.750464(Earn) Earn/Wheat Earn/Ship 0.120456(Earn) 0.120456(Earn) 0.044604(Earn) 0.109310(Ship) 0.075739(Wheat) 0.021144(Ship) 0.009547(Wheat) 0.216264(Earn) 0.750464(Earn) 0.750464(Earn) Document 1 Document 2 Document 3 Document 4 Document 5 Grain/Money 0.084163(Grain) 0.174060(Money) 0.050268(Grain) 0.289866(Money) 0.746495(Money) Grain/Crude 0.119113(Crude) 0.226254(Crude) 0.050268(Grain) 0.271311(Crude) 0.728464(Crude) 75 Acq/Corn 0.069005(Acq) 0.188062(Acq) 0.158219(Corn) 0.075291(Corn) 0.744488(Acq) Earn/Trade Earn/Interest 0.120456(Earn) 0.120456(Earn) 0.044604(Earn) 0.150447(Interest) 0.147439(Earn) 0.058989(Interest) 0.072241(Trade) 0.086886(Interest) 0.750464(Earn) 0.750464(Earn) Earn/Corn 0.120456(Earn) 0.044604(Earn) 0.158219(Corn) 0.075291(Corn) 0.750464(Earn) Grain/Trade Grain/Interest 0.084163(Grain) 0.084163(Grain) 0.022044(Grain) 0.150447(Interest) 0.050268(Grain) 0.058989(Interest) 0.072241(Trade) 0.086886(Interest) 0.553465(Grain) 0.553465(Grain) Grain/Wheat Grain/Ship Grain/Corn 0.092287(Wheat) 0.084163(Grain) 0.084163(Grain) 0.022044(Grain) 0.109310(Ship) 0.030244(Corn) 0.050268(Grain) 0.021144(Ship) 0.158219(Corn) 0.009547(Wheat) 0.257325(Ship) 0.075291(Corn) 0.553465(Grain) 0.677528(Ship) 0.553465(Grain) Document 1 Document 2 Document 3 Document 4 Document 5 Money/Crude Money/Trade Money/Interest Money/Wheat 0.119113(Crude) 0.057845(Trade) 0.044430(Money) 0.092289(Wheat) 0.226254(Crude) 0.174060(Money) 0.174060(Money) 0.174060(Money) 0.105767(Crude) 0.151960(Money) 0.058989(Interest) 0.075739(Wheat) 0.271311(Crude) 0.072241(Trade) 0.086886(Interest) 0.009547(Wheat) 0.746495(Money) 0.746495(Money) 0.746495(Money) 0.746495(Money) Money/Ship Money/Corn 0.044430(Money) 0.060037(Corn) 0.174060(Money) 0.174060(Money) 0.021144(Ship) 0.158219(Corn) 0.257325(Ship) 0.075291(Corn) 0.746495(Money) 0.746495(Money) Document 1 Document 2 Document 3 Document 4 Document 5 Crude/Trade 0.119113(Crude) 0.226254(Crude) 0.105767(Crude) 0.072241(Trade) 0.728464(Crude) Document 1 Document 2 Document 3 Document 4 Document 5 Crude/Interest 0.119113(Crude) 0.226254(Crude) 0.058989(Interest) 0.086886(Interest) 0.728464(Crude) Trade/Interest 0.057845(Trade) 0.150447(Interest) 0.058989(Interest) 0.072241(Trade) 0.108223(Trade) Crude/Wheat 0.119113(Crude) 0.226254(Crude) 0.075739(Wheat) 0.009547(Wheat) 0.728464(Crude) Trade/Wheat 0.092289(Wheat) 0.077211(Trade) 0.075739(Wheat) 0.009547(Wheat) 0.411780(Wheat) 76 Crude/Ship 0.119113(Crude) 0.226154(Crude) 0.021144(Ship) 0.257325(Ship) 0.728464(Crude) Trade/Ship 0.057845(Trade) 0.109310(Ship) 0.021144(Ship) 0.072241(Trade) 0.677529(Ship) Crude/Corn 0.119113(Crude) 0.226154(Crude) 0.158219(Corn) 0.075291(Corn) 0.728464(Crude) Trade/Corn 0.060037(Corn) 0.030244(Corn) 0.158219(Corn) 0.072241(Trade) 0.179348(Corn) Document 1 Document 2 Document 3 Document 4 Document 5 Interest/Wheat 0.092289(Wheat) 0.150447(Interest) 0.058989(Interest) 0.009547(Wheat) 0.503183(Interest) Interest/Ship 0.031982(Ship) 0.150447(Interest) 0.058989(Interest) 0.086886(Interest) 0.677529(Ship) Interest/Corn 0.060037(Corn) 0.150447(Interest) 0.158219(Corn) 0.075291(Corn) 0.503183(Interest) We employ fuzzy correlation to measure relation between every document and categories. For fuzzy correlation, we do not use a criterion (ex. -cut to satisfy the lowest degree of restricted conditions), and we can evaluate relation degree (positive, negative, irrelative) between every document and every categories. It is more objective for correlation coefficient than a threshold decision ( -cut). Fuzzy correlation coefficient can provide elasticity estimation in the multi-categorization of documents. The results of the fuzzy correlation coefficients for combining SVM binary classifiers are shown in table 7.6. It shows the confidence measures between test documents and categories. In conjunction with forty-five pairwise coupling SVM binary classifiers, a document is assigned to the classes with positive fuzzy correlation coefficients for the multi-label categorization problem as shown in table 7.7. Table 7.7 Multi-categories of documents Acq Earn Grain Money Crude Trade Interest Wheat Ship Corn Document 1 ˇ ˇ ˇ ˇ ˇ ˇ Document 2 ˇ ˇ ˇ ˇ ˇ ˇ Document 3 ˇ ˇ ˇ ˇ ˇ Document 4 ˇ ˇ ˇ ˇ Document 5 ˇ ˇ ˇ ˇ ˇ 77 ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ C acq Document1, Document 2, Document 3, Document 4, Document 5 C earn Document1, Document 2, Document 3, Document 4, Document 5 C grain Document1, Document 2, Document 3, Document 5 C money Document1, Document 2, Document 3, Document 4, Document 5 C Crude Document1, Document 2, Document 3, Document 4, Document 5 CTrade Document1, Document 2, Document 4, Document 5 C Interest Document 2, Document 3, Document 4, Document 5 CW heat Document1, Document 3, Document 4, Document 5 C Ship Document1, Document 2, Document 3, Document 4, Document 5 1 Document 1 Document 2 Document 3 Document 4 Document 5 0.8 0.6 0.4 0.2 n or C Sh ip ru de Tr ad e In te re st W he at C Ea rn G ra in M on ey 0 A cq Correlation membership C Corn Document1, Document 2, Document 3, Document 4, Document 5 Ten categories Figure 7.12 Correlation membership in the ten categories In the table 7.2 and table 7.4, we can obtain the same accuracy as other methods for few training data, and the accuracy will improve high to follow the dimension increased. After running multi-SVMs, we then use fuzzy correlation to compute correlation between every document and every category. We employ machine learning to classify nonlinear documents and then by fuzzy correlation to evaluate correlation degree. The machine learning can help us to classify documents, and it can improve accuracy and cut down time for running program. Besides machine learning, the fuzzy correlation can find out correlation between every document and every category, and then classified to multi-categorization. 78 CHAPTER 8 CONCLUSION AND FUTURE WORK 8. 1 Concluding Remarks In the document categorization domain, owing to integrating diversification of science and technology, more and more documents simultaneously include many knowledge and technology in different domains. Therefore, multi-categorization of documents problem is more and more important at present. In many supervised learning tasks, a learned classifier automatically induces a ranking of test examples, making it possible to determine which test examples are more likely to belong to a certain class when compared to other test examples. However, for many applications this ranking is not sufficient, particularly when the classification decision is cost-sensitive. In this case, it is necessary to convert the output of the classifier into well-calibrated posterior probabilities. An acknowledged deficiency of SVMs is that the uncalibrated outputs do not provide estimates of posterior probability of class membership. There are many theses in the automatic document categorization research to be brought up. However, all of these methods just only determine unclassified document which shall be classified one particular pre-defined category or shall belong to one particular pre-defined category, and just only accomplish mission in the single-categorization of documents. Even more SVMs that are most appropriately to apply in the document categorization, SVMs do not solve multi-categorization of documents problem yet. For multi-categorization of documents, categorization effect in all categorization methods is worse. 79 In our thesis, we have presented an efficient method for producing class membership estimates for multi-class text categorization problem. Based on SVM binary classifiers in conjunction with the class membership, our method relies on a measure of confidence given by the fuzzy correlation. This approach not only solves multi-class classification but also multi-label categorization problems. 8.2 Directions for Future Research However, our thesis still has many problems to solve yet. In order to improve optimal learning model, we will have many research and application space in the future work. 1. How to process efficiently in Stop of Word. 2. 