On the identification of unknown authors: a comparison between SVM’s and non parametric methods. Paola Cerchiello1 Department of Economics, University of Pavia paola.cerchiello@eco.unipv.it Summary. In the text classification field, great relevance is given to the classification task. In classical contest the SVM’s classifier shows good performance. However, if a new document, belonging to an unknown author, has to be classified, the above method fails, because of the inability to detect the text as an example of a new category. In order to overcome this problem, a new method, based on the combined employment of decision tree and Kruskal-Wallis test, is proposed. The methodology applied is evaluated in terms of correspondence analysis. Key words: text classification, Svm’s, kruskal-wallis test, decision tree, correspondence analysis 1 Introduction With the rapid growth of online information, text categorization has become one of the key techniques for handling and organizing data in textual format. Text categorization techniques are an essential part of text mining and are used to classify news documents, to find interesting information in the several web sites on line. Since building text classifiers by hand is difficult, time-consuming and often not efficient, it is advantageous to learn classifiers from examples. Text categorization is the task of assigning a Boolean value to: (dj , ci ) ∈ D × C (1) where D is a domain of documents and C = [c1 , . . . , c|C| ] is a set of predefined categories. A value of T assigned to (dj , ci ) indicates a decision to file dj under ci , and viceversa. In this context classical methods of classification such as support vectors machines [J98], naive bayes classifier [KHZ00], have shown different performance strictly connected and dependant on their structural elements. However, if a new document, belonging to an unknown author, has to be classified, the above methods fail, because of the inability to detect the text as an example of a new category. In order to show the main results of the proposed methodologies, a varying of the standard Reuters database containing 114 different authors and 468 words, has been employed. . 942 Paola Cerchiello 2 Classical methods According to the text classification literature, the best results in the field are obtained by means of Support vector machines (SVMs). They represent a set of related supervised learning methods used both for classification and regression scope. When used for classification, the SVM’s algorithm creates a hyperplane that separates the data into two classes with the maximum-margin. In the simplest binary case, given training examples labelled either ”yes” or ”no”, a maximum-margin hyperplane is identified which splits the ”yes” from the ”no” training examples, such that the distance between the hyperplane and the closest examples (the margin) is maximized. Being things so, we applied support vectors machines to a sample containing three different authors (number 5, 60, 110) in order to show their good performance. After that we used the same fitted models on a new set containing some new documents belonging to the old authors and some others belonging to a new one, never classified before. In our opinion those classifiers should assign a not decision score to the new author’s documents being not examples of any class. However, as it is exemplified in the table 1, SVM’s assign inclusion probability far from indecision, increasing proportionally the corresponding misclassification error. Table 1. Inclusion probabilities derived by SVM’s classifier for the unknown author. Auth 5 Auth 60 Auth 110 0.832 0.353 0.047 0.040 0.002 0.004 0.106 0.278 0.807 0.977 0.322 0.011 0.196 0.865 0.813 0.958 0.929 0.343 0.388 0.144 0.004 0.519 0.156 0.450 0.086 0.145 0.039 0.066 0.549 0.333 0.047 0.018 0.157 As a consequence SVM’s can not be considered useful for this kind of application, justifying the implementation of a different methodology. 3 The theoretical background The first step in text categorization is to transform documents, which typically are strings of characters, into a representation suitable for the learning algorithm and the classification task. Each distinct word wi corresponds to a feature, with the number of times word wi occurs in the document as its value. After activities of parsing and stemming, the number of relevant words is still very high (in the order of 103 ), thereby features selection methods must be applied. They are addressed to the identification of words characterized by the biggest power of discrimination The identification of unknown authors. 943 that allows to distinguish different authors. The literature has presented several approaches in this contest, for example [GE03], [S03], [F03]. The methodology here proposed deals with the combined employment of two statistical tools rooted in the family of non parametric models: decision trees and Kruskal-Wallis test. Kruskal-Wallis test [C71]is the non parametric version of ANOVA analysis and it represent a simple generalization of the Wilcoxon test for two independent sample. On the basis of K independent samples n1 , . . ., nk ,in this context represented by the different words within the bag of words , a unique big sample in created by means of fusion of the originals k samples. The above result is ordered from the smaller sample to the bigger one, and the rank is assigned to each one. Finally Ri is calculated, that is the mean of the ranks of the observations in the ı̀-th sample. The statistic is: KW = 12 N(N+1) P 1− K i=1 P [ ni (Ri − g 3 i=1 (ti −ti )] (N 3 −N) N+1 2 ) 2 (2) where the denominator is needed when there are tied observations (as it is typical in text categorization application) and the null hypothesis, that all the k word distributions are the same, is rejected if: 2 KW > XK−1 (3) Classification or decision tree methods [BFOS84] are a good choice when the task of the analysis is classification or prediction of outcomes and the goal is to generate rules that can be easily understood and explained. In particular in text mining field they represent trees in which internal nodes are labelled by terms, branches departing from them are labelled by tests on the weight that the term has in the test document, and finally leafs represent category. This kind of classifier categorizes a test document dj by recursively testing for the weights that the terms labelling the internal nodes have in vector dj , until a leaf node is reached; the label if this node is then assigned to dj . As the Kruskal-Wallis test, the decision tree is a method belonging to the family of non parametric models, so that we do not have to choose a distribution for the terms (words) present in a document, that constitutes a not simple problem in this kind of application. 4 Proposal The methodology here proposed combines the main elements of both the models above. For lack of space, we refer the readers to [CG06] for a complete and deeper exposure of the methodological aspects. In order to show the main results of the proposal, a varying of the standard Reuters database containing 114 different authors and 468 words, has been employed. For each author we have a different number of documents, at least 120, and each variable, representing a word, is labelled by a progressive number. In figure 1, a small extract of the available data set is shown. First of all we have randomly chosen three different authors for the analysis, then the Kruskal-Wallis test is applied to the data set containing them, in order to select words characterized by distribution profiles more heterogeneous and different 944 Paola Cerchiello Fig. 1. Small extract of the analyzed data set. between the selected authors so that they can reveal a typical and personal style of writing for every single author. According to the test procedure each word is assigned a p-value on the basis of which a decision on the test hypothesis is taken: if the computed p-value is under the established threshold the corresponding variable accepts the alternative, that is the word distribution behaves differently in the populations (that are the authors). This part of the analysis is useful to eliminate words used in a constant or semi-constant way in the different documents composed by the selected authors. The words located as shown above, are combined, as we will see later, with words chosen by the application of the decision tree. In fact the decision tree is used on the data set containing the three selected authors and during every phase, the recursive algorithm selects words that better distinguish authors from each other. After the application of the two models we have two different sets of words, one selected trough the Kruskal-Wallis test, and another one located trough the decision tree. Now it is useful to remember the objective of this analysis: first of all the dimensionality reduction to obtain just words able to reveal the typical and personal style of writing for every single author. Thereby in order to combine words selected through the above methods we finally consider just words selected either by the Kruskal-Wallis test (using a p-value equal to 0.01) and by the decision tree based on Entropy impurity measure. The below table reports those words: Table 2. Labels of the words selected either by the test or by the decision tree. Words Selected 5 21 435 398 183 235 14 215 As a consequence we finally obtain a very small set of words that will be used to represent in a graphic way the single author’s profile by means of correspondence analysis. The identification of unknown authors. 945 As we said at the beginning, once located key words, the second objective is to detect documents composed by an unknown author, different from the pre-classified ones. In this context the graphic derived from the correspondence analysis can be helpful in locating those documents, in fact text belonging to a different authors should be placed in the neighborhood of the cloud of texts belonging to known authors. Fig. 2. Graphic derived from the correspondence analysis. For sake of simplicity in this part of the analysis we have created another data set containing only some documents of one out of the three initial authors (i.e. number 5), previously used in order to single out key words, and as unknown author we have randomly chosen few text belonging to author number 44. Finally we have applied correspondence analysis on that new data set, keeping track of the unknown author’s documents. In figure 2 the blue points represent words selected by the two non parametric methods shown before and the red points symbolize authors documents distinguishing them through the relative code numbers (5 and 44). As the reader can simply observe, 3 out of the 4 texts belonging to author 44 can be considered as outliers because of their position respect to the central cloud. Thereby we can conclude that this approach can be usefully employed, first of all to create a specific author’s pro- 946 Paola Cerchiello file, and consequentially to use the key words found as important element in plotting a set of new texts and in locating which ones can belong to an unknown author/s. 5 Conclusion We want to stress out the key element of this contribution. As said before, classical methods of classifications are not able to identify and isolate documents belonging to authors never labelled before. As a consequence, during an automatic process of classification the incoming documents will be filled under one of the classes previously located without considering the eventual presence of anomalous texts. As showed in section 2, one of the possible classifier, the SVM’s, fails in this context. Thereby we suggest first of all to reduce the dimensionality by means of the combined employment of two non parametric tools: decision tree and Kruskal-Wallis test. In second place, correspondence analysis is considered an useful method to represent documents and words selected by feature selection in order to depict the lexical profile of a specific author. In other words, we try to understand the most typical words used by an author, that constitutes the key for the identification of new-unknown authors. The author thanks Paolo Giudici for useful discussions and suggestions; and MIUR for funding within the project ”Data mining methods for e-business applications”. References [B73] Benzecri, J. (1973). L’analyse des données. Dunod, Paris. [BFOS84] Breiman L., Friedman J.H., Olshen R., and Stone C. J. (1984). Classification and regression trees. Wadsworth, Belmont. [CFG05] Cerchiello P., Figini S., Giudici P., (2005). Feature selection: A non parametric approach. Atti di convegno internazionale S.Co., Bressanone, (a cura di Corrado Provasi), 293–298. [CG06] Cerchiello P., Giudici P., (2006). Statistical methods for classification of unknown authors . Technical report. [C71] Conover W. J., (1971). Practical nonparametric statistics. Wiley, New York. [F03] Forman G., (2003). An Extenisve empirical study of feature selection metrics for text classification. In Journal of Machine Learning Research, 3, 1289–1306. [G03] Giudici P., (2003). Applied data mining. Wiley. [GE03] Guyon I. and Elissee A., (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(3),1157–1182. [J97] Joachims T., (1997). Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proceedings of ECML-98, 10th European Conference on Machine Learning. [J98] Joachims T., (1998). Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 137–142. The identification of unknown authors. 947 [KHZ00] Kim Y.H.,Hahn S.Y., and Zhang B.T., (2000). Text filtering by boosting naive Bayes classifiers. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 168–175. [L98] Lewis D. D., (1998). Naive Bayes at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 4–15. [MY01] Manevitz L. M., and Yousef M., (2001). One-class svms for document classification. Journal of Machine Learning Research 2, 139–154. [S02] Sebastiani F., (2002). Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47. [S03] Stoppiglia H., et al., (2003). Ranking a random feature for variable and feature selection. In Journal of Machine Learning Research, 3, 1399–1414. [Z97] Zani S., (1997). Analisi dei dati statistici, volume 1: osservazioni in una e due dimensioni. Giuffré, Milano. [Z00] Zani S., (2000). Analisi dei dati statistici, volume 2: osservazioni multidimensionali. Giuffré, Milano.