An Introduction to Text Classification Toni Giorgino 1 1 Dicembre 2004 1 Elaborato finale per il terzo anno della scuola SAFI. Il lavoro è basato sul corso Statistical Learning: Teoria ed Applicazioni dell’Anno Accademico 2002-03. Coordinatore: Prof. Giuseppe De Nicolao. Abstract The increased volume of information available in digital form and the need of organizing it has generated and progressively intensified the interest in automatic text categorization. A widely used research approach to this problem is based on machine learning techniques: an inductive process builds a classifier by automatically learning the characteristics of the categories. Machine learning is more portable and less labour-intensive than manual definition of classifiers. This essay provides an account of the most prominent features of the machine learning approach to text classification, including data preparation, attribute extraction and selection, learning algorithms and kernel methods, performance measures, and availability of training corpora. Contents 1 Introduction 2 2 Supervised learning 3 3 Attributes 4 3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 N-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 Attribute selection . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Algorithms 10 4.1 Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Rocchio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 Kernel based methods 12 5.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 14 5.2 Kernel Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . 15 6 Data sets 16 6.1 Reuters 21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.2 Ohsumed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.3 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7 Performance measures 7.1 20 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Conclusions 21 22 1 1 Introduction Text classification is the problem of automatically assigning zero, one or more of a predefined set of labels to a given segment of free text. The labels are to be chosen to reflect the “meaning” of the text. Selecting the appropriate set of labels may be ambiguous even for a human rater. When a machine is to try and mimic the human behaviour, the algorithm will have to cope with a large amount of uncertainty coming from various sources. First of all, on a purely lexicographic level, human language is ambiguous per se, including words and word combinations with multiple senses which are deambiguated by the context. More importantly, the definition of meaning of a text is still vaguely defined, and a matter of debate. One does not want to answer the question whether a computer has “understood” a text, but rather – operationally – whether it can provide a result which is comparable to what a human would provide (and find useful). The three hallmarks: (a) reasoning in presence of uncertainty, (b) complexity of the task, and (c) heuristics required to approximate the human judgement, hint us about the fact that text classification tasks belong to the field of Artificial Intelligence [1]. 2 Supervised learning The task of assigning labels to items has been extensively studied in the broader context of Machine Learning. An extensive amount of algorithms and methods have been published in the literature. In the context of machine learning, items to be classified are named examples. Each example is identified by a fixed number of attributes, which are usually numeric or discrete. 2 Some algorithms try to identify features common to subsets of examples, looking only at attributes and their correlations. These algorithms belong to the class of unsupervised learning. For text classification tasks, instead, one usually wants to reproduce a given classification scheme. If a discrete label is to be predicted for a given item, this is named class. The classification scheme, as discussed above, will not be explicitly laid out according to human-designed rules – machine learning algorithms are instead designed to learn it from training examples. In other words, they extract the implicit knowledge contained in texts plus their labels. To train a machine learning algorithm, therefore, human expert needs to prepare a set of texts, each of which is manually annotated with the “correct” labels. These texts, named the training set, are processed and fed into the learning algorithms along with their labels. The algorithms will adjust a set of internal parameters which, if successful, will generalize the training set in a sensible way. Text classification problems therefore belong to the category said of supervised learning. 3 Attributes Most often, text is stored in computer form in the known ASCII representation (which may be inappropriate for non-English languages). In this representation, any text is encoded as a sequence of numbers, one per character, in the original order. While being a faithful representation of text which does not cause loss of semantic information, in fact, this form is not suited to be fed into machine learning algorithms. These algorithms, in fact, have been developed to keep into consideration examples as vectors of features or attributes. Each example is 3 assumed to have a fixed number of attributes, and each attribute is supposed to have a specific meaning (say, age, text length, etc.) for all examples. Some of the algorithms, such as Support Vector Machines, perform well even in presence of thousands of features. The problem then arises, to encode texts into a suitable way, i.e. to decide relevant features and extract them from the ASCII texts. Feature extraction is of paramount importance, since the chosen attributes will be the only input on which class decision is going to be based. Two of the most widely used approaches for generating feature vectors from text for information retrieval, bag of words and n-grams, will be described in the following. 3.1 Preprocessing The preparation of input texts for elaboration is also known in the machine learning context as “data cleaning”. One often-applied transformation to the input text is the substitution of characters outside of the usual 26-letter English alphabet with a single space. Multiple spaces are then reduced to one, and upper case letters are folded into lower case. These mappings will make any punctuation indistinguishable from white space, which is accepted as an uninfluential loss of information. This simple transformation is only appropriate to the English language. For foreign languages, more complex transformation rules are needed. For European languages, e.g. accented characters may be replaced with their unaccented equivalents. More in detail, non-English transformations will depend on the encoding of the specific text, language, and several other factors (known as the locale). On a more semantic level, a function often applied is stemming. It amounts to 4 the reduction of each word to the base form from which it originates. Stemming is clearly language dependent. Stemming maps e.g. computer into comput and going into go. A widely known stemming algorithm for the English was developed by Porter [2], with variants for several European languages. Before or after stemming, words such as conjunctions can be excluded at all. This transformation is known as stop-word cancellation. 3.2 Bag of words One widely used method for feature extraction is the bag of words approach (figure 1). Each text is scanned and word frequencies are calculated with reference to a known vocabulary. The word frequencies for each word are used as a vector to characterize the text given [3]. A standard enhancement to the bag of word approach is to weight the count of each word w by its inverse document frequency Iw : Iw = log(n/fw ) where n is the total number of documents in the corpus, and fw is the number of documents in which word w appears. The purpose of this weighting is to enhance the contribution of uncommon words with respect to those more widely found, including e.g. conjunctions and adverbs. Applied together with word stemming, the bag of words approach works well, in practice, even if the representation completely discards any information contained in the word ordering. 5 3.3 N-gram The n-gram approach provides a natural representation to keep into account the order of characters in the input text [4, 5]. Unlike bag of words plus stemming, the n-gram approach can be applied without previous knowledge of the language of the input. The input sequence is broken down in n-plets of consecutive characters. For example, when n = 2 the feature vector contains as many elements as there are possible couples of characters. Considering the spacer symbol (section 3.1), the feature space of character bigrams is 273 . Each attribute will count how many times that specific n-gram appears in the given input document. For example, if we take the four-character document text “to me”, it would contain the following four bigrams: (to, o , m, me), or the three trigrams (to , o m, me). Choice of n is an interesting tradeoff. Good results are often quoted in the literature for n = 4. Increasing n will not improve classification accuracy indefinitely. In fact, when n grows the dimensionality of the feature space grows exponentially, but most of these feature will be zero for all documents. Large n-grams, in fact, have features which become increasingly specific, and very soon so specific that each of them is only found in a single text (think of them as whole sentences). All of the examples to be categorized, then, become very long vectors, compound of zeroes and few ones. No vector would have features in common with another, which means that all documents will be orthogonal to each other, and no useful “distance” could possibly be calculated. 3.4 Attribute selection Both bag of words and n-gram approach will represent a given text with a vector with many features. The number of features in the vector is known as the dimen- 6 sion of the input space. There might be computational or theoretic reasons for which one might want to reduce the dimensionality of the input space, i.e. to discard some of the features from all examples. Computational reasons for feature selection include keeping the computation costs inside practical memory space or time requirements. Theoretic reasons may apply, for example, because certain algorithms are known to perform poorly in presence of many irrelevant features (which do not help class prediction) or redundant features (strongly correlated with each other). One simple approach to feature selection, document frequency thresholding, is based on the document frequency Iw defined above. One sorts words (or whatever feature is used) depending on how many documents they appear in, and disregards terms for which Iw is below a certain treshold. The basic assumption is that words which appear in few documents do not carry information with them, i.e. are either non-informative for category prediction or not influential in global performance. Improvement in categorization accuracy is possible if rare terms happen to be noise terms. A more sophisticated criterion takes formally into account the information contained in a given feature. It measures the number of bits of information which are obtained per category by knowing the presence or absence of a term in a document. The information gain builds on the definition of entropy: suppose a binary classification for which label may be + or −, with probabilities p+ and p− respectively. The entropy of this document set, i.e. the minimum theoretical number of bits required to store the class of one instance, is H = p+ log p+ + p− log p− Now suppose that a term t is present in a fraction s of the documents. Within 7 this document subset, the entropy would be Ht = pt+ log pt+ + pt− log pt− where pt+ indicates the fraction of documents which belong to class +, only considering those containing the term t. Analogously one will evaluate the entropy Ht̄ of documents which do not contain term t. How much entropy was contributed by the knowledge of t? The answer is obtained evaluating G = sHt + (1 − s)Ht̄ − H This is the information gain provided by the knowledge of presence or absence of word t. Features will then be selected, which have information gain above a certain treshold. As one would expect, the information gain G is zero for words which appear in no or all documents. The definition given above is easily extended to keep into account the multiple-class case. 3.5 Discussion Several other feature selection techniques have been proposed and evaluated. A comprehensive review is given in [6]. Results are quoted in the paper for two classifiers and two different domains (Ohsumed and Reuters, see section 6), which show consistently that with the bag-of-word approach, reducing of the feature space to 10% of the original size is appropriate, and does not affect classification performance (or improves it slightly). This means that classification performance is rather robust with respect to feature selection: most feature selection methods yields results which are unaffected by reduction of dimensionality from 16000 to 2000. Information gain feature selection, in particular, performs equally well with 8 as few as 200 features left. This might mean either that there are relatively few most relevant features, which are easily found by all selection algorithms, or that many features are equally suited to produce good classification. As it turns out to be the case, the latter seems to be true. This is gathered from figure 2, reproduced from reference [3], which performed feature selection in bands. All features are ranked according to their binary mutual information. Then a naive Bayes classier is trained using only features ranked 1-200, 201-500, 501-1000, 1001-2000, 2001-4000, 4001-9947. The results show that even features ranked lowest still contain considerable information and are somewhat relevant. Using the 50% “less desirable” features, for example, still yields much better prediction than random class assignment. 4 Algorithms This section will briefly review two algorithms which have shown state of the art performance on text classification [7]. An intuitive description of the idea underlying the algorithm will be provided, leaving any detail to the relevant literature. Other important classification algorithms exist, such as the Naive Bayes [8] and the decision tree classifier [9], which will not be discussed here. 4.1 Nearest Neighbours This rather intuitive classification mechanism showed good performance on text categorization tasks. The idea is to evaluate a “similarity measure” between the example to be classified and the existing ones. The algorithm formalizes the intuition that class of the unseen example is likely to be same as to the one of the closest known instances. Degree of similarity is to be defined according to a suitable criterion. When examples are identified by a vector, the euclidean 9 From: xxx@sciences.sdsu.edu Newsgroups: comp.graphics Subject: Need specs on Apple QT I need to get the specs, or at least a very verbose interpretation of the specs, for QuickTime. Technical articles from magazines and references to books would be nice, too. I also need the specs in a fromat usable on a Unix or MS-Dos system. I can’t do much with the QuickTime stuff they have on ... 0 3 0 1 0 0 0 . . . baseball specs graphics references hockey car clinton 1 0 2 0 unix space quicktime computer Figure 1: Bag-of-word approach to feature extraction. On the right, part of the language dependent dictionary is shown along with the feature vector, representing the text on the left. In this example word stemming (section 3.1) is not used. 100 Bayes Random Precision/Recall-Breakeven Point 80 60 40 20 0 0 1000 2000 3000 4000 5000 6000 7000 Features ranked by Mutual Information 8000 9000 Figure 2: Learning without the “best” features. Classification accuracy of a naive Bayes classier trained using only features ranked 1-200, 201-500, 501-1000, 1001-2000, 2001-4000. Random-guess classification shown as dotted line. 10 distance may be used. Classification accuracy is generally improved when one considers not only the closest instance, but rather the k closest examples, with a majority rule (figure 3). 4.2 Rocchio Another widely method in information retrieval is the Rocchio algorithm [10]. In summary, it vectorially averages all of the normalized positive examples, and subtracts the average of the negative ones. Thus, it finds a sort of “prototype vector” for the positive class. An unseen example can be classified according to how close it is to the computed prototype, using a threshold (figure 4). 5 Kernel based methods Sometimes one may find it useful to classify vectors not in their original form, but after transformation through an arbitrary function φ. This may be the case, for example, because the transformed space would be amenable of easier classification, for example linear separation. Say that φ maps the original feature space S in the transformed space T , with original vectors v, w belonging to S. An interesting situation then arises, for some of the machine learning schemes only depend on the scalar product of the input vectors. When this is the case, during their operation they are only going to treat input vectors through scalar products in the form of K(v, w) = φ(v) · φ(w). K is called kernel function, and appears as a function of v and w. In practice, however, we do not need to explicitly devise a mapping φ, nor we carry out the inner product in T (which might be infinite-dimensional). What we do is simply 11 Figure 3: The k nearest neighbour algorithm, when the attribute space has two dimensions. An unknown example is classified to have the same label ω as the closest (or the majority of the k closest) examples. + + + + + + + a + b b a − b b b Figure 4: Rocchio: positive normalized examples are averaged. Negative normalized examples are also averaged, and subtracted from the former. 12 choosing a real-valued kernel function K of two variables in S × S. For K to be an inner product in the transformed space, it suffices to ensure it satisfies the Mercer condition [11]. The single-variable transformation function φ needs not be known explicitly. Kernel methods are interesting because allow one to calculate distances between input vectors with arbitrary algorithms or functions of pairs of vectors. The transformation will hopefully generate an image of the input space which is separated by simpler decision surfaces. In practice, some parametric kernels are commonly used. They are the polynomial kernel, K = (1 + x · y)p, the radial basis function kernel K = exp(−α|x−y|2 ), and the sigmoid kernel K = tanh(αx·y+β). It is worth repeating, however, that K needs not be an algebraic function of x and y. A novel kernel, for example, has recently received remarkable interest. Originally proposed for bioinformatics applications, it measures the degree of similarity between two texts (input vectors) comparing their common subsequences, allowing for characters’ insertions and deletions. The input space S for this algorithm is texts themselves, and it is called string subsequence kernel [12]. 5.1 Support Vector Machines Methods only depending on scalar products of input vectors, i.e. on the Gram matrix, are called Kernel Methods. Kernel Methods were popularized by Support Vector Machines (SVM) [13]. A SVM is an algorithm which computes the linear separation surface with maximum margin for a given training set (figure 5). Only a subset of the input vectors will influence the choice of the margin (circled in the figure); such vectors are called support vectors. When a linear separation surface does not exist, for example in presence of noisy data, SVMs algorithms with a slack variable are appropriate [14]. 13 SVM methods have been shown to be excellent on text classification tasks both theoretically and experimentally. Reference [3] shows SVM outperforming other methods on most categories of the Reuters and Ohsumed dataset, without the need of parameter tuning. (Although not clearly indicated, the experiments seem to have been performed with bag of words feature vectors and no feature selection.) 5.2 Kernel Nearest Neighbour A simple modification to the k nearest neighbour algorithm was recently proposed, that makes it a kernel method [11]. If we express the euclidean distance of two vectors v and w transformed through φ, we get d2 (φ(x), φ(y)) = |φ(x) − φ(y)|2 = (φ(x) − φ(y)) · (φ(x) − φ(y)) = φ(x) · φ(x) + φ(y) · φ(y) − 2φ(x) · φ(y) which, recalling the definition K(x, y) = φ(x) · φ(y), means d2 (φ(x), φ(y)) = K(x, x) + K(y, y) − 2K(x, y). h Figure 5: Support vector machines find the separating hyperplane h which maximises distance from positive and negative training examples. Support vectors are circled. 14 In summary, the norm distance in the image feature space can be calculated by using a kernel function and the input vectors in the original feature space. When the nearest-neighbour algorithm is applied in the transformed space, one gets a kernel nearest-neighbour classifier (KNN). 6 Data sets Supervised learning schemes pose the problem of finding amounts of training text which have been properly and reliably annotated. The task of manually putting labels is time-consuming, and needs by definition to be performed by humans (at least partially). Annotated data sets may therefore be expensive to obtain. Another peculiar problem with data sets is repeatability of the assessment of algorithms. To compare classification performances among algorithms developed by different researchers, one might want to disregard performance variations due to the particular choice of examples. Luckily, availability and repeatability of experiments can be solved by using preannotated sets, often available on the Internet free of charge for research purposes. Most of those sets have been collected by interested institutions, and later made available under a non restrictive distribution license. The texts collected in a corpora tend to be homogeneous, usually reflecting the activity of the original compiler. Several institutions maintain collections of data sets potentially useful for data mining purposes, for example The University of California Irvine (UCI Machine Learning Archive) [15]. The open source software project R, besides being a powerful statistics package, also includes a collection of data sets [16]. While there is relatively wide availability of instance classification oriented data, less extensive data sets are available for text classification. The reason is 15 that collections of instance data can be reasonably anonymized in several ways, and it is therefore more likely that the owner will be willing to release them from ownership claims. On the contrary, most text available in electronic form is subject by intellectual propriety rights of authors, so it not be reproduced without permission. Luckily, there are exceptions to this rule and data sets for text classification, named corpora, are indeed available. 6.1 Reuters 21578 One of the most widely examined text corpora from text classification is known as Reuters-21578, which comes from the Carnegie Group, Inc. and Reuters, Ltd. It is a collection of 21578 real-world news stories and news-agency headlines in the English language. The total dataset size is approximately 25 megabytes. Most of the stories are annotated with zero or more topics, according to their economic subject categories. Other (orthogonal) annotations categories are present, such as people, places, organisations etc. Each of the annotation categories can be chosen for a prediction task, but topics is preferred in existing literature because it is more abstract. People, places and organisation may likely be found when the corresponding name is spotted into the story text. Typically a document assigned to a category from one of these sets explicitly includes some form of the category name in the document’s text. (Something which is usually not true for topics categories.) However, not all documents containing a named entity corresponding to the category name are assigned to these category, since the entity was required to be a focus of the news story [17]. Thus these proper name categories are not as simple to assign correctly as might be thought. Each text may be given one, more, or zero category labels. A “negative” 16 example for a given category is a text for which that category has not been assigned. Text with no category labels assigned act as negative examples for all categories. As it is apparent from figure 6, the majority of stories (47%) is a negative example for all categories. Almost all of the remaining texts have exactly one topic (44%), and the remaining 9% has two or more labels. In summary, the data set is unbalanced towards negative examples. The correctness of so many unlabelled documents is under question [18], so restricting training set to documents with at least one label might be a good choice. Figure 7 shows the frequencies of the most frequent categories. The most represented category is topic earn, which was assigned to 17% of the assigned documents. State of the art classification algorithms (SVM, Rocchio, Naive Bayes) tend to achieve high accuracy on the Reuters Corpus: microaveraged breakeven-point performance is in the order of 80–85% [7], and Reuters 21578 is thus considered an “easy” data set. Performance measures are discussed in section 7. 6.2 Ohsumed The Ohsumed collection [19] is a clinically-oriented MEDLINE subset, consisting of 348,566 references (out of a total of over 7 million), covering all references from 270 medical journals over a five-year period (1987-1991). The test database is about 400 megabytes in size. A number of fields normally present in the MEDLINE record but not pertinent to content-based information retrieval have been deleted. The only fields present include the title, abstract, MeSH indexing terms, author, source, and publication type. Label prediction on the Ohsumed dataset is harder than Reuters-21578, arguably due to the specialistic nature of the articles, which assume a vast background field knowledge from the target readers. Microaveraged breakeven-point 17 0.2 0.0 0.1 Density 0.3 0.4 Number of topic labels per story 0 1 2 3 4 5 6 7 8 9 10 11 12 14 16 Figure 6: Topics per story for the Reuters-21578 dataset. 0.10 0.00 0.05 Fraction of stories 0.15 Topic frequency earn acq money fx crude grain trade interest wheat tl Figure 7: Topic frequencies for the top categories in the Reuters-21578 dataset. 18 performance for state of the art algorithms on this dataset is the order of 65% [7]. 6.3 Other The DigiTrad (Digital Tradition) database contains folk songs collected from several countries. Annotations were added to facilitate song retrieval. The type of texts typical of DigiTrad are very different from those in Reuters-21578. Sentence structure is loose and often non-existent; clever, flowery and rhyming language is generally preferred over clarity; songs are often written in dialect; and the writers of the songs tend to be indirect, often skirting around a topic without explicitly mentioning it. The DigiTrad labels are therefore consistently very hard to predict by automated learning [20]. A novel dataset from Reuters, the RCV1 Reuters Corpus, is likely to supersede the Reuters 21578 dataset in the future. This corpus greatly expands the previous one, including over 800.000 stories and 2.5 GB of text. The corpus is, however, not readily available for download (a special request must be filed with NIST). 7 Performance measures Obtaining a performance measure for classification is important to tune parameters in the algorithms, kernels, and to compare methods and datasets. Performance measures build upon the concepts of precision p and recall r. The former is the probability that a document predicted to be in class “+” truly belongs to this class. The latter is the probability that a document belonging to class “+” is classified into this class. When a single performance measure is desired, the harmonic mean of precision and recall, F1 = 1/(p−1 + r −1 ), is sometimes quoted as the performance for each specific class. Many algorithms have adjustable parameters upon which precision and recall 19 will depend, such as a confidence threshold. When this is the case, a single performance measure can be obtained by varying the it and finding the precisionrecall breakeven point, which is the (interpolated) value of p obtained by varying the parameter until p = r. 7.1 Averaging Both F1 and the breakeven point are relevant to single-class classification problems, i.e. when one example may be either positive or negative. For text classification, unless the very special case of a single label, p and r values for the various classes should be combined. There are two possible approaches for this: macro averaged precision sums, for each class separately, the fraction of relevant documents over the retrieved ones: PM N relevant documents retrieved for class i 1 X = N i1 documents retrieved for class i where N is the number of classes. Macro averaging is therefore simply the average precision over classes. Macroaveraging has a problem if no document is retrieved for one class, which would make PM undefined. Precision values of all classes contribute equally to the macroaverage. The other way to compute an overall measure of performance for a classification system is to first compute the number of documents retrieved and relevant for all classes, sum them, and divide afterwards. The micro averaged precision is computed as: pm = relevant documents retrieved in all classes total documents retrieved Thus, microaveraging circumvents the problem of empty sets and causes every 20 individual document to have an equal influence on the result. A very comprehensive review of the subject may be found e.g. in [21]. 8 Conclusions Widespread availability of digital data has pushed the interest in automated text labelling for information retrieval. Text classification is a large field which borrows several ideas from the neighbouring subjects of machine learning and language modelling. This short essay provided an overview of several concepts and techniques currently used in text classification. The topics examined include data cleaning, feature extraction and selection, machine learning schemes and performance measures. Further discussion can be found in the vast existing literature on the subject, which builds upon the concepts discussed here. A partial list of subjects which are under current investigation includes the incorporation of “common sense” knowledge through the use of hypernyms and hyponyms [22], efficient implementation of existing algorithms [23, 24] and approximations of kernels for more efficient computation [25]. The availability of efficiently-implemented kernel methods and approximation schemes, makes development of new kernels, especially suited for text comparison [12], an area especially stimulating and amenable to yield results which can prove useful in other domains, such as bioinformatics and multimedia retrieval. References [1] E. Motta, Reusable Components for Knowledge Modelling: Case Studies in Parametric Design Problem Solving. IOS Press, 1999. 21 [2] M. Porter, “An algorithm for suffix stripping,” in Program, vol. 14, pp. 130– 130, 1980. [3] T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” in Proceedings of ECML-98, 10th European Conference on Machine Learning (C. Nédellec and C. Rouveirol, eds.), no. 1398, (Chemnitz, DE), pp. 137–142, Springer Verlag, Heidelberg, DE, 1998. [4] W. B. Cavnar and J. M. Trenkle, “N-gram-based text categorization,” in Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, (Las Vegas, US), pp. 161–175, 1994. [5] W. B. Cavnar, “Using an n-gram-based document representation with a vector processing retrieval model.,” in TREC, pp. 269–277, 1994. [6] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of ICML-97, 14th International Conference on Machine Learning (D. H. Fisher, ed.), (Nashville, US), pp. 412–420, Morgan Kaufmann Publishers, San Francisco, US, 1997. [7] T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” tech. rep., University of Dortmund, Fachbereich Informatik, 1997. [8] J. Demsar, B. Zupan, M. Kattan, J. Beck, and I. Bratko, “Naive Bayesianbased nomogram for prediction of prostate cancer recurrence,” Stud Health Technol Inform, vol. 68, pp. 436–441, 1999. [9] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [10] J. J. Rocchio, The SMART Retrieval System, Experiments in Automatic Document Processing. Prentice Hall, 1971. [11] K. Yu, L. Ji, and X. Zhang, “Kernel nearest-neighbor algorithm,” Neural Process. Lett., vol. 15, no. 2, pp. 147–156, 2002. [12] H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. J. C. H. Watkins, “Text classification using string kernels,” in NIPS, pp. 563–569, 2000. [13] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998. [14] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. 22 [15] C. Blake and C. Merz, “UCI repository of machine learning databases,” 1998. [16] R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-07-0. [17] P. J. Hayes and S. P. Weinstein, “CONSTRUE/TIS: a system for contentbased indexing of a database of news stories,” in Second Annual Conference on Innovative Applications of Artificial Intelligence, 1990. [18] Y. Yang, “An evaluation of statistical approaches to text categorization,” Information Retrieval, vol. 1, no. 1/2, pp. 69–90, 1999. [19] W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam, “Ohsumed: An interactive retrieval evaluation and new large test collection for research,” in Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum) (W. B. Croft and C. J. van Rijsbergen, eds.), pp. 192–201, ACM/Springer, 1994. [20] S. Scott, “Feature engineering for a symbolic approach to text classification,” Master’s thesis, Ottawa, CA, 1998. [21] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002. [22] S. Scott and S. Matwin, “Feature engineering for text classification,” in Proceedings of ICML-99, 16th International Conference on Machine Learning (I. Bratko and S. Dzeroski, eds.), (Bled, SL), pp. 379–388, Morgan Kaufmann Publishers, San Francisco, US, 1999. [23] T. Joachims, “Making large-scale support vector machine learning practical,” in Advances in Kernel Methods: Support Vector Machines (A. S. B. Schölkopf, C. Burges, ed.), MIT Press, Cambridge, MA, 1998. [24] A. K. McCallum, “Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering.” http://www.cs.cmu.edu/∼mccallum/bow, 1996. [25] S. Schölkopf, B. Mika, S. Burges, C. Knirsch, P. Müller, K. Rätsch, and G. Smola, “Input space vs. feature space in kernel-based methods,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1000–1017, 1999. 23