A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang Outline Summary Text Categorization Learning Methods for Text Categorization PrTFIDF: A Probabilistic Classifier Derived from TFIDF Experiments and Results Conclusions Summary A probabilistic analysis of the Rocchio relevance feedback algorithm is presented in text categorization framework. The analysis results in a probabilistic version of Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic.The Rocchio classifier,its probabilistic variant and standard naïve bayes classifier are compared on three text categorization tasks. Text Categorization The goal of text categorization is the classification of documents into a fixed number of predefined categories. The working definition used throughout this paper assumes that each document d is assigned to exactly one category. The formal definition used through this paper is defined as following. Formal definition A set of classes C and a set of training documents D. A target concept T : D C which maps documents to a class. T (d) is known for the documents in the training set. Through supervised learning the information contained in the training examples can be used to find a model H : DC which approximates T. H(d) is the function defining the class to which the learned hypothesis assigns document d; it can be used to classify new documents. The objective is to find a hypothesis which maximizes accuracy. Learning Methods for Text Categorization Learning algorithms TFIDF Classifier Naive Bayes Classifier Bag-of-words representation Feature selection A combination of these three methods is used to find a subset of words which helps to discriminate between classes. Pruning of infrequent words. Pruning of high frequency words-remove non-content word. Choosing word that have high mutual information with target concept. Mutual information E(X) is the entropy of the random variable X. Pr(T(d)=C) is probability that an arbitrary article d is in Category Cc. Pr(T(d)=C,w=0) and Pr(T(d)=C,w=1) is probabilities article d is in Category C and it does or does not contains the word w. Top 15 words with the highest mutual information for topic “wheat” in Reuters Wheat, tonnes, agriculture, grain, usda Washington, Department, Soviet, Export, Corn Crop, cts, inc, winter, company TF-IDF Classifier TFIDF classifiers is based on relevance feedback algorithm introduced by Rocchio[Rocchio,1971] for the vector space retrieval model[Salton,1991]. It is a nearest neighbor learning method with prototype vectors using the TFIDF[Salton,1991] word weighting heuristic. TF-IDF Classifier Term frequency TF(wi,d) is the number of times word wi occurs in document d Document frequency DF(wi) is the number of document in which word wi occurs at least once. The inverse document frequency IDF(wi) can be calculate from the document frequency |D| IDF( wi ) log( ) DF ( wi ) Word weight d(i) = TF(wi,d) • IDF(wi) TF-IDF This word weighting means that a word wi is an important indexing term for document d if it occurs frequently in it(tf is high).On the other hand,words which occur in many documents are rated less important indexing terms due to their low idf. Prototype vector Learning is achieved by combining document vectors into a prototype vector c for each class Cc. c d d C The resulting set of prototype vectors, one vector for each Cc ,represented a learned model. TF-IDF Classifier To classify a new document ď,the cosine of the prototype vector of each class which is calculated. The new document is assigned to the class with which its document vector has the highest cosine. The cosine measures the angle between the vector of document being classified and the prototype vectors of each of the classes. HTFIDF (d ) arg maxcos(d , c ) ' Cc TF-IDF Classifier-summary Decision rule of this classifier: Naïve Bayes Classifier Naïve Bayes classifiers is based on bagof –words representation.This algorithm use probabilistic models to estimate the likelihood that a given document is in a class.They use thus probability estimate for decision making. Naïve Bayes Classifier Assumption: Words are assumed to occur independently of the other words in the document. Bayes’ rule[James,1985] says that to achieve the highest classification accuracy, ď should be assigned to the class for which Pr(C| ď) is highest. H BAYES (d ' ) arg max Pr(C | d ' ) Cc Naïve Bayes Classifier Decision rule for naïve Bayes Classifier Here: PrTFIDF: A Probabilistic Classier Derived from TFIDF PrTFIDF is TFIDF classier analyzed in a probabilistic framework, which offers an elegant way to distinguish between a document and its representation. A function maps the document to its representation The classifier uses this representation for decision making. H Pr TFIDF (d ' ) arg max Pr(C | d ' , ) Cc PrTFIDF Algorithm Pr(C|ď,) can be written in two parts. Pr(C | d ' , ) Pr(C | x) Pr(x | d ' , ) x Pr(x|ď,) maps document ď to its representation x with a certain probability according to .Pr(C|x) is the probability that document with representation x in class C. In particular,documents will be represented by single words in design choice with documents representation mapping . So when x=w, Pr(x|ď,) = Pr(w|ď,) The PrTFIDF Algorithm The resulting decision rule for PrTFIDF is Where Bayes therom. by using The equivalence between TFIDF and PrTFIDF Assumption to achieve the equivalence Uniform class priors:each class contains equal number of documents. Exist a , so that for all classes This assumption states that Euclidean length of the prototype vectors for each class is a linear function of the number of words in that class. Assumption to achieve the equivalence Refined version of IDF(w) suggested by PrTFIDF algorithm. Differece: •relative term frequency instead of occurrence of the word. •Square root instead of logarithm.Both functions are similar in shape and reduced the impact of high document frequencies. The Connection between TFIDF and PrTFIDF The PrTFIDF decision rule can be transformed into the following formula which is in format of TFIDF decision rule. Implication of the Analysis The analysis shows how and under which preconditions the TFIDF classifier fits into a probabilistic framework.The close relationship to the probabilistic classifier offers a theoretical justification for the vector space model and the TFIDF word weighting heuristic. Use of Prior probabilities P(C). Use of IDF’(w) for word weighting instead of IDF(w). Use the number of words for normalization instead of the Euclidean length. Experiments Newsgroup Dataset:Makes a total of 20000 (20*1000)documents in this collection. The results reported on this dataset are averaged over a number of random test/training splits using binomial sign tests to estimate significance. In each experiment 33% of the data was used for testing. Reuters dataset The collection of 21450 articles which are classified into 135 topic categories. Each article can have multiple category labels. 31%------have no category label 57%------ have exactly one label 12%-----have more than one and up to 12 class labels assigned. Reuters dataset The distribution of members in each class are very uneven.Here is 20 most frequently used topic. “acq” &”wheat” in reuters The Category “acq” is the one with the second most documents in it. The “wheat” Category have a very narrow definition.There is small number of words which are very good clues as to weather a document is in this category or not. 14704-training 6746--testing Experimental Results 20Newsgroups Reuters“acq” Reuters“wheat” PrTFIDF 90.3 89.3 95.6 Bayes 88.6 89.3 95.6 TFIDF 82.3 87.9 94.0 Maximum accuracy in percent for the best parameter settings Results---20 Newsgroup(TS) Results---Reuters “acq”(TS) Results--- Reuters “wheat”(TS) Results—20 Newsgroup(FV) Results-- Reuters “acq”(FV) Results- Reuters “wheat”(FV) Result analysis •How does the number of Training Examples influence Accuracy? --As expected the accuracy increases with the number of training examples.PrTFIDF,Bayes and TFIDF show difference in how quickly the accuracy increases. --For newsgroup data, PrTFIDF performs well for small numbers of training examples in contrast to Bayes. The accuracy of te TFIDF classifier increases less quickly than for the probabilistic methods. --For reuters category “acq”,Bayes and PrTFIDF shows nearly identical.TDIDF is significantly below those two probabilistic methods.And there is no big difference for three methods for the reuters category “wheat”. Number of Features vs. Accuracy What is the Influence of the Number of Features on the Accuracy? --For newsgroup data,keeping the number of training examples at maximum,the performance of the system is higher.PrTFIDF and Bayes are significantly above TFIDF.The overall highest performance is achieved using PrTFIDF with the largest feature set. --The findings on Reuters category “acq” is similar. --The reuters category “wheat” shows different characteristics. It shows that for the probabilistic methods the accuracy does not rise with the number of words used. The highest performance is achieved by PrTFIDF and BAYES when only the minimum number of 10 words is used. Special findings for “wheat” ---The finding for the “wheat” category are probably due to the different properties of this task. Since the definition of category “wheat” is more narrow than definition of other ones.The single word “wheat” is a nearly perfect predictor for class membership.This explains when small numbers of words can achieve maximum performance.Adding more words adds noise,since those words are words with lower predictive power. Which methods is Most Robust for small numbers of Trainning Examples? With a rising number of training examples,the performance of BAYES approximates that of PrTFIDF. Bayes becomes less accurate for big word-vector sizes.Experiment on the newsgroups data have shown that the smaller the size of the wordvector,the fewer training examples are needed for Bayes to exceed the performance of PrTFIDF. Since Bayes is very sensitive to inaccurate probability estimates which arise a low number of training examples. Conclusion Although the TFIDF method showed reasonable accuracy on all classification tasks, the two probabilistic methods BAYES and PrTFIDF showed great performance improvements to all three tasks. These empirical results suggest that a probabilistically founded modeling is preferable to the heuristic TFIDF modeling. The probabilistic methods are preferable from a theoretical viewpoint too, since a probabilistic framework allows the clear statement and easier understanding of the simplifying assumptions made.