INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 14: Text categorization with Decision Trees and Naïve Bayes REMINDER: DECISION TREES • A DECISION TREE is a classifier in the form of a tree structure, where each node is either a: – Leaf node - indicates the value of the target attribute (class) of examples, or – Decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. • A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. Decision Tree Example Goal: learn when we can play Tennis and when we cannot Day Outlook Temp. Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Weak No D9 Sunny Cold Normal Weak Yes D10 Rain Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes Wind Strong No Weak Yes Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Yes Each branch corresponds to an attribute value node Each leaf node assigns a classification Decision Tree for PlayTennis Outlook Temperature Humidity Wind Sunny Hot High PlayTennis Weak ? No Outlook Sunny Humidity High Rain Yes Normal www.math.tau.ac.il/~nin/ No Courses/ML04/DecisionTreesC LS.pp Overcast Yes Wind Strong No Weak Yes TEXT CLASSIFICATION WITH DT • As an example of actual application of decision trees, we’ll consider the problem of TEXT CLASSIFICATION IS THIS SPAM? From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm ================================================= TEXT CATEGORIZATION • Given: – A description of an instance, xX, where X is the instance language or instance space. • Issue: how to represent text documents. – A fixed set of categories: C = {c1, c2,…, cn} • Determine: – The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C. • We want to know how to build categorization functions (“classifiers”). Document Classification “planning language proof intelligence” Testing Data: (AI) (Programming) (HCI) Classes: ML Training Data: learning intelligence algorithm reinforcement network... Planning Semantics Garb.Coll. planning temporal reasoning plan language... programming semantics language proof... Multimedia garbage ... collection memory optimization region... GUI ... (Note: in real life there is often a hierarchy, not present in the above problem statement; and you get papers on ML approaches to Garb. Coll.) Text Categorization Examples Assign labels to each document or web-page: • Labels are most often topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" • Labels may be genres e.g., "editorials" "movie-reviews" "news“ • Labels may be opinion e.g., “like”, “hate”, “neutral” • Labels may be domain-specific binary e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “is a toner cartridge ad” :“isn’t” TEXT CATEGORIZATION WITH DT • Build a separate decision tree for each category • Use WORDS COUNTS as features Reuters Data Set (21578 - ModApte split) • 9603 training, 3299 test articles; ave. 200 words • 118 categories – An article can be in more than one category – Learn 118 binary category distinctions Common categories (#train, #test) • • • • • Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) • • • • • Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) 14 AN EXAMPLE OF REUTERS TEXT Foundations of Statistical Natural Language Processing, Manning and Schuetze Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze OTHER LEARNING METHODS USED FOR TEXT CLASSIFICATION • Bayesian methods (Naïve Bayes) • Neural nets (e.g. ,perceptron) • Vector-space methods (k-NN, Rocchio, unsupervised) • SVMs BAYESIAN METHODS • Learning and classification methods based on probability theory. • Bayes theorem plays a critical role in probabilistic learning and classification. • Build a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item. • Categorization produces a posterior probability distribution over the possible categories given a description of an item. Bayes’ Rule P(C , X ) P(C | X ) P( X ) P( X | C ) P(C ) P( X | C ) P(C ) P(C | X ) P( X ) Maximum a posteriori Hypothesis hMAP argmaxP(h | D) hH hMAP P( D | h ) P( h ) argmax P( D ) hH hMAP argmax P( D | h) P(h) hH Naive Bayes Classifiers Task: Classify a new instance based on a tuple of attribute values x1, x2 ,, xn cMAP argmaxP(c j | x1 , x2 ,, xn ) c j C cMAP argmax c j C P( x1 , x2 ,, xn | c j ) P(c j ) P(c1 , c2 ,, cn ) cMAP argmaxP( x1 , x2 ,, xn | c j ) P(c j ) c j C Naïve Bayes Classifier: Assumptions • P(cj) – Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) – O(|X|n•|C|) – Could only be estimated if a very, very large number of training examples was available. Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities. The Naïve Bayes Classifier Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache • Conditional Independence Assumption: features are independent of each other given the class: P( X1,, X 5 | C) P( X1 | C) P( X 2 | C) P( X 5 | C) Learning the Model C X1 X2 X3 X4 X5 X6 • Common practice:maximum likelihood – simply use the frequencies in the data Pˆ (c j ) Pˆ ( xi | c j ) N (C c j ) N N ( X i xi , C c j ) N (C c j ) Problem with Max Likelihood Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache P( X1,, X 5 | C) P( X1 | C) P( X 2 | C) P( X 5 | C) • What if we have seen no training cases where patient had no flu and muscle aches? N ( X 5 t , C nf ) ˆ P( X 5 t | C nf ) 0 N (C nf ) • Zero probabilities cannot be conditioned away, no matter the other evidence! arg max c Pˆ (c)i Pˆ ( xi | c) Smoothing to Avoid Overfitting Pˆ ( xi | c j ) N ( X i xi , C c j ) 1 N (C c j ) k # of values of Xi • Somewhat more subtle version Pˆ ( xi ,k | c j ) overall fraction in data where Xi=xi,k N ( X i xi ,k , C c j ) mpi ,k N (C c j ) m extent of “smoothing” Using Naive Bayes Classifiers to Classify Text: Basic method • Attributes are text positions, values are words. c NB argmax P (c j ) P ( xi | c j ) c jC i argmax P (c j ) P ( x1 " our"| c j ) P ( xn " text"| c j ) c jC Naive Bayes assumption is clearly violated. Example? Still too many possibilities Assume that classification is independent of the positions of the words Use same parameters for each position Text Classification Algorithms: Learning • From training corpus, extract Vocabulary • Calculate required P(cj) and P(xk | cj) terms – For each cj in C do • docsj subset of documents for which the target class is cj • P(c j ) | docsj | | total# documents| Textj single document containing all docsj for each word xk in Vocabulary nk number of occurrences of xk in Textj P( xk | c j ) nk 1 n | Vocabulary| Text Classification Algorithms: Classifying • positions all word positions in current document which contain tokens found in Vocabulary • Return cNB, where cNB argmaxP(c j ) c jC P( x | c ) i positions i j Underflow Prevention • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. • Class with highest final un-normalized log probability score is still the most probable. Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. • However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not. – Output probabilities are generally very close to 0 or 1. READINGS • Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002 REMAINING LECTURES DAY HOUR TOPIC Wed 25/11 12-14 Text classification with Artificial Neural Nets Fri 27/11 14-16 Lab: Supervised ML with Weka Fri 4/12 10-12 Unsupervised methods & text classification Fri 4/12 14-16 Lab: Linear algebra Wed 9/12 10-12 Lexical acquisition by clustering Thu 10/12 10-12 Psychological evidence on learning Fri 11/12 10-12 Psychological evidence on language processing Fri 11/12 14-16 Lab: Clustering Mon 14/12 10-12 Intro to NLP REMAINING LECTURES DAY HOUR TOPIC Tue 15/12 10-12 Ling. & psychological evidence on anaphora Thu 17/12 10-12 Machine learning appr. to anaphora Fri 18/12 10-12 Corpora for anaphora Fri 18/12 14-16 Lab: BART Mon 21/12 10-12 Lexical & commons. knowledge for anaphora Tue 22/12 10-12 Salience Tue 22/12 14-16 Discourse new detection ACKNOWLEDGMENTS • Several slides come from Chris Manning & Hinrich Schuetze’s course on IR and text classification