Text classification Text Classification-Definition • Text classification is the assignment of text documents to one or more predefined categories based on their content. Text document Classifier Class A Text document Class B Class C Text document • The classifier: – Input: a set of m hand-labeled documents (x1,y1),....,(xm,ym) – Output: a learned classifier f:x y Text Classification-Applications • Classify news stories as World, US, Business, SciTech, Sports, Entertainment, Health, Other. • Classify business names by industry. • Classify student essays as A,B,C,D, or F. • Classify email as Spam, Other. • Classify pdf files as ResearchPaper, Other • Classify documents as WrittenByReagan, GhostWritten • Classify movie reviews as Favorable,Unfavorable,Neutral. • Classify technical papers as Interesting, Uninteresting. • Classify jokes as Funny, NotFunny. • Classify web sites of companies by Standard Industrial Classification (SIC) code. Cost of Manual Text Categorization ◦ Yahoo! 200 (?) people for manual labeling of Web pages using a hierarchy of 500,000 categories ◦ MEDLINE (National Library of Medicine) $2 million/year for manual indexing of journal articles using MEdical Subject Headings (18,000 categories) ◦ Mayo Clinic $1.4 million annually for coding patient-record events using the International Classification of Diseases (ICD) for billing insurance companies ◦ US Census Bureau decennial census (1990: 22 million responses) 232 industry categories and 504 occupation categories $15 million if fully done by hand Text Classification Framework Documents Preprocessing Indexing Performance measure Applying classification algorithms Feature selection 5 Introduction to Information Retrieval Classification Using Vector Spaces • Each document is a vector, one component for each term. • Terms are axes. • High dimensionality: 100,000s of dimensions • Normalize vectors (documents) to unit length • How can we do classification in this space? Classification Using Vector Spaces • In vector space classification, training set corresponds to a labeled set of points (equivalently, vectors) • Premise 1: Documents in the same class form a contiguous region of space • Premise 2: Documents from different classes don’t overlap (much) • Learning a classifier: build surfaces to delineate classes in the space Sec.14.1 Documents in a Vector Space Government Science Arts 8 Sec.14.1 Test Document of what class? Government Science Arts 9 Sec.14.1 Test Document = Government Is this similarity hypothesis true in general? Government Science Arts Our focus: how to find good separators 10 Relevance feedback vs Text classification • Relevance feedback is a form of text classification • The principal difference between relevance feedback and text classification: – The training set is given as part of the input in text classification. – It is interactively created in relevance feedback. Rocchio classification: Basic idea • Compute a centroid for each class The centroid is the average of all documents in the class. • Assign each test document to the class of its closest centroid. Sec.14.2 Definition of centroid 1 m (c) = v (d) å | Dc | d ÎDc • Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d. • Note that centroid will in general not be a unit vector even when the inputs are unit vectors. 13 Sec.14.2 Rocchio classification Rocchio forms a simple representative for each class: the centroid/prototype Classification: nearest prototype/centroid It does not guarantee that classifications are consistent with the given training data 14 Rocchio algorithm Rocchio cannot handle multimodal class Sec.14.2 Rocchio classification • Little used outside text classification – It has been used quite effectively for text classification – But in general worse than Naïve Bayes • Again, cheap to train and test documents 17 Bayes Classifiers • Consider a document D, whose class is given by C. In the case of email spam filtering there are two classes C = S (spam) and C = H (ham). We classify D as the class which has the highest posterior probability P(C|D), which can be re-expressed using Bayes’ Theorem: Bayes Classifiers Task: Classify a new instance D based on a tuple of attribute value D x1 , x2 ,, xn into one of the classes cj C cMAP argmax P(c j | x1 , x2 ,, xn ) c j C argmax c j C P( x1 , x2 ,, xn | c j ) P(c j ) P( x1 , x2 ,, xn ) argmax P( x1 , x2 ,, xn | c j ) P(c j ) c j C Naïve Bayes Assumption • P(cj) – Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) --- Naïve Bayes Conditional Independence Assumption: • Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj). The Naïve Bayes Classifier Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache • Conditional Independence Assumption: features detect term presence and are independent of each other given the class: P( X 1 ,, X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C ) Learning the Model C X1 X2 X3 X4 X5 X6 • First attempt: maximum likelihood estimates – simply use the frequencies in the data N (C c j ) ˆ P (c j ) N Pˆ ( xi | c j ) N ( X i xi , C c j ) N (C c j ) Problem with Max Likelihood Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache P( X 1 ,, X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C ) • What if we have seen no training cases where patient had no flu and muscle aches? N ( X 5 t , C nf ) ˆ P( X 5 t | C nf ) 0 N (C nf ) • Zero probabilities cannot be conditioned away, no matter the other evidence! Smoothing to Eliminate Zeros Pˆ ( xi | c j ) N ( X i xi , C c j ) k N (C c j ) k . | X | # of values of X, • Add one smooth (Laplace smoothing) • As a uniform prior (each attribute occurs once for each class) that is then updated as evidence from the training data comes in. • Two probabilistic models of documents, both of which represent documents as a bag of words, using the Naive Bayes assumption. • Both models represent documents using feature vectors whose components correspond to word types. If we have a vocabulary V, containing |V| word types, then the feature vector dimension d=|V|. • Bernoulli document model: a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present. • Multinomial document model: a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document. • In the Bernoulli model a document is represented by a binary vector. • Let 𝑏𝑖 be the feature vector for the 𝑖 𝑡ℎ document 𝐷𝑖 ; then the 𝑡𝑡ℎ element of 𝑏𝑖 , written 𝑏𝑖𝑡 , is either 0 or 1 representing the absence or presence of word 𝑤𝑡 in the 𝑖 𝑡ℎ document. • Let P(𝑤𝑡 |C) be the probability of word 𝑤𝑡 occurring in a document of class C; the probability of 𝑤𝑡 not occurring in a document of this class is given by (1−P(𝑤𝑡 |C)). If we make the naive Bayes assumption, that the probability of each word occurring in the document is independent of the occurrences of the other words, then we can write the document likelihood P(𝐷𝑖 | C) in terms of the individual word likelihoods P(𝑤𝑡 |C): • N the total number of documents • 𝑁𝑘 is the number of documents labelled with class C=k, for k=1, . . . , K • nk(wt) the number of documents of class C=k containing word wt for every class and for each word in the vocabulary 1. Define the vocabulary V; the number of words in the vocabulary defines the dimension of the feature vectors 2. Count the following in the training set: • N the total number of documents • Nk the number of documents labelled with class C=k, for k=1, . . . , K • nk(wt) the number of documents of class C=k containing word wt for every class and for each word in the vocabulary 3. Estimate the likelihoods P(wt | C=k) 4. Estimate the priors P(C=k) • To classify an unlabelled document Dj, we estimate the posterior probability for each class NB Example • c(5)=? Bernoulli NB Classifier • Feature likelihood estimate • Posterior • Result: c(5) <> China Parameter estimation • Binomial model: Pˆ ( X w t | c j ) fraction of documents of topic cj in which word w appears • Multinomial model: Pˆ ( X i w | c j ) fraction of times in which word w appears across all documents of topic cj Multinomial NB Classifier • Feature likelihood estimate • Posterior • Result: c(5) = China Introduction to Information Retrieval Naive Bayes is Not So Naive Very fast learning and testing (basically just count words) Low storage requirements Very good in domains with many equally important features More robust to irrelevant features than many learning methods Introduction to Information Retrieval Naive Bayes is Not So Naive • Naive Bayes won 1st and 2nd place in KDD-CUP 97 competition out of 16 systems Goal: Financial services industry direct mail response prediction: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records. • A good dependable baseline for text classification (but not the best)! Sec.14.3 k Nearest Neighbor Classification • kNN = k Nearest Neighbor • To classify a document d: • Define k-neighborhood as the k nearest neighbors of d • Pick the majority class label in the kneighborhood 37 Sec.14.3 Example: k=6 (6NN) Government Science Arts 38 Sec.14.3 Nearest-Neighbor Learning • Learning: just store the labeled training examples D • Testing instance x (under 1NN): – Compute similarity between x and all examples in D. – Assign x the category of the most similar example in D. • Does not compute anything beyond storing the examples • Also called: – Case-based learning – Memory-based learning – Lazy learning • Rationale of kNN: contiguity hypothesis 39 Sec.14.3 k Nearest Neighbor • Using only the closest example (1NN) subject to errors due to: – A single atypical example. – Noise (i.e., an error) in the category label of a single training example. • More robust: find the k examples and return the majority category of these k • k is typically odd to avoid ties; 3 and 5 are most common 40 Sec.14.3 kNN decision boundaries Boundaries are in principle arbitrary surfaces – but usually polyhedra Government Science Arts kNN gives locally defined decision boundaries between classes – far away points do not influence each classification decision 41 (unlike in Naïve Bayes, Rocchio, etc.) Sec.14.3 kNN: Discussion • No feature selection necessary • No training necessary • Scales well with large number of classes – Don’t need to train n classifiers for n classes • Classes can influence each other – Small changes to one class can have ripple effect • May be expensive at test time • In most cases it’s more accurate than NB or Rocchio 42