Distributional Clustering of Words for Text Classification L.Douglas Baker (Carnegie Mellon University) Andrew Kachites McCallum (Justsystem Pittsburgh Research Center) Presentation by: Thomas Walsh (Rutgers University) Clustering • Define what it means for words to be “similar”. • “Collapse” the word space by grouping similar words in “clusters”. • Key Idea for Distributional Clustering: – Class probabilities given the words in a labeled document collection P(C|w) provide rules for correlating words to classifications. Voting • Can be understood by a voting model: • Each word in a document casts a weighted vote for classification. • Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance. Benefits of Word Clustering • Useful Semantic word clustering – Automatically generates a “Thesaurus” • Higher classification accuracy – Sort of, we’ll discuss in the results section • Smaller classification models – size reductions as dramatic as 50000 50 Benefits of Smaller Models • Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch. • Memory constrained devices like PDA’s could now use text classification algorithms to organize documents. • More complex algorithms can be unleashed that would be infeasible in 50000 dimensions. The Framework • Start with Training Data with: – Set of Classes C = {c1, c2… cm} – Set of Documents D ={d1… dn} – Each Document has a class label Mixture Models • f(xi|q) = Spkh(xi|lk) • Sum of pk’s is 1 • h is a distriution function for x (such as a Gausian) with lk as the parameter (m, S) in the Gausian case. • Thus q = (p1…pk, l1… lk) What is q in this case? • Assumption: one-to-one correspondence between the mixture model components and the classes. • The class priors are contained in the vector q0 • Instances of each class / number of documents What is q in this case? • The rest of the entries in q correspond to disjoint sets. The jth entry contains the probability of each word wt in the vocabulary V given the class cj. • N(wt, di) is the number of times a word appears in document di. • P(cj|di) = {0, 1} Prob. of a given Document in the Model • The mixture model can be used to produce documents with probability: • Just the sum of the probability of generating this document in the model over each class. Documents as Collections of Words • Treat each document as an ordered collection of word events. • Dik = work in document di at place k. • Each word is dependent on preceding words Apply Naïve Bayes Assumption • Assume each word is independent of both content and position • Where dik = wt • Update Formulas 2 and 1: – (2) P(di | cj ; q) = P P(wt|cj ; q) – (1) P(di| q) = S P(cj|q) P P(wt|cj; q) Incorporate Expanded Formulae for q • We can calculate the model parameter q from the training data. • Now we wish to calculate P(cj|di; q), the probability of document di belonging to class cj. Final Equation Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class cj ------------------------------------------------------------(1/2/3) Sum of all class priors * product of all word probabilities assuming we are in class cr Maximize and that value of cj is the class for the document Shortcomings of the Framework • In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold. • But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible. What about clustering? • So assuming the Framework holds… how does clustering fit into all this? How Does Clustering affect probabilities? • Fraction of cluster from wt + fraction of cluster from ws Vs. other forms of learning • Measures similarity based on the property it is trying to estimate (the classes) – Makes the supervision in the training data really important. • Clustering is based on the similarity of the class variable distributions • Key Idea: Clustering preserves the “shape” of the class distributions. Kullock-Liebler Divergence • Measures the similarity between class distributions • D( P(C | wt) || P(C | ws)) = • If P(cj | wt) = P(cj | ws) then log(1) = 0 Problems with K-L Divergence • Not symmetric • Denominator can be 0 if ws does not appear in any documents of class cj. K-L Divergence from the Mean • Ratio of each words occurrence in the cluster * KL divergence of that word within the cluster • New and improved: uses a weighted average instead of just the mean • Justification: fits clustering because independent distributions now form combined statistics. Minimizing Error in Naïve Bayes Scores • Assuming uniform class priors allows us to drop P(cj | q) and the whole denominator from (6) • Then performing a little algebra gets us the cross entropy: • So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error. The Clustering Algorithm • Comparing similarity of all possible word clusters would be O(V2) • Instead, a number M is set as the total number of desired clusters – More supervision • M clusters initialized with the M words with the highest mutual information to the class variable • Properties: Greedy, scales efficiently Algorithm S P(C | wt) Related Work • Chi Merge / Chi 2 – Use D. Clustering to discretize numbers • Class-based clustering – Uses amount that mutual information is reduced to determine when to cluster – Not effective in text classification • Feature Selection by Mutual Information – cannot capture dependencies between words • Markov-blanket-based Feature Selection – Also attempts to Preserve P(C | wt) shapes • Latent Semantic Indexing – Unsupervised, using PCA The Experiment : Competitors to Distributional Clustering • Clustering with LSI • Information Gain Based Feature Selection • Mutual-Information Feature Selection • Feature Selection involves cutting out redundant instances • Clustering combines these redundancies The Experiment: Testbeds • 20 Newsgroups – 20,000 articles from 20 usenet groups (apx 62000 words) • ModApte “Reuters-21578” – 9603 training docs, 3299 testing docs, 135 topics (apx. 16000 words) • Yahoo! Science (July 1997) – 6294 pages in 41 classes (apx. 44000 words) – Very noisy data 20 Newsgroups Results • Averaged over 5-20 trials • Computational constraints forced Markov blanket to a smaller data set (second graph) • LSI uses only 1/3 training ratio 20 Newsgroups Analysis • Distributional Clustering achieves 82.1% accuracy at 50 features, almost as good as having the full vocabulary. • More accurate then all non-clustering approaches • LSI did not add any improvement to clustering (claim: because it is unsupervised) • On the smaller data set, D.C. achieves 80% accuracy far quicker then the others, in some cases doubling their performance for small numbers of features. • Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it. Speed in 20-Newsgroups Test • • • • Distributional Clustering: 7.5 minutes LSI: 23 minutes Makov Blanket: 10 hours Mutual information feature selection (???): 30 seconds Reuters-21578 Results • D.C. outperforms others for small numbers of features • Information-Gain based feature selection does better for larger feature sets. • In this data set, documents can have multiple labels. Yahoo! Results • Feature selection performs almost as well or better in these cases • Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection. Performance Summary • Only slight loss in accuraccy despite despite the reduction in feature space • Preserves “redundent” information better than feature selection. • The improvement is not as drastic with noisy data. Improvements on Earlier D.C. Work • Does not show much improvement on sparse data because the performance measure is related to the data distribution – D.C. preserves class distributions, even if these are poor estimates to begin with. • Thus this whole method relies on accurate values for P(C | wi) Future Work • Improve D.C.’s handling of sparse data (ensure good estimates of P(C | wi) • Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?) Some Thoughts • Extremely supervised • Needs to be retrained when new documents come in • In a paper with a lot of topics, does Naïve Bayes (word independent of context) make sense? • Didn’t work well in noisy data • How can we ensure proper theta values?