Statistical Methods for Text Mining David Madigan Rutgers University & DIMACS www.stat.rutgers.edu/~madigan David D. Lewis www.daviddlewis.com joint work with Alex Genkin, Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin Statistical Analysis of Text •Statistical text analysis has a long history in literary analysis and in solving disputed authorship problems •First (?) is Thomas C. Mendenhall in 1887 Mendenhall •Mendenhall was Professor of Physics at Ohio State and at University of Tokyo, Superintendent of the USA Coast and Geodetic Survey, and later, President of Worcester Polytechnic Institute Mendenhall Glacier, Juneau, Alaska X2 = 127.2, df=12 •Used Naïve Bayes with Poisson and Negative Binomial model •Out-of-sample predictive performance Today • Statistical methods routinely used for textual analyses of all kinds • Machine translation, part-of-speech tagging, information extraction, question-answering, text categorization, etc. • Not reported in the statistical literature (no statisticians?) Outline • Part-of-Speech Tagging, Entity Recognition • Text categorization • Logistic regression and friends • The richness of Bayesian regularization • Sparseness-inducing priors • Word-specific priors: stop words, IDF, domain knowledge, etc. • Polytomous logistic regression Part-of-Speech Tagging • Assign grammatical tags to words • Basic task in the analysis of natural language data • Phrase identification, entity extraction, etc. • Ambiguity: “tag” could be a noun or a verb • “a tag is a part-of-speech label” – context resolves the ambiguity The Penn Treebank POS Tag Set POS Tagging Process Berlin Chen POS Tagging Algorithms • Rule-based taggers: large numbers of hand-crafted rules • Probabilistic tagger: used a tagged corpus to train some sort of model, e.g. HMM. tag1 tag2 tag3 word1 word2 word3 • clever tricks for reducing the number of parameters (aka priors) some details… Charniak et al., 1993, achieved 95% accuracy on the Brown Corpus with: number of times word j appears with tag i number of times word j appears number of times a word that had never been seen with tag i gets tag i number of such occurrences in total plus a modification that uses word suffixes r1 s1 Recent Developments • Toutanova et al., 2003, use a dependency network and richer feature set • Log-linear model for ti | t-i, w • Model included, for example, a feature for whether the word contains a number, uppercase characters, hyphen, etc. • Regularization of the estimation process critical • 96.6% accuracy on the Penn corpus Named-Entity Classification • • • • “Mrs. Frank” is a person “Steptoe and Johnson” is a company “Honduras” is a location etc. • Bikel et al. (1998) from BBN “Nymble” statistical approach using HMMs nc1 nc2 nc3 word1 word2 word3 [ wi | wi 1 , nci ] if nci nci 1 [ wi | wi 1 , nci , nci 1 ] [ wi | nci , nci 1 ] if nci nci 1 • • • • • “name classes”: Not-A-Name, Person, Location, etc. Smoothing for sparse training data + word features Training = 100,000 words from WSJ Accuracy = 93% 450,000 words same accuracy training-development-test Text Categorization •Automatic assignment of documents with respect to manually defined set of categories •Applications automated indexing, spam filtering, content filters, medical coding, CRM, essay grading •Dominant technology is supervised machine learning: Manually classify some documents, then learn a classification rule from them (possibly with manual intervention) Terminology, etc. •Binary versus Multi-Class •Single-Label versus Multi-Label •Document representation via “bag of words:” 4 5 d (w1 ,, wN ) N 10 10 •wi’s might be 0/1, counts, or weights (e.g tf/idf, LSI) •Phrases, syntactic information, synonyms, NLP, etc. ? •Stopwords, stemming Test Collections •Reuters-21578 •9603 training, 3299 test, 90 categories, ~multi-label •New Reuters – 800,000 documents •Medline – 11,000,000 documents; MeSH headings •TREC conferences and collections •Newsgroups, WebKB Reuters Evaluation •binary classifiers: recall=d/(b+d) “sensitivity” precision=d/(c+d) true 0 true 1 predict 0 a b predict 1 c d “predictive value positive” •multiple binary classifiers: macro-precision = 1.0+0.5 2 micro-averaged precision = 2/3 true predict cat 1 test doc 1 1 1 test doc 2 0 0 cat 2 1 1 0 p=1.0 r =1.0 F1 Measure – harmonic mean of precision and recall 1 p=0.5 r =1.0 Reuters Results Model F1 AdaBoost.MH 0.86 SVM 0.84-0.87 k-NN 0.82-0.86 Neural Net 0.84 “Naïve Bayes” 0.72-0.78 Rocchio 0.62-0.76 X0 Naïve Bayes X1 X2 ... •Naïve Bayes for document classification dates back to the early 1960’s •The NB model assumes features are conditionally independent given class •Estimation is simple; scales well •Empirical performance usually not bad •High bias-low variance (Friedman, 1997; Domingos & Pazzani, 1997) Xp Poisson NB X0 Xc1 Xc2 X0 ... Xcp X X X cj ~ Poisson(Cj ) •Natural extension of binary model to word frequencies •ML-equivalent to the multinomial model with Poissondistributed document length •Bayesian equivalence requires constraints on conjugate priors (Poisson NB has 2p hyper-parameters per class; Multinomial-Poisson has p+2) Poisson NB - Reuters μPrecision μRecall SVM 0.89 0.84 Multinomial 0.78 0.76 Poisson NB 0.67 0.66 Multinomial+ logspline 0.79 0.76 Multinomial+ negative bin. 0.78 0.75 Negative Binomial NB 0.77 0.76 Model Different story for FAA dataset overdispersion AdaBoost.MH •Multiclass-Multilabel •At each iteration learns a simple score-producing classifier on weighed training data and the updates the weights •Final decision averages over the classifiers Class doc 1 A B +1 +1 C D -1 -1 Class A B C D doc 1 0.25 0.25 0.25 0.25 data initial weights Class A B C D doc 1 2 -2 -1 0.1 score from simple classifier Class A B C D doc 1 0.02 0.82 0.04 0.12 revised weights AdaBoost.MH Schapire and Singer, 2000 AdaBoost.MH’s weak learner is a stump two words! AdaBoost.MH Comments •Software implementation: BoosTexter •Some theoretical support in terms of bounds on generalization error •3 days of cpu time for Reuters with 10,000 boosting iterations Document Representation •Documents usually represented as “bag of words:” x i ( xi1 , , xi j ,..., xi d ) •xi’s might be 0/1, counts, or weights (e.g. tf/idf, LSI) •Many text processing choices: stopwords, stemming, phrases, synonyms, NLP, etc. Classifier Representation •For instance, linear classifier: IF x j ij , THEN yi 1 j ELSE yi 1 • xi’s derived from text of document • yi indicates whether to put document in category • βj are parameters chosen to give good classification effectiveness Logistic Regression Model •Linear model for log odds of category membership: P( yi 1| x i ) ln j xi j βxi P( yi 1| x i ) j • Equivalent to eβxi P( yi 1| xi ) βxi 1 e • Conditional probability model Logistic Regression as a Linear Classifier •If estimated probability of category membership is greater than p, assign document to category: p IF j xi j ln , THEN yi 1 1 p j •Choose p to optimize expected value of your effectiveness measure (may need different form of test) •Can change measure w/o changing model Maximum Likelihood Training • Choose parameters (βj's) that maximize probability (likelihood) of class labels (yi's) given documents (xi’s) arg max ( ln(1 exp(β xi yi ))) T β i • Maximizing (log-)likelihood can be viewed as minimizing a loss function Hastie, Friedman & Tibshirani Shrinkage Methods ► Subset selection is a discrete process – individual variables are either in or out. Combinatorial nightmare. ► This method can have high variance – a different dataset from the same source can result in a totally different model ► Shrinkage methods allow a variable to be partly included in the model. That is, the variable is included but with a shrunken co-efficient ► Elegant way to tackle over-fitting Ridge Regression ˆ ridge arg min N p i 1 j 1 2 ( y x ) i 0 ij j p subject to: 2 j s j 1 Equivalently: ˆ ridge p p N 2 2 arg min ( yi 0 xij j ) j j 1 j 1 i 1 This leads to: ˆ ridge ( X T X I ) 1 X T y Choose by cross-validation. works even when XTX is singular 0.10 Posterior Modes with Varying Hyperparameter - Gaussian 0.00 glu bp bmi/100 -0.05 ped npreg skin -0.10 posterior mode 0.05 age/100 intercept 0 0.05 0.1 0.15 tau 0.2 0.25 0.3 Ridge Regression = Bayesian MAP Regression ►Suppose we believe each βj is a small value near 0 ►Encode this belief as separate Gaussian probability distributions over values of βj ►Choosing maximum a posteriori value of the β gives same result as ridge logistic regression yi ~ N ( 0 xiT , 2 ) j ~ N (0, 2 ) same as ridge with 2 2 Least Absolute Shrinkage & Selection Operator (LASSO) ˆ ridge arg min N p i 1 j 1 2 ( y x ) i 0 ij j p subject to: j 1 j s Quadratic programming algorithm needed to solve for the parameter estimates p p N ~ arg min ( yi 0 xij j ) 2 j i 1 j 1 j 1 q q=0: var. sel. q=1: lasso q=2: ridge Learn q? 0.10 Posterior Modes with Varying Hyperparameter - Laplace 0.00 glu bp bmi/100 -0.05 ped npreg skin -0.10 posterior mode 0.05 age/100 intercept 120 100 80 60 lambda 40 20 0 Ridge & LASSO - Theory ► Lasso estimates are consistent ► But, Lasso does not have the “oracle property.” That is, it does not deliver the correct model with probability 1 ► Fan & Li’s SCAD penalty function has the Oracle property LARS ► New geometrical insights into Lasso and “Stagewise” ► Leads to a highly efficient Lasso algorithm for linear regression LARS ► Start with all coefficients bj = 0 ► Find the predictor xj most correlated with y ► Increase bj in the direction of the sign of its correlation with y. Take residuals r=y-yhat along the way. Stop when some other predictor xk has as much correlation with r as xj has ► Increase (bj,bk) in their joint least squares direction until some other predictor xm has as much correlation with the residual r. ► Continue until all predictors are in the model Zhang & Oles Results Model F1 Naïve Bayes 0.852 Ridge Logistic Regression+FS 0.914 SVM 0.911 •Reuters-21578 collection •Ridge logistic regression plus feature selection Bayes! • MAP logistic regression with Gaussian prior gives state of the art text classification effectiveness • But Bayesian framework more flexible than SVM for combining knowledge with data : – – – – Feature selection Stopwords, IDF Domain knowledge Number of classes • (and kernels.) Data Sets • ModApte subset of Reuters-21578 – 90 categories; 9603 training docs; 18978 features • Reuters RCV1-v2 – 103 cats; 23149 training docs; 47152 features • OHSUMED heart disease categories – 77 cats; 83944 training docs; 122076 features • Cosine normalized TFxIDF weights Dense vs. Sparse Models (Macroaveraged F1, Preliminary) ModApte RCV1-v2 OHSUMED Lasso 52.03 56.54 51.30 Ridge 39.71 51.40 42.99 Ridge/500 38.82 46.27 36.93 Ridge/50 45.80 41.61 42.59 Ridge/5 46.20 28.54 41.33 SVM 53.75 57.23 50.58 0 0 2 1 2 3 3 ridge lasso SVM ridge lasso SVM 4 5 log(Number of Errors + 1) plus jitter 4 log(Number of Errors + 1) plus jitter 2 log(Number of Errors + 1) plus jitter 6 6 4 7 8 ModApte (90 categories) RCV1-v2 (103 categories) OHSUMED (77 categories) ridge lasso SVM 20 15 10 5 0 Number of Categories ModApte - 21,989 features 0 100 200 300 400 500 Number of Features w ith non-zero posterior mode 10 8 6 4 2 0 Number of Categories RCV1 - 47,152 features 0 500 1000 1500 Number of Features w ith non-zero posterior mode 12 0 2 4 6 8 Number of Categories OHSUMED - 122,076 features 0 200 400 600 Number of features w ith non-zero posterior mode 800 1000 Bayesian Unsupervised Feature Selection and Weighting • Stopwords : low content words that typically are discarded – Give them a prior with mean 0 and low variance • Inverse document frequency (IDF) weighting – Rare words more likely to be content indicators – Make variance of prior inversely proportional to frequency in collection • Experiments in progress Bayesian Use of Domain Knowledge • Often believe that certain words are positively or negatively associated with category • Prior mean can encode strength of positive or negative association • Prior variance encodes confidence First Experiments • 27 RCV1-v2 Region categories • CIA World Factbook entry for country – Give content words higher mean and/or variance • Only 10 training examples per category – Shows off prior knowledge – Limited data often the case in applications Results (Preliminary) Macro F1 ROC Gaussian w/ standard prior 0.242 87.2 Gaussian w/ DK prior #1 0.608 91.2 Gaussian w/ DK prior #2 0.542 90.0 Polytomous Logistic Regression • Logistic regression trivially generalizes to 1-of-k problems – Cleaner than SVMs, error correcting codes, etc. • Laplace prior particularly cool here: – Suppose 99 classes and a word that predicts class 17 – Word gets used 100 times if build 100 models, or if use polytomous with Gaussian prior – With Laplace prior and polytomous it's used only once • Experiments in progress, particularly on author id 1-of-K Sample Results: brittany-l Feature Set % errors Number of Features “Argamon” function words, raw tf 74.8 380 POS 75.1 44 1suff 64.2 121 1suff*POS 50.9 554 2suff 40.6 1849 2suff*POS 34.9 3655 3suff 28.7 8676 3suff*POS 27.9 12976 3suff+POS+3suff*POS+Arga mon 27.6 22057 All words 23.9 52492 89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents. BMR-Laplace classification, default hyperparameter 1-of-K Sample Results: brittany-l Feature Set % errors Number of Features “Argamon” function words, raw tf 74.8 380 POS 75.1 44 1suff 64.2 121 1suff*POS 50.9 554 2suff 40.6 1849 2suff*POS 34.9 3655 3suff 28.7 8676 3suff*POS 27.9 12976 3suff+POS+3suff*POS+Arga mon 27.6 22057 All words 23.9 52492 4.6 million parameters 89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents. BMR-Laplace classification, default hyperparameter Future • Choose exact number of features desired • Faster training algorithm for polytomous – Currently using cyclic coordinate descent • Hierarchical models – Sharing strength among categories – Hierarchical relationships among features • Stemming, thesaurus classes, phrases, etc. Text Categorization Summary • Conditional probability models (logistic, probit, etc.) • As powerful as other discriminative models (SVM, boosting, etc.) • Bayesian framework provides much richer ability to insert task knowledge • Code: http://stat.rutgers.edu/~madigan/BBR • Polytomous, domain-specific priors soon The Last Slide • Statistical methods for text mining work well on certain types of problems • Many problems remain unsolved •Which financial news stories are likely to impact the market? •Where did soccer originate? •Attribution Approximate Online Sparse Bayes Shooting algorithm (Fu, 1988)