T EXT M INING S TATISTICAL M ODELING OF T EXTUAL D ATA L ECTURE 2 Mattias Villani Division of Statistics Dept. of Computer and Information Science Linköping University M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 1 / 19 O VERVIEW I I I Text classication Regularization R's tm package (demo) [TMPackageDemo.R] M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 2 / 19 S UPERVISED CLASSIFICATION I I I I Predict the class label s ∈ S using a set of features. Feature = Explanatory variable = Predictor = Covariate Binary classication: s ∈ {0, 1} S = {pos,neg} S = {Spam,Ham} S = {Not bankrupt, Bankrupt} I Movie reviews: I E-mail spam: I Bankruptcy: Multi-class classication: s ∈ {1, 2, ..., K } I Topic categorization of web pages: S = {0 News0 ,0 Sports0 ,0 Entertainment0 } I POS-tagging: M ATTIAS V ILLANI (S TATISTICS , L I U) S = {VB,JJ,NN,...,DT} T EXT M INING 3 / 19 S UPERVISED CLASSIFICATION , CONT. I Example data: I Larry Wall, born in British Columbia, Canada, is the original creator of the programming language Perl. Born in 1956, Larry went to ... I I Bjarne Stroustrup is a 62-years old computer scientist ... Person Income Age Single Payment remarks Bankrupt Larry 10 58 Yes Yes Yes Bjarne . . . 15 . . . 62 . . . No . . . Yes . . . No . . . Guido 27 56 No No No Classication: construct prediction machine Features → Class label I More generally: Features → Pr(Class label|Features) M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 4 / 19 F EATURES FROM TEXT DOCUMENTS I I Any quantity computed from a document can used as a feature: I Presence/absence of individual words I Number of times an individual word is used I Presence/absence of pairs of words I Presence/absence of individual bigrams I Lexical diversity I Word counts I Number of web links from document, possibly weighted by Page Rank. I etc etc Document has('ball') has('EU') has('political_arena') wordlen Lex. Div. Topic Article1 Yes No No 4.1 5.4 Sports Article2 No No No 6.5 13.4 Sports . . . . . . . . . . . . . . . . . . ArticleN 2No No 7.4 11.1 News Constructing clever M ATTIAS V ILLANI (S TATISTICS , L I U) Yes discriminating features T EXT M INING is the name of the game! 5 / 19 S UPERVISED LEARNING FOR CLASSIFICATION M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 6 / 19 T HE B AYESIAN CLASSIFIER I Bayesian classication argmaxp (s |x) s ∈S where x = (x , ..., xn ) is a feature vector. By Bayes' theorem 1 I p (s |x) I = p (x|s )p (s ) p (x) ∝ p (x|s )p (s ) Bayesian classication argmaxp (x|s )p (s ) s ∈S can be easily estimated from training data by relative frequencies. Even with binary features [has(word)] the outcome space of p (x|s ) is I p (s ) I huge (=data are sparse). M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 7 / 19 N AIVE B AYES I Naive Bayes (NB): features are assumed independent p (x|s ) I Naive Bayes solution " argmax s ∈S I n = ∏ p (xj |s ) j =1 n ∏ p (xj |s ) j= p (s ) 1 With binary features, p (xj |s ) can be easily estimated by p̂ (xj |s ) I # Example: s = C (xj , s ) C (s ) =news, xj = has('ball') p̂ (has(ball)|news) = M ATTIAS V ILLANI (S TATISTICS , L I U) Number of news articles containing the word 'ball' Number of news articles T EXT M INING 8 / 19 N AIVE B AYES I Continuous features I Replacing continous feature with several binary features I Estimating (1 I (e.g. lexical diversity) can be handled by: ≤lexDiv< 2, 2 ≤lexDiv≤ 10 and lexDiv> 10) p (xj |s ) by a density estimator (e.g. kernel Finding the most smallest discriminatory features. estimator) Sort from largest to p (xj |s = pos ) for j = 1, ..., n. p (xj |s = neg ) I Problem with NB: I I features are seldom independent ⇒ double-counting the evidence of individual features. Extreme example: what happens with naive Bayes if you duplicate a feature? Advantages of NB: simple and fast, yet often surprising accurate classications. M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 9 / 19 M ULTINOMIAL REGRESSION I I I I Logistic regression (Maximum Entropy/MaxEnt): exp(x0 β) p (s = 1|x) = 1 + exp(x0 β) Classication rule: Choose s = 0 if p (s |x) < 0.5 otherwise choose s = 1. ... at least when consequences of dierent choices of s are the same. Loss/Utility function. Multinomial regression for multi-class data with K classes exp(x0 βj ) p (s = sj |x) = 0 ∑K k = exp(x β k ) Classication 1 I argmax p (s |x) s ∈{s 1 I ,...sK } Classication with text data is like any multi-class regression problem ... but with hundred or thousand of covariates! Wide data. M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 10 / 19 R EGULARIZATION - VARIABLE SELECTION I I I I Select a subset of the covariates. Old school: Forward and backward selection. New school: Bayesian variable selection. For each βi introduce binary indicator Ii such that Ii Ii I I = 1 if covariate is in the model, that is β i 6= 0 = 0 if covariate is in the model, that is β i = 0 Use Markov Chain Monte Carlo (MCMC) simulation to approximate Pr(Ii |Data) for each i . Example S = {News, Sports}. Pr(News|x). has('ball') has('EU') has('political_arena') wordlen Lex. Div. 0.2 0.90 0.99 0.01 0.85 Pr(Ii |Data) M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 11 / 19 R EGULARIZATION - S HRINKAGE I Keep all covariates, but shrink their β-coecient to zero. I Penalized likelihood LRidge ( β ) I I I I = LogLik ( β) − λβ0 β where λ is the penalty parameter. Maximize LRidge ( β) with respect to β. Trade-o of t (LogLik ( β)) against complexity penalty β0 β. Ridge regression if regression is linear. iid The penalty can be motivated as a Bayesian prior βi ∼ N (0, λ− ). λ can be estimated by cross-validation or Bayesian methods. M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 1 12 / 19 L ASSO - S HRINKAGE AND VARIABLE SELECTION I Replace Ridge penalty LRidge ( β ) by LLasso ( β ) I I I n = LogLik ( β) − λ ∑ β2j j =1 n = LogLik ( β) − λ ∑ | β j | j =1 The β that maximizes LLasso ( β) is called the Lasso estimator. Some parameters are shrunked exactly to zero ⇒ Lasso does both shrinkage AND variable selection. Lasso penalty is equivalent to a double exponential prior p ( βi ) M ATTIAS V ILLANI (S TATISTICS , L I U) = λ 2 exp (λ | βi − 0|) T EXT M INING 13 / 19 S UPPORT VECTOR MACHINES I I I I One of the best classiers around. Finds the line in covariate space that maximally separates the two classes. When the points are not linearly separable: add a slack-variable ξ i > 0 for each observation. Allow misclassication, but make it costly. Non-linear separing curves can be obtained by basis expansion (think about adding x , x and so on) The kernel trick makes it possible to handle many covariates. Drawback: not so easily extended to multi-class. svm function in R-package e1071 [or nltk.classify.svm] 2 I I I M ATTIAS V ILLANI (S TATISTICS , L I U) 3 T EXT M INING 14 / 19 L INEAR SVM S X2 H1 H2 H3 X1 M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 15 / 19 R EGRESSION TREES AND RANDOM FOREST I I Binary partioning tree. At each internal node decide: I I I I I I I I Which covariate to split on Where to split the covariate (Xj < c. Trivial for binary covariates) The optimal splitting variables and split-points are chosen to minimize the mis-classication rate (or other similar measures). Random forest (RF) predicts using an average of many small trees. Each tree in RF is grown on a random subset of variables. Makes it possible to handle many covariates. Parallel. Advantage of RF: better predictions than trees. RF harder to interpret, but provide variable importance scores. R packages: tree and rpart (trees), randomForest (RF). M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 16 / 19 R EGRESSION TREES M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 17 / 19 E VALUATING A CLASSIFIER : A CCURACY AND E RROR I Confusion matrix: Truth Decision I I I Spam Not Spam Spam Not Spam tp fp fn tn tp = true positive, fp = false positive fn = false negative, tn = true negative Accuracy is the proportion of correctly classied items Accuracy = tp + tn tp + tn + fn + fp is the proportion of wrongly classied items Error = 1-Accuracy Accuracy is problematic when tn is large. High accuracy can then be obtained by not acting at all! I Error I M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING Truth 18 / 19 E VALUATING A CLASSIFIER : THE F- MEASURE I Confusion matrix: Truth Choice Spam Good Spam Good tp fp fn tn = proportion of selected items that the system got right tp Precision = tp+fp Recall = proportion of spam that the system classied as spam tp Recall = tp+fn F-measure is a trade-o between Precision and Recall (harmonic mean): 1 F = α Precision + (1 − α) Recall I Precision I I 1 M ATTIAS V ILLANI (S TATISTICS , L I U) T EXT M INING 1 19 / 19