101035 中文信息处理 Chinese NLP Lecture 14 应用——文本分类(2) Text Classification (2) • 常用文本分类器(Popular classifiers) • 分类模型训练(Model training) • 分类结果评测(Evaluation) • 常用分类工具(Tools) 2 常用文本分类器 Popular Classifiers • Types of Classifier • • • Statistical-based • • • • Naïve Bayes kNN SVM … Connection-based • ANN Rule-based • • • Decision Tree Association Rule … 3 • Naïve Bayes(朴素/天真贝叶斯) • Bayes theorem P( B | A) P( A) P( A | B) P( B) • For a document d and a class cj P(C j | d) P(d | C j ) P(C j ) Naively assuming terms are independent of each other P(d) P(t1 ,..., t|V | | C j ) P(C j ) |iV|1 P(ti | C j ) P(C j ) P(d) P(t1 ,..., t|V | ) V is the vocabulary (in case of word features) 4 • Naïve Bayes • For a document d and a class cj |V | P C j d P C j P ti C j i 1 • Probability estimation P C j N c j P ti c j |D| 1 Nij |V | | V | N kj k 1 Smoothing is used to avoid zeros and overfitting, Nij is the number of documents with ti that belong to class j Prior probability of class j, Njc is the number of documents with class j • Assignment of the class |V | class arg max P C j d arg max P C j P ti C j C j C C j C i 1 5 • Naïve Bayes • Multiplying lots of probabilities can result in floating-point underflow. A common tip is to use log. |V | class argmax log P C j log P ti | C j C j C i 1 • • NB is a very simple classifier that is easy to implement. Although the independence assumption does not hold for text, NB works fairly well for text data. 6 • kNN(k Nearest Neighbors, k近邻) • • • kNN belongs to a kind of classifiers called lazy learners because they do not build explicit declarative representations of categories. “Training” for kNN consists of simply storing the training documents together with their class labels. To decide whether a document d belongs to class c, kNN checks whether the k training documents most similar to d belong to c. 7 • kNN • Illustration k=1 k=3 k=9 8 • kNN • Algorithm • For a new document, find the k most similar documents from the training set. (A popular text similarity measure is Cosine Similarity) s d i ,d j k w ki w kj m m k w ki k w kj m 1 2 1 2 1 wki indicates the weight of word k in document i • 1 2 Assign the class to d by considering the classes of its k nearest neighbors 9 • kNN • Class assignment For binary classification using a majority voting scheme, k is usually an odd number to break ties 10 • kNN • • • • It is one of the best-performing text classifiers. It is robust in the sense of not requiring the classes to be linearly separated. The major drawback is the high computational cost during classification (lazy learning). Another limitation is the choice of k, which is hard to decide. 11 • Decision Tree(DT, 决策树) • • DT is a decision support tool that uses a graph model of decisions and their possible consequences. Features are chosen one by one at each stage. f2 The tree is grown upside down. f1 12 • Decision Tree • • • • At each stage, a feature is chosen that reduces the entropy. For two classes P and N, if there are p instances of P and n instances of N, define Using feature A to partition the total set S into sets {S1, S2 , …, Sv}, if Si contains pi instances of P and ni instances of N, entropy is Information gain at feature A is defined as 13 • Decision Tree • • At each stage of tree growth, we choose the most discriminative features that maximizes the information gain (ID3 and C4.5). Example E(age) = I(9, 5) = 0.940 class 5/14 * I(2, 3) + 4/14 * I(4, 0) + 5/14 * I(3, 2) = 0.694 Gain(age) = I(9, 5) – E(age) = 0.246 14 In-Class Exercise • Please calculate the information gain on “student”: Gain(Student) on the previous page. 15 • Decision Tree • • • It is one of the most known classifiers that have wide applications. An advantage of DT is that the trained model can be expressed by IF-THEN rules. To avoid overfitting, a tree should stop growing at a certain point, or pruning must be used after training stops. 16 • Support Vector Machine(SVM, 支持向量机) • • A binary SVM classifier can be seen as a hyperplane in the feature space separating the points that represent the positive from negative instances. SVMs selects the hyperplane that Support vectors maximizes the margin around it. • Hyperplanes are fully determined by a small subset of the training instances, called the support vectors. Maximize margin 17 • Support Vector Machine(SVM, 支持向量机) • Minimize the empirical risk (minimize err0r) 18 • Support Vector Machine(SVM, 支持向量机) • Minimize the structural risk (maximize margin) 19 • Support Vector Machine(SVM, 支持向量机) • Margin and support vectors Margin SV SV SV SV 1 min w T w 2 yi (wT (xi ) b) 1 20 • Support Vector Machine • • • It is very effective for text classification and is still considered a state-of-the-art classifier. It is able to handle a large feature space. In implementation, make sure that the feature values are about the same scale (e.g. -1 .. 1) because SVM is sensitive to scale. 21 分类模型训练 Model Training • Training vs Validation vs Test Features (N = |V|) Training (m1) Used for training (learning) the classifier • Validation (m2) Used for tuning the parameters of the model • m1 m2 Test (m3) Used for evaluating the classifier Documents (M) • m3 22 • Parameter Optimization(参数优化) • Many classifiers (DT, SVM, etc.) have some critical parameters that need to be tuned for best performance. P1 P2 … Pk p11 p21 … pk1 … … … … p1j pkn … p2m Grid search or initialize from experience and then make adjustments 23 分类结果评测 Evaluation • Cross Validation Train Test Mean Training data 5-fold CV 24 • Evaluation Metrics • For 2 classes ad accuracy abcd recall (R) • • • a ac ClassifierYES Classifier NO precision (P) a ab LabelYES Label NO a c b d F 2 PR PR Recall(查全率) is defined as the percentage of correctly classified documents among all documents with that label Precision(查准率)is the percentage of correctly classified documents among all documents assigned to the class by the classifier. F-measure (F1) is the harmonic mean of recall and precision. 25 In-Class Exercise • It is important to report both recall and precision when evaluating a classifier. Explain why (what happens if only a high recall is reported, or only a high precision is reported?). 26 • Evaluation Metrics • For more than 2 classes • Macroaveraging(宏平均)computes recall, precision, F1 for each class and then take the average. It gives equal weights to all classes. r • r C c p ( C Pc ) C C Microaveraging(微平均)computes totals of a, b, c and d for all classes, and then computes recall, precision, F1. It gives equal weights to all documents. a r a c C C C a p a b C C C 27 常用分类工具 Classification Tools Free • Weka (Java) http://www.cs.waikato.ac.nz/~ml/weka/ • Orange (Python) http://orange.biolab.si/ •R http://www.r-project.org/ 28 Wrap-Up • 常用文本分类器 • • • • Naïve Bayes kNN Decision Tree • 分类结果评测 • • Cross Validation Evaluation Metrics • 常用分类工具 SVM • 分类模型训练 29