Text Classification

advertisement
101035 中文信息处理
Chinese NLP
Lecture 14
应用——文本分类(2)
Text Classification (2)
• 常用文本分类器(Popular classifiers)
• 分类模型训练(Model training)
• 分类结果评测(Evaluation)
• 常用分类工具(Tools)
2
常用文本分类器
Popular Classifiers
•
Types of Classifier
•
•
•
Statistical-based
•
•
•
•
Naïve Bayes
kNN
SVM
…
Connection-based
•
ANN
Rule-based
•
•
•
Decision Tree
Association Rule
…
3
• Naïve Bayes(朴素/天真贝叶斯)
•
Bayes theorem
P( B | A) P( A)
P( A | B) 
P( B)
•
For a document d and a class cj
P(C j | d) 
P(d | C j ) P(C j )
Naively
assuming terms
are independent
of each other

P(d)

P(t1 ,..., t|V | | C j ) P(C j )
|iV|1 P(ti | C j ) P(C j )
P(d)
P(t1 ,..., t|V | )
V is the
vocabulary
(in case of
word
features)
4
• Naïve Bayes
•
For a document d and a class cj

|V |
P  C j d   P  C j   P ti C j
i 1
•
Probability estimation
P C j  
N
c
j


P ti c j 
|D|

1  Nij
|V |
| V |   N kj
k 1
Smoothing is used to avoid
zeros and overfitting, Nij is
the number of documents
with ti that belong to class j
Prior probability of
class j, Njc is the
number of documents
with class j
•
Assignment of the class
|V |

class  arg max P  C j d   arg max P  C j   P ti C j
C j C
C j C
i 1

5
• Naïve Bayes
•
Multiplying lots of probabilities can result in floating-point
underflow. A common tip is to use log.
|V |


class  argmax log P  C j    log P  ti | C j 
C j C
i 1


•
•
NB is a very simple classifier that is easy to implement.
Although the independence assumption does not hold for text,
NB works fairly well for text data.
6
• kNN(k Nearest Neighbors, k近邻)
•
•
•
kNN belongs to a kind of classifiers called lazy learners because
they do not build explicit declarative representations of
categories.
“Training” for kNN consists of simply storing the training
documents together with their class labels.
To decide whether a document d belongs to class c, kNN checks
whether the k training documents most similar to d belong to c.
7
• kNN
•
Illustration
k=1
k=3
k=9
8
• kNN
•
Algorithm
•
For a new document, find the k most similar documents from the
training set. (A popular text similarity measure is Cosine Similarity)
s d i ,d j  
k w ki  w kj 
m
m
k w ki  k w kj
m
1
2
1
2
1
wki indicates the weight
of word k in document i
•
1
2
Assign the class to d by considering the classes of its k nearest
neighbors
9
• kNN
•
Class assignment
For binary classification using a majority voting
scheme, k is usually an odd number to break ties
10
• kNN
•
•
•
•
It is one of the best-performing text classifiers.
It is robust in the sense of not requiring the classes to be linearly
separated.
The major drawback is the high computational cost during
classification (lazy learning).
Another limitation is the choice of k, which is hard to decide.
11
• Decision Tree(DT, 决策树)
•
•
DT is a decision support tool that uses a graph model of
decisions and their possible consequences.
Features are chosen one by one at each stage.
f2
The tree is
grown upside
down.
f1
12
• Decision Tree
•
•
•
•
At each stage, a feature is chosen that reduces the entropy.
For two classes P and N, if there are p instances of P and n
instances of N, define
Using feature A to partition the total set S into sets {S1, S2 , …, Sv},
if Si contains pi instances of P and ni instances of N, entropy is
Information gain at feature A is defined as
13
• Decision Tree
•
•
At each stage of tree growth, we choose the most discriminative
features that maximizes the information gain (ID3 and C4.5).
Example
E(age) =
I(9, 5) = 0.940
class
5/14 * I(2, 3) +
4/14 * I(4, 0) +
5/14 * I(3, 2)
= 0.694
Gain(age) = I(9, 5) – E(age)
= 0.246
14
In-Class Exercise
•
Please calculate the information gain on “student”: Gain(Student)
on the previous page.
15
• Decision Tree
•
•
•
It is one of the most known classifiers that have wide
applications.
An advantage of DT is that the trained model can be expressed
by IF-THEN rules.
To avoid overfitting, a tree should stop growing at a certain point,
or pruning must be used after training stops.
16
• Support Vector Machine(SVM, 支持向量机)
•
•
A binary SVM classifier can be seen as a hyperplane in the
feature space separating the points that represent the positive
from negative instances.
SVMs selects the hyperplane that
Support vectors
maximizes the margin around it.
•
Hyperplanes are fully determined
by a small subset of the training
instances, called the support vectors.
Maximize
margin
17
• Support Vector Machine(SVM, 支持向量机)
•
Minimize the empirical risk (minimize err0r)
18
• Support Vector Machine(SVM, 支持向量机)
•
Minimize the structural risk (maximize margin)
19
• Support Vector Machine(SVM, 支持向量机)
•
Margin and support vectors
Margin
SV
SV
SV
SV
1
min w T w
2
yi (wT  (xi )  b)  1
20
• Support Vector Machine
•
•
•
It is very effective for text classification and is still considered a
state-of-the-art classifier.
It is able to handle a large feature space.
In implementation, make sure that the feature values are about
the same scale (e.g. -1 .. 1) because SVM is sensitive to scale.
21
分类模型训练
Model Training
• Training vs Validation vs Test
Features (N = |V|)
Training (m1)
Used for training (learning)
the classifier
•
Validation (m2)
Used for tuning the
parameters of the model
•
m1
m2
Test (m3)
Used for evaluating the
classifier
Documents
(M)
•
m3
22
• Parameter Optimization(参数优化)
•
Many classifiers (DT, SVM, etc.) have some critical parameters
that need to be tuned for best performance.
P1
P2
…
Pk
p11
p21
…
pk1
…
…
…
…
p1j
pkn
…
p2m
Grid search or initialize from experience
and then make adjustments
23
分类结果评测
Evaluation
• Cross Validation
Train
Test
Mean
Training data
5-fold CV
24
• Evaluation Metrics
•
For 2 classes
ad
accuracy 
abcd
recall (R) 
•
•
•
a
ac
ClassifierYES
Classifier NO
precision (P) 
a
ab
LabelYES
Label NO
a
c
b
d
F
2 PR
PR
Recall(查全率) is defined as the percentage of correctly
classified documents among all documents with that label
Precision(查准率)is the percentage of correctly classified
documents among all documents assigned to the class by the
classifier.
F-measure (F1) is the harmonic mean of recall and precision.
25
In-Class Exercise
•
It is important to report both recall and precision when
evaluating a classifier. Explain why (what happens if only a high
recall is reported, or only a high precision is reported?).
26
• Evaluation Metrics
•
For more than 2 classes
•
Macroaveraging(宏平均)computes recall, precision, F1 for each
class and then take the average. It gives equal weights to all classes.

r
•
r
C c

p
( C Pc )
C
C
Microaveraging(微平均)computes totals of a, b, c and d for all
classes, and then computes recall, precision, F1. It gives equal
weights to all documents.
a
r
a  c
C
C
C
a
p
 a  b
C
C
C
27
常用分类工具
Classification Tools
Free
• Weka (Java)
http://www.cs.waikato.ac.nz/~ml/weka/
• Orange (Python)
http://orange.biolab.si/
•R
http://www.r-project.org/
28
Wrap-Up
• 常用文本分类器
•
•
•
•
Naïve Bayes
kNN
Decision Tree
• 分类结果评测
•
•
Cross Validation
Evaluation Metrics
• 常用分类工具
SVM
• 分类模型训练
29
Download