02_NLP_Classificatio..

advertisement
Classification (分類)
Naoaki Okazaki
okazaki at ecei.tohoku.ac.jp
http://www.chokkan.org/
http://twitter.com/#!/chokkanorg
#nlptohoku
http://www.cl.ecei.tohoku.ac.jp/index.php?InformationCommunicationT
heory
2011-10-11
Information Communication Theory (情報伝達学)
1
Spam mail
2011-10-11
Information Communication Theory (情報伝達学)
2
How many spams do you get (per day)?
• My office address: okazaki at ecei.tohoku.ac.jp
• 10 spams out of 80 emails
• 5 filtered automatically out of 10 spams
• My private address (leaked by a company)
• 40 spams out of 40 emails
• 10 filtered out of 40 spams
Spam (Monty Python)
2011-10-11
Information Communication Theory (情報伝達学)
3
Symantec’s survey
• May 2011 MessageLabs Intelligence Report
• 75.8% of email in the world was spam (72.3% in Japan)
• 1 in 1.32 emails was spam
• http://www.symanteccloud.com/mlireport/MLI_2011_05_May_FINAL-en.pdf
2011-10-11
Information Communication Theory (情報伝達学)
4
Rule-based spam filtering
• Design heuristic rules to detect spams
def is_spam(text):
if text.find('my photo attached') != -1:
return True
if text.find('very cheap drugs') != -1:
return True
...
Difficult to scale to a wide-variety of spams
return False
• Pros
• Good initial cost-performance
• Understandable internals
• Configurable internals
2011-10-11
• Cons
• High maintenance cost
• Dependence to the domain
• Artisanal skill
Information Communication Theory (情報伝達学)
5
Supervised spam filtering (with retraining)
Discard spam emails
Classification
User
Model
Incoming emails
Training
Spam folder
Spam
Append new instances
Non-spam
Training data
2011-10-11
Information Communication Theory (情報伝達学)
6
Supervised approach
• Learn from examples (supervision data)
• Spam (positive) and non-spam (negative) emails
• Acquire rules from the supervision data
• Rules (features) are usually induced from the supervision data
• e.g., n-grams of mail contents, mail headers
• (Deliberately) over-generate features, regardless of effectiveness
• Weight rules in terms of their contribution to the task
• Hand-crafted rules for the domain are unnecessary
(except for feature engineering)
• This approach work surprisingly well if we have a large
supervision data
2011-10-11
Information Communication Theory (情報伝達学)
7
Classification and NLP
• Many NLP problems can be formalized as classification!
In March 2005, the New York Times acquired About, Inc.
IN
NNP
CD
DT NNP
NP
NNP
NNP
VBD
NP
TEMP
VP
when
NNP
NP
ORG
ORG
who
2011-10-11
NNP
whom
acquire
Information Communication Theory (情報伝達学)
8
Today’s topic
• Linear binary classifier
• Feature extraction
• Tokenization, stop words, stemming, feature extraction
• Training – perceptron
• Training – logistic regression (using SGD)
• Other formalizations
• Evaluation
• Implementations and experiments
2011-10-11
Information Communication Theory (情報伝達学)
9
Take home messages
• A text is represented by features (as a vector)
• Linear classifier simply computes the linear combination
of feature weights as a score
• Training a linear classifier is very straightforward
• In principle, if the classifier fails with an instance, we update feature
weights such that the classifier can classify the instance next time!
• Two kinds of classification failures that have trade-offs:
• False positives decrease precision
• False negatives decrease recall
2011-10-11
Information Communication Theory (情報伝達学)
10
Brief Introduction of
Linear (Binary) Classifier
2011-10-11
Information Communication Theory (情報伝達学)
11
Linear (binary) classifier (線形二値分類器)
• Input: feature vector 𝒙 ∈ ℛ 𝑚
• Output: prediction 𝑦 ∈ {0,1}
• Using: weight vector 𝒘 ∈ ℛ 𝑚
1 (if 𝑎 𝒙 = 𝒘 ⋅ 𝒙 > 0)
𝑦=
0
(otherwise)
(𝑚: number of dimension)
Feature extraction
• More concrete example:
• Feature space (𝑚 = 6): (darling, honey, my, love, photo, attached)
• Input: “Hi darling, my photo in attached file” ⟶ 𝒙 = (1 0 1 0 1 1)
𝑚
𝑎 𝒙 =𝒘⋅𝒙=
𝑤𝑖 𝑥𝑖 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
𝑖=1
• Output: the sign of 𝑎(𝒙)
Contribution of each feature
(positive high: likely to yield 𝑦 = 1)
• Training: to find 𝒘 such that it fits well to the training data
2011-10-11
Information Communication Theory (情報伝達学)
12
Discrimination plane (分離平面)
• Draw the region of 𝒙 such that 𝑦 = 1
𝜋
𝒘 ⋅ 𝒙 > 0 ⟺ (|angle between 𝒘 and 𝒙| < )
2
𝒘
𝑦=1
𝒙
𝒙
𝒙
𝑦=0
2011-10-11
Discrimination plane
Information Communication Theory (情報伝達学)
13
Two important questions
• Feature extraction (feature engineering)
• Represent an input text with a feature vector
• Using a number of NLP techniques: tokenization, stop-word,
stemming, part-of-speech tagging, parsing, dictionary lookup, etc.
• Important: a linear model can see a text only through features!
• Training
• Find a weight vector such that it fits well to the training data
• The classifier is expected to re-predict all training instances correctly
• Fitness (formalizations; loss functions)
• Logistic Regression, Support Vector Machine (SVM), …
• Finding (training algorithms)
• Perceptron, Gradient Descent, Stochastic Gradient Descent, …
2011-10-11
Information Communication Theory (情報伝達学)
14
Feature Extraction
Representing a text with an input vector 𝒙
Preparing a space of feature vectors
2011-10-11
Information Communication Theory (情報伝達学)
15
Tokenization
• Split a sentence into a sequence of tokens (words)
• In English, tokens are separated by whitespaces (e.g.,
space, punctuation characters)
• Sophisticated methods (e.g., word segmentation, morphological
analysis) are necessary for Japanese and Chinese
• Token boundaries are not obvious in these languages
• Penn Treebank tokenization
• http://www.cis.upenn.edu/~treebank/tokenization.html
Hi darling, my photo in attached file
Tokenize and lowercase
[“hi”, “darling”, “,”, “my”, “photo”, “in”, “attached”, “file”]
2011-10-11
Information Communication Theory (情報伝達学)
16
Stop words
• Remove words that are irrelevant to the processing
• E.g., http://www.textfixer.com/resources/common-english-words.txt
• a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,beca
use,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from
,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least
,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,o
r,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,t
hem,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,
where,which,while,who,whom,why,will,with,would,yet,you,your
• Highly domain dependent
• “my” in the phrase “my photo” may be effective to spam filtering
• This lecture employs punctuations and prepositions as stop words
[“hi”, “darling”, “,”, “my”, “photo”, “in”, “attached”, “file”]
Remove punctuations and prepositions
[“hi”, “darling”, “my”, “photo”, “attached”, “file”]
2011-10-11
Information Communication Theory (情報伝達学)
17
Stemming
• Reducing inflected and derived words to their stems
• Stems are not identical to (morphological) base forms
• It is sufficient for an algorithm to map related words into the same
string, regardless of its morphological validity
• Porter Stemming Algorithm (Porter, 1980)
• http://tartarus.org/~martin/PorterStemmer/index.html
• http://pypi.python.org/pypi/stemming/1.0
• Not perfect: “viruses” – “virus”, “virus” – “viru”
[“hi”, “darling”, “my”, “photo”, “attached”, “file”]
Porter Stemming Algorithm
[“hi”, “darl”, “my”, “photo”, “attach”, “file”]
2011-10-11
Information Communication Theory (情報伝達学)
18
Feature extraction
• Various characteristics, depending on the target task
• Word n-grams (unigram, bigram, tri-gram, …), prefixes/postfixes
• Part-of-speech tags, predicate arguments, dependency edges
• Dictionary matching (e.g., predefined ‘black’ words in spams)
• Conditions (e.g., whether the email has a link to a black URL)
• Others (e.g., the sender of the mail, the IP address of the sender)
• We may use (combinations of) multiple characteristics
• E.g., unigram and part-of-speech tag: “photo/NN”
[“hi”, “darl”, “my”, “photo”, “attach”, “file”]
Word bigrams as features
[“hi_darl”, “darl_my”, “my_photo”, “photo_attach”, “attach_file”]
2011-10-11
Information Communication Theory (情報伝達学)
19
Practical considerations
• Bias term
• Include a feature that is always 1 (without any condition)
• The corresponding feature weight presents a threshold
𝑚
𝑎′ 𝒙 =
𝑤𝑖 𝑥𝑖 + 𝑤0 𝑥0 = 𝑠 𝒙 + 𝑤0 > 0 ⇔ 𝑠 𝒙 > −𝑤0
𝑖=1
Always 1
Bias term
Threshold
• Feature space and number of dimensions (𝑚)
• We seldom determine the number of dimensions in advance
• Instead, we do:
• define an algorithm for extracting features from an instance
• extract and enumerate features from all instances in the training data
• 𝑚 = (the number of distinct features extracted from the training data)
• A feature vector extracted from an instance is sparse
• A few non-zero elements in the 𝑚 dimension
• We often use a list of non-zero elements and their values
2011-10-11
Information Communication Theory (情報伝達学)
20
Example of feature extraction
• Training data (consisting of two instances)
• +1 Hi darling, my photo in attached file
𝑦 = 0 is often
represented by -1
• -1 Hi Mark, Kyoto photo in attached file
• Feature representations (word bi-grams)
• +1 hi_darl darl_my my_photo photo_attach attach_file
• -1 hi_mark mark_kyoto kyoto_photo photo_attach attach_file
• Feature space
• 1: 1 (bias), 2: hi_darl, 3: darl_my, 4: my_photo, 5: photo_attach,
• 6: attach_file, 7: hi_mark, 8: mark_kyoto, 9: kyoto_photo
• Feature vectors
• 𝒙, 𝑦 = ( 1 1 1 1 1 1 0 0 0 , 1)
• 𝒙, 𝑦 = ( 1 0 0 0 1 1 1 1 1 , 0)
2011-10-11
Information Communication Theory (情報伝達学)
21
Training –
Perceptron
Finding 𝒘 such that it fits well to the training data
2011-10-11
Information Communication Theory (情報伝達学)
22
Training
• We have a training data consisting of 𝑁 instances:
𝑥𝑖 : the 𝑖-th element of the vector 𝒙
• 𝐷 = { 𝒙1 , 𝑦1 , … , 𝒙𝑁 , 𝑦𝑁 }
𝒙𝑛 : the 𝑛-th instance in the training data
• Generalization (汎化): if a weight vector 𝒘 predicts training
instances correctly, it will work for unknown instances
• This is an assumption; the weight vector 𝒘 predicting training
instances perfectly does not necessarily perform the best for
unknown instances → over-fitting (過学習)
• Find the weight vector 𝒘 such that it can predicts training
instances as correctly as possible
• Ideally, 𝑦𝑛 = 𝑦𝑛 for all 𝑛 ∈ 1, 𝑁 in the training data
2011-10-11
Information Communication Theory (情報伝達学)
23
Perceptron (Rosenblatt, 1957)
𝑤𝑖 = 0 for all 𝑖 ∈ [1, 𝑚]
2. Repeat:
3.
(𝒙𝑛 , 𝑦𝑛 ) ⟵ Random sample from the training data 𝐷
4.
𝑦 ⟵ predict 𝒘, 𝒙𝑛
1 (if 𝒘 ⋅ 𝒙𝑛 > 0)
𝑦=
5.
if 𝑦 ≠ 𝑦𝑛 then:
0 (otherwise)
6.
if 𝑦𝑛 = 1 then:
7.
𝒘 ⟵ 𝒘 + 𝒙𝑛
8.
else:
9.
𝒘 ⟵ 𝒘 − 𝒙𝑛
10. Until convergence (e.g., until no instance updates 𝒘)
1.
2011-10-11
Information Communication Theory (情報伝達学)
24
How perceptron algorithm works
• Suppose that the current model 𝒘 misclassifies (𝒙𝑛 , 𝑦𝑛 )
• If 𝑦𝑛 = 1 then:
• Update the weight vector 𝒘′ ⟵ 𝒘 + 𝒙𝑛
• If we classify 𝒙𝑛 again with the updated weights 𝒘′ :
• 𝒘′ ⋅ 𝒙𝑛 = 𝒘 + 𝒙𝑛 ⋅ 𝒙𝑛 = 𝒘 ⋅ 𝒙𝑛 + 𝒙𝑛 ⋅ 𝒙𝑛 ≥ 𝒘 ⋅ 𝒙𝑛
• If 𝑦𝑛 = 0 then:
• Update the weight vector 𝒘′ ⟵ 𝒘 − 𝒙𝑛
• If we classify 𝒙𝑛 again with the updated weights 𝒘′ :
• 𝒘′ ⋅ 𝒙𝑛 = 𝒘 − 𝒙𝑛 ⋅ 𝒙𝑛 = 𝒘 ⋅ 𝒙𝑛 − 𝒙𝑛 ⋅ 𝒙𝑛 ≤ 𝒘 ⋅ 𝒙𝑛
2011-10-11
Information Communication Theory (情報伝達学)
25
Exercise 1: Update 𝒘 using perceptron
• Training instances:
• 𝒙1 , 𝑦1 = ( 1 1 1 1 1 1 0 0 0 , 1) ← Hi darling, my photo in attached file
• 𝒙2 , 𝑦2 = ( 1 0 0 0 1 1 1 1 1 , 0) ← Hi Mark, Kyoto photo in attached file
• Initialization
• 𝒘= 0 0 0
0
0
0 0
0
0
• Then?
2011-10-11
Information Communication Theory (情報伝達学)
26
Answer 1: Update 𝒘 using perceptron
• Training instances:
• 𝒙1 , 𝑦1 = ( 1 1 1 1 1 1 0 0 0 , 1) ← Hi darling, my photo in attached file
• 𝒙2 , 𝑦2 = ( 1 0 0 0 1 1 1 1 1 , 0) ← Hi Mark, Kyoto photo in attached file
• Initialization
• 𝒘= 0 0 0
0
0
0 0
0
0
• Iteration #1: pick 𝒙1 , 𝑦1
• Classification: 𝒘 ⋅ 𝒙1 = 0 (misclassification)
• Update: 𝒘 ⟵ 𝒘 + 𝒙1 = 1 1 1 1 1 1
0
• Iteration #2: pick 𝒙2 , 𝑦2
• Classification: 𝒘 ⋅ 𝒙2 = 3 (misclassification)
• Update: 𝒘 ⟵ 𝒘 − 𝒙2 = 0 1 1 1 0 0
−1
0
0
−1
−1
• Now the weight vector can classify both 𝒙1 and 𝒙2 correctly:
• 𝒘 ⋅ 𝒙1 = 3, 𝒘 ⋅ 𝒙2 = −3
2011-10-11
Information Communication Theory (情報伝達学)
27
Python implementation (perceptron.py)
def update(w, x, y):
a = 0.
for i in range(len(w)):
a += w[i] * x[i]
if a * y <= 0:
for i in range(len(w)):
w[i] += y * x[i]
Update rule is very simple if we define
negative as -1 (instead of 0)
def classify(w, x):
a = 0.
for i in range(len(w)):
a += w[i] * x[i]
return (0. < a)
if __name__ == '__main__':
w = [0.] * 9
D = (
((1, 1, 1, 1, 1, 1, 0, 0, 0), +1),
((1, 0, 0, 0, 1, 1, 1, 1, 1), -1),
)
update(w, D[0][0], D[0][1])
update(w, D[1][0], D[1][1])
print classify(w, D[0][0])
print classify(w, D[1][0])
print w
2011-10-11
$ python perceptron.py
True
False
[0.0, 1.0, 1.0, 1.0, 0.0, 0.0, -1.0, -1.0, -1.0]
Information Communication Theory (情報伝達学)
28
Practical considerations
• Perceptron algorithm can learn a training instance at a
time (online training)
• Suitable for spam filtering, which constantly receives new instances
• Perceptron algorithm does not converge if the training set
is not linearly separable (no discrimination plane exists)
• We often terminate the algorithm after:
• A fixed number of iterations (tuned by a development set)
• Number of misclassification instances does not decrease
• Perceptron algorithm often leads to poor generalization
• We often use ‘averaging’ of weight vectors at each iteration for
better generalization (Freund and Schapire, 1999)
2011-10-11
Information Communication Theory (情報伝達学)
29
Training –
Logistic Regression
Section 6.6.2 Logistic Regression (P231-P234)
2011-10-11
Information Communication Theory (情報伝達学)
30
Logistic regression (ロジスティック回帰)
• A linear binary classifier (the same as perceptron)
• Input: feature vector 𝒙 ∈ ℛ 𝑚
1 (if 𝑎 𝒙 = 𝒘 ⋅ 𝒙 > 0)
𝑦=
• Output: prediction 𝑦 ∈ {0,1}
0
(otherwise)
𝑚
• Using: weight vector 𝒘 ∈ ℛ
(𝑚: number of dimension)
• In addition, conditional probability 𝑃(𝑦|𝒙) is defined by,
1
𝑃(𝑦 = 1|𝒙) =
1 + 𝑒 −𝒘⋅𝒙
𝑒 −𝒘⋅𝒙
𝑃 𝑦 =0 𝒙 =1−𝑃 𝑦 =1 𝒙 =
1 + 𝑒 −𝒘⋅𝒙
• Decision rule to classify 𝒙 to positive (𝑦 = 1)
1
1
𝑃 𝑦 = 1 𝒙 > 0.5 ⟺
>
⟺ 𝒘⋅𝒙>0
−𝒘⋅𝒙
1+𝑒
2
2011-10-11
Information Communication Theory (情報伝達学)
the same
31
Sigmoid function (シグモイド関数)
• Sigmoid function
𝜎(a)
1
𝜎(𝑎) =
1 + 𝑒 −𝑎
• This function maps:
• Score −∞, +∞ to
probability [0, 1]
a
• In an implementation,
avoid the overflow
problem when 𝑎 is
negative
Symmetry point at (0, 0.5)
lim 𝜎(𝑎) = 0
lim 𝜎(𝑎) = 1
𝑎→−∞
𝑎→∞
http://en.wikipedia.org/wiki/Sigmoid_function
2011-10-11
Information Communication Theory (情報伝達学)
32
Interpreting logistic regression
• Compute the score (inner product) 𝑎 = 𝒘 ⋅ 𝒙
• Apply the sigmoid function to map the score 𝑎 into a
probability value, 𝑃 𝑦 = 1 𝒙 = 𝜎 𝑎 = 𝜎 𝒘 ⋅ 𝒙
𝜎(a)
𝑦=1
𝑦=0
a=𝒘⋅𝒙
2011-10-11
Information Communication Theory (情報伝達学)
33
(Instance-wise) log-likelihood (対数尤度)
• Given a training instance 𝒙𝑛 , 𝑦𝑛 , we compute the
instance-wise log-likelihood ℓ𝑛 to assess the fitness,
ℓ𝑛 ≡
log 𝑝𝑛
log 1 − 𝑝𝑛
𝑝𝑛 ≡ 𝑃 𝑦 = 1 𝒙𝑛
(if 𝑦𝑛 = 1)
= 𝑦𝑛 log 𝑝𝑛 + 1 − 𝑦𝑛 log 1 − 𝑝𝑛 ,
(if 𝑦𝑛 = 0)
1
=
, 𝑎 ≡ 𝒘 ⋅ 𝒙𝑛
1 + 𝑒 −𝑎𝑛 𝑛
• Maximum Likelihood Estimation: we would like to find the
weight vector 𝑤 such that:
𝑁
maximize
ℓ𝑛
𝑛=1
2011-10-11
Information Communication Theory (情報伝達学)
34
Check your understandings with example
• Model: 𝒘 = 0
1 1 1 0 0 −1 −1 −1
• 𝒙1 = 1 1 1 1 1 1 0 0 0 ← Hi darling, my photo in attached file
1
• 𝑎1 = 𝒘 ⋅ 𝒙1 = 3, 𝑝1 =
= 0.953
1+exp −3
• Suppose that this instance is annotated as positive (𝑦1 = 1)
• 𝑙1 = log 𝑝1 = −0.0486 ⟶ 0 (maximize)
• 𝒙2 = 1 0 0 0 1 1 1 1 1 ← Hi Mark, Kyoto photo in attached file
1
• 𝑎2 = 𝒘 ⋅ 𝒙2 = −3, 𝑝2 =
= 0.047
1+exp +3
• Suppose that this instance is annotated as negative (𝑦2 = 0)
• 𝑙2 = log(1 − 𝑝2 ) = −0.0486 ⟶ 0 (maximize)
2011-10-11
Information Communication Theory (情報伝達学)
35
Ascent
Stochastic Gradient Descent for training
• Optimization problem: Maximum Likelihood Estimation
• maximize 𝑁
𝑛=1 ℓ𝑛 (fortunately, this objective is concave)
• Stochastic Gradient Descent (SGD) (確率的勾配降下法)
• Compute the gradient for an instance (batch size = 1):
• Update parameters to the steepest direction
𝜕ℓ𝑛
𝜕𝒘
𝑁
ℓ𝑛
1.
2.
3.
4.
𝑤𝑖 = 0 for all 𝑖 ∈ [1, 𝑚]
For 𝑡 ⟵ 1 to 𝑇:
𝜂𝑡 ⟵ 1/𝑡
(𝒙𝑛 , 𝑦𝑛 ) ⟵ Random sample from 𝐷
5.
2011-10-11
𝒘 ⟵ 𝒘 + 𝜂𝑡
𝑛=1
𝜂𝑡
𝜕𝑙𝑛
𝜕𝒘
𝜕𝑙𝑛
𝜕𝒘
Information Communication Theory (情報伝達学)
𝒘
36
Exercise 2: compute the gradient
• Compute the gradients
𝜕ℓ𝑛 𝜕𝑝𝑛 𝜕𝑎𝑛
,
,
,
𝜕𝑝𝑛 𝜕𝑎𝑛 𝜕𝒘
and prove:
𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕𝑝𝑛 𝜕𝑎𝑛
=
= 𝑦𝑛 − 𝑝𝑛 𝒙𝑛
𝜕𝒘 𝜕𝑝𝑛 𝜕𝑎𝑛 𝜕𝒘
• where,
ℓ𝑛 = 𝑦𝑛 log 𝑝𝑛 + 1 − 𝑦𝑛 log 1 − 𝑝𝑛 ,
𝑝𝑛 =
1
,
−𝑎
1+𝑒
𝑎 = 𝒘 ⋅ 𝒙𝑛
2011-10-11
Information Communication Theory (情報伝達学)
37
Answer 2: compute the gradients
𝜕ℓ𝑛 𝑦𝑛 1 − 𝑦𝑛
𝑦𝑛− 𝑝𝑛
=
+
⋅ −1 =
,
𝜕𝑝𝑛 𝑝 𝑛 1 − 𝑝𝑛
𝑝 𝑛 (1 − 𝑝𝑛 )
𝜕𝑝𝑛
1
= −1 ⋅
𝜕𝑎𝑛
1 + 𝑒 −𝑎
−𝑎
1
𝑒
−𝑎 ⋅ −1 =
⋅
𝑒
⋅
= 𝑝𝑛 1 − 𝑝𝑛 ,
2
1 + 𝑒 −𝑎 1 + 𝑒 −𝑎
𝜕𝑎𝑛
= 𝒙𝑛
𝜕𝒘
Hence,
𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕𝑝𝑛 𝜕𝑎𝑛
𝑦𝑛− 𝑝𝑛
=
=
⋅ 𝑝 1 − 𝑝𝑛 ⋅ 𝒙𝑛 = 𝑦𝑛 − 𝑝𝑛 𝒙𝑛
𝜕𝒘 𝜕𝑝𝑛 𝜕𝑎𝑛 𝜕𝒘 𝑝 𝑛 (1 − 𝑝𝑛 ) 𝑛
2011-10-11
Information Communication Theory (情報伝達学)
38
Interpreting the update formula
• The update formula:
𝜕𝑙𝑛
𝒘 ⟵ 𝒘 + 𝜂𝑡
= 𝒘 + 𝜂𝑡 𝑦𝑛 − 𝑝𝑛 𝒙𝑛
𝜕𝒘
• If 𝑦𝑛 = 𝑝𝑛 , no need for updating 𝒘
• If 𝑦𝑛 = 1 and 𝑝𝑛 < 1, increase the weight 𝒘 by the amount
of the error 𝑦𝑛 − 𝑝𝑛
• If 𝑦𝑛 = 0 and 0 < 𝑝𝑛 , decrease the weight 𝒘 by the
amount of the error 𝑦𝑛 − 𝑝𝑛
2011-10-11
Information Communication Theory (情報伝達学)
39
Python implementation (logress.py)
import math
import random
def train(w, D, T):
for t in range(1, T+1):
x, y = random.choice(D)
a = sum([w[i] * x[i] for i in range(len(w))])
g = y - (1. / (1. + math.exp(-a))) if -100. < a else y
eta = 1. / t
for i in range(len(w)):
w[i] += eta * g * x[i]
# 𝒘 ← 𝒘 + 𝜂𝑔𝒙
#
def prob(x):
a = sum([w[i] * x[i] for i in range(len(w))])
return 1. / (1 + math.exp(-a)) if -100. < a else 0.
# 𝑎 𝒙 = 𝒘⋅𝒙
# 𝑝 = 1/(1 + exp 𝑎 )
if __name__ == '__main__':
w = [0.] * 9
D = (
((1, 1, 1, 1, 1, 1, 0, 0, 0), 1),
((1, 0, 0, 0, 1, 1, 1, 1, 1), 0),
)
train(w, D, 10000)
print prob(D[0][0]), prob(D[1][0])
print w
2011-10-11
# 𝑎 𝒙 = 𝒘⋅𝒙
# g=𝑦−𝑝
$ python logpress.py
0.93786130942 0.0601874607952
[-0.0037759046835663321, 0.90852032090533386,
0.90852032090533386, 0.90852032090533386,
-0.0037759046835663321, -0.0037759046835663321,
-0.91229622558889756, -0.91229622558889756,
-0.91229622558889756]
Information Communication Theory (情報伝達学)
40
Practical considerations
• MLE often leads to overfitting
• |𝒘| → ∞ as 𝑁
𝑛=1 ℓ𝑛 → 0 when the training data is linearly separable
• Subject to be affected by noises in the training data
• Regularization (正則化) (MAP estimation)
• We introduce a penalty term when 𝒘 becomes large
𝑁
• E.g., MAP with L2 regularization: maximize
𝑛=1 ℓ𝑛 − 𝐶 𝒘
2
• 𝐶 is the parameter to control the trade-off between over/under fitting
• Stopping criterion
• Detecting the convergence of log-likelihood improvements
2011-10-11
Information Communication Theory (情報伝達学)
41
Other classification models
• Extensions of logistic regression
• Maximum Entropy Modeling (MaxEnt) (最大エントロピー法)
• Multi-class logistic regression
• Conditional Random Fields (CRFs) (条件付き確率場)
• Predict a sequence of labels for a given sequence
• Other formalizations
• Support Vector Machine (SVM) (サポートベクトルマシーン)
• SVM with linear kernel = linear classifier
• Naïve Bayes
• Neural Network
• Decision Tree
• …
2011-10-11
Information Communication Theory (情報伝達学)
42
Evaluation
2011-10-11
Information Communication Theory (情報伝達学)
43
Accuracy, precision, and recall
Contingency table
+
-
Predicted +
𝑎 (true positive)
𝑏 (false positive)
Predicted -
𝑐 (false negative)
𝑑 (true negative)
𝑎+𝑑
𝑎+𝑏+𝑐+𝑑
𝑎
• Precision (適合率): 𝑝 =
𝑎+𝑏
𝑎
• Recall (再現率): 𝑟 =
𝑎+𝑐
2𝑝𝑟
• F1-score:
𝑝+𝑟
• Accuracy (精度):
2011-10-11
(correctness about positives and negatives)
(correctness about positives)
(coverage about positives)
(harmonic mean of precision and recall)
Information Communication Theory (情報伝達学)
44
Precision – recall curve
• Precision/recall tradeoff
• If we increase the precision of a system, its recall decreases
• It is not easy to improve both precision and recall
• We sometimes prioritize either precision or recall, depending on the task
• Spam filtering: precision is important
p
• How can we control the tradeoff of a
linear binary classifier?
• Increasing the threshold: improves precision
better
• Decreasing the threshold: improves recall
r
2011-10-11
Information Communication Theory (情報伝達学)
45
Holdout evaluation
• Use ‘heldout’ data set for evaluation
Training
Model
Evaluation
Training set
Separated
Test set
2011-10-11
Information Communication Theory (情報伝達学)
46
N-fold cross validation (交差検定)
• For example (5-fold cross validation)
Data set
Split into 5 groups
Perf #1
test
Perf #2
test
Perf #3
test
Perf #4
test
test
2011-10-11
Average
Information Communication Theory (情報伝達学)
Perf #5
47
Standard experiment procedure
• Three groups of data sets
• Training set: used for training
• Test set: used for evaluating the trained model
• Development set: another test set for tuning parameters of training
• This experimental setting is useful for comparing different systems
• Two groups of data sets
• Training set: used for N-fold cross validation
• Development set: a test set for tuning parameters of training
2011-10-11
Information Communication Theory (情報伝達学)
48
Implementations
2011-10-11
Information Communication Theory (情報伝達学)
49
Implementations for classification
• LIBLINEAR: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
• L2-loss SVM, L1-loss SVM, logistic regression
• LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
• SVM with linear, polynominal, radial basis, sigmoid kernels
• Classias: http://www.chokkan.org/software/classias/
• Averaged perceptron, L1-loss SVM, logistic regression
2011-10-11
Information Communication Theory (情報伝達学)
50
Ling-spam corpus
• A mixture of 481 spam messages and 2412 messages
sent via the Linguist list
• http://labs-repos.iit.demokritos.gr/skel/i-
config/downloads/lingspam_public.tar.gz
• This corpus includes mail messages after preprocessing
(lemmatization and stopword filtering)
• Under ‘lemm_stop’ directory
• This corpus splits the data set into ten groups
• Under part1, part2, …, part10 directories
2011-10-11
Information Communication Theory (情報伝達学)
51
Files in the data set
Files starting with ‘spmsg’ are spams
2011-10-11
Information Communication Theory (情報伝達学)
52
Example (lemm_stop/part1/spmsga1.txt.gz)
Subject: great part-time summer job !
* * * * * * * * * * * * * * * display box credit application need place small owneroperate store area . here : 1 . introduce yourself store owner manager . 2 . our
90 % effective script tell little display box save customer hundred dollar , draw
card business , $ 5 . 0 $ 15 . 0 every app send . 3 . spot counter , place box ,
nothing need , need name address company send commission check .
compensaation $ 10 every box place . become representative earn
commission $ 10 each application store . course much profitable plan , pay
month small effort . call 1-888 - 703-5390 code 3 24 hours receive detail ! ! * * *
*****************************************************
*****************************************************
*****************************************************
* * * * * * * * * * * * * * * * * * * * * removed our mailing list , type : b2998 @
hotmail . com ( : ) area ( remove ) subject area e - mail send . * * * * * * * * * * *
*****************************************************
*****************************************************
***************
2011-10-11
Information Communication Theory (情報伝達学)
53
Feature extraction (spam_feature.py)
import sys
import os
import gzip
def ngram(T, n):
return ['%dgram=%s' % (n, '_'.join(T[i:i+n])) for i in range(len(T)-n+1)]
def process(fo, fi, label):
F = []
for line in fi:
fields = line.strip('\n').split(' ')
F += ngram(fields, 1)
F += ngram(fields, 2)
fo.write('%s ' % label)
fo.write('%s' % ' '.join([f.replace(':', '__COLON__') for f in F]))
fo.write('\n')
if __name__ == '__main__':
fo = sys.stdout
for src in sys.argv[1:]:
label = '+1' if os.path.basename(src).startswith('spmsg') else '-1'
fi = gzip.GzipFile(src) if src.endswith('.gz') else open(src)
process(fo, fi, label)
2011-10-11
Information Communication Theory (情報伝達学)
54
Extracted features
+1 1gram=Subject__COLON__ 1gram=great 1gram=part-time 1gram=summer 1gram=job 1gram=!
2gram=Subject__COLON___great 2gram=great_part-time 2gram=part-time_summer
2gram=summer_job 2gram=job_! 1gram= 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=*
1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=display
1gram=box 1gram=credit 1gram=application 1gram=need 1gram=place 1gram=small 1gram=owneroperate 1gram=store 1gram=area 1gram=. 1gram=here 1gram=__COLON__ 1gram=1 1gram=.
1gram=introduce 1gram=yourself 1gram=store 1gram=owner 1gram=manager 1gram=. 1gram=2
1gram=. 1gram=our 1gram=90 1gram=% 1gram=effective 1gram=script 1gram=tell 1gram=little
1gram=display 1gram=box 1gram=save 1gram=customer 1gram=hundred 1gram=dollar 1gram=,
1gram=draw 1gram=card 1gram=business 1gram=, 1gram=$ 1gram=5 1gram=. 1gram=0
1gram=$ 1gram=15 1gram=. 1gram=0 1gram=every 1gram=app 1gram=send 1gram=. 1gram=3
1gram=. 1gram=spot 1gram=counter 1gram=, 1gram=place 1gram=box 1gram=, 1gram=nothing
1gram=need 1gram=, 1gram=need 1gram=name 1gram=address 1gram=company 1gram=send
1gram=commission 1gram=check 1gram=. 1gram=compensaation 1gram=$ 1gram=10 1gram=every
1gram=box 1gram=place 1gram=. 1gram=become 1gram=representative 1gram=earn
1gram=commission 1gram=$ 1gram=10 1gram=each 1gram=application 1gram=store 1gram=.
1gram=course 1gram=much 1gram=profitable 1gram=plan 1gram=, 1gram=pay 1gram=month
1gram=small 1gram=effort 1gram=. 1gram=call 1gram=1-888 1gram=- 1gram=703-5390 1gram=code
1gram=3 1gram=24 1gram=hours 1gram=receive 1gram=detail 1gram=! 1gram=! 1gram=* 1gram=*
1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=*
1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=*
1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=*
1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=*
1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=*
2011-10-11
Information Communication Theory (情報伝達学)
55
Training using Classias
• Training with spam[01-09].txt
• $ classias-train -tb -m spam.model spam0?.txt
• Tagging with spam10.txt
• $ classias-tag -r -m spam.model < spam10.txt
• Performance evaluation
• $ classias-tag -qt -m spam.model < spam10.txt
• Accuracy: 0.9656 (281/291)
• Micro P, R, F1: 0.9535 (41/43), 0.8367 (41/49), 0.8913
2011-10-11
Information Communication Theory (情報伝達学)
56
Trained feature weights
Feature with high positive weights
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
0.745045
0.508344
0.483244
0.385863
0.372262
0.340425
0.333442
0.305192
0.299545
0.298676
0.295115
0.294267
0.294198
0.293998
0.293348
0.292054
0.262475
0.256503
0.254161
2011-10-11
1gram=!
1gram=free
1gram=click
1gram=$
1gram=site
1gram=com
1gram=adult
2gram=._com
1gram=want
1gram=here
2gram=http___COLON__
1gram=http
1gram=our
1gram=remove
2gram=__COLON___/
2gram=click_here
1gram=%
1gram=money
1gram=xxx
Feature with high negative weights
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-2.35981
-0.59796
-0.545332
-0.49624
-0.43065
-0.326052
-0.325135
-0.302419
-0.299929
-0.296032
-0.294959
-0.278405
-0.268571
-0.26763
-0.233347
-0.227835
-0.226135
-0.223516
-0.216163
__BIAS__
1gram=language
2gram=!_!
1gram=@
1gram=linguistic
1gram=edu
2gram=._edu
1gram=linguist
1gram=university
1gram=(
1gram=)
1gram=de
1gram=conference
1gram=english
1gram=word
1gram=available
1gram=research
1gram=linguistics
1gram=__COLON__
Information Communication Theory (情報伝達学)
57
References
• Yoav Freund and Robert E. Schapire. (1999) Large
margin classification using the perceptron algorithm,
Machine Learning.
• Frank Rosenblatt. (1957) The Perceptron - a perceiving
and recognizing automaton. Report 85-460-1, Cornell
Aeronautical Laboratory.
2011-10-11
Information Communication Theory (情報伝達学)
58
Spam filtering in reality
2011-10-11
Information Communication Theory (情報伝達学)
59
Spam mail (again)
2011-10-11
Information Communication Theory (情報伝達学)
60
How does SpamAssassin detect spams?
Return-Path: <automail@interloper.com>
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on www234.sakura.ne.jp
X-Spam-Flag: YES
X-Spam-Level: ***************
X-Spam-Status: Yes, score=15.4 required=7.0 tests=BAYES_00,HTML_MESSAGE,
KB_DATE_CONTAINS_TAB,KB_FAKED_THE_BAT,RAZOR2_CF_RANGE_51_100,
RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,T_SURBL_MULTI1,T_SURBL_MULTI2,
T_SURBL_MULTI3,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_PH_SURBL,URIBL_SC_SURBL,
URIBL_WS_SURBL autolearn=no version=3.3.1
X-Spam-Report:
* 0.6 URIBL_PH_SURBL Contains an URL listed in the PH SURBL blocklist
*
[URIs: sonya201010.com.ua]
* 4.5 URIBL_AB_SURBL Contains an URL listed in the AB SURBL blocklist
*
[URIs: sonya201010.com.ua]
* 1.6 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist
*
[URIs: sonya201010.com.ua]
* 1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
*
[URIs: sonya201010.com.ua]
* 0.6 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist
*
[URIs: sonya201010.com.ua]
* 2.8 KB_DATE_CONTAINS_TAB KB_DATE_CONTAINS_TAB
* -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
2011-10-11
Information Communication Theory (情報伝達学)
61
Download