Classification (分類) Naoaki Okazaki okazaki at ecei.tohoku.ac.jp http://www.chokkan.org/ http://twitter.com/#!/chokkanorg #nlptohoku http://www.cl.ecei.tohoku.ac.jp/index.php?InformationCommunicationT heory 2011-10-11 Information Communication Theory (情報伝達学) 1 Spam mail 2011-10-11 Information Communication Theory (情報伝達学) 2 How many spams do you get (per day)? • My office address: okazaki at ecei.tohoku.ac.jp • 10 spams out of 80 emails • 5 filtered automatically out of 10 spams • My private address (leaked by a company) • 40 spams out of 40 emails • 10 filtered out of 40 spams Spam (Monty Python) 2011-10-11 Information Communication Theory (情報伝達学) 3 Symantec’s survey • May 2011 MessageLabs Intelligence Report • 75.8% of email in the world was spam (72.3% in Japan) • 1 in 1.32 emails was spam • http://www.symanteccloud.com/mlireport/MLI_2011_05_May_FINAL-en.pdf 2011-10-11 Information Communication Theory (情報伝達学) 4 Rule-based spam filtering • Design heuristic rules to detect spams def is_spam(text): if text.find('my photo attached') != -1: return True if text.find('very cheap drugs') != -1: return True ... Difficult to scale to a wide-variety of spams return False • Pros • Good initial cost-performance • Understandable internals • Configurable internals 2011-10-11 • Cons • High maintenance cost • Dependence to the domain • Artisanal skill Information Communication Theory (情報伝達学) 5 Supervised spam filtering (with retraining) Discard spam emails Classification User Model Incoming emails Training Spam folder Spam Append new instances Non-spam Training data 2011-10-11 Information Communication Theory (情報伝達学) 6 Supervised approach • Learn from examples (supervision data) • Spam (positive) and non-spam (negative) emails • Acquire rules from the supervision data • Rules (features) are usually induced from the supervision data • e.g., n-grams of mail contents, mail headers • (Deliberately) over-generate features, regardless of effectiveness • Weight rules in terms of their contribution to the task • Hand-crafted rules for the domain are unnecessary (except for feature engineering) • This approach work surprisingly well if we have a large supervision data 2011-10-11 Information Communication Theory (情報伝達学) 7 Classification and NLP • Many NLP problems can be formalized as classification! In March 2005, the New York Times acquired About, Inc. IN NNP CD DT NNP NP NNP NNP VBD NP TEMP VP when NNP NP ORG ORG who 2011-10-11 NNP whom acquire Information Communication Theory (情報伝達学) 8 Today’s topic • Linear binary classifier • Feature extraction • Tokenization, stop words, stemming, feature extraction • Training – perceptron • Training – logistic regression (using SGD) • Other formalizations • Evaluation • Implementations and experiments 2011-10-11 Information Communication Theory (情報伝達学) 9 Take home messages • A text is represented by features (as a vector) • Linear classifier simply computes the linear combination of feature weights as a score • Training a linear classifier is very straightforward • In principle, if the classifier fails with an instance, we update feature weights such that the classifier can classify the instance next time! • Two kinds of classification failures that have trade-offs: • False positives decrease precision • False negatives decrease recall 2011-10-11 Information Communication Theory (情報伝達学) 10 Brief Introduction of Linear (Binary) Classifier 2011-10-11 Information Communication Theory (情報伝達学) 11 Linear (binary) classifier (線形二値分類器) • Input: feature vector 𝒙 ∈ ℛ 𝑚 • Output: prediction 𝑦 ∈ {0,1} • Using: weight vector 𝒘 ∈ ℛ 𝑚 1 (if 𝑎 𝒙 = 𝒘 ⋅ 𝒙 > 0) 𝑦= 0 (otherwise) (𝑚: number of dimension) Feature extraction • More concrete example: • Feature space (𝑚 = 6): (darling, honey, my, love, photo, attached) • Input: “Hi darling, my photo in attached file” ⟶ 𝒙 = (1 0 1 0 1 1) 𝑚 𝑎 𝒙 =𝒘⋅𝒙= 𝑤𝑖 𝑥𝑖 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6 𝑖=1 • Output: the sign of 𝑎(𝒙) Contribution of each feature (positive high: likely to yield 𝑦 = 1) • Training: to find 𝒘 such that it fits well to the training data 2011-10-11 Information Communication Theory (情報伝達学) 12 Discrimination plane (分離平面) • Draw the region of 𝒙 such that 𝑦 = 1 𝜋 𝒘 ⋅ 𝒙 > 0 ⟺ (|angle between 𝒘 and 𝒙| < ) 2 𝒘 𝑦=1 𝒙 𝒙 𝒙 𝑦=0 2011-10-11 Discrimination plane Information Communication Theory (情報伝達学) 13 Two important questions • Feature extraction (feature engineering) • Represent an input text with a feature vector • Using a number of NLP techniques: tokenization, stop-word, stemming, part-of-speech tagging, parsing, dictionary lookup, etc. • Important: a linear model can see a text only through features! • Training • Find a weight vector such that it fits well to the training data • The classifier is expected to re-predict all training instances correctly • Fitness (formalizations; loss functions) • Logistic Regression, Support Vector Machine (SVM), … • Finding (training algorithms) • Perceptron, Gradient Descent, Stochastic Gradient Descent, … 2011-10-11 Information Communication Theory (情報伝達学) 14 Feature Extraction Representing a text with an input vector 𝒙 Preparing a space of feature vectors 2011-10-11 Information Communication Theory (情報伝達学) 15 Tokenization • Split a sentence into a sequence of tokens (words) • In English, tokens are separated by whitespaces (e.g., space, punctuation characters) • Sophisticated methods (e.g., word segmentation, morphological analysis) are necessary for Japanese and Chinese • Token boundaries are not obvious in these languages • Penn Treebank tokenization • http://www.cis.upenn.edu/~treebank/tokenization.html Hi darling, my photo in attached file Tokenize and lowercase [“hi”, “darling”, “,”, “my”, “photo”, “in”, “attached”, “file”] 2011-10-11 Information Communication Theory (情報伝達学) 16 Stop words • Remove words that are irrelevant to the processing • E.g., http://www.textfixer.com/resources/common-english-words.txt • a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,beca use,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from ,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least ,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,o r,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,t hem,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when, where,which,while,who,whom,why,will,with,would,yet,you,your • Highly domain dependent • “my” in the phrase “my photo” may be effective to spam filtering • This lecture employs punctuations and prepositions as stop words [“hi”, “darling”, “,”, “my”, “photo”, “in”, “attached”, “file”] Remove punctuations and prepositions [“hi”, “darling”, “my”, “photo”, “attached”, “file”] 2011-10-11 Information Communication Theory (情報伝達学) 17 Stemming • Reducing inflected and derived words to their stems • Stems are not identical to (morphological) base forms • It is sufficient for an algorithm to map related words into the same string, regardless of its morphological validity • Porter Stemming Algorithm (Porter, 1980) • http://tartarus.org/~martin/PorterStemmer/index.html • http://pypi.python.org/pypi/stemming/1.0 • Not perfect: “viruses” – “virus”, “virus” – “viru” [“hi”, “darling”, “my”, “photo”, “attached”, “file”] Porter Stemming Algorithm [“hi”, “darl”, “my”, “photo”, “attach”, “file”] 2011-10-11 Information Communication Theory (情報伝達学) 18 Feature extraction • Various characteristics, depending on the target task • Word n-grams (unigram, bigram, tri-gram, …), prefixes/postfixes • Part-of-speech tags, predicate arguments, dependency edges • Dictionary matching (e.g., predefined ‘black’ words in spams) • Conditions (e.g., whether the email has a link to a black URL) • Others (e.g., the sender of the mail, the IP address of the sender) • We may use (combinations of) multiple characteristics • E.g., unigram and part-of-speech tag: “photo/NN” [“hi”, “darl”, “my”, “photo”, “attach”, “file”] Word bigrams as features [“hi_darl”, “darl_my”, “my_photo”, “photo_attach”, “attach_file”] 2011-10-11 Information Communication Theory (情報伝達学) 19 Practical considerations • Bias term • Include a feature that is always 1 (without any condition) • The corresponding feature weight presents a threshold 𝑚 𝑎′ 𝒙 = 𝑤𝑖 𝑥𝑖 + 𝑤0 𝑥0 = 𝑠 𝒙 + 𝑤0 > 0 ⇔ 𝑠 𝒙 > −𝑤0 𝑖=1 Always 1 Bias term Threshold • Feature space and number of dimensions (𝑚) • We seldom determine the number of dimensions in advance • Instead, we do: • define an algorithm for extracting features from an instance • extract and enumerate features from all instances in the training data • 𝑚 = (the number of distinct features extracted from the training data) • A feature vector extracted from an instance is sparse • A few non-zero elements in the 𝑚 dimension • We often use a list of non-zero elements and their values 2011-10-11 Information Communication Theory (情報伝達学) 20 Example of feature extraction • Training data (consisting of two instances) • +1 Hi darling, my photo in attached file 𝑦 = 0 is often represented by -1 • -1 Hi Mark, Kyoto photo in attached file • Feature representations (word bi-grams) • +1 hi_darl darl_my my_photo photo_attach attach_file • -1 hi_mark mark_kyoto kyoto_photo photo_attach attach_file • Feature space • 1: 1 (bias), 2: hi_darl, 3: darl_my, 4: my_photo, 5: photo_attach, • 6: attach_file, 7: hi_mark, 8: mark_kyoto, 9: kyoto_photo • Feature vectors • 𝒙, 𝑦 = ( 1 1 1 1 1 1 0 0 0 , 1) • 𝒙, 𝑦 = ( 1 0 0 0 1 1 1 1 1 , 0) 2011-10-11 Information Communication Theory (情報伝達学) 21 Training – Perceptron Finding 𝒘 such that it fits well to the training data 2011-10-11 Information Communication Theory (情報伝達学) 22 Training • We have a training data consisting of 𝑁 instances: 𝑥𝑖 : the 𝑖-th element of the vector 𝒙 • 𝐷 = { 𝒙1 , 𝑦1 , … , 𝒙𝑁 , 𝑦𝑁 } 𝒙𝑛 : the 𝑛-th instance in the training data • Generalization (汎化): if a weight vector 𝒘 predicts training instances correctly, it will work for unknown instances • This is an assumption; the weight vector 𝒘 predicting training instances perfectly does not necessarily perform the best for unknown instances → over-fitting (過学習) • Find the weight vector 𝒘 such that it can predicts training instances as correctly as possible • Ideally, 𝑦𝑛 = 𝑦𝑛 for all 𝑛 ∈ 1, 𝑁 in the training data 2011-10-11 Information Communication Theory (情報伝達学) 23 Perceptron (Rosenblatt, 1957) 𝑤𝑖 = 0 for all 𝑖 ∈ [1, 𝑚] 2. Repeat: 3. (𝒙𝑛 , 𝑦𝑛 ) ⟵ Random sample from the training data 𝐷 4. 𝑦 ⟵ predict 𝒘, 𝒙𝑛 1 (if 𝒘 ⋅ 𝒙𝑛 > 0) 𝑦= 5. if 𝑦 ≠ 𝑦𝑛 then: 0 (otherwise) 6. if 𝑦𝑛 = 1 then: 7. 𝒘 ⟵ 𝒘 + 𝒙𝑛 8. else: 9. 𝒘 ⟵ 𝒘 − 𝒙𝑛 10. Until convergence (e.g., until no instance updates 𝒘) 1. 2011-10-11 Information Communication Theory (情報伝達学) 24 How perceptron algorithm works • Suppose that the current model 𝒘 misclassifies (𝒙𝑛 , 𝑦𝑛 ) • If 𝑦𝑛 = 1 then: • Update the weight vector 𝒘′ ⟵ 𝒘 + 𝒙𝑛 • If we classify 𝒙𝑛 again with the updated weights 𝒘′ : • 𝒘′ ⋅ 𝒙𝑛 = 𝒘 + 𝒙𝑛 ⋅ 𝒙𝑛 = 𝒘 ⋅ 𝒙𝑛 + 𝒙𝑛 ⋅ 𝒙𝑛 ≥ 𝒘 ⋅ 𝒙𝑛 • If 𝑦𝑛 = 0 then: • Update the weight vector 𝒘′ ⟵ 𝒘 − 𝒙𝑛 • If we classify 𝒙𝑛 again with the updated weights 𝒘′ : • 𝒘′ ⋅ 𝒙𝑛 = 𝒘 − 𝒙𝑛 ⋅ 𝒙𝑛 = 𝒘 ⋅ 𝒙𝑛 − 𝒙𝑛 ⋅ 𝒙𝑛 ≤ 𝒘 ⋅ 𝒙𝑛 2011-10-11 Information Communication Theory (情報伝達学) 25 Exercise 1: Update 𝒘 using perceptron • Training instances: • 𝒙1 , 𝑦1 = ( 1 1 1 1 1 1 0 0 0 , 1) ← Hi darling, my photo in attached file • 𝒙2 , 𝑦2 = ( 1 0 0 0 1 1 1 1 1 , 0) ← Hi Mark, Kyoto photo in attached file • Initialization • 𝒘= 0 0 0 0 0 0 0 0 0 • Then? 2011-10-11 Information Communication Theory (情報伝達学) 26 Answer 1: Update 𝒘 using perceptron • Training instances: • 𝒙1 , 𝑦1 = ( 1 1 1 1 1 1 0 0 0 , 1) ← Hi darling, my photo in attached file • 𝒙2 , 𝑦2 = ( 1 0 0 0 1 1 1 1 1 , 0) ← Hi Mark, Kyoto photo in attached file • Initialization • 𝒘= 0 0 0 0 0 0 0 0 0 • Iteration #1: pick 𝒙1 , 𝑦1 • Classification: 𝒘 ⋅ 𝒙1 = 0 (misclassification) • Update: 𝒘 ⟵ 𝒘 + 𝒙1 = 1 1 1 1 1 1 0 • Iteration #2: pick 𝒙2 , 𝑦2 • Classification: 𝒘 ⋅ 𝒙2 = 3 (misclassification) • Update: 𝒘 ⟵ 𝒘 − 𝒙2 = 0 1 1 1 0 0 −1 0 0 −1 −1 • Now the weight vector can classify both 𝒙1 and 𝒙2 correctly: • 𝒘 ⋅ 𝒙1 = 3, 𝒘 ⋅ 𝒙2 = −3 2011-10-11 Information Communication Theory (情報伝達学) 27 Python implementation (perceptron.py) def update(w, x, y): a = 0. for i in range(len(w)): a += w[i] * x[i] if a * y <= 0: for i in range(len(w)): w[i] += y * x[i] Update rule is very simple if we define negative as -1 (instead of 0) def classify(w, x): a = 0. for i in range(len(w)): a += w[i] * x[i] return (0. < a) if __name__ == '__main__': w = [0.] * 9 D = ( ((1, 1, 1, 1, 1, 1, 0, 0, 0), +1), ((1, 0, 0, 0, 1, 1, 1, 1, 1), -1), ) update(w, D[0][0], D[0][1]) update(w, D[1][0], D[1][1]) print classify(w, D[0][0]) print classify(w, D[1][0]) print w 2011-10-11 $ python perceptron.py True False [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, -1.0, -1.0, -1.0] Information Communication Theory (情報伝達学) 28 Practical considerations • Perceptron algorithm can learn a training instance at a time (online training) • Suitable for spam filtering, which constantly receives new instances • Perceptron algorithm does not converge if the training set is not linearly separable (no discrimination plane exists) • We often terminate the algorithm after: • A fixed number of iterations (tuned by a development set) • Number of misclassification instances does not decrease • Perceptron algorithm often leads to poor generalization • We often use ‘averaging’ of weight vectors at each iteration for better generalization (Freund and Schapire, 1999) 2011-10-11 Information Communication Theory (情報伝達学) 29 Training – Logistic Regression Section 6.6.2 Logistic Regression (P231-P234) 2011-10-11 Information Communication Theory (情報伝達学) 30 Logistic regression (ロジスティック回帰) • A linear binary classifier (the same as perceptron) • Input: feature vector 𝒙 ∈ ℛ 𝑚 1 (if 𝑎 𝒙 = 𝒘 ⋅ 𝒙 > 0) 𝑦= • Output: prediction 𝑦 ∈ {0,1} 0 (otherwise) 𝑚 • Using: weight vector 𝒘 ∈ ℛ (𝑚: number of dimension) • In addition, conditional probability 𝑃(𝑦|𝒙) is defined by, 1 𝑃(𝑦 = 1|𝒙) = 1 + 𝑒 −𝒘⋅𝒙 𝑒 −𝒘⋅𝒙 𝑃 𝑦 =0 𝒙 =1−𝑃 𝑦 =1 𝒙 = 1 + 𝑒 −𝒘⋅𝒙 • Decision rule to classify 𝒙 to positive (𝑦 = 1) 1 1 𝑃 𝑦 = 1 𝒙 > 0.5 ⟺ > ⟺ 𝒘⋅𝒙>0 −𝒘⋅𝒙 1+𝑒 2 2011-10-11 Information Communication Theory (情報伝達学) the same 31 Sigmoid function (シグモイド関数) • Sigmoid function 𝜎(a) 1 𝜎(𝑎) = 1 + 𝑒 −𝑎 • This function maps: • Score −∞, +∞ to probability [0, 1] a • In an implementation, avoid the overflow problem when 𝑎 is negative Symmetry point at (0, 0.5) lim 𝜎(𝑎) = 0 lim 𝜎(𝑎) = 1 𝑎→−∞ 𝑎→∞ http://en.wikipedia.org/wiki/Sigmoid_function 2011-10-11 Information Communication Theory (情報伝達学) 32 Interpreting logistic regression • Compute the score (inner product) 𝑎 = 𝒘 ⋅ 𝒙 • Apply the sigmoid function to map the score 𝑎 into a probability value, 𝑃 𝑦 = 1 𝒙 = 𝜎 𝑎 = 𝜎 𝒘 ⋅ 𝒙 𝜎(a) 𝑦=1 𝑦=0 a=𝒘⋅𝒙 2011-10-11 Information Communication Theory (情報伝達学) 33 (Instance-wise) log-likelihood (対数尤度) • Given a training instance 𝒙𝑛 , 𝑦𝑛 , we compute the instance-wise log-likelihood ℓ𝑛 to assess the fitness, ℓ𝑛 ≡ log 𝑝𝑛 log 1 − 𝑝𝑛 𝑝𝑛 ≡ 𝑃 𝑦 = 1 𝒙𝑛 (if 𝑦𝑛 = 1) = 𝑦𝑛 log 𝑝𝑛 + 1 − 𝑦𝑛 log 1 − 𝑝𝑛 , (if 𝑦𝑛 = 0) 1 = , 𝑎 ≡ 𝒘 ⋅ 𝒙𝑛 1 + 𝑒 −𝑎𝑛 𝑛 • Maximum Likelihood Estimation: we would like to find the weight vector 𝑤 such that: 𝑁 maximize ℓ𝑛 𝑛=1 2011-10-11 Information Communication Theory (情報伝達学) 34 Check your understandings with example • Model: 𝒘 = 0 1 1 1 0 0 −1 −1 −1 • 𝒙1 = 1 1 1 1 1 1 0 0 0 ← Hi darling, my photo in attached file 1 • 𝑎1 = 𝒘 ⋅ 𝒙1 = 3, 𝑝1 = = 0.953 1+exp −3 • Suppose that this instance is annotated as positive (𝑦1 = 1) • 𝑙1 = log 𝑝1 = −0.0486 ⟶ 0 (maximize) • 𝒙2 = 1 0 0 0 1 1 1 1 1 ← Hi Mark, Kyoto photo in attached file 1 • 𝑎2 = 𝒘 ⋅ 𝒙2 = −3, 𝑝2 = = 0.047 1+exp +3 • Suppose that this instance is annotated as negative (𝑦2 = 0) • 𝑙2 = log(1 − 𝑝2 ) = −0.0486 ⟶ 0 (maximize) 2011-10-11 Information Communication Theory (情報伝達学) 35 Ascent Stochastic Gradient Descent for training • Optimization problem: Maximum Likelihood Estimation • maximize 𝑁 𝑛=1 ℓ𝑛 (fortunately, this objective is concave) • Stochastic Gradient Descent (SGD) (確率的勾配降下法) • Compute the gradient for an instance (batch size = 1): • Update parameters to the steepest direction 𝜕ℓ𝑛 𝜕𝒘 𝑁 ℓ𝑛 1. 2. 3. 4. 𝑤𝑖 = 0 for all 𝑖 ∈ [1, 𝑚] For 𝑡 ⟵ 1 to 𝑇: 𝜂𝑡 ⟵ 1/𝑡 (𝒙𝑛 , 𝑦𝑛 ) ⟵ Random sample from 𝐷 5. 2011-10-11 𝒘 ⟵ 𝒘 + 𝜂𝑡 𝑛=1 𝜂𝑡 𝜕𝑙𝑛 𝜕𝒘 𝜕𝑙𝑛 𝜕𝒘 Information Communication Theory (情報伝達学) 𝒘 36 Exercise 2: compute the gradient • Compute the gradients 𝜕ℓ𝑛 𝜕𝑝𝑛 𝜕𝑎𝑛 , , , 𝜕𝑝𝑛 𝜕𝑎𝑛 𝜕𝒘 and prove: 𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕𝑝𝑛 𝜕𝑎𝑛 = = 𝑦𝑛 − 𝑝𝑛 𝒙𝑛 𝜕𝒘 𝜕𝑝𝑛 𝜕𝑎𝑛 𝜕𝒘 • where, ℓ𝑛 = 𝑦𝑛 log 𝑝𝑛 + 1 − 𝑦𝑛 log 1 − 𝑝𝑛 , 𝑝𝑛 = 1 , −𝑎 1+𝑒 𝑎 = 𝒘 ⋅ 𝒙𝑛 2011-10-11 Information Communication Theory (情報伝達学) 37 Answer 2: compute the gradients 𝜕ℓ𝑛 𝑦𝑛 1 − 𝑦𝑛 𝑦𝑛− 𝑝𝑛 = + ⋅ −1 = , 𝜕𝑝𝑛 𝑝 𝑛 1 − 𝑝𝑛 𝑝 𝑛 (1 − 𝑝𝑛 ) 𝜕𝑝𝑛 1 = −1 ⋅ 𝜕𝑎𝑛 1 + 𝑒 −𝑎 −𝑎 1 𝑒 −𝑎 ⋅ −1 = ⋅ 𝑒 ⋅ = 𝑝𝑛 1 − 𝑝𝑛 , 2 1 + 𝑒 −𝑎 1 + 𝑒 −𝑎 𝜕𝑎𝑛 = 𝒙𝑛 𝜕𝒘 Hence, 𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕𝑝𝑛 𝜕𝑎𝑛 𝑦𝑛− 𝑝𝑛 = = ⋅ 𝑝 1 − 𝑝𝑛 ⋅ 𝒙𝑛 = 𝑦𝑛 − 𝑝𝑛 𝒙𝑛 𝜕𝒘 𝜕𝑝𝑛 𝜕𝑎𝑛 𝜕𝒘 𝑝 𝑛 (1 − 𝑝𝑛 ) 𝑛 2011-10-11 Information Communication Theory (情報伝達学) 38 Interpreting the update formula • The update formula: 𝜕𝑙𝑛 𝒘 ⟵ 𝒘 + 𝜂𝑡 = 𝒘 + 𝜂𝑡 𝑦𝑛 − 𝑝𝑛 𝒙𝑛 𝜕𝒘 • If 𝑦𝑛 = 𝑝𝑛 , no need for updating 𝒘 • If 𝑦𝑛 = 1 and 𝑝𝑛 < 1, increase the weight 𝒘 by the amount of the error 𝑦𝑛 − 𝑝𝑛 • If 𝑦𝑛 = 0 and 0 < 𝑝𝑛 , decrease the weight 𝒘 by the amount of the error 𝑦𝑛 − 𝑝𝑛 2011-10-11 Information Communication Theory (情報伝達学) 39 Python implementation (logress.py) import math import random def train(w, D, T): for t in range(1, T+1): x, y = random.choice(D) a = sum([w[i] * x[i] for i in range(len(w))]) g = y - (1. / (1. + math.exp(-a))) if -100. < a else y eta = 1. / t for i in range(len(w)): w[i] += eta * g * x[i] # 𝒘 ← 𝒘 + 𝜂𝑔𝒙 # def prob(x): a = sum([w[i] * x[i] for i in range(len(w))]) return 1. / (1 + math.exp(-a)) if -100. < a else 0. # 𝑎 𝒙 = 𝒘⋅𝒙 # 𝑝 = 1/(1 + exp 𝑎 ) if __name__ == '__main__': w = [0.] * 9 D = ( ((1, 1, 1, 1, 1, 1, 0, 0, 0), 1), ((1, 0, 0, 0, 1, 1, 1, 1, 1), 0), ) train(w, D, 10000) print prob(D[0][0]), prob(D[1][0]) print w 2011-10-11 # 𝑎 𝒙 = 𝒘⋅𝒙 # g=𝑦−𝑝 $ python logpress.py 0.93786130942 0.0601874607952 [-0.0037759046835663321, 0.90852032090533386, 0.90852032090533386, 0.90852032090533386, -0.0037759046835663321, -0.0037759046835663321, -0.91229622558889756, -0.91229622558889756, -0.91229622558889756] Information Communication Theory (情報伝達学) 40 Practical considerations • MLE often leads to overfitting • |𝒘| → ∞ as 𝑁 𝑛=1 ℓ𝑛 → 0 when the training data is linearly separable • Subject to be affected by noises in the training data • Regularization (正則化) (MAP estimation) • We introduce a penalty term when 𝒘 becomes large 𝑁 • E.g., MAP with L2 regularization: maximize 𝑛=1 ℓ𝑛 − 𝐶 𝒘 2 • 𝐶 is the parameter to control the trade-off between over/under fitting • Stopping criterion • Detecting the convergence of log-likelihood improvements 2011-10-11 Information Communication Theory (情報伝達学) 41 Other classification models • Extensions of logistic regression • Maximum Entropy Modeling (MaxEnt) (最大エントロピー法) • Multi-class logistic regression • Conditional Random Fields (CRFs) (条件付き確率場) • Predict a sequence of labels for a given sequence • Other formalizations • Support Vector Machine (SVM) (サポートベクトルマシーン) • SVM with linear kernel = linear classifier • Naïve Bayes • Neural Network • Decision Tree • … 2011-10-11 Information Communication Theory (情報伝達学) 42 Evaluation 2011-10-11 Information Communication Theory (情報伝達学) 43 Accuracy, precision, and recall Contingency table + - Predicted + 𝑎 (true positive) 𝑏 (false positive) Predicted - 𝑐 (false negative) 𝑑 (true negative) 𝑎+𝑑 𝑎+𝑏+𝑐+𝑑 𝑎 • Precision (適合率): 𝑝 = 𝑎+𝑏 𝑎 • Recall (再現率): 𝑟 = 𝑎+𝑐 2𝑝𝑟 • F1-score: 𝑝+𝑟 • Accuracy (精度): 2011-10-11 (correctness about positives and negatives) (correctness about positives) (coverage about positives) (harmonic mean of precision and recall) Information Communication Theory (情報伝達学) 44 Precision – recall curve • Precision/recall tradeoff • If we increase the precision of a system, its recall decreases • It is not easy to improve both precision and recall • We sometimes prioritize either precision or recall, depending on the task • Spam filtering: precision is important p • How can we control the tradeoff of a linear binary classifier? • Increasing the threshold: improves precision better • Decreasing the threshold: improves recall r 2011-10-11 Information Communication Theory (情報伝達学) 45 Holdout evaluation • Use ‘heldout’ data set for evaluation Training Model Evaluation Training set Separated Test set 2011-10-11 Information Communication Theory (情報伝達学) 46 N-fold cross validation (交差検定) • For example (5-fold cross validation) Data set Split into 5 groups Perf #1 test Perf #2 test Perf #3 test Perf #4 test test 2011-10-11 Average Information Communication Theory (情報伝達学) Perf #5 47 Standard experiment procedure • Three groups of data sets • Training set: used for training • Test set: used for evaluating the trained model • Development set: another test set for tuning parameters of training • This experimental setting is useful for comparing different systems • Two groups of data sets • Training set: used for N-fold cross validation • Development set: a test set for tuning parameters of training 2011-10-11 Information Communication Theory (情報伝達学) 48 Implementations 2011-10-11 Information Communication Theory (情報伝達学) 49 Implementations for classification • LIBLINEAR: http://www.csie.ntu.edu.tw/~cjlin/liblinear/ • L2-loss SVM, L1-loss SVM, logistic regression • LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • SVM with linear, polynominal, radial basis, sigmoid kernels • Classias: http://www.chokkan.org/software/classias/ • Averaged perceptron, L1-loss SVM, logistic regression 2011-10-11 Information Communication Theory (情報伝達学) 50 Ling-spam corpus • A mixture of 481 spam messages and 2412 messages sent via the Linguist list • http://labs-repos.iit.demokritos.gr/skel/i- config/downloads/lingspam_public.tar.gz • This corpus includes mail messages after preprocessing (lemmatization and stopword filtering) • Under ‘lemm_stop’ directory • This corpus splits the data set into ten groups • Under part1, part2, …, part10 directories 2011-10-11 Information Communication Theory (情報伝達学) 51 Files in the data set Files starting with ‘spmsg’ are spams 2011-10-11 Information Communication Theory (情報伝達学) 52 Example (lemm_stop/part1/spmsga1.txt.gz) Subject: great part-time summer job ! * * * * * * * * * * * * * * * display box credit application need place small owneroperate store area . here : 1 . introduce yourself store owner manager . 2 . our 90 % effective script tell little display box save customer hundred dollar , draw card business , $ 5 . 0 $ 15 . 0 every app send . 3 . spot counter , place box , nothing need , need name address company send commission check . compensaation $ 10 every box place . become representative earn commission $ 10 each application store . course much profitable plan , pay month small effort . call 1-888 - 703-5390 code 3 24 hours receive detail ! ! * * * ***************************************************** ***************************************************** ***************************************************** * * * * * * * * * * * * * * * * * * * * * removed our mailing list , type : b2998 @ hotmail . com ( : ) area ( remove ) subject area e - mail send . * * * * * * * * * * * ***************************************************** ***************************************************** *************** 2011-10-11 Information Communication Theory (情報伝達学) 53 Feature extraction (spam_feature.py) import sys import os import gzip def ngram(T, n): return ['%dgram=%s' % (n, '_'.join(T[i:i+n])) for i in range(len(T)-n+1)] def process(fo, fi, label): F = [] for line in fi: fields = line.strip('\n').split(' ') F += ngram(fields, 1) F += ngram(fields, 2) fo.write('%s ' % label) fo.write('%s' % ' '.join([f.replace(':', '__COLON__') for f in F])) fo.write('\n') if __name__ == '__main__': fo = sys.stdout for src in sys.argv[1:]: label = '+1' if os.path.basename(src).startswith('spmsg') else '-1' fi = gzip.GzipFile(src) if src.endswith('.gz') else open(src) process(fo, fi, label) 2011-10-11 Information Communication Theory (情報伝達学) 54 Extracted features +1 1gram=Subject__COLON__ 1gram=great 1gram=part-time 1gram=summer 1gram=job 1gram=! 2gram=Subject__COLON___great 2gram=great_part-time 2gram=part-time_summer 2gram=summer_job 2gram=job_! 1gram= 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=display 1gram=box 1gram=credit 1gram=application 1gram=need 1gram=place 1gram=small 1gram=owneroperate 1gram=store 1gram=area 1gram=. 1gram=here 1gram=__COLON__ 1gram=1 1gram=. 1gram=introduce 1gram=yourself 1gram=store 1gram=owner 1gram=manager 1gram=. 1gram=2 1gram=. 1gram=our 1gram=90 1gram=% 1gram=effective 1gram=script 1gram=tell 1gram=little 1gram=display 1gram=box 1gram=save 1gram=customer 1gram=hundred 1gram=dollar 1gram=, 1gram=draw 1gram=card 1gram=business 1gram=, 1gram=$ 1gram=5 1gram=. 1gram=0 1gram=$ 1gram=15 1gram=. 1gram=0 1gram=every 1gram=app 1gram=send 1gram=. 1gram=3 1gram=. 1gram=spot 1gram=counter 1gram=, 1gram=place 1gram=box 1gram=, 1gram=nothing 1gram=need 1gram=, 1gram=need 1gram=name 1gram=address 1gram=company 1gram=send 1gram=commission 1gram=check 1gram=. 1gram=compensaation 1gram=$ 1gram=10 1gram=every 1gram=box 1gram=place 1gram=. 1gram=become 1gram=representative 1gram=earn 1gram=commission 1gram=$ 1gram=10 1gram=each 1gram=application 1gram=store 1gram=. 1gram=course 1gram=much 1gram=profitable 1gram=plan 1gram=, 1gram=pay 1gram=month 1gram=small 1gram=effort 1gram=. 1gram=call 1gram=1-888 1gram=- 1gram=703-5390 1gram=code 1gram=3 1gram=24 1gram=hours 1gram=receive 1gram=detail 1gram=! 1gram=! 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 1gram=* 2011-10-11 Information Communication Theory (情報伝達学) 55 Training using Classias • Training with spam[01-09].txt • $ classias-train -tb -m spam.model spam0?.txt • Tagging with spam10.txt • $ classias-tag -r -m spam.model < spam10.txt • Performance evaluation • $ classias-tag -qt -m spam.model < spam10.txt • Accuracy: 0.9656 (281/291) • Micro P, R, F1: 0.9535 (41/43), 0.8367 (41/49), 0.8913 2011-10-11 Information Communication Theory (情報伝達学) 56 Trained feature weights Feature with high positive weights • • • • • • • • • • • • • • • • • • • 0.745045 0.508344 0.483244 0.385863 0.372262 0.340425 0.333442 0.305192 0.299545 0.298676 0.295115 0.294267 0.294198 0.293998 0.293348 0.292054 0.262475 0.256503 0.254161 2011-10-11 1gram=! 1gram=free 1gram=click 1gram=$ 1gram=site 1gram=com 1gram=adult 2gram=._com 1gram=want 1gram=here 2gram=http___COLON__ 1gram=http 1gram=our 1gram=remove 2gram=__COLON___/ 2gram=click_here 1gram=% 1gram=money 1gram=xxx Feature with high negative weights • • • • • • • • • • • • • • • • • • • -2.35981 -0.59796 -0.545332 -0.49624 -0.43065 -0.326052 -0.325135 -0.302419 -0.299929 -0.296032 -0.294959 -0.278405 -0.268571 -0.26763 -0.233347 -0.227835 -0.226135 -0.223516 -0.216163 __BIAS__ 1gram=language 2gram=!_! 1gram=@ 1gram=linguistic 1gram=edu 2gram=._edu 1gram=linguist 1gram=university 1gram=( 1gram=) 1gram=de 1gram=conference 1gram=english 1gram=word 1gram=available 1gram=research 1gram=linguistics 1gram=__COLON__ Information Communication Theory (情報伝達学) 57 References • Yoav Freund and Robert E. Schapire. (1999) Large margin classification using the perceptron algorithm, Machine Learning. • Frank Rosenblatt. (1957) The Perceptron - a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory. 2011-10-11 Information Communication Theory (情報伝達学) 58 Spam filtering in reality 2011-10-11 Information Communication Theory (情報伝達学) 59 Spam mail (again) 2011-10-11 Information Communication Theory (情報伝達学) 60 How does SpamAssassin detect spams? Return-Path: <automail@interloper.com> X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on www234.sakura.ne.jp X-Spam-Flag: YES X-Spam-Level: *************** X-Spam-Status: Yes, score=15.4 required=7.0 tests=BAYES_00,HTML_MESSAGE, KB_DATE_CONTAINS_TAB,KB_FAKED_THE_BAT,RAZOR2_CF_RANGE_51_100, RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,T_SURBL_MULTI1,T_SURBL_MULTI2, T_SURBL_MULTI3,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_PH_SURBL,URIBL_SC_SURBL, URIBL_WS_SURBL autolearn=no version=3.3.1 X-Spam-Report: * 0.6 URIBL_PH_SURBL Contains an URL listed in the PH SURBL blocklist * [URIs: sonya201010.com.ua] * 4.5 URIBL_AB_SURBL Contains an URL listed in the AB SURBL blocklist * [URIs: sonya201010.com.ua] * 1.6 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist * [URIs: sonya201010.com.ua] * 1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist * [URIs: sonya201010.com.ua] * 0.6 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist * [URIs: sonya201010.com.ua] * 2.8 KB_DATE_CONTAINS_TAB KB_DATE_CONTAINS_TAB * -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% 2011-10-11 Information Communication Theory (情報伝達学) 61