EMAIL SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest Decision tree Logistic regression Naïve-Bayes Preliminary results Conclusion Spam Statistics Percentage of Spam Emails in email traffic averaged 69.9% in February 2014 Percentage of spam in email traffic Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014 Spam vs. Ham Spam=Unwanted communication Ham=Normal communication Pre-processing Example of Spam Email Corresponding File in Data Set Pre-processing 1. 2. 3. 4. Remove meaningless words Create a “bag of words” used in data set Combine similar words Create a feature matrix Bag of Words Email 1 history Email 2 last … service Email m é 1 1 ... 1 ù ê ú ê 0 3 ... 5 ú ê ú ê ú ê ú ê 2 1 ... 6 ú ë û Pre-processing Example Your history shows that your last order is ready for refilling. Thank you, tokens= [‘your’, ‘history’, ‘shows’, filtered_words=[ 'history', 'last', ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, 'order', 'ready', 'refilling', 'thank', ‘ready’, ‘for’, ‘refilling’, ‘thank’, 'sam', 'mcfarland', 'customer', ‘you’, ‘sam’, ‘mcfarland’, ‘customer 'services'] services’] Sam Mcfarland Customer Services Bag of Words histori last Email 1 Email 2 … servi Email m é 1 1 ... 1 ù ê ú ê 0 3 ... 5 ú ê ú ê ú ê ú ê 2 1 ... 6 ú ë û bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service'] Dimensionality Growth Add ~100-150 features for each additional email Growth of Number of Features 50000 45000 40000 Number of Features 35000 30000 25000 20000 15000 10000 5000 0 50 100 150 200 Number of Emails Considered 250 300 Dimensionality Reduction Add a requirement that words must appear in x% of all emails to be considered a feature Growth of Features with Cutoff Requirement 600 Number of Features 500 400 5% 300 10% 15% 20% 200 100 0 50 100 150 200 Number of Emails Considered 250 300 Dimensionality Reduction-Hashing Trick Before Hashing: 70x9403 Dimensions After Hashing: 70x1024 Dimensions String Integer Hash Table Index Source: Jorge Stolfi, Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest Decision tree Logistic regression Naïve-Bayes Preliminary results Conclusion K-Nearest Neighbors Goal: Classify an unknown training sample into one of C classes Idea: To determine the label of an unknown sample (x), look at x’s k-nearest neighbors Image from MIT Opencourseware Decision Tree Convert training data into a tree structure Root node: the first decision node Decision node: if–then decision based on features of training sample Leaf Node: contains a class label Image from MIT Opencourseware Logistic Regression “Regression” over training examples y =qT x Transform continuous y to prediction of 1 or 0 using the standard logistic function Predict spam if hq (x) = g(q T x) = 1 1+ e -q T x ³ 0.5 Naïve Bayes Use Bayes Theorem: P(H | e) = P(H | e)P(e) P(H ) Hypothesis (H): spam or not spam Event (e): word occurs For example, the probability an email is spam when the word “free” is in the email P(" free" | spam)P(spam) P(spam |" free") = P(" free") “Naïve”: assume the feature values are independent of each other Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest Decision tree Logistic regression Naïve-Bayes Preliminary results Conclusion Preliminary Results 250 emails in training set, 50 in testing set Use 15% as the “percentage of emails” cutoff Performance measures: Accuracy: % of predictions that were correct Recall: % of spam emails that were predicted correctly Precision: % of emails classified as spam that were actually spam F-Score: weighted average of precision and recall “Percentage of Emails” Performance Linear Regression Logistic Regression Preliminary Results Next Steps Implement SVM: Matlab vs. Weka Hashing trick- try different number of buckets Regularizations Thank you! Any questions?