ProgressPresentation

advertisement
EMAIL SPAM DETECTION
USING MACHINE LEARNING
Lydia Song, Lauren Steimle, Xiaoxiao Xu
Outline




Introduction to Project
Pre-processing
Dimensionality Reduction
Brief discussion of different algorithms
K-nearest
 Decision tree
 Logistic regression
 Naïve-Bayes



Preliminary results
Conclusion
Spam Statistics
Percentage of Spam Emails in email traffic averaged
69.9% in February 2014
Percentage of spam in
email traffic

Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014
Spam vs. Ham
Spam=Unwanted communication
Ham=Normal communication
Pre-processing
Example of Spam Email
Corresponding File in Data Set
Pre-processing
1.
2.
3.
4.
Remove meaningless words
Create a “bag of words” used in data set
Combine similar words
Create a feature matrix
Bag of Words
Email 1
history
Email 2
last
…
service
Email m
é 1 1 ... 1 ù
ê
ú
ê 0 3 ... 5 ú
ê
ú
ê
ú
ê
ú
ê 2 1 ... 6 ú
ë
û
Pre-processing Example
Your history shows that your last order is
ready for refilling.
Thank you,
tokens= [‘your’, ‘history’, ‘shows’,
filtered_words=[ 'history', 'last',
‘that’, ‘your’, ‘last’, ‘order’, ‘is’,
'order', 'ready', 'refilling', 'thank',
‘ready’, ‘for’, ‘refilling’, ‘thank’,
'sam', 'mcfarland', 'customer',
‘you’, ‘sam’, ‘mcfarland’, ‘customer
'services']
services’]
Sam Mcfarland
Customer Services
Bag of Words
histori
last
Email 1
Email 2
…
servi
Email m
é 1 1 ... 1 ù
ê
ú
ê 0 3 ... 5 ú
ê
ú
ê
ú
ê
ú
ê 2 1 ... 6 ú
ë
û
bag of words=['history', 'last',
'order', 'ready', 'refill', 'thank', 'sam',
'mcfarland', 'custom', 'service']
Dimensionality Growth

Add ~100-150 features for each additional email
Growth of Number of Features
50000
45000
40000
Number of Features
35000
30000
25000
20000
15000
10000
5000
0
50
100
150
200
Number of Emails Considered
250
300
Dimensionality Reduction

Add a requirement that words must appear in x%
of all emails to be considered a feature
Growth of Features with Cutoff Requirement
600
Number of Features
500
400
5%
300
10%
15%
20%
200
100
0
50
100
150
200
Number of Emails Considered
250
300
Dimensionality Reduction-Hashing Trick


Before Hashing: 70x9403 Dimensions
After Hashing: 70x1024 Dimensions
String
Integer
Hash
Table
Index
Source: Jorge Stolfi,
Outline




Introduction to Project
Pre-processing
Dimensionality Reduction
Brief discussion of different algorithms
K-nearest
 Decision tree
 Logistic regression
 Naïve-Bayes



Preliminary results
Conclusion
K-Nearest Neighbors


Goal: Classify an unknown training sample into one
of C classes
Idea: To determine the label of an unknown sample
(x), look at x’s k-nearest neighbors
Image from MIT Opencourseware
Decision Tree

Convert training data
into a tree structure

Root node: the first decision
node

Decision node: if–then
decision based on features
of training sample

Leaf Node: contains a class
label
Image from MIT Opencourseware
Logistic Regression

“Regression” over training examples
y =qT x


Transform continuous y to prediction of 1 or 0 using
the standard logistic function
Predict spam if
hq (x) = g(q T x) =
1
1+ e
-q T x
³ 0.5
Naïve Bayes




Use Bayes Theorem: P(H | e) = P(H | e)P(e)
P(H )
Hypothesis (H): spam or not spam
Event (e): word occurs
For example, the probability an email is spam when
the word “free” is in the email
P(" free" | spam)P(spam)
P(spam |" free") =
P(" free")

“Naïve”: assume the feature values are
independent of each other
Outline




Introduction to Project
Pre-processing
Dimensionality Reduction
Brief discussion of different algorithms
K-nearest
 Decision tree
 Logistic regression
 Naïve-Bayes



Preliminary results
Conclusion
Preliminary Results



250 emails in training set, 50 in testing set
Use 15% as the “percentage of emails” cutoff
Performance measures:
 Accuracy:
% of predictions that were correct
 Recall: % of spam emails that were predicted correctly
 Precision: % of emails classified as spam that were
actually spam
 F-Score: weighted average of precision and recall
“Percentage of Emails” Performance
Linear Regression
Logistic Regression
Preliminary Results
Next Steps

Implement SVM: Matlab vs. Weka

Hashing trick- try different number of buckets

Regularizations
Thank you! Any questions?
Download