Machine Learning Basics with Applications to Email Spam Detection

advertisement
Machine Learning Basics with
Applications to Email Spam
Detection
UGR PROJECT - HAOYU LI,
BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU
AND ARYE NEHORAI
General background information about
the process of machine learning
The process of email detection
⦿
Motivation of this project
⦿
Pre-processing of data
⦿
Classifier Models
●
Evaluation of classifiers
Motivation of this project
⦿Spam email has been annoyed every personal email
account
●60% of January 2004 emails were spam
● Fraud & Phishing
⦿Spam vs. Ham email
Our Goal
Spam Email
example
Ham Email
example
The process of email detection
⦿
Motivation of this project
Pre-processing of data
⦿
Classifier Models
⦿
●
Evaluation of classifiers
Pre-processing of data
⦿
Convert capital letters to lowercase
⦿
Remove numbers, and extra white space
⦿
Remove punctuations
⦿
Remove stop-words
⦿
Delete terms with length greater than 20.
Pre-processing of data
⦿Original Email
Pre-processing of data
⦿After pre-processing
Pre-processing of data
⦿Extract Terms
Pre-processing of data
⦿Reduce Terms
●Keep word
length < 20
The process of email detection
⦿
Motivation of this project
⦿
Pre-processing of data
Classifier Models
⦿
●
Evaluation of classifiers
Different classification methods
⦿
K Nearest Neighbor (KNN)
⦿
Naive Bayes Classifier
⦿
Logistic Regression
⦿
Decision Tree Analysis
What is K Nearest Neighbor
⦿
Use k "closet" samples (nearest neighbors) to
perform classification
What is K Nearest Neighbor
Initial outcome and strategies for
improvement
⦿
KNN accuracy was ~64% - very low
⦿
KNN classifier does not fit our project
⦿
Term-list is still too large
⦿
Try different method to classify and see if evaluation
results are better than KNN results
⦿
Continue to reduce size of term list by removing terms
that are not meaningful
Steps for improvement
⦿Remove sparsity
⦿Reduced length threshold
⦿Created hashtable
⦿Used alternative classifier
●Naive- Bayes Classifier
Hashtable
⦿
⦿
Calculate Hash Key for each term in term-list.
Once collision occurs, use the separate chain
Naive- Bayes classifier
Secondary Results
⦿Correctness increases from 62% to 82.36%
Suggestions for further improvement
⦿Revise pre-processing
⦿Apply additional classifiers
Thank you
⦿Questions?
Download