Machine Learning Basics with Applications to Email Spam Detection UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI General background information about the process of machine learning The process of email detection ⦿ Motivation of this project ⦿ Pre-processing of data ⦿ Classifier Models ● Evaluation of classifiers Motivation of this project ⦿Spam email has been annoyed every personal email account ●60% of January 2004 emails were spam ● Fraud & Phishing ⦿Spam vs. Ham email Our Goal Spam Email example Ham Email example The process of email detection ⦿ Motivation of this project Pre-processing of data ⦿ Classifier Models ⦿ ● Evaluation of classifiers Pre-processing of data ⦿ Convert capital letters to lowercase ⦿ Remove numbers, and extra white space ⦿ Remove punctuations ⦿ Remove stop-words ⦿ Delete terms with length greater than 20. Pre-processing of data ⦿Original Email Pre-processing of data ⦿After pre-processing Pre-processing of data ⦿Extract Terms Pre-processing of data ⦿Reduce Terms ●Keep word length < 20 The process of email detection ⦿ Motivation of this project ⦿ Pre-processing of data Classifier Models ⦿ ● Evaluation of classifiers Different classification methods ⦿ K Nearest Neighbor (KNN) ⦿ Naive Bayes Classifier ⦿ Logistic Regression ⦿ Decision Tree Analysis What is K Nearest Neighbor ⦿ Use k "closet" samples (nearest neighbors) to perform classification What is K Nearest Neighbor Initial outcome and strategies for improvement ⦿ KNN accuracy was ~64% - very low ⦿ KNN classifier does not fit our project ⦿ Term-list is still too large ⦿ Try different method to classify and see if evaluation results are better than KNN results ⦿ Continue to reduce size of term list by removing terms that are not meaningful Steps for improvement ⦿Remove sparsity ⦿Reduced length threshold ⦿Created hashtable ⦿Used alternative classifier ●Naive- Bayes Classifier Hashtable ⦿ ⦿ Calculate Hash Key for each term in term-list. Once collision occurs, use the separate chain Naive- Bayes classifier Secondary Results ⦿Correctness increases from 62% to 82.36% Suggestions for further improvement ⦿Revise pre-processing ⦿Apply additional classifiers Thank you ⦿Questions?