Machine Learning Basics with Applications to Email Spam Detection

Machine Learning Basics with Applications to Email Spam Detection UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI General background information about the process of machine learning The process of email detection ⦿ Motivation of this project ⦿ Pre-processing of data ⦿ Classifier Models ● Evaluation of classifiers Motivation of this project ⦿Spam email has been annoyed every personal email account ●60% of January 2004 emails were spam ● Fraud & Phishing ⦿Spam vs. Ham email Our Goal Spam Email example Ham Email example The process of email detection ⦿ Motivation of this project Pre-processing of data ⦿ Classifier Models ⦿ ● Evaluation of classifiers Pre-processing of data ⦿ Convert capital letters to lowercase ⦿ Remove numbers, and extra white space ⦿ Remove punctuations ⦿ Remove stop-words ⦿ Delete terms with length greater than 20. Pre-processing of data ⦿Original Email Pre-processing of data ⦿After pre-processing Pre-processing of data ⦿Extract Terms Pre-processing of data ⦿Reduce Terms ●Keep word length < 20 The process of email detection ⦿ Motivation of this project ⦿ Pre-processing of data Classifier Models ⦿ ● Evaluation of classifiers Different classification methods ⦿ K Nearest Neighbor (KNN) ⦿ Naive Bayes Classifier ⦿ Logistic Regression ⦿ Decision Tree Analysis What is K Nearest Neighbor ⦿ Use k "closet" samples (nearest neighbors) to perform classification What is K Nearest Neighbor Initial outcome and strategies for improvement ⦿ KNN accuracy was ~64% - very low ⦿ KNN classifier does not fit our project ⦿ Term-list is still too large ⦿ Try different method to classify and see if evaluation results are better than KNN results ⦿ Continue to reduce size of term list by removing terms that are not meaningful Steps for improvement ⦿Remove sparsity ⦿Reduced length threshold ⦿Created hashtable ⦿Used alternative classifier ●Naive- Bayes Classifier Hashtable ⦿ ⦿ Calculate Hash Key for each term in term-list. Once collision occurs, use the separate chain Naive- Bayes classifier Secondary Results ⦿Correctness increases from 62% to 82.36% Suggestions for further improvement ⦿Revise pre-processing ⦿Apply additional classifiers Thank you ⦿Questions?

Machine Learning Basics with Applications to Email Spam Detection

Related documents

Products

Support

Machine Learning Basics with Applications to Email Spam Detection

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib