SPAM EMAIL PREDICTION AND DETECTION SYSTEM USING MACHINE LEARNING BY ARIJE JOHN OGO-OLUWA (MATRICULATION NUMBER: CYS/16/9964) SUBMITTED TO THE DEPARTMENT OF CYBER SECURITY, THE FEDERAL UNIVERSITY OF TECHNOLOGY AKURE (FUTA), ONDO STATE, NIGERIA. IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF BACHELOR OF TECHNOLOGY (B. TECH) IN CYBER SECURITY DECEMBER, 2022 CERTIFICATION I certify that this project work was carried out by me and has not been presented elsewhere for the award of any degree or any other degree or any other purpose. STUDENT’S NAME: ARIJE JOHN OGO-OLUWA SIGNATURE …………………… DATE ………………………. This is to certify that this work was carried out by ARIJE JOHN OGO-OLUWA with matriculation number CYS/16/9964 of the Department of Computer Science, The Federal University of Technology, Akure, Nigeria. SUPERVISOR’S NAME: DR A.F . THOMPSON SIGNATURE …………………. DATE ………………………… DEDICATION This report is dedicated to the almighty God, who granted me good health, guided and protected me all through these years and made this project a success. I also dedicate this to my parents, Pst. and Mrs. Arije for their unending support. ACKNOWLEDGEMENT I give God Almighty the Glory for his Mercy, Grace, Favor and love that kept me through my undergraduate days. My appreciation goes to my project supervisor and the HOD Dr. A.F Thompson for her correction, direction, guidance and supervision. Special thanks to the entire staff of the Cyber Security Department, Federal University of Technology, Akure, for the knowledge, skills, and values that have been exposed to, which gave me a bedrock to undertake this project. I also appreciate my loving family, this would have been impossible without them. I pray that God would bless them all. ABSTRACT Nowadays Email communications are very necessary, but the email spam problems are widely spread and uncontrollable. In order to detect each spam, a collaborative spam detection system is to be proposed using python language and machine learning. The machine learning system that separates spam email from legitimate (ham) email. In this project a complete collaborative spam detection system which possess an efficient and standard machine learning software tool. This spam detection system that outperforms the prior approaches in detection results and applicable to real world. CHAPTER ONE INTRODUCTION 1.1 Background of the Study Commercialization of the internet and integration of electronic mail as an accessible means of communication has another face - the influx of unwanted information and mails. As the internet started to gain popularity in the early 1990s, it was quickly recognized as an excellent advertising tool. At practically no cost, a person can use the internet to send an email message to thousands of people. These unsolicited junk electronic mails came to be called 'Spam'. The history of spam is intertwined with the history of electronic mail. While the linguistic significance of the usage of the word 'spam' is attributed to the British comedy troupe Monty Python in a now legendary sketch from their Flying Circus TV series, in which a group of Vikings sing a chorus of "SPAM, SPAM, SPAM..." at increasing volumes, the historic significance lies in it being adopted to refer to unsolicited commercial electronic mail sent to a large number of addresses, in what was seen as drowning out normal communication on the internet. The first known email spam (although not yet called that), was sent on May 3, 1978 to several hundred users on ARPANET. It was an advertisement for a presentation by Digital Equipment Corporation for their DECSYSTEM-20 products sent by Gary Thuerk, a marketer of theirs. The reaction to it was almost universally negative, and for a long time there were no further instances. The name "spam" was actually first applied, in April 1993, not to an email, but to unwanted postings on Usenet newsgroup network. Richard Depew accidentally posted 200 messages to news admin policy and in the aftermath readers of this group were making jokes about the accident, when one person referred to the messages as “spam”, coining the term that would later be applied to similar incidents over email. On January 18, 1994, the first large-scale deliberate USENET spam occurred. A message with the subject “Global Alert for All: Jesus is Coming Soon” was cross-posted to every available newsgroup. Its controversial message sparked many debates all across USENET. In April 1994 the first commercial USENET spam arrived. Two lawyers from Phoenix, Cante and Siegel, hired a programmer to post their "Green Card Lottery- Final One?" message to as many newsgroups as possible. What made them different was that they did not hide the fact that they were spammers. They were proud of it, and thought it was great advertising. They even went on to write the book "How to Make a Fortune on the Information Superhighway : Everyone’s Guerrilla Guide to Marketing on the internet and Other On-Line Services". They planned on opening a consulting company to help other people post similar advertisements, but it never took off. In June 2003 Meng Weng Wong started the SPF-discuss mailing list and posted the very first version of the "Sender Permitted From" proposal, that would later become the Sender Policy Framework, a simple email-validation system designed to detect email spoofing as part of the solution to spam. The CAN-SPAM Act of 2003 was signed into law by President George W. Bush on December 16, 2003, establishing the United States' first national standards for the sending of commercial email and requiring the Federal Trade Commission (FTC) to enforce its provisions. The backronym CAN-SPAM derives from the bill's full name: "Controlling the Assault of NonSolicited Pornography And Marketing Act of 2003". It plays on the word "canning" (putting an end to) spam, as in the usual term for unsolicited email of this type; as well as a pun in reference to the canned SPAM food product. The bill was sponsored in Congress by Senators Conrad Burns and Ron Wyden. In January 2004 Bill Gates of Microsoft announced that "spam will soon be a thing of the past." In May 2004, Howard Carmack of Buffalo, New York was sentenced to 3½ to 7 years for sending 800 million messages, using stolen identities. In May 2003 he also lost a $16 million civil lawsuit to EarthLink. On September 27, 2004, Nicholas Tombros pleaded guilty to charges and became the first spammer to be convicted under the CAN-SPAM Act of 2003. He was sentenced in July 2007 to three years' probation, six months' house arrest, and fined $10,000. On November 4, 2004, Jeremy Jaynes, rated the 8th-most prolific spammer in the world, according to Spamhaus, was convicted of three felony charges of using servers in Virginia to send thousands of fraudulent emails. The court recommended a sentence of nine years' imprisonment, which was imposed in April 2005 although the start of the sentence was deferred pending appeals. Jaynes claimed to have an income of $750,000 a month from his spamming activities. On February 29, 2008 the Supreme Court of Virginia overturned his conviction. On November 8, 2004, Nick Marinellis of Sydney, Australia, was sentenced to 4⅓ to 5¼ years for sending Nigerian 419 emails. On December 31, 2004, British authorities arrested Christopher Pierson in Lincolnshire, UK and charged him with malicious communication and causing a public nuisance. On January 3, 2005, he pleaded guilty to sending hoax emails to relatives of people missing following the Asian tsunami disaster. 1.2 Statement of Problem The sheer volume of junk email being sent everyday has normalized the occurrence of spam and this has become a major problem. In fact, spam emails grossly outnumber legitimate ones by a mile. For May 2019 alone, spam emails constituted almost 85% of the total volume of emails being sent globally. That’s a whopping 367 billion spam emails per day, compared to a relatively paltry 64 billion emails that are legitimate. For many email users, especially those who have gotten used to seeing unsolicited emails in their inbox day after day, junk mail had evolved from being a cause of alarm to something that’s more of a mundane matter. Nowadays, many of us view spam as something normal something that we just have to learn to deal with. Most people no longer view it as the threat that it actually is. It goes without saying that spam is a nuisance for all of us. Having to individually scroll through and delete unwanted emails wastes valuable time and bandwidth. The time you spend filtering emails in a day may not be much, but over the course of a year, it really does add up. Junk emails waste a lot of time and effort that could have been used for something more productive, but that’s not even the worst part. Spam is also a popular means of transferring harmful malware and electronic viruses. And in an age where hacking tools and techniques grow more and more sophisticated by the minute, spam-instigated security attacks become a perpetual threat. Spam emails are also an avenue for marketers to exploit your data’s privacy. Responding to just one unsolicited email could put you in the mailing lists of many other companies. Before you know it, your spam emails would’ve already multiplied tenfold. A spam email that was made to look like it came from a legitimate entity that you trust (like your bank or someone you know) could end up stealing sensitive information if you’re not careful. You could be a victim of identity theft, or you could lose all your money if you mistakenly hand in your bank details. 1.3 Motivation We live in an age where technology has revolutionized the way we communicate, but it has also given rise to unwanted and often malicious emails, commonly known as spam. Spam emails can not only clog up our inboxes but also pose a threat to our personal information and security. That's why it's essential to have a reliable and effective method to detect and filter out spam emails. The traditional approach of using a single machine learning algorithm to detect spam emails can be prone to errors like overfitting and robustness talked about by Xue Ying et el., 2018, especially when dealing with complex and non-linear relationships in the data. Nayak, Amirali Jiwani & Rajitha (2021) made use of a hybrid strategy that combined Naive Bayes and Decision Tree algorithms to identify spam e-mails (DT). They were able to obtain an accuracy of 88.12% using their hybrid approach. That's where ensemble methods come in. By combining the predictions of multiple machine learning algorithms, an ensemble method can provide a more robust and accurate representation of the underlying relationships in the data. The result is a more reliable and effective method for detecting spam emails. In this project, the aim is to design an ensemble method that combines Logistic Regression, Naive Bayes, and Support Vector Machine models. The goal is to improve the accuracy, robustness and efficiency of spam email detection system, providing better protection for our personal information and security. 1.4 Objective The objective of this project is to enhance both the accuracy of the data and the accuracy of the results, as well as increase the stability of the model. The objectives of this project work are to: 1. Design Logistic Regression, Naive Bayes, and Support Vector Machine models independently through the use of machine learning techniques; 2. Design an Ensemble method that combines Logistic Regression, Naive Bayes, and Support Vector Machine models using machine learning techniques; 3. Implement the model (2). 1.5 Methodology A detailed review of relevant literature on email spamming and prediction system will be carried out. The integral part of this research is data collection. The first step consists of gathering data. The next step is feature extraction, extract the attributes, compare them, and review the features that work best. The datasets consist of instances and attributes which are important for spam email detection. The input data source is fed into Data preprocessing which involves removal of missing fields and outliers, normalization, and transforming the data into the appropriate form. The fourth step consists of the generation and evaluation of the classification models using the machine learning technique. The machine learning technique uses Ensemble method which is the combination of Logistic Regression, Naïve Bayes, and Support Vector algorithm for detecting spam emails. Ensemble method is a machine learning technique that combines the predictions of multiple individual models to produce a more accurate prediction. The idea behind ensemble methods is to leverage the strengths of multiple models and reduce their weaknesses. Ensemble methods are commonly used in supervised learning, where the goal is to classify or predict a target value based on input features. The mathematical formula for Hard Voting ensemble method can be defined as follows: Let's assume we have N individual models, each of which make a prediction f_i for a sample x. The final prediction for the sample x using Hard Voting is given by: f_ensemble(x) = argmax(sum(f_i(x) == c)) In this formula, f_ensemble(x) is the ensemble prediction for sample x, f_i(x) is the prediction made by the i-th model for sample x, c is a class label, argmax returns the class label that has the highest number of votes, and sum is the sum of all the votes for a particular class label. In other words, for each sample x, the individual models make a prediction for each class label. The final prediction for the sample x is the class label that has the majority of votes. 1.6 Contribution to Knowledge Sharma et al. (2021) employed Decision Tree (DT) and K-Nearest Neighbor (K-NN) classifiers to safeguard social media accounts from spam. The performance of the method was evaluated using the UCI machine learning e-mail spam dataset. The Decision Tree classifier achieved a classification accuracy of 90% and an F1-score of 91.5. The classifier has issues of low accuracy which is going to reduce the model efficiency. The contributions to this project include: 1. Improved prediction accuracy: By combining multiple models, ensemble methods can produce predictions that are more accurate than those of individual models. 2. Reduced overfitting: Overfitting occurs when a model is too complex and fits the training data too well, resulting in poor generalization to new data. 3. Increased robustness: Ensemble methods are less likely to be affected by noise or outliers in the data, as the predictions of multiple models are averaged out. CHAPTER TWO LITERATURE REVIEW 2.1 Introduction This chapter provides the background that is essential to understand the basis of email spamming and what it is all about. Table 2.1. Some Metrics in Spam Email Detection Metric Email spam Definition Email spam, also known as junk email, refers to unsolicited email messages, usually sent in bulk to a large list of recipients. Prediction The process of using a trained machine learning model to classify new emails as either spam or not spam. Anomaly detection Anomaly detection is any process that finds the outliers of a dataset Probability Probability is one of the bedrock of ML, which tells how likely is the event to occur. The value of Probability always lies between 0 to 1. It is the core concept as well as a primary prerequisite to understanding the ML models and their applications. Machine Leaning Machine Leaning (ML) Model Operations refers to implementation (ML) Model of processes to maintain the ML models in production Operations environments. Kernel A mathematical function used to transform the input data into a higher-dimensional space to facilitate linear separability. Common kernels used in SVM include linear, polynomial, radial basis function (RBF), and sigmoid. Machine Leaning Machine learning is important because it gives enterprises a view of (ML) significance trends in customer behavior and business operational patterns, as well as supports the development of new products. Hyperplane A decision boundary in a high-dimensional space used to separate the data points into different classes. In SVM, the hyperplane is chosen such that it maximizes the margin between the support vectors Margin The distance between the hyperplane and the closest data points, which are known as support vectors. The margin is used to define the optimal hyperplane that separates the classes with maximum separation. Model training The process of using labeled data (e.g., emails labeled as spam or not spam) to train machine learning models to identify spam emails. Model ensemble The combination of multiple models in an ensemble method, where the predictions of each model are combined to produce a final prediction. Voting A model combination technique in which each model in the ensemble casts a vote for the class (e.g., spam or not spam) it predicts for an email, and the class with the most votes is chosen as the final prediction. Classifier A machine learning model used to predict the class label of a given data point based on its features. SVM is a type of classifier used for both binary and multi-class classification problems. Overfitting and Overfitting occurs when the model is too complex and fits the Underfitting training data too well, resulting in poor generalization to new data. Underfitting occurs when the model is too simple and cannot capture the underlying patterns in the data. Regularization The process of adding a penalty term to the loss function to control the complexity of the model and prevent overfitting. 2.2 Views from different scholars In recent times, unwanted commercial bulk emails called spam has become a huge problem on the internet. The person sending the spam messages is referred to as the spammer. There are written reviews of products that are available on social networking sites. According to Liu and Pang (2018), about 30–35% of online reviews are deemed spam. Nikolov (2021); HaCohenKerner, Miller and Yigal (2020) explained that before extracting features from text, it is crucial to remove any unwanted information from the dataset. Such unwanted data within text datasets includes punctuation marks, http links, symbols, and frequently used words with little meaning (known as stop words). According to Ahmad, Rafie and Ghorabie (2021), on a dataset of 2million spam and non-spam tweets, Multilayer Perceptron(MLP), NB and RFSVM outperformed others with a precision of 0.98 and an accuracy of 0.96. The huge volume of spam mails flowing through the computer networks have destructive effects on the memory space of email servers, communication bandwidth, CPU power and user time. The menace of spam email is on the increase on yearly basis and is responsible for over 77% of the whole global email traffic as reported by Kaspersky Lab Spam Report in 2017. Users who receive spam emails that they did not request find it very irritating. It is also resulted to untold financial loss to many users who have fallen victim of internet scams and other fraudulent practices of spammers who send emails pretending to be from reputable companies with the intention to persuade individuals to disclose sensitive personal information like passwords, Bank Verification Number (BVN) and credit card numbers. According to report from Kaspersky lab, in 2015, the volume of spam emails being sent reduced to a 12-year low. Spam email volume fell below 50% for the first time since 2003. In June 2015, the volume of spam emails went down to 49.7% and in July 2015 the figures was further reduced to 46.4% according to anti-virus software developer Symantec. This decline was attributed to reduction in the number of major botnets responsible for sending spam emails in billions. Malicious spam email volume was reported to be constant in 2015. The figure of spam mails detected by Kaspersky Lab in 2015 was between 3 million and 6 million. Conversely, as the year was about to end, spam email volume escalated. Further report from Kaspersky Lab indicated that spam email messages having pernicious attachments such as malware, ransomware, malicious macros, and JavaScript started to increase in December 2015. According to Salminen et al., (2022) amazon e-commerce dataset was used for testing and training where 40,000 samples for training and 10,000 samples for testing were gathered on various categories like Fashion, Beauty and Automotive etc. That drift was sustained in 2016 and by March of that year spam email volume had quadrupled with respect to that witnessed in 2015. In March 2016, the volume of spam emails discovered by Kaspersky Lab is 22,890,956. By that time the volume of spam emails had skyrocketed to an average of 56.92% for the first quarter of 2016. Latest statistics shows that spam messages accounted for 56.87% of e-mail traffic worldwide and the most familiar types of spam emails were healthcare and dating spam. Spam results into unproductive use of resources on Simple Mail Transfer Protocol (SMTP) servers since they have to process a substantial volume of unsolicited emails. 2.3 How to recognize Spam Emails Francis West (2018) gave an insight on how to recognize Spam Emails. At present, more than 95% of email messages sent worldwide is believed to be spam. Apart from the amount of junk arriving in their inboxes it can have a more indirect and severe effect on email services and their users. It is something that is unpleasant but also unavoidable. Spam poses a security risk when phishing or malware attacks come along with it. Since spam comes in varieties, it can easily manipulate the recipient. Thus, it is necessary to bear the following tips in mind, to identify spam. These are the various ways to recognize spams; 1. Use anti-spam and anti-virus softwares: Once you install an Anti-Spam software, you can protect yourself from spam emails. This is a software that not only tags emails as spam but also blocks dangerous malware, virus and phishing attacks. 2. Ensure that you know the sender before opening an email: Avoid any email sent by a website that you don’t recognise or an email address from someone you don’t know. There’s a good chance that it is spam. Another possible way to identify a spam is when the sender's address has a bunch of numbers or a domain that you don't recognise (the part after the "@") then the email is likely spam. Hence, be careful while opening emails especially if they land in the spam box. 3. Identify spoof email address: Attackers who want to try phishing attacks use spoof email addresses to trick the recipient. To show that the email address is from a recognisable source, the attackers may use characters which look like actual letters. The attackers could also create fake sender address from trustworthy organisations. E.g., they can send an email from "westtek@rixobalkangrill.co.uk" which sounds like the email has come directly from Westtek. However, legitimate emails from Westtek always end in @westtek.co.uk. Legitimate companies send emails that use your first and last name for personal salutation. Hence, the email is a spam if the salutation is addressed to vague "Valued Customer." Ensure that you check whether received emails have the complete contact address of the company. 4. Be careful about “urgent” or “threatening” language in the subject: A common phishing tactic is to evoke a sense of fear or urgency in emails. Attackers might write email subjects like your account has been suspended, or someone is trying to make unauthorised login attempts. Due to this, recipients get worried, and they land up opening the spam emails or links. 5. Check the subject for a spam alarm: Make sure you check the subject line before opening an email. The subject sounds exciting and persuades you generally by offering things like sales or investment opportunities, new treatments, requests for money, information on packages you never ordered, etc. Usually, it sounds like you are receiving a bag of million bucks for free. These emails are definite signs of spam for you to click links that result in attacks. 6. Avoid requests for personal information: Seldom a user is requested to "update user information," or sign in "immediately". If an attacker sends a request for personal information, then you know something’s not right. These emails contain anonymous links and it is advisable to avoid all such emails as far as possible. Legitimate businesses never ask for personal information like credit card details or passwords via email. 7. Look out for typo-logical mistakes: Attackers write spam in a way to get it past spam filters, i.e. by making typo-logical errors so that they will not be detected. For example, the spelling PayPal comes across as Paypal and this way we believe that it is a legitimate email. However, it is not. Hence, we should always check for spelling mistakes since trusted brands are very serious about their emails. 8. Spot unknown attachments or links: If you are not aware of the source, you should avoid downloading links or attachments. There is a possibility that if you download these links or attachments, virus or malware can enter your computer and destroy your data. Malicious files are mostly in the .docx format for zip format. 9. Watch out for content that is too good to be true: Sometimes there is a spam email where content is unbelievable like you will get a large sum of money if you download this link. These emails are phishing scams to get information from you. These emails come in various forms which encourage the recipient to provide personal information. Make sure you dodge such spam emails. Spam is dangerous and can leave your data or computer vulnerable to cyberattacks. Stay alert, stay secure. You can be sure that you are safe by typing the link that the mail contains to validate/check the content it states rather than clicking on the link. You can also use third-party security sites to check the email for any virus or malware. 2.4 Machine learning According to Andreas C. Müller & Sarah Guido (2016). Machine learning is about extracting knowledge from data. It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning. The application of machine learning methods has in recent years become ubiquitous in everyday life. From automatic recommendations of which movies to watch, to what food to order or which products to buy, to personalized online radio and recognizing your friends in your photos, many modern websites and devices have machine learning algorithms at their core. When you look at a complex website like Facebook, Amazon, or Netflix, it is very likely that every part of the site contains multiple machine learning models. 2.4.1 Reason for using Machine learning According to Yuxi Hayden Liu (2017). Machine learning is a term coined around 1960 composed of two words—machine corresponding to a computer, robot, or other device, and learning an activity, or event patterns, which humans are good at. So why do we need machine learning, why do we want a machine to learn as a human? There are many problems involving huge datasets, or complex calculations for instance, where it makes sense to let computers do all the work. In general, of course, computers and robots don't get tired, don't have to sleep, and may be cheaper. There is also an emerging school of thought called active learning or human-in-the-loop, which advocates combining the efforts of machine learners and humans. The idea is that there are routine boring tasks more suitable for computers, and creative tasks more suitable for humans. According to this philosophy, machines are able to learn, by following rules (or algorithms) designed by humans and to do repetitive and logic tasks desired by a human. 2.4.2 Data Mining A notable study is by Liu et al. (2019) with the objective of developing a deep learning-based framework for data mining. The authors proposed a deep learning-based framework that combined convolutional neural networks and recurrent neural networks to process and analyze large datasets. The methodology included the collection of a large dataset, preprocessing and cleaning of the data, and the application of the deep learning-based framework. The results showed that the deep learning-based framework performed better than traditional machine learning algorithms in terms of accuracy and efficiency. The limitation of the work was the requirement for a large dataset and computational resources. Finally, a study by Li et al. (2022) focused on the application of transfer learning in data mining. The objective of the study was to investigate the potential of transfer learning to improve the performance of machine learning algorithms in data mining. The authors proposed a transfer learning-based framework that used pre-trained models to extract features from the data and applied machine learning algorithms to the extracted features. The methodology included the collection of a large dataset, preprocessing and cleaning of the data, and the application of the transfer learning-based framework. The results showed that the transfer learning-based framework outperformed traditional machine learning algorithms in terms of accuracy and runtime efficiency. The limitation of the work was the need for pre-trained models to be available for the specific domain of the dataset. 2.4.3 Classification Classification, a data mining technique, is the process of classifying and predicting the value of a class attribute based on its predictor value (Romero et al., 2008). A predictor is an attribute used to predict a new record, e.g. Spam email, legit email, etc. There are two main categories of classification models used for prediction; descriptive and predictive classification models. Descriptive models find relationships or models in the data and even examine the properties of the data being examined. Examples of techniques that support this include summarization, clustering, association rules, etc. While, predictive model conducts prediction of unknown data values by using supervised learning function applied on known values (Jothi et al., 2015). The known data is historical. Example of such techniques includes Time series analysis, Prediction, Classification, Regression, etc. Our interest in this study lies in the predictive classification model, where the model is based on the characteristics of historical data and is used to predict future trends (Al-radaideh and Nagi, 2012). Many classification algorithms are used to classify categorical data, e.g. Decision Tree, K-Nearest Neighbor, Naïve Bayes, SVM, J48, Random Forest, Logistic Regression, and many more. In this study, we focus on Naïve Bayes classification techniques. The Naïve Bayes classifier provides an analytical tool that defines a set of model rules that categorize data into different classes using a probabilistic approach. First, it creates a model for each class attribute as a function of the other attributes in the record. Then an attempt was made to link classes from each data set using ready-made models for invisible and even new datasets (Manjusha et al., 2015). This analysis helps to better understand the data set and predict future trends (Ameta and Jain, 2017). 2.4.4 Predictive Model In recent years, machine learning has been widely used in various applications to make predictions based on data. A number of studies have been published between 2018 and 2022 that have used machine learning techniques for predictive modeling. One example is the study by Li et al. (2020) who proposed a hybrid machine learning approach for stock price prediction. The authors combined the random forest algorithm with a deep neural network to predict the stock price movement of various companies. The results showed that the proposed method achieved high accuracy in stock price prediction, outperforming traditional machine learning techniques. Another study by Zhang et al. (2019) used machine learning techniques to predict the success of crowdfunding campaigns. The authors used decision trees, random forests, and gradient boosting algorithms to make predictions based on factors such as the funding goal, campaign length, and the number of backers. The results showed that the gradient boosting algorithm had the best performance in terms of accuracy. In a study by Kim et al. (2021), machine learning techniques were used to predict the risk of cardiovascular disease in patients. The authors used logistic regression, random forests, and gradient boosting algorithms to make predictions based on patient data such as age, blood pressure, and cholesterol levels. The results showed that the gradient boosting algorithm had the highest accuracy in predicting cardiovascular disease risk. One limitation of these studies is that they typically used a limited set of data, which may not represent the full range of conditions in real-world applications. Additionally, the accuracy of the predictive models may be influenced by the choice of machine learning techniques and the quality of the data used for training. In conclusion, the literature review shows that machine learning techniques have been used for predictive modeling in various applications, with promising results in terms of accuracy. However, more research is needed to address the limitations and to determine the best techniques for specific predictive modeling tasks. 2.4.5 Logistic Regression Logistic Regression is a Machine Learning algorithm that is used for classification problems, it is a predictive analysis algorithm and based on the concept of probability (Pant, 2019). Logistic regression is a classification technique borrowed by machine learning from the field of statistics. Logistic regression is a statistical method for analyzing a data set in which one or more independent variables determine the outcome. The intention behind using logistic regression is to find the best fitting model to describe the relationship between the dependent and the independent variable (Raj, 2020). 2.4.6 Support Vector Machine Support Vector Machines (SVM) is a type of supervised machine learning algorithm used for classification and regression analysis. The primary goal of SVM is to find the best boundary or hyperplane that separates the data points into different classes or predicts the target value in regression problems. The boundary is determined by finding the maximum margin between the data points of different classes or between the target values and the predicted values. The data points closest to the boundary are called support vectors, and the boundary is referred to as the maximum margin hyperplane. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics & Proteomics (Huang, 2018). 2.4.7 Naïve Bayes It is a classification technique based on Bayes' theorem, assuming independence between predictors. In simple terms, the Naive Bayes Classifier assumes that the existence of a certain trait in one class is not related to the existence of another trait (Ray, 2017). Naive Bayesian classifier is a simple probabilistic classifier that works by applying the Bayes’ theorem along with Naive assumptions about feature independence (Wang et al., 2010). 2.5 Related Work S/N Author and Title of paper Objectives Methodology 1 Maguluri et al. (2019) The study aimed to It is developed a address the problem prediction system of spam emails. for identifying spam emails using four data mining techniques as Decision Trees, Random Forest, Logistic Regression, and Gradient Boosting. 2 Ahmad et al., (2021) This paper is SVM outperformed devoted to giving others using more accuracy and Multilayer precision. Perceptron(MLP), NB and RF 3 Kumar et al. (2018) The study aimed at They have used the factors affecting machine learning stock prediction. techniques for this task. They have developed five models. These are based on SVM, random forest, KNN, naïve Bayes and Softmax. 4 Jayadi et al. (2019) He developed an The result shows employee that Naïve Bayes performance successfully prediction using correctly classified Naïve Bayes. instances as high as 95.48%. 5 Li et al. (2021) The authors used a The results of their combination of experiments convolutional neural showed that the networks (CNNs) proposed method and long short-term outperformed memory (LSTM) traditional machine networks to analyze learning approaches the text and and achieved an structure of spam accuracy of 97.5%. emails. 6 Kim et al. (2020) He sought to The results of their identify spam experiments emails by analyzing showed that the the relationships GCN-based between sender, approach achieved 7 Masoom et al. (2020) recipient, and email an accuracy of content. 95.1% He developed a The purpose of the system to identify research is to bullying using four identify the bullies machine learning via informed algorithms as K- surveys, using data Nearest-Neighbor, from students of SVM (Support colleges and Vector Machine), schools, and take Random Forest the results to Regression, and concerned Logistic Regression. authorities or guardians and list out ways to eradicate them. 8 José et al. (2019) He developed a A fuzzy logic system to detect system, based on a harassment using set of linguistic machine learning input variables, and fuzzy logic determines whether techniques. cyberbullying signs have been detected. Machine learning system deduces if the user could be a victim of cyberbullying from previous training. 9 Zhang et al. (2019) They aimed to The authors used a improve the convolutional accuracy of spam neural network email detection by (CNN) to extract using a combination features from the of natural language text of the emails processing (NLP) and a long short- techniques and deep term memory learning. (LSTM) network to analyze the structure of the emails. The results of their experiments showed that the proposed method achieved an accuracy of 99.4% 10 Chen et al. (2018) They aimed to The authors used a improve the convolutional accuracy of spam neural network email detection by (CNN) to extract using a combination features from the of NLP techniques text of the emails and deep learning. and a long shortterm memory (LSTM) network to analyze the structure of the emails. The results of their experiments showed that the proposed method achieved an accuracy of 98.7%. 11 Rafiah et al. (2018) He developed a The result obtained heart prediction after prediction system using three using the Cleveland data mining Heart disease techniques as Database indicated Decision Trees, that Naïve Bayes Naïve Bayes, and performed well Neural Network. followed by Neural Network and Decision Trees. 12 Xu et al. (2021) Social network Logistic regression, spam detection naive Bayes, and based on ALBERT SVM combined and combination of with ALBERT to Bi-LSTM with self- construct the spam attention. detection model, respectively, and prove the superiority of LSTM neural network in this task. 13 Zhuang et al. (2021) Used deep belief This paper network to demote proposed a web spam. Future preference-based Generation learning to rank Computer Systems. method to address two major issues of score propagation based Web spam demotion algorithms. Our proposal consists of a preference function and an ordering algorithm. 14 Tong et al. (2021) A content-based Experimental chinese spam results show that detection method the model using a capsule outperformed the network with long- current mainstream short attention. methods such as TextCNN, LSTM and even BERT in characterization and detection; and it achieved an accuracy as high as 98.72% on an unbalanced dataset and 99.30% on a balanced dataset. 15 Tanujay et al. (2022) A Machine We combine the Learning Assisted attacks at the Security Analysis of network, protocol, 5G-Network- and application Connected Systems. layers to generate complex attack vectors. In a case study, we use these attack vectors to find four novel security loopholes in WhatsApp running on a 5G network. CHAPTER 3 SYSTEM ANALYSIS AND DESIGN 3.1 The Design of the proposed system To successfully design a spam email detection system, the Naïve Bayes, Logistic Regression, and Support Vector Machine model in machine learning, and python programming language would be used to predict spam emails by evaluating the confusion matrix based on some datasets of attributes, predicting the outcome, and evaluate the prediction accuracy. Fig 3.1 shows the proposed architecture for the system. Data Collection Dataset exploration Data Preprocessing Feature extraction Predictive Analysis (Ensemble method) Training Testing Evaluation Fig 3.1: Proposed architecture for spam email prediction system 3.2 The collection of Datasets A collection of sets of data that are related to one another and is composed of separate elements which can be manipulated on a requirement basis is called a dataset. Subjects for this study were selected from spam email attributes. The data set used in this work was gotten from an openly accessible dataset. The data used consists of sms messages which can also be used as email messages classified into ham and spam. More specifically, the email dataset was made available on a kaggle site in CSV format. The emails contain more than two attributes but we are only going to be making use of two main attributes. The two main attributes include only v1 and v2. The data is divided into two parts for training and testing. The structured and analyzed data extracted from the website was formatted in a CSV format. The already trained and structured in the CSV format file extension were then imputed into a Microsoft Excel software. Table 1 shows the structured trained datasets below when formatted: Fig 3.2: Email Datasets 3.3 Dataset exploration Dataset exploration in machine learning involves analyzing and understanding the characteristics of the data that will be used to train a model. This includes analyzing the size, structure, and distribution of the data, as well as identifying any potential issues or biases that may impact the performance of the model. During this stage, data preprocessing and cleaning may also be performed, such as handling missing values or outliers. Additionally, visualizations and statistical analyses can be used to gain insights into the data and inform the selection of appropriate models and feature engineering techniques. 3.4 Data preprocessing This a data mining technique that involves transforming raw data into an understandable format. The data will likely have inconsistencies, bungles, out-of-range regard, inconceivable data mixes, missing qualities, or, more fundamentally, data unsuitable for a data mining method. Moreover, the developing rate of data in current business applications, science, industry, and scholarly network, solicitations the need for more mind-boggling frameworks to separate it. With information handling, changing over the unfeasible into possible is achievable, altering the information to accomplish the data need of each data mining algorithm. The data preprocessing stage can take a ton of handling time. The outcome expected after a tried and true relationship of information preprocessing forms is a last informational index, which can be thought about right and support for advanced data mining algorithms (Nitta et al., 2018). 3.4.1 Latin-1 encoding Text data in emails can contain special characters and symbols, and it is important to encode these characters in a way that the machine learning algorithm can understand and process. The Latin-1 encoding is one way of encoding the text data, and it can be used in conjunction with other preprocessing steps such as converting the text to lowercase, removing stop words, and stemming the words The Latin-1 encoding is a character encoding that represents the characters used in the Western European languages. In the context of spam email detection, the Latin-1 encoding can be used to encode the text of an email before using it as input to a machine learning algorithm. 3.5 Feature extraction Feature extraction is the process of transforming raw data into a set of features that can be used as input to a machine learning model. It involves selecting the most relevant and informative features from the raw data, reducing the dimensionality of the data, and transforming the features into a format that can be used by the machine learning algorithm. The goal of feature extraction is to identify the features that are most indicative of spam emails and to use those features as input to the machine learning algorithm. By selecting the most relevant and informative features, feature extraction can help to improve the accuracy and performance of the machine learning model (Ruskanda 2019). 3.5.1 Bag of Words (BoW) Bag of Words (BoW) is a widely used and straightforward technique for feature extraction in NLP. It represents a document as a collection of words, and each document is transformed into a vector that shows the frequency of each unique word in the vocabulary. The frequency of words in the document, including repeated words, are used to create the BoW representation. Barushka and Hajek (2019) applied this method in the creation of a spam review detection model that utilized n-grams and the skip-gram word embedding method. 3.6 Training and Testing Principally the informational index is separated into two sections which is test informational collection and train informational index. Preparing information is utilized to fabricate the AI model and afterward we test it with test informational index to check its exactness and accuracy and numerous different components. 3.6.1 Training At the heart of the machine, the learning process is the training of the model. The bulk of the "learning" is done at this stage. 80% of the dataset was allocated for training to teach our model to determine if an email is a spam or not 3.6.2 Testing Testing is referred to as the process where the performance of a fully trained model is evaluated on a testing set. That is why 20% of the data set created for evaluation is used to check the model’s proficiency. The testing set consisting of a set of testing samples should be separated from the both training and validation sets, but it should follow the same probability distribution as the training set. 3.7 Hard voting ensemble method Hard voting is a type of ensemble learning method in machine learning and data science. It is a simple yet powerful method for combining the predictions of multiple individual models in order to produce a final prediction. In a hard voting ensemble, the predictions of the individual models are combined by taking the majority vote. This means that the final prediction is the most common prediction made by the individual models. The hard voting ensemble is usually used when the individual models in the ensemble are of the same type and make their predictions in a mutually exclusive manner. 3.7.1 Algorithm and flowchart for spam email detection using Hard voting ensemble method 3.7.1.1 Algorithm for spam email detection using Hard voting ensemble method 1. Preprocess the email data: Perform data preprocessing steps such as removing duplicates, handling missing values, normalizing the data, and converting categorical variables into numerical ones. 2. Perform feature extraction: Extract relevant and informative features from the email data, such as the presence or absence of certain words, the frequency of certain words, or the sender's email address 3. Split the data into training and test sets: Divide the preprocessed and transformed data into a training set and a test set. 4. Train multiple classifiers on the training data: Train different classifiers such as SVM, Naive Bayes, and logistic regression on the training data. 5. Make predictions on the test data using each classifier: Each classifier will make its own prediction for each test email. 6. Collect the predictions from each classifier: For each test email, you will have multiple predictions from different classifiers. 7. Choose the majority vote as the final prediction: For each test email, select the class label that has been predicted most frequently by the classifiers. If a majority of classifiers predict an email as spam, the final prediction will be "spam"; otherwise, the final prediction will be "not spam". 8. Evaluate the accuracy of the ensemble classifier: Compare the final predictions of the ensemble classifier with the true class labels to calculate the accuracy. 3.7.1.2 Flowchart of the Model A flowchart is the diagrammatic representation of an algorithm, for the proposed model, the flowchart is given below. Start Data Preprocessing Support Vector Machine Logistic Regression Naïve Bayes Hard Voting Ensemble method Test data Training data Predict the value Spam Ham End Fig 3.3: Flowchart of the model 3.8 Logistic Regression Logistic Regression is a type of generalized linear model that is used for binary classification problems, where the target variable can take one of two possible values, such as "Yes" or "No". It is used to model the relationship between a set of independent variables (also known as features or predictors) and the probability of a binary outcome. The basic idea behind logistic regression is to use a mathematical function, known as the logistic function or sigmoid function, to model the probability that a given input belongs to a certain class. The logistic function maps any real-valued number to a value between 0 and 1, which can be interpreted as a probability. This Logistic Regression formula can be written generally in a linear equation form as: In(P/(1-P)) = ẞ0 + ẞ1X1 + ẞ2X2 + … Where P = the probability of an event and is the regression coefficient and X1, X2, ... are the values of the independent variables. Solving for the Probability equation results in: Probability of event (P) = 1/(1+e-(ẞ0 + ẞ1X1 + ẞ2X2 + …)) Where, P = output between 0 and 1 (probability estimate) e = base of natural log Logistic regression is used to predict a binary outcome based on a set of independent variables. It means a binary outcome is one where there are only two possible scenarios—either the event happens (1) or it does not happen (0). Independent variables are those variables or factors which may influence the outcome (or dependent variable). So: Logistic regression is the correct type of analysis to use when you’re working with binary data. You know you’re dealing with binary data when the output or dependent variable is dichotomous or categorical in nature; in other words, if it fits into one of two categories (such as “yes” or “no”, “pass” or “fail”, and so on). So, in order to determine if logistic regression is the correct type of analysis to use, we check for the following: 1. Is the dependent variable dichotomous? In other words, does it fit into one of two set categories? Remember: The dependent variable is the outcome; the thing that you’re measuring or predicting. 2. Are the independent variables either interval, ratio, or ordinal? See the examples above for a reminder of what these terms mean. Remember: The independent variables are those which may impact, or be used to predict, the outcome. In addition to the two criteria mentioned above, there are some further requirements that must be met in order to correctly use logistic regression. These requirements are known as “assumptions”; in other words, when conducting logistic regression, you’re assuming that these criteria have been met. Let’s take a look at those now. 3.8.1 Advantages of logistic regression 1. Logistic regression is much easier to implement than other methods, especially in the context of machine learning: A machine learning model can be described as a mathematical depiction of a real-world process. The process of setting up a machine learning model requires training and testing the model. Training is the process of finding patterns in the input data, so that the model can map a particular input (say, an image) to some kind of output, like a label. Logistic regression is easier to train and implement as compared to other methods. 2. Logistic regression works well for cases where the dataset is linearly separable: A dataset is said to be linearly separable if it is possible to draw a straight line that can separate the two classes of data from each other. Logistic regression is used when your Y variable can take only two values, and if the data is linearly separable, it is more efficient to classify it into two separate classes. 3. Logistic regression provides useful insights: Logistic regression not only gives a measure of how relevant an independent variable is (i.e. the (coefficient size), but also tells us about the direction of the relationship (positive or negative). Two variables are said to have a positive association when an increase in the value of one variable also increases the value of the other variable. For example, the more hours you spend training, the better you become at a particular sport. However: It is important to be aware that correlation does not necessarily indicate causation! In other words, logistic regression may show you that there is a positive correlation between outdoor temperature and sales, but this doesn’t necessarily mean that sales are rising because of the temperature. 3.8.2 Disadvantages of logistic regression 1. Logistic regression fails to predict a continuous outcome. Let’s consider an example to better understand this limitation. In medical applications, logistic regression cannot be used to predict how high a pneumonia patient’s temperature will rise. This is because the scale of measurement is continuous (logistic regression only works when the dependent or outcome variable is dichotomous). 2. Logistic regression assumes linearity between the predicted (dependent) variable and the predictor (independent) variables. Why is this a limitation? In the real world, it is highly unlikely that the observations are linearly separable. Let’s imagine you want to classify the iris plant into one of two families: sentosa or versicolor. In order to distinguish between the two categories, you’re going by petal size and sepal size. You want to create an algorithm to classify the iris plant, but there’s actually no clear distinction; a petal size of 2cm could qualify the plant for both the sentosa and versicolor categories. So, while linearly separable data is the assumption for logistic regression, in reality, it’s not always truly possible. 3. Logistic regression may not be accurate if the sample size is too small. If the sample size is on the small side, the model produced by logistic regression is based on a smaller number of actual observations. This can result in overfitting. In statistics, overfitting is a modeling error which occurs when the model is too closely fit to a limited set of data because of a lack of training data. Or, in other words, there is not enough input data available for the model to find patterns in it. In this case, the model is not able to accurately predict the outcomes of a new or future dataset. 3.8.3 Reasons for using Logistic Regression 1. Logistic regression is much easier to implement than other methods, especially in the context of machine learning. 2. Logistic regression works well for cases where the dataset is linearly separable. 3. Logistic regression provides useful insights. 3.9 Ensemble method Ensemble methods in machine learning are techniques that combine multiple individual models to produce improved predictions compared to those of a single model. The basic idea behind ensemble methods is that by combining the predictions of multiple models, we can reduce the variance, bias, or improve the overall performance of the model. There are several types of ensemble methods, including: 1. Bagging (Bootstrap Aggregating): This involves creating multiple instances of the same model with different random subsets of the training data. The final prediction is made by taking the average or majority vote of the predictions of all models. 2. Boosting: This involves training multiple models sequentially, where each model tries to correct the errors of the previous model. The final prediction is made by taking the weighted average of the predictions of all models. 3. Random Forest: This is an extension of bagging that involves creating multiple decision trees and combining their predictions. 4. Stacking: This involves training multiple models and then using the predictions of those models as features for another model. 5. Adaboost: This is a type of boosting algorithm that assigns higher weights to samples that are difficult to predict, so that subsequent models pay more attention to them. 6. Gradient Boosting: This is another type of boosting algorithm that combines multiple decision trees to make predictions. The final prediction is made by taking a weighted average of the predictions of all the trees, where the weights are determined by the gradient descent optimization algorithm. 3.9.1 Voting ensemble method Voting ensemble is a type of ensemble learning in which multiple models are combined to make a prediction. The basic idea behind voting ensemble is to combine the outputs of multiple models to produce a more accurate and robust prediction. There are two main types of voting ensembles: 1. Hard voting ensemble: In this type of voting ensemble, the outputs of multiple models are combined through a majority vote. The final prediction is made based on the majority class predicted by the individual models. For example, if three models predict class 1, class 2, and class 1 respectively, then the final prediction will be class 1. 2. Soft voting ensemble: In this type of voting ensemble, the outputs of multiple models are combined through a weighted average of their predictions. The weights are determined based on the accuracy of the individual models. For example, if one model is more accurate than the others, it will have a higher weight in the final prediction. Soft voting ensembles are often used with models that produce continuous output, such as regression or probability estimates. 3.9.2 Advantages of ensemble method Ensemble methods have several advantages over single-model approaches in machine learning: 1. Improved accuracy: Ensemble methods can often produce more accurate predictions compared to individual models, as they combine the strengths of multiple models and reduce the impact of any single model's weaknesses. 2. Reduced overfitting: Ensemble methods can reduce the risk of overfitting, as they average out the predictions of multiple models and thereby reduce the impact of any one model's tendency to fit the training data too closely. 3. Increased stability: Ensemble methods can be more stable and robust to changes in the training data compared to individual models, as they average out the predictions of multiple models and thereby reduce the impact of any one model's fluctuations. 4. Increased diversity: Ensemble methods can increase the diversity of models used for prediction, which can help to capture a wider range of patterns in the data and increase the overall accuracy of the predictions. 5. Improved interpretability: In some cases, ensemble methods can provide more interpretable predictions compared to individual models, as they provide a clear summary of the predictions made by multiple models. 3.9.3 Disadvantages of ensemble method While ensemble methods offer several advantages over single-model approaches, there are also some disadvantages to consider: 1. Increased computational cost: Ensemble methods can be computationally expensive, as they require training multiple models and combining their predictions. This can increase the time and resources needed to build and use an ensemble model. 2. Complexity: Ensemble methods can be more complex and harder to understand compared to single-model approaches, as they require combining the predictions of multiple models. This can make it more difficult to interpret the results and understand why a particular prediction was made. 3. Difficult to train: Ensemble methods can be more difficult to train compared to singlemodel approaches, as they require selecting and combining multiple models in a way that improves overall performance. 4. High variance: Ensemble methods can have high variance, as they can be sensitive to the specific models selected and their weighting. This can make it challenging to ensure consistent performance across different datasets. 5. Potential for increased bias: Ensemble methods can introduce bias if not properly constructed, as the combination of predictions from multiple models can amplify any biases present in the individual models. 3.10 Evaluation With the model trained, it needs to be tested to see if it would operate well in real-world situations. This puts the model in a scenario where it encounters situations that were not a part of its training. In this research case, it could mean trying to identify a type of email that is completely new to the model. However, through its training, the model should be capable enough to extrapolate the information and deem whether the complaint is spam or not. Model evaluation metrics are used to find out the goodness of the fit between model and data, to compare the different models, in the context of model selection, and to predict how predictions are expected to be accurate. CHAPTER FOUR IMPLEMENTATION AND TESTING 4.1 Introduction A system is not useful unless it is implemented and tested to ensure that the system works fine and all its functionalities are in place as well as effective, this chapter will highlight the minimum system requirements (both hardware and software, that will be used to implement the developed model and the model will be tested and evaluated for accuracy against results obtained. 4.2 System Requirements The following are the minimum system requirements to ensure the smooth and quick running of the model to be implemented. 4.2.1 Hardware Requirements 1. Architecture: 32bits/64bits Personal Computer with a minimum of 1.75GHZ processor 2. Memory: Minimum of 2GB RAM 3. Hard Disk: 100GB of free Hard disk space 4. Mouse and Keyboard 4.2.2 Software Requirements 1. Microsoft Windows 7/8/10 Operating System 2. Microsoft Excel 3. Jupyter Notebook 4.3 Model Implementation 4.3.1 Model Prediction The model was implemented by collating the analyzed datasets and then training those datasets with a frequency table, likelihood table, and the conditional probability of each instance of the analyzed datasets. Table 4.1 shows the spam dataset obtained analyzed into a CSV format in Microsoft Excel Software. Fig 4.1: Email Datasets After the analysis of the datasets, the normalizing constant and the prediction percentage was derived using logistic regression, naïve bayes and support vector machine. Recalling, Accuracy = TP + TN TP + FP + TN + FN Where; TP – True positive TN – True negative FP – False positive FN – False negative 4.3.2 Model Prediction Evaluation After deriving a prediction probability percentage from our analysis of attributes from the trained datasets. We then evaluate the accuracy of the predictions made from the test datasets as a percentage is correct out of all the predictions made and tie together to form our model. 4.3.3 Classification The classification involves two processes. They are: i. Load our datasets in the CSV FORMAT. ii. We also need to convert the attributes that were loaded from the trained datasets as strings into numbers so that we can work with them. This can be implemented by using Python on the Jupyter notebook which has all needs for programming. 4.4 Implementation of Logistic Regression on Data The Jupyter notebook as a framework was used to run our python code to implement logistic regression on our data. Some libraries in python were used for the implementation such as pandas, NumPy, sklearn, and pickle. Pandas is a module in python used for data processing to view data. NumPy is a python machine learning library that helps to represent data in an array. Sklearn is a module in python where the libraries used for machine learning classifiers are stored. Pickle is a library in python used to save models after training. In the process of executing this work the following steps were taken: 1. Import and load data 2. Data encoding 3. Training using Logistic Regression 4. Evaluate model based on prediction 5. Comparison between Naïve Bayes, Logistic Regression and Support Vector Machine 6. Save model with pickle 1. Import the Dependencies All the libraries needed with this program were imported and the Pandas library, represented by pd.read_csv, was used to load the data set into the program in Fig 4.2: Datasets Load Interface 2. Data collection and cleaning Data is loaded from from a csv file to a pandas data frame which is data. The data is printed out for visibility and all null values a removed for improved test accuracy Fig 4.3: Data Collection 3. Convert labels to binary variables 0 to represent ‘ham’ that is not spam and 1 to represent ‘spam’ Fig 4.4: Convert labels to binary variables 1. Split into training and testing test Sklearn.model is used to split the data into training and testing data. Test size used is 20%. Fig 4.5: Split into training and testing test 5. Frequency distribution Feature extraction using count vectorizer Fig 4.6: Frequency distribution 6. Evaluation of trained model Accuracy score, prediction score, recall score, F1 score from the data. 7. Evaluation of Ensemble trained model Combination of Support Vector Machine, Naïve Bayes and Logistic Regression Fig 4.7: Evaluation of logistic trained model 8. Confusion matrix label Confusion matrix using matplot library 4.5 Implementation of Naïve Bayes and Support Vector Machine on Data Comparing the performance of Naïve Bayes and Support Vector Machine to know which gives better performance Naïve Bayes performance model Support Vector Machine performance model 9. Performance metrics evaluation A confusion matrix is a table that is often used to describe the performance of a classification algorithm. The table layout is as follows: Table: Confusion matrix Predicted Spam Ham Spam TP FP Ham FN TN Actual Each entry in the table represents the number of observations that were predicted to be in a certain class and are actually in that class. The diagonal entries (True Positive and True Negative) represent correct predictions, while the off-diagonal entries (False Positive and False Negative) represent incorrect predictions. The information in a confusion matrix can be used to compute various metrics, such as accuracy, precision, recall, and F1 score. These metrics give a more detailed understanding of the performance of a classification model than a single number like accuracy. True positive: This is the number of positive (P) predictions that are true (T) False positive: This is the number of positive (P) predictions that are false (F) False negative: This is the number of negative (N) predictions that are false (F) True negative: This is the number of negative (N) predictions that are false (T) Accuracy = Recall = TP + TN TP + FP + TN + FN TP (3.6) TP + FN False positive rate = (3.5) FP FP + TN (3.7) Precision = TP (3.8) TP + FP Table: Confusion matrix for Naïve Bayes Predicted Spam Ham Spam 1587 0 Ham 56 196 Actual Accuracy= 1587+196 1587+56+196+0 Recall = 1587 1587 + 56 False positive rate = Precision = = 0.969549 = 0.965916 0 0 + 196 1587 1587 + 0 = 0.000000 = 1.000000 Table: Confusion matrix for Support Vector Machine (SVM) Predicted Spam Ham Spam 1586 1 Ham 37 215 Actual Accuracy= 1586+215 1586+37+215+1 Recall = 1586 1586 + 37 False positive rate = = 0.977203 1 1 + 215 1586 Precision = = 0.969549 1586 + 1 = 0.004630 = 0.999370 Table: Confusion matrix for Logistic Regression Predicted Spam Ham Spam 973 3 Ham 8 131 Actual Accuracy= 973+131 973+8+131+3 Recall = 973 973 + 8 False positive rate = Precision = = 0.990135 = 1.056460 3 3 + 131 973 973 + 3 = 0.022388 = 0.996926 CHAPTER FIVE CONCLUSION AND RECOMMENDATION 5.1 Conclusion Email spamming is a common occurrence affecting many people around the world. With the increase in spamming there is a huge demand for advanced systems and new approaches to improve email spamming analytics and better protection on individual emails. The development of a email prediction system using Logistic Regression, Support vector machine, and Naïve bayes model in machine learning has been carried out in this project. Our system unites past spamming records in particular zones, and Logistic regression, Support Vector Machine and Naïve bayes to predict email spamming. The proposed approach is implemented using jupyter notebook and performance is experimented using accuracy, precision, and recall. The results obtained from the implementation of the model prediction, accuracy evaluation, and classifications proved that the three models approach to email spamming prediction performed well and it was tested for any anomaly. 5.2 Recommendation As future research, the model implemented could be improved to work effectively in Nigeria by using data gotten from Nigeria. Future research on email spamming needs to focus on other areas of prediction which includes the need for consideration varieties of attributes that can be applied to form a dataset. Special consideration should also be given to known spamming factors to prevent its occurrence. REFERENCES Al-Radaideh, Q. A., & Al Nagi, E. (2012). Using data mining techniques to build a classification model for predicting employees performance. International Journal of Advanced Computer Science and Applications, 3(2), 144-151 Ameta, M. A., & Jain, M. K. (2017). Data mining techniques for the prediction of kidney diseases and treatment: A Review, 6(2), 20376–20378. Archana, S., & Elangovan, K. (2014). Survey of classification techniques in data mining. International Journal of Computer Science and Mobile Applications, 2(2), 65-71. Cruz, J. A., & Wishart, D. S. (2006). Applications of machine learning in cancer prediction and prognosis. Cancer informatics, 2, 59.