Uploaded by segun stephen

spam Email detection system using ML by Arije John-1

advertisement
SPAM EMAIL PREDICTION AND DETECTION SYSTEM USING MACHINE
LEARNING
BY
ARIJE JOHN OGO-OLUWA
(MATRICULATION NUMBER: CYS/16/9964)
SUBMITTED TO
THE DEPARTMENT OF CYBER SECURITY,
THE FEDERAL UNIVERSITY OF TECHNOLOGY AKURE (FUTA),
ONDO STATE, NIGERIA.
IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE
AWARD OF BACHELOR OF TECHNOLOGY (B. TECH)
IN CYBER SECURITY
DECEMBER, 2022
CERTIFICATION
I certify that this project work was carried out by me and has not been presented elsewhere for
the award of any degree or any other degree or any other purpose.
STUDENT’S NAME: ARIJE JOHN OGO-OLUWA
SIGNATURE ……………………
DATE ……………………….
This is to certify that this work was carried out by ARIJE JOHN OGO-OLUWA with
matriculation number CYS/16/9964 of the Department of Computer Science, The Federal
University of Technology, Akure, Nigeria.
SUPERVISOR’S NAME: DR A.F . THOMPSON
SIGNATURE ………………….
DATE …………………………
DEDICATION
This report is dedicated to the almighty God, who granted me good health, guided and protected
me all through these years and made this project a success. I also dedicate this to my parents,
Pst. and Mrs. Arije for their unending support.
ACKNOWLEDGEMENT
I give God Almighty the Glory for his Mercy, Grace, Favor and love that kept me through my
undergraduate days. My appreciation goes to my project supervisor and the HOD Dr. A.F
Thompson for her correction, direction, guidance and supervision.
Special thanks to the entire staff of the Cyber Security Department, Federal University of
Technology, Akure, for the knowledge, skills, and values that have been exposed to, which
gave me a bedrock to undertake this project.
I also appreciate my loving family, this would have been impossible without them. I pray that
God would bless them all.
ABSTRACT
Nowadays Email communications are very necessary, but the email spam problems are widely
spread and uncontrollable. In order to detect each spam, a collaborative spam detection system
is to be proposed using python language and machine learning. The machine learning system
that separates spam email from legitimate (ham) email. In this project a complete collaborative
spam detection system which possess an efficient and standard machine learning software tool.
This spam detection system that outperforms the prior approaches in detection results and
applicable to real world.
CHAPTER ONE
INTRODUCTION
1.1 Background of the Study
Commercialization of the internet and integration of electronic mail as an accessible means of
communication has another face - the influx of unwanted information and mails. As the internet
started to gain popularity in the early 1990s, it was quickly recognized as an excellent
advertising tool. At practically no cost, a person can use the internet to send an email message
to thousands of people. These unsolicited junk electronic mails came to be called 'Spam'. The
history of spam is intertwined with the history of electronic mail.
While the linguistic significance of the usage of the word 'spam' is attributed to the British
comedy troupe Monty Python in a now legendary sketch from their Flying Circus TV series,
in which a group of Vikings sing a chorus of "SPAM, SPAM, SPAM..." at increasing volumes,
the historic significance lies in it being adopted to refer to unsolicited commercial electronic
mail sent to a large number of addresses, in what was seen as drowning out normal
communication on the internet.
The first known email spam (although not yet called that), was sent on May 3, 1978 to several
hundred users on ARPANET. It was an advertisement for a presentation by Digital Equipment
Corporation for their DECSYSTEM-20 products sent by Gary Thuerk, a marketer of theirs.
The reaction to it was almost universally negative, and for a long time there were no further
instances.
The name "spam" was actually first applied, in April 1993, not to an email, but to unwanted
postings on Usenet newsgroup network. Richard Depew accidentally posted 200 messages to
news admin policy and in the aftermath readers of this group were making jokes about the
accident, when one person referred to the messages as “spam”, coining the term that would
later be applied to similar incidents over email.
On January 18, 1994, the first large-scale deliberate USENET spam occurred. A message with
the subject “Global Alert for All: Jesus is Coming Soon” was cross-posted to every available
newsgroup. Its controversial message sparked many debates all across USENET.
In April 1994 the first commercial USENET spam arrived. Two lawyers from Phoenix, Cante
and Siegel, hired a programmer to post their "Green Card Lottery- Final One?" message to as
many newsgroups as possible. What made them different was that they did not hide the fact
that they were spammers. They were proud of it, and thought it was great advertising. They
even went on to write the book "How to Make a Fortune on the Information Superhighway :
Everyone’s Guerrilla Guide to Marketing on the internet and Other On-Line Services". They
planned on opening a consulting company to help other people post similar advertisements, but
it never took off.
In June 2003 Meng Weng Wong started the SPF-discuss mailing list and posted the very first
version of the "Sender Permitted From" proposal, that would later become the Sender Policy
Framework, a simple email-validation system designed to detect email spoofing as part of the
solution to spam.
The CAN-SPAM Act of 2003 was signed into law by President George W. Bush on December
16, 2003, establishing the United States' first national standards for the sending of commercial
email and requiring the Federal Trade Commission (FTC) to enforce its provisions. The
backronym CAN-SPAM derives from the bill's full name: "Controlling the Assault of NonSolicited Pornography And Marketing Act of 2003". It plays on the word "canning" (putting
an end to) spam, as in the usual term for unsolicited email of this type; as well as a pun in
reference to the canned SPAM food product. The bill was sponsored in Congress by Senators
Conrad Burns and Ron Wyden.
In January 2004 Bill Gates of Microsoft announced that "spam will soon be a thing of the past."
In May 2004, Howard Carmack of Buffalo, New York was sentenced to 3½ to 7 years for
sending 800 million messages, using stolen identities. In May 2003 he also lost a $16 million
civil lawsuit to EarthLink.
On September 27, 2004, Nicholas Tombros pleaded guilty to charges and became the first
spammer to be convicted under the CAN-SPAM Act of 2003. He was sentenced in July 2007
to three years' probation, six months' house arrest, and fined $10,000.
On November 4, 2004, Jeremy Jaynes, rated the 8th-most prolific spammer in the world,
according to Spamhaus, was convicted of three felony charges of using servers in Virginia to
send thousands of fraudulent emails. The court recommended a sentence of nine years'
imprisonment, which was imposed in April 2005 although the start of the sentence was deferred
pending appeals. Jaynes claimed to have an income of $750,000 a month from his spamming
activities. On February 29, 2008 the Supreme Court of Virginia overturned his conviction.
On November 8, 2004, Nick Marinellis of Sydney, Australia, was sentenced to 4⅓ to 5¼ years
for sending Nigerian 419 emails.
On December 31, 2004, British authorities arrested Christopher Pierson in Lincolnshire, UK
and charged him with malicious communication and causing a public nuisance. On January 3,
2005, he pleaded guilty to sending hoax emails to relatives of people missing following the
Asian tsunami disaster.
1.2 Statement of Problem
The sheer volume of junk email being sent everyday has normalized the occurrence of spam
and this has become a major problem. In fact, spam emails grossly outnumber legitimate ones
by a mile. For May 2019 alone, spam emails constituted almost 85% of the total volume of
emails being sent globally. That’s a whopping 367 billion spam emails per day, compared to a
relatively paltry 64 billion emails that are legitimate.
For many email users, especially those who have gotten used to seeing unsolicited emails in
their inbox day after day, junk mail had evolved from being a cause of alarm to something
that’s more of a mundane matter.
Nowadays, many of us view spam as something normal something that we just have to learn
to deal with. Most people no longer view it as the threat that it actually is.
It goes without saying that spam is a nuisance for all of us. Having to individually scroll through
and delete unwanted emails wastes valuable time and bandwidth. The time you spend filtering
emails in a day may not be much, but over the course of a year, it really does add up.
Junk emails waste a lot of time and effort that could have been used for something more
productive, but that’s not even the worst part. Spam is also a popular means of transferring
harmful malware and electronic viruses. And in an age where hacking tools and techniques
grow more and more sophisticated by the minute, spam-instigated security attacks become a
perpetual threat.
Spam emails are also an avenue for marketers to exploit your data’s privacy. Responding to
just one unsolicited email could put you in the mailing lists of many other companies. Before
you know it, your spam emails would’ve already multiplied tenfold.
A spam email that was made to look like it came from a legitimate entity that you trust (like
your bank or someone you know) could end up stealing sensitive information if you’re not
careful. You could be a victim of identity theft, or you could lose all your money if you
mistakenly hand in your bank details.
1.3 Motivation
We live in an age where technology has revolutionized the way we communicate, but it has
also given rise to unwanted and often malicious emails, commonly known as spam. Spam
emails can not only clog up our inboxes but also pose a threat to our personal information and
security.
That's why it's essential to have a reliable and effective method to detect and filter out spam
emails. The traditional approach of using a single machine learning algorithm to detect spam
emails can be prone to errors like overfitting and robustness talked about by Xue Ying et el.,
2018, especially when dealing with complex and non-linear relationships in the data.
Nayak, Amirali Jiwani & Rajitha (2021) made use of a hybrid strategy that combined Naive
Bayes and Decision Tree algorithms to identify spam e-mails (DT). They were able to obtain
an accuracy of 88.12% using their hybrid approach.
That's where ensemble methods come in. By combining the predictions of multiple machine
learning algorithms, an ensemble method can provide a more robust and accurate
representation of the underlying relationships in the data. The result is a more reliable and
effective method for detecting spam emails.
In this project, the aim is to design an ensemble method that combines Logistic Regression,
Naive Bayes, and Support Vector Machine models. The goal is to improve the accuracy,
robustness and efficiency of spam email detection system, providing better protection for our
personal information and security.
1.4 Objective
The objective of this project is to enhance both the accuracy of the data and the accuracy of the
results, as well as increase the stability of the model.
The objectives of this project work are to:
1. Design Logistic Regression, Naive Bayes, and Support Vector Machine models
independently through the use of machine learning techniques;
2. Design an Ensemble method that combines Logistic Regression, Naive Bayes, and
Support Vector Machine models using machine learning techniques;
3. Implement the model (2).
1.5 Methodology
A detailed review of relevant literature on email spamming and prediction system will be
carried out. The integral part of this research is data collection. The first step consists of
gathering data. The next step is feature extraction, extract the attributes, compare them, and
review the features that work best. The datasets consist of instances and attributes which are
important for spam email detection. The input data source is fed into Data preprocessing which
involves removal of missing fields and outliers, normalization, and transforming the data into
the appropriate form. The fourth step consists of the generation and evaluation of the
classification models using the machine learning technique. The machine learning technique
uses Ensemble method which is the combination of Logistic Regression, Naïve Bayes, and
Support Vector algorithm for detecting spam emails.
Ensemble method is a machine learning technique that combines the predictions of multiple
individual models to produce a more accurate prediction. The idea behind ensemble methods
is to leverage the strengths of multiple models and reduce their weaknesses. Ensemble methods
are commonly used in supervised learning, where the goal is to classify or predict a target value
based on input features.
The mathematical formula for Hard Voting ensemble method can be defined as follows:
Let's assume we have N individual models, each of which make a prediction f_i for a sample
x. The final prediction for the sample x using Hard Voting is given by:
f_ensemble(x) = argmax(sum(f_i(x) == c))
In this formula, f_ensemble(x) is the ensemble prediction for sample x, f_i(x) is the prediction
made by the i-th model for sample x, c is a class label, argmax returns the class label that has
the highest number of votes, and sum is the sum of all the votes for a particular class label.
In other words, for each sample x, the individual models make a prediction for each class label.
The final prediction for the sample x is the class label that has the majority of votes.
1.6 Contribution to Knowledge
Sharma et al. (2021) employed Decision Tree (DT) and K-Nearest Neighbor (K-NN) classifiers
to safeguard social media accounts from spam. The performance of the method was evaluated
using the UCI machine learning e-mail spam dataset. The Decision Tree classifier achieved a
classification accuracy of 90% and an F1-score of 91.5. The classifier has issues of low
accuracy which is going to reduce the model efficiency. The contributions to this project
include:
1. Improved prediction accuracy: By combining multiple models, ensemble methods can
produce predictions that are more accurate than those of individual models.
2. Reduced overfitting: Overfitting occurs when a model is too complex and fits the
training data too well, resulting in poor generalization to new data.
3. Increased robustness: Ensemble methods are less likely to be affected by noise or
outliers in the data, as the predictions of multiple models are averaged out.
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
This chapter provides the background that is essential to understand the basis of email
spamming and what it is all about.
Table 2.1. Some Metrics in Spam Email Detection
Metric
Email spam
Definition
Email spam, also known as junk email, refers to unsolicited email
messages, usually sent in bulk to a large list of recipients.
Prediction
The process of using a trained machine learning model to classify
new emails as either spam or not spam.
Anomaly detection
Anomaly detection is any process that finds the outliers of a dataset
Probability
Probability is one of the bedrock of ML, which tells how likely is
the event to occur. The value of Probability always lies between 0 to
1. It is the core concept as well as a primary prerequisite to
understanding the ML models and their applications.
Machine Leaning
Machine Leaning (ML) Model Operations refers to implementation
(ML) Model
of processes to maintain the ML models in production
Operations
environments.
Kernel
A mathematical function used to transform the input data into a
higher-dimensional space to facilitate linear separability. Common
kernels used in SVM include linear, polynomial, radial basis
function (RBF), and sigmoid.
Machine Leaning
Machine learning is important because it gives enterprises a view of
(ML) significance
trends in customer behavior and business operational patterns, as
well as supports the development of new products.
Hyperplane
A decision boundary in a high-dimensional space used to separate
the data points into different classes. In SVM, the hyperplane is
chosen such that it maximizes the margin between the support
vectors
Margin
The distance between the hyperplane and the closest data points,
which are known as support vectors. The margin is used to define
the optimal hyperplane that separates the classes with maximum
separation.
Model training
The process of using labeled data (e.g., emails labeled as spam or
not spam) to train machine learning models to identify spam emails.
Model ensemble
The combination of multiple models in an ensemble method, where
the predictions of each model are combined to produce a final
prediction.
Voting
A model combination technique in which each model in the
ensemble casts a vote for the class (e.g., spam or not spam) it
predicts for an email, and the class with the most votes is chosen as
the final prediction.
Classifier
A machine learning model used to predict the class label of a given
data point based on its features. SVM is a type of classifier used for
both binary and multi-class classification problems.
Overfitting and
Overfitting occurs when the model is too complex and fits the
Underfitting
training data too well, resulting in poor generalization to new data.
Underfitting occurs when the model is too simple and cannot capture
the underlying patterns in the data.
Regularization
The process of adding a penalty term to the loss function to control
the complexity of the model and prevent overfitting.
2.2 Views from different scholars
In recent times, unwanted commercial bulk emails called spam has become a huge problem on
the internet. The person sending the spam messages is referred to as the spammer. There are
written reviews of products that are available on social networking sites. According to Liu and
Pang (2018), about 30–35% of online reviews are deemed spam. Nikolov (2021); HaCohenKerner, Miller and Yigal (2020) explained that before extracting features from text, it is crucial
to remove any unwanted information from the dataset. Such unwanted data within text datasets
includes punctuation marks, http links, symbols, and frequently used words with little meaning
(known as stop words).
According to Ahmad, Rafie and Ghorabie (2021), on a dataset of 2million spam and non-spam
tweets, Multilayer Perceptron(MLP), NB and RFSVM outperformed others with a precision of
0.98 and an accuracy of 0.96.
The huge volume of spam mails flowing through the computer networks have destructive
effects on the memory space of email servers, communication bandwidth, CPU power and user
time.
The menace of spam email is on the increase on yearly basis and is responsible for over 77%
of the whole global email traffic as reported by Kaspersky Lab Spam Report in 2017. Users
who receive spam emails that they did not request find it very irritating. It is also resulted to
untold financial loss to many users who have fallen victim of internet scams and other
fraudulent practices of spammers who send emails pretending to be from reputable companies
with the intention to persuade individuals to disclose sensitive personal information like
passwords, Bank Verification Number (BVN) and credit card numbers.
According to report from Kaspersky lab, in 2015, the volume of spam emails being sent
reduced to a 12-year low. Spam email volume fell below 50% for the first time since 2003. In
June 2015, the volume of spam emails went down to 49.7% and in July 2015 the figures was
further reduced to 46.4% according to anti-virus software developer Symantec. This decline
was attributed to reduction in the number of major botnets responsible for sending spam emails
in billions. Malicious spam email volume was reported to be constant in 2015. The figure of
spam mails detected by Kaspersky Lab in 2015 was between 3 million and 6 million.
Conversely, as the year was about to end, spam email volume escalated. Further report from
Kaspersky Lab indicated that spam email messages having pernicious attachments such as
malware, ransomware, malicious macros, and JavaScript started to increase in December 2015.
According to Salminen et al., (2022) amazon e-commerce dataset was used for testing and
training where 40,000 samples for training and 10,000 samples for testing were gathered on
various categories like Fashion, Beauty and Automotive etc. That drift was sustained in 2016
and by March of that year spam email volume had quadrupled with respect to that witnessed
in 2015. In March 2016, the volume of spam emails discovered by Kaspersky Lab is
22,890,956. By that time the volume of spam emails had skyrocketed to an average of 56.92%
for the first quarter of 2016. Latest statistics shows that spam messages accounted for 56.87%
of e-mail traffic worldwide and the most familiar types of spam emails were healthcare and
dating spam. Spam results into unproductive use of resources on Simple Mail Transfer Protocol
(SMTP) servers since they have to process a substantial volume of unsolicited emails.
2.3 How to recognize Spam Emails
Francis West (2018) gave an insight on how to recognize Spam Emails.
At present, more than 95% of email messages sent worldwide is believed to be spam. Apart
from the amount of junk arriving in their inboxes it can have a more indirect and severe effect
on email services and their users. It is something that is unpleasant but also unavoidable. Spam
poses a security risk when phishing or malware attacks come along with it. Since spam comes
in varieties, it can easily manipulate the recipient. Thus, it is necessary to bear the following
tips in mind, to identify spam. These are the various ways to recognize spams;
1. Use anti-spam and anti-virus softwares: Once you install an Anti-Spam software, you
can protect yourself from spam emails. This is a software that not only tags emails as
spam but also blocks dangerous malware, virus and phishing attacks.
2. Ensure that you know the sender before opening an email: Avoid any email sent by a
website that you don’t recognise or an email address from someone you don’t know.
There’s a good chance that it is spam. Another possible way to identify a spam is when
the sender's address has a bunch of numbers or a domain that you don't recognise (the
part after the "@") then the email is likely spam. Hence, be careful while opening
emails especially if they land in the spam box.
3. Identify spoof email address: Attackers who want to try phishing attacks use spoof
email addresses to trick the recipient. To show that the email address is from a
recognisable source, the attackers may use characters which look like actual letters.
The attackers could also create fake sender address from trustworthy organisations.
E.g., they can send an email from "westtek@rixobalkangrill.co.uk" which sounds like
the email has come directly from Westtek. However, legitimate emails from Westtek
always end in @westtek.co.uk. Legitimate companies send emails that use your first
and last name for personal salutation. Hence, the email is a spam if the salutation is
addressed to vague "Valued Customer." Ensure that you check whether received
emails have the complete contact address of the company.
4. Be careful about “urgent” or “threatening” language in the subject: A common
phishing tactic is to evoke a sense of fear or urgency in emails. Attackers might write
email subjects like your account has been suspended, or someone is trying to make
unauthorised login attempts. Due to this, recipients get worried, and they land up
opening the spam emails or links.
5. Check the subject for a spam alarm: Make sure you check the subject line before
opening an email. The subject sounds exciting and persuades you generally by offering
things like sales or investment opportunities, new treatments, requests for money,
information on packages you never ordered, etc. Usually, it sounds like you are
receiving a bag of million bucks for free. These emails are definite signs of spam for
you to click links that result in attacks.
6. Avoid requests for personal information: Seldom a user is requested to "update user
information," or sign in "immediately". If an attacker sends a request for personal
information, then you know something’s not right. These emails contain anonymous
links and it is advisable to avoid all such emails as far as possible. Legitimate
businesses never ask for personal information like credit card details or passwords via
email.
7. Look out for typo-logical mistakes: Attackers write spam in a way to get it past spam
filters, i.e. by making typo-logical errors so that they will not be detected. For example,
the spelling PayPal comes across as Paypal and this way we believe that it is a
legitimate email. However, it is not. Hence, we should always check for spelling
mistakes since trusted brands are very serious about their emails.
8. Spot unknown attachments or links: If you are not aware of the source, you should
avoid downloading links or attachments. There is a possibility that if you download
these links or attachments, virus or malware can enter your computer and destroy your
data. Malicious files are mostly in the .docx format for zip format.
9. Watch out for content that is too good to be true: Sometimes there is a spam email
where content is unbelievable like you will get a large sum of money if you download
this link. These emails are phishing scams to get information from you. These emails
come in various forms which encourage the recipient to provide personal information.
Make sure you dodge such spam emails.
Spam is dangerous and can leave your data or computer vulnerable to cyberattacks. Stay alert,
stay secure. You can be sure that you are safe by typing the link that the mail contains to
validate/check the content it states rather than clicking on the link. You can also use third-party
security sites to check the email for any virus or malware.
2.4 Machine learning
According to Andreas C. Müller & Sarah Guido (2016).
Machine learning is about extracting knowledge from data. It is a research field at the
intersection of statistics, artificial intelligence, and computer science and is also known as
predictive analytics or statistical learning. The application of machine learning methods has in
recent years become ubiquitous in everyday life. From automatic recommendations of which
movies to watch, to what food to order or which products to buy, to personalized online radio
and recognizing your friends in your photos, many modern websites and devices have machine
learning algorithms at their core. When you look at a complex website like Facebook, Amazon,
or Netflix, it is very likely that every part of the site contains multiple machine learning models.
2.4.1 Reason for using Machine learning
According to Yuxi Hayden Liu (2017).
Machine learning is a term coined around 1960 composed of two words—machine
corresponding to a computer, robot, or other device, and learning an activity, or event patterns,
which humans are good at. So why do we need machine learning, why do we want a machine
to learn as a human? There are many problems involving huge datasets, or complex calculations
for instance, where it makes sense to let computers do all the work. In general, of course,
computers and robots don't get tired, don't have to sleep, and may be cheaper. There is also an
emerging school of thought called active learning or human-in-the-loop, which advocates
combining the efforts of machine learners and humans. The idea is that there are routine boring
tasks more suitable for computers, and creative tasks more suitable for humans. According to
this philosophy, machines are able to learn, by following rules (or algorithms) designed by
humans and to do repetitive and logic tasks desired by a human.
2.4.2 Data Mining
A notable study is by Liu et al. (2019) with the objective of developing a deep learning-based
framework for data mining. The authors proposed a deep learning-based framework that
combined convolutional neural networks and recurrent neural networks to process and analyze
large datasets. The methodology included the collection of a large dataset, preprocessing and
cleaning of the data, and the application of the deep learning-based framework. The results
showed that the deep learning-based framework performed better than traditional machine
learning algorithms in terms of accuracy and efficiency. The limitation of the work was the
requirement
for
a
large
dataset
and
computational
resources.
Finally, a study by Li et al. (2022) focused on the application of transfer learning in data mining.
The objective of the study was to investigate the potential of transfer learning to improve the
performance of machine learning algorithms in data mining. The authors proposed a transfer
learning-based framework that used pre-trained models to extract features from the data and
applied machine learning algorithms to the extracted features. The methodology included the
collection of a large dataset, preprocessing and cleaning of the data, and the application of the
transfer learning-based framework. The results showed that the transfer learning-based
framework outperformed traditional machine learning algorithms in terms of accuracy and
runtime efficiency. The limitation of the work was the need for pre-trained models to be
available for the specific domain of the dataset.
2.4.3 Classification
Classification, a data mining technique, is the process of classifying and predicting the value
of a class attribute based on its predictor value (Romero et al., 2008). A predictor is an attribute
used to predict a new record, e.g. Spam email, legit email, etc. There are two main categories
of classification models used for prediction; descriptive and predictive classification models.
Descriptive models find relationships or models in the data and even examine the properties of
the data being examined. Examples of techniques that support this include summarization,
clustering, association rules, etc. While, predictive model conducts prediction of unknown data
values by using supervised learning function applied on known values (Jothi et al., 2015). The
known data is historical. Example of such techniques includes Time series analysis, Prediction,
Classification, Regression, etc. Our interest in this study lies in the predictive classification
model, where the model is based on the characteristics of historical data and is used to predict
future trends (Al-radaideh and Nagi, 2012). Many classification algorithms are used to classify
categorical data, e.g. Decision Tree, K-Nearest Neighbor, Naïve Bayes, SVM, J48, Random
Forest, Logistic Regression, and many more. In this study, we focus on Naïve Bayes
classification techniques. The Naïve Bayes classifier provides an analytical tool that defines a
set of model rules that categorize data into different classes using a probabilistic approach.
First, it creates a model for each class attribute as a function of the other attributes in the record.
Then an attempt was made to link classes from each data set using ready-made models for
invisible and even new datasets (Manjusha et al., 2015). This analysis helps to better
understand the data set and predict future trends (Ameta and Jain, 2017).
2.4.4 Predictive Model
In recent years, machine learning has been widely used in various applications to make
predictions based on data. A number of studies have been published between 2018 and 2022
that have used machine learning techniques for predictive modeling.
One example is the study by Li et al. (2020) who proposed a hybrid machine learning approach
for stock price prediction. The authors combined the random forest algorithm with a deep
neural network to predict the stock price movement of various companies. The results showed
that the proposed method achieved high accuracy in stock price prediction, outperforming
traditional machine learning techniques.
Another study by Zhang et al. (2019) used machine learning techniques to predict the success
of crowdfunding campaigns. The authors used decision trees, random forests, and gradient
boosting algorithms to make predictions based on factors such as the funding goal, campaign
length, and the number of backers. The results showed that the gradient boosting algorithm had
the best performance in terms of accuracy.
In a study by Kim et al. (2021), machine learning techniques were used to predict the risk of
cardiovascular disease in patients. The authors used logistic regression, random forests, and
gradient boosting algorithms to make predictions based on patient data such as age, blood
pressure, and cholesterol levels. The results showed that the gradient boosting algorithm had
the highest accuracy in predicting cardiovascular disease risk.
One limitation of these studies is that they typically used a limited set of data, which may not
represent the full range of conditions in real-world applications. Additionally, the accuracy of
the predictive models may be influenced by the choice of machine learning techniques and the
quality of the data used for training.
In conclusion, the literature review shows that machine learning techniques have been used for
predictive modeling in various applications, with promising results in terms of accuracy.
However, more research is needed to address the limitations and to determine the best
techniques for specific predictive modeling tasks.
2.4.5 Logistic Regression
Logistic Regression is a Machine Learning algorithm that is used for classification problems,
it is a predictive analysis algorithm and based on the concept of probability (Pant, 2019).
Logistic regression is a classification technique borrowed by machine learning from the field
of statistics. Logistic regression is a statistical method for analyzing a data set in which one or
more independent variables determine the outcome. The intention behind using logistic
regression is to find the best fitting model to describe the relationship between the dependent
and the independent variable (Raj, 2020).
2.4.6 Support Vector Machine
Support Vector Machines (SVM) is a type of supervised machine learning algorithm used for
classification and regression analysis. The primary goal of SVM is to find the best boundary
or hyperplane that separates the data points into different classes or predicts the target value in
regression problems. The boundary is determined by finding the maximum margin between
the data points of different classes or between the target values and the predicted values. The
data points closest to the boundary are called support vectors, and the boundary is referred to
as the maximum margin hyperplane. Applications of Support Vector Machine (SVM) Learning
in Cancer Genomics & Proteomics (Huang, 2018).
2.4.7 Naïve Bayes
It is a classification technique based on Bayes' theorem, assuming independence between
predictors. In simple terms, the Naive Bayes Classifier assumes that the existence of a certain
trait in one class is not related to the existence of another trait (Ray, 2017). Naive Bayesian
classifier is a simple probabilistic classifier that works by applying the Bayes’ theorem along
with Naive assumptions about feature independence (Wang et al., 2010).
2.5 Related Work
S/N
Author and Title of paper
Objectives
Methodology
1
Maguluri et al. (2019)
The study aimed to
It is developed a
address the problem
prediction system
of spam emails.
for identifying
spam emails using
four data mining
techniques as
Decision Trees,
Random Forest,
Logistic
Regression, and
Gradient Boosting.
2
Ahmad et al., (2021)
This paper is
SVM outperformed
devoted to giving
others using
more accuracy and
Multilayer
precision.
Perceptron(MLP),
NB and RF
3
Kumar et al. (2018)
The study aimed at
They have used
the factors affecting
machine learning
stock prediction.
techniques for this
task. They have
developed five
models. These are
based on SVM,
random forest,
KNN, naïve Bayes
and Softmax.
4
Jayadi et al. (2019)
He developed an
The result shows
employee
that Naïve Bayes
performance
successfully
prediction using
correctly classified
Naïve Bayes.
instances as high as
95.48%.
5
Li et al. (2021)
The authors used a
The results of their
combination of
experiments
convolutional neural showed that the
networks (CNNs)
proposed method
and long short-term
outperformed
memory (LSTM)
traditional machine
networks to analyze
learning approaches
the text and
and achieved an
structure of spam
accuracy of 97.5%.
emails.
6
Kim et al. (2020)
He sought to
The results of their
identify spam
experiments
emails by analyzing
showed that the
the relationships
GCN-based
between sender,
approach achieved
7
Masoom et al. (2020)
recipient, and email
an accuracy of
content.
95.1%
He developed a
The purpose of the
system to identify
research is to
bullying using four
identify the bullies
machine learning
via informed
algorithms as K-
surveys, using data
Nearest-Neighbor,
from students of
SVM (Support
colleges and
Vector Machine),
schools, and take
Random Forest
the results to
Regression, and
concerned
Logistic Regression. authorities or
guardians and list
out ways to
eradicate them.
8
José et al. (2019)
He developed a
A fuzzy logic
system to detect
system, based on a
harassment using
set of linguistic
machine learning
input variables,
and fuzzy logic
determines whether
techniques.
cyberbullying signs
have been detected.
Machine learning
system deduces if
the user could be a
victim of
cyberbullying from
previous training.
9
Zhang et al. (2019)
They aimed to
The authors used a
improve the
convolutional
accuracy of spam
neural network
email detection by
(CNN) to extract
using a combination
features from the
of natural language
text of the emails
processing (NLP)
and a long short-
techniques and deep
term memory
learning.
(LSTM) network to
analyze the
structure of the
emails.
The results of their
experiments
showed that the
proposed method
achieved an
accuracy of 99.4%
10
Chen et al. (2018)
They aimed to
The authors used a
improve the
convolutional
accuracy of spam
neural network
email detection by
(CNN) to extract
using a combination
features from the
of NLP techniques
text of the emails
and deep learning.
and a long shortterm memory
(LSTM) network to
analyze the
structure of the
emails. The results
of their
experiments
showed that the
proposed method
achieved an
accuracy of 98.7%.
11
Rafiah et al. (2018)
He developed a
The result obtained
heart prediction
after prediction
system using three
using the Cleveland
data mining
Heart disease
techniques as
Database indicated
Decision Trees,
that Naïve Bayes
Naïve Bayes, and
performed well
Neural Network.
followed by Neural
Network and
Decision Trees.
12
Xu et al. (2021)
Social network
Logistic regression,
spam detection
naive Bayes, and
based on ALBERT
SVM combined
and combination of
with ALBERT to
Bi-LSTM with self-
construct the spam
attention.
detection model,
respectively, and
prove the
superiority of
LSTM neural
network in this
task.
13
Zhuang et al. (2021)
Used deep belief
This paper
network to demote
proposed a
web spam. Future
preference-based
Generation
learning to rank
Computer Systems.
method to address
two major issues of
score propagation
based Web spam
demotion
algorithms. Our
proposal consists of
a preference
function and an
ordering algorithm.
14
Tong et al. (2021)
A content-based
Experimental
chinese spam
results show that
detection method
the model
using a capsule
outperformed the
network with long-
current mainstream
short attention.
methods such as
TextCNN, LSTM
and even BERT in
characterization
and detection; and
it achieved an
accuracy as high as
98.72% on an
unbalanced dataset
and 99.30% on a
balanced dataset.
15
Tanujay et al. (2022)
A Machine
We combine the
Learning Assisted
attacks at the
Security Analysis of
network, protocol,
5G-Network-
and application
Connected Systems.
layers to generate
complex attack
vectors. In a case
study, we use these
attack vectors to
find four novel
security loopholes
in WhatsApp
running on a 5G
network.
CHAPTER 3
SYSTEM ANALYSIS AND DESIGN
3.1 The Design of the proposed system
To successfully design a spam email detection system, the Naïve Bayes, Logistic Regression,
and Support Vector Machine model in machine learning, and python programming language
would be used to predict spam emails by evaluating the confusion matrix based on some
datasets of attributes, predicting the outcome, and evaluate the prediction accuracy.
Fig 3.1 shows the proposed architecture for the system.
Data
Collection
Dataset
exploration
Data Preprocessing
Feature extraction
Predictive Analysis
(Ensemble method)
Training
Testing
Evaluation
Fig 3.1: Proposed architecture for spam email prediction system
3.2 The collection of Datasets
A collection of sets of data that are related to one another and is composed of separate elements
which can be manipulated on a requirement basis is called a dataset. Subjects for this study
were selected from spam email attributes. The data set used in this work was gotten from an
openly accessible dataset. The data used consists of sms messages which can also be used as
email messages classified into ham and spam. More specifically, the email dataset was made
available on a kaggle site in CSV format. The emails contain more than two attributes but we
are only going to be making use of two main attributes. The two main attributes include only
v1 and v2. The data is divided into two parts for training and testing. The structured and
analyzed data extracted from the website was formatted in a CSV format. The already trained
and structured in the CSV format file extension were then imputed into a Microsoft Excel
software. Table 1 shows the structured trained datasets below when formatted:
Fig 3.2: Email Datasets
3.3 Dataset exploration
Dataset exploration in machine learning involves analyzing and understanding the
characteristics of the data that will be used to train a model. This includes analyzing the size,
structure, and distribution of the data, as well as identifying any potential issues or biases that
may impact the performance of the model. During this stage, data preprocessing and cleaning
may also be performed, such as handling missing values or outliers. Additionally,
visualizations and statistical analyses can be used to gain insights into the data and inform the
selection of appropriate models and feature engineering techniques.
3.4 Data preprocessing
This a data mining technique that involves transforming raw data into an understandable
format. The data will likely have inconsistencies, bungles, out-of-range regard, inconceivable
data mixes, missing qualities, or, more fundamentally, data unsuitable for a data mining
method. Moreover, the developing rate of data in current business applications, science,
industry, and scholarly network, solicitations the need for more mind-boggling frameworks to
separate it. With information handling, changing over the unfeasible into possible is achievable,
altering the information to accomplish the data need of each data mining algorithm. The data
preprocessing stage can take a ton of handling time. The outcome expected after a tried and
true relationship of information preprocessing forms is a last informational index, which can
be thought about right and support for advanced data mining algorithms (Nitta et al., 2018).
3.4.1 Latin-1 encoding
Text data in emails can contain special characters and symbols, and it is important to encode
these characters in a way that the machine learning algorithm can understand and process. The
Latin-1 encoding is one way of encoding the text data, and it can be used in conjunction with
other preprocessing steps such as converting the text to lowercase, removing stop words, and
stemming the words
The Latin-1 encoding is a character encoding that represents the characters used in the Western
European languages. In the context of spam email detection, the Latin-1 encoding can be used
to encode the text of an email before using it as input to a machine learning algorithm.
3.5 Feature extraction
Feature extraction is the process of transforming raw data into a set of features that can be used
as input to a machine learning model. It involves selecting the most relevant and informative
features from the raw data, reducing the dimensionality of the data, and transforming the
features into a format that can be used by the machine learning algorithm.
The goal of feature extraction is to identify the features that are most indicative of spam emails
and to use those features as input to the machine learning algorithm. By selecting the most
relevant and informative features, feature extraction can help to improve the accuracy and
performance of the machine learning model (Ruskanda 2019).
3.5.1 Bag of Words (BoW)
Bag of Words (BoW) is a widely used and straightforward technique for feature extraction in
NLP. It represents a document as a collection of words, and each document is transformed into
a vector that shows the frequency of each unique word in the vocabulary. The frequency of
words in the document, including repeated words, are used to create the BoW representation.
Barushka and Hajek (2019) applied this method in the creation of a spam review detection
model that utilized n-grams and the skip-gram word embedding method.
3.6 Training and Testing
Principally the informational index is separated into two sections which is test informational
collection and train informational index. Preparing information is utilized to fabricate the AI
model and afterward we test it with test informational index to check its exactness and accuracy
and numerous different components.
3.6.1 Training
At the heart of the machine, the learning process is the training of the model. The bulk of the
"learning" is done at this stage. 80% of the dataset was allocated for training to teach our model
to determine if an email is a spam or not
3.6.2 Testing
Testing is referred to as the process where the performance of a fully trained model is evaluated
on a testing set. That is why 20% of the data set created for evaluation is used to check the
model’s proficiency. The testing set consisting of a set of testing samples should be separated
from the both training and validation sets, but it should follow the same probability distribution
as the training set.
3.7 Hard voting ensemble method
Hard voting is a type of ensemble learning method in machine learning and data science. It is
a simple yet powerful method for combining the predictions of multiple individual models in
order to produce a final prediction.
In a hard voting ensemble, the predictions of the individual models are combined by taking the
majority vote. This means that the final prediction is the most common prediction made by the
individual models. The hard voting ensemble is usually used when the individual models in the
ensemble are of the same type and make their predictions in a mutually exclusive manner.
3.7.1 Algorithm and flowchart for spam email detection using Hard voting ensemble
method
3.7.1.1 Algorithm for spam email detection using Hard voting ensemble method
1. Preprocess the email data: Perform data preprocessing steps such as removing
duplicates, handling missing values, normalizing the data, and converting categorical
variables into numerical ones.
2. Perform feature extraction: Extract relevant and informative features from the email
data, such as the presence or absence of certain words, the frequency of certain words,
or the sender's email address
3. Split the data into training and test sets: Divide the preprocessed and transformed data
into a training set and a test set.
4. Train multiple classifiers on the training data: Train different classifiers such as SVM,
Naive Bayes, and logistic regression on the training data.
5. Make predictions on the test data using each classifier: Each classifier will make its
own prediction for each test email.
6. Collect the predictions from each classifier: For each test email, you will have multiple
predictions from different classifiers.
7. Choose the majority vote as the final prediction: For each test email, select the class
label that has been predicted most frequently by the classifiers. If a majority of
classifiers predict an email as spam, the final prediction will be "spam"; otherwise, the
final prediction will be "not spam".
8. Evaluate the accuracy of the ensemble classifier: Compare the final predictions of the
ensemble classifier with the true class labels to calculate the accuracy.
3.7.1.2 Flowchart of the Model
A flowchart is the diagrammatic representation of an algorithm, for the proposed model, the
flowchart is given below.
Start
Data
Preprocessing
Support Vector
Machine
Logistic
Regression
Naïve Bayes
Hard Voting Ensemble
method
Test data
Training data
Predict the value
Spam
Ham
End
Fig 3.3: Flowchart of the model
3.8 Logistic Regression
Logistic Regression is a type of generalized linear model that is used for binary classification
problems, where the target variable can take one of two possible values, such as "Yes" or "No".
It is used to model the relationship between a set of independent variables (also known as
features or predictors) and the probability of a binary outcome.
The basic idea behind logistic regression is to use a mathematical function, known as the
logistic function or sigmoid function, to model the probability that a given input belongs to a
certain class. The logistic function maps any real-valued number to a value between 0 and 1,
which can be interpreted as a probability.
This Logistic Regression formula can be written generally in a linear equation form as:
In(P/(1-P)) = ẞ0 + ẞ1X1 + ẞ2X2 + …
Where P = the probability of an event and is the regression coefficient and X1, X2, ... are the
values of the independent variables. Solving for the Probability equation results in:
Probability of event (P)
= 1/(1+e-(ẞ0 + ẞ1X1 + ẞ2X2 + …))
Where,
P = output between 0 and 1 (probability estimate)
e = base of natural log
Logistic regression is used to predict a binary outcome based on a set of independent variables.
It means a binary outcome is one where there are only two possible scenarios—either the event
happens (1) or it does not happen (0). Independent variables are those variables or factors which
may influence the outcome (or dependent variable).
So: Logistic regression is the correct type of analysis to use when you’re working with binary
data. You know you’re dealing with binary data when the output or dependent variable is
dichotomous or categorical in nature; in other words, if it fits into one of two categories (such
as “yes” or “no”, “pass” or “fail”, and so on).
So, in order to determine if logistic regression is the correct type of analysis to use, we check
for the following:
1. Is the dependent variable dichotomous? In other words, does it fit into one of two set
categories? Remember: The dependent variable is the outcome; the thing that you’re
measuring or predicting.
2. Are the independent variables either interval, ratio, or ordinal? See the examples above
for a reminder of what these terms mean. Remember: The independent variables are
those which may impact, or be used to predict, the outcome.
In addition to the two criteria mentioned above, there are some further requirements that must
be met in order to correctly use logistic regression. These requirements are known as
“assumptions”; in other words, when conducting logistic regression, you’re assuming that these
criteria have been met. Let’s take a look at those now.
3.8.1 Advantages of logistic regression
1. Logistic regression is much easier to implement than other methods, especially in the
context of machine learning: A machine learning model can be described as a
mathematical depiction of a real-world process. The process of setting up a machine
learning model requires training and testing the model. Training is the process of
finding patterns in the input data, so that the model can map a particular input (say, an
image) to some kind of output, like a label. Logistic regression is easier to train and
implement as compared to other methods.
2. Logistic regression works well for cases where the dataset is linearly separable: A
dataset is said to be linearly separable if it is possible to draw a straight line that can
separate the two classes of data from each other. Logistic regression is used when your
Y variable can take only two values, and if the data is linearly separable, it is more
efficient to classify it into two separate classes.
3. Logistic regression provides useful insights: Logistic regression not only gives a
measure of how relevant an independent variable is (i.e. the (coefficient size), but also
tells us about the direction of the relationship (positive or negative). Two variables are
said to have a positive association when an increase in the value of one variable also
increases the value of the other variable. For example, the more hours you spend
training, the better you become at a particular sport. However: It is important to be
aware that correlation does not necessarily indicate causation! In other words, logistic
regression may show you that there is a positive correlation between outdoor
temperature and sales, but this doesn’t necessarily mean that sales are rising because
of the temperature.
3.8.2 Disadvantages of logistic regression
1. Logistic regression fails to predict a continuous outcome. Let’s consider an example
to better understand this limitation. In medical applications, logistic regression cannot
be used to predict how high a pneumonia patient’s temperature will rise. This is
because the scale of measurement is continuous (logistic regression only works when
the dependent or outcome variable is dichotomous).
2. Logistic regression assumes linearity between the predicted (dependent) variable and
the predictor (independent) variables. Why is this a limitation? In the real world, it is
highly unlikely that the observations are linearly separable. Let’s imagine you want to
classify the iris plant into one of two families: sentosa or versicolor. In order to
distinguish between the two categories, you’re going by petal size and sepal size. You
want to create an algorithm to classify the iris plant, but there’s actually no clear
distinction; a petal size of 2cm could qualify the plant for both the sentosa and
versicolor categories. So, while linearly separable data is the assumption for logistic
regression, in reality, it’s not always truly possible.
3. Logistic regression may not be accurate if the sample size is too small. If the sample
size is on the small side, the model produced by logistic regression is based on a
smaller number of actual observations. This can result in overfitting. In statistics,
overfitting is a modeling error which occurs when the model is too closely fit to a
limited set of data because of a lack of training data. Or, in other words, there is not
enough input data available for the model to find patterns in it. In this case, the model
is not able to accurately predict the outcomes of a new or future dataset.
3.8.3 Reasons for using Logistic Regression
1. Logistic regression is much easier to implement than other methods, especially in the
context of machine learning.
2. Logistic regression works well for cases where the dataset is linearly separable.
3. Logistic regression provides useful insights.
3.9 Ensemble method
Ensemble methods in machine learning are techniques that combine multiple individual models
to produce improved predictions compared to those of a single model. The basic idea behind
ensemble methods is that by combining the predictions of multiple models, we can reduce the
variance, bias, or improve the overall performance of the model. There are several types of
ensemble methods, including:
1. Bagging (Bootstrap Aggregating): This involves creating multiple instances of the same
model with different random subsets of the training data. The final prediction is made
by taking the average or majority vote of the predictions of all models.
2. Boosting: This involves training multiple models sequentially, where each model tries
to correct the errors of the previous model. The final prediction is made by taking the
weighted average of the predictions of all models.
3. Random Forest: This is an extension of bagging that involves creating multiple decision
trees and combining their predictions.
4. Stacking: This involves training multiple models and then using the predictions of those
models as features for another model.
5. Adaboost: This is a type of boosting algorithm that assigns higher weights to samples
that are difficult to predict, so that subsequent models pay more attention to them.
6. Gradient Boosting: This is another type of boosting algorithm that combines multiple
decision trees to make predictions. The final prediction is made by taking a weighted
average of the predictions of all the trees, where the weights are determined by the
gradient descent optimization algorithm.
3.9.1 Voting ensemble method
Voting ensemble is a type of ensemble learning in which multiple models are combined to
make a prediction. The basic idea behind voting ensemble is to combine the outputs of
multiple models to produce a more accurate and robust prediction.
There are two main types of voting ensembles:
1. Hard voting ensemble: In this type of voting ensemble, the outputs of multiple models
are combined through a majority vote. The final prediction is made based on the
majority class predicted by the individual models. For example, if three models predict
class 1, class 2, and class 1 respectively, then the final prediction will be class 1.
2. Soft voting ensemble: In this type of voting ensemble, the outputs of multiple models
are combined through a weighted average of their predictions. The weights are
determined based on the accuracy of the individual models. For example, if one model
is more accurate than the others, it will have a higher weight in the final prediction. Soft
voting ensembles are often used with models that produce continuous output, such as
regression or probability estimates.
3.9.2 Advantages of ensemble method
Ensemble methods have several advantages over single-model approaches in machine
learning:
1. Improved accuracy: Ensemble methods can often produce more accurate predictions
compared to individual models, as they combine the strengths of multiple models and
reduce the impact of any single model's weaknesses.
2. Reduced overfitting: Ensemble methods can reduce the risk of overfitting, as they
average out the predictions of multiple models and thereby reduce the impact of any
one model's tendency to fit the training data too closely.
3. Increased stability: Ensemble methods can be more stable and robust to changes in the
training data compared to individual models, as they average out the predictions of
multiple models and thereby reduce the impact of any one model's fluctuations.
4. Increased diversity: Ensemble methods can increase the diversity of models used for
prediction, which can help to capture a wider range of patterns in the data and increase
the overall accuracy of the predictions.
5. Improved interpretability: In some cases, ensemble methods can provide more
interpretable predictions compared to individual models, as they provide a clear
summary of the predictions made by multiple models.
3.9.3 Disadvantages of ensemble method
While ensemble methods offer several advantages over single-model approaches, there are
also some disadvantages to consider:
1. Increased computational cost: Ensemble methods can be computationally expensive, as
they require training multiple models and combining their predictions. This can
increase the time and resources needed to build and use an ensemble model.
2. Complexity: Ensemble methods can be more complex and harder to understand
compared to single-model approaches, as they require combining the predictions of
multiple models. This can make it more difficult to interpret the results and understand
why a particular prediction was made.
3. Difficult to train: Ensemble methods can be more difficult to train compared to singlemodel approaches, as they require selecting and combining multiple models in a way
that improves overall performance.
4. High variance: Ensemble methods can have high variance, as they can be sensitive to
the specific models selected and their weighting. This can make it challenging to ensure
consistent performance across different datasets.
5. Potential for increased bias: Ensemble methods can introduce bias if not properly
constructed, as the combination of predictions from multiple models can amplify any
biases present in the individual models.
3.10 Evaluation
With the model trained, it needs to be tested to see if it would operate well in real-world
situations. This puts the model in a scenario where it encounters situations that were not a part
of its training. In this research case, it could mean trying to identify a type of email that is
completely new to the model. However, through its training, the model should be capable
enough to extrapolate the information and deem whether the complaint is spam or not. Model
evaluation metrics are used to find out the goodness of the fit between model and data, to
compare the different models, in the context of model selection, and to predict how predictions
are expected to be accurate.
CHAPTER FOUR
IMPLEMENTATION AND TESTING
4.1 Introduction
A system is not useful unless it is implemented and tested to ensure that the system works fine
and all its functionalities are in place as well as effective, this chapter will highlight the
minimum system requirements (both hardware and software, that will be used to implement
the developed model and the model will be tested and evaluated for accuracy against results
obtained.
4.2 System Requirements
The following are the minimum system requirements to ensure the smooth and quick running
of the model to be implemented.
4.2.1 Hardware Requirements
1. Architecture: 32bits/64bits Personal Computer with a minimum of 1.75GHZ processor
2. Memory: Minimum of 2GB RAM
3. Hard Disk: 100GB of free Hard disk space
4. Mouse and Keyboard
4.2.2 Software Requirements
1. Microsoft Windows 7/8/10 Operating System
2. Microsoft Excel
3. Jupyter Notebook
4.3 Model Implementation
4.3.1 Model Prediction
The model was implemented by collating the analyzed datasets and then training those datasets
with a frequency table, likelihood table, and the conditional probability of each instance of the
analyzed datasets. Table 4.1 shows the spam dataset obtained analyzed into a CSV format in
Microsoft Excel Software.
Fig 4.1: Email Datasets
After the analysis of the datasets, the normalizing constant and the prediction percentage was
derived using logistic regression, naïve bayes and support vector machine.
Recalling,
Accuracy =
TP + TN
TP + FP + TN + FN
Where;
TP – True positive
TN – True negative
FP – False positive
FN – False negative
4.3.2 Model Prediction Evaluation
After deriving a prediction probability percentage from our analysis of attributes from the
trained datasets. We then evaluate the accuracy of the predictions made from the test datasets
as a percentage is correct out of all the predictions made and tie together to form our model.
4.3.3 Classification
The classification involves two processes. They are:
i.
Load our datasets in the CSV FORMAT.
ii.
We also need to convert the attributes that were loaded from the trained datasets as
strings into numbers so that we can work with them. This can be implemented by using
Python on the Jupyter notebook which has all needs for programming.
4.4 Implementation of Logistic Regression on Data
The Jupyter notebook as a framework was used to run our python code to implement logistic
regression on our data. Some libraries in python were used for the implementation such as
pandas, NumPy, sklearn, and pickle. Pandas is a module in python used for data processing to
view data. NumPy is a python machine learning library that helps to represent data in an array.
Sklearn is a module in python where the libraries used for machine learning classifiers are
stored. Pickle is a library in python used to save models after training.
In the process of executing this work the following steps were taken:
1. Import and load data
2. Data encoding
3. Training using Logistic Regression
4. Evaluate model based on prediction
5. Comparison between Naïve Bayes, Logistic Regression and Support Vector Machine
6. Save model with pickle
1. Import the Dependencies
All the libraries needed with this program were imported and the Pandas library, represented
by pd.read_csv, was used to load the data set into the program in
Fig 4.2: Datasets Load Interface
2. Data collection and cleaning
Data is loaded from from a csv file to a pandas data frame which is data.
The data is printed out for visibility and all null values a removed for improved test accuracy
Fig 4.3: Data Collection
3. Convert labels to binary variables
0 to represent ‘ham’ that is not spam and 1 to represent ‘spam’
Fig 4.4: Convert labels to binary variables
1. Split into training and testing test
Sklearn.model is used to split the data into training and testing data. Test size used is 20%.
Fig 4.5: Split into training and testing test
5. Frequency distribution
Feature extraction using count vectorizer
Fig 4.6: Frequency distribution
6. Evaluation of trained model
Accuracy score, prediction score, recall score, F1 score from the data.
7. Evaluation of Ensemble trained model
Combination of Support Vector Machine, Naïve Bayes and Logistic Regression
Fig 4.7: Evaluation of logistic trained model
8. Confusion matrix label
Confusion matrix using matplot library
4.5 Implementation of Naïve Bayes and Support Vector Machine on Data
Comparing the performance of Naïve Bayes and Support Vector Machine to know which gives
better performance
Naïve Bayes performance model
Support Vector Machine performance model
9. Performance metrics evaluation
A confusion matrix is a table that is often used to describe the performance of a classification
algorithm. The table layout is as follows:
Table: Confusion matrix
Predicted
Spam
Ham
Spam
TP
FP
Ham
FN
TN
Actual
Each entry in the table represents the number of observations that were predicted to be in a
certain class and are actually in that class. The diagonal entries (True Positive and True
Negative) represent correct predictions, while the off-diagonal entries (False Positive and False
Negative) represent incorrect predictions.
The information in a confusion matrix can be used to compute various metrics, such as
accuracy, precision, recall, and F1 score. These metrics give a more detailed understanding of
the performance of a classification model than a single number like accuracy.
True positive: This is the number of positive (P) predictions that are true (T)
False positive: This is the number of positive (P) predictions that are false (F)
False negative: This is the number of negative (N) predictions that are false (F)
True negative: This is the number of negative (N) predictions that are false (T)
Accuracy =
Recall =
TP + TN
TP + FP + TN + FN
TP
(3.6)
TP + FN
False positive rate =
(3.5)
FP
FP + TN
(3.7)
Precision =
TP
(3.8)
TP + FP
Table: Confusion matrix for Naïve Bayes
Predicted
Spam
Ham
Spam
1587
0
Ham
56
196
Actual
Accuracy=
1587+196
1587+56+196+0
Recall =
1587
1587 + 56
False positive rate =
Precision =
= 0.969549
= 0.965916
0
0 + 196
1587
1587 + 0
= 0.000000
= 1.000000
Table: Confusion matrix for Support Vector Machine (SVM)
Predicted
Spam
Ham
Spam
1586
1
Ham
37
215
Actual
Accuracy=
1586+215
1586+37+215+1
Recall =
1586
1586 + 37
False positive rate =
= 0.977203
1
1 + 215
1586
Precision =
= 0.969549
1586 + 1
= 0.004630
= 0.999370
Table: Confusion matrix for Logistic Regression
Predicted
Spam
Ham
Spam
973
3
Ham
8
131
Actual
Accuracy=
973+131
973+8+131+3
Recall =
973
973 + 8
False positive rate =
Precision =
= 0.990135
= 1.056460
3
3 + 131
973
973 + 3
= 0.022388
= 0.996926
CHAPTER FIVE
CONCLUSION AND RECOMMENDATION
5.1 Conclusion
Email spamming is a common occurrence affecting many people around the world. With the
increase in spamming there is a huge demand for advanced systems and new approaches to
improve email spamming analytics and better protection on individual emails. The
development of a email prediction system using Logistic Regression, Support vector machine,
and Naïve bayes model in machine learning has been carried out in this project. Our system
unites past spamming records in particular zones, and Logistic regression, Support Vector
Machine and Naïve bayes to predict email spamming. The proposed approach is implemented
using jupyter notebook and performance is experimented using accuracy, precision, and recall.
The results obtained from the implementation of the model prediction, accuracy evaluation,
and classifications proved that the three models approach to email spamming prediction
performed well and it was tested for any anomaly.
5.2 Recommendation
As future research, the model implemented could be improved to work effectively in Nigeria
by using data gotten from Nigeria. Future research on email spamming needs to focus on other
areas of prediction which includes the need for consideration varieties of attributes that can be
applied to form a dataset. Special consideration should also be given to known spamming
factors to prevent its occurrence.
REFERENCES
Al-Radaideh, Q. A., & Al Nagi, E. (2012). Using data mining techniques to build a
classification model for predicting employees performance. International Journal of
Advanced Computer Science and Applications, 3(2), 144-151
Ameta, M. A., & Jain, M. K. (2017). Data mining techniques for the prediction of kidney
diseases and treatment: A Review, 6(2), 20376–20378.
Archana, S., & Elangovan, K. (2014). Survey of classification techniques in data mining.
International Journal of Computer Science and Mobile Applications, 2(2), 65-71.
Cruz, J. A., & Wishart, D. S. (2006). Applications of machine learning in cancer prediction and
prognosis. Cancer informatics, 2, 59.
Download