Uploaded by Adnan Hassan 2221954030

499A Final Report 5

advertisement
Department of Electrical and Computer Engineering
North South University
Senior Design Project
Botnet Attack Detection in IoT Networks Using Machine Learning
Azmine Mahtab Chowdhury ID# 1911517642
Md. Rakibul Hasan
ID# 1912435042
Faculty Advisor:
Dr. Mohammad Monirujjaman Khan
Associate Professor
ECE Department
Spring, 2023
LETTER OF TRANSMITTAL
June,2023
To
Dr. Rajesh Palit
Chairman,
Department of Electrical and Computer Engineering
North South University, Dhaka
Subject: Submission of 499A Final Report on "Botnet Attack Detection in IoT Networks
Using Machine Learning"
Dear Sir,
With due respect, we would like to submit our 499A Final Report on "Botnet Attack
Detection in IoT Networks Using Machine Learning" as part of our BSc program. This report
showcases the work completed in 499A, specifically focusing on the model training process
and its significance in our research project. During this phase, we have gained valuable
knowledge and practical skills related to detecting botnet attacks using machine learning. We
have learned how to build and train machine learning models to identify and address botnet
attacks in IoT networks. we have dedicated our utmost competence to meet all the required
dimensions outlined in this report. We have strived to ensure that the report covers all the
necessary aspects related to the model training process for detecting botnet attacks in IoT
networks using machine learning.
We will be highly obliged if you kindly receive this report and provide your valuable
judgment. It would be our immense pleasure if you find this report useful and informative to
have an apparent perspective on the issue.
Sincerely yours,
.........................................................
Azmine Mahtab Chowdhury
ECE Department
North South University, Bangladesh
ID: 1911517642
........................................................
Md. Rakibul Hasan
ECE Department
North South University, Bangladesh
ID: 1912435042
APPROVAL
Azmine Mahtab Chowdhury (ID # 1911517642) and Md. Rakibul Hasan (ID # 1912435042)
from Electrical and Computer Engineering Department of North South University, have worked
on the Senior Design Project titled “Botnet Attack Detection in IoT Networks Using Machine
Learning” under the supervision of Dr. Mohammad Monirujjaman Khan partial fulfillment of the
requirement for the degree of Bachelors of Science in Engineering and has been accepted as
satisfactory.
Supervisor’s Signature
…………………………………….
Dr. Mohammad Monirujjaman Khan
Associate Professor
Department of Electrical and Computer Engineering
North South University
Dhaka, Bangladesh.
Chairman’s Signature
…………………………………….
Dr. Rajesh Palit
Professor
Department of Electrical and Computer Engineering
North South University
Dhaka, Bangladesh.
DECLARATION
This is to declare that this project is our original work. No part of this work has
been submitted elsewhere partially or fully for the award of any other degree or
diploma. All project related information will remain confidential and shall not be
disclosed without the formal consent of the project supervisor. Relevant previous
works presented in this report have been properly acknowledged and cited. The
plagiarism policy, as stated by the supervisor, has been maintained.
Students’ names & Signatures
1. Azmine Mahtab Chowdhury
2. Md. Rakibul Hasan
ACKNOWLEDGEMENTS
The authors would like to express their heartfelt gratitude towards their project and
research supervisor, Dr. Mohammad Monirujjaman Khan, Associate Professor,
Department of Electrical and Computer Engineering, North South University,
Bangladesh, for his invaluable support, precise guidance and advice pertaining to
the experiments, research and theoretical studies carried out during the course of
the current project and also in the preparation of the current report.
Furthermore, the authors would like to thank the Department of Electrical and
Computer Engineering, North South University, Bangladesh for facilitating the
research. We would also like to thank my friend Md. Rakibul Hasan for helping us
in this project. The authors would also like to thank their loved ones for their
countless sacrifices and continual support.
ABSTRACT
The rapid growth of the Internet of Things (IoT) has brought intelligent systems
and energy-aware sensing devices to the forefront of our daily lives. However, the
limitations of IoT devices have resulted in an increase in botnet attacks, where
compromised devices form a network controlled by malicious entities. To address
this, we propose an ensemble learning model that profiles IoT network behavior to
detect unusual traffic from compromised devices. Our project has focused on
training the model using the N-BaIoT-2021 dataset and evaluating the accuracy of
individual decision tree models, such as AdaBoost, RUSBoost, and bagged models.
Preliminary results show promising performance, and we will further evaluate the
ensemble model's ability to detect real-world botnet attacks. The impact and
significance of our research lie in the development of a robust botnet detection
system for IoT networks. By leveraging ensemble learning techniques and
profiling IoT network behavior features, we aim to contribute to the mitigation of
botnet attacks, ensuring the integrity and privacy of connected IoT devices.
TABLE OF CONTENTS
LETTER OF TRANSMITTAL
APPROVAL
DECLARATION
ACKNOWLEDGEMENTS
ABSTRACT
TABLE OF CONTENTS
LIST OF FIGURES
LIST OF TABLES
Chapter 1 Introduction
1.1 Background and Motivation
1.2 Purpose and Goal of the Project
1.3 Organization of the Report
Chapter 2 Research Literature Review
2.1 Existing Research and Limitations
Chapter 3 Methodology
3.1 System Design
Figure 1: Block Diagram of the Proposed System
3.1.1 Dataset:
3.1.2 Data Preparation (DP):
Table 1: Label encoding for the target classes (output labels).
3.1.3 Learning Process (LP):
Figure 2: Diagram of AdaBoost Classifier
Figure 3: Diagram of RUSBoost Classifier
Figure 3: Diagram of Bagged decision tree
Figure 4: Soft Voting Classifier with Equal Weights
3.1.4 Evaluation Process (EP):
3.2 Hardware and/or Software Components
2
4
5
6
7
7
10
10
11
11
12
13
14
14
16
16
18
18
19
20
21
22
24
25
27
27
28
● Python: Python programming language served as the foundation for implementing the
project. Python is widely used in the field of data science and machine learning due to its
extensive libraries and frameworks, making it suitable for developing machine learning
models and conducting data analysis.
29
3.3 Hardware and/or Software Implementation
30
Chapter 4 Investigation/Experiment, Result, Analysis and Discussion
30
4.1 Data Processing and Feature Selection
30
4.2 Model Training and Performance Evaluation
34
4.2.1 Model Accuracy
34
4.2.2 Model Comparison:
35
Chapter 5 Impacts of the Project
5.1 Impact of this project on societal, health, safety, legal and cultural issues
5.2 Impact of this project on environment and sustainability
Chapter 6 Project Planning and Budget
Chapter 7 Complex Engineering Problems and Activities
7.1 Complex Engineering Problems (CEP)
7.2 Complex Engineering Activities (CEA):
Table 4 demonstrates A Sample Complex Engineering Problem
Activities.
REFERENCES
37
38
38
41
42
42
44
47
0
Chapter 1 Introduction
1.1 Background and Motivation
The Internet of Things (IoT) has witnessed remarkable growth, enabling the interconnection of
devices and the exchange of data across various domains. This interconnectedness has
revolutionized industries, homes, and cities, leading to the exponential proliferation of IoT
devices. However, along with the widespread adoption of IoT technologies, new security
challenges have emerged, particularly in the form of botnet attacks.
A botnet refers to a network of compromised devices controlled by malicious entities known as
botmasters [1]. These compromised devices, commonly referred to as bots or zombies, are
infected with malware, allowing the botmasters to remotely manipulate and coordinate attacks on
targeted systems or networks. Botnet attacks pose significant threats to the integrity, availability,
and confidentiality of IoT networks, enabling various malicious activities, including distributed
denial-of-service (DDoS) attacks, data theft, and unauthorized access [2].
The integration of IoT devices into various aspects of daily life and industrial operations has led
to their extensive growth. Statistical data reveals the exponential increase in the number of
connected devices worldwide. In 2019, there were approximately 24.15 billion IoT devices, and
this number is projected to reach 76.45 billion by 2026 [3]. This rapid growth signifies the
significant role IoT technologies play in enabling smart environments and transforming
industries.
In addition to its widespread deployment, the IoT is expected to have a substantial financial
impact on the global economy. Projections indicate that by 2025, the IoT could contribute
between $3.9 to $11.1 trillion to the global economy [4]. This economic potential emphasizes the
need for robust security measures to protect the integrity and functionality of IoT networks. As
IoT networks continue to expand, the associated security vulnerabilities become more prominent.
Traditional security mechanisms, such as firewalls and intrusion detection systems, are often
inadequate to address the unique challenges posed by botnet attacks in IoT environments.
Therefore, specialized detection systems are required to effectively identify and counteract
botnet activities within IoT networks [5].
Ensuring the security and privacy of connected devices and the data they generate is of utmost
importance in IoT networks. With the rapid growth of IoT and the increasing prevalence of
botnet attacks, the development of a robust botnet detection system becomes crucial. Such a
system plays a pivotal role in identifying and preventing botnet activities, thereby safeguarding
the integrity and availability of IoT networks. By minimizing the risks associated with disruptive
attacks, data breaches, and unauthorized access, the system enhances the overall security posture
of IoT environments. Moreover, it instills trust and confidence in users, enabling the continued
adoption and utilization of IoT technologies. Recognizing the unique security challenges posed
by botnet attacks, the implementation of an effective detection system is indispensable for
maintaining the integrity, privacy, and functionality of IoT networks.
1.2 Purpose and Goal of the Project
The purpose of this project is to develop an effective botnet attack detection system specifically
tailored for IoT networks. The primary goal is to enhance the security and privacy of connected
devices and the data they generate by detecting and mitigating botnet activities in IoT
environments.
The project aims to make several contributions to the field of IoT security. Firstly, by leveraging
machine learning techniques and analyzing network behavior patterns, the proposed detection
system intends to distinguish between legitimate IoT traffic and botnet-related activities. This
approach enables the identification of botnet-infected devices and the mitigation of potential
threats, thereby strengthening the overall security posture of IoT networks.
1.3 Organization of the Report
There are eight chapters in the report. The project's background, motivation, and problem
statement are presented in Chapter 1 as an introduction. The literature review in Chapter 2
examines previous studies on botnet attack detection in IoT networks using machine learning.
The system design, including the flowchart, block diagram, dataset, algorithm, and
hardware/software components, is discussed in Chapter 3. Data preprocessing, feature selection,
model training, testing, and accuracy evaluation are all covered in Chapter 4's results
presentation. The project's effects on society, health, safety, and the law are covered in detail in
Chapter 5. Chapter 6 describes the project planning and includes a Gantt chart to show the
timeline. Chapter 7 discusses intricate engineering issues and tasks unique to the project. The
report is concluded in Chapter 8 with a summary, a list of limitations, and suggestions for further
advancements.
Chapter 2 Research Literature Review
2.1 Existing Research and Limitations
Recent articles have made significant contributions to the field of IoT security, particularly in the
detection and classification of IoT attacks. Valverde et al. [6] developed a transfer learning-based
CNN model for automatic glaucoma classification, achieving an AUC of 94% using color fundus
images from the DRISHTI-GS and RIM-ONE datasets. In the context of IoT security, Abu
Al-Haija and Al-Badawi [16,17] proposed a comprehensive architecture for IoT instruction
detection and classification. They evaluated their models using two well-known IoT attack
datasets, distilled-Kitsune-2018 and NSL-KDD. Their best results were better than any prior art
by 1~20%. While some studies focused on specific types of attacks, such as port scanning
attacks [18] or Linux IoT botnet detection [26], others explored broader approaches. Tsogbaatar
et al. [19] developed DeL-IoT, a deep ensemble learning approach for uncovering anomalies in
IoT, utilizing autoencoders. They evaluated their method using a specific dataset, but the details
were not mentioned in the provided text. Rezaei [20] used ensemble learning techniques for
detecting botnets in IoT and evaluated their method using a specific dataset. Unfortunately, the
specific dataset and its characteristics were not mentioned. Yang and Shami [23] proposed a
lightweight concept drift detection and adaptation framework for IoT data streams. However, the
dataset used for evaluation and its characteristics were not provided in the provided text.
Qaddoura et al. [24] proposed a multi-stage classification approach for IoT intrusion detection,
combining clustering with oversampling techniques. The specific dataset used and its
characteristics were not mentioned. The investigations into existing literature revealed several
limitations in the field. These limitations include the utilization of individual open-source or
custom datasets of small sizes, the lack of comparison between multiple deep learning
techniques, and the absence of real-time analysis on embedded devices [16,17]. These limitations
motivated the development of the proposed research, which aims to address these issues.
In conclusion, recent literature has highlighted advancements in IoT attack detection and
classification. However, there are limitations in terms of dataset sizes, the utilization of multiple
techniques, and real-time analysis. The proposed research aims to address these limitations and
contribute to the field by developing an ensemble model for detecting botnet attacks in IoT
networks. Motivated by the research gaps, the proposed investigation focuses on developing an
ensemble model to detect botnet attacks in IoT networks. The ensemble model combines
multiple machine learning techniques, including supervised learning, unsupervised learning, and
ensemble learning. By leveraging the strengths of each technique, the proposed ensemble model
aims to enhance the accuracy and efficiency in detecting several types of Botnet attacks detection
in IoT networks.
Chapter 3 Methodology
In this chapter, we will delve into the detailed approach and procedures used to achieve the
objectives of the project. This section outlines the system design, hardware and software
components utilized, and the implementation process.
3.1 System Design
In order to construct a model, data needs to be available after completing the preprocessing part.
To develop the model, a preprocessed dataset and machine learning algorithms are required. In
Fig. 1, the proposed system’s block diagram is shown.
Figure 1: Block Diagram of the Proposed System
The block diagram visually presents the system's architecture, showcasing the key subsystems
and their connections. It provides an overview of the sequential flow, from data preparation to
model training, evaluation, and performance analysis. Each subsystem plays a crucial role in
ensuring accurate anomaly detection. The diagram highlights processes like data hosting,
cleansing, feature selection, standardization, shuffling, and distribution. Supervised machine
learning algorithms and evaluation metrics are employed for effective detection and assessment.
This concise diagram offers a foundation for understanding the subsequent sections of the report.
3.1.1 Dataset:
The N-BaIoT dataset provides a valuable resource for detecting IoT botnet attacks. It comprises
real traffic data collected from 9 commercial IoT devices infected with Mirai and BASHLITE
botnets. With a total of 7,062,606 instances and 115 attributes, this multivariate and sequential
dataset offers insights into distinguishing between benign and malicious traffic data using
anomaly detection techniques. Additionally, it enables multi-class classification, with 10 attack
classes and a 'benign' class. The dataset includes features such as stream aggregation, time-frame
statistics, weight, mean, std, radius, magnitude, cov, and pcc. This dataset is a result of the
research conducted by Meidan et al. [15].
3.1.2 Data Preparation (DP):
At the initial stage, the raw IoT traffic data is preprocessed using Python and the PyCharm IDE.
This involves transforming the data into labeled features, making it suitable for subsequent
machine learning tasks.
1. Data Hosting Process:
Instead of MATLAB, Python and PyCharm are utilized as the platform for data storage,
training, and evaluation. The raw IoT traffic data is imported into Python data structures,
such as pandas DataFrames, where each record represents an IoT traffic instance with
corresponding feature columns.
2. Data Cleansing Process: In order to better comprehend the underlying dataset and
correct for incorrectly interpreted data, the data cleansing process involves analyzing the
dataset. To increase the quality of data, This process focuses on finding and eliminating
flaws and inconsistencies from data. In this study, we applied a number of data cleansing
processes to the imported data, such as missing value checks (finding null-value cells and
replacing them with zero numerical values), corrupted value checks (looking for
misinterpreted data and removing it), fixing the attribute names for the main features (the
imported data from CSV typically have no names for their attributes), maintaining an
atomic representation of the data (making sure all attributes are straightforward and filled
with values), and so on.
Classifier
Normal
Botnet(s)
Binary
Classifier
0 (normal)
1. (anomaly)
Ternary
Classifier
0 (normal)
1. (MIRAI)
2. (BASHLITE)
Multi
Classifier
0 (normal)
1. MIRAI_DANMINI_DOORBELL
2. MIRAI_ECOBEE_THERMOSTAT
3. MIRAI_PHILIPS_BABY_MONITOR
4. MIRAI_PROVISION_SECURITY_CA
MERA
5. MIRAI _SAMSUNG_WEBCAM
6. BASHLITE_DANMINI_DOORBELL
7. BASHLITE_ECOBEE_THERMOSTAT
8. BASHLITE_PHILIPS_BABY_MONIT
OR
9. BASHLITE_PROVISION_SECURITY_
CAMERA
10. HAJIME_PROVISION_WEBCAM
11. HAJIME_PHILIPS_BABY_DOORBEL
L
12. HAJIME_PROVISION_SECURITY_C
AMERA
Table 1: Label encoding for the target classes (output labels).
For the binary classifier, the output classes are given the values 0 and 1, and for the multi
classifier, the values 0 to 9. The target labels' encoding method is summarized in Table 1 below,
which also corrects all errors and inaccurate data entries.
3. Feature Selection Process: Relevant features for Botnet attack detection are carefully
selected using techniques like correlation coefficient scores. Features with high
correlation to the target variable are chosen to optimize the machine learning model.
4. Data Standardization Process: Z-score normalization is applied to standardize the data.
This process scales the features to have a mean of 0 and a standard deviation of 1.
Standardization ensures that all features contribute equally and facilitates effective data
processing by machine learning algorithms.
5. Data Shuffling Process: To protect numerical data confidentiality, the data points are
shuffled. A uniform shuffle is performed at each epoch to eliminate bias and allow the
model to learn from a representative sample.
6. Data Distribution Process: The dataset is randomly divided into training, validation,
and testing subsets. This division ensures the optimal allocation of data for training and
evaluating the model's performance. Additionally, k-fold cross-validation is employed to
assess the model's effectiveness.
3.1.3 Learning Process (LP):
The Learning Process involves the utilization of machine learning models, including AdaBoosted
decision tree, RUSBoosted decision tree, bagged decision tree, and Soft Voting, for the detection
of Botnet attacks in IoT networks. These models collectively contribute to the development of an
ensemble learning model, which combines their predictions to enhance the overall accuracy and
robustness of the detection system.
1. AdaBoosted Decision Tree: AdaBoost (Adaptive Boosting) algorithm combines
multiple weak learners, such as decision trees, to create a strong learner. Each decision
tree is trained iteratively, and subsequent trees focus on the instances misclassified by
previous trees. The AdaBoosted decision tree model assigns higher weights to
misclassified instances to improve their classification in subsequent iterations [16]. Fig. 2
shows the diagram of AdaBoost decision tree.
Figure 2: Diagram of AdaBoost Classifier
Fig. 2 illustrates how the initial model is constructed and how the algorithm identifies faults in
the first model. The improperly categorized record is utilized as an input for the subsequent
model. This procedure is continued until the condition stated is satisfied. As seen in Fig. 2, three
models are created by including the mistakes from the preceding model. This is the mechanism
through which boosting operates.
2. RUSBoosted Decision Tree: RUSBoost (Random Under Sampling Boosting) algorithm
addresses class imbalance by undersampling the majority class and boosting the minority
class. Fig. 3 shows the diagram of RUSBoost decision tree.
Figure 3: Diagram of RUSBoost Classifier
Fig. 3 illustrates how it incorporates random under-sampling into the AdaBoost framework to
mitigate the dominance of the majority class during training. This helps the RUSBoosted
decision tree model handle imbalanced datasets more effectively [17].
3. Bagged Decision Tree: Bagging (Bootstrap Aggregating) algorithm creates multiple
subsets of the original dataset through bootstrap sampling. Each subset is used to train an
independent decision tree, and the final prediction is obtained by aggregating the
predictions of all decision trees. Bagging helps reduce overfitting and improves the
stability and accuracy of the model [18]. Fig. 4 shows the diagram of Bagged decision
tree.
Figure 3: Diagram of Bagged decision tree
Figure 3 illustrates how K number of subsets of original dataset is used to train an independent
decision treeand the final prediction is obtained by aggregating the predictions of all decision
trees.
4. Soft Voting: Soft Voting is an ensemble learning technique where each model in the
ensemble assigns probabilities to each class instead of making a deterministic prediction.
The final prediction is obtained by averaging or weighted averaging the predicted
probabilities across all models. Soft Voting allows the ensemble model to consider the
collective wisdom of the individual models and make more confident and accurate
predictions [19]. This aggregation approach helps to reduce bias and variance, enhance
the robustness of the system, and improve the overall accuracy of Botnet attack detection.
The combination of these machine learning models, along with Soft Voting, forms a
powerful ensemble learning framework that leverages their complementary strengths to
achieve more accurate and reliable detection results.
Soft voting classifiers sort input data into categories based on the likelihoods of each
prediction made by each classifier. Based on the equation shown in Fig. 1, weights are
suitably assigned to each classifier. An example to better grasp this, Two binary
classifiers, clf1, clf2, and clf3, let's assume, are present. The classifiers predict the
following outcomes for a given record in terms of probability in favor of classes [0, 1]:
The values of clf1, clf2, clf3 are [0.2, 0.8], [0.1, 0.9], and [0.8, 0.2], respectively.
The probabilities will be determined as follows with equal weights:
Probability of Class 0 is equal to 0.33*0.2 + 0.33*0.1 + 0.33*0.8 = 0.363.
Class 1 Probability is equal to 0.33*0.8 + 0.33*0.9 + 0.33*0.2 = 0.627.
The probability predicted by ensemble classifier will be [36.3%, 62.7%]. The class will
most likely by class 1 if the threshold is 0.5. This is how a soft voting classifier with
equal weights will look like:
Figure 4: Soft Voting Classifier with Equal Weights
With unequal weights [0.6, 0.4], the probabilities will get calculated as the following:
Prob of Class 0 = 0.6*0.2 + 0.4*0.6 = 0.36
Prob of Class 1 = 0.6*0.8 + 0.4*0.4 = 0.64
The probability predicted by the ensemble classifier will be [0.36, 0.64]. The class will
most likely be class 1 if the threshold is 0.5 [20].
3.1.4 Evaluation Process (EP):
The evaluation process (EP): The evaluation process (EP) is a critical step in this project to
ensure that the system meets its requirements and objectives. We will use standard evaluation
metrics on k-fold datasets to measure the performance of the four machine learning models
we developed: AdaBoosted decision tree (ABDT), RUSBoosted decision tree (RBDT),
bagged decision tree (BGDT), and ensemble learning model.
1. Confusion Matrix Analysis: Assesses the performance of the detection system by
analyzing true positive, true negative, false positive, and false negative predictions.
2. Performance Metrics: Measures the system's effectiveness, including network coverage
score (NCS#), network miss score (NMS#), detection accuracy score (DAC%), detection
precision score (DPR%), detection sensitivity score (DSN%), detection hit score
(DHS%), input anomaly error rate (IAE%), input miss error rate (IME%), and input
non-error rate (INE%).
3. Inference Overhead Evaluation: Evaluates the prediction speed (PRS) and prediction time
(PRT) of the detection system.
3.2 Hardware and/or Software Components
In this section we discuss the software that we have used in order to develop this massive
sophisticated system.
The implementation of the project "Botnet Attack Detection in IoT Networks Using Machine
Learning" involved the utilization of various software components. These components played
a crucial role in the successful development and execution of the project.
● Python: Python programming language served as the foundation for implementing the
project. Python is widely used in the field of data science and machine learning due to its
extensive libraries and frameworks, making it suitable for developing machine learning
models and conducting data analysis.
● PyCharm: PyCharm, an integrated development environment (IDE) for Python, was used
for coding, debugging, and managing the project. It provides features such as code
completion, error highlighting, and version control integration, enhancing the
development process.
● Anaconda: Anaconda, a popular distribution of Python, was utilized to create an isolated
environment for the project. It simplifies package management and ensures consistent
dependencies across different systems. Additionally, Anaconda includes Jupyter
Notebook, a web-based interactive coding environment, which was used for exploratory
data analysis and sharing code snippets.
● Libraries and Packages: Several software libraries and packages were employed to
implement specific functionalities in the project. These include:
○ Pandas: Pandas facilitated data manipulation, preprocessing, and exploratory data
analysis tasks.
○ Scikit-learn: Scikit-learn provided a range of machine learning algorithms, tools
for model training and evaluation, and data preprocessing techniques.
○ Matplotlib and Seaborn: Matplotlib and Seaborn were used for data visualization,
enabling the creation of various charts, graphs, and plots.
○ NumPy: NumPy was utilized for numerical computations and handling
multidimensional arrays efficiently.
3.3 Hardware and/or Software Implementation
Throughout the software implementation phase, widely used software tools and frameworks
such as Python, Jupyter Notebook, and scikit-learn were utilized. Python, a versatile
programming language, served as the primary language for coding the software modules.
Jupyter Notebook provided an interactive development environment for data exploration,
model development, and result analysis. The scikit-learn library offered a comprehensive set
of tools and algorithms for machine learning tasks, enabling efficient implementation of the
selected models.
Chapter 4 Investigation/Experiment, Result, Analysis and
Discussion
In this chapter we delve into the investigation and experimentation phase of our project on
botnet attack detection in IoT networks using machine learning. This chapter focuses on
presenting the experiments conducted, discussing the obtained results, and providing a
comprehensive analysis of the findings. Additionally, we engage in constructive discussions
to interpret the outcomes and gain valuable insights into the effectiveness of our approach.
4.1 Data Processing and Feature Selection
Data Cleansing: We began by performing data cleansing processes to address any errors or
inconsistencies in the dataset. This involved detecting and removing redundant rows,
ensuring that each data sample was unique and representative of the underlying information.
Additionally, we conducted checks for null values within the dataset and discovered that
there were no missing values, indicating a high level of data completeness.
Feature Selection: To improve the efficiency and effectiveness of our machine learning
models, we employed feature selection techniques to reduce the dimensionality of the dataset
and focus on the key attributes that contribute significantly to the detection of botnet attacks.
We utilized Pearson's correlation coefficient as our feature selection method. This statistical
measure quantifies the linear correlation between two variables, indicating the strength and
direction of their relationship. By calculating the correlation between each feature and the
target variable (botnet attack labels), we were able to assess their predictive power and select
the most influential features.
The feature selection process involved the following steps:
1. Computing the correlation coefficient: We computed the Pearson's correlation coefficient
between each feature and the target variable. This measure provides a value between -1
and 1, where values close to 1 indicate a strong positive correlation, values close to -1
indicate a strong negative correlation, and values close to 0 indicate a weak or no
correlation.
2. Setting the correlation threshold: We defined a correlation threshold, which determines
the minimum correlation value that a feature must exhibit with the target variable to be
considered relevant. In our case, we chose a threshold of 0.9, indicating that features with
a correlation coefficient greater than or equal to 0.9 would be selected.
3. Selecting the relevant features: We iterated through each feature and compared its
correlation coefficient with the correlation threshold. Features that exceeded the threshold
were retained, while those below the threshold were discarded. This process resulted in a
reduced feature set consisting of the most important attributes for botnet attack detection.
Figure 5: Correlation Matrix Heatmap before feature selection
Figure 6: Correlation matrix heatmap after feature selection
By applying Pearson's correlation coefficient and setting a correlation threshold, we
identified the 30 most significant features out of the initial 115 features in our dataset. This
feature selection process ensures that our models focus on the attributes that have the
strongest impact on botnet attack detection, enhancing their performance and interpretability.
4.2 Model Training and Performance Evaluation
After the feature selection process, we proceeded with the next steps in our project, which
involved standardizing the data using Z-score normalization and shuffling the dataset. This
normalization technique ensures that all features have a mean of 0 and a standard deviation
of 1, thereby preventing any bias due to differences in the scale of the features. Shuffling the
data helps in randomizing the order of the samples, which is important to avoid any bias
introduced by the original ordering of the dataset.
4.2.1 Model Accuracy
In this section the training and test accuracy of different model used will be discussed.
1.
Bagged boost: Bagging (Bootstrap Aggregating) is an ensemble learning technique
that combines multiple models trained on different subsets of the training data. Each model is
trained independently, and the final prediction is determined by combining the predictions of
all models (e.g., majority voting for classification tasks). Fig. 6 Depicts the RUS Boost
classifier report
Figure 6: Training and test accuracy for the Bagged boost classifier report
In this project, Bagging achieved a training accuracy of 100.0% and a test accuracy of
99.96%.
2.
RUS boost: It randomly selects a subset of instances from the majority class to
balance the class distribution. RUSBoost then applies AdaBoost to the balanced dataset. Fig.
7
Depicts
the
RUS
Boost
classifier
report:
Figure 7:
Training and test accuracy for the RUS boost classifier
RUSBoost achieved a training accuracy of 100.0% and a test accuracy of 99.93% in this
project.
3.
Ada boost: AdaBoost assigns higher weights to misclassified instances
to improve their classification in subsequent iterations. Fig.6 Depicts the
AdaBoost classifier report. Fig. 8 Depicts the Ada Boost classifier report
Figure 8: Training and test accuracy for the Ada boost classifier
In this training phaset, AdaBoost achieved a training accuracy of 100.0% and a test
accuracy of 99.84%.
4.2.2 Model Comparison:
In this section the results acquired for different models will be compared and will delve
deeper into understanding the work done by the soft voting to create the ensemble model.
The table 2 below shows the different accuracy gained by the different models:
Algorithm
Training Accuracy (%)
Test Accuracy (%)
AdaBoost
100.0
99.84
RUSBoost
100.0
99.93
Bagging
100.0
99.96
Soft Voting
100.0
99.98
Table 2: Comparison of train and test accuracy
In this project, Soft Voting achieved a training accuracy of 100.0% and a test accuracy of
99.98%.
These impressive accuracy scores demonstrate the effectiveness of all the algorithms in
accurately classifying instances in the dataset. Here are the key observations and analysis:
● High Training Accuracy: All algorithms achieved a perfect training accuracy of
100.0%, indicating their ability to fit the training data accurately. This suggests that the
models have successfully learned the patterns and features in the training dataset.
● Excellent Test Accuracy: The test accuracies of all algorithms are very close to the
training accuracies, ranging from 99.84% to 99.98%. This indicates that the models
generalize well to unseen data and can make accurate predictions on new instances.
● Comparable Performances: The differences in test accuracies among the algorithms are
minimal, with the highest accuracy achieved by the Voting algorithm (99.98%). This
suggests that all the algorithms are robust and perform exceptionally well in classifying
the instances.
● Reliable Ensemble Methods: AdaBoost, RUSBoost, Bagging, and Voting are all
ensemble methods that combine multiple models to make predictions. These methods
leverage the collective knowledge and diversity of the individual models, leading to
improved performance and higher accuracy.
Overall, the experimental results validate the effectiveness of the implemented algorithms in
accurately classifying instances in the dataset. The high training and test accuracies demonstrate
the models' ability to learn from the data, generalize well, and make accurate predictions. The
minimal differences in test accuracies among the algorithms indicate that they are all reliable and
perform at a high level. These findings are promising for the application of these algorithms in
real-world scenarios requiring accurate classification. In future, the model will undergo a
comprehensive evaluation to demonstrate its ability to detect seven different types of attacks.
This evaluation will focus on assessing the model's performance in accurately identifying and
classifying each attack type.
Chapter 5 Impacts of the Project
In this chapter, we will explore the impact of the project on societal, health, safety, legal,
cultural, and environmental aspects. We will discuss how the project's outcomes can
contribute to positive changes in these areas. Additionally, we will address the project's
implications for sustainability and the environment. Furthermore, we will provide insights
into future considerations and potential directions for further development. This chapter aims
to provide a comprehensive understanding of the project's broader significance and the
potential it holds for addressing key challenges in different domains.
5.1 Impact of this project on societal, health, safety, legal and
cultural issues
There are several impacts of this project on societal, health, safety, legal and cultural issues.
These are ● Societal: The project has a positive social impact by improving the security of IoT
networks, which are becoming increasingly prevalent in our daily lives. The project can
help prevent the hijacking of IoT devices by botnets, which can cause damage to personal
and public property, and also compromise sensitive data.
● Health: The project does not have a direct impact on health.
● Safety: The project can indirectly contribute to improving the safety of IoT devices by
preventing botnet attacks that can compromise the functionality and safety of these
devices.
● Legal: The project is in compliance with relevant legal requirements, such as data privacy
and protection regulations. It can also help organizations comply with legal requirements
related to the security of their networks and data.
5.2 Impact of this project on environment and sustainability
The impact of the project on various aspects.
● Environment: The project itself does not have a significant impact on the environment as
it is a software-based solution. However, the detection and prevention of botnet attacks in
IoT networks can indirectly contribute to the reduction of the environmental impact of
such attacks. Botnets can be used to launch DDoS attacks, which can cause significant
energy consumption due to the high traffic volume. By preventing such attacks, the
project can indirectly contribute to reducing energy consumption and carbon emissions.
● Sustainability: First of all, the project is focused on detecting and preventing botnet
attacks on IoT devices. As the number of IoT devices continues to grow, the threat of
botnet attacks will continue to increase, making this project relevant and necessary in the
long term.
Secondly, the project is based on machine learning algorithms and anomaly detection
techniques. These technologies are constantly evolving and improving, meaning that the
project can be updated and enhanced over time to keep up with the latest developments.
Thirdly, the project aims to provide a comprehensive solution for network security
professionals, IT administrators, and cybersecurity researchers. By offering a platform for
collecting and preprocessing network traffic data, extracting relevant features, and training
machine learning models, the project can help these professionals to stay ahead of the
evolving threat landscape of botnet attacks.
Finally, the proposed solution is innovative in that it combines an ensemble learning
approach with machine learning and anomaly detection. This approach has not been widely
explored in the field of network security and has the potential to improve the accuracy of
botnet detection.
Overall, the project is sustainable in the sense that it addresses an ongoing and growing
problem, is based on evolving technologies, offers a comprehensive solution, and is
innovative in its approach.
To further ensure the sustainability of the project, the following measures can be taken:
● Regular Maintenance: The project should be regularly maintained to keep up with the
latest developments in the field of cybersecurity. This includes updating the software and
algorithms to adapt to new threats and improving the accuracy of the system.
● Collaboration with Industry Experts: Collaborating with industry experts and
incorporating feedback from users will help in improving the performance of the system
and make it more effective in detecting botnet traffic.
● Integration with Existing Infrastructure: The system should be integrated with existing
network security infrastructure, such as firewalls and intrusion detection systems, to
provide a comprehensive solution for detecting and preventing botnet traffic.
● Continuous Improvement: The system should be continuously improved by incorporating
new features and algorithms to make it more effective in detecting and preventing botnet
traffic.
● Training and Education: Training and educating network administrators and security
personnel on the use of the system will help in improving its adoption and effectiveness.
By implementing these measures, the project can be sustained in the long term and continue
to provide value to its users in the fight against botnet traffic.
Chapter 6 Project Planning and Budget
In Chapter 6, we will focus on the project planning and budget considerations. We will
provide an overview of the project timeline and outline the tasks and activities to be
completed in the upcoming semester. The Ghantt chart will be presented, depicting the
project milestones, dependencies, and the estimated duration for each task.
Figure 9: A sample Gantt chart regarding the planned project
The project has reached a significant stage with the completion of model training. In the
upcoming semester, the focus will shift towards model evaluation, final report writing, final
presentation, and project demonstration. These activities are crucial for evaluating the
performance of the developed models, documenting the project findings, and showcasing the
project outcomes.
Regarding the budget, it is important to note that no specific budget was required for this
project. The project primarily relied on open-source software tools, publicly available
datasets, and existing hardware resources. This allowed us to conduct the project without
incurring any financial costs. The emphasis was placed on leveraging existing resources and
skills to ensure cost-effectiveness while still achieving the project goals.
Chapter 7 Complex Engineering Problems and Activities
7.1 Complex Engineering Problems (CEP)
Table 3 demonstrates a sample complex engineering problem attribute:
Attributes
Addressing
the
complex
engineering problems (P) in the
project
P1
Depth
of
The
project
requires
in-depth
knowledge
knowledge of machine learning
required
algorithms,
IoT
network
(K3-K8)
architecture,
network
security
concepts,
and
programming
languages such as Python.
P2
Range
of
The
project involves balancing
conflicting
conflicting requirements, such as
requirements
accuracy
of
computational
real-time
detection,
resources,
processing,
to
and
ensure
effective botnet detection without
compromising system performance.
P3
Depth
of
The
project
requires
extensive
analysis
analysis and evaluation of various
required
machine learning models, feature
selection
techniques,
and
data
preprocessing methods to identify
the most suitable approach for
botnet detection in IoT networks.
P4
Familiarity of
The project necessitates familiarity
issues
with IoT protocols, network traffic
analysis,
and machine learning
algorithms specifically tailored for
anomaly
detection
in
IoT
environments.
P5
Extent
of
The project relies on programming
applicable
languages such as Python and
codes
frameworks/libraries
scikit-learn
and
TensorFlow
like
to
implement machine learning models
and conduct data analysis for botnet
detection.
P6
Extent
of
The project involves stakeholders
stakeholder
such as network administrators, IoT
involvement
device
manufacturers,
and
cybersecurity experts who need to
collaborate to ensure the successful
implementation and deployment of
the botnet detection system.
P7
Interdependenc
The
project
e
interdependent
comprises
components,
including data collection from IoT
devices, preprocessing techniques,
model training, and performance
evaluation,
which
need
to
be
integrated to create an effective
botnet detection solution.
Table 3: Table of Complex Engineering Problems (CEP)
7.2 Complex Engineering Activities (CEA):
Table 4 demonstrates A Sample Complex Engineering Problem
Activities.
Attributes
Table 3 demonstrates a sample
complex
engineering
problem
attribute.
A1
Range
of
resources
This project involves the allocation
of
various resources, including
computing
resources
(CPU,
memory), dataset for training and
testing, software libraries (Python,
scikit-learn), and IoT devices for
experimentation.
A2
Level
interactions
of
The project requires interactions
among
including
different
stakeholders,
cybersecurity
experts,
network administrators, IoT device
manufacturers, and end-users to
gather requirements, share insights,
and
validate
the
system's
effectiveness.
A3
Innovation
The project employs innovative
techniques and algorithms to detect
and mitigate botnet attacks in IoT
networks,
contributing
to
the
advancement of network security
solutions in the IoT domain.
A4
Consequences
The
project
has
to
consequences for society and the
society/environ
environment
ment
security of IoT networks, protecting
by
positive
enhancing
the
sensitive data, ensuring the integrity
of
connected
devices,
and
mitigating potential risks associated
with botnet attacks.
A5
Familiarity
Familiarity
with
architecture,
(e.g.,
MQTT,
IoT
network
CoAP),
network
protocols
machine
learning algorithms (e.g., decision
trees,
ensemble
methods),
programming languages (Python),
and
cybersecurity
crucial
for
concepts
is
successful
implementation and evaluation of
the botnet detection system.
Table 4: Table of Complex Engineering Activities (CEA)
REFERENCES
1. S. Karim, et al., "A Comprehensive Survey on IoT Botnet Attacks and Their
Detection Approaches," IEEE Communications Surveys & Tutorials, vol.
22, no. 3, pp. 1892-1910, 2020.
2. G. Oikonomou, "IoT Botnets: Recent Advances, Open Challenges, and
Countermeasures," IEEE Internet of Things Journal, vol. 7, no. 8, pp.
7244-7262, 2020.
3. Gartner. (2019). Gartner Says Worldwide IoT Installed Base to Grow to 25.1
Billion
Units
by
2021.
[Online].
Available:
https://www.gartner.com/en/newsroom/press-releases/2017-02-07-gartner-sa
ys-worldwide-iot-security-spending-will-reach-348-million-in-2016
4. McKinsey Global Institute. (2019). The Internet of Things: Mapping the
Value
Beyond
the
Hype.
[Online].
Available:
https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights
/the-internet-of-things-mapping-the-value-beyond-the-hype
5. N. Moustafa, et al., "A Deep Learning Approach for IoT Botnet Detection
Using NetFlow-Like Traffic Features," Future Generation Computer
Systems, vol. 99, pp. 430-443, 2019.
6. Valverde, R.; et al. (Year). Title of the Article. Journal Name, Volume(Issue),
Page Numbers. [CrossRef]
7. Abu Al-Haija, Q.; Al-Badawi, A. (Year). Attack-Aware IoT Network Traffic
Routing Leveraging Ensemble Learning. Sensors, 22(1), 241. [CrossRef]
[PubMed]
8. Abu Al-Haija, Q. (Year). Top-Down Machine Learning-Based Architecture
for Cyberattacks Identification and Classification in IoT Communication
Networks. Front. Big Data, 4, 782902. [CrossRef]
9. Al-Haija, Q.A.; Saleh, E.; Alnabhan, M. (Year). Detecting Port Scan Attacks
Using Logistic Regression. In Proceedings of the 2021 4th International
Symposium on Advanced Electrical and Communication Technologies
(ISAECT), Khobar, Saudi Arabia, 6–8 December 2021; pp. 1–5. [CrossRef]
10.Tsogbaatar, E.; et al. (Year). DeL-IoT: A deep ensemble learning approach to
uncover anomalies in IoT. Internet Things, 14. [CrossRef]
11.Rezaei, A. (Year). Using Ensemble Learning Technique for Detecting Botnet
on IoT. SN Comput. Sci., 4, 148. [CrossRef]
12.Yang, L.; Shami, A. (Year). A Lightweight Concept Drift Detection and
Adaptation Framework for IoT Data Streams. IEEE Internet Things Mag., 4,
96–101. [CrossRef]
13.Qaddoura, R.; et al. (Year). A Multi-Stage Classification Approach for IoT
Intrusion Detection Based on Clustering with Oversampling. Appl. Sci., 11,
3022. [CrossRef]
14.Huy-Trung Nguyen et al. (Year). Title of the Article. Journal Name,
Volume(Issue), Page Numbers. [CrossRef]
15.Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D.
Breitenbacher, and Y. Elovici, "N-BaIoT: Network-based Detection of IoT
Botnet Attacks Using Deep Autoencoders," IEEE Pervasive Computing, vol.
17, no. 3, pp. 12-22, 2018.
16.Freund, Y., Schapire, R.E. (1997). A Decision-Theoretic Generalization of
On-Line Learning and an Application to Boosting. Journal of Computer and
System Sciences, 55(1), 119-139.
17.Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2010).
RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE
Transactions on Systems, Man, and Cybernetics - Part A: Systems and
Humans, 40(1), 185-197.
18.Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123-140.
19.Zou, H., & Hastie, T. (2006). Regularization and variable selection via the
elastic net. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 67(2), 301-320.
20. Kumar, A. (2020, September 7). Hard vs Soft Voting Classifier Python Example - Data
Analytics.
Data
https://vitalflux.com/hard-vs-soft-voting-classifier-python-example/
21.
Analytics.
Download