Department of Electrical and Computer Engineering North South University Senior Design Project Botnet Attack Detection in IoT Networks Using Machine Learning Azmine Mahtab Chowdhury ID# 1911517642 Md. Rakibul Hasan ID# 1912435042 Faculty Advisor: Dr. Mohammad Monirujjaman Khan Associate Professor ECE Department Spring, 2023 LETTER OF TRANSMITTAL June,2023 To Dr. Rajesh Palit Chairman, Department of Electrical and Computer Engineering North South University, Dhaka Subject: Submission of 499A Final Report on "Botnet Attack Detection in IoT Networks Using Machine Learning" Dear Sir, With due respect, we would like to submit our 499A Final Report on "Botnet Attack Detection in IoT Networks Using Machine Learning" as part of our BSc program. This report showcases the work completed in 499A, specifically focusing on the model training process and its significance in our research project. During this phase, we have gained valuable knowledge and practical skills related to detecting botnet attacks using machine learning. We have learned how to build and train machine learning models to identify and address botnet attacks in IoT networks. we have dedicated our utmost competence to meet all the required dimensions outlined in this report. We have strived to ensure that the report covers all the necessary aspects related to the model training process for detecting botnet attacks in IoT networks using machine learning. We will be highly obliged if you kindly receive this report and provide your valuable judgment. It would be our immense pleasure if you find this report useful and informative to have an apparent perspective on the issue. Sincerely yours, ......................................................... Azmine Mahtab Chowdhury ECE Department North South University, Bangladesh ID: 1911517642 ........................................................ Md. Rakibul Hasan ECE Department North South University, Bangladesh ID: 1912435042 APPROVAL Azmine Mahtab Chowdhury (ID # 1911517642) and Md. Rakibul Hasan (ID # 1912435042) from Electrical and Computer Engineering Department of North South University, have worked on the Senior Design Project titled “Botnet Attack Detection in IoT Networks Using Machine Learning” under the supervision of Dr. Mohammad Monirujjaman Khan partial fulfillment of the requirement for the degree of Bachelors of Science in Engineering and has been accepted as satisfactory. Supervisor’s Signature ……………………………………. Dr. Mohammad Monirujjaman Khan Associate Professor Department of Electrical and Computer Engineering North South University Dhaka, Bangladesh. Chairman’s Signature ……………………………………. Dr. Rajesh Palit Professor Department of Electrical and Computer Engineering North South University Dhaka, Bangladesh. DECLARATION This is to declare that this project is our original work. No part of this work has been submitted elsewhere partially or fully for the award of any other degree or diploma. All project related information will remain confidential and shall not be disclosed without the formal consent of the project supervisor. Relevant previous works presented in this report have been properly acknowledged and cited. The plagiarism policy, as stated by the supervisor, has been maintained. Students’ names & Signatures 1. Azmine Mahtab Chowdhury 2. Md. Rakibul Hasan ACKNOWLEDGEMENTS The authors would like to express their heartfelt gratitude towards their project and research supervisor, Dr. Mohammad Monirujjaman Khan, Associate Professor, Department of Electrical and Computer Engineering, North South University, Bangladesh, for his invaluable support, precise guidance and advice pertaining to the experiments, research and theoretical studies carried out during the course of the current project and also in the preparation of the current report. Furthermore, the authors would like to thank the Department of Electrical and Computer Engineering, North South University, Bangladesh for facilitating the research. We would also like to thank my friend Md. Rakibul Hasan for helping us in this project. The authors would also like to thank their loved ones for their countless sacrifices and continual support. ABSTRACT The rapid growth of the Internet of Things (IoT) has brought intelligent systems and energy-aware sensing devices to the forefront of our daily lives. However, the limitations of IoT devices have resulted in an increase in botnet attacks, where compromised devices form a network controlled by malicious entities. To address this, we propose an ensemble learning model that profiles IoT network behavior to detect unusual traffic from compromised devices. Our project has focused on training the model using the N-BaIoT-2021 dataset and evaluating the accuracy of individual decision tree models, such as AdaBoost, RUSBoost, and bagged models. Preliminary results show promising performance, and we will further evaluate the ensemble model's ability to detect real-world botnet attacks. The impact and significance of our research lie in the development of a robust botnet detection system for IoT networks. By leveraging ensemble learning techniques and profiling IoT network behavior features, we aim to contribute to the mitigation of botnet attacks, ensuring the integrity and privacy of connected IoT devices. TABLE OF CONTENTS LETTER OF TRANSMITTAL APPROVAL DECLARATION ACKNOWLEDGEMENTS ABSTRACT TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES Chapter 1 Introduction 1.1 Background and Motivation 1.2 Purpose and Goal of the Project 1.3 Organization of the Report Chapter 2 Research Literature Review 2.1 Existing Research and Limitations Chapter 3 Methodology 3.1 System Design Figure 1: Block Diagram of the Proposed System 3.1.1 Dataset: 3.1.2 Data Preparation (DP): Table 1: Label encoding for the target classes (output labels). 3.1.3 Learning Process (LP): Figure 2: Diagram of AdaBoost Classifier Figure 3: Diagram of RUSBoost Classifier Figure 3: Diagram of Bagged decision tree Figure 4: Soft Voting Classifier with Equal Weights 3.1.4 Evaluation Process (EP): 3.2 Hardware and/or Software Components 2 4 5 6 7 7 10 10 11 11 12 13 14 14 16 16 18 18 19 20 21 22 24 25 27 27 28 ● Python: Python programming language served as the foundation for implementing the project. Python is widely used in the field of data science and machine learning due to its extensive libraries and frameworks, making it suitable for developing machine learning models and conducting data analysis. 29 3.3 Hardware and/or Software Implementation 30 Chapter 4 Investigation/Experiment, Result, Analysis and Discussion 30 4.1 Data Processing and Feature Selection 30 4.2 Model Training and Performance Evaluation 34 4.2.1 Model Accuracy 34 4.2.2 Model Comparison: 35 Chapter 5 Impacts of the Project 5.1 Impact of this project on societal, health, safety, legal and cultural issues 5.2 Impact of this project on environment and sustainability Chapter 6 Project Planning and Budget Chapter 7 Complex Engineering Problems and Activities 7.1 Complex Engineering Problems (CEP) 7.2 Complex Engineering Activities (CEA): Table 4 demonstrates A Sample Complex Engineering Problem Activities. REFERENCES 37 38 38 41 42 42 44 47 0 Chapter 1 Introduction 1.1 Background and Motivation The Internet of Things (IoT) has witnessed remarkable growth, enabling the interconnection of devices and the exchange of data across various domains. This interconnectedness has revolutionized industries, homes, and cities, leading to the exponential proliferation of IoT devices. However, along with the widespread adoption of IoT technologies, new security challenges have emerged, particularly in the form of botnet attacks. A botnet refers to a network of compromised devices controlled by malicious entities known as botmasters [1]. These compromised devices, commonly referred to as bots or zombies, are infected with malware, allowing the botmasters to remotely manipulate and coordinate attacks on targeted systems or networks. Botnet attacks pose significant threats to the integrity, availability, and confidentiality of IoT networks, enabling various malicious activities, including distributed denial-of-service (DDoS) attacks, data theft, and unauthorized access [2]. The integration of IoT devices into various aspects of daily life and industrial operations has led to their extensive growth. Statistical data reveals the exponential increase in the number of connected devices worldwide. In 2019, there were approximately 24.15 billion IoT devices, and this number is projected to reach 76.45 billion by 2026 [3]. This rapid growth signifies the significant role IoT technologies play in enabling smart environments and transforming industries. In addition to its widespread deployment, the IoT is expected to have a substantial financial impact on the global economy. Projections indicate that by 2025, the IoT could contribute between $3.9 to $11.1 trillion to the global economy [4]. This economic potential emphasizes the need for robust security measures to protect the integrity and functionality of IoT networks. As IoT networks continue to expand, the associated security vulnerabilities become more prominent. Traditional security mechanisms, such as firewalls and intrusion detection systems, are often inadequate to address the unique challenges posed by botnet attacks in IoT environments. Therefore, specialized detection systems are required to effectively identify and counteract botnet activities within IoT networks [5]. Ensuring the security and privacy of connected devices and the data they generate is of utmost importance in IoT networks. With the rapid growth of IoT and the increasing prevalence of botnet attacks, the development of a robust botnet detection system becomes crucial. Such a system plays a pivotal role in identifying and preventing botnet activities, thereby safeguarding the integrity and availability of IoT networks. By minimizing the risks associated with disruptive attacks, data breaches, and unauthorized access, the system enhances the overall security posture of IoT environments. Moreover, it instills trust and confidence in users, enabling the continued adoption and utilization of IoT technologies. Recognizing the unique security challenges posed by botnet attacks, the implementation of an effective detection system is indispensable for maintaining the integrity, privacy, and functionality of IoT networks. 1.2 Purpose and Goal of the Project The purpose of this project is to develop an effective botnet attack detection system specifically tailored for IoT networks. The primary goal is to enhance the security and privacy of connected devices and the data they generate by detecting and mitigating botnet activities in IoT environments. The project aims to make several contributions to the field of IoT security. Firstly, by leveraging machine learning techniques and analyzing network behavior patterns, the proposed detection system intends to distinguish between legitimate IoT traffic and botnet-related activities. This approach enables the identification of botnet-infected devices and the mitigation of potential threats, thereby strengthening the overall security posture of IoT networks. 1.3 Organization of the Report There are eight chapters in the report. The project's background, motivation, and problem statement are presented in Chapter 1 as an introduction. The literature review in Chapter 2 examines previous studies on botnet attack detection in IoT networks using machine learning. The system design, including the flowchart, block diagram, dataset, algorithm, and hardware/software components, is discussed in Chapter 3. Data preprocessing, feature selection, model training, testing, and accuracy evaluation are all covered in Chapter 4's results presentation. The project's effects on society, health, safety, and the law are covered in detail in Chapter 5. Chapter 6 describes the project planning and includes a Gantt chart to show the timeline. Chapter 7 discusses intricate engineering issues and tasks unique to the project. The report is concluded in Chapter 8 with a summary, a list of limitations, and suggestions for further advancements. Chapter 2 Research Literature Review 2.1 Existing Research and Limitations Recent articles have made significant contributions to the field of IoT security, particularly in the detection and classification of IoT attacks. Valverde et al. [6] developed a transfer learning-based CNN model for automatic glaucoma classification, achieving an AUC of 94% using color fundus images from the DRISHTI-GS and RIM-ONE datasets. In the context of IoT security, Abu Al-Haija and Al-Badawi [16,17] proposed a comprehensive architecture for IoT instruction detection and classification. They evaluated their models using two well-known IoT attack datasets, distilled-Kitsune-2018 and NSL-KDD. Their best results were better than any prior art by 1~20%. While some studies focused on specific types of attacks, such as port scanning attacks [18] or Linux IoT botnet detection [26], others explored broader approaches. Tsogbaatar et al. [19] developed DeL-IoT, a deep ensemble learning approach for uncovering anomalies in IoT, utilizing autoencoders. They evaluated their method using a specific dataset, but the details were not mentioned in the provided text. Rezaei [20] used ensemble learning techniques for detecting botnets in IoT and evaluated their method using a specific dataset. Unfortunately, the specific dataset and its characteristics were not mentioned. Yang and Shami [23] proposed a lightweight concept drift detection and adaptation framework for IoT data streams. However, the dataset used for evaluation and its characteristics were not provided in the provided text. Qaddoura et al. [24] proposed a multi-stage classification approach for IoT intrusion detection, combining clustering with oversampling techniques. The specific dataset used and its characteristics were not mentioned. The investigations into existing literature revealed several limitations in the field. These limitations include the utilization of individual open-source or custom datasets of small sizes, the lack of comparison between multiple deep learning techniques, and the absence of real-time analysis on embedded devices [16,17]. These limitations motivated the development of the proposed research, which aims to address these issues. In conclusion, recent literature has highlighted advancements in IoT attack detection and classification. However, there are limitations in terms of dataset sizes, the utilization of multiple techniques, and real-time analysis. The proposed research aims to address these limitations and contribute to the field by developing an ensemble model for detecting botnet attacks in IoT networks. Motivated by the research gaps, the proposed investigation focuses on developing an ensemble model to detect botnet attacks in IoT networks. The ensemble model combines multiple machine learning techniques, including supervised learning, unsupervised learning, and ensemble learning. By leveraging the strengths of each technique, the proposed ensemble model aims to enhance the accuracy and efficiency in detecting several types of Botnet attacks detection in IoT networks. Chapter 3 Methodology In this chapter, we will delve into the detailed approach and procedures used to achieve the objectives of the project. This section outlines the system design, hardware and software components utilized, and the implementation process. 3.1 System Design In order to construct a model, data needs to be available after completing the preprocessing part. To develop the model, a preprocessed dataset and machine learning algorithms are required. In Fig. 1, the proposed system’s block diagram is shown. Figure 1: Block Diagram of the Proposed System The block diagram visually presents the system's architecture, showcasing the key subsystems and their connections. It provides an overview of the sequential flow, from data preparation to model training, evaluation, and performance analysis. Each subsystem plays a crucial role in ensuring accurate anomaly detection. The diagram highlights processes like data hosting, cleansing, feature selection, standardization, shuffling, and distribution. Supervised machine learning algorithms and evaluation metrics are employed for effective detection and assessment. This concise diagram offers a foundation for understanding the subsequent sections of the report. 3.1.1 Dataset: The N-BaIoT dataset provides a valuable resource for detecting IoT botnet attacks. It comprises real traffic data collected from 9 commercial IoT devices infected with Mirai and BASHLITE botnets. With a total of 7,062,606 instances and 115 attributes, this multivariate and sequential dataset offers insights into distinguishing between benign and malicious traffic data using anomaly detection techniques. Additionally, it enables multi-class classification, with 10 attack classes and a 'benign' class. The dataset includes features such as stream aggregation, time-frame statistics, weight, mean, std, radius, magnitude, cov, and pcc. This dataset is a result of the research conducted by Meidan et al. [15]. 3.1.2 Data Preparation (DP): At the initial stage, the raw IoT traffic data is preprocessed using Python and the PyCharm IDE. This involves transforming the data into labeled features, making it suitable for subsequent machine learning tasks. 1. Data Hosting Process: Instead of MATLAB, Python and PyCharm are utilized as the platform for data storage, training, and evaluation. The raw IoT traffic data is imported into Python data structures, such as pandas DataFrames, where each record represents an IoT traffic instance with corresponding feature columns. 2. Data Cleansing Process: In order to better comprehend the underlying dataset and correct for incorrectly interpreted data, the data cleansing process involves analyzing the dataset. To increase the quality of data, This process focuses on finding and eliminating flaws and inconsistencies from data. In this study, we applied a number of data cleansing processes to the imported data, such as missing value checks (finding null-value cells and replacing them with zero numerical values), corrupted value checks (looking for misinterpreted data and removing it), fixing the attribute names for the main features (the imported data from CSV typically have no names for their attributes), maintaining an atomic representation of the data (making sure all attributes are straightforward and filled with values), and so on. Classifier Normal Botnet(s) Binary Classifier 0 (normal) 1. (anomaly) Ternary Classifier 0 (normal) 1. (MIRAI) 2. (BASHLITE) Multi Classifier 0 (normal) 1. MIRAI_DANMINI_DOORBELL 2. MIRAI_ECOBEE_THERMOSTAT 3. MIRAI_PHILIPS_BABY_MONITOR 4. MIRAI_PROVISION_SECURITY_CA MERA 5. MIRAI _SAMSUNG_WEBCAM 6. BASHLITE_DANMINI_DOORBELL 7. BASHLITE_ECOBEE_THERMOSTAT 8. BASHLITE_PHILIPS_BABY_MONIT OR 9. BASHLITE_PROVISION_SECURITY_ CAMERA 10. HAJIME_PROVISION_WEBCAM 11. HAJIME_PHILIPS_BABY_DOORBEL L 12. HAJIME_PROVISION_SECURITY_C AMERA Table 1: Label encoding for the target classes (output labels). For the binary classifier, the output classes are given the values 0 and 1, and for the multi classifier, the values 0 to 9. The target labels' encoding method is summarized in Table 1 below, which also corrects all errors and inaccurate data entries. 3. Feature Selection Process: Relevant features for Botnet attack detection are carefully selected using techniques like correlation coefficient scores. Features with high correlation to the target variable are chosen to optimize the machine learning model. 4. Data Standardization Process: Z-score normalization is applied to standardize the data. This process scales the features to have a mean of 0 and a standard deviation of 1. Standardization ensures that all features contribute equally and facilitates effective data processing by machine learning algorithms. 5. Data Shuffling Process: To protect numerical data confidentiality, the data points are shuffled. A uniform shuffle is performed at each epoch to eliminate bias and allow the model to learn from a representative sample. 6. Data Distribution Process: The dataset is randomly divided into training, validation, and testing subsets. This division ensures the optimal allocation of data for training and evaluating the model's performance. Additionally, k-fold cross-validation is employed to assess the model's effectiveness. 3.1.3 Learning Process (LP): The Learning Process involves the utilization of machine learning models, including AdaBoosted decision tree, RUSBoosted decision tree, bagged decision tree, and Soft Voting, for the detection of Botnet attacks in IoT networks. These models collectively contribute to the development of an ensemble learning model, which combines their predictions to enhance the overall accuracy and robustness of the detection system. 1. AdaBoosted Decision Tree: AdaBoost (Adaptive Boosting) algorithm combines multiple weak learners, such as decision trees, to create a strong learner. Each decision tree is trained iteratively, and subsequent trees focus on the instances misclassified by previous trees. The AdaBoosted decision tree model assigns higher weights to misclassified instances to improve their classification in subsequent iterations [16]. Fig. 2 shows the diagram of AdaBoost decision tree. Figure 2: Diagram of AdaBoost Classifier Fig. 2 illustrates how the initial model is constructed and how the algorithm identifies faults in the first model. The improperly categorized record is utilized as an input for the subsequent model. This procedure is continued until the condition stated is satisfied. As seen in Fig. 2, three models are created by including the mistakes from the preceding model. This is the mechanism through which boosting operates. 2. RUSBoosted Decision Tree: RUSBoost (Random Under Sampling Boosting) algorithm addresses class imbalance by undersampling the majority class and boosting the minority class. Fig. 3 shows the diagram of RUSBoost decision tree. Figure 3: Diagram of RUSBoost Classifier Fig. 3 illustrates how it incorporates random under-sampling into the AdaBoost framework to mitigate the dominance of the majority class during training. This helps the RUSBoosted decision tree model handle imbalanced datasets more effectively [17]. 3. Bagged Decision Tree: Bagging (Bootstrap Aggregating) algorithm creates multiple subsets of the original dataset through bootstrap sampling. Each subset is used to train an independent decision tree, and the final prediction is obtained by aggregating the predictions of all decision trees. Bagging helps reduce overfitting and improves the stability and accuracy of the model [18]. Fig. 4 shows the diagram of Bagged decision tree. Figure 3: Diagram of Bagged decision tree Figure 3 illustrates how K number of subsets of original dataset is used to train an independent decision treeand the final prediction is obtained by aggregating the predictions of all decision trees. 4. Soft Voting: Soft Voting is an ensemble learning technique where each model in the ensemble assigns probabilities to each class instead of making a deterministic prediction. The final prediction is obtained by averaging or weighted averaging the predicted probabilities across all models. Soft Voting allows the ensemble model to consider the collective wisdom of the individual models and make more confident and accurate predictions [19]. This aggregation approach helps to reduce bias and variance, enhance the robustness of the system, and improve the overall accuracy of Botnet attack detection. The combination of these machine learning models, along with Soft Voting, forms a powerful ensemble learning framework that leverages their complementary strengths to achieve more accurate and reliable detection results. Soft voting classifiers sort input data into categories based on the likelihoods of each prediction made by each classifier. Based on the equation shown in Fig. 1, weights are suitably assigned to each classifier. An example to better grasp this, Two binary classifiers, clf1, clf2, and clf3, let's assume, are present. The classifiers predict the following outcomes for a given record in terms of probability in favor of classes [0, 1]: The values of clf1, clf2, clf3 are [0.2, 0.8], [0.1, 0.9], and [0.8, 0.2], respectively. The probabilities will be determined as follows with equal weights: Probability of Class 0 is equal to 0.33*0.2 + 0.33*0.1 + 0.33*0.8 = 0.363. Class 1 Probability is equal to 0.33*0.8 + 0.33*0.9 + 0.33*0.2 = 0.627. The probability predicted by ensemble classifier will be [36.3%, 62.7%]. The class will most likely by class 1 if the threshold is 0.5. This is how a soft voting classifier with equal weights will look like: Figure 4: Soft Voting Classifier with Equal Weights With unequal weights [0.6, 0.4], the probabilities will get calculated as the following: Prob of Class 0 = 0.6*0.2 + 0.4*0.6 = 0.36 Prob of Class 1 = 0.6*0.8 + 0.4*0.4 = 0.64 The probability predicted by the ensemble classifier will be [0.36, 0.64]. The class will most likely be class 1 if the threshold is 0.5 [20]. 3.1.4 Evaluation Process (EP): The evaluation process (EP): The evaluation process (EP) is a critical step in this project to ensure that the system meets its requirements and objectives. We will use standard evaluation metrics on k-fold datasets to measure the performance of the four machine learning models we developed: AdaBoosted decision tree (ABDT), RUSBoosted decision tree (RBDT), bagged decision tree (BGDT), and ensemble learning model. 1. Confusion Matrix Analysis: Assesses the performance of the detection system by analyzing true positive, true negative, false positive, and false negative predictions. 2. Performance Metrics: Measures the system's effectiveness, including network coverage score (NCS#), network miss score (NMS#), detection accuracy score (DAC%), detection precision score (DPR%), detection sensitivity score (DSN%), detection hit score (DHS%), input anomaly error rate (IAE%), input miss error rate (IME%), and input non-error rate (INE%). 3. Inference Overhead Evaluation: Evaluates the prediction speed (PRS) and prediction time (PRT) of the detection system. 3.2 Hardware and/or Software Components In this section we discuss the software that we have used in order to develop this massive sophisticated system. The implementation of the project "Botnet Attack Detection in IoT Networks Using Machine Learning" involved the utilization of various software components. These components played a crucial role in the successful development and execution of the project. ● Python: Python programming language served as the foundation for implementing the project. Python is widely used in the field of data science and machine learning due to its extensive libraries and frameworks, making it suitable for developing machine learning models and conducting data analysis. ● PyCharm: PyCharm, an integrated development environment (IDE) for Python, was used for coding, debugging, and managing the project. It provides features such as code completion, error highlighting, and version control integration, enhancing the development process. ● Anaconda: Anaconda, a popular distribution of Python, was utilized to create an isolated environment for the project. It simplifies package management and ensures consistent dependencies across different systems. Additionally, Anaconda includes Jupyter Notebook, a web-based interactive coding environment, which was used for exploratory data analysis and sharing code snippets. ● Libraries and Packages: Several software libraries and packages were employed to implement specific functionalities in the project. These include: ○ Pandas: Pandas facilitated data manipulation, preprocessing, and exploratory data analysis tasks. ○ Scikit-learn: Scikit-learn provided a range of machine learning algorithms, tools for model training and evaluation, and data preprocessing techniques. ○ Matplotlib and Seaborn: Matplotlib and Seaborn were used for data visualization, enabling the creation of various charts, graphs, and plots. ○ NumPy: NumPy was utilized for numerical computations and handling multidimensional arrays efficiently. 3.3 Hardware and/or Software Implementation Throughout the software implementation phase, widely used software tools and frameworks such as Python, Jupyter Notebook, and scikit-learn were utilized. Python, a versatile programming language, served as the primary language for coding the software modules. Jupyter Notebook provided an interactive development environment for data exploration, model development, and result analysis. The scikit-learn library offered a comprehensive set of tools and algorithms for machine learning tasks, enabling efficient implementation of the selected models. Chapter 4 Investigation/Experiment, Result, Analysis and Discussion In this chapter we delve into the investigation and experimentation phase of our project on botnet attack detection in IoT networks using machine learning. This chapter focuses on presenting the experiments conducted, discussing the obtained results, and providing a comprehensive analysis of the findings. Additionally, we engage in constructive discussions to interpret the outcomes and gain valuable insights into the effectiveness of our approach. 4.1 Data Processing and Feature Selection Data Cleansing: We began by performing data cleansing processes to address any errors or inconsistencies in the dataset. This involved detecting and removing redundant rows, ensuring that each data sample was unique and representative of the underlying information. Additionally, we conducted checks for null values within the dataset and discovered that there were no missing values, indicating a high level of data completeness. Feature Selection: To improve the efficiency and effectiveness of our machine learning models, we employed feature selection techniques to reduce the dimensionality of the dataset and focus on the key attributes that contribute significantly to the detection of botnet attacks. We utilized Pearson's correlation coefficient as our feature selection method. This statistical measure quantifies the linear correlation between two variables, indicating the strength and direction of their relationship. By calculating the correlation between each feature and the target variable (botnet attack labels), we were able to assess their predictive power and select the most influential features. The feature selection process involved the following steps: 1. Computing the correlation coefficient: We computed the Pearson's correlation coefficient between each feature and the target variable. This measure provides a value between -1 and 1, where values close to 1 indicate a strong positive correlation, values close to -1 indicate a strong negative correlation, and values close to 0 indicate a weak or no correlation. 2. Setting the correlation threshold: We defined a correlation threshold, which determines the minimum correlation value that a feature must exhibit with the target variable to be considered relevant. In our case, we chose a threshold of 0.9, indicating that features with a correlation coefficient greater than or equal to 0.9 would be selected. 3. Selecting the relevant features: We iterated through each feature and compared its correlation coefficient with the correlation threshold. Features that exceeded the threshold were retained, while those below the threshold were discarded. This process resulted in a reduced feature set consisting of the most important attributes for botnet attack detection. Figure 5: Correlation Matrix Heatmap before feature selection Figure 6: Correlation matrix heatmap after feature selection By applying Pearson's correlation coefficient and setting a correlation threshold, we identified the 30 most significant features out of the initial 115 features in our dataset. This feature selection process ensures that our models focus on the attributes that have the strongest impact on botnet attack detection, enhancing their performance and interpretability. 4.2 Model Training and Performance Evaluation After the feature selection process, we proceeded with the next steps in our project, which involved standardizing the data using Z-score normalization and shuffling the dataset. This normalization technique ensures that all features have a mean of 0 and a standard deviation of 1, thereby preventing any bias due to differences in the scale of the features. Shuffling the data helps in randomizing the order of the samples, which is important to avoid any bias introduced by the original ordering of the dataset. 4.2.1 Model Accuracy In this section the training and test accuracy of different model used will be discussed. 1. Bagged boost: Bagging (Bootstrap Aggregating) is an ensemble learning technique that combines multiple models trained on different subsets of the training data. Each model is trained independently, and the final prediction is determined by combining the predictions of all models (e.g., majority voting for classification tasks). Fig. 6 Depicts the RUS Boost classifier report Figure 6: Training and test accuracy for the Bagged boost classifier report In this project, Bagging achieved a training accuracy of 100.0% and a test accuracy of 99.96%. 2. RUS boost: It randomly selects a subset of instances from the majority class to balance the class distribution. RUSBoost then applies AdaBoost to the balanced dataset. Fig. 7 Depicts the RUS Boost classifier report: Figure 7: Training and test accuracy for the RUS boost classifier RUSBoost achieved a training accuracy of 100.0% and a test accuracy of 99.93% in this project. 3. Ada boost: AdaBoost assigns higher weights to misclassified instances to improve their classification in subsequent iterations. Fig.6 Depicts the AdaBoost classifier report. Fig. 8 Depicts the Ada Boost classifier report Figure 8: Training and test accuracy for the Ada boost classifier In this training phaset, AdaBoost achieved a training accuracy of 100.0% and a test accuracy of 99.84%. 4.2.2 Model Comparison: In this section the results acquired for different models will be compared and will delve deeper into understanding the work done by the soft voting to create the ensemble model. The table 2 below shows the different accuracy gained by the different models: Algorithm Training Accuracy (%) Test Accuracy (%) AdaBoost 100.0 99.84 RUSBoost 100.0 99.93 Bagging 100.0 99.96 Soft Voting 100.0 99.98 Table 2: Comparison of train and test accuracy In this project, Soft Voting achieved a training accuracy of 100.0% and a test accuracy of 99.98%. These impressive accuracy scores demonstrate the effectiveness of all the algorithms in accurately classifying instances in the dataset. Here are the key observations and analysis: ● High Training Accuracy: All algorithms achieved a perfect training accuracy of 100.0%, indicating their ability to fit the training data accurately. This suggests that the models have successfully learned the patterns and features in the training dataset. ● Excellent Test Accuracy: The test accuracies of all algorithms are very close to the training accuracies, ranging from 99.84% to 99.98%. This indicates that the models generalize well to unseen data and can make accurate predictions on new instances. ● Comparable Performances: The differences in test accuracies among the algorithms are minimal, with the highest accuracy achieved by the Voting algorithm (99.98%). This suggests that all the algorithms are robust and perform exceptionally well in classifying the instances. ● Reliable Ensemble Methods: AdaBoost, RUSBoost, Bagging, and Voting are all ensemble methods that combine multiple models to make predictions. These methods leverage the collective knowledge and diversity of the individual models, leading to improved performance and higher accuracy. Overall, the experimental results validate the effectiveness of the implemented algorithms in accurately classifying instances in the dataset. The high training and test accuracies demonstrate the models' ability to learn from the data, generalize well, and make accurate predictions. The minimal differences in test accuracies among the algorithms indicate that they are all reliable and perform at a high level. These findings are promising for the application of these algorithms in real-world scenarios requiring accurate classification. In future, the model will undergo a comprehensive evaluation to demonstrate its ability to detect seven different types of attacks. This evaluation will focus on assessing the model's performance in accurately identifying and classifying each attack type. Chapter 5 Impacts of the Project In this chapter, we will explore the impact of the project on societal, health, safety, legal, cultural, and environmental aspects. We will discuss how the project's outcomes can contribute to positive changes in these areas. Additionally, we will address the project's implications for sustainability and the environment. Furthermore, we will provide insights into future considerations and potential directions for further development. This chapter aims to provide a comprehensive understanding of the project's broader significance and the potential it holds for addressing key challenges in different domains. 5.1 Impact of this project on societal, health, safety, legal and cultural issues There are several impacts of this project on societal, health, safety, legal and cultural issues. These are ● Societal: The project has a positive social impact by improving the security of IoT networks, which are becoming increasingly prevalent in our daily lives. The project can help prevent the hijacking of IoT devices by botnets, which can cause damage to personal and public property, and also compromise sensitive data. ● Health: The project does not have a direct impact on health. ● Safety: The project can indirectly contribute to improving the safety of IoT devices by preventing botnet attacks that can compromise the functionality and safety of these devices. ● Legal: The project is in compliance with relevant legal requirements, such as data privacy and protection regulations. It can also help organizations comply with legal requirements related to the security of their networks and data. 5.2 Impact of this project on environment and sustainability The impact of the project on various aspects. ● Environment: The project itself does not have a significant impact on the environment as it is a software-based solution. However, the detection and prevention of botnet attacks in IoT networks can indirectly contribute to the reduction of the environmental impact of such attacks. Botnets can be used to launch DDoS attacks, which can cause significant energy consumption due to the high traffic volume. By preventing such attacks, the project can indirectly contribute to reducing energy consumption and carbon emissions. ● Sustainability: First of all, the project is focused on detecting and preventing botnet attacks on IoT devices. As the number of IoT devices continues to grow, the threat of botnet attacks will continue to increase, making this project relevant and necessary in the long term. Secondly, the project is based on machine learning algorithms and anomaly detection techniques. These technologies are constantly evolving and improving, meaning that the project can be updated and enhanced over time to keep up with the latest developments. Thirdly, the project aims to provide a comprehensive solution for network security professionals, IT administrators, and cybersecurity researchers. By offering a platform for collecting and preprocessing network traffic data, extracting relevant features, and training machine learning models, the project can help these professionals to stay ahead of the evolving threat landscape of botnet attacks. Finally, the proposed solution is innovative in that it combines an ensemble learning approach with machine learning and anomaly detection. This approach has not been widely explored in the field of network security and has the potential to improve the accuracy of botnet detection. Overall, the project is sustainable in the sense that it addresses an ongoing and growing problem, is based on evolving technologies, offers a comprehensive solution, and is innovative in its approach. To further ensure the sustainability of the project, the following measures can be taken: ● Regular Maintenance: The project should be regularly maintained to keep up with the latest developments in the field of cybersecurity. This includes updating the software and algorithms to adapt to new threats and improving the accuracy of the system. ● Collaboration with Industry Experts: Collaborating with industry experts and incorporating feedback from users will help in improving the performance of the system and make it more effective in detecting botnet traffic. ● Integration with Existing Infrastructure: The system should be integrated with existing network security infrastructure, such as firewalls and intrusion detection systems, to provide a comprehensive solution for detecting and preventing botnet traffic. ● Continuous Improvement: The system should be continuously improved by incorporating new features and algorithms to make it more effective in detecting and preventing botnet traffic. ● Training and Education: Training and educating network administrators and security personnel on the use of the system will help in improving its adoption and effectiveness. By implementing these measures, the project can be sustained in the long term and continue to provide value to its users in the fight against botnet traffic. Chapter 6 Project Planning and Budget In Chapter 6, we will focus on the project planning and budget considerations. We will provide an overview of the project timeline and outline the tasks and activities to be completed in the upcoming semester. The Ghantt chart will be presented, depicting the project milestones, dependencies, and the estimated duration for each task. Figure 9: A sample Gantt chart regarding the planned project The project has reached a significant stage with the completion of model training. In the upcoming semester, the focus will shift towards model evaluation, final report writing, final presentation, and project demonstration. These activities are crucial for evaluating the performance of the developed models, documenting the project findings, and showcasing the project outcomes. Regarding the budget, it is important to note that no specific budget was required for this project. The project primarily relied on open-source software tools, publicly available datasets, and existing hardware resources. This allowed us to conduct the project without incurring any financial costs. The emphasis was placed on leveraging existing resources and skills to ensure cost-effectiveness while still achieving the project goals. Chapter 7 Complex Engineering Problems and Activities 7.1 Complex Engineering Problems (CEP) Table 3 demonstrates a sample complex engineering problem attribute: Attributes Addressing the complex engineering problems (P) in the project P1 Depth of The project requires in-depth knowledge knowledge of machine learning required algorithms, IoT network (K3-K8) architecture, network security concepts, and programming languages such as Python. P2 Range of The project involves balancing conflicting conflicting requirements, such as requirements accuracy of computational real-time detection, resources, processing, to and ensure effective botnet detection without compromising system performance. P3 Depth of The project requires extensive analysis analysis and evaluation of various required machine learning models, feature selection techniques, and data preprocessing methods to identify the most suitable approach for botnet detection in IoT networks. P4 Familiarity of The project necessitates familiarity issues with IoT protocols, network traffic analysis, and machine learning algorithms specifically tailored for anomaly detection in IoT environments. P5 Extent of The project relies on programming applicable languages such as Python and codes frameworks/libraries scikit-learn and TensorFlow like to implement machine learning models and conduct data analysis for botnet detection. P6 Extent of The project involves stakeholders stakeholder such as network administrators, IoT involvement device manufacturers, and cybersecurity experts who need to collaborate to ensure the successful implementation and deployment of the botnet detection system. P7 Interdependenc The project e interdependent comprises components, including data collection from IoT devices, preprocessing techniques, model training, and performance evaluation, which need to be integrated to create an effective botnet detection solution. Table 3: Table of Complex Engineering Problems (CEP) 7.2 Complex Engineering Activities (CEA): Table 4 demonstrates A Sample Complex Engineering Problem Activities. Attributes Table 3 demonstrates a sample complex engineering problem attribute. A1 Range of resources This project involves the allocation of various resources, including computing resources (CPU, memory), dataset for training and testing, software libraries (Python, scikit-learn), and IoT devices for experimentation. A2 Level interactions of The project requires interactions among including different stakeholders, cybersecurity experts, network administrators, IoT device manufacturers, and end-users to gather requirements, share insights, and validate the system's effectiveness. A3 Innovation The project employs innovative techniques and algorithms to detect and mitigate botnet attacks in IoT networks, contributing to the advancement of network security solutions in the IoT domain. A4 Consequences The project has to consequences for society and the society/environ environment ment security of IoT networks, protecting by positive enhancing the sensitive data, ensuring the integrity of connected devices, and mitigating potential risks associated with botnet attacks. A5 Familiarity Familiarity with architecture, (e.g., MQTT, IoT network CoAP), network protocols machine learning algorithms (e.g., decision trees, ensemble methods), programming languages (Python), and cybersecurity crucial for concepts is successful implementation and evaluation of the botnet detection system. Table 4: Table of Complex Engineering Activities (CEA) REFERENCES 1. S. Karim, et al., "A Comprehensive Survey on IoT Botnet Attacks and Their Detection Approaches," IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 1892-1910, 2020. 2. G. Oikonomou, "IoT Botnets: Recent Advances, Open Challenges, and Countermeasures," IEEE Internet of Things Journal, vol. 7, no. 8, pp. 7244-7262, 2020. 3. Gartner. (2019). Gartner Says Worldwide IoT Installed Base to Grow to 25.1 Billion Units by 2021. [Online]. Available: https://www.gartner.com/en/newsroom/press-releases/2017-02-07-gartner-sa ys-worldwide-iot-security-spending-will-reach-348-million-in-2016 4. McKinsey Global Institute. (2019). The Internet of Things: Mapping the Value Beyond the Hype. [Online]. Available: https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights /the-internet-of-things-mapping-the-value-beyond-the-hype 5. N. Moustafa, et al., "A Deep Learning Approach for IoT Botnet Detection Using NetFlow-Like Traffic Features," Future Generation Computer Systems, vol. 99, pp. 430-443, 2019. 6. Valverde, R.; et al. (Year). Title of the Article. Journal Name, Volume(Issue), Page Numbers. [CrossRef] 7. Abu Al-Haija, Q.; Al-Badawi, A. (Year). Attack-Aware IoT Network Traffic Routing Leveraging Ensemble Learning. Sensors, 22(1), 241. [CrossRef] [PubMed] 8. Abu Al-Haija, Q. (Year). Top-Down Machine Learning-Based Architecture for Cyberattacks Identification and Classification in IoT Communication Networks. Front. Big Data, 4, 782902. [CrossRef] 9. Al-Haija, Q.A.; Saleh, E.; Alnabhan, M. (Year). Detecting Port Scan Attacks Using Logistic Regression. In Proceedings of the 2021 4th International Symposium on Advanced Electrical and Communication Technologies (ISAECT), Khobar, Saudi Arabia, 6–8 December 2021; pp. 1–5. [CrossRef] 10.Tsogbaatar, E.; et al. (Year). DeL-IoT: A deep ensemble learning approach to uncover anomalies in IoT. Internet Things, 14. [CrossRef] 11.Rezaei, A. (Year). Using Ensemble Learning Technique for Detecting Botnet on IoT. SN Comput. Sci., 4, 148. [CrossRef] 12.Yang, L.; Shami, A. (Year). A Lightweight Concept Drift Detection and Adaptation Framework for IoT Data Streams. IEEE Internet Things Mag., 4, 96–101. [CrossRef] 13.Qaddoura, R.; et al. (Year). A Multi-Stage Classification Approach for IoT Intrusion Detection Based on Clustering with Oversampling. Appl. Sci., 11, 3022. [CrossRef] 14.Huy-Trung Nguyen et al. (Year). Title of the Article. Journal Name, Volume(Issue), Page Numbers. [CrossRef] 15.Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. Breitenbacher, and Y. Elovici, "N-BaIoT: Network-based Detection of IoT Botnet Attacks Using Deep Autoencoders," IEEE Pervasive Computing, vol. 17, no. 3, pp. 12-22, 2018. 16.Freund, Y., Schapire, R.E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119-139. 17.Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2010). RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 40(1), 185-197. 18.Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123-140. 19.Zou, H., & Hastie, T. (2006). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320. 20. Kumar, A. (2020, September 7). Hard vs Soft Voting Classifier Python Example - Data Analytics. Data https://vitalflux.com/hard-vs-soft-voting-classifier-python-example/ 21. Analytics.