A Survey on Use of Data Mining for Detecting Cyber Attacks Samkeet Shah Chetashri Bhadane Department of Computer Engineering Dwarkadas J. Sanghvi COE Mumbai, India Department of Computer Engineering Assistant Professor Dwarkadas J. Sanghvi COE Mumbai, India samkeet@outlook.in chetashri@gmail.com ABSTRACT The widespread use of internet based services, has led to extensive transmission of data. Cyberattacks hence pose a major thread to the network and its infrastructure. Data Mining is the process of using the available data, analyze it and deriving patterns based upon the analysis. The available data is classified on basis of similarities in their common attribute values. The resulting pattern obtained from the classification is useful information that can be used to construct an analytical or prediction model. Data mining summarizes all the data based upon an identified pattern in the data set. In information security, the data mining algorithms can be applied for intrusion detection and detecting denial of service attacks. This review paper is to highlight and compare data mining techniques, explain different types of cyberattacks and how data mining techniques in detection of attacks over a network. 2. CYBERATTACKS TYPES Cyberattacks can lead to either loss/theft of data which may contain sensitive user information or malwares can even result in a complete lockout of a system, forcing the user to either hard reset the system, or give into the demands of the attacker. Hence it becomes necessary to know the types of attacks possible and how data mining can be used for an early detection of an attack. Types of cyberattacks are as follows: 2.1 Software based attacks 1. Data Mining, Security, Denial of Service, Intrusion detection, Malware, Probing. Data Mining is used to analyze huge data sets, by detecting a pattern in the relevant data so as to provide useful information that can be applied in various fields. The size of the data set is variable, it can range from a few hundred thousand values to millions and billions of datasets. However, a larger data set improves the accuracy of the data mining algorithm. The selection of the correct data mining algorithm determines the correctness and accuracy of the end result. Information security is one of the major concerns today. Smart devices now collect and synchronize data related to major aspects of human life. Hence it becomes necessary to identify and detect possible threats to the security and integrity of the internet and the services provided to the users. The information transmitted in packets across the transmission lines can be intercepted and modified by attackers. It becomes necessary to take appropriate measures to prevent these attacks. Thus we can make use of Data mining techniques or tools to take appropriate steps. Data mining becomes an integral part in security. THEIR Cyberattacks are in any types of attacks that may threaten the integrity of the systems or the security of the general or compromise the security of the same. Keywords 1. INTRODUCTION AND 2. 3. 4. Malware: Malware can be defined as a wide range of intrusion software that includes Trojans, worms, viruses which are intended to perform malicious activities on the host system. It is a piece of malicious code that when that are intended to exploit the vulnerabilities in the host system. Adware: Adware are programs that are activated when you visit a particular webpage or use an application, which are intended to redirect you to a malicious website or to collect marketing data of the user, to provide personalized advertisements. Spyware: Spyware is a malicious code in a software or a software itself that is intended to collect sensitive user information that may include personal details, bank and credit card details too. This information is sent anonymously to the attacker without any knowledge of the user. Ransomware: Ransomware is a type of malware that locks a user out of his computer and sometimes encrypts all of the user files. The user is then required to a certain amount of ransom to regain the control of his system from the attacker. 2.2 Remote Attacks 1. Denial of Service: A Denial of Service is a server side attack where due to excessive load or page requests sent, 2. 3. 4. the server crashes and the end user is unable to access the service. DoS attacks can be executed in multiple ways, one is a vulnerability attack, where an attacker aims to confuse the user by exploiting the vulnerability in the application with a malicious packet. The second type of DoS attack possible is either by flooding the server with excessive request so as to exhaust its resources or by causing a congestion in the network bandwidth. [12] Botnet: Botnet as a series of computer that have been set inadvertently configured by the attacker to forward spam or unwanted data packets as directed by the attacker. The infected computers are ‘bots’ or ‘robots’ over the ‘net’ or ‘network’. Hence botnets are one of the major factors in the widespread attacks over the world. The user of the infected computer is unaware of it. A series of botnets over a large area can be used for a Distributed DoS attack, resulting in spam and page requests from different locations over the world. User to Remote (U2R): User to remote attacks involves gaining root access or administrative privileges on a remote host computer without the user’s knowledge, resulting in demanding of ransom of benefits to unlock t the system or return the control back to the user. Remote to Local (R2L): An attack that allows the attacker to gain remote access to the local client machine without having an account by taking advantage of the bugs on the client machine is called as remote to local attack. 3. CLASSIFICATION ALGORITHMS Classification of data is the process grouping the data in predefined classes or groups. The most common and obvious parameters are used to classify the data in the data sets. Each class symbolizes a unique value of the attribute used to classify the data. The following are some of the data mining algorithms; Bayesian Classification or Bayesian Network, Support Vector Machine, Artificial Neural Networks and Decision Trees. 1. Decision Tree (DT) Decision Tree algorithm makes use of a graph like structure has a one root node and multiple leaf nodes. A root node is a node that has no incoming edges. It is the starting point of the tree structure. An internal node that has incoming as well as outgoing edges, it denotes the test condition that predicts the resultant outcome. The leaf nodes denote the final outcome of the tree. The learning and classification steps of a decision tree are simple and fast. 2. Bayesian Networks (BN) Bayes Network is based upon the Bayes theorem of statistical classification. Bayesian Network uses probabilities to check the likelihood of the membership of a particular data item to a pre-defined class. A Bayesian Network allows class conditional dependencies to be defined between subsets and provides a graphical model of causal relationship on which learning model can be designed. [1] 3. Support Vector Machine (SVM) Support Vector Machine uses decision plane concept to define boundaries that can classify data. For example, as shown in the diagram below the line is used to actually separate and classify the data set into the red part and the green part. Below is a graphical representation of how the data is classified based upon attributes of the data and how a partition is constructed to separate them SVM performs classification primarily by constructing hyperplanes in the multidimensional space and thus it separates and classifies data. [4] 4. Artificial Neural Networks (ANN) Artificial Neural Networks simulate the learning process by extracting patterns or detecting trends from complicated or imprecise data by use of neurons. They can be used to represent complex relationships and find patterns in the generated relationships. They are classified as feed-forward, feedback networks, and self- organizing networks. Each edge between two neurons has a variable quantity called weight. The change of weight due to previous learning of the network is studied as a part of artificial learning. Through repeated iterations, the network calculates and improves over the previously calculated values of the weight. The interconnections among the neurons are described by weights in form of a weight matrix. The artificial neural network compares the obtained output of one layer with the expected output. The error hence generated is propagated to the next layer and the weights of the interconnections are modified to obtain new weights. This process of propagation of error to the next layer continues until the difference between the expected and obtained output is the least. [3][9] Comparison of Classification Algorithms Each algorithm discussed above its advantages and disadvantages. Some of the factors that are taken into consideration while comparing them are adaptiveness, speed, ability to handle large complex data. Given below Table 1 is the comparison of the above data mining classification algorithms. [12-14] Table 1. Comparison of Classification algorithms Classification Algorithm Decision Trees Bayesian Network SVM ANN KNN 5. Advantages Disadvantages Easy to implement, can handle high categorical, text and numeric data Difficult to create trees for complex problems, not adaptive-change in input does not easily reflect on the output, Sensitive to irrelevant attributes or features Inability to handle missing data links that may lead to bad output High accuracy and speed in processing of large data sets, simple to implement, not affected by irrelevant attributes Ability to use dynamically, does not exaggerate minor fluctuations Ability to handle complex, continuous relations, adaptivechange in input is reflected on the output Easy to implement and adaptive Euclidean distance, cosine similarity and correlation are some of the methods used to map data items from the data sets into different categories based upon the centroids. As new data items are added to the data sets, the new centroid is recalculated based upon the value of the new data item. [5] 3. Genetic Clustering Algorithm (GCA) Genetic clustering algorithm, based upon the procedures of natural genetics and evolution comprise of a search space where each parameter is encoded in form of strings. These strings are collectively called as a population. The process begins with the initialization of the population, a randomly selected group of strings is chosen as the population. Following this the fitness associated with each strings is computed. Associated with each string is a goodness factor that comprises of objective and factor. The clustering metric used is usually the Euclidean distance of each point in the cluster from the centers of respective clusters. If the termination condition is achieved then output is obtained else the process continues till a terminating condition is reached. [6] 4. Swarm Intelligence Long training and testing time along with excessive memory usage Complex and training time is lengthy Sensitive to noise CLUSTERING ALGORITHMS The data classified into different categories needs to be clustered based upon the similarities in their attribute values of each data item. Clustering is finding groups of objects such that each object in a group are similar and are unrelated to objects from other groups. 1. Fuzzy Clustering Clustering of data items in data sets, in this algorithms is based on the probabilities with which the attribute values belong to a particular cluster. Data elements can belong to more than one cluster, and associated with each element is a set of membership levels. Thus fuzzy clustering associates with each member a membership function. Some points that lie on the edge of the cluster, may belong less to that cluster and more towards another cluster. [5] The interactive behavior among a colony or swarm of data elements forms the basis of Swarm Intelligence clustering technique. Ant Colony optimization algorithm involves the dumping of initial data randomly, based upon the existence of the initial data, the new incoming packets are classified by comparing the incoming new data and the preexisting data classes. This algorithm is based upon a similar behavior in ants. Ants dump the corpses of other ants randomly initially. As the number of corpses continue to pile up they are placed in the existing groups based upon the similarities of the preexisting corpses. A similar principle can be applied to clustering of data packets over a network. The packets can be clustered based upon their daily normal behavior. Any unusual packet that arrives can be dumped or flagged into a separate cluster. The overall accuracy of this algorithm depends on the number packets monitored, greater the number of deviations in packet types, more the number of clusters, hence better clustering and detection of malicious data packets. Table 2. Comparison of clustering algorithms Algorithm Advantages Disadvantages Fuzzy clustering Better results in Not suitable for fewer iterations, very large datasets Genetic Easy to Takes longer time algorithm implement, for rule execution adaptable K-means Simple to Affected by noise implement and adaptable Swarm Resistant to Difficult to intelligence noise and understand outliers, adaptable 5. RELEVANT WORK AND RESULTS 2. K-means K-means clustering algorithm is an unsupervised learning algorithm that begins with selection of random k centroids. 1 Mokhrane and Rachid [7] have used a variation of Ant Miner algorithm, by combining inputs from CAC algorithm and association rules. The CAC algorithm has a fewer parameters and stands for Communicating Ant for Clustering, which functions similar to Swarm Intelligence clustering technique. They have applied the CAC algorithm on the KDD dataset. The features of the KDD dataset were initially selected to differentiate normal behavior from the attacks. Apriori algorithm was used later to generate association rules by initially searching for sets that show greater reliability than the threshold set. This is followed by extraction of rules whose precision is appropriate as per the user requirements. Their results are as shown in table 3 and table 4 Table 3. Attack detection success rate using CAC [7] Attack Rate False Negative UDPstorm 50% 50% Smurf 78% 22% DoS Detection 96.21% 3.79% Rate DoS Detection 99.33% 0.67% rate of new attack R2L Detection 66.14% 33.86% Rate Rate Detection 29.55% 70.45% rate of new attack classify data by constructing a hyperplane. CSOACN can use the different network connecting records as its objects which can belong to either normal class or abnormal class for normal or abnormal intrusions. The CSOACN algorithm is used to generate models from the SVM training set for normal data as well as for each class of abnormal data. Results of this technique are shown in table 6. Table 6. IDS Detection Accuracy [8] Technique SVM CSOACN 4 2 Method CAC Known Attacks Unknown attacks overall U2R (unknown attack) Probing (unknown attack) DoS (unknown attack) R2L (unknown attack) 94.05% 89.39% 99.6% 64.3% Table 5. DoS Detection accuracy [10] Decision tree Fuzzy GA 3 Attacks HTTP flood 85.5 100 Attack 5 Jongsuebsuk et al. [10] have used fuzzy genetic algorithm to detect unknown attacks. Based upon their experimental results, they indicate that fuzzy genetic algorithm has the highest detection rate. Genetic algorithms like the genes in the biological beings are constantly evolving. They have a strong ability to learn and thus are able to detect unknown attacks. The genetic algorithms were used to make fuzzy algorithms learn about the new attacks by themselves. The results have been summarized in table 5 Techniques UDP flood 26.7 100 Agravat et al., [11] have used a modified ant miner algorithm (MACO), which is proven to have a higher accuracy for detecting unknown attacks that included DOS, PROBE, R2L, U2R attacks. The algorithm starts with rule construction, where each rule is of the form if <condition1 and condition2 … then class>. Results are as shown in table 7 DOS Probe R2L U2R 96.6% 99.58% Smurf Average 100 98.5 70.73 99.5 Wenying et al. [8] have used CSVACN (Combining support vector with Ant Colony networks) a hybrid of active learning support vector machine algorithm along with Clustering for Self- Organized Ant Colony Network algorithms to support their findings. SVM is used to False Negative 21.900 0.360 Table 7. Attack Detection Accuracy [11] Table 4. Attack detection success rate using CAC [7] Attacks Detection rate 66.702 80.100 6 Detection Algorithms MACO SVM 98.83 97.20 97.84 98.50 93.53 66.01 99.04 98.99 Reda M. Elbasiony et al. [15] have used a hybrid framework that uses random forest and K-means clustering algorithm for network intrusion detection. The hybrid approach is divided into two phases, the online phase and the offline phase. During the online phase i.e. misuse detection, if an intrusion is detected, the attack features will be sent to the random attack selector component of anomaly detection. [10] The offline phase on the other hand uses the obtained the feature pattern and compares it with the values from the training dataset. The proposed hybrid framework gives a 98.3% detection rate on 1% KDD dataset as compared to individual values of 92.73% and 95% in anomaly detection rate. Lin et al. [18] have used anomaly detection and classification algorithms Naïve Bayes, decision tree (J48) and Bayes Network and compared the results of in their study of detection of p2p botnets. Trojan Peacomm viruses was used as the study object. Using internet flow traffic in a controlled environment, they have generated p2p botnet. Considering the packet flow characteristics they have defined important attributes for data analysis. The result by the J48 algorithm, show the highest accuracy in detection of botnets with 98% accuracy compared to 89% by Naïve Bayes and 87% with Bayes Network. 6. CONCLUSION Overall, of all the data mining algorithms compared, the classification using ant colonization algorithm is most efficient and accurate algorithm for detection of cyberattacks and detecting intrusions. Based upon the above researches, the overall accuracy of the combination ant miner algorithm and association rule generation is higher than other algorithms as well as other variations of ant miner algorithm. Ant miner algorithm guarantees convergence to solution as it is adaptive, hence it automatically adjusts the results with changing input and it allows parallelism thereby speeding up the process and providing faster feedback to train the algorithm, but suffers from the problem of uncertainty in time converge to solution. Thus, in this paper we reviewed the main data mining techniques and some of their popular algorithms. We further compared the advantages and disadvantages of each presented algorithm. Through a comparison based on key criterions we showed that learning algorithms are adaptable and good for unknown pattern recognition but with long process time while linear algorithms are faster but not proper for unknown attacks. Further we discussed different types of cyberattacks and applications of data mining techniques on cyber-attack detection. 7. REFERENCES [11] D. Agravat, U. Vaishnav and P. Swadas, "Modified ant miner for intrusion detection," in Machine Learning and Computing (ICMLC), 2010 Second International Conference On, 2010 [1] N. Fenton and M. Neil, Risk Assessment and Decision Analysis with Bayesian Networks. CRC Press, 2012. [2] L. Rokach and O. Maimon, "Decision trees," in, O. Maimon and L. Rokach, Eds. Springer US, 2005, pp. 165-192. [3] Priyanka Gaur “Neural Networks in Data Mining“ International Journal of Electronics and Computer Science Engineering 1449 ISSN- 2277-1956 [4] J. Dukart, "Support Vector Machine Classification Basic Principles and Application," pp. 19-19, 2012. [5]T. Velmurugan “Performance based analysis between kMeans and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data”, 2014 [6] Pattern Recognition 33 (2000) 1455}1465 Genetic algorithm-based clustering technique Ujjwal Maulik, Sanghamitra Bandyopadhyay" [7] CAC-UA: A Communicating Ant for Clustering to detect Unknown Attacks [Science and Information Conference 2014 August 27-29, 2014 | London, UK] [8] Mining network data for intrusion detection through combining SVMs with ant colony networks [Future Generation Computer Systems 37 (2014) 127–140] [9] A Review of Cyber Attack Classification Technique Based on Data Mining and Neural Network Approach [International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014 ISSN: 2231-2803] [10] P. Jongsuebsuk, N. Wattanapongsakorn and C. Charnsripinyo, "Network intrusion detection with fuzzy genetic algorithm for unknown attacks," in Information Networking (ICOIN), 2013 International Conference On, 2013 [12] A Survey of Defense Mechanisms Against Distributed Denial of Service (DDoS) Flooding Attacks [IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 15, NO. 4, FOURTH QUARTER 2013] [13] S. Garg and A. K. Sharma, "Comparative Analysis of Data Mining Techniques on Educational Dataset." International Journal of Computer Applications, vol. 74, 2013. [14] A Survey and Comparative Analysis of Data Mining Techniques for Network Intrusion Detection Systems [International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-1, March 2012] [15] Reda M. Elbasiony *, Elsayed A. Sallam 1, Tarek E. Eltobely 2, Mahmoud M. Fahmy 3. Tanta University, Faculty of Engineering, Tanta, Gharbia, Egypt “A hybrid network intrusion detection framework based on random forests and weighted k-means” [Ain Shams University Ain Shams Engineering Journal [16] A Survey of Fuzzy Clustering M.-S. YANG Department of Mathematics Chung Yuan Christian University Chungli, Taiwan 32023, R.O.C. [17] Hoa Dinh Nguyen , Qi Cheng “An Efficient Feature Selection Method For Distributed Cyber Attack Detection and Classification” IEEE 2013. pp 1-6. [18] S. Lin, P. Chen and C. Chang, "A novel method of mining network flow to detect P2P botnets," Peer-to-Peer Networking and Applications, pp. 1-10, 2012.