A Survey on Use of Data Mining for Detecting Cyber Attacks

advertisement
A Survey on Use of Data Mining for Detecting Cyber
Attacks
Samkeet Shah
Chetashri Bhadane
Department of Computer Engineering
Dwarkadas J. Sanghvi COE
Mumbai, India
Department of Computer Engineering
Assistant Professor
Dwarkadas J. Sanghvi COE
Mumbai, India
samkeet@outlook.in
chetashri@gmail.com
ABSTRACT
The widespread use of internet based services, has led to
extensive transmission of data. Cyberattacks hence pose a
major thread to the network and its infrastructure. Data
Mining is the process of using the available data, analyze it
and deriving patterns based upon the analysis. The available
data is classified on basis of similarities in their common
attribute values. The resulting pattern obtained from the
classification is useful information that can be used to
construct an analytical or prediction model. Data mining
summarizes all the data based upon an identified pattern in the
data set. In information security, the data mining algorithms
can be applied for intrusion detection and detecting denial of
service attacks. This review paper is to highlight and compare
data mining techniques, explain different types of
cyberattacks and how data mining techniques in detection of
attacks over a network.
2. CYBERATTACKS
TYPES
Cyberattacks can lead to either loss/theft of data which may
contain sensitive user information or malwares can even result
in a complete lockout of a system, forcing the user to either
hard reset the system, or give into the demands of the attacker.
Hence it becomes necessary to know the types of attacks
possible and how data mining can be used for an early
detection of an attack. Types of cyberattacks are as follows:
2.1 Software based attacks
1.
Data Mining, Security, Denial of Service, Intrusion
detection, Malware, Probing.
Data Mining is used to analyze huge data sets, by detecting a
pattern in the relevant data so as to provide useful information
that can be applied in various fields. The size of the data set
is variable, it can range from a few hundred thousand values
to millions and billions of datasets. However, a larger data set
improves the accuracy of the data mining algorithm. The
selection of the correct data mining algorithm determines the
correctness and accuracy of the end result. Information
security is one of the major concerns today. Smart devices
now collect and synchronize data related to major aspects of
human life. Hence it becomes necessary to identify and detect
possible threats to the security and integrity of the internet and
the services provided to the users. The information
transmitted in packets across the transmission lines can be
intercepted and modified by attackers. It becomes necessary
to take appropriate measures to prevent these attacks. Thus we
can make use of Data mining techniques or tools to take
appropriate steps. Data mining becomes an integral part in
security.
THEIR
Cyberattacks are in any types of attacks that may
threaten the integrity of the systems or the security of
the general or compromise the security of the same.
Keywords
1. INTRODUCTION
AND
2.
3.
4.
Malware: Malware can be defined as a wide range of
intrusion software that includes Trojans, worms, viruses
which are intended to perform malicious activities on the
host system. It is a piece of malicious code that when that
are intended to exploit the vulnerabilities in the host
system.
Adware: Adware are programs that are activated when
you visit a particular webpage or use an application,
which are intended to redirect you to a malicious website
or to collect marketing data of the user, to provide
personalized advertisements.
Spyware: Spyware is a malicious code in a software or a
software itself that is intended to collect sensitive user
information that may include personal details, bank and
credit card details too. This information is sent
anonymously to the attacker without any knowledge of
the user.
Ransomware: Ransomware is a type of malware that
locks a user out of his computer and sometimes encrypts
all of the user files. The user is then required to a certain
amount of ransom to regain the control of his system
from the attacker.
2.2 Remote Attacks
1.
Denial of Service: A Denial of Service is a server side
attack where due to excessive load or page requests sent,
2.
3.
4.
the server crashes and the end user is unable to access the
service. DoS attacks can be executed in multiple ways,
one is a vulnerability attack, where an attacker aims to
confuse the user by exploiting the vulnerability in the
application with a malicious packet. The second type of
DoS attack possible is either by flooding the server with
excessive request so as to exhaust its resources or by
causing a congestion in the network bandwidth. [12]
Botnet: Botnet as a series of computer that have been set
inadvertently configured by the attacker to forward spam
or unwanted data packets as directed by the attacker. The
infected computers are ‘bots’ or ‘robots’ over the ‘net’
or ‘network’. Hence botnets are one of the major factors
in the widespread attacks over the world. The user of the
infected computer is unaware of it. A series of botnets
over a large area can be used for a Distributed DoS
attack, resulting in spam and page requests from
different locations over the world.
User to Remote (U2R): User to remote attacks involves
gaining root access or administrative privileges on a
remote host computer without the user’s knowledge,
resulting in demanding of ransom of benefits to unlock t
the system or return the control back to the user.
Remote to Local (R2L): An attack that allows the
attacker to gain remote access to the local client machine
without having an account by taking advantage of the
bugs on the client machine is called as remote to local
attack.
3. CLASSIFICATION ALGORITHMS
Classification of data is the process grouping the data in predefined classes or groups. The most common and obvious
parameters are used to classify the data in the data sets. Each
class symbolizes a unique value of the attribute used to
classify the data. The following are some of the data mining
algorithms; Bayesian Classification or Bayesian Network,
Support Vector Machine, Artificial Neural Networks and
Decision Trees.
1. Decision Tree (DT)
Decision Tree algorithm makes use of a graph like structure
has a one root node and multiple leaf nodes. A root node is a
node that has no incoming edges. It is the starting point of the
tree structure. An internal node that has incoming as well as
outgoing edges, it denotes the test condition that predicts the
resultant outcome. The leaf nodes denote the final outcome of
the tree. The learning and classification steps of a decision
tree are simple and fast.
2. Bayesian Networks (BN)
Bayes Network is based upon the Bayes theorem of statistical
classification. Bayesian Network uses probabilities to check
the likelihood of the membership of a particular data item to
a pre-defined class. A Bayesian Network allows class
conditional dependencies to be defined between subsets and
provides a graphical model of causal relationship on which
learning model can be designed. [1]
3. Support Vector Machine (SVM)
Support Vector Machine uses decision plane concept to define
boundaries that can classify data. For example, as shown in
the diagram below the line is used to actually separate and
classify the data set into the red part and the green part. Below
is a graphical representation of how the data is classified
based upon attributes of the data and how a partition is
constructed to separate them SVM performs classification
primarily
by
constructing
hyperplanes
in
the
multidimensional space and thus it separates and classifies
data. [4]
4.
Artificial Neural Networks (ANN)
Artificial Neural Networks simulate the learning process by
extracting patterns or detecting trends from complicated or
imprecise data by use of neurons. They can be used to
represent complex relationships and find patterns in the
generated relationships. They are classified as feed-forward,
feedback networks, and self- organizing networks. Each edge
between two neurons has a variable quantity called weight.
The change of weight due to previous learning of the network
is studied as a part of artificial learning. Through repeated
iterations, the network calculates and improves over the
previously calculated values of the weight. The
interconnections among the neurons are described by weights
in form of a weight matrix. The artificial neural network
compares the obtained output of one layer with the expected
output. The error hence generated is propagated to the next
layer and the weights of the interconnections are modified to
obtain new weights. This process of propagation of error to
the next layer continues until the difference between the
expected and obtained output is the least. [3][9]
Comparison of Classification Algorithms
Each algorithm discussed above its advantages and
disadvantages. Some of the factors that are taken into
consideration while comparing them are adaptiveness,
speed, ability to handle large complex data. Given
below Table 1 is the comparison of the above data
mining classification algorithms. [12-14]
Table 1. Comparison of Classification algorithms
Classification
Algorithm
Decision Trees
Bayesian
Network
SVM
ANN
KNN
5.
Advantages
Disadvantages
Easy to
implement, can
handle high
categorical, text
and numeric
data
Difficult to create
trees for complex
problems, not
adaptive-change in
input does not
easily reflect on
the output,
Sensitive to
irrelevant
attributes or
features
Inability to handle
missing data links
that may lead to
bad output
High accuracy
and speed in
processing of
large data sets,
simple to
implement, not
affected by
irrelevant
attributes
Ability to use
dynamically,
does not
exaggerate
minor
fluctuations
Ability to handle
complex,
continuous
relations,
adaptivechange in input
is reflected on
the output
Easy to
implement and
adaptive
Euclidean distance, cosine similarity and correlation are some
of the methods used to map data items from the data sets into
different categories based upon the centroids. As new data
items are added to the data sets, the new centroid is
recalculated based upon the value of the new data item. [5]
3. Genetic Clustering Algorithm (GCA)
Genetic clustering algorithm, based upon the procedures of
natural genetics and evolution comprise of a search space
where each parameter is encoded in form of strings. These
strings are collectively called as a population. The process
begins with the initialization of the population, a randomly
selected group of strings is chosen as the population.
Following this the fitness associated with each strings is
computed. Associated with each string is a goodness factor
that comprises of objective and factor. The clustering metric
used is usually the Euclidean distance of each point in the
cluster from the centers of respective clusters. If the
termination condition is achieved then output is obtained else
the process continues till a terminating condition is reached.
[6]
4. Swarm Intelligence
Long training and
testing time along
with excessive
memory usage
Complex and
training time is
lengthy
Sensitive to noise
CLUSTERING ALGORITHMS
The data classified into different categories needs to be
clustered based upon the similarities in their attribute values
of each data item. Clustering is finding groups of objects such
that each object in a group are similar and are unrelated to
objects from other groups.
1. Fuzzy Clustering
Clustering of data items in data sets, in this algorithms is
based on the probabilities with which the attribute values
belong to a particular cluster. Data elements can belong to
more than one cluster, and associated with each element is a
set of membership levels. Thus fuzzy clustering associates
with each member a membership function. Some points that
lie on the edge of the cluster, may belong less to that cluster
and more towards another cluster. [5]
The interactive behavior among a colony or swarm of data
elements forms the basis of Swarm Intelligence clustering
technique. Ant Colony optimization algorithm involves the
dumping of initial data randomly, based upon the existence of
the initial data, the new incoming packets are classified by
comparing the incoming new data and the preexisting data
classes. This algorithm is based upon a similar behavior in
ants. Ants dump the corpses of other ants randomly initially.
As the number of corpses continue to pile up they are placed
in the existing groups based upon the similarities of the
preexisting corpses. A similar principle can be applied to
clustering of data packets over a network. The packets can be
clustered based upon their daily normal behavior. Any
unusual packet that arrives can be dumped or flagged into a
separate cluster. The overall accuracy of this algorithm
depends on the number packets monitored, greater the number
of deviations in packet types, more the number of clusters,
hence better clustering and detection of malicious data
packets.
Table 2. Comparison of clustering algorithms
Algorithm
Advantages
Disadvantages
Fuzzy clustering Better results in Not suitable for
fewer iterations, very large datasets
Genetic
Easy
to Takes longer time
algorithm
implement,
for rule execution
adaptable
K-means
Simple
to Affected by noise
implement and
adaptable
Swarm
Resistant
to Difficult
to
intelligence
noise
and understand
outliers,
adaptable
5. RELEVANT WORK AND RESULTS
2. K-means
K-means clustering algorithm is an unsupervised learning
algorithm that begins with selection of random k centroids.
1
Mokhrane and Rachid [7] have used a variation of Ant
Miner algorithm, by combining inputs from CAC
algorithm and association rules. The CAC algorithm has
a fewer parameters and stands for Communicating Ant
for Clustering, which functions similar to Swarm
Intelligence clustering technique. They have applied the
CAC algorithm on the KDD dataset. The features of the
KDD dataset were initially selected to differentiate
normal behavior from the attacks. Apriori algorithm was
used later to generate association rules by initially
searching for sets that show greater reliability than the
threshold set. This is followed by extraction of rules
whose precision is appropriate as per the user
requirements. Their results are as shown in table 3 and
table 4
Table 3. Attack detection success rate using CAC [7]
Attack
Rate
False Negative
UDPstorm
50%
50%
Smurf
78%
22%
DoS Detection
96.21%
3.79%
Rate
DoS Detection
99.33%
0.67%
rate of new attack
R2L Detection
66.14%
33.86%
Rate
Rate Detection
29.55%
70.45%
rate of new attack
classify data by constructing a hyperplane. CSOACN
can use the different network connecting records as its
objects which can belong to either normal class or
abnormal class for normal or abnormal intrusions. The
CSOACN algorithm is used to generate models from the
SVM training set for normal data as well as for each class
of abnormal data. Results of this technique are shown in
table 6.
Table 6. IDS Detection Accuracy [8]
Technique
SVM
CSOACN
4
2
Method
CAC
Known Attacks
Unknown attacks
overall
U2R (unknown attack)
Probing (unknown
attack)
DoS (unknown attack)
R2L (unknown attack)
94.05%
89.39%
99.6%
64.3%
Table 5. DoS Detection accuracy [10]
Decision tree
Fuzzy GA
3
Attacks
HTTP
flood
85.5
100
Attack
5
Jongsuebsuk et al. [10] have used fuzzy genetic
algorithm to detect unknown attacks. Based upon their
experimental results, they indicate that fuzzy genetic
algorithm has the highest detection rate. Genetic
algorithms like the genes in the biological beings are
constantly evolving. They have a strong ability to learn
and thus are able to detect unknown attacks. The genetic
algorithms were used to make fuzzy algorithms learn
about the new attacks by themselves. The results have
been summarized in table 5
Techniques
UDP
flood
26.7
100
Agravat et al., [11] have used a modified ant miner
algorithm (MACO), which is proven to have a higher
accuracy for detecting unknown attacks that included
DOS, PROBE, R2L, U2R attacks. The algorithm starts
with rule construction, where each rule is of the form if
<condition1 and condition2 … then class>. Results
are as shown in table 7
DOS
Probe
R2L
U2R
96.6%
99.58%
Smurf
Average
100
98.5
70.73
99.5
Wenying et al. [8] have used CSVACN (Combining
support vector with Ant Colony networks) a hybrid of
active learning support vector machine algorithm along
with Clustering for Self- Organized Ant Colony Network
algorithms to support their findings. SVM is used to
False Negative
21.900
0.360
Table 7. Attack Detection Accuracy [11]
Table 4. Attack detection success rate using CAC [7]
Attacks
Detection rate
66.702
80.100
6
Detection Algorithms
MACO
SVM
98.83
97.20
97.84
98.50
93.53
66.01
99.04
98.99
Reda M. Elbasiony et al. [15] have used a hybrid
framework that uses random forest and K-means
clustering algorithm for network intrusion detection. The
hybrid approach is divided into two phases, the online
phase and the offline phase. During the online phase i.e.
misuse detection, if an intrusion is detected, the attack
features will be sent to the random attack selector
component of anomaly detection. [10] The offline phase
on the other hand uses the obtained the feature pattern
and compares it with the values from the training dataset.
The proposed hybrid framework gives a 98.3% detection
rate on 1% KDD dataset as compared to individual
values of 92.73% and 95% in anomaly detection rate.
Lin et al. [18] have used anomaly detection and
classification algorithms Naïve Bayes, decision tree
(J48) and Bayes Network and compared the results of in
their study of detection of p2p botnets. Trojan Peacomm
viruses was used as the study object. Using internet flow
traffic in a controlled environment, they have generated
p2p botnet. Considering the packet flow characteristics
they have defined important attributes for data analysis.
The result by the J48 algorithm, show the highest
accuracy in detection of botnets with 98% accuracy
compared to 89% by Naïve Bayes and 87% with Bayes
Network.
6. CONCLUSION
Overall, of all the data mining algorithms compared, the
classification using ant colonization algorithm is most
efficient and accurate algorithm for detection of cyberattacks
and detecting intrusions. Based upon the above researches, the
overall accuracy of the combination ant miner algorithm and
association rule generation is higher than other algorithms as
well as other variations of ant miner algorithm. Ant miner
algorithm guarantees convergence to solution as it is adaptive,
hence it automatically adjusts the results with changing input
and it allows parallelism thereby speeding up the process and
providing faster feedback to train the algorithm, but suffers
from the problem of uncertainty in time converge to solution.
Thus, in this paper we reviewed the main data mining
techniques and some of their popular algorithms. We further
compared the advantages and disadvantages of each presented
algorithm. Through a comparison based on key criterions we
showed that learning algorithms are adaptable and good for
unknown pattern recognition but with long process time while
linear algorithms are faster but not proper for unknown
attacks. Further we discussed different types of cyberattacks
and applications of data mining techniques on cyber-attack
detection.
7. REFERENCES
[11] D. Agravat, U. Vaishnav and P. Swadas, "Modified ant
miner for intrusion detection," in Machine Learning and
Computing (ICMLC), 2010 Second International Conference
On, 2010
[1] N. Fenton and M. Neil, Risk Assessment and Decision
Analysis with Bayesian Networks. CRC Press, 2012.
[2] L. Rokach and O. Maimon, "Decision trees," in, O.
Maimon and L. Rokach, Eds. Springer US, 2005, pp. 165-192.
[3] Priyanka Gaur “Neural Networks in Data Mining“
International Journal of Electronics and Computer Science
Engineering 1449 ISSN- 2277-1956
[4] J. Dukart, "Support Vector Machine Classification Basic
Principles and Application," pp. 19-19, 2012.
[5]T. Velmurugan “Performance based analysis between kMeans and Fuzzy C-Means clustering algorithms for
connection oriented telecommunication data”, 2014
[6] Pattern Recognition 33 (2000) 1455}1465 Genetic
algorithm-based clustering technique Ujjwal Maulik,
Sanghamitra Bandyopadhyay"
[7] CAC-UA: A Communicating Ant for Clustering to detect
Unknown Attacks [Science and Information Conference 2014
August 27-29, 2014 | London, UK]
[8] Mining network data for intrusion detection through
combining SVMs with ant colony networks [Future
Generation Computer Systems 37 (2014) 127–140]
[9] A Review of Cyber Attack Classification Technique
Based on Data Mining and Neural Network Approach
[International Journal of Computer Trends and Technology
(IJCTT) – volume 7 number 2– Jan 2014 ISSN: 2231-2803]
[10] P. Jongsuebsuk, N. Wattanapongsakorn and C.
Charnsripinyo, "Network intrusion detection with fuzzy
genetic algorithm for unknown attacks," in Information
Networking (ICOIN), 2013 International Conference On,
2013
[12] A Survey of Defense Mechanisms Against Distributed
Denial of Service (DDoS) Flooding Attacks [IEEE
COMMUNICATIONS SURVEYS & TUTORIALS, VOL.
15, NO. 4, FOURTH QUARTER 2013]
[13] S. Garg and A. K. Sharma, "Comparative Analysis of
Data Mining Techniques on Educational Dataset."
International Journal of Computer Applications, vol. 74,
2013.
[14] A Survey and Comparative Analysis of Data Mining
Techniques for Network Intrusion Detection Systems
[International Journal of Soft Computing and Engineering
(IJSCE) ISSN: 2231-2307, Volume-2, Issue-1, March 2012]
[15] Reda M. Elbasiony *, Elsayed A. Sallam 1, Tarek E.
Eltobely 2, Mahmoud M. Fahmy 3. Tanta University, Faculty
of Engineering, Tanta, Gharbia, Egypt “A hybrid network
intrusion detection framework based on random forests and
weighted k-means” [Ain Shams University Ain Shams
Engineering Journal
[16] A Survey of Fuzzy Clustering M.-S. YANG Department
of Mathematics Chung Yuan Christian University Chungli,
Taiwan 32023, R.O.C.
[17] Hoa Dinh Nguyen , Qi Cheng “An Efficient Feature
Selection Method For Distributed Cyber Attack
Detection and Classification” IEEE 2013. pp 1-6.
[18] S. Lin, P. Chen and C. Chang, "A novel method of
mining network flow to detect P2P botnets," Peer-to-Peer
Networking and Applications, pp. 1-10, 2012.
Download