In this paper we have presented the generalized definition of

advertisement
Survey on different techniques of Anomaly based
Intrusion Detection
Parth Shah
Harsh Shah
Sindhu Nair
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
shahp13594@gmail.com
harshshah30032@gmail.com
sindhu.nair@djsce.ac.in
ABSTRACT:
In different application domain anomaly Detection is one
of the well-studied problem. To identify fraud, customer
behavioural change, and manufacturing flaws are the
widely used examples. Applications or some other
domains of research are specified by this methods. The
methods presented are based on data mining and also
machine learning domains. We have presented the key
components along with the overview of anomaly
detection system. Key components of anomaly detection
system is also provided. Different anomaly detection
system techniques has been discussed.
anomalies in a simple 2-dimensional data set. N1 and N2
are the two normal regions, since most observations lie in
these two regions. Points that are sufficiently far away
from the regions, Eg, points d1 and d2, and points in
region
d3,a
are
the
anomalies.
Keywords:
Data Mining, Anomaly Detection, Anomaly Based
Intrusion Detection, Machine Learning
1. INTRODUCTION:
Enormous amount of data is stored in files, databases,
and other repositories, it is increasingly important and
also helps to develop powerful means for a explanation
or translation of such data. Data Mining is a process
which enables discovery of knowledge in field of
database and hence known as Knowledge Discovery of
Databases(KDD).It can be defined as the nontrivial
extraction of hidden, unknown and potentially useful
information from data in databases [5]. Data mining and
KDD have same meanings. Hence this process includes
Data mining. The major steps included in KDD process
are Data Integration, data cleaning, data selection, data
transformation, data mining, pattern evaluation and
knowledge presentation. Data mining primarily involves
anomaly
detection,
association
rule
learning,
classification, regression, summarization and clustering.
It is concerned with how can we summarize and
generalize the data. For instance, if we wish to determine
what observations don’t belong and which are
interesting.
2. ANOMALIES
Anomalies are the aberrant data that do not comply with
the normal data[1][5]. The figure below illustrates
Fig.1. A simple example of anomalies in a two-dimensional data
set.[1]
Anomaly detection is also known as outlier detection. It
is the detection of items, events or also observations that
do not comply to an expected output[5].The outliers,
novelties, noise, aberrant and exceptions in data are also
considered as example of anomalies. Detection of
aberrant images from surveillance images, identifying
abnormal organic compounds, data cleaning and
identifying flaws in manufactured materials are different
types. The basic steps that remain the same are as
follows:
1) To identify normality by calculating some “signature”
of the data.
2) To measure and calculate the abnormality from the
signature of the observer.
3) To set some threshold which, if exceeded by an
observation’s metric measurement.
3. VARIOUS CHALLENGES IN
ANOMALY DETECTION:
●
To define a region representing normal
behaviour and to declare any observation.
●
●
●
●
●
To define a normal region which encompasses
every possible normal behaviour is very
difficult [1].
In many domains normal behaviour keeps
evolving and a current belie of normal
behaviour might not be sufficiently
representative in the future.
The exact notion of an anomaly is different for
various application domains.[3]
Availability of labelled data for training and
authentication of models used is usually a
major issue.
It is difficult to distinguish and remove the data
that contains noise which tends to be similar to
the actual anomalies.
multiple attribute are known as multivariate The nature
and structure of attributes helps us to determine the
applicability of anomaly detection techniques.
Multivariate data instances consists of attributes that
may be similar or a mixture of dissimilar data types. The
applicability of anomaly detection techniques help to
determine the nature and structure. For example, for
statistical techniques different statistical models have to
be used for continuous and categorical data[3]. Similarly,
for nearest neighbour based techniques, the nature of
attributes would determine the distance measure to be
used[3]. The pair wise distance between data instances
might be provided in the form of a similarity matrix
instead of actual data. Input data can also be categorized
based on the relationship present among data instances.
The existing anomaly detection techniques deal with
record data where data instances can be related to each
other. Sequence data, multidimensional data, and graph
data are some of the examples
4.2 Types Of anomaly
Fig.2. The key components of anomaly detection
technique.[2]
4.DIFFERENT
ASPECTS
ANOMALY DETECTION
OF
The different aspects of anomaly detection are
identified. Many different factors such as the structure,
nature of the input data, the availability as well as
unavailability of labels helps us to determine the
formulation. The above factors play an important role in
determining the aspects of anomaly detection.
4.1. Structure and Nature of the Input Data
The data instances collection also referred as object,
record, point, vector, pattern, event, case, sample,
observation, entity are input instance. Set of attributes
help to describe the data instance. Binary, categorical or
continuous are different types of data inputs. Data
instance that consist of only one attribute which is known
as univariate and the data instance which contains of
Anomalies can be classified into following categories:
1. Point Anomalies: When an individual data instance
can be considered as abnormal with respect to the rest of
data,it is known as point anomaly.[2][5].For example, if
an individual’s spending using credit card is in the range
of $50000 to $100000,a payment of $500000 is by itself
a point anomaly and therefore worth investigation.
2.Contextual Anomalies: Contextual analysis can be
defined as a data instance that is anomalous in a specific
context and concept, [2].Both point and collective
anomalies can be contextual. With the use of credit card
If the usual pattern is $100 per week, a $1200 during
Diwali week is considered normal, as compared to the
same $1200 during a week in May is not normal [5].
Following are the attributes set used for defining each
data instances:

Contextual attribute: The contextual attribute
are used to determine the context or
neighbourhood for that instance[2][5].The
contextual attributes ,for instance, data sets
pertaining to space , the coordinates of a
location are which determines the position of
an instance on the entire sequence[5].

Behavioural attribute : The behavioural
attributes
defines
the
non-contextual
characteristic of data[2][5].The behavioural
attribute help us to find the anomalous
behaviour[5].A data instance might be a
contextual anomaly in a given context, but an
identical data instance could be considered
normal in a different context[5].
3.Collective Anomalies: Collective anomaly is
a collection of related data instances is
anomalous with respect to the entire data set
[2][5].In collective anomaly the individual data
instances may not be anomalies
occurrence together as a collection is.
,their
4.3 Data Labels
The labels associated with a data instance denote help us
to find whether it normal or anomalous [1]. Human
experts carry out the work of labelling. Typically,
obtaining labelled set of aberrant data sets that includes
all possible type of anomalous behaviour is more
difficult than getting labels for normal behaviour.
Moreover, the anomalous behaviour is often dynamic in
nature. Data Labels are classified into three categories
such as:
1.Supervised anomaly detection: The data set are
labelled as "normal" and "abnormal in this typel. It also
involves training of the classifier [1][5]. The
unsupervised anomaly detection issues is that the
anomalous data instances are very few compared to the
normal data instances in the training data. Issues are also
considered important.
2.Semi supervised anomaly detection: Techniques that
operate in a semi supervised mode can be illustrated with
the situation assuming that the training data has labelled
instances only for the normal class. Since the class of
anomaly does not require the labels, they are more
widely used than supervised techniques.
3.Unsupervised anomaly detection: The task is to find
which parts of a collection or document are most
anomalous with respect to the rest of the collection is the
main motto of this type of anomaly detection. For
example, if we had a collection of many news stories
with fictional story inserted, we would wish to identify
the fictional story as anomalous, because its language is
abnormal with respect to the rest of the data in the
collection.
4.4 Anomaly detection output:
It is an important aspect of anomaly detection. Typically,
they are produced by anomaly detection techniques
among the following two types:
1) Scores: In scoring techniques, an anomaly score
2)
is provided to each instance in the test data with
the condition depending on the degree to which
that instance is considered an anomaly[1].The
threshold may be used by the analyst.
Labels: We assign a label normal or anomalous to
each test instance in label [1].Techniques that
provide binary labels to the test instances do not
directly allow analysts to use a specific threshold
to select the most relevant anomalies, and these
can be controlled through parameter choices
within each technique.
5. CLASSIFICATION TECHNIQUES
FOR ANOMALY BASED INTRUSION
DETECTION SYSTEM
Classification is mainly used for anomaly detection
.Classification process is divided into two steps: In first
step, training set made up of data instances and their
associated class labels are used to build a classifier.
Classification techniques which are used to classify the
intrusion detection databases are: Bayesian classification,
Decision tree, k-Nearest neighbour, Support Vector
Machine, Neural Network and Rule Induction Methods.
A. Naïve Bayes
Naïve Bayes is a probabilistic classifier that is mainly
used to predict the likelihood of group members[4]. It
assumes conditional independence of class. The
algorithm first finds out prior probability and then class
conditional probability for the given intrusion data set[3].
Next step is to find the highest class probability after
which the detection rate and false positive rate is
calculated[3]. Figure 3 shows the framework for a naïve
Bayesian model to perform intrusion detection.
Fig 3:The framework of intrusion2w detection model[4]
The probability of one attribute does not affect the
probability of other is the assumption that the algorithm
uses. It makes 2n! . The naïve bayes classifier omits the
probability when calculating the likelihoods of
membership in each class to handle missing attribute
values.
B. Decision Tree:
The main advantage of decision trees is that it can learn a
model based on the training data and can predict the
future data as one of the attack type[3]. Due to this they
can be used as misuse intrusion detection. Two
prerequisites for the analysis are data collection and tool
acquisition and selection[3]. Figure below shows the
process to implement decision tree for intrusion
detection.
Network Traffic Pre-processing Alerts Detector Pattern
Building Data Set. To classify a new instance, start at the
topmost(root) node and follow the branch indicated by
the outcome of each test until a bottommost(leaf) node is
reached. Leaf node label represents the result of
classification. They are useful in real time intrusion
detection because of their high performance. Due to
generalization accuracy of decision tree, they are able to
detect new intrusions.
The 1-Nearest Neighbour (1NN) classifier id based on
the following key points. The example of the 1NN is one
in which the representative points are considered train
samples and then distance between samples and each
point are computed[3]. The class label of the
representative point are assigned to the test samples.
Extension of 1NN is the k-NN in which test samples are
determined by finding the k nearest neighbours. The
major usage is that it is used with statistical schemes for
intrusion detection.
C. Support Vector Machine
We have to construct a SVM model for classification in
IDS. The main aim of SVM is to produce a model that
produces the target value of a data instance using various
kernel functions [3][4]. The three major SVM kernel
functions are: Gaussian Kernel, Polynomial Kernel, and
Sigmoid Kernel. In classification phase, SVM training
model is build and to generate classification results SVM
functions are used[3]. Speed and accuracy are the main
factors
for
using
SVM
for
detection
of
intrusion[6].Training and testing are the phases of the
implementation of SVM intrusion detection system [3}.
Whenever a new pattern is detected during classification
it updates the training pattern dynamically. It provides
high accuracy rates.
E. Artificial Neutral Networks
D. K-Nearest Neighbour
6. POTENTIAL ISSUES FOR ANOMALY
BASED INTRUSION DETECTION
SYSTEMS
When reliable parametric of probability density are not
unique or difficult to determine then K-NN classification
is applied[3].In this method objects are classified based
on closest training example in the space[3]. It is also
known as lazy learning technique as functions are
approximated locally and all the computations are
delayed until classification. Figure below shows the
method for deciding the nearest neighbour.
The use of neural networks is in both anomaly intrusion
detection and in misuse intrusion detection[3][6]. In
anomaly IDS, neural networks identify the variations
from user’s established behaviour while in case of
misuse IDS, neural networks have been designed to
receive the data from network and analyse it for any
occurrence of misuse[3]. This method implements neural
network as a standalone misuse detection where data is
received from a network stream and then analysed for
any misuse intrusion that will be helpful. Even if the data
is incomplete or distorted, neural network would be
capable of analysing the data from network,





Feature Extraction.
Classifier construction.
Sequential pattern prediction
Human Intervention
False positive and false negative alarms rate
7. CONCLUSION
Fig. 3 Majority voting scheme [3]
In this paper we have presented the generalized
definition of anomaly detection along with its different
methods, aspects and techniques of anomaly detection.
During this paper we have studied the different types of
intrusion detection systems with the brief introduction of
each category of anomaly detection methods. For the
future work we will suggest to present the investigation
over the same technique and claims its efficiency against
existing methods.
8. REFERENCES
[1] Varun Chandbola, Aridnam Banerjee, and Vipin
Kumar “Anomaly Detection: A Survey” ACM
Computing Surveys, Vol. 41, No. 3, Article 15
, Publication date: July 2009
[2] Vaishali V. Khandagale, Yoginath Kalshetty“Review
and Discussion on different techniques of Anomaly
Detection Based and Recent Work” International
Journal of Engineering Research &
Technology(IJERT) Vol.2 Issue10, October2013
ISSN: 2278-0181
[3]Anju, Pardeep Kumar Mittal, Shalini Aggarwal “A
Review of Various Classification Techniques Based
Data Mining for Intrusion Detection” International
Journal of Advanced Research in Computer Science
and Software Engineering Volume 5, Issue 3, March
2015 ISSN: 2277 128X
[4]Patel Hemant, Bharat Sarkhedi, Hiren Vaghamshi,
“Intrusion Detection in Data Mining With
Classification Algorithm”, International journal of
Advance Research in Electrical, Electronics and
Instrumentation Engineering, ISSN: 2320-3765, Vol.
2, Issue 7, July 2013
[5]Wong,W.Moore,A.,Cooper,G.,andWagner,M.2003.Ba
yesian network anomaly pattern detection for disease
outbreaks.InProceedings of the 20th International
Conferenceon Machine Learning. AAAI Press, Menlo
Park,California, 808-815
[6]P. Amudha, S. Karthik, S. Sivakumari, Overview”,
“Classification Techniques for Intrusion Detection –
International Journal of Computer Applications
(0975-8887), Vol. 76, No. 16, August 2013
[7]S. Neelima, N. Satyanarayana and P. Krishna
Murthy, “Data Mining Techniques in Intrusion
detection”, International Journal of Emerging
Technology and Advanced Engineering, ISSN: 22502459, Vol. 4, Issue 2, Feb 2014, Pp. 631-634.
[8] Shyara Taruna R., Mrs. Saroj Hiranwal, “Enhanced
Naïve Bayes Algorithm for Intrusion Detection in
Data Mining”, International Journal of Computer
Science
Download