Performance Assessment of Robust Ensemble Model for Intrusion

advertisement
Performance Assessment of Robust Ensemble Model for
Intrusion Detection using Decision Tree Techniques
Reshamlal Pradhan
Deepak Kumar Xaxa
M. Tech. scholar (CSE)
Asst Professor
MATS University, Raipur (C.G.)
INDIA
Department of Computer
science
MATS University, Raipur (C.G.)
INDIA
reshamlalpradhan6602@
gmail.com
ABSTRACT
xaxadeepak@gmail.com
general framework is depicted on figure 1.
Intrusion Detection System (IDS) is one of the major research
concerns in network security. It is the process of detecting
different security violations by monitoring and analyzing the
events occurring in a computer system or in a network. IDS
can be developed using various machine learning techniques
like Classification, prediction etc. As a classifier IDS
classifies the data as normal or anomaly. In this paper we
present performance assessment of robust ensemble model [1]
for intrusion detection using decision tree techniques. The
algorithms or decision tree techniques tested are j48, Random
forest, Stacking, Bagging and Boosting on NSL-KDD Dataset
using WEKA tools. WEKA is a open source software which
consists of a collection of machine learning algorithms for
data mining tasks.
General Terms
Algorithm, Classification, Ensemble technique, Intrusion
Detection, Network security.
Keywords
Bagging, Boosting, Confusion metrics, Intrusion detection
system (IDS), J48, Random Forest, Stacking, WEKA.
1. INTRODUCTION
The security of our computer systems and data is at continual
risk. The extensive growth of the Internet and increasing
availability of tools and tricks for intruding and attacking
networks have prompted intrusion detection to become a
critical component of network administration. An intrusion
can be defined as any set of actions that threaten the integrity,
confidentiality, or availability of a network resource (such as
user accounts, file systems, system kernels, and so on).
Intrusion detection systems (IDSs)[1,2,3] are software or
hardware systems that automate the process of monitoring the
events occurring in a computer system or network, analyzing
them for signs of security problems (intrusions). IDS can be
developed using various machine learning techniques like
Classification, prediction etc. Classification is one of the very
common applications of the data mining in which similar type
of samples are grouped together in supervised manner. IDS is
a classifier which classifies the data as normal or attack. A
Fig 1: General Framework of IDS
2. DECISION TREE TECHNIQUE
Decision tree[1,5,6,7] is a Data Mining Technique and is so
popular because the construction of decision tree classifiers
does not require any domain knowledge or parameter setting,
and therefore is appropriate for exploratory knowledge
discovery. Decision trees can handle high dimensional data.
1
2.1 J48
Pseudo code
J48 [2,4,13] is an open source Java implementation of the
C4.5 algorithm in the WEKA data mining tool. C4.5 is a
program that creates a decision tree based on a set of labeled
input data. The decision trees generated by C4.5 can be used
for classification, and for this reason, C4.5 is often referred to
as a statistical classifier.
To generate c classifiers:
The J48 Decision tree classifier follows the following simple
algorithm. In order to classify a novel item it first needs to
create a decision tree based on the attribute values of the
obtainable training data. So, whenever it encounters a set of
items (training set) it finds the attribute that discriminates the
several instances most clearly. This feature that is able to tell
us most nearby the data instances so that classify them the
best is said to have the highest information gain.
Call BuildTree ( Ni )
for i = 1 to c do
Randomly sample the training data D with replacement to
produce Di
Create a root node, Ni containing Di
end for
BuildTree (N):
if N contains instances of only one class then
return
else
Pseudo code
1. Check for base cases
2. For each attribute a
1. Find the normalized information gain from
splitting on a.
2. Let a_best be the attribute with the highest
normalized information gain.
Randomly select x% of the possible splitting features in N
Select the feature F with the highest information gain to split
on
Create f child nodes of N , N1 ,..., Nf , where F has f possible
values ( F1 , … , Ff )
for i = 1 to f do
Set the contents of Ni to Di , where Di is all instances in N
that match Fi
Call BuildTree( Ni )
3. Create a decision node that splits on a_best.
end for
4. Recurse on the sublists obtained by splitting on
a_best, and add those nodes as children of node.
end if
Now, among the possible values of this feature, if there is any
value for which there is no ambiguity that is for which the
data instances falling within its category have the same value
for the target variable then terminate that branch and allocate
to it the target value that have obtained.
3. ENSEMBLE TECHNIQUES
An ensemble model[1] is a combination of two or more
models to avoid the drawbacks of individual models and to
achieve high accuracy. Bagging, Boosting and Stacking are
techniques that use a combination of models.
2.2 Random Forest
Random Forests[7,9,14] is an ensemble classifier. It
constructs a series of classification trees which will be used to
classify a new example. The idea used to create a classifier
model is constructing multiple decision trees, each of which
uses a subset of attributes randomly selected from the whole
original set of attributes.
The Random Forests is an effective prediction tool in data
mining. It employs the Bagging method to produce a
randomly sampled set of training data for each of the trees.
This Random Forests method also semi-randomly selects
splitting features; a random subset of a given size is produced
from the space of possible splitting features. The best splitting
is feature deterministically selected from that subset. A
pseudo code of random forest construction is given below. To
classify a test instance, the Random Forests classifies the
instance by simply combining all results from each of the
trees in the forest. The method used to combine the results can
be as simple as predicting the class obtained from the highest
number of trees.
3.1 Bagging
Given a set, D of d tuples, bagging works as follows. For
iteration i (i = 1, 2, : : : , k), a training set, Di, of d tuples is
sampled with replacement from the original set of tuples, D.
Note that the term bagging[1,2,5,10] stands for bootstrap
aggregation. Each training set is a bootstrap sample. Because
sampling with replacement is used, some of the original tuples
of D may not be included in Di, whereas others may occur
more than once.
A classifier model, Mi, is learned for each training set, Di. To
classify an unknown tuple, X, each classifier, Mi, returns its
class prediction, which counts as one vote. The bagged
classifier, M_, counts the votes and assigns the class with the
most votes to X. Bagging can be applied to the prediction of
continuous values by taking the average value of each
prediction for a given test tuple. The algorithm is-
2
Algorithm: Adaboost. A boosting algorithm—creates an
ensemble of classifiers. Each one gives a weighted vote.
Algorithm: Bagging. The bagging algorithm—create an
ensemble of models (classifiers or predictors) for a learning
scheme where each model gives an equally-weighted
prediction.
Input:
Input:
D, a set of d class-labeled training tuples;
k, the number of rounds (one classifier is generated per
round);
D, a set of d training tuples;
a classification learning scheme.
k, the number of models in the ensemble;
a learning scheme (e.g., decision tree algorithm,
back propagation, etc.)
Output: A composite model.
Output: A composite model, M_.
Method:
(1) Initialize the weight of each tuple in D to 1=d;
Method:
(2) For i = 1 to k do // for each round:
(1) for i = 1 to k do // create k models:
(3) Sample D with replacement according to the tuple weights
to obtain Di;
(2) create bootstrap sample, Di, by sampling D with
replacement;
(4) Use training set Di to derive a model, Mi;
(3) use Di to derive a model, Mi;
(5) Compute error (Mi), the error rate of Mi
(4) end for
(6) If error (Mi) > 0:5 then
(7) Re initialize the weights to 1=d
To use the composite model on a tuple, X:
(8) Go back to step 3 and try again;
(1) if classification then
(9) End if
(2) let each of the k models classify X and return the
majority vote;
(10) For each tuple in Di that was correctly classified do
(3) if prediction then
(4) let each of the k models predict a value for X
and return the average predicted value;
(11) Multiply the weight of the
(Mi)=(1�error(Mi)); // update weights
tuple
by
error
(12) normalize the weight of each tuple;
(13) End for
3.2 Boosting
In boosting[1,2,5,10] weights are assigned to each training
tuple. A series of k classifiers is iteratively learned. After a
classifier Mi is learned, the weights are updated to allow the
subsequent classifier Mi+1 to “pay more attention” to the
training tuples that were misclassified by Mi. The final
boosted classifier, M_, combines the votes of each individual
classifier, where the weight of each classifier’s vote is a
function of its accuracy. The boosting algorithm can be
extended for the prediction of continuous values. Ada boost is
a popular boosting algorithm.
To use the composite model to classify tuple, X:
(1) Initialize weight of each class to 0;
(2) For i = 1 to k do // for each classifier:
(3) wi = log 1�error(Mi)
error (Mi) ; // weight of the classifier’s vote
(4) c = Mi(X); // get class prediction for X from Mi
3
(5) Add wi to weight for class c
end;
(6) End for
h’ = L(D0).
% Train the second-level learner h0 by
applying the second-level
(7) return the class with the largest weight;
% learning algorithm L to the new data
set D0
Output: H(x) = h’ (h1 (x) …… hT (x))
3.3 Stacking
Stacking[2,10] is the abbreviation to refer to Stacked
Generalization. Unlike bagging and boosting it uses different
learning algorithms to generate the ensemble of classifiers.
The main idea of stacking is classifiers from different learners
such as decision trees, instance-based learners etc. Since each
one uses different knowledge representation and different
learning biases the theory space will be explored differently
and different classifiers will be found. .
When the classifiers have been generated they must be
combined. Unlike bagging and boosting, stacking does not use
a voting system because, for example, if the majority of the
classifiers make evil predictions this will lead to a final bad
classification. To resolve this problem stacking uses the
concept of Meta learner. One way to outputs is by voting the
same mechanism used in bagging. However (unweight) voting
only makes sense if the learning schemes perform comparably
well. If two of the three classifiers make predictions that are
completely incorrect, trouble instead stacking introduces the
concept of a Meta learner, which replaces the voting
procedure. The problem with voting is that it’s not clear
which classifier to trust.
Input: Data set D = f(x1; y1); (x2; y2);……(xm; ym);
First-level learning algorithms L1;…… LT ;
Second-level learning algorithm L.
Stacking tries to learn which classifiers are the reliable ones,
using another learning algorithm the Meta learner to discover
how best to combine the output of the base learners. The input
to the Meta model also called the level-1 model is the
predictions of the base models, or level-0 models. A level-1
instance has as many attributes as there are level-0 learners,
and the attribute values give the predictions of these learners
on the corresponding level-0 instance. When the stacked
learner is used for classification, an instance is first fed into
the level-0 models, and each one guesses a class value. These
guesses are fed into the level-1 model, which combines them
into the final prediction.
4. PROPOSED FRAMEWORK
The overall objective of proposed research work is to propose
robust ensemble model[1] for the classification of data.
Classification model consist of two phase(1)Model building
Model building performs on training set of data. It is the
supervised learning of a training set of data to build a model.
To build ensemble model First individual data mining
techniques (as J48, Random Forest) applies on dataset as a
classifier. Then through Ensemble techniques (bagging,
boosting, stacking) the outputs of individual models combined
to form a Robust Ensemble model. The Ensemble model
further work as a classifier which classifies the data as normal
or attack.
Process:
for t = 1;…..T:
ht = Lt(D)
% Train a first-level individual learner
ht by applying the first-level
end;
data set D
D’= ϕ;
(2)Model validation
Model validation performs on test set of data. It includes
classifying the data according to that model build in model
building phase.
% learning algorithm Lt to the original
% Generate a new data set
for i = 1;…..m:
for t = 1;….. T:
zit = ht(xi)
example xi
% Use ht to classify the training
end;
D’ = D’ [ f ((zi1; zi2;….. ziT ) ; yi)g
4
The well known data set NSL KDD[8,11] data set considered
here for our experiment. NSL KDD Dataset is a new dataset
consists of selected records of the complete KDD data set.
The data is collected over the TCP/IP network in which there
are 41 various quantities and qualitative features and one
feature belongs to class (Attack type). There are 22 type of
attack in training dataset and 37 types of attacks in test
dataset.
5.2 Confusion Metrics
Confusion metrics[1,2,3] is used for evolution of classifier.It
is commonly encountered in a two-class format, but can be
generated for any number of classes. A single prediction by a
classifier can have four outcomes which are displayed in the
following confusion metrics. Table 1 depicts confusion
metrics.
Table 1: Confusion metrics
PREDICTED
ACTUAL
NEGATIVE
POSITIVE
NEGATIVE
TN
FP
POSITIVE
FN
TP
Fig2: Ensemble model
The entries in the confusion matrix have the following
meaning in the context of our study:
5. EXPERIMENTAL SETUP
5.1 Experimental Design
We used WEKA 3.7.10 a machine learning tool to measure
the classification performance using ensemble techniques.
WEKA[3,4,12] is a data mining system developed by the
University of Waikato in New Zealand that implements data
mining algorithms using the JAVA language. It is a collection
of machine learning algorithms for data mining tasks. The
algorithms are applied directly to a dataset. WEKA
implements algorithms for data pre processing, classification,
regression, clustering and association rules; It also includes
visualization tools. We choose the decision tree classifier(J48,
Random Forest) and ensemble classifier with full training set
and 10-fold cross validation for the testing purposes. In 10fold cross-validation, the available data is randomly divided
into 10 disjoint subsets of approximately equal size. One of
the subsets is then used as the test set and the remaining 9 sets
are used for building the classifier. The test set is then used to
estimate the accuracy. This is done repeatedly 10 times so that
each subset is used as a test subset once. The accuracy
estimates is then the mean of the estimates for each of the
classifiers.
TN is the number of correct predictions that an instance is
negative.
FP is the number of incorrect predictions that an instance is
positive.
FN is the number of incorrect of predictions that an instance
negative and
TP is the number of correct predictions that an instance is
positive.
During testing phase, testing dataset is given as an input to the
proposed technique and the obtained result is estimated with
the evaluation metrics namely precision, recall and Accuracy.
The accuracy (AC) is the proportion of the total number of
predictions that were correct. It is determined using the
equation:
AC = (TP+TN) / (TP+FN+FP+TN).
The recall or sensitivity or true positive rate (TPR) is the
proportion of positive cases that were correctly identified, as
calculated using the equation:
5
TPR = TP / (TP+FN).
The precision (P) is the proportion of the predicted positive
cases that were correct, as calculated using the equation:
Precision = TP / (TP+FP).
F-measure: The harmonic mean of precision and recall
F = 2 * Recall * Precision / (Recall + Precision)
6. RESULTS
We used two decision tree techniques J48 and Random Forest
as individual classifiers. We also used these two classifiers
with ensemble techniques Stacking, Bagging, Boosting to
form ensemble classifier. Classifier result is evaluated using
confusion matrix.
Table 2 depicts accuracy of different classifiers, which
represents that ensemble classifiers performs better than
individual classifiers.
Table 2. Accuracy of classifiers on NSL-KDD dataset
S.
No
Data
Techniques
mining
1
J-48
98.59
2
Random Forest
98.55
Accuracy
Fig3: Accuracy of Different Classifiers
Table 3 shows performance of different classifiers using
precision, recall, and f-measure parameters of confusion
metrics.
Table 3: Performance of classifiers on NSL-KDD dataset
Data
Attack
Mining
type
Technique
J48
RF
3
Stacking
98.66
4
Boosting
98.60
5
Bagging
98.7
Stacking
Bagging
A graph plot of accuracy of different classifier is also given in
figure3.
Boosting
Precision Recall Fmeasure
Normal 98.2
98.6
98.4
Attack
98.9
98.6
98.8
Normal 98.2
98.5
98.3
Attack
98.8
98.6
98.7
Normal 98.4
98.5
98.4
Attack
98.9
98.8
98.8
Normal 98.5
98.5
98.5
Attack
98.9
98.9
98.9
Normal 98.4
98.3
98.4
Attack
98.8
98.8
98.7
A graph plot of performance of different classifiers is also
provided in figure 4.
6
8. REFERENCES
[1] Pradhan, R.,L., et.al (2014). “Robust ensemble
model for intrusion detection using data mining
techniques”.
[2] Nagle, M.,K., et.al, ( 2013).- “Feature Extraction
Based Classification Technique for Intrusion
Detection System, International Journal of
Engineering Research and Development”.
[3] Dr. saurbh Mukherjee (2012) ,”Intrusion
detection using bayes classifier with feature
reduction”, Procedia technology”.
[4] Kalyani, G., et.al, (2012). “ Performance
assessment of different classification techniques
for intrusion techniques”.
[5] Jiawei Han, Micheline Kamber, (2006), “Data
mining concepts and techniques”, Second
edition, San Francisco, Margan Kaufmann
Publishers, USA,.
[6] Arun K. Pujari. (2001). Data mining techniques,
4th edition, Universities Press (India) Private
Limited.
Fig 4: Performance of Different Classifiers
From table 3 and table4 shows that ensemble classifier
performs better than individual classifiers not only in accuracy
but also in other parameters of confusion metrics as precision,
recall and f-measure.
7. CONCLUSION
This research is approached to discover the best performance
of classification algorithm for intrusion detection. The
experiment results shows that Bagging classifiers provides
highest accuracy 98.71%, stacking provides accuracy of
98.66% and boosting provides accuracy of 98.60% which is
better than the accuracy of individual classifier J48 and
Random forest. Not only in accuracy, in precision, recall and
f-measure also ensemble techniques provides better results
than individual classifiers.
[7] Mrutyunjaya Panda, (2011)”A hybrid intelligent
approach for network intrusion detection”,
Proceedia Engineering.
[8] Revathi, S., et.al, (2013). “A Detailed Analysis
on NSL-KDD Dataset Using Various Machine
Learning”.
[9] Sirikulviriya, N., (2011), “ Integration of rules
from a random forest”.
[10] Ensemble learning by Zhi-Hua Zhou.
[11] http://nsl.cs.unb.ca/NSL-KDD
[12] http://www.cs.waikato.ac.nz/~ml/weka
[13] Http://en.wikipedia.org/wiki/C4.5_algorithm
[14] L. Breiman. Random Forests.
Learning(2001), 45(1):5-32.
Machine
In present study we focused on decision tree techniques J48
and Random forest and considered only few parameters for
model evolution. For further research different data mining
techniques can be tested to ensemble model. Also feature
selection techniques can be tested on NSL KDD dataset to
gain improved performance with reduced feature subsets.
7
Download