IV. Intrusion Detection Based on Maximum Entropy Model

advertisement
A Lightweight Intrusion Detection Model Based on
Feature Selection and Maximum Entropy Model
Yang Li1,2, Bin-Xing Fang1, You Chen1,2, Li Guo1
1
Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China
2
Graduate School of Chinese Academy of Sciences, Beijing 100080,China
Email: liyang@software.ict.ac.cn
Abstract—Intrusion detection is a critical component of secure
information systems. Current intrusion detection systems
(IDS) especially NIDS (Network Intrusion Detection System)
examine all data features to detect intrusions. However, some
of the features may be redundant or contribute little to the
detection process and therefore they have great impact on the
system performance. This paper proposes a lightweight
intrusion detection model that is computationally efficient and
effective based on feature selection and Maximum Entropy
(ME) model. Firstly, the issue of identifying important input
features is addressed. Since elimination of the insignificant
and/or useless inputs leads to a simplification of the problem,
therefore results to faster and more accurate detection.
Secondly, classic ME model is used to learn and detect
intrusions using the selected important features. Experimental
results on the well-known KDD 1999 dataset show the
proposed model is effective and can be applied to real-time
intrusion detection environments.
I.
INTRODUCTION
Intrusion Detection System (IDS) plays vital role of
detecting various kinds of attacks. The main purpose of IDS
is to find out intrusions among normal audit data and this can
be considered as classification problem.
The two basic methods of detection are signature based
and anomaly based [1]. The signature-based method, also
known as misuse detection, looks for a specific signature to
match, signaling an intrusion. They can detect many or all
known attack patterns, but they are of little use for as yet
unknown attack methods. Most popular intrusion detection
systems fall into this category. Another approach to intrusion
detection is called anomaly detection. Anomaly detection
systems are computationally expensive because of the
overhead of keeping track of, and possibly updating, several
system profile metrics. There are many IDSs developed
during the past three decades and most of the commercial
and freeware IDS tools are signature based.
As new attacks appear and amount of audit data
increases, IDS should counteract them. In addition to this, as
network speed becomes faster, there is an emerging need for
This paper is funded by National Grand Fundamental Research
973 Program of China (NO. 2004CB318109)
security analysis techniques that will be able to keep up with
the increased network throughput [2]. Therefore, IDS itself
should be lightweight (means relatively low computational
cost) while guaranteeing high detection rates. One of the
main problems with IDSs is the overhead [3], Detecting
intrusions in real time, therefore, is a difficult task.
In this paper, we propose a new lightweight intrusion
detection model. Our model mainly focuses on how to
effectively detect intrusions in the real-time network
environment. As our preliminary work, we mainly focus on
the intrusion detection in network traffic level. Firstly, we
extract several necessary and much important features
(named core features) from KDD 1999 dataset by means of
Information Gain and Chi-Square approach. Secondly, we
adopt Maximum Entropy model to learn and form classifier
to detect attacks based on the selected features. The results of
experiments on KDD 1999 dataset indicate the feasibility of
our model.
The remainder of this paper is organized as follows. In
Section 2, we propose the overall model based on feature
selection and Maximum Entropy Model. Section 3 and
Section 4 detail the feature selection approach and Maximum
Entropy model used in our system respectively. Section 5
discusses the relevant experiments and evaluations. We
conclude our work in Section 6.
II.
OUR INSTRUSION DETECTION MODEL
The overall model of our approach is depicted in Fig. 1.
In the training process, network traffic data is preprocessed
(label data packets for various classes such as normal,
abnormal) and passed to our feature-selection engine using
both Information Gain and Chi-Square approach.
Afterwards, the dataset is then used to build the Maximum
Entropy based intrusion detection model using selected
features. In the testing process, network traffic data will
directly sent to our intrusion detection model to detect.
The most advantage of our lightweight model is that by
means of feature selection, it can greatly reduce the
redundant and least important features for intrusion
detection, therefore reduce the computational cost in the
process of intrusion detection. Moreover, Maximum Entropy
model is proved to be a good classifier when provided
enough input features, it’s very effective in the field of
intrusion detection.
Information Gain
based
Feature selection
Network
Preprocessor
Maximum Entropy
Model based
classifier
Training
data
Traffic data
Chi-Square based
Feature selection
detection
B. Feature Selection Based on Chi-Square
A Chi-Square approach is a simple and general algorithm
2
that uses the  statistic to discretize numeric features
repeatedly until some inconsistencies are found in the data,
and achieves feature selection via discretization [5]. The
measure is defined to be:
Testing
data
Fig. 1 Overall intrusion detection model
III.
Therefore, in our experiments, information gain is
calculated for class labels by employing a binary
discrimination for each class. That is, for each class, a
dataset instance is considered in-class, if it has the same
label; otherwise, it will be considered out-class.
Consequently, as opposed to calculating one information
gain as a general measure on the relevance of the feature for
all classes, we calculate an information gain for each class.
Thus, this signifies how well the feature can discriminate the
given class (i.e. normal or an attack type) from other classes.
Section 5 will give the detailed results.
FEATURE SELECTION
In this section, we will use both classic Information Gain
(IG) and Chi-Square feature selection approach to fulfill the
task of feature selection from KDD 1999 dataset. The reason
for us to combine both IG and Chi-Square approach in
feature selection is that we want to eliminate the possibility
that a single feature selection approach will result to some
biased results. Therefore, combining them is a good and
reasonable choice.
2
k
  
2
( Aij  E ij ) 2
i 1 j 1
(4)
E ij
where:
k = number of (no.) classes,
Aij = no. patterns in the i th interval, j th class,
A. Feature Selection Based on Information Gain (IG)
The theory of Information Gain can be described as [4]:
Let S be a set of training set samples with their
corresponding labels. Suppose there are m classes and the
training set contains s i samples of class I and s is the
total number of samples in the training set. Expected
information needed to classify a given sample is calculated
by:
A feature F with values
(1)
{f 1 , f s , , f v } can divide
v subsets {s 1 , s s , , s v } where Sj
is the subset which has the value f j for feature F .
the training set into
Furthermore let Sj contain sij samples of class i . Entropy
of the feature F is:
v
E(F )  
j 1
s1 j  ...  s mj
s
* I ( s1 j ,..., s mj )
(2)
Information gain for F can be calculated as:
Gain( F )  I ( s1 ,..., s m )  E ( F )
C j = no. patterns in the j th class =
N = total no. patterns =

2
j 1


k
j 1
2
j 1
Aij ,
Aij ,
Ri ,
E ij = expected frequency of Aij = Ri * C j / N .
m
s
s
I ( S1 , S 2 ,..., S m )   i log 2 ( i )
s
i 1 s
Ri = no. patterns in the i th interval =
(3)
If either
Ri or C j is 0, E ij is set to 0.1. The degree of
freedom of the
classes.
2
statistic is one less the number of
Therefore, in this paper, we will use Chi-Square based on
the above discussions to fulfill feature selection task to
utmost distinguish five classes (i.e. normal, DoS, Probe,
U2R, R2L). Section 5 will give the detailed results.
IV.
INTRUSION DETECTION BASED ON MAXIMUM
ENTROPY MODEL
As described in Section 2, after selecting important and
necessary features by using feature selection approaches, we
will use them as input features to Maximum Entropy model.
Having formed a classifier based on the features and realistic
network traffic data, we can use the Maximum Entropy
model-based classifier to detect intrusions in real-time
environment.
Maximum entropy (ME) modeling [6] has been
successfully used in the fields of machine learning,
information retrieval, computer vision, and econometrics,
etc. From a practitioner's point of view, its advantage is that
it can handle a large set of features that are interdependent.
Features are automatically weighted by an optimization
process, resulting in a model that maximizes the conditional
likelihood of the class labels x , given the training data y , or
p( x | y ) .
In more detail, the goal of the ME principle is that, given
a set of features, a set of their corresponding
functions fi(i  1,2,3,..., k ) (its function is to measure the
contribution of each feature to the model) and a set of
constrains, we have to find the probability distribution that
satisfies the constrains and minimizes the relative entropy.
That is, given the constraint sets (5) and the information
entropy computing method (6), we can describe the problem
as (7).
P  { p | p( fi)  p * ( fi), i  1,2,3,..., k}
(5)
H ( p)   p * ( y ) p( x | y ) log p( x | y )
(6)
x, y
In (6), 0  H ( p )  log | y | .
p*  arg max H ( p)
1
exp(  ifi ( x, y ))
 ( y)
i
 ( y )   exp(  ifi ( x, y ))
x
The maximum entropy approach described in this work
exhibits many advantages. First, it provides the IDS system
administrators a multi-dimensional view of the network
traffic by classifying packets according to a set of features
carried by a packet. Second, it detects intrusions that cause
abrupt changes in the network traffic, as well as those that
increase traffic slowly. A large deviation from the baseline
distribution can only be caused by packets that make up an
unusual portion of the traffic. If an intrusion occurs, no
matter how slowly it increases its traffic, it can be detected
once the relative entropy increases to a certain level. Third, it
provides information about the type of the intrusion detected.
V.
EXPERIMENTS AND EVALUATIONS
In our preliminary work, we’ll select KDD 1999 dataset
to test the performance of our approach based on ME model
because the dataset is still a common benchmark for us to
evaluate our techniques in IDS. Moreover, considering for
the fact that our approach is independent of the realistic
dataset, it’s reasonable for us to select it as benchmark.
A.
(7)
pP
p * ( x | y) 
address this weakness, smoothing has been advocated using
a Gaussian prior [7]. We use a Gaussian prior (with σ = 1) to
smooth our models in all of our experiments.
(8)
(9)
i
In order to get the best resolution, we’d like to use
Lagrange multiplier method and it results to (8). In (8), p is
the probability distribution got from training and p * stands
for the probability distribution model constructed by
us. p * ( x | y ) denotes the probability of predicting an
outcome x in the given context y with constraint feature
fi ( x, y) . Formula (9) is a normalization factor
to ensure that  p( x | y )  1 . The parameter i can be
functions
Experimental Environment and Dataset
All experiments were performed in a Windows machine
having configurations Intel (R) Pentium (R) 4, 1.73 GHz, 1
GB RAM, and the operation system platform is Microsoft
Windows XP Professional (SP2). We have used an open
source machine learning framework – Weka [8] (the latest
Windows version is Weka 3.4). It is a collection of machine
learning algorithms for data mining tasks and it contains
tools for data pre-processing, classification, regression,
clustering, association rules, and visualization.
For feature selection, we have selected a subset (It
contains 496,201 records with 41 features and a class label.
Its classes include normal, DoS, Probe, U2R and R2L.
Approximately 20% represent normal patterns and the rest
80% of patterns are attcks belonging to the four classes)
randomly from KDD 1999 dataset and performed IG and
Chi-Square approach on it to acquire the most relevant and
necessary feature. To identify attacks, we adopted 10-fold
cross validation to verify the feature selection results.
Afterwards, we also have adopted 10-fold cross validation to
evaluate our ME-based lightweight detection model.
x
derived from an iterative algorithm called Generalized
Iterative Scaling (GIS). It updates weights by scaling the
existing parameters to increase the likelihood of observed
features. In this paper, we carry out 100 iterations in our
experiments using GIS.
Moreover, ME modeling can also suffer from overfitting.
This problem is most noticeable when the features in the
training data occur infrequently, resulting in sparse data. In
these cases, weights derived by the ME model for these
sparse features may not accurately reflect test data. To
B. Results and Evaluations
Table 1 and Table 2 give the feature selection results of
by using IG and Chi-Square approach from all the 41
features in KDD 1999 dataset respectively. Table 3 shows
the performance results of our ME model-based approach
using all the 41 features and Table 4 shows the performance
results only using the selected important features.
From Table 1 and Table 2, we can clearly see that both
using IG and Chi-Square to feature selection result to almost
the same results, i.e. they all prove the top 12 as the most
important features with their own feature selection
algorithms. It demonstrates the results are reasonable and
independent of feature selection algorithm.
TABLE II.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
It must be stated that the results in Table 3 and Table 4
are got from using the top 6 features selected from Table 1
and Table 2 as input learning features for ME model.
Because it is obvious that in the first two tables, both the IG
and Chi-Square measure of the top 6 features are much
distinct from those of the rest features, i.e. the rest features
except for the top 6 features have little or no effect on the
intrusion detection.
The results of Table 3 and Table 4 are amazingly good
(especially for the detection of DoS attack, its accuracy is
100%), and they demonstrate two important facts: i) The
selected features play the same important role in intrusion
detection; ii) The computational cost can be greatly reduced
without reducing any effectiveness when we make use of the
selected features compared to all the 41 features. Therefore,
they can be used in real-time lightweight intrusion detection
environment.
TABLE III.
Class
Normal
Probe
DoS
U2R
R2L
TABLE IV.
VI.
CONCLUSIONS
In this paper, we proposed a new lightweight intrusion
detection model. First, a feature selection based on
Information Gain and Chi-Square approach is performed on
KDD 1999 training set, which is widely used by machine
learning researchers. Second, by using ME model, these
selected features were learned and used in intrusion
detection. Experimental results on KDD 1999 dataset
demonstrate the result is good and the model is reasonable.
Moreover, its computational cost is relatively low attributes
to the adoption of feature selection and Maximum Entropy
and can be applied to real-time intrusion detection
environment (described in Table 3 and Table 4).
In the future work, we’ll apply our model in realistic
environment to verify its real-time performance and
effectiveness.
TABLE I.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
Class
Normal
Probe
DoS
U2R
R2L
Feature
src_bytes
dst_host_rerror_rate
dst_byte
dst_host_srv_rerror_rate
hot
num_compromised
srv_count
count
dst_host_srv_diff_host_rate
srv_rerror_rate
rerror_rate
service
Chi-Square
17586.107
17368.831
17073.438
17032.989
16503.031
14357.396
5060.741
3125.14
2607.774
2421.084
2209.594
2078.218
Feature
dst_host_rerror_rate
src_bytes
dst_bytes
hot
dst_host_srv_rerror_rate
num_compromised
srv_rerror_rate
rerror_rate
count
srv_count
dst_host_srv_diff_host_rat
service
DETECTION RESULTS ON ALL 41 FEATURES
Testing Time (Sec)
1.28
2.09
1.93
1.05
1.02
Accuracy (%)
99.75
99.80
100
99.89
99.78
DETECTION RESULTS ON SELECTED FEATURES
Testing Time (Sec)
0.78
1.25
1.03
0.70
0.68
Accuracy (%)
99.73
99.76
100
99.87
99.75
REFERENCES
[1]
[2]
[3]
[4]
FEATURE SELECTION RESULTS BASED ON IG
IG
0.286
0.283
0.278
0.263
0.258
0.231
0.077
0.076
0.075
0.074
0.046
0.036
FEATURE S SELECTION RESULTS BASED ON CHI-SQUARE
[5]
[6]
[7]
[8]
M. Bykova, S. Ostermann and B. Tjaden, “ Detecting network
intrusions via a statistical analysis of network packet characteristics”,
in Proc. of the 33rd Southeastern Symp. on System Theory, Athens,
OH. IEEE, 2001.
C. Kruegel and F.Valeur, “Stateful Intrusion Detection for HighSpeed Networks”, in Proc. of the IEEE Symposium on Research on
Security and Privacy, pp. 285–293, 2002.
T. Bass, “Intrusion detection systems and multisensor data fusion”,
Communications of the ACM, 43 (4), pp. 99–105, 2000.
D. Włodzisław, W. Tomasz, B. Jacek and K. Adam, “Feature
Selection and Ranking Filters”, 2003,
http://metet.polsl.katowice.pl/~jbiesiada/prace/selekcja/03Istambul.pdf.
H. Liu, Setiono and R., “Chi2: feature selection and discretization of
numeric attributes”, in Proc of the Seventh International Conference
on Tools with Artificial Intelligence, pp. 388 – 391, 1995.
R.Adwait, “A Simple Intorduction to Maximum Entropy Models for
Natural Language Processing”, University of Pennsylvania.,Tech.
Rep, 1997.
K. Min-Yen and T. Hoang Oanh Nguyen, “Fast Webpage
Classification Using URL Features”, CIKM’05 of ACM, 2005.
“Weka Machine Learning Project”, http://www.cs.waikato.ac.nz/~ml/.
Download