xinmin tao

advertisement
A novel approach to intrusion detection based on Support Vector Data
Description
XINMIN TAO
Communication Tech Institute
Harbin Institute of Technology
150001
CHINA
Abstract: - This paper presents a novel one-class classification approach to intrusion detection based support
vector data description. This approach is used to separate target class data from other possible outlier class data
which are unknown to us. SVDD-intrusion detection enables determination of an arbitrary shaped region that
comprises a target class of a dataset. This paper analyzes the behavior of the classifier based on parameter
selection and proposes a novel way based on genetic algorithm to determine the optimal parameters. Finally
some results are finally reported with DARPA’ 99 evaluation data. The results demonstrate that the proposed
method outperforms other two-class classifier.
Key-Words: - support vector intrusion detection kernel function support vector data description
1 Introduction
With the growing rate of interconnections among
computer systems, network security is becoming a
major challenge. In order to meet this challenge,
Intrusion Detection Systems(IDS) are being designed
to protect the availability, confidentiality and
integrity of critical networked information systems.
They
protect
computer
networks
against
denial-of-service(Dos)
attacks,
unauthorized
disclosure of information and the modification or
destruction of data. Early in the research into IDS,
two major principles known as anomaly detection
and signature detection were arrived at, the former
relying on flagging all behavior that is close to some
previously defined pattern signature of a known
intrusion, the later flagging behavior that is abnormal
for an entity. Accordingly, many approaches have
been proposed which include statistical, machine
learning, data mining , neural-network and
immunological inspired techniques. The IDS based
on anomaly detection can be treated as one-class
classification. In one-class classification, one set of
data, called the target set, has to be distinguished
from the rest of the feature space. In many anomaly
detection applications, however, negative(abnormal)
samples are not available at the training stage. For
instance, in a computer security application, it is
difficult ,if not impossible, to have information about
all possible attacks. In the machine learning
approaches, the lack of samples from the abnormal
class causes difficulty in the application of
supervised techniques(e.g two-class classification).
Therefore, the obvious machine learning solution is
to use an one-class classification. In one-class
classification, the task is not to distinguish between
classes of objects like in classification problems or to
produce a desired outcome for each input object like
in regression problems, but to give a description of a
set of objects, called target class. This description
should be able to distinguish between the class of
objects represented by the training set, and all other
possible objects in the object space, called outlier
class. In general the problem of one-class
classification is harder than the problem of normal
two-class classification. For normal classification the
decision boundary is supported form both sides by
examples of each of the classes. Because in the case
of one-class classification only one set of data is
available, only one side of the boundary is covered.
One-class classification is often solved using density
estimation or a model based approach. In this paper
we propose a novel approach to anomaly detection
based on support vector data description inspired by
the Support Vector Classifier (Vapnik 1998). Instead
of using a hyperplane to distinguish between two
classes, a hypersphere around the target set is used.
This paper accurately analyzes the geometric
character of kernel space and influence of different
kernel parameters on behavior on classifier. Finally
the paper presents a method based on genetic
algorithm to determine the optimal parameter of
kernel.
We will start with an introduction of mathematical
prerequisites in section 2. In section 3 an explanation
of the Support Vector Data Description will be given.
In section 4 we will discuss how to make model
selection according to different kernel parameter. In
section 5 the results of the experiments are
demonstrated and we will conclude with conclusion
in section 6.
(saddle point in a )
(5)
n
 a c (x)  0
i 1
i i
(vanishing KKT-Gap)
(6)
Definition 2.4(Normal Space) A set of feature
vectors, Self  S represents the normal states of the
2 mathematical basis
Definition 2.1(Convex set) A set X in a vector space
'
is called convex if for any x, x  X and any
  [0,1] , we have
x  (1   ) x '  X
(1)
Definition 2.2(Convex Function) A function f
defined on a set X is called convex if, for any
x, x  X
'
and
any
  [0,1] ,
such
x  (1   ) x  X , we have
f (x  (1   ) x ' )  f ( x )  (1   ) f ( x ' )
that
'
Definition 2.3 (Constrained Problems)
f ( x ) subject to ci ( x )  0 for all
x
i  [n ]
c
Here f and i are convex functions and n  N , we
e j ( x)  0
additionally have equality constraints
for
some j  [n ] . Then the optimization problem can
be written as
'
Minimize f ( x )
x
subject to
with convex, differentiable
exists
xself : [ xmin , xmax ]n  {0,1}
:
x  Self
1
 
x self ( x )  
0

x  Non _ Self
3 Support Vector Data Description
The Support Vector Data Description (SVDD) is the
method which we will use to describe our data. It is
inspired on the Support Vector Classifier of
Vapnik(see [3]).The SVDD is explained in more
detail in[1], here we will just give a quick impression
of the method.
The idea of the method is to find the sphere with
minimal volume which contains all data. Assume we
have a data set containing N data objects,
{xi , i  1, , N }
and the sphere is described by
center a and radius R .We now try to minimize an
L( R, a, ai )  R 2   ai {R 2  ( xi  2axi  a 2 )}
2
(3)
Theorem 2.1 (KKT for Differentiable Convex
Problems) A solution to the optimization problem (3)
there
we will define the Self (or Non _ Self ) set using
its characteristic function
error function containing the volume of the sphere.
The constraints that objects are within the sphere are
imposed by applying Lagrange multipliers:
ci ( x )  0 for all i  [n ] , e j ( x )  0
'
for all j  [n ] .
defined as Non _ Self  S  Self . In many cases,
(2)
A function f is called strictly convex if for
x  x ' and   (0,1) (2.2) is a strict inequality.
Minimize
system. Its complement is called Non _ Self and is
some
f , ci
is given by x , if
a  R n with
ai  0
for
i  [n ] such that the following conditions are
satisfied:
n
 x L( x , a )   x f ( x )   ai  x ci ( x )  0
i 1
(saddle point in x )
 ai L( x , a )  ci ( x )  0
(4)
(7)
i
a  0 .This function has
with Lagrange multipliers i
to be minimized with respect to R and a and
a
maximized with respect to i .
Setting the partial derivatives of L to R and a to
zero, gives:
L
 2 R  2 ai R  0 :
R
i
a
i
i
1
We observe data distribution of  (x ) mapped into
Gaussian kernel feature space. Because the norm of
L
 2 ai xi  2 ai a  0 :
a
i
i
 a i xi  a x
a i
 ii
iai i
(8)
This shows that the center of the sphere a is a linear
x
 (x ) data objects in mapped feature space is:
  ( x ),  ( x )  K ( x, x )  exp( 0)  1 ,
all data objects in high dimensional feature space is
located in spherical surface, shown in figure 1.
combination of the data objects i . Resubstituting
these values in the Lagrangian gives to maximize
with respect to
The sphere with minimial volume
ai :
L  R 2   ai R 2   ai xi  2 ai a j xi  x j   ai a j xi  x j
2
i
i
i
j
i
 0   a i ( xi  xi )   a i a j ( xi  x j )
i
Normal data
j
outlier
i, j
origin
L   a i ( xi  xi )   a i a j ( xi  x j )
i
i, j
(9)
with equation(7) KKT condition:
ai ( R 2  ( xi  2axi  a 2 )  0
2
(10)
ai  0, iai  1
.
This function should be maximized with respect to
ai .In practice this means that a large fraction of the
ai become zero. For a small fraction ai  0 and the
corresponding objects are called Support Objects.
We see that the center of the sphere depends just on
the few support objects, objects with
disregarded.
Object z is accepted when:
ai  0 can be
( z  a )( z  a )T  ( z   ai xi )( z   ai xi )
i
i
Fig.1 distribution of data object in inner-product feature space
We obtain a more flexible description than the rigid
sphere description. In figure 2 both methods are
shown applied on the same two dimensional data set.
The sphere description on the left includes all objects,
but is by no means very tight. It includes large areas
of the feature space where no target patterns are
present. In the right figure the data description using
Gaussian kernels is shown. And it clearly gives a
superior description. No empty areas are included,
what minimized the chance of accepting outlier
patterns.
 ( z  z )  2  a i ( z  xi )   a i a j ( xi  x j )  R 2
i
i, j
(11)
In general this does not give a very tight description.
Analogous to the method of Vapnik[3], we can
replace the inner products ( x  y ) in equations(9)and
in (11)by kernel functions K ( x, y ) which give a
much more flexible method. When we replace the
inner products by Gaussian kernels for instance, we
obtain:
( x  y )  K ( x, y )  exp( ( x  y ) 2 / s 2 )
(12)
Equation (9) now changes into:
L  1   ai2   ai a j K ( xi , x j )
i
Other
kernel
functions,
for
example:
polynomial k ( x, y )  ( x, y ) will be discussed in
section 5.
d
i j
(13)
and the formula to check if a new object z is within
the sphere(equation (11)) becomes:
1  2  a i K ( z , xi )   a i a j K ( xi , x j )  R 2
i
Fig 2 left graph: margin of linear-kernel SVDD, right graph: margin of
gauss-kernel SVDD
i, j
(14)
4 model selection
In cases of Gaussian kernel, there is one extra free
par- ameter, the width parameters s in the kernel
(equation (12) ). Now we discuss two cases:
2
Case 1:as s  0 , the solution of (12) converges to
that of:
exp( 
x y
s2
2
)   ij , i  j,  ij  1; i  j,  ij  0
equation (13) is maximized when
each object supports a kernel.
(15)
ai  1 / N .where
s=0.3
s=0.4
s=0.6
s=6
2
Case 2:as s  , when the Taylor expansion of
the Gaussian kernel (12) is made ,it can easily be
shown:
exp( 
x y
s
2
) 1
2
x y
s
2
2
 o(
x y
s
2
2
)
x y
x y
 o(
)
2
s
s
s
s2
(16)
for very large s Kernel ( x , y )  1 , equation(13) is
 1
2
x
2

2
y
2

2
T
a 1
a 0
maximized when all i
except for one j
.
The exact influence of s parameters on decision
boundaries is shown below. Circle point stands for
support object, gray level represents the distance to
sphere center.
To study the generalization or the overfitting
characteristics of the SVDD, we have to get an
indication of (1) the number of target patterns that
will be rejected (errors of the first kind) and (2) of the
number of outlying patterns that will be accepted
(errors of the second kind).
We can estimate the error of the first kind by
applying the leave-one-out method on the training set
containing the target class. When leaving out an
object from the training set which is no support
object, the original description is found. When a
support object is left out, the optimal sphere
description can be made smaller and this left-out
object will then be rejected. Thus the error can be
estimated by:
# SV
E[P(error)]= N
(17)
#
SV
Where
is the number of support vectors.
Using a Gaussian kernel, we can regulate the number
of support vectors by changing the width parameters
s and therefore also the error of the first kind. When
the number of support vectors is too large, we have to
increase s, while when the number is too low we have
to decrease s. This guarantees that the width
parameter in the SVDD is adapted for the problem at
hand given the error. Shown in figure 3.
Fig.3 margin of gauss-kernel SVDD vs. gauss-width
parameter
The chance that outlying objects will be accepted by
the descriptions, the error of the second kind, can not
be estimated by this measure. This is because that we
assumed only a training set of target class is
available. But in intrusion detection application, we
can use the simulated intrusion data such that we can
represent the error function:
v( s) 
# SV
# JD

N t arg et
N outlier
(18)
where # JD is the number of outlier patterns
accepted.
We use the genetic algorithm to find the optimal
solution. The genetic algorithm is a new global
optimal algorithm, GAs are parallel, iterative
optimizers, and have been successfully applied to a
broad spectrum of optimization problems, including
many pattern recognition and classification tasks. It
is shown below:
1 Set the evaluation function to calculate the
individual’ fitness value (18),population size N,
P
P
cross rate c ,mutation rate m and iteration L.
2 Initialization: randomly select initial population:
X (0)  {x1 (0), x2 (0), , x N (0)},
set t  0 。
3 Calculate the fitness value: for the tth population
X (t )  {x1 (t ), x2 (t ), , x N (t )},
xi (t )(1  i  N )
f  f ( xi (t ))
fitness value i
4 Genetic operation:
For
every
individual
among
X (t ) , xi (t )(1  i  N )
P (t )
calculate
.
population
set a copy rate i according to its fitness value.
execute N / 2 steps operation:
4.1
Selection:
select
two
individuals
x i1 ( t )
x (t )
and i 2 from population X (t ) according to each
individual’s fitness value.
consistent with the principles described in section 4.
ROC of the Gaussian kernel and polynomial kernel is
shown in Figure 5.
xi1 (t ) and xi 2 (t ) to generate the two
'
'
new individuals x i1 (t ) 和 x i 2 (t ) according to cross
P
rate c 。
4.2 Cross: cross
'
'
4.3 Mutation: mutate x i1 (t ) 和 x i 2 (t ) to generate
''
''
the two new individuals x i1 (t ) and x i 2 (t ) according
P
to mutation rate m .
4.4 New t+1th generation population
X (t  1)
''
x (t  1)  x '' i1 (t ),  , x N (t  1)  xiN
(t )}
={ 1
Consisted
''
i1
x (t )
of
N
new
individuals
x (t ) (1  i  N / 2)
. Remember the best
and
individuals.
''
i2
5 L=L-1, if L =0 stop ;otherwise go to step 3.
5 Experiments
We select a training set which contained 300 normal
cases. Likewise, The evaluation data consisted 5000
abnormal samples. Test data comprises 1500 normal
data. The samples were already preprocessed. Figure
4 demonstrates the influence of the Gaussian width
parameters on the normal data rejected, the number
of support vector. the results are shown below in
figure 4.
Fig.5 ROC of gauss-kernel and polynomial-kernel
When the width is decreased, acceptance of normal
data is decreased while the rejection of outlier data is
increased. When polynomial kernel order is
increased, the acceptance of normal data is decreased
while the rejection of outlier data is increased. Figure
6 demonstrates the influence of width parameter on
normal data rejected, outlier data accepted and the
number of support vector. Figure 7 demonstrates the
influence of the order parameter on normal data
rejected, outlier data accepted and the number of
support vector.
Fig.6 target class rejected, #SV number and outlier accepted vs.
Gaussian width parameters
When we selected the Gaussian kernel,we optimize
the width parameter by the genetic algorithm ,
shown in figure 8.
Fig.4 target class rejected, number of support vector vs.
gauss-kernel width parameter
In experiments we find that when the Gaussian width
increases, the number of support vector and the rate
of rejection of normal sample is low. The results are
The following table is the results of comparison of
different methods. The results demonstrate that the
proposed approach outperform the other two-class
classifier.
Table 2:Comparison of results of different methods
methods
Recognition rate
Gaussian kernel SVDD
99.02%
Linear kernel SVDD
98.66%
MLP
85.5%
Negative immunogenetic
96.4%
6 Conclusion
Fig.7 target class rejected, #SV number and outlier accepted vs.
polynomial order parameters
This paper described the implementation of a novel
one-class classification approach to intrusion
detection based support vector data description. We
discussed the behavior of the classifier based on
parameter selection and presented a novel way based
on genetic algorithm to determine the optimal
parameters. Finally some results are finally reported
with DARPA’ 99 evaluation data. The results
demonstrate that when the normal training sample is
enough large, the proposed method outperforms
other two-class classifier.
References:
Fig.8 fitness value vs. iteration
Finally, we do the experiment with different number
training samples. The results shown in table 1:
Table 1 results of training example of different number
Recognition rate=normal rejected (%)/abnormal accepted(%)
Num of training sample
200
300
400
600
800
1000
Kernel=rbf
0.117333/0
0.0493333/
0.00261523
0.036/
0.00261523
0.028/
0.00294214
0.00266667/
0.00686499
0/0.00980713
Kernel=linear
0.107333/0
0.0486667/
0.00326904
0.028/
0.00326904
0.0266667/
0.00359595
0.00333333/
0.0104609
0.002/
0.0114416
The results show that the performance of a proposed
approach based on support vector data description is
better when the training normal data is sufficiently
large while the number of training data set is small,
the result is not good. But in intrusion detection
application, it is relatively easier to get the enough
training data.
[1] D.M.J.Tax and R.P.W Duin.Data domain
description
using
support
vectors.
In
Verleysen,M.,editor,Proceeding of the European
Symposium on Artificial Neural Networks
1999,pages 251-256,Brussels,April 1999.D-Facto.
[2] D.M.J.Tax,A.Ypma,and R.P.W.Duin. Support
vector data description applied to machine vibration
analysis. To appear in the Proceedings of ASCI’99.
[3] V.Vapnik. The nature of statistical learning
theory. Springer-Verlag, New York,1995.
[4]G.Baudat
and
F.Anouar,”Generalized
Discriminant
Analysis
Using
a
Kernel
Approach”,Neural Comutation,Vol.12,No.1,2000.
[5] Scholkopf B.Platt J.C.,Shawe-Taylor J.,Smola
A.J., Williamson R.C Estimating the support of a
high-dimensional distribution.Microsoft Research
Corporation Technical Report MSR-TR-99-87,1999.
[6] Campbell C.and Bennett K. A Linear
Programming Approach to Novelty Detection. To
appear in Advances in Neural Information
Processing Systems 14 (Morgan Kaufmann,2001).
[7] Chapelle O.and Vapnik V. Model selection for
support vector machines. To appear in Advances in
Neural
Information
Processing
Systemss,12,ed.S.A.Solla,T.K.
Leen
and
K.-R.Muller,MIT Press,2000.
Download