A novel approach to intrusion detection based on Support Vector Data Description XINMIN TAO Communication Tech Institute Harbin Institute of Technology 150001 CHINA Abstract: - This paper presents a novel one-class classification approach to intrusion detection based support vector data description. This approach is used to separate target class data from other possible outlier class data which are unknown to us. SVDD-intrusion detection enables determination of an arbitrary shaped region that comprises a target class of a dataset. This paper analyzes the behavior of the classifier based on parameter selection and proposes a novel way based on genetic algorithm to determine the optimal parameters. Finally some results are finally reported with DARPA’ 99 evaluation data. The results demonstrate that the proposed method outperforms other two-class classifier. Key-Words: - support vector intrusion detection kernel function support vector data description 1 Introduction With the growing rate of interconnections among computer systems, network security is becoming a major challenge. In order to meet this challenge, Intrusion Detection Systems(IDS) are being designed to protect the availability, confidentiality and integrity of critical networked information systems. They protect computer networks against denial-of-service(Dos) attacks, unauthorized disclosure of information and the modification or destruction of data. Early in the research into IDS, two major principles known as anomaly detection and signature detection were arrived at, the former relying on flagging all behavior that is close to some previously defined pattern signature of a known intrusion, the later flagging behavior that is abnormal for an entity. Accordingly, many approaches have been proposed which include statistical, machine learning, data mining , neural-network and immunological inspired techniques. The IDS based on anomaly detection can be treated as one-class classification. In one-class classification, one set of data, called the target set, has to be distinguished from the rest of the feature space. In many anomaly detection applications, however, negative(abnormal) samples are not available at the training stage. For instance, in a computer security application, it is difficult ,if not impossible, to have information about all possible attacks. In the machine learning approaches, the lack of samples from the abnormal class causes difficulty in the application of supervised techniques(e.g two-class classification). Therefore, the obvious machine learning solution is to use an one-class classification. In one-class classification, the task is not to distinguish between classes of objects like in classification problems or to produce a desired outcome for each input object like in regression problems, but to give a description of a set of objects, called target class. This description should be able to distinguish between the class of objects represented by the training set, and all other possible objects in the object space, called outlier class. In general the problem of one-class classification is harder than the problem of normal two-class classification. For normal classification the decision boundary is supported form both sides by examples of each of the classes. Because in the case of one-class classification only one set of data is available, only one side of the boundary is covered. One-class classification is often solved using density estimation or a model based approach. In this paper we propose a novel approach to anomaly detection based on support vector data description inspired by the Support Vector Classifier (Vapnik 1998). Instead of using a hyperplane to distinguish between two classes, a hypersphere around the target set is used. This paper accurately analyzes the geometric character of kernel space and influence of different kernel parameters on behavior on classifier. Finally the paper presents a method based on genetic algorithm to determine the optimal parameter of kernel. We will start with an introduction of mathematical prerequisites in section 2. In section 3 an explanation of the Support Vector Data Description will be given. In section 4 we will discuss how to make model selection according to different kernel parameter. In section 5 the results of the experiments are demonstrated and we will conclude with conclusion in section 6. (saddle point in a ) (5) n a c (x) 0 i 1 i i (vanishing KKT-Gap) (6) Definition 2.4(Normal Space) A set of feature vectors, Self S represents the normal states of the 2 mathematical basis Definition 2.1(Convex set) A set X in a vector space ' is called convex if for any x, x X and any [0,1] , we have x (1 ) x ' X (1) Definition 2.2(Convex Function) A function f defined on a set X is called convex if, for any x, x X ' and any [0,1] , such x (1 ) x X , we have f (x (1 ) x ' ) f ( x ) (1 ) f ( x ' ) that ' Definition 2.3 (Constrained Problems) f ( x ) subject to ci ( x ) 0 for all x i [n ] c Here f and i are convex functions and n N , we e j ( x) 0 additionally have equality constraints for some j [n ] . Then the optimization problem can be written as ' Minimize f ( x ) x subject to with convex, differentiable exists xself : [ xmin , xmax ]n {0,1} : x Self 1 x self ( x ) 0 x Non _ Self 3 Support Vector Data Description The Support Vector Data Description (SVDD) is the method which we will use to describe our data. It is inspired on the Support Vector Classifier of Vapnik(see [3]).The SVDD is explained in more detail in[1], here we will just give a quick impression of the method. The idea of the method is to find the sphere with minimal volume which contains all data. Assume we have a data set containing N data objects, {xi , i 1, , N } and the sphere is described by center a and radius R .We now try to minimize an L( R, a, ai ) R 2 ai {R 2 ( xi 2axi a 2 )} 2 (3) Theorem 2.1 (KKT for Differentiable Convex Problems) A solution to the optimization problem (3) there we will define the Self (or Non _ Self ) set using its characteristic function error function containing the volume of the sphere. The constraints that objects are within the sphere are imposed by applying Lagrange multipliers: ci ( x ) 0 for all i [n ] , e j ( x ) 0 ' for all j [n ] . defined as Non _ Self S Self . In many cases, (2) A function f is called strictly convex if for x x ' and (0,1) (2.2) is a strict inequality. Minimize system. Its complement is called Non _ Self and is some f , ci is given by x , if a R n with ai 0 for i [n ] such that the following conditions are satisfied: n x L( x , a ) x f ( x ) ai x ci ( x ) 0 i 1 (saddle point in x ) ai L( x , a ) ci ( x ) 0 (4) (7) i a 0 .This function has with Lagrange multipliers i to be minimized with respect to R and a and a maximized with respect to i . Setting the partial derivatives of L to R and a to zero, gives: L 2 R 2 ai R 0 : R i a i i 1 We observe data distribution of (x ) mapped into Gaussian kernel feature space. Because the norm of L 2 ai xi 2 ai a 0 : a i i a i xi a x a i ii iai i (8) This shows that the center of the sphere a is a linear x (x ) data objects in mapped feature space is: ( x ), ( x ) K ( x, x ) exp( 0) 1 , all data objects in high dimensional feature space is located in spherical surface, shown in figure 1. combination of the data objects i . Resubstituting these values in the Lagrangian gives to maximize with respect to The sphere with minimial volume ai : L R 2 ai R 2 ai xi 2 ai a j xi x j ai a j xi x j 2 i i i j i 0 a i ( xi xi ) a i a j ( xi x j ) i Normal data j outlier i, j origin L a i ( xi xi ) a i a j ( xi x j ) i i, j (9) with equation(7) KKT condition: ai ( R 2 ( xi 2axi a 2 ) 0 2 (10) ai 0, iai 1 . This function should be maximized with respect to ai .In practice this means that a large fraction of the ai become zero. For a small fraction ai 0 and the corresponding objects are called Support Objects. We see that the center of the sphere depends just on the few support objects, objects with disregarded. Object z is accepted when: ai 0 can be ( z a )( z a )T ( z ai xi )( z ai xi ) i i Fig.1 distribution of data object in inner-product feature space We obtain a more flexible description than the rigid sphere description. In figure 2 both methods are shown applied on the same two dimensional data set. The sphere description on the left includes all objects, but is by no means very tight. It includes large areas of the feature space where no target patterns are present. In the right figure the data description using Gaussian kernels is shown. And it clearly gives a superior description. No empty areas are included, what minimized the chance of accepting outlier patterns. ( z z ) 2 a i ( z xi ) a i a j ( xi x j ) R 2 i i, j (11) In general this does not give a very tight description. Analogous to the method of Vapnik[3], we can replace the inner products ( x y ) in equations(9)and in (11)by kernel functions K ( x, y ) which give a much more flexible method. When we replace the inner products by Gaussian kernels for instance, we obtain: ( x y ) K ( x, y ) exp( ( x y ) 2 / s 2 ) (12) Equation (9) now changes into: L 1 ai2 ai a j K ( xi , x j ) i Other kernel functions, for example: polynomial k ( x, y ) ( x, y ) will be discussed in section 5. d i j (13) and the formula to check if a new object z is within the sphere(equation (11)) becomes: 1 2 a i K ( z , xi ) a i a j K ( xi , x j ) R 2 i Fig 2 left graph: margin of linear-kernel SVDD, right graph: margin of gauss-kernel SVDD i, j (14) 4 model selection In cases of Gaussian kernel, there is one extra free par- ameter, the width parameters s in the kernel (equation (12) ). Now we discuss two cases: 2 Case 1:as s 0 , the solution of (12) converges to that of: exp( x y s2 2 ) ij , i j, ij 1; i j, ij 0 equation (13) is maximized when each object supports a kernel. (15) ai 1 / N .where s=0.3 s=0.4 s=0.6 s=6 2 Case 2:as s , when the Taylor expansion of the Gaussian kernel (12) is made ,it can easily be shown: exp( x y s 2 ) 1 2 x y s 2 2 o( x y s 2 2 ) x y x y o( ) 2 s s s s2 (16) for very large s Kernel ( x , y ) 1 , equation(13) is 1 2 x 2 2 y 2 2 T a 1 a 0 maximized when all i except for one j . The exact influence of s parameters on decision boundaries is shown below. Circle point stands for support object, gray level represents the distance to sphere center. To study the generalization or the overfitting characteristics of the SVDD, we have to get an indication of (1) the number of target patterns that will be rejected (errors of the first kind) and (2) of the number of outlying patterns that will be accepted (errors of the second kind). We can estimate the error of the first kind by applying the leave-one-out method on the training set containing the target class. When leaving out an object from the training set which is no support object, the original description is found. When a support object is left out, the optimal sphere description can be made smaller and this left-out object will then be rejected. Thus the error can be estimated by: # SV E[P(error)]= N (17) # SV Where is the number of support vectors. Using a Gaussian kernel, we can regulate the number of support vectors by changing the width parameters s and therefore also the error of the first kind. When the number of support vectors is too large, we have to increase s, while when the number is too low we have to decrease s. This guarantees that the width parameter in the SVDD is adapted for the problem at hand given the error. Shown in figure 3. Fig.3 margin of gauss-kernel SVDD vs. gauss-width parameter The chance that outlying objects will be accepted by the descriptions, the error of the second kind, can not be estimated by this measure. This is because that we assumed only a training set of target class is available. But in intrusion detection application, we can use the simulated intrusion data such that we can represent the error function: v( s) # SV # JD N t arg et N outlier (18) where # JD is the number of outlier patterns accepted. We use the genetic algorithm to find the optimal solution. The genetic algorithm is a new global optimal algorithm, GAs are parallel, iterative optimizers, and have been successfully applied to a broad spectrum of optimization problems, including many pattern recognition and classification tasks. It is shown below: 1 Set the evaluation function to calculate the individual’ fitness value (18),population size N, P P cross rate c ,mutation rate m and iteration L. 2 Initialization: randomly select initial population: X (0) {x1 (0), x2 (0), , x N (0)}, set t 0 。 3 Calculate the fitness value: for the tth population X (t ) {x1 (t ), x2 (t ), , x N (t )}, xi (t )(1 i N ) f f ( xi (t )) fitness value i 4 Genetic operation: For every individual among X (t ) , xi (t )(1 i N ) P (t ) calculate . population set a copy rate i according to its fitness value. execute N / 2 steps operation: 4.1 Selection: select two individuals x i1 ( t ) x (t ) and i 2 from population X (t ) according to each individual’s fitness value. consistent with the principles described in section 4. ROC of the Gaussian kernel and polynomial kernel is shown in Figure 5. xi1 (t ) and xi 2 (t ) to generate the two ' ' new individuals x i1 (t ) 和 x i 2 (t ) according to cross P rate c 。 4.2 Cross: cross ' ' 4.3 Mutation: mutate x i1 (t ) 和 x i 2 (t ) to generate '' '' the two new individuals x i1 (t ) and x i 2 (t ) according P to mutation rate m . 4.4 New t+1th generation population X (t 1) '' x (t 1) x '' i1 (t ), , x N (t 1) xiN (t )} ={ 1 Consisted '' i1 x (t ) of N new individuals x (t ) (1 i N / 2) . Remember the best and individuals. '' i2 5 L=L-1, if L =0 stop ;otherwise go to step 3. 5 Experiments We select a training set which contained 300 normal cases. Likewise, The evaluation data consisted 5000 abnormal samples. Test data comprises 1500 normal data. The samples were already preprocessed. Figure 4 demonstrates the influence of the Gaussian width parameters on the normal data rejected, the number of support vector. the results are shown below in figure 4. Fig.5 ROC of gauss-kernel and polynomial-kernel When the width is decreased, acceptance of normal data is decreased while the rejection of outlier data is increased. When polynomial kernel order is increased, the acceptance of normal data is decreased while the rejection of outlier data is increased. Figure 6 demonstrates the influence of width parameter on normal data rejected, outlier data accepted and the number of support vector. Figure 7 demonstrates the influence of the order parameter on normal data rejected, outlier data accepted and the number of support vector. Fig.6 target class rejected, #SV number and outlier accepted vs. Gaussian width parameters When we selected the Gaussian kernel,we optimize the width parameter by the genetic algorithm , shown in figure 8. Fig.4 target class rejected, number of support vector vs. gauss-kernel width parameter In experiments we find that when the Gaussian width increases, the number of support vector and the rate of rejection of normal sample is low. The results are The following table is the results of comparison of different methods. The results demonstrate that the proposed approach outperform the other two-class classifier. Table 2:Comparison of results of different methods methods Recognition rate Gaussian kernel SVDD 99.02% Linear kernel SVDD 98.66% MLP 85.5% Negative immunogenetic 96.4% 6 Conclusion Fig.7 target class rejected, #SV number and outlier accepted vs. polynomial order parameters This paper described the implementation of a novel one-class classification approach to intrusion detection based support vector data description. We discussed the behavior of the classifier based on parameter selection and presented a novel way based on genetic algorithm to determine the optimal parameters. Finally some results are finally reported with DARPA’ 99 evaluation data. The results demonstrate that when the normal training sample is enough large, the proposed method outperforms other two-class classifier. References: Fig.8 fitness value vs. iteration Finally, we do the experiment with different number training samples. The results shown in table 1: Table 1 results of training example of different number Recognition rate=normal rejected (%)/abnormal accepted(%) Num of training sample 200 300 400 600 800 1000 Kernel=rbf 0.117333/0 0.0493333/ 0.00261523 0.036/ 0.00261523 0.028/ 0.00294214 0.00266667/ 0.00686499 0/0.00980713 Kernel=linear 0.107333/0 0.0486667/ 0.00326904 0.028/ 0.00326904 0.0266667/ 0.00359595 0.00333333/ 0.0104609 0.002/ 0.0114416 The results show that the performance of a proposed approach based on support vector data description is better when the training normal data is sufficiently large while the number of training data set is small, the result is not good. But in intrusion detection application, it is relatively easier to get the enough training data. [1] D.M.J.Tax and R.P.W Duin.Data domain description using support vectors. In Verleysen,M.,editor,Proceeding of the European Symposium on Artificial Neural Networks 1999,pages 251-256,Brussels,April 1999.D-Facto. [2] D.M.J.Tax,A.Ypma,and R.P.W.Duin. Support vector data description applied to machine vibration analysis. To appear in the Proceedings of ASCI’99. [3] V.Vapnik. The nature of statistical learning theory. Springer-Verlag, New York,1995. [4]G.Baudat and F.Anouar,”Generalized Discriminant Analysis Using a Kernel Approach”,Neural Comutation,Vol.12,No.1,2000. [5] Scholkopf B.Platt J.C.,Shawe-Taylor J.,Smola A.J., Williamson R.C Estimating the support of a high-dimensional distribution.Microsoft Research Corporation Technical Report MSR-TR-99-87,1999. [6] Campbell C.and Bennett K. A Linear Programming Approach to Novelty Detection. To appear in Advances in Neural Information Processing Systems 14 (Morgan Kaufmann,2001). [7] Chapelle O.and Vapnik V. Model selection for support vector machines. To appear in Advances in Neural Information Processing Systemss,12,ed.S.A.Solla,T.K. Leen and K.-R.Muller,MIT Press,2000.