A Lightweight Intrusion Detection Model Based on Feature Selection and Maximum Entropy Model Yang Li1,2, Bin-Xing Fang1, You Chen1,2, Li Guo1 1 Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China 2 Graduate School of Chinese Academy of Sciences, Beijing 100080,China Email: liyang@software.ict.ac.cn Abstract—Intrusion detection is a critical component of secure information systems. Current intrusion detection systems (IDS) especially NIDS (Network Intrusion Detection System) examine all data features to detect intrusions. However, some of the features may be redundant or contribute little to the detection process and therefore they have great impact on the system performance. This paper proposes a lightweight intrusion detection model that is computationally efficient and effective based on feature selection and Maximum Entropy (ME) model. Firstly, the issue of identifying important input features is addressed. Since elimination of the insignificant and/or useless inputs leads to a simplification of the problem, therefore results to faster and more accurate detection. Secondly, classic ME model is used to learn and detect intrusions using the selected important features. Experimental results on the well-known KDD 1999 dataset show the proposed model is effective and can be applied to real-time intrusion detection environments. I. INTRODUCTION Intrusion Detection System (IDS) plays vital role of detecting various kinds of attacks. The main purpose of IDS is to find out intrusions among normal audit data and this can be considered as classification problem. The two basic methods of detection are signature based and anomaly based [1]. The signature-based method, also known as misuse detection, looks for a specific signature to match, signaling an intrusion. They can detect many or all known attack patterns, but they are of little use for as yet unknown attack methods. Most popular intrusion detection systems fall into this category. Another approach to intrusion detection is called anomaly detection. Anomaly detection systems are computationally expensive because of the overhead of keeping track of, and possibly updating, several system profile metrics. There are many IDSs developed during the past three decades and most of the commercial and freeware IDS tools are signature based. As new attacks appear and amount of audit data increases, IDS should counteract them. In addition to this, as network speed becomes faster, there is an emerging need for This paper is funded by National Grand Fundamental Research 973 Program of China (NO. 2004CB318109) security analysis techniques that will be able to keep up with the increased network throughput [2]. Therefore, IDS itself should be lightweight (means relatively low computational cost) while guaranteeing high detection rates. One of the main problems with IDSs is the overhead [3], Detecting intrusions in real time, therefore, is a difficult task. In this paper, we propose a new lightweight intrusion detection model. Our model mainly focuses on how to effectively detect intrusions in the real-time network environment. As our preliminary work, we mainly focus on the intrusion detection in network traffic level. Firstly, we extract several necessary and much important features (named core features) from KDD 1999 dataset by means of Information Gain and Chi-Square approach. Secondly, we adopt Maximum Entropy model to learn and form classifier to detect attacks based on the selected features. The results of experiments on KDD 1999 dataset indicate the feasibility of our model. The remainder of this paper is organized as follows. In Section 2, we propose the overall model based on feature selection and Maximum Entropy Model. Section 3 and Section 4 detail the feature selection approach and Maximum Entropy model used in our system respectively. Section 5 discusses the relevant experiments and evaluations. We conclude our work in Section 6. II. OUR INSTRUSION DETECTION MODEL The overall model of our approach is depicted in Fig. 1. In the training process, network traffic data is preprocessed (label data packets for various classes such as normal, abnormal) and passed to our feature-selection engine using both Information Gain and Chi-Square approach. Afterwards, the dataset is then used to build the Maximum Entropy based intrusion detection model using selected features. In the testing process, network traffic data will directly sent to our intrusion detection model to detect. The most advantage of our lightweight model is that by means of feature selection, it can greatly reduce the redundant and least important features for intrusion detection, therefore reduce the computational cost in the process of intrusion detection. Moreover, Maximum Entropy model is proved to be a good classifier when provided enough input features, it’s very effective in the field of intrusion detection. Information Gain based Feature selection Network Preprocessor Maximum Entropy Model based classifier Training data Traffic data Chi-Square based Feature selection detection B. Feature Selection Based on Chi-Square A Chi-Square approach is a simple and general algorithm 2 that uses the statistic to discretize numeric features repeatedly until some inconsistencies are found in the data, and achieves feature selection via discretization [5]. The measure is defined to be: Testing data Fig. 1 Overall intrusion detection model III. Therefore, in our experiments, information gain is calculated for class labels by employing a binary discrimination for each class. That is, for each class, a dataset instance is considered in-class, if it has the same label; otherwise, it will be considered out-class. Consequently, as opposed to calculating one information gain as a general measure on the relevance of the feature for all classes, we calculate an information gain for each class. Thus, this signifies how well the feature can discriminate the given class (i.e. normal or an attack type) from other classes. Section 5 will give the detailed results. FEATURE SELECTION In this section, we will use both classic Information Gain (IG) and Chi-Square feature selection approach to fulfill the task of feature selection from KDD 1999 dataset. The reason for us to combine both IG and Chi-Square approach in feature selection is that we want to eliminate the possibility that a single feature selection approach will result to some biased results. Therefore, combining them is a good and reasonable choice. 2 k 2 ( Aij E ij ) 2 i 1 j 1 (4) E ij where: k = number of (no.) classes, Aij = no. patterns in the i th interval, j th class, A. Feature Selection Based on Information Gain (IG) The theory of Information Gain can be described as [4]: Let S be a set of training set samples with their corresponding labels. Suppose there are m classes and the training set contains s i samples of class I and s is the total number of samples in the training set. Expected information needed to classify a given sample is calculated by: A feature F with values (1) {f 1 , f s , , f v } can divide v subsets {s 1 , s s , , s v } where Sj is the subset which has the value f j for feature F . the training set into Furthermore let Sj contain sij samples of class i . Entropy of the feature F is: v E(F ) j 1 s1 j ... s mj s * I ( s1 j ,..., s mj ) (2) Information gain for F can be calculated as: Gain( F ) I ( s1 ,..., s m ) E ( F ) C j = no. patterns in the j th class = N = total no. patterns = 2 j 1 k j 1 2 j 1 Aij , Aij , Ri , E ij = expected frequency of Aij = Ri * C j / N . m s s I ( S1 , S 2 ,..., S m ) i log 2 ( i ) s i 1 s Ri = no. patterns in the i th interval = (3) If either Ri or C j is 0, E ij is set to 0.1. The degree of freedom of the classes. 2 statistic is one less the number of Therefore, in this paper, we will use Chi-Square based on the above discussions to fulfill feature selection task to utmost distinguish five classes (i.e. normal, DoS, Probe, U2R, R2L). Section 5 will give the detailed results. IV. INTRUSION DETECTION BASED ON MAXIMUM ENTROPY MODEL As described in Section 2, after selecting important and necessary features by using feature selection approaches, we will use them as input features to Maximum Entropy model. Having formed a classifier based on the features and realistic network traffic data, we can use the Maximum Entropy model-based classifier to detect intrusions in real-time environment. Maximum entropy (ME) modeling [6] has been successfully used in the fields of machine learning, information retrieval, computer vision, and econometrics, etc. From a practitioner's point of view, its advantage is that it can handle a large set of features that are interdependent. Features are automatically weighted by an optimization process, resulting in a model that maximizes the conditional likelihood of the class labels x , given the training data y , or p( x | y ) . In more detail, the goal of the ME principle is that, given a set of features, a set of their corresponding functions fi(i 1,2,3,..., k ) (its function is to measure the contribution of each feature to the model) and a set of constrains, we have to find the probability distribution that satisfies the constrains and minimizes the relative entropy. That is, given the constraint sets (5) and the information entropy computing method (6), we can describe the problem as (7). P { p | p( fi) p * ( fi), i 1,2,3,..., k} (5) H ( p) p * ( y ) p( x | y ) log p( x | y ) (6) x, y In (6), 0 H ( p ) log | y | . p* arg max H ( p) 1 exp( ifi ( x, y )) ( y) i ( y ) exp( ifi ( x, y )) x The maximum entropy approach described in this work exhibits many advantages. First, it provides the IDS system administrators a multi-dimensional view of the network traffic by classifying packets according to a set of features carried by a packet. Second, it detects intrusions that cause abrupt changes in the network traffic, as well as those that increase traffic slowly. A large deviation from the baseline distribution can only be caused by packets that make up an unusual portion of the traffic. If an intrusion occurs, no matter how slowly it increases its traffic, it can be detected once the relative entropy increases to a certain level. Third, it provides information about the type of the intrusion detected. V. EXPERIMENTS AND EVALUATIONS In our preliminary work, we’ll select KDD 1999 dataset to test the performance of our approach based on ME model because the dataset is still a common benchmark for us to evaluate our techniques in IDS. Moreover, considering for the fact that our approach is independent of the realistic dataset, it’s reasonable for us to select it as benchmark. A. (7) pP p * ( x | y) address this weakness, smoothing has been advocated using a Gaussian prior [7]. We use a Gaussian prior (with σ = 1) to smooth our models in all of our experiments. (8) (9) i In order to get the best resolution, we’d like to use Lagrange multiplier method and it results to (8). In (8), p is the probability distribution got from training and p * stands for the probability distribution model constructed by us. p * ( x | y ) denotes the probability of predicting an outcome x in the given context y with constraint feature fi ( x, y) . Formula (9) is a normalization factor to ensure that p( x | y ) 1 . The parameter i can be functions Experimental Environment and Dataset All experiments were performed in a Windows machine having configurations Intel (R) Pentium (R) 4, 1.73 GHz, 1 GB RAM, and the operation system platform is Microsoft Windows XP Professional (SP2). We have used an open source machine learning framework – Weka [8] (the latest Windows version is Weka 3.4). It is a collection of machine learning algorithms for data mining tasks and it contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. For feature selection, we have selected a subset (It contains 496,201 records with 41 features and a class label. Its classes include normal, DoS, Probe, U2R and R2L. Approximately 20% represent normal patterns and the rest 80% of patterns are attcks belonging to the four classes) randomly from KDD 1999 dataset and performed IG and Chi-Square approach on it to acquire the most relevant and necessary feature. To identify attacks, we adopted 10-fold cross validation to verify the feature selection results. Afterwards, we also have adopted 10-fold cross validation to evaluate our ME-based lightweight detection model. x derived from an iterative algorithm called Generalized Iterative Scaling (GIS). It updates weights by scaling the existing parameters to increase the likelihood of observed features. In this paper, we carry out 100 iterations in our experiments using GIS. Moreover, ME modeling can also suffer from overfitting. This problem is most noticeable when the features in the training data occur infrequently, resulting in sparse data. In these cases, weights derived by the ME model for these sparse features may not accurately reflect test data. To B. Results and Evaluations Table 1 and Table 2 give the feature selection results of by using IG and Chi-Square approach from all the 41 features in KDD 1999 dataset respectively. Table 3 shows the performance results of our ME model-based approach using all the 41 features and Table 4 shows the performance results only using the selected important features. From Table 1 and Table 2, we can clearly see that both using IG and Chi-Square to feature selection result to almost the same results, i.e. they all prove the top 12 as the most important features with their own feature selection algorithms. It demonstrates the results are reasonable and independent of feature selection algorithm. TABLE II. Rank 1 2 3 4 5 6 7 8 9 10 11 12 It must be stated that the results in Table 3 and Table 4 are got from using the top 6 features selected from Table 1 and Table 2 as input learning features for ME model. Because it is obvious that in the first two tables, both the IG and Chi-Square measure of the top 6 features are much distinct from those of the rest features, i.e. the rest features except for the top 6 features have little or no effect on the intrusion detection. The results of Table 3 and Table 4 are amazingly good (especially for the detection of DoS attack, its accuracy is 100%), and they demonstrate two important facts: i) The selected features play the same important role in intrusion detection; ii) The computational cost can be greatly reduced without reducing any effectiveness when we make use of the selected features compared to all the 41 features. Therefore, they can be used in real-time lightweight intrusion detection environment. TABLE III. Class Normal Probe DoS U2R R2L TABLE IV. VI. CONCLUSIONS In this paper, we proposed a new lightweight intrusion detection model. First, a feature selection based on Information Gain and Chi-Square approach is performed on KDD 1999 training set, which is widely used by machine learning researchers. Second, by using ME model, these selected features were learned and used in intrusion detection. Experimental results on KDD 1999 dataset demonstrate the result is good and the model is reasonable. Moreover, its computational cost is relatively low attributes to the adoption of feature selection and Maximum Entropy and can be applied to real-time intrusion detection environment (described in Table 3 and Table 4). In the future work, we’ll apply our model in realistic environment to verify its real-time performance and effectiveness. TABLE I. Rank 1 2 3 4 5 6 7 8 9 10 11 12 Class Normal Probe DoS U2R R2L Feature src_bytes dst_host_rerror_rate dst_byte dst_host_srv_rerror_rate hot num_compromised srv_count count dst_host_srv_diff_host_rate srv_rerror_rate rerror_rate service Chi-Square 17586.107 17368.831 17073.438 17032.989 16503.031 14357.396 5060.741 3125.14 2607.774 2421.084 2209.594 2078.218 Feature dst_host_rerror_rate src_bytes dst_bytes hot dst_host_srv_rerror_rate num_compromised srv_rerror_rate rerror_rate count srv_count dst_host_srv_diff_host_rat service DETECTION RESULTS ON ALL 41 FEATURES Testing Time (Sec) 1.28 2.09 1.93 1.05 1.02 Accuracy (%) 99.75 99.80 100 99.89 99.78 DETECTION RESULTS ON SELECTED FEATURES Testing Time (Sec) 0.78 1.25 1.03 0.70 0.68 Accuracy (%) 99.73 99.76 100 99.87 99.75 REFERENCES [1] [2] [3] [4] FEATURE SELECTION RESULTS BASED ON IG IG 0.286 0.283 0.278 0.263 0.258 0.231 0.077 0.076 0.075 0.074 0.046 0.036 FEATURE S SELECTION RESULTS BASED ON CHI-SQUARE [5] [6] [7] [8] M. Bykova, S. Ostermann and B. Tjaden, “ Detecting network intrusions via a statistical analysis of network packet characteristics”, in Proc. of the 33rd Southeastern Symp. on System Theory, Athens, OH. IEEE, 2001. C. Kruegel and F.Valeur, “Stateful Intrusion Detection for HighSpeed Networks”, in Proc. of the IEEE Symposium on Research on Security and Privacy, pp. 285–293, 2002. T. Bass, “Intrusion detection systems and multisensor data fusion”, Communications of the ACM, 43 (4), pp. 99–105, 2000. D. Włodzisław, W. Tomasz, B. Jacek and K. Adam, “Feature Selection and Ranking Filters”, 2003, http://metet.polsl.katowice.pl/~jbiesiada/prace/selekcja/03Istambul.pdf. H. Liu, Setiono and R., “Chi2: feature selection and discretization of numeric attributes”, in Proc of the Seventh International Conference on Tools with Artificial Intelligence, pp. 388 – 391, 1995. R.Adwait, “A Simple Intorduction to Maximum Entropy Models for Natural Language Processing”, University of Pennsylvania.,Tech. Rep, 1997. K. Min-Yen and T. Hoang Oanh Nguyen, “Fast Webpage Classification Using URL Features”, CIKM’05 of ACM, 2005. “Weka Machine Learning Project”, http://www.cs.waikato.ac.nz/~ml/.