Performance Assessment of Robust Ensemble Model for Intrusion Detection using Decision Tree Techniques Reshamlal Pradhan Deepak Kumar Xaxa M. Tech. scholar (CSE) Asst Professor MATS University, Raipur (C.G.) INDIA Department of Computer science MATS University, Raipur (C.G.) INDIA reshamlalpradhan6602@ gmail.com ABSTRACT xaxadeepak@gmail.com general framework is depicted on figure 1. Intrusion Detection System (IDS) is one of the major research concerns in network security. It is the process of detecting different security violations by monitoring and analyzing the events occurring in a computer system or in a network. IDS can be developed using various machine learning techniques like Classification, prediction etc. As a classifier IDS classifies the data as normal or anomaly. In this paper we present performance assessment of robust ensemble model [1] for intrusion detection using decision tree techniques. The algorithms or decision tree techniques tested are j48, Random forest, Stacking, Bagging and Boosting on NSL-KDD Dataset using WEKA tools. WEKA is a open source software which consists of a collection of machine learning algorithms for data mining tasks. General Terms Algorithm, Classification, Ensemble technique, Intrusion Detection, Network security. Keywords Bagging, Boosting, Confusion metrics, Intrusion detection system (IDS), J48, Random Forest, Stacking, WEKA. 1. INTRODUCTION The security of our computer systems and data is at continual risk. The extensive growth of the Internet and increasing availability of tools and tricks for intruding and attacking networks have prompted intrusion detection to become a critical component of network administration. An intrusion can be defined as any set of actions that threaten the integrity, confidentiality, or availability of a network resource (such as user accounts, file systems, system kernels, and so on). Intrusion detection systems (IDSs)[1,2,3] are software or hardware systems that automate the process of monitoring the events occurring in a computer system or network, analyzing them for signs of security problems (intrusions). IDS can be developed using various machine learning techniques like Classification, prediction etc. Classification is one of the very common applications of the data mining in which similar type of samples are grouped together in supervised manner. IDS is a classifier which classifies the data as normal or attack. A Fig 1: General Framework of IDS 2. DECISION TREE TECHNIQUE Decision tree[1,5,6,7] is a Data Mining Technique and is so popular because the construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision trees can handle high dimensional data. 1 2.1 J48 Pseudo code J48 [2,4,13] is an open source Java implementation of the C4.5 algorithm in the WEKA data mining tool. C4.5 is a program that creates a decision tree based on a set of labeled input data. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. To generate c classifiers: The J48 Decision tree classifier follows the following simple algorithm. In order to classify a novel item it first needs to create a decision tree based on the attribute values of the obtainable training data. So, whenever it encounters a set of items (training set) it finds the attribute that discriminates the several instances most clearly. This feature that is able to tell us most nearby the data instances so that classify them the best is said to have the highest information gain. Call BuildTree ( Ni ) for i = 1 to c do Randomly sample the training data D with replacement to produce Di Create a root node, Ni containing Di end for BuildTree (N): if N contains instances of only one class then return else Pseudo code 1. Check for base cases 2. For each attribute a 1. Find the normalized information gain from splitting on a. 2. Let a_best be the attribute with the highest normalized information gain. Randomly select x% of the possible splitting features in N Select the feature F with the highest information gain to split on Create f child nodes of N , N1 ,..., Nf , where F has f possible values ( F1 , … , Ff ) for i = 1 to f do Set the contents of Ni to Di , where Di is all instances in N that match Fi Call BuildTree( Ni ) 3. Create a decision node that splits on a_best. end for 4. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node. end if Now, among the possible values of this feature, if there is any value for which there is no ambiguity that is for which the data instances falling within its category have the same value for the target variable then terminate that branch and allocate to it the target value that have obtained. 3. ENSEMBLE TECHNIQUES An ensemble model[1] is a combination of two or more models to avoid the drawbacks of individual models and to achieve high accuracy. Bagging, Boosting and Stacking are techniques that use a combination of models. 2.2 Random Forest Random Forests[7,9,14] is an ensemble classifier. It constructs a series of classification trees which will be used to classify a new example. The idea used to create a classifier model is constructing multiple decision trees, each of which uses a subset of attributes randomly selected from the whole original set of attributes. The Random Forests is an effective prediction tool in data mining. It employs the Bagging method to produce a randomly sampled set of training data for each of the trees. This Random Forests method also semi-randomly selects splitting features; a random subset of a given size is produced from the space of possible splitting features. The best splitting is feature deterministically selected from that subset. A pseudo code of random forest construction is given below. To classify a test instance, the Random Forests classifies the instance by simply combining all results from each of the trees in the forest. The method used to combine the results can be as simple as predicting the class obtained from the highest number of trees. 3.1 Bagging Given a set, D of d tuples, bagging works as follows. For iteration i (i = 1, 2, : : : , k), a training set, Di, of d tuples is sampled with replacement from the original set of tuples, D. Note that the term bagging[1,2,5,10] stands for bootstrap aggregation. Each training set is a bootstrap sample. Because sampling with replacement is used, some of the original tuples of D may not be included in Di, whereas others may occur more than once. A classifier model, Mi, is learned for each training set, Di. To classify an unknown tuple, X, each classifier, Mi, returns its class prediction, which counts as one vote. The bagged classifier, M_, counts the votes and assigns the class with the most votes to X. Bagging can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple. The algorithm is- 2 Algorithm: Adaboost. A boosting algorithm—creates an ensemble of classifiers. Each one gives a weighted vote. Algorithm: Bagging. The bagging algorithm—create an ensemble of models (classifiers or predictors) for a learning scheme where each model gives an equally-weighted prediction. Input: Input: D, a set of d class-labeled training tuples; k, the number of rounds (one classifier is generated per round); D, a set of d training tuples; a classification learning scheme. k, the number of models in the ensemble; a learning scheme (e.g., decision tree algorithm, back propagation, etc.) Output: A composite model. Output: A composite model, M_. Method: (1) Initialize the weight of each tuple in D to 1=d; Method: (2) For i = 1 to k do // for each round: (1) for i = 1 to k do // create k models: (3) Sample D with replacement according to the tuple weights to obtain Di; (2) create bootstrap sample, Di, by sampling D with replacement; (4) Use training set Di to derive a model, Mi; (3) use Di to derive a model, Mi; (5) Compute error (Mi), the error rate of Mi (4) end for (6) If error (Mi) > 0:5 then (7) Re initialize the weights to 1=d To use the composite model on a tuple, X: (8) Go back to step 3 and try again; (1) if classification then (9) End if (2) let each of the k models classify X and return the majority vote; (10) For each tuple in Di that was correctly classified do (3) if prediction then (4) let each of the k models predict a value for X and return the average predicted value; (11) Multiply the weight of the (Mi)=(1�error(Mi)); // update weights tuple by error (12) normalize the weight of each tuple; (13) End for 3.2 Boosting In boosting[1,2,5,10] weights are assigned to each training tuple. A series of k classifiers is iteratively learned. After a classifier Mi is learned, the weights are updated to allow the subsequent classifier Mi+1 to “pay more attention” to the training tuples that were misclassified by Mi. The final boosted classifier, M_, combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy. The boosting algorithm can be extended for the prediction of continuous values. Ada boost is a popular boosting algorithm. To use the composite model to classify tuple, X: (1) Initialize weight of each class to 0; (2) For i = 1 to k do // for each classifier: (3) wi = log 1�error(Mi) error (Mi) ; // weight of the classifier’s vote (4) c = Mi(X); // get class prediction for X from Mi 3 (5) Add wi to weight for class c end; (6) End for h’ = L(D0). % Train the second-level learner h0 by applying the second-level (7) return the class with the largest weight; % learning algorithm L to the new data set D0 Output: H(x) = h’ (h1 (x) …… hT (x)) 3.3 Stacking Stacking[2,10] is the abbreviation to refer to Stacked Generalization. Unlike bagging and boosting it uses different learning algorithms to generate the ensemble of classifiers. The main idea of stacking is classifiers from different learners such as decision trees, instance-based learners etc. Since each one uses different knowledge representation and different learning biases the theory space will be explored differently and different classifiers will be found. . When the classifiers have been generated they must be combined. Unlike bagging and boosting, stacking does not use a voting system because, for example, if the majority of the classifiers make evil predictions this will lead to a final bad classification. To resolve this problem stacking uses the concept of Meta learner. One way to outputs is by voting the same mechanism used in bagging. However (unweight) voting only makes sense if the learning schemes perform comparably well. If two of the three classifiers make predictions that are completely incorrect, trouble instead stacking introduces the concept of a Meta learner, which replaces the voting procedure. The problem with voting is that it’s not clear which classifier to trust. Input: Data set D = f(x1; y1); (x2; y2);……(xm; ym); First-level learning algorithms L1;…… LT ; Second-level learning algorithm L. Stacking tries to learn which classifiers are the reliable ones, using another learning algorithm the Meta learner to discover how best to combine the output of the base learners. The input to the Meta model also called the level-1 model is the predictions of the base models, or level-0 models. A level-1 instance has as many attributes as there are level-0 learners, and the attribute values give the predictions of these learners on the corresponding level-0 instance. When the stacked learner is used for classification, an instance is first fed into the level-0 models, and each one guesses a class value. These guesses are fed into the level-1 model, which combines them into the final prediction. 4. PROPOSED FRAMEWORK The overall objective of proposed research work is to propose robust ensemble model[1] for the classification of data. Classification model consist of two phase(1)Model building Model building performs on training set of data. It is the supervised learning of a training set of data to build a model. To build ensemble model First individual data mining techniques (as J48, Random Forest) applies on dataset as a classifier. Then through Ensemble techniques (bagging, boosting, stacking) the outputs of individual models combined to form a Robust Ensemble model. The Ensemble model further work as a classifier which classifies the data as normal or attack. Process: for t = 1;…..T: ht = Lt(D) % Train a first-level individual learner ht by applying the first-level end; data set D D’= ϕ; (2)Model validation Model validation performs on test set of data. It includes classifying the data according to that model build in model building phase. % learning algorithm Lt to the original % Generate a new data set for i = 1;…..m: for t = 1;….. T: zit = ht(xi) example xi % Use ht to classify the training end; D’ = D’ [ f ((zi1; zi2;….. ziT ) ; yi)g 4 The well known data set NSL KDD[8,11] data set considered here for our experiment. NSL KDD Dataset is a new dataset consists of selected records of the complete KDD data set. The data is collected over the TCP/IP network in which there are 41 various quantities and qualitative features and one feature belongs to class (Attack type). There are 22 type of attack in training dataset and 37 types of attacks in test dataset. 5.2 Confusion Metrics Confusion metrics[1,2,3] is used for evolution of classifier.It is commonly encountered in a two-class format, but can be generated for any number of classes. A single prediction by a classifier can have four outcomes which are displayed in the following confusion metrics. Table 1 depicts confusion metrics. Table 1: Confusion metrics PREDICTED ACTUAL NEGATIVE POSITIVE NEGATIVE TN FP POSITIVE FN TP Fig2: Ensemble model The entries in the confusion matrix have the following meaning in the context of our study: 5. EXPERIMENTAL SETUP 5.1 Experimental Design We used WEKA 3.7.10 a machine learning tool to measure the classification performance using ensemble techniques. WEKA[3,4,12] is a data mining system developed by the University of Waikato in New Zealand that implements data mining algorithms using the JAVA language. It is a collection of machine learning algorithms for data mining tasks. The algorithms are applied directly to a dataset. WEKA implements algorithms for data pre processing, classification, regression, clustering and association rules; It also includes visualization tools. We choose the decision tree classifier(J48, Random Forest) and ensemble classifier with full training set and 10-fold cross validation for the testing purposes. In 10fold cross-validation, the available data is randomly divided into 10 disjoint subsets of approximately equal size. One of the subsets is then used as the test set and the remaining 9 sets are used for building the classifier. The test set is then used to estimate the accuracy. This is done repeatedly 10 times so that each subset is used as a test subset once. The accuracy estimates is then the mean of the estimates for each of the classifiers. TN is the number of correct predictions that an instance is negative. FP is the number of incorrect predictions that an instance is positive. FN is the number of incorrect of predictions that an instance negative and TP is the number of correct predictions that an instance is positive. During testing phase, testing dataset is given as an input to the proposed technique and the obtained result is estimated with the evaluation metrics namely precision, recall and Accuracy. The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation: AC = (TP+TN) / (TP+FN+FP+TN). The recall or sensitivity or true positive rate (TPR) is the proportion of positive cases that were correctly identified, as calculated using the equation: 5 TPR = TP / (TP+FN). The precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation: Precision = TP / (TP+FP). F-measure: The harmonic mean of precision and recall F = 2 * Recall * Precision / (Recall + Precision) 6. RESULTS We used two decision tree techniques J48 and Random Forest as individual classifiers. We also used these two classifiers with ensemble techniques Stacking, Bagging, Boosting to form ensemble classifier. Classifier result is evaluated using confusion matrix. Table 2 depicts accuracy of different classifiers, which represents that ensemble classifiers performs better than individual classifiers. Table 2. Accuracy of classifiers on NSL-KDD dataset S. No Data Techniques mining 1 J-48 98.59 2 Random Forest 98.55 Accuracy Fig3: Accuracy of Different Classifiers Table 3 shows performance of different classifiers using precision, recall, and f-measure parameters of confusion metrics. Table 3: Performance of classifiers on NSL-KDD dataset Data Attack Mining type Technique J48 RF 3 Stacking 98.66 4 Boosting 98.60 5 Bagging 98.7 Stacking Bagging A graph plot of accuracy of different classifier is also given in figure3. Boosting Precision Recall Fmeasure Normal 98.2 98.6 98.4 Attack 98.9 98.6 98.8 Normal 98.2 98.5 98.3 Attack 98.8 98.6 98.7 Normal 98.4 98.5 98.4 Attack 98.9 98.8 98.8 Normal 98.5 98.5 98.5 Attack 98.9 98.9 98.9 Normal 98.4 98.3 98.4 Attack 98.8 98.8 98.7 A graph plot of performance of different classifiers is also provided in figure 4. 6 8. REFERENCES [1] Pradhan, R.,L., et.al (2014). “Robust ensemble model for intrusion detection using data mining techniques”. [2] Nagle, M.,K., et.al, ( 2013).- “Feature Extraction Based Classification Technique for Intrusion Detection System, International Journal of Engineering Research and Development”. [3] Dr. saurbh Mukherjee (2012) ,”Intrusion detection using bayes classifier with feature reduction”, Procedia technology”. [4] Kalyani, G., et.al, (2012). “ Performance assessment of different classification techniques for intrusion techniques”. [5] Jiawei Han, Micheline Kamber, (2006), “Data mining concepts and techniques”, Second edition, San Francisco, Margan Kaufmann Publishers, USA,. [6] Arun K. Pujari. (2001). Data mining techniques, 4th edition, Universities Press (India) Private Limited. Fig 4: Performance of Different Classifiers From table 3 and table4 shows that ensemble classifier performs better than individual classifiers not only in accuracy but also in other parameters of confusion metrics as precision, recall and f-measure. 7. CONCLUSION This research is approached to discover the best performance of classification algorithm for intrusion detection. The experiment results shows that Bagging classifiers provides highest accuracy 98.71%, stacking provides accuracy of 98.66% and boosting provides accuracy of 98.60% which is better than the accuracy of individual classifier J48 and Random forest. Not only in accuracy, in precision, recall and f-measure also ensemble techniques provides better results than individual classifiers. [7] Mrutyunjaya Panda, (2011)”A hybrid intelligent approach for network intrusion detection”, Proceedia Engineering. [8] Revathi, S., et.al, (2013). “A Detailed Analysis on NSL-KDD Dataset Using Various Machine Learning”. [9] Sirikulviriya, N., (2011), “ Integration of rules from a random forest”. [10] Ensemble learning by Zhi-Hua Zhou. [11] http://nsl.cs.unb.ca/NSL-KDD [12] http://www.cs.waikato.ac.nz/~ml/weka [13] Http://en.wikipedia.org/wiki/C4.5_algorithm [14] L. Breiman. Random Forests. Learning(2001), 45(1):5-32. Machine In present study we focused on decision tree techniques J48 and Random forest and considered only few parameters for model evolution. For further research different data mining techniques can be tested to ensemble model. Also feature selection techniques can be tested on NSL KDD dataset to gain improved performance with reduced feature subsets. 7