International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 Improved Apriori and KNN approach for Virtual machine based intrusion detection Suneetha Valluru#1, Mrs N. Rajeswari#2 1 2 M.Tech (CSE),Gudlavalleru Engineering College, Gudlavalleru Associate professor(CSE), Gudlavalleru Engineering College, Gudlavalleru. In this paper we present a new mechanism for building storage-based intrusi on det ection systems. ABSTRACT: Nowadays, as information systems are usually more accessible to the world wide web, the advantage of secure networks is tremendously increased. New intelligent Intrusion Detection Systems (IDSs) that based on sophisticated algorithms as an alternative to current signature-base detections are really in demand. Intrusion detection is one of network security area of technology main research directions. Data mining technology was applied to network intrusion detection system (NIDS), may discover the new pattern due to massive network data, to scale back the workload of the manual compilation intrusion behavior patterns and normal behavior patterns. Virtualization is now a more popular service hosting platform. Recently, intrusion detection systems (IDSs) which utilize virtualization are now introduced. A particular challenge inside current virtualization-based IDS systems is considered in this project. In this particular proposed system a new chi-square based feature selection that evaluates the relative importance of individual features. After feature selection proposed techniques like KNN and Modified apriori are applied on the data with less false positive rates. 1. INTRODUCTION Many studies and deployment of intrusion detection systems (IDS) forced onto the development of increasingly advanced approaches to defeating malicious activities. Normally, they are actually host-based or network-based. The host-based IDS is integrated towards the host operating environment [1, 2] even though the networkbased monitors network traffic for suspicious content [3, 4, 5]. Recently, one or two projects focus themselves on storage-based IDS. Many peopl e already know in [6] a number of t ypes of intrusions will persist aft er reboot due to the reason that the persistent dat a on disks is altered, that can easily be detected by storage systems. Moreover, st orage syst ems are incredibly suitabl e for this usage for the reason that they can still work even though your body method is intruded. The storage-based IDS is regarded as a promising candidate for host-based and networkbased t echnologies. However, the IDS designer will have to be met with a dilemma: if the IDS is running within the host, timber is minimal whole view of the system even though it is threatened by more possible attacks. Then again, in case the IDS is outside, without a doubt it is going to got a poor view of your body. On a single side, they provide good visibility into the state on your monitored host’s file system dependent on our file-aware block l evel storage. On the other side, it continues to maintains strong isolation as to the IDS advantage virtual machine technologi es. The proposed IDS may be used as a possible integrated section of the virtual bl ock level storage. Our approach employs the virtual machine monit or (VMM) to pull IDS “out si de” on your monitored host, that gives an obstacle between the IDS and an attacker’s malicious activities. Mostly, VMM constructs some virtual hardware interfaces for the upper-level OS and applications, including the common block-level storage interface. However the attackers mostly do business with the machine files, so existing storage-based IDS access rules will always be defined with regards to files rather than data sectors. Then how to bridge this gap is the key issue. We present the file-aware block level storage to address the issue. Thi s sort of storage can identi fy whi ch file continues to be accessed by the upperlevel OS so t he corresponding file-level intrusion detection (ID) rules will be adopted. In the following solution a treatment should assist the virtual storage to discover the filesystem structure up front, then the storage syst em can retain a sector-to-file mapping table and employ this knowledge on-line to transformto change the file-level ID rules t o sect orlevel. 2 RELATED WORK Now, system intelligence continues to relax coming from the CPU and into devices, and storage system designers use this trend to excess computing power to perform more complex processing inside storage devices. Because we are part of a word, storage becomes further and further intelligent. However, while smart disk systems have great promise, how you can employ their full potential is difficult on account of narrow interface between storage and of course the upperlevel system. One solution often to adopt a brand-new storage interface. Research projects, including NASD[10], Attribute based Storage[11], Active Disks[12], bring forward the notion of Object-based Storage Device (OSD) [13], which provides more complicated access interfaces than traditional disks and support variable-length data object. Another example is CAS [14] [5], which uses content hashes to represent files in the whole file system. Nevertheless for most legacy systems this solution is impractical since storage ISSN: 2231-5381 Page 557 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 interfaces as to the existing desktop operating systems and applications will be needed to be modified. Therefore, smart disk systems have to acquire knowledge of the upper-level file system and exploit that understanding to enhance functionality or increase performance, as the interface stays exactly the same. Many intrusion detection systems employ techniques for both anomaly and misuse intrusion detection. The techniques employed in these systems to detect anomalies are varied. A few dependent on techniques of predicting future patterns of behavior utilizing patterns seen thus far, while some rely mainly on statistical approaches to determine anomalous behavior. In both cases, observed behavior that does not match expected behavior is just agged because an intrusion might be indicated. The best techniques programmed to perform misuse detection comprise expert systems, modelbased reasoning systems, state transition analysis, and keystroke monitoring. Xen [4] is yet another hypervisor running on the bare hardware, but it surely requires a different approach, called paravirtualization. It takes rewriting portions of an operating system written and get a Pentium machine due to the reason that the interface exported by Xen would not be the image of that of a Pentium machine. Xen is open-source and repair is currently in progress to port ISIS to Xen. 3 PROPOSED ARCHITECTURE The k-nearest neighbor algorithm is amongst the simplest among machine learning algorithms. An object is assessed by a majority vote of its neighbors, with the object being assigned to the course most popular amongst its k nearest nei ghbors. k is most definately a positive integer, typically small. If k = 1, then a object is just granted to the course of that nearest neighbor. In binary (two class) classification problems, it is useful to choose k to remain an odd number because this avoids difficulties with tied votes. The same method can be used for regression, simpl y by assigning the property value regarding the object to be the average of one's values of that k nearest neighbors. It can also be perfect for weight the contributions of one's neighbors, to ensure that nearer neighbors contribute more in the average compared to the more distant ones. The neighbors are taken from particular objects for which an appropriate classification (or, when regression, the value of the property) is thought. This is viewed as the workout set for the algorithm, though no explicit training step is required. So you can identify neighbors, the objects are represented by position vectors within the multidimensional feature space. It really is usual to use the Euclidean distance, though other distance measures, like Manhattan distance could in princi ple be accustomed instead. The k-nearest neighbor algorithm is in tune with your neighborhood structure of information. The best choice of k depends upon the data; generall y, larger values of k lessen the effect of noise upon the classification, but make boundaries between classes less distinct. A good k often is selected by various heuristic techniques, just for instance, crossvalidation. The special case wherein the class is predicted to become the importance of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm. The truth of all the k-NN algorithm can certainly be severely degraded via the presence of noisy or irrelevant features, or if the feature scales generall y are not per their importance. Much research effort has also been put into selecting or scaling features to improve classification. An incredibly popular approach will be the consumption of evoluti onary algorithms to optimize feature scaling. Another popular approach usuall y is to scale features from the mutual information of one's training data in the training classes. The K-Nearest Neighbors (K-NN) algorithm is most definately a nonparametric method for the reason that no parameters are estimated as, by way of example, in the multiple linear regression model. Instead, the proximity of neighboring input (x) observations in the training data set and also their corresponding output values (y) are utilized to anticipate (score) the output values of cases in the whole validation data set. These output variables can either be interval variables in this instance the K-NN al gorithm is made for prediction while in case the output variables are categorical, either nominal or ordinal, the K-NN algorithm is designed for classification purposes. Steps: • Determine parameter K = number of nearest neighbors • Calculate the distance between the query-instance and all the training samples • Sort the distance and determine nearest neighbors based on the K-th minimum distance • Gather the category of the nearest neighbors . • Use simple majority of the category of nearest neighbors as the prediction value of the query instance To review, the chi-square method of hypothesis testing has seven basic steps.[1] 1. State the null and research/alternative hypotheses. 2. Specify the decision rule and the level of statistical significance for the test, i.e., .05, .01, or .001. (A significance level of .01 would mean that the probability of the chi-square value must be .01 or less to reject the null hypothesis, a more stringent criterion than .05.) 3. Compute the expected values. 4. Compute the chi-square statistic. 5. Determine the degrees of freedom for the table. Then identify the critical value of chi-square at the specified level of significance and appropriate degrees of freedom. 6. Compare the computed chi-square statistic with the critical value of chi-square; reject the null hypothesis if the chisquare is equal to or larger than the critical value; accept the null hypothesis if the chi-square is less than the critical value. 7. State a substantive conclusion, i.e., describe the meaning ISSN: 2231-5381 Page 558 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 and importance of the test results in terms of the historical problem under investigation. IMPROVED APRIORI: The major steps in association rule mining are: 1. Frequent Itemset generation 2. Rules derivation 4.EXPERIMENTAL RESULTS All experiments were performed with the configurations Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the operation system platform is Microsoft Windows XP Professional (SP2). The dataset is taken from real time php based real time web analyzer. Some of the results in the dynamic web statistics are given below as: The APRIORI algorithm uses the downward closure property, to prune unnecessary branches for further consideration. It needs two parameters, minSupp and minConf. The minSupp is used for generating frequent itemsets and minConf is used for rule derivation. The APRIORI algorithm: 1. k = 1; 2. Find frequent itemset, Lk from Ck, the set of all candidate itemsets; 3. Form Ck+1 from Lk; 4. k = k+1; 5. Repeat 2-4 until Ck is empty; Step 2 is called the frequent itemset generation step. Step 3 is KNN RESULTS: KNN APPROACH FOR MALICIOUS DATA ====================================== IPLEN = 20 AND TTL = 121: 65.456.567.654.87 (5.0/4.0) IPLEN = 20 AND TTL = 128: (5.0/4.0) called as the candidate itemset generation step. Details of these two steps are in the next lesson. Frequent itemset generation Scan D and count each itemset in Ck, if the count is greater than minSupp, then add that itemset to Lk. Candidate itemset generation For k = 1, C1 = all itemsets of length = 1. For k > 1, generate Ck from Lk-1 as follows: The join step: Ck = k-2 way join of Lk-1 with itself. If both {a1,..,ak-2, ak-1} & {a1,.., ak-2, ak} are in Lk-1, then add {a1,..,ak-2, ak-1, ak} to Ck. The items are always stored in the sorted order. The prune step: Remove {a1, …,ak-2, ak-1, ak}, if it contains a non-frequent (k1) subset. S SEQ != 0xC2B1EBD5 AND ID = 21243 AND TTL = 125: 55.556.567.345.23 (2.0/1.0) SEQ != 0xC2B1EBD5 AND DgmLen = 45 AND TTL = 123: 34.234.345.234.23 (2.0/1.0) SEQ != 0xC2B1EBD5 AND PROTOCOL = TCP AND ID = 21243: 666.656.566.566.55 (2.0/1.0) SEQ != 0xC2B1EBD5 AND DgmLen = 54 AND ID = 41212: (2.0/1.0) SEQ != 0xC2B1EBD5 AND TTL = 123: 435.675.66.765.55 (4.0/3.0) IPLEN = 20: 54.655.654.345.345 (5.0/4.0) IPLEN = 30 AND TTL = 123: (2.0/1.0) Modified measure: SEQ != 0xC2B1EBD5 AND TTL = 125: 767.55.777.777.777 (3.0/2.0) The correlation coefficient (CRC) between itemsets can be SEQ != 0xC2B1EBD5 AND TTL = 132: 666.777.777.776.777 (3.0/2.0) CRC={SUPP(AUB)/SUPP(A)*SUPP(B)} Where A and B are itemsets. When CRC(A,B)=1 A and B are independent. When CRC(A,~B)<1 A and B are negative correlation. When CRC(~A,B)<1 A and B are negative correlation. TTL = 121 AND IPLEN = 50: 665.66.55.666.55 (2.0/1.0) PROTOCOL = TCP AND TTL = 128 AND IPLEN = 40: 564.666.554.678.45 (2.0/1.0) PROTOCOL = TCP AND ISSN: 2231-5381 Page 559 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 IPLEN = 40: 44.445.654.765.345 (3.0/2.0) IPLEN = 30: 56.675.65.454.567 (3.0/2.0) IPLEN = 40 AND TTL != 125 AND TTL = 134: 32.345.33.333.12 (3.0/2.0) ID = 31232 AND IPLEN = 40: 44.554.654.444.55 (2.0/1.0) TTL = 125: 43.456.554.343.33 (3.0/2.0) DgmLen = 56: 34.654.678.654.34 (3.0/2.0) : 66.777.556.656.44 (3.0/2.0) Number of Rules : 20 === ATTACK PROPERTIES=== Attack Rate DetectionAccuracy Class 0.052 0.474 0 0.5 34.234.345.234.23 0.034 0.483 32.345.33.333.12 0.017 0.491 0.052 0.474 65.456.567.654.87 0.086 0.457 55.556.567.345.23 0.034 0.483 34.654.678.654.34 0.069 0.466 435.675.66.765.55 0.052 0.474 54.655.654.345.345 0.034 0.483 56.675.65.454.567 0.017 0.491 777.777.456.466.56 0.069 0.466 43.456.554.343.33 0 0.5 33.444.443.333.32 0.034 0.483 44.554.654.444.55 0 0.491 45.554.554.224.45 0.052 0.474 44.445.654.765.345 0 0.474 32.33.453.345.234 26. TTL=128 PROTOCOL=TCP 10 ==> TCPLEN=32 10 CRC:(1) 27. TTL=128 TCPLEN=32 10 ==> PROTOCOL=TCP 10 CRC:(1) 28. TTL=128 10 ==> TCPLEN=32 PROTOCOL=TCP 10 CRC:(1) 29. TTL=132 PROTOCOL=TCP 10 ==> TCPLEN=32 10 CRC:(1) 30. TTL=132 TCPLEN=32 10 ==> PROTOCOL=TCP 10 CRC:(1) 31. TTL=132 10 ==> TCPLEN=32 PROTOCOL=TCP 10 CRC:(1) 32. TTL=134 PROTOCOL=HTTP 10 ==> TCPLEN=32 10 CRC:(1) 33. TTL=134 TCPLEN=32 10 ==> PROTOCOL=HTTP 10 CRC:(1) 34. TTL=134 10 ==> TCPLEN=32 PROTOCOL=HTTP 10 CRC:(1) 35. IPLEN=40 SRC= 10 ==> TCPLEN=32 10 CRC:(1) 36. SRC= PROTOCOL=HTTP 10 ==> TCPLEN=32 10 CRC:(1) 37. TTL=125 9 ==> TCPLEN=32 9 CRC:(1) 38. TTL=125 9 ==> PROTOCOL=HTTP 9 CRC:(1) 39. DgmLen =56 9 ==> TCPLEN=32 9 CRC:(1) 40. TTL=121 PROTOCOL=UDP 9 ==> TCPLEN=32 9 CRC:(1) 41. TTL=125 PROTOCOL=HTTP 9 ==> TCPLEN=32 9 CRC:(1) 42. TTL=125 TCPLEN=32 9 ==> PROTOCOL=HTTP 9 CRC:(1) 43. TTL=125 9 ==> TCPLEN=32 PROTOCOL=HTTP 9 CRC:(1) 44. IPLEN=20 SEQ= 0xC2B1EBD5 9 ==> TCPLEN=32 9 CRC:(1) 45. SRC= PROTOCOL=TCP 9 ==> TCPLEN=32 9 CRC:(1) 46. ID=21243 8 ==> TCPLEN=32 8 CRC:(1) 47. ID=23323 8 ==> TCPLEN=32 8 CRC:(1) 48. ID=24407 8 ==> TCPLEN=32 8 CRC:(1) 49. ID=31232 8 ==> TCPLEN=32 8 CRC:(1) 50. ID=31234 8 ==> TCPLEN=32 8 CRC:(1) 51. ID=41212 8 ==> TCPLEN=32 8 CRC:(1) 52. ID=51323 8 ==> TCPLEN=32 8 CRC:(1) 53. IPLEN=30 8 ==> TCPLEN=32 8 CRC:(1) Attack detection rate study: Detectionaccuracy 0.5 detection rates IMPROVED APRIORI RESULTS: ARM RULES: ======= Attack detection rate Attackrate IMPROVED APRIORI RULES 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 attack rates Attackrate Attack detection rate Detectionaccura cy 0.8 detection rates 1. SEQ= 0xC2B1EBD5 33 ==> TCPLEN=32 33 CRC:(1) 2. SRC= 25 ==> TCPLEN=32 25 CRC:(1) 3. IPLEN=40 22 ==> TCPLEN=32 22 CRC:(1) 4. PROTOCOL=TCP 20 ==> TCPLEN=32 20 CRC:(1) 5. PROTOCOL=HTTP 20 ==> TCPLEN=32 20 CRC:(1) 6. PROTOCOL=UDP 19 ==> TCPLEN=32 19 CRC:(1) 7. IPLEN=20 15 ==> TCPLEN=32 15 CRC:(1) 8. IPLEN=50 14 ==> TCPLEN=32 14 CRC:(1) 9. SEQ= 0xC2B1EBD5 SRC= 13 ==> TCPLEN=32 13 CRC:(1) 10. IPLEN=40 SEQ= 0xC2B1EBD5 12 ==> TCPLEN=32 12 CRC:(1) 11. SEQ= 0xC2B1EBD5 PROTOCOL=TCP 11 ==> TCPLEN=32 11 CRC:(1) 12. SEQ= 0xC2B1EBD5 PROTOCOL=UDP 11 ==> TCPLEN=32 11 CRC:(1) 13. SEQ= 0xC2B1EBD5 PROTOCOL=HTTP 11 ==> TCPLEN=32 11 CRC:(1) 14. TTL=121 10 ==> TCPLEN=32 10 CRC:(1) 15. TTL=123 10 ==> TCPLEN=32 10 CRC:(1) 16. TTL=123 10 ==> PROTOCOL=UDP 10 CRC:(1) 17. TTL=128 10 ==> TCPLEN=32 10 CRC:(1) 18. TTL=128 10 ==> PROTOCOL=TCP 10 CRC:(1) 19. TTL=132 10 ==> TCPLEN=32 10 CRC:(1) 20. TTL=132 10 ==> PROTOCOL=TCP 10 CRC:(1) 21. TTL=134 10 ==> TCPLEN=32 10 CRC:(1) 22. TTL=134 10 ==> PROTOCOL=HTTP 10 CRC:(1) 23. TTL=123 PROTOCOL=UDP 10 ==> TCPLEN=32 10 CRC:(1) 24. TTL=123 TCPLEN=32 10 ==> PROTOCOL=UDP 10 CRC:(1) 25. TTL=123 10 ==> TCPLEN=32 PROTOCOL=UDP 10 CRC:(1) 0.6 0.4 0.2 0 Attack rate 1 2 3 4 5 6 0.1 0 0 0 0 0.1 Detect 0.5 0.5 0.5 0.3 0.5 0.6 ionacc uracy ISSN: 2231-5381 attack rates Page 560 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 5.Conclusion and Future scope: Computer and Communications Security, ACM, Washington D.C., USA, 2002, pp.245-254. This system effectively tests the performance of improved knn approach and apriori approach. Experimental results demonstrates the improved attack rate when compare to existing approches. This system easily filters the attacked rules using correlation measure. In future this approach is extended to vmware machine in order to detect the attacks in the virtualization and patches in the vmware based tools. REFERENCES [1] S. A. Hofmeyr, A. Somayaji, and S. Forrest. Intrusion Detection using Sequences of System Calls. the Journal of Computer Security Vol. 6, pp 151-180, 1998. [2] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting Intrusions Using System Calls: Alternative Data Models. IEEE Symposium on Security and Privacy. pp. 133-145, 1999. [3] J. Balthrop, F. Esponda, etc. Coverage and Generalization in an Artificial Immune System. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2002), Morgan Kaufmann. New York, pp. 3-10, 2002. [5] G. W. Dunlap, S. T. King, S. Cinar, M. Basrai, and P. M. Chen, “Revirt: Enabling intrusion analysis through virtual machine logging and replay”, Proceedings of 2002 Symposium on Operating Systems Design and Implementation (OSDI’02), December 2002, pp.211-224 [6] The Xen Team, Xen Interface Manual-Xen v3.0 for x86, University of Cambridge, Technical Report, June 2006 [7] Apk, “Interface promiscuity obscurity”, Phrack, July 1998, pp.135-145 [8] Halflife, “Abuse of the Linux Kernel of Fun and Profit”, Phrack, 7(50): article 9 of 17, April 1997, pp.121-127 [9] Nmap. [10] Netcat. [11] S. Berger, R. Caceres, et al. “vTPM: Virtualizing the Trusted Platform Module”, Proceedings of the USENIX Annual Technical Conference (USENIX’06). USENIX., May 30 – June 3, 2006, Boston, MA, USA, pp. 21-21 [12] P. Daniel and M. Cesati, Understanding the Linux Kernel. O’Reilly Press, 3rd edition. 2007 [13] D. Chisnall, The Definitive Guide to the Xen Hypervisor, Prentice Hall, 2007. [14] S. Berger, R. Caceres, K. Goldman, R. Perez, R. Sailer, and L. van Doom, "vTPM: Virtualizing the Trusted Platform Module", Proceedings of 15th USENIX Security Symposium, USENIX, Vancouver B.c., Canada, 2006, pp.305-320. [15] Trusted Computing Group, [16] G. Vigna, S. T. Eckmann, and R. A. Kemmerer, "Attack Languages", Proceedings of the Information Survivability Workshop, IEEE Computer Society, Boston, USA, 2000, pp. l 63-166. [IIJ P. Ning, Y. Cui, and D. S. Reeves, "Constructing attack scenarios through correlation of intrusion alerts", Proceedings of the 9th Conference on ISSN: 2231-5381 Page 561