Improved Apriori and KNN approach for Virtual machine based intrusion detection

advertisement
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
Improved Apriori and KNN approach for Virtual machine based
intrusion detection
Suneetha Valluru#1, Mrs N. Rajeswari#2
1
2
M.Tech (CSE),Gudlavalleru Engineering College, Gudlavalleru
Associate professor(CSE), Gudlavalleru Engineering College, Gudlavalleru.
In this paper we present a new mechanism for
building storage-based intrusi on det ection systems.
ABSTRACT:
Nowadays, as information systems are usually more
accessible to the world wide web, the advantage of secure
networks is tremendously increased. New intelligent
Intrusion Detection Systems (IDSs) that based on
sophisticated algorithms as an alternative to current
signature-base detections are really in demand. Intrusion
detection is one of network security area of technology main
research directions. Data mining technology was applied to
network intrusion detection system (NIDS), may discover the
new pattern due to massive network data, to scale back the
workload of the manual compilation intrusion behavior
patterns and normal behavior patterns.
Virtualization is now a more popular service hosting
platform. Recently, intrusion detection systems (IDSs) which
utilize virtualization are now introduced. A particular
challenge inside current virtualization-based IDS systems is
considered in this project. In this particular proposed system
a new chi-square based feature selection that evaluates the
relative importance of individual features. After feature
selection proposed techniques like KNN and Modified
apriori are applied on the data with less false positive rates.
1. INTRODUCTION
Many studies and deployment of intrusion detection
systems (IDS) forced onto the development of
increasingly advanced approaches to defeating
malicious activities. Normally, they are actually
host-based or network-based. The host-based IDS is
integrated towards the host operating environment
[1, 2] even though the networkbased monitors
network traffic for suspicious content [3, 4, 5].
Recently, one or two projects focus themselves on
storage-based IDS. Many peopl e already know in [6]
a number of t ypes of intrusions will persist aft er
reboot due to the reason that the persistent dat a on
disks is altered, that can easily be detected by
storage systems. Moreover, st orage syst ems are
incredibly suitabl e for this usage for the reason that
they can still work even though your body method is
intruded. The storage-based IDS is regarded as a
promising candidate for host-based and networkbased t echnologies. However, the IDS designer will
have to be met with a dilemma: if the IDS is running
within the host, timber is minimal whole view of the
system even though it is threatened by more possible
attacks. Then again, in case the IDS is outside,
without a doubt it is going to got a poor view of your
body.
On a single side, they provide good visibility into the
state on your monitored host’s file system dependent
on our file-aware block l evel storage. On the other
side, it continues to maintains strong isolation as to
the IDS advantage virtual machine technologi es. The
proposed IDS may be used as a possible integrated
section of the virtual bl ock level storage. Our
approach employs the virtual machine monit or
(VMM) to pull IDS “out si de” on your monitored
host, that gives an obstacle between the IDS and an
attacker’s malicious activities. Mostly, VMM
constructs some virtual hardware interfaces for the
upper-level OS and applications, including the
common block-level storage interface. However the
attackers mostly do business with the machine files,
so existing storage-based IDS access rules will
always be defined with regards to files rather than
data sectors. Then how to bridge this gap is the key
issue. We present the file-aware block level storage
to address the issue. Thi s sort of storage can identi fy
whi ch file continues to be accessed by the upperlevel OS so t he corresponding file-level intrusion
detection (ID) rules will be adopted. In the following
solution a treatment should assist the virtual storage
to discover the filesystem structure up front, then the
storage syst em can retain a sector-to-file mapping
table and employ this knowledge on-line to
transformto change the file-level ID rules t o sect orlevel.
2 RELATED WORK
Now, system intelligence continues to relax coming from the
CPU and into devices, and storage system designers use this
trend to excess computing power to perform more complex
processing inside storage devices. Because we are part of a
word, storage becomes further and further intelligent.
However, while smart disk systems have great promise, how
you can employ their full potential is difficult on account of
narrow interface between storage and of course the upperlevel system. One solution often to adopt a brand-new
storage interface. Research projects, including NASD[10],
Attribute based Storage[11], Active Disks[12], bring forward
the notion of Object-based Storage Device (OSD) [13],
which provides more complicated access interfaces than
traditional disks and support variable-length data object.
Another example is CAS [14] [5], which uses content hashes
to represent files in the whole file system. Nevertheless for
most legacy systems this solution is impractical since storage
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 557
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
interfaces as to the existing desktop operating systems and
applications will be needed to be
modified. Therefore, smart disk systems have to acquire
knowledge of the upper-level file system and exploit that
understanding to enhance functionality or increase
performance, as the interface stays exactly the same.
Many intrusion detection systems employ techniques for
both anomaly and misuse intrusion detection. The techniques
employed in these systems to detect anomalies are varied. A
few dependent on techniques of predicting future patterns of
behavior utilizing patterns seen thus far, while some rely
mainly on statistical approaches to determine anomalous
behavior. In both cases, observed behavior that does not
match expected behavior is just agged because an intrusion
might be indicated. The best techniques programmed to
perform misuse detection comprise expert systems, modelbased reasoning systems, state transition analysis, and
keystroke monitoring.
Xen [4] is yet another hypervisor running on the bare
hardware, but it surely requires a different approach, called
paravirtualization. It takes rewriting portions of an operating
system written and get a Pentium machine due to the reason
that the interface exported by Xen would not be the image of
that of a Pentium machine. Xen is open-source and repair is
currently in progress to port ISIS to Xen.
3 PROPOSED ARCHITECTURE
The k-nearest neighbor algorithm is amongst the
simplest among machine learning algorithms. An
object is assessed by a majority vote of its neighbors,
with the object being assigned to the course most
popular amongst its k nearest nei ghbors. k is most
definately a positive integer, typically small. If k = 1,
then a object is just granted to the course of that
nearest neighbor. In binary (two class) classification
problems, it is useful to choose k to remain an odd
number because this avoids difficulties with tied
votes.
The same method can be used for regression, simpl y
by assigning the property value regarding the object to
be the average of one's values of that k nearest
neighbors. It can also be perfect for weight the
contributions of one's neighbors, to ensure that nearer
neighbors contribute more in the average compared to
the more distant ones.
The neighbors are taken from particular objects for
which an appropriate classification (or, when
regression, the value of the property) is thought. This
is viewed as the workout set for the algorithm, though
no explicit training step is required. So you can
identify neighbors, the objects are represented by
position vectors within the multidimensional feature
space. It really is usual to use the Euclidean distance,
though other distance measures, like Manhattan
distance could in princi ple be accustomed instead. The
k-nearest neighbor algorithm is in tune with your
neighborhood structure of information.
The best choice of k depends upon the data; generall y,
larger values of k lessen the effect of noise upon the
classification, but make boundaries between classes
less distinct. A good k often is selected by various
heuristic techniques, just for instance, crossvalidation. The special case wherein the class is
predicted to become the importance of the closest
training sample (i.e. when k = 1) is called the nearest
neighbor algorithm.
The truth of all the k-NN algorithm can certainly be
severely degraded via the presence of noisy or
irrelevant features, or if the feature scales generall y
are not per their importance. Much research effort has
also been put into selecting or scaling features to
improve classification. An incredibly popular
approach will be the consumption of evoluti onary
algorithms to optimize feature scaling. Another
popular approach usuall y is to scale features from the
mutual information of one's training data in the
training classes.
The K-Nearest Neighbors (K-NN) algorithm is most
definately a nonparametric method for the reason that
no parameters are estimated as, by way of example, in
the multiple linear regression model. Instead, the
proximity of neighboring input (x) observations in the
training data set and also their corresponding output
values (y) are utilized to anticipate (score) the output
values of cases in the whole validation data set. These
output variables can either be interval variables in this
instance the K-NN al gorithm is made for prediction
while in case the output variables are categorical,
either nominal or ordinal, the K-NN algorithm is
designed for classification purposes.
Steps:
• Determine parameter K = number of nearest
neighbors
• Calculate the distance between the query-instance
and all the training samples
• Sort the distance and determine nearest neighbors
based on the K-th minimum distance
• Gather the category of the nearest neighbors .
• Use simple majority of the category of nearest
neighbors as the prediction value of the query
instance
To review, the chi-square method of hypothesis testing has
seven basic steps.[1]
1. State the null and research/alternative hypotheses.
2. Specify the decision rule and the level of statistical
significance for the test, i.e., .05, .01, or .001. (A
significance level of .01 would mean that the probability of
the chi-square value must be .01 or less to reject the null
hypothesis, a more stringent criterion than .05.)
3. Compute the expected values.
4. Compute the chi-square statistic.
5. Determine the degrees of freedom for the table. Then
identify the critical value of chi-square at the specified level
of significance and appropriate degrees of freedom.
6. Compare the computed chi-square statistic with the critical
value of chi-square; reject the null hypothesis if the chisquare is equal to or larger than the critical value; accept the
null hypothesis if the chi-square is less than the critical
value.
7. State a substantive conclusion, i.e., describe the meaning
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 558
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
and importance of the test results in terms of the historical
problem under investigation.
IMPROVED APRIORI:
The major steps in association rule mining are:
1.
Frequent Itemset generation
2.
Rules derivation
4.EXPERIMENTAL RESULTS
All experiments were performed with the configurations
Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the
operation system platform is Microsoft Windows XP
Professional (SP2). The dataset is taken from real time php
based real time web analyzer. Some of the results in the
dynamic web statistics are given below as:
The APRIORI algorithm uses the downward closure
property, to prune unnecessary branches for further
consideration. It needs two parameters, minSupp and
minConf. The minSupp is used for generating frequent
itemsets and minConf is used for rule derivation.
The APRIORI algorithm:
1.
k = 1;
2.
Find frequent itemset, Lk from Ck, the set of all
candidate itemsets;
3.
Form Ck+1 from Lk;
4.
k = k+1;
5.
Repeat 2-4 until Ck is empty;
Step 2 is called the frequent itemset generation step. Step 3 is
KNN RESULTS:
KNN APPROACH FOR MALICIOUS DATA
======================================
IPLEN = 20 AND
TTL = 121: 65.456.567.654.87 (5.0/4.0)
IPLEN = 20 AND
TTL = 128: 74.125.236.173:80 (5.0/4.0)
called as the candidate itemset generation step. Details of
these two steps are in the next lesson.
Frequent itemset generation
Scan D and count each itemset in Ck, if the count is greater
than minSupp, then add that
itemset to Lk.
Candidate itemset generation
For k = 1, C1 = all itemsets of length = 1.
For k > 1, generate Ck from Lk-1 as follows:
The join step:
Ck = k-2 way join of Lk-1 with itself.
If both {a1,..,ak-2, ak-1} & {a1,.., ak-2, ak} are in Lk-1, then add
{a1,..,ak-2, ak-1, ak} to Ck.
The items are always stored in the sorted order.
The prune step:
Remove {a1, …,ak-2, ak-1, ak}, if it contains a non-frequent (k1) subset.
S
SEQ != 0xC2B1EBD5 AND
ID = 21243 AND
TTL = 125: 55.556.567.345.23 (2.0/1.0)
SEQ != 0xC2B1EBD5 AND
DgmLen = 45 AND
TTL = 123: 34.234.345.234.23 (2.0/1.0)
SEQ != 0xC2B1EBD5 AND
PROTOCOL = TCP AND
ID = 21243: 666.656.566.566.55 (2.0/1.0)
SEQ != 0xC2B1EBD5 AND
DgmLen = 54 AND
ID = 41212: 23.234.12.432.21 (2.0/1.0)
SEQ != 0xC2B1EBD5 AND
TTL = 123: 435.675.66.765.55 (4.0/3.0)
IPLEN = 20: 54.655.654.345.345 (5.0/4.0)
IPLEN = 30 AND
TTL = 123: 65.77.55.777.555 (2.0/1.0)
Modified measure:
SEQ != 0xC2B1EBD5 AND
TTL = 125: 767.55.777.777.777 (3.0/2.0)
The correlation coefficient (CRC) between itemsets can be
SEQ != 0xC2B1EBD5 AND
TTL = 132: 666.777.777.776.777 (3.0/2.0)
CRC={SUPP(AUB)/SUPP(A)*SUPP(B)}
Where A and B are itemsets.
When CRC(A,B)=1 A and B are independent.
When CRC(A,~B)<1 A and B are negative correlation.
When CRC(~A,B)<1 A and B are negative correlation.
TTL = 121 AND
IPLEN = 50: 665.66.55.666.55 (2.0/1.0)
PROTOCOL = TCP AND
TTL = 128 AND
IPLEN = 40: 564.666.554.678.45 (2.0/1.0)
PROTOCOL = TCP AND
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 559
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
IPLEN = 40: 44.445.654.765.345 (3.0/2.0)
IPLEN = 30: 56.675.65.454.567 (3.0/2.0)
IPLEN = 40 AND
TTL != 125 AND
TTL = 134: 32.345.33.333.12 (3.0/2.0)
ID = 31232 AND
IPLEN = 40: 44.554.654.444.55 (2.0/1.0)
TTL = 125: 43.456.554.343.33 (3.0/2.0)
DgmLen = 56: 34.654.678.654.34 (3.0/2.0)
: 66.777.556.656.44 (3.0/2.0)
Number of Rules :
20
=== ATTACK PROPERTIES===
Attack Rate DetectionAccuracy Class
0.052
0.474
74.125.236.173:80
0
0.5
34.234.345.234.23
0.034
0.483
32.345.33.333.12
0.017
0.491
23.234.12.432.21
0.052
0.474
65.456.567.654.87
0.086
0.457
55.556.567.345.23
0.034
0.483
34.654.678.654.34
0.069
0.466
435.675.66.765.55
0.052
0.474
54.655.654.345.345
0.034
0.483
56.675.65.454.567
0.017
0.491
777.777.456.466.56
0.069
0.466
43.456.554.343.33
0
0.5
33.444.443.333.32
0.034
0.483
44.554.654.444.55
0
0.491
45.554.554.224.45
0.052
0.474
44.445.654.765.345
0
0.474
32.33.453.345.234
26. TTL=128 PROTOCOL=TCP 10 ==> TCPLEN=32 10 CRC:(1)
27. TTL=128 TCPLEN=32 10 ==> PROTOCOL=TCP 10 CRC:(1)
28. TTL=128 10 ==> TCPLEN=32 PROTOCOL=TCP 10 CRC:(1)
29. TTL=132 PROTOCOL=TCP 10 ==> TCPLEN=32 10 CRC:(1)
30. TTL=132 TCPLEN=32 10 ==> PROTOCOL=TCP 10 CRC:(1)
31. TTL=132 10 ==> TCPLEN=32 PROTOCOL=TCP 10 CRC:(1)
32. TTL=134 PROTOCOL=HTTP 10 ==> TCPLEN=32 10 CRC:(1)
33. TTL=134 TCPLEN=32 10 ==> PROTOCOL=HTTP 10 CRC:(1)
34. TTL=134 10 ==> TCPLEN=32 PROTOCOL=HTTP 10 CRC:(1)
35. IPLEN=40 SRC=192.168.2.3:56007
10 ==> TCPLEN=32 10
CRC:(1)
36. SRC=192.168.2.3:56007 PROTOCOL=HTTP 10 ==> TCPLEN=32 10
CRC:(1)
37. TTL=125 9 ==> TCPLEN=32 9 CRC:(1)
38. TTL=125 9 ==> PROTOCOL=HTTP 9 CRC:(1)
39. DgmLen =56 9 ==> TCPLEN=32 9 CRC:(1)
40. TTL=121 PROTOCOL=UDP 9 ==> TCPLEN=32 9 CRC:(1)
41. TTL=125 PROTOCOL=HTTP 9 ==> TCPLEN=32 9 CRC:(1)
42. TTL=125 TCPLEN=32 9 ==> PROTOCOL=HTTP 9 CRC:(1)
43. TTL=125 9 ==> TCPLEN=32 PROTOCOL=HTTP 9 CRC:(1)
44. IPLEN=20 SEQ= 0xC2B1EBD5 9 ==> TCPLEN=32 9 CRC:(1)
45. SRC=192.168.2.3:56007 PROTOCOL=TCP 9 ==> TCPLEN=32 9
CRC:(1)
46. ID=21243 8 ==> TCPLEN=32 8 CRC:(1)
47. ID=23323 8 ==> TCPLEN=32 8 CRC:(1)
48. ID=24407 8 ==> TCPLEN=32 8 CRC:(1)
49. ID=31232 8 ==> TCPLEN=32 8 CRC:(1)
50. ID=31234 8 ==> TCPLEN=32 8 CRC:(1)
51. ID=41212 8 ==> TCPLEN=32 8 CRC:(1)
52. ID=51323 8 ==> TCPLEN=32 8 CRC:(1)
53. IPLEN=30 8 ==> TCPLEN=32 8 CRC:(1)
Attack detection rate study:
Detectionaccuracy
0.5
detection rates
IMPROVED APRIORI RESULTS:
ARM RULES:
=======
Attack detection rate
Attackrate
IMPROVED APRIORI RULES
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
attack rates
Attackrate
Attack detection rate
Detectionaccura
cy
0.8
detection rates
1. SEQ= 0xC2B1EBD5 33 ==> TCPLEN=32 33 CRC:(1)
2. SRC=192.168.2.3:56007 25 ==> TCPLEN=32 25 CRC:(1)
3. IPLEN=40 22 ==> TCPLEN=32 22 CRC:(1)
4. PROTOCOL=TCP 20 ==> TCPLEN=32 20 CRC:(1)
5. PROTOCOL=HTTP 20 ==> TCPLEN=32 20 CRC:(1)
6. PROTOCOL=UDP 19 ==> TCPLEN=32 19 CRC:(1)
7. IPLEN=20 15 ==> TCPLEN=32 15 CRC:(1)
8. IPLEN=50 14 ==> TCPLEN=32 14 CRC:(1)
9. SEQ= 0xC2B1EBD5 SRC=192.168.2.3:56007 13 ==> TCPLEN=32
13 CRC:(1)
10. IPLEN=40 SEQ= 0xC2B1EBD5 12 ==> TCPLEN=32 12 CRC:(1)
11. SEQ= 0xC2B1EBD5 PROTOCOL=TCP 11 ==> TCPLEN=32 11
CRC:(1)
12. SEQ= 0xC2B1EBD5 PROTOCOL=UDP 11 ==> TCPLEN=32 11
CRC:(1)
13. SEQ= 0xC2B1EBD5 PROTOCOL=HTTP 11 ==> TCPLEN=32 11
CRC:(1)
14. TTL=121 10 ==> TCPLEN=32 10 CRC:(1)
15. TTL=123 10 ==> TCPLEN=32 10 CRC:(1)
16. TTL=123 10 ==> PROTOCOL=UDP 10 CRC:(1)
17. TTL=128 10 ==> TCPLEN=32 10 CRC:(1)
18. TTL=128 10 ==> PROTOCOL=TCP 10 CRC:(1)
19. TTL=132 10 ==> TCPLEN=32 10 CRC:(1)
20. TTL=132 10 ==> PROTOCOL=TCP 10 CRC:(1)
21. TTL=134 10 ==> TCPLEN=32 10 CRC:(1)
22. TTL=134 10 ==> PROTOCOL=HTTP 10 CRC:(1)
23. TTL=123 PROTOCOL=UDP 10 ==> TCPLEN=32 10 CRC:(1)
24. TTL=123 TCPLEN=32 10 ==> PROTOCOL=UDP 10 CRC:(1)
25. TTL=123 10 ==> TCPLEN=32 PROTOCOL=UDP 10 CRC:(1)
0.6
0.4
0.2
0
Attack
rate
1
2
3
4
5
6
0.1
0
0
0
0
0.1
Detect 0.5 0.5 0.5 0.3 0.5 0.6
ionacc
uracy
ISSN: 2231-5381 http://www.internationaljournalssrg.org
attack rates
Page 560
International Journal of Engineering Trends and Technology- Volume3Issue5- 2012
5.Conclusion and Future scope:
Computer and Communications Security, ACM, Washington
D.C., USA, 2002, pp.245-254.
This system effectively tests the performance of improved
knn approach and apriori approach. Experimental results
demonstrates the improved attack rate when compare to
existing approches. This system easily filters the attacked
rules using correlation measure. In future this approach is
extended to vmware machine in order to detect the attacks in
the virtualization and patches in the vmware based tools.
REFERENCES
[1] S. A. Hofmeyr, A. Somayaji, and S. Forrest. Intrusion
Detection using Sequences of System Calls. the Journal of
Computer Security Vol. 6, pp 151-180, 1998.
[2] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting
Intrusions Using System Calls: Alternative Data Models.
IEEE Symposium on Security and Privacy. pp. 133-145,
1999.
[3] J. Balthrop, F. Esponda, etc. Coverage and
Generalization in an Artificial Immune System. Proceedings
of the Genetic and Evolutionary Computation Conference
(GECCO 2002), Morgan Kaufmann. New York, pp. 3-10,
2002.
[5] G. W. Dunlap, S. T. King, S. Cinar, M. Basrai, and P. M.
Chen, “Revirt: Enabling intrusion analysis through virtual
machine logging and replay”, Proceedings of 2002
Symposium on Operating Systems Design and
Implementation (OSDI’02), December 2002, pp.211-224
[6] The Xen Team, Xen Interface Manual-Xen v3.0 for x86,
University of Cambridge, Technical Report, June 2006
[7] Apk, “Interface promiscuity obscurity”, Phrack, July
1998, pp.135-145
[8] Halflife, “Abuse of the Linux Kernel of Fun and Profit”,
Phrack, 7(50): article 9 of 17, April 1997, pp.121-127
[9] Nmap. http://nmap.org.
[10] Netcat. http://netcat.sourceforge.net.
[11] S. Berger, R. Caceres, et al. “vTPM: Virtualizing the
Trusted Platform Module”, Proceedings of the USENIX
Annual Technical Conference (USENIX’06). USENIX.,
May 30 – June 3, 2006, Boston, MA, USA, pp. 21-21
[12] P. Daniel and M. Cesati, Understanding the Linux
Kernel. O’Reilly Press, 3rd edition. 2007
[13] D. Chisnall, The Definitive Guide to the Xen
Hypervisor, Prentice Hall, 2007.
[14] S. Berger, R. Caceres, K. Goldman, R. Perez, R. Sailer,
and L. van Doom, "vTPM: Virtualizing the Trusted Platform
Module", Proceedings of 15th USENIX Security
Symposium, USENIX, Vancouver B.c., Canada, 2006,
pp.305-320.
[15]
Trusted
Computing
Group,
http://www.trustedcomputinggroup.org.
[16] G. Vigna, S. T. Eckmann, and R. A. Kemmerer, "Attack
Languages", Proceedings of the Information Survivability
Workshop, IEEE Computer Society, Boston, USA, 2000, pp.
l 63-166. [IIJ P. Ning, Y. Cui, and D. S. Reeves,
"Constructing attack scenarios through correlation of
intrusion alerts", Proceedings of the 9th Conference on
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 561
Download