Support Vector Machines (SVMs) Classifiers, ROC Analysis and

advertisement
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005
CLASSIFICATION MODELS FOR INTRUSION DETECTION SYSTEMS
Srinvas Mukkamala
email: srinivas@cs.nmt.edu
Andrew H. Sung
email: sung@cs.nmt.edu
Rajeev Veeraghattam
email: rajeev@nmt.edu
Department of Computer Science, New Mexico Tech, Socorro, NM 87801, USA
Institute of Complex Additive Systems Analysis, New Mexico Tech, Socorro, NM 87801, USA
Key words: Machine learning, Intrusion detection systems, CART, MARS, TreeNet
ABSTRACT
This paper describes results concerning the classification
capability of supervised machine learning techniques in
detecting intrusions using network audit trails. In this paper
we investigate three well known machine learning
techniques: classification and regression tress (CART),
multivariate regression splines (MARS) and treenet. The
best model is chosen based on the classification accuracy
(ROC curve analysis). The results show that high
classification accuracies can be achieved in a fraction of the
time required by well known support vector machines and
artificial neural networks. Treenet performs the best for
normal, probe and denial of service attacks (DoS). CART
performs the best for user to super user (U2su) and remote
to local (R2L).
scheme, to achieve accuracy without the drawback of a
tendency to be misled by bad data [11,12].
We performed experiments using MARS, CART, Treenet
for classifying each of the five classes (normal, probe,
denial of service, user to super-user, and remote to local) of
network traffic patterns in the DARPA data.
A brief introduction MARS and model selection is given in
section II. CART and a tree generated for classifying
normal vs. intrusions in DARPA data is explained in
section III. TreeNet is briefly described in section IV.
Intrusion detection data used for experiments is explained
in section V. In section VI, we analyze classification
accuracies of MARS, CART, TreeNet using ROC curves.
Conclusions of our work are given in section VII.
1. INTRODUCTION
II. MARS
Since the ability of an Intrusion Detection System (IDS) to
identify a large variety of intrusions in real time with high
accuracy is of primary concern, we will in this paper
consider performance of machine learning-based IDSs with
respect to classification accuracy and false alarm rates.
Multivariate Adaptive Regression Splines (MARS) is a
nonparametric regression procedure that makes no
assumption about the underlying functional relationship
between the dependent and independent variables. Instead,
MARS constructs this relation from a set of coefficients
and basis functions that are entirely “driven” from the data.
AI techniques have been used to automate the intrusion
detection process; they include neural networks, fuzzy
inference systems, evolutionary computation, machine
learning, support vector machines, etc [1-6]. Often model
selection using SVMs, and other popular machine learning
methods requires extensive resources and long execution
times [7,8]. In this paper, we present a few machine
learning methods (MARS, CART, TreeNet) that can
perform model selection with higher or comparable
accuracies in a fraction of the time required by the SVMs.
MARS is a nonparametric regression procedure that is
based on the “divide and conquer” strategy, which
partitions the input space into regions, each with its own
regression equation [9]. CART is a tree-building algorithm
that determines a set of if-then logical (split) conditions that
permit accurate prediction or classification of classes [10].
TreeNet a tree-building algorithm that uses stochastic
gradient boosting to combine trees via a weighted voting
The method is based on the “divide and conquer” strategy,
which partitions the input space into regions, each with its
own regression equation. This makes MARS particularly
suitable for problems with higher input dimensions, where
the curse of dimensionality would likely create problems
for other techniques.
Basis functions: MARS uses two-sided truncated functions
of the form as basis functions for linear or nonlinear
expansion, which approximates the relationships between
the response and predictor variables. A simple example of
two basis functions (t-x)+ and (x-t)+[9,11]. Parameter t is
the knot of the basis functions (defining the "pieces" of the
piecewise linear regression); these knots (parameters) are
also determined from the data. The "+" signs next to the
terms (t-x) and (x-t) simply denote that only positive results
of the respective equations are considered; otherwise the
respective functions evaluate to zero.
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005
1
The MARS Model
The basis functions together with the model parameters
(estimated via least squares estimation) are combined to
produce the predictions given the inputs. The general
MARS
produced, which probably greatly overfits the
information contained within the learning dataset.

The third step consists of tree “pruning,” which results
in the creation of a sequence of simpler and simpler
trees, through the cutting off of increasingly important
nodes.

The fourth step consists of optimal tree selection,
during which the tree which fits the information in the
learning dataset, but does not overfit the information,
is selected from among the sequence of pruned trees.
M
y  f ( x)  o 

m hm ( X )
m 1
Where the summation is over the M nonconstant terms in
the model, y is predicted as a function of the predictor
variables X (and their interactions); this function consists of

an intercept parameter ( o ) and the weighted by ( m )
sum of one or more basis functions
hm X 
.
Model Selection
After implementing the forward stepwise selection of basis
functions, a backward procedure is applied in which the
model is pruned by removing those basis functions that are
associated with the smallest increase in the (least squares)
goodness-of-fit. A least squares error function (inverse of
goodness-of-fit) is computed. The so-called Generalized
Cross Validation error is a measure of the goodness of fit
that takes into account not only the residual error but also
the model complexity as well. It is given by
N
GCV 
( y
i  f ( xi ))
2
/(1  c / n) 2
i 1
with C 1  cd
Where N is the number of cases in the data set, d is the
effective degrees of freedom, which is equal to the number
of independent basis functions. The quantity c is the
penalty for adding a basis function. Experiments have
shown that the best value for C can be found somewhere in
the range 2 < d < 3 [9].
III. CART
CART builds classification and regression trees for
predicting continuous dependent variables (regression) and
categorical predictor variables (classification) [10,11].
The decision tree begins with a root node t derived from
whichever variable in the feature space minimizes a
measure of the impurity of the two sibling nodes. The
measure of the impurity or entropy at node t, denoted by
i(t), is as shown in the following equation [11]:
i (t )  
k
p ( w j / t ) log p ( w j / t )

f 1
Where p(wj | t ) is the proportion of patterns xi allocated to
class wj at node t. Each non-terminal node is then divided
into two further nodes, tL and tR, such that pL , pR are the
proportions of entities passed to the new nodes tL, tR
respectively. The best division is that which maximizes the
difference given in [11]:
i ( s, t )  i (t )  pi L (t L )  pi R (t R )
The decision tree grows by means of the successive subdivisions until a stage is reached in which there is no
significant decrease in the measure of impurity when a
further additional division s is implemented. When this
stage is reached, the node t is not sub-divided further, and
automatically becomes a terminal node. The class wj
associated with the terminal node t is that which maximizes
the conditional probability p(wj | t). No of nodes generated
and terminal node values for each class are for the DARPA
data set described in section V are presented in Table 1.
F
CART analysis consists of four basic steps1 [12]:


The first step consists of tree building, during which a
tree is built using recursive splitting of nodes. Each
resulting node is assigned a predicted class, based on
the distribution of classes in the learning dataset which
would occur in that node and the decision cost matrix.
C
E
J
C
E
AG
AK
The second step consists of stopping the tree building
process. At this point a “maximal” tree has been
C
AK
B
1
Reference [12] was accidentally omitted during the
editing process of the original manuscript. Complete
reference is: R. J. Lewis. An Introduction to Classification
and Regression Tree (CART) Analysis. Annual Meeting of
the Society for Academic Emergency Medicine, 2000.
AN
E
E
AI
AK
F
A
E
E
B
Figure 1. Tree for classifying normal vs. intrusions
AF
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005
Figure 1 is represents a classification tree generated for
DARPA data described in section V for classifying normal
activity vs. intrusive activity. Each of the terminal node
describes a data value; each record is classifies into one of
the terminal node through the decisions made at the nonterminal node that lead from the root to that leaf.
Table 1. Summary of tree splitters for all five classes.
Class
Normal
Probe
DoS
U2Su
R2L
No of
Nodes
23
22
16
7
10
Terminal
Node Value
0.016
0.019
0.004
0.113
0.025
several connections in a short time frame, whereas
R2U and U2Su attacks are embedded in the data
portions of the connection and often involve just a
single connection; “traffic-based” features play an
important role in deciding whether a particular
network activity is engaged in probing or not.
Attack types fall into four main categories:

Denial of Service (DOS) Attacks: A denial of service
attack is a class of attacks in which an attacker makes
some computing or memory resource too busy or too
full to handle legitimate requests, or denies legitimate
users access to a machine. Examples are Apache2,
Back, Land, Mail bomb, SYN Flood, Ping of death,
Process table, Smurf, Syslogd, Teardrop, Udpstorm.

User to Superuser or Root Attacks (U2Su): User to
root exploits are a class of attacks in which an attacker
starts out with access to a normal user account on the
system and is able to exploit vulnerability to gain root
access to the system. Examples are Eject, Ffbconfig,
Fdformat, Loadmodule, Perl, Ps, Xterm.

Remote to User Attacks (R2L): A remote to user
attack is a class of attacks in which an attacker sends
packets to a machine over a networkbut who does
not have an account on that machine; exploits some
vulnerability to gain local access as a user of that
machine. Examples are Dictionary, Ftp_write, Guest,
Imap, Named, Phf, Sendmail, Xlock, Xsnoop.

Probing (Probe): Probing is a class of attacks in which
an attacker scans a network of computers to gather
information or find known vulnerabilities. An attacker
with a map of machines and services that are available
on a network can use this information to look for
exploits. Examples are Ipsweep, Mscan, Nmap, Saint,
Satan.
IV. TREENET
In a TreeNet model classification and regression models are
built up gradually through a potentially large collection of
small trees. Typically consist from a few dozen to several
hundred trees, each normally no longer than two to eight
terminal nodes. The model is similar to a long series
expansion (such as Fourier or Taylor’s series) - a sum of
factors that becomes progressively more accurate as the
expansion continues. The expansion can be written as
[11,13]:
F ( X )  F0  1T1( X )  2T2 ( X )... M TM ( X )
Where Ti is a small tree
Each tree improves on its predecessors through an errorcorrecting strategy. Individual trees may be as small as one
split, but the final models can be accurate and are resistant
to overfitting.
V. DATA USED FOR ANALYSIS
A subset of the DARPA intrusion detection data set is used
for offline analysis. In the DARPA intrusion detection
evaluation program, an environment was set up to acquire
raw TCP/IP dump data for a network by simulating a
typical U.S. Air Force LAN. The LAN was operated like a
real environment, but being blasted with multiple attacks
[14,15]. For each TCP/IP connection, 41 various
quantitative and qualitative features were extracted [16] for
intrusion analysis. Attacks are classified into the following
types. The 41 features extracted fall into three
categorties, “intrinsic” features that describe about
the individual TCP/IP connections; can be obtained
from network audit trails, “content-based” features
that describe about payload of the network packet;
can be obtained from the data portion of the network
packet, “traffic-based” features, that are computed
using a specific window (connection time or no of
connections). As DOS and Probe attacks involve
In our experiments, we perform 5-class classification. The
(training and testing) data set contains 11982 randomly
generated points from the data set representing the five
classes, with the number of data from each class
proportional to its size, except that the smallest class is
completely included. The set of 5092 training data and
6890 testing data are divided in to five classes: normal,
probe, denial of service attacks, user to super user and
remote to local attacks. Where the attack is a collection of
22 different types of instances that belong to the four
classes described in Section V, and the other is the normal
data. Note two randomly generated separate data sets of
sizes 5092 and 6890 are used for training and testing
MARS, CART, and TreeNet respectively. Section VI
summarizes the classifier accuracies.
VI. ROC CURVES
Detection rates and false alarms are evaluated for the fiveclass pattern in the DARPA data set and the obtained
results are used to form the ROC curves. The point (0,1)
is the perfect classifier, since it classifies all positive
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005
Figures 2 to 6 show the ROC curves of the detection
models by attack categories as well as on all intrusions. In
each of these ROC plots, the x-axis is the false positive
rate, calculated as the percentage of normal connections
considered as intrusions; the y-axis is the detection rate,
calculated as the percentage of intrusions detected. A data
point in the upper left corner corresponds to optimal high
performance, i.e, high detection rate with low false alarm
rate. Area of the ROC curves, no of false positives and
false negatives are presented in Tables 2 to 6.
1
0.9
Sensitivity (true positives)
cases and negative cases correctly. Thus an ideal
system will initiate by identifying all the positive
examples and so the curve will rise to (0,1)
immediately, having a zero rate of false positives,
and then continue along to (1,1).
Curve
Area
MARS
CART
TreeNet
0.993
0.991
0.997
0.7
0.6
MARS
0.5
CART
0.4
TreeNet
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Specificity (false positives)
Figure 3. Classification accuracy for probe
Table 4. Summary of classification accuracy for DoS.
Table 2. Summary of classification accuracy for normal.
False
Positives
56
75
18
0.8
False
Negatives
4
5
0
Curve
Area
MARS
CART
TreeNet
0.945
0.998
0.998
False
Positives
185
1
3
False
Negatives
169
16
9
1
Sensitivity (true positives)
Sensitivity (true positives)
1
0.9
0.8
0.7
MARS
0.6
CART
0.5
TreeNet
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
0.9
0.8
0.7
0.6
MARS
0.5
CART
0.4
TreeNet
0.3
0.2
0.1
0
1
Specificity (false positives)
0
Figure 2. Classification accuracy for normal
0.5
1
Specificity (false positives)
Figure 4. Classification accuracy for DoS
Table 3. Summary of classification accuracy for probe.
Curve
Area
MARS
CART
TreeNet
0.777
0.998
0.999
False
Positives
64
24
14
False
Negatives
305
0
0
Table 5. Summary of classification accuracy for U2Su.
Curve
Area
MARS
CART
TreeNet
0.700
0.720
0.699
False
Positives
3
3
7
False
Negatives
15
14
16
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005
networks.
1
Sensitivity (true positives)
0.9
We note, however, that the difference in accuracy figures
tend to be small and may not be statistically significant,
especially in view of the fact that the 5 classes of patterns
differ tremendously in their sizes. More definitive
conclusions perhaps can only be drawn after analyzing
more comprehensive sets of network data.
0.8
0.7
0.6
MARS
0.5
CART
0.4
TreeNet
0.3
ACKNOWLEDGEMENTS
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Specificity (false positives)
Figure 5. Classification accuracy for U2Su
Table 6. Summary of classification accuracy for R2L
Curve
Area
MARS
CART
TreeNet
0.992
0.993
0.992
False
Positives
17
15
19
False
Negatives
7
6
7
Partial support for this research received from ICASA
(Institute for Complex Additive Systems Analysis, a
division of New Mexico Tech), a DoD IASP, and an NSF
SFS Capacity Building grants are gratefully acknowledged.
REFERENCES
1.
2.
1
Sensitivity (true positives)
0.9
0.8
0.7
MARS
0.6
CART
0.5
3.
TreeNet
0.4
0.3
0.2
4.
0.1
0
0
0.2
0.4
0.6
0.8
1
Specificity (false positives)
Figure 6. Classification accuracy for R2L
5.
VII. CONCLUSIONS
A number of observations and conclusions are drawn from
the results reported in this paper:


TreeNet easily achieves high detection accuracy
(higher than 99%) for each of the 5 classes of DARPA
data. Treenet performed the best for normal with 18
false positives (FP) and 0 false negatives (FP), probe
with 14 FP and 0 FN, and denial of service attacks
(DoS) with 3 FP and 9 FN.
CART performed the best for user to super user
(U2su) with 3 FP and 14 FN and remote to local
(R2L) with 15 FP and 6 FN.
6.
7.
8.
We demonstrate that using these fast execution machine
learning methods we can achieve high classification
accuracies in a fraction of the time required by the well
know support vector machines and artificial neural
S. Mukkamala, G. Janowski, A. H. Sung, Intrusion
Detection Using Neural Networks and Support Vector
Machines. Proceedings of IEEE International Joint
Conference on Neural Networks 2002, IEEE press, pp.
1702-1707, 2002.
M. Fugate, J. R. Gattiker, Computer Intrusion
Detection with Classification and Anomaly Detection,
Using SVMs. International Journal of Pattern
Recognition and Artificial Intelligence, Vol. 17(3), pp.
441-458, 2003.
W. Hu, Y. Liao, V. R. Vemuri, Robust Support Vector
Machines for Anamoly Detection in Computer
Security. International Conference on Machine
Learning, pp. 168-174, 2003.
K. A. Heller, K. M. Svore, A. D. Keromytis, S. J.
Stolfo, One Class Support Vector Machines for
Detecting Anomalous Window Registry Accesses.
Proceedings of IEEE Conference Data Mining
Workshop on Data Mining for Computer Security,
2003.
A. Lazarevic, L. Ertoz, A. Ozgur, J. Srivastava, V.
Kumar, A Comparative Study of Anomaly Detection
Schemes in Network Intrusion Detection. Proceedings
of Third SIAM Conference on Data Mining, 2003.
S. Mukkamala, A. H. Sung, Feature Selection for
Intrusion Detection Using Neural Networks and
Support Vector Machines. Journal of the
Transportation Research Board of the National
Academics, Transportation Research Record No 1822:
33-39, 2003.
S. J. Stolfo, F. Wei, W. Lee, A. Prodromidis, P. K.
Chan, Cost-based Modeling and Evaluation for Data
Mining with Application to Fraud and Intrusion
Detection. Results from the JAM Project, 1999.
S. Mukkamala, B. Ribeiro, A. H. Sung, Model
Selection for Kernel Based Intrusion Detection
Systems. Proceedings of International Conference on
Adaptive and Natural Computing Algorithms
(ICANNGA), Springer-Verlag, pp. 458-461, 2005.
Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005
9.
10.
11.
12.
13.
14.
15.
16.
T. Hastie, R. Tibshirani, J. H. Friedman, The elements
of statistical learning: Data mining, inference, and
prediction. Springer, 2001.
L. Breiman, J. H. Friedman, R. A. Olshen, C. J.
Stone, Classification and regression trees. Wadsworth
and Brooks/Cole Advanced Books and Software,
1986.
Salford Systems. TreeNet, CART, MARS Manual.
R. J. Lewis. An Introduction to Classification and
Regression Tree (CART) Analysis. Annual Meeting of
the Society for Academic Emergency Medicine, 2000.
J. H. Friedman, Stochastic Gradient Boosting. Journal
of Computational Statistics and Data Analysis,
Elsevier Science, Vol. 38, PP. 367-378, 2002.
K. Kendall, A Database of Computer Attacks for the
Evaluation of Intrusion Detection Systems. Master's
Thesis, Massachusetts Institute of Technology (MIT),
1998.
S. E. Webster, The Development and Analysis of
Intrusion Detection Algorithms. Master's Thesis, MIT,
1998.
W. Lee, S. J. Stolfo, A Framework for Constructing
Features and Models for Intrusion Detection Systems.
ACM Transactions on Information and System
Security, Vol. 3, pp. 227-261, 2000.
Download