Design of a Network-Based Anomaly Detection System using VFDT Algorithm
M. Salamah and N. Alwan
Computer Engineering Department, Eastern Mediterranean University, TRNC, Mersin 10, TURKEY
{[email protected], [email protected] }
Abstract: With the tremendous growth of network-based services
and sensitive information on networks, network security is getting
more and more importance than ever. Intrusion poses a serious
security risk in a network environment. The ever growing new
intrusion types have a serious problem for their detection. In this
paper, we apply one of the efficient data mining algorithms called
Very Fast Decision Tree (VFDT) for anomaly based network
intrusion detection. Experimental results on the Knowledge
Discovery Data Mining KDDcup’99 data set show that our
approach is highly effective in detecting network intrusion. It is
observed that the proposed technique performs better in terms of
detection rate, false alarm rate, and computational time compared
with other techniques.
Keywords: Very Fast Decision Tree Algorithm, Classifier,
Network Security, Intrusion Detection System, KDD CUP99
dataset.
I. INTRUDUCTION
In our society, information is becoming increasingly
dependent on rapid access and interactive processing. As
this demand increases, more information is being stored on
computers. The proliferation of inexpensive computers and
computer networks has worsened the problem of
unauthorized access and tampering with data. With an
increased understanding of how systems work, intruders
have become highly skilled at diagnosing weaknesses in
systems, and exploiting them to obtain such privileges that
they can do anything on the system. They also use patterns
that are difficult to trace and identify. Computer systems are
therefore not likely to remain safe in the nearest future
because of the intruder’s acts. Therefore, we must have
measures in place to detect security breaches, i.e., identify
intruders and their methods of intrusion [1].
An Intrusion Detection System (IDS) inspects the activities
in a system for suspicious behavior or patterns that may
indicate system attack or misuse. There are two main
categories of intrusion detection techniques; Anomaly
detection and Misuse detection [2]. The former analyses the
information gathered and compares it to a defined baseline
of what is seen as “normal” service behavior, so it has the
ability to learn how to detect network attacks that are
currently unknown. Misuse Detection is based on signatures
for known attacks, so it is only as good as the database of
attack signatures that it uses for comparison. Misuse
detection has low false alarm rate, but cannot detect novel
attacks. However, anomaly detection can detect unknown
attacks, but has a bit higher false alarm rate. In this paper,
we review the performance of classifiers when trained to
identify signatures of specific attacks.
The simulated attacks are classified as follows [3]:
Denial of Service Attack (DoS): this is an attack which
makes some computing or memory resource too busy or too
full to handle legitimate requests, or denies legitimate users
access to a machine.
Probing Attack (prob): this is an attempt to gather
information about a network of computers for the apparent
purpose of circumventing its security controls.
User to Root Attack (U2R): this is a class of exploit in
which the attacker starts out with the access to a normal
user’s account on the system (perhaps gained by sniffing
passwords, a dictionary attack, or social engineering) and is
able to exploit some vulnerability to gain root access to the
system.
Remote to Local Attack (R2L): it occurs when an attacker
has the ability to send packets to a machine over a network,
but does not have an account on that machine to exploit
some vulnerability to gain local access as a user of that
machine.
In [4], Audit Data Analysis and Mining (ADAM) is an
intrusion detector system built to detect intrusions using data
mining techniques. It first absorbs training data known to be
free of attacks. Next, it uses an algorithm to group attacks,
unknown behavior, and false alarms. ADAM has several
useful capabilities, namely;
- Classifying an item as a known attack
- Classifying an item as a normal event,
- Classifying an item as an unknown attack,
The Intrusion Detection using Data Mining Technique
(IDDM) [5] is a real-time network IDS (NIDS) for misuse
and anomaly detection. It applies association rules, Meta
rules, and characteristic rules. It employs data mining to
produce description of network data and uses this
information for deviation analysis. Mining Audit Data for
Automated Models for Intrusion Detection (MADAM ID)
[6] is one of the best known data mining projects in
intrusion detection. It is an off-line IDS to produce anomaly
and misuse intrusion detection models. Association rules
and frequent episodes are applied in MADAM ID to replace
hand-coded intrusion patterns and profiles with the learned
rules.
In [7], the author proposes a method of intrusion detection
using an evolving fuzzy neural network. This type of
learning algorithm combines Artificial Neural Network
(ANN) and Fuzzy Inference Systems (FIS), as well as
evolutionary algorithms. They create an algorithm that uses
fuzzy rules and allow new neurons to be created in order to
accomplish this. They use Snort to gather data for training
the algorithm and then compare their technique with that of
an augmented neural network.
In [8], a statistical neural network classifier for anomaly
detection is developed, which can identify UDP flood
attacks. Comparing different neural network classifiers, the
Back Propagation Neural Network (BPN) has shown to be
more efficient in developing IDS. In [9], the author uses the
back propagation method by Sample Query and Attribute
Query for the Intrusion Detection, whereby analyzing and
identifying the most important components of training data.
It could reduce processing time, storage requirement, etc.
In [10], the authors combine multiple host-based detectors
using decision tree. They use conventional measures for
intrusion detection and modeling methods appropriate to
each measure. System calls, resource usage and file access
events are used to measure user’s behavior and Hidden
Markov model. Statistical Rule-base method is used to
model these measures which are combined with diction tree.
The proposed detection method is expected to have a better
performance because it can model normal behaviors from
various perspectives. The decision tree used here C4.5
algorithm.
This paper gives a comparative study of several anomaly
detection schemes for identifying novel network intrusion
detections. We present experimental results on Knowledge
Discovery Data Mining KDDCup’99 data set. The
experimental results have demonstrated that using the
VFDT classifier model is much more efficient in the
detection of network intrusions, compared with other
classification techniques.
The remainder of this paper is organized as follow. Section
II presents the proposed Network-based Anomaly Detection
System (NADS), and also gives general information about
IDS. Section III presents the experimental results. Finally,
section IV concludes the paper.
II. THE PROPOSED NADS SYSTEM
The proposed NADS system has the possibility of working
in either online or offline modes. This depends on the type
of features used to train and test the system. The online
mode works on the basic features of an extraction using the
header of packets only. (e.g. protocol type, service, flag,
src_bytes, dst_bytes). These features are organized as
connection records to enter into the model directly to
determine the nature of the connection as soon as possible,
and give the appropriate response. The offline mode works
on the basic statistical features extracted from the header
and the content of the packets. The statistical features are
determined when the session is terminated, and they are
organized as connection records. A large number of packets
are needed to identify accurate connections, and then store
them in the file in the form of connection records in order to
be entered into the classified model to test their classes as
normal. This mode gives accurate results, but at the expense
of time. Both modes requires adequate connection
information to train the proposed model. Therefore the
system working update gets information at any time to
obtain a sufficient number of connections to build a decision
tree for a classification of attacks. VFDT is a highperformance data mining algorithm based on Hoeffding
trees. Many classification learning methods have been
proposed, of which the decision tree learning method is
commonly used, because it is fast and the description of
classifiers that it derives is easily understood. As data
arrives, this data stream grows gradually while the data is
classified [6]. VFDT allows the use of either information
gain or the Gini index as attribute evaluation measure. It
includes a number of refinements to the algorithm [6].
VFDT does not accumulate examples in main memory,
because it can gradually grow without waiting for the arrival
of all the examples [7].
The architecture of the Network-based Anomaly Detection
System (NADS) is shown in the Figure 1. Although
different NADS approaches exist, but all of them has these
basic modules or stages [7]:
Parameterization: The representation of the observed
instances in the target system as a pre-established from.
Training stage: The process of educating the system and
grouping the connections as normal or abnormal, and the
building of the corresponding model.
Detection stage: A comparison between the models for the
system built with the parameterized or observed traffic. If
any deviation found exceeds a given threshold, an alarm will
be announced.
Table 1: Confusion Matrix (CM) [10].
Normal
Intrusion (Attack)
Normal
TN
FP
Intrusion (Attack)
FN
TP
The performance metrics used are: the detection rate (DR),
the false alarm rate (FAR), the total training time (TT), and
the true positive percentage (TP%).
The DR is the percentage of the truly (correctly) classified
attacks as shown in equation (1). The FAR is the percentage
of wrongly classified normal connections as shown in
equation (2). The TT is the total time used for training of the
data set.
Fig.1. Network-based NADS Anomaly Detection System
 =

+
III. EXPERIMENTS AND RESULTS
In this section, we summarize our experimental results in
detecting network intrusion using the VFDT algorithm over
KDDCup’99 data set. We first describe the data set used in
the experiments and then discuss the results obtained.
Finally, we evaluate our approach and compare the results
with the results obtained by other researchers.
We carried out four experiments on KDD CUP99 dataset in
order to get the best results. The conditions of these
experiments are as follows:
Experiment 1: We used the whole KDD CUP99 dataset for
training, but we selected 20 features only.
Experiment 2: We used the whole KDD CUP99 dataset for
training, but we choose 15 features.
Experiment 3: We used 10% of the KDD CUP99 dataset
for training, and we selected 15 or 20 features.
Experiment 4: We used the whole KDD CUP99 dataset for
training, but we choose only the basic features.
The Confusion Matrix (CM) shown in Table 1 is used to
measure of performances of IDS systems and it has the
following entries:
1.
2.
3.
4.
True Positive (TP): Number of connections that
were correctly classified as attacks.
True Negative (TN): Number of connections
that were correctly classified as normal.
False Positive (FP): Number of normal
connections that were classified as attacks.
False Negative (FN): Number of attack
connections that were classified as normal.
 =

+
(1)
(2)
For the attacks DoS, probe, U2R, and R2L, the True
Positive percentage (TP %) is defined as the number of the
related attack TP’s over the total number of TP's.
The results of these experiments can be seen in Table 2.
The first experimental result indicated that the proposed
NADS system achieved a DR percent of 93.83 % for attacks
by using the VFDT algorithm, whereas the achieved FAR
rate is 0.608%. In addition, the TP% for DoS, Probe, U2R,
and R2L is 93.10%, 79.04%, 22.84%, and 36.64%
respectively.
The result of the second experiment is close to the first
experiment, although the detection rate for U2R becomes
zero.
The third experimental result indicated that the proposed
NADS system achieved a DR percent of 95.53 % and a FAR
percent of 0.993%. In addition, the TP% for DoS, Probe,
U2R, and R2L is 96.4%, 24.2%, 75.6%, and 81.6%
respectively.
The fourth experimental result indicated that the proposed
NADS system achieved a DR percent of 90.34 % and a FAR
rate of 0.78%. In addition, the TP% for DoS, Probe, U2R,
and R2L is 92.67%, 75.4%, 0%, and 31.32% respectively.
Table 2: Performance metrics for the four experiments
Experiment DR% FAR%
DoS
Probe U2R
No.
TP%
TP%
TP%
1
93.83
0.608
93.10 79.04 22.84
2
92.56
0.720
88.10 80.00
0
3
95.53
0.993
96.40 24.20 75.60
4
90.34
0.780
92.67 75.40
0
R2L
TP%
36.64
33.60
81.60
31.32
The following parameters are discussed to clarify our results
precisely in terms of system accuracy, classification speed,
and memory allocation.
A. System Accuracy Results
Receiver Operating Characteristic (ROC) which is a
procedure, derived from statistical decision theory that was
developed in the context of electronic signal detection. It is
used to evaluate the predictive ability of classifiers. ROCs
are plotted in coordinates which are spanned by the rates of
DR, and FAR classification.
The proposed NADS system has a high accuracy
performance when compared with other systems. The ROC
curve of the system is shown in Figure 2. The goal is to
detect many attacks while minimizing the generation of
FAR. It is clear from Figure 2 that our system is able to
detect most of the attacks with a DR rate of 93.83%, and at a
low FAR rate of 0.608%.
Fig.3. Training time versus number of connections
Fig.4. Number of created nodes versus number of connections
Fig.2.The ROC curve for the NADS system
B. Classification Speed Results
Figure 3 shows the training time versus number of
connections. The NADS system uses 100,000 connections at
time 4.09s, whereas it uses1000, 000 connections at time
38.97s.
C. Memory Allocation Results
Figure 4 shows the number of nodes which are created by
using the VFDT algorithm. The unexpected increasing at the
beginning of the chart represents the beginning of the
construction of a tree. The sufficient number of connections
is collected to be able to classify all types of attacks in the
same tree. It continues in increasing gradually, while the
number of nodes is almost stable.
Finally, table 2. Shows the comparison of the proposed
system with other systems in terms of DR and TT. The
results indicated that the proposed NADS system achieved a
high classification accuracy rate of 93% by using the VFDT
algorithm. It is the highest performance compared with all
other algorithm except the Genetic Programming (GP)
algorithm where it has a higher DR rate of 98%. The speed
of training (building and testing) didn't exceed 39.88
seconds for the proposed system, whereas it is in terms of
hours for others systems.
Table 2: Comparison of the proposed NADS system with
other systems
system
DR%
TT in sec
NADS
93.825
39.88
ANN
92.268
780
FL
91.25
87.9
BN
90.62
6.28
MLP
92.03
350.15
SOM
91.65
192.16
SVM
57.6
1040.4
GP
98
6480
IV. CONCLUSION
In this paper, we have proposed a Network-based Anomaly
Detection System (NADS) which helps to take insidious
attacks under control. This technique works in two modes,
on-line and offline modes. The proposed NADS system uses
the Very Fast Decision Tree algorithm (VFDT), and it
classifies the connections as normal or abnormal. It is worth
mentioning that the proposed NADS system outperforms
other schemes in terms of training time per example since it
has the lowest time. The proposed system was able to detect
most of the attacks with a DR rate of 93.83%, and at a low
FAR rate of 0.608%.
As a future work, this study can be extended in different
aspects such as:
-
-
The proposed NADS system used the VFDT algorithm for
detection. Hence, it can be analyzed using other detection
algorithms as well.
As the proposed NADS system uses “anomaly detection
technique”, it can be modified to use the misuse technique
as well.
REFERENES
[1] Sandeep k., “classification and detection of computer
intrusions”, PhD Thesis, Purdue University, August 1995.
[2] Ali A., Wei L and Mahbod T., “Network Intrusion detection
and prevention concepts and Techniques”, Springer Media, New
York LLC, 2010.
[3] Terry E., “intrusion detection: network security Beyond the
Firewall,” John Wiley and sons, New York, USA, 1998.
[4] Andrew G. Blank, “TCP/IP Jumpstart: internet protocol
Basics”,2nd Edition, john wiley and Sons, 2002.
[5] Hamdan o. Alanazi, Rafidah Md. Noor, B, b. Zaidan and A. A.
Zaidan,”intrusion detection system: Overview”, Journal of
computing, Vol. 2, Issue 2, pp.32-48, February 2010.
[6] Tasuya M. Ayahiko N., “ detection of Fraud use of credit card
by Extended VFDT ,” IEEE Transactions on knowledge and data
Engineering(TKDE), Vol. 21, No. 11, pp. 1505-1514, 2011.
[7] Zhi-Hua Z., Hang L. and Qiang Y., “Advances is Knowledge
Discovery and Data Mining,” 11th Pacific-Asia Conference,
PAKDD, Springer, China, Vol. 4426, 2007.
[8] Wenke L., Salvatore J. Stolfo and Kui W. Mok, “A Data
Mining Framework for Building intrusion detection Models,“
IEEE Symposium on Security and Privacy, pp. 120-132, 1999.
[9] Pang-Ning T., Michael S. and Vipin K., “introduction to data
mining”, 2nd Edition, Addison Wesley, March 2006.
[10] Sang-Jun H., sung-Bae C., “combining Multiple Host-Based
Detectors using decision tree,” 16th Australian conference on
Artificial intelligence, Springer, Vol. 2903, pp.208-220, 2003.
Muhammed SALAMAH received the B.S., M.S.,
and PhD. degrees in Electrical and Electronics
Engineering from Middle East Technical University
in 1988, 1990, and 1995, respectively. From 1988 to
1996, he worked as a Research Assistant and
Lecturer in METU. In 1995, he worked as a visiting
researcher in the Networks Department of Ecole
Nationale Superieure des Telecommunications in
Paris, France. Since 1996, he has been with Department of Computer
Engineering at Eastern Mediterranean University, TRNC, Turkey, where he
is an Assoc. Professor. He served as the Assistant Chairman of the
Computer Engineering Department of EMU from 1997 to 2004; the chair
of the Student Project and Design Center in EMU from 2005 to 2008;
Rectorate Coordinator for Student Affairs and Informatics from 2008 to
2010; head of the Computer Engineering Department of EMU from 2010 to
2013, and since 2013 he has been serving as the Rector's Advisor for
Middle East in EMU. His current research interests are computer networks
and wireless networks with emphasis on resource allocation and
management, performance evaluation, multiprocessing, and hardwareoriented algorithms.
Naseer Alwan Hussein received the MS degree
in Computer Engineering from Eastern
Mediterranean University, TRNC, Turkey, in
2014. His current research interests are in
wireless.
Communications,
molecular
communications, bio inspired communications,
and nano networks.