References - Portfolios

advertisement
Detecting Intrusions in the Cloud Using Bayesian
Methods of Anomaly Detection
Andre Torres
Edinboro University of Pennsylvania
Edinboro, PA, USA
Email: andrentorres@gmail.com
Abstract—Cloud computing has recently picked up in IT, but
with it comes new security concerns that we must take
precautions against. By taking preventative measures and
detecting anomalies, indicative of intrusions, before or as they are
happening, otherwise irreversible damage can be stopped in its
tracks. Intrusion detection systems (IDSs) use a multitude of
methods, though many of these are not appropriate for cloud
computing due to the immense resources required or the inability
to detect new types of intrusions. Using a Tree-Augmented Naïve
(TAN) Bayesian Network, with low computational power
requirements, we can train a supervised machine learning
algorithm which additionally considers dependencies among
attributes to analyze network packets and then detect both old
and new types of intrusions.
Index Terms—cloud computing, anomaly detection, Bayesian
I. INTRODUCTION
In recent years, the concept of “to the cloud” has picked up
in the IT industry, changing the playfield by allowing anyone
around the world to have access to whatever technological
requirements they may have. A common developer is given
the ability to innovate, a small business is given a doorway to
the world, a worldwide corporations is given a way to connect
their branches to a global center; the possibilities of what
we’ve come to know as cloud computing, defined by the
National Institute of Standards and Technology (NIST) as “a
model for enabling ubiquitous, convenient, on-demand
network access to a shared pool of configurable computing
resources that can be rapidly provisioned and released with
minimal management effort or service provider interaction”
[1], have seemingly infinite potential.
With this new capability, unfortunately, new concerns arise
as network-based attacks increase in both frequency and
severity. Anomalies, deviations from the constructed profiles
of systems and their users’ normal behavior pattern, are often
indicative of a potential attack. Employing anomaly detection
allows us to detect new attacks as they happen, albeit this is not
as easy as it seems if a system detects anomalies that are
merely false alarms on a reoccurring basis and as a result waste
time and labor to investigate. By detecting intrusions as they
happen rather than in some offline system, or worse, after the
intrusion has occurred, damage can be prevented before it is
allowed to happen.
II. BACKGROUND
Intrusion detection systems (IDSs) of this day typically use
supervised machine learning algorithms such as data mining,
fuzzy logic, genetic algorithm, neural network, and support
vector machine to appropriately identify intrusions [2].
Common IDS types include network IDSs (NIDSs) which
investigate incoming and outgoing network traffic, host-based
IDSs (HBIDSs) which audit internal interfaces related to the
machine, protocol-based IDSs (PIDSs) which monitor and
analyze the HTTP protocol stream, application protocol-based
IDSs (APIDSs) which look for the correct use of specific
application or process protocols, and hybrid IDSs (HIDSs)
which combine two or more intrusion detection approaches [2].
One method incorporated by IDSs is using the
Iterative Dichotomiser 3 technique (ID3) to generate a decision
tree from a dataset is an anomaly detection strategy that takes
attributes from a dataset which give the highest information
gain [2]. The idea is that the level of information associated
with an attribute value relates to the probability that some
occurrence may happen, and the objective is to iteratively
separate a dataset into subsets where all elements in each final
subset are of the same class [2].
Another common method that many IDSs employ are data
mining methods which make use of cluster analysis. Clusters
are sets of objects organized such that objects in the same
group are more similar to each other than to those in other
groups. Methods in cluster analysis identify sparse regions in
point cloud data to start a search for anomalies [3]. Variations
in cluster analysis include distance-based methods where if a
data point is far away from its neighbors, it is an anomaly, and
density-based methods where the idea is the same, though here,
anomalies are searched for in less dense areas [3]. The issue
with these cluster methods is the immense amount of data that
is ignored. For instance, assume an example where a set of
data points is graphed and the X-axis represents income and the
Y-axis represents expenditure. We could have four clusters:
O1 (low income, high expenditure, small), O2 (large, low to
medium income, low to medium expenditure), O3 (moderate,
large income, small expenditure), and O4 (small, large income,
large expenditure). In distance-based methods, O4 would
consist of anomalies, while real anomalies which are
understood would be in O1. In addition, the immense amount
of calculations required in determining the distance of some
new point of data in relation to existing clusters or hundreds of
other data points, maybe thousands, are certainly not ideal for
active anomaly detection in cloud computing due to the amount
of resources constantly being used. For these reasons, we turn
to a reliable, low-resource intensive solution: using an
algorithm which uses a Bayesian network to determine
anomalies.
Bayes Rule, shown in Equation 1, is fundamental to the
formation of a Bayesian network and provides a way to
calculate the probability of some hypothesis based on its prior
probability, where the most probable hypothesis is the best
hypothesis [4]. Observed data D is taken into consideration in
addition to any initial knowledge of the prior probabilities of
the various hypotheses h.
ī€ 
Bayes Rule: = 𝑃(ℎ|𝐷) =
𝑃(𝐷 |ℎ )𝑃(ℎ)
𝑃(𝐷)
ī€ 
ī€¨ī€ąī€Šī€ 
P(h|D), the probability of some event after the relevant
evidence is taken into account, is determined using the formula
with P(h), the prior probability associated with hypothesis h,
P(D), the probability of the occurrence of data D, and P(D|h),
the probability of an event given another event [4]. What we
are concerned with in our research is a candidate hypothesis, H,
and the most probable hypothesis h which belongs to H given
the data D, also known as the maximum posterior (MAP)
hypothesis [4].
Equation 2 determines the Maximum
Likelihood (ML) hypothesis, likelihood of the data D given h
any hypothesis that maximizes P(D|h).
ī€ 
ℎ𝑀đŋ ≡ 𝑎𝑟𝑔𝑚𝑎đ‘Ĩℎ∈đģ 𝑃(𝐷|ℎ)ī€ 
ī€¨ī€˛ī€Šī€ 
Using Pseudo-Bayes estimators which follow Bayes Rule
is a technique in discrete multivariate analysis used to provide
estimated cell values of contingency tables which could have a
large quantity of sampling zeros [4]. Issues with sampling
zeros arise with misleading data. For example, reporting 0/5
and 0/500 as both equal to zero is misleading since they both
point to very different rates. Pseudo-Bayes estimators can be
built in a multitude of ways [4], but the goal of them is to build
a naïve Bayesian (NB) classifier. NB classifiers, although they
may use over-simplified assumptions, often are well fit for the
task in many complex real-world situations due to how
efficiently they can be trained in a supervised learning
environment [2]. Limitations of NB classifiers include that
they cannot provide accurate metric attribution information [5].
Adding directional edges between attributes in our Bayesian
network, thus taking into consideration dependencies among
attributes allows us to classify system states into normal or
abnormal and give a list of metrics that are ranked which are
for the most part indicative of the anomaly [5]. Using only the
supervised NB classifier method, we would only be able to
detect reoccurring anomalies, but new anomalies can be
discovered by including data on dependencies among attributes
and creating a Tree-Augmented Naïve (TAN) Bayesian
network [5].
III. PROJECT / PLAN OF ACTION
The KDD99 dataset, used for evaluation of intrusion
techniques as recently as 2010 [6], will be used in order to train
and then determine the effectiveness of a real-time Bayesian
algorithm to detect intrusion anomalies. The set of features
defined for each connection record in KDD99 are as follows:
duration (length, in seconds, of the connection), protocol_type
(tcp, udp, etc.), service (network service on the destination,
e.g., http, telnet, etc.), src_bytes (number of data bytes from
source to destination), dst_bytes (number of data bytes from
destination to source), flag (normal or error status of the
connection), land (1 if connection is from/to the same
host/port; 0 otherwise), wrong_fragment (number of “wrong”
fragments), urgent (number of urgent packets) [6]. KDD99’s
original purpose was to be used in a competition task of
building a network intrusion detector and has millions of
sample records to teach our algorithm with [6], thus it will
suffice for our purposes.
Current plans of research, subject to change, are to use a
server hosted from my home computer and host a PC gaming
server and website to get a flow of normal traffic going in and
out of the server. In this flow of normal traffic, attacks will be
simulated such as Denial of Service attacks, buffer overflows,
probes, etc. [6]. A program such as tcpdump or Wireshark will
be used to dump packet information that will be read in at an
interval not to exceed every two minutes and analyzed using
our Tree-Augmented Naïve Bayesian network through a
program written in C++. A log of alarms given by the program
will be kept to see if they match and are in accordance with
planned simulated attacks to the server, and of course with any
real attacks that may happen. The goal of our algorithm is
currently a 90% detection rate of intrusion anomalies that are
taught with the KDD99 dataset, and a 75% detection rate of
intrusion anomalies that are not taught by the KDD99 dataset.
REFERENCES
[1] Mell, Peter and Grance, Timothy, “The NIST Definition of
Cloud Computing,” NIST, Gaithersburg, MD, Rep. 800-145,
2011.
[2] Farid, Dewan Md and Rahman, Mohammad Zahidur, “Anomaly
network intrusion detection based on improved self adaptive
Bayesian algorithm,” in Journal of computers, vol. 5, 2010, pp.
23-31.
[3] Babbar, Sakshi and Chawla, Sanja, “On Bayesian Network and
Outlier Detection,” in COMAD, 2010, pp.125-136.
[4] Barbara, Daniel et al., “Detecting novel network intrusions using
bayes estimators,” in First SIAM Conference on Data Mining,
2001 © Citeseer.
[5] Tan, Yongmin et al., "PREPARE: Predictive performance
anomaly prevention for virtualized cloud systems," in 2012
IEEE 32nd International Conference on Distributed Computing
Systems (ICDCS), 2012, pp.285-294.
[6] The KDD Archive. KDD99 Cup 1999 Data. 1999.
https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Download