ABSTRACT DESIGN AND EVALUATION OF DECISION MAKING ALGORITHMS FOR INFORMATION SECURITY

advertisement
ABSTRACT
Title of dissertation:
DESIGN AND EVALUATION OF DECISION MAKING
ALGORITHMS FOR INFORMATION SECURITY
Alvaro A. Cárdenas, Doctor of Philosophy, 2006
Dissertation directed by: Professor John S. Baras
Department of Electrical and Computer Engineering
The evaluation and learning of classifiers is of particular importance in several computer
security applications such as intrusion detection systems (IDSs), spam filters, and watermarking
of documents for fingerprinting or traitor tracing. There are however relevant considerations
that are sometimes ignored by researchers that apply machine learning techniques for security
related problems. In this work we identify and work on two problems that seem prevalent in
security-related applications. The first problem is the usually large class imbalance between
normal events and attack events. We address this problem with a unifying view of different
proposed metrics, and with the introduction of Bayesian Receiver Operating Characteristic (BROC) curves. The second problem to consider is the fact that the classifier or learning rule will
be deployed in an adversarial environment. This implies that good performance on average
might not be a good performance measure, but rather we look for good performance under the
worst type of adversarial attacks. We work on a general methodology that we apply for the
design and evaluation of IDSs and Watermarking applications.
DESIGN AND EVALUATION OF DECISION MAKING ALGORITHMS
FOR INFORMATION SECURITY
by
Alvaro A. Cárdenas
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2006
Advisory Committee:
Professor John S. Baras, Chair/Advisor
Professor Virgil D. Gligor
Professor Bruce Jacob
Professor Carlos Berenstein
Professor Nuno Martins
c Copyright by
Alvaro A. Cárdenas
2006
TABLE OF CONTENTS
List of Figures
iv
1
Introduction
1
2
Performance Evaluation Under the Class Imbalance Problem
3
I
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
II
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
III
Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
IV
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
V
Graphical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
VI
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3
4
5
Secure Decision Making: Defining the Evaluation Metrics and the Adversary
20
I
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
II
A Set of Design and Evaluation Guidelines . . . . . . . . . . . . . . . . . . .
20
III
A Black Box Adversary Model . . . . . . . . . . . . . . . . . . . . . . . . . .
22
IV
A White Box Adversary Model . . . . . . . . . . . . . . . . . . . . . . . . . .
32
V
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Performance Comparison of MAC layer Misbehavior Schemes
35
I
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
II
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
III
Problem Description and Assumptions . . . . . . . . . . . . . . . . . . . . . .
37
IV
Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . . . . .
38
V
Performance analysis of DOMINO . . . . . . . . . . . . . . . . . . . . . . . .
42
VI
Theoretical Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
VII Nonparametric CUSUM statistic . . . . . . . . . . . . . . . . . . . . . . . . .
46
VIII Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
IX
49
Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . .
Secure Data Hiding Algorithms
52
I
52
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
II
General Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
III
Additive Watermarking and Gaussian Attacks . . . . . . . . . . . . . . . . . .
54
IV
B.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
C.1
Minimization with respect to Re . . . . . . . . . . . . . . . .
58
C.2
Minimization with respect to Γ . . . . . . . . . . . . . . . .
58
Towards a Universal Adversary Model . . . . . . . . . . . . . . . . . . . . . .
62
Bibliography
70
iii
LIST OF FIGURES
2.1
Isoline projections of CID onto the ROC curve. The optimal CID value is
CID = 0.4565. The associated costs are C(0, 0) = 3 × 10−5 , C(0, 1) = 0.2156,
C(1, 0) = 15.5255 and C(1, 1) = 2.8487. The optimal operating point is PFA =
2.76 × 10−4 and PD = 0.6749. . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2
As the cost ratio C increases, the slope of the optimal isoline decreases . . . . .
11
2.3
As the base-rate p decreases, the slope of the optimal isoline increases . . . . .
12
2.4
The PPV isolines in the ROC space are straight lines that depend only on θ.
The PPV values of interest range from 1 to p . . . . . . . . . . . . . . . . . .
13
The NPV isolines in the ROC space are straight lines that depend only on φ.
The NPV values of interest range from 1 to 1 − p . . . . . . . . . . . . . . . .
14
2.6
PPV and NPV isolines for the ROC of an IDS with p = 6.52 × 10−5 . . . . . .
15
2.7
B-ROC for the ROC of Figure 2.6. . . . . . . . . . . . . . . . . . . . . . . . .
15
2.8
Mapping of ROC to B-ROC . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.9
An empirical ROC (ROC2 ) and its convex hull (ROC1 ) . . . . . . . . . . . . .
17
2.10 The B-ROC of the concave ROC is easier to interpret . . . . . . . . . . . . . .
17
2.11 Comparison of two classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.12 B-ROCs comparison for the p of interest . . . . . . . . . . . . . . . . . . . . .
19
3.1
Probability of error for hi vs. p . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
The optimal operating point . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3
Robust expected cost evaluation . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.4
Robust B-ROC evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.1
Form of the least favorable pmf p∗1 for two different values of g. When g
approaches 1, p∗1 approaches p0 . As g decreases, more mass of p∗1 concentrated
towards the smaller backoff values. . . . . . . . . . . . . . . . . . . . . . . . .
41
Tradeoff curve between the expected number of samples for a false alarm
E[TFA ] and the expected number of samples for detection E[TD ]. For fixed a
and b, as g increases (low intensity of the attack) the time to detection or to
false alarms increases exponentially. . . . . . . . . . . . . . . . . . . . . . . .
42
2.5
4.2
iv
4.3
For K=3, the state of the variable cheat count can be represented as a Markov
chain with five states. When cheat count reaches the final state (4 in this case)
DOMINO raises an alarm. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.4
Exact and approximate values of p as a function of m. . . . . . . . . . . . . . .
45
4.5
DOMINO performance for K = 3, m ranges from 1 to 60. γ is shown explicitly.
As γ tends to either 0 or 1, the performance of DOMINO decreases. The SPRT
outperforms DOMINO regardless of γ and m. . . . . . . . . . . . . . . . . . .
46
DOMINO performance for various thresholds K, γ = 0.7 and m in the range
from 1 to 60. The performance of DOMINO decreases with increase of m. For
fixed γ, the SPRT outperforms DOMINO for all values of parameters K and m.
47
The best possible performance of DOMINO is when m = 1 and K changes in
order to accommodate for the desired level of false alarms. The best γ must be
chosen independently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Tradeoff curves for each of the proposed algorithms. DOMINO has parameters γ = 0.9 and m = 1 while K is the variable parameter. The nonparametric
CUSUM algorithm has as variable parameter c and the SPRT has b = 0.1 and
a is the variable parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Tradeoff curves for default DOMINO configuration with γ = 0.9, best performing DOMINO configuration with γ = 0.7 and SPRT. . . . . . . . . . . . . . . .
50
4.10 Tradeoff curves for best performing DOMINO configuration with γ = 0.7, best
performing CUSUM configuration with γ = 0.7 and SPRT. . . . . . . . . . . .
50
4.11 Comparison between theoretical and experimental results: theoretical analysis
with linear x-axis closely resembles the experimental results. . . . . . . . . . .
51
Let us define a = y − x1 for the detector. We can now see the Lagrangian
function over a, where the adversary tries to distribute the density h such that
L(λ, h) is maximized while satisfying the constraints (i.e., minimizing L(λ, h)
over λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.2
Piecewise linear decision function, where ρ(0) = ρ0 (0) = ρ1 (0) =
. . . . . .
66
5.3
The discontinuity problem in the Lagrangian is solved by using piecewise linear
continuous decision functions. It is now easy to shape the Lagrangian such that
the maxima created form a saddle point equilibrium. . . . . . . . . . . . . . .
66
5.4
New decision function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.5
With ρ defined in Figure 5.4 the Lagrangian is able to exhibit three local maxima, one of them at the point a = 0, which implies that the adversary will use
this point whenever the distortion constraints are too severe . . . . . . . . . . .
68
4.6
4.7
4.8
4.9
5.1
v
1
2
5.6
ρ¬ represents a decision stating that we do not possess enough information in
order to make a reliable selection between the two hypotheses. . . . . . . . . .
vi
69
Chapter 1
Introduction
Look if I can raise the money fast, can I set up my own lab?
-The Smithsonian Institution. Gore Vidal
The algorithms used to ensure several information security goals, such as authentication,
integrity and secrecy, have often been designed and analyzed with the help of formal mathematical models. One of the most successful examples is the use of theoretical cryptography for
encryption, integrity and authentication. By assuming that some basic primitives hold, (such
as the existence of one-way functions), some cryptographic algorithms can formally be proven
secure.
Information security models have however theoretical limits, since it cannot always be
proven that an algorithm satisfies (or not) certain security conditions. For example, as formalized by Harrison, Ruzzo, and Ullman, the access matrix model is undecidable and Rices
theorem implies that static analysis problems are also undecidable. Because of similar results
such as the undecidability of detecting computer viruses, there is reason to believe that several
intrusion detection problems are also undecidable.
Undecidability is not the only problem for being able to formally characterize the security
properties of an algorithm. Not only are some problems intractable, or undecidable, but also
there are certain inherent uncertainties in several security related problems such as biometrics,
data hiding (watermarking), fraud detection and spam filters that are impossible to factor out.
All algorithms trying to solve these problems will make a non-negligible amount of decision errors. These errors occur due to approximations and can consequently be formulated
as a tradeoff between the costs of operation (e.g., the necessary resources for their operation,
such as the number of false alarms) and the correctness of their output (e.g., the security level
achieved, or the probability of detection). It is however very difficult in practice to assess both:
the real costs of these security solutions and their actual security guarantees. Most of the research therefore relies on ad hoc solutions and heuristics that cannot be shared between security
fields trying to address these hard problems.
In this work we try to address the problem of providing a necessary framework in order to
reason about the security and the costs of an algorithm for solving problems where the decision
between two hypotheses cannot be made without errors.
Our framework is composed of two main parts.
Evaluation Metrics: We introduce in a unified framework several metrics that have been previously proposed in the literature. We give intuitive interpretations of them and provide
new metrics that address two of the main problems for decision algorithms. First, the
large class imbalance between the two hypotheses, and second, the uncertainty of several
parameters, including costs and the class imbalance severity.
Security Model: In order to reason formally about the security level of a decision algorithm,
we need to introduce a formal model for an adversary and the system being evaluated.
We therefore clearly define the feasible design space, and the properties our algorithm
1
need to satisfy by an evaluation metric. Then we model the considered adversary class
by clearly defining the information available to the adversary, the capabilities of the adversary and its goal. Finally, since the easiest way to break the security of a system is
to step outside the box, i.e., to break the rules under which the algorithm was evaluated,
we clearly identify the assumptions made in our model. These assumptions then play
an important part in the evaluation of the algorithm. Our tools to solve and analyze this
models are based on robust detection theory and game theory, and the aim is always the
same, minimize the advantage of the adversary (the advantage of the adversary is defined
according to the metric used), for any possible adversary in a given adversary class.
We apply our model to different problems in intrusion detection, selfish packet forwarding in ad hoc networks, selfish MAC layer misbehavior in wireless access points and watermarking algorithms. Our results show that formally modeling these problems and obtaining
the least favorable attack distribution can lead in the best cases to show how there is merit in
our framework, by outperforming previously proposed heuristic solutions (analytically and by
simulations), and in the worst cases they can be seen as pessimistic decisions based on risk
aversion.
The following is the layout of this dissertation. In chapter 2 we view in a unified framework traditional metrics used to evaluate the performance of intrusion detection systems and
introduce a new evaluation curve called the B-ROC curve. In chapter 3 we introduce the notion
of security of a decision algorithm and define the adversarial models that are going to be used
in the next chapters in order to design and evaluate decision making algorithms. In chapter
4 we compare the performance of two classification algorithms used for detecting MAC layer
misbehavior in wireless networks. This chapter is a validation of our approach, since it shows
how the design of classification algorithms using robust detection theory and formal adversarial
modeling can outperform theoretically and empirically previously proposed algorithms based
on heuristics. Finally in chapter 5 we apply our framework for the design and evaluation of
data hiding algorithms.
2
Chapter 2
Performance Evaluation Under the Class Imbalance Problem
For there seem to be many empty alarms in war
-Nicomachean Ethics, Aristotle
I
Overview
Classification accuracy in intrusion detection systems (IDSs) deals with such fundamental problems as how to compare two or more IDSs, how to evaluate the performance of an IDS,
and how to determine the best configuration of the IDS. In an effort to analyze and solve these
related problems, evaluation metrics such as the Bayesian detection rate, the expected cost, the
sensitivity and the intrusion detection capability have been introduced. In this chapter, we study
the advantages and disadvantages of each of these performance metrics and analyze them in a
unified framework. Additionally, we introduce the Bayesian Receiver Operating Characteristic
(B-ROC) curves as a new IDS performance tradeoff which combines in an intuitive way the
variables that are more relevant to the intrusion detection evaluation problem.
II
Introduction
Consider a company that, in an effort to improve its information technology security
infrastructure, wants to purchase either intrusion detector 1 (I DS 1 ) or intrusion detector 2
(I DS 2 ). Furthermore, suppose that the algorithms used by each IDS are kept private and therefore the only way to determine the performance of each IDS (unless some reverse engineering
is done [1]) is through empirical tests determining how many intrusions are detected by each
scheme while providing an acceptable level of false alarms. Suppose these tests show with high
confidence that I DS 1 detects one-tenth more attacks than I DS 2 but at the cost of producing
one hundred times more false alarms. The company needs to decide based on these estimates,
which IDS will provide the best return of investment for their needs and their operational environment.
This general problem is more concisely stated as the intrusion detection evaluation problem, and its solution usually depends on several factors. The most basic of these factors are the
false alarm rate and the detection rate, and their tradeoff can be intuitively analyzed with the
help of the receiver operating characteristic (ROC) curve [2, 3, 4, 5, 6]. However, as pointed
out in [7, 8, 9], the information provided by the detection rate and the false alarm rate alone
might not be enough to provide a good evaluation of the performance of an IDS. Therefore,
the evaluation metrics need to consider the environment the IDS is going to operate in, such
as the maintenance costs and the hostility of the operating environment (the likelihood of an
attack). In an effort to provide such an evaluation method, several performance metrics such
as the Bayesian detection rate [7], expected cost [8], sensitivity [10] and intrusion detection
capability [9], have been proposed in the literature.
Yet despite the fact that each of these performance metrics makes their own contribution to the analysis of intrusion detection systems, they are rarely applied in the literature when
3
proposing a new IDS. It is our belief that the lack of widespread adoption of these metrics stems
from two main reasons. Firstly, each metric is proposed in a different framework (e.g. information theory, decision theory, cryptography etc.) and in a seemingly ad hoc manner. Therefore
an objective comparison between the metrics is very difficult.
The second reason is that the proposed metrics usually assume the knowledge of some
uncertain parameters like the likelihood of an attack, or the costs of false alarms and missed
detections. Moreover, these uncertain parameters can also change during the operation of an
IDS. Therefore the evaluation of an IDS under some (wrongly) estimated parameters might not
be of much value.
A. Our Contributions
In this chapter, we introduce a framework for the evaluation of IDSs in order to address
the concerns raised in the previous section. First, we identify the intrusion detection evaluation
problem as a multi-criteria optimization problem. This framework will let us compare several
of the previously proposed metrics in a unified manner. To this end, we recall that there are
in general two ways to solve a multi-criteria optimization problem. The first approach is to
combine the criteria to be optimized in a single optimization problem. We then show how
the intrusion detection capability, the expected cost and the sensitivity metrics all fall into this
category. The second approach to solve a multi-criteria optimization problem is to evaluate a
tradeoff curve. We show how the Bayesian rates and the ROC curve analysis are examples of
this approach.
To address the uncertainty of the parameters assumed in each of the metrics, we then
present a graphical approach that allows the comparison of the IDS metrics for a wide range of
uncertain parameters. For the single optimization problem we show how the concept of isolines
can capture in a single value (the slope of the isoline) the uncertainties like the likelihood of an
attack and the operational costs of the IDS. For the tradeoff curve approach, we introduce a new
tradeoff curve we call the Bayesian ROC (B-ROC). We believe the B-ROC curve combines in
a single graph all the relevant (and intuitive) parameters that affect the practical performance of
an IDS.
In an effort to make this evaluation framework accessible to other researchers and in order
to complement our presentation, we started the development of a software application available
at [11] to implement the graphical approach for the expected cost and our new B-ROC analysis
curves. We hope this tool can grow to become a valuable resource for research in intrusion
detection.
III
Notation and Definitions
In this section we present the basic notation and definitions which we use throughout this
document.
We assume that the input to an intrusion detection system is a feature-vector x ∈ X . The
elements of x can include basic attributes like the duration of a connection, the protocol type,
the service used etc. It can also include specific attributes selected with domain knowledge
such as the number of failed logins, or if a superuser command was attempted. Examples of
x used in intrusion detection are sequences of system calls [12], sequences of user commands
[13], connection attempts to local hosts [14], proportion of accesses (in terms of TCP or UDP
packets) to a given port of a machine over a fixed period of time [15] etc.
4
Let I denote whether a given instance x was generated by an intrusion (represented by
I = 1 or simply I) or not (denoted as I = 0 or equivalently ¬I). Also let A denote whether
the output of an IDS is an alarm (denoted by A = 1 or simply A) or not (denoted by A = 0, or
equivalently ¬A). An IDS can then be defined as an algorithm I DS that receives a continuous
data stream of computer event features X = {x[1], x[2], . . . , } and classifies each input x[ j] as
being either a normal event or an attack i.e. I DS : X →{A, ¬A}. In this chapter we do not
address how the IDS is designed. Our focus will be on how to evaluate the performance of a
given IDS.
Intrusion detection systems are commonly classified as either misuse detection schemes
or anomaly detection schemes. Misuse detection systems use a number of attack signatures describing attacks; if an event feature x matches one of the signatures, an alarm is raised. Anomaly
detection schemes on the other hand rely on profiles or models of the normal operation of the
system. Deviations from these established models raise alarms.
The empirical results of a test for an IDS are usually recorded in terms of how many
attacks were detected and how many false alarms were produced by the IDS, in a data set
containing both normal data and attack data. The percentage of alarms out of the total number
of normal events monitored is referred to as the false alarm rate (or the probability of false
alarm), whereas the percentage of detected attacks out of the total attacks is called the detection
rate (or probability of detection) of the IDS. In general we denote the probability of false alarm
and the probability of detection (respectively) as:
PFA ≡ Pr[A = 1|I = 0]
and
PD ≡ Pr[A = 1|I = 1]
(2.1)
These empirical results are sometimes shown with the help of the ROC curve; a graph
whose x-axis is the false alarm rate and whose y-axis is the detection rate. The graphs of
misuse detection schemes generally correspond to a single point denoting the performance of
the detector. Anomaly detection schemes on the other hand, usually have a monitored statistic
which is compared to a threshold τ in order to determine if an alarm should be raised or not.
Therefore their ROC curve is obtained as a parametric plot of the probability of false alarm
(PFA ) versus the probability of detection (PD ) (with parameter τ) as in [2, 3, 4, 5, 6].
IV
Evaluation Metrics
In this section we first introduce metrics that have been proposed in previous work. Then
we discuss how we can use these metrics to evaluate the IDS by using two general approaches,
that is the expected cost and the tradeoff approach. In the expected cost approach, we give intuition of the expected cost metric by relating all the uncertain parameters (such as the probability
of an attack) to a single line that allows the IDS operator to easily find the optimal tradeoff. In
the second approach, we identify the main parameters that affect the quality of the performance
of the IDS. This will allow us to later introduce a new evaluation method that we believe better
captures the effect of these parameters than all previously proposed methods.
A. Background Work
Expected Cost
In this section we present the expected cost of an IDS by combining some of the ideas
originally presented in [8] and [16]. The expected cost is used as an evaluation method for
5
State of the system
No Intrusion (I = 0)
Intrusion (I = 1)
Detector’s report
No Alarm (A=0) Alarm (A=1)
C(0, 0)
C(0, 1)
C(1, 0)
C(1, 1)
Table 2.1: Costs of the IDS reports given the state of the system
IDSs in order to assess the investment of an IDS in a given IT security infrastructure. In
addition to the rates of detection and false alarm, the expected cost of an IDS can also depend
on the hostility of the environment, the IDS operational costs, and the expected damage done
by security breaches.
A quantitative measure of the consequences of the output of the IDS to a given event,
which can be an intrusion or not are the costs shown in Table 2.1. Here C(0, 1) corresponds to
the cost of responding as though there was an intrusion when there is none, C(1, 0) corresponds
to the cost of failing to respond to an intrusion, C(1, 1) is the cost of acting upon an intrusion
when it is detected (which can be defined as a negative value and therefore be considered as a
profit for using the IDS), and C(0, 0) is the cost of not reacting to a non-intrusion (which can
also be defined as a profit, or simply left as zero.)
Adding costs to the different outcomes of the IDS is a way to generalize the usual tradeoff
between the probability of false alarm and the probability of detection to a tradeoff between the
expected cost for a non-intrusion
R(0, PFA ) ≡ C(0, 0)(1 − PFA ) +C(0, 1)PFA
and the expected cost for an intrusion
R(1, PD ) ≡ C(1, 0)(1 − PD ) +C(1, 1)PD
It is clear that if we only penalize errors of classification with unit costs (i.e. if C(0, 0) =
C(1, 1) = 0 and C(0, 1) = C(1, 0) = 1) the expected cost for non-intrusion and the expected
cost for intrusion become respectively, the false alarm rate and the detection rate.
The question of how to select the optimal tradeoff between the expected costs is still
open. However, if we let the hostility of the environment be quantified by the likelihood of
an intrusion p ≡ Pr[I = 1] (also known as the base-rate [7]), we can average the expected
non-intrusion and intrusion costs to give the overall expected cost of the IDS:
E[C(I, A)] = R(0, PFA )(1 − p) + R(1, PD )p
(2.2)
It should be pointed out that R() and E[C(I, A)] are also known as the risk and Bayesian
risk functions (respectively) in Bayesian decision theory.
Given an IDS, the costs from Table 2.1 and the likelihood of an attack p, the problem now
is to find the optimal tradeoff between PD and PFA in such a way that E[C(I, A)] is minimized.
The Intrusion Detection Capability
The main motivation for introducing the intrusion detection capability CID as an evaluation metric originates from the fact that the costs in Table 2.1 are chosen in a subjective way
[9]. Therefore the authors propose the use of the intrusion detection capability as an objective
6
metric motivated by information theory:
CID =
I(I; A)
H(I)
(2.3)
where I and H respectively denote the mutual information and the entropy [17]. The H(I) term
in the denominator is a normalizing factor so that the value of CID will always be in the [0, 1]
interval. The intuition behind this metric is that by fine tuning an IDS based on CID we are
finding the operating point that minimizes the uncertainty of whether an arbitrary input event x
was generated by an intrusion or not.
The main drawback of CID is that it obscures the intuition that is to be expected when
evaluating the performance of an IDS. This is because the notion of reducing the uncertainty
of an attack is difficult to quantify in practical values of interest such as false alarms or detection rates. Information theory has been very useful in communications because the entropy
and mutual information can be linked to practical quantities, like the number of bits saved by
compression (source coding) or the number of bits of redundancy required for reliable communications (channel coding). However it is not clear how these metrics can be related to
quantities of interest for the operator of an IDS.
The Base-Rate Fallacy and Predictive Value Metrics
In [7] Axelsson pointed out that one of the causes for the large amount of false alarms
that intrusion detectors generate is the enormous difference between the amount of normal
events compared to the small amount of intrusion events. Intuitively, the base-rate fallacy
states that because the likelihood of an attack is very small, even if an IDS fires an alarm,
the likelihood of having an intrusion remains relatively small. Formally, when we compute
the posterior probability of intrusion (a quantity known as the Bayesian detection rate, or the
positive predictive value (PPV)) given that the IDS fired an alarm, we obtain:
PPV ≡ Pr[I = 1|A = 1]
Pr[A = 1|I = 1]Pr[I = 1]
=
Pr[A = 1|I = 1]Pr[I = 1] + Pr[A = 1|I = 0]Pr[I = 0]
PD p
=
(PD − PFA )p + PFA
(2.4)
Therefore, if the rate of incidence of an attack is very small, for example on average only
1 out of 105 events is an attack (p = 10−5 ), and if our detector has a probability of detection
of one (PD = 1) and a false alarm rate of 0.01 (PFA = 0.01), then Pr[I = 1|A = 1] = 0.000999.
That is on average, of 1000 alarms, only one would be a real intrusion.
It is easy to demonstrate that the PPV value is maximized when the false alarm rate of
our detector goes to zero, even if the detection rate also tends to zero! Therefore as mentioned
in [7] we require a trade-off between the PPV value and the negative predictive value (NPV):
NPV ≡ Pr[I = 0|A = 0] =
(1 − p)(1 − PFA )
p(1 − PD ) + (1 − p)(1 − PFA )
B. Discussion
7
(2.5)
The concept of finding the optimal tradeoff of the metrics used to evaluate an IDS is an
instance of the more general problem of multi-criteria optimization. In this setting, we want
to maximize (or minimize) two quantities that are related by a tradeoff, which can be done via
two approaches. The first approach is to find a suitable way of combining these two metrics in
a single objective function (such as the expected cost) to optimize. The second approach is to
directly compare the two metrics via a trade-off curve.
We therefore classify the above defined metrics into two general approaches that will
be explored in the rest of this chapter: the minimization of the expected cost and the tradeoff
approach. We consider these two approaches as complimentary tools for the analysis of IDSs,
each providing its own interpretation of the results.
Minimization of the Expected Cost
Let ROC denote the set of allowed (PFA , PD ) pairs for an IDS. The expected cost approach
will include any evaluation metric that can be expressed as
r∗ =
min
(PFA ,PD )∈ROC
E[C(I, A)]
(2.6)
where r∗ is the expected cost of the IDS. Given I DS 1 with expected cost r1∗ and an I DS 2
with expected cost r2∗ , we can say I DS 1 is better than I DS 2 for our operational environment
if r1∗ < r2∗ .
We now show how CID , and the tradeoff between the PPV and NPV values can be expressed as an expected costs problems. For the CID case note that the entropy of an intrusion
H(I) is independent of our optimization parameters (PFA , PD ), therefore we have:
∗
(PFA
, PD∗ ) = arg
= arg
max
(PFA ,PD )∈ROC
max
I (I; A)
H(I)
I(I; A)
(PFA ,PD )∈ROC
= arg
= arg
min
H(I|A)
min
E[− log Pr[I|A]]
(PFA ,PD )∈ROC
(PFA ,PD )∈ROC
It is now clear that CID is an instance of the expected cost problem with costs given by
C(i, j) = − log Pr[I = i|A = j]. By finding the costs of CID we are making the CID metric more
intuitively appealing, since any optimal point that we find for the IDS will have an explanation
in terms of cost functions (as opposed to the vague notion of diminishing the uncertainty of the
intrusions).
Finally, in order to combine the PPV and the NPV in an average cost metric, recall that
we want to maximize both Pr[I = 1|A = 1] and Pr[I = 0|A = 0]. Our average gain for each
operating point of the IDS is therefore
ω1 Pr[I = I|A = 1]Pr[A = 1] + ω2 Pr[I = 0|A = 0]Pr[A = 0]
where ω1 (ω2 ) is a weight representing a preference towards maximizing PPV (NPV). This
equation is equivalent to the minimization of
−ω1 Pr[I = I|A = 1]Pr[A = 1] − ω2 Pr[I = 0|A = 0]Pr[A = 0]
8
(2.7)
Comparing equation (2.7) with equation (2.2), we identify the costs as being C(1, 1) = −ω1 ,
C(0, 0) = −ω2 and C(0, 1) = C(1, 0) = 0. Relating the predictive value metrics (PPV and NPV)
with the expected cost problem will allow us to examine the effects of the base-rate fallacy on
the expected cost of the IDS in future sections.
IDS classification tradeoffs
An alternate approach in evaluating intrusion detection systems is to directly compare the
tradeoffs in the operation of the system by a tradeoff curve, such as ROC, or DET curves [18] (a
reinterpretation of the ROC curve where the y-axis is 1 − PD , as opposed to PD ). As mentioned
in [7], another tradeoff to consider is between the PPV and the NPV values. However, we do
not know of any tradeoff curves that combine these two values to aid the operator in choosing
a given operating point.
We point out in section B that a tradeoff between PFA and PD (as in the ROC curves) as
well as a tradeoff between PPV and NPV can be misleading for cases where p is very small,
since very small changes in the PFA and NPV values for our points of interest will have drastic
performance effects on the PD and the PPV values. Therefore, in the next section we introduce
the B-ROC as a new tradeoff curve between PD and PPV.
The class imbalance problem
A way to relate our approach with traditional techniques used in machine learning is to
identify the base-rate fallacy as just another instance of the class imbalance problem. The term
class imbalance refers to the case when in a classification task, there are many more instances
of some classes than others. The problem is that under this setting, classifiers in general perform
poorly because they tend to concentrate on the large classes and disregard the ones with few
examples.
Given that this problem is prevalent in a wide range of practical classification problems,
there has been recent interest in trying to design and evaluate classifiers faced with imbalanced
data sets [19, 20, 21].
A number of approaches on how to address these issues have been proposed in the literature. Ideas such as data sampling methods, one-class learning (i.e. recognition-based learning),
and feature selection algorithms, appear to be the most active research directions for learning
classifiers. On the other hand the issue of how to evaluate binary classifiers in the case of class
imbalances appears to be dominated by the use of ROC curves [22, 23] (and to a lesser extent,
by error curves [24]).
V
Graphical Analysis
We now introduce a graphical framework that allows the comparison of different metrics in the analysis and evaluation of IDSs. This graphical framework can be used to adaptively
change the parameters of the IDS based on its actual performance during operation. The framework also allows for the comparison of different IDSs under different operating environments.
Throughout this section we use one of the ROC curves analyzed in [8] and in [9]. Mainly
the ROC curve describing the performance of the COLUMBIA team intrusion detector for the
1998 DARPA intrusion detection evaluation [25]. Unless otherwise stated, we assume for our
analysis the base-rate present in the DARPA evaluation which was p = 6.52 × 10−5 .
9
A. Visualizing the Expected Cost: The Minimization Approach
The biggest drawback of the expected cost approach is that the assumptions and information about the likelihood of attacks and costs might not be known a priori. Moreover, these
parameters can change dynamically during the system operation. It is thus desirable to be able
to tune the uncertain IDS parameters based on feedback from its actual system performance in
order to minimize E[C(I, A)].
We select the use of ROC curves as the basic 2-D graph because they illustrate the behavior of a classifier without regard to the uncertain parameters, such as the base-rate p and the
operational costs C(i, j). Thus the ROC curve decouples the classification performance from
these factors [26]. ROC curves are also general enough such that they can be used to study
anomaly detection schemes and misuse detection schemes (a misuse detection scheme has only
one point in the ROC space).
CID with p=6.52E−005
0.8
0.5
6
0.
0.5
0.7
0.4
0.4
0.5
1
0.6
0
65
.45
0.4
0.3
0.3
0.3
0.4
PD
0.5
0.4
0.2
0.3
0.2
0.2
0.
2
0.3
0.1
0.1
0.2
0.1
0.1
0
0.2
0.4
0.6
PFA
0.8
1
−3
x 10
Figure 2.1: Isoline projections of CID onto the ROC curve. The optimal CID value is CID =
0.4565. The associated costs are C(0, 0) = 3 × 10−5 , C(0, 1) = 0.2156, C(1, 0) = 15.5255 and
C(1, 1) = 2.8487. The optimal operating point is PFA = 2.76 × 10−4 and PD = 0.6749.
In the graphical framework, the relation of these uncertain factors with the ROC curve of
an IDS will be reflected in the isolines of each metric, where isolines refer to lines that connect
pairs of false alarm and detection rates such that any point on the line has equal expected
cost. The evaluation of an IDS is therefore reduced to finding the point of the ROC curve that
intercepts the optimal isoline of the metric (for signature detectors the evaluation corresponds
to finding the isoline that intercepts their single point in the ROC space and the point (0,0) or
(1,1)). In Figure 2.1 we can see as an example the isolines of CID intercepting the ROC curve
of the 1998 DARPA intrusion detection evaluation.
One limitation of the CID metric is that it specifies the costs C(i, j) a priori. However,
in practice these costs are rarely known in advance and moreover the costs can change and be
dynamically adapted based on the performance of the IDS. Furthermore the nonlinearity of CID
makes it difficult to analyze the effect different p values will have on CID in a single 2-D graph.
To make the graphical analysis of the cost metrics as intuitive as possible, we will assume from
10
now on (as in [8]) that the costs are tunable parameters and yet once a selection of their values
is made, they are constant values. This new assumption will let us at the same time see the
effect of different values of p in the expected cost metric.
Under the assumption of constant costs, we can see that the isolines for the expected cost
E[C(I, A)] are in fact straight lines whose slope depends on the ratio between the costs and the
likelihood ratio of an attack. Formally, if we want the pair of points (PFA1 , PD1 ) and (PFA2 , PD2 )
to have the same expected cost, they must be related by the following equation [27, 28, 26]:
mC,p ≡
1 − p C(0, 1) −C(0, 0) 1 − p 1
PD2 − PD1
=
=
PFA1 − PFA2
p C(1, 0) −C(1, 1)
p C
(2.8)
where in the last equality we have implicitly defined C to be the ratio between the costs, and
mC,p to be the slope of the isoline. The set of isolines of E[C(I, A)] can be represented by
I SO E = {mC,p × PFA + b : b ∈ [0, 1]}
(2.9)
For fixed C and p, it is easy to prove that the optimal operating point of the ROC is the
point where the ROC intercepts the isoline in I SO E with the largest b (note however that there
are ROC curves that can have more than one optimal point.) The optimal operating point in the
ROC is therefore determined only by the slope of the isolines, which in turn is determined by
p and C. Therefore we can readily check how changes in the costs and in the likelihood of an
attack will impact the optimal operating point.
Isolines of the expected cost under different C values (with p=6.52×10−005)
0.8
0.7
0.6
C=58.82
PD
0.5
C=10.0
0.4
0.3
0.2
0.1
0
C=2.50
0
0.2
0.4
0.6
PFA
0.8
1
−3
x 10
Figure 2.2: As the cost ratio C increases, the slope of the optimal isoline decreases
Effect of the Costs
In Figure 2.2, consider the operating point corresponding to C = 58.82, and assume that
after some time, the operators of the IDS realize that the number of false alarms exceeds their
response capabilities. In order to reduce the number of false alarms they can increase the cost
of a false alarm C(0, 1) and obtain a second operating point at C = 10. If however the situation
11
persists (i.e. the number of false alarms is still much more than what operators can efficiently
respond to) and therefore they keep increasing the cost of a false alarm, there will be a critical
slope mc such that the intersection of the ROC and the isoline with slope mc will be at the point
(PFA , PD ) = (0, 0). The interpretation of this result is that we should not use the IDS being
evaluated since its performance is not good enough for the environment it has been deployed
in. In order to solve this problem we need to either change the environment (e.g. hire more IDS
operators) or change the IDS (e.g. shop for a more expensive IDS).
The Base-Rate Fallacy Implications on the Costs of an IDS
A similar scenario occurs when the likelihood of an attack changes. In Figure 2.3 we can
see how as p decreases, the optimal operating point of the IDS tends again to (PFA , PD ) = (0, 0)
(again the evaluator must decide not to use the IDS for its current operating environment).
Therefore, for small base-rates the operation of an IDS will be cost efficient only if we have
an appropriate large C∗ such that mC∗ ,p∗ ≤ mc . A large C∗ can be explained if cost of a false
alarm much smaller than the cost of a missed detection: C(1, 0) C(0, 1) (e.g. the case of a
government network that cannot afford undetected intrusions and has enough resources to sort
through the false alarms).
Isolines of the probability of error under different p values
0.8
0.7
0.6
p=0.01
p=0.001
0.5
PD
p=0.1
0.4
0.3
0.2
p=0.0001
0.1
0
0
0.2
0.4
0.6
PFA
0.8
1
−3
x 10
Figure 2.3: As the base-rate p decreases, the slope of the optimal isoline increases
Generalizations
This graphical method of cost analysis can also be applied to other metrics in order to get
some insight into the expected cost of the IDS. For example in [10], the authors define an IDS
with input space X to be σ − sensitive if there exists an efficient algorithm with the same input
E ≥ σ. This metric can be used to find the optimal
space E : X → {¬A, A}, such that PDE − PFA
point of an ROC because it has a very intuitive explanation: as long as the rate of detected
intrusions increases faster than the rate of false alarms, we keep moving the operating point
12
of the IDS towards the right in the ROC. The optimal sensitivity problem for an IDS with a
receiver operating characteristic ROC is thus:
max
(PFA ,PD )∈ROC
PD − PFA
(2.10)
It is easy to show that this optimal sensitivity point is the same optimal point obtained with the
isolines method for mC,p = 1 (i.e. C = (1 − p)/p).
B. The Bayesian Receiver Operating Characteristic: The Tradeoff Approach
PPV = 1
1
PPV =
p
p + (1 − p)tanθ
0.9
0.8
0.7
PD
0.6
0.5
θ
0.4
PPV = p
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
PFA
0.6
0.7
0.8
0.9
1
Figure 2.4: The PPV isolines in the ROC space are straight lines that depend only on θ. The
PPV values of interest range from 1 to p
Although the graphical analysis introduced so far can be applied to analyze the cost
efficiency of several metrics, the intuition for the tradeoff between the PPV and the NPV is
still not clear. Therefore we now extend the graphical approach by introducing a new pair of
isolines, those of the PPV and the NPV metrics.
Lemma 1:
and only if
Two sets of points (PFA1 , PD1 ) and (PFA2 , PD2 ) have the same PPV value if
PFA2 PFA1
=
= tan θ
(2.11)
PD2
PD1
where θ is the angle between the line PFA = 0 and the isoline. Moreover the PPV value of an
isoline at angle θ is
p
PPVθ,p =
(2.12)
p + (1 − p) tan θ
Similarly, two set of points (PFA1 , PD1 ) and (PFA2 , PD2 ) have the same NPV value if and only if
1 − PD1
1 − PD2
=
= tan φ
1 − PFA1 1 − PFA2
13
(2.13)
NPV = 1
1
NPV =
0.9
0.8
φ
1−p
p(tanφ − 1) + 1
0.7
PD
0.6
0.5
NPV = 1 − p
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
PFA
0.6
0.7
0.8
0.9
1
Figure 2.5: The NPV isolines in the ROC space are straight lines that depend only on φ. The
NPV values of interest range from 1 to 1 − p
where φ is the angle between the line PD = 1 and the isoline. Moreover the NPV value of an
isoline at angle φ is
1− p
(2.14)
NPVφ,p =
p(tan φ − 1) + 1
Figures 2.4 and 2.5 show the graphical interpretation of Lemma 1. It is important to note
the range of the PPV and NPV values as a function of their angles. In particular notice that as θ
goes from 0◦ to 45◦ (the range of interest), the value of PPV changes from 1 to p. We can also
see from Figure 2.5 that as φ ranges from 0◦ to 45◦ , the NPV value changes from one to 1 − p.
If p is very small, then NPV ≈ 1.
Figure 2.6 shows the application of Fact 1 to a typical ROC curve of an IDS. In this figure
we can see the tradeoff of four variables of interest: PFA , PD , PPV , and NPV . Notice that if
we choose the optimal operating point based on PFA and PD , as in the typical ROC analysis, we
might obtain misleading results because we do not know how to interpret intuitively very low
false alarm rates, e.g. is PFA = 10−3 much better than PFA = 5 × 10−3 ? The same reasoning
applies to the study of PPV vs. NPV as we cannot interpret precisely small variations in NPV
values, e.g. is NPV = 0.9998 much better than NPV = 0.99975? Therefore we conclude that
the most relevant metrics to use for a tradeoff in the performance of a classifier are PD and PPV,
since they have an easily understandable range of interest.
However, even when you select as tradeoff parameters the PPV and PD values, the isoline
analysis shown in Figure 2.6 has still one deficiency, and it is the fact that there is no efficient
way to account for the uncertainty of p. In order to solve this problem we introduce the B-ROC
as a graph that shows how the two variables of interest: PD and PPV are related under different
severity of class imbalances. In order to follow the intuition of the ROC curves, instead of using
PPV for the x-axis we prefer to use 1-PPV . We use this quantity because it can be interpreted
as the Bayesian false alarm rate: BFA ≡ Pr[C = 0|A = 1]. For example, for IDSs BFA can be a
14
1
PPV = 0.5
PPV = 0.2
PPV = 0.1
ROC
0.9
98
0.999
0.8
0.7
0.8
PD
0.1
0.2
0.5
0.4
0.3
0.7
0.99998
6
999
0.6
0.9
0.5
94
99
0.4
0.9
0.3
0.6
0.2
0.1
ROC
0.5
PD
PD0.4
0
0.1
0.2
0.3
0.4
0.5
PFA
0.6
0.7
0.8
0.9
0.99996
NPV
= 0.99996
0.3
Zoom of the
Area of Interest
0.2
0.1
0
0
0.99994
NPV
= 0.99994
0
0.1
0.2
0.3
0.4
0.5
PFA
0.6
0.7
0.8
0.9
PF A
1
−3
x 10
×10−3
Figure 2.6: PPV and NPV isolines for the ROC of an IDS with p = 6.52 × 10−5
1
0.9
p=0.1
0.8
0.7
0.6
p=1×10−3
PD
p=0.01
0.5
p=1×10−6
p=1×10−4
0.4
p=1×10−5
0.3
p=1×10−7
0.2
0.1
0
0
0.2
0.4
0.6
0.8
BFA
Figure 2.7: B-ROC for the ROC of Figure 2.6.
15
1
1
measure of how likely it is, that the operators of the detection system will loose their time each
time they respond to an alarm. Figure 2.7 shows the B-ROC for the ROC presented in Figure
2.6. Notice also how the values of interest for the x-axis have changed from [0, 10−3 ] to [0, 1].
1
1
PD
PD
0
1
PFA
0
BFA
1−p
Figure 2.8: Mapping of ROC to B-ROC
In order to be able to interpret the B-ROC curves, Figure 2.8 shows how the ROC points
map to points in the B-ROC. The vertical line defined by 0 < PD ≤ 1 and PFA = 0 in the ROC
maps exactly to the same vertical line 0 < PD ≤ 1 and BFA = 0 in the B-ROC. Similarly, the top
horizontal line 0 ≤ PFA ≤ 1 and PD = 1 maps to the line 0 ≤ BFA ≤ 1− p and PD = 1. A classifier
that performs random guessing is represented in the ROC as the diagonal line PD = PFA , and
this random guessing classifier maps to the vertical line defined by BFA = 1 − p and PD > 0
in the B-ROC. Finally, to understand where the point (0, 0) in the ROC maps into the B-ROC,
let α and f (α) denote PFA and the corresponding PD in the ROC curve. Then, as the false
alarm rate α tends to zero (from the right), the Bayesian false alarm rate tends to a value that
depends on p and the slope of the ROC close to the point (0, 0). More specifically, if we let
f 0 (0+ ) = limα→0+ f 0 (α), then:
lim BFA = lim
α→0+
α→0+
1− p
α(1 − p)
=
0
p f (α) + α(1 − p) p( f (0+ ) − 1) + 1
It is also important to recall that a necessary condition for a classifier to be optimal, is
that its ROC curve should be concave. In fact, given any non-concave ROC, by following
Neyman-Pearson theory, you can always get a concave ROC curve by randomizing decisions
between optimal points [27]. This idea has been recently popularized in the machine learning
community by the notion of the ROC convex hull [26].
The importance of this observation is that in order to guarantee that the B-ROC is a well
defined continuous and non-decreasing function, we map only concave ROC curves to B-ROCs.
16
1
0.9
0.8
ROC
0.7
1
ROC
PD
0.6
2
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
PFA
Figure 2.9: An empirical ROC (ROC2 ) and its convex hull (ROC1 )
1
0.9
0.8
B-ROC1
0.7
B-ROC2
P
D
0.6
0.5
0.4
0.3
0.2
0.1
0
0.75
0.8
0.85
0.9
0.95
1
BFA
Figure 2.10: The B-ROC of the concave ROC is easier to interpret
17
In Figures 2.9 and 2.10 we show the only example in this document of the type of B-ROC curve
that you can get when you do not consider a concave ROC.
1
0.9
0.8
ROC
1
0.7
PD
0.6
ROC
0.5
2
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
PFA
Figure 2.11: Comparison of two classifiers
We also point out the fact that the B-ROC curves can be very useful for the comparison
of classifiers. A typical comparison problem by using ROCs is shown in Figure 2.11. Several
ideas have been proposed in order to solve this comparison problem. For example by using
decision theory as in [29, 26] we can find the optimal classifier between the two by assuming a
given prior p and given misclassification costs. However, a big problem with this approach is
that the misclassification costs are sometimes uncertain and difficult to estimate a priori. With
a B-ROC on the other hand, you can get a better comparison of two classifiers without the
assumption of any misclassification costs, as can be seen in Figure 2.12.
VI
Conclusions
We believe that the B-ROC provides a better way to evaluate and compare classifiers
in the case of class imbalances or uncertain values of p. First, for selecting operating points
in heavily imbalanced environments, B-ROCs use tradeoff parameters that are easier to understand than the variables considered in ROC curves (they provide better intuition for the
performance of the classifier). Second, since the exact class distribution p might not be known
a priori, or accurately enough, the B-ROC allows the plot of different curves for the range of
interest of p. Finally, when comparing two classifiers, there are cases in which by using the
B-ROC, we do not need cost values in order to decide which classifier would be better for given
values of p. Note also that B-ROCs consider parameters that are directly related to exact quantities that the operator of a classifier can measure. In contrast, the exact interpretation of the
expected cost of a classifier is more difficult to relate to the real performance of the classifier
(the costs depend in many other unknown factors).
18
p=1×10−3
1
0.9
0.8
0.7
PD
0.6
B−ROC1
0.5
0.4
0.3
B−ROC2
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
BFA
Figure 2.12: B-ROCs comparison for the p of interest
19
Chapter 3
Secure Decision Making: Defining the Evaluation Metrics and the Adversary
They did not realize that because of the quasi-reciprocal and circular nature of all
Improbability calculations, anything that was Infinitely Improbable was actually very likely to
happen almost immediately.
-Life the Universe and Everything. Douglas Adams
I
Overview
As we have seen in the previous chapter, the traditional evaluation of IDSs assume that
the intruder will behave similarly before and after the deployment and configuration of the IDS
(i.e. during the evaluation it is assumed that the intruder will be non-adaptive). In practice
however this assumption does not hold, since once an IDS is deployed, intruders will adapt and
try to evade detection or launch attacks against the IDS.
This lack of a proper adversarial model is a big problem for the evaluation of any decision
making algorithm used in security applications, since without the definition of an adversary, we
cannot reason about and much less measure the security of the algorithm.
We therefore layout in this chapter the overall design and evaluation goals that will be
used throughout the rest of this dissertation.
The basic idea is to introduce a formal framework for reasoning about the performance
and security of a decision making algorithm. In particular we do not want to find or design the
best performing decision making algorithm on average, but the algorithm that performs best
against the worst type of attacks.
The chapter layout is the following. In section II we introduce a general set of guidelines
for the design and evaluation of decision making algorithms. In section III we introduce a
black box adversary model. This model is very simple yet very useful and robust for cases
where having an exact adversarial model is intractable. Finally in section IV we introduce a
detailed model of an adversary which we will use in chapters 4 and 5.
II
A Set of Design and Evaluation Guidelines
In this chapter we focus on designing a practical methodology for dealing with attackers.
In particular we propose the use of a framework where each of these components is clearly
defined:
Desired Properties Intuitive definition of the goal of the system.
Feasible Design Space The design space S for the classification algorithm.
Information Available to the Adversary Identify which pieces of information can be available to an attacker.
20
Capabilities of the Adversary Define a feasible class of attackers F based on the assumed
capabilities.
Evaluation Metric The evaluation metric should be a reasonable measure how well the designed system meets our desired properties. We call a system secure if its metric outcome
is satisfied for any feasible attacker.
Objective of the Adversary An attacker can use its capabilities and the information available
in order to perform two main classes of attacks:
• Evaluation attack. The goal of the attacker is opposite to the goal defined by
the evaluation metric. For example, if the goal of the classifier is to minimize
E[L(C, A)], then the goal of the attacker is to maximize E[L(C, A)].
• Base system attack. The goal of an attacker is not the opposite goal of the classifier.
For example, even if the goal of the classifier is to minimize E[L(C, A)], the goal of
the attacker is still to minimize the probability of being detected.
Model Assumptions Identify clearly the assumptions made during the design and evaluation
of the classifier. It is important to realize that when we borrow tools from other fields,
they come with a set of assumptions that might not hold in an adversarial setting, because
the first thing that an attacker will do is violate the set of assumptions that the classifier
is relying on for proper operation. After all, the UCI machine learning repository never
launched an attack against your classifier. Therefore one of the most important ways to
deal with an adversarial environment is to limit the number of assumptions made, and to
evaluate the resiliency of the remaining assumptions to attacks.
Before we use these guidelines for the evaluation of classifiers, we describe a very simple
example of the use of these guidelines and their relationship to cryptography. Cryptography is
one of the best examples in which a precise framework has been developed in order to define
properly what a secure system means, and how to model an adversary. We believe therefore
that this example will help us in identifying the generality and use of the guidelines as a step
towards achieving sound designs1 .
A. Example: Secret key Encryption
In secret key cryptography, Alice and Bob share a single key sk. Given a message m
(called plaintext) Alice uses an encryption algorithm to produce unintelligible data C (called
ciphertext): C ← Esk (m). After receiving C, Bob then uses sk and a decryption algorithm to
recover the secret message m = Dsk (C).
Desired Properties E and D should enable Alice and Bob to communicate secretly, that is, a
feasible adversary should not get any information about m given C except with very small
probability.
Feasible Design Space E and D have to be efficient probabilistic algorithms. They also need
to satisfy correctness: for any sk and m, Dsk (Esk (m)) = m.
1 We
avoid the precise formal treatment of cryptography because our main objective here is to present the
intuition behind the principles rather than the specific technical details.
21
Information Available to the Adversary It is assumed that an adversary knows the encryption
and decryption algorithms. The only information not available to the adversary is the
secret key sk shared between Alice and Bob.
Capabilities of the Adversary The class of feasible adversaries F is the set of algorithms running in a reasonable amount of time.
Evaluation Metric For any messages m0 and m1 , given a ciphertext C which is known to be an
encryption of either m0 or m1 , no adversary A ∈ F can guess correctly which message
was encrypted with probability significantly better than 1/2.
Goal of the Adversary Perform an evaluation attack. That is, design an algorithm A ∈ F that
can guess with probability significantly better than 1/2 which message corresponds to
the given ciphertext.
Model Assumptions The security of an encryption scheme usually relies in a set of cryptographic primitives, such as one way functions.
Another interesting aspect of cryptography is the different notions of security when the
adversary is modified. In the previous example it is sometimes reasonable to assume that the
attacker will obtain valid plaintext and ciphertext pairs: {(m0 ,C0 ), (m1 ,C1 ), . . . , (mk ,Ck )}. This
new setting is modeled by giving the adversary more capabilities: the feasible set F will now
consist of all efficient algorithms that have access to the ciphertexts of chosen plaintexts. An
encryption algorithm is therefore secure against chosen-ciphertext attacks if even with this new
capability, the adversary still cannot break the encryption scheme.
III
A Black Box Adversary Model
In this section we introduce one of the simplest adversary models against classification
algorithms. We refer to it as a black box adversary model since we do not care at the moment
how the adversary creates the observations X. This model is particularly suited to cases where
trying to model the creation of the inputs to the classifier is intractable.
Recall first that for our evaluation analysis in the previous chapter we were assuming
three quantities that can be, up to a certain extent, controlled by the intruder. They are the
base-rate p, the false alarm rate PFA and the detection rate PD . The base-rate can be modified
by controlling the frequency of attacks. The perceived false alarm rate can be increased if the
intruder finds a flaw in any of the signatures of an IDS that allows him to send maliciously
crafted packets that trigger alarms at the IDS but that look benign to the IDS operator. Finally,
the detection rate can be modified by the intruder with the creation of new attacks whose signatures do not match those of the IDS, or simply by evading the detection scheme, for example
by the creation of a mimicry attack [30, 31].
In an effort towards understanding the advantage an intruder has by controlling these parameters, and to provide a robust evaluation framework, we now present a formal framework
to reason about the robustness of an IDS evaluation method. Our work in this section is in
some sense similar to the theoretical framework presented in [10], which was inspired by cryptographic models. However, we see two main differences in our work. First, we introduce the
role of an adversary against the IDS, and thereby introduce a measure of robustness for the
metrics. In the second place, our work is more practical and is applicable to more realistic
22
evaluation metrics. Furthermore we also provide examples of practical scenarios where our
methods can be applied.
In order to be precise in our presentation, we need to extend the definitions introduced
in section III. For our modeling purposes we decompose the I DS algorithm into two parts: a
detector D and a decision maker DM . For the case of an anomaly detection scheme, D (x[ j])
outputs the anomaly score y[ j] on input x[ j] and DM represents the threshold that determines
whether to consider the anomaly score as an intrusion or not, i.e. DM (y[ j]) outputs an alarm
or it does not. For a misuse detection scheme, DM has to decide to use the signature to report
alarms or decide that the performance of the signature is not good enough to justify its use and
therefore will ignore all alarms (e.g. it is not cost-efficient to purchase the misuse scheme being
evaluated).
Definition 1 An I DS algorithm is the composition of algorithms D (an algorithm from where
we can obtain an ROC curve) and DM (an algorithm responsible for selecting an operating point). During operation, an I DS receives a continuous data stream of event features
x[1], x[2], . . . and classifies each input x[ j] by raising an alarm or not. Formally:2
I DS (x)
y = D (x)
A ← DM (y)
Output A (where A ∈ {0, 1})
♦
We now study the performance of an IDS under an adversarial setting. We remark that
our intruder model does not represent a single physical attacker against the IDS. Instead our
model represents a collection of attackers whose average behavior will be studied under the
worst possible circumstances for the IDS.
The first thing we consider, is the amount of information the intruder has. A basic assumption to make in an adversarial setting is to consider that the intruder knows everything
that we know about the environment and can make inferences about the situation the same way
as we can. Under this assumption we assume that the base-rate p̂ estimated by the IDS, its
estimated operating condition (P̂FA , P̂D ) selected during the evaluation, the original ROC curve
(obtained from D ) and the cost function C(I, A) are public values (i.e. they are known to the
intruder).
We model the capability of an adaptive intruder by defining some confidence bounds. We
assume an intruder can deviate p̂ − δl , p̂ + δu from the expected p̂ value. Also, based on our
confidence in the detector algorithm and how hard we expect it to be for an intruder to evade
the detector, or to create non-relevant false positives (this also models how the normal behavior
of the system being monitored can produce new -previously unseen- false alarms), we define α
and β as bounds to the amount of variation we can expect during the IDS operation from the
false alarms and the detection rate (respectively) we expected, i.e. the amount of variation from
(P̂FA , P̂D ) (although in practice estimating these bounds is not an easy task, testing approaches
like the one described in [32] can help in their determination).
The intruder also has access to an oracle Feature(·, ·) that simulates an event to input into
the IDS. Feature(0, ζ) outputs a feature vector modeling the normal behavior of the system that
will raise an alarm with probability ζ (or a crafted malicious feature to only raise alarms in the
2 The
usual arrow notation: a ← DM (y) implies that DM can be a probabilistic algorithm.
23
case Feature(0, 1)). And Feature(1, ζ) outputs the feature vector of an intrusion that will raise
an alarm with probability ζ.
Definition 2 A (δ, α, β) − intruder is an algorithm I that can select its frequency of intrusions
p1 from the interval δ = [ p̂ − δl , p̂ + δu ]. If it decides to attempt an intrusion, then with probability p2 ∈ [0, β], it creates an attack feature x that will go undetected by the IDS (otherwise
this intrusion is detected with probability P̂D ). If it decides not to attempt an intrusion, with
probability p3 ∈ [0, α] it creates a feature x that will raise a false alarm in the IDS
I(δ, α, β)
Select p1 ∈ [ p̂ − δl , p̂ + δu ]
Select p2 ∈ [0, α]
Select p3 ∈ [0, β]
I ← Bernoulli(p1 )
If I = 1
B ← Bernoulli(p3 )
x ← Feature(1, (min{(1 − B), P̂D }))
Else
B ← Bernoulli(p2 )
x ← Feature(0, max{B, P̂FA })
Output (I,x)
where Bernoulli(ζ) outputs a Bernoulli random variable with probability of success ζ.
Furthermore, if δl = p and δu = 1 − p we say that I has the ability to make a chosenintrusion rate attack.
♦
We now formalize what it means for an evaluation scheme to be robust. We stress the
fact that we are not analyzing the security of an IDS, but rather the security of the evaluation of
an IDS, i.e. how confident we are that the IDS will behave during operation similarly to what
we assumed in the evaluation.
A. Robust Expected Cost Evaluation
We start with the general decision theoretic framework of evaluating the expected cost
(per input) E[C(I, A)] for an IDS.
Definition 3 An evaluation method that claims the expected cost of an I DS is at most r is robust against a (δ, α, β) − intruder if the expected cost of I DS during the attack (Eδ,α,β [C[I, A)])
is no larger than r, i.e.
Eδ,α,β [C[I, A)] = ∑ C(i, a)×
i,a
Pr [ (I, x) ← I(δ, α, β); A ← I DS (x) : I = i, A = a ] ≤ r
♦
Now recall that the traditional evaluation framework finds an evaluation value r∗ by using
equation (2.6). So by finding r∗ we are basically finding the best performance of an IDS and
claiming the IDS is better than others if r∗ is smaller than the evaluation of the other IDSs.
In this section we claim that an IDS is better than others if its expected value under the worst
24
performance is smaller than the expected value under the worst performance of other IDSs. In
short
Traditional Evaluation Given a set of IDSs {I DS 1 , I DS 2 , . . . , I DS n } find the best expected
cost for each:
ri∗ =
min
E[C(I, A)]
(3.1)
β
α ,P )∈ROC
(PFA
i
D
Declare that the best IDS is the one with smallest expected cost ri∗ .
Robust Evaluation Given a set of IDSs {I DS 1 , I DS 2 , . . . , I DS n } find the best expected cost
for each I DS i when being under the attack of a (δ, αi , βi ) − intruder3 . Therefore we find
the best IDS as follows:
rirobust =
max Eδ,αi ,βi [C(I, A)]
min
α
β
α ,β
(PFAi ,PDi )∈ROCi i i
(3.2)
I(δ,αi ,βi )
Several important questions can be raised by the above framework. In particular we are
interested in finding the least upper bound r such that we can claim the evaluation of I DS to
be robust. Another important question is how can we design an evaluation of I DS satisfying
this least upper bound? Solutions to these questions are partially based on game theory.
Lemma 2: Given an initial estimate of the base-rate p̂, an initial ROC curve obtained
from D , and constant costs C(I, A), the least upper bound r such that the expected cost evaluation of I DS is r-robust is given by
β
where
α
)(1 − p̂δ ) + R(1, P̂D ) p̂δ
r = R(0, P̂FA
(3.3)
α
α
α
]
) +C(0, 1)P̂FA
) ≡ [C(0, 0)(1 − P̂FA
R(0, P̂FA
(3.4)
is the expected cost of I DS under no intrusion and
β
β
β
R(1, P̂D ) ≡ [C(1, 0)(1 − P̂D ) +C(1, 1)P̂D ]
(3.5)
β
α and P̂ are the solution to a zerois the expected cost of I DS under an intrusion, and p̂δ , P̂FA
D
sum game between the intruder (the maximizer) and the IDS (the minimizer), whose solution
can be found in the following way:
1. Let (PFA , PD ) denote any points of the initial ROC obtained from D and let ROC(α,β)
α , Pβ ), where Pβ = P (1 − β) and Pα =
be the ROC curve defined by the points (PFA
D
D
D
FA
α + PFA (1 − α).
2. Using p̂ + δu in the isoline method, find the optimal operating point (xu , yu )in ROC(α,β)
and using p̂−δl in the isoline method, find the optimal operating point (xl , yl ) in ROC(α,β) .
that different IDSs might have different α and β values. For example if I DS 1 is an anomaly detection
scheme then we can expect that the probability that new normal events will generate alarms α1 is larger than the
same probability α2 for a misuse detection scheme I DS 2 .
3 Note
25
3. Find the points (x∗ , y∗ ) in ROC(α,β) that intersect the line
y=
C(1, 0) −C(0, 0)
C(0, 0) −C(0, 1)
+x
C(1, 0) −C(1, 1)
C(1, 0) −C(1, 1)
(under the natural assumptions C(1, 0) > R(0, x∗ ) > C(0, 0), C(0, 1) > C(0, 0) and C(1, 0) >
C(1, 1)). If there are no points that intersect this line, then set x∗ = y∗ = 1.
4. If x∗ ∈ [xl , xu ] then find the base-rate parameter p∗ such that the optimal isoline of Equaα = x∗ and P̂β = y∗ .
tion (2.9) intercepts ROC(α,β) at (x∗ , y∗ ) and set p̂δ = p∗ , P̂FA
D
5. Else if R(0, xu ) < R(1, yu ) find the base-rate parameter pu such that the optimal isoline of
α = x and P̂β = y .
Equation (2.9) intercepts ROC(α,β) at (xu , yu ) and then set p̂δ = pu , P̂FA
u
u
D
Otherwise, find the base-rate parameter pl such that the optimal isoline of Equation (2.9)
α = x and P̂β = y .
intercepts ROC(α,β) at (xl , yl ) and then set p̂δ = pl , P̂FA
l
l
D
The proof of this lemma is very straightforward. The basic idea is that if the uncertainty
range of p is large enough, the Nash equilibrium of the game is obtained by selecting the point
intercepting equation (3). Otherwise one of the strategies for the intruder is always a dominant
strategy of the game and therefore we only need to find which one is it: either p̂ + δu or p̂ − δl .
For most practical cases it will be p̂ + δu . Also note that the optimal operating point in the
α , P̂β ).
original ROC can be found by obtaining (P̂FA ,P̂D ) from (P̂FA
D
B. Robust B-ROC Evaluation
Similarly we can now also analyze the robustness of the evaluation done with the B-ROC
curves. In this case it is also easy to see that the worst attacker for the evaluation is an intruder
I that selects p1 = p̂ − δl , p2 = α and p3 = β.
ˆ , P̂D ) corresponding to p̂ in the B-ROC curve, a
Corollary 1: For any point (PPV
(δ, α, β) − intruder can decrease the detection rate and the positive predictive value to the
ˆ δ,α,β , P̂Dβ ), where P̂β = P̂D (1 − β) and where
pair (PPV
ˆ δ,α,β =
PPV
β
β
PD p − Pβ δ
β
α (1 − p) + δPα − δP
PD p + PFA
D
FA
(3.6)
C. Example: Minimizing the Cost of a Chosen Intrusion Rate Attack
In this example we introduce probably one of the easiest formulations of an attacker
against a classifier: we assume that the attacker cannot change its feature vectors x, but rather
only its frequency of attacks: p. This example also shows the generality of lemma 2 and also
presents a compelling scenario of when does a probabilistic IDSs make sense.
Assume an ad hoc network scenario similar to [33, 34, 35, 36] where nodes monitor and
distribute reputation values of other nodes’ behavior at the routing layer. The monitoring nodes
report selfish actions (e.g. nodes that agree to forward packets in order to be accepted in the
network, but then fail to do so) or attacks (e.g. nodes that modify routing information before
forwarding it).
Now suppose that there is a network operator considering implementing a watchdog monitoring scheme to check the compliance of nodes forwarding packets as in [33]. The operator
26
then plans an evaluation period of the method where trusted nodes will be the watchdogs reporting the misbehavior of other nodes. Since the detection of misbehaving nodes is not perfect,
during the evaluation period the network operator is going to measure the consistency of reports
given by several watchdogs and decide if the watchdog system is worth keeping or not.
During this trial period, it is of interest to selfish nodes to behave as deceiving as they
can so that the neighboring watchdogs have largely different results and the system is not permanently established. As stated in [33] the watchdogs might not detect a misbehaving node in
the presence of 1) ambiguous collisions, 2) receiver collisions, 3) limited transmission power,
4) false misbehavior, 5) collusion or 6) partial dropping. False alarms are also possible in several cases, for example when a node moves out of the previous node’s listening range before
forwarding on a packet. Also if a collision occurs while the watchdog is waiting for the next
node to forward a packet, it may never overhear the packet being transmitted.
We now briefly describe this model according to our guidelines:
Desired Properties Assume the operator wants to find a strategy that minimizes the probability
of making errors. This is an example of the expected cost metric function with C(0, 0) =
C(1, 1) = 0 and C(1, 0) = C(0, 1) = 1.
Feasible Design Space DM = {πi ∈ [0, 1] : π1 + π2 + π3 + π4 = 1}.
Information Available to the Adversary We assume the adversary knows everything that we
know and can make inferences about the situation the same way as we can. In game
theory this adversaries are usually referred to as intelligent.
Capabilities of the Adversary The adversary has complete control over the base-rate p (its frequency of attacks). The feasible set is therefore F = [0, 1].
Goal of the Adversary Evaluation attack.
Evaluation Metric
r∗ = min
max
πi ∈DM
p∈F
E[C(I, A)]
Note the order in optimization of the evaluation metric. In this case we are assuming that
the operator of the IDS makes the first decision, and that this information is then available
to the attacker when selecting the optimal p. We call the strategy of the operator secure if
the expected cost (probability of error) is never greater than r∗ for any feasible adversary.
Model Assumptions We have assumed that the attacker will not be able to change P̂FA and P̂D .
This results from its assumed inability to directly modify the feature vector x.
Notice that in this case the detector algorithm D is the watchdog mechanism that monitors
the medium to see if the packet was forwarded F or if it did not hear the packet being forwarded
(unheard U) during a specified amount of time. Following [33] (where it is shown that the
number of false alarms can be quite high) we assume that a given watchdog D has a false alarm
rate of P̂FA = 0.5 and a detection rate of P̂D = 0.75. Given this detector algorithm, a (nonrandomized) decision maker DM has to be one of the following rules (where intuitively, h3 is
the more appealing):
h1 (F) = 0 h1 (U) = 0
h2 (F) = 1 h2 (U) = 0
h3 (F) = 0 h3 (U) = 1
h4 (F) = 1 h4 (U) = 1
27
h0
R(0, 0)
R(1, 0)
I=0
I=1
h1
R(0, P̂D )
R(1, P̂FA )
h2
R(0, P̂FA )
R(1, P̂D )
h3
R(0, 1)
R(1, 1)
Table 3.1: Matrix for the zero-sum game theoretic formulation of the detection problem
Since the operator wants to check the consistency of the reports, the selfish nodes will
try to maximize the probability of error (i.e. C(0, 0) = C(1, 1) = 0 and C(0, 1) = C(1, 0) = 1)
of any watchdog with a chosen intrusion rate attack. As stated in lemma 2, this is a zero-sum
game where the adversary is the maximizer and the watchdog is the minimizer. The matrix of
this game is given in Table 3.1.
1
h
4
0.9
h1
0.8
0.7
h
2
Pr[Error]
0.6
0.5
0.4
h
3
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
p
0.6
0.7
0.8
0.9
1
Figure 3.1: Probability of error for hi vs. p
It is a well known fact that in order to achieve a Nash equilibrium of the game, the players
should consider mixed strategies (i.e. consider probabilistic choices). For our example the
optimal mixed strategy for the selfish node (see Figure 3.1) is to drop a packet with probability
p∗ = P̂FA /(P̂FA1 + P̂D ). On the other hand the optimal strategy for DM is to select h3 with
probability 1/(P̂FA + P̂D ) and h1 with probability (P̂FA − (1 − P̂D ))/(P̂FA − (1 − P̂D ) + 1). This
example shows that sometimes in order to minimize the probability of error (or any general
cost) against an adaptive attacker, DM has to be a probabilistic algorithm.
Lemma 2 also presents a way to get this optimal point from the ROC, however it is not
obvious at the beginning how to get the same results, as there appear to be only three points
in the ROC: (PFA = 0, PD = 0) (by selecting h1 ), (P̂FA = 1/2, P̂D = 3/4) (by selecting h3 ) and
(PFA = 1, PD = 1) (by selecting h4 ). The key property of ROC curves to remember is that the
(optimal) ROC curve is a continuous and concave function [27], and that in fact, the points
that do not correspond to deterministic decisions are joined by a straight line whose points
can be achieved by a mixture of probabilities of the extreme points. In our case, the line
∗ = P̂ /(P̂ + P̂ )
y = 1 − x intercepts the (optimal) ROC at the optimal operating points P̂FA
D
FA
FA
and P̂D∗ = P̂D /(P̂FA + P̂D ) (see Figure 3.2). Also note that p∗ is the value required to make the
∗ , P∗ ).
slope of the isoline parallel to the ROC line intersecting (PFA
D
∗
The optimal strategy for the intruder is therefore p = 2/5, while the optimal strategy for
DM is to select h1 with probability 1/5 and h3 with probability 4/5. In the robust operating
∗ = 2/5 and P∗ = 3/5. Therefore, after fixing DM , it does not matter if p
point we have PFA
D
deviates from p∗ because we are guaranteed that the probability of error will be no worse (but
28
ROC of a watchdog
1
h4
0.9
h3
0.8
0.7
P
D
0.6
(x*,y*)
0.5
0.4
0.3
0.2
0.1
h
1
0
0
0.2
0.4
0.6
0.8
1
PFA
Figure 3.2: The optimal operating point
no better either) than 2/5, therefore the IDS can be claimed to be 2/5-robust.
D. Example: Robust Evaluation of IDSs
As a second example, we chose to perform an intrusion detection experiment with the
1998 MIT/Lincoln Labs data set [37]. Although several aspects of this data set have been
criticized in [38], we still chose it for two main reasons. On one hand, it has been (and arguably
still remains) the most used large-scale data set to evaluate IDSs. In the second place we are
not claiming to have a better IDS to detect attacks and then proving our claim with its good
performance in the MIT data set (a feat that would require further testing in order to be assured
on the quality of the IDS). Our aim on the other hand is to illustrate our methodology, and
since this data set is publicly available and has been widely studied and experimented with
(researchers can in principle reproduce any result shown in a paper), we believe it provides the
basic background and setting to exemplify our approach.
Of interest are the Solaris system log files, known as BSM logs. The first step of the
experiment was to record every instance of a program being executed in the data set. Next, we
created a very simple tool to perform buffer overflow detection. To this end, we compared the
buffer length of each execution with a buffer threshold, if the buffer size of the execution was
larger than the threshold we report an alarm.
We divided the data set into two sets. In the first one (weeks 6 and 7), our IDS performs very well and thus we assume that this is the ”evaluation” period. The previous three
weeks were used as the period of operation of the IDS. Figure 3.3(a)4 shows the results for
the ”evaluation” period when the buffer threshold ranges between 64 and 780. The dotted
lines represent the suboptimal points of the ROC or equivalently the optimal points that can be
achieved through randomization. For example the dotted line of Figure 3.3(a) can be achieved
by selecting with probability λ the detector with threshold 399 and with probability 1 − λ the
4 Care
must always be taken when looking at the results of ROC curves due to the ”unit of analysis” problem
[38]. For example comparing the ROC of Figure 3.3(a) with the ROC of [6] one might arrive to the erroneous conclusion that the buffer threshold mechanism produces an IDS that is better than the more sophisticated IDS based
on Bayesian networks. The difference lies in the fact that we are monitoring the execution of every program while
the experiments in [6] only monitor the attacked programs (eject, fbconfig, fdformat and ps). Therefore
although we raise more false alarms, our false alarm rate (number of false alarms divided by the total number of
honest executions) is smaller.
29
ROC during operation
0.8
0.8
0.6
0.6
P
D
1
P
D
ROC for Evaluation
1
0.4
0.4
0.2
0.2
0
0
1
2
3
4
5
6
PFA
0
7
0
1
2
3
4
PFA
−4
x 10
5
6
7
−4
x 10
(a) Original ROC obtained during the (b) Effective ROC during operation time
evaluation period
ROCα,β
1
PD
0.8
0.6
0.4
0.2
0
0
1
2
3
4
PFA
5
6
7
−4
x 10
(c) Original ROC under adversarial attack ROCα,β
Figure 3.3: Robust expected cost evaluation
30
detector with threshold 773 and letting λ range from zero to one.
During the evaluation weeks there were 81108 executions monitored and 13 attacks,
therefore p̂ = 1.6 × 10−4 . Assuming that our costs (per execution) are C(0, 0) = C(1, 1) = 0,
C(1, 0) = 850 and C(0, 1) = 100 we find that the slope given by equation 2.8 is mC, p̂ = 735.2,
and therefore the optimal point is (2.83 × 10−4 , 1), which corresponds to a threshold of 399
(i.e. all executions with buffer sizes bigger than 399 raise alarms). Finally, with these operating conditions we find out that the expected cost (per execution) of the IDS is E[C(I, A)] =
2.83 × 10−2 .
In the previous three weeks used as the ”operation” period our buffer threshold does not
perform as well, as can be seen from its ROC (shown in Figure 3.3(b).) Therefore if we use
the point recommended in the evaluation (i.e. the threshold of 399) we get an expected cost of
Eoperation [C(I, A)] = 6.934 × 10−2 . Notice how larger the expected cost per execution is from
the one we had evaluated. This is very noticeable in particular because the base-rate is smaller
during the operation period ( p̂operation = 7 × 10−5 ) and a smaller base-rate should have given us
a smaller cost.
To understand the new ROC let us take a closer look at the performance of one of the
thresholds. For example, the buffer length of 773 which was able to detect 10 out of the 13
attacks at no false alarm in Figure 3.3(a) does not perform well in Figure 3.3(b) because some
programs such as grep, awk, find and ld were executed under normal operation with long
string lengths. Furthermore, a larger percent of attacks was able to get past this threshold. This
is in general the behavior modeled by the parameters α and β that the adversary has access to
in our framework.
Let us begin the evaluation process from the scratch by assuming a ([1 × 10−5 , 0], 1 ×
10−4 , 0.1) − intruder, where δ = [1 × 10−5 , 0] means the IDS evaluator believes that the baserate during operation will be at most p̂ and at least p̂ − 1 × 10−5 . α = 1 × 10−5 means that the
IDS evaluator believes that new normal behavior will have the chance of firing an alarm with
probability 1 × 10−5 . And β = 0.1 means that the IDS operator has estimated that ten percent
of the attacks during operation will go undetected. With these parameters we get the ROCα,β
shown in Figure 3.3(c).
Note that in this case, p is bounded in such a way that the equilibrium of the game is
achieved via a pure strategy. In fact, the optimal strategy of the intruder is to attack with
frequency p̂ + δu (and of course, generate missed detections with probability β and false alarms
with probability α) whereas the optimal strategy of DM is to find the point in ROCα,β that
minimizes the expected cost by assuming that the base-rate is p̂ + δu .
The optimal point for the ROCα,β curve corresponds to the one with threshold 799, having
an expected cost Eδ,α,β [C(I, A)] = 5.19 × 10−2 . Finally, by using the optimal point for ROCα,β ,
as opposed to the original one, we get during operation an expected cost of Eoperation [C(I, A)] =
2.73 × 10−2 . Therefore in this case, not only we have maintained our expected 5.19 × 10−2 −
security of the evaluation, but in addition the new optimal point actually performed better than
the original one.
Notice that the evaluation of Figure 3.3 relates exactly to the problem we presented in
the introduction, because it can be thought of as the evaluation of two IDSs. One IDS having
a buffer threshold of length 399 and another IDS having a buffer threshold of length 773.
Under ideal conditions we choose the IDS of buffer threshold length of 399 since it has a lower
expected cost. However after evaluating the worst possible behavior of the IDSs we decide to
select the one with buffer threshold length of 773.
An alternative view can be achieved by the use of B-ROC curves. In Figure 3.4(a) we
31
IDOC of ROCα,β
IDOC of Evaluation
1
1
0.9
0.9
p=1×10−3
0.8
0.8
0.7 p=1×10−2
p=1×10−5
p=1×10−4
0.7
p=1×10−6
0.6
D
0.5
P
PD
0.6
p=1×10−4
p=1×10−5
0.5
−3
p=1×10
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
0.8
0
1
−2
p=1×10
p=1×10−6
0
0.2
0.4
0.6
0.8
1
Pr[¬I|A]
Pr[¬I|A]
(b) B-ROC obtained from ROCα,β
(a) Original B-ROC obtained during the
evaluation period
IDOC during operation
1
0.9
−2
p=1×10
0.8
p=1×10−3
0.7
0.6
PD
p=1×10−4
0.5
p=1×10−5
0.4
p=1×10−6
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Pr[¬I|A]
(c) B-ROC during operation time
Figure 3.4: Robust B-ROC evaluation
see the original B-ROC curve during the evaluation period. These curves give a false sense of
confidence in the IDS. Therefore we study the B-ROC curves based on ROCα,β in Figure 3.4(b).
In Figure 3.4(c) we can see how the B-ROC of the actual operating environment follows more
closely the B-ROC based on ROCα,β than the original one.
IV
A White Box Adversary Model
So far we have been treating the decision making process without considering the exact
way that an adversary can modify or create the input to the classifier x. However following the
same statistical framework we have been considering there is a way to formulate this problem
in a way that sheds light into the exact behavior and optimal adversarial strategies.
In order to define this problem recall first that an optimal decision rule (optimal in the
sense that it minimizes several evaluation metrics such as the probability of error, the expected
cost or the probability of missed positives given an upper bound on the probability of false
positives) is the log-likelihood ratio test ln
f (x|1) H1
≷
f (x|0) H τ,
0
where Hi denotes the hypothesis that I = i,
and τ depends on the particular evaluation metric and its assumptions (e.g., misclassification
costs, base-rates etc.). If the log-likelihood ratio is greater than τ then the detector outputs A = 1
and if it less than τ, the detector outputs A = 0 (if it is equal to τ the detector randomly selects
the hypothesis based on the evaluation metric being considered).
This new detailed formulation allows us to model the fact that an attacker is able to
modify its attack strategy f (x|1). The pdf f (x|0) of the normal behavior can be learnt via one-
32
class learning, however since the attacker can change its strategy, we cannot trust a machine
learning technique to estimate f (x|1), since learning a candidate density f (x|1) for an attacker
without some proper analysis may result in serious performance degradation if the attacker’s
strategy diverges from our estimated model.
It is therefore one of our aims in the following two chapters to evaluate and design algorithms that try to estimate the least favorable pdf f (x|1) within a feasible adversary class
F . The adversary class F will consists of pdfs f (x|1) constrained by a requirement that an
adversary has to satisfy. For example in the next chapter it is a desired level of wireless channel
access, and in chapter ?? the feasible class F is composed of pdfs that satisfy certain distortion
constraints.
An abstract example on how to model the problem of finding the least favorable f (x|1)
is now considered. Assume an adversary whose objective is to create mimicry attacks: i.e., it
wants to minimize the probability of being detected. Furthermore assume x takes only discrete
values, so f (x|i) are in fact probability mass functions (pmfs) (as opposed to density functions).
A way to formulate this problem can be done with the help of information theory inequalities
[39]:
PFA log
1 − PFA
PFA
+ (1 − PFA ) log
≤ KL[ f (x|0)|| f (x|1)]
PD
1 − PD
(3.7)
where KL[ f0 || f1 ] denotes the Kullback-Leibler divergence between two pmfs f0 and f1 . From
this inequality it can be deduced that if we fix PFA to be very small (PFA → 0), then
PD ≤ 1 − 2−KL[ f (x|0)|| f (x|1)]
(3.8)
The task of the adversary would be therefore the following:
h∗ = arg min KL[ f (x|0)||h(x)]
h∈F
It can be shown in fact that some of the examples studied in the next two chapters can be
formulated as exactly performing this optimization problem.
Besides estimating least favorable distributions, another basic notion in minimax game
approaches that is going to be very useful in the next two sections is that of a saddle point. A
strategy (detection rule) d ? and an operating point (attack) f (x|1)? in the uncertainty class F
form a saddle point if:
1. For the attack f (x|1)? , any detection rule d other than d ? has worse performance. Namely
d ? is the optimal detection rule for attack f (x|1)? in terms of minimizing the objective
function (evaluation).
2. For the detection rule d ? , any attack f (x|1) from the uncertainty class, other than f (x|1)?
gives better performance. Namely, detection rule d ? has its worst performance for attack
f (x|1)? .
Implicit in the minimax approach is the assumption that the attacker has full knowledge
of the employed detection rule. Thus, it can create a misbehavior strategy that maximizes
the cost of the performance of the classifier. Therefore, our approach refers to the case of an
intelligent attacker that can adapt its misbehavior policy so as to avoid detection. One issue
that needs to be clarified is the structure of this attack strategy. Subsequently, by deriving the
33
detection rule and the performance for that case, we can obtain an (attainable) upper bound on
performance over all possible attacks.
Even though we did not point it out before, it can in fact be shown that the optimal operating points in the two IDS examples presented section III are in fact saddle point equilibria!
V
Conclusions
There are two main problems that any empirical test of a classifier will face. The first
problem relates to the inferences that once can make about any classifier system based on
experiments alone. An example is the low confidence on the estimate for the probability of
detection in the ROC. A typical way to improve this estimate in other classification tasks is
through the use of error bars in the ROC. However, if tests of classifiers include very few
attacks and their variations, there is not enough data to provide an accurate significance level
for the bars. Furthermore, the use of error bars and any other cross-validation technique gives
the average performance of the classifier. However, this brings us to the second problem, and
it is the fact that since the classifiers are subject to an adversarial environment, evaluating such
a decision maker based on its average performance is not enough. Our adversary models tries
to address these two problems, since it provides a principled approach to give us the worst case
performance of a classifier.
The extent by which the analysis with a (δ, α, β) − intruder will follow the real operation
of the IDS will depend on how accurately the person doing the evaluation of the IDS understands the IDS and its environment, for example, to what extent can the IDS be evaded, how
well the signatures are written (e.g. how likely is it that normal events fire alarms) etc. However, by assuming robust parameters we are actually assuming a pessimistic setting, and if this
pessimistic scenario never happens, we might be operating at a suboptimal point (i.e. we might
have been too pessimistic in the evaluation).
Finally, although the white box framework for adversarial modeling gives us a fine
grained evaluation procedure, it is not always a better alternative to the black box adversary
model. There are basically two problems with the white box adversary model. The first problem is the fact that finding the optimal adversarial distribution is usually an intractable problem.
The second problem is that sometimes in order to avoid the intractability of the problem, f (x|1)
is sometimes assumed to follow a certain parametric model. Therefore instead of searching for
optimal adversarial pdfs f (x|1) the problem is replaced with one of finding the optimal parameters of a prescribed distribution. The problem with this approach is that assuming a distribution
form creates extra assumptions about the attacker, and as explained in the guidelines defined
at the beginning of this chapter, any extra assumption must be taken with care. Specially because in practice the adversary will most likely not follow the parametric distribution assumed
a priori.
34
Chapter 4
Performance Comparison of MAC layer Misbehavior Schemes
If anything could dissipate my love to humanity, it would be ingratitude. In short, I am a hired
servant, I expect my payment at once- that is, praise, and the repayment of love with love.
Otherwise I am incapable of loving anyone.
- The Brothers Karamazov, Dostoevsky
I
Overview
This chapter revisits the problem of detecting greedy behavior in the IEEE 802.11 MAC
protocol by evaluating the performance of two previously proposed schemes: DOMINO and
the Sequential Probability Ratio Test (SPRT). The evaluation is carried out in four steps. We
first derive a new analytical formulation of the SPRT that takes into account the discrete nature
of the problem. Then we develop a new tractable analytical model for DOMINO. As a third
step, we evaluate the theoretical performance of SPRT and DOMINO with newly introduced
metrics that take into account the repeated nature of the tests. This theoretical comparison
provides two major insights into the problem: it confirms the optimality of SPRT and motivates
us to define yet another test, a nonparametric CUSUM statistic that shares the same intuition
as DOMINO but gives better performance. We finalize this chapter with experimental results,
confirming the correctness of our theoretical analysis and validating the introduction of the new
nonparametric CUSUM statistic.
II
Introduction
The problem of deviation from legitimate protocol operation in wireless networks and
efficient detection of such behavior has become a significant issue in recent years. In this work
we address and quantify the impact of MAC layer attacks that aim at disrupting critical network
functionalities and information flow in wireless networks.
It is important to note that parameters used for deriving a decision of whether a protocol
participant misbehaves or not should be carefully chosen. For example, choosing the percentage of time the node accesses the channel as a misbehavior metric can result in a high number
of false alarms due to the fact that the other protocol participants might not have anything (or
have significantly less traffic) to transmit within a given observation period. This could easily
lead to false accusations of legitimate nodes that have large amount of data to send. Measuring throughput offers no indication of misbehavior since it is impossible to define “legitimate”
throughput. Therefore, it is reasonable to use either fixed protocol parameters (such as SIFS or
DIFS window size) or parameters that belong to a certain (fixed) range of values for monitoring
misbehavior (such as backoff).
In this work we derive analytical bounds of two previously proposed protocols for detecting random access misbehavior: DOMINO [40] and SPRT-based tests [41, 42] and show
the optimality of SPRT against the worst-case adversary for all configurations of DOMINO.
Following the main idea of DOMINO, we introduce a nonparametric CUSUM statistic that
35
shares the same intuition as DOMINO but gives better performance for all configurations of
DOMINO.
A. Background Work
MAC layer protocol misbehavior has been studied in various scenarios and mathematical
frameworks. The random nature of access protocols coupled with the highly volatile nature of
wireless medium poses the major obstacle in generation of the unified framework for misbehavior detection. The goals of a misbehaving peer can range from exploitation of available network
resources for its own benefit up to network disruption. An efficient Intrusion Detection System
should exhibit a capability to detect a wide range of misbehavior policies with an acceptable
False Alarm rate. This presents a major challenge due to the nature of wireless protocols.
The current literature offers two major approaches in the field of misbehavior detection.
The first set of approaches provides solutions based on modification of the current IEEE 802.11
MAC layer protocol by making each protocol participant aware of the backoff values of its
neighbors. The approach proposed in [43] assumes existence of a trustworthy receiver that
can detect misbehavior of the sender and penalize it by assigning him higher back-off values
for subsequent transmissions. A decision about protocol deviation is reached if the observed
number of idle slots of the sender is smaller than a pre-specified fraction of the allocated backoff. The sender is labeled as misbehaving if it turns out to deviate continuously based on a
cumulative metric over a sliding window. The work in [44] attempts to prevent scenarios of
colluding sender-receiver pairs by ensuring randomness in the course of the MAC protocol.
A different line of thought is followed in [40, 41, 42], where the authors propose a misbehavior detection scheme without making any changes to the actual protocol. In [40] the authors
focus on multiple misbehavior policies in the wireless environment and put emphasis on detection of backoff misbehavior. They propose a sequence of conditions on available observations
for testing the extent to which MAC protocol parameters have been manipulated. The proposed
scheme does not address the scenarios that include intelligent adaptive cheaters or collaborating misbehaving nodes. The authors in [41, 42] address the detection of an adaptive intelligent
attacker by casting the problem of misbehavior detection within the minimax robust detection
framework. They optimize the system’s performance for the worst-case instance of uncertainty
by identifying the least favorable operating point of a system and derive the strategy that optimizes the system’s performance when operating in that point. The system performance is
measured in terms of number of required observation samples to derive a decision (detection
delay).
However, DOMINO and SPRT were presented independently, without direct comparison
or performance analysis. Additionally, both approaches evaluate the detection scheme performance under unrealistic conditions, such as probability of false alarm being equal to 0.01,
which in our simulations results in roughly 700 false alarms per minute (in saturation conditions), a rate that is unacceptable in any real-life implementation. Our work contributes to the
current literature by: (i) deriving a new pmf for the worst case attack using an SPRT-based
detection scheme, (ii) providing new performance metrics that address the large number of
alarms in the evaluation of previous proposals, (iii) providing a complete analytical model of
DOMINO in order to obtain a theoretical comparison to SPRT-based tests and (iv) proposing
an improvement to DOMINO based on the CUSUM test.
The rest of the chapter is organized as follows. Sect. III outlines the general setup of the
problem. In Sect. IV we propose a minimax robust detection model and derive an expression for
the worst-case attack in discrete time. In Sect. V we provide extensive analysis of DOMINO,
36
followed by the theoretical comparison of two algorithms in Sect. VI. Motivated by the main
idea of DOMINO, we offer a simple extension to the algorithm that significantly improves its
performance in Sect. VII. In Sect. VIII we present the experimental performance comparison
of all algorithms. Finally, Sect. IX concludes our study. In subsequent sections, the terms
“attacker” and “adversary” will be used interchangeably with the same meaning.
III
Problem Description and Assumptions
Throughout this work we assume existence of an intelligent adaptive attacker that is
aware of the environment and its changes over a given period of time. Consequently, the
attacker is able to adjust its access strategy depending on the level of congestion in its environment. Namely, we assume that, in order to minimize the probability of detection, the
attacker chooses legitimate over selfish behavior when the level of congestion in the network is
low. Similarly, the attacker chooses adaptive selfish strategy in congested environments. Due to
the previously mentioned reasons, we assume a benchmark scenario where all the participants
are backlogged, i.e., have packets to send at any given time in both theoretical and experimental
evaluations. We assume that the attacker will employ the worst-case misbehavior strategy in
this setting, and consequently the detection system can estimate the maximal detection delay.
It is important to mention that this setting represents the worst-case scenario with regard to the
number of false alarms per unit of time due to the fact that the detection system is forced to
make maximum number of decisions per time unit.
In order to characterize the strategy of an intelligent attacker, we assume that both misbehaving and legitimate node attempt to access the channel simultaneously. Consequently,
each station generates a sequence of random backoffs X1 , X2 , . . . , Xi over a fixed period of time.
Accordingly, the backoff values, X1 , X2 , . . . , Xi , of each legitimate protocol participant are distributed according to the probability mass function (pmf) p0 (x1 , x2 , . . . , xi ). The pmf of the misbehaving participants is unknown to the system and is denoted with p1 (x1 , x2 , . . . , xi ), where
X1 , X2 , . . . , Xi represent the sequence of backoff values generated by the misbehaving node over
the same period of time.
The assumption that holds throughout this chapter is that a detection agent (e.g., the
access point) monitors and collects the backoff values of a given station. It is important to note
that observations are not perfect and can be hindered by concurrent transmissions or external
sources of noise. It is impossible for a passive monitoring agent to know the backoff stage of
a given monitored station due to collisions and to the fact that in practice, nodes might not be
constantly backlogged. Furthermore, in practical applications the number of false alarms in
anomaly detection schemes is very high. Consequently, instead of building a “normal” profile
of network operation with anomaly detection schemes, we utilize specification based detection.
In our setup we identify “normal” (i.e., a behavior consistent with the 802.11 specification)
profile of a backlogged station in the IEEE 802.11 without any competing nodes, and notice
that its backoff process X1 , X2 , . . . , Xi can be characterized with pdf p0 (xi ) = 1/(W + 1) for
xi ∈ {0, 1, . . . ,W } and zero otherwise. We claim that this assumption minimizes the probability
of false alarms due to imperfect observations. At the same time, a safe upper bound on the
amount of damaging effects a misbehaving station can cause to the network is maintained.
Although our theoretical results utilize the above expression for p0 , the experimental
setting utilizes the original implementation of the IEEE 802.11 MAC. In this case, the detection
agent needs to deal with observed values of xi larger than W , which can be due to collisions or
due to the exponential backoff specification in the IEEE 802.11. We further discuss this issue
37
in Sect. VIII.
IV
Sequential Probability Ratio Test (SPRT)
Due to the nature of the IEEE 802.11 MAC, the back-off measurements are enhanced by
an additional sample each time a node attempts to access the channel. Intuitively, this gives rise
to the employment of a sequential detection scheme in the observed problem. The objective of
the detection test is to derive a decision as to whether or not misbehavior occurs with the least
number of observations. A sequential detection test is therefore a procedure which decides
whether or not to receive more samples with every new information it obtains. If sufficient
information for deriving a decision has been made (i.e. the desired levels of the probability
of false alarm and probability of miss are satisfied), the test proceeds to the phase of making a
decision.
It is now clear that two quantities are involved in decision making: a stopping time N and
a decision rule dN which at the time of stopping decides between hypotheses H0 (legitimate
behavior) and H1 (misbehavior). We denote the above combination with D=(N, dN ).
In order to proceed with our analysis we first define the properties of an efficient detector.
Intuitively, the starting point in defining a detector should be minimization of the probability
of false alarms P0 [dN = 1]. Additionally, each detector should be able to derive the decision as
soon as possible (minimize the number of samples it collects from a misbehaving station) before
calling the decision function E1 [N]. Finally, it is also necessary to minimize the probability of
deciding that a misbehaving node is acting normally P1 [dN = 0]. It is now easy to observe that
E1 [N], P0 [dN = 1], P1 [dN = 0] form a multi-criteria optimization problem. However, not all
of the above quantities can be optimized at the same time. Therefore, a natural approach is to
define the accuracy of each decision a priori and minimize the number of samples collected:
inf
D∈Ta,b
E1 [N]
(4.1)
where
Ta,b = {(N, dN ) : P0 [dN = 1] ≤ a and P1 [dN = 0] ≤ b}
The solution D∗ (optimality is assured when the data is i.i.d. in both classes) to the above
problem is the SPRT [41] with:
Sn = ln
p1 (x1 , . . . , xn )
and N = inf Sn ∈ [L,U].
n
p0 (x1 , . . . , xn )
The decision rule dN is defined as:
dN =
1 if SN ≥ U
0 if SN ≤ L,
(4.2)
b
where L ≈ ln 1−a
and U ≈ ln 1−b
a . Furthermore, by Wald’s identity:
E j [N] =
E j [SN ]
E j [SN ]
h
i=
p1 (x)
∑W
E j ln pp10 (x)
x=0 p j (x) ln p0 (x)
(x)
38
(4.3)
with E1 [SN ] = Lb +U(1 − b) and E0 [SN ] = L(1 − a) +Ua. We note that the coefficients j = 0, 1
in Eq.(4.3) correspond to legitimate and adversarial behavior respectively.
A. Adversary Model
To our knowledge, the current literature does not address the discrete nature of the misbehavior
detection problem. This section sets a theoretical framework of the problem in discrete time.
Due to the different nature of the problem, the relations derived in [41, 42] no longer hold
and a new pmf p∗1 that maximizes the performance of the adversary is derived. We assume
the adversary has full control over the probability mass function p1 and the backoff values
it generates. In addition to that we assume that the adversary is intelligent, i. e. he knows
everything the detection agent knows and can infer the same conclusions as the detection agent.
Goal of the Adversary: we assume the objective of the adversary is to design an access
policy with the resulting probability of channel access PA , while minimizing the probability of
detection. As it has already been mentioned, the optimal access policy results in generation of
backoff sequences according to the pmf p∗1 (x).
Theorem 1: The probability that the adversary accesses the channel before any other terminal when competing with n neighboring (honest) terminals for channel access in saturation
condition is:
1
(4.4)
Pr[Access] ≡ PA =
E1 [X]
1 + nE
0 [X]
Note that when E1 [X] = E0 [X] the probability of access is equal for all n + 1 competing nodes
1
(including the adversary), i.e., all of them will have access probability equal to n+1
. We omit
the proof of this theorem and refer the reader to [42] for the detailed derivation.
Solving the above equation for E1 [X] gives us a constraint on p1 . That is, p1 must satisfy
the following equation:
1 − PA
(4.5)
E1 [X] = E0 [X]
nPA
A
We now let g = 1−P
the scalar g, which
nPA in order to be able to parametrize the adversary by
1 intuitively denotes the level of misbehavior by the adversary. For PA ∈ 1+n , 1 , g ∈ {0, 1}.
Therefore, g = 0 and g = 1 correspond to legitimate behavior and complete misbehavior respectively. Now, for any given g, p1 must belong to the class of allowed probability mass functions
Ag , where
(
)
Ag ≡ q :
W
W
x=0
x=0
∑ q(x) = 1 and ∑ xq(x) = gE0[X]
(4.6)
After defining its desired access probability PA (or equivalently g), the second objective of the
attacker is to maximize the amount of time it can misbehave without being detected. Assuming
that the adversary has full knowledge of the employed detection test, it attempts to find the
access strategy (with pmf p1 ) that maximizes the expected duration of misbehavior before
an alarm is fired. By looking at equation Eq.(4.3), the attacker thus needs to minimize the
following objective function
W
p1 (x)
min ∑ p1 (x) ln
(4.7)
p0 (x)
p1 ∈Ag x=0
Theorem 2: Let g ∈ {0, 1} denote the objective of the adversary. The pmf p∗1 that mini-
39
mizes Eq.(4.7) can be expressed as:
(
p∗1 (x) =
rx (r−1 −1)
r−1 −rW
for x ∈ {0, 1, . . . ,W }
otherwise
0
(4.8)
where r is the solution to the following equation:
W rW − r−1 (W rW + rW − 1)
W
=g
−1
−1
W
(r − 1)(r − r )
2
(4.9)
Proof: Notice first that the objective function is convex in p1 . We let qε (x) = p∗1 (x) +
εh(x) and construct the Lagrangian of the objective function and the constraints
W
∑ qε(x) ln
x=0
W
W
qε (x)
+ µ1 ( ∑ qε (x) − 1) + µ2 ( ∑ xqε (x) − gE0 [X])
p0 (x)
x=0
x=0
(4.10)
By taking the derivative with respect to ε and equating this quantity to zero for all possible
sequences h(x), we find that the optimal p∗1 has to be of the form:
p∗1 (x) = p0 (x)e−µ2 x−µ0
(4.11)
where µ0 = µ1 + 1. In order to obtain the values of the Lagrange multipliers µ0 and µ2 we utilize
the fact that p0 (x) = W 1+1 . Additionally, we utilize the constraints in Ag . The first constraint
states that p∗1 must be a pmf and therefore by setting Eq.(4.11) equal to one and solving for µ0
we have
W
1 r − rW
µ0 = ln ∑ p0 (x)rx = ln
(4.12)
W +1 r−1
x=0
where r = e−µ2 . Replacing this solution in Eq. 4.11 we get
p∗1 (x) =
rx (r−1 − 1)
r−1 − rW
(4.13)
The second constraint in Ag is rewritten in terms of Eq.(4.13) as
r−1 − 1
r−1 − rW
W
∑ xrx = gE0[X]
(4.14)
x=0
from where Eq.(4.9) follows.
Fig. 4.1 illustrates the optimal distribution p∗1 for two values of the parameter g.
B. SPRT Optimality for any Adversary in Ag
Let Φ(D, p1 ) = E1 [N]. We notice that the above solution was obtained in the form
max min Φ(D, p1 )
p1 ∈Ag D∈Ta,b
(4.15)
That is, we first minimized Φ(D, p1 ) with the SPRT (minimization for any p1 ) and then found
the p∗1 that maximized Φ(SPRT, p∗1 ).
40
0.12
p*1 for g=0.98
p*1 for g=0.5038
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
30
35
x
Figure 4.1: Form of the least favorable pmf p∗1 for two different values of g. When g approaches
1, p∗1 approaches p0 . As g decreases, more mass of p∗1 concentrated towards the smaller backoff
values.
However, an optimal detector needs to minimize all losses due to the worst-case attacker.
That is, the optimal test should in principle be obtained by the following optimization problem
min max Φ(D, p1 )
D∈Ta,b p1 ∈Ag
(4.16)
Fortunately, our solution also satisfies this optimization problem since it forms a saddle point
equilibrium, resulting in the following theorem:
Theorem 3: For every D ∈ Ta,b and every p1 ∈ Ag
Φ(D∗ , p1 ) ≤ Φ(D∗ , p∗1 ) ≤ Φ(D, p∗1 )
(4.17)
We omit the proof of the theorem since its derivation follows reasoning similar to the one in
[42]. As a consequence of this theorem, no incentive for deviation from (D∗ , p∗1 ) for any of the
players (the detection agent or the misbehaving node) is offered.
C. Evaluation of Repeated SPRT
The original setup of SPRT-based misbehavior detection proposed in [41] was better suited for
on-demand monitoring of suspicious nodes (e.g., when a higher layer monitoring agent requests
the SPRT to monitor a given node because it is behaving suspiciously, and once it reaches a
decision it stops monitoring) and was not implemented as a repeated test.
On the other hand, the configuration of DOMINO is suited for continuous monitoring of
neighboring nodes. In order to obtain fair performance comparison of both tests, a repeated
SPRT algorithm is implemented: whenever dN = 0, the SPRT restarts with S0 = 0. This setup
allows a detection agent to detect misbehavior for both short and long-term attacks. The major
problem that arises from this setup is that continuous monitoring can raise a large number of
false alarms if the parameters of the test are not chosen appropriately.
This section proposes a new evaluation metric for continuous monitoring of misbehaving
nodes. We believe that the performance of the detection algorithms is appropriately captured by
employing the expected time before detection E[TD ] and the average time between false alarms
E[TFA ] as the evaluation parameters.
41
SPRT with a=1×10−6, b=0.1
5
10
g=0.98
4
10
3
E[TD]
10
2
10
g=0.005
1
10
0
10
4
10
5
10
6
10
7
10
E[TFA]
8
10
9
10
10
10
Figure 4.2: Tradeoff curve between the expected number of samples for a false alarm E[TFA ]
and the expected number of samples for detection E[TD ]. For fixed a and b, as g increases (low
intensity of the attack) the time to detection or to false alarms increases exponentially.
The above quantities are straightforward to compute for SPRT. Namely, each time the
SPRT stops the decision function can be modeled as a Bernoulli trial with parameters a and
1 − b; the waiting time until the first success is then a geometric random variable. Therefore:
E[TFA ] =
E0 [N]
E1 [N]
and E[TD ] =
a
1−b
(4.18)
Fig. 4.2 illustrates the tradeoff between these variables for different values of the parameter g. It is important to note that the chosen values of the parameter a in Fig. 4.2 are small.
We claim that this represents an accurate estimate of the false alarm rates that need to be satisfied in actual anomaly detection systems [29, 7], a fact that was not taken into account in the
evaluation of previously proposed systems.
V
Performance analysis of DOMINO
We now present the general outline of the DOMINO detection algorithm. The first
step of the algorithm is based on computation of the average value of backoff observations:
Xac = ∑m
i=1 Xi /m. In the next step, the averaged value is compared to the given reference
backoff value: Xac < γB, where the parameter γ (0 < γ < 1) is a threshold that controls the
tradeoff between the false alarm rate and missed detections. The algorithm utilizes the variable
cheat_count which stores the number of times the average backoff exceeds the threshold γB.
DOMINO raises a false alarm after the threshold is exceeded more than K times. A forgetting
factor is considered for cheat_count if the monitored station behaves normally in the next
monitoring period. That is, the node is partially forgiven: cheat_count=cheat_count-1 (as
long as cheat_count remains greater than zero).
We now present the actual detection algorithm from [40]. The algorithm is initialized
with cheat_count = 0 and after collecting m samples, the following detection algorithm is
executed, where condition is defined as m1 ∑m
i=1 Xi ≤ γB
if condition
42
p
1-p
0
p
1
p
2
1-p
1-p
3
p
4
1
1-p
Figure 4.3: For K=3, the state of the variable cheat count can be represented as a Markov chain
with five states. When cheat count reaches the final state (4 in this case) DOMINO raises an
alarm.
cheat count = cheat count + 1
if cheat count > K
r a i s e alarm
end
elseif cheat count > 0
cheat count = cheat count − 1
end
It is now easy to observe that DOMINO is a sequential test, with N = m ∗ Nt , where Nt
represents the number of steps cheat_count takes to exceed K and dN = 1 every time the
test stops. We evaluate DOMINO and SPRT with the same performance metrics. However,
unlike SPRT where a controls the number of false alarms and b controls the detection rate,
the parameters m, γ and K in DOMINO are difficult to tune because there has not been any
analysis of their performance. The correlation between DOMINO and SPRT parameters is
further addressed in Sec. VIII.
In order to provide an analytical model for the performance of the algorithm, we model
the detection mechanism in two steps:
1. We first define p := Pr m1 ∑m
i=1 Xi ≤ γB
2. We define a Markov chain with transition probabilities p and 1 − p. The absorbing state
represents the case when misbehavior is detected (note that we assume m is fixed, so p
does not depend on the number of observed backoff values). A Markov chain for K = 3
is shown in Fig. 4.3.
We can now write
"
j
p = p = Pj
#
1 m
∑ Xi ≤ γB , j ∈ 0, 1
m i=1
where j = 0 corresponds to the scenario where the samples Xi are generated by a legitimate
station p0 (x) and j = 1 corresponds to the samples being generated by p∗1 (x). In the remainder
of this section we assume B = E0 [Xi ] = W2 .
We now derive the expression for p for the case of a legitimate monitored node. Following the reasoning from Sect. III, we assume that each Xi is uniformly distributed on {0, 1, . . . ,W }.
It is important to note that this analysis provides a lower bound on the probability of false alarms
when the minimum contention window of size W + 1 is assumed. Using the definition of p we
derive the following expression:
43
"
p = P0
m
∑ Xi ≤ mγB
#
i=1
"
bmγBc
=
∑
P0
k=0
bmγBc
=
∑
m
∑ Xi = k
#
(4.19)
i=1
∑m
k=0 {(x1 ,...,xm ):∑i=1
1
(W + 1)m
x =k}
i
where the last equality follows from the fact that the Xi0 s are i. i. d with pmf p0 (xi ) = W 1+1 for
all xi ∈ {0, 1, . . .W }.
m+k−1
m+k−1
L
The number of ways that m integers can sum up to k is
and ∑k=0
=
k
k
m+L
. An additional constraint is imposed by the fact that Xi can only take values up to
L
W , which is in general smaller than k, and thus the above combinatorial formula cannot be
applied. Furthermore, a direct computation of the number of ways xi bounded integers sum up
to k is very expensive. As an example, let W + 1 = 32 = 25 and m = 10. A direct summation
needed for calculation of p yields at least 250 iterations.
Fortunately, an efficient alternative way for computing P0 [∑m
i=1 Xi = k] exists. We first
m
define Y := ∑i=1 Xi . It is well known that the moment generating function of Y , MY (s) =
MX (s)m can be computed as follows:
1
s
W m
1
+
e
+
·
·
·
+
e
(W + 1)m
1
=
×
(W + 1)m
m
∑ ) k0; · · · ; kW 1k0 esk1 · · · esW kW
(
k0 , . . . , kW :
∑ ki = m
MY (s) =
m
where
is the multinomial coefficient.
k0 ; k2 ; · · · ; kW
By comparing terms with the transform of MY (s) we observe that Pr[Y = k] is the coefficient that corresponds to the term eks in Eq.(4.20). This result can be used for the efficient
computation of p by using Eq.(4.19).
Alternatively, we can approximate the computation of p for large values of m. The approximation arises from the fact that as m increases, Y converges to a Gaussian random variable,
by the Central Limit Theorem. Thus,
p = Pr [Y ≤ mγB] ≈ Φ(z)
where
z= p
mγB − m W2
(W )(W + 2)m/12
44
γ=0.9 and (W+1)=32
0.45
Exact value of p
Approximate value of p: Φ
0.4
0.35
p
0.3
0.25
0.2
0.15
0.1
0.05
0
10
20
30
m
40
50
60
Figure 4.4: Exact and approximate values of p as a function of m.
and Φ(z) is the error function. Fig. 4.4 illustrates the exact and approximate calculation of p as a
function of m, for γ = 0.9 and W + 1 = 32. This shows the accuracy of the above approximation
for both small and large values of m.
The computation of p = p1 follows the same steps (although the moment generating
function cannot be easily expressed in analytical form, it is still computationally tractable) and
is therefore omitted.
A. Expected Time to Absorption in the Markov Chain
We now derive the expression for expected time to absorption for a Markov Chain with K + 1
states. Let µi be the expected number of transitions until absorption, given that the process
starts at state i. In order to compute the stopping times E[TD ] and E[TFA ], it is necessary to
find the expected time to absorption starting from state zero, µ0 . Therefore, E[TD ] = m × µ0
(computed under p = p1 ) and E[TFA ] = m × µ0 (computed under p = p0 ).
The expected times to absorption, µ0 , µ1 , . . . , µK+1 represent the unique solutions of the
equations
µK+1 = 0
K+1
µi = 1 +
∑ pi j µ j for i ∈ {0, 1 . . . , K}
j=0
where pi j is the transition probability from state i to state j.
represented in matrix form:


−p
p
0
···
0
µ0
 1 − p −1
  µ1
p
0
0


 0
  µ2
1
−
p
−1
p
0



  ..
..

 .
.
0
···
0 1 − p −1
µK
For any K, the equations can be


 
 
 
=
 
 
−1
−1
−1
..
.
−1







For example, solving the above equations for µ0 with K = 3, the following expression is derived
µ0 =
1 − p + 2p2 + 2p3
p4
The expression for µ0 for any other value of K is obtained in similar fashion.
45
DOMINO with K=3 and Attacker with g=0.503759
γ=0.5
90
γ=0.9
80
γ=0.8
70
E[TD]
60
50
γ=0.7
DOMINO
CURVES
40
a=10−10
30
20
SPRT CURVE
10
−0.5
a=10
0
2
10
4
6
10
10
E[T ]
8
10
10
10
FA
Figure 4.5: DOMINO performance for K = 3, m ranges from 1 to 60. γ is shown explicitly.
As γ tends to either 0 or 1, the performance of DOMINO decreases. The SPRT outperforms
DOMINO regardless of γ and m.
VI
Theoretical Comparison
In this section we compare the tradeoff curves between E[TD ] and E[TFA ] for both algorithms. For the sake of concreteness we compare both algorithms for an attacker with g = 0.5.
Similar results were observed for other values of g.
For SPRT we set b = 0.1 arbitrarily and vary a from 10−1/2 up to 10−10 (motivated by
the realistic low false alarm rate required by actual intrusion detection systems [29]). Due to
the fact that in DOMINO it is not clear how the parameters m, K and γ affect our metrics, we
vary all the available parameters in order to obtain a fair comparison. Fig. 4.5 illustrates the
performance of DOMINO for K = 3 (the default threshold used in [40]). Each curve for γ has
m ranging between 1 and 60, where m represents the number of samples needed for reaching
a decision. Observing the results in Fig. 4.5, it is easy to conclude that the best performance
of DOMINO is obtained for γ = 0.7, regardless of m. Therefore, this value of γ is adopted
as an optimal threshold in further experiments in order to obtain fair comparison of the two
algorithms.
Fig. 4.6 represents the evaluation of DOMINO for γ = 0.7 with varying threshold K. For
each value of K, m ranges from 1 to 60. In this figure, however, we notice that with the increase
of K, the point with m = 1 forms a performance curve that is better than any other point with
m > 1.
Consequently, Fig. 4.7 represents the best possible performance for DOMINO; that is, we
let m = 1 and change K from one up to one hundred. We again test different γ values for this
configuration, and conclude that the best γ is still close to the optimal value of 0.7 derived from
experiments in Fig. 4.5. However, even with the optimal setting, DOMINO is outperformed by
the SPRT.
VII
Nonparametric CUSUM statistic
As concluded in the previous section, DOMINO exhibits suboptimal performance for
every possible configuration of its parameters. However, the original idea of DOMINO is very
intuitive and simple; it compares the observed backoff of the monitored nodes with the expected
46
DOMINO with γ=0.7 and Attacker with g=0.5037
140
K=2
120
K=1
100
E[TD]
K=0,m=60
80
K=10
60
40
K=0,m=1
SPRT
20
0
0
10
2
10
4
10
6
10
E[TFA]
8
10
10
10
Figure 4.6: DOMINO performance for various thresholds K, γ = 0.7 and m in the range from
1 to 60. The performance of DOMINO decreases with increase of m. For fixed γ, the SPRT
outperforms DOMINO for all values of parameters K and m.
DOMINO with m=,1 Attacker with g =0.503759
90
80
γ=0.7
γ=0.9
70
E[TD]
60
50
DOMINO
40
30
20
SPRT
10
2
10
4
10
6
10
E[TFA]
8
10
10
10
Figure 4.7: The best possible performance of DOMINO is when m = 1 and K changes in order
to accommodate for the desired level of false alarms. The best γ must be chosen independently.
47
backoff of honest nodes within a given period of time. This section extends this idea to create
a test that we believe will have better performance than DOMINO, while still preserving its
simplicity.
Inspired by the notion of nonparametric statistics for change detection, we adapt the
nonparametric cumulative sum (CUSUM) statistic and apply it in our analysis. Nonparametric
CUSUM is initialized with y0 = 0 and updates its value as follows:
yi = (yi−1 − xi + γB)+
(4.20)
The alarm is fired whenever yi > c.
Assuming E0 [X] > γB and E1 [X] < γB (i. e. the expected backoff value of an honest
node is always larger than a given threshold and vice versa), the properties of the CUSUM test
with regard to the expected false alarm and detection times can be captured by the following
theorem.
Theorem 4: The probability of firing a false alarm decreases exponentially with c. Formally, as c → ∞
(4.21)
sup |ln (P0 [Yi > c])| = O (c)
i
Furthermore, the delay in detection increases only linearly with c. Formally, as c → ∞
TD =
c
γB − E1 [X]
(4.22)
The proof is a straightforward extension of the case originally considered in [45].
It is easy to observe that the CUSUM test is similar to DOMINO, with c being equivalent
to the upper threshold K in DOMINO and the statistic y in CUSUM being equivalent to the
variable cheat_count in DOMINO when m = 1.
The main difference between DOMINO when m = 1 and the CUSUM statistic is that
every time there is a “suspicious event” (i.e., whenever xi ≤ γB), cheat_count is increased by
one, whereas in CUSUM yi is increased by an amount proportional to the level of suspected
misbehavior. Similarly, when xi > γB, cheat_count is decreased only by one (or maintained as
zero), while the decrease in yi can be expressed as γB−xi (or a decrease of yi if yi −γB−xi < 0).
VIII
Experimental Results
We now proceed to experimental evaluation of the analyzed detection schemes. It has
already been mentioned that we assume existence of an intelligent adaptive attacker that is able
to adjust its access strategy depending on the level of congestion in the environment. Namely,
we assume that, in order to minimize the probability of detection, the attacker chooses legitimate over selfish behavior when the congestion level is low. Consequently, he chooses adaptive
selfish strategy in congested environments. Due to the above reasons, we assume the scenario
where all the participants are backlogged, i.e., have packets to send at any given time in constructing the experiments. We assume that the attacker will employ the worst-case misbehavior
strategy in this setting, enabling the detection system to estimate the maximal detection delay.
It is important to mention that this setting also represents the worst-case scenario with regard
to the number of false alarms per unit of time due to the fact that the detection system is forced
to make maximum number of decisions per unit of time. We expect the number of alarms to be
smaller in practice.
48
The backoff distribution of an optimal attacker was implemented in the network simulator
Opnet and tests were performed for various levels of false alarms. We note that the simulations
were performed with nodes that followed the standard IEEE 802.11 access protocol (with exponential backoff). The results presented in this work correspond to the scenario consisting of
two legitimate and one selfish node competing for channel access. The detection agent was
implemented such that any backoff value Xi > W was set up to be W. We know this is an arbitrary approximation, but our experiments show that it works well in practice. It is important
to mention that the resulting performance comparison of DOMINO, CUSUM and SPRT does
not change for any number of competing nodes, SPRT always exhibits the best performance.
In order to demonstrate the performance of all detection schemes for more aggressive attacks,
we choose to present the results for the scenario where the attacker attempts to access channel
for 60% of the time (as opposed to 33% if it was behaving legitimately). The backlogged environment in Opnet was created by employing a relatively high packet arrival rate per unit of
time: the results were collected for the exponential(0.01) packet arrival rate and the packet size
was 2048 bytes. The results for both legitimate and malicious behavior were collected over a
fixed period of 100s.
In order to obtain fair performance comparison, a performance metric different from the
one in [42] was adopted. The evaluation was performed as a tradeoff between the average time
to detection and the average time to false alarm. It is important to mention that the theoretical
performance evaluation of both DOMINO and SPRT was measured in number of samples.
Here, however, we take advantage of the experimental setup and measure time in number of
seconds, a quantity that is more meaningful and intuitive in practice.
We now proceed to the experimental performance analysis of SPRT, CUSUM and DOMINObased detection schemes. Fig. 4.8 represents the first step in our evaluation. We evaluated the
performance of the SPRT using the same parameters as in the theoretical analysis in Sect. VI.
DOMINO was evaluated for fixed γ = 0.9, which corresponds to the value used in the experimental evaluation in [40]. In order to compare the performance to SPRT, we vary the value of
K, which essentially determines the number of false alarms. We observe the performance of
DOMINO for 2 different values of parameter m. As it can be seen from Fig 4.8, SPRT outperforms DOMINO for all values of K and m. We note that the best performance of DOMINO
was obtained for m = 1 (the detection delay is smaller when the decision is made after every
sample). Therefore, we adopt m = 1 for further analysis of DOMINO. Fig. 4.9 reproduces the
setting used for theoretical analysis in Fig. 4.7. Naturally, we obtain the same results as in
Fig. 4.7 and choose γ = 0.7 for the final performance analysis of DOMINO.
After finding the optimal values of γ and m we now perform final evaluation of DOMINO,
CUSUM and SPRT. The results are presented in Fig. 4.10. We observe that even for the optimal
setting of DOMINO, the SPRT outperforms it for all values of K. We also note that due to the
reasons explained in Sect. VII, the CUSUM test experiences detection delays similar to the
ones of the SPRT.
If the logarithmic x-axis in the tradeoff curves in Sect. VI is replaced with a linear one,
we can better appreciate how accurately our theoretical results match the experimental evidence
(Fig. 4.11).
IX
Conclusions and future work
In this work, we performed extensive analytical and experimental comparison of the existing misbehavior detection schemes in the IEEE 802.11 MAC. We confirm the optimality of
49
Tradeoff curves for DOMINO, CUSUM and SPRT
0.8
DOMINO, m=20
DOMINO, m=1
CUSUM
SPRT
0.7
0.6
Td
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
Tfa
Figure 4.8: Tradeoff curves for each of the proposed algorithms. DOMINO has parameters
γ = 0.9 and m = 1 while K is the variable parameter. The nonparametric CUSUM algorithm
has as variable parameter c and the SPRT has b = 0.1 and a is the variable parameter.
Tradeoff curves for SPRT, DOMINO withγ=0.9 and γ=0.7
DOMINO: γ=0.9
DOMINO: γ=0.7
SPRT
0.5
T
d
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
Tfa
Figure 4.9: Tradeoff curves for default DOMINO configuration with γ = 0.9, best performing
DOMINO configuration with γ = 0.7 and SPRT.
Tradeoff curves for optimal DOMINO configuration, CUSUM and SPRT
DOMINO: γ=0.7
CUSUM: γ=0.7
SPRT
0.2
0.18
0.16
0.14
Td
0.12
0.1
0.08
0.06
0.04
0.02
0
0
20
40
60
80
100
Tfa
Figure 4.10: Tradeoff curves for best performing DOMINO configuration with γ = 0.7, best
performing CUSUM configuration with γ = 0.7 and SPRT.
50
DOMINO with K=3 and Attacker with g=0.503759
80
70
γ=0.9
60
γ=0.8
E[TD]
50
γ=0.7
40
30
20
SPRT CURVE
10
−0.5
a=10
0
0
1
2
3
E[TFA]
4
5
6
4
x 10
Figure 4.11: Comparison between theoretical and experimental results: theoretical analysis
with linear x-axis closely resembles the experimental results.
the SPRT-based detection schemes and provide analytical and intuitive explanation of why the
other schemes exhibit suboptimal performance when compared to the SPRT schemes. In addition to that, we offer an extension to the DOMINO algorithm that still preserves its original
idea and simplicity, while significantly improving its performance. Our results show the value
of doing a rigorous formulation of the problem and providing a formal adversarial model since
it can outperform heuristic solutions. We believe our model applies not only to MAC but to a
more general adversarial setting. In several practical security applications such as in biometrics, spam filtering, watermarking etc., the attacker has control over the attack distribution and
this distribution can be modeled in similar fashion as in our approach.
We now mention some issues for further study. This chapter focused on the performance
analysis and comparison of the existing detection schemes. A first issue concerns employment
of penalizing functions against misbehaving nodes once an alarm is raised. When an alarm
is raised, penalties such as the access point never acknowledging the receipt of the packet (in
order to rate-limit the access of the node) or denying access to the medium for a limited period
of time should be considered. If constant misbehavior (even after being penalized) is exhibited,
the system should apply more severe penalties, such as revocation from the network. A second
issue concerns defining alternative objectives of the adversary, such as maximizing throughput
while minimizing the probability of detection.
51
Chapter 5
Secure Data Hiding Algorithms
It will readily be seen that in this case the alleged right of the Duke to the whale was a
delegated one from the Sovereign. We must needs inquire then on what principle the Sovereign
is originally invested with that right.
-Moby Dick, Herman Melville
I
Overview
Digital watermarking allows hidden data, such as a fingerprint or message, to be placed
on a media signal (e.g., a sound recording or a digital image). When a watermark detector is
given this media signal, it should be able to correctly decode the original embedded fingerprint.
In this chapter we give a final example of the application of our framework to the problem
of watermark verification. In the next section we provide a general formulation of the problem.
Then in section III we present a watermark verification problem where the adversary is parameterized by a Gaussian distribution. Finally, in section IV we model a watermark verification
problem where the adversary is given complete control over its attack distribution.
II
General Model
A. Problem Description
The watermark verification problem consists on two main algorithms, an embedding algorithm E and a detection algorithm D .
• The embedder E receives as inputs a signal s and a bit m. The objective of the embedder
is to produce a signal x with no perceptual loss of information or major differences from
s, but that carries also the information about m. The general formulation of this required
property is to force the embedder to satisfy the following constraint: D(x, s) < Dw , where
D() is a distortion function and Dw is an upper bound on the amount of distortion allowed
for embedding.
• The detection algorithm D receives a signal y, which is assumed to be x or an altered
version of x either by natural errors (e.g., channel losses) or by an adversary. The detector
has to determine if y was embedded with m = 1 or not.
• In order to facilitate the detection process, the embedder and the detector usually share a
random secret key k (e.g., the seed for a pseudorandom number generator that is used to
create an embedding pattern p). This key can be initialized in the devices or exchanged
via a secure out of band channel.
B. Adversary Model
52
• Information available to the adversary: We assume the adversary knows the embedding
and detection algorithms E and D . Furthermore we assume the adversary knows the
distribution of K, S, M, however it does not know the particular realizations of these
values. That is, the adversary does not know the particular realization of k, s, or m and
therefore does not know the input arguments of E .
• Capabilities of the adversary: The adversary is a man-in-the-middle between the embedder and the detector. Therefore it can intercept the signal x produced by E , and can send
in its place, a signal y to the detector. However the output of the adversary should satisfy
a distortion constraint D(y, s) < Da , since y should be perceptually similar to s.
C. Design Space
The previous description can be summarized by the following diagram:
A
Ek (m, s)
Dk
x
x ← f (x|s, m, k) −−−−−−−−−−−−→
−−−
y
y ← f (y|x) −−−−−−−−−−−−→
−−−
ln
f (y|k,m=1) H1
≷
f (y|k,m=0) H τ
0
Where i ← f (i| j) refers to i being sampled from a distribution with pdf f (i| j). Alternatively i can be understood to be the output of an efficient probabilistic algorithm with input
j and output i. Therefore from now on, we will use Ek (m, s) and A (y) interchangeably with
f (x|s, m, k) and f (y|x) (respectively).
Notice that an optimal detection algorithm (optimal in the sense that it minimizes several
evaluation metrics such as the probability of error, the expected cost or the probability of missed
positives given an upper bound on the probability of false positives) is the log-likelihood ratio
test ln
f (y|k,m=1) H1
≷
f (y|k,m=0) H τ, where Hi
denotes the hypothesis that m = i, and τ depends on the evaluation
0
metric. If the log-likelihood ratio is greater than τ then the detector decides for m = 1 and if it
less than τ, the detector decides for m = 0 (if it is equal to τ the detector randomly selects the
hypothesis based on the evaluation metric being considered).
Of course, in order to implement this optimal detection strategy, the detector requires the
knowledge of f (y|k, m), which can be computed as follows:
Z Z
f (y|k, m) =
f (y|s, x, k, m) f (x|s, k, m) f (s)dxds
ZS ZX
=
f (y|x) f (x|s, k, m) f (s)dxds
S X
Therefore an optimal detection algorithm requires knowledge of the embedding distribution f (x|s, k, m), the attack algorithm f (y|x), and of the pdf of s: f (s). Note that f (s) is
fixed, since neither the embedding algorithm or the adversary can control it. As a result, the
overall performance of the detection scheme is only a function of the variable f (y|x) (i.e., on
the adversary A ) and on the variable f (x|s, k, m) (i.e., the embedding algorithm E ). The performance can then be evaluated by a metric which takes into account the two variable parameters:
Ψ(E , A ).
53
Assuming Ψ(E , A ) represent the losses of the system, our task is to find E ∗ to minimize
Ψ. However since A is unknown and arbitrary, the natural choice is to find the embedding
scheme E ∗ that minimizes the possible damage that can be done by an adversary A ∗ :
(E ∗ , A ∗ ) arg min max Ψ(E , A )
E ∈FDw A ∈FDa
(5.1)
where the feasible design space FDw and the feasible adversary class FDa consist on embedding
distributions and least favorable attack distributions that satisfy certain distortion constraints.
This formulation guarantees that
∀A Ψ(E ∗ , A ) ≤ Ψ(E ∗ , A ∗ )
whenever possible we are also interested in finding out if E ∗ is indeed the best embedding
distribution against A ∗ , and therefore we would like to satisfy a saddle point equilibrium:
Ψ(E ∗ , A ) ≤ Ψ(E ∗ , A ∗ ) ≤ Ψ(E , A ∗ )
D. Evaluation metric
The evaluation metric should reflect good properties of the system for the objective it
was designed. For our particular case we are going to be interested in the probability of error Pr[Dk (y) 6= m] when the following random experiment is performed: k is sampled from a
uniform distribution in {0, 1}n (where n is the length of k), s is assumed to be sampled from
a distribution with pdf f (s), m is sampled from its prior distribution f (m), and then the embedder algorithm and the adversary algorithm are executed. This random experiment is usually
expressed in the following notation:
Ψ(E , A ) ≡ Pr[k ← {0, 1}n ; s ← f (s); m ← f (m); x ← Ek (s, m); y ← A (x) : Dk (y) 6= m] (5.2)
The min-max formulation with this evaluation metric minimizes the damage of an adversary whose objective is to produce a signal y that removes the embedded information m. In
particular the adversary wants the detection algorithm on input y to make an incorrect decision
on whether the original signal s was embedded with m = 1 or not.
III
Additive Watermarking and Gaussian Attacks
The model described in the previous section is often intractable (based on the specific
objective function, distortion functions, prior distributions etc.) and therefore several approximations are made in order to obtain a solution.
For example, in practice there exists several popular embedding algorithms such as spread
spectrum watermarking and QIM watermarking, and researchers often try to optimize the parameters of these embedding schemes, as opposed to finding new embedding algorithms. Furthermore modeling an adversary that can select any algorithm to produce its output is also very
difficult because we have to consider non-parametric and arbitrary non-linear attacks. Therefore researchers often assume a linear (additive) adversary that is parameterized by a Gaussian
random process. This assumption is motivated by several arguments, including information
theoretic arguments claiming a Gaussian distribution is the least favorable noise in a channel,
or as an approximation given the central limit theorem.
54
In this chapter we follow one such model originally proposed by [46]. Contrary however to the results in [46], we relax two assumptions. First, we relax the assumption of spread
spectrum watermarking and instead search for the optimal embedding algorithm in this model
formulation. Secondly, we relax the assumption of the diagonal processors (an assumption
mostly due to the fact that the embedding algorithm used spread spectrum watermarking) and
obtain results for the general case. The end result is that our algorithms achieve a lower objective function value than [46] for any possible attacker in the feasible attacker class.
In the following section we describe the model of the problem and obtain our results.
Then in section E we discuss our results and compare them to [46].
A. Mathematical Model
Given s ∈ RN and m ∈ {0, 1}, we assume an additive embedder E that outputs x = Φ(s +
pm), Φ is an N × N matrix and where p ∈ RN is a pattern sampled from a distribution with pdf
h(p). Since p is the only random element in the watermarking algorithm, it is assumed to be
dependent on the key k, and therefore from now on we will replace k with p without loss of
generality.
The attacker A is modeled by y = Γx +e, where Γ is an N ×N matrix and e is a zero-mean
(since any non-zero mean attack is suboptimal [46]) Gaussian random vector with correlation
matrix Re .
Finally, the detection algorithm has to perform the following hypothesis test:
H0 : y = ΓΦs + e
H1 : y = ΓΦs + e + ΓΦp
If the objective function Ψ(E , A ) the detector wants to minimize is the probability of
error, then we know that an optimal detection algorithm is the log-likelihood ratio test. By
assuming that the two hypothesis are equally likely, we find that τ = 0. Furthermore in order to
compute f (y|p, m) it is assumed s is a Gaussian random vector with zero mean (zero mean is
assumed without loss of generality) and correlation matrix Rs .
The following diagram summarizes the model
A
E p (m, s)
Dp
x
x = (s + pm)Φ −−−−−−−−−−−−→
−−−
y
y = Γx + e −−−−−−−−−−−−→
−−−
H1
1 t t −1
pt ϒt R−1
y y − 2 p ϒ Ry ϒp ≷ 0
H0
Where Ry = ΓΦRs Φt Γt + Re , and ϒ = ΓΦ.
The distortion constraints that E and A need to satisfy are selected to be the squared error
distortion:
FDw = E : E||X − S||2 ≤ NDw
and
FDa = A : E||Y − S||2 ≤ NDa
where E = (Φ, R p , h) and A = (Γ, Re ).
55
B. Optimal Embedding Distribution
In the model described in the previous section, the probability of error can be found to
be:
h p
i Z
p
Ψ(E , A ) = Pr[D p (y) 6= m] = E p Q
pt Ωp = Q
pt Ωp h(p)d p
where
1
Ω = Φt Γt (ΓΦRs Φt Γt + Re )−1 ΓΦ
2
and where p is the random pattern (watermark) and h(p) is its unknown pdf. Ω is a function of
the signal and noise covariances Rs , Re , the watermark covariance R p and the scaling matrices
Γ and Φ. If we fix all these quantities then we would like to determine the form of h(p) that will
minimize the error probability (since this is the goal of the decision maker).
To solve the previous problem we rely on a the following property of the Q (·) function,
that can be verified by direct differentiation
√
Lemma 3: The function Q ( x) is a convex function of x.
Now we can use this convexity property and apply Jensen’s inequality and conclude that
p
√
Ex [Q ( x)] ≥ Q
Ex [x] ;
we have equality iff x is a constant with probability 1 (wp1). Using this result in our case
we get
q
q
i
h p
pt Ωp ≥ Q
E p [pt Ωp] = Q
tr{ΩR p } .
(5.3)
Pe = E p Q
Relation (5.3) provides a lower bound on the error probability for any pdf (which of
course satisfies the covariance constraint). We have equality in (5.3) iff wp1 we have that
pt Ωp = tr{ΩR p }.
(5.4)
In other words every realization of p (that is every watermark p) must satisfy this equality.
Notice that if we can find a pdf for p which can satisfy (5.4) under the constraint that E[ppt ] =
R p (remember we fixed the covariance R p ), then we will attain the lower bound in Equation
(5.3).
To find a random vector p that achieves what we want, we must do the following. Consider the SVD of the matrix
1/2
1/2
R p ΩR p = UΣU t
(5.5)
where U orthonormal and Σ = diag{σ1 , . . . , σK }, diagonal with nonnegative elements.
The nonnegativity of σi is assured because the matrix is nonnegative definite. Let A be a random
vector with i.i.d. elements that take the values ±1 with probability 0.5. For every vector A we
can then define an embedding vector p as follows
1/2
p = R p UA
56
(5.6)
Let us see if this definition satisfies our requirements. First consider the covariance matrix
which must be equal to R p . Indeed we have
1/2
1/2
E[ppt ] = R p U E[AAt ]U t R p
1/2
1/2
1/2
1/2
= R p U IU t R p = R p I R p = R p ,
where we used the independence of the elements of the vector A and the orthonormality
of U (I denotes the identity matrix). So our random vector has the correct covariance structure.
Let us now see whether it also satisfies the constraint that pt Ωp = tr{ΩR p } wp1. Indeed for
every realization of the random vector A we have
1/2
1/2
pt Ωp = At U t R p ΩR p UA = At ΣA = A21 σ1 + A22 σ2 + · · · + A2K σK
= σ1 + σ2 + · · · + σK ,
where we use the fact that the elements Ai of A are equal to ±1. Notice also that
1/2
1/2
tr{ΩR p } = tr{R p ΩR p } = tr{UΣU t } = tr{ΣU t U} = tr{Σ} = σ1 + · · · + σK ,
which proves the desired equality. So we conclude that, although p is a random vector
(our possible watermarks), all its realizations satisfy the equality
pt Ωp = tr{ΩR p }.
This of course suggests that this specific choice of watermarking attains the lower bound
in (5.3).
B.1
Summary
We have found the optimum embedding distribution. It is a random mixture of the
1/2
1/2
columns of the matrix R p U of the form R p UA, where A is a vector with elements ±1.
This of course suggests that we can have 2N different patterns. R p is the final matrix we
1/2
1/2
end up from the max-min game and U is the SVD of the corresponding final matrix R p ΩR p .
Once we are given h∗ , the game the embedder and the attacker play is the following:
max min tr{ΩR p }
R p ,Φ Re ,Γ
More specifically:
max min tr{(ΓΦRs Φt Γt + Re )−1 ΓΦR p Φt Γt }
R p ,Φ Re ,Γ
(5.7)
Subject to the distortion constraints:
tr{(Φ − I)Rs (Φ − I)t + ΦR p Φt } ≤ NDw
(5.8)
tr{(ΓΦ − I)Rs (ΓΦ − I)t + ΓΦR p Φt Γt + Re } ≤ NDa
(5.9)
C. Least Favorable Attacker Parameters
57
C.1
Minimization with respect to Re
Assuming ϒ = ΓΦ is fixed, we start by minimizing (??) with respect to Re . This minimization problem is addressed with the use of variational techniques. Let Rεe = Roe + ε∆. By
forming the Lagrangian of the optimization problem (??) under constraint (??), the goal of the
attacker is to minimize with respect to ε the following objective function:
f (ε) = tr{(ϒRs ϒt + Rεe )−1 ϒR p ϒt } + µ(tr{(ϒ − I)Rs (ϒ − I)t + ϒR p ϒt + Rεe } − NDa )
(5.10)
A necessary condition for the optimality of R∗e is when
d f (ε)
|ε=0 = 0
dε
which implies
−tr{(ϒRs ϒt + R∗e )−1 ∆(ϒRs ϒt + R∗e )−1 ϒR p ϒt } + µtr{∆} = 0
which must be zero for any ∆, and thus we need that
(ϒRs ϒt + R∗e )−1 ϒR p ϒt (ϒRs ϒt + R∗e )−1 = µI
from where we solve for R∗e = √1µ (ϒR p ϒt )1/2 − ϒRs ϒ
The dual problem is then to maximize with respect to µ the following function:
√
µtr{(ϒR p ϒt )1/2 } + µ(tr{(ϒ − I)Rs (ϒ − I)t + ϒR p ϒt + (µ)−1/2 (ϒR p ϒt )1/2 − ϒRs ϒt } − NDa )
which after some simplification becomes
√
2 µtr{(ϒR p ϒt )1/2 } − 2µtr{ϒRs } + µtr{Rs} + µtr{ϒR p ϒt } − µNDa
taking the derivative with respect to µ and equating to zero we obtain:
2tr{ϒRs } − tr{Rs } − tr{ϒR p ϒt } + NDa
1
=
√
µ
tr{(ϒR p ϒt )1/2 }
C.2
Minimization with respect to Γ
Since ϒ = ΓΦ we note that the attacker can completely control ϒ by an appropriate choice
of Γ (assuming Φ is invertible), therefore we only need to consider the minimization over ϒ of
the following function:
(tr{(ϒR p ϒt )1/2 })2
2tr{ϒRs } − tr{Rs } − tr{ϒR p ϒt } + NDa
Proceeding similarly to the previous case, we use variational techniques with ϒε = ϒo +
ε∆. After deriving the objective function with respect to ε and equating to zero, we obtain the
following equation:
tr{(∆R p ϒot + ϒo R p ∆t )(ϒo R p ϒot )−1/2 }Cd − (2tr{∆Rs } − tr{∆R p ϒot + ϒo R p ∆t })Cn = 0
58
where Cd and Cn are scalar factors not dependent of ∆ (and will be determined later.)
Since the above equation must be equal to zero for any ∆ we need that
2Cd (R p ϒto (ϒo R p ϒt )−1/2 ) − 2Cn (Rs − R p ϒto ) = 0
or alternatively, if we let C = Cd /Cn
c(ϒo R p ϒto )1/2 + ϒo R p ϒto = ϒRs
−1/2
Letting Σ = ϒo Rp1/2 and A = R p
Rs we obtain
C(ΣΣt )1/2 = ΣA − ΣΣt
after squaring both sides we have
C2 Σt = (A − Σt )Σ(A − Σt )
or equivalently,
(A − Σt )Σ(A − Σt ) −C2 Σt = 0
D. Optimal Embedding Parameters
Subject to:
n
o2
t
1/2
tr (ΦR p Φ )
max
Φ,R p 2tr {ΦRs } − tr {Rs } − tr ΦR p Φt + NDa
(5.11)
tr (Φ − I)Rs (Φ − I)t + ΦR p Φt ≤ NDw
(5.12)
For any Φ and any R p we have by Schwarz inequality that:
n
o2
t
1/2
tr (ΦR p Φ )
2tr {ΦRs } − tr {Rs } − tr ΦR p Φt + NDa
Ntr{ΦRs Φt }
≤
2tr {ΦRs } − tr {Rs } − tr ΦR p Φt + NDa
(5.13)
(5.14)
(5.15)
Equality is achieved if and only if ΦR p Φt = κI (i.e., R∗p = κ(Φt Φ)−1 ), where κ is a
constant that can be determined by the following arguments. Notice first that Equation 5.14 is
an increasing function of tr{ΦR p Φt }. The maximum value is therefore achieved when Equation
is satisfied with equality. Solving for κ from this constraint we obtain
κ=
NDw − tr{(Φ − I)Rs (Φ − I)t }
N
Replacing the values of κ and R p into the original objective function and letting λ be the
maximum value the objective function can achieve (as a function of Φ), we have:
59
N(NDw − tr{(Φ − I)Rs (Φ − I)t })
≤λ
2tr{ΦRs } − tr {Rs } + NDa − NDw + tr {(Φ − I)Rs (Φ − I)t }
After some algebraic manipulations we obtain the following:
λtr{Rs }
NDw λ(Da − Dw )
−
≤ tr (Φ − (λ + 1)−1 I)Rs (Φ − (λ + 1)−1 I)t +
λ+1
λ+1
(λ + 1)2
We can see that the minimum value the right hand side of this equation can achieve (as a
function of Φ) is when Φ∗ = (λ + 1)−1 I. With Φ∗ , the following equation can be used to solve
for λ:
λ
tr{Rs } + λ(Da − Dw ) = NDw
λ+1
Solving for λ we obtain
λ=
Dw − Da + NDw − tr{Rs } +
p
4(Da − Dw )NDw + (Da − Dw − NDw + tr{Rs })2
2(Da − Dw )
E. Discussion
Notice that so far we have solved the problem in the following way:
min max min Ψ(E , A )
Φ,R p Γ,Re
h
(5.16)
where (as we mentioned before) E = (h, R p , Φ) and A = (Γ, Re ). This means that given Φ
and R p , the adversary will select Γ and Re in order to maximize the probability of error, and
then given the parameters chosen by the adversary we finally find the embedding distribution h
minimizing the probability of error.
The problem with this solution is that in practice, the embedding algorithm will be given
in advance and the adversary will have the opportunity of changing its behavior based on the
given embedding algorithm (including h).
For notational simplicity assume Φ and R p are fixed, so we can replace E with h in the
remaining of this chapter. Furthermore let h(A ) denote the embedding distribution as a function
of the parameters A = (Γ, Re ) (recall that h depends on A by the selection of U in Equation
5.6). Now we can express easily the problem we have solved:
∀A Ψ(h∗ (A ), A ) ≤ Ψ(h(A ), A )
This is true in particular for A ∗ , the solution to the full optimization problem from Equation
5.16. Moreover, the above is also true for the distribution used in the previous work [46], which
assumed a Gaussian embedding distribution hG :
∀A Ψ(h∗ (A ), A ) ≤ Ψ(hG , A )
Notice also that in [46], the solution obtained was
A G = arg max Ψ(hG , A )
A
60
(5.17)
Due to some approximations done in [46], Ψ(hG , A ) turns out to be the same objective function
given in Equation 5.7. Furthermore in [46] there were further approximations in order to obtain
linear processors (diagonal matrices). In this work we relaxed this assumption in order to obtain
the full solution to Equation 5.7. Therefore the general solution (without extra assumptions
such as diagonal matrices) in both cases is the same:
G
G
∗
A = arg max Ψ(h , A ) = A = arg max min Ψ(h, A )
(5.18)
A
A
h
One of the problems with our solution however is that there might exist A 0 such that
Ψ(h∗ (A ∗), A ∗ ) < Ψ(h∗ (A 0 ), A 0 )
However, even in this case it is easy to show that h∗ is still better than hG , since the performance
achieved by hG is not as good as the performance obtained with h∗ , even for any other A :
max Ψ(h∗ (A ), A ) < max Ψ(hG , A )
A
A
The main problem is that in order to obtain this optimal performance guarantee, the
embedding distribution h∗ needs to know the adversary final strategy of the adversary A . In
particular we are interested in two questions. With regards to the previous work in [46] we
would like to know if the following is true:
∀A Ψ(h∗ (A ∗ ), A ) ≤ Ψ(hG , A ∗ )
(5.19)
that is, once we have fixed the operating point A ∗ (the optimal adversary according to Equation
5.7) there is no other adversarial strategy that will make h∗ perform worse than the previous
work.
The second question is in fact more general and it relates to the original intention of
minimizing the worst possible error created by the adversary:
min max Ψ(h, A )
h
A
(5.20)
yet we have only solved the problem in a way where h is dependent on A :
(h∗ , A ∗ ) = arg max min Ψ(h, A )
A
h
A way to show that (h∗ , A ∗ ) satisfies Equation 5.20 (and therefore also satisfy Equation
5.19) is to show that the pair (h∗ , A ∗ ) forms a saddle point equilibrium:
∀(h, A ) Ψ(h∗ , A ) ≤ Ψ(h∗ , A ∗ ) ≤ Ψ(h, A ∗ )
(5.21)
To be more specific let E denote again the triple (h, R p , Φ). Then we are interested in
showing that
∀(E , A ) Ψ(E ∗ , A ) ≤ Ψ(E ∗ , A ∗ ) ≤ Ψ(E , A ∗ )
(5.22)
where (E ∗ ,A ∗ ) = (h∗ , R∗p , Φ∗ , R∗e , Γ∗ ) is the solution to Equation (5.16).
61
It is easy to show how the right hand side inequality of Equation (5.22) is satisfied:
"
!#
r
1
t
∗
∗
∗
−1
∗
Ψ(E , A ) = E p Q
pt Φt Γ∗t (Γ∗ ΦRs Φt Γ + Re ) Γ Φp
2
!
r
1 t ∗t ∗
t
≥Q
R p Φ Γ (Γ ΦRs Φt Γ∗ + R∗e )−1 Γ∗ Φ
by Jensen’s inequality
2
!
r
1
R∗p Φ∗t Γ∗t (Γ∗ Φ∗ Rs Φ∗t Γ∗t + R∗e )−1 Γ∗ Φ∗
≥Q
by Equation (5.7)
2
= Ψ(E ∗ , A ∗ ) by the definition of h∗
The left hand side of Equation (5.22) is more difficult to satisfy. A particular case where
it is satisfied is the scalar case, i.e., when N = 1. In this case we have the following:

s
∗ (Φ∗ Γ∗ )2
R
p

Ψ(E ∗ , A ∗ ) = Q 
∗
2((Γ Φ∗ )2 Rs + R∗e )
s

∗
∗
2
R p (Φ Γ)
 by Equation (5.7)
≥Q 
2((ΓΦ∗ )2 Rs + Re )
s
!#
"
p2 (Φ∗ Γ)2
Since p is independent of A
= Ep Q
2((ΓΦ∗ )2 Rs + Re )
= Ψ(E ∗ , A ∗ )
The independence
of p in the scalar case comes
p
pfrom the fact that Equation (5.6) yields in this
case p = R p with probability 21 and p = − R p with probability 12 . With this distribution
Equation (5.4) is always satisfied (since the adversary has no control over it).
This result can in fact be seen as a counterexample against the optimality of spread spectrum watermarking against Gaussian attacks: if the attack is Gaussian, then the embedding
distribution should not be a spread spectrum watermarking, or conversely, if the embedding
distribution is spread spectrum, then the attack should not be a Gaussian attack.
In future work we plan to investigate under which conditions or assumptions is the left
inequality in Equation (5.22) satisfied. An easier goal we also plan to investigate is whether
Equation (5.19) is true, since this will also show an improvement over previous work. We also
plan to extend the work to other evaluation metrics, such as the case when one of the errors is
more important than the other. In this case we can set an arbitrary level of false alarms and find
the parameters of the embedder that maximize the probability of detection while the adversary
tries to minimize detection.
IV
Towards a Universal Adversary Model
In the previous section we attempted to relax the usual assumptions regarding the optimal
embedding distributions (Spread Spectrum or QIM) and find new embedding distributions that
achieve better performance against attacks. The problem with the formulation in the previous
section is that the optimal embedding distribution h∗ will depend on the strict assumptions made
to keep the problem tractable. In particular it assumes the correctness of the source signal model
62
f (s) and even more troubling, it assumes the adversary will perform a Gaussian and scaling
attack only. This limitation on the capabilities of the adversary is a big problem in practice,
since any data hiding algorithm that is shown to perform well under any parametric adversary
(e.g., Gaussian attacks) will give a false sense of security, since in reality the adversary will
never be confined to create only Gaussian attacks or follow any other parametric distribution
prescribed by the analysis.
In this section we are going to focus in another version of the embedding detection problem: non-Blind watermarking. This formulation is very important for several problems such
as fingerprinting and traitor tracing. In non-blind watermarking both the embedder and the
detector have access to the source signal s, and therefore the problem can again be represented
as:
A
Ek (m, s)
Dk (s)
x
x ← f (x|k, m, s) −−−−−−−−−−−−→
−−−
y
y ← f (y|x) −−−−−−−−−−−−→
−−−
ln
f (y|s,k,m=1) H1
≷
f (y|s,k,m=0) H τ
0
where A should satisfy
some distortion constraint, for example for quadratic distortion con
straints: A ∈ FD : A : E[(x − y)2 ] ≤ D
Our main objective is to model and understand the optimal strategy that a non-parametric
adversary can do. To the best of our knowledge this is the first attempt to model this all powerful
adversary.
In order to gain a better insight into the problem we are going to start with the scalar case:
N = 1, or in particular s, x and y are in R. Furthermore we assume E is fixed and parameterized
by a distance d between different embeddings: that is, for all k and s d = |Ek (1, s) − Ek (0, s)|.
Since for every output of the attacker y ← f (y|x) there exists a random realization of a with pdf
h such that y = x + a, we can replace the adversarial model with an additive random variable a
sampled from an attacker distribution h. Finally, the decision function ρ will output an estimate
of m: m = 0 or m = 1 given the output of the adversary: y.
E (m, s)
A
D (s)
x
−−−−−−−−−−−−→
−−−
y
y = x + a −−−−−−−−−−−−→
−−−
ρ(y)
Having fixed the embedding algorithm this time, our objective is to find a pair (ρ∗ , h∗ )
such that
∀ρ and h ∈ FD Ψ(ρ∗ , h) ≤ Ψ(ρ∗ , h∗ ) ≤ Ψ(ρ, h∗ )
(5.23)
where Ψ(ρ, h) is again the probability of error:
Pr[m ← {0, 1}; x = E (m, s); a ← h(a) : ρ(x + a) 6= m]
R
and where FD simplifies to h : E[a2 ] ≤ D, h(a)da = 1 and h ≥ 0 .
Notice that this problem is significantly more difficult than the problem of finding the
optimal parameters of an adversary, since in this case we need to perform the optimization over
infinite dimensional spaces, because h is a continuous function.
63
A. On the Necessity of Randomized Decisions for Active Distortion Constraints
Let xi = E (i, s). Then we know that to satisfy Ψ(ρ∗ , h∗ ) ≤ Ψ(ρ, h∗ ), ρ∗ should select the
largest between the likelihood of y given x1 : f (y|x1 ) and the likelihood of y given x0 : f (y|x0 ),
and should randomly flip a coin to decide if both likelihoods are equal (this decision is called
Bayes optimal). Therefore in order to find the saddle point equilibrium we are going to assume
a given ρ and maximize for h (subject to the distortion constraints) and then check to see if ρ is
indeed Bayes optimal.
Before we obtain a saddle point solution we think it is informative to show our attempt
to solve the problem with a non-randomized decision function ρ. A typical non-randomized
decision function will divide the decision space into two sets: R and its complement Rc . If
y ∈ R then ρ(y) = 1, otherwise ρ(y) = 0.
Assume without loss of generality that d = x0 − x1 > 0. The probability of error can be
expressed then as:
Ψ(ρ, h) = Pr[ρ = 1|M = 0]Pr[M = 0] + Pr[ρ = 0|M = 1]Pr[M = 1]
Z
Z
1
h(y − x0 )dy +
h(y − x1 )dy
=
2 R
Rc
Z
Z
1
h(a − d)da +
h(a)da
=
2 R
Rc
Z
Z
1
=
h(a)da +
h(a)da
2 R−d
Rc
Z
1
(1R−d (a) + 1Rc (a)) h(a)da
=
2
where 1R is the indicator function for the set R (i.e., 1R (a) = 1 if a ∈ R and 1R (a) = 0 otherwise)
and where R − d is defined as the set {a − d : a ∈ R}.
The objective function is therefore:
1
R∈R h∈FD 2
min max
Z
(1R−d (a) + 1Rc (a)) h(a)da
Subject to:
E[a2 ] ≤ D
The Lagrangian is
Z L(λ, h) =
1
2
(1R−d (a) + 1Rc (a)) − λa h(a)da + λD
2
where Ψ(ρ, h∗ ) = L∗ (λ∗ ) = L(λ∗ , h∗ ).
By looking at the form of the Lagrangian in Figure 5.1 (for λ > 0) it is clear that a
T
/ since otherwise, the adversary will
necessary condition for optimality is that R − d Rc = 0,
put all the mass of h in this interval. Under this condition we assume R − d = [− inf, −d
2 ].
d 2 ∗
∗
Now notice that for D ≥ ( 2 ) , λ = 0, and therefore there will always be an h such that
Ψ(ρ, h∗ ) = 12 . The interpretation for this case is that the distortion constraints are not strict
enough, and the adversary can create error rates up to 0.5. It is impossible to find a saddle point
solution for this case, since any ρ will not be Bayesian, and if it is Bayes optimal then h∗ is
64
1
− λa2
2
L(λ, h)
a
0
d
λa2
x1
x0
R−d
R
c
y
Figure 5.1: Let us define a = y − x1 for the detector. We can now see the Lagrangian function
over a, where the adversary tries to distribute the density h such that L(λ, h) is maximized while
satisfying the constraints (i.e., minimizing L(λ, h) over λ.
not a maximizing distribution. However by assuming R − d = [− inf, −d
2 ] we guarantee that the
probability of error is not greater than 0.5. Having such a high false alarm rate is unfeasible in
practice and thus the embedding scheme should be designed with a d such that D ≥ ( d2 )2
Assuming λ > 0 it is now clear from Figure 5.1 that an optimal solution is for
h∗ (a) = p0 δ(−d/2) + p1 δ(0) + p2 (d/2)
where p0 + p1 + p2 = 1.
For D < ( d2 )2 , λ∗ =
2
d2
and thus Ψ(ρ, h∗ ) =
h∗ (a) =
2D
,
d2
where
2D
d 2 − 4D
2D
δ(−d/2)
+
δ(0) + 2 δ(d/2)
2
2
d
d
d
Notice however that in this case ρ is not optimal, since the solution assumes that at the boundary
between R and Rc , ρ decides for both: m = 0 and m = 1 at the same time! and thus clearly this
is not a Bayes optimal decision rule (in fact this is not a decision at all!). An naive approach to
solve this problem is to randomize the decision at the boundary: i.e., to flip an unbiased coin
whenever y is in the boundary between R and Rc . This decision will then be Bayes optimal for
h∗ , however it can be shown that this h∗ is not a solution h that maximizes the probability of
error and thus we cannot achieve a saddle point solution. In the next section we introduce a
more elaborate randomized decision that achieves a saddle point equilibria.
B. Achieving Saddle Point Equilibria with Active Constraints
Let ρ(a) = 0 with probability ρ0 (a) and ρ(a) = 1 with probability ρ1 (a). In order to have
a well defined decision function we require ρ0 = 1 − ρ1 . Consider now the decision function
given in Figure 5.2. The Lagrangian is now:
Z 1
2
(ρ1 (a + d) + ρ0 (a)) − λa h(a)da + λD
L(λ, h) =
2
65
1
ρ1 (a)
ρ0 (a)
0
0
d
Figure 5.2: Piecewise linear decision function, where ρ(0) = ρ0 (0) = ρ1 (0) =
1
2
1
− λa2
2
a
− λa2
2d
L(λ, h)
0
d
a
λa2
Figure 5.3: The discontinuity problem in the Lagrangian is solved by using piecewise linear
continuous decision functions. It is now easy to shape the Lagrangian such that the maxima
created form a saddle point equilibrium.
66
1
z
1
a−
(z − 2)d
z−2
ρ1 (a)
0
z−1
d
z
d
z
0
ρ0 (a)
d
Figure 5.4: New decision function
In order to have active distortion constraints, the maxima of L(λ, h) should be in the
interval a ∈ [−d, d]. Looking at Figure 5.3 we see that
1
(ρ1 (a + d) + ρ0 (a)) − λa2
2
1
achieves its maximum value for a∗ = ± 4λd
. Therefore
1
1
1
δ −
+δ
h (a) =
2
4λd
4λd
∗
1
Notice however that under h∗ , ρ will only be Bayes optimal if and only if 4λd
= d2 , which occurs
2
1
if and only if λ∗ = 2d1 2 which occurs if and only if D = d2 (since λ∗ = 4d √
minimizes the
D
Lagrangian.)
2
As a summary, for D = d2 , (ρ∗ , h∗ ) form a saddle point equilibrium when ρ∗ is defined
as in Figure 5.3 and
−d
d
1
∗
δ −
+δ
h (a) =
2
2
2
1
1
D
1
Furthermore the probability of error is Ψ(ρ∗ , h∗ ) = 16λd
2 + λD = 8 + 2d 2 = 4 .
It is still an open question whether there are saddle point equilibria for the distortion
2
2
constraint E[a2 ] < D where D > d4 , however for D < d4 we can obtain a saddle point by
considering the decision function shown in Figure 5.4. For z ∈ (3, ∞), the local maxima of the
z
Lagrangian occur for a = 0, and a = ± 4d(z−2)λ
, where the second value was obtained as the
solution to
d 1
z
1
2
a−
− λa
=0
da 2 (z − 2)d
z−2
a∗
i.e., the derivative evaluated at a∗ must be equal to zero. By symmetry of the Lagrangian we
have that another local maximum occurs at a = −a∗ .
The value of λ∗ that makes all these local maxima the same (and thus gives the opportunity of an optimal attack with three delta functions, one on each local maximum) is the solution
to:
2
1
z
z
1
z
z2 − 8d 2 (x − 2)λ∗
∗
−
−
λ
=
=0
2 (z − 2)d 4d(z − 2)λ∗ z − 2
4d(z − 2)λ∗
16d 2 (z − 2)2 λ∗
67
1
− λa2
2 !
"
1
z
1
a−
− λa2
2 (z − 2)d
z−2
L(λ, h)
a
0
d
λa2
Figure 5.5: With ρ defined in Figure 5.4 the Lagrangian is able to exhibit three local maxima,
one of them at the point a = 0, which implies that the adversary will use this point whenever
the distortion constraints are too severe
2
z
which is λ∗ = 8d 2 (z−2)
. Any other λ would have implied inactive constraints (D too large) or
D = 0. See Figure 5.5.
The optimal adversary has the form
2d
2d
∗
h (a) = p0 δ −
+ (1 − 2p0 )δ(0) + p0 δ
z
z
We can see that ρ defined as in Figure 5.4, and h∗ can only form a saddle point if z = 4.
Notice also that h∗ is an optimal strategy for the adversary as long as
2
d
E[a ] = 2p0
=D
2
2
That is p0 =
2D
.
d2
Since the maximum value that p0 should attain is 21 , this implies that this is
the optimal strategy for the adversary for any D ≤
point equilibrium is
d2
4.
The probability of error for this saddle
Ψ(ρ∗ , h∗ ) = L∗ (λ∗ ) = λ∗ D =
C. Saddle Point Solutions for D >
D
1
≤
d2 4
d2
4
In the previous section we saw how the adversary can create pdfs h that generate points y
where the decision function makes large errors in classification, therefore the idea of using an
“indecision” region can help the decision function in regions where deciding between the two
hypothesis is prone to errors. In this framework we allow ρ(y) to output ¬ when not enough
information is given in y in order to decide between m = 0 or m = 1.
Let C(i, j) represent the cost of deciding for i (ρ = i) when the true hypothesis was m = j.
By using as an evaluation metric the probability of error, we have been so far minimizing the
expected cost E[C(ρ, m)] when C(0, 0) = C(1, 1) = 0 and C(1, 0) = C(0, 1) = 1. We now extend
this evaluation metric by incorporating the cost of not making a decision: C(¬, 0) = C(¬, 1) =
α.
68
ρ1 (a)
1
ρ0 (a)
ρ¬ (a)
0
0
d/2
d
β
Figure 5.6: ρ¬ represents a decision stating that we do not possess enough information in order
to make a reliable selection between the two hypotheses.
If we let α < 12 , it is easy to show (assuming Pr[m = 0] = Pr[m = 1] = 0.5) that a decision
function that minimizes Ψ(ρ, h) = E[C(ρ, m)] has the following form:

f (y|x1 )
1−α


 1 if f (y|x0 ) > α
f (y|x1 )
α
(5.24)
ρ∗ (y) =
¬ if 1−α
α > f (y|x0 ) > 1−α


 0 if f (y|x1 ) < α
f (y|x0 )
1−α
α
1−α
1)
and whenever ff (y|x
(y|x0 ) equals either 1−α or α the decision is randomized between 1 and ¬ and
between ¬ and 0 (respectively).
Under our non-blind watermarking model the expected cost becomes:
1
{[ρ1 (x + d) + αρ¬ (x + d)] + [ρ0 (x) + αρ¬ (x)]} h(x)dx
2
where rhoi is the probability of deciding for i and where ρ0 (x) + ρ¬ (x) + ρ1 (x) = 1.
Given ρ, the Lagrangian for the optimization problem of the adversary is:
Z
Ψ(ρ, h) =
1
L(λ, h) =
2
Z ρ1 (a + d) + αρ¬ (a + d) + ρ0 (a) + αρ¬ (a) − λa2 h(a)da + λD
Consider now the decision function given in Figure 5.6. Following the same reasoning
as in the previous chapter, it is easy to show that for β = 32 d, the maximum values for L(λ, h)
occur for a = 0 and a = ±d. The optimal distribution h has the following form:
p
p
h∗ (a) = δ(−d) + (1 − p)δ(0) + δ(d)
2
2
The decision function ρ is Bayes optimal for this attack distribution only if the likelihood ratio
1−p
1−α
for a = 0 is equal to 1−α
α , (i.e., if p/2 = α ) and if the likelihood ratio for a = ±d () is equal
α
1−α
p/2
α
(i.e., 1−p
= 1−α
).
This optimality requirement places a constraint on α: α = 2p − 1. Furthermore, the
distortion constraint implies the adversary will select E[a2 ] = pd 2 = D. Since we need α < 12
in order to make use of the “indecision” region, the above formulation is thus satisfied for
D ≤ 34 d 2 .
to
69
BIBLIOGRAPHY
[1] C. Kruegel, D. Mutz, W. Robertson, G. Vigna, and R. Kemmerer. Reverse Engineering of
Network Signatures. In Proceedings of the AusCERT Asia Pacific Information Technology
Security Conference, Gold Coast, Australia, May 2005.
[2] W. Lee and S. J. Stolfo. Data mining approaches for intrusion detection. In Proceedings
of the 7th USENIX Security Symposium, 1998.
[3] W. Lee, S. J. Stolfo, and K. Mok. A data mining framework for building intrusion detection models. In Proceedings of the IEEE Symposium on Security & Privacy, pages
120–132, Oakland, CA, USA, 1999.
[4] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting intrusions using system calls:
Alternative data models. In Proceedings of the 1999 IEEE Symposium on Security &
Privacy, pages 133–145, Oakland, CA, USA, May 1999.
[5] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for
unsupervised anomaly detection: Detecting intrusions in unlabeled data. In D. Barbara
and S. Jajodia, editors, Data Mining for Security Applications. Kluwer, 2002.
[6] C. Kruegel, D. Mutz, W. Robertson, and F. Valeur. Bayesian event classification for
intrusion detection. In Proceedings of the 19th Annual Computer Security Applications
Conference (ACSAC), pages 14–24, December 2003.
[7] Stefan Axelsson. The base-rate fallacy and its implications for the difficulty of intrusion
detection. In Proceedings of the 6th ACM Conference on Computer and Communications
Security (CCS ’99), pages 1–7, November 1999.
[8] John E. Gaffney and Jacob W. Ulvila. Evaluation of intrusion detectors: A decision theory
approach. In Proceedings of the 2001 IEEE Symposium on Security and Privacy, pages
50–61, Oakland, CA, USA, 2001.
[9] Guofei Gu, Prahlad Fogla, David Dagon, Wenke Lee, and Boris Skoric. Measuring intrusion detection capability: An information-theoretic approach. In Proceedings of ACM
Symposium on InformAtion, Computer and Communications Security (ASIACCS ’06),
Taipei, Taiwan, March 2006.
[10] Giovanni Di Crescenzo, Abhrajit Ghosh, and Rajesh Talpade. Towards a theory of intrusion detection. In ESORICS 2005, 10th European Symposium on Research in Computer
Security, pages 267–286, Milan, Italy, September 12-14 2005. Lecture Notes in Computer
Science 3679 Springer.
[11] Software
for
empirical
evaluation
of
http://www.cshcn.umd.edu/research/IDSanalyzer.
IDSs.
Available
at
[12] S. Forrest, S. Hofmeyr, A. Somayaji, and T. A. Longstaff. A sense of self for unix processes. In Proceedigns of the 1996 IEEE Symposium on Security & Privacy, pages 120–
12, Oakland, CA, USA, 1996. IEEE Computer Society Press.
70
[13] M. Schonlau, W. DuMouchel, W.-H Ju, A. F. Karr, M. Theus, and Y. Vardi. Computer
intrusion: Detecting masquerades. Technical Report 95, National Institute of Statistical
Sciences, 1999.
[14] J. Jung, V. Paxson, A. Berger, and H. Balakrishnan. Fast portscan detection using sequential hypothesis testing. In IEEE Symposium on Security & Privacy, pages 211–225,
Oakland, CA, USA, 2004.
[15] D. J. Marchette. A statistical method for profiling network traffic. In USENIX Workshop
on Intrusion Detection and Network Monitoring, pages 119–128, 1999.
[16] Salvatore Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Phil Chan. Cost-based
modeling for fraud and intrusion detection: Results from the JAM project. In Proceedings
of the 2000 DARPA Information Survivability Conference and Exposition, pages 130–144,
January 2000.
[17] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc,
1991.
[18] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The DET curve in
assessment of detection task performance. In Proceedings of the 5th European Conference
on Speech Communication and Technology (Eurospeech’97), pages 1895–1898, Rhodes,
Greece, 1997.
[19] N. Japkowicz, editor. Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, 2000.
[20] N. V. Chawla, N. Japkowicz, and A. Kołcz, editors. Proceedings of the International
Conference for Machine Learning Workshop on Learning from Imbalanced Data Sets,
2003.
[21] Nitesh V. Chawla, Nathalie Japkowicz, and Aleksander Kołz. Editorial: Special issue on
learning from imbalanced data sets. Sigkdd Explorations Newsletter, 6(1):1–6, June 2004.
[22] Cèsar Ferri, Peter Flach, José Hernández-Orallo, and Nicolas Lachinche, editors. First
Workshop on ROC Analysis in AI, 2004.
[23] Cèsar Ferri, Nicolas Lachinche, Sofus A. Macskassy, and Alain Rakotomamonjy, editors.
Second Workshop on ROC Analysis in ML, 2005.
[24] C. Drummond and R. Holte. Explicitly representing expected cost: An alternative to ROC
representation. In Proceedings of the Sixth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 198–207, 2001.
[25] Richard P. Lippmann, David J. Fried, Isaac Graf, Joshua W. Haines, Kristopher R.
Kendall, David McClung, Dan Weber, Seth E. Webster, Dan Wyschogrod, Robert K.
Cunningham, and Marc A. Zissman. Evaluating intrusion detection systems: The 1998
DARPA off-line intrusion detection evaluation. In DARPA Information Survivability Conference and Exposition, volume 2, pages 12–26, January 2000.
[26] Foster Provost and Tom Fawcett. Robust classification for imprecise environments. Machine Learning, 42(3):203–231, March 2001.
71
[27] H. V. Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, 2nd
edition, 1988.
[28] H. L. Van Trees. Detection, Estimation and Modulation Theory, Part I. Wiley, New York,
1968.
[29] Alvaro A. Cárdenas, John S. Baras, and Karl Seamon. A framework for the evaluation
of intrusion detection systems. In Proceedings of the 2006 IEEE Symposium on Security
and Privacy, Oakland, California, May 2006.
[30] D. Wagner and P. Soto. Mimicry attacks on host-based intrusion detection systems. In
Proceedings of the 9th ACM Conference on Computer and Communications Security
(CCS), pages 255–264, Washington D.C., USA, 2002.
[31] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Automating mimicry attacks
using static binary analysis. In Proceedings of the 2005 USENIX Security Symposium,
pages 161–176, Baltimore, MD, August 2005.
[32] G. Vigna, W. Robertson, and D. Balzarotti. Testing Network-based Intrusion Detection
Signatures Using Mutant Exploits. In Proceedings of the ACM Conference on Computer
and Communication Security (ACM CCS), pages 21–30, Washington, DC, October 2004.
[33] Sergio Marti, T. J. Giuli, Kevin Lai, and Mary Baker. Mitigating routing misbehavior in
mobile ad hoc networks. In Proceedings of the 6th annual international conference on
Mobile computing and networking, pages 255–265. ACM Press, 2000.
[34] Y. Zhang, W. Lee, and Y. Huang. Intrusion detection techniques for mobile wireless
networks. ACM/Kluwer Mobile Networks and Applications (MONET), 9(5):545–556,
September 2003.
[35] G. Vigna, S. Gwalani, K. Srinivasan, E. Belding-Royer, and R. Kemmerer. An Intrusion
Detection Tool for AODV-based Ad Hoc Wireless Networks. In Proceedings of the Annual Computer Security Applications Conference (ACSAC), pages 16–27, Tucson, AZ,
December 2004.
[36] Sonja Buchegger and Jean-Yves Le Boudec. Nodes bearing grudges: Towards routing
security, fairness, and robustness in mobile ad hoc networks. In Proceedings of Tenth
Euromicro PDP (Parallel, Distributed and Network-based Processing), pages 403 – 410,
Gran Canaria, January 2002.
[37] The MIT lincoln labs evaluation data set, DARPA intrusion detection evaluation. Available at http://www.ll.mit.edu/IST/ideval/index.html.
[38] J. McHugh. Testing intrusion detection systems: A critique of the 1998 and 1999 DARPA
intrusion detection system evaluations as performed by the Lincoln laboratory. ACM
Transactions on Information and System Security (TISSEC), 3(4):262–294, November
2000.
[39] R. E. Blahut. Principles and Practice of Information Theory. Addison-Wesley, 1987.
72
[40] Jean-Pierre Hubaux Maxim Raya and Imad Aad. DOMINO: A system to detect greedy
behavior in IEEE 802.11 hotspots. In Proceedings of the Second International Conference
on Mobile Systems, Applications and Services (MobiSys2004), Boston, Massachussets,
June 2004.
[41] Svetlana Radosavac, John S. Baras, and Iordanis Koutsopoulos. A framework for MAC
protocol misbehavior detection in wireless networks. In Proceedings of the 4th ACM
workshop on Wireless Security (WiSe 05), pages 33–42, 2005.
[42] S. Radosavac, G. V. Moustakides, J. S. Baras, and I. Koutsopoulos. An analytic framework
for modeling and detecting access layer misbehavior in wireless networks. submitted to
ACM Transactions on Information and System Security (TISSEC), 2006.
[43] Pradeep Kyasanur and Nitin Vaidya. Detection and handling of mac layer misbehavior
in wireless networks. In Proceedings of the International Conference on Dependable
Systems and Networks, June 2003.
[44] Alvaro A. Cárdenas, Svetlana Radosavac, and John S. Baras. Detection and prevention of
mac layer misbehavior in ad hoc networks. In Proceedings of the 2nd ACM workshop on
security of ad hoc and sensor networks (SASN 04), 2004.
[45] B. E. Brodsky and B. S. Darkhovsky. Nonparametric Methods in Change-Point Problems,
volume 243 of Mathematics and Its Applications. Kluwer Academic Publishers, 1993.
[46] Pierre Moulin and Aleksandar Ivanović. The zero-rate spread-spectrum watermarking
game. IEEE Transactions on Signal Processing, 51(4):1098–1117, April 2003.
73
Download