ABSTRACT Title of dissertation: DESIGN AND EVALUATION OF DECISION MAKING ALGORITHMS FOR INFORMATION SECURITY Alvaro A. Cárdenas, Doctor of Philosophy, 2006 Dissertation directed by: Professor John S. Baras Department of Electrical and Computer Engineering The evaluation and learning of classifiers is of particular importance in several computer security applications such as intrusion detection systems (IDSs), spam filters, and watermarking of documents for fingerprinting or traitor tracing. There are however relevant considerations that are sometimes ignored by researchers that apply machine learning techniques for security related problems. In this work we identify and work on two problems that seem prevalent in security-related applications. The first problem is the usually large class imbalance between normal events and attack events. We address this problem with a unifying view of different proposed metrics, and with the introduction of Bayesian Receiver Operating Characteristic (BROC) curves. The second problem to consider is the fact that the classifier or learning rule will be deployed in an adversarial environment. This implies that good performance on average might not be a good performance measure, but rather we look for good performance under the worst type of adversarial attacks. We work on a general methodology that we apply for the design and evaluation of IDSs and Watermarking applications. DESIGN AND EVALUATION OF DECISION MAKING ALGORITHMS FOR INFORMATION SECURITY by Alvaro A. Cárdenas Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2006 Advisory Committee: Professor John S. Baras, Chair/Advisor Professor Virgil D. Gligor Professor Bruce Jacob Professor Carlos Berenstein Professor Nuno Martins c Copyright by Alvaro A. Cárdenas 2006 TABLE OF CONTENTS List of Figures iv 1 Introduction 1 2 Performance Evaluation Under the Class Imbalance Problem 3 I Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 II Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 III Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 IV Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 V Graphical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 VI Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 4 5 Secure Decision Making: Defining the Evaluation Metrics and the Adversary 20 I Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 II A Set of Design and Evaluation Guidelines . . . . . . . . . . . . . . . . . . . 20 III A Black Box Adversary Model . . . . . . . . . . . . . . . . . . . . . . . . . . 22 IV A White Box Adversary Model . . . . . . . . . . . . . . . . . . . . . . . . . . 32 V Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Performance Comparison of MAC layer Misbehavior Schemes 35 I Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 II Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 III Problem Description and Assumptions . . . . . . . . . . . . . . . . . . . . . . 37 IV Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . . . . . 38 V Performance analysis of DOMINO . . . . . . . . . . . . . . . . . . . . . . . . 42 VI Theoretical Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 VII Nonparametric CUSUM statistic . . . . . . . . . . . . . . . . . . . . . . . . . 46 VIII Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 IX 49 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . Secure Data Hiding Algorithms 52 I 52 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii II General Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 III Additive Watermarking and Gaussian Attacks . . . . . . . . . . . . . . . . . . 54 IV B.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 C.1 Minimization with respect to Re . . . . . . . . . . . . . . . . 58 C.2 Minimization with respect to Γ . . . . . . . . . . . . . . . . 58 Towards a Universal Adversary Model . . . . . . . . . . . . . . . . . . . . . . 62 Bibliography 70 iii LIST OF FIGURES 2.1 Isoline projections of CID onto the ROC curve. The optimal CID value is CID = 0.4565. The associated costs are C(0, 0) = 3 × 10−5 , C(0, 1) = 0.2156, C(1, 0) = 15.5255 and C(1, 1) = 2.8487. The optimal operating point is PFA = 2.76 × 10−4 and PD = 0.6749. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 As the cost ratio C increases, the slope of the optimal isoline decreases . . . . . 11 2.3 As the base-rate p decreases, the slope of the optimal isoline increases . . . . . 12 2.4 The PPV isolines in the ROC space are straight lines that depend only on θ. The PPV values of interest range from 1 to p . . . . . . . . . . . . . . . . . . 13 The NPV isolines in the ROC space are straight lines that depend only on φ. The NPV values of interest range from 1 to 1 − p . . . . . . . . . . . . . . . . 14 2.6 PPV and NPV isolines for the ROC of an IDS with p = 6.52 × 10−5 . . . . . . 15 2.7 B-ROC for the ROC of Figure 2.6. . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8 Mapping of ROC to B-ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.9 An empirical ROC (ROC2 ) and its convex hull (ROC1 ) . . . . . . . . . . . . . 17 2.10 The B-ROC of the concave ROC is easier to interpret . . . . . . . . . . . . . . 17 2.11 Comparison of two classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.12 B-ROCs comparison for the p of interest . . . . . . . . . . . . . . . . . . . . . 19 3.1 Probability of error for hi vs. p . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 The optimal operating point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Robust expected cost evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Robust B-ROC evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Form of the least favorable pmf p∗1 for two different values of g. When g approaches 1, p∗1 approaches p0 . As g decreases, more mass of p∗1 concentrated towards the smaller backoff values. . . . . . . . . . . . . . . . . . . . . . . . . 41 Tradeoff curve between the expected number of samples for a false alarm E[TFA ] and the expected number of samples for detection E[TD ]. For fixed a and b, as g increases (low intensity of the attack) the time to detection or to false alarms increases exponentially. . . . . . . . . . . . . . . . . . . . . . . . 42 2.5 4.2 iv 4.3 For K=3, the state of the variable cheat count can be represented as a Markov chain with five states. When cheat count reaches the final state (4 in this case) DOMINO raises an alarm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Exact and approximate values of p as a function of m. . . . . . . . . . . . . . . 45 4.5 DOMINO performance for K = 3, m ranges from 1 to 60. γ is shown explicitly. As γ tends to either 0 or 1, the performance of DOMINO decreases. The SPRT outperforms DOMINO regardless of γ and m. . . . . . . . . . . . . . . . . . . 46 DOMINO performance for various thresholds K, γ = 0.7 and m in the range from 1 to 60. The performance of DOMINO decreases with increase of m. For fixed γ, the SPRT outperforms DOMINO for all values of parameters K and m. 47 The best possible performance of DOMINO is when m = 1 and K changes in order to accommodate for the desired level of false alarms. The best γ must be chosen independently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Tradeoff curves for each of the proposed algorithms. DOMINO has parameters γ = 0.9 and m = 1 while K is the variable parameter. The nonparametric CUSUM algorithm has as variable parameter c and the SPRT has b = 0.1 and a is the variable parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Tradeoff curves for default DOMINO configuration with γ = 0.9, best performing DOMINO configuration with γ = 0.7 and SPRT. . . . . . . . . . . . . . . . 50 4.10 Tradeoff curves for best performing DOMINO configuration with γ = 0.7, best performing CUSUM configuration with γ = 0.7 and SPRT. . . . . . . . . . . . 50 4.11 Comparison between theoretical and experimental results: theoretical analysis with linear x-axis closely resembles the experimental results. . . . . . . . . . . 51 Let us define a = y − x1 for the detector. We can now see the Lagrangian function over a, where the adversary tries to distribute the density h such that L(λ, h) is maximized while satisfying the constraints (i.e., minimizing L(λ, h) over λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Piecewise linear decision function, where ρ(0) = ρ0 (0) = ρ1 (0) = . . . . . . 66 5.3 The discontinuity problem in the Lagrangian is solved by using piecewise linear continuous decision functions. It is now easy to shape the Lagrangian such that the maxima created form a saddle point equilibrium. . . . . . . . . . . . . . . 66 5.4 New decision function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5 With ρ defined in Figure 5.4 the Lagrangian is able to exhibit three local maxima, one of them at the point a = 0, which implies that the adversary will use this point whenever the distortion constraints are too severe . . . . . . . . . . . 68 4.6 4.7 4.8 4.9 5.1 v 1 2 5.6 ρ¬ represents a decision stating that we do not possess enough information in order to make a reliable selection between the two hypotheses. . . . . . . . . . vi 69 Chapter 1 Introduction Look if I can raise the money fast, can I set up my own lab? -The Smithsonian Institution. Gore Vidal The algorithms used to ensure several information security goals, such as authentication, integrity and secrecy, have often been designed and analyzed with the help of formal mathematical models. One of the most successful examples is the use of theoretical cryptography for encryption, integrity and authentication. By assuming that some basic primitives hold, (such as the existence of one-way functions), some cryptographic algorithms can formally be proven secure. Information security models have however theoretical limits, since it cannot always be proven that an algorithm satisfies (or not) certain security conditions. For example, as formalized by Harrison, Ruzzo, and Ullman, the access matrix model is undecidable and Rices theorem implies that static analysis problems are also undecidable. Because of similar results such as the undecidability of detecting computer viruses, there is reason to believe that several intrusion detection problems are also undecidable. Undecidability is not the only problem for being able to formally characterize the security properties of an algorithm. Not only are some problems intractable, or undecidable, but also there are certain inherent uncertainties in several security related problems such as biometrics, data hiding (watermarking), fraud detection and spam filters that are impossible to factor out. All algorithms trying to solve these problems will make a non-negligible amount of decision errors. These errors occur due to approximations and can consequently be formulated as a tradeoff between the costs of operation (e.g., the necessary resources for their operation, such as the number of false alarms) and the correctness of their output (e.g., the security level achieved, or the probability of detection). It is however very difficult in practice to assess both: the real costs of these security solutions and their actual security guarantees. Most of the research therefore relies on ad hoc solutions and heuristics that cannot be shared between security fields trying to address these hard problems. In this work we try to address the problem of providing a necessary framework in order to reason about the security and the costs of an algorithm for solving problems where the decision between two hypotheses cannot be made without errors. Our framework is composed of two main parts. Evaluation Metrics: We introduce in a unified framework several metrics that have been previously proposed in the literature. We give intuitive interpretations of them and provide new metrics that address two of the main problems for decision algorithms. First, the large class imbalance between the two hypotheses, and second, the uncertainty of several parameters, including costs and the class imbalance severity. Security Model: In order to reason formally about the security level of a decision algorithm, we need to introduce a formal model for an adversary and the system being evaluated. We therefore clearly define the feasible design space, and the properties our algorithm 1 need to satisfy by an evaluation metric. Then we model the considered adversary class by clearly defining the information available to the adversary, the capabilities of the adversary and its goal. Finally, since the easiest way to break the security of a system is to step outside the box, i.e., to break the rules under which the algorithm was evaluated, we clearly identify the assumptions made in our model. These assumptions then play an important part in the evaluation of the algorithm. Our tools to solve and analyze this models are based on robust detection theory and game theory, and the aim is always the same, minimize the advantage of the adversary (the advantage of the adversary is defined according to the metric used), for any possible adversary in a given adversary class. We apply our model to different problems in intrusion detection, selfish packet forwarding in ad hoc networks, selfish MAC layer misbehavior in wireless access points and watermarking algorithms. Our results show that formally modeling these problems and obtaining the least favorable attack distribution can lead in the best cases to show how there is merit in our framework, by outperforming previously proposed heuristic solutions (analytically and by simulations), and in the worst cases they can be seen as pessimistic decisions based on risk aversion. The following is the layout of this dissertation. In chapter 2 we view in a unified framework traditional metrics used to evaluate the performance of intrusion detection systems and introduce a new evaluation curve called the B-ROC curve. In chapter 3 we introduce the notion of security of a decision algorithm and define the adversarial models that are going to be used in the next chapters in order to design and evaluate decision making algorithms. In chapter 4 we compare the performance of two classification algorithms used for detecting MAC layer misbehavior in wireless networks. This chapter is a validation of our approach, since it shows how the design of classification algorithms using robust detection theory and formal adversarial modeling can outperform theoretically and empirically previously proposed algorithms based on heuristics. Finally in chapter 5 we apply our framework for the design and evaluation of data hiding algorithms. 2 Chapter 2 Performance Evaluation Under the Class Imbalance Problem For there seem to be many empty alarms in war -Nicomachean Ethics, Aristotle I Overview Classification accuracy in intrusion detection systems (IDSs) deals with such fundamental problems as how to compare two or more IDSs, how to evaluate the performance of an IDS, and how to determine the best configuration of the IDS. In an effort to analyze and solve these related problems, evaluation metrics such as the Bayesian detection rate, the expected cost, the sensitivity and the intrusion detection capability have been introduced. In this chapter, we study the advantages and disadvantages of each of these performance metrics and analyze them in a unified framework. Additionally, we introduce the Bayesian Receiver Operating Characteristic (B-ROC) curves as a new IDS performance tradeoff which combines in an intuitive way the variables that are more relevant to the intrusion detection evaluation problem. II Introduction Consider a company that, in an effort to improve its information technology security infrastructure, wants to purchase either intrusion detector 1 (I DS 1 ) or intrusion detector 2 (I DS 2 ). Furthermore, suppose that the algorithms used by each IDS are kept private and therefore the only way to determine the performance of each IDS (unless some reverse engineering is done [1]) is through empirical tests determining how many intrusions are detected by each scheme while providing an acceptable level of false alarms. Suppose these tests show with high confidence that I DS 1 detects one-tenth more attacks than I DS 2 but at the cost of producing one hundred times more false alarms. The company needs to decide based on these estimates, which IDS will provide the best return of investment for their needs and their operational environment. This general problem is more concisely stated as the intrusion detection evaluation problem, and its solution usually depends on several factors. The most basic of these factors are the false alarm rate and the detection rate, and their tradeoff can be intuitively analyzed with the help of the receiver operating characteristic (ROC) curve [2, 3, 4, 5, 6]. However, as pointed out in [7, 8, 9], the information provided by the detection rate and the false alarm rate alone might not be enough to provide a good evaluation of the performance of an IDS. Therefore, the evaluation metrics need to consider the environment the IDS is going to operate in, such as the maintenance costs and the hostility of the operating environment (the likelihood of an attack). In an effort to provide such an evaluation method, several performance metrics such as the Bayesian detection rate [7], expected cost [8], sensitivity [10] and intrusion detection capability [9], have been proposed in the literature. Yet despite the fact that each of these performance metrics makes their own contribution to the analysis of intrusion detection systems, they are rarely applied in the literature when 3 proposing a new IDS. It is our belief that the lack of widespread adoption of these metrics stems from two main reasons. Firstly, each metric is proposed in a different framework (e.g. information theory, decision theory, cryptography etc.) and in a seemingly ad hoc manner. Therefore an objective comparison between the metrics is very difficult. The second reason is that the proposed metrics usually assume the knowledge of some uncertain parameters like the likelihood of an attack, or the costs of false alarms and missed detections. Moreover, these uncertain parameters can also change during the operation of an IDS. Therefore the evaluation of an IDS under some (wrongly) estimated parameters might not be of much value. A. Our Contributions In this chapter, we introduce a framework for the evaluation of IDSs in order to address the concerns raised in the previous section. First, we identify the intrusion detection evaluation problem as a multi-criteria optimization problem. This framework will let us compare several of the previously proposed metrics in a unified manner. To this end, we recall that there are in general two ways to solve a multi-criteria optimization problem. The first approach is to combine the criteria to be optimized in a single optimization problem. We then show how the intrusion detection capability, the expected cost and the sensitivity metrics all fall into this category. The second approach to solve a multi-criteria optimization problem is to evaluate a tradeoff curve. We show how the Bayesian rates and the ROC curve analysis are examples of this approach. To address the uncertainty of the parameters assumed in each of the metrics, we then present a graphical approach that allows the comparison of the IDS metrics for a wide range of uncertain parameters. For the single optimization problem we show how the concept of isolines can capture in a single value (the slope of the isoline) the uncertainties like the likelihood of an attack and the operational costs of the IDS. For the tradeoff curve approach, we introduce a new tradeoff curve we call the Bayesian ROC (B-ROC). We believe the B-ROC curve combines in a single graph all the relevant (and intuitive) parameters that affect the practical performance of an IDS. In an effort to make this evaluation framework accessible to other researchers and in order to complement our presentation, we started the development of a software application available at [11] to implement the graphical approach for the expected cost and our new B-ROC analysis curves. We hope this tool can grow to become a valuable resource for research in intrusion detection. III Notation and Definitions In this section we present the basic notation and definitions which we use throughout this document. We assume that the input to an intrusion detection system is a feature-vector x ∈ X . The elements of x can include basic attributes like the duration of a connection, the protocol type, the service used etc. It can also include specific attributes selected with domain knowledge such as the number of failed logins, or if a superuser command was attempted. Examples of x used in intrusion detection are sequences of system calls [12], sequences of user commands [13], connection attempts to local hosts [14], proportion of accesses (in terms of TCP or UDP packets) to a given port of a machine over a fixed period of time [15] etc. 4 Let I denote whether a given instance x was generated by an intrusion (represented by I = 1 or simply I) or not (denoted as I = 0 or equivalently ¬I). Also let A denote whether the output of an IDS is an alarm (denoted by A = 1 or simply A) or not (denoted by A = 0, or equivalently ¬A). An IDS can then be defined as an algorithm I DS that receives a continuous data stream of computer event features X = {x[1], x[2], . . . , } and classifies each input x[ j] as being either a normal event or an attack i.e. I DS : X →{A, ¬A}. In this chapter we do not address how the IDS is designed. Our focus will be on how to evaluate the performance of a given IDS. Intrusion detection systems are commonly classified as either misuse detection schemes or anomaly detection schemes. Misuse detection systems use a number of attack signatures describing attacks; if an event feature x matches one of the signatures, an alarm is raised. Anomaly detection schemes on the other hand rely on profiles or models of the normal operation of the system. Deviations from these established models raise alarms. The empirical results of a test for an IDS are usually recorded in terms of how many attacks were detected and how many false alarms were produced by the IDS, in a data set containing both normal data and attack data. The percentage of alarms out of the total number of normal events monitored is referred to as the false alarm rate (or the probability of false alarm), whereas the percentage of detected attacks out of the total attacks is called the detection rate (or probability of detection) of the IDS. In general we denote the probability of false alarm and the probability of detection (respectively) as: PFA ≡ Pr[A = 1|I = 0] and PD ≡ Pr[A = 1|I = 1] (2.1) These empirical results are sometimes shown with the help of the ROC curve; a graph whose x-axis is the false alarm rate and whose y-axis is the detection rate. The graphs of misuse detection schemes generally correspond to a single point denoting the performance of the detector. Anomaly detection schemes on the other hand, usually have a monitored statistic which is compared to a threshold τ in order to determine if an alarm should be raised or not. Therefore their ROC curve is obtained as a parametric plot of the probability of false alarm (PFA ) versus the probability of detection (PD ) (with parameter τ) as in [2, 3, 4, 5, 6]. IV Evaluation Metrics In this section we first introduce metrics that have been proposed in previous work. Then we discuss how we can use these metrics to evaluate the IDS by using two general approaches, that is the expected cost and the tradeoff approach. In the expected cost approach, we give intuition of the expected cost metric by relating all the uncertain parameters (such as the probability of an attack) to a single line that allows the IDS operator to easily find the optimal tradeoff. In the second approach, we identify the main parameters that affect the quality of the performance of the IDS. This will allow us to later introduce a new evaluation method that we believe better captures the effect of these parameters than all previously proposed methods. A. Background Work Expected Cost In this section we present the expected cost of an IDS by combining some of the ideas originally presented in [8] and [16]. The expected cost is used as an evaluation method for 5 State of the system No Intrusion (I = 0) Intrusion (I = 1) Detector’s report No Alarm (A=0) Alarm (A=1) C(0, 0) C(0, 1) C(1, 0) C(1, 1) Table 2.1: Costs of the IDS reports given the state of the system IDSs in order to assess the investment of an IDS in a given IT security infrastructure. In addition to the rates of detection and false alarm, the expected cost of an IDS can also depend on the hostility of the environment, the IDS operational costs, and the expected damage done by security breaches. A quantitative measure of the consequences of the output of the IDS to a given event, which can be an intrusion or not are the costs shown in Table 2.1. Here C(0, 1) corresponds to the cost of responding as though there was an intrusion when there is none, C(1, 0) corresponds to the cost of failing to respond to an intrusion, C(1, 1) is the cost of acting upon an intrusion when it is detected (which can be defined as a negative value and therefore be considered as a profit for using the IDS), and C(0, 0) is the cost of not reacting to a non-intrusion (which can also be defined as a profit, or simply left as zero.) Adding costs to the different outcomes of the IDS is a way to generalize the usual tradeoff between the probability of false alarm and the probability of detection to a tradeoff between the expected cost for a non-intrusion R(0, PFA ) ≡ C(0, 0)(1 − PFA ) +C(0, 1)PFA and the expected cost for an intrusion R(1, PD ) ≡ C(1, 0)(1 − PD ) +C(1, 1)PD It is clear that if we only penalize errors of classification with unit costs (i.e. if C(0, 0) = C(1, 1) = 0 and C(0, 1) = C(1, 0) = 1) the expected cost for non-intrusion and the expected cost for intrusion become respectively, the false alarm rate and the detection rate. The question of how to select the optimal tradeoff between the expected costs is still open. However, if we let the hostility of the environment be quantified by the likelihood of an intrusion p ≡ Pr[I = 1] (also known as the base-rate [7]), we can average the expected non-intrusion and intrusion costs to give the overall expected cost of the IDS: E[C(I, A)] = R(0, PFA )(1 − p) + R(1, PD )p (2.2) It should be pointed out that R() and E[C(I, A)] are also known as the risk and Bayesian risk functions (respectively) in Bayesian decision theory. Given an IDS, the costs from Table 2.1 and the likelihood of an attack p, the problem now is to find the optimal tradeoff between PD and PFA in such a way that E[C(I, A)] is minimized. The Intrusion Detection Capability The main motivation for introducing the intrusion detection capability CID as an evaluation metric originates from the fact that the costs in Table 2.1 are chosen in a subjective way [9]. Therefore the authors propose the use of the intrusion detection capability as an objective 6 metric motivated by information theory: CID = I(I; A) H(I) (2.3) where I and H respectively denote the mutual information and the entropy [17]. The H(I) term in the denominator is a normalizing factor so that the value of CID will always be in the [0, 1] interval. The intuition behind this metric is that by fine tuning an IDS based on CID we are finding the operating point that minimizes the uncertainty of whether an arbitrary input event x was generated by an intrusion or not. The main drawback of CID is that it obscures the intuition that is to be expected when evaluating the performance of an IDS. This is because the notion of reducing the uncertainty of an attack is difficult to quantify in practical values of interest such as false alarms or detection rates. Information theory has been very useful in communications because the entropy and mutual information can be linked to practical quantities, like the number of bits saved by compression (source coding) or the number of bits of redundancy required for reliable communications (channel coding). However it is not clear how these metrics can be related to quantities of interest for the operator of an IDS. The Base-Rate Fallacy and Predictive Value Metrics In [7] Axelsson pointed out that one of the causes for the large amount of false alarms that intrusion detectors generate is the enormous difference between the amount of normal events compared to the small amount of intrusion events. Intuitively, the base-rate fallacy states that because the likelihood of an attack is very small, even if an IDS fires an alarm, the likelihood of having an intrusion remains relatively small. Formally, when we compute the posterior probability of intrusion (a quantity known as the Bayesian detection rate, or the positive predictive value (PPV)) given that the IDS fired an alarm, we obtain: PPV ≡ Pr[I = 1|A = 1] Pr[A = 1|I = 1]Pr[I = 1] = Pr[A = 1|I = 1]Pr[I = 1] + Pr[A = 1|I = 0]Pr[I = 0] PD p = (PD − PFA )p + PFA (2.4) Therefore, if the rate of incidence of an attack is very small, for example on average only 1 out of 105 events is an attack (p = 10−5 ), and if our detector has a probability of detection of one (PD = 1) and a false alarm rate of 0.01 (PFA = 0.01), then Pr[I = 1|A = 1] = 0.000999. That is on average, of 1000 alarms, only one would be a real intrusion. It is easy to demonstrate that the PPV value is maximized when the false alarm rate of our detector goes to zero, even if the detection rate also tends to zero! Therefore as mentioned in [7] we require a trade-off between the PPV value and the negative predictive value (NPV): NPV ≡ Pr[I = 0|A = 0] = (1 − p)(1 − PFA ) p(1 − PD ) + (1 − p)(1 − PFA ) B. Discussion 7 (2.5) The concept of finding the optimal tradeoff of the metrics used to evaluate an IDS is an instance of the more general problem of multi-criteria optimization. In this setting, we want to maximize (or minimize) two quantities that are related by a tradeoff, which can be done via two approaches. The first approach is to find a suitable way of combining these two metrics in a single objective function (such as the expected cost) to optimize. The second approach is to directly compare the two metrics via a trade-off curve. We therefore classify the above defined metrics into two general approaches that will be explored in the rest of this chapter: the minimization of the expected cost and the tradeoff approach. We consider these two approaches as complimentary tools for the analysis of IDSs, each providing its own interpretation of the results. Minimization of the Expected Cost Let ROC denote the set of allowed (PFA , PD ) pairs for an IDS. The expected cost approach will include any evaluation metric that can be expressed as r∗ = min (PFA ,PD )∈ROC E[C(I, A)] (2.6) where r∗ is the expected cost of the IDS. Given I DS 1 with expected cost r1∗ and an I DS 2 with expected cost r2∗ , we can say I DS 1 is better than I DS 2 for our operational environment if r1∗ < r2∗ . We now show how CID , and the tradeoff between the PPV and NPV values can be expressed as an expected costs problems. For the CID case note that the entropy of an intrusion H(I) is independent of our optimization parameters (PFA , PD ), therefore we have: ∗ (PFA , PD∗ ) = arg = arg max (PFA ,PD )∈ROC max I (I; A) H(I) I(I; A) (PFA ,PD )∈ROC = arg = arg min H(I|A) min E[− log Pr[I|A]] (PFA ,PD )∈ROC (PFA ,PD )∈ROC It is now clear that CID is an instance of the expected cost problem with costs given by C(i, j) = − log Pr[I = i|A = j]. By finding the costs of CID we are making the CID metric more intuitively appealing, since any optimal point that we find for the IDS will have an explanation in terms of cost functions (as opposed to the vague notion of diminishing the uncertainty of the intrusions). Finally, in order to combine the PPV and the NPV in an average cost metric, recall that we want to maximize both Pr[I = 1|A = 1] and Pr[I = 0|A = 0]. Our average gain for each operating point of the IDS is therefore ω1 Pr[I = I|A = 1]Pr[A = 1] + ω2 Pr[I = 0|A = 0]Pr[A = 0] where ω1 (ω2 ) is a weight representing a preference towards maximizing PPV (NPV). This equation is equivalent to the minimization of −ω1 Pr[I = I|A = 1]Pr[A = 1] − ω2 Pr[I = 0|A = 0]Pr[A = 0] 8 (2.7) Comparing equation (2.7) with equation (2.2), we identify the costs as being C(1, 1) = −ω1 , C(0, 0) = −ω2 and C(0, 1) = C(1, 0) = 0. Relating the predictive value metrics (PPV and NPV) with the expected cost problem will allow us to examine the effects of the base-rate fallacy on the expected cost of the IDS in future sections. IDS classification tradeoffs An alternate approach in evaluating intrusion detection systems is to directly compare the tradeoffs in the operation of the system by a tradeoff curve, such as ROC, or DET curves [18] (a reinterpretation of the ROC curve where the y-axis is 1 − PD , as opposed to PD ). As mentioned in [7], another tradeoff to consider is between the PPV and the NPV values. However, we do not know of any tradeoff curves that combine these two values to aid the operator in choosing a given operating point. We point out in section B that a tradeoff between PFA and PD (as in the ROC curves) as well as a tradeoff between PPV and NPV can be misleading for cases where p is very small, since very small changes in the PFA and NPV values for our points of interest will have drastic performance effects on the PD and the PPV values. Therefore, in the next section we introduce the B-ROC as a new tradeoff curve between PD and PPV. The class imbalance problem A way to relate our approach with traditional techniques used in machine learning is to identify the base-rate fallacy as just another instance of the class imbalance problem. The term class imbalance refers to the case when in a classification task, there are many more instances of some classes than others. The problem is that under this setting, classifiers in general perform poorly because they tend to concentrate on the large classes and disregard the ones with few examples. Given that this problem is prevalent in a wide range of practical classification problems, there has been recent interest in trying to design and evaluate classifiers faced with imbalanced data sets [19, 20, 21]. A number of approaches on how to address these issues have been proposed in the literature. Ideas such as data sampling methods, one-class learning (i.e. recognition-based learning), and feature selection algorithms, appear to be the most active research directions for learning classifiers. On the other hand the issue of how to evaluate binary classifiers in the case of class imbalances appears to be dominated by the use of ROC curves [22, 23] (and to a lesser extent, by error curves [24]). V Graphical Analysis We now introduce a graphical framework that allows the comparison of different metrics in the analysis and evaluation of IDSs. This graphical framework can be used to adaptively change the parameters of the IDS based on its actual performance during operation. The framework also allows for the comparison of different IDSs under different operating environments. Throughout this section we use one of the ROC curves analyzed in [8] and in [9]. Mainly the ROC curve describing the performance of the COLUMBIA team intrusion detector for the 1998 DARPA intrusion detection evaluation [25]. Unless otherwise stated, we assume for our analysis the base-rate present in the DARPA evaluation which was p = 6.52 × 10−5 . 9 A. Visualizing the Expected Cost: The Minimization Approach The biggest drawback of the expected cost approach is that the assumptions and information about the likelihood of attacks and costs might not be known a priori. Moreover, these parameters can change dynamically during the system operation. It is thus desirable to be able to tune the uncertain IDS parameters based on feedback from its actual system performance in order to minimize E[C(I, A)]. We select the use of ROC curves as the basic 2-D graph because they illustrate the behavior of a classifier without regard to the uncertain parameters, such as the base-rate p and the operational costs C(i, j). Thus the ROC curve decouples the classification performance from these factors [26]. ROC curves are also general enough such that they can be used to study anomaly detection schemes and misuse detection schemes (a misuse detection scheme has only one point in the ROC space). CID with p=6.52E−005 0.8 0.5 6 0. 0.5 0.7 0.4 0.4 0.5 1 0.6 0 65 .45 0.4 0.3 0.3 0.3 0.4 PD 0.5 0.4 0.2 0.3 0.2 0.2 0. 2 0.3 0.1 0.1 0.2 0.1 0.1 0 0.2 0.4 0.6 PFA 0.8 1 −3 x 10 Figure 2.1: Isoline projections of CID onto the ROC curve. The optimal CID value is CID = 0.4565. The associated costs are C(0, 0) = 3 × 10−5 , C(0, 1) = 0.2156, C(1, 0) = 15.5255 and C(1, 1) = 2.8487. The optimal operating point is PFA = 2.76 × 10−4 and PD = 0.6749. In the graphical framework, the relation of these uncertain factors with the ROC curve of an IDS will be reflected in the isolines of each metric, where isolines refer to lines that connect pairs of false alarm and detection rates such that any point on the line has equal expected cost. The evaluation of an IDS is therefore reduced to finding the point of the ROC curve that intercepts the optimal isoline of the metric (for signature detectors the evaluation corresponds to finding the isoline that intercepts their single point in the ROC space and the point (0,0) or (1,1)). In Figure 2.1 we can see as an example the isolines of CID intercepting the ROC curve of the 1998 DARPA intrusion detection evaluation. One limitation of the CID metric is that it specifies the costs C(i, j) a priori. However, in practice these costs are rarely known in advance and moreover the costs can change and be dynamically adapted based on the performance of the IDS. Furthermore the nonlinearity of CID makes it difficult to analyze the effect different p values will have on CID in a single 2-D graph. To make the graphical analysis of the cost metrics as intuitive as possible, we will assume from 10 now on (as in [8]) that the costs are tunable parameters and yet once a selection of their values is made, they are constant values. This new assumption will let us at the same time see the effect of different values of p in the expected cost metric. Under the assumption of constant costs, we can see that the isolines for the expected cost E[C(I, A)] are in fact straight lines whose slope depends on the ratio between the costs and the likelihood ratio of an attack. Formally, if we want the pair of points (PFA1 , PD1 ) and (PFA2 , PD2 ) to have the same expected cost, they must be related by the following equation [27, 28, 26]: mC,p ≡ 1 − p C(0, 1) −C(0, 0) 1 − p 1 PD2 − PD1 = = PFA1 − PFA2 p C(1, 0) −C(1, 1) p C (2.8) where in the last equality we have implicitly defined C to be the ratio between the costs, and mC,p to be the slope of the isoline. The set of isolines of E[C(I, A)] can be represented by I SO E = {mC,p × PFA + b : b ∈ [0, 1]} (2.9) For fixed C and p, it is easy to prove that the optimal operating point of the ROC is the point where the ROC intercepts the isoline in I SO E with the largest b (note however that there are ROC curves that can have more than one optimal point.) The optimal operating point in the ROC is therefore determined only by the slope of the isolines, which in turn is determined by p and C. Therefore we can readily check how changes in the costs and in the likelihood of an attack will impact the optimal operating point. Isolines of the expected cost under different C values (with p=6.52×10−005) 0.8 0.7 0.6 C=58.82 PD 0.5 C=10.0 0.4 0.3 0.2 0.1 0 C=2.50 0 0.2 0.4 0.6 PFA 0.8 1 −3 x 10 Figure 2.2: As the cost ratio C increases, the slope of the optimal isoline decreases Effect of the Costs In Figure 2.2, consider the operating point corresponding to C = 58.82, and assume that after some time, the operators of the IDS realize that the number of false alarms exceeds their response capabilities. In order to reduce the number of false alarms they can increase the cost of a false alarm C(0, 1) and obtain a second operating point at C = 10. If however the situation 11 persists (i.e. the number of false alarms is still much more than what operators can efficiently respond to) and therefore they keep increasing the cost of a false alarm, there will be a critical slope mc such that the intersection of the ROC and the isoline with slope mc will be at the point (PFA , PD ) = (0, 0). The interpretation of this result is that we should not use the IDS being evaluated since its performance is not good enough for the environment it has been deployed in. In order to solve this problem we need to either change the environment (e.g. hire more IDS operators) or change the IDS (e.g. shop for a more expensive IDS). The Base-Rate Fallacy Implications on the Costs of an IDS A similar scenario occurs when the likelihood of an attack changes. In Figure 2.3 we can see how as p decreases, the optimal operating point of the IDS tends again to (PFA , PD ) = (0, 0) (again the evaluator must decide not to use the IDS for its current operating environment). Therefore, for small base-rates the operation of an IDS will be cost efficient only if we have an appropriate large C∗ such that mC∗ ,p∗ ≤ mc . A large C∗ can be explained if cost of a false alarm much smaller than the cost of a missed detection: C(1, 0) C(0, 1) (e.g. the case of a government network that cannot afford undetected intrusions and has enough resources to sort through the false alarms). Isolines of the probability of error under different p values 0.8 0.7 0.6 p=0.01 p=0.001 0.5 PD p=0.1 0.4 0.3 0.2 p=0.0001 0.1 0 0 0.2 0.4 0.6 PFA 0.8 1 −3 x 10 Figure 2.3: As the base-rate p decreases, the slope of the optimal isoline increases Generalizations This graphical method of cost analysis can also be applied to other metrics in order to get some insight into the expected cost of the IDS. For example in [10], the authors define an IDS with input space X to be σ − sensitive if there exists an efficient algorithm with the same input E ≥ σ. This metric can be used to find the optimal space E : X → {¬A, A}, such that PDE − PFA point of an ROC because it has a very intuitive explanation: as long as the rate of detected intrusions increases faster than the rate of false alarms, we keep moving the operating point 12 of the IDS towards the right in the ROC. The optimal sensitivity problem for an IDS with a receiver operating characteristic ROC is thus: max (PFA ,PD )∈ROC PD − PFA (2.10) It is easy to show that this optimal sensitivity point is the same optimal point obtained with the isolines method for mC,p = 1 (i.e. C = (1 − p)/p). B. The Bayesian Receiver Operating Characteristic: The Tradeoff Approach PPV = 1 1 PPV = p p + (1 − p)tanθ 0.9 0.8 0.7 PD 0.6 0.5 θ 0.4 PPV = p 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 PFA 0.6 0.7 0.8 0.9 1 Figure 2.4: The PPV isolines in the ROC space are straight lines that depend only on θ. The PPV values of interest range from 1 to p Although the graphical analysis introduced so far can be applied to analyze the cost efficiency of several metrics, the intuition for the tradeoff between the PPV and the NPV is still not clear. Therefore we now extend the graphical approach by introducing a new pair of isolines, those of the PPV and the NPV metrics. Lemma 1: and only if Two sets of points (PFA1 , PD1 ) and (PFA2 , PD2 ) have the same PPV value if PFA2 PFA1 = = tan θ (2.11) PD2 PD1 where θ is the angle between the line PFA = 0 and the isoline. Moreover the PPV value of an isoline at angle θ is p PPVθ,p = (2.12) p + (1 − p) tan θ Similarly, two set of points (PFA1 , PD1 ) and (PFA2 , PD2 ) have the same NPV value if and only if 1 − PD1 1 − PD2 = = tan φ 1 − PFA1 1 − PFA2 13 (2.13) NPV = 1 1 NPV = 0.9 0.8 φ 1−p p(tanφ − 1) + 1 0.7 PD 0.6 0.5 NPV = 1 − p 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 PFA 0.6 0.7 0.8 0.9 1 Figure 2.5: The NPV isolines in the ROC space are straight lines that depend only on φ. The NPV values of interest range from 1 to 1 − p where φ is the angle between the line PD = 1 and the isoline. Moreover the NPV value of an isoline at angle φ is 1− p (2.14) NPVφ,p = p(tan φ − 1) + 1 Figures 2.4 and 2.5 show the graphical interpretation of Lemma 1. It is important to note the range of the PPV and NPV values as a function of their angles. In particular notice that as θ goes from 0◦ to 45◦ (the range of interest), the value of PPV changes from 1 to p. We can also see from Figure 2.5 that as φ ranges from 0◦ to 45◦ , the NPV value changes from one to 1 − p. If p is very small, then NPV ≈ 1. Figure 2.6 shows the application of Fact 1 to a typical ROC curve of an IDS. In this figure we can see the tradeoff of four variables of interest: PFA , PD , PPV , and NPV . Notice that if we choose the optimal operating point based on PFA and PD , as in the typical ROC analysis, we might obtain misleading results because we do not know how to interpret intuitively very low false alarm rates, e.g. is PFA = 10−3 much better than PFA = 5 × 10−3 ? The same reasoning applies to the study of PPV vs. NPV as we cannot interpret precisely small variations in NPV values, e.g. is NPV = 0.9998 much better than NPV = 0.99975? Therefore we conclude that the most relevant metrics to use for a tradeoff in the performance of a classifier are PD and PPV, since they have an easily understandable range of interest. However, even when you select as tradeoff parameters the PPV and PD values, the isoline analysis shown in Figure 2.6 has still one deficiency, and it is the fact that there is no efficient way to account for the uncertainty of p. In order to solve this problem we introduce the B-ROC as a graph that shows how the two variables of interest: PD and PPV are related under different severity of class imbalances. In order to follow the intuition of the ROC curves, instead of using PPV for the x-axis we prefer to use 1-PPV . We use this quantity because it can be interpreted as the Bayesian false alarm rate: BFA ≡ Pr[C = 0|A = 1]. For example, for IDSs BFA can be a 14 1 PPV = 0.5 PPV = 0.2 PPV = 0.1 ROC 0.9 98 0.999 0.8 0.7 0.8 PD 0.1 0.2 0.5 0.4 0.3 0.7 0.99998 6 999 0.6 0.9 0.5 94 99 0.4 0.9 0.3 0.6 0.2 0.1 ROC 0.5 PD PD0.4 0 0.1 0.2 0.3 0.4 0.5 PFA 0.6 0.7 0.8 0.9 0.99996 NPV = 0.99996 0.3 Zoom of the Area of Interest 0.2 0.1 0 0 0.99994 NPV = 0.99994 0 0.1 0.2 0.3 0.4 0.5 PFA 0.6 0.7 0.8 0.9 PF A 1 −3 x 10 ×10−3 Figure 2.6: PPV and NPV isolines for the ROC of an IDS with p = 6.52 × 10−5 1 0.9 p=0.1 0.8 0.7 0.6 p=1×10−3 PD p=0.01 0.5 p=1×10−6 p=1×10−4 0.4 p=1×10−5 0.3 p=1×10−7 0.2 0.1 0 0 0.2 0.4 0.6 0.8 BFA Figure 2.7: B-ROC for the ROC of Figure 2.6. 15 1 1 measure of how likely it is, that the operators of the detection system will loose their time each time they respond to an alarm. Figure 2.7 shows the B-ROC for the ROC presented in Figure 2.6. Notice also how the values of interest for the x-axis have changed from [0, 10−3 ] to [0, 1]. 1 1 PD PD 0 1 PFA 0 BFA 1−p Figure 2.8: Mapping of ROC to B-ROC In order to be able to interpret the B-ROC curves, Figure 2.8 shows how the ROC points map to points in the B-ROC. The vertical line defined by 0 < PD ≤ 1 and PFA = 0 in the ROC maps exactly to the same vertical line 0 < PD ≤ 1 and BFA = 0 in the B-ROC. Similarly, the top horizontal line 0 ≤ PFA ≤ 1 and PD = 1 maps to the line 0 ≤ BFA ≤ 1− p and PD = 1. A classifier that performs random guessing is represented in the ROC as the diagonal line PD = PFA , and this random guessing classifier maps to the vertical line defined by BFA = 1 − p and PD > 0 in the B-ROC. Finally, to understand where the point (0, 0) in the ROC maps into the B-ROC, let α and f (α) denote PFA and the corresponding PD in the ROC curve. Then, as the false alarm rate α tends to zero (from the right), the Bayesian false alarm rate tends to a value that depends on p and the slope of the ROC close to the point (0, 0). More specifically, if we let f 0 (0+ ) = limα→0+ f 0 (α), then: lim BFA = lim α→0+ α→0+ 1− p α(1 − p) = 0 p f (α) + α(1 − p) p( f (0+ ) − 1) + 1 It is also important to recall that a necessary condition for a classifier to be optimal, is that its ROC curve should be concave. In fact, given any non-concave ROC, by following Neyman-Pearson theory, you can always get a concave ROC curve by randomizing decisions between optimal points [27]. This idea has been recently popularized in the machine learning community by the notion of the ROC convex hull [26]. The importance of this observation is that in order to guarantee that the B-ROC is a well defined continuous and non-decreasing function, we map only concave ROC curves to B-ROCs. 16 1 0.9 0.8 ROC 0.7 1 ROC PD 0.6 2 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 PFA Figure 2.9: An empirical ROC (ROC2 ) and its convex hull (ROC1 ) 1 0.9 0.8 B-ROC1 0.7 B-ROC2 P D 0.6 0.5 0.4 0.3 0.2 0.1 0 0.75 0.8 0.85 0.9 0.95 1 BFA Figure 2.10: The B-ROC of the concave ROC is easier to interpret 17 In Figures 2.9 and 2.10 we show the only example in this document of the type of B-ROC curve that you can get when you do not consider a concave ROC. 1 0.9 0.8 ROC 1 0.7 PD 0.6 ROC 0.5 2 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 PFA Figure 2.11: Comparison of two classifiers We also point out the fact that the B-ROC curves can be very useful for the comparison of classifiers. A typical comparison problem by using ROCs is shown in Figure 2.11. Several ideas have been proposed in order to solve this comparison problem. For example by using decision theory as in [29, 26] we can find the optimal classifier between the two by assuming a given prior p and given misclassification costs. However, a big problem with this approach is that the misclassification costs are sometimes uncertain and difficult to estimate a priori. With a B-ROC on the other hand, you can get a better comparison of two classifiers without the assumption of any misclassification costs, as can be seen in Figure 2.12. VI Conclusions We believe that the B-ROC provides a better way to evaluate and compare classifiers in the case of class imbalances or uncertain values of p. First, for selecting operating points in heavily imbalanced environments, B-ROCs use tradeoff parameters that are easier to understand than the variables considered in ROC curves (they provide better intuition for the performance of the classifier). Second, since the exact class distribution p might not be known a priori, or accurately enough, the B-ROC allows the plot of different curves for the range of interest of p. Finally, when comparing two classifiers, there are cases in which by using the B-ROC, we do not need cost values in order to decide which classifier would be better for given values of p. Note also that B-ROCs consider parameters that are directly related to exact quantities that the operator of a classifier can measure. In contrast, the exact interpretation of the expected cost of a classifier is more difficult to relate to the real performance of the classifier (the costs depend in many other unknown factors). 18 p=1×10−3 1 0.9 0.8 0.7 PD 0.6 B−ROC1 0.5 0.4 0.3 B−ROC2 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 BFA Figure 2.12: B-ROCs comparison for the p of interest 19 Chapter 3 Secure Decision Making: Defining the Evaluation Metrics and the Adversary They did not realize that because of the quasi-reciprocal and circular nature of all Improbability calculations, anything that was Infinitely Improbable was actually very likely to happen almost immediately. -Life the Universe and Everything. Douglas Adams I Overview As we have seen in the previous chapter, the traditional evaluation of IDSs assume that the intruder will behave similarly before and after the deployment and configuration of the IDS (i.e. during the evaluation it is assumed that the intruder will be non-adaptive). In practice however this assumption does not hold, since once an IDS is deployed, intruders will adapt and try to evade detection or launch attacks against the IDS. This lack of a proper adversarial model is a big problem for the evaluation of any decision making algorithm used in security applications, since without the definition of an adversary, we cannot reason about and much less measure the security of the algorithm. We therefore layout in this chapter the overall design and evaluation goals that will be used throughout the rest of this dissertation. The basic idea is to introduce a formal framework for reasoning about the performance and security of a decision making algorithm. In particular we do not want to find or design the best performing decision making algorithm on average, but the algorithm that performs best against the worst type of attacks. The chapter layout is the following. In section II we introduce a general set of guidelines for the design and evaluation of decision making algorithms. In section III we introduce a black box adversary model. This model is very simple yet very useful and robust for cases where having an exact adversarial model is intractable. Finally in section IV we introduce a detailed model of an adversary which we will use in chapters 4 and 5. II A Set of Design and Evaluation Guidelines In this chapter we focus on designing a practical methodology for dealing with attackers. In particular we propose the use of a framework where each of these components is clearly defined: Desired Properties Intuitive definition of the goal of the system. Feasible Design Space The design space S for the classification algorithm. Information Available to the Adversary Identify which pieces of information can be available to an attacker. 20 Capabilities of the Adversary Define a feasible class of attackers F based on the assumed capabilities. Evaluation Metric The evaluation metric should be a reasonable measure how well the designed system meets our desired properties. We call a system secure if its metric outcome is satisfied for any feasible attacker. Objective of the Adversary An attacker can use its capabilities and the information available in order to perform two main classes of attacks: • Evaluation attack. The goal of the attacker is opposite to the goal defined by the evaluation metric. For example, if the goal of the classifier is to minimize E[L(C, A)], then the goal of the attacker is to maximize E[L(C, A)]. • Base system attack. The goal of an attacker is not the opposite goal of the classifier. For example, even if the goal of the classifier is to minimize E[L(C, A)], the goal of the attacker is still to minimize the probability of being detected. Model Assumptions Identify clearly the assumptions made during the design and evaluation of the classifier. It is important to realize that when we borrow tools from other fields, they come with a set of assumptions that might not hold in an adversarial setting, because the first thing that an attacker will do is violate the set of assumptions that the classifier is relying on for proper operation. After all, the UCI machine learning repository never launched an attack against your classifier. Therefore one of the most important ways to deal with an adversarial environment is to limit the number of assumptions made, and to evaluate the resiliency of the remaining assumptions to attacks. Before we use these guidelines for the evaluation of classifiers, we describe a very simple example of the use of these guidelines and their relationship to cryptography. Cryptography is one of the best examples in which a precise framework has been developed in order to define properly what a secure system means, and how to model an adversary. We believe therefore that this example will help us in identifying the generality and use of the guidelines as a step towards achieving sound designs1 . A. Example: Secret key Encryption In secret key cryptography, Alice and Bob share a single key sk. Given a message m (called plaintext) Alice uses an encryption algorithm to produce unintelligible data C (called ciphertext): C ← Esk (m). After receiving C, Bob then uses sk and a decryption algorithm to recover the secret message m = Dsk (C). Desired Properties E and D should enable Alice and Bob to communicate secretly, that is, a feasible adversary should not get any information about m given C except with very small probability. Feasible Design Space E and D have to be efficient probabilistic algorithms. They also need to satisfy correctness: for any sk and m, Dsk (Esk (m)) = m. 1 We avoid the precise formal treatment of cryptography because our main objective here is to present the intuition behind the principles rather than the specific technical details. 21 Information Available to the Adversary It is assumed that an adversary knows the encryption and decryption algorithms. The only information not available to the adversary is the secret key sk shared between Alice and Bob. Capabilities of the Adversary The class of feasible adversaries F is the set of algorithms running in a reasonable amount of time. Evaluation Metric For any messages m0 and m1 , given a ciphertext C which is known to be an encryption of either m0 or m1 , no adversary A ∈ F can guess correctly which message was encrypted with probability significantly better than 1/2. Goal of the Adversary Perform an evaluation attack. That is, design an algorithm A ∈ F that can guess with probability significantly better than 1/2 which message corresponds to the given ciphertext. Model Assumptions The security of an encryption scheme usually relies in a set of cryptographic primitives, such as one way functions. Another interesting aspect of cryptography is the different notions of security when the adversary is modified. In the previous example it is sometimes reasonable to assume that the attacker will obtain valid plaintext and ciphertext pairs: {(m0 ,C0 ), (m1 ,C1 ), . . . , (mk ,Ck )}. This new setting is modeled by giving the adversary more capabilities: the feasible set F will now consist of all efficient algorithms that have access to the ciphertexts of chosen plaintexts. An encryption algorithm is therefore secure against chosen-ciphertext attacks if even with this new capability, the adversary still cannot break the encryption scheme. III A Black Box Adversary Model In this section we introduce one of the simplest adversary models against classification algorithms. We refer to it as a black box adversary model since we do not care at the moment how the adversary creates the observations X. This model is particularly suited to cases where trying to model the creation of the inputs to the classifier is intractable. Recall first that for our evaluation analysis in the previous chapter we were assuming three quantities that can be, up to a certain extent, controlled by the intruder. They are the base-rate p, the false alarm rate PFA and the detection rate PD . The base-rate can be modified by controlling the frequency of attacks. The perceived false alarm rate can be increased if the intruder finds a flaw in any of the signatures of an IDS that allows him to send maliciously crafted packets that trigger alarms at the IDS but that look benign to the IDS operator. Finally, the detection rate can be modified by the intruder with the creation of new attacks whose signatures do not match those of the IDS, or simply by evading the detection scheme, for example by the creation of a mimicry attack [30, 31]. In an effort towards understanding the advantage an intruder has by controlling these parameters, and to provide a robust evaluation framework, we now present a formal framework to reason about the robustness of an IDS evaluation method. Our work in this section is in some sense similar to the theoretical framework presented in [10], which was inspired by cryptographic models. However, we see two main differences in our work. First, we introduce the role of an adversary against the IDS, and thereby introduce a measure of robustness for the metrics. In the second place, our work is more practical and is applicable to more realistic 22 evaluation metrics. Furthermore we also provide examples of practical scenarios where our methods can be applied. In order to be precise in our presentation, we need to extend the definitions introduced in section III. For our modeling purposes we decompose the I DS algorithm into two parts: a detector D and a decision maker DM . For the case of an anomaly detection scheme, D (x[ j]) outputs the anomaly score y[ j] on input x[ j] and DM represents the threshold that determines whether to consider the anomaly score as an intrusion or not, i.e. DM (y[ j]) outputs an alarm or it does not. For a misuse detection scheme, DM has to decide to use the signature to report alarms or decide that the performance of the signature is not good enough to justify its use and therefore will ignore all alarms (e.g. it is not cost-efficient to purchase the misuse scheme being evaluated). Definition 1 An I DS algorithm is the composition of algorithms D (an algorithm from where we can obtain an ROC curve) and DM (an algorithm responsible for selecting an operating point). During operation, an I DS receives a continuous data stream of event features x[1], x[2], . . . and classifies each input x[ j] by raising an alarm or not. Formally:2 I DS (x) y = D (x) A ← DM (y) Output A (where A ∈ {0, 1}) ♦ We now study the performance of an IDS under an adversarial setting. We remark that our intruder model does not represent a single physical attacker against the IDS. Instead our model represents a collection of attackers whose average behavior will be studied under the worst possible circumstances for the IDS. The first thing we consider, is the amount of information the intruder has. A basic assumption to make in an adversarial setting is to consider that the intruder knows everything that we know about the environment and can make inferences about the situation the same way as we can. Under this assumption we assume that the base-rate p̂ estimated by the IDS, its estimated operating condition (P̂FA , P̂D ) selected during the evaluation, the original ROC curve (obtained from D ) and the cost function C(I, A) are public values (i.e. they are known to the intruder). We model the capability of an adaptive intruder by defining some confidence bounds. We assume an intruder can deviate p̂ − δl , p̂ + δu from the expected p̂ value. Also, based on our confidence in the detector algorithm and how hard we expect it to be for an intruder to evade the detector, or to create non-relevant false positives (this also models how the normal behavior of the system being monitored can produce new -previously unseen- false alarms), we define α and β as bounds to the amount of variation we can expect during the IDS operation from the false alarms and the detection rate (respectively) we expected, i.e. the amount of variation from (P̂FA , P̂D ) (although in practice estimating these bounds is not an easy task, testing approaches like the one described in [32] can help in their determination). The intruder also has access to an oracle Feature(·, ·) that simulates an event to input into the IDS. Feature(0, ζ) outputs a feature vector modeling the normal behavior of the system that will raise an alarm with probability ζ (or a crafted malicious feature to only raise alarms in the 2 The usual arrow notation: a ← DM (y) implies that DM can be a probabilistic algorithm. 23 case Feature(0, 1)). And Feature(1, ζ) outputs the feature vector of an intrusion that will raise an alarm with probability ζ. Definition 2 A (δ, α, β) − intruder is an algorithm I that can select its frequency of intrusions p1 from the interval δ = [ p̂ − δl , p̂ + δu ]. If it decides to attempt an intrusion, then with probability p2 ∈ [0, β], it creates an attack feature x that will go undetected by the IDS (otherwise this intrusion is detected with probability P̂D ). If it decides not to attempt an intrusion, with probability p3 ∈ [0, α] it creates a feature x that will raise a false alarm in the IDS I(δ, α, β) Select p1 ∈ [ p̂ − δl , p̂ + δu ] Select p2 ∈ [0, α] Select p3 ∈ [0, β] I ← Bernoulli(p1 ) If I = 1 B ← Bernoulli(p3 ) x ← Feature(1, (min{(1 − B), P̂D })) Else B ← Bernoulli(p2 ) x ← Feature(0, max{B, P̂FA }) Output (I,x) where Bernoulli(ζ) outputs a Bernoulli random variable with probability of success ζ. Furthermore, if δl = p and δu = 1 − p we say that I has the ability to make a chosenintrusion rate attack. ♦ We now formalize what it means for an evaluation scheme to be robust. We stress the fact that we are not analyzing the security of an IDS, but rather the security of the evaluation of an IDS, i.e. how confident we are that the IDS will behave during operation similarly to what we assumed in the evaluation. A. Robust Expected Cost Evaluation We start with the general decision theoretic framework of evaluating the expected cost (per input) E[C(I, A)] for an IDS. Definition 3 An evaluation method that claims the expected cost of an I DS is at most r is robust against a (δ, α, β) − intruder if the expected cost of I DS during the attack (Eδ,α,β [C[I, A)]) is no larger than r, i.e. Eδ,α,β [C[I, A)] = ∑ C(i, a)× i,a Pr [ (I, x) ← I(δ, α, β); A ← I DS (x) : I = i, A = a ] ≤ r ♦ Now recall that the traditional evaluation framework finds an evaluation value r∗ by using equation (2.6). So by finding r∗ we are basically finding the best performance of an IDS and claiming the IDS is better than others if r∗ is smaller than the evaluation of the other IDSs. In this section we claim that an IDS is better than others if its expected value under the worst 24 performance is smaller than the expected value under the worst performance of other IDSs. In short Traditional Evaluation Given a set of IDSs {I DS 1 , I DS 2 , . . . , I DS n } find the best expected cost for each: ri∗ = min E[C(I, A)] (3.1) β α ,P )∈ROC (PFA i D Declare that the best IDS is the one with smallest expected cost ri∗ . Robust Evaluation Given a set of IDSs {I DS 1 , I DS 2 , . . . , I DS n } find the best expected cost for each I DS i when being under the attack of a (δ, αi , βi ) − intruder3 . Therefore we find the best IDS as follows: rirobust = max Eδ,αi ,βi [C(I, A)] min α β α ,β (PFAi ,PDi )∈ROCi i i (3.2) I(δ,αi ,βi ) Several important questions can be raised by the above framework. In particular we are interested in finding the least upper bound r such that we can claim the evaluation of I DS to be robust. Another important question is how can we design an evaluation of I DS satisfying this least upper bound? Solutions to these questions are partially based on game theory. Lemma 2: Given an initial estimate of the base-rate p̂, an initial ROC curve obtained from D , and constant costs C(I, A), the least upper bound r such that the expected cost evaluation of I DS is r-robust is given by β where α )(1 − p̂δ ) + R(1, P̂D ) p̂δ r = R(0, P̂FA (3.3) α α α ] ) +C(0, 1)P̂FA ) ≡ [C(0, 0)(1 − P̂FA R(0, P̂FA (3.4) is the expected cost of I DS under no intrusion and β β β R(1, P̂D ) ≡ [C(1, 0)(1 − P̂D ) +C(1, 1)P̂D ] (3.5) β α and P̂ are the solution to a zerois the expected cost of I DS under an intrusion, and p̂δ , P̂FA D sum game between the intruder (the maximizer) and the IDS (the minimizer), whose solution can be found in the following way: 1. Let (PFA , PD ) denote any points of the initial ROC obtained from D and let ROC(α,β) α , Pβ ), where Pβ = P (1 − β) and Pα = be the ROC curve defined by the points (PFA D D D FA α + PFA (1 − α). 2. Using p̂ + δu in the isoline method, find the optimal operating point (xu , yu )in ROC(α,β) and using p̂−δl in the isoline method, find the optimal operating point (xl , yl ) in ROC(α,β) . that different IDSs might have different α and β values. For example if I DS 1 is an anomaly detection scheme then we can expect that the probability that new normal events will generate alarms α1 is larger than the same probability α2 for a misuse detection scheme I DS 2 . 3 Note 25 3. Find the points (x∗ , y∗ ) in ROC(α,β) that intersect the line y= C(1, 0) −C(0, 0) C(0, 0) −C(0, 1) +x C(1, 0) −C(1, 1) C(1, 0) −C(1, 1) (under the natural assumptions C(1, 0) > R(0, x∗ ) > C(0, 0), C(0, 1) > C(0, 0) and C(1, 0) > C(1, 1)). If there are no points that intersect this line, then set x∗ = y∗ = 1. 4. If x∗ ∈ [xl , xu ] then find the base-rate parameter p∗ such that the optimal isoline of Equaα = x∗ and P̂β = y∗ . tion (2.9) intercepts ROC(α,β) at (x∗ , y∗ ) and set p̂δ = p∗ , P̂FA D 5. Else if R(0, xu ) < R(1, yu ) find the base-rate parameter pu such that the optimal isoline of α = x and P̂β = y . Equation (2.9) intercepts ROC(α,β) at (xu , yu ) and then set p̂δ = pu , P̂FA u u D Otherwise, find the base-rate parameter pl such that the optimal isoline of Equation (2.9) α = x and P̂β = y . intercepts ROC(α,β) at (xl , yl ) and then set p̂δ = pl , P̂FA l l D The proof of this lemma is very straightforward. The basic idea is that if the uncertainty range of p is large enough, the Nash equilibrium of the game is obtained by selecting the point intercepting equation (3). Otherwise one of the strategies for the intruder is always a dominant strategy of the game and therefore we only need to find which one is it: either p̂ + δu or p̂ − δl . For most practical cases it will be p̂ + δu . Also note that the optimal operating point in the α , P̂β ). original ROC can be found by obtaining (P̂FA ,P̂D ) from (P̂FA D B. Robust B-ROC Evaluation Similarly we can now also analyze the robustness of the evaluation done with the B-ROC curves. In this case it is also easy to see that the worst attacker for the evaluation is an intruder I that selects p1 = p̂ − δl , p2 = α and p3 = β. ˆ , P̂D ) corresponding to p̂ in the B-ROC curve, a Corollary 1: For any point (PPV (δ, α, β) − intruder can decrease the detection rate and the positive predictive value to the ˆ δ,α,β , P̂Dβ ), where P̂β = P̂D (1 − β) and where pair (PPV ˆ δ,α,β = PPV β β PD p − Pβ δ β α (1 − p) + δPα − δP PD p + PFA D FA (3.6) C. Example: Minimizing the Cost of a Chosen Intrusion Rate Attack In this example we introduce probably one of the easiest formulations of an attacker against a classifier: we assume that the attacker cannot change its feature vectors x, but rather only its frequency of attacks: p. This example also shows the generality of lemma 2 and also presents a compelling scenario of when does a probabilistic IDSs make sense. Assume an ad hoc network scenario similar to [33, 34, 35, 36] where nodes monitor and distribute reputation values of other nodes’ behavior at the routing layer. The monitoring nodes report selfish actions (e.g. nodes that agree to forward packets in order to be accepted in the network, but then fail to do so) or attacks (e.g. nodes that modify routing information before forwarding it). Now suppose that there is a network operator considering implementing a watchdog monitoring scheme to check the compliance of nodes forwarding packets as in [33]. The operator 26 then plans an evaluation period of the method where trusted nodes will be the watchdogs reporting the misbehavior of other nodes. Since the detection of misbehaving nodes is not perfect, during the evaluation period the network operator is going to measure the consistency of reports given by several watchdogs and decide if the watchdog system is worth keeping or not. During this trial period, it is of interest to selfish nodes to behave as deceiving as they can so that the neighboring watchdogs have largely different results and the system is not permanently established. As stated in [33] the watchdogs might not detect a misbehaving node in the presence of 1) ambiguous collisions, 2) receiver collisions, 3) limited transmission power, 4) false misbehavior, 5) collusion or 6) partial dropping. False alarms are also possible in several cases, for example when a node moves out of the previous node’s listening range before forwarding on a packet. Also if a collision occurs while the watchdog is waiting for the next node to forward a packet, it may never overhear the packet being transmitted. We now briefly describe this model according to our guidelines: Desired Properties Assume the operator wants to find a strategy that minimizes the probability of making errors. This is an example of the expected cost metric function with C(0, 0) = C(1, 1) = 0 and C(1, 0) = C(0, 1) = 1. Feasible Design Space DM = {πi ∈ [0, 1] : π1 + π2 + π3 + π4 = 1}. Information Available to the Adversary We assume the adversary knows everything that we know and can make inferences about the situation the same way as we can. In game theory this adversaries are usually referred to as intelligent. Capabilities of the Adversary The adversary has complete control over the base-rate p (its frequency of attacks). The feasible set is therefore F = [0, 1]. Goal of the Adversary Evaluation attack. Evaluation Metric r∗ = min max πi ∈DM p∈F E[C(I, A)] Note the order in optimization of the evaluation metric. In this case we are assuming that the operator of the IDS makes the first decision, and that this information is then available to the attacker when selecting the optimal p. We call the strategy of the operator secure if the expected cost (probability of error) is never greater than r∗ for any feasible adversary. Model Assumptions We have assumed that the attacker will not be able to change P̂FA and P̂D . This results from its assumed inability to directly modify the feature vector x. Notice that in this case the detector algorithm D is the watchdog mechanism that monitors the medium to see if the packet was forwarded F or if it did not hear the packet being forwarded (unheard U) during a specified amount of time. Following [33] (where it is shown that the number of false alarms can be quite high) we assume that a given watchdog D has a false alarm rate of P̂FA = 0.5 and a detection rate of P̂D = 0.75. Given this detector algorithm, a (nonrandomized) decision maker DM has to be one of the following rules (where intuitively, h3 is the more appealing): h1 (F) = 0 h1 (U) = 0 h2 (F) = 1 h2 (U) = 0 h3 (F) = 0 h3 (U) = 1 h4 (F) = 1 h4 (U) = 1 27 h0 R(0, 0) R(1, 0) I=0 I=1 h1 R(0, P̂D ) R(1, P̂FA ) h2 R(0, P̂FA ) R(1, P̂D ) h3 R(0, 1) R(1, 1) Table 3.1: Matrix for the zero-sum game theoretic formulation of the detection problem Since the operator wants to check the consistency of the reports, the selfish nodes will try to maximize the probability of error (i.e. C(0, 0) = C(1, 1) = 0 and C(0, 1) = C(1, 0) = 1) of any watchdog with a chosen intrusion rate attack. As stated in lemma 2, this is a zero-sum game where the adversary is the maximizer and the watchdog is the minimizer. The matrix of this game is given in Table 3.1. 1 h 4 0.9 h1 0.8 0.7 h 2 Pr[Error] 0.6 0.5 0.4 h 3 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 p 0.6 0.7 0.8 0.9 1 Figure 3.1: Probability of error for hi vs. p It is a well known fact that in order to achieve a Nash equilibrium of the game, the players should consider mixed strategies (i.e. consider probabilistic choices). For our example the optimal mixed strategy for the selfish node (see Figure 3.1) is to drop a packet with probability p∗ = P̂FA /(P̂FA1 + P̂D ). On the other hand the optimal strategy for DM is to select h3 with probability 1/(P̂FA + P̂D ) and h1 with probability (P̂FA − (1 − P̂D ))/(P̂FA − (1 − P̂D ) + 1). This example shows that sometimes in order to minimize the probability of error (or any general cost) against an adaptive attacker, DM has to be a probabilistic algorithm. Lemma 2 also presents a way to get this optimal point from the ROC, however it is not obvious at the beginning how to get the same results, as there appear to be only three points in the ROC: (PFA = 0, PD = 0) (by selecting h1 ), (P̂FA = 1/2, P̂D = 3/4) (by selecting h3 ) and (PFA = 1, PD = 1) (by selecting h4 ). The key property of ROC curves to remember is that the (optimal) ROC curve is a continuous and concave function [27], and that in fact, the points that do not correspond to deterministic decisions are joined by a straight line whose points can be achieved by a mixture of probabilities of the extreme points. In our case, the line ∗ = P̂ /(P̂ + P̂ ) y = 1 − x intercepts the (optimal) ROC at the optimal operating points P̂FA D FA FA and P̂D∗ = P̂D /(P̂FA + P̂D ) (see Figure 3.2). Also note that p∗ is the value required to make the ∗ , P∗ ). slope of the isoline parallel to the ROC line intersecting (PFA D ∗ The optimal strategy for the intruder is therefore p = 2/5, while the optimal strategy for DM is to select h1 with probability 1/5 and h3 with probability 4/5. In the robust operating ∗ = 2/5 and P∗ = 3/5. Therefore, after fixing DM , it does not matter if p point we have PFA D deviates from p∗ because we are guaranteed that the probability of error will be no worse (but 28 ROC of a watchdog 1 h4 0.9 h3 0.8 0.7 P D 0.6 (x*,y*) 0.5 0.4 0.3 0.2 0.1 h 1 0 0 0.2 0.4 0.6 0.8 1 PFA Figure 3.2: The optimal operating point no better either) than 2/5, therefore the IDS can be claimed to be 2/5-robust. D. Example: Robust Evaluation of IDSs As a second example, we chose to perform an intrusion detection experiment with the 1998 MIT/Lincoln Labs data set [37]. Although several aspects of this data set have been criticized in [38], we still chose it for two main reasons. On one hand, it has been (and arguably still remains) the most used large-scale data set to evaluate IDSs. In the second place we are not claiming to have a better IDS to detect attacks and then proving our claim with its good performance in the MIT data set (a feat that would require further testing in order to be assured on the quality of the IDS). Our aim on the other hand is to illustrate our methodology, and since this data set is publicly available and has been widely studied and experimented with (researchers can in principle reproduce any result shown in a paper), we believe it provides the basic background and setting to exemplify our approach. Of interest are the Solaris system log files, known as BSM logs. The first step of the experiment was to record every instance of a program being executed in the data set. Next, we created a very simple tool to perform buffer overflow detection. To this end, we compared the buffer length of each execution with a buffer threshold, if the buffer size of the execution was larger than the threshold we report an alarm. We divided the data set into two sets. In the first one (weeks 6 and 7), our IDS performs very well and thus we assume that this is the ”evaluation” period. The previous three weeks were used as the period of operation of the IDS. Figure 3.3(a)4 shows the results for the ”evaluation” period when the buffer threshold ranges between 64 and 780. The dotted lines represent the suboptimal points of the ROC or equivalently the optimal points that can be achieved through randomization. For example the dotted line of Figure 3.3(a) can be achieved by selecting with probability λ the detector with threshold 399 and with probability 1 − λ the 4 Care must always be taken when looking at the results of ROC curves due to the ”unit of analysis” problem [38]. For example comparing the ROC of Figure 3.3(a) with the ROC of [6] one might arrive to the erroneous conclusion that the buffer threshold mechanism produces an IDS that is better than the more sophisticated IDS based on Bayesian networks. The difference lies in the fact that we are monitoring the execution of every program while the experiments in [6] only monitor the attacked programs (eject, fbconfig, fdformat and ps). Therefore although we raise more false alarms, our false alarm rate (number of false alarms divided by the total number of honest executions) is smaller. 29 ROC during operation 0.8 0.8 0.6 0.6 P D 1 P D ROC for Evaluation 1 0.4 0.4 0.2 0.2 0 0 1 2 3 4 5 6 PFA 0 7 0 1 2 3 4 PFA −4 x 10 5 6 7 −4 x 10 (a) Original ROC obtained during the (b) Effective ROC during operation time evaluation period ROCα,β 1 PD 0.8 0.6 0.4 0.2 0 0 1 2 3 4 PFA 5 6 7 −4 x 10 (c) Original ROC under adversarial attack ROCα,β Figure 3.3: Robust expected cost evaluation 30 detector with threshold 773 and letting λ range from zero to one. During the evaluation weeks there were 81108 executions monitored and 13 attacks, therefore p̂ = 1.6 × 10−4 . Assuming that our costs (per execution) are C(0, 0) = C(1, 1) = 0, C(1, 0) = 850 and C(0, 1) = 100 we find that the slope given by equation 2.8 is mC, p̂ = 735.2, and therefore the optimal point is (2.83 × 10−4 , 1), which corresponds to a threshold of 399 (i.e. all executions with buffer sizes bigger than 399 raise alarms). Finally, with these operating conditions we find out that the expected cost (per execution) of the IDS is E[C(I, A)] = 2.83 × 10−2 . In the previous three weeks used as the ”operation” period our buffer threshold does not perform as well, as can be seen from its ROC (shown in Figure 3.3(b).) Therefore if we use the point recommended in the evaluation (i.e. the threshold of 399) we get an expected cost of Eoperation [C(I, A)] = 6.934 × 10−2 . Notice how larger the expected cost per execution is from the one we had evaluated. This is very noticeable in particular because the base-rate is smaller during the operation period ( p̂operation = 7 × 10−5 ) and a smaller base-rate should have given us a smaller cost. To understand the new ROC let us take a closer look at the performance of one of the thresholds. For example, the buffer length of 773 which was able to detect 10 out of the 13 attacks at no false alarm in Figure 3.3(a) does not perform well in Figure 3.3(b) because some programs such as grep, awk, find and ld were executed under normal operation with long string lengths. Furthermore, a larger percent of attacks was able to get past this threshold. This is in general the behavior modeled by the parameters α and β that the adversary has access to in our framework. Let us begin the evaluation process from the scratch by assuming a ([1 × 10−5 , 0], 1 × 10−4 , 0.1) − intruder, where δ = [1 × 10−5 , 0] means the IDS evaluator believes that the baserate during operation will be at most p̂ and at least p̂ − 1 × 10−5 . α = 1 × 10−5 means that the IDS evaluator believes that new normal behavior will have the chance of firing an alarm with probability 1 × 10−5 . And β = 0.1 means that the IDS operator has estimated that ten percent of the attacks during operation will go undetected. With these parameters we get the ROCα,β shown in Figure 3.3(c). Note that in this case, p is bounded in such a way that the equilibrium of the game is achieved via a pure strategy. In fact, the optimal strategy of the intruder is to attack with frequency p̂ + δu (and of course, generate missed detections with probability β and false alarms with probability α) whereas the optimal strategy of DM is to find the point in ROCα,β that minimizes the expected cost by assuming that the base-rate is p̂ + δu . The optimal point for the ROCα,β curve corresponds to the one with threshold 799, having an expected cost Eδ,α,β [C(I, A)] = 5.19 × 10−2 . Finally, by using the optimal point for ROCα,β , as opposed to the original one, we get during operation an expected cost of Eoperation [C(I, A)] = 2.73 × 10−2 . Therefore in this case, not only we have maintained our expected 5.19 × 10−2 − security of the evaluation, but in addition the new optimal point actually performed better than the original one. Notice that the evaluation of Figure 3.3 relates exactly to the problem we presented in the introduction, because it can be thought of as the evaluation of two IDSs. One IDS having a buffer threshold of length 399 and another IDS having a buffer threshold of length 773. Under ideal conditions we choose the IDS of buffer threshold length of 399 since it has a lower expected cost. However after evaluating the worst possible behavior of the IDSs we decide to select the one with buffer threshold length of 773. An alternative view can be achieved by the use of B-ROC curves. In Figure 3.4(a) we 31 IDOC of ROCα,β IDOC of Evaluation 1 1 0.9 0.9 p=1×10−3 0.8 0.8 0.7 p=1×10−2 p=1×10−5 p=1×10−4 0.7 p=1×10−6 0.6 D 0.5 P PD 0.6 p=1×10−4 p=1×10−5 0.5 −3 p=1×10 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 0.8 0 1 −2 p=1×10 p=1×10−6 0 0.2 0.4 0.6 0.8 1 Pr[¬I|A] Pr[¬I|A] (b) B-ROC obtained from ROCα,β (a) Original B-ROC obtained during the evaluation period IDOC during operation 1 0.9 −2 p=1×10 0.8 p=1×10−3 0.7 0.6 PD p=1×10−4 0.5 p=1×10−5 0.4 p=1×10−6 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Pr[¬I|A] (c) B-ROC during operation time Figure 3.4: Robust B-ROC evaluation see the original B-ROC curve during the evaluation period. These curves give a false sense of confidence in the IDS. Therefore we study the B-ROC curves based on ROCα,β in Figure 3.4(b). In Figure 3.4(c) we can see how the B-ROC of the actual operating environment follows more closely the B-ROC based on ROCα,β than the original one. IV A White Box Adversary Model So far we have been treating the decision making process without considering the exact way that an adversary can modify or create the input to the classifier x. However following the same statistical framework we have been considering there is a way to formulate this problem in a way that sheds light into the exact behavior and optimal adversarial strategies. In order to define this problem recall first that an optimal decision rule (optimal in the sense that it minimizes several evaluation metrics such as the probability of error, the expected cost or the probability of missed positives given an upper bound on the probability of false positives) is the log-likelihood ratio test ln f (x|1) H1 ≷ f (x|0) H τ, 0 where Hi denotes the hypothesis that I = i, and τ depends on the particular evaluation metric and its assumptions (e.g., misclassification costs, base-rates etc.). If the log-likelihood ratio is greater than τ then the detector outputs A = 1 and if it less than τ, the detector outputs A = 0 (if it is equal to τ the detector randomly selects the hypothesis based on the evaluation metric being considered). This new detailed formulation allows us to model the fact that an attacker is able to modify its attack strategy f (x|1). The pdf f (x|0) of the normal behavior can be learnt via one- 32 class learning, however since the attacker can change its strategy, we cannot trust a machine learning technique to estimate f (x|1), since learning a candidate density f (x|1) for an attacker without some proper analysis may result in serious performance degradation if the attacker’s strategy diverges from our estimated model. It is therefore one of our aims in the following two chapters to evaluate and design algorithms that try to estimate the least favorable pdf f (x|1) within a feasible adversary class F . The adversary class F will consists of pdfs f (x|1) constrained by a requirement that an adversary has to satisfy. For example in the next chapter it is a desired level of wireless channel access, and in chapter ?? the feasible class F is composed of pdfs that satisfy certain distortion constraints. An abstract example on how to model the problem of finding the least favorable f (x|1) is now considered. Assume an adversary whose objective is to create mimicry attacks: i.e., it wants to minimize the probability of being detected. Furthermore assume x takes only discrete values, so f (x|i) are in fact probability mass functions (pmfs) (as opposed to density functions). A way to formulate this problem can be done with the help of information theory inequalities [39]: PFA log 1 − PFA PFA + (1 − PFA ) log ≤ KL[ f (x|0)|| f (x|1)] PD 1 − PD (3.7) where KL[ f0 || f1 ] denotes the Kullback-Leibler divergence between two pmfs f0 and f1 . From this inequality it can be deduced that if we fix PFA to be very small (PFA → 0), then PD ≤ 1 − 2−KL[ f (x|0)|| f (x|1)] (3.8) The task of the adversary would be therefore the following: h∗ = arg min KL[ f (x|0)||h(x)] h∈F It can be shown in fact that some of the examples studied in the next two chapters can be formulated as exactly performing this optimization problem. Besides estimating least favorable distributions, another basic notion in minimax game approaches that is going to be very useful in the next two sections is that of a saddle point. A strategy (detection rule) d ? and an operating point (attack) f (x|1)? in the uncertainty class F form a saddle point if: 1. For the attack f (x|1)? , any detection rule d other than d ? has worse performance. Namely d ? is the optimal detection rule for attack f (x|1)? in terms of minimizing the objective function (evaluation). 2. For the detection rule d ? , any attack f (x|1) from the uncertainty class, other than f (x|1)? gives better performance. Namely, detection rule d ? has its worst performance for attack f (x|1)? . Implicit in the minimax approach is the assumption that the attacker has full knowledge of the employed detection rule. Thus, it can create a misbehavior strategy that maximizes the cost of the performance of the classifier. Therefore, our approach refers to the case of an intelligent attacker that can adapt its misbehavior policy so as to avoid detection. One issue that needs to be clarified is the structure of this attack strategy. Subsequently, by deriving the 33 detection rule and the performance for that case, we can obtain an (attainable) upper bound on performance over all possible attacks. Even though we did not point it out before, it can in fact be shown that the optimal operating points in the two IDS examples presented section III are in fact saddle point equilibria! V Conclusions There are two main problems that any empirical test of a classifier will face. The first problem relates to the inferences that once can make about any classifier system based on experiments alone. An example is the low confidence on the estimate for the probability of detection in the ROC. A typical way to improve this estimate in other classification tasks is through the use of error bars in the ROC. However, if tests of classifiers include very few attacks and their variations, there is not enough data to provide an accurate significance level for the bars. Furthermore, the use of error bars and any other cross-validation technique gives the average performance of the classifier. However, this brings us to the second problem, and it is the fact that since the classifiers are subject to an adversarial environment, evaluating such a decision maker based on its average performance is not enough. Our adversary models tries to address these two problems, since it provides a principled approach to give us the worst case performance of a classifier. The extent by which the analysis with a (δ, α, β) − intruder will follow the real operation of the IDS will depend on how accurately the person doing the evaluation of the IDS understands the IDS and its environment, for example, to what extent can the IDS be evaded, how well the signatures are written (e.g. how likely is it that normal events fire alarms) etc. However, by assuming robust parameters we are actually assuming a pessimistic setting, and if this pessimistic scenario never happens, we might be operating at a suboptimal point (i.e. we might have been too pessimistic in the evaluation). Finally, although the white box framework for adversarial modeling gives us a fine grained evaluation procedure, it is not always a better alternative to the black box adversary model. There are basically two problems with the white box adversary model. The first problem is the fact that finding the optimal adversarial distribution is usually an intractable problem. The second problem is that sometimes in order to avoid the intractability of the problem, f (x|1) is sometimes assumed to follow a certain parametric model. Therefore instead of searching for optimal adversarial pdfs f (x|1) the problem is replaced with one of finding the optimal parameters of a prescribed distribution. The problem with this approach is that assuming a distribution form creates extra assumptions about the attacker, and as explained in the guidelines defined at the beginning of this chapter, any extra assumption must be taken with care. Specially because in practice the adversary will most likely not follow the parametric distribution assumed a priori. 34 Chapter 4 Performance Comparison of MAC layer Misbehavior Schemes If anything could dissipate my love to humanity, it would be ingratitude. In short, I am a hired servant, I expect my payment at once- that is, praise, and the repayment of love with love. Otherwise I am incapable of loving anyone. - The Brothers Karamazov, Dostoevsky I Overview This chapter revisits the problem of detecting greedy behavior in the IEEE 802.11 MAC protocol by evaluating the performance of two previously proposed schemes: DOMINO and the Sequential Probability Ratio Test (SPRT). The evaluation is carried out in four steps. We first derive a new analytical formulation of the SPRT that takes into account the discrete nature of the problem. Then we develop a new tractable analytical model for DOMINO. As a third step, we evaluate the theoretical performance of SPRT and DOMINO with newly introduced metrics that take into account the repeated nature of the tests. This theoretical comparison provides two major insights into the problem: it confirms the optimality of SPRT and motivates us to define yet another test, a nonparametric CUSUM statistic that shares the same intuition as DOMINO but gives better performance. We finalize this chapter with experimental results, confirming the correctness of our theoretical analysis and validating the introduction of the new nonparametric CUSUM statistic. II Introduction The problem of deviation from legitimate protocol operation in wireless networks and efficient detection of such behavior has become a significant issue in recent years. In this work we address and quantify the impact of MAC layer attacks that aim at disrupting critical network functionalities and information flow in wireless networks. It is important to note that parameters used for deriving a decision of whether a protocol participant misbehaves or not should be carefully chosen. For example, choosing the percentage of time the node accesses the channel as a misbehavior metric can result in a high number of false alarms due to the fact that the other protocol participants might not have anything (or have significantly less traffic) to transmit within a given observation period. This could easily lead to false accusations of legitimate nodes that have large amount of data to send. Measuring throughput offers no indication of misbehavior since it is impossible to define “legitimate” throughput. Therefore, it is reasonable to use either fixed protocol parameters (such as SIFS or DIFS window size) or parameters that belong to a certain (fixed) range of values for monitoring misbehavior (such as backoff). In this work we derive analytical bounds of two previously proposed protocols for detecting random access misbehavior: DOMINO [40] and SPRT-based tests [41, 42] and show the optimality of SPRT against the worst-case adversary for all configurations of DOMINO. Following the main idea of DOMINO, we introduce a nonparametric CUSUM statistic that 35 shares the same intuition as DOMINO but gives better performance for all configurations of DOMINO. A. Background Work MAC layer protocol misbehavior has been studied in various scenarios and mathematical frameworks. The random nature of access protocols coupled with the highly volatile nature of wireless medium poses the major obstacle in generation of the unified framework for misbehavior detection. The goals of a misbehaving peer can range from exploitation of available network resources for its own benefit up to network disruption. An efficient Intrusion Detection System should exhibit a capability to detect a wide range of misbehavior policies with an acceptable False Alarm rate. This presents a major challenge due to the nature of wireless protocols. The current literature offers two major approaches in the field of misbehavior detection. The first set of approaches provides solutions based on modification of the current IEEE 802.11 MAC layer protocol by making each protocol participant aware of the backoff values of its neighbors. The approach proposed in [43] assumes existence of a trustworthy receiver that can detect misbehavior of the sender and penalize it by assigning him higher back-off values for subsequent transmissions. A decision about protocol deviation is reached if the observed number of idle slots of the sender is smaller than a pre-specified fraction of the allocated backoff. The sender is labeled as misbehaving if it turns out to deviate continuously based on a cumulative metric over a sliding window. The work in [44] attempts to prevent scenarios of colluding sender-receiver pairs by ensuring randomness in the course of the MAC protocol. A different line of thought is followed in [40, 41, 42], where the authors propose a misbehavior detection scheme without making any changes to the actual protocol. In [40] the authors focus on multiple misbehavior policies in the wireless environment and put emphasis on detection of backoff misbehavior. They propose a sequence of conditions on available observations for testing the extent to which MAC protocol parameters have been manipulated. The proposed scheme does not address the scenarios that include intelligent adaptive cheaters or collaborating misbehaving nodes. The authors in [41, 42] address the detection of an adaptive intelligent attacker by casting the problem of misbehavior detection within the minimax robust detection framework. They optimize the system’s performance for the worst-case instance of uncertainty by identifying the least favorable operating point of a system and derive the strategy that optimizes the system’s performance when operating in that point. The system performance is measured in terms of number of required observation samples to derive a decision (detection delay). However, DOMINO and SPRT were presented independently, without direct comparison or performance analysis. Additionally, both approaches evaluate the detection scheme performance under unrealistic conditions, such as probability of false alarm being equal to 0.01, which in our simulations results in roughly 700 false alarms per minute (in saturation conditions), a rate that is unacceptable in any real-life implementation. Our work contributes to the current literature by: (i) deriving a new pmf for the worst case attack using an SPRT-based detection scheme, (ii) providing new performance metrics that address the large number of alarms in the evaluation of previous proposals, (iii) providing a complete analytical model of DOMINO in order to obtain a theoretical comparison to SPRT-based tests and (iv) proposing an improvement to DOMINO based on the CUSUM test. The rest of the chapter is organized as follows. Sect. III outlines the general setup of the problem. In Sect. IV we propose a minimax robust detection model and derive an expression for the worst-case attack in discrete time. In Sect. V we provide extensive analysis of DOMINO, 36 followed by the theoretical comparison of two algorithms in Sect. VI. Motivated by the main idea of DOMINO, we offer a simple extension to the algorithm that significantly improves its performance in Sect. VII. In Sect. VIII we present the experimental performance comparison of all algorithms. Finally, Sect. IX concludes our study. In subsequent sections, the terms “attacker” and “adversary” will be used interchangeably with the same meaning. III Problem Description and Assumptions Throughout this work we assume existence of an intelligent adaptive attacker that is aware of the environment and its changes over a given period of time. Consequently, the attacker is able to adjust its access strategy depending on the level of congestion in its environment. Namely, we assume that, in order to minimize the probability of detection, the attacker chooses legitimate over selfish behavior when the level of congestion in the network is low. Similarly, the attacker chooses adaptive selfish strategy in congested environments. Due to the previously mentioned reasons, we assume a benchmark scenario where all the participants are backlogged, i.e., have packets to send at any given time in both theoretical and experimental evaluations. We assume that the attacker will employ the worst-case misbehavior strategy in this setting, and consequently the detection system can estimate the maximal detection delay. It is important to mention that this setting represents the worst-case scenario with regard to the number of false alarms per unit of time due to the fact that the detection system is forced to make maximum number of decisions per time unit. In order to characterize the strategy of an intelligent attacker, we assume that both misbehaving and legitimate node attempt to access the channel simultaneously. Consequently, each station generates a sequence of random backoffs X1 , X2 , . . . , Xi over a fixed period of time. Accordingly, the backoff values, X1 , X2 , . . . , Xi , of each legitimate protocol participant are distributed according to the probability mass function (pmf) p0 (x1 , x2 , . . . , xi ). The pmf of the misbehaving participants is unknown to the system and is denoted with p1 (x1 , x2 , . . . , xi ), where X1 , X2 , . . . , Xi represent the sequence of backoff values generated by the misbehaving node over the same period of time. The assumption that holds throughout this chapter is that a detection agent (e.g., the access point) monitors and collects the backoff values of a given station. It is important to note that observations are not perfect and can be hindered by concurrent transmissions or external sources of noise. It is impossible for a passive monitoring agent to know the backoff stage of a given monitored station due to collisions and to the fact that in practice, nodes might not be constantly backlogged. Furthermore, in practical applications the number of false alarms in anomaly detection schemes is very high. Consequently, instead of building a “normal” profile of network operation with anomaly detection schemes, we utilize specification based detection. In our setup we identify “normal” (i.e., a behavior consistent with the 802.11 specification) profile of a backlogged station in the IEEE 802.11 without any competing nodes, and notice that its backoff process X1 , X2 , . . . , Xi can be characterized with pdf p0 (xi ) = 1/(W + 1) for xi ∈ {0, 1, . . . ,W } and zero otherwise. We claim that this assumption minimizes the probability of false alarms due to imperfect observations. At the same time, a safe upper bound on the amount of damaging effects a misbehaving station can cause to the network is maintained. Although our theoretical results utilize the above expression for p0 , the experimental setting utilizes the original implementation of the IEEE 802.11 MAC. In this case, the detection agent needs to deal with observed values of xi larger than W , which can be due to collisions or due to the exponential backoff specification in the IEEE 802.11. We further discuss this issue 37 in Sect. VIII. IV Sequential Probability Ratio Test (SPRT) Due to the nature of the IEEE 802.11 MAC, the back-off measurements are enhanced by an additional sample each time a node attempts to access the channel. Intuitively, this gives rise to the employment of a sequential detection scheme in the observed problem. The objective of the detection test is to derive a decision as to whether or not misbehavior occurs with the least number of observations. A sequential detection test is therefore a procedure which decides whether or not to receive more samples with every new information it obtains. If sufficient information for deriving a decision has been made (i.e. the desired levels of the probability of false alarm and probability of miss are satisfied), the test proceeds to the phase of making a decision. It is now clear that two quantities are involved in decision making: a stopping time N and a decision rule dN which at the time of stopping decides between hypotheses H0 (legitimate behavior) and H1 (misbehavior). We denote the above combination with D=(N, dN ). In order to proceed with our analysis we first define the properties of an efficient detector. Intuitively, the starting point in defining a detector should be minimization of the probability of false alarms P0 [dN = 1]. Additionally, each detector should be able to derive the decision as soon as possible (minimize the number of samples it collects from a misbehaving station) before calling the decision function E1 [N]. Finally, it is also necessary to minimize the probability of deciding that a misbehaving node is acting normally P1 [dN = 0]. It is now easy to observe that E1 [N], P0 [dN = 1], P1 [dN = 0] form a multi-criteria optimization problem. However, not all of the above quantities can be optimized at the same time. Therefore, a natural approach is to define the accuracy of each decision a priori and minimize the number of samples collected: inf D∈Ta,b E1 [N] (4.1) where Ta,b = {(N, dN ) : P0 [dN = 1] ≤ a and P1 [dN = 0] ≤ b} The solution D∗ (optimality is assured when the data is i.i.d. in both classes) to the above problem is the SPRT [41] with: Sn = ln p1 (x1 , . . . , xn ) and N = inf Sn ∈ [L,U]. n p0 (x1 , . . . , xn ) The decision rule dN is defined as: dN = 1 if SN ≥ U 0 if SN ≤ L, (4.2) b where L ≈ ln 1−a and U ≈ ln 1−b a . Furthermore, by Wald’s identity: E j [N] = E j [SN ] E j [SN ] h i= p1 (x) ∑W E j ln pp10 (x) x=0 p j (x) ln p0 (x) (x) 38 (4.3) with E1 [SN ] = Lb +U(1 − b) and E0 [SN ] = L(1 − a) +Ua. We note that the coefficients j = 0, 1 in Eq.(4.3) correspond to legitimate and adversarial behavior respectively. A. Adversary Model To our knowledge, the current literature does not address the discrete nature of the misbehavior detection problem. This section sets a theoretical framework of the problem in discrete time. Due to the different nature of the problem, the relations derived in [41, 42] no longer hold and a new pmf p∗1 that maximizes the performance of the adversary is derived. We assume the adversary has full control over the probability mass function p1 and the backoff values it generates. In addition to that we assume that the adversary is intelligent, i. e. he knows everything the detection agent knows and can infer the same conclusions as the detection agent. Goal of the Adversary: we assume the objective of the adversary is to design an access policy with the resulting probability of channel access PA , while minimizing the probability of detection. As it has already been mentioned, the optimal access policy results in generation of backoff sequences according to the pmf p∗1 (x). Theorem 1: The probability that the adversary accesses the channel before any other terminal when competing with n neighboring (honest) terminals for channel access in saturation condition is: 1 (4.4) Pr[Access] ≡ PA = E1 [X] 1 + nE 0 [X] Note that when E1 [X] = E0 [X] the probability of access is equal for all n + 1 competing nodes 1 (including the adversary), i.e., all of them will have access probability equal to n+1 . We omit the proof of this theorem and refer the reader to [42] for the detailed derivation. Solving the above equation for E1 [X] gives us a constraint on p1 . That is, p1 must satisfy the following equation: 1 − PA (4.5) E1 [X] = E0 [X] nPA A We now let g = 1−P the scalar g, which nPA in order to be able to parametrize the adversary by 1 intuitively denotes the level of misbehavior by the adversary. For PA ∈ 1+n , 1 , g ∈ {0, 1}. Therefore, g = 0 and g = 1 correspond to legitimate behavior and complete misbehavior respectively. Now, for any given g, p1 must belong to the class of allowed probability mass functions Ag , where ( ) Ag ≡ q : W W x=0 x=0 ∑ q(x) = 1 and ∑ xq(x) = gE0[X] (4.6) After defining its desired access probability PA (or equivalently g), the second objective of the attacker is to maximize the amount of time it can misbehave without being detected. Assuming that the adversary has full knowledge of the employed detection test, it attempts to find the access strategy (with pmf p1 ) that maximizes the expected duration of misbehavior before an alarm is fired. By looking at equation Eq.(4.3), the attacker thus needs to minimize the following objective function W p1 (x) min ∑ p1 (x) ln (4.7) p0 (x) p1 ∈Ag x=0 Theorem 2: Let g ∈ {0, 1} denote the objective of the adversary. The pmf p∗1 that mini- 39 mizes Eq.(4.7) can be expressed as: ( p∗1 (x) = rx (r−1 −1) r−1 −rW for x ∈ {0, 1, . . . ,W } otherwise 0 (4.8) where r is the solution to the following equation: W rW − r−1 (W rW + rW − 1) W =g −1 −1 W (r − 1)(r − r ) 2 (4.9) Proof: Notice first that the objective function is convex in p1 . We let qε (x) = p∗1 (x) + εh(x) and construct the Lagrangian of the objective function and the constraints W ∑ qε(x) ln x=0 W W qε (x) + µ1 ( ∑ qε (x) − 1) + µ2 ( ∑ xqε (x) − gE0 [X]) p0 (x) x=0 x=0 (4.10) By taking the derivative with respect to ε and equating this quantity to zero for all possible sequences h(x), we find that the optimal p∗1 has to be of the form: p∗1 (x) = p0 (x)e−µ2 x−µ0 (4.11) where µ0 = µ1 + 1. In order to obtain the values of the Lagrange multipliers µ0 and µ2 we utilize the fact that p0 (x) = W 1+1 . Additionally, we utilize the constraints in Ag . The first constraint states that p∗1 must be a pmf and therefore by setting Eq.(4.11) equal to one and solving for µ0 we have W 1 r − rW µ0 = ln ∑ p0 (x)rx = ln (4.12) W +1 r−1 x=0 where r = e−µ2 . Replacing this solution in Eq. 4.11 we get p∗1 (x) = rx (r−1 − 1) r−1 − rW (4.13) The second constraint in Ag is rewritten in terms of Eq.(4.13) as r−1 − 1 r−1 − rW W ∑ xrx = gE0[X] (4.14) x=0 from where Eq.(4.9) follows. Fig. 4.1 illustrates the optimal distribution p∗1 for two values of the parameter g. B. SPRT Optimality for any Adversary in Ag Let Φ(D, p1 ) = E1 [N]. We notice that the above solution was obtained in the form max min Φ(D, p1 ) p1 ∈Ag D∈Ta,b (4.15) That is, we first minimized Φ(D, p1 ) with the SPRT (minimization for any p1 ) and then found the p∗1 that maximized Φ(SPRT, p∗1 ). 40 0.12 p*1 for g=0.98 p*1 for g=0.5038 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 x Figure 4.1: Form of the least favorable pmf p∗1 for two different values of g. When g approaches 1, p∗1 approaches p0 . As g decreases, more mass of p∗1 concentrated towards the smaller backoff values. However, an optimal detector needs to minimize all losses due to the worst-case attacker. That is, the optimal test should in principle be obtained by the following optimization problem min max Φ(D, p1 ) D∈Ta,b p1 ∈Ag (4.16) Fortunately, our solution also satisfies this optimization problem since it forms a saddle point equilibrium, resulting in the following theorem: Theorem 3: For every D ∈ Ta,b and every p1 ∈ Ag Φ(D∗ , p1 ) ≤ Φ(D∗ , p∗1 ) ≤ Φ(D, p∗1 ) (4.17) We omit the proof of the theorem since its derivation follows reasoning similar to the one in [42]. As a consequence of this theorem, no incentive for deviation from (D∗ , p∗1 ) for any of the players (the detection agent or the misbehaving node) is offered. C. Evaluation of Repeated SPRT The original setup of SPRT-based misbehavior detection proposed in [41] was better suited for on-demand monitoring of suspicious nodes (e.g., when a higher layer monitoring agent requests the SPRT to monitor a given node because it is behaving suspiciously, and once it reaches a decision it stops monitoring) and was not implemented as a repeated test. On the other hand, the configuration of DOMINO is suited for continuous monitoring of neighboring nodes. In order to obtain fair performance comparison of both tests, a repeated SPRT algorithm is implemented: whenever dN = 0, the SPRT restarts with S0 = 0. This setup allows a detection agent to detect misbehavior for both short and long-term attacks. The major problem that arises from this setup is that continuous monitoring can raise a large number of false alarms if the parameters of the test are not chosen appropriately. This section proposes a new evaluation metric for continuous monitoring of misbehaving nodes. We believe that the performance of the detection algorithms is appropriately captured by employing the expected time before detection E[TD ] and the average time between false alarms E[TFA ] as the evaluation parameters. 41 SPRT with a=1×10−6, b=0.1 5 10 g=0.98 4 10 3 E[TD] 10 2 10 g=0.005 1 10 0 10 4 10 5 10 6 10 7 10 E[TFA] 8 10 9 10 10 10 Figure 4.2: Tradeoff curve between the expected number of samples for a false alarm E[TFA ] and the expected number of samples for detection E[TD ]. For fixed a and b, as g increases (low intensity of the attack) the time to detection or to false alarms increases exponentially. The above quantities are straightforward to compute for SPRT. Namely, each time the SPRT stops the decision function can be modeled as a Bernoulli trial with parameters a and 1 − b; the waiting time until the first success is then a geometric random variable. Therefore: E[TFA ] = E0 [N] E1 [N] and E[TD ] = a 1−b (4.18) Fig. 4.2 illustrates the tradeoff between these variables for different values of the parameter g. It is important to note that the chosen values of the parameter a in Fig. 4.2 are small. We claim that this represents an accurate estimate of the false alarm rates that need to be satisfied in actual anomaly detection systems [29, 7], a fact that was not taken into account in the evaluation of previously proposed systems. V Performance analysis of DOMINO We now present the general outline of the DOMINO detection algorithm. The first step of the algorithm is based on computation of the average value of backoff observations: Xac = ∑m i=1 Xi /m. In the next step, the averaged value is compared to the given reference backoff value: Xac < γB, where the parameter γ (0 < γ < 1) is a threshold that controls the tradeoff between the false alarm rate and missed detections. The algorithm utilizes the variable cheat_count which stores the number of times the average backoff exceeds the threshold γB. DOMINO raises a false alarm after the threshold is exceeded more than K times. A forgetting factor is considered for cheat_count if the monitored station behaves normally in the next monitoring period. That is, the node is partially forgiven: cheat_count=cheat_count-1 (as long as cheat_count remains greater than zero). We now present the actual detection algorithm from [40]. The algorithm is initialized with cheat_count = 0 and after collecting m samples, the following detection algorithm is executed, where condition is defined as m1 ∑m i=1 Xi ≤ γB if condition 42 p 1-p 0 p 1 p 2 1-p 1-p 3 p 4 1 1-p Figure 4.3: For K=3, the state of the variable cheat count can be represented as a Markov chain with five states. When cheat count reaches the final state (4 in this case) DOMINO raises an alarm. cheat count = cheat count + 1 if cheat count > K r a i s e alarm end elseif cheat count > 0 cheat count = cheat count − 1 end It is now easy to observe that DOMINO is a sequential test, with N = m ∗ Nt , where Nt represents the number of steps cheat_count takes to exceed K and dN = 1 every time the test stops. We evaluate DOMINO and SPRT with the same performance metrics. However, unlike SPRT where a controls the number of false alarms and b controls the detection rate, the parameters m, γ and K in DOMINO are difficult to tune because there has not been any analysis of their performance. The correlation between DOMINO and SPRT parameters is further addressed in Sec. VIII. In order to provide an analytical model for the performance of the algorithm, we model the detection mechanism in two steps: 1. We first define p := Pr m1 ∑m i=1 Xi ≤ γB 2. We define a Markov chain with transition probabilities p and 1 − p. The absorbing state represents the case when misbehavior is detected (note that we assume m is fixed, so p does not depend on the number of observed backoff values). A Markov chain for K = 3 is shown in Fig. 4.3. We can now write " j p = p = Pj # 1 m ∑ Xi ≤ γB , j ∈ 0, 1 m i=1 where j = 0 corresponds to the scenario where the samples Xi are generated by a legitimate station p0 (x) and j = 1 corresponds to the samples being generated by p∗1 (x). In the remainder of this section we assume B = E0 [Xi ] = W2 . We now derive the expression for p for the case of a legitimate monitored node. Following the reasoning from Sect. III, we assume that each Xi is uniformly distributed on {0, 1, . . . ,W }. It is important to note that this analysis provides a lower bound on the probability of false alarms when the minimum contention window of size W + 1 is assumed. Using the definition of p we derive the following expression: 43 " p = P0 m ∑ Xi ≤ mγB # i=1 " bmγBc = ∑ P0 k=0 bmγBc = ∑ m ∑ Xi = k # (4.19) i=1 ∑m k=0 {(x1 ,...,xm ):∑i=1 1 (W + 1)m x =k} i where the last equality follows from the fact that the Xi0 s are i. i. d with pmf p0 (xi ) = W 1+1 for all xi ∈ {0, 1, . . .W }. m+k−1 m+k−1 L The number of ways that m integers can sum up to k is and ∑k=0 = k k m+L . An additional constraint is imposed by the fact that Xi can only take values up to L W , which is in general smaller than k, and thus the above combinatorial formula cannot be applied. Furthermore, a direct computation of the number of ways xi bounded integers sum up to k is very expensive. As an example, let W + 1 = 32 = 25 and m = 10. A direct summation needed for calculation of p yields at least 250 iterations. Fortunately, an efficient alternative way for computing P0 [∑m i=1 Xi = k] exists. We first m define Y := ∑i=1 Xi . It is well known that the moment generating function of Y , MY (s) = MX (s)m can be computed as follows: 1 s W m 1 + e + · · · + e (W + 1)m 1 = × (W + 1)m m ∑ ) k0; · · · ; kW 1k0 esk1 · · · esW kW ( k0 , . . . , kW : ∑ ki = m MY (s) = m where is the multinomial coefficient. k0 ; k2 ; · · · ; kW By comparing terms with the transform of MY (s) we observe that Pr[Y = k] is the coefficient that corresponds to the term eks in Eq.(4.20). This result can be used for the efficient computation of p by using Eq.(4.19). Alternatively, we can approximate the computation of p for large values of m. The approximation arises from the fact that as m increases, Y converges to a Gaussian random variable, by the Central Limit Theorem. Thus, p = Pr [Y ≤ mγB] ≈ Φ(z) where z= p mγB − m W2 (W )(W + 2)m/12 44 γ=0.9 and (W+1)=32 0.45 Exact value of p Approximate value of p: Φ 0.4 0.35 p 0.3 0.25 0.2 0.15 0.1 0.05 0 10 20 30 m 40 50 60 Figure 4.4: Exact and approximate values of p as a function of m. and Φ(z) is the error function. Fig. 4.4 illustrates the exact and approximate calculation of p as a function of m, for γ = 0.9 and W + 1 = 32. This shows the accuracy of the above approximation for both small and large values of m. The computation of p = p1 follows the same steps (although the moment generating function cannot be easily expressed in analytical form, it is still computationally tractable) and is therefore omitted. A. Expected Time to Absorption in the Markov Chain We now derive the expression for expected time to absorption for a Markov Chain with K + 1 states. Let µi be the expected number of transitions until absorption, given that the process starts at state i. In order to compute the stopping times E[TD ] and E[TFA ], it is necessary to find the expected time to absorption starting from state zero, µ0 . Therefore, E[TD ] = m × µ0 (computed under p = p1 ) and E[TFA ] = m × µ0 (computed under p = p0 ). The expected times to absorption, µ0 , µ1 , . . . , µK+1 represent the unique solutions of the equations µK+1 = 0 K+1 µi = 1 + ∑ pi j µ j for i ∈ {0, 1 . . . , K} j=0 where pi j is the transition probability from state i to state j. represented in matrix form: −p p 0 ··· 0 µ0 1 − p −1 µ1 p 0 0 0 µ2 1 − p −1 p 0 .. .. . . 0 ··· 0 1 − p −1 µK For any K, the equations can be = −1 −1 −1 .. . −1 For example, solving the above equations for µ0 with K = 3, the following expression is derived µ0 = 1 − p + 2p2 + 2p3 p4 The expression for µ0 for any other value of K is obtained in similar fashion. 45 DOMINO with K=3 and Attacker with g=0.503759 γ=0.5 90 γ=0.9 80 γ=0.8 70 E[TD] 60 50 γ=0.7 DOMINO CURVES 40 a=10−10 30 20 SPRT CURVE 10 −0.5 a=10 0 2 10 4 6 10 10 E[T ] 8 10 10 10 FA Figure 4.5: DOMINO performance for K = 3, m ranges from 1 to 60. γ is shown explicitly. As γ tends to either 0 or 1, the performance of DOMINO decreases. The SPRT outperforms DOMINO regardless of γ and m. VI Theoretical Comparison In this section we compare the tradeoff curves between E[TD ] and E[TFA ] for both algorithms. For the sake of concreteness we compare both algorithms for an attacker with g = 0.5. Similar results were observed for other values of g. For SPRT we set b = 0.1 arbitrarily and vary a from 10−1/2 up to 10−10 (motivated by the realistic low false alarm rate required by actual intrusion detection systems [29]). Due to the fact that in DOMINO it is not clear how the parameters m, K and γ affect our metrics, we vary all the available parameters in order to obtain a fair comparison. Fig. 4.5 illustrates the performance of DOMINO for K = 3 (the default threshold used in [40]). Each curve for γ has m ranging between 1 and 60, where m represents the number of samples needed for reaching a decision. Observing the results in Fig. 4.5, it is easy to conclude that the best performance of DOMINO is obtained for γ = 0.7, regardless of m. Therefore, this value of γ is adopted as an optimal threshold in further experiments in order to obtain fair comparison of the two algorithms. Fig. 4.6 represents the evaluation of DOMINO for γ = 0.7 with varying threshold K. For each value of K, m ranges from 1 to 60. In this figure, however, we notice that with the increase of K, the point with m = 1 forms a performance curve that is better than any other point with m > 1. Consequently, Fig. 4.7 represents the best possible performance for DOMINO; that is, we let m = 1 and change K from one up to one hundred. We again test different γ values for this configuration, and conclude that the best γ is still close to the optimal value of 0.7 derived from experiments in Fig. 4.5. However, even with the optimal setting, DOMINO is outperformed by the SPRT. VII Nonparametric CUSUM statistic As concluded in the previous section, DOMINO exhibits suboptimal performance for every possible configuration of its parameters. However, the original idea of DOMINO is very intuitive and simple; it compares the observed backoff of the monitored nodes with the expected 46 DOMINO with γ=0.7 and Attacker with g=0.5037 140 K=2 120 K=1 100 E[TD] K=0,m=60 80 K=10 60 40 K=0,m=1 SPRT 20 0 0 10 2 10 4 10 6 10 E[TFA] 8 10 10 10 Figure 4.6: DOMINO performance for various thresholds K, γ = 0.7 and m in the range from 1 to 60. The performance of DOMINO decreases with increase of m. For fixed γ, the SPRT outperforms DOMINO for all values of parameters K and m. DOMINO with m=,1 Attacker with g =0.503759 90 80 γ=0.7 γ=0.9 70 E[TD] 60 50 DOMINO 40 30 20 SPRT 10 2 10 4 10 6 10 E[TFA] 8 10 10 10 Figure 4.7: The best possible performance of DOMINO is when m = 1 and K changes in order to accommodate for the desired level of false alarms. The best γ must be chosen independently. 47 backoff of honest nodes within a given period of time. This section extends this idea to create a test that we believe will have better performance than DOMINO, while still preserving its simplicity. Inspired by the notion of nonparametric statistics for change detection, we adapt the nonparametric cumulative sum (CUSUM) statistic and apply it in our analysis. Nonparametric CUSUM is initialized with y0 = 0 and updates its value as follows: yi = (yi−1 − xi + γB)+ (4.20) The alarm is fired whenever yi > c. Assuming E0 [X] > γB and E1 [X] < γB (i. e. the expected backoff value of an honest node is always larger than a given threshold and vice versa), the properties of the CUSUM test with regard to the expected false alarm and detection times can be captured by the following theorem. Theorem 4: The probability of firing a false alarm decreases exponentially with c. Formally, as c → ∞ (4.21) sup |ln (P0 [Yi > c])| = O (c) i Furthermore, the delay in detection increases only linearly with c. Formally, as c → ∞ TD = c γB − E1 [X] (4.22) The proof is a straightforward extension of the case originally considered in [45]. It is easy to observe that the CUSUM test is similar to DOMINO, with c being equivalent to the upper threshold K in DOMINO and the statistic y in CUSUM being equivalent to the variable cheat_count in DOMINO when m = 1. The main difference between DOMINO when m = 1 and the CUSUM statistic is that every time there is a “suspicious event” (i.e., whenever xi ≤ γB), cheat_count is increased by one, whereas in CUSUM yi is increased by an amount proportional to the level of suspected misbehavior. Similarly, when xi > γB, cheat_count is decreased only by one (or maintained as zero), while the decrease in yi can be expressed as γB−xi (or a decrease of yi if yi −γB−xi < 0). VIII Experimental Results We now proceed to experimental evaluation of the analyzed detection schemes. It has already been mentioned that we assume existence of an intelligent adaptive attacker that is able to adjust its access strategy depending on the level of congestion in the environment. Namely, we assume that, in order to minimize the probability of detection, the attacker chooses legitimate over selfish behavior when the congestion level is low. Consequently, he chooses adaptive selfish strategy in congested environments. Due to the above reasons, we assume the scenario where all the participants are backlogged, i.e., have packets to send at any given time in constructing the experiments. We assume that the attacker will employ the worst-case misbehavior strategy in this setting, enabling the detection system to estimate the maximal detection delay. It is important to mention that this setting also represents the worst-case scenario with regard to the number of false alarms per unit of time due to the fact that the detection system is forced to make maximum number of decisions per unit of time. We expect the number of alarms to be smaller in practice. 48 The backoff distribution of an optimal attacker was implemented in the network simulator Opnet and tests were performed for various levels of false alarms. We note that the simulations were performed with nodes that followed the standard IEEE 802.11 access protocol (with exponential backoff). The results presented in this work correspond to the scenario consisting of two legitimate and one selfish node competing for channel access. The detection agent was implemented such that any backoff value Xi > W was set up to be W. We know this is an arbitrary approximation, but our experiments show that it works well in practice. It is important to mention that the resulting performance comparison of DOMINO, CUSUM and SPRT does not change for any number of competing nodes, SPRT always exhibits the best performance. In order to demonstrate the performance of all detection schemes for more aggressive attacks, we choose to present the results for the scenario where the attacker attempts to access channel for 60% of the time (as opposed to 33% if it was behaving legitimately). The backlogged environment in Opnet was created by employing a relatively high packet arrival rate per unit of time: the results were collected for the exponential(0.01) packet arrival rate and the packet size was 2048 bytes. The results for both legitimate and malicious behavior were collected over a fixed period of 100s. In order to obtain fair performance comparison, a performance metric different from the one in [42] was adopted. The evaluation was performed as a tradeoff between the average time to detection and the average time to false alarm. It is important to mention that the theoretical performance evaluation of both DOMINO and SPRT was measured in number of samples. Here, however, we take advantage of the experimental setup and measure time in number of seconds, a quantity that is more meaningful and intuitive in practice. We now proceed to the experimental performance analysis of SPRT, CUSUM and DOMINObased detection schemes. Fig. 4.8 represents the first step in our evaluation. We evaluated the performance of the SPRT using the same parameters as in the theoretical analysis in Sect. VI. DOMINO was evaluated for fixed γ = 0.9, which corresponds to the value used in the experimental evaluation in [40]. In order to compare the performance to SPRT, we vary the value of K, which essentially determines the number of false alarms. We observe the performance of DOMINO for 2 different values of parameter m. As it can be seen from Fig 4.8, SPRT outperforms DOMINO for all values of K and m. We note that the best performance of DOMINO was obtained for m = 1 (the detection delay is smaller when the decision is made after every sample). Therefore, we adopt m = 1 for further analysis of DOMINO. Fig. 4.9 reproduces the setting used for theoretical analysis in Fig. 4.7. Naturally, we obtain the same results as in Fig. 4.7 and choose γ = 0.7 for the final performance analysis of DOMINO. After finding the optimal values of γ and m we now perform final evaluation of DOMINO, CUSUM and SPRT. The results are presented in Fig. 4.10. We observe that even for the optimal setting of DOMINO, the SPRT outperforms it for all values of K. We also note that due to the reasons explained in Sect. VII, the CUSUM test experiences detection delays similar to the ones of the SPRT. If the logarithmic x-axis in the tradeoff curves in Sect. VI is replaced with a linear one, we can better appreciate how accurately our theoretical results match the experimental evidence (Fig. 4.11). IX Conclusions and future work In this work, we performed extensive analytical and experimental comparison of the existing misbehavior detection schemes in the IEEE 802.11 MAC. We confirm the optimality of 49 Tradeoff curves for DOMINO, CUSUM and SPRT 0.8 DOMINO, m=20 DOMINO, m=1 CUSUM SPRT 0.7 0.6 Td 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Tfa Figure 4.8: Tradeoff curves for each of the proposed algorithms. DOMINO has parameters γ = 0.9 and m = 1 while K is the variable parameter. The nonparametric CUSUM algorithm has as variable parameter c and the SPRT has b = 0.1 and a is the variable parameter. Tradeoff curves for SPRT, DOMINO withγ=0.9 and γ=0.7 DOMINO: γ=0.9 DOMINO: γ=0.7 SPRT 0.5 T d 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Tfa Figure 4.9: Tradeoff curves for default DOMINO configuration with γ = 0.9, best performing DOMINO configuration with γ = 0.7 and SPRT. Tradeoff curves for optimal DOMINO configuration, CUSUM and SPRT DOMINO: γ=0.7 CUSUM: γ=0.7 SPRT 0.2 0.18 0.16 0.14 Td 0.12 0.1 0.08 0.06 0.04 0.02 0 0 20 40 60 80 100 Tfa Figure 4.10: Tradeoff curves for best performing DOMINO configuration with γ = 0.7, best performing CUSUM configuration with γ = 0.7 and SPRT. 50 DOMINO with K=3 and Attacker with g=0.503759 80 70 γ=0.9 60 γ=0.8 E[TD] 50 γ=0.7 40 30 20 SPRT CURVE 10 −0.5 a=10 0 0 1 2 3 E[TFA] 4 5 6 4 x 10 Figure 4.11: Comparison between theoretical and experimental results: theoretical analysis with linear x-axis closely resembles the experimental results. the SPRT-based detection schemes and provide analytical and intuitive explanation of why the other schemes exhibit suboptimal performance when compared to the SPRT schemes. In addition to that, we offer an extension to the DOMINO algorithm that still preserves its original idea and simplicity, while significantly improving its performance. Our results show the value of doing a rigorous formulation of the problem and providing a formal adversarial model since it can outperform heuristic solutions. We believe our model applies not only to MAC but to a more general adversarial setting. In several practical security applications such as in biometrics, spam filtering, watermarking etc., the attacker has control over the attack distribution and this distribution can be modeled in similar fashion as in our approach. We now mention some issues for further study. This chapter focused on the performance analysis and comparison of the existing detection schemes. A first issue concerns employment of penalizing functions against misbehaving nodes once an alarm is raised. When an alarm is raised, penalties such as the access point never acknowledging the receipt of the packet (in order to rate-limit the access of the node) or denying access to the medium for a limited period of time should be considered. If constant misbehavior (even after being penalized) is exhibited, the system should apply more severe penalties, such as revocation from the network. A second issue concerns defining alternative objectives of the adversary, such as maximizing throughput while minimizing the probability of detection. 51 Chapter 5 Secure Data Hiding Algorithms It will readily be seen that in this case the alleged right of the Duke to the whale was a delegated one from the Sovereign. We must needs inquire then on what principle the Sovereign is originally invested with that right. -Moby Dick, Herman Melville I Overview Digital watermarking allows hidden data, such as a fingerprint or message, to be placed on a media signal (e.g., a sound recording or a digital image). When a watermark detector is given this media signal, it should be able to correctly decode the original embedded fingerprint. In this chapter we give a final example of the application of our framework to the problem of watermark verification. In the next section we provide a general formulation of the problem. Then in section III we present a watermark verification problem where the adversary is parameterized by a Gaussian distribution. Finally, in section IV we model a watermark verification problem where the adversary is given complete control over its attack distribution. II General Model A. Problem Description The watermark verification problem consists on two main algorithms, an embedding algorithm E and a detection algorithm D . • The embedder E receives as inputs a signal s and a bit m. The objective of the embedder is to produce a signal x with no perceptual loss of information or major differences from s, but that carries also the information about m. The general formulation of this required property is to force the embedder to satisfy the following constraint: D(x, s) < Dw , where D() is a distortion function and Dw is an upper bound on the amount of distortion allowed for embedding. • The detection algorithm D receives a signal y, which is assumed to be x or an altered version of x either by natural errors (e.g., channel losses) or by an adversary. The detector has to determine if y was embedded with m = 1 or not. • In order to facilitate the detection process, the embedder and the detector usually share a random secret key k (e.g., the seed for a pseudorandom number generator that is used to create an embedding pattern p). This key can be initialized in the devices or exchanged via a secure out of band channel. B. Adversary Model 52 • Information available to the adversary: We assume the adversary knows the embedding and detection algorithms E and D . Furthermore we assume the adversary knows the distribution of K, S, M, however it does not know the particular realizations of these values. That is, the adversary does not know the particular realization of k, s, or m and therefore does not know the input arguments of E . • Capabilities of the adversary: The adversary is a man-in-the-middle between the embedder and the detector. Therefore it can intercept the signal x produced by E , and can send in its place, a signal y to the detector. However the output of the adversary should satisfy a distortion constraint D(y, s) < Da , since y should be perceptually similar to s. C. Design Space The previous description can be summarized by the following diagram: A Ek (m, s) Dk x x ← f (x|s, m, k) −−−−−−−−−−−−→ −−− y y ← f (y|x) −−−−−−−−−−−−→ −−− ln f (y|k,m=1) H1 ≷ f (y|k,m=0) H τ 0 Where i ← f (i| j) refers to i being sampled from a distribution with pdf f (i| j). Alternatively i can be understood to be the output of an efficient probabilistic algorithm with input j and output i. Therefore from now on, we will use Ek (m, s) and A (y) interchangeably with f (x|s, m, k) and f (y|x) (respectively). Notice that an optimal detection algorithm (optimal in the sense that it minimizes several evaluation metrics such as the probability of error, the expected cost or the probability of missed positives given an upper bound on the probability of false positives) is the log-likelihood ratio test ln f (y|k,m=1) H1 ≷ f (y|k,m=0) H τ, where Hi denotes the hypothesis that m = i, and τ depends on the evaluation 0 metric. If the log-likelihood ratio is greater than τ then the detector decides for m = 1 and if it less than τ, the detector decides for m = 0 (if it is equal to τ the detector randomly selects the hypothesis based on the evaluation metric being considered). Of course, in order to implement this optimal detection strategy, the detector requires the knowledge of f (y|k, m), which can be computed as follows: Z Z f (y|k, m) = f (y|s, x, k, m) f (x|s, k, m) f (s)dxds ZS ZX = f (y|x) f (x|s, k, m) f (s)dxds S X Therefore an optimal detection algorithm requires knowledge of the embedding distribution f (x|s, k, m), the attack algorithm f (y|x), and of the pdf of s: f (s). Note that f (s) is fixed, since neither the embedding algorithm or the adversary can control it. As a result, the overall performance of the detection scheme is only a function of the variable f (y|x) (i.e., on the adversary A ) and on the variable f (x|s, k, m) (i.e., the embedding algorithm E ). The performance can then be evaluated by a metric which takes into account the two variable parameters: Ψ(E , A ). 53 Assuming Ψ(E , A ) represent the losses of the system, our task is to find E ∗ to minimize Ψ. However since A is unknown and arbitrary, the natural choice is to find the embedding scheme E ∗ that minimizes the possible damage that can be done by an adversary A ∗ : (E ∗ , A ∗ ) arg min max Ψ(E , A ) E ∈FDw A ∈FDa (5.1) where the feasible design space FDw and the feasible adversary class FDa consist on embedding distributions and least favorable attack distributions that satisfy certain distortion constraints. This formulation guarantees that ∀A Ψ(E ∗ , A ) ≤ Ψ(E ∗ , A ∗ ) whenever possible we are also interested in finding out if E ∗ is indeed the best embedding distribution against A ∗ , and therefore we would like to satisfy a saddle point equilibrium: Ψ(E ∗ , A ) ≤ Ψ(E ∗ , A ∗ ) ≤ Ψ(E , A ∗ ) D. Evaluation metric The evaluation metric should reflect good properties of the system for the objective it was designed. For our particular case we are going to be interested in the probability of error Pr[Dk (y) 6= m] when the following random experiment is performed: k is sampled from a uniform distribution in {0, 1}n (where n is the length of k), s is assumed to be sampled from a distribution with pdf f (s), m is sampled from its prior distribution f (m), and then the embedder algorithm and the adversary algorithm are executed. This random experiment is usually expressed in the following notation: Ψ(E , A ) ≡ Pr[k ← {0, 1}n ; s ← f (s); m ← f (m); x ← Ek (s, m); y ← A (x) : Dk (y) 6= m] (5.2) The min-max formulation with this evaluation metric minimizes the damage of an adversary whose objective is to produce a signal y that removes the embedded information m. In particular the adversary wants the detection algorithm on input y to make an incorrect decision on whether the original signal s was embedded with m = 1 or not. III Additive Watermarking and Gaussian Attacks The model described in the previous section is often intractable (based on the specific objective function, distortion functions, prior distributions etc.) and therefore several approximations are made in order to obtain a solution. For example, in practice there exists several popular embedding algorithms such as spread spectrum watermarking and QIM watermarking, and researchers often try to optimize the parameters of these embedding schemes, as opposed to finding new embedding algorithms. Furthermore modeling an adversary that can select any algorithm to produce its output is also very difficult because we have to consider non-parametric and arbitrary non-linear attacks. Therefore researchers often assume a linear (additive) adversary that is parameterized by a Gaussian random process. This assumption is motivated by several arguments, including information theoretic arguments claiming a Gaussian distribution is the least favorable noise in a channel, or as an approximation given the central limit theorem. 54 In this chapter we follow one such model originally proposed by [46]. Contrary however to the results in [46], we relax two assumptions. First, we relax the assumption of spread spectrum watermarking and instead search for the optimal embedding algorithm in this model formulation. Secondly, we relax the assumption of the diagonal processors (an assumption mostly due to the fact that the embedding algorithm used spread spectrum watermarking) and obtain results for the general case. The end result is that our algorithms achieve a lower objective function value than [46] for any possible attacker in the feasible attacker class. In the following section we describe the model of the problem and obtain our results. Then in section E we discuss our results and compare them to [46]. A. Mathematical Model Given s ∈ RN and m ∈ {0, 1}, we assume an additive embedder E that outputs x = Φ(s + pm), Φ is an N × N matrix and where p ∈ RN is a pattern sampled from a distribution with pdf h(p). Since p is the only random element in the watermarking algorithm, it is assumed to be dependent on the key k, and therefore from now on we will replace k with p without loss of generality. The attacker A is modeled by y = Γx +e, where Γ is an N ×N matrix and e is a zero-mean (since any non-zero mean attack is suboptimal [46]) Gaussian random vector with correlation matrix Re . Finally, the detection algorithm has to perform the following hypothesis test: H0 : y = ΓΦs + e H1 : y = ΓΦs + e + ΓΦp If the objective function Ψ(E , A ) the detector wants to minimize is the probability of error, then we know that an optimal detection algorithm is the log-likelihood ratio test. By assuming that the two hypothesis are equally likely, we find that τ = 0. Furthermore in order to compute f (y|p, m) it is assumed s is a Gaussian random vector with zero mean (zero mean is assumed without loss of generality) and correlation matrix Rs . The following diagram summarizes the model A E p (m, s) Dp x x = (s + pm)Φ −−−−−−−−−−−−→ −−− y y = Γx + e −−−−−−−−−−−−→ −−− H1 1 t t −1 pt ϒt R−1 y y − 2 p ϒ Ry ϒp ≷ 0 H0 Where Ry = ΓΦRs Φt Γt + Re , and ϒ = ΓΦ. The distortion constraints that E and A need to satisfy are selected to be the squared error distortion: FDw = E : E||X − S||2 ≤ NDw and FDa = A : E||Y − S||2 ≤ NDa where E = (Φ, R p , h) and A = (Γ, Re ). 55 B. Optimal Embedding Distribution In the model described in the previous section, the probability of error can be found to be: h p i Z p Ψ(E , A ) = Pr[D p (y) 6= m] = E p Q pt Ωp = Q pt Ωp h(p)d p where 1 Ω = Φt Γt (ΓΦRs Φt Γt + Re )−1 ΓΦ 2 and where p is the random pattern (watermark) and h(p) is its unknown pdf. Ω is a function of the signal and noise covariances Rs , Re , the watermark covariance R p and the scaling matrices Γ and Φ. If we fix all these quantities then we would like to determine the form of h(p) that will minimize the error probability (since this is the goal of the decision maker). To solve the previous problem we rely on a the following property of the Q (·) function, that can be verified by direct differentiation √ Lemma 3: The function Q ( x) is a convex function of x. Now we can use this convexity property and apply Jensen’s inequality and conclude that p √ Ex [Q ( x)] ≥ Q Ex [x] ; we have equality iff x is a constant with probability 1 (wp1). Using this result in our case we get q q i h p pt Ωp ≥ Q E p [pt Ωp] = Q tr{ΩR p } . (5.3) Pe = E p Q Relation (5.3) provides a lower bound on the error probability for any pdf (which of course satisfies the covariance constraint). We have equality in (5.3) iff wp1 we have that pt Ωp = tr{ΩR p }. (5.4) In other words every realization of p (that is every watermark p) must satisfy this equality. Notice that if we can find a pdf for p which can satisfy (5.4) under the constraint that E[ppt ] = R p (remember we fixed the covariance R p ), then we will attain the lower bound in Equation (5.3). To find a random vector p that achieves what we want, we must do the following. Consider the SVD of the matrix 1/2 1/2 R p ΩR p = UΣU t (5.5) where U orthonormal and Σ = diag{σ1 , . . . , σK }, diagonal with nonnegative elements. The nonnegativity of σi is assured because the matrix is nonnegative definite. Let A be a random vector with i.i.d. elements that take the values ±1 with probability 0.5. For every vector A we can then define an embedding vector p as follows 1/2 p = R p UA 56 (5.6) Let us see if this definition satisfies our requirements. First consider the covariance matrix which must be equal to R p . Indeed we have 1/2 1/2 E[ppt ] = R p U E[AAt ]U t R p 1/2 1/2 1/2 1/2 = R p U IU t R p = R p I R p = R p , where we used the independence of the elements of the vector A and the orthonormality of U (I denotes the identity matrix). So our random vector has the correct covariance structure. Let us now see whether it also satisfies the constraint that pt Ωp = tr{ΩR p } wp1. Indeed for every realization of the random vector A we have 1/2 1/2 pt Ωp = At U t R p ΩR p UA = At ΣA = A21 σ1 + A22 σ2 + · · · + A2K σK = σ1 + σ2 + · · · + σK , where we use the fact that the elements Ai of A are equal to ±1. Notice also that 1/2 1/2 tr{ΩR p } = tr{R p ΩR p } = tr{UΣU t } = tr{ΣU t U} = tr{Σ} = σ1 + · · · + σK , which proves the desired equality. So we conclude that, although p is a random vector (our possible watermarks), all its realizations satisfy the equality pt Ωp = tr{ΩR p }. This of course suggests that this specific choice of watermarking attains the lower bound in (5.3). B.1 Summary We have found the optimum embedding distribution. It is a random mixture of the 1/2 1/2 columns of the matrix R p U of the form R p UA, where A is a vector with elements ±1. This of course suggests that we can have 2N different patterns. R p is the final matrix we 1/2 1/2 end up from the max-min game and U is the SVD of the corresponding final matrix R p ΩR p . Once we are given h∗ , the game the embedder and the attacker play is the following: max min tr{ΩR p } R p ,Φ Re ,Γ More specifically: max min tr{(ΓΦRs Φt Γt + Re )−1 ΓΦR p Φt Γt } R p ,Φ Re ,Γ (5.7) Subject to the distortion constraints: tr{(Φ − I)Rs (Φ − I)t + ΦR p Φt } ≤ NDw (5.8) tr{(ΓΦ − I)Rs (ΓΦ − I)t + ΓΦR p Φt Γt + Re } ≤ NDa (5.9) C. Least Favorable Attacker Parameters 57 C.1 Minimization with respect to Re Assuming ϒ = ΓΦ is fixed, we start by minimizing (??) with respect to Re . This minimization problem is addressed with the use of variational techniques. Let Rεe = Roe + ε∆. By forming the Lagrangian of the optimization problem (??) under constraint (??), the goal of the attacker is to minimize with respect to ε the following objective function: f (ε) = tr{(ϒRs ϒt + Rεe )−1 ϒR p ϒt } + µ(tr{(ϒ − I)Rs (ϒ − I)t + ϒR p ϒt + Rεe } − NDa ) (5.10) A necessary condition for the optimality of R∗e is when d f (ε) |ε=0 = 0 dε which implies −tr{(ϒRs ϒt + R∗e )−1 ∆(ϒRs ϒt + R∗e )−1 ϒR p ϒt } + µtr{∆} = 0 which must be zero for any ∆, and thus we need that (ϒRs ϒt + R∗e )−1 ϒR p ϒt (ϒRs ϒt + R∗e )−1 = µI from where we solve for R∗e = √1µ (ϒR p ϒt )1/2 − ϒRs ϒ The dual problem is then to maximize with respect to µ the following function: √ µtr{(ϒR p ϒt )1/2 } + µ(tr{(ϒ − I)Rs (ϒ − I)t + ϒR p ϒt + (µ)−1/2 (ϒR p ϒt )1/2 − ϒRs ϒt } − NDa ) which after some simplification becomes √ 2 µtr{(ϒR p ϒt )1/2 } − 2µtr{ϒRs } + µtr{Rs} + µtr{ϒR p ϒt } − µNDa taking the derivative with respect to µ and equating to zero we obtain: 2tr{ϒRs } − tr{Rs } − tr{ϒR p ϒt } + NDa 1 = √ µ tr{(ϒR p ϒt )1/2 } C.2 Minimization with respect to Γ Since ϒ = ΓΦ we note that the attacker can completely control ϒ by an appropriate choice of Γ (assuming Φ is invertible), therefore we only need to consider the minimization over ϒ of the following function: (tr{(ϒR p ϒt )1/2 })2 2tr{ϒRs } − tr{Rs } − tr{ϒR p ϒt } + NDa Proceeding similarly to the previous case, we use variational techniques with ϒε = ϒo + ε∆. After deriving the objective function with respect to ε and equating to zero, we obtain the following equation: tr{(∆R p ϒot + ϒo R p ∆t )(ϒo R p ϒot )−1/2 }Cd − (2tr{∆Rs } − tr{∆R p ϒot + ϒo R p ∆t })Cn = 0 58 where Cd and Cn are scalar factors not dependent of ∆ (and will be determined later.) Since the above equation must be equal to zero for any ∆ we need that 2Cd (R p ϒto (ϒo R p ϒt )−1/2 ) − 2Cn (Rs − R p ϒto ) = 0 or alternatively, if we let C = Cd /Cn c(ϒo R p ϒto )1/2 + ϒo R p ϒto = ϒRs −1/2 Letting Σ = ϒo Rp1/2 and A = R p Rs we obtain C(ΣΣt )1/2 = ΣA − ΣΣt after squaring both sides we have C2 Σt = (A − Σt )Σ(A − Σt ) or equivalently, (A − Σt )Σ(A − Σt ) −C2 Σt = 0 D. Optimal Embedding Parameters Subject to: n o2 t 1/2 tr (ΦR p Φ ) max Φ,R p 2tr {ΦRs } − tr {Rs } − tr ΦR p Φt + NDa (5.11) tr (Φ − I)Rs (Φ − I)t + ΦR p Φt ≤ NDw (5.12) For any Φ and any R p we have by Schwarz inequality that: n o2 t 1/2 tr (ΦR p Φ ) 2tr {ΦRs } − tr {Rs } − tr ΦR p Φt + NDa Ntr{ΦRs Φt } ≤ 2tr {ΦRs } − tr {Rs } − tr ΦR p Φt + NDa (5.13) (5.14) (5.15) Equality is achieved if and only if ΦR p Φt = κI (i.e., R∗p = κ(Φt Φ)−1 ), where κ is a constant that can be determined by the following arguments. Notice first that Equation 5.14 is an increasing function of tr{ΦR p Φt }. The maximum value is therefore achieved when Equation is satisfied with equality. Solving for κ from this constraint we obtain κ= NDw − tr{(Φ − I)Rs (Φ − I)t } N Replacing the values of κ and R p into the original objective function and letting λ be the maximum value the objective function can achieve (as a function of Φ), we have: 59 N(NDw − tr{(Φ − I)Rs (Φ − I)t }) ≤λ 2tr{ΦRs } − tr {Rs } + NDa − NDw + tr {(Φ − I)Rs (Φ − I)t } After some algebraic manipulations we obtain the following: λtr{Rs } NDw λ(Da − Dw ) − ≤ tr (Φ − (λ + 1)−1 I)Rs (Φ − (λ + 1)−1 I)t + λ+1 λ+1 (λ + 1)2 We can see that the minimum value the right hand side of this equation can achieve (as a function of Φ) is when Φ∗ = (λ + 1)−1 I. With Φ∗ , the following equation can be used to solve for λ: λ tr{Rs } + λ(Da − Dw ) = NDw λ+1 Solving for λ we obtain λ= Dw − Da + NDw − tr{Rs } + p 4(Da − Dw )NDw + (Da − Dw − NDw + tr{Rs })2 2(Da − Dw ) E. Discussion Notice that so far we have solved the problem in the following way: min max min Ψ(E , A ) Φ,R p Γ,Re h (5.16) where (as we mentioned before) E = (h, R p , Φ) and A = (Γ, Re ). This means that given Φ and R p , the adversary will select Γ and Re in order to maximize the probability of error, and then given the parameters chosen by the adversary we finally find the embedding distribution h minimizing the probability of error. The problem with this solution is that in practice, the embedding algorithm will be given in advance and the adversary will have the opportunity of changing its behavior based on the given embedding algorithm (including h). For notational simplicity assume Φ and R p are fixed, so we can replace E with h in the remaining of this chapter. Furthermore let h(A ) denote the embedding distribution as a function of the parameters A = (Γ, Re ) (recall that h depends on A by the selection of U in Equation 5.6). Now we can express easily the problem we have solved: ∀A Ψ(h∗ (A ), A ) ≤ Ψ(h(A ), A ) This is true in particular for A ∗ , the solution to the full optimization problem from Equation 5.16. Moreover, the above is also true for the distribution used in the previous work [46], which assumed a Gaussian embedding distribution hG : ∀A Ψ(h∗ (A ), A ) ≤ Ψ(hG , A ) Notice also that in [46], the solution obtained was A G = arg max Ψ(hG , A ) A 60 (5.17) Due to some approximations done in [46], Ψ(hG , A ) turns out to be the same objective function given in Equation 5.7. Furthermore in [46] there were further approximations in order to obtain linear processors (diagonal matrices). In this work we relaxed this assumption in order to obtain the full solution to Equation 5.7. Therefore the general solution (without extra assumptions such as diagonal matrices) in both cases is the same: G G ∗ A = arg max Ψ(h , A ) = A = arg max min Ψ(h, A ) (5.18) A A h One of the problems with our solution however is that there might exist A 0 such that Ψ(h∗ (A ∗), A ∗ ) < Ψ(h∗ (A 0 ), A 0 ) However, even in this case it is easy to show that h∗ is still better than hG , since the performance achieved by hG is not as good as the performance obtained with h∗ , even for any other A : max Ψ(h∗ (A ), A ) < max Ψ(hG , A ) A A The main problem is that in order to obtain this optimal performance guarantee, the embedding distribution h∗ needs to know the adversary final strategy of the adversary A . In particular we are interested in two questions. With regards to the previous work in [46] we would like to know if the following is true: ∀A Ψ(h∗ (A ∗ ), A ) ≤ Ψ(hG , A ∗ ) (5.19) that is, once we have fixed the operating point A ∗ (the optimal adversary according to Equation 5.7) there is no other adversarial strategy that will make h∗ perform worse than the previous work. The second question is in fact more general and it relates to the original intention of minimizing the worst possible error created by the adversary: min max Ψ(h, A ) h A (5.20) yet we have only solved the problem in a way where h is dependent on A : (h∗ , A ∗ ) = arg max min Ψ(h, A ) A h A way to show that (h∗ , A ∗ ) satisfies Equation 5.20 (and therefore also satisfy Equation 5.19) is to show that the pair (h∗ , A ∗ ) forms a saddle point equilibrium: ∀(h, A ) Ψ(h∗ , A ) ≤ Ψ(h∗ , A ∗ ) ≤ Ψ(h, A ∗ ) (5.21) To be more specific let E denote again the triple (h, R p , Φ). Then we are interested in showing that ∀(E , A ) Ψ(E ∗ , A ) ≤ Ψ(E ∗ , A ∗ ) ≤ Ψ(E , A ∗ ) (5.22) where (E ∗ ,A ∗ ) = (h∗ , R∗p , Φ∗ , R∗e , Γ∗ ) is the solution to Equation (5.16). 61 It is easy to show how the right hand side inequality of Equation (5.22) is satisfied: " !# r 1 t ∗ ∗ ∗ −1 ∗ Ψ(E , A ) = E p Q pt Φt Γ∗t (Γ∗ ΦRs Φt Γ + Re ) Γ Φp 2 ! r 1 t ∗t ∗ t ≥Q R p Φ Γ (Γ ΦRs Φt Γ∗ + R∗e )−1 Γ∗ Φ by Jensen’s inequality 2 ! r 1 R∗p Φ∗t Γ∗t (Γ∗ Φ∗ Rs Φ∗t Γ∗t + R∗e )−1 Γ∗ Φ∗ ≥Q by Equation (5.7) 2 = Ψ(E ∗ , A ∗ ) by the definition of h∗ The left hand side of Equation (5.22) is more difficult to satisfy. A particular case where it is satisfied is the scalar case, i.e., when N = 1. In this case we have the following: s ∗ (Φ∗ Γ∗ )2 R p Ψ(E ∗ , A ∗ ) = Q ∗ 2((Γ Φ∗ )2 Rs + R∗e ) s ∗ ∗ 2 R p (Φ Γ) by Equation (5.7) ≥Q 2((ΓΦ∗ )2 Rs + Re ) s !# " p2 (Φ∗ Γ)2 Since p is independent of A = Ep Q 2((ΓΦ∗ )2 Rs + Re ) = Ψ(E ∗ , A ∗ ) The independence of p in the scalar case comes p pfrom the fact that Equation (5.6) yields in this case p = R p with probability 21 and p = − R p with probability 12 . With this distribution Equation (5.4) is always satisfied (since the adversary has no control over it). This result can in fact be seen as a counterexample against the optimality of spread spectrum watermarking against Gaussian attacks: if the attack is Gaussian, then the embedding distribution should not be a spread spectrum watermarking, or conversely, if the embedding distribution is spread spectrum, then the attack should not be a Gaussian attack. In future work we plan to investigate under which conditions or assumptions is the left inequality in Equation (5.22) satisfied. An easier goal we also plan to investigate is whether Equation (5.19) is true, since this will also show an improvement over previous work. We also plan to extend the work to other evaluation metrics, such as the case when one of the errors is more important than the other. In this case we can set an arbitrary level of false alarms and find the parameters of the embedder that maximize the probability of detection while the adversary tries to minimize detection. IV Towards a Universal Adversary Model In the previous section we attempted to relax the usual assumptions regarding the optimal embedding distributions (Spread Spectrum or QIM) and find new embedding distributions that achieve better performance against attacks. The problem with the formulation in the previous section is that the optimal embedding distribution h∗ will depend on the strict assumptions made to keep the problem tractable. In particular it assumes the correctness of the source signal model 62 f (s) and even more troubling, it assumes the adversary will perform a Gaussian and scaling attack only. This limitation on the capabilities of the adversary is a big problem in practice, since any data hiding algorithm that is shown to perform well under any parametric adversary (e.g., Gaussian attacks) will give a false sense of security, since in reality the adversary will never be confined to create only Gaussian attacks or follow any other parametric distribution prescribed by the analysis. In this section we are going to focus in another version of the embedding detection problem: non-Blind watermarking. This formulation is very important for several problems such as fingerprinting and traitor tracing. In non-blind watermarking both the embedder and the detector have access to the source signal s, and therefore the problem can again be represented as: A Ek (m, s) Dk (s) x x ← f (x|k, m, s) −−−−−−−−−−−−→ −−− y y ← f (y|x) −−−−−−−−−−−−→ −−− ln f (y|s,k,m=1) H1 ≷ f (y|s,k,m=0) H τ 0 where A should satisfy some distortion constraint, for example for quadratic distortion con straints: A ∈ FD : A : E[(x − y)2 ] ≤ D Our main objective is to model and understand the optimal strategy that a non-parametric adversary can do. To the best of our knowledge this is the first attempt to model this all powerful adversary. In order to gain a better insight into the problem we are going to start with the scalar case: N = 1, or in particular s, x and y are in R. Furthermore we assume E is fixed and parameterized by a distance d between different embeddings: that is, for all k and s d = |Ek (1, s) − Ek (0, s)|. Since for every output of the attacker y ← f (y|x) there exists a random realization of a with pdf h such that y = x + a, we can replace the adversarial model with an additive random variable a sampled from an attacker distribution h. Finally, the decision function ρ will output an estimate of m: m = 0 or m = 1 given the output of the adversary: y. E (m, s) A D (s) x −−−−−−−−−−−−→ −−− y y = x + a −−−−−−−−−−−−→ −−− ρ(y) Having fixed the embedding algorithm this time, our objective is to find a pair (ρ∗ , h∗ ) such that ∀ρ and h ∈ FD Ψ(ρ∗ , h) ≤ Ψ(ρ∗ , h∗ ) ≤ Ψ(ρ, h∗ ) (5.23) where Ψ(ρ, h) is again the probability of error: Pr[m ← {0, 1}; x = E (m, s); a ← h(a) : ρ(x + a) 6= m] R and where FD simplifies to h : E[a2 ] ≤ D, h(a)da = 1 and h ≥ 0 . Notice that this problem is significantly more difficult than the problem of finding the optimal parameters of an adversary, since in this case we need to perform the optimization over infinite dimensional spaces, because h is a continuous function. 63 A. On the Necessity of Randomized Decisions for Active Distortion Constraints Let xi = E (i, s). Then we know that to satisfy Ψ(ρ∗ , h∗ ) ≤ Ψ(ρ, h∗ ), ρ∗ should select the largest between the likelihood of y given x1 : f (y|x1 ) and the likelihood of y given x0 : f (y|x0 ), and should randomly flip a coin to decide if both likelihoods are equal (this decision is called Bayes optimal). Therefore in order to find the saddle point equilibrium we are going to assume a given ρ and maximize for h (subject to the distortion constraints) and then check to see if ρ is indeed Bayes optimal. Before we obtain a saddle point solution we think it is informative to show our attempt to solve the problem with a non-randomized decision function ρ. A typical non-randomized decision function will divide the decision space into two sets: R and its complement Rc . If y ∈ R then ρ(y) = 1, otherwise ρ(y) = 0. Assume without loss of generality that d = x0 − x1 > 0. The probability of error can be expressed then as: Ψ(ρ, h) = Pr[ρ = 1|M = 0]Pr[M = 0] + Pr[ρ = 0|M = 1]Pr[M = 1] Z Z 1 h(y − x0 )dy + h(y − x1 )dy = 2 R Rc Z Z 1 h(a − d)da + h(a)da = 2 R Rc Z Z 1 = h(a)da + h(a)da 2 R−d Rc Z 1 (1R−d (a) + 1Rc (a)) h(a)da = 2 where 1R is the indicator function for the set R (i.e., 1R (a) = 1 if a ∈ R and 1R (a) = 0 otherwise) and where R − d is defined as the set {a − d : a ∈ R}. The objective function is therefore: 1 R∈R h∈FD 2 min max Z (1R−d (a) + 1Rc (a)) h(a)da Subject to: E[a2 ] ≤ D The Lagrangian is Z L(λ, h) = 1 2 (1R−d (a) + 1Rc (a)) − λa h(a)da + λD 2 where Ψ(ρ, h∗ ) = L∗ (λ∗ ) = L(λ∗ , h∗ ). By looking at the form of the Lagrangian in Figure 5.1 (for λ > 0) it is clear that a T / since otherwise, the adversary will necessary condition for optimality is that R − d Rc = 0, put all the mass of h in this interval. Under this condition we assume R − d = [− inf, −d 2 ]. d 2 ∗ ∗ Now notice that for D ≥ ( 2 ) , λ = 0, and therefore there will always be an h such that Ψ(ρ, h∗ ) = 12 . The interpretation for this case is that the distortion constraints are not strict enough, and the adversary can create error rates up to 0.5. It is impossible to find a saddle point solution for this case, since any ρ will not be Bayesian, and if it is Bayes optimal then h∗ is 64 1 − λa2 2 L(λ, h) a 0 d λa2 x1 x0 R−d R c y Figure 5.1: Let us define a = y − x1 for the detector. We can now see the Lagrangian function over a, where the adversary tries to distribute the density h such that L(λ, h) is maximized while satisfying the constraints (i.e., minimizing L(λ, h) over λ. not a maximizing distribution. However by assuming R − d = [− inf, −d 2 ] we guarantee that the probability of error is not greater than 0.5. Having such a high false alarm rate is unfeasible in practice and thus the embedding scheme should be designed with a d such that D ≥ ( d2 )2 Assuming λ > 0 it is now clear from Figure 5.1 that an optimal solution is for h∗ (a) = p0 δ(−d/2) + p1 δ(0) + p2 (d/2) where p0 + p1 + p2 = 1. For D < ( d2 )2 , λ∗ = 2 d2 and thus Ψ(ρ, h∗ ) = h∗ (a) = 2D , d2 where 2D d 2 − 4D 2D δ(−d/2) + δ(0) + 2 δ(d/2) 2 2 d d d Notice however that in this case ρ is not optimal, since the solution assumes that at the boundary between R and Rc , ρ decides for both: m = 0 and m = 1 at the same time! and thus clearly this is not a Bayes optimal decision rule (in fact this is not a decision at all!). An naive approach to solve this problem is to randomize the decision at the boundary: i.e., to flip an unbiased coin whenever y is in the boundary between R and Rc . This decision will then be Bayes optimal for h∗ , however it can be shown that this h∗ is not a solution h that maximizes the probability of error and thus we cannot achieve a saddle point solution. In the next section we introduce a more elaborate randomized decision that achieves a saddle point equilibria. B. Achieving Saddle Point Equilibria with Active Constraints Let ρ(a) = 0 with probability ρ0 (a) and ρ(a) = 1 with probability ρ1 (a). In order to have a well defined decision function we require ρ0 = 1 − ρ1 . Consider now the decision function given in Figure 5.2. The Lagrangian is now: Z 1 2 (ρ1 (a + d) + ρ0 (a)) − λa h(a)da + λD L(λ, h) = 2 65 1 ρ1 (a) ρ0 (a) 0 0 d Figure 5.2: Piecewise linear decision function, where ρ(0) = ρ0 (0) = ρ1 (0) = 1 2 1 − λa2 2 a − λa2 2d L(λ, h) 0 d a λa2 Figure 5.3: The discontinuity problem in the Lagrangian is solved by using piecewise linear continuous decision functions. It is now easy to shape the Lagrangian such that the maxima created form a saddle point equilibrium. 66 1 z 1 a− (z − 2)d z−2 ρ1 (a) 0 z−1 d z d z 0 ρ0 (a) d Figure 5.4: New decision function In order to have active distortion constraints, the maxima of L(λ, h) should be in the interval a ∈ [−d, d]. Looking at Figure 5.3 we see that 1 (ρ1 (a + d) + ρ0 (a)) − λa2 2 1 achieves its maximum value for a∗ = ± 4λd . Therefore 1 1 1 δ − +δ h (a) = 2 4λd 4λd ∗ 1 Notice however that under h∗ , ρ will only be Bayes optimal if and only if 4λd = d2 , which occurs 2 1 if and only if λ∗ = 2d1 2 which occurs if and only if D = d2 (since λ∗ = 4d √ minimizes the D Lagrangian.) 2 As a summary, for D = d2 , (ρ∗ , h∗ ) form a saddle point equilibrium when ρ∗ is defined as in Figure 5.3 and −d d 1 ∗ δ − +δ h (a) = 2 2 2 1 1 D 1 Furthermore the probability of error is Ψ(ρ∗ , h∗ ) = 16λd 2 + λD = 8 + 2d 2 = 4 . It is still an open question whether there are saddle point equilibria for the distortion 2 2 constraint E[a2 ] < D where D > d4 , however for D < d4 we can obtain a saddle point by considering the decision function shown in Figure 5.4. For z ∈ (3, ∞), the local maxima of the z Lagrangian occur for a = 0, and a = ± 4d(z−2)λ , where the second value was obtained as the solution to d 1 z 1 2 a− − λa =0 da 2 (z − 2)d z−2 a∗ i.e., the derivative evaluated at a∗ must be equal to zero. By symmetry of the Lagrangian we have that another local maximum occurs at a = −a∗ . The value of λ∗ that makes all these local maxima the same (and thus gives the opportunity of an optimal attack with three delta functions, one on each local maximum) is the solution to: 2 1 z z 1 z z2 − 8d 2 (x − 2)λ∗ ∗ − − λ = =0 2 (z − 2)d 4d(z − 2)λ∗ z − 2 4d(z − 2)λ∗ 16d 2 (z − 2)2 λ∗ 67 1 − λa2 2 ! " 1 z 1 a− − λa2 2 (z − 2)d z−2 L(λ, h) a 0 d λa2 Figure 5.5: With ρ defined in Figure 5.4 the Lagrangian is able to exhibit three local maxima, one of them at the point a = 0, which implies that the adversary will use this point whenever the distortion constraints are too severe 2 z which is λ∗ = 8d 2 (z−2) . Any other λ would have implied inactive constraints (D too large) or D = 0. See Figure 5.5. The optimal adversary has the form 2d 2d ∗ h (a) = p0 δ − + (1 − 2p0 )δ(0) + p0 δ z z We can see that ρ defined as in Figure 5.4, and h∗ can only form a saddle point if z = 4. Notice also that h∗ is an optimal strategy for the adversary as long as 2 d E[a ] = 2p0 =D 2 2 That is p0 = 2D . d2 Since the maximum value that p0 should attain is 21 , this implies that this is the optimal strategy for the adversary for any D ≤ point equilibrium is d2 4. The probability of error for this saddle Ψ(ρ∗ , h∗ ) = L∗ (λ∗ ) = λ∗ D = C. Saddle Point Solutions for D > D 1 ≤ d2 4 d2 4 In the previous section we saw how the adversary can create pdfs h that generate points y where the decision function makes large errors in classification, therefore the idea of using an “indecision” region can help the decision function in regions where deciding between the two hypothesis is prone to errors. In this framework we allow ρ(y) to output ¬ when not enough information is given in y in order to decide between m = 0 or m = 1. Let C(i, j) represent the cost of deciding for i (ρ = i) when the true hypothesis was m = j. By using as an evaluation metric the probability of error, we have been so far minimizing the expected cost E[C(ρ, m)] when C(0, 0) = C(1, 1) = 0 and C(1, 0) = C(0, 1) = 1. We now extend this evaluation metric by incorporating the cost of not making a decision: C(¬, 0) = C(¬, 1) = α. 68 ρ1 (a) 1 ρ0 (a) ρ¬ (a) 0 0 d/2 d β Figure 5.6: ρ¬ represents a decision stating that we do not possess enough information in order to make a reliable selection between the two hypotheses. If we let α < 12 , it is easy to show (assuming Pr[m = 0] = Pr[m = 1] = 0.5) that a decision function that minimizes Ψ(ρ, h) = E[C(ρ, m)] has the following form: f (y|x1 ) 1−α 1 if f (y|x0 ) > α f (y|x1 ) α (5.24) ρ∗ (y) = ¬ if 1−α α > f (y|x0 ) > 1−α 0 if f (y|x1 ) < α f (y|x0 ) 1−α α 1−α 1) and whenever ff (y|x (y|x0 ) equals either 1−α or α the decision is randomized between 1 and ¬ and between ¬ and 0 (respectively). Under our non-blind watermarking model the expected cost becomes: 1 {[ρ1 (x + d) + αρ¬ (x + d)] + [ρ0 (x) + αρ¬ (x)]} h(x)dx 2 where rhoi is the probability of deciding for i and where ρ0 (x) + ρ¬ (x) + ρ1 (x) = 1. Given ρ, the Lagrangian for the optimization problem of the adversary is: Z Ψ(ρ, h) = 1 L(λ, h) = 2 Z ρ1 (a + d) + αρ¬ (a + d) + ρ0 (a) + αρ¬ (a) − λa2 h(a)da + λD Consider now the decision function given in Figure 5.6. Following the same reasoning as in the previous chapter, it is easy to show that for β = 32 d, the maximum values for L(λ, h) occur for a = 0 and a = ±d. The optimal distribution h has the following form: p p h∗ (a) = δ(−d) + (1 − p)δ(0) + δ(d) 2 2 The decision function ρ is Bayes optimal for this attack distribution only if the likelihood ratio 1−p 1−α for a = 0 is equal to 1−α α , (i.e., if p/2 = α ) and if the likelihood ratio for a = ±d () is equal α 1−α p/2 α (i.e., 1−p = 1−α ). This optimality requirement places a constraint on α: α = 2p − 1. Furthermore, the distortion constraint implies the adversary will select E[a2 ] = pd 2 = D. Since we need α < 12 in order to make use of the “indecision” region, the above formulation is thus satisfied for D ≤ 34 d 2 . to 69 BIBLIOGRAPHY [1] C. Kruegel, D. Mutz, W. Robertson, G. Vigna, and R. Kemmerer. Reverse Engineering of Network Signatures. In Proceedings of the AusCERT Asia Pacific Information Technology Security Conference, Gold Coast, Australia, May 2005. [2] W. Lee and S. J. Stolfo. Data mining approaches for intrusion detection. In Proceedings of the 7th USENIX Security Symposium, 1998. [3] W. Lee, S. J. Stolfo, and K. Mok. A data mining framework for building intrusion detection models. In Proceedings of the IEEE Symposium on Security & Privacy, pages 120–132, Oakland, CA, USA, 1999. [4] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting intrusions using system calls: Alternative data models. In Proceedings of the 1999 IEEE Symposium on Security & Privacy, pages 133–145, Oakland, CA, USA, May 1999. [5] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In D. Barbara and S. Jajodia, editors, Data Mining for Security Applications. Kluwer, 2002. [6] C. Kruegel, D. Mutz, W. Robertson, and F. Valeur. Bayesian event classification for intrusion detection. In Proceedings of the 19th Annual Computer Security Applications Conference (ACSAC), pages 14–24, December 2003. [7] Stefan Axelsson. The base-rate fallacy and its implications for the difficulty of intrusion detection. In Proceedings of the 6th ACM Conference on Computer and Communications Security (CCS ’99), pages 1–7, November 1999. [8] John E. Gaffney and Jacob W. Ulvila. Evaluation of intrusion detectors: A decision theory approach. In Proceedings of the 2001 IEEE Symposium on Security and Privacy, pages 50–61, Oakland, CA, USA, 2001. [9] Guofei Gu, Prahlad Fogla, David Dagon, Wenke Lee, and Boris Skoric. Measuring intrusion detection capability: An information-theoretic approach. In Proceedings of ACM Symposium on InformAtion, Computer and Communications Security (ASIACCS ’06), Taipei, Taiwan, March 2006. [10] Giovanni Di Crescenzo, Abhrajit Ghosh, and Rajesh Talpade. Towards a theory of intrusion detection. In ESORICS 2005, 10th European Symposium on Research in Computer Security, pages 267–286, Milan, Italy, September 12-14 2005. Lecture Notes in Computer Science 3679 Springer. [11] Software for empirical evaluation of http://www.cshcn.umd.edu/research/IDSanalyzer. IDSs. Available at [12] S. Forrest, S. Hofmeyr, A. Somayaji, and T. A. Longstaff. A sense of self for unix processes. In Proceedigns of the 1996 IEEE Symposium on Security & Privacy, pages 120– 12, Oakland, CA, USA, 1996. IEEE Computer Society Press. 70 [13] M. Schonlau, W. DuMouchel, W.-H Ju, A. F. Karr, M. Theus, and Y. Vardi. Computer intrusion: Detecting masquerades. Technical Report 95, National Institute of Statistical Sciences, 1999. [14] J. Jung, V. Paxson, A. Berger, and H. Balakrishnan. Fast portscan detection using sequential hypothesis testing. In IEEE Symposium on Security & Privacy, pages 211–225, Oakland, CA, USA, 2004. [15] D. J. Marchette. A statistical method for profiling network traffic. In USENIX Workshop on Intrusion Detection and Network Monitoring, pages 119–128, 1999. [16] Salvatore Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Phil Chan. Cost-based modeling for fraud and intrusion detection: Results from the JAM project. In Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, pages 130–144, January 2000. [17] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc, 1991. [18] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The DET curve in assessment of detection task performance. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech’97), pages 1895–1898, Rhodes, Greece, 1997. [19] N. Japkowicz, editor. Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, 2000. [20] N. V. Chawla, N. Japkowicz, and A. Kołcz, editors. Proceedings of the International Conference for Machine Learning Workshop on Learning from Imbalanced Data Sets, 2003. [21] Nitesh V. Chawla, Nathalie Japkowicz, and Aleksander Kołz. Editorial: Special issue on learning from imbalanced data sets. Sigkdd Explorations Newsletter, 6(1):1–6, June 2004. [22] Cèsar Ferri, Peter Flach, José Hernández-Orallo, and Nicolas Lachinche, editors. First Workshop on ROC Analysis in AI, 2004. [23] Cèsar Ferri, Nicolas Lachinche, Sofus A. Macskassy, and Alain Rakotomamonjy, editors. Second Workshop on ROC Analysis in ML, 2005. [24] C. Drummond and R. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207, 2001. [25] Richard P. Lippmann, David J. Fried, Isaac Graf, Joshua W. Haines, Kristopher R. Kendall, David McClung, Dan Weber, Seth E. Webster, Dan Wyschogrod, Robert K. Cunningham, and Marc A. Zissman. Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation. In DARPA Information Survivability Conference and Exposition, volume 2, pages 12–26, January 2000. [26] Foster Provost and Tom Fawcett. Robust classification for imprecise environments. Machine Learning, 42(3):203–231, March 2001. 71 [27] H. V. Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, 2nd edition, 1988. [28] H. L. Van Trees. Detection, Estimation and Modulation Theory, Part I. Wiley, New York, 1968. [29] Alvaro A. Cárdenas, John S. Baras, and Karl Seamon. A framework for the evaluation of intrusion detection systems. In Proceedings of the 2006 IEEE Symposium on Security and Privacy, Oakland, California, May 2006. [30] D. Wagner and P. Soto. Mimicry attacks on host-based intrusion detection systems. In Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS), pages 255–264, Washington D.C., USA, 2002. [31] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Automating mimicry attacks using static binary analysis. In Proceedings of the 2005 USENIX Security Symposium, pages 161–176, Baltimore, MD, August 2005. [32] G. Vigna, W. Robertson, and D. Balzarotti. Testing Network-based Intrusion Detection Signatures Using Mutant Exploits. In Proceedings of the ACM Conference on Computer and Communication Security (ACM CCS), pages 21–30, Washington, DC, October 2004. [33] Sergio Marti, T. J. Giuli, Kevin Lai, and Mary Baker. Mitigating routing misbehavior in mobile ad hoc networks. In Proceedings of the 6th annual international conference on Mobile computing and networking, pages 255–265. ACM Press, 2000. [34] Y. Zhang, W. Lee, and Y. Huang. Intrusion detection techniques for mobile wireless networks. ACM/Kluwer Mobile Networks and Applications (MONET), 9(5):545–556, September 2003. [35] G. Vigna, S. Gwalani, K. Srinivasan, E. Belding-Royer, and R. Kemmerer. An Intrusion Detection Tool for AODV-based Ad Hoc Wireless Networks. In Proceedings of the Annual Computer Security Applications Conference (ACSAC), pages 16–27, Tucson, AZ, December 2004. [36] Sonja Buchegger and Jean-Yves Le Boudec. Nodes bearing grudges: Towards routing security, fairness, and robustness in mobile ad hoc networks. In Proceedings of Tenth Euromicro PDP (Parallel, Distributed and Network-based Processing), pages 403 – 410, Gran Canaria, January 2002. [37] The MIT lincoln labs evaluation data set, DARPA intrusion detection evaluation. Available at http://www.ll.mit.edu/IST/ideval/index.html. [38] J. McHugh. Testing intrusion detection systems: A critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by the Lincoln laboratory. ACM Transactions on Information and System Security (TISSEC), 3(4):262–294, November 2000. [39] R. E. Blahut. Principles and Practice of Information Theory. Addison-Wesley, 1987. 72 [40] Jean-Pierre Hubaux Maxim Raya and Imad Aad. DOMINO: A system to detect greedy behavior in IEEE 802.11 hotspots. In Proceedings of the Second International Conference on Mobile Systems, Applications and Services (MobiSys2004), Boston, Massachussets, June 2004. [41] Svetlana Radosavac, John S. Baras, and Iordanis Koutsopoulos. A framework for MAC protocol misbehavior detection in wireless networks. In Proceedings of the 4th ACM workshop on Wireless Security (WiSe 05), pages 33–42, 2005. [42] S. Radosavac, G. V. Moustakides, J. S. Baras, and I. Koutsopoulos. An analytic framework for modeling and detecting access layer misbehavior in wireless networks. submitted to ACM Transactions on Information and System Security (TISSEC), 2006. [43] Pradeep Kyasanur and Nitin Vaidya. Detection and handling of mac layer misbehavior in wireless networks. In Proceedings of the International Conference on Dependable Systems and Networks, June 2003. [44] Alvaro A. Cárdenas, Svetlana Radosavac, and John S. Baras. Detection and prevention of mac layer misbehavior in ad hoc networks. In Proceedings of the 2nd ACM workshop on security of ad hoc and sensor networks (SASN 04), 2004. [45] B. E. Brodsky and B. S. Darkhovsky. Nonparametric Methods in Change-Point Problems, volume 243 of Mathematics and Its Applications. Kluwer Academic Publishers, 1993. [46] Pierre Moulin and Aleksandar Ivanović. The zero-rate spread-spectrum watermarking game. IEEE Transactions on Signal Processing, 51(4):1098–1117, April 2003. 73