A Data Mining Approach for Building Cost-Sensitive and Light Intrusion Detection Models Quarterly Review – November 2000 North Carolina State University Columbia University Florida Institute of Technology Outline • Project description • Progress report: – – – – – – – – – Cost-sensitive modeling (NCSU/Columbia/FIT). Automated feature and model construction (NCSU). Anomaly detection (NCSU/Columbia/FIT). Attack “clustering” and light modeling (FIT). Real-time architecture and systems (NCSU/Columbia). Correlation (NCSU). Collaboration with industry (NCSU/Columbia). Publications and software distribution. Effort and budget. • Plan of work for next quarter New Ideas and Hypotheses (1/2) • High-volume automated attacks can overwhelm a real-time IDS and its staff – IDS needs to consider cost factors: • Damage cost, response cost, operational cost, etc. • Pure statistical accuracy not ideal: – Base-rate fallacy of anomaly detection. – Alternative: the cost (saving) of an IDS. New Ideas and Hypotheses (2/2) • Thorough analysis cannot always be done in real-time by one sensor: – Correlation of multiple sensor outputs. – Trend or scenario analysis. • Need better theories and tools for building misuse and anomaly detection models: – Characteristics of normal data and attack signatures can be measured and utilized. Main Approaches (1/2) • Cost-sensitive models and architecture: – Optimized for the cost metrics defined by users. • Cost-sensitive machine learning algorithms. – Multiple specialized and light sensors dynamically activated/configured in run-time. • “Load balancing” of models and data • Aggregation and correlation. • Cost-effectiveness as the guiding principle and multi-model correlation as the architectural approach. Main Approaches (2/2) • Theories and tools for more effective anomaly and misuse detection: – Information-theoretic measures for anomaly detection • “Regularity” of normal data is used to build model. – New algorithms, e.g. • Unsupervised learning using “noisy” data. • Using “artificial anomalies” – An automated system that integrate all these algorithms/tools. Project Impacts (1/2) • A better understanding of the cost factors, cost models, and cost metrics related to intrusion detection. • Modeling techniques and deployment strategies for cost-effective IDSs – Provide the “best-valued” protection. • “Clustering” techniques for grouping intrusions and building specialized and light sensors. • An architecture for dynamically activating, configuring, and correlating sensors. Project Impacts (2/2) • More effective misuse and anomaly detection models – With sound theoretical foundations and automation tools. • Analysis/correlation techniques for understanding/recognizing and predicting complex attack scenarios. Cost-Sensitive Modeling • In previous quarters: – – – – Cost factors and metrics definition and analysis. Cost model definition. Cost-sensitive modeling with machine learning. Evaluation using DARPA off-line data. • Current quarter: – Real-time architecture. – Dynamic cost-sensitive deployment and correlation of sensors. A Multi Layer/Component Architecture models Remote IDS/Sensor Dynamic Cost-sensitive Decision Making FW Real-time IDS Backend IDS ID Model Builder Next Steps • Study “realistic” cost-metrics in the real-world. • Implement a prototype system – Demonstrate the advantage of costsensitive modeling and dynamic costeffective deployment • Use representative scenarios for evaluation. An Automated System for Feature and Model Construction The Data Mining Process of Building ID Models models connection/ session records raw audit data packets/ events (ASCII) Feature Construction From Patterns patterns new mining intrusion records mining compare intrusion patterns features training data learning normal and historical intrusion records detection models Status and Next Steps • The effectiveness of the algorithms/tools (process steps) have been validated – 1998 DARPA Evaluation. • Automating the process: – Process steps “chained” together. – Process iteration: under development. • Field test: – Advanced Technology Systems, General Dynamics. – Planned public release 2Q-2001. • Dealing with “unlabeled” data – Integrate “anomaly detection over noisy data (Columbia)” algorithms. Information-Theoretic Measures for Anomaly Detection • Motivations: – Need formal understandings. • Hypothesis: – Anomaly detection is based on “regularity” of normal data. • Approach: – Entropy and conditional entropy: regularity • Determine how to build a model. – Relative (conditional) entropy: how the regularities between training and test datasets relate • Determine the performance of a model on test data. Case Studies • Anomaly detection for Unix processes – “Short sequences” as normal profile. – A classification approach: • Given the first k system calls, predict the k+1st system call – How to determine the “sequence length”, k? Will including other information help? – UNM sendmail system call traces. – MIT Lincoln Lab BSM data. • Anomaly detection for network – How to partition the data – refine the complex subject. – MIT Lincoln Lab tcpdump data. Entropy and Conditional Entropy H ( X ) P( x ) log( P( x )) x • “Impurity” of the dataset • the smaller (the more regular) the better. H ( X | Y ) P( x, y ) log P( x | y ) x, y • “Irregularity” of sequential dependencies • “uncertainty” of a sequence after seeing its prefix (subsequences) • the smaller (the more regular) the better. Relative (Conditional) Entropy p( x ) relEntropy ( p | q) p( x ) log q( x ) x p( x | y ) relCondEntropy ( p | q) p( x, y ) log ( ) q( x | y ) x, y • How different is p from q: • how different is the regularity of test data from that of training data • the smaller the better. Information Gain and Classification | Xv | Gain( X , A) H ( X ) H (Xv) vValues( A ) | X | • How much can attribute/feature A contribute to the classification process: • the reduction of entropy when the dataset is partitioned according to values of A. • the larger the better. • if A = the first k events in a sequence (i.e., Y) and the class label is the k+1st event • conditional entropy H(X|Y) is just the second term of the Gain(X, A) • the smaller the conditional entropy, the better performance the classifier. Conditional Entropy of Training Data (UNM) 0.5 bounce-1.int bounce.int 0.4 queue.int 0.3 plus.int sendmail.int 0.2 total mean 0.1 sliding window size 17 15 13 11 9 7 5 3 0 1 Conditional Entropy 0.6 Misclassification Rate: Training Data 50 40 bounce-1.int 35 bounce.int 30 queue.int 25 plus.int 20 sendmail.int 15 total 10 mean 5 sliding window size 17 15 13 11 9 7 5 3 0 1 Misclassification Rate 45 Conditional Entropy vs. Misclassification Rate condEnt and misClass rate 1.2 1 0.8 total-CondEnt total-MisClass 0.6 mean-CondEnt mean-MisClass 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 sliding window size 30 sendmail.int 25 total 20 sm-10763.int 15 syslog-local-1.int 10 fwd-loops-1.int 5 fwd-loops-2.int 0 fwd-loops-3.int sliding window size 17 plus.int 15 35 13 queue.int 11 40 9 bounce.int 7 45 5 bounce-1.int 3 50 1 Misclassification Rate Misclassification Rate of Testing Data and Intrusion Data fwd-loops-4.int fwd-loops-5.int Relative Conditional Entropy btw. Training and Testing Normal Data 0.05 bounce-1.int bounce.int 0.04 queue.int 0.03 plus.int sendmail.int 0.02 total mean 0.01 sliding window size 17 15 13 11 9 7 5 3 0 1 Relative Conditional Entropy 0.06 (Real and Estimated) Accuracy/Cost (Time) Trade-off 0.0009 0.0008 estimated accur/cost, total 0.0006 accur/cost, total 0.0005 estimated accur/cost, mean 0.0004 0.0003 accur/cost, mean 0.0002 0.0001 sliding window size 17 15 13 11 9 7 5 3 0 1 Accuracy/Cost 0.0007 Conditional Entropy of In- and Out- bound Email (MIT/LL BSM) 0.6 s-o-in0 0.5 s-in0 0.4 so-in0 0.3 s-o-out0 0.2 s-out0 0.1 so-out0 sliding window size 17 15 13 11 9 7 5 3 0 1 Conditional Entropy 0.7 0.025 s-o-in0 0.02 s-in0 0.015 so-in0 s-o-out0 0.01 s-out0 0.005 so-out0 sliding window size 17 15 13 11 9 7 5 3 0 1 Relative Conditional Entropy Relative Conditional Entropy Misclassification Rate of inbound Email 30 s-o-in0,80% 25 s-o-in0,20% 20 s-in0,80% 15 s-in0,20% so-in0,80% 10 so-in0,20% 5 sliding window size 17 15 13 11 9 7 5 3 0 1 Misclassification Rate 35 Misclassification Rate of outbound Email 35 s-o-out0,80% 30 s-o-out0,20% 25 s-out0,80% 20 s-out0,20% 15 so-out0,80% 10 so-out0,20% 5 sliding window size 17 15 13 11 9 7 5 3 0 1 Misclassification Rate 40 0.001 0.0009 0.0008 0.0007 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0 s-o-in0 s-in0 so-in0 s-o-out0 s-out0 so-out0 sliding window size 17 15 13 11 9 7 5 3 mean 1 Accuracy/cost Accuracy/cost Trade-off Estimated Accuracy/cost Trade-off s-o-in0 0.001 0.0008 s-in0 0.0006 so-in0 0.0004 s-o-out0 0.0002 s-out0 so-out0 sliding window size 17 15 13 11 9 7 5 3 0 1 Accuracy/cost 0.0012 mean Key Findings • “Regularity” of data can guide how to build a model – For sequential data, conditional entropy directly influences the detection performance • Determines the (best) sequence length and whether to include more information, before building a model. • With cost is also considered, the “optimal” model. • Detection performance on test data can be attained only if regularity is similar to training data. Next Steps • Study how to measure more complex environments – Network topology/configuration/traffic, etc. • Extend the principle/approach for misuse detection: – Measure normal, attack, and their relationship • “Parameter adjustment”, performance prediction. New Anomaly Detection Approaches • Unsupervised training methods – Build models over noisy (not clean) data • Artificial anomalies – Improves performance of misuse and anomaly detection methods. • Network traffic anomaly detection AD over Noisy Data • Builds normal models over data containing some anomalies. • Motivating assumptions: – Intrusions are extremely rare compared to to normal. – Intrusions are quantitatively different. Approach Overview • Mixture model – Normal component – Anomalous component • Build probabilistic model of data • Max likelihood test for detection. Mixture Model of Anomalies • Assume a generative model: – The data is generated with a probability distribution D. • Each element originates from one of two components: – M, the Majority Distribution (x M). – A, the Anomalous Distribution (x A). • Thus: D = (1-)M + A. Modeling Probability Distributions • Train Probability Distributions over current sets of M and A. • PM(X) = probability distribution for Majority. • PA(X) = probability distribution for Anomaly. • Any probability modeling method can be used: – Naïve Bayes, Max Entropy, etc. Experiments • Two Sets of experiments: – Measured Performance against comparison methods over noisy data. – Measured Performance trained over noisy data against comparison methods trained over clean data. – Method Robust in both comparisons. AD Using Artificial Anomalies • Generate abnormal behavior artificially – Assume the given normal data are representative. – “Near misses" of normal behavior is considered abnormal. – Change the value of only one feature in an instance of normal behavior. – Sparsely represented values are sampled more frequently. – “Near misses" help define a tight boundary enclosing the normal behavior. Experimental Results • Learning algorithm: RIPPER • Data: 1998 DARPA evaluation – U2R, R2L, DOS, PRB: 22 “clusters” • Training data: normal and artificial anomalies • Results – – – – – Overall detection rate: 94.26% Overall false alarm rate: 2.02% 100% dectection: buffer_overflow, guess_passwd, phf, back 0% detection: perl, spy, teardrop, ipsweep, nmap 50+% detection: 13 out of 22 intrusion subclasses Combining Anomaly and Misuse Detection • Training data: normal data, artificially generated anomalies, known intrusion data • The learned model can predict normal, anomaly, or known intrusion subclass • Experiments were performed on increasing subsets of known intrusion subclasses in the training data (simulates identified intrusions over time). Combining Anomaly and Misuse Detection (continued) • Consider phf, pod, teardrop, spy, and smurf are unknown (absent from the training data) • Anomaly detection rate: phf=25%, pod=100%, teardrop=93.91%, spy=50%, smurf=100% • Overall false alarm rate: .20% • The false alarm rate has dropped from 2.02% to .20% when some known attacks are included for training Adaptive Combined Anomaly and Misuse Detection • Completely re-train model whenever new intrusion is found is very expensive and slow process. • Effective and fast remedy is very important to thwart these attacks. • Re-training is still necessary when time and resource are enough. Multiple Model Adaptive Approach • Generate an additional detection module only good at detecting the newly discovered intrusion. – Method 1: trained from normal and new intrusion data – Method 2: new intrusion and artificial anomaly • When old classifier predicts “anomaly”, it will be further predicted by the new classifier to examine if it is the new intrusion. Multiple Model Adaptive Experiment • The “old model” is trained from n intrusions. • A light weight model is trained from one new intrusion type. • They are combined as an ensemble. • The accuracy and training time is compared with one model trained from n + 1 intrusions. Multiple Model Adaptive Experiment Result • The accuracy difference is very small – recall: +3.4% – precision: -16% – In other words, ensemble approach detects more new intrusion, but also misidentifies more anomaly as new intrusion. • Training time difference: 150 time difference! or a cup of coffee versus one or two days. Detecting Anomalies in Network Traffic (1/2) • Can we detect intrusions by identifying novel values in network packets? • Anomaly detection is potentially useful in detecting novel attacks. • Our model is trained on attack-free tcpdump data. • Fields in the Transport layer or below are considered. Detecting Anomalies in Network Traffic (2/2) • Normal field values are learned. • During evaluation, a function scores a packet based on the likelihood of encountering novel field values. • Initial results indicate our learned model compares favorably with other systems on the 1999 DARPA evaluation data. Packet Fields • Fields in Data-link, Network, and Transport layers. – (Application layer will be considered later) • Ethernet: source, destination, protocol. • IP: header length, TOS, fragment ID, TTL, transport protocol … • TCP: header length, UAPRSF flags, URG pointer … • UDP: length … • ICMP: type, code… Anomaly Scoring Function (1/2) • N1 = Number of unique values in a field in the training data • N = Number of packets in the training data • Likelihood of observing a novel value in a field is: N1 / N (escape probability, Witten and Bell, 1991) Anomaly Scoring Function (2/2) • Non-stationary model: consider the last occurrence of novel values • t = Number of seconds since the last novel value in the same field • Likelihood of observing an anomaly P = (N1 / N) * (1 / t) • Field anomaly score: Sf = 1 / P • Packet anomaly score = Sf Sf Experiments • 1999 DARPA evaluation data (from Lincoln Lab). • Same mechanism as DARPA in determining detection (correct IP address of the victim, 60 seconds before and after an attack). • Score thresholds of our system and others are lowered to produce no more than 100 false alarms. • Some of the other systems use binary scoring. Initial Results IDS TP/FP All TP/FP Network IDS Type Oracle 200/0 72/0 ideal FIT 64/100 51/100 anomaly GMU 51/22 27/22 anomaly+signature NYU 20/80 14/80 signature SUNY 24/9 19/9 signature NetSTAT 70/995 35/995 signature EmeraldTCP 83/23 35/23 signature Discussion • All attacks: more detections with 100 or fewer false alarms than most systems except Emerald and NetSTAT. • Our initial experiments did not look at fields in the Application protocol layer. • Network attacks: more detections with 100 or fewer false alarms than the other systems. • 57 out of 72 attacks were detected with 100 false alarms. Summary of Progress • Florida Tech’s official start date: August 30, 2000. • Near-term objective: using learning techniques to build anomaly detection models that can identify intrusions. • Progress: initial experimental results on the 1999 DARPA evaluation data indicate that our techniques compare favorably with the other systems in detecting network attacks. Plans for the Next Quarter • Investigate an entropy approach to detecting anomalies. • Study methods that incorporate more information from packets prior to the current packet. • Examine how effective our techniques are with respect to individual attack types. • Devise techniques to catch attack types that are undetected. • Incorporate fields in the Application protocol layer into our model. Anomaly Detection: Summary and Plans • Anomaly detection is a main focus. • Both theories and new approaches. • Will integrate: – – – – Theories applied to develop new AD sensors. Incorporate cost-sensitive measures. Study real-time architecture/performance. Automated feature and model construction system. Correlation Analysis of Attack Scenario • Motivations: – Detecting individual attack actions not adequate • Damage assessment, trend prediction, etc. • Hypothesis: – Attacks are related and such correlation can be learned. • Approach: – Start with crude knowledge models. – Use data mining to validate/refine the models. – An IETF/IDWG architecture/system. Objectives (1/2) • Local/low layer correlations in an IDS – Multiple sources of raw (audit) data • Raw information: tcpdump data, BSM records… • Based on specific attack signatures, system/user normal profiles … – Benefits: • Better accuracy: higher TP, lower FP • More alarm information for higher level and global analysis Objectives (2/2) • Global / High Layer Correlations – Multiple sources of alarms by IDSs – The bigger picture • What really happened in our networks? • What can we learn from these cases? – Benefits: • What is the intention of the attacks? • What will happen next? When? Where? • What can we do to prevent it from happening? Architecture of Global Correlation System Alarm Collection Center Alarm PreProcessor Correlation Engine Alarm PostProcessor IDSs Knowledge Base Knowledge Controller Report Center Correlation Techniques from Network Management System (1/2) • Rule-Based Reasoning (RBR) – If – then rules based on the domain knowledge and expertise. – Sufficient for small, non-changing, and well understood system. • Model-Based Reasoning (MBR) – Model both physical and logical entity, such as hub, router … – Correlation is a result of the collaboration among models. Correlation Techniques from Network Management Systems (2/2) • State-Transition Graph (STG) – Logical connections via state-transition. – May lead to unexpected behavior if the collaborating STGs are not carefully defined. • Case-Based Reasoning (CBS) – Learn from the experience and offer solutions to novel problems based on experience. – Need to develop a similarity metric to retrieve useful cases from the library. Correlation Techniques for IDS • Combination of different correlation techniques – Network complexity. – Wide varieties attacking motives and tools. • Adaptation of different correlation techniques – Different perspectives between NMS and IDS. Challenges of Correlation (1/2) • Knowledge representation – How to represent the objects such as alarms, log files, network entities? – How to model the knowledge such as network topology, network history, intrusion library, previous cases? Challenges of Correlation (2/2) • Knowledge base construction – What kind of knowledge base do we need? – How to construct the knowledge base? • Case library • Network Knowledge • Intrusion Knowledge – Patten discovery ( domain knowledge/expert system, data mining …) A Case Study: DDoS • An attack scenario from MIT/LL – Phase 1: IPSweep of the AFB from a remote site. – Phase 2: Probe of live IPs to look for the ‘sadmind’ daemon running on Solaris hosts. – Phase 3: Break-ins via the ‘sadmind’ vulnerability. – Phase 4: Installation of the trojan program— ’mstream’ DDoS software on three hosts at the AFB. – Phase 5: Launching the DDoS. Alarm Model • Object-Oriented • Alarm A: {feature1, feature2, …} • Features of Alarm – – – – – – – – – – Attack type Time stamp Service Source IP / domain Target IP/ domain Target number Source type (router , host , server…) Target type (router, host, server … ) Duration Frequency within time window Alarm Model • Example: – IP sweep 09:51:51 ICMP ppp5-23.iawhk.com 172.16.115.x 20 hosts servers 9 1 • • • • • • • • • • Attack type: IP sweep Time stamp: 09:51:51 Service: ICMP Source IP: ppp5-23.iawhk.com Target IP: 172.16.115.x Target number: 20 Source type: n/a Target type: hosts and servers Duration: 9 seconds Frequency: 1 Scenario Representation (1/2) • Attack scenario graph – Constructed by domain knowledge • Can be validated/augmented via data mining. – Describing attack scenarios via state transition. – Each transition with probability P. – Modifiable by experts. – Adaptive to new cases. Scenario Representation (2/2) • Example of attack scenario graph TFN2K DDoS Buffer Overflow IP Sweep Trojan Installation Port Scan Trinoo DDoS SMURF Syn Flood UDP Flood Mstream DDoS Correlation Rule Sets • Based on – Attack scenario graph. – Domain knowledge and expertise. – Case library. • Two Layers of Rule Sets – Lower layer for matching/correlating specific alarms. – Higher layer for trend prediction. – Probability assigned. Correlation Rule Sets • Example of low layer rule sets – If (A1.type = “IP Sweep” & A2.type = “Port Scan” ) & (A1.time < A2.time) & (A1.domain = A2.domain) & ( A2.target # > 10 ), then A1&A2 …. – If (A2.type = “Port Scan” & A3.type = “Buffer Overflow”) & (A2.time < A3.time) & (A3.DestIP belongs to A2.domain) & (A3.target# >=2), then A2 & A3 Correlation Rule Set • Example of high layer rule sets – If (A1 & A2, A2 &A3), then A1&A2&A3 – If (A1 & A2 & A3), then the attack scenario is A1 -> A2 ->A3 -> A4 w/ probability P1 A1-> A2 -> A3 -> A4 -> A5 w/ probability P2 – E.g., If (“IP Sweep” & “Port Scan” & “Buffer Overflow” ) Then next1 = “Trojan Installation” with P1 next2 = “DDoS” with P2 Status and Next Steps • At the very beginning of this research. • Attack Scenario Graph – How to construct it automatically? – How to model the statistical properties of attack scenario state transition? • How to automatically generate the correlation rule sets? • Collaboration with other groups: – Alarm formats, architecture, IETF/IDWG. Real-time System Implementation • Motivations – Validate our algorithms and models in the realworld. – Faster technology transfer and greater impact. • Approach – Collaboration with industries • Reuse available “building blocks” as much as possible. Conceptual Architecture Adaptive Model Generator models Data Warehouse models data Sensor data data Detector System Architecture Model Generation Supervised Machine Learning Unsupervised Machine Learning Real Time Data Mining Adaptive Model Generation Data Warehouse NT Host Based IDS Linux Host Based IDS Solaris Host Based IDS Sensors Malicious Email Filter “Meta” IDS File System Wrappers NFR Network Based IDS Software Wrappers Sensor: Host Based IDS System • Generic Interface to Sensors – BAM (Basic Auditing Module) – Sends data to data warehouse – Receives models from data warehouse • NT System – Fully Operational • Linux System & BSM (Solaris) System – Sensor Operational – Under Construction • Plan to finish construction by end of semester Sensor: Network IDS System • NFR Based Sensor – Data Mining based • Efficient Evaluation Architecture – Multiple Models • System operational and integrated with larger system Sensor: Malicious Email Filter • Monitors Email (sendmail) – Detects malicious emails entering domain • Key Features: – Model Based – Generalizes to unknown malicious attachments – Models distributed automatically to filters • Status: – Prototype operational – Open source release by end of semester Sensor: Advanced IDS Sensors • • • • File Wrappers Software Wrappers Monitor other aspects of system Status: – File Wrappers almost finished – Software Wrappers under development Data Warehouse • Stores data collected from sensors – Generic IDS data format – Data can be manipulated in database – Cross reference data from attacks • Stores generated models • Status: – Currently Operational – Refining Interface and Data Transfer Protocol – Completed by end of Semester Adaptive Model Generator • • • • • Builds models from data in data warehouse Uses both supervised and unsupervised data Can build models based on data collected XML Based Data Exchange Format Status: – Exchange Format’s defined – Prototype developed – Completion by end of semester Collaboration with Industries • NFR. • Cigital (RST). • SAS. • General Dynamics. • Aprisma/Cabletron. • HRL. Publications and Software, etc. • 4 Journal and 10+ Conference papers – One best paper and two runner-ups. • JAM. • MADAMID. • PhDs: two graduated, one graduating, five in the pipeline … • More to come … Efforts: Current Tasks • Cost-sensitive modeling (NCSU/Columbia/FIT). • Automated feature and model construction (NCSU/Columbia/FIT) – Integration of all algorithms and tools. • Anomaly detection (NCSU/Columbia/FIT). • Attack “clustering” and light modeling (FIT). • Real-time architecture and systems (NCSU/Columbia). • Correlation (NCSU). • Collaboration with industry (NCSU/Columbia/FIT).