Learning to Detect Computer Intrusions with (Extremely) Few False Alarms Jude Shavlik Mark Shavlik Two Basic Approaches for Intrusion Detection Systems (IDS) Pattern Matching If packet contains “site exec” and … then sound alarm Famous example: SNORT.org Weakness: Don’t (yet) have patterns for new attacks Anomaly Detection Usually based on statistics measured during normal behavior Weakness: does anomaly = intrusion ? Both approaches often suffer from too many false alarms Admin’s ignore IDS when flooded by false alarms How to Get Training Examples for Machine Learning? Ideally, get measurements during 1. 2. Normal Operation vs. Intrusions However, hard to define space of possible intrusions Instead, learn from “positive examples only” Learn what’s normal and define all else as anomalous Behavior-Based Intrusion Detection Need to go beyond looking solely at external network traffic and log files File-access patterns Typing behavior Choice of programs run … Like human immune system, continually monitor and notice “foreign” behavior Our General Approach Identify ≈unique characteristics of each user/server’s behavior Every second, measure 100’s of Windows 2000 properties in/out network traffic, programs running, keys pressed, kernel usage, etc Predict Prob( normal | measurements ) Raise alarm if recent measurements seem unlikely for this user/server Goal: Choose “Feature Space” that Widely Separates User from General Population Probability Choose separate set of “features” for each user Specific User Specific User General Population General Population Measurements Possible Possible Measurements in Chosen Space What We’re Measuring (in Windows 2000) Performance Monitor (Perfmon) data File bytes written per second TCP/IP/UDP/ICMP segments sent per second System calls per second # of processes, threads, events, … Event-Log entries Programs running, CPU usage, working-set size MS Office, Wordpad, Notepad Browsers: IE, Netscape Program development tools, … Keystroke and mouse events Temporal Aggregates Actual Value Measured Average of the Previous 10 Values Average of the Previous 100 Values Difference between Current Value and Previous Value Difference between Current Value and Average of Last 10 Difference between Current Value and Ave of Last 100 Difference between Averages of Prev 10 and Prev 100 Using (Naïve) Bayesian Networks Learning network structure too CPU-intensive Plus, naïve Bayes frequently works best Testset results 59.2% of intrusions detected About 2 false alarms per day per user This paper’s approach 93.6% detected 0.3 false alarms per day per user Our Intrusion-Detection Template Last W (window width) measurements X X ... X time (in sec) If score(current measurements) > T then raise “mini” alarm If # “mini” alarms in window > N then predict intrusion Use tuning set to choose per user good values for T and N Methodology – for Training and Evaluating Learned Models Replay of User X’s Behavior ... Alarm from Model of User X ? yes Alarm from Model of User Y ? yes “Intrusion” Detected False Alarm Learning to Score Windows 2000 Measurements (done for each user) 1. 2. Initialize weights on each feature to 1 For each training example do 1. 2. Set weightedVotesFOR = 0 & weightedVotesAGAINST = 0 If measurement i is “unlikely” (ie, low prob) then add weighti to weightedVotesFOR else add weighti to weightedVotesAGAINST 3. If weightedVotesFOR > weightedVotesAGAINST then raise “mini alarm” 4. If decision about intrusion incorrect, multiply weights by ½ on all measurements that voted incorrectly (Winnow algorithm) Choosing Good Parameter Values For each user Use training data to estimate probabilities and weight individual measurements Try 20 values for T and 20 values for N For each T x N pairing compute detection and false-alarm rates on tuning set Select T x N pairing whose 1. false-alarm rate is less than spec (e.g., 1 per day) 2. has highest detection rate Experimental Data Subjects Insiders: 10 employees at Shavlik Technologies Outsiders: 6 additional Shavlik employees Unobtrusively collected data for 6 weeks 7 GBytes archived Task: Are current measurements from user X? Training, Tuning, and Testing Sets Very important in machine learning to not use testing data to optimize parameters! Train Set: first two weeks of data Build a (statistical) model Tune Set: middle two weeks of data Can tune to zero false alarms and high detection rates! Choose good parameter settings Test Set: last two weeks of data Evaluate “frozen” model Experimental Results on the Testset Percentage (%) 100 80 Insider Detection Rate 60 Outsider Detection Rate 40 False Alarms 20 One False Alarm Per Day Per User 0 0 200 400 600 800 1000 Window Width (W seconds) 1200 Highly Weighted Measurements (% of time in Top Ten across users & experiments) Number of Semaphores (43%) Logon Total (43%) Print Jobs (41%) System Driver Total Bytes (39%) CMD: Handle Count (35%) Excel: Handle Count (26%) Number of Mutexes (25%) Errors Access Permissions (24%) Files Opened Total (23%) TCP Connections Passive (23%) Notepad: % Processor Time (21%) 73 measurements occur > 10% Confusion Matrix – Detection Rates (3 of 10 Subjects Shown) INTRUDER A OWNER A B 100% C 91% B C 100% 100% 25% 94% Some Questions What if user behavior changes? (Called concept drift in machine learning) One solution Assign “half life” to counts used to compute prob’s Multiply counts by f < 1 each day (10/20 vs. 1000/2000) CPU and memory demands too large? Measuring features and updating counts < 1% CPU Tuning of parameters needs to be done off-line How often to check for intrusions? Only check when window full, then clear window Else too many false alarms Future Directions Measure system while applying various known intrusion techniques Compare to measurements during normal operation Train on known methods 1, …, N -1 Test using data from known method N Analyze simultaneous measurements from network of computers Analyze impact of intruder’s behavior changing recorded statistics Current results: prob of detecting intruder in first W sec Some Related Work on Anomaly Detection Machine learning for intrusion detection Lane & Brodley (1998) Gosh et al. (1999) Lee et al. (1999) Warrender et al. (1999) Agarwal & Joshi (2001) Typically Unix-based Streams of programs invoked or network traffic analyzed Analysis of keystroke dynamics Monrose & Rubin (1997) For authenticating passwords Conclusions Can accurately characterize individual user behavior using simple models based on measuring many system properties Such “profiles” can provide protection without too many false alarms Separate data into train, tune, and test sets “Let the data decide” good parameter settings, on per-user basis (including measurements to use) Acknowledgements DARPA’s Insider Threat Active Profiling (ITAP) program within ATIAS program Mike Fahland for help with data collection Shavlik, Inc employees who allowed collection of their usage data Using Relative Probabilities Alarm: Prob( keystrokes | machine owner ) Prob( keystrokes | population ) Detection Rate on Testset 100% Relative Prob 80% 60% Absolute Prob 40% 20% 0% 10 20 40 80 160 Window Width (W) 320 640 Value of Relative Probabilities Using relative probabilities Separates rare for this user from rare for everyone Example of variance reduction Reduce variance in a measurement by comparing to another (eg, paired t-tests) Tradeoff between False Alarms and Detected Intrusions (ROC Curve) Detection Rate on Testset 100% 80% 60% 40% 20% 0% 0.00 spec 0.50 1.00 1.50 False Alarms per Day (W = 160) Note: left-most value results from ZERO tune-set false alarms Conclusions Can accurately characterize individual user behavior using simple models based on measuring many system properties Such “profiles” can provide protection without too many false alarms Separate data into train, tune, and test sets “Let the data decide” good parameter settings, on per-user basis (including measurements to use) Normalize prob’s by general-population prob’s Separate rare for this user (or server) from rare for everyone Outline Approaches for Building Intrusion-Detection Systems A Bit More on What We Measure Experiments with Windows 2000 Data Wrapup