shavlik.kdd04.ppt

advertisement
Learning to
Detect Computer Intrusions
with (Extremely) Few False
Alarms
Jude Shavlik
Mark Shavlik
Two Basic Approaches for
Intrusion Detection Systems (IDS)



Pattern Matching

If packet contains “site exec” and … then sound alarm

Famous example: SNORT.org

Weakness: Don’t (yet) have patterns for new attacks
Anomaly Detection 

Usually based on statistics measured during normal behavior

Weakness: does anomaly = intrusion ?
Both approaches often suffer from
too many false alarms

Admin’s ignore IDS when flooded by false alarms
How to Get Training Examples
for Machine Learning?

Ideally, get measurements during
1.
2.
Normal Operation vs.
Intrusions

However, hard to define space of
possible intrusions

Instead, learn from “positive examples only”

Learn what’s normal and define all else as anomalous
Behavior-Based Intrusion Detection


Need to go beyond looking solely
at external network traffic and log files

File-access patterns

Typing behavior

Choice of programs run

…
Like human immune system, continually
monitor and notice “foreign” behavior
Our General Approach

Identify ≈unique characteristics of
each user/server’s behavior

Every second, measure 100’s of
Windows 2000 properties

in/out network traffic, programs running,
keys pressed, kernel usage, etc

Predict Prob( normal | measurements )

Raise alarm if recent measurements
seem unlikely for this user/server
Goal: Choose “Feature Space” that Widely
Separates User from General Population
Probability
Choose separate set of “features” for each user
Specific User
Specific User
General Population
General Population
Measurements
Possible Possible
Measurements
in Chosen Space
What We’re Measuring (in Windows 2000)

Performance Monitor (Perfmon) data




File bytes written per second
TCP/IP/UDP/ICMP segments sent per second
System calls per second
# of processes, threads, events, …

Event-Log entries

Programs running, CPU usage, working-set size




MS Office, Wordpad, Notepad
Browsers: IE, Netscape
Program development tools, …
Keystroke and mouse events
Temporal Aggregates

Actual Value Measured

Average of the Previous 10 Values

Average of the Previous 100 Values

Difference between Current Value and Previous Value

Difference between Current Value and Average of Last 10

Difference between Current Value and Ave of Last 100

Difference between Averages of Prev 10 and Prev 100
Using (Naïve) Bayesian Networks

Learning network structure too CPU-intensive

Plus, naïve Bayes frequently works best

Testset results


59.2% of intrusions detected

About 2 false alarms per day per user
This paper’s approach

93.6% detected

0.3 false alarms per day per user
Our Intrusion-Detection Template
Last W (window width) measurements
X
X
...
X
time (in sec)
If score(current measurements) > T then raise “mini” alarm
If # “mini” alarms in window > N then predict intrusion
Use tuning set to choose per user good values for T and N
Methodology
– for Training and Evaluating Learned Models
Replay of User X’s Behavior
...
Alarm from Model of User X ?
yes
Alarm from Model of User Y ?
yes
“Intrusion” Detected
False Alarm
Learning to Score Windows 2000
Measurements (done for each user)
1.
2.
Initialize weights on each feature to 1
For each training example do
1.
2.
Set weightedVotesFOR = 0 & weightedVotesAGAINST = 0
If measurement i is “unlikely” (ie, low prob)
then add weighti to weightedVotesFOR
else add weighti to weightedVotesAGAINST
3.
If weightedVotesFOR > weightedVotesAGAINST
then raise “mini alarm”
4.
If decision about intrusion incorrect,
multiply weights by ½ on all measurements
that voted incorrectly (Winnow algorithm)
Choosing Good Parameter Values

For each user

Use training data to estimate probabilities
and weight individual measurements

Try 20 values for T and 20 values for N

For each T x N pairing
compute detection and false-alarm rates
on tuning set

Select T x N pairing whose
1.
false-alarm rate is less than spec (e.g., 1 per day)
2.
has highest detection rate
Experimental Data


Subjects

Insiders: 10 employees at Shavlik Technologies

Outsiders: 6 additional Shavlik employees

Unobtrusively collected data for 6 weeks

7 GBytes archived
Task: Are current measurements from user X?
Training, Tuning, and Testing Sets

Very important in machine learning to not use
testing data to optimize parameters!


Train Set: first two weeks of data


Build a (statistical) model
Tune Set: middle two weeks of data


Can tune to zero false alarms and high detection rates!
Choose good parameter settings
Test Set: last two weeks of data

Evaluate “frozen” model
Experimental Results on the Testset
Percentage (%)
100
80
Insider Detection Rate
60
Outsider Detection Rate
40
False Alarms
20
One False Alarm Per Day Per User
0
0
200
400
600
800
1000
Window Width (W seconds)
1200
Highly Weighted Measurements
(% of time in Top Ten across users & experiments)











Number of Semaphores (43%)
Logon Total (43%)
Print Jobs (41%)
System Driver Total Bytes (39%)
CMD: Handle Count (35%)
Excel: Handle Count (26%)
Number of Mutexes (25%)
Errors Access Permissions (24%)
Files Opened Total (23%)
TCP Connections Passive (23%)
Notepad: % Processor Time (21%)
73 measurements occur > 10%
Confusion Matrix – Detection Rates
(3 of 10 Subjects Shown)
INTRUDER
A
OWNER
A
B
100%
C
91%
B
C
100%
100%
25%
94%
Some Questions

What if user behavior changes?
(Called concept drift in machine learning)

One solution
Assign “half life” to counts used to compute prob’s
Multiply counts by f < 1 each day (10/20 vs. 1000/2000)

CPU and memory demands too large?



Measuring features and updating counts < 1% CPU
Tuning of parameters needs to be done off-line
How often to check for intrusions?


Only check when window full, then clear window
Else too many false alarms
Future Directions

Measure system while applying various
known intrusion techniques



Compare to measurements during normal operation
Train on known methods 1, …, N -1
Test using data from known method N

Analyze simultaneous measurements from
network of computers

Analyze impact of intruder’s behavior
changing recorded statistics

Current results: prob of detecting intruder in first W sec
Some Related Work
on Anomaly Detection

Machine learning for intrusion detection








Lane & Brodley (1998)
Gosh et al. (1999)
Lee et al. (1999)
Warrender et al. (1999)
Agarwal & Joshi (2001)
Typically Unix-based
Streams of programs invoked or network traffic analyzed
Analysis of keystroke dynamics


Monrose & Rubin (1997)
For authenticating passwords
Conclusions

Can accurately characterize individual
user behavior using simple models
based on measuring many system properties

Such “profiles” can provide protection
without too many false alarms

Separate data into train, tune, and test sets

“Let the data decide” good parameter settings,
on per-user basis
(including measurements to use)
Acknowledgements

DARPA’s Insider Threat Active Profiling
(ITAP) program within ATIAS program

Mike Fahland for help with data collection

Shavlik, Inc employees who allowed
collection of their usage data
Using Relative Probabilities
Alarm:
Prob( keystrokes | machine owner )
Prob( keystrokes | population )
Detection Rate
on Testset
100%
Relative Prob
80%
60%
Absolute Prob
40%
20%
0%
10
20
40
80
160
Window Width (W)
320
640
Value of Relative Probabilities

Using relative probabilities


Separates rare for this user
from rare for everyone
Example of variance reduction

Reduce variance in a measurement by
comparing to another (eg, paired t-tests)
Tradeoff between False Alarms
and Detected Intrusions (ROC Curve)
Detection Rate
on Testset
100%
80%
60%
40%
20%
0%
0.00
spec
0.50
1.00
1.50
False Alarms per Day (W = 160)
Note: left-most value results from ZERO tune-set false alarms
Conclusions

Can accurately characterize individual
user behavior using simple models
based on measuring many system properties

Such “profiles” can provide protection
without too many false alarms

Separate data into train, tune, and test sets


“Let the data decide” good parameter settings,
on per-user basis (including measurements to use)
Normalize prob’s by general-population prob’s

Separate rare for this user (or server)
from rare for everyone
Outline

Approaches for Building
Intrusion-Detection Systems

A Bit More on What We Measure

Experiments with Windows 2000 Data

Wrapup
Download