Scalable Data Exploration and Novelty Detection

advertisement
Scalable Data Exploration
and Novelty Detection
NIMD Grand Finale PI Meeting
April 18, 2006
Main contacts:
Prof. Jaime Carbonell, Carnegie Mellon University
Dr. Santosh Ananthraman, DYNAMiX Technologies
1
NIMD
Project ARGUS Progression
1.
Fast multi-dimensional matching of structured data
•
•
•
2.
Novelty detection in structured databases or data streams
•
•
•
3.
Detection and tracking of situation-specific “alert-watch” patterns
Cluster analysis to establish background (normal) models
Cluster density and locus analysis for early detection of new pattern onset, or meaningful
changes to established patterns
Test Applications
•
•
•
4.
Exact and approximate matching
Scalable: Up to a trillion (1012) records
Profile matching for streaming data: Up to a million (106) profiles
FED: Fedwire money-transfer database (simulated), which allows tracking of suspicious
transaction patterns
MED: Massachusetts hospital admission database to detect attacks by biological agents
NED: Network-flow databases from two sources (CERT® at CMU, and the MIT Lincoln Labs),
which allow detection of network attacks such as denial of service
ARGUS Data Explorer – a prototype analyst interface
•
•
•
v0.8 = evaluation at SAIC RDEC PP and MITRE/NIST
Framework for intensive, analyst-directed data exploration
Challenge of harnessing the technologies into a robust, user-friendly package that helps
analysts in significantly reducing the size of the proverbial “haystack”
2
NIMD
Role of ARGUS in a Hypothetical End-toEnd Multifunctional Architecture
Analysts
Mobile Agents
Raw Data - Biometric
Analyst Workstation
Raw Data - News
Raw Data - Customs
Raw Data - Materials
Analyst Interface
Analyst
Collaboration
Data Source Active Context
Prioritization
Control
Validation
Exploration
Query
Generation
Hypothesis
Management
Hypothesis
Evaluation
Raw Data - Reports
Raw Data - Financial
Raw Data - RSS
Raw Data-Annotations
Profile
Queries
Analysis Subsystem – Text & Data
Data
Normalization Text Extraction
& Modeling
Situation
Assessment
Events
and Alerts
Matched
Events
Distributed Structured Search Engines
Massive
Search
Control
Novelty
Detection
Structured Data
Search
• Exact
• Approximate
• Massive data
• Streaming data
Raw Data - Net Traffic
Raw Data – Other
Raw Data - Other
Structured Data
Banking Transactions
Structured Data
Network Traffic
Structured Data
Hospital Admissions
3
Structured Data
Extracted News Archives
Structured Data
Extracted Agency Reports
NIMD
Information Flow
Select
New
Data
Historical
Data
Create
Background Background
Model
Model
Detect
Novel
Events
Novel
Events
Re-cluster
Novel
Clusters
New
Data
Alerts
Match
Analyst
Tracked
Events
Profiles
Update
Profiles
New
Profiles
Generate
Profiles
Analyst
4
NIMD
Novelty Detection
• Objective
– Detect the onset of novel events in incoming data streams
– Generate alerts for the analyst (with justifications, priorities, etc.)
– If judged significant, then track developments, else discard
• Properties
– Need a model of “business as usual” to detect divergences from it,
which is done by clustering recent history
– Control points (tradeoff in precision-recall)
• Degree of deviation from normalcy required (radial density functions)
• Amount of data support (e.g. number of observations) before alerting
• Statistical model of normal “noise” in data streams
5
NIMD
Cluster Evolution and Density
Change Detection
Constant Event
New Unobfuscated Event
New Obfuscated Event
Growing Event
6
NIMD
ARGUS Data Explorer Implementation
• A Client-Server System
– Data Explorer Client
• GUI components, i.e., DYNAMiX’s 2-D GUI embedded with ManTech’s
3-D GUI module
• Interface connecting the GUI components to the Web Service API
– ARGUS Server
• Web Service API that delegates tasks to the application layer
• Application Layer, which encompasses the core application functionality
including clustering, novelty detection, re-clustering, exact and
approximate matching
• Data Access Layer, which includes application functionality such as set
operations and data exchanges between the application layer and the
database
• DYNAMiX iX server used as a component for matching structured data
• Data store (database)
7
NIMD
ARGUS Data Explorer
• Typical Hardware
– Server
• Processor: Intel® Xeon™, 3.0GHz, 2MB Cache
• Memory: 8GB DDR2 400MHz (4X2GB), Dual Ranked DIMMs
• Disk Space: 300 GB
– Client
•
•
•
•
Processor: Pentium 3 or higher
Memory: 512MB
Disk Space: 100MB
Graphics subsystem: High-performance supporting 1280 x1024 resolution
• Typical Software
– Server
• Operating System: Red Hat Enterprise Linux ES v4
• Other: Oracle 9i; JSDK V1.4.2 (Tomcat); DYNAMiX iX;
– Client
• Operating System: Windows 2000; Mac OS 10.4
• Other: Java3D; JRE V1.4.2_11 (Swing Application)
8
NIMD
ARGUS Data Explorer
9
NIMD
ARGUS Data Explorer
10
NIMD
3-D Data Cluster Display
3-D Data and Cluster display Region. In this Region the display will also show the X, Y, and Z
axes along with axis labels for the quantitative data dimensions chosen by the user
Slider to control
rotation about the
Horizontal (X) axis.
This allows the
user to rotate the
data set to obtain a
favorable view to
see the clusters of
greatest interest.
Slider to control
rotation about the
vertical (Y) axis.
This allows the
user to rotate the
data set to obtain a
favorable view to
see the clusters of
greatest interest.
11
NIMD
Cluster Control
Cluster size control allows
user to adjust radius of
transparent sphere to be One
or Three Standard Deviations
Cluster selection controls allow the user to select any cluster and then hide
non-selected clusters to allow focus on relevant data.
12
NIMD
Novelty Detection
T0
0-15°
T1
16-30°
T2
31-45°
13
NIMD
ARGUS Data Explorer: Menu Items
14
NIMD
DEMO PROBLEM: The MIT LL Test Dataset
•
Network-flow dataset from the MIT Lincoln Labs, to experiment with the
detection of network attacks
•
805,049 records x 42 independent fields (derived features of raw data) x 1
dependent field (post hoc classification label)
•
The dependent field is a discrete value representing Connection Type
– 0 is normal
– 1, 2, 3 and 4 are malicious (1:probe, 2:denial_of_service, 3:user_to_root, and
4:remote_to_local)
•
The goal is to use the ARGUS Data Explorer to learn a “normal” background
model and then detect the onset of “malicious” attacks
15
NIMD
DEMO PROBLEM: The MIT LL Dataset
MITLLT0 = 494,020 records x 43 fields
MITLLT1 = 311,029 records x 43 fields
#
Name
1 RECID
2 DURATION
3 PROTOCOL_TYPE
4 SERVICE
5 FLAG
6 SRC_BYTES
7 DST_BYTES
8 LAND
9 WRONG_FRAGMENT
10 URGENT
11 HOT
12 NUM_FAILED_LOGINS
13 LOGGED_IN
14 NUM_COMPROMISED
15 ROOT_SHELL
16 SU_ATTEMPTED
17 NUM_ROOT
18 NUM_FILE_CREATIONS
19 NUM_SHELLS
20 NUM_ACCESS_FILES
21 NUM_OUTBOUND_CMDS
22 IS_HOT_LOGIN
23 IS_GUEST_LOGIN
24 COUNT
25 SRV_COUNT
26 SERROR_RATE
27 SRV_SERROR_RATE
28 RERROR_RATE
29 SRV_RERROR_RATE
30 SAME_SRV_RATE
31 DIFF_SRV_RATE
32 SRV_DIFF_HOST_RATE
33 DST_HOST_COUNT
34 DST_HOST_SRV_COUNT
35 DST_HOST_SAME_SRV_RATE
36 DST_HOST_DIFF_SRV_RATE
37 DST_HOST_SAME_SRC_PORT_RATE
38 DST_HOST_SRV_DIFF_HOST_RATE
39 DST_HOST_SERROR_RATE
40 DST_HOST_SRV_SERROR_RATE
41 DST_HOST_RERROR_RATE
42 DST_HOST_SRV_RERROR_RATE
43 CONNECTION_TYPE
Ex Recd
1
0
2
23
10
181
5450
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
8
8
0
0
0
0
1
0
0
9
9
1
0
0
0
0
0
0
0
0
Ex Recd
2
0
2
23
10
239
486
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
8
8
0
0
0
0
1
0
0
19
19
1
0
0
0
0
0
0
0
0
Description
id
length (number of seconds) of the connection
type of the protocol, e.g. tcp, udp, etc.
network service on the destination, e.g., http, telnet, etc.
normal or error status of the connection
number of data bytes from source to destination
number of data bytes from destination to source
1 if connection is from/to the same host/port; 0 otherwise
number of ``wrong'' fragments
number of urgent packets
number of ``hot'' indicators
number of failed login attempts
1 if successfully logged in; 0 otherwise
number of ``compromised'' conditions
1 if root shell is obtained; 0 otherwise
1 if ``su root'' command attempted; 0 otherwise
number of ``root'' accesses
number of file creation operations
number of shell prompts
number of operations on access control files
number of outbound commands in an ftp session
1 if the login belongs to the ``hot'' list; 0 otherwise
1 if the login is a ``guest''login; 0 otherwise
number of connections to the same host as the current connection in the past two seconds
number of connections to the same service as the current connection in the past two seconds
% of connections that have ``SYN'' errors
% of connections that have ``SYN'' errors
% of connections that have ``REJ'' errors
% of connections that have ``REJ'' errors
% of connections to the same service
% of connections to different services
% of connections to different hosts
0:normal, 1:probe, 2:denial_of_service, 3:user_to_root; 4:remote_to_local
16
Type
discrete
continuous
discrete
discrete
discrete
continuous
continuous
discrete
continuous
continuous
continuous
continuous
discrete
continuous
discrete
discrete
continuous
continuous
continuous
continuous
continuous
discrete
discrete
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
discrete
NIMD
DEMO PROBLEM: The MIT LL Test Dataset
•
Recipe
–
–
–
–
•
Learn the baseline “normal” set of clusters
Cluster into this baseline, the incremental data
including “malicious” records
Assess quantitative and qualitative changes in
clusters and test the capability of the system to
detect novelties and alert the user
Iterate through this “learn ↔ test” cycle
dynamically over time, progressively building
domain knowledge based on data empirics
Results
–
–
Please visit us at our DEMO & POSTER
session on April 19th between 0800-1300
We will demonstrate the successful alerting (by
cluster density changes, as in the adjacent
graph) on denial of service and unauthorized
entry attempts by cluster density changes
17
NIMD
ARGUS Data Explorer: Challenges
• Goal
– The Data Explorer should be a robust, intuitive, user-friendly
system that provides decision support for an analyst who an expert
in the problem domain, but not an expert in advanced statistics and
pattern recognition technologies
• Challenges
– Human-in-the-loop automation of the underlying algorithms
– Ensuring the transfer of maximally relevant data through the “data
pipe” connecting the server to the client
– Balancing-act: iterative historical batch and near-real-time
processing
– Iterative scaling of the application: increase the amount of data
handled; ensure user expectation is still maintained; repeat cycle
– Setting the right expectation: a “work in progress” prototype
18
NIMD
ARGUS Data Explorer:
Where do we go from here?
• Combining the “top-down” and “bottom-up” analysis
approaches in ARGUS II
• “Top-down” – Hypothesis creation using first principles and
process heuristics
• “Bottom-up” – Probabilistic hypothesis validation and
refutation using data empirics
19
NIMD
Download