Exploring Massive Structured Data with ARGUS : PI Meeting

advertisement
Exploring Massive Structured Data
with ARGUS
PI Meeting
November 29, 2005
Main contacts:
Prof. Jaime Carbonell – carbon8918@yahoo.com
Dr. Santosh Ananthraman – santosh@dynamixtechnologies.com
NIMD
1
Project ARGUS Objectives
1.
Novelty detection in structured databases or data streams
•
•
•
2.
Detect and track situation-specific “alert-watch” patterns
Cluster analysis to establish background (normal) models
Cluster density and locus analysis for early detection of new pattern onset,
or meaningful change to established pattern
Data Explorer - analyst interface
•
•
Framework for intensive, analyst-directed data exploration
Applications
•
•
3.
MED: Massachusetts hospital admission database to detect attacks by biological
agents
NED: Network anomaly/attack detection with CERT®, the federally funded
computer security incident response center at CMU
Fast multi-dimensional structured-data matching
•
•
•
Exact and approximate matching
Scalable: O(106) to O(1012) records
Profile matching for streaming data: O(1) to O(106) profiles
NIMD
2
Role of ARGUS in Hypothetical Endto-End Multifunctional Architecture
Analysts
Mobile Agents
Raw Data - Biometric
Analyst Workstation
Raw Data - News
Raw Data - Customs
Raw Data - Materials
Analyst Interface
Analyst
Collaboration
Data Source Active Context
Prioritization
Control
Validation
Exploration
Query
Generation
Hypothesis
Management
Hypothesis
Evaluation
Raw Data - Reports
Raw Data - Financial
Raw Data - RSS
Raw Data-Annotations
Profile
Queries
Analysis Subsystem – Text & Data
Data
Normalization Text Extraction
& Modeling
Situation
Assessment
Events
and Alerts
Matched
Events
Distributed Structured Search Engines
Massive
Search
Control
Novelty
Detection
Structured Data
Search
• Exact
• Approximate
• Massive data
• Streaming data
Raw Data - Net Traffic
Raw Data – Other
Raw Data - Other
Structured Data
Banking Transactions
Structured Data
Network Traffic
Structured Data
Hospital Admissions
Structured Data
Extracted News Archives
Structured Data
Extracted Agency Reports
NIMD
3
Novelty Detection
• Objective:
– Detect the onset of novel events in incoming data streams
– Generate alert for analyst (with justification)
– If judged significant  track developments, else discard
• Properties
– Need a model of “business as usual” to detect divergences
therefrom (done by clustering recent history)
– Control points (tradeoff in precision-recall)
• Degree of deviation from normalcy required
• Amount of data support (e.g. # of observations) before alerting
• Statistical model of normal “noise” in data streams
NIMD
4
Cluster Evolution and Density Change Detection
Constant Event
New Unobfuscated Event
New Obfuscated Event
Growing Event
NIMD
5
Visualizations in Display Area
NIMD
6
Sample Application: Monitoring for Bioterrorism
•
Database of all Mass hospital stays
discharged between 10/2000 and 9/2001
(835,895 records)
18 fields per record, including:
•
–
–
–
–
–
•
•
•
–
provider (hospital)
patient (gender, age, birthdate, race, ZIP)
timing (admit date, length of stay)
diagnoses (up to 8 with one primary)
payment source
–
Cluster to form background models
Inject new streaming data that may include
potential threats (e.g. SARS, Anthrax, toxinbased attack,…)
•
SARS Outbreak simulation
Added new records for patients
from a small geographical region
diagnosed with influenza in
9/2001
Graph shows resulting secondary
peak in the pulmonary disease
density function
New Mini-Cluster Analysis
reveals outbreaks of:
•
•
•
•
Tularemia
Dengue Fever
Myiasis
Chagas Disease
NIMD
7
CERT Collaboration
•
Working with CERT on NetFlow data for scalable detection of network attack patterns
(viruses, denial of service, unauthorized entry attempts, etc.)
NIMD
8
CERT: Preliminary Data Analysis
•
•
•
Principal component analysis is used for data reduction where the 11 input features are reduced to
3 principal component features (PC1, PC2 and PC3 below) to capture 54%, 25% and 13%,
respectively, of the variance in the original 11 features
For example, PC2 is mainly comprised of DST FLOWS, PKTS, and BYTES, and PC3 is mainly
comprised of UNIQ_PORTS, SUBNETS and DSTPORT
Clustering in the principal components dimension to explore automatically-generated
aggregations and abstractions of data for meaningful matching and pattern detection
NIMD
9
Scalable Matcher
•
•
In-memory matchers faster than 1/100th of a second
Disk matcher faster than 1/10th of a second until disk access barrier 
1 second per match above 108 records (in 2-year-old processor)
Matcher
Versions
Record
Volume
Time complexity Status
In-memory
106 to 108
Logarithmic
Disk-based 107 to 1010 Low power-law
Distributed
109 to 1012 As underlying
matcher
Mature
Algorithmically stable
Initial prototype only
NIMD
10
Matching Data Streams to Profiles
Novelty Detection
Matcher
Profiles
Data Streams
Profile = “alert-watch” pattern
• Generated by analyst
New Profiles
• Novelty detection & vetted
• Need rapid matching for 105+
simultaneously active profiles
Analyst
11
NIMD
Profile Sharing Framework
Data Tables
Data Streams
Dynamix
Matcher
Query Network
Analyst
Query
ARGUS
Query
Network
Manager
System
Catalog
Identified Threats
12
NIMD
Evaluation
MED: Bio-surveillance
AvgTime/Query with 565 queries in seconds:
NonJoinS:
0.20
MatchPlan+NCanon: 0.12
AllSharing:
0.11
FED: Fedwire suspicious
transaction pattern tracking
AvgTime/Query with 768 queries in seconds:
NonJoinS:
0.25
MatchPlan+NCanon: 0.12
AllSharing:
0.04
NIMD
13
ARGUS Achievements: Summary
•
Solid scientific underpinnings
– Efficient algorithms for approximate search and exploration
– Efficient matching of complex patterns on streaming data
– Novelty detection via radial cluster-density function analysis
•
Prototype development
– User validation of utility of techniques (at NIST)
– Analyst GUI - Data Explorer (under development)
– Applications
• MED: Massachusetts hospital admission database for detection of attacks by
biological agents
• FED: Fedwire Money Transfer database for suspicious transaction pattern
tracking
• NED: NetFlow database from CERT® for scalable detection of network attack
patterns
•
Sufficient progress to interest operational IC
– Exploring collaboration with GDAIS (their client has >108 transactions daily,
>1010 records total)
– Getting ready for stage 1 RDEC insertion
NIMD
14
Additional Slides
for Q & A Session
NIMD
15
Cluster Evolution
Constant Event
New Unobfuscated Event
New Obfuscated Event
Growing Event
NIMD
16
Novelty Detection
Functionality
Technology
• Build background model
• Modeling methods
– Expected Events (clusters)
– (Hierarchical) k-means
• Find divergences
• Divergence metrics
– Radial density gradients from
cluster centroid
– Temporally-adaptive distance
measures
– Secondary peaks in density
function
– Individual outliers (but many false
positives)
– New Mini-clusters (more reliable,
unobfuscated new-event detection)
– Detect when a novel event is masked
by ordinary happenings or intentiallly
obfuscated
• Create analyst profiles
• Trigger Alerts
– RETE-based SAMs methods (last
PI-meeting ARGUS paper)
– Route & Prioritize
– Formulate hypotheses for Analyst
NIMD
17
ARGUS Query Network Manager
ARGUS
Query Network
Manager
Common
Computation Identifier
Sharing Optimizer
Query
Query
Network
Projection Manager
Coordinator
Network Topology
& Operation Manager
Query Rewriter
Query Optimizer
Code Assembler
18
System
Catalog
NIMD
Recording & Identifying Common Comps
Inference & Classification
r2.type_code = 1000
r3.type_code = 1000
r1.type_code = 1000
r1.amount > 1000000
r1.rbank_aba = r2.sbank_aba
r1.benef_account = r2.orig_account
r2.amount * 2 > r1.amount
r1.tran_date <= r2.tran_date
r2.tran_date <= r1.tran_date + 10
r2.rbank_aba = r3.sbank_aba
r2.benef_account = r3.orig_account
r2.amount = r3.amount
r2.tran_date <= r3.tran_date
r3.tran_date <= r2.tran_date + 10
r1.type_code = 1000
r1.amount > 1000000
r2.type_code = 1000
r2.amount > 500000
r3.type_code = 1000
r3.amount > 500000
r1.rbank_aba = r2.sbank_aba
r1.benef_account = r2.orig_account
r2.amount * 2 > r1.amount
r1.tran_date <= r2.tran_date
r2.tran_date <= r1.tran_date + 10
r2.rbank_aba = r3.sbank_aba
r2.benef_account = r3.orig_account
r2.amount = r3.amount
r2.tran_date <= r3.tran_date
r3.tran_date <= r2.tran_date + 10
Canonicalization
r1.amount – r2.amount * 2 < 0
r3.tran_date – r2.tran_date <= 10
Common
Computation
Identification
NodeID
PredSetID
…
PredSetID PredID
…
System Catalog
19
PredID CanonicalForm
…
NIMD
Preliminary Data Analysis
CERT: The Data
•
Exploratory data for this exercise comprised a matrix of 65k rows and 24 columns which was
aggregated as follows
•
For every SCAN_HOUR, for every unique SCAN_ID
record the {independent, input features - time element}
TIME
DATETIME - FIRST TIME THIS (SCAN, PORT, HOST) WAS SEEN THIS HOUR
STIME
DATETIME - START TIME OF THE FIRST FLOW IN THE SCAN
ETIME
DATETIME - START TIME OF THE LAST FLOW IN THE SCAN
record the {independent, input features - Source details}
SRCADDR
ADDRESS - SOURCE IP ADDRESS
COUNTRY
CHAR - TWO-LETTER COUNTRY CODE OF THE SRC (FROM GEOIP)
UNIQ_DSTS
INTEGER - NUMBER OF UNIQUE (PORT, ADDR) PAIRS SCANNED
FLOWS
INTEGER - TOTAL NUMBER OF FLOWS IN THE SCAN
PKTS
INTEGER - TOTAL NUMBER OF PACKETS IN THE SCAN
BYTES
INTEGER - TOTAL NUMBER OF BYTES IN THE SCAN
UNIQ_PORTS
INTEGER - NUMBER OF UNIQUE PORTS SCANNED
UNIQ_HOSTS
INTEGER - NUMBER OF UNIQUE HOSTS SCANNED
SUBNETS
INTEGER - NUMBER OF UNIQUE CLASS /24 PREFIXES SCANNED
HAS_EXPLOIT INTEGER - 1 IF ANY OF THE TARGETS "TALKED BACK"
record the {independent, input features - Destination details}
DSTPORT
INTEGER - DESTINATION PORT
FLOWS
INTEGER - NUMBER OF FLOWS FOR THIS (SCAN, HOUR, PORT)
PKTS
INTEGER - "
PACKETS "
BYTES
INTEGER - "
BYTES "
DSTADDR
ADDRESS - DESTINATION IP ADDRESS
EXPLOIT
INTEGER - 1 IF THE DESTINATION HOST "TALKED BACK" TO THE SOURCE
record the {dependent, output features - SCAN classification labels based on CERT expert heuristics}
SCAN_PROB
FLOAT - PROBABILITY THAT THIS EVENT REPRESENTS A SCAN
SCAN_FP
INTEGER - 0: UNKNOWN, 1: HORIZONTAL, 2: VERTICAL
SCAN_TYPE
INTEGER - 0: NOT A SCAN, 1: SYN SCAN, 2: SYN-FIN SCAN,
3: NULL SCAN, 4: XMAS SCAN, 5: FIN SCAN,
6: UNIDENTIFIED SCAN
HAS_TROJAN_PORT INTEGER - 1 IF ANY DSTPORT IS USED BY A KNOWN TROJAN
IS _WORM
INTEGER – 1 IF THE SCAN APPEARS TO BE A WORM
20
NIMD
Download