Exploring Massive Structured Data with ARGUS PI Meeting November 29, 2005 Main contacts: Prof. Jaime Carbonell – carbon8918@yahoo.com Dr. Santosh Ananthraman – santosh@dynamixtechnologies.com NIMD 1 Project ARGUS Objectives 1. Novelty detection in structured databases or data streams • • • 2. Detect and track situation-specific “alert-watch” patterns Cluster analysis to establish background (normal) models Cluster density and locus analysis for early detection of new pattern onset, or meaningful change to established pattern Data Explorer - analyst interface • • Framework for intensive, analyst-directed data exploration Applications • • 3. MED: Massachusetts hospital admission database to detect attacks by biological agents NED: Network anomaly/attack detection with CERT®, the federally funded computer security incident response center at CMU Fast multi-dimensional structured-data matching • • • Exact and approximate matching Scalable: O(106) to O(1012) records Profile matching for streaming data: O(1) to O(106) profiles NIMD 2 Role of ARGUS in Hypothetical Endto-End Multifunctional Architecture Analysts Mobile Agents Raw Data - Biometric Analyst Workstation Raw Data - News Raw Data - Customs Raw Data - Materials Analyst Interface Analyst Collaboration Data Source Active Context Prioritization Control Validation Exploration Query Generation Hypothesis Management Hypothesis Evaluation Raw Data - Reports Raw Data - Financial Raw Data - RSS Raw Data-Annotations Profile Queries Analysis Subsystem – Text & Data Data Normalization Text Extraction & Modeling Situation Assessment Events and Alerts Matched Events Distributed Structured Search Engines Massive Search Control Novelty Detection Structured Data Search • Exact • Approximate • Massive data • Streaming data Raw Data - Net Traffic Raw Data – Other Raw Data - Other Structured Data Banking Transactions Structured Data Network Traffic Structured Data Hospital Admissions Structured Data Extracted News Archives Structured Data Extracted Agency Reports NIMD 3 Novelty Detection • Objective: – Detect the onset of novel events in incoming data streams – Generate alert for analyst (with justification) – If judged significant track developments, else discard • Properties – Need a model of “business as usual” to detect divergences therefrom (done by clustering recent history) – Control points (tradeoff in precision-recall) • Degree of deviation from normalcy required • Amount of data support (e.g. # of observations) before alerting • Statistical model of normal “noise” in data streams NIMD 4 Cluster Evolution and Density Change Detection Constant Event New Unobfuscated Event New Obfuscated Event Growing Event NIMD 5 Visualizations in Display Area NIMD 6 Sample Application: Monitoring for Bioterrorism • Database of all Mass hospital stays discharged between 10/2000 and 9/2001 (835,895 records) 18 fields per record, including: • – – – – – • • • – provider (hospital) patient (gender, age, birthdate, race, ZIP) timing (admit date, length of stay) diagnoses (up to 8 with one primary) payment source – Cluster to form background models Inject new streaming data that may include potential threats (e.g. SARS, Anthrax, toxinbased attack,…) • SARS Outbreak simulation Added new records for patients from a small geographical region diagnosed with influenza in 9/2001 Graph shows resulting secondary peak in the pulmonary disease density function New Mini-Cluster Analysis reveals outbreaks of: • • • • Tularemia Dengue Fever Myiasis Chagas Disease NIMD 7 CERT Collaboration • Working with CERT on NetFlow data for scalable detection of network attack patterns (viruses, denial of service, unauthorized entry attempts, etc.) NIMD 8 CERT: Preliminary Data Analysis • • • Principal component analysis is used for data reduction where the 11 input features are reduced to 3 principal component features (PC1, PC2 and PC3 below) to capture 54%, 25% and 13%, respectively, of the variance in the original 11 features For example, PC2 is mainly comprised of DST FLOWS, PKTS, and BYTES, and PC3 is mainly comprised of UNIQ_PORTS, SUBNETS and DSTPORT Clustering in the principal components dimension to explore automatically-generated aggregations and abstractions of data for meaningful matching and pattern detection NIMD 9 Scalable Matcher • • In-memory matchers faster than 1/100th of a second Disk matcher faster than 1/10th of a second until disk access barrier 1 second per match above 108 records (in 2-year-old processor) Matcher Versions Record Volume Time complexity Status In-memory 106 to 108 Logarithmic Disk-based 107 to 1010 Low power-law Distributed 109 to 1012 As underlying matcher Mature Algorithmically stable Initial prototype only NIMD 10 Matching Data Streams to Profiles Novelty Detection Matcher Profiles Data Streams Profile = “alert-watch” pattern • Generated by analyst New Profiles • Novelty detection & vetted • Need rapid matching for 105+ simultaneously active profiles Analyst 11 NIMD Profile Sharing Framework Data Tables Data Streams Dynamix Matcher Query Network Analyst Query ARGUS Query Network Manager System Catalog Identified Threats 12 NIMD Evaluation MED: Bio-surveillance AvgTime/Query with 565 queries in seconds: NonJoinS: 0.20 MatchPlan+NCanon: 0.12 AllSharing: 0.11 FED: Fedwire suspicious transaction pattern tracking AvgTime/Query with 768 queries in seconds: NonJoinS: 0.25 MatchPlan+NCanon: 0.12 AllSharing: 0.04 NIMD 13 ARGUS Achievements: Summary • Solid scientific underpinnings – Efficient algorithms for approximate search and exploration – Efficient matching of complex patterns on streaming data – Novelty detection via radial cluster-density function analysis • Prototype development – User validation of utility of techniques (at NIST) – Analyst GUI - Data Explorer (under development) – Applications • MED: Massachusetts hospital admission database for detection of attacks by biological agents • FED: Fedwire Money Transfer database for suspicious transaction pattern tracking • NED: NetFlow database from CERT® for scalable detection of network attack patterns • Sufficient progress to interest operational IC – Exploring collaboration with GDAIS (their client has >108 transactions daily, >1010 records total) – Getting ready for stage 1 RDEC insertion NIMD 14 Additional Slides for Q & A Session NIMD 15 Cluster Evolution Constant Event New Unobfuscated Event New Obfuscated Event Growing Event NIMD 16 Novelty Detection Functionality Technology • Build background model • Modeling methods – Expected Events (clusters) – (Hierarchical) k-means • Find divergences • Divergence metrics – Radial density gradients from cluster centroid – Temporally-adaptive distance measures – Secondary peaks in density function – Individual outliers (but many false positives) – New Mini-clusters (more reliable, unobfuscated new-event detection) – Detect when a novel event is masked by ordinary happenings or intentiallly obfuscated • Create analyst profiles • Trigger Alerts – RETE-based SAMs methods (last PI-meeting ARGUS paper) – Route & Prioritize – Formulate hypotheses for Analyst NIMD 17 ARGUS Query Network Manager ARGUS Query Network Manager Common Computation Identifier Sharing Optimizer Query Query Network Projection Manager Coordinator Network Topology & Operation Manager Query Rewriter Query Optimizer Code Assembler 18 System Catalog NIMD Recording & Identifying Common Comps Inference & Classification r2.type_code = 1000 r3.type_code = 1000 r1.type_code = 1000 r1.amount > 1000000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date <= r1.tran_date + 10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date <= r2.tran_date + 10 r1.type_code = 1000 r1.amount > 1000000 r2.type_code = 1000 r2.amount > 500000 r3.type_code = 1000 r3.amount > 500000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date <= r1.tran_date + 10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date <= r2.tran_date + 10 Canonicalization r1.amount – r2.amount * 2 < 0 r3.tran_date – r2.tran_date <= 10 Common Computation Identification NodeID PredSetID … PredSetID PredID … System Catalog 19 PredID CanonicalForm … NIMD Preliminary Data Analysis CERT: The Data • Exploratory data for this exercise comprised a matrix of 65k rows and 24 columns which was aggregated as follows • For every SCAN_HOUR, for every unique SCAN_ID record the {independent, input features - time element} TIME DATETIME - FIRST TIME THIS (SCAN, PORT, HOST) WAS SEEN THIS HOUR STIME DATETIME - START TIME OF THE FIRST FLOW IN THE SCAN ETIME DATETIME - START TIME OF THE LAST FLOW IN THE SCAN record the {independent, input features - Source details} SRCADDR ADDRESS - SOURCE IP ADDRESS COUNTRY CHAR - TWO-LETTER COUNTRY CODE OF THE SRC (FROM GEOIP) UNIQ_DSTS INTEGER - NUMBER OF UNIQUE (PORT, ADDR) PAIRS SCANNED FLOWS INTEGER - TOTAL NUMBER OF FLOWS IN THE SCAN PKTS INTEGER - TOTAL NUMBER OF PACKETS IN THE SCAN BYTES INTEGER - TOTAL NUMBER OF BYTES IN THE SCAN UNIQ_PORTS INTEGER - NUMBER OF UNIQUE PORTS SCANNED UNIQ_HOSTS INTEGER - NUMBER OF UNIQUE HOSTS SCANNED SUBNETS INTEGER - NUMBER OF UNIQUE CLASS /24 PREFIXES SCANNED HAS_EXPLOIT INTEGER - 1 IF ANY OF THE TARGETS "TALKED BACK" record the {independent, input features - Destination details} DSTPORT INTEGER - DESTINATION PORT FLOWS INTEGER - NUMBER OF FLOWS FOR THIS (SCAN, HOUR, PORT) PKTS INTEGER - " PACKETS " BYTES INTEGER - " BYTES " DSTADDR ADDRESS - DESTINATION IP ADDRESS EXPLOIT INTEGER - 1 IF THE DESTINATION HOST "TALKED BACK" TO THE SOURCE record the {dependent, output features - SCAN classification labels based on CERT expert heuristics} SCAN_PROB FLOAT - PROBABILITY THAT THIS EVENT REPRESENTS A SCAN SCAN_FP INTEGER - 0: UNKNOWN, 1: HORIZONTAL, 2: VERTICAL SCAN_TYPE INTEGER - 0: NOT A SCAN, 1: SYN SCAN, 2: SYN-FIN SCAN, 3: NULL SCAN, 4: XMAS SCAN, 5: FIN SCAN, 6: UNIDENTIFIED SCAN HAS_TROJAN_PORT INTEGER - 1 IF ANY DSTPORT IS USED BY A KNOWN TROJAN IS _WORM INTEGER – 1 IF THE SCAN APPEARS TO BE A WORM 20 NIMD