Scalable Data Exploration and Novelty Detection NIMD Grand Finale PI Meeting April 18, 2006 Main contacts: Prof. Jaime Carbonell, Carnegie Mellon University Dr. Santosh Ananthraman, DYNAMiX Technologies 1 NIMD Project ARGUS Progression 1. Fast multi-dimensional matching of structured data • • • 2. Novelty detection in structured databases or data streams • • • 3. Detection and tracking of situation-specific “alert-watch” patterns Cluster analysis to establish background (normal) models Cluster density and locus analysis for early detection of new pattern onset, or meaningful changes to established patterns Test Applications • • • 4. Exact and approximate matching Scalable: Up to a trillion (1012) records Profile matching for streaming data: Up to a million (106) profiles FED: Fedwire money-transfer database (simulated), which allows tracking of suspicious transaction patterns MED: Massachusetts hospital admission database to detect attacks by biological agents NED: Network-flow databases from two sources (CERT® at CMU, and the MIT Lincoln Labs), which allow detection of network attacks such as denial of service ARGUS Data Explorer – a prototype analyst interface • • • v0.8 = evaluation at SAIC RDEC PP and MITRE/NIST Framework for intensive, analyst-directed data exploration Challenge of harnessing the technologies into a robust, user-friendly package that helps analysts in significantly reducing the size of the proverbial “haystack” 2 NIMD Role of ARGUS in a Hypothetical End-toEnd Multifunctional Architecture Analysts Mobile Agents Raw Data - Biometric Analyst Workstation Raw Data - News Raw Data - Customs Raw Data - Materials Analyst Interface Analyst Collaboration Data Source Active Context Prioritization Control Validation Exploration Query Generation Hypothesis Management Hypothesis Evaluation Raw Data - Reports Raw Data - Financial Raw Data - RSS Raw Data-Annotations Profile Queries Analysis Subsystem – Text & Data Data Normalization Text Extraction & Modeling Situation Assessment Events and Alerts Matched Events Distributed Structured Search Engines Massive Search Control Novelty Detection Structured Data Search • Exact • Approximate • Massive data • Streaming data Raw Data - Net Traffic Raw Data – Other Raw Data - Other Structured Data Banking Transactions Structured Data Network Traffic Structured Data Hospital Admissions 3 Structured Data Extracted News Archives Structured Data Extracted Agency Reports NIMD Information Flow Select New Data Historical Data Create Background Background Model Model Detect Novel Events Novel Events Re-cluster Novel Clusters New Data Alerts Match Analyst Tracked Events Profiles Update Profiles New Profiles Generate Profiles Analyst 4 NIMD Novelty Detection • Objective – Detect the onset of novel events in incoming data streams – Generate alerts for the analyst (with justifications, priorities, etc.) – If judged significant, then track developments, else discard • Properties – Need a model of “business as usual” to detect divergences from it, which is done by clustering recent history – Control points (tradeoff in precision-recall) • Degree of deviation from normalcy required (radial density functions) • Amount of data support (e.g. number of observations) before alerting • Statistical model of normal “noise” in data streams 5 NIMD Cluster Evolution and Density Change Detection Constant Event New Unobfuscated Event New Obfuscated Event Growing Event 6 NIMD ARGUS Data Explorer Implementation • A Client-Server System – Data Explorer Client • GUI components, i.e., DYNAMiX’s 2-D GUI embedded with ManTech’s 3-D GUI module • Interface connecting the GUI components to the Web Service API – ARGUS Server • Web Service API that delegates tasks to the application layer • Application Layer, which encompasses the core application functionality including clustering, novelty detection, re-clustering, exact and approximate matching • Data Access Layer, which includes application functionality such as set operations and data exchanges between the application layer and the database • DYNAMiX iX server used as a component for matching structured data • Data store (database) 7 NIMD ARGUS Data Explorer • Typical Hardware – Server • Processor: Intel® Xeon™, 3.0GHz, 2MB Cache • Memory: 8GB DDR2 400MHz (4X2GB), Dual Ranked DIMMs • Disk Space: 300 GB – Client • • • • Processor: Pentium 3 or higher Memory: 512MB Disk Space: 100MB Graphics subsystem: High-performance supporting 1280 x1024 resolution • Typical Software – Server • Operating System: Red Hat Enterprise Linux ES v4 • Other: Oracle 9i; JSDK V1.4.2 (Tomcat); DYNAMiX iX; – Client • Operating System: Windows 2000; Mac OS 10.4 • Other: Java3D; JRE V1.4.2_11 (Swing Application) 8 NIMD ARGUS Data Explorer 9 NIMD ARGUS Data Explorer 10 NIMD 3-D Data Cluster Display 3-D Data and Cluster display Region. In this Region the display will also show the X, Y, and Z axes along with axis labels for the quantitative data dimensions chosen by the user Slider to control rotation about the Horizontal (X) axis. This allows the user to rotate the data set to obtain a favorable view to see the clusters of greatest interest. Slider to control rotation about the vertical (Y) axis. This allows the user to rotate the data set to obtain a favorable view to see the clusters of greatest interest. 11 NIMD Cluster Control Cluster size control allows user to adjust radius of transparent sphere to be One or Three Standard Deviations Cluster selection controls allow the user to select any cluster and then hide non-selected clusters to allow focus on relevant data. 12 NIMD Novelty Detection T0 0-15° T1 16-30° T2 31-45° 13 NIMD ARGUS Data Explorer: Menu Items 14 NIMD DEMO PROBLEM: The MIT LL Test Dataset • Network-flow dataset from the MIT Lincoln Labs, to experiment with the detection of network attacks • 805,049 records x 42 independent fields (derived features of raw data) x 1 dependent field (post hoc classification label) • The dependent field is a discrete value representing Connection Type – 0 is normal – 1, 2, 3 and 4 are malicious (1:probe, 2:denial_of_service, 3:user_to_root, and 4:remote_to_local) • The goal is to use the ARGUS Data Explorer to learn a “normal” background model and then detect the onset of “malicious” attacks 15 NIMD DEMO PROBLEM: The MIT LL Dataset MITLLT0 = 494,020 records x 43 fields MITLLT1 = 311,029 records x 43 fields # Name 1 RECID 2 DURATION 3 PROTOCOL_TYPE 4 SERVICE 5 FLAG 6 SRC_BYTES 7 DST_BYTES 8 LAND 9 WRONG_FRAGMENT 10 URGENT 11 HOT 12 NUM_FAILED_LOGINS 13 LOGGED_IN 14 NUM_COMPROMISED 15 ROOT_SHELL 16 SU_ATTEMPTED 17 NUM_ROOT 18 NUM_FILE_CREATIONS 19 NUM_SHELLS 20 NUM_ACCESS_FILES 21 NUM_OUTBOUND_CMDS 22 IS_HOT_LOGIN 23 IS_GUEST_LOGIN 24 COUNT 25 SRV_COUNT 26 SERROR_RATE 27 SRV_SERROR_RATE 28 RERROR_RATE 29 SRV_RERROR_RATE 30 SAME_SRV_RATE 31 DIFF_SRV_RATE 32 SRV_DIFF_HOST_RATE 33 DST_HOST_COUNT 34 DST_HOST_SRV_COUNT 35 DST_HOST_SAME_SRV_RATE 36 DST_HOST_DIFF_SRV_RATE 37 DST_HOST_SAME_SRC_PORT_RATE 38 DST_HOST_SRV_DIFF_HOST_RATE 39 DST_HOST_SERROR_RATE 40 DST_HOST_SRV_SERROR_RATE 41 DST_HOST_RERROR_RATE 42 DST_HOST_SRV_RERROR_RATE 43 CONNECTION_TYPE Ex Recd 1 0 2 23 10 181 5450 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0 9 9 1 0 0 0 0 0 0 0 0 Ex Recd 2 0 2 23 10 239 486 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0 19 19 1 0 0 0 0 0 0 0 0 Description id length (number of seconds) of the connection type of the protocol, e.g. tcp, udp, etc. network service on the destination, e.g., http, telnet, etc. normal or error status of the connection number of data bytes from source to destination number of data bytes from destination to source 1 if connection is from/to the same host/port; 0 otherwise number of ``wrong'' fragments number of urgent packets number of ``hot'' indicators number of failed login attempts 1 if successfully logged in; 0 otherwise number of ``compromised'' conditions 1 if root shell is obtained; 0 otherwise 1 if ``su root'' command attempted; 0 otherwise number of ``root'' accesses number of file creation operations number of shell prompts number of operations on access control files number of outbound commands in an ftp session 1 if the login belongs to the ``hot'' list; 0 otherwise 1 if the login is a ``guest''login; 0 otherwise number of connections to the same host as the current connection in the past two seconds number of connections to the same service as the current connection in the past two seconds % of connections that have ``SYN'' errors % of connections that have ``SYN'' errors % of connections that have ``REJ'' errors % of connections that have ``REJ'' errors % of connections to the same service % of connections to different services % of connections to different hosts 0:normal, 1:probe, 2:denial_of_service, 3:user_to_root; 4:remote_to_local 16 Type discrete continuous discrete discrete discrete continuous continuous discrete continuous continuous continuous continuous discrete continuous discrete discrete continuous continuous continuous continuous continuous discrete discrete continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous discrete NIMD DEMO PROBLEM: The MIT LL Test Dataset • Recipe – – – – • Learn the baseline “normal” set of clusters Cluster into this baseline, the incremental data including “malicious” records Assess quantitative and qualitative changes in clusters and test the capability of the system to detect novelties and alert the user Iterate through this “learn ↔ test” cycle dynamically over time, progressively building domain knowledge based on data empirics Results – – Please visit us at our DEMO & POSTER session on April 19th between 0800-1300 We will demonstrate the successful alerting (by cluster density changes, as in the adjacent graph) on denial of service and unauthorized entry attempts by cluster density changes 17 NIMD ARGUS Data Explorer: Challenges • Goal – The Data Explorer should be a robust, intuitive, user-friendly system that provides decision support for an analyst who an expert in the problem domain, but not an expert in advanced statistics and pattern recognition technologies • Challenges – Human-in-the-loop automation of the underlying algorithms – Ensuring the transfer of maximally relevant data through the “data pipe” connecting the server to the client – Balancing-act: iterative historical batch and near-real-time processing – Iterative scaling of the application: increase the amount of data handled; ensure user expectation is still maintained; repeat cycle – Setting the right expectation: a “work in progress” prototype 18 NIMD ARGUS Data Explorer: Where do we go from here? • Combining the “top-down” and “bottom-up” analysis approaches in ARGUS II • “Top-down” – Hypothesis creation using first principles and process heuristics • “Bottom-up” – Probabilistic hypothesis validation and refutation using data empirics 19 NIMD