Big Data Technologies for InfoSec Dive Deeper. See Further. Ram Sripracha (rsriprac@ucla.edu) UCLA / Sift Security Experiences RR Systems What are “Big Data” systems? • XXL in Size • Data Volume • TBs - PBs • Computation Scalability • Horizontally Scalable • Multi-host Deployment • Commodity Hardware Why now? • Rich Ecosystem • Well Supported Open Source Software • High Adoption Rate • Commercial Backings • “Redhat” Model • Heavily Invested Platform Providers Technologies Is it a “Big Data” problem? • Many moving parts • Initially maybe overwhelming • 100s of configuration setting • Requests some level of expertise • Overkill for some problems • Larger resource footprint Big Data Stack Big Data Stack DFS • NoSQL • Columnar • Sits on HDFS • Million Rows x Million Columns • Cell-level Security Titan • Graph-based Datastore • Optimized for (E, V) • Key/Value attributes for vertices and edges • 100s million vertices x 100s billion edges • Capturing relationships • Sits on top of HBase, Cassandra, … Map-Reduce • Resilient Distributed Dataset (RDD) • In-Memory RDD • Iterative Algorithms • Machine Learning Impala • Near-real-time analysis • Micro-batch processing • Pipelining of micro-batches • Stream annotations • Sits on top of • Distributed indexing and search • Indexes • • • • Raw text files from HDFS HBase content Titan properties Other data replicated data streams Application Log Search • Full Text Indexes • Flexible Faceting • Automatic field extraction • Dashboard-able search interface • Low-cost alternative to Splunk and other search solutions Real-time Blacklist Alerting • Fault tolerance • Netflow annotation • Match alerting • Application access alerting • Authentication alerting • Network metrics Netflow Data Warehouse • 3x Nodes • 2x 8-Core Intel E5-2450 per node • 16Gb RAM per node • 72TB Storage Total • ~5B Netflow records/day • >1 year retention • Support complex SQL-like query Netflow Data Warehouse • Continuous scanning • Direct querying of delimited file DFS • Perform metrics and diffs • Compute trending • Firewall rule validations • Long retention EMR Access Anomalies • Category of insider threat • Relational networks of • Users/Groups • Department • Document Access • Community structure-based anomaly detection