uccsc2014-20140804

advertisement
Big Data Technologies
for InfoSec
Dive Deeper. See Further.
Ram Sripracha ([email protected])
UCLA / Sift Security
Experiences
RR Systems
What are “Big Data” systems?
• XXL in Size
• Data Volume
• TBs - PBs
• Computation Scalability
• Horizontally Scalable
• Multi-host Deployment
• Commodity Hardware
Why now?
• Rich Ecosystem
• Well Supported Open Source Software
• High Adoption Rate
• Commercial Backings
• “Redhat” Model
• Heavily Invested
Platform Providers
Technologies
Is it a “Big Data” problem?
• Many moving parts
• Initially maybe overwhelming
• 100s of configuration setting
• Requests some level of expertise
• Overkill for some problems
• Larger resource footprint
Big Data Stack
Big Data Stack
DFS
• NoSQL
• Columnar
• Sits on HDFS
• Million Rows
x Million Columns
• Cell-level Security
Titan
• Graph-based Datastore
• Optimized for (E, V)
• Key/Value attributes for vertices and
edges
• 100s million vertices x 100s billion edges
• Capturing relationships
• Sits on top of HBase, Cassandra, …
Map-Reduce
• Resilient Distributed Dataset
(RDD)
• In-Memory RDD
• Iterative Algorithms
• Machine Learning
Impala
• Near-real-time analysis
• Micro-batch processing
• Pipelining of micro-batches
• Stream annotations
• Sits on top of
• Distributed indexing and search
• Indexes
•
•
•
•
Raw text files from HDFS
HBase content
Titan properties
Other data replicated data streams
Application Log Search
• Full Text Indexes
• Flexible Faceting
• Automatic field extraction
• Dashboard-able search interface
• Low-cost alternative to Splunk
and other search solutions
Real-time Blacklist Alerting
• Fault tolerance
• Netflow annotation
• Match alerting
• Application access alerting
• Authentication alerting
• Network metrics
Netflow Data Warehouse
• 3x Nodes
• 2x 8-Core Intel E5-2450 per node
• 16Gb RAM per node
• 72TB Storage Total
• ~5B Netflow records/day
• >1 year retention
• Support complex SQL-like query
Netflow Data Warehouse
• Continuous scanning
• Direct querying of delimited file
DFS
• Perform metrics and diffs
• Compute trending
• Firewall rule validations
• Long retention
EMR Access Anomalies
• Category of insider threat
• Relational networks of
• Users/Groups
• Department
• Document Access
• Community structure-based
anomaly detection
Download