Hyperion: High Volume Stream Archival for Restrospective Querying Peter Desnoyers and Prashant Shenoy University of Massachusetts UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Packet monitoring with history Packet monitor: capture and search packet headers E.g.: Snort, tcpdump, Gigascope … with history: Capture, index, and store packet headers Interactive queries on stored data Provides new capabilities: Network forensics: monitor When was a system compromised? From where? How? Management: After-the-fact debugging storage UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Challenges Speed 1 gbit/s x 80% ÷ 400 B/pkt = 250,000 pkts/s Storage rate, capacity to store data without loss, retain long enough For each link monitored Queries must search millions of packet records Indexing in real time for online queries Commodity hardware UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Existing approaches Event rates Archive Index, query Commodity Streaming query systems (GigaScope, Bro, Snort) Yes No No Yes Peer-to-peer systems (MIND, PIER) No Yes Yes Yes Conventional DBMS No Yes Yes Yes CoMo Yes Yes No Yes Proprietary systems* ? Yes Yes No *Niksun NetDetector, Sandstorm NetInterceptor Packet monitoring with history requires a new system. UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science HW Outline of talk Introduction and Motivation Design Implementation Results Conclusions UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Hyperion Design Multiple monitor systems High-speed storage system Local index Distributed index for query routing Monitor/ capture Index Distributed index Hyperion node Storage UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Storage Requirements Real-time Writes must keep up or data is lost Prioritized Reads shouldn’t interfere with writes Aging Old data replaced by new Stream storage Typical app Hyperion Likely deletes Newest files Oldest data File size Random, small Streaming yes no Sequential reads Behavior: Typical app. vs. Hyperion Different behavior Packet monitoring is different from typical applications UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Log structured stream storage Goal: minimize seeks despite interleaved writes on multiple streams Log-structured file system minimizes seeks disk position Interleave writes at advancing frontier free space collected by segment cleaner But: A C C A B 1: 2: 3: 4: A C A B General-purpose segment cleaner performs poorly on streams UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Write frontier Hyperion StreamFS How to improve on a generalpurpose file system? Rely on application use patterns Eliminate un-needed features skip StreamFS – log structure with no segment cleaner. No deletes (just over-write) No fragmentation No segment cleaning overhead Operation: Write fixed-size segment Advance write frontier to next segment ready for deletion UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science StreamFS Design Record record Single write, packed into: Segment segment Fixed-size, single stream, interleaved into: region Region Contains: Region map Region map Identifies segments in region directory Used when write frontier wraps Stream_A Directory … Locate streams on disk UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science StreamFS optimizations New data Data Retention Control how much history saved Lets filesystem make delete decisions Old data is deleted Reservation Speed balancing Worst-case speed set by slowest tracks Solution: interleave fast and slow sections Worst-case speed now set by average track UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Local Index Requirements: High insertion speed Interactive query response Index and search mechanisms Search speed Insert speed Exhaustive search No Yes B-tree Yes No Hash index Yes No Signature index Yes Yes UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Signature Index Compress data into signature Store signature separately Signature Search signature, not data Retrieve data itself on match Signature algorithm: Bloom filter Keys No false negatives – never misses a result False positives – extra read overhead Records UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Signature index efficiency Index size False positives (data scan) Concise index: Index scan cost: low False positive scans: high Bytes searched Overhead = bytes searched Index size Verbose index: Index scan cost: high False positive scans: low UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Multi-level signature index Concise index: Concise index Low scan overhead Verbose index: Low false positive overhead Verbose index Use both Scan concise index Check positives in verbose index Data records UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Distributed Index Query routing: Send queries only to nodes holding matches Use signature index Index distribution: Aggregate indexes at cluster head Route queries through cluster head Rotate cluster head for load sharing UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Implementation Components: StreamFS Index Capture RPC, query & index distribution Query API Linux OS Python framework RPC, query, index dist. Query API Index Linux kernel capture StreamFS Hyperion components UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Outline of talk Introduction and Motivation Design Implementation Results Conclusions UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Experimental Setup Hardware: Linux cluster Dual 2.4GHz Xeon CPUs 1 GB memory 4 x 10K RPM SCSI disks Syskonnect SK98xx + U. Cambridge driver Test data Packet traces from UMass Internet gateway* 400 mbit/s, 100k pkt/s *http://traces.cs.umass.edu UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science StreamFS – write performance Tested configurations: NetBSD / LFS Linux / XFS (SGI) StreamFS Workload: multiple streams, rates Logfile rotation Used for LFS, XFS Results: 50% boost in worst-case throughput Fast enough to store 1,000,000 packet hdrs/s UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science StreamFS – read/write Workload: Continuous writes Random reads StreamFS: sustained write throughput XFS throughput collapse StreamFS can handle stream read+write traffic without data loss. XFS cannot. UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Index Performance Calculation benchmark: Query: 380M packet headers 26GB data selective query (1 pkt returned) Data fetched (MB) 250,000 pkts/sec Query results: Index size 13MB data fetched to query 26GB data (1:2000) UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science System Performance Workload: Trace replay Simultaneous queries Speed: 100-200K pkts/s Packet loss measured: #transmitted - #received Results: Packets/s Loss rate 110,000 0 130,000 0 150,000 2·10-6 160,000 4·10-6 175,000 10·10-6 200,000 .001 Up to 175K pkts/s with negligible packet loss UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Conclusions Hyperion - packet monitoring with retrospective queries Key components: Storage 50% improvement over GP file systems Index Insert at 250K pkts/sec Interactive query over 100s of millions of pkts System Capture, index, and query at 175K pkts/sec UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science Questions Questions? UNIVERSITY OF MASSACHUSETTS, AMHERST • Department of Computer Science