Hyperion

advertisement
Hyperion :High
Volume Stream Archival
Divya Muthukumaran
Area
 Network Monitoring
 Identify problems due to overloaded and/or
crashed servers, network connections or
other devices
 Example: To determine the status of a
webserver, monitoring software may
periodically send an HTTP request to fetch a
page
Live Monitoring
 Packets are examined in real time


Compute and continually update traffic
statistics
Discard the captured packet headers once
examined
 Why the need to store packet headers?
Live Monitoring
 Packets are examined in real time
 Compute and continually update traffic statistics
 Discard the captured packet headers once examined
 Why the need to store packet headers?
 Example: Network forensics


To go back and examine the root cause of a problem
Ex: See how an intruder gained entry, How a worm
infection happened
What is the need of such a system?
Querying and examining live data
 Data Archival



Capture the data at wire speeds, Index and
store them
Efficiently support retrieval and processing of
archived data
Specifically designed to handle needs of high
volume stream archival
Why not traditional databases?
 Some statistics


A single GB link can generate over 100,000
packets and tens of MBs of archival data.
A monitor may record from Multiple links.
Design Principles
 Support Queries not reads

Implies the need to maintain indexes
 Writes

Sequential and Immutable
 Archive locally , summarize globally

Scalability Vs Need to avoid flooding


Scalability: Favors local archiving and indexing to
avoid network writes
Need to answer Distributed queries: favors
sharing information across nodes
Hyperion
Three Key components
 Stream File System

High volume archiving and querying
 Multi-level index structure

High update rates + reasonable lookup
performance
 Distributed index layer

Distributes a summary of local indices to
enable distributed querying
Design choices for the Hyperion
Storage System
 Storage of multiple high-speed traffic streams
without loss
 Support for concurrent read activity without
loss of write performance
 Re-use of storage in a buffer-like fashion
Stream File System
 Stores Streams as opposed to files
 Characteristics

Recycled : When storage is full new data
replaces old data.



In a GP File system new data is lost old is
retained
Immutable
Record-oriented: data is written in fixed or
variable length records
Can we use a GP FS?
 Need to map streams <=>files
LogFile Rotation
Stream FS
Stream FS Organization
 Los-structured FS
 What problem?

Cleaning/Garbage collection
 StreamFS solves the cleaning problem


Guarantee : Storage guarantee for each
stream
Small segment size

Check if next segment is a surplus . If yes then
overwrite , otherwise skip.
Stream FS Organization




Los-structured FS
What problem?
Cleaning/Garbage collection
StreamFS solves the cleaning problem


stream
Small segment size (1 or ½ MB)
Guarantee : Storage guarantee for each


Check if next segment is a surplus . If yes then
overwrite , otherwise skip.
Advantages?


Storage Reservation
Best effort use of remaining storage
Reads
 First get index
 Use index to get data
 Persistent Handles



Returned from each write operation
Passed to read op to retrieve data
What does the handle contain?


Disk location , approximate length
Allows data to be retrieved directly
Handle issues
 Validate the handle. How?
 Self certifying record header




Id of the stream
Permissions of the stream
Record length
Hash (used for validating the handle)
Stream FS Organization
 Record
 Variable length
 On-disk record + header
 Block
 Fixed length
 Multiple records of the same stream
 Block Map
 Every nth block
 (stream ID + in-stream sequence number for each of
the preceding n-1 blocks)
 Used for easy write allocation
Stream FS Organization
Indexing
 Uses signature based Indices
 Signature for each segment
 Can check if a record with a key k is present
in the segment or not
 Does not tell you where the record is present
in the segment
Multi-level Indices
Multi Level Indices
 Uses a Bloom Filter


Hash (key) -> b bits
In b bits k bits are set to 1
= Hs (Signature)
 How to check for presence of a record?
 H(key1)||H(key2)…||H(keyn)



Compute hash of its key kr, H(kr)
If a bit in H(kr) is set but not set in Hs then the
value is not present
False positives
Distributed Index
 How to handle distributed queries without flooding?
 Maintain distributed index
 Integrated view of all nodes
 Coarse-grain summary of data at each node is needed
 Can use the top level index in the Hyperion
 One index node per time interval
 All nodes send their top-level indices to this node
 Temporally–distributed index
Download