Sailfish: A Framework For Large Scale Data Processing

advertisement
Sailfish: A Framework For Large
Scale Data Processing
Sriram Rao
CISL@Microsoft
Oct. 15, 2012
Joint Work With Colleagues…
• Raghu Ramakrishnan (at Yahoo and now at
Microsoft)
• Adam Silberstein (at Yahoo and now at
LinkedIn)
• Mike Ovsiannikov and Damian Reeves (at
Quantcast)
Motivation
• “Big data” is a booming industry:
– Collect massive amounts of data (10’s of TB/day)
– Use data intensive compute frameworks (Hadoop,
Cosmos, Map-Reduce) to extract value from the
collected data
– Volume of data processed is bragging rights
• How do frameworks handle data at scale?
– Not well studied in the literature
M/R Dataflow
𝑅0
DFS
𝑀0
DFS
𝑆𝑘𝑒𝑤
𝑆𝑒𝑒𝑘𝑠
DFS
𝑅2
𝑀1
𝑅1
DFS
DFS
Disk Overheads
• Intermediate data transfer is seek intensive => I/O’s are small/random
– # of disk seeks for transferring intermediate data proportional to M * R
DFS
𝑀1
𝑅1
𝑀0
𝑅0
DFS
𝑅2
Why Is Scale Important?
• Yahoo cluster workload characteristics:
– Vast majority of jobs (about 95%) are small
– A minority of jobs (about 5%) are big
• Involve 1000’s of tasks that are run on many
machines in the cluster
• Run for several hours processing TB’s of data
• Size of intermediate data (i.e., map output) is at
least as big as the input
Can We Minimize Seeks?
• Problem space: Size of intermediate data
exceeds amount of RAM in cluster
• Reducer reads data from disk => one seek per
reducer
• Lower bound is proportional to R
Solving A Seek Problem…
• Minimize disk seeks via “group commit” is well
known
• Why isn’t this idea implemented?
• Difficult to implement in the past
– Datacenter bandwidth is a contended resource
– Any solution that mentions “network” was
beginning of a (futile) negotiation
What Is New…
• Network b/w in a datacenter is going up…
– Lower “over-subscription”
• 1/5/10-Gbps between any pair
• Can we leverage this trend to do distributed
aggregation and improve disk performance?
• Building using this trend is being explored:
– Flat Datacenter Storage (OSDI’12): Blob store
– ThemisMR (SOCC’12): M/R at scale
Key Ideas
𝑅0
𝑀0
RAM
I-file
𝑀1
𝑅1
RAM
I-file
𝑅2
𝑀2
RAM
I-file
1. I-files, a network-wide
data aggregation
mechanism
2. Observe intermediate
data in I-files during
map phase to plan
reduce phase
Our Work
• Build I-files by extending a DFS
• Build Sailfish (by modifying Hadoop) in which Ifiles are used to transport intermediate data
– Leverage I-files to gather statistics on intermediate
data to plan reduce phase:
• (1) # of reducers depend on data, (2) handle skew
– Eliminate tuning parameters
• No more map-side tuning, choosing # of reducers
• Results show 20% to 5x speedup on a
representative mix of (large) real jobs/datasets at
Yahoo
Talk Outline
• Motivation
• I-files: A data aggregation mechanism
• Sailfish: Map-Reduce using I-files
• Experimental Evaluation
• Summary and On-going work
Using I-files for Intermediate Data
𝑅0
𝑀0
RAM
I-file
𝑀1
𝑅1
RAM
I-file
𝑅2
𝑀2
RAM
I-file
Aggregator
• I-files are a container
for data aggregation in
general
• Per I-file aggregator:
– Buffers data from writers
in RAM
– “Group commit” data to
disk
• # of disk seeks is
proportional to R
Issues
𝑅0
𝑀0
RAM
I-file
𝑀1
𝑅1
RAM
I-file
𝑅2
𝑀2
RAM
I-file
• Fault-tolerance
• Scale
• Skew: Suppose there is
skew in data written to
I-files
• Hot-spots:
– Suppose a partition
becomes hot
– All map tasks generate
data for that partition
Big data => Scale out!
“Scale out” Aggregation
• Build using distributed
aggregation (scale out)
– Rather than 1 aggregator
per I-file, use multiple
I-file
𝑀0
…
Aggregator
𝑀10
• Bind subset of mappers
to each aggregator
𝑀11
…
𝑀20
Aggregator
What Does This get?
• Fault-tolerance
– When an aggregator fails, need to re-run the subset of
maps that wrote to that aggregator
• Re-run in parallel…
• Mitigate skew, hot-spots, better scale
• Seeks: Goes up (!)
– Reducer input is now stored at multiple aggregators
• Read from multiple places
– Will comeback to this issue…
I-files Design -> Implementation
• Extend DFS to support data aggregation
– Use KFS in our work (KFS ≅ HDFS)
• Single metaserver (≅ HDFS NameNode)
• Multiple Chunkservers (≅ HDFS DataNodes)
• Files are striped across nodes in chunks
– Chunks can be variable in size with a fixed maximal value
– Currently, max size of a chunk: 128MB
• Adapt KFS to support I-files
– Leverage multi-writer capabilities of KFS
KFS I-files Characteristics
• I-files provide a record-oriented interface
• Append-only
– Clients append records to an I-file :
record_append(fd, <key, value>)
• Append on a file translates to append on a chunk
• Records do not span chunk boundaries
– Chunkserver is the aggregator
• Supports data retrieval by-key:
– scan(fd, <key range>)
Appending To KFS I-files
Alloc chunk
Map Task
record_append()
KFS metaserver
Bind to
Chunk1
• Multiple appenders per
chunk
• Multiple chunks appended
to concurrently
Chunk1
Map
tasks
Chunk2
Map tasks
KFS I-files
• I-file constructed via sequential I/O in a
distributed manner
– Network-wide batching
– Multiple appenders per chunk
– Multiple chunks appended to concurrently
• On a per-chunk basis chunkserver responsible for
that chunk is the aggregator
– Chunkserver aggregates records and commits to disk
– Append is atomic: Chunkserver serializes concurrent
appends
Distributed Aggregation With KFS Ifiles
Design
• Bind subset of mappers to
an aggregator
• Use multiple aggregators
per I-file
• Minimize # of aggregators
per I-file
•
•
•
•
Implementation
Multiple writers/chunk
Multiple chunks appended
to concurrently
# of chunks per I-file scales
based on data
Chunk allocation is key:
– Need to pack data into as few
chunks as possible
Talk Outline
• Motivation
• I-files: A data aggregation mechanism
• Sailfish: Map-Reduce using I-files
• Experimental Evaluation
• Summary and On-going work
Sailfish: MapReduce Using I-files
• Modify Hadoop-0.20.2 to use I-files
– Mappers append their output to per-partition Ifiles using record_append()
• Map output is appended concurrent to task execution
– During map phase, gather statistics on
intermediate data and plan reduce phase
• # of reducers, task assignment done at run-time
– Reducer scans its input from a per-partition I-file
• Merge records from chunks and reduce()
• For efficient scan(), sort and index an I-file chunk
Sailfish Dataflow
𝑅0
DFS
DFS
𝑀0
𝑅1
DFS
DFS
𝑀1
𝑅2
DFS
𝑅3
DFS
Sailfish Dataflow
DFS
𝑀0
DFS
𝑀1
Chunkserver
Chunkserver
DFS
𝑀2
DFS
𝑀3
Sorter
Sailfish Dataflow
Chunkserver
𝑅1
DFS
𝑅2
DFS
Chunkserver
Leveraging I-files
• Gather statistics on intermediate data
whenever a chunk is sorted
– Statistics are gathered during map phase as part
of execution
• During sorting augment each chunk with an
index
– Index supports efficient scans
Reduce Phase Implementation
• Plan reduce phase based on data:
– # of reduce tasks per I-file = Size of I-file / Work
per task
– # of tasks scale based on data; handles skew
• On a per I-file basis, partition key space by
constructing “split points”
• Each reduce task processes a range of keys
within an I-file
– “Hierarchical partitioning” of data in an I-file
Sailfish: Reduce Phase
Objectives
• Avoid specifying # of
reducers in a job at job
submission time
• Handle skew
• Auto-scale
Implementation
• Gather statistics on I-files to
plan reduce phase
– # of reduce tasks is
determined at run-time in a
data dependent manner
• Hierarchical scheme
– Partition map output into a
large number of I-files
– Assign key-ranges within a Ifile to a reduce task
• Reducers/I-file is data
dependent
How many seeks?
• Goal: # of seeks proportional to R
• With Sailfish,
– A reducer reads input from all chunks of a single I-file
– Suppose that each I-file has c chunks
– # of seeks during read is proportional to c * R
• # of seeks during appends is also proportional to c * R
– But sorters also cause seeks
• # of seeks during sorting is proportional to 2 * c * R
• Packing data into as few chunks as possible is
critical for I-file effectiveness
Talk Outline
• Motivation
• I-files: A data aggregation mechanism
• Sailfish: Map-Reduce using I-files
• Experimental Evaluation
• Summary and On-going work
Experimental Evaluation
• Cluster comprises of ~150 machines (5 racks)
– 2008-vintage machines
• 8 cores, 16GB RAM, 4-750GB drives/node
– 1Gbps between any pair of nodes
• Used lzo compression for handling intermediate
data (for both Hadoop and Sailfish)
• Evaluations involved:
– Synthetic benchmark that generates its own data
– Actual jobs/data run at Yahoo
How did we do? (Synthetic Benchmark)
Sailfish
Hadoop
1800
1600
Runtime (min.)
1400
1200
1000
800
600
400
200
0
1
2
4
8
Intermediate Data Size (TB)
16
32
64
How Many Seeks…
Hadoop
• With Stock Hadoop, it is
proportional to M * R
• # of map tasks generating
64TB data
• M = (64TB / 1GB per map
task) = 65536
Sailfish
• With Sailfish, it is proportional
to c * R
• 64TB of intermediate data split
over 512 I-files
– Chunks per I-file: ((64TB / 512)
/ 128MB)) = 1024
– In practice: c varies from 1032
to 1048
• Results show that chunks are
packed
– Chunk allocation policy works
well in practice 
Data read by a reduce task
per disk I/O (in MB)
Sailfish: More data read per seek…
Stock Hadoop
Sailfish
10
1
0.1
0.01
1
10
Intermediate data size (TB)
100
Sailfish: Faster reduce phase…
Disk Read Throughput (MB/s)
45
Stock Hadoop
Sailfish
40
35
30
25
20
15
10
5
0
0
0.5
1
1.5
2
2.5
Time (in hours)
3
3.5
4
Sailfish In Practice
• Use actual jobs+datasets that are used in
production
Job Name
Characteristics
Input size (TB)
Int. data size (TB)
LogCount
Data reduction
1.1
0.04
LogProc
Skew in map
output
1.1
1.1
LogRead
Skew in reduce
input
1.1
1.1
Nday Model
Incr. computation
3.54
3.54
Behavior Model
Big data
3.6
9.47
Click Attribution
Big data
6.8
8.2
14.1
25.2
Segment Exploder Data explosion
How Did We Do…
Sailfish
Hadoop
900
800
Runtime (min.)
700
600
500
400
300
200
100
0
LogCount
LogProc
LogRead
Nday-Model
Job
BehaviorModel
ClickAttr
SegmentExploder
Handling Skew In Reducer Input
Work per task (MB)
5000
Hadoop
Sailfish
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0
100
200
300
400
500
Task #
600
LogRead job
700
800
900
Fault-tolerance
• Sailfish handles (temporary) loss of
intermediate data via recomputes
– Bookkeeping that tracks map tasks that wrote to
the lost block and re-run those
• “Scale out” mitigates the impact of data loss
– 15% increase in run-time for a run with failure
• Described in detail in the paper…
Related Work
• ThemisMR (SOCC’12) addresses same problem as Sailfish
– Does not (yet) support fault-tolerance: design space is small
clusters where failures are rare
– Design requires reducer input to fit in RAM
• [Starfish] Parameter tuning (for Hadoop)
– Construct a job profile and use that to tune Hadoop parameters
– Gains are limited by Hadoop’s intermediate data handling
mechanisms
• Lot of work in DB literature on handling skew
– Run job on a sample of input and collect statistics to construct
partition boundaries
– Use statistics to drive actual run
Summary
• Explore idea of network-wide aggregation to
improve disk subsystem performance
• Develop I-files as a data aggregation construct
– Implement I-files in KFS (a distributed filesystem)
• Use I-files to build Sailfish, a M/R infra
– Sailfish improves job completion times: 20% to 5x
On-going Work
• Extending Sailfish to support
elasticity/preemption (Amoeba, SOCC’12)
• Working on integrating many of the core ideas
in Sailfish into Hadoop 2.x (aka YARN)
– Work started at Yahoo! Labs
– Being continued in CISL@Microsoft
• http://issues.apache.org/jira/browse/MAPREDUCE4584
• http://issues.apache.org/jira/browse/YARN-45
Software Available
• Sailfish released as open source project
– http://code.google.com/p/sailfish
Download