pptx

advertisement
I-Files: Handling Intermediate Data In
Parallel Dataflow Graphs
Sriram Rao
November 2, 2011
Joint Work With…
 Raghu Ramakrishnan, Adam Silberstein: Yahoo Labs
 Mike Ovsiannikov, Damian Reeves: Quantcast
Motivation
 Massive growth in online advertising (read…display ads)
 Companies are reacting to this opportunity via behavioral
ad-targeting
›
Collect click-stream logs, mine the data, build models, show ads
 “Petabyte scale data mining” using computational
frameworks (such as, Hadoop, Dryad) is commonplace
 Analysis of Hadoop job history logs shows:
›
Over 95% of jobs are small (run for a few mins, process small data)
›
About 5% of jobs are large (run for hours, process big data)
Where have my cycles gone?
5% of jobs take 90% of cycles!
Who is using my network?
5% of jobs account for 99% of
network traffic!
So…
 Analysis shows 5% of the jobs are “big”:
›
5% of jobs use 90% cluster compute cycles
›
5% of jobs shuffle 99% of data (i.e., 99% network bandwidth)
 To improve cluster performance, improve M/R
performance for large jobs
 Faster, faster, faster: virtuous cycle
›
Cluster throughput goes up
›
Users will run bigger jobs
 Our work: Focus on handling intermediate data at scale in
parallel dataflow graphs
Handling Intermediate Data in M/R
 In a M/R computation, map output is intermediate data
 For transferring intermediate data from map to reduce:
›
Map tasks generate data, write to disk
›
When a reduce task pulls map output,
• Data has to be read from disk
• Transferred over the network
– Cannot assume that mappers/reducers can be scheduled concurrently
 Transporting intermediate data:
›
Intermediate data size < RAM size: RAM masks disk I/O
›
Intermediate data size > RAM size: Cache hit rate masks disk I/O
›
Intermediate data size >> RAM size: Disk overheads affect perf
Handling Intermediate data at scale
 Intermediate Data Transfer: Distributed Merge Sort
›
# of disk seeks for transferring intermediate data α M * R
›
Avg. amount of data reducer pulls from a mapper α 1 / R
Distributed File System
Map
(M * R)
Disk
Seeks
Reduce
Disk Overheads (More detail)
 “Fix” the amount of data generated by a map task
• Size RAM such that the map output fits in-memory and can be sorted in 1-pass
– For example, use 1GB
 “Fix” the amount of data consumed by a reduce task
• Size RAM for a 1-pass merge
– For example, use 1GB
 Now…
• For a job with 1TB of data

1024 mappers generate 1G each; 1024 reduces consume 1G each
– On average, data generated by a map for a given reducer = 1G / 1024 = 1M
• For a job with 16TB of data

16K mappers generate 1G each; 16K reduces consume 1G each
– On average, data generated by a map for a given reducer = 1G / 16K = 64K
 With scale, # of seeks increases; data read/seek decreases
Disk Overheads
 As the volume of intermediate data
scales,
›
Amount of data read per seek decreases
›
# of disk seeks increases non-linearly
 Net result: Job performance will be
affected by the disk overheads in
handling intermediate data
›
Intermediate data increases by 2
›
Job-run time increases by 2.5x
What is new?
Distributed File System
Map
Network-wide
Merge
I-Files
Fewer Seeks!
Reduce
One intermediate file per reducer, instead of
one per mapper
Our work
 New approach for efficient handling of intermediate data
at large scale
• Minimize the number of seeks
• Maximize the amount of data read/written per seek
• Primarily geared towards LARGE M/R jobs:
– 10’s of TB of intermediate data
– 1000’s of mapper/reducer tasks
 I-files: Filesystem support for intermediate data
›
Atomic record append primitive that allows write parallelism at scale
›
Network-wide batching of intermediate data
 Build Sailfish (by modifying Hadoop-0.20.2) where
intermediate data is transported using I-files
How did we do? (Benchmark job)
How did we do? (Actual Job)
Talk Outline
 Properties of Intermediate data
 I-files implementation
 Sailfish: M/R implementation that uses I-files for
intermediate data
 Experimental Evaluation
 Summary
Organizing Intermediate Data
 Hadoop organizes intermediate data in a format
convenient for the mapper
 What if we went the opposite way: organize it in a format
convenient for the reducer?
›
Mappers write their output to a per-partition I-file
›
Data destined for a reducer is in a single file
›
Build the intermediate data file in a manner that is suitable for the
reader rather than the writer
Intermediate data
 Reducer input is generated by multiple
mappers
M
 File is a container into which mapper
output needs to be stored
›
Write order is k1, k2, k3, k4
›
Processing order is k3, k1, k4, k2
 Because reducer imposes processing
order, writer does not care where the
output is stored in the file
 Once a mapper emits a record, the
output is committed
›
There is no “withdraw”
k
2
M
M
k
1
File
k
3
k
4
M
k
3
k
1
k
4
k
2
R
Properties of Intermediate data file
 Multiple mappers generate data that will be consumed by
a single reducer
›
Need low latency multi-writer support
 Writers are doing append-only writes
›
Contents of the I-file are never overwritten
 Arbitrary interleaving of data is ok:
›
Writer does not care where the data goes in the file
›
Any ordering we need can be done post-facto
 No ordering guarantees for the writes from a single client
• Follows from arbitrary interleaving of writes
Atomic Record Append
 Multi-writer support => need an atomic primitive
›
Intermediate data is append only…so, need atomic append
 With atomic record append primitive clients provide just
the data but the server chooses the offset with arbitrary
interleaving
›
In contrast, in a traditional write clients provide data+offset
 Since server is choosing the offset, design is lock-free
 To scale atomic record append with writers, allow
›
Multiple writers append to a single block of the file
›
Multiple blocks of the file concurrently appended to
Atomic Record Append
Client1
ARA: <A, offset = -1>
Server
ARA: <B, offset = -1>
Client2
Offset =
300
B
300
A
350
C
400
D
500
Implementing I-files
 Have implemented I-files in context of Kosmos distributed
filesystem (KFS)
›
Why KFS?
• KFS has multi-writer support
• We have designed/implemented/deployed KFS to manage PB’s of
storage
 KFS is similar to GFS/HDFS
›
Chunks are striped across nodes and replicated for fault-tolerance
– Chunk master serializes all writes to a chunk
›
For atomic append, chunk master assigns the offset
›
With KFS I-files, multiple chunks of the I-file can be concurrently
modified
Atomic Record Append
 Writers are routed to a chunk that is open for writing
›
For scale, limit the # of concurrent writers to a chunk
 When client gets an ack back from chunk master, data is
replicated in the volatile memory at all the replicas
›
Chunkservers are free to commit data to disk asynchronously
 Eventually, chunk is made stable
›
Data is committed to disk at all the replicas
›
Replicas are byte-wise identical
 Stable chunks are not appended to again
Talk Outline
 Properties of Intermediate data
 I-files implementation
 Sailfish: M/R implementation that uses I-files for
intermediate data
 Experimental Evaluation
 Summary
The Elephant Can Dance…
map()
reduce()
(De) Serialization
Hadoop Shuffle Pipeline
Sailfish Shuffle Pipeline
Sailfish Overview
 Modify Hadoop-0.20.2 to use I-files for MapReduce
›
Mappers write their output to a per-partition I-file
• Replication factor of 1 for all the chunks of an I-file
• At-least-once semantics for append; filter dups on the reduce side
›
Data destined for a reducer is in a single file
›
Build the intermediate data file in a manner that is suitable for the
reader rather than the writer
 Automatically parallelize execution of the reduce phase:
Set the number of reduce tasks and work assignment
dynamically
›
Assign key-ranges to reduce tasks rather than whole partitions
›
Extend I-files to support key-based retrieval
Sailfish Map Phase Execution
Sailfish Reduce Phase Execution
Atomic “Record” Append For M/R
 M/R computations are about processing records
›
Intermediate data consists of key/value pairs
 Extend atomic append to support “records”
›
Mappers emit <key, record>
• Per-record framing that identifies the mapper task that generated a
record
›
System stores per-chunk index
• After chunk is stable, chunk is sorted and an index is built by the sorter
– Sorting is a completely local operation: read a block from disk, sort in RAM, and write back
to disk
›
Reducers can retrieve data by <key>
• Use per-record framing to discard data from dead mappers
Sailfish Architecture
Submit Job
Hadoop JT
workbuilder
What do I
do?
Mapper Task
I-file 5
[a, d)
Reducer Task
Read/Merge
I Appender
.
.
.
KFS I-files
IMerger
Handling Failures
 Whenever a chunk of an I-file is lost, need to re-generate
lost data
 With I-file, we have multiple mappers writing to a block
 For fault-tolerance,
›
Workbuilder tracks the set of chunks modified by a mapper task
›
Whenever a chunk is lost, workbuilder notifies the JT of the set of
map tasks that have to be re-run
›
Reducers reading from the I-file with the lost chunk wait until data is
re-generated
 For fault-containment, in Sailfish, use per-rack I-files
›
Mappers running in a rack write to chunks of the I-file stored in the
rack
Fault-tolerance With Sailfish
 Alternate option is to replicate map-output
›
Use atomic record append to write to two chunkservers
• Probability of data loss due to (concurrent) double failure is low
›
Performance hit for replicating data is low
• Data is replicated using RAM and written to disk async
›
However, network traffic increases substantially
• Sailfish causes network traffic to double compared to Stock Hadoop
– Map output is written to the network and reduce input is read over the network
• With replication, data traverses the network three times
– Alternate strategy is to selectively replicate map output

Replicate in response to data loss

Replicate output that was generated the earliest
Sailfish Reduce Phase
 # of reducers/job and their task assignment is determined by the workbuilder
in a data-dependent manner
›
Dynamically set the # of reducer per job after the map phase execution is complete
 # of reducers/I-file = (size of I-file) / (work per reducer)
›
Work per reducer is set based on RAM (in experiments, use, 1GB per reduce task)
›
If data assigned to a task exceeds size of RAM, merger does a network-wide merge by
appropriately streaming the data
 Workbuilder uses the per-chunk index to determine split points
›
Each reduce task is assigned a range of keys within an I-file
• Data for a reduce task is in multiple chunks and requires a merge
• Since chunks are sorted, data read by a reducer from a chunk is all sequential I/O
 Skew in reduce input is handled seamlessly
›
I-file with more data has more tasks assigned to it
Experimental Evaluation
 Cluster comprises of ~150 machines
›
6 map tasks, 6 reduce tasks per node
• With Hadoop M/R tasks, a JVM is given 1.5G RAM for one pass sort/merge
›
8 cores, 16GB RAM, 4-750GB drives, 1Gbps between any pair of nodes
›
Job uses all the nodes in the cluster
 Evaluate with benchmark as well as real M/R job
›
Simple benchmark that generates its own data (similar to terasort)
• Measure only the overhead with transporting intermediate data
• Job generates records with random 10-byte key, 90-byte value
›
Experiments vary the size of intermediate data (1TB – 64TB)
• Mappers generate 1GB of data and reducers consume ~1GB of data
I-files in practice
 150 map tasks/rack
 128 map tasks concurrently appending to a block of an Ifile
 2 blocks of an I-file are concurrently appended to in a rack
 512 I-files per job
›
Beyond 512 I-files hit system limitations in the cluster (too many
open files, too many connections)
 KFS chunkservers use direct I/O with the disk subsystem,
by-passing the buffer cache
How did we do? (Benchmark job)
How many seeks?
 With Stock Hadoop, number of seeks is α M * R
 With Sailfish, it is the product of:
›
# of chunks per I-file (c)
›
# of reduce tasks per I-file (R / I)
›
# of I-files (I)
 We get: c * I * (R / I) = c * R
 # of chunks per I-file: 64TB intermediate data split over
512 I-files, where the chunksize is 128MB
›
c = (64TB / (512 * 128MB)) = 1024
 # of map tasks at 64TB: 65536 (64TB / 1GB per mapper):
c << M
Why does Sailfish work?
 Where are the gains coming from?
›
Write-path is low-latency and is able to keep as many disk arms and
NICs busy
›
Read-path:
• Lowered the number of disk seeks
• Reads are large/sequential
 Compared to Hadoop, read path in Sailfish is very efficient
›
Efficient disk read path leads to better network utilization
Data read per seek
Disk Thruput (during Reduce phase)
Using Sailfish In Practice
 Use a job+data from one of the behavioral ad-targeting pipelines at
Yahoo
›
BT-Join: Build a sliding N-day model of user behavior
• Take 1 day of clickstream logs and join with previous N days and
produce a new N-day model
 Input datasets compressed using bz2:
›
Dataset A: 1000 files, 50MB apeice (10:1 compression)
›
Dataset B: 1000 files, 1.2GB apeice (10:1 compression)
 Extended Sailfish to support compression for intermediate data
• Mappers generate upto 256K of records, compress, and “append record”
• Sorters read compressed data, decompress, sort, and recompress
• Merger reads compressed data, decompress, merge and pass to reducer
• For performance, use LZO from Intel IPP package
How did we do? (BT-Join)
BT-Join Analysis
 Speedup in Reduce phase is due better batching
 Speedup in Map-phase:
›
Stock Hadoop: if map output doesn’t fit in RAM, mappers do an
external sort
›
Sailfish: Sorting is outside the map task and hence, no limits on the
amount of map output generated by a map task
 Net result: Job with Sailfish is about 2x faster when
compared to Stock Hadoop
Related Work
 Atomic append was introduced in GFS paper (SOSP’03)
›
GFS however seems to have moved away from atomic append as
they say it has not usable (at least once semantics and replicas can
diverge)
 Balanced systems: TritonSort
›
Stage-based Sort engine in which the hardware is balanced
• 8 cores, 24GB RAM, 10Gig NIC, 16 drives/box
• Software is then constructed in a way that balances hardware use
›
Follow-on work on building an M/R on top of TritonSort
• Not clear how general their M/R engine is (seems specific to sort)
›
Sailfish tries to achieve balance via software and is a general M/R
engine
Summary
 Designed I-files for intermediate data and built Sailfish for
doing large-scale M/R
›
Sailfish will be released as open-source
 Build Sailfish on top of YARN
›
Utilize the per-chunk index:
• Improve reduce task planning based on key distributions
• “Checkpoint” reduce tasks on key-based boundaries and allow better resource
sharing
• Support aggregation trees
 Having the intermediate data outside a M/R job allows
new debugging possibilities
›
Debug just the reduce phase
Download