
I-Files: Handling Intermediate Data In
Parallel Dataflow Graphs
Sriram Rao
November 2, 2011
Joint Work With…
 Raghu Ramakrishnan, Adam Silberstein: Yahoo Labs
 Mike Ovsiannikov, Damian Reeves: Quantcast
 Massive growth in online advertising (read…display ads)
 Companies are reacting to this opportunity via behavioral
Collect click-stream logs, mine the data, build models, show ads
 “Petabyte scale data mining” using computational
frameworks (such as, Hadoop, Dryad) is commonplace
 Analysis of Hadoop job history logs shows:
Over 95% of jobs are small (run for a few mins, process small data)
About 5% of jobs are large (run for hours, process big data)
Where have my cycles gone?
5% of jobs take 90% of cycles!
Who is using my network?
5% of jobs account for 99% of
network traffic!
 Analysis shows 5% of the jobs are “big”:
5% of jobs use 90% cluster compute cycles
5% of jobs shuffle 99% of data (i.e., 99% network bandwidth)
 To improve cluster performance, improve M/R
performance for large jobs
 Faster, faster, faster: virtuous cycle
Cluster throughput goes up
Users will run bigger jobs
 Our work: Focus on handling intermediate data at scale in
parallel dataflow graphs
Handling Intermediate Data in M/R
 In a M/R computation, map output is intermediate data
 For transferring intermediate data from map to reduce:
Map tasks generate data, write to disk
When a reduce task pulls map output,
• Data has to be read from disk
• Transferred over the network
– Cannot assume that mappers/reducers can be scheduled concurrently
 Transporting intermediate data:
Intermediate data size < RAM size: RAM masks disk I/O
Intermediate data size > RAM size: Cache hit rate masks disk I/O
Intermediate data size >> RAM size: Disk overheads affect perf
Handling Intermediate data at scale
 Intermediate Data Transfer: Distributed Merge Sort
# of disk seeks for transferring intermediate data α M * R
Avg. amount of data reducer pulls from a mapper α 1 / R
Distributed File System
(M * R)
Disk Overheads (More detail)
 “Fix” the amount of data generated by a map task
• Size RAM such that the map output fits in-memory and can be sorted in 1-pass
– For example, use 1GB
 “Fix” the amount of data consumed by a reduce task
• Size RAM for a 1-pass merge
– For example, use 1GB
 Now…
• For a job with 1TB of data
1024 mappers generate 1G each; 1024 reduces consume 1G each
– On average, data generated by a map for a given reducer = 1G / 1024 = 1M
• For a job with 16TB of data
16K mappers generate 1G each; 16K reduces consume 1G each
– On average, data generated by a map for a given reducer = 1G / 16K = 64K
 With scale, # of seeks increases; data read/seek decreases
Disk Overheads
 As the volume of intermediate data
Amount of data read per seek decreases
# of disk seeks increases non-linearly
 Net result: Job performance will be
affected by the disk overheads in
handling intermediate data
Intermediate data increases by 2
Job-run time increases by 2.5x
What is new?
Distributed File System
Fewer Seeks!
One intermediate file per reducer, instead of
one per mapper
Our work
 New approach for efficient handling of intermediate data
at large scale
• Minimize the number of seeks
• Maximize the amount of data read/written per seek
• Primarily geared towards LARGE M/R jobs:
– 10’s of TB of intermediate data
– 1000’s of mapper/reducer tasks
 I-files: Filesystem support for intermediate data
Atomic record append primitive that allows write parallelism at scale
Network-wide batching of intermediate data
 Build Sailfish (by modifying Hadoop-0.20.2) where
intermediate data is transported using I-files
How did we do? (Benchmark job)
How did we do? (Actual Job)
Talk Outline
 Properties of Intermediate data
 I-files implementation
 Sailfish: M/R implementation that uses I-files for
intermediate data
 Experimental Evaluation
 Summary
Organizing Intermediate Data
 Hadoop organizes intermediate data in a format
convenient for the mapper
 What if we went the opposite way: organize it in a format
convenient for the reducer?
Mappers write their output to a per-partition I-file
Data destined for a reducer is in a single file
Build the intermediate data file in a manner that is suitable for the
reader rather than the writer
Intermediate data
 Reducer input is generated by multiple
 File is a container into which mapper
output needs to be stored
Write order is k1, k2, k3, k4
Processing order is k3, k1, k4, k2
 Because reducer imposes processing
order, writer does not care where the
output is stored in the file
 Once a mapper emits a record, the
output is committed
There is no “withdraw”
Properties of Intermediate data file
 Multiple mappers generate data that will be consumed by
a single reducer
Need low latency multi-writer support
 Writers are doing append-only writes
Contents of the I-file are never overwritten
 Arbitrary interleaving of data is ok:
Writer does not care where the data goes in the file
Any ordering we need can be done post-facto
 No ordering guarantees for the writes from a single client
• Follows from arbitrary interleaving of writes
Atomic Record Append
 Multi-writer support => need an atomic primitive
Intermediate data is append only…so, need atomic append
 With atomic record append primitive clients provide just
the data but the server chooses the offset with arbitrary
In contrast, in a traditional write clients provide data+offset
 Since server is choosing the offset, design is lock-free
 To scale atomic record append with writers, allow
Multiple writers append to a single block of the file
Multiple blocks of the file concurrently appended to
Atomic Record Append
ARA: <A, offset = -1>
ARA: <B, offset = -1>
Offset =
Implementing I-files
 Have implemented I-files in context of Kosmos distributed
filesystem (KFS)
Why KFS?
• KFS has multi-writer support
• We have designed/implemented/deployed KFS to manage PB’s of
 KFS is similar to GFS/HDFS
Chunks are striped across nodes and replicated for fault-tolerance
– Chunk master serializes all writes to a chunk
For atomic append, chunk master assigns the offset
With KFS I-files, multiple chunks of the I-file can be concurrently
Atomic Record Append
 Writers are routed to a chunk that is open for writing
For scale, limit the # of concurrent writers to a chunk
 When client gets an ack back from chunk master, data is
replicated in the volatile memory at all the replicas
Chunkservers are free to commit data to disk asynchronously
 Eventually, chunk is made stable
Data is committed to disk at all the replicas
Replicas are byte-wise identical
 Stable chunks are not appended to again
Talk Outline
 Properties of Intermediate data
 I-files implementation
 Sailfish: M/R implementation that uses I-files for
intermediate data
 Experimental Evaluation
 Summary
The Elephant Can Dance…
(De) Serialization
Hadoop Shuffle Pipeline
Sailfish Shuffle Pipeline
Sailfish Overview
 Modify Hadoop-0.20.2 to use I-files for MapReduce
Mappers write their output to a per-partition I-file
• Replication factor of 1 for all the chunks of an I-file
• At-least-once semantics for append; filter dups on the reduce side
Data destined for a reducer is in a single file
Build the intermediate data file in a manner that is suitable for the
reader rather than the writer
 Automatically parallelize execution of the reduce phase:
Set the number of reduce tasks and work assignment
Assign key-ranges to reduce tasks rather than whole partitions
Extend I-files to support key-based retrieval
Sailfish Map Phase Execution
Sailfish Reduce Phase Execution
Atomic “Record” Append For M/R
 M/R computations are about processing records
Intermediate data consists of key/value pairs
 Extend atomic append to support “records”
Mappers emit <key, record>
• Per-record framing that identifies the mapper task that generated a
System stores per-chunk index
• After chunk is stable, chunk is sorted and an index is built by the sorter
– Sorting is a completely local operation: read a block from disk, sort in RAM, and write back
to disk
Reducers can retrieve data by <key>
• Use per-record framing to discard data from dead mappers
Sailfish Architecture
Submit Job
Hadoop JT
What do I
Mapper Task
I-file 5
[a, d)
Reducer Task
I Appender
KFS I-files
Handling Failures
 Whenever a chunk of an I-file is lost, need to re-generate
lost data
 With I-file, we have multiple mappers writing to a block
 For fault-tolerance,
Workbuilder tracks the set of chunks modified by a mapper task
Whenever a chunk is lost, workbuilder notifies the JT of the set of
map tasks that have to be re-run
Reducers reading from the I-file with the lost chunk wait until data is
 For fault-containment, in Sailfish, use per-rack I-files
Mappers running in a rack write to chunks of the I-file stored in the
Fault-tolerance With Sailfish
 Alternate option is to replicate map-output
Use atomic record append to write to two chunkservers
• Probability of data loss due to (concurrent) double failure is low
Performance hit for replicating data is low
• Data is replicated using RAM and written to disk async
However, network traffic increases substantially
• Sailfish causes network traffic to double compared to Stock Hadoop
– Map output is written to the network and reduce input is read over the network
• With replication, data traverses the network three times
– Alternate strategy is to selectively replicate map output
Replicate in response to data loss
Replicate output that was generated the earliest
Sailfish Reduce Phase
 # of reducers/job and their task assignment is determined by the workbuilder
in a data-dependent manner
Dynamically set the # of reducer per job after the map phase execution is complete
 # of reducers/I-file = (size of I-file) / (work per reducer)
Work per reducer is set based on RAM (in experiments, use, 1GB per reduce task)
If data assigned to a task exceeds size of RAM, merger does a network-wide merge by
appropriately streaming the data
 Workbuilder uses the per-chunk index to determine split points
Each reduce task is assigned a range of keys within an I-file
• Data for a reduce task is in multiple chunks and requires a merge
• Since chunks are sorted, data read by a reducer from a chunk is all sequential I/O
 Skew in reduce input is handled seamlessly
I-file with more data has more tasks assigned to it
Experimental Evaluation
 Cluster comprises of ~150 machines
6 map tasks, 6 reduce tasks per node
• With Hadoop M/R tasks, a JVM is given 1.5G RAM for one pass sort/merge
8 cores, 16GB RAM, 4-750GB drives, 1Gbps between any pair of nodes
Job uses all the nodes in the cluster
 Evaluate with benchmark as well as real M/R job
Simple benchmark that generates its own data (similar to terasort)
• Measure only the overhead with transporting intermediate data
• Job generates records with random 10-byte key, 90-byte value
Experiments vary the size of intermediate data (1TB – 64TB)
• Mappers generate 1GB of data and reducers consume ~1GB of data
I-files in practice
 150 map tasks/rack
 128 map tasks concurrently appending to a block of an Ifile
 2 blocks of an I-file are concurrently appended to in a rack
 512 I-files per job
Beyond 512 I-files hit system limitations in the cluster (too many
open files, too many connections)
 KFS chunkservers use direct I/O with the disk subsystem,
by-passing the buffer cache
How did we do? (Benchmark job)
How many seeks?
 With Stock Hadoop, number of seeks is α M * R
 With Sailfish, it is the product of:
# of chunks per I-file (c)
# of reduce tasks per I-file (R / I)
# of I-files (I)
 We get: c * I * (R / I) = c * R
 # of chunks per I-file: 64TB intermediate data split over
512 I-files, where the chunksize is 128MB
c = (64TB / (512 * 128MB)) = 1024
 # of map tasks at 64TB: 65536 (64TB / 1GB per mapper):
c << M
Why does Sailfish work?
 Where are the gains coming from?
Write-path is low-latency and is able to keep as many disk arms and
NICs busy
• Lowered the number of disk seeks
• Reads are large/sequential
 Compared to Hadoop, read path in Sailfish is very efficient
Efficient disk read path leads to better network utilization
Data read per seek
Disk Thruput (during Reduce phase)
Using Sailfish In Practice
 Use a job+data from one of the behavioral ad-targeting pipelines at
BT-Join: Build a sliding N-day model of user behavior
• Take 1 day of clickstream logs and join with previous N days and
produce a new N-day model
 Input datasets compressed using bz2:
Dataset A: 1000 files, 50MB apeice (10:1 compression)
Dataset B: 1000 files, 1.2GB apeice (10:1 compression)
 Extended Sailfish to support compression for intermediate data
• Mappers generate upto 256K of records, compress, and “append record”
• Sorters read compressed data, decompress, sort, and recompress
• Merger reads compressed data, decompress, merge and pass to reducer
• For performance, use LZO from Intel IPP package
How did we do? (BT-Join)
BT-Join Analysis
 Speedup in Reduce phase is due better batching
 Speedup in Map-phase:
Stock Hadoop: if map output doesn’t fit in RAM, mappers do an
external sort
Sailfish: Sorting is outside the map task and hence, no limits on the
amount of map output generated by a map task
 Net result: Job with Sailfish is about 2x faster when
compared to Stock Hadoop
Related Work
 Atomic append was introduced in GFS paper (SOSP’03)
GFS however seems to have moved away from atomic append as
they say it has not usable (at least once semantics and replicas can
 Balanced systems: TritonSort
Stage-based Sort engine in which the hardware is balanced
• 8 cores, 24GB RAM, 10Gig NIC, 16 drives/box
• Software is then constructed in a way that balances hardware use
Follow-on work on building an M/R on top of TritonSort
• Not clear how general their M/R engine is (seems specific to sort)
Sailfish tries to achieve balance via software and is a general M/R
 Designed I-files for intermediate data and built Sailfish for
doing large-scale M/R
Sailfish will be released as open-source
 Build Sailfish on top of YARN
Utilize the per-chunk index:
• Improve reduce task planning based on key distributions
• “Checkpoint” reduce tasks on key-based boundaries and allow better resource
• Support aggregation trees
 Having the intermediate data outside a M/R job allows
new debugging possibilities
Debug just the reduce phase