On Availability of Intermediate Data in Cloud Computations

advertisement
On Availability of Intermediate
Data in Cloud Computations
Steven Y. Ko,
Imranul Hoque,
Brian Cho,
and Indranil Gupta
Distributed Protocols Research Group (DPRG)
University of Illinois at Urbana-Champaign
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
2
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
◦ Dataflow programming frameworks
3
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
◦ Dataflow programming frameworks
◦ The importance of intermediate data
4
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
◦ Dataflow programming frameworks
◦ The importance of intermediate data
◦ Outline of a solution
5
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
◦ Dataflow programming frameworks
◦ The importance of intermediate data
◦ Outline of a solution

This talk
◦ Builds up the case
◦ Emphasizes the need, not the solution
6
Dataflow Programming Frameworks

Runtime systems that execute dataflow
programs
◦ MapReduce (Hadoop), Pig, Hive, etc.
◦ Gaining popularity for massive-scale data
processing
◦ Distributed and parallel execution on clusters

A dataflow program consists of
◦ Multi-stage computation
◦ Communication patterns between stages
7
Example 1: MapReduce

Two-stage computation with all-to-all comm.
◦ Google introduced,Yahoo! open-sourced (Hadoop)
◦ Two functions – Map and Reduce – supplied by a
programmer
◦ Massively parallel execution of Map and Reduce
Stage 1: Map
Shuffle (all-to-all)
Stage 2: Reduce
8
Example 2: Pig and Hive
Pig from Yahoo! & Hive from Facebook
 Built atop MapReduce
 Declarative, SQL-style languages
 Automatic generation & execution of
multiple MapReduce jobs

9
Example 2: Pig and Hive

Multi-stage with either all-to-all or 1-to-1
Stage 1: Map
Shuffle (all-to-all)
Stage 2: Reduce
1-to-1 comm.
Stage 3: Map
Stage 4: Reduce
10
Usage
11
Usage

Google (MapReduce)
◦ Indexing: a chain of 24 MapReduce jobs
◦ ~200K jobs processing 50PB/month (in 2006)

Yahoo! (Hadoop + Pig)
◦ WebMap: a chain of 100 MapReduce jobs

Facebook (Hadoop + Hive)
◦ ~300TB total, adding 2TB/day (in 2008)
◦ 3K jobs processing 55TB/day

Amazon
◦ Elastic MapReduce service (pay-as-you-go)

Academic clouds
◦ Google-IBM Cluster at UW (Hadoop service)
◦ CCT at UIUC (Hadoop & Pig service)
12
One Common Characteristic

Intermediate data
◦ Intermediate data? data between stages

Similarities to traditional intermediate
data
◦ E.g., .o files
◦ Critical to produce the final output
◦ Short-lived, written-once and read-once, &
used-immediately
13
One Common Characteristic

Intermediate data
◦ Written-locally & read-remotely
◦ Possibly very large amount of intermediate
data (depending on the workload, though)
◦ Computational barrier
Stage 1: Map
Computational Barrier
Stage 2: Reduce
14
Computational Barrier +
Failures

Availability becomes critical.
◦ Loss of intermediate data before or during
the execution of a task
=> the task can’t proceed
Stage 1: Map
Stage 2: Reduce
15
Current Solution

Store locally & re-generate when lost
◦ Re-run affected map & reduce tasks
◦ No support from a storage system

Assumption: re-generation is cheap and
easy
Stage 1: Map
Stage 2: Reduce
16
Hadoop Experiment

Emulab setting (for all plots in this talk)
◦ 20 machines sorting 36GB
◦ 4 LANs and a core switch (all 100 Mbps)

Normal execution: Map–Shuffle–Reduce
Map
Shuffle
Reduce
17
Hadoop Experiment

1 failure after Map
◦ Re-execution of Map-Shuffle-Reduce

~33% increase in completion time
Map
Shuffle
Shuffl
Reduce
Map
Reduce
e
18
Re-Generation for Multi-Stage

Cascaded re-execution: expensive
Stage 1: Map
Stage 2: Reduce
Stage 3: Map
Stage 4: Reduce
19
Importance of Intermediate Data

Why?
◦ Critical for execution (barrier)
◦ When lost, very costly

Current systems handle it themselves.
◦ Re-generate when lost: can lead to expensive
cascaded re-execution
◦ No support from the storage

We believe the storage is the right
abstraction, not the dataflow frameworks.
20
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
Dataflow programming frameworks
The importance of intermediate data
◦ Outline of a solution
 Why is storage the right abstraction?
 Challenges
 Research directions
21
Why is Storage the Right Abstraction?

Replication stops cascaded re-execution.
Stage 1: Map
Stage 2: Reduce
Stage 3: Map
Stage 4: Reduce
22
So, Are We Done?


No!
Challenge: minimal interference
◦ Network is heavily utilized during Shuffle.
◦ Replication requires network transmission too.
◦ Minimizing interference is critical for the overall
job completion time.

Any existing approaches?
◦ HDFS (Hadoop’s default file system): much
interference (next slide)
◦ Background replication with TCP-Nice: not
designed for network utilization & control (no
further discussion, please refer to our paper)
23
Modified HDFS Interference

Unmodified HDFS
◦ Much overhead with synchronous replication

Modification for asynchronous replication
◦ With an increasing level of interference

Four levels of interference
◦ Hadoop: original, no replication, no interference
◦ Read: disk read, no network transfer, no actual
replication
◦ Read-Send: disk read & network send, no actual
replication
◦ Rep.: full replication
24
Modified HDFS Interference

Asynchronous replication
◦ Network utilization makes the difference

Both Map & Shuffle get affected
◦ Some Maps need to read remotely
25
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
Dataflow programming frameworks
The importance of intermediate data
◦ Outline of a new storage system design
Why is storage the right abstraction?
Challenges
 Research directions
26
Research Directions

Two requirements
◦ Intermediate data availability to stop cascaded
re-execution
◦ Interference minimization focusing on
network interference

Solution
◦ Replication with minimal interference
27
Research Directions

Replication using spare bandwidth
◦ Not much network activity during Map &
Reduce computation
◦ Tight B/W monitoring & control

Deadline-based replication
◦ Replicate every N stages

Replication based on a cost model
◦ Replicate only when re-execution is more
expensive
28
Summary

Our position
◦ Intermediate data as a first-class citizen for
dataflow programming frameworks in clouds
Problem: cascaded re-execution
 Requirements

◦ Intermediate data availability
◦ Interference minimization

Further research needed
29
BACKUP
30
Default HDFS Interference

Replication of Map and Reduce outputs
31
Default HDFS Interference
Replication policy: local, then remote-rack
 Synchronous replication

32
Download