Efficient On-Demand Operations in Large-Scale Infrastructures Final Defense Presentation by Steven Y. Ko

advertisement
Efficient On-Demand Operations in
Large-Scale Infrastructures
Final Defense Presentation
by Steven Y. Ko
Thesis Focus

One class of operations that faces one
common set of challenges
◦ cutting across four diverse and popular types
of distributed infrastructures
2
Large-Scale Infrastructures

Are everywhere
Research testbeds
• CCT, PlanetLab, Emulab, …
• For computer scientists
Grids
• TeraGrid, Grid5000, …
• For scientific communities
Internet
Clouds
• Amazon, Google, HP, IBM, …
• For Web users, app developers
Wide-area peer-to-peer
• BitTorrent, LimeWire, etc.
• For Web users
3
On-Demand Operations

Operations that act upon the most up-to-date state
of the infrastructure
◦ E.g., what’s going on right now in my data center?

Four operations in the thesis
◦ On-demand monitoring (prelim)
 Clouds, research testbeds
◦ On-demand replication (this talk)
 Clouds, research testbeds
◦ On-demand scheduling
 Grids
◦ On-demand key/value store
 Peer-to-peer


Diverse infrastructures & essential operations
All have one common set of challenges
4
Common Challenges


Scale and dynamism
On-demand monitoring
◦ Scale: 200K machines (Google)
◦ Dynamism: values can change any time (e.g., CPU-util)

On-demand replication
◦ Scale: ~300TB total, adding 2TB/day (Facebook HDFS)
◦ Dynamism: resource availability (failures, b/w, etc.)

On-demand scheduling
◦ Scale: 4K CPUs running 44,000 tasks accessing
588,900 files (Coadd)
◦ Dynamism: resource availability (CPU, mem, disk, etc.)

On-demand key/value store
◦ Scale: Millions of peers (BitTorrent)
◦ Dynamism: short-term unresponsiveness
5
Common Challenges

Scale and dynamism
Scale
On-Demand
Operations
Dynamism
Static
Dynamic
6
On-Demand Monitoring (Recap)

Need
◦ Users and admins need to query & monitor the
most-up-to-date attributes (CPU-util, disk-cap, etc.) of
an infrastructure

Scale
◦ # of machines, the amount of monitoring data, etc.

Dynamism
◦ Static attributes (e.g., CPU-type) vs. dynamic
attributes (e.g., CPU-util)

Moara
◦ Expressive queries, e.g., avg. CPU-util where disk > 10
G or (mem-util < 50% and # of processes < 20)
◦ Quick response time and bandwidth-efficiency
7
Common Challenges

(Data)
Scale
On-demand monitoring: DB vs. Moara
Scaling
centralized
(e.g., Replicated DB)
Centralized
(e.g., DB)
Static
Attributes
(e.g., CPU type)
My Solution (Moara),
etc.
Centralized
(e.g., DB)
Dynamic
Attributes
(e.g., CPU util.)
(Attribute)
Dynamism
8
Common Challenges

(Data)
Scale
On-demand replication of intermediate
data (in detail later)
Distributed File
Systems
(e.g., Hadoop w/ HDFS)
My Solution
(ISS)
Local File
Systems
(e.g., compilers)
Replication
(e.g., Hadoop w/ HDFS)
Static
Dynamic
(Availability)
Dynamism
9
Thesis Statement

On-demand operations can be
implemented efficiently in spite of scale
and dynamism in a variety of distributed
systems.
10
Thesis Statement

On-demand operations can be
implemented efficiently in spite of scale
and dynamism in a variety of distributed
systems.
◦ Efficiency: responsiveness & bandwidth
11
Thesis Contributions
Identifying on-demand operations as an
important class of operations in largescale infrastructures
 Identifying scale and dynamism as two
common challenges
 Arguing the need for an on-demand
operation in each case
 Showing how efficient each can be by
simulations and real deployments

12
Thesis Overview

ISS (Intermediate Storage System)
◦ Implements on-demand replication
◦ USENIX HotOS ’09

Moara
◦ Implements on-demand monitoring
◦ ACM/IFIP/USENIX Middleware ’08

Worker-centric scheduling
◦ Implements on-demand scheduling
◦ ACM/IFIP/USENIX Middleware ’07

MPIL (Multi-Path Insertion and Lookup)
◦ Implements on-demand key/value store
◦ IEEE DSN ’05
13
Thesis Overview
Research testbeds
• CCT, PlanetLab, Emulab, …
• For computer scientists
Worker-Centric Scheduling
Grids
• TeraGrid, Grid5000, …
• For scientific communities
Moara & ISS
Internet
Clouds
• Amazon, Google, HP, IBM, …
• For Web users, app developers
MPIL
Wide-area peer-to-peer
• BitTorrent, LimeWire, etc.
• For Web users
14
Related Work
Management operations: MON [Lia05],
vxargs, PSSH[Pla], etc.
 Scheduling [Ros04, Ros05,Vis04]
 Gossip-based multicast [Bir02, Gup02,
Ker01]
 On-demand scaling

◦ Amazon AWS, Google AppEngine, MS Azure,
RightScale, etc.
15
ISS: Intermediate Storage
System
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
17
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
◦ Dataflow programming frameworks
18
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
◦ Dataflow programming frameworks
◦ The importance of intermediate data
 Scale and dynamism
19
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
◦ Dataflow programming frameworks
◦ The importance of intermediate data
 Scale and dynamism
◦ ISS (Intermediate Storage System) with ondemand replication
20
Dataflow Programming Frameworks

Runtime systems that execute dataflow
programs
◦ MapReduce (Hadoop), Pig, Hive, etc.
◦ Gaining popularity for massive-scale data
processing
◦ Distributed and parallel execution on clusters

A dataflow program consists of
◦ Multi-stage computation
◦ Communication patterns between stages
21
Example 1: MapReduce

Two-stage computation with all-to-all comm.
◦ Google introduced,Yahoo! open-sourced (Hadoop)
◦ Two functions – Map and Reduce – supplied by a
programmer
◦ Massively parallel execution of Map and Reduce
Stage 1: Map
Shuffle (all-to-all)
Stage 2: Reduce
22
Example 2: Pig and Hive

Multi-stage with either all-to-all or 1-to-1
Stage 1: Map
Shuffle (all-to-all)
Stage 2: Reduce
1-to-1 comm.
Stage 3: Map
Stage 4: Reduce
23
Usage

Google (MapReduce)
◦ Indexing: a chain of 24 MapReduce jobs
◦ ~200K jobs processing 50PB/month (in 2006)

Yahoo! (Hadoop + Pig)
◦ WebMap: a chain of 100 MapReduce jobs

Facebook (Hadoop + Hive)
◦ ~300TB total, adding 2TB/day (in 2008)
◦ 3K jobs processing 55TB/day

Amazon
◦ Elastic MapReduce service (pay-as-you-go)

Academic clouds
◦ Google-IBM Cluster at UW (Hadoop service)
◦ CCT at UIUC (Hadoop & Pig service)
24
One Common Characteristic

Intermediate data
◦ Intermediate data? data between stages

Similarities to traditional intermediate
data [Bak91,Vog99]
◦ E.g., .o files
◦ Critical to produce the final output
◦ Short-lived, written-once and read-once, &
used-immediately
◦ Computational barrier
25
One Common Characteristic

Computational Barrier
Stage 1: Map
Computational Barrier
Stage 2: Reduce
26
Why Important?

Scale and Dynamism
◦ Large-scale: possibly very large amount of
intermediate data
◦ Dynamism: Loss of intermediate data
=> the task can’t proceed
Stage 1: Map
Stage 2: Reduce
27
Failure Stats
5 average worker deaths per MapReduce
job (Google in 2006)
 One disk failure in every run of a 6-hour
MapReduce job with 4000 machines
(Google in 2008)
 50 machine failures out of 20K machine
cluster (Yahoo! in 2009)

28
Hadoop Failure Injection
Experiment

Emulab setting
◦ 20 machines sorting 36GB
◦ 4 LANs and a core switch (all 100 Mbps)

1 failure after Map
◦ Re-execution of Map-Shuffle-Reduce

~33% increase in completion time
Map
Shuffle
Shuffl
Reduce
Map
Reduce
e
29
Re-Generation for Multi-Stage

Cascaded re-execution: expensive
Stage 1: Map
Stage 2: Reduce
Stage 3: Map
Stage 4: Reduce
30
Importance of Intermediate Data

Why?
◦ Scale and dynamism
◦ A lot of data + when lost, very costly

Current systems handle it themselves.
◦ Re-generate when lost: can lead to expensive
cascaded re-execution

We believe that the storage can provide a
better solution than the dataflow
programming frameworks
31
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
Dataflow programming frameworks
The importance of intermediate data
◦ ISS (Intermediate Storage System)




Why storage?
Challenges
Solution hypotheses
Hypotheses validation
32
Why Storage? On-Demand
Replication

Stops cascaded re-execution
Stage 1: Map
Stage 2: Reduce
Stage 3: Map
Stage 4: Reduce
33
So, Are We Done?
No!
 Challenge: minimal interference

◦ Network is heavily utilized during Shuffle.
◦ Replication requires network transmission
too, and needs to replicate a large amount.
◦ Minimizing interference is critical for the
overall job completion time.
◦ Efficiency: completion time + bandwidth

HDFS (Hadoop Distributed File System):
much interference
34
Default HDFS Interference

On-demand replication of Map and
Reduce outputs (2 copies in total)
35
Background Transport Protocols

TCP-Nice [Ven02] & TCP-LP [Kuz06]
◦ Support background & foreground flows

Pros
◦ Background flows do not interfere with
foreground flows (functionality)

Cons
◦ Designed for wide-area Internet
◦ Application-agnostic
◦ Not designed for data center replication

Can do better!
36
Revisiting Common Challenges
Scale: need to replicate a large amount of
intermediate data
 Dynamism: failures, foreground traffic
(bandwidth availability)
 Two challenges create much interference.

37
Revisiting Common Challenges

(Data)
Scale
Intermediate data management
Distributed File
Systems
(e.g., Hadoop w/ HDFS)
Local File
Systems
(e.g., compilers)
Static
ISS
Replication
(e.g., Hadoop w/ HDFS)
Dynamic
(Availability)
Dynamism
38
Our Position

Intermediate data as a first-class citizen
for dataflow programming frameworks in
clouds
Dataflow programming frameworks
The importance of intermediate data
◦ ISS (Intermediate Storage System)
Why is storage the right abstraction?
Challenges
 Solution hypotheses
 Hypotheses validation
39
Three Hypotheses
Asynchronous replication can help.
1.
◦
2.
3.
HDFS replication works synchronously.
The replication process can exploit the
inherent bandwidth heterogeneity of
data centers (next).
Data selection can help (later).
40
Bandwidth Heterogeneity

Data center topology: hierarchical
◦ Top-of-the-rack switches (under-utilized)
◦ Shared core switch (fully-utilized)
41
Data Selection

Only replicate locally-consumed data
Stage 1: Map
Stage 2: Reduce
Stage 3: Map
Stage 4: Reduce
42
Three Hypotheses
1.
2.
3.
Asynchronous replication can help.
The replication process can exploit the
inherent bandwidth heterogeneity of
data centers.
Data selection can help.
The question is not if, but how much.
 If effective, these become techniques.

43
Experimental Setting

Emulab with 80 machines
◦
◦
◦
◦



4 X 1 LAN with 20 machines
4 X 100Mbps top-of-the-rack switch
1 X 1Gbps core switch
Various configurations give similar results.
Input data: 2GB/machine, random-generation
Workload: sort
5 runs
◦ Std. dev. ~ 100 sec.: small compared to the overall
completion time

2 replicas of Map outputs in total
44
Asynchronous Replication

Modification for asynchronous replication
◦ With an increasing level of interference

Four levels of interference
◦ Hadoop: original, no replication, no
interference
◦ Read: disk read, no network transfer, no actual
replication
◦ Read-Send: disk read & network send, no
actual replication
◦ Rep.: full replication
45
Asynchronous Replication
Network utilization makes the difference
 Both Map & Shuffle get affected

◦ Some Maps need to read remotely
46
Three Hypotheses (Validation)
 Asynchronous
replication can help, but
still can’t eliminate the interference.
 The replication process can exploit the
inherent bandwidth heterogeneity of data
centers.
 Data selection can help.
47
Rack-Level Replication

Rack-level replication is effective.
◦ Only 20~30 rack failures per year, mostly
planned (Google 2008)
48
Three Hypotheses (Validation)
 Asynchronous
replication can help, but
still can’t eliminate the interference
 The rack-level replication can reduce the
interference significantly.
 Data selection can help.
49
Locally-Consumed Data Replication

It significantly reduces the amount of
replication.
50
Three Hypotheses (Validation)
 Asynchronous
replication can help, but
still can’t eliminate the interference
 The rack-level replication can reduce the
interference significantly.
 Data selection can reduce the
interference significantly.
51
ISS Design Overview
Implements asynchronous rack-level
selective replication (all three hypotheses)
 Replaces the Shuffle phase

◦ MapReduce does not implement Shuffle.
◦ Map tasks write intermediate data to ISS, and
Reduce tasks read intermediate data from ISS.

Extends HDFS (next)
52
ISS Design Overview

Extends HDFS
◦
◦
◦
◦
◦

iss_create()
iss_open()
iss_write()
iss_read()
iss_close()
Map tasks
◦ iss_create() => iss_write() => iss_close()

Reduce tasks
◦ iss_open() => iss_read() => iss_close()
53
Hadoop + Failure

75% slowdown compared to no-failure
Hadoop
54
Hadoop + ISS + Failure

(Emulated) 10% slowdown compared to
no-failure Hadoop
◦ Speculative execution can leverage ISS.
55
Replication Completion Time

Replication completes before Reduce
◦ ‘+’ indicates replication time for each block
56
Summary

Our position
◦ Intermediate data as a first-class citizen for dataflow
programming frameworks in clouds
Problem: cascaded re-execution
 Requirements

◦ Intermediate data availability (scale and dynamism)
◦ Interference minimization (efficiency)
Asynchronous replication can help, but still can’t eliminate
the interference
 The rack-level replication can reduce the interference
significantly.
 Data selection can reduce the interference significantly.
 Hadoop & Hadoop + ISS show comparable completion
times.

57
Thesis Overview
Research testbeds
• CCT, PlanetLab, Emulab, …
• For computer scientists
Worker-Centric Scheduling
Grids
• TeraGrid, Grid5000, …
• For scientific communities
Moara & ISS
Internet
Clouds
• Amazon, Google, HP, IBM, …
• For Web users, app developers
MPIL
Wide-area peer-to-peer
• BitTorrent, LimeWire, etc.
• For Web users
58
On-Demand Scheduling

One-line summary
◦ On-demand scheduling of Grid tasks for dataintensive applications

Scale
◦ # of tasks, # of CPUs, the amount of data

Dynamism
◦ resource availability (CPU, mem, disk, etc.)

Worker-centric scheduling
◦ Worker’s availability is the first-class criterion
in scheduling tasks.
59
On-Demand Scheduling

Taxonomy
Scale
Task-Centric
Scheduling
Worker-Centric
Scheduling
Static Scheduling
Task-Centric
Scheduling
Dynamism
Static
Dynamic
60
On-Demand Key/Value Store

One-line summary
◦ On-demand key/value lookup algorithm under
dynamic environments

Scale
◦ # of peers

Dynamism
◦ short-term unresponsiveness

MPIL
◦ A DHT-style lookup algorithm that can operate
over any topology and is resistant to
perturbation
61
On-Demand Key/Value Store

Taxonomy
Scale
DHTs
Napster-style
Central Directory
MPIL
Napster-style
Central Directory
Dynamism
Static
Dynamic
62
Conclusion
On-demand operations can be
implemented efficiently in spite of scale
and dynamism in a variety of distributed
systems.
 Each on-demand operation has a different
need, but a common set of challenges.

63
References












[Lia05] J. Liang, S.Y. Ko, I. Gupta, and K. Nahrstedt. MON: On- demand Overlays for
Distributed System Management. In USENIX WORLDS, 2005
[Pla] PlanetLab. http://www.planet-lab.org/
[Ros05] A. L. Rosenberg and M. Yurkewych. Guidelines for Scheduling Some Common
Computation-Dags for Internet-Based Computing. IEEE TC,Vol. 54, No. 4, April 2005.
[Ros04] A. L. Rosenberg. On Scheduling Mesh-Structured Computations for InternetBased Computing. IEEE TC, 53(9), September 2004.
[Vis04] S.Viswanathan, B.Veeravalli, D.Yu, and T. G. Robertazzi. Design and Analysis of a
Dynamic Scheduling Strategy with Resource Estimation for Large-Scale Grid Systems. In
IEEE/ACM GRID, 2004.
[Bir02] K. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. Bimodal
Multicast.ACM TCSystems, 17(2):41–88, May 2002.
[Gup02] I. Gupta, A.-M. Kermarrec, and A. J. Ganesh. Efficient Epidemic-Style Protocols for
Reliable and Scalable Multicast. In SRDS, 2002.
[Ker01] A.-M. Kermarrec, L. Massouli, and A. J. Ganesh. Probabilistic reliable dissemination
in large-scale systems. IEEE TPDS, 14:248–258, 2001.
[Vog99] W. Vogels. File System Usage in Windows NT 4.0. In SOSP, 1999.
[Bak91] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirriff, and J. K. Ousterhout.
Measurements of a Distributed File System. SIGOPS OSR, 25(5), 1991.
[Ven02] A.Venkataramani, R. Kokku, and M. Dahlin. TCP Nice: A Mechanism for
Background Transfers. In OSDI, 2002.
[Kuz06] A. Kuzmanovic and E. W. Knightly. TCP-LP: Low-Priority Service via End-Point
Congestion Control. IEEE/ACM TON, 14(4):739–752, 2006.
64
Backup Slides
65
Hadoop Experiment

Emulab setting
◦ 20 machines sorting 36GB
◦ 4 LANs and a core switch (all 100 Mbps)

Normal execution: Map–Shuffle–Reduce
Map
Shuffle
Reduce
66
Worker-Centric Scheduling

One-line summary
◦ On-demand scheduling of Grid tasks for dataintensive applications

Background
◦ Data-intensive Grid applications are divided into
chunks.
◦ Data-intensive applications access a large set of files –
file transfer time is a bottleneck (from Coadd
experience)
◦ Different tasks access many files together (locality)

Problem
◦ How to schedule tasks in a data-intensive application
that access a large set of files?
67
Worker-Centric Scheduling

Solution: on-demand scheduling based on
worker’s availability and reusing of files
◦ Task-centric vs. worker-centric: consideration of
worker’s availability for execution
◦ We have proposed a series of worker-centric
scheduling heuristics that minimizes file transfers

Result
◦ Tested our heuristics with a real trace
◦ Compared them with the state-of-the-art taskcentric scheduling
◦ Showed the task completion time reduction up
to 20%
68
MPIL (Multi-Path Insertion/Lookup)

One-line summary
◦ On-demand key/value lookup algorithm under
dynamic environments

Background
◦ Various DHTs (Distributed Hash Tables) have
been proposed.
◦ Real-world trace studies show churn is a threat
to DHTs’ performance.
◦ DHTs employ aggressive maintenance algorithms
to combat churn – low lookup cost, but high
maintenance cost
69
MPIL (Multi-Path Insertion/Lookup)

Alternative: MPIL
◦ Is a new DHT routing algorithm
◦ Has low maintenance cost, but slightly higher
lookup cost

Features
◦ Uses multi-path routing to combat dynamism
(perturbation in the environment).
◦ Runs over any overlay – no need for a
maintenance algorithm
70
Download