Improving MapReduce
Performance in Heterogeneous
Environments
Sturzu Antonio-Gabriel, SCPD
Table of Contents
Intro
Scheduling in Hadoop
Heterogeneity in Hadoop
The LATE Scheduler(Longest Approximate
Time to End)
Evaluation of Performance
Intro
The big volume of data that internet services work
on has led to the need of parallel processing
The leading example is Google which uses
MapReduce to process 20 petabytes of data per
day
MapReduce breaks a computation into small tasks
that run in parallel on multiple machines and
scales easy to very large clusters
The two key benefits of MapReduce are:
– Fault tolerance
– Speculative execution
Intro
Google has noted that speculative execution
improves response time by 44%
The paper shows an efficient way to do
speculative execution in order to maximize
performance
It also shows that Hadoop’s simple speculative
algorithm based on comparing each task’s
progress to the average progress brakes down in
heterogeneous systems
Intro
The proposed scheduling algorithm increases
Hadoop’s response time by a factor of two
Most of the examples and tests are done on
Amazon’s Elastic Compute Cloud( EC2)
The paper adresses two important problems in
speculative execution:
– Choosing the best node to run the speculative task
– Distinguishing between nodes slightly slower than the
mean and stragglers
Scheduling in Hadoop
Hadoop divides each MapReduce job into
tasks
The input file is split into even-sized chunks
replicated for fault tolerance
Each chunk of input is first processed by a
map task that outputs a set of key-value
pairs
Map outputs are split into buckets based on
the key
Scheduling in Hadoop
When all map tasks
finish reducers apply a
reduce function on the
set of values
associated with each
key
Scheduling in Hadoop
Hadoop runs several maps and reduces
concurrently on each slave in order to
overlap I/O with computation
Each slave tells the master when it has
empty task slots
– First any failed task is given priority
– Second a non-running task
– Third a task to execute speculatively
Scheduling in Hadoop
To select speculative tasks Hadoop uses a
progress score between 0 and 1
For a map the progress score is the fraction of
input data read
For a reduce task the execution is divided into 3
phases each of which accounts for 1/3 of the
score:
– The copy phase
– The sort phase
– The reduce phase
Scheduling in Hadoop
In each phase the score is the fraction of
data processed
Hadoop calculates an average score for
each category of tasks in order to define a
threshold for speculative execution
When a task’s progress score is less than
the average for its category minus 0.2 and
the task has run for at least a minute it is
marked as a straggler
Scheduling in Hadoop
All tasks beyond the threshold are
considered equally slow and ties between
them are broken by data locality
This threshold works well in homogeneous
systems because tasks tend to start and
finish in “waves” at roughly the same times
When running multiple jobs Hadoop uses a
FIFO discipline
Scheduling in Hadoop
Assumptions made by Hadoop Scheduler:
– Nodes can perform work at roughly the same
rate
– Tasks progress at a constant rate throughout
time
– There is no cost to launching a speculative task
on a node that would otherwise have an idle slot
Scheduling in Hadoop
– A task’s progress score is a representative of
fraction of its total work that it has done
– Tasks tend to finish in waves, so a task with a
low progress score is likely a straggler
– Tasks in the same category (map or reduce)
require roughly the same amount of work
Heterogeneity in Hadoop
Too many speculative tasks are launched
because of the fixed threshold (assumption
3 falls)
Because the scheduler uses data locality to
rank candidates for speculative execution
the wrong tasks may be chosen first
Assumptions 3, 4 and 5 fall on both
homogeneous and heterogeneous clusters
The LATE Scheduler
The main idea is that it speculatively
executes the task that will finish farthest in
the future
Estimates progress rate as
ProgressScore/T, where T is the amount of
time the task has been running for
The time to completion is (1ProgressScore)/ProgressRate
The LATE Scheduler
In order to get the best chance to beat the original
task which was speculated the algorithm launches
speculative tasks only on fast nodes
It does this using a SlowNodeThreshold which is a
metric of the total work performed
Because speculative tasks cost resources LATE
uses two additional heuristics:
– A limit on the number of speculative tasks executed
(SpeculativeCap)
– A SlowTaskThreshold that determines if a task is slow
enough in order to get speculated (uses progress rate
for comparison)
The LATE Scheduler
When a node asks for a new task and the number
of speculative tasks is less than the threshold:
– If the node’s progress score is below
SlowNodeThreshold ignore the request
– Rank currently running tasks that are not being
speculated by estimating completion time
– Launch a copy of the highest ranked task whose
progress rate is below SlowTaskThreshold
Doesn’t take into account data locality
The LATE Scheduler
Advantages of the algorithm:
– Robust to node heterogeneity because it launches only
the slowest tasks and only few of them
– Prioritized among slow tasks based on how they hurt
response time
– Takes into account node heterogeneity when choosing
on which node to run a speculative task
– Executes only tasks that will improve the total response
time, not any slow task
The LATE Scheduler
The time completion estimation can produce
errors when a task’s progress rate
decreases but in general gets correct
approximations in typical MapReduce jobs
Evaluation
In order to create heterogeneity they
mapped a variable number of virtual
machines (from 1 to 8) on each host in the
EC2 cluster
They measured the impact of contention on
I/O performance and Application Level
Performance
Evaluation
Evaluation
For Application Level they sorted 100 GB of
random data using Hadoop’s Sort benchmark with
speculative execution disabled
With isolated VM’s the job completed in 408 s and
with VM’s packed densely onto physical hosts (7
VM’s per host) it took 1094s
For evaluating the scheduling algorithms they
used clusters of about 200 VM’s and they
performed 5-7 runs
Evaluation
Results for scheduling
in a heterogeneous
cluster of 243 VM’s
using the Sort job (128
MB per host) for a total
of 30GB of data:
Evaluation
On average LATE finished jobs 27% faster
than Hadoop’s native scheduler and 31%
faster than no speculation
Results for scheduling with stragglers
In order to simulate stragglers they manually
slowed down eight VM’s in a cluster of 100
Evaluation
For each run they
sorted 256 MB per
host for a total of
25GB:
Evaluation
They also ran other two workloads on a
heterogeneous cluster with stragglers:
– Grep
– WordCount
They used a 204 node cluster with 1 to 8
VM’s per host
For the Grep test they searched on 43GB of
text data or about 200 MB per host
Evaluation
Evaluation
On average LATE finished jobs 36% faster
than Hadoop’s native scheduler and 56%
faster than no speculation
For the WordCount test they used a data set
of 21GB or 100 MB per host
Evaluation
Evaluation
Sensitivity analysis
– SpeculativeCap
– SlowTaskThreshold
– SlowNodeThreshold
SpeculativeCap results
They ran experiments at six SpeculativeCap
values from 2.5% to 100% repeating each
experiment 5 times
Evaluation
Evaluation
Sensitivity to SlowTaskThreshold
Here the idea is to not speculate tasks that
are progressing fast if they are the only
tasks left
They tested 6 values from 5% to 100%
Evaluation
Evaluation
We observe that values past 25% all work
well with 25% being the optimum value
Sensitivity to SlowNodeThreshold
Evaluation