Improving MapReduce Performance in Heterogeneous Environments

advertisement
Improving MapReduce
Performance in Heterogeneous
Environments
Sturzu Antonio-Gabriel, SCPD
Table of Contents




Intro
Scheduling in Hadoop
Heterogeneity in Hadoop
The LATE Scheduler(Longest Approximate
Time to End)
 Evaluation of Performance
Intro
 The big volume of data that internet services work
on has led to the need of parallel processing
 The leading example is Google which uses
MapReduce to process 20 petabytes of data per
day
 MapReduce breaks a computation into small tasks
that run in parallel on multiple machines and
scales easy to very large clusters
 The two key benefits of MapReduce are:
– Fault tolerance
– Speculative execution
Intro
 Google has noted that speculative execution
improves response time by 44%
 The paper shows an efficient way to do
speculative execution in order to maximize
performance
 It also shows that Hadoop’s simple speculative
algorithm based on comparing each task’s
progress to the average progress brakes down in
heterogeneous systems
Intro
 The proposed scheduling algorithm increases
Hadoop’s response time by a factor of two
 Most of the examples and tests are done on
Amazon’s Elastic Compute Cloud( EC2)
 The paper adresses two important problems in
speculative execution:
– Choosing the best node to run the speculative task
– Distinguishing between nodes slightly slower than the
mean and stragglers
Scheduling in Hadoop
 Hadoop divides each MapReduce job into
tasks
 The input file is split into even-sized chunks
replicated for fault tolerance
 Each chunk of input is first processed by a
map task that outputs a set of key-value
pairs
 Map outputs are split into buckets based on
the key
Scheduling in Hadoop
 When all map tasks
finish reducers apply a
reduce function on the
set of values
associated with each
key
Scheduling in Hadoop
 Hadoop runs several maps and reduces
concurrently on each slave in order to
overlap I/O with computation
 Each slave tells the master when it has
empty task slots
– First any failed task is given priority
– Second a non-running task
– Third a task to execute speculatively
Scheduling in Hadoop
 To select speculative tasks Hadoop uses a
progress score between 0 and 1
 For a map the progress score is the fraction of
input data read
 For a reduce task the execution is divided into 3
phases each of which accounts for 1/3 of the
score:
– The copy phase
– The sort phase
– The reduce phase
Scheduling in Hadoop
 In each phase the score is the fraction of
data processed
 Hadoop calculates an average score for
each category of tasks in order to define a
threshold for speculative execution
 When a task’s progress score is less than
the average for its category minus 0.2 and
the task has run for at least a minute it is
marked as a straggler
Scheduling in Hadoop
 All tasks beyond the threshold are
considered equally slow and ties between
them are broken by data locality
 This threshold works well in homogeneous
systems because tasks tend to start and
finish in “waves” at roughly the same times
 When running multiple jobs Hadoop uses a
FIFO discipline
Scheduling in Hadoop
 Assumptions made by Hadoop Scheduler:
– Nodes can perform work at roughly the same
rate
– Tasks progress at a constant rate throughout
time
– There is no cost to launching a speculative task
on a node that would otherwise have an idle slot
Scheduling in Hadoop
– A task’s progress score is a representative of
fraction of its total work that it has done
– Tasks tend to finish in waves, so a task with a
low progress score is likely a straggler
– Tasks in the same category (map or reduce)
require roughly the same amount of work
Heterogeneity in Hadoop
 Too many speculative tasks are launched
because of the fixed threshold (assumption
3 falls)
 Because the scheduler uses data locality to
rank candidates for speculative execution
the wrong tasks may be chosen first
 Assumptions 3, 4 and 5 fall on both
homogeneous and heterogeneous clusters
The LATE Scheduler
 The main idea is that it speculatively
executes the task that will finish farthest in
the future
 Estimates progress rate as
ProgressScore/T, where T is the amount of
time the task has been running for
 The time to completion is (1ProgressScore)/ProgressRate
The LATE Scheduler
 In order to get the best chance to beat the original
task which was speculated the algorithm launches
speculative tasks only on fast nodes
 It does this using a SlowNodeThreshold which is a
metric of the total work performed
 Because speculative tasks cost resources LATE
uses two additional heuristics:
– A limit on the number of speculative tasks executed
(SpeculativeCap)
– A SlowTaskThreshold that determines if a task is slow
enough in order to get speculated (uses progress rate
for comparison)
The LATE Scheduler
 When a node asks for a new task and the number
of speculative tasks is less than the threshold:
– If the node’s progress score is below
SlowNodeThreshold ignore the request
– Rank currently running tasks that are not being
speculated by estimating completion time
– Launch a copy of the highest ranked task whose
progress rate is below SlowTaskThreshold
 Doesn’t take into account data locality
The LATE Scheduler
 Advantages of the algorithm:
– Robust to node heterogeneity because it launches only
the slowest tasks and only few of them
– Prioritized among slow tasks based on how they hurt
response time
– Takes into account node heterogeneity when choosing
on which node to run a speculative task
– Executes only tasks that will improve the total response
time, not any slow task
The LATE Scheduler
 The time completion estimation can produce
errors when a task’s progress rate
decreases but in general gets correct
approximations in typical MapReduce jobs
Evaluation
 In order to create heterogeneity they
mapped a variable number of virtual
machines (from 1 to 8) on each host in the
EC2 cluster
 They measured the impact of contention on
I/O performance and Application Level
Performance
Evaluation
Evaluation
 For Application Level they sorted 100 GB of
random data using Hadoop’s Sort benchmark with
speculative execution disabled
 With isolated VM’s the job completed in 408 s and
with VM’s packed densely onto physical hosts (7
VM’s per host) it took 1094s
 For evaluating the scheduling algorithms they
used clusters of about 200 VM’s and they
performed 5-7 runs
Evaluation
 Results for scheduling
in a heterogeneous
cluster of 243 VM’s
using the Sort job (128
MB per host) for a total
of 30GB of data:
Evaluation
 On average LATE finished jobs 27% faster
than Hadoop’s native scheduler and 31%
faster than no speculation
 Results for scheduling with stragglers
 In order to simulate stragglers they manually
slowed down eight VM’s in a cluster of 100
Evaluation
 For each run they
sorted 256 MB per
host for a total of
25GB:
Evaluation
 They also ran other two workloads on a
heterogeneous cluster with stragglers:
– Grep
– WordCount
 They used a 204 node cluster with 1 to 8
VM’s per host
 For the Grep test they searched on 43GB of
text data or about 200 MB per host
Evaluation
Evaluation
 On average LATE finished jobs 36% faster
than Hadoop’s native scheduler and 56%
faster than no speculation
 For the WordCount test they used a data set
of 21GB or 100 MB per host
Evaluation
Evaluation
 Sensitivity analysis
– SpeculativeCap
– SlowTaskThreshold
– SlowNodeThreshold
 SpeculativeCap results
 They ran experiments at six SpeculativeCap
values from 2.5% to 100% repeating each
experiment 5 times
Evaluation
Evaluation
 Sensitivity to SlowTaskThreshold
 Here the idea is to not speculate tasks that
are progressing fast if they are the only
tasks left
 They tested 6 values from 5% to 100%
Evaluation
Evaluation
 We observe that values past 25% all work
well with 25% being the optimum value
 Sensitivity to SlowNodeThreshold
Evaluation
Download