Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications

advertisement
Locality and Network-Aware Reduce Task Scheduling for
Data-Intensive Applications
Engin Arslan
Mrigank Shekhar
Tevfik Kosar
University at Buffalo (SUNY)
Intel Corporation
University at Buffalo (SUNY)
enginars@buffalo.edu
mrigank.shekhar@intel.com
ABSTRACT
MapReduce is one of the leading programming frameworks
to implement data-intensive applications by splitting the
map and reduce tasks to distributed servers. Although there
has been substantial amount of work on map task scheduling
and optimization in the literature, the work on reduce task
scheduling is very limited. Effective scheduling of the reduce tasks to the resources becomes especially important for
the performance of data-intensive applications where large
amounts of data are moved between the map and reduce
tasks. In this paper, we propose a new algorithm (LoNARS)
for reduce task scheduling, which takes both data locality
and network traffic into consideration. Data locality awareness aims to schedule the reduce tasks closer to the map
tasks to decrease the delay in data access as well as the
amount of traffic pushed to the network. Network traffic
awareness intends to distribute the traffic over the whole
network and minimize the hotspots to reduce the effect of
network congestion in data transfers. We have integrated
LoNARS into Hadoop-1.2.1. Using our LoNARS algorithm,
we achieved up to 15% gain in data shuffling time and up to
3-4% improvement in total job completion time compared to
the other reduce task scheduling algorithms. Moreover, we
reduced the amount of traffic on network switches by 15%
which helps to save energy consumption considerably.
1.
INTRODUCTION
The increasing data requirements of commercial and scientific applications has lead to new programming paradigms
and complex scheduling algorithms for efficient processing of
these data sets. MapReduce [1] is one of the programming
paradigms proposed to overcome challenges of big data processing by effectively dividing big jobs into small tasks and
distributing them over a set of compute nodes. In such a
setting, the compute and data storage nodes can be different, which brings up the problem of co-locating data and
computation for efficient end-to-end computing.
To overcome the data-computation scheduling problem,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
tkosar@buffalo.edu
locality-aware data placement is considered a desired feature, and different solutions were proposed to optimize it
by arranging data location in a way that compute tasks are
always assigned to nodes which store the relevant data [12,
10]. On the other hand, MapReduce and its open source
implementation Hadoop [3], offload the data placement task
to HDFS which does not take data locality optimization into
consideration when making data placement decisions. Thus,
low data localization leads to high amounts of data to be
transferred to and from the compute nodes.
A typical MapReduce application is composed of three
phases: map, shuffle and reduce. Since data placement tasks
are managed mostly by the underlying filesystem, such as
HDFS in Hadoop, there are two main operations that heavily use the network to transfer data between computation
units. The first one is the data transfer during the map
phase where the task and the relevant data are located at
different parts of the datacenter. This would require copying the data from where it is stored to where the task is
scheduled. Matei et al. [13] proposed a delay scheduling algorithm to improve data locality for map tasks, and their
results showed that using simple delay scheduling, map task
data locality can be increased to maximum values. The
second main operation that causes large network usage is
the shuffle phase where the outputs of the map tasks are
transferred to the reduce task locations in order to combine
all results. Unlike the optimization of map task scheduling, shuffle optimization has more than two components to
consider thus making the problem more complex.
Hammoud et al. proposed the CoGRS algorithm [4] to increase shuffle locality by scheduling reduce tasks to nearby
map tasks. CoGRS calculates the optimal task tracker for
each reduce task by finding the center of gravity of associated
map tasks. During the shuffle phase, reduce tasks transfer
data from the map tasks and each reduce task could transfer a different amount of data from each map task. Hence,
reduce tasks should be placed nearby to locations of map
tasks that they will transfer output data from.
In this paper, we propose a locality and network-aware
reduce task scheduling algorithm (LoNARS) in order to optimize the shuffle phase of data-intensive MapReduce applications. We combine data locality awareness with network
traffic awareness in order to decrease the shuffle phase duration, as well as to lower the network traffic caused by these
data transfers. Our solution has two distinctive features.
First, it takes network bandwidth capacity and congestion
into account when comparing two potential paths for reduce
input data movement. This is crucial for end-to-end perfor-
mance of MapReduce applications since the heterogeneous
nature of the network connectivity between any two nodes
could result in a severe performance penalty. Also, it is possible to observe high load or congestion on a specific part of
the cluster or on one specific port of a switch. In this case,
it would be crucial to consider the impact of network congestion on the end-to-end performance and make scheduling
decisions accordingly. As the second distinctive feature of
LoNARS, instead of finding one optimal task tracker for a
reduce task as used in [4]—which could lead to sub-optimal
scheduling, in case the optimal task tracker is not available—
we classify all possible reduce task candidates using cost
function and then choose from optimal category which will
possibly have more than one choice.
The rest of the paper is organized as follows. Section
2 introduces the background on locality based scheduling
methods in Hadoop. Our locality and network-aware reduce
task scheduling algorithm (LoNARS) is presented in Section
3. In Section 4, we analyze and evaluate the performance
gain by our solution. And, Section 5 concludes the paper
with a summary of our contributions.
cide where to launch the reduce jobs. LARTS tries to employ
a solution which enables early shuffling along with efficient
reduce job assignment. It is done by enabling early shuffling
after a certain number of map tasks are finished. This lets
LARTS efficiently choose the nodes to launch the reduce
tasks on, by reducing network traffic and data improving
locality. Later they proposed CoGRS [4] which calculates
the optimal task tracker for each reduce task, similar to
LARTS, but, instead of rejecting a task tracker if no reduce
task prefers it, it schedules one of the reduce tasks based
on the closeness of the task tracker to the reduce task’s optimal task tracker. CoGRS performs well under light network congestion and when servers have free slots most of the
time. However, these assumptions might not hold true all
the time, especially in production clusters where too many
jobs are running simultaneously and network congestion is
inevitable. Our proposed algorithm is able to categorize task
trackers based on the cost function and will successfully find
multiple scheduling choices, which will perform very close to
CoGRS in terms of finding optimal task tracker, and outperform CoGRS in lowering network traffic.
2.
3.
BACKGROUND
There has been extensive research in the area of locality awareness optimization in MapReduce. Some of them
focused on data placement optimization [10, 12], some on
map task scheduling [14, 15, 11], and some on reduce task
optimizations [6, 10, 4].
Park et al. [11] proposed a locality-aware solution to the
map task scheduling in the context of VM resource provisioning. When multiple VMs are running on the same physical machines, all VMs obtain a certain portion of physical
resources. This work aimed to dynamically allocate physical
CPUs to VMs in such a way that underutilized CPUs can be
moved to busy VMs so that better resource utilization can
be accomplished. More importantly, in this approach tasks
are always assigned to the node with data. When a task is
scheduled to the VM where data resides, if all the cores are
busy, the task is kept in a queue to wait for an available resource. If other VMs on the same machine are not busy and
have underutilized CPUs, then the CPU resources are dynamically moved to the VMs with tasks in the queue so that
the tasks are always assigned in a locality-aware manner and
CPU resources are utilized better.
Palanisamy et al. [10] proposed Purlieus which couples
data placement and task scheduling to achieve higher locality. They claim that without considering the data placement scheme, it is hard to achieve high data locality since
random data placement might cause some nodes to be more
congested. They also claim that job characteristics, such as
how long it takes and how much data is processed in map
and reduce phases, can be obtained beforehand. An efficient
data placement scheme needs to take these characteristics
into consideration so that the data of long jobs are placed
on nodes with least possible load.
Hammoud et al. [6] proposed LARTS which focuses on
the reduce job data locality problem. Native Hadoop implementations have an option to enable/disable early shuffling
(H ESON and H ESOF), which determines whether or not
to schedule reduce tasks while map tasks are still running.
This improves the overall turnaround time but leads to inefficient reduce job assignment, since without knowledge of
which nodes generated more reduce input, it is hard to de-
SYSTEM DESIGN
In this section, we explain the methodology of our proposed reduce task scheduling algorithm. In order to estimate the best task tracker for a reduce task, we define a
cost function which finds the cost of assigning each reduce
task r to each task tracker (T T ). We define δT Tr as the cost
of scheduling reduce task r on task tracker T T ,
δT Tr =
n
X
m=0
Dm
× HT Tm ,r
BW (T Tm , r)
(1)
where Dm is the estimated shuffle output size of map task
m for a reduce task r, T Tm is the task tracker that executes
map task m, and BW (T Tm , r) is an estimation of network
bandwidth that can be obtained for data transfer between
the task tracker T Tm and the task tracker chosen to reduce
task r. HT Tm ,r is the number of hops between task tracker
T T and T Tm . The cost function measures candidacy by two
factors: (i) how much time it would take to complete the
shuffling phase for this reduce task, and (ii) how close T T
is to map tasks.
In order to find the transfer time of a shuffle, we estimate
the data size to be shuffled from the map task to each reduce
task. Although the exact shuffle data size can be learned
once the map task is completed, it is inefficient to wait for
all map tasks to finish execution [6]. Thus we extrapolate it
by proportioning the size of the currently available output
to the map task’s progress level. To estimate the network
bandwidth share for a shuffle operation, we set up an SNMP
client which regularly obtains utilization information from
the switches.
Our reduce task scheduling algorithm tries to choose the
optimal task tracker for each pending reduce task. Thus,
once a job is ready to schedule the reduce task, we wait one
heartbeat time to receive the request from all task trackers
which are available to accept a reduce task. During this time
period, we evaluate the cost function of each task tracker
with respect to each pending reduce task. As a result, we
are able to quantify each task tracker’s candidacy for each
reduce task based on cost function values. Then, for each
reduce task, we can partition task trackers into groups using
cost functions via K-Means clustering. This helps us find
more than one optimal candidates with slight cost variations.
Although the task tracker with the lowest cost function is
the best candidate for a reduce task, it is possible multiple
reduce tasks prefer the same task tracker. So, by grouping
task trackers, we can have more than one task tracker as
an optimal candidate. We used three levels of optimality
for task trackers with free reduce slots. Task trackers with
map tasks generally fall into the first group unless links that
are on the path of these task trackers are more congested
than the other candidate task tracker. The task trackers
which are located close to task trackers with map tasks and
have available network bandwidth fall into the second group.
The third group contains the rest of the task trackers with
available reduce slots.
Algorithm 1 LoNARS Task Scheduling Algorithm
1: function ScheduleReduceTask(TaskTracker TT)
2:
for Reduce Task r in job.pendingReduceTasks do
3:
δT Tr = calculateCost(TT,r)
4:
r.TTCostList.put(TT,δT Tr )
5:
end for
6:
if job.timer == −1 then
. All job timers are set to -1 at
the beginning
job.timer = current t
return false
. Skip this job
end if
r.costPartitions = partition r.TTCostList list to 3 groups via
KMeans
11:
sort r.costPartitions in increasing order
12:
if currrent t − job.timer > HEART BEAT F REQ and
r.T T CostList.get(T T ) ≤ r.costP artitions[0] then
13:
scheduleTask R to TT
. Optimal Scheduling
14:
return true
15:
else if currrent t − job.timer > 2xHEART BEAT F REQ
and R.T T CostList.get(T T ) ≤ R.costP artitions[1] then
16:
scheduleTask R to TT
. Sub-optimal Scheduling
17:
return true
18:
else if currrent t − job.timer > 3xHEART BEAT F REQ
then
19:
scheduleTask R to TT
. Non-optimal Scheduling
20:
return true
21:
end if return false
22: end function
7:
8:
9:
10:
When a task tracker T T asks for a reduce task from the
job tracker, a job in the job list is chosen, and if the job
is ready to schedule its reduce task (by default, 5% of map
tasks have to complete before a job becomes ready to schedule reduce tasks), we call the ScheduleReduceT ask function.
This function evaluates the cost of choosing this task tracker
for each pending reduce task of the job. Each pending reduce task keeps a list of costs of task trackers and updates
the cost of a task tracker every time the task tracker sends
heartbeat. After the cost of each pending reduce task of the
job is calculated for requesting task tracker T T , we check if
the timer for this job has been initiated yet. If not, we set
it to current time and reject the task tracker for this job. If
the timer is already set and the difference of current t and
timer is between one and two heartbeat times, then we accept only the task tracker from the first optimal cost group.
If none of the task trackers from the first group are available, we start accepting requests from the first and second
cost group after current t−timer becomes greater than two
heartbeat times. In our experiments, for small and medium
size jobs, LoNARS is able to assign the task to a task tracker
from the first and second group most of the time. For jobs
with too many reduce tasks, task trackers from the third
optimal level are also chosen.
In terms of algorithm complexity, we used Lloyd’s clustering to generate three groups from 1-dimensional data. With
fixed number of iteration, it leads to O(cN) complexity .
The complexity of the cost function evaluation is similar
to CoGRS except LoNARS measure cost for all available
TTs as opposed to considering only feeding nodes. This difference does not impose significant overhead for large jobs
since one can expect number of feeding nodes of a reduce
task will be very close to total number of TTs as large jobs
have too many map tasks. Thus, CoGRS and LoNARS will
have similar complexity for large jobs.
4.
EVALUATION
We used Apache Hadoop 1.2.1 [2] to implement our custom reduce task scheduling method. We also tuned the
heartbeat interval to 0.3 seconds. Although this might lead
to high load on the job tracker when thousands of servers are
in place, we leave finding the minimal acceptable heartbeat
interval based on cluster size to future works. We benchmarked LoNARS at micro and macro scales. Micro-scale
benchmarking is conducted on the 12-server cluster topology shown and macro-scale benchmarking on a 100-server
cluster simulation using a MapReduce simulator.
4.1
Micro-benchmarking
We set up a 12-server cluster with the topology shown
in Figure 1. Four 1G SNMP-enabled switches (three ToR
switches and one Aggregate switch) were used to build a
two-layer network topology. We used 12 Ivy Bridge servers
with the following specifications: 10-core processors with
2.9 GHz CPU, 16GB RAM, and a single partition 2TB hard
disk. CentOS 6.2 is used for operating systems without virtualization. Each server runs one task tracker and each task
tracker is configured to have eight map and reduce slots.
One of the servers is configured to run the job tracker along
with its task tracker. The same server is also configured as
an SNMP client since the job tracker uses traffic information from switches to calculate the cost function. The Fair
Scheduler [13] is used in all experiments and simulations in
order to eliminate network congestion caused by missed map
task locality.
Aggregate 1G ToR 1G Rack 1 Rack 2 Rack 3 Figure 1: Cluster topology used in micro experiment
We benchmarked with three types of jobs to observe the
effect of LoNARS on data transfer time and total job time.
The jobs are listed in Table 1 along with some characteristics. WordCount is an application to count the number
of occurrences of each word in a given text. TeraSort sorts
Job Type
Dataset Size
Map Tasks
Reduce Tasks
WordCount 1
WordCount 2
WordCount 3
TeraSort 1
TeraSort 2
TeraSort 3
Recommender
182 MB
940 MB
3.7 GB
1 GB
4.8 GB
20 GB
20 M
3
15
76
16
59
298
1
1
1
1
5
15
30
1
Map
Duration (sec)
17.6
18.1
20.3
3.5
8.3
9.2
21.3
Reduce
Duration (sec)
2.8
2.8
3.2
4.5
6.8
15.7
5.7
Map
Output Size
7.8 MB
7.8 MB
7.8 MB
63 MB
67 MB
68 MB
10.8 MB
a given set of input data generated by TeraGen. Recommender is a famous machine learning application used by
many real life applications[8]. It is important to know how
much time is spent on different phases of job execution time
and the amount of data shuffled during the shuffle phase in
order to evaluate gain/loss imposed by a reduce scheduling
algorithm. Map and reduce execution times are calculated
by taking the average of all map tasks for the given job. Reduce task time does not cover shuffling time; it only covers
sorting time and reduce function execution time. We wanted
to see the effect of resource contention between tasks which
are running on same task tracker by running the same job
with multiple sizes.
Time (sec)
Table 1: Jobs used in micro-benchmarking
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Fifo
Fifo-Sim
LoNARS
LoNARS-Sim
Wo Wo Wo Te Te Te Re
rdC rdC rdC raS raS raS com
ou ou ou ort1 ort2 ort3 me
nt1 nt2 nt3
nd
e
r
90
Job Size
80
Time (sec)
70
60
50
Fifo
Fifo-Sim
LoNARS
LoNARS-Sim
Figure 3: Transfer time per shuffle of jobs used in
micro-benchmarking
40
30
20
10
Wo
W
W
T
T
T
R
rdC ordC ordC eraS eraS eraS ecom
ou
ou
ou ort1 ort2 ort3 me
nt1 nt2 nt3
nd
e
r
Job Size
Figure 2: Total job completion time of jobs used in
micro-benchmarking
Based on the results of micro-benchmarking on the 12server cluster, we implemented a map reduce simulator which
we later used to run our algorithm in larger scale networks.
In order to make sure the simulator can generate close-toreal results, we ran the same benchmarking in Table 1 using
the simulator (we used the same network topology in the
simulator and took resource contention into consideration)
and compared the results with the 12-server cluster’s results.
Figure 2 and 3 show the results of benchmarking for FIFO
and LoNARS algorithms when run on both cluster and simulator. Shuffle times are calculated by dividing the total
time a job spent on data transfer by the number of shuffle operations. We compared LoNARS with FIFO which is
Hadoop’s default reduce scheduling method that schedules
reduce tasks with on a first come first serve basis.
Although 12-server cluster size is not an optimal comparison environment for reduce task scheduling algorithms,
we still observed that LoNARS outperforms the FIFO algorithm in shuffle time for almost all job types. However,
overall job completion time might be affected adversely because the LoNARS algorithm requires at least one heartbeat
period waiting time before it chooses an optimal server. If
the shuffle data size is very small and the network is not
very congested, then the gain from the shuffling phase turns
to be less than one heartbeat time, so overall job time increases slightly. On the other hand, in production clusters it
is rarely possible to have no network congestion, so LoNARS
will not perform worse than FIFO in most cases. Moreover,
LoNARS also decreases the amount of traffic on the network
by considering data locality. The benefit of reduced traffic
size is explained in Section 4.2.3
The simulator is capable of generating total job completion time results within 5% error rate for all jobs. For data
shuffling times, the error rate is less than 5% for TeraSort.
However, the error rate can reach up to 20% for WordCount
and Recommender applications. This is because we use flowbased network simulation in the simulator, where it fails to
achieve fine-grained results for short-lived flows. On the
other hand, correlation of simulator and cluster results is
99.8% for total job times and 98.7% for shuffle times, which
is enough to see that even though the error rate is larger for
shuffle times, the simulator is able to match the trends of
the cluster.
4.2
Macro-benchmarking
We used the simulator to compare the algorithms in larger
scale networks. The network topology used in macro benchmarking is shown in Figure 4. There are three network layers: ToR switches where each ToR switch is connected to
10 servers, Aggregate switches where each Aggregate switch
is connected to five ToR switches, and a Core switch which
connects two aggregate switches. In order to simulate over-
subscription in datacenters, link capacities are set to 1G,
6G, and 10G from servers to ToR switches, ToR switches to
Aggregate switches, and Aggregate switches to Core switch
respectively.
Core Aggregate … … ToR Job Name
WordCount 1
WordCount 2
WordCount 3
TeraSort 1
TeraSort 2
TeraSort 3
Recommender 1
Recommender 2
Recommender 3
Permutation 1
Permuation 2
Permutation 3
Dataset
Size
1.8GB
3.6GB
12.8GB
300MB
1.8GB
12.8GB
60MB
300MB
640MB
300MB
640MB
1.2GB
Map
Tasks
30
60
200
5
30
200
1
5
10
5
10
20
Reduce
Tasks
2
10
30
2
5
30
1
5
10
2
3
6
# of
Jobs
30
10
5
30
10
5
10
5
2
30
5
2
Table 2: Jobs used in macro-benchmarking
… … 14
Rack 5 Rack 6 Rack 10 Figure 4:
Network topology used in macrobenchmarking
We compared LoNARS with FIFO, Rack-Aware, and CoGRS
(Center of Gravity)[5]. The Rack-Aware method distributes
reduce tasks across racks in order to achieve network load
balancing between racks. It starts scheduling jobs from any
task tracker in any rack, and waits for the task tracker from
different racks so that reduce tasks can be distributed evenly
among racks. Finally, CoGRS determines the optimal task
tracker for each reduce task and tries to schedule the reduce
task on its optimal task tracker. Since it is possible that
multiple reduce tasks prefers the same task tracker, one of
them will be scheduled on its optimal task tracker. Others will be scheduled to the task trackers which are located
nearby the optimal task tracker in terms of network distance.
However, we realized that instead of choosing a task tracker
based on network distance when the optimal task tracker is
not available, categorizing task trackers and choosing a task
tracker which falls into same category with the optimal task
tracker leads to higher data locality as well as less network
traffic. Secondly, CoGRS does not take network congestion
into consideration which might causes hotspots in the network. To minimize hotspots, LoNARS takes congestion into
account in the cost function.
In the simulator we assumed that we can measure the
number of flows on switches which is possible with NetFlow [9]. This helps us to estimate network bandwidth for a
given path more precisely since we can estimate how much
bandwidth a new flow can obtain even if the link is fully utilized. SNMP, on the other hand, provides only the amount
of traffic passing on switch ports and gives no indication
about how many flows are sharing same port. We query
NetFlow for traffic information every time we calculate the
cost function of a task tracker.
We used a pool of four different jobs for benchmarking
as shown in Table 2. In addition to the WordCount, TeraSort, and Recommender applications, we used the Permutation application in the simulator in order to see the effect of
LoNARS on applications that generates heavy shuffle output
in the map phase. Since we propose an algorithm to optimize
the shuffling phase of MapReduce jobs, shuffle-heavy jobs
will be a good benchmark to be able to see effect LoNARS.
The job characteristics for the Permutation application are
Percentage (%)
12
Rack 1 Fifo
RackAware
CoGRS
LoNARS
10
8
6
4
2
0
Local
Rack
Locality Level
Figure 6: Shuffle locality ratios when map outputs
shuffle uniformly
as follows: the average map task time is 90 seconds, the
average reduce task time (excluding shuffle phase) is 22 seconds, and the approximate shuffle data size of a map task
is 700MB per map task. We had total of 144 jobs and they
are scheduled using exponential distribution with a mean of
14 seconds as in Facebook’s cluster [13].
4.2.1
Uniform Shuffle outputs
Firstly, we considered a case where map task output shuffles to reduce tasks uniformly. For example, if a map task
generates 30MB of output to be shuffled to reduce tasks and
there are six reduce tasks, then each reduce task will receive
exactly 5MB.
Figures 5 (a), (b), (c), and (d) show total job time comparison of each job type listed in Table 2 for four different
reduce scheduling algorithms. Figures 5 (e), (f), (g), and (h)
show average time for one shuffle operation of a given job.
If there are 10 concurrent transfers running and it takes 10
seconds to complete all of them, then per shuffle time will
be one second.
The WordCount and Recommender applications shuffle a
small amount of data during the shuffle phase. Thus, regardless of improvements in reduce task scheduling, it does
not considerably impact overall job time. The Terasort application shuffles more data, however still its average per
shuffle time is less than 0.5 seconds. Also, as the number of
map tasks increases, the impact of the reduce task scheduling algorithm decreases since map tasks will be distributed
over the whole cluster almost evenly. Thus, we see slight
improvement in TeraSort 1. The Permutation application,
on the other hand, shuffles a significant amount of data and
(a) WordCount Job Time
28
27
26
25
1
2
88.5
88
87.5
87
86.5
86
85.5
85
84.5
84
83.5
2
20
18
3
1
2
Fifo
RackAware
CoGRS
LoNARS
Time (sec)
0.04
144
142
140
138
0.03
0.02
0.01
136
0
2
Job Size
3
1
(g) TeraSort Shuffle Time
0.4
Fifo
RackAware
CoGRS
LoNARS
0.35
0.3
0.25
0.2
0.15
2
Job Size
(f) KMeans Shuffle Time
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
Fifo
RackAware
CoGRS
LoNARS
3
1
(h) Permutation Shuffle Time
4
2
Job Size
3
Fifo
RackAware
CoGRS
LoNARS
3.5
Time (sec)
134
3
Job Size
(e) WordCount Shuffle Time
0.05
Time (sec)
22
Job Size
Fifo
RackAware
CoGRS
LoNARS
1
Fifo
RackAware
CoGRS
LoNARS
12
1
(d) Permutation Job Time
Time (sec)
24
14
3
150
146
26
16
Job Size
148
(c) TeraSort Job Time
28
Fifo
RackAware
CoGRS
LoNARS
Time (sec)
29
Time (sec)
30
Time (sec)
(b) KMeans Job Time
Fifo
RackAware
CoGRS
LoNARS
Time (sec)
31
3
2.5
2
0.1
1.5
0.05
0
1
1
2
Job Size
3
1
2
Job Size
3
Figure 5: Benchmarking results with uniform shuffle
reduce task scheduling makes a difference more than it does
for other jobs. LoNARS performs 2-3% improvement in total job time in shuffle heavy jobs such as the TeraSort shown
in Figure 5 (d). In Figure 5 (h), it can be seen that per shuffle time for LoNARS performs with up to a 15% improvement compared to FIFO and a 7% improvement compared
to CoGRS. The basic reason LoNARS performs better than
CoGRS is CoGRS’s inability to address choosing an optimal alternative task tracker when multiple reduce tasks prefer the same task tracker. With the help of categorization,
the LoNARS algorithm successfully finds an alternative task
tracker whose cost is close to the optimal task tracker’s cost.
Figure 6 shows the ratio of local and rack-level shuffle
operations among all shuffles. If a map and reduce tasks
are on the same server, then the only operation is reading
data from disk into memory for the reduce task—we call
this case a ”local-level” shuffle. If map and reduce tasks are
on the same rack, data transfer stays within rack—we call
this a ”rack-level” shuffle. The rest of the shuffles are called
”off-rack” shuffles, and constitute around 90% of all shuffles
in most cases. LoNARS approximately doubles the level of
local-level shuffling compared to FIFO and RackAware, and
improves local-level shuffling by 20% compared to CoGRS.
LoNARS also increases the rack level shuffle ratio by 22%
compared to FIFO, 19% compared to RackAware, and 17%
compared to CoGRS. Increases in the level of local- and
rack-level shuffling reduce the amount of traffic pushed to
the network significantly.
4.2.2
Non-uniform Shuffle outputs
CoGRS performs best when map outputs are divided among
reduce tasks non-uniformly. That is, the partitioning of map
task output to each reduce task is different, and some reduce
tasks receives more data on one map task. This leads reduce
tasks to choose different task trackers as optimal task trackers. FIFO and RackAware perform the same regardless of
the distribution pattern of map outputs. We again used
the benchmarking jobs listed in Table 2, and the results
are given in Figure 7. CoGRS performs better than the
FIFO and RackAware algorithms in this case, as expected.
Thus, CoGRS increases locality and shortens shuffling time.
LoNARS algorithm outperforms CoGRS slightly in this case.
Although CoGRS can find different task trackers for most
reduce tasks, it fails to find good alternative task trackers
when many jobs are active and optimal task trackers do
not have free reduce slots. Furthermore, it is important to
be aware of network traffic in order to avoid experiencing
network congestion and to utilize network capacity effectively. LoNARS successfully finds alternative task trackers
and yields more network capacity which improves shuffle
time. However, total job time is not significantly improved,
since the reduce phase starts only after all shuffles are completed and even one slow shuffle can cause a long total job
time.
In Figure 8, we can see CoGRS achieves a better locallevel shuffle ratio than it does in the uniform shuffling case.
LoNARS again outperforms FIFO, RackAware, and CoGRS
in local- and rack-level shuffle ratios. Its local-level shuffle
(a) WordCount Job Time
31
88
30
29
28
26
24
87
86
85
20
18
84
16
26
83
14
25
82
2
12
3
1
2
Job Size
(d) Permutation Job Time
1
142
140
138
0.04
0.03
0.02
0.01
136
134
0
2
Job Size
3
1
2
Job Size
0.35
Time (sec)
0.3
0.25
0.2
0.15
Fifo
RackAware
CoGRS
LoNARS
1
2
Job Size
3
(h) Permutation Job Shuffle Time
4
Fifo
RackAware
CoGRS
LoNARS
3.5
Time (sec)
Fifo
RackAware
CoGRS
LoNARS
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
3
(g) TeraSort Shuffle Time
0.4
3
(f) KMeans Shuffle Time
Time (sec)
Fifo
RackAware
CoGRS
LoNARS
0.05
1
2
Job Size
(e) WordCount Shuffle Time
0.06
Fifo
RackAware
CoGRS
LoNARS
Time (sec)
144
3
Job Size
148
146
Fifo
RackAware
CoGRS
LoNARS
22
27
1
Time (sec)
(c) TeraSort Job Time
28
Fifo
RackAware
CoGRS
LoNARS
89
Time (sec)
32
Time (sec)
(b) KMeans Job Time
90
Fifo
RackAware
CoGRS
LoNARS
Time (sec)
33
3
2.5
2
0.1
1.5
0.05
0
1
1
2
Job Size
3
1
2
Job Size
3
Figure 7: Benchmarking with non-uniform shuffle
ratio is again double that of FIFO and RackAware and 10%
more than that of CoGRS. In terms of rack-level shuffle ratio, LoNARS outperforms FIFO by 20%, RackAware by 18%
, and CoGRS by 13%.
14
Percentage (%)
12
Fifo
RackAware
CoGRS
LoNARS
10
8
6
4
2
0
Local
Rack
Locality Level
Figure 8: Shuffle locality ratios when map outputs
have non-uniform pattern
4.2.3
Network Traffic Comparison
Besides improvements in shuffle and total job times, LoNARS
reduces network traffic considerably. Reduction in network
traffic means less traffic on switches and reduced power consumption. Mahadevan, et al. [7] analyzed power consumption of network switches. 1G network switches consume
around 120-180 watts and 10G switches consume around
300 watts. Utilization of the switch contributes 5-15% of
total switch power consumption. According to the author,
although the ratio seems low, as green technology becomes
popular, vendors will focus more on more energy efficient
products. Hence, lowering network traffic has a significant
impact on lowering datacenter networking costs.
Figure 9 (a) shows a traffic size comparison of the LoNARS
and FIFO algorithms on the 12-server cluster shown in Figure 1. We used the jobs listed in Table 1. The number
of jobs of a given size is inversely proportional to job size.
Jobs arrive with exponential distribution with a mean of 14
seconds. Traffic information is gathered via SNMP polling.
Although traffic size values also include HDFS block copying, since both methods have similar outputs, the difference comes from using different reduce task scheduling algorithms. LoNARS reduced network traffic processed by ToR
and Aggregate switches by around 15%.
In Figure 9 (b) and (c) we compare the traffic size of
four algorithms for benchmarking, explained in Section 4.2.
RackAware reduces network traffic slightly more than FIFO,
and CoGRS outperforms both FIFO and RackAware by 810% in all network levels.
For the uniform map shuffling case, compared to the FIFO,
RackAware, and CoGRS algorithms, LoNARS reduces network traffic processed by the core switch by 27%, 25%, and
20% respectively. Since higher capacity switches consume
more power per megabit [7], increasing locality and reducing
(b) Uniform Shuffle
600
500
600
400
300
200
ToR
500
Fifo
RackAware
CoGRS
LoNARS
400
300
200
100
Aggregate
(c) Non-uniform Shuffle
700
Fifo
RackAware
CoGRS
LoNARS
Traffic Size (GB)
700
Traffic Size (GB)
Traffic Size (GB)
(a) 12-Server Mix Jobs
100
90
Fifo
80 LoNARS
70
60
50
40
30
20
10
0
100
Core
Layer
Aggregate
ToR
Core
Layer
Aggregate
ToR
Layer
Figure 9: Size of traffic at different layers of the network
network traffic on upper layers is more important than lower
layers. For the aggregate layer, compared to FIFO, RackAware, and CoGRS algorithms, LoNARS cuts the amount
of traffic pushed to aggregate switches by 20.9%, 19.3%, and
13% respectively. Finally, LoNARS lowers traffic size processed by ToR switches by 13%, 11% and 6% respectively
compared to FIFO, RackAware, and CoGRS.
In the non-uniform map shuffling case, CoGRS increases
its locality ratio, which reduces network traffic. LoNARS
again is able to reduce network traffic size in almost the
same ratio compared to FIFO and RackAware for all layers.
However, compared to CoGRS, LoNARS reduces traffic size
through core, aggregate, and ToR switches by 8%, 5% and
2% respectively.
5.
CONCLUSIONS AND FUTURE WORK
In this paper, we propose the LoNARS algorithm for reduce task scheduling in MapReduce. We conducted microbenchmarking to see the performance of LoNARS using a
12-server cluster and showed that LoNARS outperforms the
native Hadoop reduce task scheduling algorithm. We also
simulated macro-benchmarking using a 100-server cluster
and compared LoNARS to FIFO, RackAware, and CoGRS
algorithms. The results showed that LoNARS always reduces network traffic more than these three algorithms by
up to 25% which makes significant impact in power consumption of network switches. LoNARS also reduces shuffle
time in almost all cases and the reduction ratio depends on
how much traffic a job shuffle in the shuffle phase. Regarding total job completion time, LoNARS makes improvement
if shuffle transfer time is more than one heartbeat time yet
does not always perform worse in other cases.
As future work, we plan to explore how to tune heartbeat
time based on cluster size and processing power of the job
tracker to minimize overhead of LoNARS. We also plan to
test LoNARS on different datacenter topologies such as FatTree to see its performance in such network topologies.
6.
ACKNOWLEDGMENTS
This project is in part sponsored by NSF under award
number CNS-1131889 (CAREER).
7.
REFERENCES
[1] Dean, J., and Ghemawat, S. Mapreduce: Simplified data
processing on large clusters. In Proceedings of the 6th
Conference on Symposium on Opearting Systems Design &
Implementation - Volume 6 (Berkeley, CA, USA, 2004),
OSDI’04, USENIX Association, pp. 10–10.
[2] Documentation, H. .
https://hadoop.apache.org/docs/r1.2.1/.
[3] Hadoop. http://hadoop.apache.org/.
[4] Hammoud, M., Rehman, M. S., and Sakr, M. F.
Center-of-gravity reduce task scheduling to lower
mapreduce network traffic. In IEEE CLOUD (2012),
pp. 49–58.
[5] Hammoud, M., Rehman, M. S., and Sakr, M. F.
Center-of-gravity reduce task scheduling to lower
mapreduce network traffic. In Proceedings of the 2012
IEEE Fifth International Conference on Cloud Computing
(Washington, DC, USA, 2012), CLOUD ’12, IEEE
Computer Society, pp. 49–58.
[6] Hammoud, M., and Sakr, M. F. Locality-aware reduce
task scheduling for mapreduce. In Proceedings of the 2011
IEEE Third International Conference on Cloud Computing
Technology and Science (Washington, DC, USA, 2011),
CLOUDCOM ’11, IEEE Computer Society, pp. 570–576.
[7] Mahadevan, P., Sharma, P., Banerjee, S., and
Ranganathan, P. A power benchmarking framework for
network devices. In Proceedings of the 8th International
IFIP-TC 6 Networking Conference (Berlin, Heidelberg,
2009), NETWORKING ’09, Springer-Verlag, pp. 795–808.
[8] Mahout, A. http://mahout.apache.org/.
[9] NetFlow. http://en.wikipedia.org/wiki/NetFlow.
[10] Palanisamy, B., Singh, A., Liu, L., and Jain, B.
Purlieus: locality-aware resource allocation for mapreduce
in a cloud. In Proceedings of 2011 International Conference
for High Performance Computing, Networking, Storage
and Analysis (New York, NY, USA, 2011), SC ’11, ACM,
pp. 58:1–58:11.
[11] Park, J., Lee, D., Kim, B., Huh, J., and Maeng, S.
Locality-aware dynamic vm reconfiguration on mapreduce
clouds. In Proceedings of the 21st international symposium
on High-Performance Parallel and Distributed Computing
(New York, NY, USA, 2012), HPDC ’12, ACM, pp. 27–36.
[12] Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors,
J., Manzanares, A., and Qin, X. Improving mapreduce
performance through data placement in heterogeneous
hadoop clusters. In Parallel Distributed Processing,
Workshops and Phd Forum (IPDPSW), 2010 IEEE
International Symposium on (2010), pp. 1–9.
[13] Zaharia, M., Borthakur, D., Sen Sarma, J.,
Elmeleegy, K., Shenker, S., and Stoica, I. Delay
scheduling: A simple technique for achieving locality and
fairness in cluster scheduling. In Proceedings of the 5th
European Conference on Computer Systems (New York,
NY, USA, 2010), EuroSys ’10, ACM, pp. 265–278.
[14] Zaharia, M., Borthakur, D., Sen Sarma, J.,
Elmeleegy, K., Shenker, S., and Stoica, I. Delay
scheduling: a simple technique for achieving locality and
fairness in cluster scheduling. In Proceedings of the 5th
European conference on Computer systems (New York,
NY, USA, 2010), EuroSys ’10, ACM, pp. 265–278.
[15] Zhang, X., Feng, Y., Feng, S., Fan, J., and Ming, Z. An
effective data locality aware task scheduling method for
mapreduce framework in heterogeneous environments. In
Proceedings of the 2011 International Conference on Cloud
and Service Computing (Washington, DC, USA, 2011),
CSC ’11, IEEE Computer Society, pp. 235–242.
Download