Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications Engin Arslan Mrigank Shekhar Tevfik Kosar University at Buffalo (SUNY) Intel Corporation University at Buffalo (SUNY) enginars@buffalo.edu mrigank.shekhar@intel.com ABSTRACT MapReduce is one of the leading programming frameworks to implement data-intensive applications by splitting the map and reduce tasks to distributed servers. Although there has been substantial amount of work on map task scheduling and optimization in the literature, the work on reduce task scheduling is very limited. Effective scheduling of the reduce tasks to the resources becomes especially important for the performance of data-intensive applications where large amounts of data are moved between the map and reduce tasks. In this paper, we propose a new algorithm (LoNARS) for reduce task scheduling, which takes both data locality and network traffic into consideration. Data locality awareness aims to schedule the reduce tasks closer to the map tasks to decrease the delay in data access as well as the amount of traffic pushed to the network. Network traffic awareness intends to distribute the traffic over the whole network and minimize the hotspots to reduce the effect of network congestion in data transfers. We have integrated LoNARS into Hadoop-1.2.1. Using our LoNARS algorithm, we achieved up to 15% gain in data shuffling time and up to 3-4% improvement in total job completion time compared to the other reduce task scheduling algorithms. Moreover, we reduced the amount of traffic on network switches by 15% which helps to save energy consumption considerably. 1. INTRODUCTION The increasing data requirements of commercial and scientific applications has lead to new programming paradigms and complex scheduling algorithms for efficient processing of these data sets. MapReduce [1] is one of the programming paradigms proposed to overcome challenges of big data processing by effectively dividing big jobs into small tasks and distributing them over a set of compute nodes. In such a setting, the compute and data storage nodes can be different, which brings up the problem of co-locating data and computation for efficient end-to-end computing. To overcome the data-computation scheduling problem, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. tkosar@buffalo.edu locality-aware data placement is considered a desired feature, and different solutions were proposed to optimize it by arranging data location in a way that compute tasks are always assigned to nodes which store the relevant data [12, 10]. On the other hand, MapReduce and its open source implementation Hadoop [3], offload the data placement task to HDFS which does not take data locality optimization into consideration when making data placement decisions. Thus, low data localization leads to high amounts of data to be transferred to and from the compute nodes. A typical MapReduce application is composed of three phases: map, shuffle and reduce. Since data placement tasks are managed mostly by the underlying filesystem, such as HDFS in Hadoop, there are two main operations that heavily use the network to transfer data between computation units. The first one is the data transfer during the map phase where the task and the relevant data are located at different parts of the datacenter. This would require copying the data from where it is stored to where the task is scheduled. Matei et al. [13] proposed a delay scheduling algorithm to improve data locality for map tasks, and their results showed that using simple delay scheduling, map task data locality can be increased to maximum values. The second main operation that causes large network usage is the shuffle phase where the outputs of the map tasks are transferred to the reduce task locations in order to combine all results. Unlike the optimization of map task scheduling, shuffle optimization has more than two components to consider thus making the problem more complex. Hammoud et al. proposed the CoGRS algorithm [4] to increase shuffle locality by scheduling reduce tasks to nearby map tasks. CoGRS calculates the optimal task tracker for each reduce task by finding the center of gravity of associated map tasks. During the shuffle phase, reduce tasks transfer data from the map tasks and each reduce task could transfer a different amount of data from each map task. Hence, reduce tasks should be placed nearby to locations of map tasks that they will transfer output data from. In this paper, we propose a locality and network-aware reduce task scheduling algorithm (LoNARS) in order to optimize the shuffle phase of data-intensive MapReduce applications. We combine data locality awareness with network traffic awareness in order to decrease the shuffle phase duration, as well as to lower the network traffic caused by these data transfers. Our solution has two distinctive features. First, it takes network bandwidth capacity and congestion into account when comparing two potential paths for reduce input data movement. This is crucial for end-to-end perfor- mance of MapReduce applications since the heterogeneous nature of the network connectivity between any two nodes could result in a severe performance penalty. Also, it is possible to observe high load or congestion on a specific part of the cluster or on one specific port of a switch. In this case, it would be crucial to consider the impact of network congestion on the end-to-end performance and make scheduling decisions accordingly. As the second distinctive feature of LoNARS, instead of finding one optimal task tracker for a reduce task as used in [4]—which could lead to sub-optimal scheduling, in case the optimal task tracker is not available— we classify all possible reduce task candidates using cost function and then choose from optimal category which will possibly have more than one choice. The rest of the paper is organized as follows. Section 2 introduces the background on locality based scheduling methods in Hadoop. Our locality and network-aware reduce task scheduling algorithm (LoNARS) is presented in Section 3. In Section 4, we analyze and evaluate the performance gain by our solution. And, Section 5 concludes the paper with a summary of our contributions. cide where to launch the reduce jobs. LARTS tries to employ a solution which enables early shuffling along with efficient reduce job assignment. It is done by enabling early shuffling after a certain number of map tasks are finished. This lets LARTS efficiently choose the nodes to launch the reduce tasks on, by reducing network traffic and data improving locality. Later they proposed CoGRS [4] which calculates the optimal task tracker for each reduce task, similar to LARTS, but, instead of rejecting a task tracker if no reduce task prefers it, it schedules one of the reduce tasks based on the closeness of the task tracker to the reduce task’s optimal task tracker. CoGRS performs well under light network congestion and when servers have free slots most of the time. However, these assumptions might not hold true all the time, especially in production clusters where too many jobs are running simultaneously and network congestion is inevitable. Our proposed algorithm is able to categorize task trackers based on the cost function and will successfully find multiple scheduling choices, which will perform very close to CoGRS in terms of finding optimal task tracker, and outperform CoGRS in lowering network traffic. 2. 3. BACKGROUND There has been extensive research in the area of locality awareness optimization in MapReduce. Some of them focused on data placement optimization [10, 12], some on map task scheduling [14, 15, 11], and some on reduce task optimizations [6, 10, 4]. Park et al. [11] proposed a locality-aware solution to the map task scheduling in the context of VM resource provisioning. When multiple VMs are running on the same physical machines, all VMs obtain a certain portion of physical resources. This work aimed to dynamically allocate physical CPUs to VMs in such a way that underutilized CPUs can be moved to busy VMs so that better resource utilization can be accomplished. More importantly, in this approach tasks are always assigned to the node with data. When a task is scheduled to the VM where data resides, if all the cores are busy, the task is kept in a queue to wait for an available resource. If other VMs on the same machine are not busy and have underutilized CPUs, then the CPU resources are dynamically moved to the VMs with tasks in the queue so that the tasks are always assigned in a locality-aware manner and CPU resources are utilized better. Palanisamy et al. [10] proposed Purlieus which couples data placement and task scheduling to achieve higher locality. They claim that without considering the data placement scheme, it is hard to achieve high data locality since random data placement might cause some nodes to be more congested. They also claim that job characteristics, such as how long it takes and how much data is processed in map and reduce phases, can be obtained beforehand. An efficient data placement scheme needs to take these characteristics into consideration so that the data of long jobs are placed on nodes with least possible load. Hammoud et al. [6] proposed LARTS which focuses on the reduce job data locality problem. Native Hadoop implementations have an option to enable/disable early shuffling (H ESON and H ESOF), which determines whether or not to schedule reduce tasks while map tasks are still running. This improves the overall turnaround time but leads to inefficient reduce job assignment, since without knowledge of which nodes generated more reduce input, it is hard to de- SYSTEM DESIGN In this section, we explain the methodology of our proposed reduce task scheduling algorithm. In order to estimate the best task tracker for a reduce task, we define a cost function which finds the cost of assigning each reduce task r to each task tracker (T T ). We define δT Tr as the cost of scheduling reduce task r on task tracker T T , δT Tr = n X m=0 Dm × HT Tm ,r BW (T Tm , r) (1) where Dm is the estimated shuffle output size of map task m for a reduce task r, T Tm is the task tracker that executes map task m, and BW (T Tm , r) is an estimation of network bandwidth that can be obtained for data transfer between the task tracker T Tm and the task tracker chosen to reduce task r. HT Tm ,r is the number of hops between task tracker T T and T Tm . The cost function measures candidacy by two factors: (i) how much time it would take to complete the shuffling phase for this reduce task, and (ii) how close T T is to map tasks. In order to find the transfer time of a shuffle, we estimate the data size to be shuffled from the map task to each reduce task. Although the exact shuffle data size can be learned once the map task is completed, it is inefficient to wait for all map tasks to finish execution [6]. Thus we extrapolate it by proportioning the size of the currently available output to the map task’s progress level. To estimate the network bandwidth share for a shuffle operation, we set up an SNMP client which regularly obtains utilization information from the switches. Our reduce task scheduling algorithm tries to choose the optimal task tracker for each pending reduce task. Thus, once a job is ready to schedule the reduce task, we wait one heartbeat time to receive the request from all task trackers which are available to accept a reduce task. During this time period, we evaluate the cost function of each task tracker with respect to each pending reduce task. As a result, we are able to quantify each task tracker’s candidacy for each reduce task based on cost function values. Then, for each reduce task, we can partition task trackers into groups using cost functions via K-Means clustering. This helps us find more than one optimal candidates with slight cost variations. Although the task tracker with the lowest cost function is the best candidate for a reduce task, it is possible multiple reduce tasks prefer the same task tracker. So, by grouping task trackers, we can have more than one task tracker as an optimal candidate. We used three levels of optimality for task trackers with free reduce slots. Task trackers with map tasks generally fall into the first group unless links that are on the path of these task trackers are more congested than the other candidate task tracker. The task trackers which are located close to task trackers with map tasks and have available network bandwidth fall into the second group. The third group contains the rest of the task trackers with available reduce slots. Algorithm 1 LoNARS Task Scheduling Algorithm 1: function ScheduleReduceTask(TaskTracker TT) 2: for Reduce Task r in job.pendingReduceTasks do 3: δT Tr = calculateCost(TT,r) 4: r.TTCostList.put(TT,δT Tr ) 5: end for 6: if job.timer == −1 then . All job timers are set to -1 at the beginning job.timer = current t return false . Skip this job end if r.costPartitions = partition r.TTCostList list to 3 groups via KMeans 11: sort r.costPartitions in increasing order 12: if currrent t − job.timer > HEART BEAT F REQ and r.T T CostList.get(T T ) ≤ r.costP artitions[0] then 13: scheduleTask R to TT . Optimal Scheduling 14: return true 15: else if currrent t − job.timer > 2xHEART BEAT F REQ and R.T T CostList.get(T T ) ≤ R.costP artitions[1] then 16: scheduleTask R to TT . Sub-optimal Scheduling 17: return true 18: else if currrent t − job.timer > 3xHEART BEAT F REQ then 19: scheduleTask R to TT . Non-optimal Scheduling 20: return true 21: end if return false 22: end function 7: 8: 9: 10: When a task tracker T T asks for a reduce task from the job tracker, a job in the job list is chosen, and if the job is ready to schedule its reduce task (by default, 5% of map tasks have to complete before a job becomes ready to schedule reduce tasks), we call the ScheduleReduceT ask function. This function evaluates the cost of choosing this task tracker for each pending reduce task of the job. Each pending reduce task keeps a list of costs of task trackers and updates the cost of a task tracker every time the task tracker sends heartbeat. After the cost of each pending reduce task of the job is calculated for requesting task tracker T T , we check if the timer for this job has been initiated yet. If not, we set it to current time and reject the task tracker for this job. If the timer is already set and the difference of current t and timer is between one and two heartbeat times, then we accept only the task tracker from the first optimal cost group. If none of the task trackers from the first group are available, we start accepting requests from the first and second cost group after current t−timer becomes greater than two heartbeat times. In our experiments, for small and medium size jobs, LoNARS is able to assign the task to a task tracker from the first and second group most of the time. For jobs with too many reduce tasks, task trackers from the third optimal level are also chosen. In terms of algorithm complexity, we used Lloyd’s clustering to generate three groups from 1-dimensional data. With fixed number of iteration, it leads to O(cN) complexity . The complexity of the cost function evaluation is similar to CoGRS except LoNARS measure cost for all available TTs as opposed to considering only feeding nodes. This difference does not impose significant overhead for large jobs since one can expect number of feeding nodes of a reduce task will be very close to total number of TTs as large jobs have too many map tasks. Thus, CoGRS and LoNARS will have similar complexity for large jobs. 4. EVALUATION We used Apache Hadoop 1.2.1 [2] to implement our custom reduce task scheduling method. We also tuned the heartbeat interval to 0.3 seconds. Although this might lead to high load on the job tracker when thousands of servers are in place, we leave finding the minimal acceptable heartbeat interval based on cluster size to future works. We benchmarked LoNARS at micro and macro scales. Micro-scale benchmarking is conducted on the 12-server cluster topology shown and macro-scale benchmarking on a 100-server cluster simulation using a MapReduce simulator. 4.1 Micro-benchmarking We set up a 12-server cluster with the topology shown in Figure 1. Four 1G SNMP-enabled switches (three ToR switches and one Aggregate switch) were used to build a two-layer network topology. We used 12 Ivy Bridge servers with the following specifications: 10-core processors with 2.9 GHz CPU, 16GB RAM, and a single partition 2TB hard disk. CentOS 6.2 is used for operating systems without virtualization. Each server runs one task tracker and each task tracker is configured to have eight map and reduce slots. One of the servers is configured to run the job tracker along with its task tracker. The same server is also configured as an SNMP client since the job tracker uses traffic information from switches to calculate the cost function. The Fair Scheduler [13] is used in all experiments and simulations in order to eliminate network congestion caused by missed map task locality. Aggregate 1G ToR 1G Rack 1 Rack 2 Rack 3 Figure 1: Cluster topology used in micro experiment We benchmarked with three types of jobs to observe the effect of LoNARS on data transfer time and total job time. The jobs are listed in Table 1 along with some characteristics. WordCount is an application to count the number of occurrences of each word in a given text. TeraSort sorts Job Type Dataset Size Map Tasks Reduce Tasks WordCount 1 WordCount 2 WordCount 3 TeraSort 1 TeraSort 2 TeraSort 3 Recommender 182 MB 940 MB 3.7 GB 1 GB 4.8 GB 20 GB 20 M 3 15 76 16 59 298 1 1 1 1 5 15 30 1 Map Duration (sec) 17.6 18.1 20.3 3.5 8.3 9.2 21.3 Reduce Duration (sec) 2.8 2.8 3.2 4.5 6.8 15.7 5.7 Map Output Size 7.8 MB 7.8 MB 7.8 MB 63 MB 67 MB 68 MB 10.8 MB a given set of input data generated by TeraGen. Recommender is a famous machine learning application used by many real life applications[8]. It is important to know how much time is spent on different phases of job execution time and the amount of data shuffled during the shuffle phase in order to evaluate gain/loss imposed by a reduce scheduling algorithm. Map and reduce execution times are calculated by taking the average of all map tasks for the given job. Reduce task time does not cover shuffling time; it only covers sorting time and reduce function execution time. We wanted to see the effect of resource contention between tasks which are running on same task tracker by running the same job with multiple sizes. Time (sec) Table 1: Jobs used in micro-benchmarking 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Fifo Fifo-Sim LoNARS LoNARS-Sim Wo Wo Wo Te Te Te Re rdC rdC rdC raS raS raS com ou ou ou ort1 ort2 ort3 me nt1 nt2 nt3 nd e r 90 Job Size 80 Time (sec) 70 60 50 Fifo Fifo-Sim LoNARS LoNARS-Sim Figure 3: Transfer time per shuffle of jobs used in micro-benchmarking 40 30 20 10 Wo W W T T T R rdC ordC ordC eraS eraS eraS ecom ou ou ou ort1 ort2 ort3 me nt1 nt2 nt3 nd e r Job Size Figure 2: Total job completion time of jobs used in micro-benchmarking Based on the results of micro-benchmarking on the 12server cluster, we implemented a map reduce simulator which we later used to run our algorithm in larger scale networks. In order to make sure the simulator can generate close-toreal results, we ran the same benchmarking in Table 1 using the simulator (we used the same network topology in the simulator and took resource contention into consideration) and compared the results with the 12-server cluster’s results. Figure 2 and 3 show the results of benchmarking for FIFO and LoNARS algorithms when run on both cluster and simulator. Shuffle times are calculated by dividing the total time a job spent on data transfer by the number of shuffle operations. We compared LoNARS with FIFO which is Hadoop’s default reduce scheduling method that schedules reduce tasks with on a first come first serve basis. Although 12-server cluster size is not an optimal comparison environment for reduce task scheduling algorithms, we still observed that LoNARS outperforms the FIFO algorithm in shuffle time for almost all job types. However, overall job completion time might be affected adversely because the LoNARS algorithm requires at least one heartbeat period waiting time before it chooses an optimal server. If the shuffle data size is very small and the network is not very congested, then the gain from the shuffling phase turns to be less than one heartbeat time, so overall job time increases slightly. On the other hand, in production clusters it is rarely possible to have no network congestion, so LoNARS will not perform worse than FIFO in most cases. Moreover, LoNARS also decreases the amount of traffic on the network by considering data locality. The benefit of reduced traffic size is explained in Section 4.2.3 The simulator is capable of generating total job completion time results within 5% error rate for all jobs. For data shuffling times, the error rate is less than 5% for TeraSort. However, the error rate can reach up to 20% for WordCount and Recommender applications. This is because we use flowbased network simulation in the simulator, where it fails to achieve fine-grained results for short-lived flows. On the other hand, correlation of simulator and cluster results is 99.8% for total job times and 98.7% for shuffle times, which is enough to see that even though the error rate is larger for shuffle times, the simulator is able to match the trends of the cluster. 4.2 Macro-benchmarking We used the simulator to compare the algorithms in larger scale networks. The network topology used in macro benchmarking is shown in Figure 4. There are three network layers: ToR switches where each ToR switch is connected to 10 servers, Aggregate switches where each Aggregate switch is connected to five ToR switches, and a Core switch which connects two aggregate switches. In order to simulate over- subscription in datacenters, link capacities are set to 1G, 6G, and 10G from servers to ToR switches, ToR switches to Aggregate switches, and Aggregate switches to Core switch respectively. Core Aggregate … … ToR Job Name WordCount 1 WordCount 2 WordCount 3 TeraSort 1 TeraSort 2 TeraSort 3 Recommender 1 Recommender 2 Recommender 3 Permutation 1 Permuation 2 Permutation 3 Dataset Size 1.8GB 3.6GB 12.8GB 300MB 1.8GB 12.8GB 60MB 300MB 640MB 300MB 640MB 1.2GB Map Tasks 30 60 200 5 30 200 1 5 10 5 10 20 Reduce Tasks 2 10 30 2 5 30 1 5 10 2 3 6 # of Jobs 30 10 5 30 10 5 10 5 2 30 5 2 Table 2: Jobs used in macro-benchmarking … … 14 Rack 5 Rack 6 Rack 10 Figure 4: Network topology used in macrobenchmarking We compared LoNARS with FIFO, Rack-Aware, and CoGRS (Center of Gravity)[5]. The Rack-Aware method distributes reduce tasks across racks in order to achieve network load balancing between racks. It starts scheduling jobs from any task tracker in any rack, and waits for the task tracker from different racks so that reduce tasks can be distributed evenly among racks. Finally, CoGRS determines the optimal task tracker for each reduce task and tries to schedule the reduce task on its optimal task tracker. Since it is possible that multiple reduce tasks prefers the same task tracker, one of them will be scheduled on its optimal task tracker. Others will be scheduled to the task trackers which are located nearby the optimal task tracker in terms of network distance. However, we realized that instead of choosing a task tracker based on network distance when the optimal task tracker is not available, categorizing task trackers and choosing a task tracker which falls into same category with the optimal task tracker leads to higher data locality as well as less network traffic. Secondly, CoGRS does not take network congestion into consideration which might causes hotspots in the network. To minimize hotspots, LoNARS takes congestion into account in the cost function. In the simulator we assumed that we can measure the number of flows on switches which is possible with NetFlow [9]. This helps us to estimate network bandwidth for a given path more precisely since we can estimate how much bandwidth a new flow can obtain even if the link is fully utilized. SNMP, on the other hand, provides only the amount of traffic passing on switch ports and gives no indication about how many flows are sharing same port. We query NetFlow for traffic information every time we calculate the cost function of a task tracker. We used a pool of four different jobs for benchmarking as shown in Table 2. In addition to the WordCount, TeraSort, and Recommender applications, we used the Permutation application in the simulator in order to see the effect of LoNARS on applications that generates heavy shuffle output in the map phase. Since we propose an algorithm to optimize the shuffling phase of MapReduce jobs, shuffle-heavy jobs will be a good benchmark to be able to see effect LoNARS. The job characteristics for the Permutation application are Percentage (%) 12 Rack 1 Fifo RackAware CoGRS LoNARS 10 8 6 4 2 0 Local Rack Locality Level Figure 6: Shuffle locality ratios when map outputs shuffle uniformly as follows: the average map task time is 90 seconds, the average reduce task time (excluding shuffle phase) is 22 seconds, and the approximate shuffle data size of a map task is 700MB per map task. We had total of 144 jobs and they are scheduled using exponential distribution with a mean of 14 seconds as in Facebook’s cluster [13]. 4.2.1 Uniform Shuffle outputs Firstly, we considered a case where map task output shuffles to reduce tasks uniformly. For example, if a map task generates 30MB of output to be shuffled to reduce tasks and there are six reduce tasks, then each reduce task will receive exactly 5MB. Figures 5 (a), (b), (c), and (d) show total job time comparison of each job type listed in Table 2 for four different reduce scheduling algorithms. Figures 5 (e), (f), (g), and (h) show average time for one shuffle operation of a given job. If there are 10 concurrent transfers running and it takes 10 seconds to complete all of them, then per shuffle time will be one second. The WordCount and Recommender applications shuffle a small amount of data during the shuffle phase. Thus, regardless of improvements in reduce task scheduling, it does not considerably impact overall job time. The Terasort application shuffles more data, however still its average per shuffle time is less than 0.5 seconds. Also, as the number of map tasks increases, the impact of the reduce task scheduling algorithm decreases since map tasks will be distributed over the whole cluster almost evenly. Thus, we see slight improvement in TeraSort 1. The Permutation application, on the other hand, shuffles a significant amount of data and (a) WordCount Job Time 28 27 26 25 1 2 88.5 88 87.5 87 86.5 86 85.5 85 84.5 84 83.5 2 20 18 3 1 2 Fifo RackAware CoGRS LoNARS Time (sec) 0.04 144 142 140 138 0.03 0.02 0.01 136 0 2 Job Size 3 1 (g) TeraSort Shuffle Time 0.4 Fifo RackAware CoGRS LoNARS 0.35 0.3 0.25 0.2 0.15 2 Job Size (f) KMeans Shuffle Time 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Fifo RackAware CoGRS LoNARS 3 1 (h) Permutation Shuffle Time 4 2 Job Size 3 Fifo RackAware CoGRS LoNARS 3.5 Time (sec) 134 3 Job Size (e) WordCount Shuffle Time 0.05 Time (sec) 22 Job Size Fifo RackAware CoGRS LoNARS 1 Fifo RackAware CoGRS LoNARS 12 1 (d) Permutation Job Time Time (sec) 24 14 3 150 146 26 16 Job Size 148 (c) TeraSort Job Time 28 Fifo RackAware CoGRS LoNARS Time (sec) 29 Time (sec) 30 Time (sec) (b) KMeans Job Time Fifo RackAware CoGRS LoNARS Time (sec) 31 3 2.5 2 0.1 1.5 0.05 0 1 1 2 Job Size 3 1 2 Job Size 3 Figure 5: Benchmarking results with uniform shuffle reduce task scheduling makes a difference more than it does for other jobs. LoNARS performs 2-3% improvement in total job time in shuffle heavy jobs such as the TeraSort shown in Figure 5 (d). In Figure 5 (h), it can be seen that per shuffle time for LoNARS performs with up to a 15% improvement compared to FIFO and a 7% improvement compared to CoGRS. The basic reason LoNARS performs better than CoGRS is CoGRS’s inability to address choosing an optimal alternative task tracker when multiple reduce tasks prefer the same task tracker. With the help of categorization, the LoNARS algorithm successfully finds an alternative task tracker whose cost is close to the optimal task tracker’s cost. Figure 6 shows the ratio of local and rack-level shuffle operations among all shuffles. If a map and reduce tasks are on the same server, then the only operation is reading data from disk into memory for the reduce task—we call this case a ”local-level” shuffle. If map and reduce tasks are on the same rack, data transfer stays within rack—we call this a ”rack-level” shuffle. The rest of the shuffles are called ”off-rack” shuffles, and constitute around 90% of all shuffles in most cases. LoNARS approximately doubles the level of local-level shuffling compared to FIFO and RackAware, and improves local-level shuffling by 20% compared to CoGRS. LoNARS also increases the rack level shuffle ratio by 22% compared to FIFO, 19% compared to RackAware, and 17% compared to CoGRS. Increases in the level of local- and rack-level shuffling reduce the amount of traffic pushed to the network significantly. 4.2.2 Non-uniform Shuffle outputs CoGRS performs best when map outputs are divided among reduce tasks non-uniformly. That is, the partitioning of map task output to each reduce task is different, and some reduce tasks receives more data on one map task. This leads reduce tasks to choose different task trackers as optimal task trackers. FIFO and RackAware perform the same regardless of the distribution pattern of map outputs. We again used the benchmarking jobs listed in Table 2, and the results are given in Figure 7. CoGRS performs better than the FIFO and RackAware algorithms in this case, as expected. Thus, CoGRS increases locality and shortens shuffling time. LoNARS algorithm outperforms CoGRS slightly in this case. Although CoGRS can find different task trackers for most reduce tasks, it fails to find good alternative task trackers when many jobs are active and optimal task trackers do not have free reduce slots. Furthermore, it is important to be aware of network traffic in order to avoid experiencing network congestion and to utilize network capacity effectively. LoNARS successfully finds alternative task trackers and yields more network capacity which improves shuffle time. However, total job time is not significantly improved, since the reduce phase starts only after all shuffles are completed and even one slow shuffle can cause a long total job time. In Figure 8, we can see CoGRS achieves a better locallevel shuffle ratio than it does in the uniform shuffling case. LoNARS again outperforms FIFO, RackAware, and CoGRS in local- and rack-level shuffle ratios. Its local-level shuffle (a) WordCount Job Time 31 88 30 29 28 26 24 87 86 85 20 18 84 16 26 83 14 25 82 2 12 3 1 2 Job Size (d) Permutation Job Time 1 142 140 138 0.04 0.03 0.02 0.01 136 134 0 2 Job Size 3 1 2 Job Size 0.35 Time (sec) 0.3 0.25 0.2 0.15 Fifo RackAware CoGRS LoNARS 1 2 Job Size 3 (h) Permutation Job Shuffle Time 4 Fifo RackAware CoGRS LoNARS 3.5 Time (sec) Fifo RackAware CoGRS LoNARS 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 3 (g) TeraSort Shuffle Time 0.4 3 (f) KMeans Shuffle Time Time (sec) Fifo RackAware CoGRS LoNARS 0.05 1 2 Job Size (e) WordCount Shuffle Time 0.06 Fifo RackAware CoGRS LoNARS Time (sec) 144 3 Job Size 148 146 Fifo RackAware CoGRS LoNARS 22 27 1 Time (sec) (c) TeraSort Job Time 28 Fifo RackAware CoGRS LoNARS 89 Time (sec) 32 Time (sec) (b) KMeans Job Time 90 Fifo RackAware CoGRS LoNARS Time (sec) 33 3 2.5 2 0.1 1.5 0.05 0 1 1 2 Job Size 3 1 2 Job Size 3 Figure 7: Benchmarking with non-uniform shuffle ratio is again double that of FIFO and RackAware and 10% more than that of CoGRS. In terms of rack-level shuffle ratio, LoNARS outperforms FIFO by 20%, RackAware by 18% , and CoGRS by 13%. 14 Percentage (%) 12 Fifo RackAware CoGRS LoNARS 10 8 6 4 2 0 Local Rack Locality Level Figure 8: Shuffle locality ratios when map outputs have non-uniform pattern 4.2.3 Network Traffic Comparison Besides improvements in shuffle and total job times, LoNARS reduces network traffic considerably. Reduction in network traffic means less traffic on switches and reduced power consumption. Mahadevan, et al. [7] analyzed power consumption of network switches. 1G network switches consume around 120-180 watts and 10G switches consume around 300 watts. Utilization of the switch contributes 5-15% of total switch power consumption. According to the author, although the ratio seems low, as green technology becomes popular, vendors will focus more on more energy efficient products. Hence, lowering network traffic has a significant impact on lowering datacenter networking costs. Figure 9 (a) shows a traffic size comparison of the LoNARS and FIFO algorithms on the 12-server cluster shown in Figure 1. We used the jobs listed in Table 1. The number of jobs of a given size is inversely proportional to job size. Jobs arrive with exponential distribution with a mean of 14 seconds. Traffic information is gathered via SNMP polling. Although traffic size values also include HDFS block copying, since both methods have similar outputs, the difference comes from using different reduce task scheduling algorithms. LoNARS reduced network traffic processed by ToR and Aggregate switches by around 15%. In Figure 9 (b) and (c) we compare the traffic size of four algorithms for benchmarking, explained in Section 4.2. RackAware reduces network traffic slightly more than FIFO, and CoGRS outperforms both FIFO and RackAware by 810% in all network levels. For the uniform map shuffling case, compared to the FIFO, RackAware, and CoGRS algorithms, LoNARS reduces network traffic processed by the core switch by 27%, 25%, and 20% respectively. Since higher capacity switches consume more power per megabit [7], increasing locality and reducing (b) Uniform Shuffle 600 500 600 400 300 200 ToR 500 Fifo RackAware CoGRS LoNARS 400 300 200 100 Aggregate (c) Non-uniform Shuffle 700 Fifo RackAware CoGRS LoNARS Traffic Size (GB) 700 Traffic Size (GB) Traffic Size (GB) (a) 12-Server Mix Jobs 100 90 Fifo 80 LoNARS 70 60 50 40 30 20 10 0 100 Core Layer Aggregate ToR Core Layer Aggregate ToR Layer Figure 9: Size of traffic at different layers of the network network traffic on upper layers is more important than lower layers. For the aggregate layer, compared to FIFO, RackAware, and CoGRS algorithms, LoNARS cuts the amount of traffic pushed to aggregate switches by 20.9%, 19.3%, and 13% respectively. Finally, LoNARS lowers traffic size processed by ToR switches by 13%, 11% and 6% respectively compared to FIFO, RackAware, and CoGRS. In the non-uniform map shuffling case, CoGRS increases its locality ratio, which reduces network traffic. LoNARS again is able to reduce network traffic size in almost the same ratio compared to FIFO and RackAware for all layers. However, compared to CoGRS, LoNARS reduces traffic size through core, aggregate, and ToR switches by 8%, 5% and 2% respectively. 5. CONCLUSIONS AND FUTURE WORK In this paper, we propose the LoNARS algorithm for reduce task scheduling in MapReduce. We conducted microbenchmarking to see the performance of LoNARS using a 12-server cluster and showed that LoNARS outperforms the native Hadoop reduce task scheduling algorithm. We also simulated macro-benchmarking using a 100-server cluster and compared LoNARS to FIFO, RackAware, and CoGRS algorithms. The results showed that LoNARS always reduces network traffic more than these three algorithms by up to 25% which makes significant impact in power consumption of network switches. LoNARS also reduces shuffle time in almost all cases and the reduction ratio depends on how much traffic a job shuffle in the shuffle phase. Regarding total job completion time, LoNARS makes improvement if shuffle transfer time is more than one heartbeat time yet does not always perform worse in other cases. As future work, we plan to explore how to tune heartbeat time based on cluster size and processing power of the job tracker to minimize overhead of LoNARS. We also plan to test LoNARS on different datacenter topologies such as FatTree to see its performance in such network topologies. 6. ACKNOWLEDGMENTS This project is in part sponsored by NSF under award number CNS-1131889 (CAREER). 7. REFERENCES [1] Dean, J., and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI’04, USENIX Association, pp. 10–10. [2] Documentation, H. . https://hadoop.apache.org/docs/r1.2.1/. [3] Hadoop. http://hadoop.apache.org/. [4] Hammoud, M., Rehman, M. S., and Sakr, M. F. Center-of-gravity reduce task scheduling to lower mapreduce network traffic. In IEEE CLOUD (2012), pp. 49–58. [5] Hammoud, M., Rehman, M. S., and Sakr, M. F. Center-of-gravity reduce task scheduling to lower mapreduce network traffic. In Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing (Washington, DC, USA, 2012), CLOUD ’12, IEEE Computer Society, pp. 49–58. [6] Hammoud, M., and Sakr, M. F. Locality-aware reduce task scheduling for mapreduce. In Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science (Washington, DC, USA, 2011), CLOUDCOM ’11, IEEE Computer Society, pp. 570–576. [7] Mahadevan, P., Sharma, P., Banerjee, S., and Ranganathan, P. A power benchmarking framework for network devices. In Proceedings of the 8th International IFIP-TC 6 Networking Conference (Berlin, Heidelberg, 2009), NETWORKING ’09, Springer-Verlag, pp. 795–808. [8] Mahout, A. http://mahout.apache.org/. [9] NetFlow. http://en.wikipedia.org/wiki/NetFlow. [10] Palanisamy, B., Singh, A., Liu, L., and Jain, B. Purlieus: locality-aware resource allocation for mapreduce in a cloud. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2011), SC ’11, ACM, pp. 58:1–58:11. [11] Park, J., Lee, D., Kim, B., Huh, J., and Maeng, S. Locality-aware dynamic vm reconfiguration on mapreduce clouds. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2012), HPDC ’12, ACM, pp. 27–36. [12] Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., and Qin, X. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on (2010), pp. 1–9. [13] Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., and Stoica, I. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems (New York, NY, USA, 2010), EuroSys ’10, ACM, pp. 265–278. [14] Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., and Stoica, I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems (New York, NY, USA, 2010), EuroSys ’10, ACM, pp. 265–278. [15] Zhang, X., Feng, Y., Feng, S., Fan, J., and Ming, Z. An effective data locality aware task scheduling method for mapreduce framework in heterogeneous environments. In Proceedings of the 2011 International Conference on Cloud and Service Computing (Washington, DC, USA, 2011), CSC ’11, IEEE Computer Society, pp. 235–242.