Balanced Data Layout in Hadoop Weiping Zhang, Ke Xu, Kyungmin (Jason) Lee Introduction The performance of MapReduce programs in Hadoop can be dependent on how data is stored in HDFS, the Hadoop Distributed File System. For example, during MapReduce’s map phase, a datanode will need to fetch the input file. The operation would be more efficient if the input data already resides in the local node, otherwise the datanode will be required to fetch the data across the network, causing extra delay and network traffic. This problem is further compounded if the input data is not stored equally across the cluster, so that not only do a higher number of nodes need to fetch data across the network, those that store big chunks of the data will be hit heavily and performance will suffer as a consequence. If a MapReduce workflow is running, in which the output of one job is used as input to the next, it is a highly likely scenario that outputs are imbalanced, leading to worsened performance that gets worse as data becomes more and more imbalanced. Some previous works exist that try to address the data imbalance problem. One solution, the HDFS rebalancer, implements an algorithm on the HDFS level to detect data imbalance in the cluster based on storage utilization data. The shortcoming of this solution is that it is a purely reactive solution in the sense that it only fixes the data imbalance problem once the problem exists. A more efficient solution would prevent such imbalance from materializing in the first place. A second solution modifies Hadoop’s block placement policy; instead of writing locally, this policy chooses the write location in a round robin format so that data will be balanced. While this method does indeed prevent data imbalance from happening, it introduces some unnecessary network traffic, since the majority of writes will unlikely to be local writes. Our solution builds on the round robin placement policy by further optimizing write locations while maintaining data balance. Balanced Block Placement Policy Our solution to data balance involves changing the policy on how and where data is stored in Hadoop. Whenever a datanode tries to write a block to HDFS, it asks the master namenode for the target locations to place the block. Our solution is to place these writes greedily as long as the cluster remains fairly balanced. By greedily, we mean that our policy will prioritize target nodes based on location; if possible, the local node will be given priority over another node on the same rack, which will be given priority over other nodes on remote racks. We use the term fairly balanced to mean that we allow some variation in data balance as long as the differences are not significant. Concretely, we define a windowSize parameter, which defines the boundaries for how much data is stored on a node for it to be considered balanced with respect to all other nodes in a cluster. The upper bound for this range is defined as the size of the node with the most data, so the lower bound is the upper bound minus the windowSize. The fact that this range is dependent on the size of the node with the most data ensures that our window will grow as the amount of data on the cluster grows. Since all nodes cannot have more data than the node with the most data, as long as all nodes have more data than maxSize – windowSize, we consider the cluster to be fairly balanced. For our policy, we first sort all the live nodes in the HDFS cluster. We then compute the threshold value by subtracting the windowSize from the maximum data size of a node. Since all nodes are sorted, we immediately know if all nodes fall above the data size threshold by simply checking the size of the smallest node. If it is below the threshold value, we know some nodes are underutilized and we need to perform some work to ensure balance, otherwise all knows are above our lower bound to be fairly balanced, and we are safe to assign the write to the local node. If the former case occurs, we do a further check to see if the local node is underutilized; if so, we pick the local node over the least utilized node, otherwise the write would go to the least utilized node. The idea here is that as long as the cluster is fairly balanced the write would go to the local node; if the cluster is imbalanced, we still prioritize the local node if it is under-utilized, otherwise the write will go to the least-utilized node. For the second replica, we choose the node with the least storage utilization; if more than one rack exists in the cluster however, we will eliminate from consideration all nodes residing on the same rack as the first replica for system reliability reasons. For third replica and beyond, we simply choose the least utilized nodes, excluding those chosen for previous replicas. Benchmark Workloads To benchmark our balanced block placement policy’s (BP) performance we compared against the Hadoop default block placement policy (DP). We initially deployed a 4-node cluster on Amazon EC2, and then deployed a 16-node cluster to simulate a more realistic user-case scenario. On the 4-node cluster, we ran two MapReduce jobs, a balanced sorting job and a skewed sorting job with a 2GB input dataset. In the balanced sort, the partitioner is setup such that an approximately equal number of records are shuffled to each reducer, so that the data size outputted by reducers is essentially uniform. By contrast, the skewed sort partitions the data such that some reducer tasks will receive a disproportionate amount of data, so that reducer outputs will vary, introducing data skew into the cluster. For each of the two jobs, we ran two types of workloads, a single run workflow and a cascaded workflow. For the single run, we run the MapReduce once, varying the number of reducers used (1, 2, 4, 10, and 12). The cascaded workflow is 3 runs of the MapReduce job in series, such that the output from the first job is taken as input into the second, so on and so forth. For this case, we set the number of reducers to be 10. Other parameters that we altered were the replication factor (RF), which we set to either 1, meaning only 1 copy of data is written to HDFS, and 3, the default value in Hadoop. We also allowed speculative execution to be either on (SE) or off (SD). On the 16-node cluster, we ran only the skewed sorting MapReduce workflow job, with 20GB input dataset. We set the number of reducers to 44 and disabled speculative execution. This experiment is meant to simulate a more realistic Hadoop workload, while it is also the configuration most likely to yield the biggest gains for our balanced block placement policy. For each experiment, we are interested in the data distribution across cluster nodes and also the running time of the job. We use standard deviation of node data size to help us quantify data balance; a higher standard implies that the data is more imbalanced; a lower standard deviation closer to 0 means that the data distribution is close to uniform. Cluster Configuration 1. Export jar file for new balanced block placement policy into a file named BalancedBlockPlacement.jar 2. Launch Amazon EC2 cluster. This step is exactly the same as in our assignment, except that in hadoop-ec2-env.sh we change the ID of the image to be hadoop-0.21.0. 3. Upload jar file. Push BalancedBlockPlacement.jar under $HADOOP_HOME/lib for each node in the cluster, including the namenode. 4. Change HDFS configurations. Modify $HADOOP_HOME/conf/hdfs-site.xml to enable our policy by adding the following code to set our block placement policy class as the block replicator. <property> <name>dfs.block.replicator.classname</name> <value>org.apache.hadoop.hdfs.server.namenode.BalancedBlockPlac ementPolicy</value> </property> Additionally, we set the <dfs.replication> property to change the number of replicas in HDFS. The <mapred.reduce.tasks.speculative.execution> has been changed to false to disable speculative execution in some test cases. The modifications were made on each node, including the namenode. 5. Restart HDFS. Execute $HADOOP/bin/stop-all.sh to stop namenode and all datanodes in the cluster, and then execute $HADOOP/bin/start-all.sh to restart all nodes. After restart, the modifications on HDFS configuration are in effect. 6. Run benchmarking jobs. Benchmarking Results For the single MapReduce job experiments, we tested five configurations: default (DP RF1 SE) and balanced policy (BP RF1 SE) with replication factor 1 and speculative execution enabled, balanced policy with replication factor 1 and speculative execution disabled (BP RF1 SD), and default (DP RF3 SE) and balanced policy (BP RF3 SE) with replication factor 3 and speculative execution enabled. For experiments running balanced sorting, there are no significantly noticeable trends, though the line for DP RF1 SE has a dramatic spike with 1 reducer that dramatically decreases to near zero at 4 reducers. This makes sense because in the default policy all data from one reducer will only be written to the local node, so for reducer numbers less than 4 it makes sense to have great differences in output size. Otherwise, the data points indicate that the standard deviation for the amount of data written to each node hovers around 0.1 GB; this result is consistent with the fact that the job should produce balanced data, so that standard deviations are low. A note should be made that for the balanced policy, the standard deviations seemed to rise when the number of reducers increased above 10. This is especially true for the BP cases with replication factor of 1, and we hypothesize this is a side-effect of speculative execution on our policy, which will be discussed later. Figure 1. Standard Deviation for HDFS Written Across Nodes for Balanced Sort Single Run (L) and Skewed Sort Single Run (R) For skewed sorting, the trend differences are more pronounced as we see that the DP RF1 SE case has distribution standard deviation of roughly 0.7 GB, while the balanced policies with either speculative execution off or replication factor 3 have standard deviations less than 0.1 GB. These trends show that our balanced policy is more successful at balancing data than the default policy. Another trend here is that having a replication factor of 3 seemed to have some effect at achieving more balance throughout the cluster. In both the balanced sort and skewed sort case, no significant running time differences were observed. For balanced sort, running time for all 5 cases fell slightly from around 9.5 minutes with 1 reducer to around 5 minutes for 4 reducers, and then remained constant for reducers greater than 4. For skewed sort, all running times were between 7.5 and 9.5 minutes with no patterns observed. For the workflow experiments, we ran the same 5 configurations as in the single job case. For the balanced workflow, data stays relatively well balanced for all 5 cases, the balanced policies with replication factor 1 seem to do a better balancing job than default policy at replication factor 1, while the reverse result is observed at replication factor 3, but these differences are small and negligible. For skewed sort workflow, a much more pronounced improvement is noticed in comparing the balanced policy versus the default policy. The default policy showed data distribution standard deviation of 0.675, while the balanced policies had standard deviations of 0.1 and 0.02 with speculative execution enabled and disabled, respectively. In the replication factor of 3 cases, default policy had standard deviation of 0.6 while balanced policy had standard deviation of 0.02, both with speculative execution on. Despite the obvious balance improvements of the balanced policy over default policy on skewed data, there were no significant improvements in running time; balanced sorts took around 15 minutes each, while skewed sorts took around 25 minutes each. Figure 2. HDFS Data Written Distribution for Balanced Workflow (L) and Skewed Workflow (R) Finally, we expanded to a 16-node cluster and the results show drastic improvements in favor of the balanced policy. Figure 3 shows the cluster’s overall storage utilization after each successive sorting run in the workflow, for both default and balanced policy. While the balanced policy has largely uniform distribution, the default policy shows significant spikes on just a few nodes, while some other nodes are largely unutilized. In the balanced policy, the standard deviation in node data size actually decreases from 0.06 to 0.03 to 0.02 after the first, second, and third sort in the workflow. Comparatively, the default policy shows standard deviations of 3, 4, and 6, largely due to the few nodes with more than 6 times as much data as the other nodes. Figure 3. Overall Cluster Storage Utilization Running Skewed Workflow Using Default Policy (L) and Balanced Policy (R) Focusing on the amount of data written during each sorting job in the workflow, we notice that in the default case one node writes almost 11 GB of data while the others write about 0.2 GB. This is the intended outcome of our skewed sorting algorithm and shows a worst-case scenario for data imbalance. Meanwhile, the balanced policy has dealt with the inherent reducer output skew by distributing writes across all nodes uniformly. In terms of running time, we observed an improvement of roughly 8 minutes for each individual sort in the workflow: balanced policy took about 46 minutes to run each sort on average while default policy took 54 on average, roughly a 15% performance gain. Figure 4. HDFS Data Written Distribution Running Skewed Workflow Using Default Policy (L) and Balanced Policy (R) Analysis In running some single-job benchmarking experiments, we noticed that having speculative execution enabled caused slight imbalance in the cluster, particularly in our policy. Speculative execution is a Hadoop performance feature that runs the same task on two nodes concurrently, using the data from the task that completes first. This feature improves performance, so that the system would not be slowed down if one or a few nodes are performing slowly. With regards to data balance, we notice that during the course of a MapReduce job data size actually grows uniformly in all nodes, then right before all reducers are finished some nodes see a sudden drop in data size, causing imbalance. We conjecture that this is caused by speculative execution. According to our observations, the likely scenario is that, if nodes A and B are running the same job, both will write to HDFS until one task finishes, then the other task will be killed and data written by that task removed from HDFS, causing imbalance. When we disabled speculative execution, data was more balanced using our policy, but in real practice we would like both the data balance provided by our policy and the performance boost provided by speculative execution. One solution revolves around creating a hybrid of the balanced policy and the round robin policy. Since speculative execution essentially penalizes the balanced policy for greedily writing to local nodes, the round robin policy eliminates this problem because nodes are chosen in round robin fashion, so on average data outputted by the same reduce task will be spread evenly across all nodes. Thus, even if such a task was killed, data would be removed roughly equally among all nodes. If we implemented a hybrid system where some datanodes are running the balanced policy and others running the round robin policy, we would be able to achieve higher balance in the cluster. The tradeoff would be balance versus network traffic; by having more nodes running balanced policy, network traffic would be reduced but extra data imbalance may be introduced, whereas running more nodes with round robin policy would keep data more balanced but may cause extra network traffic for writes. A second solution to improve data balance is to assign writes to the same nodes for speculative execution. Even though there is no way to know which task may finish, in the end it would not matter since the output will be written to the same nodes regardless, and output from each task should be equal on each node. In the current implementation, we consider only the common case that a cluster begins with no data (or a data-balanced state). As data is written into HDFS, the balanced policy will ensure that the data distribution remains balanced, implying that data output will be balanced for each individual MapReduce task. We did not consider cases where the cluster may change; for example, new nodes may be introduced into the cluster, or nodes may be down for some time then re-enter the cluster. These actions would introduce unintended data imbalance. The question becomes what should be done in this case; is it more important to maintain total data balance, in the sense that each node should the same amount of data, or is it more important to have data balance per MapReduce job, where each job’s data will be distributed uniformly across all nodes participating in the cluster at that time. The answer should be the latter, since if the former was true, and a new node enters the cluster, all writes for a subsequent MapReduce job would go to the new node to bring its capacity up to the rest of the cluster. However, this action would actually be undesirable, since this would imply that when this output is used as input to another job, all reads will have to be fetched from one node, the precise scenario we want to prevent. It makes much more sense to write equally to each node, including the new node, so that mappers will have approximate equal amounts of input residing on each node, so minimal network fetching of input data is required. The current implementation of the balanced placement policy does not take this issue into account currently, and it will actually perform the former action, so in the future it will need to be modified to perform data balance on job output as opposed to total balance. Another factor not considered in the current implementation is the effect of windowSize on data balance and performance. In the current implementation, the windowSize is set at 8 MB with block size 64 MB, implying that the cluster will be strictly balanced. Some experiments were ran on the 4-node cluster, varying the windowSize from 8 MB to 64 MB to 128 MB while keeping block size the same. However, due to the small size in both cluster size and dataset size, no significant results were observed as the window is relaxed. Conclusion We were able to implement a new block placement policy that focuses on maintaining data balance in a Hadoop cluster while maximizing performance by keeping HDFS writes local as much as possible. In our benchmark experiments we were able to show that the new balanced policy is successful at balancing data distribution in the cluster, and the difference is especially pronounced when the output of MapReduce jobs naturally caused data skew. While running time results show no great improvement in the 4-node clusters, we were able to observe a 15% performance increase over skewed data in a 16-node cluster with a 20 GB dataset instead of 2 GB. This also confirms our theory that we will see increasing gains in the balanced policy over the default policy as the size of datasets and cluster increase. Finally, while we did note slight increase in data imbalance in the system due to speculative execution, part of the imbalance was mitigated by having extra replications of data, and working with bigger datasets should also help to offset some of the imbalance.