Balanced Data Layout in Hadoop

advertisement
Balanced Data Layout in Hadoop
Weiping Zhang, Ke Xu, Kyungmin (Jason) Lee
Introduction
The performance of MapReduce programs in Hadoop can be dependent on how data is stored
in HDFS, the Hadoop Distributed File System. For example, during MapReduce’s map phase, a
datanode will need to fetch the input file. The operation would be more efficient if the input
data already resides in the local node, otherwise the datanode will be required to fetch the data
across the network, causing extra delay and network traffic. This problem is further
compounded if the input data is not stored equally across the cluster, so that not only do a
higher number of nodes need to fetch data across the network, those that store big chunks of
the data will be hit heavily and performance will suffer as a consequence. If a MapReduce
workflow is running, in which the output of one job is used as input to the next, it is a highly
likely scenario that outputs are imbalanced, leading to worsened performance that gets worse
as data becomes more and more imbalanced.
Some previous works exist that try to address the data imbalance problem. One solution, the
HDFS rebalancer, implements an algorithm on the HDFS level to detect data imbalance in the
cluster based on storage utilization data. The shortcoming of this solution is that it is a purely
reactive solution in the sense that it only fixes the data imbalance problem once the problem
exists. A more efficient solution would prevent such imbalance from materializing in the first
place. A second solution modifies Hadoop’s block placement policy; instead of writing locally,
this policy chooses the write location in a round robin format so that data will be balanced.
While this method does indeed prevent data imbalance from happening, it introduces some
unnecessary network traffic, since the majority of writes will unlikely to be local writes. Our
solution builds on the round robin placement policy by further optimizing write locations while
maintaining data balance.
Balanced Block Placement Policy
Our solution to data balance involves changing the policy on how and where data is stored in
Hadoop. Whenever a datanode tries to write a block to HDFS, it asks the master namenode for
the target locations to place the block. Our solution is to place these writes greedily as long as
the cluster remains fairly balanced. By greedily, we mean that our policy will prioritize target
nodes based on location; if possible, the local node will be given priority over another node on
the same rack, which will be given priority over other nodes on remote racks. We use the term
fairly balanced to mean that we allow some variation in data balance as long as the differences
are not significant.
Concretely, we define a windowSize parameter, which defines the boundaries for how much
data is stored on a node for it to be considered balanced with respect to all other nodes in a
cluster. The upper bound for this range is defined as the size of the node with the most data, so
the lower bound is the upper bound minus the windowSize. The fact that this range is
dependent on the size of the node with the most data ensures that our window will grow as the
amount of data on the cluster grows. Since all nodes cannot have more data than the node with
the most data, as long as all nodes have more data than maxSize – windowSize, we consider the
cluster to be fairly balanced.
For our policy, we first sort all the live nodes in the HDFS cluster. We then compute the
threshold value by subtracting the windowSize from the maximum data size of a node. Since all
nodes are sorted, we immediately know if all nodes fall above the data size threshold by simply
checking the size of the smallest node. If it is below the threshold value, we know some nodes
are underutilized and we need to perform some work to ensure balance, otherwise all knows
are above our lower bound to be fairly balanced, and we are safe to assign the write to the local
node. If the former case occurs, we do a further check to see if the local node is underutilized; if
so, we pick the local node over the least utilized node, otherwise the write would go to the least
utilized node. The idea here is that as long as the cluster is fairly balanced the write would go to
the local node; if the cluster is imbalanced, we still prioritize the local node if it is under-utilized,
otherwise the write will go to the least-utilized node. For the second replica, we choose the
node with the least storage utilization; if more than one rack exists in the cluster however, we
will eliminate from consideration all nodes residing on the same rack as the first replica for
system reliability reasons. For third replica and beyond, we simply choose the least utilized
nodes, excluding those chosen for previous replicas.
Benchmark Workloads
To benchmark our balanced block placement policy’s (BP) performance we compared against
the Hadoop default block placement policy (DP). We initially deployed a 4-node cluster on
Amazon EC2, and then deployed a 16-node cluster to simulate a more realistic user-case
scenario.
On the 4-node cluster, we ran two MapReduce jobs, a balanced sorting job and a skewed sorting
job with a 2GB input dataset. In the balanced sort, the partitioner is setup such that an
approximately equal number of records are shuffled to each reducer, so that the data size
outputted by reducers is essentially uniform. By contrast, the skewed sort partitions the data
such that some reducer tasks will receive a disproportionate amount of data, so that reducer
outputs will vary, introducing data skew into the cluster. For each of the two jobs, we ran two
types of workloads, a single run workflow and a cascaded workflow. For the single run, we run
the MapReduce once, varying the number of reducers used (1, 2, 4, 10, and 12). The cascaded
workflow is 3 runs of the MapReduce job in series, such that the output from the first job is
taken as input into the second, so on and so forth. For this case, we set the number of reducers
to be 10. Other parameters that we altered were the replication factor (RF), which we set to
either 1, meaning only 1 copy of data is written to HDFS, and 3, the default value in Hadoop. We
also allowed speculative execution to be either on (SE) or off (SD).
On the 16-node cluster, we ran only the skewed sorting MapReduce workflow job, with 20GB
input dataset. We set the number of reducers to 44 and disabled speculative execution. This
experiment is meant to simulate a more realistic Hadoop workload, while it is also the
configuration most likely to yield the biggest gains for our balanced block placement policy.
For each experiment, we are interested in the data distribution across cluster nodes and also the
running time of the job. We use standard deviation of node data size to help us quantify data
balance; a higher standard implies that the data is more imbalanced; a lower standard deviation
closer to 0 means that the data distribution is close to uniform.
Cluster Configuration
1. Export jar file for new balanced block placement policy into a file named
BalancedBlockPlacement.jar
2. Launch Amazon EC2 cluster. This step is exactly the same as in our assignment, except
that in hadoop-ec2-env.sh we change the ID of the image to be hadoop-0.21.0.
3. Upload jar file. Push BalancedBlockPlacement.jar under $HADOOP_HOME/lib for
each node in the cluster, including the namenode.
4. Change HDFS configurations. Modify $HADOOP_HOME/conf/hdfs-site.xml to
enable our policy by adding the following code to set our block placement policy class as
the block replicator.
<property>
<name>dfs.block.replicator.classname</name>
<value>org.apache.hadoop.hdfs.server.namenode.BalancedBlockPlac
ementPolicy</value>
</property>
Additionally, we set the <dfs.replication> property to change the number of
replicas in HDFS. The <mapred.reduce.tasks.speculative.execution> has
been changed to false to disable speculative execution in some test cases. The
modifications were made on each node, including the namenode.
5. Restart HDFS. Execute $HADOOP/bin/stop-all.sh to stop namenode and all
datanodes in the cluster, and then execute $HADOOP/bin/start-all.sh to restart all
nodes. After restart, the modifications on HDFS configuration are in effect.
6. Run benchmarking jobs.
Benchmarking Results
For the single MapReduce job experiments, we tested five configurations: default (DP RF1 SE)
and balanced policy (BP RF1 SE) with replication factor 1 and speculative execution enabled,
balanced policy with replication factor 1 and speculative execution disabled (BP RF1 SD), and
default (DP RF3 SE) and balanced policy (BP RF3 SE) with replication factor 3 and speculative
execution enabled. For experiments running balanced sorting, there are no significantly
noticeable trends, though the line for DP RF1 SE has a dramatic spike with 1 reducer that
dramatically decreases to near zero at 4 reducers. This makes sense because in the default
policy all data from one reducer will only be written to the local node, so for reducer numbers
less than 4 it makes sense to have great differences in output size. Otherwise, the data points
indicate that the standard deviation for the amount of data written to each node hovers around
0.1 GB; this result is consistent with the fact that the job should produce balanced data, so that
standard deviations are low. A note should be made that for the balanced policy, the standard
deviations seemed to rise when the number of reducers increased above 10. This is especially
true for the BP cases with replication factor of 1, and we hypothesize this is a side-effect of
speculative execution on our policy, which will be discussed later.
Figure 1. Standard Deviation for HDFS Written Across Nodes for Balanced Sort Single Run (L) and
Skewed Sort Single Run (R)
For skewed sorting, the trend differences are more pronounced as we see that the DP RF1 SE
case has distribution standard deviation of roughly 0.7 GB, while the balanced policies with
either speculative execution off or replication factor 3 have standard deviations less than 0.1 GB.
These trends show that our balanced policy is more successful at balancing data than the default
policy. Another trend here is that having a replication factor of 3 seemed to have some effect at
achieving more balance throughout the cluster.
In both the balanced sort and skewed sort case, no significant running time differences were
observed. For balanced sort, running time for all 5 cases fell slightly from around 9.5 minutes
with 1 reducer to around 5 minutes for 4 reducers, and then remained constant for reducers
greater than 4. For skewed sort, all running times were between 7.5 and 9.5 minutes with no
patterns observed.
For the workflow experiments, we ran the same 5 configurations as in the single job case. For
the balanced workflow, data stays relatively well balanced for all 5 cases, the balanced policies
with replication factor 1 seem to do a better balancing job than default policy at replication
factor 1, while the reverse result is observed at replication factor 3, but these differences are
small and negligible. For skewed sort workflow, a much more pronounced improvement is
noticed in comparing the balanced policy versus the default policy. The default policy showed
data distribution standard deviation of 0.675, while the balanced policies had standard
deviations of 0.1 and 0.02 with speculative execution enabled and disabled, respectively. In the
replication factor of 3 cases, default policy had standard deviation of 0.6 while balanced policy
had standard deviation of 0.02, both with speculative execution on. Despite the obvious balance
improvements of the balanced policy over default policy on skewed data, there were no
significant improvements in running time; balanced sorts took around 15 minutes each, while
skewed sorts took around 25 minutes each.
Figure 2. HDFS Data Written Distribution for Balanced Workflow (L) and Skewed Workflow (R)
Finally, we expanded to a 16-node cluster and the results show drastic improvements in favor of
the balanced policy. Figure 3 shows the cluster’s overall storage utilization after each successive
sorting run in the workflow, for both default and balanced policy. While the balanced policy has
largely uniform distribution, the default policy shows significant spikes on just a few nodes,
while some other nodes are largely unutilized. In the balanced policy, the standard deviation in
node data size actually decreases from 0.06 to 0.03 to 0.02 after the first, second, and third sort
in the workflow. Comparatively, the default policy shows standard deviations of 3, 4, and 6,
largely due to the few nodes with more than 6 times as much data as the other nodes.
Figure 3. Overall Cluster Storage Utilization Running Skewed Workflow Using Default Policy (L)
and Balanced Policy (R)
Focusing on the amount of data written during each sorting job in the workflow, we notice that
in the default case one node writes almost 11 GB of data while the others write about 0.2 GB.
This is the intended outcome of our skewed sorting algorithm and shows a worst-case scenario
for data imbalance. Meanwhile, the balanced policy has dealt with the inherent reducer output
skew by distributing writes across all nodes uniformly. In terms of running time, we observed an
improvement of roughly 8 minutes for each individual sort in the workflow: balanced policy took
about 46 minutes to run each sort on average while default policy took 54 on average, roughly a
15% performance gain.
Figure 4. HDFS Data Written Distribution Running Skewed Workflow Using Default Policy (L) and
Balanced Policy (R)
Analysis
In running some single-job benchmarking experiments, we noticed that having speculative
execution enabled caused slight imbalance in the cluster, particularly in our policy. Speculative
execution is a Hadoop performance feature that runs the same task on two nodes concurrently,
using the data from the task that completes first. This feature improves performance, so that
the system would not be slowed down if one or a few nodes are performing slowly. With
regards to data balance, we notice that during the course of a MapReduce job data size actually
grows uniformly in all nodes, then right before all reducers are finished some nodes see a
sudden drop in data size, causing imbalance. We conjecture that this is caused by speculative
execution. According to our observations, the likely scenario is that, if nodes A and B are running
the same job, both will write to HDFS until one task finishes, then the other task will be killed
and data written by that task removed from HDFS, causing imbalance. When we disabled
speculative execution, data was more balanced using our policy, but in real practice we would
like both the data balance provided by our policy and the performance boost provided by
speculative execution.
One solution revolves around creating a hybrid of the balanced policy and the round robin policy.
Since speculative execution essentially penalizes the balanced policy for greedily writing to local
nodes, the round robin policy eliminates this problem because nodes are chosen in round robin
fashion, so on average data outputted by the same reduce task will be spread evenly across all
nodes. Thus, even if such a task was killed, data would be removed roughly equally among all
nodes. If we implemented a hybrid system where some datanodes are running the balanced
policy and others running the round robin policy, we would be able to achieve higher balance in
the cluster. The tradeoff would be balance versus network traffic; by having more nodes running
balanced policy, network traffic would be reduced but extra data imbalance may be introduced,
whereas running more nodes with round robin policy would keep data more balanced but may
cause extra network traffic for writes.
A second solution to improve data balance is to assign writes to the same nodes for speculative
execution. Even though there is no way to know which task may finish, in the end it would not
matter since the output will be written to the same nodes regardless, and output from each task
should be equal on each node.
In the current implementation, we consider only the common case that a cluster begins with no
data (or a data-balanced state). As data is written into HDFS, the balanced policy will ensure that
the data distribution remains balanced, implying that data output will be balanced for each
individual MapReduce task. We did not consider cases where the cluster may change; for
example, new nodes may be introduced into the cluster, or nodes may be down for some time
then re-enter the cluster. These actions would introduce unintended data imbalance. The
question becomes what should be done in this case; is it more important to maintain total data
balance, in the sense that each node should the same amount of data, or is it more important to
have data balance per MapReduce job, where each job’s data will be distributed uniformly
across all nodes participating in the cluster at that time. The answer should be the latter, since if
the former was true, and a new node enters the cluster, all writes for a subsequent MapReduce
job would go to the new node to bring its capacity up to the rest of the cluster. However, this
action would actually be undesirable, since this would imply that when this output is used as
input to another job, all reads will have to be fetched from one node, the precise scenario we
want to prevent. It makes much more sense to write equally to each node, including the new
node, so that mappers will have approximate equal amounts of input residing on each node, so
minimal network fetching of input data is required. The current implementation of the balanced
placement policy does not take this issue into account currently, and it will actually perform the
former action, so in the future it will need to be modified to perform data balance on job output
as opposed to total balance.
Another factor not considered in the current implementation is the effect of windowSize on
data balance and performance. In the current implementation, the windowSize is set at 8 MB
with block size 64 MB, implying that the cluster will be strictly balanced. Some experiments
were ran on the 4-node cluster, varying the windowSize from 8 MB to 64 MB to 128 MB while
keeping block size the same. However, due to the small size in both cluster size and dataset size,
no significant results were observed as the window is relaxed.
Conclusion
We were able to implement a new block placement policy that focuses on maintaining data
balance in a Hadoop cluster while maximizing performance by keeping HDFS writes local as
much as possible. In our benchmark experiments we were able to show that the new balanced
policy is successful at balancing data distribution in the cluster, and the difference is especially
pronounced when the output of MapReduce jobs naturally caused data skew. While running
time results show no great improvement in the 4-node clusters, we were able to observe a 15%
performance increase over skewed data in a 16-node cluster with a 20 GB dataset instead of 2
GB. This also confirms our theory that we will see increasing gains in the balanced policy over
the default policy as the size of datasets and cluster increase. Finally, while we did note slight
increase in data imbalance in the system due to speculative execution, part of the imbalance
was mitigated by having extra replications of data, and working with bigger datasets should also
help to offset some of the imbalance.
Download