Overview of Hadoop Performance Tuning

Hadoop Job Optimization
In this document, we describe specific types of Hadoop job configurations that can act as bottlenecks and
hurt overall performance. We provide techniques that you can use to remediate each type of bottleneck,
and provide suggestions for general optimization of your Hadoop jobs.
Section 1 provides a short introduction to how MapReduce works internally.
Section 2 defines a common performance tuning framework that you can use to guide repeatable
processes. We explain how to identify resource bottlenecks based on performance indicators.
Section 3 describes tuning techniques in detail, to help you to address and remove performance
bottlenecks. We also provide a matrix to guide selection of which technique to apply to each
performance problem.
Table of Contents
Introduction ............................................................................................................................................................................. 3
Overview of Hadoop Performance Tuning ..................................................................................................................... 5
Common Resource Bottlenecks ......................................................................................................................................... 9
Identifying Bottlenecks ......................................................................................................................................................... 11
Choosing a Solution ............................................................................................................................................................ 24
Resolving Bottlenecks.......................................................................................................................................................... 26
Conclusion ............................................................................................................................................................................. 49
Appendix A: Join Optimization in Hive............................................................................................................................51
Appendix B: List of Tuning Properties in Shuffle.......................................................................................................... 53
Appendix C: Miscellaneous Tuning Properties ............................................................................................................. 56
References and Appendices .............................................................................................................................................. 59
List of Figures
Figure 1. Stages of a Hadoop job ........................................................................................................................................... 6
Figure 2. The performance tuning cycle ............................................................................................................................... 7
Figure 3. Large input data .......................................................................................................................................................12
Figure 4. Counters showing spilled records .......................................................................................................................13
Figure 5. Log showing spilled records in TeraSort job ....................................................................................................15
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 6. Large map output ....................................................................................................................................................16
Figure 7. Java exception in log file ........................................................................................................................................17
Figure 8. Checking volume of data as HDFS bytes...........................................................................................................18
Figure 9. Java exception caused by large job output ......................................................................................................18
Figure 10. Cluster’s total map/reduce task capacity .........................................................................................................19
Figure 11. Details of a job ........................................................................................................................................................ 20
Figure 12. Long-tailed tasks ....................................................................................................................................................21
Figure 13. Hive sort on single reducer ................................................................................................................................. 22
Figure 14. Default TeraSort job on 100 GB (before) ......................................................................................................... 27
Figure 15. TeraSort job after spill optimization ................................................................................................................. 30
Figure 16. Records transferred from Map to Reduce phase, before use of combiner ............................................31
Figure 17. Records transferred from Map to Reduce phase, after use of combiner .............................................. 32
Figure 18. Bytes written in Map phase, before compression ........................................................................................ 33
Figure 19. Bytes written in Map phase, after compression ............................................................................................ 34
Figure 20. Word Count job before job output compression ........................................................................................ 34
Figure 21. Word Count job after job output compression ............................................................................................ 36
Figure 22. Data being copied when replicated to three nodes ................................................................................... 37
Figure 23. Reduce output before Hive compression ...................................................................................................... 39
Figure 24. Reduce output after Hive compression .......................................................................................................... 39
Figure 25. Intermediate output before compression ...................................................................................................... 40
Figure 26. Intermediate output after compression .......................................................................................................... 41
Figure 27. Number of Map tasks before ............................................................................................................................ 43
Figure 28. Number of Map tasks after ................................................................................................................................ 44
Figure 29. Maximum Map tasks before .............................................................................................................................. 45
Figure 30. Number of running Map tasks before ............................................................................................................ 45
Figure 31. Maximum Map tasks after ................................................................................................................................... 46
Figure 32. Running Map tasks after ..................................................................................................................................... 46
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Hadoop is an Open Source platform for data processing that enables reliable, scalable and distributed
computing on clusters of inexpensive servers. Hadoop is a big ecosystem, but the power of Hadoop comes
from its two key components: HDFS and MapReduce.
HDFS is a distributed file system that offers built-in replication to store vast amounts of data.
MapReduce is a flexible programming model for developers to write applications to perform parallel
data processing.
Combined, HDFS and MapReduce make Hadoop the perfect tool to solve today’s big data analytics
challenges in terms of Volume, Variety and Velocity.
However, as with any new tool, there is much to learn. Our team is responsible for building solutions for
Microsoft internal customers, and we have experimented extensively with Hadoop. In this process, we have
learned that it is important to design and configure Hadoop jobs appropriately to maximize performance.
The goal of this document is to describe the types of configurations that can hurt performance, and to
provide techniques that you can use to optimize your Hadoop jobs.
Section 1 explains how MapReduce works internally, to set the technical context.
Section 2 defines a common performance tuning framework that you can use to guide repeatable
processes. We also explain how to identify resource bottlenecks based on performance indicators.
Section 3 describes tuning techniques in detail, to help you to address and remove performance
bottlenecks. We also provide a matrix to guide selection of which technique to apply to each
performance problem.
Software Used
We used the following software in our tests.
Microsoft Hadoop on Windows version 1.0.175 (the May 2012 CTP release; code-named Isotope).
The core engine of this version corresponds to Apache Hadoop version
Microsoft Isotope Deployment Tool version 1.0.175
Windows Server 2008 R2
The platform that we used in our experiments was Microsoft Hadoop for Windows, now known as
HDInsight. At the time of this writing, HDInsight had not been released to the general public, but was
available as a downloadable CTP version (May 2012 and later).
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Because the core of the HDInsight release is based on the Apache distribution of Hadoop, and it includes
most of the native Hadoop components such as HDFS, MapReduce, Hive, Pig, and Mahout, we expect that
the contents of this document also apply to optimization of “native” Hadoop jobs. Therefore, we offer our
experiences as a generic guide for performance tuning of Hadoop jobs in general.
We must emphasize that all the optimization techniques described in this document are intended to
maximize performance of MapReduce jobs at the level of the individual job. Global tuning techniques that
apply more to the Hadoop cluster level, such as cluster capacity allocation, are out of scope for this
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Overview of Hadoop Performance Tuning
The popularity of Hadoop is growing rapidly. One major reason is that MapReduce is a very developerfriendly framework. The MapReduce paradigm was designed to encapsulate all the underlying complexities
of distributed computing within a single framework, which allows developers to focus on the logic of data
analytics rather than worry about how to distribute and process data on many machines.
There is no doubt that MapReduce greatly simplifies development of large-scale data processing tasks.
However, writing functional MapReduce code is just one part of your overall development effort. When you
are working with Big Data, you are typically processing data at the terabytes or petabyte scale. Given the
scale of the data, you can expect your MapReduce job to run hours or even days to deliver the results.
Therefore, understanding how to analyze, fix, and fine-tune the performance of jobs is an extremely
important skill for Hadoop developers.
Benefits of Tuning
In our experiments, it became clear that performance of the same MapReduce job can differ significantly
before and after tuning.
For example, we wrote a simple Hive query which in its first execution took more than 10 hours to complete.
Hive is a high-level query language, like SQL, but which operates on MapReduce results. The query that we
ran collected results from a 3 TB dataset on a Hadoop cluster with 10 nodes. That means that each machine
processed about .5 GB/minute.
With this much data, you might expect it to be slow. In fact, Hadoop is not built for fast queries, but to
provide a scalable system. Therefore, the expectation is that, if one machine takes N hours to perform a
task, N machines would take about 1 hour to perform the same task.
However, in the course of analyzing performance, we determined that disk I/O was the major bottleneck,
and this forces other cluster resources (CPU and RAM) to wait. Because of the data size, far too much time
was spent in massive disk reads and writes. Based on this discovery we reconfigured the Hadoop job to
enable data compression at several job stages.
On re-rerunning the same query, the query took only 2.5 hours to complete; with very balanced CPU, RAM
and I/O utilization during job execution. This example illustrates how tuning can significantly improve job
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
How a MapReduce Job Works
For the purposes of this document, we shall consider a Hadoop job to include any of the following: a
MapReduce application, a Hive query, or a PigLatin script.
For any MapReduce job, Shuffle and Sort are the key processes that are internally handled by the
MapReduce framework; and it is also the key area of focus for performance tuning. Depending on the
particular job, these processes might consume considerable resources including CPU, RAM, disk I/O, and
network bandwidth, as the job execution shifts from the Map phase to the Reduce phase.
Because Shuffle and Sort is a very complex, multi-stage pipeline, it is important to have a good
understanding of how this part of a job works, before performing MapReduce job tuning. To gain this
foundation, we recommend that you read the book by Tom White, Hadoop: The Definitive Guide 2. There is
a detailed description of the Shuffle and Sort process starting on page 177, so we will not repeat the same
content in this document.
The Definitive Guide also provides an excellent diagram, which illustrates different stages of the process to
be involved during tuning. We have provided an annotated version below, but you can refer to the original
on page 178 of Hadoop: The Definitive Guide 2 (Figure 6-4. Shuffle and Sort in MapReduce).
Figure 1. Stages of a Hadoop job
The important steps of a job are identified as follows:
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Stage 1 –Input for Map phase
Stage 2 –Computation for Map phase
Stage 3 – Partition and sort for Map phase
Stage 4 –Output for Map phase
Stage 5 – Copying Map output
Stage 6 – Merging and sorting during the Reduce phase
Stage 7 –Computation for the Reduce phase
Stage 8 –Output of the Reduce phase
We will refer to these stages frequently, as we illustrate the problems that occur in each phase, and discuss
tuning techniques that can be applied at each step.
Performance Tuning Cycle
Hadoop resources can be classified into computation, memory, network bandwidth and input and output
to storage (I/O). A job can run slowly if any of these resources perform badly. Therefore, improving
performance in Hadoop jobs is about more than turning a few knobs – it is about understanding your
configuration and the requirements of your job, and achieving the right balance.
As our earlier example shows, having balanced resources on your cluster is key to achieving high job
performance. If you notice that a job takes an unreasonably long time to complete, very likely some
resources have reached their limits, which prevent the job from running faster.
The first step in job optimize a job is to determine which of these resources is the principal bottleneck, and
find out the cause of the resource limitation. We will describe the tools to do this in the section, Identifying
Figure 2. The performance tuning cycle
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Once you have determined the source of the principal resource limitation, you can apply the tuning
techniques we describe, either to remove bottlenecks, or to change the way that resources are used for a
particular job. We provide techniques for resolving each type of issue in the section, Removing Bottlenecks.
Typically, optimization is a repeatable process. After you have solved one bottleneck, you might find
another one. We recommend that you run many cycles of a job to compare performance indicators for
each run, until you are certain that you have located all bottlenecks and have applied the most effective
tuning techniques.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Common Resource Bottlenecks
The figure presented earlier, which was adapted from Tom White’s book, Hadoop: the Definitive Guide,
reminds us that a MapReduce job is in fact a pipeline with many stages, each stage requiring different types
of resources. From this point of view, if you want your MapReduce job to run at full speed, you must ensure
that the pipeline is clear throughout, without any resource bottlenecks.
However, remember that it can be counterproductive to focus on a single technique. The fundamental goal
of job tuning is to ensure that, given a particular job and a given cluster configuration, all available
resources (CPU, RAM, I/O, and network) are used in a balanced way. Slowdowns typically occur when one
resource becomes a bottleneck and forces other resources to wait.
For example, during an I/O intensive job (such as a simple SELECT…WHERE Hive query that writes a huge
number of rows to HDFS), the MapReduce framework spends much more time on disk reads and writes
than on executing the actual MapReduce code. In such jobs, CPU resources might be under-utilized, while
I/O is over-utilized and acts as the bottleneck that slows down the overall job.
However, over-utilization of I/O can happen in many stages in the pipeline, so you need to identity the
exact source of the bottleneck before applying any solution. Depending on the phase where the intense I/O
is occurring, techniques could include enabling compression or reducing the amount of data spilled during
the intermediate Map phase.
In this section we describe most of the resources that are required by Hadoop to complete a MapReduce
CPU Usage
CPU is the key resource used when processing data in Stage 2 (Map computation) and Stage 7 (Reduce
computation). We monitor CPU utilization by using Windows performance counters.
High CPU utilization is the result of intensive computation in the Map code or Reduce code. A
computationally intensive job usually requires more CPU resources but little network bandwidth and less I/O
High CPU utilization can act as one type of bottleneck. However, our experience tells us that the most
common computation bottleneck is actually insufficient CPU utilization. The reason should be obvious: with
today’s hardware, the CPU can process massive data much faster than other resources, such as storage I/O
and network, can move the data. So most of the time, while data is flowing through the MapReduce
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
pipeline, the CPU is waiting for other resources to feed in data before it can proceed to the actual
Low CPU utilization can also be caused by inappropriate values in the job configuration. For example, if you
set the number of Map or Reduce tasks to a value that is too conservative, the job might under-utilize the
computational capacity of the cluster.
The amount of RAM available on the task tracker nodes is another potential bottleneck that can have
significant effects. If you have memory-intensive workloads and/or the memory settings are not properly
configured, the Map and Reduce tasks might be initiated but immediately fail. Depending on whether or
not this is a transient issue, Hadoop might eventually succeed in running the task, but the retry operation
itself imposes overhead, and in many cases the job will simply fail to execute. You must be careful to
configure memory usage based on the number of tasks and the physical memory you have on each nodes.
Network Bandwidth
High network utilization occurs when large amounts of data travel among nodes. Most often, this happens
when Reduce tasks pull data from Map tasks in the Shuffle phase, and also when the job outputs the final
results into HDFS.
To ensure that network utilization does not become a bottleneck, you need to be aware of the current
network bandwidth used in a particular job, as well as the maximum network bandwidth available. This
number can be determined by performing a stress test of the Hadoop cluster. By constantly monitoring
MapReduce network utilization, you will be able to figure out if the cluster has sufficient network bandwidth
to move data efficiently.
Storage I/O
Storage I/O is probably the most important resource for running a MapReduce job. In our experience, I/O
throughput is also one of the most common bottlenecks. It can decrease MapReduce job performance
across the board, and become a bottleneck at every stage of the execution pipeline.
Storage I/O utilization can be monitored by using Windows performance counters. We recommend that
you learn your cluster’s maximum I/O throughput by running a stress test beforehand, so that you can
determine when your job has encountered a bottleneck. Storage I/O utilization heavily depends on the
volume of input, intermediate data, and final output data. If the bottleneck is in storage I/O, almost all
techniques for reducing data size will help.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Identifying Bottlenecks
In this section, we will describe some of the common issues in Hadoop jobs, and explain the methods that
we use to monitor a job and to identify bottlenecks as they occur.
In the following section, Removing Bottlenecks, we will provide ways to resolve these common issues.
Overview of the Performance Tuning Cycle
When you consider optimizing a MapReduce job, you probably already feel that the job is taking too long,
and are looking for changes that will make the job finish faster. However, before you put any effort into
solving individual bottlenecks, you must always take into account the overall configuration of the cluster,
and the type of job. Having a good balance of resources across your cluster is the real key to achieving
high job performance.
Hadoop resources can be classified into computation, memory, network bandwidth, and storage I/O. If you
notice that a job runs for an unreasonably long time, very likely the cause is resource limits at some phase
of the job, which prevents the job from running faster. Such resources limitations are the principal
bottlenecks that will be described in the rest of this section.
The first step towards job optimization is identifying bottlenecks and finding resource limitations. The next
step is to remove bottlenecks by applying the tuning techniques described in this paper, to change the way
that resources are used in the context of a particular job.
Typically, optimization is a repeatable process that you have to run for many cycles, comparing the
benchmarks for each run, until you are certain that you have located the bottlenecks and found the most
effective tuning techniques.
Massive I/O Caused by Large Input Data in Map Input Stage
In the Map input phase (Stage 1), source data is read into the Map tasks and used for calculation. This
process of reading data consumes disk I/O. If disk I/O is not fast enough, computation resources will be idle
and spend most of the job time waiting for the incoming data.
Therefore, performance can be constrained by disk I/O. This problem happens most often on jobs with light
computation and large volumes of source data.
The following indicators can be used to look for I/O issues:
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Job counters: Bytes Read, HDFS_BYTES_READ
Windows performance counters: I/O counters, network counters, processor counters
Possible exceptions: Not encountered during our tests.
To view the volume of input data for Map tasks, you can use the job counters Bytes Read and the
HDFS_BYTES_READ, which are circled in red in the following log excerpt:
Figure 3. Large input data
This snapshot of the log shows the two counters that you should focus on. In this particular case, you can
see that the input data is only around 100 GB, not large at all by Hadoop standards. However, when you
see huge numbers for both of these counters, the job is dealing with massive I/O while reading from a large
input file. Therefore, to reduce the overall job completion time, it might be helpful to apply the tuning
techniques described in Section 3 to improve I/O read performance.
Tip: When I/O is a bottleneck, check I/O performance counters in Windows Task Manager. These counters
will usually exhibit high utilization, especially for read operations.
Massive I/O Caused by Spilled Records in the Partition and Sort Stage
During the Partition and Sort stages (Stage 3), if you want to optimize the Map task, your goal should be to
ensure that records are spilled (meaning, written to the HDFS file system or disk) only once. The reason is
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
that when records are spilled more than once, the data must be read back in and then written out, causing
multiple drains on I/O.
Records typically have to be spilled to disk more than once when the size of the buffer is inadequate as
configured. In other words, the buffer is used up too quickly, leading to multiple spills. Each extra spill
generates a large volume of data in intermediate bytes.
The following indicators can help you identify spilled records:
Job counters: FILE_BYTES_READ, FILE_BYTES_WRITTEN, Spilled Records
Windows performance counters: I/O performance counters
Possible exceptions: Not encountered during our tests.
The following example shows the log from a job with 100 GB data processed throughout the entire pipeline.
You can view the data size in these counters: HDFS_BYTES_READ (Map phase), FILE_BYTES_READ (Reduce
phase), and Bytes Written (Reduce phase).
Figure 4. Counters showing spilled records
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Note the numbers that are circled in red. From these counters, you can see that Map tasks read 152 GB of
data from the local disk and wrote another 223 GB to disk. Typically, the counters FILE_BYTES_READ for the
Map phase should have very small values, close to zero. Also, the value for FILE_BYTES_WRITTEN in the
Map phase also should be closer to the size of the data input for the Reduce phase, around 100 GB. The
fact that there is a gap between the numbers suggests that a large amount of intermediate data is being
generated as part of sorting in the Map phase.
The other highlighted number, Spilled Records, confirms that the job has very high disk I/O caused by the
large amount of intermediate records written to storage after the Map phase.
When this type of bottleneck is happening, review the I/O counters in Windows Task Manager. Typically
they will exhibit high utilization for both read and write operations.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
To get more detailed information about spilled records, you can open the syslog logs for the mapper. For
example, the following screenshot shows the log from a TeraSort job1.
Figure 5. Log showing spilled records in TeraSort job
The heavy disk I/O is due to spilled records. Note that the value for buffer size during the I/O sorting
process is 100 MB. You can also see where the mapper has spilled records multiple times when the buffer is
Massive I/O or Network Traffic Caused by Large Map Output
In the Map output phase (Stage 4), Hadoop background daemons merge sorted partitions to disk. Next,
this Map output will be read and copied to the Reduce task or tasks. However, in the process, many I/O
operations occur as the data is written, read, or transferred among processes. Thus, large output from the
For more information about TeraSort and other jobs that are commonly used for measuring Hadoop job performance, see the
related paper, Performance of Hadoop in Hyper-V.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Map phase can cause longer I/O and data transfer time, and in extreme cases can raise exceptions, if all the
I/O throughput channels are saturated or if network bandwidth is exhausted.
You can use the following indicators to investigate the volume of Map output:
Job counters: FILE_BYTES_WRITTEN, FILE_BYTES_READ, Combine Input Records
Windows performance counters: IO counter, network counter, processor counter
Possible exceptions: java.io.IOException
The following log shows the values of the counters FILE_BYTES_WRITTEN (Map phase) and
FILE_BYTES_READ (Reduce phase). Ideally, these values should be fairly close. But if you notice a big gap
between the numbers, the problem might be the amount of intermediate spilled data, from Stage 3.
By examining the byte size in these two counters, you can judge whether the volume of output data from
the Map phase is too big, which could possibly lead to a performance problem.
Figure 6. Large map output
When this bottleneck occurs, you can view the I/O and network performance counters in the Windows Task
Manager. Both of these counters should exhibit high utilization, and most likely one of these counters will
have hit its maximum value, depending on which is faster, I/O or network.
Based on our experiments, very large output from the Map phase can sometimes trigger various kinds of
java.io.IOException, which sometimes can cause the task to fail. The following except shows one exception
for a failed task which has been output to the log file.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 7. Java exception in log file
Massive I/O and Network Traffic Caused by Large Output Data in Reduce Output
In the output stage of the Reduce phase (Stage 8), reducers write their output to HDFS, requiring a lot of
I/O write operations.
The values for the counters Bytes Written (Reduce phase) and HDFS_BYTES_WRITTEN (Reduce phase)
indicate the volume of data. However, it is important to note that these two counters do not include the
replication factor. If the replication setting is larger than one, it means that blocks of data will be replicated
to different nodes, which consumes more I/O for read and write operations, and which also uses network
bandwidth. Therefore, bottlenecks on disk I/O or network bandwidth could cause slow job performance or
even exceptions in affected Reduce tasks.
You can view the following indicators to check for this condition:
Job counters: Bytes Written, HDFS_BYTES_WRITTEN
Windows performance counters: I/O counters, network counters, processor counters
Possible exceptions: java.io.IOException
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 8. Checking volume of data as HDFS bytes
If this bottleneck occurs, and if the replication factor is 1 (that is, data is not being copied among nodes),
you should be able to observe high values for I/O counters in the Windows Task Manager. However,
network counters could be low due to lack of substantial network traffic.
If the job uses a replication factor greater than 1, data will be replicated across the network, so you will
usually observe high values in counters for both I/O and network usage. In some cases, the large job output
can trigger an exception java.io.IOException, and cause the task to fail.
The following log excerpt shows the exception and the task failure messages.
Figure 9. Java exception caused by large job output
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Insufficient Concurrent Tasks Caused by Improper Configuration
If the number of tasks running concurrently is insufficient to perform the job, the job can leave many
resources idle. Increasing the number of concurrent tasks helps to accelerate the overall job speed by better
utilizing resources.
The number of concurrent running tasks is determined by two configuration factors: the first is the total
capacity of Map and Reduce slots for the cluster, and the second is the number of Map and Reduce tasks
configured for each particular job.
You can use the following indicators to check for underutilization of resources:
Task Summary List: Num Tasks, Running, Map Task Capacity, Reduce Task Capacity
Windows performance counters: I/O counters, network counters, processor counters
Possible exceptions: Not encountered during our tests.
To view the total Map and Reduce task capacity for the cluster, use the Map/Reduce Administration page,
shown in the following diagram.
Figure 10. Cluster’s total map/reduce task capacity
The Job Detail page, shown below, provides the number of Map and Reduce tasks available to the job.
In cases where either the cluster or the job has been configured with insufficient capacity for concurrent
tasks, you will usually observe low utilization on all counters in Windows Task Manager, due to underconfigured computation resources. For example, if your data node machine has 12 CPU cores but you only
configured 4 mappers on each machine, the remaining 8 cores will be idle because no work load will be
assigned to them; As a result, only a limited portion of I/O throughput and network bandwidth will be
actually utilized, because there are no requests coming from the other CPU cores.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 11. Details of a job
Long-Tailed Reducers Caused by Skew Data Distribution
In the computation stage of a Reduce task (Stage 7), partitions resulting from multiple mappers are merged
into sorted key-value pair groups, and are fed to different reducers depending on the way that keys have
been distributed.
However, the size of each data group can vary significantly if the keys are not evenly distributed. When this
happens, the amount of data received by different reducers might be imbalanced, a phenomenon which is
called skew data distribution. Skew data distribution causes heavy workloads for some Reduce tasks, while
leaving light workloads for others.
During our experiments with some extreme cases, it often happened that most of the Reduce tasks finished
within a few minutes, while a few remaining tasks ran for days. As a result of this imbalance in the
distribution of data, the job also might take days to complete.
We termed this type of bottleneck the “long-tailed’ reducer” and attempted to minimize or eliminate
skewed distribution to improve performance.
You can view the following indicators to check for skew distribution and for long-running Reduce tasks:
Job Detail page: Num Tasks, Running
Windows performance counters: None
Possible exceptions: None during our testing.
The following excerpt from the Job Detail page shows some long-tailed tasks.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 12. Long-tailed tasks
Such tasks can be easily identified, because most of the other Reduce tasks have finished, but there are a
few tasks that continue running for a much longer time.
We recommend that you use the page that contains task counters to figure out how serious the skew
distribution is in your job. To do this, open the task counter page for each Reduce task, save the values, and
then compare the values with other counters, such as Reduce input records. This review process will help
you identify differences in workload among different Reduce tasks.
Single Reducer Caused by Hive ‘ORDER BY’ Clause
Hive is a SQL-like language that helps non-developers easily perform ad-hoc queries against data stored in
Hadoop. However, one limitation of Hive is that the ORDER BY clause has not been optimized for
leveraging multiple reducers. In its current implementation, the Hive ORDER BY clause can only run as a
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
single reducer job; therefore, if you perform a global data sort while output is being written, performance
can suffer. When you perform a query that requires global sorting against big data, there is no doubt that
the query will be extremely slow.
You can view the following indicators to check for this problem:
Job detail page: Number of Reduce Tasks
Windows performance counters: I/O counters, network counters, processor counters
Possible exceptions: Not experienced during our tests.
The following diagram shows a portion of the Job Detail page. From this you can see that only one Reduce
task has been assigned to the ORDER BY operation.
Figure 13. Hive sort on single reducer
When a Hive query containing an ORDER BY clause is running using a single reducer, Windows
performance counters on all nodes except the query node should exhibit very low or no utilization, because
most resources will have no work to do. However, the node that is running the single reducer will exhibit
high utilization of I/O and very high computation.
Insufficient memory and task hangs during attempt due to misconfigured memory allocation in tasks
The amount of RAM that is available on the task tracker nodes can act as a bottleneck that can have
significant effects. A task tracker node can be starved for memory when an attempted job requires a lot of
memory and the configuration doesn’t allow it to respond.
For the purpose of diagnostics, this problem tends to manifest itself in the form of memory related error
messages, such as “java.lang.OutOfMemoryError: Java heap space”.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
For example, one customer migrated their workload from an on-premise computer with large memory
nodes to an Azure-hosted solution. By default, the Azure workers are configured to run four (4) mappers
per node, so each worker processes four separate streams of data. The data was compressed with a nonsplittable format and the blobs were quite large (1 GB). The worker ran out of memory while attempting to
decompress the data.
In the worst-case scenario, the workers can hang at the attempt phase, if large amounts of memory are
allocated to the attempts, and memory pressure pushes the system into an unhealthy or unresponsive state.
This happens because Hadoop has too many attempts running at the same time, or the JVM settings allow
the attempts to expand beyond healthy limits, or both.
When this problem happens, you should be able to see high memory values in the Perfmon traces, while
other resources (I/O, network utilization, and CPU) remain relatively low. Another scenario in which this
problem was reported was a system with high latency on storage.
You can view the following indicators to check for this problem:
Windows performance counters: RAM counters
Possible exceptions: “java.lang.OutOfMemoryError: Java heap space”
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Choosing a Solution
Once bottlenecks are identified, they should be addressed by applying the appropriate tuning techniques.
The following table provides some recommendations for potential tuning techniques for each of the
bottlenecks described in the previous section. Each technique is described in more detail in the next section.
Bottleneck or Issue
Tuning Techniques
Massive I/O caused by large input data
(Stage 1)
Compress source data
Massive I/O caused by large spilled records
(Stage 3)
Reduce spilled records from map tasks
Massive I/O or network traffic caused by
large map output (Stage 4)
Compress map output
Massive IO and network traffic caused by
large output data in stage #8
Compress job output
Change replication
Hive – Compress query intermediate
Implement a combiner
Hive – compress query output
Insufficient running tasks caused by
improper configuration
Change number of MapReduce tasks
Insufficient memory; job hangs at attempt
due to misconfigured memory allocation in
Adjust memory settings
Long tailed reducers caused by skew data
distribution in stage #7
Hive - AutoMap-Join
Single reducer caused by hive order by
operation in stage #7
Global sorting
Change number of MapReduce slots
However, we must stress that there is not a one-to-one mapping between the bottlenecks and the solutions
listed, and there is no one-size-fits-all technique for tuning Hadoop jobs.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Because of the architecture of Hadoop, achieving balance among resources is often more effective
than addressing a single problem.
Each of the techniques in the list can be applied to address more than one type of bottleneck.
Depending on the type of job you are running and the amount of data you are moving, the solution
might be quite different.
Therefore, we highly recommend that you try combining different techniques to figure out which tuning
solutions are most efficient in the context of your job. In our experience, improving job performance is a
repeatable effort that usually takes several cycles until you can find the best combination.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Resolving Bottlenecks
This section describes the different techniques we have used. In each section, we summarize the
performance issues, list the configurations that were changed, and compare performance after tuning.
Reduce Spilled Records from Map Tasks
A Map task can spill (meaning, write data to the file system or disk) a large amount of data during the
internal sorting process. Therefore, your goal for optimization should be to ensure that records are spilled
only once. Multiple data spills generate a lot of I/O stress and slow down overall job performance.
Before tuning
The following graphic shows the log of a sample TeraSort job against 100 GB of data, using the default
configuration. Notice that the value of counter Spilled Records is almost as twice big as the counter Map
output records. From this log you can see that multiple spills occurred for this un-tuned job.
Total time for this job was 00:21:14.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 14. Default TeraSort job on 100 GB (before)
Configurations to change
To reduce the amount of data spilled during the intermediate Map phase, we adjusted the following
properties for controlling sorting and spilling behavior. The meanings of those properties are taken from
Hadoop: The Definitive Guide 2:
io.sort.mb — Size, in megabytes, of the memory buffer to use while sorting map output.
io.sort.record.percent — Proportion of the memory buffer defined in io.sort.mb that is reserved for storing
record boundaries of the Map outputs. The remaining space is used for the Map output records
io.sort.spill.percent — value that represents the soft limit in either the buffer or record collection buffers.
Once this threshold is reached, a thread will begin to spill the contents to disk. Applies to both the map
output memory buffer and the record boundaries index.
It was not straightforward or intuitive to define the best values for those properties, so in our experiments
we spent some time investigating how to calculate and then set new values on these parameters to
overwrite the default values.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
What we learned is that when Map output is being sorted, 16 bytes of metadata are added immediately
before each key-value pair. These 16 bytes include 12 bytes for the key-value offset and 4 bytes for the
indirect-sort index. Therefore, the total buffer space defined in io.sort.mb can be divided into two parts:
metadata buffer and key-value buffer.
Based on this understanding, we used the calculations shown below to derive a new potentially optimized
value setting for these properties:
Calculate io.sort.record.percent.
This value represents the proportion between metadata buffer and key-value buffer.
io.sort.record.percent = 16 / (16 + R)
In this formula, R is the average length of the key-value pairs and can be calculated by dividing the Map
output bytes by the number of Map output records.
In our example, R = 100,000,000, 000/1,000,000,000. Hence R = 100; and the formula as follows:
io.sort.record.percent = 16 / (16 + 100) = 0.138
Calculate io.sort.mb
Based on the average length of key-value pairs (R) calculated in the preceding step, and the average
number of records per mapper (N), we used the following formula:
io.sort.mb = (16 + R) * N / 1,048,576
N is calculated by dividing the Map output records by the number of map tasks.
The property io.sort.mb must be defined in megabytes, therefore the result should be converted to
megabytes by dividing by 1,048,576.
In our example, the number of map tasks is 373.Therefore, N is 1,000,000,000 records divided by 373
mappers, which equals a value for N of 2,680,965 records per mapper in average. Hence, io.sort.mb should
be 297 MB. For convenience, we rounded this up to 300 MB.
Update the io.sort.spill.percent properties
The property io.sort.spill.percent defines the threshold after which the data from the Map output buffer will
be spilled to disk. The valid range is 0 to 1.0, with a default value of 0.8. There is no reason to reserve this
additional space, so we configured it to 1.0, to ensure that all buffer space was used.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
After tuning
The following figure shows the log when the same job was run after tuning.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 15. TeraSort job after spill optimization
The number of spilled records in the Map phase was reduced to 1,000,000,000, as specified in the
configuration. This value is identical to the number of Map output records, which means that spilling from
the Map output occurred just once. This optimization alone roughly halved the time spent on the job, to
Implement a Combiner
We used the sample Word Count2 application to demonstrate how to implement tuning on a job that
includes a combiner. You can run the application without a combiner, but including a combiner can
improve performance, as we will explain.
Before tuning
The following figure shows the log from a Word Count job without a combiner implementation. Because we
did not use a combiner, the values for Combine input records and Combine output records are zero, and
the number for Reduce input records is the same as for Map output records.
To learn more about this sample application, see the MapReduce Tutorial, or see our related paper, Performance of Hadoop on
Windows Hyper-V.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 16. Records transferred from Map to Reduce phase, before use of combiner
Configurations to change
Implementing a combiner would improve performance in this job. To enable a combiner in the job, we
added the highlighted line into the existing code:
Add a combiner
job.setCombinerClass(Reduce.class); //new line of code to set a combiner
The combiner aggregates the map output first, and then the reducer takes pre-aggregated key-value
groups from the combiner’s output.
Adding a combiner requires that you create a combiner class and implement it in your MapReduce code. In
this example, we used the same class for the combiner as for the reducer, since they share the same
aggregation algorithm. This is a common way of implementing a combiner.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
After tuning
The table below shows the improvement in performance after running the same job with a combiner
implemented as part of the job.
Figure 17. Records transferred from Map to Reduce phase, after use of combiner
The number of Map output records remained unchanged (629,187). However, the values for Combine input
records and Combine output records now have increased, due to the addition of the combiner in the job.
Notably, the number of Reduce input records dropped from 629,187 to 102,300. What this means is that I/O
usage from the mapper output writes and from the reducer reads have both declined by a factor of 6,
compared to the original process. This demonstrates that, in big jobs that deal with large data, enabling a
combiner to reduce I/O pressure is a very efficient way to improve overall performance.
Compress Map Output
Using compression in Hadoop jobs is an important way to achieve increased performance at very little
For more detailed information about different types of compression and their effectiveness in different phases of a Hadoop job,
see our white paper, Compression in Hadoop.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
The next example shows the results when we tested the effects of Map output compression by running a
sample TeraSort job against 100 GB of data, with and without compression.
Before tuning
The following table shows the log before compression on the map output was enabled. Around 100 GB of
data was written to the file system by the TeraSort job, and almost the same amount of data was read by
the reducers. This is very inefficient.
Figure 18. Bytes written in Map phase, before compression
Configurations to change
To improve performance on this job, we enabled compression on the output of the Map phase.
In the default Hadoop configuration, the value of the parameter, mapred.compress.map.output, is false,
and the property mapred.output.compression.type is set to RECORD. To compress the output of the Map
phase, we simply need to edit the job configuration file and change the first property to true. We also
changed the value for mapred.output.compression.type from RECORD to BLOCK, because our tests found
that a higher compression ratio is achieved with the type BLOCK.
mapred.compress.map.output — Change value to true to enable compression.
mapred.output.compression.type — Change value to BLOCK.
Enable compression on map blocks
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
After tuning
The following figure shows the log after we ran the same job but with compression enabled on the Map
First, note that the values for FILE_BYTES_READ and FILE_BYTES_WRITTEN dropped to around 15 GB,
compared to over 102 GB before.
Figure 19. Bytes written in Map phase, after compression
The benefits of this change are two-fold: enabling compression on the Map output enables relatively higher
computation usage on the Hadoop cluster, and it also significantly reduces network bandwidth and storage
I/O. As an overall result, any jobs that generate large output in the Map phase can potentially run faster
with compression enabled on the Map output.
Compress Job Output
We also studied the effect of compressing output of the overall job.
Before tuning
The following figure shows the log from a job that ran the Word Count application, with no changes made
to configuration.
Figure 20. Word Count job before job output compression
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Note that before any job tuning, the total size of files output to HDFS was around 880 KB. (The time of the
job is not included here because the file we processed was already fairly small. Our point is really to show
the compression ratio.)
Configurations to change
To enable compression of the job’s output, we added the following lines of code into the source code for
the Word Count application. Note that these are essentially the same parameters described in the previous
section. Namely, we enabled compression, and we changed compression to use blocks instead of on
Enable compression on job output
job.getConfiguration().set("mapred.output.compression.type", "BLOCK");
Tip: You can also change these settings by using the Hadoop command line and the ‘-D’ parameters4.
For the complete list of commands in Hadoop, see: http://hadoop.apache.org/docs/r0.19.1/commands_manual.html. For more
about compression, the tools and its uses, see our white paper, Compression in Hadoop.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
After tuning
The following log shows the same job, run after compression was enabled on the job output. The size of
the output files written to HDFS was reduced to 304 KB, compared to 880 KB before.
Figure 21. Word Count job after job output compression
The benefit of enabling compression on job output might not seem so impressive on a small Word Count
application, but if the final phase of your job performs intensive I/O with many writes, enabling compression
might save a lot of storage and make your job complete faster.
However, one drawback of this technique is that the final output data will be stored in HDFS as compressed
data; Therefore, you must apply decompression before the data can be consumed by other processes.
Compress Input Source Data
Compression of input files provides two major benefits: it saves storage space (especially when using
replication) and speeds up data transfer.
Before tuning
In one experiment, we applied the DEFLATE codec, which is the default compression format used in
Hadoop, to compress input data, and achieved a reduction of 152 GB to 33 GB. Our MapReduce job
completion time dropped from 376 seconds when processing the original data to 108 seconds when
processing the compressed data.
Configurations to change
Files can be compressed before they are loaded into Hadoop, or the files can be compressed inside
Hadoop processes, to leverage the advantage of parallel computing. You can also use Hive to compress
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Because compression provides so many advantages we wrote a separate white paper, titled Compression in
Hadoop Jobs, that describes the tools and processes that we have tested for enabling compression in all
phases of Hadoop jobs.
We also recommend that readers review the section of Tom White’s book, Hadoop: The Definitive Guide,
particularly the section on “Compression” beginning on page 77. It describes the choices for compression
formats and codecs.
After tuning
On the other hand, “all compression algorithms exhibit a space/time trade-off”5. From the resource point of
view, the choice of whether or not to use compression is a trade-off between increasing network traffic and
disk I/O, and increasing CPU demand. The reason is obvious – to compress or decompress files costs extra
CPU time.
Change Number of Replications
Replication in the context of Hadoop means that input files are split onto multiple nodes to get the benefit
of data redundancy. Although this is an important feature of Hadoop, copying the data among nodes does
cost a massive amount of network bandwidth and storage I/O. If your data might not require this level of
redundancy, reducing the number of replications—that is, copying data to fewer nodes—is a good way to
reduce job running time and avoid performance issues.
Before tuning
The following screenshot shows the file copying activity when a 7 GB data file was written into HDFS with
three (3) replications
Figure 22. Data being copied when replicated to three nodes
Just copying the data to the nodes took 249 seconds in total.
Hadoop: The Definitive Guide 2, page 78.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Configurations to change
The property dfs.replication controls the number of replications of a job output.
Change replications in MapReduce
To change the number of replications in a MapReduce job, you can add the following lines to the job
configuration file.
Change replications in Hive
To change the number of replications used in a Hive job, use the following command in the session:
hive> set dfs.replication=1;
After tuning
After changing the job to use a single replication, we measured the job activity when processing the same 7
GB of data. This time the job wrote all results to HDFS and ended in only 116 seconds. Reducing the
replication factor from 3 to 1 clearly shortened job completion time, going from around 3 minutes to less
than 2 minutes.
Hive – Compress Query Output
The principle underlying the use of compression in MapReduce jobs also applies to compression in Hive,
because internally Hive queries are compiled to MapReduce jobs. The only difference is that Hive has its
own set of parameters to control compression.
In this test, we used a simple GROUP BY query to test the effects of Hive output compression.
Before tuning
The following log shows the job before we changed any parameters. The size of the output file was around
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 23. Reduce output before Hive compression
Configurations to change
Enable compression within Hive process
Next, we changed the setting, hive.exec.compress.output, as follows:
hive> set hive.exec.compress.output=true;
For more information about this setting, see our white paper, Compression in Hadoop Jobs.
After tuning
The following log shows the job after enabling compression.
Figure 24. Reduce output after Hive compression
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Hive – Compress Intermediate Output
A complex Hive query is usually converted to a series of multi-stage MapReduce jobs after submission, and
these jobs will be chained up by the Hive engine to complete the entire query. So “intermediate output”
here refers to the output from the previous MapReduce job, which will be used to feed the next
MapReduce job as input data.
In this test, we assessed the effect of compressing the intermediate output of a Hive job, by using a twostage query against 500 MB of data. Because the query is compiled as a two-stage chained MapReduce
job, the intermediate output is actually the output of the Reduce phase of the first job, which in turn serves
as the input to the second job.
Before tuning
The following figure illustrates the size of output data from first job. In the original job, the size of files in the
Reduce phase (see HDFS_BYTE_WRITTEN) is around 89 MB.
Figure 25. Intermediate output before compression
Configurations to change
We used two properties, hive.exec.compress.intermediate and mapred.output.compression.type, to
compress the intermediate output. We changed the value of mapred.output.compression.type to BLOCK
because, in our tests, higher compression ratios were achieved when using BLOCK.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Compress intermediate output
hive> set hive.exec.compress.intermediate=true;
hive> set mapred.output.compression.type=BLOCK;
After tuning
The following diagram shows the log of the same job, after compression had been enabled on the
intermediate Hive files.
Figure 26. Intermediate output after compression
With Hive compression enabled, the size of the intermediate output files dropped to around 23 MB,
meaning that a 75% reduction in size was achieved through compression.
Hive – Auto Map Join
In Hive, Auto Map-Join is a very useful feature when you are joining a big table to a small table. When you
enable this feature, the small table will be saved in the local cache on each node, and then joined with the
big table in the Map phase.
In terms of performance, enabling Auto Map Join in Hive provides at least two advantages. First, loading a
small table into cache will save read time on each data node. Second, it avoids skew joins (meaning
imbalanced joins) in the Hive query, since the join operation has been already done in the Map phase for
each block of data.
To assess the effectiveness of using Auto Map Join in a Hive query, we ran a two-stage Hive query that
contains both GROUP BY and JOIN operations, to process a 200 GB dataset.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Before tuning
The original query, without Auto Map Join enabled, took 2869 seconds (or 47.8 minutes) to complete. The
longest Reduce task lasted 34 minutes.
Configurations to change
To enable the Auto Map-Join feature, we just changed the value of the property hive.auto.convert.join to
true. Under this configuration, any table that is smaller than the size defined in the property,
hive.smalltable.filesize, will be treated as a small table, and will be copied to each of the task nodes as a
cached table.
Enable Auto Map Join
hive> set hive.auto.convert.join=true;
hive> set hive.smalltable.filesize=<filezie_threadhold>; (optional)
The default value of hive.smalltable.filesize is 25 MB.
After tuning
After enabling Auto Map Join, the same query completed in 1101 seconds (or 18.3 minutes). The problem
with one Reduce task lasting a very long time was eliminated, because the job did not need any Reduce
tasks at all in the first stage.6
Change the Number of Map Tasks and Reduce Tasks
For a MapReduce job, changing the number of Map tasks and/or Reduce tasks can significantly affect
performance, because those two settings are directly associated with the amount of cluster resources used
in a job. Performance can suffer whether you allocate too many resources or too few, so be careful to
choose the right values, and try different combinations to determine the right balance of resources.
For our tests of this tuning technique, we changed the number of Map tasks in a job running the TeraGen
In Appendix A we describe a related tuning technique, called bucketed mapjoin, which we expect to be
useful; However, we have not had the opportunity to test its effectiveness and so have not included it here.
For information about the TeraGen test, see the related paper, Performance of Hadoop on Hyper-V.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Before tuning
The following table shows the log of the Map and Reduce tasks during the job, before any tuning.
Figure 27. Number of Map tasks before
Note the number of Map tasks. This is a typical example of what we call under-configuration. In this
particular case, the cluster has enough computational resources to run at least eight Map tasks
concurrently, but the job is only configured to use two map tasks, which fails to utilize available CPU
Configurations to change
To change the number of Map tasks, the following properties need to be set, using an appropriate value, in
the job configuration file.
To change the number of Reduce tasks you use the property, mapred.reduce.tasks. Note that this just
shows you where to change the values. The actual calculation of what value would be appropriate depends
on your cluster configuration.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
After tuning
After the properties were changed, we re-ran the job. The following table shows the log from the same job.
Note that the number of Map and Reduce tasks has changed.
Figure 28. Number of Map tasks after
After tuning the configuration, the number of Map tasks increased to eight. Since the cluster had the
resources to support more mappers, we noticed a job performance boost immediately, with more mappers
working in parallel.
Change the Number of Map Reduce Slots
Increasing the number of slots available to a Hadoop cluster’s total Map and Reduce tasks will enable more
tasks to run concurrently, which leverages more CPU, RAM, I/O and network resources to work on the job.
As long as your cluster has enough resource capacity to support more Map and/or Reduce tasks, increasing
the number of total task slots always works to improve the job performance.
However, providing too many resources can hurt performance if resources are unbalanced, because such a
configuration can lead to resource bottlenecks and more failures. Therefore, choosing the right number of
slots is the key to getting the best performance out of your cluster, and to keeping your cluster running
To help you determine potential values for the appropriate number of Map and Reduce slots, we have
provided links to resources that describe cluster tuning strategies in more detail. See the Resources section
for details; the next section provides a summary and highlights a few tips.
Before tuning
The following figure shows the configuration of a single-node Hadoop cluster configured with a maximum
of four Map tasks.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Figure 29. Maximum Map tasks before
The next figure shows the log while the job was running with the configuration. As you can see, the job is
using the maximum number of concurrent mappers that is allowed. If the job runs slowly, you might try
increasing this value.
Figure 30. Number of running Map tasks before
Configurations to change
There are two global properties that you can change to affect the number of Map slots or Reduce slots:
mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum. Note that the total
Map task capacity of the cluster is the sum of each data node’s capacity. For this example, we were using a
two-node cluster.
Increase Map slots
The following procedure describes how to change the Map slot capacity on a data node. The method for
increasing the number of Reduce slots in Hadoop is almost the same as the method for increasing the Map
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
1. Stop the cluster.
2. On each node, edit the file mapred-site.xml.
3. Change the property mapred.tasktracker.map.tasks.maximum as shown in the following code:
4. Start the cluster.
After tuning
After making this change, the number of available (maximum) Map slots increased to eight, because we
were using a two-node cluster configured to use a maximum of four tasks per node. For performance, this
means that eight Map tasks can run concurrently on the cluster, as shown in the following job configuration.
Figure 31. Maximum Map tasks after
The following figure shows the log of the job running after more Map slots have been added. From this log,
it is clear that the MapReduce job is now using the maximum Map slots that are available, which was eight.
Figure 32. Running Map tasks after
Improve Global Sorting
The ORDER BY clause is supported in Hive to globally sort the data and order output by key values.
However, Hive is not optimized to perform global sorting with multiple reducers. At the time of this writing,
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Hive can use only one reducer to sort the entire data set, which can make the query extremely slow when
sorting large data volumes.
Before tuning
An example will illustrate how slow a job can be when using Hive to perform an ORDER BY operation.
In this job, a 20 GB source data file was available in the HDFS directory, which was mapped to a Hive table
named t1. The following ORDER BY query took 3823 seconds to sort t1 in order.
Hive>insert overwrite table t2 select * from t1 order by col1;
Configurations to change
Perform global sorting using PigLatin
The easiest way to perform fast global sorting is to use PigLatin, which is a programming language and
runtime environment for working with Hadoop data. Unlike Hive, PigLatin has been optimized to
automatically perform parallel sorting. Therefore, you can generally expect to see some performance
improvement after conversion to PigLatin.
The following code shows a PigLatin script which is equivalent to a Hive query containing a global ORDER
BY clause.
a = load '<HDFS_PATH_TO_t1>' as (col1:chararray, col2:chararray, col3:chararray, col4:chararray,
col5:chararray, col6:chararray, col7:chararray, col8:chararray, col9:chararray, col10:chararray,
col11:chararray, col12:chararray, col13:chararray, col14:long);
b = order a by col1;
store b into ‘<HDFS_PATH_TO_t2>’;
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
After tuning
The PigLatin script took 1250 seconds to return the results, which is significantly faster than the Hive ORDER
BY query (3823 seconds).
In another experiment, we compared the performance of equivalent Hive and PigLatin queries, like those
above, using a 120 GB data set. When performing a global sort of this larger data set, the Hive query took
13.5 hours, whereas the same job using Pig and PigLatin completed the sorting operation in only 2 hours
and 10 minutes.
Adjust Memory Settings
Problems with insufficient memory can arise when a job that is attempted requires a lot of memory and the
configuration provides insufficient memory, which does not allow the worker node to respond. The solution
is to increase the memory limit for the JVMs used for the attempt, by configuring the mapper JVM settings
to increase memory and obtain more physical memory for the job.
For example, the following configuration property sets the amount of memory used by each Reduce task to
2 GB:
Mapred.reduce.child.java.opts = -Xmx2g
A job can hang on the first or subsequent attempts if large amounts of memory are allocated to the
attempts, and memory pressure pushes the system into an unhealthy or unresponsive state. The suggested
solution is to scale down the number of Mappers and Reducers by configuring the properties,
mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum.
In any memory-related scenario, you should also consider adjusting the value of
mapred.job.reuse.jvm.num.tasks: This property allows you to specify the number of tasks that can reuse the
same JVM. Reusing a JVM reduces the execution time, since there is no need to create a JVM. Conversely,
not reusing a JVM improves memory, because the attempt uses a new JVM which does not have any
memory already allocated. You can consider it a tradeoff between overhead and memory usage.
Note: We would like to state for the record that our team did not encounter memory issues in our testing,
since we were fairly cautious in determining requirements and allocating resources. However, testing by
other teams revealed some scenarios where this can be an issue. Although we have not validated the
findings, after reviewing the test results provided by other teams,, we felt that it was important to share
these scenarios and recommendations above.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
We have described a selection of techniques that you can use to identify bottlenecks in Hadoop jobs. We
also provided the corresponding techniques that you can apply to tune your Hadoop jobs and potentially
improve performance.
There are many ways to improve the performance of Hadoop jobs, but in our testing we have found that
the process of identifying bottlenecks and applying solutions is one you will repeat often. It is also critical to
approach performance in a balanced manner.
Future Work
We were unable to test all the potential tuning techniques and combinations of techniques, due to time
constraints. We have summarized only the most common and useful tuning techniques, along with a few
MapReduce properties.
However, if you are interested in other properties that might be useful for tuning, we have provided a full
list of tuning properties for Shuffle and Sort. Please refer to Appendix B. We encourage you to experiment
with these and to report your results.
In addition to the techniques covered in this document, the previously cited book by Tom White, Hadoop:
The Definitive Guide, describes a number of other tuning properties and techniques. Because the properties
are described in different chapters, and can be hard to find, we have provided a summary of all the
properties from The Definitive Guide for your reference as well (see Appendix C).
About the Authors
The Microsoft IT Big Data Program is the result of collaboration between multiple enterprise services
organizations working in partnership with Microsoft’s HDInsight product development team. This group has
been tasked with assessing the usefulness of Hadoop for corporate and client applications, and for
providing a reliable, cost-effective platform for future Big Data solutions within Microsoft. Towards this end,
the group has extensively researched the Hadoop offerings from Apache, and has implemented Hadoop
production and test environments of all sizes, using the HDInsight platform for Windows Azure and for onpremise Windows.
The Microsoft IT Big Data Program is publishing their research to the community at-large so that others can
benefit from this experience and make more efficient use of Hadoop for data-intensive applications.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
We would like to gratefully acknowledge our colleagues who read these papers and provided technical
feedback and support: Pedro Urbina Escos (HDInsight test), Karan Gulati (CSS), Larry Franks, and Cindy
Gross (SQL CAT).
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Appendix A: Join Optimization in Hive
The following information was originally published on the HDInsight team blog when the release was codenamed Isotope. We have taken the liberty of editing, summarizing, and clarifying where possible.
Hive can perform joins on big data sets (TB range). However, it takes time for a complex join on big data
sets to finish since the underlying MapReduce job involves shuffling large amounts of data in the Reduce
phase. There are a couple of ways to optimize the join operation in Hive. MapJoin is one of them.
The idea of MapJoin is that, if one of the tables in the join operation is relatively small (in the range of MB or
GB), you can remove it from the Reduce phase entirely by copying the smaller table to all the nodes and
perform the join in the mapper.
Further improvement can be achieved if both sides of the tables are bucketed (satisfying certain conditions).
In this case, instead of the entire smaller table being copied to all nodes, only a portion of the data will be
copied to the nodes, with each node getting a different portion of the data.
The following figure shows the performance of different joins on a single node cluster.
This particular comparison was done on a desktop computer used for development, using two tables, one
about 7GB and the smaller one only 11KB. Other background processes were running, so you should not
read too much into the actual numbers, but focus instead on the relative difference between the results.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
To perform a MapJoin in Hive, use the following syntax:
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key;
The table name used in /*+ MAPJOIN(…) */ is the smaller of the two tables, and will be distributed to all
When both tables are bucketed and the buckets are a multiple of each other, the same MapJoin statement
above will be further optimized as the bucketed MapJoin if the following parameter is set:
set hive.optimize.bucketmapjoin = true;
If the two tables are not only bucketed but also sorted, and if the two tables have the same number of
buckets, the above MapJoin will be even further optimized to a sort-merge join (whose number is not
shown in the chart above).
To enable a sort merge join, we need to set the following parameters in Hive:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
The limitation of MapJoin is that we cannot perform a FULL or RIGHT OUTER JOIN.
MapJoin in Hive is enabled in the most recent build of HDInsight. Please try it out. As always, your
feedback is welcome.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Appendix B: List of Tuning Properties in Shuffle
The following table lists additional properties that can be used for tuning. Although we did not experience
specific bottlenecks or alter these particular properties in our experiments, it is likely that modifying these
properties might affect performance in some scenarios.
Table 1. Map phase tuning properties
Property name
Default value
The size, in megabytes, of the memory buffer to use
while sorting map output.
The proportion of io.sort.mb reserved for storing record
boundaries of the map outputs. The remaining space is
used for the map output records themselves.
The threshold usage proportion for both the map
output memory buffer and the record boundaries index
to start the process of spilling to disk.
The maximum number of streams to merge at once
when sorting files. It’s fairly common to increase this to
This property is also used in the Reduce phase.
min.num.spills.for. combine*
The minimum number of spill files needed for the
combiner to run (if a combiner is specified).
The number of worker threads per tasktracker for
serving the map outputs to reducers. This is a clusterwide setting and cannot be set by individual jobs.
Class name
The compression codec to use for map outputs.
mapred.compress.map. output
Microsoft IT SES Enterprise Data Architect Team
Compress map outputs.
Hadoop Job Optimization
1. The values for the properties listed in the table are the default values in Hadoop. For additional
detail on each, please refer to Hadoop: The Definitive Guide, “Shuffle and Sort” (starting on page
2. The properties marked with an asterisk (*) are those that we did not test during our tuning
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Table 2. Reduce phase tuning properties
Property name
mapred.reduce.parallel. copies
The number of threads used to copy map outputs to the reducer.
mapred.reduce.copy. backoff
The maximum amount of time, in seconds, to spend retrieving one
map output for a reducer before declaring it as failed. The reducer
may repeatedly reattempt a transfer within this time if it fails (using
exponential backoff).
The maximum number of streams to merge at once when sorting
This property is also used in the Map phase.
outputs buffer during the copy phase of the shuffle.
The threshold usage proportion for the map outputs buffer
(defined by mapred.job.shuffle.input.buffer.percent) for starting the
The proportion of total heap size to be allocated to the map
process of merging the outputs and spilling to disk.
The threshold number of map outputs for starting the process of
merging the outputs and spilling to disk. A value of 0 or less means
there is no threshold, and the spill behavior is governed solely by
The proportion of total heap size to be used for retaining map
outputs in memory during the Reduce phase. For the Reduce
phase to begin, the size of Map outputs in memory must be no
more than this size. By default, all Map outputs are merged to disk
before the Reduce phase begins, to give the reducers as much
memory as possible. However, if your reducers require less
memory, this value may be increased to minimize the number of
trips to disk.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Appendix C: Miscellaneous Tuning Properties
The table below lists most of the properties related to performance tuning that are provided in various
places in Tom White's book, Hadoop: The Definitive Guide 2.
Please note that the default values of some properties in HDInsight might be different from the values listed
in this table. For example, in HDInsight, the default value of mapred.tasktracker.map.tasks.maximum is 4
and the mapred.child.java.opts is “-Xmx1024m”.
Property name
Default Value
The smallest valid size
p 202: FileInputFormat input
p 203: Small files and
in bytes for a file split.
p 204: Preventing splitting
The largest valid size in
bytes for a file split.
same as above
64 MB, that is
The size of a block in
p 202: FileInputFormat input
HDFS in bytes.
p 203: Small files and
p 204: Preventing splitting
p 279: HDFS block size
Whether extra instances
p 183: Speculative Execution
of map tasks may be
launched if a task is
making slow progress.
Whether extra instances
p 183: Speculative Execution
of reduce tasks may be
launched if a task is
making slow progress.
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Property name
Default Value
The number of map
p 273: Important Hadoop
tasks that may be run
Daemon Properties
on a tasktracker at any
one time.
The number of reduce
p 269: Memory
same as above
tasks that may be run
on a task tracker at any
one time.
The maximum number
p 184: Task JVM Reuse
of tasks to run for a
given job for each JVM
on a task tracker. A
value of –1 indicates no
limit: the same JVM
may be used for all
tasks for a job.
mapred.child.java. opts
The JVM options used
p 273: Important Hadoop
to launch the task
Daemon Properties
tracker child process
that runs map and
p 180: Configuration Tuning
reduce tasks. This
p 269: Memory
property can be set on
p 280: Task memory limits
a per-job basis, which
can be useful for setting
JVM properties for
debugging, for
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
Addendum: One additional property that can affect performance is
What happens is that the reducers launch when a certain percentage of the mappers are done. The
reducers are busy pre-fetching data but they can’t start the actual work of reducing until all the mappers
finish. Therefore, this pre-fetching phase can be highly inefficient, depending on the distribution of key
values and the number of reducers. The result is an unnecessary load of the system, as reducers occupy
slots in concurrent jobs.
The default value is 0%, meaning that reducers the pre-fetch process start as soon as there is even one
mapper that has completed. The following JIRA has been filed for this property:
Microsoft IT SES Enterprise Data Architect Team
Hadoop Job Optimization
References and Appendices
Nolan, Carl. Hadoop Streaming and Azure Blob Storage.
Beresford, James. Using Azure Blob Storage as a Data Source for Hadoop on Azure.
MSDN Library. System.IO.Compression
MSDN Library. System.IO.Packaging
White, Tom. Hadoop: The Definitive Guide, 2nd edition. O’Reilly Media, Inc., 2011.
Guinebertière, Benjamin, Philippe Beraud, Rémi Olivier. Leveraging a Hadoop cluster from SQL Server
Integration Services (SSIS)
Microsoft IT SES Enterprise Data Architect Team