Hadoop Job Optimization In this document, we describe specific types of Hadoop job configurations that can act as bottlenecks and hurt overall performance. We provide techniques that you can use to remediate each type of bottleneck, and provide suggestions for general optimization of your Hadoop jobs. Section 1 provides a short introduction to how MapReduce works internally. Section 2 defines a common performance tuning framework that you can use to guide repeatable processes. We explain how to identify resource bottlenecks based on performance indicators. Section 3 describes tuning techniques in detail, to help you to address and remove performance bottlenecks. We also provide a matrix to guide selection of which technique to apply to each performance problem. Table of Contents Introduction ............................................................................................................................................................................. 3 Overview of Hadoop Performance Tuning ..................................................................................................................... 5 Common Resource Bottlenecks ......................................................................................................................................... 9 Identifying Bottlenecks ......................................................................................................................................................... 11 Choosing a Solution ............................................................................................................................................................ 24 Resolving Bottlenecks.......................................................................................................................................................... 26 Conclusion ............................................................................................................................................................................. 49 Appendix A: Join Optimization in Hive............................................................................................................................51 Appendix B: List of Tuning Properties in Shuffle.......................................................................................................... 53 Appendix C: Miscellaneous Tuning Properties ............................................................................................................. 56 References and Appendices .............................................................................................................................................. 59 List of Figures Figure 1. Stages of a Hadoop job ........................................................................................................................................... 6 Figure 2. The performance tuning cycle ............................................................................................................................... 7 Figure 3. Large input data .......................................................................................................................................................12 Figure 4. Counters showing spilled records .......................................................................................................................13 Figure 5. Log showing spilled records in TeraSort job ....................................................................................................15 Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 6. Large map output ....................................................................................................................................................16 Figure 7. Java exception in log file ........................................................................................................................................17 Figure 8. Checking volume of data as HDFS bytes...........................................................................................................18 Figure 9. Java exception caused by large job output ......................................................................................................18 Figure 10. Cluster’s total map/reduce task capacity .........................................................................................................19 Figure 11. Details of a job ........................................................................................................................................................ 20 Figure 12. Long-tailed tasks ....................................................................................................................................................21 Figure 13. Hive sort on single reducer ................................................................................................................................. 22 Figure 14. Default TeraSort job on 100 GB (before) ......................................................................................................... 27 Figure 15. TeraSort job after spill optimization ................................................................................................................. 30 Figure 16. Records transferred from Map to Reduce phase, before use of combiner ............................................31 Figure 17. Records transferred from Map to Reduce phase, after use of combiner .............................................. 32 Figure 18. Bytes written in Map phase, before compression ........................................................................................ 33 Figure 19. Bytes written in Map phase, after compression ............................................................................................ 34 Figure 20. Word Count job before job output compression ........................................................................................ 34 Figure 21. Word Count job after job output compression ............................................................................................ 36 Figure 22. Data being copied when replicated to three nodes ................................................................................... 37 Figure 23. Reduce output before Hive compression ...................................................................................................... 39 Figure 24. Reduce output after Hive compression .......................................................................................................... 39 Figure 25. Intermediate output before compression ...................................................................................................... 40 Figure 26. Intermediate output after compression .......................................................................................................... 41 Figure 27. Number of Map tasks before ............................................................................................................................ 43 Figure 28. Number of Map tasks after ................................................................................................................................ 44 Figure 29. Maximum Map tasks before .............................................................................................................................. 45 Figure 30. Number of running Map tasks before ............................................................................................................ 45 Figure 31. Maximum Map tasks after ................................................................................................................................... 46 Figure 32. Running Map tasks after ..................................................................................................................................... 46 Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Introduction Hadoop is an Open Source platform for data processing that enables reliable, scalable and distributed computing on clusters of inexpensive servers. Hadoop is a big ecosystem, but the power of Hadoop comes from its two key components: HDFS and MapReduce. HDFS is a distributed file system that offers built-in replication to store vast amounts of data. MapReduce is a flexible programming model for developers to write applications to perform parallel data processing. Combined, HDFS and MapReduce make Hadoop the perfect tool to solve today’s big data analytics challenges in terms of Volume, Variety and Velocity. However, as with any new tool, there is much to learn. Our team is responsible for building solutions for Microsoft internal customers, and we have experimented extensively with Hadoop. In this process, we have learned that it is important to design and configure Hadoop jobs appropriately to maximize performance. The goal of this document is to describe the types of configurations that can hurt performance, and to provide techniques that you can use to optimize your Hadoop jobs. Section 1 explains how MapReduce works internally, to set the technical context. Section 2 defines a common performance tuning framework that you can use to guide repeatable processes. We also explain how to identify resource bottlenecks based on performance indicators. Section 3 describes tuning techniques in detail, to help you to address and remove performance bottlenecks. We also provide a matrix to guide selection of which technique to apply to each performance problem. Software Used We used the following software in our tests. Microsoft Hadoop on Windows version 1.0.175 (the May 2012 CTP release; code-named Isotope). The core engine of this version corresponds to Apache Hadoop version 0.20.203.1. Microsoft Isotope Deployment Tool version 1.0.175 Windows Server 2008 R2 The platform that we used in our experiments was Microsoft Hadoop for Windows, now known as HDInsight. At the time of this writing, HDInsight had not been released to the general public, but was available as a downloadable CTP version (May 2012 and later). Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Because the core of the HDInsight release is based on the Apache distribution of Hadoop, and it includes most of the native Hadoop components such as HDFS, MapReduce, Hive, Pig, and Mahout, we expect that the contents of this document also apply to optimization of “native” Hadoop jobs. Therefore, we offer our experiences as a generic guide for performance tuning of Hadoop jobs in general. Scope We must emphasize that all the optimization techniques described in this document are intended to maximize performance of MapReduce jobs at the level of the individual job. Global tuning techniques that apply more to the Hadoop cluster level, such as cluster capacity allocation, are out of scope for this document. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Overview of Hadoop Performance Tuning The popularity of Hadoop is growing rapidly. One major reason is that MapReduce is a very developerfriendly framework. The MapReduce paradigm was designed to encapsulate all the underlying complexities of distributed computing within a single framework, which allows developers to focus on the logic of data analytics rather than worry about how to distribute and process data on many machines. There is no doubt that MapReduce greatly simplifies development of large-scale data processing tasks. However, writing functional MapReduce code is just one part of your overall development effort. When you are working with Big Data, you are typically processing data at the terabytes or petabyte scale. Given the scale of the data, you can expect your MapReduce job to run hours or even days to deliver the results. Therefore, understanding how to analyze, fix, and fine-tune the performance of jobs is an extremely important skill for Hadoop developers. Benefits of Tuning In our experiments, it became clear that performance of the same MapReduce job can differ significantly before and after tuning. For example, we wrote a simple Hive query which in its first execution took more than 10 hours to complete. Hive is a high-level query language, like SQL, but which operates on MapReduce results. The query that we ran collected results from a 3 TB dataset on a Hadoop cluster with 10 nodes. That means that each machine processed about .5 GB/minute. With this much data, you might expect it to be slow. In fact, Hadoop is not built for fast queries, but to provide a scalable system. Therefore, the expectation is that, if one machine takes N hours to perform a task, N machines would take about 1 hour to perform the same task. However, in the course of analyzing performance, we determined that disk I/O was the major bottleneck, and this forces other cluster resources (CPU and RAM) to wait. Because of the data size, far too much time was spent in massive disk reads and writes. Based on this discovery we reconfigured the Hadoop job to enable data compression at several job stages. On re-rerunning the same query, the query took only 2.5 hours to complete; with very balanced CPU, RAM and I/O utilization during job execution. This example illustrates how tuning can significantly improve job speed. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization How a MapReduce Job Works For the purposes of this document, we shall consider a Hadoop job to include any of the following: a MapReduce application, a Hive query, or a PigLatin script. For any MapReduce job, Shuffle and Sort are the key processes that are internally handled by the MapReduce framework; and it is also the key area of focus for performance tuning. Depending on the particular job, these processes might consume considerable resources including CPU, RAM, disk I/O, and network bandwidth, as the job execution shifts from the Map phase to the Reduce phase. Because Shuffle and Sort is a very complex, multi-stage pipeline, it is important to have a good understanding of how this part of a job works, before performing MapReduce job tuning. To gain this foundation, we recommend that you read the book by Tom White, Hadoop: The Definitive Guide 2. There is a detailed description of the Shuffle and Sort process starting on page 177, so we will not repeat the same content in this document. The Definitive Guide also provides an excellent diagram, which illustrates different stages of the process to be involved during tuning. We have provided an annotated version below, but you can refer to the original on page 178 of Hadoop: The Definitive Guide 2 (Figure 6-4. Shuffle and Sort in MapReduce). Figure 1. Stages of a Hadoop job The important steps of a job are identified as follows: Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Stage 1 –Input for Map phase Stage 2 –Computation for Map phase Stage 3 – Partition and sort for Map phase Stage 4 –Output for Map phase Stage 5 – Copying Map output Stage 6 – Merging and sorting during the Reduce phase Stage 7 –Computation for the Reduce phase Stage 8 –Output of the Reduce phase We will refer to these stages frequently, as we illustrate the problems that occur in each phase, and discuss tuning techniques that can be applied at each step. Performance Tuning Cycle Hadoop resources can be classified into computation, memory, network bandwidth and input and output to storage (I/O). A job can run slowly if any of these resources perform badly. Therefore, improving performance in Hadoop jobs is about more than turning a few knobs – it is about understanding your configuration and the requirements of your job, and achieving the right balance. As our earlier example shows, having balanced resources on your cluster is key to achieving high job performance. If you notice that a job takes an unreasonably long time to complete, very likely some resources have reached their limits, which prevent the job from running faster. The first step in job optimize a job is to determine which of these resources is the principal bottleneck, and find out the cause of the resource limitation. We will describe the tools to do this in the section, Identifying Bottlenecks. Figure 2. The performance tuning cycle Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Once you have determined the source of the principal resource limitation, you can apply the tuning techniques we describe, either to remove bottlenecks, or to change the way that resources are used for a particular job. We provide techniques for resolving each type of issue in the section, Removing Bottlenecks. Typically, optimization is a repeatable process. After you have solved one bottleneck, you might find another one. We recommend that you run many cycles of a job to compare performance indicators for each run, until you are certain that you have located all bottlenecks and have applied the most effective tuning techniques. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Common Resource Bottlenecks The figure presented earlier, which was adapted from Tom White’s book, Hadoop: the Definitive Guide, reminds us that a MapReduce job is in fact a pipeline with many stages, each stage requiring different types of resources. From this point of view, if you want your MapReduce job to run at full speed, you must ensure that the pipeline is clear throughout, without any resource bottlenecks. However, remember that it can be counterproductive to focus on a single technique. The fundamental goal of job tuning is to ensure that, given a particular job and a given cluster configuration, all available resources (CPU, RAM, I/O, and network) are used in a balanced way. Slowdowns typically occur when one resource becomes a bottleneck and forces other resources to wait. For example, during an I/O intensive job (such as a simple SELECT…WHERE Hive query that writes a huge number of rows to HDFS), the MapReduce framework spends much more time on disk reads and writes than on executing the actual MapReduce code. In such jobs, CPU resources might be under-utilized, while I/O is over-utilized and acts as the bottleneck that slows down the overall job. However, over-utilization of I/O can happen in many stages in the pipeline, so you need to identity the exact source of the bottleneck before applying any solution. Depending on the phase where the intense I/O is occurring, techniques could include enabling compression or reducing the amount of data spilled during the intermediate Map phase. In this section we describe most of the resources that are required by Hadoop to complete a MapReduce job. CPU Usage CPU is the key resource used when processing data in Stage 2 (Map computation) and Stage 7 (Reduce computation). We monitor CPU utilization by using Windows performance counters. High CPU utilization is the result of intensive computation in the Map code or Reduce code. A computationally intensive job usually requires more CPU resources but little network bandwidth and less I/O throughput. High CPU utilization can act as one type of bottleneck. However, our experience tells us that the most common computation bottleneck is actually insufficient CPU utilization. The reason should be obvious: with today’s hardware, the CPU can process massive data much faster than other resources, such as storage I/O and network, can move the data. So most of the time, while data is flowing through the MapReduce Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization pipeline, the CPU is waiting for other resources to feed in data before it can proceed to the actual computation. Low CPU utilization can also be caused by inappropriate values in the job configuration. For example, if you set the number of Map or Reduce tasks to a value that is too conservative, the job might under-utilize the computational capacity of the cluster. RAM The amount of RAM available on the task tracker nodes is another potential bottleneck that can have significant effects. If you have memory-intensive workloads and/or the memory settings are not properly configured, the Map and Reduce tasks might be initiated but immediately fail. Depending on whether or not this is a transient issue, Hadoop might eventually succeed in running the task, but the retry operation itself imposes overhead, and in many cases the job will simply fail to execute. You must be careful to configure memory usage based on the number of tasks and the physical memory you have on each nodes. Network Bandwidth High network utilization occurs when large amounts of data travel among nodes. Most often, this happens when Reduce tasks pull data from Map tasks in the Shuffle phase, and also when the job outputs the final results into HDFS. To ensure that network utilization does not become a bottleneck, you need to be aware of the current network bandwidth used in a particular job, as well as the maximum network bandwidth available. This number can be determined by performing a stress test of the Hadoop cluster. By constantly monitoring MapReduce network utilization, you will be able to figure out if the cluster has sufficient network bandwidth to move data efficiently. Storage I/O Storage I/O is probably the most important resource for running a MapReduce job. In our experience, I/O throughput is also one of the most common bottlenecks. It can decrease MapReduce job performance across the board, and become a bottleneck at every stage of the execution pipeline. Storage I/O utilization can be monitored by using Windows performance counters. We recommend that you learn your cluster’s maximum I/O throughput by running a stress test beforehand, so that you can determine when your job has encountered a bottleneck. Storage I/O utilization heavily depends on the volume of input, intermediate data, and final output data. If the bottleneck is in storage I/O, almost all techniques for reducing data size will help. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Identifying Bottlenecks In this section, we will describe some of the common issues in Hadoop jobs, and explain the methods that we use to monitor a job and to identify bottlenecks as they occur. In the following section, Removing Bottlenecks, we will provide ways to resolve these common issues. Overview of the Performance Tuning Cycle When you consider optimizing a MapReduce job, you probably already feel that the job is taking too long, and are looking for changes that will make the job finish faster. However, before you put any effort into solving individual bottlenecks, you must always take into account the overall configuration of the cluster, and the type of job. Having a good balance of resources across your cluster is the real key to achieving high job performance. Hadoop resources can be classified into computation, memory, network bandwidth, and storage I/O. If you notice that a job runs for an unreasonably long time, very likely the cause is resource limits at some phase of the job, which prevents the job from running faster. Such resources limitations are the principal bottlenecks that will be described in the rest of this section. The first step towards job optimization is identifying bottlenecks and finding resource limitations. The next step is to remove bottlenecks by applying the tuning techniques described in this paper, to change the way that resources are used in the context of a particular job. Typically, optimization is a repeatable process that you have to run for many cycles, comparing the benchmarks for each run, until you are certain that you have located the bottlenecks and found the most effective tuning techniques. Massive I/O Caused by Large Input Data in Map Input Stage In the Map input phase (Stage 1), source data is read into the Map tasks and used for calculation. This process of reading data consumes disk I/O. If disk I/O is not fast enough, computation resources will be idle and spend most of the job time waiting for the incoming data. Therefore, performance can be constrained by disk I/O. This problem happens most often on jobs with light computation and large volumes of source data. Indicators The following indicators can be used to look for I/O issues: Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Job counters: Bytes Read, HDFS_BYTES_READ Windows performance counters: I/O counters, network counters, processor counters Possible exceptions: Not encountered during our tests. To view the volume of input data for Map tasks, you can use the job counters Bytes Read and the HDFS_BYTES_READ, which are circled in red in the following log excerpt: Figure 3. Large input data This snapshot of the log shows the two counters that you should focus on. In this particular case, you can see that the input data is only around 100 GB, not large at all by Hadoop standards. However, when you see huge numbers for both of these counters, the job is dealing with massive I/O while reading from a large input file. Therefore, to reduce the overall job completion time, it might be helpful to apply the tuning techniques described in Section 3 to improve I/O read performance. Tip: When I/O is a bottleneck, check I/O performance counters in Windows Task Manager. These counters will usually exhibit high utilization, especially for read operations. Massive I/O Caused by Spilled Records in the Partition and Sort Stage During the Partition and Sort stages (Stage 3), if you want to optimize the Map task, your goal should be to ensure that records are spilled (meaning, written to the HDFS file system or disk) only once. The reason is Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization that when records are spilled more than once, the data must be read back in and then written out, causing multiple drains on I/O. Records typically have to be spilled to disk more than once when the size of the buffer is inadequate as configured. In other words, the buffer is used up too quickly, leading to multiple spills. Each extra spill generates a large volume of data in intermediate bytes. Indicators The following indicators can help you identify spilled records: Job counters: FILE_BYTES_READ, FILE_BYTES_WRITTEN, Spilled Records Windows performance counters: I/O performance counters Possible exceptions: Not encountered during our tests. The following example shows the log from a job with 100 GB data processed throughout the entire pipeline. You can view the data size in these counters: HDFS_BYTES_READ (Map phase), FILE_BYTES_READ (Reduce phase), and Bytes Written (Reduce phase). Figure 4. Counters showing spilled records Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Note the numbers that are circled in red. From these counters, you can see that Map tasks read 152 GB of data from the local disk and wrote another 223 GB to disk. Typically, the counters FILE_BYTES_READ for the Map phase should have very small values, close to zero. Also, the value for FILE_BYTES_WRITTEN in the Map phase also should be closer to the size of the data input for the Reduce phase, around 100 GB. The fact that there is a gap between the numbers suggests that a large amount of intermediate data is being generated as part of sorting in the Map phase. The other highlighted number, Spilled Records, confirms that the job has very high disk I/O caused by the large amount of intermediate records written to storage after the Map phase. When this type of bottleneck is happening, review the I/O counters in Windows Task Manager. Typically they will exhibit high utilization for both read and write operations. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization To get more detailed information about spilled records, you can open the syslog logs for the mapper. For example, the following screenshot shows the log from a TeraSort job1. Figure 5. Log showing spilled records in TeraSort job The heavy disk I/O is due to spilled records. Note that the value for buffer size during the I/O sorting process is 100 MB. You can also see where the mapper has spilled records multiple times when the buffer is full. Massive I/O or Network Traffic Caused by Large Map Output In the Map output phase (Stage 4), Hadoop background daemons merge sorted partitions to disk. Next, this Map output will be read and copied to the Reduce task or tasks. However, in the process, many I/O operations occur as the data is written, read, or transferred among processes. Thus, large output from the For more information about TeraSort and other jobs that are commonly used for measuring Hadoop job performance, see the related paper, Performance of Hadoop in Hyper-V. 1 Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Map phase can cause longer I/O and data transfer time, and in extreme cases can raise exceptions, if all the I/O throughput channels are saturated or if network bandwidth is exhausted. Indicators You can use the following indicators to investigate the volume of Map output: Job counters: FILE_BYTES_WRITTEN, FILE_BYTES_READ, Combine Input Records Windows performance counters: IO counter, network counter, processor counter Possible exceptions: java.io.IOException The following log shows the values of the counters FILE_BYTES_WRITTEN (Map phase) and FILE_BYTES_READ (Reduce phase). Ideally, these values should be fairly close. But if you notice a big gap between the numbers, the problem might be the amount of intermediate spilled data, from Stage 3. By examining the byte size in these two counters, you can judge whether the volume of output data from the Map phase is too big, which could possibly lead to a performance problem. Figure 6. Large map output When this bottleneck occurs, you can view the I/O and network performance counters in the Windows Task Manager. Both of these counters should exhibit high utilization, and most likely one of these counters will have hit its maximum value, depending on which is faster, I/O or network. Based on our experiments, very large output from the Map phase can sometimes trigger various kinds of java.io.IOException, which sometimes can cause the task to fail. The following except shows one exception for a failed task which has been output to the log file. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 7. Java exception in log file Massive I/O and Network Traffic Caused by Large Output Data in Reduce Output In the output stage of the Reduce phase (Stage 8), reducers write their output to HDFS, requiring a lot of I/O write operations. The values for the counters Bytes Written (Reduce phase) and HDFS_BYTES_WRITTEN (Reduce phase) indicate the volume of data. However, it is important to note that these two counters do not include the replication factor. If the replication setting is larger than one, it means that blocks of data will be replicated to different nodes, which consumes more I/O for read and write operations, and which also uses network bandwidth. Therefore, bottlenecks on disk I/O or network bandwidth could cause slow job performance or even exceptions in affected Reduce tasks. Indicators You can view the following indicators to check for this condition: Job counters: Bytes Written, HDFS_BYTES_WRITTEN Windows performance counters: I/O counters, network counters, processor counters Possible exceptions: java.io.IOException Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 8. Checking volume of data as HDFS bytes If this bottleneck occurs, and if the replication factor is 1 (that is, data is not being copied among nodes), you should be able to observe high values for I/O counters in the Windows Task Manager. However, network counters could be low due to lack of substantial network traffic. If the job uses a replication factor greater than 1, data will be replicated across the network, so you will usually observe high values in counters for both I/O and network usage. In some cases, the large job output can trigger an exception java.io.IOException, and cause the task to fail. The following log excerpt shows the exception and the task failure messages. Figure 9. Java exception caused by large job output Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Insufficient Concurrent Tasks Caused by Improper Configuration If the number of tasks running concurrently is insufficient to perform the job, the job can leave many resources idle. Increasing the number of concurrent tasks helps to accelerate the overall job speed by better utilizing resources. The number of concurrent running tasks is determined by two configuration factors: the first is the total capacity of Map and Reduce slots for the cluster, and the second is the number of Map and Reduce tasks configured for each particular job. Indicators You can use the following indicators to check for underutilization of resources: Task Summary List: Num Tasks, Running, Map Task Capacity, Reduce Task Capacity Windows performance counters: I/O counters, network counters, processor counters Possible exceptions: Not encountered during our tests. To view the total Map and Reduce task capacity for the cluster, use the Map/Reduce Administration page, shown in the following diagram. Figure 10. Cluster’s total map/reduce task capacity The Job Detail page, shown below, provides the number of Map and Reduce tasks available to the job. In cases where either the cluster or the job has been configured with insufficient capacity for concurrent tasks, you will usually observe low utilization on all counters in Windows Task Manager, due to underconfigured computation resources. For example, if your data node machine has 12 CPU cores but you only configured 4 mappers on each machine, the remaining 8 cores will be idle because no work load will be assigned to them; As a result, only a limited portion of I/O throughput and network bandwidth will be actually utilized, because there are no requests coming from the other CPU cores. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 11. Details of a job Long-Tailed Reducers Caused by Skew Data Distribution In the computation stage of a Reduce task (Stage 7), partitions resulting from multiple mappers are merged into sorted key-value pair groups, and are fed to different reducers depending on the way that keys have been distributed. However, the size of each data group can vary significantly if the keys are not evenly distributed. When this happens, the amount of data received by different reducers might be imbalanced, a phenomenon which is called skew data distribution. Skew data distribution causes heavy workloads for some Reduce tasks, while leaving light workloads for others. During our experiments with some extreme cases, it often happened that most of the Reduce tasks finished within a few minutes, while a few remaining tasks ran for days. As a result of this imbalance in the distribution of data, the job also might take days to complete. We termed this type of bottleneck the “long-tailed’ reducer” and attempted to minimize or eliminate skewed distribution to improve performance. Indicators You can view the following indicators to check for skew distribution and for long-running Reduce tasks: Job Detail page: Num Tasks, Running Windows performance counters: None Possible exceptions: None during our testing. The following excerpt from the Job Detail page shows some long-tailed tasks. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 12. Long-tailed tasks Such tasks can be easily identified, because most of the other Reduce tasks have finished, but there are a few tasks that continue running for a much longer time. We recommend that you use the page that contains task counters to figure out how serious the skew distribution is in your job. To do this, open the task counter page for each Reduce task, save the values, and then compare the values with other counters, such as Reduce input records. This review process will help you identify differences in workload among different Reduce tasks. Single Reducer Caused by Hive ‘ORDER BY’ Clause Hive is a SQL-like language that helps non-developers easily perform ad-hoc queries against data stored in Hadoop. However, one limitation of Hive is that the ORDER BY clause has not been optimized for leveraging multiple reducers. In its current implementation, the Hive ORDER BY clause can only run as a Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization single reducer job; therefore, if you perform a global data sort while output is being written, performance can suffer. When you perform a query that requires global sorting against big data, there is no doubt that the query will be extremely slow. Indicators You can view the following indicators to check for this problem: Job detail page: Number of Reduce Tasks Windows performance counters: I/O counters, network counters, processor counters Possible exceptions: Not experienced during our tests. The following diagram shows a portion of the Job Detail page. From this you can see that only one Reduce task has been assigned to the ORDER BY operation. Figure 13. Hive sort on single reducer When a Hive query containing an ORDER BY clause is running using a single reducer, Windows performance counters on all nodes except the query node should exhibit very low or no utilization, because most resources will have no work to do. However, the node that is running the single reducer will exhibit high utilization of I/O and very high computation. Insufficient memory and task hangs during attempt due to misconfigured memory allocation in tasks The amount of RAM that is available on the task tracker nodes can act as a bottleneck that can have significant effects. A task tracker node can be starved for memory when an attempted job requires a lot of memory and the configuration doesn’t allow it to respond. For the purpose of diagnostics, this problem tends to manifest itself in the form of memory related error messages, such as “java.lang.OutOfMemoryError: Java heap space”. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization For example, one customer migrated their workload from an on-premise computer with large memory nodes to an Azure-hosted solution. By default, the Azure workers are configured to run four (4) mappers per node, so each worker processes four separate streams of data. The data was compressed with a nonsplittable format and the blobs were quite large (1 GB). The worker ran out of memory while attempting to decompress the data. In the worst-case scenario, the workers can hang at the attempt phase, if large amounts of memory are allocated to the attempts, and memory pressure pushes the system into an unhealthy or unresponsive state. This happens because Hadoop has too many attempts running at the same time, or the JVM settings allow the attempts to expand beyond healthy limits, or both. When this problem happens, you should be able to see high memory values in the Perfmon traces, while other resources (I/O, network utilization, and CPU) remain relatively low. Another scenario in which this problem was reported was a system with high latency on storage. Indicators You can view the following indicators to check for this problem: Windows performance counters: RAM counters Possible exceptions: “java.lang.OutOfMemoryError: Java heap space” Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Choosing a Solution Once bottlenecks are identified, they should be addressed by applying the appropriate tuning techniques. The following table provides some recommendations for potential tuning techniques for each of the bottlenecks described in the previous section. Each technique is described in more detail in the next section. Bottleneck or Issue Tuning Techniques Massive I/O caused by large input data (Stage 1) Compress source data Massive I/O caused by large spilled records (Stage 3) Reduce spilled records from map tasks Massive I/O or network traffic caused by large map output (Stage 4) Compress map output Massive IO and network traffic caused by large output data in stage #8 Compress job output Change replication Hive – Compress query intermediate output Implement a combiner Hive – compress query output Insufficient running tasks caused by improper configuration Change number of MapReduce tasks Insufficient memory; job hangs at attempt due to misconfigured memory allocation in tasks Adjust memory settings Long tailed reducers caused by skew data distribution in stage #7 Hive - AutoMap-Join Single reducer caused by hive order by operation in stage #7 Global sorting Change number of MapReduce slots However, we must stress that there is not a one-to-one mapping between the bottlenecks and the solutions listed, and there is no one-size-fits-all technique for tuning Hadoop jobs. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Because of the architecture of Hadoop, achieving balance among resources is often more effective than addressing a single problem. Each of the techniques in the list can be applied to address more than one type of bottleneck. Depending on the type of job you are running and the amount of data you are moving, the solution might be quite different. Therefore, we highly recommend that you try combining different techniques to figure out which tuning solutions are most efficient in the context of your job. In our experience, improving job performance is a repeatable effort that usually takes several cycles until you can find the best combination. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Resolving Bottlenecks This section describes the different techniques we have used. In each section, we summarize the performance issues, list the configurations that were changed, and compare performance after tuning. Reduce Spilled Records from Map Tasks A Map task can spill (meaning, write data to the file system or disk) a large amount of data during the internal sorting process. Therefore, your goal for optimization should be to ensure that records are spilled only once. Multiple data spills generate a lot of I/O stress and slow down overall job performance. Before tuning The following graphic shows the log of a sample TeraSort job against 100 GB of data, using the default configuration. Notice that the value of counter Spilled Records is almost as twice big as the counter Map output records. From this log you can see that multiple spills occurred for this un-tuned job. Total time for this job was 00:21:14. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 14. Default TeraSort job on 100 GB (before) Configurations to change To reduce the amount of data spilled during the intermediate Map phase, we adjusted the following properties for controlling sorting and spilling behavior. The meanings of those properties are taken from Hadoop: The Definitive Guide 2: io.sort.mb — Size, in megabytes, of the memory buffer to use while sorting map output. io.sort.record.percent — Proportion of the memory buffer defined in io.sort.mb that is reserved for storing record boundaries of the Map outputs. The remaining space is used for the Map output records themselves. io.sort.spill.percent — value that represents the soft limit in either the buffer or record collection buffers. Once this threshold is reached, a thread will begin to spill the contents to disk. Applies to both the map output memory buffer and the record boundaries index. It was not straightforward or intuitive to define the best values for those properties, so in our experiments we spent some time investigating how to calculate and then set new values on these parameters to overwrite the default values. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization What we learned is that when Map output is being sorted, 16 bytes of metadata are added immediately before each key-value pair. These 16 bytes include 12 bytes for the key-value offset and 4 bytes for the indirect-sort index. Therefore, the total buffer space defined in io.sort.mb can be divided into two parts: metadata buffer and key-value buffer. Based on this understanding, we used the calculations shown below to derive a new potentially optimized value setting for these properties: Calculate io.sort.record.percent. This value represents the proportion between metadata buffer and key-value buffer. io.sort.record.percent = 16 / (16 + R) In this formula, R is the average length of the key-value pairs and can be calculated by dividing the Map output bytes by the number of Map output records. In our example, R = 100,000,000, 000/1,000,000,000. Hence R = 100; and the formula as follows: io.sort.record.percent = 16 / (16 + 100) = 0.138 Calculate io.sort.mb Based on the average length of key-value pairs (R) calculated in the preceding step, and the average number of records per mapper (N), we used the following formula: io.sort.mb = (16 + R) * N / 1,048,576 N is calculated by dividing the Map output records by the number of map tasks. The property io.sort.mb must be defined in megabytes, therefore the result should be converted to megabytes by dividing by 1,048,576. In our example, the number of map tasks is 373.Therefore, N is 1,000,000,000 records divided by 373 mappers, which equals a value for N of 2,680,965 records per mapper in average. Hence, io.sort.mb should be 297 MB. For convenience, we rounded this up to 300 MB. Update the io.sort.spill.percent properties The property io.sort.spill.percent defines the threshold after which the data from the Map output buffer will be spilled to disk. The valid range is 0 to 1.0, with a default value of 0.8. There is no reason to reserve this additional space, so we configured it to 1.0, to ensure that all buffer space was used. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization <property> <name>io.sort.mb</name> <value>300</value> </property> <property> <name>io.sort.record.percent</name> <value>0.138</value> </property> <property> <name>io.sort.spill.percent</name> <value>1.0</value> </property> After tuning The following figure shows the log when the same job was run after tuning. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 15. TeraSort job after spill optimization The number of spilled records in the Map phase was reduced to 1,000,000,000, as specified in the configuration. This value is identical to the number of Map output records, which means that spilling from the Map output occurred just once. This optimization alone roughly halved the time spent on the job, to 00:11:29. Implement a Combiner We used the sample Word Count2 application to demonstrate how to implement tuning on a job that includes a combiner. You can run the application without a combiner, but including a combiner can improve performance, as we will explain. Before tuning The following figure shows the log from a Word Count job without a combiner implementation. Because we did not use a combiner, the values for Combine input records and Combine output records are zero, and the number for Reduce input records is the same as for Map output records. To learn more about this sample application, see the MapReduce Tutorial, or see our related paper, Performance of Hadoop on Windows Hyper-V. 2 Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 16. Records transferred from Map to Reduce phase, before use of combiner Configurations to change Implementing a combiner would improve performance in this job. To enable a combiner in the job, we added the highlighted line into the existing code: Add a combiner job.setMapperClass(Map.class); job.setCombinerClass(Reduce.class); //new line of code to set a combiner job.setReducerClass(Reduce.class); The combiner aggregates the map output first, and then the reducer takes pre-aggregated key-value groups from the combiner’s output. Adding a combiner requires that you create a combiner class and implement it in your MapReduce code. In this example, we used the same class for the combiner as for the reducer, since they share the same aggregation algorithm. This is a common way of implementing a combiner. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization After tuning The table below shows the improvement in performance after running the same job with a combiner implemented as part of the job. Figure 17. Records transferred from Map to Reduce phase, after use of combiner The number of Map output records remained unchanged (629,187). However, the values for Combine input records and Combine output records now have increased, due to the addition of the combiner in the job. Notably, the number of Reduce input records dropped from 629,187 to 102,300. What this means is that I/O usage from the mapper output writes and from the reducer reads have both declined by a factor of 6, compared to the original process. This demonstrates that, in big jobs that deal with large data, enabling a combiner to reduce I/O pressure is a very efficient way to improve overall performance. Compress Map Output Using compression in Hadoop jobs is an important way to achieve increased performance at very little cost3. For more detailed information about different types of compression and their effectiveness in different phases of a Hadoop job, see our white paper, Compression in Hadoop. 3 Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization The next example shows the results when we tested the effects of Map output compression by running a sample TeraSort job against 100 GB of data, with and without compression. Before tuning The following table shows the log before compression on the map output was enabled. Around 100 GB of data was written to the file system by the TeraSort job, and almost the same amount of data was read by the reducers. This is very inefficient. Figure 18. Bytes written in Map phase, before compression Configurations to change To improve performance on this job, we enabled compression on the output of the Map phase. In the default Hadoop configuration, the value of the parameter, mapred.compress.map.output, is false, and the property mapred.output.compression.type is set to RECORD. To compress the output of the Map phase, we simply need to edit the job configuration file and change the first property to true. We also changed the value for mapred.output.compression.type from RECORD to BLOCK, because our tests found that a higher compression ratio is achieved with the type BLOCK. mapred.compress.map.output — Change value to true to enable compression. mapred.output.compression.type — Change value to BLOCK. Enable compression on map blocks <property> <name>mapred.compress.map.output</name> <value>true</value> </property> Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property> After tuning The following figure shows the log after we ran the same job but with compression enabled on the Map output. First, note that the values for FILE_BYTES_READ and FILE_BYTES_WRITTEN dropped to around 15 GB, compared to over 102 GB before. Figure 19. Bytes written in Map phase, after compression The benefits of this change are two-fold: enabling compression on the Map output enables relatively higher computation usage on the Hadoop cluster, and it also significantly reduces network bandwidth and storage I/O. As an overall result, any jobs that generate large output in the Map phase can potentially run faster with compression enabled on the Map output. Compress Job Output We also studied the effect of compressing output of the overall job. Before tuning The following figure shows the log from a job that ran the Word Count application, with no changes made to configuration. Figure 20. Word Count job before job output compression Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Note that before any job tuning, the total size of files output to HDFS was around 880 KB. (The time of the job is not included here because the file we processed was already fairly small. Our point is really to show the compression ratio.) Configurations to change To enable compression of the job’s output, we added the following lines of code into the source code for the Word Count application. Note that these are essentially the same parameters described in the previous section. Namely, we enabled compression, and we changed compression to use blocks instead of on records. Enable compression on job output job.getConfiguration().setBoolean("mapred.output.compress",true); job.getConfiguration().set("mapred.output.compression.type", "BLOCK"); Tip: You can also change these settings by using the Hadoop command line and the ‘-D’ parameters4. For the complete list of commands in Hadoop, see: http://hadoop.apache.org/docs/r0.19.1/commands_manual.html. For more about compression, the tools and its uses, see our white paper, Compression in Hadoop. 4 Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization After tuning The following log shows the same job, run after compression was enabled on the job output. The size of the output files written to HDFS was reduced to 304 KB, compared to 880 KB before. Figure 21. Word Count job after job output compression The benefit of enabling compression on job output might not seem so impressive on a small Word Count application, but if the final phase of your job performs intensive I/O with many writes, enabling compression might save a lot of storage and make your job complete faster. However, one drawback of this technique is that the final output data will be stored in HDFS as compressed data; Therefore, you must apply decompression before the data can be consumed by other processes. Compress Input Source Data Compression of input files provides two major benefits: it saves storage space (especially when using replication) and speeds up data transfer. Before tuning In one experiment, we applied the DEFLATE codec, which is the default compression format used in Hadoop, to compress input data, and achieved a reduction of 152 GB to 33 GB. Our MapReduce job completion time dropped from 376 seconds when processing the original data to 108 seconds when processing the compressed data. Configurations to change Files can be compressed before they are loaded into Hadoop, or the files can be compressed inside Hadoop processes, to leverage the advantage of parallel computing. You can also use Hive to compress data. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Because compression provides so many advantages we wrote a separate white paper, titled Compression in Hadoop Jobs, that describes the tools and processes that we have tested for enabling compression in all phases of Hadoop jobs. We also recommend that readers review the section of Tom White’s book, Hadoop: The Definitive Guide, particularly the section on “Compression” beginning on page 77. It describes the choices for compression formats and codecs. After tuning On the other hand, “all compression algorithms exhibit a space/time trade-off”5. From the resource point of view, the choice of whether or not to use compression is a trade-off between increasing network traffic and disk I/O, and increasing CPU demand. The reason is obvious – to compress or decompress files costs extra CPU time. Change Number of Replications Replication in the context of Hadoop means that input files are split onto multiple nodes to get the benefit of data redundancy. Although this is an important feature of Hadoop, copying the data among nodes does cost a massive amount of network bandwidth and storage I/O. If your data might not require this level of redundancy, reducing the number of replications—that is, copying data to fewer nodes—is a good way to reduce job running time and avoid performance issues. Before tuning The following screenshot shows the file copying activity when a 7 GB data file was written into HDFS with three (3) replications Figure 22. Data being copied when replicated to three nodes Just copying the data to the nodes took 249 seconds in total. 5 Hadoop: The Definitive Guide 2, page 78. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Configurations to change The property dfs.replication controls the number of replications of a job output. Change replications in MapReduce To change the number of replications in a MapReduce job, you can add the following lines to the job configuration file. <property> <name>dfs.replication</name> <value>1</value> </property> Change replications in Hive To change the number of replications used in a Hive job, use the following command in the session: hive> set dfs.replication=1; After tuning After changing the job to use a single replication, we measured the job activity when processing the same 7 GB of data. This time the job wrote all results to HDFS and ended in only 116 seconds. Reducing the replication factor from 3 to 1 clearly shortened job completion time, going from around 3 minutes to less than 2 minutes. Hive – Compress Query Output The principle underlying the use of compression in MapReduce jobs also applies to compression in Hive, because internally Hive queries are compiled to MapReduce jobs. The only difference is that Hive has its own set of parameters to control compression. In this test, we used a simple GROUP BY query to test the effects of Hive output compression. Before tuning The following log shows the job before we changed any parameters. The size of the output file was around 50K. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 23. Reduce output before Hive compression Configurations to change Enable compression within Hive process Next, we changed the setting, hive.exec.compress.output, as follows: hive> set hive.exec.compress.output=true; For more information about this setting, see our white paper, Compression in Hadoop Jobs. After tuning The following log shows the job after enabling compression. Figure 24. Reduce output after Hive compression Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Hive – Compress Intermediate Output A complex Hive query is usually converted to a series of multi-stage MapReduce jobs after submission, and these jobs will be chained up by the Hive engine to complete the entire query. So “intermediate output” here refers to the output from the previous MapReduce job, which will be used to feed the next MapReduce job as input data. In this test, we assessed the effect of compressing the intermediate output of a Hive job, by using a twostage query against 500 MB of data. Because the query is compiled as a two-stage chained MapReduce job, the intermediate output is actually the output of the Reduce phase of the first job, which in turn serves as the input to the second job. Before tuning The following figure illustrates the size of output data from first job. In the original job, the size of files in the Reduce phase (see HDFS_BYTE_WRITTEN) is around 89 MB. Figure 25. Intermediate output before compression Configurations to change We used two properties, hive.exec.compress.intermediate and mapred.output.compression.type, to compress the intermediate output. We changed the value of mapred.output.compression.type to BLOCK because, in our tests, higher compression ratios were achieved when using BLOCK. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Compress intermediate output hive> set hive.exec.compress.intermediate=true; hive> set mapred.output.compression.type=BLOCK; After tuning The following diagram shows the log of the same job, after compression had been enabled on the intermediate Hive files. Figure 26. Intermediate output after compression With Hive compression enabled, the size of the intermediate output files dropped to around 23 MB, meaning that a 75% reduction in size was achieved through compression. Hive – Auto Map Join In Hive, Auto Map-Join is a very useful feature when you are joining a big table to a small table. When you enable this feature, the small table will be saved in the local cache on each node, and then joined with the big table in the Map phase. In terms of performance, enabling Auto Map Join in Hive provides at least two advantages. First, loading a small table into cache will save read time on each data node. Second, it avoids skew joins (meaning imbalanced joins) in the Hive query, since the join operation has been already done in the Map phase for each block of data. To assess the effectiveness of using Auto Map Join in a Hive query, we ran a two-stage Hive query that contains both GROUP BY and JOIN operations, to process a 200 GB dataset. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Before tuning The original query, without Auto Map Join enabled, took 2869 seconds (or 47.8 minutes) to complete. The longest Reduce task lasted 34 minutes. Configurations to change To enable the Auto Map-Join feature, we just changed the value of the property hive.auto.convert.join to true. Under this configuration, any table that is smaller than the size defined in the property, hive.smalltable.filesize, will be treated as a small table, and will be copied to each of the task nodes as a cached table. Enable Auto Map Join hive> set hive.auto.convert.join=true; hive> set hive.smalltable.filesize=<filezie_threadhold>; (optional) The default value of hive.smalltable.filesize is 25 MB. After tuning After enabling Auto Map Join, the same query completed in 1101 seconds (or 18.3 minutes). The problem with one Reduce task lasting a very long time was eliminated, because the job did not need any Reduce tasks at all in the first stage.6 Change the Number of Map Tasks and Reduce Tasks For a MapReduce job, changing the number of Map tasks and/or Reduce tasks can significantly affect performance, because those two settings are directly associated with the amount of cluster resources used in a job. Performance can suffer whether you allocate too many resources or too few, so be careful to choose the right values, and try different combinations to determine the right balance of resources. For our tests of this tuning technique, we changed the number of Map tasks in a job running the TeraGen test7. 6 In Appendix A we describe a related tuning technique, called bucketed mapjoin, which we expect to be useful; However, we have not had the opportunity to test its effectiveness and so have not included it here. 7 For information about the TeraGen test, see the related paper, Performance of Hadoop on Hyper-V. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Before tuning The following table shows the log of the Map and Reduce tasks during the job, before any tuning. Figure 27. Number of Map tasks before Note the number of Map tasks. This is a typical example of what we call under-configuration. In this particular case, the cluster has enough computational resources to run at least eight Map tasks concurrently, but the job is only configured to use two map tasks, which fails to utilize available CPU resources. Configurations to change To change the number of Map tasks, the following properties need to be set, using an appropriate value, in the job configuration file. <property> <name>mapred.map.tasks</name> <value>8</value> </property> To change the number of Reduce tasks you use the property, mapred.reduce.tasks. Note that this just shows you where to change the values. The actual calculation of what value would be appropriate depends on your cluster configuration. <property> <name>mapred.reduce.tasks</name> <value>2</value> </property> Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization After tuning After the properties were changed, we re-ran the job. The following table shows the log from the same job. Note that the number of Map and Reduce tasks has changed. Figure 28. Number of Map tasks after After tuning the configuration, the number of Map tasks increased to eight. Since the cluster had the resources to support more mappers, we noticed a job performance boost immediately, with more mappers working in parallel. Change the Number of Map Reduce Slots Increasing the number of slots available to a Hadoop cluster’s total Map and Reduce tasks will enable more tasks to run concurrently, which leverages more CPU, RAM, I/O and network resources to work on the job. As long as your cluster has enough resource capacity to support more Map and/or Reduce tasks, increasing the number of total task slots always works to improve the job performance. However, providing too many resources can hurt performance if resources are unbalanced, because such a configuration can lead to resource bottlenecks and more failures. Therefore, choosing the right number of slots is the key to getting the best performance out of your cluster, and to keeping your cluster running reliably. To help you determine potential values for the appropriate number of Map and Reduce slots, we have provided links to resources that describe cluster tuning strategies in more detail. See the Resources section for details; the next section provides a summary and highlights a few tips. Before tuning The following figure shows the configuration of a single-node Hadoop cluster configured with a maximum of four Map tasks. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Figure 29. Maximum Map tasks before The next figure shows the log while the job was running with the configuration. As you can see, the job is using the maximum number of concurrent mappers that is allowed. If the job runs slowly, you might try increasing this value. Figure 30. Number of running Map tasks before Configurations to change There are two global properties that you can change to affect the number of Map slots or Reduce slots: mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum. Note that the total Map task capacity of the cluster is the sum of each data node’s capacity. For this example, we were using a two-node cluster. Increase Map slots The following procedure describes how to change the Map slot capacity on a data node. The method for increasing the number of Reduce slots in Hadoop is almost the same as the method for increasing the Map slots. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization 1. Stop the cluster. 2. On each node, edit the file mapred-site.xml. 3. Change the property mapred.tasktracker.map.tasks.maximum as shown in the following code: <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> </property> 4. Start the cluster. After tuning After making this change, the number of available (maximum) Map slots increased to eight, because we were using a two-node cluster configured to use a maximum of four tasks per node. For performance, this means that eight Map tasks can run concurrently on the cluster, as shown in the following job configuration. Figure 31. Maximum Map tasks after The following figure shows the log of the job running after more Map slots have been added. From this log, it is clear that the MapReduce job is now using the maximum Map slots that are available, which was eight. Figure 32. Running Map tasks after Improve Global Sorting The ORDER BY clause is supported in Hive to globally sort the data and order output by key values. However, Hive is not optimized to perform global sorting with multiple reducers. At the time of this writing, Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Hive can use only one reducer to sort the entire data set, which can make the query extremely slow when sorting large data volumes. Before tuning An example will illustrate how slow a job can be when using Hive to perform an ORDER BY operation. In this job, a 20 GB source data file was available in the HDFS directory, which was mapped to a Hive table named t1. The following ORDER BY query took 3823 seconds to sort t1 in order. Hive>insert overwrite table t2 select * from t1 order by col1; Configurations to change Perform global sorting using PigLatin The easiest way to perform fast global sorting is to use PigLatin, which is a programming language and runtime environment for working with Hadoop data. Unlike Hive, PigLatin has been optimized to automatically perform parallel sorting. Therefore, you can generally expect to see some performance improvement after conversion to PigLatin. The following code shows a PigLatin script which is equivalent to a Hive query containing a global ORDER BY clause. a = load '<HDFS_PATH_TO_t1>' as (col1:chararray, col2:chararray, col3:chararray, col4:chararray, col5:chararray, col6:chararray, col7:chararray, col8:chararray, col9:chararray, col10:chararray, col11:chararray, col12:chararray, col13:chararray, col14:long); b = order a by col1; store b into ‘<HDFS_PATH_TO_t2>’; Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization After tuning The PigLatin script took 1250 seconds to return the results, which is significantly faster than the Hive ORDER BY query (3823 seconds). In another experiment, we compared the performance of equivalent Hive and PigLatin queries, like those above, using a 120 GB data set. When performing a global sort of this larger data set, the Hive query took 13.5 hours, whereas the same job using Pig and PigLatin completed the sorting operation in only 2 hours and 10 minutes. Adjust Memory Settings Problems with insufficient memory can arise when a job that is attempted requires a lot of memory and the configuration provides insufficient memory, which does not allow the worker node to respond. The solution is to increase the memory limit for the JVMs used for the attempt, by configuring the mapper JVM settings to increase memory and obtain more physical memory for the job. For example, the following configuration property sets the amount of memory used by each Reduce task to 2 GB: Mapred.reduce.child.java.opts = -Xmx2g A job can hang on the first or subsequent attempts if large amounts of memory are allocated to the attempts, and memory pressure pushes the system into an unhealthy or unresponsive state. The suggested solution is to scale down the number of Mappers and Reducers by configuring the properties, mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. In any memory-related scenario, you should also consider adjusting the value of mapred.job.reuse.jvm.num.tasks: This property allows you to specify the number of tasks that can reuse the same JVM. Reusing a JVM reduces the execution time, since there is no need to create a JVM. Conversely, not reusing a JVM improves memory, because the attempt uses a new JVM which does not have any memory already allocated. You can consider it a tradeoff between overhead and memory usage. Note: We would like to state for the record that our team did not encounter memory issues in our testing, since we were fairly cautious in determining requirements and allocating resources. However, testing by other teams revealed some scenarios where this can be an issue. Although we have not validated the findings, after reviewing the test results provided by other teams,, we felt that it was important to share these scenarios and recommendations above. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Conclusion We have described a selection of techniques that you can use to identify bottlenecks in Hadoop jobs. We also provided the corresponding techniques that you can apply to tune your Hadoop jobs and potentially improve performance. There are many ways to improve the performance of Hadoop jobs, but in our testing we have found that the process of identifying bottlenecks and applying solutions is one you will repeat often. It is also critical to approach performance in a balanced manner. Future Work We were unable to test all the potential tuning techniques and combinations of techniques, due to time constraints. We have summarized only the most common and useful tuning techniques, along with a few MapReduce properties. However, if you are interested in other properties that might be useful for tuning, we have provided a full list of tuning properties for Shuffle and Sort. Please refer to Appendix B. We encourage you to experiment with these and to report your results. In addition to the techniques covered in this document, the previously cited book by Tom White, Hadoop: The Definitive Guide, describes a number of other tuning properties and techniques. Because the properties are described in different chapters, and can be hard to find, we have provided a summary of all the properties from The Definitive Guide for your reference as well (see Appendix C). About the Authors The Microsoft IT Big Data Program is the result of collaboration between multiple enterprise services organizations working in partnership with Microsoft’s HDInsight product development team. This group has been tasked with assessing the usefulness of Hadoop for corporate and client applications, and for providing a reliable, cost-effective platform for future Big Data solutions within Microsoft. Towards this end, the group has extensively researched the Hadoop offerings from Apache, and has implemented Hadoop production and test environments of all sizes, using the HDInsight platform for Windows Azure and for onpremise Windows. The Microsoft IT Big Data Program is publishing their research to the community at-large so that others can benefit from this experience and make more efficient use of Hadoop for data-intensive applications. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Acknowledgments We would like to gratefully acknowledge our colleagues who read these papers and provided technical feedback and support: Pedro Urbina Escos (HDInsight test), Karan Gulati (CSS), Larry Franks, and Cindy Gross (SQL CAT). Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Appendix A: Join Optimization in Hive The following information was originally published on the HDInsight team blog when the release was codenamed Isotope. We have taken the liberty of editing, summarizing, and clarifying where possible. http://isotope/?s=join+optimization Hive can perform joins on big data sets (TB range). However, it takes time for a complex join on big data sets to finish since the underlying MapReduce job involves shuffling large amounts of data in the Reduce phase. There are a couple of ways to optimize the join operation in Hive. MapJoin is one of them. The idea of MapJoin is that, if one of the tables in the join operation is relatively small (in the range of MB or GB), you can remove it from the Reduce phase entirely by copying the smaller table to all the nodes and perform the join in the mapper. Further improvement can be achieved if both sides of the tables are bucketed (satisfying certain conditions). In this case, instead of the entire smaller table being copied to all nodes, only a portion of the data will be copied to the nodes, with each node getting a different portion of the data. The following figure shows the performance of different joins on a single node cluster. This particular comparison was done on a desktop computer used for development, using two tables, one about 7GB and the smaller one only 11KB. Other background processes were running, so you should not read too much into the actual numbers, but focus instead on the relative difference between the results. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization To perform a MapJoin in Hive, use the following syntax: SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key; The table name used in /*+ MAPJOIN(…) */ is the smaller of the two tables, and will be distributed to all nodes. When both tables are bucketed and the buckets are a multiple of each other, the same MapJoin statement above will be further optimized as the bucketed MapJoin if the following parameter is set: set hive.optimize.bucketmapjoin = true; If the two tables are not only bucketed but also sorted, and if the two tables have the same number of buckets, the above MapJoin will be even further optimized to a sort-merge join (whose number is not shown in the chart above). To enable a sort merge join, we need to set the following parameters in Hive: set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; Notes: The limitation of MapJoin is that we cannot perform a FULL or RIGHT OUTER JOIN. MapJoin in Hive is enabled in the most recent build of HDInsight. Please try it out. As always, your feedback is welcome. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Appendix B: List of Tuning Properties in Shuffle The following table lists additional properties that can be used for tuning. Although we did not experience specific bottlenecks or alter these particular properties in our experiments, it is likely that modifying these properties might affect performance in some scenarios. Table 1. Map phase tuning properties Property name Type Default value Description io.sort.mb int 100 The size, in megabytes, of the memory buffer to use while sorting map output. io.sort.record.percent float 0.05 The proportion of io.sort.mb reserved for storing record boundaries of the map outputs. The remaining space is used for the map output records themselves. io.sort.spill.percent float 0.80 The threshold usage proportion for both the map output memory buffer and the record boundaries index to start the process of spilling to disk. io.sort.factor* int 10 The maximum number of streams to merge at once when sorting files. It’s fairly common to increase this to 100. This property is also used in the Reduce phase. min.num.spills.for. combine* int 3 The minimum number of spill files needed for the combiner to run (if a combiner is specified). tasktracker.http.threads int 40 The number of worker threads per tasktracker for serving the map outputs to reducers. This is a clusterwide setting and cannot be set by individual jobs. mapred.map.outputcompressio Class name n.codec* org.apache. The compression codec to use for map outputs. hadoop.io. compress. DefaultCodec mapred.compress.map. output Boolean false Microsoft IT SES Enterprise Data Architect Team Compress map outputs. 2016-02-08 Hadoop Job Optimization Notes: 1. The values for the properties listed in the table are the default values in Hadoop. For additional detail on each, please refer to Hadoop: The Definitive Guide, “Shuffle and Sort” (starting on page 177). 2. The properties marked with an asterisk (*) are those that we did not test during our tuning experiments. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Table 2. Reduce phase tuning properties Property name Type Default value Description mapred.reduce.parallel. copies int 5 The number of threads used to copy map outputs to the reducer. mapred.reduce.copy. backoff int 300 The maximum amount of time, in seconds, to spend retrieving one map output for a reducer before declaring it as failed. The reducer may repeatedly reattempt a transfer within this time if it fails (using exponential backoff). io.sort.factor int 10 The maximum number of streams to merge at once when sorting files. This property is also used in the Map phase. mapred.job.shuffle.input. float 0.70 buffer.percent mapred.job.shuffle.merge. outputs buffer during the copy phase of the shuffle. float 0.66 The threshold usage proportion for the map outputs buffer (defined by mapred.job.shuffle.input.buffer.percent) for starting the percent mapred.inmem.merge. The proportion of total heap size to be allocated to the map process of merging the outputs and spilling to disk. int 1000 threshold. The threshold number of map outputs for starting the process of merging the outputs and spilling to disk. A value of 0 or less means there is no threshold, and the spill behavior is governed solely by mapred.job.shuffle.merge.percent. mapred.job.reduce.input. float 0.0 buffer.percent The proportion of total heap size to be used for retaining map outputs in memory during the Reduce phase. For the Reduce phase to begin, the size of Map outputs in memory must be no more than this size. By default, all Map outputs are merged to disk before the Reduce phase begins, to give the reducers as much memory as possible. However, if your reducers require less memory, this value may be increased to minimize the number of trips to disk. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Appendix C: Miscellaneous Tuning Properties The table below lists most of the properties related to performance tuning that are provided in various places in Tom White's book, Hadoop: The Definitive Guide 2. Please note that the default values of some properties in HDInsight might be different from the values listed in this table. For example, in HDInsight, the default value of mapred.tasktracker.map.tasks.maximum is 4 and the mapred.child.java.opts is “-Xmx1024m”. Property name Type Default Value Description Reference mapred.min.split.size int 1 The smallest valid size p 202: FileInputFormat input splits p 203: Small files and CombineFileInputFormat in bytes for a file split. p 204: Preventing splitting mapred.max.split.size long Long.MAX_VAL The largest valid size in UE bytes for a file split. same as above (9223372 036854775807) dfs.block.size long 64 MB, that is The size of a block in p 202: FileInputFormat input 67108864 HDFS in bytes. splits p 203: Small files and CombineFileInputFormat p 204: Preventing splitting p 279: HDFS block size mapred.map.tasks. Boolean true speculative.execution Whether extra instances p 183: Speculative Execution of map tasks may be launched if a task is making slow progress. mapred.reduce.tasks. Boolean True speculative.execution Whether extra instances p 183: Speculative Execution of reduce tasks may be launched if a task is making slow progress. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Property name Type Default Value Description Reference mapred.tasktracker. int 2 The number of map p 273: Important Hadoop tasks that may be run Daemon Properties map.tasks.maximum on a tasktracker at any one time. mapred.tasktracker.re int 2 duce.tasks.maximum The number of reduce p 269: Memory same as above tasks that may be run on a task tracker at any one time. mapred.job.reuse.jvm.num.ta int 1 sks The maximum number p 184: Task JVM Reuse of tasks to run for a given job for each JVM on a task tracker. A value of –1 indicates no limit: the same JVM may be used for all tasks for a job. mapred.child.java. opts string -Xmx200m The JVM options used p 273: Important Hadoop to launch the task Daemon Properties tracker child process that runs map and p 180: Configuration Tuning reduce tasks. This p 269: Memory property can be set on p 280: Task memory limits a per-job basis, which can be useful for setting JVM properties for debugging, for example. Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization Addendum: One additional property that can affect performance is mapred.reduce.slowstart.completed.maps What happens is that the reducers launch when a certain percentage of the mappers are done. The reducers are busy pre-fetching data but they can’t start the actual work of reducing until all the mappers finish. Therefore, this pre-fetching phase can be highly inefficient, depending on the distribution of key values and the number of reducers. The result is an unnecessary load of the system, as reducers occupy slots in concurrent jobs. The default value is 0%, meaning that reducers the pre-fetch process start as soon as there is even one mapper that has completed. The following JIRA has been filed for this property: https://issues.apache.org/jira/browse/MAPREDUCE-1184 Microsoft IT SES Enterprise Data Architect Team 2016-02-08 Hadoop Job Optimization References and Appendices 1. Nolan, Carl. Hadoop Streaming and Azure Blob Storage. http://blogs.msdn.com/b/carlnol/archive/2012/01/07/hadoop-streaming-and-azure-blobstorage.aspx 2. Beresford, James. Using Azure Blob Storage as a Data Source for Hadoop on Azure. http://www.bimonkey.com/2012/07/using-azure-blob-storage-as-a-data-source-for-hadoop-onazure/ 2. MSDN Library. System.IO.Compression http://msdn.microsoft.com/en-us/library/system.io.compression(v=vs.80).aspx 3. MSDN Library. System.IO.Packaging http://msdn.microsoft.com/en-us/library/system.io.packaging.package.aspx 4. 5. White, Tom. Hadoop: The Definitive Guide, 2nd edition. O’Reilly Media, Inc., 2011. Guinebertière, Benjamin, Philippe Beraud, Rémi Olivier. Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) http://msdn.microsoft.com/en-us/library/jj720569.aspx Microsoft IT SES Enterprise Data Architect Team 2016-02-08