Eric Yang
Jim Huang
Stewart Tate
Simon Harris
Michael Ahern
John Poelman
2
BigInsights Tuning
– Validating and tuning a new cluster
Performance Improvements with Adaptive Map/Reduce
Special Topic: “HBase tuning experience with NICE” by Bruce
Brown
Tuning Big SQL: Results and lessons learned
Q & A
Topics covered later in the week
– More on tuning Big SQL (e.g. hints)
– GPFS performance
© 2013 IBM Corporation
3
The performance of a new cluster should be validated and tuned
– Slow nodes will drag down the performance of the entire cluster
Things to validate:
– Network
– Storage
– BIOS
– CPU
– Memory
© 2013 IBM Corporation
4
– Ensure expected sustained bandwidth between cluster nodes
– There are many tools that can be used to validate the network
– One tool we frequently use is iperf ( http://en.wikipedia.org/wiki/Iperf )
• Test bandwidth between every pair of nodes
• Example of how to invoke iperf client: iperf --client <server hostname> --time 30 --interval 5 --parallel 1 --dualtest
© 2013 IBM Corporation
5
Storage validation
– JBOD is the preferred storage set up. Avoid things like volume groups or logical volumes, etc.
– Each local device should be set up JBOD or (if the storage controller does not support JBOD) as a RAID-0 single-device array.
– Check device settings
• Are write caches enabled? Write caches are usually enabled except for storage that needs high durability, such as storage for the NameNode
– Based on the type of file system (e.g. ext), check file system settings (block size, etc.)
• Example (ext4): /sbin/dumpe4fs /dev/sdX
– Check mount options
• Usually want to see things like “atime” and “user extended attributes” are disabled
© 2013 IBM Corporation
6
Storage validation (continued)
– Use Linux dd command to test read/write throughput of every device or file system that BigInsights will use
– (Optional) Test raw device read/write performance BEFORE creating file system
• Warning: Writing to a raw device will clobber existing data (if any)
– (Mandatory) Test read/write performance after creating and mounting a file system on the device
• Test with and without Direct I/O
• With Direct I/O
dd if=/dev/zero oflag=direct of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out iflag=direct of=/dev/null bs=128k count=1024000
• Without Direct I/O
dd if=/dev/zero of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out of=/dev/null bs=128k count=1024000
© 2013 IBM Corporation
7
In the BIOS, consider turning off power saving options
Validate CPU and memory performance
– “Hdparm –T <device>” Reads data from the device’s caches as fast as possible
© 2013 IBM Corporation
8
Cluster and BigInsights performance tuning is a broad topic
We will touch on select topics and settings in the slides that follow
© 2013 IBM Corporation
9
How to check current settings
– ifconfig
– sysctl -a | grep net
ifconfig settings commonly tuned include
– MTU (Maximum Transmission Unit) -- Often set to larger value like 9000.
– Txqueuelen (Transmit Queue Length) -- High value, e.g. 2000, recommended for servers that perform large data transfers.
Other network settings we typical check/tune:
– Read/write memory buffers (e.g. net.ipv4.tcp_rmem, net.ipv4.tcp_wmem)
Big SQL users: it is a good idea to tune “keep alive” and
“tcp_fin_timeout” TCP/IP settings for improved stability (fewer timeouts and hangs)
There are more network settings than we can cover here
© 2013 IBM Corporation
10
Here are examples of things to tune in the Linux kernel
For clusters mostly doing Map/Reduce and a lot of sequential I/O, consider:
– Using “deadline” scheduler (instead of “completely fair”)
• Check: cat /sys/block/sdX/queue/scheduler
• How to change: echo deadline > /sys/block/sdX/queue/scheduler
– Tune read ahead buffer size
• By default the Linux OS will read 128 KB of data in advance so that it is already in
Memory cache before the program needs it. This value can be increased so as to get better Read Performance.
• Check: cat /sys/block/sdX/queue/read_ahead_kb
• How to change: echo 3072 > /sys/block/sdX/queue/read_ahead_kb
There are many more kernel settings not covered here
© 2013 IBM Corporation
11
Tune the file system based on your file system type
Examples of file system type: ext3, ext4, xfs
– Note: GPFS tuning will be covered in a later session
For ext4 file systems, consider settings like “dir_index” and “extent”
– mkfs.ext4 -O dir_index,extent /dev/sdX
• dir_index: Use hashed b-trees to speed up lookups in large directories
• extent: Instead of using the indirect block scheme for storing the location of data blocks in an inode, use extents instead. This is a much more efficient encoding which speeds up filesystem access, especially for large files.
Mount options
– Turn off atime
• noatime: Access timestamps are not updated when a file is read
• nodiratime: Turn off directory time stamps
© 2013 IBM Corporation
12
redhat_transparent_hugepage/enabled (Redhat 6.2 or later)
– In order to bring "huge page" performance benefits to legacy software, RedHat has implemented custom Linux kernel extension called "redhat_transparent_hugepage".
– Recommendation: Turn off huge pages since this functionality still maturing.
• This is a well-known tuning optimization in the Hadoop community
• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
vm.swappiness
– This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. Default is 60.
– Recommend a lower value in the range 5 to 20 (test in increments of 5)
vm.min_free_kbytes (came up during Big SQL testing)
– min_free_kbytes changes the page reclaim thresholds. When this number is increased the system starts reclaiming memory earlier, when its lowered it starts reclaiming memory later.
– If you see "page allocation" failures in /var/log/messages then chances are this is set too low.
– For Big SQL testing on machines with 128 GB of physical memory, we set this to
2621440 (2.5GB)
– (If using GPFS) GPFS documentation recommends vm.min_free_kbytes should be set to
5 to 6% of the total physical memory
There are many more vm kernel settings you can explore
© 2013 IBM Corporation
13
Give HDFS as many paths as possible, where each path is on a different device
Can be configured at install time
– Can also be changed later using parameter “dfs.data.dir” in hdfs-site.xml
In the installer, under “DataNode/TaskTracker”
– Installer default path: /hadoop/hdfs/data
– You almost always want to change this to one or more paths
© 2013 IBM Corporation
14
Tune the underlying devices and file systems as discussed on previous slides
When running large Big SQL & Hive workloads the following settings have proved useful in improving the stability of a cluster for both Apache M-R & Platform Symphony M-R:
– dfs.datanode.handler.count=40 (default is 10). The number of server threads for the datanode. Usually about the same as number of CPUs.
– dfs.datanode.max.xcievers=65536 (default is 8192). Maximum number of files that a DataNode will serve concurrently.
– dfs.datanode.socket.write.timeout=960000 (default is 480000). Units is milliseconds. 960000 is 16 minutes. Increase to avoid tasks failing due to write timeout errors.
© 2013 IBM Corporation
15
“Intermediate” data gets shuffled between map and reduce tasks
Intermediate data is kept on local storage as configured by Hadoop parameter “mapred.local.dir”
The speed of this storage directly impacts M/R performance
For better performance, spread across multiple devices
Configured at install time on “File System” panel
– Referred to as “Cache directory”
– Can also be updated post-install in mapred-site.xml
In the installer, expand “MapReduce general settings”
– Default path: /hadoop/mapred/local
– Change this to one or more paths, where each path corresponds to a different device
© 2013 IBM Corporation
16
Intermediate storage (mapred.local.dir) can share the same disks as
HDFS (dfs.data.dir
DataNode 1 DataNode 2 DataNode 3
DataNode N
Logs, Etc
HDFS
(Shared) dfs.data.dir
© 2013 IBM Corporation
17
We only have time to cover a few M/R settings in this presentation
– Number of slots
– JVM heap sizes
Backup slides cover more topics
– Number of reduce tasks
– Compression
– Sorting
– Block and split sizes
– Making clusters rack aware
© 2013 IBM Corporation
18
Number of map and reduce slots on each TaskTracker
Setting mapred.tasktracker.map.tasks.maximum
Default value
Formula:
<%= Math.min(Math.ceil(numOfCores * 1.0),
Math.ceil(maxPartition*0.66*totalMem/1000)) %> mapred.tasktracker.reduce.tasks.maximum
Formula:
<%= Math.min(Math.ceil(numOfCores * 0.5),
Math.ceil(maxPartition*0.33*totalMem/1000)) %>
Used to tune the degree of Map/Reduce concurrency on the cluster
Defines the maximum number of concurrently occupied slots on a node =
– Too high and the cluster becomes unstable, too low and you are wasting machine resources
– Tuned in association with the map and reduce task JVM size (mapred.child.java.opts)
– On large machines, particularly those with many CPU cores and/or hyper-threading enabled, the default values may be too large
• Example: Simon tested Big SQL on Power machines have 16 cores and SMT=4, resulting in
64 virtual CPUs. By default, BigInsights was configuration 64 map slots and 32 reduce slots (96 slots in all). This was too many and resulted in high context switching and cluster instability.
After tuning, Simon found 24 map slots and 12 reduce slots much more optimal.
© 2013 IBM Corporation
19
Additional notes about these two Hadoop parameters:
1.
The formulas for “mapred.tasktracker.map.tasks.maximum” and
“mapred.tasktracker.reduce.tasks.maximum” are evaluated on each TaskTracker and thus can vary by node
2. If changed, be sure to restart all TaskTrackers
3. Since these are TaskTracker settings, overriding them on individual Hadoop jobs has no effect
© 2013 IBM Corporation
20
Individual Hadoop jobs can run with custom JVM arguments
Setting mapred.child.java.opts
mapred.map.child.java.opts
mapred.reduce.child.java.opts
Default value
-Xmx1000m ( New default value in BI 2.1
)
BI 2.0 default: -Xmx600m -Xshareclasses
None. If set, overrides “mapred.child.java.opts” for map tasks.
None. If set, overrides “mapred.child.java.opts” for reduce tasks.
Tune the amount of memory assigned to each map and reduce task
– the –Xmx property defines the maximum Java heap size
– Conservative setting would be (60% of real memory) /
(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum
– Aggressive setting would be (80% of real memory) /
(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum
– Often tuned in association with mapred.tasktracker.map.tasks.maximum & mapred.tasktracker.reduce.tasks.maximum
© 2013 IBM Corporation
21
Kernel settings:
– sysctl -w <parameter>=<value>, or
• Always confirm the setting was actually changed
– Add to /etc/sysctl.conf
• vm.swappiness = 5
ifconfig settings
– Edit /etc/sysconfig/network-scripts/ifcfg-ethX, change lines like:
• MTU=“9000”
– Edit /etc/rc.local, add lines like:
• ifconfig ethX txqueuelen 2000
© 2013 IBM Corporation
22
A Hadoop job by default will only have one reduce task, which is usually not what you want
Setting mapred.reduce.tasks
Default value
1 (not set in mapred-site.xml)
Tune the number of reducer tasks for your Hadoop jobs
– The default number of reduce tasks is 1, which is OK for very small tasks
– Recommendation is to override “mapred.reduce.tasks” to a value greater than 1 for most jobs
– If number of reduce tasks is too few, then each reduce task will take a long time and reduce time will dominate overall job time
– If too many, then reduce tasks will tie up slots and hold back other jobs
© 2013 IBM Corporation
23
Use compression to save space and in some cases improve performance
Setting mapred.map.output.compression.codec
mapred.output.compress
mapred.compress.map.output
mapred.output.compression.type
Default value
The compression codec to use (a Java class). Must be listed in parameter “io.compression.codecs”.
Example: com.ibm.biginsights.compress.CmxCodec
Boolean. Default is false. Whether or not to compress final output from a Hadoop job.
Boolean. Default is false. Whether or not to compress intermediate data shuffled from map to reduce tasks.
Default is RECORD. How to compress sequence files.
Other values are NONE and BLOCK.
io.seqfile.compress.blocksize
Default is 1 million (bytes). Block size when compressing a sequence file and mapred.output.compression.type=BLOCK.
Consider benefits and drawbacks of compression
– Tradeoff between storage space and CPU time
– Compressing map output can speed up the shuffle phase
© 2013 IBM Corporation
24
Hadoop jobs can run with custom sort parameters
Setting io.sort.mb
io.sort.factor
Default value
256
10 ( New default value in BI 2.1
)
BI 2.0 default: 64
Tune the amount of memory and concurrency to use for sorting using io.sort.mb and io.sort.factor
– io.sort.mb is the size, in megabytes, of the memory buffer to use while sorting map output
– io.sort.factor is the maximum number of streams to merge at once when sorting files
– Should be tuned in conjunction with mapred.child.java.opts=-Xmx
© 2013 IBM Corporation
25
Block size is an important characteristic of your data on HDFS
Setting dfs.block.size
mapred.min.split.size
mapred.max.split.size
Default value
128MB
0
No limit
Choose an appropriate block size for your files using dfs.block.size
– The size of one block of data on HDFS
– Block size is a property of each file on HDFS
– Usually defines the size of input data (split) passed to a map task for processing
– Split sizes can be further tuned using “mapred.min.split.size” and
“mapred.max.split.size”
– If split size is too small, then incur overhead of creating/destroying many small tasks
– If split size is too large, then concurrency of the cluster is reduced
© 2013 IBM Corporation
26
Additional Hadoop parameters (defaults in mapred-site.xml) that are often overridden for individual Hadoop jobs:
Setting mapred.reduce.slowstart.
completed.maps
Default value
0.5
Notes
Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.
Starting reducers earlier can help some jobs finish faster, but the reducers tie up slots.
© 2013 IBM Corporation
27
Rack awareness
Setting Default value
Notes topology.script.file.name
No script Add to core-site.xml. Use to make Hadoop rackaware.
Enable rack-awareness for large clusters with several racks, when all racks have roughly same number of nodes.
Warning : It may hurt performance to enable rack awareness for small clusters with uneven number of nodes in each rack. Racks with fewer nodes tend to be busier.
© 2013 IBM Corporation
28
The number of slots on each compute node has a huge impact on performance
– Cluster is under-utilized if too few slots
– Cluster can become over-committed if too many slots
– The number of slots needs to be configured based:
• Compute node hardware (CPU, memory, number of disks, etc.)
• Workload characteristics (memory or CPU or storage intensive)
Default number of slots:
– Adaptive M/R: Equal to number of CPUs
• Single slot pool for map and reduce tasks
– Apache M/R: Separate pools for map and reduce slots with default ratio of 2 map slots for every reduce slot
© 2013 IBM Corporation
29
How to change the number of slots (Adaptive M/R)
– Command line
• Example: Configure 16 slots per compute node
Edit $BIGINSIGHTS_HOME/hdm/components/HAManager/conf/ResourceGroups.xml and
change two lines:
> <ResourceGroup ResourceGroupName="ManagementHosts" availableSlots=“16" >
> <ResourceGroup ResourceGroupName="ComputeHosts" availableSlots=“16" >
Run $BIGINSIGHTS_HOME/bin/syncconf.sh HAManager
Check that changes appear in $EGO_CONFDIR/ResourceGroups.xml
• Web UI (not officially externalized in BigInsights)
http://<master node>:18080/Platform
Bring up Map/Reduce UI
Resources –> Resource Groups -> ComputeHosts
© 2013 IBM Corporation
30
How to change the number of slots (Apache M/R)
– Edit $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/mapred-site.xml
– Change mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
– Run $BIGINSIGHTS_HOME/bin/syncconf.sh hadoop
© 2013 IBM Corporation
31
Adaptive M/R honors property “mapred.child.java.opts”
– Default in mapred-site.xml
• Out-of-box, mapred.child.java.opts = -Xmx1000m
– Hadoop jobs often override “mapred.child.java.opts” value
© 2013 IBM Corporation
32
Since there is no distinction between map slots and reduce slots for
Adaptive M/R special consideration should be taken
– Never oversubscribe your system memory ( number of slots * memory per slot )
– mapred.reduce.slowstart.completed.maps
io.sort.mb (100) / io.sort.factor (10)
– Increasing io.sort.mb, when large map output, should decrease map-side IO – sets buffer size for map-side sorting – recommend 256 default, or higher depending on block size
– Increasing io.sort.factor, should decrease number of spills to disk / decrease sort / shuffle – penalty is extra GC – recommend 64 default
Compression mapred.map.output.compression.codec, mapred.compress.map.output
– Enabling compression of map output decreases disk and network activity
– On most workloads, when CPU is available, compression will improve overall performance, in some environments, dramatically
– Out of box, com.ibm.biginsights.compress.CmxCodec, provides good compression ratio without too much CPU overhead
© 2013 IBM Corporation
33
mapred.reduce.input.buffer.percent (default 0.00)
– Sets percentage of memory relative to heap size
– With default, all in-memory segments are persisted to disk and then read back when calling user reduce function
– Increasing to 0.96 would allow reducers to use up to 96% of available memory to keep segments in memory while calling reduce function ~ consider when map output is large such as Terasort
– Conversely TPCH workload does not benefit as smaller map outputs
mapred.job.shuffle.merge.percent
– % of memory allocated to storing in-memory map outputs
mapred.job.shuffle.input.buffer.percent
– % of memory allocated from max heap for storing map outputs during shuffle
mapred.local.dir / dfs.data.dir
– the more the merrier!!
– On dedicated cluster, using all disks (X) on a compute host should yield better results than using X/2 disks for mapred.local.dir and the other X/2 for dfs.data.dir
© 2013 IBM Corporation