BI_Performance_Tuning_Settings.pptx

advertisement

BigInsights Tuning

Eric Yang

Jim Huang

Stewart Tate

Simon Harris

Michael Ahern

John Poelman

2

Agenda

 BigInsights Tuning

– Validating and tuning a new cluster

 Performance Improvements with Adaptive Map/Reduce

 Special Topic: “HBase tuning experience with NICE” by Bruce

Brown

 Tuning Big SQL: Results and lessons learned

 Q & A

 Topics covered later in the week

– More on tuning Big SQL (e.g. hints)

– GPFS performance

© 2013 IBM Corporation

3

Validating a new cluster

 The performance of a new cluster should be validated and tuned

– Slow nodes will drag down the performance of the entire cluster

 Things to validate:

– Network

– Storage

– BIOS

– CPU

– Memory

© 2013 IBM Corporation

4

Validating a new cluster (Network)

Network validation

– Ensure expected sustained bandwidth between cluster nodes

– There are many tools that can be used to validate the network

– One tool we frequently use is iperf ( http://en.wikipedia.org/wiki/Iperf )

• Test bandwidth between every pair of nodes

• Example of how to invoke iperf client: iperf --client <server hostname> --time 30 --interval 5 --parallel 1 --dualtest

© 2013 IBM Corporation

5

Validating a new cluster (Storage)

 Storage validation

– JBOD is the preferred storage set up. Avoid things like volume groups or logical volumes, etc.

– Each local device should be set up JBOD or (if the storage controller does not support JBOD) as a RAID-0 single-device array.

– Check device settings

• Are write caches enabled? Write caches are usually enabled except for storage that needs high durability, such as storage for the NameNode

– Based on the type of file system (e.g. ext), check file system settings (block size, etc.)

• Example (ext4): /sbin/dumpe4fs /dev/sdX

– Check mount options

• Usually want to see things like “atime” and “user extended attributes” are disabled

© 2013 IBM Corporation

6

Validating a new cluster (Storage)

 Storage validation (continued)

– Use Linux dd command to test read/write throughput of every device or file system that BigInsights will use

– (Optional) Test raw device read/write performance BEFORE creating file system

• Warning: Writing to a raw device will clobber existing data (if any)

– (Mandatory) Test read/write performance after creating and mounting a file system on the device

• Test with and without Direct I/O

• With Direct I/O

 dd if=/dev/zero oflag=direct of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out iflag=direct of=/dev/null bs=128k count=1024000

• Without Direct I/O

 dd if=/dev/zero of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out of=/dev/null bs=128k count=1024000

© 2013 IBM Corporation

7

Validating a new cluster (BIOS/CPU/Memory)

 In the BIOS, consider turning off power saving options

 Validate CPU and memory performance

– “Hdparm –T <device>” Reads data from the device’s caches as fast as possible

© 2013 IBM Corporation

8

Tuning a new cluster

 Cluster and BigInsights performance tuning is a broad topic

 We will touch on select topics and settings in the slides that follow

© 2013 IBM Corporation

9

Tuning the network

 How to check current settings

– ifconfig

– sysctl -a | grep net

 ifconfig settings commonly tuned include

– MTU (Maximum Transmission Unit) -- Often set to larger value like 9000.

– Txqueuelen (Transmit Queue Length) -- High value, e.g. 2000, recommended for servers that perform large data transfers.

 Other network settings we typical check/tune:

– Read/write memory buffers (e.g. net.ipv4.tcp_rmem, net.ipv4.tcp_wmem)

 Big SQL users: it is a good idea to tune “keep alive” and

“tcp_fin_timeout” TCP/IP settings for improved stability (fewer timeouts and hangs)

 There are more network settings than we can cover here

© 2013 IBM Corporation

10

Tuning I/O in the Linux Kernel

 Here are examples of things to tune in the Linux kernel

 For clusters mostly doing Map/Reduce and a lot of sequential I/O, consider:

– Using “deadline” scheduler (instead of “completely fair”)

• Check: cat /sys/block/sdX/queue/scheduler

• How to change: echo deadline > /sys/block/sdX/queue/scheduler

– Tune read ahead buffer size

• By default the Linux OS will read 128 KB of data in advance so that it is already in

Memory cache before the program needs it. This value can be increased so as to get better Read Performance.

• Check: cat /sys/block/sdX/queue/read_ahead_kb

• How to change: echo 3072 > /sys/block/sdX/queue/read_ahead_kb

 There are many more kernel settings not covered here

© 2013 IBM Corporation

11

Tuning the file system

 Tune the file system based on your file system type

 Examples of file system type: ext3, ext4, xfs

– Note: GPFS tuning will be covered in a later session

 For ext4 file systems, consider settings like “dir_index” and “extent”

– mkfs.ext4 -O dir_index,extent /dev/sdX

• dir_index: Use hashed b-trees to speed up lookups in large directories

• extent: Instead of using the indirect block scheme for storing the location of data blocks in an inode, use extents instead. This is a much more efficient encoding which speeds up filesystem access, especially for large files.

 Mount options

– Turn off atime

• noatime: Access timestamps are not updated when a file is read

• nodiratime: Turn off directory time stamps

© 2013 IBM Corporation

12

Tuning Kernel Virtual Machine Memory

 redhat_transparent_hugepage/enabled (Redhat 6.2 or later)

– In order to bring "huge page" performance benefits to legacy software, RedHat has implemented custom Linux kernel extension called "redhat_transparent_hugepage".

– Recommendation: Turn off huge pages since this functionality still maturing.

• This is a well-known tuning optimization in the Hadoop community

• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

 vm.swappiness

– This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. Default is 60.

– Recommend a lower value in the range 5 to 20 (test in increments of 5)

 vm.min_free_kbytes (came up during Big SQL testing)

– min_free_kbytes changes the page reclaim thresholds. When this number is increased the system starts reclaiming memory earlier, when its lowered it starts reclaiming memory later.

– If you see "page allocation" failures in /var/log/messages then chances are this is set too low.

– For Big SQL testing on machines with 128 GB of physical memory, we set this to

2621440 (2.5GB)

– (If using GPFS) GPFS documentation recommends vm.min_free_kbytes should be set to

5 to 6% of the total physical memory

 There are many more vm kernel settings you can explore

© 2013 IBM Corporation

13

Tuning HDFS

 Give HDFS as many paths as possible, where each path is on a different device

 Can be configured at install time

– Can also be changed later using parameter “dfs.data.dir” in hdfs-site.xml

 In the installer, under “DataNode/TaskTracker”

– Installer default path: /hadoop/hdfs/data

– You almost always want to change this to one or more paths

© 2013 IBM Corporation

14

Tuning HDFS

 Tune the underlying devices and file systems as discussed on previous slides

 When running large Big SQL & Hive workloads the following settings have proved useful in improving the stability of a cluster for both Apache M-R & Platform Symphony M-R:

– dfs.datanode.handler.count=40 (default is 10). The number of server threads for the datanode. Usually about the same as number of CPUs.

– dfs.datanode.max.xcievers=65536 (default is 8192). Maximum number of files that a DataNode will serve concurrently.

– dfs.datanode.socket.write.timeout=960000 (default is 480000). Units is milliseconds. 960000 is 16 minutes. Increase to avoid tasks failing due to write timeout errors.

© 2013 IBM Corporation

15

Tuning Map/Reduce (Intermediate storage)

 “Intermediate” data gets shuffled between map and reduce tasks

 Intermediate data is kept on local storage as configured by Hadoop parameter “mapred.local.dir”

 The speed of this storage directly impacts M/R performance

 For better performance, spread across multiple devices

 Configured at install time on “File System” panel

– Referred to as “Cache directory”

– Can also be updated post-install in mapred-site.xml

 In the installer, expand “MapReduce general settings”

– Default path: /hadoop/mapred/local

– Change this to one or more paths, where each path corresponds to a different device

© 2013 IBM Corporation

16

Tuning HDFS and Intermediate storage

 Intermediate storage (mapred.local.dir) can share the same disks as

HDFS (dfs.data.dir

DataNode 1 DataNode 2 DataNode 3

DataNode N

Logs, Etc

HDFS

(Shared) dfs.data.dir

© 2013 IBM Corporation

17

Tuning Map/Reduce

 We only have time to cover a few M/R settings in this presentation

– Number of slots

– JVM heap sizes

 Backup slides cover more topics

– Number of reduce tasks

– Compression

– Sorting

– Block and split sizes

– Making clusters rack aware

© 2013 IBM Corporation

18

Tuning Map/Reduce (Number of slots)

 Number of map and reduce slots on each TaskTracker

Setting mapred.tasktracker.map.tasks.maximum

Default value

Formula:

<%= Math.min(Math.ceil(numOfCores * 1.0),

Math.ceil(maxPartition*0.66*totalMem/1000)) %> mapred.tasktracker.reduce.tasks.maximum

Formula:

<%= Math.min(Math.ceil(numOfCores * 0.5),

Math.ceil(maxPartition*0.33*totalMem/1000)) %>

 Used to tune the degree of Map/Reduce concurrency on the cluster

 Defines the maximum number of concurrently occupied slots on a node =

– Too high and the cluster becomes unstable, too low and you are wasting machine resources

– Tuned in association with the map and reduce task JVM size (mapred.child.java.opts)

– On large machines, particularly those with many CPU cores and/or hyper-threading enabled, the default values may be too large

• Example: Simon tested Big SQL on Power machines have 16 cores and SMT=4, resulting in

64 virtual CPUs. By default, BigInsights was configuration 64 map slots and 32 reduce slots (96 slots in all). This was too many and resulted in high context switching and cluster instability.

After tuning, Simon found 24 map slots and 12 reduce slots much more optimal.

© 2013 IBM Corporation

19

Tuning Map/Reduce (Number of slots - Continued)

 Additional notes about these two Hadoop parameters:

1.

The formulas for “mapred.tasktracker.map.tasks.maximum” and

“mapred.tasktracker.reduce.tasks.maximum” are evaluated on each TaskTracker and thus can vary by node

2. If changed, be sure to restart all TaskTrackers

3. Since these are TaskTracker settings, overriding them on individual Hadoop jobs has no effect

© 2013 IBM Corporation

20

Tuning Map/Reduce (Task JVM heap size)

 Individual Hadoop jobs can run with custom JVM arguments

Setting mapred.child.java.opts

mapred.map.child.java.opts

mapred.reduce.child.java.opts

Default value

-Xmx1000m ( New default value in BI 2.1

)

BI 2.0 default: -Xmx600m -Xshareclasses

None. If set, overrides “mapred.child.java.opts” for map tasks.

None. If set, overrides “mapred.child.java.opts” for reduce tasks.

 Tune the amount of memory assigned to each map and reduce task

– the –Xmx property defines the maximum Java heap size

– Conservative setting would be (60% of real memory) /

(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum

– Aggressive setting would be (80% of real memory) /

(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum

– Often tuned in association with mapred.tasktracker.map.tasks.maximum & mapred.tasktracker.reduce.tasks.maximum

© 2013 IBM Corporation

21

Making tuning changes persist after reboot

 Kernel settings:

– sysctl -w <parameter>=<value>, or

• Always confirm the setting was actually changed

– Add to /etc/sysctl.conf

• vm.swappiness = 5

 ifconfig settings

– Edit /etc/sysconfig/network-scripts/ifcfg-ethX, change lines like:

• MTU=“9000”

– Edit /etc/rc.local, add lines like:

• ifconfig ethX txqueuelen 2000

© 2013 IBM Corporation

22

Tuning Map/Reduce (Number of reduce tasks)

 A Hadoop job by default will only have one reduce task, which is usually not what you want

Setting mapred.reduce.tasks

Default value

1 (not set in mapred-site.xml)

 Tune the number of reducer tasks for your Hadoop jobs

– The default number of reduce tasks is 1, which is OK for very small tasks

– Recommendation is to override “mapred.reduce.tasks” to a value greater than 1 for most jobs

– If number of reduce tasks is too few, then each reduce task will take a long time and reduce time will dominate overall job time

– If too many, then reduce tasks will tie up slots and hold back other jobs

© 2013 IBM Corporation

23

Tuning Map/Reduce (Compression)

 Use compression to save space and in some cases improve performance

Setting mapred.map.output.compression.codec

mapred.output.compress

mapred.compress.map.output

mapred.output.compression.type

Default value

The compression codec to use (a Java class). Must be listed in parameter “io.compression.codecs”.

Example: com.ibm.biginsights.compress.CmxCodec

Boolean. Default is false. Whether or not to compress final output from a Hadoop job.

Boolean. Default is false. Whether or not to compress intermediate data shuffled from map to reduce tasks.

Default is RECORD. How to compress sequence files.

Other values are NONE and BLOCK.

io.seqfile.compress.blocksize

Default is 1 million (bytes). Block size when compressing a sequence file and mapred.output.compression.type=BLOCK.

 Consider benefits and drawbacks of compression

– Tradeoff between storage space and CPU time

– Compressing map output can speed up the shuffle phase

© 2013 IBM Corporation

24

Tuning Map/Reduce (Sorting)

 Hadoop jobs can run with custom sort parameters

Setting io.sort.mb

io.sort.factor

Default value

256

10 ( New default value in BI 2.1

)

BI 2.0 default: 64

 Tune the amount of memory and concurrency to use for sorting using io.sort.mb and io.sort.factor

– io.sort.mb is the size, in megabytes, of the memory buffer to use while sorting map output

– io.sort.factor is the maximum number of streams to merge at once when sorting files

– Should be tuned in conjunction with mapred.child.java.opts=-Xmx

© 2013 IBM Corporation

25

Tuning Map/Reduce (Block Size/Split Size)

 Block size is an important characteristic of your data on HDFS

Setting dfs.block.size

mapred.min.split.size

mapred.max.split.size

Default value

128MB

0

No limit

 Choose an appropriate block size for your files using dfs.block.size

– The size of one block of data on HDFS

– Block size is a property of each file on HDFS

– Usually defines the size of input data (split) passed to a map task for processing

– Split sizes can be further tuned using “mapred.min.split.size” and

“mapred.max.split.size”

– If split size is too small, then incur overhead of creating/destroying many small tasks

– If split size is too large, then concurrency of the cluster is reduced

© 2013 IBM Corporation

26

Tuning Map/Reduce

 Additional Hadoop parameters (defaults in mapred-site.xml) that are often overridden for individual Hadoop jobs:

Setting mapred.reduce.slowstart.

completed.maps

Default value

0.5

Notes

Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.

Starting reducers earlier can help some jobs finish faster, but the reducers tie up slots.

© 2013 IBM Corporation

27

Tuning Map/Reduce (General Settings)

 Rack awareness

Setting Default value

Notes topology.script.file.name

No script Add to core-site.xml. Use to make Hadoop rackaware.

Enable rack-awareness for large clusters with several racks, when all racks have roughly same number of nodes.

Warning : It may hurt performance to enable rack awareness for small clusters with uneven number of nodes in each rack. Racks with fewer nodes tend to be busier.

© 2013 IBM Corporation

28

Tuning Adaptive M/R – Number of slots

 The number of slots on each compute node has a huge impact on performance

– Cluster is under-utilized if too few slots

– Cluster can become over-committed if too many slots

– The number of slots needs to be configured based:

• Compute node hardware (CPU, memory, number of disks, etc.)

• Workload characteristics (memory or CPU or storage intensive)

 Default number of slots:

– Adaptive M/R: Equal to number of CPUs

• Single slot pool for map and reduce tasks

– Apache M/R: Separate pools for map and reduce slots with default ratio of 2 map slots for every reduce slot

© 2013 IBM Corporation

29

Tuning Adaptive M/R – Number of slots

 How to change the number of slots (Adaptive M/R)

– Command line

• Example: Configure 16 slots per compute node

 Edit $BIGINSIGHTS_HOME/hdm/components/HAManager/conf/ResourceGroups.xml and

 change two lines:

> <ResourceGroup ResourceGroupName="ManagementHosts" availableSlots=“16" >

> <ResourceGroup ResourceGroupName="ComputeHosts" availableSlots=“16" >

Run $BIGINSIGHTS_HOME/bin/syncconf.sh HAManager

 Check that changes appear in $EGO_CONFDIR/ResourceGroups.xml

• Web UI (not officially externalized in BigInsights)

 http://<master node>:18080/Platform

Bring up Map/Reduce UI

Resources –> Resource Groups -> ComputeHosts

© 2013 IBM Corporation

30

Tuning Apache M/R – Number of slots

 How to change the number of slots (Apache M/R)

– Edit $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/mapred-site.xml

– Change mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

– Run $BIGINSIGHTS_HOME/bin/syncconf.sh hadoop

© 2013 IBM Corporation

31

Adaptive M/R: Tuning Map and Reduce tasks JVM size

 Adaptive M/R honors property “mapred.child.java.opts”

– Default in mapred-site.xml

• Out-of-box, mapred.child.java.opts = -Xmx1000m

– Hadoop jobs often override “mapred.child.java.opts” value

© 2013 IBM Corporation

32

Tuning Adaptive M/R

 Since there is no distinction between map slots and reduce slots for

Adaptive M/R special consideration should be taken

– Never oversubscribe your system memory ( number of slots * memory per slot )

– mapred.reduce.slowstart.completed.maps

 io.sort.mb (100) / io.sort.factor (10)

– Increasing io.sort.mb, when large map output, should decrease map-side IO – sets buffer size for map-side sorting – recommend 256 default, or higher depending on block size

– Increasing io.sort.factor, should decrease number of spills to disk / decrease sort / shuffle – penalty is extra GC – recommend 64 default

 Compression mapred.map.output.compression.codec, mapred.compress.map.output

– Enabling compression of map output decreases disk and network activity

– On most workloads, when CPU is available, compression will improve overall performance, in some environments, dramatically

– Out of box, com.ibm.biginsights.compress.CmxCodec, provides good compression ratio without too much CPU overhead

© 2013 IBM Corporation

33

Tuning Adaptive M/R

 mapred.reduce.input.buffer.percent (default 0.00)

– Sets percentage of memory relative to heap size

– With default, all in-memory segments are persisted to disk and then read back when calling user reduce function

– Increasing to 0.96 would allow reducers to use up to 96% of available memory to keep segments in memory while calling reduce function ~ consider when map output is large such as Terasort

– Conversely TPCH workload does not benefit as smaller map outputs

 mapred.job.shuffle.merge.percent

– % of memory allocated to storing in-memory map outputs

 mapred.job.shuffle.input.buffer.percent

– % of memory allocated from max heap for storing map outputs during shuffle

 mapred.local.dir / dfs.data.dir

– the more the merrier!!

– On dedicated cluster, using all disks (X) on a compute host should yield better results than using X/2 disks for mapred.local.dir and the other X/2 for dfs.data.dir

© 2013 IBM Corporation

Download