RHEL6 tuning guide for mellanox ethernet card. 2012년 10월 유니원아이앤씨주식회사 IBM X3650 M4 Performance Tuning 1.1 X3650 BIOS Setting. 구분 General Processor 항목 변경값 Power Profile/Operating Modes Max Performance C-States Disabled Turbo mode Enabled /Performance Optimized Hyper-Threading Disabled CPU frequency select Max performance C-states limit : C값에 따라 전력 소모를 제어 하는 것 을 말합니다. C값이 0일 경우 가장많은 전력을 소모하며 지속적으로 working 상태 입니다. C값이 높을수록 전력소모는 낮아지며, sleep time 상태로 되며 working 상태로 돌아오기 까지 많은 시간이 소요됩니다. Disable 상태일 경우, 전력 소모를 제어하지 않습니다. 2 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.2 X3650 BIOS Setting. 구분 Memory 항목 변경값 Memory speed Max performance Memory channel mode Independent Socket Interleaving NUMA / Disabled Memory Node Interleaving OFF Patrol Scrubbing Disabled Demand Scrubbing Enabled Thermal Mode Performance 3 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.3 RHEL6 OS tuning.(Networking) 구분 항목 변경값 Disable the TCP timestamps sysctl -w net.ipv4.tcp_timestamps=0 Disable the TCP selective acks sysctl -w net.ipv4.tcp_sack=0 processor input queues sysctl -w net.core.netdev_max_backlog=250000 TCP buffer sizes using setsockopt(): sysctl sysctl sysctl sysctl sysctl Network -w -w -w -w -w net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.rmem_default=16777216 net.core.wmem_default=16777216 net.core.optmem_max=16777216 Increase memory thresholds to prevent packet dropping sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216" auto-tuning of TCP buffer limits sysctl -w net.ipv4.tcp_rmem="4096 87380 167772 16" sysctl -w net.ipv4.tcp_wmem="4096 65536 16777 216" Low latency mode for TCP sysctl -w net.ipv4.tcp_timestamps=0 Rebooting 후에도 변경 값을 적용하기 위해 /etc/sysctl.conf 파일을 수정합니다. ex) net.ipv4.tcp_timestamps = 0 #해당 항목 추가 4 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.4 RHEL6 OS tuning.(Power management) Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent. Check the maximum supported CPU frequency using: #cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq Check that core frequencies are consistent using: #cat /proc/cpuinfo | grep "cpu MHz" Check that the output frequencies are the same as the maximum supported. If the CPU frequency is not at the maximum, check the BIOS settings according to table in Recommended BIOS Settings. Check the current CPU frequency to check whether echo performance is implemented using: #cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq OS에서 CPU frequency 현재 값과 MAX값을 비교합니다. 만약 현재 값이 MAX값과 다를 경우, BIOS에서 CPU frequency 값을 변경합니다. 5 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.5 RHEL6 OS tuning.(Kernel Idle Loop Tuning ) The mlx4_en kernel module has an optional parameter that can tune the kernel idle loop for better latency. This will improve the CPU wakeup time but may result in higher power consumption. To tune the kernel idle loop, set the following options in /etc/modprobe.d/mlx4.conf: options mlx4_en enable_sys_tune=1 CPU wakeup time 성능이 향상됩니다. 하지만 전력 소모량이 증가합니다. 6 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.6 RHEL6 OS tuning.(OS Controlled Power Management ) Some operating systems can override BIOS power management configuration and enable c-states by default, which results in a higher latency. To resolve the high latency issue, please follow the instructions below: 1. Edit the /boot/grub/grub.conf file or any other bootloader configuration file. 2. Add the following kernel parameters to the bootloader command. intel_idle.max_cstate=0 processor.max_cstate=1 3. Reboot the system. Ex) title RH6.2x64 root (hd0,0) kernel /vmlinuz-RH6.2x64-2.6.32-220.el6.x86_64 root=UUID=817c207b-c0e8-4ed9-9c33-c589c0bb566f console=tty0 console=ttyS0,115200n8 rhgb intel_idle.max_cstate=0 processor.max_cstate=1 OS의 Power management setting이 BIOS의 setting 값보다 우선시 되는 경우가 있습니다. 그럴 경우에, /boot/grub/grub.conf 파일을 수정해서 적용을 해야 합니다. 7 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.6 RHEL6 OS tuning.(Interrupt Moderation ) Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU. Mellanox network adapters use an adaptive interrupt moderation algorithm by default. The algorithm checks the transmission (Tx) and receive (Rx) packet rates and modifies the Rx interrupt moderation settings accordingly. To manually set Tx and/or Rx interrupt moderation, use the ethtool utility. For example, the following commands first show the current (default) setting of interrupt moderation on the interface eth1, then turns off Rx interrupt moderation, and last shows the new setting. # ethtool -c eth1 Coalesce parameters for eth1: Adaptive RX: on TX: off ... pkt-rate-low: 400000 pkt-rate-high: 450000 rx-usecs: 16 rx-frames: 88 rx-usecs-irq: 0 rx-frames-irq: 0 ... 8 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.6.1 RHEL6 OS tuning.(Interrupt Moderation ) ethtool -C eth1 adaptive-rx off rx-usecs 0 rx-frames 0 #ethtool -c eth1 Coalesce parameters for eth1: Adaptive RX: off TX: off ... pkt-rate-low: 400000 pkt-rate-high: 450000 rx-usecs: 0 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 0 ... Interrupt Moderation 은 CPU에 네트워크 어댑터의 인터럽트 빈도를 감소하는데 사용됩니다. 수동으로 TX / RX Interrupt Moderation 을 설정하려면 ethtool 커맨드를 사용합니다. 9 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.7 RHEL6 OS tuning.(Tuning for NUMA Architecture ) Tuning for Intel® Microarchitecture Code name Sandy Bridge The Intel Sandy Bridge processor has an integrated PCI express controller. Thus every PCIe adapter OS is connected directly to a NUMA node. On a system with more than one NUMA node, performance will be better when using the local NUMA node to which the PCIe adapter is connected. In order to identify which NUMA node is the adapter's node the system BIOS should support the proper ACPI feature. to see if your system supports PCIe adapter's NUMA node detection run the following command: # cat /sys/devices/[PCI root]/[PCIe function]/numa_node Or # cat /sys/class/net/[interface]/device/numa_node Example for supported system: # cat /sys/devices/pci0000\:00/0000\:00\:05.0/numa_node 0 Example for unsupported system: # cat /sys/devices/pci0000\:00/0000\:00\:05.0/numa_node -1 10 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.7 RHEL6 OS tuning.(IRQ Affinity ) The affinity of an interrupt is defined as the set of processor cores that service that interrupt. To improve application scalability and latency, it is recommended to distribute interrupt requests (IRQs) between the available processor cores. To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme, the IRQ balancer must be turned off. The following command turns off the IRQ balancer: > /etc/init.d/irqbalance stop The following command assigns the affinity of a single interrupt vector: > echo <hexadecimal bit mask> > /proc/irq/<irq vector>/smp_affinity where bit i in <hexadecimal bit mask> indicates whether processor core i is in <irq vector>’s affinity or not. Application의 scalability 와 latency의 향상을 위해서 IRQ가 가용한 processor core들에게 분산되는 것을 권장합니다. LINUX IRQ balancer 와 Interrupt affinity scheme 의 충돌을 방지하기 위해서, IRQ balancer는 반드시 turn off 상태여야 합 니다. 11 © 2012 UniOne INC Co., Ltd. All Rights Reserved. IBM X3650 M4 Performance Tuning 1.7.1 RHEL6 OS tuning.(IRQ Affinity Configuration ) For Intel Sandy Bridge systems set the irq affinity to the adapter's NUMA node: For optimizing single-port traffic, run: # set_irq_affinity_bynode.sh <numa node> <interface> For optimizing dual-port traffic, run: # set_irq_affinity_bynode.sh <numa node> <interface1> <interface2> To show the current irq affinity settings, run: # show_irq_affinity.sh <interface> 위의 스크립트는 Mellanox 웹 사이트(www.mellanox.com )에서 다운로드 가능합니다. 12 © 2012 UniOne INC Co., Ltd. All Rights Reserved.