Uploaded by batemo9346

NetXtreme-E-UG600

NetXtreme-E®
Tuning Guide for AMD® EPYC™ 7002 Series on Linux
User Guide
Broadcom Confidential
NetXtreme-E-UG600
October 1, 2020
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
Broadcom, the pulse logo, Connecting everything, Avago Technologies, Avago, and the A logo are among the trademarks
of Broadcom and/or its affiliates in the United States, certain other countries, and/or the EU. AMD and EPYC are trademarks
of AMD.
Copyright © 2020 Broadcom. All Rights Reserved.
The term “Broadcom” refers to Broadcom Inc. and/or its subsidiaries. For more information, please visit www.broadcom.com.
Broadcom reserves the right to make changes without further notice to any products or data herein to improve reliability,
function, or design. Information furnished by Broadcom is believed to be accurate and reliable. However, Broadcom does
not assume any liability arising out of the application or use of this information, nor the application or use of any product or
circuit described herein, neither does it convey any license under its patent rights nor the rights of others.
Broadcom Confidential
NetXtreme-E-UG600
2
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
Table of Contents
1 Introduction ..................................................................................................................................................................5
2 AMD EPYC 7002 Series Microarchitecture ................................................................................................................5
2.1 Core Cache Dies (CCD) and Core Cache Complexes (CCX) .............................................................................5
2.2 NUMA and NUMA Per Socket (NPS) ..................................................................................................................6
2.2.1 NPS=1 .......................................................................................................................................................6
2.2.2 NPS=2 .......................................................................................................................................................6
2.2.3 NPS=4 .......................................................................................................................................................7
2.3 Memory Optimizations .........................................................................................................................................7
2.3.1 Platforms Specifically Designed for AMD EPYC 7002 ..............................................................................7
3 BIOS Tuning .................................................................................................................................................................8
3.1 NPS (NUMA Per Socket) .....................................................................................................................................8
3.2 X2APIC ................................................................................................................................................................9
3.3 Determinism Control and Determinism Slider ....................................................................................................10
3.4 APBDIS..............................................................................................................................................................11
3.5 Preferred I/O and Enhanced Preferred I/O ........................................................................................................12
3.6 PCIe Ten Bit Tag ...............................................................................................................................................13
3.7 Memory Clock Speed.........................................................................................................................................14
3.8 L3 LLC (Last Level Cache) as NUMA ................................................................................................................15
3.9 Socket/Inter-Chip Global Memory Interconnect (xGMI) .....................................................................................16
4 TCP Performance Tuning ..........................................................................................................................................17
4.1 BIOS Tuning ......................................................................................................................................................17
4.2 NIC Tuning.........................................................................................................................................................17
4.2.1 NUMA: Local vs. Non Local.....................................................................................................................17
4.2.2 Configuring Queues.................................................................................................................................18
4.2.3 Configure IRQ and Application Affinity ....................................................................................................19
4.2.4 TX and RX Flow Steering ........................................................................................................................19
4.2.5 TX/RX Queue Size ..................................................................................................................................20
4.2.6 Interrupt Moderation ................................................................................................................................20
4.2.7 GRO (Generic Receive Offload) ..............................................................................................................20
4.2.8 TX-NoCache-Copy ..................................................................................................................................21
4.2.9 Relaxed Ordering.....................................................................................................................................21
4.2.10 PCIe MRRS (Maximum Read Request Size) ........................................................................................22
4.3 OS Tuning (Linux) ..............................................................................................................................................22
4.3.1 IOMMU ....................................................................................................................................................22
4.3.2 Performance Governor ............................................................................................................................23
4.3.3 TCP Memory Configuration .....................................................................................................................23
4.3.4 nohz=off ...................................................................................................................................................23
4.3.5 TCP Example with the BCM957508-P2100G ..........................................................................................24
Broadcom Confidential
NetXtreme-E-UG600
3
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
5 DPDK Tunings ............................................................................................................................................................26
5.1 BIOS Tuning ......................................................................................................................................................26
5.2 Kernel Tuning.....................................................................................................................................................26
5.3 PCIe Configuration.............................................................................................................................................26
5.4 DPDK Configuration...........................................................................................................................................27
5.5 DPDK Results ....................................................................................................................................................27
6 IP Forwarding Tunings ..............................................................................................................................................27
6.1 BIOS Tuning ......................................................................................................................................................27
6.2 Kernel Tuning.....................................................................................................................................................28
6.3 NIC Tuning.........................................................................................................................................................29
6.4 IP Forwarding Results........................................................................................................................................29
7 RDMA Tunings ...........................................................................................................................................................30
7.1 BIOS Tuning ......................................................................................................................................................30
7.2 OS Configuration ...............................................................................................................................................30
7.3 RDMA Results ...................................................................................................................................................31
8 Debugging Performance Issues ...............................................................................................................................31
8.1 Identifying CPU Bottleneck ................................................................................................................................31
8.2 Checking the PCIe Gen4 Speed and Width.......................................................................................................31
9 References..................................................................................................................................................................32
Revision History............................................................................................................................................................ 33
NetXtreme-E-UG600; October 1, 2020.....................................................................................................................33
Broadcom Confidential
NetXtreme-E-UG600
4
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
1 Introduction
This document provides detailed steps to configure the Broadcom NetXtreme-E 100G/200G adapters for optimal
performance on AMD EPYC PCIe Gen4 architecture.
There is no single configuration that can work optimally for all scenarios and custom tuning may be required to achieve the
desired results for specific workloads. This tuning guide broadly covers each tunable item and provides a specific example
for common workloads.
2 AMD EPYC 7002 Series Microarchitecture
2.1 Core Cache Dies (CCD) and Core Cache Complexes (CCX)
The AMD EPYC 7002 series processor has a unique CPU microarchitecture and the tuning steps mentioned in this
document leverage the new architecture.
It is designed as a multi-chip package with up to nine CPU dies, the central 14 nm I/O die is surrounded by up to eight 7 nm
core chiplets also called core cache dies (CCD). Each of the CCD has two CCX (Core Cache Complexes) with up to 4 cores
and 16 MB L3 cache each.
The I/O die provides the I/O, memory function, and Infinity fabric connection between sockets. There is support for up to
eight memory channels per socket.
Figure 1: 2nd Generation EPYC
Another important aspect to consider is interconnect distance when dealing with dual socket systems. A CPU might be
accessing memory connected to the local memory controller (local memory) or it could be accessing memory from across
the socket (non-local memory). The following command can provide the details about the NUMA distances:
numactl -H
For best performance and latency, it is recommended to use the local access only as the memory access across socket
results in significant performance penalty.
Broadcom Confidential
NetXtreme-E-UG600
5
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
2.2 NUMA and NUMA Per Socket (NPS)
The AMD EPYC 7002 Series Processor uses the NUMA (Non Uniform Memory Access) architecture. The architecture is
divided into four logical quadrants per socket that can be further partitioned into different NUMA domains called NPS (NUMA
Per Socket).
2.2.1 NPS=1
The processor is a single NUMA domain. For example, all the cores on the processor, memory and PCIe devices connected
to it are in one NUMA domain. Memory is interleaved across eight local memory channels so this mode offers the maximum
memory bandwidth.
Figure 2: EPYC 7002 Series Processors
NOTE:
Dual socket machine in nps=1 configuration.
2.2.2 NPS=2
The NPS2 divides the CPU processor into two NUMA domains. Half of the cores and half of the memory channels connected
to the processor are grouped into each NUMA domain. Memory is interleaved across the four memory channels in each
NUMA domain.
Figure 3: EPYC 7002 Series Processors
NOTE:
Dual socket machine in nps=2 configuration.
Broadcom Confidential
NetXtreme-E-UG600
6
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
2.2.3 NPS=4
NPS4 partitions the CPU into four NUMA domains. As shown in Figure 4, each quadrant is a NUMA domain and memory is
interleaved across the two memory channels. PCIe devices will be local to one of four NUMA domains on the socket
depending on which quadrant of the I/O die has the PCIe root for that device. In other words, only two channels are local to
a PCIe device which may not offer enough memory bandwidth for 200 Gb/s.
Figure 4: EPYC 7002 Series Processors
NOTE:
Certain EPYC 7002 series SKUs may not support NPS4 (specifically, those SKUs manufactured with six CCD dies)
and lower-stack SKUs manufactured with four CCD dies are not recommended for 200 Gb/s.
NOTE:
In many cases, it is advisable to use the cores local to your PCIe device to get the optimal throughput and latency.
Going across CCD and across NUMA can result in performance degradation in some cases.
2.3 Memory Optimizations
For optimal performance, populate 8 DIMMs for 1 DPC (DIMMs per Channel) configuration, or 16 DIMMs for 2 DPC (DIMMs
per Channel) configuration, per processor. Other configurations, such as 12 DIMMs per processor, do not provide optimal
performance. 1 DPC configuration runs the memory DIMMs at a higher speed when compared to 2 DPC. OEM servers
supporting AMD EPYC 7002 Series processors are built to either support previous generations of AMD EPYC (EPYC 7001
series) or are specifically designed for AMD EPYC 7002. Contact your OEM to determine the characteristics of your servers.
2.3.1 Platforms Specifically Designed for AMD EPYC 7002
Platforms are specifically designed for AMD EPYC 7002 and are not compatible with AMD EPYC 7001 Processors.



The maximum memory bus frequency supported on these platforms is 3200 MT/s.
For throughput sensitive applications to obtain higher I/O throughput, Maximum Memory Bus Frequency can be set to
the maximum allowed (3200 MT/s) provided your memory DIMM hardware supports it. However, the Infinity Fabric
Clock on these platforms does not optimally synchronize with a memory bus frequency of 3200 MT/s, which can lead to
higher latency.
For latency-sensitive applications, better performance is obtained by setting the maximum memory bus frequency down
to 2933 MT/s, 2667 MT/s, or 2400 MT/s, since these frequencies synchronize with the Infinity fabric clock. The best
latency performance is observed by setting the maximum memory bus frequency to 2933 MT/s synchronized with the
Infinity fabric clock frequency in NPS=4 mode, although the total memory bandwidth is limited.
Broadcom Confidential
NetXtreme-E-UG600
7
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3 BIOS Tuning
This section covers the various BIOS configuration options to tune the system for optimal performance. The BIOS screens
in this section are for reference only and have been captured on AMD EPYC reference platform. It is recommended to find
the equivalent settings in the target system BIOS console.
3.1 NPS (NUMA Per Socket)
NOTE:
NPS=1 configuration is recommended for 200 Gb/s and above. NPS=4 is recommended for up to 100 Gb/s which
provides better CPU and memory locality.
To access the NUMA nodes per socket setting (see Figure 5), select Advanced → AMD CBS → DF Common Options →
Memory Addressing → NUMA Nodes Per Socket → NPS1 Socket → NPS1
Figure 5: NUMA Nodes Per Socket Settings
Broadcom Confidential
NetXtreme-E-UG600
8
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.2 X2APIC
Set X2APIC = Enabled to allow the OS to work with 256 threads and improve performance over legacy APIC.
NOTE:
Disable SMT if you are running an OS that does not support X2APIC and have dual socket 64 core processor.
To access the Local APIC Mode setting (see Figure 6), select Advanced → AMD CBS → CPU Common Options → Local
APIC Mode→ X2APIC → NPS1 Socket
Figure 6: Local APIC Mode Settings
Broadcom Confidential
NetXtreme-E-UG600
9
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.3 Determinism Control and Determinism Slider
Set Determinism Control to Manual and the Determinism Slider to Performance (see Figure 7) to ensure a consistent
performance across a fleet of similarly configured systems.
1. To access the Determinism Control setting, select Advanced → AMD CBS → NBIO Common Options → SMU
Common Options → Determinism Control → Manual
2. To access the Determinism Slider setting, select Advanced → AMD CBS → NBIO Common Options → SMU
Common Options → Determinism Slider → Performance
Figure 7: Determinism Control/Determinism Slider Settings
Broadcom Confidential
NetXtreme-E-UG600
10
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.4 APBDIS
Set APBDIS=1 to disable Algorithmic Performance Boost (see Figure 8) which subsequently disables the switching of Pstates in infinity fabric (CPU P-states remain unaffected) and forces the system to be in P0 state, which is the highest
performing infinity fabric P-state. The APBDIS states are as follows:



0: Disable APBDIS – Locks the fabric clock to the non-boosted speeds.
1: Enable APBDIS – Unlocks the fabric clock to the boosted speeds.
Auto (Default setting) – Use the default value for APBDIS. The default value is 0.
1. To access the APBDIS setting, select Advanced → AMD CBS → NBIO Common Options → SMU Common Options
→ APBDIS → 1
2. To access the Fixed SOC Pstate setting, select Advanced → AMD CBS → NBIO Common Options → SMU Common
Options → Fixed SOC Pstate → P0
Figure 8: APBDIS Settings
Broadcom Confidential
NetXtreme-E-UG600
11
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.5 Preferred I/O and Enhanced Preferred I/O
Preferred I/O (see Figure 9) is a new capability in the EPYC 7002 series BIOS that prioritizes the traffic from the selected I/
O device and facilitates the ordering of PCIe packets which reduces the overhead and results in better adapter performance.
Enhanced Preferred I/O (see Figure 9), further ensures that the same configured I/O device remains at the highest
performance by keeping its clocks at the maximum frequency.
1. To access the Preferred I/O setting, select Advanced → AMD CBS → NBIO Common Options → Preferred I/O→
Manual
2. To access the Preferred I/O Bus setting select Advanced → AMD CBS → NBIO Common Options → Preferred I/O
Bus → [PCIe Bus Number]
3. To access the Enhanced Preferred I/O setting, select Advanced → AMD CBS → NBIO Common Options →
Enhanced Preferred I/O Mode → Auto/Enable P0
Figure 9: Preferred I/O and Enhanced Preferred I/O Settings
Broadcom Confidential
NetXtreme-E-UG600
12
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.6 PCIe Ten Bit Tag
Enable the PCIe Ten Bit Tag (see Figure 10) to increase the number of non posted requests from 256 to 768 for better
performance. As latency increases, the increase in unique tags are required to maintain the peak performance at 16 GT/s.
To access the PCIe Ten Bit Tag setting, select Advanced → AMD CBS → NBIO Common Options → PCIe Ten Bit Tag→
Enable
Figure 10: PCIe Ten Bit Tag Settings
Broadcom Confidential
NetXtreme-E-UG600
13
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.7 Memory Clock Speed
Set the Memory Clock Speed (see Figure 11) to match the maximum fabric clock speed supported by installed EPYC 7002
series server, which is either 1467 MHz or 1333 MHz (the double data rate is 2x this clock – for example, MCLK = 1467
means 2933 MTS data rate).
NOTE:
A platform may be capable of supporting higher speed memory (for example, 1600 MHz memory clock) and while
this may increase the overall platform memory bandwidth, the average memory latency is higher.
1. To access the Overclock setting, select Advanced → AMD CBS → UMC Common Options → DDR4 Common
Options → DRAM Timing Configuration → Accept → Overclock → Enabled
2. To access the Memory Clock setting, select Advanced → AMD CBS → UMC Common Options → DDR4 Common
Options → DRAM Timing Configuration → Accept → Memory Clock Speed → 1467MHz
Figure 11: Memory Clock Speed Settings
Broadcom Confidential
NetXtreme-E-UG600
14
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.8 L3 LLC (Last Level Cache) as NUMA
Enable L3 as NUMA to create NUMA nodes equal to the number of L3 Caches (CCX). This helps the operating system
schedulers maintain locality to the LLC without causing unnecessary cache-to-cache transactions and improves the
performance.
NOTE:
Currently this is a benchmarking feature meant for isolating L3 caches and is not recommended for production
deployments.
To access the ACPI settings (see Figure 12), select Advanced → AMD CBS → DF Common Options → ACPI → ACPI
SRAT L3 cache As NUMA Domain → Enabled
Figure 12: ACPI SRAT L3 cache As NUMA Domain Setting
Broadcom Confidential
NetXtreme-E-UG600
15
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
3.9 Socket/Inter-Chip Global Memory Interconnect (xGMI)
xGMI Dynamic Link Width Management (see Figure 13) saves power during periods of low socket-to-socket data traffic
by reducing the number of active xGMI lanes per link from 16 to 8. However, under certain scenarios, involving low
bandwidth, but latency sensitive traffic, the transition from low power to full power xGMI can adversely impact latency. Setting
xGMI Link Width Control to Manual and specifying a Max Link Width of 16 forces the xGMI interface into full power mode,
eliminating any latency jitter.
NOTE:
Socket/Inter-Chip Global Memory Interconnect option only applies to a 2P system.
1. To access the xGMI Link Width Control setting, select Advanced → AMD CBS → SMU Common Options → xGMI
Link Width Control→ Manual
2. To access the xGMI Force Link Width setting, select Advanced → AMD CBS → SMU Common Options → xGMI
Force Link Width → 2
3. To access the xGMI Force Link Width Control setting, select Advanced → AMD CBS → SMU Common Options →
xGMI Force Link Width Control → Force
4. To access the xGMI Max Link Width Control setting, select Advanced → AMD CBS → SMU Common Options →
xGMI Max Link Width Control → Auto
Figure 13: Socket/Inter-Chip Global Memory Interconnect (xGMI) Settings
Applications that are known to be insensitive to both socket-to-socket bandwidth and latency can set a fixed link width of
eight to save power, which can divert more power to the cores for boost.
Broadcom Confidential
NetXtreme-E-UG600
16
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
4 TCP Performance Tuning
4.1 BIOS Tuning
Configure all BIOS settings as shown in BIOS Tuning.
4.2 NIC Tuning
Contact the OEM to get the latest driver, firmware, and tools and follow the installation instructions.
4.2.1 NUMA: Local vs. Non Local
Non Uniform Memory Access (NUMA) is a memory architecture in which each CPU is connected to its own local memory.
The local NUMA CPUs provide faster access to the local memory (shorter distance), accessing the memory on remote
NUMA is possible, but it is slower.


Local NUMA: cat /sys/class/net/[interface]/device/numa_node
Local CPUs: cat /sys/class/net/[interface]/device/local_cpulist
Alternately, install hwloc and use the output of lstopo to find the local NUMA nodes and cpulist:
yum install hwloc hwloc-gui
lstopo --logical --output-format png > cpu.png
NOTE:
Single Socket NPS=1, all cores appear local.
Broadcom Confidential
NetXtreme-E-UG600
17
NetXtreme-E User Guide
NOTE:
Tuning Guide for AMD EPYC 7002 Series on Linux
Single Socket NPS=4, only a single node is local.
4.2.2 Configuring Queues
The Broadcom NIC supports both combined and separate queues configuration using ethtool. For most cases, the use of
combined queues is recommended. With combined queues, TX and RX queues are shared with a single IRQ.
ethtool -L [interface] combined 8 tx 0 rx 0
For granular control packet processing, but at the cost of more resources, it is recommended to use separate queues. In this
configuration, the TX and RX queue are separate and have independent IRQs giving the ability to have finer control over the
IRQ affinities.
ethtool -L [interface] combined 0 tx 4 rx 4
In either case, it is recommended to have no more than a single IRQ per physical core. It may be beneficial to only allocate
a single IRQ per local core in a CCD.
Broadcom Confidential
NetXtreme-E-UG600
18
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
4.2.3 Configure IRQ and Application Affinity
IRQ affinity refers to the binding of interrupts from a specific device to one or multiple logical processors. The distribution of
the IRQs across different local logical cores results in improved performance due to better CPU utilization.
Use the following steps for IRQ affinity configuration:
1. Disable irqbalance (to prevent the service from dynamically moving your IRQ) using the following commands:
service irqbalance stop
service irqbalance disable (to keep it persistent through reboot)
2. Identify local CPUs using the following command:
cat /sys/class/net/[interface]/device/local_cpulist
3. Identify IRQ numbers using the following command:
cat /proc/interrupts | grep [interface] | awk -F ":" '{print $1}'
4. Pin each of the interrupts to a different local NUMA CPUs using the following command:
echo [cpu_core] > /proc/irq/[interface number]/smp_affinity_list
NOTE:
It is preferred to use the same CPUs for application affinity which also allows cache locality between interrupts and
application threads and reduces the processing overhead. taskset and numactl tools or application-specific options
(for example, netperf with -T) can be used for configuring application locality:
taskset -c [cpu_core list] application
or
numactl -C [cpu_cores list] application
or
application-specific options, for example:
If using netperf there is a -T option to handle both server and client application affinity.
4.2.4 TX and RX Flow Steering
For the best performance, it is important to distribute traffic across multiple software queues. It distributes the traffic it and
allows the utilization of multiple CPU cores to handle the processing. There are several ways to distribute traffic across
multiple software queues.
Enabled by default, RSS (Receive Side Scaling), provides a good mechanism for RX load distribution as it hashes different
streams to separate RX queues to spread the load evenly. However, it does not consider application locality. For example,
a flow could hash to queue 0 which is being processed on core 0 but the application consuming that data is running on core
64. This does not benefit from any locality. However, if the processing is highly CPU intensive there could be a benefit to
having the application and IRQ on different processors. RFS (Receive Flow Steering) overcomes this shortcoming by
steering the packets to the CPU cores where the application thread is running and thus increasing the data cache hit rate.
Further, Broadcom NICs support this steering in HW, aRFS (Accelerated RFS).
The TX flow steering can be achieved by configuring XPS (Transmit Packet Steering) which guarantees that TX Interrupts
are generated on the same core running the application.
Broadcom Confidential
NetXtreme-E-UG600
19
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
Configure XPS by setting the appropriate local CPU mask for every TX queue as shown in the following steps:
1. Configure XPS for each TX queue using the following command:
echo [cpumask] > /sys/class/net/$ifname/queues/tx-$i/xps_cpus
NOTE:
Any application generating traffic on CPUs in the CPU mask interrupts the ring tx-$i specified here.
To configure aRFS, use the following steps:
1. Enable ntuple filtering (required for aRFS) using the following command:
ethtool -K [interface] ntuple on
2. Configure RFS for each RX queue using the following commands:
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
rps_flow_val = 32768/(number of rings)
echo [rps_flow_value] > /sys/class/net/[interface]/queues/rx-$i/rps_flow_cnt
NOTE:
See the Linux kernel Scaling.txt for more details on both of these mechanisms.
4.2.5 TX/RX Queue Size
Increasing the TX and RX queue size helps with queuing more data for transmit and receive and helps in avoiding packet
drop during high data transfer.
Increase the TX and RX queue size to 2047 using the following command:
ethtool -G [interface] tx 2047 rx 2047
However, this is not suggested for all cases as it also results in higher latency by a bigger build up in the software rings.
There could be other side effects such as poor cache utilization if the ring size exceeds the cache size.
4.2.6 Interrupt Moderation
Interrupt moderation controls the rate of interrupts to the CPU during the TX and RX. Too many interrupts (per packet
interrupt) increase CPU usage impacting the throughput adversely while too few interrupts (after time or number of packets)
increases the latency.
Enabling adaptive-rx improves RX latency at low packet-receiving rates and improves throughput at high packet-receiving
rates and thus providing a good performance balance.
ethtool -C [interface] adaptive-rx on rx-usecs 40 rx-frames 40
4.2.7 GRO (Generic Receive Offload)
GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving
CPU cycles as fewer packets need to be processed by the kernel. By default, GRO is accomplished in the Linux kernel,
however, Broadcom NICs support Hardware GRO.
ethtool -K [interface] rx-gro-hw on lro off gro on
Broadcom NICs support the aggregation in HW and it can co-exist with SW GRO.
Broadcom Confidential
NetXtreme-E-UG600
20
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
4.2.8 TX-NoCache-Copy
TX-NoCache-Copy helps to reduce CPU cycles by bypassing the local cache and writing user-space data directly into
memory. However, it must be turned on carefully when advanced features are enabled. For example, it may cause significant
performance penalty if the data must be pulled back into the cache for VXLAN checksum calculation when the packets are
VXLAN encapsulated but the hardware does not support VXLAN checksum offloads.
NOTE:
It is recommended to enable this feature carefully and keep it disabled by default.
To enable this feature, use the following command:
ethtool -K [interface] tx-nocache-copy on
4.2.9 Relaxed Ordering
Relaxed ordering allows packets to be retired out of order when possible. This maintains data consistency and improves
performance in high bandwidth cases. Relaxed ordering can be queried and enabled using bnxtnvm tool:
To query the RO status, use the following command:
bnxtnvm -dev=[interface] getoption=pcie_relaxed_ordering
To enable RO, use the following command:
bnxtnvm -dev=[interface] setoption=pcie_relaxed_ordering enable
Ensure the following criteria are met before enabling RO:



Memory accesses to the same location is in order.
All devices between the endpoint (NIC) and the root complex support relaxed ordering.
No applications are in use that are designed to poll on the last byte-address of the sink buffer to mark the completion for
example latency sensitive RDMA applications.
It is recommended to enable this feature if all of three conditions above are satisfied, this option significantly improves the
throughput on AMD EPYC 7002 for high bandwidth scenarios.
Broadcom Confidential
NetXtreme-E-UG600
21
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
4.2.10 PCIe MRRS (Maximum Read Request Size)
This parameter specifies the maximum size of a memory read request. The MRRS can be used to enforce a more uniform
allocation of bandwidth by imposing a ceiling on the read requests. The MRRS can be queried and set dynamically using
the following commands:
To identify the PCIe bus for Broadcom NICs, use the following commands:
lspci | grep Broadcom
41:00.0 Ethernet controller: Broadcom Limited Device 1750
To query the current MRRS value, use the following commands:
lspci -s 0000:41:00.0 -vvv | grep MaxReadReq
MaxPayload 512 bytes, MaxReadReq 4096 bytes
To identify the MRRS size selector, use the following commands:
setpci -s 41:00.0 b4.w
5d57
The first digit (shown in the previous command example) is the MRRS size selector, and the number 5 represents the MRRS
value of 4096B. Other acceptable values are as follows:
0 -> 128B, 1 -> 256B, 2 -> 512B, 3 -> 1024B, 4 -> 2048B and 5 -> 4096B
To change MRRS from 4096B, use the following commands:
setpci -s 41:00.0 b4.w=3d57
lspci -s 0000:41:00.0 -vvv | grep MaxReadReq
MaxPayload 512 bytes, MaxReadReq 1024 bytes
NOTE:
Do not change the last three digits from the setup (d57 in the previous example), it may crash the system.
4.3 OS Tuning (Linux)
4.3.1 IOMMU
It is recommended to use IOMMU in pass through (pt) mode. This disables the DMAR (DMA Remapping) to the memory
and improves the host performance. It can be enabled by adding iommu=pt in the grub file as shown in the following
commands:
vi /etc/default/grub
GRUB_CMDLINE_LINUX="nofb splash=quiet console=tty0 ... iommu=pt"
grub2-mkconfig -o /boot/grub2/grub.cfg
Reboot the system and ensure that iommu=pt is in /proc/cmdline using the following command:
cat /proc/cmdline | grep -i iommu=pt
Broadcom Confidential
NetXtreme-E-UG600
22
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
4.3.2 Performance Governor
The CPU frequency performance governor sets the CPU statically to the highest frequency within the borders of
scaling_min_freq and scaling_max_freq for highest performance.
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Check that the CPUs are running at highest frequency using the following command:
cat /proc/cpuinfo | grep -i mhz
4.3.3 TCP Memory Configuration
Increase the memory buffer for TCP sockets. This can improve performance for long RTT connections by allowing more data
in flight at a time where smaller buffers may not cover the BDP (Bandwidth-delay product) resulting in gaps in transmission.
The following three values in the echo statement represent the minimum, default and maximum buffer value for each TCP
socket.
echo "4096 131072 268435456" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 131072 63108864" > /proc/sys/net/ipv4/tcp_wmem
The TCP rmem_max and wmem_max are maximum receive and send buffer sizes for the socket memory. These buffers are
used to hold/send the data until it is read by the application.
echo 268435456 > /proc/sys/net/core/rmem_max
echo 63108864 > /proc/sys/net/core/wmem_max
4.3.4 nohz=off
This is a boot time kernel parameter that disables the dyntick idle mode. In this mode, the kernel sends the timer tick
periodically to all CPUs irrespective of the state and prevents the CPU to remain idle for a long time which results in
increased power consumption.
NOTE:
This configuration must tested extensively as results may vary depending upon the workload and applications.
Refer to kernel documentation for more detail.
vi /etc/default/grub
GRUB_CMDLINE_LINUX="nofb splash=quiet console=tty0 ... nohz=off"
grub2-mkconfig -o /boot/grub2/grub.cfg
Broadcom Confidential
NetXtreme-E-UG600
23
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
4.3.5 TCP Example with the BCM957508-P2100G
This section provides the BCM957508-P2100G configuration for running bi-directional dual port test with netperf for 2
queues and 2 sessions test.
NOTE:
The commands below are for reference and may not be a complete set of commands due to brevity.
BIOS Settings:
Configure NPS=1
Enable X2APIC
Performance Determinism Slider
Configure APBDIS = 1
Configure Preferred IO and Enhanced Preferred IO
Enable PCIe Ten Bit Tag
Configure Memory Clock Speed
Enable L3 as NUMA
Configure xGMI
Adapter Settings:
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
ethtool
bnxtnvm
bnxtnvm
-L enp65s0f0 combined 4
-C enp65s0f0 adaptive-rx on rx-usecs 50 rx-frames 50
-C enp65s0f0 adaptive-tx on tx-usecs 50 tx-frames 50
-G enp65s0f0 rx 2047 tx 2047
-K enp65s0f0 ntuple on
-K enp65s0f0 tx-nocache-copy on
-K enp65s0f0 rx-gro-hw on lro off gro on
-L enp65s0f1 combined 4
-C enp65s0f1 adaptive-rx on rx-usecs 50 rx-frames 50
-C enp65s0f1 adaptive-tx on tx-usecs 50 tx-frames 50
-G enp65s0f1 rx 2047 tx 2047
-K enp65s0f1 ntuple on
-K enp65s0f1 tx-nocache-copy on
-K enp65s0f1 rx-gro-hw on lro off gro on
-dev=enp65s0f0 setoption=pcie_relaxed_ordering enable
-dev=enp65s0f1 setoption=pcie_relaxed_ordering enable
OS Settings:
echo
echo
echo
echo
268435456 > /proc/sys/net/core/rmem_max
67108864 > /proc/sys/net/core/wmem_max
"4096 131072 268435456" > /proc/sys/net/ipv4/tcp_rmem
"4096 131072 63108864" > /proc/sys/net/ipv4/tcp_wmem
Configure Affinities:
IRQ list for Port0: '160', '161'
IRQ list for Port1: 163, 164
CPU Cores to be Assigned – Port 0: 0, 1 Port 1: 16,17:
echo
echo
echo
echo
0 > /proc/160/smp_affinity_list
1 > /proc/161/smp_affinity_list
16 > /proc/163/smp_affinity_list
17 > /proc/164/smp_affinity_list
XRFS and XPS Configure:
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 16384 > /sys/class/net/p1p1/queues/rx-0/rps_flow_cnt
echo 16384 > /sys/class/net/p1p1/queues/rx-1/rps_flow_cnt
Broadcom Confidential
NetXtreme-E-UG600
24
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
XPS setting on Port0:
echo 1 > /sys/class/net/p1p1/queues/tx-0/xps_cpus
echo 2 > /sys/class/net/p1p1/queues/tx-1/xps_cpus
XPS setting on Port1:
echo 10000 > /sys/class/net/p1p1/queues/tx-0/xps_cpus
echo 20000 > /sys/class/net/p1p1/queues/tx-1/xps_cpus
Netserver Commands:
netserver -L 25.0.0.1 -p 12800
netserver -L 25.0.0.1 -p 12801
netserver -L 35.0.0.1 -p 13051
netserver -L 35.0.0.1 -p 13052
Netperf Commands:
netperf
'512K'
netperf
'512K'
netperf
'512K'
netperf
'512K'
-H 25.0.0.2
-m '64k' -P
-H 25.0.0.2
-m '64k' -P
-H 35.0.0.1
-m '64k' -P
-H 35.0.0.1
-m '64k' -P
-t TCP_SENDFILE
'12900,32200' &
-t TCP_SENDFILE
'12903,32211' &
-t TCP_SENDFILE
'13151,32451' &
-t TCP_SENDFILE
'13154,32462' &
-T 0,0 -l 60 -L 25.0.0.1 -P 0 -p 12800,32100 -- -S '512K'
-s
-T 1,1 -l 60 -L 25.0.0.1 -P 0 -p 12801,32101 -- -S '512K'
-s
-T 16,16 -l 60 -L 35.0.0.2 -P 0 -p 13051,32351 -- -S '512K'
-s
-T 17,17 -l 60 -L 35.0.0.2 -P 0 -p 13052,32352 -- -S '512K'
-s
Table 1: Server Configuration
Item
Description
Test
Linux Bi-Directional TCP, 2x 100 Gb/s
Server
AMD EPYC 7002 Series Reference Platform
CPU
AMD EPYC 7742
RAM
128 GB, 16 GB x 8 DIMMs at 3200 MHz
BIOS
REX1006G
NIC
BCM957508-P2100G
Operating System
Linux RHEL 7.6
Kernel
5.3.4
Kernel command line
BOOT_IMAGE=/vmlinuz-5.3.4 root=/dev/mapper/rhel-root ro crashkernel=auto
rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet iommu=pt nohz=off
Table 2: Throughput Numbers
Speed
MTU
Bidi Throughput
2x100G
1500
339 Gb/s
Broadcom Confidential
NetXtreme-E-UG600
25
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
5 DPDK Tunings
Broadcom publishes the DPDK performance report to dpdk.org, this report contains the achieved performance numbers and
configuration details. The latest version of the report can be accessed here.
5.1 BIOS Tuning
See BIOS Tuning and set the following BIOS options:
















Local APIC Mode – x2 APIC
NUMA nodes per socket – NPS1
L3 Cache as NUMA – Disabled
Memory Clock Speed – 1467
PCIe Ten Bit Tag – Enabled
Preferred I/O – Manual
Preferred I/O BUS – (Provide BUS ID)
Enhanced Preferred I/O – Auto
Determinism Control – Manual
Determinism Slider – Performance
xGMI Link Width Control – Manual
xGMI Force Link Width – 2
xGMI Force Link Width Control – Force
xGMI Max Link Width Control – Auto
APBDIS 1 and PState=P0
SMT Control – Enabled
5.2 Kernel Tuning
Add the following entries to the kernel command line:
amd_iommu=on iommu=pt nohz=off rcu_nocbs=32-47 isolcpus=32-47 selinux=0 numa_balancing=disable
processor.max_cstate=0 nosoftlockup default_hugepagesz=1G hugepagesz=1G hugepages=64
5.3 PCIe Configuration
Reduce MRRS to 1024B using the following command (Use default setting. For example, 4K for a 4 x 25 adapter):
setpci -s 41:00.0 b4.w=3d57
NOTE:
See PCIe MRRS (Maximum Read Request Size) as incorrect usage may cause a system crash.
Enable Relaxed ordering in adapter.
Example: bnxtnvm -dev=[interface] setoption=pcie_relaxed_ordering enable
Broadcom Confidential
NetXtreme-E-UG600
26
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
5.4 DPDK Configuration
This section provides information on DPDK configuration.
Driver: vfio-pci
Testpmd Command Line
testpmd -l 32,33,34,35,36,37,38,39,63 -n 4 --socketmem=4096 --master-lcore 63 -- --txq=8 --rxq=8 --rxd=4096 -txd=4096 --nb-cores=8 -i
5.5 DPDK Results



http://core.dpdk.org/perf-reports/
BCM5741X (25G) Results:
– Forwarding Rate is 30 Mp/s using 64B frame
– Line-Rate from 128B onwards
BCM575XX (200G) Results:
– Forwarding Rate is 102 Mp/s using 64B frame
– Line-Rate with 1518B
6 IP Forwarding Tunings
This section provides information on IP forwarding tunings.
6.1 BIOS Tuning
This section provides information on BIOS tuning.















SVM Mode – Disabled
SMEE – Disabled
SR-IOV Support – Disabled
Custom Pstate0 – Auto
Custom Pstate1 – Disabled
Custom Pstate2 – Disabled
SMT Control – Disabled
Local APIC Mode – x2APIC
NUMA nodes per socket – NPS4
DDR Timing Configuration → Overclock – Enabled
Memory Clock Speed – 2666 MHz
DDR Power Options → Power Down Enable – Disabled
IOMMU – Auto
Determinism Control – Manual
Determinism Slider – Performance
Broadcom Confidential
NetXtreme-E-UG600
27
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
6.2 Kernel Tuning
This section provides information on kernel tuning.

Add the following entries to the kernel command line:
amd_iommu=off iommu=off nohz=off (for 100G Link, set iommu=pt)


Map Interrupts to CPU's (CPUs that belong to local NUMA, one IRQ per CPU)
Disable the following services
– Firewalld
systemctl stop firewalld
– Selinux
echo 0 > /selinux/enforce
– Set CPU to run at max frequency
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
– irqbalance
systemctl stop irqbalance
– NetworkManager
systemctl stop NetworkManager

Ensure NAT/IP Table modules are unloaded (list of modules that need to be unloaded)
– iptable_raw
– iptable_security
– kvm_amd
– ip6table_mangle
– ip6table_security
– ip6table_raw
– iptable_mangle
– iptable_filter
– ip_tables
– ip6table_filter
– ip6_tables
– ipt_REJECT
– ebtable_nat
– ebtable_filter
– ebtables
– kvm_intel
– Kvm
Broadcom Confidential
NetXtreme-E-UG600
28
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
Equation 1:
Open file /etc/modprobe.d/blacklist.conf and turn off auto load using below syntax
alias driver-name off
NOTE:
Blacklist the dependent modules that do not unload any of the previous modules
6.3 NIC Tuning
This section provides information on NIC tuning.

Increase combined rings to 16 (1 ring per physical core in the local CCD, determine local cores using lscpu commands):
ethtool -L [interface] combined 16 rx 0 tx 0

Disable Pause:
ethtool -A [interface] tx off rx off

Disable LRO and GRO:
ethtool -K [interface] rx-gro-hw off lro off gro off

Turn ON TX no cache copy:
ethtool -K [interface] tx-nocache-copy on

Increase TX/RX Ring size to 2047:
ethtool -G [interface] tx 2047 rx 2047

Configure ntuple filter (to have even distribution across all rings):
ethtool -N [interface] flow-type [udp4/tcp4] src-ip [sip] dst-ip [dip] src-port [sport] dst-port
[dport] action [Queue to redirect]

Interrupt Moderation (Not required for 25G):
ethtool -C [interface] rx-usecs 512 tx-usecs 512 rx-frames 512 tx-frames 512
6.4 IP Forwarding Results
This section provides information on IP forwarding results.



IP Forwarding is typically limited by the Linux kernel. Therefore, the results scale with the number of physical cores
utilized. It is common to expect roughly 600K – 800 KP/s per physical core utilized.
BCM5741X (25G) Results:
– Forwarding Rate is ~18 MP/s using 64B frame
– Line-Rate from 256B onwards
BCM575XX (100G) Results:
– Forwarding Rate is ~18 MP/s using 64B frame
Broadcom Confidential
NetXtreme-E-UG600
29
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
7 RDMA Tunings
This section provides information on RDMA tunings.
7.1 BIOS Tuning
This section provides information on BIOS tuning.




Preferred I/O – Enabled
NPS – 1
System Profile – Performance
x2APIC – Enabled
7.2 OS Configuration
This section provides information on OS configuration.

Install either of the IB Stack. The RoCE driver can work with either of the following IB stacks:
– Native IB support (Recommended)
This is the standard IB stack and utilities distributed by the OS vendors
Debian/Ubuntu:
sudo apt install automake autoconf libtool libibverbs-dev ibverbs-utils infiniband-diags perftest
ethtool
– RedHat/Fedora/CentOS:
sudo yum install -y libibverbs-devel qperf perftest infiniband-diags make gcc kernel kernel-devel
autoconf aclocal libtool libibverbs-utils rdma-core-devel
– OpenFabrics Enterprise Distribution (OFED)
IB stack and utilities distributed by OpenFabrics Alliance
Extract and install it as:
tar xf OFED-4.8.tgz
cd OFED-4.8
./install.pl
Select 2) All Packages

Install Broadcom User Space Library:
tar xf libbnxt_re-x.y.z.tar.gz
cd libbnxt_re-x.y.z
sh autogen.sh
./configure --sysconfdir=/etc
make
make install all
sudo sh -c "echo /usr/local/lib >> /etc/ld.so.conf"
sudo ldconfig
Fedora/RHEL/CentOS/Ubuntu:
sudo rm -f /usr/lib/x86_64-linux-gnu/libbnxt_re.a usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re*

OS Tuning
– IOMMU=pt
Broadcom Confidential
NetXtreme-E-UG600
30
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
7.3 RDMA Results
This section provides information on the BCM575XX 2x100G link speed:


Write Bandwidth Uni-direction – 187 Gb/s
Write Bandwidth Bi-direction – 367 Gb/s
Refer to NetXtreme RoCE User Guide (available from docSAFE) for additional information.
8 Debugging Performance Issues
This section provides information on debugging and performance issues.
8.1 Identifying CPU Bottleneck
Use the following command to check the CPU utilization while the test scenario is running:
mpstat 1 -u -P ALL | grep -v 100.00$
Ensure that none of the CPUs are 100% utilized. The most likely causes of CPU contention are:


Incorrect IRQ and application affinities are configured.
– Double check the affinities setting.
The Broadcom NIC supports the packet aggregation in HW, It means that the NIC can combine incoming packets into
larger packets. This improves the CPU efficiency as the CPU needs to process fewer packets. If the TPA feature does
not operate, then the throughput will be less and CPU might be overutilized. Check that the TPA counters are
incrementing to make sure that TPA is working properly using the following command:
watch -n1 -d "ethtool -S [interface] | grep rx_tpa_pkt"



Packet steering is not working and cat /proc/interrupts show that workload is not evenly balanced across the assigned
cores.
Check that the IRQ affinities are set correctly and irqbalance is stopped.
Try with 1 stream and make sure that packet counters get incremented on a single ring:
Watch -n1 -d "ethtool -S [interface] | grep rx_ucast_packets"
Increment the number of flows and check that different rings are utilized now.
8.2 Checking the PCIe Gen4 Speed and Width
To take full advantage of PCIe Gen4, it is important that the NIC comes at the correct PCIe width and speed. For gen4, the
width should be x16 and speed should be 16GT/s as follows:
lspci -s 41:00.0 -vv | grep -i speed
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <2us
LnkSta: Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt-
Broadcom Confidential
NetXtreme-E-UG600
31
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
9 References
See the following references for additional information:










https://developer.amd.com/wp-content/resources/56827-1-0.pdf
https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1280442391/
AMD+2nd+Gen+EPYC+CPU+Tuning+Guide+for+InfiniBand+HPC
https://www.dell.com/support/article/en-us/sln319015/amd-rome-is-it-for-real-architecture-and-initial-hpcperformance?lang=en
https://developer.amd.com/wp-content/resources/56745_0.75.pdf
https://software.intel.com/en-us/articles/network-recipes-for-an-evolving-data-center
https://iopscience.iop.org/article/10.1088/1742-6596/664/9/092010/pdf
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-cpu-irq
https://www.kernel.org/doc/Documentation/networking/scaling.txt
https://www.redhat.com/en/blog/pushing-limits-kernel-networking
https://documentation.suse.com/sbp/all/html/SBP-AMD-EPYC-2-SLES15SP1/index.html
Broadcom Confidential
NetXtreme-E-UG600
32
NetXtreme-E User Guide
Tuning Guide for AMD EPYC 7002 Series on Linux
Revision History
NetXtreme-E-UG600; October 1, 2020
Initial release.
Broadcom Confidential
NetXtreme-E-UG600
33