P7 System/Network/GPFS AIX Tuning Guide

advertisement
=======================================
P7 System/Network/GPFS AIX Tuning Guide
=======================================
This describes the current tuning recommendations for P7 clusters
running AIX.
Most parameters operate at the AIX level, and are set either in the
diskless boot image, by an xcat postscript, or in the software
component (e.g., GPFS, LL) configuration. The exception is the memory
interleave setting, which is accomplished through PHYP.
Some parameters depend on the intended function of the LPAR or CEC.
There are 4 node types/functions:
* GPFS server node. Contains SAS HBAs connected to one or more
P7 Disk Enclosures.
* Compute node. Generic application node and GPFS client.
* Login node. GPFS client where users login and launch jobs.
* Service node. NFS server that provides the diskless nodes
with boot images, paging space, and statelite resources.
Some of the parameters depend on the size of the P7 cluster, by which
is meant the number of ml0 interfaces on the common HFI fabric (i.e.,
not the size of an individual GPFS cluster when more than one GPFS
cluster is defined within a HFI fabric, but the number of all HFI
nodes).
Several of the parameters are dependent upon each other, so care must
be taken to match for example the settings for rpoolsize, spoolsize,
TCP socket buffer sizes both in "no" and GPFS, and the number of large
pages.
-------------------------------------P7 Hardware Memory Controller Setting.
-------------------------------------Set through the FSP.
GPFS server LPARs/CECs need to be in "8MC" mode through an FSP
setting.
All other nodes need the "2MC" value.
These values are:
PendingMemoryInterleaveMode=1 for NSD servers (8MC)
PendingMemoryInterleaveMode=2 for compute/all others
(2MC)
-----------Boot images.
-----------Set in the EMS/service node NIM config.
Two boot images are required, one for GPFS servers and one for
compute/login nodes. Service nodes are diskful.
The reason for separate GPFS server and GPFS client boot images is that
these node functions have different boot time "vmo" and "no" values
(among others) and different statelite resources.
The GPFS servers need a persistent AIX ODM (the file /etc/basecust) to
preserve customized device attributes. It is recommended that
compute/other diskless nodes not have a persistent ODM; they should
boot with default device attributes (it could lead to strange
situations where what should be interchangeable anonymous compute nodes
exhibit different behaviors because somehow a customized device
attribute got changed in one but not the others).
GPFS
GPFS
GPFS
GPFS
server
server
server
server
LPAR boot image only:
and client/compute LPARs:
and client/compute LPARs:
and client/compute LPARs:
The
The
The
The
file /etc/basecust
file /var/adm/ras/errlog
directory /var/mmfs
directory /var/spool/cron
Only the GPFS server LPAR boot image should have the file /etc/basecust
as persistent statelite. We do not want the interchangeable generic
compute nodes to be able to individually preserve AIX ODM attributes.
Footnote from Frank Mangione: This is a sudden reevaluation based on
the C2 experience with even a switched /tmp/conslog possibly still
leaving activity going to the /var/adm/ras/conslog and causing NFS
hangs. I had been recommending all of /var/adm/ras being statelite,
because errlog did not seem to work as an individual persistent file,
but it does seem to work on C2 now.
----------------NFS paging space.
----------------Set in the EMS/service node boot images.
Default diskless boot image paging space of 64M is insufficient for a
GPFS server.
Two reasons to increase the default service node NFS-based paging
space: (1) GPFS servers pin 48 to 64 and maybe more gigs of pagepool
and temporarily need real swap to stage the real memory pins; and (2)
some benchmark applications use real swap space to run.
From a GPFS-only perspective, 256M of NFS swap is probably enough on
the servers (especially when they are configured to use 64K instead of
4K pages) to handle the pinning of 64G of pagepool, and the default of
64M is enough for a GPFS client to handle the pinning of 4G of 4K
pages. But we have been recommending from a GPFS-only perspective 512M
of swap across the board.
From the applications perspective, compute nodes would appear to need
1024M of NFS swap (even 2048M has been suggested).
So lately we have been configuring 1024M of NFS swap across the board.
For a service node with hundreds of clients, that's hundreds of gigs of
NFS exported swap.
The recommendation is that GPFS server nodes be configured with 512M of
paging space and that all other diskless clients be configured with
1024M of paging space.
Care must be taken to be sure that there is enough space on the service
nodes to accommodate all the required swap (especially in the case of
reassigning service nodes for failover).
------------------------------------------------Special note for diskful nodes only (eg, service nodes)
------------------------------------------------For any command in this document which changes the boot image (eg, vmo,
), the bosboot command is also required to effect the modification of
the boot image. For example, “bosboot -a -d /dev/ipldevice”
---------------------------HFI rpoolsize and spoolsize.
---------------------------Set by xcat postscript.
For GPFS servers and service nodes, the maximum value of 512 MiB
(0x20000000 -- hex 2 followed by 7 zeros) should be used.
For GPFS client compute and login nodes, the value depends on the
GPFS/TCP socket buffer size and the number of GPFS server LPARs that
the GPFS client must communicate with.
For GPFS clients, the HFI rpoolsize and spoolsize should be:
poolsize = MINIMUM OF (i) and (ii)
(i)
Number of GPFS servers * socket buffer size
(ii) 0x4000000
(each client must have at least 64MB)
rounded to the nearest MiB expressed in hex. If
less than 64 MiB, use the minimum value defined
(0x2000000 -- hex 2 followed by 6 zeros). Note
for any non-GPFS sockets which use the r/spool,
r/spool may be overcommitted by 2x.
For example,
this amounts
than 64 MiB,
both the HFI
the value of (i) is
in (ii), 64 MiB
this provides a cushion
because in general, the
with 3 GPFS servers and a socket buffer of 4198400 bytes,
to rpool/spool of 3 * 4198400 = 12.01 MiB. This is less
so the default of 64 MiB (0x4000000) should be used for
rpoolsize and spoolsize on the GPFS client nodes.
As another example, with 10 GPFS servers and a socket buffer of 4198400
bytes, this amounts to rpool/spool of 10 * 4198400 = 40.04 MiB. This is
greater than 32 MiB, so a non-default value of 40.04 MiB (0x280A000)
should be used for both the HFI rpoolsize and spoolsize on the GPFS
client nodes.
The HFI rpoolsize and spoolsize are set at boot time by an xcat
postscript using the following commands:
SERVER: /usr/lib/methods/chghfi -l hfi0 -a rpoolsize=0x20000000
SERVER: /usr/lib/methods/chghfi -l hfi0 -a spoolsize=0x20000000
CLIENT: /usr/lib/methods/chghfi -l hfi0 -a rpoolsize=SEE_ABOVE
CLIENT: /usr/lib/methods/chghfi -l hfi0 -a spoolsize=SEE_ABOVE
(Remember that service nodes should have the maximum pool values, just
like GPFS servers. But service nodes are diskful, so the chghfi
commands should only need to be executed once and will be preserved
in the ODM.)
The HFI receive and send pools are backed by system VMM large pages,
so a minimum number of 16 MiB large pages to cover both pools is
required, plus a cushion for other system programs, and those that
might be required by user applications. See the "vmo" section.
---------------------------------------VMM ("vmo") and non-network AIX tunables.
---------------------------------------Set using "xcatchroot" in the boot images.
The size of a large page is 16 MiB in all boot images:
vmo -r -o lgpg_size=16777216
The number of large pages on a GPFS server or a service node should
be 64 + 8 (a cushion for any other OS resources to claim) = 72:
SERVER: vmo -r -o lgpg_regions=72
The number of large pages on a GPFS client should be:
((rpoolsize + spoolsize) / 16) + 8 + user app requirement
For example, if the rpoolsize and spoolsize are 32 MiB and there are no
user application requirements for large pages, then the number of large
pages for a GPFS client node would be ((32 + 32) / 16) + 8 = 12:
CLIENT: vmo -r -o lgpg_regions=12
As another example, if the rpoolsize and spoolsize are 80 MiB and there
are no user application requirements for large pages, then the number
of large pages for a GPFS client node would be ((80 + 80) / 16) + 8 =
18:
CLIENT: vmo -r -o lgpg_regions=18
The large page values should be set in the boot image so that the pages
are available when the HFI receive and send pools are configured.
------------------------------------------------Additional AIX tuning for client nodes.
------------------------------------------------Booting in SMT2 should be set using "xcatchroot" in the boot image.
smtctl -t 2 -w boot
The remainder in this section can be set by an xcat postscript.
Increase hardware memory prefetch to max:
dscrctl -n -s 0x1e
Enable users to use 16M pages and the BSR:
chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH
<user>
Allow 512 tasks per user (minimum):
chdev -l sys0 -a maxuproc=512
Paging space threshold tuning:
vmo -o nokilluid=1
vmo -o npskill=64 (if system has less than 1G of pg space, set to 4)
vmo -o npswarn=256 (if system has less than 1G of pg space, set to
16)
Allow shared memory segments to be pinned:
vmo -o v_pinshm=1
Disable enhanced memory affinity:
vmo -o enhanced_affinity_affin_time=0
Disable processor folding:
schedo -o vpm_fold_policy=4
To help performance when applications use primary CPUs:
schedo -o shed_primrunq_mload=0
In the client node boot image, edit the file /etc/environment and add
the following line so that user programs default to 64k pages:
LDR_CNTRL=TEXTPSIZE=64K@STACKPSIZE=64K@DATAPSIZE=64K@SHMPSIZE=64K
-----------------------Network ("no") tunables.
-----------------------Some of these must, and most should, be set using "xcatchroot" in the
boot images.
ARP table values need to be set in the boot image, as they cannot be
changed at runtime.
ARP table values are the same in all boot images and on the service
node.
The number of buckets (arptab_nb) is always 149 (a reasonably large
prime number for the hash table). The size of each bucket (arptab_bsiz)
depends on the total number of nodes on the HFI fabric.
no -r -o arptab_nb=149 -o arptab_bsiz=X
Where X is (8 * number of LPARs on the HFI) / 149, with a minimum value
of 9.
Max socket buffer is the same in all boot images (that includes service
nodes, along with compute, login and GPFS storage nodes), and must be
set in the boot image so that the ml0 driver configures with it. It
should be 4 times the GPFS/TCP socket buffer size. If tcp_sendspace and
tcp_recvspace are to be set to 4198400, the value of sb_max should be
16793600.
no -r -o sb_max=16793600
The following should be set prior to GPFS startup. There is no reason
not to set them in the boot image, and they must be the same in all
boot images (note this includes the images for service, compute, login,
and GPFS storage images):
no
no
no
no
-o
-o
-o
-o
rfc1323=1
tcp_recvspace=1048576
tcp_sendspace=1048576
sack=0
TCP send space and TCP receive space are the default socket buffer
sizes for TCP connections. Note that the values used for GPFS socket
buffer sizes are defined in the GPFS tuning section.
-------------------------GPFS Tuning Considerations
-------------------------1) mmchconfig socketRcvBufferSize 4198400
2) mmchconfig socketSndBufferSize 4198400
The value of 4198400 is chosen to be one quarter of a 16 MiB GPFS data
block, plus NSD network checksum (4 MiB plus 4 KiB = 4194304 + 4096 =
4198400). This will allow a 16 MiB GPFS data block to be sent or
received using 4 buffers, and 8 MiB GPFS data block to be transferred
using 2 buffers, and a 4 MiB or less GPFS data block to be transferred
using 1 buffer. These values should be no larger than sb_max/4.
3) In order to reduce the overhead of metadata operations, to provide
more responsive filesystems, the customer may decide they don't need
exact mtime values to be tracked by GPFS. In this case the
modification time on a file will be changed when it is closed but it
will not be changed each time the file is modified. For each such GPFS
filesystem that this behavior is desired for, run:
mmchfs device_name_of_GPFS_filesystem -E no
4) For login nodes and NFS export nodes:
mmchconfig maxFilesToCache=10000,maxStatCache=50000 node1, node2,
..., nodeN
If compute nodes operate on large numbers of files, consider making the
above change or similar on them as well.
5) Large system customers may want to consider increasing the GPFS
lease to avoid nodes dropping off a filesystem due to a lease timeout,
e.g.:
mmchconfig leaseRecoveryWait=120
mmchconfig maxMissedPingTimeout=120
-------------------------Passwordless root ssh/scp.
-------------------------Every LPAR--server, compute, other--that is to be part of the P7 GPFS
cluster, must be able to passwordlessly ssh/scp as root to all the
others so that GPFS commands and configuration work. This can be
configured either in the boot image or configured through
an xcat postscript.
-------------------------------------------GPFS Server ODM changes for HBAs and hdisks.
-------------------------------------------Note: The 0x400000 (4 MiB) value below assumes a 16 MiB GPFS blocksize.
This is almost certainly the case in any P7 GPFS installation.
Inside the GPFS server boot image, the following odmdelete/odmadd
commands must be run to add the max_coalesce attribute to the hdisks:
# odmdelete -o PdAt -q 'uniquetype=disk/sas/nonmpioscsd and
attribute=max_coalesce'
The delete is just to make sure a second max_coalesce attribute is not
added; it's okay if it results in "0 object deleted."
Create a file 'max_coalesce.add' with the following contents:
PdAt:
uniquetype = "disk/sas/nonmpioscsd"
attribute = "max_coalesce"
deflt = "0x400000"
values =
"0x10000,0x20000,0x40000,0x80000,0x100000,0x200000,0x400000,0x800000,0x
fff000,0x1000000"
width = ""
type = "R"
generic = "DU"
rep = "nl"
nls_index = 137
The first and last lines of the file 'max_coalesce.add' must be blank.
Then run the following odmadd inside the server boot image:
# odmadd < max_coalesce.add
The above newly-created max_coalesce attribute will have the required
default value of 4 MiB (0x400000).
The default for the max_transfer hdisk attribute needs also to be
changed to 0x400000 using the /usr/sbin/chdef command, which must be
run inside the xcatchroot for the server boot image:
# /usr/sbin/chdef -a max_transfer=0x400000 -c disk -s sas -t
nonmpioscsd
Optionally, the HBA attributes max_commands and max_transfer can have
their default values changed.
Both the 2-port HBA (uniquetype 001072001410ea0) and the 4-port HBA
(uniquetype 001072001410f60) require a max_transfer of 0x400000.
This can be set with the chdef command in the server boot image chroot
environment:
# chdef -a max_transfer=0x400000 -c adapter -s pciex -t 001072001410ea0
# chdef -a max_transfer=0x400000 -c adapter -s pciex -t 001072001410f60
The 2-port HBA takes a max_commands of 248 and the 4-port HBA takes a
max_commands of 124. Again, these must be set in the change root
environment of the GPFS server boot image:
# chdef -a max_commands=248 -c adapter -s pciex -t 001072001410ea0
# chdef -a max_commands=124 -c adapter -s pciex -t 001072001410f60
Download