======================================= P7 System/Network/GPFS AIX Tuning Guide ======================================= This describes the current tuning recommendations for P7 clusters running AIX. Most parameters operate at the AIX level, and are set either in the diskless boot image, by an xcat postscript, or in the software component (e.g., GPFS, LL) configuration. The exception is the memory interleave setting, which is accomplished through PHYP. Some parameters depend on the intended function of the LPAR or CEC. There are 4 node types/functions: * GPFS server node. Contains SAS HBAs connected to one or more P7 Disk Enclosures. * Compute node. Generic application node and GPFS client. * Login node. GPFS client where users login and launch jobs. * Service node. NFS server that provides the diskless nodes with boot images, paging space, and statelite resources. Some of the parameters depend on the size of the P7 cluster, by which is meant the number of ml0 interfaces on the common HFI fabric (i.e., not the size of an individual GPFS cluster when more than one GPFS cluster is defined within a HFI fabric, but the number of all HFI nodes). Several of the parameters are dependent upon each other, so care must be taken to match for example the settings for rpoolsize, spoolsize, TCP socket buffer sizes both in "no" and GPFS, and the number of large pages. -------------------------------------P7 Hardware Memory Controller Setting. -------------------------------------Set through the FSP. GPFS server LPARs/CECs need to be in "8MC" mode through an FSP setting. All other nodes need the "2MC" value. These values are: PendingMemoryInterleaveMode=1 for NSD servers (8MC) PendingMemoryInterleaveMode=2 for compute/all others (2MC) -----------Boot images. -----------Set in the EMS/service node NIM config. Two boot images are required, one for GPFS servers and one for compute/login nodes. Service nodes are diskful. The reason for separate GPFS server and GPFS client boot images is that these node functions have different boot time "vmo" and "no" values (among others) and different statelite resources. The GPFS servers need a persistent AIX ODM (the file /etc/basecust) to preserve customized device attributes. It is recommended that compute/other diskless nodes not have a persistent ODM; they should boot with default device attributes (it could lead to strange situations where what should be interchangeable anonymous compute nodes exhibit different behaviors because somehow a customized device attribute got changed in one but not the others). GPFS GPFS GPFS GPFS server server server server LPAR boot image only: and client/compute LPARs: and client/compute LPARs: and client/compute LPARs: The The The The file /etc/basecust file /var/adm/ras/errlog directory /var/mmfs directory /var/spool/cron Only the GPFS server LPAR boot image should have the file /etc/basecust as persistent statelite. We do not want the interchangeable generic compute nodes to be able to individually preserve AIX ODM attributes. Footnote from Frank Mangione: This is a sudden reevaluation based on the C2 experience with even a switched /tmp/conslog possibly still leaving activity going to the /var/adm/ras/conslog and causing NFS hangs. I had been recommending all of /var/adm/ras being statelite, because errlog did not seem to work as an individual persistent file, but it does seem to work on C2 now. ----------------NFS paging space. ----------------Set in the EMS/service node boot images. Default diskless boot image paging space of 64M is insufficient for a GPFS server. Two reasons to increase the default service node NFS-based paging space: (1) GPFS servers pin 48 to 64 and maybe more gigs of pagepool and temporarily need real swap to stage the real memory pins; and (2) some benchmark applications use real swap space to run. From a GPFS-only perspective, 256M of NFS swap is probably enough on the servers (especially when they are configured to use 64K instead of 4K pages) to handle the pinning of 64G of pagepool, and the default of 64M is enough for a GPFS client to handle the pinning of 4G of 4K pages. But we have been recommending from a GPFS-only perspective 512M of swap across the board. From the applications perspective, compute nodes would appear to need 1024M of NFS swap (even 2048M has been suggested). So lately we have been configuring 1024M of NFS swap across the board. For a service node with hundreds of clients, that's hundreds of gigs of NFS exported swap. The recommendation is that GPFS server nodes be configured with 512M of paging space and that all other diskless clients be configured with 1024M of paging space. Care must be taken to be sure that there is enough space on the service nodes to accommodate all the required swap (especially in the case of reassigning service nodes for failover). ------------------------------------------------Special note for diskful nodes only (eg, service nodes) ------------------------------------------------For any command in this document which changes the boot image (eg, vmo, ), the bosboot command is also required to effect the modification of the boot image. For example, “bosboot -a -d /dev/ipldevice” ---------------------------HFI rpoolsize and spoolsize. ---------------------------Set by xcat postscript. For GPFS servers and service nodes, the maximum value of 512 MiB (0x20000000 -- hex 2 followed by 7 zeros) should be used. For GPFS client compute and login nodes, the value depends on the GPFS/TCP socket buffer size and the number of GPFS server LPARs that the GPFS client must communicate with. For GPFS clients, the HFI rpoolsize and spoolsize should be: poolsize = MINIMUM OF (i) and (ii) (i) Number of GPFS servers * socket buffer size (ii) 0x4000000 (each client must have at least 64MB) rounded to the nearest MiB expressed in hex. If less than 64 MiB, use the minimum value defined (0x2000000 -- hex 2 followed by 6 zeros). Note for any non-GPFS sockets which use the r/spool, r/spool may be overcommitted by 2x. For example, this amounts than 64 MiB, both the HFI the value of (i) is in (ii), 64 MiB this provides a cushion because in general, the with 3 GPFS servers and a socket buffer of 4198400 bytes, to rpool/spool of 3 * 4198400 = 12.01 MiB. This is less so the default of 64 MiB (0x4000000) should be used for rpoolsize and spoolsize on the GPFS client nodes. As another example, with 10 GPFS servers and a socket buffer of 4198400 bytes, this amounts to rpool/spool of 10 * 4198400 = 40.04 MiB. This is greater than 32 MiB, so a non-default value of 40.04 MiB (0x280A000) should be used for both the HFI rpoolsize and spoolsize on the GPFS client nodes. The HFI rpoolsize and spoolsize are set at boot time by an xcat postscript using the following commands: SERVER: /usr/lib/methods/chghfi -l hfi0 -a rpoolsize=0x20000000 SERVER: /usr/lib/methods/chghfi -l hfi0 -a spoolsize=0x20000000 CLIENT: /usr/lib/methods/chghfi -l hfi0 -a rpoolsize=SEE_ABOVE CLIENT: /usr/lib/methods/chghfi -l hfi0 -a spoolsize=SEE_ABOVE (Remember that service nodes should have the maximum pool values, just like GPFS servers. But service nodes are diskful, so the chghfi commands should only need to be executed once and will be preserved in the ODM.) The HFI receive and send pools are backed by system VMM large pages, so a minimum number of 16 MiB large pages to cover both pools is required, plus a cushion for other system programs, and those that might be required by user applications. See the "vmo" section. ---------------------------------------VMM ("vmo") and non-network AIX tunables. ---------------------------------------Set using "xcatchroot" in the boot images. The size of a large page is 16 MiB in all boot images: vmo -r -o lgpg_size=16777216 The number of large pages on a GPFS server or a service node should be 64 + 8 (a cushion for any other OS resources to claim) = 72: SERVER: vmo -r -o lgpg_regions=72 The number of large pages on a GPFS client should be: ((rpoolsize + spoolsize) / 16) + 8 + user app requirement For example, if the rpoolsize and spoolsize are 32 MiB and there are no user application requirements for large pages, then the number of large pages for a GPFS client node would be ((32 + 32) / 16) + 8 = 12: CLIENT: vmo -r -o lgpg_regions=12 As another example, if the rpoolsize and spoolsize are 80 MiB and there are no user application requirements for large pages, then the number of large pages for a GPFS client node would be ((80 + 80) / 16) + 8 = 18: CLIENT: vmo -r -o lgpg_regions=18 The large page values should be set in the boot image so that the pages are available when the HFI receive and send pools are configured. ------------------------------------------------Additional AIX tuning for client nodes. ------------------------------------------------Booting in SMT2 should be set using "xcatchroot" in the boot image. smtctl -t 2 -w boot The remainder in this section can be set by an xcat postscript. Increase hardware memory prefetch to max: dscrctl -n -s 0x1e Enable users to use 16M pages and the BSR: chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH <user> Allow 512 tasks per user (minimum): chdev -l sys0 -a maxuproc=512 Paging space threshold tuning: vmo -o nokilluid=1 vmo -o npskill=64 (if system has less than 1G of pg space, set to 4) vmo -o npswarn=256 (if system has less than 1G of pg space, set to 16) Allow shared memory segments to be pinned: vmo -o v_pinshm=1 Disable enhanced memory affinity: vmo -o enhanced_affinity_affin_time=0 Disable processor folding: schedo -o vpm_fold_policy=4 To help performance when applications use primary CPUs: schedo -o shed_primrunq_mload=0 In the client node boot image, edit the file /etc/environment and add the following line so that user programs default to 64k pages: LDR_CNTRL=TEXTPSIZE=64K@STACKPSIZE=64K@DATAPSIZE=64K@SHMPSIZE=64K -----------------------Network ("no") tunables. -----------------------Some of these must, and most should, be set using "xcatchroot" in the boot images. ARP table values need to be set in the boot image, as they cannot be changed at runtime. ARP table values are the same in all boot images and on the service node. The number of buckets (arptab_nb) is always 149 (a reasonably large prime number for the hash table). The size of each bucket (arptab_bsiz) depends on the total number of nodes on the HFI fabric. no -r -o arptab_nb=149 -o arptab_bsiz=X Where X is (8 * number of LPARs on the HFI) / 149, with a minimum value of 9. Max socket buffer is the same in all boot images (that includes service nodes, along with compute, login and GPFS storage nodes), and must be set in the boot image so that the ml0 driver configures with it. It should be 4 times the GPFS/TCP socket buffer size. If tcp_sendspace and tcp_recvspace are to be set to 4198400, the value of sb_max should be 16793600. no -r -o sb_max=16793600 The following should be set prior to GPFS startup. There is no reason not to set them in the boot image, and they must be the same in all boot images (note this includes the images for service, compute, login, and GPFS storage images): no no no no -o -o -o -o rfc1323=1 tcp_recvspace=1048576 tcp_sendspace=1048576 sack=0 TCP send space and TCP receive space are the default socket buffer sizes for TCP connections. Note that the values used for GPFS socket buffer sizes are defined in the GPFS tuning section. -------------------------GPFS Tuning Considerations -------------------------1) mmchconfig socketRcvBufferSize 4198400 2) mmchconfig socketSndBufferSize 4198400 The value of 4198400 is chosen to be one quarter of a 16 MiB GPFS data block, plus NSD network checksum (4 MiB plus 4 KiB = 4194304 + 4096 = 4198400). This will allow a 16 MiB GPFS data block to be sent or received using 4 buffers, and 8 MiB GPFS data block to be transferred using 2 buffers, and a 4 MiB or less GPFS data block to be transferred using 1 buffer. These values should be no larger than sb_max/4. 3) In order to reduce the overhead of metadata operations, to provide more responsive filesystems, the customer may decide they don't need exact mtime values to be tracked by GPFS. In this case the modification time on a file will be changed when it is closed but it will not be changed each time the file is modified. For each such GPFS filesystem that this behavior is desired for, run: mmchfs device_name_of_GPFS_filesystem -E no 4) For login nodes and NFS export nodes: mmchconfig maxFilesToCache=10000,maxStatCache=50000 node1, node2, ..., nodeN If compute nodes operate on large numbers of files, consider making the above change or similar on them as well. 5) Large system customers may want to consider increasing the GPFS lease to avoid nodes dropping off a filesystem due to a lease timeout, e.g.: mmchconfig leaseRecoveryWait=120 mmchconfig maxMissedPingTimeout=120 -------------------------Passwordless root ssh/scp. -------------------------Every LPAR--server, compute, other--that is to be part of the P7 GPFS cluster, must be able to passwordlessly ssh/scp as root to all the others so that GPFS commands and configuration work. This can be configured either in the boot image or configured through an xcat postscript. -------------------------------------------GPFS Server ODM changes for HBAs and hdisks. -------------------------------------------Note: The 0x400000 (4 MiB) value below assumes a 16 MiB GPFS blocksize. This is almost certainly the case in any P7 GPFS installation. Inside the GPFS server boot image, the following odmdelete/odmadd commands must be run to add the max_coalesce attribute to the hdisks: # odmdelete -o PdAt -q 'uniquetype=disk/sas/nonmpioscsd and attribute=max_coalesce' The delete is just to make sure a second max_coalesce attribute is not added; it's okay if it results in "0 object deleted." Create a file 'max_coalesce.add' with the following contents: PdAt: uniquetype = "disk/sas/nonmpioscsd" attribute = "max_coalesce" deflt = "0x400000" values = "0x10000,0x20000,0x40000,0x80000,0x100000,0x200000,0x400000,0x800000,0x fff000,0x1000000" width = "" type = "R" generic = "DU" rep = "nl" nls_index = 137 The first and last lines of the file 'max_coalesce.add' must be blank. Then run the following odmadd inside the server boot image: # odmadd < max_coalesce.add The above newly-created max_coalesce attribute will have the required default value of 4 MiB (0x400000). The default for the max_transfer hdisk attribute needs also to be changed to 0x400000 using the /usr/sbin/chdef command, which must be run inside the xcatchroot for the server boot image: # /usr/sbin/chdef -a max_transfer=0x400000 -c disk -s sas -t nonmpioscsd Optionally, the HBA attributes max_commands and max_transfer can have their default values changed. Both the 2-port HBA (uniquetype 001072001410ea0) and the 4-port HBA (uniquetype 001072001410f60) require a max_transfer of 0x400000. This can be set with the chdef command in the server boot image chroot environment: # chdef -a max_transfer=0x400000 -c adapter -s pciex -t 001072001410ea0 # chdef -a max_transfer=0x400000 -c adapter -s pciex -t 001072001410f60 The 2-port HBA takes a max_commands of 248 and the 4-port HBA takes a max_commands of 124. Again, these must be set in the change root environment of the GPFS server boot image: # chdef -a max_commands=248 -c adapter -s pciex -t 001072001410ea0 # chdef -a max_commands=124 -c adapter -s pciex -t 001072001410f60