HP-UX Swap and Dump Unleashed By Unix/Linux Apprentice with 26 Years of Experience Dusan Baljevic Sydney, Australia Aug 2011 Why This Document? * • Frequent “abuse” of good design principles. • A “friend in need is a friend indeed” – why standard swap/dump design fails in real scenarios. • Everyone has different opinion – why not help system administrators and architects stop implementing bad practices. • Especially important on large-RAM servers. • Based on 26-year practical experiences in Unix/Linux. This Document is Not: • A replacement for HP’s official statements. • A written manual to learn HP-UX and its design principles in detail. • Glorified personal experience to prove that I “know best” (rather the opposite). HP-UX Current Official Recommendations* - Part 1 Use the following guidelines when configuring swap logical volumes: • Interleave device swap areas for better performance. • Two swap areas on different disks perform better than one swap area with the equivalent amount of space. This configuration allows interleaved swapping, which means the swap areas are written to concurrently, thus enhancing performance. • When using LVM, set up secondary swap areas within logical volumes that are on different disks using lvextend. • If you have only one disk and must increase swap space, try to move the primary swap area to a larger contiguous region. HP-UX Current Official Recommendations* - Part 2 • Similar-sized device swap areas work best. Device swap areas must have similar sizes for best performance. Otherwise, when all space in the smaller device swap area is used, only the larger swap area is available, making interleaving impossible. • By default, primary swap is located on the same disk as the root file system. The kernel configuration file contains the configuration information for primary swap. • If you are using logical volumes as secondary swap, allocate the secondary swap to reside on a disk other than the root disk for better performance. • Disable mirror consistency checking for swap mirrored primary swap device (no need to recover after a failure). • Use Priority 0 device swap to bypass swap on root disk. HP-UX How Much Swap is Enough? • Every admin and architect has a different opinion. • Traditional views typically use formula: SWAP = 1 or 2 x RAM • Some old designs and applications required even 3 x RAM (or more). • Old HP-UX releases had serious issue with (now obsolete) kernel parameter swapmem_on (see next slide). HP-UX How Much Dump is Enough?* Part 1 The vast majority of problems are found in the kernel area. Only rarely do the program data areas need to be examined, even more rarely, the shared memory areas, and virtually never the buffer/file cache and shared libraries. If a full crash dump is taken, the total space needed with be as high as RAM (and a bit more). By compressed dump overall time taken will be reduced by 1/3 as well as the disk space required should also get reduced by at least 1/3 for default selection of page classes (usually the default page class selection utilizes around 20% of the memory). HP-UX How Much Dump is Enough? – Part 2 # crashconf -v Crash dump configuration has been changed since boot. CLASS -------- PAGES INCLUDED IN DUMP DESCRIPTION ---------- ---------------- ------------------------------------- UNUSED 9572004 no, by default unused pages USERPG 1341553 no, by default user process pages BCACHE 1980 no, by default buffer cache pages KCODE 9142 no, by default kernel code pages USTACK 1567 yes, by default user process stacks FSDATA 12 yes, by default file system metadata KDDATA 1492949 yes, by default kernel dynamic data KSDATA 8816 yes, by default kernel static data no, unused kernel super pages SUPERPG 128677 by default Total pages on system: 12556700 Total pages included in dump: Dump compressed: ON Dump Parallel: DEVICE 1503344 ON OFFSET(kB) SIZE (kB) LOGICAL VOL. NAME ------------ ------------ ------------ ------------ ------------------------1:0x000005 2349920 4194304 -----------4194304 # getconf PAGESIZE 4096 64:0x000002 /dev/vg00/lvol2 HP-UX Pseudoswap Pseudoswap allows the kernel to treat a portion of physical memory as if it is swap space in order to satisfy the swap reservation policy. Pseudo-swap is enabled by default in all current versions of HP-UX and is removed as kernel parameter in11i v3 (swapmem_on). I have 2GB of swap and 8GB of available memory. Can I start a 4GB process on an idle server? With Pseudoswap (swapmem_on=1) Yes! 2GB Device Swap + 6GB Pseudo Swap (75% 8GB) 8GB Reservable Swap Without Pseudoswap (swapmem_on=0) No! 2GB Device Swap + 0GB Pseudo Swap 2GB Reservable Swap Example of an Application Swap Requirements • Please see SAP note 1112627 for a detailed explanation of swap sizing and pseudo-swap. • In general device swap configurations of 1.5 or 2 x RAM have proven appropriate for the majority of SAP installations. The recommendation is to set device swap to 2 x RAM (minimum 20 GB). • Please refer to SAP note 153641 for a detailed explanation of swap requirements on a per SAP instance basis. Basics of Crash Dumps Bad Example of Swap Design # /usr/sbin/swapinfo -tm TYPE dev dev dev dev reserve memory total Mb AVAIL 30464 30464 30464 30464 98292 220148 Mb USED 0 0 0 0 46202 2278 48480 Mb FREE 30464 30464 30464 30464 -46202 96014 171668 PCT USED 0% 0% 0% 0% 2% 22% START/ Mb LIMIT RESERVE 0 0 0 0 - - 0 PRI 1 1 1 1 - NAME /dev/vg00/lvol2 /dev/vg00/swap1 /dev/vg00/swap2 /dev/vg00/swap3 HP-UX Maximum Swap • Swap space in the kernel is managed using 'chunks' of physical device space. These chunks contain one or more (usually more) pages of memory, but provide another layer of indexing (similar to inodes in file systems) to keep the global swap table relatively small, as opposed to a large table indexed by swap page. • swchunk controls the size in physical disk blocks (which are defined as 1 KB) for each chunk. Maximum Swap on HP-UX Before 11i V3 • The total bytes of swap space manageable by the system on HP-UX 11i older releases is: swchunk x 1KB x 16384 where16384 is the system maximum number of swap chunks in the swap table, as defined by kernel parameter maxswapchunks. swchunk has allowed values between 2048 and 65536 blocks. Maximum Swap on HP-UX 11i V3 • The total bytes of swap space manageable by the system on HP-UX 11i v3 is: swchunk x 1KB x 2147483648 Dump Terms • Dump unit A thread of execution during dump. A dump unit requires its own set of CPUs, dump devices, and other resources, which are non-overlapping with other dump units. • Reentrancy Capability of a dump driver to issue multiple I/Os simultaneously, one I/O per HBA port, during dump. • Concurrency Capability of a dump driver to issue multiple I/Os simultaneously per HBA port, during dump. In HP-UX 11i v3 this capability means that the driver can issue I/Os simultaneously to multiple devices under a given HBA port, one I/O per device. • Parallel Dump HP-UX 11i v3 dump infrastructure which enables the parallelism features. • Reentrant HBA port or device An HBA port or device controlled by a reentrant driver. • Concurrent HBA port or device An HBA port or device controlled by a concurrent Dump Unit - Part 1 * • A Dump Unit is an independent sequential unit of execution within the dump process. • Each dump unit is assigned an exclusive subset of the system resources needed to perform the dump, including CPUs, a portion of the physical memory to be dumped, and a subset of the configured dump devices. The dump infrastructure in HP-UX 11i v3 automatically partitions system resources at dump time into dump units. • Each dump unit operates sequentially. • Parallelism is achieved by multiple dump units executing in parallel. Dump Unit - Part 2 * • A dump device cannot be shared across multiple dump units. • Multiple “reentrant devices” can be accessed in parallel only if the devices are configured through separate HBA ports. Thus all “reentrant devices” on the same HBA port will be assigned to a single dump unit. • Each “concurrent device” can be accessed in parallel. Each can therefore be assigned to a separate dump unit, even if configured through a single HBA port. • Multiple dump volumes on a single physical volume will not allow for parallelism. Parallelism at dump time can only be achieved across multiple physical devices (LUNs). • Logical volumes configured as dump devices: all logical volumes which reside on the same physical device (LUN) are assigned to the same dump unit. Dump Options Overview − Selective • Based on classes/uses of memory − Compressed • >=5 CPUs per dump unit • Mixed compressed/non-compressed images − Parallel (concurrent) • Faster dump with multiple “monarchs” • Influenced by memory availability and dump devices • HP Integrity Servers only − Live dump • • • • Crashdump a live system without forced shutdown or panic System stays up, running & stable Offline analysis of system Memory image -> file − Extra load during this save • HP Integrity Servers only Dump Parallelism I/O support during dump is provided via dump drivers, and each configured dump driver reports its parallelism capabilities to the dump infrastructure: Legacy: new parallelism feature is not supported Reentrant: supports parallelism per HBA port Concurrent: supports parallelism per dump device These requirements can be distilled into the following formulas for calculating the number of dump units that can be achieved: • CPU Parallelism = (number of CPUs available at dump time) / (1 or 5, depending on whether or not compression is enabled) • Device Parallelism = (number of reentrant dump HBA ports) + (number of concurrent dump devices) + (1 if there are any legacy dump devices) • Number of Dump Units = Minimum (CPU Parallelism, Device Parallelism) Dump Driver Parallelism Capability Examples of HP-provided dump drivers on HP-UX 11.31: fcd Concurrent td, mpt, c8xx, ciss, sasd, fclp Reentrant # crashconf -l DEVICE LOGICAL VOL. ------------ --------------- 1:0x000002 64:0x000002 NAME ------------------/dev/vg00/lvol2 LUNPATH HANDLE * ----------------------40/0/2/0/0/0/0/4/0/0/0.0x247000c0ffdb3fb9.0x4001000000000000 # ioscan -fNk | grep "40/0/2/0/0/0/0/4/0/0/0 " fc 0 40/0/2/0/0/0/0/4/0/0/0 fclp CLAIMED INTERFACE HP AD22260001 PCIe Fibre Channel 2-port 4Gb FC/2-port 1000B-T Combo Adapter Dump Driver Capability # scsimgr get_attr -a capability -H 40/0/2/0/0/0/0/4/0/0/0 SCSI ATTRIBUTES FOR CONTROLLER : 40/0/2/0/0/0/0/4/0/0/0 name = capability current = "Boot Dump" default = saved = Uncompressed vs. Compressed Dump – One Dump Device * Uncompressed vs. Compressed Dump – Three Dump Devices * Uncompressed vs. Compressed Dump – Legacy Devices * Uncompressed Dump – Reentrant Devices * Uncompressed vs. Compressed Dump – Complex Example * Compressed Dump Configuration # crashconf -v Crash dump configuration has been changed since boot. CLASS -------UNUSED USERPG BCACHE KCODE USTACK FSDATA KDDATA KSDATA SUPERPG PAGES ---------1514754 112614 26235 10389 1136 40 386358 6933 21546 INCLUDED IN DUMP ---------------no, by default no, by default no, by default yes, forced yes, by default no, forced yes, by default yes, by default no, by default Total pages on system: Total pages included in dump: Dump compressed: Dump Parallel: DEVICE -----------3:0x000000 3:0x000000 DESCRIPTION ------------------------------------unused pages user process pages buffer cache pages kernel code pages user process stacks file system metadata kernel dynamic data kernel static data unused kernel super pages 2080005 404816 ON # crashconf –c off # crashconf –c on to turn compression off until reboot to turn compression on until reboot ON OFFSET(kB) ---------2349920 30677856 SIZE (kB) LOGICAL VOL. NAME ---------- ------------ ------------------------8388608 64:0x000002 /dev/vg00/lvol2 114688 64:0x000009 /dev/vg00/v3Dump ---------8503296 Dump device configuration mode is config_deprecated_mode. Use crashconf -s option to change the mode. # kctune dump_compress_on Tunable Value Expression dump_compress_on 1 Default Changes Immed # crashconf –tc off to change tunable to 0 # kctune dump_compress_on=0 # crashconf –tc on to set tunable to 1 # kctune dump_compress_on=1 Compressed Dump Algorithm • HP-UX uses one processor to do all disk writes and four processors for compression. • The algorithm for compression is Lempel–Ziv–Welch (LZW). • LZW is a universal lossless data compression algorithm, simple to implement, and has the potential for very high throughput in hardware implementations. • One of the reasons for selecting LZW: HP has a license to use it, and It achieves pretty good compression ratio. Concurrent Dump Configuration # crashconf -v Crash dump configuration has been changed since boot. CLASS -------UNUSED USERPG BCACHE KCODE USTACK FSDATA KDDATA KSDATA SUPERPG PAGES ---------1514754 112614 26235 10389 1136 40 386358 6933 21546 INCLUDED IN DUMP ---------------no, by default no, by default no, by default yes, forced yes, by default no, forced yes, by default yes, by default no, by default Total pages on system: Total pages included in dump: Dump compressed: 2080005 404816 ON Dump Parallel: DEVICE -----------3:0x000000 3:0x000000 DESCRIPTION ------------------------------------unused pages user process pages buffer cache pages kernel code pages user process stacks file system metadata kernel dynamic data kernel static data unused kernel super pages ON # crashconf –p off # crashconf –p on to turn concurrent dump off until reboot to turn concurrent dump on until reboot OFFSET(kB) ---------2349920 30677856 SIZE (kB) LOGICAL VOL. NAME ---------- ------------ ------------------------8388608 64:0x000002 /dev/vg00/lvol2 114688 64:0x000009 /dev/vg00/v3Dump ---------8503296 Dump device configuration mode is config_deprecated_mode. Use crashconf -s option to change the mode. # kctune dump_concurrent_on Tunable Value Expression Changes dump_concurrent_on 1 1 Immed # crashconf –tp off to change tunable to 0 # kctune dump_concurrent_on=0 # crashconf –tp on to set tunable to 1 # kctune dump_concurrent_on=1 HP-UX Kernel Parameters alwaysdump and dontdump On rare occasions, the system may panic before crashconf(1M) is run during the boot process. On those occasions, the configuration can be set using the alwaysdump and dontdump tunables. # kctune -v -q alwaysdump Tunable alwaysdump Description Bitmap of memory page classes to include in a crash dump Module dump Current Value 0 [Default] Value at Next Boot 1024 Value at Last Boot 0 Default Value 0 Can Change Immediately or at Next Boot HP-UX Typical Crash Dump Configuration # crashconf -v Crash dump configuration has been changed since boot. CLASS -------UNUSED USERPG BCACHE KCODE USTACK FSDATA KDDATA KSDATA SUPERPG PAGES ---------9571877 1340875 2309 9142 1567 12 1493845 8816 128257 INCLUDED IN DUMP ---------------no, by default no, by default no, by default no, by default yes, by default yes, by default yes, by default yes, by default no, by default Total pages on system: Total pages included in dump: Dump compressed: ON Dump Parallel: ON DESCRIPTION ------------------------------------unused pages user process pages buffer cache pages kernel code pages user process stacks file system metadata kernel dynamic data kernel static data unused kernel super pages 12556700 1504240 DEVICE OFFSET(kB) SIZE (kB) LOGICAL VOL. NAME ------------ ------------ ------------ ------------ ------------------------1:0x000005 2349920 4194304 64:0x000002 /dev/vg00/lvol2 -----------4194304 Dump device configuration mode is config_deprecated_mode. Use crashconf -s option to change the mode. HP-UX Savecrash Locking Dump devices are often used as paging devices (primary swap is one such example). If savecrash determines that a dump device is already enabled for paging, and that paging activity has already taken place on that device, a warning message will indicate that the dump may be invalid. If a dump device has not already been enabled for paging, savecrash prevents paging from being enabled to the device by creating the file /var/adm/crash/.savecrash.LCK. swapon does not enable the device for paging if the device is locked in /var/adm/crash/.savecrash.LCK. As savecrash finishes saving the image from each dump device, it updates the /var/adm/crash/.savecrash.LCK file and optionally executes swapon to enable paging on the device. HP-UX Dump Device in Non-Root VGs • As of HP-UX 11.00 we have the possibility to configure additional dump devices online (without the need of a reboot. These dump LVs must not be configured using lvlnboot –d but with crashconf(1M). • We are no longer restricted to choose a dump LV from the root VG only. The configuration of such dump devices is similar to the configuration of secondary swap devices. Example of Classical Swap/Dump Design on HP-UX Potential Issues Primary PV /stand /stand Primary swap/ dump Other LVs Alternate PV • If shortage of RAM, boot disks experience severe I/O performance problems due to swap usage. • If more RAM is added, not easy to resize primary swap (contiguous blocks). • Long reboot due to savecrash(1M) export to /var/adm/crash. • More swap added in other VGs, often different in size than primary. • Waste of large amount of disk space for swap. /stand /stand Primary swap/ Dump mirror Other LVs RAID-1 for Boot disk 32 GB RAM Swap = 1 or 2 x RAM Swap/dump shared Example of Different Swap/Dump Design on HP-UX with Internal Boot Disks * Primary PV Alternate PV SAN-based LUNs or LVs /stand /stand Primary swap Primary swap mirror Other LVs Other LVs Secondary swap Secondary swap Dump area Dump area 32 GB RAM Dump areas set up on different Primary Swap = 4-8 GB LUNs or PVs in non-root VGs RAID-1 for Boot disks Total Swap = 1 x RAM ** Swap NOT shared with dump (dump PVs are NEVER RAID-1 Example of Different Swap/Dump Design on HP-UX with SAN Boot Disk * Boot PV SAN-based LUNs or LVs /stand Primary swap Secondary swap Secondary swap Dump area Dump area Other LVs 32 GB RAM Primary Swap = 4-8 GB Total Swap = 1 x RAM ** Swap NOT shared with dump Dump areas set up on different LUNs or PVs in non-root VGs (dump PVs are NEVER RAID-1 HP-UX Persistent Dump Devices – Part 1 • Persistent Dump Devices are those that are configured automatically after a reboot. Persistent dump devices information is maintained in the kernel registry services, (KRS, see krs(5)). • To mark the dump devices as persistent, there are two configuration modes available. config_crashconf_mode In this mode crashconf(1M) and crashconf(2) are the only mechanisms available to mark dump devices as persistent. Logical volumes marked for dump using lvlnboot(1M) or vxvmboot(1M) and devices marked in /stand/system for dump will be ignored during boot-up. This is the preferred method for dump device configuration and will be used from this HP-UX release onwards. This mode can be enabled using the crashconf -s option. VxVM stores extent information of persistent dump logical volumes in lif(4). Up to ten VxVM logical volumes can be marked persistent. The logical volumes which are not part of the root volume group cannot be configured as persistent dump devices. HP-UX Persistent Dump Devices – Part 2 config_deprecated_mode The logical volumes marked for dump using lvlnboot(1M) or vxvmboot(1M) and devices marked in /stand/system for dump will be configured as dump devices during boot-up. Devices marked as persistent, using crashconf -s, will be ignored during boot-up. Marking devices using lvlnboot(1M), vxvmboot(1M), and /stand/system will be obsolete in the next HP-UX release. This mode is deprecated on HP-UX 11.31 and will be obsolete in the next HP-UX release. This is the default mode for dump and can be enabled using the crashconf -o option. HP-UX Dump Devices and Bad Block Relocation • From HP-UX 11.23 release onwards, the LVM bad block relocation feature is obsolete. However, for compatibility reasons the value is maintained as a logical volume attribute. • If BBRA is not disabled when dump device is created, HP-UX complains about “unsupported disk layout”. • Hence, the correct procedure to create a dump device in LVM is: # lvcreate -C y -r n -L 16000 -n dump2 /dev/vgdump HP-UX Crashconf Fails with Unsupported Disk Layout Error - VxVM The volume dumpvol was added to the /etc/fstab file and crashconf was issued to increase the total dump area but crashconf failed with the message below: /dev/vx/dsk/rootdg/dumpvol: error: unsupported disk layout The crashconf error is due to the dump area not being contiguous: # vxprint -g rootdg -ht v dumpvol pl dumpvol-01 - ENABLED ACTIVE 204800 SELECT - swap dumpvol ENABLED ACTIVE 204800 CONCAT - RW sd rootdisk01-07 dumpvol-01 rootdisk01 1081344 102400 0 c1t4d0 ENA sd rootdisk01-17 dumpvol-01 rootdisk01 5702418 102400 102400 c1t4d0 ENA The dumpvol volume has two areas on c1t4d0. The first is rootdisk01-07 which starts at 1081344 and is 102400 kb in size and the second is rootdisk01-17 which starts at 5702418 and is also 102400 kb in size. The volume dumpvol needs to be contiguous so the last 102400 kb should be reduced from dumpvol. To reduce dumpvol: # vxassist shrinkby dumpvol 102400 HP-UX Crashconf Fails with Unsupported Disk Layout Error - LVM /dev/vg01/lvswap: error: unsupported disk layout # lvdisplay /dev/vg01/lvswap .... Bad block on Allocation strict Dump is required to be contiguous and have bad block reallocation turned off: # lvchange -C y -r n /dev/vg01/lvswap HP-UX VxVM Dump Device Creation* – Part 1 With Volume Manager 5.0 on HP-UX 11.31, to initialize the disk, must use vxdisksetup -ifB <disk> command, vxdiskadm is unable to initialize the disk correctly for use with crashconf. Please note that CDS diskgroups are not affected. Those can still be initialized via vxdiskadm. # vxdisk list DEVICE TYPE DISK GROUP STATUS c2t0d0s2 auto:none - - online invalid c2t1d0s2 auto:hpdisk rootdisk01 rootdg online # vxdisk -f init c2t0d0s2 format=hpdisk # vxdg init dumpdg c2t0d0s2 cds=off # vxassist -g dumpdg -U swap make dumpvol 3g HP-UX VxVM Dump Device Creation – Part 2 # crashconf -s /dev/vx/dsk/dumpdg/dumpvol # crashconf -v Crash dump configuration has been changed since boot. CLASS -------- PAGES INCLUDED IN DUMP DESCRIPTION ---------- ---------------- ------------------------------------- UNUSED 10197 no, by default unused pages USERPG 115131 no, by default user process pages BCACHE 14359 no, by default buffer cache pages KCODE 10819 no, by default kernel code pages USTACK 890 yes, by default user process stacks FSDATA 26 yes, by default file system metadata KDDATA 100591 yes, by default kernel dynamic data KSDATA 7238 yes, by default kernel static data SUPERPG 1100 no, unused kernel super pages by default Total pages on system: 260351 Total pages included in dump: 108745 Dump compressed: Dump Parallel: DEVICE ON ON OFFSET(kB) SIZE (kB) LOGICAL VOL. NAME ------------ ------------ ------------ ------------ ------------------------3:0x000001 2350176 3:0x000000 544896 2097152 4:0x000001 /dev/vx/dsk/rootdg/swapvol 3145728 4:0x414ad8 /dev/vx/dsk/dumpdg/dumpvol -----------5242880 HP-UX Better Swap and Dump Design – Part 1 • Set up primary swap between 4 and 8 GB ONLY, no matter how large the RAM is! • Primary swap device should not be NOT SHARED with dump. • Initially, set up primary swap only. In the pre-production testing, verify if that is enough and avoid creating other swap areas unless absolutely necessary. • Secondary swaps (if you need to have them!) are created as 4-8 GB LUNs (could be LVs in LVM or Plexes in VxVM) on SAN (if practicable). Ensure that secondary swaps match the size of primary swap. That way, if server ever needs to use swap, the performance of swap devices will be excellent and boot disk I/O will never “suffer”. • If primary swap is left at 4-8 GB, then allocate separate dump areas in other volume groups to match the size of physical memory if compression is disabled or not possible (due to lack of available CPUs), or less if compression is enabled and possible. HP-UX Better Swap and Dump Design – Part 2 • Disable savecrash(1M) at boot (/etc/rc.config.d/savecrash): SAVECRASH=0 If you do it, make sure not to forget to run savecrash(1M) after the reboot. • Dedicated dump device will not shorten the time required to write from memory to the dump volume during the crash, but will shorten the reboot time. This is because the crash image are not at risk being overwritten by page or swap activity and savecrash(1M) can run in background to save the crash files into the crash dump directory. • If the dump device is also configured as one of the swap devices, the device cannot be enabled for paging until savecrash(1M) has finished saving the image from the device to the crash dump directory. Therefore, the boot time will be longer if savecrash is run in foreground. This extra time will be even greater if vPars are configured because multiple dump images may have to be saved. HP-UX Better Swap and Dump Design – Part 3 • When dump and swap areas are separated, there is no need to save the crash images at boot time. Therefore, savecrash(1M) at (re)boot can be disabled! • The reduction in reboot time achieved by configuring a separate dump device (close to 50% over classical design with savecrash running in foreground) is likely to provide a worthwhile return on investment when system availability is a priority. • Using identical sizes and types of dump devices and HBAs in the dump configuration is one way to avoid inequalities in dump speeds or times across the dump units. This tends to produce more predictable results and better overall parallelism. HP-UX Better Swap and Dump Design – Part 4 • It is recommended that shared swap and dump devices or volumes not be used with parallel dump. Using a shared swap/dump device can significantly increase the subsequent reboot time because such devices result in swap being disabled while saving the corresponding dump data (for example, in /var/adm/crash). • Avoid file system swap altogether if possible. • Set priorities of SAN-based secondary swaps to lower value than the primary swap (and let it be identical value across all secondary swaps). That way, if there is a serious shortage of RAM, swap will perform as “perfectly striped” volume. HP-UX Better Swap and Dump Design – Part 5 • If compressed dumps are required, ensure that there are five CPUs per each dump unit. • Set up multiple dump units on SAN (non-root volume groups), and enable parallel dumps. Note that, currently, the logical volumes which are not part of the root volume group in LVM cannot be configured as persistent dump devices. * However, non-root data group with VxVM can be used for persistent dump devices. ** HP-UX Better Swap and Dump Design – Part 6 • For a kernel dump, the usual requirement: Kernel text/static data Kernel dynamic data in use User-space kernel thread stacks (UAREA) Kernel dynamic memory, which is free-and-cached (Super Page Pool), is only needed when there is a problem in the SPP itself (pretty rare). User data is very rarely needed (in addition, most users do not want HP support reading their application private data for security reasons (classified data, customer sensitive, and so on). The default configuration for crashconf is good enough for most situations. • If enough disk space available or no other constraints imposed, you might enable all crash classes in dumps (check crashdump(1M)). Guidelines for Selecting Device Swap • Two swap areas on different disks are better than one single swap area • Only configure one swap area per disk • Device swap areas should be of similar size • Consider the speed of the disks Swap LV Swap LV Swap LV Swap LV Swap LV No! Yes! HP-UX Post-crash Manual Dump Export • If the dump was not saved completely due to lack of space in the crash directory you have the possibility to save the dump again. The -r option (resave) need to be included when this is not the first time that savecrash runs. # savecrash -v [-r] <crash directory> • There is also the possibility to save the dump directly to a tape: # savecrash -v [-r] -t <tapedevice> HP-UX Manual Dump Export from a Specific Dump Device To manually extract the dump, type either the persistent DSF or the legacy DSF of the whole disk along with the offset: DEVICE OFFSET(kB) SIZE (kB) LOGICAL VOL. NAME ------------ ------------ ------------ ------------ ----------------3:0x000000 2612064 8364032 64:0x000002 /dev/vg00/lvol2 3:0x000001 18168692 40956 64:0x02000b /dev/vg01/dump_3 3:0x000001 18127732 40956 64:0x02000a /dev/vg01/dump_2 3:0x000001 18086772 40956 64:0x020009 /dev/vg01/dump_1 # savecrash -D /dev/rdisk/disk4 -O 18086772 -r -v . or # savecrash -D /dev/rdsk/c2t1d0 -O 18086772 -r -v . Swapoff • Available with HP-UX 11.31. • The swapoff(1M) command disables swapping on the specified swap device(s) for the current boot. The term swap refers to an obsolete implementation of virtual memory; HP-UX actually implements virtual memory by way of paging rather than swapping. This command and others retain names derived from swap for historical reasons. • Does not remove swap device from /etc/fstab. • Will not be successful if amount of swap is needed, for example, reserve space as reported by swapinfo(1M). • Example: # /usr/sbin/swapoff /dev/vg00/lvol2 Swapoff – Real Life Example – Part 1 • Remove primary swap and move it into another volume group. To remove the primary swap, we need to ensure that the new swap device has at least enough space that “reserve” requires. Otherwise, swapoff(1M) command will fail! # lvcreate -C y –r n -L 8192 -n lvswap2 /dev/vgswap # swapon -f /dev/vgswap/lvswap2 # swapinfo -tm Mb Mb Mb AVAIL USED FREE USED dev 8192 0 8192 0% 0 - 1 /dev/vg00/lvol2 dev 8192 0 8192 0% 0 - 1 /dev/vgswap/lvswap2 - 1301 -1301 3876 963 2913 25% 20260 2264 17996 11% - 0 - TYPE reserve memory total PCT START/ Mb LIMIT RESERVE PRI NAME Swapoff – Real Life Example – Part 2 • Remove primary swap on-line: # swapoff /dev/vg00/lvol2 # lvrmboot -s vg00 # swapinfo -tm TYPE dev reserve memory total Mb Mb Mb PCT AVAIL USED FREE USED 8192 0 8192 0% - 1291 -1291 3876 963 2913 25% 12068 2254 9814 19% START/ Mb LIMIT RESERVE PRI 0 - 1 - 0 - NAME /dev/vgswap/lvswap2 Swapoff – Real Life Example – Part 3 • Add line into /etc/fstab for the new primary swap and reboot the server: /dev/vg00/lvol3 / vxfs delaylog 0 1 /dev/vg00/lvol1 /stand vxfs tranflush 0 1 /dev/vg00/lvol4 /home vxfs delaylog 0 2 /dev/vg00/lvol5 /tmp vxfs delaylog 0 2 /dev/vg00/lvol6 /usr vxfs delaylog 0 2 /dev/vg00/lvol7 /var vxfs delaylog 0 2 /dev/vg00/lvol8 /var/tmp vxfs delaylog 0 2 #/dev/vg00/lvdump3 / dump defaults 0 0 /dev/vgswap/lvswap2 / swap defaults 0 0 Swapoff – Real Life Example – Part 4 • After the reboot, check swap status and confirm that non-root volume is now the primary swap: # swapinfo -tm TYPE dev reserve memory total Mb Mb Mb PCT AVAIL USED FREE USED 8192 0 8192 0% - 1283 -1283 3876 950 2926 25% 12068 2233 9835 19% START/ Mb LIMIT RESERVE PRI 0 - 1 - 0 - NAME /dev/vgswap/lvswap2 Swapoff – Real Life Example – Part 5 • However, because we did not initialize the disk in vgswap with “-B” option, it does not contain the Boot Area, and cannot be added with “lvlnboot -s /dev/vgswap/lvswap2”. As a result, this is reported: # lvlnboot -v Boot Definitions for Volume Group /dev/vg00: Physical Volumes belonging in Root Volume Group: /dev/disk/disk6_p2 -- Boot Disk Boot: lvol1 on: /dev/disk/disk6_p2 Root: lvol3 on: /dev/disk/disk6_p2 No Swap Logical Volume configured No Dump Logical Volume configured Swapoff – Real Life Example – Part 6 • We still have one persistent dump device, which is NOT listed in /etc/fstab*: # crashconf -v Crash dump configuration has been changed since boot. CLASS -------- PAGES INCLUDED IN DUMP DESCRIPTION ---------- ---------------- ------------------------------------- UNUSED 584997 no, by default unused pages USERPG 171077 no, by default user process pages BCACHE 7529 no, by default buffer cache pages KCODE 11892 no, by default kernel code pages USTACK 1128 yes, by default user process stacks FSDATA 16 yes, by default file system metadata KDDATA 238003 yes, by default kernel dynamic data KSDATA 10563 yes, by default kernel static data SUPERPG 18286 no, unused kernel super pages by default Total pages on system: 1043491 Total pages included in dump: Dump compressed: ON Dump Parallel: ON DEVICE OFFSET(kB) 249710 SIZE (kB) LOGICAL VOL. NAME ------------ ------------ ------------ ------------ ------------------------1:0x000002 57023328 4096000 -----------4096000 Persistent dump device list: /dev/vg00/lvdump3 64:0x000009 /dev/vg00/lvdump3 Crash Dump – Two Dump Unit Example – Part 1 # crashconf -v Crash dump configuration has been changed since boot. CLASS -------- PAGES INCLUDED IN DUMP DESCRIPTION ---------- ---------------- ------------------------------------- UNUSED 4264207 no, by default unused pages USERPG 185052 no, by default user process pages BCACHE 45250 no, by default buffer cache pages KCODE 11859 no, by default kernel code pages USTACK 1271 yes, by default user process stacks FSDATA 16 yes, by default file system metadata KDDATA 581797 yes, by default kernel dynamic data KSDATA 10569 yes, by default kernel static data no, unused kernel super pages SUPERPG 107834 by default Total pages on system: 5207855 Total pages included in dump: Dump compressed: Dump Parallel: DEVICE 593653 OFF ON OFFSET(kB) SIZE (kB) LOGICAL VOL. NAME ------------ ------------ ------------ ------------ ------------------------1:0x000004 2612064 8388608 64:0x000002 /dev/vgroot/lvol2 1:0x000003 2496 1048576 64:0x010001 /dev/vgdump/dump2 1:0x000003 16386496 1048576 64:0x010002 /dev/vgdump/dump3 -----------10485760 Persistent dump device list: /dev/vgroot/lvol2 Crash Dump – Two Dump Unit Example – Part 2 *** A system crash has occurred. (See the above messages for details.) *** The system is now preparing to dump physical memory to disk, for use *** in debugging the crash. *** The dump will be a SELECTIVE dump with compression OFF and concurrency ON: 2320 of 20344 megabytes. Primary Dump Header Location : Device details: Major number: 0x1f Minor number: 0xb0000 Offset: 16386496. *** Dumping: 100% complete (2320 of 2320 MB) time: 84 seconds, Number of Dump units: 2 Crash Dump Without Primary Swap, No Persistent Devices, and No Dump Devices in /etc/fstab Console logs at boot time after a crash: No crash dump devices defined. Persistent dump device list is empty. All subsequent crashes will fail to collect data into dump volumes: Swap device table: (start & size given in 512-byte blocks) entry 0 - auto-configured on root device; ignored - no room WARNING: No swap device configured, so dump cannot be defaulted to primary swap. WARNING: No dump devices are configured. Dump is disabled. Message buffer contents after system crash: These messages are the contents of msgbuf, which should have been saved In the dump. They are output to the console, as the dump was not taken. How to Set the Dump Order for Saving System Crash – Part 1 • The current dump configuration first saves the crash to dump2 , dump1 , then to lvol2: # crashconf Crash dump configuration is changed after boot: CLASS -------UNUSED USERPG BCACHE KCODE USTACK FSDATA KDDATA KSDATA SUPERPG PAGES ---------570458 136677 10426 7764 1172 8 192353 3641 120995 INCLUDED IN DUMP ---------------no, by default no, by default no, by default yes, forced yes, by default yes, by default yes, by default yes, by default no, by default Total pages on system: Total pages included in dump: Dump compressed: DEVICE -----------31:0x021000 31:0x021000 31:0x021000 DESCRIPTION ------------------------------------unused pages user process pages buffer cache pages kernel code pages user process stacks file system metadata kernel dynamic data kernel static data unused kernel super pages 4173976 1253843 ON OFFSET(kB) ---------924532 27843444 27859828 SIZE (kB) ---------4194300 2097150 1048575 ---------7340025 LOGICAL VOL. NAME ------------ ------------------------64:0x000002 /dev/vg00/lvol2 64:0x00000a /dev/vg00/dump1 64:0x00000b /dev/vg00/dump2 How to Set the Dump Order for Saving System Crash – Part 2 SOLUTION: • /etc/fstab does not list vg00/lvol2 , because it is the default dump volume. /dev/vg00/dump1 ... dump defaults 0 0 /dev/vg00/dump2 ... dump defaults 0 0 • Edit /etc/fstab file for the new order of the dump LVs. The order of the dump LVs is opposite of the placement in the file, and vg00/lvol2 needs to be listed last to be used as the first dump lvol. New listing of dump area's in /etc/fstab ------------------------------------------------------------/dev/vg00/dump2 ... dump defaults 0 0 /dev/vg00/dump1 ... dump defaults 0 0 /dev/vg00/lvol2 ... dump defaults 0 0 • Edit /etc/rc.config.d/crashconf : CRASHCONF_ENABLED=1 CRASHCONF_READ_FSTAB=1 CRASHCONF_REPLACE=1 # last dump area used # second dump area used # first dump area used How to Set the Dump Order for Saving System Crash – Part 3 • Put the new dump configuration in place (when a crash is saved, the first dump area is lvol2 followed by dump1 , then by dump2 ): # /sbin/rc1.d/S080crashconf start • Check the new configuration: # crashconf CLASS -------UNUSED USERPG BCACHE KCODE USTACK FSDATA KDDATA KSDATA SUPERPG PAGES ---------169224 500811 10412 7764 1218 20 241200 3641 109204 INCLUDED IN DUMP ---------------no, by default no, by default no, by default yes, forced yes, by default yes, by default yes, by default yes, by default no, by default Total pages on system: Total pages included in dump: Dump compressed: DEVICE -----------31:0x021000 31:0x021000 31:0x021000 DESCRIPTION ------------------------------------unused pages user process pages buffer cache pages kernel code pages user process stacks file system metadata kernel dynamic data kernel static data unused kernel super pages 4173976 1253843 ON OFFSET(kB) ---------27859828 27843444 924532 SIZE (kB) ---------1048575 2097150 4194300 ---------- LOGICAL VOL. -----------64:0x00000b 64:0x00000a 64:0x000002 NAME ------------------------/dev/vg00/dump2 /dev/vg00/dump1 /dev/vg00/lvol2 Example of Distributed Swap Design # /usr/sbin/swapinfo –t TYPE Kb AVAIL Kb USED Kb FREE PCT USED START/ Kb LIMIT RESERVE PRI NAME dev 4194304 0 4194304 0% 0 - 1 /dev/vgroot/lvol2 (4096MB) dev 4194304 0 4194304 0% 0 - 0 /dev/vgswap1/swap1 (4096MB) dev 4194304 0 4194304 0% 0 - 0 /dev/vgswap2/swap2 (4096MB) reserve memory - 13417244 -13417244 25135192 5363876 19771316 21% I am also very passionate about naming volume groups and logical volumes in a meaningful manner. * Paginglist Command # /usr/sam/lbin/paginglist /dev/vg00/lvol2|dev|4194304|4.0 GB|0|0.0 KB|4194304|4.0 GB|0%|0|-|1|no|now| reserve|reserve|0|0.0 KB|2019848|1.9 GB|-2019848|-2019848.`KB||0||0|no|now| total|total|4194304|4.0 GB|2019848|1.9 GB|2174456|2.1 GB|48%|0|0|0|no|now| Patch Servers Regularly Some of HP-UX 11.31 dump patches: PHKL_41977: HANG OTHER crashconf(1M) hangs when trying to configure more than 32 dump devices. This patch fix allows to configure a logical volume as primary swap and, provide support to FCD and FCLP NPIV (N_Port ID Virtualization) enablement. PHKL_41257: HANG During MCA handling, the system hangs in the process of generating a crashdump. PHKL_39740: OTHER System fails to dump memory into dump devices. PHKL_38628: PANIC P HKL_38414: ABORT If in the kernel, base page size is configured greater than 4k, dump may get aborted prematurely and affect debugging of crash. Add Timestamps to RC scripts – Part 1 • If there are RC startup problems, /etc/rc.log is usually the first place we need to check. The output from RC scripts can be found there, but rc.log has no timestamp for each RC script. • In order to let rc.log has timestamp for each RC script, put date command into each RC script, but this is not a good choice because there are so many files to updates. A better option is to set /sbin/rc.utils. The rc.utils script intercepts the output of RC scripts and logs it to /etc/rc.log , we can make it log timestamps as well. • Backup /sbin/rc.utils before you make changes, ensure permissions unchanged: # cp -p /sbin/rc.utils /sbin/rc.utils.bak • Edit /sbin/rc.utils , find the two lines echo >> $LOGFILE , (one is under routine do_screen_mode , the other is under do_line_mode ), insert a new line: date >> $LOGFILE Add Timestamps to RC scripts – Part 2 /etc/rc.log reports: Thu Aug 18 12:22:28 EST 2011 Configure system crash dumps Output from "/sbin/rc1.d/S080crashconf start": ---------------------------EXIT CODE: 0 ... Thu Aug 18 12:22:33 EST 2011 Save system crash dump if needed Output from "/sbin/rc1.d/S440savecrash start": ---------------------------savecrash directory not set; defaulting to: /var/adm/crash *EXIT: parse_args ENTER: open_source ENTER: read_header ENTER: get_hdr_loc *EXIT: get_hdr_loc savecrash: Finished Reading Header From: device : /dev/rdsk/c11t0d0 offset:16386496 Crash Dump Scenarios – Part 1 • If there are no crash dump devices on HP-UX, by design, server will default to primary swap for saving crash dumps! • Persistent crash dump devices must be in root volume group in LVM, but can be in any data group in VxVM (Symantec documentation confirms it too). • If the crash dump devices are not persistent, and they are not listed in /etc/fstab, and swap is not in root volume group, HP-UX will happily use non-persistent dump devices from other volume groups AS LONG as they are defined in the currently running kernel configuration (check with crashconf(1M) command). Crash Dump Scenarios – Part 2 • To make non-persistent dump devices enabled permanently, they need to be added into /etc/fstab and “switched on” via crashconf(1M) command and/or /etc/rc.config.d/crashconf BEFORE crash happens. Otherwise, if there is no fall-back to primary swap, crash dump will FAIL. • A dump can be saved to both non-persistent and persistent dump devices*. • If there are persistent crash dump devices (they must be in root volume group in LVM, but can be in any data group in VxVM), they will be used for saving crash dumps even if they are not listed in /etc/fstab. Thank You! Dusan Baljevic Sydney, Australia Aug 2011