www.novell.com Technical White Paper LINUX OPERATING SYSTEMS Determining the Correct Usage of Swap in Linux 2.6 Kernels * Determining the Correct Usage of Swap in Linux 2.6 Kernels Table of Contents: 2 . . . . . Proper Sizing of Virtual Memory 3 . . . . . Choosing the Right Swap Size to Use on Linux 2.6 Kernels 5 . . . . . Appendix p. 1 Proper Sizing of Virtual Memory In modern computing, memory potential ranges widely. There are cache chips with rapid processing capabilities and small amounts of available memory, random access memory (RAM) chips with moderate processing power but larger memory capacities, and hard disks that run slowly but use extremely large memory stores. Data stored in RAM, as the name implies, is accessed at random. It is not dependent on its physical location or spatial relation to other data, which generally means shorter and more consistent data-retrieval times. On a Linux-based system there are two different kinds of memory: Physical memory: installed on the system as memory bars or chips 2. Swap memory: either a separate partition or a separate file on the hard drive and increases available memory space. Specifically, swap frees physical memory by copying temporarily unused memory pages over to swap memory. A high demand for memory, often caused by memory-consuming applications, file or network input/output (I/O), can also trigger memory pages to move to swap. This action is called swapout. Because applications only see the amount of memory available, but not which part of memory they are using, the operating system uses swap to page out pages of idle processes such as inactive applications, unused pages or shared libraries. You should note, however, that the kernel code and its data are never moved to swap. 1. What is Swap? Swap memory, or virtual memory, is a block of memory—or a space—on the hard drive. To calculate the total amount of virtual memory available on a system, you must add the swap space size to the physical RAM size. Swap memory was once needed to extend physical memory: it was used in addition to RAM to create more available memory. Now, with the growth of the physical size of RAM, using swap memory to extend RAM isn’t the necessity it once was. However, because of the innovations in RAM technology, some people assume that if there is enough physical memory on the machine, swap isn’t needed at all. This assumption is incorrect; swap optimizes memory usage p. 2 To influence how the kernel balances between paging out runtime memory, swapping applications and dropping pages from the system page cache, you can use the parameter swappiness. Swappiness can have a value between 0 and 100, while the default value is 60. With a value set to 100, the kernel will prefer to swap out inactive pages. Swappiness can also be set as a sysctl parameter (sysctl is a utility to configure or modify kernel parameters) either at runtime, /proc/sys/vm/swappiness, or in /etc/sysctl.conf. (See also below: VM tuning sysctl parameters). Whether or not a swapout occurs also depends on how much application memory is in use and how well the cache is finding and releasing inactive items. Swapped-out data is not automatically swapped in until it is needed. And when the data stored in swap needs to be used, it is swapped back in. Some applications and databases, however, lock their memory and never allow memory swapping. www.novell.com Determining the Correct Usage of Swap in Linux 2.6 Kernels Linux* tries to keep as many cached pages in memory as possible. It does this to ensure you can access essential data from faster accessible memory. You will see the most free memory right after boot up or shortly after a large application is closed. If a process that occupies large amounts of physical memory terminates and frees memory all at once, the physical memory display might show more memory remaining free as well. You can allocate swap during or after installation. If you choose to allocate swap during installation and in partitions, you will first need to ensure that you have an adequate amount of disk space. You can also add and allocate swap flexibly after installation. For example, when you need more swap after installation, you can easily set it up with an extra swap file. This helps you avoid keeping the entire swap allocated all the time. You can also opt to use several swap partitions and files at the same time. This allows for more flexibility, and you don’t need to have a fixed-partition layout from the beginning. For example, if more swap is suddenly needed after installation, it can be allocated later on a per-usage basis. Accessing swap and disk I/O is much slower than accessing physical memory. To enhance performance, consider putting your swap partition or files on the fastest available disk. Several swap partitions or files should be distributed equally and on drives with similar speed. You can assign different priorities to swap spaces. Swap pages are allocated from areas in priority order, with the highest priority swapped first. For areas with different priorities, a higher-priority area is exhausted before a lower-priority area is used. If two or more areas have the same priority, and both are the highest priority available, pages are allocated on a round-robin basis between the two priorities. In an ideal situation, several swap spaces would be given the same priority, so they would be used equally; otherwise, the swap spaces will be filled one by one in order of priority. For more detailed information about how to specify swap usages and priorities, please see the swapon man page. If a swap is allocated with the command mkswap you can set up the swap area either on a device or a file. Unless a swap partition or file is put into the configuration file /etc/fstab and activated automatically during boot, it must be activated manually. After creating the swap area, you need to use the swapon/swapoff command to start using it. You can activate or deactivate additional swap space by using these two commands: Swapon is used to specify devices on which paging and swapping are to take place. Swapoff disables swapping on the specified devices and files. Swap can be turned on or off while the system is being used. Choosing the Right Swap Size to Use on Linux 2.6 Kernels In earlier kernel versions (such as 2.4), the required swap size was twice as large as the RAM. However, this formula has been obsolete since kernel version 2.4.10. Today, the swap size depends entirely on the tradeoff you want to achieve. You can choose to allow memory allocations to fail or choose to run with the memory bandwidth of the hard disk, instead of the bandwidth of the RAM. The hard disk is significantly slower than RAM, and worst of all, the hard disk handles seeks inefficiently. (The order in which the data is swapped out is not necessarily the order in which the data is swapped in later, so the initial swapout bandwidth can be statistically higher than the later swapin bandwidth.) p. 3 The total virtual memory that can be allocated by applications on Linux can be considered as FREEABLE_RAM+FREE_SWAP. For example, if you have 1 TB of RAM, you might not need to set up any swap at all, unless you really need to allocate more than 1 TB of virtual memory with the malloc or tmpfs functions (or similar anonymous/shared memory allocations). To understand why setting the swap too large is an issue, consider the following situation: you are setting your swap to 1 TB, and you have 1 TB of RAM. By doing so, you are telling the kernel that you are satisfied to have 50 percent of your malloced memory (actually 1 TB of virtual memory) backed by your hard disk instead of RAM. With this type of setup, a malicious program could render many system applications unusable by forcing swap-outs of hundreds of gigabytes of virtual memory on useful applications. This could render the system many hundreds or thousands of times slower than it would run without a swap. Even worse: a simple malloc infinite loop would need to write 1 TB to the hard disk before being killed by the out-of-memory (OOM) killer (a function or mechanism that has to terminate one or more processes in order to free up memory for the system). This would take hours—even on a fast hard disk and even with the rest of the system idle and not triggering swap-ins. Additionally, the larger the swap space, the harder it is for the virtual memory to detect that the system is out of memory. The OOM detection will probably improve in future versions, but currently, the “trashing” type of denial-of-service (DoS) attack (which attempts to make a computer resource unavailable to its intended users), and the excessive time before malicious or buggy applications are killed cannot be entirely fixed. The system cannot declare OOM itself until all swaps have been allocated— not even when excessive numbers of swaps p. 4 are being allocated. This process is a system feature, not a bug. At best, you could partition the system to give swap access only to certain applications. If all applications can access the whole swap space when you are allocating excessive numbers of swaps, you run the risk of making a system slow for extended periods of time— even if just a single application malfunctions. Setting ulimits (a command to show and set various restrictions on resource usage) to restrict problematic applications could also help. The old formula SWAP=2*RAM, and an even more reasonable SWAP=RAM (that is generally well-suited for small systems with 4 GB of RAM or less) is probably not good for most large systems with 16 GB of RAM or more. For large systems, there is no easy formula you can use to auto-size the swap. Sizing the swap is as hard as sizing the RAM for large systems. What is clear, however, is that the swap size should not only be a function of the RAM size. In recent 2.6 Linux kernels, it is very safe— and almost equally efficient—to add swap using files in the file system instead of using block devices. This makes changing the swap size as needed a legitimate recommendation. (Using a separate hard disk, if available, for the swap space remains a good idea; it helps you avoid seeks between data I/O and swapin/swapout.) On larger servers that must always function at peak performance, we recommend starting with a small swap. If a swap becomes full or nearly full, that should be an indicator to increase the RAM in the system or to increase the swap if an even worse performance penalty is acceptable. However, if swap is being used lightly, that is a good sign. It means the kernel selected portions of RAM that were unused, and then swap moved them to the hard disk to make space for more useful data in RAM. www.novell.com Determining the Correct Usage of Swap in Linux 2.6 Kernels There are cases where 16 GB of swap or more makes sense. For example, in situations where certain servers handle many services with multiple clients connected to each service—and each client has its own shared memory cache and private anonymous memory areas—having the extra swap could make the system more efficient. If the common pattern is that only a few clients are active simultaneously, using swap instead of RAM for the inactive clients can be very cost effective. But that still leaves the system more vulnerable to malicious virtual-memorytrashing programs, and it will take longer for the system to recover if a real OOM condition is triggered. Those kinds of memory-intensive applications, however, tend to run on single- user systems where virtual memory utilization can be controlled by the application. The opposite types of workloads are systems that will run out of memory frequently because of unpredictable simulations that evolve freely at runtime. These workloads also benefit from the overcommit default Linux virtual memory (VM) behavior. Alternately, you could use the overcommit = 2 VM feature to notify you of -ENOMEM (output for “not enough space”) failures and to prevent the OOM killer from being invoked. In those scenarios, fairly low swap usage is recommended to allow for quick OOM killer detection and to avoid waiting a long time before filling the entire swap (regardless of the size of the RAM). Appendix The following sections explain different utilities and methods for gathering system information about memory and swap. System Memory Output Details Below we show the output of two different systems. System 1 has been in use for a several days with uninterrupted uptime: System 2 has not been in use and just booted: The following partitions are in use as swap partitions: Note: swapon -s will use /proc/swaps and show the same information. p. 5 Memory Usage The utility free examines RAM usage and also shows the use of physical and swap memory. Details of both free and used memory and swap areas are shown: Explanation of output: total: total amount of available physical memory in KBs. The number is lower than the actual physical memory because the kernel itself uses some memory. used: amount of memory in use. free: memory not used and available. Shared/buffers/cached: detailed information on how memory is used. The shared column shows the amount of memory shared by several processes; the buffers column shows the current size of the disk buffer cache. -/+ buffers/cache: memory used to cache data for applications; parts of it can be freed when memory is needed for other purposes. swap: shows information about utilization of swap memory. Amount of total—used and free—memory. p. 6 When the amount of free RAM as reported by free becomes low, this is not necessarily a problem. However, when the “buffers” and “cached” values go down and the “swap used” value goes up, the system may be running out of RAM. At this point, you might want to consider adding more RAM. Note: The options -b,-k,-m and -g show output in bytes, KB, MB or GB, respectively. The parameter -s delay ensures that the display is refreshed every specified number of delay seconds. For example, free -s 2 produces an update every two seconds. The same information is shown when using the ‘meminfo’ information. The command cat /proc/meminfo shows information about physical and swap memory usage. www.novell.com Determining the Correct Usage of Swap in Linux 2.6 Kernels p. 7 The command vmstat reports virtual memory statistics and activity. It reports information and statistics about processes, memory, paging, block I/O, traps and CPU activity. vmstat can be used to monitor and evaluate system activity and will help to discover physical memory shortage. vmstat and monitoring si (swapin) and so (swapout) shows the rate at which data is swapped to and from swap. vmstat uses /proc/meminfo to get some of the information. Explanation of output: Memory: swpd: the amount of virtual memory used free: the amount of idle memory buff: the amount of memory used as buffers cache: the amount of memory used as cache inact/active: the amount of inactive/active memory (-a option) Note: If you want to display the amount of active and inactive memory, use the -a option with vmstat. The -a switch displays active/inactive memory when used in a 2.5.41 kernel or better. For example: Swap: si: Amount of memory swapped in from disk (/s): data transferred from swap space to physical memory so: Amount of memory swapped to disk (/s): data transferred from physical memory to swap space The top command can be used to find programs that use a lot of memory. When you type ‘F’ ‘n’ ‘enter’ in the top screen, p. 8 the applications that consume the most memory will appear at the top of the list. The sar command can generate extensive reports on almost all important system activities. It can log and report over a period of time, and it reports on CPU usage, memory, IRQ usage, I/O or networking. For example, on a machine that is starting to swap, the percentage of time that the CPU waits for I/O will increase. www.novell.com Determining the Correct Usage of Swap in Linux 2.6 Kernels Virtual Memory Configuration Options The behavior of the virtual memory subsystem of the Linux kernel (which is responsible for swapping) can be tuned through the /proc filesystem (see /proc/sys/vm/). See the list below for the most important virtual memory tuning sysctl parameters. This list, however, may not be complete and may not cover all parameters to tune the virtual memory. hugepages=N: Number of hugepages (2 MB for i386) to reserve for the hugetlbfs file system at boot time. Later allocation is possible by writing the value to /proc/sys/vm/nr_hugepages, but likely to fail if the system has been in use for some time due to memory fragmentation. /proc/sys/vm/swappiness: The default value 60 causes idle applications and services to be removed from the memory, allowing the memory to be used for I/O caching instead. If you do not want applications to be removed from the memory in favor of I/O caching and avoid swapping, set swappiness to a lower value. Note: Some databases block the memory they use. In this case, the memory is not swapped, no matter the value of this parameter. /proc/sys/vm/lower_zone_protection: This command ensures that the kernel will avoid using lower memory areas if address space is available in the high memory area. The default value is 0. It should be set to a high value (1024) on systems with memory of more than 8 GB. With this setting, there will be sufficient memory (under 4 GB) to ensure that the kernel continues to be available in critical situations on i386 systems. If the I/O hardware does not support 64-bit addressing, the possibility of performing zerocopy I/O will be less likely, resulting in a higher likelihood of bounce buffering (copying the buffers to and fro), which in turn reduces the I/O performance. /proc/sys/vm/disable_cap_mlock: To avoid DoS attacks by normal users, memory can usually only be locked by root (CAP_IPC_LOCK). Some databases that do not run with root permissions for any other reason might need this setting. This parameter bypasses this restriction, enabling every user to add address space to the memory. /proc/sys/kernel/shm_use_hugepages: SysV IPC (interprocess communication) Shared memory can be prompted to use hugetlb pages instead of normal pages without having to use SHM_HUGETLB. More information about this subject is available in /usr/src/linux/ Documentation/vm/hugetlbpage.txt. overcommit_memory: This value contains a flag that enables memory overcommitment. When this flag is 0, the kernel attempts to estimate the amount of free memory left when userspace requests more memory. When this flag is 1, the kernel pretends there is always enough memory until it actually runs out. When this flag is 2, the kernel uses a “strict overcommit” policy that attempts to prevent any overcommitment of memory. This feature can be very useful because there are many programs that allocate huge amounts of memory in advance with malloc(), but do not really use much of this allocated memory. To prevent the system from allocating too much memory, you can use the overcommit_memory option. The default value is 0. overcommit_ratio: When overcommit_memory is set to 2, the committed address space is not permitted to exceed swap plus this percentage of physical RAM. See above. bdflush params: /proc/sys/vm/XXX dirty_expire_centisecs: Time after which dirty pages are scheduled for write-out (writing the data to the hard disk). The default of 30 s is good for DB loads under which rewrites within 30 s are not uncommon. This can be reduced to trigger earlier write-out. p. 9 www.novell.com dirty_ratio: Write-out begins when this percentage of memory is dirty. dirty_background_ratio: Percentage of dirty pages under which background write-out stops. The number of pages at which the pdflush background writeback daemon will start writing out dirty data. This value should be a bit under the value for dirty_ratio. dirty_writeout_centisecs: The frequency with which the background write-out of dirty pages takes place, if the percentage of dirty pages is higher than specified in dirty_ratio. This value can be reduced to trigger the write-out earlier, especially if the storage medium is fast. The ratio is often set to a high value, which is good for databases. In databases, pages often change and then change again after some time. If no data is written in the meantime, a write is saved. The ratio can be reduced or narrowed for other loads. In general, write-outs can be triggered more often to keep the memory clean and the disks busy. block_dump: Debugging information on virtual memory laptop_mode: Correlates VM flushes with other physical I/O lowmem_reserve_ratio: How much low memory should be reserved page-cluster: The Linux VM subsystem avoids excessive disk seeks by reading multiple pages on a page fault. The number of pages it reads is dependent on the amount of memory in your machine. The number of pages the kernel reads in at once is equal to 2 ^ page-cluster. Values above 2 ^ 5 don’t make much sense for swap because we only cluster swap data in 32-page groups max_map_count: This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries. While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume many of them, (e.g., up to one or two maps per allocation). The default value is 65536. min_free_kbytes: This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a pages_min value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size. 462-002076-001 | 10/07 | © 2007 Novell, Inc. All rights reserved. Novell, the Novell logo and the N logo are registered trademarks of Novell, Inc. in the United States and other countries. *Linux is a registered trademark of Linus Torvalds. All other third-party trademarks are the property of their respective owners. Contact your local Novell® Solutions Provider, or call Novell at: 1 800 714 3400 U.S./Canada 1 801 861 1349 Worldwide 1 801 861 8473 Facsimile Novell, Inc. 404 Wyman Street Waltham, MA 02451 USA