Determining the Correct Usage of Swap in Linux* 2.6 Kernels

www.novell.com
Technical White Paper
LINUX OPERATING SYSTEMS
Determining the Correct Usage
of Swap in Linux 2.6 Kernels
*
Determining the Correct Usage of Swap in Linux 2.6 Kernels
Table of Contents:
2 . . . . . Proper Sizing of Virtual Memory
3 . . . . . Choosing the Right Swap Size
to Use on Linux 2.6 Kernels
5 . . . . . Appendix
p.
1
Proper Sizing of Virtual Memory
In modern computing, memory potential
ranges widely. There are cache chips with
rapid processing capabilities and small
amounts of available memory, random
access memory (RAM) chips with moderate
processing power but larger memory capacities, and hard disks that run slowly but use
extremely large memory stores. Data stored
in RAM, as the name implies, is accessed at
random. It is not dependent on its physical
location or spatial relation to other data,
which generally means shorter and more
consistent data-retrieval times.
On a Linux-based system there are two
different kinds of memory:
Physical memory: installed on the system
as memory bars or chips
2. Swap memory: either a separate partition
or a separate file on the hard drive
and increases available memory space.
Specifically, swap frees physical memory
by copying temporarily unused memory
pages over to swap memory.
A high demand for memory, often caused
by memory-consuming applications, file or
network input/output (I/O), can also trigger
memory pages to move to swap. This action
is called swapout. Because applications
only see the amount of memory available,
but not which part of memory they are using,
the operating system uses swap to page
out pages of idle processes such as inactive
applications, unused pages or shared libraries.
You should note, however, that the kernel
code and its data are never moved to swap.
1.
What is Swap?
Swap memory, or virtual memory, is a
block of memory—or a space—on the hard
drive. To calculate the total amount of virtual
memory available on a system, you must
add the swap space size to the physical
RAM size.
Swap memory was once needed to extend
physical memory: it was used in addition to
RAM to create more available memory. Now,
with the growth of the physical size of RAM,
using swap memory to extend RAM isn’t the
necessity it once was. However, because of
the innovations in RAM technology, some
people assume that if there is enough
physical memory on the machine, swap
isn’t needed at all. This assumption is
incorrect; swap optimizes memory usage
p.
2
To influence how the kernel balances between
paging out runtime memory, swapping applications and dropping pages from the system
page cache, you can use the parameter
swappiness. Swappiness can have a value
between 0 and 100, while the default value
is 60. With a value set to 100, the kernel
will prefer to swap out inactive pages.
Swappiness can also be set as a sysctl
parameter (sysctl is a utility to configure
or modify kernel parameters) either at
runtime, /proc/sys/vm/swappiness, or in
/etc/sysctl.conf. (See also below: VM tuning
sysctl parameters). Whether or not a swapout
occurs also depends on how much application memory is in use and how well the
cache is finding and releasing inactive items.
Swapped-out data is not automatically
swapped in until it is needed. And when
the data stored in swap needs to be used,
it is swapped back in. Some applications
and databases, however, lock their memory
and never allow memory swapping.
www.novell.com
Determining the Correct Usage of Swap in Linux 2.6 Kernels
Linux* tries to keep as many cached pages
in memory as possible. It does this to ensure
you can access essential data from faster
accessible memory. You will see the most free
memory right after boot up or shortly after a
large application is closed. If a process that
occupies large amounts of physical memory
terminates and frees memory all at once,
the physical memory display might show
more memory remaining free as well.
You can allocate swap during or after installation. If you choose to allocate swap during
installation and in partitions, you will first
need to ensure that you have an adequate
amount of disk space. You can also add
and allocate swap flexibly after installation.
For example, when you need more swap
after installation, you can easily set it up
with an extra swap file. This helps you avoid
keeping the entire swap allocated all the time.
You can also opt to use several swap partitions and files at the same time. This allows
for more flexibility, and you don’t need to have
a fixed-partition layout from the beginning.
For example, if more swap is suddenly needed
after installation, it can be allocated later on
a per-usage basis.
Accessing swap and disk I/O is much slower
than accessing physical memory. To enhance
performance, consider putting your swap
partition or files on the fastest available
disk. Several swap partitions or files should
be distributed equally and on drives with
similar speed.
You can assign different priorities to swap
spaces. Swap pages are allocated from
areas in priority order, with the highest priority
swapped first. For areas with different priorities, a higher-priority area is exhausted before
a lower-priority area is used. If two or more
areas have the same priority, and both are
the highest priority available, pages are allocated on a round-robin basis between the
two priorities. In an ideal situation, several
swap spaces would be given the same
priority, so they would be used equally;
otherwise, the swap spaces will be filled
one by one in order of priority.
For more detailed information about how to
specify swap usages and priorities, please
see the swapon man page.
If a swap is allocated with the command
mkswap you can set up the swap area
either on a device or a file.
Unless a swap partition or file is put into
the configuration file /etc/fstab and activated
automatically during boot, it must be activated
manually. After creating the swap area, you
need to use the swapon/swapoff command
to start using it. You can activate or deactivate additional swap space by using these
two commands:
Swapon is used to specify devices on which
paging and swapping are to take place.
Swapoff disables swapping on the specified
devices and files. Swap can be turned on or
off while the system is being used.
Choosing the Right Swap Size
to Use on Linux 2.6 Kernels
In earlier kernel versions (such as 2.4),
the required swap size was twice as large
as the RAM. However, this formula has been
obsolete since kernel version 2.4.10. Today,
the swap size depends entirely on the tradeoff you want to achieve. You can choose to
allow memory allocations to fail or choose to
run with the memory bandwidth of the hard
disk, instead of the bandwidth of the RAM.
The hard disk is significantly slower than RAM,
and worst of all, the hard disk handles seeks
inefficiently. (The order in which the data is
swapped out is not necessarily the order in
which the data is swapped in later, so the
initial swapout bandwidth can be statistically
higher than the later swapin bandwidth.)
p.
3
The total virtual memory that can be allocated
by applications on Linux can be considered as
FREEABLE_RAM+FREE_SWAP. For example, if you have 1 TB of RAM, you might not
need to set up any swap at all, unless you
really need to allocate more than 1 TB of
virtual memory with the malloc or tmpfs
functions (or similar anonymous/shared
memory allocations).
To understand why setting the swap too large
is an issue, consider the following situation:
you are setting your swap to 1 TB, and you
have 1 TB of RAM. By doing so, you are
telling the kernel that you are satisfied to
have 50 percent of your malloced memory
(actually 1 TB of virtual memory) backed by
your hard disk instead of RAM. With this
type of setup, a malicious program could
render many system applications unusable by
forcing swap-outs of hundreds of gigabytes
of virtual memory on useful applications.
This could render the system many hundreds
or thousands of times slower than it would
run without a swap. Even worse: a simple
malloc infinite loop would need to write 1 TB
to the hard disk before being killed by the
out-of-memory (OOM) killer (a function or
mechanism that has to terminate one or more
processes in order to free up memory for the
system). This would take hours—even on a
fast hard disk and even with the rest of the
system idle and not triggering swap-ins.
Additionally, the larger the swap space,
the harder it is for the virtual memory to
detect that the system is out of memory.
The OOM detection will probably improve in
future versions, but currently, the “trashing”
type of denial-of-service (DoS) attack (which
attempts to make a computer resource
unavailable to its intended users), and the
excessive time before malicious or buggy
applications are killed cannot be entirely
fixed. The system cannot declare OOM
itself until all swaps have been allocated—
not even when excessive numbers of swaps
p.
4
are being allocated. This process is a system
feature, not a bug.
At best, you could partition the system to give
swap access only to certain applications.
If all applications can access the whole swap
space when you are allocating excessive
numbers of swaps, you run the risk of making
a system slow for extended periods of time—
even if just a single application malfunctions.
Setting ulimits (a command to show and
set various restrictions on resource usage)
to restrict problematic applications could
also help.
The old formula SWAP=2*RAM, and an
even more reasonable SWAP=RAM (that is
generally well-suited for small systems with
4 GB of RAM or less) is probably not good
for most large systems with 16 GB of RAM
or more. For large systems, there is no easy
formula you can use to auto-size the swap.
Sizing the swap is as hard as sizing the RAM
for large systems. What is clear, however,
is that the swap size should not only be a
function of the RAM size.
In recent 2.6 Linux kernels, it is very safe—
and almost equally efficient—to add swap
using files in the file system instead of using
block devices. This makes changing the swap
size as needed a legitimate recommendation.
(Using a separate hard disk, if available,
for the swap space remains a good idea;
it helps you avoid seeks between data I/O
and swapin/swapout.) On larger servers that
must always function at peak performance,
we recommend starting with a small swap.
If a swap becomes full or nearly full, that
should be an indicator to increase the RAM
in the system or to increase the swap if an
even worse performance penalty is acceptable. However, if swap is being used lightly,
that is a good sign. It means the kernel
selected portions of RAM that were unused,
and then swap moved them to the hard disk
to make space for more useful data in RAM.
www.novell.com
Determining the Correct Usage of Swap in Linux 2.6 Kernels
There are cases where 16 GB of swap or
more makes sense. For example, in situations where certain servers handle many
services with multiple clients connected to
each service—and each client has its own
shared memory cache and private anonymous
memory areas—having the extra swap could
make the system more efficient. If the common pattern is that only a few clients are
active simultaneously, using swap instead
of RAM for the inactive clients can be very
cost effective. But that still leaves the system
more vulnerable to malicious virtual-memorytrashing programs, and it will take longer for
the system to recover if a real OOM condition
is triggered. Those kinds of memory-intensive
applications, however, tend to run on single-
user systems where virtual memory utilization
can be controlled by the application.
The opposite types of workloads are systems
that will run out of memory frequently because
of unpredictable simulations that evolve freely
at runtime. These workloads also benefit from
the overcommit default Linux virtual memory
(VM) behavior. Alternately, you could use the
overcommit = 2 VM feature to notify you of
-ENOMEM (output for “not enough space”)
failures and to prevent the OOM killer from
being invoked. In those scenarios, fairly low
swap usage is recommended to allow for
quick OOM killer detection and to avoid
waiting a long time before filling the entire
swap (regardless of the size of the RAM).
Appendix
The following sections explain different utilities and methods for gathering system information
about memory and swap.
System Memory Output Details
Below we show the output of two different systems.
System 1 has been in use for a several days with uninterrupted uptime:
System 2 has not been in use and just booted:
The following partitions are in use as swap partitions:
Note: swapon -s will use /proc/swaps and show the same information.
p.
5
Memory Usage
The utility free examines RAM usage and also shows the use of physical and swap memory.
Details of both free and used memory and swap areas are shown:
Explanation of output:
total: total amount of available physical
memory in KBs. The number is lower than
the actual physical memory because the
kernel itself uses some memory.
used: amount of memory in use.
free: memory not used and available.
Shared/buffers/cached: detailed information on how memory is used. The shared
column shows the amount of memory
shared by several processes; the buffers
column shows the current size of the disk
buffer cache.
-/+ buffers/cache: memory used to cache
data for applications; parts of it can be
freed when memory is needed for other
purposes.
swap: shows information about utilization
of swap memory. Amount of total—used
and free—memory.
p.
6
When the amount of free RAM as reported
by free becomes low, this is not necessarily
a problem. However, when the “buffers” and
“cached” values go down and the “swap
used” value goes up, the system may be
running out of RAM. At this point, you might
want to consider adding more RAM.
Note: The options -b,-k,-m and -g show output in bytes, KB, MB or GB, respectively. The
parameter -s delay ensures that the display
is refreshed every specified number of delay
seconds. For example, free -s 2 produces an
update every two seconds.
The same information is shown when using
the ‘meminfo’ information. The command cat
/proc/meminfo shows information about
physical and swap memory usage.
www.novell.com
Determining the Correct Usage of Swap in Linux 2.6 Kernels
p.
7
The command vmstat reports virtual memory statistics and activity. It reports information and
statistics about processes, memory, paging, block I/O, traps and CPU activity. vmstat can
be used to monitor and evaluate system activity and will help to discover physical memory
shortage. vmstat and monitoring si (swapin) and so (swapout) shows the rate at which data
is swapped to and from swap. vmstat uses /proc/meminfo to get some of the information.
Explanation of output:
Memory:
swpd: the amount of virtual memory used
free: the amount of idle memory
buff: the amount of memory used as buffers
cache: the amount of memory used as cache
inact/active: the amount of inactive/active memory (-a option)
Note: If you want to display the amount of active and inactive memory, use the -a option with
vmstat. The -a switch displays active/inactive memory when used in a 2.5.41 kernel or better.
For example:
Swap:
si: Amount of memory swapped in from
disk (/s): data transferred from swap
space to physical memory
so: Amount of memory swapped to disk
(/s): data transferred from physical memory to swap space
The top command can be used to find
programs that use a lot of memory. When
you type ‘F’ ‘n’ ‘enter’ in the top screen,
p.
8
the applications that consume the most
memory will appear at the top of the list.
The sar command can generate extensive
reports on almost all important system activities. It can log and report over a period of
time, and it reports on CPU usage, memory,
IRQ usage, I/O or networking. For example,
on a machine that is starting to swap, the
percentage of time that the CPU waits for
I/O will increase.
www.novell.com
Determining the Correct Usage of Swap in Linux 2.6 Kernels
Virtual Memory Configuration Options
The behavior of the virtual memory subsystem
of the Linux kernel (which is responsible for
swapping) can be tuned through the /proc
filesystem (see /proc/sys/vm/). See the list
below for the most important virtual memory
tuning sysctl parameters. This list, however,
may not be complete and may not cover all
parameters to tune the virtual memory.
hugepages=N: Number of hugepages (2 MB
for i386) to reserve for the hugetlbfs file system
at boot time. Later allocation is possible by
writing the value to /proc/sys/vm/nr_hugepages,
but likely to fail if the system has been in use
for some time due to memory fragmentation.
/proc/sys/vm/swappiness: The default value
60 causes idle applications and services to
be removed from the memory, allowing the
memory to be used for I/O caching instead.
If you do not want applications to be removed
from the memory in favor of I/O caching
and avoid swapping, set swappiness to
a lower value.
Note: Some databases block the memory they
use. In this case, the memory is not swapped,
no matter the value of this parameter.
/proc/sys/vm/lower_zone_protection: This
command ensures that the kernel will avoid
using lower memory areas if address space is
available in the high memory area. The default
value is 0. It should be set to a high value
(1024) on systems with memory of more than
8 GB. With this setting, there will be sufficient
memory (under 4 GB) to ensure that the
kernel continues to be available in critical
situations on i386 systems.
If the I/O hardware does not support 64-bit
addressing, the possibility of performing
zerocopy I/O will be less likely, resulting in a
higher likelihood of bounce buffering (copying
the buffers to and fro), which in turn reduces
the I/O performance.
/proc/sys/vm/disable_cap_mlock: To avoid DoS
attacks by normal users, memory can usually
only be locked by root (CAP_IPC_LOCK).
Some databases that do not run with root
permissions for any other reason might need
this setting. This parameter bypasses this
restriction, enabling every user to add address
space to the memory.
/proc/sys/kernel/shm_use_hugepages: SysV
IPC (interprocess communication) Shared
memory can be prompted to use hugetlb
pages instead of normal pages without having
to use SHM_HUGETLB. More information
about this subject is available in /usr/src/linux/
Documentation/vm/hugetlbpage.txt.
overcommit_memory: This value contains a
flag that enables memory overcommitment.
When this flag is 0, the kernel attempts to
estimate the amount of free memory left when
userspace requests more memory. When this
flag is 1, the kernel pretends there is always
enough memory until it actually runs out.
When this flag is 2, the kernel uses a “strict
overcommit” policy that attempts to prevent
any overcommitment of memory. This feature
can be very useful because there are many
programs that allocate huge amounts of
memory in advance with malloc(), but do not
really use much of this allocated memory. To
prevent the system from allocating too much
memory, you can use the overcommit_memory
option. The default value is 0.
overcommit_ratio: When overcommit_memory
is set to 2, the committed address space
is not permitted to exceed swap plus this
percentage of physical RAM. See above.
bdflush params: /proc/sys/vm/XXX
dirty_expire_centisecs: Time after which
dirty pages are scheduled for write-out
(writing the data to the hard disk). The
default of 30 s is good for DB loads
under which rewrites within 30 s are
not uncommon. This can be reduced
to trigger earlier write-out.
p.
9
www.novell.com
dirty_ratio: Write-out begins when this
percentage of memory is dirty.
dirty_background_ratio: Percentage of
dirty pages under which background
write-out stops. The number of pages at
which the pdflush background writeback
daemon will start writing out dirty data.
This value should be a bit under the
value for dirty_ratio.
dirty_writeout_centisecs: The frequency
with which the background write-out of
dirty pages takes place, if the percentage
of dirty pages is higher than specified in
dirty_ratio. This value can be reduced to
trigger the write-out earlier, especially if
the storage medium is fast.
The ratio is often set to a high value,
which is good for databases. In databases,
pages often change and then change again
after some time. If no data is written in the
meantime, a write is saved. The ratio can
be reduced or narrowed for other loads.
In general, write-outs can be triggered
more often to keep the memory clean
and the disks busy.
block_dump: Debugging information on
virtual memory
laptop_mode: Correlates VM flushes with
other physical I/O
lowmem_reserve_ratio: How much low
memory should be reserved
page-cluster: The Linux VM subsystem
avoids excessive disk seeks by reading
multiple pages on a page fault. The number
of pages it reads is dependent on the amount
of memory in your machine. The number of
pages the kernel reads in at once is equal to
2 ^ page-cluster. Values above 2 ^ 5 don’t
make much sense for swap because we
only cluster swap data in 32-page groups
max_map_count: This file contains the
maximum number of memory map areas
a process may have. Memory map areas
are used as a side effect of calling malloc,
directly by mmap and mprotect, and also
when loading shared libraries. While most
applications need less than a thousand
maps, certain programs, particularly malloc
debuggers, may consume many of them,
(e.g., up to one or two maps per allocation).
The default value is 65536.
min_free_kbytes: This is used to force the
Linux VM to keep a minimum number of
kilobytes free. The VM uses this number
to compute a pages_min value for each
lowmem zone in the system. Each lowmem
zone gets a number of reserved free pages
based proportionally on its size.
462-002076-001 | 10/07 | © 2007 Novell, Inc. All rights reserved. Novell, the Novell logo and the N logo are registered trademarks of
Novell, Inc. in the United States and other countries.
*Linux is a registered trademark of Linus Torvalds. All other third-party trademarks are the property of their respective owners.
Contact your local Novell®
Solutions Provider, or call
Novell at:
1 800 714 3400 U.S./Canada
1 801 861 1349 Worldwide
1 801 861 8473 Facsimile
Novell, Inc.
404 Wyman Street
Waltham, MA 02451 USA