MALLOCFAIL Errors and General Memory Problems Troubleshoot Contents

advertisement
MALLOCFAIL Errors and General Memory
Problems Troubleshoot
Document ID: 116467
Contributed by Brandon Lynch, Cisco TAC Engineer.
Sep 30, 2013
Contents
Introduction
MALLOCFAIL Errors
Processor Pool
Causes and What to Collect
I/O Pool
Causes and What to Collect
Items to Investigate
Summary
Related Information
Introduction
This document discusses MALLOCFAIL errors on native Cisco IOS®, as well as steps to take and
information to gather before you open a Cisco Technical Assistance Center (TAC) case or reload the device in
order to expedite problem resolution. This document is not exhaustive, but provides a general guideline used
in order to troubleshoot memory issues with many routers and switches.
MALLOCFAIL Errors
Memory problems manifest themselves in several ways on switches and routers. In many instances, a device
that experiences memory errors is reloaded before the appropriate data is gathered.
Memory issues generally appear in the form of MALLOCFAIL errors in the logs of your router or switch.
These errors are important because they provide "road signs" to direct the investigation. Here is a sample
MALLOCFAIL error:
%SYS−2−MALLOCFAIL: Memory allocation of 65536 bytes failed
from 0x60103098,
alignment 0
Pool: Processor Free: 5453728 Cause: Memory fragmentation
Alternate Pool: None Free: 0 Cause: No Alternate pool
The first thing to notice is how much memory you need to allocate and how much free memory you have.
This example shows a scenario where you must allocate 65KB from a pool that has only approximately
5.45MB free. The output indicates that, even though there is enough free memory, the largest contiguous
block is smaller than 65KB, and the memory allocation failed. While, by definition, this is considered
memory fragmentation, this is not usually the cause. Most often, it is simply caused by low memory in the
pool itself.
The second thing to notice is the pool type. The prevoius example dealt with the Processor pool. This is
important because it is the first road sign that directs the investigation and what needs to be checked. The pool
specified should be either Processor or I/O. Here is an example of an I/O memory error:
%SYS−2−MALLOCFAIL: Memory allocation of 65548 bytes failed from 0x400B8564,
alignment 32
Pool: I/O Free: 39696 Cause: Not enough free memory
Alternate Pool: None Free: 0 Cause: No Alternate pool
The next sections detail these pools further. Once the pool is identified, you can focus your efforts accordingly
in the right spots.
Processor Pool
The Processor pool is used, as the name implies, for the various processes that run on the router or switch.
There are specific processes that underlie most Cisco IOS versions and platforms that use memory. For
example, Init is a process established on boot−up of most devices, and is present across various platforms.
Other processes that might be present are based on the configuration of the individual device. For example, on
platforms in which voice is configured and used, voice−specific processes consume memory, while in more
generalized configurations without voice, these processes do not hold as much, or any memory at all.
Certain processes hold more memory than others. If there are questions or concerns about a particular process,
it is best to open a TAC case to have it investigated.
Causes and What to Collect
1. If the device has recently undergone a Cisco IOS upgrade, the first thing to check is the minimum
required DRAM for the new image. This should be equal to or less than the amount of DRAM
installed on the device itself. The minimum required DRAM is listed under the image within the
Software Download Tool. Enter the show version command in order to confirm the amount of DRAM
installed:
Cisco 2821 (revision 53.51) with 210944K/51200K bytes of memory.
In order to determine the total DRAM, add these numbers. This particular Cisco router has 256MB of
DRAM.
2. Another possible cause is a memory leak caused by a Cisco IOS bug. In this situation, one process
consumes an excessive amount of memory until it runs out. Enter these commands when the memory
is low in order to collect information:
show
show
show
show
show
clock
mem stat
proc mem sorted
mem all totals
log
The show proc mem sorted command lists all processes in descending order from the highest amount
of memory held to the lowest. Identify the highest process, but exclude Init. Once the investigation is
complete, find the Process ID (PID) for that process on the left−hand side of the output, and collect
this information:
show proc mem <PID #>
If the highest process is Dead, collect this information in addition to the previous outputs:
show mem dead totals
show mem dead
Certain processes require more in−depth investigation, but they are not covered in this document.
3. Another potential cause of memory issues is encountered when you run out of memory due to the
processes and configuration on the device. One example of this is the Border Gateway Protocol
(BGP) router. In some instances, BGP holds a large amount of memory because of the number of
routes that it takes in. This is not caused by a Cisco IOS bug. This problem must be corrected by
altering the configuration in order to achieve optimal routing and reduce memory consumption.
If you are unsure, collect the outputs listed previously (exclude show mem dead totals and show mem
dead), and open a TAC case, because this problem will probably require further investigation.
I/O Pool
The I/O pool refers to the I/O buffers seen with the show buffers command. These buffers are used for
process−switched traffic, among other things, such as routing updates or broadcasts. I/O memory is broken
down into pools, which are shown in the show buffers command output. These pools are based on packet size,
which allows more efficient allocation of memory based on the needs.
Causes and What to Collect
1. The first thing to check with I/O memory issues is a potential buffer leak caused by a Cisco IOS bug.
This often manifests itself as a particular pool that increases its amount of buffers without releasing
them back into the I/O pool once they are no longer needed. Here is an example of this:
−−−−−−−−− show buffers −−−−−−−−
Buffer elements:
500 in free list (500 max allowed)
3220350364 hits, 0 misses, 0 created
Public buffer pools:
Small buffers, 104 bytes (total 6144, permanent 6144):
3867 in free list (2048 min, 8192 max allowed)
248913132 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Medium buffers, 256 bytes (total 86401, permanent 3000, peak 86401 @ 05:18:11):
0 in free list (64 min, 3000 max allowed)
9697361 hits, 203293 misses, 2208 trims, 85609 created
167633 failures (651288 no memory)
Middle buffers, 600 bytes (total 512, permanent 512):
0 in free list (64 min, 1024 max allowed)
9284431 hits, 237750 misses, 0 trims, 0 created
224619 failures (680486 no memory)
Big buffers, 1536 bytes (total 1000, permanent 1000):
0 in free list (64 min, 1000 max allowed)
69471745 hits, 895218 misses, 0 trims, 0 created
842142 failures (1821074 no memory)
VeryBig buffers, 4520 bytes (total 10, permanent 10, peak 122 @ 1w3d):
0 in free list (0 min, 100 max allowed)
2120517 hits, 1632477 misses, 112 trims, 112 created
1632421 failures (3272987 no memory)
Large buffers, 9240 bytes (total 8, permanent 8, peak 18 @ 1w3d):
0 in free list (0 min, 10 max allowed)
9593 hits, 832217 misses, 44 trims, 44 created
832195 failures (1651309 no memory)
Huge buffers, 18024 bytes (total 2, permanent 2):
0 in free list (0 min, 4 max allowed)
1325 hits, 831497 misses, 0 trims, 0 created
831494 failures (1649904 no memory)
The previous output clearly shows that the problem is with the Medium pool. Its total value is much
higher than the permanent amount set for that pool. The output shows that, even with over 86,000
buffers in the pool, you have 0 available in the free list. Finally, the output shows that the number of
trims is much lower than the number created, which indicates that these have not been released back
into the I/O pool for further consumption. For further explanation of these fields, see the Definitions
for Buffer Pool Fields link in the Related Information section at the end of this document.
For this scenario, first capture these outputs:
show
show
show
show
clock
mem stat
buffers
log
Once the problematic pool or pools are determined, enter this command in order to focus on the
problem pool(s):
show buffer pool <pool name> packet
This command might provide extensive output. You can usually determine which packets reside in
these buffers and who allocated them within a few pages of the output.
2. Another possible cause is a network/traffic event. This often manifests itself as excessive utilization in
multiple pools. It is recommended that the previous outputs be collected, along with the show buffer
pool <pool name> packet command output for the pools that show this utilization, and that you open
a TAC case. This is often caused by an abnormal or unexpected traffic flow that must be
process−switched by the device. Because the flow might be bursty and quick, you can run out of I/O
memory in a relatively short period of time. In order to troubleshoot this type of problem, usually you
must identify the source of the traffic in order to see if this flow is abnormal and, if so, eliminate or
block it.
3. Another, more rare event is that a specific pool is more heavily−utilized because of certain traffic that
is needed in a network environment. This traffic might, for some reason, need to be process−switched,
and there is no way to avoid this in the network. This scenario must be confirmed further, and then
appropriate action must be taken. The same outputs from step 1 apply here.
Items to Investigate
On most routers, the MALLOCFAIL error examples presented previously are standard. On Cisco Catalyst
6500 Series switches and 7600 Series routers with Supervisor Engines (SUPs) or Route Switch Processor
(RSPs), these errors might vary. For example, this error was taken from the Route Processor (RP) logs on a
6500 Series switch:
%SYS−SP−2−MALLOCFAIL: Memory allocation of 820 bytes failed from 0x40C83B60,
alignment 32
Pool: I/O Free: 48 Cause: Not enough free memory
Alternate Pool: None Free: 0 Cause: No Alternate pool
The MALLOCFAIL error shows that the Switch Processor (SP) of the SUP reports the problem, not the RP. If
the problem is associated with the RP, the SP designation in the error is not present. For this reason, the
previous outputs must be taken from the SP. In order to accomplish this, precede the commands with:
remote command switch
The error message might also refer to the standby SUP/RSP RP or SP as denoted by STDBY, and needs to be
collected accordingly.
Summary
You might speed up case resolution and bring stability to your device more quickly if you collect the outputs
listed in this document. If any questions arise or if there is uncertainty about memory performance on a
device, it is best to open a TAC case in order to have it investigated.
Related Information
• Troubleshooting Memory Problems
• Understanding Buffer Misses and Failures − Buffer Pool
• Technical Support & Documentation − Cisco Systems
Updated: Sep 30, 2013
Document ID: 116467
Download