MALLOCFAIL Errors and General Memory Problems Troubleshoot Document ID: 116467 Contributed by Brandon Lynch, Cisco TAC Engineer. Sep 30, 2013 Contents Introduction MALLOCFAIL Errors Processor Pool Causes and What to Collect I/O Pool Causes and What to Collect Items to Investigate Summary Related Information Introduction This document discusses MALLOCFAIL errors on native Cisco IOS®, as well as steps to take and information to gather before you open a Cisco Technical Assistance Center (TAC) case or reload the device in order to expedite problem resolution. This document is not exhaustive, but provides a general guideline used in order to troubleshoot memory issues with many routers and switches. MALLOCFAIL Errors Memory problems manifest themselves in several ways on switches and routers. In many instances, a device that experiences memory errors is reloaded before the appropriate data is gathered. Memory issues generally appear in the form of MALLOCFAIL errors in the logs of your router or switch. These errors are important because they provide "road signs" to direct the investigation. Here is a sample MALLOCFAIL error: %SYS−2−MALLOCFAIL: Memory allocation of 65536 bytes failed from 0x60103098, alignment 0 Pool: Processor Free: 5453728 Cause: Memory fragmentation Alternate Pool: None Free: 0 Cause: No Alternate pool The first thing to notice is how much memory you need to allocate and how much free memory you have. This example shows a scenario where you must allocate 65KB from a pool that has only approximately 5.45MB free. The output indicates that, even though there is enough free memory, the largest contiguous block is smaller than 65KB, and the memory allocation failed. While, by definition, this is considered memory fragmentation, this is not usually the cause. Most often, it is simply caused by low memory in the pool itself. The second thing to notice is the pool type. The prevoius example dealt with the Processor pool. This is important because it is the first road sign that directs the investigation and what needs to be checked. The pool specified should be either Processor or I/O. Here is an example of an I/O memory error: %SYS−2−MALLOCFAIL: Memory allocation of 65548 bytes failed from 0x400B8564, alignment 32 Pool: I/O Free: 39696 Cause: Not enough free memory Alternate Pool: None Free: 0 Cause: No Alternate pool The next sections detail these pools further. Once the pool is identified, you can focus your efforts accordingly in the right spots. Processor Pool The Processor pool is used, as the name implies, for the various processes that run on the router or switch. There are specific processes that underlie most Cisco IOS versions and platforms that use memory. For example, Init is a process established on boot−up of most devices, and is present across various platforms. Other processes that might be present are based on the configuration of the individual device. For example, on platforms in which voice is configured and used, voice−specific processes consume memory, while in more generalized configurations without voice, these processes do not hold as much, or any memory at all. Certain processes hold more memory than others. If there are questions or concerns about a particular process, it is best to open a TAC case to have it investigated. Causes and What to Collect 1. If the device has recently undergone a Cisco IOS upgrade, the first thing to check is the minimum required DRAM for the new image. This should be equal to or less than the amount of DRAM installed on the device itself. The minimum required DRAM is listed under the image within the Software Download Tool. Enter the show version command in order to confirm the amount of DRAM installed: Cisco 2821 (revision 53.51) with 210944K/51200K bytes of memory. In order to determine the total DRAM, add these numbers. This particular Cisco router has 256MB of DRAM. 2. Another possible cause is a memory leak caused by a Cisco IOS bug. In this situation, one process consumes an excessive amount of memory until it runs out. Enter these commands when the memory is low in order to collect information: show show show show show clock mem stat proc mem sorted mem all totals log The show proc mem sorted command lists all processes in descending order from the highest amount of memory held to the lowest. Identify the highest process, but exclude Init. Once the investigation is complete, find the Process ID (PID) for that process on the left−hand side of the output, and collect this information: show proc mem <PID #> If the highest process is Dead, collect this information in addition to the previous outputs: show mem dead totals show mem dead Certain processes require more in−depth investigation, but they are not covered in this document. 3. Another potential cause of memory issues is encountered when you run out of memory due to the processes and configuration on the device. One example of this is the Border Gateway Protocol (BGP) router. In some instances, BGP holds a large amount of memory because of the number of routes that it takes in. This is not caused by a Cisco IOS bug. This problem must be corrected by altering the configuration in order to achieve optimal routing and reduce memory consumption. If you are unsure, collect the outputs listed previously (exclude show mem dead totals and show mem dead), and open a TAC case, because this problem will probably require further investigation. I/O Pool The I/O pool refers to the I/O buffers seen with the show buffers command. These buffers are used for process−switched traffic, among other things, such as routing updates or broadcasts. I/O memory is broken down into pools, which are shown in the show buffers command output. These pools are based on packet size, which allows more efficient allocation of memory based on the needs. Causes and What to Collect 1. The first thing to check with I/O memory issues is a potential buffer leak caused by a Cisco IOS bug. This often manifests itself as a particular pool that increases its amount of buffers without releasing them back into the I/O pool once they are no longer needed. Here is an example of this: −−−−−−−−− show buffers −−−−−−−− Buffer elements: 500 in free list (500 max allowed) 3220350364 hits, 0 misses, 0 created Public buffer pools: Small buffers, 104 bytes (total 6144, permanent 6144): 3867 in free list (2048 min, 8192 max allowed) 248913132 hits, 0 misses, 0 trims, 0 created 0 failures (0 no memory) Medium buffers, 256 bytes (total 86401, permanent 3000, peak 86401 @ 05:18:11): 0 in free list (64 min, 3000 max allowed) 9697361 hits, 203293 misses, 2208 trims, 85609 created 167633 failures (651288 no memory) Middle buffers, 600 bytes (total 512, permanent 512): 0 in free list (64 min, 1024 max allowed) 9284431 hits, 237750 misses, 0 trims, 0 created 224619 failures (680486 no memory) Big buffers, 1536 bytes (total 1000, permanent 1000): 0 in free list (64 min, 1000 max allowed) 69471745 hits, 895218 misses, 0 trims, 0 created 842142 failures (1821074 no memory) VeryBig buffers, 4520 bytes (total 10, permanent 10, peak 122 @ 1w3d): 0 in free list (0 min, 100 max allowed) 2120517 hits, 1632477 misses, 112 trims, 112 created 1632421 failures (3272987 no memory) Large buffers, 9240 bytes (total 8, permanent 8, peak 18 @ 1w3d): 0 in free list (0 min, 10 max allowed) 9593 hits, 832217 misses, 44 trims, 44 created 832195 failures (1651309 no memory) Huge buffers, 18024 bytes (total 2, permanent 2): 0 in free list (0 min, 4 max allowed) 1325 hits, 831497 misses, 0 trims, 0 created 831494 failures (1649904 no memory) The previous output clearly shows that the problem is with the Medium pool. Its total value is much higher than the permanent amount set for that pool. The output shows that, even with over 86,000 buffers in the pool, you have 0 available in the free list. Finally, the output shows that the number of trims is much lower than the number created, which indicates that these have not been released back into the I/O pool for further consumption. For further explanation of these fields, see the Definitions for Buffer Pool Fields link in the Related Information section at the end of this document. For this scenario, first capture these outputs: show show show show clock mem stat buffers log Once the problematic pool or pools are determined, enter this command in order to focus on the problem pool(s): show buffer pool <pool name> packet This command might provide extensive output. You can usually determine which packets reside in these buffers and who allocated them within a few pages of the output. 2. Another possible cause is a network/traffic event. This often manifests itself as excessive utilization in multiple pools. It is recommended that the previous outputs be collected, along with the show buffer pool <pool name> packet command output for the pools that show this utilization, and that you open a TAC case. This is often caused by an abnormal or unexpected traffic flow that must be process−switched by the device. Because the flow might be bursty and quick, you can run out of I/O memory in a relatively short period of time. In order to troubleshoot this type of problem, usually you must identify the source of the traffic in order to see if this flow is abnormal and, if so, eliminate or block it. 3. Another, more rare event is that a specific pool is more heavily−utilized because of certain traffic that is needed in a network environment. This traffic might, for some reason, need to be process−switched, and there is no way to avoid this in the network. This scenario must be confirmed further, and then appropriate action must be taken. The same outputs from step 1 apply here. Items to Investigate On most routers, the MALLOCFAIL error examples presented previously are standard. On Cisco Catalyst 6500 Series switches and 7600 Series routers with Supervisor Engines (SUPs) or Route Switch Processor (RSPs), these errors might vary. For example, this error was taken from the Route Processor (RP) logs on a 6500 Series switch: %SYS−SP−2−MALLOCFAIL: Memory allocation of 820 bytes failed from 0x40C83B60, alignment 32 Pool: I/O Free: 48 Cause: Not enough free memory Alternate Pool: None Free: 0 Cause: No Alternate pool The MALLOCFAIL error shows that the Switch Processor (SP) of the SUP reports the problem, not the RP. If the problem is associated with the RP, the SP designation in the error is not present. For this reason, the previous outputs must be taken from the SP. In order to accomplish this, precede the commands with: remote command switch The error message might also refer to the standby SUP/RSP RP or SP as denoted by STDBY, and needs to be collected accordingly. Summary You might speed up case resolution and bring stability to your device more quickly if you collect the outputs listed in this document. If any questions arise or if there is uncertainty about memory performance on a device, it is best to open a TAC case in order to have it investigated. Related Information • Troubleshooting Memory Problems • Understanding Buffer Misses and Failures − Buffer Pool • Technical Support & Documentation − Cisco Systems Updated: Sep 30, 2013 Document ID: 116467