The Essentials Series Solving Network Problems Before They Occur sponsored by KNOW YOUR NETWORK by Greg Shields Article 1: How to Use SNMP in Network Problem Resolution .............................................................. 1 SNMP, the Solution .............................................................................................................................................. 1 SNMP, Total Network Awareness ................................................................................................................. 3 SNMP, Disaster Protection ............................................................................................................................... 4 SNMP, Easy Implementation .......................................................................................................................... 5 Article 2: How to Use WMI in Network Problem Resolution ................................................................ 6 The Network Rosetta Stone ............................................................................................................................ 6 WMI, Finger‐Pointer Preventer ..................................................................................................................... 8 WMI, Keeping Email Operational .............................................................................................................. 10 WMI, Network Monitoring for Servers and Applications ............................................................... 10 Article 3: How Effective Configuration Management Aids in Network Problem Resolution 11 Config Management, Little Problems with Big Impact ..................................................................... 12 Config Management, When the Fix Is Harder than the Problem ................................................. 13 Solving Network Problems Requires the Right Vision ..................................................................... 14 i Copyright Statement © 2009 Realtime Publishers. All rights reserved. This site contains materials that have been created, developed, or commissioned by, and published with the permission of, Realtime Publishers (the “Materials”) and this site and any such Materials are protected by international copyright and trademark laws. THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice and do not represent a commitment on the part of Realtime Publishers or its web site sponsors. In no event shall Realtime Publishers or its web site sponsors be held liable for technical or editorial errors or omissions contained in the Materials, including without limitation, for any direct, indirect, incidental, special, exemplary or consequential damages whatsoever resulting from the use of any information contained in the Materials. The Materials (including but not limited to the text, images, audio, and/or video) may not be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any way, in whole or in part, except that one copy may be downloaded for your personal, noncommercial use on a single computer. In connection with such use, you may not modify or obscure any copyright or other proprietary notice. The Materials may contain trademarks, services marks and logos that are the property of third parties. You are not permitted to use these trademarks, services marks or logos without prior written consent of such third parties. Realtime Publishers and the Realtime Publishers logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners. If you have any questions about these terms, or if you would like information about licensing materials from Realtime Publishers, please contact us via e-mail at info@realtimepublishers.com. ii Article 1: How to Use SNMP in Network Problem Resolution I’ve spent almost 15 years of my life as an IT professional. In that time I’ve been a phone support operator, field technician, systems administrator, consultant, and now an independent technology author and presenter. Through those experiences, I’ve seen a wide range of very different environments in very different businesses. Those IT environments range from the exceptionally simple, installed into actual closets within small business offices, all the way to multi‐enterprise, multi‐national collaborative networks. What’s interesting about all of them is their similarity. Some networks have more applications than others. Some have faster connections between sites. Some use more remote applications. Yet there’s a common thread in all of them: from time to time, they all have problems. There’s also something remarkably strange about those networks I’ve seen. Even though we can all agree that every network occasionally has its problems, relatively few have the tools in place to find and fix them. For reasons of cost, or time, or lack of subject knowledge, many IT organizations haven’t implemented unified and comprehensive network monitoring solutions. It is my goal in this Essentials Series to explain why you should. With the right platform in place, you’ll experience less downtime, more customer satisfaction, and fewer late nights tracking down the network problems of the day. Using a series of examples from my own experience, I want to show you how effective network monitoring can help to solve network problems before they occur. SNMP, the Solution Let’s start by looking at actual solutions to your network’s visibility problem. Networks are by nature very opaque. You can’t simply peer through cables or into routers to see the behaviors going on during their operation. To see what’s going on in your network, you need tools that do the peering for you. 1 Those tools start with the individual devices themselves. For example, if you queried the interface statistics on a Cisco router, you would be greeted with information about that interface’s traffic: router1#show int Ethernet0 is up, line protocol is up […snip…] 37592 packets input, 2859273 bytes, 0 no buffer Received 15938 broadcasts, 0 runts, 0 giants 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort 0 input packets with dribble condition detected 15288 packets output, 1395393 bytes, 0 underruns 0 output errors, 0 collisions, 1 interface resets, 0 restarts 0 output buffer failures, 0 output buffers swapped out That information is descriptive of the individual device you’ve logged into, but stops there. Today’s network devices natively include all the necessary capabilities to gather and report on their network traffic statistics. You can today request this information from each device and manually build a picture of how your network is operating. However, the complexity of doing so rises dramatically as your network’s count of interconnected devices goes much past one. To combat these complexities, the Simple Network Management Protocol (SNMP) was ratified in the early 1990s. This protocol enables a request‐response framework between individual devices and a central Network Management Solution (NMS). Individual devices can be polled for their information through a GET request by the NMS. Device information is stored and can be addressed via its globally‐unique Management Information Base (MIB) Object Identifier (OID). An OID’s long string of digits represents the “address” for the unit of information being stored on that device. Information being stored can relate to network statistics, details about that device’s configuration, performance and throughput metrics, or really any information that the device’s manufacturer has enabled. This part of SNMP’s poll‐based nature means that information must be requested if it is to be sent back to the NMS. For this reason, SNMP also has a unidirectional alert component. An SNMP “trap” represents a preconfigured alert from a device back to its NMS, reporting on conditions that the NMS should know about. This setup enables SNMP clients to rapidly notify the NMS when problems exist. SNMP also comes in many versions, with later versions including additional and desired features over those in the previous. SNMP v3 is today’s version commonly used by most environments because it adds a suite of critical security features that protect its data in transit and authenticates servers prior to communication. This encryption ensures that the clear text data transfers of earlier versions are protected from prying eyes, while servers must prove their identity before they’re communicated with. 2 You’ll probably recognize that this information on SNMP is neither new nor revolutionary in the way it works. With SNMP rapidly approaching its 20th birthday, its protocol is mature and its capabilities are well known. Yet in making this statement, why are so many IT organizations still not using it? Perhaps they don’t understand its true power in solving network problems before they occur. Consider a few examples… SNMP, Total Network Awareness Recognizing how SNMP does its job is far less exciting than realizing how it can spot and solve network problems. The information gained through SNMP connections and stored in a central NMS enables a situational awareness of your network. This awareness illuminates the behaviors on all devices through a single console, providing you a single heads‐up display of your network’s health. As an example of this, I used to work for a company that built satellite ground stations. This company’s complex development activity required the cooperation of multiple business units and even multiple companies, all in different locations. To ensure that everyone was working on the same page, we architected a centralized collaboration environment that brought all parties together to the same set of applications. This remote application infrastructure was a perfect solution for its users, enabling them to share documents and work together whether they were in Colorado, California, Massachusetts, or anywhere. Perfect, that is, until the network began experiencing problems. Remote application infrastructures, such as Microsoft Terminal Services or Citrix XenApp, by nature perform well over low‐bandwidth connections. They enable users to work on remote applications as if they were installed locally, even over the slowest of network lines. Yet although they do well in low‐bandwidth situations, the streaming nature of their protocols means they do not do well across those that are highly latent. In this environment, it was well known that certain WAN connections to certain sites would experience latency from time to time. This project’s network traffic was only a portion of the traffic sourcing from each site. Rather than waiting for administrators to get phone calls when users’ experience degraded, this environment elected instead to configure SNMP across each remote device. Each device was configured to report to a central NMS. That NMS queried each device for its interface utilization and ping latency statistics on a regular basis. Traps and subsequent administrator alerts were additionally set up to alert the central NMS when metrics went below acceptable thresholds. 3 Figure 1: SNMP enables the creation of ping latency graphs across multiple devices. The result was the creation of a real‐time graph similar to that shown in Figure 1. There, you can see where ping latency information across devices was graphed, giving administrators information about the health of each connection. Because the right people were also alerted as conditions went below thresholds, they were able to compensate as necessary to maintain their users’ experience. SNMP, Disaster Protection Although SNMP is most commonly associated with gathering network statistics and configurations, it is extensible to even non‐network devices as well. SNMP was originally developed as a communications framework between all kinds of networked devices. Thus, any device with a network connection can potentially receive and respond to SNMP requests or send its own traps. Nowhere is this more valuable than with the environmental sensors used in many data centers today. These environmental sensors regularly check the temperature, humidity, and (in the case of accidental flooding) water level present in the data center room. The installation and use of these sensors is critical to ensuring that your expensive IT investment doesn’t melt down if your data center air conditioning stops functioning. 4 That exact situation happened to me at another former client. That day, I had the lucky privilege of stepping into their data center on the very day their air conditioning unit experienced a massive, yet unnoticed, failure. Walking into that data center, the massive outpouring of heat made me immediately recognize that something was terribly wrong. I looked over to the room’s temperature sensor—a cheap model more often found attached to the outside of your bedroom window—to discover that the temperature had crossed the 80° threshold and was increasing at a rate of 1° every 10 minutes. Humidity was similarly affected. Although the problem was quickly resolved through the forced shutdown of non‐essential equipment and the introduction of backup air conditioning, the problem could have been dramatically worse had my timing been different. The network‐enabling of data center sensors using protocols such as SNMP illuminates another of this protocol’s key value propositions. With the right tools in place, an alert could have notified administrators immediately when temperature conditions in the data center started their deviation. Consolidating SNMP’s data into a unified network management solution enables the real‐time alerting of problems directly to network administrators. SNMP, Easy Implementation As I travel across IT environments, I find that a common hurdle in implementing comprehensive monitoring relates to its perceived difficulty in implementation. Although numerous enterprise‐scale monitoring solutions are available today, their implementation often installs little more than an empty shell to be later populated by dedicated monitoring administrators. Needed for environments that aren’t necessarily “enterprise” are cost‐effective solutions that implement quickly and without the need for specialized knowledge. The right solutions for your environment will immediately begin gathering useful data with a minimum of daily maintenance. As you’ll discover in the next article of this series, such solutions integrate with servers and applications as well as networking devices to provide a complete view into your network. 5 Article 2: How to Use WMI in Network Problem Resolution I’ve found myself constantly amazed at the language barrier we experience in the world of IT. I’m not talking here about the barrier between the technologists and the non‐ technologists, the geek and non‐geek. I’m speaking about the language barrier we’ve all experienced between an organization’s “server” administrators and their “network” administrators. You’ve probably been in the same situation as you’ve been called in to work together on a big problem. Your network team sits at one side of the conference table, while the server admins take over the other. Although some problem is preventing your users from getting their job done, the two opposing teams pull out domain‐specific vocabulary the other doesn’t understand in an effort to prove that the problem isn’t their fault. This circle‐the‐wagons approach to solving IT problems has been around as long as the problems themselves. When a major problem occurs in the environment, a common approach is to gather everyone that is potentially involved and focus them on today’s firefight. Yet solving problems in this way is expensive in terms of people’s time and in cost to your business. There’s got to be a better way. The Network Rosetta Stone With the right Network Management Solution (NMS) solution in place, it is possible to improve your resolution of large‐scale problems without the finger pointing. The right solutions leverage integration to servers and applications as well as network components to provide complete visibility into your operating environment. The result is that your NMS becomes a kind of Rosetta Stone or universal translation device between IT teams; the NMS helps the network team understand the impact of servers and applications, while giving systems administrators a perspective on the network infrastructure. One way in which an NMS, acting as a Rosetta Stone, translates your Microsoft Windows computers is through Windows Management Instrumentation (WMI) integration. Microsoft’s WMI is a platform‐specific service that enables third‐party devices to query the Microsoft OS for details about its behaviors. As a rough analogy, if you consider the Simple Network Management Protocol (SNMP) the request/response tool for network device monitoring, WMI performs the same actions within the Windows OS. A typical WMI query might look like this: Select FreeSpace from Win32_LogicalDisk 6 In this query, the targeted machine is asked to provide the amount of free space on its installed volumes. This process looks different than SNMP’s numerical Object Identifier (OID) approach, but the result is the same. An NMS queries for information across multiple Windows machines, storing the results into its local database and reporting them along its management consoles. Because multiple Windows machines can be queried by a central monitoring server, that server becomes the locus of analysis for behaviors across servers and network devices (see Figure 1). As a result, it becomes dramatically easier to locate or prevent network problems because their root cause can be tracked to very specific endpoints and behaviors. Figure 1: A unified dashboard that displays information about servers, applications, and network devices in one place. 7 WMI, Finger‐Pointer Preventer It is the intersection of WMI and SNMP monitoring where an NMS provides great value. It also helps out with the historical problem of teams pointing fingers at each other. Consider another situation I experienced not long ago with one of my consulting clients. At this client, a particular Windows virtual machine was experiencing an intermittent problem with its network connection. That network problem would occur only irregularly; however, when it did occur, it impacted a large number of users. Thus, resolving this problem was extremely important for this client. The client was very focused on the perceived network source for this problem, pointing their attention and resources to the network and its behaviors. “There must be something wrong with the network cards, their drivers, or their firmware,” they would tell me. Yet what they did not recognize was how virtualization tends to significantly increase the complexity of troubleshooting these types of problems. With multiple virtual machines co‐ located atop a single virtual host, a simple network problem’s root cause can be something as seemingly‐unrelated as a shortage of system memory or too much consumption of processor cycles. To resolve the situation, a unified NMS was implemented that enabled the collection and reporting on metrics through SNMP and WMI statistics. This same solution integrated with the virtualization platform to provide additional data about its processing as well (see Figure 2). The solution to the problem was immediately discovered the very next time it occurred. WMI queries to the virtual host discovered that the virtual host’s processor utilization experienced a dramatic spike in use at the very moment the networking problem occurred. The solution was to offload some of that virtual host’s workload to other servers to prevent the resource‐overuse situation. 8 Figure 2: A single view with SNMP, WMI, and even virtualization counters provides a holistic view of the entire environment. 9 WMI, Keeping Email Operational Another averted crisis that may strike home in your own network environment has to do with keeping email servers up and running. Although most businesses can endure the loss of file servers for a day, or even a few databases for a few hours, the loss of the email system usually sends a business’ executives into orbit. That’s why in organizations both large and small, the email system is often considered one of the most important services to remain up and operational. Email at the same time can be one of the most dynamic data processing systems in your data center. Handling thousands of messages a day in even the smallest of environments, email systems must effortlessly deal with large attachments, malware, and addressing failures while preserving the users’ experience within their desktop email clients. I was once called in to architect a monitoring solution for a company in the financial services industry. Although this client needed the monitoring solution for their entire multi‐site infrastructure, the real reason for its implementation was due to regular and painful problems with the email server. Implementing the right kind of tools for this small business of less than 100 employees was a trivial installation. Connecting it to network devices, identified servers, and even a few clients was not difficult because the system included preconfigured templates for each type of device. We completed the installation and initial configuration in less than a day. The next morning, I returned to the client to find an extremely tired but extremely happy systems administrator sitting at his desk. It turns out that the majority of the problems with the email system were related to users overfilling it with data to the point where it would consume all its available disk space. That very night after the installation of this monitoring system, the administrator received an alert notifying him that the email server’s disk drive was within a few percentage points of full consumption. Unlike in each of the previous incidents, this administrator was able to add the necessary disk space prior to the email server’s database shutting down. The right level of monitoring across network, server, and even application facets of the IT environment prevented the problem from ever occurring. WMI, Network Monitoring for Servers and Applications As with the previous article in this series, these stories are told to explain why effective monitoring goes far in preventing problems. With the right monitoring that spans every part of an IT environment, you gain much‐needed visibility into areas where you otherwise would have none. By integrating the network focus traditionally associated with SNMP with the server and application focus commonly used with WMI, that vision spans the entire environment. In the end, it may bring your network and server teams closer together as a cohesive unit for better managing your IT infrastructure. 10 Article 3: How Effective Configuration Management Aids in Network Problem Resolution The third focus of this Essentials Series is on the need for effective configuration management, a common feature across many Network Management Solutions (NMSs) but one that sometimes gets missed. In this instance, what do I mean by configuration management? I mean the unified storage and uniform distribution of configurations to each of the devices on your network. There is a certain brilliance in the way that most network devices can and are configured. Using little more than text files, a smart administrator can set up their interfaces, ACLs, and essentially every other setting within these devices. Their use of text files means that one device’s configuration can very easily be replicated on another device through a file copy. Their editing is also trivial, accomplished with a simple text editor or SSH application. As an example, the following code snippet shows the simplicity of a Cisco device’s initial configuration: no service password‐encryption ! hostname Router ! enable secret 5 $2m$FJdHx53V$t7rQJop3jjbXIB7n3 ! interface FastEthernet0/0 ip address 192.168.1.1 255.255.255.0 duplex auto speed auto ! interface FastEthernet0/1 no ip address duplex auto speed auto shutdown ! interface Vlan1 no ip address shutdown 11 Yet there’s a certain level of pain that comes with this simplicity. That pain grows as the number of devices and their individual configurations increases in number. Managing the configuration of just a few devices means that you’re responsible for just a few text files and their individual settings. But as your network grows in size and complexity, your number of elements under management grows geometrically. At some point, no one person can safely handle the sheer volume of text files and their settings that are required by a production network. It is in just this situation where the configuration management elements of an effective NMS grow extremely valuable to the IT organization. An effective NMS will include the database storage of configurations, versioning and version control of individual config files, analysis tools for comparing those files, and the ability to rapidly deploy changes to devices all across the network. In much the same way that most people program their favorite phone numbers into their cellular phones, managing a network through an NMS ensures you don’t accidentally call the wrong person, forget a phone number, or misconfigure a device in such a way that brings down the LAN. Config Management, Little Problems with Big Impact This workflow wraps around the traditional actions associated with changing a device config and adds a lot of value to the process. Consider a situation I experienced a number of years ago in the network of a major governmental defense contractor. There, a network condition began occurring where some servers intermittently lost their connection with the network. When those servers could talk to the network, their connection speeds were dramatically lower than expected. Network bandwidth rates were so slow that network applications began to suffer, users began calling into the Help desk, and fellow administrators started contacting loved ones to report they’d be spending the night. In this situation, the entire staff of systems administrators was tasked with resolving the problem. As the problem affected a large percentage of servers on the network, every eye was needed on the problem. After a full day of troubleshooting by the entire staff, the problem was eventually tracked to an incorrect configuration on a particular switch in the data center. That configuration mismatched the duplex settings between the switch and its connected servers, with one side inexplicably reset to 100/Half duplex with the other at Auto/Auto. As a result, the two sides found themselves repeatedly renegotiating their communication channel, with the resulting loss in service and performance. In the end, a half‐dozen systems administrators lost a full day of productive work as a result of a very simple misconfiguration. This misconfiguration was set into place by a well‐ meaning network engineer, who manually made a small change to a config file and accidentally introduced the error. Because the engineer completed the change using a traditional SSH connection directly to the device, the change wasn’t logged into any change management system. No one knew about the change, and so no one was looking in that location for the problem. Conversely, had the engineer made the change using an NMS’ change control engine, the error would have been found before it was released into production. 12 Config Management, When the Fix Is Harder than the Problem Another story that is relatively common with network engineers involves an enterprise client of mine and their massively distributed network. This client was a single business unit of a much larger corporate network, responsible for the network traffic for many thousands of people across dozens of sites. As you can imagine, the level of networking equipment required to support the infrastructure was large and exceedingly complex. This client and I were working on a widespread network slowdown situation. This situation was not necessarily that the network had gotten slow, or for some reason stopped operating at its expected level. In this environment, the network was slow, had been slow, and its users had grown to accept its slowness as baseline. The network engineer and I recognized that its baseline performance simply did not make sense based on the kinds of equipment in the infrastructure and the bandwidth rates between sites. In this environment, even the intra‐LAN traffic itself was slow beyond comprehension. After a substantial amount of time peering through reports and looking through device statistics, we realized that a small but important misconfiguration had been propagated into the config files of each and every device on the network and across every site. The specific misconfiguration is less important than the realization that the scope of the fix was far greater than our group of individuals could take on. With literally thousands of devices spanning dozens of sites, the steps needed to locate each device, log in, make and confirm the change, and move on to the next device was anticipated to take between 5 and 10 minutes per device. Multiplying that number across each device meant that the solution could take literally months of constant manual effort to resolve. Adding to the complexity of the resolution was the nature of the fix itself. Due to the specific change required, a rapid fix was necessary to preserve network connections between sites. Although the fix was trivial, the network engineers were baffled as to how to implement it. The solution arrived with the implementation of an NMS not unlike those discussed in this Essentials Series. By adding the NMS to the environment and instructing it to automatically discover and map the network infrastructure (see Figure 1), the organization was able to very quickly bring each individual device under centralized management. Using the NMS solution’s bulk change feature enabled the team to quickly implement and distribute the change across the infrastructure. The result was a massive improvement in performance across the business unit, and a promotion for the engineer. 13 Figure 1: An NMS’ automated discovery and mapping features can quickly bring a large network infrastructure under management. Solving Network Problems Requires the Right Vision As stated earlier, the goal of this Essentials Series has been to illustrate why effective monitoring and management is necessary for a healthy network. That need is the case irrespective of the size of your network. Whether you’re a small business with a few devices or a large enterprise with many thousands, not having this vision prevents you from actually understanding what’s going on inside your network. As this article has shown, not having these tools also inhibits you from cohesively managing the configuration of your network devices when you need them the most. When looking for an NMS, look for one that is scoped to the needs of your environment, with the right features and integrations you require for a complete situational awareness across the IT landscape. 14