Disaster Preparedness and Resiliency

Network Management: Fault Management, Performance Management and Planned Maintenance Jennifer M. Yates, Zihui Ge AT&T Labs - Research, Florham Park, NJ Email: jyates@research.att.com; gezihui@research.att.com 1. Introduction As the Internet grew from a fledgling network interconnecting a few University computers to a massive infrastructure that interconnects users across the globe, the focus was primarily on providing connectivity to the masses. And IP has certainly achieved this. As of June 2008, it was estimated that approximately 1.5 billion people were connected to the Internet [ref: http://www.internetworldstats.com/stats.htm] – approximately 1 in 5 people across the globe. But as networks using the Internet Protocol (IP) became increasing widespread, an increasingly diverse set of services also came to use IP as the underlying communications protocol. IP now supports a diverse set of services, including eCommerce, voice, mission critical business applications, TV (IPTV), as well as email and Internet web browsing. Nowadays, even people’s lives depend on IP networks – as IP increasingly supports emergency 9-1-1 services and other medical applications. This tremendous diversity in applications places an equally diverse set of requirements on the underlying network infrastructure, particularly with respect to bandwidth consumption, reliability and performance. At one extreme, email applications are resilient to network impairments – the key requirement being basic connectivity. Email servers will continually attempt to retransmit emails, even in the face of potentially lengthy outages. In contrast, real-time video services typically require high bandwidth and are sensitive to even very short term “glitches” and performance degradations. IP is also a relatively new technology – especially if we compare it with technology such as the telephone network, which has now been around for over 130 years. As IP technology has matured over recent years, network availability has been driven up – a result of hardware and software maturing, as well as continued improvements in network management practices, tools and network design. This improvement has enabled a shift in emphasis to focus beyond availability and faults to managing performance metrics – for example, eliminating short term 2 “glitches” which may not be at all relevant to many applications (e.g., email), but can cause degradation in service quality to applications such as video and gaming. The transformation of IP from a technology designed to support best effort data delivery to one that supports a diverse range of mission-critical applications is a testament to the industry and to the network operators who have created technologies, automation and procedures to ensure high reliability and performance. In this chapter, we focus on the network management systems and the tasks involved in supporting the day-to-day operation of an ISP network. The goal of network operations is to keep the network up and running, and performing at or above designed levels of service performance. Achieving this goal involves responding to issues as they arise, proactively making network changes to prevent issues from occurring and evolving the network over time to introduce new technologies, services and capacity. We can loosely classify these tasks into categories using traditional definitions discussed within the literature – namely fault management, performance management, and planned maintenance. At a high level, fault management is easy to understand: it includes the “break/fix” functions – if something breaks, fix it. More precisely, fault management encompasses the systems and work flows necessary to ensure that failures (or faults) are rapidly detected, the root causes are localized, the impacts of the faults on customers are minimized and the faults are rapidly resolved so that the network resources and customer service are returned to normal. The goal of restoring customer service quickly may involve a temporary fix or “work around” until network maintenance can be scheduled to repair the underlying problem. Performance management can be defined in several different ways: for example, (1) designing, planning, monitoring and taking action to prevent / repair overload conditions and (2) monitoring both end network performance and per element performance and take action to address performance degradation. The first definition focuses on traffic management, and encompasses roles executed by network engineering organizations (capacity planning) and Operations (real-time responses to network conditions). In this chapter, we will focus on the second definition of performance management: monitoring network performance and taking appropriate actions when performance is degraded. This performance degradation may be a result of traffic-related conditions, or may alternatively result from degraded network elements (e.g., high bit error rates). Both network faults and degraded performance may require intervention by network operations, and the line between fault and performance management is blurred in practice. In this chapter, we refer to a network condition that may require the intervention of network operations as a “network event.” The events in question may have a diverse set of causes – including hardware failures, software bugs, external factors (e.g., flash crowds, outages in peer networks) or combinations of these. The resulting impact also varies drastically, ranging from outages during which impacted customers have no connectivity for potentially lengthy periods of time (known as “hard outages”), loss of network capacity (where network elements or capacity is out of service, but traffic can be restored around it and thus 3 customers experience little to no issues), or performance degradation (packet loss, or increased delay and/or jitter). In addition to rapidly reacting to issues that arise, daily network operation also incorporates taking planned actions to proactively ensure that the network continues to deliver high service levels, and to evolve network services and technologies. We refer to such scheduled activities as planned maintenance. Planned maintenance includes a wide variety of activities, ranging from changing a fan filter in a router to hardware or software upgrades, preventative maintenance, or rearranging connections in response to external factors, including even highway maintenance being carried out where cables are laid. Such activities are usually performed at a specific, scheduled time, typically during non-business hours when network utilization is low and the impact of the maintenance activity on customers can be minimized. This chapter covers both fault and performance management, and planned maintenance. Section 2 focuses on fault and performance management – detecting, troubleshooting and repairing network faults and performance impairments. Section 3 examines how process automation is incorporated in fault and performance management to automate many of the tasks that were originally executed by humans. Process automation is the key ingredient that enables a relatively small Operations group to manage a rapidly expanding number of network elements, customer ports and complexity. Section 4 discusses tracking and managing network availability and performance over time, looking across larger numbers of network events to identify opportunities for performance improvements. Section 5 then focuses on planned maintenance. Finally, in Section 6 we discuss opportunities for new innovations and technology enhancements, including new areas for Research. We conclude in Section 7 with a set of “best practices”. 2. Real-time Fault and Performance Management Fault and performance management comprise a large and complex set of functions that are required to support the day-to-day operation of a network. As mentioned in the previous section, network events can be divided into two broad categories: faults and performance impairments. The term fault is used to refer to a “hard” failure – for example, when an interface goes from being operational (up) to failed (down). Performance events denote situations where a network element is operating with degraded performance – for example, a router interface is dropping packets, or when there is an undesirably high loss and/or delay being experienced along a given route. We use the term “event management” to generically define the set of functions that detect, isolate and repair network malfunctions – covering both faults and performance events. The primary goal of event management is to rapidly mitigate issues as they arise so that any resulting customer impact can be minimized. However, achieving this within the complex environment of ISP networks today requires a carefully architected and scalable event management architecture. Figure 1 presents a sim- 4 plified view of such an architecture. The goal of the architecture is to reliably detect events and alarm on them, so that action can be taken to mitigate the issue within the network. However, as anyone who has experience with large networks can attest, this is in itself a challenging problem. Figure 1. Simplified event management architecture An extensive and diverse network instrumentation layer provides the foundation of the event management architecture. The instrumentation layer, illustrated in Figure 1 collects network data from both network elements (routers, switches, lower layer equipment) and network monitoring systems designed to monitor network performance from vantage points across the network. The measurements collected are critical for a wide range of different purposes, including customer billing, capacity planning, event detection and alarming, and troubleshooting. These latter two functions are critical parts of daily network operations. With regards to this, the goal is to ensure that the wide range of potential fault and performance “events” are reliably detected and that there are sufficient measurements available to troubleshoot network events. Both the routers and the collectors contained in the instrumentation layer generate alarms to notify the central event management system when an event of interest is detected. Events detected include faults (e.g., link or router down) and 5 performance impairments (such as excessive network loss or delay). These performance alarms are created by looking for anomalous, “bad” conditions amongst the mounds of performance data and logs collected within the instrumentation layer. Given the diversity of the instrumentation layer, any given network event will likely be detected by multiple monitoring points, and even as multiple symptoms within an individual network element. Thus, it is likely that a single event will result in a deluge of alarms, which would swamp Operations personnel trying to investigate. Significant automation is thus introduced in the event management system to suppress and correlate multiple alarms associated with the same network event to create a single alarm indicative of what is often referred to within the fault management literature as the “root cause” of the event. This correlated alarm is used to trigger the creation of a ticket in the network ticketing system – these tickets are used to notify Operations personnel of the event. Once notified of an issue, Operations personnel are responsible for troubleshooting and repair. Troubleshooting and repair have two primary goals: (1) restoring customer service and performance and (2) fully resolving the underlying issue and returning network resources into service. These two goals may or may not be distinct – where redundancy exists, the network is typically designed to automatically re-route around failed elements, thus negating the need for manual intervention in the first step. Troubleshooting and repair can then instead focus on returning the failed network resources into service so that there is sufficient capacity available to absorb future outages. In other situations, Operations needs to resolve the outage to restore the customer’s service. Rapid action from Operations personnel is then imperative – until Operations can identify what is causing the problem and repair it, the customer may be out of service or experiencing degraded performance. This typically occurs at the edge of a network, where the customer connects to the ISP and redundancy may be limited. For example, if the nonredundant line card to which a customer connects fails, the customer may be out of service until a replacement line card can be installed. We now discuss the above steps associated with event management in more detail. 2.1. Instrumentation Layer, Event Detection and Alarms The foundation of an event management system is the reliable detection and notification of network faults and performance impairments. We thus start by considering how these network events are detected, and the resulting alarm notifications that are generated. The primary goal is to ensure that the event management system is reliably informed of network impairments. But the issues of interest may display a wide range of different symptoms, and may occur anywhere across the vast network(s) being monitored. Depending on the impairments at hand, it may be that the routers themselves observe the symptoms and can report them accordingly. However, lim- 6 itations in router monitoring capabilities, challenges in adding new monitoring capabilities without major router upgrades and the need for independent network monitoring have driven a wide range of external monitoring systems into the network. These systems have the added benefit of being able to obtain different perspectives on the network compared with the individual routers. For example, end to end performance monitoring systems obtain a network-wide view of performance; which is not readily captured from within a single router. Thus, the reliable detection of network events requires a comprehensive instrumentation layer to collect network measurements. Chapters 10 and 11 discussed the wide range of network monitoring capabilities available in large-scale operational IP/MPLS networks. These are incorporated within the instrumentation layer to support a diverse range of applications, ranging from network engineering (e.g., capacity planning), customer billing, and fault and performance management. Within fault and performance management, these measurements support real-time alarm generation, as well as other tasks such as troubleshooting and postmortem analysis, which are discussed later in this chapter. As illustrated in Figure 1, the instrumentation layer collects measurements from both network elements (e,g., routers) and from external monitoring devices, such as route monitors and application / end to end performance measurement tools. Although logically depicted as a single layer, the instrumentation layer actually consists of multiple different collectors, each focused on a single monitoring capability. These collectors are discussed in more detail in the remainder of this section. 2.1.1. Router Fault and Event Notifications Network routers themselves have an ideal vantage point for detecting faults local to them. They are privy to events which occur on inter-router links which terminate on to them, and also to events which occur inside the routers themselves. They can thus identify all sorts of hardware (link, line card and router chassis failures) and software issues (including software protocol crashes and routing protocol failures). Routers themselves detect events and notify the event management systems – one can view the instrumentation layer for these events residing directly within the routers themselves as opposed to in an external instrumentation layer. In addition to creating alarms about events detected in the router, routers also write log messages which describe a wide range of events observed on the router. These are known as syslogs; they are akin to the syslogs created on servers. These syslogs can be collected at an external syslog collector as part of the instrumentation layer. They serve multiple purposes, including the generation of events which feed the event management system, and supporting the troubleshooting of network events. The syslog protocol [1] is used to deliver log messages from network elements to a central repository, namely the syslog collector depicted in Figure 1. Syslog messages collected from elements cover a diverse range of conditions observed within the element, from link and protocol-related state changes (down, up), envi- 7 ronment measurements (voltage, temperature), and warning messages (e.g., denoting when customers send more routes than the router is configured to allow). Some of these events relate to conditions which need to be alarmed on, whilst others provide useful information when it comes to troubleshooting an event. Syslog messages are basically free-form text, with some structure (e.g., indicating date/time of event, location and event priority). The form and format of the syslog messages vary between router vendors. Figure 2 illustrates some example syslog messages, taken from Cisco routers – details regarding Cisco message formats can be found in [2]. Mar 15 00:00:06 ROUTER_A 627423: Mar 15 00:00:00.554: %CI-6ENVNORMAL:+24 Voltage measured at 24.63 ` Mar 15 00:00:06 ROUTER_B 289883: Mar 15 00:00:00.759: %LINK-3UPDOWN: Interface Serial13/0.6/16:0, changed state to up Mar 15 00:00:13 ROUTER_C 2267435: Mar 15 00:00:12.473: %CONTROLLER-5-UPDOWN: Controller T3 10/1/1 T1 18, changed state to DOWN Mar 15 00:00:06 ROUTER_D 852136: Mar 15 00:00:00.155: %PIM-6INVALID_RP_JOIN: VRF 13979:28858: Received (*, 224.0.1.40) Join from 1.2.3.4 for invalid RP 5.6.7.8 Mar 15 00:00:07 ROUTER_E 801790: -Process= "PIM Process", ipl= 0, pid=218 Mar 15 00:25:26 ROUTER_Z bay0007:SUN MAR 15 00:25:24 2009 [03004EFF] MINOR:snmp-traps:Module in bay 4 slot 6, temp 65 deg C at or above minor threshold of 65 degC. Figure 2. Example (normalized) syslog messages 2.1.2. External Fault Detection and Route Monitoring Router fault detection mechanisms are complemented by other mechanisms to identify issues that are not detected and/or reported by the routers either because they don’t have visibility into the event, their detection mechanisms fail to detect it (maybe because of a software bug), or because they are unable to report the issue to the relevant parties. For example, basic periodic ICMP “ping” tests are used to validate the liveness of routers and their interfaces – if a router or interface becomes unresponsive to ICMP pings, then this is reason for concern and an alarm should be generated. Route monitors, as discussed in Chapter 11, also provide visibility into control plane activity; activity that may not always be reported by the routers. IGP monitors, such as OSPFmon [2], will learn about link and router up/down events and can be used to complement the same events reported by routers. However, route monitors extend beyond simple detection of link up / down events, and can provide information about weight changes which also impact traffic routing. Even further, the route(s) between any given source / destination can be inferred using 8 routing data collected from route monitors. This information is vital when trying to understand events which impacted a given traffic flow. 2.1.3. Router-Reported Performance Measurements Although fault has traditionally been the primary focus of event management systems, performance events can also cause significant customer distress and thus must be addressed. The goal here is to identify and report when network performance deviates from desired operating regions and requires investigation. But detecting performance issues is really quite different from detecting faults. Performance statistics are continually collected from the network; from these measurements we can then determine when performance becomes out of desired operating regions. As discussed in Chapter 10, routers track a wide range of different performance parameters – such as counts of the number of packets and bytes flowing on each router interface, different types of interface error counts (including buffer overflow, mal-formatted packets, CRC check violations) and CPU and memory utilization on the central router CPU and its line cards. These performance parameters are stored in the Simple Network Management Protocol (SNMP) Management Information Bases (MIBs). The Simple Network Management Protocol (SNMP) [3] is the Internet community’s de facto standard management protocol. SNMP was defined by the IETF, and is used to convey management information between software agents on managed devices (e.g., router) and the managing systems (e.g., event management systems in Figure 1). The information available through SNMP is organized into a Management Information Base (MIB). External systems poll router MIBs on a regular basis, such as every five minutes. Thus, with a five minute polling interval, the CPU measurements would represent the average CPU over a five minute interval; the packet counts would represent the total number of packets that have flowed on the interface in a given five minute interval. The SNMP information collected from routers is used for a wide variety of purposes, including customer billing, detecting anomalous network conditions which could be impacting customers (congestion, excessively high CPU, router memory leaks) and troubleshooting network events. 2.1.4. Traffic Measurements SNMP measurements provide aggregate statistics about links loads – e.g., counts of total number of bytes or equivalently link utilization over a fixed time interval (e.g., five minutes). However, SNMP measurements provide little insight into how this traffic is distributed across different applications or different source / destination pairs. Instead, as discussed in Chapter 10, Netflow and deep packet inspection (DPI) are used to obtain much more detailed measurements of how traffic is distributed across network links and routes. Few if any network management alarms are created today based on either network or DPI measurements. However, 9 both measurement capabilities are critical in troubleshooting various network conditions. DPI in particular can be used to obtain unique visibility into the traffic carried on a link, which is useful in trying to understand what and how traffic may be related to a given network issues. For example, DPI could be used to identify malformed packets, or identify what traffic is destined to an overloaded router CPU. 2.1.5. Application and Service Monitoring Although monitoring the state and performance of individual network elements is critical to managing the overall health of the network, it is not in itself sufficient to understanding the performance as perceived by customers. It is also imperative that we are able to monitor and detect events based on the end to end service performance. End to end measurements – even as simple as that achieved by sending test traffic across the network from one edge of the network to another – provide the closest representation of what customers experience as can be measured from within the ISP network. End to end measurements were discussed in more detail in Chapter 10. In the context of fault and performance management, they are used to identify anomalous network conditions, such as when there is excessive end to end packet loss, delay and/or jitter. These events are reported to the event management system via alarms, and used to trigger further investigation. As discussed in Chapter 10, performance monitoring can be achieved using either active or passive measurements. Active measurements send test probes (traffic) across the network, measuring statistics related to the network and/or service performance. In contrast, passive measurements examine the performance of customer traffic, which is not always easily achieved on an end to end basis unless the ISP has access to the customer’s end device. End to end performance measurements can be used both to understand the impact of known network events, and to identify events which not have been detected via other means. When known events, such as congestion, occur, end to end performance measurements provide direct performance measures which present insight into the event’s impact on customers. But end to end measurements also provide an overall test of the network health – and can be used to identify the rare but potentially critical issues which the network may have failed to detect. These are known as silent failures, and have historically been an artifact of immature router technologies – where the routers simply fail to detect issues that they should have detected. For example, consider an internal router fault (e.g., corrupted memory on a line card) which causes a link to simply drop all traffic. If the router fails to detect that this is occurring, then it will fail to both re-route the traffic away from the failed link, and to report this as a fault condition. Thus, traffic will continual to be sent to a failed interface and be dropped – a condition known as black-holing. End to end active measurements (test probes) provide a means to proactively detect these issues; the test probes will be dropped along with the customer traffic – this can be detected 10 and alarmed on. Thus, the event can be detected even when the network elements fail to report them and (hopefully!) before customers complain. Simple test probes sent across the network are extremely effective at auditing the integrity of the network – identifying when there are loss or delay issues being experienced, and even detecting failures not identified by the network elements. However, extrapolating from simple loss and delay measures to understand the impact of a network event on any given network application is most often an extremely complex if not impossible task, involving the intimate details of each application in question. Ideally, we would like to be able to directly measure the performance of each and every application which operates across a network. This is clearly an impossible task in networks which support a plethora of different services and applications. But if a network is critical to supporting a specific application – such as IPTV – then that application also needs to be monitored across the network, and appropriate mechanisms be available to detect issues. After all, the goal is to ensure that the service is operating correctly – ascertaining this requires direct monitoring of the service of interest. 2.1.6. Alarms The instrumentation layer thus provides network operators with an immense volume of network measurements. But how do we transform this data to identify network issues which require investigation and action? Let’s start by identifying the events that we are interested in detecting. We clearly need to detect faults – conditions which may be causing customer outages or taking critical network resources out of service. We also aim to detect performance issues which would be causing degraded customer experiences. And finally, we wish to detect network element health concerns, which are either impacting customers or are risking customer impact. The majority of faults are relatively easy to detect – we need to be able to detect when network elements or components are out of service (down). It gets a little more complicated when it comes to defining performance impairments of interest. If we are too sensitive in the issues identified – for example, reporting very minor, short conditions – then Operations personnel risk expending tremendous effort attempting to troubleshoot meaningless events, potentially missing events that are of great significance amongst all the noise and false alarms. Short term impairments, such as packet losses or CPU anomalies, are often too short to even enable real-time investigation – the event has more often than not cleared before Operations personnel could even be informed about it, let alone investigate. And short term events are expected – no matter how careful network planning and design is, occasional short term packet losses, for example, will occur. Instead, we need to focus on identifying events which are sufficiently large and persist for a significant period of time. Simple thresholding is typically used to achieve this – an event is identified when a parameter of interest exceeds a pre- 11 defined value for a specified period of time. For example, an event may be declared when packet loss exceeds 0.3% over 3 consecutive polling periods across an IP/MPLS backbone network. However, note that more complicated signatures can and are used to detect some conditions. Section 4.1 discusses this in some more detail, including how appropriate thresholds can be identified. Once we have detected an issue – whether it is a fault or performance impairment - our goal is to report it so that appropriate action can be taken as needed. Event notification is realized through the generation of an alarm which is sent to the event management platform as illustrated in Figure 1. An alarm is a symptom of a fault or performance degradation. Alarms themselves have a life span – they start with a SET alarm and typically end through an explicit CLEAR. Thus, explicit notifications of both the start of an event (the SET) and the end of an event (CLEAR) need to be conveyed to the event management system. SNMP traps or informs (depending on the SNMP version) are used by routers to generate alarms to the event management system. Given the predominance of SNMP in IP/MPLS, this same mechanism is also often used between other collectors in the instrumentation layer and the event management layer. Traps / informs represent asynchronous reports of events and are used to notify network management systems of state changes, such as links and protocols going down or coming up. For example, if a link fails, then the routers at either end of the link will send SNMP traps to the network management system collecting these. These traps typically include information about the source of the event, details of the type of event, parameters and priority. Figure 3 depicts logs detailing traps collected from a router. The logs report the time and location of the event, the type of event which has occurred and the SNMP MIB Object ID (OID). In this particular case, we are observing the symptoms associated with a line card failure, where the line card constitutes a number of physical and logical ports. Each port is individually reported as having failed. This includes both the physical interfaces (denoted in the Cisco router example below as T3 4/0/0, T3 4/0/1, Serial4/1/0/3:0, Serial4/1/0/4:0, Serial4/1/0/5:0, Serial4/1/0/23:0, Serial4/1/0/24:0, Serial4/1/0/25:0) and the logical interfaces (denoted in the example below as Multilink6164 and Multilink6168). In the example shown in Figure 3, the OID that denotes the link down event is “4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0” – and can be seen on each event in Figure 3. 1238081113 10 Thu Feb 21 03:10:07 2009 Router_A- Cisco LinkDown Trap on Interface T3 4/0/0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081117 10 Thu Mar 26 15:25:17 2009 Router_A - Cisco LinkDown Trap on Interface T3 4/0/1;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081118 10 Thu Mar 26 15:25:18 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/3:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 12 1238081118 10 Thu Mar 26 15:25:18 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/4:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081119 10 Thu Mar 26 15:25:19 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/5:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081124 10 Thu Mar 26 15:25:24 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/23:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081124 10 Thu Mar 26 15:25:24 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/24:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/25:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on Interface Multilink6164;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on Interface Multilink6168;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 Figure 3. Example logs from (anonymized) SNMP traps 2.2. Event Management System The preceding section discussed the vast monitoring infrastructure deployed in modern IP/MPLS networks and used to create alarms. However, the number of alarms created by such an extensive monitoring infrastructure would simply overwhelm a network operator – completely obscuring the real event in an avalanche of alarms. Manually weeding out the noise from the true root cause could take hours or even longer for a single event – time during which critical customers may be unable to effectively communicate, watch TV and/or access the Internet. This is simply not an acceptable mode of operation. By way of a simple example, consider a fiber cut impacting a link between two adjacent routers. The failure of this link (the fault) will be observed at the physical layer (e.g., SONET, Ethernet) on the routers at both ends of the link. But it will also be observed in the different protocols running over that link – for example, in PPP, in the intra-domain routing protocol (OSPF or IS-IS), in LDP and in PIM (if multi-cast is enabled) and potentially even through BGP. These will all be separately reported via syslog messages; failure notifications (traps) would be generated by the routers at both ends of the link in question to indicate that the link is down. The same event would be captured by route monitors (in this case, by the intra-domain route protocol monitors [2]). Network management systems monitor- 13 ing the lower layer (e.g., layer one / physical layer) technologies will also detect the event, and will alarm. And finally – should congestion result from this outage performance monitoring tools may report packet losses observed from the routers, end to end performance degradations may be identified via end to end performance monitoring and application monitors would presumably start alarming if customer services are impacted. Thus, even a single fiber cut can result in a plethora of alarms. The last thing we need is for the network operator to be manually delving into each one of these to identify the one that falls at the heart of the issue – in this case, the alarm that denotes a loss of incoming signal at the physical layer. Instead, automated event management systems as depicted in Figure 1 are used to automatically identify both an incident’s root cause and its impact from amongst the flood of incoming alarms. The key to achieving this is alarm correlation – taking the incoming flood of alarms, automatically identifying which alarms are associated with a common event, and then analyzing them to identify the most likely explanation and any relevant impact. The resulting correlated alarms are input to the ticket creation process; tickets are used to notify Operations personnel about an issue requiring their attention. In the above example, the root cause can be crisply identified as being a physical layer link failure; the impact of interest to the network operator being the extent to which any service and/or network performance degradation is impacting customers, and how many and which customers are impacted. The greater the impact of an event, the more urgent is the need for the network operator to instigate repair. 2.2.1. analysis Managing alarm floods: alarm correlation and root cause Not all alarms received by an event management system need to be correlated and translated into tickets to trigger human investigation. For example, alarms corresponding to known maintenance events can be logged and discarded – there is no need to investigate these as they are expected side effects of the planned activities. Similarly, duplicate alarms can be discarded – if an alarm is received multiple times for the same condition, then only one of these alarms needs to be passed on for further analysis. Thus, as alarms are received by the event management system, they are filtered to identify those which should be forwarded for further correlation. Similarly, one off events which are very short in nature – effectively “naturally” resolving themselves – can often be discarded as there is no action to be taken by the time that the alarm makes it for further analysis. Thus, if an alarm SET is followed almost immediately by an alarm CLEAR, then the event management system may chose to discard the event entirely. Such events can occur, for example, due to someone accidentally touching a fiber in an office, causing a very rapid, one off impairment. There is little point to expending significant resources investigating this event – it has disappeared and is not something that someone in Operations can do anything about – unless it continues to happen. Short conditions 14 which continually re-occur are classified as chronics, and do require investigation. Thus, if an event keeps occurring, then a correlated alarm and corresponding ticket is created so that the incident can be properly investigated. Alarm correlation, alternatively known as root cause analysis, follows the alarm filtering process. Put simply, alarm correlation examines the incoming stream of network alarms to identify those which occur at approximately the same time, and can be associated with a common explanation. These are then grouped together as being a correlated alarm. The goal here is to identify the “root cause” alarm – effectively equating to identifying the type and location of the event being reported. There are numerous commercial products available that implement alarm correlation for IP/MPLS networks. The basic idea behind these tools is the notion of causal relationships – understanding what underlying events cause which symptoms. Building this model requires detailed real-time knowledge and models of device and network behavior, and of network topology. Given that network topology varies over time, this must be automatically derived through auto-discovery. However, how the products implement alarm correlations varies. Some tools use defined rules or policies to perform the correlation. The rules are defined as “if condition then conclusion” – conditions relate to observed events and information about the current network state, whilst the conclusion would denote the root cause event and associated actions (e.g., ticket creation). These rules capture domain expertise from humans and can become extremely complex. The Codebook approach [4; 5] applies experience from graphs and coding to alarm correlation. A codebook is pre-calculated based on network topology and models of network elements and network behavior to model signatures of different possible network events. The alarms received by an event management system implemented using codebooks are referred to as symptoms. These symptoms are compared against the set of different known signatures; the signature that best matches the observed symptoms is identified as the root cause. Event signatures are created by determining which unique combination of alarms would be observed for each possible failure. This can be inferred by looking at the network topology and how components within a router and between routers are connected. A model of each router is (automatically) created – denoting the set of line cards deployed within a router, the set of physical ports in existence on a line card and the set of logical interfaces which terminate on a given physical port. The network topology then indicates which interfaces on which line cards on which routers are connected together to create links. In large IP networks, such information is automatically discovered by crawling through the network. Once the set of routers in the network is known (either through external systems or via auto-discovery), SNMP MIBs can be walked to identify all of the interfaces on the router, and the relationships between interfaces (for example, which interfaces are on which port and which port are on which line card). 15 Figure 4. Signature identification Given knowledge of topology and router structure, failure signatures can be identified by examining what symptoms would be observed upon each individual component failure. To illustrate this, let us consider the simple four node scenario depicted in Figure 4. In this example, we consider what would be observed if line card 1 failed on router 2. Such a failure would be observed on all of the four interfaces contained within the first line card on router 2 (card 1), and also on interface I1 on port P1 and interface I1 on port P2 on card 1 (C1) of router 1, and on interface I2 on port P1 and interface I2 on port P2 on card 1 (C1) of router 3. These are all interfaces which are directly connected to interfaces on the failed line card on router 2. If the failure of line card 1 on router 2 were to happen, then alarms would be generated by the three routers involved and sent to the event management system. This combination of symptoms thus represents the signature of the line card failure on router 1. This signature, or combination of alarms, is unique – if this is observed, we conclude that the issue (root cause) is the failure of line card 1 on router 2. Let’s now return to the SNMP traps provided in Figure 3. This example depicted the traps sent by a router Router_A to the event management system upon detecting the failure of a set of interfaces. Traps were sent corresponding to eight physical interfaces and two logical (multilink) interfaces – all of these being interfaces which directly connect to external customers. The ISP has visibility into only the local ISP interfaces (the ones reported in the traps in Figure 3), and does not receive alarms from the remote ends of the links – the ends within customer premises. Thus, the symptoms observed for the given failure mode are purely those coming from the one ISP router – nothing from the remote ends of the links. This set of received alarms is compared with the set of available signatures; comparison will provide a match between this set of alarms (symptoms) and the signature associated with the failure of a line card (Serial4) on Router_A. The resulting correlated alarm output from the event management system identifies the line card fail- 16 ure, and incorporates the associated symptoms of the individual interface failures. Figure 5 illustrates an example correlated alarm, taken from a production event management system. In this particular example reports that there are 10 out of a total of 10 configured interfaces which have failed on the line card Serial4 – indicating a slot (line card) failure. This is a critical alarm, indicating that immediate attention is required. 03/26/2009 15:25:28 RC500_170: Incoming Alarm: Router_A:Interfaces|Slot Threshold Alarm: Router_A:Serial4 Down. There are 10 out of 10 interfaces down. Down Interface List: Serial4/0/0, Serial4/0/1, Serial4/1/0/3:0, Serial4/1/0/4:0, Serial4/1/0/5:0, Serial4/1/0/23:0, Serial10/1/0/24:0, Serial10/1/0/24:0, Multilink6164, Multilink6168|Critical Figure 5. Correlated alarm associated with Event in Figure 2. Correlation can also be used across layers or across network domains to effectively localize faults for more efficient troubleshooting. Consider the example illustrated in Figure 6. The actual outage occurs in the layer one (physical) network which is used to directly interconnect two routers. Let’s specifically consider an example of a fiber cut. If the Layer one and the IP/MPLS network are maintained by different organizations (a common situation), then there is little that can be done in the IP organization to repair the failure. However, the IP routers all detect the issue. Cross-layer correlation can be used to identify that the issue is indeed in a different layer; the IP organization should thus be notified, but with the clarification that this is informational only – another organization (layer one) is responsible for repair. Such correlations can save a tremendous amount of unnecessary resource expenditure, by accurately and clearly identifying the issue, and notifying the appropriate organization of the need to actuate repair. Figure 6. Cross-layer correlation The “root cause” alarm created by the event management systems effectively identifies approximately where an event occurred and what type of event it was. However, this is still a long way from identifying the true root cause of the event, 17 and thus being able to rectify it. If we consider the example depicted in Figure 4, automated root cause analysis is able to successfully isolate the problem as being related to line card 1 on router 2. However, we still need to determine how and why the line card failed. Was the problem in hardware or software? What in the hardware or software failed and why? Tremendous work is still typically required before reaching true issue resolution. 2.3. Ticketing Tickets are used to notify Operations personnel of events that require investigation, and to track ongoing investigations. Tickets are at the heart of troubleshooting network events, and recording actions taken and their results. If the issue being reported has been detected within the network, then the tickets are automatically opened by the event management systems. However, if the issue is, for example, reported by a customer, then the ticket will likely have been opened by a human, whether that be the customer or a representative from the ISP’s customer care organization. The tickets are opened with basic information regarding the event at hand, such as the date, time and location of where the issue was observed, and the details of the correlated alarms and symptoms that triggered the ticket creation. From there, Operations personnel carefully record in the ticket the tests that they execute in troubleshooting the event, and the observations made. They record interactions with other organizations, such as Operations groups responsible for other layers in the network, or employees from equipment manufacturers (vendors). Clearly, carefully tracking observations made whilst troubleshooting a complex issue is critical for the person(s) investigating. However, the tickets also serve the purpose of allowing easy handoff of investigations across personnel, such as may happen when an issue must be handed off across, say, a change of shift or between different organizations. Once an investigation has reached its conclusion and the issue has been rectified, then the ticket is closed. A resolution classification is typically assigned, primarily for use in tracking aggregate statistics and for offline analysis of overall network performance. Network reliability modeling is discussed in more detail in Chapter 4 and in Section 4 of this chapter. 2.4. Troubleshooting Troubleshooting a network outage is analogous to being a private investigator – hunting down the offender and taking appropriate action to rectify the issue. Drilling down to root cause can require keen detective instincts, knowing where to look and what to look for. Operations teams often pull upon vast experience and domain knowledge to rapidly delve into the vast depths of the network and related network data to crystallize upon the symptoms, delve deep for additional information and theorize over potential root causes. This is often under extreme pres- 18 sure – the clock is ticking; customer service may be impaired until the issue is repaired. The first step of troubleshooting a network event is to collect as much information about the event as possible that will help with reasoning about what is happening or has happened. Clearly, a fundamental part of this involves looking at the event symptoms and impact. The major symptoms are provided in the alarm which created the ticket. Additional information can be collected as the result of tests that the operator performs to further investigate the outage, or may incorporate historical data pulled from support systems to investigate earlier behavior and actions taken within the network (e.g., maintenance) which could be related. The tests invoked by the network operator range considerably in nature, depending on the type of incident being investigated. But in general they may include ping tests (“ping” a remote end to see if it is reachable), different types of status checks (executed as “show” commands on Cisco routers, for example), and on demand end-to-end performance tests. In addition to information about the event, it may also be important to find out about potentially related events and activities – for example, did something change in the network that could have invoked the issue? This could have been a recent change, or something that changed a while ago, waiting like a time bomb until conditions were ripe for an appearance. Armed with this additional information, the network operator works towards identifying what is causing the issue, and how it can be mitigated. In the majority of situations, this can be achieved quickly – leading to rapid, permanent repair. However, some outages are more complex to troubleshoot. This is when the full power of an extended Operations team comes into force. Let’s consider a hypothetical example to illustrate the process of troubleshooting a complex outage. In this scenario, let’s assume that a number of edge routers across the network fail one after another over a short period of time. As these are the routers which connect to the customers it is likely that the majority of customers are out of service until the issue is resolved - the pressure is really on to resolve the issue as fast as possible! Given that we are assuming that routers are failing one after another, it would take two or more routers to fail before it becomes apparent that this is a cascading issue, involving multiple routers. But once this becomes apparent, then the individual router issues should be combined to treat the event as a whole. As discussed earlier, the first goal is to identify how to bring the routers back up, operational and stable so that customer service is restored. But achieving this requires at least some initial idea about the event’s root cause. And then once the router have been brought back in service, it is critical to drill down to fully understand the issue and root cause so that permanent repair can be achieved – permanently eliminating this particular failure mode from the network. The first step in troubleshooting such an outage is to collect as much information as possible regarding the symptoms observed, identify any interesting information that may shed light on the trouble, and create a timeline of events detail- 19 ing when and where they occurred. This information is typically collated by running diagnostic commands on the router, and from information collected within the various collectors contained within the instrumentation layer. Syslogs provide a huge amount of informative information about events observed on network routers. This information is complemented by that obtained from route monitors and performance data collected over time, and from logs detailing actions taken on the router by Operations personnel and automated systems. The biggest challenge now becomes how to find something useful in amongst the huge amount of data created by the network. Operations personnel would painstakingly sift through all of this data, raking through syslogs, examining critical router statistics (CPU, memory, utilization etc), identifying what actions were being performed within the network during the time interval before the event, and examining route monitoring logs and performance statistics. Depending on the failure mode, and whether the routers are reachable (e.g., out of band), diagnostic tests and experimentation with potential actions to repair the issue would also be performed within the network. In a situation where multiple elements are involved, it is also important to focus on what the routers involved have in common. Are they all made by a common router vendor, are they a common router model? Do they share a common software version? Are they in a common physical or logical location within the network? Do they have similar number of interfaces, or load, or something which may relate to the outage at hand? Were there any recent network changes made on these particular routers that could have triggered the issue? Are there customers that are in common across these routers? Identifying the factors which are common and those which are not can be critical in focusing into what may be the root cause of the issue at hand. For example, if the routers involved are from different vendors, then it is less likely to be a software bug causing the issue. And if there is a common customer associated with all of the impacted routers, then it would make sense to look further into whether there is anything about the specific customer(s) which could have induced the issue or contributed in some way. For example, was the customer making any significant changes at the time of the incident? The initial goal is to extract sufficient insight into the situation at hand to indicate how service can be restored. Once this is achieved, appropriate actions can be taken to rectify the solution. However, in many situations, such as those induced by router bugs, the initial actions may restore service, but would not necessarily solve the underlying issue. Further (lengthy) analysis – potentially involving extensive lab reproduction and/or detailed software analysis – may be necessary to truly identify the real root cause. 2.5. Restore then repair Thus, troubleshooting and repair typically involves two goals: (1) restoring customer service and performance and (2) fully resolving the underlying issue and re- 20 turning network resources into service. These two goals may or may not be distinct – where redundancy exists, the network is typically designed to automatically re-route around outages, thus negating the need for manual intervention in the first step. Troubleshooting and repair can instead focus on returning the failed network resources into service so that there is sufficient capacity available to absorb future outages. In other situations, customer service restoral and outage resolution may be one and the same. Let’s consider two examples in more detail: 2.5.1. Core network outage Consider the example of a fiber cut impacting a link between two routers in the core of an IP/MPLS network. IP networks are designed with redundancy and spare capacity, so that traffic can be re-routed around failed network resources. This rerouting is automatically initiated by the routers themselves - the exact details of which depend on the mechanisms deployed, as discussed in Chapter 2. In the example in Figure 7, traffic between the routers E and D is normally routed via router C. However, in the event that that the link between routers C and D fails, then the traffic is re-routed via F and H. Figure 7. Core network outage. Assuming that the network has sufficient capacity to re-route all traffic without congestion, then the impetus for rapidly restoring the failed resources is to ensure that the capacity is available to handle future failures and/or planned maintenance events. If, however, congestion results from the outage, then immediate intervention is required by Operations personnel to restore the customer experience. Immediate action may be taken to re-route or re-load balance traffic, to eliminate the performance issues whilst the resources are being repaired – for example, by tuning the IGP weights (e.g., OSPF weight tuning). In the example of Figure 7, if there were congestion observed on, say, the link between routers F and H, then Operations personnel may need to re-route the traffic via routers F, I and G. Note that this requires that Operations have an appropriate network modeling tool available to simulate potential actions before they are taken. This is necessary to 21 ensure that the actions to be taken will achieve the desired result of eliminating the performance issues being observed. Permanent repair is achieved when the fiber cut is repaired. This requires that a technician travel to the location of the cut, and re-splice the impacted fiber(s). 2.5.2. Customer-facing outage Now consider an outage on a customer-facing line card in a service provider’s edge router. We focus on a customer that has only a single connection between the customer router and the ISP edge router, as illustrated in Figure 8. Figure 8. Failure on customer-facing interface Cost effective line card protection mechanisms simply do not exist for router technologies today. Instead, providing redundancy on customer facing interfaces requires that an ISP deploy dedicated backup line cards for each working line card. Because this is expensive, customers that need higher reliability can choose a multi-homed redundancy option. In situations where redundancy is not deployed, a failure of the ISP router line card facing the customer will result in the customer being out of service until the line card can be returned to service. If the line card failure was caused by faulty hardware, the customer may be out of service until the failed hardware is repaired, necessitating a rapid technician visit to the router in question. However, if the issue is in software, for example, service could potentially be restored via software rebooting of the line card. If the issue is extreme, and a result of a newly introduced software release on the router, then it is possible that the software should be “rolled back” to a previous version that does not suffer from the software bug. Although this apparently fixes the issue and restores customer service, it is not a permanent repair. If this is likely to occur 22 again, then the issue must be permanently resolved – the software must be debugged, re-certified for network deployment, and then installed on each relevant router network-wide before permanent repair is achieved. This could involve upgrading potentially hundreds of routers – a major task, to say the least. In this case, the repair of the underlying root cause takes time, but is necessary to ensure that the failure mode does not occur again within the network. 3. Process automation The previous section described the basic architecture and processes involved in detecting and troubleshooting fault and performance impairments. These issues often need to be resolved under immense time pressures, especially when customers are being directly impacted. Computer systems can perform well-defined tasks much more rapidly than humans, and are necessary to support the scale of a large ISP network. Although human reasoning about extremely complex situations is often difficult, if not impossible to automate, automation can be used to support human analysis and to aid in performing simple, repetitive tasks. This is termed process automation. Process automation is widely used in many aspects of network management, such as in customer service provisioning. Relevant to this chapter, it is also applied to the processes executed in network troubleshooting and even repair. Process automation in combination with the alarm filtering and correlation discussed in Section 2.2.1 are what enable a small Operations team to manage a rapidly increasingly and already extremely large scale network, with increasing complexity and diversity. The automation also speeds trouble resolution, thereby minimizing customer service disruptions, and eliminates human errors, which are a fact of life, no matter how much process is put in place in a bid to minimize them. Over time, Operations personnel have identified large numbers of tasks which are executed repeatedly and are very time consuming when executed by hand. Where possible, these tasks are automated in the process automation capabilities illustrated in the modified event management architecture of Figure 9. 23 Figure 9. Event management architecture incorporating process automation The process automation system lies between the event management system and the ticketing system. One of its major roles is to provide the interface between these two systems – listening to incoming correlated alarms and opening, closing and updating tickets. But rather than simply using the output of the event management system to trigger ticket creation and updates, process automation can take various actions based on network state and the situation at hand. It collects additional diagnostic information, and reasons about relatively complex scenarios before then taking action. Upon opening or updating a ticket, the process automation system automatically populates the ticket with relevant information that it collected in evaluating the issue. This means that there is already a significant amount of diagnostic information available in the ticket as it is opened by a human being to initiate investigation into an incident. This can dramatically speed up event investigation, by eliminating the need for humans to go out and manually execute commands on the network elements to collect diagnostic information. The process automation system interfaces with a wide range of different systems to execute tasks. In addition to the event management and ticketing systems, process automation also interacts closely with network databases and with the network itself (either directly, or indirectly through another system). For example, collecting diagnostic information related to an incoming event will likely involve 24 reaching out to the network elements and executing commands to retrieve the information of interest. This could be done either directly by the process automation system, or via an external network interface, as illustrated in Figure 9. Process automation is often implemented using an expert system – a system which attempts to mimic the decision making process and actions that a human would execute. The actions taken by the system are defined by rules, which are created and management by experts. The number of rules in a complex process automation system is typically of the order of 100s to 1000s given the complexity and extensive range of the tasks at hand. Rules are continually updated and managed by the relevant experts as new opportunities for automation are conceived and implemented, processes are updated and improved, and the technologies used within the network evolve and change. The example in Figure 10 demonstrates the process automation steps executed upon receipt of a basic notification that a slot is down (failed). A slot here refers to a single line card within a router; the slot is assumed to support multiple interfaces. Each interface terminates a connection to a customer or adjacent network router. As can be seen from this example, the system executes a series of different tests and then takes different actions depending on the outcomes of each test. The tests and actions executed are specific to the type of alarm triggering the event, and also to the router model in question. Note that in this particular case, the router in question is a Cisco router, and thus Cisco command line interface (CLI) commands are executed to test the line card. Figure 10. Example process automation rule - router slot down. 25 The process automation example here is focused on support for the network operations team. This team is focused on managing the network elements and is not responsible for troubleshooting individual customer issues – those are handled by the customer care organization. Thus, a primary goal of the process automation is to identify issues with network elements; weeding out individual customer issues (and not creating tickets on them for this particular team). In this example of a customer-facing line card with multiple customers carried on different interfaces on the same card, the automation ensures that there are multiple interfaces that are simultaneously experiencing issues, as opposed to being associated with a single customer. This makes it likely that the issue at hand relates to the line card (part of the network), and is not individual customer issue(s). In addition to ensuring that only relevant tickets are created, process automation also distinguishes between new issues versus new symptoms associated with an existing, known problem. Thus, if a ticket has already been opened for the current event, such as may occur when a new symptom is detected for an ongoing event, then the process automation will associate the new information with the existing ticket, as opposed to opening a new ticket. The results of the tests executed are presented to the Operations team within the ticket that is opened or updated. Thus, as an Operations team member opens a new ticket ready to troubleshoot an issue, he/she is immediately presented with a solid set of diagnostic information; instead of the individual having to start logging into routers and running these same tests. This can significantly reduce the investigation time, the customer impact and the load on the Operations team. For a more complex scenario, now consider a customer initiated ticket, created either directly by the customer or by a customer’s service representative. The process automation system picks up the ticket automatically and, with support from other systems, launches tests in an effort to automatically localize the issue. If the system can identify the issue, it will either automatically dispatch workforce to the field or refer the ticket out to the appropriate organization, which may even be another company (e.g., where another telecommunications provider may be providing access). Upon receiving confirmation that the problem has been resolved, the expert system also executes tests to validate, and then closes the ticket if all tests succeed. If the expert system is unable to resolve the issue, then it would collate information regarding the diagnostics tests and create a ticket to trigger human investigation and troubleshooting. The opportunity space for process automation is almost limitless. As technologies and modeling capabilities improve, automation can and is being extended beyond troubleshooting and event management, and into service recovery and repair and even outage prevention. Although it is clearly hard to imagine automated fiber repairs in the near future, there are scenarios in which actions can be automatically taken to restore service. For example, network elements or network interfaces can be rebooted automatically should such action likely result in restored service. One scenario when this may be an attractive solution is where a known software bug is impacting customers and the operator is waiting for a fix from the vendor. In the 26 meantime, if an issue occurs, then the most rapid way of recovering service would be through the automated response – not waiting for human investigation and intervention. As another example, let’s consider errors in router configurations (misconfigurations) which can be automatically fixed via process automation. Consider a scenario where regular auditing such as that discussed in Chapter ??? identifies what we refer to as “ticking time bombs” – misconfigurations that could cause significant customer impact under specific circumstances. These misconfigurations are automatically detected, and can also be automatically repaired – the “bad” configuration being replaced with “good” configuration, thereby preventing the potentially nasty repercussions. Although automated responses to troubles is an attractive proposition, extreme care must clearly be taken with automation; making sure that the logic is carefully thought out and that safeguards are introduced to prevent potentially larger issues from arising with the wrong automated response. Meticulous tracking of automated actions is also critical, to ensure that automated repair does not hide underlying chronic issues which need to be addressed. However, even with these caveats and warnings, automation offers an opportunity to far more rapidly recover from issues than a human being could achieve through manually initiated actions. The value of automation throughout the event management process (from event detection through to advanced troubleshooting) is unquestionable – reducing millions of alarms to a couple of hundred or fewer tickets that require human investigation and intervention. This automation is what allows a small network operations team to successfully manage a massive network, with rapidly growing numbers of network elements, customers and complexity. 4. Managing Network Performance over Time Event management systems primarily focus on troubleshooting individual large network events that persist over extended periods of time, such as link failures, or recurring intermittent (chronic) flaps on individual links. However, this narrow view of looking at each individual event in isolation risks leaving network issues flying under the event management radar, while potentially impacting customers’ performance. Let’s consider an analogy of financial management within a large corporate or government organization. In keeping a tight reign over a budget, an organization would likely be extremely careful about managing large transactions – potentially going to great lengths to approve and track each of the individual large purchases made across the organization. However, tracking individual transactions without looking at history can result in patterns being missed. A single user’s individual transaction may be fine in isolation, but analysis over a longer time interval may uncover an excessively large number of such transactions – something which may need further investigation. In addition, this approach does not scale to small transactions, and thus smaller transactions would likely receive less attention be- 27 fore they are made. However, if no one is tracking the bottom line – the total expenditure - it may well be that the small expenditures add up considerably, and could lead to financial troubles in the long run. Instead, careful tracking of how money is spent across the board is critical, characterizing at an aggregate level how, where and why this money is spent, whether it is appropriate and (in situation where money is tight), where there may be opportunities for reductions in expenditure. New processes or policies may well be introduced to address issues identified. But this can only be seen with careful analysis of longer term spending patterns – large and small. Returning to the network, carefully managing network performance also requires examining network events holistically – exploring series of events – large and small - instead of purely focusing on each large event in isolation. The end goal is to identify actions which can be taken to improve overall network performance and reliability. Such actions can take many forms – for example, driving a subset of network impairments either entirely out of the network, or minimizing their impact, or by incorporating better methods for detecting issues in the event management system so that Operations can be rapidly informed of a network issue. For example, if a particular failure mode is identified that is causing performance issues over time (e.g., repeated line card failures, excessive BGP flaps, unexpectedly high packet losses over time), then it is desirable to work to permanently eliminate this condition, where possible. The response depends entirely on the issues identified, but could involve software bug fixes, hardware redesigns, process or changes, technology changes. An important step towards driving network performance is to carefully track performance over time, with periods of poor performance being identified so that intervention can be initiated. For example, if it is observed that the network has recently been demonstrating, for example, unacceptable unavailability due to excessive numbers of line card failures, or higher packet loss than expected or than normal then investigation should be initiated in an effort to determine why this is occurring, and to take appropriate actions to rectify the situation. But we don’t need to wait for performance to degrade – regular root cause analysis of network impairments can identify areas for improvements, and potentially even uncover previously unrecognized yet undesirable network behaviors which could be eliminated. For example, consider a scenario where ongoing root cause analysis of network packet loss uncovered chronic slower than expected recovery times in the face of network failures. Once identified, efforts can be initiated to identify why this is occurring (router software bug?, hardware issue?, fundamental technology limitations?) and to then drive either new technologies or enhancements to existing technologies into the network to permanently rectify this situation. Tracking network performance over time and drilling into root causes of network impairments typically involves delving into large amounts of historical network data. Exploratory Data Mining (EDM) is thus used to complement real-time event management through detailed analysis across large numbers of events, identifying patterns in failures and performance impairments and in the root causes of these network events. However, this is clearly a challenging goal; it typically in- 28 volves mining tremendous volumes of network data across very diverse data sources to identify patterns either in individual time series, or across time series. 4.1. Trending Key Performance Indicators (KPI) So let’s start by asking a relatively simple question - how well is my network performing? The first step here is to clearly define what we mean by network performance, so that we can provide metrics which can be evaluated using available network measurements. We refer to these metrics as Key Performance Indicators, or KPIs. KPIs are very general metrics, but in the context here are really designed for two primary purposes: (1) to track the network performance, and (2) to track network health. When tracking how well the network is performing, it is important to ensure that metrics obtain a view which is as close as possible to what customers are experiencing. However, how well the network is performing is in the eye of the beholder – and different beholders have different vantage points, and different criteria. Some applications (e.g., email or file transfers) are extremely resilient to short term impairments whilst others, such as video, are extremely sensitive to even extremely short term outages. Thus, KPIs need to capture and track a range of different performance metrics which reflect the diversity of applications being supported. Ideally, KPIs should track application measures – such as the frequency and severity of video impairments that would be observable to viewers. However, network-based metrics are also critical, particularly in networks where there is a vast array of different applications being supported. Thus, KPIs should include, but not be limited to, metrics tracking network availability (DPM – see Chapter 3 for details), application performance end-to-end network packet loss, delay and jitter and even network operations responses to tickets. KPIs can also capture non-customer impacting measures of network health, such as the utilization of network resources. These provide us with the ability to track network health before we hit customer-impacting issues. There are many limited resources in a router – link capacity is an obvious one, but router processing power and memory are two other key examples. Capacity is traditionally tracked as part of the capacity management process discussed in Chapter 5 and is therefore not discussed further here. But router CPU and memory utilization – both on the central route processor and on individual line cards – are also critical limited resources which are often less well analyzed than link capacity. One of the most critical functions that the central route processor is responsible for is control plane management – ensuring that routing updates are successfully received and sent by the router, and that the internal forwarding within the router is appropriately configured to ensure successful routing of traffic across the network. Thus, the integrity of our control plane is very much dependent on router CPU usage – if CPUs become overloaded with high priority tasks, the integrity of the network control plane could be put at risk. Router memory is similarly critical – if memory becomes fully utilized within a router, then the router is at risk of crashing, caus- 29 ing a nasty outage. Thus, these limited resources must be tracked to ensure that they are not hitting exhaust either in the long term, or over shorter periods of time. KPIs must be measurable– thus, a given KPI must map to a set of measurements that can be made within the network or using data collected from network operations. Availability-based KPIs can be readily calculated based on logs from the event management system and troubleshooting analyses. Chapters 3 and 4 discuss availability modeling and associated metrics, which clearly depend on network measurements. This is not discussed further here. Obtaining application dependent performance metrics requires extensive monitoring at the application level. This can be achieved either by using “test” measurements executed via sample devices strategically located across the network, or by collecting statistics from user devices (in situations where the user devices are accessible by the service provider). Such application measurements are a must for networks which support a limited set of applications – such as a network supporting IPTV distribution. However, it is practically impossible to scale this to every possible application type that may ride over a general purpose IP/MPLS backbone, especially if application performance depends on a plethora of different customer end devices. Instead, end to end measurements of key network performance criteria, namely packet loss, delay and jitter, are the closest general network measure of customerperceived network performance. These measurements provide a generic, application-independent measure of network performance, which can (ideally) be applied to estimate the performance of a given application. End to end packet loss, delay and jitter would likely be captured from an endto-end monitoring system, such as described in Chapter 10 and [6]. In an active monitoring infrastructure, for example, large numbers of end to end test probes are sent across the network; loss and delay measurements are calculated based on whether these probes are successfully received at each remote end, and how long they take to be propagated across the network. From these measurements, network loss, delay and jitter can be estimated over time for each different source / destination pair tested. By aggregating these measures over larger time intervals, we can calculate a set of metrics, such as average or 95th percentile loss / delay for each individual source / destination pair tested. These can be further aggregated across source / destination pairs to obtain network-wide metrics. KPIs can also examine other characteristics of the loss beyond simple loss rate measurements– for example, tracking loss intervals (continuous periods of packet loss). How loss is distributed over time can really matter to some applications – a continuous period of loss may have greater (or lesser) impact on a given application than the same total loss distributed over a longer period of time. Thus, metrics which characterize the number of short versus long duration outages may be used to characterize the performance of various network applications. Ideally, end to end measurements should extend out as far to the customer as possible, preferably into the customer domain. However, this is often not practical – scaling to large numbers of customers can be infeasible, and the customer devic- 30 es are often not accessible to the service provider. Thus, comprehensive end to end measurements are often only available between network routers – leaving the connection between the customer and the network beyond the scope of the end to end measurements. Tracking performance at the edge of the ISP network thus requires different metrics. One popular metric is BGP flap frequency aggregated over different dimensions (e.g., across the network, per customer etc). However, it is important to note that by definition the ISP/customer and ISP/peer interfaces cross trust domains - the ISP often only has visibility and control over its own side of this boundary, and not the customer/peer domain. It is actually often extremely challenging to distinguish between customer/peer-induced issues and ISP-induced issues. Thus, without knowing about or being responsible for customer/peer activities on the other side of the trust boundary, BGP and other similar metrics can be seriously skewed. A customer interface that is flapping incessantly can significantly distort these metrics, making it extremely challenging to distinguish real patterns that may be attributed to the ISP. KPIs can also be calculated on more complex transformations of time series, beyond simple statistical measures such as total counts or averages. Let’s again illustrate this by way of an example. As discussed in Section 2.1, router performance measures, including CPU measurements, are obtained from routers using pollers which collect these measurements at regular time intervals (e.g., every 5 minutes). Thus, we have time series of router CPU measurements averaged over 5 minute intervals. These time series are typically extremely noisy – varying dramatically as a function of time as different processes pound the routers, collecting information and configuring the device, and as the routers respond to topology updates and major routing changes and so on. An example daily CPU plot is illustrated in Figure 11. From this plot, we can see short periods where the CPU “spikes” to high values and longer periods where the CPU measurements clearly become elevated over extended periods. Figure 11. Example CPU time series 31 Simply looking at metrics such as average CPU would provide limited insight into the overall router behavior over time, given the complexity of the process. Instead, we can carefully examine the anomalies as a function time, even trending the anomalies themselves. KPIs can then track different aspects of the CPU, including, for example, the rate of “bad” anomalies, and identify routers with higher than expected rates of “bad” anomalies. Once we have defined our key metrics, we can then track how these change over time. This is known as trending. Trending of KPIs is critical for driving the network to higher levels of reliability and performance, and for identifying areas and opportunities for improvement. The goal is to see these KPIs improve over time, corresponding to network performance improvements. However, if KPIs turn south, indicating worsening network conditions, investigation would likely be required. The KPIs can thus be used to focus Operations attention to areas that need most immediate attention. Let’s consider a simple example of end to end loss. If the loss-related KPIs (e.g., average loss) degrade, then investigation would be required to understand the underlying root cause and to (hopefully) initiate action(s) to reverse the negative trend. Obviously, the actions taken depend on the root cause identified – but could include capacity augments, elimination of failure modes (e.g., if loss may be being introduced by router hardware of software issues), or may even require the introduction of new technology, such as faster failure recovery mechanisms. Examining KPIs over time can also enable detection of anomalous network conditions – thereby detecting issues which may be flying under the radar of the event management systems described in Section 2.2. Let’s consider an example where the rate of protocol flaps has increased within a given region of the network. The individual flaps are too short to alarm on – their associated alarms have cleared even before a human can be informed of the event, let alone investigate. Thus, alarms would only be created if there are multiple events exceeding a threshold during a given time duration, when the flapping is determined to be a chronic. If the flaps on individual interfaces do not cross this threshold, then an aggregate increase in flaps across a region may go undetected by the event management system. However, this aggregate increase could be indicative of an unexpected condition, and be impacting customers. It would thus require investigation. Careful trending and analysis of KPIs can also be used to identify new and improved signatures which can be used to better detect events that should be alarmed within the event management systems discussed in Section 2. The identification of new signatures is a continual process, typically leveraging the vast and evolving experience gained by network Operators as they manage the network on a day-today basis. However, this human experience can be complemented by data mining. It is far from easy to specify what issues should be alarmed on amidst the mound of performance data collected from the routers. For example, under what conditions should we consider CPU load to be excessive and thus alarm on it to trigger intervention? And at what point does a series of “one off” events become a chronic condition that requires immediate attention? 32 Performance events used to trigger alarms are typically detected using simple threshold crossings – an event is identified when the parameter of interest exceeds a pre-defined value (threshold) for a given period of time. However, even selecting this threshold is often extremely challenging. How bad and for how long should an event be before Operations should be informed for immediate attention? Low level events more often than not clear themselves; if we are overly sensitive at picking up events to react to, then we can result in too many false alarms and Operations personnel would spend most of their time chasing false alarms, and will likely miss the real issues amongst the noise. But if we are not sensitive enough, then critical issues may fly under the radar and not be appropriately reacted to in a timely fashion. Selecting good thresholds for alarm generation is typically achieved through a combination of tremendous Operational experience, combined with detailed analysis of KPIs over time. However, note that the thresholds selected may not actually be a constant value – in some cases they could vary over time, or may vary over different parts of the network. Thus, in some cases, it may be sensible to actually have these thresholds be automatically learned and adjusted as the network evolves over time. Simple thresholding techniques can also be complemented by more advanced anomaly detection mechanisms, particularly when detecting degrading conditions which may be indicative of an impending issue. For example, consider a router experiencing a process memory leak. In such a situation, the available router memory will (gradually) decrease – the rate of decrease being indicative of the point at which the router will hit memory exhaust and likely cause a nasty router crash. Under normal conditions, router memory utilization is relatively flat; a router with a memory leak can be detected with a non-zero gradient in the memory utilization curve. Predicting the impending issue well in advance provides Operations with the opportunity to deal with the issue before it becomes customer impacting. These are known as predictive alarms, and can be used to nip a problem in the bud, thereby preventing a potentially nasty issue. Again, detailed analysis over extended periods of data is required to identify appropriate anomaly detection schemes which can accurately detect issues, with minimal false alarms. There is a vast array of publications focused on anomaly detection using network data, although the majority of the work focuses on traffic anomalies [7; 8; 9; 10; 11]. 4.2. Root Cause Analysis KPIs track network performance; they do not typically provide any insight into what is causing a condition or how to remediate it. But if we want to drive network improvements, then we must understand the root cause of performance issues –potentially down to the smallest individual events. Troubleshooting persistent network events was discussed in detail in Section 2.4. However, short term impairments, such as packet losses or CPU anomalies, are often too short to even enable real-time investigation – the event has more often than not cleared before Operations personnel could even be informed about it, 33 let alone investigate. And short term events are expected – no matter how careful network planning and design is, occasional short term packet losses, for example, will occur. There is nothing that can be done about a short event that occurred historically; Operations only address potential future events. Thus, short impairments can be examined from the point of view of understanding how often they occur, whether recent events are more or less frequent than expected and what are the most significant root causes of these impairments. If events are more frequent than normal or expected, or if opportunities can be identified to eliminate some of these short term impairments, then actions can be taken. Such analyses have the potential to uncover the need for hardware or software repairs, configuration changes, or maintenance procedural changes, to name but a few. Characterizing the root causes of network events (e.g., packet loss) – even the small individual events - and then creating aggregate views across many such events can enable insights into the underlying behavior of the network and uncover opportunities for improvements. Obviously, investments made in improving the network to address performance should ideally be focused where they will have the greatest impact. For example, by characterizing the root causes of the different packet loss conditions observed from end to end performance measurements, a network operator can identify the most common root causes, and focus energies on permanently eliminating or at least reducing these. If many of these root causes were, for example, congestion-related, then additional network capacity may be required. If, instead, significant loss correlated with a previously unidentified issue within the network elements, then additional analysis would be required. A Pareto analysis [] is a formal technique used to guide this process – used to formally evaluate the benefits of potential actions which can be taken and then selecting those which together come close to having the maximal possible impact. To understand the root cause of recurring symptoms, such as hardware failures, packet loss or routing flaps, we need to drill down into individual events in a bid to come up with the best explanation of their likely root cause. Root cause analysis here is similar to that described above in Section 2.4 – with a couple of important distinctions. Specifically, we are typically examining large numbers of individual events, as opposed to a single large event, and we are typically examining historical events. As with troubleshooting large persistent individual issues, root cause analysis of recurring symptoms typically commences with detailed analysis of available network data. If the symptoms being investigated were actually already individually analyzed by Operations staff, as would be the case for large, persistent network faults (e.g., line card failures etc), then analysis would rely extensively on observations made by the Operations personnel during this process. More generally, scalable data mining techniques are required to effectively make the most of the available data, given the large number of events presumably involved, and the diversity of possible root causes. But data mining alone cannot always lead to the underlying root cause(s) – especially in scenarios where anomalous network conditions or unexpected network behaviors are being identified. Instead, such analy- 34 sis must be complemented by targeted network measurement, lab reproduction and detailed software and/or hardware analysis. Targeted network measurements can be used to complement the regular network monitoring infrastructure in situations where the general monitoring is insufficiently fine grained or focused to provide the detailed information required to troubleshoot an issue. The data collection infrastructure in place across a large network aims to capture information critical to a wide range of tasks, including network troubleshooting. But such an infrastructure has to scale across the whole network – there is thus a careful tradeoff required between the feasibility of capturing measurements at scale with the need for this information. Thus, the data collected by a general infrastructure may not always suffice when troubleshooting a very specific issue; it may not be granular enough, or certain measurements may simply not be made at all on a regular basis. Targeted network measurements can instead be used to obtain more detailed information as required. For example, when troubleshooting an issue related to router process management, finer grained CPU measurements may be temporarily obtained on a small subset of network routers through targeted measurements (e.g., recording measurements every 5 seconds instead of every 5 minutes). Similarly, detailed inspection of specific traffic carried over a network link may be required if malformed packets are causing erroneous network behavior – identifying what this traffic is and where it is coming from may be fundamental to addressing the issue. However, note that targeted measurements will clearly most likely need to be taken when the event of interest occurs – which could be challenging to capture for an intermittent issue, unless one can be explicitly induced. But once such measurements are available, they can complement the regularly collected data and be fed into the network analyses. Lab testing aimed at reproducing a problem observed in the field and detailed software/hardware analysis can also be used to complement data analysis. Large ISPs typically maintain a lab containing the same hardware and software configurations that are running in the field to provide a controlled environment in which to experiment in which testers can try to reproduce and understand the issue hand – without the risk of impacting customer traffic. Lab testing can be used to understand the conditions under which the issue occurs, and to experiment with potential solutions to address it. Examining software and hardware designs and implementations can also provide tremendous insight into why problems are occurring and how they can be addressed. However, lab testing and detailed software/hardware analysis are more often than not immense and extremely time consuming efforts which should not be entered into lightly. It is thus critical to glean as much information from available network data to effectively guide these other efforts. EDM techniques are at the heart of this. Constructing a good view of what is happening in a network requires looking across a wide range of different data sources, where each data source provides a different perspective and insight. Manually pulling all of this data together and then applying reasoning to hypothesize about an individual event’s root cause is 35 excessively challenging and time consuming. The data itself is typically available in a range of different tools, as depicted in Figure 12. These tools are often created and managed by different teams or organizations, and often present information in different formats and via different interfaces. In the example in Figure 12, performance data collected from SNMP MIBs is viewed via one web site, router syslogs may be collected either directly from a router or accessed from an archive stored on a network server, a different server collates work flow logs and end to end performance data is available in yet another web site. Thus, manually collecting data from all these different locations and then correlating it to build a complete view of what is happening in any given situation can be a painstaking process, to say the least. This situation may be further complicated in scenarios which involve multiple network layers or technologies which are managed by different organizations. It is entirely possible that information across network layers / organizations is only accessible via human communication with an expert in the other organization. Thus, obtaining information about lower layer network performance (e.g., SONET network) involve reaching out via the phone to a layer one specialist. Of course, different data sources also use different conventions for timestamps (e.g., different time zones) and in naming network devices (e.g., router names may be specified as IP addresses, abbreviated names, domain name extensions etc). These further compound the complexity of correlating across different data sources. Thus, analyzing even a single event could potentially take hours by hand - simply in pulling the relevant data together. This is barely practical when troubleshooting a complex individual event, but becomes completely impractical when troubleshooting recurring events with potentially large numbers of root causes. However, it has historically been the state of the art. 36 Figure 12. Troubleshooting of network event is difficult if data is stored in separate “data silos” Data integration and automation is thus absolutely critical to scaling network troubleshooting and data mining in support of root cause analysis. It would be difficult to overstate the importance of data integration: making all of the relevant network and systems data readily available. Ideally the data should be made available in a form that makes it easy to correlate significant numbers of very diverse data sources across extended time intervals. It is also extremely valuable to simply expose the data to Operations though a set of integrated data browsing and visualization tools. AT&T Labs have taken a practical approach to achieving this - collecting data from large numbers of individual “data silos” and integrating them into a common infrastructure. Above this common data infrastructure is a set of scalable tools and applications, typically designed to work over many very diverse data sources. This architecture is illustrated in Figure 13. Scalable and automated feed management and database management tools [12; 13] are used to amass and then load data into a massive data warehouse. The data warehouse is archiving data collected from various configuration, routing, fault, performance and application measurement tools, across multiple networks. Above the data warehouse resides a set of scalable and reusable tool sets, which provide core ca- 37 pabilities in support of various applications. These tool sets include a reporting engine (used for making data and more complex analyses available via web reports to end users), anomaly detection schemes, and different correlation infrastructures (rules-based correlations and techniques for correlation testing and learning [14; 15]). Above the reusable tool sets lies a wide and rapidly expanding range of data mining applications - including troubleshooting tools, network topology and time series visualization, trending reports, and in depth analysis of recurring conditions (e.g., packet loss, BGP flaps, service quality impairments) The key to the infrastructure is scale – both in terms of the amount and the diversity of the network data being collated, and the range of different applications being created. One of the key ingredients to achieving this is simple normalization on the data which is performed as the data is ingested into the database – ensuring for example that a single time zone is used across all data sources, and common naming conventions are used to describe network elements, networks etc. Operations and applications alike thus do not have to continually convert between different timezones, naming formats etc – dramatically simplifying human and automated analyses. Normalization and data pre-processing is done once – and thus not required by each individual application and user. Although an enormous undertaking, such an infrastructure enables scale in terms of the very diverse analyses which can be performed [12; 13]. Figure 13. Scaling network troubleshooting and root cause analysis So let’s now return to the challenge of uncovering root causes for a series of recurring events. We specify these events as a time series, referred to as a “symptom 38 time series”. This time series is characterized by temporal and spatial information describing when and where an event symptom was observed and event specific information. For example, a time series describing uni-cast end-to-end packet loss measurements would have associated timing information (when each event occurred and how long it lasted), spatial information (the two end points between which loss was observed) and event specific information (e.g., the magnitude of each event – in this case, how much loss was observed). Note that given the infrastructure depicted in Figure 13, the time series is typically formed using a simple database query. The most likely root cause of each of our symptom events is then identified by correlating each event with the set of diagnostic information (time series) available in our data infrastructure. Domain knowledge is used to specify which events should be correlated and how. The analysis effectively then becomes an automation of what a person would have executed – taking each symptom event in turn and applying potentially very complex rules to identify the most likely explanation from within the mound of available data. Aggregate statistics can then be calculated across multiple events, to characterize the breakdown of the different root causes. Appropriate remedies and actions can then be identified. Again, the data infrastructure depicted in Figure 13 makes scaling this infrastructure far more practical, as the root causes are typically identified from a wide range of different data sources; a common data warehouse with normalized naming conventions means that the analysis infrastructure does not need to be painfully aware of the data origins and conventions. However, it is often far from clear as to what all the potential causes or impacts of a given set of symptom events are. Take, for example, anomalous router CPU utilization events. Once CPU anomalies are defined (a challenge unto itself), we are then faced with the question of what causes these anomalous events, and what impacts – if any – do they have. Domain knowledge can be heavily drawn upon here – network experts can often deduce from both knowledge of how things should work and from their experience in operating networks to create an initial list. However, network operators will rapidly report that networks do not always operate as expected nor as desired – routers are complicated beasts with bugs which can result in behaviors that violate the fundamentals of “networking 101” principles. Examination of real network data, and lots of it, is necessarily to truly understand what is occurring. Domain knowledge can be successfully augmented through EDM, which can be used to automatically identify these relationships – specifically, to learn about the root causes and impacts associated with a given time series of interest. Such analyses are instrumental in advancing the effectiveness of network analyses, both for troubleshooting recurring conditions and even for detecting failure modes which are flying under the radar. Although there are numerous data mining techniques available, the key is to identify relationships or correlations between different time series [12; 13] and across different spatial domains. We are interested in identifying those time series which are statistically correlated with our symptom 39 time series; these are likely to be root causes or impacts of our time series of interest. However, given the enormous set of time series and potential correlations involved here, domain knowledge is generally necessary to guide the analysis – where to look and under what spatial constraints to correlate events (e.g., testing correlations of events on a common router, common path, common interface). If looking for anomalous correlations, the real challenge is in defining what is normal / desired behavior and what is not (and thus what may be worth investigating). Figure 14. Customer - ISP access Let’s consider another hypothetical example as an application of exploratory data mining. We focus here on the connectivity between customers and an ISP – specifically, between a customer router (CR) and a provider edge router (PER). The connectivity between these two routers is provided by metropolitan and access networks as illustrated in Figure 14 – which in turn may be made up of a variety of different layer one and layer two technologies (see Chapter 2). These metro/access networks will often provide protection against faults within the lower layers – the goal being that issues at the lower layer will be recovered without the higher layer (layer three) being impacted. Thus, a protection event at the lower layer should not result in any impact (barring a few lost packets) at layers above. So now let us consider a situation where there is an underlying issue such that protection events at the lower layer are in fact causing IP links to flap (go down and then come back up very shortly later) at the layer above. This breaks the principle of layered network restoration – protection events at the lower layer should not cause links between routers to fail – even for a short time. The result is unnecessary customer impact – instead of seeing a few tens of milliseconds break in connectivity as the protection switching is executed, the customer may now be impacted for up to a couple of minutes, depending on the routing protocols used over the interface. This example is further complicated by the fact that that it is over a trust boundary - specifically between a customer and the ISP. The ISP has 40 limited visibility - it is typically not informed of actions taken on the customer end. A customer may, for example, reboot a router and would not inform the ISP to which it is connected. Thus, it makes it very challenging to create reliable alarms on customer-facing interfaces on ISP routers – the majority of events observed may not be service provider issues at all, and simply a result of activities within the customer’s domain. Extensive investigation of such events is simply impractical, given the tremendous number of such events. Thus, the fact that there is an IP link flap on this interface is not sufficient to initiate detailed investigation. So there is a very real risk that, without a great deal of care, this particular failure mode may fly under the radar. This condition may be being experienced across large numbers of interfaces – but may not be occurring on any one interface often enough to cause detection as a chronic condition and thus cause an alert to be triggered and manual investigation. In fact, on customer interfaces, it is entirely possible that even chronic conditions are not investigated without an explicit customer complaint, as more often than not these are caused by customer activities. Let’s not consider how EDM might be used to detect this condition. Data mining can expose that the IP link flaps events are often occurring at the same time as protection events associated with the same link between the PER and the CE – more often than could be explained as purely coincidences. This would be revealed strong statistical correlation between IP layer flaps and protection events on the corresponding lower layer -- a correlation that violates normal operating behavior, and is indicative of erroneous network behavior. Note the need here to bring in spatial constraints – we are explicitly interested in behavior happening across layers for each individual connection between a PER and a given CE – it is this correlation that is not expected and indicates undesirable network behavior. It is actually entirely expected that there be correlations between protection switching events on some PER – CE links and flaps on other, different interfaces on the router – this would be a result of a common lower layer event impacting both protected and unprotected lower layer circuits. This highlights the complexity here and the need for detailed domain knowledge in analyzing correlation results – the spatial models used for correlation analyses are critical. Statistical correlation testing can also be used in troubleshooting this behavior, once detected. For example, correlation testing can be used to identify how the strong correlation between IP flaps and protection events varies across technologies – does it only exist for certain types of lower layer technologies or certain types of router technologies (routers, line cards)? If the same behavior is observed across lower layer technologies from multiple vendors, then it is unlikely to be a result of erroneous behavior in the lower layer equipment (for example, longer than expected protection outages). If, however, the issue correlates to only a single type of router, then it is highly likely that the issue is a router bug and the interface is not reacting as designed. Thus, analysis of what is common and not common across the symptom observations can help in guiding troubleshooting. Targeted lab testing and detailed software analysis can then proceed so that the underlying 41 cause of the issue can be identified and rectified, with the intent that the failure mode be permanently driven out of the network. In general, the opportunity for EDM is tremendous in large-scale IP/MPLS networks. The immense scale, complexity of network technologies, tight interaction across networking layers, and the rapid evolution of network software mean that we risk having critical issues flying under the radar, and that network issues are extremely complex to troubleshoot, particularly at scale. Driving network performance to higher and higher levels will necessitate significant advances in applying data mining to the extremely diverse set of network. This area is ripe for further innovation. 5. Planned Maintenance The previous sections focused on reacting to events which occur in the network. However, a large proportion of the events which occur in the network are actually a result of planned activities. Managing a large scale IP/MPLS network indeed requires regular planned maintenance – the network is continuously evolving as network elements are added, new functionality is introduced, and hardware and software are upgraded. External events, such as major road works can also impact areas where fibers are laid, and thus necessitate network maintenance. There are two primary requirements of planned maintenance: (1) successfully complete the work required and (2) minimize customer impact. As such, planned network maintenance is typically executed during hours when expected customer impact and network load is lowest, which typically equates to overnight and early hours in the morning (e.g., midnight to 6am). However, the extent that customers would be impacted by any given planned maintenance activity also depends on the location of the resources being maintained, and what maintenance is being performed. Redundant capacity would be used to service traffic during planned maintenance activities – where redundancy exists, such as in the core of an IP network. However, as discussed in Section 2.5, such redundancy does not always exist and thus some planned maintenance activities can result in a service interruption. This is most likely to occur at the edge of an IP/MPLS network, where cost effective redundancy is simply not available within router technologies today. 5.1. Preparing for planned maintenance activities The scale of a large ISP network and complexity of competing planned maintenance activities occurring across multiple network layers could be a recipe for disaster. Meticulous preparation is thus completed in advance of each and every planned maintenance event, ensuring that activities do not clash, and that there are sufficient network and human resources available to handle the scheduled work. Planning for network activities involves careful scheduling, impact assessment, coordination with customers and other network organizations, and identifying ap- 42 propriate mechanisms for minimizing customer impact. We consider these in more detail here. Scheduling planned maintenance activities often requires juggling a range of different resources, including Operations personnel and network capacity – across different organizations and layers of the network. For example, in an IP/MPLS network segment where protection is not available in the lower layer network (e.g., transport network interconnecting routers), planned maintenance in the lower layer will impact the IP/MPLS network. IP/MPLS traffic will be re-routed, requiring spare IP network capacity. This same network capacity could also be required if IP router maintenance should occur simultaneously. This is illustrated in Figure 15, where maintenance is required both on the link between routers C and D (executed by layer one technicians) and on router H (executed by layer 3 technicians). Assuming simple IGP routing, much of the IP/MPLS traffic normally carried on the link between routers C and D would normally re-route over to the path E – F – H in the event of the link between C and D failing as a result of the planned maintenance. But if router H is unavailable due to simultaneous planned maintenance, this path would not be available. Thus, both the traffic normally carried on the link between routers C and D, and that carried via router H would all be competing for different resources within the network – potentially causing significant congestion and corresponding customer impact. Clearly, this is not an acceptable situation –the two maintenance activities need to be coordinated so that they do not occur at the same time. Thus, careful scheduling of planned maintenance across network layers is essential – also necessitating careful processes to communicate and coordinate maintenance activities across organizations managing the different network layers. Detailed ”what if” simulation tools that can emulate the planned maintenance activities are key to ensuring that customer traffic will not be impacted. Such tools are executed in advance of the planned activities, taking into account current network conditions, traffic matrices and planned activities. In situations where the planned maintenance activities would cause impact, the simulation tools are also used to evaluate potential actions which can be taken to ensure survivability (e.g., tweaking network routing). 43 Figure 15. Competing planned maintenance activities on a network link (between routers C and D) and a router (router H) If the planned maintenance is instead occurring at the provider edge router (the router to which a customer connects), then it may be necessary to coordinate with or at least communicate the planned maintenance to the customer. This isn’t typically done in consumer markets, but is a necessary step when serving enterprise customers. This communication is typically done well in advance of the planned activities – often many weeks. If the work needs to be repeated across many edge routers, as may occur when upgrading a router software network wide, then human resources must also be scheduled to manage the work across the different routers. This can become a relatively complex planning process, with numerous other constraints. Once tentative schedules are in place, extensive network analysis and simulation studies are executed in anticipation of the planned maintenance event, to ensure that the network has sufficient available resources to successfully absorb the traffic impacted by the planned maintenance activities. Notifications are also sent to impacted enterprise customers well in advance of planned maintenance events that would directly impact the customer (e.g., those at the edge of the network where redundancy may be lacking) so that customers can make appropriate arrangements. 5.2. Executing planned maintenance Planning for a scheduled maintenance event may occur well in advance of the scheduled event, ensuring adequate time for customers to react and capacity evaluations to be completed. Maintenance can proceed after a successful last minute check of current network conditions. However, service providers go to great lengths to carefully manage traffic in real time so as to further minimize customer impact. In locations where redundancy exists, such as in the core of the IP/MPLS network, gracefully removing traffic away from impacted network links in advance of the maintenance can result in significantly smaller (and potentially negligible) customer impact compared with having links simply fail whilst still carrying traffic. Forcing traffic off the links which are due to be impacted by planned maintenance also eliminates potential traffic re-routes that would result should the link flap excessively during the maintenance activities. Thus, ensuring that all traffic is removed before the maintenance commences ensures that customer impact is minimized. How this is achieved depends on the protocols used for routing traffic. For example, if simple IGP protocols are alone used, then traffic can be re-routed away from the links by simply increasing the weight of the links to a very high value . This act is known as costing out the link. Once the maintenance is completed, the link can be costed in (reducing the IGP weight back down to the normal value, thereby re-attracting traffic). 44 Continual monitoring of network performance and network resources is also critical during and after the planned maintenance procedure, to ensure that any unexpected conditions that arise are rapidly detected, isolated and repaired. Think about taking your car to a mechanic – how often has a mechanic fixed one problem only to introduce another as part of their maintenance activities? Networks are the same – human beings are prone to make mistakes, even when taking the upmost care. Thus, network operators are particularly vigilant after maintenance activities, and put significant process and automated auditing in place to ensure that any issues are rapidly captured and addressed. For example, in the previously discussed scenario where links are costed out before commencing maintenance activities – it is critical that monitoring and maintenance procedures ensure that these network resources are successfully returned to normal network operation after completion of the maintenance activities. Accidentally leaving resources out of service can result in significant network vulnerabilities, such as having insufficient network capacity to handle future network faults. 6. The Importance of Continued Innovation Over the past ten years or so, we have witnessed tremendous improvements in network reliability and performance. This has been achieved through massive investments across the board, including in router technologies, support systems for detecting and alarming on events, process automation and procedures for planned and unplanned maintenance. However, there still remains room for improvement and considerable opportunities for innovation. The already vast and yet rapidly expanding scale of the Internet will continue to pose increasing challenges, as will the increasingly diverse application set and often corresponding stringent application requirements. The network reliability and performance requirements imposed are likely to become increasingly stringent as IP continues to become the single technology of choice for such a plethora of applications and services. We herein outline a few directions in which advances in the state of the art promise to produce great operational benefits. We hope to use this as an opportunity to stimulate greater attention and innovation in these areas. The first area that we focus on is continued improvements in router reliability. This was also discussed in Chapter [Yakov], but we include it here as well in view of the tremendous impact that enhancements would have on both the performance and the ease of operating a large scale network. The edge (PE) routers in an ISP network are today a single point of failure (see Figure 8) – if either the router or the line cards facing the customers are out of service, then customer service is impacted. And routers have not as yet attained the same level of reliability as other more mature technologies (e.g., voice switches, transport equipment). Although service providers have invested heavily in ensuring that the lacking of network element reliability has minimal impact the core – through the introduction of fully redundant hardware - such a solution does not scale at the edge of the network. One case in point is router software upgrades, which today typically introduce 45 router downtime as the routers are rebooted. In the case of edge routers and singly connected customers, this becomes a lengthy yet (currently) unavoidable service outage. As discussed in Section 5, not only is this causing interruptions in customer service, but it also adds considerable operational overhead as maintenance activities need to be coordinated and communicated with customers well in advance of the planned activities. Truly hitless upgrades – being able to update the software on routers without impacting router operation - are thus critical to dramatically improving availability and simplifying network operations.. Similarly, ISPs are experiencing a dire need for cost-effective line card protection. Consider the scenario in which a customer connects to a provider edge router using only a single circuit (see Figure 8). Without any line card redundancy, a failure of the PE line card facing the customer would result in the customer being out of service until the issue can be repaired. The most significant consequence of this is obviously that this introduces unacceptable customer outage. But in addition, if this is a hardware issue requiring intervention at the router in question, a service technician must be either available on site or must be rapidly deployed to the fault location – very costly alternatives. However, if the customer using the failed resource can be recovered using redundant hardware without physical human intervention at the fault location, then the replacement of the failed hardware can be planned and executed in a timely but somewhat convenient fashion. The work could even be scheduled along with other service requests at the same location. Thus, automated (or remotely initiated) failure recovery can both significantly improve customer performance (through fast failure recovery) and reduce the network operational overheads. Router technology today supports dedicated protection (1:1) – where for each working interface, there is a dedicated backup interface. However, this approach is very expensive, essentially doubling the line card costs (which are typically one of the most expensive components). There has been Research demonstrating the feasibility of more cost effective, shared protection [16; 17] – implementing such capabilities in routers would result in significant reliability and operational enhancements. We look forward to advances in this area leading to fully integrated cost-effective fast-failure-protection capability on next generation routers. As demonstrated in earlier sections within this chapter, service providers have typically mastered the detection, troubleshooting and repair for commonly occurring faults. For events which are regularly observed – say, fiber cuts or hardware failures – troubleshooting these issues is typically a rapid process, aided by process automation capabilities such as discussed in Section 3. However, the same cannot be said for dealing with the more esoteric faults and performance issues – those for which the root cause is far from apparent. Tools for effectively and rapidly aiding in troubleshooting such issues are sorely lacking; significant innovation is well overdue here. This is primarily because it is such a fundamentally challenging problem and one most understood by the small teams of highly skilled engineers to which such issues are escalated. The environment that these teams work in is extremely demanding – many of the more esoteric scenarios investigat- 46 ed are vastly different from previous observations, and may (at least initially) be defying all logic and understanding. The network elements involve massive amounts of complex software and hardware, complicated interactions exist across network technologies and layers and across network boundaries (customers, peers), and tremendous quantities of network data are spawned by the network elements and monitoring systems – and yet the key information needed to understand a very specific issue may not be something readily observed in the information regularly collected. Operations personnel, while under tremendous pressure, have to sift through immense quantities of data from diverse tools, collect additional information and reason / theorize on potential root causes. Tools which effectively aid in this process would have great operational value. We also highlight exciting opportunities for exploratory data mining in network performance management. Section 4 discussed some of the activities which are carried out today, including some more recent Research pursuits. There is a vast amount of interesting data collected on a regular basis from large operational networks, and correspondingly interesting opportunities for detecting and analyzing issues which risk “flying under the radar”. The biggest challenges are probably the sheer volume of data, and the complexity of the systems being analyzed. How can we most effectively identify relevant and critical issues amongst the millions (or more) of potential time series we could create from the network data? Our experience [14; 15; 11] indicates that combining advanced data analytic technologies and strong networking domain knowledge and experience yields a powerful combination for effectively driving continued network performance improvement. As a final topic, we consider process automation – specifically, how far can we and should we proceed with automating actions taken by Operations teams? As highlighted in Section 3, process automation is already an integral part of at least some large ISP network operations. But how far can and should we proceed with such technologies? The ultimate goal would be to fully automate common fault and performance management actions, closing the control loop of issue detection, isolation, mitigation strategy evaluation, and actuation of the devised responses in network (e.g., re-routing traffic, repairing databases, fixing router configuration files). We refer to this as adaptive maintenance. Consider by way of an example a situation in which Operations have repeatedly observed that a software reload of a line card can restore service under certain failure conditions – Operations have adopted this strategy to restore service while investigating the root cause in the meantime. Following the adaptive maintenance concept, a network management system could automatically detect the signature for this failure mode, emulate the validation process that a human would follow before taking action (e.g., determine whether the problem is being worked on an existing active ticket and whether the symptoms are consistent with the proposed response), launch the reload command, verify whether it achieved the expected result and finally report the incident for further investigation. While adaptive maintenance promises to greatly reduce recovery times and can eliminate human errors that are inevitable in manual operations, a flawed adaptive 47 maintenance capability can create damage at a scale and speed which is unlikely to be matched by humans. It is thus crucial to carefully design and implement such an adaptive maintenance system, and thoroughly and systematically verify it against all situations that could arise within the network, a non-trivial task to say the least. That being said, the overall result has the potential to dramatically improve network reliability and performance. Identifying promising new opportunities for automated recovery and repair of both networks and supporting systems (e.g., databases, configuration) is an area wide open for innovation. 7. Conclusions We conclude with some final thoughts on “best practice” principles for both fault and performance management, and planned maintenance. Text Box: Fault and Performance Management “Best Practice” Principles:          Plan carefully as new technologies are to be deployed – don’t make network measurement, fault and performance management an afterthought! Detection: deploy a extensive and diverse monitoring infrastructure that will enable rapid detection and troubleshooting of network events Scalable correlation to rapidly drill down to root cause. Automation required to rapidly get to where we need to be so that faults can be repaired rapidly. Fault management systems: ensure that they are reliable and easily extensible Process automation: automate commonly executed tasks – but be careful Ensure cost effective (shared) redundancy is available where possible – it can (1) improve customer performance, through rapid failure recovery (2) significantly reduce Operations overheads by eliminating the need for express technical callouts, replacing individual callouts with planned maintenance events that can deal with multiple tasks in the one callout Execute continual ongoing analysis looking for new signatures and improving existing signatures for detecting network issues. Trending: be on top of key performance indictors – both those indicative of customer impact (loss, delay) and those associated with network health (CPU, memory) Create a scalable data infrastructure for data mining – avoid silos and create infrastructure for rapidly obtaining data for both rapid troubleshooting and scaling troubleshooting efforts 48  Ensure that adequate tools are available for troubleshooting the wide range of different network events which occur Text Box: Planned maintenance “Best Practice” Principles:           Ensure that human beings “think twice” as they are touching the network (bid to minimize mistakes) Plan and schedule maintenance activities carefully to minimize customer impact Ensure careful scheduling and coordination of maintenance events, including across network layers where necessary Ensure careful coordination and validation of planned activities to ensure that there is sufficient spare network capacity to absorb load during backbone network maintenance Validate network state before executing planned maintenance, to ensure that maintenance is not executed when the network is already impaired. Minimize customer impact by taking routers and network resources “gracefully” out of service where possible (e.g., network backbone) Suppress alarms during planned maintenance to avoid Operations chasing events which they are inducing! Carefully monitor performance during and after maintenance, ensuring that all resources are successfully returned to operation Validate planned maintenance actions after completion (e.g., verify router configuration files) to ensure that the correct actions were taken and that configuration errors / bad conditions were not introduced during the activities Ensure mechanisms are available for rapid back out of maintenance activities, in case issues are encountered 8. Acknowledgments The authors would like to thank the AT&T Operations teams for invaluable collaborations with us, their Research partners, over the years. In particular, we would like to thank Bobbi Bailey, Heather Robinett and Joanne Emmons (AT&T) for detailed discussions related to this chapter and beyond. And finally, we would like to acknowledge Stuart Mackie from EMC, for discussions regarding alarm correlation. 9. References 1. Lonvick, C. The BSD Syslog Protocol. s.l. : IETF, 2009. RFC 5424. 49 2. P. Della, C. Elliott, R. Pavone, K. Phelps and J. Thompson. Performance and Fault Management. s.l. : Cisco Press, 2000. 3. OSPF Monitoring: Architecture, Design and Deployment Experience, Proc. USENIX Symposium on Networked Systems Design and Implementation (NSDI). A. Shaikh, A. Greenberg. March 2004. 4. D. Mauro, K. Schmidt. Essential SNMP. s.l. : O'Reilly, 2005. 5. A Coding Approach to Event Correlation. S. Kliger, S. Yemini, Y. Yemini, D. Ohsie and S. Stolfo. 1995. Proceedings of the Fourth international Symposium on Integrated Network Management IV. pp. 266-277. 6. High Speed and Robust Event Correlation. S. Yemini, S. Kliger, E. Mozes, Y. Yemini, D. Ohsie. s.l. : IEEE, May 1996, IEEE Communications Magazine, pp. 82-90. 7. Standarized Active Measurements on a Tier 1 IP Backbone. G. Ramachandran, L. Ciavattone, A. Morton. 6, June 2003, IEEE Communications Magazine, Vol. 41. 8. A signal analysis of network traffic. P. Barford, J. Kline, D. Plonka, and A. Ro. 2002. Internet Measurement Workshop. 9. Diagnosing network disruptions with network-wide analysis. Y. Huang, N. Feamster, A. Lakhina, and J. J. Xu. 2007. Sigmetrics. 10. Mining anomalies using traffic feature distributions. A. Lakhina, M. Crovella, and C. Diot. 2005. Sigcomm. 11. Network anomography. Y. Zhang, Z. Ge, A. Greenberg, and M. Roughan. 2005. Internet Measurement Workshop. 12. Black box Anomaly Detection: Is it Utopian? S. Venkataraman, Caballero, D. Song, A. Blum and J. Yates. 2006. 5th Workshop on Hot Topics in Networking (HotNets). 13. Scheduling Updates in a Real-time Stream Warehouse . L. Golab, T. Johnson, V. Shkapenyuk. 2009. International Conference on Data Engineering (ICDE). 14. Stream Warehousing with DataDepot. L. Golab, T. Johnson, J. Seidel, and V. Shkapenyuk. 2009. SIGMOD. 15. Troubleshooting Chronic Conditions in Large IP Networks. A. Mahimkar, J. Yates, Y. Zhang, A. Shaikh, J. Wang, Z. Ge and C. Ee. 2008. 4th ACM International Conference on Emerging Network Experiments and Technologies (CoNEXT). 16. Towards Automated Performance Diagnosis in a Large IPTV Network. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, Q. Zhao. 2009. Sigcomm. 17. An Integrated IP/Optical Approach for Efficient Access Router Failure Recovery. P. Sebos, J. Yates, G. Li, D. Rubenstein and M. Lazer. 2004. IEEE / LEOS Optical Fiber Communications Conference. 18. Ultra-fast IP link and interface provisioning with applications to IP restoration. P. Sebos, J. Yates, G. Li, D. Wang, A. Greenberg, M. Lazer, C. 50 Kalmanek and D. Rubenstein. 2003. IEEE / LEOS Optical Fiber Communications Conference. 19. O. Maggiora, C. Elliott, R. Pavone, K. Phelps, J. Thompson. Performance and Fault Management. s.l. : Cisco Press, 2000.

Disaster Preparedness and Resiliency

Related documents

Products

Support

Disaster Preparedness and Resiliency

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib