Disaster Preparedness and Resiliency

advertisement
Network Management: Fault Management,
Performance Management and Planned
Maintenance
Jennifer M. Yates, Zihui Ge
AT&T Labs - Research, Florham Park, NJ
Email: jyates@research.att.com; gezihui@research.att.com
1.
Introduction
As the Internet grew from a fledgling network interconnecting a few University
computers to a massive infrastructure that interconnects users across the globe, the
focus was primarily on providing connectivity to the masses. And IP has certainly
achieved this. As of June 2008, it was estimated that approximately 1.5 billion
people
were
connected
to
the
Internet
[ref:
http://www.internetworldstats.com/stats.htm] – approximately 1 in 5 people across
the globe.
But as networks using the Internet Protocol (IP) became increasing widespread,
an increasingly diverse set of services also came to use IP as the underlying communications protocol. IP now supports a diverse set of services, including eCommerce, voice, mission critical business applications, TV (IPTV), as well as email
and Internet web browsing. Nowadays, even people’s lives depend on IP networks
– as IP increasingly supports emergency 9-1-1 services and other medical applications.
This tremendous diversity in applications places an equally diverse set of requirements on the underlying network infrastructure, particularly with respect to
bandwidth consumption, reliability and performance. At one extreme, email applications are resilient to network impairments – the key requirement being basic
connectivity. Email servers will continually attempt to retransmit emails, even in
the face of potentially lengthy outages. In contrast, real-time video services typically require high bandwidth and are sensitive to even very short term “glitches”
and performance degradations.
IP is also a relatively new technology – especially if we compare it with technology such as the telephone network, which has now been around for over 130
years. As IP technology has matured over recent years, network availability has
been driven up – a result of hardware and software maturing, as well as continued
improvements in network management practices, tools and network design. This
improvement has enabled a shift in emphasis to focus beyond availability and
faults to managing performance metrics – for example, eliminating short term
2
“glitches” which may not be at all relevant to many applications (e.g., email), but
can cause degradation in service quality to applications such as video and gaming.
The transformation of IP from a technology designed to support best effort data
delivery to one that supports a diverse range of mission-critical applications is a
testament to the industry and to the network operators who have created technologies, automation and procedures to ensure high reliability and performance.
In this chapter, we focus on the network management systems and the tasks involved in supporting the day-to-day operation of an ISP network. The goal of
network operations is to keep the network up and running, and performing at or
above designed levels of service performance. Achieving this goal involves responding to issues as they arise, proactively making network changes to prevent
issues from occurring and evolving the network over time to introduce new technologies, services and capacity. We can loosely classify these tasks into categories using traditional definitions discussed within the literature – namely fault
management, performance management, and planned maintenance.
At a high level, fault management is easy to understand: it includes the
“break/fix” functions – if something breaks, fix it. More precisely, fault management encompasses the systems and work flows necessary to ensure that failures (or faults) are rapidly detected, the root causes are localized, the impacts of
the faults on customers are minimized and the faults are rapidly resolved so that
the network resources and customer service are returned to normal. The goal of
restoring customer service quickly may involve a temporary fix or “work around”
until network maintenance can be scheduled to repair the underlying problem.
Performance management can be defined in several different ways: for example, (1) designing, planning, monitoring and taking action to prevent / repair overload conditions and (2) monitoring both end network performance and per element
performance and take action to address performance degradation. The first definition focuses on traffic management, and encompasses roles executed by network
engineering organizations (capacity planning) and Operations (real-time responses
to network conditions). In this chapter, we will focus on the second definition of
performance management: monitoring network performance and taking appropriate actions when performance is degraded. This performance degradation may be
a result of traffic-related conditions, or may alternatively result from degraded
network elements (e.g., high bit error rates).
Both network faults and degraded performance may require intervention by
network operations, and the line between fault and performance management is
blurred in practice. In this chapter, we refer to a network condition that may require the intervention of network operations as a “network event.” The events in
question may have a diverse set of causes – including hardware failures, software
bugs, external factors (e.g., flash crowds, outages in peer networks) or combinations of these. The resulting impact also varies drastically, ranging from outages
during which impacted customers have no connectivity for potentially lengthy periods of time (known as “hard outages”), loss of network capacity (where network
elements or capacity is out of service, but traffic can be restored around it and thus
3
customers experience little to no issues), or performance degradation (packet loss,
or increased delay and/or jitter).
In addition to rapidly reacting to issues that arise, daily network operation also
incorporates taking planned actions to proactively ensure that the network continues to deliver high service levels, and to evolve network services and technologies. We refer to such scheduled activities as planned maintenance. Planned
maintenance includes a wide variety of activities, ranging from changing a fan filter in a router to hardware or software upgrades, preventative maintenance, or rearranging connections in response to external factors, including even highway
maintenance being carried out where cables are laid. Such activities are usually
performed at a specific, scheduled time, typically during non-business hours when
network utilization is low and the impact of the maintenance activity on customers
can be minimized.
This chapter covers both fault and performance management, and planned
maintenance. Section 2 focuses on fault and performance management – detecting,
troubleshooting and repairing network faults and performance impairments. Section 3 examines how process automation is incorporated in fault and performance
management to automate many of the tasks that were originally executed by humans. Process automation is the key ingredient that enables a relatively small Operations group to manage a rapidly expanding number of network elements, customer ports and complexity. Section 4 discusses tracking and managing network
availability and performance over time, looking across larger numbers of network
events to identify opportunities for performance improvements. Section 5 then focuses on planned maintenance. Finally, in Section 6 we discuss opportunities for
new innovations and technology enhancements, including new areas for Research.
We conclude in Section 7 with a set of “best practices”.
2. Real-time Fault and Performance Management
Fault and performance management comprise a large and complex set of functions
that are required to support the day-to-day operation of a network. As mentioned
in the previous section, network events can be divided into two broad categories:
faults and performance impairments. The term fault is used to refer to a “hard”
failure – for example, when an interface goes from being operational (up) to failed
(down). Performance events denote situations where a network element is operating with degraded performance – for example, a router interface is dropping packets, or when there is an undesirably high loss and/or delay being experienced
along a given route. We use the term “event management” to generically define
the set of functions that detect, isolate and repair network malfunctions – covering
both faults and performance events.
The primary goal of event management is to rapidly mitigate issues as they
arise so that any resulting customer impact can be minimized. However, achieving
this within the complex environment of ISP networks today requires a carefully
architected and scalable event management architecture. Figure 1 presents a sim-
4
plified view of such an architecture. The goal of the architecture is to reliably detect events and alarm on them, so that action can be taken to mitigate the issue
within the network. However, as anyone who has experience with large networks
can attest, this is in itself a challenging problem.
Figure 1. Simplified event management architecture
An extensive and diverse network instrumentation layer provides the foundation of the event management architecture. The instrumentation layer, illustrated
in Figure 1 collects network data from both network elements (routers, switches,
lower layer equipment) and network monitoring systems designed to monitor network performance from vantage points across the network. The measurements
collected are critical for a wide range of different purposes, including customer
billing, capacity planning, event detection and alarming, and troubleshooting.
These latter two functions are critical parts of daily network operations. With regards to this, the goal is to ensure that the wide range of potential fault and performance “events” are reliably detected and that there are sufficient measurements
available to troubleshoot network events.
Both the routers and the collectors contained in the instrumentation layer generate alarms to notify the central event management system when an event of interest is detected. Events detected include faults (e.g., link or router down) and
5
performance impairments (such as excessive network loss or delay). These performance alarms are created by looking for anomalous, “bad” conditions amongst
the mounds of performance data and logs collected within the instrumentation layer.
Given the diversity of the instrumentation layer, any given network event will
likely be detected by multiple monitoring points, and even as multiple symptoms
within an individual network element. Thus, it is likely that a single event will result in a deluge of alarms, which would swamp Operations personnel trying to investigate. Significant automation is thus introduced in the event management system to suppress and correlate multiple alarms associated with the same network
event to create a single alarm indicative of what is often referred to within the
fault management literature as the “root cause” of the event. This correlated alarm
is used to trigger the creation of a ticket in the network ticketing system – these
tickets are used to notify Operations personnel of the event.
Once notified of an issue, Operations personnel are responsible for troubleshooting and repair. Troubleshooting and repair have two primary goals: (1) restoring customer service and performance and (2) fully resolving the underlying
issue and returning network resources into service. These two goals may or may
not be distinct – where redundancy exists, the network is typically designed to automatically re-route around failed elements, thus negating the need for manual intervention in the first step. Troubleshooting and repair can then instead focus on
returning the failed network resources into service so that there is sufficient capacity available to absorb future outages. In other situations, Operations needs to resolve the outage to restore the customer’s service. Rapid action from Operations
personnel is then imperative – until Operations can identify what is causing the
problem and repair it, the customer may be out of service or experiencing degraded performance. This typically occurs at the edge of a network, where the customer connects to the ISP and redundancy may be limited. For example, if the nonredundant line card to which a customer connects fails, the customer may be out
of service until a replacement line card can be installed.
We now discuss the above steps associated with event management in more detail.
2.1.
Instrumentation Layer, Event Detection and Alarms
The foundation of an event management system is the reliable detection and notification of network faults and performance impairments. We thus start by considering how these network events are detected, and the resulting alarm notifications
that are generated.
The primary goal is to ensure that the event management system is reliably informed of network impairments. But the issues of interest may display a wide
range of different symptoms, and may occur anywhere across the vast network(s)
being monitored. Depending on the impairments at hand, it may be that the routers
themselves observe the symptoms and can report them accordingly. However, lim-
6
itations in router monitoring capabilities, challenges in adding new monitoring capabilities without major router upgrades and the need for independent network
monitoring have driven a wide range of external monitoring systems into the network. These systems have the added benefit of being able to obtain different perspectives on the network compared with the individual routers. For example, end
to end performance monitoring systems obtain a network-wide view of performance; which is not readily captured from within a single router.
Thus, the reliable detection of network events requires a comprehensive instrumentation layer to collect network measurements. Chapters 10 and 11 discussed the wide range of network monitoring capabilities available in large-scale
operational IP/MPLS networks. These are incorporated within the instrumentation
layer to support a diverse range of applications, ranging from network engineering
(e.g., capacity planning), customer billing, and fault and performance management. Within fault and performance management, these measurements support real-time alarm generation, as well as other tasks such as troubleshooting and postmortem analysis, which are discussed later in this chapter.
As illustrated in Figure 1, the instrumentation layer collects measurements from
both network elements (e,g., routers) and from external monitoring devices, such
as route monitors and application / end to end performance measurement tools.
Although logically depicted as a single layer, the instrumentation layer actually
consists of multiple different collectors, each focused on a single monitoring capability. These collectors are discussed in more detail in the remainder of this section.
2.1.1.
Router Fault and Event Notifications
Network routers themselves have an ideal vantage point for detecting faults local to them. They are privy to events which occur on inter-router links which terminate on to them, and also to events which occur inside the routers themselves.
They can thus identify all sorts of hardware (link, line card and router chassis failures) and software issues (including software protocol crashes and routing protocol failures). Routers themselves detect events and notify the event management
systems – one can view the instrumentation layer for these events residing directly
within the routers themselves as opposed to in an external instrumentation layer.
In addition to creating alarms about events detected in the router, routers also
write log messages which describe a wide range of events observed on the router.
These are known as syslogs; they are akin to the syslogs created on servers. These
syslogs can be collected at an external syslog collector as part of the instrumentation layer. They serve multiple purposes, including the generation of events which
feed the event management system, and supporting the troubleshooting of network
events.
The syslog protocol [1] is used to deliver log messages from network elements
to a central repository, namely the syslog collector depicted in Figure 1. Syslog
messages collected from elements cover a diverse range of conditions observed
within the element, from link and protocol-related state changes (down, up), envi-
7
ronment measurements (voltage, temperature), and warning messages (e.g., denoting when customers send more routes than the router is configured to allow).
Some of these events relate to conditions which need to be alarmed on, whilst others provide useful information when it comes to troubleshooting an event. Syslog
messages are basically free-form text, with some structure (e.g., indicating
date/time of event, location and event priority). The form and format of the syslog
messages vary between router vendors. Figure 2 illustrates some example syslog
messages, taken from Cisco routers – details regarding Cisco message formats can
be found in [2].
Mar 15 00:00:06 ROUTER_A 627423: Mar 15 00:00:00.554: %CI-6ENVNORMAL:+24 Voltage measured at 24.63 `
Mar 15 00:00:06 ROUTER_B 289883: Mar 15 00:00:00.759: %LINK-3UPDOWN: Interface Serial13/0.6/16:0, changed state to up
Mar 15 00:00:13 ROUTER_C 2267435: Mar 15 00:00:12.473:
%CONTROLLER-5-UPDOWN: Controller T3 10/1/1 T1 18, changed state to
DOWN
Mar 15 00:00:06 ROUTER_D 852136: Mar 15 00:00:00.155: %PIM-6INVALID_RP_JOIN: VRF 13979:28858: Received (*, 224.0.1.40) Join from
1.2.3.4 for invalid RP 5.6.7.8
Mar 15 00:00:07 ROUTER_E 801790: -Process= "PIM Process", ipl= 0, pid=218
Mar 15 00:25:26 ROUTER_Z bay0007:SUN MAR 15 00:25:24 2009 [03004EFF]
MINOR:snmp-traps:Module in bay 4 slot 6, temp 65 deg C at or above minor
threshold of 65 degC.
Figure 2. Example (normalized) syslog messages
2.1.2.
External Fault Detection and Route Monitoring
Router fault detection mechanisms are complemented by other mechanisms to
identify issues that are not detected and/or reported by the routers either because
they don’t have visibility into the event, their detection mechanisms fail to detect
it (maybe because of a software bug), or because they are unable to report the issue to the relevant parties. For example, basic periodic ICMP “ping” tests are used
to validate the liveness of routers and their interfaces – if a router or interface becomes unresponsive to ICMP pings, then this is reason for concern and an alarm
should be generated.
Route monitors, as discussed in Chapter 11, also provide visibility into control
plane activity; activity that may not always be reported by the routers. IGP monitors, such as OSPFmon [2], will learn about link and router up/down events and
can be used to complement the same events reported by routers. However, route
monitors extend beyond simple detection of link up / down events, and can provide information about weight changes which also impact traffic routing. Even
further, the route(s) between any given source / destination can be inferred using
8
routing data collected from route monitors. This information is vital when trying
to understand events which impacted a given traffic flow.
2.1.3.
Router-Reported Performance Measurements
Although fault has traditionally been the primary focus of event management
systems, performance events can also cause significant customer distress and thus
must be addressed. The goal here is to identify and report when network performance deviates from desired operating regions and requires investigation. But detecting performance issues is really quite different from detecting faults. Performance statistics are continually collected from the network; from these
measurements we can then determine when performance becomes out of desired
operating regions.
As discussed in Chapter 10, routers track a wide range of different performance
parameters – such as counts of the number of packets and bytes flowing on each
router interface, different types of interface error counts (including buffer overflow, mal-formatted packets, CRC check violations) and CPU and memory utilization on the central router CPU and its line cards. These performance parameters
are stored in the Simple Network Management Protocol (SNMP) Management Information Bases (MIBs). The Simple Network Management Protocol (SNMP) [3]
is the Internet community’s de facto standard management protocol. SNMP was
defined by the IETF, and is used to convey management information between
software agents on managed devices (e.g., router) and the managing systems (e.g.,
event management systems in Figure 1). The information available through SNMP
is organized into a Management Information Base (MIB).
External systems poll router MIBs on a regular basis, such as every five
minutes. Thus, with a five minute polling interval, the CPU measurements would
represent the average CPU over a five minute interval; the packet counts would
represent the total number of packets that have flowed on the interface in a given
five minute interval.
The SNMP information collected from routers is used for a wide variety of
purposes, including customer billing, detecting anomalous network conditions
which could be impacting customers (congestion, excessively high CPU, router
memory leaks) and troubleshooting network events.
2.1.4.
Traffic Measurements
SNMP measurements provide aggregate statistics about links loads – e.g.,
counts of total number of bytes or equivalently link utilization over a fixed time
interval (e.g., five minutes). However, SNMP measurements provide little insight
into how this traffic is distributed across different applications or different source /
destination pairs. Instead, as discussed in Chapter 10, Netflow and deep packet inspection (DPI) are used to obtain much more detailed measurements of how traffic
is distributed across network links and routes. Few if any network management
alarms are created today based on either network or DPI measurements. However,
9
both measurement capabilities are critical in troubleshooting various network conditions. DPI in particular can be used to obtain unique visibility into the traffic
carried on a link, which is useful in trying to understand what and how traffic may
be related to a given network issues. For example, DPI could be used to identify
malformed packets, or identify what traffic is destined to an overloaded router
CPU.
2.1.5.
Application and Service Monitoring
Although monitoring the state and performance of individual network elements
is critical to managing the overall health of the network, it is not in itself sufficient
to understanding the performance as perceived by customers. It is also imperative
that we are able to monitor and detect events based on the end to end service performance.
End to end measurements – even as simple as that achieved by sending test
traffic across the network from one edge of the network to another – provide the
closest representation of what customers experience as can be measured from
within the ISP network. End to end measurements were discussed in more detail in
Chapter 10. In the context of fault and performance management, they are used to
identify anomalous network conditions, such as when there is excessive end to end
packet loss, delay and/or jitter. These events are reported to the event management
system via alarms, and used to trigger further investigation.
As discussed in Chapter 10, performance monitoring can be achieved using either active or passive measurements. Active measurements send test probes (traffic) across the network, measuring statistics related to the network and/or service
performance. In contrast, passive measurements examine the performance of customer traffic, which is not always easily achieved on an end to end basis unless the
ISP has access to the customer’s end device. End to end performance measurements can be used both to understand the impact of known network events, and to
identify events which not have been detected via other means. When known
events, such as congestion, occur, end to end performance measurements provide
direct performance measures which present insight into the event’s impact on customers. But end to end measurements also provide an overall test of the network
health – and can be used to identify the rare but potentially critical issues which
the network may have failed to detect. These are known as silent failures, and
have historically been an artifact of immature router technologies – where the
routers simply fail to detect issues that they should have detected. For example,
consider an internal router fault (e.g., corrupted memory on a line card) which
causes a link to simply drop all traffic. If the router fails to detect that this is occurring, then it will fail to both re-route the traffic away from the failed link, and
to report this as a fault condition. Thus, traffic will continual to be sent to a failed
interface and be dropped – a condition known as black-holing. End to end active
measurements (test probes) provide a means to proactively detect these issues; the
test probes will be dropped along with the customer traffic – this can be detected
10
and alarmed on. Thus, the event can be detected even when the network elements
fail to report them and (hopefully!) before customers complain.
Simple test probes sent across the network are extremely effective at auditing
the integrity of the network – identifying when there are loss or delay issues being
experienced, and even detecting failures not identified by the network elements.
However, extrapolating from simple loss and delay measures to understand the
impact of a network event on any given network application is most often an extremely complex if not impossible task, involving the intimate details of each application in question.
Ideally, we would like to be able to directly measure the performance of each
and every application which operates across a network. This is clearly an impossible task in networks which support a plethora of different services and applications. But if a network is critical to supporting a specific application – such as
IPTV – then that application also needs to be monitored across the network, and
appropriate mechanisms be available to detect issues. After all, the goal is to ensure that the service is operating correctly – ascertaining this requires direct monitoring of the service of interest.
2.1.6.
Alarms
The instrumentation layer thus provides network operators with an immense
volume of network measurements. But how do we transform this data to identify
network issues which require investigation and action?
Let’s start by identifying the events that we are interested in detecting. We
clearly need to detect faults – conditions which may be causing customer outages
or taking critical network resources out of service. We also aim to detect performance issues which would be causing degraded customer experiences. And finally, we wish to detect network element health concerns, which are either impacting
customers or are risking customer impact.
The majority of faults are relatively easy to detect – we need to be able to detect when network elements or components are out of service (down).
It gets a little more complicated when it comes to defining performance impairments of interest. If we are too sensitive in the issues identified – for example,
reporting very minor, short conditions – then Operations personnel risk expending
tremendous effort attempting to troubleshoot meaningless events, potentially missing events that are of great significance amongst all the noise and false alarms.
Short term impairments, such as packet losses or CPU anomalies, are often too
short to even enable real-time investigation – the event has more often than not
cleared before Operations personnel could even be informed about it, let alone investigate. And short term events are expected – no matter how careful network
planning and design is, occasional short term packet losses, for example, will occur.
Instead, we need to focus on identifying events which are sufficiently large and
persist for a significant period of time. Simple thresholding is typically used to
achieve this – an event is identified when a parameter of interest exceeds a pre-
11
defined value for a specified period of time. For example, an event may be declared when packet loss exceeds 0.3% over 3 consecutive polling periods across
an IP/MPLS backbone network. However, note that more complicated signatures
can and are used to detect some conditions. Section 4.1 discusses this in some
more detail, including how appropriate thresholds can be identified.
Once we have detected an issue – whether it is a fault or performance impairment - our goal is to report it so that appropriate action can be taken as needed.
Event notification is realized through the generation of an alarm which is sent to
the event management platform as illustrated in Figure 1. An alarm is a symptom
of a fault or performance degradation. Alarms themselves have a life span – they
start with a SET alarm and typically end through an explicit CLEAR. Thus, explicit notifications of both the start of an event (the SET) and the end of an event
(CLEAR) need to be conveyed to the event management system.
SNMP traps or informs (depending on the SNMP version) are used by routers
to generate alarms to the event management system. Given the predominance of
SNMP in IP/MPLS, this same mechanism is also often used between other collectors in the instrumentation layer and the event management layer. Traps / informs
represent asynchronous reports of events and are used to notify network management systems of state changes, such as links and protocols going down or coming
up. For example, if a link fails, then the routers at either end of the link will send
SNMP traps to the network management system collecting these. These traps typically include information about the source of the event, details of the type of
event, parameters and priority. Figure 3 depicts logs detailing traps collected from
a router. The logs report the time and location of the event, the type of event
which has occurred and the SNMP MIB Object ID (OID). In this particular case,
we are observing the symptoms associated with a line card failure, where the line
card constitutes a number of physical and logical ports. Each port is individually
reported as having failed. This includes both the physical interfaces (denoted in
the Cisco router example below as T3 4/0/0, T3 4/0/1, Serial4/1/0/3:0, Serial4/1/0/4:0, Serial4/1/0/5:0, Serial4/1/0/23:0, Serial4/1/0/24:0, Serial4/1/0/25:0)
and the logical interfaces (denoted in the example below as Multilink6164 and
Multilink6168). In the example shown in Figure 3, the OID that denotes the link
down event is “4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0” – and can be seen on
each event in Figure 3.
1238081113 10 Thu Feb 21 03:10:07 2009 Router_A- Cisco LinkDown Trap on Interface T3 4/0/0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081117 10 Thu Mar 26 15:25:17 2009 Router_A - Cisco LinkDown Trap on Interface T3 4/0/1;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081118 10 Thu Mar 26 15:25:18 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/3:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
12
1238081118 10 Thu Mar 26 15:25:18 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/4:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081119 10 Thu Mar 26 15:25:19 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/5:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081124 10 Thu Mar 26 15:25:24 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/23:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081124 10 Thu Mar 26 15:25:24 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/24:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/25:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on
Interface Multilink6164;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on Interface Multilink6168;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0
Figure 3. Example logs from (anonymized) SNMP traps
2.2.
Event Management System
The preceding section discussed the vast monitoring infrastructure deployed in
modern IP/MPLS networks and used to create alarms. However, the number of
alarms created by such an extensive monitoring infrastructure would simply overwhelm a network operator – completely obscuring the real event in an avalanche
of alarms. Manually weeding out the noise from the true root cause could take
hours or even longer for a single event – time during which critical customers may
be unable to effectively communicate, watch TV and/or access the Internet. This is
simply not an acceptable mode of operation.
By way of a simple example, consider a fiber cut impacting a link between two
adjacent routers. The failure of this link (the fault) will be observed at the physical
layer (e.g., SONET, Ethernet) on the routers at both ends of the link. But it will also be observed in the different protocols running over that link – for example, in
PPP, in the intra-domain routing protocol (OSPF or IS-IS), in LDP and in PIM (if
multi-cast is enabled) and potentially even through BGP. These will all be separately reported via syslog messages; failure notifications (traps) would be generated by the routers at both ends of the link in question to indicate that the link is
down. The same event would be captured by route monitors (in this case, by the
intra-domain route protocol monitors [2]). Network management systems monitor-
13
ing the lower layer (e.g., layer one / physical layer) technologies will also detect
the event, and will alarm. And finally – should congestion result from this outage performance monitoring tools may report packet losses observed from the routers,
end to end performance degradations may be identified via end to end performance monitoring and application monitors would presumably start alarming if
customer services are impacted. Thus, even a single fiber cut can result in a plethora of alarms. The last thing we need is for the network operator to be manually
delving into each one of these to identify the one that falls at the heart of the issue
– in this case, the alarm that denotes a loss of incoming signal at the physical layer.
Instead, automated event management systems as depicted in Figure 1 are used
to automatically identify both an incident’s root cause and its impact from
amongst the flood of incoming alarms. The key to achieving this is alarm correlation – taking the incoming flood of alarms, automatically identifying which alarms
are associated with a common event, and then analyzing them to identify the most
likely explanation and any relevant impact. The resulting correlated alarms are input to the ticket creation process; tickets are used to notify Operations personnel
about an issue requiring their attention. In the above example, the root cause can
be crisply identified as being a physical layer link failure; the impact of interest to
the network operator being the extent to which any service and/or network performance degradation is impacting customers, and how many and which customers are impacted. The greater the impact of an event, the more urgent is the need
for the network operator to instigate repair.
2.2.1.
analysis
Managing alarm floods: alarm correlation and root cause
Not all alarms received by an event management system need to be correlated
and translated into tickets to trigger human investigation. For example, alarms corresponding to known maintenance events can be logged and discarded – there is
no need to investigate these as they are expected side effects of the planned activities. Similarly, duplicate alarms can be discarded – if an alarm is received multiple
times for the same condition, then only one of these alarms needs to be passed on
for further analysis. Thus, as alarms are received by the event management system, they are filtered to identify those which should be forwarded for further correlation.
Similarly, one off events which are very short in nature – effectively “naturally” resolving themselves – can often be discarded as there is no action to be taken
by the time that the alarm makes it for further analysis. Thus, if an alarm SET is
followed almost immediately by an alarm CLEAR, then the event management
system may chose to discard the event entirely. Such events can occur, for example, due to someone accidentally touching a fiber in an office, causing a very rapid, one off impairment. There is little point to expending significant resources investigating this event – it has disappeared and is not something that someone in
Operations can do anything about – unless it continues to happen. Short conditions
14
which continually re-occur are classified as chronics, and do require investigation.
Thus, if an event keeps occurring, then a correlated alarm and corresponding ticket
is created so that the incident can be properly investigated.
Alarm correlation, alternatively known as root cause analysis, follows the
alarm filtering process. Put simply, alarm correlation examines the incoming
stream of network alarms to identify those which occur at approximately the same
time, and can be associated with a common explanation. These are then grouped
together as being a correlated alarm. The goal here is to identify the “root cause”
alarm – effectively equating to identifying the type and location of the event being
reported.
There are numerous commercial products available that implement alarm correlation for IP/MPLS networks. The basic idea behind these tools is the notion of
causal relationships – understanding what underlying events cause which symptoms. Building this model requires detailed real-time knowledge and models of
device and network behavior, and of network topology. Given that network topology varies over time, this must be automatically derived through auto-discovery.
However, how the products implement alarm correlations varies. Some tools use
defined rules or policies to perform the correlation. The rules are defined as “if
condition then conclusion” – conditions relate to observed events and information
about the current network state, whilst the conclusion would denote the root cause
event and associated actions (e.g., ticket creation). These rules capture domain
expertise from humans and can become extremely complex.
The Codebook approach [4; 5] applies experience from graphs and coding to
alarm correlation. A codebook is pre-calculated based on network topology and
models of network elements and network behavior to model signatures of different possible network events. The alarms received by an event management system implemented using codebooks are referred to as symptoms. These symptoms
are compared against the set of different known signatures; the signature that best
matches the observed symptoms is identified as the root cause.
Event signatures are created by determining which unique combination of
alarms would be observed for each possible failure. This can be inferred by looking at the network topology and how components within a router and between
routers are connected. A model of each router is (automatically) created – denoting the set of line cards deployed within a router, the set of physical ports in existence on a line card and the set of logical interfaces which terminate on a given
physical port. The network topology then indicates which interfaces on which line
cards on which routers are connected together to create links. In large IP networks,
such information is automatically discovered by crawling through the network.
Once the set of routers in the network is known (either through external systems or
via auto-discovery), SNMP MIBs can be walked to identify all of the interfaces on
the router, and the relationships between interfaces (for example, which interfaces
are on which port and which port are on which line card).
15
Figure 4. Signature identification
Given knowledge of topology and router structure, failure signatures can be
identified by examining what symptoms would be observed upon each individual
component failure. To illustrate this, let us consider the simple four node scenario
depicted in Figure 4. In this example, we consider what would be observed if line
card 1 failed on router 2. Such a failure would be observed on all of the four interfaces contained within the first line card on router 2 (card 1), and also on interface
I1 on port P1 and interface I1 on port P2 on card 1 (C1) of router 1, and on interface I2 on port P1 and interface I2 on port P2 on card 1 (C1) of router 3. These are
all interfaces which are directly connected to interfaces on the failed line card on
router 2. If the failure of line card 1 on router 2 were to happen, then alarms would
be generated by the three routers involved and sent to the event management system. This combination of symptoms thus represents the signature of the line card
failure on router 1. This signature, or combination of alarms, is unique – if this is
observed, we conclude that the issue (root cause) is the failure of line card 1 on
router 2.
Let’s now return to the SNMP traps provided in Figure 3. This example depicted the traps sent by a router Router_A to the event management system upon detecting the failure of a set of interfaces. Traps were sent corresponding to eight
physical interfaces and two logical (multilink) interfaces – all of these being interfaces which directly connect to external customers. The ISP has visibility into only the local ISP interfaces (the ones reported in the traps in Figure 3), and does not
receive alarms from the remote ends of the links – the ends within customer premises. Thus, the symptoms observed for the given failure mode are purely those
coming from the one ISP router – nothing from the remote ends of the links. This
set of received alarms is compared with the set of available signatures; comparison
will provide a match between this set of alarms (symptoms) and the signature associated with the failure of a line card (Serial4) on Router_A. The resulting correlated alarm output from the event management system identifies the line card fail-
16
ure, and incorporates the associated symptoms of the individual interface failures.
Figure 5 illustrates an example correlated alarm, taken from a production event
management system. In this particular example reports that there are 10 out of a
total of 10 configured interfaces which have failed on the line card Serial4 – indicating a slot (line card) failure. This is a critical alarm, indicating that immediate
attention is required.
03/26/2009 15:25:28 RC500_170: Incoming Alarm:
Router_A:Interfaces|Slot Threshold Alarm: Router_A:Serial4 Down. There
are 10 out of 10 interfaces down. Down Interface List: Serial4/0/0, Serial4/0/1, Serial4/1/0/3:0, Serial4/1/0/4:0, Serial4/1/0/5:0, Serial4/1/0/23:0,
Serial10/1/0/24:0, Serial10/1/0/24:0, Multilink6164, Multilink6168|Critical
Figure 5. Correlated alarm associated with Event in Figure 2.
Correlation can also be used across layers or across network domains to effectively localize faults for more efficient troubleshooting. Consider the example illustrated in Figure 6. The actual outage occurs in the layer one (physical) network
which is used to directly interconnect two routers. Let’s specifically consider an
example of a fiber cut. If the Layer one and the IP/MPLS network are maintained
by different organizations (a common situation), then there is little that can be
done in the IP organization to repair the failure. However, the IP routers all detect
the issue. Cross-layer correlation can be used to identify that the issue is indeed in
a different layer; the IP organization should thus be notified, but with the clarification that this is informational only – another organization (layer one) is responsible for repair. Such correlations can save a tremendous amount of unnecessary resource expenditure, by accurately and clearly identifying the issue, and notifying
the appropriate organization of the need to actuate repair.
Figure 6. Cross-layer correlation
The “root cause” alarm created by the event management systems effectively
identifies approximately where an event occurred and what type of event it was.
However, this is still a long way from identifying the true root cause of the event,
17
and thus being able to rectify it. If we consider the example depicted in Figure 4,
automated root cause analysis is able to successfully isolate the problem as being
related to line card 1 on router 2. However, we still need to determine how and
why the line card failed. Was the problem in hardware or software? What in the
hardware or software failed and why? Tremendous work is still typically required
before reaching true issue resolution.
2.3.
Ticketing
Tickets are used to notify Operations personnel of events that require investigation, and to track ongoing investigations. Tickets are at the heart of troubleshooting network events, and recording actions taken and their results.
If the issue being reported has been detected within the network, then the tickets are automatically opened by the event management systems. However, if the
issue is, for example, reported by a customer, then the ticket will likely have been
opened by a human, whether that be the customer or a representative from the
ISP’s customer care organization.
The tickets are opened with basic information regarding the event at hand, such
as the date, time and location of where the issue was observed, and the details of
the correlated alarms and symptoms that triggered the ticket creation. From there,
Operations personnel carefully record in the ticket the tests that they execute in
troubleshooting the event, and the observations made. They record interactions
with other organizations, such as Operations groups responsible for other layers in
the network, or employees from equipment manufacturers (vendors). Clearly,
carefully tracking observations made whilst troubleshooting a complex issue is
critical for the person(s) investigating. However, the tickets also serve the purpose
of allowing easy handoff of investigations across personnel, such as may happen
when an issue must be handed off across, say, a change of shift or between different organizations.
Once an investigation has reached its conclusion and the issue has been rectified, then the ticket is closed. A resolution classification is typically assigned,
primarily for use in tracking aggregate statistics and for offline analysis of overall
network performance. Network reliability modeling is discussed in more detail in
Chapter 4 and in Section 4 of this chapter.
2.4.
Troubleshooting
Troubleshooting a network outage is analogous to being a private investigator –
hunting down the offender and taking appropriate action to rectify the issue. Drilling down to root cause can require keen detective instincts, knowing where to look
and what to look for. Operations teams often pull upon vast experience and domain knowledge to rapidly delve into the vast depths of the network and related
network data to crystallize upon the symptoms, delve deep for additional information and theorize over potential root causes. This is often under extreme pres-
18
sure – the clock is ticking; customer service may be impaired until the issue is repaired.
The first step of troubleshooting a network event is to collect as much information about the event as possible that will help with reasoning about what is
happening or has happened. Clearly, a fundamental part of this involves looking
at the event symptoms and impact. The major symptoms are provided in the alarm
which created the ticket. Additional information can be collected as the result of
tests that the operator performs to further investigate the outage, or may incorporate historical data pulled from support systems to investigate earlier behavior and
actions taken within the network (e.g., maintenance) which could be related.
The tests invoked by the network operator range considerably in nature, depending on the type of incident being investigated. But in general they may include ping tests (“ping” a remote end to see if it is reachable), different types of
status checks (executed as “show” commands on Cisco routers, for example), and
on demand end-to-end performance tests. In addition to information about the
event, it may also be important to find out about potentially related events and activities – for example, did something change in the network that could have invoked the issue? This could have been a recent change, or something that changed
a while ago, waiting like a time bomb until conditions were ripe for an appearance.
Armed with this additional information, the network operator works towards
identifying what is causing the issue, and how it can be mitigated. In the majority
of situations, this can be achieved quickly – leading to rapid, permanent repair.
However, some outages are more complex to troubleshoot. This is when the full
power of an extended Operations team comes into force.
Let’s consider a hypothetical example to illustrate the process of troubleshooting a complex outage. In this scenario, let’s assume that a number of edge routers
across the network fail one after another over a short period of time. As these are
the routers which connect to the customers it is likely that the majority of customers are out of service until the issue is resolved - the pressure is really on to resolve the issue as fast as possible!
Given that we are assuming that routers are failing one after another, it would
take two or more routers to fail before it becomes apparent that this is a cascading
issue, involving multiple routers. But once this becomes apparent, then the individual router issues should be combined to treat the event as a whole.
As discussed earlier, the first goal is to identify how to bring the routers back
up, operational and stable so that customer service is restored. But achieving this
requires at least some initial idea about the event’s root cause. And then once the
router have been brought back in service, it is critical to drill down to fully understand the issue and root cause so that permanent repair can be achieved – permanently eliminating this particular failure mode from the network.
The first step in troubleshooting such an outage is to collect as much information as possible regarding the symptoms observed, identify any interesting information that may shed light on the trouble, and create a timeline of events detail-
19
ing when and where they occurred. This information is typically collated by running diagnostic commands on the router, and from information collected within
the various collectors contained within the instrumentation layer. Syslogs provide
a huge amount of informative information about events observed on network routers. This information is complemented by that obtained from route monitors and
performance data collected over time, and from logs detailing actions taken on the
router by Operations personnel and automated systems.
The biggest challenge now becomes how to find something useful in amongst
the huge amount of data created by the network. Operations personnel would
painstakingly sift through all of this data, raking through syslogs, examining critical router statistics (CPU, memory, utilization etc), identifying what actions were
being performed within the network during the time interval before the event, and
examining route monitoring logs and performance statistics. Depending on the
failure mode, and whether the routers are reachable (e.g., out of band), diagnostic
tests and experimentation with potential actions to repair the issue would also be
performed within the network.
In a situation where multiple elements are involved, it is also important to focus
on what the routers involved have in common. Are they all made by a common
router vendor, are they a common router model? Do they share a common software version? Are they in a common physical or logical location within the network? Do they have similar number of interfaces, or load, or something which
may relate to the outage at hand? Were there any recent network changes made on
these particular routers that could have triggered the issue? Are there customers
that are in common across these routers? Identifying the factors which are common and those which are not can be critical in focusing into what may be the root
cause of the issue at hand. For example, if the routers involved are from different
vendors, then it is less likely to be a software bug causing the issue. And if there is
a common customer associated with all of the impacted routers, then it would
make sense to look further into whether there is anything about the specific customer(s) which could have induced the issue or contributed in some way. For example, was the customer making any significant changes at the time of the incident?
The initial goal is to extract sufficient insight into the situation at hand to indicate how service can be restored. Once this is achieved, appropriate actions can be
taken to rectify the solution. However, in many situations, such as those induced
by router bugs, the initial actions may restore service, but would not necessarily
solve the underlying issue. Further (lengthy) analysis – potentially involving extensive lab reproduction and/or detailed software analysis – may be necessary to
truly identify the real root cause.
2.5.
Restore then repair
Thus, troubleshooting and repair typically involves two goals: (1) restoring customer service and performance and (2) fully resolving the underlying issue and re-
20
turning network resources into service. These two goals may or may not be distinct – where redundancy exists, the network is typically designed to automatically
re-route around outages, thus negating the need for manual intervention in the first
step. Troubleshooting and repair can instead focus on returning the failed network
resources into service so that there is sufficient capacity available to absorb future
outages. In other situations, customer service restoral and outage resolution may
be one and the same. Let’s consider two examples in more detail:
2.5.1.
Core network outage
Consider the example of a fiber cut impacting a link between two routers in the
core of an IP/MPLS network. IP networks are designed with redundancy and spare
capacity, so that traffic can be re-routed around failed network resources. This rerouting is automatically initiated by the routers themselves - the exact details of
which depend on the mechanisms deployed, as discussed in Chapter 2. In the example in Figure 7, traffic between the routers E and D is normally routed via router C. However, in the event that that the link between routers C and D fails, then
the traffic is re-routed via F and H.
Figure 7. Core network outage.
Assuming that the network has sufficient capacity to re-route all traffic without
congestion, then the impetus for rapidly restoring the failed resources is to ensure
that the capacity is available to handle future failures and/or planned maintenance
events. If, however, congestion results from the outage, then immediate intervention is required by Operations personnel to restore the customer experience. Immediate action may be taken to re-route or re-load balance traffic, to eliminate the
performance issues whilst the resources are being repaired – for example, by tuning the IGP weights (e.g., OSPF weight tuning). In the example of Figure 7, if
there were congestion observed on, say, the link between routers F and H, then
Operations personnel may need to re-route the traffic via routers F, I and G. Note
that this requires that Operations have an appropriate network modeling tool
available to simulate potential actions before they are taken. This is necessary to
21
ensure that the actions to be taken will achieve the desired result of eliminating the
performance issues being observed.
Permanent repair is achieved when the fiber cut is repaired. This requires that a
technician travel to the location of the cut, and re-splice the impacted fiber(s).
2.5.2.
Customer-facing outage
Now consider an outage on a customer-facing line card in a service provider’s
edge router. We focus on a customer that has only a single connection between the
customer router and the ISP edge router, as illustrated in Figure 8.
Figure 8. Failure on customer-facing interface
Cost effective line card protection mechanisms simply do not exist for router
technologies today. Instead, providing redundancy on customer facing interfaces
requires that an ISP deploy dedicated backup line cards for each working line
card. Because this is expensive, customers that need higher reliability can choose
a multi-homed redundancy option. In situations where redundancy is not deployed, a failure of the ISP router line card facing the customer will result in the
customer being out of service until the line card can be returned to service.
If the line card failure was caused by faulty hardware, the customer may be out
of service until the failed hardware is repaired, necessitating a rapid technician
visit to the router in question. However, if the issue is in software, for example,
service could potentially be restored via software rebooting of the line card. If the
issue is extreme, and a result of a newly introduced software release on the router,
then it is possible that the software should be “rolled back” to a previous version
that does not suffer from the software bug. Although this apparently fixes the issue
and restores customer service, it is not a permanent repair. If this is likely to occur
22
again, then the issue must be permanently resolved – the software must be debugged, re-certified for network deployment, and then installed on each relevant
router network-wide before permanent repair is achieved. This could involve upgrading potentially hundreds of routers – a major task, to say the least. In this case,
the repair of the underlying root cause takes time, but is necessary to ensure that
the failure mode does not occur again within the network.
3. Process automation
The previous section described the basic architecture and processes involved in
detecting and troubleshooting fault and performance impairments. These issues often need to be resolved under immense time pressures, especially when customers
are being directly impacted.
Computer systems can perform well-defined tasks much more rapidly than humans, and are necessary to support the scale of a large ISP network. Although
human reasoning about extremely complex situations is often difficult, if not impossible to automate, automation can be used to support human analysis and to aid
in performing simple, repetitive tasks. This is termed process automation.
Process automation is widely used in many aspects of network management,
such as in customer service provisioning. Relevant to this chapter, it is also applied to the processes executed in network troubleshooting and even repair. Process automation in combination with the alarm filtering and correlation discussed
in Section 2.2.1 are what enable a small Operations team to manage a rapidly increasingly and already extremely large scale network, with increasing complexity
and diversity. The automation also speeds trouble resolution, thereby minimizing
customer service disruptions, and eliminates human errors, which are a fact of life,
no matter how much process is put in place in a bid to minimize them.
Over time, Operations personnel have identified large numbers of tasks which
are executed repeatedly and are very time consuming when executed by hand.
Where possible, these tasks are automated in the process automation capabilities
illustrated in the modified event management architecture of Figure 9.
23
Figure 9. Event management architecture incorporating process automation
The process automation system lies between the event management system and
the ticketing system. One of its major roles is to provide the interface between
these two systems – listening to incoming correlated alarms and opening, closing
and updating tickets. But rather than simply using the output of the event management system to trigger ticket creation and updates, process automation can take
various actions based on network state and the situation at hand. It collects additional diagnostic information, and reasons about relatively complex scenarios before then taking action. Upon opening or updating a ticket, the process automation system automatically populates the ticket with relevant information that it
collected in evaluating the issue. This means that there is already a significant
amount of diagnostic information available in the ticket as it is opened by a human
being to initiate investigation into an incident. This can dramatically speed up
event investigation, by eliminating the need for humans to go out and manually
execute commands on the network elements to collect diagnostic information.
The process automation system interfaces with a wide range of different systems to execute tasks. In addition to the event management and ticketing systems,
process automation also interacts closely with network databases and with the
network itself (either directly, or indirectly through another system). For example,
collecting diagnostic information related to an incoming event will likely involve
24
reaching out to the network elements and executing commands to retrieve the information of interest. This could be done either directly by the process automation
system, or via an external network interface, as illustrated in Figure 9.
Process automation is often implemented using an expert system – a system
which attempts to mimic the decision making process and actions that a human
would execute. The actions taken by the system are defined by rules, which are
created and management by experts. The number of rules in a complex process automation system is typically of the order of 100s to 1000s given the complexity
and extensive range of the tasks at hand. Rules are continually updated and managed by the relevant experts as new opportunities for automation are conceived
and implemented, processes are updated and improved, and the technologies used
within the network evolve and change.
The example in Figure 10 demonstrates the process automation steps executed
upon receipt of a basic notification that a slot is down (failed). A slot here refers to
a single line card within a router; the slot is assumed to support multiple interfaces. Each interface terminates a connection to a customer or adjacent network router. As can be seen from this example, the system executes a series of different
tests and then takes different actions depending on the outcomes of each test. The
tests and actions executed are specific to the type of alarm triggering the event,
and also to the router model in question. Note that in this particular case, the router in question is a Cisco router, and thus Cisco command line interface (CLI)
commands are executed to test the line card.
Figure 10. Example process automation rule - router slot down.
25
The process automation example here is focused on support for the network
operations team. This team is focused on managing the network elements and is
not responsible for troubleshooting individual customer issues – those are handled
by the customer care organization. Thus, a primary goal of the process automation
is to identify issues with network elements; weeding out individual customer issues (and not creating tickets on them for this particular team). In this example of
a customer-facing line card with multiple customers carried on different interfaces
on the same card, the automation ensures that there are multiple interfaces that are
simultaneously experiencing issues, as opposed to being associated with a single
customer. This makes it likely that the issue at hand relates to the line card (part of
the network), and is not individual customer issue(s).
In addition to ensuring that only relevant tickets are created, process automation also distinguishes between new issues versus new symptoms associated with
an existing, known problem. Thus, if a ticket has already been opened for the current event, such as may occur when a new symptom is detected for an ongoing
event, then the process automation will associate the new information with the existing ticket, as opposed to opening a new ticket.
The results of the tests executed are presented to the Operations team within the
ticket that is opened or updated. Thus, as an Operations team member opens a new
ticket ready to troubleshoot an issue, he/she is immediately presented with a solid
set of diagnostic information; instead of the individual having to start logging into
routers and running these same tests. This can significantly reduce the investigation time, the customer impact and the load on the Operations team.
For a more complex scenario, now consider a customer initiated ticket, created
either directly by the customer or by a customer’s service representative. The process automation system picks up the ticket automatically and, with support from
other systems, launches tests in an effort to automatically localize the issue. If the
system can identify the issue, it will either automatically dispatch workforce to the
field or refer the ticket out to the appropriate organization, which may even be another company (e.g., where another telecommunications provider may be providing access). Upon receiving confirmation that the problem has been resolved, the
expert system also executes tests to validate, and then closes the ticket if all tests
succeed. If the expert system is unable to resolve the issue, then it would collate
information regarding the diagnostics tests and create a ticket to trigger human investigation and troubleshooting.
The opportunity space for process automation is almost limitless. As technologies and modeling capabilities improve, automation can and is being extended beyond troubleshooting and event management, and into service recovery and repair
and even outage prevention. Although it is clearly hard to imagine automated fiber
repairs in the near future, there are scenarios in which actions can be automatically
taken to restore service. For example, network elements or network interfaces can
be rebooted automatically should such action likely result in restored service. One
scenario when this may be an attractive solution is where a known software bug is
impacting customers and the operator is waiting for a fix from the vendor. In the
26
meantime, if an issue occurs, then the most rapid way of recovering service would
be through the automated response – not waiting for human investigation and intervention.
As another example, let’s consider errors in router configurations (misconfigurations) which can be automatically fixed via process automation. Consider a scenario where regular auditing such as that discussed in Chapter ??? identifies what
we refer to as “ticking time bombs” – misconfigurations that could cause significant customer impact under specific circumstances. These misconfigurations are
automatically detected, and can also be automatically repaired – the “bad” configuration being replaced with “good” configuration, thereby preventing the potentially nasty repercussions.
Although automated responses to troubles is an attractive proposition, extreme
care must clearly be taken with automation; making sure that the logic is carefully
thought out and that safeguards are introduced to prevent potentially larger issues
from arising with the wrong automated response. Meticulous tracking of automated actions is also critical, to ensure that automated repair does not hide underlying
chronic issues which need to be addressed. However, even with these caveats and
warnings, automation offers an opportunity to far more rapidly recover from issues than a human being could achieve through manually initiated actions.
The value of automation throughout the event management process (from event
detection through to advanced troubleshooting) is unquestionable – reducing millions of alarms to a couple of hundred or fewer tickets that require human investigation and intervention. This automation is what allows a small network operations team to successfully manage a massive network, with rapidly growing
numbers of network elements, customers and complexity.
4. Managing Network Performance over Time
Event management systems primarily focus on troubleshooting individual large
network events that persist over extended periods of time, such as link failures, or
recurring intermittent (chronic) flaps on individual links. However, this narrow
view of looking at each individual event in isolation risks leaving network issues
flying under the event management radar, while potentially impacting customers’
performance.
Let’s consider an analogy of financial management within a large corporate or
government organization. In keeping a tight reign over a budget, an organization
would likely be extremely careful about managing large transactions – potentially
going to great lengths to approve and track each of the individual large purchases
made across the organization. However, tracking individual transactions without
looking at history can result in patterns being missed. A single user’s individual
transaction may be fine in isolation, but analysis over a longer time interval may
uncover an excessively large number of such transactions – something which may
need further investigation. In addition, this approach does not scale to small
transactions, and thus smaller transactions would likely receive less attention be-
27
fore they are made. However, if no one is tracking the bottom line – the total expenditure - it may well be that the small expenditures add up considerably, and
could lead to financial troubles in the long run. Instead, careful tracking of how
money is spent across the board is critical, characterizing at an aggregate level
how, where and why this money is spent, whether it is appropriate and (in situation where money is tight), where there may be opportunities for reductions in expenditure. New processes or policies may well be introduced to address issues
identified. But this can only be seen with careful analysis of longer term spending
patterns – large and small. Returning to the network, carefully managing network
performance also requires examining network events holistically – exploring series of events – large and small - instead of purely focusing on each large event in
isolation. The end goal is to identify actions which can be taken to improve overall
network performance and reliability. Such actions can take many forms – for example, driving a subset of network impairments either entirely out of the network,
or minimizing their impact, or by incorporating better methods for detecting issues
in the event management system so that Operations can be rapidly informed of a
network issue. For example, if a particular failure mode is identified that is causing performance issues over time (e.g., repeated line card failures, excessive BGP
flaps, unexpectedly high packet losses over time), then it is desirable to work to
permanently eliminate this condition, where possible. The response depends entirely on the issues identified, but could involve software bug fixes, hardware redesigns, process or changes, technology changes.
An important step towards driving network performance is to carefully track
performance over time, with periods of poor performance being identified so that
intervention can be initiated. For example, if it is observed that the network has
recently been demonstrating, for example, unacceptable unavailability due to excessive numbers of line card failures, or higher packet loss than expected or than
normal then investigation should be initiated in an effort to determine why this is
occurring, and to take appropriate actions to rectify the situation. But we don’t
need to wait for performance to degrade – regular root cause analysis of network
impairments can identify areas for improvements, and potentially even uncover
previously unrecognized yet undesirable network behaviors which could be eliminated. For example, consider a scenario where ongoing root cause analysis of network packet loss uncovered chronic slower than expected recovery times in the
face of network failures. Once identified, efforts can be initiated to identify why
this is occurring (router software bug?, hardware issue?, fundamental technology
limitations?) and to then drive either new technologies or enhancements to existing technologies into the network to permanently rectify this situation.
Tracking network performance over time and drilling into root causes of network impairments typically involves delving into large amounts of historical network data. Exploratory Data Mining (EDM) is thus used to complement real-time
event management through detailed analysis across large numbers of events, identifying patterns in failures and performance impairments and in the root causes of
these network events. However, this is clearly a challenging goal; it typically in-
28
volves mining tremendous volumes of network data across very diverse data
sources to identify patterns either in individual time series, or across time series.
4.1.
Trending Key Performance Indicators (KPI)
So let’s start by asking a relatively simple question - how well is my network
performing? The first step here is to clearly define what we mean by network performance, so that we can provide metrics which can be evaluated using available
network measurements. We refer to these metrics as Key Performance Indicators,
or KPIs. KPIs are very general metrics, but in the context here are really designed
for two primary purposes: (1) to track the network performance, and (2) to track
network health.
When tracking how well the network is performing, it is important to ensure
that metrics obtain a view which is as close as possible to what customers are experiencing. However, how well the network is performing is in the eye of the beholder – and different beholders have different vantage points, and different criteria. Some applications (e.g., email or file transfers) are extremely resilient to short
term impairments whilst others, such as video, are extremely sensitive to even extremely short term outages. Thus, KPIs need to capture and track a range of different performance metrics which reflect the diversity of applications being supported. Ideally, KPIs should track application measures – such as the frequency and
severity of video impairments that would be observable to viewers. However,
network-based metrics are also critical, particularly in networks where there is a
vast array of different applications being supported. Thus, KPIs should include,
but not be limited to, metrics tracking network availability (DPM – see Chapter 3
for details), application performance end-to-end network packet loss, delay and jitter and even network operations responses to tickets.
KPIs can also capture non-customer impacting measures of network health,
such as the utilization of network resources. These provide us with the ability to
track network health before we hit customer-impacting issues. There are many
limited resources in a router – link capacity is an obvious one, but router processing power and memory are two other key examples. Capacity is traditionally
tracked as part of the capacity management process discussed in Chapter 5 and is
therefore not discussed further here. But router CPU and memory utilization –
both on the central route processor and on individual line cards – are also critical
limited resources which are often less well analyzed than link capacity. One of the
most critical functions that the central route processor is responsible for is control
plane management – ensuring that routing updates are successfully received and
sent by the router, and that the internal forwarding within the router is appropriately configured to ensure successful routing of traffic across the network. Thus, the
integrity of our control plane is very much dependent on router CPU usage – if
CPUs become overloaded with high priority tasks, the integrity of the network
control plane could be put at risk. Router memory is similarly critical – if memory
becomes fully utilized within a router, then the router is at risk of crashing, caus-
29
ing a nasty outage. Thus, these limited resources must be tracked to ensure that
they are not hitting exhaust either in the long term, or over shorter periods of time.
KPIs must be measurable– thus, a given KPI must map to a set of measurements that can be made within the network or using data collected from network
operations. Availability-based KPIs can be readily calculated based on logs from
the event management system and troubleshooting analyses. Chapters 3 and 4 discuss availability modeling and associated metrics, which clearly depend on network measurements. This is not discussed further here.
Obtaining application dependent performance metrics requires extensive monitoring at the application level. This can be achieved either by using “test” measurements executed via sample devices strategically located across the network, or
by collecting statistics from user devices (in situations where the user devices are
accessible by the service provider). Such application measurements are a must for
networks which support a limited set of applications – such as a network supporting IPTV distribution. However, it is practically impossible to scale this to every
possible application type that may ride over a general purpose IP/MPLS backbone,
especially if application performance depends on a plethora of different customer
end devices.
Instead, end to end measurements of key network performance criteria, namely
packet loss, delay and jitter, are the closest general network measure of customerperceived network performance. These measurements provide a generic, application-independent measure of network performance, which can (ideally) be applied
to estimate the performance of a given application.
End to end packet loss, delay and jitter would likely be captured from an endto-end monitoring system, such as described in Chapter 10 and [6]. In an active
monitoring infrastructure, for example, large numbers of end to end test probes are
sent across the network; loss and delay measurements are calculated based on
whether these probes are successfully received at each remote end, and how long
they take to be propagated across the network. From these measurements, network
loss, delay and jitter can be estimated over time for each different source / destination pair tested. By aggregating these measures over larger time intervals, we can
calculate a set of metrics, such as average or 95th percentile loss / delay for each
individual source / destination pair tested. These can be further aggregated across
source / destination pairs to obtain network-wide metrics. KPIs can also examine
other characteristics of the loss beyond simple loss rate measurements– for example, tracking loss intervals (continuous periods of packet loss). How loss is distributed over time can really matter to some applications – a continuous period of
loss may have greater (or lesser) impact on a given application than the same total
loss distributed over a longer period of time. Thus, metrics which characterize the
number of short versus long duration outages may be used to characterize the performance of various network applications.
Ideally, end to end measurements should extend out as far to the customer as
possible, preferably into the customer domain. However, this is often not practical
– scaling to large numbers of customers can be infeasible, and the customer devic-
30
es are often not accessible to the service provider. Thus, comprehensive end to end
measurements are often only available between network routers – leaving the connection between the customer and the network beyond the scope of the end to end
measurements. Tracking performance at the edge of the ISP network thus requires
different metrics. One popular metric is BGP flap frequency aggregated over different dimensions (e.g., across the network, per customer etc). However, it is important to note that by definition the ISP/customer and ISP/peer interfaces cross
trust domains - the ISP often only has visibility and control over its own side of
this boundary, and not the customer/peer domain. It is actually often extremely
challenging to distinguish between customer/peer-induced issues and ISP-induced
issues. Thus, without knowing about or being responsible for customer/peer activities on the other side of the trust boundary, BGP and other similar metrics can be
seriously skewed. A customer interface that is flapping incessantly can significantly distort these metrics, making it extremely challenging to distinguish real patterns that may be attributed to the ISP.
KPIs can also be calculated on more complex transformations of time series,
beyond simple statistical measures such as total counts or averages. Let’s again illustrate this by way of an example.
As discussed in Section 2.1, router performance measures, including CPU
measurements, are obtained from routers using pollers which collect these measurements at regular time intervals (e.g., every 5 minutes). Thus, we have time series of router CPU measurements averaged over 5 minute intervals. These time series are typically extremely noisy – varying dramatically as a function of time as
different processes pound the routers, collecting information and configuring the
device, and as the routers respond to topology updates and major routing changes
and so on. An example daily CPU plot is illustrated in Figure 11. From this plot,
we can see short periods where the CPU “spikes” to high values and longer periods where the CPU measurements clearly become elevated over extended periods.
Figure 11. Example CPU time series
31
Simply looking at metrics such as average CPU would provide limited insight
into the overall router behavior over time, given the complexity of the process. Instead, we can carefully examine the anomalies as a function time, even trending
the anomalies themselves. KPIs can then track different aspects of the CPU, including, for example, the rate of “bad” anomalies, and identify routers with higher
than expected rates of “bad” anomalies.
Once we have defined our key metrics, we can then track how these change
over time. This is known as trending. Trending of KPIs is critical for driving the
network to higher levels of reliability and performance, and for identifying areas
and opportunities for improvement. The goal is to see these KPIs improve over
time, corresponding to network performance improvements. However, if KPIs
turn south, indicating worsening network conditions, investigation would likely be
required. The KPIs can thus be used to focus Operations attention to areas that
need most immediate attention.
Let’s consider a simple example of end to end loss. If the loss-related KPIs
(e.g., average loss) degrade, then investigation would be required to understand
the underlying root cause and to (hopefully) initiate action(s) to reverse the negative trend. Obviously, the actions taken depend on the root cause identified – but
could include capacity augments, elimination of failure modes (e.g., if loss may be
being introduced by router hardware of software issues), or may even require the
introduction of new technology, such as faster failure recovery mechanisms.
Examining KPIs over time can also enable detection of anomalous network
conditions – thereby detecting issues which may be flying under the radar of the
event management systems described in Section 2.2. Let’s consider an example
where the rate of protocol flaps has increased within a given region of the network. The individual flaps are too short to alarm on – their associated alarms have
cleared even before a human can be informed of the event, let alone investigate.
Thus, alarms would only be created if there are multiple events exceeding a
threshold during a given time duration, when the flapping is determined to be a
chronic. If the flaps on individual interfaces do not cross this threshold, then an
aggregate increase in flaps across a region may go undetected by the event management system. However, this aggregate increase could be indicative of an unexpected condition, and be impacting customers. It would thus require investigation.
Careful trending and analysis of KPIs can also be used to identify new and improved signatures which can be used to better detect events that should be alarmed
within the event management systems discussed in Section 2. The identification
of new signatures is a continual process, typically leveraging the vast and evolving
experience gained by network Operators as they manage the network on a day-today basis. However, this human experience can be complemented by data mining.
It is far from easy to specify what issues should be alarmed on amidst the mound
of performance data collected from the routers. For example, under what conditions should we consider CPU load to be excessive and thus alarm on it to trigger
intervention? And at what point does a series of “one off” events become a chronic condition that requires immediate attention?
32
Performance events used to trigger alarms are typically detected using simple
threshold crossings – an event is identified when the parameter of interest exceeds
a pre-defined value (threshold) for a given period of time. However, even selecting this threshold is often extremely challenging. How bad and for how long
should an event be before Operations should be informed for immediate attention?
Low level events more often than not clear themselves; if we are overly sensitive
at picking up events to react to, then we can result in too many false alarms and
Operations personnel would spend most of their time chasing false alarms, and
will likely miss the real issues amongst the noise. But if we are not sensitive
enough, then critical issues may fly under the radar and not be appropriately reacted to in a timely fashion. Selecting good thresholds for alarm generation is typically achieved through a combination of tremendous Operational experience,
combined with detailed analysis of KPIs over time. However, note that the
thresholds selected may not actually be a constant value – in some cases they
could vary over time, or may vary over different parts of the network. Thus, in
some cases, it may be sensible to actually have these thresholds be automatically
learned and adjusted as the network evolves over time.
Simple thresholding techniques can also be complemented by more advanced
anomaly detection mechanisms, particularly when detecting degrading conditions
which may be indicative of an impending issue. For example, consider a router
experiencing a process memory leak. In such a situation, the available router
memory will (gradually) decrease – the rate of decrease being indicative of the
point at which the router will hit memory exhaust and likely cause a nasty router
crash. Under normal conditions, router memory utilization is relatively flat; a
router with a memory leak can be detected with a non-zero gradient in the memory
utilization curve. Predicting the impending issue well in advance provides Operations with the opportunity to deal with the issue before it becomes customer impacting. These are known as predictive alarms, and can be used to nip a problem
in the bud, thereby preventing a potentially nasty issue. Again, detailed analysis
over extended periods of data is required to identify appropriate anomaly detection
schemes which can accurately detect issues, with minimal false alarms. There is a
vast array of publications focused on anomaly detection using network data, although the majority of the work focuses on traffic anomalies [7; 8; 9; 10; 11].
4.2.
Root Cause Analysis
KPIs track network performance; they do not typically provide any insight into
what is causing a condition or how to remediate it. But if we want to drive network improvements, then we must understand the root cause of performance issues –potentially down to the smallest individual events.
Troubleshooting persistent network events was discussed in detail in Section
2.4. However, short term impairments, such as packet losses or CPU anomalies,
are often too short to even enable real-time investigation – the event has more often than not cleared before Operations personnel could even be informed about it,
33
let alone investigate. And short term events are expected – no matter how careful
network planning and design is, occasional short term packet losses, for example,
will occur. There is nothing that can be done about a short event that occurred
historically; Operations only address potential future events. Thus, short impairments can be examined from the point of view of understanding how often they
occur, whether recent events are more or less frequent than expected and what are
the most significant root causes of these impairments. If events are more frequent
than normal or expected, or if opportunities can be identified to eliminate some of
these short term impairments, then actions can be taken. Such analyses have the
potential to uncover the need for hardware or software repairs, configuration
changes, or maintenance procedural changes, to name but a few.
Characterizing the root causes of network events (e.g., packet loss) – even the
small individual events - and then creating aggregate views across many such
events can enable insights into the underlying behavior of the network and uncover opportunities for improvements. Obviously, investments made in improving
the network to address performance should ideally be focused where they will
have the greatest impact. For example, by characterizing the root causes of the
different packet loss conditions observed from end to end performance measurements, a network operator can identify the most common root causes, and focus
energies on permanently eliminating or at least reducing these. If many of these
root causes were, for example, congestion-related, then additional network capacity may be required. If, instead, significant loss correlated with a previously unidentified issue within the network elements, then additional analysis would be required. A Pareto analysis [] is a formal technique used to guide this process –
used to formally evaluate the benefits of potential actions which can be taken and
then selecting those which together come close to having the maximal possible
impact. To understand the root cause of recurring symptoms, such as hardware
failures, packet loss or routing flaps, we need to drill down into individual events
in a bid to come up with the best explanation of their likely root cause. Root cause
analysis here is similar to that described above in Section 2.4 – with a couple of
important distinctions. Specifically, we are typically examining large numbers of
individual events, as opposed to a single large event, and we are typically examining historical events.
As with troubleshooting large persistent individual issues, root cause analysis
of recurring symptoms typically commences with detailed analysis of available
network data. If the symptoms being investigated were actually already individually analyzed by Operations staff, as would be the case for large, persistent network faults (e.g., line card failures etc), then analysis would rely extensively on
observations made by the Operations personnel during this process. More generally, scalable data mining techniques are required to effectively make the most of
the available data, given the large number of events presumably involved, and the
diversity of possible root causes. But data mining alone cannot always lead to the
underlying root cause(s) – especially in scenarios where anomalous network conditions or unexpected network behaviors are being identified. Instead, such analy-
34
sis must be complemented by targeted network measurement, lab reproduction
and detailed software and/or hardware analysis. Targeted network measurements
can be used to complement the regular network monitoring infrastructure in situations where the general monitoring is insufficiently fine grained or focused to provide the detailed information required to troubleshoot an issue. The data collection infrastructure in place across a large network aims to capture information
critical to a wide range of tasks, including network troubleshooting. But such an
infrastructure has to scale across the whole network – there is thus a careful
tradeoff required between the feasibility of capturing measurements at scale with
the need for this information. Thus, the data collected by a general infrastructure
may not always suffice when troubleshooting a very specific issue; it may not be
granular enough, or certain measurements may simply not be made at all on a regular basis. Targeted network measurements can instead be used to obtain more detailed information as required. For example, when troubleshooting an issue related
to router process management, finer grained CPU measurements may be temporarily obtained on a small subset of network routers through targeted measurements (e.g., recording measurements every 5 seconds instead of every 5 minutes).
Similarly, detailed inspection of specific traffic carried over a network link may be
required if malformed packets are causing erroneous network behavior – identifying what this traffic is and where it is coming from may be fundamental to addressing the issue. However, note that targeted measurements will clearly most
likely need to be taken when the event of interest occurs – which could be challenging to capture for an intermittent issue, unless one can be explicitly induced.
But once such measurements are available, they can complement the regularly collected data and be fed into the network analyses.
Lab testing aimed at reproducing a problem observed in the field and detailed
software/hardware analysis can also be used to complement data analysis. Large
ISPs typically maintain a lab containing the same hardware and software configurations that are running in the field to provide a controlled environment in which
to experiment in which testers can try to reproduce and understand the issue hand
– without the risk of impacting customer traffic. Lab testing can be used to understand the conditions under which the issue occurs, and to experiment with potential solutions to address it. Examining software and hardware designs and implementations can also provide tremendous insight into why problems are occurring
and how they can be addressed.
However, lab testing and detailed software/hardware analysis are more often
than not immense and extremely time consuming efforts which should not be entered into lightly. It is thus critical to glean as much information from available
network data to effectively guide these other efforts. EDM techniques are at the
heart of this.
Constructing a good view of what is happening in a network requires looking
across a wide range of different data sources, where each data source provides a
different perspective and insight. Manually pulling all of this data together and
then applying reasoning to hypothesize about an individual event’s root cause is
35
excessively challenging and time consuming. The data itself is typically available
in a range of different tools, as depicted in Figure 12. These tools are often created and managed by different teams or organizations, and often present information
in different formats and via different interfaces. In the example in Figure 12, performance data collected from SNMP MIBs is viewed via one web site, router syslogs may be collected either directly from a router or accessed from an archive
stored on a network server, a different server collates work flow logs and end to
end performance data is available in yet another web site. Thus, manually collecting data from all these different locations and then correlating it to build a complete view of what is happening in any given situation can be a painstaking process, to say the least. This situation may be further complicated in scenarios
which involve multiple network layers or technologies which are managed by different organizations. It is entirely possible that information across network layers /
organizations is only accessible via human communication with an expert in the
other organization. Thus, obtaining information about lower layer network performance (e.g., SONET network) involve reaching out via the phone to a layer one
specialist. Of course, different data sources also use different conventions for
timestamps (e.g., different time zones) and in naming network devices (e.g., router
names may be specified as IP addresses, abbreviated names, domain name extensions etc). These further compound the complexity of correlating across different
data sources. Thus, analyzing even a single event could potentially take hours by
hand - simply in pulling the relevant data together. This is barely practical when
troubleshooting a complex individual event, but becomes completely impractical
when troubleshooting recurring events with potentially large numbers of root
causes. However, it has historically been the state of the art.
36
Figure 12. Troubleshooting of network event is difficult if data is stored in
separate “data silos”
Data integration and automation is thus absolutely critical to scaling network
troubleshooting and data mining in support of root cause analysis. It would be difficult to overstate the importance of data integration: making all of the relevant
network and systems data readily available. Ideally the data should be made
available in a form that makes it easy to correlate significant numbers of very diverse data sources across extended time intervals. It is also extremely valuable to
simply expose the data to Operations though a set of integrated data browsing and
visualization tools. AT&T Labs have taken a practical approach to achieving this
- collecting data from large numbers of individual “data silos” and integrating
them into a common infrastructure. Above this common data infrastructure is a
set of scalable tools and applications, typically designed to work over many very
diverse data sources. This architecture is illustrated in Figure 13. Scalable and automated feed management and database management tools [12; 13] are used to
amass and then load data into a massive data warehouse. The data warehouse is
archiving data collected from various configuration, routing, fault, performance
and application measurement tools, across multiple networks. Above the data
warehouse resides a set of scalable and reusable tool sets, which provide core ca-
37
pabilities in support of various applications. These tool sets include a reporting
engine (used for making data and more complex analyses available via web reports to end users), anomaly detection schemes, and different correlation infrastructures (rules-based correlations and techniques for correlation testing and
learning [14; 15]). Above the reusable tool sets lies a wide and rapidly expanding
range of data mining applications - including troubleshooting tools, network topology and time series visualization, trending reports, and in depth analysis of recurring conditions (e.g., packet loss, BGP flaps, service quality impairments)
The key to the infrastructure is scale – both in terms of the amount and the diversity of the network data being collated, and the range of different applications
being created. One of the key ingredients to achieving this is simple normalization on the data which is performed as the data is ingested into the database – ensuring for example that a single time zone is used across all data sources, and
common naming conventions are used to describe network elements, networks etc.
Operations and applications alike thus do not have to continually convert between
different timezones, naming formats etc – dramatically simplifying human and automated analyses. Normalization and data pre-processing is done once – and thus
not required by each individual application and user. Although an enormous undertaking, such an infrastructure enables scale in terms of the very diverse analyses which can be performed [12; 13].
Figure 13. Scaling network troubleshooting and root cause analysis
So let’s now return to the challenge of uncovering root causes for a series of recurring events. We specify these events as a time series, referred to as a “symptom
38
time series”. This time series is characterized by temporal and spatial information
describing when and where an event symptom was observed and event specific information. For example, a time series describing uni-cast end-to-end packet loss
measurements would have associated timing information (when each event occurred and how long it lasted), spatial information (the two end points between
which loss was observed) and event specific information (e.g., the magnitude of
each event – in this case, how much loss was observed). Note that given the infrastructure depicted in Figure 13, the time series is typically formed using a simple
database query.
The most likely root cause of each of our symptom events is then identified by
correlating each event with the set of diagnostic information (time series) available in our data infrastructure. Domain knowledge is used to specify which events
should be correlated and how. The analysis effectively then becomes an automation of what a person would have executed – taking each symptom event in turn
and applying potentially very complex rules to identify the most likely explanation
from within the mound of available data. Aggregate statistics can then be calculated across multiple events, to characterize the breakdown of the different root
causes. Appropriate remedies and actions can then be identified. Again, the data
infrastructure depicted in Figure 13 makes scaling this infrastructure far more
practical, as the root causes are typically identified from a wide range of different
data sources; a common data warehouse with normalized naming conventions
means that the analysis infrastructure does not need to be painfully aware of the
data origins and conventions.
However, it is often far from clear as to what all the potential causes or impacts
of a given set of symptom events are. Take, for example, anomalous router CPU
utilization events. Once CPU anomalies are defined (a challenge unto itself), we
are then faced with the question of what causes these anomalous events, and what
impacts – if any – do they have. Domain knowledge can be heavily drawn upon
here – network experts can often deduce from both knowledge of how things
should work and from their experience in operating networks to create an initial
list. However, network operators will rapidly report that networks do not always
operate as expected nor as desired – routers are complicated beasts with bugs
which can result in behaviors that violate the fundamentals of “networking 101”
principles. Examination of real network data, and lots of it, is necessarily to truly
understand what is occurring.
Domain knowledge can be successfully augmented through EDM, which can
be used to automatically identify these relationships – specifically, to learn about
the root causes and impacts associated with a given time series of interest. Such
analyses are instrumental in advancing the effectiveness of network analyses, both
for troubleshooting recurring conditions and even for detecting failure modes
which are flying under the radar. Although there are numerous data mining techniques available, the key is to identify relationships or correlations between different time series [12; 13] and across different spatial domains. We are interested in
identifying those time series which are statistically correlated with our symptom
39
time series; these are likely to be root causes or impacts of our time series of interest. However, given the enormous set of time series and potential correlations involved here, domain knowledge is generally necessary to guide the analysis –
where to look and under what spatial constraints to correlate events (e.g., testing
correlations of events on a common router, common path, common interface). If
looking for anomalous correlations, the real challenge is in defining what is normal / desired behavior and what is not (and thus what may be worth investigating).
Figure 14. Customer - ISP access
Let’s consider another hypothetical example as an application of exploratory
data mining. We focus here on the connectivity between customers and an ISP –
specifically, between a customer router (CR) and a provider edge router (PER).
The connectivity between these two routers is provided by metropolitan and access networks as illustrated in Figure 14 – which in turn may be made up of a variety of different layer one and layer two technologies (see Chapter 2). These metro/access networks will often provide protection against faults within the lower
layers – the goal being that issues at the lower layer will be recovered without the
higher layer (layer three) being impacted. Thus, a protection event at the lower
layer should not result in any impact (barring a few lost packets) at layers above.
So now let us consider a situation where there is an underlying issue such that
protection events at the lower layer are in fact causing IP links to flap (go down
and then come back up very shortly later) at the layer above. This breaks the principle of layered network restoration – protection events at the lower layer should
not cause links between routers to fail – even for a short time. The result is unnecessary customer impact – instead of seeing a few tens of milliseconds break in
connectivity as the protection switching is executed, the customer may now be
impacted for up to a couple of minutes, depending on the routing protocols used
over the interface. This example is further complicated by the fact that that it is
over a trust boundary - specifically between a customer and the ISP. The ISP has
40
limited visibility - it is typically not informed of actions taken on the customer
end. A customer may, for example, reboot a router and would not inform the ISP
to which it is connected. Thus, it makes it very challenging to create reliable
alarms on customer-facing interfaces on ISP routers – the majority of events observed may not be service provider issues at all, and simply a result of activities
within the customer’s domain. Extensive investigation of such events is simply
impractical, given the tremendous number of such events. Thus, the fact that there
is an IP link flap on this interface is not sufficient to initiate detailed investigation.
So there is a very real risk that, without a great deal of care, this particular failure mode may fly under the radar. This condition may be being experienced across
large numbers of interfaces – but may not be occurring on any one interface often
enough to cause detection as a chronic condition and thus cause an alert to be triggered and manual investigation. In fact, on customer interfaces, it is entirely possible that even chronic conditions are not investigated without an explicit customer complaint, as more often than not these are caused by customer activities.
Let’s not consider how EDM might be used to detect this condition. Data mining can expose that the IP link flaps events are often occurring at the same time as
protection events associated with the same link between the PER and the CE –
more often than could be explained as purely coincidences. This would be revealed strong statistical correlation between IP layer flaps and protection events
on the corresponding lower layer -- a correlation that violates normal operating
behavior, and is indicative of erroneous network behavior. Note the need here to
bring in spatial constraints – we are explicitly interested in behavior happening
across layers for each individual connection between a PER and a given CE – it is
this correlation that is not expected and indicates undesirable network behavior. It
is actually entirely expected that there be correlations between protection switching events on some PER – CE links and flaps on other, different interfaces on the
router – this would be a result of a common lower layer event impacting both protected and unprotected lower layer circuits. This highlights the complexity here
and the need for detailed domain knowledge in analyzing correlation results – the
spatial models used for correlation analyses are critical.
Statistical correlation testing can also be used in troubleshooting this behavior,
once detected. For example, correlation testing can be used to identify how the
strong correlation between IP flaps and protection events varies across technologies – does it only exist for certain types of lower layer technologies or certain
types of router technologies (routers, line cards)? If the same behavior is observed
across lower layer technologies from multiple vendors, then it is unlikely to be a
result of erroneous behavior in the lower layer equipment (for example, longer
than expected protection outages). If, however, the issue correlates to only a single
type of router, then it is highly likely that the issue is a router bug and the interface
is not reacting as designed. Thus, analysis of what is common and not common
across the symptom observations can help in guiding troubleshooting. Targeted
lab testing and detailed software analysis can then proceed so that the underlying
41
cause of the issue can be identified and rectified, with the intent that the failure
mode be permanently driven out of the network.
In general, the opportunity for EDM is tremendous in large-scale IP/MPLS
networks. The immense scale, complexity of network technologies, tight interaction across networking layers, and the rapid evolution of network software mean
that we risk having critical issues flying under the radar, and that network issues
are extremely complex to troubleshoot, particularly at scale. Driving network performance to higher and higher levels will necessitate significant advances in applying data mining to the extremely diverse set of network. This area is ripe for
further innovation.
5. Planned Maintenance
The previous sections focused on reacting to events which occur in the network.
However, a large proportion of the events which occur in the network are actually
a result of planned activities. Managing a large scale IP/MPLS network indeed requires regular planned maintenance – the network is continuously evolving as
network elements are added, new functionality is introduced, and hardware and
software are upgraded. External events, such as major road works can also impact
areas where fibers are laid, and thus necessitate network maintenance.
There are two primary requirements of planned maintenance: (1) successfully
complete the work required and (2) minimize customer impact. As such, planned
network maintenance is typically executed during hours when expected customer
impact and network load is lowest, which typically equates to overnight and early
hours in the morning (e.g., midnight to 6am). However, the extent that customers
would be impacted by any given planned maintenance activity also depends on the
location of the resources being maintained, and what maintenance is being performed. Redundant capacity would be used to service traffic during planned
maintenance activities – where redundancy exists, such as in the core of an IP
network. However, as discussed in Section 2.5, such redundancy does not always
exist and thus some planned maintenance activities can result in a service interruption. This is most likely to occur at the edge of an IP/MPLS network, where cost
effective redundancy is simply not available within router technologies today.
5.1.
Preparing for planned maintenance activities
The scale of a large ISP network and complexity of competing planned maintenance activities occurring across multiple network layers could be a recipe for disaster. Meticulous preparation is thus completed in advance of each and every
planned maintenance event, ensuring that activities do not clash, and that there are
sufficient network and human resources available to handle the scheduled work.
Planning for network activities involves careful scheduling, impact assessment,
coordination with customers and other network organizations, and identifying ap-
42
propriate mechanisms for minimizing customer impact. We consider these in more
detail here.
Scheduling planned maintenance activities often requires juggling a range of
different resources, including Operations personnel and network capacity – across
different organizations and layers of the network. For example, in an IP/MPLS
network segment where protection is not available in the lower layer network
(e.g., transport network interconnecting routers), planned maintenance in the lower layer will impact the IP/MPLS network. IP/MPLS traffic will be re-routed, requiring spare IP network capacity. This same network capacity could also be required if IP router maintenance should occur simultaneously. This is illustrated in
Figure 15, where maintenance is required both on the link between routers C and
D (executed by layer one technicians) and on router H (executed by layer 3 technicians). Assuming simple IGP routing, much of the IP/MPLS traffic normally
carried on the link between routers C and D would normally re-route over to the
path E – F – H in the event of the link between C and D failing as a result of the
planned maintenance. But if router H is unavailable due to simultaneous planned
maintenance, this path would not be available. Thus, both the traffic normally carried on the link between routers C and D, and that carried via router H would all
be competing for different resources within the network – potentially causing significant congestion and corresponding customer impact. Clearly, this is not an acceptable situation –the two maintenance activities need to be coordinated so that
they do not occur at the same time. Thus, careful scheduling of planned maintenance across network layers is essential – also necessitating careful processes to
communicate and coordinate maintenance activities across organizations managing the different network layers.
Detailed ”what if” simulation tools that can emulate the planned maintenance
activities are key to ensuring that customer traffic will not be impacted. Such
tools are executed in advance of the planned activities, taking into account current
network conditions, traffic matrices and planned activities. In situations where the
planned maintenance activities would cause impact, the simulation tools are also
used to evaluate potential actions which can be taken to ensure survivability (e.g.,
tweaking network routing).
43
Figure 15. Competing planned maintenance activities on a network link (between routers C and D) and a router (router H)
If the planned maintenance is instead occurring at the provider edge router (the
router to which a customer connects), then it may be necessary to coordinate with
or at least communicate the planned maintenance to the customer. This isn’t typically done in consumer markets, but is a necessary step when serving enterprise
customers. This communication is typically done well in advance of the planned
activities – often many weeks. If the work needs to be repeated across many edge
routers, as may occur when upgrading a router software network wide, then human resources must also be scheduled to manage the work across the different
routers. This can become a relatively complex planning process, with numerous
other constraints.
Once tentative schedules are in place, extensive network analysis and simulation studies are executed in anticipation of the planned maintenance event, to ensure that the network has sufficient available resources to successfully absorb the
traffic impacted by the planned maintenance activities. Notifications are also sent
to impacted enterprise customers well in advance of planned maintenance events
that would directly impact the customer (e.g., those at the edge of the network
where redundancy may be lacking) so that customers can make appropriate arrangements.
5.2.
Executing planned maintenance
Planning for a scheduled maintenance event may occur well in advance of the
scheduled event, ensuring adequate time for customers to react and capacity evaluations to be completed. Maintenance can proceed after a successful last minute
check of current network conditions. However, service providers go to great
lengths to carefully manage traffic in real time so as to further minimize customer
impact. In locations where redundancy exists, such as in the core of the IP/MPLS
network, gracefully removing traffic away from impacted network links in advance of the maintenance can result in significantly smaller (and potentially negligible) customer impact compared with having links simply fail whilst still carrying traffic. Forcing traffic off the links which are due to be impacted by planned
maintenance also eliminates potential traffic re-routes that would result should the
link flap excessively during the maintenance activities. Thus, ensuring that all traffic is removed before the maintenance commences ensures that customer impact is
minimized. How this is achieved depends on the protocols used for routing traffic.
For example, if simple IGP protocols are alone used, then traffic can be re-routed
away from the links by simply increasing the weight of the links to a very high
value . This act is known as costing out the link. Once the maintenance is completed, the link can be costed in (reducing the IGP weight back down to the normal value, thereby re-attracting traffic).
44
Continual monitoring of network performance and network resources is also
critical during and after the planned maintenance procedure, to ensure that any unexpected conditions that arise are rapidly detected, isolated and repaired. Think
about taking your car to a mechanic – how often has a mechanic fixed one problem only to introduce another as part of their maintenance activities? Networks are
the same – human beings are prone to make mistakes, even when taking the upmost care. Thus, network operators are particularly vigilant after maintenance activities, and put significant process and automated auditing in place to ensure that
any issues are rapidly captured and addressed. For example, in the previously discussed scenario where links are costed out before commencing maintenance activities – it is critical that monitoring and maintenance procedures ensure that these
network resources are successfully returned to normal network operation after
completion of the maintenance activities. Accidentally leaving resources out of
service can result in significant network vulnerabilities, such as having insufficient
network capacity to handle future network faults.
6. The Importance of Continued Innovation
Over the past ten years or so, we have witnessed tremendous improvements in
network reliability and performance. This has been achieved through massive investments across the board, including in router technologies, support systems for
detecting and alarming on events, process automation and procedures for planned
and unplanned maintenance. However, there still remains room for improvement
and considerable opportunities for innovation. The already vast and yet rapidly
expanding scale of the Internet will continue to pose increasing challenges, as will
the increasingly diverse application set and often corresponding stringent application requirements. The network reliability and performance requirements imposed
are likely to become increasingly stringent as IP continues to become the single
technology of choice for such a plethora of applications and services. We herein
outline a few directions in which advances in the state of the art promise to produce great operational benefits. We hope to use this as an opportunity to stimulate
greater attention and innovation in these areas.
The first area that we focus on is continued improvements in router reliability.
This was also discussed in Chapter [Yakov], but we include it here as well in view
of the tremendous impact that enhancements would have on both the performance
and the ease of operating a large scale network. The edge (PE) routers in an ISP
network are today a single point of failure (see Figure 8) – if either the router or
the line cards facing the customers are out of service, then customer service is impacted. And routers have not as yet attained the same level of reliability as other
more mature technologies (e.g., voice switches, transport equipment). Although
service providers have invested heavily in ensuring that the lacking of network element reliability has minimal impact the core – through the introduction of fully
redundant hardware - such a solution does not scale at the edge of the network.
One case in point is router software upgrades, which today typically introduce
45
router downtime as the routers are rebooted. In the case of edge routers and singly
connected customers, this becomes a lengthy yet (currently) unavoidable service
outage. As discussed in Section 5, not only is this causing interruptions in customer service, but it also adds considerable operational overhead as maintenance
activities need to be coordinated and communicated with customers well in advance of the planned activities. Truly hitless upgrades – being able to update the
software on routers without impacting router operation - are thus critical to dramatically improving availability and simplifying network operations.. Similarly,
ISPs are experiencing a dire need for cost-effective line card protection. Consider
the scenario in which a customer connects to a provider edge router using only a
single circuit (see Figure 8). Without any line card redundancy, a failure of the PE
line card facing the customer would result in the customer being out of service until the issue can be repaired. The most significant consequence of this is obviously that this introduces unacceptable customer outage. But in addition, if this is a
hardware issue requiring intervention at the router in question, a service technician
must be either available on site or must be rapidly deployed to the fault location –
very costly alternatives. However, if the customer using the failed resource can be
recovered using redundant hardware without physical human intervention at the
fault location, then the replacement of the failed hardware can be planned and executed in a timely but somewhat convenient fashion. The work could even be
scheduled along with other service requests at the same location. Thus, automated
(or remotely initiated) failure recovery can both significantly improve customer
performance (through fast failure recovery) and reduce the network operational
overheads. Router technology today supports dedicated protection (1:1) – where
for each working interface, there is a dedicated backup interface. However, this
approach is very expensive, essentially doubling the line card costs (which are
typically one of the most expensive components). There has been Research
demonstrating the feasibility of more cost effective, shared protection [16; 17] –
implementing such capabilities in routers would result in significant reliability and
operational enhancements. We look forward to advances in this area leading to
fully integrated cost-effective fast-failure-protection capability on next generation
routers.
As demonstrated in earlier sections within this chapter, service providers have
typically mastered the detection, troubleshooting and repair for commonly occurring faults. For events which are regularly observed – say, fiber cuts or hardware
failures – troubleshooting these issues is typically a rapid process, aided by process automation capabilities such as discussed in Section 3. However, the same
cannot be said for dealing with the more esoteric faults and performance issues –
those for which the root cause is far from apparent. Tools for effectively and rapidly aiding in troubleshooting such issues are sorely lacking; significant innovation is well overdue here. This is primarily because it is such a fundamentally
challenging problem and one most understood by the small teams of highly skilled
engineers to which such issues are escalated. The environment that these teams
work in is extremely demanding – many of the more esoteric scenarios investigat-
46
ed are vastly different from previous observations, and may (at least initially) be
defying all logic and understanding. The network elements involve massive
amounts of complex software and hardware, complicated interactions exist across
network technologies and layers and across network boundaries (customers,
peers), and tremendous quantities of network data are spawned by the network elements and monitoring systems – and yet the key information needed to understand a very specific issue may not be something readily observed in the information regularly collected. Operations personnel, while under tremendous
pressure, have to sift through immense quantities of data from diverse tools, collect additional information and reason / theorize on potential root causes. Tools
which effectively aid in this process would have great operational value.
We also highlight exciting opportunities for exploratory data mining in network
performance management. Section 4 discussed some of the activities which are
carried out today, including some more recent Research pursuits. There is a vast
amount of interesting data collected on a regular basis from large operational networks, and correspondingly interesting opportunities for detecting and analyzing
issues which risk “flying under the radar”. The biggest challenges are probably
the sheer volume of data, and the complexity of the systems being analyzed. How
can we most effectively identify relevant and critical issues amongst the millions
(or more) of potential time series we could create from the network data? Our experience [14; 15; 11] indicates that combining advanced data analytic technologies
and strong networking domain knowledge and experience yields a powerful combination for effectively driving continued network performance improvement.
As a final topic, we consider process automation – specifically, how far can we
and should we proceed with automating actions taken by Operations teams? As
highlighted in Section 3, process automation is already an integral part of at least
some large ISP network operations. But how far can and should we proceed with
such technologies? The ultimate goal would be to fully automate common fault
and performance management actions, closing the control loop of issue detection,
isolation, mitigation strategy evaluation, and actuation of the devised responses in
network (e.g., re-routing traffic, repairing databases, fixing router configuration
files). We refer to this as adaptive maintenance. Consider by way of an example
a situation in which Operations have repeatedly observed that a software reload of
a line card can restore service under certain failure conditions – Operations have
adopted this strategy to restore service while investigating the root cause in the
meantime. Following the adaptive maintenance concept, a network management
system could automatically detect the signature for this failure mode, emulate the
validation process that a human would follow before taking action (e.g., determine
whether the problem is being worked on an existing active ticket and whether the
symptoms are consistent with the proposed response), launch the reload command,
verify whether it achieved the expected result and finally report the incident for
further investigation.
While adaptive maintenance promises to greatly reduce recovery times and can
eliminate human errors that are inevitable in manual operations, a flawed adaptive
47
maintenance capability can create damage at a scale and speed which is unlikely
to be matched by humans. It is thus crucial to carefully design and implement
such an adaptive maintenance system, and thoroughly and systematically verify it
against all situations that could arise within the network, a non-trivial task to say
the least. That being said, the overall result has the potential to dramatically improve network reliability and performance. Identifying promising new opportunities for automated recovery and repair of both networks and supporting systems
(e.g., databases, configuration) is an area wide open for innovation.
7. Conclusions
We conclude with some final thoughts on “best practice” principles for both
fault and performance management, and planned maintenance.
Text Box: Fault and Performance Management “Best Practice” Principles:









Plan carefully as new technologies are to be deployed – don’t make network measurement, fault and performance management an afterthought!
Detection: deploy a extensive and diverse monitoring infrastructure that
will enable rapid detection and troubleshooting of network events
Scalable correlation to rapidly drill down to root cause. Automation required to rapidly get to where we need to be so that faults can be repaired
rapidly.
Fault management systems: ensure that they are reliable and easily extensible
Process automation: automate commonly executed tasks – but be careful
Ensure cost effective (shared) redundancy is available where possible – it
can (1) improve customer performance, through rapid failure recovery (2)
significantly reduce Operations overheads by eliminating the need for express technical callouts, replacing individual callouts with planned maintenance events that can deal with multiple tasks in the one callout
Execute continual ongoing analysis looking for new signatures and improving existing signatures for detecting network issues.
Trending: be on top of key performance indictors – both those indicative of
customer impact (loss, delay) and those associated with network health
(CPU, memory)
Create a scalable data infrastructure for data mining – avoid silos and create infrastructure for rapidly obtaining data for both rapid troubleshooting
and scaling troubleshooting efforts
48

Ensure that adequate tools are available for troubleshooting the wide range
of different network events which occur
Text Box: Planned maintenance “Best Practice” Principles:










Ensure that human beings “think twice” as they are touching the network
(bid to minimize mistakes)
Plan and schedule maintenance activities carefully to minimize customer
impact
Ensure careful scheduling and coordination of maintenance events, including across network layers where necessary
Ensure careful coordination and validation of planned activities to ensure
that there is sufficient spare network capacity to absorb load during backbone network maintenance
Validate network state before executing planned maintenance, to ensure
that maintenance is not executed when the network is already impaired.
Minimize customer impact by taking routers and network resources
“gracefully” out of service where possible (e.g., network backbone)
Suppress alarms during planned maintenance to avoid Operations chasing
events which they are inducing!
Carefully monitor performance during and after maintenance, ensuring that
all resources are successfully returned to operation
Validate planned maintenance actions after completion (e.g., verify router
configuration files) to ensure that the correct actions were taken and that
configuration errors / bad conditions were not introduced during the activities
Ensure mechanisms are available for rapid back out of maintenance activities, in case issues are encountered
8. Acknowledgments
The authors would like to thank the AT&T Operations teams for invaluable collaborations with us, their Research partners, over the years. In particular, we
would like to thank Bobbi Bailey, Heather Robinett and Joanne Emmons (AT&T)
for detailed discussions related to this chapter and beyond. And finally, we would
like to acknowledge Stuart Mackie from EMC, for discussions regarding alarm
correlation.
9. References
1. Lonvick, C. The BSD Syslog Protocol. s.l. : IETF, 2009. RFC 5424.
49
2. P. Della, C. Elliott, R. Pavone, K. Phelps and J. Thompson. Performance
and Fault Management. s.l. : Cisco Press, 2000.
3. OSPF Monitoring: Architecture, Design and Deployment Experience, Proc.
USENIX Symposium on Networked Systems Design and Implementation (NSDI).
A. Shaikh, A. Greenberg. March 2004.
4. D. Mauro, K. Schmidt. Essential SNMP. s.l. : O'Reilly, 2005.
5. A Coding Approach to Event Correlation. S. Kliger, S. Yemini, Y. Yemini,
D. Ohsie and S. Stolfo. 1995. Proceedings of the Fourth international Symposium
on Integrated Network Management IV. pp. 266-277.
6. High Speed and Robust Event Correlation. S. Yemini, S. Kliger, E. Mozes,
Y. Yemini, D. Ohsie. s.l. : IEEE, May 1996, IEEE Communications Magazine,
pp. 82-90.
7. Standarized Active Measurements on a Tier 1 IP Backbone. G.
Ramachandran, L. Ciavattone, A. Morton. 6, June 2003, IEEE
Communications Magazine, Vol. 41.
8. A signal analysis of network traffic. P. Barford, J. Kline, D. Plonka, and A.
Ro. 2002. Internet Measurement Workshop.
9. Diagnosing network disruptions with network-wide analysis. Y. Huang, N.
Feamster, A. Lakhina, and J. J. Xu. 2007. Sigmetrics.
10. Mining anomalies using traffic feature distributions. A. Lakhina, M.
Crovella, and C. Diot. 2005. Sigcomm.
11. Network anomography. Y. Zhang, Z. Ge, A. Greenberg, and M.
Roughan. 2005. Internet Measurement Workshop.
12. Black box Anomaly Detection: Is it Utopian? S. Venkataraman,
Caballero, D. Song, A. Blum and J. Yates. 2006. 5th Workshop on Hot Topics
in Networking (HotNets).
13. Scheduling Updates in a Real-time Stream Warehouse . L. Golab, T.
Johnson, V. Shkapenyuk. 2009. International Conference on Data Engineering
(ICDE).
14. Stream Warehousing with DataDepot. L. Golab, T. Johnson, J. Seidel,
and V. Shkapenyuk. 2009. SIGMOD.
15. Troubleshooting Chronic Conditions in Large IP Networks. A. Mahimkar,
J. Yates, Y. Zhang, A. Shaikh, J. Wang, Z. Ge and C. Ee. 2008. 4th ACM
International Conference on Emerging Network Experiments and Technologies
(CoNEXT).
16. Towards Automated Performance Diagnosis in a Large IPTV Network. A.
Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, Q. Zhao. 2009.
Sigcomm.
17. An Integrated IP/Optical Approach for Efficient Access Router Failure
Recovery. P. Sebos, J. Yates, G. Li, D. Rubenstein and M. Lazer. 2004. IEEE /
LEOS Optical Fiber Communications Conference.
18. Ultra-fast IP link and interface provisioning with applications to IP
restoration. P. Sebos, J. Yates, G. Li, D. Wang, A. Greenberg, M. Lazer, C.
50
Kalmanek and D. Rubenstein. 2003. IEEE / LEOS Optical Fiber
Communications Conference.
19. O. Maggiora, C. Elliott, R. Pavone, K. Phelps, J. Thompson.
Performance and Fault Management. s.l. : Cisco Press, 2000.
Download