Timely & accurate IP network diagnostics TextStart By Huang Bin IP

advertisement
Timely & accurate IP network diagnostics
TextStart
By Huang Bin
IP network troubleshooting is typically an uphill climb. Fault location is hindered by
an excess or absence of alarms, while fault determination is basically a tedious
process of elimination. Both are slow and inefficient; a solution would be welcome.
Alarm excess
The architecture of an IP network is highly stratified. A virtual leased line (VLL)
service must traverse multiple layers of processing, including the physical layer, link
layer, routing protocol, MPLS, and VLL. If a physical fiber breaks, the physical layer,
link layer, IP transport layer, and VLL will all be affected. Each will send out a large
number of TRAP messages.
Correlation between protocols is also complex. A fiber breakdown will generally
cause convergence of the routing protocol, hence the changes in multiprotocol label
switching (MPLS) and the label distribution protocol (LDP), and the subsequent
plethora of TRAP messages.
Alarm absence
The absence of an alarm, however, is much more complex. A fault can be defined as
the failure of a network element to perform as expected, but what if there are no
expectations? IP architecture, thanks to its fluid nature, makes expectations nearly
impossible to systematically define.
The control plane determines the path from the source to the destination. On a
traditional circuit-switched (CS) network, the administrator configures the active and
standby paths. For each packet, its next hop can be clearly expected, along either the
active or standby path. With IP networking, the routing protocol selects the path; the
router ‘knows' only the next hop, without knowing the expected service path.
Therefore, when a breakdown causes route convergence or a path computation error
results in divergence, the router fails to generate an alarm.
Huawei once encountered an NGN voice service failure that lasted for more than 40
minutes, yet the IP bearer network generated no alarm. The culprit was found to be an
error in label-switched path (LSP) computation, resulting in mismatch between the
computation result and intermediate system-to-intermediate system (ISIS) result, yet
no alarm occurred because the protocol that established the LSP did not know the
expected path.
In terms of the forwarding plane, as IP networking is asynchronous, its forwarding
mechanism cannot enable clear expectations. A packet from router X may be destined
for router Y, but router Y will not be aware that the packet is coming and therefore
will not generate an alarm if it fails to arrive.
The most common fault of this kind involves degradation between routers, resulting
in packet loss. Without an alarm, such a fault may not be noticed or located for quite
some time. Engineers will have to check each router along the path, without any hint
as to where to begin.
Root alarm identification
When a flood of alarms comes in, the root alarm must be determined. According to
Huawei statistics, most IP network faults stem from hardware and link degradation.
Long-distance links are particularly sensitive to their surrounding environment;
breakdowns occur from time to time. This generates a large number of alarms and
causes interior gateway protocol (IGP) convergence, which leads to an increase in the
number of alarms as IGP alarms trigger LSP alarms. In other words, a link alarm can
bring about a multitude of protocol alarms.
For this issue, Huawei proposes a two-pronged approach. First, alarms must be
classified (environment, hardware, software, interface, link, protocol, or service), with
environment and hardware alarms having priority. When a higher-level alarm is
resolved, the correlated protocol alarms should disappear automatically. This
approach is simple and practical, and should produce the desired result in a short time.
Second, an alarm correlation system should be established by vendors that depends on
protocol and service. Correlated alarms would be displayed under the root alarm, so
the administrator need only deal with the root alarm directly. This approach is
certainly more complete, but it is arduous and time-consuming to establish.
Path expectation & detection
When a fault fails to trigger an alarm, current status must be compared against
expectations, and must be done from the control plane and forwarding plane
perspectives. Although a dynamic protocol is adopted for the IP control plane, it is
still physically-based and involves the shortest path-first (SPF) algorithm. The simpler
a network is, the clearer the path expectation will be. Small and medium-sized
metropolitan area networks (MANs) typically have fewer layers, while active/standby
links are usually adopted between layers for protection purposes. For such a network,
faults can be effectively handled as long as network topology diagrams are accurate.
For a large and complex network, the service path is extremely hard to identify from
the distribution of physical links. Network simulation can help calculate the expected
path, as network configurations and topologies are imported into its software.
After a path expectation is determined, OSS software regularly obtains path status for
comparison against it; any mismatch triggers an alarm and prompts the administrator.
Tracert can be adopted for small and medium-sized networks with simple architecture,
but for large and complex networks where equal-cost multi-paths (ECMPs) occur,
Tracert methods should be combined with forwarding table query for service path
assessment. Said assessment can also be done by analyzing the flood of IGP packets
and calculating the forwarding path via the routing algorithm and configuration.
Forwarding expectation & detection
In the forwarding plane, expectation is closely related to detection, which can be done
through non-service-aware OAM, service-aware OAM, or service quality monitoring.
Non-service-aware OAM – This involves OAM-detection packet injection into the
network so that expectations are predefined. The recipient therefore already has
details concerning the detection packets, including their size and interval. When a
received packet defies expectations, it qualifies as a fault.
This method is easy to deploy, as each network layer has an OAM protocol, such as
Bidirectional Forwarding Detection (BFD), EthOAM, Internet Control Message
Protocol (ICMP) ping, or MPLS OAM. However, mere OAM packets cannot fully
illustrate the service situation, as they may not reflect certain service failures, at least
not immediately.
Service-aware OAM – This method directly measures the service stream. A typical
example would be the loss measurement function defined in the ITU-T Y.1731
standard. Simply put, it is a conservation principle for packets, where the number of
received packets equals those sent. In terms of implementation, the sender and
recipient both tally the service packets. The sender regularly sends the count to the
recipient for double-checking; if their tallies don't match, a fault is declared.
Service quality monitoring – With this method, service data is measured and
compared with predefined thresholds. With IPTV service, for example, dedicated
hardware is connected to the device port to directly measure IPTV traffic across the
network; indicators might include the MOS value for VoIP. This method best reflects
real-world conditions, but the deployment and maintenance of dedicated equipment is
expensive. It also requires deep packet inspection (DPI), as actual packets must be
sampled or analyzed against predefined expectations. These three methods can coexist,
but the desirability of this depends on the operator's service SLA goals and
procurement & maintenance budget.
Furthermore, the control plane interrelates with the forwarding plane, as the operation
of the former directly affects traffic distribution in the latter. Any problems with the
control plane may lead to traffic congestion and device/link faults. Huawei, by
integrating expectation determination and status detection for the control and
forwarding planes, has developed a visualized IP network O&M solution
("path+traffic") which provides all-around fault monitoring and location capabilities.
China Mobile, for example, has been able to slash its average MAN fault ticket tally
from 500 per day to 10, thanks to Huawei's ability to reduce the number of false
alarms. This solution has also helped operators better visualize their IP network O&M.
Usually, it takes operators hours or even days to troubleshoot the most common faults,
such as link error, link interruption, component failure, and route error. With Huawei
"path+traffic," such common faults can be rectified within minutes, as dictated by
internal testing.
Currently, this solution operates on several commercial networks on a pilot basis;
results, thus far, have been significant; the maintenance process has been simplified,
while the troubleshooting process has been accelerated.
TextEnd
Download