mpls-te in networks with varying capacity constraints

advertisement
MPLS-TE IN NETWORKS WITH VARYING CAPACITY
CONSTRAINTS
Authors: Kyriaki Levanti (CMU), Vijay Gopalakrishnan (AT&T Labs – Research), Hyong S. Kim (CMU),
Seungjoon Lee (AT&T Labs – Research), Aman Shaikh (AT&T Labs – Research)
1.1 Motivation
Multi Protocol Label Switching (MPLS) is a widely deployed forwarding mechanism in backbone networks.
Unlike popular intradomain routing protocols such as OSPF and ISIS that perform shortest-path routing
given a single cost metric (e.g., link capacity), MPLS also allows for fine-grained traffic-engineering (TE)
and fast-reroute (FRR) during failures. Further, constraint-based routing (CB-R) allows the selection of
paths other than the shortest paths chosen by OSPF/ISIS, and tunnel priorities allow the differentiated
treatment of traffic. These two capabilities enable important network objectives such as load-balancing
and Quality-of-Service (QoS) for specific flows.
The ability to balance load and provide QoS is becoming increasingly important in networks. As traffic
demands increase, network providers can no longer afford to overprovision network resources. Network
providers need to efficiently use their network resources by balancing traffic across their links. Similarly,
as traffic demands increase and the Internet is required to support many different types of services, it is
essential for network providers to be able to offer service-specific QoS guarantees to different traffic
flows.
Network providers are currently using MPLS in the following ways: 1) QoS service to customer networks:
customer networks specify their requirements in terms of the volume of traffic to be forwarded,
destination points, and quality of service, and provider networks setup and manage the tunnels that will
meet the customer’s needs. 2) Pinning-down important or large traffic flows: network providers setup
1
static tunnels between Points-of-Presence (PoPs) in their network in order to closely monitor the traffic
that flows between these PoPs. Operators gain traffic awareness once they pin-down the traffic on
specific known paths. In network operations, traffic awareness translates to simple and effective traffic
management. 3) Creating an iBGP-routing-free core backbone network: operators create a full mesh of
tunnels between the network’s PoPs. The paths of the tunnels are either statically defined in
configuration or calculated by the intradomain routing protocol. In this architecture, the core backbone
routers only perform forwarding and do not need to run internal BGP (iBGP). We note that iBGP
distributes the routes learned through external BGP to the network’s backbone routers. The iBGP-routingfree core backbone architecture results in less processing overhead on the network’s core routers.
Additionally, this architecture allows for path protection through MPLS fast-reroute.
Although there are considerable benefits in the full mesh tunnel design as described above, this MPLS
design does not achieve load-balancing, one of the key motivating factors for deploying MPLS. However,
MPLS-TE achieves network-wide load-balancing. This is feasible when the tunnel configurations include
bandwidth requirements. If the per-tunnel bandwidth reservations are statically defined, then there is a
disconnect between the routing design and the network traffic demands. In order for the routing design
to follow fluctuating traffic demands, operators need to regularly monitor the traffic demands and
manually reconfigure the bandwidth of the tunnels as needed.
Network-wide load-balancing is also feasible without manual intervention through a self-adjusting tunnel
design. The reconfiguration of a tunnel’s bandwidth reservation and the dynamic adjustment of the
reservation can be automated to track changes in the per-tunnel traffic volumes. In particular, the autobandwidth option Ошибка! Источник ссылки не найден. automatically adjusts a tunnel’s
bandwidth reservation according to its incoming traffic rate. Upon bandwidth adjustment, a tunnel may
stay on the same path or it may change to a new path that better accommodates the adjusted bandwidth
requirement. Auto-bandwidth is one of the few router mechanisms that dynamically changes network
configuration to avoid human intervention. Auto-bandwidth is implemented by multiple router vendors.
2
Dynamic traffic-engineering refers to the traffic-engineering practices that adjust to changes in the
network’s traffic demands. Dynamic traffic-engineering, as implemented by auto-bandwidth, reduces the
management overhead of intradomain routing in backbone networks but it also includes challenges and
risks. First of all, the automated adjustment of tunnel bandwidth can increase the number of tunnel path
changes in the network. Consequently, it can reduce the tunnel path visibility. Tunnel path visibility
refers to the operators’ awareness of the paths that the tunnels take in the network at all times.
Additionally, when the routing depends on external factors such as the traffic volumes forwarded by
neighboring networks, then the risk of network-wide routing instabilities increases. A single tunnel with
highly variable bandwidth requirements can potentially trigger multiple tunnel path changes. A tunnel
path change follows the make-before-break paradigm and does not cause packet loss. However, the path
change can cause packet reordering, reduce the application throughput, and deteriorate the data plane
performance. Also, tunnel path changes can cause tunnel preemptions. Tunnel preemption is tearing
down one or more low-priority tunnels and using the previously reserved bandwidth for setting up a highpriority tunnel. Tunnel preemption does not follow the make-before-break paradigm, and therefore, it
intermittently disrupts low-priority traffic. Overall, it is important to investigate the challenges and risks
involved with dynamic traffic-engineering in order to ensure that it is both an effective as well as a safe
traffic-engineering solution.
In this chapter, we propose the tunnel visibility system, a system that calculates the network’s tunnel
paths in an offline manner, even when the network deploys dynamic traffic-engineering. The system uses
as input the network’s configuration and traffic demands. Then, we apply the tunnel visibility system on
the backbone network of a major tier-1 network and investigate the impact of dynamic traffic-engineering
on the stability of the tunnel setup.
3
1.2 Related Work
First, we present two recent network measurement studies on the prevalence of MPLS in the Internet and
on the performance of MPLS-TE in a service provider network. The results of these works further
motivate the investigation of dynamic traffic-engineering. We also present previous works on MPLS
traffic-engineering (MPLS-TE). We observe that the early works on MPLS-TE focus on the tunnel
placement problem and address the static traffic-engineering problem. Also, many works assume multiple
tunnels between two edge nodes. However, the operational practice is to have one or only a few tunnels
per ingress-egress pair. The proposed traffic-engineering algorithms output explicitly-routed tunnels,
whereas the common practice is to use dynamic tunnel path calculation so that the tunnel setup
automatically adjusts according to network topology changes. Overall, these works do not consider the
operational aspects in MPLS-TE deployments but they focus on the traffic-engineering algorithms. Finally,
we present some previous works on load-sensitive routing. These works investigate packet-based loadsensitive routing, whereas auto-bandwidth implements flow-based load-sensitive routing.
Two recent works perform measurement studies of operational MPLS deployments. [1] measures the
prevalence and characteristics of MPLS deployments in the Internet over the past 3.5 years. This study is
based on traceroute-style path measurements that include MPLS label stacks. However, the MPLS label
stack is not always visible to a public user of traceroute. Therefore, this measurement methodology
underestimates the number of MPLS tunnels in the Internet. Nevertheless, they find that 7% of all ASes
have been consistently deploying MPLS over the past 3.5 years. 25% of all paths in 2011 cross at least
one MPLS tunnel. Also, they find that the largest deployments are found in tier-1 providers, and that
many ASes deploy traffic classification and engineering in their tunnels. Overall, this study shows that a
growing number of ASes adopt MPLS traffic-engineering. Thus, MPLS traffic-engineering practices
become increasingly important.
4
[2] studies the performance of MPLS in Microsoft’s online service network (MSN). This network connects
Microsoft’s data centers to each other and to peering ISPs. MSN uses dynamic traffic-engineering in order
to route the changing traffic demands between the data centers using auto-bandwidth. The study focuses
on the increased latency caused by routing the traffic along paths with sufficient bandwidth, instead of
routing the traffic along shortest paths. They find that 80% of the increase in latency occurs due to
tunnel path changes presumably caused by auto-bandwidth. This increase in latency reflects the tunnel
path changes where the new path is longer than the old path but the new path has sufficient capacity for
the new bandwidth reservation. In our work, we perform a thorough investigation of auto-bandwidth
focusing on the risks involved in deploying a dynamic traffic-engineering scheme in backbone networks.
[3] presents a set of techniques for network-wide tunnel path optimization subject to the routing
constraints imposed by QoS requirements. The goal is to find the explicit routes that will jointly optimize
a set of tunnel demands. They use multi-commodity flow solution methods as primitives and focus on the
scalability and speed-of-response of the proposed techniques. [4] addresses the same problem but with a
non-linear programming approach. Both works suggest a centralized network-wide approach with the
assumption of pre-programmed Label Switch Paths (LSPs) that change on relatively long time scales.
The following works target specific traffic-engineering problems that are not common in operational
environments: in backbone networks, the tunnel specifications are usually static, and there are only a
few, if any, tunnel pairs between two edge nodes. [5] focuses on the problem of dynamically routing
bandwidth guaranteed tunnels when the tunnel requests arrive one-by-one and there is no a priori
knowledge of what bandwidth requirements the future LSP requests will have. Their approach is routing a
new tunnel on a path that results in minimum interference to potential future LSP setup requests. In
particular, they propose an algorithm that outputs explicit routes that avoid the loading of the bottleneck
links. [6] focuses on bidirectional LSP setup and proposes an optimal LSP pair (upward and downward
LSP) selection method in the case where there are multiple LSP pairs between two edge nodes.
The following works implement the dynamic traffic-engineering but are not operationally viable solutions.
MATE Ошибка! Источник ссылки не найден. does not respond to traffic changes and TeXCP
5
Ошибка! Источник ссылки не найден. assumes multiple tunnels between two edge nodes. We
note that, although ISPs may deploy more than one tunnels between two edge nodes, they generally
maintain a small number of such tunnels for scalability reasons. In detail, MATE and TeXCP require active
monitoring of the network’s state. Specifically, MATE focuses on load-balancing short-term traffic
fluctuations among multiple LSPs between two edge nodes. They assume that the network-wide LSP
setup is determined using long-term traffic matrixes and they achieve effective load-balancing by probing
the current state of the network. However, their approach cannot react to real-time traffic changes.
TeXCP, on the other hand, proposes an online distributed traffic-engineering protocol that performs the
load-balancing in real-time. TeXCP requires an agent residing at each ingress router and multiple tunnels
to deliver the traffic to the egress node. The agent moves traffic from the over-utilized to the underutilized paths between the same source/destination pair. This work is the first to address the stability
requirement of the online TE protocols. Finally, COPE Ошибка! Источник ссылки не найден. is a
class of traffic-engineering algorithms that optimize for the expected traffic demands and provides the
worst-case guarantee for the unexpected traffic demands. As the authors note, their future work includes
developing an efficient implementation of COPE that can be integrated with the current operational
practices. That is MPLS, OSPF, and online traffic-engineering.
Next, we present works on packet-based load-sensitive routing. In Section 1.6.2.2, we compare packetbased load-sensitive routing with flow-based load-sensitive routing. The latter is implemented by autobandwidth. [7][8][9] investigate packet-based load-sensitive routing in the early days of the Internet. In
particular, they investigate link cost metrics for intradomain routing protocols that reflect the network
load. They find that dynamic metrics lead to routing instabilities and oscillations. We elaborate on these
works and compare packet-based with flow-based load-sensitive routing in Section 1.6.2.2.
6
1.3 Contributions
Our contributions in intradomain routing management for backbone networks are:

An offline tunnel visibility system that is applicable to networks deploying dynamic trafficengineering: To the best of our knowledge, this is the first system that performs lightweight
simulations of operational MPLS domains. The tunnel visibility system predicts the paths of all the
tunnels in the network given the network topology, the tunnels’ configuration, and the tunnels’
traffic volume measurements. In other words, it predicts the network-wide tunnel setup.
Additionally, it enables the network-wide simulation of the auto-bandwidth mechanism. The
purpose of this system is to provide offline visibility of the MPLS functionality to network
operators.

The measurement of the impact of factors that contribute to unpredictable tunnel setups: We
verify the simulation of dynamic traffic-engineering with network measurement data from the
backbone network of a major ISP. We expose the factors that lead to unpredictable, or so called
non-deterministic, tunnel setups. The network under analysis is configured with a full mesh of
tunnels between edge routers with different priorities and network-wide deployment of the autobandwidth mechanism. We find that, when auto-bandwidth is enabled, the following factors play
a significant role in the tunnel path predictions: (i) Tunnel traffic measurements missing from our
dataset reduces the accuracy with which we can predict auto-bandwidth events by 15%. (ii) The
timing factor that is reflected in the order with which tunnels establish their paths can reduce the
tunnel reroutes in a two-month period by 8%. (iii) The tiebreaking process between paths of the
same cost makes 5% of the tunnel paths unpredictable. Due to these three factors, the tunnel
visibility system cannot predict the exact tunnel setup. However, we find that the predicted
tunnel dynamics, i.e. number of tunnel reroutes and failures, follow the real network operation.
7

The investigation of dynamic traffic-engineering using the tunnel visibility system : The autobandwidth option exposes the intradomain routing to external factors, namely the network’s
traffic demands. This makes the tunnel setup vulnerable to network-wide instabilities. Tunnel
path instabilities can be harmful to the data plane performance. We identify the risks involved in
MPLS tunnel designs that automatically adjust the tunnel bandwidth reservations according to the
observed traffic demands, and we investigate their impact through extensive simulations based
on a real network dataset. Our dataset includes the network topology, the network topology
changes, the tunnel configuration, and the tunnel traffic demands. We focus on various aspects
of auto-bandwidth: the impact of the mechanism on the amount of tunnel reroutes, failures, and
preemptions, the responsiveness of the mechanism to highly variable traffic patterns, and the
stability of the mechanism.

The analysis of the impact of dynamic traffic-engineering on networks with varying capacity
constraints: The tunnel visibility system along with a large dataset from a tier-1 ISP allows us to
investigate the operational practices of dynamic traffic-engineering. We find that as the capacity
of the network reduces to 50% of its initial capacity: (i) Auto-bandwidth increases the tunnel
reroutes by almost 41% and the tunnel preemptions by 34 times. (ii) Most of the additional
reroutes occur because of the preemption of lower-priority traffic as higher-priority tunnels adjust
their bandwidth. These reroutes are indirectly caused by auto-bandwidth. So, they represent the
cascading effect of auto-bandwidth on the network-wide tunnel setup. (iii) Auto-bandwidth
causes the reroute of increasingly large tunnels and the preemption of an increasing number of
small lower-priority tunnels. Overall, the total size of the tunnels that are rerouted increases by
three times. Thus, large amounts of traffic are impacted by the increased tunnel dynamics.

The identification of a routing design detail with major impact on networks with stringent
capacity constraints: When tunnels cannot find a path to accommodate their adjusted bandwidth
reservation, they maintain their old reservation and do not resize to the maximum bandwidth
that can be accommodated. We call this auto-bandwidth failure. Auto-bandwidth failures are soft
8
failures because the tunnels do establish a path but the bandwidth reservation does not
correspond to the amount of incoming traffic. We find that their impact aggravates in networks
that are have less spare capacity. In detail, auto-bandwidth failures barely occur in the real
network but, when the capacity of the network decreases by 50%, large amounts of traffic are
impacted by the auto-bandwidth failures.

The investigation of the auto-bandwidth responsiveness and stability: We test the behavior of the
auto-bandwidth mechanism in the presence of extreme traffic shifts and we find that the
bandwidth adjustments are significantly delayed by the moving average algorithm run by the
routers of a major vendor for calculating the tunnel traffic rates. In terms of the stability of autobandwidth, we analyze the previous experiences with packet-based load-sensitive routing and
conclude that auto-bandwidth is stable. This means that auto-bandwidth does not cause
recurring tunnel path changes to a single tunnel. However, this holds provided that the tunnel
path changes do not cause the reaction of end-to-end traffic-engineering that is external to the
network.
9
1.4 Background
1.4.1 MPLS-TE Basics
Multi Protocol Label Switching (MPLS) [10] uses labels to forward traffic across the MPLS domain.
When a packet enters an MPLS domain, the router imposes a label on the packet and the label, as
opposed to the IP header, determines the next-hop. The label is removed at the egress point of the MPLS
domain. When a packet arrives at a Label Switching Router (LSR), the incoming label determines the
outgoing interface. The LSR may also swap the incoming label to the appropriate outgoing label. Labels
are assigned to packets based on Forwarding Equivalence Classes (FECs). Packets belonging to the same
FEC are forwarded in the same way within the MPLS domain. The Label Switched Path (LSP) that a
packet takes in an MPLS domain is what we call an MPLS tunnel.
Traffic-engineering (TE) refers to the process of selecting the path for data traffic belonging to some
service so that the service’s objectives are satisfied. TE mostly aims at simultaneously optimizing the
network utilization and the traffic performance. Existing Interior Gateway Protocols (IGPs) are inadequate
for traffic-engineering because their routing decisions are based on shortest path algorithms and do not
take into account bandwidth requirements or other service-specific requirements.
In Figure 3-A, we show the block diagram of MPLS Traffic-engineering (MPLS-TE) as implemented at
the head-end router, the LSR that originates a tunnel. MPLS-TE relies on the IGP to distribute the
network topology and network state information: TE-specific link metrics, available link bandwidth, link
attributes (e.g. core-core link or edge-core link), etc. This information forms the TE Topology Database.
Path selection is based on the collected TE information and the Constraint-Based Routing (CB-R)
algorithm. CB-R calculates the best path for a tunnel. Path selection takes place when (i) there is a new
tunnel request, (ii) an existing tunnel has failed, or (iii) the path of an existing tunnel is reoptimized [11].
Tunnel reoptimization refers to the re-running of the CB-R algorithm by the head-end router. The
10
Figure 0-A: MPLS-TE system block diagram.
http://www.cisco.com/en/US/prod/collateral/iosswrel/ps6537/ps6557/ps6608/whitepaper_c11-551235.html
head-end router may find a better path for the tunnel and reroute it. It takes place periodically and the
reoptimization frequency is specified in the head-end router’s configuration.
In one example CB-R implementation [12], the algorithm ignores the links that do not have sufficient
resources and violate policy constraints, and then, it runs Dijkstra on the remaining topology.
This
returns the shortest path (or paths) that also satisfies the routing constraints specified by the operator. In
IGPs, traffic uses multiple paths with the same cost to the destination if such paths exist (Equal-Cost
Multi-Path). However, CB-R looks for a single path from the tunnel’s source to the tunnel’s destination.
Therefore, in the case of multiple paths, the following tiebreakers determine the tunnel’s path [12]:

Choose the path with the highest minimum available bandwidth.

Then, choose the path with the lowest number of routers on path.

Then, choose a random path. This is usually the first path in the list of chosen paths.
Afterwards, the head-end router signals the chosen path and sets up the tunnel. The tunnel setup is
implemented by a reservation protocol that supports TE, namely RSVP-TE [13] or CR-LDP [14]. We
11
elaborate on RSVP-TE because it is the most widely used protocol for MPLS-TE. RSVP-TE is an extension
of the RSVP protocol. The head-end router sends an RSVP-TE PATH message along the selected path in
order to request resources for the tunnel. The tail-end router sends back an RSVP-TE RESV message
upon the receipt of the PATH message and initiates the label distribution process. Once the LSP is
established, traffic can be forwarded across the tunnel.
Finally, we elaborate on the tunnel attributes. Every tunnel is specified in the configuration file of the
head-end router. We describe the most important tunnel attributes:
1) Path calculation: The tunnel’s path can be explicitly specified in router configuration or it can be
dynamically calculated by the head-end router using the CB-R algorithm.
2) Bandwidth requirement: The tunnel’s bandwidth requirement can be statically defined in the
router configuration or it can dynamically adjust to the tunnel’s traffic rate. In Section 1.4.2, we
elaborate on the auto-bandwidth option that implements the automatic bandwidth adjustment.
3) Priority: Tunnels can preempt each other to acquire bandwidth based upon their defined priority
value. Priority values range from 0 to 7.
4) Adaptability: This is the frequency of reoptimizing the tunnel’s path.
5) Resiliency: Fast-reroute (FRR) is the dominant mechanism for tunnel path protection. When FRR
is deployed, more options regarding the backup path calculation need to be specified.
6) Affinity flags: Affinities are the properties that the tunnel requires in its links. The permitted or
excluded link affinities (otherwise known as link colors) introduce additional constraints in the CBR algorithm. For example, an operator may specify an affinity flag that does not allow edge-tocore links to be included in the tunnel’s path.
Finally, we elaborate on tunnel priorities and on how preemptions are performed in the MPLS domain.
Each tunnel is configured with a setup and a hold priority. In particular, a new tunnel with high setup
priority can be established by preempting tunnels with lower hold priority. This happens when there is
insufficient reservable bandwidth for the new tunnel to establish but there is sufficient reservable
bandwidth if tunnels of lower hold priority are torn down. The head-end routers implement proprietary
12
decision logic algorithms when deciding which tunnels to preempt. We also note that, when a tunnel is
preempted, its head-end router tries to re-establish it on another path with its previously signalled
bandwidth.
1.4.2 Auto-bandwidth Mechanism
13
Figure 0-B: Auto-bandwidth example.
http://s-tools1.juniper.net/solutions/literature/app_note/350080.pdf
The auto-bandwidth Ошибка! Источник ссылки не найден. mechanism enables dynamic trafficengineering. Auto-bandwidth automatically adjusts the bandwidth reservation of an MPLS tunnel based
on how much traffic is flowing through the tunnel. Hence, it automates the monitoring and
reconfiguration of the tunnel’s bandwidth. Auto-bandwidth adjusts the tunnel bandwidth based on the
largest average output rate observed during the last adjust-interval as long as this rate does not exceed a
configured maximum and minimum bandwidth value. The output rate is estimated by sampling with a
configured frequency value. When the tunnel’s bandwidth is adjusted to a new bandwidth constraint, the
head-end router generates a new RSVP-TE PATH message. If the new bandwidth reservation cannot be
satisfied by any path in the network, the current path will continue to be used and the tunnel’s bandwidth
reservation will remain unchanged. Figure 3-B shows an example of auto-bandwidth in time. Next, we
present an elaborate list of the auto-bandwidth parameters that control the operation of the mechanism.
The auto-bandwidth operation is determined per tunnel by the following parameters:
14

Adjust-interval: the interval between bandwidth adjustments.

Collection-interval: the interval of collecting output rate information for the tunnel.

Maximum-bandwidth: the maximum automatic bandwidth for the tunnel.

Minimum-bandwidth: the minimum automatic bandwidth for the tunnel.

Adjust-threshold percentage: the bandwidth change percentage threshold that triggers an
adjustment if the maximum bandwidth in the adjust-interval is higher or lower than the current
bandwidth reservation.

Adjust-threshold minimum: the bandwidth change value threshold that triggers an adjustment if
the maximum bandwidth in the adjust-interval is higher or lower than the current bandwidth
reservation.

Overflow/underflow threshold percentage and minimum: the bandwidth change percentage and
value threshold that trigger immediate adjustment of the tunnel’s bandwidth.

Overflow/underflow limit: the necessary number of consecutive collection-intervals that exceed
the specified overflow/underflow thresholds and trigger an immediate bandwidth adjustment.
Finally, we comment on how the router estimates the output rate of a tunnel. Routers from a major
vendor use a five-second moving average algorithm [15]. This algorithm has the following effect: the
output rate rises less quickly on traffic spikes and falls less rapidly if traffic drops suddenly, as opposed to
not using this algorithm and setting the output rate equal to the actual output rate. The output rate in a
five-second interval is given by the following weighted average formula:
new rate = ((old rate – current rate) * exp (-5secs / 5mins)) + current rate
where new rate is the output rate reported to the auto-bandwidth mechanism, old rate is the output rate
five seconds ago, and current rate is the actual output rate in the last five seconds. The exponential
factor is applying exponential decay to the deviation of the current rate from the previously reported
output rate.
15
1.5 Tunnel Visibility
The visibility of a routing mechanism is essential in network operations. It happens that the routing
mechanism and protocol implementations do not follow the network operators’ perception about how
these mechanisms or protocols function. In these cases, it is difficult to redesign the routing when
network problems appear or when the network objectives change. When operators can accurately predict
the routes selected by the routing mechanism given sufficient information about the network’s state, then
the routing mechanism is visible and easier to manage. Here, by network state, we refer to the network
topology, the routing configuration, and the traffic demands.
Tunnel visibility refers to the accurate determination of the tunnel setup. In detail, a tunnel visibility
system estimates the network-wide tunnel setup in an offline manner based on the tunnel specifications
in the network’s configuration. When dynamic traffic-engineering is deployed, the tunnel visibility system
also requires as input the tunnel traffic demands. If the tunnel visibility system cannot provide an
accurate view of the real tunnel setup, then it ignores factors that are affecting the tunnel setup in an
operational environment. We implement a tunnel visibility system for networks deploying dynamic trafficengineering and investigate the presence and impact of such factors.
In the rest of this section, we describe and evaluate the tunnel visibility system. This system simulates
the network-wide MPLS deployment. In a nutshell, the tunnel visibility system predicts the paths that the
tunnels take based on an offline constraint-based routing algorithm implementation. We verify the tunnel
visibility system’s accuracy by applying it to a tier-1 network both when dynamic traffic-engineering is
enabled and when it is not. We use multiple data sources in order to compare the system’s output with
the real network operation. We further investigate the determinism of the tunnel setup with respect to
the sequence that the head-end routers reoptimize their tunnels. Finally, we discuss our findings.
To summarize, dynamic traffic-engineering reduces the tunnel visibility because: (i) It requires detailed
traffic volume measurements in order to accurately predict the operation of the auto-bandwidth
mechanism. In our dataset, 20% of the auto-bandwidth adjustments are mispredicted because the
available traffic volume measurements are more coarse-grained than the ones used by the auto16
Figure 0-C: Tunnel visibility system architecture. The system includes two main components: the
AutoBw Simulator and the modified and extended TOTEM toolbox. The input is the network’s state: the
network topology, the tunnels’ configuration, and the per-tunnel traffic volume measurements. The
output is the network-wide tunnel setup. That is the tunnel paths.
bandwidth mechanism in the real network. (ii) When dynamic traffic-engineering is enabled, the C-BR
tiebreaking introduces non-determinism in the tunnel setup. 5% of the tunnel paths are inaccurately
estimated. (iii) Timing factors that are not controlled by the operators, such as the order with which the
head-end routers reoptimize their tunnels, introduce small but not negligible non-determinism in the
tunnel setup. For example, we find one order of tunnel reoptimization that results in 8% less tunnel
reroutes than the average number of tunnel reroutes measured over three random tunnel reoptimization
orders in a two-month period.
1.5.1 System
The tunnel visibility system consists of an integrated set of components: (i) an open source toolbox for
traffic-engineering methods (TOTEM) [16][17], and (ii) the AutoBw Simulator, an implementation of the
auto-bandwidth mechanism. In detail, we modify and extend TOTEM to convert it from a simulator of
traffic engineering methods into a simulator of network-wide MPLS operation. We extend TOTEM to
enable the simulation of dynamic traffic-engineering and of other mechanisms that determine the MPLS
functionality in production networks. We implement the AutoBw Simulator according to the autobandwidth implementation of a major router vendor. Figure 3-C illustrates the architecture of the tunnel
visibility system. In the rest of this section, we present TOTEM, the open source toolbox for traffic
17
engineering methods. Then, we present the extensions and modifications introduced to TOTEM. Finally,
we describe the AutoBw Simulator.
1.5.1.1
TOTEM
The TOolbox for Traffic Engineering Methods (TOTEM) [16][17] simulates how traffic is routed on a
network using Shortest-Path-First (SPF), Constrained-Shortest-Path-First (CSPF) and other trafficengineering routing algorithms. CSPF implements Constraint-Based Routing (CB-R). From now, we use
the terms CSPF and CB-R interchangeably. TOTEM is a flow-based event-driven simulator and not a
packet-based simulator. This means that it simulates the calculation of paths for network flows but not
the per-packet per-hop behavior. In other words, it does not simulate the routing of the packets in the
network but it simulates the control-plane functionality that defines the tunnel paths through which
packets flow.
In detail, TOTEM includes a repository of traffic-engineering methods grouped into several
categories: 1) IP: algorithms using only IP information such as the IGP weights, 2) MPLS: algorithms
using MPLS-TE functionalities such as the MPLS source-based routing, the computation of primary LSP
paths using CSPF, and the computation of backup LSP paths using link/node disjoint constraints, 3) BGP:
a BGP decision process simulator that combines interdomain routing information with intradomain routing
algorithms, 4) generic: classic optimization and search algorithms used by various components of the
toolbox.
TOTEM also includes a set of components that interface with the algorithms repository and allow the
flexible use of the toolbox in an operational setting. The three most essential components are the
topology manager, the scenario manager, and the traffic matrix manager. The topology manager handles
the network representation. The scenario manager handles the automatic execution of simulation events.
Some events already integrated with TOTEM are the LSP creation, link and node failures, and link
utilization computations. The traffic matrix manager handles the traffic representation. Traffic flow
volumes can be combined with the traffic-engineering algorithms to generate network utilization
statistics. In addition to these three components, TOTEM includes a topology and traffic generator, a web
18
Figure 0-D: TOTEM toolbox architecture. TOTEM includes the algorithms repository, the topology,
scenario and traffic matrix managers, and various interfaces for the flexible usage of the toolbox.
service interface for remote online control of the toolbox, a native interface for toolbox extensions, and a
graphical user interface for network operators. Figure 3-D summarizes the TOTEM architecture.
1.5.1.2
TOTEM Extensions and Modifications
TOTEM simulates various path calculation algorithms but it omits MPLS-TE functionality performed by
head-end routers that is important for the accurate representation of the network operation. Hence, we
extend and modify TOTEM to reflect the operation of an MPLS domain.
We extend TOTEM to handle tunnel auto-bandwidth events and network-wide reoptimization events. In
detail, we simulate a tunnel bandwidth adjustment by rerunning the CSPF algorithm with the tunnel’s
adjusted bandwidth reservation. The auto-bandwidth events are generated by the AutoBw Simulator that
we describe next. After the CSPF algorithm runs with the adjusted bandwidth requirement, the tunnel
may: 1) resize its bandwidth but stay on the same path, 2) change path in order to accommodate the
new bandwidth reservation, 3) stay on the same path without adjusting its bandwidth reservation
19
because no path is found to accommodate the new bandwidth reservation. We extend TOTEM to
simulate this behavior.
We simulate tunnel reoptimization by rerunning the CSPF algorithm for each tunnel. The sequence of
reoptimizing the tunnels can be inserted in TOTEM or calculated by TOTEM according to some criteria.
We elaborate on the tunnel reoptimization sequence in Section 1.5.3. In reality, each tunnel maintains its
own reoptimization timer and this timer expires according to the reoptimization period that is specified in
the router configuration.
In addition to these extensions, we make four major modifications to TOTEM. The required modifications
are necessary to accurately simulate the stateful router operation in production networks:

CSPF tiebreaking: TOTEM randomly picks one of the multiple paths returned by the constraintshortest-path-first calculation whereas routers run specific tiebreaking rules before choosing the
single path on which they will signal the tunnel. We implement CSPF tiebreaking according to the
major router vendor’s implementation.

Preemption decision logic: TOTEM randomly picks a sufficient set of lower priority tunnels that
are on the path where the higher priority tunnel establishes. However, head-end routers in
production networks run specific preemption decision logic algorithms. We implement the
preemption decision logic of the router vendor.

Handling preempted tunnels: TOTEM tears down the preempted tunnels whereas routers attempt
to re-establish these tunnels on an alternative path. We implement this functionality.

Maintenance of tunnel state: TOTEM does not keep state of the tunnels that fail to establish
and/or are torn down. However, head-end routers periodically attempt to re-establish the tunnels
that have failed to set up a path in the past. We add tunnel state to TOTEM and simulate the
periodic router attempts to re-establish failed tunnels.
Overall, the extended and modified version of TOTEM results in an operational version of TOTEM, a
version that simulates the operation of an MPLS-TE-enabled production network.
20
1.5.1.3
AutoBw Simulator
The AutoBw Simulator simulates the auto-bandwidth mechanism as described in Section 1.4.2. The
AutoBw Simulator takes as input a series of traffic rates (one value per collection-interval) and returns
a sequence of bandwidth adjustment events for the tunnel. Below, we describe the AutoBw Simulator in
pseudocode. We assume that the overflow/underflow limit is set to 1 for simplicity.
At the end of each adjust-interval, there is a bandwidth adjustment if the maximum bandwidth
measured during this interval (MaxAvgBW) exceeds the adjust-threshold percentage (Adjust-Threshold
Perc) and the adjust-threshold value (Adjust-Threshold Min). At the end of each collection-interval,
there is a bandwidth adjustment if the maximum bandwidth measured during the adjust-interval
(MaxAvgBW) exceeds the over/underflow-threshold percentage (Over/Underflow-Threshold Perc)
and the over/underflow-threshold value (Over/Underflow-Threshold Min).
while (end of Collection-Interval)
BW = tunnel traffic ratio
Max Avg BW = max(BW) within Adjust-Interval
# Auto-Bandwidth Adjustment
if (end of Adjust-Interval)
if
Max Avg BW > (1 + Adjust-Threshold Perc) * Signalled BW
&& Max Avg BW - Signalled BW > Adjust-Threshold Min
|| Max Avg BW < (1 - Adjust-Threshold Perc ) * Signalled BW
&& Signalled BW - Max Avg BW > Adjust-Threshold Min
BW ADJUSTMENT: Signalled BW = Max Avg BW
restart Adjust-Interval
# Overflow/Underflow Event
else
if
Max Avg BW > ( 1 + Overflow-Threshold Perc) * Signalled BW
&& Max Avg BW - Signalled BW > Overflow-Threshold Min
|| Max Avg BW < ( 1 - Underflow-Threshold Perc) * SignalledBW
&& Signalled BW - Max Avg BW > Underflow-Threshold Min
BW ADJUSTMENT: Signalled BW = Max Avg BW
restart Adjust-Interval
21
1.5.2 Verification
We investigate the accuracy of the tunnel visibility system by simulating the MPLS deployment of a tier-1
backbone network and by verifying the simulation observables with the real network operation, as
extracted from various network measurements. In detail, we simulate the MPLS-TE functionality for two
months in 2011. We extract the network’s state from router configurations, IGP messages, and SNMP
traffic measurements, and use this data as input to the tunnel visibility system. During simulation, we
capture the tunnel activity (i.e., the tunnel path changes) and compare it against router logs on MPLS-TE
and tunnel state measurements.
Next, we describe our dataset, the simulation setup, and the results from comparing the simulated tunnel
dynamics with the real tunnel dynamics. We find that the AutoBw Simulator accurately predicts on
average 80% of the real auto-bandwidth adjustments and that its prediction accuracy is heavily
dependent on the completeness of the traffic volumes dataset. We also find that the simulated tunnel
paths match with the real tunnel paths in 95% of the cases and that the mismatches are attributed to the
CSPF tiebreaking. Finally, although the tunnel visibility system provides good estimates of the tunnel
activity, it cannot predict the exact tunnel setup. The factors that contribute to this non-determinism are
the granularity of the traffic measurements, the CSPF tiebreaking, as well as the tunnel reoptimization
order analyzed in Section 1.5.3.
1.5.2.1
Dataset
Network topology: We obtain the network topology using IGP topology snapshots generated by a
network-specific IGP tool that monitors the exchange of IGP protocol messages. This network uses OSPF
as the intradomain routing protocol. The OSPF topology includes the links that tunnels can traverse. We
retrieve specific link attributes, such as link cost and link capacity, from parsed daily snapshots of the
router configuration files and convert the generated topology into the TOTEM-specific format. However,
this topology snapshot is valid for as long as there are no node up/down events, link up/down events,
and link cost changes. In order to maintain an accurate network topology at all times, we track such
events by processing the OSPF messages collected by the OSPF monitoring tool and note the changes in
22
the topology during the two-month simulation period. These topology changes are included in the
simulation scenarios, as described in the next section.
Tunnel configuration: The configuration of the tunnels is available in both router configuration files
and in high-level design documents. Tunnel configuration also includes the auto-bandwidth parameters
used by the AutoBw Simulator. In the first month of our dataset, auto-bandwidth is gradually deployed
on the network’s tunnels. In the second month of our dataset, auto-bandwidth is enabled throughout the
network. Except for this change, the tunnel configuration remains static throughout the two-month
period.
Traffic measurements: We have per-tunnel traffic volume measurements for 3-6 minute intervals. This
dataset is collected by the Simple Network Management Protocol (SNMP), and it is used both for
generating the network traffic matrices and for simulating the auto-bandwidth mechanism.
SYSLOG messages: The SYSLOG messages that pertain to MPLS-TE routing are available from all the
routers in the network. These alarms include tunnel events such as reroutes after the expiration of the
reoptimization timer, bandwidth adjustments because of the auto-bandwidth mechanism, fast-reroutes,
changes in tunnel state (i.e., a tunnel going down or coming back up). Each alarm includes the accurate
timing of the event, the affected tunnel name, and the event type.
Tunnel path measurements: SNMP provides us with tunnel path snapshots along with incremental
updates of the tunnel path changes. We note that this dataset only shows the path changes but includes
no information about the root-cause of the change. Table 0-I illustrates an overview of our dataset.
Table 0-I: Dataset Overview
Network Data
Network Topology
Tunnel Configuration
Traffic Measurements
SYSLOG Messages
Tunnel Path Measurements
Description
Router configuration files, IGP topology snapshots and updates
Router configuration files, MPLS-TE design documents
SNMP data including 3-6 min traffic volume rates per tunnel
MPLS-TE SYSLOG messages per router
SNMP data including tunnel path snapshots and updates
23
1.5.2.2
Simulation Setup
In this section, we describe the scenario generation for TOTEM and the setup for the AutoBw Simulator.
The scenarios include the daily events ordered in time. Therefore, for each day we perform the following
steps to generate the scenario that is representative of the MPLS domain activity for that day:
1. Scenario initialization: We (i) load the topology snapshot of the network at the beginning of the
day, (ii) start the routing algorithm, that is CSPF where the path length is determined by the
OSPF costs and the constraint is the tunnel’s bandwidth reservation, and (iii) establish the
tunnels. Since our dataset does not include per-tunnel bandwidth reservations, we assume that
the initial tunnel reservations coincide with the output rate of each tunnel at the beginning of the
day. In reality, the signalled bandwidth of the tunnel and its output rate should be close in value
because of the auto-bandwidth mechanism.
2. Topology updates: We insert OSPF events, such as router up, router down, link up, link down,
and link cost change. When TOTEM executes these events, it updates the loaded network
topology but it does not act upon the tunnel setup. However, when routers are informed of such
events from the OSPF protocol messages, they recalculate the tunnel paths. Therefore, after
OSPF event we also insert a tunnel reoptimization event. We note that, in reality, when a router
or link goes down, fast-reroute (FRR) is triggered within 50ms. In our simulations, we focus on
the dynamics introduced by dynamic traffic-engineering, and thus, omit FRR, the prevalent MPLSTE network restoration mechanism. This will inevitably lead to an inaccurate prediction of the
tunnel setup during network failures because the tunnels that are fast-rerouted in reality will be
torn down in simulation. However, this does not interfere with our goal of investigating the autobandwidth mechanism. Nevertheless, the tunnel visibility system can be extended to include the
FRR mechanism. We leave this as future work.
3. Auto-bandwidth events: We insert the per-tunnel bandwidth adjustment events that are
generated by the AutoBw Simulator. We elaborate on the setup of the AutoBw Simulator next.
24
4. Periodic tunnel reoptimization: We insert a network-wide reoptimization event according to the
frequency of reoptimization that is specified in the router configuration. During each
reoptimization, the extended version of TOTEM (i) reruns CSPF for all the tunnels that are up and
(ii) attempts to re-establish all the tunnels that are down, given their latest signalled bandwidth
before going down.
5. Show link utilization: We output network-wide link utilization statistics after each reoptimization
event in order to track the network’s utilization. Although we have performed a network-wide
analysis of the utilization levels in the network, we do not disclose this information because it is
proprietary for the network under analysis.
We order the topology updates, auto-bandwidth events, and periodic tunnel reoptimizations in time and
convert them into the TOTEM-specific scenario format. Finally, we set the AutoBw Simulator parameters
to the values specified in router configuration and run the simulator on a per-tunnel basis given the
tunnel’s traffic volume measurements during a single day. Again, we assume that (i) the tunnel holds an
initial bandwidth reservation equal to the tunnel’s traffic rate at the beginning of the day, and (ii) the
auto-bandwidth timer that defines the adjustment intervals starts at the beginning of the day.
Furthermore, we find that the auto-bandwidth collection frequency is set to a value that is smaller than
the traffic volume measurement frequency in our dataset. Therefore, we feed the simulator with the
Figure 0-E: Overview of the scenario generation.
25
same traffic rate value for multiple collection intervals. This assumes that the rate remains constant
during the measurement window. We expect that the lack of fine-grained traffic volume measurements
will result in inaccuracies between the auto-bandwidth simulator and the auto-bandwidth mechanism
deployed in the network. Figure 3-E summarizes the simulation setup for the tunnel visibility system.
1.5.2.3
Simulation Observables
After the execution of each scenario event, we output the impact of the event on the MPLS domain. On a
high level, changes in tunnel state are caused by (i) topology changes, (ii) traffic volume changes, and
(iii) periodic reoptimization. Tunnel state changes refer to tunnel (i) reroutes, (ii) failures to establish a
path, and (iii) setups after failed attempts to establish a path. Figure 3-F illustrates the MPLS domain as a
black-box where network events result in tunnel state changes.
In detail, topology updates can cause tunnels to (i) reroute, (ii) fail (i.e., go down), or (iii) re-establish
(i.e., come back up). Auto-bandwidth events can cause a tunnel to (i) resize but stay on its path, (ii)
reroute, or (iii) fail to resize and stay on its path. We call the last case an auto-bandwidth failure. Finally,
a reoptimization event can cause tunnels to (i) reroute or (ii) re-establish (head-end routers periodically
attempt to re-establish failed tunnels). Additionally, we perform a separate analysis of a particular type of
reroutes and failures, the reroutes and failures of lower-priority tunnels because of preemption. After
each event, we output the impact of the event including detailed information about the affected
tunnel(s), the path changes, the bandwidth reservation changes, and the inflicted preemptions.
Figure 0-F: The MPLS domain as a black-box of tunnel dynamics.
26
We elaborate on the metrics calculated after processing TOTEM’s output for each daily scenario. We
focus on tunnel reroutes and failures and on the amount of traffic impacted by these events. We omit
tunnel setups because they follow the tunnel failures. Thus, they do not provide any additional insights
on the tunnel dynamics. Finally, we analyze tunnel preemptions because of their impact on low-priority
traffic. Note that preemptions follow the break-before-make paradigm. Thus, when a tunnel is
preempted, its traffic is temporarily disrupted. Table 0-II summarizes the simulation observables:

Tunnel Reroutes: Reroutes represent tunnel path changes in the MPLS domain. There are four
causes for a tunnel to change paths: (i) topology update (OSPF reroute), (ii) auto-bandwidth
event (AutoBw reroute), (iii) tunnel reoptimization (ReOpt reroute), (iv) tunnel preemption
(preemption reroute).

Reroute Impact: The reroute impact represents the amount of traffic that is affected by the
reroute. When auto-bandwidth is enabled, the impact is reflected by the tunnel’s bandwidth
reservation. Therefore, we estimate the impact of the reroute using the bandwidth reservation of
the affected tunnel. We further discuss the impact of a tunnel reroute on the data plane when we
analyze the dynamic traffic-engineering in Section 1.6.

Tunnel Failures: Tunnels experience hard and soft failures. In hard failures, tunnels fail to find a
path and go down after (i) a setup attempt ( ReEstabl failure), (ii) a topology change (OSPF
failure), (iii) being preempted (preemption failure). In soft failures, we include the autobandwidth failures (AutoBw failure) where a tunnel maintains its path but fails to resize.

Failure Impact: The failure impact represents the amount of traffic that is affected by the failure.
Again, when auto-bandwidth is enabled, the impact is reflected by the tunnel’s bandwidth
reservation.

Tunnel Preemptions: There are four causes for a low-priority tunnel to be preempted by a higherpriority tunnel: (i) high-priority tunnel setup or re-establishment (ReEstabl preemption), (ii) highpriority tunnel reroute because of a topology change ( OSPF preemption), (iii) high-priority tunnel
bandwidth adjustment (AutoBw preemption), (iv) high-priority tunnel reroute because of a
reoptimization event (ReOpt preemption).
27
Table 0-II: Simulation Observables
Observable
Tunnel Reroutes
Reroute Impact
Tunnel Failures
Failure Impact
Tunnel Preemptions
1.5.2.4
Sub-Classification Based on Root-Cause
OSPF, AutoBw, ReOpt, preemption
OSPF, AutoBw, ReOpt, preemption
ReEstabl, OSPF, preemption, AutoBw
ReEstabl, OSPF, preemption, AutoBw
ReEstabl, OSPF, AutoBw, ReOpt
Verification Results
We verify the accuracy of the tunnel visibility system by comparing the TOTEM output with the various
real network measurements collected during the simulated period. The verification process includes
multiple steps. First, we verify the AutoBw Simulator by comparing the simulated auto-bandwidth events
with the real auto-bandwidth events. Then, we verify the SPF and CSPF algorithms by comparing the
TOTEM tunnel paths with the real tunnel paths. Finally, we verify the TOTEM functionality as a whole by
comparing the aggregate values of the TOTEM observables per day with the corresponding values
inferred from the real network measurements.
We find that the auto-bandwidth mechanism cannot be accurately simulated unless fine-grained traffic
volume measurements are available. Given our dataset, the AutoBw Simulator predicts 80% of the real
bandwidth adjustments. The SPF and CSPF algorithms are deterministic except for their tiebreaking
process that results in 5% unpredictable paths. These two factors, along with the tunnel reoptimization
order analyzed next, challenge the offline tunnel path calculation. However, the general levels of tunnel
activity, given by the total number of reroutes and failures within a day, match the levels predicted by the
tunnel visibility system.
AutoBw Simulator: We compare the number of auto-bandwidth adjustment events observed in the
SYSLOG messages with the number of auto-bandwidth adjustment events generated by the AutoBw
Simulator on a daily basis. We do so only for the month when auto-bandwidth is fully enabled throughout
the network. From the SYSLOG messages, we exclude the underflow and overflow events because the
granularity of the traffic volume measurements does not allow the observation of these events (the auto-
28
bandwidth collection interval is lower than the measurement collection window). We also note that the
completeness of the traffic volume measurements varies per day. By completeness, we refer to the
number of available traffic volume measurements per day in comparison to a complete dataset (i.e., 288
5-minute measurements for each tunnel in the network).
Figure 3-G illustrates the real auto-bandwidth adjustments, the simulated auto-bandwidth adjustments,
the matching events, and the completeness percentage of our measurements on a daily basis. In
particular, we count a matching event when a SYSLOG bandwidth adjustment matches with a simulated
event in a time window of two adjust-intervals around the real timing of the event. When excluding the
days with low measurement completeness (less than 75%), on average 80% of the real events are also
predicted by the AutoBw Simulator. This percentage increases to 85% on days 26, 27, 28, when the
completeness percentage is almost 100%. Even then, the AutoBw Simulator cannot predict all autobandwidth adjustments because (i) it is not aware of the initial tunnel bandwidth reservations, (ii) the
exact timing of the adjust-intervals for each tunnel, (iii) the auto-bandwidth mechanism uses more finegrained measurements than the ones available in our per-tunnel traffic volume dataset.
We conclude that the AutoBw Simulator yields reasonable accuracy given the granularity of the
measurement data that serves as input to the simulator. We also investigate the simulated autobandwidth events that do not match with any real auto-bandwidth event and we find that these events
are very close to the thresholds that trigger an adjustment. This is justified because these events may
pass the auto-bandwidth thresholds in simulation but not in reality. It is for these borderline cases that
we need fine-grained measurements in order to accurately predict the behavior of the mechanism.
29
Figure 0-G: Verification of the AutoBw Simulator. We show the real bandwidth adjustments as inferred
by the SYSLOG messages, the bandwidth adjustments predicted by the AutoBw Simulator, and the
bandwidth adjustment events which are present in both datasets. We also show the traffic volume
measurement completeness.
30
SPF and CSPF algorithms: The second step towards verifying the tunnel visibility system compares the
routing algorithm implementations. First, we compare the SPF implementation by comparing the TOTEM
tunnel paths with the OSPF paths for 30 snapshots during the first month. During this time, autobandwidth is not fully enabled and the routers signal nominal bandwidth reservations. Because of the
static nominal bandwidth reservations, tunnels take the shortest path to their destination routers. We
obtain these paths from the network-specific OSPF monitoring tool. In the case of ECMP, the tool returns
multiple paths. We consider it a match when the TOTEM path matches one of the ECMP paths. We find
that TOTEM paths match 100% the OSPF paths. Therefore, the SPF algorithm implementations in the
operational environment and in the tunnel visibility system are consistent.
Then, we compare the CSPF implementations by comparing the TOTEM tunnel paths with the real paths
that tunnels take in the network, as given by the tunnel path measurements, for 30 snapshots during the
second month when auto-bandwidth is fully enabled. We assume that each tunnel holds a bandwidth
reservation equal to the tunnel’s traffic volume at the time of the snapshot. Again, even though autobandwidth is enabled, tunnels are expected to take the shortest path to their destination routers because
the network has enough spare capacity to accommodate multiple concurrent network failures. Thus, all
tunnel bandwidth requirements can be fulfilled by the shortest paths.
Figure 3-H shows the percentage of the path matches, the path mismatches, and the paths that we do
not verify because the tunnel path measurements do not include the corresponding information. On
average, 95% of the tunnel paths estimated by the TOTEM coincide with the real tunnel paths in the
network for the time snapshots under investigation. We look into the path mismatches and we find that
the path reported by TOTEM and the path extracted from the tunnel path measurements are equal-cost
paths. This is justified because the first CSPF tiebreaker depends on the exact tunnel reservations in the
network. The simulated and real tunnel reservations do not match exactly because of the granularity of
the traffic volume measurements and the assumption we make for the initial tunnel reservations. Thus,
the CSPF algorithm implementations are consistent but, for this network, the CSPF tiebreaking in
conjunction with dynamic traffic-engineering reduces the determinism of the tunnel setup up to 5%.
31
Figure 0-H: Verification of the CSPF algorithm.
TOTEM functionality: The last step for verifying that the tunnel visibility system yields similar behavior
with the real network includes comparing the number of reroutes, failures, and preemptions observed in
the TOTEM simulations with the same numbers as extracted from the real network measurements.
Specifically, we infer the real values of the three metrics from the MPLS-TE SYSLOG messages and the
tunnel path measurements. Given that the simulated auto-bandwidth events include events that do not
match with real auto-bandwidth events, we do not expect these numbers to match 100%.
Our analysis includes aggregate values over the course of a day for the month when auto-bandwidth is
enabled throughout the network. We exclude a few days when our measurement dataset is less than
75% complete. We elaborate on the challenges involved in collecting complete measurement datasets in
Section 1.5.4. Finally, we note that we run the simulations using five different sequences of tunnel
reoptimization. We analyze the impact of this parameter next. In the rest of this section, when we
present TOTEM numbers, we present the average numbers over all five simulation sets.
Reroutes: In Figure 3-I, we plot (i) the number of SYSLOG messages noting reroutes, (ii) the number of
tunnel path changes observed in the tunnel path measurements, and (iii) the number of TOTEM reroutes.
We make the following observations: 1) TOTEM reroutes exhibit the same trend with the reroutes in the
real network but TOTEM overestimates the number of reroutes. 2) The number of SYSLOG messages and
32
Figure 0-I: Normalized number of reroutes per day in reality and in simulation. The values are
normalized to the maximum value, TOTEM reroutes on day 17.
tunnel path changes are very close to each other but do not coincide. This shows that none of these
measurement datasets is 100% complete.
Failures: In Figure 3-J, we plot (i) the number of SYSLOG messages noting when a tunnel is down, and
(ii) the number of TOTEM tunnel failures. We observe that TOTEM overestimates the number of tunnel
failures but follows the actual network behavior. We further investigate the tunnel failures predicted by
TOTEM and we find that the vast majority of these tunnel failures occur because of router maintenance
operations and not because of insufficient capacity. When a router is taken off the OSPF topology, all
tunnels originating from and destined to this router go down. However, the tunnel traffic is not impacted
because it has already been rerouted to other tunnels by routing mechanisms external to the MPLS
domain. Nevertheless, in this case, the router under maintenance may not report the tunnel failures to
the SYSLOG server. This likely explains the discrepancy between the tunnel failures predicted by TOTEM
and the ones reported by the SYSLOG messages.
33
Figure 0-J: Normalized number of failures per day in reality and in simulation. The values are
normalized to the maximum value, TOTEM failures on day 15.
Preemptions: In Figure 3-K, we plot (i) the number of SYSLOG messages noting when a tunnel has been
preempted, (ii) the number of TOTEM tunnel preemptions, (iii) the number of SYSLOG messages noting
when a tunnel has been fast-rerouted (FRR). We explain why we show the FRR SYSLOG messages next.
We observe that tunnel preemptions are not frequent. Also, we observe that the preemptions projected
by TOTEM do not match with the preemptions observed in the real network on days 9 and 28. On these
same days, we observe a high number of SYSLOG FRR messages. We find that this happens because
TOTEM misses the preemptions caused by FRR. In detail, when a tunnel is fast-rerouted, it may preempt
lower-priority tunnels on the backup path. However, since we do not simulate FRR, TOTEM
underestimates the number of preemptions that take place after links go down. Additionally, we note that
the number of preemptions varies significantly per simulation set. Note that each simulation set runs with
a different tunnel reoptimization order. We present more details on this in the following Section 1.5.3.
34
Figure 0-K: Normalized number of preemptions per day in reality and in simulation. The values are
normalized to the maximum value, SYSLOG FRR alarms on day 9.
To summarize, the tunnel visibility system provides good estimates of the total number of tunnel reroutes
that take place in a specified time window. But it does not provide good estimates of the total number of
tunnel failures and preemptions because it does not simulate the FRR mechanism. However, the FRR
mechanism can be easily included in the tunnel visibility system for the purpose of investigating the faulttolerance of the network. Next, we investigate the impact of the tunnel reoptimization order on the
network-wide tunnel dynamics.
35
1.5.3 Tunnel Reoptimization Order
After verifying the accuracy of the tunnel visibility system, we investigate the impact of the tunnel
reoptimization order on the MPLS deployment. The order that the per-tunnel reoptimization timers expire
is a factor that cannot be controlled by the network operators. This order could potentially be defined by
a centralized system that controls the tunnel setup in a network-wide manner.
We investigate the impact of the tunnel reoptimization order by running the simulation of the two-month
period with five different tunnel reoptimization orders: i) three different random orders (O1-random, O2-
random, O3-random), ii) before each reoptimization, TOTEM orders the tunnels based on their priority
and bandwidth reservation (O4-minFirst and O5-maxFirst). In O4-minFirst, the tunnel with the lowest
priority and lowest bandwidth reservation is reoptimized first and so on. In O5-maxFirst, the tunnel with
the highest priority and highest bandwidth reservation is reoptimized first and so on. O4-minFirst and O5-
maxFirst assume that there is a centralized system with knowledge of all the tunnel priorities and
reservations and that it controls the reoptimization process throughout the network.
In Figure 3-L, we present the impact of the tunnel reoptimization order on the number of reroutes, the
reroute impact, and the number of preemptions. Tunnel failures are not impacted by the tunnel
reoptimization order because tunnels fail due to topology changes that effectively are maintenance
operations. In these cases, the head-end router cannot find any path to the tunnel’s destination as the
destination is not available. Therefore, we do not include tunnel failures in our results. In detail, we show
values that are normalized to the average value of the metric for the three random orders O1-random,
O2-random, O3-random. We make the following observations:
1. In terms of tunnel reroutes, O4-minFirst and O5-maxFirst do not significantly differ from the
random orders O1-random, O2-random, O3-random. O4-minFirst only causes 1% more reroutes
than the random orders, and O5-maxFirst only causes 2% less reroutes than the random orders.
Also, we observe that the reroute impact is not affected by the tunnel reoptimization order. This
is justified in the network under analysis as follows: There are only a few large high-priority
tunnels that, unless they are reoptimized first, they cause the preemption – and reroute – of
36
Figure 0-L: Impact of the tunnel reoptimization order in the backbone network.
lower-priority tunnels. On a high level, we conjecture that in a network with sufficient spare
capacity there is no tunnel reoptimization order that is particularly harmful or beneficial to the
network because it causes increased or reduced dynamics.
2. The tunnel reoptimization order mostly affects the number of preemptions. O5-maxFirst results in
57% less preemptions. Hence, O5-maxFirst has a significant impact on preemptions. Placing
high-bandwidth high-priority tunnels first avoids the preemption of already placed low-priority
tunnels. In this network, only a few preemptions take place because there is enough capacity to
accommodate both high-priority and low-priority traffic. So, the preemption reroutes (low-priority
tunnels reroute after being preempted) are low in absolute number. The preemption reroutes
account for the 2% reduction in tunnel reroutes.
Overall, we find that, for this network, the tunnel reoptimization order does not have significant impact
on the tunnel dynamics. However, we need to take into account that this network is overprovisioned for
fault-tolerance. In other words, the network is provisioned with additional capacity in order to
accommodate multiple concurrent link or node failures. This means that, when there are no network
failures, capacity constraints rarely come into play and constrained-based routing approximates shortestpath-first routing.
37
We extend our analysis in non-overprovisioned networks (nop networks) and investigate how the
reoptimization order affects the tunnel activity in the presence of stringent capacity constraints. We run
the two-month simulations with the five different orders on two what-if network topologies: (i) nop75:
we reduce the capacity of all the links in the network by 25%. This results in a network with 75% of its
initial capacity. (ii) nop50: we reduce the capacity of all the links in the network by 50%. This results in
a network with 50% of its initial capacity.
Figure 3-M illustrates the impact of the tunnel reoptimization order on the number of reroutes, the
reroute impact, and the number of preemptions in the nop75 and nop50 networks. We make the
following observations:
1. The impact of the O4-minFirst and O5-maxFirst orders on the number of reroutes increases in
the less overprovisioned networks. In detail, O5-maxFirst results in 8% less tunnel reroutes in the
nop50 network. This is because the number of high-priority tunnels that can cause preemptions
unless they are placed first increase. In this way, preemption reroutes decrease even more when
capacity constraints become tighter. Also, the reroute impact is not affected by the tunnel
reoptimization order in the non-overprovisioned networks. This is intuitive since the number of
reroutes does not change radically.
2. The impact of the O5-maxFirst order on the number of preemptions decreases in the less
overprovisioned networks. With O5-maxFirst, the number of preemptions decreases by 53% in
the real network, by 42% in the nop75 network, and by 20% in the nop50 network. This is
because, in the non-overprovisioned networks, preemptions caused by auto-bandwidth
adjustments increase significantly and account for the majority of the preemptions in the
network. However, these preemptions cannot be avoided by the O5-maxFirst order.
38
(a) Nop75 network
(b) Nop50 network
Figure 0-M: Impact of the tunnel reoptimization order in the (a) nop75 and (b) nop50 networks.
39
1.5.4 Discussion
In this section, we discuss the availability of complete network measurement datasets, the determinism
of the tunnel setup in a network practicing dynamic traffic-engineering, and the impact of the tunnel
reoptimization order on the tunnel dynamics.
Running and verifying the tunnel visibility system requires a wide set of measurement data. Although this
data is available, we find that the measurement data completeness remains a challenge. Measurement
accuracy affects both the accuracy with which the tunnel visibility system simulates the network and the
subsequent verification of the system. For example, the SYSLOG data source is known to be an unreliable
source of information about the router’s operation. Network measurement is a low-priority task for the
CPU of network routers. Also, SYSLOG messages are sent via UDP to the SYSLOG server. Thus, it is
possible that these messages are not delivered to the SYSLOG server especially when there is a network
failure. Overall, our experiences with the application of the tunnel visibility system to a large-scale
production network reveal that network monitoring is an increasingly important network management
operation. It is important for both the offline path calculation (as implemented by the tunnel visibility
system) and the online tunnel path retrieval (i.e., tunnel path measurements). However, detailed network
monitoring is fulfilled with difficulty even by networks that are well-instrumented to collect network
measurements.
We find that the SPF and CSPF algorithms are fairly deterministic and allow for an accurate projection of
the network-wide tunnel setup. In detail, less than 5% of the tunnel paths have equal-cost alternatives,
in which case the tunnel path prediction is challenging. When bandwidth reservations are dynamic, it is
hard to predict which path will be chosen among the equal-cost paths because we need to have an
accurate view of the bandwidth reservations on the equal-cost paths in order to pick the one with the
maximum available bandwidth. This is according to the CSPF tiebreaking process described in Section
1.4.1. Therefore, dynamic bandwidth reservations result in a small percentage of non-deterministic tunnel
paths and reduce the visibility of the tunnel setup.
40
Finally, we quantify the impact of the tunnel reoptimization order. This order represents the different
timing of events in a network where the head-end routers perform their MPLS functionality in an
independent fashion but with common knowledge of the network state (assuming that all routers are upto-date with respect to the network topology and the bandwidth reservations). We find that the total
number of tunnel reroutes in a two-month period can vary by 8% because of the tunnel reoptimization
order. For network operations this means that: (i) There is no significant benefit in having a centralized
control of the tunnel reoptimization order. (ii) Although the tunnel reoptimization order does not have
high impact on the aggregated tunnel dynamics, it still reduces the network-wide tunnel visibility because
different tunnel reoptimization orders result in different tunnel reroutes and different tunnel setups.
In detail, we find that the better “packing” of the tunnels implemented by O5-maxFirst reduces the
number of preemptions in the network. But the total number of preemptions in the overprovisioned
network is very low. The tunnel reoptimization order becomes more important as capacity constraints
become tighter. Tunnel reroutes exhibit more variability. But still the aggregated effect on the tunnel
dynamics is not significant. Nevertheless, the different reoptimization orders result in different tunnel
setups that cannot be predicted by a network management system.
Overall, our results show that the tunnel visibility system closely follows the real network operation but it
cannot provide with the exact setup of the tunnels in the network. Auto-bandwidth causes tunnel
reservations to become dynamic. Unless the tunnel visibility system has an accurate view of the tunnel
output rates, the timing of the auto-bandwidth timers and the reoptimization timers, it cannot reproduce
the exact reservation requests. On the other hand, the predicted tunnel dynamics closely follow the
tunnel dynamics in the real network. Thus, we can use the system to perform further analysis of the
dynamic traffic-engineering.
41
1.6 Dynamic Traffic-Engineering
In this section, we investigate the dynamic traffic-engineering implemented by the MPLS technology and
the auto-bandwidth option. In particular, we focus on load-based routing. In load-based routing, the
routing of traffic depends on the traffic demands. The auto-bandwidth option is an operational
implementation of load-based routing. When auto-bandwidth is enabled, the bandwidth of a tunnel
dynamically adjusts to the traffic demands. Then, the path of tunnel may change in order to satisfy the
new bandwidth requirement.
In general, a tunnel may change its path when the network topology changes or when its configuration
changes. Network topology changes occur when there is a failure, planned maintenance of a network
component, or addition of a network component. Tunnel configuration changes occur when there is a
change in network requirements or objectives. For example, a network provider may split the traffic into
two classes of service and route each service type on a tunnel of different priority in order to provide
different levels of QoS. Although network topology and configuration changes occur often, they are
generally not recurring (with the exception of a network component with unstable behavior).
Dynamic traffic-engineering is equivalent to tunnel configuration changes because the tunnel’s signalled
bandwidth that is normally specified in configuration changes according to the tunnel’s incoming traffic.
So with dynamic traffic-engineering a tunnel may also change its path when its traffic volume changes.
Common reasons for a change in a tunnel’s traffic volume are: (i) changes in end-user behavior (e.g.,
diurnal pattern), (ii) remote network failures (e.g., eBGP session failure), (iii) remote routing policy
changes or remote traffic-engineering. Figuring out the cause of a tunnel volume change is cumbersome,
if at all possible, because it depends on multiple factors that are external to a network’s operation.
However, it is important that the tunnel path changes caused by the traffic volume changes do not occur
with high frequency and that the side-effects of the dynamic traffic-engineering practice are not harmful
to the network operation.
Our goal is to investigate the impact and potential risks of dynamic traffic-engineering on a large-scale
MPLS deployment. In detail, we investigate the impact of the auto-bandwidth mechanism on the
42
network-wide tunnel dynamics. We run simulations in order to monitor the tunnel reroutes and failures
and classify them according to their root-cause. In this way, we can isolate the auto-bandwidth effect.
We note that real tunnel measurements do not provide such information. Additionally, we run simulations
in order to investigate the impact of the auto-bandwidth mechanism in networks with stringent capacity
constraints but with adequate restoration capacity. This is essential because the ultimate goal of dynamic
traffic-engineering is to provide load-balancing in highly-utilized networks. Finally, we note a few other
issues that are related to dynamic traffic-engineering and should be considered when investigating the
safety of auto-bandwidth: (i) responsiveness: the behavior of the auto-bandwidth mechanism in the case
of abrupt traffic shifts, (ii) stability: the non-recurring adjustment of a tunnel’s bandwidth.
Our analysis reveals the magnitude of the side effects that auto-bandwidth has on tunnel dynamics. We
find that, as the network’s capacity constraints become tighter, both auto-bandwidth and preemption
reroutes increase. The reoptimization reroutes are also indirectly caused by the auto-bandwidth
mechanism. The dynamic tunnel reservations change the available bandwidth in the network. So, when
the head-end routers reoptimize their tunnels’ paths, they may find newly available bandwidth on shorter
paths. However, we find that reoptimization reroutes rarely take place even in networks with less spare
capacity. Then, the tunnels that are rerouted due to auto-bandwidth become increasingly large and
preempt an increasing number of smaller lower-priority tunnels. Although this is expected, this
phenomenon becomes unexpectedly intense with respect to the decrease in network capacity. Finally, the
auto-bandwidth failures also increase rapidly in the networks that have less spare capacity.
In terms of auto-bandwidth responsiveness, we find that the mechanism exhibits delayed reaction to
extreme traffic volume variations because of a vendor-specific output rate calculation algorithm. Finally,
in terms of auto-bandwidth stability and given the current end-to-end traffic-engineering techniques, we
observe that the mechanism is stable and does not cause high-frequency bandwidth adjustments. But
this stability is not guaranteed if end-to-end traffic-engineering becomes more sophisticated and adjusts
to traffic path changes.
43
1.6.1 Auto-Bandwidth in Non-Overprovisioned Networks
We investigate the tunnel reroutes and failures that are directly or indirectly caused by the bandwidth
adjustment of the tunnels in networks with different capacity constraints. Auto-bandwidth can directly
cause the reroute of a tunnel by changing its bandwidth reservation. But it can also indirectly cause the
reroute of a lower-priority tunnel in the following way: a higher-priority tunnel changes its path due to a
bandwidth adjustment and preempts a lower-priority tunnel on its new path. Then, the lower-priority
tunnel reroutes to an alternate path.
Specifically, a tunnel with volatile bandwidth requirements can cause multiple tunnel path changes. We
illustrate an example of multiple tunnel path changes caused by a single tunnel flap, a tunnel bandwidth
increase and a subsequent tunnel bandwidth decrease to its initial value. In Figure 3-N, the bandwidth of
tunnel B decreases and stays on the same path since this remains its best option. This bandwidth
adjustment causes tunnel A to migrate after its reoptimization timer expires because path R8-R2-R3-R4-
R9 is shorter and can now satisfy its bandwidth requirement. When tunnel A migrates to path R8-R2-R3R4-R9, it preempts lower-priority tunnels X and Y that get rerouted to path R8-R2-R6-R7-R4-R9. The
bandwidth of tunnel B increases again and this results in tunnel’s B migration to a longer path with
sufficient bandwidth (path R1-R2-R6-R7-R4-R9). Again, this bandwidth adjustment causes the
preemption and reroute of tunnels X and Y. The bottom line is that the variation in bandwidth
requirements of a single tunnel can trigger multiple tunnel path changes and have a network-wide effect.
Given this observation, it is worth investigating the impact of multiple realistic bandwidth adjustments on
the network-wide tunnel setup.
Before investigating the dynamics introduced by auto-bandwidth in a real operational environment, we
emphasize on the impact of a tunnel path change on data plane performance. Tunnel path changes
follow the make-before-break paradigm: tunnels are torn down only after a new path has been found
such that the packet loss is avoided. However, tunnel path changes can have an impact on TCP
performance. In particular, a change from a longer path to a shorter path can cause packet reordering to
the flows that are routed through the affected tunnel. Previous works reveal that packet reordering can
44
Figure 0-N: Example of the tunnel path changes caused by a single bandwidth flap. The bandwidth of
tunnel B (red) changes from 0.7 to 0.5 to 0.7. Tunnels A (blue) and B (red) have the same priority.
Tunnels X and Y (green) have lower priority than tunnels A and B.
45
be very harmful to application throughput [18][19]. Tunnel preemptions follow the break-before-make
paradigm: preempted tunnels are first torn down and then re-established on an alternative path, if one is
found. Thus, lower-priority traffic experiences higher disruption during tunnel path changes.
Next, we analyze the effect of auto-bandwidth in the tier-1 backbone network. We perform our analysis
by simulating a realistic tunnel setup for a two-month period: the simulation includes real topology
changes and real traffic volume changes. By only using tunnel state measurements, we cannot isolate the
cascading effect of the auto-bandwidth events on the tunnel dynamics. In detail, inferring the root-cause
of the measured tunnel path changes is a cumbersome task whose accuracy depends heavily on the
availability of multiple network measurement data. Instead, we perform simulations and present the
impact of auto-bandwidth on tunnel reroutes, failures, and preemptions.
Furthermore, given that the network under analysis is overprovisioned, we also run simulations for the
real network (baseline) but with 25% reduced capacity (nop75) and 50% reduced capacity (nop50). This
analysis is essential in order to fully investigate dynamic traffic-engineering. Without stringent capacity
constraints, CSPF approximates SPF and the impact of auto-bandwidth is low. Finally, we note that the
numbers presented below are averaged over the five tunnel reoptimization orders ( O1-random, O2-
random, O3-random, O4-minFirst, O5-maxFirst).
In a nutshell, we find that, as network capacity decreases by 50%, (i) auto-bandwidth causes a 41%
increase in tunnel reroutes, (ii) the additional tunnel reroutes are mostly a side-effect of the autobandwidth mechanism, i.e. they are preemption reroutes and not auto-bandwidth reroutes, (ii) the total
amount of traffic that is rerouted increases by three times, (iii) the auto-bandwidth failures increase
dramatically, leading to suboptimal tunnel setups, (iv) the number of preemptions caused by bandwidth
adjustments increase by 35 times.
46
1.6.1.1
Reroutes
Tunnel reroutes are caused by: (i) topology changes (OSPF changes), (ii) auto-bandwidth adjustments,
(iii) periodic reoptimization, and (iv) preemptions. We elaborate on the tunnel preemptions in Section
1.6.1.3. Note that reoptimization reroutes are indirectly caused by auto-bandwidth adjustments in the
following way: auto-bandwidth adjustments, independent of whether they cause the tunnel to reroute or
not, change the available bandwidths on the network paths. Therefore, when the head-end router
reoptimizes a tunnel’s path, it may find a new shorter path because some auto-bandwidth adjustments
freed up previously reserved bandwidth.
In Figure 3-O, we present normalized numbers of tunnel reroutes throughout the two-month period for
the real network (baseline) and the non-overprovisioned networks (nop75 and nop50). We make the
following observations:
1. Decreasing the capacity of the network by 25% increases the number of reroutes by 10%. But
decreasing the capacity of the network by 50% increases the number of reroutes by 41%. The
Figure 0-O: Number of reroutes per root cause for the baseline, nop75, and nop50 networks. The
numbers are normalized to the number of reroutes in the baseline network.
47
increase in reroutes is mostly attributed to auto-bandwidth. In detail, in the nop50 network,
OSPF reroutes increase by 15%, AutoBw reroutes increase by 15 times, and preemption reroutes
increase by 24 times. Most of the preemption reroutes are attributed to auto-bandwidth (see
Section 1.6.1.3). ReOpt reroutes are negligible because AutoBw reroutes – that can cause further
ReOpt reroutes – do not occur often.
2. Although AutoBw reroutes account only for a small percentage of all the reroutes, the autobandwidth mechanism accounts for an increasing number of tunnel reroutes as network capacity
decreases. AutoBw and preemption reroutes account for 1% in baseline, 4% in nop75, and 19%
in nop50 of all the reroutes. This shows that the auto-bandwidth mechanism has a cascading
effect on the tunnel dynamics that increases non-linearly as network capacity decreases.
Next, we analyze the impact of the tunnel reroutes. The larger the tunnel that is rerouted, the higher the
impact of the reroute on the data plane. In order to accurately estimate the data plane impact, we would
need to perform packet-based simulations. We leave this as future work. We focus on the reroute impact,
that is the total bandwidth of the rerouted tunnels. In Figure 3-P, we observe that:
1. In the baseline network, 5% of the reroute impact is attributed to the AutoBw, ReOpt and
preemption reroutes. In detail, we find that the AutoBw reroutes account for 4% of the rerouted
traffic. This percentage is high given that the AutoBw reroutes account for less than 1% of all
reroutes. Hence, the tunnels that are rerouted due to an auto-bandwidth adjustment are large.
This is because the large tunnels have higher probability of not fitting in their path when their
bandwidth reservation changes.
2. When capacity constraints become tight, auto-bandwidth has tremendous impact on the amount
of traffic that changes paths within the network. As capacity decreases to 50%, the reroute
impact increases by three times, and this increase is entirely attributed to auto-bandwidth. The
tunnels that are rerouted due to an auto-bandwidth adjustment become on average three times
larger. We derive this number by dividing the AutoBw reroute impact with the number of AutoBw
reroutes in the baseline and nop50 networks and by comparing the two average tunnel sizes.
48
Figure 0-P: Total bandwidth of rerouted tunnels per root cause for the baseline, nop75, and nop50
networks. The numbers are normalized to the reroute impact in the baseline network.
3. As capacity decreases to 50%, the tunnels that are rerouted due to preemption reduce to 85% of
their average size in the baseline network. Thus, smaller tunnels are preempted to accommodate
the high-priority tunnel reroutes.
1.6.1.2
Failures
We investigate the impact of auto-bandwidth on tunnel failures. We divide failures in the following
categories: (i) OSPF failures, topology changes can cause tunnels to not find a path to their destination,
(ii) preemption failures, preempted tunnels cannot find an alternative path because of insufficient
capacity, (iii) auto-bandwidth failures, tunnels fail to resize because there is no alternative path to
accommodate their adjusted bandwidth reservation. OSPF and preemption failures are hard failures,
whereas the auto-bandwidth failures are soft failures. In Figure 3-Q, we present the normalized number
of failures subcategorized per root-cause for the baseline, nop75, and nop50 networks. We observe the
following:
49
Figure 0-Q: Number of failures per root cause for the baseline, nop75, and nop50 networks. The
numbers are normalized to the number of failures in the baseline network.
1. The vast majority of tunnel failures in the baseline network occur because the tunnel destination
router is removed from the topology. Whenever this is the result of a maintenance operation,
tunnel traffic is not impacted because, prior to bringing down the router, traffic is rerouted
towards another destination router.
2. Decreasing the capacity of the network by 25% barely affects the number of failures but
decreasing the capacity of the network by 50% increases the number of failures by 9%. Thus,
the total reduction in capacity does not cause a significant increase in failures because the
network still has spare capacity to accommodate failures. Instead the capacity reduction
increases the tunnel reroutes and the size of the rerouted tunnels.
3. As the network capacity decreases by 50%, hard failures increase by 2%. The OSPF failures are
not impacted because they occur when a tunnel cannot find a path to its destination because of a
maintenance event. However, the preemption failures increase by almost 10 times because there
is not enough capacity to accommodate the lower-priority traffic. The preemption failures reflect
the cost of dynamic traffic-engineering. Lower-priority traffic is disrupted when the network does
not have enough capacity to accommodate the whole traffic matrix.
50
Figure 0-R: Total bandwidth of failed tunnels per root cause for the baseline, nop75, and nop50
networks. The numbers are normalized to the failure impact in the baseline network.
4. As the network capacity decreases by 50%, we observe a considerable increase in soft failures.
In the nop50 network, the AutoBw failures account for 6% of the failures whereas they barely
occur in the baseline network. Hence, AutoBw failures illustrate a considerable increase when
capacity constraints become stringent.
Finally, in Figure 3-R, we present the normalized impact of failures subcategorized per root-cause for the
baseline, nop75, and nop50 networks. We observe the following:
1. As network capacity decreases by 50%, the failure impact increases by a factor of 6. OSPF and
preemption failures account for almost 1/4 of this increase. But AutoBw failures are responsible
for the most part of the increase in failure impact. As already mentioned, these tunnels are not
torn down but their incoming traffic is partially served by the bandwidth reservation. The failed
tunnel bandwidth attributed to the AutoBw failures is the tunnel traffic ratio that flows outside of
the tunnel reservation because the tunnel failed to adjust to its new bandwidth requirement. This
effectively is the new bandwidth reservation minus the old bandwidth reservation.
51
2. In the nop50 network, the failure impact of the preempted tunnels (low-priority tunnels that do
not find a path after preemption) increases 17 times, and the preempted tunnels are two times
larger than the ones in the baseline network. The impact of preemption failures shows the tradeoff between reduced network capacity and serving the low-priority traffic even in the presence of
network failures.
1.6.1.3
Preemptions
Low-priority tunnels can be preempted when high-priority tunnels are (i) initially established or reestablished after a failure (ReEstabl preemption), (ii) rerouted due to a topology change (OSPF
preemption), (iii) rerouted due to an auto-bandwidth adjustment (AutoBw preemption), (iv) rerouted due
to tunnel reoptimization (ReOpt preemption). The auto-bandwidth mechanism causes the last two cases
of preemption. In Figure 3-S, we present the normalized number of preemptions subcategorized per rootcause for the baseline, nop75, and nop50 networks. We observe that:
1. A very small number of preemptions take place in the baseline network. This shows that the
network has enough capacity to accommodate both high-priority and low-priority traffic.
2. Preemptions increase by four times in the nop75 network and by 20 times in the nop50 network.
Thus, when the capacity of the network decreases, the number of preemptions increases
dramatically.
3. AutoBw preemptions increase by four times in the nop75 network and by more than 34 times in
the nop50 network in comparison with the baseline network. Hence, the cascading effect of autobandwidth on lower-priority traffic increases dramatically.
52
Figure 0-S: Number of preemptions per root cause for the baseline, nop75, and nop50 networks. The
numbers are normalized to the number of preemptions in the baseline network.
53
1.6.2 Other Issues with Dynamic Traffic-Engineering
We elaborate on the responsiveness of the auto-bandwidth mechanism to abrupt traffic volume changes
and on the stability of the mechanism in comparison to packet-based load-sensitive routing. First, we
form what-if scenarios in order to project the behavior of the mechanism to extreme traffic volume
variations. We find that, although the parameters that control auto-bandwidth are set in such a way that
the mechanism is not too sensitive to the external traffic volume variations, in the case of extreme traffic
volume variations, the vendor-specific calculation of traffic rates delays the bandwidth adjustments.
Then, in terms of mechanism stability, we observe that a tunnel may experience recurring path changes
if the following happens: the change in the tunnel’s path interacts with external traffic-engineering and
the external traffic-engineering causes a traffic volume change that in turn causes a tunnel path change.
1.6.2.1
Auto-bandwidth Responsiveness
The values of the auto-bandwidth parameters define the sensitivity of the mechanism to traffic volume
changes. The auto-bandwidth mechanism is sensitive when it triggers a bandwidth adjustment even with
a small change in the tunnel’s traffic volume. As with all dynamic adjustment mechanisms, there is a
trade-off between accuracy and adjustment overhead. In particular for auto-bandwidth, the more
sensitive the mechanism is to traffic volume changes, the more accurate the per-tunnel bandwidth
reservations. Then, the more accurate the bandwidth reservations are, the more traffic goes through
paths with guaranteed bandwidth and the network efficiently fulfills its QoS requirements for the traffic
flow. However, the more sensitive the mechanism is to traffic volume changes, the more control signaling
is needed to update the tunnel reservation. This poses processing overhead on the network’s routers.
Given the network’s traffic patterns, topology, and tunnel configuration, parameter tuning analysis leads
to the permissible levels of RSVP signaling in the network. For the auto-bandwidth mechanism, the tradeoff is the following: accurate tunnel bandwidth reservations versus excessive RSVP signaling overhead on
the routers. Although we cannot disclose the parameter setting process for the auto-bandwidth
mechanism deployed by this network provider, we do analyze the tunnel path changes observed in the
network. We verify both through simulation and through processing of the real tunnel path
54
measurements that the tunnel paths do not change with high frequency. However, this holds for the
common daily traffic demands. We do not present further information about the auto-bandwidth
parameters and their impact on the frequency of tunnel path changes because this is proprietary
information for the network under analysis.
Next, we explore the responsiveness of the auto-bandwidth mechanism when there are extreme traffic
shifts. By responsiveness, we refer to the speed with which auto-bandwidth adjusts the bandwidth
reservations of the affected tunnels. We investigate a particular set of what-if scenarios that aim at
projecting the behavior of the mechanism when there are extremely volatile traffic demands.
Extreme traffic shifts can occur for a number of reasons usually associated with interdomain routing.
For example, eBGP session failures can cause traffic to shift from one network ingress point to another.
Such a shift is potentially equivalent to moving traffic between cities. Remote interdomain routing policy
changes can also cause traffic to shift between ingress points. Finally, internal interdomain routing policy
changes can cause traffic to shift from one egress point to another. Although such events do happen
often, they may not manifest themselves as abrupt traffic shifts. For example, in backbone networks,
traffic is highly aggregated. However, even in this network environment, remote interdomain changes
can cause unexpected and sudden changes in the network’s traffic patterns. We explore such events by
running the following set of simulations.
We form three what-if scenario sets: (i) tunnel-to-tunnel traffic shifts: the incoming traffic of tunnel A
shifts to tunnel B, (ii) ingress-to-ingress traffic shifts: the outgoing traffic of head-end router X shifts to
head-end router Y, (iii) egress-to-egress traffic shifts: the incoming traffic of tail-end router X shifts
towards tail-end router Y. In detail, in scenario (i), we pick the tunnel pair such that the head-end routers
are nearby in terms of geographical location and maintain the same tail-end router. In scenario (ii), we
simulate tunnel-to-tunnel traffic shifts for all the tunnels originating at the head-end router and assume
that their traffic is shifted to a geographically nearby head-end router. We do the same for scenario (iii)
but here we simulate tunnel-to-tunnel traffic shifts between nearby tail-end routers. These scenarios are
55
unlikely to happen but are not impossible. The purpose of the what-if study is to test the auto-bandwidth
mechanism and predict its behavior under stressful conditions.
We insert the traffic shift events at random times in the two-month period and observe the tunnel
dynamics caused by the traffic shift. We note that there is a huge amount of traffic shift scenarios that
we can simulate. We choose to simulate the following traffic shifts: (i) for tunnel-to-tunnel traffic shifts,
we pick the tunnels with the largest bandwidth reservations at randomly chosen times, (ii) for router-torouter traffic shifts, we randomly pick one router per day. Table 0-III summarizes the what-if scenarios.
Table 0-III: Extreme Traffic Shift Scenarios
What-If Scenario
Tunnel-to-tunnel
Ingress-to-ingress
Egress-to-egress
Number
300
61
61
Description
Large-tunnel traffic shifts to another tunnel (nearby ingress, same egress).
Random ingress router picked. Outgoing traffic shifts to nearby ingress.
Random egress router picked. Incoming traffic shifts to nearby egress.
ingressuter.
egressouter.
Next, we elaborate on the auto-bandwidth events that follow the traffic shifts. The behavior of the autobandwidth mechanism is not straightforward because of the moving average calculation of the tunnel’s
output rate. In detail, let us assume that tunnel A has an output rate of BWA Mbps and tunnel B has an
output rate of BWB Mbps. After the sudden traffic shift, the output rate of tunnel A suddenly drops to
zero and the output rate of tunnel B increases to BWA + BWB Mbps. These are the real output rates for
the tunnels. However, the output rates used as input to the auto-bandwidth mechanism are adjusted by
the moving average algorithm described in Section 1.4.2. We find that this algorithm significantly delays
the adjustment of the tunnel reservations.
In Table 0-IV, we present our results on the auto-bandwidth behavior when running the tunnel-to-tunnel
and the router-to-router traffic shift scenarios. We show (i) the average and maximum time elapsed in
multiples of the collection-interval till the bandwidth reservations adjust to the real traffic volumes after
the shift, and (ii) the average and maximum number of auto-bandwidth adjustments that follow the
traffic shift.
56
Table 0-IV: Extreme Traffic Shift Results
What-If Scenarios
Tunnel-to-tunnel
(x300)
( # Scenarios)
Router-to-router (x122)
Time To Adjust
Average Max
40
49
20
34
AutoBw Adjustments
Average
Max
28
Max #21AutoBw Events
69
348
According to our results, the bandwidth reservations
of the tunnels involved in the traffic shift adjust after
ection
multiple collection-intervals and auto-bandwidth events. Given that a router-to-router traffic shift includes
Max Time
multiple tunnel-to-tunnel traffic shifts, the tunnel-to-tunnel traffic shifts take longer to adjust and require
more bandwidth adjustments than the router-to-router traffic shifts. This happens because we simulate
traffic shifts of high-bandwidth tunnels. So, it takes more time for the large traffic rate to drop to zero.
Also, although not shown in the results, we observe that the tunnel whose bandwidth increases adjusts
faster than the tunnel whose traffic drops. Thus, the most serious effect of the traffic shift is that, until
the bandwidth of the tunnel whose traffic drops to zero adjusts to this low value, there is bandwidth
reserved that is not utilized by traffic. Finally, we note that the mechanism reacts slowly not because of
the collection-interval but because of the vendor-specific moving average algorithm that calculates the
traffic rate of the tunnel.
1.6.2.2
Auto-bandwidth Stability
Prior research on packet-based load-sensitive routing [7][8][9] shows that load-sensitive routing can be
unstable. Here, we comment on the stability of flow-based load-sensitive routing with respect to the
instability of the packet-based load-sensitive routing. By instability, we refer to packets with the same
destination constantly changing routes to reach this destination. In flow-based routing, this is equivalent
to tunnel path flapping. That is a tunnel experiencing recurring path changes.
In packet-based load-sensitive routing, the shortest-path algorithm calculates hops based on a dynamic
metric that reflects the load in the network. Prior work initially focuses on queue length, i.e. the number
of packets queued for transmission on the router’s lines [7]. However, this type of delay measurement
has the following defects: (i) The queue length does not take into account the link propagation delay or
the packet size. Thus, it is not a good indicator of delay. (ii) If the maximum queue length is short, the
57
metric is again a poor indicator of the delay. (iii) There is a significant real-time fluctuation in queue
lengths at any traffic level [8].
For these reasons, the queue length metric used for implementing load-sensitive routing was revisited.
Packet delay averaged over a ten second interval overcame the limitations of the queue length [8].
However, this revised metric results in routing instabilities in heavily loaded networks [9]. In detail, let us
assume that packets take path X towards some destination in the network because it is the fastest, and
thus shortest, path to the destination. When traffic increases and the path becomes gradually congested,
the packet delay increases. At some point, alternative path Y becomes shorter because it is less
congested and exhibits less packet delay. Then, traffic gradually reroutes at a per-hop level onto path Y.
This causes path X to become less utilized and this in turn reduces packet delay. Traffic gradually moves
back to path X and so on. This is how routing instability manifests itself in packet-based load-sensitive
routing. Additionally, since routing decisions are taken on a per-hop level, there is a possibility for the
formation of routing loops.
We observe that the source of instability for packet-based load-sensitive routing is the existence of a
feedback loop between the routing mechanism and the network load. In other words, the network load
represented by the number of packets traversing a link affects the routing of packets, and the routing of
packets affects the network load.
In the context of flow-based load-sensitive routing, the volume of a flow affects its routing in the
network. If the routing of the flow can influence the flow’s volume, then we would have a feedback loop
and an unstable routing mechanism. This raises the question of whether external traffic engineering that
controls the amounts of traffic flowing through other networks adjusts to path changes within those
networks. For example, if a remote network that is the source of traffic monitors the characteristics of
remote path segments and changes the amount of traffic forwarded to these paths according to some
criteria, then dynamic traffic-engineering can create a feedback loop between the internal dynamic
traffic-engineering mechanism that adjusts the flow paths within the backbone network and the external
end-to-end traffic-engineering mechanism that controls the flow volumes and the end-to-end paths.
58
To the best of our knowledge, such external traffic-engineering is not used. But altogether dynamic
traffic-engineering is vulnerable to such instability. Finally, we note that in flow-based routing the
formation of routing loops is impossible because head-end routers are aware of the end-to-end path
within the MPLS domain and do not allow loops in the path.
59
1.6.3 Discussion
In this section, we discuss our results from analyzing the auto-bandwidth mechanism both in simulation
and in theory. Our analysis focuses on various aspects of auto-bandwidth: the effect of auto-bandwidth
adjustments on the network-wide tunnel dynamics, the responsiveness and the stability of the autobandwidth mechanism.
By simulating a two-month period, we isolate the effect of auto-bandwidth in networks with varying
capacity constraints. In detail, we find that, for the network under analysis that is both overprovisioned
for fault-tolerance and fine-tuned in order to experience a reasonable number of auto-bandwidth
adjustments, only 3% of the tunnel reroutes occur because of bandwidth adjustments.
We extend our analysis and project the auto-bandwidth behavior in networks with more stringent
capacity constraints. The ultimate objective of dynamic traffic-engineering is the better usage of the
network resources. Thus, we expect auto-bandwidth to be deployed in networks operating at higher
levels of utilization. We find that, as network capacity decreases by 50%: (i) the number of autobandwidth reroutes increases by 15 times, and (ii) the number of preemptions caused by auto-bandwidth
events increases by 34 times. The preemption of the lower-priority traffic reflects the cost of operating a
network on higher utilizations. Finally, it is noteworthy that, although reroutes only increase by 41%, the
reroute impact increases three times. This means that it is mostly tunnels with large bandwidth
reservations that are impacted by the increased tunnel dynamics. It is important to further investigate the
impact of the reroute on the data plane performance.
We also expose a routing design limitation relevant to the auto-bandwidth mechanism, namely the autobandwidth failure. When the auto-bandwidth mechanism fails to resize the tunnel’s bandwidth, it remains
on the current path with the old bandwidth reservation, even though there could be available bandwidth
either on another path or on the current path. This limitation leads to suboptimal tunnel path placements.
There are multiple ways to deal with this limitation (resize the tunnel to the maximum available
bandwidth, split the tunnel into two tunnels with smaller bandwidth reservations etc.) that need to be
60
investigated. However, we emphasize on the importance of routing visibility systems in revealing
limitations of the deployed routing mechanisms.
Then, we comment on the responsiveness of the auto-bandwidth mechanism to extreme traffic demand
changes. We observe that, instead for the auto-bandwidth parameters to be the deciding factor for how
fast the bandwidth reservations adjust to highly-variable traffic demands, the tunnel output rate
calculation slows down the adjustment and results in the temporary holding of unused bandwidth. Also,
until the bandwidth reservations adjust to the traffic demand change, the incoming traffic of a tunnel
exceeds the tunnel’s reservation. This means that there is no guaranteed bandwidth for this traffic.
Finally, auto-bandwidth implements flow-based load-sensitive routing. Currently, the mechanism is stable
because there is no observed usage of sophisticated end-to-end traffic-engineering. A sophisticated endto-end traffic-engineering practice reroutes traffic according to the traffic’s path. Such a practice would
create a feedback loop between the traffic-engineering and the auto-bandwidth, and therefore,
jeopardize the stability of the dynamic traffic-engineering. In order to further analyze this issue, we would
need to identify the path characteristics that could potentially be used by external traffic-engineering
mechanisms and analyze how these characteristics vary among the network’s paths.
61
1.7 Summary
MPLS is an intradomain forwarding mechanism widely deployed in backbone networks. One of the key
capabilities of MPLS is traffic-engineering. MPLS traffic-engineering allows for efficient network resource
utilization, a network objective of increasing importance for backbone networks. MPLS can achieve loadbalancing both statically and dynamically. In a static MPLS configuration, operators reconfigure the tunnel
attributes in order to accommodate changes in traffic demands. In a dynamic MPLS configuration, the
bandwidth requirements of the tunnels adjust automatically to the recently measured traffic demands.
We focus on the routing visibility and stability problem in operational environments deploying MPLS and
dynamic traffic engineering. First, we build a tunnel visibility system that takes as input the network
topology, the tunnel configuration, and the tunnel traffic volumes. The goal of this system is to provide
network operators with tunnel path visibility. That is a view of the paths that the configured tunnels take
in the network. We apply the system to the backbone network of a tier-1 ISP and compare the tunnel
setup predicted by the system with the real tunnel setup as inferred from various network
measurements.
We find that, although the SPF and CSPF algorithms produce deterministic outputs, an MPLS deployment
with auto-bandwidth enabled is harder to predict, and thus, less visible. We identify three major reasons
that contribute to this lack of visibility: (i) the granularity of the measurement data required in order to
accurately predict the tunnel bandwidth requirements, (ii) the CSPF tiebreaking process, and (iii) the
timing of tunnel path calculations in the network.
We leverage the tunnel visibility system to analyze the impact of the auto-bandwidth mechanism on
tunnel reroutes and failures in networks with varying capacity constraints. We find that as network
capacity decreases by 50%, the tunnel reroutes caused by auto-bandwidth increase by 41% but the total
size of the rerouted tunnels increases by a factor of three. Furthermore, our simulations reveal that
preemption reroutes and preemption failures increase dramatically in networks with less spare capacity.
Auto-bandwidth failures only occur in networks with tight capacity constraints and result in inaccurate
62
tunnel reservations for the affected tunnels. Thus, it is important to address this issue for networks that
operate at higher levels of utilization.
Furthermore, we perform simulations to investigate the responsiveness of the auto-bandwidth
mechanism to extreme traffic demand variability and find that the vendor-specific algorithm that
calculates the traffic rates used by the auto-bandwidth mechanism introduces considerable delay in the
bandwidth adjustments. The benefit of this algorithm is that it makes auto-bandwidth less sensitive to
traffic spikes. The negative side-effect is that, because of the delayed reaction of auto-bandwidth,
bandwidth is reserved without being utilized by network traffic.
Finally, we focus on the stability of the auto-bandwidth mechanism. Both real tunnel measurements and
simulation results show that, after carefully setting the auto-bandwidth parameter values, tunnels do not
change paths with high frequency. However, auto-bandwidth can become unstable if a feedback loop
between the internal and the end-to-end traffic-engineering practices is formed.
Overall, we find that dynamic traffic-engineering currently allows for a stable deployment but it trades-off
better network utilization with reduced tunnel visibility. Also, as network capacity decreases, it increases
the reroutes of both large high-priority tunnels and small low-priority tunnels, and it causes a significant
number of soft failures.
63
BIBLIOGRAPHY
[1] J. Sommers, B. Eriksson, and P. Barford, “On the Prevalence and Characteristics of MPLS
Deployments in the Open Internet,” in Proc. ACM IMC 2011.
[2] A. Pathak et al, “Latency Inflation with MPLS-based Traffic Engineering,” in Proc. ACM IMC 2011.
[3] D. Mitra and K.G. Ramakrishnan, “A case study of multiservice, multipriority traffic-engineering design
for data networks, ” in Proc. IEEE GLOBECOM 1999.
[4] J. E. Burns et al, “Path selection and bandwidth allocation in MPLS networks, ” in Proc. of Perform.
Eval. 52, 2-3 (April 2003).
[5] M. Kodialam and T. V. Lakshman, “Minimum interference routing with applications to MPLS traffic-
engineering, ” in Proc. INFOCOM 2000.
[6] S. Kuribayashi and S. Tsumura, “Optimal LSP selection method in MPLS networks,” in Proc.
Communications, Computers and Signal Processing 2007.
[7] J. M. McQuillan, I. Richer, and E. C. Rosen. ARPANET routing study - Final Report. BBN Rep. 3641,
Sept. 1977.
[8] J. M. McQuillan, I. Richer, and E. C. Rosen, “The New Routing Algorithm for the ARPANET,” in IEEE
Transactions on Communications, 711-719, May 1980.
[9] A. Khanna and J. Zinky, “The revised ARPANET routing metric,” in Proc. SIGCOMM 1989.
[10] Multiprotocol Label Switching Architecture. http://www.ietf.org/rfc/rfc3031.txt
[11] Cisco Tunnel Reoptimization.
http://cisco-press-traffic-engineering.org.ua/1587050315/ ch04lev1sec3.html
[12] Cisco MPLS Traffic Engineering.
http://www.ciscopress.com/articles/article.asp?p=426640&seqNum=3
[13] RSVP-TE: Extensions to RSVP for LSP Tunnels. http://www.ietf.org/rfc/rfc3209.txt
[14] Constraint-Based LSP Setup using LDP. http://www.ietf.org/rfc/rfc3212.txt
[15] Cisco definition of bits per second.
http://www.cisco.com/en/US/products/sw/iosswrel/ps1818/products_tech_note09186a0080191323.s
html
[16] TOTEM Project. TOolbox for Traffic Engineering Methods. http://totem.run.montefiore.ulg.ac.be/
[17] G. Leduc et al, “An open source traffic engineering toolbox, ” in Computer Communications,
29(5):593-610, March 2006.
[18] M. Laor and L. Gendel, “The effect of packet reordering in a backbone link on application
throughput,” in IEEE Network Sep/Oct 2002.
[19] S. Jaiswal et al, “Measurement and Classification of Out-of-Sequence Packets in a Tier-1 IP
Backbone,” in IEEE/ACM Transactions on Networking, vol. 15, no. 1, Feb. 2007.
64
Download