Understanding High
Availability
Implementing a Highly Available Network
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-1
Components of High Availability
 The objective of high availability
is to prevent outages and
minimize downtime.
 Achieving high availability
integrates multiple components:
– Redundancy
– Technology
– People
– Processes
– Tools
 The first two components are
relatively easy to integrate.
 The last three components are
usually where gaps lead to
outages.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-2
Redundancy
 Redundancy is used to reduce or
eliminate the effects of a failure.
 Design of redundancy attempts to
eliminate single points of failure:
– Avoid single causes of failure.
– Use geographic diversity and
path diversity.
– Use dual devices and links.
– Use dual WAN providers.
– As appropriate, implement dual
data centers.
– As appropriate, use dual
colocations, dual central office
facilities, and dual power
substations.
 Design of redundancy needs to trade
off cost versus benefit:
– Hours of downtime compared to
the costs of redundancy,
planning, etc.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-3
Technology
 Cisco routing continuity
options:
– Cisco Nonstop Forwarding
(NSF)
– Stateful Switchover (SSO)
– Catalyst 3750 Series
Switches with Cisco
StackWise technology
– Catalyst 6500 VSS 1440
 Techniques for detecting failure
and triggering failover:
– Monitoring
– IP SLAs and object tracking
 Other technologies:
– Fast-routing convergence
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-4
People
 Staff work habits and skills can impact high availability.
– Attention to detail.
– Reliability and consistency.
 Good skills and ongoing technical training are needed:
– Lab time working with technology, practical skills, troubleshooting
challenging scenarios, etc.
– Communication and documentation are important.
 Define what other groups expect.
 Define why the network is designed the way it is, how it is supposed to
work.
 If people are not given the time to do the job right, they cut corners:
– If the design target is just “adequate,” falling short leads to poor design.
 Staff team should align with services.
– Owner and experts for each key service application and other
components should be identified and included.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-5
Processes
 Build repeatable processes.
– Document change procedures, failover
planning and lab testing, and
implementation procedures.
 Use labs appropriately.
– Lab equipment reflects the production network, failover
mechanisms are tested and understood, and new code is
validated before deployment.
 Use meaningful change controls.
– Test all changes before deployment, use good planning with
rollback plans, and conduct realistic and thorough risk
analysis.
 Manage operation changes.
– Perform regular capacity management audits, manage Cisco
IOS versions, track design compliance as recommended
practices change, and develop disaster recovery plans.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-6
Tools
 Monitor availability and key statistics
for devices and links.
– Use performance thresholds, Top N reporting, and trending to
spot potential problems.
– Monitor packet loss, latency, jitter, and drops.
 Good documentation is a powerful tool.
– Maintain updated network diagrams.
– Have network design write-ups.
– Document key addresses, VLANs, and servers.
– Tie services to applications, applications to virtual servers, and
virtual servers to real server tables.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-7
Resiliency for High Availability
High availability is implemented with the following
components:
Network-level resiliency
 Redundant links
 Redundant devices
System-level resiliency
 Integrated hardware resiliency
 Redundant power supply
 Stackable switches
Management and monitoring
 Detection of failure
Supported features depend on switch family.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-8
Network-Level Resiliency
Link redundancy
 Redundant links
 EtherChannel
Fast convergence
 Optimized link implementation
 Tuning of Layer 2 and routing protocols
Power redundancy
 External redundant power supply
 Uninterruptible power supply
Monitoring
 SNMP
 Syslog
 IP SLA
 Time synchronization via NTP
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-9
High Availability and Failover Times
The overall failover time is the
combination of convergence at
Layer 1, Layer 2, Layer 3, and
higher layer components.
 Layer 1
– Link
 Layer 2
– STP
 Layer 3
– Routing protocol
– First Hop Redundancy Protocol
(FHRP)
 Higher layers
– Firewall failover
– Server failover
Tuning of timers.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-10
Optimal Redundancy
 Core and distribution have
redundant switches and links.
 Access switches have
redundant links.
 Network bandwidth and
capacity can withstand single
switch or link failure.
 Network bandwidth and
capacity support 200–500 ms
to converge around most
events.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-11
Provide Alternate Paths
 With a single path to the core,
one failure causes traffic to be
dropped.
 A redundant link to the core
resolves this issue.
 Recommend practice:
Use a redundant link to the
core with a Layer 3
link between distribution
switches.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-12
Avoid Too Much Redundancy
Too much redundancy can
lead to design issues:
 Root placement
 Number of blocked links
 Convergence process
 Complex fault resolution
 Cost
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-13
Avoid Single Points of Failure





The access layer is a candidate for supervisor redundancy.
Layer 2 access layer SSO.
Layer 3 access layer SSO and Cisco NSF.
Reduces network outage to 1 to 3 seconds.
Supported with Cisco Catalyst 4500 and 6500 Series Switches.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-14
Cisco NSF with SSO
 The standby RP takes control
of the router after a hardware
or software fault on the active
RP.
 SSO allows the standby RP
to take immediate control and
maintain connectivity
protocols.
 Cisco NSF continues to
forward packets until route
convergence is complete.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-15
Routing Protocol Requirements for
Cisco NSF
 Cisco NSF enhancements to routing protocols are designed to prevent
routing flaps.
 Adjacencies must not be reset when switchover is complete; otherwise,
protocol state is not maintained.
 FIB must remain unchanged during switchover.
– Current routes are marked as stale during restart.
– Routes are refreshed after Cisco NSF convergence is complete.
– Transient routing loops or black holes may be introduced if the network
topology changes before the FIB is updated.
 Switchover must be completed before dead or hold timer expires; otherwise,
peers will reset the adjacency and reroute the traffic.
 Cisco NSF-capable routers are configured to support Cisco NSF.
 Routers that are aware of Cisco NSF know that Cisco NSF-capable router
can still forward packets.
 Supported with EIGRP, OSPF, BGP, IS-IS.
 Supported with Cisco Catalyst 4550 and 6500 Series Switches.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-16
Summary
 High availability involves several elements: redundancy,
technology, people, processes, and tools.
 At the network level, high availability involves making sure that
there is always a possible path between two endpoints.
 High availability minimizes link and node failures to minimize
downtime, by implementing link and node redundancy, providing
alternate paths for traffic, and avoiding single points of failure.
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-17
© 2009 Cisco Systems, Inc. All rights reserved.
SWITCH v1.0—5-18