Understanding High Availability Implementing a Highly Available Network © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-1 Components of High Availability The objective of high availability is to prevent outages and minimize downtime. Achieving high availability integrates multiple components: – Redundancy – Technology – People – Processes – Tools The first two components are relatively easy to integrate. The last three components are usually where gaps lead to outages. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-2 Redundancy Redundancy is used to reduce or eliminate the effects of a failure. Design of redundancy attempts to eliminate single points of failure: – Avoid single causes of failure. – Use geographic diversity and path diversity. – Use dual devices and links. – Use dual WAN providers. – As appropriate, implement dual data centers. – As appropriate, use dual colocations, dual central office facilities, and dual power substations. Design of redundancy needs to trade off cost versus benefit: – Hours of downtime compared to the costs of redundancy, planning, etc. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-3 Technology Cisco routing continuity options: – Cisco Nonstop Forwarding (NSF) – Stateful Switchover (SSO) – Catalyst 3750 Series Switches with Cisco StackWise technology – Catalyst 6500 VSS 1440 Techniques for detecting failure and triggering failover: – Monitoring – IP SLAs and object tracking Other technologies: – Fast-routing convergence © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-4 People Staff work habits and skills can impact high availability. – Attention to detail. – Reliability and consistency. Good skills and ongoing technical training are needed: – Lab time working with technology, practical skills, troubleshooting challenging scenarios, etc. – Communication and documentation are important. Define what other groups expect. Define why the network is designed the way it is, how it is supposed to work. If people are not given the time to do the job right, they cut corners: – If the design target is just “adequate,” falling short leads to poor design. Staff team should align with services. – Owner and experts for each key service application and other components should be identified and included. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-5 Processes Build repeatable processes. – Document change procedures, failover planning and lab testing, and implementation procedures. Use labs appropriately. – Lab equipment reflects the production network, failover mechanisms are tested and understood, and new code is validated before deployment. Use meaningful change controls. – Test all changes before deployment, use good planning with rollback plans, and conduct realistic and thorough risk analysis. Manage operation changes. – Perform regular capacity management audits, manage Cisco IOS versions, track design compliance as recommended practices change, and develop disaster recovery plans. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-6 Tools Monitor availability and key statistics for devices and links. – Use performance thresholds, Top N reporting, and trending to spot potential problems. – Monitor packet loss, latency, jitter, and drops. Good documentation is a powerful tool. – Maintain updated network diagrams. – Have network design write-ups. – Document key addresses, VLANs, and servers. – Tie services to applications, applications to virtual servers, and virtual servers to real server tables. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-7 Resiliency for High Availability High availability is implemented with the following components: Network-level resiliency Redundant links Redundant devices System-level resiliency Integrated hardware resiliency Redundant power supply Stackable switches Management and monitoring Detection of failure Supported features depend on switch family. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-8 Network-Level Resiliency Link redundancy Redundant links EtherChannel Fast convergence Optimized link implementation Tuning of Layer 2 and routing protocols Power redundancy External redundant power supply Uninterruptible power supply Monitoring SNMP Syslog IP SLA Time synchronization via NTP © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-9 High Availability and Failover Times The overall failover time is the combination of convergence at Layer 1, Layer 2, Layer 3, and higher layer components. Layer 1 – Link Layer 2 – STP Layer 3 – Routing protocol – First Hop Redundancy Protocol (FHRP) Higher layers – Firewall failover – Server failover Tuning of timers. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-10 Optimal Redundancy Core and distribution have redundant switches and links. Access switches have redundant links. Network bandwidth and capacity can withstand single switch or link failure. Network bandwidth and capacity support 200–500 ms to converge around most events. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-11 Provide Alternate Paths With a single path to the core, one failure causes traffic to be dropped. A redundant link to the core resolves this issue. Recommend practice: Use a redundant link to the core with a Layer 3 link between distribution switches. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-12 Avoid Too Much Redundancy Too much redundancy can lead to design issues: Root placement Number of blocked links Convergence process Complex fault resolution Cost © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-13 Avoid Single Points of Failure The access layer is a candidate for supervisor redundancy. Layer 2 access layer SSO. Layer 3 access layer SSO and Cisco NSF. Reduces network outage to 1 to 3 seconds. Supported with Cisco Catalyst 4500 and 6500 Series Switches. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-14 Cisco NSF with SSO The standby RP takes control of the router after a hardware or software fault on the active RP. SSO allows the standby RP to take immediate control and maintain connectivity protocols. Cisco NSF continues to forward packets until route convergence is complete. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-15 Routing Protocol Requirements for Cisco NSF Cisco NSF enhancements to routing protocols are designed to prevent routing flaps. Adjacencies must not be reset when switchover is complete; otherwise, protocol state is not maintained. FIB must remain unchanged during switchover. – Current routes are marked as stale during restart. – Routes are refreshed after Cisco NSF convergence is complete. – Transient routing loops or black holes may be introduced if the network topology changes before the FIB is updated. Switchover must be completed before dead or hold timer expires; otherwise, peers will reset the adjacency and reroute the traffic. Cisco NSF-capable routers are configured to support Cisco NSF. Routers that are aware of Cisco NSF know that Cisco NSF-capable router can still forward packets. Supported with EIGRP, OSPF, BGP, IS-IS. Supported with Cisco Catalyst 4550 and 6500 Series Switches. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-16 Summary High availability involves several elements: redundancy, technology, people, processes, and tools. At the network level, high availability involves making sure that there is always a possible path between two endpoints. High availability minimizes link and node failures to minimize downtime, by implementing link and node redundancy, providing alternate paths for traffic, and avoiding single points of failure. © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-17 © 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—5-18