TECCRS-2001 Enterprise High Availability Design and Architectures Samer Theodossy Dana Daum Maren Kostede Junmei Zhang Dana Daum Communications Architect Maren Kostede Technical Solutions Architect Junmei Zhang Technical Marketing Eng. Samer Theodossy Principal Engineer High Availability World Coverage TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • High Availability Architectures: • • Enterprise Wired LAN • Enterprise Wireless LAN • Enterprise Data Center High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 Agenda Schedule & Logistics For Your Reference 08:30 - 10:30 10:30 - 10:45 Break Samer Key Concept or Design Point 10:45 -12:45 12:45- 14:30 Lunch Dana 14:30 -16:30 16:30 - 16:45 Break Maren 16:45 - 18:45 Hurray We are done!!! Junmei We value your feedback: Don't forget to complete your online session evaluations TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Cisco Webex Teams Questions? Use Cisco Webex Teams (formerly Cisco Spark) to chat with the speaker after the session How 1 Find this session in the Cisco Events Mobile App 2 Click “Join the Discussion” 3 Install Webex Teams or go directly to the team space 4 Enter messages/questions in the team space cs.co/ciscolivebot#TECCRS-2001 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 Head Quarters WAAS Access Switches UCS Rack-mount Server UCS Rack-mount Servers UCS Blade Chassis Storage WAAS Central Manager Distribution Switches Nexus WAN Router s Access Switches Regional Site Communications Managers Internet Edge Cisco ACE Internet Routers Wireless LAN Controller Data Center Firewalls Nexus Internet RA-VPN WAN Route r Access Switch Firewall Guest Wireless LAN Controller DMZ Switch Remote Site Web Security Appliance Teleworker/ Mobile Worker Remote Site DMZ Servers Core Switches Email Security Appliance Hardware and Software VPN Access Switch Stack Data Center Wireless LAN Controllers WAN Routers MPLS WANs WAN Router s Distribution Switches User Access Layers WAAS WAAS WAN Aggregation Remote Site Wireless LAN Controller © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • High Availability Architectures: • • Enterprise Wired LAN • Enterprise Wireless LAN • Enterprise Data Center High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 Enterprise-Class Availability Campus Systems Approach to High Availability Ultimate Goal……………..100% • System-level resiliency • Network-level redundancy Next-Generation Apps Video Conf., Unified Messaging, Global Outsourcing, E-Business, Wireless Ubiquity • Enhanced management • Human ear notices the difference in voice within 150–200 msec Mission Critical Apps. Databases, Order-Entry, CRM, ERP • 10 consecutive G711 packet loss • Video loss is even more noticeable • 200-msec end-to-end campus convergence Desktop Apps E-mail, File and Print APPLICATIONS DRIVE REQUIREMENTS FOR HIGH AVAILABILITY NETWORKING TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 9 Cisco HA Evolution No Redundancy Redundancy with RPR No Redundant Units Adding Redundant Units Failure on Supervisor Outage: Failure on Active Sup causes reload causes Switchover 10’s of Line Cards reload minutes Standby Unit is in on failure Outage:state STANDBY_COLD Several Line Cards reload after minutes switchover Startup Configuration Synchronized to Peer Redundancy with RPR+ Adding Redundant Units Failure on Active Sup causes Switchover Standby Unit is in STANDBY_WARM state Line Cards reload after Outage: switchover Several Startup Configuration Seconds Synchronized to Peer Running Configuration Synchronized to Peer and applied after switchover TECCRS-2001 Redundancy with SSO Adding Redundant Units Failure on Active Sup causes Switchover Standby Unit is in STANDBY_HOT state Line Cards Stay up after switchover Outage: Startup Configuration Synchronized to Peer Order of Running Configuration Milliseconds Synchronized to Peer and applied. and/or its affiliates. All rights reserved. Cisco Public © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 Defining Levels of Availability* Continuous Availability Continuous Operations High Availability CA = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle all unplanned or planned events CO = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle both unplanned faults and planned maintenance events HA = system is designed to a specified service level with resiliency and redundancy mechanisms to handle unplanned faults * References de facto industry terminology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 Defining Levels of Availability* Continuous Availability Continuous Operations High Availability CA = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle all unplanned or planned events CO = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle both unplanned faults and planned maintenance events HA = system is designed to a specified service level with resiliency and redundancy mechanisms to handle unplanned faults * References de facto industry terminology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 Defining Levels of Availability* Continuous Availability Continuous Operations High Availability CA = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle all unplanned or planned events CO = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle both unplanned faults and planned maintenance events HA = system is designed to a specified service level with resiliency and redundancy mechanisms to handle unplanned faults * References de facto industry terminology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 Defining Levels of Availability* Continuous Availability Continuous Operations High Availability CA = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle all unplanned or planned events CO = system is designed to operate 7 days a week 24 hours a day with resiliency and redundancy mechanisms to handle both unplanned faults and planned maintenance events HA = system is designed to a specified service level with resiliency and redundancy mechanisms to handle unplanned faults * References de facto industry terminology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 Measure Availability End-End from the User Perspective Application Custom Application Scripts, HTML, TCL, Python, many others Presentation Session Transport ICMP Ping, IP Traceroute, Bidirectional Forwarding Detection, IP SLA Network UDLD, STP, REP Data-Link Cable Testers / Power Meters Physical * Layer 8 is not an official part of the OSI reference model TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 Measure and Analyze Every Event Analyze and Automate • Measure all previous points – Actual Fault Starts • Analyze trends • Automation – • Trouble ticketing • Technology/database • Electronic bonding Total Service Downtime • Note each in trouble tickets Failure Detection Time Notification Time Diagnosis Time Dispatch Time Arrival Time Repair Time • Redundant network design and resiliency features Up Time • Required for very high availability TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 What to Automate? Device Provisioning Day 0: Deployment Automation Configuration Monitoring Day 1: Open Programmable Interfaces Day 2: Telemetry Source: 2016 Cisco Study TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 Main Operational Challenges 95% Network Changes Performed Manually 70% Policy Violations Due to Human Error 75% OpEx spent on Network Visibility and Troubleshooting Source: 2016 Cisco Study CANNOT Keep Pace with the Demands of Digital Business © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public High Availability Design Principal Key Principals • Enterprise network design architectures continue to evolve to meet business and technology needs, but the key principals of high availably network design still apply; Add redundancy and resiliency components as needed to meet the business requirements. • Simplify network designs and configurations through virtualization techniques. • Implement network-monitoring tools with automation where appropriate, and analyze all aspects of network outages for indications of where improvement is needed. • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • High Availability Architectures • High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Availability Modeling • Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing • Stackwise480 and Stackwise • In Service Software Upgrades • Foundations of the Structured Network Design • High Availability Architectures • High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 Why Use System and Network Availability Modeling? • Planning and Engineering • Architecture validation • Design tradeoff analysis/decisions • Request for Proposal (RFP) • Service Level Agreement (SLA) Option 1 $ Option 2 $$ TECCRS-2001 Option 3 $$$ © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 Predicted Availability Ratings Are Not Guarantees • Predicted Availability ratings are not guarantees of network availability. • Ratings are based on Industry standard methodologies and statistical analysis • Useful in making design decisions and comparing different options. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 23 Predicted Availability Rating Function of Mean Time Between Failure and Mean Time to Repair Increase MTBF Availability Equation Decrease MTTR TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 Predicted Availability Equation (Basic) Availability Equation Availability MTBF MTBF MTTR MTBF = Mean Time Between Failure MTTR = Mean Time To Repair TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 Predicted Availability Equations MTBF Availability MTTR MTBF 74,116 hrs. 0.999676 2hrs 50.3 min. per year 74,116 hrs 24 hrs. (No Spare) 74,116 hrs 0.999946 28 min. per year 74,116 hrs 4 hrs. (Spare Available) 74,116 hrs 0.999999 .526 min. per year 74,116 hrs .00833 (sub-second) (Redundancy!!!) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 The Redundancy Effect Single Points of Failure Availability = 99.998% Downtime = ~10 min/yr 99.999% ~5 min/yr 99.999% ~5 min/yr Linecard Supervisor Unit 1 Unit 2 Blocks in Series TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 The Redundancy Effect Single Points of Failure Redundant Components Availability = 99.999999% Downtime = ~0.0053 min/yr Availability = 99.998% Downtime = ~10 min/yr Unit 1 99.999% ~5 min/yr 99.999% ~5 min/yr 99.999% ~5 min/yr Linecard Supervisor Unit 1 Supervisor Unit 2 Unit 2 99.999% ~5 min/yr Supervisor Blocks in Series Blocks in Parallel TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Example of Predicted Availability Rating (No Redundancy) Catalyst 2960XR-48TS-I For Your Reference Part MTBF (hours) MTTR Predicted Availability Annual Downtime Catalyst 2960XR-48TS-I 438,130 hrs. 4 hrs. 99.99908704% -- Power Supply 1,000,000 hrs. 4 hrs. 99.99960000% -- SFP-10GSR Uplink 2,294,776 hrs. 4 hrs. 99.99982569% -- System MTBF 268,947 hrs. 99.99851274% 7.82 min All single points of failure combined in a series calculation Chassis X Power Supply X Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 29 Example of Predicted Availability Rating (With Redundancy) Catalyst 2960XR-48TS-I Part MTBF (hours) MTTR Switchover time sec. Catalyst 2960XR48TS-I 438,130 4 hrs. -- 438,130 99.99908704% -- Power Supply (Redundant) 1,000,00 0 4 hrs. 0 125,001,00 0,002 100.00000000% -- SFP-10GSR Uplink (Redundant) 2,294,77 6 4 hrs. .500 658,251,90 6,130 100.00000000% -- System MTBF Combined MTBF Hrs. For Your Reference 438,128 Predicted Availability 99.99908704% Annual Downtime 4.80 min. Redundant components combined in parallel calculation Chassis X Combined Power Supply X Combined Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 Example of Predicted Availability Rating (Catalyst 3850 No Redundancy) Catalyst WS-C3850-48F For Your Reference Part MTBF (hours) MTTR Predicted Availability Annual Downtime Catalyst C385048F 241,050 4 hrs. 99.99834062% -- Power Supply PWR-C11100WAC 392,174 4 hrs. 99.99898005% -- C3850-NM-2-10G 4,319,170 4 hrs. 99.99990732% -- SFP-10GSR Uplink 2,294,776 4 hrs. 99.99982569% -- System MTBF 135,761 99.99705371% 15.50 min All single points of failure combined in a series calculation Chassis X Power Supply X Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 Example of Predicted Availability Rating (With Component Redundancy) Catalyst WS-C3850-48F For Your Reference Part MTBF MTTR Switchover time sec. Combined MTBF Predicted Availability Annual Downtime Catalyst C3850-48F 241,050 4 hrs. -- 241,050 hrs. 99.9983406 2% -- Power Supply PWR-C11100WAC 392,174 4 hrs. 0 19,225,447,9 59 99.99999999 % -- SFP-10GSR Uplink 2,294,77 6 4 hrs. .500 658,251,906, 038 100.0000000 0% -- C3850-NM2-10G 4,319,17 0 4 hrs. -- 4,319,170 System MTBF 228,297 -- 99.9999073 2% 99.9982479 3% 9.22 min. Redundant components combined in parallel calculation Chassis X Combined Power Supply X Combined Uplink X Uplink Module = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 32 Example of Predicted Availability Rating (With Stackwise480 Redundancy, Single Attached) • Catalyst WS-C3850-48F For Your Reference Part MTBF (hours) MTT R Switcho ver time Combin ed MTBF Combine d Availabilit y Annual Downti me Catalyst C3850-48F 241,050 4 hrs. -- 241,050 99.99834062 % -- Power Supply PWR-C11100WAC 392,174 4 hrs. .001 19,225,447,9 59 99.99999999% -- C3850-NM2-10G 4,319,170 4 hrs. .500 2,328,453,94 6,134 100.0000000% -- SFP-10GSR Uplink 2,294,776 4 hrs. .500 658,251,906, 038 100.0000000% -- System MTBF 241,047 99.99834061% 8.73 min. Redundant components combined in parallel calculation Combined Chassis X Combined Power Supply X Combined Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 33 Example of Predicted Availability Rating (Catalyst 4507R+E Non Redundant) Catalyst WS-C4507R+E For Your Reference Part MTBF MTTR Combined MTBF Combined Availability Annual Downtime Chassis with Fans WS-C4507R+E 248,630 4 hrs. 248,630 hrs. 99.99839121% -- Power Supply PWR-C45-6000ACV 341,356 4 hrs. 341,356 hrs. 99.99882822% -- WS-X45-SUP8-E 451,610 4 hrs. 451,610 hrs. 99.99911429% -- SFP-10GSR Uplink 2,294,77 6 4 hrs. 658,251,906,03 8 99.99999956% -- WS-X4748-RJ45-E 402,386 4 hrs. 402,386 hrs. 99.99900594% -- 82,735 hrs. 99.99516543% 25.43 min. System MTBF Components combined in series calculation Chassis X Power Supply X Line Card X Supervisor Module X SFP Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 34 Example of Predicted Availability Rating (Catalyst 4507R+E With Redundancy ) Catalyst WS-C4507R+E with Redundancy For Your Reference Part MTBF MTT R Switchover time Combined MTBF Combined Availability Annual Downtime Chassis with Fans WS-C4507R+E 248,630 4 hrs. -- 248,630 hrs. 99.99839121% -- Power Supply PWR-C456000ACV 341,356 0 hrs. 0 14,565,831, 200 hrs. 99.99882822% -- WS-X45SUP8-E 451,610 0 hrs. .500 25,494,400, 625 hrs. 99.99911429% -- SFP-10GSR Uplink 2,294,77 6 0 hrs. .500 658,251,90 6,038 99.99999956% -- WS-X4748RJ45-E 402,386 4 hrs. -- 402,386 hrs. 99.99900594% -- 153,673 hrs. 99.99739714% 13.69 min. System MTBF Redundant components combined in parallel calculation Chassis X Combined Power Supply X Line Card X Combined Supervisor Module X Combined SFP Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 35 Example of Predicted Availability Rating (Catalyst 6800XL Non Redundant) Catalyst 6800XL Part MTBF (hours) MTTR Combined MTBF Hrs. 638,440 4 hrs. 638,440 C6807-XL-FAN= 3,077,880 4 hrs. 3,077,880 SFP-10GSR 2,294,776 4 hrs. 2,294,776 Supervisor VS-S2T-10G 231,910 4 hrs. 231,910 WS-X6904-40G2T 256,490 C6800-XL-3KWAC* 3,000,000 Chassis C6807-XL For Your Reference Combined Availability Annual Downtime -- 99.99937348% -- 99.99987004% -- 99.99982569% -- 99.99827522% 4 hrs. 256,490 -- 99.99844051% 4 hrs. 3,000,000 -- 99.99986667% System MTBF 91,987 99.99565168% 22.87 min. Components combined in series calculation Chassis X Fan Tray X Power Supply X Line Card X Supervisor Module X SFP Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 36 Example of Predicted Availability Rating (Catalyst 6800XL With Redundancy) Catalyst 6800XL with Redundancy For Your Reference Part MTBF Hrs. MTTR Hrs. Switchover time (seconds) Combined MTBF Hrs. Chassis C6807XL 638,444 4 Hrs. -- 638,440 99.99937348% -- C6807-XL-FAN= 3,077,88 0 4 Hrs. -- 3,077,880 99.99987004% -- SFP-10GSR 451,610 4Hrs. .500 2,633,000,739, 868 100.00000000 % -- Supervisor VS-S2T-10G 2,294,77 6 4 Hrs. .500 26,891,355,96 1 99.99999997% -- WS-X6904-40G2T 402,386 4 Hrs. .500 32,893,816,54 1 99.99999998% -- C6800-XL-3KWAC* 3,000,00 0 4 Hrs. 0 4,500,003,000, 001 100.00000000 % -- 99.99924347% 3.98min. System MTBF 528,687 Combined Availability Annual Downtime Redundant components combined in parallel calculation Chassis X Combined Power Supply X Combined Line Card X Combined Supervisor Module X Combined SFP Uplink = System MTBF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 37 Choosing the Right Platform and Network Design It is More Than Just Predicted Availability Ratings • Design to business requirements • • Use Predicted Availability ratings as part of your overall design considerations Common factors that dictate platform selection: • Backplane throughput and performance • Interface types and port densities • Scalability for future growth/ investment protection • Software upgrade procedures • Software feature support • Simplicity / Ease of Use TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 39 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Availability Modeling • Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing • Stackwise480 and Stackwise • In Service Software Upgrades • Foundations of the Structured Network Design • High Availability Architectures • High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 41 Control Plane and Data Plane Control Plane CPU, Software , Memory EIGRP OSPF LDP BGP SNMP STP CDP FIB Data Plane ASICs, High-Speed TCAMs FIB TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 42 Control Plane and Data Plane Control Plane CPU, Software , Memory EIGRP OSPF LDP BGP SNMP STP CDP FIB Data Plane ASICs, High-Speed TCAMs FIB A B TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 43 Control Plane and Data Plane Control Plane CPU, Software , Memory EIGRP OSPF LDP BGP SNMP STP CDP FIB Data Plane ASICs, High-Speed TCAMs FIB SRC A DST B A B TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 44 Control Plane and Data Plane For Your Reference Definitions for our context • • Control Plane – Protocols or signaling traffic associated with routing. Typically this is traffic sourced from a router or destined to a router. Examples include BGP, OSPF, EIGRP, ICMP etc.… • Processed by a CPU • May also include exception traffic that needs special services applied • Commonly referred to as the “Slow Path” • May also include management protocols including SNMP, Telnet, HTTP etc.… AKA “Management Plane” Data Plane - Traffic forwarded through a device. • Processed by hardware ASICs • In the context of a switching device, typically this is traffic processed completely by the device’s hardware ASICs • Commonly referred to as the “Fast Path” TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 45 Stateful Switchover (SSO) • For Your Reference Stateful Switchover (SSO)– A software facility within Cisco IOS that synchronizes specific Cisco IOS processes between an Active Supervisor Engine and a Redundant Standby Supervisor Engine for the purpose of redundancy. • Redundancy Facility – synchronizes application states • Checkpointing Facility – synchronizes data structures TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 46 Redundant Supervisors – IOS Active – Standby Model Active Supervisor Control Plane Data Plane Active Supervisor Control Plane • Console access • Manages Configurations • Manages Chassis Environmentals • L2 – L3 Protocols Data Plane • CF Hardware-based switching Standby Supervisor RF Not part of the active forwarding path Multiple Redundancy modes COLD Standby WARM Standby HOT Standby Synchronization CF – Checkpoint Facility RF – Redundancy Facility COLD Standby WARM Standby HOT Standby Control Plane Data Plane TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 47 Stateful Switchover Mode – IOS SSO-Aware and SSO-Compliant IOS Applications Cisco IOS SSO-Compliant Applications Routing Protocols NetFlow Cisco Discovery Protocol …and more SSO-Aware Applications Redundancy Facility Checkpointing Facility Forwarding Information Base IEEE 802.1x PAgP / LACP …and more Active Supervisor Standby Hot Supervisor SSO-Aware Applications SSO-Compliant Applications Routing Protocols NetFlow Cisco Discovery Protocol …and more Checkpointing Facility Redundancy Facility Forwarding Information Base IEEE 802.1x PAgP / LACP …and more Cisco IOS TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 48 SSO Compliant Redundancy Clients IOS Partial List Example Router# show redundancy clients clientID = 0 clientSeq = 0 clientID = 1319 clientSeq = 1 clientID = 5030 clientSeq = 2 RF_INTERNAL_MSG Cat6k Platform Swove Redundancy Mode RF Management & Services EEM Server RF CLIENT SNMP HA RF Client Switch SPAN client MQC QoS Call-Home RF Port Security Client IKE RF Client IPSEC RF Client CRYPTO RSA LAN-Switch PAgP/LACP LAN-Switch Private V VLAN Mapping CTS HA Platform Specific L3 Services L2 Services • Cat6k Inline Power • Car6k OIR • Cat6k QoS Manager Network RF Client • CWAN VLAN RF Client HSRP • Cat6k Feature Manager GLBP • Cat6k SPA TSM Cat6k PAgP/LACP BFD RF Client Spanning-Tree Protocol DHCP Snooping • Cat6k Online Diag HA Cat6k MLS Multicast • Cat6k Platform SLB RF Client • Config Sync RF client • Cat6k Startup Config Frame Relay HDLC IPROUTING NSF RF ARP LLDP PPP RF L3 Mobility Manager IP multicast RF Client MPLS VPN HA Client LDP HA AToM manager TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 49 SSO by itself Does Not Provide Redundancy for the Routing Protocols Graceful Restart, Non-Stop Forwarding and Non-Stop Routing • Non-Stop Forwarding was developed by Cisco to maintain traffic forwarding by a router experiencing a control plane switchover event. The router will essentially synchronize its Forwarding Information Base between an Active and Standby Route Processor as well as signal to its routing neighbors to continue forwarding traffic while routing topology information is exchanged • The IETF developed standards based implementations similar to Cisco NSF • The IETF implementations use different terminology including the terms “Graceful Restart” to describe the signaling used between the routers • Graceful Restart(GR) and Non-Stop Forwarding (NSF) are terms often used interchangeably • Graceful Restart/Non-Stop Forwarding as well as Non-Stop Routing (NSR) all allow for the forwarding of data packets to continue along known routes while the routing protocol information is being restored (in the case of Graceful Restart) or refreshed (in the case of Non Stop Routing) following a processor switchover. • Each routing protocol has its own unique implementation and signaling mechanisms TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 51 Routing Protocol Redundancy With NSF Active Supervisor Engine Slot 1 EIGRP RIB OSPF RIB Standby Supervisor Engine Slot 2 ARP Table EIGRP RIB Prefix Next Hop Prefix Next Hop IP MAC 10.0.0.0 10.1.1.1 192.168.0 192.168.0.1 10.1.1.1 aabbcc:ddee32 10.1.0.0 10.1.1.1 192.168.55.0 192.168.55.1 10.1.1.2 adbb32:d34e43 - 10.20.0.0 10.1.1.1 192.168.32.0 192.168.32.1 10.20.1.1 aa25cc:ddeee8 - FIB Table Prefix Next HOP 10.1.1.1 10.1.1.2 192.168.0.0 aa25cc:ddeee8 Prefix - OSPF RIB Next Hop Prefix Next Hop IP MAC - - - - - - - - - - - - - - - SSO Redundancy Facility FIB Table Prefix Next HOP aabbcc:ddee32 10.1.1.1 aabbcc:ddee32 adbb32:d34e43 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 Checkpoint Facility TECCRS-2001 ARP Table © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 53 Routing Protocol Redundancy With NSF Active Supervisor Engine Slot 2 EIGRP RIB Prefix - OSPF RIB ARP Table Next Hop Prefix Next Hop IP MAC - - - - - - - - - - - - - - - - - FIB Table TECCRS-2001 Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 Routing Protocol Redundancy With NSF Active Supervisor Engine Slot 2 EIGRP RIB Prefix - OSPF RIB ARP Table Next Hop Prefix Next Hop IP MAC - - - - - - - - - - - - - - - - - FIB Table TECCRS-2001 Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 Routing Protocol Redundancy With NSF Active Supervisor Engine Slot 2 EIGRP RIB Prefix - OSPF RIB ARP Table Next Hop Prefix Next Hop IP MAC - - - - - - - - - - - - - - - - - FIB Table Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 GR/NSF Signaling per protocol Synchronization per protocol TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 Routing Protocol Redundancy With NSF Active Supervisor Engine Slot 2 EIGRP RIB OSPF RIB ARP Table Prefix Next Hop Prefix Next Hop IP MAC 10.0.0.0 10.1.1.1 - - - - 10.1.0.0 10.1.1.1 - - - - 10.20.0.0 10.1.1.1 - - - - FIB Table Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 GR/NSF Signaling per protocol Synchronization per protocol TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 Routing Protocol Redundancy With NSF Active Supervisor Engine Slot 2 OSPF RIB EIGRP RIB ARP Table Next Hop Prefix Next Hop IP MAC 10.0.0.0 10.1.1.1 192.168.0 192.168.0.1 - - 10.1.0.0 10.1.1.1 192.168.55.0 192.168.55.1 - - 10.20.0.0 10.1.1.1 192.168.32.0 192.168.32.1 - - Prefix FIB Table Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 GR/NSF Signaling per protocol Synchronization per protocol TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 Routing Protocol Redundancy With NSF Active Supervisor Engine Slot 2 OSPF RIB EIGRP RIB ARP Table Next Hop Prefix Next Hop IP MAC 10.0.0.0 10.1.1.1 192.168.0 192.168.0.1 10.1.1.1 aabbcc:ddee32 10.1.0.0 10.1.1.1 192.168.55.0 192.168.55.1 10.1.1.2 adbb32:d34e43 10.20.0.0 10.1.1.1 192.168.32.0 192.168.32.1 10.20.1.1 aa25cc:ddeee8 Prefix FIB Table Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 GR/NSF Signaling per protocol Synchronization per protocol TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 Non Stop Forwarding Router Roles • Non-Stop Forwarding, NSF, allows a router to continue forwarding data along routes that are already known, while the routing protocol information is being restored NSF Aware • NSF Aware router or NSF Helper router* • A router running NSF-compatible software, capable of assisting a neighbor router perform an NSF restart • NSF Capable router • A router configured to perform an NSF restart, therefore able to rebuild routing information from neighbor NSF-aware or NSF capable router NSF Aware NSF Capable Device with Redundant Supervisors * NSF Helper - This term is used in IETF terminology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 60 NSF/SSO Switchover Operation – IOS 1 Active Supervisor Newly Active Supervisor Control Plane Active Supervisor Fails RP RP CPU 5 CPU OSPF EIGRP Control Path IS-IS 9 Routing Information Base 10 2 BGP ARP Table 6 4 Cisco IOS CEF Tables Global Epoch = 1 FIB Table Prefix 10.2 Adjacency Table Next Hop Interface Epoch Next Hop MAC 10.1.1.1 01 10.1.1.1 AA-BB-.. 01 0 192.168.1.1 EE-DD.. 10 Vlan 10 NSF Aware Router 3 Data Plane 192.1 192.168.1.1 Vlan 192 Epoch 11 12 Hardware 3 FIB Table Adjacency Table Forwarding Path TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 61 Non-Stop Forwarding OSPF Implementation Example NSF Capable NSF Aware NSF Capable IETF NSF (GR) Cisco NSF Restart Event LSA Requests/ Update Hello 225.0.0.5 Database Description LSA Request s/Update LS ACK (Grace LSA) 225.0.0.5 Hello Database Description LSA Requests /Update Hello Hello (RS Bit Clear) Database Exchange Hello (RS Bit Clear) Database Description Out-of-Band Sync LSA Requests/ Update Fast Hello (2 Sec Interval RS Bit Clear) LS Update (Grace LSA) OSPF Discovery Database Description Fast Hello (2 Sec Interval RS Bit Clear) Announce GracefulRestart Fast Hello (2 Sec Interval RS Bit Set) Restart Event Fast Hello Fast Hello (2 Sec Interval RS Bit Set) NSF Aware Hello TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 63 NSF Configuration - IOS Capable Vs Helper Configuration • • Configuration is required to enable “NSF Capable” Configuration is NOT required to enable “NSF Helper” with default settings • Helper supports both types on the device router eigrp 1 nsf ! router ospf 1 nsf ietf ! router isis 1 nsf cisco core1# show ip ospf nsf Routing Process "ospf 1" IETF Non-Stop Forwarding enabled restart-interval limit: 120 sec IETF NSF helper support enabled IETF NSF helper strict-lsa-checking enabled Cisco NSF helper support enabled OSPF restart state is NO_RESTART Handle 2162698, Router ID 1.1.1.1, checkpoint Router ID 1.1.1.1 Config wait timer interval 10, timer not running Dbase wait timer interval 120, timer not running TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 64 NSF Interoperability Interoperability between different Cisco devices • The Graceful Restart extensions used in NX-OS are based on the IETF RFCs except for EIGRP, which is Cisco proprietary and can interoperate with Cisco NSF. • This implies that for OSPFv2, OSPFv3, and BGP the GR extension are compatible with versions of IOS that use the RFC based extensions router ospf 1 graceful-restart router ospf 1 graceful-restart ✔ router ospf 1 nsf ietf Si Si TECCRS-2001 router ospf 1 nsf cisco © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 66 Non-Stop Routing (NSR) • Cisco IOS Non-Stop Routing preserves the state information (prefixes and related data) in the Routing Information Base across Supervisor Engine (Route Processor) switchover events. Helpful in environments where peer routers are not managed by the same entity or are not capable of supporting NSF awareness • Consider that Non-Stop Routing does consume more control plane resources, such as memory and CPU compute cycles, compared to NSF • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 67 Routing Protocol Redundancy With NSR Active Supervisor Engine Slot 1 EIGRP RIB OSPF RIB Standby Supervisor Engine Slot 2 ARP Table EIGRP RIB OSPF RIB ARP Table Prefix Next Hop Prefix Next Hop IP MAC Prefix Next Hop Prefix Next Hop IP MAC 10.0.0.0 10.1.1.1 192.168.0 192.168.0.1 10.1.1.1 aabbcc:ddee32 10.0.0.0 10.1.1.1 192.168.0 192.168.0.1 10.1.1.1 aabbcc:ddee32 10.1.0.0 10.1.1.1 192.168.55.0 192.168.55.1 10.1.1.2 adbb32:d34e43 10.1.0.0 10.1.1.1 192.168.55.0 192.168.55.1 10.1.1.2 adbb32:d34e43 10.20.0.0 10.1.1.1 192.168.32.0 192.168.32.1 10.20.1.1 aa25cc:ddeee8 10.20.0.0 10.1.1.1 192.168.32.0 192.168.32.1 10.20.1.1 aa25cc:ddeee8 FIB Table Prefix Next HOP 10.1.1.1 10.1.1.2 192.168.0.0 aa25cc:ddeee8 SSO Redundancy Facility FIB Table Prefix Next HOP aabbcc:ddee32 10.1.1.1 aabbcc:ddee32 adbb32:d34e43 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 Checkpoint Facility TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 68 Routing Protocol Redundancy With NSR Active Supervisor Engine Slot 2 EIGRP RIB OSPF RIB ARP Table Prefix Next Hop Prefix Next Hop IP MAC 10.0.0.0 10.1.1.1 192.168.0 192.168.0.1 10.1.1.1 aabbcc:ddee32 10.1.0.0 10.1.1.1 192.168.55.0 192.168.55.1 10.1.1.2 adbb32:d34e43 10.20.0.0 10.1.1.1 192.168.32.0 192.168.32.1 10.20.1.1 aa25cc:ddeee8 FIB Table Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 No additional signaling required to maintain topology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 69 NSR Deployment Scenario Case Study: MPLS VPN Provider Edge • Provider PE device can use NSR for peering with the CE devices CE MPLS VPN CE • use NSF for peering with the internal P devices or Route Reflectors CE • NSF Aware peers are not needed for the CE device • Control plane resources can be optimized by using NSF and NSR together P PE CE P CE CE TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 70 NSR Configuration - IOS • Configuration is required to enable NSR router eigrp 1 nsr ! router ospf 1 nsr ! router isis 1 nsr TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 71 Comparing NSF and NSR Metric Non-Stop Forwarding Non -Stop Routing Configuration required Yes, per protocol instance on NSF capable device, No configuration required for NSF – aware devices for Interior Gateway Protocols. BGP requires GR configuration on both NSFCapable and NSF Helper device Yes per protocol instance. BGP also requires per peer configurations Which routing protocols are supported EIGRP, OSPFv2, OSPFv3, ISIS, BGP, LDP, etc. ISIS, BGP, OSPFv2, etc. Synchronizes routing protocol state and RIB information across redundant control planes No Yes Consumes additional CPU and memory resources Negligible Yes, applicable with the number of routes per protocol Requires specific feature support on peer routers Yes No TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 72 High Availability At Different layers Standalone Chassis Redundant Core Redundant Supervisors Yes or No ? Catalyst 6500 • Redundant topologies with equal cost multipaths (ECMP) provide sub-second convergence NSF/SSO provides superior availability in environments with non-redundant paths Seconds of Lost Voice • RP Convergence Is Dependent on IGP and Tuning ? Si Si Si Si Si Link Failure Node Failure NSF/SSO OSPF Convergence TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 74 Redundant Supervisors Yes or No? Catalyst 6500 • HSRP doesn’t flap on Supervisor SSO switchover • Reduces the need for sub-second HSRP timers Si • SSO Aware HSRP Si • 6500-E - 12.2(33)SXH • 4500 - 12.2(31)SG Seconds of Lost Voice ? TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 75 Design Considerations for NSF/SSO Where Does It Make Sense? Access switch is the single point of failure in best practices HA design • Supervisor failure is most common cause of access switch service outages • Recommended design with NSF/SSO provides for sub 600 msec recovery of voice and data traffic Seconds of Lost Voice • Si Si Si Si ? TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 76 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Availability Modeling • Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing • Stackwise480 and Stackwise • In Service Software Upgrades • Foundations of the Structured Network Design • High Availability Architectures • High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 77 Catalyst 9300 Series Cisco Stackwise-480 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 78 Stacking Cable – Close-up Stacking Cable Cable Lengths • 0.5m • 1m • 3m TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 79 Understanding the Stack Ring Stack Interface ASIC • 6 rings in total • 3 rings go East • 3 rings go West Is math really an opinion? • Each ring is 40G Assuming 4 x 24-port Cat9K Switches • Total Stack BW = 240G • With Spatial Reuse = 480G Stack Interface Packets are segmented/reassembled in HW (256 byte segments) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 80 Understanding Spatial Reuse Doubling the capacity of my stack 4 3 1 2 Assuming 4 x 24-port 9300 Switches Destination Stripping Packet travels ½ the rings. Taken out of stack by destination 3 1 2 4 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Stack Ring Healing Example shows: 4 x 24-port Cat9K Switches Detection is by hardware Software is notified immediately Ring Wrap initiated immediately (1-2ms) X For Recovery – Hardware detects other side Software validates the link and so it brings up the connection gracefully Unwrap is slower than Wrap • All rings wrap •240Gbps when wrapped TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 82 IOS XE Software Internals Overview Infra Domain LC Domain Service Location RP Domain Wireless Controller Consolidated Logging Forwarding & Feature Mgr (FFM) Stack Manager (3K) Features PD Platform Drivers HA UADP ASIC Drivers External Transports (TCP/SCTP/UDP) Internal IPC Licensing Services Libraries/ Utilities Services Comet Services Low Level APIs Forwarding Engine Driver Packet Delivery Service Platform Manager System Manager Availability Framework IOSd RP Interface Manager Kernel © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public UADP UADP provides an unparalleled degree of Flexibility in an Access Switch Designed for Flexibility Excellent for encapsulations, which often need recirculation Parse depth of 256 Bytes 15 programmable stages Up to 250 frames across stages at one time… Ability to handle current and future protocols – extremely flexible and capable TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 84 VXLAN as a protocol had not even been invented when the UADP ASIC was designed … Yet UADP forwards VXLAN in hardware, at high performance in IOS-XE 16.3+ … thanks to the FlexParser Next-Hop MAC Address Underlay Src VTEP MAC Address Outer MAC Header Outer IP Header Dest. MAC 48 Source MAC 48 VLAN Type 0x8100 16 VLAN ID 16 Ether Type 0x0800 14 Bytes IP Header Misc. Data 72 Protocol 0x11 (UDP) 8 Header Checksum 16 16 UDP Header VXLAN Header Parse depth of 256 Bytes Inner (Original) MAC Header Overlay Inner (Original) IP Header VXLAN is a complex Original Payload protocol … in (4 Bytes Optional) 20 Bytes 15 programmable stages Source IP 32 Dest. IP 32 Src RLOC IP Address Source Port 16 VXLAN Port 16 UDP Length 16 Checksum 0x0000 16 8 Bytes Dst RLOC IP Address Up to 250 frames across stages at one time… Hash of inner L2/L3/L4 headers of original frame. Enables entropy for ECMP load balancing. UDP 4789 VXLAN Flags RRRRIRRR 8 Segment ID 16 VN ID 24 Reserved 8 Allows 64K possible SGTs 8 Bytes Allows 16M possible VRFs © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Stack Discovery • Switches boot. • Stack Interfaces brought online • Infra and LC Domains boot in parallel Infra LC Infra LC • • Stack Discovery Protocol discovers Stack topology – broadcast, followed by neighbor-cast Active Election begins after Discovery exits TECCRS-2001 LC Infra LC Infra © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 86 Stack Active Election Rules of Election A •The stack (or switch) whose member has the higher user configurable priority 1–15 •The switch or stack whose member has the lowest MAC address %IOSXE-1-PLATFORM: process stack-mgr: %STACKMGR-1-ACTIVE_ELECTED: Switch 1 has been elected ACTIVE. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 87 Define Stack Roles minimal Downtime • • Power up the first Switch that you want to make it as Active Catalyst9300#switch 1 priority 15 Configure Priority of the switch (1-15) Catalyst9300#switch 2 priority 14 • • A S 1 by default – the higher the better Power up the second member that you want to make as Standby Catalyst9300#switch 3 priority 13 Catalyst9300#switch 4 priority 12 • Configure Priority less than the Active • Power up the rest of the members *Priority command is a global command TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 88 Catalyst 9K Stack similarity to Catalyst 6500 • Active and Standby units • Active and Standby Supervisors • Run IOS on Supervisors • Synchronize information • Active programs all DFCs • DFCs run a subset of IOS for LCs • Run IOSd, WCM, etc.. on Active/Standby • Synchronize information • Active programs Data plane for members • Member switches act as Line cards– connected via the Stack Cable A A S TECCRS-2001 S © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 89 Show switch with SSO Stack Mac follows Active initially Switch# show switch Switch/Stack Mac Address : 2037.06cf.0e80 H/W Current Switch# Role Mac Address Priority Version State ------------------------------------------------------------ Active *1 Active 2037.06cf.0e80 10 V01 Ready 2 Standby 2037.06cf.3380 8 V00 Ready Standby 3 Member 2037.06cf.1400 6 V00 Ready 4 Member 2037.06cf.3000 4 V00 Ready Member * Indicates which member is providing the “stack Identity” (aka “stack MAC”) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 90 Show switch detail output Switch# show switch detail Switch/Stack Mac Address : 2037.06cf.0e80 H/W Current Switch# Role Mac Address Priority Version State -----------------------------------------------------------*1 Active 2037.06cf.0e80 10 V01 Ready 2 Standby 2037.06cf.3380 8 V00 Ready 3 Member 2037.06cf.1400 6 V00 Ready 4 Member 2037.06cf.3000 4 V00 Ready Stack Port Stack Port Status Neighbors Information Switch# Port 1 Port 2 Port 1 Port 2 -------------------------------------------------------1 OK OK 2 4 2 OK OK 3 1 3 OK OK 4 2 4 OK OK 1 3 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 91 Catalyst 9000 – HA State Machine 2min timer • Active starts RP Domain locally • Programs hardware on all LC Domains • Traffic starts once hardware is programmed • Starts 2min Timer to elect Standby in parallel • Active elects Standby • Standby starts RP Domain locally • Starts Bulk Sync with Active RP LC RP RP LC LC LC • Infra Infra A S Infra Infra Standby reaches “Standby Hot” TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 92 Show redundancy states Switch# show redundancy states my state = 13 –ACTIVE peer state = Terminal state for Active Unit. 8 -STANDBY HOT Mode = Duplex Unit ID = 1 Redundancy Mode (Operational) = SSO Terminal state for Standby Unit for SSO. Redundancy Mode (Configured) = SSO Redundancy State = SSO Manual Swact = enabled Slot Number of Active Unit Communications = Up client count = 76 client_notification_TMR = 360000 milliseconds keep_alive TMR = 9000 milliseconds keep_alive count = 0 Communication Channel Status between the Active/Standby RP units keep_alive threshold = 9 RF debug mask = 0 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 93 Show Redundancy Command Output… Switch#sh redundancy Redundant System Information : -----------------------------Available system uptime Switchovers system experienced Standby failures Last switchover reason = = = = 29 weeks, 2 days, 11 hours, 47 minutes 2 0 user_forced Hardware Mode Configured Redundancy Mode Operating Redundancy Mode Maintenance Mode Communications = = = = = Duplex SSO SSO Disabled Up System uptime Current Processor Information : Image version -----------------------------of current unit Active Location = slot 1 Current Software state = ACTIVE Uptime in current state = 1 week, 4 days, 22 hours, 38 minutes Image Version = Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 03.03.03E RELEASE SOFTWARE (fc1) Peer Processor Information : -----------------------------Standby Location = slot 2 Current Software state = STANDBY HOT Uptime in current state = 1 week, 4 days, 22 hours, 34 minutes Image Version = Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 03.03.03E RELEASE SOFTWARE (fc1) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 94 StackWise Virtual Architecture Extending StackWise Architecture Dist-1 Does it look familiar? SW-1 SW-2 VSS • Cisco StackWise Virtual extends proven back-panel technology Cat 9k 40G/10G Cat 9k over front-panel network ports • Cisco StackWise Virtual simplifies the Distribution-Layer with two common Cat 9K series chassis into single logical entity TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 96 StackWise Virtual Architecture Resilient Software Design Dist-1 SW-1 Cat 9K SW-2 40G/10G Cat 9k • Cisco StackWise Virtual supports 1+1 Inter-Chassis SSO redundancy providing non-stop communication • Consistent SSO and NSF capable protocols and features on both deployment models TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 97 StackWise Virtual Architecture Simplified. Scalable. Core Core Dist-1 SW-1 SW-2 Distribution Cat 9k Cat 9k 40G/10G Cat (k Access • Cisco StackWise Virtual supports Unified control and management plane architecture • Complex network designs gets simplified with Multi-Chassis EtherChannels (MEC) • Improved application performance with deterministic network resiliency during various planned or unplanned failures. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 98 SW Patching in IOS-XE Adding a SMU file Activating SMU 9300#install add file flash:cat9k-universalk9.2017-0317_21.53_zhangyu.301.CSCuo76464.SSA.smu.bin install_add: START Sun Mar 26 01:13:29 UTC 2017 SUCCESS: Finished copying package(s) to the selected switch(es) SUCCESS: install_add /flash/cat9k-universalk9.2017-0317_21.53_zhangyu.301.CSCuo76464.SSA.smu.bin Sun Mar 26 01:13:31 UTC 2017 Patching Support is only for the Cat9K product Family requires a reload of the system. Do you want 9300#install activate file flash:cat9k-universalk9.2017-0317_21.53_zhangyu.301.CSCuo76464.SSA.smu.bin install_activate: START Sun Mar 26 01:14:12 UTC 2017 2 install_activate: Activating SMU... This operation to proceed? [y/n]y 2 install_activate: Reloading the box to complete activation of the SMU... Committing it 9300#install commit install_commit: START Sun Mar 26 01:24:41 UTC 2017 SUCCESS: install_commit Sun Mar 26 01:24:43 UTC 2017 Any failures/reloads between activate and commit result in a rollback © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public SMU Deployment Experience with Cisco DNA Center • • • • Download SMU to APICEM file server Analyze SMU impact Test SMU on Pilot setup Schedule SMU deployment Cisco DNA Center App Network Admin ReadMe SMU SMU File Server APIC EM Server SMU Cisco.com Pilot Site Production Site TECCRS-2001 Production Site © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 101 Stackable Best Practices Stacking Convergence Not a recommended design Multi-Layer Access vIP: 10.0.0.10 vMAC: 0000.0c07.ac00 Summary Subnets • • Active unit with uplink failure introduces two failures • Active control plane • Uplink interface Distribution D1 HSRP ACTIVE Upstream, HSRP / GLBP will detect link down, and D2 will start answering to the virtual MAC 0000.0c07.ac00 • Downstream traffic is re-routed to D2 via L3 link Si L2 When the Active fails, the Standby will take over. • Si D2 HSRP STANDBY Active S1 Access Standby S3 S2 Single Logical Switch IP: MAC: GW: ARP: 10.0.0.1 aaaa.aaaa.aa01 10.0.0.10 0000.0c07.ac00 TECCRS-2001 IP: MAC: GW: ARP: 10.0.0.3 aaaa.aaaa.aa03 10.0.0.10 0000.0c07.ac00 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 103 Stacking Convergence vIP: 10.0.0.10 vMAC: 0000.0c07.ac00 Multi-Layer Access • • • • Active unit Failure (without uplink) Summary Subnets Distribution D1 HSRP ACTIVE Si D2 HSRP STANDBY Si When the Active fails, the Standby will take over L2 Access No HSRP/GLBP failover, while the new Active being elected, MAC address of HSRP/GLPB still used by the rest of the stack for data forwarding Standby S1 Active S2 S3 Single Logical Switch No downstream re-route convergence IP: MAC: GW: ARP: TECCRS-2001 10.0.0.1 aaaa.aaaa.aa01 10.0.0.10 0000.0c07.ac00 IP: MAC: GW: ARP: 10.0.0.3 aaaa.aaaa.aa03 10.0.0.10 0000.0c07.ac00 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 104 Catalyst 9300 Stack Wise Routed Access • CLI “stack-mac persistent timer 0” enables MAC consistency – • This is the default value for 3850/9300 • This is a change from the existing stacking model • New Active inherits the MAC address of the previous Active • • Summary Subnets Distribution Si Si L3 Access Standby S1 No MAC changes for end hosts and adjacent routers, significantly improves upstream recovery Active S2 S3 Single Logical Switch NO MAC Changes Caution – • Do not re-introduce the 3x50/9300 elsewhere in order to avoid duplicate MAC in your network IP: MAC: GW: ARP: TECCRS-2001 10.0.0.1 aaaa.aaaa.aa01 10.0.0.10 000c.cece.7c80 IP: MAC: GW: ARP: 10.0.0.3 aaaa.aaaa.aa03 10.0.0.10 000c.cece.7c80 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 105 Changing Stack Mac on Cat9K Switches • • By default the timer value is set to indefinite (0) • System continues to keep selected stack mac after switchover • Avoids Protocol flapping How to change it • A new command introduced switch#stack-mac update force Catalyst9k#show switch Switch/Stack Mac Address : 2037.06cf.0e80 Catalyst9k#show switch Mac persistency wait time: Indefinite Switch/Stack Mac Address : 2037.06cf.0e80 2037.06cf.3380 Mac persistency wait time: Indefinite H/W Current Switch# Role Mac Address Priority Version State H/W Current -----------------------------------------------------------Switch# Role Mac Address Priority Version *1 Active 2037.06cf.0e80 10 V01 State Ready -----------------------------------------------------------2 Standby 2037.06cf.3380 8 V00 Ready *1 3 Member V01V00 Removed Member 0000.0000.0000 2037.06cf.1400 10 6 Ready 2 4 Active 2037.06cf.3380 8 V00 Ready Member 2037.06cf.3000 4 V00 Ready 3 Member 2037.06cf.1400 6 V00 Ready 4 Member 2037.06cf.3000 4 V00 Ready TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 106 Key Recommendations for Stacking • Run the stack in full ring mode to get full bandwidth • Configure the Active switch priority and Standby switch priority • • Predetermine which switch is the Active and Standby which will become the Active should the Active fail • Simplifies operations Configure Active and Standby unit without uplinks if possible • • If deploying a stack of 4 or more switches keep the Active and Standby switches without uplinks, this will simplify the convergence and reduce the outage time Do Not change the stack-mac timer value • By default the value is 0 (indefinite) • Avoids protocol flapping • There is a command to change the stack-mac when needed TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 107 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Availability Modeling • Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing • Stackwise480 and Stackwise • In Service Software Upgrades • Foundations of the Structured Network Design • High Availability Architectures • High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 108 ISSU Overview • ISSU provides a mechanism to perform software upgrades and downgrades without taking the switch out of service • Leverages the capabilities of NSF and SSO to allow the switch to forward traffic during Supervisor IOS upgrade (or downgrade) • SSO Standby Sup Line Card Line Card Key technology is the ISSU Infrastructure • Active Sup Allows SSO between different versions Catalyst 9400 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 109 In Service Software Upgrades Streamlined Process for Software Upgrades/Downgrades ISSU Loadversion 1 ISSU Acceptversion (Optional) 2 3 ISSU Runversion 4 ISSU Commitversion TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 111 Stateful Switchover Mode – IOS ISSU Client and Versioning Infrastructure ISSU Versioning Cisco IOS Version 1 HA-Compliant Applications Routing Protocols NetFlow Cisco Discovery Protocol …and more Redundancy Facility ISSU Clients HA-Aware Applications Forwarding Information Base Port Manager PAgP / LACP …and more Checkpointing Facility Active Supervisor Standby Hot Supervisor ISSU Versioning HA-Compliant Applications Routing Protocols NetFlow Cisco Discovery Protocol …and more Checkpointing Facility ISSU Clients HA-Aware Applications Forwarding Information Base Port Manager PAgP / LACP …and more Redundancy Facility Cisco IOS Version 2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public ISSU Client and Infrastructure Interactions – IOS Active Supervisor ISSU Endpoint V1 Application XYZ ISSU Client V1 Versioning Infrastructure Store Client Info Register Client Info Propose Capabilities Capabilities Negotiation Propose Message Version Message Version Negotiation Register ClientID,, Msg Capabilities, MSG Versions, Card Type… Endpoints Agree on a Common Set of Capabilities Hot Standby Supervisor ISSU Endpoint V3 Versioning Infrastructure Endpoints Agree on a Common Message Version Store Client Info Application XYZ ISSU Client V3 Register Client Info Capabilities Negotiation Propose Capabilities Message Version Negotiation Propose Message Version Agree V1 Compatible N V1 Y Message Exchange If Compatible, then Message Exchange Can Proceed Message Transformation Compatible V1, V2,V3 Message Transformation MSG V1 N Y Message Exchange MSG V3 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 113 ISSU Dual Supervisor – Catalyst 9400 ISSU Process Dual Supervisors Start ISSU Uplinks • ISSU Process leverages SSO/NSF Architecture • Uplinks on both active and standby SUP are forwarding traffic Active Supervisor SSO Standby Supervisor Line Card Catalyst 9400 • Convergence is less than 200 msec TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 115 C9K ISSU Dual Supervisor ISSU 3 Step Process • Install add file <tftp/ftp/flash/disk:*.bin> • • Install activate ISSU Install commit Granular Control on the upgrade process with ability to rollback 1 Step Process • Install add file <tftp/ftp/flash/disk:*.bin> activate ISSU commit TECCRS-2001 Single Command to perform complete ISSU © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 116 C9K ISSU Workflow Dual Supervisor ISSU 1. ISSU Started, Image is expanded on Active and Standby V1 S1 Active V1 S2 Standby If S2 fails to become standby it will revert back to step 1 Abort Timer Starts 2. Standby Reloads with the new V2 Image 5. ISSU Complete V2 S1 V2 S2 V1 Standby Expired Abort timer will revert to Step 2 and then Step 1 Active V1 V2 S1 Active S2 Standby Abort Timer Expired Abort Timer Stopped 4. ‘Commit’ Keyword stops the abort timer V1 V2 V2 S1 S2 Standby Active 3. Auto-Switchover causes S2 to become new active and S1 reloads with the new V2 image © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Stackwise Virtual - ISSU C9K ISSU Stackwise Virtual ISSU and Dual Supervisor ISSU 3 Step Process • Install add file <tftp/ftp/flash/disk:*.bin> • • Install activate ISSU Install commit Granular Control on the upgrade process with ability to rollback 1 Step Process • Install add file <tftp/ftp/flash/disk:*.bin> activate ISSU commit TECCRS-2001 Single Command to perform complete ISSU © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 119 Stackwise Virtual ISSU ISSU Process Install ISSU Dual-Active Detection Link Catalyst 9500-24Q 2nd Sub-second traffic convergence Auto-Switchover 16.8.1 16.8.2 Catalyst 9500-24Q 16.8.1 16.8.2 1st Sub-second traffic convergence Stackwise-Virtual Link TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 120 Enhanced Fast Software Upgrade – Catalyst 9000 Achieving High Availability on Catalyst 9300 Enhanced Fast Software Upgrade • • eFSU provides a mechanism to upgrade and downgrade the software image by segregating the Control plane and Data Plane update It updates the control plane by leveraging the NSF/GR Architecture with Flush and Re-Learn mechanism to reduce the impact on the data plane TECCRS-2001 Control-Plane RIB Prefix Next Hop 10.0.0.0 10.1.1.1 10.1.0.0 10.1.1.1 10.20.0.0 10.1.1.1 Data Plane FIB Table Prefix Next HOP 10.1.1.1 aabbcc:ddee32 10.1.1.2 adbb32:d34e43 192.168.0.0 aa25cc:ddeee8 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 122 Enhanced Fast Software Upgrade Regular Upgrade Vs Enhanced Fast Software Upgrade Process 16.10.1* #Install add file image activate commit Enhanced Fast Software Upgrade #Install add file image activate reloadfast enhanced commit < 30 seconds of traffic impact Traffic is impacted throughout the upgrade cycle * Limited Controlled Availability in 16.10.1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 123 Enhanced Fast Software Upgrade CLI commands • • • FSU is supported only in install mode One step command which activates the fast software upgrade and commits it 9300# install add file flash:cat9k_iosxe.BLD_V1610 activate reloadfast enhanced commit Fast Reload without Software upgrade 9300# Reload Fast Enhanced TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 124 Enhanced Fast Software Upgrade – VSS system VSS Software Upgrade on Catalyst 6500 Preparation Steps Enhanced Fast Software Upgrade (EFSU) 1. Before ISSU software upgrade, VSS Switch-1 and Switch-2 will be running the old software image. 2. Install the new image to the same location on the file systems of both Supervisors 3. Make sure the boot register is configured for auto boot 0x2102 = Old Version Switch-2 Switch-1 VSS Standby Hot WS-X6708-10G Si = New Version Execute Upgrade Si VSL 1. ISSU Loadversion R = Reload R STANDBY COLD VSS Active WS-X6708-10G VSS Standby HOT 100% 50% SW2 SO = Switchover 1 TECCRS-2001 2 3 4 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 126 VSS Software Upgrade on Catalyst 6500 Preparation Steps Enhanced Fast Software Upgrade (EFSU) 1. Before ISSU software upgrade, VSS Switch-1 and Switch-2 will be running the old software image. 2. Install the new image to the same location on the file systems of both Supervisors 3. Make sure the boot register is configured for auto boot 0x2102 = Old Version Switch-2 Switch-1 SO R STANDBY COLD VSSStandby Active Hot VSS WS-X6708-10G VSS Standby Hot VSS Active WS-X6708-10G Si = New Version 1. ISSU Loadversion Execute Upgrade Si VSL 2. ISSU Runversion R 3. ISSU Acceptversion (Optional) = Reload VSS Standby HOT 100% 50% SW2 SO SW1 = Switchover 1 TECCRS-2001 2 3 4 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 127 VSS Software Upgrade on Catalyst 6500 Preparation Steps Enhanced Fast Software Upgrade (EFSU) 1. Before ISSU software upgrade, VSS Switch-1 and Switch-2 will be running the old software image. 2. Install the new image to the same location on the file systems of both Supervisors 3. Make sure the boot register is configured for auto boot 0x2102 Switch-2 Switch-1 R = Old Version STANDBY COLD VSS Active WS-X6708-10G VSS Standby Hot WS-X6708-10G Si = New Version 1. ISSU Loadversion Execute Upgrade Si VSL 2. ISSU Runversion R 3. ISSU Acceptversion (Optional) = Reload VSS Standby HOT 100% 50% 4. ISSU Commitversion SW2 SO SW1 SW1 = Switchover 1 TECCRS-2001 2 3 4 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 128 VSS Quad SUP SSO - Catalyst 6500 • In Chassis Standby SUP in each Switch • This will keep the unit up and running when the other chassis is reloaded ICA SSO Act ICA SSO Stby ICS ICS • We take advantage of this for EFSU • There are 2 Upgrade Modes • Standard EFSU Switch ID 1 Switch ID 2 • Staggered EFSU TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 129 EFSU Quad Sup Normal Quad Sup Upgrade Vs Staggered Quad Sup Upgrade 100% 100% 50% 50% SW 2 SW 1 1 2 SW 2 SW 1 3 4 1. ISSU Loadversion (Whole Standby Sw2 chassis reload) SW 1 3 1 2 4 5 nd 1. ISSU Loadversion (2 Sup on Standby Chassis - ICS) 2. ISSU Runversion (whole active Sw1 chassis reload) 2. ISSU Loadversion – Step 2 (Switchover with the Standby Chassis, LCs reload) 3. ISSU Acceptversion(Optional) 3. ISSU Runversion (Chassis S/O) 4. ISSU Commitversion (whole Standby Sw1 chassis reload) 4. ISSU Commitversion (ICS on new Standby Chassis) 5. ISSU Commitversion – Step 2 (Reload on the new Standby Chassis LC) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 130 Cisco IOS ISSU Summary • ISSU is a software upgrade /downgrade procedure • Changes the risk assessment criteria • Minimizes the impact of upgrades/downgrades • Allows for a trial period with automated rollback • Less downtime • Both software versions must be ISSU compatible in order to achieve and SSO–based upgrade • Software version compatibility includes • 18 month rolling window between software releases of the same train • Same license level required between versions TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 136 Graceful Insertion and Removal - GIR Graceful Insertion and Removal for Catalyst 9000 Isolation of Switch from network Change window begins. Start Maintenance One command! Pre-change System Snapshot TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 138 Graceful Insertion and Removal for Catalyst 9000 Return Switch into network Change window begins. Stop Maintenance One command! Pre-change System Snapshot TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 139 Graceful Insertion and Removal Isolation of Switch from network • Isolate a switch from the network in order to perform debugging or an upgrade. • Isolate: All protocols are gracefully brought down but is not shutdown. • Entering Maintenance Mode: • EGP -> IGP in Parallel -> L2 (shutdown port) • Existing Maintenance Mode: • L2 -> IGP in Parallel -> EGP TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 140 Graceful Insertion and Removal Default and Customizable Templates • Default Template • System Generated Profile based on the switch configuration 9300L#show system mode maintenance template default System Mode: Normal default maintenance-template details: router isis 1 shutdown l2 9300L#show system mode maintenance template test • Customized Template System Mode: Normal Maintenance Template test details: • User Configured Profile based on specific configuration or use case shutdown l2 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 141 Graceful Insertion and Removal Snapshots • Automatic Snapshots • • • Snapshots are automatically generated when entering and exiting maintenance mode Captures operational data from the running system like Vlan’s, Routes etc.. User Configured Snapshots • Snapshots can be collected manually for comparing and troubleshooting Switch#show system snapshots compare before_maintenance after_maintenance ================================================================================ Feature Tag .before_maintenance .after_maintenance ================================================================================ [interface] -------------------------------------------------------------------------------[Name:Vlan1] packetsinput 181587 **181589** [Name:GigabitEthernet1/0/3] packetsinput 101531 **101550** broadcasts 80893 **80910** packetsoutput 211568 **211594** [Name:GigabitEthernet1/0/8] output 00:00:00, **00:00:04,** packetsinput 6915 **6918** packetsoutput 57677 **57706** [Name:GigabitEthernet1/0/17] packetsinput 101528 **101550** broadcasts 80891 **80910** packetsoutput 211570 **211600** TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 142 GIR Summary • • GIR used to isolate a switch • Maintenance • HW upgrade • SW upgrade Works well in an L3 end to end network • • Order of Maintenance is • EGP -> IGP (in parallel) -> L2 shutdown • HSRP/VRRP can be leveraged without causing issue on switchover If Stackwise Virtual is deployed, you don’t need to do GIR to upgrade those switches • Leverage the ISSU Stack Virtual technology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 143 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • High Availability Architectures: • • Enterprise Wired LAN • Enterprise Data Center • Enterprise Wireless LAN High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 144 Dana Daum Maren Kostede Communications Architect Technical Solutions Architect Junmei Zhang Technical Marketing Eng. Samer Theodossy Principal Engineer High Availability World Coverage TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 145 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • Modularity, Hierarchy, and Structure • Leveraging Hardware-Based Path Restoration • High Availability Architectures • High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 146 Headquarters WAAS Access Switches UCS Rack-mount Server UCS Rack-mount Servers UCS Blade Chassis Storage WAAS Central Manager Distribution Switches Nexus WAN Router s Access Switches Cisco ACE Internet Routers Wireless LAN Controller Regional Site Data Center Firewalls Nexus WAN Route r Access Switch RA-VPN Firewall Guest Wireless LAN Controller DMZ Switch Remote Site Web Security Appliance Teleworker/ Mobile Worker Access Switch Stack Data Center Wireless LAN Controllers Internet Access Switch Communications Managers Internet Edge DMZ Servers Core Switches Email Security Appliance Hardware and Software VPN WAN Routers MPLS WANs WAN Router s Distribution Switches User Access Layers WAAS Remote Site WAAS WAN Aggregation Remote Site Wireless LAN Controller © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Hierarchical network design High availability using modularity, hierarchy, and structure • Each layer in hierarchy has a specific role • Modular topology—building blocks • Modularity makes it easy to grow, understand, and troubleshoot • Structure creates small fault domains and predictable network behavior—clear demarcations and isolation • Promotes load balancing and resilience Access Distribution Core Distribution Access Building Block TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 148 Hierarchical network design • Core • • Connectivity, availability and scalability Distribution Aggregation for wiring and traffic flows • Policy and network control point (FHRP, L3 summarization) • • Access Physical – Ethernet wired 10/100/1000(802.3z)/mGig(802.3bz); 802.3af(PoE), 802.3at(PoE+), and Cisco Universal POE (UPOE) • Policy enforcement – security: 802.1x, port security, DAI, IPSG, DHCP snooping; identification: CDP/LLDP; QoS: policing, marking, queuing • Traffic control – IGMP snooping, broadcast control • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 149 Hierarchical network design Do I need a core layer? • It is a question of operational complexity and a Do I need a core layer? question of scale • n x (n-1) scaling • Routing peers • Fiber, line cards and port counts ($,€,£) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 150 Hierarchical network design Do I need a core layer? • It is a question of operational complexity and a question of scale Do I need a core layer? • n x (n-1) scaling • Routing peers • Fiber, line cards and port counts ($,€,£) • Capacity planning considerations • Easier to track traffic flows from a block to the common core than to ‘n’ other blocks • Geographic factors may also influence the design • Multi-building interconnections may have fiber limitations TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 151 Structured campus network design • Optimize data load-sharing, redundancy design for best application performance • Diversify uplink network paths with cross-stack and dual-sup access-layer switches • Build distributed and full-mesh network paths between Distribution and Access-layer switches TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 152 High availability design optimization of the elements • Optimize the interaction of the physical redundancy with the network protocols • Provide the necessary amount of redundancy • Pick the right protocol for the requirement • Optimize the tuning of the protocol • The network looks like this so that we can map the protocols onto the physical topology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 153 What we are trying to avoid! TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 154 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • Modularity, Hierarchy, and Structure • Leveraging Hardware-Based Path Restoration • High Availability Architectures • High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 155 Optimizing network convergence Failure detection and recovery • Optimal high availability network design attempts to leverage ‘local’ switch fault detection and recovery • Design should leverage the hardware capabilities of the switches to detect and recover traffic flows based on these ‘local’ events • Design principle – Hardware failure detection and recovery is both faster and more deterministic • Design principle – Software failure detection mechanisms provide a secondary, not primary, fault detection and recovery mechanism in the optimal design TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 156 Optimizing network convergence Layer 1 link redundancy and failure detection • Direct point to point fiber provides for fast failure detection • Do not disable auto-negotiation on GigE and 10GigE interfaces • IEEE 802.3z and 802.3ae link negotiation define the use of Remote Fault Indicator & Link Fault Signaling mechanisms • IOS debounce – • GigE and 10GigE fiber ports is 10 msec • Minimum for copper is 300 msec • NX-OS debounce – Currently 100 msec by default • All 1G and 10G SFP / SFP+ based interfaces (MM, SM, CX-1) changing to a default of 10 msec • RJ45 based Copper interfaces on NX-OS will remain at 100 msec • Design principle Understand how hardware choices and tuning impact fault detection and response to link failures TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 157 Optimizing network convergence Layer 2 software fault detection (e.g. UDLD) • While 802.3z and 802.3ae link negotiation provide for L1 fault detection, hardware ASIC failures can still occur • UDLD provides an L2 based keep-alive mechanism that confirms bi-directional L2 connectivity • Each switch port configured for UDLD will send UDLD protocol packets (at L2) containing the port’s own device / port ID, and the neighbor’s device / port IDs seen by UDLD on that port Tx Rx Rx Tx • If the port does not see its own device / port ID echoed in the incoming UDLD packets, the link is considered unidirectional and is shutdown • Design principle – UDLD Keepalive Redundant fault detection mechanisms required (SW as a backup to HW as possible) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 158 Optimizing network convergence Layer 2 and 3 – Why use routed interfaces? L3 routed interfaces allow faster convergence than L2 switchport with an associated L3 SVI 21:38:37.042 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet3/1, changed state to down 21:38:37.050 UTC: %LINK-3-UPDOWN: Interface GigabitEthernet3/1, changed state to down 21:38:37.050 UTC: IP-EIGRP(Default-IP-Routing-Table:100): Callback: route_adjust GigabitEthernet3/1 21:32:47.813 21:32:47.821 21:32:48.069 21:32:48.069 UTC: UTC: UTC: UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/1, changed state to down %LINK-3-UPDOWN: Interface GigabitEthernet2/1, changed state to down %LINK-3-UPDOWN: Interface Vlan301, changed state to down IP-EIGRP(Default-IP-Routing-Table:100): Callback: route, adjust Vlan301 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 159 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • High Availability Architectures: • • Enterprise Wired LAN • Multilayer Campus Distribution and HA Considerations • Simplified Distribution and HA Advantages • Extending HA Advantages by Simplifying Virtualization • Enterprise Data Center • Enterprise Wireless LAN High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 160 Optimizing the Layer 2 design – spanning tree • At least some VLANs span multiple access switches • Each access switch has unique VLANs • Layer 2 loops • No Layer 2 loops • Layer 2 and 3 running over link between distribution • Layer 3 link between distribution • Blocked links • No blocked links • More typical of a “classic” data center design • More typical of a campus LAN design TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 161 Optimizing the Layer 2 design Non-STP-blocking topologies converge fastest • When STP is not blocking uplinks, recovery of access to distribution link failures is accomplished based on L2 CAM updates not on the Spanning Tree protocol recovery • Time to restore traffic flows is based on: • Time to detect link failure + Time to purge the HW CAM table and begin to flood the traffic • No dependence on external events (no need to wait for Spanning Tree convergence) • Behavior is deterministic TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 162 Optimizing the Layer 2 design PVST+, Rapid PVST+, MST • PVST+ (pre 802.1D-2004) - traditional spanning tree • Rapid-PVST+ (802.1w) greatly improves the restoration times for any VLAN that requires a topology convergence due to link UP • Rapid-PVST+ also greatly improves convergence time over BackboneFast for any indirect link failures • Rapid PVST+ • Scales to large size (up to 16,000 logical ports) • Easy to implement, proven, scales • MST (802.1s) • Permits very large scale STP implementations (up to 75,000 logical ports) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 163 Optimizing the Layer 2 design Complex topologies take longer to converge • Time to converge is dependent on the protocol implemented – 802.1D, 802.1s, or 802.1w • It is also dependent on – • Size and shape of the L2 topology (how deep is the tree) • Number of VLANs being trunked across each link • Number of logical ports in the VLAN on each switch • Non-congruent topologies take longer to converge. Restricting the topology is necessary to reduce convergence times • Prune all unnecessary VLANs from trunk configuration TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 164 Optimizing the Layer 2 design STP toolkit – PortFast and BPDU guard • PortFast is configured on edge ports to allow them to quickly move to forwarding bypassing listening and learning and avoids TCN (Topology Change Notification) messages • BPDU guard can prevent loops by moving PortFast configured interfaces that receive BPDUs to errdisable state • BPDU guard prevents ports configured with PortFast from being incorrectly connected to another switch • When enabled globally, BPDU guard applies to all interfaces that are in an operational PortFast state Switch(config-if)#spanning-tree portfast Switch(config-if)#spanning-tree bpduguard enable 1w2d: %SPANTREE-2-BLOCK_BPDUGUARD: Received BPDU on port FastEthernet3/1 with BPDU Guard enabled. Disabling port. 1w2d: %PM-4-ERR_DISABLE: bpduguard error detected on Fa3/1, putting Fa3/1 in err-disable state TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 165 Optimizing the Layer 2 design STP best practices for campus • The root bridge should stay where you put it • Define the STP primary (and backup) root • Rootguard • Loopguard or bridge assurance • UDLD • There is a reasonable limit to broadcast and multicast traffic volumes • Configure storm control on backup links to aggressively rate limit broadcast and multicast TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 166 Layer 2 access with Layer 3 distribution First hop redundancy protocols (FHRP) • HSRP, GLBP, and VRRP are used to provide a resilient default gateway / first hop address to end stations • A group of routers act as a single logical router providing first hop router redundancy • Protect against multiple failures • Distribution switch failure • Uplink failure • Default recovery is ~10 Seconds TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 167 First Hop Redundancy Sub-Second Timers Improve Convergence interface Vlan4 ip address 10.120.4.2 255.255.255.0 standby 1 ip 10.120.4.1 standby 1 timers msec 250 msec 750 standby 1 priority 150 standby 1 preempt standby 1 preempt delay minimum 180 interface Vlan4 ip address 10.120.4.2 255.255.255.0 glbp 1 ip 10.120.4.1 glbp 1 timers msec 250 msec 750 glbp 1 priority 150 glbp 1 preempt glbp 1 preempt delay minimum 180 interface Vlan4 ip address 10.120.4.1 255.255.255.0 vrrp 1 description Master VRRP vrrp 1 ip 10.120.4.1 vrrp 1 timers advertise msec 250 vrrp 1 preempt delay minimum 180 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public HSRP preemption—why it is desirable • Spanning tree root and HSRP primary aligned • When spanning tree root is re- introduced, traffic will take a twohop path to HSRP active • HSRP preemption will allow HSRP to follow the spanning tree topology TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 169 FHRP design considerations Preempt delay needs to be longer than boot time • HSRP is not always aware of the status of the entire switch and network • Ensure that you provide enough time for the entire system to be up – diagnostics (full or partial), L1 (line cards), L2 (STP), L3 (IGP convergence) • Tune delay and preempt delay conservatively as the network is already forwarding data interface Vlan402 . . . standby delay minimum 60 reload 600 standby 1 ip 10.147.102.1 standby 1 timers msec 250 msec 750 standby 1 priority 110 standby 1 preempt delay minimum 60 reload 600 standby 1 authentication ese standby 1 name HSRP-Voice hold-queue 2048 in ‘standby delay’ Controls How Long Before the Interface Needs to Be Up Before HSRP Starts and ‘preempt delay’ Controls How Long to Wait After HSRP Establishes a Neighbour Relationship. You Should Configure Both. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 170 Sub-second timer considerations HSRP, GLBP, OSPF, PIM • Evaluate your network before implementing any sub-second timers • Certain events can impact the ability of the switch to process sub- second timers • Application of large ACL • OIR of line cards in Catalyst 6500/6800 • The volume of control plane traffic can also impact the ability to process • 250 / 750 msec GLBP & HSRP timers are only valid in designs with less than 150 VLAN instances (Catalyst 6x00 in the distribution) • Spanning Tree size TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 171 FHRP design considerations— asymmetric routing (unicast flooding) • Alternating HSRP Active between distribution switches can be used for upstream load balancing • This can cause a problem with unicast flooding • ARP timer defaults to four hours and CAM timer defaults to five minutes • ARP entry is valid, but no matching L2 CAM table exists • In many cases when the HSRP standby needs to forward a frame, it will have to unicast flood the frame since its CAM table is empty TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 172 FHRP design considerations— asymmetric routing (unicast flooding) solutions Using ‘V’ based design with unique voice and data VLANs per access switch, this problem has no user impact • Don’t deploy stacking switches (ie. daisy-chained switches) that depend on spanning tree for managing interconnects in the stack • Tune ARP timer to 270 seconds and leave CAM timer to default, unless ARP > 10,000, change CAM timers • Deploy MultiChassis EtherChannel with Virtual Switching System (VSS or vPC) in the distribution block • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 173 Even with faster convergence from RPVST+ we still have to wait for FHRP convergence FHRP Active • FHRP protocol based forwarding topologies • FHRP Standby Load balancing based on Per-Port or Per-VLAN • Protocol-based fault detection and recovery – • Recommended to configure per-VLAN aggressive timers to protect user experience impact within <1 second boundary • Limited network scale for system reliability • Sub-second protocol timers must be avoided on SSO capable network 1000 900 800 700 600 500 400 300 200 100 0 HSRP Config SVI - Aggressive Time Convergence (msec) 6500-Sup2T 4500-Sup7E TECCRS-2001 interface Vlan2 ip address 10.120.2.2 255.255.255.0 standby 1 ip 10.120.2.1 standby 1 timers msec 250 msec 750 standby 1 priority 150 standby 1 preempt standby 1 preempt delay minimum 180 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 174 Multilayer campus network design— It is a good solid design, but… • Utilizes multiple control protocols • Spanning tree (802.1w), HSRP / GLBP, EIGRP, OSPF • Convergence is dependent on multiple factors – • FHRP – 900msec to 9 seconds • Spanning tree – Up to 50 seconds • Load balancing – • Asymmetric forwarding • HSRP / VRRP – per subnet • GLBP – per host 60 50 50 40 30 20 9.1 10 0.91 0 Looped PVST+ (No RPVST+) Non-looped Default FHRP Non-looped SubSecond FHRP • Unicast flooding in looped design • STP, if it breaks badly, has no inherent mechanism to stop the loop TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 175 Campus wired LAN design Option 1: Traditional multilayer campus (BRKCRS-2031) Logical topology— L3: core/dist. L2: dist./acc. Common design since the 1990’s Complex configurations (prone to human error) related to spanning-tree, load balancing, unicast and multicast routing • Requires heavy performance tuning resulting from reliance on FHRPs (HSRP, VRRP, GLBP) • • Survives device and link failures Easy mitigation of Layer 2 looping concerns Rapid detection/recovery from failures Physical topology: 2 core 2 dist./acc. Layer 2 across all access blocks within distribution Device-level CLI configuration simplicity Automated network and policy provisioning included TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 176 Transforming multilayer campus Before: Layer 3 distribution with Layer 2 access IGP IGP Layer 3 Layer 2 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 177 Simplification with routed access design After: Layer 3 distribution with Layer 3 access IGP IGP IGP Layer 3 IGP Layer 2 • Move the Layer 2 / 3 demarcation to the network edge • Leverages Layer 2 only on the access ports, but builds a Layer 2 loop-free network • Design motivations – Simplified control plane, ease of troubleshooting, highest availability TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 178 Routed access advantages Simplified control plane Simplified Control Plane • • • • • • No STP feature placement (root bridge, loopguard, …) No default gateway redundancy setup/tuning (HSRP, VRRP, GLBP ...) No matching of STP/HSRP priority No asymmetric flooding No L2/L3 multicast topology inconsistencies No Trunking Configuration Required L2 Port Edge features still apply: • • • • Spanning Tree Portfast Spanning Tree BPDU Guard Port Security, DHCP Snooping, DAI, IPSG Storm Control TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 179 Routed access advantages Simplified network recovery • Routed access network recovery is dependent on L3 re-route • Time to restore upstream traffic flows is based on ECMP re-route • Time to detect link failure • Process the removal of the lost routes from the SW RIB • Update the HW FIB • Time to restore downstream flows is based on a routing protocol re-route • Time to detect link failure • Time to determine new route • Process the update for the SW RIB • Update the HW FIB TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 180 Routed access advantages Faster convergence times • RPVST+ convergence times dependent on FHRP tuning • Proper design and tuning can achieve sub-second times • EIGRP converges <200 msec • OSPF converges <200 msec with LSA and SPF tuning 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Upstream RPVST+ FHRP TECCRS-2001 OSPF EIGRP © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 181 Routed access advantages A single router per subnet: simplified multicast Layer 2 access has two multicast routers per access subnet, RPF checks and split roles between routers Routed access has a single multicast router which simplifies multicast topology and avoids RPF check altogether TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 182 Routed access advantages Ease of troubleshooting • Routing troubleshooting tools • Consistent troubleshooting: access, dist, core • show ip route / show ip cef • Traceroute • Ping and extended pings • Extensive protocol debugs • IP SLA from the Access Layer • Failure differences • Routed topologies fail closed—i.e. neighbor loss • Layer 2 topologies fail open—i.e. broadcast and unknowns flooded switch#sh ip cef 192.168.0.0 192.168.0.0/24 nexthop 192.168.1.6 TenGigabitEthernet9/4 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 183 Why isn’t routed access deployed everywhere? Routed access design constraints • VLANs don’t span across multiple wiring closet switches/switch stacks Does this impact your requirements? • IP addressing changes: more DHCP scopes L3 and subnets of smaller sizes increase management and operational complexity L3 • Deployed access platforms must be able to L3 L3 L3 support routing features TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 184 Campus wired LAN design Option 2: Layer 3 routed access (BRKCRS-3036) Logical topology— L3: everywhere L2: edge only Complexity reduced for Layer 2 (STP, trunks, etc.) • Elimination of FHRP and associated timer tuning • Requires more Layer 3 subnet planning; might not support Layer 2 adjacency requirements • Survives device and link failures Easy mitigation of Layer 2 looping concerns Rapid detection/recovery from failures Physical topology: 2 core 2 dist./acc. Layer 2 across all access blocks within distribution Device-level CLI configuration simplicity Automated network and policy provisioning included TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 185 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • High Availability Architectures: • • Enterprise Wired LAN • Multilayer Campus Distribution and HA Considerations • Simplified Distribution and HA Advantages • Extending HA Advantages by Simplifying Virtualization • Enterprise Data Center • Enterprise Wireless LAN High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 186 Traditional multilayer campus design TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 187 Simplified end-to-end VSS design Data Center TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 188 Comparison – standalone (multilayer) versus VSS TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 189 Unified system architecture • • • • • • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 190 Catalyst VSS setup LAN distribution layer 1) Prepare standalone switches for VSS Router#conf t Router(config)# hostname VSS-Sw1 VSS-Sw1(config)#switch virtual domain 100 VSS-Sw1(config-vs-domain)# switch 1 Switch 1 Switch 2 VSL 1) Prepare standalone switches for VSS Router#conf t Router#config)# hostname VSS-Sw2 VSS-Sw2(config)#switch virtual domain 100 VSS-Sw2(config-vs-domain)# switch 2 2) Configure Virtual Switch Link 2) Configure Virtual Switch Link VSS-Sw2(config)#interface port-channel 64 VSS-Sw2(config-if)#switch virtual link 2 VSS-Sw2(config)#interface range tengigabit 5/4-5 VSS-Sw2(config-if)#channel-group 64 mode on VSS-Sw2(config-if)#no shutdown VSS-Sw1(config)#interface port-channel 63 VSS-Sw1(config-if)#switch virtual link 1 VSS-Sw1(config)#interface range tengigabit 5/4-5 VSS-Sw1(config-if)#channel-group 63 mode on VSS-Sw1(config-if)#no shutdown 3) Validate Virtual Switch Link operation VSS-Sw1# show etherchannel 63 ports AND VSS-Sw2# show etherchannel 64 ports Ports in the group: ------------------Port: Te5/4 Port state = Up Mstr In-Bndl Port: Te5/5 Port state = Up Mstr In-Bndl TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 191 Catalyst VSS setup LAN distribution layer 4) Enable virtual mode operation VSS-Sw1# switch convert mode virtual Do you want to proceed? (yes/no) yes 4) Enable virtual mode operation VSS Switch 1 Switch 2 VSL • The switch now renumbers from y/z to x/y/z • When process is complete, save configuration when prompted, switch reloads and forms VSS. 5) Verify operation and rename switch VSS-Sw1# show switch virtual redundancy • Check for both switches visible, Supervisors in SSO mode, second Supervisor in Standby-hot status VSS-Sw1(config)# hostname VSS VSS(config)# VSS-Sw2# switch convert mode virtual Do you want to proceed? (yes/no)yes • The switch now renumbers from y/z to x/y/z • When process is complete, save configuration when prompted, switch reloads and forms VSS. 6) Configure dual-active detection • Connect a Gigabit Link between the VSS switches VSS(config)# switch virtual domain 100 VSS(config-vs-domain)# dual-active detection fast-hello VSS(config)# interface range gigabit1/1/24, gigabit2/1/24 VSS(config-if-range)# dual-active fast-hello VSS(config-if-range)# no shut 7) Configure the system virtual MAC address VSS(config)# switch virtual domain 100 *Feb 25 14:28:39.294: %VSDA-SW2_SPSTBY-5-LINK_UP: Interface Gi2/1/24 is now dual-active detection capable VSS(config-vs-domain)# mac-address use-virtual *Feb 25 14:28:39.323: %VSDA-SW1_SP-5-LINK_UP: Interface Gi1/1/24 is now dual-active detection capable Configured Router mac address is different from operational value. Change will take effect after config is saved and the entire Virtual Switching System (Active and Standby) is reloaded. BRKCRS-3035: Advanced Enterprise Campus Design: Virtual Switching System (VSS) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 192 “Is there an easier way to enable VSS?” Use Easy VSS to configure from a single console port Prerequisites: • Switches running same software with feature support (C4K:3.6E, C6K:15.2(1)SY1) • Links to be used for VSLs up with CDP communication 1) C6K - Enable Easy VSS feature, convert, and reload VSS-Sw1# switch virtual easy VSS-Sw1# switch convert mode easy links ? Local Interface Remote Interface Hostname TenGiigabit3/4 TenGigabit3/4 VSS-Sw2 TenGigabiti4/4 TenGigabit4/4 VSS-Sw2 VSS-Sw1# switch convert mode easy links T3/4 T4/4 domain 100 VSS-Sw1(config)# switch virtual domain 100 VSS-Sw1(config-vs-domain)# mac-address use-virtual VSS-Sw1# copy running-config startup-config VSS-Sw1# reload 2) Verify operation and rename switch VSS-Sw1# show switch virtual redundancy • Check for both switches visible, Supervisors in SSO mode, second Supervisor in Standby-hot status VSS-Sw1(config)# VSS(config)# hostname VSS VSS VSS-Sw1 VSS-Sw2 VSL 3) Configure dual-active detection • Connect a Gigabit Link between the VSS switches VSS(config)# switch virtual domain 100 VSS(config-vs-domain)# dual-active detection fast-hello VSS(config)# interface range gigabit1/1/24, gigabit2/1/24 VSS(config-if-range)# dual-active fast-hello VSS(config-if-range)# no shut TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 193 VSS dual supervisor inter-chassis redundancy • VSS dual supervisor (single sup per chassis) supports inter- chassis SSO redundancy. • Single in-chassis supervisor - SSO Active or Standby role. Reduced NSF Recovery Capacity Reduced Capacity • Stateful SSO synchronization and redundancy between virtual-switches VSL • Single supervisor system Design – Active Standby Standby Active • Supervisor switchover requires chassis reset, including all linecard and service modules Reduced Reduced Capacity Capacity • Network capacity reduced until system returns to operational state • Consistent redundancy design between modular Catalyst 6500E/6800/4500E and fixed Catalyst 4500X/3850 system TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 194 Catalyst quad-supervisor NSF/SSO redundancy Inter-Chassis Sup Redundancy • Dual in-chassis supervisors, each in different redundancy modes • In-chassis Active Supervisor (ICA) – SSO Active OR Standby-Hot (switchover target) • In-chassis Standby Supervisor (ICS) – StandbyHot (Chassis) Intra-Chassis Sup Redundancy ICA – SSO Active ICS – STANDBY-HOT ( Chassis) • VSS Quad-Sup protects network availability and capacity with dual redundancy domain – between chassis and within chassis • Stateful SSO synchronization between multiple redundancy domains • Complete system configuration and parameters synchronization • Catalyst 6x00 with Sup6T or Sup2T pairs, Catalyst 4500E with Sup8E, 7E, and 7L-E VSL SW1 6500-VS4O#show switch virtual redundancy Switch|Mode|Current|Fabric My Switch Id = 1 Peer Switch Id = 2 Configured Redundancy Mode = sso Operating Redundancy Mode = sso Switch 1 Slot 6 Processor Information : Current Software state = ACTIVE Fabric State = ACTIVE Switch 1 Slot 5 Processor Information : Current Software state = STANDBY Fabric State = ACTIVE Switch 2 Slot 6 Processor Information : Current Software state = STANDBY Fabric State = ACTIVE Switch 2 Slot 5 Processor Information : Current Software state = STANDBY Fabric State = ACTIVE TECCRS-2001 Intra-Chassis Sup Redundancy ICA – SSO Standby ICS – STANDBY-HOT(Chassis) SW2 | inc HOT (CHASSIS) HOT (switchover target) HOT (CHASSIS) © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 195 Understanding Virtual Switch Link • Inter-chassis system link • No network protocol operations • Invisible in network topology • Transparent to network level troubleshooting Control Link Control Link VSL VSH • VSL control link L2 L3 Payload CRC 4500E-VSS#show switch virtual link • Carries all system internal control traffic Executing the command on VSS member switch role = VSS Active, id = 1 • Single member-link; dynamic election during boot • Shared interface for network/data traffic VSL Status : UP VSL Uptime : 1 day, 1 hour, 16 minutes VSL Control Link : Te1/3/1 • < 50 msec switchover to pre-determined VSL path Executing the command on VSS member switch role = VSS Standby, id = 2 VSL Status : UP VSL Uptime : 1 day, 1 hour, 17 minutes VSL Control Link : Te2/3/1 • Payload overhead • Every single packet encapsulated with Virtual Switch Header (VSH) • Non-bridgeable and non-routeable. • VSL must be directly connected between two virtual switch systems TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 196 6500E/6800/4500E VSS dual sup – VSL design Two Cisco recommended designs Profile 2 – Diversified VSL between Supervisor and VSL capable Linecard Profile 1 – Two VSL links on Supervisor Sup Sup Sup Sup VSL VSL • Cost-effective solution to leverage both uplinks. Continue to use non-VSL capable linecard for 10G core connection. • Redundant and diversified fibers between supervisor and next-gen VSL capable linecards. • Redundant fibers connects thru common fabric and ASICs, this could result vulnerability in system stability. • Same design as Profile 1 but increases system reliability as each VSL port are diversified across different fabric/ASICs. • Optimal and preset VSL parameters – Load-Balancing, QoS, HA, Traffic-engg, Dual-Active etc.. • Optimal and preset VSL parameters – Load-Balancing, QoS, HA, Traffic-engg, Dual-Active etc.. • Restricted to bundle 2 x VSL ports or 20G switching capacity on per virtual-switch node basis. • Flexible to scale up to 8 x VSL for high-dense system to aggregate uplink, service modules, single-home etc.. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 197 6500E/6800 VSS quad-supervisor VSL design RPR-WARM Sup2T/6T quad-supervisor NSF/SSO VSL redundancy Sup-1 Sup-2 Sup-3 Sup-3 Sup-4 Sup-4 VSL SW1 SW2 • Same design profile – 1 dual sup • Flexible to increase VSL capacity • Continue to leverage existing non-VSL 10G linecard for uplink connection • Retains all original VSL benefits • Vulnerable design during any supervisor self-recovery fault incident TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 198 6500E/6800 VSS quad-supervisor VSL design SSO advantage Sup2T/6T quad-supervisor NSF/SSO VSL redundancy Recommended: Full-Mesh VSL on Quad-Sup Sup-3 Sup-3 Sup-2 Sup-1 Sup-2 Sup-4 Sup-4 Sup-3 Sup-3 Sup-4 Sup-4 VSL SW1 VSL SW2 SW1 • Same design profile – 1 dual sup • Flexible to increase VSL capacity • Continue to leverage existing non-VSL 10G linecard for uplink connection • Retains all original VSL benefits • Vulnerable design during any supervisor self-recovery fault incident SW2 • Highly redundant and cost-effective VSL design. • Increases overall VSL capacity • Maintains 20G VSL capacity during supervisor failure. • Increases network reliability by minimizing the dual-active probability TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 199 4500X VSS – VSL network design • • • • Fixed switch hardware architecture – • 24 or 48 10G/1G front panel ports • 8 port 1G/10G Pluggable Uplink Module Any ports can be bundled into VSL EtherChannel. Recommended to use front-panel ports to build VSL connections. Minimizes system instability during accidental uplink module OIR/reset Split VSL member-link interfaces to different internal ASICs groups : ASIC Group • 4500X – 16 Port ASIC to Port Mapping 4500X – 32 Port ASIC to Port Mapping Internal Stub ASIC – 1 1–8 1–8 Internal Stub ASIC – 2 9 – 16 9 – 16 Internal Stub ASIC – 3 N/A 17 – 24 Internal Stub ASIC – 4 N/A 25 – 32 Front / Uplink Ports Ten1/1/1 Ten2/1/1 Ten1/1/9 Ten2/1/9 VSL 4500-X 4500-X SW-1 Front Panel Ports SW-2 Consistent software design and VSL function as 4500E TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 200 Cisco Catalyst platforms and transitions Where is VSS? Cisco Catalyst Cisco® Catalyst® 9300 Series 9200 Series Cisco Catalyst 9500 Series Cisco Catalyst 9400 Series Cisco Catalyst Cisco Catalyst Cisco Catalyst 2960X/XR Series 3850 copper 4500E Series Cisco Catalyst Cisco Catalyst 3850F/4500-X 6840-X/ 6880-X Access switching Backbone switching TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 201 “How can I simplify my distribution without VSS?” StackWise Virtual • Fixed switch hardware architecture with distributed forwarding architecture • First available on WS-3850-48XS • Available on Catalyst 3850-24XS, 3850-12XS, 9500-16X, 9500-40X, 9500-12Q, 9500-24Q, 9500-48Y4C, 9500-24Y4C, 9500-32QC, 9500-32C, 9404R/9407R Sup1/Sup1-XL (check software release notes for versions and additional hardware) • StackWise Virtual Link between two nodes (10Gb or 40Gb) • Both StackWise Virtual members must have consistent Cisco IOS-XE and license StackWise Virtual Pair WS-3850-48XS WS-3850-48XS Fast Hello SVL Distribution Access TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 202 Cisco StackWise Virtual (SWV) setup LAN distribution layer 1) Prepare standalone switches for SWV3850-D1 3850-D1 3850-D2 1) Prepare standalone switches for SWV 3850-D2#conf t 3850-D2(config)# stackwise-virtual 3850-D2(config-stackwise-vir)# domain <1-255> SVL 3850-D1#conf t 3850-D1(config)# stackwise-virtual 3850-D1(config-stackwise-vir)# domain <1-255> 2) Configure StackWise Virtual links 2) Configure StackWise Virtual links *Automatically creates EtherChannel (128) *Automatically creates EtherChannel (128) 3850-D2(config)# interface range FortyG x/y/z – x/y/z 3850-D2(config-if)# stackwise-virtual link 1 3850-D1(config)# interface range FortyG x/y/z – x/y/z 3850-D1(config-if)# stackwise-virtual link 1 3) Configure dual-active detection (fast hello) 3) Configure dual-active detection (fast hello) 3850-D1(config)# interface range TenG x/y/z – x/y/z 3850-D1(config)# stackwise-virtual dual-active-detection 3850-D2(config)# interface range TenG x/y/z – x/y/z 3850-D2(config)# stackwise-virtual dual-active-detection 4) Save and reload to convert 4) Save and reload to convert 3850-D1# copy run start 3850-D1# reload 3850-D2# copy run start 3850-D2# reload Note: Maximum of 8 SVL member links and 4 dual active detection links TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 203 Virtual Switch Link capacity planning • Plan VSL capacity to reduce congestion point, handle failures and specific configurations VSL • Supported VSL interfaces types : • Catalyst 6500E/6800 : 10G and 40G • Catalyst 4500E/4500X : 1G and 10G • Catalyst 3850 : 1G, 10G, and 40G • Four major factors : • Total uplink bandwidth per chassis. Ability to handle data re-route during uplink failures without network congestion • Handling egress data to single-homed devices (non-recommended design) • Catalyst 6500E/6800 services module integration may require centralized forwarding on remote chassis • Remote network services such as SPAN VSL Analyzer • Up to 8 member-links supported in VSL EtherChannel. (Implement in power of 2 for optimal forwarding decision) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 204 VSS – single-homed connections • Independent of system modes (VSS or Standalone), single-home connection is non-recommended • Cannot leverage any distributed VSS architecture benefits. • Non-congruent Layer 2 or Layer 3 network design with – • Centralized network control-plane processing over VSL VSL • Asymmetric forwarding plane. Ingress data may traverse over VSL interface and oversubscribe the ports SW-2 (HOT-STANDBY) SW-1 (ACTIVE) • Single-point of failure in various faults – Link/SFP/module failure, SSO switchover, ISSU etc.. A1 A2 • Cannot be trusted switch for dual active detection purpose TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 205 VSS – multi-homed physical connections • Redundant network paths per system delivers best architectural approach • Parallel Layer 2 paths between bridges builds sub-optimal topology : • Creates STP loop. Except for root port, all other ports are in blocking mode • Slow network convergence • Parallel Layer 3 doubles control-plane processing load : • ACTIVE switch needs to handle control plane load of local and remote-chassis interfaces VSL SW-2 (HOT-STANDBY) SW-1 (ACTIVE) • Multiple unicast and multicast neighbor adjacencies • Redundant routing and forwarding topologies A1 TECCRS-2001 A2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public STP Loop 206 VSS – Multichassis EtherChannel • MEC enables: • Simplified STP loop-free network topology • Consistent L3 control-plane and network design as traditional Standalone mode system • Deterministic sub-second network recovery • MECs can be deployed in two modes – Layer 2 or Layer 3 • MEC scalability support varies on system basis – • Catalyst 6500E supports 512 L2/L3 MEC • Catalyst 4500E and 4500X supports 256 L2 MEC • Catalyst 3850-48XS supports 127 L2/L3 MEC VSL SW-2 (HOT-STANDBY) SW-1 (ACTIVE) A1 TECCRS-2001 A2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 207 Simplified STP network topology with VSS • VSS simplifies STP. VSS does not eliminate STP. Never disable STP. • Multiple parallel Layer 2 network path builds STP loop network • VSS with MEC builds single loop-free network to utilize all available links. • Distributed EtherChannel minimizes STP complexities compared to standalone distribution design • STP toolkit should be deployed to safe-guard multilayer network STP BLK Port Loop-free L2 EtherChannel TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 208 Traditional distribution design Redundant design with sub-optimal topology and complex operation Stabilize network topology with several L2 features: • STP Primary and Backup Root Bridge • Rootguard • Loopguard or Bridge Assurance • STP Edge Protection Protocol restricted forwarding topology • STP FWD/ALT/BLK Port • Single Active FHRP Gateway • Asymmetric forwarding • Unicast Flood Protocol dependent driven network recovery: • PVST/RPVST+ and FHRP Tuning TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 209 Resiliency versus performance/scale tradeoff:HSRP FHRP Active FHRP Standby • Multichassis EtherChannel based forwarding topologies • Per-Flow Load Balancing based on Layer 2 to Layer 4 + VLANs 1000 900 800 700 600 500 400 300 200 100 0 interface Vlan2 ip address 10.120.2.2 255.255.255.0 standby 1 ip 10.120.2.1 standby 1 timers msec 250 msec 750 standby 1 priority 150 standby 1 preempt standby 1 preempt delay minimum 180 SVI - Aggressive Time Convergence (msec) 6500-Sup2T 4500-Sup7E TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 210 Resiliency versus performance/scale tradeoff:VSS • Multichassis EtherChannel based forwarding topologies • Per-Flow Load Balancing based on Layer 2 to Layer 4 + VLANs VSS-SW1 • Hardware-Based Fault Detection and Recovery • Deterministic network convergence with simplistic approach • Increases Network Scale for system reliability • No reliability compromise to enable path and system-level Quad-Sup redundancy Multilayer VSS Network Scale And Convergence 1000 1000 900 800 700 600 500 400 300 200 100 0 900 800 700 600 SVI - Aggressive Time 500 Convergence (msec) 400 SVI (Validated Limit) Convergence (msec) 300 200 100 0 6500-Sup2T 6500-Sup2T 4500-Sup7E TECCRS-2001 4500-Sup7E © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 211 PIM timers also need tuning • Multicast recovery depends on PIM DR failure detection PIM DR in Layer 2 network • PIM routers exchanges PIM expiration time in query message • DR Failure Detection: ~90 seconds (30 sec. hello * 3 multiplier) • Tune PIM query interval to sub-sec as FHRP for faster multicast convergence • Sub-second protocol timer must be avoided on SSO capable network TECCRS-2001 interface Vlan2 ip pim sparse-mode ip pim query-interval 250 msec © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 212 Simplified and robust multicast network design using VSS • Single PIM DR system in Layer 2 network to process IGMP from host receivers • Doubles multicast forwarding performance across all Multichassis EtherChannel member links VSS-SW1 PIM-DR • Optimize multicast network with PIM stub configuration • Rapid, deterministic and simple multicast design • • Hardware based sub-second fault detection and recovery. Eliminates aggressive timer requirement and improves system performance and scalability interface Vlan2 ip pim passive TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 213 Multichassis EtherChannel load sharing • MEC hash algorithm is computed independently by each virtual-switch to perform load share via its local physical ports. SW-2 SW-1 • 8 bits computation on each member link of an MEC is independently done on per virtualswitch node basis. • Total number of member link bundling in single MEC recommendation remains consistent as described in single chassis EtherChannel section. • Recommendation to deploy EtherChannel in 2n ratio evenly distributed to each virtual-switch for best load-sharing result. Per Switch MEC Flow Distribution Matrix Member Links Port1 Bit Port2 Bit Port3 Bit Port4 Bit Port5 Bit Port6 Bit Port7 Bit Port8 Bit 1 8 X X X X X X X 2 4 4 X X X X X X 3 3 3 2 X X X X X 4 2 2 2 2 X X X X 5 2 2 2 1 1 X X X 6 2 2 1 1 1 1 X X 7 2 1 1 1 1 1 1 X 8 1 1 1 1 1 1 1 1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 214 Optimize EtherChannel load balancing • Load share egress data traffic based on input hash Core Default : src-dst-ip vlan • Optimal load sharing results with : • Bucket-based load-sharing – Bundle member-links in power-of-2 (2/4/8) Recommended : src-dst-mixed-ip-port • Multiple variation of input for hash (L2 to L4) • Recommended algorithm * : Dist Default : src-dst-ip vlan • Access – Src/Dst IP Recommended : src-dst-mixed-ip-port vlan • 6500E/6800 Dist/Core – Src/Dst IP + Src/Dst L4 Ports • 4500E / 4500X Dist – Src/Dst IP Default : src-mac Recommended : src-dst-ip Access * May vary based on your network traffic pattern TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 215 Summary: Multichassis EtherChannel performs better in any network design • Network recovery mechanic varies in different 1 Convergence (sec) distribution design – • Standalone – protocol and timer dependent • VSS – hardware dependent • VSS logical distribution system – • Single P2P STP Topology 0.8 0.6 0.4 0.2 0 • Single Layer 3 gateway L2-FHRP • Single PIM DR system Upstream Downstream L2-MEC Multicast • Distributed and synchronized forwarding table – MAC address, ARP cache, IGMP • All links are fully utilized based on Ether-channel load balancing TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 216 VSS-enabled campus core design • Extend VSS architectural benefits to campus core layer network • VSS enabled core increases capacity, optimizes network topologies and simplifies system operations • Key VSS enable core best practices : • Protect network availability and capacity with Catalyst 6800 Sup6T Quad-Sup NSF/SSO • Simplify network topology and routing database with single MEC • Leverage self-engineer VSS and MEC capabilities for deterministic network fault detection and recovery Data Center TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 217 VSS core network design alternatives VSL VSL SW1 SW2 SW1 SW2 SW1 VSL SW1 SW2 VSL TECCRS-2001 SW2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 218 Catalyst 6500/6800 VSS-enabled campus design ECMP forwarding table construction • ACTIVE switch responsible for: Unicast Forwarding Path Multicast Forwarding Path • Construct two software tables : Routing Information Base (RIB) and Forwarding Information Base (FIB) T1/2/1 • Synchronize software FIB tables to local and remote chassis supervisor and network modules T1/2/1 T2/2/1 T2/2/2 ECMP forwarding also favors locally attached interfaces Hardware FIB inserts entries for ECMP routes using locally attached links If all local links fail the FIB is programmed to forward across the VSL link as last resort Po1 Po2 SW2 (HOT_STANDBY) SW1 (ACTIVE) Unicast ECMP Software RIB (System-Wide) Unicast ECMP Switch-1 Hardware FIB Four ECMP RIB Entries Two SW1 HW FIB Entries Unicast ECMP Software FIB (System-Wide) Unicast ECMP Switch-2 Hardware FIB Four ECMP FIB Entries TECCRS-2001 Two SW2 HW FIB Entries © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 219 Summary – optimizing core performance (1/2) HW Driven Forwarding Topology & High Availability Unicast Forwarding Path Multicast Forwarding Path VSS-Core Standalone-Core VSS-Dist Standalone--Dist • • • • • • • • • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 220 Summary – optimizing core performance (2/2) HW Driven Forwarding Topology & High Availability Unicast Forwarding Path Multicast Forwarding Path Standalone-Core Standalone-Core VSS-Dist Standalone-Dist • • • • • • • • • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 221 Simple core network design delivers deterministic network recovery • Routing protocol independent network convergence in large scale campus core T1/2/1 T1/2/1 T2/2/1 T2/2/2 • ECMP prefix-independent convergence (PIC) for with 6x00 (VSS/standalone) from 12.2(33)SXI2 Po1 • Cisco Express Forwarding (CEF) optimization in SW1 (ACTIVE) IOS software. tuning required • Hardware-based fault detection and recovery in MEC/EC designs SW2 (HOT_STANDBY) 3.5 Convergence (sec) • Default behavior: no additional configuration or Po2 3 2.5 2 1.5 1 0.5 0 500 1000 5000 ECMP (W/o PIC) TECCRS-2001 10000 15000 ECMP (With PIC) 20000 25000 MEC © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 222 VSS core simplifies multicast operation, improves performance and redundancy (1/2) • Standalone core needs anycast MSDP peering for RP redundancy AnyCast - MSDP Core • ECMP builds single multicast forwarding path and protocol-based fault detection and recovery PIM RP PIM RP Single OIL PIM Join PIM Router PIM Router Dist TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 223 VSS core simplifies multicast operation, improves performance and redundancy (2/2) • VSS based Catalyst systems enables PIM Single Logical PIM RP RP Redundancy with resilient technologies Core Multiple Multicast Forwarding Paths • MEC increases multicast forwarding Single Logical PIM Interface capacity by utilizing all member-links and provides hardware-based fault detection and recovery Single Logical OIL PIM Join Single Logical PIM Router TECCRS-2001 Dist © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 224 Simplified multicast network design delivers deterministic network recovery • ECMP multicast recovery is mroute scale dependent could range in seconds. • MEC/EC multicast recovery is hardware-based and recovery is scale- independent in sub-seconds Convergence (sec) 6 5 4 3 ECMP 2 MEC/EC 1 0 100 500 1000 5000 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 225 Implementing non-stop forwarding • VSS software design is built on NSF/SSO architecture. • Catalyst 4500E, 4500X and 6500E/6800 deployed in VSS mode must enabled NSF. No configuration required on NSF Helper system • NSF capability must be manually enabled for all Layer 3 routing protocols : • EIGRP, OSPF, ISIS, BGP, MPLS etc.. • In VRF environment the NSF must be manually enabled on per-VRF IGP instance Inter-Chassis NSF/SSO Recovery Analysis • Multicast NSF capability is default ON Convergence (sec) 16 14 12 10 8 6 4 2 0 Without NSF TECCRS-2001 With NSF © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 226 Sub-second protocol timers and NSF/SSO Core • NSF is intended to provide availability through route convergence avoidance • Fast IGP timers are intended to provide availability through fast route convergence • In an NSF environment dead timer must be greater than: • • interface Port-Channel 10 ip ospf dead-interval minimal multiplier 4 SSO recovery + Routing Protocol restart + time to send first hello Recommendation – • Do not configure aggressive timer Layer 2 protocols, i.e. Fast UDLD • Do not configure aggressive timer Layer 3 protocols, i.e. OSPF Fast Hello, BFD etc.. Keep all protocol timers at default settings Link and Switch Failure Analysis – Default OSPF Timer 0.3 0.2 0.2 0.1 0.1 0 0 Downstream Dist Link Failure Analysis – Aggressive OSPF Timer 0.3 Upstream VSL Access Catalyst 2K/3K/4K Upstream Downstream TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 227 Campus wired LAN design Option 3: Layer 2 access with “simplified” distribution (BRKCRS-1500) Logical topology— L3: core/dist. L2: dist./acc. Leading campus design for easy configuration and operation when using stacking or similar technology (VSS, StackWise Virtual) • Flexibility to support Layer 2 services within distribution blocks, without FHRPs. • Easy to scale and manage • Survives device and link failures Easy mitigation of Layer 2 looping concerns Rapid detection/recovery from failures Physical topology: 2 core 2 dist./acc. Layer 2 across all access blocks within distribution Device-level CLI configuration simplicity Automated network and policy provisioning included TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 228 VSS best practices summary (1/2) • Design each VSS domain with unique ID • Configure “mac-address use-virtual” under virtual switch configuration mode • Select appropriate VSS capable system that fits in network and solution requirements • Deploy 6500/6800 Quad-sup NSF/SSO for mission-critical networks to protect network availability and capacity • Do not compromise network foundation baselines. Deploy full-mesh physical connections for redundancy and load sharing across the network • MEC enables network benefits with VSS. Bundle all physical connections into single logical connection for simplified and resilient network topologies • Layer 3 MEC is highly recommended for 4500E/X VSS enabled Campus network • Always use link bundling protocols – Cisco PAgP or IETF LACP • Configure “no ip routing protocol purge-interface” to optimize ECMP based network convergence time TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 229 VSS best practices summary (2/2) • Plan and design VSL with appropriate capacity, diversification and redundancy • Configure “nsf” under L3 routing protocols • Keep Layer 2 and Layer 3 protocol timers at factory default. Do not enable protocols with aggressive timers • Configure redundant dual active trusted ePAgP neighbors (L2/L3) • Configure redundant dual active mechanics ePAgP and Fast Hello • Exclude dual active management interface for connectivity and troubleshooting • Remember “reload” command on 6500/6800 resets both virtual-switch chassis, whereas 4500E/X resets ACTIVE switch. Issue “redundancy reload shelf” on 4500E/X to reload ACTIVE and STANDBY system TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 230 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • High Availability Architectures: • Enterprise Wired LAN Multilayer Campus Distribution and HA Considerations • Simplified Distribution and HA Advantages • Extending HA Advantages by Simplifying Virtualization • • • Enterprise Data Center • Enterprise Wireless LAN High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 231 • • • • • • • • • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 232 Hop-by-hop network virtualization Multi-VRF architecture overview • Two preset network setup: • Hop-by-hop network segmentation with logical connection • Build control and data-plane over each logical connection TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 233 Hop-by-hop network virtualization Data-plane isolation TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 234 Multi-VRF: Campus network design alternatives Standalone devices TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 235 Multi-VRF: Campus network design alternatives Cisco VSS TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 236 M-VRF: Per-hop VPN control plane complexity ECMP unicast and multicast adjacencies comparison (1 of 4) Standalone Design 10 VRF Sample Design Each core : 40 Adj TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 237 M-VRF: Per-hop VPN control plane complexity ECMP unicast and multicast adjacencies comparison (2 of 4) Standalone Design VSS Design 10 VRF Sample Design 10 VRF Sample Design VSS core : Each core : 40 Adj TECCRS-2001 0 Adj © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 238 M-VRF: Per-hop VPN control plane complexity ECMP unicast and multicast adjacencies comparison (3 of 4) Standalone Design VSS Design 10 VRF Sample Design 10 VRF Sample Design VSS core : Each core : 160 Adj TECCRS-2001 240 Adj © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 239 M-VRF: Per-hop VPN control plane complexity ECMP unicast and multicast adjacencies comparison (4 of 4) Standalone Design VSS Design 10 VRF Sample Design 10 VRF Sample Design Each core :480 Adj Edge : VSS core : 880 Adj Edge : 160 Adj 80 Adj • Standalone uses distributed control-plane. VSS uses a centralized control-plane • Increases 2X control-plane adjacencies based on network design TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 240 Multi-VRF: MEC design simplifies complexity EC/MEC unicast and multicast adjacencies comparison (1 of 4) VSS-ECMP Design 10 VRF Sample Design VSS core : 240 Adj TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 241 Multi-VRF: MEC design simplifies complexity EC/MEC unicast and multicast adjacencies comparison (2 of 4) VSS-ECMP Design VSS-MEC Design 10 VRF Sample Design 10 VRF Sample Design VSS core : 240 Adj VSS core : TECCRS-2001 100 Adj © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 242 Multi-VRF: MEC design simplifies complexity EC/MEC unicast and multicast adjacencies comparison (3 of 4) VSS-ECMP Design VSS-MEC Design 10 VRF Sample Design 10 VRF Sample Design VSS core : 880 Adj TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 243 Multi-VRF: MEC design simplifies complexity EC/MEC unicast and multicast adjacencies comparison (4 of 4) VSS-ECMP Design VSS-MEC Design 10 VRF Sample Design 10 VRF Sample Design VSS core : Edge : 880 Adj 80 Adj VSS core : Edge : 260 Adj 20 Adj • Simplify virtualized network design with EC and MEC. Reduces up to 4X control- plane adjacencies depending on network design • Hardware driven, scale-independent and deterministic network availability © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public MPLS-based campus network architecture Edge and core network design LSR/LER LSR/LER Core LSP IP/MPLS LSP LSP LER LER LER Distribution LSP LSP LSP LER LER LER IP TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 245 Simplified underlay = simplified overlay (before) P/PE P/PE VPN PE Management MP-iBGP PE Systems MPLS Label Paths MPLS LDP Adjacencies VPN Unicast Forwarding Paths (with BGP Multipath) P/PE TECCRS-2001 P/PE P/PE P/PE P/PE © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public P/PE 246 Simplified underlay = simplified overlay (after) P/PE VPN PE Management MP-iBGP PE Systems MPLS Label Paths MPLS LDP Adjacencies VPN Unicast Forwarding Paths (Without BGP Multipath) PE TECCRS-2001 PE PE © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 247 MPLS before VSS IGP Tuning OSPF LSA/SPF Tuning P/PE P/PE BGP Tunings MP-iBGP Multipath BGP Prefix-Independent Convergence MPLS LDP Tuning MPLS LDP Session Protection BFD MPLS TE Link Protection MPLS TE Node Protection Network/System Redundancy Tradeoff P/PE P/PE P/PE P/PE P/PE P/PE Protocol Dependent Recovery Control/Management/Forwarding Complexity TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 248 MPLS VSS benefits summary IGP Tuning OSPF LSA/SPF Tuning P/PE BGP Tunings Scale-independent Recovery MP-iBGP Multipath Network/System Level Redundancy BGP Prefix-Independent Convergence Hardware Driven Recovery MPLS LDP Tuning Increase VPN Unicast Capacity MPLS LDP Session Protection Increase VPN Multicast Capacity BFD Simplified Virtual Network MPLS TE Link Protection Control-plane Simplicity MPLS TE Node Protection Network/System Redundancy Tradeoff PE PE PE Operational Simplicity L2-L4 Load Sharing Protocol Dependent Recovery Control/Management/Forwarding Complexity TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 249 Headquarters WAAS Access Switches UCS Rack-mount Server UCS Rack-mount Servers UCS Blade Chassis Storage WAAS Central Manager Distribution Switches Nexus WAN Router s Access Switches Cisco ACE Internet Routers Wireless LAN Controller Regional Site Data Center Firewalls Nexus WAN Route r Access Switch RA-VPN Firewall Guest Wireless LAN Controller DMZ Switch Remote Site Web Security Appliance Teleworker/ Mobile Worker Access Switch Stack Data Center Wireless LAN Controllers Internet Access Switch Communications Managers Internet Edge DMZ Servers Core Switches Email Security Appliance Hardware and Software VPN WAN Routers MPLS WANs WAN Router s Distribution Switches User Access Layers WAAS Remote Site WAAS WAN Aggregation Remote Site Wireless LAN Controller © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public What’s different in your network today versus a decade ago? How does it affect availability? Mobility Bring Your Own Device Devices in the Workspace IoT Auto-detect Non-User Devices Devices everywhere TECCRS-2001 Cyber Security Networking and Security Advanced threats © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 251 Key Challenges for Traditional Networks Difficult to Segment Complex to Manage Slower Issue Resolution Ever increasing number of users and endpoint types Multiple steps, user credentials, complex interactions Separate user policies for wired and wireless networks Ever increasing number of VLANs and IP Subnets Multiple touch-points Unable to find users when troubleshooting Traditional Networks Cannot Keep Up! TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 252 What if you could do this? Cisco Software-Defined Access • Enables: Border Nodes Border Nodes Edge Nodes Edge Nodes • Host mobility • Network segmentation • Role-based access control Logical Layer 2 Overlay Logical Layer 3 Overlay • It is an overlay network to the network underlay • Control plane based on LISP • Data plane based on VXLAN Physical Topology • Policy plane based on TrustSec Software-Defined Access Design Guide - CVD https://www.cisco.com/c/dam/en/us/td/docs/solutions/CVD/Campus/CVD-Software-Defined-Access-Design-Sol1dot2-2018DEC.pdf TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 253 SD-Access Why overlays? Simple Transport Forwarding Flexible Virtual Services • • • • • • • • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 254 SD-Access Types of overlays • • • • • • • • • • TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 255 Campus wired LAN design Option 4: Cisco Software-Defined Access (BRKCRS-1501, many others) L2/L3: flexible overlays Uses advantages of a routed access physical design, with Layer 2 capable logical overlay design • Provisioning and policy automation • Integrates wireless into the same policy • Requires automation to simplify configuration • Logical topology— OR Survives device and link failures Easy mitigation of Layer 2 looping concerns Rapid detection/recovery from failures Physical topology: 2 core 2 dist./acc. Layer 2 across all access blocks within distribution Device-level CLI configuration simplicity Automated network and policy provisioning included TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 256 Cisco DNA Appliance—What about HA? Outside of control plane, and has 2+1 clustering capabilities SKU Specs DN1-HW-APL • • • • DN2-HW-APL DN2-HW-APL-L Scale and Performance SDA Design Based on UCS M4 44 cores 256 GB RAM 12 TB SSD 5000 Devices Small or Medium • • • • Based on UCS M5 44 cores 256 GB RAM 16 TB SSD 5000 Devices • • • • Based on UCS M5 56 cores 384 GB RAM 16 TB SSD 8000 Devices 1000 Switches/Routers/WLC + 4000 APs 25,000 Clients 1000 Switches/Routers/WLC + 4000 APs Small or Medium 25,000 Clients 2000 Switches/Routers/WLC + 6000 Aps Medium or Large 40,000 Clients TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 257 Access Cisco Software-Defined Cisco Live Barcelona - Session Map Tuesday (Jan 29) 08:00-11:00 11:00-13:00 13:00-15:00 Wednesday (Jan 30) 15:00-18:00 08:00-11:00 BRKCRS-2821 SD-Access Integration 11:00-13:00 13:00-15:00 Thursday (Jan 31) 15:00-18:00 08:00-11:00 11:00-13:00 LTRACI-2636 LTRCRS-2810 ACI + SD-Access Lab SD-Access Lab 11:00-13:00 13:00-15:00 15:00-18:00 BRKCRS-2812 BRKCRS-3811 SD-Access Policy BRKCRS-1501 ISE & SD-Access SD-Access Deep Dive 08:00-11:00 SD-Access Migration BRKCRS-1449 BRKCRS-3810 15:00-18:00 SD-Access Scale BRKCLD-2412 BRKCRS-2810 13:00-15:00 Friday (Feb 01) BRKCRS-2825 Cross-Domain Policy SD-Access Solution Missed One? Sessions are available online @CiscoLive.com Validated Design BRKCRS-2815 BRKCRS-2814 BRKARC-2020 Connect SD-Access Sites SD-Access Assurance Troubleshoot SD-Access BRKEWN-2021 SD-Access Demo TECCRS-2001 BRKEWN-2020 SD-Access Wireless © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 258 SD-Access resources Related Sessions Reference Cisco SD-Access - 8H Technical Seminar - TECCRS-3810 • Monday, Jan 28 8:30 AM - 6:45 PM Cisco SD-Access Integration Cisco SD-Access Fabric Cisco SD-Access - A Look Under the Hood - BRKCRS-2810 • Tuesday, Jan 29 11:00 AM - 1:00 PM Cisco SD-Access - Technology Deep Dive - BRKCRS-3810 • Tuesday, Jan 29 2:30 PM - 4:00 PM Cisco SD-Access - Connecting Multiple Sites - BRKCRS-2815 • Wednesday, Jan 30 11:00 AM - 1:00 PM Cisco SD-Access – Assurance and Analytics - BRKCRS-2814 • Wednesday, Jan 30 4:30 PM - 6:00 PM Cisco SD-Access - Troubleshooting the Fabric - BRKARC-2020 • Thursday, Jan 31 2:30 PM - 4:00 PM Cisco SD-Access Campus Cisco Validated Design - BRKCRS-1501 • Friday, Feb 01 9:00 AM - 11:00 AM Cisco SD-Access - Connecting to the DC, Firewall, WAN & More! - BRKCRS-2821 • • Thursday, Jan 31 8:30 AM - 10:30 AM Cisco SD-Access - Wireless Integration - BRKEWN-2020 • Friday, Feb 01 Wednesday, Jan 30 2:30 PM - 4:00 PM Cisco SD-Access – Integrating Existing Network - BRKCRS-2812 • Friday, Feb 01 11:30 AM - 1:30 PM Cisco SD-Access Policy Simplifying and Securing the Cisco Digital Network Architecture - BRKCRS-1449 • Tuesday, Jan 29 5:00 PM - 6:30 PM Group-Based Policy for On-Prem, Hybrid & Cloud with Cisco DNA - BRKCLD-2412 • Wednesday, Jan 30 2:30 PM - 4:00 PM Cisco SD-Access - Policy Driven Manageability - BRKCRS-3811 Thursday, Jan 31 2:30 PM - 4:00 PM Cisco SD-Access Labs How to Setup SD-Access Wireless from Scratch - BRKEWN-2021 • 8:30 AM - 10:30 AM Cisco SD-Access - Scaling to Hundreds of Sites - BRKCRS-2825 • Cisco SD-Access Wireless Wednesday, Jan 30 9:00 AM - 11:00 AM Cisco SD-Access & ACI Integration - Hands-on Lab - LTRACI-2636 • Tuesday, Jan 29 2:15 PM - 6:15 PM Cisco SD-Access - Hands-on Lab - LTRCRS-2810 • Wednesday, Jan 30 TECCRS-2001 9:00 AM - 1:00 PM © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 259 Campus wired LAN design options—summary Traditional Multilayer Campus BRKCRS-2031 Layer 3 Routed Access BRKCRS-3036 L2 Access / Simplified Distribution BRKCRS-1500 SD-Access / Fabric for Campus BRKCRS-1501 (and many others) Logical topology Design notes OR Protocols / Tuning L3 Planning Limited L2 Flexible, Easy, Scalable Flexible, Tools to Simplify Physical topology: 2 core 2 dist./acc. On-line library at ciscolive.com TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 260 How do I get there? Successful deployments… …start with a plan. Photos showing Basílica i Temple Expiatori de la Sagrada Família TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 261 High availability wired campus design Key principals • Choices when interconnecting devices can affect network availability • Choose hardware based detection and recovery mechanisms over software for faster convergence– • EtherChannel and Multichassis EtherChannel are powerful tools for convergence and scale • Overall design choices (multilayer vs. routed access vs. simplified distribution) require the introduction of supporting protocols that affect network availability • Simplifying the network and improving network availability improves other services overlaid on that network TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 262 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • High Availability Architectures: • • Enterprise Wired LAN • Enterprise Wireless LAN • Enterprise Data Center High Availability System Recovery TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 263 Dana Daum Communications Architect Maren Kostede Technical Solutions Architect Junmei Zhang Technical Marketing Eng. Samer Theodossy Principal Engineer High Availability World Coverage © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Who connected to a wired network today? TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 265 … a typical day of a connected life… Wi-Fi LTE Wi-Fi LTE LTE Wi-Fi Home Driving Office Walk to lunch Restaurant Shopping, Hotspots TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 266 No Wireless == No Network Access © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Section Objective What is the acceptable network downtime? Minutes << 11 second are ok minute admin The goal of this section is to show you how to design and deploy a Highly Available wireless network to reduce the network downtime TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 268 Wireless High Availability concepts • Good news: all the High Availability concepts and best practices we have seen for wired are applicable to wireless access as well • Bad news: wireless is not wired Ch 1 Ch 6 Ch 11 Thin air….. Shielded, isolated access No electromagnetic protection We use the air to transmit packets, it’s a shared media, it’s unlicensed….enough? TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 269 Agenda • High Availability (HA), the theory of operations: • What to do at the Radio Frequency layer? • Controller HA for different Deployment Modes: Centralized (Cloud/non-Cloud) SD-Access FlexConnect Mobility Express • HA Design and Deployment Practices • Wireless Assurance: proactively monitor your network! • Key takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 270 RF HA – how to build redundancy at the RF layer? Access Points Access Switches Aggregation Switches Wireless Controller • Creating a stable, predictable RF environment (Proper Design, Site Survey) • Dealing with RF that is continuously changing (RRM and RF Management) • Coping with coverage holes from an AP going down (RRM and RF Management) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 271 Radio Frequency (RF) High Availability • Site Survey, site survey….and site survey • Use “Active” survey • Coverage vs. Capacity • Consider Client type (ex. Smartphone vs. Laptop) My Myantenna power isgain halfisof4 my times brother smaller MacBook I trythen to connect 5GHz and move totoanother and stay ifconnected until BSSID it is REALLY the signal better is REALLY bad Adaptive 802.11r, FastLane, iOS Analytics TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 274 Radio Frequency (RF) High Availability • Site Survey, site survey….and site survey • Use “Active” survey • Coverage vs. Capacity • Consider Client type (ex. Smartphone vs. Laptop) • AP positioning and antenna choice is Key • Use common sense • Light source analogy • Internal antennas are designed to be mounted on ceiling • External antennas: use same antennas on all connectors • Tools • What you use is less important than how you use it • Use the same tool to compare results TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 275 RF High Availability: Cisco RRM • What are Radio Resource Manager (RRM)’s objectives? • Provide a system wide RF view of the network at the Controller (only Cisco!!) • Dynamically balance the network and mitigate changes • Manage Spectrum Efficiency so as to provide the optimal throughput under changing conditions • What’s RRM • DCA—Dynamic Channel Assignment • TPC—Transmit Power Control • CHDM—Coverage Hole Detection and Mitigation • RRM best practices • RRM settings to auto for most deployments (High Density is a special case) • Design for most radios set at mid power level (lever 3 for example) • Use RF Profiles to customize RRM settings per Areas/Groups of APs TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 276 RF High Availability: Cisco RRM • RRM DCA in action 1 11 RRM has a system view of RF. AP view would be limited and could result in sub-optimal RF plan 6 RRM will determine the optimal channel plan based on AP layout A rogue AP is detected on channel 11 RRM will assess the RF and take a decision in less than 10min Channel change is triggered to improve the RF Note how the 3 non overlapping channels are still maintained! With a limited AP-based view of the RF, each AP will avoid channel 11 reducing overall network capacity 1 1 11 11 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 277 RF High Availability: Cisco RRM RRM Channel Hole Detection Mitigation (CHDM) in action CHDM = Coverage Hole Detection Mitigation TECCRS-2001 RRM will determine the optimal Power plan based on AP layout Each client RSSI is tracked by AP and reported to WLC If an AP fails… © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 278 RF High Availability: Cisco RRM RRM CHDM in action RRM will determine the optimal Power plan based on AP layout Each client RSSI is tracked by AP and reported to WLC If an AP fails… CHDM algorithms kicks in and increases power of neighboring cells within 90 secs Clients roam to new APs This happens if the CHDM conditions are met: RRM Details and more: Improve WLAN Spectrum Quality with Cisco’s advanced RF (BRKEWN-3010) • • • • Clients are below the RSSI threshold Min Failed client per AP (#3 default) Coverage Exception Level per AP (25% by default) Failed packets (number and %) These checks are needed to avoid false positives © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public RF High Availability Flexible Radio Assignment (FRA) 5GHz 2.4GHz Serving Serving 5GHz. Serving 5GHz. Serving 2.4GHz 5GHz Serving FRA-auto (default value) or Manual Auto 2.4 -> 5GHz or Monitor Mode Transition to 2.4 GHz if coverage drops 5GHz. Serving 2.4-5GHz 2.4GHz Monitoring Serving FRA: Supported on the Cisco Aironet 2800/3800/4800 Series Access Points TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 280 Summary Cisco provides well engineered Access Points, Antennas, and Radio Resource Management features in the controllers However, you need to understand the general concepts of radio – otherwise, it is very easy to end up implementing a network in a sub-optimal way: “RF Matters” TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 282 … adding a Wireless Controller (functionality) Private or public Cloud Access Points Mobility Express Access Switches SD-Access Aggregation Switches Wireless Controller Centralized/ SD-Access/FlexConnect © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Agenda • High Availability (HA), the theory of operations: • What to do at the Radio Frequency layer? • Controller HA for different Deployment Modes: Centralized (Cloud/non-Cloud) SD-Access FlexConnect Mobility Express • HA Design and Deployment Practices • Wireless Assurance: proactively monitor your network! • Key takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 284 Wireless Controller modes fitting different requirements Configure Centralized SDA-Wireless Ease Deployment Fromof a web browser or and management Cisco wireless app,for use largethe campuses. Cloud setup wizard to enable multiple APs and non-Cloud options. Policy Segmentation and consistent wired-wireless management simultaneously up Flex Set Connect Mobility Express Eliminate the need for a Controller at every Site for a distributed deployment. Cloud and non-Cloud options. Simplified Controller-less deployment for distributed deployments and small sites LAN Campus WAN Fabric TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 285 Launched Nov 2018 ENCS C9800 on Switch (SD-Access only) 200 APs WLC 3504 150 APs Mobility Express 50-100 APs Catalyst 9800-40 2000 APs Catalyst 9800-Cloud (private and public) 1000 APs Catalyst 9800-80 6000 APs Catalyst 9800-Cloud (private) 3000-6000 APs 2000 APs WLC 5520 1500 APs 3000 APs 6000 APs WLC 8540 6000 APs TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public AireOS WLCs Catalyst 9800 Controller Series Cisco Wireless Controller Options Cisco Catalyst 9800 Series – Wireless benefits Powered by IOS XE Open and Programmable Trustworthy Solutions Modular operating system Deploy Anywhere Secure Always-on • On-Prem, Private/Public cloud, Embed wireless on a 9k switch • Software updates with no disruption • Detect encrypted threats with Encrypted Traffic Analytics (ETA) • AWS GovCloud ready • Rolling AP upgrades • Integration with StealthWatch • Scale as you grow • Seamlessly add new AP models • Automated macro/micro segmentation with SDA • WPA3 Support* *Future TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 288 High Availability Reducing downtime for Upgrades and Unplanned Events Unplanned Events Device and network interruptions SSO ActiveStandby N+1 Primary, Secondary Per AP Primary, Secondary, Tertiary Controller Software Update Software Maintenance updates ( SMU^ ) Access Point Updates New AP Model & AP updates* Software Image Upgrades Wireless controller image upgrades TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 289 Centralized Mode High Availability: SSO and N+1 Requirements Network Uptime Client SSO N+1 Redundancy (Deterministic/Stateless HA, a.k.a.: primary/secondary/tertiary) Benefits • Catalyst 9800 Series • 5520, 8540, 3504 WLC • L2 connection • Same HW+SW Version • 1:1 box redundancy Active Client State is synched AP state is synched No Application downtime No License needed on secondary Controller Each Controller has to be configured separately Available on all controllers Crosses L3 boundaries Flexible: 1:1, N:1, N:N TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 290 Wireless Controller HA Centralized Mode N+1 Redundancy N+1 Redundancy WLAN-Controller-A WLAN-Controller-B • WLAN-Controller-C Administrator statically assigns APs a primary, secondary, and/or tertiary controller • Assigned from controller interface (per AP) or Prime Infrastructure (template-based) • You need to specify Name and IP if WLCs are not in the same Mobility Group • IP Network Pros: • Predictability: easier operational management • Support for L3 network between WLCs Access Point Primary: WLAN-Controller-1 Secondary: WLAN-Controller-2 Tertiary: WLAN-Controller-3 • Flexible redundancy design options:1:1, N:1, N:N:1 Primary: WLAN-Controller-2 Secondary: WLAN-Controller-3 Tertiary: WLAN-Controller-1 • WLCs can be of different HW and SW (*) Primary: WLAN-Controller-3 Secondary: WLAN-Controller-2 Tertiary: WLAN-Controller-1 • “Fallback” option in the case of failover • Can overload APs on controllers (using AP priority) Cons: • Stateless redundancy. There is a network downtime when the WLC fails • More upfront planning and configuration (*) AP will need to upgrade/downgrade code upon joining TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 292 N+1 Redundancy Configuration > AP Join >… AireOS Catalyst 9800 Controller Series Global backup Controllers Wireless > High Availability • Used if there are no primary/secondary/tertiary WLCs configured on the AP • The backup controllers are added to the primary discovery response message to the AP TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 293 N+1 Redundancy AP Failover mechanism < 30-45 sec (*) When configured with Primary and backup Controllers: • AP uses heartbeats to validate current WLC connectivity • Upon loosing a heartbeat to the Primary, AP sends 5 consecutives heartbeats every 3 second (default) AP Boots UP • Configurable to minimum of 3 keepalive every 2 sec • WLC failure detected Reset Discovery If no reply, AP declares the WLC dead and starts the join process to the first backup WLC candidate: • Backup is the first alive WLC in this order: primary, secondary, tertiary, global primary, global secondary. Image Data • With N+1 Failover, AP goes back to discovery state just to make sure the backup WLC is UP and then immediately starts the JOIN process • With N+1, AP periodically checks for Primary to come back online and falls back to it (AP fallback can be disabled) DTLS Setup Run Join Config (*) With Fast Heartbeat and minimum values for keepalive TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 294 N+1 Redundancy AP Fast Heartbeat < <30-45 30-45sec sec(*) • Fast Heartbeats lower the amount of time it takes to detect Primary controller failure • How Fast Heartbeat works • AP sends these packets, by default every 1 sec • When the fast heartbeat timer expires, the AP sends a 3 fast echo requests to the WLC for 3 times (configurable) • • If no response primary is considered dead and the AP selects an available controller from its “backup controller” list in the order of primary, secondary, tertiary, primary backup controller, and secondary backup controller. Fast Heartbeat only supported for Local and Flex mode TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 295 N+1 Redundancy AP Primary Discovery Request Timer • The access point periodically sends primary discovery requests to the Primary WLC to know when it is back online. Default is 120 sec. • If AP Fallback is enabled (default), the AP automatically joins back the Primary controller TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 296 N+1 Redundancy Backup WLC Overloaded Failed WLC AP Failover Priority Critical AP fails over • Assign priorities to APs: Critical, High, Medium, Low Medium priority AP dropped • Critical priority APs get precedence over all other APs when joining a controller AP Priority: Critical AP Priority: Medium • In a failover situation, a higher priority AP will be allowed to join ahead of all other APs • If backup controller doesn’t have enough licenses (ex. multiple Primary WLCs fail), existing lower priority APs will be dropped to accommodate higher priority APs TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 297 N+1 Redundancy Typical Design < 30-45 sec • Most common Design is N+1 with WLC-BKP Redundant WLC in a geographically separate location across Campus • Can provide 30-45 sec of downtime when use faster heartbeat to detect failure • Use AP priority in case of over subscription of redundant WLC Geo separated DC Primary Locations IP network (Campus) WLAN-Local WLAN-Local WLAN-Local APs Configured With: Primary: WLAN-Local Secondary: WLC-BKP For more info: http://www.cisco.com/en/US/docs/wireless/tech nology/hi_avail/N1_HA_Overview.html © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Wireless Controller HA Centralized Mode Stateful Switch Over (SSO) High Availability (Client SSO) A direct physical connection between Active and Standby Redundant Ports or Layer 2 connectivity is required to provide stateful redundancy within or across datacenters Sub-second failover and zero SSID outage Active Wireless Controller Hot-Standby Wireless Controller C9800-40-K9 Redundancy Port Connectivity RP via L2 Gigabit SFP RP port Gigabit SFP RP port C9800-80-K9 Active Wireless Controller Redundancy Port Connectivity RP Via L2 Example for AireOS Controller: https://www.cisco.com/c/en/us/td/docs/wireless/controller/technotes/7-5/High_Availability_DG.html Hot-Standby Wireless Controller 303 TECEWN-2005 2019 Cisco are and/or:itsGLC-SX-MMD affiliates. All rights reserved. CiscoGLC-LH-SMD Public The only supported SFPs on Gigabit ©RP port and C9800 Private Cloud Deployment: Client SSO High Availability ESXi C9800-CL-K9 vWLC1-Active vWLC1-Standby vWLC1-Active C P D P C P D P C P vWLC2-Active vWLC2-Standby D P C P HA interface C P D P vWLC1-Standby D P C P D P HA interface vswitch vswitch vswitch vswitch vswitch vswitch switch Redundancy Port Connectivity switch Redundancy Port Connectivity RP via L2 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 304 Stateful Switchover (SSO) < 1 sec • HA Pairing is possible only between the same type of hardware and software versions • True Box to Box High Availability i.e. 1:1 • • One WLC in Active state and second WLC in Hot Standby state • Secondary continuously monitors the health of Active WLC via dedicated link Configuration on Active is synched to Standby WLC • • What else is synched between Active and Standby? • • This happens at startup and incrementally at each configuration change on the Active Licenses, AP CAPWAP state, Clients in “RUN” state Downtime during failover reduced is greatly reduced: • 2 - 100 msec for a box failover (Active WLC crashes, system hangs, manual reset or forced switch-over) • 350-500 msec in the case of power failure on the Active WLC (no signaling to the peer is possible) • Few seconds in the case of network failover (gateway not reachable) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 306 Stateful Switchover (SSO) Failover sequence 1. Redundancy role negotiation and config sync 2. APs associates with Active controller 3. Client associates with Active through AP 4. Active failure: notify peer / or missing keep alive 5. Standby WLC sends out GARP 6. Standby becomes Active: ACTIVE STANDBY ACTIVE Si Si Si GARP Si Si Si AP DB and Client DB are already synced to standby controller AP CAPWAP tunnel session intact Campus Access Client session intact, client does not re-associate* Effective downtime for the client is: Detection time + Switchover time Capwap tunnel Client Session video: https://www.youtube.com/watch?v=If5F7eZkC3w TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 307 Stateful Switchover (SSO) Other important things to keep in mind.. • There is no preemption in Controller SSO: • when the failed Active WLC comes back online it will joining as Hot Standby • Recommendations: • In Service Software Upgrade (ISSU): is currently not supported plan for down time when upgrading software. many improvements for Catalyst 9800 Controller see next section. • Physical connection between Redundant Ports should be done first before HA configuration • Keepalive and Peer Discovery timers should be left at default values for better performance TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 308 High Availability Reducing downtime for Upgrades and Unplanned Events Unplanned Events Device and network interruptions Controller Software Update Software Maintenance updates ( SMU^ ) Access Point Updates SSO ActiveStandby N+1 Primary, Secondary Hot Patch (No Wireless Controller reboot) Per AP Primary, Secondary, Tertiary Cold Patch HA install on SSO Pair Auto Install on Standby Rolling AP Update New AP Model & AP updates* (No Wireless Controller Reboot) Software Image Upgrades N+1 Hitless Rolling AP Upgrade Wireless controller image upgrades Cisco Catalyst 9800 Wireless Controller Differentiators TECCRS-2001 AP Device Pack New AP Model Flexible Per-Site, Per-Model Updates © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 309 Wireless Controller HA – Catalyst 9800 only High Availability Cisco Catalyst 9800 Wireless Controller Differentiators Reducing downtime for Upgrades and Unplanned Events Supported after 16.10 16.10 Supported Unplanned Events Device and network interruptions Controller Software Update Software Maintenance updates ( SMU^ ) Access Point Updates SSO ActiveStandby N+1 Primary, Secondary Hot Patch (No Wireless Controller reboot) Per AP Primary, Secondary, Tertiary Cold Patch HA install on SSO Pair Auto Install on Standby Rolling AP Update New AP Model & AP updates* (No Wireless Controller Reboot) Software Image Upgrades N+1 Hitless Rolling AP Upgrade Wireless controller image upgrades ^ MD Release Only TECCRS-2001 AP Device Pack New AP Model Flexible Per-Site, Per-Model Updates © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 311 Future SMU on MD Release only Controller and AP software upgrades Controller Updates Controller update or bug fixes SMU PSIRTs, fixes on APs AP update or bug fixes AP Service Pack New AP Model Support Hot-patchable support for Device Pack AP Device Pack Contain impact within release Faster resolution to critical issues Fixes for defects and security issues without need to requalify a new release Provide fixes to critical issues found in network devices that are time-sensitive TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 312 Wireless Controller SMU Wireless Controller SMU installation Options Hot Patch Cold Patch Wireless Controller Reboot (No Wireless Controller reboot) Auto Install on Standby Software Maintenance Update (SMU) is the ability to apply patch fixes on a software release in the customer network Hot-Patching Cold Patching Inline replace of functions without restarting the process Install of a SMU will require a system reload On SSO Systems, patch will be applied on both active and standby without any reload Current mechanism relies on Engineering Specials • Entire image is rebuilt and delivered to customer On SSO systems, SMU updates can be installed on the HA Pair with zero downtime (Follows ISSU path and both Standby & Active controller reloaded but there is no impact to AP and Client session) SMU Infrastructure will be available in 16.10 FCS release SMUs for C9800 will be available starting the first MD Release TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 313 Catalyst 9800 SMU Cold Patch + AP Service Pack Active SMU Follows ISSU path and both Standby & Active controller reloaded but there is no impact to AP and Client session. Standby SMU Install SMU on Standby Standby SMU Active SMU Switchover to Activate SMU Standby Active SMU SMU Rolling AP upgrade if AP image needs update (Reset AP in staggered way) Install SMU on New Standby TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 314 High Availability Cisco Catalyst 9800 Wireless Controller Differentiators Reducing downtime for Upgrades and Unplanned Events Supported after 16.10 16.10 Supported Unplanned Events Device and network interruptions Controller Software Update Software Maintenance updates ( SMU^ ) Access Point Updates SSO ActiveStandby N+1 Primary, Secondary Hot Patch (No Wireless Controller reboot) Per AP Primary, Secondary, Tertiary Cold Patch HA install on SSO Pair Auto Install on Standby Rolling AP Update New AP Model & AP updates* (No Wireless Controller Reboot) Software Image Upgrades N+1 Hitless Rolling AP Upgrade Wireless controller image upgrades ^ MD Release Only TECCRS-2001 AP Device Pack New AP Model Flexible Per-Site, Per-Model Updates © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 315 Rolling AP Upgrade: Choose how aggressive… N=4 Neighbor APs N=8 Neighbor APs N=24 Neighbor APs User selects % of APs to upgrade in one go [5, 15, 25] For 25%, Neighbors marked = 6 [Expected number of iterations ~ 5] For 15%, Neighbors marked = 12 [Expected number of iterations ~ 12] For 5%, Neighbors marked = 24 [Expected number of iterations ~ 22] © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Rolling AP Upgrade - Client Steering • Clients steered from candidate APs to non-candidate APs • 802.11v BSS Transition Request • Dissociation imminent • If clients do not honor this, they will be de-authenticated before AP reload 802.11v TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 317 High Availability Cisco Catalyst 9800 Wireless Controller Differentiators Reducing downtime for Upgrades and Unplanned Events Supported after 16.10 16.10 Supported Unplanned Events Device and network interruptions Controller Software Update Software Maintenance updates ( SMU^ ) Access Point Updates SSO ActiveStandby N+1 Primary, Secondary Hot Patch (No Wireless Controller reboot) Per AP Primary, Secondary, Tertiary Cold Patch HA install on SSO Pair Auto Install on Standby Rolling AP Update New AP Model & AP updates* (No Wireless Controller Reboot) Software Image Upgrades N+1 Hitless Rolling AP Upgrade Wireless controller image upgrades ^ MD Release Only AP Device Pack New AP Model Flexible Per-Site, Per-Model Updates © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public AP N+1 Rolling AP Upgrade Wireless Controller image upgrade using N+1 staging Controller Trigger Rolling Upgrade Mobility Group Version : X X+1 Primary Version: X+1 1. Device auto selects candidate APs based on selected % and RRM AP Neighbor Map Upgraded N+1 2. Upgrade process kicks-in • • • • • • Image download to Primary Wireless Controller Image pre-download to APs Selective redirect of clients using 11v APs moved to N+1 Wireless Controller in rolling manner Primary Wireless Controller Reboot APs moved back to Primary Wireless Controller (optional) 3. Monitor progress on the Device TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 319 Wireless Controller HA Software DefinedAccess Wireless Software Defined Access: Bringing Intent Based Networking to Life Cisco DNA Center Policy Automation B B Automated Network Fabric Analytics Single Fabric for Wired & Wireless with simple Automation C Outside Identity-Based Policy & Segmentation Decouples Security & QoS from VLAN and IP Address SDA Extension User Mobility Policy stays with User IoT Network Employee Network Insights & Telemetry Analytics and Insights into User and Application behavior © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Catalyst 9800 SD-Access Wireless Cisco DNA Center Policy Automation Analytics SD-Access Wireless Distributed Sites SD-Access Wireless Campus Controller Appliance or Private Cloud SD-WAN (Viptela) c SD-Access IoT c MPLS | Metro 4G/5G/LTE | Internet Embedded Wireless “Cat 9k Switch” User Mobility Seamless Mobility Policy stays with user Policy stays with user TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 322 Software Defined-Access: Roles and Terminology Cisco DNA Controller – Enterprise SDN Identity Services Cisco DNA Controller ISE / AD Fabric Mode WLC Fabric Border B Fabric Mode APs Control-Plane (CP) Node – – Map System that manages Endpoint to Device relationships Fabric Border Nodes – A Fabric device B Intermediate Nodes (Underlay) Controller for Automation & Assurance. GUI management abstraction via multiple Service Apps Identity Services – NAC & ID Systems (e.g. ISE) for dynamic Endpoint to Group mapping and Policy definition C Control-Plane Nodes (e.g. Core) that connects External L3 network(s) to the SDA Fabric Fabric Edge Nodes – A Fabric device (e.g. Access or Distribution) that connects Wired Endpoints to the SDA Fabric CAPWAP (Control) Fabric Edge Nodes Fabric Mode APs VXLAN (Data) Fabric Wireless Controller – Wireless Controller (WLC) fabric-enabled, participate in LISP control plane Fabric Mode APs – Access Points that are fabric-enabled. Wireless traffic is VXLAN encapsulated at AP © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public SD-Access Wireless: Redundancy Considerations Active Standby WLC registers wireless clients in Host Tracking DB SSO pair Client updates C Control Plane (CP) redundancy is supported in Active / Active configuration C B WLC is configured with two CP nodes with information sync across both Stateful redundancy with WLC SSO pair. Active WLC updates Control nodes TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 324 Platforms supporting SD-Access Wireless Optimized for Distributed Braches On Switch Small and Medium Campus On Private Cloud • Cisco IOS® XE Software • Cisco IOS® XE Software • Cat 9300, Cat 9500 • C9800-CL • 200 AP, 4k Clients • SD-Access wireless with Cat9800 Software Package • Indirect AP Support • Centralize Control Plane • Always on Fabric with robust HA • 1k AP, 10k Clients • 3k AP, 32k Clients • 6k AP, 64k Clients^ • Scale on demand • Designed for IoT • Always on Fabric with robust HA TECCRS-2001 Medium and Large Campus On Appliance • Cisco IOS® XE Software • C9800-40-K9 • C9800-80-K9 • Cisco AireOS Software: • WLC 3504 (SW8.8) • WLC 5520 (SW8.8) • WLC 8540 (SW8.8) • Designed for IoT • Always on Fabric with robust HA © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 326 Wireless Controller HA FlexConnect Mode FlexConnect quick recap… Controller Cluster Central Site • CAPWAP management and data plane are split: • Central Switching (SSID data traffic sent to WLC) • Local Switching (SSID data traffic sent to local VLAN) Central Switching WAN • Two modes of operation from AP perspective: • Connected (when WLC is reachable) Local Switching • Standalone (when WLC is not reachable) FlexConnect Branch Office TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 328 FlexConnect HA Limitations Benefits FlexConnect Local Switching L2 roaming Flex Groups for AAA Local Auth. Fault Tolerance: identical configuration on N+1 controllers Upon WLC failure AP stays up and clients are not disconnected Equivalent to Client SSO AAA survivability available FlexConnect Central Switching Same as Centralized mode Same as Centralized mode TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 329 Clients at locally switched SSIDs stay connected at Controller/WAN outage Data Center Local Switching SSIDs all connected Clients stay connected! Prime AAA/ RADIUS WAN Outage WAN Wireless Controller Access Point Branch Office CAPWAP Control – UDP 5246 CAPWAP Data – UDP 5247 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 330 Impact of WAN Outage or Controller Failure 1. Controller failure : N+1 HA Design: • No Impact for locally switched SSIDs • FlexConnect AP will search for backup WLC and resume client sessions with centrally switched SSIDs. 1:1 HA Design with Client SSO: • No impact for centrally switched SSIDs: Centrally and locally switched SSIDs stay up. 2. WAN Failure/ Controller not reachable: • • • • • Access Point will continue to transmit/receive Data on locally switched SSIDs. Connected Clients stay connected Fast roaming is possible for Clients with CCKM/OKC/802.11r support New Clients can connect if local RADIUS or Authentication provided. Lost features: RRM, wIDS, location, WebAuth, NAC Controller Cluster Central Site 1 2WAN Local Switching FlexConnect Branch Office © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Wireless Controller HA Mobility Express Cisco Mobility Express: Controller Function embedded into the access point Runs WLAN Controller on access point Investment Protection - Add controller without changing Access Point Mobile app/WebUI to configure up to 100 access points Simple UI monitors, manages and troubleshoots your network Best Practices activated by default & in built redundancy TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 333 Mobility Express Overview AIR-AP1852I-B-K9 One AP runs as Mobility Master (think about it as a local Virtual WLC) AIR-AP1852I-B-K9 AIR-AP2802I-B-K9 Controller and APs are in the same L2 broadcast domain. AIR-AP1852I-B-K9 AIR-AP1852I-B-K9 AIR-AP3802I-B-K9 MASTER AP Based on FlexConnect architecture Mobility Express supports client central authentication and local switching of traffic AIR-AP3702E-B-K9 AIR-AP13702I-B-K9 AIR-AP2702I-B-K9 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 334 Mobility Express: Master AP Redundancy • If Master AP fails, another Mobility Express capable AP is elected automatically. • Newly elected Master AP has same IP and config as original Master AP. • Preferred Master can be set (AireOS 8.7) • Election of a new controller using VRRP • Heartbeat exchanged every 10s with Master AP • After 3 missed heartbeat: Master election initiated - all Master capable APs participate • APs fall into standalone mode during election process (takes about 30 Secs) • Standalone Access Points join newly elected master and go to connected mode • Election Priorities 1. Most capable Access Points. 3800 > 2800 > 1800. 2. AP with least client load 3. In case of tie, election based on lowest MAC Address TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 335 Mobility Express Master Election Process AIR-AP1852I-B-K9 AIR-AP2802I-B-K9 Most capable Access Point – E.g. 2800 vs. 1850 P AIR-AP1852I-B-K9 P AIR-AP3802I-B-K9 AIR-AP1852I-B-K9 AIR-AP1852I-B-K9 MASTER AP P Least Client Load Lowest MAC address AIR-AP3702E-B-K9 AIR-AP3702I-B-K9 AIR-AP2702I-B-K9 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 336 Mobility Express WLAN Deployment Options Single Office Distributed Office Mobility Express Mobility Express Distributed Enterprise Mobility Express in Branch TECCRS-2001 Controller Based in campus © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 337 Agenda • High Availability (HA), the theory of operations: • What to do at the Radio Frequency layer? • Controller HA for different Deployment Modes: Centralized (Cloud/non-Cloud) SD-Access FlexConnect Mobility Express • HA Design and Deployment Practices • Wireless Assurance: proactively monitor your network! • Key takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 338 HA Best Practices: Connecting an AP to the wired network Recommendations: Create redundancy throughout the access layer by connecting APs to different switches/stack members/linecards If the AP is in Local mode, configure the port as access with SPT PortFast, BPDU guard, etc.. If the AP is in Flex mode and Local Switching, configure the port as trunk and allow only the VLANs you need TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 339 HA Best Practices: Connecting a Single Controller to the wired network 1) To a single Modular Switch or Stack Modular Switch/Stack • Use Trunk EtherChannel(EC)/LAG • • Trunk only the required VLANs to the Controller 2/4/8 ports in a bundle to optimize load sharing • Spread ports across Line Cards/Stack members WLC VSS pair 2) To Redundant Distribution Switches in a VSS pair • Same as Option 1 • Spread ports across VSS members WLC TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 340 HA Best Practices: Connecting HA pair to the wired network Single Switch or stack Option 1: to single Modular Switch or Stack • • • • • Same configuration on both Po1 and Po2 The HA pair of WLCs should be considered as separated WLCs with the same exact configuration Ports on both WLCs are UP but only the ones on the Active WLC are forwarding data traffic On WLC side: use same physical ports are connected to the network, for ex.: port 1-4 on WLC1 and port 1-4 on WLC2 On switch side the configuration has to be the same. If using LAG, for example, two Port-channel should be used with the same configuration (same mode, same VLANs, same native, etc..) General recommendations for Option 1 WLC also apply TECCRS-2001 Po 2 Po 1 Trunk Port-channels L2 Active WLC Standby WLC © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 341 HA Best Practices: Connecting a Client SSO Controller Cluster to the wired network (VSS) Option 2: to VSS pair Use EtherChannel from each Wireless Controller to Distribution VSS • Spread the links in each EtherChannel among the two physical switches: this will prevent a Wireless Controller switchover upon a failure of one of the VSS switch • Redundancy Port (RP) connected to the respective uplink switches. • The AP/Clients are up after an SSO. It is a seamless transition and there are no drops on the client. • TECCRS-2001 Active WLC Standby WLC © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 342 HA Best Practices: Connecting a Client SSO Controller Cluster to the wired network (HSRP) Option 3: to HSRP pair • Controller devices are connected to 2 HSRP routers (Active and Standby). • The uplink is a port-channel. RP connected to the respective uplink routers. • Failover of HSRP Active to Standby induces a switchover of Cisco Catalyst 9800 Wireless Controller HA pair. • The AP/Clients are up after an SSO. It is a seamless transition and there are no drops on the client. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 343 HA Deployment Best Practices Focus on Campus HA Deployment Best Practices Campus • • What is the acceptable downtime for your business applications? • No downtime? Go with Stateful Switchover (Client SSO). • Are 30 sec to few minutes ok? Go with N+1 to have more deployment flexibility What is the downtime to upgrade a HA pair and how to minimize it? • Catalyst 9800 Wireless Controller: use built-in Rolling SW Upgrade • AireOS Controllers (details for reference only): • Plan for additional backup controller • Use Prime Infrastructure Rolling SW Updates Feature TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 345 HA Deployment Best Practices for Campus N+1 Primary Controller SSO SSO + 1 SSO + SSO L2 Secondary Controller Primary Controller Active WLC Primary Controller Standby WLC Secondary Controller Secondary Controller • • • • Approx. 30 Sec failover time (AP+Client affected) No Config Synch (risk: Config mismatch) AP loadbalancing L2 or L3 - - Sub-Second Failover (Client+AP not affected) Config Synch One active, one standby (no AP loadbalancing) L2 connection needed TECCRS-2001 - adds redundancy and simplifies operation during maintenance (e.g. SW Updates) adds redundancy and simplfies operation during maintenance (e.g. SW Updates) © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 362 HA Deployment Best Practices Campus • What is the acceptable downtime for your business applications? • No downtime? Go with AireOS Stateful Switchover • Are 30 sec to few minutes ok? Go with N+1 to have more deployment flexibility • What is the downtime to upgrade a HA pair and how to minimize it? • What is the recommended HA deployment in a multi-site Campus? Use 2-Tier Redundancy (SSO and N+1) HA deployment 1. 2. • Use SSO in the main site (Primary WLC) • Use Secondary/Tertiary in redundancy sites For max resiliency use SSO in all sites TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 363 Multi-site Campus: Combine SSO with N+1 SSO pair can act as the Primary Controller and be deployed with single Secondary and Tertiary WLC Network downtime: • No network downtime for single controller failure in the Primary DC • On failure of both Active and Standby WLC, APs will fall back to secondary and further to configured tertiary controller Primary 9.6.61.x/ 24 DC 1 Secondary WLC 9.6.62.2 Main Data Centre Si ISE PI DC 2 Tertiary WLC 9.6.63.2 IP network .2 Si .3 Si Si SSO pair Si Recommendations: Si AP Config: Primary WLC – 9.6.61.2 Secondary WLC – 9.6.62.2 Tertiary WLC – 9.6.63.2 • Make sure that AP Fallback is enabled • Use AP Failover priority in case of oversubscription of the backup WLC Si • Useful to reduce downtime for SSO pair software upgrade Si Si Si Campus Access TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 364 Multi-Site Campus : SSO everywhere! DC 1 Primary 9.6.61.x Each site can be its own separated SSO architecture Full site redundancy by assigning primary, secondary, tertiary to the APs. Max level of High Availability: no network downtime upon controller failure within any site. Main Data Centre Secondary SSO 9.6.62.x .2 IP network Si Primary SSO pair ISE Si .3 .2 Si .3 Si PI DC 2 Si Tertiary SSO 9.6.63.x Si .2 Si Si Si Si Si Campus Access TECCRS-2001 .3 Si AP Config: Primary WLC – 9.6.61.2 Secondary WLC – 9.6.62.2 Tertiary WLC – 9.6.63.2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 365 HA Deployment Best Practice Focus on Branch HA Deployment Best Practices: Branch Key Design Questions Local Controller Controller (Appliance/virtual) • Specific per branch configuration • Independency from WAN quality • Reduced configuration on switches • Full feature support • L3 roaming supported FlexConnect Mobility Express • Specific per branch configuration • Independency from WAN quality • low hardware footprint (Controller running on Access Point) • Single pane of Mgmt. & Troubleshooting • Reduced branch footprint • Built-in resiliency • Perfect fit for centralized IT Team HA questions: • Is the branch independent from the Central site from an operation prospective? • What is the traffic flow of your application? Are the APP servers centrally located? • Is there a local Internet breakout? How do you authenticate new users if WAN/Controller is down? Where is the AAA server located? TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 367 FlexConnect Branch Summary “Central Controller Cluster for thousands of Sites and Access Points” Key Facts Data Centre Campus Services ISE WLC SSO pair PI Si Si • “Cloud Controller” (private or public) • Ease of Operations: single point of configuration for up to 6000 APs When to use: • Perfect for centralized IT Team High Availability: WAN • If controller not reachable: • Local Data path stays UP and Clients stay connected, you can use AAA survivability • SSO at central site provides control plane survivability Keep in Mind: Remote location • Switchport as Trunk if SSID/VLAN separation needed • WAN Performance • Some feature limitations (compared with local Controller) FlexConnect APs TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 368 Local Controller Branch Summary “Do your clients need full Enterprise feature set (even if WAN is down)?” • Key Facts: Data Centre Campus Services Position one or two controllers per branch • ISE When to use: Full feature set available • Si PI Si WAN Local Services: AAA, DHCP, DNS Remote location • WAN Bandwidth and latency is a concern • Simple configuration on the switch port connected to the Access Point desired • Branch/local IT staff requires configuration outside of corporate standard High Availability: WLC • Full features available if WAN is down • use N+1 or SSO for site controller redundancy • Local Authentication, DHCP, DNS required for full WAN Independency Si Keep in Mind: • Need to manage each site individually • Prime Infrastructure should be considered for central manageability TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 369 Mobility Express Branch Summary “Quick and Easy setup, no additional Hardware, WAN Independency” • Key Facts: Data Centre Campus Services It’s a Wireless Controller running on an Access Point! • ISE When to use: • WAN independency is required and low hardware footprint is desired. • Ideal for new deployments using 18xx/28xx/38xx Series Access Points Si PI Si WAN Local Services: AAA, DHCP, DNS Remote location Si High Availability: • Self-Healing redundancy • Independent from WAN • Local AAA, DHCP, DNS for full WAN independency Keep in Mind: • Switchport as Trunk if SSID/VLAN separation needed • Per branch configuration and management • consider adding Prime Infrastructure or Cisco DNA Center for central management Mobility Express APs TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 370 Agenda • High Availability (HA), the theory of operations: • What to do at the Radio Frequency layer? • Controller HA for different Deployment Modes: Centralized (Cloud/non-Cloud) SD-Access FlexConnect Mobility Express • HA Design and Deployment Practices • Wireless Assurance: proactively monitor your network! • Key takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 371 Cisco Wireless Assurance: Proactively Monitor your Network Cisco DNA Center can manage all wireless deployment modes for Automation and Assurance Cisco DNA Center Policy SDA-Wireless Policy Segmentation and consistent wired-wireless management Automation Configure Centralized From a web browser or Ease of Deployment Cisco wireless app, use andthe management setup wizard for to large enablecampuses multiple APs Analytics up Flex Set Connect Mobility Express Eliminate the need for a Controller at every Site for a distributed deployment Simplified Controller-less deployment for distributed deployments and small sites simultaneously TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 373 Continuous Verification Configs, Changes, Routing, Security Services, Compliance, Audits Successful Rollouts, Operational Continuity Insights & Visibility Visibility, Context, Historical Insights, Prediction Minimize Downtime, User Productivity Corrective Actions Guided Remediation, Automated Updates System Optimization IT Productivity © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public *Available with 16.10.1s and Cisco DNAC 1.2.8 or later Purpose-Built for Cisco DNA Assurance Wireless Streaming Telemetry Architecture Cisco DNA Center gRPC/Protobuf TLS/TDL https/JWT AP WSA/JWT AP2/3/4800K ME, WLC3504/5520/8540 Catalyst 9800 Series Active Sensor AP1800S • HTTP 2.0/gRPC based • Anomaly Event, RF Stat, PCAP, Spectrum • Scheduled and Automated • Supported from AireOS 8.5 • Real-Time client event • KPI Parity with AireOS • Immediate Event Update • Embedded Wireless in Cat9300 • HTTPS for Automation and reporting • PnP-based Provisioning • Fully Managed by Cisco DNAC TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 375 Cisco DNA Center: Built ground-up for Assurance Real-Time and Context based Telemetry • • • Client RF stats, Onboarding state and location (<5 sec) Client Onboarding Health with Sankey charts for better analysis Near-Real time Client tracking Intelligent Capture for proactive troubleshooting 1800s Sensor to validate user experience • • • Floor reassignment to make 1800s sensor mobile Speed tests to validate Cloud app connectivity IP SLA tests for Real-time AppX assessment TECCRS-2001 • • • Live and In-Service capture of Onboarding failures with PCAPs Spectrum Analyzer for analyzing Interference sources On-Demand AP stats for Wi-Fi troubleshooting © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 376 Wireless Assurance: Client Onboarding Client Onboarding 1 Actionable Dashboards: Onboarding Sankey charts for better analysis 2 Real-time Correlation: Correlate Onboarding events with poor RF and client location for RCA 3 Intelligent Capture: Onboarding failures with In-service PCAPs Sankey chart TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 377 Wireless Assurance: Sensors to monitor SLAs Sensor based SLA Monitoring 1 Simulate Client perspective: 1800s Sensor is mobile with floor re-assignment 2 Active Testing: Test the cloud app performance and Realtime AppX assessment 3 SLA Dashboard: Onboarding, Network Services, Cloud App Performance and IP SLA TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 378 Cisco Sensors: Intelligence of Cisco DNA Assurance to the edge AP as a Sensor * (1800/2800/3800/4800) Aironet 1800S Active Sensor Purpose-built Hardware for Analytics Can be configured as dedicate Sensor when it’s configured AP as a Sensor Automatically converted to Sensor or AP by Cisco DNAC 2x2 with 2 spatial streams Multiple powering options - PoE Power - USB Type “C” power - Direct AC Power Plug • Integrated BLE • Ultra compact form factor • • SLA Dashboard Onboarding & Services Tests Configure Tests Remotely Global Issue Creation Dynamic Sensor Test Trigger Test Your Network Anywhere at Any time at Real-world Client Level *AP2800/3800/4800 w/ 8.5MR4 or 8.8.111.0 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 379 Wireless Assurance: Client Health and Intelligent Capture Client and Network Experience 1 Health Dashboard: Near-Real time Client tracking (<60 sec) and Top N AP analytics 2 Client 360: Historical Time travel with client RF correlated with the Onboarding events 3 Intelligent Capture: On-Demand AP stats for Wi-Fi troubleshooting TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 380 Know you clients! Client Insights– Apple iOS Analytics 1 Device Profile Client shares these details 1. Model e.g. iPhone 7 2. OS Details e.g. iOS 11 Support per device-group Policies and Analytics 2 Wi-Fi Analytics Client shares these details 1. BSSID 2. RSSI 3. Channel # Insights into the clients view of the network 3 Assurance Client shares these details Error code for why did it previously disconnected Provide clarity into the reliability of connectivity TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 381 Cisco DNA Wireless Assurance • Be proactive: Use Sensor-based verification for critical services! • Know your clients: Cisco/Apple WiFi iOS Analytics. • Intelligent Capture: Who’s fault is it? “always on” packet capture – helping to differentiate between RF or application/client issue. • Go back in time: What happened yesterday/last week? • Actionable Insights: Provide guidance on how to solve the issue. TECCRS-2001 Cisco DNA Center Policy Automation Analytics © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 383 I would like to leave you with… Key Takeaways • High Availability for Wireless is a multi level approach, starting from Level 1 (RF) • You have different solutions to chose based on the downtime that is acceptable for your business application • Cisco Controller SSO eliminates network downtime upon controller failure • Wireless Assurance is key to assess your network stability and proactively test TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 385 Selected additional Wireless Sessions… • Cisco DNA Wireless Assurance: Isolate problems for faster troubleshooting - BRKEWN-2034 • • Friday, Feb 01, 9:00 AM - 11:00 AM | Hall 8.0, Session Room C118 How to setup an SD Access Wireless fabric from scratch - BRKEWN-2021 • • Wednesday, Jan 30, 2:30 PM - 4:00 PM | Hall 8.0, Session Room D134 Cisco SD-Access Wireless Integration - BRKEWN-2020 • • Friday, Feb 01, 11:30 AM - 1:00 PM | Hall 8.0, Session Room C129 Cisco DNA Center Assurance and Analytics– Reducing Time to resolution using Big Data and Machine Learning - BRKNMS-2542 • • Tuesday, Jan 29, 2:30 PM - 4:00 PM | Hall 8.0, Session Room A108 Advanced Troubleshooting of Cisco Catalyst 9800 Wireless Controller - BRKEWN-3013 • • For Your Reference Thursday, Jan 31, 8:30 AM - 10:30 AM | Hall 8.0, Session Room C126 Improve Enterprise WLAN Spectrum Quality with Cisco's advanced RF capacities (RRM, CleanAir, ClientLink, etc) - BRKEWN-3010 • Wednesday, Jan 30, 8:30 AM - 10:30 AM | Hall 8.0, Session Room A103 • Thursday, Jan 31, 11:00 AM - 1:00 PM | Hall 8.0, Session Room C126 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 386 Agenda • Designing High Availability Networks for the Enterprise • System Hardware and Software Resiliency • Foundations of the Structured Network Design • High Availability Architectures: • • Enterprise Wired LAN • Enterprise Wireless LAN • Enterprise Data Center High Availability System Recovery Analysis TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 387 Dana Daum Communications Architect Maren Kostede Technical Solutions Architect Junmei Zhang Technical Marketing Eng. Samer Theodossy Principal Engineer High Availability World Coverage © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Agenda • Enterprise Data Center High Availability (DC HA) • DC Switch NX-OS HA Architecture and HA Features • DC Network HA Design and Operational Best Practices Legacy DC with vPC Programmable Application Centric Infrastructure (ACI) Programmable • Fabric Network Key Takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 389 Data Center HA Section Objective • Focus is on Enterprise Data Center Network • High Availability design options and best practices • High Availability operational best practices • Same principle: The Enterprise Campus Network High Availability concepts are applicable to Data Center network • Same goal: minimize network downtime TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 390 Agenda • Enterprise Data Center High Availability (DC HA) • DC Switch NX-OS HA Architecture and HA Features • DC Network HA Design and Operational Best Practices Legacy DC with vPC Programmable Application Centric Infrastructure (ACI) Programmable • Fabric Network Key Takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 391 Platform-dependent hardware-related modules System-infrastructure modules NX-OS HA Architecture Feature modules • Fully distributed modular design • Control-plane & data-plane separation • Service restart-ability • Non-disruptive SSO* & ISSU API Feature Management Infrastructure API HA Infrastructure API Hardware Drivers Netstack Kernel *SSO only available on dual-sup Nexus 7x00 and 9500 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 392 NX-OS Service Restart-ability • Stateful Restart with Persistent Storage Service (PSS) Checkpoints states to PSS • Recover states from PSS upon restart • • • Stateful Restart with Graceful Restart • Recover states based on information from other services and/or network • Mainly routing protocols Stateless Restart • Fresh start, no trace of former instantiation TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 393 NX-OS NSF with Stateful Fault Recovery etc. LACP HSRP STP IPv6 TCP/UDP PIM OSPF Software RIB Graceful restart BGP Restart process! Graceful restart HA Manager Linux Kernel Table Update Routing updates Nexus Data Plane Routing updates If a fault occurs in a process… • HA manager determines best recovery action (restart process, switchover to redundant supervisor) Hardware FIB • Process restarts with no impact on data plane • State checkpointing (PSS) allows instant, stateful process recovery • Software utilizes Graceful Restart© 2019 where appropriate Cisco and/or its affiliates. All rights reserved. Cisco Public NX-OS NSF with Stateful Supervisor Switchover • Supervisor switchover triggers: • • HA policy initiated • Process restart have failed • When the kernel fails (or panics) • When the Supervisor experiences a hardware failure Active Sup User initiated – system switchover Standby Sup LC - NSF TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 395 NX-OS NSF Configuration • The Nexus products are “NSF Capable” by default for all the routing protocols in all NX-OS software releases. • No additional configuration is required unless you need to modify the default NSF timers. Nexus# show running-config ospf all !Command: show running-config ospf all !Time: Tue May 19 16:13:31 2009 version 4.2(1) feature ospf <snip> router ospf 1 graceful-restart graceful-restart grace-period 60 area 0.0.0.0 authentication message-digest TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 396 NX-OS Software Maintenance Upgrade • Non-Disruptive Bug Fix Direction for re-startable/ stateful processes • Limited number of Patches supported • Works with or without ISSU • Not every bug will have a patch • For Operationally Impacting Bugs with no workaround • May be disruptive • Platform and process specific TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 397 NX-OS Stateful Process Restart & Patching • NX-OS services Checkpoint their runtime state to the Persistent Storage Service Control-Plane O SPF 1 EIGRP B GP B GP H S RP 1 O TV vPC H S RP 2 UDLD SSH IGM P S TP BGP Management Infrastructure When a process is patched… • Restart process! HA Infrastructure Install process applies new patch Hardware Drivers • HA manager restarts process • Process restarts with patched code and no impact on data plane • State is recovered, operation resumes • Total Recovery Time ~10s ms Netstack Kernel Data-Plane TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 398 Software Patching: CLI procedure SMU Show Install Active Show Install Committed SMU SMU Repository N7K> Install Add N7K> Install Remove Show Install Inactive Show Install Packages SMU Committed Memory: Copy to Device Process: Memory: Process: N7K> Install Activate N7K> Install Commit . . SMU Removed Memory: SMU Applied Process: Memory: N7K> Install Deactivate SMU Committed Memory: TECCRS-2001 Process: N7K> Install Commit Process: © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 399 Dual Supervisor Standard ISSU N7K# install all kickstart bootdisk:7.2-kickstart system bootdisk:7.3-system Release 7.3 1 Perform SSO 3 Upgrade standby supervisor 4 Reload standby supervisor 5 Sup 1 Sup 2 Standby Active Standby Active Standby 7.2 7.3 7.3 7.3 7.2 2 Upgrade standby supervisor Reload standby supervisor Release 7.3 6 * Upgrade LCs & FEX in series Release 7.3 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 400 Fixed Switch (ToR) Standard ISSU • Control plane is inactive during reload while data plane is forwarding Reload supervisor Control-Plane Load new version • ISSU is non-disruptive for L2 services Restore control plane and configuration • STP enabled switches cannot be present downstream Supervisor • ISSU is disruptive for L3 services Version 7.3 7.2 Reconcile with Data Plane Data-Plane TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 401 Fixed Switch (ToR) Enhanced ISSU (N3K/N9K) #install all nxos v2.bin Container- A destroyed Container – B spawned to bootup with NX-OS (V2) Container - A NX-OS (V1) Container- B becomes Active Container - B NX-OS (V2) NX-OS upgraded to V2 with ~3-5 seconds impact to Control plane traffic Host OS (Linux) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 402 In-Service Software Upgrade For Your Reference ISSU NX-OS Switch Traffic Loss Standard ISSU Dual supervisor modular switch: N9500, N7700, N7000 Control plane: <3-5 sec Data plane: 0/no service disruption Fixed switch: N9300, N3000, N5500, N5600, N6000 Control plane: < 120 sec Data plane: 0/no service disruption Fixed switch: N9300, N3000 Control plane: <3-5 sec Data plane: 0/no service disruption Enhanced ISSU TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 403 NX-OS ISSU Best Practices • For Layer 2 and Layer 3 protocols with sensitive timers, the timeout value should be increased. Otherwise, the upgrade will be disruptive • Best practices vPC ToR Make sure that both vPC peers are in the same mode (traditional ISSU mode or enhanced ISSU mode) Connect host using port-channel to a pair of vPC ToR If ToR vPC is STP root bridge: Enable peer-switch to avoid STP root change during ISSU If ToR vPC is not STP root bridge: enable all ports as edge/edge trunk ports TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 404 Graceful Insertion and Removal for NXOS Isolation of Switch from network Change window begins. vPC vPC system mode maintenance One command! Pre-change System Snapshot TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 405 Graceful Insertion and Removal for NXOS Return of Switch into network Change window complete. vPC vPC no system mode maintenance One command! Post-change System Snapshot TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 406 Configuration Profiles • Maintenance-mode profile is applied when entering GIR mode, • Normal-mode profile is applied when GIR mode is exited. Automatic Profiles Custom Profiles • Generated by default • Parses configuration to determine changes going into and out of GIR • Changes based on base protocol configuration settings. • • • • Use: Maintenance Windows User created profile for maintenancemode and normal-mode Flexible selection of protocols for isolation Use: maintenance windows and isolation during troubleshooting TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 407 Graceful Insertion and Removal Feature Graceful Removal with Isolate command • New CLI 'isolate' in all Unicast Protocols //Sends route withdraws • Make Nexus undesirable for all transit traffic router bgp 33 • Maintain Protocol Adjacencies • Send route withdrawals/worse metrics • Local route states are maintained. • Multicast follows Unicast for RPF • Feature available: N5K/6K:7.3(0)N1(1); N7K: 7.3(0)D1(1); N9K/N3K: 7.0(3)I2(1) isolate //Poisons the routes by sending highest metric router eigrp 1 isolate //Advertises max-metric router-lsa router ospf 1 isolate //Refreshes LSPs with overload-bit on router isis 1 isolate TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 408 GIR – Platform Specifics For Your Reference Nexus Switch Shutdown Isolate Nexus 5K/6K Only support shutdown mode from 7.1(0)N1(1) Default mode is isolate from 7.3(0)N1(1), shutdown is optional mode Supported features: BGP/BGPv6, EIGRP/EIGRPv6, ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine switch), vPC/vPC+, Interfaces Supported features: BGP/BGPv6, EIGRP/EIGRPv6, ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine switch), vPC/vPC+(shutdown only), Interfaces (shutdown only) Only support shutdown mode from 7.2(0)D1(1) Default mode is isolate from 7.3(0)D1(1), shutdown is optional mode Supported features: BGP/BGPv6, EIGRP/EIGRPv6, ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine switch), vPC/vPC+, Interfaces Supported features: BGP/BGPv6, EIGRP/EIGRPv6, ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine switch), vPC/vPC+(shutdown only), Interfaces (shutdown only) Default mode is isolate from 7.0(3)I2(1), shutdown is optional mode Default mode is isolate from 7.0(3)I2(1), shutdown is optional mode Nexus 7K Nexus 9K/3K Supported features: BGP/BGPv6, EIGRP/EIGRPv6, ISIS/ISISv6, OSPF/OSPFv3, PIM(on vPC), RIP, vPC(shutdown only), Interfaces (shutdown only) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 409 Putting it all Together What to use? GIR Mode? Patching? ISSU? All of them? Option Critical Bug Fix & PSIRT Hardware Upgrade New Features ISSU ✓ X ✓ GIR + Cold Boot ✓ X ✓ GIR + Disruptive Installer ✓ X ✓ SMU Restart ✓ X X GIR + SMU Reload ✓ X X GIR X ✓ X Situation TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 410 Agenda • Enterprise Data Center High Availability (DC HA) • DC Switch NX-OS HA Architecture and HA Features • DC Network HA Design and Operational Best Practices Legacy DC with vPC Programmable Application Centric Infrastructure (ACI) Programmable • Fabric Network Key Takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 411 Data Center Fabric Technology Evolution VXLAN EVPN VXLAN F&L FabricPath vPC STP 2015-2019 2014 2010 2009 2008 and before ACI TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 412 Cisco Data Center Network Solutions Classic Ethernet & VPC Programmable Fabric Application Centric Infrastructure DB Programmable Network DB Web TECCRS-2001 Web App Web App © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 413 High Availability Design Principle Structure, Modularity, and Hierarchy • Structured design • • Modular design • • Allows you to manage and understand traffic flows, and network failure behavior Allows for easier evolution and change to the network Hierarchical design • Provides for improved scalability • Separates network services into manageable building blocks TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 414 High Availability Design Principle Structure, Modularity, and Hierarchy • Optimize the interaction of the physical redundancy with the network protocols • • Provide the necessary amount of redundancy • Pick the right protocol for the requirement • Optimize the tuning of the protocol Optimize network convergence failure detection and recovery • Optimal high availability network design attempts to leverage ‘local’ switch fault detection and recovery • Design should leverage the hardware capabilities of the switches to detect and recover traffic flows based on these ‘local’ events TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 415 Agenda • Enterprise Data Center High Availability (DC HA) • DC Switch NX-OS HA Architecture and HA Features • DC Network HA Design and Operational Best Practices Legacy DC with vPC Programmable Application Centric Infrastructure (ACI) Programmable • Fabric Network Key Takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 416 Cisco Data Center Network Solutions Classic Ethernet & VPC Programmable Fabric Application Centric Infrastructure DB Programmable Network DB Web TECCRS-2001 Web App Web App © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 417 vPC Feature Overview vPC Terminology Layer 3 Cloud vPC Peer vPC Peer Keepalive Link vPC Domain P S Peer Orphan Port S1 Link CFS S2 vPC Orphan Device S3 vPC is supported on Cisco Nexus switches (N5k, N6k, N7k, N9k, N3k) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 418 vPC Failure Scenario vPC Peer-Keepalive Link up & vPC Peer-Link down • vPC peer-link failure (link loss): P • vPC peer-keepalive up • Status of other vPC peer known • Secondary vPC peer disables all vPC’s • Traffic forwarded by vPC primary vPC Peer-keepalive S S2 S1 vPC_PLink Suspend secondary vPC Member Ports vPC1 vPC2 SW4 SW3 Keepalive Heartbeat TECCRS-2001 P Primary vPC S Secondary vPC © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 419 Legacy DC HA Design with vPC • Core Layer • Layer 3 ECMP for multipath redundancy Core Core S1 S2 S3 Aggregation S4 S5 Access Access S6 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 420 Legacy DC HA Design with vPC • Aggregation Layer • HSRP / VRRP/ GLBP with vPC for active/active gateway • Use default FHRP timers Core Core S1 S2 S3 Aggregation S4 S5 Access Access S6 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 421 Legacy DC HA Design with vPC • Access Layer • Connect to a pair of Aggregation switch via Layer 2 port-channel • Redundant uplinks Core Core S1 S2 S3 Aggregation S4 S5 Access Access S6 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 422 Legacy DC HA Design with vPC • Core Access Layer • Double-sided vPC connecting to Aggregation layer • Higher resilience • Different vPC domain ID vPC Domain 10 vPC Domain 20 TECCRS-2001 Aggregation Access © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 423 VSS vs vPC Catalyst Non VSS Nexus VSS Non vPC vPC Merge Data Plane only!! Control Plane still separate Merge Data and Control Plane TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 424 VSS Design vs vPC Design Nexus vPC Catalyst VSS Don’t design VSS and vPC in same way for Layer 3! TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 425 VSS Design vs vPC Design Nexus vPC Catalyst VSS vPC Layer 3 routed uplink with ECMP TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 426 vPC – Layer 2 Data Center Interconnect (DCI) DC 1 DC 2 Long Distance Dark Fiber F E F - E CORE CORE vPC domain 11 vPC domain 21 - N N N N N Network port E Edge or portfast - Normal port type B BPDUguard F BPDUfilter R Rootguard 802.1AE (Optional) - R N N - R R F E R R Layer 2 vPC Portchannel - N N - vPC domain 10 vPC domain 20 R R - - E E B B Server Cluster ACCESS ACCESS E F AGGR AGGR - R Server Cluster TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 427 For Your Reference vPC Best Practices vPC General Deployment Best Practices • Unique vPC domain ID’s in the contiguous layer 2 domain • Enable vPC peer-gateway, to act as the active gateway for packets addressed to the peer gateway of the router MAC • • Enable vPC peer-switch, to create vPC peer switch as single logical entity • • • Keeps forwarding of traffic local to the vPC node and avoids use of the peer-link Optimized BPDU processing Enable auto-recovery to address two cases of single switch behavior • Peer-link fails and after a while primary switch fails • Both VPC peers are reloaded and only one comes back up Enable vPC orphan-ports suspend to prevent orphan device traffic blackhole during peer-link failure TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 428 vPC Configuration Best Practices Enable Object-tracking • vPC object tracking, tracks both peer-link and uplinks in a list of Boolean OR • Object Tracking triggered when the track object goes down • Suspends the vPCs on the impaired device. • Traffic forwarded over the remaining vPC peer ! Track track 1 track 2 track 3 the vpc peer link and uplinks interface port-channel11 line-protocol interface Ethernet1/1 line-protocol interface Ethernet1/2 line-protocol S4 S5 S2 S1 ! Combine all tracked objects into one. ! “OR” means if ALL objects are down, this object will go down track 10 list boolean OR object 1 object 2 object 3 S3 ! If object 10 goes down on the primary vPC peer, ! system will switch over to other vPC peer and disable all local vPCs vpc domain 1 track 10 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 436 vPC Hitless Role Change Feature Without vPC Hitless Role Change With vPC Hitless Role Change • Traffic interruption. • No traffic interruption. • Manually flap vPC peer link • cli – “vpc role preempt” • Not Graceful • Graceful Note: supported from N7k 7.3(0)(D1(1), N9k/3,: 7.0(3)I7(1) Not supported on N5k/6k TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 438 vPC Shutdown Feature • Isolates a switch from the vPC complex • Isolated switch can be debugged, reloaded, or even removed physically, without affecting the vPC traffic going through the nonisolated switch Primary Secondary vPC S2 S1 switch# configure terminal switch(config)# vpc domain 100 switch(config-vpc)# shutdown S3 Note: supported from N7k: 7.3(0)D1(1), N9k: 7.0(3)I2(2), N5k/6k: 6.0(2)N2(1) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 439 Graceful Insertion and Removal Example FHRP with vPC Switch Isolation using GIR • Use automatic profile to go into GIR. Core Network f Isolate unicast routing protocol L3 //Enter maintenance mode using the system mode maintenance command: switch# configure terminal switch(config)# system mode maintenance Following configuration will be applied: L2 VPC Shutdown router ospf 100 isolate vpc domain 2 shutdown Do you want to continue (y/n)? [no] y TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 440 Legacy DC HA Design with vPC Key Takeaways • To minimize Legacy DC down time: • Follow vPC design best practices • Follow vPC configuration best practices • Follow vPC operation best practices TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 441 vPC References • For Your Reference vPC Design and Configuration Best Practices: http://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_ design/vpc_best_practices_design_guide.pdf TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 442 Agenda • Enterprise Data Center High Availability (DC HA) • DC Switch NX-OS HA Architecture and HA Features • DC Network HA Design and Operational Best Practices • • Legacy DC with vPC • Programmable Fabric • Application Centric Infrastructure (ACI) • Programmable Network Key Takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 443 Cisco Data Center Network Solutions Classic Ethernet & VPC Programmable Fabric Application Centric Infrastructure DB Programmable Network DB Web Web App Web App • Standards-based • VXLAN BGP EVPN • Forwarding & Multi-Tenancy • Disaggregated Management • Open NX-OS TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 444 Programmable Fabric Underlay HA Design • Structured design with Spine, Leaf and Border Leaf External Layer-3 Network • Allows you to manage traffic flow, network failure VTEP VTEP Border Leaf • Layer 3 IP fabric with point-to- point link: Spine Spine Spine Spine Spine • Better stability, faster convergence • Redundant links with ECMP • Scale out spine leaf design VTEP VTEP VTEP VTEP VTEP VTEP VTEP Leaf • Better scalability and availability Pod 1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 445 Programmable Fabric Underlay HA Design • Structured design with Border Spine and Leaf • Layer 3 IP fabric with point-to- point link: External Layer-3 Network • Better stability, faster convergence • Redundant links with ECMP Spine Spine Spine Spine Border Spine • Scale out spine leaf design • Better scalability and availability VTEP VTEP VTEP VTEP VTEP VTEP VTEP Leaf Pod 1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 446 Programmable Fabric Overlay HA Design • VXLAN EVPN based overlay • Same “Anycast” SVI IP/MAC is External Layer-3 Network enabled at all VTEPs/ToRs VTEP • Better availability, no IP gateway relearning VTEP Border Leaf • Optimal traffic forwarding, no hairpinning to GW Spine Spine Spine Spine Overlay Spine • Enable host mobility VTEP VTEP VTEP VTEP VTEP VTEP VTEP Leaf SVI IP Address MAC: 0000.1111.2222 IP: 10.1.1.1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 447 Programmable Fabric Host HA Design • Host connects to a pair of vPC External Layer-3 Network leaf VTEP directly (recommended) VTEP VTEP Border Leaf • Host connects to a pair of vPC leaf VTEP via FEX (not recommended) • Redundant host uplinks Spine VTEP VTEP Spine Spine Spine Overlay VTEP VTEP Spine VTEP VTEP VTEP Leaf TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 448 vPC Leaf VTEP Best Practices for HA • vPC leaf VTEP best practices • Enable peer-gateway • Enable peer switch • Enable IP ARP Sync • Use separate loopback address for VTEP source address Control plane and data plane separation Loopback0 is for underlay and overlay routing Loopback1 with secondary IP is for VTEP source data plane vpc domain 100 peer-switch peer-keepalive destination 172.32.1.13 source 172.32.1.14 delay restore 150 peer-gateway ip arp synchronize ipv6 nd synchronize interface nve1 host-reachability protocol bgp source-interface loopback1 source-interface hold-down-time 180 interface loopback0 ip address 10.10.0.2/32 ip router ospf UNDERLAY area 0.0.0.0 ip pim sparse-mode interface loopback1 ip address 10.20.0.3/32 ip address 10.20.0.1/32 secondary ip router ospf UNDERLAY area 0.0.0.0 ip pim sparse-mode TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 449 vPC Delay Restore and Source Hold Down Timer Control plane adjacencies not fully established Leaf 1 Spine X vPC Peer-Link connection not recovered yet X Leaf 1 4 Leaf 2 Spine X Anycast VTEP Advertisement Leaf Leaf 2 2 X Recovering device Host connection toward recovering Leaf 2 is brought up before the ToR can successfully establish routing adjacencies with the fabric and the peer vPC leaf node temporary black-holed Tuning delay-restore is required regardless the SW release Host-to-Leaf connection not recovered yet If the advertisement of the Anycast VTEP address happens before the vPC peer-link and vPC leg connection to the host are recovered, traffic will be black-holed as well. A “source-interface hold-down-time” is natively brought to keep the VTEP address (Loopback1) down for 180 sec (default), supported from 7.0(3)I2(2) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 452 vPC Leaf VTEP HA Best Practices • vPC leaf VTEP best practices o Enable layer 3 link between the two vPC VTEPs to connect them in the underlay network so that when one VTEP loses all its uplinks, it can still learn the routes through its vPC peer, and forward the traffic via its peer o Layer 3 link can be dedicated link or via point-to-point VLAN SVI over vPC peerlink Underlay Network With IP ECMP Load Sharing VTEP ....... vPCVTEP-1 vPCVTEP-2 vPC Port-Channel TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 453 Programmable Fabric External Routing HA Design The two Border Leaf VTEPs are independent to each other. Spine RR They each individually exchange routes with the external routing devices, and advertise the external routes into the EVPN fabric RR VXLAN Overlay EVPN MP-BGP Leaf VTEP VTEP Anycast Gateway Anycast Gateway VTEP Border Leaf VTEP VTEP VTEP Anycast Gateway Anycast Gateway Routing Protocol of Choice Distributed Anycast Gateway on the internal VTEPs Leafs IP Routing BGP multi-pathing needs to be enabled on the internal VTEP to leverage both border leaf VTEPs TECCRS-2001 Global Default VRF Instance or User Space VRF Instances © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 454 Programmable Fabric External Routing HA Design IP Routing Distributed Anycast Gateway on the internal VTEPs Leafs BGP multi-pathing needs to be enabled on the internal VTEP to leverage both border leaf VTEPs The two Border Spines are both VTEPs. Routing Protocol of Choice Spine VTEP VTEP RR RR They each individually exchange routes with the external routing devices, and advertise the external routes into the EVPN fabric. VXLAN Overlay EVPN MP-BGP Leaf VTEP Anycast Gateway VTEP Anycast Gateway VTEP VTEP Anycast Gateway Anycast Gateway TECCRS-2001 VTEP Anycast Gateway VTEP Anycast Gateway © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 455 Programmable Fabric Multi-X Connectivity (DCI) VXLAN Multi-Site 2017+ Fabric #1Domain 1 EVPN Control-Plane BGP EVPN Fabric #2Domain 2 EVPN Control-Plane Overlay VTEP Bare metal VTEP Overlay VTEP VTEP VTEP VTEP Bare metal Data-Plane Domain 1 DCI Data-Plane VTEP VTEP Bare metal Bare metal Data-Plane Domain 2 Multiple Fabrics with Integrated DCI (DCI2) Hierarchical design at both overlay and underlay: Better Scale and Failure Domain Isolation between Fabrics Recommended DCI Architecture Going Forward!!! TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 456 VXLAN Multi-Site Main Use Cases Scale-Up Model to Build a Large Intra-DC Network Network Extension across Multiple Sites Integration with Legacy Networks (Coexistence and/or Migration) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 457 VXLAN Multi-Site Border Gateways HA Design Anycast Border Gateways Possible BGWs deployment models: • Anycast Border Gateways (supported since day 1) and recommended for interconnecting VXLAN EVPN fabrics • VPC Border Gateways (supported since 9.2(1)) Border Gateways used for Layer 2 and Layer 3 Site-to-Site communication(East-West traffic) Border Gateway are often deployed also as Border Leaf nodes for Site to External Layer 3 communication (North-South traffic) BGW BGW BGW BGW VTEP VTEP VTEP VTEP Site 1 VPC Border Gateways BGW BGW VTEP VTEP Site 1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 458 VXLAN Multi-Site VPC Border Gateways DCI Use Cases Migration/Coexistence of a legacy site with one (or more) new VXLAN EVPN fabrics VTEP VTEP VTEP VTEP BGW BGW BGW BGW Spine VTEP VTEP Spine VTEP Spine VTEP Spine VTEP Greenfield Site VTEP VTEP Legacy Site VTEP VTEP VTEP VTEP BGW BGW BGW BGW Replacing ‘legacy’ DCI solutions (vPC, OTV, VPLS, etc..) Legacy Site 1 Legacy Site 2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Site-Internal Site-External VXLAN Multi-Site Failure Detection on Anycast BGWs – Fabric Isolation The Site-Internal interfaces on BGW nodes are constantly tracked to determine their status (‘evpn multisite fabric-tracking’ command) If all the Site-Internal interfaces are detected as down: Multi-Site VIP 10.111.111.1 BGW BGW BGW BGW VTEP VTEP VTEP VTEP PIP-BGW2 10.200.200.22 PIP-BGW3 10.200.200.23 PIP-BGW4 10.200.200.24 Spine Spine 1. 2. The isolated BGW stops advertising PIP/VIP addresses toward the Site-External network The remaining BGWs perform new DF elections for the L2VNIs owned by the isolated BGW As a result, the BGW becomes isolated from both the Site-Internal and Site-External networks Seamless BGW node reinsertion using a “delayrestore” timer for the VIP address Site 1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 460 VXLAN Multi-Site Failure Detection on Anycast BGWs – DCI Isolation DC Core Site-Internal Site-External (Layer-3 Unicast) BGW BGW BGW BGW VTEP VTEP VTEP VTEP PIP-BGW1 10.200.200.21 PIP-BGW2 10.200.200.22 PIP-BGW3 10.200.200.23 PIP-BGW4 10.200.200.24 The Site-External interfaces on BGW nodes are also tracked to determine their status (‘evpn multisite dci-tracking’ command) If all the Site-External interfaces are detected as down, the isolated BGW node: 1. 2. Multi-Site VIP 10.111.111.1 3. Stops advertising VIP VTEP address toward the Site-Internal network Withdraws BGP EVPN Type-4 advertisements (triggering a new DF election between other BGWs) Starts functioning as a regular VTEP (PIP still up) As a result, the BGW continues to operate as a Site-Internal VTEP Seamless BGW node reinsertion using a “delayrestore” timer for the VIP address Site 1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 461 Graceful Insertion and Removal Example VXLAN BGP EVPN Leaf vPC VTEP Isolation with GIR • Use automatic profile to go into GIR. S10 //Enter maintenance mode using the system mode maintenance command: switch# configure terminal switch(config)# system mode maintenance Following configuration will be applied: ip pim isolate router bgp 1 isolate router ospf UNDERLAY isolate vpc domain 1000 shutdown S20 VXLAN BGP EVPN Do you want to continue (yes/no)? [no] y System mode operation completed successfully switch(config)# end Host1 TECCRS-2001 Host2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 464 Graceful Insertion and Removal Example VXLAN BGP EVPN Leaf vPC VTEP Isolation with GIR • Use automatic profile to come out of GIR. S10 //Enter maintenance mode using the system mode maintenance command: switch# configure terminal switch(config)# no system mode maintenance Following configuration will be applied: vpc domain 1000 no shutdown router ospf UNDERLAY no isolate router bgp 1 no isolate no ip pim isolate S20 VXLAN BGP EVPN Do you want to continue (yes/no)? [no] y System mode operation completed successfully switch(config)# end Host1 TECCRS-2001 Host2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 465 Graceful Insertion and Removal Example VXLAN BGP EVPN Spine RR Isolation with GIR • Use automatic profile to go into GIR. S10 //Enter maintenance mode using the system mode maintenance command: switch# configure terminal switch(config)# system mode maintenance Following configuration will be applied: router bgp 100 isolate router ospf 1 isolate S20 VXLAN BGP EVPN Do you want to continue (yes/no)? [no] y System mode operation completed successfully switch(config)# end Host1 TECCRS-2001 Host2 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 466 Programmable Fabric HA Takeaways • • Programmable fabric HA design • Spine leaf L3 IP fabric with ECMP • VXLAN EVPN fabric with Anycast GW • vPC for host Multiple DC fabric HA design • • VXLAN Multi-Site Follow configuration and operational best practices to minimize down time for different failure scenarios TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 467 Programmable Fabric Resources For Your Reference • VXLAN Network with MP-BGP EVPN Control Plane Design Guide • https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-seriesswitches/guide-c07-734107.html • VXLAN EVPN Multi-Site Design and Deployment White Paper • https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-seriesswitches/white-paper-c11-739942.html#_Toc498025653 • BRKDCN-3378 Building DataCenter Networks with VXLAN BGP EVPN • BRKDCN-2035 VXLAN BGP EVPN based Multi-Site TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 468 Agenda • Enterprise Data Center High Availability (DC HA) • DC Switch NX-OS HA Architecture and HA Features • DC Network HA Design and Operational Best Practices • • Legacy DC with vPC • Programmable Fabric • Application Centric Infrastructure (ACI) • Programmable Network Key Takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 469 Cisco Data Center Network Solutions Classic Ethernet & VPC Programmable Fabric Application Centric Infrastructure DB Programmable Network DB Web Web App Web App • VXLAN-based • Forwarding, Multi-Tenancy & Security • Integrated Controller with Enhanced APIs TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 470 ACI Fabric Underlay HA Design • Zero touch provision Application Policy Infrastructure Controller • Structured design • Layer 3 IP fabric with point-to- point link: ACI Fabric • Better stability, faster convergence • Redundant links with ECMP • Scale out spine leaf design • Better scalability and availability SVI IP Address MAC: 0000.1111.2222 IP: 10.1.1.1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 471 ACI Fabric Overlay HA Design • eVXLAN EVPN based overlay Application Policy Infrastructure Controller • Same “Anycast” SVI IP/MAC is enabled at all VTEPs/ToRs • Better availability, no IP gateway relearning Overlay ACI Fabric • Optimal traffic forwarding, no hairpinning to GW • Enable host mobility SVI IP Address MAC: 0000.1111.2222 IP: 10.1.1.1 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 472 ACI Fabric Host vPC HA Design ACI Spine Nodes ACI Fabric ACI Leaf Nodes Host vPC to ACI leaf nodes Host vPC to FEX Host2 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 473 vPC in ACI Fabric • Differences between ACI vPC and standard vPC • No Peer Link is required ACI Fabric Services (ZMQ) vPC Anycast VTEP vPC Anycast VTEP VTEP VTEP • Peer communication/path recovery happens via the Fabric • CFS (Cisco Fabric Services) is replaced by IFS (ACI Fabric Services) which is based on Zero Message Queue (ZMQ) • Forwarding selection (which peer will forward a frame • Within the Fabric the vPC interfaces use an anycast VTEP which is active on both vPC peers Host or Switch TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 474 ACI Port Tracking Policy for Uplink Failure Detection • The port tracking policy specifies • Number of uplink connections that trigger the policy • A delay timer for bringing the leaf switch access ports back up TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 475 ACI Fabric Convergence Improvement • Convergence improvement from sub-seconds to 200ms for ACI3.1 • • • With new Cloudscale ASIC N9Ks Failure scenarios with convergence improvement: • Fabric (between leaf and spine): link failure, Spine reload/upgrade, Spine linecard reload, Leaf reload/upgrade, power failure of Spine • Access link/node with vPC or portchannel • External (Border Leaf) connectivity (L3 out): link failure, Border Leaf reload/upgrade Achieved by special ASIC capability and software design TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 476 ACI Fabric Convergence Improvement • Uncovered failure scenarios: • Double failure, L2/L3 multicast, copper links, process crashes on Leaf/Spine/Border Leaf, etc. • Convergence for traffic from EP to ACI fabric is dependent on how fast the EP is able to divert traffic to ACI Leaf • Convergence for traffic from external node to ACI fabric is dependent on how fast external node is able to divert traffic to ACI TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 477 Fabric Fast Convergence - Enable LBX It’s a per leaf configuration. Fabric ERSPAN can’t be enabled per uplink port with this feature enabled. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 478 Access Link with vPC or PortChannel Fast Convergence - Debounce Policy Configuration Reduce debounce timer from default 100ms to 10msfor faster convergence under Fabric Access Policy TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 479 ACI Fabric Fast Convergence Best Practices • Always use vPC • Distribute scale • • 100 L3out per Leaf • 50 BD per Leaf Use static EPG instead of L2out TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 480 ACI Fabric Maintenance Mode • New decommission option from ACI3.0(1k) • To help to isolate the switch in ACI fabric with keeping management access to the switch • Prior to ACI3.0(1k): decommission options are Regular or Remove from controller. • • The switch reboots and is wiped out of all the configuration ACI3.0(1k): Maintenance Mode (Debug mode) or decommission • With Maintenance mode, the switch is not in Active forwarding path • It can be accessed via management port, logs can be collected for debugging TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 481 Spine Node Maintenance Mode Decommission • IS-IS on spine advertises routes with max matric • OSPF, EIGRP and BGP do graceful shutdown on IPN/GOLF link GOLF IPN ports are still up but OSPF neighbor is down. IPN IS-IS set max matric Traffic goes through different paths. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 482 Spine Node Insertion Recommission • Spine switch reboots and is wiped out of all the configuration • After the switch comes up and is discovered by APIC, the policy is programmed on the switch • After the switch configuration is done, the switch establishes IS-IS, OSPF and BGP peers. Then the switch will be in active forwarding path. Max metric will be set 10 mins during startup. Thus, internal traffic will be less preferred for 10 mins. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 483 Leaf Node Maintenance Mode Decommission • IS-IS on Leaf node advertises route with max metric • OSPF, EIGRP and BGP do graceful shutdown • vPC shuts down Keep-Alive & Peer Link • Shutdown all front panel ports and directly connected IFC ports (Cuts Laser on the Port) Set max metric Traffic goes through different paths Graceful shutdown for L3out vPC shutdown Shutdown front panel ports TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 484 Leaf Node Insertion Recommission • The switch reboots and is wiped out of all the configuration • After the switch comes up and is discovered by APIC, the policy is programmed on the switch • After the switch configuration is done, the switch will establish IS-IS, OSPF and BGP peers. Then the switch will be in active forwarding path. Max metric will be set for 10 mins during startup. Thus, internal traffic will be less preferred for 10 mins. • There is a 2 min delay before we bring up the vPC ports. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 485 ACI Multi-Site VXLAN ACI 3.0 Release Inter-Site Network MP-BGP - EVPN Multi-Site Orchestrator Site 1 REST Availability Zone ‘A’ API Site 2 GUI Availability Zone ‘B’ • Separate ACI Fabrics with independent APIC clusters • MP-BGP EVPN control plane between sites • No latency limitation between Fabrics • • ACI Multi-Site Orchestrator pushes cross-fabric configuration to multiple APIC clusters providing scoping of all configuration changes Data Plane VXLAN encapsulation across sites • End-to-end policy definition and enforcement TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 486 ACI Multi-Site Main Use Cases Scale-Up Model to Build a Large Intra-DC Network Data Center Interconnect (DCI) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 487 ACI Policy Upgrade • Ability to upgrade all switches and controllers in the fabric from one place, with a single click • Requires the upload of the new controller and switch image • Then, create a firmware group • Finally, Create Maintenance groups as needed to define which switches get upgrade at what time • Controllers are upgraded through a different “Controller Firmware” Policy • Controllers are kicked off at the same time (sort of like a single maintenance group) and upgrade sequentially. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 488 ACI Maintenance Group Logic TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 489 ACI HA Key Takeaways • ACI is a turnkey solution for Data Center fabric with built in HA and full automation • ACI integrates all the best practices and lessons we learned from previous technologies TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 490 ACI Fabric Resources • For Your Reference Cisco ACI Multi-Site Architecture White Paper • https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/applicationcentric-infrastructure/white-paper-c11-739609.html • BRKACI-2125 ACI Multi-Site Architecture and Deployment • BRKACI-3101 ACI Under the Hood - How Your Configuration is Deployed TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 491 Agenda • Enterprise Data Center High Availability (DC HA) • DC Switch NX-OS HA Architecture and HA Features • DC Network HA Design and Operational Best Practices • • Legacy DC with vPC • Programmable Fabric • Application Centric Infrastructure (ACI) • Programmable Network Key Takeaways TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 492 Cisco Data Center Network Solutions Classic Ethernet & VPC Programmable Fabric Application Centric Infrastructure DB Programmable Network DB Web Web App Web App • Open NX-OS • Enhanced APIs and Automation Ecosystem (DevOps) TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 493 Nexus Device Programmability • Power on Auto Provisioning (PoAP) • On-box Python Scripting • NX-OS Software Development Kit (SDK) • Configuration Management Tools TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 494 Cisco Nexus Power on Auto Provisioning (PoAP) Script Server License, Configuration and Software Server DHCP Server 2 3 DHCP Discover phase: Get IP Address, Gateway Script server Script file Download Script file onto the switch and execute the script 4 Download Configuration License Software images onto the switch Default Gateway Reboot if needed. Switch up and running with the downloaded image and config 1 5 Power up Phase: Start Power On Auto-Provisioning Process Nexus Switch TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 495 Deploy and Manage POAP Using DCNM.. TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 496 Deploy using POAP Script • Download POAP script from github: • https://github.com/datacenter/nexus9000/blob/master/nx-os/poap/poap.py TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 497 Nexus 9000 Programmability On-box Python • Python script can be run in interactive or non-interactive mode • Please store scripts in the bootflash:scripts directory of the switch Interactive Mode Non Interactive (script) Mode switch# python Python 2.7.5 (default, Nov 5 2016, 04:39:52) >>> cli("conf ; interface loopback 1") Switch # dir bootflash:scripts 946 Oct 30 14:50:36 2013 crc.py 7009 Sep 19 10:38:39 2013 myScript.py 22760 Oct 31 02:51:41 2012 poap.py mode: username: admin vdc: TSI-N9508-stand-alone routing-context vrf: default Switch # python bootflash:/scripts/crc.py Or Switch # source crc.py ----------------------------------------Started running CRC checker script finished running CRC checker script ------------------------------------------ '' >>> clip ('where detail') TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 498 Python Usecases “Off-Box” Python Python “On-Box” Python Linux Server SSH/NETCONF Python NX-OS NX-OS Device scripts executed externally from switch: • configuration management automation • telemetry / operational data • controller use cases including APIC, POAP NX-OS Device NX-OS • TECCRS-2001 scripts executed locally on switch: • provisioning automation • automating Embedded Event Manager • application development © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 499 Auto Back-up Use Case “On-Box” Python and EEM Cisco Nexus 9000 Python SDK User Guide: https://developer.cisco.com/docs/nx-os/#cisco-nexus-9000-series-pythonsdk-user-guide-and-api-reference Python script creates a back-up file and sends it to a tftp server Nexus 93xx EEM EEM Triggers on-box Python script TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 500 NX-OS SDK (Software Development Kit) • NX-OS SDK enables on-box custom applications to access NX-OS native functionality Nexus 9K Custom Applications (Python, C++ etc..) Existing 3 rd Party Linux Applications Linux – Native Shell or Guest Shell Linux Networking Stack NX-OS CLI L2 L3 Interfaces Platform TECCRS-2001 Etc © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 501 Nexus Programmability Configuration Management Tools TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 502 NX-OS Programmability Resources • For Your Reference BRKACI-2025 Maximizing Network Programmability and Automation with Open NX-OS TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 503 Data Center HA Key Takeaways High Availability Enterprise Data Center Design Key Principles • Follow HA design and operational best practices to minimize network downtime TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 505 Maren Kostede Dana Daum Technical Solutions Architect Communications Architect Junmei Zhang Technical Marketing Eng. Samer Theodossy Principal Engineer High Availability World Coverage © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Reconvergence Effect on “Mission-Critical”, Real-Time Operations • First step on the Moon – July 20, 1969 … how it really happened … “OK, I’m going to step off the LEM now” LEM = Lunar Excursion Module = the Lunar Lander “That’s one small step for man” “One giant leap for mankind” TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 507 Reconvergence Effect on “Mission-Critical”, Real-Time Operations • And how it would have looked with … standard HSRP timers … TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 508 Reconvergence Effect on “Mission-Critical”, Real-Time Operations • And how it would have looked with … 3-second reconvergence … TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 509 Reconvergence Effect on “Mission-Critical”, Real-Time Operations • And how it would have looked with … 500-msec re-convergence … Tuning Your Network Design and Reconvergence Can Be a “Giant Leap” for Your Network – and Your Application – Availability! TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 510 Published design guides www.cisco.com/go/cvd TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public Cisco Webex Teams Questions? Use Cisco Webex Teams (formerly Cisco Spark) to chat with the speaker after the session How 1 Find this session in the Cisco Events Mobile App 2 Click “Join the Discussion” 3 Install Webex Teams or go directly to the team space 4 Enter messages/questions in the team space cs.co/ciscolivebot#TECCRS-2001 TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 512 Complete your online session survey • Please complete your Online Session Survey after each session • Complete 4 Session Surveys & the Overall Conference Survey (available from Thursday) to receive your Cisco Live Tshirt • All surveys can be completed via the Cisco Events Mobile App or the Communication Stations Don’t forget: Cisco Live sessions will be available for viewing on demand after the event at ciscolive.cisco.com TECCRS-2001 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 513 Continue Your Education Demos in the Cisco Showcase Walk-in self-paced labs Meet the engineer 1:1 meetings TECCRS-2001 Related sessions © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 514 Thank you