Uploaded by aawaa eejee

TECCRS-2001

advertisement
TECCRS-2001
Enterprise High Availability
Design and Architectures
Samer Theodossy
Dana Daum
Maren Kostede
Junmei Zhang
Dana Daum
Communications Architect
Maren Kostede
Technical Solutions Architect
Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer
High Availability World Coverage
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
3
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
High Availability Architectures:
•
•
Enterprise Wired LAN
•
Enterprise Wireless LAN
•
Enterprise Data Center
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
4
Agenda Schedule & Logistics
For Your
Reference
08:30 - 10:30
10:30 - 10:45 Break
Samer
Key Concept or
Design Point
10:45 -12:45
12:45- 14:30 Lunch
Dana
14:30 -16:30
16:30 - 16:45 Break
Maren
16:45 - 18:45
Hurray We are done!!!
Junmei
We value your feedback:
Don't forget to complete your online session evaluations
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
5
Cisco Webex Teams
Questions?
Use Cisco Webex Teams (formerly Cisco Spark)
to chat with the speaker after the session
How
1 Find this session in the Cisco Events Mobile App
2 Click “Join the Discussion”
3 Install Webex Teams or go directly to the team space
4 Enter messages/questions in the team space
cs.co/ciscolivebot#TECCRS-2001
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
6
Head Quarters
WAAS
Access
Switches
UCS Rack-mount
Server
UCS Rack-mount
Servers
UCS Blade
Chassis
Storage
WAAS
Central Manager
Distribution
Switches
Nexus
WAN
Router
s
Access
Switches
Regional Site
Communications
Managers
Internet Edge
Cisco ACE
Internet
Routers
Wireless LAN
Controller
Data Center
Firewalls
Nexus
Internet
RA-VPN
WAN
Route
r
Access
Switch
Firewall
Guest Wireless
LAN Controller
DMZ
Switch
Remote Site
Web
Security
Appliance
Teleworker/
Mobile Worker
Remote
Site
DMZ
Servers
Core
Switches
Email
Security
Appliance
Hardware and
Software VPN
Access
Switch
Stack
Data
Center
Wireless LAN
Controllers
WAN
Routers
MPLS
WANs
WAN
Router
s
Distribution
Switches
User
Access
Layers
WAAS
WAAS
WAN
Aggregation
Remote Site
Wireless LAN
Controller
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
High Availability Architectures:
•
•
Enterprise Wired LAN
•
Enterprise Wireless LAN
•
Enterprise Data Center
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
8
Enterprise-Class Availability
Campus Systems Approach to High Availability
Ultimate Goal……………..100%
• System-level resiliency
• Network-level redundancy
Next-Generation Apps
Video Conf., Unified Messaging,
Global Outsourcing,
E-Business, Wireless Ubiquity
• Enhanced management
• Human ear notices the difference in voice within
150–200 msec
Mission Critical Apps.
Databases, Order-Entry,
CRM, ERP
• 10 consecutive G711 packet loss
• Video loss is even more noticeable
• 200-msec end-to-end campus convergence
Desktop Apps
E-mail, File and Print
APPLICATIONS DRIVE REQUIREMENTS FOR
HIGH AVAILABILITY NETWORKING
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
9
Cisco HA Evolution
No Redundancy
Redundancy with RPR
No Redundant Units
Adding Redundant Units
Failure on Supervisor
Outage:
Failure on Active Sup
causes reload
causes Switchover
10’s of
Line Cards reload
minutes
Standby Unit is in
on failure
Outage:state
STANDBY_COLD
Several
Line Cards
reload after
minutes
switchover
Startup Configuration
Synchronized to Peer
Redundancy with RPR+
Adding Redundant Units
Failure on Active Sup
causes Switchover
Standby Unit is in
STANDBY_WARM state
Line Cards reload after
Outage:
switchover
Several
Startup Configuration
Seconds
Synchronized
to Peer
Running Configuration
Synchronized to Peer and
applied after switchover
TECCRS-2001
Redundancy with SSO
Adding Redundant Units
Failure on Active Sup
causes Switchover
Standby Unit is in
STANDBY_HOT state
Line Cards Stay up after
switchover
Outage:
Startup Configuration
Synchronized
to Peer
Order
of
Running
Configuration
Milliseconds
Synchronized to Peer and
applied.
and/or
its affiliates. All rights reserved. Cisco Public
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
10
Defining Levels of Availability*
Continuous
Availability
Continuous
Operations
High Availability
CA = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
events
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle both unplanned faults and
planned maintenance events
HA = system is designed to a specified service level
with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
11
Defining Levels of Availability*
Continuous
Availability
Continuous
Operations
High Availability
CA = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
events
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle both unplanned faults and
planned maintenance events
HA = system is designed to a specified service level
with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
12
Defining Levels of Availability*
Continuous
Availability
Continuous
Operations
High Availability
CA = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
events
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle both unplanned faults and
planned maintenance events
HA = system is designed to a specified service level
with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
13
Defining Levels of Availability*
Continuous
Availability
Continuous
Operations
High Availability
CA = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle all unplanned or planned
events
CO = system is designed to operate 7 days a week 24
hours a day with resiliency and redundancy
mechanisms to handle both unplanned faults and
planned maintenance events
HA = system is designed to a specified service level
with resiliency and redundancy mechanisms to handle
unplanned faults
* References de facto industry terminology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
14
Measure Availability End-End from the User
Perspective
Application
Custom Application Scripts,
HTML, TCL, Python, many others
Presentation
Session
Transport
ICMP Ping, IP Traceroute,
Bidirectional Forwarding
Detection, IP SLA
Network
UDLD, STP, REP
Data-Link
Cable Testers /
Power Meters
Physical
* Layer 8 is not an official part of the OSI reference model 
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
15
Measure and Analyze Every Event
Analyze and Automate
• Measure all previous points –
Actual Fault Starts
• Analyze trends
• Automation –
• Trouble ticketing
• Technology/database
• Electronic bonding
Total Service
Downtime
• Note each in trouble tickets
Failure Detection Time
Notification Time
Diagnosis Time
Dispatch Time
Arrival Time
Repair Time
• Redundant network design and resiliency features
Up Time
• Required for very high availability
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
16
What to Automate?
Device
Provisioning
Day 0:
Deployment
Automation
Configuration
Monitoring
Day 1:
Open Programmable
Interfaces
Day 2:
Telemetry
Source: 2016 Cisco Study
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
17
Main Operational Challenges
95%
Network Changes
Performed Manually
70%
Policy Violations
Due to Human Error
75%
OpEx spent on
Network Visibility and
Troubleshooting
Source: 2016 Cisco Study
CANNOT Keep Pace with the Demands of Digital Business
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
High Availability Design Principal
Key Principals
•
Enterprise network design architectures continue to evolve to meet
business and technology needs, but the key principals of high
availably network design still apply;
Add redundancy and resiliency components as needed to meet the
business requirements.
• Simplify network designs and configurations through virtualization
techniques.
• Implement network-monitoring tools with automation where appropriate,
and analyze all aspects of network outages for indications of where
improvement is needed.
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
19
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
High Availability Architectures
•
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
20
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Availability Modeling
•
Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
•
Stackwise480 and Stackwise
•
In Service Software Upgrades
•
Foundations of the Structured Network Design
•
High Availability Architectures
•
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
21
Why Use System and Network Availability
Modeling?
• Planning and Engineering
• Architecture validation
• Design tradeoff analysis/decisions
• Request for Proposal (RFP)
• Service Level Agreement (SLA)
Option 1 $
Option 2 $$
TECCRS-2001
Option 3 $$$
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
22
Predicted Availability Ratings Are Not Guarantees
•
Predicted Availability ratings are not
guarantees of network availability.
•
Ratings are based on Industry standard
methodologies and statistical analysis
•
Useful in making design decisions
and comparing different options.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
23
Predicted Availability Rating
Function of Mean Time Between Failure and Mean Time to Repair
Increase MTBF
Availability Equation
Decrease MTTR
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
24
Predicted Availability Equation (Basic)
Availability Equation
Availability
MTBF
MTBF
MTTR
MTBF = Mean Time Between
Failure
MTTR = Mean Time To Repair
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
25
Predicted Availability Equations
MTBF
Availability
MTTR
MTBF
74,116 hrs.
0.999676
2hrs 50.3 min. per year
74,116 hrs
24 hrs. (No Spare)
74,116 hrs
0.999946
28 min. per year
74,116 hrs
4 hrs. (Spare Available)
74,116 hrs
0.999999
.526 min. per year
74,116 hrs
.00833 (sub-second)
(Redundancy!!!)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
26
The Redundancy Effect
Single Points of Failure
Availability = 99.998%
Downtime = ~10 min/yr
99.999%
~5 min/yr
99.999%
~5 min/yr
Linecard
Supervisor
Unit 1
Unit 2
Blocks in Series
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
27
The Redundancy Effect
Single Points of Failure
Redundant Components
Availability = 99.999999%
Downtime = ~0.0053 min/yr
Availability = 99.998%
Downtime = ~10 min/yr
Unit 1
99.999%
~5 min/yr
99.999%
~5 min/yr
99.999%
~5 min/yr
Linecard
Supervisor
Unit 1
Supervisor
Unit 2
Unit 2
99.999%
~5 min/yr
Supervisor
Blocks in Series
Blocks in Parallel
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
28
Example of Predicted Availability Rating
(No Redundancy)
Catalyst 2960XR-48TS-I
For Your
Reference
Part
MTBF
(hours)
MTTR
Predicted
Availability
Annual
Downtime
Catalyst
2960XR-48TS-I
438,130 hrs.
4 hrs.
99.99908704%
--
Power Supply
1,000,000 hrs.
4 hrs.
99.99960000%
--
SFP-10GSR
Uplink
2,294,776 hrs.
4 hrs.
99.99982569%
--
System MTBF
268,947 hrs.
99.99851274%
7.82 min
All single points of failure combined in a series calculation
Chassis X Power Supply X Uplink = System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
29
Example of Predicted Availability Rating
(With Redundancy)
Catalyst 2960XR-48TS-I
Part
MTBF
(hours)
MTTR
Switchover
time sec.
Catalyst
2960XR48TS-I
438,130
4 hrs.
--
438,130
99.99908704%
--
Power Supply
(Redundant)
1,000,00
0
4 hrs.
0
125,001,00
0,002
100.00000000%
--
SFP-10GSR
Uplink
(Redundant)
2,294,77
6
4 hrs.
.500
658,251,90
6,130
100.00000000%
--
System MTBF
Combined
MTBF Hrs.
For Your
Reference
438,128
Predicted
Availability
99.99908704%
Annual
Downtime
4.80 min.
Redundant components combined in parallel calculation
Chassis X Combined Power Supply X Combined Uplink = System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
30
Example of Predicted Availability Rating
(Catalyst 3850 No Redundancy)
Catalyst WS-C3850-48F
For Your
Reference
Part
MTBF
(hours)
MTTR
Predicted
Availability
Annual
Downtime
Catalyst C385048F
241,050
4 hrs.
99.99834062%
--
Power Supply
PWR-C11100WAC
392,174
4 hrs.
99.99898005%
--
C3850-NM-2-10G
4,319,170
4 hrs.
99.99990732%
--
SFP-10GSR Uplink
2,294,776
4 hrs.
99.99982569%
--
System MTBF
135,761
99.99705371%
15.50 min
All single points of failure combined in a series calculation
Chassis X Power Supply X Uplink = System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
31
Example of Predicted Availability Rating
(With Component Redundancy)
Catalyst WS-C3850-48F
For Your
Reference
Part
MTBF
MTTR
Switchover
time sec.
Combined
MTBF
Predicted
Availability
Annual
Downtime
Catalyst
C3850-48F
241,050
4 hrs.
--
241,050 hrs.
99.9983406
2%
--
Power Supply
PWR-C11100WAC
392,174
4 hrs.
0
19,225,447,9
59
99.99999999
%
--
SFP-10GSR
Uplink
2,294,77
6
4 hrs.
.500
658,251,906,
038
100.0000000
0%
--
C3850-NM2-10G
4,319,17
0
4 hrs.
--
4,319,170
System MTBF
228,297
--
99.9999073
2%
99.9982479
3%
9.22 min.
Redundant components combined in parallel calculation
Chassis X Combined Power Supply X Combined Uplink X Uplink Module =
System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
32
Example of Predicted Availability Rating
(With Stackwise480 Redundancy, Single Attached)
• Catalyst WS-C3850-48F
For Your
Reference
Part
MTBF
(hours)
MTT
R
Switcho
ver time
Combin
ed
MTBF
Combine
d
Availabilit
y
Annual
Downti
me
Catalyst
C3850-48F
241,050
4 hrs.
--
241,050
99.99834062
%
--
Power Supply
PWR-C11100WAC
392,174
4 hrs.
.001
19,225,447,9
59
99.99999999%
--
C3850-NM2-10G
4,319,170
4 hrs.
.500
2,328,453,94
6,134
100.0000000%
--
SFP-10GSR
Uplink
2,294,776
4 hrs.
.500
658,251,906,
038
100.0000000%
--
System MTBF
241,047
99.99834061%
8.73 min.
Redundant components combined in parallel calculation
Combined Chassis X Combined Power Supply X Combined Uplink = System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
33
Example of Predicted Availability Rating
(Catalyst 4507R+E Non Redundant)
Catalyst WS-C4507R+E
For Your
Reference
Part
MTBF
MTTR
Combined
MTBF
Combined
Availability
Annual
Downtime
Chassis with Fans
WS-C4507R+E
248,630
4 hrs.
248,630 hrs.
99.99839121%
--
Power Supply
PWR-C45-6000ACV
341,356
4 hrs.
341,356 hrs.
99.99882822%
--
WS-X45-SUP8-E
451,610
4 hrs.
451,610 hrs.
99.99911429%
--
SFP-10GSR Uplink
2,294,77
6
4 hrs.
658,251,906,03
8
99.99999956%
--
WS-X4748-RJ45-E
402,386
4 hrs.
402,386 hrs.
99.99900594%
--
82,735 hrs.
99.99516543%
25.43 min.
System MTBF
Components combined in series calculation
Chassis X Power Supply X Line Card X Supervisor Module X SFP Uplink = System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
34
Example of Predicted Availability Rating
(Catalyst 4507R+E With Redundancy )
Catalyst WS-C4507R+E
with Redundancy
For Your
Reference
Part
MTBF
MTT
R
Switchover
time
Combined
MTBF
Combined
Availability
Annual
Downtime
Chassis with
Fans
WS-C4507R+E
248,630
4
hrs.
--
248,630
hrs.
99.99839121%
--
Power Supply
PWR-C456000ACV
341,356
0
hrs.
0
14,565,831,
200 hrs.
99.99882822%
--
WS-X45SUP8-E
451,610
0
hrs.
.500
25,494,400,
625 hrs.
99.99911429%
--
SFP-10GSR
Uplink
2,294,77
6
0
hrs.
.500
658,251,90
6,038
99.99999956%
--
WS-X4748RJ45-E
402,386
4
hrs.
--
402,386
hrs.
99.99900594%
--
153,673 hrs.
99.99739714%
13.69 min.
System MTBF
Redundant components combined in parallel calculation
Chassis X Combined Power Supply X Line Card X Combined Supervisor Module X Combined SFP Uplink = System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
35
Example of Predicted Availability Rating
(Catalyst 6800XL Non Redundant)
Catalyst 6800XL
Part
MTBF (hours)
MTTR
Combined
MTBF Hrs.
638,440
4 hrs.
638,440
C6807-XL-FAN=
3,077,880
4 hrs.
3,077,880
SFP-10GSR
2,294,776
4 hrs.
2,294,776
Supervisor
VS-S2T-10G
231,910
4 hrs.
231,910
WS-X6904-40G2T
256,490
C6800-XL-3KWAC*
3,000,000
Chassis C6807-XL
For Your
Reference
Combined
Availability
Annual
Downtime
--
99.99937348%
--
99.99987004%
--
99.99982569%
--
99.99827522%
4 hrs.
256,490
--
99.99844051%
4 hrs.
3,000,000
--
99.99986667%
System MTBF
91,987
99.99565168%
22.87 min.
Components combined in series calculation
Chassis X Fan Tray X Power Supply X Line Card X Supervisor Module X SFP Uplink = System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
36
Example of Predicted Availability Rating
(Catalyst 6800XL With Redundancy)
Catalyst 6800XL with
Redundancy
For Your
Reference
Part
MTBF
Hrs.
MTTR
Hrs.
Switchover
time
(seconds)
Combined
MTBF Hrs.
Chassis C6807XL
638,444
4 Hrs.
--
638,440
99.99937348%
--
C6807-XL-FAN=
3,077,88
0
4 Hrs.
--
3,077,880
99.99987004%
--
SFP-10GSR
451,610
4Hrs.
.500
2,633,000,739,
868
100.00000000
%
--
Supervisor
VS-S2T-10G
2,294,77
6
4 Hrs.
.500
26,891,355,96
1
99.99999997%
--
WS-X6904-40G2T
402,386
4 Hrs.
.500
32,893,816,54
1
99.99999998%
--
C6800-XL-3KWAC*
3,000,00
0
4 Hrs.
0
4,500,003,000,
001
100.00000000
%
--
99.99924347%
3.98min.
System MTBF
528,687
Combined
Availability
Annual
Downtime
Redundant components combined in parallel calculation
Chassis X Combined Power Supply X Combined Line Card X Combined Supervisor Module X Combined SFP Uplink =
System MTBF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
37
Choosing the Right Platform and Network Design
It is More Than Just Predicted Availability Ratings
• Design to business requirements
•
•
Use Predicted Availability ratings as part of your overall design considerations
Common factors that dictate platform selection:
•
Backplane throughput and performance
•
Interface types and port densities
•
Scalability for future growth/ investment protection
•
Software upgrade procedures
•
Software feature support
•
Simplicity / Ease of Use
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
39
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Availability Modeling
•
Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
•
Stackwise480 and Stackwise
•
In Service Software Upgrades
•
Foundations of the Structured Network Design
•
High Availability Architectures
•
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
41
Control Plane and Data Plane
Control Plane
CPU, Software , Memory
EIGRP
OSPF
LDP
BGP
SNMP
STP
CDP
FIB
Data Plane
ASICs, High-Speed TCAMs
FIB
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
42
Control Plane and Data Plane
Control Plane
CPU, Software , Memory
EIGRP
OSPF
LDP
BGP
SNMP
STP
CDP
FIB
Data Plane
ASICs, High-Speed TCAMs
FIB
A
B
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
43
Control Plane and Data Plane
Control Plane
CPU, Software , Memory
EIGRP
OSPF
LDP
BGP
SNMP
STP
CDP
FIB
Data Plane
ASICs, High-Speed TCAMs
FIB
SRC A
DST B
A
B
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
44
Control Plane and Data Plane
For Your
Reference
Definitions for our context
•
•
Control Plane – Protocols or signaling traffic associated with routing. Typically this is
traffic sourced from a router or destined to a router. Examples include BGP, OSPF,
EIGRP, ICMP etc.…
•
Processed by a CPU
•
May also include exception traffic that needs special services applied
•
Commonly referred to as the “Slow Path”
•
May also include management protocols including SNMP, Telnet, HTTP etc.… AKA
“Management Plane”
Data Plane - Traffic forwarded through a device.
•
Processed by hardware ASICs
•
In the context of a switching device, typically this is traffic processed completely by the
device’s hardware ASICs
•
Commonly referred to as the “Fast Path”
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
45
Stateful Switchover (SSO)
•
For Your
Reference
Stateful Switchover (SSO)– A software facility within Cisco IOS that
synchronizes specific Cisco IOS processes between an Active Supervisor
Engine and a Redundant Standby Supervisor Engine for the purpose of
redundancy.
•
Redundancy Facility – synchronizes application states
•
Checkpointing Facility – synchronizes data structures
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
46
Redundant Supervisors – IOS
Active – Standby Model
Active Supervisor

Control Plane
Data Plane
Active Supervisor

Control Plane
•
Console access
•
Manages Configurations
•
Manages Chassis
Environmentals
•
L2 – L3 Protocols
Data Plane
•
CF
Hardware-based switching
Standby Supervisor
RF

Not part of the active forwarding
path

Multiple Redundancy modes
COLD Standby
WARM Standby
HOT Standby

Synchronization
CF – Checkpoint Facility
RF – Redundancy Facility
COLD Standby
WARM Standby
HOT Standby
Control Plane
Data Plane
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
47
Stateful Switchover Mode – IOS
SSO-Aware and SSO-Compliant IOS Applications
Cisco IOS
SSO-Compliant Applications
Routing Protocols
NetFlow
Cisco Discovery Protocol
…and more
SSO-Aware Applications
Redundancy
Facility
Checkpointing
Facility
Forwarding Information Base
IEEE 802.1x
PAgP / LACP
…and more
Active Supervisor
Standby Hot Supervisor
SSO-Aware Applications
SSO-Compliant Applications
Routing Protocols
NetFlow
Cisco Discovery Protocol
…and more
Checkpointing
Facility
Redundancy
Facility
Forwarding Information Base
IEEE 802.1x
PAgP / LACP
…and more
Cisco IOS
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
48
SSO Compliant Redundancy Clients
IOS Partial List Example
Router# show redundancy clients
clientID = 0
clientSeq = 0
clientID = 1319
clientSeq = 1
clientID = 5030
clientSeq = 2
RF_INTERNAL_MSG
Cat6k Platform Swove
Redundancy Mode RF
Management & Services

EEM Server RF CLIENT

SNMP HA RF Client

Switch SPAN client

MQC QoS

Call-Home RF

Port Security Client

IKE RF Client

IPSEC RF Client

CRYPTO RSA

LAN-Switch PAgP/LACP

LAN-Switch Private V

VLAN Mapping

CTS HA
Platform Specific
L3 Services
L2 Services
•
Cat6k Inline Power
•
Car6k OIR
•
Cat6k QoS Manager
Network RF Client
•
CWAN VLAN RF Client

HSRP
•
Cat6k Feature Manager

GLBP
•
Cat6k SPA TSM
Cat6k PAgP/LACP

BFD RF Client
Spanning-Tree
Protocol

DHCP Snooping
•
Cat6k Online Diag HA

Cat6k MLS Multicast
•
Cat6k Platform

SLB RF Client
•
Config Sync RF client
•
Cat6k Startup Config


Frame Relay
HDLC


IPROUTING NSF RF
ARP


LLDP
PPP RF

L3 Mobility Manager

IP multicast RF Client

MPLS VPN HA
Client



LDP HA
AToM manager


TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
49
SSO by itself Does Not
Provide Redundancy for
the Routing Protocols
Graceful Restart, Non-Stop Forwarding and
Non-Stop Routing
•
Non-Stop Forwarding was developed by Cisco to maintain traffic forwarding by a router
experiencing a control plane switchover event. The router will essentially synchronize its
Forwarding Information Base between an Active and Standby Route Processor as well as signal
to its routing neighbors to continue forwarding traffic while routing topology information is
exchanged
•
The IETF developed standards based implementations similar to Cisco NSF
•
The IETF implementations use different terminology including the terms “Graceful Restart” to
describe the signaling used between the routers
•
Graceful Restart(GR) and Non-Stop Forwarding (NSF) are terms often used interchangeably
•
Graceful Restart/Non-Stop Forwarding as well as Non-Stop Routing (NSR) all allow for the
forwarding of data packets to continue along known routes while the routing protocol information
is being restored (in the case of Graceful Restart) or refreshed (in the case of Non Stop Routing)
following a processor switchover.
•
Each routing protocol has its own unique implementation and signaling mechanisms
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
51
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 1
EIGRP RIB
OSPF RIB
Standby Supervisor Engine Slot 2
ARP Table
EIGRP RIB
Prefix
Next Hop
Prefix
Next Hop
IP
MAC
10.0.0.0
10.1.1.1
192.168.0
192.168.0.1
10.1.1.1
aabbcc:ddee32
10.1.0.0
10.1.1.1
192.168.55.0
192.168.55.1
10.1.1.2
adbb32:d34e43
-
10.20.0.0
10.1.1.1
192.168.32.0
192.168.32.1
10.20.1.1
aa25cc:ddeee8
-
FIB Table
Prefix
Next HOP
10.1.1.1
10.1.1.2
192.168.0.0
aa25cc:ddeee8
Prefix
-
OSPF RIB
Next Hop
Prefix
Next Hop
IP
MAC
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
SSO
Redundancy Facility
FIB Table
Prefix
Next HOP
aabbcc:ddee32
10.1.1.1
aabbcc:ddee32
adbb32:d34e43
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
Checkpoint Facility
TECCRS-2001
ARP Table
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
53
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB
Prefix
-
OSPF RIB
ARP Table
Next Hop
Prefix
Next Hop
IP
MAC
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
FIB Table
TECCRS-2001
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB
Prefix
-
OSPF RIB
ARP Table
Next Hop
Prefix
Next Hop
IP
MAC
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
FIB Table
TECCRS-2001
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB
Prefix
-
OSPF RIB
ARP Table
Next Hop
Prefix
Next Hop
IP
MAC
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
FIB Table
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
GR/NSF Signaling per protocol
Synchronization per protocol
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
EIGRP RIB
OSPF RIB
ARP Table
Prefix
Next Hop
Prefix
Next Hop
IP
MAC
10.0.0.0
10.1.1.1
-
-
-
-
10.1.0.0
10.1.1.1
-
-
-
-
10.20.0.0
10.1.1.1
-
-
-
-
FIB Table
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
GR/NSF Signaling per protocol
Synchronization per protocol
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
OSPF RIB
EIGRP RIB
ARP Table
Next Hop
Prefix
Next Hop
IP
MAC
10.0.0.0
10.1.1.1
192.168.0
192.168.0.1
-
-
10.1.0.0
10.1.1.1
192.168.55.0
192.168.55.1
-
-
10.20.0.0
10.1.1.1
192.168.32.0
192.168.32.1
-
-
Prefix
FIB Table
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
GR/NSF Signaling per protocol
Synchronization per protocol
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
54
Routing Protocol Redundancy With NSF
Active Supervisor Engine Slot 2
OSPF RIB
EIGRP RIB
ARP Table
Next Hop
Prefix
Next Hop
IP
MAC
10.0.0.0
10.1.1.1
192.168.0
192.168.0.1
10.1.1.1
aabbcc:ddee32
10.1.0.0
10.1.1.1
192.168.55.0
192.168.55.1
10.1.1.2
adbb32:d34e43
10.20.0.0
10.1.1.1
192.168.32.0
192.168.32.1
10.20.1.1
aa25cc:ddeee8
Prefix
FIB Table
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
GR/NSF Signaling per protocol
Synchronization per protocol
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
54
Non Stop Forwarding Router Roles
• Non-Stop Forwarding, NSF, allows a
router to continue forwarding data along
routes that are already known, while the
routing protocol information is being
restored
NSF Aware
• NSF Aware router or
NSF Helper router*
• A router running NSF-compatible
software, capable of assisting a
neighbor router perform an NSF restart
• NSF Capable router
• A router configured to perform
an NSF restart, therefore able to rebuild
routing information from neighbor
NSF-aware or NSF capable router
NSF Aware
NSF Capable
Device with
Redundant
Supervisors
* NSF Helper - This term is used in IETF terminology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
60
NSF/SSO Switchover Operation – IOS
1
Active Supervisor
Newly Active Supervisor
Control Plane
Active Supervisor Fails
RP
RP CPU
5
CPU
OSPF
EIGRP
Control
Path
IS-IS
9
Routing Information Base
10
2
BGP
ARP Table
6
4
Cisco IOS CEF Tables
Global Epoch = 1
FIB Table
Prefix
10.2
Adjacency Table
Next Hop Interface Epoch
Next Hop
MAC
10.1.1.1
01
10.1.1.1
AA-BB-.. 01
0
192.168.1.1 EE-DD.. 10
Vlan 10
NSF Aware Router
3
Data Plane
192.1 192.168.1.1 Vlan 192
Epoch
11
12
Hardware
3
FIB
Table
Adjacency
Table
Forwarding Path
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
61
Non-Stop Forwarding
OSPF Implementation Example
NSF Capable
NSF Aware
NSF Capable
IETF NSF
(GR)
Cisco NSF
Restart Event
LSA
Requests/
Update
Hello
225.0.0.5
Database
Description
LSA
Request
s/Update
LS ACK
(Grace LSA)
225.0.0.5
Hello
Database
Description
LSA
Requests
/Update
Hello
Hello
(RS Bit Clear)
Database Exchange
Hello
(RS Bit Clear)
Database
Description
Out-of-Band Sync
LSA
Requests/
Update
Fast Hello
(2 Sec Interval
RS Bit Clear)
LS Update
(Grace LSA)
OSPF
Discovery
Database
Description
Fast Hello
(2 Sec Interval
RS Bit Clear)
Announce
GracefulRestart
Fast Hello
(2 Sec Interval
RS Bit Set)
Restart Event
Fast Hello
Fast Hello
(2 Sec Interval
RS Bit Set)
NSF Aware
Hello
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
63
NSF Configuration - IOS
Capable Vs Helper Configuration
•
•
Configuration is required to enable “NSF Capable”
Configuration is NOT required to enable “NSF Helper” with default settings
• Helper supports both types on the device
router eigrp 1
nsf
!
router ospf 1
nsf ietf
!
router isis 1
nsf cisco
core1# show ip ospf nsf
Routing Process "ospf 1"
IETF Non-Stop Forwarding enabled
restart-interval limit: 120 sec
IETF NSF helper support enabled
IETF NSF helper strict-lsa-checking enabled
Cisco NSF helper support enabled
OSPF restart state is NO_RESTART
Handle 2162698, Router ID 1.1.1.1, checkpoint Router ID 1.1.1.1
Config wait timer interval 10, timer not running
Dbase wait timer interval 120, timer not running
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
64
NSF Interoperability
Interoperability between different Cisco devices
•
The Graceful Restart extensions used in NX-OS are based on the IETF
RFCs except for EIGRP, which is Cisco proprietary and can interoperate
with Cisco NSF.
•
This implies that for OSPFv2, OSPFv3, and BGP the GR extension are
compatible with versions of IOS that use the RFC based extensions
router ospf 1
graceful-restart
router ospf 1
graceful-restart
✔
router ospf 1
nsf ietf
Si
Si
TECCRS-2001
router ospf 1
nsf cisco
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
66
Non-Stop Routing (NSR)
•
Cisco IOS Non-Stop Routing preserves the state information
(prefixes and related data) in the Routing Information Base across
Supervisor Engine (Route Processor) switchover events.
Helpful in environments where peer routers are not managed by the
same entity or are not capable of supporting NSF awareness
• Consider that Non-Stop Routing does consume more control plane
resources, such as memory and CPU compute cycles, compared to NSF
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
67
Routing Protocol Redundancy With NSR
Active Supervisor Engine Slot 1
EIGRP RIB
OSPF RIB
Standby Supervisor Engine Slot 2
ARP Table
EIGRP RIB
OSPF RIB
ARP Table
Prefix
Next Hop
Prefix
Next Hop
IP
MAC
Prefix
Next Hop
Prefix
Next Hop
IP
MAC
10.0.0.0
10.1.1.1
192.168.0
192.168.0.1
10.1.1.1
aabbcc:ddee32
10.0.0.0
10.1.1.1
192.168.0
192.168.0.1
10.1.1.1
aabbcc:ddee32
10.1.0.0
10.1.1.1
192.168.55.0
192.168.55.1
10.1.1.2
adbb32:d34e43
10.1.0.0
10.1.1.1
192.168.55.0
192.168.55.1
10.1.1.2
adbb32:d34e43
10.20.0.0
10.1.1.1
192.168.32.0
192.168.32.1
10.20.1.1
aa25cc:ddeee8
10.20.0.0
10.1.1.1
192.168.32.0
192.168.32.1
10.20.1.1
aa25cc:ddeee8
FIB Table
Prefix
Next HOP
10.1.1.1
10.1.1.2
192.168.0.0
aa25cc:ddeee8
SSO
Redundancy Facility
FIB Table
Prefix
Next HOP
aabbcc:ddee32
10.1.1.1
aabbcc:ddee32
adbb32:d34e43
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
Checkpoint Facility
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
68
Routing Protocol Redundancy With NSR
Active Supervisor Engine Slot 2
EIGRP RIB
OSPF RIB
ARP Table
Prefix
Next Hop
Prefix
Next Hop
IP
MAC
10.0.0.0
10.1.1.1
192.168.0
192.168.0.1
10.1.1.1
aabbcc:ddee32
10.1.0.0
10.1.1.1
192.168.55.0
192.168.55.1
10.1.1.2
adbb32:d34e43
10.20.0.0
10.1.1.1
192.168.32.0
192.168.32.1
10.20.1.1
aa25cc:ddeee8
FIB Table
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
No additional signaling required to maintain topology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
69
NSR Deployment Scenario
Case Study: MPLS VPN Provider Edge
• Provider PE device can use NSR for
peering with the CE devices
CE
MPLS VPN
CE
• use NSF for peering with the internal
P devices or Route Reflectors
CE
• NSF Aware peers are not needed
for the CE device
• Control plane resources can be
optimized by using NSF and NSR
together
P
PE
CE
P
CE
CE
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
70
NSR Configuration - IOS
•
Configuration is required to enable NSR
router eigrp 1
nsr
!
router ospf 1
nsr
!
router isis 1
nsr
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
71
Comparing NSF and NSR
Metric
Non-Stop Forwarding
Non
-Stop Routing
Configuration required
Yes, per protocol instance on NSF capable
device, No configuration required for NSF –
aware devices for Interior Gateway Protocols.
BGP requires GR configuration on both NSFCapable and NSF Helper device
Yes per protocol instance.
BGP also requires per peer configurations
Which routing protocols are supported
EIGRP, OSPFv2, OSPFv3, ISIS, BGP, LDP,
etc.
ISIS, BGP, OSPFv2, etc.
Synchronizes routing protocol state and RIB
information across redundant control planes
No
Yes
Consumes additional CPU and memory
resources
Negligible
Yes, applicable with the number of routes per
protocol
Requires specific feature support on peer
routers
Yes
No
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
72
High Availability At
Different layers
Standalone Chassis Redundant Core
Redundant Supervisors Yes or No ? Catalyst 6500
•
Redundant topologies with equal cost multipaths (ECMP) provide sub-second
convergence
NSF/SSO provides superior availability in
environments with non-redundant paths
Seconds of Lost Voice
•
RP Convergence
Is Dependent
on IGP and Tuning
?
Si
Si
Si
Si
Si
Link
Failure
Node
Failure
NSF/SSO
OSPF
Convergence
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
74
Redundant Supervisors Yes or No? Catalyst 6500
• HSRP doesn’t flap on Supervisor SSO
switchover
• Reduces the need for sub-second HSRP timers
Si
• SSO Aware HSRP
Si
• 6500-E - 12.2(33)SXH
• 4500 - 12.2(31)SG
Seconds of Lost Voice
?
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
75
Design Considerations for NSF/SSO
Where Does It Make Sense?
Access switch is the single point of failure
in best practices HA design
•
Supervisor failure is most common
cause of access switch service outages
•
Recommended design with NSF/SSO provides for
sub 600 msec recovery of voice and data traffic
Seconds of Lost Voice
•
Si
Si
Si
Si
?
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
76
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Availability Modeling
•
Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
•
Stackwise480 and Stackwise
•
In Service Software Upgrades
•
Foundations of the Structured Network Design
•
High Availability Architectures
•
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
77
Catalyst 9300 Series
Cisco Stackwise-480
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
78
Stacking Cable – Close-up
Stacking
Cable
Cable Lengths
• 0.5m
• 1m
• 3m
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
79
Understanding the Stack Ring
Stack Interface
ASIC
• 6 rings in total
• 3 rings go East
• 3 rings go West
Is math really an
opinion?
• Each ring is 40G
Assuming
4 x 24-port
Cat9K Switches
• Total Stack BW = 240G
• With Spatial Reuse = 480G
Stack Interface
Packets are segmented/reassembled in HW (256 byte
segments)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
80
Understanding Spatial Reuse
Doubling the capacity of my stack
4
3
1
2
Assuming
4 x 24-port
9300 Switches
Destination
Stripping
Packet travels
½ the rings.
Taken out of
stack by
destination
3
1
2
4
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Stack Ring Healing
Example
shows:
4 x 24-port
Cat9K
Switches
Detection is by hardware
Software is notified
immediately
Ring Wrap initiated
immediately (1-2ms)
X
For Recovery –
Hardware detects other side
Software validates the link
and so it brings up the
connection gracefully
Unwrap is slower than Wrap
• All rings wrap
•240Gbps when wrapped
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
82
IOS XE Software Internals Overview
Infra Domain
LC Domain
Service
Location
RP Domain
Wireless Controller
Consolidated
Logging
Forwarding &
Feature Mgr (FFM)
Stack Manager (3K)
Features PD
Platform
Drivers
HA
UADP ASIC
Drivers
External
Transports
(TCP/SCTP/UDP)
Internal IPC
Licensing
Services
Libraries/
Utilities
Services
Comet
Services
Low Level APIs
Forwarding Engine Driver
Packet Delivery Service
Platform
Manager
System
Manager
Availability Framework
IOSd RP
Interface
Manager
Kernel
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
UADP
UADP
provides an
unparalleled degree
of Flexibility
in an Access Switch
Designed for Flexibility
Excellent for
encapsulations, which
often need recirculation
Parse depth
of 256 Bytes
15 programmable stages
Up to 250 frames across
stages at one time…
Ability to handle current and
future protocols – extremely
flexible and capable
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
84
VXLAN as a protocol had not even been invented
when the UADP ASIC was designed …
Yet UADP forwards VXLAN
in hardware, at high performance
in IOS-XE 16.3+ …
thanks to the FlexParser
Next-Hop MAC Address
Underlay
Src VTEP MAC Address
Outer MAC Header
Outer IP Header
Dest. MAC
48
Source MAC
48
VLAN Type
0x8100
16
VLAN ID
16
Ether Type
0x0800
14 Bytes
IP Header
Misc. Data
72
Protocol 0x11 (UDP)
8
Header
Checksum
16
16
UDP Header
VXLAN Header
Parse depth
of 256 Bytes
Inner (Original) MAC Header
Overlay
Inner (Original) IP Header
VXLAN is a complex
Original Payload
protocol …
in
(4 Bytes Optional)
20 Bytes
15 programmable stages
Source IP
32
Dest. IP
32
Src RLOC IP Address
Source Port
16
VXLAN Port
16
UDP Length
16
Checksum 0x0000
16
8 Bytes
Dst RLOC IP Address
Up to 250 frames across
stages at one time…
Hash of inner L2/L3/L4 headers of original frame.
Enables entropy for ECMP load balancing.
UDP 4789
VXLAN Flags RRRRIRRR
8
Segment ID
16
VN ID
24
Reserved
8
Allows 64K
possible SGTs
8 Bytes
Allows 16M
possible VRFs
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Stack Discovery
•
Switches boot.
•
Stack Interfaces brought online
•
Infra and LC Domains boot in parallel
Infra
LC
Infra
LC
•
•
Stack Discovery Protocol discovers
 Stack topology – broadcast,
followed by neighbor-cast
Active Election begins after
Discovery exits
TECCRS-2001
LC
Infra
LC
Infra
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
86
Stack Active Election
Rules of Election
A
•The stack (or switch) whose member
has the higher user configurable
priority 1–15
•The switch or stack whose member
has the
lowest MAC address
%IOSXE-1-PLATFORM: process stack-mgr: %STACKMGR-1-ACTIVE_ELECTED: Switch 1 has been elected ACTIVE.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
87
Define Stack Roles
minimal Downtime
•
•
Power up the first Switch that you want
to make it as Active
Catalyst9300#switch 1 priority 15
Configure Priority of the switch (1-15)
Catalyst9300#switch 2 priority 14
•
•
A
S
1 by default – the higher the better
Power up the second member that you
want to make as Standby
Catalyst9300#switch 3 priority 13
Catalyst9300#switch 4 priority 12
•
Configure Priority less than the Active
•
Power up the rest of the members
*Priority command is a global command
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
88
Catalyst 9K Stack similarity to Catalyst 6500
• Active and Standby units
• Active and Standby Supervisors
• Run IOS on Supervisors
• Synchronize information
• Active programs all DFCs
• DFCs run a subset of IOS for LCs
• Run IOSd, WCM, etc.. on Active/Standby
• Synchronize information
• Active programs Data plane for members
• Member switches act as Line cards–
connected via the Stack Cable
A
A
S
TECCRS-2001
S
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
89
Show switch with SSO
Stack Mac follows
Active initially
Switch# show switch
Switch/Stack Mac Address : 2037.06cf.0e80
H/W
Current
Switch#
Role
Mac Address
Priority Version State
------------------------------------------------------------ Active
*1
Active
2037.06cf.0e80
10
V01
Ready
2
Standby 2037.06cf.3380
8
V00
Ready
Standby
3
Member
2037.06cf.1400
6
V00
Ready
4
Member
2037.06cf.3000
4
V00
Ready
Member
* Indicates which member is providing the “stack Identity” (aka “stack MAC”)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
90
Show switch detail output
Switch# show switch detail
Switch/Stack Mac Address : 2037.06cf.0e80
H/W
Current
Switch#
Role
Mac Address
Priority Version State
-----------------------------------------------------------*1
Active
2037.06cf.0e80
10
V01
Ready
2
Standby 2037.06cf.3380
8
V00
Ready
3
Member
2037.06cf.1400
6
V00
Ready
4
Member
2037.06cf.3000
4
V00
Ready
Stack Port
Stack Port Status
Neighbors
Information
Switch# Port 1
Port 2
Port 1
Port 2
-------------------------------------------------------1
OK
OK
2
4
2
OK
OK
3
1
3
OK
OK
4
2
4
OK
OK
1
3
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
91
Catalyst 9000 – HA State Machine
2min timer
•
Active starts RP Domain locally
•
Programs hardware on all LC Domains
•
Traffic starts once hardware is programmed
•
Starts 2min Timer to elect Standby in parallel
•
Active elects Standby
•
Standby starts RP Domain locally
•
Starts Bulk Sync with Active RP
LC
RP
RP
LC
LC
LC
•
Infra
Infra
A
S
Infra
Infra
Standby reaches “Standby Hot”
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
92
Show redundancy states
Switch# show redundancy states
my state = 13 –ACTIVE
peer state =
Terminal state for Active Unit.
8 -STANDBY HOT
Mode = Duplex
Unit ID = 1
Redundancy Mode (Operational) = SSO
Terminal state for Standby Unit
for SSO.
Redundancy Mode (Configured) = SSO
Redundancy State = SSO
Manual Swact = enabled
Slot Number of Active Unit
Communications = Up
client count = 76
client_notification_TMR = 360000 milliseconds
keep_alive TMR = 9000 milliseconds
keep_alive count = 0
Communication Channel
Status between the
Active/Standby RP units
keep_alive threshold = 9
RF debug mask = 0
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
93
Show Redundancy Command Output…
Switch#sh redundancy
Redundant System Information :
-----------------------------Available system uptime
Switchovers system experienced
Standby failures
Last switchover reason
=
=
=
=
29 weeks, 2 days, 11 hours, 47 minutes
2
0
user_forced
Hardware Mode
Configured Redundancy Mode
Operating Redundancy Mode
Maintenance Mode
Communications
=
=
=
=
=
Duplex
SSO
SSO
Disabled
Up
System uptime
Current Processor Information :
Image version
-----------------------------of current unit
Active Location = slot 1
Current Software state = ACTIVE
Uptime in current state = 1 week, 4 days, 22 hours, 38 minutes
Image Version = Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M),
Version 03.03.03E RELEASE SOFTWARE (fc1)
Peer Processor Information :
-----------------------------Standby Location = slot 2
Current Software state = STANDBY HOT
Uptime in current state = 1 week, 4 days, 22 hours, 34 minutes
Image Version = Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M),
Version 03.03.03E RELEASE SOFTWARE (fc1)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
94
StackWise Virtual Architecture
Extending StackWise Architecture
Dist-1
Does it look familiar?
SW-1
SW-2
VSS
• Cisco StackWise Virtual extends proven back-panel technology
Cat 9k
40G/10G
Cat 9k
over front-panel network ports
•
Cisco StackWise Virtual simplifies the Distribution-Layer with two
common Cat 9K series chassis into single logical entity
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
96
StackWise Virtual Architecture
Resilient Software Design
Dist-1
SW-1
Cat 9K
SW-2
40G/10G
Cat 9k
•
Cisco StackWise Virtual supports 1+1 Inter-Chassis SSO
redundancy providing non-stop communication
•
Consistent SSO and NSF capable protocols and features on both
deployment models
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
97
StackWise Virtual Architecture
Simplified. Scalable.
Core
Core
Dist-1
SW-1
SW-2
Distribution
Cat 9k
Cat 9k
40G/10G
Cat (k
Access
•
Cisco StackWise Virtual supports Unified control and management plane architecture
•
Complex network designs gets simplified with Multi-Chassis EtherChannels (MEC)
•
Improved application performance with deterministic network resiliency during
various planned or unplanned failures.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
98
SW Patching in IOS-XE
Adding a SMU file
Activating SMU
9300#install add file flash:cat9k-universalk9.2017-0317_21.53_zhangyu.301.CSCuo76464.SSA.smu.bin
install_add: START Sun Mar 26 01:13:29 UTC 2017
SUCCESS: Finished copying package(s) to the selected switch(es)
SUCCESS: install_add /flash/cat9k-universalk9.2017-0317_21.53_zhangyu.301.CSCuo76464.SSA.smu.bin Sun Mar 26 01:13:31 UTC 2017
Patching Support is only
for the Cat9K product
Family
requires a reload of the system. Do you want
9300#install activate file flash:cat9k-universalk9.2017-0317_21.53_zhangyu.301.CSCuo76464.SSA.smu.bin
install_activate: START Sun Mar 26 01:14:12 UTC 2017
2 install_activate: Activating SMU...
This operation
to proceed? [y/n]y
2 install_activate: Reloading the box to complete activation of the SMU...
Committing it
9300#install commit
install_commit: START Sun Mar 26 01:24:41 UTC 2017
SUCCESS: install_commit Sun Mar 26 01:24:43 UTC 2017
Any failures/reloads between activate and commit result in a rollback
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
SMU Deployment Experience with Cisco DNA Center
•
•
•
•
Download SMU to APICEM file server
Analyze SMU impact
Test SMU on Pilot setup
Schedule SMU
deployment
Cisco DNA Center App
Network
Admin
ReadMe
SMU
SMU
File Server
APIC EM
Server
SMU
Cisco.com
Pilot Site
Production Site
TECCRS-2001
Production Site
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
101
Stackable Best
Practices
Stacking Convergence
Not a recommended design
Multi-Layer Access
vIP: 10.0.0.10
vMAC: 0000.0c07.ac00
Summary
Subnets
•
•
Active unit with uplink failure
introduces two failures
•
Active control plane
•
Uplink interface
Distribution
D1
HSRP
ACTIVE
Upstream, HSRP / GLBP
will detect link down, and
D2 will start answering to the
virtual MAC 0000.0c07.ac00
•
Downstream traffic is
re-routed to D2 via L3 link
Si
L2
When the Active fails,
the Standby will take over.
•
Si
D2
HSRP
STANDBY
Active
S1
Access
Standby
S3
S2
Single Logical Switch
IP:
MAC:
GW:
ARP:
10.0.0.1
aaaa.aaaa.aa01
10.0.0.10
0000.0c07.ac00
TECCRS-2001
IP:
MAC:
GW:
ARP:
10.0.0.3
aaaa.aaaa.aa03
10.0.0.10
0000.0c07.ac00
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
103
Stacking Convergence
vIP: 10.0.0.10
vMAC: 0000.0c07.ac00
Multi-Layer Access
•
•
•
•
Active unit Failure
(without uplink)
Summary
Subnets
Distribution
D1
HSRP
ACTIVE
Si
D2
HSRP
STANDBY
Si
When the Active fails,
the Standby will take over
L2
Access
No HSRP/GLBP failover,
while the new Active being elected,
MAC address of HSRP/GLPB still used
by the rest of the stack for data
forwarding
Standby
S1
Active
S2
S3
Single Logical Switch
No downstream
re-route convergence
IP:
MAC:
GW:
ARP:
TECCRS-2001
10.0.0.1
aaaa.aaaa.aa01
10.0.0.10
0000.0c07.ac00
IP:
MAC:
GW:
ARP:
10.0.0.3
aaaa.aaaa.aa03
10.0.0.10
0000.0c07.ac00
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
104
Catalyst 9300 Stack Wise
Routed Access
•
CLI “stack-mac persistent timer 0”
enables MAC consistency –
•
This is the default value for 3850/9300
•
This is a change from the existing
stacking model
•
New Active inherits the MAC address
of the previous Active
•
•
Summary
Subnets
Distribution
Si
Si
L3
Access
Standby
S1
No MAC changes for end hosts
and adjacent routers, significantly
improves upstream recovery
Active
S2
S3
Single Logical Switch
NO MAC
Changes
Caution –
•
Do not re-introduce the 3x50/9300
elsewhere in order to avoid
duplicate MAC in your network
IP:
MAC:
GW:
ARP:
TECCRS-2001
10.0.0.1
aaaa.aaaa.aa01
10.0.0.10
000c.cece.7c80
IP:
MAC:
GW:
ARP:
10.0.0.3
aaaa.aaaa.aa03
10.0.0.10
000c.cece.7c80
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
105
Changing Stack Mac on Cat9K Switches
•
•
By default the timer value is set to indefinite (0)
•
System continues to keep
selected stack mac after
switchover
•
Avoids Protocol flapping
How to change it
•
A new command introduced
switch#stack-mac update force
Catalyst9k#show switch
Switch/Stack Mac Address : 2037.06cf.0e80
Catalyst9k#show
switch
Mac persistency
wait time: Indefinite
Switch/Stack Mac Address : 2037.06cf.0e80
2037.06cf.3380
Mac persistency wait time: Indefinite
H/W
Current
Switch#
Role
Mac Address
Priority Version State
H/W
Current
-----------------------------------------------------------Switch#
Role
Mac
Address
Priority
Version
*1
Active
2037.06cf.0e80
10
V01 State
Ready
-----------------------------------------------------------2
Standby 2037.06cf.3380
8
V00
Ready
*1 3
Member
V01V00 Removed
Member 0000.0000.0000
2037.06cf.1400 10 6
Ready
2 4
Active
2037.06cf.3380
8
V00
Ready
Member
2037.06cf.3000
4
V00
Ready
3
Member
2037.06cf.1400
6
V00
Ready
4
Member
2037.06cf.3000
4
V00
Ready
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
106
Key Recommendations for Stacking
•
Run the stack in full ring mode to get full bandwidth
•
Configure the Active switch priority and Standby switch priority
•
•
Predetermine which switch is the Active and Standby which will become the Active
should the Active fail
•
Simplifies operations
Configure Active and Standby unit without uplinks if possible
•
•
If deploying a stack of 4 or more switches keep the Active and Standby switches
without uplinks, this will simplify the convergence and reduce the outage time
Do Not change the stack-mac timer value
•
By default the value is 0 (indefinite)
•
Avoids protocol flapping
•
There is a command to change the stack-mac when needed
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
107
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Availability Modeling
•
Stateful Switchover, Non-Stop Forwarding, and Non-Stop Routing
•
Stackwise480 and Stackwise
•
In Service Software Upgrades
•
Foundations of the Structured Network Design
•
High Availability Architectures
•
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
108
ISSU Overview
•
ISSU provides a mechanism
to perform software upgrades
and downgrades without taking
the switch out of service
•
Leverages the capabilities of NSF
and SSO to allow the switch to
forward traffic during Supervisor
IOS upgrade (or downgrade)
•
SSO
Standby Sup
Line Card
Line Card
Key technology is the
ISSU Infrastructure
•
Active Sup
Allows SSO between different
versions
Catalyst 9400
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
109
In Service Software Upgrades
Streamlined Process for Software Upgrades/Downgrades
ISSU
Loadversion
1
ISSU
Acceptversion
(Optional)
2
3
ISSU
Runversion
4
ISSU
Commitversion
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
111
Stateful Switchover Mode – IOS
ISSU Client and Versioning Infrastructure
ISSU Versioning
Cisco IOS Version 1
HA-Compliant Applications
Routing Protocols
NetFlow
Cisco Discovery Protocol
…and more
Redundancy
Facility
ISSU Clients
HA-Aware Applications
Forwarding Information Base
Port Manager
PAgP / LACP
…and more
Checkpointing
Facility
Active Supervisor
Standby Hot Supervisor
ISSU Versioning
HA-Compliant Applications
Routing Protocols
NetFlow
Cisco Discovery Protocol
…and more
Checkpointing
Facility
ISSU Clients
HA-Aware Applications
Forwarding Information Base
Port Manager
PAgP / LACP
…and more
Redundancy
Facility
Cisco IOS
Version 2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
ISSU Client and Infrastructure Interactions – IOS
Active Supervisor
ISSU Endpoint V1
Application XYZ
ISSU Client V1
Versioning
Infrastructure
Store
Client
Info
Register Client
Info
Propose
Capabilities
Capabilities
Negotiation
Propose
Message Version
Message Version
Negotiation
Register ClientID,,
Msg Capabilities,
MSG Versions, Card
Type…
Endpoints Agree on
a Common Set of
Capabilities
Hot Standby Supervisor
ISSU Endpoint V3
Versioning
Infrastructure
Endpoints Agree on
a Common Message
Version
Store
Client
Info
Application XYZ
ISSU Client V3
Register Client
Info
Capabilities
Negotiation
Propose
Capabilities
Message Version
Negotiation
Propose
Message Version
Agree V1
Compatible
N
V1
Y
Message
Exchange
If Compatible, then
Message Exchange
Can Proceed
Message
Transformation
Compatible
V1, V2,V3
Message
Transformation
MSG V1
N
Y
Message
Exchange
MSG V3
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
113
ISSU Dual Supervisor –
Catalyst 9400
ISSU Process
Dual Supervisors
Start ISSU
Uplinks
• ISSU Process leverages SSO/NSF
Architecture
• Uplinks on both active and standby SUP
are forwarding traffic
Active Supervisor
SSO
Standby Supervisor
Line Card
Catalyst 9400
• Convergence is less than 200 msec
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
115
C9K ISSU
Dual Supervisor ISSU
3 Step Process
•
Install add file <tftp/ftp/flash/disk:*.bin>
•
•
Install activate ISSU
Install commit
Granular Control on
the upgrade process
with ability to rollback
1 Step Process
•
Install add file <tftp/ftp/flash/disk:*.bin> activate ISSU commit
TECCRS-2001
Single Command
to perform
complete ISSU
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
116
C9K ISSU Workflow
Dual Supervisor ISSU
1. ISSU Started, Image is
expanded on Active and Standby
V1
S1
Active
V1
S2
Standby
If S2 fails to become standby it
will revert back to step 1
Abort Timer
Starts
2. Standby Reloads
with the new V2 Image
5. ISSU
Complete
V2
S1
V2
S2
V1
Standby
Expired Abort timer will revert
to Step 2 and then Step 1
Active
V1
V2
S1
Active
S2
Standby
Abort Timer
Expired
Abort Timer
Stopped
4. ‘Commit’ Keyword
stops the abort timer
V1
V2
V2
S1
S2
Standby
Active
3. Auto-Switchover causes S2 to
become new active and S1 reloads
with the new V2 image
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Stackwise Virtual - ISSU
C9K ISSU
Stackwise Virtual ISSU and Dual Supervisor ISSU
3 Step Process
•
Install add file <tftp/ftp/flash/disk:*.bin>
•
•
Install activate ISSU
Install commit
Granular Control on
the upgrade process
with ability to rollback
1 Step Process
•
Install add file <tftp/ftp/flash/disk:*.bin> activate ISSU commit
TECCRS-2001
Single Command
to perform
complete ISSU
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
119
Stackwise Virtual ISSU
ISSU Process
Install ISSU
Dual-Active Detection Link
Catalyst 9500-24Q
2nd Sub-second
traffic convergence
Auto-Switchover
16.8.1
16.8.2
Catalyst 9500-24Q
16.8.1
16.8.2
1st Sub-second
traffic convergence
Stackwise-Virtual Link
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
120
Enhanced Fast Software
Upgrade – Catalyst
9000
Achieving High Availability on Catalyst 9300
Enhanced Fast Software Upgrade
•
•
eFSU provides a mechanism to upgrade
and downgrade the software image by
segregating the Control plane and Data
Plane update
It updates the control plane by leveraging
the NSF/GR Architecture with Flush and
Re-Learn mechanism to reduce the
impact on the data plane
TECCRS-2001
Control-Plane
RIB
Prefix
Next Hop
10.0.0.0
10.1.1.1
10.1.0.0
10.1.1.1
10.20.0.0
10.1.1.1
Data Plane
FIB Table
Prefix
Next HOP
10.1.1.1
aabbcc:ddee32
10.1.1.2
adbb32:d34e43
192.168.0.0
aa25cc:ddeee8
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
122
Enhanced Fast Software Upgrade
Regular Upgrade Vs Enhanced Fast Software Upgrade Process
16.10.1*
#Install add file image activate commit
Enhanced Fast Software Upgrade
#Install add file image activate reloadfast
enhanced commit
< 30 seconds of
traffic impact
Traffic is impacted throughout the upgrade cycle
* Limited Controlled Availability in 16.10.1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
123
Enhanced Fast Software Upgrade
CLI commands
•
•
•
FSU is supported only in install mode
One step command which activates the fast software upgrade and
commits it
9300# install add file flash:cat9k_iosxe.BLD_V1610 activate
reloadfast enhanced commit
Fast Reload without Software upgrade
9300# Reload Fast Enhanced
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
124
Enhanced Fast Software
Upgrade – VSS system
VSS Software Upgrade on Catalyst 6500
Preparation Steps
Enhanced Fast Software Upgrade (EFSU)
1. Before ISSU software upgrade, VSS Switch-1 and
Switch-2 will be running the old software image.
2. Install the new image to the same location on the file
systems of both Supervisors
3. Make sure the boot register is configured for auto boot
0x2102
= Old Version
Switch-2
Switch-1
VSS Standby Hot
WS-X6708-10G
Si
= New Version
Execute Upgrade
Si
VSL
1. ISSU Loadversion
R = Reload
R
STANDBY COLD
VSS Active
WS-X6708-10G
VSS Standby HOT
100%
50%
SW2
SO = Switchover
1
TECCRS-2001
2
3
4
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
126
VSS Software Upgrade on Catalyst 6500
Preparation Steps
Enhanced Fast Software Upgrade (EFSU)
1. Before ISSU software upgrade, VSS Switch-1 and
Switch-2 will be running the old software image.
2. Install the new image to the same location on the file
systems of both Supervisors
3. Make sure the boot register is configured for auto boot
0x2102
= Old Version
Switch-2
Switch-1
SO
R
STANDBY COLD
VSSStandby
Active Hot
VSS
WS-X6708-10G
VSS Standby
Hot
VSS Active
WS-X6708-10G
Si
= New Version
1. ISSU Loadversion
Execute Upgrade
Si
VSL
2. ISSU Runversion
R
3. ISSU Acceptversion
(Optional)
= Reload
VSS Standby HOT
100%
50%
SW2
SO
SW1
= Switchover
1
TECCRS-2001
2
3
4
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
127
VSS Software Upgrade on Catalyst 6500
Preparation Steps
Enhanced Fast Software Upgrade (EFSU)
1. Before ISSU software upgrade, VSS Switch-1 and
Switch-2 will be running the old software image.
2. Install the new image to the same location on the file
systems of both Supervisors
3. Make sure the boot register is configured for auto boot
0x2102
Switch-2
Switch-1
R
= Old Version
STANDBY COLD
VSS Active
WS-X6708-10G
VSS Standby Hot
WS-X6708-10G
Si
= New Version
1. ISSU Loadversion
Execute Upgrade
Si
VSL
2. ISSU Runversion
R
3. ISSU Acceptversion
(Optional)
= Reload
VSS Standby HOT
100%
50%
4. ISSU Commitversion
SW2
SO
SW1
SW1
= Switchover
1
TECCRS-2001
2
3
4
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
128
VSS Quad SUP SSO - Catalyst 6500
• In Chassis Standby SUP in each
Switch
• This will keep the unit up and
running when the other chassis is
reloaded
ICA
SSO Act
ICA
SSO Stby
ICS
ICS
• We take advantage of this for EFSU
• There are 2 Upgrade Modes
• Standard EFSU
Switch ID 1
Switch ID 2
• Staggered EFSU
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
129
EFSU Quad Sup
Normal Quad Sup Upgrade Vs Staggered Quad Sup Upgrade
100%
100%
50%
50%
SW 2 SW 1
1
2
SW 2
SW 1
3
4
1. ISSU Loadversion (Whole Standby Sw2 chassis reload)
SW 1
3
1
2
4
5
nd
1. ISSU Loadversion (2 Sup on Standby Chassis - ICS)
2. ISSU Runversion (whole active Sw1 chassis reload)
2. ISSU Loadversion – Step 2
(Switchover with the Standby Chassis, LCs reload)
3. ISSU Acceptversion(Optional)
3. ISSU Runversion (Chassis S/O)
4. ISSU Commitversion (whole Standby Sw1 chassis reload)
4. ISSU Commitversion (ICS on new Standby Chassis)
5. ISSU Commitversion – Step 2
(Reload on the new Standby Chassis LC)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
130
Cisco IOS ISSU Summary
•
ISSU is a software upgrade /downgrade procedure
•
Changes the risk assessment criteria
•
Minimizes the impact of upgrades/downgrades
•
Allows for a trial period with automated rollback
•
Less downtime
•
Both software versions must be ISSU compatible
in order to achieve and SSO–based upgrade
•
Software version compatibility includes
•
18 month rolling window between software releases of the same train
•
Same license level required between versions
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
136
Graceful Insertion and
Removal - GIR
Graceful Insertion and Removal for Catalyst 9000
Isolation of Switch from network
Change window begins.
Start Maintenance
One command!
Pre-change System Snapshot
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
138
Graceful Insertion and Removal for Catalyst 9000
Return Switch into network
Change window begins.
Stop Maintenance
One command!
Pre-change System Snapshot
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
139
Graceful Insertion and Removal
Isolation of Switch from network
• Isolate a switch from the network in
order to perform debugging or an
upgrade.
• Isolate: All protocols are gracefully
brought down but is not shutdown.
• Entering Maintenance Mode:
• EGP -> IGP in Parallel -> L2 (shutdown port)
• Existing Maintenance Mode:
• L2 -> IGP in Parallel -> EGP
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
140
Graceful Insertion and Removal
Default and Customizable Templates
•
Default Template
•
System Generated Profile based on
the switch configuration
9300L#show system mode maintenance template default
System Mode: Normal
default maintenance-template details:
router isis 1
shutdown l2
9300L#show system mode maintenance template test
•
Customized Template
System Mode: Normal
Maintenance Template test details:
•
User Configured Profile based on
specific configuration or use case
shutdown l2
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
141
Graceful Insertion and Removal
Snapshots
•
Automatic Snapshots
•
•
•
Snapshots are automatically
generated when entering and
exiting maintenance mode
Captures operational data
from the running system like
Vlan’s, Routes etc..
User Configured Snapshots
•
Snapshots can be collected
manually for comparing and
troubleshooting
Switch#show system snapshots compare before_maintenance
after_maintenance
================================================================================
Feature
Tag
.before_maintenance .after_maintenance
================================================================================
[interface]
-------------------------------------------------------------------------------[Name:Vlan1]
packetsinput
181587
**181589**
[Name:GigabitEthernet1/0/3]
packetsinput
101531
**101550**
broadcasts
80893
**80910**
packetsoutput
211568
**211594**
[Name:GigabitEthernet1/0/8]
output
00:00:00,
**00:00:04,**
packetsinput
6915
**6918**
packetsoutput
57677
**57706**
[Name:GigabitEthernet1/0/17]
packetsinput
101528
**101550**
broadcasts
80891
**80910**
packetsoutput
211570
**211600**
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
142
GIR Summary
•
•
GIR used to isolate a switch
•
Maintenance
•
HW upgrade
•
SW upgrade
Works well in an L3 end to end network
•
•
Order of Maintenance is
•
EGP -> IGP (in parallel) -> L2 shutdown
•
HSRP/VRRP can be leveraged without causing issue on switchover
If Stackwise Virtual is deployed, you don’t need to do GIR to upgrade those
switches
•
Leverage the ISSU Stack Virtual technology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
143
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
High Availability Architectures:
•
•
Enterprise Wired LAN
•
Enterprise Data Center
•
Enterprise Wireless LAN
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
144
Dana Daum
Maren Kostede
Communications Architect
Technical Solutions Architect
Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer
High Availability World Coverage
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
145
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
Modularity, Hierarchy, and Structure
•
Leveraging Hardware-Based Path Restoration
•
High Availability Architectures
•
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
146
Headquarters
WAAS
Access
Switches
UCS Rack-mount
Server
UCS Rack-mount
Servers
UCS Blade
Chassis
Storage
WAAS
Central Manager
Distribution
Switches
Nexus
WAN
Router
s
Access
Switches
Cisco ACE
Internet
Routers
Wireless
LAN
Controller
Regional Site
Data Center
Firewalls
Nexus
WAN
Route
r
Access
Switch
RA-VPN
Firewall
Guest Wireless
LAN Controller
DMZ
Switch
Remote Site
Web
Security
Appliance
Teleworker/
Mobile Worker
Access
Switch
Stack
Data
Center
Wireless LAN
Controllers
Internet
Access
Switch
Communications
Managers
Internet Edge
DMZ
Servers
Core
Switches
Email
Security
Appliance
Hardware and
Software VPN
WAN
Routers
MPLS
WANs
WAN
Router
s
Distribution
Switches
User
Access
Layers
WAAS
Remote
Site
WAAS
WAN
Aggregation
Remote Site
Wireless LAN
Controller
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Hierarchical network design
High availability using modularity, hierarchy, and structure
•
Each layer in hierarchy has a
specific role
•
Modular topology—building blocks
•
Modularity makes it easy to grow,
understand, and troubleshoot
•
Structure creates small fault
domains and predictable network
behavior—clear demarcations and
isolation
•
Promotes load balancing
and resilience
Access
Distribution
Core
Distribution
Access
Building Block
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
148
Hierarchical network design
•
Core
•
•
Connectivity, availability and scalability
Distribution
Aggregation for wiring and traffic flows
• Policy and network control point (FHRP, L3 summarization)
•
•
Access
Physical – Ethernet wired 10/100/1000(802.3z)/mGig(802.3bz);
802.3af(PoE), 802.3at(PoE+), and Cisco Universal POE (UPOE)
• Policy enforcement – security: 802.1x, port security, DAI, IPSG, DHCP
snooping; identification: CDP/LLDP; QoS: policing, marking, queuing
• Traffic control – IGMP snooping, broadcast control
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
149
Hierarchical network design
Do I need a core layer?
• It is a question of operational complexity and a
Do I need a core layer?
question of scale
• n x (n-1) scaling
• Routing peers
• Fiber, line cards and port counts ($,€,£)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
150
Hierarchical network design
Do I need a core layer?
• It is a question of operational complexity and a
question of scale
Do I need a core layer?
• n x (n-1) scaling
• Routing peers
• Fiber, line cards and port counts ($,€,£)
• Capacity planning considerations
• Easier to track traffic flows from a block
to the common core than to ‘n’ other blocks
• Geographic factors may also influence the design
• Multi-building interconnections may have fiber
limitations
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
151
Structured campus network design
• Optimize data load-sharing, redundancy design for best application performance
• Diversify uplink network paths with cross-stack and dual-sup access-layer switches
• Build distributed and full-mesh network paths between Distribution and Access-layer switches
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
152
High availability design optimization
of the elements
• Optimize the interaction of the
physical redundancy with the network
protocols
•
Provide the necessary amount of
redundancy
•
Pick the right protocol for the
requirement
•
Optimize the tuning of the protocol
• The network looks like this so that we
can map the protocols onto the
physical topology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
153
What we are trying to avoid!
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
154
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
Modularity, Hierarchy, and Structure
•
Leveraging Hardware-Based Path Restoration
•
High Availability Architectures
•
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
155
Optimizing network convergence
Failure detection and recovery
• Optimal high availability network design attempts to
leverage ‘local’ switch fault detection and recovery
• Design should leverage the hardware capabilities of
the switches to detect and recover traffic flows
based on these ‘local’ events
• Design principle –
Hardware failure detection and recovery is both
faster and more deterministic
• Design principle –
Software failure detection mechanisms provide a
secondary, not primary, fault detection and recovery
mechanism in the optimal design
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
156
Optimizing network convergence
Layer 1 link redundancy and failure detection
• Direct point to point fiber provides for fast failure detection
• Do not disable auto-negotiation on GigE and 10GigE interfaces
• IEEE 802.3z and 802.3ae link negotiation define the use of Remote Fault
Indicator & Link Fault Signaling mechanisms
• IOS debounce –
•
GigE and 10GigE fiber ports is 10 msec
•
Minimum for copper is 300 msec
• NX-OS debounce – Currently 100 msec by default
•
All 1G and 10G SFP / SFP+ based interfaces (MM, SM, CX-1) changing to a default
of 10 msec
•
RJ45 based Copper interfaces on NX-OS will remain at 100 msec
• Design principle
Understand how hardware choices and tuning impact fault detection and
response to link failures
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
157
Optimizing network convergence
Layer 2 software fault detection (e.g. UDLD)
• While 802.3z and 802.3ae link negotiation provide for L1 fault detection,
hardware ASIC failures can still occur
• UDLD provides an L2 based keep-alive mechanism that confirms bi-directional
L2 connectivity
• Each switch port configured for UDLD will send UDLD protocol packets (at L2)
containing the port’s own device / port ID, and the neighbor’s device / port IDs
seen by UDLD on that port
Tx
Rx
Rx
Tx
• If the port does not see its own device / port ID echoed in the incoming UDLD
packets, the link is considered unidirectional and is shutdown
• Design principle –
UDLD Keepalive
Redundant fault detection mechanisms required (SW as a backup to HW as
possible)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
158
Optimizing network convergence
Layer 2 and 3 – Why use routed interfaces?
L3 routed interfaces allow faster convergence than L2 switchport with an associated L3 SVI
21:38:37.042 UTC: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet3/1, changed state to down
21:38:37.050 UTC: %LINK-3-UPDOWN: Interface GigabitEthernet3/1, changed state to down
21:38:37.050 UTC: IP-EIGRP(Default-IP-Routing-Table:100): Callback: route_adjust GigabitEthernet3/1
21:32:47.813
21:32:47.821
21:32:48.069
21:32:48.069
UTC:
UTC:
UTC:
UTC:
%LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/1, changed state to down
%LINK-3-UPDOWN: Interface GigabitEthernet2/1, changed state to down
%LINK-3-UPDOWN: Interface Vlan301, changed state to down
IP-EIGRP(Default-IP-Routing-Table:100): Callback: route, adjust Vlan301
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
159
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
High Availability Architectures:
•
•
Enterprise Wired LAN
•
Multilayer Campus Distribution and HA Considerations
•
Simplified Distribution and HA Advantages
•
Extending HA Advantages by Simplifying Virtualization
•
Enterprise Data Center
•
Enterprise Wireless LAN
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
160
Optimizing the Layer 2 design – spanning tree
• At least some VLANs span multiple access switches
• Each access switch has unique VLANs
• Layer 2 loops
• No Layer 2 loops
• Layer 2 and 3 running over link between distribution
• Layer 3 link between distribution
• Blocked links
• No blocked links
• More typical of a “classic” data center design
• More typical of a campus LAN design
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
161
Optimizing the Layer 2 design
Non-STP-blocking topologies converge fastest
• When STP is not blocking uplinks, recovery of
access to distribution link failures is accomplished
based on L2 CAM updates not on the Spanning Tree
protocol recovery
• Time to restore traffic flows is based on:
• Time to detect link failure + Time to purge the HW
CAM table and begin to flood the traffic
• No dependence on external events (no need to wait
for Spanning Tree convergence)
• Behavior is deterministic
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
162
Optimizing the Layer 2 design
PVST+, Rapid PVST+, MST
• PVST+ (pre 802.1D-2004) - traditional spanning
tree
• Rapid-PVST+ (802.1w) greatly improves the
restoration times for any VLAN that requires a
topology convergence due to link UP
• Rapid-PVST+ also greatly improves convergence
time
over BackboneFast for any indirect link failures
• Rapid PVST+
• Scales to large size (up to 16,000 logical ports)
• Easy to implement, proven, scales
• MST (802.1s)
• Permits very large scale STP implementations
(up to 75,000 logical ports)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
163
Optimizing the Layer 2 design
Complex topologies take longer to converge
• Time to converge is dependent on the protocol
implemented – 802.1D, 802.1s, or 802.1w
• It is also dependent on –
• Size and shape of the L2 topology (how deep is the tree)
• Number of VLANs being trunked across each link
• Number of logical ports in the VLAN on each switch
• Non-congruent topologies take longer to converge.
Restricting the topology is necessary to reduce
convergence times
• Prune all unnecessary VLANs from trunk configuration
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
164
Optimizing the Layer 2 design
STP toolkit – PortFast and BPDU guard
• PortFast is configured on edge ports to allow them to quickly
move to forwarding bypassing listening and learning and
avoids TCN (Topology Change Notification) messages
• BPDU guard can prevent loops by moving PortFast
configured interfaces that receive BPDUs to errdisable state
• BPDU guard prevents ports configured with PortFast from
being incorrectly connected to another switch
• When enabled globally, BPDU guard applies to all interfaces
that are in an operational PortFast state
Switch(config-if)#spanning-tree portfast
Switch(config-if)#spanning-tree bpduguard enable
1w2d: %SPANTREE-2-BLOCK_BPDUGUARD: Received BPDU on port FastEthernet3/1 with BPDU Guard enabled. Disabling port.
1w2d: %PM-4-ERR_DISABLE: bpduguard error detected on Fa3/1, putting Fa3/1 in err-disable state
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
165
Optimizing the Layer 2 design
STP best practices for campus
• The root bridge should stay where you put it
• Define the STP primary (and backup) root
• Rootguard
• Loopguard or bridge assurance
• UDLD
• There is a reasonable limit to broadcast and
multicast traffic volumes
• Configure storm control on backup links to
aggressively rate limit broadcast and
multicast
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
166
Layer 2 access with Layer 3 distribution
First hop redundancy protocols (FHRP)
• HSRP, GLBP, and VRRP are used to provide a resilient
default gateway / first hop address to end stations
• A group of routers act as a single logical router providing
first hop router redundancy
• Protect against multiple failures
• Distribution switch failure
• Uplink failure
• Default recovery is ~10 Seconds
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
167
First Hop Redundancy
Sub-Second Timers Improve Convergence
interface Vlan4
ip address 10.120.4.2 255.255.255.0
standby 1 ip 10.120.4.1
standby 1 timers msec 250 msec 750
standby 1 priority 150
standby 1 preempt
standby 1 preempt delay minimum 180
interface Vlan4
ip address 10.120.4.2 255.255.255.0
glbp 1 ip 10.120.4.1
glbp 1 timers msec 250 msec 750
glbp 1 priority 150
glbp 1 preempt
glbp 1 preempt delay minimum 180
interface Vlan4
ip address 10.120.4.1 255.255.255.0
vrrp 1 description Master VRRP
vrrp 1 ip 10.120.4.1
vrrp 1 timers advertise msec 250
vrrp 1 preempt delay minimum 180
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
HSRP preemption—why it is desirable
• Spanning tree root and HSRP
primary aligned
• When spanning tree root is re-
introduced, traffic will take a twohop path to HSRP active
• HSRP preemption will allow HSRP
to follow the spanning tree
topology
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
169
FHRP design considerations
Preempt delay needs to be longer than boot time
• HSRP is not always aware of the status of
the entire switch and network
• Ensure that you provide enough time for the
entire system to be up – diagnostics (full or
partial), L1 (line cards), L2 (STP),
L3 (IGP convergence)
• Tune delay and preempt delay conservatively
as the network is already forwarding data
interface Vlan402
. . .
standby delay minimum 60 reload 600
standby 1 ip 10.147.102.1
standby 1 timers msec 250 msec 750
standby 1 priority 110
standby 1 preempt delay minimum 60 reload 600
standby 1 authentication ese
standby 1 name HSRP-Voice
hold-queue 2048 in
‘standby delay’ Controls How Long Before the Interface
Needs to Be Up Before HSRP Starts and ‘preempt delay’
Controls How Long to Wait After HSRP Establishes a
Neighbour Relationship.
You Should Configure Both.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
170
Sub-second timer considerations
HSRP, GLBP, OSPF, PIM
• Evaluate your network before implementing any sub-second timers
• Certain events can impact the ability of the switch to process sub-
second timers
• Application of large ACL
• OIR of line cards in Catalyst 6500/6800
• The volume of control plane traffic can also impact the ability to process
• 250 / 750 msec GLBP & HSRP timers are only valid in designs with less
than 150 VLAN instances (Catalyst 6x00 in the distribution)
• Spanning Tree size
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
171
FHRP design considerations—
asymmetric routing (unicast flooding)
•
Alternating HSRP Active between distribution
switches can be used for upstream load balancing
•
This can cause a problem with unicast flooding
•
ARP timer defaults to four hours and CAM timer
defaults to five minutes
•
ARP entry is valid, but no matching L2 CAM table
exists
•
In many cases when the HSRP standby needs to
forward a frame, it will have to unicast flood the
frame since its CAM table is empty
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
172
FHRP design considerations—
asymmetric routing (unicast flooding) solutions
Using ‘V’ based design with unique voice and data VLANs
per access switch, this problem has no user impact
• Don’t deploy stacking switches (ie. daisy-chained switches)
that depend on spanning tree for managing interconnects in
the stack
• Tune ARP timer to 270 seconds and leave CAM timer to
default, unless ARP > 10,000, change CAM timers
• Deploy MultiChassis EtherChannel with Virtual Switching
System (VSS or vPC) in the distribution block
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
173
Even with faster convergence from RPVST+
we still have to wait for FHRP convergence
FHRP Active
• FHRP protocol based forwarding topologies
•
FHRP Standby
Load balancing based on Per-Port or Per-VLAN
• Protocol-based fault detection and recovery –
•
Recommended to configure per-VLAN aggressive timers to
protect user experience impact within <1 second boundary
• Limited network scale for system reliability
• Sub-second protocol timers must be avoided on SSO
capable network
1000
900
800
700
600
500
400
300
200
100
0
HSRP Config
SVI - Aggressive Time
Convergence (msec)
6500-Sup2T
4500-Sup7E
TECCRS-2001
interface Vlan2
ip address 10.120.2.2 255.255.255.0
standby 1 ip 10.120.2.1
standby 1 timers msec 250 msec 750
standby 1 priority 150
standby 1 preempt
standby 1 preempt delay minimum 180
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
174
Multilayer campus network design—
It is a good solid design, but…
• Utilizes multiple control protocols
• Spanning tree (802.1w), HSRP / GLBP, EIGRP, OSPF
• Convergence is dependent on multiple factors –
• FHRP – 900msec to 9 seconds
• Spanning tree – Up to 50 seconds
• Load balancing –
• Asymmetric forwarding
• HSRP / VRRP – per subnet
• GLBP – per host
60
50
50
40
30
20
9.1
10
0.91
0
Looped PVST+ (No
RPVST+)
Non-looped Default
FHRP
Non-looped SubSecond FHRP
• Unicast flooding in looped design
• STP, if it breaks badly, has no inherent
mechanism to stop the loop
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
175
Campus wired LAN design
Option 1: Traditional multilayer campus (BRKCRS-2031)
Logical
topology—
L3:
core/dist.
L2:
dist./acc.
Common design since the 1990’s
Complex configurations (prone to human error)
related to spanning-tree, load balancing,
unicast and multicast routing
• Requires heavy performance tuning resulting
from reliance on FHRPs (HSRP, VRRP, GLBP)
•
•
Survives device and link failures
Easy mitigation of Layer 2 looping concerns
Rapid detection/recovery from failures
Physical
topology:
2 core
2 dist./acc.
Layer 2 across all access blocks within distribution
Device-level CLI configuration simplicity
Automated network and policy provisioning included
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
176
Transforming multilayer campus
Before: Layer 3 distribution with Layer 2 access
IGP
IGP
Layer 3
Layer 2
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
177
Simplification with routed access design
After: Layer 3 distribution with Layer 3 access
IGP
IGP
IGP
Layer 3
IGP
Layer 2
•
Move the Layer 2 / 3 demarcation to the network edge
•
Leverages Layer 2 only on the access ports, but builds a Layer 2 loop-free network
•
Design motivations – Simplified control plane, ease of troubleshooting, highest availability
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
178
Routed access advantages
Simplified control plane

Simplified Control Plane
•
•
•
•
•
•

No STP feature placement (root bridge,
loopguard, …)
No default gateway redundancy setup/tuning
(HSRP, VRRP, GLBP ...)
No matching of STP/HSRP priority
No asymmetric flooding
No L2/L3 multicast topology inconsistencies
No Trunking Configuration Required
L2 Port Edge features still apply:
•
•
•
•
Spanning Tree Portfast
Spanning Tree BPDU Guard
Port Security, DHCP Snooping, DAI, IPSG
Storm Control
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
179
Routed access advantages
Simplified network recovery
• Routed access network recovery is
dependent on L3 re-route
• Time to restore upstream traffic flows
is based on ECMP re-route
• Time to detect link failure
• Process the removal of the lost routes
from the SW RIB
• Update the HW FIB
• Time to restore downstream flows is
based on a routing protocol re-route
• Time to detect link failure
• Time to determine new route
• Process the update for the SW RIB
• Update the HW FIB
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
180
Routed access advantages
Faster convergence times
• RPVST+ convergence times
dependent on FHRP tuning
•
Proper design and tuning can
achieve sub-second times
• EIGRP converges <200 msec
• OSPF converges <200 msec
with LSA and SPF tuning
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Upstream
RPVST+
FHRP
TECCRS-2001
OSPF
EIGRP
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
181
Routed access advantages
A single router per subnet: simplified multicast

Layer 2 access has two multicast routers per access subnet, RPF checks
and split roles between routers

Routed access has a single multicast router which simplifies multicast
topology and avoids RPF check altogether
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
182
Routed access advantages
Ease of troubleshooting
• Routing troubleshooting tools
• Consistent troubleshooting:
access, dist, core
• show ip route / show ip cef
• Traceroute
• Ping and extended pings
• Extensive protocol debugs
• IP SLA from the Access Layer
• Failure differences
• Routed topologies fail closed—i.e.
neighbor loss
• Layer 2 topologies fail open—i.e.
broadcast and unknowns flooded
switch#sh ip cef 192.168.0.0
192.168.0.0/24
nexthop 192.168.1.6 TenGigabitEthernet9/4
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
183
Why isn’t routed access deployed everywhere?
Routed access design constraints
• VLANs don’t span across multiple wiring
closet switches/switch stacks
Does this impact your requirements?
• IP addressing changes: more DHCP scopes
L3
and subnets of smaller sizes increase
management and operational complexity
L3
• Deployed access platforms must be able to
L3
L3
L3
support routing features
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
184
Campus wired LAN design
Option 2: Layer 3 routed access (BRKCRS-3036)
Logical
topology—
L3:
everywhere
L2:
edge only
Complexity reduced for Layer 2
(STP, trunks, etc.)
• Elimination of FHRP and associated timer
tuning
• Requires more Layer 3 subnet planning; might
not support Layer 2 adjacency requirements
•
Survives device and link failures
Easy mitigation of Layer 2 looping concerns
Rapid detection/recovery from failures
Physical
topology:
2 core
2 dist./acc.
Layer 2 across all access blocks within distribution
Device-level CLI configuration simplicity
Automated network and policy provisioning included
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
185
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
High Availability Architectures:
•
•
Enterprise Wired LAN
•
Multilayer Campus Distribution and HA Considerations
•
Simplified Distribution and HA Advantages
•
Extending HA Advantages by Simplifying Virtualization
•
Enterprise Data Center
•
Enterprise Wireless LAN
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
186
Traditional multilayer campus design
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
187
Simplified end-to-end VSS design
Data Center
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
188
Comparison – standalone (multilayer) versus VSS
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
189
Unified system architecture
•
•
•
•
•
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
190
Catalyst VSS setup
LAN distribution layer
1) Prepare standalone switches for VSS
Router#conf t
Router(config)# hostname VSS-Sw1
VSS-Sw1(config)#switch virtual domain 100
VSS-Sw1(config-vs-domain)# switch 1
Switch 1
Switch 2
VSL
1) Prepare standalone switches for VSS
Router#conf t
Router#config)# hostname VSS-Sw2
VSS-Sw2(config)#switch virtual domain 100
VSS-Sw2(config-vs-domain)# switch 2
2) Configure Virtual Switch Link
2) Configure Virtual Switch Link
VSS-Sw2(config)#interface port-channel 64
VSS-Sw2(config-if)#switch virtual link 2
VSS-Sw2(config)#interface range tengigabit 5/4-5
VSS-Sw2(config-if)#channel-group 64 mode on
VSS-Sw2(config-if)#no shutdown
VSS-Sw1(config)#interface port-channel 63
VSS-Sw1(config-if)#switch virtual link 1
VSS-Sw1(config)#interface range tengigabit 5/4-5
VSS-Sw1(config-if)#channel-group 63 mode on
VSS-Sw1(config-if)#no shutdown
3) Validate Virtual Switch Link operation
VSS-Sw1# show etherchannel 63 ports
AND
VSS-Sw2# show etherchannel 64 ports
Ports in the group:
------------------Port: Te5/4
Port state
= Up Mstr In-Bndl
Port: Te5/5
Port state
= Up Mstr In-Bndl
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
191
Catalyst VSS setup
LAN distribution layer
4) Enable virtual mode operation
VSS-Sw1# switch convert mode virtual
Do you want to proceed? (yes/no) yes
4) Enable virtual mode operation
VSS
Switch 1
Switch 2
VSL
• The switch now renumbers from y/z to x/y/z
• When process is complete, save configuration when
prompted, switch reloads and forms VSS.
5) Verify operation and rename switch
VSS-Sw1# show switch virtual redundancy
• Check for both switches visible, Supervisors in SSO mode,
second Supervisor in Standby-hot status
VSS-Sw1(config)# hostname VSS
VSS(config)#
VSS-Sw2# switch convert mode virtual
Do you want to proceed? (yes/no)yes
• The switch now renumbers from y/z to x/y/z
• When process is complete, save configuration when
prompted, switch reloads and forms VSS.
6) Configure dual-active detection
• Connect a Gigabit Link between the VSS switches
VSS(config)# switch virtual domain 100
VSS(config-vs-domain)# dual-active detection fast-hello
VSS(config)# interface range gigabit1/1/24, gigabit2/1/24
VSS(config-if-range)# dual-active fast-hello
VSS(config-if-range)# no shut
7) Configure the system virtual MAC address
VSS(config)#
switch virtual domain 100
*Feb 25 14:28:39.294: %VSDA-SW2_SPSTBY-5-LINK_UP: Interface Gi2/1/24 is now dual-active detection capable
VSS(config-vs-domain)#
mac-address use-virtual
*Feb 25 14:28:39.323: %VSDA-SW1_SP-5-LINK_UP:
Interface Gi1/1/24 is now dual-active detection capable
Configured Router mac address is different from operational value. Change will take effect
after config is saved and the entire Virtual Switching System (Active and Standby) is reloaded.
BRKCRS-3035: Advanced Enterprise Campus Design: Virtual Switching System (VSS)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
192
“Is there an easier way to enable VSS?”
Use Easy VSS to configure from a single console port
Prerequisites:
• Switches running same software with feature support (C4K:3.6E, C6K:15.2(1)SY1)
• Links to be used for VSLs up with CDP communication
1) C6K - Enable Easy VSS feature, convert, and reload
VSS-Sw1# switch virtual easy
VSS-Sw1# switch convert mode easy links
?
Local Interface
Remote Interface Hostname
TenGiigabit3/4
TenGigabit3/4
VSS-Sw2
TenGigabiti4/4 TenGigabit4/4
VSS-Sw2
VSS-Sw1# switch convert mode easy links T3/4 T4/4 domain 100
VSS-Sw1(config)# switch virtual domain 100
VSS-Sw1(config-vs-domain)# mac-address use-virtual
VSS-Sw1# copy running-config startup-config
VSS-Sw1# reload
2) Verify operation and rename switch
VSS-Sw1# show switch virtual redundancy
• Check for both switches visible, Supervisors in SSO
mode, second Supervisor in Standby-hot status
VSS-Sw1(config)#
VSS(config)#
hostname VSS
VSS
VSS-Sw1
VSS-Sw2
VSL
3) Configure dual-active detection
• Connect a Gigabit Link between the VSS switches
VSS(config)# switch virtual domain 100
VSS(config-vs-domain)# dual-active detection fast-hello
VSS(config)# interface range gigabit1/1/24, gigabit2/1/24
VSS(config-if-range)# dual-active fast-hello
VSS(config-if-range)# no shut
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
193
VSS dual supervisor inter-chassis redundancy
• VSS dual supervisor (single sup per chassis) supports inter-
chassis SSO redundancy.
• Single in-chassis supervisor - SSO Active or Standby role.
Reduced
NSF Recovery
Capacity
Reduced
Capacity
• Stateful SSO synchronization and redundancy between
virtual-switches
VSL
• Single supervisor system Design –
Active
Standby
Standby
Active
• Supervisor switchover requires chassis reset, including all linecard
and service modules
Reduced
Reduced
Capacity
Capacity
• Network capacity reduced until system returns to operational state
• Consistent redundancy design between modular Catalyst
6500E/6800/4500E and fixed Catalyst 4500X/3850 system
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
194
Catalyst quad-supervisor NSF/SSO redundancy
Inter-Chassis Sup
Redundancy
• Dual in-chassis supervisors, each in different
redundancy modes
•
In-chassis Active Supervisor (ICA) – SSO Active
OR Standby-Hot (switchover target)
•
In-chassis Standby Supervisor (ICS) – StandbyHot (Chassis)
Intra-Chassis Sup
Redundancy
ICA – SSO Active
ICS – STANDBY-HOT ( Chassis)
• VSS Quad-Sup protects network availability
and capacity with dual redundancy domain –
between chassis and within chassis
• Stateful SSO synchronization between
multiple redundancy domains
• Complete system configuration and
parameters synchronization
• Catalyst 6x00 with Sup6T or Sup2T pairs,
Catalyst 4500E with Sup8E, 7E, and 7L-E
VSL
SW1
6500-VS4O#show switch virtual redundancy
Switch|Mode|Current|Fabric
My Switch Id = 1
Peer Switch Id = 2
Configured Redundancy Mode = sso
Operating Redundancy Mode = sso
Switch 1 Slot 6 Processor Information :
Current Software state = ACTIVE
Fabric State = ACTIVE
Switch 1 Slot 5 Processor Information :
Current Software state = STANDBY
Fabric State = ACTIVE
Switch 2 Slot 6 Processor Information :
Current Software state = STANDBY
Fabric State = ACTIVE
Switch 2 Slot 5 Processor Information :
Current Software state = STANDBY
Fabric State = ACTIVE
TECCRS-2001
Intra-Chassis Sup
Redundancy
ICA – SSO Standby
ICS – STANDBY-HOT(Chassis)
SW2
| inc
HOT (CHASSIS)
HOT (switchover target)
HOT (CHASSIS)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
195
Understanding Virtual Switch Link
• Inter-chassis system link
•
No network protocol operations
•
Invisible in network topology
•
Transparent to network level troubleshooting
Control Link
Control Link
VSL
VSH
• VSL control link
L2
L3
Payload
CRC
4500E-VSS#show switch virtual link
•
Carries all system internal control traffic
Executing the command on VSS member switch role = VSS Active, id = 1
•
Single member-link; dynamic election during boot
•
Shared interface for network/data traffic
VSL Status : UP
VSL Uptime : 1 day, 1 hour, 16 minutes
VSL Control Link : Te1/3/1
•
< 50 msec switchover to pre-determined VSL path
Executing the command on VSS member switch role = VSS Standby, id = 2
VSL Status : UP
VSL Uptime : 1 day, 1 hour, 17 minutes
VSL Control Link : Te2/3/1
• Payload overhead
•
Every single packet encapsulated with Virtual Switch Header (VSH)
•
Non-bridgeable and non-routeable.
•
VSL must be directly connected between two virtual switch systems
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
196
6500E/6800/4500E VSS dual sup – VSL design
Two Cisco recommended designs
Profile 2 – Diversified VSL between
Supervisor and VSL capable Linecard
Profile 1 – Two VSL links on Supervisor
Sup
Sup
Sup
Sup
VSL
VSL
• Cost-effective solution to leverage both uplinks. Continue
to use non-VSL capable linecard for 10G core connection.
• Redundant and diversified fibers between supervisor and
next-gen VSL capable linecards.
• Redundant fibers connects thru common fabric and ASICs,
this could result vulnerability in system stability.
• Same design as Profile 1 but increases system reliability as
each VSL port are diversified across different fabric/ASICs.
• Optimal and preset VSL parameters – Load-Balancing,
QoS, HA, Traffic-engg, Dual-Active etc..
• Optimal and preset VSL parameters – Load-Balancing, QoS,
HA, Traffic-engg, Dual-Active etc..
• Restricted to bundle 2 x VSL ports or 20G switching
capacity on per virtual-switch node basis.
• Flexible to scale up to 8 x VSL for high-dense system to
aggregate uplink, service modules, single-home etc..
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
197
6500E/6800 VSS quad-supervisor VSL design
RPR-WARM
Sup2T/6T quad-supervisor NSF/SSO VSL redundancy
Sup-1
Sup-2
Sup-3
Sup-3
Sup-4
Sup-4
VSL
SW1
SW2
• Same design profile – 1 dual sup
• Flexible to increase VSL capacity
• Continue to leverage existing non-VSL
10G linecard for uplink connection
• Retains all original VSL benefits
• Vulnerable design during any
supervisor self-recovery fault incident
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
198
6500E/6800 VSS quad-supervisor VSL design
SSO advantage
Sup2T/6T quad-supervisor NSF/SSO VSL redundancy
Recommended: Full-Mesh VSL on Quad-Sup
Sup-3
Sup-3
Sup-2
Sup-1
Sup-2
Sup-4
Sup-4
Sup-3
Sup-3
Sup-4
Sup-4
VSL
SW1
VSL
SW2
SW1
• Same design profile – 1 dual sup
• Flexible to increase VSL capacity
• Continue to leverage existing non-VSL
10G linecard for uplink connection
• Retains all original VSL benefits
• Vulnerable design during any
supervisor self-recovery fault incident
SW2
• Highly redundant and cost-effective VSL
design.
• Increases overall VSL capacity
• Maintains 20G VSL capacity during
supervisor failure.
• Increases network reliability by
minimizing the dual-active probability
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
199
4500X VSS – VSL network design
•
•
•
•
Fixed switch hardware architecture –
• 24 or 48 10G/1G front panel ports
• 8 port 1G/10G Pluggable Uplink Module
Any ports can be bundled into VSL EtherChannel.
Recommended to use front-panel ports to build VSL
connections. Minimizes system instability during accidental
uplink module OIR/reset
Split VSL member-link interfaces to different internal ASICs
groups :
ASIC Group
•
4500X – 16 Port
ASIC to Port Mapping
4500X – 32 Port
ASIC to Port Mapping
Internal Stub ASIC – 1
1–8
1–8
Internal Stub ASIC – 2
9 – 16
9 – 16
Internal Stub ASIC – 3
N/A
17 – 24
Internal Stub ASIC – 4
N/A
25 – 32
Front / Uplink
Ports
Ten1/1/1
Ten2/1/1
Ten1/1/9
Ten2/1/9
VSL
4500-X
4500-X
SW-1
Front Panel
Ports
SW-2
Consistent software design and VSL function as 4500E
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
200
Cisco Catalyst platforms and transitions
Where is VSS?
Cisco Catalyst
Cisco® Catalyst® 9300 Series
9200 Series
Cisco Catalyst
9500 Series
Cisco Catalyst
9400 Series
Cisco Catalyst Cisco Catalyst Cisco Catalyst
2960X/XR Series 3850 copper 4500E Series
Cisco Catalyst Cisco Catalyst
3850F/4500-X 6840-X/ 6880-X
Access switching
Backbone switching
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
201
“How can I simplify my distribution without VSS?”
StackWise Virtual
• Fixed switch hardware architecture with distributed forwarding architecture
• First available on WS-3850-48XS
• Available on Catalyst 3850-24XS, 3850-12XS, 9500-16X, 9500-40X, 9500-12Q, 9500-24Q,
9500-48Y4C, 9500-24Y4C, 9500-32QC, 9500-32C, 9404R/9407R Sup1/Sup1-XL
(check software release notes for versions and additional hardware)
• StackWise Virtual Link between two nodes (10Gb or 40Gb)
• Both StackWise Virtual members must have consistent Cisco IOS-XE and license
StackWise Virtual Pair
WS-3850-48XS
WS-3850-48XS
Fast
Hello
SVL
Distribution
Access
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
202
Cisco StackWise Virtual (SWV) setup
LAN distribution layer
1) Prepare standalone switches for SWV3850-D1
3850-D1
3850-D2 1) Prepare standalone switches for SWV
3850-D2#conf t
3850-D2(config)# stackwise-virtual
3850-D2(config-stackwise-vir)# domain <1-255>
SVL
3850-D1#conf t
3850-D1(config)# stackwise-virtual
3850-D1(config-stackwise-vir)# domain <1-255>
2) Configure StackWise Virtual links
2) Configure StackWise Virtual links
*Automatically creates EtherChannel (128)
*Automatically creates EtherChannel (128)
3850-D2(config)# interface range FortyG x/y/z – x/y/z
3850-D2(config-if)# stackwise-virtual link 1
3850-D1(config)# interface range FortyG x/y/z – x/y/z
3850-D1(config-if)# stackwise-virtual link 1
3) Configure dual-active detection
(fast hello)
3) Configure dual-active detection
(fast hello)
3850-D1(config)# interface range TenG x/y/z – x/y/z
3850-D1(config)# stackwise-virtual dual-active-detection
3850-D2(config)# interface range TenG x/y/z – x/y/z
3850-D2(config)# stackwise-virtual dual-active-detection
4) Save and reload to convert
4) Save and reload to convert
3850-D1# copy run start
3850-D1# reload
3850-D2# copy run start
3850-D2# reload
Note: Maximum of 8 SVL member links and 4 dual active detection links
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
203
Virtual Switch Link capacity planning
• Plan VSL capacity to reduce congestion point,
handle failures and specific configurations
VSL
• Supported VSL interfaces types :
•
Catalyst 6500E/6800 : 10G and 40G
•
Catalyst 4500E/4500X : 1G and 10G
•
Catalyst 3850 : 1G, 10G, and 40G
• Four major factors :
•
Total uplink bandwidth per chassis. Ability to handle data re-route
during uplink failures without network congestion
•
Handling egress data to single-homed devices
(non-recommended design)
•
Catalyst 6500E/6800 services module integration may require
centralized forwarding on remote chassis
•
Remote network services such as SPAN
VSL
Analyzer
• Up to 8 member-links supported in VSL EtherChannel.
(Implement in power of 2 for optimal forwarding decision)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
204
VSS – single-homed connections
• Independent of system modes (VSS or Standalone),
single-home connection is non-recommended
• Cannot leverage any distributed VSS architecture benefits.
• Non-congruent Layer 2 or Layer 3 network design with –
• Centralized network control-plane processing over VSL
VSL
• Asymmetric forwarding plane. Ingress data may traverse
over VSL interface and oversubscribe the ports
SW-2
(HOT-STANDBY)
SW-1
(ACTIVE)
• Single-point of failure in various faults –
Link/SFP/module failure, SSO switchover, ISSU etc..
A1
A2
• Cannot be trusted switch for dual active detection purpose
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
205
VSS – multi-homed physical connections
• Redundant network paths per system delivers best architectural approach
• Parallel Layer 2 paths between bridges
builds sub-optimal topology :
• Creates STP loop. Except for root port, all other ports
are in blocking mode
• Slow network convergence
• Parallel Layer 3 doubles control-plane processing load :
• ACTIVE switch needs to handle control plane load of local
and remote-chassis interfaces
VSL
SW-2
(HOT-STANDBY)
SW-1
(ACTIVE)
• Multiple unicast and multicast neighbor adjacencies
• Redundant routing and forwarding topologies
A1
TECCRS-2001
A2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
STP Loop
206
VSS – Multichassis EtherChannel
• MEC enables:
•
Simplified STP loop-free network topology
•
Consistent L3 control-plane and network design as traditional
Standalone mode system
•
Deterministic sub-second network recovery
• MECs can be deployed in two modes – Layer 2 or Layer 3
• MEC scalability support varies on system basis –
•
Catalyst 6500E supports 512 L2/L3 MEC
•
Catalyst 4500E and 4500X supports 256 L2 MEC
•
Catalyst 3850-48XS supports 127 L2/L3 MEC
VSL
SW-2
(HOT-STANDBY)
SW-1
(ACTIVE)
A1
TECCRS-2001
A2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
207
Simplified STP network topology with VSS
• VSS simplifies STP. VSS does not eliminate STP.
Never disable STP.
• Multiple parallel Layer 2 network path builds STP
loop network
• VSS with MEC builds single loop-free network to
utilize all available links.
• Distributed EtherChannel minimizes STP
complexities compared to standalone distribution
design
• STP toolkit should be deployed to safe-guard
multilayer network
STP BLK Port
Loop-free L2 EtherChannel
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
208
Traditional distribution design
Redundant design with sub-optimal topology and complex operation
Stabilize network topology with several L2 features:
• STP Primary and Backup Root Bridge
• Rootguard
• Loopguard or Bridge Assurance
• STP Edge Protection
Protocol restricted forwarding topology
• STP FWD/ALT/BLK Port
• Single Active FHRP Gateway
• Asymmetric forwarding
• Unicast Flood
Protocol dependent driven network recovery:
•
PVST/RPVST+ and FHRP Tuning
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
209
Resiliency versus performance/scale tradeoff:HSRP
FHRP Active
FHRP Standby
• Multichassis EtherChannel based forwarding topologies
•
Per-Flow Load Balancing based on Layer 2 to Layer 4 + VLANs
1000
900
800
700
600
500
400
300
200
100
0
interface Vlan2
ip address 10.120.2.2 255.255.255.0
standby 1 ip 10.120.2.1
standby 1 timers msec 250 msec 750
standby 1 priority 150
standby 1 preempt
standby 1 preempt delay minimum 180
SVI - Aggressive Time
Convergence (msec)
6500-Sup2T
4500-Sup7E
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
210
Resiliency versus performance/scale tradeoff:VSS
• Multichassis EtherChannel based forwarding topologies
•
Per-Flow Load Balancing based on Layer 2 to Layer 4 + VLANs
VSS-SW1
• Hardware-Based Fault Detection and Recovery
•
Deterministic network convergence with simplistic approach
• Increases Network Scale for system reliability
• No reliability compromise to enable path and system-level
Quad-Sup redundancy
Multilayer VSS
Network Scale And Convergence
1000
1000
900
800
700
600
500
400
300
200
100
0
900
800
700
600
SVI - Aggressive Time
500
Convergence (msec)
400
SVI (Validated Limit)
Convergence (msec)
300
200
100
0
6500-Sup2T
6500-Sup2T
4500-Sup7E
TECCRS-2001
4500-Sup7E
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
211
PIM timers also need tuning
• Multicast recovery depends on PIM DR failure detection
PIM DR
in Layer 2 network
• PIM routers exchanges PIM expiration time in query
message
• DR Failure Detection:
~90 seconds (30 sec. hello * 3 multiplier)
• Tune PIM query interval to sub-sec as FHRP for faster
multicast convergence
• Sub-second protocol timer must be avoided on SSO
capable network
TECCRS-2001
interface Vlan2
ip pim sparse-mode
ip pim query-interval 250 msec
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
212
Simplified and robust multicast network design
using VSS
• Single PIM DR system in Layer 2 network to process IGMP
from host receivers
• Doubles multicast forwarding performance across all
Multichassis EtherChannel member links
VSS-SW1
PIM-DR
• Optimize multicast network with PIM stub configuration
• Rapid, deterministic and simple multicast design
•
•
Hardware based sub-second fault detection and recovery.
Eliminates aggressive timer requirement and improves
system performance and scalability
interface Vlan2
ip pim passive
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
213
Multichassis EtherChannel load sharing
• MEC hash algorithm is computed
independently by each virtual-switch to
perform load share via its local physical ports.
SW-2
SW-1
• 8 bits computation on each member link of an
MEC is independently done on per virtualswitch node basis.
• Total number of member link bundling in
single MEC recommendation remains
consistent as described in single chassis
EtherChannel section.
• Recommendation to deploy EtherChannel
in 2n ratio evenly distributed to each
virtual-switch for best load-sharing result.
Per Switch MEC Flow Distribution Matrix
Member
Links
Port1
Bit
Port2
Bit
Port3
Bit
Port4
Bit
Port5
Bit
Port6
Bit
Port7
Bit
Port8
Bit
1
8
X
X
X
X
X
X
X
2
4
4
X
X
X
X
X
X
3
3
3
2
X
X
X
X
X
4
2
2
2
2
X
X
X
X
5
2
2
2
1
1
X
X
X
6
2
2
1
1
1
1
X
X
7
2
1
1
1
1
1
1
X
8
1
1
1
1
1
1
1
1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
214
Optimize EtherChannel load balancing
• Load share egress data traffic based on input
hash
Core
Default : src-dst-ip vlan
• Optimal load sharing results with :
• Bucket-based load-sharing – Bundle member-links
in power-of-2 (2/4/8)
Recommended : src-dst-mixed-ip-port
• Multiple variation of input for hash (L2 to L4)
• Recommended algorithm * :
Dist
Default : src-dst-ip vlan
• Access – Src/Dst IP
Recommended : src-dst-mixed-ip-port vlan
• 6500E/6800 Dist/Core – Src/Dst IP + Src/Dst L4
Ports
• 4500E / 4500X Dist – Src/Dst IP
Default : src-mac
Recommended : src-dst-ip
Access
* May vary based on your network traffic pattern
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
215
Summary: Multichassis EtherChannel performs
better in any network design
• Network recovery mechanic varies in different
1
Convergence (sec)
distribution design –
• Standalone – protocol and timer dependent
• VSS – hardware dependent
• VSS logical distribution system –
• Single P2P STP Topology
0.8
0.6
0.4
0.2
0
• Single Layer 3 gateway
L2-FHRP
• Single PIM DR system
Upstream
Downstream
L2-MEC
Multicast
• Distributed and synchronized forwarding table –
MAC address, ARP cache, IGMP
• All links are fully utilized based on Ether-channel
load balancing
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
216
VSS-enabled campus core design
• Extend VSS architectural benefits to campus
core layer network
• VSS enabled core increases capacity,
optimizes network topologies and simplifies
system operations
• Key VSS enable core best practices :
• Protect network availability and capacity with
Catalyst 6800 Sup6T Quad-Sup NSF/SSO
• Simplify network topology and routing database
with single MEC
• Leverage self-engineer VSS and MEC capabilities
for deterministic network fault detection and
recovery
Data Center
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
217
VSS core network design alternatives
VSL
VSL
SW1
SW2
SW1
SW2
SW1
VSL
SW1
SW2
VSL
TECCRS-2001
SW2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
218
Catalyst 6500/6800 VSS-enabled campus
design
ECMP forwarding table construction
• ACTIVE switch responsible for:
Unicast Forwarding Path
Multicast Forwarding Path
• Construct two software tables : Routing Information Base (RIB)
and Forwarding Information Base (FIB)
T1/2/1
• Synchronize software FIB tables to local and remote chassis
supervisor and network modules
T1/2/1
T2/2/1
T2/2/2
ECMP forwarding also favors locally attached interfaces
Hardware FIB inserts entries for ECMP routes using locally attached links
If all local links fail the FIB is programmed to forward across the VSL link as last resort
Po1
Po2
SW2 (HOT_STANDBY)
SW1 (ACTIVE)
Unicast ECMP Software RIB (System-Wide)
Unicast ECMP Switch-1 Hardware FIB
Four ECMP
RIB Entries
Two SW1 HW
FIB Entries
Unicast ECMP Software FIB (System-Wide)
Unicast ECMP Switch-2 Hardware FIB
Four ECMP
FIB Entries
TECCRS-2001
Two SW2 HW
FIB Entries
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
219
Summary – optimizing core performance (1/2)
HW Driven Forwarding Topology & High Availability
Unicast Forwarding Path
Multicast Forwarding Path
VSS-Core
Standalone-Core
VSS-Dist
Standalone--Dist
•
•
•
•
•
•
•
•
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
220
Summary – optimizing core performance (2/2)
HW Driven Forwarding Topology & High Availability
Unicast Forwarding Path
Multicast Forwarding Path
Standalone-Core
Standalone-Core
VSS-Dist
Standalone-Dist
•
•
•
•
•
•
•
•
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
221
Simple core network design delivers
deterministic network recovery
• Routing protocol independent network
convergence in large scale campus core
T1/2/1
T1/2/1
T2/2/1
T2/2/2
• ECMP prefix-independent convergence (PIC) for
with 6x00 (VSS/standalone) from 12.2(33)SXI2
Po1
• Cisco Express Forwarding (CEF) optimization in
SW1 (ACTIVE)
IOS software.
tuning required
• Hardware-based fault detection and recovery in
MEC/EC designs
SW2 (HOT_STANDBY)
3.5
Convergence (sec)
• Default behavior: no additional configuration or
Po2
3
2.5
2
1.5
1
0.5
0
500
1000
5000
ECMP (W/o PIC)
TECCRS-2001
10000
15000
ECMP (With PIC)
20000
25000
MEC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
222
VSS core simplifies multicast operation, improves
performance and redundancy (1/2)
• Standalone core needs anycast MSDP peering
for RP redundancy
AnyCast - MSDP
Core
• ECMP builds single multicast forwarding path
and protocol-based fault detection and recovery
PIM RP
PIM RP
Single OIL
PIM Join
PIM Router
PIM Router
Dist
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
223
VSS core simplifies multicast operation, improves
performance and redundancy (2/2)
• VSS based Catalyst systems enables PIM
Single Logical
PIM RP
RP Redundancy with resilient technologies
Core
Multiple Multicast
Forwarding Paths
• MEC increases multicast forwarding
Single Logical
PIM Interface
capacity by utilizing all member-links and
provides hardware-based fault detection
and recovery
Single Logical OIL
PIM Join
Single Logical
PIM Router
TECCRS-2001
Dist
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
224
Simplified multicast network design delivers
deterministic network recovery
• ECMP multicast recovery is mroute scale dependent could range in
seconds.
• MEC/EC multicast recovery is hardware-based and recovery is scale-
independent in sub-seconds
Convergence (sec)
6
5
4
3
ECMP
2
MEC/EC
1
0
100
500
1000
5000
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
225
Implementing non-stop forwarding
• VSS software design is built on NSF/SSO architecture.
• Catalyst 4500E, 4500X and 6500E/6800 deployed in VSS mode must enabled NSF.
No configuration required on NSF Helper system
• NSF capability must be manually enabled for all Layer 3 routing protocols :
• EIGRP, OSPF, ISIS, BGP, MPLS etc..
• In VRF environment the NSF must be manually enabled on per-VRF IGP instance
Inter-Chassis NSF/SSO Recovery Analysis
• Multicast NSF capability is default ON
Convergence (sec)
16
14
12
10
8
6
4
2
0
Without NSF
TECCRS-2001
With NSF
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
226
Sub-second protocol timers and NSF/SSO
Core
•
NSF is intended to provide availability through route convergence avoidance
•
Fast IGP timers are intended to provide availability through fast route convergence
•
In an NSF environment dead timer must be greater than:
•
•
interface Port-Channel 10
ip ospf dead-interval minimal multiplier 4
SSO recovery + Routing Protocol restart + time to send first hello
Recommendation –
•
Do not configure aggressive timer Layer 2 protocols, i.e. Fast UDLD
•
Do not configure aggressive timer Layer 3 protocols, i.e. OSPF Fast Hello, BFD etc.. Keep all
protocol timers at default settings
Link and Switch Failure Analysis –
Default OSPF Timer
0.3
0.2
0.2
0.1
0.1
0
0
Downstream
Dist
Link Failure Analysis –
Aggressive OSPF Timer
0.3
Upstream
VSL
Access
Catalyst 2K/3K/4K
Upstream
Downstream
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
227
Campus wired LAN design
Option 3: Layer 2 access with “simplified” distribution (BRKCRS-1500)
Logical
topology—
L3:
core/dist.
L2:
dist./acc.
Leading campus design for easy configuration
and operation when using stacking or similar
technology (VSS, StackWise Virtual)
• Flexibility to support Layer 2 services within
distribution blocks, without FHRPs.
• Easy to scale and manage
•
Survives device and link failures
Easy mitigation of Layer 2 looping concerns
Rapid detection/recovery from failures
Physical
topology:
2 core
2 dist./acc.
Layer 2 across all access blocks within distribution
Device-level CLI configuration simplicity
Automated network and policy provisioning included
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
228
VSS best practices summary (1/2)
• Design each VSS domain with unique ID
• Configure “mac-address use-virtual” under virtual switch configuration mode
• Select appropriate VSS capable system that fits in network and solution requirements
• Deploy 6500/6800 Quad-sup NSF/SSO for mission-critical networks to protect network
availability and capacity
• Do not compromise network foundation baselines. Deploy full-mesh physical connections for
redundancy and load sharing across the network
• MEC enables network benefits with VSS. Bundle all physical connections into single logical
connection for simplified and resilient network topologies
• Layer 3 MEC is highly recommended for 4500E/X VSS enabled Campus network
• Always use link bundling protocols – Cisco PAgP or IETF LACP
• Configure “no ip routing protocol purge-interface” to optimize ECMP based network
convergence time
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
229
VSS best practices summary (2/2)
• Plan and design VSL with appropriate capacity, diversification and redundancy
• Configure “nsf” under L3 routing protocols
• Keep Layer 2 and Layer 3 protocol timers at factory default. Do not enable protocols with
aggressive timers
• Configure redundant dual active trusted ePAgP neighbors (L2/L3)
• Configure redundant dual active mechanics ePAgP and Fast Hello
• Exclude dual active management interface for connectivity and troubleshooting
• Remember “reload” command on 6500/6800 resets both virtual-switch chassis, whereas
4500E/X resets ACTIVE switch. Issue “redundancy reload shelf” on 4500E/X to reload ACTIVE
and STANDBY system
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
230
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
High Availability Architectures:
•
Enterprise Wired LAN
Multilayer Campus Distribution and HA Considerations
• Simplified Distribution and HA Advantages
• Extending HA Advantages by Simplifying Virtualization
•
•
•
Enterprise Data Center
•
Enterprise Wireless LAN
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
231
•
•
•
•
•
•
•
•
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
232
Hop-by-hop network virtualization
Multi-VRF architecture overview
•
Two preset network setup:
•
Hop-by-hop network segmentation with logical connection
•
Build control and data-plane over each logical connection
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
233
Hop-by-hop network virtualization
Data-plane isolation
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
234
Multi-VRF: Campus network design alternatives
Standalone devices
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
235
Multi-VRF: Campus network design alternatives
Cisco VSS
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
236
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (1 of 4)
Standalone Design
10 VRF Sample Design
Each core : 40 Adj
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
237
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (2 of 4)
Standalone Design
VSS Design
10 VRF Sample Design
10 VRF Sample Design
VSS core :
Each core : 40 Adj
TECCRS-2001
0 Adj
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
238
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (3 of 4)
Standalone Design
VSS Design
10 VRF Sample Design
10 VRF Sample Design
VSS core :
Each core : 160 Adj
TECCRS-2001
240 Adj
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
239
M-VRF: Per-hop VPN control plane complexity
ECMP unicast and multicast adjacencies comparison (4 of 4)
Standalone Design
VSS Design
10 VRF Sample Design
10 VRF Sample Design
Each core :480 Adj
Edge :
VSS core :
880 Adj
Edge :
160 Adj
80 Adj
• Standalone uses distributed control-plane. VSS uses a centralized control-plane
• Increases 2X control-plane adjacencies based on network design
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
240
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (1 of 4)
VSS-ECMP Design
10 VRF Sample Design
VSS core :
240 Adj
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
241
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (2 of 4)
VSS-ECMP Design
VSS-MEC Design
10 VRF Sample Design
10 VRF Sample Design
VSS core :
240 Adj
VSS core :
TECCRS-2001
100 Adj
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
242
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (3 of 4)
VSS-ECMP Design
VSS-MEC Design
10 VRF Sample Design
10 VRF Sample Design
VSS core :
880 Adj
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
243
Multi-VRF: MEC design simplifies complexity
EC/MEC unicast and multicast adjacencies comparison (4 of 4)
VSS-ECMP Design
VSS-MEC Design
10 VRF Sample Design
10 VRF Sample Design
VSS core :
Edge :
880 Adj
80 Adj
VSS core :
Edge :
260 Adj
20 Adj
• Simplify virtualized network design with EC and MEC. Reduces up to 4X control-
plane adjacencies depending on network design
• Hardware driven, scale-independent and deterministic
network availability
© 2019 Cisco and/or its affiliates. All rights reserved.
Cisco Public
MPLS-based campus network architecture
Edge and core network design
LSR/LER
LSR/LER
Core
LSP
IP/MPLS
LSP
LSP
LER
LER
LER
Distribution
LSP
LSP
LSP
LER
LER
LER
IP
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
245
Simplified underlay = simplified overlay (before)
P/PE
P/PE
VPN PE Management
MP-iBGP PE Systems
MPLS Label Paths
MPLS LDP Adjacencies
VPN Unicast Forwarding Paths
(with BGP Multipath)
P/PE
TECCRS-2001
P/PE
P/PE
P/PE
P/PE
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
P/PE
246
Simplified underlay = simplified overlay (after)
P/PE
VPN PE Management
MP-iBGP PE Systems
MPLS Label Paths
MPLS LDP Adjacencies
VPN Unicast Forwarding Paths
(Without BGP Multipath)
PE
TECCRS-2001
PE
PE
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
247
MPLS before VSS
IGP Tuning
OSPF LSA/SPF Tuning
P/PE
P/PE
BGP Tunings
MP-iBGP Multipath
BGP Prefix-Independent Convergence
MPLS LDP Tuning
MPLS LDP Session Protection
BFD
MPLS TE Link Protection
MPLS TE Node Protection
Network/System Redundancy Tradeoff
P/PE
P/PE
P/PE
P/PE
P/PE
P/PE
Protocol Dependent Recovery
Control/Management/Forwarding Complexity
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
248
MPLS VSS benefits summary
IGP Tuning
OSPF LSA/SPF Tuning
P/PE
BGP Tunings
Scale-independent Recovery
MP-iBGP Multipath
Network/System Level Redundancy
BGP Prefix-Independent Convergence
Hardware Driven Recovery
MPLS LDP Tuning
Increase VPN Unicast Capacity
MPLS LDP Session Protection
Increase VPN Multicast Capacity
BFD
Simplified Virtual Network
MPLS TE Link Protection
Control-plane Simplicity
MPLS TE Node Protection
Network/System Redundancy Tradeoff
PE
PE
PE
Operational Simplicity
L2-L4 Load Sharing
Protocol Dependent Recovery
Control/Management/Forwarding Complexity
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
249
Headquarters
WAAS
Access
Switches
UCS Rack-mount
Server
UCS Rack-mount
Servers
UCS Blade
Chassis
Storage
WAAS
Central Manager
Distribution
Switches
Nexus
WAN
Router
s
Access
Switches
Cisco ACE
Internet
Routers
Wireless
LAN
Controller
Regional Site
Data Center
Firewalls
Nexus
WAN
Route
r
Access
Switch
RA-VPN
Firewall
Guest Wireless
LAN Controller
DMZ
Switch
Remote Site
Web
Security
Appliance
Teleworker/
Mobile Worker
Access
Switch
Stack
Data
Center
Wireless LAN
Controllers
Internet
Access
Switch
Communications
Managers
Internet Edge
DMZ
Servers
Core
Switches
Email
Security
Appliance
Hardware and
Software VPN
WAN
Routers
MPLS
WANs
WAN
Router
s
Distribution
Switches
User
Access
Layers
WAAS
Remote
Site
WAAS
WAN
Aggregation
Remote Site
Wireless LAN
Controller
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
What’s different in your network today versus a
decade ago? How does it affect availability?
Mobility
Bring Your Own Device
Devices in the Workspace
IoT
Auto-detect Non-User Devices
Devices everywhere
TECCRS-2001
Cyber
Security
Networking and Security
Advanced threats
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
251
Key Challenges for Traditional Networks
Difficult to Segment
Complex to Manage
Slower Issue Resolution
Ever increasing number of
users and endpoint types
Multiple steps,
user credentials, complex
interactions
Separate user policies for
wired and wireless networks
Ever increasing number of
VLANs and IP Subnets
Multiple touch-points
Unable to find users
when troubleshooting
Traditional Networks Cannot Keep Up!
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
252
What if you could do this?
Cisco Software-Defined Access
• Enables:
Border
Nodes
Border
Nodes
Edge
Nodes
Edge
Nodes
• Host mobility
• Network segmentation
• Role-based access
control
Logical Layer 2 Overlay
Logical Layer 3 Overlay
• It is an overlay network
to the network underlay
• Control plane based on LISP
• Data plane based on VXLAN
Physical Topology
• Policy plane based on TrustSec
Software-Defined Access Design Guide - CVD
https://www.cisco.com/c/dam/en/us/td/docs/solutions/CVD/Campus/CVD-Software-Defined-Access-Design-Sol1dot2-2018DEC.pdf
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
253
SD-Access
Why overlays?
Simple Transport Forwarding
Flexible Virtual Services
•
•
•
•
•
•
•
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
254
SD-Access
Types of overlays
•
•
•
•
•
•
•
•
•
•
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
255
Campus wired LAN design
Option 4: Cisco Software-Defined Access (BRKCRS-1501, many others)
L2/L3:
flexible
overlays
Uses advantages of a routed access physical
design, with Layer 2 capable logical overlay
design
• Provisioning and policy automation
• Integrates wireless into the same policy
• Requires automation to simplify configuration
•
Logical
topology—
OR
Survives device and link failures
Easy mitigation of Layer 2 looping concerns
Rapid detection/recovery from failures
Physical
topology:
2 core
2 dist./acc.
Layer 2 across all access blocks within distribution
Device-level CLI configuration simplicity
Automated network and policy provisioning included
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
256
Cisco DNA Appliance—What about HA?
Outside of control plane, and has 2+1 clustering capabilities
SKU
Specs
DN1-HW-APL
•
•
•
•
DN2-HW-APL
DN2-HW-APL-L
Scale and Performance
SDA Design
Based on UCS M4
44 cores
256 GB RAM
12 TB SSD
5000 Devices
Small or
Medium
•
•
•
•
Based on UCS M5
44 cores
256 GB RAM
16 TB SSD
5000 Devices
•
•
•
•
Based on UCS M5
56 cores
384 GB RAM
16 TB SSD
8000 Devices
1000 Switches/Routers/WLC + 4000 APs
25,000 Clients
1000 Switches/Routers/WLC + 4000 APs
Small or
Medium
25,000 Clients
2000 Switches/Routers/WLC + 6000 Aps
Medium or
Large
40,000 Clients
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
257
Access Cisco Software-Defined
Cisco Live Barcelona - Session Map
Tuesday (Jan 29)
08:00-11:00
11:00-13:00
13:00-15:00
Wednesday (Jan 30)
15:00-18:00
08:00-11:00
BRKCRS-2821
SD-Access Integration
11:00-13:00
13:00-15:00
Thursday (Jan 31)
15:00-18:00
08:00-11:00
11:00-13:00
LTRACI-2636
LTRCRS-2810
ACI + SD-Access Lab
SD-Access Lab
11:00-13:00
13:00-15:00
15:00-18:00
BRKCRS-2812
BRKCRS-3811
SD-Access Policy
BRKCRS-1501
ISE & SD-Access
SD-Access
Deep Dive
08:00-11:00
SD-Access Migration
BRKCRS-1449
BRKCRS-3810
15:00-18:00
SD-Access Scale
BRKCLD-2412
BRKCRS-2810
13:00-15:00
Friday (Feb 01)
BRKCRS-2825
Cross-Domain Policy
SD-Access Solution
Missed One?
Sessions are available online
@CiscoLive.com
Validated Design
BRKCRS-2815
BRKCRS-2814
BRKARC-2020
Connect
SD-Access Sites
SD-Access
Assurance
Troubleshoot
SD-Access
BRKEWN-2021
SD-Access Demo
TECCRS-2001
BRKEWN-2020
SD-Access Wireless
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
258
SD-Access resources
Related Sessions
Reference
Cisco SD-Access - 8H Technical Seminar - TECCRS-3810
•
Monday, Jan 28
8:30 AM - 6:45 PM
Cisco SD-Access Integration
Cisco SD-Access Fabric
Cisco SD-Access - A Look Under the Hood - BRKCRS-2810
•
Tuesday, Jan 29
11:00 AM - 1:00 PM
Cisco SD-Access - Technology Deep Dive - BRKCRS-3810
•
Tuesday, Jan 29
2:30 PM - 4:00 PM
Cisco SD-Access - Connecting Multiple Sites - BRKCRS-2815
•
Wednesday, Jan 30
11:00 AM - 1:00 PM
Cisco SD-Access – Assurance and Analytics - BRKCRS-2814
•
Wednesday, Jan 30
4:30 PM - 6:00 PM
Cisco SD-Access - Troubleshooting the Fabric - BRKARC-2020
•
Thursday, Jan 31
2:30 PM - 4:00 PM
Cisco SD-Access Campus Cisco Validated Design - BRKCRS-1501
•
Friday, Feb 01
9:00 AM - 11:00 AM
Cisco SD-Access - Connecting to the DC, Firewall, WAN & More! - BRKCRS-2821
•
•
Thursday, Jan 31
8:30 AM - 10:30 AM
Cisco SD-Access - Wireless Integration - BRKEWN-2020
•
Friday, Feb 01
Wednesday, Jan 30
2:30 PM - 4:00 PM
Cisco SD-Access – Integrating Existing Network - BRKCRS-2812
•
Friday, Feb 01
11:30 AM - 1:30 PM
Cisco SD-Access Policy
Simplifying and Securing the Cisco Digital Network Architecture - BRKCRS-1449
•
Tuesday, Jan 29
5:00 PM - 6:30 PM
Group-Based Policy for On-Prem, Hybrid & Cloud with Cisco DNA - BRKCLD-2412
•
Wednesday, Jan 30
2:30 PM - 4:00 PM
Cisco SD-Access - Policy Driven Manageability - BRKCRS-3811
Thursday, Jan 31
2:30 PM - 4:00 PM
Cisco SD-Access Labs
How to Setup SD-Access Wireless from Scratch - BRKEWN-2021
•
8:30 AM - 10:30 AM
Cisco SD-Access - Scaling to Hundreds of Sites - BRKCRS-2825
•
Cisco SD-Access Wireless
Wednesday, Jan 30
9:00 AM - 11:00 AM
Cisco SD-Access & ACI Integration - Hands-on Lab - LTRACI-2636
•
Tuesday, Jan 29
2:15 PM - 6:15 PM
Cisco SD-Access - Hands-on Lab - LTRCRS-2810
•
Wednesday, Jan 30
TECCRS-2001
9:00 AM - 1:00 PM
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
259
Campus wired LAN design options—summary
Traditional
Multilayer
Campus
BRKCRS-2031
Layer 3
Routed
Access
BRKCRS-3036
L2 Access /
Simplified
Distribution
BRKCRS-1500
SD-Access /
Fabric for
Campus
BRKCRS-1501
(and many others)
Logical
topology
Design
notes
OR
Protocols /
Tuning
L3 Planning
Limited L2
Flexible, Easy,
Scalable
Flexible, Tools to
Simplify
Physical
topology:
2 core
2 dist./acc.
On-line library at ciscolive.com
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
260
How do I get there?
Successful deployments…
…start with a plan.
Photos showing Basílica i Temple Expiatori de la Sagrada Família
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
261
High availability wired campus design
Key principals
•
Choices when interconnecting devices can affect network
availability
•
Choose hardware based detection and recovery mechanisms over
software for faster convergence–
•
EtherChannel and Multichassis EtherChannel are powerful tools for
convergence and scale
•
Overall design choices (multilayer vs. routed access vs. simplified
distribution) require the introduction of supporting protocols that
affect network availability
•
Simplifying the network and improving network availability improves
other services overlaid on that network
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
262
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
High Availability Architectures:
•
•
Enterprise Wired LAN
•
Enterprise Wireless LAN
•
Enterprise Data Center
High Availability System Recovery
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
263
Dana Daum
Communications Architect
Maren Kostede
Technical Solutions Architect
Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer
High Availability World Coverage
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Who connected to a wired network today?
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
265
… a typical day of a connected life…
Wi-Fi
LTE
Wi-Fi
LTE
LTE
Wi-Fi
Home
Driving
Office
Walk to lunch
Restaurant
Shopping,
Hotspots
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
266
No Wireless == No Network Access
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Section Objective
What is the acceptable
network downtime?
Minutes
<< 11 second
are ok
minute
admin
The goal of this section is to show you how to design and deploy a Highly
Available wireless network to reduce the network downtime
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
268
Wireless High Availability concepts
• Good news: all the High Availability concepts and best practices we have seen for wired are
applicable to wireless access as well
• Bad news: wireless is not wired
Ch 1
Ch 6
Ch 11
Thin air…..
Shielded, isolated access
No electromagnetic protection
We use the air to transmit packets, it’s a shared media, it’s unlicensed….enough?
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
269
Agenda
•
High Availability (HA), the theory of operations:
•
What to do at the Radio Frequency layer?
•
Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
•
HA Design and Deployment Practices
•
Wireless Assurance: proactively monitor your network!
•
Key takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
270
RF HA – how to build redundancy at the RF layer?
Access Points
Access Switches
Aggregation Switches
Wireless Controller
• Creating a stable, predictable RF environment (Proper Design, Site Survey)
• Dealing with RF that is continuously changing (RRM and RF Management)
• Coping with coverage holes from an AP going down (RRM and RF Management)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
271
Radio Frequency (RF) High Availability
• Site Survey, site survey….and site survey
• Use “Active” survey
• Coverage vs. Capacity
• Consider Client type (ex. Smartphone vs. Laptop)
My
Myantenna
power isgain
halfisof4
my times
brother
smaller
MacBook
I trythen
to connect
5GHz
and
move totoanother
and
stay ifconnected
until
BSSID
it is REALLY
the signal better
is REALLY bad
Adaptive 802.11r, FastLane, iOS Analytics
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
274
Radio Frequency (RF) High Availability
• Site Survey, site survey….and site survey
• Use “Active” survey
• Coverage vs. Capacity
• Consider Client type (ex. Smartphone vs. Laptop)
• AP positioning and antenna choice is Key
• Use common sense
• Light source analogy
• Internal antennas are designed to be mounted on ceiling
• External antennas: use same antennas on all connectors
• Tools
• What you use is less important than how you use it
• Use the same tool to compare results
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
275
RF High Availability: Cisco RRM
• What are Radio Resource Manager (RRM)’s objectives?
• Provide a system wide RF view of the network at the Controller (only Cisco!!)
• Dynamically balance the network and mitigate changes
• Manage Spectrum Efficiency so as to provide the optimal throughput under changing conditions
• What’s RRM
• DCA—Dynamic Channel Assignment
• TPC—Transmit Power Control
• CHDM—Coverage Hole Detection and Mitigation
• RRM best practices
• RRM settings to auto for most deployments (High Density is a special case)
• Design for most radios set at mid power level (lever 3 for example)
• Use RF Profiles to customize RRM settings per Areas/Groups of APs
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
276
RF High Availability: Cisco RRM
• RRM DCA in action
1
11
RRM has a system view of RF. AP
view would be limited and could
result in sub-optimal RF plan
6

RRM will determine the optimal
channel plan based on AP layout

A rogue AP is detected on
channel 11

RRM will assess the RF and take
a decision in less than 10min

Channel change is triggered to
improve the RF

Note how the 3 non overlapping
channels are still maintained!

With a limited AP-based view of
the RF, each AP will avoid
channel 11 reducing overall
network capacity
1
1
11
11
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
277
RF High Availability: Cisco RRM
RRM Channel Hole Detection Mitigation (CHDM) in action
CHDM = Coverage Hole Detection Mitigation
TECCRS-2001

RRM will determine the optimal
Power plan based on AP layout

Each client RSSI is tracked by
AP and reported to WLC

If an AP fails…
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
278
RF High Availability: Cisco RRM
RRM CHDM in action

RRM will determine the optimal
Power plan based on AP layout

Each client RSSI is tracked by
AP and reported to WLC

If an AP fails…

CHDM algorithms kicks in and
increases power of neighboring
cells within 90 secs

Clients roam to new APs

This happens if the CHDM
conditions are met:
RRM Details and more:
Improve WLAN Spectrum
Quality with Cisco’s advanced
RF (BRKEWN-3010)
•
•
•
•
Clients are below the RSSI
threshold
Min Failed client per AP (#3 default)
Coverage Exception Level per AP
(25% by default)
Failed packets (number and %)
These checks are needed to
avoid false positives
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

RF High Availability
Flexible Radio Assignment (FRA)
5GHz
2.4GHz
Serving
Serving
5GHz.
Serving
5GHz.
Serving
2.4GHz
5GHz
Serving
 FRA-auto (default value) or Manual
 Auto 2.4 -> 5GHz or Monitor Mode
 Transition to 2.4 GHz if coverage drops
5GHz.
Serving
2.4-5GHz
2.4GHz
Monitoring
Serving
FRA: Supported on the Cisco Aironet 2800/3800/4800 Series Access Points
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
280
Summary

Cisco provides well engineered Access Points,
Antennas, and Radio Resource Management features
in the controllers

However, you need to understand the general
concepts of radio – otherwise, it is very easy to end up
implementing a network in a sub-optimal way:
“RF Matters”
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
282
… adding a Wireless Controller (functionality)
Private or public
Cloud
Access Points
Mobility Express
Access Switches
SD-Access
Aggregation Switches
Wireless Controller
Centralized/
SD-Access/FlexConnect
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Agenda
•
High Availability (HA), the theory of operations:
•
What to do at the Radio Frequency layer?
•
Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
•
HA Design and Deployment Practices
•
Wireless Assurance: proactively monitor your network!
•
Key takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
284
Wireless Controller modes fitting different
requirements
Configure
Centralized
SDA-Wireless
Ease
Deployment
Fromof
a web
browser or
and
management
Cisco wireless app,for
use
largethe
campuses.
Cloud
setup wizard
to
enable multiple
APs
and non-Cloud
options.
Policy Segmentation and
consistent wired-wireless
management
simultaneously
up
Flex Set
Connect
Mobility Express
Eliminate the need for a
Controller at every Site for a
distributed deployment. Cloud
and non-Cloud options.
Simplified Controller-less
deployment for distributed
deployments and small sites
LAN
Campus
WAN
Fabric
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
285
Launched Nov 2018
ENCS
C9800 on Switch
(SD-Access only)
200 APs
WLC 3504
150 APs
Mobility Express
50-100 APs
Catalyst 9800-40
2000 APs
Catalyst 9800-Cloud
(private and public)
1000 APs
Catalyst 9800-80
6000 APs
Catalyst 9800-Cloud (private)
3000-6000 APs
2000 APs
WLC 5520
1500 APs
3000 APs
6000 APs
WLC 8540
6000 APs
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
AireOS WLCs
Catalyst 9800
Controller Series
Cisco Wireless Controller Options
Cisco Catalyst 9800 Series – Wireless benefits
Powered by IOS XE
Open and Programmable
Trustworthy Solutions
Modular operating system
Deploy Anywhere
Secure
Always-on
•
On-Prem, Private/Public cloud,
Embed wireless on a 9k switch
•
Software updates with no
disruption
•
Detect encrypted threats with
Encrypted Traffic Analytics (ETA)
•
AWS GovCloud ready
•
Rolling AP upgrades
•
Integration with StealthWatch
•
Scale as you grow
•
Seamlessly add new AP models
•
Automated macro/micro
segmentation with SDA
•
WPA3 Support*
*Future
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
288
High Availability
Reducing downtime for Upgrades and Unplanned Events
Unplanned Events
Device and network interruptions
SSO ActiveStandby
N+1 Primary,
Secondary
Per AP Primary,
Secondary,
Tertiary
Controller Software Update
Software Maintenance updates ( SMU^ )
Access Point Updates
New AP Model & AP updates*
Software Image Upgrades
Wireless controller image upgrades
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
289
Centralized Mode High Availability: SSO and N+1
Requirements
Network Uptime
Client SSO
N+1 Redundancy
(Deterministic/Stateless HA, a.k.a.:
primary/secondary/tertiary)
Benefits
• Catalyst 9800 Series
• 5520, 8540, 3504 WLC
• L2 connection
• Same HW+SW Version
• 1:1 box redundancy
Active Client State is
synched
AP state is synched
No Application downtime
No License needed on
secondary Controller
Each Controller has to be
configured separately
Available on all controllers
Crosses L3 boundaries
Flexible: 1:1, N:1, N:N
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
290
Wireless Controller HA Centralized Mode
N+1 Redundancy
N+1 Redundancy
WLAN-Controller-A
WLAN-Controller-B
•
WLAN-Controller-C
Administrator statically assigns APs a primary,
secondary, and/or tertiary controller
• Assigned from controller interface (per AP) or Prime
Infrastructure (template-based)
• You need to specify Name and IP if WLCs are not in the
same Mobility Group
•
IP Network
Pros:
• Predictability: easier operational management
• Support for L3 network between WLCs
Access Point
Primary: WLAN-Controller-1
Secondary: WLAN-Controller-2
Tertiary: WLAN-Controller-3
• Flexible redundancy design options:1:1, N:1, N:N:1
Primary: WLAN-Controller-2
Secondary: WLAN-Controller-3
Tertiary: WLAN-Controller-1
• WLCs can be of different HW and SW (*)
Primary: WLAN-Controller-3
Secondary: WLAN-Controller-2
Tertiary: WLAN-Controller-1
• “Fallback” option in the case of failover
• Can overload APs on controllers (using AP priority)

Cons:
• Stateless redundancy. There is a network downtime
when the WLC fails
• More upfront planning and configuration
(*) AP will need to upgrade/downgrade code upon joining
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
292
N+1 Redundancy
Configuration > AP Join >…
AireOS
Catalyst 9800
Controller Series
Global backup Controllers
Wireless > High Availability
•
Used if there are no
primary/secondary/tertiary WLCs configured
on the AP
•
The backup controllers are added to the
primary discovery response message to the
AP
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
293
N+1 Redundancy
AP Failover mechanism
< 30-45 sec (*)
When configured with Primary and backup Controllers:
•
AP uses heartbeats to validate current WLC connectivity
•
Upon loosing a heartbeat to the Primary, AP sends 5
consecutives heartbeats every 3 second (default)
AP Boots UP
• Configurable to minimum of 3 keepalive every 2 sec
•
WLC failure
detected
Reset
Discovery
If no reply, AP declares the WLC dead and starts the join
process to the first backup WLC candidate:
• Backup is the first alive WLC in this order: primary, secondary,
tertiary, global primary, global secondary.
Image Data
•
With N+1 Failover, AP goes back to discovery state just to
make sure the backup WLC is UP and then immediately starts
the JOIN process
•
With N+1, AP periodically checks for Primary to come back
online and falls back to it (AP fallback can be disabled)
DTLS
Setup
Run
Join
Config
(*) With Fast Heartbeat and minimum values for keepalive
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
294
N+1 Redundancy
AP Fast Heartbeat
< <30-45
30-45sec
sec(*)
• Fast Heartbeats lower the amount of
time it takes to detect Primary
controller failure
• How Fast Heartbeat works
• AP sends these packets, by default every
1 sec
• When the fast heartbeat timer expires, the
AP sends a 3 fast echo requests to the
WLC for 3 times (configurable)
•
•
If no response primary is considered dead and the AP selects an available controller from its
“backup controller” list in the order of primary, secondary, tertiary, primary backup controller,
and secondary backup controller.
Fast Heartbeat only supported for Local and Flex mode
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
295
N+1 Redundancy
AP Primary Discovery Request Timer
• The access point periodically sends primary discovery requests to the Primary WLC to
know when it is back online. Default is 120 sec.
• If AP Fallback is enabled (default), the AP automatically joins back the Primary controller
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
296
N+1 Redundancy
Backup WLC
Overloaded
Failed WLC
AP Failover Priority
Critical AP
fails over
• Assign priorities to APs: Critical, High,
Medium, Low
Medium priority
AP dropped
• Critical priority APs get precedence over
all other APs when joining a controller
AP Priority: Critical
AP Priority: Medium
• In a failover situation, a higher priority AP
will be allowed to join ahead of all other
APs
• If backup controller doesn’t have enough
licenses (ex. multiple Primary WLCs fail),
existing lower priority APs will be dropped
to accommodate higher priority APs
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
297
N+1 Redundancy
Typical Design
< 30-45 sec
• Most common Design is N+1 with
WLC-BKP
Redundant WLC in a geographically
separate location across Campus
• Can provide 30-45 sec of downtime
when use faster heartbeat to detect
failure
• Use AP priority in case of over
subscription of redundant WLC
Geo separated DC
Primary Locations
IP network
(Campus)
WLAN-Local
WLAN-Local
WLAN-Local
APs Configured With:
Primary: WLAN-Local
Secondary: WLC-BKP
For more info:
http://www.cisco.com/en/US/docs/wireless/tech
nology/hi_avail/N1_HA_Overview.html
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Wireless Controller HA
Centralized Mode Stateful Switch Over
(SSO)
High Availability (Client SSO)
A direct physical connection between Active and Standby Redundant Ports or Layer 2 connectivity is
required to provide stateful redundancy within or across datacenters
Sub-second failover and zero SSID outage
Active Wireless
Controller
Hot-Standby Wireless
Controller
C9800-40-K9
Redundancy Port Connectivity
RP via L2
Gigabit SFP RP port
Gigabit SFP RP port
C9800-80-K9
Active Wireless
Controller
Redundancy Port Connectivity
RP Via L2
Example for AireOS Controller:
https://www.cisco.com/c/en/us/td/docs/wireless/controller/technotes/7-5/High_Availability_DG.html
Hot-Standby Wireless
Controller
303
TECEWN-2005
2019
Cisco are
and/or:itsGLC-SX-MMD
affiliates. All rights reserved.
CiscoGLC-LH-SMD
Public
The only supported
SFPs on Gigabit ©RP
port
and
C9800 Private Cloud Deployment:
Client SSO High Availability
ESXi
C9800-CL-K9
vWLC1-Active
vWLC1-Standby
vWLC1-Active
C
P
D
P
C
P
D
P
C
P
vWLC2-Active
vWLC2-Standby
D
P
C
P
HA interface
C
P
D
P
vWLC1-Standby
D
P
C
P
D
P
HA interface
vswitch
vswitch
vswitch
vswitch
vswitch
vswitch
switch
Redundancy Port
Connectivity
switch
Redundancy Port Connectivity
RP via L2
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
304
Stateful Switchover (SSO)
< 1 sec
•
HA Pairing is possible only between the same type of hardware and software versions
•
True Box to Box High Availability i.e. 1:1
•
•
One WLC in Active state and second WLC in Hot Standby state
•
Secondary continuously monitors the health of Active WLC via dedicated link
Configuration on Active is synched to Standby WLC
•
•
What else is synched between Active and Standby?
•
•
This happens at startup and incrementally at each configuration change on the Active
Licenses, AP CAPWAP state, Clients in “RUN” state
Downtime during failover reduced is greatly reduced:
•
2 - 100 msec for a box failover (Active WLC crashes, system hangs, manual reset or forced switch-over)
•
350-500 msec in the case of power failure on the Active WLC (no signaling to the peer is possible)
•
Few seconds in the case of network failover (gateway not reachable)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
306
Stateful Switchover (SSO)
Failover sequence
1.
Redundancy role negotiation and config sync
2.
APs associates with Active controller
3.
Client associates with Active through AP
4.
Active failure: notify peer / or missing keep alive
5.
Standby WLC sends out GARP
6.
Standby becomes Active:
ACTIVE
STANDBY
ACTIVE
Si
Si
Si
GARP
Si
Si
Si
AP DB and Client DB are already synced to standby controller
AP CAPWAP tunnel session intact
Campus
Access
Client session intact, client does not re-associate*
Effective downtime for the client is:
Detection time + Switchover time
Capwap tunnel
Client Session
video: https://www.youtube.com/watch?v=If5F7eZkC3w
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
307
Stateful Switchover (SSO)
Other important things to keep in mind..
• There is no preemption in Controller SSO:
• when the failed Active WLC comes back online it will joining as Hot Standby
• Recommendations:
• In Service Software Upgrade (ISSU): is currently not supported  plan for down
time when upgrading software.
many improvements for Catalyst 9800 Controller  see next section.
• Physical connection between Redundant Ports should be done first before HA
configuration
• Keepalive and Peer Discovery timers should be left at default values for better
performance
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
308
High Availability
Reducing downtime for Upgrades and Unplanned Events
Unplanned Events
Device and network interruptions
Controller Software Update
Software Maintenance updates ( SMU^ )
Access Point Updates
SSO ActiveStandby
N+1 Primary,
Secondary
Hot Patch
(No Wireless Controller
reboot)
Per AP Primary,
Secondary,
Tertiary
Cold Patch
HA install on SSO Pair
Auto Install on Standby
Rolling AP Update
New AP Model & AP updates*
(No Wireless Controller
Reboot)
Software Image Upgrades
N+1 Hitless Rolling AP
Upgrade
Wireless controller image upgrades
Cisco Catalyst 9800
Wireless Controller
Differentiators
TECCRS-2001
AP Device
Pack
New AP Model
Flexible
Per-Site,
Per-Model
Updates
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
309
Wireless Controller HA –
Catalyst 9800 only
High Availability
Cisco Catalyst
9800 Wireless
Controller
Differentiators
Reducing downtime for Upgrades and Unplanned Events
Supported after 16.10
16.10 Supported
Unplanned Events
Device and network interruptions
Controller Software Update
Software Maintenance updates ( SMU^ )
Access Point Updates
SSO ActiveStandby
N+1 Primary,
Secondary
Hot Patch
(No Wireless Controller
reboot)
Per AP Primary,
Secondary,
Tertiary
Cold Patch
HA install on SSO Pair
Auto Install on Standby
Rolling AP Update
New AP Model & AP updates*
(No Wireless Controller
Reboot)
Software Image Upgrades
N+1 Hitless Rolling AP
Upgrade
Wireless controller image upgrades
^ MD Release Only
TECCRS-2001
AP Device
Pack
New AP Model
Flexible
Per-Site,
Per-Model
Updates
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
311
Future
SMU on MD
Release only
Controller and AP software upgrades
Controller
Updates
Controller update or bug fixes
SMU
PSIRTs, fixes
on APs
AP update or bug fixes
AP Service Pack
New AP Model
Support
Hot-patchable support for Device Pack
AP Device Pack
Contain impact within release
Faster resolution to critical issues
Fixes for defects and security issues
without need to requalify a new release
Provide fixes to critical issues found in
network devices that are time-sensitive
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
312
Wireless Controller SMU
Wireless Controller SMU installation
Options
Hot Patch
Cold Patch
Wireless Controller Reboot
(No Wireless Controller reboot)
Auto Install on Standby
 Software Maintenance Update (SMU) is the
ability to apply patch fixes on a software
release in the customer network
Hot-Patching
Cold Patching
Inline replace of functions
without restarting the process
Install of a SMU will require a
system reload
On SSO Systems, patch will be
applied on both active and
standby without any reload
 Current mechanism relies on Engineering
Specials
• Entire image is rebuilt and delivered to
customer
On SSO systems, SMU updates
can be installed on the HA Pair with
zero downtime (Follows ISSU path
and both Standby & Active controller
reloaded but there is no impact to AP
and Client session)

SMU Infrastructure will be available in 16.10 FCS release

SMUs for C9800 will be available starting the first MD Release
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
313
Catalyst 9800 SMU Cold Patch + AP Service
Pack
Active
SMU
Follows ISSU path and both
Standby & Active controller
reloaded but there is no
impact to AP and Client
session.
Standby
SMU
Install SMU on Standby
Standby
SMU
Active
SMU
Switchover to Activate SMU
Standby
Active
SMU
SMU
Rolling AP upgrade
if AP image needs update
(Reset AP in staggered way)
Install SMU on New Standby
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
314
High Availability
Cisco Catalyst
9800 Wireless
Controller
Differentiators
Reducing downtime for Upgrades and Unplanned Events
Supported after 16.10
16.10 Supported
Unplanned Events
Device and network interruptions
Controller Software Update
Software Maintenance updates ( SMU^ )
Access Point Updates
SSO ActiveStandby
N+1 Primary,
Secondary
Hot Patch
(No Wireless Controller
reboot)
Per AP Primary,
Secondary,
Tertiary
Cold Patch
HA install on SSO Pair
Auto Install on Standby
Rolling AP Update
New AP Model & AP updates*
(No Wireless Controller
Reboot)
Software Image Upgrades
N+1 Hitless Rolling AP
Upgrade
Wireless controller image upgrades
^ MD Release Only
TECCRS-2001
AP Device
Pack
New AP Model
Flexible
Per-Site,
Per-Model
Updates
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
315
Rolling AP Upgrade: Choose how aggressive…
N=4 Neighbor APs
N=8 Neighbor APs
N=24 Neighbor APs
User selects % of APs to upgrade in one go [5, 15, 25]
For 25%, Neighbors marked = 6 [Expected number of iterations ~ 5]
For 15%, Neighbors marked = 12 [Expected number of iterations ~ 12]
For 5%, Neighbors marked = 24 [Expected number of iterations ~ 22]
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Rolling AP Upgrade - Client Steering
•
Clients steered from candidate APs
to non-candidate APs
•
802.11v BSS Transition Request
•
Dissociation imminent
•
If clients do not honor this, they will
be de-authenticated before AP
reload
802.11v
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
317
High Availability
Cisco Catalyst
9800 Wireless
Controller
Differentiators
Reducing downtime for Upgrades and Unplanned Events
Supported after 16.10
16.10 Supported
Unplanned Events
Device and network interruptions
Controller Software Update
Software Maintenance updates ( SMU^ )
Access Point Updates
SSO ActiveStandby
N+1 Primary,
Secondary
Hot Patch
(No Wireless Controller
reboot)
Per AP Primary,
Secondary,
Tertiary
Cold Patch
HA install on SSO Pair
Auto Install on Standby
Rolling AP Update
New AP Model & AP updates*
(No Wireless Controller
Reboot)
Software Image Upgrades
N+1 Hitless Rolling AP
Upgrade
Wireless controller image upgrades
^ MD Release Only
AP Device
Pack
New AP Model
Flexible
Per-Site,
Per-Model
Updates
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
AP
N+1 Rolling AP Upgrade
Wireless Controller image upgrade using N+1 staging Controller
Trigger Rolling
Upgrade
Mobility Group
Version : X
X+1
Primary
Version: X+1
1. Device auto selects candidate APs
based on selected % and RRM AP
Neighbor Map
Upgraded N+1
2. Upgrade process kicks-in
•
•
•
•
•
•
Image download to Primary
Wireless Controller
Image pre-download to APs
Selective redirect of clients using
11v
APs moved to N+1 Wireless
Controller in rolling manner
Primary Wireless Controller Reboot
APs moved back to Primary
Wireless Controller (optional)
3. Monitor progress on the Device
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
319
Wireless Controller HA
Software DefinedAccess Wireless
Software Defined Access: Bringing Intent Based
Networking to Life
Cisco DNA Center
Policy
Automation
B
B
Automated
Network Fabric
Analytics
Single Fabric for Wired & Wireless
with simple Automation
C
Outside
Identity-Based
Policy & Segmentation
Decouples Security & QoS
from VLAN and IP Address
SDA
Extension
User Mobility
Policy stays with User
IoT Network
Employee Network
Insights &
Telemetry
Analytics and Insights into
User and Application behavior
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Catalyst 9800 SD-Access Wireless
Cisco DNA Center
Policy
Automation
Analytics
SD-Access Wireless Distributed Sites
SD-Access Wireless Campus
Controller Appliance or
Private Cloud
SD-WAN
(Viptela)
c
SD-Access
IoT
c
MPLS | Metro
4G/5G/LTE | Internet
Embedded Wireless
“Cat 9k Switch”
User Mobility
Seamless Mobility
Policy stays with user
Policy stays with
user
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
322
Software Defined-Access: Roles and Terminology
 Cisco DNA Controller – Enterprise SDN
Identity
Services
Cisco DNA
Controller
ISE / AD
Fabric Mode
WLC
Fabric Border
B
Fabric
Mode APs
 Control-Plane (CP) Node – – Map System that
manages Endpoint to Device relationships
 Fabric Border Nodes – A Fabric device
B
Intermediate
Nodes (Underlay)
Controller for Automation & Assurance. GUI
management abstraction via multiple Service
Apps
 Identity Services – NAC & ID Systems
(e.g. ISE) for dynamic Endpoint to Group
mapping and Policy definition
C
Control-Plane
Nodes
(e.g. Core) that connects External L3
network(s) to the SDA Fabric
 Fabric Edge Nodes – A Fabric device
(e.g. Access or Distribution) that connects
Wired Endpoints to the SDA Fabric
CAPWAP
(Control)
Fabric Edge
Nodes
Fabric
Mode APs
VXLAN
(Data)
 Fabric Wireless Controller – Wireless
Controller (WLC) fabric-enabled, participate
in LISP control plane
 Fabric Mode APs – Access Points that are
fabric-enabled. Wireless traffic is VXLAN
encapsulated at AP
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
SD-Access Wireless: Redundancy Considerations
Active
Standby
WLC registers wireless clients in
Host Tracking DB
SSO pair
Client updates
C
Control Plane (CP) redundancy is
supported in Active / Active
configuration
C
B
WLC is configured with two CP nodes
with information sync across both
Stateful redundancy with WLC SSO pair.
Active WLC updates Control nodes
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
324
Platforms supporting SD-Access Wireless
Optimized for Distributed Braches
On Switch
Small and Medium Campus
On Private Cloud
• Cisco IOS® XE Software
• Cisco IOS® XE Software
• Cat 9300, Cat 9500
• C9800-CL
• 200 AP, 4k Clients
• SD-Access wireless with Cat9800
Software Package
• Indirect AP Support
• Centralize Control Plane
• Always on Fabric with robust HA
• 1k AP, 10k Clients
• 3k AP, 32k Clients
• 6k AP, 64k Clients^
• Scale on demand
• Designed for IoT
• Always on Fabric with robust HA
TECCRS-2001
Medium and Large Campus
On Appliance
• Cisco IOS® XE Software
• C9800-40-K9
• C9800-80-K9
• Cisco AireOS Software:
• WLC 3504 (SW8.8)
• WLC 5520 (SW8.8)
• WLC 8540 (SW8.8)
• Designed for IoT
• Always on Fabric with robust HA
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
326
Wireless Controller HA
FlexConnect Mode
FlexConnect quick recap…
Controller
Cluster
Central Site
• CAPWAP management and data plane are
split:
• Central Switching (SSID data traffic sent to WLC)
• Local Switching (SSID data traffic sent to local VLAN)
Central
Switching
WAN
• Two modes of operation from AP
perspective:
• Connected (when WLC is reachable)
Local
Switching
• Standalone (when WLC is not reachable)
FlexConnect Branch Office
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
328
FlexConnect HA
Limitations
Benefits
FlexConnect Local
Switching
L2 roaming
Flex Groups for AAA Local Auth.
Fault Tolerance: identical
configuration on N+1 controllers
Upon WLC failure AP stays up
and clients are not disconnected
Equivalent to Client SSO
AAA survivability available
FlexConnect Central
Switching
Same as Centralized mode
Same as Centralized mode
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
329
Clients at locally switched SSIDs stay connected
at Controller/WAN outage
Data Center
Local Switching SSIDs  all
connected Clients stay connected!
Prime
AAA/
RADIUS
WAN
Outage
WAN
Wireless Controller
Access Point
Branch Office
CAPWAP Control – UDP 5246
CAPWAP Data – UDP 5247
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
330
Impact of WAN Outage or Controller Failure
1. Controller failure :
N+1 HA Design:
• No Impact for locally switched
SSIDs
• FlexConnect AP will search for
backup WLC and resume
client sessions with centrally
switched SSIDs.
1:1 HA Design with Client SSO:
• No impact for centrally
switched SSIDs: Centrally and
locally switched SSIDs stay
up.
2. WAN Failure/ Controller
not reachable:
•
•
•
•
•
Access Point will continue to
transmit/receive Data on
locally switched SSIDs.
Connected Clients stay
connected
Fast roaming is possible for
Clients with
CCKM/OKC/802.11r support
New Clients can connect if
local RADIUS or Authentication
provided.
Lost features: RRM, wIDS,
location, WebAuth, NAC
Controller
Cluster
Central Site 1
2WAN
Local
Switching
FlexConnect Branch
Office
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Wireless Controller HA
Mobility Express
Cisco Mobility Express: Controller Function
embedded into the access point
Runs WLAN Controller on
access point
Investment Protection - Add
controller without changing
Access Point
Mobile app/WebUI to configure
up to 100 access points
Simple UI monitors, manages and
troubleshoots your network
Best Practices activated by
default & in built redundancy
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
333
Mobility Express Overview
AIR-AP1852I-B-K9
 One AP runs as Mobility Master (think about
it as a local Virtual WLC)
AIR-AP1852I-B-K9
AIR-AP2802I-B-K9
 Controller and APs are in the same L2
broadcast domain.
AIR-AP1852I-B-K9
AIR-AP1852I-B-K9
AIR-AP3802I-B-K9
MASTER
AP
 Based on FlexConnect architecture
 Mobility Express supports client central
authentication and local switching of traffic
AIR-AP3702E-B-K9
AIR-AP13702I-B-K9
AIR-AP2702I-B-K9
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
334
Mobility Express: Master AP Redundancy
• If Master AP fails, another Mobility Express capable AP is elected
automatically.
• Newly elected Master AP has same IP and config as original Master AP.
• Preferred Master can be set (AireOS 8.7)
• Election of a new controller using VRRP
•
Heartbeat exchanged every 10s with Master AP
•
After 3 missed heartbeat: Master election initiated - all Master capable APs participate
•
APs fall into standalone mode during election process (takes about 30 Secs)
•
Standalone Access Points join newly elected master and go to connected mode
• Election Priorities
1. Most capable Access Points. 3800 > 2800 > 1800.
2. AP with least client load
3. In case of tie, election based on lowest MAC Address
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
335
Mobility Express Master Election Process
AIR-AP1852I-B-K9
AIR-AP2802I-B-K9
Most capable Access
Point – E.g. 2800 vs.
1850
P
AIR-AP1852I-B-K9
P
AIR-AP3802I-B-K9
AIR-AP1852I-B-K9
AIR-AP1852I-B-K9
MASTER
AP
P
Least Client Load
Lowest MAC address
AIR-AP3702E-B-K9
AIR-AP3702I-B-K9
AIR-AP2702I-B-K9
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
336
Mobility Express WLAN Deployment Options
Single Office
Distributed Office
Mobility Express
Mobility Express
Distributed Enterprise
Mobility Express
in Branch
TECCRS-2001
Controller Based
in campus
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
337
Agenda
•
High Availability (HA), the theory of operations:
•
What to do at the Radio Frequency layer?
•
Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
•
HA Design and Deployment Practices
•
Wireless Assurance: proactively monitor your network!
•
Key takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
338
HA Best Practices: Connecting an AP to the
wired network
Recommendations:



Create redundancy throughout the access layer by
connecting APs to different switches/stack
members/linecards
If the AP is in Local mode, configure the port as
access with SPT PortFast, BPDU guard, etc..
If the AP is in Flex mode and Local Switching,
configure the port as trunk and allow only the
VLANs you need
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
339
HA Best Practices: Connecting a Single
Controller to the wired network
1) To a single Modular Switch or Stack
Modular
Switch/Stack
•
Use Trunk EtherChannel(EC)/LAG
•
•
Trunk only the required VLANs to the Controller
2/4/8 ports in a bundle to optimize load sharing
•
Spread ports across Line Cards/Stack members
WLC
VSS pair
2) To Redundant Distribution Switches in a VSS pair
•
Same as Option 1
•
Spread ports across VSS members
WLC
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
340
HA Best Practices: Connecting HA pair to the
wired network
Single Switch or stack
Option 1: to single Modular Switch or Stack
•
•
•
•
•
Same configuration
on both Po1 and Po2
The HA pair of WLCs should be considered as separated WLCs
with the same exact configuration
Ports on both WLCs are UP but only the ones on the Active WLC
are forwarding data traffic
On WLC side: use same physical ports are connected to the
network, for ex.: port 1-4 on WLC1 and port 1-4 on WLC2
On switch side the configuration has to be the same. If using LAG,
for example, two Port-channel should be used with the same
configuration (same mode, same VLANs, same native, etc..)
General recommendations for Option 1 WLC also apply
TECCRS-2001
Po 2
Po 1
Trunk
Port-channels
L2
Active WLC
Standby WLC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
341
HA Best Practices: Connecting a Client SSO
Controller Cluster to the wired network (VSS)
Option 2: to VSS pair
Use EtherChannel from each Wireless Controller to
Distribution VSS
• Spread the links in each EtherChannel among the
two physical switches: this will prevent a Wireless
Controller switchover upon a failure of one of the
VSS switch
• Redundancy Port (RP) connected to the respective
uplink switches.
• The AP/Clients are up after an SSO. It is a seamless
transition and there are no drops on the client.
•
TECCRS-2001
Active WLC
Standby WLC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
342
HA Best Practices: Connecting a Client SSO
Controller Cluster to the wired network (HSRP)
Option 3: to HSRP pair
• Controller devices are connected to 2 HSRP
routers (Active and Standby).
• The uplink is a port-channel. RP connected to
the respective uplink routers.
• Failover of HSRP Active to Standby induces a
switchover of Cisco Catalyst 9800 Wireless
Controller HA pair.
• The AP/Clients are up after an SSO. It is a
seamless transition and there are no drops on
the client.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
343
HA Deployment Best
Practices
Focus on Campus
HA Deployment Best Practices
Campus
•
•
What is the acceptable downtime for your business applications?
•
No downtime? Go with Stateful Switchover (Client SSO).
•
Are 30 sec to few minutes ok? Go with N+1 to have more deployment flexibility
What is the downtime to upgrade a HA pair and how to minimize it?
•
Catalyst 9800 Wireless Controller: use built-in Rolling SW Upgrade
•
AireOS Controllers (details for reference only):
•
Plan for additional backup controller
•
Use Prime Infrastructure Rolling SW Updates Feature
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
345
HA Deployment Best Practices for Campus
N+1
Primary
Controller
SSO
SSO + 1
SSO + SSO
L2
Secondary
Controller
Primary Controller
Active WLC
Primary Controller
Standby WLC
Secondary Controller Secondary Controller
•
•
•
•
Approx. 30 Sec failover
time (AP+Client affected)
No Config Synch (risk:
Config mismatch)
AP loadbalancing
L2 or L3
-
-
Sub-Second Failover
(Client+AP not affected)
Config Synch
One active, one standby
(no AP loadbalancing)
L2 connection needed
TECCRS-2001
- adds
redundancy and
simplifies
operation during
maintenance
(e.g. SW
Updates)
adds
redundancy
and simplfies
operation
during
maintenance
(e.g. SW
Updates)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
362
HA Deployment Best Practices
Campus
• What is the acceptable downtime for your business applications?
•
No downtime? Go with AireOS Stateful Switchover
•
Are 30 sec to few minutes ok? Go with N+1 to have more deployment flexibility
•
What is the downtime to upgrade a HA pair and how to minimize it?
•
What is the recommended HA deployment in a multi-site Campus?
Use 2-Tier Redundancy (SSO and N+1) HA deployment
1.
2.
•
Use SSO in the main site (Primary WLC)
•
Use Secondary/Tertiary in redundancy sites
For max resiliency use SSO in all sites
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
363
Multi-site Campus: Combine SSO with N+1
 SSO pair can act as the Primary
Controller and be deployed with
single Secondary and Tertiary WLC
 Network downtime:
• No network downtime for single controller
failure in the Primary DC
• On failure of both Active and Standby WLC,
APs will fall back to secondary and further to
configured tertiary controller
Primary 9.6.61.x/ 24
DC 1
Secondary WLC
9.6.62.2
Main Data Centre
Si
ISE
PI
DC 2
Tertiary WLC
9.6.63.2
IP network
.2
Si
.3
Si
Si
SSO pair
Si
 Recommendations:
Si
AP Config:
Primary
WLC – 9.6.61.2
Secondary WLC – 9.6.62.2
Tertiary
WLC – 9.6.63.2
• Make sure that AP Fallback is enabled
• Use AP Failover priority in case of
oversubscription of the backup WLC
Si
• Useful to reduce downtime for SSO pair
software upgrade
Si
Si
Si
Campus
Access
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
364
Multi-Site Campus : SSO everywhere!
DC 1
Primary 9.6.61.x
 Each site can be its own
separated SSO architecture
 Full site redundancy by
assigning primary,
secondary, tertiary to the
APs.
 Max level of High
Availability: no network
downtime upon controller
failure within any site.
Main Data Centre
Secondary SSO
9.6.62.x
.2
IP network
Si
Primary SSO pair
ISE
Si
.3
.2
Si
.3
Si
PI
DC 2
Si
Tertiary SSO
9.6.63.x
Si
.2
Si
Si
Si
Si
Si
Campus
Access
TECCRS-2001
.3
Si
AP Config:
Primary
WLC – 9.6.61.2
Secondary WLC – 9.6.62.2
Tertiary
WLC – 9.6.63.2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
365
HA Deployment Best
Practice
Focus on Branch
HA Deployment Best Practices: Branch Key
Design Questions
Local Controller
Controller (Appliance/virtual)
• Specific per branch configuration
• Independency from WAN quality
• Reduced configuration on
switches
• Full feature support
• L3 roaming supported
FlexConnect
Mobility Express
• Specific per branch configuration
• Independency from WAN quality
• low hardware footprint (Controller
running on Access Point)
• Single pane of Mgmt. &
Troubleshooting
• Reduced branch footprint
• Built-in resiliency
• Perfect fit for centralized IT Team
HA questions:
•
Is the branch independent from the Central site from an operation prospective?
•
What is the traffic flow of your application? Are the APP servers centrally located?
•
Is there a local Internet breakout? How do you authenticate new users if WAN/Controller is
down? Where is the AAA server located?
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
367
FlexConnect Branch Summary
“Central Controller Cluster for thousands of Sites and Access Points”
Key Facts
Data Centre
Campus Services
ISE
WLC SSO pair
PI
Si
Si
•
“Cloud Controller”
(private or public)
•
Ease of Operations:
single point of
configuration for up to
6000 APs
When to use:
•
Perfect for centralized IT Team
High Availability:
WAN
•
If controller not reachable:
•
Local Data path stays UP and Clients stay
connected, you can use AAA survivability
•
SSO at central site provides control plane
survivability
Keep in Mind:
Remote
location
•
Switchport as Trunk if SSID/VLAN separation
needed
•
WAN Performance
•
Some feature limitations (compared with local
Controller)
FlexConnect APs
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
368
Local Controller Branch Summary
“Do your clients need full Enterprise feature set (even if WAN is down)?”
• Key Facts:
Data Centre
Campus Services
Position one or two
controllers per branch
•
ISE
When to use:
Full feature set available
•
Si
PI
Si
WAN
Local Services:
AAA, DHCP, DNS
Remote
location
•
WAN Bandwidth and latency is a concern
•
Simple configuration on the switch port connected
to the Access Point desired
•
Branch/local IT staff requires configuration outside
of corporate standard
High Availability:
WLC
•
Full features available if WAN is down
•
use N+1 or SSO for site controller redundancy
•
Local Authentication, DHCP, DNS required for full
WAN Independency
Si
Keep in Mind:
•
Need to manage each site individually
•
Prime Infrastructure should be considered for central
manageability
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
369
Mobility Express Branch Summary
“Quick and Easy setup, no additional Hardware, WAN Independency”
• Key Facts:
Data Centre
Campus Services
It’s a Wireless Controller
running on an Access
Point!
•
ISE
When to use:
•
WAN independency is required and low hardware
footprint is desired.
•
Ideal for new deployments using 18xx/28xx/38xx
Series Access Points
Si
PI
Si
WAN
Local Services:
AAA, DHCP, DNS
Remote
location
Si
High Availability:
•
Self-Healing redundancy
•
Independent from WAN
•
Local AAA, DHCP, DNS for full WAN independency
Keep in Mind:
•
Switchport as Trunk if SSID/VLAN separation
needed
•
Per branch configuration and management
•
consider adding Prime Infrastructure or Cisco DNA
Center for central management
Mobility Express APs
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
370
Agenda
•
High Availability (HA), the theory of operations:
•
What to do at the Radio Frequency layer?
•
Controller HA for different Deployment Modes:
Centralized (Cloud/non-Cloud)
SD-Access
FlexConnect
Mobility Express
•
HA Design and Deployment Practices
•
Wireless Assurance: proactively monitor your network!
•
Key takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
371
Cisco Wireless
Assurance:
Proactively Monitor your
Network
Cisco DNA Center can manage all wireless deployment
modes for Automation and Assurance
Cisco DNA Center
Policy
SDA-Wireless
Policy Segmentation and
consistent wired-wireless
management
Automation
Configure
Centralized
From a web browser or
Ease of Deployment
Cisco wireless app, use
andthe
management
setup wizard for
to
large
enablecampuses
multiple APs
Analytics
up
Flex Set
Connect
Mobility Express
Eliminate the need for a
Controller at every Site for a
distributed deployment
Simplified Controller-less
deployment for distributed
deployments and small sites
simultaneously
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
373
Continuous Verification
Configs, Changes, Routing, Security
Services, Compliance, Audits
Successful Rollouts, Operational Continuity
Insights & Visibility
Visibility, Context, Historical
Insights, Prediction
Minimize Downtime, User Productivity
Corrective Actions
Guided Remediation, Automated Updates
System Optimization
IT Productivity
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
*Available with 16.10.1s
and Cisco DNAC 1.2.8 or later
Purpose-Built for Cisco DNA Assurance
Wireless Streaming Telemetry Architecture
Cisco DNA Center
gRPC/Protobuf
TLS/TDL
https/JWT
AP WSA/JWT
AP2/3/4800K
ME, WLC3504/5520/8540
Catalyst 9800 Series
Active Sensor AP1800S
• HTTP 2.0/gRPC based
• Anomaly Event, RF Stat,
PCAP, Spectrum
• Scheduled and Automated
• Supported from AireOS 8.5
• Real-Time client event
• KPI Parity with AireOS
• Immediate Event Update
• Embedded Wireless in
Cat9300
• HTTPS for Automation and
reporting
• PnP-based Provisioning
• Fully Managed by Cisco
DNAC
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
375
Cisco DNA Center: Built ground-up for Assurance
Real-Time and
Context based
Telemetry
•
•
•
Client RF stats, Onboarding
state and location (<5 sec)
Client Onboarding Health with
Sankey charts for better
analysis
Near-Real time Client tracking
Intelligent Capture
for proactive
troubleshooting
1800s Sensor to
validate user
experience
•
•
•
Floor reassignment to make
1800s sensor mobile
Speed tests to validate Cloud
app connectivity
IP SLA tests for Real-time
AppX assessment
TECCRS-2001
•
•
•
Live and In-Service capture of
Onboarding failures with PCAPs
Spectrum Analyzer for analyzing
Interference sources
On-Demand AP stats for Wi-Fi
troubleshooting
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
376
Wireless Assurance: Client Onboarding
Client
Onboarding
1
Actionable Dashboards:
Onboarding Sankey charts
for better analysis
2
Real-time Correlation:
Correlate Onboarding
events with poor RF and
client location for RCA
3
Intelligent Capture:
Onboarding failures with
In-service PCAPs
Sankey chart
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
377
Wireless Assurance: Sensors to monitor SLAs
Sensor based
SLA Monitoring
1
Simulate Client
perspective:
1800s Sensor is mobile
with floor re-assignment
2
Active Testing:
Test the cloud app
performance and Realtime AppX assessment
3
SLA Dashboard:
Onboarding, Network
Services, Cloud App
Performance and IP SLA
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
378
Cisco Sensors: Intelligence of Cisco DNA
Assurance to the edge
AP as a Sensor *
(1800/2800/3800/4800)
Aironet 1800S Active Sensor
Purpose-built Hardware for Analytics
Can be configured as dedicate Sensor
when it’s configured AP as a Sensor
Automatically converted to Sensor or AP
by Cisco DNAC
2x2 with 2 spatial streams
Multiple powering options
- PoE Power
- USB Type “C” power
- Direct AC Power Plug
• Integrated BLE
• Ultra compact form factor
•
•
SLA Dashboard
Onboarding &
Services Tests
Configure Tests
Remotely
Global Issue
Creation
Dynamic Sensor
Test Trigger
Test Your Network Anywhere at Any time at Real-world Client Level
*AP2800/3800/4800 w/ 8.5MR4 or 8.8.111.0
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
379
Wireless Assurance: Client Health and Intelligent
Capture
Client and
Network
Experience
1
Health Dashboard:
Near-Real time Client
tracking (<60 sec) and
Top N AP analytics
2
Client 360:
Historical Time travel with
client RF correlated with
the Onboarding events
3
Intelligent Capture:
On-Demand AP stats for
Wi-Fi troubleshooting
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
380
Know you clients!
Client Insights– Apple iOS Analytics
1
Device Profile
Client shares these details
1.
Model e.g. iPhone 7
2.
OS Details e.g. iOS 11
Support per device-group
Policies and Analytics
2
Wi-Fi Analytics
Client shares these details
1.
BSSID
2.
RSSI
3.
Channel #
Insights into the clients view
of the network
3
Assurance
Client shares these details
Error code for why did it
previously disconnected
Provide clarity into the
reliability of connectivity
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
381
Cisco DNA Wireless Assurance
•
Be proactive: Use Sensor-based verification for
critical services!
•
Know your clients: Cisco/Apple WiFi iOS
Analytics.
•
Intelligent Capture: Who’s fault is it? “always
on” packet capture – helping to differentiate
between RF or application/client issue.
•
Go back in time: What happened yesterday/last
week?
•
Actionable Insights: Provide guidance on how
to solve the issue.
TECCRS-2001
Cisco DNA Center
Policy
Automation
Analytics
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
383
I would like to leave you
with…
Key Takeaways
•
High Availability for Wireless is a multi level approach, starting from
Level 1 (RF)
•
You have different solutions to chose based on the downtime that
is acceptable for your business application
•
Cisco Controller SSO eliminates network downtime upon controller
failure
•
Wireless Assurance is key to assess your network stability and
proactively test
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
385
Selected additional Wireless Sessions…
•
Cisco DNA Wireless Assurance: Isolate problems for faster troubleshooting - BRKEWN-2034
•
•
Friday, Feb 01, 9:00 AM - 11:00 AM | Hall 8.0, Session Room C118
How to setup an SD Access Wireless fabric from scratch - BRKEWN-2021
•
•
Wednesday, Jan 30, 2:30 PM - 4:00 PM | Hall 8.0, Session Room D134
Cisco SD-Access Wireless Integration - BRKEWN-2020
•
•
Friday, Feb 01, 11:30 AM - 1:00 PM | Hall 8.0, Session Room C129
Cisco DNA Center Assurance and Analytics– Reducing Time to resolution using Big Data and
Machine Learning - BRKNMS-2542
•
•
Tuesday, Jan 29, 2:30 PM - 4:00 PM | Hall 8.0, Session Room A108
Advanced Troubleshooting of Cisco Catalyst 9800 Wireless Controller - BRKEWN-3013
•
•
For Your
Reference
Thursday, Jan 31, 8:30 AM - 10:30 AM | Hall 8.0, Session Room C126
Improve Enterprise WLAN Spectrum Quality with Cisco's advanced RF capacities (RRM, CleanAir,
ClientLink, etc) - BRKEWN-3010
•
Wednesday, Jan 30, 8:30 AM - 10:30 AM | Hall 8.0, Session Room A103
•
Thursday, Jan 31, 11:00 AM - 1:00 PM | Hall 8.0, Session Room C126
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
386
Agenda
•
Designing High Availability Networks for the Enterprise
•
System Hardware and Software Resiliency
•
Foundations of the Structured Network Design
•
High Availability Architectures:
•
•
Enterprise Wired LAN
•
Enterprise Wireless LAN
•
Enterprise Data Center
High Availability System Recovery Analysis
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
387
Dana Daum
Communications Architect
Maren Kostede
Technical Solutions Architect
Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer
High Availability World Coverage
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Agenda
•
Enterprise Data Center High Availability (DC HA)
•
DC Switch NX-OS HA Architecture and HA Features
•
DC Network HA Design and Operational Best Practices
 Legacy
DC with vPC
 Programmable
 Application
Centric Infrastructure (ACI)
 Programmable
•
Fabric
Network
Key Takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
389
Data Center HA Section Objective
• Focus is on Enterprise Data Center Network
• High Availability design options and best practices
• High Availability operational best practices
• Same principle: The Enterprise Campus Network High Availability concepts
are applicable to Data Center network
• Same goal: minimize network downtime
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
390
Agenda
•
Enterprise Data Center High Availability (DC HA)
•
DC Switch NX-OS HA Architecture and HA Features
•
DC Network HA Design and Operational Best Practices
 Legacy
DC with vPC
 Programmable
 Application
Centric Infrastructure (ACI)
 Programmable
•
Fabric
Network
Key Takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
391
Platform-dependent
hardware-related modules
System-infrastructure
modules
NX-OS HA Architecture
Feature
modules
• Fully distributed modular
design
• Control-plane & data-plane
separation
• Service restart-ability
• Non-disruptive SSO*
& ISSU
API
Feature
Management
Infrastructure
API
HA
Infrastructure
API
Hardware
Drivers
Netstack
Kernel
*SSO only available on dual-sup Nexus 7x00 and 9500
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
392
NX-OS Service Restart-ability
•
Stateful Restart with Persistent Storage Service (PSS)
Checkpoints states to PSS
• Recover states from PSS upon restart
•
•
•
Stateful Restart with Graceful Restart
•
Recover states based on information from other services and/or network
•
Mainly routing protocols
Stateless Restart
•
Fresh start, no trace of former instantiation
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
393
NX-OS NSF with Stateful Fault Recovery
etc.
LACP
HSRP
STP
IPv6
TCP/UDP
PIM
OSPF
Software RIB
Graceful restart
BGP
Restart process!
Graceful restart
HA Manager
Linux Kernel
Table
Update
Routing updates
Nexus Data Plane
Routing updates
If a fault occurs in a process…
• HA manager determines best recovery action (restart process,
switchover to redundant supervisor)
Hardware FIB
• Process restarts with no impact on data plane
• State checkpointing (PSS) allows instant, stateful process recovery
• Software utilizes Graceful Restart© 2019
where
appropriate
Cisco and/or
its affiliates. All rights reserved. Cisco Public
NX-OS NSF with Stateful Supervisor Switchover
•
Supervisor switchover triggers:
•
•
HA policy initiated
•
Process restart have failed
•
When the kernel fails (or panics)
•
When the Supervisor experiences a hardware failure
Active Sup
User initiated – system switchover
Standby Sup
LC - NSF
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
395
NX-OS NSF Configuration
• The Nexus products are “NSF Capable” by default
for all the routing protocols in all NX-OS software releases.
• No additional configuration is required unless
you need to modify the default NSF timers.
Nexus# show running-config ospf all
!Command: show running-config ospf all
!Time: Tue May 19 16:13:31 2009
version 4.2(1)
feature ospf
<snip>
router ospf 1
graceful-restart
graceful-restart grace-period 60
area 0.0.0.0 authentication message-digest
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
396
NX-OS Software Maintenance Upgrade
• Non-Disruptive Bug Fix
Direction
for re-startable/ stateful processes
• Limited number of Patches
supported
• Works with or without ISSU
• Not every bug will have a patch
• For Operationally Impacting
Bugs with no workaround
• May be disruptive
• Platform and process specific
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
397
NX-OS Stateful Process Restart & Patching
•
NX-OS services Checkpoint their runtime state to the
Persistent Storage Service
Control-Plane
O SPF 1
EIGRP
B GP
B GP
H S RP 1
O TV
vPC
H S RP 2
UDLD
SSH
IGM P
S TP
BGP
Management
Infrastructure
When a process is patched…
•
Restart
process!
HA Infrastructure
Install process applies new patch
Hardware
Drivers
•
HA manager restarts process
•
Process restarts with patched code and no impact on
data plane
•
State is recovered, operation resumes
•
Total Recovery Time ~10s ms
Netstack
Kernel
Data-Plane
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
398
Software Patching: CLI procedure
SMU
Show Install Active
Show Install Committed
SMU
SMU
Repository
N7K> Install Add
N7K> Install Remove
Show Install Inactive
Show Install Packages
SMU Committed
Memory:
Copy to Device
Process:
Memory:
Process:
N7K> Install Activate
N7K> Install Commit
.
.
SMU Removed
Memory:
SMU Applied
Process:
Memory:
N7K> Install Deactivate
SMU Committed
Memory:
TECCRS-2001
Process:
N7K> Install Commit
Process:
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
399
Dual Supervisor Standard ISSU
N7K# install all kickstart bootdisk:7.2-kickstart system bootdisk:7.3-system
Release
7.3
1
Perform SSO
3
Upgrade standby supervisor
4
Reload standby supervisor
5
Sup 1
Sup 2
Standby
Active
Standby
Active
Standby
7.2
7.3
7.3
7.3
7.2
2
Upgrade standby supervisor
Reload standby supervisor
Release
7.3
6
*
Upgrade LCs & FEX in series
Release 7.3
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
400
Fixed Switch (ToR) Standard ISSU
• Control plane is inactive during
reload while data plane is
forwarding

Reload supervisor
Control-Plane

Load new version
• ISSU is non-disruptive for L2
services

Restore control plane
and configuration
• STP enabled switches cannot be
present downstream

Supervisor
• ISSU is disruptive for L3 services
Version
7.3
7.2
Reconcile with Data
Plane
Data-Plane
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
401
Fixed Switch (ToR) Enhanced ISSU (N3K/N9K)
#install all nxos v2.bin
Container- A
destroyed
Container – B
spawned to bootup
with NX-OS (V2)
Container - A
NX-OS (V1)
Container- B
becomes Active
Container - B
NX-OS (V2)
NX-OS upgraded to V2
with ~3-5 seconds impact
to Control plane traffic
Host OS (Linux)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
402
In-Service Software Upgrade
For Your
Reference
ISSU
NX-OS Switch
Traffic Loss
Standard ISSU
Dual supervisor modular switch: N9500,
N7700, N7000
Control plane: <3-5 sec
Data plane: 0/no service
disruption
Fixed switch: N9300, N3000, N5500,
N5600, N6000
Control plane: < 120 sec
Data plane: 0/no service
disruption
Fixed switch: N9300, N3000
Control plane: <3-5 sec
Data plane: 0/no service
disruption
Enhanced ISSU
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
403
NX-OS ISSU Best Practices
•
For Layer 2 and Layer 3 protocols with sensitive timers, the timeout value should be
increased. Otherwise, the upgrade will be disruptive
•
Best practices vPC ToR

Make sure that both vPC peers are in the same mode (traditional ISSU mode or enhanced
ISSU mode)

Connect host using port-channel to a pair of vPC ToR

If ToR vPC is STP root bridge: Enable peer-switch to avoid STP root change during ISSU

If ToR vPC is not STP root bridge: enable all ports as edge/edge trunk ports
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
404
Graceful Insertion and Removal for NXOS
Isolation of Switch from network
Change window begins.
vPC
vPC
system mode maintenance
One command!
Pre-change System Snapshot
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
405
Graceful Insertion and Removal for NXOS
Return of Switch into network
Change window complete.
vPC
vPC
no system mode maintenance
One command!
Post-change System Snapshot
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
406
Configuration Profiles
•
Maintenance-mode profile is applied when entering GIR mode,
•
Normal-mode profile is applied when GIR mode is exited.
Automatic Profiles
Custom Profiles
• Generated by default
• Parses configuration to determine
changes going into and out of GIR
• Changes based on base protocol
configuration settings.
•
•
•
• Use: Maintenance Windows
User created profile for maintenancemode and normal-mode
Flexible selection of protocols for
isolation
Use: maintenance windows and
isolation during troubleshooting
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
407
Graceful Insertion and Removal Feature
Graceful Removal with Isolate command
•
New CLI 'isolate' in all Unicast Protocols
//Sends route withdraws
•
Make Nexus undesirable for all transit traffic
router bgp 33
•
Maintain Protocol Adjacencies
•
Send route withdrawals/worse metrics
•
Local route states are maintained.
•
Multicast follows Unicast for RPF
•
Feature available: N5K/6K:7.3(0)N1(1); N7K:
7.3(0)D1(1); N9K/N3K: 7.0(3)I2(1)
isolate
//Poisons the routes by sending highest metric
router eigrp 1
isolate
//Advertises max-metric router-lsa
router ospf 1
isolate
//Refreshes LSPs with overload-bit on
router isis 1
isolate
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
408
GIR – Platform Specifics
For Your
Reference
Nexus Switch
Shutdown
Isolate
Nexus 5K/6K
Only support shutdown mode from
7.1(0)N1(1)
Default mode is isolate from
7.3(0)N1(1), shutdown is optional mode
Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine
switch), vPC/vPC+, Interfaces
Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine
switch), vPC/vPC+(shutdown only), Interfaces
(shutdown only)
Only support shutdown mode from
7.2(0)D1(1)
Default mode is isolate from
7.3(0)D1(1), shutdown is optional mode
Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine
switch), vPC/vPC+, Interfaces
Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, RIP, FabricPath (spine
switch), vPC/vPC+(shutdown only), Interfaces
(shutdown only)
Default mode is isolate from 7.0(3)I2(1),
shutdown is optional mode
Default mode is isolate from 7.0(3)I2(1),
shutdown is optional mode
Nexus 7K
Nexus 9K/3K
Supported features: BGP/BGPv6, EIGRP/EIGRPv6,
ISIS/ISISv6, OSPF/OSPFv3, PIM(on vPC), RIP,
vPC(shutdown only), Interfaces (shutdown only)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
409
Putting it all Together
What to use? GIR Mode? Patching? ISSU? All of them?
Option
Critical Bug
Fix & PSIRT
Hardware
Upgrade
New
Features
ISSU
✓
X
✓
GIR + Cold Boot
✓
X
✓
GIR + Disruptive
Installer
✓
X
✓
SMU Restart
✓
X
X
GIR + SMU Reload
✓
X
X
GIR
X
✓
X
Situation
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
410
Agenda
•
Enterprise Data Center High Availability (DC HA)
•
DC Switch NX-OS HA Architecture and HA Features
•
DC Network HA Design and Operational Best Practices
 Legacy
DC with vPC
 Programmable
 Application
Centric Infrastructure (ACI)
 Programmable
•
Fabric
Network
Key Takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
411
Data Center Fabric Technology Evolution
VXLAN EVPN
VXLAN F&L
FabricPath
vPC
STP
2015-2019
2014
2010
2009
2008 and
before
ACI
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
412
Cisco Data Center Network Solutions
Classic Ethernet
& VPC
Programmable Fabric
Application Centric
Infrastructure
DB
Programmable
Network
DB
Web
TECCRS-2001
Web
App
Web
App
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
413
High Availability Design Principle
Structure, Modularity, and Hierarchy
•
Structured design
•
•
Modular design
•
•
Allows you to manage and understand traffic flows, and network failure behavior
Allows for easier evolution and change to the network
Hierarchical design
•
Provides for improved scalability
•
Separates network services into manageable building blocks
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
414
High Availability Design Principle
Structure, Modularity, and Hierarchy
• Optimize the interaction of the physical redundancy with the network
protocols
•
•
Provide the necessary amount of redundancy
•
Pick the right protocol for the requirement
•
Optimize the tuning of the protocol
Optimize network convergence failure detection and recovery
•
Optimal high availability network design attempts to leverage ‘local’ switch fault
detection and recovery
•
Design should leverage the hardware capabilities of the switches to detect and
recover traffic flows based on these ‘local’ events
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
415
Agenda
•
Enterprise Data Center High Availability (DC HA)
•
DC Switch NX-OS HA Architecture and HA Features
•
DC Network HA Design and Operational Best Practices
 Legacy DC
with vPC
 Programmable
 Application
Centric Infrastructure (ACI)
 Programmable
•
Fabric
Network
Key Takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
416
Cisco Data Center Network Solutions
Classic Ethernet
& VPC
Programmable Fabric
Application Centric
Infrastructure
DB
Programmable
Network
DB
Web
TECCRS-2001
Web
App
Web
App
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
417
vPC Feature Overview
vPC Terminology
Layer 3 Cloud
vPC
Peer
vPC Peer
Keepalive Link
vPC Domain
P
S
Peer
Orphan Port
S1
Link
CFS
S2
vPC
Orphan
Device
S3
vPC is supported on Cisco Nexus switches (N5k, N6k, N7k, N9k, N3k)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
418
vPC Failure Scenario
vPC Peer-Keepalive Link up & vPC Peer-Link down
• vPC peer-link failure (link loss):
P
•
vPC peer-keepalive up
•
Status of other vPC peer known
•
Secondary vPC peer disables all vPC’s
•
Traffic forwarded by vPC primary
vPC Peer-keepalive
S
S2
S1
vPC_PLink
Suspend secondary
vPC Member Ports
vPC1
vPC2
SW4
SW3
Keepalive Heartbeat
TECCRS-2001
P
Primary vPC
S
Secondary vPC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
419
Legacy DC HA Design with vPC
•
Core Layer
•
Layer 3 ECMP for multipath redundancy
Core
Core
S1
S2
S3
Aggregation
S4
S5
Access
Access
S6
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
420
Legacy DC HA Design with vPC
•
Aggregation Layer
•
HSRP / VRRP/ GLBP with vPC for
active/active gateway
•
Use default FHRP timers
Core
Core
S1
S2
S3
Aggregation
S4
S5
Access
Access
S6
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
421
Legacy DC HA Design with vPC
•
Access Layer
•
Connect to a pair of Aggregation switch
via Layer 2 port-channel
•
Redundant uplinks
Core
Core
S1
S2
S3
Aggregation
S4
S5
Access
Access
S6
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
422
Legacy DC HA Design with vPC
•
Core
Access Layer
•
Double-sided vPC connecting to Aggregation layer
•
Higher resilience
•
Different vPC domain ID
vPC Domain 10
vPC Domain 20
TECCRS-2001
Aggregation
Access
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
423
VSS vs vPC
Catalyst
Non VSS
Nexus
VSS
Non vPC
vPC
Merge Data Plane only!!
Control Plane still separate
Merge Data and Control Plane
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
424
VSS Design vs vPC Design
Nexus vPC
Catalyst VSS
Don’t design VSS and vPC in same way for Layer 3!
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
425
VSS Design vs vPC Design
Nexus vPC
Catalyst VSS
vPC Layer 3 routed uplink with ECMP
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
426
vPC – Layer 2 Data Center Interconnect (DCI)
DC 1
DC 2
Long Distance
Dark Fiber
F
E F
-
E
CORE
CORE
vPC domain 11
vPC domain 21
-
N
N
N
N
N
Network port
E
Edge or portfast
-
Normal port type
B
BPDUguard
F
BPDUfilter
R
Rootguard
802.1AE (Optional)
- R
N
N
-
R
R
F E
R
R
Layer 2 vPC
Portchannel
-
N
N
-
vPC domain 10
vPC domain 20
R
R
-
-
E
E
B
B
Server Cluster
ACCESS
ACCESS
E F
AGGR
AGGR
-
R
Server Cluster
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
427
For Your
Reference
vPC Best Practices
vPC General Deployment Best Practices
• Unique vPC domain ID’s in the contiguous layer 2 domain
•
Enable vPC peer-gateway, to act as the active gateway for packets
addressed to the peer gateway of the router MAC
•
•
Enable vPC peer-switch, to create vPC peer switch as single logical entity
•
•
•
Keeps forwarding of traffic local to the vPC node and avoids use of the peer-link
Optimized BPDU processing
Enable auto-recovery to address two cases of single switch behavior
•
Peer-link fails and after a while primary switch fails
•
Both VPC peers are reloaded and only one comes back up
Enable vPC orphan-ports suspend to prevent orphan device traffic
blackhole during peer-link failure
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
428
vPC Configuration Best Practices
Enable Object-tracking
•
vPC object tracking, tracks both peer-link and uplinks in a list
of Boolean OR
•
Object Tracking triggered when the track object goes down
•
Suspends the vPCs on the impaired device.
•
Traffic forwarded over the remaining vPC peer
! Track
track 1
track 2
track 3
the vpc peer link and uplinks
interface port-channel11 line-protocol
interface Ethernet1/1 line-protocol
interface Ethernet1/2 line-protocol
S4
S5
S2
S1
! Combine all tracked objects into one.
! “OR” means if ALL objects are down, this object will go down
track 10 list boolean OR
object 1
object 2
object 3
S3
! If object 10 goes down on the primary vPC peer,
! system will switch over to other vPC peer and disable all local vPCs
vpc domain 1
track 10
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
436
vPC Hitless Role Change Feature
Without vPC Hitless Role Change
With vPC Hitless Role Change
• Traffic interruption.
• No traffic interruption.
• Manually flap vPC peer link
• cli – “vpc role preempt”
• Not Graceful
• Graceful
Note: supported from N7k 7.3(0)(D1(1), N9k/3,: 7.0(3)I7(1)
Not supported on N5k/6k
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
438
vPC Shutdown Feature
•
Isolates a switch from the vPC complex
•
Isolated switch can be debugged, reloaded,
or even removed physically, without affecting
the vPC traffic going through the nonisolated switch
Primary
Secondary
vPC
S2
S1
switch# configure terminal
switch(config)# vpc domain 100
switch(config-vpc)# shutdown
S3
Note: supported from N7k: 7.3(0)D1(1), N9k: 7.0(3)I2(2), N5k/6k:
6.0(2)N2(1)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
439
Graceful Insertion and Removal Example
FHRP with vPC Switch Isolation using GIR
•
Use automatic profile to go
into GIR.
Core Network
f
Isolate unicast
routing protocol
L3
//Enter maintenance mode using the system mode
maintenance command:
switch# configure terminal
switch(config)# system mode maintenance
Following configuration will be applied:
L2
VPC
Shutdown
router ospf 100
isolate
vpc domain 2
shutdown
Do you want to continue (y/n)? [no] y
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
440
Legacy DC HA Design with vPC Key Takeaways
•
To minimize Legacy DC down time:
•
Follow vPC design best practices
•
Follow vPC configuration best practices
•
Follow vPC operation best practices
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
441
vPC References
•
For Your
Reference
vPC Design and Configuration Best Practices:
http://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/sw/design/vpc_
design/vpc_best_practices_design_guide.pdf
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
442
Agenda
•
Enterprise Data Center High Availability (DC HA)
•
DC Switch NX-OS HA Architecture and HA Features
•
DC Network HA Design and Operational Best Practices
•
•
Legacy DC with vPC
•
Programmable Fabric
•
Application Centric Infrastructure (ACI)
•
Programmable Network
Key Takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
443
Cisco Data Center Network Solutions
Classic Ethernet
& VPC
Programmable Fabric
Application Centric
Infrastructure
DB
Programmable
Network
DB
Web
Web
App
Web
App
• Standards-based
• VXLAN BGP EVPN
• Forwarding & Multi-Tenancy
• Disaggregated Management
• Open NX-OS
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
444
Programmable Fabric Underlay HA Design
• Structured design with Spine,
Leaf and Border Leaf
External Layer-3 Network
• Allows you to manage traffic flow,
network failure
VTEP
VTEP
Border Leaf
• Layer 3 IP fabric with point-to-
point link:
Spine
Spine
Spine
Spine
Spine
• Better stability, faster convergence
• Redundant links with ECMP
• Scale out spine leaf design
VTEP
VTEP
VTEP
VTEP
VTEP
VTEP
VTEP
Leaf
• Better scalability and availability
Pod 1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
445
Programmable Fabric Underlay HA Design
• Structured design with Border
Spine and Leaf
• Layer 3 IP fabric with point-to-
point link:
External Layer-3 Network
• Better stability, faster convergence
• Redundant links with ECMP
Spine
Spine
Spine
Spine
Border Spine
• Scale out spine leaf design
• Better scalability and availability
VTEP
VTEP
VTEP
VTEP
VTEP
VTEP
VTEP
Leaf
Pod 1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
446
Programmable Fabric Overlay HA Design
• VXLAN EVPN based overlay
• Same “Anycast” SVI IP/MAC is
External Layer-3 Network
enabled at all VTEPs/ToRs
VTEP
• Better availability, no IP gateway
relearning
VTEP
Border Leaf
• Optimal traffic forwarding, no hairpinning to GW
Spine
Spine
Spine
Spine
Overlay
Spine
• Enable host mobility
VTEP
VTEP
VTEP
VTEP
VTEP
VTEP
VTEP
Leaf
SVI IP Address
MAC: 0000.1111.2222
IP: 10.1.1.1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
447
Programmable Fabric Host HA Design
• Host connects to a pair of vPC
External Layer-3 Network
leaf VTEP directly
(recommended)
VTEP
VTEP
Border Leaf
• Host connects to a pair of vPC
leaf VTEP via FEX (not
recommended)
• Redundant host uplinks
Spine
VTEP
VTEP
Spine
Spine
Spine
Overlay
VTEP
VTEP
Spine
VTEP
VTEP
VTEP
Leaf
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
448
vPC Leaf VTEP Best Practices for HA
• vPC leaf VTEP best practices
• Enable peer-gateway
• Enable peer switch
• Enable IP ARP Sync
• Use separate loopback address for
VTEP source address
 Control plane and data plane separation
 Loopback0 is for underlay and overlay
routing
 Loopback1 with secondary IP is for
VTEP source data plane
vpc domain 100
peer-switch
peer-keepalive destination 172.32.1.13 source 172.32.1.14
delay restore 150
peer-gateway
ip arp synchronize
ipv6 nd synchronize
interface nve1
host-reachability protocol bgp
source-interface loopback1
source-interface hold-down-time 180
interface loopback0
ip address 10.10.0.2/32
ip router ospf UNDERLAY area 0.0.0.0
ip pim sparse-mode
interface loopback1
ip address 10.20.0.3/32
ip address 10.20.0.1/32 secondary
ip router ospf UNDERLAY area 0.0.0.0
ip pim sparse-mode
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
449
vPC Delay Restore and Source Hold Down Timer
Control plane
adjacencies not
fully established
Leaf 1
Spine
X
vPC Peer-Link
connection not
recovered yet
X
Leaf 1
4
Leaf 2
Spine
X
Anycast VTEP
Advertisement
Leaf
Leaf 2
2
X
Recovering
device
Host connection toward recovering Leaf 2 is brought up before
the ToR can successfully establish routing adjacencies with the
fabric and the peer vPC leaf node  temporary black-holed
Tuning delay-restore is required regardless the
SW release
Host-to-Leaf
connection not
recovered yet
If the advertisement of the Anycast VTEP address
happens before the vPC peer-link and vPC leg
connection to the host are recovered, traffic will be
black-holed as well.
A “source-interface hold-down-time” is natively
brought to keep the VTEP address (Loopback1) down
for 180 sec (default), supported from 7.0(3)I2(2)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
452
vPC Leaf VTEP HA Best Practices
• vPC leaf VTEP best practices
o Enable layer 3 link between the two vPC
VTEPs to connect them in the underlay
network so that when one VTEP loses all
its uplinks, it can still learn the routes
through its vPC peer, and forward the
traffic via its peer
o Layer 3 link can be dedicated link or via
point-to-point VLAN SVI over vPC peerlink
Underlay Network
With IP ECMP Load Sharing
VTEP
.......
vPCVTEP-1
vPCVTEP-2
vPC
Port-Channel
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
453
Programmable Fabric External Routing HA Design
The two Border Leaf VTEPs are
independent to each other.
Spine
RR
They each individually exchange
routes with the external routing
devices, and advertise the external
routes into the EVPN fabric
RR
VXLAN Overlay
EVPN MP-BGP
Leaf
VTEP
VTEP
Anycast Gateway
Anycast Gateway
VTEP
Border Leaf
VTEP
VTEP
VTEP
Anycast Gateway Anycast Gateway
Routing
Protocol
of
Choice
Distributed Anycast Gateway on the internal
VTEPs Leafs
IP Routing
BGP multi-pathing needs to be enabled on the
internal VTEP to leverage both border leaf
VTEPs
TECCRS-2001
Global Default VRF Instance
or User Space VRF Instances
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
454
Programmable Fabric External Routing HA Design
IP Routing
Distributed Anycast Gateway on the internal
VTEPs Leafs
BGP multi-pathing needs to be enabled on the
internal VTEP to leverage both border leaf
VTEPs
The two Border Spines are both
VTEPs.
Routing
Protocol
of
Choice
Spine
VTEP
VTEP
RR
RR
They each individually exchange
routes with the external routing
devices, and advertise the external
routes into the EVPN fabric.
VXLAN Overlay
EVPN MP-BGP
Leaf
VTEP
Anycast Gateway
VTEP
Anycast Gateway
VTEP
VTEP
Anycast Gateway Anycast Gateway
TECCRS-2001
VTEP
Anycast Gateway
VTEP
Anycast Gateway
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
455
Programmable Fabric Multi-X Connectivity (DCI)
VXLAN Multi-Site
2017+
Fabric #1Domain 1
EVPN Control-Plane
BGP EVPN
Fabric #2Domain 2
EVPN Control-Plane
Overlay
VTEP
Bare
metal
VTEP
Overlay
VTEP
VTEP
VTEP
VTEP
Bare
metal
Data-Plane Domain 1
DCI
Data-Plane
VTEP
VTEP
Bare
metal
Bare
metal
Data-Plane Domain 2
 Multiple Fabrics with Integrated DCI (DCI2)
 Hierarchical design at both overlay and underlay: Better
Scale and Failure Domain Isolation between Fabrics
Recommended DCI Architecture Going Forward!!!
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
456
VXLAN Multi-Site
Main Use Cases
Scale-Up Model to Build a
Large Intra-DC Network
Network Extension across
Multiple Sites
Integration with Legacy Networks
(Coexistence and/or Migration)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
457
VXLAN Multi-Site Border Gateways HA Design
Anycast Border Gateways



Possible BGWs deployment models:
•
Anycast Border Gateways (supported since day 1)
and recommended for interconnecting VXLAN EVPN
fabrics
•
VPC Border Gateways (supported since 9.2(1))
Border Gateways used for Layer 2 and Layer 3
Site-to-Site communication(East-West traffic)
Border Gateway are often deployed also as
Border Leaf nodes for Site to External Layer 3
communication (North-South traffic)
BGW
BGW
BGW
BGW
VTEP
VTEP
VTEP
VTEP
Site 1
VPC Border Gateways
BGW
BGW
VTEP
VTEP
Site 1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
458
VXLAN Multi-Site VPC Border Gateways
DCI Use Cases
Migration/Coexistence of a
legacy site with
one (or more) new VXLAN
EVPN fabrics
VTEP
VTEP
VTEP
VTEP
BGW
BGW
BGW
BGW
Spine
VTEP
VTEP
Spine
VTEP
Spine
VTEP
Spine
VTEP
Greenfield Site
VTEP
VTEP
Legacy Site
VTEP
VTEP
VTEP
VTEP
BGW
BGW
BGW
BGW
Replacing ‘legacy’ DCI solutions
(vPC, OTV, VPLS, etc..)
Legacy Site 1
Legacy Site 2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Site-Internal
Site-External
VXLAN Multi-Site
Failure Detection on Anycast BGWs – Fabric Isolation

The Site-Internal interfaces on BGW nodes are
constantly tracked to determine their status
(‘evpn multisite fabric-tracking’ command)

If all the Site-Internal interfaces are detected as
down:
Multi-Site VIP
10.111.111.1
BGW
BGW
BGW
BGW
VTEP
VTEP
VTEP
VTEP
PIP-BGW2
10.200.200.22
PIP-BGW3
10.200.200.23
PIP-BGW4
10.200.200.24
Spine
Spine
1.
2.
The isolated BGW stops advertising PIP/VIP
addresses toward the Site-External network
The remaining BGWs perform new DF elections for
the L2VNIs owned by the isolated BGW

As a result, the BGW becomes isolated from
both the Site-Internal and Site-External
networks

Seamless BGW node reinsertion using a “delayrestore” timer for the VIP address
Site 1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
460
VXLAN Multi-Site
Failure Detection on Anycast BGWs – DCI Isolation
DC Core
Site-Internal
Site-External
(Layer-3 Unicast)
BGW
BGW
BGW
BGW
VTEP
VTEP
VTEP
VTEP
PIP-BGW1
10.200.200.21
PIP-BGW2
10.200.200.22
PIP-BGW3
10.200.200.23
PIP-BGW4
10.200.200.24

The Site-External interfaces on BGW nodes are
also tracked to determine their status (‘evpn
multisite dci-tracking’ command)

If all the Site-External interfaces are detected as
down, the isolated BGW node:
1.
2.
Multi-Site VIP
10.111.111.1
3.
Stops advertising VIP VTEP address toward the
Site-Internal network
Withdraws BGP EVPN Type-4 advertisements
(triggering a new DF election between other BGWs)
Starts functioning as a regular VTEP (PIP still up)

As a result, the BGW continues to operate as a
Site-Internal VTEP

Seamless BGW node reinsertion using a “delayrestore” timer for the VIP address
Site 1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
461
Graceful Insertion and Removal Example
VXLAN BGP EVPN Leaf vPC VTEP Isolation with GIR
• Use automatic profile to go into GIR.
S10
//Enter maintenance mode using the system
mode maintenance command:
switch# configure terminal
switch(config)# system mode maintenance
Following configuration will be applied:
ip pim isolate
router bgp 1
isolate
router ospf UNDERLAY
isolate
vpc domain 1000
shutdown
S20
VXLAN BGP EVPN
Do you want to continue (yes/no)? [no] y
System mode operation completed successfully
switch(config)# end
Host1
TECCRS-2001
Host2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
464
Graceful Insertion and Removal Example
VXLAN BGP EVPN Leaf vPC VTEP Isolation with GIR
• Use automatic profile to come out of GIR.
S10
//Enter maintenance mode using the system
mode maintenance command:
switch# configure terminal
switch(config)# no system mode maintenance
Following configuration will be applied:
vpc domain 1000
no shutdown
router ospf UNDERLAY
no isolate
router bgp 1
no isolate
no ip pim isolate
S20
VXLAN BGP EVPN
Do you want to continue (yes/no)? [no] y
System mode operation completed successfully
switch(config)# end
Host1
TECCRS-2001
Host2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
465
Graceful Insertion and Removal Example
VXLAN BGP EVPN Spine RR Isolation with GIR
• Use automatic profile to go into GIR.
S10
//Enter maintenance mode using the system mode
maintenance command:
switch# configure terminal
switch(config)# system mode maintenance
Following configuration will be applied:
router bgp 100
isolate
router ospf 1
isolate
S20
VXLAN BGP EVPN
Do you want to continue (yes/no)? [no] y
System mode operation completed successfully
switch(config)# end
Host1
TECCRS-2001
Host2
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
466
Programmable Fabric HA Takeaways
•
•
Programmable fabric HA design
•
Spine leaf L3 IP fabric with ECMP
•
VXLAN EVPN fabric with Anycast GW
•
vPC for host
Multiple DC fabric HA design
•
•
VXLAN Multi-Site
Follow configuration and operational best practices to minimize down time
for different failure scenarios
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
467
Programmable Fabric Resources
For Your
Reference
•
VXLAN Network with MP-BGP EVPN Control Plane Design Guide
•
https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-seriesswitches/guide-c07-734107.html
•
VXLAN EVPN Multi-Site Design and Deployment White Paper
•
https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-seriesswitches/white-paper-c11-739942.html#_Toc498025653
•
BRKDCN-3378 Building DataCenter Networks with VXLAN BGP EVPN
•
BRKDCN-2035 VXLAN BGP EVPN based Multi-Site
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
468
Agenda
•
Enterprise Data Center High Availability (DC HA)
•
DC Switch NX-OS HA Architecture and HA Features
•
DC Network HA Design and Operational Best Practices
•
•
Legacy DC with vPC
•
Programmable Fabric
•
Application Centric Infrastructure (ACI)
•
Programmable Network
Key Takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
469
Cisco Data Center Network Solutions
Classic Ethernet
& VPC
Programmable Fabric
Application Centric
Infrastructure
DB
Programmable
Network
DB
Web
Web
App
Web
App
• VXLAN-based
• Forwarding, Multi-Tenancy &
Security
• Integrated Controller with
Enhanced APIs
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
470
ACI Fabric Underlay HA Design
• Zero touch provision
Application Policy Infrastructure Controller
• Structured design
• Layer 3 IP fabric with point-to-
point link:
ACI
Fabric
• Better stability, faster convergence
• Redundant links with ECMP
• Scale out spine leaf design
• Better scalability and availability
SVI IP Address
MAC: 0000.1111.2222
IP: 10.1.1.1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
471
ACI Fabric Overlay HA Design
• eVXLAN EVPN based overlay
Application Policy Infrastructure Controller
• Same “Anycast” SVI IP/MAC is
enabled at all VTEPs/ToRs
• Better availability, no IP gateway
relearning
Overlay
ACI
Fabric
• Optimal traffic forwarding, no hairpinning to GW
• Enable host mobility
SVI IP Address
MAC: 0000.1111.2222
IP: 10.1.1.1
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
472
ACI Fabric Host vPC HA Design
ACI Spine Nodes
ACI Fabric
ACI Leaf Nodes

Host vPC to ACI leaf nodes

Host vPC to FEX
Host2
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
473
vPC in ACI Fabric
• Differences between ACI vPC and standard
vPC
• No Peer Link is required
ACI Fabric Services (ZMQ)
vPC Anycast
VTEP
vPC Anycast
VTEP
VTEP
VTEP
• Peer communication/path recovery
happens via the Fabric
• CFS (Cisco Fabric Services) is replaced by
IFS (ACI Fabric Services) which is based
on Zero Message Queue (ZMQ)
• Forwarding selection (which peer will
forward a frame
• Within the Fabric the vPC interfaces use an
anycast VTEP which is active on both vPC
peers
Host or Switch
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
474
ACI Port Tracking Policy for Uplink Failure
Detection
• The port tracking policy specifies
• Number of uplink connections that trigger the policy
• A delay timer for bringing the leaf switch access ports back up
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
475
ACI Fabric Convergence Improvement
•
Convergence improvement from sub-seconds to 200ms for ACI3.1
•
•
•
With new Cloudscale ASIC N9Ks
Failure scenarios with convergence improvement:
•
Fabric (between leaf and spine): link failure, Spine reload/upgrade, Spine linecard
reload, Leaf reload/upgrade, power failure of Spine
•
Access link/node with vPC or portchannel
•
External (Border Leaf) connectivity (L3 out): link failure, Border Leaf
reload/upgrade
Achieved by special ASIC capability and software design
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
476
ACI Fabric Convergence Improvement
•
Uncovered failure scenarios:
•
Double failure, L2/L3 multicast, copper links, process crashes on
Leaf/Spine/Border Leaf, etc.
•
Convergence for traffic from EP to ACI fabric is dependent on how fast the
EP is able to divert traffic to ACI Leaf
•
Convergence for traffic from external node to ACI fabric is dependent on
how fast external node is able to divert traffic to ACI
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
477
Fabric Fast Convergence - Enable LBX
It’s a per leaf configuration.
Fabric ERSPAN can’t be enabled per uplink port with this feature enabled.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
478
Access Link with vPC or PortChannel Fast
Convergence - Debounce Policy Configuration
Reduce debounce timer from default 100ms to 10msfor faster convergence under Fabric Access Policy
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
479
ACI Fabric Fast Convergence Best Practices
•
Always use vPC
•
Distribute scale
•
•
100 L3out per Leaf
•
50 BD per Leaf
Use static EPG instead of L2out
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
480
ACI Fabric Maintenance Mode
•
New decommission option from ACI3.0(1k)
•
To help to isolate the switch in ACI fabric with keeping
management access to the switch
•
Prior to ACI3.0(1k): decommission options are Regular or
Remove from controller.
•
•
The switch reboots and is wiped out of all the configuration
ACI3.0(1k): Maintenance Mode (Debug mode) or
decommission
•
With Maintenance mode, the switch is not in Active forwarding
path
•
It can be accessed via management port, logs can be collected
for debugging
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
481
Spine Node Maintenance Mode Decommission
•
IS-IS on spine advertises routes with max matric
•
OSPF, EIGRP and BGP do graceful shutdown on IPN/GOLF link
GOLF
IPN ports are still up but
OSPF neighbor is down.
IPN
IS-IS set max
matric
Traffic goes through
different paths.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
482
Spine Node Insertion Recommission
•
Spine switch reboots and is wiped out of all the configuration
•
After the switch comes up and is discovered by APIC, the policy is
programmed on the switch
•
After the switch configuration is done, the switch establishes IS-IS, OSPF
and BGP peers. Then the switch will be in active forwarding path. Max
metric will be set 10 mins during startup. Thus, internal traffic will be less
preferred for 10 mins.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
483
Leaf Node Maintenance Mode Decommission
•
IS-IS on Leaf node advertises route with max metric
•
OSPF, EIGRP and BGP do graceful shutdown
•
vPC shuts down Keep-Alive & Peer Link
•
Shutdown all front panel ports and directly connected IFC ports (Cuts Laser
on the Port)
Set max metric
Traffic goes through
different paths
Graceful shutdown for L3out
vPC shutdown
Shutdown front panel ports
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
484
Leaf Node Insertion Recommission
•
The switch reboots and is wiped out of all the configuration
•
After the switch comes up and is discovered by APIC, the policy is
programmed on the switch
•
After the switch configuration is done, the switch will establish IS-IS, OSPF
and BGP peers. Then the switch will be in active forwarding path. Max
metric will be set for 10 mins during startup. Thus, internal traffic will be
less preferred for 10 mins.
•
There is a 2 min delay before we bring up the vPC ports.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
485
ACI Multi-Site
VXLAN
ACI 3.0 Release
Inter-Site
Network
MP-BGP - EVPN
Multi-Site Orchestrator
Site 1
REST
Availability Zone ‘A’
API
Site 2
GUI
Availability Zone ‘B’
•
Separate ACI Fabrics with independent APIC clusters
•
MP-BGP EVPN control plane between sites
•
No latency limitation between Fabrics
•
•
ACI Multi-Site Orchestrator pushes cross-fabric
configuration to multiple APIC clusters providing
scoping of all configuration changes
Data Plane VXLAN encapsulation across
sites
•
End-to-end policy definition and
enforcement
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
486
ACI Multi-Site
Main Use Cases
Scale-Up Model to Build a Large
Intra-DC Network
Data Center Interconnect (DCI)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
487
ACI Policy Upgrade
•
Ability to upgrade all switches and controllers in the fabric from one place,
with a single click
•
Requires the upload of the new controller and switch image
•
Then, create a firmware group
•
Finally, Create Maintenance groups as needed to define which switches
get upgrade at what time
•
Controllers are upgraded through a different “Controller Firmware” Policy
•
Controllers are kicked off at the same time (sort of like a single maintenance
group) and upgrade sequentially.
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
488
ACI Maintenance Group Logic
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
489
ACI HA Key Takeaways
•
ACI is a turnkey solution for Data Center fabric with built in HA and full
automation
•
ACI integrates all the best practices and lessons we learned from previous
technologies
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
490
ACI Fabric Resources
•
For Your
Reference
Cisco ACI Multi-Site Architecture White Paper
•
https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/applicationcentric-infrastructure/white-paper-c11-739609.html
•
BRKACI-2125 ACI Multi-Site Architecture and Deployment
•
BRKACI-3101 ACI Under the Hood - How Your Configuration is Deployed
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
491
Agenda
•
Enterprise Data Center High Availability (DC HA)
•
DC Switch NX-OS HA Architecture and HA Features
•
DC Network HA Design and Operational Best Practices
•
•
Legacy DC with vPC
•
Programmable Fabric
•
Application Centric Infrastructure (ACI)
•
Programmable Network
Key Takeaways
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
492
Cisco Data Center Network Solutions
Classic Ethernet
& VPC
Programmable Fabric
Application Centric
Infrastructure
DB
Programmable
Network
DB
Web
Web
App
Web
App
• Open NX-OS
• Enhanced APIs and
Automation Ecosystem
(DevOps)
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
493
Nexus Device Programmability
•
Power on Auto Provisioning (PoAP)
•
On-box Python Scripting
•
NX-OS Software Development Kit (SDK)
•
Configuration Management Tools
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
494
Cisco Nexus Power on Auto Provisioning (PoAP)
Script Server
License, Configuration and
Software Server
DHCP Server
2
3
DHCP Discover phase:
Get IP Address, Gateway
Script server Script file
Download Script
file onto the switch
and execute the
script
4
Download Configuration
License Software images
onto the switch
Default
Gateway
Reboot if needed. Switch up
and running with the
downloaded image and
config
1
5
Power up Phase: Start Power
On Auto-Provisioning Process
Nexus Switch
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
495
Deploy and Manage POAP Using DCNM..
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
496
Deploy using POAP Script
•
Download POAP script from github:
•
https:/​/​github.com/​datacenter/​nexus9000/​blob/​master/​nx-os/​poap/​poap.py
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
497
Nexus 9000 Programmability
On-box Python
• Python script can be run in interactive or non-interactive mode
•
Please store scripts in the bootflash:scripts directory of the switch
Interactive Mode
Non Interactive (script) Mode
switch# python
Python 2.7.5 (default, Nov 5 2016, 04:39:52)
>>> cli("conf ; interface loopback 1")
Switch # dir bootflash:scripts
946
Oct 30 14:50:36 2013 crc.py
7009 Sep 19 10:38:39 2013 myScript.py
22760 Oct 31 02:51:41 2012 poap.py
mode:
username: admin
vdc: TSI-N9508-stand-alone
routing-context vrf: default
Switch # python bootflash:/scripts/crc.py
Or Switch # source crc.py
----------------------------------------Started running CRC checker script
finished running CRC checker script
------------------------------------------
''
>>> clip ('where detail')
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
498
Python Usecases
“Off-Box” Python
Python
“On-Box” Python
Linux Server
SSH/NETCONF
Python
NX-OS
NX-OS Device
scripts executed externally from switch:
•
configuration management automation
•
telemetry / operational data
•
controller use cases including APIC, POAP
NX-OS Device
NX-OS
•
TECCRS-2001
scripts executed locally on switch:
•
provisioning automation
•
automating Embedded Event Manager
•
application development
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
499
Auto Back-up Use Case
“On-Box” Python and EEM
Cisco Nexus 9000 Python SDK User Guide:
https://developer.cisco.com/docs/nx-os/#cisco-nexus-9000-series-pythonsdk-user-guide-and-api-reference
Python script creates a back-up file and sends it to a tftp
server
Nexus 93xx
EEM
EEM Triggers on-box Python script
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
500
NX-OS SDK (Software Development Kit)
•
NX-OS SDK enables on-box custom applications to access NX-OS native
functionality
Nexus 9K
Custom Applications
(Python, C++ etc..)
Existing 3 rd Party
Linux Applications
Linux – Native
Shell or Guest
Shell
Linux
Networking
Stack
NX-OS
CLI
L2
L3
Interfaces
Platform
TECCRS-2001
Etc
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
501
Nexus Programmability
Configuration Management Tools
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
502
NX-OS Programmability Resources
•
For Your
Reference
BRKACI-2025 Maximizing Network Programmability and Automation with Open
NX-OS
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
503
Data Center HA
Key Takeaways
High Availability Enterprise Data Center Design
Key Principles
• Follow HA design and operational best practices to minimize network
downtime
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
505
Maren Kostede
Dana Daum
Technical Solutions Architect
Communications Architect
Junmei Zhang
Technical Marketing Eng.
Samer Theodossy
Principal Engineer
High Availability World Coverage
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
•
First step on the Moon – July 20, 1969 … how it really happened …
“OK, I’m going to step off the LEM now”
LEM = Lunar
Excursion Module =
the Lunar Lander
“That’s one small step for man”
“One giant leap for mankind”
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
507
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
•
And how it would have looked with … standard HSRP timers …
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
508
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
•
And how it would have looked with … 3-second reconvergence …
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
509
Reconvergence
Effect on “Mission-Critical”, Real-Time Operations
•
And how it would have looked with … 500-msec re-convergence …
Tuning Your Network Design and
Reconvergence Can Be a “Giant Leap”
for Your Network – and Your
Application – Availability!
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
510
Published design guides
www.cisco.com/go/cvd
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Cisco Webex Teams
Questions?
Use Cisco Webex Teams (formerly Cisco Spark)
to chat with the speaker after the session
How
1 Find this session in the Cisco Events Mobile App
2 Click “Join the Discussion”
3 Install Webex Teams or go directly to the team space
4 Enter messages/questions in the team space
cs.co/ciscolivebot#TECCRS-2001
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
512
Complete your online
session survey
• Please complete your Online Session
Survey after each session
• Complete 4 Session Surveys & the Overall
Conference Survey (available from
Thursday) to receive your Cisco Live Tshirt
• All surveys can be completed via the Cisco
Events Mobile App or the Communication
Stations
Don’t forget: Cisco Live sessions will be available for viewing
on demand after the event at ciscolive.cisco.com
TECCRS-2001
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
513
Continue Your Education
Demos in
the Cisco
Showcase
Walk-in
self-paced
labs
Meet the
engineer
1:1
meetings
TECCRS-2001
Related
sessions
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
514
Thank you
Download