High Availability

advertisement
HFR - TAG
High Availability
Ravi Narayanan
(ravin@cisco.com)
February 2002
1
Cisco HFR
GOAL - High Availability
Goal: Non-Stop Availability
5- 9’s or Greater Availabiliity
What customers require:
Quick Recovery from defects,
High MTBF, Low MTTR/DPM,
Built in Redundancy
© 2002, Cisco Systems, Inc.
www.cisco.com
2
Cisco HFR
A Five Nines Capable Router
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
© 2002, Cisco Systems, Inc.
www.cisco.com
3
Hardware Architecture
• Apply Prior Experience
• No Single Points of Failure
• Hardware Non Stop Forwarding (NSF)
• Automated Fault Injection
• Verify Architecture with Modeling
© 2002, Cisco Systems, Inc.
www.cisco.com
4
Apply Prior Experience
• ATM Switch Products
– Large Customer Frame Relay Network
– Many Years Measuring Availability
• GSR
– Now resets at RP/LC level (HFR provides
finer granularity at component level)
– Routing NSF Developments Started
© 2002, Cisco Systems, Inc.
www.cisco.com
5
No Single Points of Failure
• Redundancy
– Active Standby
* (D) RP, SC
– Loadsharing
* Fabric, Power, Cooling, Management Interconnect (out of
band ethernet 1:1)
– Port Protection (Linecards/PLIMs)
• No outage on Upgrade of Fabric
• Graceful Degradation of Fabric
© 2002, Cisco Systems, Inc.
www.cisco.com
6
System Control Network
GE
RP
LC
FE
Gig Ether
Switch
RP
LC
RP
LC
FE
Optional 10G
Gig Ether
Switch
RP
LC
SC
S2
FE
SC
© 2002, Cisco Systems, Inc.
www.cisco.com
LC
Chassis
LC
Chassis
Fabric
Chassis
S2
7
Graceful Degradation
8 of 8
8
OC192
2 of 8
S1
S2
S3
S3
1 of 8
...
...
S1
S3
S3
S3
S3
...
S2
S2
...
...
.....
8
...
...
S1
S1
1
...
2
S2
...
Line Card
S3
S3
OC192
Line Card
2
S1
S1
1
© 2002, Cisco Systems, Inc.
www.cisco.com
S2
S2
S3
S3
S3
S3
8
Hardware Non Stop
Forwarding
• Reset Strategy
– Entire Board
– Individual Components on a Board
– CAM (HW forwarding database) Not reset
unless desired
• Forwarding Strategy
– Metro - 176 PPEs forwarding using CAM
© 2002, Cisco Systems, Inc.
www.cisco.com
9
LC NSF Strategy
PPE0
PPE2
DISTRIB
MUX
PPE175
TCAM
© 2002, Cisco Systems, Inc.
PLU
TLU
www.cisco.com
STATS
10
Automated Fault Injection
• Designed into Hardware ASICs up front
• Makes testing easier and complete
• Off the shelf parts must have mechanism for injection
• System Test and Reliability tests use automated fault insertion
testing mechanisms
• Fault insertion testing at all stages
– Bring up, Design Verification, component test, system test
• Ability to test multiple failure scenarios - in hardware &
software
© 2002, Cisco Systems, Inc.
www.cisco.com
11
Verify Architecture With
Modeling
• Early modeling influenced architecture
– Memory soft error rates -> ECC
– Opticial error rates -> FEC-Reed Solomon
– Board level MTBF >= 100,000 hours - is a
Cisco Requirement
• Parts count model
– Telcordia TR-332 standards, close vendor
interaction
© 2002, Cisco Systems, Inc.
www.cisco.com
12
Cisco HFR
A Five Nines Capable Router
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
© 2002, Cisco Systems, Inc.
www.cisco.com
13
Software Architecture
• Protected Memory Microkernel
• Separation of Control and Data Plane
• Software Non Stop Forwarding
• Scalable Distributed System
• Health Monitoring
• No Outage on upgrades - Packaging and
Release Strategy
© 2002, Cisco Systems, Inc.
www.cisco.com
14
Protected Memory
Microkernel
• Every Process Has a Private Address
Space - contains faults
• Enables Process Restartability
• Enables Board Failover
• Enables Hitless Software Upgrade
© 2002, Cisco Systems, Inc.
www.cisco.com
15
1:1 Card Redundancy
Card 1
Process A
“Active”
Processes
Process B
Card 2
Checkpointing
Checkpointing
Pr ocess A
Pr ocess B
“Standby”
Processes
Checkpointing
Pr ocess C
Process C
Standby Logical Slot 1
Active Logical Slot 1
© 2002, Cisco Systems, Inc.
www.cisco.com
16
Active / Standby Switchover
Active SC
LR Daemon
12
5
1
6
10
Card 1
Card 2
11
RedCon
13
2
3
Process A
QSM
System Mgr
Process B
© 2002, Cisco Systems, Inc.
7
8
Process A
14
Pr ocess B’
RedCon
System Mgr
7
4
4
9
Process C
Process B
www.cisco.com
Process C
17
Separation of Control and
Data Plane
• Redundancy in Control Plane
– All protocols support NSF over board
fail over
• Port Protection in Data Plane
– SONET APS
– Link Bundling
© 2002, Cisco Systems, Inc.
www.cisco.com
18
Traffic Switchover- APS
DRP
Traffic before APS switch
Traffic after APS switch
APS Manager
Line Card
4
5
FIB
5
Line Card A
1
Line Card
5
2
APS Process
FIB
3
Line Card
Line Card B
3
6
FIB
APS Process
Switching Fabric
© 2002, Cisco Systems, Inc.
www.cisco.com
19
Traffic Switchover Bundled link
DRP
Traffic before link failure
Bundled IF
Mgr
Traffic after link failure
3
4
DRP
FIB
2
4
Line Card
1
DRP
4
Link Monitor
FIB
DRP
Link Monitor
FIB
Line Card
Switching Fabric
5
© 2002, Cisco Systems, Inc.
www.cisco.com
20
Software Non Stop
Forwarding
• Architected with HW NSF
• Process Restartability
• Separation of Control and Data Planes
• Protocol Support (BGP, ISIS, OSPF, Multicast,
MPLS), Support for HSRP, VRRP
© 2002, Cisco Systems, Inc.
www.cisco.com
21
BGP NSF
RP
RP LPTS/TCP
Connections
to peers
BGP
BGP
BGP
BGP
Component
Speaker
Speaker
Speaker
SysDB
bRIB
BPM
GigE
gRIB
Fabric
LC
LC FIB
HW
FWD
© 2002, Cisco Systems, Inc.
BCDL
www.cisco.com
Incremental
updates to FIB
22
Non Stop Forwarding MPLS
• No impact on MPLS forwarding when one or
more MPLS processes fail.
• No impact on MPLS forwarding when an
active card from a pair of active/standby
fails.
• Hitless software upgrade.
© 2002, Cisco Systems, Inc.
www.cisco.com
23
MPLS - NSF in Action
• If the control plane fails, the forwarding plane can continue
to send traffic. Headless forwarding.
• Minimize the time forwarding remains headless.
System
Services
IP Network
MPLS Control
Services
IP
MPLS Forwarding
© 2002, Cisco Systems, Inc.
www.cisco.com
Forwarding
24
MPLS Architecture
DRP
Application: MPLS-TE
Recovery: From
systems services
and check-poiniting
Label signaling: RSVP, LDP
Recovery: From
applications and
neighbors
Recovery: From
signaling layer
Infra: Label manager
MPLS
Forwarding
LC
© 2002, Cisco Systems, Inc.
MPLS
Forwarding
Recovery: From
Label Manager
LC
www.cisco.com
25
MPLS Fast Reroute
• Supports Node, Path, and Link Protections
• Controlled by the routers at ends of a failed link
– link protection is configured on a per link
basis
• Uses nested LSPs (stack of labels)
– original LSP nested within link protection LSP
© 2002, Cisco Systems, Inc.
www.cisco.com
26
Scalable Distributed System
• Configuration and Operational Data
Distributed Across System
– Allows system to scale, Logical Routers
– Fault containment and recovery (SysDB, IM,
SC, dSC, d(LRSC) )
• Processing Distributed Across System
– Distributed RPs
– Enables faster convergence
© 2002, Cisco Systems, Inc.
www.cisco.com
27
Managing Configuration
SC
Fabric C
SC
GigE
•
Designated SC (dSC) - An owner plane concept, Verifies Rack numbering
among SCs
•
Co-ordinates image management and versioning
•
Co-ordinates LR membership information
•
System Elected: Deterministic election through reboot
– Backup Elected as well
•
d(LRSC) extends similar concept to a Logical Router configuration in LR
plane.
© 2002, Cisco Systems, Inc.
www.cisco.com
28
Managing Scaling/Distribution
Shared
Local
Local
Local
Local
Local
Local
Local
LC
LC
DRP
RP
DRP
LC
LC
© 2002, Cisco Systems, Inc.
www.cisco.com
29
Process Distribution
Logical Router
LRd
LRconfig
Cisco
pre-config
B
C
placed
sysmgr
© 2002, Cisco Systems, Inc.
sysdb
shared
A
LRd
placed
sysmgr
A
RP
RP
sysmgr
B
.startup files
of placeable
applications
standby replicated processes
Rack
sysmgr
A
C
A
B
DRP
www.cisco.com
Rack
DRP
30
Health Monitoring
• Online Diagnostics
– Minimizes double faults at switchover time
• Detect failures before they become critical
– Standby RP/DRP, Fabric plane
– Hot tested spare units
–Alarm cards, Logging & Alarm system (LED
A/N display, minor, major, critical alarms)
© 2002, Cisco Systems, Inc.
www.cisco.com
31
No outage on Software
Upgrades
• Packaging model
– Allows modular upgrade (sub package / package)
and software patches (SMU) to key components and
packages without affecting others.
• Software Release Strategy
– Takes into account upgrade timings and impacts on
system availability
– Progressive upgrade path defined, Compatibility
requirements taken into consideration.
Process Restartability with NSF is key Enabler
© 2002, Cisco Systems, Inc.
www.cisco.com
32
Cisco HFR
A Five Nines Capable Router
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
© 2002, Cisco Systems, Inc.
www.cisco.com
33
Development Process
• ISO compliant
• Mandatory design/code reviews
• API versioning controlled by tools
• Strictly enforced package boundaries (Tools)
• Continual automated measurement/improvement
• HA culture throughout program
© 2002, Cisco Systems, Inc.
www.cisco.com
34
Cisco HFR
A Five Nines Capable Router
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
© 2002, Cisco Systems, Inc.
www.cisco.com
35
Software Test Process
• Test Hierarchy (Waterfall model)
– Q Integrated Sanity System (QISS)
– Component and Feature Test
– Regression Test
– System Integration Test
– Early Field Trial (EFT) and Beta
• Test Operations
– Test Automation and Formal Script Review
– Central Reporting (online system - TIMS, Dashboard)
– Test Planning and Formal Review
© 2002, Cisco Systems, Inc.
www.cisco.com
36
Software Test Tools
3rd Part Tools
Internal Tools
• REX – resource exhaustion
• CTF – component testing
• FIT – fault injection
• IXIA – traffic generation &
analyzer
• Agilent QA Robot –
protocol conformance testing
• ATS – test scripting
• Agilent RouterTester
• e-ARMS – test scheduler
–
interface & protocol scalability
• Pagent – packet generator
• RouteM – net emulation
• CFLOW – code coverage
• DDTS – defect tracking
• TIMS – test reporting
• Dashboard – test summary
© 2002, Cisco Systems, Inc.
www.cisco.com
37
Test Activities
Test Validation
• Online Insertion Removal (OIR)
• Hitless Software Upgrade (HSU)
• Hot Standby Route Processor
(HSRP)
• Fault Manager (FM)
• Process Deadlock Simulation
• SONET APS, DPT
• Interface Scalability
• Protocol Scalability
• Throughput
• SW/HW Fault Injection
• Process Restartability w/NSF
Test Measurements
MTTR
• Up time/Longevity
• Boot time
• Latency
• APS Protection
•Security Audit
• Reliability & Availability
• Fault Detection Time
• Fail over time
• Standard Conformance
•Process restart/resync
• Interop w. IOS/JunOS
© 2002, Cisco Systems, Inc.
www.cisco.com
38
Cisco HFR
A Five Nines Capable Router
• Architecture
– Hardware
– Software
• Development Process
• Test Process
• Accounting, Logging & Alarms
• Conclusion
© 2002, Cisco Systems, Inc.
www.cisco.com
39
ACCOUNTING & HA
• Netflow support
– Multiple / Distributed collectors
• Persistent storage of accounting data
– Across failovers
– Checkpointed continually
© 2002, Cisco Systems, Inc.
www.cisco.com
40
LOGGING & ALARM SYSTEM
• HA Attributes
– All bistate alarms checkpointed
–Alarms are sequenced and can be retrieved
anytime
• Alarm Cards
– Alarm lights lit on failure conditions
– System wide storage of data
© 2002, Cisco Systems, Inc.
www.cisco.com
41
HFR - High Availability
(Bird’s Eyeview)
Result: Quick Recovery (low MTTR/DPM)
Hitless Software/Hardware Upgrades
Upgrade software/hardware while router is in
service
Non Stop Forwarding
No line card reboot upon processor fail over
Forward user data during RP fail over
Process Restartability/upgrade and NSF
Logical redundancy/protection
SONET APS, DPT, HSRP/VRRP, MPLS
FRR, Layer 3 load balancing, link
bundling
Physical redundancy
Dual processors, Power, Fabric, Cooling, OIR
Goal: Non-Stop Availability
© 2002, Cisco Systems, Inc.
www.cisco.com
42
Conclusion
• Target: 99.999% availability
• Availability modeling, availability design and
fault injection testing incorporated as part of the
development process
• Cisco uses HA analysis and modeling to identify
the areas of improvements for future designs
• High availability (in some operational areas) will
need close cooperation with customers and the
required support process is being developed.
© 2002, Cisco Systems, Inc.
www.cisco.com
43
© 1998, Cisco Systems, Inc.
44
Backup Slides
45
Cisco’s HA Products
Cisco is certifying a variety of its products for HA
compliance.
• MSSBU: (PXM1, PXM45, AXSM)
• IP: GSR, ESR 10000 (W), DSL (Austin), Fermi, HFR
• Optical: Monterey
• Cisco’s IOS has been certified for 99.999% Availability
in many service provider environments
Cisco’s efforts for achieving High Availability are both
platform oriented and cross-platform oriented.
© 2002, Cisco Systems, Inc.
www.cisco.com
46
IOS HA Initiatives
• RPR: Partial initialization of IOS in standby RP
• RPR+: Improves standby readiness over RPR
(recognizes line cards and does not reset them on
switchover)
• Single Line Card Reload: Problems in one VIP do not
require an entire router reboot
• Fast reboot: Improves reboot time by 5 minutes
• Fast upgrade: Improves upgrade time by 5 minutes
by pre-loading software onto standby
• Stateful switchover: Instant switchover to standby RP
(includes non-stop forwarding routing protocol
changes)
• In-service upgrade: Software upgrade without user
impact
www.cisco.com
© 2002, Cisco Systems, Inc.
47
HFR System
Fabric Shelves
Contains Fabric cards,
System Controllers
Line Card Shelves
Contains Route Processors, Line
cards, System controllers
EMS
(Full system view)
100m
Shelf
controller
Shelf controller
Shelf controller
Out of band GE control bus to all shelf controllers
© 2002, Cisco Systems, Inc.
www.cisco.com
48
Software Test Process
• Tools for HA Testing
– REX (Resource Exhaustion Tool), CTF (Component Test
Framework), measure how HFR HA features respond to different test
conditions simulated by these tools.
• Test Restartability with Faults simulation
– memory failures, thread create failures, dependent process
failures, multiple related processes failures, recovery on check point
process failure, restartability under high CPU usage
• Test Hitless Software Upgrade
– Test under high resource/CPU utilization conditions
• Fault Manager Testing
– Check to see FM works properly under fault conditions
• MTTR Measurements
– Measure time to repair for most process/component failures
© 2002, Cisco Systems, Inc.
www.cisco.com
49
Specific Availability Requirements
Here is what I ask a BU to do (chronological):
• Create an availability model to gain
perspective
Arch
• Reduce/remove single points of failure
Design
• Design for over 100,000 hours MTBF
• Automate measurement of DPM
Test
• Write online diagnostics on active and
standby
Field
• Write and execute network level availability
test plan
© 2002, Cisco Systems, Inc.
www.cisco.com
50
Limit Headless Forwarding
Time
• Check point data that cannot be recovered
otherwise
• Dedicate MPLS process resources to the
recovery of LSPs that are already established.
Processing of any new configured LSP tunnels
is temporarily suspended.
• Processing of new LSPs resumes when
recovery completes.
© 2002, Cisco Systems, Inc.
www.cisco.com
51
TIMING GOALS
• Boot from Flash / TFTP (~3 min)
• Total Single Rack Bring up time (~5min)
• OIR Recovery Time (~30 to 60 secs)
• Uptime = 14 days before ship
• BGP Aggregation Convergence ~ 60 sec
• BGP Backbone Convergence ~ 3 min
• OSPF Convergence ~ 25 secs
• IS-IS Convergence ~ 350 secs
© 2002, Cisco Systems, Inc.
www.cisco.com
52
Redundant Cards & Links
Line Card Chassis
DRP/SC0
SC0 GE Links
SC1 GE Links
Inter-SC FE Links
DRP/SC0
DRP/SC1
Fabric Chassis
DRP/SC1
External GE
Switch 0
SC0
SC0
External GE
Switch 1
SC1
SC1
© 2002, Cisco Systems, Inc.
www.cisco.com
53
1:1 Card Redundancy
Card 1
Process A
“Active”
Processes
Process B
Card 2
Checkpointing
Checkpointing
Pr ocess A
Pr ocess B
“Standby”
Processes
Checkpointing
Pr ocess C
Process C
Standby Logical Slot 1
Active Logical Slot 1
© 2002, Cisco Systems, Inc.
www.cisco.com
54
Active / Standby Switchover
Active SC
LR Daemon
12
5
1
6
10
Card 1
Card 2
11
RedCon
13
2
3
Process A
QSM
System Mgr
Process B
© 2002, Cisco Systems, Inc.
7
8
Process A
14
Pr ocess B’
RedCon
System Mgr
7
4
4
9
Process C
Process B
www.cisco.com
Process C
55
SC/DRP Combo Switchover
SC/DRP Combo 1
LR Daemon
SC/DRP Combo 2
LR Daemon
5
4
7
3
6
10
RedCon
RedCon
SC1
SC1
2
1
11
8
RedCon
RedCon
DRP1
© 2002, Cisco Systems, Inc.
9
www.cisco.com
DRP1
56
Traffic Switchover Bundled link
DRP
Traffic before link failure
Bundled IF
Mgr
Traffic after link failure
3
4
DRP
FIB
2
4
Line Card
1
DRP
4
Link Monitor
FIB
DRP
Link Monitor
FIB
Line Card
Switching Fabric
5
© 2002, Cisco Systems, Inc.
www.cisco.com
57
Traffic Switchover- APS
DRP
Traffic before APS switch
Traffic after APS switch
APS Manager
Line Card
4
5
FIB
5
Line Card A
1
Line Card
5
2
APS Process
FIB
3
Line Card
Line Card B
3
6
FIB
APS Process
Switching Fabric
© 2002, Cisco Systems, Inc.
www.cisco.com
58
SC/RP Upgrade (Initial Config)
Card 1
Card 2
Process A
Process A
“Active”
Processes
Process B
Checkpt. Checkpointing
Server
Process C
Process B
“Standby”
Processes
Process C
Active Logical Slot 1
© 2002, Cisco Systems, Inc.
Checkpt.
Server
www.cisco.com
Standby Logical Slot 1
59
HFR HA Roadmap
QFT-1
QFT-2
Target GSR
All processes
Restartable
Target GSR
Demonstrate limited HSU
Restartability
nonservice affecting to
Routing and
Forwarding plane apps
NSF for ISIS, OSPF
RP and DRP standby
Multiple Verifier Support
CheckPointing and
Mirroring
Limited SC
Functionality and SC
HA features
RP and DRP standby and
failover
NSF support with
upgrade of config data
Support for checkpoint
data with version
differences between
releases
© 2002, Cisco Systems, Inc.
www.cisco.com
QFT-3
Beta/FOA
Target- HFR test
hardware
Target - HFR platform
Full functionality of SC,
RP, DRP, SP and
Fabric SC will be
demonstrated with
high availability and
failover features.
Process Redundancy
mechanism across
DRPs demonstrated
All apps support HSU forwarding, multicast,
security and base.
Multiple LRs support
and fault isolation
between LRs
All QFT1 to QFT3 goals
met
Meet product
requiremnets in HA
PRD.
Minimum .9999
standalone availabiity
and .99999 network
availability
fCS/Post FCS: HA
support and assurance
programs, HA test
support framework
implementaton
Software downgrade to
atleast 1 prev level
60
© 1998, Cisco Systems, Inc.
61
Download