OSPF Monitor Architecture, Design and Deployment Experience Aman Shaikh Albert Greenberg AT&T Labs - Research NSDI 2004 OSPF Monitor - NSDI 2004 1 Objectives for OSPF Monitor • Real-time analysis of OSPF behavior – Trouble-shooting, alerting, validation of maintenance – Real-time snapshots of OSPF network topology • Off-line analysis – Post-mortem analysis of recurring problems – Generate statistics and reports about network performance – Identify anomaly signatures – Facilitate tuning of configurable parameters – Improve maintenance procedures – Analyze OSPF behavior in commercial networks OSPF Monitor - NSDI 2004 2 OSPF Monitor in a Nutshell • Collect OSPF LSAs (Link State Advertisements) passively from network – Every router describes its local connectivity in an LSA – Router originates an LSA due to... • Change in network topology • Periodic soft-state refresh – LSA is flooded to other routers in the domain • Flooding is reliable and hop-by-hop • Flooding leads to duplicate copies of LSAs being received – Every router stores LSAs (self-originated + received) in linkstate database (= topology graph) • Real-time analysis of LSA streams • Archive LSAs for off-line analysis OSPF Monitor - NSDI 2004 3 Components • Data collection: LSA Reflector (LSAR) – Passively collects OSPF LSAs from network – “Reflects” streams of LSAs to LSAG – Archives LSAs for analysis by OSPFScan • Real-time analysis: LSA aGgregator (LSAG) – Monitors network for topology changes, LSA storms, node flaps and anomalies • Off-line analysis: OSPFScan – Supports queries on LSA archives – Allows playback and modeling of topology changes – Allows emulation of OSPF routing OSPF Monitor - NSDI 2004 4 Example LSAG Real-time Monitoring OSPFScan Off-line Analysis LSAs LSAs TCP Connection LSAR 1 LSAR 2 “Reflect” LSA “Reflect” LSA LSA archive LSA archive LSAs Area 1 LSAs LSAs replicate LSA archive LSAs Area 0 Area 2 OSPF Monitor - NSDI 2004 OSPF Network 5 How LSAR attaches to Network • Host mode – Join multicast group – Adv: completely passive – Disadv: not reliable, delayed initialization of LSDB • Full adjacency mode – Form full adjacency (= peering session) with a router – Adv: reliable, immediate initialization of LSDB – Disadv: LSAR’s instability can impact entire network • Partial adjacency mode – Keep adjacency in a state that allows LSAR to receive LSAs, but does not allow data forwarding over link – Adv: reliable, LSAR’s instability does not impact entire network, immediate initialization of LSDB – Disadv: can raise alarms on the router OSPF Monitor - NSDI 2004 6 Partial Adjacency for LSAR I need LSA L from LSAR I have LSA L LSAR Please send me LSA L R Partial state • Router R does not advertise a link to LSAR • LSAR does not originate any LSAs • Routers (except R) not aware of LSAR’s presence • Does not trigger routing calculations in network • LSAR’s going up/down does not impact network • LSARR link is not used for data forwarding OSPF Monitor - NSDI 2004 7 LSA aGregator (LSAG) • Analyzes “reflected” LSAs from LSARs in real-time • Generates console messages: – Change in OSPF network topology • ADJACENY COST CHANGE: rtr 10.0.0.1 (intf 10.0.0.2) rtr 10.0.0.5 old_cost 1000 new_cost 50000 area 0.0.0.0 – Node flaps • RTR FLAP: rtr 10.0.0.12 no_flaps 7 flap_window 570 sec – LSA storms • LSA STORM: lstype 3 lsid 10.1.0.0 advrt 10.0.0.3 area 0.0.0.0 no_lsas 7 storm_window 470 sec – Anomalous behavior • TYPE-3 ROUTE FROM NON-BORDER RTR: ntw 10.3.0.0/24 rtr 10.0.0.6 area 0.0.0.0 • Dumps snapshots of network topology OSPF Monitor - NSDI 2004 8 OSPFScan • Tools for off-line analysis of LSA archives – Parse, select (based on queries), and analyze • Functionality supported by OSPFScan – Classification of LSA traffic • Change LSAs, refresh LSAs, duplicate LSAs – Emulation of OSPF Routing • How OSPF routing tables evolved in response to network changes • How end-to-end path within OSPF domain looked like at any instance – Modeling of topology changes • Vertex addition/deletion and link addition/deletion/change_cost – Playback of topology change events – Statistics and report generation OSPF Monitor - NSDI 2004 9 Performance Evaluation • Performance of LSAR and LSAG through lab experiments – LSAR and LSAG are key to real-time monitoring • How performance scales with LSA-rate and network size OSPF Monitor - NSDI 2004 10 Experimental Setup Measure LSA processing time for LSAG PC SUT LSAG Emulated topology LSA Zebra TCP connection OSPF adjacency LSA LSAR TCP connection LSA Measure LSA pass-through time for LSAR OSPF Monitor - NSDI 2004 11 Methodology • Send a burst of LSAs from Zebra to LSAR – Vary number of LSAs (l) in a burst of 1 sec duration • Use of fully connected graph as the emulated topology – Vary number of nodes (n) in the topology • Performance measurements – LSAR performance: LSA “pass-through” time • Zebra measures time difference between sending and receiving an LSA from LSAR – LSAG performance: LSA processing time • Instrumentation of LSAG code OSPF Monitor - NSDI 2004 12 LSAR Performance Mean LSA pass-through time (LSAR) v/s burst-size Time (seconds) 0.9 n n n n 0.8 0.7 0.6 0.5 = 100, LSAR + LSAG = 50, LSAR + LSAG = 100, LSAR only = 50, LSAR only 0.4 0.3 0.2 0.1 0 50 100 150 200 250 300 350 400 450 500 Number of LSAs per burst OSPF Monitor - NSDI 2004 13 LSAG Performance Mean LSA processing time (LSAG) v/s network size 0.06 Time (seconds) 0.05 0.04 0.03 0.02 burst-size = 500 LSAs 0.01 burst-size = 100 LSAs 0 50 60 70 80 90 100 Number of nodes in the topology OSPF Monitor - NSDI 2004 14 Deployment • Tier-1 ISP network – – – – Area 0, 100+ routers; point-to-point links Deployed since January, 2003 LSA archive size: 8 MB/day LSAR connection: partial adjacency mode • Enterprise network – – – – 15 areas, 500+ routers; Ethernet-based LANs Deployed since February, 2002 LSA archive size: 10 MB/day LSAR connection: host mode OSPF Monitor - NSDI 2004 15 LSAG in Day-to-day Operations • Generation of alarms by feeding messages into higher layer network management systems – Grouping of messages to reduce the number of alarms – Prioritization of messages • Validation of maintenance steps and monitoring the impact of these steps on network-wide OSPF behavior – Example: • Network operators use cost-out/cost-in of links to carry out maintenance • A “link-audit” web-page allows operators to keep track of link costs in real-time OSPF Monitor - NSDI 2004 16 Problems Caught by LSAG • Equipment problem – Detected internal problems in a crucial router in enterprise network • Problem manifested as episodes of OSPF adjacency flapping • Configuration problem – Identified assignment of same router-id to two routers in enterprise network • OSPF implementation bug – Caught a bug in type-3 LSA generation code of a router vendor in ISP network • Faster refresh of LSAs than standards-mandated rate OSPF Monitor - NSDI 2004 17 Long Term Analysis by OSPFScan • LSA traffic analysis – Identified excessive duplicate LSA traffic in some areas of Enterprise Network • Led to root-cause analysis and preventative steps • Statistics generation – Inter-arrival time of change LSAs in ISP network • Fine-tuning configurable timers related to route calculation (= SPF calculation) – Mean down-time and up-time for links and routers in ISP network • Assessment of reliability and availability OSPF Monitor - NSDI 2004 18 Lessons Learned through Deployment • New tools reveal new failure modes • Real-time alerting and off-line analysis are complementary – Distributed architecture helped a lot • OSPF exhibits significant activity in real networks – Maintenance and genuine problems • Add functionality incrementally and through interaction with users • Archive all LSAs – LSA volume is manageable – Don’t throw away refresh and duplicate LSAs OSPF Monitor - NSDI 2004 19 Conclusion • Three component architecture – LSAR: data collection – LSAG: real-time analysis – OSPFScan: off-line analysis • Performance analysis – LSAR and LSAG scale well as LSA-rate and network size increases • Deployment – Deployed in Tier-1 ISP and Enterprise network • Has proved to be an extremely valuable tool for network management • “OSPF Monitor was a Lifesaver” – VP of Networking, Enterprise network OSPF Monitor - NSDI 2004 20 Future Work • Real-time analysis – Correlation with other fault and performance data for more meaningful alerting – Prioritization of alerts • Off-line analysis – Correlation with other data sources • Work already underway: BGP, fault, performance – Identification of problem signatures and feeding them into real-time component for problem prediction OSPF Monitor - NSDI 2004 21 Backup Slides OSPF Monitor - NSDI 2004 22 Overview of OSPF • OSPF is a link-state protocol – Every router learns entire network topology • Topology is represented as graph – Routers are vertices, links are edges – Every link is assigned weight through configuration – Every router uses Dijkstra’s single source shortest path algorithm to build its forwarding table • Router builds Shortest Path Tree (SPT) with itself as root • Shortest Path Calculation (SPF) – Packets are forwarded along shortest paths defined by link weights OSPF Monitor - NSDI 2004 23 Areas in OSPF • OSPF allows domain to be divided into areas for scalability – – – – Areas are numbered 0, 1, 2 … Hub-and-spoke with area 0 as hub Every link is assigned to exactly one area Routers with links in multiple areas are called border routers Border routers Area 1 Area 2 Area 0 OSPF Monitor - NSDI 2004 24 Summarization with Areas • Each router learns – Entire topology of its attached areas – Information about subnets in remote areas and their distance from the border routers • Distance = sum of link costs from border router to subnet R1 Area 0 R2 200 100 400 300 B1 C1 OSPF domain 500 200 20 10 50 10.10.4.0/24 R3 Area 0 R2 B2 R1’s View R1 100 400 300 200 500 200 B1 C2 20 10.10.5.0/24 B2 60 70 10.10.4.0/24 Area 1 R3 10 10.10.5.0/24 Area 1 OSPF Monitor - NSDI 2004 25 Link State Advertisements (LSAs) • Every router describes its local connectivity in Link State Advertisements (LSAs) • Router originates an LSA due to… – Change in network topology • Example: link goes down or comes up – Periodic soft-state refresh • Recommended value of interval is 30 minutes • LSA is flooded to other routers in the domain – Flooding is reliable and hop-by-hop – Includes change and refresh LSAs – Flooding leads to duplicate copies of LSAs being received • Every router stores LSAs (self-originated + received) in link-state database (= topology graph) OSPF Monitor - NSDI 2004 26 Adjacency • Neighbor routers (i.e., routers connected by a physical link) form an adjacency • The purpose is to make sure – Link is operational and routers can communicate with each other – Neighbor routers have consistent view of network topology • To avoid loops and black holes • Link gets used for data forwarding only after adjacency is established • Use of periodic Hellos to monitor the status of link and adjacency OSPF Monitor - NSDI 2004 27 Equipment Problem at Enterprise Network • Internal errors in a router in area 0 – Episodes where router would drop adjacencies with other routers • Problem manifested in LSAG as “ADJ UP” and “ADJ DOWN” messages – Not visible in other network management systems • Led to proactive maintenance Total LSAs in area 0 Total LSAs due to router bug 1500 Total LSAs in area 0 Total LSAs due to router bug 100 80 1000 60 40 500 20 0 0 1 11 21 1 Day in April, 2002 OSPF Monitor - NSDI 2004 7 13 19 Hour on April 16, 2002 28 LSA Traffic in Enterprise Network Area 0 Area 2 1000000 8000 Refresh LSAs Genuine Anomaly 10000 4000 100 Change LSAs Genuine Anomaly 0 1 1 11 21 Days 1 8000 8000 4000 4000 11 21 Days 21 Days Duplicate LSAs Artifact: 23 hr day (Apr 7) 0 0 1 11 21 Days Area 3 OSPF Monitor - NSDI 2004 1 11 Area 4 29 Overhead: Duplicate LSAs Duplicate LSAs in area 3 Duplicate LSAs in area 2 2950 1950 950 -50 1 11 21 Days • Why do some areas witness substantial duplicate LSA traffic, while other areas do not witness any? – OSPF flooding over LANs leads to control plane asymmetries and to imbalances in duplicate LSA traffic OSPF Monitor - NSDI 2004 30