Analytics: Improving Network Fault Management Efficiency Abby Knowles, Verizon Wireless IEEE CQR Conference San Diego, CA May 16, 2012 Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. Wireless Network Transitions 2G 3G 4G It is typical within evolving networks to have multiple generations of technologies present until older ones have been phased out Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 2 Multi-Generational Network Management Architecture Databases Links • • OSS Vendor Specific • Pollers • Service Managers • SNMP Traps (RFC 1157) • Syslog OAM Network 3rd Gen 2nd Gen Elements 1st Gen 1st Gen 2nd Gen 1st Gen 3rd Gen 2nd Gen 1st Gen 3rd Gen 2nd Gen 1st Gen Transport 2nd Gen 1st Gen Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 3 Typical Network Management Model Tickets Dispatches Vendor Support System Discard Operator Ignore ALARMS An overlooked resource of Fault Management Operations is Human Capital •BSc / MSc •CCNA/IE/MP Service Impact Network Impact Network Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 4 Potential Impact of Increasing Alarms • Fault Management Network – Server Volume, Capacity, Space – Fault Management Transport Latency • Operating Center – Headcount Pressures – Redundant Tickets – Operator Errors – Reduced Remote Resolution – Increased Mean Time to Repair (MTTR) Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 5 Achieving Fault Management Efficiency SYSTEM OPERATOR REDUCE n : traps generated m : traps discarded a: informational tickets b: dispatch tickets c: Tier II support (Vendor) tickets X = # total alarms viewed = n- m Y = # tickets created =a+b+c System Efficiency = x = (n-m) n n Operator Efficiency = y = a+b+c x (n-m) Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 6 “Smart Alarming” • Standardized Alert Names & Use of Correlation – Standardize the alarming type – Define business rules – Collapse and present single alarm – Maintain the ability to present correlated alarms for further analysis Correlation Engine Network Impact Service Impact Network Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 7 Example – BGP Router Alerts • As Data Networks become more central to wireless providers’ products, there is an urgent need to optimize data alarms. • Many data networks use BGP (Border Gateway Protocol -RFC 427). It is a protocol for exchanging routing information between gateway hosts. • Whenever communications between gateways fail, standard and vendor specific BGP alerts are generated • The result is an “alarm storm” of BGP state alert, through which the fault management operator has to research, or delay/ignore if there are more critical alarms. Identify types of common alerts for standardization • • ROUTER_cbgFsmStateChange CISCO_CiscocbgBackwardTransition Define business rules to reduce the common alerts to single alerts Apply these rules to a correlation engine TEST BGP_bgpStateChange Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 8 Efficiencies Attained System Efficiency 87% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% # Alarms viewed decreased between 40-80% Operator Efficiency 85% 6.85% 8.00% 50% 38% 38% 6.00% 4.00% 27% 2.00% 2.50% 1.33% 2.46% 1.49% 1.14% 0.00% BGP Cell Site Router Pre-Correlation Switch Post Correlation BGP Pre-Correlation Cell Site Router Switch Post Correlation Pre-Correlation Post Correlation Combined System Efficiency 76% 77% Combined Operator Efficiency 1.37% 4.26% Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 9