Improving Network Fault Management Efficiency

advertisement
Analytics:
Improving Network Fault Management Efficiency
Abby Knowles, Verizon Wireless
IEEE CQR Conference
San Diego, CA
May 16, 2012
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Wireless Network Transitions
2G
3G
4G
It is typical within evolving networks to have multiple generations of
technologies present until older ones have been phased out
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
2
Multi-Generational Network Management Architecture
Databases
Links
•
•
OSS
Vendor
Specific
• Pollers
• Service
Managers
• SNMP Traps
(RFC 1157)
• Syslog
OAM Network
3rd
Gen
2nd
Gen
Elements
1st
Gen
1st
Gen
2nd
Gen
1st
Gen
3rd
Gen
2nd
Gen
1st
Gen
3rd
Gen
2nd
Gen
1st
Gen
Transport
2nd
Gen
1st
Gen
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
3
Typical Network Management Model
Tickets
Dispatches
Vendor Support
System
Discard
Operator
Ignore
ALARMS
An overlooked resource
of Fault Management
Operations is Human
Capital
•BSc / MSc
•CCNA/IE/MP
Service
Impact
Network
Impact
Network
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
4
Potential Impact of Increasing Alarms
• Fault Management Network
– Server Volume, Capacity, Space
– Fault Management Transport Latency
• Operating Center
– Headcount Pressures
– Redundant Tickets
– Operator Errors
– Reduced Remote Resolution
– Increased Mean Time to Repair (MTTR)
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
5
Achieving Fault Management Efficiency
SYSTEM
OPERATOR
REDUCE
n : traps generated
m : traps discarded
a: informational tickets
b: dispatch tickets
c: Tier II support (Vendor) tickets
X = # total alarms viewed
= n- m
Y = # tickets created
=a+b+c
System Efficiency
= x = (n-m)
n
n
Operator Efficiency
= y = a+b+c
x (n-m)
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
6
“Smart Alarming”
• Standardized Alert Names & Use of Correlation
– Standardize the alarming type
– Define business rules
– Collapse and present single alarm
– Maintain the ability to present correlated alarms for further analysis
Correlation Engine
Network
Impact
Service
Impact
Network
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
7
Example – BGP Router Alerts
• As Data Networks become more central to wireless providers’ products,
there is an urgent need to optimize data alarms.
• Many data networks use BGP (Border Gateway Protocol -RFC 427). It is
a protocol for exchanging routing information between gateway hosts.
• Whenever communications between gateways fail, standard and vendor
specific BGP alerts are generated
• The result is an “alarm storm” of BGP state alert, through which the fault
management operator has to research, or delay/ignore if there are more
critical alarms.
Identify types of
common alerts for
standardization
•
•
ROUTER_cbgFsmStateChange
CISCO_CiscocbgBackwardTransition
Define business
rules to reduce
the common
alerts to single
alerts
Apply these rules
to a correlation
engine
TEST
BGP_bgpStateChange
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
8
Efficiencies Attained
System Efficiency
87%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
# Alarms viewed
decreased
between 40-80%
Operator Efficiency
85%
6.85%
8.00%
50%
38%
38%
6.00%
4.00%
27%
2.00%
2.50%
1.33%
2.46%
1.49%
1.14%
0.00%
BGP
Cell Site
Router
Pre-Correlation
Switch
Post Correlation
BGP
Pre-Correlation
Cell Site
Router
Switch
Post Correlation
Pre-Correlation Post Correlation
Combined System Efficiency
76%
77%
Combined Operator Efficiency
1.37%
4.26%
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
9
Download