AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001

advertisement
AI Approaches to Network
Fault Management
Andrew Learn
29 Nov 2001
1
Outline
• Fault Management Process
• AI Approaches
– Expert Systems
– Neural Networks
– Case-based Reasoning
2
Network Faults
• Hardware
– Wear and tear
– Cut cables
– Improper installation
• Software
– Incorrect design
– Bugs
– Incorrect data (e.g. routing tables)
3
Fault Management Process
1.
2.
3.
4.
5.
Collect alarms
Filter and correlate alarms
Diagnose faults
Restoration and repair
Evaluate effectiveness
4
1. Collect Alarms
• Types of alarms
– Physical: Failure in communication
• e.g. loss of signal, CRC failure
– Logical: Statistical values exceed threshold
• e.g. number of packets dropped
• Communication with components
– Control protocol: Simple Network Management
Protocol (SNMP)
– Data format: Management Information Base (MIBII, 1990) has ~170 manageable objects
5
• Sample MIB Entry
ipInReceives OBJECT-TYPE
SYNTAX Counter
ACCESS read-only
STATUS mandatory
DESCRIPTION
"The total number of input datagrams
received from interfaces, including
those received in error."
::= { ip 3 }
• Sample SNMP “get” call
snmpget netdev-kbox.cc.cmu.edu

public
system.sysUpTime.0
Name: system.sysUpTime.0
Timeticks: (2270351) 6:18:23
6
2. Filter and Correlate Alarms
• Filter
– Eliminate redundant alarms
– Suppress noncritical alarms
– Inhibit low-priority alarms in presence of
high-priority alarms
• Correlate
– Analyze and interpret multiple alarms to
assign new meaning (derived alarm)
7
3. Diagnose Faults
• May require additional tests/diagnostics
on circuits or components
– Automated or manual
• Analyze all info from alarms, tests,
performance monitoring
• Identify smallest system module that
needs to be repaired or replaced
8
4. Restoration and Repair
• Restoration: Continue service in presence of fault
– Switch over to spares
– Reroute around trouble spot
– Restore software or data from backup
• Repair
– Replace parts
– Repair cables
– Debug software
• Retest to verify fault is eliminated
9
5. Evaluate Effectiveness
• Questions to answer :
– How often do faults occur?
– How many faults affect service?
– How long is service interrupted?
– How long to repair?
• Provides assessment of:
– Performance of fault management system
– Reliability of equipment
10
AI Approaches to Fault Management
• Well-developed approach:
– Expert systems
• New approaches:
– Neural networks
– Case-based reasoning
– Other
11
Why AI?
• Need for intelligence
–
–
–
–
Data analysis
Pattern recognition
Clustering and categorization
Problem solving
• Need for automation
– Manual analysis/solution takes time
– Limited manpower
– Limited expertise
12
Well-developed approach:
Expert Systems
•
•
Expert systems = Rule-base + Working Memory
Three parts to rules:
1. Context trigger (when should rule be considered)
2. Condition ( if X . . . )
3. Conclusion ( . . . then Y)
•
Used since 1980’s by major telecomm companies
– Bell: Automated Cable Expertise (ACE) system
– GTE: Central Office Maintenance Printout Analysis &
Suggestion System (COMPASS)
– AT&T: Network Management Expert System
(NEMESYS)
13
Need for New Approaches
• Weaknesses of expert systems
–
–
–
–
–
Brittle in unforeseen situations
Cannot learn from experience
Hard to maintain (adding/deleting/modifying rules)
Knowledge acquisition bottleneck
Can’t handle incomplete or probabilistic data
• Factors driving new approach
–
–
–
–
Rapidly changing technology
Dynamic network topology
Network complexity
Competition, demand for QoS
14
Neural Nets
• Structure: input, hidden, output layers
• Training
– Supervised: Input pattern & desired output
– Unsupervised: Clustering of similar inputs
weights
Input
Output
Hidden
15
Neural Nets
• Advantages
– Pattern matching & generalization
– Fast & efficient
– Trainable
– Handles incomplete, ambiguous data
• Disadvantages
– Black box
– Lack of training data
16
Neural Net Example
• Example: Alarm correlation in cell
phone networks (Univ of Hannover, Germany)
Maintenance
Center
BS1
Microwave
Links
BS2
Mobile
units
Base
Stations
MC
BSC
Base Station
Controller
Switching
Centers
17
Neural Net Example
• Test Results:
– 94 alarms
– 99.76% correct classification with up to 25% noise
BSC
alarms
.
.
ML-1
fault
.
BS-1
alarms
.
.
ML-2
fault
Initial
Cause
.
BS-2
alarms
18
Case-Based Reasoning
• Case-based reasoning = matching previous
examples
– Case library: Set of previous faults, diagnoses,
solutions
– Usually based on “trouble ticket” help-desk
databases
• Design considerations:
– What are key attributes of a case?
– What attributes will be used to index & access a
case?
19
Case-Based Reasoning
• Advantages
– Easier knowledge acquisition than expert
systems
– Can learn by adding new cases
– Doesn’t require extensive maintenance
• Disadvantages
– Requires time-consuming user interaction
– No help for first-time problems
20
Case-Based Reasoning Example
Case 134
Problem Type: Performance
Description: High error rate in comm between POA-SP & DF
No access: Intermittent
Retrieval: Case 103 [Similarity = 0.69]
Description: 64kb line from VendorX drops big datagrams.
Additional Info requested: Is there loss of big datagrams in
ping test? (Result: Yes)
Cause: Link 34 inside Bldg 207 was defective
Solution: Vendor replaced cabling.
21
Summary of 3 AI Methods
• Expert systems
– If / then rules
– Well-developed technology
– Brittle, hard to maintain
• Neural networks
– Output = weighted transform of inputs
– Fast pattern matching, robust to noise
– Black box, lack of training data
• Case-based systems
– Trouble-ticket retrieval
– Easy to build, maintain
– Slower diagnosis, takes time to build
22
Other Approaches
• Bayesian networks
– Model statistical probabilities and
dependence of faults
• Mobile intelligent agents
– Independent software agents cooperate to
collect info, suggest solutions
23
Future Trends
• Proactive fault detection
– Recognizing trouble signs and taking
corrective action before service degrades
• Hybrid systems
– Multiple AI methods integrated
24
Download