Example

advertisement
Fault Management – Detection
and Diagnosis
Outline
Fault management functionality
 Event correlations concept
 Techniques

Definitions


A fault may cause hundreds of alarms.
We need to be able to do the following:
o Detect the existence of faults
o Locate faults

An alarm
o External manifestations of faults
— Generated by components
— Observable, e.g. via messages


An alarm represents a symptom of a fault.
An event
o An occurrence of interest, e.g. an alarm message
Fault Management
Functionalities

Fault detection
o Should be real-time
o Techniques can be based on active schemes
(e.g., polling) or event-based schemes (where a
system component says that it has detected a
failure).

Fault location
o Is it a link or system component or application
component?
Determine corrective actions
 Carry out corrective actions and determine
effectiveness

Alarm (Event) Correlation

Alarm explosion
o A single problem might trigger multiple symptoms (e.g.,
router is down)

There could be too many alarms for an
administrator to handle; Techniques used to help:
o Compression: reduction of multiple occurrences of an
alarm into a single alarm
o Count: replacement of a number of occurrences of alarms
with a new alarm
o Suppression: inhibiting a low-priority alarm in the
presence of a higher priority alarm
o Boolean: substitution of a set of alarms satisfying a
condition with a new alarm
o Root cause determination
Faults and Alarms
f0
f1
A1
C3
f2
A2
C1
A3 A4 A5
C2
Faults and Alarms
The previous figure shows that correlation
c1 detects the fault f1 and that correlation
c2 detects the fault f2.
 Correlating c1 and c2 into the correlation
c0 allows the diagnosis of the fault f0.

Example
Let a1, a2, a3, a4, a5 be alarms generated
by client processes indicating that a client
process is not getting a response from a
server.
 Correlation techniques can be used to show
that since a1, a2, a3 were generated by
client processes by trying to contact the
same server then the server may be the
problem. Similar comments apply to a4 and
a5.

Example
From the perspective of client processes,
the servers (at the second level of the
previous figure) are at fault.
 However, it may be observed that alarms
were generated by these two servers.
Both alarms indicate that each of the two
servers are not getting a response and that
both were trying to contact the same
server. This is another correlation.

Fault Diagnosis
Major application of alarm correlation
(often called event correlation) is fault
diagnosis
 Useful in fault location

Rule-Based Reasoning
Based on expert systems
 Intended to represent heuristic knowledge
as rules.
 Components

o Knowledge Base (KB): Contains the expert rules
that describe the action to be taken when a
specific condition occurs e.g., if-then-else
o Working Memory(WM): Stores information
such as the system/network topology and data
collected through the monitoring of application
and network components.
Rule-Based Reasoning

Components (continued)
o Inference engine: matches the current state
(as represented by the monitored data) of the
system against the left-side of a rule in the
knowledge base in order to trigger the action.
The rules are meant to encapsulate expert
knowledge
 Why rule-based reasoning?

o Rules are interpreted which means that rules
can be changed without recompiling.
o Since expert knowledge can be wrong and/or
complete, this feature is very useful.
Rule-Based Reasoning

Operation
o The WM constantly scanned for facts (e.g., alarms) that
can satisfy any of the left hand sides of the rules.
o If a rule is found then the rule “fires” I.e., the right
hand side is executed.
o The result of the execution may result in facts being
inserted into WM.

Example:
o Failed-connection (Y,X) and Failed-connection(X,Z) 
faulty(Z).

Used by commercial systems such as Tivoli (from
IBM) and HP Openview.
Approaches
Fault propagation
 Model traversing
 Case-based reasoning

Fault Propagation
Based on models that describe which
symptoms will be observed if a specific
fault occurs.
 Monitors typically collect managed data at
network elements and detect out of
tolerance conditions, generating
appropriate alarms.
 An event model is used by a management
application to analyze these alarms.
 The event model represents knowledge of
events and their causal relationships.

Fault Propogation (Coding
Approach)







Correlation is concerned with analysis of causal
relations among events.
The notation ef is to denote causality of the
event f by the event e.
Causality is a partial order between events.
The relation  may be described by a causality
graph whose directed edges represent causality.
Distinguish between faults problems) and
symptoms.
Nodes of a causality graph may be marked as
problems (P) or symptoms (S).
Some symptoms are not directly caused by faults,
but rather by other symptoms.
Fault Propagation (Coding
Approach)
Example Causality Graph
11
10
9
8
5
7
6
1
3
4
2
Fault Propagation (Coding
Approach)

The correlation problem
o A correlation p  s means that problem p can
cause a chain of events leading to the symptom
s.
o This can be represented by a graph.
Fault Propagation (Coding
Approach)
A Correlation Graph
9
10
1
11
6
2
Fault Propagation (Coding
Approach)




For each fault (problem) p, the correlation graphs
provides a vector that summarizes information
available about correlation and symptoms and
problems.
This is referred to as the code of the problem.
Alarms may also be described using a vector
assigning measures of 1 and 0 to observed and
unobserved symptoms.
The alarm correlation problem is that of finding
problems whose codes optimally match an
observed alarm vector.
Fault Propagation (Coding
Approach)

Example codes (look at correlation graph
example)
o 1 = (0,1,1) – This indicates that problem 1 causes
symptoms 9 and 10
o 2 = (1,0,1) – This indicates that problem 2
causes symptoms 6 and 10
o 11 = (0,1,1) – This indicates that problem 11
causes symptoms 9 and 10.
Fault Propagation (Coding
Approach)

Example alarm vector
o Assume that alarms indicating symptoms 9 and
10 have been observed.
o a = (0,1,1)
We can infer that either 1 or 11 match the
observation a.
 These two problems have identical codes
and hence are indistinguishable.
 The fault management application may have
to do additional tests.

Fault Propagation (Coding
Approach)
A Codebook is an array of the vectors just
defined.
 The number of symptoms associated with
a single problem may be very large.

o Sometimes a much smaller set of symptoms is
selected to accomplish a desired level of
distinction among problems.
Fault Propagation (Coding
Approach)
Example Codebook
1
2
4
p1
p2
p3
p4
p5
p6
1
1
1
0
1
0
0
1
1
1
1
0
0
0
1
1
0
0
Fault Propagation (Coding
Approach)
Example Codebook
p1
1
3
4
6
9
18
1
1
1
1
0
0
p2
0
1
0
1
1
1
p3
p4
p5
0
0
1
1
0
1
1
1
0
0
0
1
0
0
1
0
1
0
p6
1
0
0
1
1
0
Fault Propagation (Coding
Approach)
Distinction among problems is measured by
the Hamming Distance between their
codes
 The radius of a codebook is one half of the
minimal Hamming distance among codes.
 When the radius is 0.5, the code provides
distinction between problems.

Fault Propagation (Coding
Approach)

Is this easy to apply to application
processes?
o No

Why
o Applications are dynamic
o The coding approach assumes the system is
fairly static.
Model Traversing



Reconstruct fault propagation at run time using
relationships between objects
Begins with managed object that generated event
Work best when object relationship is graph-like
and easy to obtain since it must be obtained at
run-time
o Performance
o Potential parallelism

Weaknesses
o Lack of flexibility
o Not well-structured like fault propagation
Model Traversing

Characteristics
o Event-Driven: Fault management application is
passive until an event arrives. This event is the
reporting of a symptom.
o Correlation : Decides whether two events result
from the same primary fault.
o Relationship Exploration: The fault management
application correlates events by detecting
special relationships between the source
objects of those events.
Model Traversing

Event reports should have the following
information:
o
o
o
o

Symptom type
Source
Target
etc
If symptom si’s target is the same as sj’s
source then this is an indication that si is a
secondary symptom. This allows us to
ignore certain alarms.
Model Traversing



For each event, construct a graph of objects
(models) related to the source object of that
event.
When two such graphs touch each other, i.e.
contain at least one common object, the events
which initiated their construction are regarded to
be correlated. Possibly these two events are the
result of the same fault.
If si is correlated with sj and sj is correlated with
sk then through transitivity we can conclude that
si is a secondary symptom.
Model Traversing
The process of eliminating symptom
reports may result in reports that have the
same target.
 Example:

o
o
s1 and t
s2 and t
It might be necessary to construct
possible paths of objects between s1 and t
as well as s2 and t
 Nodes in common are good candidates for
the faults.

Model Traversing




We will now discuss the building of graphs
The algorithm for building graphs uses
relationships between network hardware and
software components to search for the root cause
of a problem.
Assumes that information about the relationships
between the components are available (e.g.,
through a database).
Assumes that there are functions including these:
o getNextHop(source, target,B): Get the node
representing the next entity (that comes after B) in the
path between source and target. Note that this may
return more than one entity.
Model Traversing
Example
 Assume the following configuration of processes
and machines. All machines are connected
through the Ethernet.
o P1 is on chocolate; P2 is on peppermint
o P3 is on vanilla; P4 is on strawberry
o P5 is on doublefudge; P6 is on mintchip

Communication is through remote procedure calls.
This basically requires that all communication go
through a daemon process on the server host’s
machine. We will call this rpcd
Model Traversing
Call structure is depicted in the following graph:
P6
P2
P5
P3
P1
P4
Model Traversing
Example
 Assume that P4 terminates abnormally causing a
cascade of timeouts
 Correlation will result on focusing on these event
reports:
o (P1,P4)
o (P3,P4)

Not enough to diagnose the fault.
o It’s all at the process level.
o There are still many entities or objects to examine since
you do not want everything generating a message.
Model Traversing
Example
 Starting with P1 the next component (node)
along the path of the connection between
P1 and P4 is identified.
 Between P1 and P4 are many entities. We
will start out with a vertical search which
basically results in the fact that P1 is
running on a host machine called chocolate

Model Traversing


chocolate is connected to the hub through an
ethernet cable.
The hub is connected to strawberry through an
ethernet connection cable where P2 is running.
Thus we can say that the path is the following:
o P1, chocolate,ethernet connection
cable,hub,strawberry,ethernet connection cable,
rpcd.strawberry,P4

The path between P3 and P4 is the following:
o P3, vanilla, ethernet connection cable, hub, ethernet
connection cable, strawberry, rpcd.strawberry, P4
Model Traversing
Example
 This suggests that we can narrow down the
problem to hub, ethernet connection cable,
strawberry.rpcd, strawberry, P4.
 At this point, the fault management application
may want to poll for additional information. The
polling may check to see if something is up or not.
An example is applying the ping operation to the
host machine called strawberry.
 What if every entity is up? This may indicate
that strawberry is overloaded. An indication of an
overload can be found by measuring the CPU load.
Model Traversing

Building the graphs requires structural
information and the use of rules.
Model Traversing
Implementation
 What management services are needed?
o To detect and report symptoms, one could use
application instrumentation.
o The instrumentation library should most likely
talk with a management process (or agent).
o The agent sends an event report to the event
server.
o The event server may have a set of rules for
symptom correlation.
o After correlation, a task may be invoked that
does relationship exploration and the final
diagnosis.
Model Traversing
Implementation
 Information Needed
o Information representing the relationships
between hardware components and software
components is needed.
o This needs to be stored in a database or a
directory service (e.g., X500)
o An API needs to be defined to retrieve this
information.
o Rules can be used to help construct the graph.
Model Traversing
Implementation
 Information Needed
o How is the information collected?
o Many different techniques. Examples include:
—Processes (using instrumentation) may have to
register and have their information put into the
database.
—Network information may have to be entered
manually.
Model Traversing
Summary
 Performs very quickly once model is built

o Model can be constructed incrementally during
normal processing; do not have to wait until
failure
Can operate in parallel
 Can accommodate multiple events;
different starting points can result in same
problem element
 Does require model reflective of run-time

o One that changes too fast is a problem
Case-based Reasoning (CBR)

Objective
o Learn from experience
o Solutions to novel problems
o Avoid extensive maintenance

Basic idea: recall, adapt and execute
episodes of former problem-solving in an
attempt to deal with a current problem
Case-based Reasoning
Approach
Case
Library
Input
Retrieve
Adapt
Process
Case-Based Reasoning
Strategy
 Useful for domains in which a body of
knowledge with a case structure exists or
is easily obtainable
 Case structure:

o Set of fields or “slots”
o Capture “essential” information

Yield discriminators
o Set of fields highly correlated with problems or
solutions

Need to find “closest” match
Case-Based Reasoning
Adapt
Case
Library
Input
Retrieve
Adapt
Process
User-based
Adaptation
Discriminators
Adaptation
Techniques
Case-based Reasoning
Summary
 Needs well-defined cases
 Likely to work well when problems are
“close” to existing solutions
 Problem selecting solutions when “not so
close”

o Dangerous in following actions?
o How to adapt?
Summary
Variety of approaches
 Mostly applied in network management
scenarios

o More controlled?
o Better understanding of problems?

Limited experience in application
management
Download