Understanding Fault Management

advertisement
CH A P T E R
17
Understanding Fault Management
A common problem in the management of the Enterprise networks is that a single fault manifests itself
as multiple alarms in the network management system (NMS). This makes manual analysis of faults
costly and also diverts attention from major problems. Many types of problems threaten service delivery;
for example, hardware failures, software failures, and so on.
Cisco Prime Collaboration Manager ensures near-real time quick and accurate fault detection. After
identifying the event, Cisco Prime CM groups related events and performs fault analysis to determine
the root cause of the fault in a video collaboration network.
Figure 17-1 shows the flow diagram for the Cisco Prime CM fault management.
Figure 17-1
Fault Management in Cisco Prime CM
Cisco Prime CM Compares with
detects fault
catalogs
events
Periodic Polling
Notification
Receives
supported
events
Compares with
catalogs
Creates,
updates, or
clears alarms
Checks for
e-mails
notification
configuration
Notification sent to
Cisco Prime CM
users
Purges
cleared
alarms and
events
Checks
for
purging
policy
Events and
alarms persisted
Alarm and Event
browser
239898
Drops unsupported events
Cisco Prime Collaboration Manager 1.0 User Guide
OL-23630-01
17-1
Chapter 17
Understanding Fault Management
Event
An event is a distinct incident that occurs at a specific point in time. Examples of events include:
•
Port status change.
•
Connectivity loss (for example, BGP Neighbor Loss) between routing protocol processes on peer
routers.
•
Device reset.
•
Device becomes unreachable by the management station
An event is a:
•
Possible symptom of a fault that is an error, failure, or exceptional condition in the network. For
example, when a device becomes unreachable, an unreachable event is triggered.
•
Possible symptom of a fault clearing. For example, when a device state changes from unreachable
to reachable, a reachable event is triggered.
Alarms
The life cycle of a fault scenario is called an alarm. An alarm is characterized by a sequence of related
events, such as port-down and port-up (see Figure 17-2).
Figure 17-2
Alarm Sequence
A
B
186382
Link down
In the above figure, event A is followed by event B as part of an event sequence. In a sequence of events,
the event with the highest severity, determines the severity of the alarm.
Event Creation
Cisco Prime CM maintains an event catalog and decides how and when an event has to be created and
whether to associate an alarm with an event. Multiple events can be associated to the same alarm.
Cisco Prime CM discovers events in the following ways:
•
By receiving notification events and analyzing them; for example, syslog and traps.
•
By automatically polling devices and discovering changes; for example, device unreachable.
•
By receiving events when a significant change occurs in the Cisco Prime CM server; for example,
rebooting the server.
•
By receiving events when the status of the alarm is changed; for example when the user
acknowledges or clears an alarm.
Incoming event notifications (traps and syslogs) are identified by matching the event data to predefined
patterns. A trap or syslog is considered supported by Cisco Prime CM if it has matching patterns and can
be properly identified. If the event data does not match with predefined patterns, the event is considered
as unsupported and it is dropped.
A fault may be notified to Cisco Prime CM through polling, traps, or syslog messages. Cisco Prime CM
maintains the context of all faults and ensures that duplicate events or alarms are not maintained in the
Cisco Prime CM database.
Cisco Prime Collaboration Manager 1.0 User Guide
17-2
OL-23630-01
Chapter 17
Understanding Fault Management
Time
Event
Cisco Prime CM Behavior
10:00AM PDT June 7, 2010
Device A becomes Unreachable
Creates a new unreachable event
on device A.
10:30AM PDT June 7, 2010
Device A continues to be in the
unreachable state.
No change in the event status.
10:45AM PDT June 7, 2010
Device A becomes reachable.
Creates a new reachable event on
device A.
11:00AM PDT June 7, 2010
Device A stays reachable
No change in the event status.
12:00AM PDT June 7, 2010
Device A becomes Unreachable.
Creates a new unreachable event
on device A.
Alarm Creation
Alarm represents the life cycle of a fault in a network. It defines the root cause of a fault. Multiple events
can be associated with a single alarm.
An alarm is created in the following sequence:
1.
A notification is triggered when a fault occurs in the network.
2.
An event is created, based on the notification.
3.
An alarm is created after checking if there is no active alarm corresponding to this event.
An alarm is associated with two types of events:
•
Active events: Events that have not been cleared. An alarm remains in this state until the fault is
resolved in a network.
•
Historical events: Events that have been cleared. An event changes its state to an historical event
when the fault is resolved in a network.
After an alarm is cleared, it indicates the end of an alarm life cycle. A cleared alarm can be revived if
the same fault reoccurs within a preset period of time. The present period is set to 5 minutes in
Cisco Prime CM.
Event and Alarm Association
Cisco Prime CM maintains a catalog of events and alarms. The catalog contains the list of events
managed by Cisco Prime CM, and the relationship among the events and alarms. Events of different
types can be attached to the same alarm type.
When a notification is received:
1.
Cisco Prime CM compares an incoming notification against the event and alarm catalog.
2.
Cisco Prime CM decides whether an event has to be raised.
3.
If an event is raised, Cisco Prime CM decides whether the event triggers a new alarm or associates
it to an existing alarm.
A new event is associated with an existing alarm, if the new event triggered is of the same type and occurs
on the same source.
For example, an active interface error alarm. The interface error events that occur at the same interface,
are all associated to the same alarm.
Cisco Prime Collaboration Manager 1.0 User Guide
OL-23630-01
17-3
Chapter 17
Understanding Fault Management
Alarm Status
The following are the supported statuses for an alarm:
•
New—When an event triggers a new alarm or an event is associated with an existing alarm.
When an alarm is unacknowledged, the status changes from Acknowledged to New.
•
Acknowledged—When you acknowledge an alarm, the status changes from New to Acknowledged.
•
Cleared—An alarm can be in these statuses:
– Auto-clear from the device—The fault is resolved on the device and an event is triggered for the
same. For example, a device-reachable event clears the device-unreachable event. This in-turn,
clears the device-unreachable alarm.
– Manual-clear from Cisco Prime CM users: You can manually clear an active alarm without
resolving the fault in the network. A clearing event is triggered and this event clears the alarm.
If the fault continues to exist in the network, a new event and alarm are created subsequently
based on the event notification (traps/syslogs).
– Auto-clear from the Cisco Prime CM server—Cisco Prime CM clears all session-related
alarms, when the session ends.
If there are no updates to an active alarm for 24 hours, Cisco Prime CM automatically clears the
alarm.
Event and Alarm Severity
Each event has an assigned severity. Events fall broadly into the following severity categories, each with
their associated color in Cisco Prime CM:
•
Flagging—Indicates a fault: Critical (red), Major (orange), Minor (yellow), or Warning (sky blue).
•
Informational—Info (blue). Some of the Informational events clear the flagging events.
For example, a Link Down event might be assigned a Critical severity, while its corresponding Link Up
event will be an Informational severity.
In a sequence of events, the event with the highest severity determines the severity of the alarm.
Event and Alarm Persisted
All events and alarms, including active and cleared, are persisted in the Cisco Prime CM database.
The relationships between the events are retained. The content of the database can be reviewed, using
the Alarm and Event Browser pages.
Note
Events are stored in the form of the Cisco Prime CM event object. The original notification structure of
incoming event notifications (trap or syslog) is not maintained.
See Performing Backup and Restore, page 7-1 to understand the Cisco Prime CM purge policy.
Cisco Prime Collaboration Manager 1.0 User Guide
17-4
OL-23630-01
Chapter 17
Understanding Fault Management
Alarm Notification
Cisco Prime CM allows you to subscribe to receive notifications for critical, major, and minor alarms.
You can configure Cisco Prime CM to send notifications, using the Administration > System
Configuration tab.
Device Details
•
Device Name—Hostname of the managed device.
•
Device IP Address—IP address of the managed device.
•
Device Software—Software version running on the device.
Alarm Details
•
Alarm Time Stamp—Time when the alarm was triggered. The Cisco Prime CM client timezone is
used.
•
Alarm Type Name—Type of the alarm; for example, Call Quality - Packet Loss, Device Access
Error, Interface Error.
•
Alarm Severity—Severity of the alarm; for example, MAJOR, MINOR.
•
Alarm Device Category—Type of device; for example, CTS, CTMS,
•
Alarm Description—Description of the alarm; for example, Call quality alarm, packet loss above
threshold.
•
Event Description—Description of the event; for example, audio rx packet loss on primary Codec
stream is 1.11% (> 1.00%).
•
Alarm Acknowledged—Status whether the alarm was acknowledged.
Session Details
•
Device in Session—Displays whether the device is active in a session.
•
Session Status—Displays the status of the session,
•
Session Type—Whether the session is a point-to-point or multipoint session.
•
Phone Number Dialed—Dialled endpoint phone number.
•
Session Duration—Duration of the session.
The session details are not applicable if you receive an endpoint alarm notification.
Cisco Prime Collaboration Manager Server Name: Hostname of the Cisco Prime CM server.
Cisco Prime Collaboration Manager 1.0 User Guide
OL-23630-01
17-5
Chapter 17
Understanding Fault Management
Cisco Prime Collaboration Manager 1.0 User Guide
17-6
OL-23630-01
Download