CH A P T E R 17 Understanding Fault Management A common problem in the management of the Enterprise networks is that a single fault manifests itself as multiple alarms in the network management system (NMS). This makes manual analysis of faults costly and also diverts attention from major problems. Many types of problems threaten service delivery; for example, hardware failures, software failures, and so on. Cisco Prime Collaboration Manager ensures near-real time quick and accurate fault detection. After identifying the event, Cisco Prime CM groups related events and performs fault analysis to determine the root cause of the fault in a video collaboration network. Figure 17-1 shows the flow diagram for the Cisco Prime CM fault management. Figure 17-1 Fault Management in Cisco Prime CM Cisco Prime CM Compares with detects fault catalogs events Periodic Polling Notification Receives supported events Compares with catalogs Creates, updates, or clears alarms Checks for e-mails notification configuration Notification sent to Cisco Prime CM users Purges cleared alarms and events Checks for purging policy Events and alarms persisted Alarm and Event browser 239898 Drops unsupported events Cisco Prime Collaboration Manager 1.0 User Guide OL-23630-01 17-1 Chapter 17 Understanding Fault Management Event An event is a distinct incident that occurs at a specific point in time. Examples of events include: • Port status change. • Connectivity loss (for example, BGP Neighbor Loss) between routing protocol processes on peer routers. • Device reset. • Device becomes unreachable by the management station An event is a: • Possible symptom of a fault that is an error, failure, or exceptional condition in the network. For example, when a device becomes unreachable, an unreachable event is triggered. • Possible symptom of a fault clearing. For example, when a device state changes from unreachable to reachable, a reachable event is triggered. Alarms The life cycle of a fault scenario is called an alarm. An alarm is characterized by a sequence of related events, such as port-down and port-up (see Figure 17-2). Figure 17-2 Alarm Sequence A B 186382 Link down In the above figure, event A is followed by event B as part of an event sequence. In a sequence of events, the event with the highest severity, determines the severity of the alarm. Event Creation Cisco Prime CM maintains an event catalog and decides how and when an event has to be created and whether to associate an alarm with an event. Multiple events can be associated to the same alarm. Cisco Prime CM discovers events in the following ways: • By receiving notification events and analyzing them; for example, syslog and traps. • By automatically polling devices and discovering changes; for example, device unreachable. • By receiving events when a significant change occurs in the Cisco Prime CM server; for example, rebooting the server. • By receiving events when the status of the alarm is changed; for example when the user acknowledges or clears an alarm. Incoming event notifications (traps and syslogs) are identified by matching the event data to predefined patterns. A trap or syslog is considered supported by Cisco Prime CM if it has matching patterns and can be properly identified. If the event data does not match with predefined patterns, the event is considered as unsupported and it is dropped. A fault may be notified to Cisco Prime CM through polling, traps, or syslog messages. Cisco Prime CM maintains the context of all faults and ensures that duplicate events or alarms are not maintained in the Cisco Prime CM database. Cisco Prime Collaboration Manager 1.0 User Guide 17-2 OL-23630-01 Chapter 17 Understanding Fault Management Time Event Cisco Prime CM Behavior 10:00AM PDT June 7, 2010 Device A becomes Unreachable Creates a new unreachable event on device A. 10:30AM PDT June 7, 2010 Device A continues to be in the unreachable state. No change in the event status. 10:45AM PDT June 7, 2010 Device A becomes reachable. Creates a new reachable event on device A. 11:00AM PDT June 7, 2010 Device A stays reachable No change in the event status. 12:00AM PDT June 7, 2010 Device A becomes Unreachable. Creates a new unreachable event on device A. Alarm Creation Alarm represents the life cycle of a fault in a network. It defines the root cause of a fault. Multiple events can be associated with a single alarm. An alarm is created in the following sequence: 1. A notification is triggered when a fault occurs in the network. 2. An event is created, based on the notification. 3. An alarm is created after checking if there is no active alarm corresponding to this event. An alarm is associated with two types of events: • Active events: Events that have not been cleared. An alarm remains in this state until the fault is resolved in a network. • Historical events: Events that have been cleared. An event changes its state to an historical event when the fault is resolved in a network. After an alarm is cleared, it indicates the end of an alarm life cycle. A cleared alarm can be revived if the same fault reoccurs within a preset period of time. The present period is set to 5 minutes in Cisco Prime CM. Event and Alarm Association Cisco Prime CM maintains a catalog of events and alarms. The catalog contains the list of events managed by Cisco Prime CM, and the relationship among the events and alarms. Events of different types can be attached to the same alarm type. When a notification is received: 1. Cisco Prime CM compares an incoming notification against the event and alarm catalog. 2. Cisco Prime CM decides whether an event has to be raised. 3. If an event is raised, Cisco Prime CM decides whether the event triggers a new alarm or associates it to an existing alarm. A new event is associated with an existing alarm, if the new event triggered is of the same type and occurs on the same source. For example, an active interface error alarm. The interface error events that occur at the same interface, are all associated to the same alarm. Cisco Prime Collaboration Manager 1.0 User Guide OL-23630-01 17-3 Chapter 17 Understanding Fault Management Alarm Status The following are the supported statuses for an alarm: • New—When an event triggers a new alarm or an event is associated with an existing alarm. When an alarm is unacknowledged, the status changes from Acknowledged to New. • Acknowledged—When you acknowledge an alarm, the status changes from New to Acknowledged. • Cleared—An alarm can be in these statuses: – Auto-clear from the device—The fault is resolved on the device and an event is triggered for the same. For example, a device-reachable event clears the device-unreachable event. This in-turn, clears the device-unreachable alarm. – Manual-clear from Cisco Prime CM users: You can manually clear an active alarm without resolving the fault in the network. A clearing event is triggered and this event clears the alarm. If the fault continues to exist in the network, a new event and alarm are created subsequently based on the event notification (traps/syslogs). – Auto-clear from the Cisco Prime CM server—Cisco Prime CM clears all session-related alarms, when the session ends. If there are no updates to an active alarm for 24 hours, Cisco Prime CM automatically clears the alarm. Event and Alarm Severity Each event has an assigned severity. Events fall broadly into the following severity categories, each with their associated color in Cisco Prime CM: • Flagging—Indicates a fault: Critical (red), Major (orange), Minor (yellow), or Warning (sky blue). • Informational—Info (blue). Some of the Informational events clear the flagging events. For example, a Link Down event might be assigned a Critical severity, while its corresponding Link Up event will be an Informational severity. In a sequence of events, the event with the highest severity determines the severity of the alarm. Event and Alarm Persisted All events and alarms, including active and cleared, are persisted in the Cisco Prime CM database. The relationships between the events are retained. The content of the database can be reviewed, using the Alarm and Event Browser pages. Note Events are stored in the form of the Cisco Prime CM event object. The original notification structure of incoming event notifications (trap or syslog) is not maintained. See Performing Backup and Restore, page 7-1 to understand the Cisco Prime CM purge policy. Cisco Prime Collaboration Manager 1.0 User Guide 17-4 OL-23630-01 Chapter 17 Understanding Fault Management Alarm Notification Cisco Prime CM allows you to subscribe to receive notifications for critical, major, and minor alarms. You can configure Cisco Prime CM to send notifications, using the Administration > System Configuration tab. Device Details • Device Name—Hostname of the managed device. • Device IP Address—IP address of the managed device. • Device Software—Software version running on the device. Alarm Details • Alarm Time Stamp—Time when the alarm was triggered. The Cisco Prime CM client timezone is used. • Alarm Type Name—Type of the alarm; for example, Call Quality - Packet Loss, Device Access Error, Interface Error. • Alarm Severity—Severity of the alarm; for example, MAJOR, MINOR. • Alarm Device Category—Type of device; for example, CTS, CTMS, • Alarm Description—Description of the alarm; for example, Call quality alarm, packet loss above threshold. • Event Description—Description of the event; for example, audio rx packet loss on primary Codec stream is 1.11% (> 1.00%). • Alarm Acknowledged—Status whether the alarm was acknowledged. Session Details • Device in Session—Displays whether the device is active in a session. • Session Status—Displays the status of the session, • Session Type—Whether the session is a point-to-point or multipoint session. • Phone Number Dialed—Dialled endpoint phone number. • Session Duration—Duration of the session. The session details are not applicable if you receive an endpoint alarm notification. Cisco Prime Collaboration Manager Server Name: Hostname of the Cisco Prime CM server. Cisco Prime Collaboration Manager 1.0 User Guide OL-23630-01 17-5 Chapter 17 Understanding Fault Management Cisco Prime Collaboration Manager 1.0 User Guide 17-6 OL-23630-01