Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Monitoring and Alerting in Cisco MDS Fabric © 2016 Cisco and/or its affiliates. All rights reserved. 1 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Contents About This Document 3 Supported Hardware and Software 3 Introduction4 Cisco Prime Data Center Network Manager 4 Remote Monitoring 4 Overview4 RMON Alarms 4 RMON Events 5 Configuring RMON 5 Guidelines and Limitations 5 Enabling RMON 5 Setting RMON Alarms 6 Verifying RMON 6 Using RMON to Monitor MIBs 6 Port Monitor 7 Overview7 Comparison of PMON and RMON 9 Architectural Components 9 Configuring PMON 10 Verifying PMON 11 Configuring PMON Alerting Mechanisms 11 Configuring Alerting Based on Poll Interval 11 Configuring Alerting Based on Check Interval 12 PMON Counters 14 Cisco Embedded Event Manager 24 Overview24 Defining Cisco Embedded Event Manager Policy 25 Defining Events and Actions 25 Supervisor Upgrade and Downgrade Support 26 For More Information Appendix A: Remote Monitor Configuration Appendix B: Port Monitor Configuration 26 27 27 Normal Template PMON Configuration 28 Aggressive Template PMON Configuration 29 Most Aggressive Template PMON Configuration 30 Appendix C: Cisco Embedded Event Manager Configuration 32 © 2016 Cisco and/or its affiliates. All rights reserved. 2 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public About This Document This document describes the mechanisms available to monitor Cisco® MDS fabric to effectively check SAN fabric health. It discusses important hardware and software counters present in Cisco NX-OS Software and the values that are built into 8- and 16-Gbps Cisco MDS 9000 Family switches. It describes the counters, the techniques employed to monitor them, the monitoring capabilities available, and the configurable threshold limits for the counters. It also provides in-depth discussions of the three main tools available to monitor Cisco MDS fabric: Remote Monitor (RMON), Port Monitor (PMON), and Cisco Embedded Event Manager (EEM). The document discusses how to configure each tool and when to use each one. SAN administrators can use the policies and threshold recommendations provided in this document as guidelines for monitoring their SAN fabrics for overall health and for setting alerts so that they are informed if a particular counter or setting exceeds its preset threshold value. The appendixes at the end of this document provide sample configurations and command-line interface (CLI) commands that can be used to deploy the various monitoring techniques on Cisco MDS 9000 Family switches. Supported Hardware and Software This document supports the hardware platforms and NX-OS software releases listed in Table 1. The features discussed in this document may be available on other platforms or NX-OS releases as well, but no attempt has been made to cover them in this document. This document is based on NX-OS Release 6.2(13) and Cisco Prime™ Data Center Network Manager (DCNM) Release 7.2(1). Table 1. Supported Hardware and Cisco NX-OS Software Releases Cisco MDS 9000 Family Platforms Cisco MDS 9000 Family 16-Gbps platforms Model Cisco MDS 9700 Series Multilayer Directors with DS-X9448-768K9 line card Cisco MDS 9396S 16G Multilayer Fabric Switch Cisco MDS 9148S 16G Multilayer Fabric Switch Cisco MDS 9250i Multilayer Fabric Switch Cisco MDS 9000 Family advanced 8-Gbps platforms Cisco MDS 9500 Series Multilayer Directors with DS-X92xx-256K9 line cards Cisco MDS 9000 Family 8-Gbps platforms Cisco MDS 9500 Series Multilayer Directors with MDS 9148 Multilayer Fabric Switch and DS-X9248-48K9 and DS-X92xx-96K9 line cards © 2016 Cisco and/or its affiliates. All rights reserved. 3 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Introduction In today’s SAN deployments, finding the correlation between a network event and its root cause can be difficult. Without proactively monitoring the SAN fabric and being informed about events as they occur, organizations have difficulty preventing incidents in the network. Incidents may be simple, such as the flapping of one member link of a port channel interface on a Cisco MDS 9000 Family switch, or complex, such as application performance degradation due to a slow-performing device or host bus adapter (HBA) somewhere in the network. Monitoring fabricwide events, various port parameters, Small Form-Factor Pluggable (SFP) transceiver settings, system CPU memory values, and other environmental parameters permits early fault detection and isolation of such network events. It enables the SAN administrator to effectively measure performance on the SAN fabric network. Benefits include: • Improved high availability of the network through reduced downtime • Effective root-cause analyses of a network event • Alerts that are delivered proactively, rather than reactively, from counters and events that are monitored • Reduced troubleshooting time through notifications from alerts and traps Cisco MDS 9000 Family switches track a variety of SAN fabric metrics, counters, and events in real time. These counters are made available by the MDS platform for use by native NX-OS applications such as PMON, RMON, and embedded event managers. A few of these counters are exposed in their raw Simple Network Management Protocol (SNMP) object identifier (OID) format for monitoring or are made available through various system policies. SAN administrators can use existing templates and policy scripts (for example, the default PMON policy) or write custom policy scripts that can be applied to their SAN fabric switches to help improve monitoring, alerting, and troubleshooting for their SAN infrastructure. These system policies, scripts, counters, and tools can be extended to form a monitoring and alerting framework that can be developed and deployed by SAN administrators on their fabric network from day one. © 2016 Cisco and/or its affiliates. All rights reserved. Continuous monitoring of fabricwide events, ports, system resources, and environmental parameters enables early fault detection and helps keep the network healthy. Cisco Prime Data Center Network Manager Cisco Prime DCNM is a data center network monitoring tool that monitors events occurring across the network on multiple switches. It can manage both SAN and LAN networks, and it can correlate events across multiple switches and proactively alert users of these fabricwide events. SAN administrators should deploy and use the network manager to gain fabricwide visibility, perform diagnosis and troubleshooting at the fabric level, analyze trending statistics, generate reports, and perform forecasting for the network, among other tasks. Administrators can also use the network manager to poll a specific management information base (MIB) or OID and plot the values in a graph over an extended period of time. With Cisco Prime DCNM Release 7.2.1 and later, administrator can choose to deliver only network events of interest and suppress unwanted events and alerts. Remote Monitoring Overview Remote Monitoring, or RMON, is an SNMP IETF standard monitoring specification that allows various network agents and console systems to exchange network monitoring data. NX-OS supports RMON alarms, events, and logs to monitor NX-OS devices. You can use the RMON alarms and events to monitor Cisco MDS 9000 Family switches running NX-OS. All switches in the Cisco MDS 9000 Family support the RMON functions discussed in this section (and defined in RFC 2819). RMON Alarms Each alarm monitors a specific MIB object for a specified interval. When the MIB object value exceeds a specified value (rising threshold), the alarm condition is set and only one event is triggered regardless of how long the condition exists. When the MIB object value falls below a certain value (falling threshold), the alarm condition is cleared. Clearing allows the alarm to be triggered again the next time the rising threshold is crossed. 4 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public When you create an alarm, you specify the following parameters: • MIB object to monitor • Sampling interval: The interval that NX-OS uses to collect a sample value of the MIB object • Sample type: Absolute or delta -- Absolute samples take the current snapshot of the MIB object value. -- Delta samples take two consecutive samples and calculate the difference between them. • Rising threshold: The value at which NX-OS triggers a rising alarm or resets a falling alarm • Falling threshold: The value at which NX-OS triggers a falling alarm or resets a rising alarm Event: The action that NX-OS takes when an alarm (rising or falling) is triggered RMON Events You can associate a particular event with each RMON alarm. RMON supports the following event types • SNMP notification: Sends an SNMP risingAlarm or fallingAlarm notification when the associated alarm is triggered • Log: Adds an entry in the RMON log table when the associated alarm is triggered • Both: Sends an SNMP notification and adds an entry in the RMON log table when the associated alarm is triggered You can specify different events for falling alarms and rising alarms. Configuring RMON RMON is disabled by default, and no events or alarms are configured in the switch. You can configure your RMON alarms and events by using the CLI or an SNMP-compatible network management station. Guidelines and Limitations The Cisco MDS 9000 Family MIB files can be obtained through FTP from http://www.cisco.com/public/ sw-center/netmgmt/cmtk/mibs.shtml, under Cisco Storage Networking. Cisco and IETF MIBs are updated frequently. You should download the latest MIBs from http://www.cisco.com/public/sw-center/netmgmt/ cmtk/mibs.shtml whenever you upgrade your Cisco MDS 9000 NX-OS Software. RMON has the following limitations: • You must configure an SNMP user and a notification receiver to use the SNMP notification event type. • You can configure an RMON alarm only on a MIB object that resolves to an integer. When you configure an RMON alarm, the object identifier must be complete, with its index, so that it refers to only one object. For example, 1.3.6.1.2.1.2.2.1.14 corresponds to cpmCPUTotal5minRev, and .1 corresponds to index cpmCPUTotalIndex, which creates object identifier 1.3.6.1.2.1.2.2.1.14.1. Enabling RMON RMON is disabled by default. To enable it, use the following commands: MDS9000# config t MDS9000(conf t)# snmp-server enable traps rmon © 2016 Cisco and/or its affiliates. All rights reserved. 5 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Setting RMON Alarms RMON alarms help configure rising- and falling-threshold values for a specific MIB object using OIDs. If the OID that is configured is related to port counters, then the monitoring will apply to the entire port set available in the system. The following example shows configuration of RMON alarm number 20 to monitor the SNMP MIB object 1.3.6.1.2.1.2.2.1.14.16777216, where 1.3.6.1.2.1.2.2.1.14. represents the ifInOctets value and 16777216 represents the ifIndex value of the physical port. Monitoring occurs once every 60 seconds until the alarm is disabled and checks the change in the variable’s rise or fall. If the value shows a MIB counter increase of 15 or more, the software triggers an alarm. The alarm in turn triggers event number 1, which is configured with the RMON event command. Possible events include a log entry or an SNMP trap. If the MIB value changes by 0, the alarm is reset and can be triggered again. switch(config)# rmon alarm 20 1.3.6.1.2.1.2.2.1.14.16777216 60 delta risingthreshold 15 1 falling-threshold 0 owner test RMON is a software process that runs on a supervisor module. There is no limit on the number of OIDs that can be configured for monitoring purpose. The lowest frequency for polling an SNMP counter is 1 second. For example, if you want to monitor the CPU use percentage rise with a threshold of 80 percent, you can set the polling frequency to once every 30 seconds. Verifying RMON Use the show rmon and show snmp commands to display the configured RMON and SNMP information. Using RMON to Monitor MIBs RMON can monitor all the MIBs that the platform supports. The MIB reference guide for Cisco MDS 9000 Family switches lists all the available MIBs that can be used for managing SAN fabric. Table 2 lists common MIBs that the network administrator can use to monitor the SAN fabric. Table 3 lists MIBs for monitoring CPU and memory utilization and their recommended thresholds. Table 2. Common MIB OIDs for Monitoring on Cisco MDS 9000 Family Switches Serial MIB Number Description OID 1 CISCO-CFS-MIB Provides global-level control over the features in the system that support Cisco Fabric Services 1.3.6.1.4.1.9.9.433.0.x 2 CISCO-ENTITY-FRUCONTROL-MIB Manages field-replaceable units (FRUs), such as power supplies, fans, and modules 1.3.6.1.4.1.9.9.117.2.0.x 3 CISCO-ENTITY-SENSOR-MIB Provides sensor data for environmental monitors such as temperature gauges 1.3.6.1.4.1.9.9.91.2.0.1 4 CISCO-FSPF-MIB Configures and monitors the 1.3.6.1.4.1.9.9.287.3.0.1 Fabric Shortest Path First (FSPF) parameters on all VSANs configured on the local switch 5 CISCO-LICENSE-MGR-MIB Manages license files on the switch © 2016 Cisco and/or its affiliates. All rights reserved. 1.3.6.1.4.1.9.9.369.3.0.x 6 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Serial MIB Number Description OID 6 CISCO-SYSLOG-MIB Collects system messages generated by switch (These messages are typically sent to a syslog server. Use this MIB to retrieve SYSLOG-MIB system messages.) 1.3.6.1.4.1.9.9.41.2.1 7 CISCO-VSAN-MIB Enables you to create and monitor VSANs 1.3.6.1.4.1.9.9.282.1.3.0.1 8 CISCO-PROCESS-MIB Displays memory and process utilization 1.3.6.1.4.1.9.9.109.x 9 CISCO-ZS-MIB Provides notifications related to zone changes in Cisco MDS 9000 Family switches 1.3.6.1.4.1.9.9.294.1.x Table 3. CPU and Memory Utilization MIBs Category Description OID Poll (seconds) Rising Falling CPU CPU utilization 1.3.6.1.4.1.9.9.109.1.1.1.1.6.1 60 80 60 Normal 60 80 60 Aggressive 30 70 60 Normal 90 80 50 Aggressive 60 80 50 Memory Memory utilization 1.3.6.1.4.1.9.9.305.1.1.2 Port Monitor Overview Port Monitor, or PMON, is a monitoring feature available on Cisco MDS 9000 Family switches that can monitor physical port–related critical counters and alert external monitoring software of any anomalies. The configured entities are monitored in hardware, thereby saving CPU cycles on the supervisor. The software component of PMON relies on RMON capabilities, which in turn rely on the SNMP component to generate threshold messages whenever a monitored entity rises or falls beyond the set threshold value. Administrators can optionally use Cisco Prime DCNM software to view these alerts and gain fabricwide control by identifying nodes that are consuming network resources and hampering the flow of data traffic. The PMON feature hides the complexity of SNMP OIDs and provides an easy-to-use CLI. Cisco MDS 9000 Family switches have two predefined PMON policies: default and slowdrain. Administrators can also create other custom policies for monitoring and alerting purposes. The slowdrain policy is automatically active when you launch a Cisco MDS 9000 Family switch. Note that at any given time, only one policy can be active in the system. © 2016 Cisco and/or its affiliates. All rights reserved. 7 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public The default policy monitors the objects as shown in Table 4 for each physical port interface on the Cisco MDS 9000 Family switch. This policy is not active and cannot be modified. Table 4. Default PMON Policy Serial Number Hardware Counters Interval (seconds) Rising Threshold Falling Threshold 1 Link loss 60 5 1 2 Sync loss 60 5 1 3 Signal loss 60 5 1 4 Invalid words 60 1 0 5 Invalid CRCs 60 5 1 6 LR Rx 60 5 1 7 LR Tx 60 5 1 8 Timeout Discards 60 200 10 9 Credit Loss Reco 1 1 0 10 Tx Credit Not Available 1 10% 0% 11 RX Datarate 60 80% 20% 12 TX Datarate 60 80% 20% 13 TX Discards 60 200 10 14 ASIC Error Pkt from port 60 10 5 15 ASIC Error Pkt from xbar 50 10 5 16 ASIC Error Pkt to xbar 50 10 5 17 Tx Slowport Count 1 10 0 18 Tx Slowport Oper Delay 1 50 ms 0 ms 19 TxWait 1 40% 0% The slowdrain PMON policy monitors two counters: Credit Loss Reco and Tx Credit Not Available, as shown in Table 5. This policy is active by default. The Tx Credit Not Available counter is sampled at 100-ms intervals. A 10 percent rising threshold indicates 10 percent of a second (100 milliseconds [ms]). If there are zero remaining Tx B2B credits for 100 ms, then the rising-threshold criterion will be met. Table 5. Slowdrain PMON Policy Serial Number Hardware Counters Interval (seconds) Rising Threshold Falling Threshold 1 Credit Loss Reco 1 1 0 2 Tx Credit Not Available 1 10% 0% © 2016 Cisco and/or its affiliates. All rights reserved. 8 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public PMON allows end users to further define and modify new custom policies for monitoring access ports or trunk ports, or both, on a Cisco MDS 9000 Family switch. You can also optionally use PMON to errordisable or flap the port if the set threshold is met for a configured counter. Note: Different counters are available depending on the types of Cisco MDS 9000 Family modules in the system. The unit for threshold values (Rising and Falling) differs across different counters. MDS 9000 Family switches. For example, you can configure RMON to monitor CPU and memory statistics and use PMON to monitor hardwarespecific counters related to a slow-drain condition. PMON differs from RMON in monitoring port counters in the way that PMON processes events and alarms on the ports. RMON is dependent on SNMP and runs only on supervisor CPUs. It listens to SNMP-based events and traps from various components and relays this information to the external management application. For all platforms if the default value for txwait is modified then ISSD will be restricted. To proceed with ISSD, use the no form of the command to roll back to the default value. PMON, in contrast, performs most computations on the port application-specific integrated circuit (ASIC) present on the individual modules, and counters exposed by PMON features can be monitored at frequencies of less than 1 second. If a counter exceeds its configured threshold value, an alarm event will be generated by the line-card port ASIC and sent to the supervisor module. This architectural design of PMON frees CPU time on the supervisor, relieving it from having to continuously monitor the PMON counters. The supervisor CPU can instead be used to service other important functions: to service control traffic, run online Cisco Generic Online Diagnostics (GOLD), etc. Comparison of PMON and RMON Architectural Components PMON and RMON can be configured to work in tandem to monitor various parameters on the Cisco Figure 1 shows the interaction of PMON components with other software components. For all platforms if the default value for tx-slowportcount is modified then ISSD will be restricted. To proceed with ISSD, use the no form of the command to roll back to the default value. For all platforms if the default value for tx-slowportoper-delay is modified then ISSD will be restricted. To proceed with ISSD, use the no form of the command to roll back to the default value. Figure 1. PMON Interaction with Various Software Modules Port Monitor Configuration: Type, Interval, Threshold, Event, and Object PMON CLI Configuration (on Supervisor) Configuration List and Database RMON Component (on Supervisor) SNMP Component (on Supervisor) Persistent Storage Broadcast to All Line Cards Cisco Prime DCNM (External Monitoring Application) © 2016 Cisco and/or its affiliates. All rights reserved. Port Monitor Agent in LC1 (Line-Card Component) Port Monitor Agent in LC2 (Line-Card Component) 9 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public The PMON component on the supervisor interacts with SNMP and RMON components to alert and send traps to external monitoring software. It also interacts with the persistent storage system (PSS) component, which stores the configuration list and policy details in a database. The PSS helps enable you to retrieve the PMON policy if a switch reloads or fails. When PMON policy is configured on the supervisor, the policy is internally broadcast to all line cards using internal NX-OS messaging. The PMON agent process on the line card is notified and then applies the policy according to the settings configured on all the active ports on the module. Relevant hardware-assisted timers are started to monitor the port counters according to the polling interval specified. If a threshold exceeds its configured value, the PMON process on the line card will inform the SNMP agent on the supervisor of this violation. RMON will also send traps to the external monitoring application, such as Cisco Prime DCNM, that is registered to receive such notifications from the switch. PMON-triggered notifications can also be viewed in RMON by using the show rmon logs and show rmon alarm commands. Configuring PMON Here is a sample configuration to specify the PMON policy, monitor the link-loss counter, and activate the policy. MDS9000-A(config)# port-monitor name port-health-check MDS9000-A(config-port-monitor)# monitor counter link-loss MDS9000-A(config-port-monitor)# counter link-loss poll-interval 60 delta risingthreshold 5 event 4 falling-threshold 1 event 4 portguard errordisable MDS9000-A(config-port-monitor)# port-monitor activate port-health-check The preceding policy configures a PMON policy named port-health-check to continuously monitor the link-loss counter every 60 seconds. The delta value specified in the configuration represents the difference between the value from the current polling and the value from previous polling performed by PMON 60 seconds earlier. If this value is greater than or equal to 5 between two successive polls, then a port-guard action to error-disable the affected port is performed, and an event with the value 4 is generated to RMON. Note: The port types all and access port include all F-ports and all F-port-channel interfaces. Also, the port-guard action to flap or error-disable the port should not be used when monitoring an F-port-channel interface connected to a network port virtualization (NPV) switch. If another PMON policy is already active for the port type in question, then you will first have to deactivate that policy before you can activate a new policy. PMON can generate five levels of events. The levels reflect the importance of this event occurrence. Table 6 shows the event levels. Table 6. PMON Events Event Description 1 Fatal 2 Critical 3 Error 4 Warning 5 Informational © 2016 Cisco and/or its affiliates. All rights reserved. 10 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public When configuring PMON, administrators should be sure to choose the right event ID levels so that meaningful actions can be taken by the administrator or the third-party tool that receives these alerts. Generally, events are classified as follows: • An event level 1 alert signifies a general network failure or a situation in which a critical part of your network is unreachable. • An event level 2 alert signifies a critical situation in the network that needs immediate attention. For example, such an alert might indicate that one of the modules of the Cisco MDS 9000 Family switch is down, or that the flow of traffic has changed so that traffic that was earlier traversing the SAN A path has been switched to a SAN B path. • Event level 3, 4, and 5 alerts signify less critical situations. For example, these alerts may be used to indicate that one of the member links in a port-channel interface has flapped, that the module temperature is rising, or that the count of cyclic redundancy check (CRC) errors on an interface has increased. After the PMON policy is configured, enabled, and activated for monitoring, you can use the show portmonitor port-health-check and show port-monitor active CLI commands to view the various PMON counters and their configured thresholds. To deactivate monitoring for the configured policy, use this command: MDS9000(config)# no port-monitor activate port-health-check Verifying PMON You can use the following commands to check whether any Cisco MDS 9000 Family module has triggered an alarm and whether PMON was able to inform RMON about generating an SNMP event. MDS9000# show port-monitor active MDS9000# show port-monitor < policy name > MDS9000# show port-monitor status MDS9000# show rmon logs Configuring PMON Alerting Mechanisms PMON provides two ways to configure port counters for monitoring and alerting. You can configure monitoring and alerting based on the poll interval or check interval. Configuring Alerting Based on Poll Interval Consider a PMON policy named pmon-test with the counter link-loss configured as shown here: port-monitor name pmon-test port-type all counter link-loss poll-interval 30 delta rising-threshold 20 event 4 fallingthreshold 0 event 4 This policy configures the PMON policy named pmon-test to continuously monitor the link-loss counter every 30 seconds. Rising and falling thresholds are configured as 20 and 0, respectively. If a counter value exceeds the threshold settings, the rising or falling alert is generated and logged in RMON. If an external monitoring tool such as Cisco Prime DCNM is configured to listen for SNMP traps, RMON will inform this tool though an SNMP trap of PMON alerts. Figure 2 and the following section show how poll-interval-based PMON alerting works internally and generates rising or falling PMON alerts. © 2016 Cisco and/or its affiliates. All rights reserved. 11 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Figure 2. Poll-Interval-Based Alerting 120 100 Rising Alarm Rising Threshold: 20 Falling Threshold: 0 Polling Interval: 30 Seconds Counter Value 80 Falling Alarm 60 Rising Alarm 40 20 0 30 60 90 120 150 180 210 240 Time (seconds) When you configure a PMON policy, the system begins polling the hardware counters internally. The PMON agent process on the Cisco MDS 9000 Family module runs according to the configured poll-interval duration (such as 30 seconds), reads the PMON counters, and performs validation checks against the configured rising and falling alarm thresholds. In the sample pmon-test policy configured in the example here, during the first polling interval (that is, at the thirtieth second), PMON reads the counter’s absolute value from the module’s port ASIC and stores this value internally as the base value. During next polling cycle (that is, at the sixtieth second), PMON starts polling again, reads the counter value, and stores this value internally as the current value. A validation check is then performed between the counter’s current value and the base value, which determines whether an alarm needs to be raised. If the delta between the current counter value and the previous base counter value is greater than the risingthreshold value configured, then a rising alarm is generated. [(Current_value – Base_value) > Rising threshold] = Generate rising alarm Similarly, if the delta between the current and previous counters is less than or equal to the falling-threshold value configured, then a falling alarm is generated. [(Current_value – Base_value) <= Falling threshold] = Generate falling alarm Note: The base value is overwritten with the counter’s current value during each polling cycle by PMON. The validation always is performed against the base value during each polling cycle. Two successive rising alarms or falling alarms will not be generated unless an alarm of the other type occurs between them. Configuring Alerting Based on Check Interval Beginning from the Cisco® MDS NX-OS Release 6.2.15, along with the rising and falling threshold, the user can configure a new threshold, called warning threshold, to generate syslogs. In poll-interval-based alerting, hardware counters are polled only at the configured polling intervals, triggering alerts at the polling intervals. However, if you need a more fine-tuned alerting mechanism, you should consider the ‘check-interval-based Alerting’ mechanism. In this approach, a new global configuration command-line interface (CLI) is introduced which checks for counters value in between the configured polling interval and will send rising or falling alerts if the thresholds conditions are met. © 2016 Cisco and/or its affiliates. All rights reserved. 12 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public port-monitor check-interval <interval> where ‘<interval>’ is the duration specified in seconds. The minimum interval value that can be set is 5, and it can be in increments of 1 second. To illustrate the alerting mechanism using this approach, let us consider a similar process monitor (PMON) policy named ‘pmon-test’ for monitoring the counter ‘link-loss’ as configured below. The ‘checkinterval’ is configured with value ‘15’ seconds and the ‘warning-threshold’ with a value of ‘10’. The‘warning-threshold’ is an optional parameter. It can be configured with a value in between rising and falling threshold values. It can also be used to inform the administrator through syslog message generation that the monitored counter is nearing its rising or falling threshold values. port-monitor check-interval 15 port-monitor name pmon-test port-type all counter link-loss poll-interval 30 delta rising-threshold 20 event 4 warningthreshold 10 falling-threshold 0 event 4 Figure 3 illustrates check-interval-based alerting. Figure 3. Check-Interval-Based Alerting Rising Falling Threshold Threshold Raising and Falling Alerts Warning Threshold (Upward) Counter Value Rising Falling Threshold Threshold Warning Threshold (Downward) Warning Threshold (Downward) Warning Threshold (Upward) Warning Threshold (Upward) 25 40 55 55 60 75 100 100 115 30 45 60 75 90 105 120 135 150 Rising Threshold: 20 Falling Threshold: 0 Warning Threshold : 10 Polling Interval: 30 Seconds Check Interval: 15 Seconds Time (Seconds) → As shown in Figure 3, in this approach PMON performs validation checks at every configured check-interval period and at every poll-interval period. Calculations to generate alarms in this method are similar to those shown for poll-interval-based alerting, as shown here. If the delta between the current counter value and the previous base counter value is greater than the risingthreshold value configured, then a rising alarm is generated. [(Current_value – Base_value) > Rising threshold] = Generate rising alarm © 2016 Cisco and/or its affiliates. All rights reserved. 13 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Similarly, if the delta between the current counter value and the previous counter value is less than or equal to the falling-threshold value configured, then a falling alarm is generated. [(Current_value – Base_value) <= Falling threshold] = Generate falling alarm In addition: [(Current_value – Base_value) >= Alert threshold] = Generate a syslog message Sample syslog messages generated by PMON for rising, falling and warning thresholds are as below: MDS9000(config)# 2016 Jan 6 06:51:35 sw-apex-40 %PMON-SLOT2-4-WARNING_THRESHOLD_REACHED_UPWARD: Invalid Words has reached warning threshold in the upward direction (port fc2/18 [0x1091000], value = 30). 2016 Jan 6 06:51:35 sw-apex-40 %PMON-SLOT2-3-RISING_THRESHOLD_REACHED: Invalid Words has reached the rising threshold (port=fc2/18 [0x1091000], value=30). 2016 Jan 6 06:51:36 sw-apex-40 %SNMPD-3-ERROR: PMON: Rising Alarm Req for Invalid Words counter for port fc2/18(1091000), value is 30 [event id 1 threshold 30 sample 2 object 4 fcIfInvalidTxWords] 2016 Jan 6 06:52:21 sw-apex-40 %PMON-SLOT2-5-WARNING_THRESHOLD_REACHED_DOWNWARD: Invalid Words has reached warning threshold in the downward direction (port fc2/18 [0x1091000], value = 0). 2016 Jan 6 06:52:21 sw-apex-40 %PMON-SLOT2-5-FALLING_THRESHOLD_REACHED: Invalid Words has reached the falling threshold (port=fc2/18 [0x1091000], value=0). 2016 Jan 6 06:52:21 sw-apex-40 %SNMPD-3-ERROR: PMON: Falling Alarm Req for Invalid Words counter for port fc2/18(1091000), value is 0 [event id 2 threshold 0 sample 2 object 4 fcIfInvalidTxWords] PMON Counters The Cisco MDS 9000 Family switches include a set of hardware counters. Advanced 8- and 16-Gbps and higher Cisco MDS 9000 Family modules include additional counters that are specific to the line-card type being used. To view the list of available PMON counters on the software release you are running, use the following command: MDS9000(config)# port-monitor name <pmon_policy_name> MDS9000(config-port-monitor)# monitor counter ? credit-loss-reco Configure credit loss recovery counter err-pkt-from-port Configure err-pkt-from-port counter err-pkt-from-xbar Configure err-pkt-from-xbar counter err-pkt-to-xbar Configure err-pkt-to-xbar counter invalid-crc Configure invalid-crc counter invalid-words Configure invalid-words counter link-loss Configure link-failure counter © 2016 Cisco and/or its affiliates. All rights reserved. 14 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public lr-rx Configure the number of link resets received by the fc-port lr-tx Configure the number of link resets transmitted by the fc-port rx-datarate Configure rx performance counter signal-loss Configure signal-loss counter sync-loss Configure sync-loss counter timeout-discards Configure timeout discards counter tx-credit-not-available Configure credit not available counter tx-datarate Configure tx performance counter tx-discards Configure tx discards counter tx-slowport-count Configure slow port sub-100ms counter tx-slowport-oper-delay Configure slow port operation delay txwait Configure tx total wait counter This section describes the PMON counters available, the CLI show commands for the counters, and the recommended threshold values configured in three templates: Normal, Aggressive, and Most Aggressive. The CLI configuration commands for these templates are also shown in Appendix B at the end of this document. PMON offers various counters to monitor Cisco MDS fabric and provide notifications if events exceed the configured threshold values. These counters range from generic interface counters such as receive (Rx) and transmission (Tx) rate counters to more platform-specific counters such as TxWait and Tx Slowport Count, for slow-drain detection. If you are experiencing slowness in your network, applications that are not performing up to their potential, or congestion on network ports, a prudent approach is to deploy PMON on all Cisco SAN switches and monitor the counters for potential slow-drain conditions. By integrating PMON with Cisco Prime DCNM, system administrators can be alerted proactively and notified whenever a configured counter exceeds its set threshold value. PMON is easy to deploy. System administrators can quickly configure a PMON policy using any of the three templates (Normal, Aggressive, and Most Aggressive) and start monitoring their SAN fabric from day one. © 2016 Cisco and/or its affiliates. All rights reserved. A general guideline is to first deploy a Normal PMON policy on the SAN fabric and monitor for alerts or events for a few weeks. If the network is stable and does not result in alerts or event notifications and yet you continue to see application degradation, then you can move to the next level of monitoring and deploy a PMON policy using the Aggressive template and then, if necessary, the Most Aggressive template. PMON also provides facility to adjust the individual counter threshold settings when you deploy the policy. It also provides mechanisms for configuring alert and notification actions such as error-disabling a port if a configured counter exceeds its threshold settings. In Appendix B, the port-guard action is set to the default value for all the templates: that is, neither flapping nor error-disabling the physical port. The policy will alert RMON, which can then trigger alerts to the management application and notify the administrator if a threshold breach occurs on the Cisco MDS 9000 Family switch. Table 7 lists the 19 available hardware counters and the recommended threshold settings for each counter value. These values are further divided into three templates: Normal, Aggressive, and Most Aggressive. The polling interval specified is per second unless specified alongside its value. 15 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Table 7. Hardware Counters Counter Name Normal Aggressive Most Aggressive Rise Fall Poll Rise Fall Poll Rise Fall Poll Signal Loss 5 0 60 5 1 45 5 0 30 Sync Loss 5 0 60 5 1 45 5 0 30 Link Loss 5 0 60 5 1 30 5 0 10 Invalid CRC 5 0 60 5 1 30 5 0 10 Invalid Words 5 0 60 5 1 30 5 0 10 Tx Discards 50 0 60 50 10 45 50 0 30 LR Rx 5 0 60 4 1 60 3 0 30 LR Tx 5 0 60 4 1 60 3 0 30 Timeout Discard 200 10 60 200 100 60 200 50 30 Credit Loss Reco 1 0 1 1 0 1 1 0 1 Tx Credit Not Available 10% 0 1 10% 0 1 10% 0 1 Rx Datarate 80% 70% 60 90% 60% 60 90% 50% 30 Tx Datarate 80% 70% 60 90% 60% 60 90% 50% 30 ASIC Error from Port 3 50 10 60 50 40 60 50 30 60 ASIC Error Pkt to Xbar 3 50 10 50 50 40 60 50 30 60 ASIC Error Pkt From Xbar 3 50 10 50 50 40 60 50 30 60 Tx Slowport Count 1 5 0 1 5 0 1 5 0 1 Tx Slowport Oper Delay 2 50 ms 0 1 40 ms 0 1 30 ms 0 1 TxWait 40% 1 30% 1 20% 1 0 0 0 Table 8 provides a support matrix for slow-drain-specific counters for PMON. Table 8. PMON Slow-Drain Counters Counter Name Supported Platforms Minimum Cisco NX-OS Release credit-loss-reco All Release 5. 0.0 or 6.0.0 lr-rx All Release 5.0.0 or 6.0.0 lr-tx All Release 5.0.0 or 6.0.0 timeout-discards All Release 5.0.0 or 6.0.0 tx-credit-not-available All Release 5.0.0 or 6.0.0 tx-discards All Release 5.0.0 or 6.0.0 © 2016 Cisco and/or its affiliates. All rights reserved. 16 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter Name Supported Platforms Minimum Cisco NX-OS Release tx-slowport-count1 Cisco MDS 9500 Series only with DS-X9248-48K9, DS-X9224-96K9, or DS-X9248-96K9 line card Release 6.2(13) tx-slowport-oper-delay2 Cisco MDS 9700 Series, 9396S, 9148S, 9250i, Release 6.2(13) and 9500 Series only with DS-X9232-256K9 or DS-X9248-256K9 line card txwait Cisco MDS 9700 Series, 9396S, 9148S, 9250i, Release 6.2(13) and 9500 Series only with DS-X9232-256K9 or DS-X9248-256K9 line card The slow-port-monitoring feature must be enabled for this counter to work. This counter will increment only if the amount of time that an interface is at 0 Tx buffer-to-buffer (B2B) credits is equal to or greater than the slow-port-monitoring value configured (administrative) delay. It would increment at most 1 time per 100 ms, or 10 times per second. 1 Use the following configuration command to enable slow-port monitoring : MDS9700(config)# system timeout slowport-monitor ? <1-500> Configure number of milliseconds default Default timeout value for HW slowport monitoring MDS9700(config)# system timeout slowport-monitor default ? mode Enter the port mode MDS9700(config)# system timeout slowport-monitor default mode ? E E mode The slow-port-monitoring feature must be enabled for this counter to work. The threshold is exceeded only if the slow-port-monitoring administrative delay (the configured value) is less than the configured rising threshold and the reported operation delay (oper-delay) is more than the configured rising threshold. 2 These counters work on 4-and 8-Gbps Cisco MDS 9000 Family switches. On 16-Gbps Cisco MDS 9000 Family switches, these counters are not operable. 3 The remainder of this section provides detailed descriptions of the PMON counters, their associated switch CLI commands, and their threshold values. Counter Signal Loss (signal-loss) Description Signal loss typically occurs at the physical transport medium. If Laser is off or the optical medium is down, then the counter is incremented. The optical ASIC on Cisco MDS 9000 Family switches detects this incrementation and raises a firmware interrupt to flag such conditions in the underlying Layer 1 signaling. The usual result of this event is that the link must be reset and the port must be reinitialized. The loss of signal usually indicates a hardware problem. Show command show interface x/y counters detail | include ignore signal Threshold Poll (Sec) Rising Falling Normal 60 5 1 Aggressive 45 5 1 Most Aggressive 30 5 1 © 2016 Cisco and/or its affiliates. All rights reserved. 17 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter Sync Loss (sync-loss) Description This counter tracks the loss of the synchronization (sync) bit pattern when a frame is received from the peer Fibre Channel switch. In Fibre Channel, if the frame that is transmitted is encoded in 66-bit format, the first 2 bits form sync bits and remaining 64 bits form the data. Upon receiving a frame, the receiver locks itself to a sync-bit pattern. During data transfer, if any packet is received in which the sync pattern is different from the one to which the receiver is locked, then the counter goes high, indicating noise in the transmitted data. Generally a sync loss is followed by a logical reset of link reset (LR), link reset response (LRR), and IDLE. Sync-loss errors frequently are the result of a faulty transceiver or cable. Show command show interface x/y counters detail | include ignore sync Threshold Poll (Sec) Rising Falling Normal 60 5 1 Aggressive 45 5 1 Most Aggressive 30 5 1 Counter Link Loss (link-loss) Description Link loss occurs whenever a port flaps. For example, if a peer sends an OLS or NOS signal to the switch, the counter is incremented. Link loss can trigger a series of other hardware counter updates as well. Both physical and hardware problems can cause link failures. Link failures frequently occur due to a sync loss or signal loss. A port or link flap causes other associated PMON counters to increment as well (sync loss, signal loss, link reset, etc.). Show command None Threshold Poll (Sec) Rising Falling Normal 60 5 1 Aggressive 45 5 1 Most Aggressive 30 5 1 Counter Invalid CRC (invalid-crc) Description In Fibre Channel, words are packaged inside a frame. If bit-level corruption occurs (that is, if cyclic redundancy-check [CRC] failure, checksum mismatch, etc. occurs) at the frame level, then the invalid-crc counter is incremented. The CRC field is 32 bits long and inserted before the end-of-frame (EOF) delimiter in the Fibre Channel frame format. Show command show interface x/y | include ignore crc Threshold Poll (Sec) Rising Falling Normal 60 5 1 Aggressive 30 5 1 Most Aggressive 10 5 1 © 2016 Cisco and/or its affiliates. All rights reserved. 18 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter Invalid Words (invalid-words) Description A checksum and signal integrity check occurs for every word received on a Cisco MDS 9000 Family switch. The invalid-words counter is incremented each time a Fibre Channel word is detected with checksum or bit-level errors. Show command show interface x/y counter detail | include ignore words Threshold Poll (Sec) Rising Falling Normal 60 5 1 Aggressive 30 5 1 Most Aggressive 10 5 1 Counter Tx Discards (tx-discards) Description This counter increments when one of the following conditions occurs: an egress policy map (access control list [ACL]) forcibly drops the packets, traffic is received on the port ASIC when port is physically down, or packets are discarded as a result of a timeout (500-ms threshold). Show command show interface x/y counter detail | include ignore discard Threshold Poll (Sec) Rising Falling Normal 60 50 0 Aggressive 45 50 10 Most Aggressive 30 50 20 Counter LR RX (lr-rx) Description IIf a peer port is continuously running out of transmit B2B credits for a long time (1 second for F-ports and 1.5 seconds for E-ports), it can invoke the credit-lossrecovery mechanism by transmitting a Fibre Channel link reset (LR) primitive. The node (the Fibre Channel switch) connected to this port will receive a link reset in the receive (Rx) direction and increment the lr-rx counter. The switch that sent this link reset would have its lr-tx (transmit) counter incremented. This example is just one of the situations in which this counter incremented. A link flap can also trigger the counter to increment. Usually, no traffic flows are received on this port when this condition occurs. An increasing number of received link resets can indicate congestion at a port, or the resets can be a cascading effect of congestion on other ports to which this port is transmitting data. It can also indicate that the receiver-ready (R_RDY) B2B credits are being lost due to corruption on the link. Show command show interface fc2/11 count details | include ignore “reset responses received” Threshold Poll (Sec) Rising Falling Normal 60 5 1 Aggressive 60 3 1 Most Aggressive 30 3 1 © 2016 Cisco and/or its affiliates. All rights reserved. 19 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter LR TX (lr-tx) Description If a port is continuously running out of transmit B2B credits for a long time (1 second for F-ports and 1.5 seconds for E-ports), it will invoke the credit-loss-recovery mechanism by transmitting a Fibre Channel link reset (LR) primitive. A link reset is only a credit reset and does not actually flap the port if it is successful. Usually, no traffic flows out of this port when this condition occurs. This example is just one of the situations in which this counter is incremented. A link flap can also trigger the counter to increment. Counter incrementation can also indicate that the R_RDY B2B credits are being lost due to corruption on the link. Show command show interface x/y count details | include ignore “reset responses transmitted” Threshold Poll (Sec) Rising Falling Normal 60 5 1 Aggressive 60 3 1 Most Aggressive 30 3 1 Counter Timeout Discards (timeout-discards) Description All frames are time-stamped when they enter a Cisco MDS 9000 Family switch. The frame is put into the receive buffer queue when it enters the switch. The frame will be dropped if it is not delivered to the egress port within the time specified by the congestion-drop timeout value (the default is 500 ms). These dropped frames are accounted for at the next-hop (egress) port and signify congestion at this egress port. This example is one of the conditions that causes the timeout-discards counter to increment. Show command show interface x/y count details | include ignore timeout Threshold Poll (Sec) Rising Falling Normal 60 200 10 Aggressive 60 200 100 Most Aggressive 30 200 100 Counter Credit Loss Reco (credit-loss-reco) Description IThis counter is incremented when B2B credits are unavailable for more than 1 second for F-ports or more than 1.5 seconds for E-ports. Because this is an important counter for slow-drain detection, you should keep the value the same for all three templates while monitoring. Show command show interface x/y count details | include ignore “credit loss” Threshold Poll (Sec) Rising Falling Normal 1 1 0 Aggressive 1 1 0 Most Aggressive 1 1 0 © 2016 Cisco and/or its affiliates. All rights reserved. 20 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter Name TX Credit Not Available (tx-credit-not-available) Description This counter is incremented by 1 when Tx B2B credits are not available for 100 ms (default). Both thresholds are configured as a percentage of the polling internal time. For example, assume that you set the poll interval to 1 second, the rising threshold to 10% (which translates to 100 ms: that is, 10% of a second), the falling threshold to 0% (which translates to 0 ms), and the port-guard action to error-disable the port. If there are zero remaining Tx B2B credits for 100 ms, then the rising-threshold criterion is met. If a credit is indeed returned, then the falling-threshold criterion is met. In addition, when the rising-threshold criterion is met, the port-guard action is initiated, which puts the port in the error-disable state. The port will remain in that state until someone manually issues a shut or no shut command on that port. Because this is an important counter for slow-drain detection, you should keep the value the same for all three templates while monitoring. Note: Because this counter is sampled at 100-ms intervals, you should configure both threshold percentages only in multiples of 10 (10, 20, 30, etc.). Show command show interface x/y count details | include ignore “Transmit B2B” Threshold Poll (Sec) Rising Falling Normal 1 10% 0 Aggressive 1 10% 0 Most Aggressive 1 10% 0 Counter Rx Datarate (rx-datarate) Description This counter indicates the rate at which frames are received on the interface. Both thresholds are configured as a percentage of the operational speed of the interface. Show command show interface x/y | include ignore “input rate” Threshold Poll (Sec) Rising Falling Normal 60 80% 70% Aggressive 60 90% 60% Most Aggressive 30 90% 50% Counter TX Datarate (tx-datarate) Description This counter indicates the rate at which frames are transmitted out of a particular interface. Both thresholds are configured as a percentage of the operational speed of the interface. Show command show interface x/y | include ignore “output rate” Threshold Poll (Sec) Rising Falling Normal 60 80% 70% Aggressive 60 90% 60% Most Aggressive 30 90% 50% © 2016 Cisco and/or its affiliates. All rights reserved. 21 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter ASIC Error Pkt from Port (err-pkt-from-port) Description This counter is incremented when the internal path between the port ASIC to the forwarding ASIC is causing frames to be corrupted. If the forwarding ASIC detects corrupted frames from the port ASIC due to any internal problems, this counter is incremented. The counter is deprecated on Cisco MDS 9000 Family fourth-generation and later line cards due to the nature of the hardware architecture on these modules. Show command - Threshold Poll (Sec) Rising Falling Normal 60 50 10 Aggressive 60 50 40 Most Aggressive 60 50 30 Counter ASIC Error Pkt to XBAR (err-pkt-to-xbar) Description This counter indicates the rate at which frames are received on the interface. Both thresholds are configured as a percentage of the operational speed of the interface. Show command - Threshold Poll (Sec) Rising Falling Normal 50 50 10 Aggressive 60 50 40 Most Aggressive 60 50 30 Counter ASIC Error Pkt from XBAR (err-pkt-from-xbar) Description This counter is incremented when the forwarding ASIC on the ingress module (line card) is sending corrupted frames to the crossbar (fabric module or XBAR) ASIC. Show command - Threshold Poll (Sec) Rising Falling Normal 50 50 10 Aggressive 60 50 40 Most Aggressive 60 50 30 © 2016 Cisco and/or its affiliates. All rights reserved. 22 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter tx-slowport-count Description The tx-slowport-count counter is applicable only for 8-Gbps modules in Cisco MDS 9500 Series Multilayer Directors. In the default configuration, PMON sends an alert when a slow-port condition is detected 10 times in 1 second for the configured administrative delay. Because this is an important counter for slow-drain detection, you should keep the value the same for all three templates while monitoring. Note: Because this counter is sampled at 100-ms intervals and there can be, at most, one event per 100-ms interval, this counter can increment, at most, 10 times per second. Therefore, configuring a rising-threshold value greater than 10 (with a polling interval of 1 second) will cause this alert never to be generated. In addition, system timeout slowport-monitor default mode must be configured for alerting to work. Show command show process creditmon slowport-monitor-events show logging onboard slowport-monitor-events Threshold Poll (Sec) Rising Falling Normal 1 5 0 Aggressive 1 5 0 Most Aggressive 1 5 0 Counter tx-slowport-oper-delay Description The tx-slowport-oper-delay counter is applicable to advanced 8- and 16-Gbps modules for Cisco MDS 9000 Family switches. There are two defaults based on the module type. For advanced 8-Gbps modules, the default rising-threshold value is 80 ms with a 1-second polling interval. For 16-Gbps modules and switches, the default rising-threshold value is 50 ms with a 1-second polling interval. Show command show process creditmon slowport-monitor-events show logging onboard slowport-monitor-events Threshold Poll (Sec) Rising Falling Normal 1 50 ms 0 Aggressive 1 40 ms 0 Most Aggressive 1 30 ms 0 Counter TxWait Description TxWait counts the time whenever the port Tx B2B credits available are zero and frames are waiting in the egress queue. The counter increments if a port has zero remaining Tx B2B credits for 2.5 microseconds. This counter is used to report the credit unavailability on a port in multiple intuitive ways. TxWait is reported only on 16-Gbps and advanced 8-Gbps platforms. The other platforms in the Cisco MDS 9000 Family will report zero. Both rising and falling thresholds are configured as a percentage of the polling-interval time. Therefore, for example, a 30% threshold with a 1-second polling interval is a setting of 300 ms. © 2016 Cisco and/or its affiliates. All rights reserved. 23 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public Counter TxWait Show command show process creditmon txwait-history and show interface x/y counters | include ignore TxWait Threshold Poll (Sec) Rising Falling Normal 1 40% 0 Aggressive 1 30% 0 Most Aggressive 1 20% 0 Cisco Embedded Event Manager Overview Cisco Embedded Event Manager, or EEM, is an NX-OS component that can be used to monitor system events that occur on the switch. It can take action after an event occurs and help you troubleshoot the situation. Event manager policies are configured on the supervisor, and they can be used to monitor parameters on modules or line cards as well. NX-OS has preconfigured system policies that define common events and actions for the device. System policy names generally begin with two underscore characters: for example, __flogi_fcids_max_per_switch is a policy that generates a syslog warning message if the fabric login (FLOGI) requests on the switch exceed 4000. To see a list of all available system policies on Cisco MDS 9000 Family switches, enter the show event manager system-policy execution command. Figure 4 shows the events that trigger an event log generated through the event manager. Figure 4 Event Manager Interaction with System Events Events • • • • • System Switchover OIR FLOGI Count Temperature Event Module Failure Event Manager User-Defined Policy (Configured in Applet or Script Using CLI) © 2016 Cisco and/or its affiliates. All rights reserved. • Validates and Records Policy Information • Directs Event Notifications • Logs Events • Dynamically Registers Event Names, Actions, and Parameters • Filters Events and Matches Them with Policies Event Log 24 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public The event manager has three main components: • Event statements: Events to be monitored from another NX-OS component that may require some action, workaround, or notification • Action statements: Actions that the event manager can take, such as sending an email or disabling an interface, to recover from an event • Policies: Events paired with one or more actions to troubleshoot or recover from an event • Defining Cisco Embedded Event Manager Policy Defining Cisco Embedded Event Manager Policy Use the commands listed here to configure a user-defined EEM policy. 1.config t 2. (Optional) show event manager policy-state system-policy 3. event manager applet applet-name override system-policy 4. (Optional) description policy-description 5.event event-statement 6. action number action-statement (Repeat Step 6 for multiple action statements.) 7. (Optional) show event manager policy-state applet-name 8. (Optional) copy running-config startup-config Defining Events and Actions As noted in the preceding configuration steps (step 6), you need to define an action to be taken if the matching conditions for the event are true. You can use the following options to configure event statements: • cli: Create a CLI event specification. • counter: Create a counter event. • fanabsent: Create a fan-absent event specification. • fanbad: Create fan-bad event specification. • fcns: Create an event related to the Fibre Channel name server. • flogi: Configure an event related to fabric-login requests. • gold: Create a diagnostic event specification. • memory: Create a memory-threshold event specification. • module: Create a module event specification. • module-failure: Create a module-failure event specification. • oir: Create an online insertion and removal (OIR) event specification. • policy-default: Use the event in the system policy being overridden. • poweroverbudget: Create a power-over-budget event specification. • snmp: Create an SNMP event specification. • storm-control: Create a storm-control event specification. • syslog: Create a syslog event specification. © 2016 Cisco and/or its affiliates. All rights reserved. 25 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public • sysmgr: Create a system manager event. • temperature: Create a temperature event specification. • test: Create a test event specification. • zone: Specify zone configuration commands. You can use Embedded Event Manager applets to configure the following actions: • cli: Configure a virtual shell (VSH) CLI action. • counter: Specify the name of the counter. • eem: Configure an Embedded Event Manager command. • event-default: Perform the default action for the event. • forceshut: Force the entire switch to shut down. • overbudgetshut: Shut down the specified line cards because the system exceeds the power budget. • policy-default: Perform the default actions of the policy being overridden. • reload: Reload the system or a specific module. • snmp-trap: Send out an SNMP trap. • syslog: Generate a syslog message. Preconfigured System Policies Available on Cisco MDS 9000 Family Switches The following are some of the preconfigured system policies available in Cisco MDS 9000 Family switches: • Zoning -- zone_dbsize_max_per_vsan -- zone_members_max_per_sw -- zone_zones_max_per_sw -- zone_zonesets_max_per_sw • Fabric login (FLOGI) -- flogi_fcids_max_per_switch -- flogi_fcids_max_per_module -- flogi_fcids_max_per_intf • Fibre Channel name server (FCNS) -- fcns_entries_max_per_switch Refer to Appendix C for an example showing how to configure the event manager and specify a set of actions matching a particular event to your applet. Supervisor Upgrade and Downgrade Support All the features for monitoring support supervisor upgrade and downgrade using Cisco In-Service Software Upgrade (ISSU). When an ISSU upgrade is performed, new counters may be added on the new software. © 2016 Cisco and/or its affiliates. All rights reserved. 26 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public For More Information http://www.cisco.com/go/mds http://www.cisco.com/c/en/us/products/storage-networking/mds-9700-series-multilayer-directors/ datasheet-listing.html Appendix A: Remote Monitor Configuration The example in this appendix configures a RMON alarm to monitor active supervisor’s average CPU use every 60 seconds with a rising threshold of 70 and a falling threshold of 60. A rising SNMP trap would be generated if the CPU utilization reaches 70 percent. MDS9000# config t MDS9000(config)# rmon alarm 20 1.3.6.1.4.1.9.9.305.1.1.1 60 delta risingthreshold 70 5 falling-threshold 5 60 owner test To enable RMON events, use the procedure in the following example. This example creates RMON event number 2 to define critical errors, and it generates a log entry when the event is triggered by the alarm. The user Test2 owns the row that is created in the event table by this command. This example also generates an SNMP trap when the event is triggered. MDS9000# config t MDS9000(config)# rmon event 2 log trap eventtrap description CriticalErrors owner Test2 Refer to the Cisco MDS 9000 Family MIB quick reference guide for a list of all MIBS supported on Cisco MDS 9000 Family switches. OIDs corresponding to individual MIBs can be used for monitoring using RMON alarms. Appendix B: Port Monitor Configuration The example in this appendix configures a PMON policy named port_health that polls for the link-reset receive (lr-rx) hardware port counter every 60 seconds. MDS9000# config t Enter configuration commands, one per line. End with CNTL/Z. MDS9000(config)# port-monitor name port_health MDS9000(config-port-monitor)# monitor counter lr-rx MDS9000(config-port-monitor)# counter lr-rx poll-interval 60 delta risingthreshold 5 event 4 falling-threshold 1 event 4 portguard flap Alerting occurs when the specified conditions are met. In this example, assume that when PMON polls for this counter the first time, the line card returns an lr-rx counter value of 100. On the first poll of a delta counter, PMON stores the value and takes no action. On the second poll 60 second later, assume that the count is now 107. Because the rising threshold is configured as 5, a rising alarm will be generated (107 – 100 > 5) and sent to RMON. For a falling alarm to be generated, the delta between the values polled in two successive polls should be less than or equal to the falling threshold configured. © 2016 Cisco and/or its affiliates. All rights reserved. 27 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public In the preceding configuration, if the third PMON poll for the lr-rx counter returns a value greater than 107, then no rising or falling alarm is generated. As long as the delta is greater than 1 no further alarms will be generated. Only when the delta is 1 or less will the falling alarm be generated. In addition, only after the falling-threshold alert is generated can another rising-threshold alert be generated. You can use the following show commands to view the policies configured using the PMON utility and their state: show port-monitor status show port-monitor active show port-monitor <policy-name> To check whether the RMON component was notified by this alerting, use the following show commands: show rmon logs show rmon alarms Normal Template PMON Configuration You can copy and paste the following configuration commands in the Cisco MDS 9000 Family switch console to enable normal PMON-based monitoring. Note: If you configure PMON to monitor all port types, be sure not to use a port-guard action to flap or errordisable the port. config t no port-monitor activate slowdrain port-monitor name pmon-normal-policy port-type all counter link-loss poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter sync-loss poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter signal-loss poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter invalid-words poll-interval 60 delta rising-threshold 5 event 4 falling-threshold 0 event 4 counter invalid-crc poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter tx-discards poll-interval 60 delta rising-threshold 50 event 4 fallingthreshold 0 event 4 counter lr-rx poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter lr-tx poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter timeout-discards poll-interval 60 delta rising-threshold 200 event 4 © 2016 Cisco and/or its affiliates. All rights reserved. 28 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public falling-threshold 10 event 4 counter credit-loss-reco poll-interval 1 delta rising-threshold 1 event 3 falling-threshold 0 event 3 counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event 4 falling-threshold 0 event 4 counter rx-datarate poll-interval 60 delta rising-threshold 80 event 4 fallingthreshold 70 event 4 counter tx-datarate poll-interval 60 delta rising-threshold 80 event 4 falling-threshold 70 event 4 counter err-pkt-from-port poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 10 event 4 counter err-pkt-to-xbar poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 10 event 4 counter err-pkt-from-xbar poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 10 event 4 counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 3 falling-threshold 0 event 3 counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 50 event 5 falling-threshold 0 event 5 counter txwait poll-interval 1 delta rising-threshold 40 event 4 fallingthreshold 0 event 4 Aggressive Template PMON Configuration You can cut and paste the following configuration commands in the Cisco MDS 9000 Family switch console to enable aggressive PMON-based monitoring. Note: If you configure PMON to monitor all port types, be sure not to use a port-guard action to flap or errordisable the port. config t no port-monitor activate slowdrain port-monitor name pmon-aggressive-policy port-type all counter link-loss poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 1 event 4 counter sync-loss poll-interval 45 delta rising-threshold 5 event 4 fallingthreshold 1 event 4 counter signal-loss poll-interval 45 delta rising-threshold 5 event 4 fallingthreshold 1 event 4 counter invalid-words poll-interval 30 delta rising-threshold 5 event 4 falling-threshold 1 event 4 © 2016 Cisco and/or its affiliates. All rights reserved. 29 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public counter invalid-crc poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 1 event 4 counter tx-discards poll-interval 45 delta rising-threshold 50 event 4 fallingthreshold 10 event 4 counter lr-rx poll-interval 60 delta rising-threshold 3 event 4 fallingthreshold 1 event 4 counter lr-tx poll-interval 60 delta rising-threshold 3 event 4 fallingthreshold 1 event 4 counter timeout-discards poll-interval 60 delta rising-threshold 200 event 4 falling-threshold 100 event 4 counter credit-loss-reco poll-interval 1 delta rising-threshold 1 event 3 falling-threshold 0 event 3 counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event 3 falling-threshold 1 event 3 counter rx-datarate poll-interval 60 delta rising-threshold 90 event 4 fallingthreshold 60 event 4 counter tx-datarate poll-interval 60 delta rising-threshold 90 event 4 fallingthreshold 60 event 4 counter err-pkt-from-port poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 40 event 4 counter err-pkt-to-xbar poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 40 event 4 counter err-pkt-from-xbar poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 40 event 4 counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 3 falling-threshold 0 event 3 counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 40 event 5 falling-threshold 0 event 5 counter txwait poll-interval 1 delta rising-threshold 30 event 3 fallingthreshold 0 event 3 Most Aggressive Template PMON Configuration You can cut and paste the following configuration commands in the Cisco MDS 9000 Family switch console to enable the most aggressive type of PMON-based monitoring. Note: If you configure PMON to monitor all port types, be sure not to use a port-guard action to flap or errordisable the port. config t no port-monitor activate slowdrain port-monitor name pmon-mostaggressive-policy port-type all © 2016 Cisco and/or its affiliates. All rights reserved. 30 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public counter link-loss poll-interval 10 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter sync-loss poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter signal-loss poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter invalid-words poll-interval 30 delta rising-threshold 5 event 4 falling-threshold 0 event 4 counter invalid-crc poll-interval 10 delta rising-threshold 5 event 4 fallingthreshold 0 event 4 counter tx-discards poll-interval 30 delta rising-threshold 50 event 4 fallingthreshold 0 event 4 counter lr-rx poll-interval 30 delta rising-threshold 3 event 4 fallingthreshold 0 event 4 counter lr-tx poll-interval 30 delta rising-threshold 3 event 4 fallingthreshold 0 event 4 counter timeout-discards poll-interval 30 delta rising-threshold 200 event 4 falling-threshold 50 event 4 counter credit-loss-reco poll-interval 1 delta rising-threshold 1 event 3 falling-threshold 0 event 3 counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event 3 falling-threshold 0 event 3 counter rx-datarate poll-interval 30 delta rising-threshold 90 event 4 fallingthreshold 50 event 4 counter tx-datarate poll-interval 30 delta rising-threshold 90 event 4 fallingthreshold 50 event 4 counter err-pkt-from-port poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 30 event 4 counter err-pkt-to-xbar poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 30 event 4 counter err-pkt-from-xbar poll-interval 60 delta rising-threshold 50 event 4 falling-threshold 30 event 4 counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 3 falling-threshold 0 event 3 © 2016 Cisco and/or its affiliates. All rights reserved. 31 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public counter tx-slowport-oper-delay poll-interval 30 absolute rising-threshold 30 event 5 falling-threshold 0 event 5 counter txwait poll-interval 1 delta rising-threshold 30 event 30 fallingthreshold 0 event 3 Appendix C: Cisco Embedded Event Manager Configuration This appendix presents several sample NX-OS event manager policy scripts that you can use to monitor different elements of your Cisco MDS fabric. After you configure an Embedded Event Manager policy, it waits for the configured event to occur. If the event occurs, then various actions related to the event can be performed: print a syslog message, reload the supervisor, send an SNMP trap, configure a VSH CLI action, etc. • Event manager applet to monitor CPU utilization by the Cisco MDS 9000 Family switch: YMDS9706-A# config t Enter configuration commands, one per line. End with CNTL/Z. MDS9000(config)# event manager applet high-cpu MDS9000(config-applet)# event snmp oid 1.3.6.1.4.1.9.9.305.1.1.1 get-type exact entry-op ge entry-val 80 exit-op le exit-val 60 poll-interval 10 MDS9000(config-applet)# action 1.0 syslog msg HIGH_CPU TRIGGERED MDS9000(config-applet)# action 2.0 cli show processes cpu sorted MDS9000(config-applet)# exit MDS9000(config)# • Event manager applet to monitor the module temperature for a major threshold violation: MDS9000(config)# event manager applet module-temp-major MDS9000(config-applet)# event temperature module 1 threshold major MDS9000(config-applet)# action 1.0 syslog msg Module 1 Major Temperature alarm MDS9000(config-applet)# action 2.0 cli show environment MDS9000(config-applet)# exit MDS9000(config)# Event manager applet to monitor whether a module goes offline: MDS9000(config)# event manager applet module-offline MDS9000(config-applet)# event module status offline module 1 MDS9000(config-applet)# action 1.0 syslog msg Module 1 Went Offline ! © 2016 Cisco and/or its affiliates. All rights reserved. 32 Monitoring and Alerting in Cisco MDS Fabric White Paper Cisco Public MDS9000(config-applet)# action 2.0 cli show module internal event-history module 1 MDS9000(config-applet)# exit MDS9000(config)# • Event manager applet to monitor for the FLOGI limit per module, switch, or interface: MDS9000(config)# event manager applet notify-flogi MDS9000(config-applet)# event flogi ? intf-max Event to configure maximum flogi per interface module-max Event to configure maximum flogi per module switch-max Event to configure maximum flogi per switch You can use show event manager policy internal <EEM policy name> to display information about the configured policy. © 2016 Cisco and/or its affiliates. All rights reserved. Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (1110R) C11-736963-00 04/16