Monitoring and Alerting in Cisco MDS Fabric White Paper

advertisement
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Monitoring and Alerting
in Cisco MDS Fabric
© 2016 Cisco and/or its affiliates. All rights reserved.
1
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Contents
About This Document
3
Supported Hardware and Software
3
Introduction4
Cisco Prime Data Center Network Manager
4
Remote Monitoring
4
Overview4
RMON Alarms
4
RMON Events
5
Configuring RMON
5
Guidelines and Limitations
5
Enabling RMON
5
Setting RMON Alarms
6
Verifying RMON 6
Using RMON to Monitor MIBs 6
Port Monitor
7
Overview7
Comparison of PMON and RMON
9
Architectural Components
9
Configuring PMON
10
Verifying PMON 11
Configuring PMON Alerting Mechanisms
11
Configuring Alerting Based on Poll Interval
11
Configuring Alerting Based on Check Interval 12
PMON Counters
14
Cisco Embedded Event Manager
24
Overview24
Defining Cisco Embedded Event Manager Policy
25
Defining Events and Actions
25
Supervisor Upgrade and Downgrade Support
26
For More Information
Appendix A: Remote Monitor Configuration
Appendix B: Port Monitor Configuration
26
27
27
Normal Template PMON Configuration
28
Aggressive Template PMON Configuration
29
Most Aggressive Template PMON Configuration
30
Appendix C: Cisco Embedded Event Manager Configuration
32
© 2016 Cisco and/or its affiliates. All rights reserved.
2
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
About This Document
This document describes the mechanisms available to monitor Cisco® MDS fabric
to effectively check SAN fabric health. It discusses important hardware and software
counters present in Cisco NX-OS Software and the values that are built into 8- and
16-Gbps Cisco MDS 9000 Family switches. It describes the counters, the techniques
employed to monitor them, the monitoring capabilities available, and the configurable
threshold limits for the counters. It also provides in-depth discussions of the three main
tools available to monitor Cisco MDS fabric: Remote Monitor (RMON), Port Monitor
(PMON), and Cisco Embedded Event Manager (EEM). The document discusses how to
configure each tool and when to use each one.
SAN administrators can use the policies and threshold recommendations provided in
this document as guidelines for monitoring their SAN fabrics for overall health and for
setting alerts so that they are informed if a particular counter or setting exceeds its
preset threshold value.
The appendixes at the end of this document provide sample configurations and
command-line interface (CLI) commands that can be used to deploy the various
monitoring techniques on Cisco MDS 9000 Family switches.
Supported Hardware and Software
This document supports the hardware platforms and NX-OS software releases listed in Table 1. The features
discussed in this document may be available on other platforms or NX-OS releases as well, but no attempt
has been made to cover them in this document. This document is based on NX-OS Release 6.2(13) and
Cisco Prime™ Data Center Network Manager (DCNM) Release 7.2(1).
Table 1. Supported Hardware and Cisco NX-OS Software Releases
Cisco MDS 9000 Family Platforms
Cisco MDS 9000 Family 16-Gbps
platforms
Model
Cisco MDS 9700 Series Multilayer Directors with
DS-X9448-768K9 line card
Cisco MDS 9396S 16G Multilayer Fabric Switch
Cisco MDS 9148S 16G Multilayer Fabric Switch
Cisco MDS 9250i Multilayer Fabric Switch
Cisco MDS 9000 Family advanced
8-Gbps platforms
Cisco MDS 9500 Series Multilayer Directors with DS-X92xx-256K9
line cards
Cisco MDS 9000 Family 8-Gbps
platforms
Cisco MDS 9500 Series Multilayer Directors with MDS 9148 Multilayer
Fabric Switch and DS-X9248-48K9 and DS-X92xx-96K9 line cards
© 2016 Cisco and/or its affiliates. All rights reserved.
3
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Introduction
In today’s SAN deployments, finding the correlation
between a network event and its root cause can
be difficult. Without proactively monitoring the
SAN fabric and being informed about events as
they occur, organizations have difficulty preventing
incidents in the network. Incidents may be simple,
such as the flapping of one member link of a port
channel interface on a Cisco MDS 9000 Family
switch, or complex, such as application performance
degradation due to a slow-performing device or host
bus adapter (HBA) somewhere in the network.
Monitoring fabricwide events, various port
parameters, Small Form-Factor Pluggable (SFP)
transceiver settings, system CPU memory values,
and other environmental parameters permits early
fault detection and isolation of such network events.
It enables the SAN administrator to effectively
measure performance on the SAN fabric network.
Benefits include:
• Improved high availability of the network through
reduced downtime
• Effective root-cause analyses of a network event
• Alerts that are delivered proactively, rather than
reactively, from counters and events that are
monitored
• Reduced troubleshooting time through notifications
from alerts and traps
Cisco MDS 9000 Family switches track a variety
of SAN fabric metrics, counters, and events in real
time. These counters are made available by the
MDS platform for use by native NX-OS applications
such as PMON, RMON, and embedded event
managers. A few of these counters are exposed
in their raw Simple Network Management Protocol
(SNMP) object identifier (OID) format for monitoring
or are made available through various system
policies.
SAN administrators can use existing templates
and policy scripts (for example, the default PMON
policy) or write custom policy scripts that can be
applied to their SAN fabric switches to help improve
monitoring, alerting, and troubleshooting for their
SAN infrastructure. These system policies, scripts,
counters, and tools can be extended to form a
monitoring and alerting framework that can be
developed and deployed by SAN administrators on
their fabric network from day one.
© 2016 Cisco and/or its affiliates. All rights reserved.
Continuous monitoring of fabricwide events, ports,
system resources, and environmental parameters
enables early fault detection and helps keep the
network healthy.
Cisco Prime Data Center
Network Manager
Cisco Prime DCNM is a data center network
monitoring tool that monitors events occurring
across the network on multiple switches. It can
manage both SAN and LAN networks, and it can
correlate events across multiple switches and
proactively alert users of these fabricwide events.
SAN administrators should deploy and use the
network manager to gain fabricwide visibility,
perform diagnosis and troubleshooting at the fabric
level, analyze trending statistics, generate reports,
and perform forecasting for the network, among
other tasks. Administrators can also use the network
manager to poll a specific management information
base (MIB) or OID and plot the values in a graph
over an extended period of time. With Cisco Prime
DCNM Release 7.2.1 and later, administrator can
choose to deliver only network events of interest
and suppress unwanted events and alerts.
Remote Monitoring
Overview
Remote Monitoring, or RMON, is an SNMP IETF
standard monitoring specification that allows various
network agents and console systems to exchange
network monitoring data. NX-OS supports RMON
alarms, events, and logs to monitor NX-OS devices.
You can use the RMON alarms and events to
monitor Cisco MDS 9000 Family switches running
NX-OS.
All switches in the Cisco MDS 9000 Family support
the RMON functions discussed in this section (and
defined in RFC 2819).
RMON Alarms
Each alarm monitors a specific MIB object for
a specified interval. When the MIB object value
exceeds a specified value (rising threshold),
the alarm condition is set and only one event is
triggered regardless of how long the condition
exists. When the MIB object value falls below a
certain value (falling threshold), the alarm condition
is cleared. Clearing allows the alarm to be triggered
again the next time the rising threshold is crossed.
4
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
When you create an alarm, you specify the following parameters:
• MIB object to monitor
• Sampling interval: The interval that NX-OS uses to collect a sample value of the MIB object
• Sample type: Absolute or delta
-- Absolute samples take the current snapshot of the MIB object value.
-- Delta samples take two consecutive samples and calculate the difference between them.
• Rising threshold: The value at which NX-OS triggers a rising alarm or resets a falling alarm
• Falling threshold: The value at which NX-OS triggers a falling alarm or resets a rising alarm
Event: The action that NX-OS takes when an alarm (rising or falling) is triggered
RMON Events
You can associate a particular event with each RMON alarm. RMON supports the following event types
• SNMP notification: Sends an SNMP risingAlarm or fallingAlarm notification when the associated alarm is
triggered
• Log: Adds an entry in the RMON log table when the associated alarm is triggered
• Both: Sends an SNMP notification and adds an entry in the RMON log table when the associated alarm is
triggered
You can specify different events for falling alarms and rising alarms.
Configuring RMON
RMON is disabled by default, and no events or alarms are configured in the switch. You can configure your
RMON alarms and events by using the CLI or an SNMP-compatible network management station.
Guidelines and Limitations
The Cisco MDS 9000 Family MIB files can be obtained through FTP from http://www.cisco.com/public/
sw-center/netmgmt/cmtk/mibs.shtml, under Cisco Storage Networking. Cisco and IETF MIBs are updated
frequently. You should download the latest MIBs from http://www.cisco.com/public/sw-center/netmgmt/
cmtk/mibs.shtml whenever you upgrade your Cisco MDS 9000 NX-OS Software.
RMON has the following limitations:
• You must configure an SNMP user and a notification receiver to use the SNMP notification event type.
• You can configure an RMON alarm only on a MIB object that resolves to an integer.
When you configure an RMON alarm, the object identifier must be complete, with its index, so that it
refers to only one object. For example, 1.3.6.1.2.1.2.2.1.14 corresponds to cpmCPUTotal5minRev, and
.1 corresponds to index cpmCPUTotalIndex, which creates object identifier 1.3.6.1.2.1.2.2.1.14.1.
Enabling RMON
RMON is disabled by default. To enable it, use the following commands:
MDS9000# config t
MDS9000(conf t)# snmp-server enable traps rmon
© 2016 Cisco and/or its affiliates. All rights reserved.
5
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Setting RMON Alarms
RMON alarms help configure rising- and falling-threshold values for a specific MIB object using OIDs. If the
OID that is configured is related to port counters, then the monitoring will apply to the entire port set available
in the system.
The following example shows configuration of RMON alarm number 20 to monitor the SNMP MIB object
1.3.6.1.2.1.2.2.1.14.16777216, where 1.3.6.1.2.1.2.2.1.14. represents the ifInOctets value and 16777216
represents the ifIndex value of the physical port. Monitoring occurs once every 60 seconds until the alarm is
disabled and checks the change in the variable’s rise or fall. If the value shows a MIB counter increase of 15
or more, the software triggers an alarm. The alarm in turn triggers event number 1, which is configured with
the RMON event command. Possible events include a log entry or an SNMP trap. If the MIB value changes
by 0, the alarm is reset and can be triggered again.
switch(config)# rmon alarm 20 1.3.6.1.2.1.2.2.1.14.16777216 60 delta risingthreshold 15 1 falling-threshold 0 owner test
RMON is a software process that runs on a supervisor module. There is no limit on the number of OIDs that
can be configured for monitoring purpose. The lowest frequency for polling an SNMP counter is 1 second.
For example, if you want to monitor the CPU use percentage rise with a threshold of 80 percent, you can set
the polling frequency to once every 30 seconds.
Verifying RMON
Use the show rmon and show snmp commands to display the configured RMON and SNMP information.
Using RMON to Monitor MIBs
RMON can monitor all the MIBs that the platform supports. The MIB reference guide for Cisco MDS 9000
Family switches lists all the available MIBs that can be used for managing SAN fabric. Table 2 lists common
MIBs that the network administrator can use to monitor the SAN fabric. Table 3 lists MIBs for monitoring CPU
and memory utilization and their recommended thresholds.
Table 2. Common MIB OIDs for Monitoring on Cisco MDS 9000 Family Switches
Serial
MIB
Number
Description
OID
1
CISCO-CFS-MIB
Provides global-level control over
the features in the system that
support Cisco Fabric Services
1.3.6.1.4.1.9.9.433.0.x
2
CISCO-ENTITY-FRUCONTROL-MIB
Manages field-replaceable units
(FRUs), such as power supplies,
fans, and modules
1.3.6.1.4.1.9.9.117.2.0.x
3
CISCO-ENTITY-SENSOR-MIB
Provides sensor data for
environmental monitors such as
temperature gauges
1.3.6.1.4.1.9.9.91.2.0.1
4
CISCO-FSPF-MIB
Configures and monitors the
1.3.6.1.4.1.9.9.287.3.0.1
Fabric Shortest Path First (FSPF)
parameters on all VSANs configured
on the local switch
5
CISCO-LICENSE-MGR-MIB
Manages license files on
the switch
© 2016 Cisco and/or its affiliates. All rights reserved.
1.3.6.1.4.1.9.9.369.3.0.x
6
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Serial
MIB
Number
Description
OID
6
CISCO-SYSLOG-MIB
Collects system messages generated
by switch (These messages are
typically sent to a syslog server. Use
this MIB to retrieve SYSLOG-MIB
system messages.)
1.3.6.1.4.1.9.9.41.2.1
7
CISCO-VSAN-MIB
Enables you to create and
monitor VSANs
1.3.6.1.4.1.9.9.282.1.3.0.1
8
CISCO-PROCESS-MIB
Displays memory and process
utilization
1.3.6.1.4.1.9.9.109.x
9
CISCO-ZS-MIB
Provides notifications related to
zone changes in Cisco MDS 9000
Family switches
1.3.6.1.4.1.9.9.294.1.x
Table 3. CPU and Memory Utilization MIBs
Category
Description
OID
Poll
(seconds)
Rising
Falling
CPU
CPU utilization
1.3.6.1.4.1.9.9.109.1.1.1.1.6.1
60
80
60
Normal
60
80
60
Aggressive
30
70
60
Normal
90
80
50
Aggressive
60
80
50
Memory
Memory
utilization
1.3.6.1.4.1.9.9.305.1.1.2
Port Monitor
Overview
Port Monitor, or PMON, is a monitoring feature available on Cisco MDS 9000 Family switches that can
monitor physical port–related critical counters and alert external monitoring software of any anomalies. The
configured entities are monitored in hardware, thereby saving CPU cycles on the supervisor. The software
component of PMON relies on RMON capabilities, which in turn rely on the SNMP component to generate
threshold messages whenever a monitored entity rises or falls beyond the set threshold value. Administrators
can optionally use Cisco Prime DCNM software to view these alerts and gain fabricwide control by identifying
nodes that are consuming network resources and hampering the flow of data traffic. The PMON feature
hides the complexity of SNMP OIDs and provides an easy-to-use CLI.
Cisco MDS 9000 Family switches have two predefined PMON policies: default and slowdrain. Administrators
can also create other custom policies for monitoring and alerting purposes. The slowdrain policy is
automatically active when you launch a Cisco MDS 9000 Family switch. Note that at any given time, only one
policy can be active in the system.
© 2016 Cisco and/or its affiliates. All rights reserved.
7
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
The default policy monitors the objects as shown in Table 4 for each physical port interface on the Cisco
MDS 9000 Family switch. This policy is not active and cannot be modified.
Table 4. Default PMON Policy
Serial
Number
Hardware Counters
Interval (seconds)
Rising Threshold
Falling Threshold
1
Link loss
60
5
1
2
Sync loss
60
5
1
3
Signal loss
60
5
1
4
Invalid words
60
1
0
5
Invalid CRCs
60
5
1
6
LR Rx
60
5
1
7
LR Tx
60
5
1
8
Timeout Discards
60
200
10
9
Credit Loss Reco
1
1
0
10
Tx Credit Not Available
1
10%
0%
11
RX Datarate
60
80%
20%
12
TX Datarate
60
80%
20%
13
TX Discards
60
200
10
14
ASIC Error Pkt from port
60
10
5
15
ASIC Error Pkt from xbar
50
10
5
16
ASIC Error Pkt to xbar
50
10
5
17
Tx Slowport Count
1
10
0
18
Tx Slowport Oper Delay
1
50 ms
0 ms
19
TxWait
1
40%
0%
The slowdrain PMON policy monitors two counters: Credit Loss Reco and Tx Credit Not Available, as
shown in Table 5. This policy is active by default. The Tx Credit Not Available counter is sampled at 100-ms
intervals. A 10 percent rising threshold indicates 10 percent of a second (100 milliseconds [ms]). If there are
zero remaining Tx B2B credits for 100 ms, then the rising-threshold criterion will be met.
Table 5. Slowdrain PMON Policy
Serial
Number
Hardware Counters
Interval (seconds)
Rising Threshold
Falling Threshold
1
Credit Loss Reco
1
1
0
2
Tx Credit Not Available
1
10%
0%
© 2016 Cisco and/or its affiliates. All rights reserved.
8
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
PMON allows end users to further define and modify
new custom policies for monitoring access ports or
trunk ports, or both, on a Cisco MDS 9000 Family
switch. You can also optionally use PMON to errordisable or flap the port if the set threshold is met for
a configured counter.
Note: Different counters are available depending on
the types of Cisco MDS 9000 Family modules in the
system.
The unit for threshold values (Rising and Falling)
differs across different counters.
MDS 9000 Family switches. For example, you can
configure RMON to monitor CPU and memory
statistics and use PMON to monitor hardwarespecific counters related to a slow-drain condition.
PMON differs from RMON in monitoring port
counters in the way that PMON processes events
and alarms on the ports. RMON is dependent on
SNMP and runs only on supervisor CPUs. It listens
to SNMP-based events and traps from various
components and relays this information to the
external management application.
For all platforms if the default value for txwait is
modified then ISSD will be restricted. To proceed
with ISSD, use the no form of the command to roll
back to the default value.
PMON, in contrast, performs most computations
on the port application-specific integrated circuit
(ASIC) present on the individual modules, and
counters exposed by PMON features can be
monitored at frequencies of less than 1 second. If
a counter exceeds its configured threshold value,
an alarm event will be generated by the line-card
port ASIC and sent to the supervisor module. This
architectural design of PMON frees CPU time on the
supervisor, relieving it from having to continuously
monitor the PMON counters. The supervisor CPU
can instead be used to service other important
functions: to service control traffic, run online Cisco
Generic Online Diagnostics (GOLD), etc.
Comparison of PMON and RMON
Architectural Components
PMON and RMON can be configured to work in
tandem to monitor various parameters on the Cisco
Figure 1 shows the interaction of PMON
components with other software components.
For all platforms if the default value for tx-slowportcount is modified then ISSD will be restricted.
To proceed with ISSD, use the no form of the
command to roll back to the default value.
For all platforms if the default value for tx-slowportoper-delay is modified then ISSD will be restricted.
To proceed with ISSD, use the no form of the
command to roll back to the default value.
Figure 1. PMON Interaction with Various Software Modules
Port Monitor Configuration:
Type, Interval, Threshold,
Event, and Object
PMON CLI Configuration
(on Supervisor)
Configuration List
and Database
RMON Component
(on Supervisor)
SNMP Component
(on Supervisor)
Persistent
Storage
Broadcast to
All Line Cards
Cisco Prime DCNM
(External Monitoring
Application)
© 2016 Cisco and/or its affiliates. All rights reserved.
Port Monitor Agent in
LC1 (Line-Card
Component)
Port Monitor Agent in
LC2 (Line-Card
Component)
9
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
The PMON component on the supervisor interacts with SNMP and RMON components to alert and send
traps to external monitoring software. It also interacts with the persistent storage system (PSS) component,
which stores the configuration list and policy details in a database. The PSS helps enable you to retrieve the
PMON policy if a switch reloads or fails.
When PMON policy is configured on the supervisor, the policy is internally broadcast to all line cards using
internal NX-OS messaging. The PMON agent process on the line card is notified and then applies the policy
according to the settings configured on all the active ports on the module. Relevant hardware-assisted
timers are started to monitor the port counters according to the polling interval specified. If a threshold
exceeds its configured value, the PMON process on the line card will inform the SNMP agent on the supervisor
of this violation. RMON will also send traps to the external monitoring application, such as Cisco Prime DCNM,
that is registered to receive such notifications from the switch. PMON-triggered notifications can also be
viewed in RMON by using the show rmon logs and show rmon alarm commands.
Configuring PMON
Here is a sample configuration to specify the PMON policy, monitor the link-loss counter, and activate the
policy.
MDS9000-A(config)# port-monitor name port-health-check
MDS9000-A(config-port-monitor)# monitor counter link-loss
MDS9000-A(config-port-monitor)# counter link-loss poll-interval 60 delta risingthreshold 5 event 4 falling-threshold 1 event 4 portguard errordisable
MDS9000-A(config-port-monitor)# port-monitor activate port-health-check
The preceding policy configures a PMON policy named port-health-check to continuously monitor the
link-loss counter every 60 seconds. The delta value specified in the configuration represents the difference
between the value from the current polling and the value from previous polling performed by PMON 60
seconds earlier. If this value is greater than or equal to 5 between two successive polls, then a port-guard
action to error-disable the affected port is performed, and an event with the value 4 is generated to RMON.
Note: The port types all and access port include all F-ports and all F-port-channel interfaces. Also, the
port-guard action to flap or error-disable the port should not be used when monitoring an F-port-channel
interface connected to a network port virtualization (NPV) switch. If another PMON policy is already active
for the port type in question, then you will first have to deactivate that policy before you can activate a new
policy.
PMON can generate five levels of events. The levels reflect the importance of this event occurrence. Table 6
shows the event levels.
Table 6. PMON Events
Event
Description
1
Fatal
2
Critical
3
Error
4
Warning
5
Informational
© 2016 Cisco and/or its affiliates. All rights reserved.
10
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
When configuring PMON, administrators should be sure to choose the right event ID levels so that
meaningful actions can be taken by the administrator or the third-party tool that receives these alerts.
Generally, events are classified as follows:
• An event level 1 alert signifies a general network failure or a situation in which a critical part of your network
is unreachable.
• An event level 2 alert signifies a critical situation in the network that needs immediate attention. For example,
such an alert might indicate that one of the modules of the Cisco MDS 9000 Family switch is down, or that
the flow of traffic has changed so that traffic that was earlier traversing the SAN A path has been switched
to a SAN B path.
• Event level 3, 4, and 5 alerts signify less critical situations. For example, these alerts may be used to indicate
that one of the member links in a port-channel interface has flapped, that the module temperature is rising,
or that the count of cyclic redundancy check (CRC) errors on an interface has increased.
After the PMON policy is configured, enabled, and activated for monitoring, you can use the show portmonitor port-health-check and show port-monitor active CLI commands to view the various PMON
counters and their configured thresholds.
To deactivate monitoring for the configured policy, use this command:
MDS9000(config)# no port-monitor activate port-health-check
Verifying PMON
You can use the following commands to check whether any Cisco MDS 9000 Family module has triggered
an alarm and whether PMON was able to inform RMON about generating an SNMP event.
MDS9000# show port-monitor active
MDS9000# show port-monitor < policy name >
MDS9000# show port-monitor status
MDS9000# show rmon logs
Configuring PMON Alerting Mechanisms
PMON provides two ways to configure port counters for monitoring and alerting. You can configure
monitoring and alerting based on the poll interval or check interval.
Configuring Alerting Based on Poll Interval
Consider a PMON policy named pmon-test with the counter link-loss configured as shown here:
port-monitor name pmon-test
port-type all
counter link-loss poll-interval 30 delta rising-threshold 20 event 4 fallingthreshold 0 event 4
This policy configures the PMON policy named pmon-test to continuously monitor the link-loss counter every
30 seconds. Rising and falling thresholds are configured as 20 and 0, respectively. If a counter value exceeds
the threshold settings, the rising or falling alert is generated and logged in RMON. If an external monitoring tool
such as Cisco Prime DCNM is configured to listen for SNMP traps, RMON will inform this tool though an SNMP
trap of PMON alerts.
Figure 2 and the following section show how poll-interval-based PMON alerting works internally and
generates rising or falling PMON alerts.
© 2016 Cisco and/or its affiliates. All rights reserved.
11
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Figure 2. Poll-Interval-Based Alerting
120
100
Rising
Alarm
Rising Threshold: 20
Falling Threshold: 0
Polling Interval: 30 Seconds
Counter Value
80
Falling
Alarm
60
Rising
Alarm
40
20
0
30
60
90
120
150
180
210
240
Time (seconds)
When you configure a PMON policy, the system begins polling the hardware counters internally. The PMON
agent process on the Cisco MDS 9000 Family module runs according to the configured poll-interval duration
(such as 30 seconds), reads the PMON counters, and performs validation checks against the configured rising
and falling alarm thresholds.
In the sample pmon-test policy configured in the example here, during the first polling interval (that is, at the
thirtieth second), PMON reads the counter’s absolute value from the module’s port ASIC and stores this value
internally as the base value. During next polling cycle (that is, at the sixtieth second), PMON starts polling
again, reads the counter value, and stores this value internally as the current value. A validation check is then
performed between the counter’s current value and the base value, which determines whether an alarm needs
to be raised.
If the delta between the current counter value and the previous base counter value is greater than the risingthreshold value configured, then a rising alarm is generated.
[(Current_value – Base_value) > Rising threshold] = Generate rising alarm
Similarly, if the delta between the current and previous counters is less than or equal to the falling-threshold
value configured, then a falling alarm is generated.
[(Current_value – Base_value) <= Falling threshold] = Generate falling alarm
Note: The base value is overwritten with the counter’s current value during each polling cycle by PMON. The
validation always is performed against the base value during each polling cycle. Two successive rising alarms
or falling alarms will not be generated unless an alarm of the other type occurs between them.
Configuring Alerting Based on Check Interval
Beginning from the Cisco® MDS NX-OS Release 6.2.15, along with the rising and falling threshold, the user
can configure a new threshold, called warning threshold, to generate syslogs.
In poll-interval-based alerting, hardware counters are polled only at the configured polling intervals, triggering
alerts at the polling intervals. However, if you need a more fine-tuned alerting mechanism, you should consider
the ‘check-interval-based Alerting’ mechanism.
In this approach, a new global configuration command-line interface (CLI) is introduced which checks for
counters value in between the configured polling interval and will send rising or falling alerts if the thresholds
conditions are met.
© 2016 Cisco and/or its affiliates. All rights reserved.
12
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
port-monitor check-interval <interval>
where ‘<interval>’ is the duration specified in seconds. The minimum interval value that can be set is 5,
and it can be in increments of 1 second.
To illustrate the alerting mechanism using this approach, let us consider a similar process monitor (PMON)
policy named ‘pmon-test’ for monitoring the counter ‘link-loss’ as configured below. The ‘checkinterval’ is configured with value ‘15’ seconds and the ‘warning-threshold’ with a value of
‘10’. The‘warning-threshold’ is an optional parameter. It can be configured with a value in between
rising and falling threshold values. It can also be used to inform the administrator through syslog message
generation that the monitored counter is nearing its rising or falling threshold values.
port-monitor check-interval 15
port-monitor name pmon-test
port-type all
counter link-loss poll-interval 30 delta rising-threshold 20 event 4 warningthreshold 10 falling-threshold 0 event 4
Figure 3 illustrates check-interval-based alerting.
Figure 3. Check-Interval-Based Alerting
Rising
Falling
Threshold Threshold
Raising and
Falling Alerts
Warning
Threshold
(Upward)
Counter
Value
Rising
Falling
Threshold Threshold
Warning
Threshold
(Downward)
Warning
Threshold
(Downward)
Warning
Threshold
(Upward)
Warning
Threshold
(Upward)
25
40
55
55
60
75
100
100
115
30
45
60
75
90
105
120
135
150
Rising Threshold: 20
Falling Threshold: 0
Warning Threshold : 10
Polling Interval: 30 Seconds
Check Interval: 15 Seconds
Time (Seconds) →
As shown in Figure 3, in this approach PMON performs validation checks at every configured check-interval
period and at every poll-interval period. Calculations to generate alarms in this method are similar to those
shown for poll-interval-based alerting, as shown here.
If the delta between the current counter value and the previous base counter value is greater than the risingthreshold value configured, then a rising alarm is generated.
[(Current_value – Base_value) > Rising threshold] = Generate rising alarm
© 2016 Cisco and/or its affiliates. All rights reserved.
13
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Similarly, if the delta between the current counter value and the previous counter value is less than or equal to
the falling-threshold value configured, then a falling alarm is generated.
[(Current_value – Base_value) <= Falling threshold] = Generate falling alarm
In addition:
[(Current_value – Base_value) >= Alert threshold] = Generate a syslog message
Sample syslog messages generated by PMON for rising, falling and warning thresholds are as below:
MDS9000(config)#
2016 Jan 6 06:51:35 sw-apex-40 %PMON-SLOT2-4-WARNING_THRESHOLD_REACHED_UPWARD:
Invalid Words has reached warning threshold in the upward direction (port fc2/18
[0x1091000], value = 30).
2016 Jan 6 06:51:35 sw-apex-40 %PMON-SLOT2-3-RISING_THRESHOLD_REACHED: Invalid
Words has reached the rising threshold (port=fc2/18 [0x1091000], value=30).
2016 Jan 6 06:51:36 sw-apex-40 %SNMPD-3-ERROR: PMON: Rising Alarm Req for Invalid
Words counter for port fc2/18(1091000), value is 30 [event id 1 threshold 30
sample 2 object 4 fcIfInvalidTxWords]
2016 Jan 6 06:52:21 sw-apex-40 %PMON-SLOT2-5-WARNING_THRESHOLD_REACHED_DOWNWARD:
Invalid Words has reached warning threshold in the downward direction (port
fc2/18 [0x1091000], value = 0).
2016 Jan 6 06:52:21 sw-apex-40 %PMON-SLOT2-5-FALLING_THRESHOLD_REACHED: Invalid
Words has reached the falling threshold (port=fc2/18 [0x1091000], value=0).
2016 Jan 6 06:52:21 sw-apex-40 %SNMPD-3-ERROR: PMON: Falling Alarm Req for
Invalid Words counter for port fc2/18(1091000), value is 0 [event id 2 threshold
0 sample 2 object 4 fcIfInvalidTxWords]
PMON Counters
The Cisco MDS 9000 Family switches include a set of hardware counters. Advanced 8- and 16-Gbps and
higher Cisco MDS 9000 Family modules include additional counters that are specific to the line-card type
being used.
To view the list of available PMON counters on the software release you are running, use the following
command:
MDS9000(config)# port-monitor name <pmon_policy_name>
MDS9000(config-port-monitor)# monitor counter ?
credit-loss-reco
Configure credit loss recovery counter
err-pkt-from-port
Configure err-pkt-from-port counter
err-pkt-from-xbar
Configure err-pkt-from-xbar counter
err-pkt-to-xbar
Configure err-pkt-to-xbar counter
invalid-crc
Configure invalid-crc counter
invalid-words
Configure invalid-words counter
link-loss
Configure link-failure counter
© 2016 Cisco and/or its affiliates. All rights reserved.
14
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
lr-rx
Configure the number of link resets received by the
fc-port
lr-tx
Configure the number of link resets transmitted by
the fc-port
rx-datarate
Configure rx performance counter
signal-loss
Configure signal-loss counter
sync-loss
Configure sync-loss counter
timeout-discards
Configure timeout discards counter
tx-credit-not-available
Configure credit not available counter
tx-datarate
Configure tx performance counter
tx-discards
Configure tx discards counter
tx-slowport-count
Configure slow port sub-100ms counter
tx-slowport-oper-delay
Configure slow port operation delay
txwait
Configure tx total wait counter
This section describes the PMON counters
available, the CLI show commands for the counters,
and the recommended threshold values configured
in three templates: Normal, Aggressive, and Most
Aggressive. The CLI configuration commands for
these templates are also shown in Appendix B at
the end of this document.
PMON offers various counters to monitor Cisco MDS
fabric and provide notifications if events exceed the
configured threshold values. These counters range
from generic interface counters such as receive
(Rx) and transmission (Tx) rate counters to more
platform-specific counters such as TxWait and Tx
Slowport Count, for slow-drain detection. If you are
experiencing slowness in your network, applications
that are not performing up to their potential, or
congestion on network ports, a prudent approach
is to deploy PMON on all Cisco SAN switches
and monitor the counters for potential slow-drain
conditions. By integrating PMON with Cisco Prime
DCNM, system administrators can be alerted
proactively and notified whenever a configured
counter exceeds its set threshold value.
PMON is easy to deploy. System administrators
can quickly configure a PMON policy using any of
the three templates (Normal, Aggressive, and Most
Aggressive) and start monitoring their SAN fabric
from day one.
© 2016 Cisco and/or its affiliates. All rights reserved.
A general guideline is to first deploy a Normal PMON
policy on the SAN fabric and monitor for alerts or
events for a few weeks. If the network is stable and
does not result in alerts or event notifications and
yet you continue to see application degradation,
then you can move to the next level of monitoring
and deploy a PMON policy using the Aggressive
template and then, if necessary, the Most Aggressive
template.
PMON also provides facility to adjust the individual
counter threshold settings when you deploy the policy.
It also provides mechanisms for configuring alert and
notification actions such as error-disabling a port if
a configured counter exceeds its threshold settings.
In Appendix B, the port-guard action is set to the
default value for all the templates: that is, neither
flapping nor error-disabling the physical port. The
policy will alert RMON, which can then trigger
alerts to the management application and notify the
administrator if a threshold breach occurs on the
Cisco MDS 9000 Family switch.
Table 7 lists the 19 available hardware counters
and the recommended threshold settings for each
counter value. These values are further divided into
three templates: Normal, Aggressive, and Most
Aggressive. The polling interval specified is per
second unless specified alongside its value.
15
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Table 7. Hardware Counters
Counter Name
Normal
Aggressive
Most Aggressive
Rise
Fall
Poll
Rise
Fall
Poll
Rise
Fall
Poll
Signal Loss
5
0
60
5
1
45
5
0
30
Sync Loss
5
0
60
5
1
45
5
0
30
Link Loss
5
0
60
5
1
30
5
0
10
Invalid CRC
5
0
60
5
1
30
5
0
10
Invalid Words
5
0
60
5
1
30
5
0
10
Tx Discards
50
0
60
50
10
45
50
0
30
LR Rx
5
0
60
4
1
60
3
0
30
LR Tx
5
0
60
4
1
60
3
0
30
Timeout Discard
200
10
60
200
100
60
200
50
30
Credit Loss Reco
1
0
1
1
0
1
1
0
1
Tx Credit Not Available
10%
0
1
10%
0
1
10%
0
1
Rx Datarate
80%
70%
60
90%
60%
60
90%
50%
30
Tx Datarate
80%
70%
60
90%
60%
60
90%
50%
30
ASIC Error from Port 3
50
10
60
50
40
60
50
30
60
ASIC Error Pkt to Xbar 3
50
10
50
50
40
60
50
30
60
ASIC Error Pkt From Xbar 3
50
10
50
50
40
60
50
30
60
Tx Slowport Count 1
5
0
1
5
0
1
5
0
1
Tx Slowport Oper Delay 2
50 ms 0
1
40 ms 0
1
30 ms 0
1
TxWait
40%
1
30%
1
20%
1
0
0
0
Table 8 provides a support matrix for slow-drain-specific counters for PMON.
Table 8. PMON Slow-Drain Counters
Counter Name
Supported Platforms
Minimum
Cisco NX-OS Release
credit-loss-reco
All
Release 5. 0.0 or 6.0.0
lr-rx
All
Release 5.0.0 or 6.0.0
lr-tx
All
Release 5.0.0 or 6.0.0
timeout-discards
All
Release 5.0.0 or 6.0.0
tx-credit-not-available
All
Release 5.0.0 or 6.0.0
tx-discards
All
Release 5.0.0 or 6.0.0
© 2016 Cisco and/or its affiliates. All rights reserved.
16
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter Name
Supported Platforms
Minimum
Cisco NX-OS Release
tx-slowport-count1
Cisco MDS 9500 Series only with
DS-X9248-48K9, DS-X9224-96K9, or
DS-X9248-96K9 line card
Release 6.2(13)
tx-slowport-oper-delay2
Cisco MDS 9700 Series, 9396S, 9148S, 9250i, Release 6.2(13)
and 9500 Series only with DS-X9232-256K9 or
DS-X9248-256K9 line card
txwait
Cisco MDS 9700 Series, 9396S, 9148S, 9250i, Release 6.2(13)
and 9500 Series only with DS-X9232-256K9 or
DS-X9248-256K9 line card
The slow-port-monitoring feature must be enabled for this counter to work. This counter will increment only
if the amount of time that an interface is at 0 Tx buffer-to-buffer (B2B) credits is equal to or greater than
the slow-port-monitoring value configured (administrative) delay. It would increment at most 1 time per 100
ms, or 10 times per second.
1
Use the following configuration command to enable slow-port monitoring :
MDS9700(config)# system timeout slowport-monitor ?
<1-500> Configure number of milliseconds
default Default timeout value for HW slowport monitoring
MDS9700(config)# system timeout slowport-monitor default ?
mode Enter the port mode
MDS9700(config)# system timeout slowport-monitor default mode ?
E E mode
The slow-port-monitoring feature must be enabled for this counter to work. The threshold is exceeded only
if the slow-port-monitoring administrative delay (the configured value) is less than the configured rising
threshold and the reported operation delay (oper-delay) is more than the configured rising threshold.
2
These counters work on 4-and 8-Gbps Cisco MDS 9000 Family switches. On 16-Gbps Cisco MDS 9000
Family switches, these counters are not operable.
3
The remainder of this section provides detailed descriptions of the PMON counters, their associated switch
CLI commands, and their threshold values.
Counter
Signal Loss (signal-loss)
Description
Signal loss typically occurs at the physical transport medium. If Laser is off or the
optical medium is down, then the counter is incremented. The optical ASIC on Cisco
MDS 9000 Family switches detects this incrementation and raises a firmware interrupt
to flag such conditions in the underlying Layer 1 signaling. The usual result of this
event is that the link must be reset and the port must be reinitialized. The loss of signal
usually indicates a hardware problem.
Show command
show interface x/y counters detail | include ignore signal
Threshold
Poll (Sec)
Rising
Falling
Normal
60
5
1
Aggressive
45
5
1
Most Aggressive
30
5
1
© 2016 Cisco and/or its affiliates. All rights reserved.
17
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter
Sync Loss (sync-loss)
Description
This counter tracks the loss of the synchronization (sync) bit pattern when a frame is received
from the peer Fibre Channel switch. In Fibre Channel, if the frame that is transmitted is
encoded in 66-bit format, the first 2 bits form sync bits and remaining 64 bits form the data.
Upon receiving a frame, the receiver locks itself to a sync-bit pattern. During data transfer,
if any packet is received in which the sync pattern is different from the one to which the
receiver is locked, then the counter goes high, indicating noise in the transmitted data.
Generally a sync loss is followed by a logical reset of link reset (LR), link reset response (LRR),
and IDLE. Sync-loss errors frequently are the result of a faulty transceiver or cable.
Show command
show interface x/y counters detail | include ignore sync
Threshold
Poll (Sec)
Rising
Falling
Normal
60
5
1
Aggressive
45
5
1
Most Aggressive
30
5
1
Counter
Link Loss (link-loss)
Description
Link loss occurs whenever a port flaps. For example, if a peer sends an OLS or NOS
signal to the switch, the counter is incremented. Link loss can trigger a series of other
hardware counter updates as well. Both physical and hardware problems can cause
link failures. Link failures frequently occur due to a sync loss or signal loss. A port or
link flap causes other associated PMON counters to increment as well (sync loss,
signal loss, link reset, etc.).
Show command
None
Threshold
Poll (Sec)
Rising
Falling
Normal
60
5
1
Aggressive
45
5
1
Most Aggressive
30
5
1
Counter
Invalid CRC (invalid-crc)
Description
In Fibre Channel, words are packaged inside a frame. If bit-level corruption occurs
(that is, if cyclic redundancy-check [CRC] failure, checksum mismatch, etc. occurs) at
the frame level, then the invalid-crc counter is incremented. The CRC field is 32 bits
long and inserted before the end-of-frame (EOF) delimiter in the Fibre Channel frame
format.
Show command
show interface x/y | include ignore crc
Threshold
Poll (Sec)
Rising
Falling
Normal
60
5
1
Aggressive
30
5
1
Most Aggressive
10
5
1
© 2016 Cisco and/or its affiliates. All rights reserved.
18
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter
Invalid Words (invalid-words)
Description
A checksum and signal integrity check occurs for every word received on a Cisco
MDS 9000 Family switch. The invalid-words counter is incremented each time a
Fibre Channel word is detected with checksum or bit-level errors.
Show command
show interface x/y counter detail | include ignore words
Threshold
Poll (Sec)
Rising
Falling
Normal
60
5
1
Aggressive
30
5
1
Most Aggressive
10
5
1
Counter
Tx Discards (tx-discards)
Description
This counter increments when one of the following conditions occurs: an egress
policy map (access control list [ACL]) forcibly drops the packets, traffic is received on
the port ASIC when port is physically down, or packets are discarded as a result of a
timeout (500-ms threshold).
Show command
show interface x/y counter detail | include ignore discard
Threshold
Poll (Sec)
Rising
Falling
Normal
60
50
0
Aggressive
45
50
10
Most Aggressive
30
50
20
Counter
LR RX (lr-rx)
Description
IIf a peer port is continuously running out of transmit B2B credits for a long time
(1 second for F-ports and 1.5 seconds for E-ports), it can invoke the credit-lossrecovery mechanism by transmitting a Fibre Channel link reset (LR) primitive. The
node (the Fibre Channel switch) connected to this port will receive a link reset in the
receive (Rx) direction and increment the lr-rx counter. The switch that sent this link
reset would have its lr-tx (transmit) counter incremented. This example is just one
of the situations in which this counter incremented. A link flap can also trigger the
counter to increment.
Usually, no traffic flows are received on this port when this condition occurs. An
increasing number of received link resets can indicate congestion at a port, or the
resets can be a cascading effect of congestion on other ports to which this port is
transmitting data. It can also indicate that the receiver-ready (R_RDY) B2B credits are
being lost due to corruption on the link.
Show command
show interface fc2/11 count details | include ignore “reset
responses received”
Threshold
Poll (Sec)
Rising
Falling
Normal
60
5
1
Aggressive
60
3
1
Most Aggressive
30
3
1
© 2016 Cisco and/or its affiliates. All rights reserved.
19
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter
LR TX (lr-tx)
Description
If a port is continuously running out of transmit B2B credits for a long time (1 second
for F-ports and 1.5 seconds for E-ports), it will invoke the credit-loss-recovery
mechanism by transmitting a Fibre Channel link reset (LR) primitive. A link reset is
only a credit reset and does not actually flap the port if it is successful. Usually, no
traffic flows out of this port when this condition occurs. This example is just one of the
situations in which this counter is incremented. A link flap can also trigger the counter
to increment. Counter incrementation can also indicate that the R_RDY B2B credits are
being lost due to corruption on the link.
Show command
show interface x/y count details | include ignore “reset
responses transmitted”
Threshold
Poll (Sec)
Rising
Falling
Normal
60
5
1
Aggressive
60
3
1
Most Aggressive
30
3
1
Counter
Timeout Discards (timeout-discards)
Description
All frames are time-stamped when they enter a Cisco MDS 9000 Family switch. The
frame is put into the receive buffer queue when it enters the switch. The frame will
be dropped if it is not delivered to the egress port within the time specified by the
congestion-drop timeout value (the default is 500 ms). These dropped frames are
accounted for at the next-hop (egress) port and signify congestion at this egress
port. This example is one of the conditions that causes the timeout-discards counter
to increment.
Show command
show interface x/y count details | include ignore timeout
Threshold
Poll (Sec)
Rising
Falling
Normal
60
200
10
Aggressive
60
200
100
Most Aggressive
30
200
100
Counter
Credit Loss Reco (credit-loss-reco)
Description
IThis counter is incremented when B2B credits are unavailable for more than 1 second
for F-ports or more than 1.5 seconds for E-ports. Because this is an important
counter for slow-drain detection, you should keep the value the same for all three
templates while monitoring.
Show command
show interface x/y count details | include ignore “credit loss”
Threshold
Poll (Sec)
Rising
Falling
Normal
1
1
0
Aggressive
1
1
0
Most Aggressive
1
1
0
© 2016 Cisco and/or its affiliates. All rights reserved.
20
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter Name
TX Credit Not Available (tx-credit-not-available)
Description
This counter is incremented by 1 when Tx B2B credits are not available for 100 ms
(default). Both thresholds are configured as a percentage of the polling internal time.
For example, assume that you set the poll interval to 1 second, the rising threshold to
10% (which translates to 100 ms: that is, 10% of a second), the falling threshold to
0% (which translates to 0 ms), and the port-guard action to error-disable the port.
If there are zero remaining Tx B2B credits for 100 ms, then the rising-threshold
criterion is met. If a credit is indeed returned, then the falling-threshold criterion is
met. In addition, when the rising-threshold criterion is met, the port-guard action
is initiated, which puts the port in the error-disable state. The port will remain in
that state until someone manually issues a shut or no shut command on that port.
Because this is an important counter for slow-drain detection, you should keep the
value the same for all three templates while monitoring.
Note: Because this counter is sampled at 100-ms intervals, you should configure
both threshold percentages only in multiples of 10 (10, 20, 30, etc.).
Show command
show interface x/y count details | include ignore “Transmit B2B”
Threshold
Poll (Sec)
Rising
Falling
Normal
1
10%
0
Aggressive
1
10%
0
Most Aggressive
1
10%
0
Counter
Rx Datarate (rx-datarate)
Description
This counter indicates the rate at which frames are received on the interface. Both
thresholds are configured as a percentage of the operational speed of the interface.
Show command
show interface x/y | include ignore “input rate”
Threshold
Poll (Sec)
Rising
Falling
Normal
60
80%
70%
Aggressive
60
90%
60%
Most Aggressive
30
90%
50%
Counter
TX Datarate (tx-datarate)
Description
This counter indicates the rate at which frames are transmitted out of a particular
interface. Both thresholds are configured as a percentage of the operational speed of
the interface.
Show command
show interface x/y | include ignore “output rate”
Threshold
Poll (Sec)
Rising
Falling
Normal
60
80%
70%
Aggressive
60
90%
60%
Most Aggressive
30
90%
50%
© 2016 Cisco and/or its affiliates. All rights reserved.
21
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter
ASIC Error Pkt from Port (err-pkt-from-port)
Description
This counter is incremented when the internal path between the port ASIC to the
forwarding ASIC is causing frames to be corrupted. If the forwarding ASIC detects
corrupted frames from the port ASIC due to any internal problems, this counter is
incremented.
The counter is deprecated on Cisco MDS 9000 Family fourth-generation and later
line cards due to the nature of the hardware architecture on these modules.
Show command
-
Threshold
Poll (Sec)
Rising
Falling
Normal
60
50
10
Aggressive
60
50
40
Most Aggressive
60
50
30
Counter
ASIC Error Pkt to XBAR (err-pkt-to-xbar)
Description
This counter indicates the rate at which frames are received on the interface. Both
thresholds are configured as a percentage of the operational speed of the interface.
Show command
-
Threshold
Poll (Sec)
Rising
Falling
Normal
50
50
10
Aggressive
60
50
40
Most Aggressive
60
50
30
Counter
ASIC Error Pkt from XBAR (err-pkt-from-xbar)
Description
This counter is incremented when the forwarding ASIC on the ingress module (line
card) is sending corrupted frames to the crossbar (fabric module or XBAR) ASIC.
Show command
-
Threshold
Poll (Sec)
Rising
Falling
Normal
50
50
10
Aggressive
60
50
40
Most Aggressive
60
50
30
© 2016 Cisco and/or its affiliates. All rights reserved.
22
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter
tx-slowport-count
Description
The tx-slowport-count counter is applicable only for 8-Gbps modules in Cisco MDS
9500 Series Multilayer Directors. In the default configuration, PMON sends an alert
when a slow-port condition is detected 10 times in 1 second for the configured
administrative delay. Because this is an important counter for slow-drain detection,
you should keep the value the same for all three templates while monitoring.
Note: Because this counter is sampled at 100-ms intervals and there can be, at most,
one event per 100-ms interval, this counter can increment, at most, 10 times per
second. Therefore, configuring a rising-threshold value greater than 10 (with a polling
interval of 1 second) will cause this alert never to be generated. In addition, system
timeout slowport-monitor default mode must be configured for alerting to work.
Show command
show process creditmon slowport-monitor-events
show logging onboard slowport-monitor-events
Threshold
Poll (Sec)
Rising
Falling
Normal
1
5
0
Aggressive
1
5
0
Most Aggressive
1
5
0
Counter
tx-slowport-oper-delay
Description
The tx-slowport-oper-delay counter is applicable to advanced 8- and 16-Gbps
modules for Cisco MDS 9000 Family switches.
There are two defaults based on the module type. For advanced 8-Gbps modules,
the default rising-threshold value is 80 ms with a 1-second polling interval. For
16-Gbps modules and switches, the default rising-threshold value is 50 ms with a
1-second polling interval.
Show command
show process creditmon slowport-monitor-events
show logging onboard slowport-monitor-events
Threshold
Poll (Sec)
Rising
Falling
Normal
1
50 ms
0
Aggressive
1
40 ms
0
Most Aggressive
1
30 ms
0
Counter
TxWait
Description
TxWait counts the time whenever the port Tx B2B credits available are zero and
frames are waiting in the egress queue. The counter increments if a port has zero
remaining Tx B2B credits for 2.5 microseconds. This counter is used to report the
credit unavailability on a port in multiple intuitive ways.
TxWait is reported only on 16-Gbps and advanced 8-Gbps platforms. The other
platforms in the Cisco MDS 9000 Family will report zero. Both rising and falling
thresholds are configured as a percentage of the polling-interval time. Therefore, for
example, a 30% threshold with a 1-second polling interval is a setting of 300 ms.
© 2016 Cisco and/or its affiliates. All rights reserved.
23
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
Counter
TxWait
Show command
show process creditmon txwait-history and
show interface x/y counters | include ignore TxWait
Threshold
Poll (Sec)
Rising
Falling
Normal
1
40%
0
Aggressive
1
30%
0
Most Aggressive
1
20%
0
Cisco Embedded Event Manager
Overview
Cisco Embedded Event Manager, or EEM, is an NX-OS component that can be used to monitor system
events that occur on the switch. It can take action after an event occurs and help you troubleshoot the
situation. Event manager policies are configured on the supervisor, and they can be used to monitor
parameters on modules or line cards as well.
NX-OS has preconfigured system policies that define common events and actions for the device. System
policy names generally begin with two underscore characters: for example, __flogi_fcids_max_per_switch
is a policy that generates a syslog warning message if the fabric login (FLOGI) requests on the switch
exceed 4000. To see a list of all available system policies on Cisco MDS 9000 Family switches, enter the
show event manager system-policy execution command.
Figure 4 shows the events that trigger an event log generated through the event manager.
Figure 4 Event Manager Interaction with System Events
Events
•
•
•
•
•
System Switchover
OIR
FLOGI Count
Temperature Event
Module Failure
Event Manager
User-Defined Policy
(Configured in Applet
or Script Using CLI)
© 2016 Cisco and/or its affiliates. All rights reserved.
• Validates and Records
Policy Information
• Directs Event Notifications
• Logs Events
• Dynamically Registers Event
Names, Actions, and Parameters
• Filters Events and Matches
Them with Policies
Event Log
24
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
The event manager has three main components:
• Event statements: Events to be monitored from another NX-OS component that may require some action,
workaround, or notification
• Action statements: Actions that the event manager can take, such as sending an email or disabling an
interface, to recover from an event
• Policies: Events paired with one or more actions to troubleshoot or recover from an event
• Defining Cisco Embedded Event Manager Policy
Defining Cisco Embedded Event Manager Policy
Use the commands listed here to configure a user-defined EEM policy.
1.config t
2.
(Optional) show event manager policy-state system-policy
3.
event manager applet applet-name override system-policy
4.
(Optional) description policy-description
5.event event-statement
6.
action number action-statement
(Repeat Step 6 for multiple action statements.)
7.
(Optional) show event manager policy-state applet-name
8.
(Optional) copy running-config startup-config
Defining Events and Actions
As noted in the preceding configuration steps (step 6), you need to define an action to be taken if the
matching conditions for the event are true.
You can use the following options to configure event statements:
• cli: Create a CLI event specification.
• counter: Create a counter event.
• fanabsent: Create a fan-absent event specification.
• fanbad: Create fan-bad event specification.
• fcns: Create an event related to the Fibre Channel name server.
• flogi: Configure an event related to fabric-login requests.
• gold: Create a diagnostic event specification.
• memory: Create a memory-threshold event specification.
• module: Create a module event specification.
• module-failure: Create a module-failure event specification.
• oir: Create an online insertion and removal (OIR) event specification.
• policy-default: Use the event in the system policy being overridden.
• poweroverbudget: Create a power-over-budget event specification.
• snmp: Create an SNMP event specification.
• storm-control: Create a storm-control event specification.
• syslog: Create a syslog event specification.
© 2016 Cisco and/or its affiliates. All rights reserved.
25
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
• sysmgr: Create a system manager event.
• temperature: Create a temperature event specification.
• test: Create a test event specification.
• zone: Specify zone configuration commands.
You can use Embedded Event Manager applets to configure the following actions:
• cli: Configure a virtual shell (VSH) CLI action.
• counter: Specify the name of the counter.
• eem: Configure an Embedded Event Manager command.
• event-default: Perform the default action for the event.
• forceshut: Force the entire switch to shut down.
• overbudgetshut: Shut down the specified line cards because the system exceeds the power budget.
• policy-default: Perform the default actions of the policy being overridden.
• reload: Reload the system or a specific module.
• snmp-trap: Send out an SNMP trap.
• syslog: Generate a syslog message. Preconfigured System Policies Available on Cisco MDS 9000 Family
Switches
The following are some of the preconfigured system policies available in Cisco MDS 9000 Family switches:
• Zoning
-- zone_dbsize_max_per_vsan
-- zone_members_max_per_sw
-- zone_zones_max_per_sw
-- zone_zonesets_max_per_sw
• Fabric login (FLOGI)
-- flogi_fcids_max_per_switch
-- flogi_fcids_max_per_module
-- flogi_fcids_max_per_intf
• Fibre Channel name server (FCNS)
-- fcns_entries_max_per_switch
Refer to Appendix C for an example showing how to configure the event manager and specify a set of
actions matching a particular event to your applet.
Supervisor Upgrade and Downgrade Support
All the features for monitoring support supervisor upgrade and downgrade using Cisco In-Service
Software Upgrade (ISSU). When an ISSU upgrade is performed, new counters may be added on the new
software.
© 2016 Cisco and/or its affiliates. All rights reserved.
26
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
For More Information
http://www.cisco.com/go/mds
http://www.cisco.com/c/en/us/products/storage-networking/mds-9700-series-multilayer-directors/
datasheet-listing.html
Appendix A: Remote Monitor Configuration
The example in this appendix configures a RMON alarm to monitor active supervisor’s average CPU use
every 60 seconds with a rising threshold of 70 and a falling threshold of 60. A rising SNMP trap would be
generated if the CPU utilization reaches 70 percent.
MDS9000# config t
MDS9000(config)# rmon alarm 20 1.3.6.1.4.1.9.9.305.1.1.1 60 delta risingthreshold 70 5 falling-threshold 5 60 owner test
To enable RMON events, use the procedure in the following example. This example creates RMON event
number 2 to define critical errors, and it generates a log entry when the event is triggered by the alarm.
The user Test2 owns the row that is created in the event table by this command. This example also
generates an SNMP trap when the event is triggered.
MDS9000# config t
MDS9000(config)# rmon event 2 log trap eventtrap description CriticalErrors owner
Test2
Refer to the Cisco MDS 9000 Family MIB quick reference guide for a list of all MIBS supported on Cisco
MDS 9000 Family switches. OIDs corresponding to individual MIBs can be used for monitoring using RMON
alarms.
Appendix B: Port Monitor Configuration
The example in this appendix configures a PMON policy named port_health that polls for the link-reset
receive (lr-rx) hardware port counter every 60 seconds.
MDS9000# config t
Enter configuration commands, one per line. End with CNTL/Z.
MDS9000(config)# port-monitor name port_health
MDS9000(config-port-monitor)# monitor counter lr-rx
MDS9000(config-port-monitor)# counter lr-rx poll-interval 60 delta risingthreshold 5 event 4 falling-threshold 1 event 4 portguard flap
Alerting occurs when the specified conditions are met. In this example, assume that when PMON polls
for this counter the first time, the line card returns an lr-rx counter value of 100. On the first poll of a delta
counter, PMON stores the value and takes no action. On the second poll 60 second later, assume that the
count is now 107. Because the rising threshold is configured as 5, a rising alarm will be generated (107 –
100 > 5) and sent to RMON.
For a falling alarm to be generated, the delta between the values polled in two successive polls should be
less than or equal to the falling threshold configured.
© 2016 Cisco and/or its affiliates. All rights reserved.
27
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
In the preceding configuration, if the third PMON poll for the lr-rx counter returns a value greater than 107,
then no rising or falling alarm is generated. As long as the delta is greater than 1 no further alarms will be
generated. Only when the delta is 1 or less will the falling alarm be generated. In addition, only after the
falling-threshold alert is generated can another rising-threshold alert be generated.
You can use the following show commands to view the policies configured using the PMON utility and their
state:
show port-monitor status
show port-monitor active
show port-monitor <policy-name>
To check whether the RMON component was notified by this alerting, use the following show commands:
show rmon logs
show rmon alarms
Normal Template PMON Configuration
You can copy and paste the following configuration commands in the Cisco MDS 9000 Family switch
console to enable normal PMON-based monitoring.
Note: If you configure PMON to monitor all port types, be sure not to use a port-guard action to flap or errordisable the port.
config t
no port-monitor activate slowdrain
port-monitor name pmon-normal-policy
port-type all
counter link-loss poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter sync-loss poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter signal-loss poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter invalid-words poll-interval 60 delta rising-threshold 5 event 4
falling-threshold 0 event 4
counter invalid-crc poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter tx-discards poll-interval 60 delta rising-threshold 50 event 4 fallingthreshold 0 event 4
counter lr-rx poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter lr-tx poll-interval 60 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter timeout-discards poll-interval 60 delta rising-threshold 200 event 4
© 2016 Cisco and/or its affiliates. All rights reserved.
28
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
falling-threshold 10 event 4
counter credit-loss-reco poll-interval 1 delta rising-threshold 1 event 3
falling-threshold 0 event 3
counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event
4 falling-threshold 0 event 4
counter rx-datarate poll-interval 60 delta rising-threshold 80 event 4 fallingthreshold 70 event 4
counter tx-datarate poll-interval 60 delta rising-threshold 80 event 4
falling-threshold 70 event 4
counter err-pkt-from-port poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 10 event 4
counter err-pkt-to-xbar poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 10 event 4
counter err-pkt-from-xbar poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 10 event 4
counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 3
falling-threshold 0 event 3
counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 50
event 5 falling-threshold 0 event 5
counter txwait poll-interval 1 delta rising-threshold 40 event 4 fallingthreshold 0 event 4
Aggressive Template PMON Configuration
You can cut and paste the following configuration commands in the Cisco MDS 9000 Family switch console
to enable aggressive PMON-based monitoring.
Note: If you configure PMON to monitor all port types, be sure not to use a port-guard action to flap or errordisable the port.
config t
no port-monitor activate slowdrain
port-monitor name pmon-aggressive-policy
port-type all
counter link-loss poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 1 event 4
counter sync-loss poll-interval 45 delta rising-threshold 5 event 4 fallingthreshold 1 event 4
counter signal-loss poll-interval 45 delta rising-threshold 5 event 4 fallingthreshold 1 event 4
counter invalid-words poll-interval 30 delta rising-threshold 5 event 4
falling-threshold 1 event 4
© 2016 Cisco and/or its affiliates. All rights reserved.
29
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
counter invalid-crc poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 1 event 4
counter tx-discards poll-interval 45 delta rising-threshold 50 event 4 fallingthreshold 10 event 4
counter lr-rx poll-interval 60 delta rising-threshold 3 event 4 fallingthreshold 1 event 4
counter lr-tx poll-interval 60 delta rising-threshold 3 event 4 fallingthreshold 1 event 4
counter timeout-discards poll-interval 60 delta rising-threshold 200 event 4
falling-threshold 100 event 4
counter credit-loss-reco poll-interval 1 delta rising-threshold 1 event 3
falling-threshold 0 event 3
counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event
3 falling-threshold 1 event 3
counter rx-datarate poll-interval 60 delta rising-threshold 90 event 4 fallingthreshold 60 event 4
counter tx-datarate poll-interval 60 delta rising-threshold 90 event 4 fallingthreshold 60 event 4
counter err-pkt-from-port poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 40 event 4
counter err-pkt-to-xbar poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 40 event 4
counter err-pkt-from-xbar poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 40 event 4
counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 3
falling-threshold 0 event 3
counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 40
event 5 falling-threshold 0 event 5
counter txwait poll-interval 1 delta rising-threshold 30 event 3 fallingthreshold 0 event 3
Most Aggressive Template PMON Configuration
You can cut and paste the following configuration commands in the Cisco MDS 9000 Family switch console
to enable the most aggressive type of PMON-based monitoring.
Note: If you configure PMON to monitor all port types, be sure not to use a port-guard action to flap or errordisable the port.
config t
no port-monitor activate slowdrain
port-monitor name pmon-mostaggressive-policy
port-type all
© 2016 Cisco and/or its affiliates. All rights reserved.
30
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
counter link-loss poll-interval 10 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter sync-loss poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter signal-loss poll-interval 30 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter invalid-words poll-interval 30 delta rising-threshold 5 event 4
falling-threshold 0 event 4
counter invalid-crc poll-interval 10 delta rising-threshold 5 event 4 fallingthreshold 0 event 4
counter tx-discards poll-interval 30 delta rising-threshold 50 event 4 fallingthreshold 0 event 4
counter lr-rx poll-interval 30 delta rising-threshold 3 event 4 fallingthreshold 0 event 4
counter lr-tx poll-interval 30 delta rising-threshold 3 event 4 fallingthreshold 0 event 4
counter timeout-discards poll-interval 30 delta rising-threshold 200 event 4
falling-threshold 50 event 4
counter credit-loss-reco poll-interval 1 delta rising-threshold 1 event 3
falling-threshold 0 event 3
counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event
3 falling-threshold 0 event 3
counter rx-datarate poll-interval 30 delta rising-threshold 90 event 4 fallingthreshold 50 event 4
counter tx-datarate poll-interval 30 delta rising-threshold 90 event 4 fallingthreshold 50 event 4
counter err-pkt-from-port poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 30 event 4
counter err-pkt-to-xbar poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 30 event 4
counter err-pkt-from-xbar poll-interval 60 delta rising-threshold 50 event 4
falling-threshold 30 event 4
counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 3
falling-threshold 0 event 3
© 2016 Cisco and/or its affiliates. All rights reserved.
31
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
counter tx-slowport-oper-delay poll-interval 30 absolute rising-threshold 30
event 5 falling-threshold 0 event 5
counter txwait poll-interval 1 delta rising-threshold 30 event 30 fallingthreshold 0 event 3
Appendix C: Cisco Embedded Event Manager Configuration
This appendix presents several sample NX-OS event manager policy scripts that you can use to monitor
different elements of your Cisco MDS fabric. After you configure an Embedded Event Manager policy, it
waits for the configured event to occur. If the event occurs, then various actions related to the event can be
performed: print a syslog message, reload the supervisor, send an SNMP trap, configure a VSH CLI action,
etc.
• Event manager applet to monitor CPU utilization by the Cisco MDS 9000 Family switch:
YMDS9706-A# config t
Enter configuration commands, one per line. End with CNTL/Z.
MDS9000(config)# event manager applet high-cpu
MDS9000(config-applet)# event snmp oid 1.3.6.1.4.1.9.9.305.1.1.1 get-type exact
entry-op ge entry-val 80 exit-op le exit-val 60 poll-interval 10
MDS9000(config-applet)# action 1.0 syslog msg HIGH_CPU TRIGGERED
MDS9000(config-applet)# action 2.0 cli show processes cpu sorted
MDS9000(config-applet)# exit
MDS9000(config)#
• Event manager applet to monitor the module temperature for a major threshold violation:
MDS9000(config)# event manager applet module-temp-major
MDS9000(config-applet)# event temperature module 1 threshold major
MDS9000(config-applet)# action 1.0 syslog msg Module 1 Major Temperature alarm
MDS9000(config-applet)# action 2.0 cli show environment
MDS9000(config-applet)# exit
MDS9000(config)#
Event manager applet to monitor whether a module goes offline:
MDS9000(config)# event manager applet module-offline
MDS9000(config-applet)# event module status offline module 1
MDS9000(config-applet)# action 1.0 syslog msg Module 1 Went Offline !
© 2016 Cisco and/or its affiliates. All rights reserved.
32
Monitoring and Alerting in Cisco MDS Fabric
White Paper
Cisco Public
MDS9000(config-applet)# action 2.0 cli show module internal event-history module
1
MDS9000(config-applet)# exit
MDS9000(config)#
• Event manager applet to monitor for the FLOGI limit per module, switch, or interface:
MDS9000(config)# event manager applet notify-flogi
MDS9000(config-applet)# event flogi ?
intf-max Event to configure maximum flogi per interface
module-max Event to configure maximum flogi per module
switch-max Event to configure maximum flogi per switch
You can use show event manager policy internal <EEM policy name> to display information about the
configured policy.
© 2016 Cisco and/or its affiliates. All rights reserved. Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S.
and other countries. To view a list of Cisco trademarks, go to this URL: www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of
their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (1110R)
C11-736963-00 04/16
Download