How Templates Are Used - Part 1

advertisement
Leveraging and Understanding
Performance Data and Graphs
Troy Lea
troy@box293.com
Twitter: @Box293
http://exchange.nagios.org/directory/Owner/Box293/1
About Me
IT Consultant
Nagios Developer
Love tinkering with Nagios
Why Nagios XI?
It’s a virtual appliance - ready to go
2
About This Presentation
Understanding how performance data is stored
in the back end and how Nagios accesses it
Goal is to give you key pieces of information
A good reference for understanding concepts
This presentation is centered around Nagios XI
Valid for other Nagios implementations
3
Basic Concepts - Part 1
4
Basic Concepts - Part 2
./check_nt -H SERVER -s "" -p 12489 -v USEDDISKSPACE -l C -w 80 -c 95
C:\ - total: 39.99 Gb - used: 25.28 Gb (63%) - free 14.71 Gb (37%) | 'C:\ Used
Space'=25.28Gb;32.00;38.00;0.00;39.99
5
Basic Concepts - Part 3
Service check command is executed by the monitoring engine
Monitoring engine receives the result of the check
Data received has performance data
Performance data is anything after the | (pipe)
The performance data is inserted into an RRD file
When viewing the performance graph, PNP4Nagios retrieves the
performance data from the RRD file and generates a pretty graph
Every time the service check receives performance data, it inserts
this performance data into the RRD file which allows you to look at
trends over time
6
Plugins
The power of Nagios is in the plugins!
Monitor what you want, how you want!
Resources available that clearly define the
guidelines around creating plugins
Nagios Plug-in Developer Guidelines
http://nagiosplug.sourceforge.net/developerguidelines.html
PNP Documentation
http://docs.pnp4nagios.org/pnp-0.4/doc_complete
7
Plugin Output Explained - Part 1
Plugins produce data divided into two parts
The pipe symbol “|” is used as a delimiter
Example check_icmp
OK - 127.0.0.1: rta 2.687ms, lost 0% |
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
Data to the left of the pipe symbol is processed
by the monitoring engine
Data to the right of the pipe symbol is used for
inserting into RRD and XML files
8
Plugin Output Explained - Part 2
The exit code Nagios receives from the plugin
determines the state of the service
0 = OK
1 = WARNING
2 = CRITICAL
3 = UNKNOWN
The exit code is not “visible” when running a
check from the command line or looking at the
output returned from the plugin
9
Plugin Output Explained - Part 3
No performance data = no pretty graphs
You can create a plugin using whatever
language and tools are available
All that matters is the end result which is returned
back to Nagios when the plugin has finished
running
10
Plugin Output Explained - Part 4
Examples:
Shell script
Something you might want to check on the Nagios
host itself
perl script
Remotely checking a device using SNMP OR using
third party APIs like the VMware vSphere SDK to
remotely access virtual environments
Visual Basic script
Using NSClient on a Windows host to perform a
check (like RDP usage)
11
Performance Data Specifics - Part 1
Asterix (*) fields are required fields, everything
else is optional
In this instance, rta is the FIRST DS, or DS 1
12
Performance Data Specifics - Part 2
Multiple DS
Each DS is separated by a space
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
The label can have spaces however the label
MUST be enclosed by single quotes
'Round Trip Average'=2.687ms;3000.000;5000.000;0;
'Packet Loss'=0%;80;100;;
13
Basic Plugin - Part 1
Example shell script demonstrating how a plugin
outputs performance data
NUMBER1=$[ ( $RANDOM % 100 ) + 1 ]
NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ]
echo ""OK - Number 1: $NUMBER1 Number 2:
$NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number
2'=$NUMBER2;;;;“
exit "0"
14
Basic Plugin - Part 2
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number 2'=74;;;;
OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;;;; 'Number 2'=758;;;;
OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;;;; 'Number 2'=60;;;;
OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;;;; 'Number 2'=338;;;;
OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;;;; 'Number 2'=612;;;;
15
Basic Plugin - Part 3
Performance data
displayed as a
pretty graph
Demonstration of
how you can
generate
performance data
in a plugin
16
Basic Plugin - Part 4
Now lets add warning and critical thresholds to
the performance data string
Number1
WARNING @ 50
CRITICAL @ 75
Number2
WARNING @ 500
CRITICAL @ 750
echo ""OK - Number 1: $NUMBER1 Number 2:
$NUMBER2" | 'Number 1'=$NUMBER1;50;75;;
'Number 2'=$NUMBER2;500;750;;"
17
Basic Plugin - Part 5
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Number 2'=74;500;750;;
OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;50;75;; 'Number 2'=758;500;750;;
OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;50;75;; 'Number 2'=60;500;750;;
OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;50;75;; 'Number 2'=338;500;750;;
OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;50;75;; 'Number 2'=612;500;750;;
18
Basic Plugin - Part 6
This demonstrates
how the
performance data
does not have any
effect on the state
of the service
Warning and
Critical thresholds
are inside the .xml
file
19
.rrd and .xml files
Used for recording the results from Nagios checks
Useful for observing daily trends of your environment
Invaluable for helping resolve performance issues
RRD = Round Robin Database
XML = Information about the Nagios check
PNP4Nagios uses the RRD and XML files to
generate pretty graphs
20
Location of .rrd and .xml files
When a service check returns performance data,
Nagios dumps this into:
/usr/local/nagios/var/spool/perfdata
A background process detects the spooled data
and creates / updates the relevant .rrd and .xml
The Performance Data files live in:
/usr/local/nagios/share/perfdata/<host>
21
Extract .rrd data
You can extract data from an .rrd file
Example (from the CLI):
rrdtool fetch
/usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX r 900 -s -1h
22
.rrd and .xml Gotchya - Part 1
The .xml file can contain sensitive data
<NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u
readonly!-p Str0ngPassw0rd!-t sp_cbt_busy!--sp A!--warn 70!--crit
90!</NAGIOS_SERVICECHECKCOMMAND>
23
.rrd and .xml Gotchya - Part 2
Perhaps use a central credential file
<NAGIOS_SERVICECHECKCOMMAND>check_vmware_host!check_vmware_config
_vcenter01!cpu!90!95!!!!</NAGIOS_SERVICECHECKCOMMAND>
24
.rrd and .xml Gotchya - Part 3
RRD Data is averaged out over time
Looking at performance graphs for past day / week /
month / year will show results with less spikey data
This generally only occurs with data that has lots of
peaks and troughs
Constant data like disk space used will generally not
average out that much
It all depends on your environment!
When reviewing RRD data you need to take into
consideration these factors, it’s all relative!
25
Graphs - How Templates Are Used - Part 1
http://docs.pnp4nagios.org/pnp-0.4/tpl
26
Graphs - How Templates Are Used - Part 2
PNP4Nagios queries the XML file for the
<TEMPLATE> tag
Each datasource has it’s own <TEMPLATE> tag
<TEMPLATE>check-host-alive</TEMPLATE>
Also can be a trailing string in the performance
data (good for distributed monitoring)
OK - 127.0.0.1: rta 2.687ms, lost 0% |
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
[check_icmp]
27
Graphs - How Templates Are Used - Part 3
From the example graphs:
<TEMPLATE>check-host-alive</TEMPLATE>
<TEMPLATE>check_local_load_alt</TEMPLATE>
PNP4Nagios looks for a php file with this name
in the following folders:
/usr/local/nagios/share/pnp/templates.dist
/usr/local/nagios/share/pnp/templates
28
Graphs - How Templates Are Used - Part 4
check-host-alive
/usr/local/nagios/share/pnp/templates.dist/check-hostalive.php
This PHP file generates the performance graph
check_local_load_alt
check_local_load_alt.php does NOT exist
Default template is used:
/usr/local/nagios/share/pnp/templates.dist/default.php
29
Graphs - Creating Your Own Template - Part 1
The check_command name is what Nagios uses
to insert into the <TEMPLATE> tag in the XML
file (how PNP determines which template to use)
So for this example I have created a copy of an
existing command
check_xi_service_nsclient_alt
30
Graphs - Creating Your Own Template - Part 2
The service definition using the new command
31
Graphs - Creating Your Own Template - Part 3
The graph currently being generated
Default Template being used
Check Command being used
.rrd and .xml files currently contain valid data
32
Graphs - Creating Your Own Template - Part 4
Copy the file:
/usr/local/nagios/share/pnp/templates.dist/default.php
To the following location with the name:
/usr/local/nagios/share/pnp/templates/check_xi_servi
ce_nsclient_alt.php
Edit check_xi_service_nsclient_alt.php
33
Graphs - Creating Your Own Template - Part 5
In the graph we are removing the bottom two lines
Default Template
Check Command command name
Which are lines 62 and 63
$def[$i] .= 'COMMENT:"Default Template\r" ';
$def[$i] .= 'COMMENT:"Check Command ' .
$TEMPLATE[$i] . '\r" ';
Save check_xi_service_nsclient_alt.php
34
Graphs - Creating Your Own Template - Part 6
Updated graph
Template Name and Check Command removed
How easy was that!
35
PNP Templates In Detail - Part 1
Lets get into specifics
Template we just
modified
It’s not that
complicated! (LOL)
36
PNP Templates In Detail - Part 2
.rrd files can have multiple datasources (DS)
Round Trip Time and Packet Loss for example
37
PNP Templates In Detail - Part 3
Example of .rrd file with five DS
Two graphs generated using these DS
38
PNP Templates In Detail - Part 4
Default Template creates one graph per DS
This is a simple PHP foreach loop
The code within the loop references the relevant
DS by the $i variable
39
PNP Templates In Detail - Part 5
This section of the template uses three DS
One graph will be generated using three DS
$opt[1] and $def[1] is a reference for the first graph
being generated
40
PNP Templates In Detail - Part 6
Number formatting
Our modified template and the relative code
The relevant information:
%3.4lf
41
PNP Templates In Detail - Part 7
The three DS template and the relative code
The relevant information:
%4.0lf
42
PNP Templates In Detail - Part 8
Numbers are displayed with four decimal points
%3.4lf
Numbers are displayed as whole numbers
%4.0lf
43
PNP Templates In Detail - Part 9
PNP documentation defines the number
formatting using the printf standard defined here
http://en.wikipedia.org/wiki/Printf
The number (1) and the letter "L" look alike
%3.4lg contains a lower case "L"
The syntax is
%[parameter][flags][width][.precision][length]type
44
PNP Templates In Detail - Part 10
width
When the number is generated on the graph, it will
allocate a minimum specific width, this helps you
align numbers in a column style
precision
Determines if the number displayed is a whole
number, or a number with a specific number of digits
following the decimal place
45
PNP Templates In Detail - Part 11
%3.4lf
width = 3
precision = .4
hence the displayed number is 25.3800
%4.0lf
width = 4
precision = .0
hence the displayed number is 14
Because the precision is 0, NO decimal place is used
46
MRTG - Part 1
MRTG = Multi Router Traffic Grapher
Nagios Addon that is useful for monitoring
network switch and router bandwidth using SNMP
Can be complicated to understand configuration
47
MRTG - Part 2
Nagios XI Wizard called “Network Switch /
Router” automates the configuration of MRTG
MRTG configuration file
/etc/mrtg/mrtg.cfg
MRTG runs as a cron job every five minutes
cron comes from the Greek word for time, χρόνος
[chronos]
Hence cron is a software utility on linux which is a
time-based job scheduler
In the windows world it's the Task Scheduler
48
MRTG - Part 3
When MRTG runs, it gathers data from the
devices defined in the mrtg.cfg file
It dumps this data into the folder
/var/lib/mrtg
For every port monitored, an .rrd file is created
(no .xml file created at this point)
Another background process will then take the
data in /var/lib/mrtg and put it into the correct
location
/usr/local/nagios/share/perfdata/<host>
49
MRTG Gotchya - Part 1
When the Wizard populates the mrtg.cfg file it will
add ALL ports on the switch to the config file
Even if you only selected to monitor 10 ports on
the switch
The Nagios XI Service Configuration will only have 10
ports defined as service definitions
Every time the MRTG cron job runs, it will collect
data from all ports on the switch (as defined in the
mrtg.cfg file)
Extra CPU cycles, extra disk space
50
MRTG Gotchya - Part 2
On a 48 port switch this might not concern you
But in a stack of two 48 port switches this
becomes 96 ports + also other internal ports like
link aggregation ports (another 32 ports perhaps)
So these additional 128 ports have now added
8700+ configuration lines to the mrtg.cfg file
128 ports consume about 24 MB of .rrd disk
space
In my past environment, the mrtg.cfg file was
59,000 lines long!
51
MRTG Gotchya - Part 3
Suggestion
Clean up the mrtg.cfg file
Remove the ports you do not wish to gather data on
Can this cause Problems?
Yes!
Problem 1
Monitoring additional ports later using the wizard will
not work
The wizard will NOT re-add the ports to the mrtg.cfg file
Wizard detects switch / router is already in the mrtg.cfg file
52
MRTG Gotchya - Part 4
Problem 2 - Adding a switch (or module) to an
existing switch
Monitoring additional ports later using the wizard will
not work
The wizard will NOT add newly detected ports to the
mrtg.cfg file
Wizard detects switch / router is already in the mrtg.cfg file
Very similar behaviour to Problem 1
Only relevant when the new switch / module is managed
through the existing IP Address / FQDN
Common with stacked switches, adding another switch to
the stack
53
MRTG Gotchya - Part 5
Solutions to Problems 1 & 2
cfgmaker
This is how the Wizard configures mrtg.cfg
The wizard updates the existing mrtg.cfg using a php
function (not available from the CLI)
Run cfgmaker @ CLI to generate a config file
Add the contents of the config file to the existing mrtg.cfg
cfgmaker --noreversedns “public@192.168.1.1" --output=output.txt
54
MRTG Gotchya - Part 6
Problem 3 - With a frequently changing
environment, keep mrtg.cfg clean
Monitoring WAN links for remote routers?
WAN link no longer exists?
Disable / Delete service definition(s) in Core Configuration
Manager (CCM)
You will NEED to remove device from mrtg.cfg
Why?
MRTG will still try and collect data from WAN links no longer
accessible
Causes delays and can make MRTG run past the default 5
minute schedule ... can cause graph anomalies
55
MRTG Gotchya - Part 7
Problem 4 - Firmware Upgrade causes port
numbering to change
Major firmware revision applied to switch / router
New data collected for ports is no longer the same pattern
Internal port numbering has changed
mrtg.cfg queries specific port numbers, does not use port
names or descriptions
Example
Old Firmware:
New Firmware:
WAN = Port 1 LAN = Port 2
WAN = Port 0 LAN = Port 1
Have seen this behaviour on SonicWALL Firewalls
56
Questions
Questions ?
57
Discount Offer
But wait, there's more ...
When visiting the Nagios XI use my affiliate link
http://www.nagios.com/#ref=3oHG00
58
Download