OMD_Nagios_Hardware-Software

advertisement
Using OMD/Nagios
to Monitor Complex
Hardware/Software
Systems
Joe VanAndel
NCAR/EOL
2012/3/29
Why is Monitoring
Important?
Why is Monitoring
Important?
•
•
Software systems can be very complex:
•
networked data sources
•
multiple computers
•
long running daemons
Hardware (including computers) can fail
Why is Monitoring
Important (2)?
•
Someone is relying on your system to produce or
process data.
•
Computers are better than people at monitoring manual procedures are error prone and don’t cover
24x7.
•
Your staff may need to be notified out-of-hours if
failures occur.
Why is Monitoring
Important to S-Pol?
•
S-Pol is a complex system of hardware and
software - need to detect problems so they can be
quickly corrected.
•
Notifications allow unattended operation, so staff
don’t have to stay on site 24x7.
•
Can not afford to have 3 shifts in field projects
What is OMD?
•
Open Monitoring Distribution (http://omdistro.org)
•
runs on Linux
•
Bundles Nagios with 16 useful utilities, including
•
check_mk - creates Nagios configurations for
you!
•
rrdtool/rrdcached - store and retrieve time series
data, supports graphing of performance data.
Why use OMD?
•
complete package of monitoring tools
•
avoid the effort of compiling and integrating Nagios
add-ons
•
Web based monitoring - from anywhere!
Why use
check_mk?
•
Automatically generates Nagios rules for each
machine you monitor.
•
Lower overhead allows monitoring more checks on
more hosts.
•
easy to create both hardware and software checks.
•
The S-Pol radar had 700 checks running on 14
hosts - we didn’t want to generate the Nagios
configuration manually.
check_mk
architecture
figure from
http://mathias-kettner.de
RRD is “Round Robin
Database” which efficiently
stores the output from
check_mk.
check_mk_agent
Getting Started with
OMD
•
install the RPM
•
$ omd create mysite # the monitoring instance
•
create scripts in /usr/lib/check_mk_agent/local
•
$ check_mk -I
# run inventory
•
$ omd start mysite
# start daemons.
•
open the check_mk URL in a browser.
Writing a check is
simple
•
write a C program, shell script, or Python script
•
query hardware or software status
•
output string(s) to stdout: "0 PgenTritonRaidStatus - OK"
•
run a check_mk inventory to
•
find your script
•
generate the Nagios configuration
/usr/lib/check_mk_agent/local/filecount
#!/bin/bash
DIRS="/var/log /tmp"
for dir in $DIRS
do
count=$(ls $dir | wc --lines)
if [ $count -lt 50 ] ; then
status=0
statustxt=OK
elif [ $count -lt 100 ] ; then
status=1
statustxt=WARNING
else
status=2
statustxt=CRITICAL
fi
echo "$status Filecount_$dir
count=$count;50;100;0; $statustxt - $count
files in $dir"
done
S-Pol monitoring
•
Radar hardware for S-Band & Ka-band:
•
antenna
•
transmitter
•
receiver
•
Klystron temperature
•
Container temperatures
Hardware Monitoring Architecture
Sixnet Controller
Hardware monitoring
•
•
Sixnet controller communicates to measurement
modules using RS-485
•
monitors transmitter status
•
monitors antenna status
•
monitors transmitter temperature
Sixnet controller runs Linux, so adding a
check_mk_agent was easy!
What else?
•
Computer status:
•
cpu load,
•
disk space,
•
memory usage
•
radar software - tasks running, products being
produced
•
fetching data: satellite images, soundings, forecast
model output
Implementation
•
installed OMD on a rack-mount Linux server
•
installed check_mk_agent on all monitored
computers
•
wrote scripts, installed in
/usr/lib/check_mk_agent/local
Implementation(2)
•
Configured digital IO modules (controlled by an
embedded Sixnet computer) to monitor S-Pol
hardware
•
Wrote a program on the Sixnet that reported
hardware status to check_mk_agent
•
Send Ka-band status over the network, wrote
software to create status files readable by
check_mk scripts
Types of S-Pol
checks
•
scripts/programs directly monitor hardware or
software
•
hybrid scripts - process the output of an existing
program, output check_mk status reports.
Implementation(2)
•
•
configured GSM cell phone to send SMS messages
•
software from gnokii.org
•
bought local SIM
wrote script to limit frequency of SMS messages
Sample Web
Screens
Challenges
•
learning how to create advanced checks with
graphs
•
Avoiding false alarms (particularly after hours!)
•
limiting frequency of notifications - getting 20 text
messages on your cell phone in 5 minutes is not
helpful!
How well did
OMD/Nagios work?
•
The second shift only had to be on-site from
3:00PM to 8:00PM, rather than until 11:00PM
•
Daytime: OMD/Nagios warned staff of problems on
multiple occasions.
•
Offhours: OMD/Nagios notified S-Pol staff of critical
hardware/software failures on multiple occasions
24x7 Operations :
w/o working 24x7
•
Added SMS (text message) notifications to Nagios
•
Technicians and Engineers carried cell phones
•
Nagios sent SMS when hardware or software
problems occurred.
•
Technicians and Engineers would access Nagios
web pages via 3G modems on laptops
FUTURE
•
Monitoring of diesel generators
•
Add remote control:
•
generator & transfer switch
•
reset of transmitter faults
•
reset of antenna faults
Conclusion
•
Monitoring is important for any system, critical for
complex or unattended operation
•
OMD/Nagios makes it easy to deploy monitoring
•
OMD/Nagios helped EOL maintain high data quality
from S-Pol without requiring staff 24x7 on site.
•
Notifications via SMS and remote access to OMD’s
web pages are very helpful.
Acknowledgments
•
Ethan Galstad - Nagios chief developer
•
Mathias Kettner - check_mk
•
Fatima Dembele (summer intern) - prototyping
•
Paloma Gutierrez - hardware monitoring
•
Chris Burghart - Ka-band monitoring
•
Mike Dixon - Ka-band & HAWK monitoring
Questions?
Download