Using OMD/Nagios to Monitor Complex Hardware/Software Systems Joe VanAndel NCAR/EOL 2012/3/29 Why is Monitoring Important? Why is Monitoring Important? • • Software systems can be very complex: • networked data sources • multiple computers • long running daemons Hardware (including computers) can fail Why is Monitoring Important (2)? • Someone is relying on your system to produce or process data. • Computers are better than people at monitoring manual procedures are error prone and don’t cover 24x7. • Your staff may need to be notified out-of-hours if failures occur. Why is Monitoring Important to S-Pol? • S-Pol is a complex system of hardware and software - need to detect problems so they can be quickly corrected. • Notifications allow unattended operation, so staff don’t have to stay on site 24x7. • Can not afford to have 3 shifts in field projects What is OMD? • Open Monitoring Distribution (http://omdistro.org) • runs on Linux • Bundles Nagios with 16 useful utilities, including • check_mk - creates Nagios configurations for you! • rrdtool/rrdcached - store and retrieve time series data, supports graphing of performance data. Why use OMD? • complete package of monitoring tools • avoid the effort of compiling and integrating Nagios add-ons • Web based monitoring - from anywhere! Why use check_mk? • Automatically generates Nagios rules for each machine you monitor. • Lower overhead allows monitoring more checks on more hosts. • easy to create both hardware and software checks. • The S-Pol radar had 700 checks running on 14 hosts - we didn’t want to generate the Nagios configuration manually. check_mk architecture figure from http://mathias-kettner.de RRD is “Round Robin Database” which efficiently stores the output from check_mk. check_mk_agent Getting Started with OMD • install the RPM • $ omd create mysite # the monitoring instance • create scripts in /usr/lib/check_mk_agent/local • $ check_mk -I # run inventory • $ omd start mysite # start daemons. • open the check_mk URL in a browser. Writing a check is simple • write a C program, shell script, or Python script • query hardware or software status • output string(s) to stdout: "0 PgenTritonRaidStatus - OK" • run a check_mk inventory to • find your script • generate the Nagios configuration /usr/lib/check_mk_agent/local/filecount #!/bin/bash DIRS="/var/log /tmp" for dir in $DIRS do count=$(ls $dir | wc --lines) if [ $count -lt 50 ] ; then status=0 statustxt=OK elif [ $count -lt 100 ] ; then status=1 statustxt=WARNING else status=2 statustxt=CRITICAL fi echo "$status Filecount_$dir count=$count;50;100;0; $statustxt - $count files in $dir" done S-Pol monitoring • Radar hardware for S-Band & Ka-band: • antenna • transmitter • receiver • Klystron temperature • Container temperatures Hardware Monitoring Architecture Sixnet Controller Hardware monitoring • • Sixnet controller communicates to measurement modules using RS-485 • monitors transmitter status • monitors antenna status • monitors transmitter temperature Sixnet controller runs Linux, so adding a check_mk_agent was easy! What else? • Computer status: • cpu load, • disk space, • memory usage • radar software - tasks running, products being produced • fetching data: satellite images, soundings, forecast model output Implementation • installed OMD on a rack-mount Linux server • installed check_mk_agent on all monitored computers • wrote scripts, installed in /usr/lib/check_mk_agent/local Implementation(2) • Configured digital IO modules (controlled by an embedded Sixnet computer) to monitor S-Pol hardware • Wrote a program on the Sixnet that reported hardware status to check_mk_agent • Send Ka-band status over the network, wrote software to create status files readable by check_mk scripts Types of S-Pol checks • scripts/programs directly monitor hardware or software • hybrid scripts - process the output of an existing program, output check_mk status reports. Implementation(2) • • configured GSM cell phone to send SMS messages • software from gnokii.org • bought local SIM wrote script to limit frequency of SMS messages Sample Web Screens Challenges • learning how to create advanced checks with graphs • Avoiding false alarms (particularly after hours!) • limiting frequency of notifications - getting 20 text messages on your cell phone in 5 minutes is not helpful! How well did OMD/Nagios work? • The second shift only had to be on-site from 3:00PM to 8:00PM, rather than until 11:00PM • Daytime: OMD/Nagios warned staff of problems on multiple occasions. • Offhours: OMD/Nagios notified S-Pol staff of critical hardware/software failures on multiple occasions 24x7 Operations : w/o working 24x7 • Added SMS (text message) notifications to Nagios • Technicians and Engineers carried cell phones • Nagios sent SMS when hardware or software problems occurred. • Technicians and Engineers would access Nagios web pages via 3G modems on laptops FUTURE • Monitoring of diesel generators • Add remote control: • generator & transfer switch • reset of transmitter faults • reset of antenna faults Conclusion • Monitoring is important for any system, critical for complex or unattended operation • OMD/Nagios makes it easy to deploy monitoring • OMD/Nagios helped EOL maintain high data quality from S-Pol without requiring staff 24x7 on site. • Notifications via SMS and remote access to OMD’s web pages are very helpful. Acknowledgments • Ethan Galstad - Nagios chief developer • Mathias Kettner - check_mk • Fatima Dembele (summer intern) - prototyping • Paloma Gutierrez - hardware monitoring • Chris Burghart - Ka-band monitoring • Mike Dixon - Ka-band & HAWK monitoring Questions?