Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams <jreams@columbia.edu> Symon Says Nagios Project • Replace 12-year-old home grown monitoring system – Very customized – Very engineered – Very unsupported • ~17,000 checks • Mandate to move to Nagios False Start 1. 2. 3. 4. Installed Nagios Ported checks from old system to new Went out for coffee Problems a. High check latency b. High load Stock Nagios Nagios Host Nagios Process Status Data File Check Results CGIs Nagios Reapers Check Processes Sysadmin Nagios Problems • Trapped on one host: – Check results – Status data – Configuration data • Nagios isn’t a great executor – Forks 2 processes per check – Everything is basically synchronous – async achieved with multiple processes • Data format is simple but non-standard Nagios Problems • Implementation is all in C – hard to customize • Can be I/O bound by reading/writing check result files • Cannot query data from status file/configuration without reading/parsing all of it • Input via FIFO gives no feedback and has a limited buffer size Nagios Problems Communication is hard! My Solution NagMQ A ZeroMQ-based API for Nagios Background on ZeroMQ • • • • Broker-less messaging kernel in a single library Emulates Berkeley socket API Supports IPC/TCP/Multicast transports Fanout, pub/sub, pipe-line, and request/reply messaging patterns • All I/O is asynchronous after connections are established with dedicated I/O threads • Bindings available for large number of operating systems and languages • Agnostic of data being sent – no defined data format NagMQ Event Publisher & Commands Host check result from publisher host_check_processed localhost { "host_name": "localhost", "check_type": 0, "check_options": 0, "scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1, "max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0, "last_check": 1354996955, "last_state_change": 1337098090, "latency": 1.63600, "timeout": 60, "type": "host_check_processed", "start_time": { "tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec": 1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time": 0.07324, "return_code": 0, "output": "Host up", "long_output": null, "perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 } } Command to add an acknowledgement to service problem {'comment_data': 'Stop alerting me!!', 'notify_contacts': False, 'author_name': ’jreams', 'persistent_comment': False, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec': 1355074576}, 'type': 'acknowledgement'} State Data Request {'keys': ['host_name', 'services', 'hosts', 'service_description', 'current_state', 'members', 'type', 'name', 'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled', 'notifications_enabled', 'event_handler_enabled'], 'include_services': True, 'host_name': 'localhost'} Response [{'checks_enabled': True, 'notifications_enabled': True, 'current_state': 0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0, 'event_handler_enabled': True, 'host_name': 'localhost', 'services': ['rotate-unix'], 'type': 'host'}, {'checks_enabled': False, 'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You are now on call', 'problem_has_been_acknowledged': False, 'event_handler_enabled': True, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'type': 'service'}] Some examples • Distributed check execution (mqexec) • Custom user interfaces (nag.py, etc) • High availability (haagent.py, halib.py) mqexec mqexec • Asynchronous command executor • Subscribes to host_check_initiate, service_check_initiate, and event_handler_start messages, and executes command line specified • Can filter which commands to execute based on any attribute in message • Receives messages as – Fair-queued worker pool (pull from MQ broker) – Individual worker (subscribe directly to NagMQ) • Sends results back to command interface of NagMQ Performance: Stock Nagios 18 Latency in Seconds 16 14 12 10 Max Host Avg Host Max Svc Avg Svc 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time in Minutes Performance: NagMQ/mqexec 18 Latency in Seconds 16 14 12 10 Max Host Avg Host Max Svc Avg Svc 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time in Minutes User Interfaces • Command-line $ nag.py -c 'Stop alerting me!!' add ack localhost [localhost]: No problem found [uptime@localhost]: Acknowledgement added • Python/Javascript/Twitter Bootstrap web interface using NagMQ (see demo) • Interface to Twitter High Availability – Stock Nagios High Availability - NagMQ High Availability - NagMQ • Use regular program_status to provide heartbeat • Retrieve active state from state interface to bring passive node into sync with active node on startup • Subscribe to and send check result messages, acknowledgements, downtimes, and adaptive changes to command interface • Passive host’s mqexec(s) run checks for whatever host is active • Use VIFs owned by the message broker to direct traffic to active host Why not use one of these? • LiveStatus – live state query module with check execution workers • Mod_gearman – distributed check execution based on gearman job queue • Merlin – database/distributed backend for Nagios • Ndoutils – database backend for Nagios • NSCA – allows check/command submission over network • NRPE – remote check executor API – not a product • NagMQ is just an interface into Nagios, not a product • Better communication with clients comes from larger ZeroMQ project – leaving NagMQ to focus on Nagios • Implement ad-hoc tools for Nagios without having to write any compiled code • Doing expensive data processing of monitoring data doesn’t have to create latency in monitoring system • Re-use one interface for many tools Future Work • Pluggable authentication/encryption for NagMQ • Pluggable parser/emitter for custom data formats (XML, Yaml, etc) • NDOutils database replacement • More user interfaces (Jabber, SMS, email gateway, REST API) • Nagios 4 NagMQ https://github.com/jbreams/nagmq Jonathan Reams jbreams@gmail.com