Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated eric@bitnetix.com www.bitnetix.com 877.BITNETIX About Eric Loyd and Bitnetix Founder and CEO of Bitnetix Incorporated VOIP services and IT/network consulting 25 Years in IT at places like Eastman Kodak Frontier Communications Global Crossing Bitnetix started its seventh year in July, 2012 2012 Digital Rochester GREAT Award Finalist in Communications Technology Using Nagios to monitor our client equipment, VOIP platform, and still using it at Kodak since 2004 © 2012 Bitnetix Incorporated 2 A History of Eastman Kodak’s kodak.com Web Server Infrastructure (non-confidential) History of kodak.com Pre-2004 Machines located in Rochester, NY Public Apache servers Reverse proxy Apache servers Application servers (ATG/Dynamo, Tomcat, etc) Database boxes, Production Support, etc. 2004 – Moved ~80 machines from ROC -> ??? ROC <-> ??? Firewalls Bandwidth requirements Minimal user impact Flipped the switch, went live © 2012 Bitnetix Incorporated 4 History of kodak.com Some of the things kodak.com did at the time Consumer store and product information B2B portal and wholesaler purchasing “Picture Of The Day” (www.kodak.com/go/potd) Warranty registration Photo lab calibration strips “Phone home” reports for printers, docks, cameras, etc Software/firmware updates Corporate press releases, bios, and regulatory information Reverse proxy for internal information through secure channels Dozens of sitelets for products and campaigns © 2012 Bitnetix Incorporated 5 Why Kodak Chose Nagios to Monitor kodak.com Why Nagios? No centralized corporate monitoring software Nothing to compete with internally Nothing to build on, either Cost No additional cost beyond existing human resources Framework Nagios worked with firewalls without needing agents Leverage SSH, HTTP and other remote protocols Custom checks and notifications (very important) © 2012 Bitnetix Incorporated 7 Initial Hurdles in the New Complex Server Environment kodak.com Network © 2012 Bitnetix Incorporated Initial hurdles Firewalls Public load balancers on external Internet IPs Public Apaches in Zone 1, Kodak network Reverse proxy, app servers in Zone 2, semi-secure Nagios machine in internal Zone 3, most secure Complex “top” and “bottom” checks for web site Is the site working from the user’s perspective (top)? From the application side (bottom)? How to separate apparent from actual failure © 2012 Bitnetix Incorporated 10 Initial hurdles No Internal Nagios Knowledge It was a contractor who set up Nagios (me) Contractors typically have a finite lifespan at Kodak Contractor made custom checks, event handlers, and all Nagios configurations. Uh-oh… Escalation and Paging Screw it – let’s email everyone, every time and let Thunderbird sort it all out Paging done via texting gateway email address Which means email gateway failure = notification failure Twitter API as backup / current primary notification © 2012 Bitnetix Incorporated 11 SSH to Remote Servers SSH to the rescue One user, one key, infinite access Software apps run as second user, with SSH auth Additional robot accounts can be added at any time Wrap existing checks in an SSH shell Provides additional control, error handling, reporting Allows all checks to submit results to SQL database SQL Database Side Note – all custom scripts executed CLI Perl code that locked a file, logged to it, and unlocked it. A Perl cron job woke up every 5 minutes, locked the file, read it, pushed things to Oracle, unlocked, and deleted log file. A second cron pruned Oracle daily to 400 days of data and collapsed checks older than 30 days so that successive checks with the same status were removed. © 2012 Bitnetix Incorporated 13 Managing Nagios Configuration Files Configuration Management SCCS Solaris’s “poor man’s CVS” Pre-installed, no additional cost, existing expertise Current configuration is managed through SVN Rsync – the workhorse to move config files Configuration Repository and Push (CRaP) directory Cfengine Local versus remote execution Post-install, ignore pid files, deploy/restart, etc. Makefile – the “CLI” to the entire process © 2012 Bitnetix Incorporated 15 Common Event Handler Common Event Handler EKrestart – That Which Does Setup • • • • • Arguments Conversions do_soft/hard? do_something? do_restart do_restart • • • • • • • • • Lock, logs, SQL send_nagios SSH to remote Remote EKrestart Process args do_<service> send_nagios Unlock, log, SQL Terminate © 2012 Bitnetix Incorporated do_<service> • • • • • • Locks (level 2) Instance mapping Port mapping App restart Email & log Exit 17 A Closer Look at EKrestart #!/bin/sh PATH=... [ "$1" = "-r" ] && client_code host="$1" service="$2" baseService=`echo $service | awk -F: '{print $1}'` state="$3" type="$4" tries="$5" perfdata="$6" class="<based on machine name, e.g., x-y-CLASS-nnn.kodak.com>" number="<based on machine name, e.g., x-y-class-NNN.kodak.com>" case "$state" in OK) do_fixit;; WARNING) do_nothing;; UNKNOWN) do_nothing; CRITICAL) do_something; *) do_nothing; esac © 2012 Bitnetix Incorporated 18 A Closer Look at EKrestart do_fixit() { case "$baseService" in Workers) do_restart;; *) do_nothing;; esac } do_nothing() { $debug && echo "$service is in $state state ($type) for $tries tries." } do_something() { case "$type" in SOFT) do_soft;; HARD) do_restart;; *) do_nothing;; esac } do_soft() { case "$tries" in 3,4,5) do_restart;; *) do_nothing;; esac } # Take action before it's too late? # Hard CRITICAL - Our last chance to take action # Okay, let's restart it before it goes hard # Don't restart yet © 2012 Bitnetix Incorporated 19 A Closer Look at EKrestart do_restart() { # <figure some stuff out, set up lock files, send_nagios, log to SQL, etc> ssh $machine <EKrestart> -r do_$service <parameters> # <tear down, unlock, close log, send_nagios, log to SQL, etc> exit } # On the client side, we use the same EKretart script, but start at client_code() client_code() { host=`hostname` function="$2" service="$3" # (etc) eval $function exit } # Example function do_Dynamo() { # lock file processing # turn off new sessions, wean existing ones # /etc/init.d/restart_dynamo_$instance # tear down return } © 2012 Bitnetix Incorporated 20 Integrating Nagios into Operational Procedures Integration with Operations Homebrew API nchart, send_nagios, nlog – all portable to other installations of Nagios on other machines Integrate with start/stop scripts Lock files. Lots of lock files! TOO MANY lock files!! The “Rippler” Leverage EKrestart, cron, and send_nagios Pager / Twitter and lots of private twitter feeds Inter-group notifications Predominately with procmail © 2012 Bitnetix Incorporated 22 Predictive Failure Recovery and a Good Night’s Sleep Predictive Failure Recovery On ATG/Dynamo (and other) services do_soft triggers do_restart on third failure do_hard always triggers restart Notifications on fourth failure Escalation to pager only on fifth notification Nagios has time to restart things that are bad, or are going bad, prior to sending out notifications Service check dependencies allow us to know whether it’s a bad application, server, or user experience Twitter – follow private tweets with smartphone, use apps to acknowledge problems, and get an even better night’s sleep!! © 2012 Bitnetix Incorporated 24 Questions Eric Loyd Founder & CEO Bitnetix Incorporated eric@bitnetix.com www.bitnetix.com 877.BITNETIX