2012 Bitnetix Incorporated 4 History of kodak.com

advertisement
Nagios Implementation Case:
Eastman Kodak Company
Eric Loyd
Founder & CEO
Bitnetix Incorporated
eric@bitnetix.com
www.bitnetix.com
877.BITNETIX
About Eric Loyd and Bitnetix
Founder and CEO of Bitnetix Incorporated
VOIP services and IT/network consulting
25 Years in IT at places like
Eastman Kodak
Frontier Communications
Global Crossing
Bitnetix started its seventh year in July, 2012
2012 Digital Rochester GREAT Award Finalist in
Communications Technology
Using Nagios to monitor our client equipment, VOIP
platform, and still using it at Kodak since 2004
© 2012 Bitnetix Incorporated
2
A History of Eastman Kodak’s
kodak.com Web Server
Infrastructure (non-confidential)
History of kodak.com
Pre-2004
Machines located in Rochester, NY
Public Apache servers
Reverse proxy Apache servers
Application servers (ATG/Dynamo, Tomcat, etc)
Database boxes, Production Support, etc.
2004 – Moved ~80 machines from ROC -> ???
ROC <-> ??? Firewalls
Bandwidth requirements
Minimal user impact
Flipped the switch, went live
© 2012 Bitnetix Incorporated
4
History of kodak.com
Some of the things kodak.com did at the time
Consumer store and product information
B2B portal and wholesaler purchasing
“Picture Of The Day” (www.kodak.com/go/potd)
Warranty registration
Photo lab calibration strips
“Phone home” reports for printers, docks, cameras, etc
Software/firmware updates
Corporate press releases, bios, and regulatory information
Reverse proxy for internal information through secure channels
Dozens of sitelets for products and campaigns
© 2012 Bitnetix Incorporated
5
Why Kodak Chose Nagios
to Monitor kodak.com
Why Nagios?
No centralized corporate monitoring software
Nothing to compete with internally
Nothing to build on, either
Cost
No additional cost beyond existing human resources
Framework
Nagios worked with firewalls without needing agents
Leverage SSH, HTTP and other remote protocols
Custom checks and notifications (very important)
© 2012 Bitnetix Incorporated
7
Initial Hurdles in the New
Complex Server Environment
kodak.com Network
© 2012 Bitnetix Incorporated
Initial hurdles
Firewalls
Public load balancers on external Internet IPs
Public Apaches in Zone 1, Kodak network
Reverse proxy, app servers in Zone 2, semi-secure
Nagios machine in internal Zone 3, most secure
Complex “top” and “bottom” checks for web site
Is the site working from the user’s perspective (top)?
From the application side (bottom)?
How to separate apparent from actual failure
© 2012 Bitnetix Incorporated
10
Initial hurdles
No Internal Nagios Knowledge
It was a contractor who set up Nagios (me)
Contractors typically have a finite lifespan at Kodak
Contractor made custom checks, event handlers,
and all Nagios configurations. Uh-oh…
Escalation and Paging
Screw it – let’s email everyone, every time and let
Thunderbird sort it all out
Paging done via texting gateway email address
Which means email gateway failure = notification failure
Twitter API as backup / current primary notification
© 2012 Bitnetix Incorporated
11
SSH to Remote Servers
SSH to the rescue
One user, one key, infinite access
Software apps run as second user, with SSH auth
Additional robot accounts can be added at any time
Wrap existing checks in an SSH shell
Provides additional control, error handling, reporting
Allows all checks to submit results to SQL database
SQL Database Side Note – all custom scripts executed CLI Perl
code that locked a file, logged to it, and unlocked it. A Perl cron
job woke up every 5 minutes, locked the file, read it, pushed
things to Oracle, unlocked, and deleted log file. A second cron
pruned Oracle daily to 400 days of data and collapsed checks
older than 30 days so that successive checks with the same
status were removed.
© 2012 Bitnetix Incorporated
13
Managing Nagios
Configuration Files
Configuration Management
SCCS
Solaris’s “poor man’s CVS”
Pre-installed, no additional cost, existing expertise
Current configuration is managed through SVN
Rsync – the workhorse to move config files
Configuration Repository and Push (CRaP) directory
Cfengine
Local versus remote execution
Post-install, ignore pid files, deploy/restart, etc.
Makefile – the “CLI” to the entire process
© 2012 Bitnetix Incorporated
15
Common Event Handler
Common Event Handler
EKrestart – That Which Does
Setup
•
•
•
•
•
Arguments
Conversions
do_soft/hard?
do_something?
do_restart
do_restart
•
•
•
•
•
•
•
•
•
Lock, logs, SQL
send_nagios
SSH to remote
Remote EKrestart
Process args
do_<service>
send_nagios
Unlock, log, SQL
Terminate
© 2012 Bitnetix Incorporated
do_<service>
•
•
•
•
•
•
Locks (level 2)
Instance mapping
Port mapping
App restart
Email & log
Exit
17
A Closer Look at EKrestart
#!/bin/sh
PATH=...
[ "$1" = "-r" ] && client_code
host="$1"
service="$2"
baseService=`echo $service | awk -F: '{print $1}'`
state="$3"
type="$4"
tries="$5"
perfdata="$6"
class="<based on machine name, e.g., x-y-CLASS-nnn.kodak.com>"
number="<based on machine name, e.g., x-y-class-NNN.kodak.com>"
case "$state" in
OK) do_fixit;;
WARNING) do_nothing;;
UNKNOWN) do_nothing;
CRITICAL) do_something;
*) do_nothing;
esac
© 2012 Bitnetix Incorporated
18
A Closer Look at EKrestart
do_fixit() {
case "$baseService" in
Workers) do_restart;;
*) do_nothing;;
esac
}
do_nothing() {
$debug && echo "$service is in $state state ($type) for $tries tries."
}
do_something() {
case "$type" in
SOFT) do_soft;;
HARD) do_restart;;
*) do_nothing;;
esac
}
do_soft() {
case "$tries" in
3,4,5) do_restart;;
*) do_nothing;;
esac
}
# Take action before it's too late?
# Hard CRITICAL - Our last chance to take action
# Okay, let's restart it before it goes hard
# Don't restart yet
© 2012 Bitnetix Incorporated
19
A Closer Look at EKrestart
do_restart() {
# <figure some stuff out, set up lock files, send_nagios, log to SQL, etc>
ssh $machine <EKrestart> -r do_$service <parameters>
# <tear down, unlock, close log, send_nagios, log to SQL, etc>
exit
}
# On the client side, we use the same EKretart script, but start at client_code()
client_code() {
host=`hostname`
function="$2"
service="$3"
# (etc)
eval $function
exit
}
# Example function
do_Dynamo() {
# lock file processing
# turn off new sessions, wean existing ones
# /etc/init.d/restart_dynamo_$instance
# tear down
return
}
© 2012 Bitnetix Incorporated
20
Integrating Nagios into
Operational Procedures
Integration with Operations
Homebrew API
nchart, send_nagios, nlog – all portable to other
installations of Nagios on other machines
Integrate with start/stop scripts
Lock files. Lots of lock files! TOO MANY lock files!!
The “Rippler”
Leverage EKrestart, cron, and send_nagios
Pager / Twitter and lots of private twitter feeds
Inter-group notifications
Predominately with procmail
© 2012 Bitnetix Incorporated
22
Predictive Failure Recovery
and a Good Night’s Sleep
Predictive Failure Recovery
On ATG/Dynamo (and other) services
do_soft triggers do_restart on third failure
do_hard always triggers restart
Notifications on fourth failure
Escalation to pager only on fifth notification
Nagios has time to restart things that are bad, or are
going bad, prior to sending out notifications
Service check dependencies allow us to know whether
it’s a bad application, server, or user experience
Twitter – follow private tweets with smartphone, use
apps to acknowledge problems, and get an even
better night’s sleep!!
© 2012 Bitnetix Incorporated
24
Questions
Eric Loyd
Founder & CEO
Bitnetix Incorporated
eric@bitnetix.com
www.bitnetix.com
877.BITNETIX
Download