Template Bull 2008

advertisement
Nagios in the Real World
Dave Williams
Technical Architect
Agenda
2
©Bull, 2011
Presentation Title
Agenda
- Introduction
-
General Background
System Monitoring Background
- Example Implementations of Nagios
-
UK Customer Examples
- Datacentre Monitoring with Nagios
-
-
What is a Datacentre ?
Software & Hardware combinations
Vision
- Conclusions
3
©Bull, 2011
Presentation Title
Background
- UK based
-
Mainframe (IBM & Honeywell)
Unix (HP-UX, AIX, Solaris)
Network (CASE, 3COM,
CISCO)
- Working for Bull
-
-
4
French Computer Manufacturer
Mainframes, Unix, HPC,
Security, Managed Services
©Bull, 2011
Presentation Title
Background
- System Monitoring
-
OpenView
Netview
Open Master
- Open Source Monitoring
-
5
NetSaint on AIX
Nagios
©Bull, 2011
Presentation Title
Example Implementations
6
©Bull, 2011
Presentation Title
Crown Office Procurator Fiscal Service
- Responsible for the prosecution of crime in Scotland
-
Investigation of suspicious deaths
Complaints against the Police
- IT Locations in Glasgow & Edinburgh
-
7
Windows at every Courts of Justice in Scotland
AIX / Oracle DB at Glasgow & Edinburgh
©Bull, 2011
Presentation Title
Crown Office Procurator Fiscal Service
- Already used Solarwinds for some network
-
monitoring
Strategy demanded AIX based monitoring & reporting
-
8
In a competitive tender Nagios selected
Main success points were – simplicity, ease of customisation
Fitted within AIX based distance data replication already in use
©Bull, 2011
Presentation Title
Crown Office Procurator Fiscal Service
- 60+ Windows systems monitored for CPU, Disk
-
Space etc
2 AIX servers monitored for CPU, Disk Space etc
Two Oracle Instances monitored for performance and
DBspace usage
All alerts shown on monitor screen and if necessary
SMS Text alerts
-
9
Installed 2005, still working
Provides ‘backstop’ to Solarwinds for capacity monitoring on
the WAN & LAN.
©Bull, 2011
Presentation Title
Rother District Council
- “Working with the community to improve the overall
well-being of the District “
-
10
Responsible for Waste Collection, Housing, Planning &
Building Control
The District covers some 200 square miles and serves a
population of around 90,000 inhabitants.
©Bull, 2011
Presentation Title
Rother District Council
- Monitoring 20+ Windows Servers for CPU, Disk
-
Utilsation etc
Monitoring numerous disparate Applications
-
11
Reporting on Availability
Monitoring Printer status
Unexpected benefits
©Bull, 2011
Presentation Title
North Yorkshire County Council
- Internet Access system for 30,000 pupils
- Monitoring e-mail, internet access, IDS, AV,
Webservers
-
12
Reporting on Availability
Monitoring Service Level Indicators
Mix of application providers (Scalix, Plesk)
Mix of appliance systems – Cisco, Panda, Radware,
NetEnforcer, MyFilter
©Bull, 2011
Presentation Title
North Yorkshire County Council
- System Schematic
13
©Bull, 2011
Presentation Title
North Yorkshire County Council
- Uses NRPE to perform active checks on hosts
- Multi O/S support
-
Debian
RedHat
- Uses NSCA to accept check results from Windows
-
14
Via NagiosEventLog
©Bull, 2011
Presentation Title
North Yorkshire County Council
- E-mail
-
- AV systems
Scalix running on Redhat
Cluster. Checking all
processes, cluster state etc.
- PLESK Web server
-
15
-
Monitoring availability
Checking on AV database
- Myfilter
-
Checking availability of web
sites via test installation
Monitoring disk utilsation
and processor utilisation
©Bull, 2011
-
-
Presentation Title
Monitoring email filters
running
Checking that sufficient
email filters are available
North Yorkshire County Council
- E-mail
-
Nagios server runs external
email loopback test every
20 minutes to confirm
external reachability.
- PLESK Web server
-
16
- NetBackup
Straightforward
implementation of
check_http
©Bull, 2011
-
Monitoring that backups
have run
Checking that enough
backup tapes are available
- Business Availability
-
-
Presentation Title
Define which services
constitute a business line
07:00 check – tell support
before the customers come
on line
NYCC - Nagiosgraph
- Nagiosgraph
-
-
17
Uses process_performance
_data
Example of Unix load average
©Bull, 2011
Presentation Title
NYCC – Nagios Monitoring
- Scalix Email System
18
©Bull, 2011
Presentation Title
NYCC
- Alerts sent via email to customers as well as support
- Backup notifications via SMS Text
- Use Nagios Looking Glass for Customer View
- nagiosgraph used to catch all service performance
data
-
-
19
Debian & Redhat perfomance metrics
Network throughput from LAN switches
LDAP response time
©Bull, 2011
Presentation Title
Datacentre Monitoring with Nagios
20
©Bull, 2011
Presentation Title
What is a DataCentre ?
- A data center (or datacentre) is a facility used to
house computer systems and associated
components, such as telecommunications and
storage systems. It generally includes redundant or
backup power supplies, redundant data
communications connections, environmental controls
and security devices.
(Wikipedia)
21
©Bull, 2011
Presentation Title
How good is your DataCentre ?
- The TIA-942:Data Center Standards Overview
describes the requirements for the data centre
infrastructure. The simplest is a Tier 1 data centre,
which is basically a server room, following basic
guidelines for the installation of computer systems.
The most stringent level is a Tier 4 data centre, which
is designed to host mission critical computer systems,
with fully redundant subsystems and
compartmentalized security zones controlled by
biometric access controls methods .
(Wikipedia)
22
©Bull, 2011
Presentation Title
What is a DataCentre ?
-
Tier 1 Requirements
-
-
Tier 2 Requirements
-
-
-
-
23
Fulfills all Tier 1 and Tier 2 requirements
Multiple independent distribution paths serving the IT equipment
All IT equipment must be dual-powered and fully compatible with the topology of a site's architecture Concurrently maintainable
site infrastructure guaranteeing 99.982% availability
Tier 4 Requirements
-
-
Fulfills all Tier 1 requirements
Redundant site infrastructure capacity components guaranteeing 99.741% availability
Tier 3 Requirements
-
-
Single non-redundant distribution path serving the IT equipment
Non-redundant capacity components
Basic site infrastructure guaranteeing 99.671% availability
Fulfills all Tier 1, Tier 2 and Tier 3 requirements
All cooling equipment is independently dual-powered, including chillers and heating, ventilating and air-conditioning (HVAC)
systems
Fault-tolerant site infrastructure with electrical power storage and distribution facilities guaranteeing 99.995% availability
©Uptime Institute
©Bull, 2011
Presentation Title
What is a Green DataCentre ?
-
The most commonly used metric to determine the energy
efficiency of a data centre is power usage effectiveness, or PUE.
This simple ratio is the total power entering the data centre
divided by the power used by the IT equipment.
-
-
24
PUE = Total facility Power / IT Equipment Power
Power used by support equipment, often referred to as overhead
load, mainly consists of cooling systems, power delivery, and
other facility infrastructure like lighting. The average data centre
in the US has a PUE of 2.0, meaning that the facility uses one
Watt of overhead power for every Watt delivered to IT
equipment. State-of-the-art data centre energy efficiency is
estimated to be roughly 1.2.
©Bull, 2011
Presentation Title
Bull Datacentre BC1 ?
- New datacentre build on an already existing site
- Design criteria PUE 1.6
- Easily expanded on demand
- Tier 3
25
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- What do you get for £1.2m ?
26
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- New Mains Incomer
-
- 3 x Ambient CRAC Units
Took feed from 11Kv ring
Had to build own substation
- 1.2Mw Generator
-
27
-
Required 8000 litre fuel
tank
Switchgear to automatically
start generator if mains
incomer fails (10-45
seconds)
©Bull, 2011
-
-
Cooling via external
temperature differential
N+1 configuration
Hot Aisle Containment
- In-Line UPS
-
-
Presentation Title
UPS only required to keep IT
equipment running until
generator fires up
Uses space in Cab rows,
easily scalable according to
load
Bull UK Datacentre BC1 - Monitoring
-
-
Physical Environment
-
APC Netbotz Devices
Translate inputs from sensors
• Humidity, Temperature, Dew
Point
SEAL I/O Dry Contact
•
-
Voltage indicators
• For CRAC, FM200, Generator,
UPS
Electrical Efficiency
PowerLogic ION software
reads from power meters
- Power meter on every
Distribution Board
- Real-time calculation of PUE
-
28
©Bull, 2011
Every PDU strip (2 per Cab)
monitored for power
consumption & problems
- A number of PDU strips also
have remote control down to
socket level
-
•
-
Power Distribution
-
Management Network
LAN infrastructure required to
support the Datacentre
- Servers required to support the
datacentre
- External alert mechanisms
-
Presentation Title
Bull UK Datacentre BC1
- What does Netbotz look like ?
29
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- What does SeaLevel look like ?
30
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- What does ION look like ?
31
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- What does a metered PDU look like ?
32
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- What does a managed PDU look like ?
33
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- Nagios Map
34
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- Nagios Host Groups
35
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- Do things go wrong - yes
36
©Bull, 2011
Presentation Title
Bull UK Datacentre BC1
- Do things go wrong - yes & no
37
©Bull, 2011
Presentation Title
Datacentre Monitoring Schematic
38
©Bull, 2011
Presentation Title
Nagios Products in use
- Nagios Core
-
NRPE
NSCA
- Nagios Looking Glass
- Nagvis
- EventDB
- SNMPTT
- Nagmap
- NDO
39
©Bull, 2011
Presentation Title
Other Open Source Products in use
40
Nedi
Arpwatch
PSAD
SMS-Client
Bacula
Confluence (Wiki)
i-doit (ITIL CMDB)
MRTG
Routers2cgi
©Bull, 2011
Presentation Title
BC1 Datacentre Monitoring Elements
- Nagios Core
Normal install with direct
polling of devices
Only looking at Datacentre
-
- Nagios Display System
-
-
-
Central reporting Nagios
Absorbs updates from other
Nagios instances
- Information Display
-
41
- Nagios Customer System
Normal system with 5
heads
©Bull, 2011
Running on an appliance
connected to Customer
network
Sends data via encrypted
secured link to Display
System
- Backup System
-
-
Presentation Title
Use tape library
Hosts CMDB & WiKi
BC1 Datacentre Nagios Core
- Hardware Platform - Intel
-
O/S Centos 5
Xeon 2.8Ghz , 8Gb memory, 72GB RAID-1 disk
- Nagios 3.2.0
-
Built from source tarball
- Nagios Plugins 1.4.15-2
-
42
Installed from RPM
©Bull, 2011
Presentation Title
BC1 Datacentre Nagios Display System
- Hardware Platform - Intel
-
O/S Fedora Core 9
P4 2.8Ghz , 2.5Gb memory, 76GB RAID-1 disk
Nvidia dual monitor display Card – DVI interfaces
- Nagios 3.0.6
-
Built from source tarball
- Nagios Plugins 1.4.13-9
-
43
Installed from RPM
©Bull, 2011
Presentation Title
BC1 Datacentre Normal Display System
- Hardware Platform - AMD
-
O/S Centos 5
Athlon 1.2Ghz , 1.0 Gb memory, 3GB disk
Matrox G200 Quad Head
- Runs console displays – http/RDP/ssh
44
©Bull, 2011
Presentation Title
BC1 Datacentre Customer System
- Hardware Platform – Motion Tablet
-
O/S Ubuntu 10.04 LTS
Pentium M 1.5Ghz , 0.5 Gb memory, 30GB disk
Touch Screen tablet system
- Nagios 3.2.3
-
Built from tarball
- Nagios Plugins 1.4.15
-
Built from tarball
- Nagios NSCA
-
45
Sends status (encrypted) to central reporting system
©Bull, 2011
Presentation Title
BC1 Datacentre Backup System
- Hardware Platform – Intel
-
O/S Centos 5
Xeon 3.06Ghz , 2.0 Gb memory, 108GB disk
- Uses Bacula 5.0.3
-
Controls SDLT 20 slot tape library
Backs up all Datacentre Infrastructure
•
Windows
• Centos
•
46
Ubuntu
©Bull, 2011
Presentation Title
Conclusions
47
©Bull, 2011
Presentation Title
Conclusions
- Strategic Overall Design - You will have to make it
-
Know what you need to
monitor
Know who needs to be told
- Expect to throw the first
version away
-
-
48
Only when you have fully
engineered the solution will
you understand all of the
issues
Keep a record of design
decisions
©Bull, 2011
pretty for management
-
-
Accept that an attractive
display will be required
Reporting will become key
- It must be reliable
-
Presentation Title
Make backups
Consider clustering &
recovery options
& Hints
49
©Bull, 2011
Presentation Title
Hints & Experience
-
-
-
If you are tracking 10,000’s of
services you don’t want
processor heavy graphics as
well
Escalation & Alerting take
time
-
Don’t give in – there is always
a way to get Nagios involved
- Screen scrape, email,
telnet,RS232 are all possible
SNMP is your friend
When in doubt use SNMP to
help you out
- SNMP V3 with AES cypher is
suitably secure for most
implementations
-
Firstly to get right with your
organisation
Secondly to actually physically
do !
©Bull, 2011
Suppliers go out of their way
to make it difficult
-
-
50
-
Separate Display systems
from Monitoring systems
Presentation Title
52
©Bull, 2011
Presentation Title
53
©Bull, 2011
Presentation Title
Download