Keeping a Hawkeye on The Grid

advertisement
Keeping a Hawkeye
on The Grid
Nick LeRoy
➢ Computer Sciences Department
➢ University of Wisconsin-Madison
➢ nleroy@cs.wisc.edu
http://www.cs.wisc.edu/condor/hawkeye
➢
➢
www.cs.wisc.edu/condor/glidein
The Grid Idea
› Large scale
distributed
computing
› Solve massive
computational
problems
Grid
www.cs.wisc.edu/condor/glidein
The Grid Reality
›
›
›
›
›
Sites go down
Updates need to by synchronized
Firewalls get in the way
Human errors occur
Separate administrative domains cause
inconsistencies
www.cs.wisc.edu/condor/glidein
The Emperor's New Grid
Grid
www.cs.wisc.edu/condor/glidein
Some Observations
› A lot of these problems can't be
solved by technology alone
➢ Human errors
➢ Separate administrative domains
› Detecting problems is often quite
difficult and time consuming
www.cs.wisc.edu/condor/glidein
More Observations
› Can't fix problems before we're aware
of them
› Impossible to prevent all classes of
problems
www.cs.wisc.edu/condor/glidein
So...
› Often more cost effective to detect &
work around problems
➢ Even when prevention is possible
➢ Detection is always required
› Automation is our friend
www.cs.wisc.edu/condor/glidein
Watching the Grid
› We need a monitoring system:
➢
➢
➢
➢
That
That
That
That
can automate detection
is flexible
is easy to deploy
can alert us of problems
• In a timely manner
www.cs.wisc.edu/condor/glidein
Hawkeye
› Hawkeye is a monitoring tool
› Designed for grid and distributed
applications
➢ Provides Automated detection
➢
➢
➢
Very Flexible
Is Easy to deploy
Provides Timely Alerts of problems
www.cs.wisc.edu/condor/glidein
Watching the Grid
Grid
www.cs.wisc.edu/condor/glidein
Hawkeye Uses
› Can be used for:
➢
➢
➢
➢
➢
Monitoring system load, I/O, usage, etc.
Watching for run-away processes
Monitoring the health of your pool
Watching the health of grid site
...
www.cs.wisc.edu/condor/glidein
Some Details about Hawkeye
› Distributed monitoring system
Uses a push data model
› Built on Condor technology
➢ Uses ClassAds & match-making
➢ Every Condor has Hawkeye built-in
› Stable, production quality
➢
www.cs.wisc.edu/condor/glidein
Hawkeye UI
› Alert you when things go wrong:
➢
➢
When virtually any condition is found
When various problems are found:
• My checkpoint server's disk is full
• Joe has had a CVS lock for 20 minutes
› Help you visualize what's going on
➢
Plotting via RRDT
www.cs.wisc.edu/condor/glidein
Why would I run Hawkeye?
› Make system administration easier
› Simplify pool maintenance
Condor
➢ Other batch system
› Scalable solution
➢
www.cs.wisc.edu/condor/glidein
Hawkeye Architecture
Hawkeye Monitoring
Agent
Hawkeye Module
Condor
Pool
Hawkeye Module
Hawkeye Monitoring
Agent
Hawkeye Module
Hawkeye
Manager
www.cs.wisc.edu/condor/glidein
Grid
Hawkeye Monitoring Agent
(Hawkeye Startd)
Hawkeye Job
Manager
Hawkeye
Manager
Hawkeye Job 1
ClassAd
Hawkeye Job 2
Hawkeye Module 1
www.cs.wisc.edu/condor/glidein
Hawkeye ClassAds
› Hawkeye uses Condor ClassAds to
represent collected data
➢
➢
➢
Schema-free data representation
Provides matching mechanism
Represent whatever data you gather in a
way that works best for you
www.cs.wisc.edu/condor/glidein
Hawkeye ClassAds
› Example ClassAd “snippet”:
RAM_MemFree = 841932800
RAM_MemShared = 0
RAM_MemTotal = 1055367168
RAM_SwapCached = 0
RAM_SwapFree = 2147483647
RAM_SwapTotal = 2147483647
www.cs.wisc.edu/condor/glidein
Hawkeye Modules
› Current library of modules monitor:
Processes, CPU usage, etc.
➢ RAM, I/O, VM Statistics, etc.
➢ Disk space
➢ CVS repository
➢ GASS Cache statistics
➢ etc.
➢
www.cs.wisc.edu/condor/glidein
Hawkeye and Condor
› Hawkeye has Condor specific tools
› Developed to help us run our pool
www.cs.wisc.edu/condor/glidein
Condor Node Module
›
›
›
›
Run on each node of the pool
Watches the Condor daemons
Monitors multiple virtual machines
Can identify run-away or orphaned
jobs / processes
www.cs.wisc.edu/condor/glidein
Condor Pool Module
›
›
›
›
Run on just one host
Reports overall pool health
Watches for “absent” nodes
Lots of data on:
➢ Job Submitters
➢ Running Jobs
➢ CPUs in the pool
www.cs.wisc.edu/condor/glidein
Other Condor Modules
› Checkpoint server module
Watch # of checkpoints, disk space,
etc.
› Job history module
➢ Number and types of jobs, etc.
➢
www.cs.wisc.edu/condor/glidein
Custom Hawkeye Modules
› Hawkeye allows you to run your own
custom “modules” to gather data
➢ Simple text to stdout
➢ Can be a shell “One liner”
➢ Can be a 100 line perl program
• All current modules are in perl
➢ Can be 10k-line “C” program
www.cs.wisc.edu/condor/glidein
Hawkeye Alerts
› Hawkeye allows you in set your own
custom “alerts”
➢ On attributes generated by
standard and/or custom modules
➢ Flexible, uses ClassAd Match-making
➢ Used to generate dynamic web pages
www.cs.wisc.edu/condor/glidein
Hawkeye Matchmaking
› Hawkeye alerts are done using
ClassAd match-making.
Machine
Ad
Match
Alert
Trigger
Ad
www.cs.wisc.edu/condor/glidein
Sample Alert Trigger
[
AlertTrigger = ( MyType == "Pool" && Absent.count > 5 );
AlertSeverity = ( Absent.count > 5 ) ? 1 : 0;
Name = "Absent Nodes";
AlertText = StrCat(Absent.count,
" machines are missing in ",
Name)
]
www.cs.wisc.edu/condor/glidein
Advanced Trigger Tool
› ClassAd based trigger system with state
Example: Take some action if a
machine has been heavily loaded for a
certain amount of time
› Much more flexible:
➢ You specify the action to take
➢ Maintains state information
➢
www.cs.wisc.edu/condor/glidein
More on Advanced Trigger
› Both current and previous state can be
used in generating a trigger
➢ Example: Send me an email when the
system has been heavily loaded for a
specified time, but don't flood my
inbox with them...
www.cs.wisc.edu/condor/glidein
Hawkeye Extras
› Currently available:
Tool to “set up” a Condor to easily
install & run Hawkeye modules
› In development:
➢ Grid Exerciser module
➢ Data plotting tool
➢
www.cs.wisc.edu/condor/glidein
Hawkeye at UW
› Currently at UW CS, we're using
Hawkeye extensively:
➢
➢
➢
To monitor our 1400-CPU Condor cluster
To aid in detecting and correcting cluster
problems
Hawkeye is one of our main tools for pool
administration
www.cs.wisc.edu/condor/glidein
›
www.cs.wisc.edu/condor/glidein
www.cs.wisc.edu/condor/glidein
www.cs.wisc.edu/condor/glidein
What is the status of
Hawkeye?
› Version 1.0 Release Candidate 5 “RC5”
Version 1.0 “real soon”
› Available from
http://cs.wisc.edu/condor/hawkeye
› Get help:
➢ Condor: condor-admin@cs.wisc.edu
➢
www.cs.wisc.edu/condor/glidein
Download