Keeping a Hawkeye on The Grid Nick LeRoy ➢ Computer Sciences Department ➢ University of Wisconsin-Madison ➢ nleroy@cs.wisc.edu http://www.cs.wisc.edu/condor/hawkeye ➢ ➢ www.cs.wisc.edu/condor/glidein The Grid Idea › Large scale distributed computing › Solve massive computational problems Grid www.cs.wisc.edu/condor/glidein The Grid Reality › › › › › Sites go down Updates need to by synchronized Firewalls get in the way Human errors occur Separate administrative domains cause inconsistencies www.cs.wisc.edu/condor/glidein The Emperor's New Grid Grid www.cs.wisc.edu/condor/glidein Some Observations › A lot of these problems can't be solved by technology alone ➢ Human errors ➢ Separate administrative domains › Detecting problems is often quite difficult and time consuming www.cs.wisc.edu/condor/glidein More Observations › Can't fix problems before we're aware of them › Impossible to prevent all classes of problems www.cs.wisc.edu/condor/glidein So... › Often more cost effective to detect & work around problems ➢ Even when prevention is possible ➢ Detection is always required › Automation is our friend www.cs.wisc.edu/condor/glidein Watching the Grid › We need a monitoring system: ➢ ➢ ➢ ➢ That That That That can automate detection is flexible is easy to deploy can alert us of problems • In a timely manner www.cs.wisc.edu/condor/glidein Hawkeye › Hawkeye is a monitoring tool › Designed for grid and distributed applications ➢ Provides Automated detection ➢ ➢ ➢ Very Flexible Is Easy to deploy Provides Timely Alerts of problems www.cs.wisc.edu/condor/glidein Watching the Grid Grid www.cs.wisc.edu/condor/glidein Hawkeye Uses › Can be used for: ➢ ➢ ➢ ➢ ➢ Monitoring system load, I/O, usage, etc. Watching for run-away processes Monitoring the health of your pool Watching the health of grid site ... www.cs.wisc.edu/condor/glidein Some Details about Hawkeye › Distributed monitoring system Uses a push data model › Built on Condor technology ➢ Uses ClassAds & match-making ➢ Every Condor has Hawkeye built-in › Stable, production quality ➢ www.cs.wisc.edu/condor/glidein Hawkeye UI › Alert you when things go wrong: ➢ ➢ When virtually any condition is found When various problems are found: • My checkpoint server's disk is full • Joe has had a CVS lock for 20 minutes › Help you visualize what's going on ➢ Plotting via RRDT www.cs.wisc.edu/condor/glidein Why would I run Hawkeye? › Make system administration easier › Simplify pool maintenance Condor ➢ Other batch system › Scalable solution ➢ www.cs.wisc.edu/condor/glidein Hawkeye Architecture Hawkeye Monitoring Agent Hawkeye Module Condor Pool Hawkeye Module Hawkeye Monitoring Agent Hawkeye Module Hawkeye Manager www.cs.wisc.edu/condor/glidein Grid Hawkeye Monitoring Agent (Hawkeye Startd) Hawkeye Job Manager Hawkeye Manager Hawkeye Job 1 ClassAd Hawkeye Job 2 Hawkeye Module 1 www.cs.wisc.edu/condor/glidein Hawkeye ClassAds › Hawkeye uses Condor ClassAds to represent collected data ➢ ➢ ➢ Schema-free data representation Provides matching mechanism Represent whatever data you gather in a way that works best for you www.cs.wisc.edu/condor/glidein Hawkeye ClassAds › Example ClassAd “snippet”: RAM_MemFree = 841932800 RAM_MemShared = 0 RAM_MemTotal = 1055367168 RAM_SwapCached = 0 RAM_SwapFree = 2147483647 RAM_SwapTotal = 2147483647 www.cs.wisc.edu/condor/glidein Hawkeye Modules › Current library of modules monitor: Processes, CPU usage, etc. ➢ RAM, I/O, VM Statistics, etc. ➢ Disk space ➢ CVS repository ➢ GASS Cache statistics ➢ etc. ➢ www.cs.wisc.edu/condor/glidein Hawkeye and Condor › Hawkeye has Condor specific tools › Developed to help us run our pool www.cs.wisc.edu/condor/glidein Condor Node Module › › › › Run on each node of the pool Watches the Condor daemons Monitors multiple virtual machines Can identify run-away or orphaned jobs / processes www.cs.wisc.edu/condor/glidein Condor Pool Module › › › › Run on just one host Reports overall pool health Watches for “absent” nodes Lots of data on: ➢ Job Submitters ➢ Running Jobs ➢ CPUs in the pool www.cs.wisc.edu/condor/glidein Other Condor Modules › Checkpoint server module Watch # of checkpoints, disk space, etc. › Job history module ➢ Number and types of jobs, etc. ➢ www.cs.wisc.edu/condor/glidein Custom Hawkeye Modules › Hawkeye allows you to run your own custom “modules” to gather data ➢ Simple text to stdout ➢ Can be a shell “One liner” ➢ Can be a 100 line perl program • All current modules are in perl ➢ Can be 10k-line “C” program www.cs.wisc.edu/condor/glidein Hawkeye Alerts › Hawkeye allows you in set your own custom “alerts” ➢ On attributes generated by standard and/or custom modules ➢ Flexible, uses ClassAd Match-making ➢ Used to generate dynamic web pages www.cs.wisc.edu/condor/glidein Hawkeye Matchmaking › Hawkeye alerts are done using ClassAd match-making. Machine Ad Match Alert Trigger Ad www.cs.wisc.edu/condor/glidein Sample Alert Trigger [ AlertTrigger = ( MyType == "Pool" && Absent.count > 5 ); AlertSeverity = ( Absent.count > 5 ) ? 1 : 0; Name = "Absent Nodes"; AlertText = StrCat(Absent.count, " machines are missing in ", Name) ] www.cs.wisc.edu/condor/glidein Advanced Trigger Tool › ClassAd based trigger system with state Example: Take some action if a machine has been heavily loaded for a certain amount of time › Much more flexible: ➢ You specify the action to take ➢ Maintains state information ➢ www.cs.wisc.edu/condor/glidein More on Advanced Trigger › Both current and previous state can be used in generating a trigger ➢ Example: Send me an email when the system has been heavily loaded for a specified time, but don't flood my inbox with them... www.cs.wisc.edu/condor/glidein Hawkeye Extras › Currently available: Tool to “set up” a Condor to easily install & run Hawkeye modules › In development: ➢ Grid Exerciser module ➢ Data plotting tool ➢ www.cs.wisc.edu/condor/glidein Hawkeye at UW › Currently at UW CS, we're using Hawkeye extensively: ➢ ➢ ➢ To monitor our 1400-CPU Condor cluster To aid in detecting and correcting cluster problems Hawkeye is one of our main tools for pool administration www.cs.wisc.edu/condor/glidein › www.cs.wisc.edu/condor/glidein www.cs.wisc.edu/condor/glidein www.cs.wisc.edu/condor/glidein What is the status of Hawkeye? › Version 1.0 Release Candidate 5 “RC5” Version 1.0 “real soon” › Available from http://cs.wisc.edu/condor/hawkeye › Get help: ➢ Condor: condor-admin@cs.wisc.edu ➢ www.cs.wisc.edu/condor/glidein