Adapting Condor for use with Energy

advertisement
Towards a greener Condor
pool: adapting Condor for
use with energy-efficient PCs
Ian C. Smith
Overview

Quick description of the University of Liverpool Condor Pool

Power saving at Liverpool

A home-grown approach to dealing with power-saving PCs

Power management using Condor 7.4.X

Implementing Condor power management

Results

Future directions
University of Liverpool Condor Pool

Contains around 300 machines running the University’s Managed
Windows (XP, soon Windows 7) Service.

Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB
disk, configured with two job slots / machine.

Single combined submit host / central manager running on Sun
V445 SMP server.

Currently running Condor 7.0.2 on execute hosts (moving to 7.2.x
soon).

Policy is to run jobs only if a least 5 minutes of inactivity and low
load average during office hours and at anytime outside of office
hours

Jobs are killed rather than suspended
Power saving at Liverpool

We have around 2 000 centrally managed PCs across campus
which were powered up overnight, at weekends and during
vacations.

Original power saving policy was to “power-off” machines after 30
minutes of inactivity, we now hibernate them after 15 minutes of
inactivity

Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000
hours per week (equivalent to 20-25 MWh) leading to an estimated
saving of approx. £125 000 p.a.

Makes extensive use of PowerMAN system from Data Synergy
comprising:
 service which forces machines into a low-power state and reports machine
activity to Management Reporting Platform
 Management Reporting Platform - central server from where usage stats
can be retrieved and viewed via a web browser
Typical monthly Condor activity
A home grown approach to power
management

Two main problems to deal with:
 how to ensure Condor jobs are not evicted by hibernating PCs
 how to wake up dormant PCs to run Condor jobs on-demand

PowerMAN service prevents job eviction:
 can provide PowerMAN with a list of “protected programs” which ensures
that the machine remains active if running
 include condor_starter process as a protected program (only present while
a Condor job is running).

Wake-on-LAN (“WoL”) used to bring hibernating machines back to full
power:
 NICs must be remain powered-up during hibernation
 NICs must be capable of waking machines on receipt of a “magic packet”
 network must be able to route “magic packets” – not a problem for us but
YMMV
Adapting Condor for use with power-saving
PCs

cron runs on the submit host which periodically examines the state of the
queue (condor_status -schedd) and the pool (condor_status)

if more idle jobs in queue than Unclaimed machines then need to wake up
hibernating machines
find out the number of powered up machines machines in each “teaching
centre” (classroom)
estimate the number of hibernating machines in each teaching centre from
total number of machines in each
sort centres from highest number of available machines to lowest
wake up centres in turn until sufficient machines woken to meet the
demand (or all centres woken up)
MAC addresses of machines are stored in files sorted according to
teaching centre (needed for Wake-on-LAN)





Problems with the home-grown approach

Assumes that any job can run on any machine:
 users cannot choose particular teaching centres or machines in their job
Requirements
 ideally, pool needs to be homogenous
 errors in Requirements specification can cause severe problems
(machines repeatedly wake up then hibernate again)
 cron includes a “sanity check” for this

Can only estimate number of hibernating machines in each centre

Same machines get woken up first
Power management in Condor 7.4.X

Condor daemons can now place an execute host in a low-power
state according to a given policy

Execute hosts signals it is about to enter low-power state to the
Condor central manager

Central manager records persistent offline ClassAds for
hibernating machines

Negotiator can perform matchmaking with offline ClassAds

Matches are passed to condor_rooster

condor_rooster pipes information to condor_power which wakes
up machines using WoL
Implementing Condor power management

Still use PowerMAN to power-down inactive PCs rather than using
Condor

Need a way of advertising available offline machines to the
condor_collector

If we know which machines are currently active (A) and which
machines make up the pool in total (P), then the offline machines
are form the subset O = P – A

cron periodically advertises the offline machines and updates the
timestamps (ClockMin / ClockDay)

Finding P (the total set of machines which are out there) turns out
to be a very difficult problem
How do we determine which machines are
available to Condor

Try waking them up !

Wake up all machines in each teaching centre once a week using
WoL

After wakeup call, wait a few minutes and test each machine in
turn with:
condor_status –direct <hostname>

Sanity check similar to UNIX ping

Record which machines
respond and publish
ClassAds for them
Unforeseen problems

Not all woken up machines begin to run jobs
 number of wakeups is limited by our “roll-your-own” version of condor_power

condor_rooster originally attempted to wake up all offline machines which
matched job requirements
 Included another limit in our condor_power script (number of wakeups must
be < no of idle jobs)
 Condor 7.4.3 should fix this, 7.5.3 adds ROOSTER_MAX_UNHIBERNATE
configuration option

Wanted to wake up machines in random order so same machines not used
repeatedly
 Found that condor_negotiator ignored Rank values
 Used condor_power script to implement this (“shuffles the deck”)
 Should be fixed in 7.5.3 using ROOSTER_UNHIBERNATE_RANK config option
Unforeseen problems / cont’d

Condor continued to wakeup machines after jobs removed (or complete)

Use
Unhibernate = CurrentTime – MachineLastMatchTime < 300
not
Unhibernate =!= Undefined

Difficult to distinguish Unclaimed offline machines from online ones in
condor_status:

to see all offline machines


to see all powered-up machines


$ condor_status –constraint Offline==True
$ condor_status –constraint Offline=!=True
Also difficult to distinguish in Condor View graphs
Results – wakeup test
Future Directions

Condor power management will allow us to expand the pool to include
even low-spec machines

If machines are not needed or are unsuitable they need not be woken up

Rank can be used so that newer (more energy efficient machines) used
first

We would like a more accurate way of determining which machines are
available. One possible method:


Record the amount of time since each machine last appeared in the pool and/or ran a
job

Confidence in waking a PC can be described by a monotonically decreasing function of
this

May still need to wake machines for testing occasionally
Encourage users to incorporate their own checkpointing code to reduce
“badput” and energy wastage (see Liverpool Condor website for details).
Further Information
http://www.liv.ac.uk/e-science/condor
i.c.smith@liverpool.ac.uk
Download