What’s New in Condor

advertisement
What’s New in Condor
Todd Tannenbaum
Computer Sciences Department
University of Wisconsin-Madison
condor-admin@cs.wisc.edu
http://www.cs.wisc.edu/condor
Overview
Quick ‘sound bytes’ on new
functionality in recent Condor
releases
› Condor Development Process
› New Features in Condor version 6.6.x
› New Features in Condor version 6.7.0
www.cs.wisc.edu/condor
Condor Development Process
› We maintain two different releases
at all times
Stable Series
• Second digit is even: e.g. 6.2.2, 6.4.7, 6.6.3
Development Series
• Second digit is odd: e.g. 6.5.1, 6.7.2
www.cs.wisc.edu/condor
Stable Series
› Heavily tested
› Runs on our department production pool of
›
›
nearly 1,000 CPUs (for min of 3 weeks)
No new features, only bugfixes and ports.
A given stable release is always
compatible with other releases from
the same series
 6.6.X is compatible with 6.6.Y
› Recommended for production pools
www.cs.wisc.edu/condor
Development Series
› Less heavily tested
› Runs on our small(er) test pool.
› New features and new technology are
added frequently
› Versions from the same development
series are not guaranteed compatible
with each other (although we try
hard)
www.cs.wisc.edu/condor
New in version 6.6.x
› Version 6.6.0 released in
November 03.
› Current release: version
6.6.7, to be released in
Oct 04.
www.cs.wisc.edu/condor
The Struggle to Build Condor
› Condor is BIG
 Condor code consists of primary
source plus ‘externals’.
• Externals include Kerberos, zlib,
GSI, PVM, gSOAP…
• Patches to externals
www.cs.wisc.edu/condor
The Struggle to Build Condor
› Condor is BIG
 Condor code consists of primary
source plus ‘externals’.
• Externals include Kerberos, zlib,
GSI, PVM, gSOAP…
• Patches to externals
 Current shipped source +
externals: ~415MB of source, or
~9 million lines!
 Building Condor outside of UWMadison used to be very
difficult.
• “LIST OF SHAME”: Build pointed to
packages on UW-Madison
fileservers.
www.cs.wisc.edu/condor
Now Condor Source
“Self-Contained”
› Source code to externals are now bundled
w/ Condor itself.
 Self-contained
 Allows version control on externals + patches
› Build w/ just “configure; make” !
 Checks for existence and proper version of all
“bootstrap” requirements, such as the compiler
 Applies our patches to the externals
 All 9 million lines built and bundled
www.cs.wisc.edu/condor
Building Condor
Building Condor before
Version 6.6.0…
Building Condor
Post Version 6.6.0!
www.cs.wisc.edu/condor
Condor + NMI
› NMI = NSF Middleware
›
Initiative
Automated build and test
infrastructure built on top
of Condor
 Pool of 37 machines of many
architectures
 Scalable
 Runs every night, builds
several Condor source
branches, then runs 114 test
programs.
 All results stored in RDBMS,
reported on the web.
 Yes, Condor builds Condor!
www.cs.wisc.edu/condor
Ports
› New Ports w/ v6.6.x –vs- v6.4.x :
Solaris 9
RedHat Linux 8.x, 9.x for x86 (+RPMs)
RedHat Linux 7.x and SUSE 8.0 for
IA64 (clipped)
Tru64 5.1 (clipped)
AIX 5.2 (clipped)
Mac OS X (clipped)
www.cs.wisc.edu/condor
Some new components
› Computing On Demand (COD)
› Integration of “Hawkeye” technology
› Condor-G Additions
Matchmaking
Grid Monitor
Grid Shell
www.cs.wisc.edu/condor
Computing On Demand (COD)
› Introduce effective timesharing to a
distributed system
Batch applications often want sustained
throughput for a long period of time
Interactive applications often want a
quick burst of CPU power for small
period of time
COD : Allow both to co-exist
www.cs.wisc.edu/condor
HawkEye Technology
› Dynamic Resource Monitoring, now ‘built-in’
to Condor.
 Allows custom dynamic attributes to be added
into machine classads.
 These attributes can be used for
• Queries
• Scheduling
 Many plugins available.
• Disk space, memory used, network errors, open
files/descriptors, process monitoring, users, …
www.cs.wisc.edu/condor
Condor-G
› Condor-G Matchmaking
 Condor-G can determine which grid site to
utilize via ClassAd matchmaking (grid planning,
meta scheduling, …)
› Condor-G Grid Monitor
 Reduces the load on a GT2-based gatekeeper,
greatly increasing the amount of jobs that can
be submitted
› Condor-G GridShell
 A wrapper for the job
 Reports exit status, cpu utilization, more
www.cs.wisc.edu/condor
Improvements in Condor for
Windows
› Ability to run SCHEDULER universe jobs
 Including DAGMan
› JAVA universe support
› More Win32 flavors, incl international
›
›
›
versions.
Added support for encryption on disk of
the job and data files on execute machine.
v6.6.6: Many issues fixed w/ signaling jobs
V6.6.7: Support for SP2
www.cs.wisc.edu/condor
New Features in DAGMan
› DAGMan previously required that all
jobs in a DAG share one log file
› Each job can now have it’s own log file
› Understands XML formatted logs
› Can draw a graphical representation
of your DAG
Uses GraphViz,
http://www.graphviz.org/
www.cs.wisc.edu/condor
www.cs.wisc.edu/condor
Central Manager New
Features
› Central Manager daemons can now run on
any port
COLLECTOR_HOST = condor.cs.wisc.edu:9019
NEGOTIATOR_HOST = condor.cs.wisc.edu:9020
 Useful for firewall situations
 Allows multiple instances on one
machine
› Keeps statistics on missed updates
› Can use TCP instead of UDP, if you must
www.cs.wisc.edu/condor
Command-line Tools
› ‘condor_update_stats’ tool to display information on any
dropped central manager updates
› ‘condor_q –hold’ gives you a list of held jobs and the reason
they were put on hold
› ‘condor_config_val –v’ tells you where (file and line number)
an attribute is defined
› ‘condor_fetch_log’ will grab a log file from a remote
machine:
 condor_fetch_log c2-15.cs.wisc.edu STARTD
› ‘condor_configure’ will install Condor via simple command-line
switches, no questions asked
› ‘condor_vacate_job’ to release a resource by job id, and can
be invoked by the job owner.
› `condor_wait’ blocks until a job or set of jobs completes
www.cs.wisc.edu/condor
New 6.7.x Development
Series
› Release of v6.7.2 was in
April 04.
www.cs.wisc.edu/condor
Big Picture
What do we want to achieve in a new
›
Condor developer series?
Technology Transfer
 Building a bridge between the Condor
production software development activity and
the academic core research activity
BAD-FS, Stork, Diskrouter, Parrot (transparent
I/O), Schedd Glidein, VO Schedulers, HA,
Management, Improved ClassAds…
www.cs.wisc.edu/condor
What do we want to
achieve, cont?
New Ports: Go to where the cycles are!
•The RedHat Dilemma
•Our porting ‘hopper’ :
AIX 5.1L on the PowerPC architecture
Redhat AS server on x86
Fedora Core on x86
Fedora Core 2 on x86
Redhat AS server on AMD64
SuSE 8.0 on AMD64
Redhat AS server on IA64
HPUX 11.11 64-bit
www.cs.wisc.edu/condor
What do we want to achieve,
cont.
› Improve existing ports
Move “clipped wing” port to full ports
(w/ checkpoint, process migration)
• Max OS X, Windows
Better integration into environments
• Windows: operate better w/ DFS, use MSI
• Unix: operate w/ AFS
www.cs.wisc.edu/condor
What do we want to achieve,
cont.
› Address changes in the computing
landscape
Firewalls, NATs
64-bit operating systems
Emphasis on data
Movement towards standards such as
WS, OGSA, …
www.cs.wisc.edu/condor
V6.7 Themes
› Scalability
 Resources, jobs,
matchmaking
framework
› Accessibility
 APIs, more Grid
middleware, network
› Availability
 Failover
www.cs.wisc.edu/condor
High Availability in
v6.7.x
What happens if my
submit machine reboots?
Once upon a time, only one answer: job restarts.
Checkpoint?
No Checkpoint?
www.cs.wisc.edu/condor
New: Job Progress continues
if connection is interrupted
› Now for Vanilla and Java universe jobs,
Condor now supports reestablishment of
the connection between the submitting and
executing machines.
› To take advantage of this feature, put the
following line into their job’s submit
description file:
JobLeaseDuration = <N seconds>
For example:
JobLeaseDuration = 1200
www.cs.wisc.edu/condor
What if the submission point
spontaneously explodes?
(don’t try this at home)
www.cs.wisc.edu/condor
More High Availability
Solutions
› Condor can support a submit machine
“hot spare”
If your submit machine is down for
longer than N minutes, a second machine
can take over
› Two mechanisms available
Job Mirroring
High Availability Daemon Failover
• Just tell the condor_master to run ONE
instance
www.cs.wisc.edu/condor
Daemon Failover
Machine A
Master
SchedD
Refresh
Lock
Refresh
Obtain
Check
Lock
Lock
Lock
Machine B
Master
SchedD
Active
Active
(hot spare)
www.cs.wisc.edu/condor
Accessibility
› Support for GCB
 Condor working w/ NATs, Firewalls
› Distributed Resource Management
Application API (DRMAA)
 GGF Working Group
 An API specification for the submission and
control of jobs to one or more Distributed
Resource Management (DRM) systems
 Condor DRMAA interface to appear in v6.7.0
www.cs.wisc.edu/condor
SOAP/Grid Service
condor_schedd
Cedar
Web Service:
SOAP
HTTPS
www.cs.wisc.edu/condor
New “Grid Universe”
› With new Grid Universe, always
specify a ‘gridtype’. So the old
“globus” Universe is now declared as:
universe = grid
gridtype = gt2
› Other gridtypes? GT3 for OGSAbased Globus Toolkit 3
www.cs.wisc.edu/condor
Condor-G improvements
› Condor-G can submit to either Globus GT2 or GT3
resources, including support for GT3 with web
services.
 Condor-G includes everything required; no need for client
to have a GT3 installation.
 Good migration path to OGSA
› Condor-G to Nordugrid, Unicore, Condor, ORACLE
› Support for credential refresh via the MyProxy
Online Credential Management in NMI
http://grid.ncsa.uiuc.edu/myproxy/
www.cs.wisc.edu/condor
Why Condor + MyProxy?
› Long-lived tasks or services need
credentials
 Task lifetime is difficult to predict
› Don’t want to delegate long-lived
credentials
 Fear of compromise
› Instead, renew credentials with MyProxy
as needed during the task’s lifetime
 Provides a single point of monitoring and
control
 Renewal policy can be modified at any time
• For example, disable renewals if compromise is
detected or suspected
www.cs.wisc.edu/condor
Credential Renewal
Home
Remote
Submit
Launch Job
Jobs Condor-G
Scheduler Refresh Credentials
Resource
Manager
Retrieve
Credentials
Enable
Renewal
Refresh
Credentials
MyProxy
Job
www.cs.wisc.edu/condor
More…
› Condor can now transfer job data
files larger than 2 GB in size.
On all platforms that support 64bit file
offsets
› Real-time spooling of stdout/err/in in
any universe incl VANILLA
Real-time monitoring of job progress
› Working on Hierarchical Negotiations
www.cs.wisc.edu/condor
Thank you!
www.cs.wisc.edu/condor
Download