Powerpoint - Computer Sciences Dept.

advertisement
Condor RoadMap
Paradyn/Condor Week 2005
Todd Tannenbaum
Computer Sciences Department
University of Wisconsin-Madison
condor-admin@cs.wisc.edu
http://www.cs.wisc.edu/condor
Terms of License
Any and all dates in these slides are
relative from a date hereby unspecified in
the event of a likely situation involving a
frequent condition. Viewing, use,
reproduction, display, modification and
redistribution of these slides, with or without
modification, in source and binary forms, is
permitted only after a deposit by said user into
PayPal accounts registered to Todd Tannenbaum
….
Outline
› Version 6.7.x to Version 6.8.0
 Availability
• Failover, fault tolerance
 Scalability
• Resources, jobs, matchmaking framework, files
 Accessibility
• APIs, more Grid middleware, network firewalls
 Everything else
• New functionality, new ports, etc.
› And after that?
p.s. Still here? Thank you for your generous PayPal pledge!
3
Current Status
› Current Stable Release
Version 6.6.9
› Current Development Release
Version 6.7.5
› Next Stable Release Version 6.8.0
Once per year
Code freeze end of April
Release end of May
4
Existing Ports
• Digital UNIX 4.0
Alpha
• AIX 5.2 (clipped) PowerPC
• Tru64 5.1 (clipped)
Alpha
• HP UNIX 10.20 PA RISC
• HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC
• Irix 6.5 (clipped) SGI
• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha
• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86
• Linux 2.4.x (glibc 2.2) - Red Hat 8
Intel x86
• Linux 2.4.x (glibc 2.3) - Red Hat 9
Intel x86
• Enterprise Server 8.1 Intel Itanium
• Solaris 8
Sparc
• Solaris 9
Sparc
• Microsoft Windows 2000 or XP (clipped) Intel x86
5
›
New
Ports
Introduced in v6.6.x





MacOSX (“clipped") PowerPC
Sigh…
Debian Linux 3.1 Intel x86
Fedora Core 1 Intel x86
Red Hat Enterprise Linux 3 Intel x86
SuSE Linux Enterprise Server 8.1 Intel
Itanium
› Introduced in v6.7.x






AIX 5.1 (“clipped") PowerPC
Fedora Core 2 on x86
Fedora Core 3 on x86
SuSE 8.0 ("clipped") on AMD64
Solaris 10 ("clipped") on Sparc
Scientific Linux (Release 303) on x86
“Psilord” – The Condor
porting doctor. Talk to him
in person tomorrow.
› Still to be introduced in v6.7.x (before
v6.8.0)
 HPUX 11i 64-bit pa-risc
 RHEL 4 on x86
 “native” 64 bit AMD Linux
6
Job Progress continues if
connection is interrupted
› Now for Vanilla and Java universe jobs, Condor
supports reestablishment of the connection
between the submitting and executing machines.
 If network outage between execute and submit machine
 If submit machine restarts
› To take advantage of this feature, put the
following line into their job’s submit description
file:
JobLeaseDuration = <N seconds>
For example:
job_lease_duration = 1200
7
Job Progress continues if
submit machine fails
› Condor can now support a submit
machine “hot spare”
If your submit machine A is down for
longer than N minutes, a second machine
B can take over
Requires shared filesystem between
machines A and B
8
Central Manager Failover
› Condor Central Manager has two
services
› condor_collector
Now a list of collectors is supported
› condor_negotiator (matchmaker)
If fails, election process, another takes
over
Contributed technology from Technion
9
Some Condor APIs
› Command Line tools
›
›
›
›
›
›
›
 condor_submit, condor_q, etc
Condor Perl Module
Chirp
Checkpoint Library API
MW --- improved!
DRMAA
Condor Grid ASCII Protocol (GAHP)
Web Service Interface
10
DRMAA
› Distributed Resource Management
Application API (DRMAA)
 GGF Working Group
 An API specification for the submission and
control of jobs to one or more Distributed
Resource Management (DRM) systems
› An API with C and Java bindings
 not a protocol
› Scope
 Does: job submission, monitoring, control, final
status
 Does not: file staging, reservations, security, …
11
Condor GAHP
› The Condor GAHP is a relatively low-level protocol
›
based on simple ASCII messages through stdin and
stdout
Supports a rich feature set including two-phase
commits, transactions, and optional asynchronous
notification of events
12
GAHP, cont
Example:
R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $
S: GRAM_PING 100 vulture.cs.wisc.edu/fork
R: E
S: RESULTS
R: E
S: COMMANDS
R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST
GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE
QUIT RESULTS VERSION
S: VERSION
R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $
S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt
R: S
S: GRAM_PING 100 vulture.cs.wisc.edu/fork
R: S
S: RESULTS
R: S 0
S: RESULTS
R: S 1
R: 100 0
S: QUIT
R: S
13
Web Service Interfaces
› SOAP over http or https to
›
›
the Condor daemons
Use any language or
platform (where you can find
a decent SOAP library)
Functionality Exposed
in current release
 Submit jobs
 Retrieve job output
 Remove/hold/release jobs
 Query machine status (fetch ads from collector)
 Query job status (fetch ads from the schedd)
14
Getting machine status via
SOAP (in Java with Axis)
locator = new CondorCollectorLocator();
collector = locator.getcondorCollector(new
URL(“http://machine:port”));
ads = collector.queryStartdAds(“Memory>512“);
Because we give you WSDL information you don’t
have to write any of these functions.
15
New “Grid Universe”
› With new Grid Universe, always specify a
›
‘gridtype’. So the old “globus” Universe is now
declared as:
universe = grid
gridtype = gt2
Other gridtypes?
 GT2 (Globus Toolkit 2)
‘Condor-G’
 GT3 (Globus Toolkit 3.2)
 GT4 (Globus Toolkit 3.9.5+)
 UNICORE (Unicore)
 PBS (OpenPBS, PBSPro – technology from INFN)
 LSF (Platform LSF – technology from INFN)
‘Condor-C’
 CONDOR (thanks gLite!)
16
Other Grid Universe
improvements
› Condor-G has support for credential refresh
›
via the MyProxy Online Credential
Management in NMI
http://grid.ncsa.uiuc.edu/myproxy/
Some functionality present in Condor-G added
to Condor-C
 Forwarding of refreshed credentials (EGEE)
 GSI authentication support
17
Quill
› Job ClassAds
Master
Startd
…Schedd
Job
Queue
log
Quill
RDBMS
Queue
+
History
Tables
›
›
information
mirrored into an
RDBMS
Both active jobs
and historical jobs
Benefits BOTH
scalability and
accessibility
18
BAM! More tasty Condor
goodness!
› Condor can now transfer job data
files larger than 2 GB in size.
 On all platforms that support 64bit
file offsets
› Real-time spooling of
stdout/err/in in any universe incl
VANILLA
 Real-time monitoring of job progress
› Condor Installer on Win32 uses
›
›
›
›
MSI (thanks Micron!)
condor_transfer_data (DZero)
STARTD_VM_EXPRS (INFN)
condor_vacate_job tool
condor_status -negotiator
19
And More…
› New startd policy expression MaxJobRetirementTime.
 specifies the maximum amount of time (in seconds) that the
startd is willing to wait for a job to finish on its own when the
startd needs to preempt the job
› -peaceful option to condor_off, condor_restart
› noop_job = True
› Preliminary support for the Tool Daemon Protocol (TDP)
 TDP goal is to provide a generic way for scheduling systems
(daemons) to interact with monitoring tools.
 specify a ``tool'' that should be spawned along-side their regular
Condor job.
 On Linux, ability to allow a monitoring tool to attach with ptrace()
before the job's main() function is called.
20
Hey Jobs! We’re watching you!
› condor_starter enforce limits
 Starter is already monitoring many
job characteristics (image size, cpu
usage, etc)
 Threshold expressions
• Use more resources than you said you
would, and BAM!
› Local Universe
 Just like Scheduler Universe, but
there is a condor_starter
 All advantages of the starter
Submit
Execute
startd
schedd
starter
starter
job
job
Hey, job,
behave or else!
21
Condor with
Firewalls and NATS:
GCB in v6.8.0!
GCB layer
connect
translate
Client app
TCP/IP
listen
accept
Server app
GCB layer
TCP/IP
Relay point
22
Binding & Registration
Officially
bound to
X
B = socket();
bind(B, ANY);
Locally
bound to
B
getsockname (B, X)
Server
X
Registere
d (X, B)
B
Broker
X
GCB
lib
X
23
GCB: Public-Private
Connection
connect(A, X)
Client
GCB
lib
Server
A
CONNECT (X)
CONTACT (A)
B
GCB
lib
PASSIVE
X
24
GCB: Private-Private
Connection
connect(A, X)
Client
GCB
lib
Server
A
CONNECT (X)
CONTACT (Y)
B
GCB
lib
ACTIVE (X)
X
Y
25
From CondorWeek
2003:
› New version of ClassAds into
Condor
 Conditionals !!
• if/then/else
 Aggregates (lists, nested classads)
 Built-in functions
• String operations, pattern
matching, time operators, unit
conversions
 Clean implementations in C++ and
Java
 ClassAd collections
› This may become v6.8.0
Is this TODD ?!?!
26
ClassAd Improvements in
Condor!
› Conditionals
 IfThenElse(condition,then,else)
› String functions
 Strcat(), strcmp(), toUpper(), etc.
› StringList functions
 Example of a “string list” (CSV style)
• Mylist = “Joe, Jon, Jeff, Jim, Jake”
 StrListContains(), StrListAppend(),
StrListRemove(), etc.
› Others
 Type test, some math functions
27
Security
› New Service: condor_credd
 Store, refresh, forward credentials
 Right now used just by stork – role will expand
(AFS authentication?)
› Common Authentication Methods between
Condor on Unix and Win32
 Kerberos 1.4
• Additional hopeful benefit: Authentication against MS
Active Directory!?!
 GSI on Win32 ?
› Starter only runs known executables
› Shadow only reads/writes to a given
subdirectory(s)
28
Accounting Groups and
Group Quota Support
› Account Group (w/ CORE Feature Animation)
› Account Group Quota (inspiration CDF @ Fermi)
 Sample Problem: Cluster w/ 500 nodes, Chemistry Dept
purchased 100 of them, Chemistry users must always be
able to use them
 Could use Machine Rank…
• but this ties to specific machines
 Or
•
•
•
•
could use new group support
Each group can be given a quota in config file
Job ads can specify group membership
Group quotas are satisfied first
Accounting by user and by group
29
Improved Scalability
› Much faster negotiation
SIGNIFICANT_ATTRIBUTES
determined automatically
Schedd uses non-blocking TCP connects
to the startd
Negotiator caching
Collector Forks for queries
More…
30
Parallel Universe
› SSHD running alongside your job!
Also works with VANILLA, JAVA
universe!
› Support for parallel jobs
Other than just MPICH, e.g. Lam, SCore
Nice for testing environments
31
What’s brewing for after
v6.8.0?
› More data, data, data
›
›
›
›
›
›
›
›
›
 Stork distributed w/ v6.8.0, incl DAGMan support
 NeST manage Condor spool files, ckpt servers
 Stork used for Condor job data transfers
Can I commit
this to CVS??
Virtual Machines (and the future of Standard Universe)
Condor and Shibboleth (with Georgetown Univ)
Least Privilege Security Access (with U of Cambridge)
Dynamic Temporary Accounts (with EGEE, Argonne)
Leverage Database Technology (with UW DB group)
‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida)
Easier Updates
New ClassAds (integration with Optena)
Hierarchical Matchmaking
32
A Tree of Matchmakers
BIG 10 MM
UW MM
• Fault Tolerance
• Flexibility
• MM now manage
other MMs
CS MM
“I need more resources”
R
Theory
Group MM
R
R
A Match
C
C
R
Erdos MM
33
Thank you!
34
Download