Condor RoadMap Paradyn/Condor Week 2005 Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor Terms of License Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum …. Outline › Version 6.7.x to Version 6.8.0 Availability • Failover, fault tolerance Scalability • Resources, jobs, matchmaking framework, files Accessibility • APIs, more Grid middleware, network firewalls Everything else • New functionality, new ports, etc. › And after that? p.s. Still here? Thank you for your generous PayPal pledge! 3 Current Status › Current Stable Release Version 6.6.9 › Current Development Release Version 6.7.5 › Next Stable Release Version 6.8.0 Once per year Code freeze end of April Release end of May 4 Existing Ports • Digital UNIX 4.0 Alpha • AIX 5.2 (clipped) PowerPC • Tru64 5.1 (clipped) Alpha • HP UNIX 10.20 PA RISC • HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC • Irix 6.5 (clipped) SGI • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86 • Linux 2.4.x (glibc 2.2) - Red Hat 8 Intel x86 • Linux 2.4.x (glibc 2.3) - Red Hat 9 Intel x86 • Enterprise Server 8.1 Intel Itanium • Solaris 8 Sparc • Solaris 9 Sparc • Microsoft Windows 2000 or XP (clipped) Intel x86 5 › New Ports Introduced in v6.6.x MacOSX (“clipped") PowerPC Sigh… Debian Linux 3.1 Intel x86 Fedora Core 1 Intel x86 Red Hat Enterprise Linux 3 Intel x86 SuSE Linux Enterprise Server 8.1 Intel Itanium › Introduced in v6.7.x AIX 5.1 (“clipped") PowerPC Fedora Core 2 on x86 Fedora Core 3 on x86 SuSE 8.0 ("clipped") on AMD64 Solaris 10 ("clipped") on Sparc Scientific Linux (Release 303) on x86 “Psilord” – The Condor porting doctor. Talk to him in person tomorrow. › Still to be introduced in v6.7.x (before v6.8.0) HPUX 11i 64-bit pa-risc RHEL 4 on x86 “native” 64 bit AMD Linux 6 Job Progress continues if connection is interrupted › Now for Vanilla and Java universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines. If network outage between execute and submit machine If submit machine restarts › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = <N seconds> For example: job_lease_duration = 1200 7 Job Progress continues if submit machine fails › Condor can now support a submit machine “hot spare” If your submit machine A is down for longer than N minutes, a second machine B can take over Requires shared filesystem between machines A and B 8 Central Manager Failover › Condor Central Manager has two services › condor_collector Now a list of collectors is supported › condor_negotiator (matchmaker) If fails, election process, another takes over Contributed technology from Technion 9 Some Condor APIs › Command Line tools › › › › › › › condor_submit, condor_q, etc Condor Perl Module Chirp Checkpoint Library API MW --- improved! DRMAA Condor Grid ASCII Protocol (GAHP) Web Service Interface 10 DRMAA › Distributed Resource Management Application API (DRMAA) GGF Working Group An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems › An API with C and Java bindings not a protocol › Scope Does: job submission, monitoring, control, final status Does not: file staging, reservations, security, … 11 Condor GAHP › The Condor GAHP is a relatively low-level protocol › based on simple ASCII messages through stdin and stdout Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events 12 GAHP, cont Example: R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt R: S S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: 100 0 S: QUIT R: S 13 Web Service Interfaces › SOAP over http or https to › › the Condor daemons Use any language or platform (where you can find a decent SOAP library) Functionality Exposed in current release Submit jobs Retrieve job output Remove/hold/release jobs Query machine status (fetch ads from collector) Query job status (fetch ads from the schedd) 14 Getting machine status via SOAP (in Java with Axis) locator = new CondorCollectorLocator(); collector = locator.getcondorCollector(new URL(“http://machine:port”)); ads = collector.queryStartdAds(“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions. 15 New “Grid Universe” › With new Grid Universe, always specify a › ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 Other gridtypes? GT2 (Globus Toolkit 2) ‘Condor-G’ GT3 (Globus Toolkit 3.2) GT4 (Globus Toolkit 3.9.5+) UNICORE (Unicore) PBS (OpenPBS, PBSPro – technology from INFN) LSF (Platform LSF – technology from INFN) ‘Condor-C’ CONDOR (thanks gLite!) 16 Other Grid Universe improvements › Condor-G has support for credential refresh › via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy/ Some functionality present in Condor-G added to Condor-C Forwarding of refreshed credentials (EGEE) GSI authentication support 17 Quill › Job ClassAds Master Startd …Schedd Job Queue log Quill RDBMS Queue + History Tables › › information mirrored into an RDBMS Both active jobs and historical jobs Benefits BOTH scalability and accessibility 18 BAM! More tasty Condor goodness! › Condor can now transfer job data files larger than 2 GB in size. On all platforms that support 64bit file offsets › Real-time spooling of stdout/err/in in any universe incl VANILLA Real-time monitoring of job progress › Condor Installer on Win32 uses › › › › MSI (thanks Micron!) condor_transfer_data (DZero) STARTD_VM_EXPRS (INFN) condor_vacate_job tool condor_status -negotiator 19 And More… › New startd policy expression MaxJobRetirementTime. specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job › -peaceful option to condor_off, condor_restart › noop_job = True › Preliminary support for the Tool Daemon Protocol (TDP) TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools. specify a ``tool'' that should be spawned along-side their regular Condor job. On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called. 20 Hey Jobs! We’re watching you! › condor_starter enforce limits Starter is already monitoring many job characteristics (image size, cpu usage, etc) Threshold expressions • Use more resources than you said you would, and BAM! › Local Universe Just like Scheduler Universe, but there is a condor_starter All advantages of the starter Submit Execute startd schedd starter starter job job Hey, job, behave or else! 21 Condor with Firewalls and NATS: GCB in v6.8.0! GCB layer connect translate Client app TCP/IP listen accept Server app GCB layer TCP/IP Relay point 22 Binding & Registration Officially bound to X B = socket(); bind(B, ANY); Locally bound to B getsockname (B, X) Server X Registere d (X, B) B Broker X GCB lib X 23 GCB: Public-Private Connection connect(A, X) Client GCB lib Server A CONNECT (X) CONTACT (A) B GCB lib PASSIVE X 24 GCB: Private-Private Connection connect(A, X) Client GCB lib Server A CONNECT (X) CONTACT (Y) B GCB lib ACTIVE (X) X Y 25 From CondorWeek 2003: › New version of ClassAds into Condor Conditionals !! • if/then/else Aggregates (lists, nested classads) Built-in functions • String operations, pattern matching, time operators, unit conversions Clean implementations in C++ and Java ClassAd collections › This may become v6.8.0 Is this TODD ?!?! 26 ClassAd Improvements in Condor! › Conditionals IfThenElse(condition,then,else) › String functions Strcat(), strcmp(), toUpper(), etc. › StringList functions Example of a “string list” (CSV style) • Mylist = “Joe, Jon, Jeff, Jim, Jake” StrListContains(), StrListAppend(), StrListRemove(), etc. › Others Type test, some math functions 27 Security › New Service: condor_credd Store, refresh, forward credentials Right now used just by stork – role will expand (AFS authentication?) › Common Authentication Methods between Condor on Unix and Win32 Kerberos 1.4 • Additional hopeful benefit: Authentication against MS Active Directory!?! GSI on Win32 ? › Starter only runs known executables › Shadow only reads/writes to a given subdirectory(s) 28 Accounting Groups and Group Quota Support › Account Group (w/ CORE Feature Animation) › Account Group Quota (inspiration CDF @ Fermi) Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them Could use Machine Rank… • but this ties to specific machines Or • • • • could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group 29 Improved Scalability › Much faster negotiation SIGNIFICANT_ATTRIBUTES determined automatically Schedd uses non-blocking TCP connects to the startd Negotiator caching Collector Forks for queries More… 30 Parallel Universe › SSHD running alongside your job! Also works with VANILLA, JAVA universe! › Support for parallel jobs Other than just MPICH, e.g. Lam, SCore Nice for testing environments 31 What’s brewing for after v6.8.0? › More data, data, data › › › › › › › › › Stork distributed w/ v6.8.0, incl DAGMan support NeST manage Condor spool files, ckpt servers Stork used for Condor job data transfers Can I commit this to CVS?? Virtual Machines (and the future of Standard Universe) Condor and Shibboleth (with Georgetown Univ) Least Privilege Security Access (with U of Cambridge) Dynamic Temporary Accounts (with EGEE, Argonne) Leverage Database Technology (with UW DB group) ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) Easier Updates New ClassAds (integration with Optena) Hierarchical Matchmaking 32 A Tree of Matchmakers BIG 10 MM UW MM • Fault Tolerance • Flexibility • MM now manage other MMs CS MM “I need more resources” R Theory Group MM R R A Match C C R Erdos MM 33 Thank you! 34