What’s New in Condor Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor Overview Quick ‘sound bytes’ on new functionality in recent Condor releases › Condor Development Process › New Features in Condor version 6.6.x › New Features in Condor version 6.7.0 www.cs.wisc.edu/condor Condor Development Process › We maintain two different releases at all times Stable Series • Second digit is even: e.g. 6.2.2, 6.4.7, 6.6.3 Development Series • Second digit is odd: e.g. 6.5.1, 6.7.2 www.cs.wisc.edu/condor Stable Series › Heavily tested › Runs on our department production pool of › › nearly 1,000 CPUs (for min of 3 weeks) No new features, only bugfixes and ports. A given stable release is always compatible with other releases from the same series 6.6.X is compatible with 6.6.Y › Recommended for production pools www.cs.wisc.edu/condor Development Series › Less heavily tested › Runs on our small(er) test pool. › New features and new technology are added frequently › Versions from the same development series are not guaranteed compatible with each other (although we try hard) www.cs.wisc.edu/condor New in version 6.6.x › Version 6.6.0 released in November 03. › Current release: version 6.6.7, to be released in Oct 04. www.cs.wisc.edu/condor The Struggle to Build Condor › Condor is BIG Condor code consists of primary source plus ‘externals’. • Externals include Kerberos, zlib, GSI, PVM, gSOAP… • Patches to externals www.cs.wisc.edu/condor The Struggle to Build Condor › Condor is BIG Condor code consists of primary source plus ‘externals’. • Externals include Kerberos, zlib, GSI, PVM, gSOAP… • Patches to externals Current shipped source + externals: ~415MB of source, or ~9 million lines! Building Condor outside of UWMadison used to be very difficult. • “LIST OF SHAME”: Build pointed to packages on UW-Madison fileservers. www.cs.wisc.edu/condor Now Condor Source “Self-Contained” › Source code to externals are now bundled w/ Condor itself. Self-contained Allows version control on externals + patches › Build w/ just “configure; make” ! Checks for existence and proper version of all “bootstrap” requirements, such as the compiler Applies our patches to the externals All 9 million lines built and bundled www.cs.wisc.edu/condor Building Condor Building Condor before Version 6.6.0… Building Condor Post Version 6.6.0! www.cs.wisc.edu/condor Condor + NMI › NMI = NSF Middleware › Initiative Automated build and test infrastructure built on top of Condor Pool of 37 machines of many architectures Scalable Runs every night, builds several Condor source branches, then runs 114 test programs. All results stored in RDBMS, reported on the web. Yes, Condor builds Condor! www.cs.wisc.edu/condor Ports › New Ports w/ v6.6.x –vs- v6.4.x : Solaris 9 RedHat Linux 8.x, 9.x for x86 (+RPMs) RedHat Linux 7.x and SUSE 8.0 for IA64 (clipped) Tru64 5.1 (clipped) AIX 5.2 (clipped) Mac OS X (clipped) www.cs.wisc.edu/condor Some new components › Computing On Demand (COD) › Integration of “Hawkeye” technology › Condor-G Additions Matchmaking Grid Monitor Grid Shell www.cs.wisc.edu/condor Computing On Demand (COD) › Introduce effective timesharing to a distributed system Batch applications often want sustained throughput for a long period of time Interactive applications often want a quick burst of CPU power for small period of time COD : Allow both to co-exist www.cs.wisc.edu/condor HawkEye Technology › Dynamic Resource Monitoring, now ‘built-in’ to Condor. Allows custom dynamic attributes to be added into machine classads. These attributes can be used for • Queries • Scheduling Many plugins available. • Disk space, memory used, network errors, open files/descriptors, process monitoring, users, … www.cs.wisc.edu/condor Condor-G › Condor-G Matchmaking Condor-G can determine which grid site to utilize via ClassAd matchmaking (grid planning, meta scheduling, …) › Condor-G Grid Monitor Reduces the load on a GT2-based gatekeeper, greatly increasing the amount of jobs that can be submitted › Condor-G GridShell A wrapper for the job Reports exit status, cpu utilization, more www.cs.wisc.edu/condor Improvements in Condor for Windows › Ability to run SCHEDULER universe jobs Including DAGMan › JAVA universe support › More Win32 flavors, incl international › › › versions. Added support for encryption on disk of the job and data files on execute machine. v6.6.6: Many issues fixed w/ signaling jobs V6.6.7: Support for SP2 www.cs.wisc.edu/condor New Features in DAGMan › DAGMan previously required that all jobs in a DAG share one log file › Each job can now have it’s own log file › Understands XML formatted logs › Can draw a graphical representation of your DAG Uses GraphViz, http://www.graphviz.org/ www.cs.wisc.edu/condor www.cs.wisc.edu/condor Central Manager New Features › Central Manager daemons can now run on any port COLLECTOR_HOST = condor.cs.wisc.edu:9019 NEGOTIATOR_HOST = condor.cs.wisc.edu:9020 Useful for firewall situations Allows multiple instances on one machine › Keeps statistics on missed updates › Can use TCP instead of UDP, if you must www.cs.wisc.edu/condor Command-line Tools › ‘condor_update_stats’ tool to display information on any dropped central manager updates › ‘condor_q –hold’ gives you a list of held jobs and the reason they were put on hold › ‘condor_config_val –v’ tells you where (file and line number) an attribute is defined › ‘condor_fetch_log’ will grab a log file from a remote machine: condor_fetch_log c2-15.cs.wisc.edu STARTD › ‘condor_configure’ will install Condor via simple command-line switches, no questions asked › ‘condor_vacate_job’ to release a resource by job id, and can be invoked by the job owner. › `condor_wait’ blocks until a job or set of jobs completes www.cs.wisc.edu/condor New 6.7.x Development Series › Release of v6.7.2 was in April 04. www.cs.wisc.edu/condor Big Picture What do we want to achieve in a new › Condor developer series? Technology Transfer Building a bridge between the Condor production software development activity and the academic core research activity BAD-FS, Stork, Diskrouter, Parrot (transparent I/O), Schedd Glidein, VO Schedulers, HA, Management, Improved ClassAds… www.cs.wisc.edu/condor What do we want to achieve, cont? New Ports: Go to where the cycles are! •The RedHat Dilemma •Our porting ‘hopper’ : AIX 5.1L on the PowerPC architecture Redhat AS server on x86 Fedora Core on x86 Fedora Core 2 on x86 Redhat AS server on AMD64 SuSE 8.0 on AMD64 Redhat AS server on IA64 HPUX 11.11 64-bit www.cs.wisc.edu/condor What do we want to achieve, cont. › Improve existing ports Move “clipped wing” port to full ports (w/ checkpoint, process migration) • Max OS X, Windows Better integration into environments • Windows: operate better w/ DFS, use MSI • Unix: operate w/ AFS www.cs.wisc.edu/condor What do we want to achieve, cont. › Address changes in the computing landscape Firewalls, NATs 64-bit operating systems Emphasis on data Movement towards standards such as WS, OGSA, … www.cs.wisc.edu/condor V6.7 Themes › Scalability Resources, jobs, matchmaking framework › Accessibility APIs, more Grid middleware, network › Availability Failover www.cs.wisc.edu/condor High Availability in v6.7.x What happens if my submit machine reboots? Once upon a time, only one answer: job restarts. Checkpoint? No Checkpoint? www.cs.wisc.edu/condor New: Job Progress continues if connection is interrupted › Now for Vanilla and Java universe jobs, Condor now supports reestablishment of the connection between the submitting and executing machines. › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = <N seconds> For example: JobLeaseDuration = 1200 www.cs.wisc.edu/condor What if the submission point spontaneously explodes? (don’t try this at home) www.cs.wisc.edu/condor More High Availability Solutions › Condor can support a submit machine “hot spare” If your submit machine is down for longer than N minutes, a second machine can take over › Two mechanisms available Job Mirroring High Availability Daemon Failover • Just tell the condor_master to run ONE instance www.cs.wisc.edu/condor Daemon Failover Machine A Master SchedD Refresh Lock Refresh Obtain Check Lock Lock Lock Machine B Master SchedD Active Active (hot spare) www.cs.wisc.edu/condor Accessibility › Support for GCB Condor working w/ NATs, Firewalls › Distributed Resource Management Application API (DRMAA) GGF Working Group An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems Condor DRMAA interface to appear in v6.7.0 www.cs.wisc.edu/condor SOAP/Grid Service condor_schedd Cedar Web Service: SOAP HTTPS www.cs.wisc.edu/condor New “Grid Universe” › With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 › Other gridtypes? GT3 for OGSAbased Globus Toolkit 3 www.cs.wisc.edu/condor Condor-G improvements › Condor-G can submit to either Globus GT2 or GT3 resources, including support for GT3 with web services. Condor-G includes everything required; no need for client to have a GT3 installation. Good migration path to OGSA › Condor-G to Nordugrid, Unicore, Condor, ORACLE › Support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy/ www.cs.wisc.edu/condor Why Condor + MyProxy? › Long-lived tasks or services need credentials Task lifetime is difficult to predict › Don’t want to delegate long-lived credentials Fear of compromise › Instead, renew credentials with MyProxy as needed during the task’s lifetime Provides a single point of monitoring and control Renewal policy can be modified at any time • For example, disable renewals if compromise is detected or suspected www.cs.wisc.edu/condor Credential Renewal Home Remote Submit Launch Job Jobs Condor-G Scheduler Refresh Credentials Resource Manager Retrieve Credentials Enable Renewal Refresh Credentials MyProxy Job www.cs.wisc.edu/condor More… › Condor can now transfer job data files larger than 2 GB in size. On all platforms that support 64bit file offsets › Real-time spooling of stdout/err/in in any universe incl VANILLA Real-time monitoring of job progress › Working on Hierarchical Negotiations www.cs.wisc.edu/condor Thank you! www.cs.wisc.edu/condor