The Condor Story (& why it is worth developing the plot further) Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu Regardless of how we call IT (distributed computing, eScience, grid, cyberinfrastructure, …) IT is not easy! www.cs.wisc.edu/condor Therefore, if we want IT to happen, we MUST join forces and work together www.cs.wisc.edu/condor Working Together › Each of us must be consider as both a consumer and a provider and view others in the same way › We have to know each other › We have to trust each other › We have to understand each other www.cs.wisc.edu/condor The Condor Project (Established ‘85) Distributed Computing research performed by a team of ~40 faculty, full time staff and students who face software/middleware engineering challenges, involved in national and international collaborations, interact with users in academia and industry, maintain and support a distributed production environment (more than 2300 CPUs at UW), and educate and train students. Funding (~ $4.5M annual budget) – DoE, NASA, NIH, NSF, EU, INTEL, Micron, Microsoft and the UW Graduate School www.cs.wisc.edu/condor www.cs.wisc.edu/condor S u p p o r t www.cs.wisc.edu/condor “ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing subsystems can be integrated into multicomputer ‘communities’. … “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983. www.cs.wisc.edu/condor Claims for “benefits” provided by Distributed Processing Systems High Availability and Reliability High System Performance Ease of Modular and Incremental Growth Automatic Load and Resource Sharing Good Response to Temporary Overloads Easy Expansion in Capacity and/or Function “What is a Distributed Data Processing System?” , P.H. Enslow, Computer, January 1978 www.cs.wisc.edu/condor Benefits to Science › Democratization of Computing – “you do not › › have to be a SUPER person to do SUPER computing.” (accessibility) Speculative Science – “Since the resources are there, lets run it and see what we get.” (unbounded computing power) Function shipping – “Find the image that has a red car in this 3 TB collection.” (computational mobility) www.cs.wisc.edu/condor High Throughput Computing For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year. www.cs.wisc.edu/condor High Throughput Computing is a 24-7-365 activity FLOPY (60*60*24*7*52)*FLOPS www.cs.wisc.edu/condor Every community needs a * Matchmaker ! * or a Classified section in the newspaper or an eBay. www.cs.wisc.edu/condor We use Matchmakers to build Computing Communities out of Commodity Components www.cs.wisc.edu/condor www.cs.wisc.edu/condor The ’94 Worldwide Condor Flock Delft Amsterdam 3 10 200 30 3 3 3 Madison Warsaw 10 Geneva 10 Dubna/Berlin www.cs.wisc.edu/condor The Grid: Blueprint for a New Computing Infrastructure Edited by Ian Foster and Carl Kesselman July 1998, 701 pages. The grid promises to fundamentally change the way we think about and use computing. This infrastructure will connect multiple regional and national computational grids, creating a universal source of pervasive and dependable computing power that supports dramatically new classes of applications. The Grid provides a clear vision of what computational grids are, why we need them, who will use them, and how they will be programmed. www.cs.wisc.edu/condor “ … We claim that these mechanisms, although originally developed in the context of a cluster of workstations, are also applicable to computational grids. In addition to the required flexibility of services in these grids, a very important concern is that the system be robust enough to run in “production mode” continuously even in the face of component failures. … “ Miron Livny & Rajesh Raman, "High Throughput Resource Management", in “The Grid: Blueprint for a New Computing Infrastructure”. www.cs.wisc.edu/condor “ … Grid computing is a partnership between clients and servers. Grid clients have more responsibilities than traditional clients, , and must be equipped with powerful mechanisms for dealing with and recovering from failures, whether they occur in the context of remote execution, work management, or data output. When clients are powerful, servers must accommodate them by using careful protocols.… “ Douglas Thain & Miron Livny, "Building Reliable Clients and Servers", in “The Grid: Blueprint for a New Computing Infrastructure”,2nd edition www.cs.wisc.edu/condor www.cs.wisc.edu/condor Client Server Master Worker www.cs.wisc.edu/condor Being a Master Customer “deposits” task(s) with the master that is responsible for: Obtaining resources and/or workers Deploying and managing workers on obtained resources Assigning and delivering work unites to obtained/deployed workers Receiving and processing results Notify customer. www.cs.wisc.edu/condor our answer to High Throughput MW Computing on commodity resources www.cs.wisc.edu/condor www.cs.wisc.edu/condor The Layers of Condor Application Application Agent Submit (client) Customer Agent (schedD) Matchmaker Owner Agent (startD) Remote Execution Agent Local Resource Manager Resource www.cs.wisc.edu/condor Execute (service) Local PSE or User Condor C-app Condor – G (schedD) Grid Tools Flocking Remote Flocking PBS LSF Condor Condor Condor Condor Condor G-app C-app G-app C-app G-app C-app C-app (Glide-in) (Glide-in) (Glide-in) www.cs.wisc.edu/condor Cycle Delivery at the Madison campus www.cs.wisc.edu/condor Yearly Condor usage at UW-CS 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 www.cs.wisc.edu/condor Yearly Condor CPUs at UW www.cs.wisc.edu/condor (inter) national science www.cs.wisc.edu/condor U.S. “Trillium” Grid Partnership Trillium = PPDG + GriPhyN + iVDGL Particle Physics Data Grid: $12M (DOE) GriPhyN: $12M (NSF) iVDGL: $14M (NSF) Basic (1999 – 2004+) (2000 – 2005) (2001 – 2006) composition (~150 people) PPDG: 4 universities, 6 labs GriPhyN: 12 universities, SDSC, 3 labs iVDGL: 18 universities, SDSC, 4 labs, foreign partners Expts: BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO, SDSS/NVO Complementarity GriPhyN: of projects CS research, Virtual Data Toolkit (VDT) development PPDG: “End to end” Grid services, monitoring, analysis iVDGL: Grid laboratory deployment using VDT Experiments provide frontier challenges Unified entity when collaborating internationally Grid2003: An Operational National Grid 28 sites: Universities + national labs 2800 CPUs, 400–1300 jobs Running since October 2003 Applications in HEP, LIGO, SDSS, Genomics Korea http://www.ivdgl.org/grid2003 Contributions to Grid3 › Condor-G – “your window to Grid3 › › › › › resources” GRAM 1.5 + GASS Cache Directed Acyclic Graph Manager (DAGMan) Packaging, Distribution and Support of the Virtual Data Toolkit (VDT) Trouble Shooting Technical road-map/blueprint www.cs.wisc.edu/condor Contributions to EDG/EGEE › › › › › Condor-G … DAGMan … VDT … Design of gLite … Testbed … www.cs.wisc.edu/condor VDT Growth 25 20 VDT 1.1.8 First real use by LCG 15 10 5 VDT 1.1.11 Grid2003 VDT 1.0 Globus 2.0b Condor 6.3.1 VDT 1.1.3, 1.1.4 & 1.1.5 pre-SC 2002 VDT 1.1.7 Switch to Globus 2.2 Ja n0 M 2 ar -0 M 2 ay -0 2 Ju l-0 Se 2 p0 N 2 ov -0 Ja 2 n0 M 3 ar -0 M 3 ay -0 3 Ju l-0 Se 3 p03 N ov -0 Ja 3 n0 M 4 ar -0 4 0 Number of Components The Build Process Hope use NMI processes soon NMI Sources (CVS) Build & Test Condor pool (~40 computers) Build Build Test Package RPMs Binaries GPT src bundles Test Note patches Binaries Pacman cache … Patching VDT Build Binaries Contributors (VDS, etc.) Tools in the VDT 1.2.0 Condor Group EDG & LCG Job submission (GRAM) Information service (MDS) Data transfer (GridFTP) Replica Location (RLS) Make Gridmap Certificate Revocation List Updater Glue Schema/Info prov. ISI & UC Chimera & Pegasus NCSA Condor/Condor-G Fault Tolerant Shell ClassAds Globus Alliance Components built by NMI LBL MonaLisa VDT PyGlobus Netlogger Caltech MyProxy GSI OpenSSH UberFTP VDT System Profiler Configuration software Others KX509 (U. Mich.) DRM 1.2 Java FBSng job manager Tools in the VDT 1.2.0 Condor Group EDG & LCG Job submission (GRAM) Information service (MDS) Data transfer (GridFTP) Replica Location (RLS) Make Gridmap Certificate Revocation List Updater Glue Schema/Info prov. ISI & UC Chimera & Pegasus NCSA Condor/Condor-G Fault Tolerant Shell ClassAds Globus Alliance MonaLisa VDT PyGlobus Netlogger Caltech MyProxy GSI OpenSSH UberFTP LBL Components built by contributors VDT System Profiler Configuration software Others KX509 (U. Mich.) DRM 1.2 Java FBSng job manager Tools in the VDT 1.2.0 Condor Group EDG & LCG Job submission (GRAM) Information service (MDS) Data transfer (GridFTP) Replica Location (RLS) Make Gridmap Certificate Revocation List Updater Glue Schema/Info prov. ISI & UC Chimera & Pegasus NCSA Condor/Condor-G Fault Tolerant Shell ClassAds Globus Alliance Components built by VDT LBL MonaLisa VDT PyGlobus Netlogger Caltech MyProxy GSI OpenSSH UberFTP VDT System Profiler Configuration software Others KX509 (U. Mich.) DRM 1.2 Java FBSng job manager Health www.cs.wisc.edu/condor Condor at Noregon >AtNoregon 10:14 AM 7/15/2004 xxx wrote: has entered into -0400, a partnership with Targacept Inc.Livny: to develop a system to efficiently perform >Dr. dynamics simulations. Targacept is a our privately >I molecular wanted to update you on our progress with grid computing held pharmaceutical company located in Winston-Salem's >project. We havePark about 300efforts nodes are deployed presently with the ability to Triad Research whose focused on >deploy up drug to 6,000 total for nodes wheneverpsychiatric, we are ready. creating therapies neurological, and The project has gastrointestinal diseases. >been getting attention in the local press and has gained the full support >of…the public school system and generated a lot of excitement in the >business community. Using the Condor® grid middleware, Noregon is designing and implementing an ensemble Car-Parrinello simulation tool for Targacept that will allow a simulation to be distributed across a large grid of inexpensive Windows® PC’s. Simulations can be completed in a fraction of the time without the use of high performance (expensive) hardware. www.cs.wisc.edu/condor Electronics www.cs.wisc.edu/condor Condor at Micron 8000+ processors in 11 “pools” Linux, Solaris, Windows <50th Top500 Rank 3+ TeraFLOPS Centralized governance Distributed management 16+ applications Self developed Slides used by UWCS with permission of Micron Technology, Inc. Micron’s Global Grid Condor at Micron The Chief Officer value proposition ■ Info Week 2004 IT Survey includes Grid questions! Makes our CIO look good by letting him answer yes Micron’s 2003 rank: 23rd ■ Without Condor we only get about 25% of PC value today Did’t tell our CFO a $1000 PC really costs $4000! Doubling utilization to 50% doubles CFO’s return on capital Micron’s goal: 66% monthly average utilization ■ Providing a personal supercomputer to every engineer CTO appreciates the cool factor CTO really “gets it” when his engineer’s say: I don’t know how I would have done that without the Grid Slides used by UWCS with permission of Micron Technology, Inc. Condor at Micron Example Value 73606 job hours / 24 / 30 = 103 Solaris boxes 103 * $10,000/box = $1,022,306 And that’s just for one application not considering decreased development time, increased uptime, etc. Chances are if you have Micron memory in your PC, it was processed by Condor! Slides used by UWCS with permission of Micron Technology, Inc. Software Engineering www.cs.wisc.edu/condor Condor at Oracle Condor is used within Oracle's Automated Integration Management Environment (AIME) to perform automated build and regression testing of multiple components for Oracle's flagship Database Server product. Each day, nearly 1,000 developers make contributions to the code base of Oracle Database Server. Just the compilation alone of these software modules would take over 11 hours on a capable workstation. But in addition to building, AIME must control repository labelling/tagging, configuration publishing, and last but certainly not least, regression testing. Oracle is very serious about the stability and correctness about their products. Therefore, the AIME daily regression test suite currently covers 90,000 testable items divided into over 700 test packages. The entire process must complete within 12 hours to keep development moving forward. About five years ago, Oracle selected Condor as the resource manager underneath AIME because they liked the maturity of Condor's core components. In total, ~3000 CPUs at Oracle are managed by Condor today. www.cs.wisc.edu/condor GRIDS Center - Enabling Collaborative ScienceGrid Research Integration Development & Support The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Procedures,Tools and Facilities • Build – Generate executable versions of a component • Package – Integrate executables into a distribution • Test – Verify the functionality of a – Component – A set of a components – A distribution The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Build • Reproducibility – “build the version we released 2 years ago!” – Well managed source repository – Know your “externals” and keep them around • Portability – “build the component on build17.nmi.wisc.edu!” – No dependencies on “local” capabilities – Understand your hardware requirements • Manageability – “run the build daily and email me outcome” The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Fetch component Move source files to build site Build Component Retrieve executables from build site Report outcome and clean up The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Goals of the Build Facility • Design, develop and deploy a build system (HW and software) capable of performing daily builds of a suite of middleware packages on a heterogeneous (HW, OS, libraries, …) collection of platforms – – – – – – Dependable Traceable Manageable Portable Extensible Schedulable The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Using our own technologies • Using GRIDS technologies to automate the build, deploy, and test cycle – – – – Condor: schedule build and testing tasks DAGMan: Manage build and testing workflow GridFTP: copy/move files GSI-OpenSSH: remote login, start/stop services etc • Constructed and manage a dedicated heterogeneous and distributed facility The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org NMI Build facility Build resources Web interface Database Build Generator Build Manager • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • mi-aix.cs.wisc.edu nmi-hpux.cs.wisc.edu nmi-irix.cs.wisc.edu nmi-rh72-alpha.cs.wisc.edu nmi-redhat72-ia64.cs.wisc.edu nmi-sles8-ia64.cs.wisc.edu nmi-redhat72-build.cs.wisc.edu nmi-redhat72-dev.cs.wisc.edu nmi-redhat80-ia32.cs.wisc.edu nmi-redhat9-ia32.cs.wisc.edu (rh9 x86)nmi-test-1.cs.wisc.edu (production system rh73 x86)vger.cs.wisc.edu nmi-dux40f.cs.wisc.edu nmi-tru64.cs.wisc.edu nmi-macosx.local. nmi-solaris6.cs.wisc.edu nmi-solaris7.cs.wisc.edu nmi-solaris8.cs.wisc.edu nmi-solaris9.cs.wisc.edu (rh73 x86)nmi-test-3 (rh73 x86)nmi-test-4 (rh73 x86)nmi-test-5 (rh73 x86)nmi-test-6 (rh73 x86)nmi-test-7 (rh9 x86)nmi-build1.cs.wisc.edu (rh8 x86)nmi-build2.cs.wisc.edu (rh73 x86)nmi-build3.cs.wisc.edu (windows build system)nmi-build4.cs.wisc.edu (rh72 x86)wopr.cs.wisc.edu (rh72 x86)bigmac.cs.wisc.edu (rh73 x86)monster.cs.wisc.edu (new vger / vmware system)grandcentral.cs.wisc.edu new nmi-linuxas-ia64 (big dual processors) new nmi-linuxas-opteron-tst new nmi-linuxas-opteron-bld Ordered Email The GRIDS Center, part of the NSF Middleware Initiative 1. 2. 3. new nmi-linuxas-ia32 (big dual procs) (about to be requested) new nmi-solaris-2-8 (about to be requested) new nmi-solaris-2-9 (about to be requested) www.grids-center.org The VDT operation NMI Sources (CVS) VDT Build & Test Condor pool (37 computers) Build Build Test Pacman cache Package … Patching Binaries RPMs Binaries GPT src bundles Test Build Binaries Contributors (VDS, etc.) The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Test • Reproducibility – “Run last year test harness on last week build!” – Separation between build and test processes – Well managed repository of test harnesses – Know your “externals” and keep them around • Portability – “run the test harness of component A on test17.nmi.wisc.edu!” – Automatic install and de-install of component – No dependencies on “local” capabilities – Understand your hardware requirements • Manageability – “run the test suite daily and email me the outcome” The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Testing Tools • Current focus on component testing • Developed scripts and procedures to verify deployment, very basic operations • Multi-component, multi-version, multi-platform test harness and procedures • Testing as “bottom feeder” activity • Short and long term testing cycles The GRIDS Center, part of the NSF Middleware Initiative www.grids-center.org Movies www.cs.wisc.edu/condor C.O.R.E Digital Pictures There has been a lot of buzz in the industry about something big going on here at C.O.R.E. We're really really really pleased to make the following announcement: Yes, it's true. C.O.R.E. digital pictures has spawned a new division: C.O.R.E. Feature Animation We're in production on a CG animated feature film being directed by Steve "Spaz" Williams. The script is penned by the same writers who brought you There's Something About Mary, Ed Decter and John Strauss. www.cs.wisc.edu/condor How can we accommodate an unbounded need for computing with an unbounded amount of resources? www.cs.wisc.edu/condor www.cs.wisc.edu/condor