HTC in Research & Education Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu Claims for “benefits” provided by Distributed Processing Systems High Availability and Reliability High System Performance Ease of Modular and Incremental Growth Automatic Load and Resource Sharing Good Response to Temporary Overloads Easy Expansion in Capacity and/or Function “What is a Distributed Data Processing System?” , P.H. Enslow, Computer, January 1978 www.cs.wisc.edu/~miron 2 Democratization of Computing: You do not need to be a super-person to do super-computing www.cs.wisc.edu/~miron 3 NCBI FTP Searching for small RNAs candidates in a kingdom 45 CPU days .ffn .fna IGRExtract3 .ptt .gbk RNAMotif FindTerm TransTerm ROI IGRs All other IGRs BLAST Terminators Conservation Known sRNAs, riboswitches sRNAPredict IGR sequences of candidates Candidate loci TFBS matrices Patser IGRs all known sRNAs BLAST BLAST homology QRNA 2o cons. TFBSs paralogy sRNA_Annotate Annotated candidate sRNA-encoding genes BLAST FFN_parse ORFs flank candidates ORFs flank known BLAST synteny 4 Education and Training › Computer Science – develop and implement novel › › › HTC technologies (horizontal) Domain Sciences – develop and implement end-toend HTC capabilities that are fully integrated in the scientific discovery process (vertical) Experimental methods – develop and implement a curriculum that harnesses HTC capabilities to teach how to use modeling and numerical data to answer scientific questions. System Management – develop and implement a curriculum that uses HTC resources to teach how to build, deploy, maintain and operate distributed systems www.cs.wisc.edu/~miron 5 As we look to hire new graduates, both at the undergraduate and graduate levels, we find that in most cases people are coming in with a good, solid core computer science traditional education ... but not a great, broad-based education in all the kinds of computing that near and dear to our business." " Ron Brachman Vice President of Worldwide Research Operations, Yahoo www.cs.wisc.edu/~miron 6 Yahoo! Inc., a leading global Internet company, today announced that it will be the first in the industry to launch an open source program aimed at advancing the research and development of systems software for distributed computing. Yahoo’s program is intended to leverage its leadership in Hadoop, an open source distributed computing sub-project of the Apache Software Foundation, to enable researchers to modify and evaluate the systems software running on a 4,000-processor supercomputer provided by Yahoo. Unlike other companies and traditional supercomputing centers, which focus on providing users with computers for running applications and for coursework, Yahoo’s program focuses on pushing the boundaries of largescale systems software research. www.cs.wisc.edu/~miron 7 1986-2006 Celebrating 20 years since we first installed Condor in our CS department www.cs.wisc.edu/~miron 8 Integrating Linux Technology with Condor Kim van der Riet Principal Software Engineer What will Red Hat be doing? Red Hat will be investing into the Condor project locally in Madison WI, in addition to driving work required in upstream and related projects. This work will include: Engineering on Condor features & infrastructure Should result in tighter integration with related technologies Tighter kernel integration Information transfer between the Condor team and Red Hat engineers working on things like Messaging, Virtualization, etc. Creating and packaging Condor components for Linux distributions Support for Condor packaged in RH distributions All work goes back to upstream communities, so this partnership will benefit all. Shameless plug: If you want to be involved, Red Hat is hiring... 10 IBM Systems and Technology Group High Throughput Computing on Blue Gene IBM Rochester: Amanda Peters, Tom Budnik With contributions from: IBM Rochester: Mike Mundy, Greg Stewart, Pat McCarthy IBM Watson Research: Alan King, Jim Sexton UW-Madison Condor: Greg Thain, Miron Livny, Todd Tannenbaum © 2007 IBM Corporation IBM Systems and Technology Group Condor and IBM Blue Gene Collaboration Both IBM and Condor teams engaged in adapting code to bring Condor and Blue Gene technologies together Initial Collaboration (Blue Gene/L) – Prototype/research Condor running HTC workloads on Blue Gene/L • • Condor developed dispatcher/launcher running HTC jobs Prototype work for Condor being performed on Rochester On-Demand Center Blue Gene system Mid-term Collaboration (Blue Gene/L) – Condor supports HPC workloads along with HTC workloads on Blue Gene/L Long-term Collaboration (Next Generation Blue Gene) – I/O Node exploitation with Condor – Partner in design of HTC services for Next Generation Blue Gene • Standardized launcher, boot/allocation services, job submission/tracking via database, etc. – Study ways to automatically switch between HTC/HPC workloads on a partition – Data persistence (persisting data in memory across executables) • Data affinity scheduling – Petascale environment issues 12 5/31/2016 © 2007 IBM Corporation The Grid: Blueprint for a New Computing Infrastructure Edited by Ian Foster and Carl Kesselman July 1998, 701 pages. The grid promises to fundamentally change the way we think about and use computing. This infrastructure will connect multiple regional and national computational grids, creating a universal source of pervasive and dependable computing power that supports dramatically new classes of applications. The Grid provides a clear vision of what computational grids are, why we need them, who will use them, and how they will be programmed. www.cs.wisc.edu/~miron 13 “ … We claim that these mechanisms, although originally developed in the context of a cluster of workstations, are also applicable to computational grids. In addition to the required flexibility of services in these grids, a very important concern is that the system be robust enough to run in “production mode” continuously even in the face of component failures. … “ Miron Livny & Rajesh Raman, "High Throughput Resource Management", in “The Grid: Blueprint for a New Computing Infrastructure”. www.cs.wisc.edu/~miron 14 www.cs.wisc.edu/~miron 15 www.cs.wisc.edu/~miron 16 The search for SUSY* Sanjay Padhi is a UW Chancellor Fellow who is working at the group of Prof. Sau Lan Wu located at CERN (Geneva) Using Condor Technologies he established a “grid access point” in his office at CERN Through this access-point he managed to harness in 3 month (12/05-2/06) more that 500 CPU years from the LHC Computing Grid (LCG) the Open Science Grid (OSG) the Grid Laboratory Of Wisconsin (GLOW) resources and local group owned desk-top resources. *Super-Symmetry www.cs.wisc.edu/~miron 17 High Throughput Computing We first introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center in July of 1996 and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing. www.cs.wisc.edu/~miron 18 Why HTC? For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year. www.cs.wisc.edu/~miron 19 High Throughput Computing is a 24-7-365 activity FLOPY (60*60*24*7*52)*FLOPS www.cs.wisc.edu/~miron 20 High Throughput Computing Miron Livny Computer Sciences University of Wisconsin-Madison {miron@cs.wisc.edu} Customers of HTC Most HTC application follow the Master-Worker paradigm where a group of workers executes a loosely coupled heap of tasks controlled by on or more masters. • Job Level - Tens to thousands of independent jobs • Task Level - A parallel application (PVM,MPI-2) that consists of a small group of master processes and tens to hundreds worker processes. 22 The Challenge Turn large collections of existing distributively owned computing resources into effective High Throughput Computing Environments Minimize Wait while Idle 23 Obstacles to HTC Ownership Distribution Size and Uncertainties Technology Evolution Physical Distribution (Sociology) (Robustness) (Portability) (Technology) 24 Sociology Make owners (& system administrators) happy. • Give owners full control on – when and by whom private resources are used for HTC – impact of HTC on private Quality of Service – membership and information on HTC related activities • No changes to existing software and make it easy – to install, configure, monitor, and maintain owners more resources higher throughput 25 Sociology Owners look for a verifiable contract with the HTC environment that spells out the rules of engagements. System administrators do not like weird distributed applications that have the potential of interfering with the happiness of their interactive users. 26 Robustness To be effective, a HTC environment must run as a 24-7-356 operation. • Customers count on it • Debugging and fault isolation may be a very time consuming processes • In a large distributed system, everything that might go wrong will go wrong. t system less down time higher throughput 27 Portability To be effective, the HTC software must run on and support the latest greatest hardware and software. • Owners select hardware and software according to their needs and tradeoffs • Customers expect it to be there. • Application developer expect only few (if any) changes to their applications. tability more platforms higher throughput 28 Technology A HTC environment is a large, dynamic and evolving Distributed System; • • • • Autonomous and heterogeneous resources Remote file access Authentication Local and wide-area networking 29 Robust and Portable Mechanisms Hold The To High Throughput Computing Policies play only a secondary role in HTC 30 Leads to a “bottom up” approach to building and operating distributed systems www.cs.wisc.edu/~miron 31 My jobs should run … › … on my laptop if it is not connected to the › › › network … on my group resources if my certificate expired ... on my campus resources if the meta scheduler is down … on my national resources if the transAtlantic link was cut by a submarine www.cs.wisc.edu/~miron 32 The Open Science Grid (OSG) Miron Livny - OSG PI & Facility Coordinator, Computer Sciences Department University of Wisconsin-Madison Supported by the Department of Energy Office of Science SciDAC-2 program from the High Energy Physics, Nuclear Physics and Advanced Software and Computing Research programs, and the National Science Foundation Math and Physical Sciences, Office of CyberInfrastructure and Office of International Science and Engineering Directorates. The Evolution of the OSG LIGO operation LIGO preparation LHC construction, preparation LHC Ops iVDGL(NSF) GriPhyN(NSF) Trillium Grid3 PPDG (DOE) DOE Science Grid 1999 2000 2001 2002 2003 OSG (DOE+NSF) (DOE) 2004 2005 2006 2007 2008 2009 European Grid + Worldwide LHC Computing Grid Campus, regional grids ] 34 The Open Science Grid vision Transform processing and data intensive science through a crossdomain self-managed national distributed cyber-infrastructure that brings together campus and community infrastructure and facilitating the needs of Virtual Organizations (VO) at all scales ] 35 D0 Data Re-Processing Total Events 12 sites contributed up to 1000 OSG CPUHours/Week jobs/day 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Week in 2007 ] CIT_CMS_T2 FNAL_GPFARM MIT_CMS NERSC-PDSF OU_OSCER_CONDOR UCSDT2 USCMS-FNAL-WC1-CE FNAL_DZEROOSG_2 GLOW MWT2_IU OSG_LIGO_PSU Purdue-RCAC UFlorida-IHEPA FNAL_FERMIGRID GRASE-CCR-U2 Nebraska OU_OSCER_ATLAS SPRACE UFlorida-PG 2M CPU hours 286M events 286K Jobs on OSG 48TB Input data 22TB Output data 36 The Three Cornerstones National ] Campus Need to be harmonized into a well integrated whole. Community37 OSG challenges • Develop the organizational and management structure of a consortium that drives such a Cyber Infrastructure • Develop the organizational and management structure for the project that builds, operates and evolves such Cyber Infrastructure • Maintain and evolve a software stack capable of offering powerful and dependable capabilities that meet the science objectives of the NSF and DOE scientific communities • Operate and evolve a dependable and well managed distributed facility ] 38 6,400 CPUs available Campus Condor pool backfills idle nodes in PBS clusters - provided 5.5 million CPUhours in 2006, all from idle nodes in clusters Use on TeraGrid: 2.4 million hours in 2006 spent Building a database of hypothetical zeolite structures; 2007: 5.5 million hours allocated to TG http://www.cs.wisc.edu/condor/PCW2007/presentations/cheeseman_Purdue_Condor_Week_2007.ppt Clemson Campus Condor Pool • Machines in 27 different locations on Campus • ~1,700 job slots • >1.8M hours served in 6 months • users from Industrial and Chemical engineering, and Economics • Fast ramp up of usage • Accessible to the OSG through a gateway 40 Grid Laboratory of Wisconsin 2003 Initiative funded by NSF(MIR)/UW at $1.5M. Second phase funded in 2007 by NSF(MIR)/UW at $1.5M. Six Initial GLOW Sites • Computational Genomics, Chemistry • Amanda, Ice-cube, Physics/Space Science • High Energy Physics/CMS, Physics • Materials by Design, Chemical Engineering • Radiation Therapy, Medical Physics • Computer Science Diverse users with different deadlines and usage patterns. 5/31/2016 41 GLOW Usage - between 2004-01-31 and GLOW Usage 4/04-11/08 2007-11-08 other 13% Plasma Physics 1% CMPhysics 1% Atlas 20% MultiScalar 1% Over 35M CPU hours served! MedPhysics 4% LMCG 18% ChemE 18% IceCube 5% 5/31/2016 CS 2% CMS 17% 42 The next 20 years We all came to this meeting because we believe in the value of HTC and are aware of the challenges we face in offering researchers and educators dependable HTC capabilities. We all agree that HTC is not just about technologies but is also very much about people – users, developers, administrators, accountants, operators, policy makers, … www.cs.wisc.edu/~miron 43