Condor at Cardiff Dr James Osborne Contents • • • • • • • What is Condor Condor at Cardiff Condor Users at Cardiff Green Computing at Cardiff Advanced Research Computing at Cardiff Virtualization Patterns What is Condor • Condor is the name for two species of New World vultures, each in a monotypic genus – They are the largest flying land birds in the Western Hemisphere What is Condor • A specialised workload management system for compute-intensive jobs • Users submit their jobs to Condor – – – – Condor places them into a queue Condor chooses where and when to run them Condor carefully monitors their progress Condor informs the user upon completion http://www.cs.wisc.edu/condor/ Condor at Cardiff - Pilot • The Condor pool began as a pilot service back in April of 2004 led by Dr Hugh Beedie, CTO of Information Services in conjunction with staff at the Welsh e-Science Centre – First user from the School of Business – A solution looking for problems… Condor at Cardiff - Production • The Condor pool transitioned to a production service back in January of 2006 with the appointment of Dr James Osborne as project manager – – – – – Latest user from the School of Psychology Doubled size of pool, Tripled number of users Distributed using Novell Zenworks Common condor_config files EA, EI, S, SEA Injected condor_config_local variables • IS_OWNED_BY, IS_EXECUTE_ALWAYS, RANK Central Manager master, collector, negotiator Execute Nodes 1600 Workstations Submit Nodes 30 Workstations master, schedd, shadow master, startd, starter Condor Users at Cardiff • User in a computing context refers to one who uses a computer system – Users may need to identify themselves for the purposes of accounting, security, logging and resource management – Users are also widely characterized as the class of people that uses a system without complete technical expertise required to fully understand the system Growth of User Base 35 30 25 20 15 10 5 0 Q1-06 Q2-06 Q3-06 Q4-06 Q1-07 Q2-07 Q3-07 Q4-07 Diversity of User Base • • • • • • Architecture Biosciences Business Computer Sci Engineering Epidemiology 1 9 1 6 3 2 • • • • • • History Arch Mathematics Optometry Physics Psychology Social Sci 2 2 2 2 1 1 Total 32 Diversity of Applications • • • • • • • • Blast, Damfilt Dammin, Energyplus Gasbor, Grinder Lea, Leadmix Matlab, Msvar Oxcal, Perl Pest, R Sienna, Structure • • • • • • • • Econometric Modelling Fluid Dynamics Fourier Analysis Geological Modelling Image Processing Radiation Transport Travelling Salesman WIFI Roaming Structural Biophysics Group Donna Lammie • • • • • • OPTOM X-Ray Diffraction Determine shape of molecules Time on a single workstation = 2-3 Days Time on the Condor pool = 2-3 Hours Speed-up factor of 2000% Donna Lammie PF2 90o PF5 90o PF7 90o PF8 90o PF9 90o PF10 90o PF11 90o PF12 90o C. Baldock et. al. Nanostructure of Fibrillin-1 Reveals Compact Conformation of EGF Arrays and Mechanism for Extensibility. Proceedings of the National Academy of Sciences of the United States of America, 103(32):1192211927, August 2006. Research Assistant Patrick Downes • • • • • • Velindre Cancer Centre Montecarlo simulation Radiotherapy dose calculation Time on a single workstation = 3 Months Time on the Condor pool = 36 Hours Speed-up of 6000% Patrick Downes Green Computing at Cardiff • Green Computing is the study and practice of using computing resources efficiently – Typically, technological systems or computing products that incorporate green computing principles take into account the so-called triple bottom line of economic viability, social responsibility, and environmental impact Based on a P4 3GHz PC with 512MB RAM Power Consumption Watts Consum ed 150 160 140 112 Watts 120 100 100 80 60 40 20 0 0 5 Off Hibernate Standby 0 Idle Machine State Office Condor Watts Up Pro • Measures – Watts, Volts, Amps, WattHrs, Cost, Avg Kwh, Mo Cost, Max Wts, Max Vlt, Max Amp, Min Wts, Min Vlt, Min Amp, Pwr Fct, Dty Cyc, Pwr Cyc • Freq – 1 second • Duration – 15 minutes Based on a P4 3GHz PC with 512MB RAM Economic Viability • • • • • Makes sound financial sense Hibernate saves £60 per year Condor = £30 per year (max) Dedicated = £150 per year Condor is 5 times cheaper Saving of Hibernate = Cost of 100W Electricity (Idle State) for 16 Hours out of 24 Cost of Condor = Cost of 150W Electricity (Condor State) – Cost of 100W Electricity (Idle State) Cost of Dedicated = Cost of 150W Electricity (Condor State) + Cost of 100W Electricity (Air Con) Based on a P4 3GHz PC with 512MB RAM Environmental Impact • • • • • Makes sound environmental sense Hibernate saves 650Kg CO2 per year Condor = 325Kg CO2 per year (max) Dedicated = 1,625Kg CO2 per year Condor is 5 times greener Saving of Hibernate = Cost of 100W Electricity (Idle State) for 16 Hours out of 24 Cost of Condor = Cost of 150W Electricity (Condor State) – Cost of 100W Electricity (Idle State) Cost of Dedicated = Cost of 150W Electricity (Condor State) + Cost of 100W Electricity (Air Con) Based on 10,000 P4 3GHz PCs with 512MB RAM Across Campus • Makes sound financial sense – Hibernate would save £600,000 per year • Hibernate 16 out of 24 hours • Makes sound environmental sense – Hibernate would save 6,500T CO2 per year – Rainforest required = 52Km2 – Rainforest required = 40% area of Cardiff Saving of Hibernate = Cost of 100W Electricity (Idle State) for 16 Hours out of 24 Cost of Condor = Cost of 150W Electricity (Condor State) – Cost of 100W Electricity (Idle State) Cost of Dedicated = Cost of 150W Electricity (Condor State) + Cost of 100W Electricity (Air Con) Cardiff’s Condor Pool • ...is the equivalent of a £500,000 supercomputer – …costs £50,000 in equipment, power, and staff – …improves return on investment • ...is one of the largest pools in the UK – …and we plan to expand the pool • …is probably the most utilised pool in the UK – …by a factor of 10 • ...has more users than other pool in the UK – …and we are working hard to keep it that way Nobody corrected me at the 1st Campus Grids SIG in Oxford Nobody corrected me at the 21st Open Grid Forum in Manchester The ARC Spectrum HPC HTC Tightly Coupled Loosely Coupled Supercomputers NUMA Machines Large Clusters SMP Small Clusters Campus Grids £ Million+ £ H Thousand £ Million £ Thousand £ H Thousand The ARC Division • ARCCA will provide, co-ordinate, support and develop advanced research computing services for researchers at Cardiff University • ARCCA will also work with clients and partners outside the University through a range of outreach activities • ARCCA is staffed with experts in the field who are already available to help and support your research needs through a range of services • ARCCA is procuring a range of dedicated high-end computing equipment which is planned to be fully operational by early 2008 The ARC Organisation • • • • • • Prof Martyn Guest Dr Christine Kitchen Dr James Osborne Mr Huw Lynes Ms Liz Fitzgerald Another Director of ARC Manager of ARC Applications Infrastructure Admin Officer Programmer Prof Martyn Guest • 2007 – Director of Advanced Research Computing at Cardiff • 1995 – Associate Director of Computational Science and Engineering at Daresbury • 1971 – PhD Theoretical Chemistry • 1967 – BSc Chemistry The ARC Cluster • 256 x Compute Nodes (Cluster) – Dual Socket Quad Core Intel Xeon E5472 3.0GHz – 16 Gb of Memory – ConnectX Infiniband + Dual GigE • 4 x Compute Nodes (SMP) – Quad Socket Quad Core Intel Xeon X7350 2.93GHz – 32 Gb of Memory, 1Tb of Local Disk (RAID5) – ConnectX Infiniband + Dual GigE + Resilient PSU The ARC Cluster • 4 x Login Nodes – Dual Socket Quad Core Intel Xeon E5472 3.0GHz – 32 Gb of Memory, 0.5Tb of Local Disk (RAID1) – ConnectX Infiniband + Dual GigE + Resilient PSU • 2 x Storage Nodes – Dual Socket Quad Core Intel Xeon E5472 3.0GHz – 32 Gigabytes of Memory + Resilient PSU – ConnectX Infiniband + Dual GigE + Fibre Channel Virtualization • Virtualization is a broad term that refers to the abstraction of computer resources – This includes making a single physical resource appear to function as multiple logical resources – Or it can include making multiple physical resources appear as a single logical resource Based on 6 months of monitoring Central Manager Utilisation • CPU (Percentage) 16.60 (average) 65.99 (max) – (Single Socket Single Core Intel Xeon 2.4GHz) • RAM (Gb) 1.09 (average) 1.60 (max) – [55.00% and 80.00% of current capacity] (2 GB) • Disk (Gb) 1.25 (average) 1.50 (max) – [1.71% and 2.05% of current capacity] (73 GB) Based on 6 months of monitoring Central Manager Utilisation • Net In (Kbps) 29.66 (average) 45.39 (max) – [0.02% and 0.04% of current capacity] (Gigabit) • Net Out (Kbps) 39.13 (average) 86.17 (max) – [0.03% and 0.07% of current capacity] (Gigabit) Based on 6 months of monitoring Central Manager Virtualization • 1 x Condor Server – Dual Socket Quad Core Intel Xeon E5472 3.0GHz – 32 Gb of Memory, 0.5Tb of Local Disk (RAID1) – Dual GigE + Resilient PSU • = 4 x Virtual Central Managers ? • = 2 x Virtual Submit Nodes ? Design Patterns • A Design Pattern is a general repeatable solution to a commonly occurring problem in software design – A design pattern is not a finished design that can be transformed directly into code – It is a description or template for how to solve a problem that can be used in many different situations Questions condor@cardiff.ac.uk http://www.cardiff.ac.uk/arcca/ http://www.cs.wisc.edu/condor/