N+N Meeting October 3 2003 Exploiting Terascale Supercomputers Experiences from HPCx David Henty d.henty@epcc.ed.ac.uk N+N 03/10/3 1 Overview HPCx – the machine – the consortium Usability issues – a brief summary HPCx, EPCC and the Grid – current activities DEISA – Distributed Infrastructure for Supercomputing Applications – an EC-funded pan-European Grid testbed proposal N+N 03/10/3 2 What is HPCx? Consortium of leading UK organisations committed to creating and managing the new UK HPC resource for the next 6 years Multi-stage project to deliver a world class academic computing resource, the largest in Europe, with ultimate peak performance of 22 TFlop/s £50M/$70M budget from EPSRC Grid-enabled, a key component in the UK e-Science program N+N 03/10/3 3 The HPCx Consortium: Members University of Edinburgh Edinburgh Parallel Computing Centre Central Laboratory of the Research Councils: Daresbury Laboratory IBM N+N 03/10/3 4 University of Edinburgh Lead contractor of the HPCx Consortium • International centre of academic excellence • One of the largest and most successful research universities in the UK • Partner in the National e-Science Centre (NeSC) N+N 03/10/3 5 The HPCx Consortium: N+N 03/10/3 6 The HPCx Consortium: EPCC N+N 03/10/3 7 Leading computer centre in Europe, bridging the gap between academia and industry • Self-funding, in existence for over 10 years • Provides both HPC and novel computing solutions to a wide range of problems and users • Long experience of providing national HPC services including: – Meiko Computing Surfaces – Thinking Machines CM200 – Cray T3D/T3E (1994 to 2001) N+N 03/10/3 8 Daresbury Laboratory N+N 03/10/3 9 Daresbury Laboratory A multi disciplinary research lab with over 500 people Provides large-scale research facilities both for UK academic and industrial research communities Daresbury hosts and maintains the hardware for the HPCx system N+N 03/10/3 10 • IBM will provide the technology for HPCx • Long standing involvement in HPC including the development of a number of ASCI machines and 5 of the top 10 machines in the 6/2002 TOP500 list: – No 2: ASCI White: Rmax = 7.2 TFlop/s – No 5: SP Power3 (3328 Processors): Rmax = 3.0 TFlop/s – No 8: pSeries 690 (864 Processors): Rmax= 2.3 TFlop/s • IBM has the long term technology road map essential to a 6 year project such as HPCx N+N 03/10/3 11 HPCx Operational Phases System will be commissioned in three main stages with phase 1 covering 2002-2004 – phase 1: December 2002, performance > 3 Tflops Linpack – phase 2: June 2004: 6 Tflops – phase 3: June 2006: 12 Tflops Focussed on Capability jobs – using at least 50% of the CPU resource – target is for half of the jobs to be capability jobs N+N 03/10/3 12 Usability Issues (i) Note these are NOT specific to HPCx or IBM! Batch systems – not deeply integrated with the OS – incompatibility between systems – lack of useful information to the user Real time limits – seem to be completely alien to UNIX – accounting and charging therefore done by hacks N+N 03/10/3 13 Usability (ii) Operating Systems – written for multi-user, general-purpose systems – desktop users work with the OS – HPC users spend their whole lives fighting it • we liked the Cray T3D because it DIDN’T HAVE an OS! – modern OS’s are far too relaxed and sloppy • • • • eg runaway processes just run and run at 100% CPU ... on almost all systems, a “Grim Reaper” must be run by hand! eg I am running 128 processes ... is it a single MPI job? multiple MPI jobs? mixed MPI/OpenMP? – spawn 100’s of threads for tasks that aren’t needed for HPC IO – MPI-IO is nice but I don’t see people using it – usually develop bespoke solutions which don’t port well N+N 03/10/3 14 Usability (iii) What about accounting? Users have to buy CPU time (at least in the UK!) – and be charged for it • in a common currency – almost zero support for users or administrators to control resource allocation to projects, groups and users • can be tape, disk, CPU, memory etc – HPC centres have to develop their own software • we wrote an application from scratch for HPCx If these things are hard on a parallel machine, just think how hard they will be on the Grid! N+N 03/10/3 15 Grid Actvities HPCx – available over the Grid via Globus 2 – issues due to back-end CPUs being on a private network EPCC – part of Globus Alliance along with Argonne, ISI and PDC • planning the direction of the Globus toolkit – many e-Science projects, collaborations with NeSC, etc. DEISA – a 5-year project in the pipeline, under negotiation with EC – 9 partners in 7 countries, requested budget around €14M N+N 03/10/3 16 DEISA Vision The DEISA super-cluster Dedicated network interconnect National computing facility Site B Site A Site C N+N 03/10/3 Site D Extended Grid services 17 DEISA Overview A bottom-up approach to an EU Grid – most of the sites have IBM hardware (a coincidence in time) – a US TeraGrid on the cheap with little (initial) hardware – using the best available commodity software Major focus is shared file system – initially extending GPFS – also investigate other technologies (AFS, Avaki, ...) EPCC’s involvement – ensure HPCx is integrated – develop a Cosmology application demonstrator – develop OGSA middleware to enhance heterogeneity N+N 03/10/3 18 Simulation by the OCCAM group N+N 03/10/3 19