N+N Meeting October 3 2003 Exploiting Terascale Supercomputers Experiences from HPCx

N+N Meeting
October 3 2003
Exploiting Terascale Supercomputers
Experiences from HPCx
David Henty
N+N 03/10/3
– the machine
– the consortium
4Usability issues
– a brief summary
4HPCx, EPCC and the Grid
– current activities
– Distributed Infrastructure for Supercomputing Applications
– an EC-funded pan-European Grid testbed proposal
N+N 03/10/3
What is HPCx?
4Consortium of leading UK organisations
committed to creating and managing the new UK
HPC resource for the next 6 years
Multi-stage project to deliver a world class academic
computing resource, the largest in Europe, with
ultimate peak performance of 22 TFlop/s
4£50M/$70M budget from EPSRC
4Grid-enabled, a key component in
the UK e-Science program
N+N 03/10/3
The HPCx Consortium: Members
4University of Edinburgh
4Edinburgh Parallel Computing Centre
4Central Laboratory of the Research
Councils: Daresbury Laboratory
N+N 03/10/3
University of Edinburgh
4Lead contractor of the HPCx Consortium
• International centre of academic excellence
• One of the largest and most successful
research universities in the UK
• Partner in the National e-Science Centre
N+N 03/10/3
The HPCx Consortium:
N+N 03/10/3
The HPCx Consortium: EPCC
N+N 03/10/3
4 Leading computer centre in Europe, bridging
the gap between academia and industry
• Self-funding, in existence for over 10 years
• Provides both HPC and novel computing
solutions to a wide range of problems and users
• Long experience of providing national HPC
services including:
– Meiko Computing Surfaces
– Thinking Machines CM200
– Cray T3D/T3E (1994 to 2001)
N+N 03/10/3
Daresbury Laboratory
N+N 03/10/3
Daresbury Laboratory
4A multi disciplinary research
lab with over 500 people
4Provides large-scale
research facilities both for
UK academic and industrial
research communities
4Daresbury hosts and
maintains the hardware
for the HPCx system
N+N 03/10/3
• IBM will provide the technology for HPCx
• Long standing involvement in HPC including the
development of a number of ASCI machines and 5 of
the top 10 machines in the 6/2002 TOP500 list:
– No 2: ASCI White: Rmax = 7.2 TFlop/s
– No 5: SP Power3 (3328 Processors): Rmax = 3.0 TFlop/s
– No 8: pSeries 690 (864 Processors): Rmax= 2.3 TFlop/s
• IBM has the long term technology
road map essential to a 6 year
project such as HPCx
N+N 03/10/3
HPCx Operational Phases
4System will be commissioned in three main
stages with phase 1 covering 2002-2004
– phase 1: December 2002, performance > 3 Tflops Linpack
– phase 2: June 2004: 6 Tflops
– phase 3: June 2006: 12 Tflops
4Focussed on Capability jobs
– using at least 50% of the CPU resource
– target is for half of the jobs to be capability jobs
N+N 03/10/3
Usability Issues (i)
4Note these are NOT specific to HPCx or IBM!
4Batch systems
– not deeply integrated with the OS
– incompatibility between systems
– lack of useful information to the user
4Real time limits
– seem to be completely alien to UNIX
– accounting and charging therefore done by hacks
N+N 03/10/3
Usability (ii)
4Operating Systems
– written for multi-user, general-purpose systems
– desktop users work with the OS
– HPC users spend their whole lives fighting it
• we liked the Cray T3D because it DIDN’T HAVE an OS!
– modern OS’s are far too relaxed and sloppy
• eg runaway processes just run and run at 100% CPU
• ... on almost all systems, a “Grim Reaper” must be run by hand!
• eg I am running 128 processes
• ... is it a single MPI job? multiple MPI jobs? mixed MPI/OpenMP?
– spawn 100’s of threads for tasks that aren’t needed for HPC
– MPI-IO is nice but I don’t see people using it
– usually develop bespoke solutions which don’t port well
N+N 03/10/3
Usability (iii)
4What about accounting?
4Users have to buy CPU time (at least in the UK!)
– and be charged for it
• in a common currency
– almost zero support for users or administrators to control
resource allocation to projects, groups and users
• can be tape, disk, CPU, memory etc
– HPC centres have to develop their own software
• we wrote an application from scratch for HPCx
4If these things are hard on a parallel machine,
just think how hard they will be on the Grid!
N+N 03/10/3
Grid Actvities
– available over the Grid via Globus 2
– issues due to back-end CPUs being on a private network
– part of Globus Alliance along with Argonne, ISI and PDC
• planning the direction of the Globus toolkit
– many e-Science projects, collaborations with NeSC, etc.
– a 5-year project in the pipeline, under negotiation with EC
– 9 partners in 7 countries, requested budget around €14M
N+N 03/10/3
DEISA Vision
The DEISA super-cluster
Dedicated network
National computing
Site B
Site A
Site C
N+N 03/10/3
Site D
Extended Grid
DEISA Overview
4 A bottom-up approach to an EU Grid
– most of the sites have IBM hardware (a coincidence in time)
– a US TeraGrid on the cheap with little (initial) hardware
– using the best available commodity software
4Major focus is shared file system
– initially extending GPFS
– also investigate other technologies (AFS, Avaki, ...)
4EPCC’s involvement
– ensure HPCx is integrated
– develop a Cosmology application demonstrator
– develop OGSA middleware to enhance heterogeneity
N+N 03/10/3
Simulation by the OCCAM group
N+N 03/10/3