N+N Meeting October 3 2003 Exploiting Terascale Supercomputers Experiences from HPCx

advertisement
N+N Meeting
October 3 2003
Exploiting Terascale Supercomputers
Experiences from HPCx
David Henty
d.henty@epcc.ed.ac.uk
N+N 03/10/3
1
Overview
HPCx
– the machine
– the consortium
Usability issues
– a brief summary
HPCx, EPCC and the Grid
– current activities
DEISA
– Distributed Infrastructure for Supercomputing Applications
– an EC-funded pan-European Grid testbed proposal
N+N 03/10/3
2
What is HPCx?
Consortium of leading UK organisations
committed to creating and managing the new UK
HPC resource for the next 6 years
Multi-stage project to deliver a world class academic
computing resource, the largest in Europe, with
ultimate peak performance of 22 TFlop/s
£50M/$70M budget from EPSRC
Grid-enabled, a key component in
the UK e-Science program
N+N 03/10/3
3
The HPCx Consortium: Members
University of Edinburgh
Edinburgh Parallel Computing Centre
Central Laboratory of the Research
Councils: Daresbury Laboratory
IBM
N+N 03/10/3
4
University of Edinburgh
Lead contractor of the HPCx Consortium
• International centre of academic excellence
• One of the largest and most successful
research universities in the UK
• Partner in the National e-Science Centre
(NeSC)
N+N 03/10/3
5
The HPCx Consortium:
N+N 03/10/3
6
The HPCx Consortium: EPCC
N+N 03/10/3
7
 Leading computer centre in Europe, bridging
the gap between academia and industry
• Self-funding, in existence for over 10 years
• Provides both HPC and novel computing
solutions to a wide range of problems and users
• Long experience of providing national HPC
services including:
– Meiko Computing Surfaces
– Thinking Machines CM200
– Cray T3D/T3E (1994 to 2001)
N+N 03/10/3
8
Daresbury Laboratory
N+N 03/10/3
9
Daresbury Laboratory
A multi disciplinary research
lab with over 500 people
Provides large-scale
research facilities both for
UK academic and industrial
research communities
Daresbury hosts and
maintains the hardware
for the HPCx system
N+N 03/10/3
10
• IBM will provide the technology for HPCx
• Long standing involvement in HPC including the
development of a number of ASCI machines and 5 of
the top 10 machines in the 6/2002 TOP500 list:
– No 2: ASCI White: Rmax = 7.2 TFlop/s
– No 5: SP Power3 (3328 Processors): Rmax = 3.0 TFlop/s
– No 8: pSeries 690 (864 Processors): Rmax= 2.3 TFlop/s
• IBM has the long term technology
road map essential to a 6 year
project such as HPCx
N+N 03/10/3
11
HPCx Operational Phases
System will be commissioned in three main
stages with phase 1 covering 2002-2004
– phase 1: December 2002, performance > 3 Tflops Linpack
– phase 2: June 2004: 6 Tflops
– phase 3: June 2006: 12 Tflops
Focussed on Capability jobs
– using at least 50% of the CPU resource
– target is for half of the jobs to be capability jobs
N+N 03/10/3
12
Usability Issues (i)
Note these are NOT specific to HPCx or IBM!
Batch systems
– not deeply integrated with the OS
– incompatibility between systems
– lack of useful information to the user
Real time limits
– seem to be completely alien to UNIX
– accounting and charging therefore done by hacks
N+N 03/10/3
13
Usability (ii)
Operating Systems
– written for multi-user, general-purpose systems
– desktop users work with the OS
– HPC users spend their whole lives fighting it
• we liked the Cray T3D because it DIDN’T HAVE an OS!
– modern OS’s are far too relaxed and sloppy
•
•
•
•
eg runaway processes just run and run at 100% CPU
... on almost all systems, a “Grim Reaper” must be run by hand!
eg I am running 128 processes
... is it a single MPI job? multiple MPI jobs? mixed MPI/OpenMP?
– spawn 100’s of threads for tasks that aren’t needed for HPC
IO
– MPI-IO is nice but I don’t see people using it
– usually develop bespoke solutions which don’t port well
N+N 03/10/3
14
Usability (iii)
What about accounting?
Users have to buy CPU time (at least in the UK!)
– and be charged for it
• in a common currency
– almost zero support for users or administrators to control
resource allocation to projects, groups and users
• can be tape, disk, CPU, memory etc
– HPC centres have to develop their own software
• we wrote an application from scratch for HPCx
If these things are hard on a parallel machine,
just think how hard they will be on the Grid!
N+N 03/10/3
15
Grid Actvities
HPCx
– available over the Grid via Globus 2
– issues due to back-end CPUs being on a private network
EPCC
– part of Globus Alliance along with Argonne, ISI and PDC
• planning the direction of the Globus toolkit
– many e-Science projects, collaborations with NeSC, etc.
DEISA
– a 5-year project in the pipeline, under negotiation with EC
– 9 partners in 7 countries, requested budget around €14M
N+N 03/10/3
16
DEISA Vision
The DEISA super-cluster
Dedicated network
interconnect
National computing
facility
Site B
Site A
Site C
N+N 03/10/3
Site D
Extended Grid
services
17
DEISA Overview
 A bottom-up approach to an EU Grid
– most of the sites have IBM hardware (a coincidence in time)
– a US TeraGrid on the cheap with little (initial) hardware
– using the best available commodity software
Major focus is shared file system
– initially extending GPFS
– also investigate other technologies (AFS, Avaki, ...)
EPCC’s involvement
– ensure HPCx is integrated
– develop a Cosmology application demonstrator
– develop OGSA middleware to enhance heterogeneity
N+N 03/10/3
18
Simulation by the OCCAM group
N+N 03/10/3
19
Download