Computational Science at STFC and the Exploitation of Novel HPC Architectures Mike Ashworth Scientific Computing Department and STFC Hartree Centre STFC Daresbury Laboratory mike.ashworth@stfc.ac.uk • STFC’s Scientific Computing Department • STFC’s Hartree Centre • Exploitation of Novel HPC Architectures • STFC’s Scientific Computing Department • STFC’s Hartree Centre • Exploitation of Novel HPC Architectures Organisation HM Government (& HM Treasury) RCUK Executive Group UK Astronomy Technology Centre, Edinburgh, Scotland Polaris House Swindon, Wiltshire Daresbury Laboratory Daresbury Science and Innovation Campus Warrington, Cheshire Rutherford Appleton Laboratory Harwell Science and Innovation Campus Didcot, Oxfordshire Chilbolton Observatory Stockbridge, Hampshire STFC’s Sites Isaac Newton Group of Telescopes La Palma Joint Astronomy Centre Hawaii Understanding our Universe STFC’s Science Programme Particle Physics Large Hadron Collider (LHC), CERN - the structure and forces of nature Ground based Astronomy European Southern Observatory (ESO), Chile Very Large Telescope (VLT), Atacama Large Millimeter Array (ALMA), European Extremely Large Telescope (E-ELT), Square Kilometre Array (SKA) Space based Astronomy European Space Agency (ESA) Herschel/Planck/GAIA/James Webb Space Telescope (JWST) Bi-laterals – NASA, JAXA, etc. STFC Space Science Technology Department Nuclear Physics Facility for anti-proton and Ion research (FAIR), Germany Nuclear Skills for - medicine (Isotopes and Radiation applications), energy (Nuclear Power Plants) and environment (Nuclear Waste Disposal) STFC’s Facilities Neutron Sources ISIS - pulsed neutron and muon source/ and Institute Laue-Langevin (ILL), Grenoble Providing powerful insights into key areas of energy, biomedical research, climate, environment and security. High Power Lasers Central Laser Facility - providing applications on bioscience and nanotechnology HiPER Demonstrating laser driven fusion as a future source of sustainable, clean energy Light Sources Diamond Light Source Limited (86%) - providing new breakthroughs in medicine, environmental and materials science, engineering, electronics and cultural heritage European Synchrotron Radiation Facility (ESRF), Grenoble Scientific Computing Department Major funded activities • • • • • • 190 staff supporting over 7500 users Applications development and support Director: Adrian Wander Compute and data facilities and services Research: over 100 publications per annum Appointed 24th July 2012 Deliver over 3500 training days per annum Systems administration, data services, high-performance computing, numerical analysis & software engineering. Major science themes and capabilities • Expertise across the length and time scales from processes occurring inside atoms to environmental modelling The UK National Supercomputing facilities The UK National Supercomputing Services are managed by EPSRC on behalf of the UK academic communities HPCx ran from 2002-2009 using IBM POWER4 and POWER5 HECToR current service 2007-2014 Located at Edinburgh, operated jointly by STFC and EPCC HECToR Phase3 90,112 cores Cray XE6 (660 Tflop/s Linpack) ARCHER is the new service, early access now, service starts 16th Dec‘13, operated by STFC and EPCC (1.37 Pflop/s Linpack #19 TOP500) PRACE PRACE launched 9th June 2010 25 member countries; seat in Belgium France, Germany, Italy and Spain have each committed €100M over 5 years EC funding of €70M for infrastructure Tier-0 infrastructure providing free-of-charge service for European scientific communities based on peer review Four projects to date: PP, 1IP, 2IP, 3IP overlapping 1IP finished; 2IP extended; 3IP about half way through EPSRC represent UK on PRACE Council; STFC and EPCC carry out technical work in the PRACE projects STFC focus is on application optimization & benchmarking, technology evaluation, procurement procedures, training STFC contributes 2.5% BlueGene/Q into the PRACE DECI calls 10 Scientific Highlights Journal of Materials Chemistry 16 no. 20 (May 2006) - issue devoted to HPC in materials chemistry (esp. use of HPCx); Phys. Stat. Sol.(b) 243 no. 11 (Sept 2006) - issue featuring scientific highlights of the Psi-k Network (the European network on the electronic structure of condensed matter coordinated by our Band Theory Group); Molecular Simulation 32 no. 12-13 (Oct, Nov 2006) - special issue on applications of the DL_POLY MD program written & developed by Bill Smith (the 2nd special edition of Mol Sim on DL_POLY - the 1st was about 5 years ago); Acta Crystallographica Section D 63 part 1 (Jan 2007) proceedings of the CCP4 Study Weekend on protein crystallography. The Aeronautical Journal, Volume 111, Number 1117 (March 2007), UK Applied Aerodynamics Consortium, Special Edition. Proc Roy Soc A Volume 467, Number 2131 (July 2011), HPC in the Chemistry and Physics of Materials. Last 5 years metrics: – – – – – – 67 grants of order £13M 422 refereed papers and 275 presentations Three senior staff have joint appointments with Universities Seven staff have visiting professorships Six members of staff awarded Senior Fellowships or Fellowships by Research Councils’ individual merit scheme Five staff are Fellows of senior learned societies • STFC’s Scientific Computing Department • STFC’s Hartree Centre • Exploitation of Novel HPC Architectures Opportunities Political Opportunity • Demonstrate growth through economic and societal impact from investments in HPC Business Opportunity • Engage industry in HPC simulation for competitive advantage • Exploit multi-core Scientific Opportunity • Build multi-scale, multiphysics coupled apps • Tackle complex Grand Challenge problems Technical Opportunity • Exploit new Petascale and Exascale architectures • Adapt to multi-core and hybrid architectures Tildesley Report BIS commissioned a report on the strategic vision for a UK e-Infrastructure for Science and Business. Prof Dominic Tildesley led the team including representatives from Universities, Research Councils, industry and JANET. The scope included compute, software, data, networks, training and security. Mike Ashworth, Richard Blake and John Bancroft from STFC provided input. Published in December 2011. Google the title to download from the BIS website Government Investment in e-infrastructure - 2011 17th Aug 2011: Prime Minister David Cameron confirmed £10M investment into STFC's Daresbury Laboratory. £7.5M for computing infrastructure 3rd Oct 2011: Chancellor George Osborne announced £145M for e-infrastructure at the Conservative Party Conference 4th Oct 2011: Science Minister David Willetts indicated £30M investment in Hartree Centre 30th Mar 2012: John Womersley CEO STFC and Simon Pendlebury IBM signed major collaboration at the Hartree Centre Clockwise from top left Intel collaboration STFC and Intel have signed an MOU to develop and test technology that will be required to power the supercomputers of tomorrow. STFC and Intel have signed an MOU to develop and test technology that will be required to power the supercomputers of tomorrow. Karl Solchenbach, Director of European Exascale Computing at Intel said "We will use STFC's leading expertise in scalable applications to address the challenges of exascale computing in a co-design approach." Collaboration with Unilever 1st Feb 2013: Also announced was a key partnership with Unilever in the development of Computer Aided Formulation (CAF) Months of laboratory bench work can be completed within minutes by a tool designed to run as an ‘App’ on a tablet or laptop which is connected remotely to the Blue Joule supercomputer at Daresbury. The aggregation of surfactant molecules into micelles is an important process in product formulation This tool predicts the behaviour and structure of different concentrations of liquid compounds, both in the bottle and in-use, and helps researchers plan fewer and more focussed experiments. John Womersley, CEO STFC, and Jim Crilly, Senior Vice President, Strategic Science Group at Unilever Hartree Centre IBM BG/Q Blue Joule TOP500 (#13 in Jun 2012 list) #23 in the Nov 2013 list #8 in Europe #2 system in UK 6 racks • 98,304 cores • 6144 nodes • 16 cores & 16 GB per node • 1.25 Pflop/s peak 1 rack to be configured as BGAS (Blue Gene Advanced Storage) • 16,384 cores • Up to 1PB Flash memory Hartree Centre IBM iDataPlex Blue Wonder TOP500 (#114 in Jun 2012 list) #283 in the Nov 2013 list 8192 cores, 170 Tflop/s peak node has 16 cores, 2 sockets Intel Sandy Bridge (AVX etc.) 252 nodes with 32 GB 4 nodes with 256 GB 12 nodes with X3090 GPUs 256 nodes with 128 GB ScaleMP virtualization software up to 4TB virtual shared memory Hartree Centre Datastore Storage: 5.76 PB usable disk storage 15 PB tape store Hartree Centre Visualization Four major facilities: Hartree Vis-1: a large visualization “wall” supporting stereo Hartree Vis-2: a large surround and immersive visualization system Hartree ISIC: a large visualization “wall” supporting stereo at ISIC Hartree Atlas: a large visualization “wall” supporting stereo in the Atlas Building at RAL, part of the Harwell Imaging Partnership (HIP) Virtalis is the hardware supplier Home of the 2nd most powerful supercomputer in the UK Douglas Rayner Hartree Father of Computational Science • Hartree–Fock method • Appleton–Hartree equation • Differential Analyser • Numerical Analysis Douglas Rayner Hartree PhD, FRS (1897 –1958) “It may well be that the high-speed digital computer will have as great an influence on civilization as the advent of nuclear power” 1946 Douglas Hartree with Phyllis Nicolson at the Hartree Differential Analyser at Manchester University Hartree Centre Official Opening 1st Feb 2013: Chancellor George Osborne and Science Minister David Willetts opened the Hartree Centre and announced a further £185M of funding for e-Infrastructure £19M for the Hartree Centre for power-efficient computing technologies £11M for the UK’s participation in the Square Kilometre Array This investment forms part of the £600 million investment for science announced by the Chancellor at the Autumn Statement 2012. “By putting out money into science we are supporting the economy of tomorrow.” George Osborne opens the Hartree Centre, 1st February 2013 Work with us to: • Sharpen your innovation • Improve your global competitiveness • Reduce research and development costs • Reduce costs for certification • Speed up your time to market Power-efficient Technologies ‘Shopping List’ £19M investment in power-efficient technologies • System with latest NVIDIA Kepler GPUs • System based on Intel Xeon Phi • System based on ARM processors • Active storage project using IBM BGAS • Dataflow architecture based on FPGAs • Instrumented machine room Systems will be made available for development and evaluation projects with Hartree Centre partners from industry, government and academia • STFC’s Scientific Computing Department • STFC’s Hartree Centre • Exploitation of Novel HPC Architectures Accelerators in the TOP500 Where are the accelerators? TOP500 1-10 accelerator based systems: #1 #2 #6 #7 Tianhe-2 Titan Piz Daint Stampede 33.8 Pflop/s Intel Xeon Phi 17.6 Nvidia K20x 6.27 Nvidia K20x 5.17 Intel Xeon Phi 3120k (2736k) cores 560k (262k) 116k (74k) 462k (366k) UK National resources: STFC Hartree Centre IBM iDataPlex has 48 Nvidia Fermi GPUs UK Regional resources: e-Infrastructure South: EMERALD consists of 372 Fermi GPUs Local resources: University and departmental clusters Under your desk? The Power Wall • • • • Transistor density is still increasing Clock frequency is not due to power density constraints Cores per chip is increasing, multi-core CPUs (currently 8-16) and GPUs (~500) Little further scope for instruction level parallelism Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Processor Comparison Processor Parallelism GB/s Tflop/s Gflop/s/W Nvidia K40 15x64 288 1.43 6.1 2x56x32* 480 1.48 3.9 60 x 8 320 1.00 4.5 16x4 50 0.20 1.3 AMD FirePro S10000 Intel Xeon Phi Intel E5-2667W * single precision cores. double precision is 1/4. Software Challenges Fundamental challenges is extracting additional parallelism in your application Fortran/C + MPI OpenMP for multi-core CUDA for Nvidia GPUs OpenCL OpenACC Directive-based from Cray (& PGI) OpenMP 4 for accelerators • Gung-Ho – a new atmospheric dynamical core • NEMO on GPUs • DL_POLY on GPUs • LBM on GPUs Current Unified Model 450 Met Office Unified Model ‘Unified’ in the sense of using the same code for weather forecasting and for climate research 400 350 300 250 “New Dynamics” Davies et al (2005) “ENDGame” to be operational in 2013 From Nigel Wood, Met Office 150 2011 20 11 01 20 10 01 20 09 01 20 08 01 20 07 01 20 06 01 2003 20 05 01 100 20 04 01 Also couples to other models (ocean , sea-ice, land surface, chemistry/aerosols etc.) for improved forecasting and earth system modelling Met Office ECMWF USA France Germany Japan Canada Australia 200 20 03 01 Combines dynamics on a lat/long grid with physics (radiation, clouds, precipitation, convection etc.) Performance of the UM (dark blue) versus a basket of models measured by 3-day surface pressure errors Limits to Scalability of the UM The current version (New Dynamics) has limited scalability The problem lies with the spacing of the lat/long grid at the poles The latest ENDGame code improves this, but a more radical solution is required for Petascale and beyond At 25km resolution, grid spacing near poles is 75m (17km) Perfect scaling POWER7 Nodes From Nigel Wood, Met Office At 10km this reduces to 12m! Challenging Solutions GUNG-HO targets a brand new dynamical core Scalability – choose a globally uniform grid which has no poles (see below) Speed – maintain performance at high & low resolution and for high & low core counts Accuracy – need to maintain standing of the model Space weather implies a 600km deep model Five year project 2011-2015 Operational weather forecasts around 2020! Triangles From Nigel Wood, Met Office Cubesphere Yin-Yang Globally Uniform Next Generation Highly Optimized “Working together harmoniously” Design considerations Ford et al, “Gung Ho: A code design for weather and climate prediction on exascale machines”, EASC 2013, to appear in a special edition of the Advances in Engineering Software • Fortran 2003, MPI, OpenMP and OpenAcc • Other models e.g. PGAS, CAF, are not excluded • Indirect addressing in the horizontal to support a wide range of possible grids • Direct addressing in the vertical • Vertical index innermost is optimal for cache re-use in CPUs, and can also achieve coalesced memory access in GPUs Software architecture The Gung-Ho software architecture is structured into layers communicating via a defined API • the driver layer (control for one or more models) • the algorithm layer (high-level specification) • the parallelisation system (PSy) (inter-node and intranode parallelism, parsing, transformations, hardware-specific code generation) • the kernel layer (toolkit of algorithm building blocks with directives) • the infrastructure layer (generic library to support parallelisation, communications, coupling, I/O etc.) Gung-Ho Single Model architecture The arrows represent the APIs connecting the layers The direction shows the flow control A code generator will parse the algorithm layer source code. Structure Alg Code Algorithm Generator Alg Code PSy Generator PSy Code Parser Kernel Codes Generator ... ast,invokeInfo=parse(filename,invoke_name=invokeName) alg=algGen(ast,invokeInfo,psyName=psyName, invokeName=invokeName) psy=psyGen(invokeInfo,psyName=psyName) ... Invoking the generator rupert@ubuntu:~/proj/GungHoSVN/LFRIC/src/generator$ python generator.py usage: generator.py [-h] [-oalg OALG] [-opsy OPSY] filename generator.py: error: too few arguments integrate_one_generate: python ../generator/generator.py -oalg integrate_one_alg.F90 -opsy integrate_one_psy.F90 integrate_one.F90 make integrate_one_generated Example (integrate_one) program main ... use integrate_one_module, only : integrate_one_kernel ... call invoke(integrate_one_kernel(x, integral)) ... end program main PROGRAM main ... USE psy, ONLY: invoke_integrate_one_kernel ... CALL invoke_integrate_one_kernel(x, integral) … END PROGRAM main Example (integrate_one) module integrate_one_module use kernel_mod implicit none private public integrate_one_kernel public integrate_one_code type, extends(kernel_type) :: integrate_one_kernel type(arg) :: meta_args(2) = (/& arg(READ, (CG(1)*CG(1))**3, FE), & arg(SUM, R, FE)/) integer :: ITERATES_OVER = CELLS contains procedure, nopass :: code => integrate_one_code end type integrate_one_kernel contains subroutine integrate_one_code(layers, p1dofm, X, R) ... Example (integrate_one) MODULE psy USE integrate_one_module, ONLY: integrate_one_code USE lfric IMPLICIT NONE CONTAINS SUBROUTINE invoke_integrate_one_kernel(x, integral) ... SELECT TYPE ( x_space=>x%function_space ) TYPE IS ( FunctionSpace_type ) topology => x_space%topology nlayers = topology%layer_count() p1dofmap => x_space%dof_map(cells, fe) END SELECT DO column=1,topology%entity_counts(cells) CALL integrate_one_code(nLayers, p1dofmap(:,column), x%data, integral%data(1)) END DO END SUBROUTINE invoke_integrate_one_kernel END MODULE psy Timetable Further development and testing of horizontal [2013] Testing of proposals for code architecture [2013] Vertical discretization [2013] 3D prototype development [2014-2015] Operational around 2020…? NEMO Acceleration on GPUs using OpenACC Maxim Milakov, Peter Messmer, Thomas Bradley, NVIDIA Flat profile, code converted using OpenACC directives GYRE only – we are looking at more realistic test cases, with ice and land Tesla M2090 GPUs, Westmere CPUs Milakov, Messmer, Bradley, GPU Technology Conference, 18th-22nd March 2013 DL_POLY Acceleration on GPUs Chritos Kartsaklis and Ruairi Nestor, ICHEC, Ilian Todorov and Bill Smith, STFC CUDA implementation of key DL_POLY features: Constraints Shake Link cell pairs Two-body forces Ewald SPME forces DMPC (dimethyl pyrocarbonate) in water, 413896 atoms (test case 4) DL_POLY Acceleration on GPUs “Benchmarking and Analysis of DL_POLY 4 on GPU Clusters” Lysaght et al PRACE report Significant 8x speed-up using cuFFT MPI code scales to 108 atoms on >105 cores Pure MPI vs. 2 GPUs on the ICHEC Stokes GPU cluster Sodium Chloride, 216000 Ions (test case 2) 3D LBM on Kepler GPUs (1) Mark Mawson & Alistair Revell, Manchester Lattice Boltzmann Method (LBM) for solving fluid flow Focus on memory transfer issues SKA STFC is a partner in the Science Data Processor (SDP) work package Stephen Pickles leads the software engineering task in SKA.TEL.SDP.ARCH.S WE developing the software systemlevel prototype We are also contributing to the work to build, run and support the SDP software prototypes on existing production HPC facilities (SKA.TEL.SDP.PROT.ISP). Work started on 1st November 2013 Summary • New UK Government investment supports a wide range of eInfrastructure projects (incl. data centres, networks, ARCHER, Hartree) • The Hartree Centre is ‘open for business’ • We are driving forward application development on emerging architectures For more information see http://www.stfc.ac.uk/scd If you have been … … thank you for listening Mike Ashworth mike.ashworth@stfc.ac.uk http://www.stfc.ac.uk/scd