Next decade in supercomputing José M. Cela Director departmento CASE BSC-CNS josem.cela@bsc.es Talk outline ● Supercomputing from the past….. ● Architecture evolution ● Applications and algorithms ● …..Supercomputing for the future ● Technology trends ● Multidisciplinary top-down approach ● BSC-CNS activities ● Conclusions 2 One upon a time … ENIAC 1946 Eniac, 1946, Moore School 18000 válvulas de vacio, 70000 resistores y 5 millones de conexiones soldadas Consumo = 140 Kw Dimensiones = 8x3x100 pies Peso > 30 toneladas Capacidad de cálculo = 5000 sumas y 360 multiplicaciones por segundo 3 Technological Achievements ● Transistor (Bell Labs, 1947) ● DEC PDP-1 (1957) ● IBM 7090 (1960) ● Integrated circuit (1958) ● IBM System 360 (1965) ● DEC PDP-8 (1965) ● Microprocessor (1971) ● Intel 4004 ● 2.300 transistors ● Could access 300 bytes of memory 4 Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc 5 Pipeline (H. Ford) 6 DRAM access bottleneck ● Not everything is scaling up fast ● DRAM access speed is hardly improved 7 Latencies and Pipelines • Memory F L2 Cache D/ M a p Q R E x D/ St R W R et PC Register Map Ic a c h e 100-1000 cycles 1-3 cycles 10-20 cycles 8 R e g s R D e c g a s c h e 1-3 cycles Processor on a chip Hybrid SMP-cluster parallel systems ● Most modern high-performance computing systems are clusters of SMP nodes (performance/cost trade-off) Interconnection Interconnection Network Network Memory P P P Memory P P SMP P P Memory P P SMP P P SMP ● MPI parallel level ● Threads (openMP) parallel level 9 Memory P P P P SMP P TOP500 10 TOP500 11 Technology Outlook High Volume 2004 2006 2008 2010 2012 2014 2016 2018 90 65 45 32 22 16 11 8 Integration Capacity (BT) 2 4 8 16 32 64 128 256 Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down >0.35 >0.5 >0.5 Energy scaling will slow down Manufacturing Technology Node (nm) Energy/Logic Op scaling Bulk Planar CMOS High Probability Low Probability Alternate, 3G etc Low Probability High Probability Variability ILD (K) RC Delay Metal Layers Medium High ~3 <3 1 1 1 6-7 7-8 8-9 Very High Reduce slowly towards 2-2.5 1 1 1 1 0.5 to 1 layer per generation Shekhar Borkar, Micro37, P 12 1 Increasing CPU performance: a delicate balancing Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Increase Clock Rate & Transistor Density Lower Voltage Cache Cache Core C1 Core C2 Cache C3 C4 Core C1 C2 C1 C2 C1 C2 C3 C4 C3 C4 Cache C1 C2 C1 C2 C3 C4 C3 C4 C3 C4 We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel Processors > 100 Watts We will not see the dramatic increases in clock speeds in the future. However, the number of gates on a chip will continue to increase. 13 Moore’s law 14 Multicore chips 15 16 ORNL Computing Power and Cooling 2006 - 2011 90 Cooling Computers 80 $31M 70 $23M 60 Power (MW) ● Immediate need to add 8 MW to prepare for 2007 installs of new systems ● NLCF petascale system could require an additional 10 MW by 2008 ● Need total of 40-50 MW for projected systems by 2011 ● Numbers just for computers: add 75% for cooling ● Cooling will require 12,000 – 15,000 tons of chiller capacity Computer Center Power Projections 50 $17M 40 $9M 30 20 $3M 10 0 2005 2006 2007 2008 Year 2009 2010 2011 Cost estimates based on $0.05 kW/hr Site LBNL ANL ORNL PNNL FY 2005 43.70 44.92 46.34 49.82 Annual Average Electrical Power Rates $/MWh FY 2006 FY 2007 FY 2008 FY 2009 FY 2010 50.23 53.43 57.51 58.20 56.40 * Data taken from Energy Management System-4 (EMS4). EMS4 is the DOE corporate 53.01 system for collecting energy information from the sites. EMS4 is a web-based system that 51.33 collects energy consumption and cost information for all energy sources used at each DOE site. Information is entered into EMS4 by the site and reviewed at Headquarters for N/A accuracy. 17 View from the Computer Room 18 How to reduce energy but not performance? ● Reduce the amount of DRAM memory per core and redesign everything for energy saving ● Blue Gene Solution ● Eliminate the cache coherency in a multicore chip and use accelerators instead of general purpose cores ● Cell/B.E. solution ● GPU solution ● FPGA solution 19 Blue Gene/P System 72 Racks Blue Gene/P continues Blue Gene’s leadership performance in a spacesaving, power-efficient package for the most demanding and scalable high-performance computing applications Rack Cabled 8x8x16 32 Node Cards 1024 chips, 4096 procs Node Card (32 chips 4x4x2) 32 compute, 0-1 IO cards Final System:1 PF/s,144 TB November 2007: 0.596 PF/s 14 TF/s 2 TB Compute Card 1 chip, 20 DRAMs 435 GF/s 64 GB Chip 4 processors 13.6 GF/s 8 MB EDRAM 13.6 GF/s 2.0 (or 4.0) GB DDR Supports 4-way SMP Front End Node / Service Node JS21 / Power5 Linux SLES10 20 HPC SW: Compilers GPFS ESSL Loadleveler Cell Broadband Engine architecture 235 Mtransistors 235 mm2 21 Cell Broadband Engine Architecture™ (CBEA) Technology Competitive Roadmap Next Gen (2PPE’+32SPE’) 45nm SOI ~1 TFlop (est.) Performance Enhancements/ Scaling Advanced Cell BE (1+8eDP SPE) 65nm SOI Cost Reduction Cell BE Cell BE (1+8) 90nm SOI (1+8) 65nm SOI 2006 2007 2008 2009 2010 All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs. 22 First PetaFlop computer (Nov2008): Roadrunner at LANL ~7,000 dual-core Opterons ~50 TeraFlop/s (total) ~13,000 eDP Cell chips 1.4 PetaFlop/s (Cell) “Connected Unit” cluster 192 Opteron nodes (180 w/ 2 dual-Cell blades connected w/ 4 PCIe x8 links) CU clusters 2nd stage InfiniBand 4x DDR interconnect (18 sets of 12 links to 8 switches) 2nd Generation IB 4X DDR 23 How we are going to program it? ● MPI layer will continue ● Hybrid codes will be mandatory just for load balancing ● openMP on homogeneous processors ● But with heterogeneous processors ● openCL ● CUDA ● … ● SIMD code should be provided by the compiler 24 ● BSC-CNS activities 25 Barcelona Supercomputing Center Centro Nacional de Supercomputación ● Mission ● Investigate, develop and manage technology to facilitate the advancement of science. ● Objectives ● Operate national supercomputing facility ● R&D in Supercomputing ● Collaborate in R&D e-Science ● Public Consortium ● the Spanish Government (MEC) 51% ● the Catalonian Government (DURSI) 37% ● the Technical University of Catalonia (UPC)12% 26 Location 27 28 Blades, blade center and racks JS21 Processor Blade • 2x2 PPC 970 MP 2,3 GHz • 8 GB memory • 36 Gigabytes HD SAS • 2x1Gb Ethernet on board 6 chassis in a rack (42U) • Myrinet daughter card • 336 processors • 672 GB memory 29 Blade Center • 14 blades per chassis (7U) • 56 processors • 112 GB memory • Gigabit ethernet switch Network: Myrinet Spine 1280 Spine 1280 128 Links Clos 256x256 Clos 256x256 Clos 256x256 Clos 256x256 Clos 256x256 Clos 256x256 Clos 256x256 Clos 256x256 Clos 256x256 256 links (1 to each node) Clos 256x256 250MB/s each direction … 0 30 255 MareNostrum Blade centers Storage servers Gigabit switch Myrinet racks Operations rack 10/100 switches - 2560 JS21 - 2 PPC 790 MP 2,3 GHz - 8 Gigabytes (20 TB) - 36 Gigabytes HD SAS - Myrinet daughter card - 2x1Gb Ethernet on board -Myrinet - 10 clos256+256 - 2 spines 1280s -20 Storage nodes - 2 P615, 2 Power4+, 4 GigaBytes - 28 SATA disc, 512 Gbytes (280 TB) Performance Summary - 4 instructions per cycle, 2,3 GHz - 10240 processors - 94,21 TFlops -20 TB Memory, 300 TB disk 31 Additional Systems ● Tape facility ● 6 Petabytes ● LTO4 Technology ● HSM and Backup ● Shared memory system (ALTIX) ● 128 cores Montecito ● 2.5 TByte Main Memory 32 Spanish Supercomputing Network Magerit CaesarAugusta 33 Altamira LaPalma Picasso Tirant RES services ● Red Española de Supercomputacion (RES) supercomputers can be access free by any public Spanish research group. MareNostrum is the main RES node. ● Web application form and instructions could be found in the web page: www.bsc.es (support&services / RES) ● External committee evaluates the proposals ● Access reviewed every 4 months ● For any question contact BSC operations director ● Sergi Girona (sergi.girona@bsc.es) 34 Au Bestra lg lia i Bu Braum C lga zil an ri a a D C h da en in Fi m a a nl rk F G ra and er nc m e an IreInd y la ia Is nd ra Ko e I t re Ja al l a p y M , So an a u N Mlay th N eth ex sia ew e ic r Ze lan o d N alan s or d Po wa R lan y So Slo uss d ut ve ia h n Af ia Sprica S Sw w a itz ed in U ni e e te T rla n d U K aiw nd ni in a te g n d do St m at es Top500: who is who? 60,00 50,00 40,00 30,00 20,00 10,00 0,00 35 Can Europe compete? 60, 00 50, 00 40, 00 30, 00 20, 00 10, 00 0, 00 UE Japan China 35,00 30,00 25,00 20,00 15,00 10,00 5,00 36 Slovenia Ireland Finland Bulgaria Belgium Netherlands Denmark Spain Poland Sweden Italy Germany France 0,00 United Kingdom USA ESFRI: European Infrastructure Roadmap ● The high-end (capability) resources should be implemented every 2-3 years in a “renewal spiral” process ● Tier0 Centre total cost over a 5 year period shall be in the range of 200-400 M€ ● With supporting actions in the national/regional centers to maintain the transfer of knowledge and feed projects to the top capability layer tier0 tier1 tier2 37 PRACE tier 1 Principal Partners General Partners Associated Partners 38 Ecosystem GENCI BSC-IBM MareIncognito project ● Our 10 Petaflop research project for BSC (2011) ● Port/develop applications to reduce time-to-production once installed Application development an tuning ● Programming models ● Tools for application development and to support previous evaluations ● Evaluate node architecture ● Evaluate interconnect options Performance analysis and Prediction Tools Model and prototype Interconnect 39 Fine-grain programming models Load balancing Processor and node BSC Departments • Computational Mechanics • Applied Computer Science • Optimization 40 What are the CASE objectives? ● Identify scientific communities with supercomputing needs and help them to develop software ● Material Science (SIESTA) ● Fusion (EUTERPE, EIRENE, BIT1) ● Spectroscopy (OCTOPUS, ALYA) ● Atmospheric modeling (ALYA, WRF) ● Geophysics (BSIT, ALYA) ● Develop our own technology in Computational Mechanics ● ALYA, BSIT, … ● Perform technology transfer with companies ● REPSOL, AIRBUS, … 41 Who needs 10 Petaflops? 42 Airbus 380 Design 43 Seismic Imagining: RTM (REPSOL) SPM RTM 44 RTM Performance in Cell Platform Gflops Power (W) Gflops/W JS21 8,3 267 0,03 QS20 108,2 315 0,34 QS21 116,6 370 0,32 8 7 22.1 GB/s of memory BW used Speed-up 6 5 Mesured 4 Ideal 3 2 1 1 2 3 4 5 # of SPU 45 6 7 8 ALYA Computational Mechanics and Design In-house development Parallel Coupled Multiphysics Fluid dynamics Structure dynamics Heat transfer Wave propagation Excitable media… 46 Alya: Multiphysics Code Domain decomposition Parallelization Parall MUMPS sparse direct solver Optimization Dodeme Optima Solmum Services Kernel Solidz Nastin Structure dynamics v b t Turbul Temper Incompressible Turbulence Navier-Stokes models Heat transfer u ( u u σ ) g t u 0 k , k , Spalart Almmaras c T c u T p t p Nastal Compressible Navier-Stokes (kT ) 2ε(u) : ε(u) t ( u) ( u u σ ) g t ( u) 0 t E (uE kT Modules u σ ) u g 47 •Mesh •Coupling •Solvers •Input/output Exmedi Excitable media M imp n 1 n cT t K i I NL Vm fm K totUen 1 fen Apelme Fracture mechanics Gotita Wavequ Droplet Impingement (icing) Wave propagation Dt (ud ) cD Re D 24K (1 1 2 1 t u ( u) f (ua ud ) 1 ) g Fr a 2 w t (ud ) 0 ALYA keywords ● Multi-physics modular code for High Performance Computational Mechanics ● Numerical solution of PDE’s ● Variational methods are preferred (FEM)... ● Coupling between multi-physics (loose or strong) ● Explicit and Implicit formulations ● Hybrid meshes, non-conforming meshes ● Advanced meshing issues ● Parallelization by MPI + OpenMP ● Automatic mesh partition using Metis ● Portability is a must (Compiled on Windows, Linux, MacOS) ● Porting to new architectures: CELL, … ● Scalability Tested on: ● IBM JS21 blades on MareNostrum: BSC, 10000 CPUs ● IBM Blue Gene/P & /L: IBM Lab. Montpellier and Watson, 4000 CPUs ● SGI Altix shared memory: BSC, Barcelona 128 CPUs ● PC cluster, 10 - 80 CPUs 48 Alya speed-up MARENOSTRUM - IBM Blades Boundary layer flow, 25M hexas NASTAL module Explicit compressible flow Fractional step NASTIN module Implicit incompressible flow Fractional step 49 CASE R&D: Aero-Acoustics ● High Speed train 50 CASE R&D: Automotive ● Ahmed body benchmark ● Win speed 120 km/h 51 CASE R&D: Building & Energy ● Benchmark cavity ● MareNostrum Cooling 52 CASE R&D: Aerospace ● Icing Simulation ● Subsonic / Transonic / Supersonic flows ● Adjoint methods in Shape Optimization 53 CASE R&D: Aerospace ● Subsonic cavity flow (0.82 Mach) 54 CASE R&D: Free surface problems ● Level set method 55 CASE R&D: Mesh generation ● Meshing boundary layer 56 CASE R&D: Mesh adaptativity ● Meshing 57 CASE R&D: Atmospheric Flows ● San Antonio Quarter (Barcelona) 58 CASE R&D: Meteo Mesh ● Surface from topography ● Semi-structured in volume 59 CASE R&D: Biomechanics ● Cardiac Simulator ● By-pass flow ● Arterial brain system 60 CASE R&D: Biomechanics ● Nose air flow 61 Scalability problems: The deflated PCG 62 The deflated PCG • Mesh partitioner slices arteries => two neigbours • But, there exists fat meeting points of arteries => more neighbours 63 Parallel footprint 512 proc Efficiency Load balance Overall 0.67 0.92 GMRES 0.74 0.92 Deflated CG 0.43 0.83 120 ms 6.6 ms All_reduce Sendrec Momentum solver Pressure solver 170 μs: very fine grain 8Bytes: support for fast reductions would be useful 64 Solver continuity: Deflated CG (1) Subdomains with lots of neighbors Sendrec All_reduce (500x8b) All_reduce (8B) Sendrec All_reduce (8B) 65 (2) (6) (9) ● The conclusions 66 The accelerator era Cell Multi-core “Wedge of Opportunity” Multi-threading FPGAs Vector GPUs performance 67 Near Future Supercomputing Trends ● Performance will be provided by ● Multi-core ● Without cache coherency ● With Accelerators (top-down approach) ● Programming is going to suffer a revolution ● openCL ● CUDA ● … ● Compilers should provided SIDM parallelism level 68 Thank you ! 69