Computing in Space PRACE Keynote, Linz Oskar Mencer, April 2014 Thinking Fast and Slow Daniel Kahneman Nobel Prize in Economics, 2002 14 × 27 = ? Kahneman splits thinking into: System 1: fast, hard to control System 2: slow, easier to control ….. 300 ….. 378 Assembly-line computing in action SYSTEM 1 x86 cores SYSTEM 2 flexible memory plus logic Optimal Encoding Low Latency Memory System High Throughput Memory minimize data movement Temporal Computing (1D) • A program is a sequence of instructions • Performance is dominated by: CPU Memory – Memory latency – ALU availability Actual computation time Get Inst. 1 5 Read data 1 C O M P Write Get Result Inst. 1 2 Read data 2 C O M P Write Get Result Inst. 2 3 Read data 3 C O M P Write Result 3 Time Spatial Computing (2D) Synchronous data movement Control data in ALU Control ALU Buffer ALU data out ALU ALU Read data [1..N] Computation Write results [1..N] Throughput dominated 6 Time Computing in Time vs Computing in Space Computing in Time 512 Controlflow Cores 2GHz 10KB on-chip SRAM Computing in Space 10,000* Dataflow cores 200MHz 5MB on-chip SRAM >10TB/s 8GB on board DRAM 96GB of DRAM per DFE 1 result every 100* clock cycles 1 result every clock cycle *depending on application! => *200x faster per manycore card => *10x less power => *10x bigger problems per node => *10x less nodes needed OpenSPL in Practice New CME Electronic Trading Gateway will be going live in March 2014! Webinar Page: http://www.cmegroup.com/education/new-ilink-architecture-webinar.html CME Group Inc. (Chicago Mercantile Exchange) is one of the largest options and futures exchanges. It owns and operates large derivatives and futures exchanges in Chicago, and New York City, as well as online trading platforms. It also owns the Dow Jones stock and financial indexes, and CME Clearing Services, which provides settlement and clearing of exchange trades. …. [from Wikipedia] 8 9 Maxeler Seismic Imaging Platform • Maxeler provides Hardware plus application software for seismic modeling • MaxSkins allow access to Ultrafast Modelling and RTM for research and development of RTM and Full Waveform Inversion (FWI) from MatLab, Python, R, C/C++ and Fortran. • Bonus: MaxGenFD is a MaxCompiler plugin that allows the user to specify any 3D Finite Difference problem, including the PDE, coefficients, boundary conditions, etc, and automatically generate a fully parallelized implementation for a whole rack of Maxeler MPC nodes. Application areas: • O&G • Weather • 3D PDE Solvers • High Energy Physics • Medical Imaging 10 Example: data flow graph generated by MaxCompiler 4866 static dataflow cores in 1 chip Mission Impossible? Computing in Space - Why Now? • Semiconductor technology is ready – Within ten years (2003 to 2013) the number of transistors on a chip went up from 400M (Itanium 2) to 5Bln (Xeon Phi) • Memory performance isn’t keeping up – – – – Memory density has followed the trend set by Moore’s law But Memory latency has increased from 10s to 100s of CPU clock cycles As a result, On-die cache % of die area increased from 15% (1um) to 40% (32nm) Memory latency gap could eliminate most of the benefits of CPU improvements • Petascale challenges (10^15 FLOPS) – Clock frequencies stagnated in the few GHz range – Energy usage and Power wastage of modern HPC systems are becoming a huge economic burden that can not be ignored any longer – Requirements for annual performance improvements grow steadily – Programmers continue to rely on sequential execution (1D approach) • For affordable petascale systems ο¨ Novel approach is needed 13 OpenSPL Example: X2 + 30 x x SCSVar x = io.input("x", scsInt(32)); 30 SCSVar result = x * x + 30; io.output("y", result, scsInt(32)); + y 14 OpenSPL Example: Moving Average Y = (Xn-1 + X + Xn+1) / 3 SCSVar x = io.input(“x”, scsFloat(7,17)); SCSVar prev = stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output(“y”, result, scsFloat(7,17)); 15 OpenSPL Example: Choices x SCSVar x = io.input(“x”, scsUInt(24)); SCSVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, scsUInt(24)); > - + y 16 1 1 10 OpenSPL and MaxAcademy 17 lectures/exercises, Theory and Practice of Computing in Space LECTURE 1: Concepts for Computing in Space LECTURE 2: Converting Temporal Code to Graphs LECTURE 3: Computing, Storage and Networking LECTURE 4: OpenSPL LECTURE 5: Dataflow Engines (DFEs) LECTURE 6: Programming DFEs (Basics) LECTURE 7: Programming DFEs (Advanced) LECTURE 8: Programming DFEs (Dynamic and multiple kernels) LECTURE 9: Application Case Studies I LECTURE 10: Making things go fast LECTURE 11: Numerics LECTURE 12: Application Case Studies II LECTURE 13: System Perspective LECTURE 14: Verifying Results LECTURE 15: Performance Modelling LECTURE 16: Economics of Computing in Space LECTURE 17: Summary and Conclusions 17 Maxeler Dataflow Engine Platforms High Density DFEs The Dataflow Appliance The Low Latency Appliance Intel Xeon CPU cores and up to 6 DFEs with 288GB of RAM Dense compute with 8 DFEs, 384GB of RAM and dynamic allocation of DFEs to CPU servers with zero-copy RDMA access Intel Xeon CPUs and 1-2 DFEs with direct links to up to six 10Gbit Ethernet connections 18 Bringing Scalability and Efficiency to the Datacenter 19 3000³ Modeling Compared to 32 3GHz x86 cores parallelized using MPI 2,000 15Hz peak frequency 1,600 30Hz peak frequency 1,400 45Hz peak frequency 1,200 70Hz peak frequency Equivalent CPU cores 1,800 1,000 800 600 400 200 0 1 *presented at SEG 2010. 4 Number of MAX2 cards 8 8 Full Intel Racks ~100kWatts => 2 MaxNodes (2U) Maxeler System <1kWatt Typical Scalability of Sparse Matrix Visage – Geomechanics Eclipse Benchmark 4 (2 node Nehalem 2.93 GHz) E300 2 Mcell Benchmark FEM Benchmark Relative Speed Relative Speed (2 node Westmere 3.06 GHz) 3 2 1 0 0 2 4 6 8 # cores 10 12 5 4 3 2 1 0 0 2 4 # cores 6 8 Sparse Matrix Solving O. Lindtjorn et al, 2010 • Given matrix A, vector b, find vector x in: Ax = b • Typically memory bound, not parallelisable. • 1 MaxNode achieved 20-40x the performance of an x86 node. 60 GREE0A 1new01 Speedup per 1U Node 50 624 40 30 20 10 624 0 0 1 2 3 4 5 6 7 8 Ratio Domain SpecificCompression Address and Data Encoding 22 9 10 Global Weather Simulation ο§ Atmospheric equations ο§ Equations: Shallow Water Equations (SWEs) ππ 1 π(ΛπΉ1 ) 1 π(ΛπΉ1 ) + C. Yang, W.1Xue, + +G.πYang, = 0Accelerating solvers for [L. Gan, H. Fu, W. Luk, X. Huang, Y. Zhang, and 2 ππ‘ equations Λ through ππ₯ mixed-precision Λ ππ₯ global atmospheric data flow engine, FPL2013] Always double-precision needed? ο§ Range analysis to track the absolute values of all variables fixed-point fixed-point fixed-point reduced-precision reduced-precision What about error vs area tradeoffs ο§ Bit accurate simulations for different bit-width configurations. Accuracy validation [Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, et al. ‘A Peta-scalable CPU-GPU Algorithm for Global Atmospheric Simulations’, PPoPP’2013] And there is also performance gain Platform Performance Speedup () 6-core CPU 4.66K 1 Tianhe-1A node 110.38K 23x MaxWorkstation 468.1K 100x MaxNode 1.54M 330x Meshsize: 1024 × 1024 × 6 14x MaxNode speedup over Tianhe node: 14 times And power efficiency too Platform Efficiency Speedup () 6-core CPU Tianhe-1A node MaxWorkstation MaxNode 20.71 306.6 2.52K 3K Meshsize: 1024 × 1024 × 6 MaxNode is 9 times more power efficient 1 14.8x 121.6x 144.9x 9x Weather and climate models on DFEs Which one is better? Finer grid and higher precision are obviously preferred but the computational requirements will increase ο¨ Power usage ο $$ What about using reduced precision? (15 bits instead of 64 double precision FP) 29 Weather models precision comparison 30 What about 15 days of simulation? Surface pressure after 15 days of simulation for the double precision and the reduced precision simulations (quality of the simulation hardly reduced) 31 CPU DFE MAX-UP: Astro Chemistry Does it work? Test problem 2D Linear advection 4th order Runge-Kutta Regular torus mesh Gaussian bump Bump is advected across the torus mesh After 20 timesteps it should be back where it started Bump at t=20 33 CFD Performance Max3A workstation with Xilinx Virtex 6 475t + 4-core i7 For this 2D linear advection test problem we achieve ca.450M degree-offreedom updates per second For comparison a GPU implementation (of a NavierStokes solver) 34 CFD Conclusions You really can do unstructured meshes on a dataflow accelerator You really can max out the DRAM bandwidth You really can get exciting performance You have to work pretty hard Or build on the work of others This was not an acceleration project We designed a generic architecture for a family of problems 35 7 We’re Hiring Candidate Profiles Acceleration Architect (UK) Application Engineer (USA) System Administrator (UK) Senior PCB Designer (UK) Hardware Engineer (UK) Networking Engineer (UK) Electronics Technician (UK)