Arun Rodrigues, Scott Hemmert, Resnick: John Shalf, David Donofrio:Dave Lawrence Berkeley National Laboratory Sandia National Lab (ABQ) Curtis Janssen, Helgi Adalsteinsson: Sandia National Laboratories Keren Bergman: Columbia University Dan Quinlan: Lawrence Livermore National Laboratory Bruce Jacob: U. Maryland Sudhakar Yalamanchili: Georgia Tech John Shalf, David Donofrio: Lawrence Berkeley National Laboratory John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory Curtis Janssen, Helgi Adalsteinsson: Sandia National Laboratories Gilbert Hendry: Sandia National Laboratory http://www.nersc.gov/projects/CoDEx Dan Quinlan: Lawrence Livermore National Laboratory Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab Sudhakar Yalamanchili: Georgia Tech Sudhakar Yalamanchili: Georgia Tech http://www.nersc.gov/projects/CoDEx Data Movement Dominates (DMD) and Architectural Simulation and Modeling for Exascale Platform Development CoDEx: CoDesign for Exascale Architectural Simulation and Modeling for Exascale Platform Development 0 Codesign Tools Recap ROSE ACE Architectural Simulation to Accelerate CoDesign SST ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs. ROSE • Application Analysis SST Macro System Simulation: Enables systemscale simulation through capture of application ACE communication traces and simulation of largescale interconnects. • Node level emulation SST Micro Software Simulators: Software simulation for node-level simulation SST • System level models Codesign Tools Recap ROSE ACE Architectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ROSE ACE Node Emulation: Rapid design synthesis CoDEx: CoDesign For Exascale and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore ASCR-funded Simulation node designs. SST • Application Analysis Infrastructure Project ACE SST Macro System Simulation: Enables systemscale simulation through capture of application communication tracesSimulation and simulation of large-• Node level SST: Structure Toolkit emulation scale interconnects. NNSA-funded Simulation Tools SST Micro Software Simulators: Software Program) simulation for(ASC node-level simulation SST • System level models Codesign Tools Recap ROSE ACE Architectural Simulation to Accelerate CoDesign ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST. ROSE ACE Node Emulation: Rapid design synthesis CoDEx: CoDesign For Exascale and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore ASCR-funded Simulation node designs. SST • Application Analysis CAL: (Sandia/LBL) Infrastructure Project Computer Architecture ACE SST Macro System Simulation: Enables systemscale simulation through capture of application Laboratory communication traces and simulation of large-• Node level SST: Structure Simulation Toolkit scale interconnects. NNSA-funded Simulation Tools SST Micro Software Simulators: Software Program) simulation for(ASC node-level simulation emulation SST • System level models Fidelity vs. Scope for Architectural Simulation Methods Simula on Scope/Parallelism 107 106 Coarse-grained simula on: SST/macro 105 104 103 102 101 Cons tu ve models Emula on: Green Flash 100 Crude guess Rough idea Cause and effect Simula on Fidelity 4 Very good es mates Exact hardware model ROSE ACE SST ROSE ROSE Compiler ACE Full Program Understanding through Deep Source-Code Analysis C/C++/Fortran/ OpenMP/UPC EDG Front-end/ Open Fortran Parser http://www.roseCompiler.org EDG /Fortran-toROSE Connector Program Analysis Vendor Compiler Internal Representation (IR) Transformed Source Code ROSE Unparser 2009 USER Program Transformation Control-Flow ROSE System-dependency Sliced-system-dependency int aFunction(int a, int b) { int c=b; return a; } Intermediate Representation de Co e y urc So Binar le b Or cuta e Ex main() { int a,b,c,d,e; int i=4; for (i=0;i<10;i++) { int j=55; c=i+j; c=aFunction(i,c); a=aFunction(a+1,b); } #pragma SliceTarget a; return 0; } Data-dependency Disassembly and Representation plus Instruction Semantics Control-dependency 5 SST ExaSAT: Exascale Static Analysis Tool Compiler-Automated Performance Model Extraction • Can automatically predict performance for many input codes and software optimizations • Predict performance under different architectural scenarios • Much faster than hardware simulation and manual Performance Prediction modeling Spreadsheet Machine Parameter s Combustion Codes Compiler Analysis <XML> Performanc e Model Dependency Graph Optimization User Parameter s 6 SST/macro: Coarse-Grained Simulation OL %(22O; E /9&. (2"3 4#/' . &( An*(!"#$ application code S HH#"9/5. #%&' !(&4*(/0( with #"=L ' : %"=L ' (' Lminor &%/+0( modifications - . 3 H#%' %(#"$&/&, ("3 H#%3 %*' /5. *0( SST/Macro Impl. of interfaces (MPI), which simulate execution and communication P&. 9%00. &( E P& %3. .9% &,00. ( &( P& % 3. .9% &,00. ( &( TE"0Y( %3 . &, ( TE"0Y( T "0Y( ) /&+: /&%( E . +%#0( 70' /$#"0L %+(0"3 4#/5. *(H#/g . &3 0( V%j=j(E PM B(- . 3 H4' %X( > %' : . &Y( • 2%&F"9%0( V%j=j(P@2B(K. $(29L %+4#%&X( >M -( >M -( >M -( - . /&0%1=&/"*(L /&+: /&%(3 . +%#0( ?"$&/&"%0( > . +%( ( ( • V0H/: *X( 2. < : /&%( ?"$&/&"%0( ( • 2Y%#%' . *( S HH#"9/5. *( OL &%/+0( • T "09&%' %(7F%*' (2"3 4#/5. *( V#() (*22O; 3 /9&. B(22O; 9. &%B(h E > %Oi i B(2, 0' %3 - X( 7 SST Simulation Project Overview Goals SST/micro: Cycle-Accurate Framework e standard architectural framework for HPC evaluate future systems rkloads omputers to design uters Status •Current Release (2.1) at code.google.com/p/sst-simulator/ •Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model • Has a general simulation framework for integrating models • Simulation backend is parallel cal Approach Consortium •“Best of Breed” simulation suite • Plenty of people involved •Combine Lab, academic, & industry crete Event core with optimization over MPI ech. Models for power Panalyzer d simple models for etwork, and memory non viral, modular 8 Some Models Currently Integrated Gem5 SimObject Gem5 & 1MacSim Int roduct ion •GeM5 –Trace Driven (x86 & PTX) –Models OoO and GPU-like dnesday, March 28, 2012 Gem5 Queue Port MacSim is a heterogeneous architecture simulator, specifically supporting x86 ISA and NVIDIA PTX ISA. It is a trace-driven cycle-level simulator. It SimObject can simulate homogeneous ISA multicore simulations, heterogeneous ISA multicoresimulations. It usesOcelot for PTX tracegeneration and Pin to generate x86 traces. Both traces are converted internal RISC style uops and those uops are simulated. MacSim is a microarchitecture simulator thatSST::Component simulates detailed pipeline (in-order SST::Link and out-of-order) and a memory system includingSST::Link caches, NoC, memory controlle rs. It supports, asymmetric multicore configurations (small cores + medium cores+ big cores ) and SMT or MT SST::Component SST::Component architectures as well. Currently interconnection network model (based on IRIS) and power model (based on McPat) are connected. ARM ISA support is on-progress. MacSim is also one of the components of SST so SST::Component multiple MacSim simulators can run concurrently. –M5: Modular platform for Gem5 is a well-known computer architectural system simulator architecture research, with models for encompassing system-level processors, caches, architecture as well as busses, and network processor microarchitecture. components. –Provides detailed, fullsystem CPU models for x86, ARM, SPARC, Alpha •MacSim Port SimObject MacSim provides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces. SST Queue GPUOcelot NVCC (compiler) PTX code Emulator/ Trace Generator CUDA code (*.cu) Pin Trace Generator X86 Binaries Figure 1. The overview of MacSim Simulator ! "# $% $&$' (% MacSim MacSim Figure 1 shows SST the overview Macs im simulator. ! "#of$% $&$' (% ! "# $% $&$' (% 415% SST Mem 465% SST Mem cache 425% - . /% MacSim GPU CPU Mem Mem IRIS provides a pipelined, cycleaccurate router model capable of modeling a variety of Network-onHeterogeneous Architecture Chip (NoC) and interTiming & Power Simulator interconnection node architectures. PhoenixSim models photonic networks. ! "# $% $&$' (% MacSim core ) $# *+, % $&$' (% - . /% core ! "# $% $&$' (% MacSim core - . /% 0123$(% MacSim MacSim core core 4 9 Leveraging Embedded Design Automation For Design Space Exploration Applicationoptimized processor implementation (RTL/Verilog) Base CPU OCD Apps Cache Timer Datapaths Extended Registers FPU Processor configuration 1. Select from menu 2. Automatic instruction discovery (XPRES Compiler) 3. Explicit instruction This description (TIE) Processor Generator (Tensilica) stuff is essential! Build with any Tailored SW Tools: process in any fab Compiler, debugger, simulators, Linux, other OS Ports (Automatically generated together with the Core) Embedded Design Automation (Using FPGA emulation to do rapid prototyping) Applicationoptimized processor implementation (RTL/Verilog) Base CPU OCD Apps Cache Timer Datapaths Extended Registers FPU Processor configuration 1. Select from menu 2. Automatic instruction discovery (XPRES Compiler) 3. Explicit instruction description (TIE) Processor Generator (Tensilica) Build with any Tailored SW Tools: process in any fab Compiler, debugger, Or “tape out” simulators, Linux, other OS Ports To FPGA (Automatically generated together with the Core) RAMP FPGA-accelerated Emulation of ASIC Data Movement Dominates (Sandia, Micron, Columbia, LBL) Understand the Potential of Intelligent, Stacked DRAM Technology • Data movement are projected to account for over 75% of power budget for an exascale platform DRA M Laye rs • Work to reduce that via – Optical interconnect(s) Modulators – 3D stacking (logic + memory + optics) – New memory protocols Receivers Ph o to n Lo gic Laser Source Waveguide Research Questions – What is the performance potential of stacked memory (power & speed) – How much intelligence to put into logic layer • Atomics, gather/scatter, checksums, full-processor-in-memory – What is the memory consistency model for intelligent DRAM – How to program it if we put embed more intelligence into DRAM ic L La ye r aye r The Cost of Moving Data Intranode/SMP Intranode/MPI Communica on Communica on 10000 On‐chip / CMP communica on 100 now 2018 10 lo ss ys te m Cr os ne ct nt er co n ca li p/ DR AM Of f‐c hi hi p 5m m on ‐c hi p 1m m on ‐c st er Re gi FL OP 1 DP PicoJoules 1000 Locality Management is Key What are the best combination of software and hardware mechanisms to maximize data movement efficiency Vertical Locality Management Horizontal Locality Management Temporal Topological 14 Why Study Chip Stacking (TSVs)? Energy = (V 2 ∗ C) ∗ Overhead + Ecomm TSVs Reduce Costs DRAM Cells Efficient • • • • DRAM cells require < 1 pJ to access Current DRAM architectures are not power efficient Long distances ➔ high power We pay for more than we get at every level – – – – • Cache: throw away 75-80% DRAM Row: Charge 1024B for each 64B access DIMM: Charge 8-9 chips/access ~800 pJ/byte total DRAM design driven by packaging constraints – – ~50% of DRAM chip cost is packaging, mainly in pins DIMMs use multiple chips with a few data pins to achieve high BW • • • • • • • • • 15 TSVs orders of magnitude less energy –250 fJ/bit for reading DRAM –5 fJ/bit for TSV –250 fJ/bit for mem. controller –~0.5 pJ/bit (compared to 30pJ for conventional DIMM) –Don’t have to access more data than needed • Enables.... –Lower Capacitance: Narrower – Lower Overhead: Smarter –In-Memory computation • Requires –...changes to how we view the machine & the memory Why Photonics? Photonics changes the rules for Bandwidth-per-Watt. PHOTONICS: ELECTRONICS: Modulate/receive data stream once per communication event. Wavelength Parallelism: Broadband switch routes entire multi-wavelength stream. Off-chip BW ≈ on-chip BW for nearly same power. Buffer, receive, and re-transmit at every router. Space Parallelism: Each bus lane routed independently (P NLANES). Off-chip BW requires much more power than on-chip BW. TX RX TX RX TX RX TX RX TX RX TX RX Why Optically-Connected Memory? Traditional Memory HBDRAM HBDRAM Optically-Connected Memory HBDRAM CPU HBDRAM HBDRAM CPU HBDRAM Electronic Bus •Large Pin-out •Complex wiring •Low bandwidth density •Distance constrained by electrical limitations •High power dissipation Will not scale to meet power and bandwidth requirements of future highperformance computing systems HBDRAM HBDRAM Optical Link •All-optical link, no electronic bus to drive •Bit-rate transparent link •High bandwidth density, less pins •Distance immunity at computer scale •Low power dissipation Enables scaling of high-performance computing through increased memory capacity and bandwidth 18 Mixed Model Simulation cycle accurate and energy-accurate models MPI Traces (C, C++, Fortran) (DUMPI) kernels Checkpoint/r estart Workload Translation SST/macro SystemC Processor Model (SST/micro & Tensilica) (C++) Address Translation NoC Model (PhoenixSim) Memory Model (DRAMSim2, FLASHsim, NVRAM) Fault Injection skeleton app Simulator Infrastructure: Interconnects cycle accurate and energy-accurate models MPI Traces (C, C++, Fortran) (DUMPI) kernels Checkpoint/ restart Workload Translation SST/macro SystemC Processor Model (SST/micro & Tensilica) (C++) Address Translation Intranode/SMP Intranode/MPI Communica on Communica on 10000 NoC Model (PhoenixSim) Memory Model (DRAMSim2, FLASHsim, NVRAM) On‐chip / CMP communica on 100 now 2018 10 lo ss ys te m Cr os ne ct nt er co n ca li p/ DR AM hi p Of f‐c hi m on ‐c hi p 5m m on ‐c st er 1m Re gi FL OP 1 DP PicoJoules 1000 Developed by Sandia Collaborators CoDEx project Fault Injection skeleton app Simulator Infrastructure: Memory cycle accurate and energy-accurate models MPI Traces (C, C++, Fortran) (DUMPI) kernels Checkpoint/ restart Workload Translation SST/macro SystemC Processor Model (SST/micro & Tensilica) (C++) Address Translation Intranode/SMP Intranode/MPI Communica on Communica on 10000 NoC Model (PhoenixSim) Memory Model (DRAMSim2, FLASHsim, NVRAM) On‐chip / CMP communica on 100 now 2018 10 lo ss ys te m Cr os ne ct nt er co n ca li p/ DR AM hi p Of f‐c hi m on ‐c hi p 5m m on ‐c st er 1m Re gi FL OP 1 DP PicoJoules 1000 Validated against Micron DRAM HMC model coming this summer Fault Injection skeleton app Simulator Infrastructure cycle accurate and energy-accurate models MPI Traces (C, C++, Fortran) (DUMPI) kernels Checkpoint/ restart Workload Translation SST/macro SystemC Processor Model (SST/micro & Tensilica) (C++) Address Translation Intranode/SMP Intranode/MPI Communica on Communica on 10000 NoC Model (PhoenixSim) Memory Model (DRAMSim2, FLASHsim, NVRAM) On‐chip / CMP communica on 100 now 2018 10 lo ss ys te m Cr os ne ct nt er co n ca li p/ DR AM hi p Of f‐c hi m on ‐c hi p 5m m on ‐c st er 1m Re gi FL OP 1 DP PicoJoules 1000 Rewrote Columbia PhoenixSim summer 2011 Orion-2 energy model Validated against Cornell test parts Fault Injection skeleton app Simulator Infrastructure cycle accurate and energy-accurate models MPI Traces (C, C++, Fortran) (DUMPI) kernels Checkpoint/ restart Workload Translation SST/macro SystemC Processor Model (SST/micro & Tensilica) (C++) Address Translation Intranode/SMP Intranode/MPI Communica on Communica on 10000 NoC Model (PhoenixSim) Memory Model (DRAMSim2, FLASHsim, NVRAM) On‐chip / CMP communica on 100 now 2018 10 lo ss ys te m Cr os ne ct nt er co n ca li p/ DR AM hi p Of f‐c hi m on ‐c hi p 5m m on ‐c st er 1m Re gi FL OP 1 DP PicoJoules 1000 Full Gate-level RTL model of processor Well characterized energy model Fault Injection skeleton app