PATMOS 2015, the 25th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015 Reiner Hartenstein TU Kaiserslautern How to cope with the Power Wall >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 2 http://hartenstein.de TU Kaiserslautern spin-off from the EU’s PATMOS project Oldest conference series on power efficiency issues 3 years elder than ISPLED. http://hartenstein.de/PATMOS/ Power efficiency is going to become an industry-wide issue Some incremental improvements are on track, at all abstraction levels - not only in hardware design and software development however, there is still a lot to do © 2015, reiner@hartenstein.de http://hartenstein.de Power Efficiency of Programming Languages (an example) © 2015, reiner@hartenstein.de http://hartenstein.de Programming Language Popularity (IEEE) source: iStockphoto not yet determined by power efficiency © 2015, reiner@hartenstein.de 5 http://hartenstein.de ICT infrastructures TU Kaiserslautern Already the internet’s 2007 carbon footprint was higher than that of worldwide air traffic Power consumption by internet: x30 til 2030 if trends continue G. Fettweis, E. Zimmermann: ICT Energy Consumption Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008 It‘s more than the entire world‘s total power consumption to-day !!! © 2015, 6 6 © New York reiner@hartenstein.de Times Data Center at Dallas IDC predicts: the total number of data centers around the world http://hartenstein.de will peak at 8.6 million in 2017. TU Kaiserslautern Google‘s Electricity Bill (for example) Google asked the Federal Energy Regulatory Commission for the authority to sell electricity, Google patent for water-based data centers: the ocean provides cooling & power (motion of surface) Cost of a data center is determined solely by the monthly power bill, not by the cost of hardware or maintenance „The possibility of computer equipment power consumption spiraling out of control could have serious consequences for the overall affordability of computing” [L. A. Barrosso, Google] © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Massive Critique of the von Neumann Model Patterson’s Law: bandwidth gap grows 50% / year has reached >1000x (long ago) Dave Patterson instruction streams are very slow and extremely memory-cycle-hungry “von Neumann Syndrome” C.V. Ramamoorthy Von Neumann Syndrome Nathan’s Law: Software is a gas. It expands to fill all its containers ... Nathan Myhrvold Software dimension: MLoC (Million Lines of Code): Mac OS X 10.4 SAP NetWeaver 2007 Debian 5.0 2009 86 238 324 von Neumann: an extremely power-inefficient paradigm John Backus 1978: Can programming be liberated from the von Neumann style? © 2015, reiner@hartenstein.de 8 http://hartenstein.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 9 http://hartenstein.de Terminology Problems TU Kaiserslautern Searching a more power-efficient machine paradigm: Stressing differences to „Control-Flow“ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“… although the „Dataflow“ scene is I-Structure-centered. © 2015, reiner@hartenstein.de http://hartenstein.de MIT Tagged Token Dataflow Architecture* I-Structure Storage no PC no updateable global store PE: ? I-Structure Storage RU Token Queue to/from the Communication Network „instruction is executed even if some of its operands are not yet available“ *) source: Jurij Silc © 2015, reiner@hartenstein.de Communication Network PE PE Wait-Match Unit & Waiting Token Store Instruction Fetch Unit SU Form Token Unit ALU & Form Tag 11 Program Store & Constant Store http://hartenstein.de Solution: I-Structure concept* TU Kaiserslautern Problems with “Dataflow” Architectures read request deferrend but write operation is allowed at least one read request has been deferred ? • very complex data structures ! can be read but not written ? • “each update consumes the structure and the value produces a new data structure” [source: Jurij Silc] •“awkward or even impossible to implement” © 2015, reiner@hartenstein.de 12 „instruction is executed even if some of its operands are not yet available“ *) „I-Structure Flow“ instead of „Dataflow“ http://hartenstein.de A Second Opinion TU Kaiserslautern D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER, February 1982 the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future." ( Still active … 5th workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA ) However, the power efficiency break-thru did not happen here © 2015, reiner@hartenstein.de http://hartenstein.de Computing Paradigms CPU program counter term term CPU Xputer RPU DFC* DFC DPU program counter rDPU data counters DPUs yes no no I- structure *) based on data dependency graph © 2015, reiner@hartenstein.de execution triggered by paradigm instruction instruction-streambased (von Neumann) fetch data arrival** data-driven or datastream-based Reconfigurable Computing complicated I-structure “Dataflow” Computer handling 14 **) “transport-triggered” http://hartenstein.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 15 http://hartenstein.de TU Kaiserslautern First FPGA available 1984 from Xilinx LUT LUT LUT LUT LUT LUT Table Table LUT © 2015, reiner@hartenstein.de http://hartenstein.de Reconfigurable Computing (RC): the intensive Impact TU Kaiserslautern Speed-ups by von Neumann to RC Migrations Tarek El-Ghazawi [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Application DNA and Protein sequencing DES breaking Power Savings Cost Size 8723 779 22 253 28514 3439 96 1116 Speed-up factor massively much less equipment saving needed energy much less memory and bandwidth needed © 2015, reiner@hartenstein.de 17 http://hartenstein.de Speedup- 106 Speedup-Factor Factor + Pre-FPGA solutions TU Kaiserslautern Image processing, Pattern matching, DES breaking28514 Multimedia DSP and 105 15000 Grid-based DRC* („fair comparizon“) no FPGA: DPLA on MoM by TU-KL* 103 6000 1984: 1 DPLA replaces 256 FPGAs fabricated by E.I.S. Multi University Project Chip 103 Speed-ups by vN Software to FPGA Migrations 100 1985 1990 Reed-Solomon Decoding 2400 video-rate stereo vision MAC pattern recognition 730 900 BLAST 52 40 8723 Crypto 3000 CT imaging 1000 288 Viterbi Decoding Smith-Waterman pattern matching 457 100 1000 400 SPIHT wavelet-based image compression http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html DNA & protein sequencing wireless real-time face detection 88 protein identification FFT molecular dynamics simulation Bioinformatics 20 GRAPE 100 © 2015, reiner@hartenstein.de *) TU Kaiserslautern 18 http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf http://hartenstein.de Design Rule Check Accelerator *) PISA DRC accelerator [ICCAD 1984] TU Kaiserslautern Processing 4-by-4 Reference Patterns DPLA: fabricated by the E.I.S. Multi University Project: http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html 1984: 1 DPLA replaces 256 FPGAs © 2015, reiner@hartenstein.de DPLA was more area-efficient than FPGA - by several orders of magnitude 19 http://hartenstein.de The Reconfigurable Computing Paradox although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: •wiring overhead • reconfigurability overhead •routing congestion •etc. Reinvent Computing Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing. © 2015, reiner@hartenstein.de 20 http://hartenstein.de Obstacles to widespread FPGA adoption go well beyond the required skill set - Workshop at FPL_2015 http://reconfigurablecomputing4themasses.net/ © 2015, reiner@hartenstein.de 21 http://hartenstein.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 22 http://hartenstein.de TU Kaiserslautern Dual paradigm mind set: an old hat– but was ignored time to space mapping: procedural to structural: loop to pipe mapping PDP-16 RTMs: 1971 why did it take 25 years to find out? token bit evoke FF FF FF 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of RegisterTransfer Modules (RTM's); IEEE Trans-C21/5, May 1972 © 2006, reiner@hartenstein.de 23 http://hartenstein.de Transformations since the 70ies (time to time/space mapping) Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag] time domain: procedure domain program loop space domain: structure domain Strip Mining Transformation Pipeline k time steps, n DPUs n x k time steps, 1 CPU time algorithm © 2015, reiner@hartenstein.de space/time algorithmus 24 http://hartenstein.de The Systolic Arrays (1) 1980 no instruction streams needed time (pipe network) DPA x x x DataPath Array (array of DPUs) x x x | x x x | | time x x x define: x x x ... which data item at which time x x x - at which port „data streams“ © 2015, reiner@hartenstein.de execution transporttriggered port # - - - x x x time - - - - x x x - - - - - x x x port # M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips ... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 input data stream | | | | | | | | | | | x x x time 25 x x x | x x x port # output data streams H. T. Kung http://hartenstein.de TU Kaiserslautern M. J. Foster and H. T. Kung: “The Design of SpecialPurpose VLSI Chips ... “ http://karl-steinbuch.org Systolic Arrays (2) Karl Steinbuch why not general purpose ? It is not sufficient to invent something. You need to recognize, that you have invented something. © 2015, reiner@hartenstein.de 26 http://hartenstein.de What Synthesis Method? H. T. Kung: „of course algebraic!“ (linear projection) only linear pipes supports only special applications with strictly regular data dependencies http://kressarray.de/ My student Rainer Kress replaced it by simulated annealing: supports also any irregular & wild shape pipe networks © 2015, reiner@hartenstein.de 27 *) KressArray [ASP-DAC-1995] http://hartenstein.de H. T. Kung: “It’s not our job” TU Kaiserslautern another Tunnel Vision Symptom without a sequencer: missed to invent a new machine paradigm (the Xputer) © 2015, reiner@hartenstein.de *) or receives 28 http://hartenstein.de The Xputer machine Paradigm TU Kaiserslautern (TU-KL) Xputer literature is the TU-KL‘s Symbiosis of Time to Space Mapping and Reconfigurable Computing. obtained by adding auto-sequencing memory with multiple data counters instead of a program counter ASM GAG ASM RAM data counter ASM ASM: Auto-Sequencing Memory GAG: Generic Address Generstor © 2015, reiner@hartenstein.de ASM http://hartenstein.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 30 http://hartenstein.de What means Configware? Xputer pages Software Source time domain Software to Configware Migration (instruction-procedural) © 2015, reiner@hartenstein.de Configware Source Placement & Routing mapper Software Compiler Software Code space domain Configware Code ( time domain) (structural: space domain) 31 http://hartenstein.de Compilation: Software vs. Configware u. Flowware TU Kaiserslautern Xputer literature Software Engineering source program Configware Engineering C, FORTRAN MATHLAB placement source „program“ & routing mapper software compiler procedural: time domain) configware compiler space domain data scheduler software code configware code © 2015, reiner@hartenstein.de 32 time domain flowware code http://hartenstein.de Heterogeneous: Co-Compilation TU Kaiserslautern Xputer pages C, FORTRAN, MATHLAB automatic SW / CW partitioner Software / Configware software Co-Compiler compiler mapper configware compiler data scheduler software code configware code © 2015, reiner@hartenstein.de 33 flowware code http://hartenstein.de Paradigm Shift Consequences TU Kaiserslautern Xputer literature von Neumann: CPU software Program Counter (PC) Software Engineering resources: fixed algorithm: variable Xputer: Configware Configuration Code (CC) configware flowware 1 programming source needed Engineering resources: variable algorithm: variable 2 programming sources needed Data Counters (DCs) sequencing code (e. g. see MoPL language) © 2015, reiner@hartenstein.de 34 http://hartenstein.de ASM x x x use data counters, no program counter x x x | | | x x x x x x x x x - - 32 ports, or n x 32 ports Xputer pages © 2015, reiner@hartenstein.de | | | | | | | | | | x x x x x x 35 | x x x ASM other example | ASM 100 & more on-chip ASM are feasible x x x ASM implemented ASM by distributed ASM on-chipmemory ASM ASM ASM reconfigurable (pipe network) rDPA ASM Data stream generators - - - x x x ASM - - - - x x x ASM - - - - - x x x ASM Xputer machine paradigm GAG RAM data counter ASM: AutoSequencing Memory http://hartenstein.de state register(s): MoPL program counter: data counter(s): Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops But there is an Asymmetry © 2006, reiner@hartenstein.de 36 more simple: no ALU tasks Xputer literature TU Kaiserslautern Xputer pages Duality of procedural Languages http://hartenstein.de CoDe-X Dual Paradigm Application Development TU Kaiserslautern high level language Juergen Becker’s PhD thesis 1996 C language source Partitioner SW compiler CPU CW compiler software/configware co-compiler software code configware code instructiondatastreamstreambased based CPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU © 2015, reiner@hartenstein.de accelerator hardwired rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU reconfigurable 37 accelerator http://hartenstein.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 38 http://hartenstein.de Illustrating the Paradigm Trap TU Kaiserslautern the watering can model [Hartenstein] ( crippled watering can ) © 2015, reiner@hartenstein.de The von Neumann Manycore Approach ( many crippled watering cans ) The von Neumann single core Approach The Memory Wall (1) many von Neumann bottlenecks 39 http://hartenstein.de Illustrating the Paradigm Trap (2) TU Kaiserslautern The “Dataflow“ Computer extremely complicated: no watering can model ! (a power efficiency break-thru did not yet happen here) © 2015, reiner@hartenstein.de 40 http://hartenstein.de Xputer: the only massively power-efficient Paradigm TU Kaiserslautern the watering can model [Hartenstein] The Xputer Paradigm has no von Neumann bottleneck © 2015, reiner@hartenstein.de 41 http://hartenstein.de thank you ! © 2015, reiner@hartenstein.de 42 http://hartenstein.de END © 2015, reiner@hartenstein.de 43 http://hartenstein.de Backup for discussion: © 2015, reiner@hartenstein.de 44 http://hartenstein.de Pipelining through DPU Arrays: the TU-KL Xputer principle no memory wall DPA massively avoiding memory cycles DPU operation is transport-triggered | | - - - x x x - - - - x x x x x x - - - - - - - x x x | | | | | | | | | | | x x x no message passing nor thru common memory 45 input data streams | x x x x x x - no instruction streams © 2015, reiner@hartenstein.de x x x x x x x x x x x x output data streams | x x x DPA = DPU array DPU = Data Path Unit http://hartenstein.de I-Structures (I = incremental) - part 1 Jurij Silc: Dataflow Architectures http://csd.ijs.si/courses/dataflow/ 26 28 28 © 2015, reiner@hartenstein.de http://hartenstein.de 27 I-Structures - part 2 Jurij Silc: Dataflow Architectures (PE) http://csd.ijs.si/courses/dataflow/ MIT Tagged-Token Dataflow Architecture 31 29 I-structure select © 2015, reiner@hartenstein.de Istructure assign 30 http://hartenstein.de 20 Taxonomy TU Kaiserslautern Flynn‘s taxonomy: von Neumann only Diana‘s taxonomy: Reconfigurable Computing Reiner‘s taxonomy: heterogeneous systems Reiner‘s 2nd taxonomy: Xputers only noI © 2015, reiner@hartenstein.de 48 http://hartenstein.de Von Neumann Syndrome TU Kaiserslautern Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011 © 2015, reiner@hartenstein.de 49 http://hartenstein.de