PATMOS 2015, the 25th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015 Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow http://hartenstein.de downloadable from http://xputer.de How to cope with the Power Wall >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 2 http://xputer.de The Workshop Series TU Kaiserslautern spin-off from the PATMOS project Project .leader: Reiner Hartenstein partner . leader: Antonio Núñez partner . leader: Francis Jutand . . Oldest conference series on power efficiency http://hartenstein.de/PATMOS/ Power efficiency is going to become an industry-wide issue Some incremental improvements are on track, at all abstraction levels however, there is still a lot to be done © 2015, reiner@hartenstein.de 3 http://xputer.de TU Kaiserslautern Power-Efficient Computing Power-efficient Microchip Design tutorial by Jan Power-efficient Computer Architectures Rabaey Power-efficient Languages and Compilers Power-efficient Software Implementation (Power-efficient Memory) Power-efficient Machine Paradigm © 2015, reiner@hartenstein.de 4 my presentation http://xputer.de Three tectonic shifts: TU Kaiserslautern ICT infrastructures the energy-constrained world from internet of people to internet of (every)thing the end of scaling as we know it © Hewlett Packard Power consumption by internet: x30 til 2030 if trends continue G. Fettweis, E. Zimmermann: ICT Energy Consumption Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008 © Hewlett Packard It‘s more than the entire world‘s total power consumption to-day !!! © 2015, © New York reiner@hartenstein.de Times 5 5 http://xputer.de Data Center at Dallas >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 6 http://xputer.de Terminology Problems TU Kaiserslautern Stressing differences to „Control-Flow“ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“… or „ “ Tagged Token Flow … although the„Dataflow“ scene is „I-Structure“-centered © 2015, reiner@hartenstein.de 7 http://xputer.de A Second Opinion TU Kaiserslautern D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER, February 1982 the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future." ( Still active workshops …, e. g.: 5th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA ) However, the power efficiency break-thru did not happen here © 2015, reiner@hartenstein.de 8 http://xputer.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 9 http://xputer.de Speedup- 106 Speedup-Factor Factor + Pre-FPGA solutions TU Kaiserslautern Image processing, Pattern matching, 28514 DES breaking Multimedia 105 Design Rule Check accelerator PISA 15000 („fair comparizon“) no FPGA: DPLA on 1984 MoM by TU-KL* 6000 >15 years earlier 103 1984: 1 DPLA replaces 256 FPGAs fabricatedbyE.I.S. Multi University Project Chip 103 pattern recognition Speed-ups by vN Software to FPGA Migrations 1985 1990 © 2015, reiner@hartenstein.de 730 900 8723 52 40 3000 CT imaging Crypto 1000 288 Viterbi Decoding Smith-Waterman pattern matching 457 100 1000 400 SPIHT wavelet-based image compression BLAST DNA & protein sequencing Reed-Solomon Decoding 2400 video-rate stereo vision MAC http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html 100 DSP and wireless real-time face detection 88 protein identification FFT molecular dynamics simulation Bioinformatics 20 GRAPE 100 *) TU Kaiserslautern 10 for references see here: http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf http://xputer.de The Reconfigurable Computing Paradox although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: •wiring overhead • reconfigurability overhead •routing congestion von Neumann: an extremely power-inefficient paradigm “von Neumann Syndrome” Reinvent Computing © 2015, reiner@hartenstein.de C.V. Ramamoorthy Von Neumann Syndrome 11 http://xputer.de Obstacles to widespread FPGA adoption go well beyond the required skill set - Workshop at FPL_2015 http://reconfigurablecomputing4themasses.net/ © 2015, reiner@hartenstein.de 12 http://xputer.de TU Kaiserslautern What about Acceleration by Graphics Processors? von •Drastically smaller Speed-ups if at all Neumann •Power saving mostly not documented •R. Vaduc et al.*: „ … adding a GPU is equivalent to adding one more multicore CPU socket …” *) R. Vuduc, J. Choi, M. Guney, A. Shringarpure: On the Limits of GPU Acceleration; Proc. HotPar'10, 2nd USENIX workshop on Hot Topics in Parallelism, June 14 15, 2010, Berkeley, CA, USA, USENIX Assoc. Berkeley, CA, USA http://newport.eecs.uci.edu/~amowli/resources/papers/vuduc2010-hotpar.pdf © 2015, reiner@hartenstein.de http://xputer.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 14 http://xputer.de TU Kaiserslautern Dual paradigm mind set: an old hat– but was ignored time to space mapping: procedural to structural: loop to pipe mapping PDP-16 RTMs: why did it take 25 years to find out? demultiplexer 1971 token bit evoke FF FF FF 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of RegisterTransfer Modules (RTM's); IEEE Trans-C21/5, May 1972 © 2006, reiner@hartenstein.de 15 http://hartenstein.de The Systolic Arrays (1) 1980 no instruction streams needed time (pipe network) DPA x x x DataPath Array (array of DPUs) x x x | x x x | | time x x x define: x x x ... which data item at which time x x x - at which port „data streams“ © 2015, reiner@hartenstein.de port # - - - x x x execution transporttriggered time Kung‘s - - - - x x x example - - - - - x x x (algebra) port # M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips ... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 input data stream | | | | | | | | | | | x x x time 16 x x x | x x x port # output data streams H. T. Kung http://xputer.de TU Kaiserslautern M. J. Foster and H. T. Kung: “The Design of SpecialPurpose VLSI Chips ... “ It is not sufficient to invent something. You need to recognize, that you have invented something. © 2015, reiner@hartenstein.de 17 http://karl-steinbuch.org Systolic Arrays (2) Karl Steinbuch 17 http://xputer.de What Synthesis Method? H.T.Kung*: „of course algebraic!“ (linear projection) *) a mathematician only linear pipes supports only very special applications with strictly regular data dependencies http://kressarray.de/ My student Rainer Kress replaced it by simulated annealing: this supports also any irregular & wild shape pipe networks © 2015, reiner@hartenstein.de 18 *) KressArray [ASP-DAC-1995] http://xputer.de H. T. Kung: “It’s not our job” TU Kaiserslautern another Tunnel Vision Symptom without a sequencer: missed to invent a new machine paradigm (the Xputer) © 2015, reiner@hartenstein.de *) or receives 19 http://xputer.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 20 http://xputer.de The Xputer machine Paradigm TU Kaiserslautern (TU-KL) Xputer literature is the TU-KL‘s Symbiosis of Time to Space Mapping and Reconfigurable Computing! ASM obtained by adding auto-sequencing memory (ASM) With data counters instead of a program counter GAG: Generic Address Generstor © 2015, reiner@hartenstein.de 21 ASM ASM GAG ASM RAM data counter http://xputer.de state register(s): MoPL program counter: data counter(s): Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops But there is an Asymmetry © 2006, reiner@hartenstein.de 22 more simple: no ALU tasks Xputer literature TU Kaiserslautern Xputer pages Duality of procedural Languages http://hartenstein.de Compilation: Software vs. Configware u. Flowware TU Kaiserslautern Xputer literature Software Engineering source program time to space mapping Configware Engineering C, etc. source „program“ mapper software compiler procedural: time domain) configware compiler space domain data scheduler software code configware code © 2015, reiner@hartenstein.de 23 time domain flowware code http://xputer.de TU Kaiserslautern Heterogeneous: Co-Compilation Xputer pages important: why? C, or other high level language CoDe-X automatic SW / CW partitioner Software / Configware software Co-Compiler compiler (Jürgen Becker‘s Ph.D. thesis) mapper configware compiler data scheduler software code © 2015, reiner@hartenstein.de configware code 24 flowware code http://xputer.de >> Outline << TU Kaiserslautern • The Power Wall • “Dataflow” Computing • Reconfigurable Computing • Time to Space Mapping • The Xputer Paradigm • Conclusions http://www.uni-kl.de © 2015, reiner@hartenstein.de 25 http://xputer.de Illustrating the Paradigm Trap TU Kaiserslautern the watering can model [Hartenstein] ( crippled watering can ) © 2015, reiner@hartenstein.de The von Neumann Manycore Approach ( many crippled watering cans ) The von Neumann single core Approach The Memory Wall (1) many von Neumann bottlenecks 26 http://xputer.de Illustrating the Paradigm Trap (2) TU Kaiserslautern The “Dataflow“ Computer extremely complicated: no watering can model ! (a power efficiency break-thru did not happen here) © 2015, reiner@hartenstein.de 27 http://xputer.de Xputer: the only massively power-efficient Paradigm TU Kaiserslautern the watering can model [Hartenstein] The Xputer Paradigm has no von Neumann bottleneck © 2015, reiner@hartenstein.de fully supporting Reconfigurable Computing 28 http://xputer.de TU Kaiserslautern We need a Seismic Shift … It’s an extremely massive challenge ! … to avoid future unaffordable ICT power consumption cost that’s why heterogeneous is important The software from more than half a century sits squarely on top For many more years we must work under a heterogeneous triple-paradigm mind set: Configware, Flowware, and still even Software © 2015, reiner@hartenstein.de 29 http://xputer.de thank you ! © 2015, reiner@hartenstein.de 30 http://xputer.de END © 2015, reiner@hartenstein.de 31 http://xputer.de Backup for discussion: © 2015, reiner@hartenstein.de 32 http://xputer.de Reconfigurable Computing (RC): the intensive Impact TU Kaiserslautern Speed-ups by von Neumann to RC Migrations Tarek El-Ghazawi [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Application DNA and Protein sequencing . DES breaking Speed-up factor 8723 779 22 253 28514 3439 96 1116 massively much less equipment saving needed energy much less memory and bandwidth needed © 2015, reiner@hartenstein.de Savings divisor Power Cost Size 33 http://xputer.de Taxonomy TU Kaiserslautern Flynn‘s taxonomy: von Neumann only Diana‘s taxonomy: Reconfigurable Computing Reiner‘s taxonomy: heterogeneous systems Reiner‘s 2nd taxonomy: Xputers only noI © 2015, reiner@hartenstein.de 34 http://xputer.de TU Kaiserslautern First FPGA available 1984 from Xilinx LUT LUT LUT LUT LUT LUT Table Table LUT © 2015, reiner@hartenstein.de http://xputer.de Transformations since the 70ies (time to time/space mapping) Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag] time domain: procedure domain program loop space domain: structure domain Strip Mining Transformation Pipeline k time steps, n DPUs n x k time steps, 1 CPU time algorithm © 2015, reiner@hartenstein.de space/time algorithmus 36 http://xputer.de The Reconfigurable Computing Paradox although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: •wiring overhead • reconfigurability overhead •routing congestion •etc. Reinvent Computing Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing. © 2015, reiner@hartenstein.de 37 http://xputer.de ASM ASM x x x x x x use data counters, no program counter x x x | | | x x x x x x x x x - | | | | | | | | | x x x x x x 38 | x x x ASM ASM: AutoXputer pages Sequencing Memory © 2015, reiner@hartenstein.de | ASM Xputer machine paradigm | ASM implemented ASM by distributed ASM on-chipmemory ASM ASM general purpose reconfigurable (pipe network) rDPA ASM Data stream generator usage - - - x x x ASM - - - - x x x ASM - - - - - x x x ASM GAG RAM data counter GAG: Generic Address Generators (reconfigurable: to avoid memory-cycle-hungry address computation) http://xputer.de Paradigm Shift Consequences Xputer literature TU Kaiserslautern von Neumann: Software Engineering CPU resources: fixed 1 programming software algorithm: variable source needed Program Counter (PC) Configuration Xputer: Code (CC) configware flowware Configware Engineering resources: variable algorithm: variable 2 programming sources needed Data Counters (DCs) sequencing code (e. g. see MoPL language) Xputer and vN: Heterogenous Engineering all 3 programming sources needed © 2015, reiner@hartenstein.de 39 http://xputer.de Pipelining through DPU Arrays: the TU-KL Xputer principle no memory wall DPA massively avoiding memory cycles DPU operation is transport-triggered | | - - - x x x - - - - x x x x x x - - - - - - - x x x | | | | | | | | | | | x x x no message passing nor thru common memory 40 input data streams | x x x x x x - no instruction streams © 2015, reiner@hartenstein.de x x x x x x x x x x x x output data streams | x x x DPA = DPU array DPU = Data Path Unit http://xputer.de Von Neumann Syndrome TU Kaiserslautern Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011 © 2015, reiner@hartenstein.de 41 http://xputer.de Computing Paradigms CPU program counter term term CPU Xputer RPU DFC* DPU program counter rDPU data counters yes no execution triggered by paradigm instruction instruction-streambased (von Neumann) fetch data arrival** data-driven or datastream-based Reconfigurable Computing no complicated I-structure “Dataflow” Computer handling *) based on tagged token “I-Structure” **) “transport-triggered” DFC DPUs I- structure © 2015, reiner@hartenstein.de 42 http://xputer.de MIT Tagged Token Dataflow Architecture* I would call it Tagged Token Flow Architecture no Program Counter no updateable global store PE: ? I-Structure Storage I-Structure Storage to/from the Communication Network „instruction is executed even if some of its operands are not yet available“ *) source: Jurij Silc © 2015, reiner@hartenstein.de Communication Network RU Token Queue SU PE PE Wait-Match Unit & Waiting Token Store Instruction Fetch Unit Form Token Unit ALU & Form Tag 43 Program Store & Constant Store http://xputer.de Solution: I-Structure concept* TU Kaiserslautern Problems with such “Dataflow” Architectures very complex data structures ! “read request deferrend but write operation is allowed” “at least one read request has been deferred” ? “can be read but not written” ? “each update consumes the structure and the value produces a new data structure” I would “awkward or call it „Tagged Token Flow“ even impossible = Tagged Token Structure to implement” [source: Jurij Silc] *) „I-Structure Flow“ insteadhttp://xputer.de of „Dataflow“ © 2015, reiner@hartenstein.de 44 I-Structures (I = incremental) - part 1 45 Jurij Silc: Dataflow Architectures http://csd.ijs.si/courses/dataflow/ 26 28 28 © 2015, reiner@hartenstein.de http://xputer.de 27 I-Structures - part 2 Jurij Silc: Dataflow Architectures http://csd.ijs.si/courses/dataflow/ MIT Tagged-Token Dataflow Architecture 46 (PE) 31 29 I-structure select © 2015, reiner@hartenstein.de Istructure assign 30 http://xputer.de 20 Power Efficiency of Programming Languages (an example) © 2015, reiner@hartenstein.de 47 http://xputer.de How big is big ? © Hewlett Packard © 2015, reiner@hartenstein.de 48 http://xputer.de