Enabling Technologies for Reconfigurable Computing November 21, 2001, Tampere, Finland Reiner Hartenstein University of Kaiserslautern Enabling Technologies for Reconfigurable Computing part 1: Reconfigurable Computing (RC) Wednesday, November 21, 8.30 – 10.00 hrs. Schedule Xputer Lab University of Kaiserslautern time slot 08.30 – 10.00 Reconfigurable Computing (RC) 10.00 – 10.30 coffee break 10.30 – 12.00 Compilation Techniques for RC 12.00 – 14.00 lunch break 14.00 – 15.30 Resources for Stream-based RC 15.30 – 16.00 coffee break 16.00 – 17.30 FPGAs: recent developments © 2001, reiner@hartenstein.de 2 http://www.fpl.uni-kl.de Reconfigurable: why? Xputer Lab University of Kaiserslautern • Exploding design cost and shrinking product life cycles of ASICs create a demand on RA usage for product longevity. • Performance is only one part of the story. The time has come fully exploit their flexibility to support turn-around times of minutes instead of months for real time in-system debugging, profiling, verification, tuning, fieldmaintenance, and field-upgrades. • A new “soft machine” paradigm and language framework is available for novel compilation techniques to cope with the new market structures transferring synthesis from vendor to customer. © 2001, reiner@hartenstein.de 3 http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern SOC Alternatives… not including C/C++ CAD Tools [Gordon Bell] • The blank sheet of paper: FPGA • Auto design of a basic system: Tensilica • Standardized, committee designed components*, cells, and custom IP • Standard components including more application specific processors *, IP add-ons and custom • One chip does it all: SMOP ** *) Processors, Memory, Communication & Memory Links, **) SMOP ?? © 2001, reiner@hartenstein.de 4 http://www.fpl.uni-kl.de SoC Alternatives [Gordon Bell] Xputer Lab University of Kaiserslautern product strategy vendor FPGA “sea of uncommitted gate arrays” Xylinx, Altera compile a system unique processor for every application systolic array many pipelined or parallel processors + custom Tensilica DSP, VLIW special purpose processor cores + TI custom processor + RAM + general purpose cores, IBM, Intel, ASICS specialized by I/O, etc. universal micro multiprocessor array, programmable I/O © 2001, reiner@hartenstein.de 5 Cradle http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern A Decade of Research in Reconfigurable Computing • Due to the achievements of numerous Research Projects throughout the 90ies the Breakthrough in Commercialization has started and already a quite comprehensive Methodology is available. • Dear Colleague, the RC Scene welcomes your contributions to improve it and to push for Inclusion in contemporary CS&E Curricula. • It is one of the Goals of this Talk to stimulate you by Highlights and introducing some Key Issues. © 2001, reiner@hartenstein.de 6 http://www.fpl.uni-kl.de Xputer Lab no more a strange niche area University of Kaiserslautern • was “Hardware” design for a strange plattform – CAD, but no Compilation • Emerging awareness: – New mind set – New curricular embedding • coming Dichotomie of CS – SW <-> CW – HW <-> FW – computing in time <-> computing in space © 2001, reiner@hartenstein.de 7 http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern flexibility/universality trade-off Kress Array Xplorer FPGA flexibility © 2001, reiner@hartenstein.de trade-off 8 hardwired efficiency http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern RAs are heading for Mainstream ... become indispensable for SoC products ? ASPP, application-specific programmable product is: • Application-specific standard product and: • embedded programmable logic Flash / RAM DRAM/Flash/SRAM memory banks Programmable Logic Logic CSoC, configurable SoC is: • an industry standard µProcessor, Reconfigurable • embedded reconfigurable array, Microprocessor Accelerator • memory, dedicated systen bus ... Array Analog Logic Soap Chip: System on a programmable Chip © 2001, reiner@hartenstein.de 9 http://www.fpl.uni-kl.de Reconfigurable Logic going Mainstream Xputer Lab University of Kaiserslautern • Fine grain: FPGAs killing the ASIC market • Fastest growing segment of semiconductor market • Substantially improved design flow and libraries • Coarse grain: several startups • Comprehensive Methodology • Please, Lobby for New Curricula. • One of the goals of this talk: to motivate You by Key Issues and Visionary Highlights. © 2001, reiner@hartenstein.de 10 http://www.fpl.uni-kl.de Xputer Lab Designer-oriented Innovation stalled ? University of Kaiserslautern • • • • • EDA industry: about 7 bio $ leverages > 200 bio $ semconductor industry FPGAs (7 bio $) fastest growing segment EDA industry constantly redefining itself „except logic synthesis nor really significant innovation in the past decade“ • CAD developers can‘t deliver their idear effectively • CAD developers personally don‘t appreciate the real problems facing designers © 2001, reiner@hartenstein.de 11 http://www.fpl.uni-kl.de EDA the main bottleneck Xputer Lab University of Kaiserslautern © 2001, reiner@hartenstein.de 12 http://www.fpl.uni-kl.de Xputer Lab guess it ! Biggest Mistake of EDA University of Kaiserslautern © 2001, reiner@hartenstein.de 13 http://www.fpl.uni-kl.de >> History Xputer Lab University of Kaiserslautern • History • Paradidgm Shift • Coarse Grain: why ? • Coarse Grain Architectures • Reconfiguration Architecture http://www.uni-kl.de © 2001, reiner@hartenstein.de 14 http://www.fpl.uni-kl.de Logic Gate Price Trend Xputer Lab Price (Normalized to Q1/1993) University of Kaiserslautern Source:Altera 1.2 Price per Logic Element 1 40% lower per Year 0.8 0.6 0.4 0.261 0.2 0 0.086 Q1 '93 Q1 '94 © 2001, reiner@hartenstein.de Q1 '95 Q1 '96 Q1 '97 15 0.042 Q1 '98 Q1 '99 0.029 Q1 '00 http://www.fpl.uni-kl.de The History of Paradigm Shifts Xputer Lab University of Kaiserslautern TTL 1967 1957 custom LSI, MSI © 2001, reiner@hartenstein.de “The Programmable System-on-a-Chip is the next wave“ µproc., memory 1977 1987 2nd Design Crisis standard 1st Design Crisis “Mainstream Silicon Application is switching every 10 Years” ASICs, accel’s 16 2007 1997 ? ? http://www.fpl.uni-kl.de Makimoto’s 3rd Wave Xputer Lab University of Kaiserslautern • Fine Grain Subsystems (FPGAs): – 1st half of 3rd wave – universal (but less efficient) • Coarse Grain Subsystems: – 2nd half of 3rd wave – domain-specific – much more flexible than 2nd half of 2rd wave © 2001, reiner@hartenstein.de 17 http://www.fpl.uni-kl.de How’s next Wave ? Xputer Lab University of Kaiserslautern standard hardwired procedural programming 1967 1957 1987 1977 structural programming FPGAs 1997 4th wave ? Coarse 2007 grain RAs ? ? custom algorithm: fixed algorithm: variable algorithm: variable resources: fixed resources: fixed resources: variable Tredennick’s Paradigm Shifts © 2001, reiner@hartenstein.de Hartenstein’s Curve no further wave ! 18 http://www.fpl.uni-kl.de The Impact of Makimoto’s Paradigm Shifts Xputer Lab University of Kaiserslautern Software Industry’s Secret of Success Personalization (CAD) before fabrication standard 1967 1957 custom Repeat Success Story by new Machine Paradigm ! Procedural personalization via RAM-based Machine Paradigm µproc., memory TTL LSI, MSI © 2001, reiner@hartenstein.de Dr. Makimoto: FPL 2000 keynote structural personalization: RAM-based before run time 2007 1987 ASICs, accel’s 1977 19 1997 http://www.fpl.uni-kl.de >> Paradigm Shift Xputer Lab University of Kaiserslautern • History • Paradidgm Shift • Coarse Grain: why ? • Coarse Grain Architectures • Reconfiguration Architecture http://www.uni-kl.de © 2001, reiner@hartenstein.de 20 http://www.fpl.uni-kl.de Sequential vs. structural RAM Xputer Lab University of Kaiserslautern Logic Synthesis Software Route and Place (procedural) downloading I/O RAM sequential RAM download FPGA restructural conf. accelerator(s) RAM RAM data path instruction sequencer “von Neumann” © 2001, reiner@hartenstein.de 21 http://www.fpl.uni-kl.de Changing Models of Computing Xputer Lab University of Kaiserslautern the tail Software wagging the dog Configware Software occupies most silicon (procedural) downloading I/O data path RAM instruction sequencer “von Neumann” downloading RAM CAD hardwired accelerator(s) host downloading RAM host contemporary Hardware © 2001, reiner@hartenstein.de (structural) 22 reconf. accelerator(s) RAM reconfigurable computing Flexware http://www.fpl.uni-kl.de Xputer Lab The Microprocessor is a Methuselah University of Kaiserslautern 9 technology generations ... • • • • • • • • • © 2001, reiner@hartenstein.de 1th 2nd 3rd 4th 5th 6th 7th 8th 9th 4004 ... the steam engine 8008 of the silicon age 8086 80286 80386 80486 P5 (Pentium) P6 (Pentium Pro / Pentium II) Pentium III 23 http://www.fpl.uni-kl.de … Decline of Wintel Business Model Xputer Lab University of Kaiserslautern Billion US-$ US Market [forrester] Billion Subscribers worldwide Million Devices delivered in the U.S. 20 [IDC] 15 201 Bio 1500 $ 1000 $ 1997 1998 10 0.5 Bio 1999 2000 2001 2002 © 2001, reiner@hartenstein.de 24 http://www.fpl.uni-kl.de Basics of Binding Time Xputer Lab University of Kaiserslautern time of “Instruction Fetch” run time microprocessor parallel computer loading time Reconfigurable Computing compile time © 2001, reiner@hartenstein.de 25 http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern Binding Time vs. Computing Domain Binding time: (Set-up of Communication Channels) at run time microprocessor parallel computer array processor at loading time at compile time The KressArray is a generalization of the systolic array later fabrication step ASICs before fabrication programming domain: © 2001, reiner@hartenstein.de Reconfigurable Computing time domain (procedural) 26 systolic arrays full custom ICs time & space (hybrid) space domain (structural) http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern Dataquest Predicts Programmability to be Predominant in SOC • Application-specific programmable products (ASPPs) will be the next best thing in semiconductor technology • With programmability as a standard feature, ASPPs will be predominant system-on-a-chip products in five years Jordan Selburn, principal analyst, ASICs and system-level integration, Dataquest Inc.’s Semiconductors Group EETimes 10/21/98 Dataquest Semiconductors ‘98 conference © 2001, reiner@hartenstein.de 27 http://www.fpl.uni-kl.de Applications Xputer Lab University of Kaiserslautern • next generations’ wireless* • network processors* • many other areas* *) keynotes and papers at FPL 2000 Villach, Austria, August 27 - 30, 2000 http://www.fpl.uni-kl.de/FPL/ The 10th International Conference on Field-programmable Logic and Applications The Roadmap to Reconfigurable Systems © 2001, reiner@hartenstein.de 28 http://www.fpl.uni-kl.de Applications (2) Xputer Lab University of Kaiserslautern • Image Processing: – for smart car (collision avoidance, others ...), – Smart traffic pilots, robotics, fast material inspection, – smart stub finders, motion detection (MPEG-4, ...) • Signal Processing, Speech Processing, Software Radio, • Correlation, Encryption, Comm. Switching / Protocols, • Innovative consumer electronics: – super smart cards, smart handies, wearable, – portable, set-top, laptop, desktop, embedded, ... • many others, ... © 2001, reiner@hartenstein.de 29 http://www.fpl.uni-kl.de Applications Xputer Lab University of Kaiserslautern • new cellular standard: up to 2 Mbit/sec: new CDMA standard: > 500 MIPS needed just for RF receiver part • wide variety of end-user‘s devices: smart handies, palm pilots, laptops, games, camcorder-likes, ..the internet car, many new types of devices to come ... • increasing wide variety of services available from network provider:download just what a particular customer is subscribed to • expert group [Vissers]: > 20% of it will be accelerator code* © 2001, reiner@hartenstein.de 30 http://www.fpl.uni-kl.de 4G Why coarse grain ? Xputer Lab 3G University of Kaiserslautern Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld memory 100 000 000 2G Transistors/chip Normalized processor speed 10 000 000 wireless 1000 000 100 000 Algorithmic Complexity (Shannon’s Law) 10 000 microprocessor / DSP 1G 100 computational efficiency 1000 10 SH7752 StrongARM mA/ MIP 0.1 100 battery performance 10 1 1960 1 1970 © 2001, reiner@hartenstein.de 1980 1990 31 2000 0.01 0.001 2010 http://www.fpl.uni-kl.de Shannon‘s Law Xputer Lab University of Kaiserslautern • In a number of application areas throughput requirements are growing faster than Moore's law • Fundamental flaws in software processor solutions • 32 soft ARM cores fit onto contemporary FPGA • Stream-based distributed processing is the way to go © 2001, reiner@hartenstein.de 32 http://www.fpl.uni-kl.de Xputer Lab It’s a Paradigm Shift ! University of Kaiserslautern • Using FPGAs (fine grain reconfigurable) just mainly is classical Logic Synthesis on a “strange hardware” platform • Coarse Grain Reconfigurable Arrays (Reconfigurable Computing), however, mean a really fundamental Paradigm Shift • This is still ignored by CS and EE Curricula and almost all R&D scenes © 2001, reiner@hartenstein.de 33 http://www.fpl.uni-kl.de >> Coarse Grain: why ? Xputer Lab University of Kaiserslautern • History • Paradidgm Shift • Coarse Grain: why ? • Coarse Grain Architectures • Reconfiguration Architecture http://www.uni-kl.de © 2001, reiner@hartenstein.de 34 http://www.fpl.uni-kl.de Xputer Lab It’s a General Paradigm Shift ! University of Kaiserslautern • Using FPGAs (fine grain reconfigurable): just Logic Synthesis on a strange platform • Coarse Grain Reconfigurable Arrays (Reconfigurable Computing): a fundamental Paradigm Shift • Replacing Concurrent Processes by much more efficient parallelism: Stream-based ComputingArrays • ignored by Curricula & most R&D scenes © 2001, reiner@hartenstein.de 35 http://www.fpl.uni-kl.de Xputer Lab Fine-grained vs. coarse-grained University of Kaiserslautern • Fine-grained reconfiguration versus coarse-grained reconfiguration. • fine grain is general purpose • slow and area-inefficient, but high parallelism • coarse grain is application domain-specific • coarse grain is highly area-efficient • extremely high performance © 2001, reiner@hartenstein.de 36 http://www.fpl.uni-kl.de Xputer Lab Reconfigurability Overhead University of Kaiserslautern area used by application L L partly for configuration code storage S L © 2001, reiner@hartenstein.de S L resources needed for reconfigurability “hidden RAM” not shown L S L S L 37 L L http://www.fpl.uni-kl.de Principle of a Typical FPGA Xputer Lab University of Kaiserslautern FF FF CLB CLB CLB CLB CLB CLB ConnectionPoint FF FF Tap FF FF FF FF FF of hidden RAM © 2001, reiner@hartenstein.de 38 http://www.fpl.uni-kl.de Routing Overhead in FPGAs Xputer Lab University of Kaiserslautern >1000 transistors at each cross bar > Ý 40 transistors at each switching point Routing Congestion [DeHon]: often 50% or less of CLBs used part of the hidden RAM FF FF FF FF Ý 15 transistors > at each tap most FPGA vendors’ gate count: FF FF FF FF 1 flipflop of configuration RAM = 4 gates © 2001, reiner@hartenstein.de 39 FF http://www.fpl.uni-kl.de Why Coarse Grain instead of FPGA ? Xputer Lab University of Kaiserslautern Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld physical logical 100 000 000 000 FPGA physical Transistors / chip 10 000 000 000 1000 000 000 FPGA routed 10 000 000 reduced reconfigurability overhead by up to ~ 1000 1000 000 100 000 drastically much fastersmaller loading configuration memory a lot of more benefits 10 000 © 2001, reiner@hartenstein.de ~ 10 000 FPGA logical 100 000 000 1000 1980 ~ 10 1990 2000 40 2010 http://www.fpl.uni-kl.de Xputer Lab >>> extremely high efficiency University of Kaiserslautern 1. avoiding address computation overhead 2. avoiding instruction fetch and interpretation overhead 3. high parallelism, massively multiple deep pipelines 4. much less configuration memory 5. no routing areas to configure functions from CLBs © 2001, reiner@hartenstein.de 41 http://www.fpl.uni-kl.de Xputer Lab Configurable Computing Systems University of Kaiserslautern • combine programmable sequential processor with Flexware (structurally programmable „hard“ware): • capitalize on the strength of both,flexware and software. • early 60ies: Estrin (UCLA): enabling technology not available • 90ies: significant increase of research activities (DARPA ...) • FPGAs: not the enabling technology: hardware skills needed • Verilog or VHDL based systems often result in poor performance © 2001, reiner@hartenstein.de 42 http://www.fpl.uni-kl.de Platforms available Xputer Lab University of Kaiserslautern • Soft Data Path Arrays – – – – – KressArray Xtreme (PACT) ACM (Quicksilver Tech) CHESS Array (Elixent) others • Compilation techniques feasibility studies: – Partitioning Co-Compiler – Design Space Explorer – others © 2001, reiner@hartenstein.de 43 http://www.fpl.uni-kl.de Xputer Lab Also as an autonomous Machine University of Kaiserslautern • New Machine Paradigm (Xputer) • is the counterpart of the so-called von Neumann paradigm • easy to teach: simple machine principles – – – – – – – – CONS: confuses customers (paradigm switch: the brain hurts) PROS: strong guidance of EDA tool development more effective hardware/software APIs compilation techniques similar to traditional compilation better Application Development Tools accepting C or Java scan patterns (data counter) similar to control flow (program counter) general model of hardware / software co-design fascination for freak effect: opening up a new R&D discipline © 2001, reiner@hartenstein.de 44 http://www.fpl.uni-kl.de >> Coarse Grain Architectures Xputer Lab University of Kaiserslautern • History • Paradidgm Shift • Coarse Grain: why ? • Coarse Grain Architectures • Reconfiguration Architecture http://www.uni-kl.de © 2001, reiner@hartenstein.de 45 http://www.fpl.uni-kl.de Some Players in Silicon Valley and …. Xputer Lab University of Kaiserslautern Company Architecture Business Model Markets Adaptive Silicon Not disclosed Sell Cores Chameleon Systems 32 bit datapath array Sell Chips Embedded DSP Networking Malleable Not disclosed Sell Chips Voice over IP MorphICs Not disclosed Sell Cores Wireless Commun. Silicon Spice Not disclosed Sell Solutions Networking Systolix Bit Serial Systolic Array Sell Cores Signal Conditioning Triscend System on Chip Embedded Systems Sell Chips Network Processors: > 20 Players © 2001, reiner@hartenstein.de 46 http://www.fpl.uni-kl.de Commercial rDPAs Xputer Lab University of Kaiserslautern XPU family (IP cores): PACT Corp., Munich CALISTO: Silicon Spice ** CS2000 family: Chameleon Systems MECA family: Malleable ** flexible array: MorphICs ACM: Quicksilver Tech * CHESS array: Elixent MorphoSys: Morpho Tech* FIPSOC: SIDSA XPU128 **) bought © 2001, reiner@hartenstein.de *) here at SoC 47 http://www.fpl.uni-kl.de PACT Corp Xputer Lab University of Kaiserslautern • Xtreme Processor Platform (XPP) family of IP cores, high-speed data-stream-capable, scalable, reconfigurable clusters of arrays of 32-bit DPUs with embedded memories, and high-speed I/O ports • Application development support software featuring a flow graphstyle algorithm mapping language - to minimize training requirements. • XPP's fabrics, featuring automatic DataFlow synchronization and flagged Event Network to dynamically configure the execution flow, • Supports dynamic RTR: hierarchical configuration managers free the designer from chip-level details and ensure that configurations are independently loaded in exactly the intended order. • Automatic event-based task swapping along with data streams: released resources automatically reconfigured immediately © 2001, reiner@hartenstein.de 48 http://www.fpl.uni-kl.de Xputer Lab rDPA (Reconfigurable Datapath Array) University of Kaiserslautern rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU Reconfigurable Interconnect Fabric rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU separate routing area © 2001, reiner@hartenstein.de RIF layouted over rDPUs: rDPA wired by abutment 49 http://www.fpl.uni-kl.de Generically defined Fabrics: KressArray Family Xputer Lab University of Kaiserslautern a) c) d) rDPU: b) rDPU routing only e) rDPU: routing and function f) g) h) + i) Some Application Areas, like e. g. Wireless Communicatio 50 Communication http://www.fpl.uni-kl.de © 2001, reiner@hartenstein.de need extraordinarily powerful Resources Xputer Lab Universal RAs are not always feasible University of Kaiserslautern The General Purpose (coarse grain) Reconfigurable Array may appear to be an Illusion ... ... often Functional Resources are not the Throughput Bottleneck Some Application Areas, such as e. g. Wireless Communication, need extremely rich Communication Resources Use Domain-specific Platform Generators ! © 2001, reiner@hartenstein.de 51 http://www.fpl.uni-kl.de Xputer Lab KressArray Family Example University of Kaiserslautern 16 taylored KressArray rDPU example 32 24 2 rDPU 4 external view: only NNport Abutment Architecture shown © 2001, reiner@hartenstein.de 8 52 http://kressarray.de http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern KressArray Family generic Fabrics: a few examples Select mode, Select number, width of NNports 16 Function Repertory 8 32 + 24 2 rDPU 4 select Nearest Neighbour (NN) Interconnect: an example routthrough only more NNports: rich Rout Resources rout-through and function Examples of 2nd Level Interconnect: layouted over rDPU cell no separate routing areas ! http://kressarray.de © 2001, reiner@hartenstein.de 53 http://www.fpl.uni-kl.de Xputer Lab CMOS intercoonnect resources University of Kaiserslautern Foundries offer up to 8 metal layers and up to 3 poly layers reconfigurable interconnect fabric layouted over the rDU cell © 2001, reiner@hartenstein.de 54 http://www.fpl.uni-kl.de Super Pipe Networks Xputer Lab University of Kaiserslautern array systolic array applications regular data dependencies only supersystolic rDPA * pipeline properties shape resources linear only uniform only mapping linear projection or algebraic synthesis simulated annealing or P&R algorithm no restrictions scheduling (data stream formation) (e.g. force-directed) scheduling algorithm *) KressArray [1995] © 2001, reiner@hartenstein.de 55 http://www.fpl.uni-kl.de Xputer Lab Communication Resource Requirements University of Kaiserslautern ... often Functional Resources are not the Throughput Bottleneck In some Application Areas, such as e. g. Wireless Communication, Reconfigurable Computing Arrays need extraordinarily rich and powerful Communication Resources The Solution: Generators for Domain-specific RA Platforms © 2001, reiner@hartenstein.de 56 http://www.fpl.uni-kl.de SNN filter KressArray Mapping Example Xputer Lab University of Kaiserslautern http://kressarray.de rout thru only array size: 10 x 16 = 160 rDPUs Legend: © 2001, reiner@hartenstein.de rDPU not used backbus connect used for routing only backbus connect 57 operator and routing port location not usedmarker http://www.fpl.uni-kl.de Xputer Lab Xplorer Plot: SNN Filter Example University of Kaiserslautern [13] http://kressarray.de 2 hor. NNports, 32 bit 3 vert. NNports, 32 bit route-thru-only rDPU © 2001, reiner@hartenstein.de + result operand 58 operator operand route thru backbus connect http://www.fpl.uni-kl.de Super Pipe Networks Xputer Lab University of Kaiserslautern array systolic array supersystolic RA * applications regular data dependencies only pipeline properties shape resources linear only uniform only mapping linear projection or algebraic synthesis simulated annealing or P&R algorithm no restrictions scheduling (data stream formation) (e.g. force-directed) scheduling algorithm *) KressArray [ASP-DAC-1995] © 2001, reiner@hartenstein.de 59 http://www.fpl.uni-kl.de Xputer Lab KressArray: try out youself ! University of Kaiserslautern • You may experiment yourself • You may use it over the internet • Map an application onto a KressArray • Start with a simple example • Visit http://kressarray.de • Click the link to Xplorer try Netscape 4.7x • ... does not run on internet explorer .... • ... since Bill Gates does not like Java © 2001, reiner@hartenstein.de 60 http://www.fpl.uni-kl.de Michael Herz Xputer Lab University of Kaiserslautern Dissertation Agilent, Sindelfingen Michael Herz: • ... on mapping parallel memory architectures for stream-based arrays onto KessArrays • ... also transformation of storage schemes to optimize memory bandwith • (MoM scan pattern transformations) © 2001, reiner@hartenstein.de 61 http://www.fpl.uni-kl.de Ulrich Nageldinger Xputer Lab University of Kaiserslautern Dissertation Ulrich Nageldinger: infineon technologies, Munich • ... on mapping applications onto KessArrays • ... simultaneous routing and placement by simulated annealing • Supporting a huge family of KressArrays • fuzzy logic improvement proposal generator • profiling • design space exploration © 2001, reiner@hartenstein.de 62 http://www.fpl.uni-kl.de Rainer Kress Xputer Lab University of Kaiserslautern Dissertation Rainer Kress: infineon technologies, Munich • ... on mapping applications onto his* KessArray • DPSS datapath synthesis system • Including a data scheduler • (data stream scheduler) • Generalization of the Systolic Array • (KressArray is a super systolic array) • 32 bit design via Eurochip support © 2001, reiner@hartenstein.de 63 http://www.fpl.uni-kl.de Jürgen Becker Xputer Lab University of Kaiserslautern Dissertation Jürgen Becker: Professor at Univ. Karlsruhe • ... Automatically partitioning Co-compiler • (configware / software co-compilation) • Resource-parameter-driven retargettable • Profiler-driven optimization • Accepts HLL „ALE-X“ (extended C subset) • (subset: pointers not supported) © 2001, reiner@hartenstein.de 64 http://www.fpl.uni-kl.de Karin Schmidt Xputer Lab University of Kaiserslautern Dissertation Karin Schmidt: DaimlerChrysler Research • Compilation Techniques for Xputers • modified loop transformations • Modified parts of implementation used for Jürgen Becker‘s Ph. D. thesis © 2001, reiner@hartenstein.de 65 http://www.fpl.uni-kl.de CHESS Array w. embedded RAM (Elixent) Xputer Lab University of Kaiserslautern multi-granular e. g. 16 * 4 Bits = 64 Bits Sequencer ALU ALU ALU ALU ALU ALU © 2001, reiner@hartenstein.de R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M R A M 66 Memory Interface User Registers Clock Control http://www.fpl.uni-kl.de Chameleon Systems Xputer Lab University of Kaiserslautern • RISC processor and an array of 108 arithmetic processing units. Each of those 32-bit processing cores runs at 125 MHz. • The CS2112 is the industry's first Reconfigurable Communications Processor (RCP), a streaming data processor. • The vendor claims a performance of 20 billion 16-bit operations per second, and 2.4 billion 16-bit multiply-accumulates per second and 1.6 GBytes / sec for ist programmable I/O (PIO) banks. • It also has a PCI interface. • Tool suite C~SIDE for developing, verifying and optimizing. © 2001, reiner@hartenstein.de 67 http://www.fpl.uni-kl.de Coarse Grain Architectures Xputer Lab University of Kaiserslautern style project DP-FPGA KressArray Colt Matrix RAW Garp REMARC mesh MorphoSys CHESS DReAM CS2000 family MECA family CALISTO FIPSOC RaPID linear PipeRench PADDI Cross PADDI-2 bar Pleiades first source 1994 publ. 1995 1996 1996 1997 1997 1998 1999 1999 2000 2000 2000 2000 2000 1996 1998 1990 1993 1997 architecture granularity [4] 2-D array 1 & 4 bit multi-granular [5,11] 2-D mesh family: sel. pathwidth [12] 2-D array 1 & 16 bit [15] 2-D mesh 8 bit, multi-granular [17] 2-D mesh 8 bit, multi-granular [16] 2-D mesh 2 bit [18] 2-D mesh 16 bit [19] 2-D mesh 16 bit [20] hexagon 4 bit, multi-granular [21] 2-D array 8 &16 bit [23] 2-D array 16 & 32 bit [24] 2-D array multi-granular [25] 2-D array 16 bit multi-granular [26] 2-D array 4 bit multi-granular [27] 1-D array 16 bit 1-D array 128 bit [29] [30] crossbar 16 bit [32] crossbar 16 bit [33] mesh+crossbar multi-granular © 2001, reiner@hartenstein.de fabrics mapping intended target application Inhomog. routing channels switchbox routing multiple NN & bus segments (co-)compilation inhomogenous run time reconfiguration 8NN, length 4 & global lines multi-length 8NN switched connections switchbox rout global & semi-global lines heuristic routing NN & full length buses (info not available) NN, length 2 & 3 global lines manual P&R 8NN and buses JHDL compilation NN, segmented buses co-compilation inhomogenous array (not disclosed) (not disclosed) (not disclosed) (not disclosed) (not disclosed) (not disclosed) (not disclosed) segmented buses channel routing (sophisticated) scheduling central crossbar routing multiple crossbar routing multiple segmented crossbar switchbox routing 68 regular datapaths (adaptable) highly dynamic reconfig. general purpose experimental loop acceleration multimedia (not disclosed) multimedia next generation wireless communication tele- & datacommunication tele- & datacommunication tele- & datacommunication pipelining pipelining DSP DSP and others multimedia http://www.fpl.uni-kl.de Primarily Mesh-based …. Xputer Lab University of Kaiserslautern market project KressArray Garp CHESS Matrix research RAW Colt DReAM REMARC MorphoSys CALISTO MECA family commercial CS2000 family FIPSOC XPP XPU128 © 2001, reiner@hartenstein.de 69 bits granularity source variable 2 4 U. Kaiserslautern UC Berkeley Hewlett Packard 8 M.I.T. 1 & 16 8 &16 Virginia Tech TU Darmstadt Stanford UC Irvine Slicon Spice Malleable Chameleon Systems SIDSA PACT Corp. 16 16 & 32 16 & analog 32 http://www.fpl.uni-kl.de UC Berkeley (Jan Rabaey) Xputer Lab University of Kaiserslautern market project bits granularity source 16 UC Berkeley PADDI research PADDI-2 Pleiades © 2001, reiner@hartenstein.de 70 http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern Crossbar-based Architectures 16 bit C T L EXU 1990: UC Berkeley (Jan Rabaey) 1993: PADY-II (Jan Rabaey) 1997: Pleiades (mesh & crossbar) C T L EXU C T L EXU C T L EXU crossbar switch I/O I/O C T L EXU C T L EXU C T L EXU C T L EXU 32 bit © 2001, reiner@hartenstein.de 71 http://www.fpl.uni-kl.de PADDI-II Architecture Xputer Lab University of Kaiserslautern P1 P2 P3 P4 P5 P6 P7 P8 Level-2 Network 16 x 16b © 2001, reiner@hartenstein.de P9 P10 P11 P12 P13 P14 P15 P16 P25 P26 P27 P28 P29 P30 P31 P32 I/O I/O I/O I/O P17 P18 P19 P20 P21 P22 P23 P24 break-switch I/O break-switch 6 x 16b I/O P33 P34 P35 P36 P37 P38 P39 P40 72 16 x 6 switch matrix 4-PE Cluster P45 P46 P47 P41 P42 P43 P44 P45 P46 P47 P48 P48 I/O I/O Level-1 Network http://www.fpl.uni-kl.de MorphoSys Xputer Lab University of Kaiserslautern © 2001, reiner@hartenstein.de 73 http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern PipeRench Architecture (CMU 1998) alternating data/instruction stream highly dynamic reconfiguration © 2001, reiner@hartenstein.de 74 http://www.fpl.uni-kl.de Xputer Lab M.I.T. 0.5 m CMOS 8 bit 10 x 10 1.8 mm2 100 MHz MATRIX (1996) University of Kaiserslautern Multiple Alu archiTecture with Reconfigurable Interconnect eXperiment RAW (M.I.T. 1997) compare / reduce 2 Architecture Workbench Network Port A ALU Func Port global Reconfigurable lines 256x8 bit Mem 8 bit ALU compare / reduce 1 Level-1 Network © 2001, reiner@hartenstein.de C / R Network 75 Mem Func Port cross bar mode WE Network Port B MIPS-like processor core global lines BFU C / R Network opc operation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 × ×+ ×++ × const insh nsh dsh csh + +0 +1 := nand nor xor http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern MATRIX Interconnect Fabrics Communication Resources are often the bottleneck BFUs BFU its neighbours © 2001, reiner@hartenstein.de 76 http://www.fpl.uni-kl.de Xputer Lab More Research Projects University of Kaiserslautern Garp (UC Berkeley) published between 1996 - 2000 RaPiD (U. Washington ) REMARC (Stanford) DReAM (U. Karlsruhe) .... and others Asia / Pacific: also see embedded tutorials by Prof. Amano (ASP_DAC’99, FPL-2000) © 2001, reiner@hartenstein.de 77 http://www.fpl.uni-kl.de RaPiD Architecture Xputer Lab University of Kaiserslautern M U L T Datapath Registers A L U Bus Connectors © 2001, reiner@hartenstein.de R A M A L U Input Multiplexers 78 R A M A L U R A M Output Drivers http://www.fpl.uni-kl.de REMARC Xputer Lab University of Kaiserslautern © 2001, reiner@hartenstein.de 79 http://www.fpl.uni-kl.de Xputer Lab Future Coarse Grain RA Development University of Kaiserslautern • It is indispensable to operate within the Convergence Area of Compilers, Co-Compilers, Architecture and fullcustom-style VLSI Design (array cells). • It is a must, that Products come with a Development Platform which encourages users,especially also those with a limited Hardware Background. © 2001, reiner@hartenstein.de 80 http://www.fpl.uni-kl.de >> Reconfiguration Architecture Xputer Lab University of Kaiserslautern • History • Paradidgm Shift • Coarse Grain: why ? • Coarse Grain Architectures • Reconfiguration Architecture http://www.uni-kl.de © 2001, reiner@hartenstein.de 81 http://www.fpl.uni-kl.de Dimensions of Reconfigurability Xputer Lab University of Kaiserslautern ASIPs* vs. Network Processors *) Application-Specific Instruction set Processors configuration time design ASIP time Extremes: Class of product processor vendor ASIP Tensilica Tensilica fabrication time MECA family Malleable Network Processor CALISTO SiliconSpice many others many others © 2001, reiner@hartenstein.de 82 compile time run time statically reconfigurable Network dynamically Processor reconfigurable http://www.fpl.uni-kl.de Xputer Lab Configuration Architectures (dynamic vs. static) University of Kaiserslautern Configuration caching*: host Compiler, Mapper, RTOS etc. RAM Config. RAM Cache RAM RAM *) no cache as usual ! straight forward: host Soft RAM Data Path Compiler, Mapper, RTOS etc. Soft RAM Data Path multi-context: Configuration Loading Resources: RAM • separate configuration fabrics (e.g. FPGA) host Soft RAM • wormhole routing (KressArray, Colt, PipeRench) Compiler, Data RAM Path Mapper, • RA part computes code for other RTOS RAM dynamic etc. RA part (self reconfiguration) © 2001, reiner@hartenstein.de 83 http://www.fpl.uni-kl.de Colt Architecture (P. Athanas 1996) Xputer Lab University of Kaiserslautern Multiplier wormhole routing DP DP I/O Pins I/O Pins Smart Crossbar DP I/O Pins DP I/O Pins DP DP I/O Pins Studying highly dynamic reconfiguration © 2001, reiner@hartenstein.de I/O Pins IFU IFU IFU IFU IFU 84 IFU IFU IFU IFU IFU IFU IFU IFU IFU IFU IFU http://www.fpl.uni-kl.de Schedule Xputer Lab University of Kaiserslautern time slot 08.30 – 10.00 Reconfigurable Computing (RC) 10.00 – 10.30 coffee break 10.30 – 12.00 Compilation Techniques for RC 12.00 – 14.00 lunch break 14.00 – 15.30 Resources for Stream-based RC 15.30 – 16.00 coffee break 16.00 – 17.30 FPGAs: recent developments © 2001, reiner@hartenstein.de 85 http://www.fpl.uni-kl.de Xputer Lab University of Kaiserslautern - END © 2001, reiner@hartenstein.de 86 http://www.fpl.uni-kl.de