The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu mihaib@cs.cmu.edu SSS 4/9/99 CMU Reconfigurable Computing 1 Current Project Members CS Department ECE Department Seth Copen Goldstein Mihai Budiu Herman Schmit Srihari Cadambi Matt Moe Robert Taylor Ronald Laufer SSS 4/9/99 CMU Reconfigurable Computing 2 Why Study Reconfigurable Hardware? It is a nice computation paradigm (wire your own computer) SSS 4/9/99 CMU Reconfigurable Computing 3 Why Study Reconfigurable Hardware Algorithm Year System Versus Speedup x DNA matching 1992 SPLASH 2 SPARC 10 4300 FIR Filter 1998 PipeRench 90 IDEA Encryption 1998 SAT solver 1997 Ray Casting 1995 Hidden Markov Model DES Encryption 1996 UltraSparc 300Mhz PipeRench UltraSparc 300Mhz Pamette SPARC 5 110Mhz RIPP-10 Pentium 75Mhz 1 Xilinx FPGA SPARC 10 1996 GARP SPEC92 1994 MIPS+RC SSS 4/9/99 UltraSparc 170Mhz MIPS CMU Reconfigurable Computing 61 17--1100 33.8 24.4 24 1.22 4 Commercial Players Source: In-stat April 1998 *Does not include software, hardwire or support EPROMs SSS 4/9/99 CMU Reconfigurable Computing 5 What Is “Reconfigurable Hardware?” Interconnection network Universal gates and/or storage elements Switches SSS 4/9/99 CMU Reconfigurable Computing 6 Basic Ingredient: RAM cell a0 a1 0 0 0 1 data a0 a1 a1 & a2 Universal gate = RAM SSS 4/9/99 CMU Reconfigurable Computing 7 Basic Ingredients (ctd) 1 0 1 1 A switch is controlled by a 1-bit RAM cell SSS 4/9/99 CMU Reconfigurable Computing 8 Outline • • • • What is reconfigurable hardware RH vs other computation paradigms Challenges in RH research PipeRench: the CMU project: – the hardware – the software • Conclusions SSS 4/9/99 CMU Reconfigurable Computing 9 RH vs ASICs • Generally Application-Specific Integrated Circuits will be faster than RH: – RH wires are slow & big – RH bit-slices are costly to interconnect – RH devices must store configuration on the chip but • RH can be reprogrammed – new algorithms – to fix bugs • RH cheaper in small production • RH tolerates faults better • RH sometimes faster with staged computation SSS 4/9/99 CMU Reconfigurable Computing 10 RH vs Microprocessors • RH less flexible (like a VLIW with fixed instructions) but • RH provides more (customized) computation elements • RH can decrease memory traffic • RH can be tailored for specific algorithms and data types RH will not replace mP, but complement them SSS 4/9/99 CMU Reconfigurable Computing 11 Types of RH • FPGAs: bit-level logic functionality (the basic processing elements compute on 1 bit) • word-based architectures: PipeRench (CMU) (basic PE operates on 8 bits) (basic PE is a small ALU) • coarse architectures: RAW (MIT) (basic PE is a MIPS 2000 core) SSS 4/9/99 CMU Reconfigurable Computing 12 RH In A System Title: (c oupling) Creat or: (FrameMaker 5.5 Pow erPC: Las erWrit er 8 8.5. 1) Prev iew : This EPS pict ure w as not s av ed w ith a preview inc luded in it. Comment: This EPS pict ure w ill print to a Pos tSc ript printer, but not to other ty pes of printers. SSS 4/9/99 CMU Reconfigurable Computing 13 Challenges In RC • Software tools: – Programming RC like software development – Automatic compilation from HLL – Automatic program partitioning • Mapping efficiently algorithms (no ISA) • System issues – interfaces – find “ideal” RC fabric SSS 4/9/99 CMU Reconfigurable Computing 14 The CMU Reconfigurable Computing Project SSS 4/9/99 CMU Reconfigurable Computing 15 Hardware Goals • To build a complete reconfigurable hardware device • To build the system integration hardware • To host the device in a PC SSS 4/9/99 CMU Reconfigurable Computing 16 Our Device: • • • • • Word processing elements Pipelined architecture Virtualized hardware Local interconnection network Wide pipelined bus SSS 4/9/99 CMU Reconfigurable Computing 17 Configuration memory Data & Config controller Stripes Processing elements SSS 4/9/99 CMU Reconfigurable Computing 18 Hardware Virtualization Actual available hardware Instructions currently in hardware Instructions paged out SSS 4/9/99 CMU Reconfigurable Computing 19 Hardware Virtualization (2) Page out compute compute compute configure Page in hardware Program in configuration memory Overlap configuration with computation. SSS 4/9/99 CMU Reconfigurable Computing 20 Processing Elements a PE2 b PE1 out SSS 4/9/99 CMU Reconfigurable Computing Cin PE0 • Look-up table • Any 3-to-1 function 21 The Interconnection Network P*B bits Word-level cross-bar 0 B bits PE N PE PE 1 Pass Registers P*B*N bits SSS 4/9/99 CMU Reconfigurable Computing 22 The PCI Board Title: c hip.eps Creat or: f ig2dev Version 3.2 Patchlevel 0-beta3 Prev iew : This EPS pict ure w as not s av ed w ith a preview inc luded in it. Comment: This EPS pict ure w ill print to a Pos tSc ript printer, but not to other ty pes of printers. SSS 4/9/99 CMU Reconfigurable Computing 23 Software Goal To program reconfigurable devices using the standard software development processes: Java – Compile C or Java – Do it quickly Partitioner Data-flow Intermediate Language DIL Built Configuration Reconfigurable HW SSS 4/9/99 CMU Reconfigurable Computing CPU 25 Building Circuits From DIL a = b + c * d; e = c - d; c b d * • variables • operators SSS 4/9/99 wires gates + - a e CMU Reconfigurable Computing 26 Mapping Circuits To a a b c b c + a c b + + - a + SSS 4/9/99 c b - CMU Reconfigurable Computing 27 The DIL Compiler Front-End Circuit Dil input file Parser Evaluator Loader Backend Loader component library SSS 4/9/99 CMU Reconfigurable Computing Component circuits 28 The DIL Compiler Backend Circuit (expanded) Front-end Circuit (placed) Circuit Optimizer PlacerRouter The whole compilation process is very fast (compared to classical CAD tools). We can compile two orders of magnitude faster. SSS 4/9/99 CMU Reconfigurable Computing Code generator xfig C++ Asm 29 Processing Element Size Tradeoffs Small Efficient usage Slower Flexible interconnect Bigger configuration Place and route easier SSS 4/9/99 Big Wasteful Faster bit-slice Coarse routing Fewer configuration bits Constrains the compiler CMU Reconfigurable Computing 30 Stripe Width Tradeoffs Wider Fewer stripes Virtualize more Bandwidth waste Placer freedom SSS 4/9/99 Narrower More will fit Fewer page-ins Less bandwidth available Placement constrained CMU Reconfigurable Computing 31 Bus Width Tradeoffs Wider More area High bandwidth SSS 4/9/99 Narrower Less area Time-mux bus CMU Reconfigurable Computing 32 Clock Speed Tradeoffs (run-time) Faster Short critical path Long pipeline built Decomposition overhead Virtualized more 24 Little decomposition Less virtualized 24 8 + 8 + + Slower Big chains Compact circuits 24 24 + 24 8 24 SSS 4/9/99 CMU Reconfigurable Computing 33 Configuration Bits per Stripe 2 1600 4 PE bit width 8 16 32 128 144 Configuration Bits 1400 1200 1000 800 600 400 200 0 64 SSS 4/9/99 80 96 112 Stripe Width CMU Reconfigurable Computing 34 Title: (fir-throughput.eps) Creator: Adobe Illus trator(TM) 7.0 Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . SSS 4/9/99 CMU Reconfigurable Computing 35 Project Status • Operational: – Behavioral and structural models of Piperench in Verilog – Assembler, simulator – Tools for visualization and debugging – One tile fabricated and tested – Very fast compiler from intermediate language • In work: – Prototype PipeRench to be taped this summer – PCI board to host PipeRench in a PC SSS 4/9/99 CMU Reconfigurable Computing 36 Simulated Speed-up vs. UltraSparc @ 300Mhz 1000.0 328.8 90.9 100.0 76.1 61.8 29.0 26.0 20.6 10.0 1.0 ATR SSS 4/9/99 Cordic DCT FIR CMU Reconfigurable Computing IDEA Nqueens Over 37 Future Work • Build the PCI board • Build the OS device drivers • Start investigating HLL issues: – automatic partitioning – translation to DIL – special code transformations SSS 4/9/99 CMU Reconfigurable Computing 38 Conclusions • A set of important applications can benefit from RC devices • RC offer potential for substantial performance improvement at a low cost • RC devices will soon be mainstream U in the embedded computing world; V perhaps in the future they will also R permeate the desktop SSS 4/9/99 CMU Reconfigurable Computing Pentium V 39