Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department Joint work with Srihari Cadambi, Herman Schmit, Matt Moe, Robert Taylor, Ronald Laufer Goal To program reconfigurable devices using the standard software development processes: Java – Compile C or Java – Do it quickly Partitioner Data-flow Intermediate Language DIL This talk Configuration Reconfigurable HW FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu CPU 2 Compiler Performance on 1D DCT (8 inputs 8 bit each) DIL 2.4s Total Compile time Place and route Target clock speed Circuit size Application speed-up Target Classical tools 1s 75Mhz 7816 bit-ops 20 PipeRench ~75min Synopsis+Design Manager Design Manager 14m22s 33Mhz 899 CLBs ~20 Xilinx 4085XL Compilation: ~700x faster FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 3 The Place and Route Problem ~ & << >> Interconnection operators ~ & << . [1,2] >> Interconnection network << . [1,2] << + + Processing elements FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 4 Our Target: • Medium grain processing elements (4 bits) • Pipelined architecture • Virtualized hardware • Local interconnection network • Wide pipelined bus FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 5 The Place and Route Problem ~ & << >> Interconnection operators ~ & << . [1,2] >> Interconnection network << . Stripe [1,2] << + + Processing elements FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 6 Why Place and Route Is Hard • Hard constraints: – Stripe width – Pipelined bus width • Word-based circuit – interconnection network switches words – fixed PE size • Scarce input ports for the interconnection network FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 7 How We Simplify Place and Route • Computation-oriented programs (restricted language, with unidirectional data flow) • Hardware resources virtualized • Relatively rich interconnection network • High granularity placement (I.e. one 32-bit adder instead of 100 gates) • There is a wide pipelined bus available • Timing is very predictable FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 8 The Key Idea • Global analysis and transformations guarantee placeability using lazy noops (conservatively) • Deterministic, greedy place & route (no backtracking) • All passes linear time in the size of the circuit FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 9 Guaranteeing Placement & ~ >> ~ & << << Complex permutation Simple permutation noop >> . [1,2] Simple permutation . << [1,2] noop << + Simple permutation + The inserted noops are sufficient but not necessary FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 10 Placement of a Non-lazy Noop ~ & ~ noop & noop noop + + FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 11 Lazy Noops Are Not Placed ~ & ~ noop & + noop + FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 12 Place and Route Overview • Analysis: – Noops have been inserted to guarantee that the graph is routable. • Place & Route: – will determine which lazy noops are instantiated Next: actual Place and Route FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 13 Step1: Analyze Routability ~ & Already placed & ~ noop + + + + + + + noop + FPGA, Feb 23 1999 Q: can we place the + given the placement of its ancestors? (c) 1998 by Mihai Budiu 14 Step 2: If a Node Is Unroutable ~ & noop ~ & noop noop noop + + Solution: promote a lazy noop FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 15 Step 3: Choosing a Noop ~ & noop ~ & noop Closest noop which is routable. noop noop + FPGA, Feb 23 1999 + (c) 1998 by Mihai Budiu 16 Other Details • Operators are decomposed in pieces for: – timing constraints – size constraints • When placing optimize for – register pressure when accessing the bus – constraints placed on future nodes • Long critical paths are sliced with pipeline registers FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 17 Compilation Times (Seconds on PII/400) 9 8.07 8 7 6 5 4 3 2 2.43 2.27 1.36 1.25 0.95 0.84 1 0.13 0.07 0.47 0.86 FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu va rp ol y e sq ua r ov er s nq ue en id ea en co de r dc t ul t cs dm co rd ic at r2 Lf ir 0 18 Compilation Speed (PII/400) 20000 Bit Operations/ Kernel 18000 bitops bitops/sec 16000 14000 10000 8000 12000 10000 6000 8000 4000 6000 4000 Bit Operations Compiled/Sec 12000 2000 2000 0 FPGA, Feb 23 1999 CS D D En CT co de r FI R ID N EA qu ee ns O ve Sq r u V are ar po G ly M ea n c di Co r A TR 0 (c) 1998 by Mihai Budiu 19 Compilation Times Breakdown 100% other place analysis library simplification evaluation 80% 60% 40% Place and route 20% FPGA, Feb 23 1999 t sq ua re va rp ol y r po pc n ov e id ea nq ue en s en co de r dc t co rd ic cs dm ul t Lf ir 0% (c) 1998 by Mihai Budiu 20 Placed Circuit Utilization 100% 90% utilization 80% effective utilization 70% 60% 50% 40% 30% 20% 10% FPGA, Feb 23 1999 FI R ID EA N qu ee ns O ve r Sq ua r V e ar po l G y M ea n CT En co de r D CS D Co rd ic A TR 0% (c) 1998 by Mihai Budiu 21 Simulated Speed-up vs. UltraSparc @ 300Mhz 1000.0 328.8 90.9 100.0 76.1 61.8 29.0 26.0 20.6 10.0 1.0 ATR FPGA, Feb 23 1999 Cordic DCT FIR (c) 1998 by Mihai Budiu IDEA Nqueens Over 22 Conclusions • Fast compilation from HLL achievable (seconds not tens of minutes.) • High-quality output achievable (60% density) • Linear-time Place and Route feasible using the technique of lazy noops FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 23 Future Work • Time-multiplexing the bus • Porting to commercial FPGAs • Front-end from C/Java to DIL FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 24 How We Simplify Place and Route • Computation-oriented programs (restricted language, with unidirectional data flow) Hardware resources virtualized • Relatively rich interconnection network • High granularity placement (I.e. one 32-bit adder instead of 100 gates) There is a wide pipelined bus available • Timing is very predictable FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 25 Timing and Size Guarantees 24 24 8 24 8 24 + 8 + 8 + 8 24 8 8 8 + 8 FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 24 28 Optimize for Register Pressure ~ & & ~ noop ++ + + + ++ noop Cost: 1 2 1 -- -- 0 Best position + FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 29 Kernels Benchmark ATR Cordic CSD DCT Encoder FIR IDEA Nqueens Over Square Varpoly FPGA, Feb 23 1999 Description Automatic Target Recognition (image pattern scan) Honeywell timing benchmark for vector rotation. Canonical signed multiplier with the constant 123. One-dimensional 8-point discrete cosine transform. Huffman encoder for fixed frequencies. Finite Impulse Response filter with 20 taps. PGP encryption algorithm. 8x8 queens solution tester. Porter-Duff “over” operator. Squaring a 16-bit number. Evaluating a degree-3 polynomial with variable coefficients in a given point. (c) 1998 by Mihai Budiu 30