The Optimization of Interconnection Networks in FPGAs Dr. Yajun Ha Assistant Professor Department of Electrical & Computer Engineering National University of Singapore © NUS 2010 Dagstuhl Seminar 1 Outline Background and Motivation Time-multiplexed interconnects in FPGAs sFPGA2 architecture Conclusion © NUS 2010 Dagstuhl Seminar 2 FPGA Research Challenges Research challenges for FPGA architectures and tool are closely linked. The source for FPGA challenges are coming from the underlying semiconductor technologies. Scaling semiconductor technologies bring the following new challenges: Leakage Power Dual Vt or Vdd or subthreshold architectures Process variations Reconfigurability for variability, fault tolerant Substantially more transistors Scalable, multi-core, secure architectures and SLD © NUS 2010 Dagstuhl Seminar 3 Motivation Logic and Interconnect are un-balanced in FPGAs. Qualitatively: “PLDs are 90% routing and 10% logic.” Prof. Jonathan Rose, Design of Interconnection Networks for Programmable Logic, Kluwer Academic Publishers, 2004, Page xix; “…(in FPGAs) programmable interconnect comes at a substantial cost in performance in area, performance and power.” Prof. Jan Rabaey, Digital Integrated Circuits, 2nd Edition, Prentice-Hall, 2003, Page 413; Quantitatively: Area: Logic area v.s. Routing area; Delay: Logic delay v.s. Net delay; Power: Dynamic power consumption by Logic v.s. by Interconnect. © NUS 2010 Dagstuhl Seminar 4 Unbalance: Area Routing Area / Logic Area 25 20 15 10 tseng spla seq s38584.1 s38417 s298 pdc misex3 frisc ex5p ex1010 elliptic dsip diffeq des clma bigkey apex4 apex2 0 alu4 5 Relative weight of routing area and logic area of the 20 largest MCNC benchmark circuits, assuming PTM 90nm CMOS process. Data produced by VPR v5.0.2. © NUS 2010 Dagstuhl Seminar 5 Unbalance: Delay Logic Delay Net Delay 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% tseng spla seq s38584.1 s38417 s298 pdc misex3 frisc ex5p ex1010 elliptic dsip diffeq des clma bigkey apex4 apex2 0.0% alu4 10.0% Delay breakdown along the critical path for the 20 largest MCNC benchmarks, assuming PTM 90nm CMOS process. Data produced by VPR v5.0.2. © NUS 2010 Dagstuhl Seminar 6 Unbalance: Power Dynamic power breakdown for a real circuit [1], assuming the Xilinx Virtex-II FPGAs Note: Double: The length-2 wires; Hex: The length-6 wires; Long: The long wires spanning the whole chip; IXbar & OXbar: Crossbar at the input & output pins of the logic blocks. [1] L. Shang, A. Kaviani and K. Bathala, “Dynamic power consumption in Virtex-II FPGA family,” ACM FPGA, 2002. © NUS 2010 Dagstuhl Seminar 7 Outline Background and Motivation Time-multiplexed interconnects in FPGAs sFPGA2 architecture Conclusion © NUS 2010 Dagstuhl Seminar 8 Intra-Clock Cycle Idleness Clock cycle is constrained by the critical path delay. Many wires are idle for a significant amount of time in a clock cycle. An example: clma: the largest circuit (~8400 4-input LUTs) in MCNC benchmark; Use VPR v5.0.2 to implement to an island-style FPGA (10 4-inputs LUT in each CLB and 100% length-4 wires ), assuming the PTM 90nm CMOS process; Timing results after P&R: Critical path delay = 9.50ns; Delay of most nets (~96.5%) are less than 1ns; Expensive © NUS 2010 Dagstuhl Seminar wires are often less utilized. 9 Time-Multiplexing Net N1 CLB Net N2 CLB CLB CLB routing wire Conventional switch Two nets use two wires Switches with multiple contexts Two nets share one wire • Use switches with multiple contexts to achieve time-multiplexing of wires. Keep wires busy; • Can potentially save wire area and achieve better timing performance. © NUS 2010 Dagstuhl Seminar 10 Preliminary Results Bring time-multiplexing enhancements to existing CAD tools; Preliminary studies show positive results: For 16 MCNC benchmark circuits, ~11.5% saving in minimum required number of wires, (but) ~1.5% timing overhead; For 16 MCNC benchmark circuits, ~8.2% reduction in critical path delay, using the same number of wires; See [1] [2] for details. [1] H. Liu et al, “An Area-Efficient Timing-Driven Routing Algorithm for Scalable FPGAs with Time-Multiplexed Interconnects,” FCCM 2008. [2] H. Liu et al, “An Architecture and Timing-Driven Routing Algorithm for AreaEfficient FPGAs with Time-Multiplexed Interconnects,” FPL 2008. © NUS 2010 Dagstuhl Seminar 11 TM FPGA Challenges and Ongoing Work The TM rate cannot be too high to have a reasonable TM clock rate. We are targeting at 2-4 at the moment. The nets that are qualified for TM are limited since most nets having delays finished in the first microcycle. Dual Vt architectures are proposed to adjust the delay to achieve low power and higher TM opportunities. © NUS 2010 Dagstuhl Seminar 12 Outline Background and Motivation Time-multiplexed interconnects in FPGAs sFPGA2 architecture Conclusion © NUS 2010 Dagstuhl Seminar 13 Motivation In current FPGAs, switching requirement grows superlinearly with number of logic resources. In other words, current architecture scales poorly. To address this, we need to organize FPGA interconnecting wires hierarchically to achieve scalability [3] Rizwan Syed et al, “sFPGA2 - A Scalable GALS FPGA Architecture and Design Methodology,” FPL 2009. © NUS 2010 Dagstuhl Seminar 14 How Multiple FPGAs Are Connected? MGT based Serial Switch Interconnect PCI Express Serial and switched based interconnects are the future of peripheral interconnect! © NUS 2010 Dagstuhl Seminar 15 sFPGA2 Is an On-Chip Version sFPGA2 is a scalable FPGA architecture using hierarchical routing network employing high speed serial links and switches to route multiple nets simultaneously [3]. Consists of two levels: Base Level (eg.: A0…A7, S0) Higher Levels (eg.: X0) Architecture Block Diagram [3] Rizwan Syed et al, “sFPGA2 - A Scalable GALS FPGA Architecture and Design Methodology,” FPL 2009. © NUS 2010 Dagstuhl Seminar 16 sFPGA2 Architecture (Contd) A0…A7 are FPGA tiles (similar to current FPGAs). Courtesy of Xilinx (Virtex II Pro) S0 contains very high speed transceivers capable of aggregating multiple high speed serial links into a very high link. © NUS 2010 Dagstuhl Seminar 17 sFPGA2 (Contd) Routing is done using either of the two methodology shown in figure. Intra cluster routing uses only the switch blocks and channels in that level. While inter cluster routing uses very high speed links and switches. © NUS 2010 Dagstuhl Seminar 18 Design Methodology NOP v0 v1 * v3 * v2 * * v4 * v5 v6 v7 * + v9 + v8 < v10 v11 An inter tile net NOP vn The new step to deal with inter-tile nets! © NUS 2010 Dagstuhl Seminar 19 Preliminary Results Successfully implemented a JPEG engine and demonstrated it to transport groups of nets on an emulation platform built on 3 Xilinx Virtex 2 Pro FPGA boards. Serial communication was emulated by MGTs. Preliminary studies show that latency in transport is very high mainly due to high latency transceivers thus limiting application domain to GALS designs only. However, with the advancement in transceivers, this can be extended to pure synchronous designs as well. © NUS 2010 Dagstuhl Seminar 20 Conclusion Logic / Interconnect unbalance in FPGAs makes the optimization of interconnection network important. Significant intra-clock cycle idleness exists in FPGA routing wires. Time-multiplexing increases resource utilization, and can potentially save area and achieve better timing. Current FPGA interconnection network is not scalable. On-chip network, consisting of switches and serial links, can improve scalability. Promising preliminary results justify our approaches. Future work needs to thoroughly investigate the impact of architecture changes. © NUS 2010 Dagstuhl Seminar 21 Multi-FPGA or Multi-Core? FPGA Tile FPGA Tile FPGA Tile FPGA Tile FPGA Tile NoC FPGA Tile uP Tile uP Tile uP Tile NoC uP Tile uP Tile uP Tile 1. Building Multi-FPGA or Multi-Core will not be difficult with the development of semiconductor technology. 2. We (hardware engineers) know programming multi-FPGA more than programming multi-core processors. 3. Should we use VHDL/Verilog as the (intermediate) programming language for both Multi-FPGA or Multi-Core? © NUS 2010 Dagstuhl Seminar 22 © NUS 2010 Dagstuhl Seminar 23 See also v5.0.2 – Versatile Placement & Routing tool for heterogeneous FPGAs: http://www.eecg.utoronto.ca/vpr/ VPR Predictive Technology Model (PTM): http://www.eas.asu.edu/~ptm © NUS 2010 Dagstuhl Seminar 24