Towards Low-Power yet High-Performance Networks-on-Chip by Sunghyun Park B.S. in Korea Advanced Institute of Science and Technology (2009) S.M. in Massachusetts Institute of Technology (2011) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2014 c Massachusetts Institute of Technology 2014. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science September 2, 2014 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li-Shiuan Peh Professor of Electrical Engineering and Computer Science Thesis Supervisor Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anantha P. Chandrakasan Joseph F. and Nancy P. Keithley Professor of Electrical Engineering Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leslie A. Kolodziejski Chair, Department Committee on Graduate Students 2 Towards Low-Power yet High-Performance Networks-on-Chip by Sunghyun Park Submitted to the Department of Electrical Engineering and Computer Science on September 2, 2014, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract A network-on-chip (NoC), the de-facto communication backbone in manycore processors, consumes a significant portion of total chip power, competing against the computation cores for the limited power and thermal budget. On the other hand, overall system performance of manycore chips increasingly relies on on-chip latency and bandwidth as core counts scale. This thesis aims to design low-power yet highperformance NoCs through circuit and microarchitecture co-design contrary to the traditional approaches where NoCs sacrifice latency and/or bandwidth for low-power operation; then demonstrate such design concepts through test chip prototyping, enabling detailed measurements for rigorous analysis of the pros and cons of the proposed NoCs. The thesis starts with a 4×4 mesh NoC chip prototype that tries to simultaneously optimize energy, latency and throughput for all kinds of traffic (unicasts, multicasts and broadcasts). Its extensive experiment results make it possible to accurately analyze energy/performance benefits and timing/area overheads of the virtually bypassed, multicast-optimized router design; energy savings, area overheads and reduced reliability of the clocked low-swing datapath circuits; and a power gap between simulated estimations and measurement results. Next demonstrated is a link test chip of two clockless low-swing repeater designs, a self-resetting logic repeater (SRLR) optimized for transmission energy and a voltagelocked repeater (VLR) for transmission delay. This second chip prototype shows that the clockless, single-ended low-swing signaling of SRLRs armed with variation-robust circuit techniques has lower energy and smaller area than clocked, differential lowswing signaling. Featured with lower delay than full-swing repeaters, VLRs provide the fundamental building block to the single-cycle reconfigurable NoC that enables potential power saving at architecture level through single-cycle multi-hop asynchronous link traversal on dynamically configurable routes. The last one-third of this thesis explores a 3D-IC chip prototype of a throughsilicon via (TSV) interconnect that can support simultaneously bi-directional (SBD) 3 signaling. While TSVs, as 3D-IC NoC links, offer an appealing solution to manycore architectures that require huge off-die bandwidth, existing TSV technologies impose considerable power and area overheads (using spare TSVs) to improve reliability. The proposed SBD TSV circuit shows better energy efficiency and smaller area than unidirectional TSVs, thus providing reliable 3D signaling within tight power/silicon budget. Such SBD signaling also enables configurable off-die bandwidth, and hence, can be the basis of a bandwidth-adaptive 3D NoC that efficiently supports highly dynamic traffic on manycore chips. Thesis Supervisor: Li-Shiuan Peh Title: Professor of Electrical Engineering and Computer Science Thesis Supervisor: Anantha P. Chandrakasan Title: Joseph F. and Nancy P. Keithley Professor of Electrical Engineering 4 To my parents, Doosoo Park and Soonsil Shin 5 6 Acknowledgments The LORD is my shepherd, I shall not be in want. He makes me lie down in green pastures, he leads me beside quiet waters, he restores my soul. He guides me in paths of righteousness for his name’s sake. Even though I walk through the valley of the shadow of death, I will fear no evil, for you are with me; your rod and your staff, they comfort me. Psalm 23:1-4 First and foremost, I give thanks to God for allowing me to have the best advisors, Professor Li-Shiuan Peh and Professor Anantha Chandrakasan. Their complementary research interests and advising styles have made this possible; while Li-Shiuan’s accurate yet wide-ranging comprehension of Networks-on-Chip (NoCs) has enabled me to freely play in the playground of NoCs without worrying about my wrong assumptions and technical mistakes, Anantha’s sharp insight and extensive experience in low-power digital circuit design has allowed my rough ideas to be well-positioned and shaped in detail. Indeed, being co-advised by Li-Shiuan and Anantha was the best opportunity that I have ever been given at MIT in that I was able to explore unique research questions between circuit and architecture under their excellent guidance. Even from the viewpoint of humanity, they are truly great mentors. I sincerely thank Anantha and Li-Shiuan for being my advisors. It is my honor and pleasure to have Professor Srinivas Devadas on my thesis committee. I would like to thank him for the contributions to my PhD work. Actually, he helped me shape the thesis direction even before being my thesis committee through the DARPA Angstrom Project and Research Qualifying Examination (RQE). His comprehensive system-level view and objective standpoint on my work have motivated me to view my research from other angles. I deeply appreciate him for spending 7 time and energy despite his busy schedule. I would also like to express my appreciation to Professor Vladimir Stojanovic for his feedback and suggestions as my RQE committee. His standpoint on on-chip interconnects (that differs from the angle of my advisors) widened my understanding of scalable Networks-on-Chip. In addition, his distinguished work on physical model of on-chip wires motivated me to investigate circuit-wire codesign. I want to extend the appreciation to my MIT colleagues who are always willing to help me out. While all members of my both research groups deserve my gratitude, I have to leave a special thanks to the following eight people: Tushar Krishna (NoC architecture discussion), Masood Qazi (variation-robust circuit design and analysis), Owen Chen (NoC architecture discussion), Gilad Yahalom (3D-IC test chip implementation), Arun Paidimarri (chip measurement), Sunghyuk Lee (PCB design), Bhavya Daya (mesh NoC chip comparison) and SungWon Jung (high-frequency clocking circuit design). I should leave a thanks to our friendly administrative staffs for all the help through my PhD years at MIT: Maria Rebelo (CSAIL administration), Margaret Flaherty (MTL administration), Janet Fischer and Alicia Duarte (EECS Department administration). I am proud to acknowledge the support of the following companies for my research projects: MediaTek (3D-IC test chip fabrication), Samsung (financial support during my entire PhD years) and Freescale (filp chip packaging). I would like to thank Dr. Alice Wang for her excellent management at MediaTek to enable successful completion of our 3D-IC project. I also thank Mr. Stacy Ho to mercifully take care of my MediaTek internship at the Woburn site. In particular, I want to express my deepest gratitude to Samsung Scholarship not only for financially supporting my PhD program but also for giving me an opportunity to become a part of their superior community. It was my great honor to serve as Jar-Chi-Wii-Won-Chang at 2013 Samsung Scholarship Academic Camp in Yosemite National Park. 8 No words can do justice to express how deeply grateful I am to my family members. I truly appreciate my lovely penguin, Seonghee Nam, for always being with me as my wife and as my best friend. Without her devoted support, I would not have completed my PhD journey. I also thank my adorable little girl, Seohee Park, and my brave little boy, Seungwoo Park, for giving me the strongest motivation to finish my school life. Indeed, their existence itself is a blessing to me everyday. I should not forget to thank my sister, Haejin Park, for her trust and encouragement. I am always proud of her career as a professor in a medical school. I should also thank my parents-inlaw, I-hyun Nam and Soonae Song, for treating me like their son. Finally, reserving the best for last, I would like to exhibit the most heartfelt gratitude to my parents, Doosoo Park and Soonsil Shin, for their unconditional love and trust. I thank you, I respect you, I love you, my dad and my mom. 9 10 List of Acronyms 3D − IC 3 Dimensional Integrated Circuit BER Bit Error Rate BW Buf f er W ritng (in Router P ipeline) CAD Computer Aided Design CMOS Complementary M etal Oxide Semiconductor CMP Chip M ultiP rocessor DM (P ulse) DeM odulator DOR Dimesion Ordered Routing DRC Design Rule Check ECC Error Correction Code F2B F ace to Back (T hrough Silicon V ia) F2F F ace to F ace (T hrough Silicon V ia) FIFO F irst In F irst Out (Buf f ers) I/O Input/Output IP Intellectual P roperty LA LookAhead generation (in Router P ipeline) LT Link T raversal (in Router P ipeline) MC M essage Class (of V irtual Channels) MMS M ultiscale M odeling and Simulation (SoC Application) MOSFET M etal Oxide Semiconductor F ield Ef f ect T rasistor MPSoC M ultiP rocessor System on Chip 11 mSA multiple (Crossbar) Switch Allocation NIC N etwork Interf ace Circuit NMOS N − channel M OSF ET NoC N etwork on Chip NRC N ext Route Computation PDK P rocessor Design Kit PE P rocessor Element PIP P ersonal Interest P roject (SoC application) PM P ulse M odulator PMOS P − channel M OSF ET PRBS P seudo Random Binary Sequence PVT P rocess, V oltage and T emperature QoS Quality of Service RC Resistance − Capacitance RSD Reduced Swing Driver Rx Receiver SA (Crossbar) Switch Allocation (in Router P ipeline) SBD Simultaneously BiDirectional Si Silicon SMART Single cycle M ulti hop Asynchronous Repeated T raversal SoC System on Chip SOI Silicon On Insulator 12 SRLR Self Resetting Logic Repeater TSV T hrough Silicon V ia Tx T ransmitter ST (Crossbar) Switch T raversal (in Router P ipeline) VA V irtual channel Allocation (in Router P ipeline) VC V irtual Channel VLR V oltage Locked Repeater VOPD V ideo Object P lane Decoder WLAN W ireless Local Area N etwork 13 14 Contents 1 Introduction 27 1.1 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.2 Mesh Network-on-Chip (NoC) . . . . . . . . . . . . . . . . . . . . . . 28 1.3 Rethinking Router Microarchitecture . . . . . . . . . . . . . . . . . . 32 1.4 Thesis Contributions and Overview . . . . . . . . . . . . . . . . . . . 38 2 Towards the Theoretical Limits of a Mesh NoC 43 2.1 Theoretical Mesh NoC Limits . . . . . . . . . . . . . . . . . . . . . . 43 2.2 Related Work: Existing Mesh NoC Chips . . . . . . . . . . . . . . . . 45 2.3 Chip Design and Fabrication . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.1 Towards Theoretical Latency Limits . . . . . . . . . . . . . . 51 2.3.2 Towards Theoretical Throughput Limits . . . . . . . . . . . . 52 2.3.3 Towards Theoretical Energy Limits . . . . . . . . . . . . . . . 53 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.4.1 Latency, Throughput and Energy . . . . . . . . . . . . . . . . 57 2.4.2 Virtual bypassing . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.4.3 Low-Swing Signaling . . . . . . . . . . . . . . . . . . . . . . . 62 2.4.4 Power Modeling and Estimation . . . . . . . . . . . . . . . . . 65 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.4 2.5 15 3 Low-Swing Datapath for Reconfigurable NoCs 71 3.1 Background: Reconfigurable NoCs . . . . . . . . . . . . . . . . . . . . 71 3.2 Introduction: Clockless Low-Swing Repeaters . . . . . . . . . . . . . . 73 3.3 Related Work: Existing Low-Swing Links . . . . . . . . . . . . . . . . 74 3.4 Self-Resetting Logic Repeater (SRLR) . . . . . . . . . . . . . . . . . 76 3.4.1 SRLR Circuit Design . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.2 Test Chip Fabrication and Measurement . . . . . . . . . . . . 84 Voltage-Locked Repeater (VLR) . . . . . . . . . . . . . . . . . . . . . 87 3.5.1 VLR Circuit Design . . . . . . . . . . . . . . . . . . . . . . . 87 3.5.2 Test Chip Fabrication and Measurement . . . . . . . . . . . . 90 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.5 3.6 4 Energy and Area Efficient TSV Signaling for 3D-IC NoCs 97 4.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Design Considerations of SBD TSV Links . . . . . . . . . . . . . . . 102 4.3 SBD Transmitter Design . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4 4.5 4.6 97 4.3.1 Case 1: EN=0 (no data to be transmitted) . . . . . . . . . . . 104 4.3.2 Case 2: EN=1 and CLK=1 (first half clock cycle) . . . . . . . 105 4.3.3 Case 3: EN=1 and CLK=0 (next half clock cycle) . . . . . . . 106 Rx Design: Switched Dual-Tree Sense Amplifier . . . . . . . . . . . . 112 4.4.1 Switched Scheme for Low Sensing Delay . . . . . . . . . . . . 112 4.4.2 Dual-Tree Sense Amplifier for Reliable SBD Signaling . . . . . 113 Prototyping and Testing of TSV Interconnects . . . . . . . . . . . . . 117 4.5.1 Maximum Data Rate . . . . . . . . . . . . . . . . . . . . . . . 119 4.5.2 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.5.3 Area Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.5.4 Comparison with Other Low-Power TSV Circuits . . . . . . . 127 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 16 5 Conclusions and Future Work 5.1 133 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.1.1 Regular Mesh Network in CMPs . . . . . . . . . . . . . . . . . 134 5.1.2 Low-Swing Datapath of Configurable Meshes in SoCs . . . . . 134 5.1.3 Towards Low-Cost 3D Meshes in 3D-ICs . . . . . . . . . . . . 135 5.2 Low-Swing Signaling Reliability . . . . . . . . . . . . . . . . . . . . . 136 5.3 Future Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.1 Broadcast-Intensive Cache Coherent Protocols . . . . . . . . . 139 5.3.2 Error-Tolerant NoCs with Low-Swing Links . . . . . . . . . . 140 5.3.3 Bandwidth-Adaptive 3D NoCs . . . . . . . . . . . . . . . . . . 140 17 18 List of Figures 1-1 Simplified router microarchitecture for 2D mesh NoCs. . . . . . . . . 29 1-2 Detailed router microarchitecture and pipeline of a packet-switched, input-buffered VC NoC. . . . . . . . . . . . . . . . . . . . . . . . . . 31 1-3 Ideal point-to-point interconnect only through a metal wire. . . . . . 33 1-4 Repeated interconnect for lower wire delay (starting point of wire sharing). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1-5 Input wire sharing through a demultiplexer. . . . . . . . . . . . . . . 33 1-6 Output wire sharing through a multiplexer. . . . . . . . . . . . . . . . 33 1-7 Input and output wire sharing through a demultiplexer and a multiplexer. 34 1-8 Input and output wire sharing through a crossbar switch. . . . . . . . 34 1-9 Efficient wire sharing with a SA logic and buffers. . . . . . . . . . . . 34 1-10 Packet-switched, input-buffered VC router microarchitecture. . . . . . 35 2-1 Latency calculation example for broadcast traffic on a k×k mesh network. 44 2-2 Broadcast example and overview of the fabricated 4×4 mesh NoC. . . 49 2-3 Die photo and design layout of the 4×4 mesh NoC and stand-alone low-swing crossbar switch connected to longer links (1mm and 2mm wires). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2-4 64bits 5×5 tri-state RSD-based matrix crossbar switch and link circuitry. 54 19 LIST OF FIGURES 2-5 Proposed router microarchitecture and pipeline. . . . . . . . . . . . . 56 2-6 Network performance evaluation with mixed traffic at 1GHz. . . . . . 58 2-7 Network performance evaluation with broadcast-only traffic at 1GHz. 59 2-8 Measured network power reduction at 653Gb/s at 1GHz (A: full-swing unicast network, B: low-swing unicast network, C:low-swing broadcast network without virtual buffer bypassing, D: low-swing broadcast network with virtual buffer bypassing). . . . . . . . . . . . . . . . . . . . 60 2-9 1mm link energy efficiency of full-swing and RSD-based signaling. . . 63 2-10 2mm link energy efficiency of full-swing and RSD-based signaling. . . 63 2-11 Low-swing signaling trade-off between reliability and energy efficiency. 65 2-12 Comparison of power estimates with measurements (A: ORION 2.0 simulations, B: Post-layout simulations, C: Measured results). . . . . 66 3-1 Single-cycle reconfigurable NoC [1] with SMART links (red bold lines) where its backbone mesh network is reconfigured at run time. . . . . 72 3-2 10mm SRLR-based link for the mesh-based reconfigurable NoC where the local router-to-router distance is 1mm. . . . . . . . . . . . . . . . 77 3-3 Proposed SRLR circuit and its simulated waveforms. . . . . . . . . . 78 3-4 1000-run Monte-Carlo simulation results that show the impact of each variation-robust design technique. . . . . . . . . . . . . . . . . . . . . 82 3-5 Process variation robust SRLR circuit with (1) an alternating delay cell design, (2) NMOS-based drivers and (3) an adaptive swing voltage scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3-6 Die photograph of the SRLR test chip in 45nm SOI CMOS that includes an on-chip test circuit and an on-chip clocking circuit. . . . . . 84 3-7 1cm link traversal (LT) energy versus bandwidth density. . . . . . . . 86 3-8 Proposed clockless low-swing voltage-locked repeater (VLR) for singlecycle multi-hop link traversal. . . . . . . . . . . . . . . . . . . . . . . 20 89 LIST OF FIGURES 3-9 Simulated waveforms at 6.8Gb/s: (a) original input data and (b) VLR’s low-swing signaling at node X. . . . . . . . . . . . . . . . . . . . . . . 89 3-10 1bit 10mm VLR-based on-chip link and its equivalent full-swing link fabricated on the same die as SRLRs in 45nm SOI CMOS. . . . . . . 90 3-11 SMART NoC performance across SoC applications. Reference: [1]. . . 93 3-12 SMART NoC power breakdown across SoC applications. Reference: [1]. 93 4-1 Example of hop count reduction through greater spatial locality in 3DICs. The reduced hop counts translate into lower interconnect delay and energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4-2 Uni-directional TSV signaling versus proposed SBD TSV signaling at the same clock frequency, e.g. 5GHz in this example. . . . . . . . . . 99 4-3 4 voltage-level SBD signaling with weaker driving strength required (pros) and smaller noise margin between SBD signaling symbols (cons). 100 4-4 3 voltage-level SBD signaling with bigger noise margin between SBD signaling symbols (pros) and stronger driving strength required (cons). 100 4-5 Upward die-to-die static current path through a low resistance TSV: bottom die PMOS → micro bump → landing pad → top die NMOS. 101 4-6 Downward die-to-die static current path through a low resistance TSV: top die PMOS → landing pad → micro bump → bottom die NMOS. 101 4-7 Proposed SBD TSV Tx circuits: a simple NAND-enabled inverter on a bottom die and a half-clocked driver on a top die. . . . . . . . . . . 105 4-8 Tx circuit connectivity of Case 1 (EN=0). No die-to-die current path is formed when there is no data to be transmitted through a TSV. . . 106 4-9 Tx circuit connectivity of Case 2 (EN=1 and CLK=1, first half clock cycle) where a TSV is driven by a bottom driver only, consuming dynamic energy only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 21 LIST OF FIGURES 4-10 Tx circuit connectivity of Case 3 (EN=1 and CLK=0, next half clock cycle) where a TSV is driven by a bottom driver and a top driver together, forming a static current path through a TSV for the three voltage-level SBD signaling. The coupling capacitor, which acts as a high-pass filter, compensates the bandwidth loss without adding to inter-die static current. . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4-11 TSV voltage transitions of uni-directional TSVs versus our SBD TSV. 110 4-12 While a floating TSV during the first half clock period also enables 50% lower static die-to-die current, such a floating state incurs bandwidth loss at 00 → 11 bi-directional data transition. . . . . . . . . . . . . . 111 4-13 The coupling capacitor on a top die driver enables shorter switching time when a TSV is driven to VDD/2 by a pull-up PMOS and a pulldown NMOS together during the second half clock period. . . . . . . 111 4-14 Reduced symbol noise margin of SBD signaling due to process variation. When designing 3D-IC circuits, we should consider die-to-die variation mismatch as well as on-die variation mismatch described in (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4-15 Switched dual-tree sense amplifiers for variation-robust SBD signaling and low sensing delay. . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4-16 Overall circuit implementation of the proposed TSV SBD signaling. Two types of sense amplifiers, a PMOS-input and an NMOS-input sense amplifier, are switched on and off according to the transmitted data (txIN) for low sensing delay. . . . . . . . . . . . . . . . . . . . . 116 4-17 Top die photograph of a 2-tier 3D-IC test chip fabricated with a 28nm Low-Power (LP) CMOS process. . . . . . . . . . . . . . . . . . . . . . 118 4-18 Bottom die photograph of a 2-tier 3D-IC test chip fabricated with the same process as a top die, 28nm LP CMOS. . . . . . . . . . . . . . . 118 22 LIST OF FIGURES 4-19 Four types of TSV interconnects implemented in a 3D-IC test chip: two uni-directional TSVs (baseline #1); an inverter-based SBD TSV (baseline #2); a proposed SBD TSV without a coupling capacitor (baseline #3); and a completed design (proposed SBD TSV). . . . . . . . . . . 119 4-20 Measured maximum die-to-die bandwidth comparison at 1.05V between uni-directional TSVs (baseline #1) and proposed SBD TSVs. . 120 4-21 Maximum bi-directional bandwidth of our fabricated F2B TSV interconnects. The proposed SBD signaling can deliver up to 9.1Gb/s/TSV bi-directional data (i.e. 4.55GHz maximum clock frequency) at 1.05V. 121 4-22 Four bi-directional input data sets for energy comparison. . . . . . . . 123 4-23 Measured TSV interconnect energy efficiency over various input data sets at 9.1Gb/s bi-directional data rate (i.e. 4.55GHz clock frequency) at 1.05V. The proposed SBD signaling circuits consume 10.3-31.1% less energy than uni-directional TSVs. . . . . . . . . . . . . . . . . . . . . 124 4-24 Normalized area comparison of the fabricated TSV signaling circuits. While baseline #1 includes two TSV landing pads, other three SBD TSV interconnects have only one TSV landing pad. . . . . . . . . . 126 5-1 Lower voltage swing enables higher energy efficiency, but results in higher signaling error probability (hence bigger system overheads). This figure is identical to Figure 2-11. . . . . . . . . . . . . . . . . . . 137 23 LIST OF FIGURES 24 List of Tables 2.1 Theoretical limits of a k×k mesh NoC for unicast and broadcast traffic. 48 2.2 Comparison of mesh NoC chip prototypes. . . . . . . . . . . . . . . . 48 2.3 Critical path analysis results. . . . . . . . . . . . . . . . . . . . . . . 61 2.4 Area comparison with full-swing signaling. . . . . . . . . . . . . . . . 64 3.1 Comparison of silicon-proven low-swing on-chip interconnects. . . . . 86 3.2 Maximum hop counts in a single cycle at high data rate. . . . . . . . 92 3.3 Maximum hop counts in a single cycle at low data rate. . . . . . . . . 92 4.1 Comparison of Energy-efficient Face-to-Back TSV Interconnects (CMOSon-CMOS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 25 LIST OF TABLES 26 1 Introduction This thesis challenges the conventional wisdom that involves NoC design in trading off latency, bandwidth and energy, leading to poor performance in low-power NoCs or high performance but with unacceptable network power. 1.1 Research Motivation Moore’s law scaling and diminishing performance returns of complex uniprocessor chips have led to the advent of manycore systems such as chip multiprocessors (CMPs) and multiprocessor systems-on-chip (MPSoCs). The scalability of these manycore chips relies highly on the on-chip communication fabric connecting the cores/IPs. An ideal communication fabric would incur only metal-wire delay and energy between the source and destination node. However, there is insufficient wiring for dedicated global point-to-point wires between all nodes [2], and hence, a network-on-chip (NoC) with routers that multiplex wires across traffic flows is becoming the de-facto communication fabric in manycore chips [3]. These on-chip routers, however, impose substantial power overhead. For instance, 36% and 39% of entire chip power are consumed by such NoC routers at the peak network throughput in MIT’s Raw [4] and Intel’s TeraFLOPS [5], respectively. Since 27 Chapter 1. Introduction each chip cannot cross its power wall, this power-hungry network competes against the cores/IPs, leading to lower power and thermal budget for actual computation work. On the other hand, overall manycore chip performance increasingly depends on NoC performance such as bandwidth and latency with a growing number of onchip components [3, 6, 7]. Therefore, a low-power yet high-performance NoC is sorely needed to allow more cores/IPs to be integrated on one die. Designing a low-power NoC without the loss of network performance is almost always a challenging task. To take a couple of easy examples (other challenges will be discussed in Section 1.3), link drivers with a lower power supply voltage, i.e. simple low-swing links, enable both dynamic and leakage power saving but at the cost of longer wire propagation delay, resulting in longer latency and lower bandwidth in the network. While smaller buffer size at a router also makes NoCs energy-efficient, it leads to poor link utilization (hence lower bandwidth). Due to these design challenges, prior NoC chips [4, 5, 8, 9, 10, 11] sacrificed network performance for acceptable NoC power consumption, or endured substantial network power overheads to meet the aggressive performance requirements. This thesis seeks to break such conventional trade-offs to pave the way to the low-power yet high-performance NoC. 1.2 Mesh Network-on-Chip (NoC) A mesh network, which is formed by laying out a regular grid in each dimension and adding routers at the grid intersections, maps readily to the planar layout that current CMOS technology requires (e.g. 2D meshes on a single Si wafer or 3D meshes in a vertically-stacked 3D-ICs). Thanks to such planar regularity and scalability, a mesh is the most widely-used NoC topology for high-performance manycore chips [4, 5, 8, 9, 10, 11, 12]. In addition, unlike indirect, multi-stage NoC topologies such as Clos or Butterflies [13], a mesh supports the locality present in many applications, allowing nearby traffic to be transported at lower delay and energy. 28 1.2. Mesh Network-on-Chip (NoC) Figure 1-1: Simplified router microarchitecture for 2D mesh NoCs. Figure 1-1 shows a simplified 5-port 2D mesh router microarchitecture composed of four main components: input buffers, control logic, a crossbar switch and links. The input buffers store incoming data till they are sent to the next router. The control logic determines when data proceed through the router pipeline and sets up the crossbar switch. The crossbar switch physically moves data from input ports to output ports, followed by links that forward output port data to the next router. These actions can be pipelined to improve throughput, depending on operating clock frequency, process technology and specific logic implementation. Let us now look at how a packet-switched NoC works in a manycore processor. Each core communicates with other cores by sending and receiving messages through a network interface controller (NIC) that connects the core to a router (hence the network). Before a message is injected into the network, it is first segmented into packets that are then divided into fixed-length flits, short for flow-control units. A packet consists of a head flit that contains the destination address, body flits, and a 29 Chapter 1. Introduction tail flit that indicates the end of a packet. If the amount of information the packet carries is little, single-flit packets are also possible, i.e. where a flit is both the head and tail flit. Because only the head flit carries the destination information, all flits of a packet must follow the same route through the network. Virtual channels (VCs), logically-separate input buffers, allow multiple data streams to share physical channel (link wires) by interleaving flits from different packets. Such decoupled input buffers can be utilized to improve throughput by eliminating the head-of-line blocking; prevent deadlocks in the network without deadlock-free routing algorithms; or offer quality of service (QoS) for system level optimization [13, 14]. Figure 1-2 shows an example of the packet-switched, input-buffered VC router microarchitecture and pipeline. One distinguishing feature of this design is the routerlevel multicast capability through multiport switch allocation (mSA) by which multicasts do not require multiple unicast packets to be injected. At the first pipeline stage, flits get buffered (BW) and each input port chooses 1 output port request (mSA-I) with a round-robin logic that guarantees fair and starvation-free arbitration. Since multicast flits can require multiple output ports, the request is a 5b vector. The next router VC is selected (VA) from a free VC queue in this stage, too. These 3 independent operations are executed in parallel without decreasing operating frequency. At the second stage, output port requests for the next routers are computed (NRC) for the winners of mSA-I, and concurrently, a matrix arbiter at each output port grants the crossbar ports to the input port requests (mSA-II). Multicast requests get granted multiple output ports. In the third and fourth stages, flits physically traverse a crossbar switch (ST) and a link (LT). It is notable that out of all these actions, only the last two actions (ST and LT) actually move the flits toward the destination. Throughout the thesis, we will refer to this design (Figure 1-2) as a state-of-the-art router microarchitecture or a comparison baseline after slight modifications required for fair comparison. 30 Header Generation Input Buffer VC1 64b 5ports X 64b VC2 VC3 NIC VC4 VC5 VC6 5b N Input Port (BW) E Credit Signals to Previous Routers 5b 31 Outport Request (VC1) 5b S Outport Request (VC2) Outport Request (VC3) Outport Request (VC4) W Outport Request (VC5) 64b 5X5 Crossbar (ST) Link (LT) Pipeline Stage 3 Pipeline Stage 4 Outport Request (VC6) Round-robin circuit (mSA-I) Output Port (mSA-II) VC allocation (VA) 5ports X 3b Credit Signals from Next Routers Pipeline Stage 1 Pipeline Stage 2 Figure 1-2: Detailed router microarchitecture and pipeline of a packet-switched, input-buffered VC NoC. 1.2. Mesh Network-on-Chip (NoC) Next Route Computation (NRC) 64b Chapter 1. Introduction 1.3 Rethinking Router Microarchitecture The previous section (as well as most existing NoC literatures) tried to explain a NoC router microarchitecture by diving directly into the state-of-the-art design followed by description of individual components. This top-down approach is able to provide a quick way to understand the packet-switched VC router microarchitecture, but often makes it difficult to accurately analyze router overheads due to a significant gap between an ideal interconnect and the packet-switched NoC. Indeed, design optimization starts with finding its overheads and recognizing if such overheads are avoidable or not. To explicitly reveal the router overheads, this section explores the bottom-up approach through the step-by-step router building process, rethinking the state-of-the-art NoC router microarchitecture. As mentioned earlier, the ideal interconnect would be a point-to-point link provided by full network connectivity that delivers the highest possible throughput at the lowest possible latency and energy. Figure 1-3 shows such a point-to-point wire between a source (SRC) and a destination (DST). It should be noted that even the ideal incurs physical constraints like metal-wire delay and energy, and accordingly, the first step towards designing low-power high performance NoCs should be the optimization of the metal-wire delay and energy. Trade-off wise, higher supply voltage in link drivers (greater driving strength) enables shorter wire delay, but requires more propagation energy. While a wider wire pitch reduces coupling capacitance between wires (hence lower propagation delay and energy), it leads to lower wire density, resulting in poor bandwidth in the network. Metal-wire links comprise 17-39% of total network power in mesh NoC chips [4, 5, 8], and form the unavoidable portion of network power as link power is a physical constraint. Furthermore, as wire performance benefit from CMOS scaling does not keep up with gate performance benefit, link power will increase in percentage relative to control and storage circuitry power as process technology scales down [15, 16]. 32 1.3. Rethinking Router Microarchitecture Figure 1-3: Ideal point-to-point interconnect only through a metal wire. Figure 1-4: Repeated interconnect for lower wire delay (starting point of wire sharing). Figure 1-5: Input wire sharing through a demultiplexer. Figure 1-6: Output wire sharing through a multiplexer. 33 Chapter 1. Introduction Figure 1-7: Input and output wire sharing through a demultiplexer and a multiplexer. Figure 1-8: Input and output wire sharing through a crossbar switch. Figure 1-9: Efficient wire sharing with a SA logic and buffers. 34 1.3. Rethinking Router Microarchitecture Figure 1-10: Packet-switched, input-buffered VC router microarchitecture. When DST is too far from SRC, i.e. an interconnect wire is too long, intermediate drivers in a long wire (known as repeaters) can significantly reduce the propagation delay by converting the quadratic RC delay growth with the wire length into a linear RC delay growth [17, 18]. If a wire is long enough, the extra delay and energy of repeaters are easily offset by the long wire. Besides, repeaters offer a wire sharing opportunity for free by decoupling a long wire into multiple wire segments. In other words, the repeated interconnect shown in Figure 1-4 has 3 source and destination pairs (SRC to a repeater, a repeater to DST and SRC to DST) while the point-topoint interconnect in Figure 1-3 has only one SRC and DST pair. While point-to-point connection is always preferable between one SRC and DST pair, full connectivity of all possible SRC and DST pairs is too expensive in terms of global wiring area to be incorporated in manycore chips; building a fully-connected NoC with higher node counts (e.g. 32 nodes) is practically impossible in existing CMOS technology due to insufficient wiring [2]. Thus, wire sharing is inevitable for a scalable on-chip communication fabric. Figure 1-5 shows input wire sharing through a demultiplexer by which multiple destination nodes (DST1-5) share the wire segment 35 Chapter 1. Introduction from SRC to the demultiplexer. Similarly, as shown in Figure 1-6, a multiplexer enables output wire sharing where multiple source nodes (SRC1-5) share the tail wire from the multiplexer to DST. The demultiplexer/multiplexer along with drivers optimized for a single segmented wire can be viewed as the most primitive form of NoCs in that wire sharing is the essence of NoCs. We can also share both input and output wires by the combination of a demultiplexer and a multiplexer (Figure 1-7). This naive implementation, however, imposes severe bandwidth loss since only one SRC and DST pair can communicate at a time. A crossbar switch, which allows multiple SRCs to be connected to multiple DSTs, can prevent such a bandwidth loss (Figure 1-8). The straightforward implementation of the crossbar switch is to use redundant multiplexer arrays to provide all possible connections between multiple SRCs and DSTs [13]. Actually, this is the design that most commercial synthesis CAD tools generate for crossbar switch functionality. While its static feature ensures stable operation, higher counts of SRCs and DSTs cause substantial propagation delay and energy as compared to matrix crossbar switches whose transistor counts are much lower than the mux-based crossbar switch [6]. On the other hand, the matrix crossbar switch requires careful circuit design on the matrix crosspoint switch which is typically implemented with pass gates, transmission gates, or dynamic tri-state gates. This is because simple pass gates and transmission gates do not work properly in an advanced CMOS process such as silicon-on-insulator (SOI) technology and dynamic tri-state gates make their output noise-sensitive. The Crossbar switch consumes 15-33% of the entire network power in mesh NoC chips [4, 5, 8], and if wire sharing is inevitable for scalable NoCs then this crossbar power consumption is also unavoidable. For efficient wire sharing, a crossbar switch requires its allocation logic as shown in Figure 1-9. In addition to the bandwidth improvement, such an allocation logic can support QoS or packet/message fairness, depending on system requirements. While the switch allocation logic contributes a negligible portion of overall NoC power, its 36 1.3. Rethinking Router Microarchitecture computation delay can add significant packet latency. Trade-off wise, bandwidthoptimized allocation algorithms generally need more computation, thus resulting in longer flit latency. To further enhance wire sharing efficiency, a crossbar switch can incorporate buffers to house flits when they cannot go forward right away to their destinations due to contention. These buffers have a substantial impact on network bandwidth [19] so that all existing NoC chips include buffers dedicated to their routers [4, 5, 8, 9, 10, 11, 12]. While flits can be buffered on the input ports or output ports, only an input-buffered microarchitecture permits the single-ported memories that are more power and area-efficient than multiple-port memories [13, 19]. For this reason, most NoC router designs have buffers at the input ports, and as described in Figure 1-2, we also selected the input-buffered router microarchitecture as a baseline in the thesis. However, if the allocation rate of a crossbar switch is faster than the rate of output wires (links in NoCs), output buffering allows more efficient wire sharing and higher bandwidth, and hence, the output-buffered microarchitecture can be a better choice for some systems. Router buffers consume 22-35% of total NoC power [4, 5, 8], and add buffering delay to flit latency (1 clock cycle in general). Unlike links and a crossbar switch, the buffering power and delay are not unavoidable overheads. In fact, minimizing buffer size and actual buffering counts at given target performance stands at the center of NoC optimization. Buffer allocation also has a huge impact on bandwidth of the packet-switched NoC. If each input port allows only one buffer queue, head-of-line blocking can occur, leading to poor link utilization. In other words, when a packet at the entrance of such a single queue is blocked, it can stall other packets that are lined up behind the blocked packet even if free buffers are available. Multiple buffer queues can resolve this headof-line blocking, but assigning several physical queues at each input port is expensive in terms of area and energy [13]. Alternatively, we can split one physical queue into multiple logically-separate queues. These logically-separate buffer queues can share 37 Chapter 1. Introduction one physical channel (that’s why they are called virtual channels!), and hence, flits can be interleaved from different packets. Similar to a crossbar switch, these virtual channels (VCs) need their own allocation logic for efficient VC arbitration. Figure 110 shows the additional allocation logic for logically-separate buffers. Actually, this figure design (a 5×5 crossbar switch along with its allocation logic, link drivers, logically-separate input buffers and a VC allocation logic) is architecturally-identical to the 2D mesh router design described in the previous section. Insights developed through this step-by-step router building process from the ideal interconnect to the packet-switched VC router will be the basis of our approaches towards the low-power yet high-performance NoC designs. 1.4 Thesis Contributions and Overview This thesis presents novel low-power NoC designs that depart from the traditional trade-offs between network power and latency/bandwidth performance through circuit and microarchitecture co-design, then proves such design concepts on silicon with a thorough analysis of the chip measurement results. To be specific, the thesis demonstrates three test chip designs: a 4×4 mesh NoC in Chapter 2, clockless low-swing repeaters in Chapter 3 and a 3D through-silicon via (TSV) interconnect in Chapter 4. The mesh NoC chip first optimizes the crossbar switch (Figure 1-8), then co-designs the logic to minimize buffering (Figure 1-10). The second test chip of clockless low-swing repeaters targets the repeated link (Figure 1-4) while the third 3D-IC chip seeks to develop the TSV point-to-point interconnect (Figure 1-3) whose design constraints totally differ from the conventional 2D metal wires. An overview of each chip prototype and corresponding chapter is as follows: • 4×4 Mesh NoC Chip. Chapter 2 explores our first test chip of a mesh network design for chip multiprocessors (CMPs) that aims to simultaneously optimize energy-latency-throughput for unicasts, multicasts and broadcasts. We first 38 1.4. Thesis Contributions and Overview define and analyze the theoretical limits of a mesh NoC in latency, throughput and energy, then describe how we approach these limits through a combination of microarchitecture and circuit techniques. Fabricated in 45nm SOI CMOS, the 1.1V 1GHz NoC chip achieves 1-cycle router-and-link latency at each hop and energy-efficient router-level multicast support, delivering 892Gb/s (87.1% of the theoretical bandwidth limit) at 531.4mW for a mixed traffic of unicasts and broadcasts. Armed with detailed measurement results, this chapter deeply compares and analyzes the pros and cons of the proposed mesh NoC design: (1) energy/performance improvement and timing/area penalties of the virtual bypassed, multicast-optimized router design; (2) energy benefits, area overheads and reduced reliability of the clocked low-swing datapath circuits; and (3) a gap between simulated power estimation (ORION 2.0 [20]) and actual power consumption. Here, I would like to acknowledge that the architectural design of this test chip [6] was done by Tushar Krishna, a former PhD student at MIT. • Clockless Low-Swing Repeaters Chip. Traffic on multiprocessor systemson-chip (MPSoCs) is highly dynamic, i.e. the traffic considerably varies depending on SoC applications. To efficiently support such dynamic traffic, reconfigurable NoCs on a flexible network topology like a mesh have been developed [21, 22, 23, 24, 25]. These networks pre-reserve (parts of) the route to match application traffic by making unnecessary routers contention-free. If existing clocked low-swing circuits are applied to the pre-reserved routes, flits will pay needless clocking energy and latency even at the contention-free nodes. To prevent such wastage (hence maximize low-swing signaling benefits in the reconfigurable NoCs), Chapter 3 proposes two types of clockless low-swing repeaters, self-resetting logic repeaters (SRLRs) and voltage-locked repeaters (VLRs), and analyzes experimental results of the test chip fabricated in 45nm SOI CMOS. Featured with variation-robust circuit techniques, the 0.8V 4.1Gb/s SRLRs enable single-ended low-swing pulses to be asynchronously repeated, and there- 39 Chapter 1. Introduction fore, consume less energy than differential, clocked low-swing signaling. On the other hand, the 1.0V 6.8Gb/s VLRs outperform energy-equivalent full-swing repeaters in terms of delay (35% reduction) and bandwidth (23% improvement), enabling single-cycle multi-hop asynchronous link traversal for a single-cycle reconfigurable NoC [1]. • 3D TSV Interconnect Chip. Many multi-threaded applications of CMPs and MPSoCs require heavy off-die bandwidth that cannot be handled by existing off-chip I/Os. While three-dimensional integrated circuits (3D-ICs) offer an appealing solution to such bandwidth-hungry manycore chips, current 3DIC fabrication technologies inevitably require redundant through-silicon vias (TSVs) for reliable 3D vertical signaling, leading to significant power and area overheads. To alleviate these 3D signaling overheads (hence incorporate TSVs as 3D-IC NoC links within tight power and area budget), Chapter 4 proposes and demonstrates the concept of simultaneously bi-directional (SBD) TSV signaling that can send and receive data at the same time through a single TSV. The proposed SBD interconnect enables area and power-efficient, variationrobust 3D signaling with a relatively small bandwidth loss (less than 13%). Implemented with 28nm Low-Power CMOS process and MediaTek TSV technology, our SBD TSV interconnect achieves 10.3-31.1% lower energy and 34.4% less area than equivalent two uni-directional TSVs at 9.1Gb/s/TSV bi-directional data rate (i.e. 4.55GHz clock frequency) at 1.05V. 40 1.4. Thesis Contributions and Overview 41 Chapter 1. Introduction 42 2 Towards the Theoretical Limits of a Mesh NoC This chapter first derives the theoretical mesh NoC bounds, followed by an analysis of a power and performance gap with existing mesh NoC chips. It then presents a chip prototype of the proposed mesh NoC which tries to eliminate the gap, thus approaching the theoretical limits. 2.1 Theoretical Mesh NoC Limits A mesh topology by itself imposes theoretical limits on latency, throughput and energy (i.e. minimum latency and energy, and maximum throughput). Chapter 2 starts with the derivation of these theoretical bounds of a k × k mesh NoC for two traffic types, unicast and broadcast traffic. In our analysis, each network interface circuit (NIC) injects flits into the network according to a Bernoulli process of rate R, to a random, uniformly distributed destination for unicasts; and from a random, uniformly distributed source to all nodes for broadcasts. All derived bounds are for a complete action: from initiation at the source NIC, till the flit is received at all destination NICs. We also make three NoC-level assumptions for our derivation: 1. Perfect Routing. A router would route all packets with minimal hop-counts, balancing injected packets (termed channel load in our analysis) across multiple routes perfectly, thereby keeping the load on all links optimally balanced. 43 Chapter 2. Towards the Theoretical Limits of a Mesh NoC k k-j Furthest destination S Source (i, j) k-i Figure 2-1: Latency calculation example for broadcast traffic on a k×k mesh network. 2. Perfect Flow Control. A router maintains maximum utilization of the links, i.e. a link is never left idle when there is traffic routed across it. 3. Perfect Router Microarchitecture. All flits only incur the delay and energy of the datapath (ST and LT). In other words, a router arbitrates between competing flits; performs crossbar and link traversal all in a single cycle; and do not expend extraneous energy for buffering and control. Based on these assumptions, we derive the theoretical limits for unicast and broadcast traffic. For unicasts, we analyze the theoretical limits for latency and throughput using the same technique as in [13]. We then derive the energy limit by multiplying 44 2.2. Related Work: Existing Mesh NoC Chips hop count with crossbar and link energy costs. For broadcast traffic, to the best of our knowledge, no prior theoretical analysis exists. Here, we define the time till a flit is received by all destination NICs as equivalent to when this flit is received by the furthest NIC relative to the source NIC (Figuire 2-1). Hence, we derive the theoretical latency limit for received packets by averaging the hop delay from each source NIC to its furthest destination NIC. Throughput wise, we obtain the theoretical limit by analyzing the channel load across the ejection links and bisection links [13], and observe that the maximum throughput for broadcast traffic is limited by the ejection links. This differs from unicast traffic where throughput is always limited by the bisection links. As for the theoretical energy limit, intuitively, due to the nature of broadcasting, a broadcast flit needs to visit all k 2 routers in the network and traverse k 2 crossbars and links connecting them. Therefore, the energy limit grows quadratically with the number of routers in the network. Table 2.1 summarizes our derivation results. 2.2 Related Work: Existing Mesh NoC Chips There have been chip prototypes that incorporate mesh NoCs [4, 5, 8, 9, 10, 11, 12] or other heterogeneous NoCs [26, 27, 28] as their on-chip communication fabric. The prototypes range from full manycore processors to stand-alone NoCs. As heterogeneous NoC chips [26, 27, 28] have irregular topologies which make it difficult to characterize them against the theoretical mesh limits, we focus here on the manycore chips with mesh networks in our related work investigation. In particular, three chip prototypes were selected for comparison, each differing significantly with respect to targeted design goals and optimizations: Intel TeraFLOPS which is the precursor of the Intel IA-32 NoC, Tilera TILE64 which is the successor of the MIT Raw, and SWIFT, a NoC with low-swing signaling. These three chips and their corresponding NoC architectures are described in detail as follows: 45 Chapter 2. Towards the Theoretical Limits of a Mesh NoC • Tilera TILE64 [9] is a multiprocessor consisting of 64 tiles interconnected by five 2D mesh networks, where each tile contains a CPU, cache and a router, fabricated on the TSMC 90nm process and running at a speed of 700 to 866 MHz. Four of the five networks are dynamically routed, each servicing a different type of traffic: user dynamic network (UDN) for user-level messages, I/O dynamic network (IDN) for I/O traffic, memory dynamic network (MDN) for traffic to/from the memory controllers, and tile dynamic network (TDN) for cache-to-cache transfers. The dynamic networks are packetized, wormhole routed, with a one cycle pipeline for straight-through traffic and two cycles for turning traffic. The static network is software scheduled, and has a single-cycle pipeline. • Intel TeraFLOPS [5] has a more complex NoC architecture, but the cores are much simpler than a standard RISC processor. Since simpler cores are more area and energy-efficient than larger ones, more functional units can be integrated within a single chip’s area and power budget. TeraFLOPS is a demonstration of the possibility of including an on chip interconnect, operating at 5 GHz, and achieving performance in excess of TeraFLOPS while maintaining a power usage of less than 100W. TeraFLOPS NoC has a five-port, two-lane, five-pipelinestage router with a double pumped crossbar used to interconnect the tiles in a 2D mesh network. Each input port is connected to two 16 entry deep FIFO buffers, one for each lane. A single crossbar for both lanes is double pumped in the fourth pipeline stage using dual-edge triggered flip-flops, allowing the switch to transfer data at both edges of the clock signal. • SWIFT [29] is a 2×2 standalone NoC research chip demonstrating the practicality of implementing token flow control [30] and low swing crossbar switches and links. The buffer-bypassed traversal of flits through a reduce-swing datapath is demonstrated to perform at 400 MHz and obtain latency and power reductions of approximately 40 percent each. The token flow control microar- 46 2.2. Related Work: Existing Mesh NoC Chips chitecture pre-allocates buffers and links in the network by using tokens. Many flits are then able to bypass buffering, improving link utilization and reducing the buffer turnaround time. Dual voltage supply differential reduced-swing drivers and sense-amplifier receivers sustain the low-swing signaling necessary to reduce the dynamic power consumption. We calculated zero-load latency and channel load of these networks for both unicast-only and broadcast-only traffic. Zero-load latency can be obtained by multiplying the average hop-count by the number of pipeline stages to traverse a hop, with serialization latency added on to model pipelining of all flits. In terms of throughput, we computed channel load based on a flit injection rate per core of R, following the methodology of [13]. The results are shown in the Table 2.2. It is noted in this table that our proposed NoC, which will be described in the following section, optimizes for broadcast traffic and incurs much lower zero-load latency and channel load compared to all other networks. TILE64 attempts to optimize for all three metrics by utilizing independent simple networks for different message types. The simple router design, with no virtual channels, improves unicast zero-load latency but broadcast traffic latency is poor as its lack of multicast support forces the source NIC to duplicate k 2 − 1 copies of a broadcast flit and send a copy to every destination NIC. This increases channel load by k 2 − 1 times, causing contention at all routers along the shared route, making it impossible to meet the single-cycle per hop. TILE64’s static partitioning of traffic across 5 networks may also lead to poor throughput when exercised with realistic uniform traffic. Similar effect on broadcast latency and channel load is observed for the TeraFLOPS and SWIFT NoCs as none of these chip prototypes have multicast support. The SWIFT NoC with a single-cycle pipeline for unicasts performs better on zero-load latency, albeit at a lower operating frequency. The TeraFLOPS NoC has poor zero-load latency in terms of cycles due to a 5-stage pipeline, which is aggravated with broadcasts. 47 Table 2.1: Theoretical limits of a k×k mesh NoC for unicast and broadcast traffic. Metric Unicasts (one-to-one multicasts) 2(k + 1)/3 Average Hop Count (Haverage ) k×R/4 R 2(k + 1)/3 R, for k <= 4 k×R/4, for k > 4 2(k + 1)/3×Exbar + Exbar + 2(k + 1)/3×Elink k 2 ×Exbar + (k 2 − 1)×Elink 48 Table 2.2: Comparison of mesh NoC chip prototypes. Clock frequency Power supply Power consumption Latency Metrics Delay per hop Zero-load latency (cycles) Throughput Metrics Channel width Bisection bandwidth Channel load (R:injection rate/core) Intel TeraFLOPS [5] 8×10, 65nm 5GHz 1.1-1.2V 97W 1ns 30 (unicast) 120.5 (broadcast) 39b 1560Gb/s 64R (unicast) 4096R (broadcast) Tilera TILE64 [31] SWIFT [29] 5 8×8, 90nm 2×2, 90nm 750MHz 225MHz 1.0V 1.2V 15-22W 116.5mW Modeled as 8×8 networks 1.3ns 8.9-17.8ns 9 (unicast) 12 (unicast) 77.5 (broadcast) 86 (broadcast) Modeled as 8×8 networks 5×32b 64b 937.5Gb/s 112.5Gb/s 64R (unicast) 64R (unicast) 4096R (broadcast) 4096R (broadcast) Our work 4×4, 45nm SOI 1GHz 1.1V 427.3mW 4×4 network 1-3ns 6 (unicast) 3.3 (unicast) 11.5 (broadcast) 5.5 (broadcast) 64b 512Gb/s 64R (unicast) 4×4 network 64b 256Gb/s 16R (unicast) 64R (broadcast) 16R (broadcast) Chapter 2. Towards the Theoretical Limits of a Mesh NoC Channel Load on each bisection link (Lbisection ) Channel Load on each ejection link (Lejection ) Theoretical Latency Limit given by Haverage Theoretical Throughput Limit given by max{Lbisection , Lejection } Theoretical Energy Limit Exbar : energy of crossbar traversal Elink : energy of link traversal Broadcasts (one-to-all multicasts) (3k − 1)/2, for k even (k − 1)(3k + 1)/2k, for k odd k 2 ×R/4 k 2 ×R (3k − 1)/2, for k even (k − 1)(3k + 1)/2k, for k odd k2 × R 2.3. Chip Design and Fabrication 2.3 Chip Design and Fabrication Our mesh NoC chip design starts with the state-of-the-art router microarchitecture (Figure 1-2). We then add features pushing latency towards the theoretical limit of a single cycle per hop, throughput towards the theoretical limit of maximum channel load and energy towards the theoretical limit of just datapath traversal. In the fabricated network, all routers are connected to network interface circuits (NICs) to generate and receive packets. For realistic traffic, we separately model request and response messages to reflect that most manycore chips today use shared memory architecture and rely on the request and response messages between nodes to maintain data coherence. To avoid message-level deadlocks in such cache-coherent manycore processors, each input port has two message classes (MCs), request and response. The request message class contains 4 VCs, each of which is 1-flit deep, while the response message class contains 2 VCs, each of which is 3-flit deep. All flits follow an XY dimension ordered routing (DOR) and a broadcast flit is replicated only when #4 #3 #3 #3 #2 #4 #3 #2 #5 #4 #3 #4 Flit Size 64bits Request Packet Size 1 flit Response Packet Size 5 flits Microarchitecture 6VCs over 2MCs #2 #2 #0 #1 #1 #3 #2 #2 #2 #2 #3 #2 #1 #1 #3 #3 #2 #4 #3 No-load Routerand-link Latency 1 cycle Operating Frequency 1GHz Power Supply Voltage 1.1V Technology 45nm SOI CMOS Figure 2-2: Broadcast example and overview of the fabricated 4×4 mesh NoC. 49 Chapter 2. Towards the Theoretical Limits of a Mesh NoC 1mm link 2mm link Tri-state RS D Router #1 NORTH Router #5 Router #3 Router #2 320um 540um 590um X B AR Tri-state NIC RSD EAST Router #8 WEST XBAR Router #4 SOUTH 260um Router #9 Router #10 Router #11 Router #12 Router #13 Router #14 Router #15 Router #16 Figure 2-3: Die photo and design layout of the 4×4 mesh NoC and stand-alone lowswing crossbar switch connected to longer links (1mm and 2mm wires). the XY DOR requires different output ports to minimize the network traffic (XY-tree DOR). Figure 2-2 describes such a broadcast example with overview of our fabricated 4×4 mesh NoC. As shown in the die photo overlaid with its design layout (Figure 23), an additional crossbar switch is separately laid out with longer links, 1mm and 2mm wires, to explore higher data rate performance of our low-swing crossbar switch (clock frequency of the overall network is limited by the synthesized router logic). Following subsections will completely describe the proposed mesh NoC chip design. 50 2.3. Chip Design and Fabrication 2.3.1 Towards Theoretical Latency Limits We first push the state-of-the-art design towards the latency bounds by adding two key features: a virtual bypassing microarchitecture to hide delays due to buffering and arbitration [6, 30, 32], and low-swing datapath circuits based on linear-mode drive transistors to achieve single cycle ST+LT without lowering clock frequency. • Single-stage pipeline with lookaheads. In pipeline stage 2 of the state-ofthe-art design (Figure 1-2), we add and generate 15b lookahead signals from the results of NRC and mSA-II, and send them to the next router. The lookaheads try to pre-allocate the crossbar switch ahead of the actual flit, thus hiding mSA-II from the router delay. The lookahead takes priority over requests from buffered flits at the next router, and directly enters mSA-II. If the lookahead wins an output port, this pre-allocation allows the following flit to bypass the first two pipeline stages and go into the third stage directly, reducing the router pipeline depth from 4 to 2. It is notable that our active pre-allocation by lookaheads enables incoming flits to bypass routers at all loads, in contrast to a plain approach of bypassing only at low-loads when the input queues are empty [33, 34, 35]. • Single-cycle ST+LT with low-swing circuits. We apply the low-swing signaling technique based on linear-mode drive transistors, which can reduce the charging/discharging delay and dynamic energy when driving capacitive parasitics [36], to the NoC datapath. As will be described later in Section 2.3.3, the proposed low-swing circuits obtain higher current driving ability (i.e. lower linear drive resistance) even at small Vds than the reduced-swing signaling generated by simply lowering supply voltage, and hence, our low-swing datapath enables single-cycle ST+LT at higher clock frequency. Our chip prototype demonstrates that the proposed low-swing circuits enable up to 5.4GHz singlecycle ST+LT (more details in Section 2.4). 51 Chapter 2. Towards the Theoretical Limits of a Mesh NoC These two optimizations achieve a single-cycle-per-hop delay for unicasts and multicasts, exactly matching the theoretical latency limits. The caveat is that in case of contention for the same output port from multiple lookaheads, one of them will have to be buffered and then forced to go through the 3-stage pipeline. In addition, critical path delay is stretched, which will be analyzed in Section 2.4.2. 2.3.2 Towards Theoretical Throughput Limits Next, we take two steps towards the throughput bounds for both unicasts and broadcasts: router-level multicast support for bandwidth sharing and single-cycleper-hop latency for fast buffer reuse. • Multicast support inside routers. We extend the multicast capability of the state-of-the-art design into our lookahead-based microarchitecture by letting lookaheads perform the multicast switch allocation. This scheme enables one multicast/broadcast flit to be sent from the source NIC, and get routed to all other routers in the network via a tree. The multicast capability allows a broadcast flit to share bandwidth till it does not require an explicit forking into different directions. This dramatically reduces contention compared to the textbook router design [13] where multiple flits would have to be sent as unicasts which are guaranteed to create contention at along the shared routes. We use a dimension ordered XY-tree routing in our design as it is deadlock free, and simplifies the routing algorithm. • Single-cycle-per-hop latency. The number of buffers/VCs required at every input port to sustain a particular throughput depends upon the buffer/VC turnaround time, i.e. the number of cycles for which the buffer/VC is occupied. This is where our optimizations for latency in Section 2.3.1 come in handy here since they reduce the pipeline depth, thus reducing buffer turnaround time, thereby increasing throughput given the same number of buffers. For our singlecycle pipeline, the turnaround time for buffers/VCs is 3: one cycle for ST+LT 52 2.3. Chip Design and Fabrication to the downstream router, one cycle for the free VC/buffer signal to return from the downtsream router (if the flit successfully bypassed), and one cycle for it to be processed and ready to be used for a new flit. We thus choose 4 VCs in the request message class, each 1-flit deep (since requests packets in our design are 1-flit wide) to satisfy VC turnaround time and sustain high throughput for broadcasts. We chose 2 VCs in our response message class, each 3-flit deep, for the 5-flit response packets. This number was chosen to be less than the turnaround time to shorten the critical path, and reduce the total buffers (which increase power consumption). We thus chose a total of 6 VCs per port, with a total of 10 buffers. 2.3.3 Towards Theoretical Energy Limits Section 2.1 reveals a significant energy gap between the state-of-the-art router energy and the theoretical energy limit (which is just clocking and datapath energy, Exbar and Elink ). Such a gap is due to buffering energy (Ebuf f ), arbitration logic energy (Earb ) and silicon leakage energy (Elkg ). Conventionally, these energy overheads are traded off against latency and throughput as follows: Fewer buffers reduce Ebuf f and Elkg , but stretch latency due to contention and lower throughput. Or, simple routers like wormhole routers [13] reduce Earb and Elkg , and increase operating frequency f , but these come at the expense of poorer latency and throughput. Our proposed NoC first includes multicast support so even broadcasts and multicasts can approach the theoretical energy limit. Then, it incorporates two new features that permits different tradeoffs of latency, throughput and energy. First, our multicast virtual bypassing reduces Ebuf f , while improving both latency and throughput. The hidden cost lies in increased Earb and decreased f . As will be shown in Section 2.4.1, the savings in Ebuf f outweigh the Earb overheads, and operating frequency can still be in GHz. Second, our chip employs low-swing signaling to reduce dynamic energy in the datapath (Exbar and Elink ) which is unavoidable and part 53 Chapter 2. Towards the Theoretical Limits of a Mesh NoC Figure 2-4: 64bits 5×5 tri-state RSD-based matrix crossbar switch and link circuitry. of the theoretical energy limit. Our low-swing circuits based on linear-mode drive transistors provides an opportunity to break the conventional trade-offs that achieve dynamic energy savings at the cost of latency and throughput penalties. Indeed, our low-swing datapath optimizes both energy and latency. Its downsides lie in its area overheads and reduced process variation immunity. Figure 2-4 shows the circuit implementation of the low-swing crossbar switch directly connected to links with tri-state reduced-swing drivers (RSDs). This circuit 54 2.3. Chip Design and Fabrication design enables low-swing signaling in crossbar vertical wires and link wires. The tristate RSD disconnects horizontal and vertical wires and only drives the corresponding vertical wire and link, thereby providing energy-efficient multicasting capability. With an additional supply voltage (LVDD), the 4-PMOS stacked RSD design generates more reliable low-swing signaling in the presence of wire capacitance and resistance variation than equalized interconnects [37, 38, 39, 40] where low-swing signaling is obtained by wire channel attenuation. A delay cell aligns an input signal (which drives only a 1b crossbar) to an enable signal (which drives all of 64 1bit crossbars). It reduces mismatch between charging and discharging time, thus decreasing intersymbol interference (ISI). The 64bits links are designed with 0.15um-width 0.30umspace fully shielded differential wires, to eliminate noise coupling of crosstalk effects and supply voltage variation. Figure 2-5 shows the detailed router microarchitecture and pipeline of our proposed mesh NoC that incorporates virtual bypassing, low-swing signaling datapath and router-level multicast support. The following section will closely explore not only its performance and energy benefits but also the concomitant costs such as area overhead, stretched critical path, reduced noise margin. 55 Chapter 2. Towards the Theoretical Limits of a Mesh NoC 56 Figure 2-5: Proposed router microarchitecture and pipeline. 2.4. Evaluation 2.4 Evaluation We first evaluate the measured energy-latency-throughput of our fabricated NoC against that of the baseline design and the theoretical mesh limits defined in Section 2.1. Armed with our chip measurements, we then delve into three specific case studies on virtual bypassing; low-swing signaling; and power modeling and estimation to dissect our design choices. 2.4.1 Latency, Throughput and Energy We measured average packet latency of our NoC as a function of packet injection rate, with two different traffic patterns: mixed traffic (50% broadcast request, 25% unicast request and 25% unicast response messages) and broadcast-only traffic (100% broadcast request messages), at 1GHz operating frequency. Figure 2-6 and Figure 2-7 show the results along with the baseline performance and theoretical mesh bounds. Here, we chose a more aggressive baseline that has single-cycle ST+LT instead of separate ST and LT stages described in Section 1.2. Since even the full-swing baseline can support single-cycle ST+LT at 1GHz, this baseline is a fairer model of an equivalent unicast full-swing NoC. Except for the the single-cycle ST+LT, the baseline used in this section is identical to Figure 1-2. The theoretical latency limits (cycles/packet) include two extra cycles for NIC-to-router and router-to-NIC traversals which are indispensable since traffic injects and ejects through the NICs. Theoretical throughput limits are calculated based on received flits, then converted into Gb/s to factor in the 1GHz clock frequency and 64-bit flit size (16×64b×1/1GHz=1024Gb/s). Simulation results were obtained from pre-layout synthesis with sufficient simulation cycles (104 cycles) to make scan-chain warmup (128 cycles) negligible. For latency, our design enables 48.7% (mixed traffic) and 55.1% (broadcast-only) reductions before the network saturates as compared to the baseline. To enable precise comparisons, we define the saturation point as the injection rate at which NoC 57 Chapter 2. Towards the Theoretical Limits of a Mesh NoC Figure 2-6: Network performance evaluation with mixed traffic at 1GHz. latency reaches 3 times the average no-load latency; most multi-threaded applications run within this range. The low-load latency gap from the theoretical latency limit is 5.7 (6.3) cycles for mixed (broadcast) traffic, i.e. only 1.03 (1.14) cycles of contention latency per hop for mixed (broadcast) traffic. This can be further improved to 0.04 (0.05) cycles of contention latency per hop (obtained through RTL simulations) by removing the artifact in our chip whereby all NICS had identical pseudo-random generators that caused contention which lowers the amount of bypassing even at low injection rates. Throughput wise, the fabricated NoC approaches the theoretical limits: 87% (mixed traffic) and 91% (broadcast-only) of the theoretical throughput limits. In 58 2.4. Evaluation Figure 2-7: Network performance evaluation with broadcast-only traffic at 1GHz. addition, our NoC design has 2.1x (mixed traffic) and 2.2x (broadcast-only) higher saturation throughput than the baseline. In other words, the proposed NoC can obtain the same throughput as the baseline with fewer buffers or VCs. The throughput gap between the theoretical mesh and the fabricated chip is due to imperfect arbitration (like all prior chips discussed in Section 2.2, we use separable allocators, mSA-I and mSA-II, to lower complexity) and routing (the dimension ordered XY routing can lead to imbalance in load). Figure 2-8 shows the measured power reduction at 653Gb/s broadcast delivery at 1GHz at room temperature. The low-swing signaling enables 48.3% power reduction in the datapath. In addition, the single-cycle multicast capability and virtual 59 A 494 B 38.2% power reduction in total 425 32.2% power reduction in buffers by multicast buffer bypass 67 13.9% power reduction in router logic by router-level broadcast support 137 48.3% power reduction in data path by tri-state RSD-based crossbars Chapter 2. Towards the Theoretical Limits of a Mesh NoC C 288 D clocking circuitry router logic and buffer datapath (crossbar + link) Figure 2-8: Measured network power reduction at 653Gb/s at 1GHz (A: full-swing unicast network, B: low-swing unicast network, C:low-swing broadcast network without virtual buffer bypassing, D: low-swing broadcast network with virtual buffer bypassing). bypassing result in 13.9% and 32.2% power reduction in router logics and buffers, respectively. Overall, our chip prototype achieves 38.2% power reduction compared to the baseline. To compare against the theoretical power limit, we performed a post-layout power simulation of a router in the middle of the mesh to further breakdown data-dependent power from non-data-dependent components like clocking. We 60 2.4. Evaluation then calculate the theoretical power limit to comprise just clocking and a full-swing datapath: 5.6mW/router, at close to zero-load injection rate (3/255). Compared to our NoC power consumption at the same low injection rate (13.2mW/router), our overhead comes largely from VC bookkeeping state (1.9mW/router) and buffers (2.0mW/router), whereas the allocators (0.7mW/router) and additional lookahead signals (0.2mW/router) contribute little additional power. The data-dependent power (e.g. buffers, allocators) is due to our identical PRBS generators at NICs that limited bypassing at low loads and can be removed by virtual bypassing, but the non-datadependent power (e.g. VC state) will remain. Also, since our chip consumes nontrivial leakage power (76.7mW measured, 18% of overall chip power consumption at 653Gb/s), power gating will help to further close the gap, at the expense of a decrease in operating frequency. 2.4.2 Virtual bypassing Virtual bypassing of buffering to achieve single-cycle routers has been reported in various forms [6, 30, 32]. The aggressive folding of multiple pipeline stages into a single cycle naturally raises the question of whether that comes at the expense of router frequency f . While our chip is the first prototype to demonstrate a singlecycle virtual bypassed router at GHz frequency, it begs the question of how much f is affected. To quantify the timing overhead, we performed critical path analysis on pre- and post-layout netlists of the baseline and our design. Table 2.3 shows such Pre-layout simulations Baseline router design Our virtual bypassed router design Post-layout simulations Baseline router design Our virtual bypassed router design Measured critical path Our virtual bypassed router design 549ns 593ns (1.08x overhead) 658ns 793ns (1.21x overhead) 961ns (1/1.04GHz) Table 2.3: Critical path analysis results. 61 Chapter 2. Towards the Theoretical Limits of a Mesh NoC estimates along with the actual measured timing. The critical paths of both the baseline and the proposed router occur in the second pipeline stage where mSA-II is performed. The overhead of lookaheads lengthens the critical path by 8% in pre-layout simulations and 20% in post-layout simulations. It should be pointed out though that if the operating frequency is limited by the core rather than the NoC router, which is typically the case, this 20% critical path overhead can be hidden. In the Intel 48 core chip, nominal operation is 1GHz core and 2GHz router frequencies, allowing any network overhead to be masked [41]. Also notable is the fact that while the critical path of the post-layout simulation is 793ns, the maximum frequency of our chip prototype is 1.04GHz (i.e. the actual critical path is 961ns). This is mainly due to nonideal factors (e.g. a contaminated clock, supply voltage fluctuation, unexpected temperature variations, and etc.) whose effects cannot be exactly predicted in design phase. 2.4.3 Low-Swing Signaling Low-swing signaling has demonstrated substantial energy gains in domains such as off-chip interconnects and SRAMs. However, in NoCs, there are few chip prototypes employing low-swing signaling [29, 26] so that a deep understanding of its tradeoffs and its applicability to NoCs is hard to carry out. To investigate such effects with longer links (necessary in a manycore processor as cores are much larger than routers) and at higher data rates than the network clock frequency, as mentioned earlier, an identical low-swing crossbar switch with longer link wires (1mm and 2mm) is separately implemented and measured. Energy savings and 1-cycle ST+LT. The measured energy efficiency shows that the 1mm 300mV-swing tri-state RSD enables 57-61% energy reduction (Figure 29) while the 2mm 300mV-swing link shows 65-69% energy reduction (Figure 2-10) when compared with their equivalent 1.1V full-swing link. Since energy benefits of low-swing signaling come from reduction in dynamic power of link metal wires, the 62 2.4. Evaluation Figure 2-9: 1mm link energy efficiency of full-swing and RSD-based signaling. Figure 2-10: 2mm link energy efficiency of full-swing and RSD-based signaling. 63 Chapter 2. Towards the Theoretical Limits of a Mesh NoC Synthesized full-swing crossbar Proposed low-swing crossbar Router with the full-swing crossbar Router with the low-swing crossbar 26,840um2 83,200um2 (3.1x overhead) 227,230um2 318,600um2 (1.4x overhead) Table 2.4: Area comparison with full-swing signaling. longer low-swing link (2mm wires) has higher energy efficiency than the shorter lowswing link (1mm wires). Here, the energy consumption is measured assuming that the lower power supply supports charge-recycling [42]. Experimental results also demonstrates that the tri-state RSD-based crossbar supports single-cycle ST+LT at up to 5.4GHz and 2.6GHz clock frequency with 1mm and 2mm links, respectively. The tri-state RSDs enables a reduction in the total amount of charge and delay required for data transitions, thereby resulting in these energy and latency benefits. Area overheads. Table 2.4 shows the area overhead of our 64bits 5×5 low-swing crossbar switch against an equivalent full-swing crossbar. The low-swing crossbar has a high area overhead (3.1x) compared to a synthesized full-swing crossbar, as the proposed RSDs employ differential signaling while the full-swing crossbar uses single-ended signaling. Besides, since our low-swing crossbar was carefully laid out due to noise coupling issues, such restricted placement and wiring of tri-state RSDs exacerbate the area overhead. However, at the router level, the relative area overhead goes down to 1.4x, and naturally, it will significantly diminish when compared against an entire tile with a core, cache and router. Process variation effects. The critical drawback of low-swing signaling is reduced noise margin. In our circuitry, the primary noise source is a sense amplifier offset caused by process variation. While low-swing signaling enables more dynamic energy savings as voltage swing decreases, the process variation effect worsens. Based on 1000-run Monte-Carlo Spice simulations, we chose 300mV-swing for above 3-σ reliability, but the voltage swing can be further decreased by offset compensation circuit techniques [43, 44, 45] at the cost of design complexity. Figure 3-4 shows energy 64 2.4. Evaluation efficiency and link failure probability of the 1mm 5Gb/s tri-state RSD as a function of voltage swing level. These results explicitly reveal the low-swing signaling energy gain trade-off against process variation vulnerability. 2.4.4 Power Modeling and Estimation Architectural power models such as ORION [20, 46, 47] have been extensively adopted by researchers for early-stage evaluation of research ideas, while RTL-based energy estimates have also been widely used. With our chip, we can now study the gap between silicon-proven energy and different levels of energy modeling. We compare our chip power measurements with two power estimates obtained from ORION 2.0 [20] and post-layout netlists. The experiments (or simulations) are conducted with 1.1V supply voltage, 1GHz clock frequency, 653Gb/s throughput at room temperature. Figure 2-12 summarizes the results. Figure 2-11: Low-swing signaling trade-off between reliability and energy efficiency. 65 Chapter 2. Towards the Theoretical Limits of a Mesh NoC A 4.8y x: our design power consumption y: baseline power consumption Clocking Circuits Router logic and buffer Data path (crossbar + link) 5.3x B 1.06y C 1y 1.13x 1x Figure 2-12: Comparison of power estimates with measurements (A: ORION 2.0 simulations, B: Post-layout simulations, C: Measured results). ORION 2.0 substantially over-estimates power (4.8-5.3x of measured chip power), but its estimate of relative power reduction between the baseline and our design (32% reduction) is not far from the measurements (38% reduction). This is because the transistor sizes assumed in ORION 2.0 are much larger than the actual sizes in the chip. Thus, while ORION 2.0 can be used for comparison of various system-level optimizations or early-stage design space exploration, its estimates should not be the basis of absolute power budgets. On the other hand, the post-layout simulation gives us fairly accurate power estimates (6-13% deviation from measurements). Specifically, it slightly under-estimates the power of buffers and arbitration logic but over-estimates clocking and datapath power. Relative power reduction (34%) also matches well with 66 2.5. Chapter Summary measurements (38%). However, such accurate estimates come at the cost of tremendous simulation time overheads (several days for an entire NoC simulation) because the post-layout simulation calculates its estimates at the transistor-level along with parasitic effects. Moreover, since the post-layout estimation requires complete extracted netlists, it is difficult to apply to early-stage NoC evaluation. To sum up, cycle-accurate NoC simulations hardly give us any information about power, area, and critical path. ORION 2.0 calculates routers’ power and area estimations on the basis of gate-level technology parameters, but it explores power and area budget only with fixed router microarchitectures and transistor sizes. RTL-based pre-layout NoC research enables critical path analysis at the standard cell level. It also provides fairly reliable estimations on energy-latency-throughput performance. These pre-layout results, however, do not include any parasitic effects such as supply voltage drop by the resistance of the power metal lines, signal attenuation in the RC-dominant link wires, or noise coupling through the unexpected capacitance and inductance. Also, the RTL code-based simulators cannot explore some research ideas that require custom design like low-swing signaling techniques. HSPICE/Spectrebased post-layout simulations inform NoC designer of the most accurate network performance in the CAD-based research level at the cost of tremendous simulation time, order of days, but there is still gap between their estimations and silicon-proven results due to nonideal factors such as a contaminated clock, supply voltage fluctuation, unexpected temperature variations and imperfect parasitic extraction on contacts, vias, on-chip copper wires and off-chip bonding wires. 2.5 Chapter Summary In this chapter, we described our design of a NoC mesh chip that aims to simultaneously approach the theoretical latency, bandwidth and energy limits of a mesh, for all kinds of traffic (unicasts, multicasts and broadcasts). We first derived such 67 Chapter 2. Towards the Theoretical Limits of a Mesh NoC theoretical limits of a mesh NoC for unicasts and broadcasts. This analysis closely guided us in our design which leverages virtual bypassing to approach the theoretical latency limit of a single cycle per hop for unicasts, multicasts and broadcasts. This, coupled with the speed benefits of low-swing signaling, enabled us to swiftly reuse buffers and approach theoretical throughput without trading off energy or latency. Finally, low-swing signaling applied to the datapath helped us towards the theoretical energy limit. To be more specific, this chapter made the following contributions: 1. It presented a mesh NoC chip prototype that showed 48-55% latency benefits, 2.1-2.2x throughput improvements and 31-38% energy savings as compared with an equivalent baseline NoC described in Section 1.2. To the best of our knowledge, this is the first mesh NoC chip with multicast support. 2. It defined the theoretical mesh limits for unicasts and broadcasts, in terms of latency, throughput and energy. We also characterized several prior chip prototypes’ performance relative to these limits. 3. It presented lessons learnt from our prototyping experience: • Virtual bypassing can enable 1GHz single-cycle router pipelines and 32% buffering energy savings with negligible area overhead (5% only). It comes at the expense of a 21% increased critical path, though this timing overhead can be masked in multicore processors where cores limit the clock frequency rather than routers. More critically, virtual bypassing does not address non-data-dependent power. • Low-swing signaling can substantially reduce datapath energy (3.2x less energy in 1mm links compared to a full-swing datapath) as well as realize high frequency single-cycle traversal per hop (5.4GHz with a 64bits 5×5 crossbar and 1mm links), but comes with increased process variation vulnerability and area overhead. • System-level NoC power modeling tools like ORION 2.0 can be way off in absolute accuracy (∼5x of measured chip power) but maintain relative ac68 2.5. Chapter Summary curacy. RTL-based post-layout power simulations (post-layout) are much closer to measured power numbers, but post-layout timing simulations are still off. 69 Chapter 2. Towards the Theoretical Limits of a Mesh NoC 70 3 Low-Swing Datapath for Reconfigurable NoCs This chapter explores two circuit optimization opportunities that reconfigurable NoC architectures create: Self-resetting logic repeaters (SRLRs) improve the energy efficiency of reconfigurable NoCs without affecting the network performance while voltagelocked repeaters (VLRs) enhance the reconfigurable NoC’s performance, sustaining datapath energy efficiency. 3.1 Background: Reconfigurable NoCs Multiprocessor systems-on-chip (MPSoCs) have integrated more and more generalpurpose and application-specific processor elements (PEs) to meet the requirements of progressively compute-intensive applications [48, 49] while such applications have increased and diversified with proliferation of smart phones [50]. Traffic on the manycore MPSoCs that support diverse applications substantially varies depending on its executed application, and this dynamic on-chip traffic should be delivered with low latency and high bandwidth at low power consumption to satisfy aggressive MPSoC design targets. To tackle this design challenge, one approach has been to tailor the NoC topology to match application communication patterns at design time. Star-Ring [26], 71 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs Figure 3-1: Single-cycle reconfigurable NoC [1] with SMART links (red bold lines) where its backbone mesh network is reconfigured at run time. Octagon [49], Fat Tree [51] and the high-radix crossbar [52] serve as examples of network topologies customized at design time. These NoCs, coupled with equalized on-chip interconnects [37, 38, 39, 40], can achieve a single-cycle transmission between distant PEs. However, this approach requires knowledge of all applications and their communication graphs at design time to be able to pin these dedicated express links to specific pairs of dedicated cores, and assumes sufficient wiring density to support dedicated links between all communicating cores. An alternate has been to employ a scalable network topology at design time such as a mesh connecting a collection of generic PEs (like ARM processors), then reconfigure the network at run time to match application traffic. Since router delays can vary depending on congestion, prior NoC literatures [21, 22, 23, 24, 25] have proposed pre72 3.2. Introduction: Clockless Low-Swing Repeaters reservation of (parts of) the route to provide predictable and bounded delays. These NoC architectures perform an offline computation of contention-free routes, allowing flits to bypass queues and arbiters at routers where there is no conflict between the routes of different flows. We will refer to this flexible communication fabric as a reconfigurable NoC in this thesis. 3.2 Introduction: Clockless Low-Swing Repeaters The reconfigurable NoC architectures offer two link optimization opportunities. First, we can further reduce the dynamic energy of the pre-paved routes through clockless low-swing repeaters. The low-swing circuit proposed in Chapter 2, as well as most of existing low-swing interconnects [36, 53, 54, 55, 56, 57], requires clocked sense amplifiers at every node in a mesh, so it will have to pay unnecessary clocking energy and delay at the contention-free nodes when embedded into the reconfigurable NoC datapath. In addition, the area overhead and system costs (such as an additional power supply voltage and differential wires) make its adoption in a NoC datapath infeasible. Motivated by these challenges, this chapter presents a self-resetting logic repeater (SRLR) that enables clockless, single-ended low-swing signaling without the extra supply voltage. Second, we can maximize latency benefits of the pre-reserved routes through Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) links, enabling flits to potentially incur a single-cycle delay all the way from the source to the destination. Figure 3-1 shows the single-cycle reconfigurable NoC, where a network reconfigures into 3 different topologies for 3 different applications. The single-cycle delay benefits of SMART can be obtained even at high clock frequency (e.g. 2GHz on a 4×4 mesh in 45nm SOI CMOS) by voltage-locked repeaters (VLRs) at which a low-swing technique is optimized for lower transmission delay. In other words, VLRs stretch the maximum distance that full-swing repeated links (or other existing low-swing links) 73 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs can span in a cycle. Such VLR-based SMART links can lead to energy savings as well as latency reduction when actual MPSoC application traffic is delivered through the network [1]. We will first investigate why embedding existing low-swing links into the mesh, which is the backbone of our reconfigurable NoCs, do not sufficiently tackle the challenges of NoC design in Section 3.3, then present design details and silicon-proven performance of SRLRs and VLRs in Section 3.4 and Section 3.5, respectively. Finally, Section 3.6 will summarize and conclude our second test chip prototyping of the clockless low-swing repeaters. 3.3 Related Work: Existing Low-Swing Links As demonstrated through SWIFT [29] and our first test chip, low-swing drivers can be embedded within mesh NoC routers and shown to substantially reduce NoC energy, but such low-swing links and other existing low-swing circuits face key NoC design challenges. First, the area overhead imposed by low-swing drivers is of prime concern, since a NoC shares precious on-die real estate with processor cores, caches, memory controllers, etc. Second, low-swing signaling comes at the cost of reduced noise margin, which is crucial as packet losses are not tolerated in NoCs. Thirdly, existing low-swing circuits impose a considerable system overhead such as an additional dedicated power supply voltage or clocking circuitry in an entire NoC datapath, or provide energy-optimal design of only one-to-one signaling, making their adoption in a mesh fabric infeasible. We will next explain in detail why prior circuits may come up short in area, robustness and energy-efficient application to a mesh. Apart from traditional low-swing circuits which use a lower supply voltage or inherent threshold voltage drop [15, 36], there have been a number of more sophisticated low-swing circuits proposed, based on linear-mode transistors [29, 58], charge sharing [53, 54, 55], cut-off drivers [15, 56, 57] and channel attenuation [37, 38, 39]. The 74 3.3. Related Work: Existing Low-Swing Links low-swing drivers exploiting linear-mode transistors [29, 58] are composed of PMOS pullups and pulldowns only (or NMOS pullups and pulldowns only) to obtain lower linear drive resistance even at small Vds . While such designs enable better energy efficiency and higher bandwidth than the traditional low-swing signaling generated by simply lowering power supply voltage, they require differential wiring, clocked sense amplifiers and an additional power supply voltage. In particular, the additional power supply dedicated only to the NoC low-swing datapath can be a system overhead in manycore processor design. The charge sharing-based low-swing drivers [53, 54, 55] limit voltage swing without a second power supply voltage, but they require fixed data patterns for reliable operation, which limit NoC design. The voltage swing of the cut-off drivers [15, 56, 57] is directly affected by threshold voltage variation of drive transistors, thus requiring complicated receivers to sense and calibrate the threshold voltage variation, resulting in an area overhead. Equalized on-chip interconnects [37, 38, 39, 40] can generate low-swing signaling by leveraging the inherent channel attenuation of RC-dominant wires and have successfully provided high-bandwidth low-power global links that transmit data through long wires (5-10mm). These long equalized links can be used as point-to-point wires between pairs of cores, but as there is insufficient on-die wiring to support dedicated links between all pairs of cores, equalized links map more readily to indirect, multistage NoC topologies with long global links, thus being advantageous to the MPSoC NoC tailored at design time where specific flows can be optimized with equalized interconnects. Besides, such topologies do not leverage application locality, turning all traffic into cross-die global traversals, which leads to high NoC latency and energy overheads. Meshes, on the other hand, are dominated by short local core-to-core links. Adopting equalizers as parallel links in a mesh NoC will lead to considerable area overhead (e.g. the 10mm 1-bit driver of [39] occupies 1760µm2 ). Yet another way of incorporating long equalized links in meshes is to use them as express links between far-away cores [59, 60]. That increases router port count though, leading to 75 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs high NoC area overhead. On top of that, direct transmission on a long global wire makes equalized interconnects vulnerable to wire capacitance/resistance variation and crosstalk coupling noise. 3.4 Self-Resetting Logic Repeater (SRLR) In this section, we seek to tackle the above-mentioned design challenges of incorporating low-swing signaling in a reconfigurable NoC. To be specific, this section presents a novel low-swing signaling circuit named a self-resetting logic repeater (SRLR). The proposed repeater has the following features: • The SRLR enables low-swing signaling to be repeated without a reference clock, and hence, eliminates clocking energy and delay at contention-free nodes of the pre-paved routes in a reconfigurable NoC. • The SRLR enables single-ended low-swing signaling, consuming less energy than differential low-swing signaling at the same wire density (i.e., the SRLR can have higher wire density at the same energy budget). • The SRLR achieves low-swing signaling mainly through the inherent wire channel attenuation so it does not require additional power supplies and works across all data patterns. • The SRLR enables low-swing signaling to be regenerated with a single repeater length, the wire length of local core-to-core links in a mesh. A single optimized SRLR design can thus be used for energy-efficient signaling between any pair of nodes in a mesh. As a side benefit, the SRLR enables 1-to-N multicasts for free since inherent full-swing signals are available at every intermediate repeater node. This multicast capability is a significant benefit as multicast traffic forms a sizable portion of NoC traffic [6]. • The SRLR incorporates circuit techniques to mitigate global process variation and ensure robustness of single-ended low-swing signaling. 76 3.4. Self-Resetting Logic Repeater (SRLR) 3.4.1 SRLR Circuit Design Figure 3-2 shows the overall 10mm link with SRLRs located at the end of each 1mm wire segment connecting adjacent routers in a mesh NoC. Typically, embedding repeaters within the crosspoints of a crossbar can lead to increased layout complexity due to the active silicon region in the midst of wires. The SRLR-based datapath, however, averts that by ensuring that the SRLR insertion length is equal to the routerto-router distance in a mesh NoC. We assume that the local router-to-router distance is 1mm, and accordingly, the SRLR transistors are optimally-sized to directly drive the 1mm wire in order to offer low-swing repeated signaling without adding to layout Figure 3-2: 10mm SRLR-based link for the mesh-based reconfigurable NoC where the local router-to-router distance is 1mm. 77 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs Figure 3-3: Proposed SRLR circuit and its simulated waveforms. complexity. The only implementation overheads of the proposed low-swing signaling are thus a pulse modulator (PM) and a demodulator (DM) required for pulse-based data communication. With the PMs and DMs at every router, our proposed circuit can send low-swing pulses to a far-away node in a mesh without energy overheads since each SRLR drives only a 1mm wire segment and the low-swing pulses are repeated without clocking. In addition, our SRLR-based datapath provides low-swing 1-to-N multicast capability for free while equalized links [37, 38, 39, 40] offer only 1-to1 unicasts. For instance, in Figure 3-2, the data sent from the 1st SRLR to the 10th SRLR can be directly sampled at all the intermediate SRLRs. This inherent multicast capability can result in substantial benefits in NoCs that see significant 78 3.4. Self-Resetting Logic Repeater (SRLR) multicast traffic [6]. Figure 3-3 shows the proposed SRLR circuitry along with its simulation waveforms. When a pulse (whose low-swing is obtained by wire channel attenuation) arrives at an input NMOS (M1), the node X is discharged and output voltage of the SRLR (OUT) becomes high. The node X is again charged when a reset signal comes back through a delay cell, generating another pulse at the output. As a last step, a keeper NMOS (M2) lowers the node X voltage down to VDD-Vth after the pulse is repeated. The reduced standby voltage at the node X increases amplification gain of the current-starved inverter (INV) but this standby voltage should stay above the threshold voltage of INV across process variation. Also, the size ratio of M1/M2 should be designed to allow enough SRLR input sensitivity at a given lowswing voltage level. The current-starved inverter (INV) amplifier becomes activated when enable signal (EN) is high, and this 3-port (IN, OUT and EN) circuit design allows SRLRs to be directly integrated into a matrix crossbar switch. While single-ended low-swing signaling has higher energy efficiency than differential low-swing signaling, this comes at the expense of global (die-to-die) process variation immunity. To mitigate such variation effects on the proposed on-chip signaling, the SRLR-based link employs three circuit techniques: an alternating delay cell design, an NMOS-based driver and an adaptive swing voltage scheme. Alternating Delay Cell Design: First, we propose an alternating delay cell design where odd SRLRs and even SRLRs incorporate different delay cells. As shown in Figure 3-5, the SRLR output pulse width (P Wout ) is a function of node X’s pulse width (P Wx ), which is mainly given by the delay of the delay cell, and the difference between rising time (trising ) and falling time (tf alling ) of the INV amplifier. At an n − th SRLR, the output pulse width (P Wout,n ) can be expressed as: 79 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs P Wout,n = P Wx,n − Drising,n + Df alling,n = P Wx,n − (trising,n − tf alling,n ). The rising time becomes longer (or shorter) as input pulse swing gets smaller (or bigger); whereas, the falling time experiences little change with the input pulse swing change. With a single delay cell design (e.g. 6-buffer whose delay enables the single delay cell design to offer the most reliable repeated signaling at a no-variation simulation environment), this influence of the input pulse swing on the rising time of INV accumulates over several SRLR stages, and hence, the rising time gradually becomes longer (or shorter) at the smaller (or bigger) initial pulse swing caused by the process variation. The increasing (or decreasing) rising time causes a shrinking (or widening) output pulse width, resulting in a transmission failure at the end of the 10mm link. In other words, the output pulse widths obtained from process corner simulations of the single delay cell design are P Wout,0 > P Wout,1 > P Wout,2 > . . . > P Wout,10 (3.1) (bit 1 transmission f ailure) or P Wout,0 < P Wout,1 < P Wout,2 < . . . < P Wout,10 . (3.2) (bit 0 transmission f ailure) The proposed alternating delay cell design, on the other hand, enables output pulse widths to increase (or decrease) even with the longer (or shorter) rising time of the INV amplifier through the intentionally-increased (or intentionally-decreased) delay of the delay cell. The alternating design can still saturate, but because of the non- 80 3.4. Self-Resetting Logic Repeater (SRLR) linearity of the feedback (where larger input pulse width causes even larger change in output pulse width) the alternating design takes more stages to saturate. Therefore, the alternating design improves the probability of correct operation for a fixed link length. NMOS-based Driver: Global process variation influences the output stage of the SRLR as well. Under a straightforward implementation, an inverter driver at the output exhibits two distinct failure modes. In one mode, a weak PMOS will generate insufficient voltage swing at the input of the following stage. In the other mode, a strong PMOS generates too much voltage swing for a weak NMOS to fully discharge the node at the end of a wire channel prior to the arrival of the next bit. Accordingly, the worst-case sequence of ‘11110’ will eventually saturate the voltage and prevent transmission of several 1s followed by a 0. The NMOS-based driver in this design supplies both pull-up and pull-down currents through NMOS devices, so the strong PMOS condition no longer applies. The resulting circuit is more robust since it is optimized for only one failure mode at a weak NMOS corner, instead of two distinct failure modes across a weak PMOS or a strong PMOS with weak NMOS. Adaptive Swing Voltage Scheme: Having a robust NMOS-based circuit also allows the optimization of transmission energy. At a strong NMOS corner, the output pulse swing tends to be excessively high, especially for the lower Vth of the input NMOS (M1) of the next stage. Therefore, the adaptive voltage swing scheme with an on-chip bias current generator (Figure 3-5) tracks the M1 threshold voltage to reduce swing voltage, avoiding the needless waste of energy. In other words, when M1 is fabricated with higher (or lower) threshold voltage than the nominal value, the lower (or higher) Vref is applied to the NMOSbased drivers to increase (or decrease) voltage swing. The bias current, which does not 81 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs Figure 3-4: 1000-run Monte-Carlo simulation results that show the impact of each variation-robust design technique. contain any threshold voltage-related terms for the first order analysis [61], is tolerant of process and temperature variations so that Vref is mainly given by the threshold voltage and technology parameters of M1, a primary determinant transistor of the SRLR input sensitivity. Figure 3-4 shows the error probability obtained from 1000-run Monte-Carlo simulations on different SRLR designs with various swing voltages. At the voltage swing selected for test chip fabrication, the proposed process variation robust SRLR design achieves about 3.7 times higher process variation immunity than the straightforward SRLR design that incorporates inverter drivers (instead of NMOS-based drivers) and 6-buffer delay cells only (instead of an alternating delay cell design) without the adaptive swing voltage scheme. 82 3.4. Self-Resetting Logic Repeater (SRLR) 83 Figure 3-5: Process variation robust SRLR circuit with (1) an alternating delay cell design, (2) NMOS-based drivers and (3) an adaptive swing voltage scheme. Chapter 3. Low-Swing Datapath for Reconfigurable NoCs 3.4.2 Test Chip Fabrication and Measurement In order to demonstrate the energy efficiency and performance of the proposed low-swing on-chip signaling, a proof-of-concept chip of a 1bit 10mm SRLR-based link (described in Figure 3-2) is implemented using a 45nm SOI CMOS process. Figure 36 shows its die photograph overlaid with a design layout where each SRLR occupies 47.9µm2 active silicon area. The fabricated link is fed by pseudo-random binary sequence (PRBS) data generated on-chip and a test circuit performs data comparison and error counting. This on-chip measurement circuit shows that the 1bit 10mm SRLR-based on-chip in- Figure 3-6: Die photograph of the SRLR test chip in 45nm SOI CMOS that includes an on-chip test circuit and an on-chip clocking circuit. 84 3.4. Self-Resetting Logic Repeater (SRLR) terconnect can deliver up to 4.1Gb/s data with the bit error rate (BER) that is less than 10−9 . Measurement results show that the SRLR-based on-chip signaling achieves 6.83Gb/s/µm bandwidth density at its maximum data rate of 4.1Gb/s, consuming 1.66mW (i.e., 404fJ/bit/cm or 40.4fJ/b/mm) at a power supply voltage of 0.8V. Figure 3-7 shows 10mm link traversal (LT) energy versus bandwidth density characteristics of the SRLR-based link and other silicon-proven on-chip interconnects [58, 37, 38, 39]. Details of the fabricated test link are summarized in Table 3.1 together with the previous works. When analyzing the comparison results in Table 3.1, we should be aware of the following considerations. First, higher bandwidth density (i.e., smaller wire spacing) incurs larger wire coupling capacitance, resulting in higher energy consumption. Thus, the energy consumption of on-chip interconnects should be considered along with their bandwidth density as shown in Figure 3-7. Second, CMOS process scaling does not provide much energy benefit for on-chip signaling circuits since the load capacitance of on-chip interconnects is mostly given by their long wire capacitance (not by the gate capacitance) [16]. Lastly, the low-swing circuits proposed in Chapter 2 requires an additional power supply and its energy is evaluated assuming optimistically that the additional power supply has no charge-recycling circuits. The on-chip bias circuit for an adaptive swing voltage scheme consumes 587µW and it can be shared by all parallel links at a NoC router. When considering a 64bit 10mm link implementation, the bias circuit dissipates just 0.6% of total link power. In order to compare the power consumption and area of our SRLR-based datapath with those of an entire router, we synthesized a typical mesh router (64bits, 5ports, 4VCs, and 16 buffers) in the same process, 45nm SOI CMOS. Extracted simulation results showed that input buffers and control logic consume 38.8mW and 5.2mW respectively, while our low-swing datapath consumes 12.9mW. Area wise, the SRLR low-swing datapath occupies 18% of the overall router footprint. 85 Signaling Type Data Rate Bandwidth Density Energy for 10mm Link Traversal (LT) Process Technology JSSC’10 [62] fully differential 2Gb/s 1.163Gb/s/µm 340fJ/bit/cm (repeaterless) 90nm bulk CMOS JSSC’10 [63] fully differential (4Gb/s), 6Gb/s (2Gb/s/µm), 3Gb/s/µm (370fJ/bit/cm), 630fJ/bit/cm (repeaterless) 90nm bulk CMOS ISSCC’10 [64] fully differential 4.9Gb/s 4.375Gb/s/µm 340 X 2 = 680fJ/bit/cm (2 repeaters) 90nm bulk CMOS low-swing circuit in Chapter 2, fully differential 5.4Gb/s 6.0Gb/s/µm 56.1 X 10 = 561fJ/bit/cm (10 repeaters) 45nm SOI CMOS Table 3.1: Comparison of silicon-proven low-swing on-chip interconnects. SRLRs in Chapter 3 single-ended 4.1Gb/s 6.83Gb/s/µm 404fJ/bit/cm (10 repeaters) 45nm SOI CMOS Chapter 3. Low-Swing Datapath for Reconfigurable NoCs 86 Figure 3-7: 1cm link traversal (LT) energy versus bandwidth density. 3.5. Voltage-Locked Repeater (VLR) 3.5 Voltage-Locked Repeater (VLR) A low-swing signaling technique can be utilized for lower transmission delay (instead of lower energy) by reducing the charging and discharging time of interconnects’ parasitic capacitance. For such a delay-optimized low-swing circuit, wire driving currents should be maintained even with the reduced voltage swing. It is again noted that regular repeaters with a lower power supply voltage (simple low-swing links) have weaker driving currents, thus leading to longer delay than conventional full-swing repeaters. This section presents another novel low-swing circuit optimized for lower latency and higher bandwidth in the network, a voltage locked repeater (VLR). The proposed circuits enables Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) which forms the basis of the single-cycle reconfigurable NoC [1] explained as our background in Section 3.1. The VLR-based link stretches the maximum distance that a full-swing repeated link can span in a cycle. 3.5.1 VLR Circuit Design Figure 3-8 shows the proposed VLR circuit design. Once a logic High signal arrives at node X through a highly-resistive link wire and exceeds the threshold voltage of the first inverter amplifier (INV1x), the logic High signal starts to traverse the feedback path. When the feedback signal turns RxN on, the node X voltage is locked at some voltage level, Vhigh , resulting in low-swing signaling along with wire channel attenuation as shown in Fig. 3-9. The inverse results when a logic Low signal is applied at node X. The proposed low-swing generation maintains the node X voltage near the threshold voltage of INV1x without a decrease in driving current, and hence, enables lower delay of the next symbol propagation delay. Since the low-swing voltage level is determined by transistor sizes and link wire impedance (Vhigh is given by link wire resistance, TxP’s on-state resistance and RxN’s on-state resistance while 87 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs Vlow is determined by link wire resistance, TxN’s on-state resistance and RxP’s onstate resistance), careful gate sizing and extracted simulations are required to prevent oscillation and static current through the RxP-to-RxN path in all possible process corners. We thus custom-designed the proposed low-swing repeater circuit and used this block in our NoC generator as a standard cell in the synthesis flow. In this low-swing circuit design, a delay cell in the feedback path plays a key role in making our single-ended low-swing signaling variation-robust. The delay cell generates transient overshoots at node X, leading to lower repeater propagation delay, and more importantly, larger noise margin. An advantage of this delay cell-included low-swing repeater design is that such delay and reliability benefits can be obtained without a significant energy overhead since the high frequency overshoots are filtered out through long and narrow link wires (i.e. highly resistive and capacitive channels) which act as a low-pass filter. 1000-run Monte Carlo simulations done on the extracted netlist of the fabricated design at 6.8Gb/s data rate and 290mV voltage swing (i.e. Vhigh -Vlow =290mV) at the end of link wires in a 45nm SOI CMOS process design kit show that the delay cell enables 3.4x lower process variation failure probability at just 11% energy overhead. While the proposed low-swing repeater does not require clocking power and differential signaling, it has static current paths between two consecutive repeaters, TxP-wire-RxN for logic High and TxN-wire-RxP for logic Low. It should be noted, however, that the static energy is much less than a conventional continuous-time comparator since the static current paths include a highly-resistive link wire. Also, switching off the enable signal (EN) when the link is not used help eliminate unnecessary static power dissipation. Thus, as long as TxP and RxN are optimally sized for the target data rate and wire impedance (which is given by specific CMOS technology, wire pitch and width), the VLR-based links do not pay higher link energy for the lower latency and higher bandwidth. 88 89 Figure 3-9: Simulated waveforms at 6.8Gb/s: (a) original input data and (b) VLR’s low-swing signaling at node X. 3.5. Voltage-Locked Repeater (VLR) Figure 3-8: Proposed clockless low-swing voltage-locked repeater (VLR) for single-cycle multi-hop link traversal. Chapter 3. Low-Swing Datapath for Reconfigurable NoCs 3.5.2 Test Chip Fabrication and Measurement To prove the concept of VLRs that outperform full-swing repeaters in terms of both latency and bandwidth without energy overhead, both a 1b 10mm VLR-based on-chip link and an equivalent full-swing link were fabricated in 45nm SOI CMOS along with SRLRs. Both VLRs and full-swing repeaters have same repeater length Voltage-Locked Repeatet (VLR) On-chip test circuit 1mm (repeater length) metal wire (M7) VLR clocking circuits SRLRs Equivalent full-swing repeaters Figure 3-10: 1bit 10mm VLR-based on-chip link and its equivalent full-swing link fabricated on the same die as SRLRs in 45nm SOI CMOS. 90 3.5. Voltage-Locked Repeater (VLR) (1mm), link wire pitch (minimum DRC pitch) and width (minimum DRC width). Figure 3-10 shows the test chip die photo with a design layout. Experimental environment is identical to that of SRLRs. The pseudo-random binary sequence (PRBS) feeds both VLR-based links and full-swing repeaters, and on-chip test circuits perform input and output data comparison and error counting. Due to the limitation of such an on-chip test environment, we cannot get accurate BER performance. Instead, we can only see if the 10mm link BER is less than 10−9 or not. Test chip measurement results verify the VLR performance expected in its design phase. First, VLRs exceed the equivalent full-swing repeaters in terms of bandwidth. The fabricated VLR-based 10mm link achieves the maximum data rate of 6.8Gb/s with 4.14mW power consumption (i.e. 608fJ/b energy efficiency) at the supply voltage of 1.0V, maintaining BER below 10−9 . On the other hand, the equivalent full-swing repeaters can send 5.5Gb/s data at most, with BER which is less than 10−9 , consuming 4.21mW (i.e. 765fJ/b) at the same supply voltage. At the data rate of 5.5Gb/s (the maximum data rate of the full-swing repeaters), VLRs dissipate 3.78mW (i.e. 687fJ/b) at 1.0V supply voltage. Second, latency wise, experiment results demonstrate that the VLRs-based link also surpasses its equivalent full-swing repeaters; the delay of a link with VLRs is arounud 64ps/mm, whereas the delay of a link with full-swing repeaters is around 100ps/mm. In short, the 10mm 10-hop VLRs transmit its PRBS input data with 23.6% higher bandwidth, 35.8% lower latency and 10.4% less energy than its counterpart full-swing interconnect. While the energy benefit varies depending on an input data pattern (higher transition rate leads to bigger energy savings, hence our PRBS is favorable to VLRs), wire transmission delay is irrelative to the input data. Therefore, measurements of our chip prototype provide evidence for the latency and bandwidth advantages of VLRs. As compared to the simulated results, the energy efficiency and maximum data rate of the fabricated VLRs are way off in absolute accuracy (up to 18% and 31% 91 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs Table 3.2: Maximum hop counts in a single cycle at high data rate. Data Rate Full-swing Repeaters Fabricated VLRs 4 Gb/s 4 (98 fJ/b/mm) 7 (132 fJ/b/mm) 5 Gb/s 3 (89 fJ/b/mm) 6 (107 fJ/b/mm) 5.5 Gb/s 3 (85 fJ/b/mm) 5 (96 fJ/b/mm) Table 3.3: Maximum hop counts in a single cycle at low data rate. Data Rate Full-swing Repeaters Re-optimized VLRs 1 Gb/s 13 (103 fJ/b/mm) 16 (128 fJ/b/mm) 2 Gb/s 6 (95 fJ/b/mm) 8 (104 fJ/b/mm) 3 Gb/s 4 (84 fJ/b/mm) 6 (87 fJ/b/mm) in energy efficiency and maximum data rate, respectively) while cycle-wise latency performance exactly matches up with the simulation results across all data rates. This cycle-wise latency certainty of our simulation environment provides solid foundation to the next discussion based on such latency simulation results. The system-level design goal of VLRs, as Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) links in the single-cycle reconfigurable NoC [1], is to stretch the maximum hop counts that conventional full-swing repeaters traverse within a single cycle. Unfortunately, our on-chip test environment does not give this information since only the last repeater is connected to the comparison circuit. In other words, our test chip can only give us how many cycles are required for 10mm link traversal (LT). Hence, we obtained such information from Spectre circuit simulations. Table 3.2 summaries these maximum hop counts and corresponding energy efficiency of the VLR-based 10mm link. In MPSoCs, the maximum clock frequency is usually limited by the computation core rather than the link. We thus re-optimize the transistor sizes and wire spacing of our circuits for a lower clock frequency of 2GHz. The maximum hop counts and energy efficiency of such re-optimized VLRs are shown in Table 3.3. At 2GHz, VLRs can traverse 8 hops in a cycle at 104fJ/b/mm, thus enabling maximum bypassing in SMART across a 4×4 mesh network. 92 3.5. Voltage-Locked Repeater (VLR) Figure 3-11: SMART NoC performance across SoC applications. Reference: [1]. Figure 3-12: SMART NoC power breakdown across SoC applications. Reference: [1]. Finally, we briefly discuss the network-level latency reduction and power savings of the VLR-based single-cycle reconfigurable NoC (SMART NoC). The architecture design and simulation methodology of the SMART NoC are not the contributions of this thesis, but the network-level evaluation results explicitly show the VLRs, which hardly lead to link-level energy benefits but enable such a SMART NoC on 93 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs 4×4 mesh at 2GHz, can facilitate power savings at the network level as well as network performance enhancement. Chia-Hsin Owen Chen, a PhD student at MIT, evaluated the SMART NoC against two baselines, a typical 3-cycle-per-hop mesh NoC (similar to our thesis baseline design) and the dedicated NoC with 1-cycle dedicated links between all communicating cores tailored to each application. Figure 3-11 and Figure 3-12 show the average network latency and the post-layout power estimates across diverse SoC applications, respectively. The SMART NoC reduces network power by 2.2x on average as a result of buffer bypassing through the pre-paved routes and clock gating at the no-traffic routers. Details of the SMART NoC design and simulation methodology are presented in [1]. 3.6 Chapter Summary In this chapter, we first introduced a novel NoC architecture, a single-cycle reconfigurable NoC, as the background of our second chip prototyping research. Then, we proposed two clockless, single-ended low-swing repeaters: a self-resetting logic repeater (SRLR) for a reduction in the pre-paved link energy, and a voltage-locked repeaters (VLR) for lower propagation delay. • The SRLR optimized for the router-to-router distance in a mesh NoC (e.g., 1mm in this work) provides scalable on-chip signaling without the increased layout complexity. Since the SRLR enables single-ended low-swing pulses to be repeated without a reference clock, the SRLR-based on-chip signaling achieves higher energy efficiency than differential, clocked low-swing signaling circuits.We also presented circuit techniques to improve process variation immunity of the SRLR-based on-chip signaling. • The VLR maintains its standby voltage level near the threshold voltage of the first amplifier for higher amplification gain. When the VLR repeats the signal transition, it locks the swing voltage by its feedback path signal. A delay cell 94 3.6. Chapter Summary on the feedback path generates voltage overshoot only at the receiver end, thus leading to an increase in noise margin without significant energy overheads. This standby voltage and reduced voltage swing enable lower propagation delay at the expense of static power dissipation. The VLRs make single-cycle crosschip transmission feasible over a 4×4 mesh network at 2GHz, realizing the network-level power savings in the SMART NoC. 95 Chapter 3. Low-Swing Datapath for Reconfigurable NoCs 96 4 Energy and Area Efficient TSV Signaling for 3D-IC NoCs In this chapter, we present a simultaneously bi-directional (SBD) TSV interconnect for energy and area efficient 3D-IC vertical signaling. The proposed SBD signaling enables 10.3-31.1% lower energy and 34.4% less area than equivalent two uni-directional TSVs. Albeit with 12.5% lower maximum data rate, the SBD TSV interconnect functions error free at bi-directional data rates up to 9.1Gb/s/TSV (i.e. 4.55GHz maximum clock frequency). 4.1 Chapter Introduction Three-dimensional integrated circuits (3D-ICs), in which multiple layers of active electronic components are vertically integrated into a single chip, have emerged as an appealing alternative to planar 2D counterparts. This is mainly for three reasons: First, 3D-IC integration enables fewer hop counts in a NoC by exploiting greater spatial locality as shown in Figure 4-1 (where 10 hops in a 2D NoC can be reduced to 3 hops in a 3D-IC NoC), leading to lower interconnect delay as well as less interconnect energy. Second, 3D-ICs allow different fabrication technologies to be integrated into a single MPSoC design, and this heterogeneous 3D integration makes it possible to 97 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs 3D-IC NoC 2D-IC NoC Figure 4-1: Example of hop count reduction through greater spatial locality in 3D-ICs. The reduced hop counts translate into lower interconnect delay and energy. integrate the best technology for a specific functionality (e.g. RF analog CMOS, lowpower digital CMOS or opto-electronic devices) in a single chip cube. Thirdly, but most importantly, 3D-ICs can offer much higher bandwidth for die-to-die communication, resulting in substantial performance benefits in manycore chips that require heavy processor-to-memory bandwidth. Actually, the memory bandwidth is looming as a major performance bottleneck as core counts scale; even with an efficient memory controller design such as Intel’s QuickPath Interconnect or AMD’s HyperTransport, the off-die bandwidth is still restricted by limited pin counts of a chip package [65]. The heart of 3D-ICs is a through silicon via (TSV)1 that cuts across thinned silicon substrates to offer inter-die connectivity, e.g. Tezzaron Semiconductor Corporation [66], IBM [67, 68], IMEC [69] and MediaTek TSVs (this work). In general, TSVs can provide two types of 3D vertical signaling depending on die stacking topologies: face-to-face (F2F) and face-to-back (F2B). The F2F TSVs have lower 3D signaling energy and delay due to smaller parasitic lumped capacitance than the F2B TSVs, but only support two-tier 3D-IC integration. On the other hand, the F2B TSVs allow of any multiple-tier 3D-IC integration, and hence, enable scalable 3D-IC chip architectures. 1 The semiconductor industry has pursed 3D-ICs in many different technologies. Accordingly, the 3D-IC definition is diverse; wire-bonded 3D-ICs, microbump-only 3D-ICs, contactless (capacitive or inductive) burled bump-based 3D-ICs and TSV-based 3D-ICs. In this thesis, we refer to the TSV-based chip cube as 3D-ICs. 98 4.1. Chapter Introduction Figure 4-2: Uni-directional TSV signaling versus proposed SBD TSV signaling at the same clock frequency, e.g. 5GHz in this example. The scalable F2B TSVs, however, face other design challenges. First, the F2B TSVs consume higher signaling energy than the F2F TSVs due to higher parasitic lumped capacitance, e.g. about 200fF/TSV in IBM technology [68] or 80-120fF/TSV in IMEC technology [69]. While these F2B TSVs still hold much better energy merits over off-chip interconnects (typically in the order of tens of pF), their parasitic capacitance is 10-20× higher than the F2F TSVs [66, 11]. Second, TSV landing pads2 occupy bigger silicon area than F2F TSVs, e.g. 7µm×7µm in Tezzaron TSV technology [66]. Lastly, current TSV fabrication technologies cause non-negligible fault rates, leading to lower yield than standard 2D chip fabrication [70]. For reliable signaling, such low-yield 3D-IC chips require redundant TSVs such as spare TSV arrays [11], twin TSVs [71] or shared spare TSVs [72], thereby further exacerbating the energy and area overheads of the F2B TSV interconnect. 2 In face-to-back (F2B) TSV technology, the upper end of TSVs is connected to a poly layer landing pad on a top die while the lower end of TSVs is connected with a top metal micro bump on a bottom die. 99 Off-Chip SBD #1 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs “1” “0” VDD “1” “1” VDD/3 “0” “1” “0” “0” SBD Tx GND SBD Tx chip-to-chip wire Figure 4-3: 4 voltage-level SBD signaling with weaker driving strength required (pros) Off-Chip SBD #2 between SBD signaling symbols (cons). and smaller noise margin “1” “1” VDD “0” “1” “1” “0” VDD/2 “0” “0” SBD Tx chip-to-chip wire GND SBD Tx Figure 4-4: 3 voltage-level SBD signaling with bigger noise margin between SBD signaling symbols (pros) and stronger driving strength required (cons). 100 Back-to-Back Current Issue 4.1. Chapter Introduction “1” “1” VDD “0” “1” “1” “0” top die bottom die “0” “0” GND TSV repeater repeater two inverters Figure 4-5: Upward die-to-die static current path through a low resistance TSV: bottom die PMOS → micro bump → landing pad → top die NMOS. Back-to-Back Current Issue “1” “1” VDD “0” “1” top die “1” “0” “0” “0” TSV repeater bottom die GND repeater two inverters Figure 4-6: Downward die-to-die static current path through a low resistance TSV: top die PMOS → landing pad → micro bump → bottom die NMOS. 101 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs In order to tackle these challenges, this chapter proposes a simultaneously bidirectional (SBD) TSV signaling circuit that can transmit and receive data at the same time through a single TSV. Figure 4-2 shows the concept of our SBD TSV signaling versus uni-directional TSV signaling. The proposed SBD signaling can halve the number of TSVs, and at the same time, lower TSV transmission energy through such reduced TSV counts and reduced swing (ideally half swing, as will be explained in the following design sections). In other words, the SBD TSV signaling can offer 2× higher bandwidth than uni-directional TSVs with the same TSV counts and clock frequency. 4.2 Design Considerations of SBD TSV Links While there have been reported studies on off-chip (chip-to-chip) SBD signaling circuits [73, 74, 75, 76], this work presents the SBD link design optimized for TSV channel characteristics. Since different channel characteristics lead to different interconnect design requirements, a natural next step is to compare and analyze design considerations of well-known off-chip interconnects and 3D-IC TSV links. Here, we highlight the key differences: 1. TSV signaling does not require accurate impedance matching because TSV transmission delay is generally negligible compared to the clock period. For instance, the IBM TSV interconnect can transmit 6Gb/s/TSV data without any impedance matching circuits [68]. 2. TSV interconnects do not need complicated, power-hungry equalization circuits since TSVs’ RC constant is much smaller than that of off-chip interconnects due to TSVs’ negligible parasitic resistance. 3. SBD TSV signaling circuit requires higher noise margin ratio than off-chip counterparts because the off-chip power supply rail is typically 2.5V or 1.8V while the on-chip rail is around 1.0V. 102 4.2. Design Considerations of SBD TSV Links 4. Most importantly, SBD TSV signaling circuits should minimize inter-die static current flowing through a low resistance TSV when TSV voltage is driven to middle-level voltages (e.g. bi-directional data ”11” and ”01” in Figure 4-3, or ”01” and ”10” in Figure 4-4) for energy-efficient 3D-IC signaling. While the static current required for SBD signaling in off-chip interconnects does not dissipate significant power due to their highly-resistive channel, SBD TSV signaling can lead to non-negligible static current through a low resistance TSV. This will be explained later in detail with Figure 4-5 and Figure 4-6. A straightforward implementation of 2-bit SBD signaling is to use four voltage levels between a power supply voltage (the highest voltage in a chip) and a ground voltage (the lowest voltage in a chip), mapping each of them to four possible SBD symbols as shown in Figure 4-3. An alternative is to share one voltage level between two SBD symbols [74], e.g. ”01” and ”10” in Figure 4-4. There are obvious trade-offs between these two schemes: Three voltage-level SBD signaling (Figure 4-4) has higher voltage swing between adjacent SBD symbols than four voltage-level SBD signaling (Figure 4-3), requiring bigger driving strength but with higher noise margin. Considering that SBD TSV signaling needs higher noise margin ratio but does not require bigger driving strength through Tx equalization, three voltage-level SBD signaling is fitted for TSV channel characteristics. Let us investigate the issue of die-to-die static current in the three voltage-level SBD TSV shown in Figure 4-5 and Figure 4-6 where a TSV driver is implemented using a simple repeater (composed of two inverters). Once a top die driver transmits logic 0 with a pulled-down NMOS and a bottom die driver transmits logic 1 with a pulled-up PMOS, a static current path is formed from the bottom die PMOS through a TSV to the top die NMOS. Similarly, when a top die driver transmits logic 1 and a bottom die driver transmits logic 0, another static current path comes out from a top die to a bottom die. Since the parasitic resistance of TSVs is much smaller than that of off-chip interconnects, die-to-die static current flowing through a TSV 103 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs can incur significant energy overheads. Higher threshold voltage or smaller transistor size of a pull-up PMOS and a pull-down NMOS can reduce the amount of die-to-die static current, but it will lead to bandwidth loss due to reduction in dynamic driving strength. Thus, the SBD TSV interconnect circuit should minimize these energy overheads by eliminating unnecessary die-to-die static current, while maintaining the advantages of TSV signaling such as high bandwidth (e.g. 10Gb/s/TSV bi-directional data rate) and low delay (e.g. single cycle die-to-die transmission at 5GHz clock frequency). The next section will present our SBD TSV transmitter designed for such low-power yet high-performance 3D-IC vertical signaling. 4.3 SBD Transmitter Design Figure 4-7 shows the proposed SBD transmitter (Tx) circuit where a driver design on a top die differs from a driver design on a bottom die. While the bottom die driver is implemented using a simple NAND-enabled inverter, the top die driver includes clocking circuitry (highlighted in blue) and a coupling capacitor (highlighted in red). Since our proposed design does not depend on the top and bottom die circuit symmetry, it can easily apply to heterogeneous 3D-IC integration (e.g. a top die in 45nm CMOS and a bottom die in 28nm CMOS). The following sub-sections will explain how this SBD Tx circuit works and enables energy-efficient TSV signaling without significant reduction in maximum data bandwidth. 4.3.1 Case 1: EN=0 (no data to be transmitted) Figure 4-8 shows circuit connectivity of the proposed SBD Tx when an enable signal is not asserted (EN=0). Clearly, both sides of an idle TSV are connected to ground (GND), i.e. no die-to-die current path is formed through a low resistance TSV, and hence, our SBD Tx circuits do not consume any static power when there is no data to be transmitted through a TSV. It is notable that the enable signal can 104 4.3. SBD Transmitter Design Proposed TSV SBD Tx coupling capacitor for bandwidth compensation SBD transmitter on a bottom die SBD transmitter on a top die Figure 4-7: Proposed SBD TSV Tx circuits: a simple NAND-enabled inverter on a bottom die and a half-clocked driver on a top die. be obtained at every clock cycle for free from NoCs since it is an output of a switch allocator, the vital logic of packet-switched NoC routers described in Chapter 1. 4.3.2 Case 2: EN=1 and CLK=1 (first half clock cycle) During the first half clock period (Figure 4-9), the bottom die driver acts as a repeater composed of two inverters. On the other hand, on a top die, there is no connection between an input (IN1) and a TSV because the high state of clock (CLK=1) makes all transistors between IN1 and a TSV turned off. Hence, during the first half clock cycle, a TSV is driven by a bottom die driver only, allowing no static current to flow through a low resistance TSV. The bottom die driver does not need to be over-sized to drive rail-to-rail TSV voltage swing during the first half clock period (half-swing transitions suffice), because another half-swing transition can be 105 Proposed TSV SBD Tx – operation (1/2) Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs X1 X1 CLK=0 0 1 X 0 X0 GND at TSV top side (landing pad) GND at TSV bottom side (micro bump) Figure 4-8: Tx circuit connectivity of Case 1 (EN=0). No die-to-die current path is formed when there is no data to be transmitted through a TSV. driven during the next half clock cycle by a top die driver and a bottom die driver together. In other words, during the first half clock cycle, a TSV is driven by a pull-up PMOS (P1) or a pull-down NMOS (N1) on a bottom die regardless of a top die input (IN1), consuming dynamic energy only (i.e. no static energy consumed). Depending on data transition patterns on a bottom die, the dynamic energy is zero (no TSV voltage swing) or the energy required for TSV half voltage swing. 4.3.3 Case 3: EN=1 and CLK=0 (next half clock cycle) When clock goes low (Figure 4-10), two pass-gate transistors, N2 and P2, are turned on, generating a signaling path between a top die input (IN1) and a TSV landing pad. Now, a top die driver also acts as a repeater composed of two inverters (a NAND gate with EN=1 is the first inverter while the second inverter consists of 106 4.3. SBD Transmitter Design 0 ON 0 ON X P3 floating (open) X N3 repeater (two inverters) P1 1 N1 N2 X VDD GND X P2 ON ON X X 1 1 1 driven by P1 or N1 Figure 4-9: Tx circuit connectivity of Case 2 (EN=1 and CLK=1, first half clock cycle) where a TSV is driven by a bottom driver only, consuming dynamic energy only. P3 and N3). Since there are no changes on a bottom die, a TSV is driven by a top die driver and a bottom die driver together during the second half clock cycle. When we have bi-directional data of (IN1, IN2) = (0, 1) or (1, 0), a die-to-die static current path is formed during the second half clock period to drive TSV voltage at the middle voltage level (ideally VDD/2). While the straightforward SBD Tx design implemented using just repeaters (Figure 4-5 and Figure 4-6) allows static current to flow during the entire clock period, our half-clocked Tx design can have a die-to-die static current path only during the half clock period (hence, 50% lower static energy consumption). At the cost of such energy savings, our Tx design incurs maximum bandwidth loss because TSV half voltage swing can be driven by a pull-up PMOS and a pull-down NMOS together (P1 107 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs additional current path through a coupling capacitor ON to make TSV voltage swing faster driven by P3 or N3 P3 0 X N2 ON 1 ON 1 N3 repeater (two inverters) P1 1 N1 ON ON X ON P2 0 1 driven by P1 or N1 Figure 4-10: Tx circuit connectivity of Case 3 (EN=1 and CLK=0, next half clock cycle) where a TSV is driven by a bottom driver and a top driver together, forming a static current path through a TSV for the three voltage-level SBD signaling. The coupling capacitor, which acts as a high-pass filter, compensates the bandwidth loss without adding to inter-die static current. and N3 together or P3 and N1 together) during the second half clock period. Switching to the middle voltage level by a pull-up PMOS and a pull-down NMOS together is slower than switching by a pull-up PMOS only or pull-down NMOS only, because some portion of switching current always flows through the static current path formed by the pull-up PMOS and the pull-down NMOS, making no contribution to the TSV voltage swing. In order to compensate such bandwidth loss, our top die driver has one more signaling path between an input (IN1) and a TSV through a coupling capacitor (highlighted in red in Figure 4-10). This additional signaling path adds switching current 108 4.3. SBD Transmitter Design during the second clock period, and hence, reduces switching time required for the half voltage swing. In other words, it increases maximum clock frequency by adding up additional switching current during high-to-low clock transitions. An advantage of this design is that switching current through the coupling capacitor flows only during such clock transitions (i.e. the coupling capacitor acts as a high-pass filter), and hence, it can increase bandwidth without adding to inter-die static current. This coupling capacitor also generates dynamic current flow during the first half clock period when a bottom die driver makes TSV voltage transition. However, due to TSV’s big lumped capacitance, the TSV voltage transition during the first half clock period (at clock transition from low to high) is less sharp than the voltage transition at another end of the coupling capacitor during the second half clock period (at clock transition from high to low). Since more sharp transitions enable higher dynamic current through a coupling capacitor at a given time, the switching current flowing through a coupling capacitor during the first half clock period is less than 10% of the switching current during the second half clock period, thus resulting in little impact on switching time reduction during the first half clock period. However, since the bandwidth bottleneck of our proposed Tx circuit is the second half clock period when TSV voltage is driven to VDD/2 by a pull-up PMOS and a pull-down NMOS together, we don’t need a large coupling capacitor to reduce the first half clock period’s switching time. To sum up, the underlying design philosophy of our proposed SBD Tx is that while die-to-die static current through a TSV is inevitable in multiple voltage-level SBD signaling for keeping TSV voltage at middle voltage levels (e.g. VDD/2), such static current does not need to flow over an entire symbol period (one clock cycle in our design) if a TSV voltage transition time is shorter than an entire symbol period. In our transmitter circuit design, the half-clocked driver on a top die allows static current to flow through a TSV during the second half clock period only (half of symbol period), and the coupling capacitor enables shorter TSV voltage transition time by creating a dynamic current path at negative clock edges. 109 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs Two uni-directional TSVs VDD VDD Proposed SBD TSV VS GND bottom to top TSV CLK=1 CLK=0 (1st half period) (2nd half period) top to bottom TSV CLK=1 CLK=0 time GND (1st half period) (2nd half period) CLK=1 CLK=0 (1st half period) (2nd half period) 00 to 11 bi-directional data transition (0 to 1 on a bottom die and 0 to 1 on a top die) Proposed SBD TSV Two uni-directional TSVs VDD VDD VS GND bottom to top TSV CLK=1 CLK=0 (1st half period) (2nd half period) top to bottom TSV CLK=1 CLK=0 time GND (1st half period) (2nd half period) CLK=1 CLK=0 (1st half period) (2nd half period) 00 to 10 bi-directional data transition (0 to 1 on a bottom die and 0 to 0 on a top die) Proposed SBD TSV Two uni-directional TSVs VDD VDD VS GND bottom to top TSV CLK=1 CLK=0 (1st half period) (2nd half period) top to bottom TSV CLK=1 CLK=0 (1st half period) (2nd half period) time GND CLK=1 CLK=0 (1st half period) (2nd half period) 00 to 01 bi-directional data transition (0 to 0 on a bottom die and 0 to 1 on a top die) Figure 4-11: TSV voltage transitions of uni-directional TSVs versus our SBD TSV. 110 4.3. SBD Transmitter Design Floating during the first half clock period for static die-to-die current reduction VDD Proposed SBD TSV VDD Floating during the first half clock Fail period for static die-to-die current reduction VDD VS GND CLK=1 CLK=0 Fail time (1st half period) (2nd half period) VS Success Proposed SBD TSV VDD GND Success CLK=1 CLK=0 (1st half period) (2nd half period) 00 to 11 bi-directional data transition (0 to 1 on a bottom die and 0 to 1 on a top die) Figure 4-12: While a floating TSV during the first half clock period also enables 50% GND GND time CLK=1 CLK=0 CLK=0bandwidth loss Proposed SBD TSV lower static die-to-die current, such a floatingCLK=1 state incurs (1st half period) (2nd half period) (1st half period) (2nd half period) Proposed SBD TSV wiout a coupling capacitor at 00 → 11 bi-directional 00 toVDD 11 bi-directional data transition (0data to 1transition. on a bottom VDD die and 0 to 1 on a top die) Success Proposed SBD TSV wiout a coupling capacitor VDD VS bandwidth compensation bySBD a coupling Proposed TSV capacitor VDD Success Fail GND CLK=1 CLK=0 time (1st half period) (2nd half period) VS GND CLK=1 bandwidth compensation by a coupling capacitor CLK=0 (1st half period) (2nd half period) Fail 00 to 01 bi-directional data transition (0 to 0 on a bottom die and 0 to 1 on a top die) GND CLK=1 CLK=0 GND time (1st half period) (2nd half period) CLK=1 CLK=0 (1st half period) (2nd half period) 00 to 01 bi-directional data transition (0 to 0 on a bottom die and 0 to 1 on a top die) Figure 4-13: The coupling capacitor on a top die driver enables shorter switching time when a TSV is driven to VDD/2 by a pull-up PMOS and a pull-down NMOS together during the second half clock period. 111 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs 4.4 Rx Design: Switched Dual-Tree Sense Amplifier In order for our SBD TSV links to be incorporated into a typical NoC router microarchitecture3 , there are two key requirements in their receiver (Rx) design: low sensing delay and error-free SBD signaling against die-to-die (a top die to a bottom die) process variations. 4.4.1 Switched Scheme for Low Sensing Delay While a pipelined NoC router microarchitecture generally requires single-cycle link traversal (LT), our SBD signaling needs extra time to sense and convert TSV voltage into full-swing logic level. To minimize such extra time and hence utilize our SBD TSV interconnects as links of the pipelined NoC router, a SBD Rx circuit should be designed for low sensing delay. In terms of small-signal sensing delay, an NMOSinput sense amplifier is optimized for higher input common mode voltage while a PMOS-input sense amplifier is ideally fitted for lower common mode voltage. Thus, a possible Rx circuit implementation is to switch such two sense amplifiers according to the TSV common mode voltage, in a similar way as the reconfigurable sensing network proposed in [77]. In our three voltage-level SBD signaling, the TSV common mode voltage is given by the transmitted data as described in Figure 4-4; when the transmitted data is 0, the SBD TSV voltage should be GND or VDD/2 (i.e. lower common mode voltage). On the other hand, the SBD TSV voltage should be VDD or VDD/2 (i.e. higher common mode voltage) if the transmitted data is 1. Therefore, we can switch on and off an NMOS-input sense amplifier and a PMOS-input sense amplifier depending on the transmitted data. 3 The 2D mesh 5-port NoC router microarchitecture (Figure 1-1) can naturally be extended to any 3D-IC NoC router microarchitectures by adding more input ports, e.g. a 3D mesh 6-port router microarchitecture for 2-tier 3D-ICs or a 3D mesh 7-port router microarchitecture for many-tier 3D-ICs. 112 4.4. Rx Design: Switched Dual-Tree Sense Amplifier Figure 4-16 shows such a switched Rx scheme. When the transmitted data (txIN) is logic 1, an NMOS-input sense amplifier is activated to speed up the sensing operation with higher common mode voltage while a PMOS-input sense amplifier is turned off for energy saving. Similarly, once the transmitted data is logic 0 (i.e. our SBD Rx senses the voltage difference between VDD/2 and GND), a PMOS-input amplifier is switched on for low sensing delay of lower common mode voltage whereas an NMOSinput sense amplifier is turned off. Measurement results showed that this switched Rx scheme, which was fabricated in a 28nm Low-Power (LP) CMOS process, enabled single-cycle SBD TSV signaling at high clock frequencies, up to 4.55GHz, at 1.05V. 4.4.2 Dual-Tree Sense Amplifier for Reliable SBD Signaling The widely-used sense amplifier designs [77, 78] can incur signaling reliability issues when applying to our SBD Rx circuit; using a reference voltage between VDD and VDD/2 (or GND and VDD/2) as another input of sense amplifiers reduces the sensing noise margin into half of the SBD symbol noise margin (i.e. VDD/4). Across all possible process variations, the sensing noise margin of VDD/4 will be further reduced. As shown in Figure 4-14, a strong PMOS and a weak NMOS increase the voltage level for SBD symbol 01 and 10 (VM ), making the noise margin between VDD and VM less than VDD/2. Similarly, a weak PMOS and a strong NMOS decrease VM , making the noise margin between VM and GND less than VDD/2. In particular, since 3D-ICs are vertically-integrated at the wafer level in general, we have to consider different die-to-die variation corners between different wafers. Accordingly, our three voltage-step SBD signaling can have different voltage levels for symbol 01 (VM 01 ) and symbol 10 (VM 10 ) as shown in Figure 4-14 (c), if both PMOS and NMOS on a top die were fabricated stronger than typical transistors while both PMOS and NMOS on a bottom die were fabricated weaker than typical transistors. To increase the sensing noise margin of SBD Rx circuits, we present dual-tree sense amplifiers. Figure 4-15 shows an NMOS-input dual-tree sense amplifier (a) 113 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs Figure 4-14: Reduced symbol noise margin of SBD signaling due to process variation. When designing 3D-IC circuits, we should consider die-to-die variation mismatch as well as on-die variation mismatch described in (c). 114 4.4. Rx Design: Switched Dual-Tree Sense Amplifier Figure 4-15: Switched dual-tree sense amplifiers for variation-robust SBD signaling and low sensing delay. 115 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs NMOS-input dual-tree sense amplifier CLK VDD VDD/2 VDD/2 1-cycle delayed txIN GND CLK PMOS-input dual-tree sense amplifier TSV SBD Tx EN Figure 4-16: Overall circuit implementation of the proposed TSV SBD signaling. Two types of sense amplifiers, a PMOS-input and an NMOS-input sense amplifier, are switched on and off according to the transmitted data (txIN) for low sensing delay. and a PMOS-input dual-tree sense amplifier (b). In the dual-tree sense amplifier design, the tail current difference of cross-coupled inverters is the result of directly comparing the TSV voltage (rxIN) with VDD/2 and VDD (or GND). Accordingly, its sensing noise margin is equal to the SBD symbol noise margin (i.e. VDD/2), whereas the traditional sense amplifier designs [77, 78] have half of the SBD symbol noise margin (i.e. VDD/4) as their sensing noise margin. Simulated with a 28nm 116 4.5. Prototyping and Testing of TSV Interconnects LP CMOS process design kit (PDK), our dual-tree sense amplifiers achieved errorfree SBD TSV signaling across all possible process corners, whereas traditional sense amplifier designs [77, 78] (whose transistors were equivalently-sized with the dual-tree sense amplifiers) incurred Rx errors at three corner cases discussed in Figure 4-14. This is mainly because the input offset of sense amplifiers, which is caused by on-die variations, is bigger than the reduced sensing noise margin at such corner cases. Figure 4-16 describes an overall circuit implementation of our proposed TSV signaling. Depending on the transmitted data (txIN), two types of dual-tree sense amplifiers (a PMOS-input and an NMOS-input dual-tree sense amplifier) are switched on and off, and then, a final output (rxOUT) is selected by a simple 2:1 multiplexer. The SBD Tx design on a bottom die is a NAND-enabled inverter while a top die has a half-clocked driver as its SBD Tx. 4.5 Prototyping and Testing of TSV Interconnects Our 3D-IC chip prototype was implemented using MediaTek face-to-back (F2B) TSV technology and a 28nm Low-Power (LP) CMOS process; we first fabricated two different CMOS wafers (one for top die design and another for bottom die design), then integrated such two wafers into a single 3D-IC chip using the F2B TSV technology. While the test chip in this work is a 2-tier 3D-IC, the proposed F2B TSV signaling circuits can be utilized as repeaters in any multiple-layered 3D-ICs. Figure 4-17 and Figure 4-18 show a top die and a bottom die photo of our 2-tier 3D-IC chip prototype, respectively. To fairly compare the energy consumption, maximum data rate and occupied area of the proposed SBD TSV signaling versus uni-directional TSV signaling, both interconnect circuits (proposed SBD TSV and baseline #1 in Figure 4-19) were included in the test chip. In addition, to see the impact of a half-clocked SBD Tx and a coupling capacitor, two other TSV interconnects (baseline #2 and baseline #3 in Figure 4-19) 117 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs poly layer landing pads (TSV top side) poly layer landing pads (TSV top side) Figure 4-17: Top die photograph of a 2-tier 3D-IC test chip fabricated with a 28nm Low-Power (LP) CMOS process. top metal micro bumps (TSV bottom side) top metal micro bumps (TSV bottom side) Figure 4-18: Bottom die photograph of a 2-tier 3D-IC test chip fabricated with the same process as a top die, 28nm LP CMOS. were also implemented in our chip prototype. While baseline #2 employs straightforward SBD Tx implementation using inverters, baseline #3 incorporates the same Tx circuits as our proposed design except for the coupling capacitor. 118 4.5. Prototyping and Testing of TSV Interconnects Inverter Driver TSV TSV TSV Proposed Rx Proposed Rx Baseline #1 Proposed Rx Proposed Rx TSV Proposed Tx @ top die TSV Proposed Tx w/o bypass cap @ top die Proposed Rx Proposed Rx Proposed Tx @ bottom die Baseline #2 Inverter Driver Proposed Tx @ bottom die Baseline #3 Proposed SBD TSV Figure 4-19: Four types of TSV interconnects implemented in a 3D-IC test chip: two uni-directional TSVs (baseline #1); an inverter-based SBD TSV (baseline #2); a proposed SBD TSV without a coupling capacitor (baseline #3); and a completed design (proposed SBD TSV). 4.5.1 Maximum Data Rate We first measured the maximum data rate of the fabricated F2B TSV signaling circuits. Each TSV end was fed by two pseudorandom binary sequences generated on-chip using linear feedback shift registers, PRBS7 (x7 + x6 + 1) and PRBS31 119 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs 12.5% decreased 9.1Gb/s 5.2Gb/s 5.2Gb/s repeaters repeaters (a) Half TSV Counts: Two uni-directional TSVs (Baseline #1) vs One proposed SBD TSV 9.1Gb/s 75% increased 9.1Gb/s 5.2Gb/s 5.2Gb/s repeaters repeaters (b) Same TSV Counts: Two uni-directional TSVs (Baseline #1) vs Two proposed SBD TSV Figure 4-20: Measured maximum die-to-die bandwidth comparison at 1.05V between uni-directional TSVs (baseline #1) and proposed SBD TSVs. (x31 + x28 + 1), which are industrial standard patterns for an off-chip link test. Accordingly, for 2-bit bi-directional signaling (one bit from a top die to a bottom die and another bit from a bottom die to a top die), each interconnect circuit was tested by four possible input vectors: PRBS7* )PRBS7, PRBS7* )PRBS31, PRBS31* )PRBS7, PRBS31* )PRBS31. Then, an on-chip test circuit executed input and output data comparison and error counting. All experiments were performed at 1.05V, the typical power supply voltage of our 28nm LP CMOS process. Measurement results showed that two uni-directional TSVs (baseline #1) achieved the maximum data rate of 10.4Gb/s whereas our proposed SBD signaling circuits 120 4.5. Prototyping and Testing of TSV Interconnects 12.5% lower bandwidth than two uni-directional TSVs (2 un i-d i re ct io na lT SV s) 33.8% bandwidth increase by a coupling capacitor Figure 4-21: Maximum bi-directional bandwidth of our fabricated F2B TSV interconnects. The proposed SBD signaling can deliver up to 9.1Gb/s/TSV bidirectional data (i.e. 4.55GHz maximum clock frequency) at 1.05V. attained the maximum data rate of 9.1Gb/s through a single TSV (i.e. 12.5% lower maximum bandwidth than two uni-directional TSVs). In other words, the proposed SBD TSVs can send (100−12.5)×2−100 = 75% more data than uni-directional TSVs with the same TSV counts. Figure 4-20 describes such trade-offs between die-to-die bandwidth and TSV counts. While the straightforward SBD implementation using inverters (baseline #2) 121 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs showed the maximum data rate of 9.9Gb/s, which was slightly less than two unidirectional TSVs (baseline #1), the same SBD signaling circuits as our proposed design except for a coupling capacitor (baseline #3) showed much smaller maximum bandwidth, 6.8Gb/s/TSV. As discussed in Section 4.3, our half-clocked driver at a top die (designed for die-to-die static current reduction) has only half clock cycle for switching to the middle voltage level (ideally VDD/2=5.025V), and such middle voltage switching by a pull-up PMOS and a pull-down NMOS together takes longer time than pull-up only or pull-down only switching, thereby leading to a significant decrease in maximum data rate. This maximum bandwidth loss can be compensated by the coupling capacitor (highlighted in red in Fig. 4-7); experimental results proved that this circuit design enabled a 33.8% increase (6.8Gb/s/TSV to 9.1Gb/s/TSV) in maximum SBD TSV bandwidth. As discussed earlier in Section 4.3, this is because the coupling capacitor shortens the time required for switching to the middle voltage level by a pull-up PMOS and a pull-down NMOS together (which is the critical path of our SBD TSV signaling circuits) by adding extra charging/discharging current during the signal transition through capacitive coupling. Figure 4-21 summarizes the results of our maximum die-to-die bandwidth experiments. 4.5.2 Energy Efficiency Next, we measured signaling energy efficiency of the fabricated TSV interconnects. In addition to two industrial standard patterns for a link test, PRBS7 and PRBS31, two more input data sequences were included in the energy measurement experiment: CLK/2 for a higher data transition input sequence and CLK/32 for a lower data transition input sequence. CLK/2 is the clock-shaped waveform whose frequency is half of the interconnect clock, and hence, it has 100% data transition density. On the other hand, CLK/32 is the clock-shaped signal whose frequency is 1/32 of the interconnect clock so that it generates 6.25% data transition density. Figure 4-22 describes four bi-directional data sets used in our energy measurement. 122 4.5. Prototyping and Testing of TSV Interconnects TOP TOP BOTTOM BOTTOM low activity (CLK/32) moderate activity (PRBS7) moderate activity (PRBS31) moderate activity (PRBS31) Bi-directional data set #1 Bi-directional data set #2 TOP TOP BOTTOM BOTTOM moderate activity (PRBS31) moderate activity (PRBS7) moderate activity (PRBS7) high activity (CLK/2) Bi-directional data set #3 Bi-directional data set #4 Figure 4-22: Four bi-directional input data sets for energy comparison. Figure 4-23 shows the TSV signaling energy efficiency4 measured at the maximum data rate of our proposed SBD interconnect circuits, 9.1Gb/s bi-directional bandwidth (i.e. 4.55GHz clock frequency), at the power supply voltage of 1.05V. Once data set #1 (CLK/32* )PRBS31) were transmitted and received across two dies, two uni-directional TSVs and a single inverter-based SBD TSV consumed 98.46fJ/2b and 112.87fJ/2b, respectively. On the other hand, The proposed SBD TSV showed 88.31fJ/2b energy efficiency, 10.3% less energy than two uni-directional TSVs or 21.8% less energy than the inverter-based SBD TSV. When data transition density is low, the energy benefit of our SBD signaling over uni-directional signaling is not significant (only 10.3% reduction) because die-to-die static current is comparable to 4 To follow the convention of interconnect energy efficiency calculation, we obtained the TSV energy efficiency by dividing the measured power consumption (µW) by operating frequency (GHz). Since our TSV interconnects simultaneously deliver two bits, one bit from a top die to a bottom die and another bit from a bottom die to a top die, the energy efficiency unit of our bi-directional signaling is not fJ/b but fJ/2b. 123 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs Measured energy efficiency (fJ/2b) CLK/32 PRBS31 PRBS7 PRBS31 19.8% reduction 137.08 21.8% reduction 131.58 16.4% reduction 112.87 109.94 10.3% reduction 98.46 PRBS31 PRBS7 PRBS7 Measured energy efficiency (fJ/2b) 186.41 16.9% reduction 136.84 es ig n D ed po s Pr o (2 (in B ve as rte e r-b lin as e ed #2 SB D ) u n Ba i-d se ire lin ct e io na #1 lT SV s) es ig n D ed po s Pr o (2 (in B ve as rte e r-b lin as e ed #2 SB D ) u n Ba i-d se ire lin ct e io na #1 lT SV s) 88.31 CLK/2 31.1% reduction 146.58 12.4% reduction 132.89 14.5% reduction 128.44 es ig n D po se d Pr o (2 (in B v e as rte e r-b lin as e ed #2 SB D ) ) un Ba i-d se ire lin ct e io n a #1 lT SV s es ig n D po se d Pr o (in B ve as rte e r-b lin as e ed #2 SB D ) (2 u n Ba i-d se ire lin ct e io n a #1 lT SV s ) 113.67 Figure 4-23: Measured TSV interconnect energy efficiency over various input data sets at 9.1Gb/s bi-directional data rate (i.e. 4.55GHz clock frequency) at 1.05V. The proposed SBD signaling circuits consume 10.3-31.1% less energy than uni-directional TSVs. 124 4.5. Prototyping and Testing of TSV Interconnects the low-activity switching current. As compared to the inverter-based SBD TSV, however, our SBD TSV shows higher energy savings (21.8% reduction) due to the half-clocked driver on a top die that enables 2× reduction in die-to-die static current. Once moderate data transition inputs, data set #2 (PRBS7* )PRBS31) and data set #3 (PRBS31* )PRBS7), were applied to the fabricated TSV interconnects, our proposed SBD TSV showed 16.9-19.8% energy savings over two uni-directional TSVs. When comparing with the inverter-based SBD signaling circuits, the proposed design showed 14.5-16.4% lower energy consumption. One notable result is that unidirectional TSVs and an inverter-based SBD TSV dissipated almost the same energy with data set #2 and data set #3, which are the same data sequences but with different direction, while the proposed SBD TSV showed over 5% energy deviation between such two data sets. This is because our SBD TSV signaling circuits incorporate different transmitter designs at a top die (the half-clocked coupling driver) and a bottom die (the NAND-enabled inverter) as shown in Figure 4-7. This asymmetric feature can offer an energy optimization opportunity at a design phase if data patterns between two adjacent dies in 3D-ICs are known before chip fabrication; we can simply switch the transmitter designs when such a switched case shows better energy efficiency. The proposed SBD signaling circuits achieved the best energy saving over unidirectional TSVs when delivering data set #4 (PRBS7* )CLK/2) that has the highest transition density of our 4 data sets; while two uni-directional TSVs dissipated 186.41fJ/2b, the proposed SBD TSV consumed 128.44fJ/2b (31.1% energy reduction). Each data transition on a uni-directional TSV requires one full-swing switching, and hence, two simultaneous transitions such as 00→11, 11→00, 01→10 or 10→01 incur two full-swing switches. On the other hand, in our proposed SBD signaling, such simultaneous transitions require only one full-swing switching, leading to dynamic energy reduction. Thus, the higher data transition rate on SBD TSVs results in the bigger energy benefits over uni-directional TSVs as proven in our experiments. 125 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs 4.5.3 Area Comparison Since SBD signaling can transmit and receive data at the same time through a single TSV, it occupies smaller silicon area than uni-directional TSV signaling at a given die-to-die bandwidth requirement. Such an area benefit is ideally 50% (2 TSVs versus 1 TSV), but our proposed SBD interconnect incorporates a halfclocked driver (which is bigger than uni-directional repeaters) and a switched dual- Normalized TSV interconnect footprint tree sense amplifier (which is also bigger than conventional flip-flops) so that its actual pa ci to r) ca SB D (n o co up l in g d ve rte r-b as e (in (tw o un i-d ire ct io na lT SV s) TS V) only 2.7% area overhead of a coupling capacitor Figure 4-24: Normalized area comparison of the fabricated TSV signaling circuits. While baseline #1 includes two TSV landing pads, other three SBD TSV interconnects have only one TSV landing pad. 126 4.5. Prototyping and Testing of TSV Interconnects area benefit is 34.4%. Considering current face-to-back (F2B) TSV technology that imposes substantial area overheads due to its poly layer landing pads, this 34.4% area benefit is quite profitable in designing 3D-ICs; we can use more redundant TSVs for reliable die-to-die signaling or more power TSVs for stable power delivery at a given area budget through SBD signaling. It is also notable that the proposed SBD signaling circuits have a 2.7% bigger footprint than baseline #3 which is the identical to the proposed design except for a coupling capacitor. In other words, the coupling capacitor at a half-clocked driver incurs only a 2.7% area overhead at an overall signaling circuit level that includes transmitters (Tx), receivers (Rx) and TSV landing pads. Figure 4-24 shows the normalized area value of four types of TSV interconnects fabricated in our 3D-IC test chip. 4.5.4 Comparison with Other Low-Power TSV Circuits There are only a few TSV interconnect studies at circuit design level with actual chip prototypes [79, 68]. Futoshi Furuta et al. demonstrated low-swing TSV circuits featured with adaptive timing control to deal with variations of TSV parasitic lumped capacitance [79]. In their design, low-swing signaling was generated by an inverter with lower power supply voltage, 0.4V, consuming 27% lower energy than an equivalent full-swing TSV. While the 0.4V voltage swing provided enough noise margin, the inverter-based low-swing signaling caused weaker driving strength, resulting in significant bandwidth loss; their uni-directional TSVs were able to deliver 2Gb/s/TSV at most. In addition, the additional lower power supply voltage dedicated only for low-swing links is sometimes an unacceptable system cost. Yong Liu et al. presented a 6-tier 3D-IC chip prototype of low-swing TSV circuits that achieved 27-53% energy savings over full-swing TSVs with 0.19 to 0.3V voltage swing. While the gated-diode sense amplifier enables single-ended low-swing signaling, their small noise margin (0.19 to 0.3V voltage swing) is vulnerable to PVT variations. The single-ended tri-state low-swing transmitter is very similar to our 127 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs Chapter 2 low-swing circuits (Figure 2-4); our PMOS transistors were replaced by NMOS transistors with the lower second power supply voltage. As discussed in Section 2.3, this transmitter design incurred no bandwidth and latency reduction, and hence, their low-swing TSVs delivered uni-directional data up to 6Gb/s/TSV data rate. Our TSV link is arguably the first Simultaneously Bi-Directional (SBD) TSV. In other words, the proposed interconnect design is the first SBD link design optimized for TSV channel characteristics. While the previous two low-swing TSV circuits [79, 68] provided no area reduction (actually, they incurred a little area overhead), our SBD TSV enabled 34.4% lower area consumption than uni-directional TSVs. Since the TSV micro bump size (50µm×50µm in [79]) is almost same as the size of flip-chip micro bumps, this 34.4% area saving is quite precious. Also, 10.3 to 31.1% energy reduction with half-swing noise margin provides energy-efficient yet reliable 3D-IC vertical signaling. Details of the proposed SBD TSV link are summarized in Table 4.1 together with the previous works. 128 ISSCC 2012 [68] This work TSV Interconnect Feature Low-Swing Signaling Low-Swing Signaling SBD Signaling Energy Reduction over Full-Swing 27% 27 to 53% 10.3 to 31.1% Voltage Swing (Noise Margin) 0.4V 0.19 to 0.3V 0.505V (half swing) Process Corner Simulations Error Free N/A Error Free 2nd Power Supply Required Yes Yes No Area Reduction No (a little overhead) No (a little overhead) Yes (34.4% reduction) Signaling Type Single-ended Single-ended Single-ended TSV Lumped Capacitance ∼ 200fF/TSV ∼ 200fF/TSV N/A TSV Landing Pad Size 50um×50um N/A N/A Max Clock Frequency 2GHz 6GHz 4.55GHz 3D-IC Stacks 2-tier 3D-ICs 6-tier 3D-ICs 2-tier 3D-ICs CMOS Technology 65nm CMOS 45nm CMOS 28nm LP CMOS F2B TSV Technology N/A IBM Support MediaTek Support Table 4.1: Comparison of Energy-efficient Face-to-Back TSV Interconnects (CMOS-on-CMOS). 4.5. Prototyping and Testing of TSV Interconnects 129 3DIC 2012 [79] Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs 4.6 Chapter Summary In this chapter, we presented an energy and area efficient TSV signaling circuit for 3D-IC NoCs that require an intensive die-to-die bandwidth. To be specific, we proposed a simultaneously bi-directional (SBD) TSV interconnect that can transmit and receive data at the same time through a single TSV. While a half-clocked driver with a coupling capacitor enabled inter-die static current reduction (hence, low-power SBD data transmission) without a substantial loss in maximum data rate, a switched dual-tree sense amplifier allowed of error-free single-cycle TSV signaling even at high clock frequency. Fabricated with MediaTek F2B TSV technology and a 28nm LP CMOS process, our 3D-IC test chip proved that the proposed SBD signaling circuit consumed 10.331.1% lower energy and 34.4% less silicon area than traditional uni-directional TSVs. Even though one SBD TSV showed 12.5% lower maximum bandwidth than two unidirectional TSVs (i.e. two SBD TSVs can deliver 75% more data than two unidirectional TSVs), it successfully functioned error free at bi-directional data rates up to 9.1Gb/s/TSV at 4.55GHz clock frequency. Considering that our test chip was fabricated with a Low-Power (LP) CMOS process typically used for mobile chips that require moderate clock frequency, we believe our 4.55GHz maximum clock frequency at 1.05V is high enough. To sum up, the proposed TSV interconnect enables higher die-to-die bandwidth than uni-directional TSVs at a given energy and area budget in 3D-ICs. In other words, it can deliver the same amount of inter-die data with lower energy and less area than uni-directional TSVs. Besides, our SBD signaling circuit enables cyclewise bandwidth adaptivity so that it can be utilized as the basis of fined-grained bandwidth-adaptive 3D NoCs that efficiently handle dynamic traffic in 3D-IC manycore chips. 130 4.6. Chapter Summary 131 Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs 132 5 Conclusions and Future Work Can we imagine the future NoC? Can the circuits proposed in this thesis feature in the future NoC? What kind of NoC architectures can be enabled? Let us conclude the thesis by shaping future research direction based on insights and lessons learnt from our study. In this thesis, we presented design techniques for low-power yet high-performance NoCs at link circuit level, router microarchitecture level and link-and-router co-design level. The proposed design techniques reduced NoC power without compromising network performance, i.e. simultaneously improved network energy efficiency and performance. However, our NoC designs incurred overheads such as longer critical path, larger silicon area, reduced signaling noise margin, or additional system overheads. This thesis seeks to accurately analyze both the benefits and penalties of our proposed NoC design techniques through chip prototyping. 5.1 Thesis Summary We will briefly summarize the contributions of our three test chips: The first test chip explored a regular mesh NoC for general-purpose computing chips (CMPs), while 133 Chapter 5. Conclusions and Future Work the second chip prototype aimed at designing the low-swing datapath of irregular mesh NoCs for high-performance application-specific ICs (MPSoCs). The third test chip demonstrated energy and area efficient TSV signaling for 3D mesh NoCs in 3D-ICs. 5.1.1 Regular Mesh Network in CMPs Our first chip prototype demonstrated that a 1GHz 16-node mesh NoC with a 64b link width can be designed in 45nm SOI CMOS within a power budget of less than 1W, delivering 87-91% of the theoretical throughput limits of a mesh and 1.04-1.05 cycles per-hop-latency at low-load traffic. As compared to an equivalent baseline, the fabricated mesh NoC showed 31-38% power reductions as well as 48-55% latency benefits and over 2× bandwidth improvements through the combination of virtual bypassing, router-level multicast supports and clocked low-swing datapath. On the other hand, the proposed design resulted in a 21% stretched critical path, 39% larger area and reduced signaling noise margin. While the longer critical path can be easily masked in actual manycore chips where computation cores limit the clock frequency rather than NoC routers, the 39% area overhead at the router level is quite painful. A more compact layout (e.g. a router layout through flattened placement and automatic route of low-swing circuits and full-swing logic gates together) can substantially reduce the area overhead, but such a compact layout will incur more noise coupling between noise-sensitive low-swing circuits and noisy full-swing digital circuits. 5.1.2 Low-Swing Datapath of Configurable Meshes in SoCs The second test chip explored two types of clockless, single-ended low-swing repeaters to enable configuration of fast contention-free paths through any irregular meshes (i.e. any subsets of a regular backbone mesh) that efficiently support dynamic traffic in MPSoCs. 134 5.1. Thesis Summary Self-resetting logic repeaters (SRLRs) enabled single-ended low-swing pulses to be repeated without clocking, thereby leading to lower power dissipation than differential, clocked low-swing signaling. To mitigate global process variations while delivering high energy efficiency, three circuit techniques are incorporated. Fabricated in 45nm SOI CMOS, the 10mm SRLR-based datapath achieved 6.83Gb/s/µm bandwidth density and 40.4fJ/b/mm energy efficiency at 4.1Gb/s data rate. Featured with single-ended signaling, the SRLR-based low-swing datapath occupied only 18% of the entire router footprint. Voltage-locked repeaters (VLRs), on the other hand, facilitated lower transmission delay albeit with negligible low-swing energy benefits at the link level. When compared to an equivalent full-swing repeaters, the fabricated 10mm 10-hop VLRs showed 35.8% latency reduction and 23.6% higher maximum data rate. The VLRs enabled 2GHz single-cycle, cross-chip communication between any node pairs on a 4×4 mesh, and such single-cycle multi-hop asynchronous repeated traversal (SMART) contributed to network-level power savings as well as latency reduction in the SMART NoC [1]. 5.1.3 Towards Low-Cost 3D Meshes in 3D-ICs Just as a 2D mesh maps easily to the planar layout of traditional CMOS wafers, a 3D mesh maps readily to the physical structure of 3D-ICs. A 3D mesh fabric provides more routing paths between a source and destination node (i.e. more flexible) than a 2D mesh, and such path diversity enables fewer hop counts and higher throughput. In other words, 3D mesh NoCs can integrate more nodes than 2D mesh NoCs at a given budget of network latency and bandwidth. While F2B TSVs are the most promising technology for vertical signaling of multiple-layered 3D-ICs, the current F2B TSV fabrication technologies incur low yield, sizable silicon landing pads and significant parasitic capacitance. Our third test chip demonstrated that the proposed SBD TSV signaling circuits 135 Chapter 5. Conclusions and Future Work can alleviate such power and area overheads; the SBD TSVs showed 10.3-31.1% lower power and 34.4% less silicon area than two uni-directional TSVs that deliver the same amount of data. While the half-clocked driver with a coupling capacitor eliminated unnecessary inter-tier static current with a slight bandwidth loss (less than 13%), the switched dual-tree sense amplifier enabled variation-tolerable TSV signaling as well as lower sensing delay. These low-power, high-density SBD TSVs can be utilized to improve reliability of 3D-ICs’ vertical signaling (using spare TSVs) or build bandwidth-adaptive 3D NoCs that can efficiently deliver highly dynamic inter-tier traffic in 3D-ICs. 5.2 Low-Swing Signaling Reliability This thesis demonstrated that NoC datapath energy can be substantially reduced by various low-swing signaling techniques such as a clocked, differential low-swing circuit (Chapter 2), a clockless, single-ended low-swing circuit (Chapter 3) and a half-swing SBD TSV interconnect (Chapter 4). These energy savings, as discussed in detail in each chapter, involve a reduced noise margin. Basically, lower voltage swing enables more dynamic energy saving but incurs a smaller voltage noise margin, thereby resulting in higher error probability. Figure 5-1,which is same as Figure 34, explicitly shows such trade-off between low-swing energy efficiency and reliability. Link-level errors should be detected and corrected at the system level (e.g. NoC router layer, network interface layer or protocol layer), and hence, the higher error probability incurs bigger system level overheads. The optimal voltage swing level to minimize overall system power is debatable because the optimal point highly depends on the implementation of system-level error detection and correction whose overheads are not yet accurately analyzed. To overcome such reliability issues, our Chapter 2 low-swing circuit (Figure 2-4) can control its voltage swing level by an off-chip voltage regulator. While this off136 5.2. Low-Swing Signaling Reliability Figure 5-1: Lower voltage swing enables higher energy efficiency, but results in higher signaling error probability (hence bigger system overheads). This figure is identical to Figure 2-11. chip solution guarantees stable low-swing signaling with a little voltage margin, the additional power supply voltage is sometimes considered as an unacceptable system cost, and moreover, this scheme requires extra post-fabrication testing efforts to find proper voltage swing level for each die. Self-resetting logic repeaters (SRLRs), which generate single-ended low-swing signaling without clocking circuits, achieve substantial energy reduction without compromising bandwidth density (See measurement results in Figure 3-7). However, since their robustness against process variations is 3∼4σ, which is far from the industrial standard (typically over 5σ), SRLRs cannot be directly incorporated into commercial chips due to the poor signaling reliability. The easiest way, needless to say, is to increase the voltage swing, but covering all 5σ variations only with higher voltage 137 Chapter 5. Conclusions and Future Work swing in an advanced node (e.g. 28nm CMOS) will lead to a considerable loss in energy benefits. The reduced signaling noise margin can be addressed through error correction codes (ECCs) in a system level, but such ECC scheme will incur power and latency overheads. NoC-layer solutions will be introduced in Subsection 5.3.2 as our future project. While our SBD TSV link shows lower energy reduction than SRLRs, its half swing nature (∼0.5V noise margin) enables much more reliable functionality. As discussed in Section 4.4 and Section 4.5, both simulation and measurement results demonstrate that the proposed SBD interconnect design functions error free at data rates up to 9.1Gb/s/TSV at 1.05V power supply voltage. Thus, our SBD TSV can be utilized as error-free NoC datapath without system-level error correction schemes. As CMOS process scales down, NoC datapath energy will increase in percentage relative to control and storage circuitry energy [15, 16], and hence, it will become more critical to reduce the interconnect power. On the other hand, since CMOS scaling makes process variation worse, low-swing signaling circuits with smaller feature size incur lower BER at the same noise margin. Therefore, future NoCs will require variation-aware circuit design even at the link level to achieve low-power yet reliable interconnects. 5.3 Future Projects Can we imagine the future NoC? With an advance in 3D-IC fabrication technology, coupled with the physical limits of conventional CMOS scaling, the future NoC will likely have a 3D mesh as its backbone network topology. This 3D mesh, which is much more flexible and scalable than a 2D mesh, will be reconfigured at runtime depending on the executed applications. In addition to network topology, the future NoC will tailor each router microarchitecture (e.g. pipeline depth) through background calibration against PVT variations. Ultimately, the NoC will satisfy the 138 5.3. Future Projects communication requirements of both bandwidth-intensive applications through highdensitiy TSVs (hopefully our SBD TSVs!) and latency-sensitive applications through contention-free low-latency links (optical interconnects or low-swing links such as our VLRs). Not to mention, such a NoC should be low-powered. Is it possible? 5.3.1 Broadcast-Intensive Cache Coherent Protocols It is widely believed that broadcasts over a mesh network are too expensive on-die due to bandwidth and power constraints, so broadcast-intensive on-chip communication has typically been limited to bus interconnects or ring NoCs that suffer from poor scalability. Our 4×4 mesh NoC chip prototype challenged this. As discussed in Section 2.4.1, our mesh NoC chip achieved 2.2× higher saturation throughput than the baseline, which means our mesh design can endure over 2× more broadcasts before the network saturates. Latency wise, the proposed design showed 55.1% broadcast latency reduction before saturation, leading to 1.05 cycles per hop on average before the network saturates. From the energy point of view, a broadcast flit is nothing but a unicast flit with 5× ST/LT energy and 5× buffering energy. As demonstrated through our chip prototyping, the higher ST/LT energy can be alleviated by lowswing signaling circuits while the higher buffering energy can be effectively reduced through virtual bypassing, thus resulting in reasonable broadcast energy even in a mesh. We see potential architecture-level research opportunities enabled by our lowpower broadcast mesh NoC. One well-known application of on-chip broadcasts is a cache coherence protocol for scalable on-chip cache subsystems [80, 81, 82, 83]. The broadcast-based cache coherence protocols have a virtue of requiring no (or smaller) directory storage whose power and area substantially increase as core counts scale. In other words, in the cache coherence protocols that incorporate broadcasts as the requests and invalidates, more broadcasts enable lower directory overheads, thus mitigating the scalability issue caused by the directory. SCORPIO [84], which is a 139 Chapter 5. Conclusions and Future Work 36-core processor chip prototype featured with in-network ordering over a mesh NoC, leveraged our broadcast router design to incorporate snoopy coherence for scalable on-chip cache subsystems. 5.3.2 Error-Tolerant NoCs with Low-Swing Links As discussed in Section 3.4.2 and Section 3.5.2, SRLRs and VLRs require systemlevel error-tolerant schemes due to their poor signaling reliability, in order to be utilized for stable NoC datapath. One interesting approach is that the errors that our circuit-level solutions do not cover will be detected and corrected at the NoC architecture level. Such NoC-level solutions will likely include topology reconfiguration, resilient routing or flexible router pipeline. It is notable that even though NoCs do not incorporate low-swing links, these error-tolerant NoCs will be required at future technology nodes where there will be many transistor failures during the lifetime of manycore chips [85, 86]. Thus, if such error-tolerant schemes can be extended for low-swing links with acceptable overheads, our low-swing circuits will be able to be embedded within a NoC without compromising their excellent energy efficiency. Diverse previous works on the errortolerant NoCs are well summarized in [87] so that it can be the starting point of this future project. 5.3.3 Bandwidth-Adaptive 3D NoCs Our Simultaneously Bi-Directional (SBD) TSV circuit provides a foundation for bandwidth-adaptive 3D NoCs which efficiently handle dynamic traffic in 3D-ICs. The half-clocked driver enables/disables SBD signaling at every cycle, and hence, capacitates cycle-wise inter-tier bandwidth adaptivity. In other words, our SBD TSVs support three effective signaling modes (UP→DOWN unidirectional signaling, DOWN→UP unidirectional signaling, or UP↔DOWN bidirectional signaling), and such a signaling mode can be switched at every cycle. Thanks to the switched dual140 5.3. Future Projects tree sense amplifier, coupled with the half-swing nature of the 3-step SBD signaling, our proposed SBD TSVs hardly incur any reliability issues. Armed with these flexible yet reliable SBD TSVs, we can instantaneously double the die-to-die bandwidth at given TSV area and thermal budgets. While this bandwidth adaptivity can efficiently handle bursty off-die traffic which is often the performance bottleneck in manycore chips, it does not require any other link-level overheads such as serialization/desirialization, data encoding/decoding, or error correction. However, incorporating this SBD-based bandwidth adaptivity into the 3D NoC router will be a challenging task, i.e. it will require some router-level overheads. For instance, designing a 7×7 crossbar switch with two vertical SBD links is never trivial. Since the one end of SBD TSVs can be input ports and/or ouput ports of the crossbar switch, careful datapath decoupling is required. Besides, the physical constraints of SBD TSVs differ from those of other 2D links (North, South, East, West and NIC), leading to more complicated timing requirements. Detecting the bursty traffic in the NoC level through flow control is also challenging. Finally, energy efficiency and performance of the bandwidth-adaptive 3D NoC will significantly vary depending on how to merge a spare TSV scheme (which is critical in existing 3D-ICs) into the SBD TSV-based router design. 141 Chapter 5. Conclusions and Future Work 142 Bibliography [1] Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramanian, Anantha Chandrakasan, and Li-Shiuan Peh. SMART: A Single-Cycle Reconfigurable NoC for SoC Applications. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE), 2013. [2] Seongmoo Heo and Krste Asanovic. Replacing Global Wires with an On-Chip Network: a Power Analysis. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), August 2005. [3] W. J. Dally and B. Towles. Route Packets Not Wires: On-Chip Interconnection Networks. In Proceedings of the IEEE/ACM Design Automation Conference (DAC), June 2001. [4] M. B. Taylor et al. The Raw Microprocessor: a Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, 22(2):25–35, 2002. [5] Y. Hoskote et al. A 5-GHz Mesh Interconect for a Teraflops Processor. IEEE Micro, 27(5):51–61, 2007. [6] T. Krishna et al. Towards the Ideal On-Chip Fabric for 1-to-Many and Many-to1 Communication. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2011. 143 BIBLIOGRAPHY [7] T. Krishna et al. Breaking the On-Chip Latency Barrier Using SMART. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), February 2013. [8] P. Gratz et al. On-Chip Interconnection Networks of the TRIPS Chip. IEEE Micro, 27(5):41–50, 2007. [9] Shane Bell et al. TILE 64-Processor: A 64-Core SoC with Mesh Interconnect. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), February 2008. [10] Jason Howard et al. A 48-core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), February 2010. [11] Dae Hyun Kim et al. 3D-MAPS: 3D Massively Parallel Processor with Stacked Memory. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), February 2012. [12] Bhavya K. Daya et al. SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2014. [13] William J. Dally and Brian Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, 2004. [14] William J. Dally. Virtual-Channel Flow Control. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), June 1990. 144 BIBLIOGRAPHY [15] Hui Zhang, Varghese George, and Jan M. Rabaey. Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness. IEEE Transactions on Very Large Scale Integration Systems (T-VLSI), pages 264–272, 2000. [16] C. Sun, Chia-Hsin O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic. DSENT - a Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In Proceedings of the IEEE/ACM International Symposium on Networks-on-Chip (NOCS), May 2010. [17] J. Cong and D. Z. Pan. Interconnect Estimation and Planning for Deep Submicron Designs. In Proceedings of the IEEE/ACM Design Automation Conference (DAC), June 1999. [18] P. Gratz et al. Uniform Repeater Insertion in RC Trees. IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications (TCAS-I), 47(10):41–50, 2000. [19] Natalie Enright Jerger and Li-Shiuan Peh. Synthesis Lectures on Computer Architecture - On-Chip Networks. Morgan and Claypool Publishers. [20] Andrew Kahng et al. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE), pages 423–428, 2009. [21] M. Modarressi et al. Application-Aware Topology Reconfiguration for On-Chip Networks. IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2011. [22] M. Modarressi et al. Virtual Point-to-Point Connections for NoCs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (T-CAD), 2010. 145 BIBLIOGRAPHY [23] M. B. Stensgaard and J. Sparso. ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology. In Proceedings of the IEEE/ACM International Symposium on Networks-on-Chip (NOCS), 2008. [24] M. B. Stuart et al. Synthesis of Topology Configurations and Deadlock Free Routing Algorithms for ReNoC-based Systems-on-Chip. In Proceedings of the IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2009. [25] C. Jackson and S. J. Hollis. Skip-links: A Dynamically Reconfiguring Topology for Energy-efficient NoCs. In International Symposium on System on Chip (SoC), 2010. [26] Joo-Young Kim et al. A 118.4 GB/s Multi-Casting Network-on-Chip With Hierarchical Star-Ring Combined Topology for Real-Time Object Recognition. IEEE Journal of Solid-State Circuits (JSSC), 45:1399–1409, 2010. [27] M. Coppola et al. Spidergon: a Novel On-Chip Communication Network. In International Symposium on System on Chip (SoC), page 15, 2004. [28] H. Zhang et al. A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 68–69, 2000. [29] T. Krishna, J. Postman, C. Edmonds, L.-S. Peh, and P. Chiang. SWIFT: A SWing-reduced Interconnect For a Token-based Network-on-Chip in 90nm CMOS. In Proceedings of the IEEE International Conference on Computer Design (ICCD), pages 439–446, 2010. [30] Amit Kumar et al. Token Flow Control. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), November 2008. 146 BIBLIOGRAPHY [31] David Wentzlaff et al. On-chip interconnection architecture of the tile processor. 27(5):15–31, 2007. [32] A. Kumar et al. Express Virtual Channels: Towards the Ideal Interconnection Fabric. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2007. [33] S. S. Mukherjee et al. The Alpha 21364 Network Architecture. IEEE Micro, 22(5):26–35, 2002. [34] P. Gratz et al. Implementation and Evaluation of On-Chip Network Architectures. In Proceedings of the IEEE International Conference on Computer Design (ICCD), October 2006. [35] A. Kumar et al. A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS. In Proceedings of the IEEE International Conference on Computer Design (ICCD), October 2007. [36] Jan M. Rabaey, Anantha P. Chandrakasan, and Borivoje Nikolic. Digital Integrated Circuits: A design perspective. Prentice Hall, 2nd Edition, 1998. [37] R. Ho et al. High-Speed and Low-Energy Capacitive-Driven On-Chip Wires. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 412–413, February 2007. [38] E. Mensink et al. A 0.28pJ/b 2Gb/s/ch Transceiver in 90nm CMOS for 10mm On-chip interconnects. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 314–315, February 2007. [39] B. Kim and V. Stojanovic. A 4Gb/s/ch 356fJ/b 10mm equalized on-chip interconnect with nonlinear charge-injecting transmit filter and transimpedance receiver in 90nm CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 66–67, February 2009. 147 BIBLIOGRAPHY [40] J. Seo et al. High-Bandwidth and Low-Energy On-Chip Signaling with Adaptive Pre-Emphasis in 90nm CMOS. In Proceedings of the IEEE International SolidState Circuits Conference (ISSCC), pages 182–183, February 2010. [41] Jason Howard et al. A 48-core ia-32 message-passing processor with dvfs in 45nm cmos. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 108–109, 2010. [42] Sunghyun Park. Low-Swing Signaling for Energy Efficient On-Chip Networks. SM Thesis, Massachusetts Institute of Technology (MIT), June 2011. [43] N. Verma and A. P. Chandrakasan. A High-Density 45nm SRAM Using SmallSignal Non-Strobed Regenerative Sensing. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 380–381, February 2008. [44] I. Arsovski and R. Wistort. Self-referenced Sense Amplifier for Across- chipvariation Immune Sensing in High-performance Content-Addressable Memories. In Proceedings of the IEEE/ACM Custom Integrated Circuits Conference (CICC), pages 453–456, 2006. [45] M. Qazi et al. A 512kb 8T SRAM Macro Operating Down to 0.57V with An AC-Coupled Sense Amplifier and Embedded Data-Retention-Voltage Sensor in 45nm SOI CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 350–351, February 2010. [46] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. ORION: A Power-Performance Simulator for Interconnection Networks. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). [47] Andrew Kahng et al. Explicit Modeling of Control and Data for Improved NoC Router Estimation. In Proceedings of the IEEE/ACM Design Automation Conference (DAC), pages 392–397, 2012. 148 BIBLIOGRAPHY [48] K. Goossens et al. AEthereal Network on Chip: Concepts, Architectures, and Implementations. IEEE Dessign and Test of Computers, 22(5):414–421, 2005. [49] F. Karim et al. An Interconnect Architecture for Networking Systems on Chips. IEEE Micro, 22(5):36–45, September 2002. [50] Nam-Sung Woo. High Performance SOC for mobile applications. In Proceedings of the IEEE Asian Solid-State Circuits Conference (A-SSCC), 2010. [51] A. Adriahantenaina et al. SPIN: A Scalable, Packet Switched, On-Chip MicroNetwork. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE), March 2003. [52] G. Passas et al. A 128 x 128 x 24Gb/s Crossbar Interconnecting 128 Tiles in a Single Hop and Occupying 6 percent of Their Area. In Proceedings of the IEEE/ACM International Symposium on Networks-on-Chip (NOCS), May 2010. [53] E.D. Kyriakis-Bitzaros et al. Design of Low Power CMOS Drivers Based on Charge Recycling. IEEE International Symposium on Circuits and Systems (ISCAS), pages 1924–1927, June 1997. [54] M. Hiraki et al. Data-Dependent Logic Swing Internal Bus Architecture for Ultralow-Power LSIs. IEEE Journal of Solid-State Circuits (JSSC), pages 397– 402, April 1995. [55] H. Yamauchi et al. An Asymptotically Zero Power Charge-Recycling Bus Architecture for Battery-Operated Ultrahigh Data Rate ULSIs. IEEE Journal of Solid-State Circuits (JSSC), pages 423–431, April 1995. [56] R. Golshan et al. A novel reduced swing CMOS BUS interface circuit for high speed low power VLSI systems. IEEE International Symposium on Circuits and Systems (ISCAS), pages 351–354, May 1994. 149 BIBLIOGRAPHY [57] B.-D. Yang et al. High-Speed and Low-Swing On-Chip Bus Interface Using Threshold Voltage Swing Driver and Dual Sense Amplifier Receiver. IEEE European Solid-State Circuit Conference (ESSCIRC), pages 144–147, September 2000. [58] Sunghyun Park, Tushar Krishna, Chia-Hsin O. Chen, Bhavya Daya, Anantha P. Chandrakasan, and Li-Shiuan Peh. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI. Proceedings of the IEEE/ACM Design Automation Conference (DAC), June 2012. [59] Y.-H Kao, M. Yang, N. S. Artan, and H. J. Chao. CNoC: High-Radix Clos Network-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (T-CAD), 30:1897–1910, 2011. [60] W. J. Dally. Express Cubes: Improving the Performance of k-ary n-cube Interconnection Networks. IEEE Transactions on Computers, 40:1016–1023, 1991. [61] Henri J. Oguey and Daniel Aebischer. CMOS Current Reference Without Resistance. IEEE Journal of Solid-State Circuits (JSSC), 32:1132–1135, July 1997. [62] Eisse Mensink, Daniel Schinkel, Eric A. M. Klumperink, Ed van Tuijl, and Bram Nauta. Power efficient gigabit communication over capacitively driven RC-limited on-chip interconnects. IEEE Journal of Solid-State Circuits (JSSC), 45:447–457, Apr. 2010. [63] Byungsub Kim and Vladimir Stojanovic. An energy-efficient equalized transceiver for RC-dominant channels. IEEE Journal of Solid-State Circuits (JSSC), 45:1186–1197, June 2010. [64] Jae sun Seo, Ron Ho, Jon Lexau, Michael Dayringer, Dennis Sylvester, and David Blaauw. High-Bandwidth and Low-Energy On-Chip Signaling with Adaptive Pre-Emphasis in 90nm CMOS. In Proceedings of the IEEE International SolidState Circuits Conference (ISSCC), pages 182–183, February 2010. 150 BIBLIOGRAPHY [65] D. Woo et al. An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 1–12, January 2010. [66] R. S. Patti. Three-Dimensional Integrated Circuits and the Future of Systemon-Chip Designs. In Proceedings of the IEEE, 2006. [67] A. W. Topol, J. D. C. La Tulipe, D. J. Frank L. Shi, K. Bernstein, S. E. Steen, A. Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong. ThreeDimensional Integrated Circuits. IBM Journal of Research and Development, 50(4/5):491–506, 2006. [68] Yong Liu, Wing Luk, and Daniel Friedman. A Compact Low-Power 3D I/O in 45nm CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), February 2012. [69] B. Swinnen, W. Ruythooren, P. D. M. L. Bogaerts, L. Carbonell, K. D. Munck, B. Eyckens, S. Stoukatch, Tezcan, D. Sabuncuoglu, Z. Tokei, J. Vaes, J. V. Aelst, and E. Beyne. 3D Integration by Cu-Cu Thermo-Compression Bonding of Extremely Thinned Bulk-Si Die Containing 10um Pitch Through-Si Vias. In International Electron Devices Meeting (IEDM), December 2006. [70] Igor Loi, Subhasish Mitra, Thomas H. Lee, Shinobu Fujita, and Luca Benini. A Low-Overhead Fault Tolerance Scheme for TSV-Based 3D Network on Chip Links. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE), November 2008. [71] M. Laisne et al. System and Methods Utilizing Redundancy in Semiconductor Chip Interconnects. In US patent 20100060310A1, March 2010. 151 BIBLIOGRAPHY [72] A. Hsieh et al. TSV Redundancy: Architecture and Design Issues in 3D IC. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE), March 2010. [73] Kevin Lam, Larry R. Dennison, and William J. Dally. Simultaneous Bidirectional Signalling for IC Systems. In Proceedings of the IEEE International Conference on Computer Design (ICCD), September 1990. [74] Jae-Toon Sim et al. A 1-Gb/s Bidirectional I/O Buffer Using the Current-Mode Scheme. IEEE Journal of Solid-State Circuits (JSSC), 34(4):529–535, 1999. [75] H. Wilson et al. A Six-Port 30-GB/s Nonblocking Router Component Using Point-to-Point Simultaneous Bidirectional Signaling for High-Bandwidth Interconnects. IEEE Journal of Solid-State Circuits (JSSC), 36(12):1954–1963, 2001. [76] Wilson et al. A 4-Gb/s/pin Low-Power Memory I/O Interface Using 4-Level Simultaneous Bi-Directional Signaling. IEEE Journal of Solid-State Circuits (JSSC), 40(1):89–101, 2005. [77] Mahmut E. Sinangil et al. A Reconfigurable 8T Ultra-Dynamic Voltage Scalable (U-DVS) SRAM in 65nm CMOS. IEEE Journal of Solid-State Circuits (JSSC), 11:3163–3173, 2009. [78] T. Kobayashi et al. A Current-Controlled Latch Sense Amplifier and a Static Power-Saving Input Buffer for Low-Power Architecture. IEEE Journal of SolidState Circuits (JSSC), 28:523–527, 1993. [79] Futoshi Furuta and Kenichi Osada. 6Tbps/W, 1Tbps/mm, 3D Interconnect using Adaptive Timing Control and Low Capacitance TSV. In IEEE International 3D Systems Itegration Conference(3DIC), January 2012. 152 BIBLIOGRAPHY [80] M. M. K. Martin, M. D. Hill, and D. A. Wood. Token Coherence: Decoupling Performance and Correctness. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2003. [81] Niket Agarwal, Li-Shiuan Peh, and N. K. Jha. In-Network Snoop Ordering (INSO): Snoopy Coherence on Unordered Interconnects. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), February 2009. [82] Pat Conway et al. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE Micro, 30:16–29, 2010. [83] Arun Raghavan et al. Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), November 2008. [84] Bhavya K. Daya, Chia-Hsin Owen Chen, Suvinay Subramanian, Woo-Cheol Kwon, Sunghyun Park, Tushar Krishna, Jim Holt, Anantha P. Chandrakasan, and Li-Shiuan Peh. SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2014. [85] Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. The impact of Technology Scaling and Lifetime Reliability. In International Conference on Dependable Systems and Networks (DSN), June-July 2004. [86] Konstantinos Aisopos, Chia-Hsin Owen Chen, and Li-Shiuan Peh. Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chip. In Proceedings of the IEEE/ACM Design Automation Conference (DAC), October 2011. 153 BIBLIOGRAPHY [87] K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco. ARIADNE: Agnostic Reconfiguration In A Disconnected Network Environment. In Proceedings of the IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), October 2011. 154 BIBLIOGRAPHY 155 BIBLIOGRAPHY 156