Towards Low-Power yet High-Performance Networks-on-Chip Sunghyun Park by

advertisement
Towards Low-Power yet High-Performance
Networks-on-Chip
by
Sunghyun Park
B.S. in Korea Advanced Institute of Science and Technology (2009)
S.M. in Massachusetts Institute of Technology (2011)
Submitted to the
Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2014
c Massachusetts Institute of Technology 2014. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
September 2, 2014
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Li-Shiuan Peh
Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Anantha P. Chandrakasan
Joseph F. and Nancy P. Keithley Professor of Electrical Engineering
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leslie A. Kolodziejski
Chair, Department Committee on Graduate Students
2
Towards Low-Power yet High-Performance
Networks-on-Chip
by
Sunghyun Park
Submitted to the Department of Electrical Engineering and Computer Science
on September 2, 2014, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
Abstract
A network-on-chip (NoC), the de-facto communication backbone in manycore processors, consumes a significant portion of total chip power, competing against the
computation cores for the limited power and thermal budget. On the other hand,
overall system performance of manycore chips increasingly relies on on-chip latency
and bandwidth as core counts scale. This thesis aims to design low-power yet highperformance NoCs through circuit and microarchitecture co-design contrary to the
traditional approaches where NoCs sacrifice latency and/or bandwidth for low-power
operation; then demonstrate such design concepts through test chip prototyping,
enabling detailed measurements for rigorous analysis of the pros and cons of the
proposed NoCs.
The thesis starts with a 4×4 mesh NoC chip prototype that tries to simultaneously optimize energy, latency and throughput for all kinds of traffic (unicasts,
multicasts and broadcasts). Its extensive experiment results make it possible to accurately analyze energy/performance benefits and timing/area overheads of the virtually
bypassed, multicast-optimized router design; energy savings, area overheads and reduced reliability of the clocked low-swing datapath circuits; and a power gap between
simulated estimations and measurement results.
Next demonstrated is a link test chip of two clockless low-swing repeater designs, a
self-resetting logic repeater (SRLR) optimized for transmission energy and a voltagelocked repeater (VLR) for transmission delay. This second chip prototype shows that
the clockless, single-ended low-swing signaling of SRLRs armed with variation-robust
circuit techniques has lower energy and smaller area than clocked, differential lowswing signaling. Featured with lower delay than full-swing repeaters, VLRs provide
the fundamental building block to the single-cycle reconfigurable NoC that enables potential power saving at architecture level through single-cycle multi-hop asynchronous
link traversal on dynamically configurable routes.
The last one-third of this thesis explores a 3D-IC chip prototype of a throughsilicon via (TSV) interconnect that can support simultaneously bi-directional (SBD)
3
signaling. While TSVs, as 3D-IC NoC links, offer an appealing solution to manycore
architectures that require huge off-die bandwidth, existing TSV technologies impose
considerable power and area overheads (using spare TSVs) to improve reliability.
The proposed SBD TSV circuit shows better energy efficiency and smaller area than
unidirectional TSVs, thus providing reliable 3D signaling within tight power/silicon
budget. Such SBD signaling also enables configurable off-die bandwidth, and hence,
can be the basis of a bandwidth-adaptive 3D NoC that efficiently supports highly
dynamic traffic on manycore chips.
Thesis Supervisor: Li-Shiuan Peh
Title: Professor of Electrical Engineering and Computer Science
Thesis Supervisor: Anantha P. Chandrakasan
Title: Joseph F. and Nancy P. Keithley Professor of Electrical Engineering
4
To my parents,
Doosoo Park and Soonsil Shin
5
6
Acknowledgments
The LORD is my shepherd, I shall not be in want. He makes me lie down in green
pastures, he leads me beside quiet waters, he restores my soul. He guides me in paths
of righteousness for his name’s sake. Even though I walk through the valley of the
shadow of death, I will fear no evil, for you are with me; your rod and your staff, they
comfort me.
Psalm 23:1-4
First and foremost, I give thanks to God for allowing me to have the best advisors,
Professor Li-Shiuan Peh and Professor Anantha Chandrakasan. Their complementary research interests and advising styles have made this possible; while Li-Shiuan’s
accurate yet wide-ranging comprehension of Networks-on-Chip (NoCs) has enabled
me to freely play in the playground of NoCs without worrying about my wrong assumptions and technical mistakes, Anantha’s sharp insight and extensive experience
in low-power digital circuit design has allowed my rough ideas to be well-positioned
and shaped in detail. Indeed, being co-advised by Li-Shiuan and Anantha was the
best opportunity that I have ever been given at MIT in that I was able to explore
unique research questions between circuit and architecture under their excellent guidance. Even from the viewpoint of humanity, they are truly great mentors. I sincerely
thank Anantha and Li-Shiuan for being my advisors.
It is my honor and pleasure to have Professor Srinivas Devadas on my thesis
committee. I would like to thank him for the contributions to my PhD work. Actually, he helped me shape the thesis direction even before being my thesis committee
through the DARPA Angstrom Project and Research Qualifying Examination (RQE).
His comprehensive system-level view and objective standpoint on my work have motivated me to view my research from other angles. I deeply appreciate him for spending
7
time and energy despite his busy schedule.
I would also like to express my appreciation to Professor Vladimir Stojanovic
for his feedback and suggestions as my RQE committee. His standpoint on on-chip
interconnects (that differs from the angle of my advisors) widened my understanding
of scalable Networks-on-Chip. In addition, his distinguished work on physical model
of on-chip wires motivated me to investigate circuit-wire codesign.
I want to extend the appreciation to my MIT colleagues who are always willing
to help me out. While all members of my both research groups deserve my gratitude,
I have to leave a special thanks to the following eight people: Tushar Krishna (NoC
architecture discussion), Masood Qazi (variation-robust circuit design and analysis),
Owen Chen (NoC architecture discussion), Gilad Yahalom (3D-IC test chip implementation), Arun Paidimarri (chip measurement), Sunghyuk Lee (PCB design), Bhavya
Daya (mesh NoC chip comparison) and SungWon Jung (high-frequency clocking circuit design).
I should leave a thanks to our friendly administrative staffs for all the help through
my PhD years at MIT: Maria Rebelo (CSAIL administration), Margaret Flaherty
(MTL administration), Janet Fischer and Alicia Duarte (EECS Department administration).
I am proud to acknowledge the support of the following companies for my research projects: MediaTek (3D-IC test chip fabrication), Samsung (financial support
during my entire PhD years) and Freescale (filp chip packaging). I would like to
thank Dr. Alice Wang for her excellent management at MediaTek to enable successful completion of our 3D-IC project. I also thank Mr. Stacy Ho to mercifully take
care of my MediaTek internship at the Woburn site. In particular, I want to express
my deepest gratitude to Samsung Scholarship not only for financially supporting my
PhD program but also for giving me an opportunity to become a part of their superior community. It was my great honor to serve as Jar-Chi-Wii-Won-Chang at 2013
Samsung Scholarship Academic Camp in Yosemite National Park.
8
No words can do justice to express how deeply grateful I am to my family members.
I truly appreciate my lovely penguin, Seonghee Nam, for always being with me as my
wife and as my best friend. Without her devoted support, I would not have completed
my PhD journey. I also thank my adorable little girl, Seohee Park, and my brave
little boy, Seungwoo Park, for giving me the strongest motivation to finish my school
life. Indeed, their existence itself is a blessing to me everyday. I should not forget to
thank my sister, Haejin Park, for her trust and encouragement. I am always proud
of her career as a professor in a medical school. I should also thank my parents-inlaw, I-hyun Nam and Soonae Song, for treating me like their son. Finally, reserving
the best for last, I would like to exhibit the most heartfelt gratitude to my parents,
Doosoo Park and Soonsil Shin, for their unconditional love and trust. I thank you, I
respect you, I love you, my dad and my mom.
9
10
List of Acronyms
3D − IC
3 Dimensional Integrated Circuit
BER
Bit Error Rate
BW
Buf f er W ritng (in Router P ipeline)
CAD
Computer Aided Design
CMOS
Complementary M etal Oxide Semiconductor
CMP
Chip M ultiP rocessor
DM
(P ulse) DeM odulator
DOR
Dimesion Ordered Routing
DRC
Design Rule Check
ECC
Error Correction Code
F2B
F ace to Back (T hrough Silicon V ia)
F2F
F ace to F ace (T hrough Silicon V ia)
FIFO
F irst In F irst Out (Buf f ers)
I/O
Input/Output
IP
Intellectual P roperty
LA
LookAhead generation (in Router P ipeline)
LT
Link T raversal (in Router P ipeline)
MC
M essage Class (of V irtual Channels)
MMS
M ultiscale M odeling and Simulation (SoC Application)
MOSFET
M etal Oxide Semiconductor F ield Ef f ect T rasistor
MPSoC
M ultiP rocessor System on Chip
11
mSA
multiple (Crossbar) Switch Allocation
NIC
N etwork Interf ace Circuit
NMOS
N − channel M OSF ET
NoC
N etwork on Chip
NRC
N ext Route Computation
PDK
P rocessor Design Kit
PE
P rocessor Element
PIP
P ersonal Interest P roject (SoC application)
PM
P ulse M odulator
PMOS
P − channel M OSF ET
PRBS
P seudo Random Binary Sequence
PVT
P rocess, V oltage and T emperature
QoS
Quality of Service
RC
Resistance − Capacitance
RSD
Reduced Swing Driver
Rx
Receiver
SA
(Crossbar) Switch Allocation (in Router P ipeline)
SBD
Simultaneously BiDirectional
Si
Silicon
SMART
Single cycle M ulti hop Asynchronous Repeated T raversal
SoC
System on Chip
SOI
Silicon On Insulator
12
SRLR
Self Resetting Logic Repeater
TSV
T hrough Silicon V ia
Tx
T ransmitter
ST
(Crossbar) Switch T raversal (in Router P ipeline)
VA
V irtual channel Allocation (in Router P ipeline)
VC
V irtual Channel
VLR
V oltage Locked Repeater
VOPD
V ideo Object P lane Decoder
WLAN
W ireless Local Area N etwork
13
14
Contents
1 Introduction
27
1.1
Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.2
Mesh Network-on-Chip (NoC) . . . . . . . . . . . . . . . . . . . . . .
28
1.3
Rethinking Router Microarchitecture . . . . . . . . . . . . . . . . . .
32
1.4
Thesis Contributions and Overview . . . . . . . . . . . . . . . . . . .
38
2 Towards the Theoretical Limits of a Mesh NoC
43
2.1
Theoretical Mesh NoC Limits . . . . . . . . . . . . . . . . . . . . . .
43
2.2
Related Work: Existing Mesh NoC Chips . . . . . . . . . . . . . . . .
45
2.3
Chip Design and Fabrication . . . . . . . . . . . . . . . . . . . . . . .
49
2.3.1
Towards Theoretical Latency Limits
. . . . . . . . . . . . . .
51
2.3.2
Towards Theoretical Throughput Limits . . . . . . . . . . . .
52
2.3.3
Towards Theoretical Energy Limits . . . . . . . . . . . . . . .
53
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
2.4.1
Latency, Throughput and Energy . . . . . . . . . . . . . . . .
57
2.4.2
Virtual bypassing . . . . . . . . . . . . . . . . . . . . . . . . .
61
2.4.3
Low-Swing Signaling . . . . . . . . . . . . . . . . . . . . . . .
62
2.4.4
Power Modeling and Estimation . . . . . . . . . . . . . . . . .
65
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
2.4
2.5
15
3 Low-Swing Datapath for Reconfigurable NoCs
71
3.1
Background: Reconfigurable NoCs . . . . . . . . . . . . . . . . . . . .
71
3.2
Introduction: Clockless Low-Swing Repeaters . . . . . . . . . . . . . .
73
3.3
Related Work: Existing Low-Swing Links . . . . . . . . . . . . . . . .
74
3.4
Self-Resetting Logic Repeater (SRLR) . . . . . . . . . . . . . . . . .
76
3.4.1
SRLR Circuit Design . . . . . . . . . . . . . . . . . . . . . . .
77
3.4.2
Test Chip Fabrication and Measurement . . . . . . . . . . . .
84
Voltage-Locked Repeater (VLR) . . . . . . . . . . . . . . . . . . . . .
87
3.5.1
VLR Circuit Design
. . . . . . . . . . . . . . . . . . . . . . .
87
3.5.2
Test Chip Fabrication and Measurement . . . . . . . . . . . .
90
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
3.5
3.6
4 Energy and Area Efficient TSV Signaling for 3D-IC NoCs
97
4.1
Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Design Considerations of SBD TSV Links . . . . . . . . . . . . . . . 102
4.3
SBD Transmitter Design . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4
4.5
4.6
97
4.3.1
Case 1: EN=0 (no data to be transmitted) . . . . . . . . . . . 104
4.3.2
Case 2: EN=1 and CLK=1 (first half clock cycle) . . . . . . . 105
4.3.3
Case 3: EN=1 and CLK=0 (next half clock cycle) . . . . . . . 106
Rx Design: Switched Dual-Tree Sense Amplifier . . . . . . . . . . . . 112
4.4.1
Switched Scheme for Low Sensing Delay . . . . . . . . . . . . 112
4.4.2
Dual-Tree Sense Amplifier for Reliable SBD Signaling . . . . . 113
Prototyping and Testing of TSV Interconnects . . . . . . . . . . . . . 117
4.5.1
Maximum Data Rate . . . . . . . . . . . . . . . . . . . . . . . 119
4.5.2
Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.5.3
Area Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.5.4
Comparison with Other Low-Power TSV Circuits . . . . . . . 127
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
16
5 Conclusions and Future Work
5.1
133
Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.1.1
Regular Mesh Network in CMPs . . . . . . . . . . . . . . . . . 134
5.1.2
Low-Swing Datapath of Configurable Meshes in SoCs . . . . . 134
5.1.3
Towards Low-Cost 3D Meshes in 3D-ICs . . . . . . . . . . . . 135
5.2
Low-Swing Signaling Reliability . . . . . . . . . . . . . . . . . . . . . 136
5.3
Future Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.3.1
Broadcast-Intensive Cache Coherent Protocols . . . . . . . . . 139
5.3.2
Error-Tolerant NoCs with Low-Swing Links . . . . . . . . . . 140
5.3.3
Bandwidth-Adaptive 3D NoCs . . . . . . . . . . . . . . . . . . 140
17
18
List of Figures
1-1 Simplified router microarchitecture for 2D mesh NoCs. . . . . . . . .
29
1-2 Detailed router microarchitecture and pipeline of a packet-switched,
input-buffered VC NoC. . . . . . . . . . . . . . . . . . . . . . . . . .
31
1-3 Ideal point-to-point interconnect only through a metal wire. . . . . .
33
1-4 Repeated interconnect for lower wire delay (starting point of wire sharing). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
1-5 Input wire sharing through a demultiplexer. . . . . . . . . . . . . . .
33
1-6 Output wire sharing through a multiplexer. . . . . . . . . . . . . . . .
33
1-7 Input and output wire sharing through a demultiplexer and a multiplexer. 34
1-8 Input and output wire sharing through a crossbar switch. . . . . . . .
34
1-9 Efficient wire sharing with a SA logic and buffers. . . . . . . . . . . .
34
1-10 Packet-switched, input-buffered VC router microarchitecture. . . . . .
35
2-1 Latency calculation example for broadcast traffic on a k×k mesh network. 44
2-2 Broadcast example and overview of the fabricated 4×4 mesh NoC. . .
49
2-3 Die photo and design layout of the 4×4 mesh NoC and stand-alone
low-swing crossbar switch connected to longer links (1mm and 2mm
wires). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
2-4 64bits 5×5 tri-state RSD-based matrix crossbar switch and link circuitry. 54
19
LIST OF FIGURES
2-5 Proposed router microarchitecture and pipeline. . . . . . . . . . . . .
56
2-6 Network performance evaluation with mixed traffic at 1GHz. . . . . .
58
2-7 Network performance evaluation with broadcast-only traffic at 1GHz.
59
2-8 Measured network power reduction at 653Gb/s at 1GHz (A: full-swing
unicast network, B: low-swing unicast network, C:low-swing broadcast
network without virtual buffer bypassing, D: low-swing broadcast network with virtual buffer bypassing). . . . . . . . . . . . . . . . . . . .
60
2-9 1mm link energy efficiency of full-swing and RSD-based signaling. . .
63
2-10 2mm link energy efficiency of full-swing and RSD-based signaling. . .
63
2-11 Low-swing signaling trade-off between reliability and energy efficiency.
65
2-12 Comparison of power estimates with measurements (A: ORION 2.0
simulations, B: Post-layout simulations, C: Measured results). . . . .
66
3-1 Single-cycle reconfigurable NoC [1] with SMART links (red bold lines)
where its backbone mesh network is reconfigured at run time. . . . .
72
3-2 10mm SRLR-based link for the mesh-based reconfigurable NoC where
the local router-to-router distance is 1mm. . . . . . . . . . . . . . . .
77
3-3 Proposed SRLR circuit and its simulated waveforms. . . . . . . . . .
78
3-4 1000-run Monte-Carlo simulation results that show the impact of each
variation-robust design technique. . . . . . . . . . . . . . . . . . . . .
82
3-5 Process variation robust SRLR circuit with (1) an alternating delay
cell design, (2) NMOS-based drivers and (3) an adaptive swing voltage
scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
3-6 Die photograph of the SRLR test chip in 45nm SOI CMOS that includes an on-chip test circuit and an on-chip clocking circuit. . . . . .
84
3-7 1cm link traversal (LT) energy versus bandwidth density. . . . . . . .
86
3-8 Proposed clockless low-swing voltage-locked repeater (VLR) for singlecycle multi-hop link traversal. . . . . . . . . . . . . . . . . . . . . . .
20
89
LIST OF FIGURES
3-9 Simulated waveforms at 6.8Gb/s: (a) original input data and (b) VLR’s
low-swing signaling at node X. . . . . . . . . . . . . . . . . . . . . . .
89
3-10 1bit 10mm VLR-based on-chip link and its equivalent full-swing link
fabricated on the same die as SRLRs in 45nm SOI CMOS. . . . . . .
90
3-11 SMART NoC performance across SoC applications. Reference: [1]. . .
93
3-12 SMART NoC power breakdown across SoC applications. Reference: [1]. 93
4-1 Example of hop count reduction through greater spatial locality in 3DICs. The reduced hop counts translate into lower interconnect delay
and energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4-2 Uni-directional TSV signaling versus proposed SBD TSV signaling at
the same clock frequency, e.g. 5GHz in this example. . . . . . . . . .
99
4-3 4 voltage-level SBD signaling with weaker driving strength required
(pros) and smaller noise margin between SBD signaling symbols (cons). 100
4-4 3 voltage-level SBD signaling with bigger noise margin between SBD
signaling symbols (pros) and stronger driving strength required (cons). 100
4-5 Upward die-to-die static current path through a low resistance TSV:
bottom die PMOS → micro bump → landing pad → top die NMOS.
101
4-6 Downward die-to-die static current path through a low resistance TSV:
top die PMOS → landing pad → micro bump → bottom die NMOS.
101
4-7 Proposed SBD TSV Tx circuits: a simple NAND-enabled inverter on
a bottom die and a half-clocked driver on a top die. . . . . . . . . . . 105
4-8 Tx circuit connectivity of Case 1 (EN=0). No die-to-die current path
is formed when there is no data to be transmitted through a TSV. . . 106
4-9 Tx circuit connectivity of Case 2 (EN=1 and CLK=1, first half clock
cycle) where a TSV is driven by a bottom driver only, consuming dynamic energy only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
21
LIST OF FIGURES
4-10 Tx circuit connectivity of Case 3 (EN=1 and CLK=0, next half clock
cycle) where a TSV is driven by a bottom driver and a top driver
together, forming a static current path through a TSV for the three
voltage-level SBD signaling. The coupling capacitor, which acts as
a high-pass filter, compensates the bandwidth loss without adding to
inter-die static current. . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4-11 TSV voltage transitions of uni-directional TSVs versus our SBD TSV. 110
4-12 While a floating TSV during the first half clock period also enables 50%
lower static die-to-die current, such a floating state incurs bandwidth
loss at 00 → 11 bi-directional data transition. . . . . . . . . . . . . . 111
4-13 The coupling capacitor on a top die driver enables shorter switching
time when a TSV is driven to VDD/2 by a pull-up PMOS and a pulldown NMOS together during the second half clock period. . . . . . . 111
4-14 Reduced symbol noise margin of SBD signaling due to process variation. When designing 3D-IC circuits, we should consider die-to-die
variation mismatch as well as on-die variation mismatch described in
(c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4-15 Switched dual-tree sense amplifiers for variation-robust SBD signaling
and low sensing delay. . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4-16 Overall circuit implementation of the proposed TSV SBD signaling.
Two types of sense amplifiers, a PMOS-input and an NMOS-input
sense amplifier, are switched on and off according to the transmitted
data (txIN) for low sensing delay. . . . . . . . . . . . . . . . . . . . . 116
4-17 Top die photograph of a 2-tier 3D-IC test chip fabricated with a 28nm
Low-Power (LP) CMOS process. . . . . . . . . . . . . . . . . . . . . . 118
4-18 Bottom die photograph of a 2-tier 3D-IC test chip fabricated with the
same process as a top die, 28nm LP CMOS. . . . . . . . . . . . . . . 118
22
LIST OF FIGURES
4-19 Four types of TSV interconnects implemented in a 3D-IC test chip: two
uni-directional TSVs (baseline #1); an inverter-based SBD TSV (baseline #2); a proposed SBD TSV without a coupling capacitor (baseline
#3); and a completed design (proposed SBD TSV). . . . . . . . . . . 119
4-20 Measured maximum die-to-die bandwidth comparison at 1.05V between uni-directional TSVs (baseline #1) and proposed SBD TSVs. . 120
4-21 Maximum bi-directional bandwidth of our fabricated F2B TSV interconnects. The proposed SBD signaling can deliver up to 9.1Gb/s/TSV
bi-directional data (i.e. 4.55GHz maximum clock frequency) at 1.05V. 121
4-22 Four bi-directional input data sets for energy comparison. . . . . . . . 123
4-23 Measured TSV interconnect energy efficiency over various input data
sets at 9.1Gb/s bi-directional data rate (i.e. 4.55GHz clock frequency)
at 1.05V. The proposed SBD signaling circuits consume 10.3-31.1% less
energy than uni-directional TSVs. . . . . . . . . . . . . . . . . . . . . 124
4-24 Normalized area comparison of the fabricated TSV signaling circuits.
While baseline #1 includes two TSV landing pads, other three SBD
TSV interconnects have only one TSV landing pad.
. . . . . . . . . 126
5-1 Lower voltage swing enables higher energy efficiency, but results in
higher signaling error probability (hence bigger system overheads).
This figure is identical to Figure 2-11. . . . . . . . . . . . . . . . . . . 137
23
LIST OF FIGURES
24
List of Tables
2.1
Theoretical limits of a k×k mesh NoC for unicast and broadcast traffic. 48
2.2
Comparison of mesh NoC chip prototypes. . . . . . . . . . . . . . . .
48
2.3
Critical path analysis results. . . . . . . . . . . . . . . . . . . . . . .
61
2.4
Area comparison with full-swing signaling. . . . . . . . . . . . . . . .
64
3.1
Comparison of silicon-proven low-swing on-chip interconnects. . . . .
86
3.2
Maximum hop counts in a single cycle at high data rate. . . . . . . .
92
3.3
Maximum hop counts in a single cycle at low data rate. . . . . . . . .
92
4.1
Comparison of Energy-efficient Face-to-Back TSV Interconnects (CMOSon-CMOS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
25
LIST OF TABLES
26
1
Introduction
This thesis challenges the conventional wisdom that involves NoC design in trading
off latency, bandwidth and energy, leading to poor performance in low-power NoCs or
high performance but with unacceptable network power.
1.1
Research Motivation
Moore’s law scaling and diminishing performance returns of complex uniprocessor
chips have led to the advent of manycore systems such as chip multiprocessors (CMPs)
and multiprocessor systems-on-chip (MPSoCs). The scalability of these manycore
chips relies highly on the on-chip communication fabric connecting the cores/IPs. An
ideal communication fabric would incur only metal-wire delay and energy between
the source and destination node. However, there is insufficient wiring for dedicated
global point-to-point wires between all nodes [2], and hence, a network-on-chip (NoC)
with routers that multiplex wires across traffic flows is becoming the de-facto communication fabric in manycore chips [3].
These on-chip routers, however, impose substantial power overhead. For instance,
36% and 39% of entire chip power are consumed by such NoC routers at the peak
network throughput in MIT’s Raw [4] and Intel’s TeraFLOPS [5], respectively. Since
27
Chapter 1. Introduction
each chip cannot cross its power wall, this power-hungry network competes against
the cores/IPs, leading to lower power and thermal budget for actual computation
work. On the other hand, overall manycore chip performance increasingly depends
on NoC performance such as bandwidth and latency with a growing number of onchip components [3, 6, 7]. Therefore, a low-power yet high-performance NoC is sorely
needed to allow more cores/IPs to be integrated on one die.
Designing a low-power NoC without the loss of network performance is almost
always a challenging task. To take a couple of easy examples (other challenges will be
discussed in Section 1.3), link drivers with a lower power supply voltage, i.e. simple
low-swing links, enable both dynamic and leakage power saving but at the cost of
longer wire propagation delay, resulting in longer latency and lower bandwidth in the
network. While smaller buffer size at a router also makes NoCs energy-efficient, it
leads to poor link utilization (hence lower bandwidth). Due to these design challenges,
prior NoC chips [4, 5, 8, 9, 10, 11] sacrificed network performance for acceptable NoC
power consumption, or endured substantial network power overheads to meet the
aggressive performance requirements. This thesis seeks to break such conventional
trade-offs to pave the way to the low-power yet high-performance NoC.
1.2
Mesh Network-on-Chip (NoC)
A mesh network, which is formed by laying out a regular grid in each dimension
and adding routers at the grid intersections, maps readily to the planar layout that
current CMOS technology requires (e.g. 2D meshes on a single Si wafer or 3D meshes
in a vertically-stacked 3D-ICs). Thanks to such planar regularity and scalability, a
mesh is the most widely-used NoC topology for high-performance manycore chips [4,
5, 8, 9, 10, 11, 12]. In addition, unlike indirect, multi-stage NoC topologies such as
Clos or Butterflies [13], a mesh supports the locality present in many applications,
allowing nearby traffic to be transported at lower delay and energy.
28
1.2. Mesh Network-on-Chip (NoC)
Figure 1-1: Simplified router microarchitecture for 2D mesh NoCs.
Figure 1-1 shows a simplified 5-port 2D mesh router microarchitecture composed
of four main components: input buffers, control logic, a crossbar switch and links.
The input buffers store incoming data till they are sent to the next router. The
control logic determines when data proceed through the router pipeline and sets up
the crossbar switch. The crossbar switch physically moves data from input ports to
output ports, followed by links that forward output port data to the next router.
These actions can be pipelined to improve throughput, depending on operating clock
frequency, process technology and specific logic implementation.
Let us now look at how a packet-switched NoC works in a manycore processor.
Each core communicates with other cores by sending and receiving messages through
a network interface controller (NIC) that connects the core to a router (hence the
network). Before a message is injected into the network, it is first segmented into
packets that are then divided into fixed-length flits, short for flow-control units. A
packet consists of a head flit that contains the destination address, body flits, and a
29
Chapter 1. Introduction
tail flit that indicates the end of a packet. If the amount of information the packet
carries is little, single-flit packets are also possible, i.e. where a flit is both the head
and tail flit. Because only the head flit carries the destination information, all flits of
a packet must follow the same route through the network. Virtual channels (VCs),
logically-separate input buffers, allow multiple data streams to share physical channel (link wires) by interleaving flits from different packets. Such decoupled input
buffers can be utilized to improve throughput by eliminating the head-of-line blocking; prevent deadlocks in the network without deadlock-free routing algorithms; or
offer quality of service (QoS) for system level optimization [13, 14].
Figure 1-2 shows an example of the packet-switched, input-buffered VC router microarchitecture and pipeline. One distinguishing feature of this design is the routerlevel multicast capability through multiport switch allocation (mSA) by which multicasts do not require multiple unicast packets to be injected. At the first pipeline stage,
flits get buffered (BW) and each input port chooses 1 output port request (mSA-I)
with a round-robin logic that guarantees fair and starvation-free arbitration. Since
multicast flits can require multiple output ports, the request is a 5b vector. The next
router VC is selected (VA) from a free VC queue in this stage, too. These 3 independent operations are executed in parallel without decreasing operating frequency.
At the second stage, output port requests for the next routers are computed (NRC)
for the winners of mSA-I, and concurrently, a matrix arbiter at each output port
grants the crossbar ports to the input port requests (mSA-II). Multicast requests get
granted multiple output ports. In the third and fourth stages, flits physically traverse
a crossbar switch (ST) and a link (LT). It is notable that out of all these actions,
only the last two actions (ST and LT) actually move the flits toward the destination.
Throughout the thesis, we will refer to this design (Figure 1-2) as a state-of-the-art
router microarchitecture or a comparison baseline after slight modifications required
for fair comparison.
30
Header
Generation
Input Buffer
VC1
64b
5ports X 64b
VC2
VC3
NIC
VC4
VC5
VC6
5b
N
Input Port (BW)
E
Credit Signals to Previous Routers
5b
31
Outport Request (VC1)
5b
S
Outport Request (VC2)
Outport Request (VC3)
Outport Request (VC4)
W
Outport Request (VC5)
64b 5X5 Crossbar
(ST)
Link
(LT)
Pipeline Stage 3
Pipeline Stage 4
Outport Request (VC6)
Round-robin circuit (mSA-I)
Output Port
(mSA-II)
VC allocation (VA)
5ports X 3b
Credit Signals
from Next Routers
Pipeline Stage 1
Pipeline Stage 2
Figure 1-2: Detailed router microarchitecture and pipeline of a packet-switched, input-buffered VC NoC.
1.2. Mesh Network-on-Chip (NoC)
Next Route
Computation
(NRC)
64b
Chapter 1. Introduction
1.3
Rethinking Router Microarchitecture
The previous section (as well as most existing NoC literatures) tried to explain a
NoC router microarchitecture by diving directly into the state-of-the-art design followed by description of individual components. This top-down approach is able to
provide a quick way to understand the packet-switched VC router microarchitecture,
but often makes it difficult to accurately analyze router overheads due to a significant gap between an ideal interconnect and the packet-switched NoC. Indeed, design
optimization starts with finding its overheads and recognizing if such overheads are
avoidable or not. To explicitly reveal the router overheads, this section explores the
bottom-up approach through the step-by-step router building process, rethinking the
state-of-the-art NoC router microarchitecture.
As mentioned earlier, the ideal interconnect would be a point-to-point link provided by full network connectivity that delivers the highest possible throughput at
the lowest possible latency and energy. Figure 1-3 shows such a point-to-point wire
between a source (SRC) and a destination (DST). It should be noted that even the
ideal incurs physical constraints like metal-wire delay and energy, and accordingly,
the first step towards designing low-power high performance NoCs should be the optimization of the metal-wire delay and energy. Trade-off wise, higher supply voltage
in link drivers (greater driving strength) enables shorter wire delay, but requires more
propagation energy. While a wider wire pitch reduces coupling capacitance between
wires (hence lower propagation delay and energy), it leads to lower wire density, resulting in poor bandwidth in the network. Metal-wire links comprise 17-39% of total
network power in mesh NoC chips [4, 5, 8], and form the unavoidable portion of network power as link power is a physical constraint. Furthermore, as wire performance
benefit from CMOS scaling does not keep up with gate performance benefit, link
power will increase in percentage relative to control and storage circuitry power as
process technology scales down [15, 16].
32
1.3. Rethinking Router Microarchitecture
Figure 1-3: Ideal point-to-point interconnect only through a metal wire.
Figure 1-4: Repeated interconnect for lower wire delay (starting point of wire sharing).
Figure 1-5: Input wire sharing through a demultiplexer.
Figure 1-6: Output wire sharing through a multiplexer.
33
Chapter 1. Introduction
Figure 1-7: Input and output wire sharing through a demultiplexer and a multiplexer.
Figure 1-8: Input and output wire sharing through a crossbar switch.
Figure 1-9: Efficient wire sharing with a SA logic and buffers.
34
1.3. Rethinking Router Microarchitecture
Figure 1-10: Packet-switched, input-buffered VC router microarchitecture.
When DST is too far from SRC, i.e. an interconnect wire is too long, intermediate
drivers in a long wire (known as repeaters) can significantly reduce the propagation
delay by converting the quadratic RC delay growth with the wire length into a linear
RC delay growth [17, 18]. If a wire is long enough, the extra delay and energy of
repeaters are easily offset by the long wire. Besides, repeaters offer a wire sharing
opportunity for free by decoupling a long wire into multiple wire segments. In other
words, the repeated interconnect shown in Figure 1-4 has 3 source and destination
pairs (SRC to a repeater, a repeater to DST and SRC to DST) while the point-topoint interconnect in Figure 1-3 has only one SRC and DST pair.
While point-to-point connection is always preferable between one SRC and DST
pair, full connectivity of all possible SRC and DST pairs is too expensive in terms of
global wiring area to be incorporated in manycore chips; building a fully-connected
NoC with higher node counts (e.g. 32 nodes) is practically impossible in existing
CMOS technology due to insufficient wiring [2]. Thus, wire sharing is inevitable for a
scalable on-chip communication fabric. Figure 1-5 shows input wire sharing through a
demultiplexer by which multiple destination nodes (DST1-5) share the wire segment
35
Chapter 1. Introduction
from SRC to the demultiplexer. Similarly, as shown in Figure 1-6, a multiplexer
enables output wire sharing where multiple source nodes (SRC1-5) share the tail
wire from the multiplexer to DST. The demultiplexer/multiplexer along with drivers
optimized for a single segmented wire can be viewed as the most primitive form of
NoCs in that wire sharing is the essence of NoCs.
We can also share both input and output wires by the combination of a demultiplexer and a multiplexer (Figure 1-7). This naive implementation, however, imposes
severe bandwidth loss since only one SRC and DST pair can communicate at a time.
A crossbar switch, which allows multiple SRCs to be connected to multiple DSTs,
can prevent such a bandwidth loss (Figure 1-8). The straightforward implementation
of the crossbar switch is to use redundant multiplexer arrays to provide all possible connections between multiple SRCs and DSTs [13]. Actually, this is the design
that most commercial synthesis CAD tools generate for crossbar switch functionality. While its static feature ensures stable operation, higher counts of SRCs and
DSTs cause substantial propagation delay and energy as compared to matrix crossbar switches whose transistor counts are much lower than the mux-based crossbar
switch [6]. On the other hand, the matrix crossbar switch requires careful circuit design on the matrix crosspoint switch which is typically implemented with pass gates,
transmission gates, or dynamic tri-state gates. This is because simple pass gates
and transmission gates do not work properly in an advanced CMOS process such as
silicon-on-insulator (SOI) technology and dynamic tri-state gates make their output
noise-sensitive. The Crossbar switch consumes 15-33% of the entire network power
in mesh NoC chips [4, 5, 8], and if wire sharing is inevitable for scalable NoCs then
this crossbar power consumption is also unavoidable.
For efficient wire sharing, a crossbar switch requires its allocation logic as shown
in Figure 1-9. In addition to the bandwidth improvement, such an allocation logic can
support QoS or packet/message fairness, depending on system requirements. While
the switch allocation logic contributes a negligible portion of overall NoC power, its
36
1.3. Rethinking Router Microarchitecture
computation delay can add significant packet latency. Trade-off wise, bandwidthoptimized allocation algorithms generally need more computation, thus resulting in
longer flit latency.
To further enhance wire sharing efficiency, a crossbar switch can incorporate
buffers to house flits when they cannot go forward right away to their destinations due
to contention. These buffers have a substantial impact on network bandwidth [19]
so that all existing NoC chips include buffers dedicated to their routers [4, 5, 8, 9,
10, 11, 12]. While flits can be buffered on the input ports or output ports, only an
input-buffered microarchitecture permits the single-ported memories that are more
power and area-efficient than multiple-port memories [13, 19]. For this reason, most
NoC router designs have buffers at the input ports, and as described in Figure 1-2, we
also selected the input-buffered router microarchitecture as a baseline in the thesis.
However, if the allocation rate of a crossbar switch is faster than the rate of output
wires (links in NoCs), output buffering allows more efficient wire sharing and higher
bandwidth, and hence, the output-buffered microarchitecture can be a better choice
for some systems. Router buffers consume 22-35% of total NoC power [4, 5, 8], and
add buffering delay to flit latency (1 clock cycle in general). Unlike links and a crossbar switch, the buffering power and delay are not unavoidable overheads. In fact,
minimizing buffer size and actual buffering counts at given target performance stands
at the center of NoC optimization.
Buffer allocation also has a huge impact on bandwidth of the packet-switched
NoC. If each input port allows only one buffer queue, head-of-line blocking can occur,
leading to poor link utilization. In other words, when a packet at the entrance of such
a single queue is blocked, it can stall other packets that are lined up behind the blocked
packet even if free buffers are available. Multiple buffer queues can resolve this headof-line blocking, but assigning several physical queues at each input port is expensive
in terms of area and energy [13]. Alternatively, we can split one physical queue into
multiple logically-separate queues. These logically-separate buffer queues can share
37
Chapter 1. Introduction
one physical channel (that’s why they are called virtual channels!), and hence, flits
can be interleaved from different packets. Similar to a crossbar switch, these virtual
channels (VCs) need their own allocation logic for efficient VC arbitration. Figure 110 shows the additional allocation logic for logically-separate buffers. Actually, this
figure design (a 5×5 crossbar switch along with its allocation logic, link drivers,
logically-separate input buffers and a VC allocation logic) is architecturally-identical
to the 2D mesh router design described in the previous section. Insights developed
through this step-by-step router building process from the ideal interconnect to the
packet-switched VC router will be the basis of our approaches towards the low-power
yet high-performance NoC designs.
1.4
Thesis Contributions and Overview
This thesis presents novel low-power NoC designs that depart from the traditional
trade-offs between network power and latency/bandwidth performance through circuit and microarchitecture co-design, then proves such design concepts on silicon
with a thorough analysis of the chip measurement results. To be specific, the thesis demonstrates three test chip designs: a 4×4 mesh NoC in Chapter 2, clockless
low-swing repeaters in Chapter 3 and a 3D through-silicon via (TSV) interconnect
in Chapter 4. The mesh NoC chip first optimizes the crossbar switch (Figure 1-8),
then co-designs the logic to minimize buffering (Figure 1-10). The second test chip
of clockless low-swing repeaters targets the repeated link (Figure 1-4) while the third
3D-IC chip seeks to develop the TSV point-to-point interconnect (Figure 1-3) whose
design constraints totally differ from the conventional 2D metal wires. An overview
of each chip prototype and corresponding chapter is as follows:
• 4×4 Mesh NoC Chip. Chapter 2 explores our first test chip of a mesh network
design for chip multiprocessors (CMPs) that aims to simultaneously optimize
energy-latency-throughput for unicasts, multicasts and broadcasts. We first
38
1.4. Thesis Contributions and Overview
define and analyze the theoretical limits of a mesh NoC in latency, throughput
and energy, then describe how we approach these limits through a combination
of microarchitecture and circuit techniques. Fabricated in 45nm SOI CMOS,
the 1.1V 1GHz NoC chip achieves 1-cycle router-and-link latency at each hop
and energy-efficient router-level multicast support, delivering 892Gb/s (87.1%
of the theoretical bandwidth limit) at 531.4mW for a mixed traffic of unicasts
and broadcasts. Armed with detailed measurement results, this chapter deeply
compares and analyzes the pros and cons of the proposed mesh NoC design:
(1) energy/performance improvement and timing/area penalties of the virtual
bypassed, multicast-optimized router design; (2) energy benefits, area overheads
and reduced reliability of the clocked low-swing datapath circuits; and (3) a
gap between simulated power estimation (ORION 2.0 [20]) and actual power
consumption. Here, I would like to acknowledge that the architectural design
of this test chip [6] was done by Tushar Krishna, a former PhD student at MIT.
• Clockless Low-Swing Repeaters Chip. Traffic on multiprocessor systemson-chip (MPSoCs) is highly dynamic, i.e. the traffic considerably varies depending on SoC applications. To efficiently support such dynamic traffic, reconfigurable NoCs on a flexible network topology like a mesh have been developed [21, 22, 23, 24, 25]. These networks pre-reserve (parts of) the route to
match application traffic by making unnecessary routers contention-free. If existing clocked low-swing circuits are applied to the pre-reserved routes, flits will
pay needless clocking energy and latency even at the contention-free nodes. To
prevent such wastage (hence maximize low-swing signaling benefits in the reconfigurable NoCs), Chapter 3 proposes two types of clockless low-swing repeaters,
self-resetting logic repeaters (SRLRs) and voltage-locked repeaters (VLRs), and
analyzes experimental results of the test chip fabricated in 45nm SOI CMOS.
Featured with variation-robust circuit techniques, the 0.8V 4.1Gb/s SRLRs enable single-ended low-swing pulses to be asynchronously repeated, and there-
39
Chapter 1. Introduction
fore, consume less energy than differential, clocked low-swing signaling. On the
other hand, the 1.0V 6.8Gb/s VLRs outperform energy-equivalent full-swing repeaters in terms of delay (35% reduction) and bandwidth (23% improvement),
enabling single-cycle multi-hop asynchronous link traversal for a single-cycle
reconfigurable NoC [1].
• 3D TSV Interconnect Chip. Many multi-threaded applications of CMPs
and MPSoCs require heavy off-die bandwidth that cannot be handled by existing off-chip I/Os. While three-dimensional integrated circuits (3D-ICs) offer
an appealing solution to such bandwidth-hungry manycore chips, current 3DIC fabrication technologies inevitably require redundant through-silicon vias
(TSVs) for reliable 3D vertical signaling, leading to significant power and area
overheads. To alleviate these 3D signaling overheads (hence incorporate TSVs
as 3D-IC NoC links within tight power and area budget), Chapter 4 proposes
and demonstrates the concept of simultaneously bi-directional (SBD) TSV signaling that can send and receive data at the same time through a single TSV.
The proposed SBD interconnect enables area and power-efficient, variationrobust 3D signaling with a relatively small bandwidth loss (less than 13%).
Implemented with 28nm Low-Power CMOS process and MediaTek TSV technology, our SBD TSV interconnect achieves 10.3-31.1% lower energy and 34.4% less
area than equivalent two uni-directional TSVs at 9.1Gb/s/TSV bi-directional
data rate (i.e. 4.55GHz clock frequency) at 1.05V.
40
1.4. Thesis Contributions and Overview
41
Chapter 1. Introduction
42
2
Towards the Theoretical Limits of a Mesh NoC
This chapter first derives the theoretical mesh NoC bounds, followed by an analysis of
a power and performance gap with existing mesh NoC chips. It then presents a chip
prototype of the proposed mesh NoC which tries to eliminate the gap, thus approaching
the theoretical limits.
2.1
Theoretical Mesh NoC Limits
A mesh topology by itself imposes theoretical limits on latency, throughput and
energy (i.e. minimum latency and energy, and maximum throughput). Chapter 2
starts with the derivation of these theoretical bounds of a k × k mesh NoC for two
traffic types, unicast and broadcast traffic. In our analysis, each network interface
circuit (NIC) injects flits into the network according to a Bernoulli process of rate
R, to a random, uniformly distributed destination for unicasts; and from a random,
uniformly distributed source to all nodes for broadcasts. All derived bounds are for
a complete action: from initiation at the source NIC, till the flit is received at all
destination NICs. We also make three NoC-level assumptions for our derivation:
1. Perfect Routing. A router would route all packets with minimal hop-counts,
balancing injected packets (termed channel load in our analysis) across multiple
routes perfectly, thereby keeping the load on all links optimally balanced.
43
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
k
k-j
Furthest destination
S
Source (i, j)
k-i
Figure 2-1: Latency calculation example for broadcast traffic on a k×k mesh network.
2. Perfect Flow Control. A router maintains maximum utilization of the links,
i.e. a link is never left idle when there is traffic routed across it.
3. Perfect Router Microarchitecture. All flits only incur the delay and energy of the datapath (ST and LT). In other words, a router arbitrates between
competing flits; performs crossbar and link traversal all in a single cycle; and
do not expend extraneous energy for buffering and control.
Based on these assumptions, we derive the theoretical limits for unicast and broadcast traffic. For unicasts, we analyze the theoretical limits for latency and throughput
using the same technique as in [13]. We then derive the energy limit by multiplying
44
2.2. Related Work: Existing Mesh NoC Chips
hop count with crossbar and link energy costs. For broadcast traffic, to the best of
our knowledge, no prior theoretical analysis exists. Here, we define the time till a flit
is received by all destination NICs as equivalent to when this flit is received by the
furthest NIC relative to the source NIC (Figuire 2-1). Hence, we derive the theoretical latency limit for received packets by averaging the hop delay from each source
NIC to its furthest destination NIC. Throughput wise, we obtain the theoretical limit
by analyzing the channel load across the ejection links and bisection links [13], and
observe that the maximum throughput for broadcast traffic is limited by the ejection
links. This differs from unicast traffic where throughput is always limited by the
bisection links. As for the theoretical energy limit, intuitively, due to the nature of
broadcasting, a broadcast flit needs to visit all k 2 routers in the network and traverse
k 2 crossbars and links connecting them. Therefore, the energy limit grows quadratically with the number of routers in the network. Table 2.1 summarizes our derivation
results.
2.2
Related Work: Existing Mesh NoC Chips
There have been chip prototypes that incorporate mesh NoCs [4, 5, 8, 9, 10,
11, 12] or other heterogeneous NoCs [26, 27, 28] as their on-chip communication
fabric. The prototypes range from full manycore processors to stand-alone NoCs.
As heterogeneous NoC chips [26, 27, 28] have irregular topologies which make it
difficult to characterize them against the theoretical mesh limits, we focus here on the
manycore chips with mesh networks in our related work investigation. In particular,
three chip prototypes were selected for comparison, each differing significantly with
respect to targeted design goals and optimizations: Intel TeraFLOPS which is the
precursor of the Intel IA-32 NoC, Tilera TILE64 which is the successor of the MIT
Raw, and SWIFT, a NoC with low-swing signaling. These three chips and their
corresponding NoC architectures are described in detail as follows:
45
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
• Tilera TILE64 [9] is a multiprocessor consisting of 64 tiles interconnected by
five 2D mesh networks, where each tile contains a CPU, cache and a router,
fabricated on the TSMC 90nm process and running at a speed of 700 to 866
MHz. Four of the five networks are dynamically routed, each servicing a different type of traffic: user dynamic network (UDN) for user-level messages,
I/O dynamic network (IDN) for I/O traffic, memory dynamic network (MDN)
for traffic to/from the memory controllers, and tile dynamic network (TDN)
for cache-to-cache transfers. The dynamic networks are packetized, wormhole
routed, with a one cycle pipeline for straight-through traffic and two cycles for
turning traffic. The static network is software scheduled, and has a single-cycle
pipeline.
• Intel TeraFLOPS [5] has a more complex NoC architecture, but the cores are
much simpler than a standard RISC processor. Since simpler cores are more area
and energy-efficient than larger ones, more functional units can be integrated
within a single chip’s area and power budget. TeraFLOPS is a demonstration
of the possibility of including an on chip interconnect, operating at 5 GHz, and
achieving performance in excess of TeraFLOPS while maintaining a power usage
of less than 100W. TeraFLOPS NoC has a five-port, two-lane, five-pipelinestage router with a double pumped crossbar used to interconnect the tiles in
a 2D mesh network. Each input port is connected to two 16 entry deep FIFO
buffers, one for each lane. A single crossbar for both lanes is double pumped in
the fourth pipeline stage using dual-edge triggered flip-flops, allowing the switch
to transfer data at both edges of the clock signal.
• SWIFT [29] is a 2×2 standalone NoC research chip demonstrating the practicality of implementing token flow control [30] and low swing crossbar switches
and links. The buffer-bypassed traversal of flits through a reduce-swing datapath is demonstrated to perform at 400 MHz and obtain latency and power
reductions of approximately 40 percent each. The token flow control microar-
46
2.2. Related Work: Existing Mesh NoC Chips
chitecture pre-allocates buffers and links in the network by using tokens. Many
flits are then able to bypass buffering, improving link utilization and reducing the buffer turnaround time. Dual voltage supply differential reduced-swing
drivers and sense-amplifier receivers sustain the low-swing signaling necessary
to reduce the dynamic power consumption.
We calculated zero-load latency and channel load of these networks for both
unicast-only and broadcast-only traffic. Zero-load latency can be obtained by multiplying the average hop-count by the number of pipeline stages to traverse a hop, with
serialization latency added on to model pipelining of all flits. In terms of throughput,
we computed channel load based on a flit injection rate per core of R, following the
methodology of [13]. The results are shown in the Table 2.2.
It is noted in this table that our proposed NoC, which will be described in the
following section, optimizes for broadcast traffic and incurs much lower zero-load latency and channel load compared to all other networks. TILE64 attempts to optimize
for all three metrics by utilizing independent simple networks for different message
types. The simple router design, with no virtual channels, improves unicast zero-load
latency but broadcast traffic latency is poor as its lack of multicast support forces
the source NIC to duplicate k 2 − 1 copies of a broadcast flit and send a copy to every
destination NIC. This increases channel load by k 2 − 1 times, causing contention at
all routers along the shared route, making it impossible to meet the single-cycle per
hop. TILE64’s static partitioning of traffic across 5 networks may also lead to poor
throughput when exercised with realistic uniform traffic. Similar effect on broadcast
latency and channel load is observed for the TeraFLOPS and SWIFT NoCs as none
of these chip prototypes have multicast support. The SWIFT NoC with a single-cycle
pipeline for unicasts performs better on zero-load latency, albeit at a lower operating
frequency. The TeraFLOPS NoC has poor zero-load latency in terms of cycles due
to a 5-stage pipeline, which is aggravated with broadcasts.
47
Table 2.1: Theoretical limits of a k×k mesh NoC for unicast and broadcast traffic.
Metric
Unicasts
(one-to-one multicasts)
2(k + 1)/3
Average Hop Count (Haverage )
k×R/4
R
2(k + 1)/3
R, for k <= 4
k×R/4, for k > 4
2(k + 1)/3×Exbar
+ Exbar
+ 2(k + 1)/3×Elink
k 2 ×Exbar
+ (k 2 − 1)×Elink
48
Table 2.2: Comparison of mesh NoC chip prototypes.
Clock frequency
Power supply
Power consumption
Latency Metrics
Delay per hop
Zero-load latency
(cycles)
Throughput Metrics
Channel width
Bisection bandwidth
Channel load
(R:injection rate/core)
Intel
TeraFLOPS [5]
8×10, 65nm
5GHz
1.1-1.2V
97W
1ns
30 (unicast)
120.5
(broadcast)
39b
1560Gb/s
64R (unicast)
4096R
(broadcast)
Tilera TILE64 [31]
SWIFT [29]
5 8×8, 90nm
2×2, 90nm
750MHz
225MHz
1.0V
1.2V
15-22W
116.5mW
Modeled as 8×8 networks
1.3ns
8.9-17.8ns
9 (unicast)
12 (unicast)
77.5 (broadcast)
86 (broadcast)
Modeled as 8×8 networks
5×32b
64b
937.5Gb/s
112.5Gb/s
64R (unicast)
64R (unicast)
4096R (broadcast)
4096R (broadcast)
Our work
4×4, 45nm SOI
1GHz
1.1V
427.3mW
4×4 network
1-3ns
6 (unicast)
3.3 (unicast)
11.5 (broadcast)
5.5 (broadcast)
64b
512Gb/s
64R (unicast)
4×4 network
64b
256Gb/s
16R (unicast)
64R (broadcast)
16R (broadcast)
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
Channel Load on each bisection link (Lbisection )
Channel Load on each ejection link (Lejection )
Theoretical Latency Limit
given by Haverage
Theoretical Throughput Limit
given by max{Lbisection , Lejection }
Theoretical Energy Limit
Exbar : energy of crossbar traversal
Elink : energy of link traversal
Broadcasts
(one-to-all multicasts)
(3k − 1)/2, for k even
(k − 1)(3k + 1)/2k, for k odd
k 2 ×R/4
k 2 ×R
(3k − 1)/2, for k even
(k − 1)(3k + 1)/2k, for k odd
k2 × R
2.3. Chip Design and Fabrication
2.3
Chip Design and Fabrication
Our mesh NoC chip design starts with the state-of-the-art router microarchitecture
(Figure 1-2). We then add features pushing latency towards the theoretical limit of
a single cycle per hop, throughput towards the theoretical limit of maximum channel
load and energy towards the theoretical limit of just datapath traversal. In the
fabricated network, all routers are connected to network interface circuits (NICs)
to generate and receive packets. For realistic traffic, we separately model request
and response messages to reflect that most manycore chips today use shared memory
architecture and rely on the request and response messages between nodes to maintain
data coherence. To avoid message-level deadlocks in such cache-coherent manycore
processors, each input port has two message classes (MCs), request and response.
The request message class contains 4 VCs, each of which is 1-flit deep, while the
response message class contains 2 VCs, each of which is 3-flit deep. All flits follow
an XY dimension ordered routing (DOR) and a broadcast flit is replicated only when
#4
#3
#3
#3
#2
#4
#3
#2
#5
#4
#3
#4
Flit Size
64bits
Request
Packet Size
1 flit
Response
Packet Size
5 flits
Microarchitecture 6VCs over 2MCs
#2
#2
#0
#1
#1
#3
#2
#2
#2
#2
#3
#2
#1
#1
#3
#3
#2
#4
#3
No-load Routerand-link Latency
1 cycle
Operating
Frequency
1GHz
Power Supply
Voltage
1.1V
Technology
45nm SOI CMOS
Figure 2-2: Broadcast example and overview of the fabricated 4×4 mesh NoC.
49
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
1mm link
2mm link
Tri-state
RS D
Router #1
NORTH
Router #5
Router #3
Router #2
320um
540um
590um
X B AR
Tri-state
NIC
RSD
EAST
Router #8
WEST
XBAR
Router #4
SOUTH
260um
Router #9
Router #10
Router #11
Router #12
Router #13
Router #14
Router #15
Router #16
Figure 2-3: Die photo and design layout of the 4×4 mesh NoC and stand-alone lowswing crossbar switch connected to longer links (1mm and 2mm wires).
the XY DOR requires different output ports to minimize the network traffic (XY-tree
DOR). Figure 2-2 describes such a broadcast example with overview of our fabricated
4×4 mesh NoC. As shown in the die photo overlaid with its design layout (Figure 23), an additional crossbar switch is separately laid out with longer links, 1mm and
2mm wires, to explore higher data rate performance of our low-swing crossbar switch
(clock frequency of the overall network is limited by the synthesized router logic).
Following subsections will completely describe the proposed mesh NoC chip design.
50
2.3. Chip Design and Fabrication
2.3.1
Towards Theoretical Latency Limits
We first push the state-of-the-art design towards the latency bounds by adding two
key features: a virtual bypassing microarchitecture to hide delays due to buffering
and arbitration [6, 30, 32], and low-swing datapath circuits based on linear-mode
drive transistors to achieve single cycle ST+LT without lowering clock frequency.
• Single-stage pipeline with lookaheads. In pipeline stage 2 of the state-ofthe-art design (Figure 1-2), we add and generate 15b lookahead signals from the
results of NRC and mSA-II, and send them to the next router. The lookaheads
try to pre-allocate the crossbar switch ahead of the actual flit, thus hiding
mSA-II from the router delay. The lookahead takes priority over requests from
buffered flits at the next router, and directly enters mSA-II. If the lookahead
wins an output port, this pre-allocation allows the following flit to bypass the
first two pipeline stages and go into the third stage directly, reducing the router
pipeline depth from 4 to 2. It is notable that our active pre-allocation by
lookaheads enables incoming flits to bypass routers at all loads, in contrast to
a plain approach of bypassing only at low-loads when the input queues are
empty [33, 34, 35].
• Single-cycle ST+LT with low-swing circuits. We apply the low-swing signaling technique based on linear-mode drive transistors, which can reduce the
charging/discharging delay and dynamic energy when driving capacitive parasitics [36], to the NoC datapath. As will be described later in Section 2.3.3,
the proposed low-swing circuits obtain higher current driving ability (i.e. lower
linear drive resistance) even at small Vds than the reduced-swing signaling generated by simply lowering supply voltage, and hence, our low-swing datapath enables single-cycle ST+LT at higher clock frequency. Our chip prototype
demonstrates that the proposed low-swing circuits enable up to 5.4GHz singlecycle ST+LT (more details in Section 2.4).
51
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
These two optimizations achieve a single-cycle-per-hop delay for unicasts and multicasts, exactly matching the theoretical latency limits. The caveat is that in case
of contention for the same output port from multiple lookaheads, one of them will
have to be buffered and then forced to go through the 3-stage pipeline. In addition,
critical path delay is stretched, which will be analyzed in Section 2.4.2.
2.3.2
Towards Theoretical Throughput Limits
Next, we take two steps towards the throughput bounds for both unicasts and
broadcasts: router-level multicast support for bandwidth sharing and single-cycleper-hop latency for fast buffer reuse.
• Multicast support inside routers. We extend the multicast capability of
the state-of-the-art design into our lookahead-based microarchitecture by letting
lookaheads perform the multicast switch allocation. This scheme enables one
multicast/broadcast flit to be sent from the source NIC, and get routed to
all other routers in the network via a tree. The multicast capability allows
a broadcast flit to share bandwidth till it does not require an explicit forking
into different directions. This dramatically reduces contention compared to the
textbook router design [13] where multiple flits would have to be sent as unicasts
which are guaranteed to create contention at along the shared routes. We use
a dimension ordered XY-tree routing in our design as it is deadlock free, and
simplifies the routing algorithm.
• Single-cycle-per-hop latency. The number of buffers/VCs required at every
input port to sustain a particular throughput depends upon the buffer/VC
turnaround time, i.e. the number of cycles for which the buffer/VC is occupied.
This is where our optimizations for latency in Section 2.3.1 come in handy here
since they reduce the pipeline depth, thus reducing buffer turnaround time,
thereby increasing throughput given the same number of buffers. For our singlecycle pipeline, the turnaround time for buffers/VCs is 3: one cycle for ST+LT
52
2.3. Chip Design and Fabrication
to the downstream router, one cycle for the free VC/buffer signal to return
from the downtsream router (if the flit successfully bypassed), and one cycle
for it to be processed and ready to be used for a new flit. We thus choose
4 VCs in the request message class, each 1-flit deep (since requests packets
in our design are 1-flit wide) to satisfy VC turnaround time and sustain high
throughput for broadcasts. We chose 2 VCs in our response message class, each
3-flit deep, for the 5-flit response packets. This number was chosen to be less
than the turnaround time to shorten the critical path, and reduce the total
buffers (which increase power consumption). We thus chose a total of 6 VCs
per port, with a total of 10 buffers.
2.3.3
Towards Theoretical Energy Limits
Section 2.1 reveals a significant energy gap between the state-of-the-art router
energy and the theoretical energy limit (which is just clocking and datapath energy,
Exbar and Elink ). Such a gap is due to buffering energy (Ebuf f ), arbitration logic energy
(Earb ) and silicon leakage energy (Elkg ). Conventionally, these energy overheads are
traded off against latency and throughput as follows: Fewer buffers reduce Ebuf f and
Elkg , but stretch latency due to contention and lower throughput. Or, simple routers
like wormhole routers [13] reduce Earb and Elkg , and increase operating frequency f ,
but these come at the expense of poorer latency and throughput.
Our proposed NoC first includes multicast support so even broadcasts and multicasts can approach the theoretical energy limit. Then, it incorporates two new
features that permits different tradeoffs of latency, throughput and energy. First, our
multicast virtual bypassing reduces Ebuf f , while improving both latency and throughput. The hidden cost lies in increased Earb and decreased f . As will be shown in
Section 2.4.1, the savings in Ebuf f outweigh the Earb overheads, and operating frequency can still be in GHz. Second, our chip employs low-swing signaling to reduce
dynamic energy in the datapath (Exbar and Elink ) which is unavoidable and part
53
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
Figure 2-4: 64bits 5×5 tri-state RSD-based matrix crossbar switch and link circuitry.
of the theoretical energy limit. Our low-swing circuits based on linear-mode drive
transistors provides an opportunity to break the conventional trade-offs that achieve
dynamic energy savings at the cost of latency and throughput penalties. Indeed, our
low-swing datapath optimizes both energy and latency. Its downsides lie in its area
overheads and reduced process variation immunity.
Figure 2-4 shows the circuit implementation of the low-swing crossbar switch
directly connected to links with tri-state reduced-swing drivers (RSDs). This circuit
54
2.3. Chip Design and Fabrication
design enables low-swing signaling in crossbar vertical wires and link wires. The tristate RSD disconnects horizontal and vertical wires and only drives the corresponding
vertical wire and link, thereby providing energy-efficient multicasting capability. With
an additional supply voltage (LVDD), the 4-PMOS stacked RSD design generates
more reliable low-swing signaling in the presence of wire capacitance and resistance
variation than equalized interconnects [37, 38, 39, 40] where low-swing signaling is
obtained by wire channel attenuation. A delay cell aligns an input signal (which
drives only a 1b crossbar) to an enable signal (which drives all of 64 1bit crossbars).
It reduces mismatch between charging and discharging time, thus decreasing intersymbol interference (ISI). The 64bits links are designed with 0.15um-width 0.30umspace fully shielded differential wires, to eliminate noise coupling of crosstalk effects
and supply voltage variation.
Figure 2-5 shows the detailed router microarchitecture and pipeline of our proposed mesh NoC that incorporates virtual bypassing, low-swing signaling datapath
and router-level multicast support. The following section will closely explore not
only its performance and energy benefits but also the concomitant costs such as area
overhead, stretched critical path, reduced noise margin.
55
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
56
Figure 2-5: Proposed router microarchitecture and pipeline.
2.4. Evaluation
2.4
Evaluation
We first evaluate the measured energy-latency-throughput of our fabricated NoC
against that of the baseline design and the theoretical mesh limits defined in Section 2.1. Armed with our chip measurements, we then delve into three specific case
studies on virtual bypassing; low-swing signaling; and power modeling and estimation
to dissect our design choices.
2.4.1
Latency, Throughput and Energy
We measured average packet latency of our NoC as a function of packet injection
rate, with two different traffic patterns: mixed traffic (50% broadcast request, 25%
unicast request and 25% unicast response messages) and broadcast-only traffic (100%
broadcast request messages), at 1GHz operating frequency. Figure 2-6 and Figure 2-7
show the results along with the baseline performance and theoretical mesh bounds.
Here, we chose a more aggressive baseline that has single-cycle ST+LT instead of separate ST and LT stages described in Section 1.2. Since even the full-swing baseline
can support single-cycle ST+LT at 1GHz, this baseline is a fairer model of an equivalent unicast full-swing NoC. Except for the the single-cycle ST+LT, the baseline used
in this section is identical to Figure 1-2. The theoretical latency limits (cycles/packet)
include two extra cycles for NIC-to-router and router-to-NIC traversals which are indispensable since traffic injects and ejects through the NICs. Theoretical throughput
limits are calculated based on received flits, then converted into Gb/s to factor in the
1GHz clock frequency and 64-bit flit size (16×64b×1/1GHz=1024Gb/s). Simulation
results were obtained from pre-layout synthesis with sufficient simulation cycles (104
cycles) to make scan-chain warmup (128 cycles) negligible.
For latency, our design enables 48.7% (mixed traffic) and 55.1% (broadcast-only)
reductions before the network saturates as compared to the baseline. To enable
precise comparisons, we define the saturation point as the injection rate at which NoC
57
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
Figure 2-6: Network performance evaluation with mixed traffic at 1GHz.
latency reaches 3 times the average no-load latency; most multi-threaded applications
run within this range. The low-load latency gap from the theoretical latency limit is
5.7 (6.3) cycles for mixed (broadcast) traffic, i.e. only 1.03 (1.14) cycles of contention
latency per hop for mixed (broadcast) traffic. This can be further improved to 0.04
(0.05) cycles of contention latency per hop (obtained through RTL simulations) by
removing the artifact in our chip whereby all NICS had identical pseudo-random
generators that caused contention which lowers the amount of bypassing even at low
injection rates.
Throughput wise, the fabricated NoC approaches the theoretical limits: 87%
(mixed traffic) and 91% (broadcast-only) of the theoretical throughput limits. In
58
2.4. Evaluation
Figure 2-7: Network performance evaluation with broadcast-only traffic at 1GHz.
addition, our NoC design has 2.1x (mixed traffic) and 2.2x (broadcast-only) higher
saturation throughput than the baseline. In other words, the proposed NoC can obtain the same throughput as the baseline with fewer buffers or VCs. The throughput
gap between the theoretical mesh and the fabricated chip is due to imperfect arbitration (like all prior chips discussed in Section 2.2, we use separable allocators, mSA-I
and mSA-II, to lower complexity) and routing (the dimension ordered XY routing
can lead to imbalance in load).
Figure 2-8 shows the measured power reduction at 653Gb/s broadcast delivery
at 1GHz at room temperature. The low-swing signaling enables 48.3% power reduction in the datapath. In addition, the single-cycle multicast capability and virtual
59
A
494
B
38.2% power
reduction in total
425
32.2% power reduction in buffers
by multicast buffer bypass
67
13.9% power reduction in router logic
by router-level broadcast support
137
48.3% power reduction in data path
by tri-state RSD-based crossbars
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
C
288
D
clocking circuitry
router logic and buffer
datapath (crossbar + link)
Figure 2-8: Measured network power reduction at 653Gb/s at 1GHz (A: full-swing unicast network, B: low-swing unicast network, C:low-swing broadcast network
without virtual buffer bypassing, D: low-swing broadcast network with virtual buffer bypassing).
bypassing result in 13.9% and 32.2% power reduction in router logics and buffers,
respectively. Overall, our chip prototype achieves 38.2% power reduction compared
to the baseline. To compare against the theoretical power limit, we performed a
post-layout power simulation of a router in the middle of the mesh to further breakdown data-dependent power from non-data-dependent components like clocking. We
60
2.4. Evaluation
then calculate the theoretical power limit to comprise just clocking and a full-swing
datapath: 5.6mW/router, at close to zero-load injection rate (3/255). Compared
to our NoC power consumption at the same low injection rate (13.2mW/router),
our overhead comes largely from VC bookkeeping state (1.9mW/router) and buffers
(2.0mW/router), whereas the allocators (0.7mW/router) and additional lookahead
signals (0.2mW/router) contribute little additional power. The data-dependent power
(e.g. buffers, allocators) is due to our identical PRBS generators at NICs that limited
bypassing at low loads and can be removed by virtual bypassing, but the non-datadependent power (e.g. VC state) will remain. Also, since our chip consumes nontrivial leakage power (76.7mW measured, 18% of overall chip power consumption at
653Gb/s), power gating will help to further close the gap, at the expense of a decrease
in operating frequency.
2.4.2
Virtual bypassing
Virtual bypassing of buffering to achieve single-cycle routers has been reported
in various forms [6, 30, 32]. The aggressive folding of multiple pipeline stages into
a single cycle naturally raises the question of whether that comes at the expense of
router frequency f . While our chip is the first prototype to demonstrate a singlecycle virtual bypassed router at GHz frequency, it begs the question of how much f
is affected. To quantify the timing overhead, we performed critical path analysis on
pre- and post-layout netlists of the baseline and our design. Table 2.3 shows such
Pre-layout simulations
Baseline router design
Our virtual bypassed router design
Post-layout simulations
Baseline router design
Our virtual bypassed router design
Measured critical path
Our virtual bypassed router design
549ns
593ns (1.08x overhead)
658ns
793ns (1.21x overhead)
961ns (1/1.04GHz)
Table 2.3: Critical path analysis results.
61
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
estimates along with the actual measured timing.
The critical paths of both the baseline and the proposed router occur in the second
pipeline stage where mSA-II is performed. The overhead of lookaheads lengthens the
critical path by 8% in pre-layout simulations and 20% in post-layout simulations.
It should be pointed out though that if the operating frequency is limited by the
core rather than the NoC router, which is typically the case, this 20% critical path
overhead can be hidden. In the Intel 48 core chip, nominal operation is 1GHz core
and 2GHz router frequencies, allowing any network overhead to be masked [41].
Also notable is the fact that while the critical path of the post-layout simulation
is 793ns, the maximum frequency of our chip prototype is 1.04GHz (i.e. the actual
critical path is 961ns). This is mainly due to nonideal factors (e.g. a contaminated
clock, supply voltage fluctuation, unexpected temperature variations, and etc.) whose
effects cannot be exactly predicted in design phase.
2.4.3
Low-Swing Signaling
Low-swing signaling has demonstrated substantial energy gains in domains such as
off-chip interconnects and SRAMs. However, in NoCs, there are few chip prototypes
employing low-swing signaling [29, 26] so that a deep understanding of its tradeoffs and its applicability to NoCs is hard to carry out. To investigate such effects
with longer links (necessary in a manycore processor as cores are much larger than
routers) and at higher data rates than the network clock frequency, as mentioned
earlier, an identical low-swing crossbar switch with longer link wires (1mm and 2mm)
is separately implemented and measured.
Energy savings and 1-cycle ST+LT. The measured energy efficiency shows
that the 1mm 300mV-swing tri-state RSD enables 57-61% energy reduction (Figure 29) while the 2mm 300mV-swing link shows 65-69% energy reduction (Figure 2-10)
when compared with their equivalent 1.1V full-swing link. Since energy benefits of
low-swing signaling come from reduction in dynamic power of link metal wires, the
62
2.4. Evaluation
Figure 2-9: 1mm link energy efficiency of full-swing and RSD-based signaling.
Figure 2-10: 2mm link energy efficiency of full-swing and RSD-based signaling.
63
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
Synthesized full-swing crossbar
Proposed low-swing crossbar
Router with the full-swing crossbar
Router with the low-swing crossbar
26,840um2
83,200um2 (3.1x overhead)
227,230um2
318,600um2 (1.4x overhead)
Table 2.4: Area comparison with full-swing signaling.
longer low-swing link (2mm wires) has higher energy efficiency than the shorter lowswing link (1mm wires). Here, the energy consumption is measured assuming that
the lower power supply supports charge-recycling [42]. Experimental results also
demonstrates that the tri-state RSD-based crossbar supports single-cycle ST+LT at
up to 5.4GHz and 2.6GHz clock frequency with 1mm and 2mm links, respectively.
The tri-state RSDs enables a reduction in the total amount of charge and delay
required for data transitions, thereby resulting in these energy and latency benefits.
Area overheads. Table 2.4 shows the area overhead of our 64bits 5×5 low-swing
crossbar switch against an equivalent full-swing crossbar. The low-swing crossbar
has a high area overhead (3.1x) compared to a synthesized full-swing crossbar, as
the proposed RSDs employ differential signaling while the full-swing crossbar uses
single-ended signaling. Besides, since our low-swing crossbar was carefully laid out
due to noise coupling issues, such restricted placement and wiring of tri-state RSDs
exacerbate the area overhead. However, at the router level, the relative area overhead
goes down to 1.4x, and naturally, it will significantly diminish when compared against
an entire tile with a core, cache and router.
Process variation effects. The critical drawback of low-swing signaling is reduced noise margin. In our circuitry, the primary noise source is a sense amplifier
offset caused by process variation. While low-swing signaling enables more dynamic
energy savings as voltage swing decreases, the process variation effect worsens. Based
on 1000-run Monte-Carlo Spice simulations, we chose 300mV-swing for above 3-σ
reliability, but the voltage swing can be further decreased by offset compensation circuit techniques [43, 44, 45] at the cost of design complexity. Figure 3-4 shows energy
64
2.4. Evaluation
efficiency and link failure probability of the 1mm 5Gb/s tri-state RSD as a function
of voltage swing level. These results explicitly reveal the low-swing signaling energy
gain trade-off against process variation vulnerability.
2.4.4
Power Modeling and Estimation
Architectural power models such as ORION [20, 46, 47] have been extensively
adopted by researchers for early-stage evaluation of research ideas, while RTL-based
energy estimates have also been widely used. With our chip, we can now study the gap
between silicon-proven energy and different levels of energy modeling. We compare
our chip power measurements with two power estimates obtained from ORION 2.0 [20]
and post-layout netlists. The experiments (or simulations) are conducted with 1.1V
supply voltage, 1GHz clock frequency, 653Gb/s throughput at room temperature.
Figure 2-12 summarizes the results.
Figure 2-11: Low-swing signaling trade-off between reliability and energy efficiency.
65
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
A
4.8y
x: our design power consumption
y: baseline power consumption
Clocking Circuits
Router logic and buffer
Data path (crossbar + link)
5.3x
B
1.06y
C
1y
1.13x
1x
Figure 2-12: Comparison of power estimates with measurements (A: ORION 2.0 simulations, B: Post-layout simulations, C: Measured results).
ORION 2.0 substantially over-estimates power (4.8-5.3x of measured chip power),
but its estimate of relative power reduction between the baseline and our design (32%
reduction) is not far from the measurements (38% reduction). This is because the
transistor sizes assumed in ORION 2.0 are much larger than the actual sizes in the
chip. Thus, while ORION 2.0 can be used for comparison of various system-level optimizations or early-stage design space exploration, its estimates should not be the basis
of absolute power budgets. On the other hand, the post-layout simulation gives us
fairly accurate power estimates (6-13% deviation from measurements). Specifically, it
slightly under-estimates the power of buffers and arbitration logic but over-estimates
clocking and datapath power. Relative power reduction (34%) also matches well with
66
2.5. Chapter Summary
measurements (38%). However, such accurate estimates come at the cost of tremendous simulation time overheads (several days for an entire NoC simulation) because
the post-layout simulation calculates its estimates at the transistor-level along with
parasitic effects. Moreover, since the post-layout estimation requires complete extracted netlists, it is difficult to apply to early-stage NoC evaluation.
To sum up, cycle-accurate NoC simulations hardly give us any information about
power, area, and critical path. ORION 2.0 calculates routers’ power and area estimations on the basis of gate-level technology parameters, but it explores power and
area budget only with fixed router microarchitectures and transistor sizes. RTL-based
pre-layout NoC research enables critical path analysis at the standard cell level. It
also provides fairly reliable estimations on energy-latency-throughput performance.
These pre-layout results, however, do not include any parasitic effects such as supply
voltage drop by the resistance of the power metal lines, signal attenuation in the
RC-dominant link wires, or noise coupling through the unexpected capacitance and
inductance. Also, the RTL code-based simulators cannot explore some research ideas
that require custom design like low-swing signaling techniques. HSPICE/Spectrebased post-layout simulations inform NoC designer of the most accurate network
performance in the CAD-based research level at the cost of tremendous simulation
time, order of days, but there is still gap between their estimations and silicon-proven
results due to nonideal factors such as a contaminated clock, supply voltage fluctuation, unexpected temperature variations and imperfect parasitic extraction on
contacts, vias, on-chip copper wires and off-chip bonding wires.
2.5
Chapter Summary
In this chapter, we described our design of a NoC mesh chip that aims to simultaneously approach the theoretical latency, bandwidth and energy limits of a mesh,
for all kinds of traffic (unicasts, multicasts and broadcasts). We first derived such
67
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
theoretical limits of a mesh NoC for unicasts and broadcasts. This analysis closely
guided us in our design which leverages virtual bypassing to approach the theoretical
latency limit of a single cycle per hop for unicasts, multicasts and broadcasts. This,
coupled with the speed benefits of low-swing signaling, enabled us to swiftly reuse
buffers and approach theoretical throughput without trading off energy or latency.
Finally, low-swing signaling applied to the datapath helped us towards the theoretical
energy limit. To be more specific, this chapter made the following contributions:
1. It presented a mesh NoC chip prototype that showed 48-55% latency benefits,
2.1-2.2x throughput improvements and 31-38% energy savings as compared with
an equivalent baseline NoC described in Section 1.2. To the best of our knowledge, this is the first mesh NoC chip with multicast support.
2. It defined the theoretical mesh limits for unicasts and broadcasts, in terms
of latency, throughput and energy. We also characterized several prior chip
prototypes’ performance relative to these limits.
3. It presented lessons learnt from our prototyping experience:
• Virtual bypassing can enable 1GHz single-cycle router pipelines and 32%
buffering energy savings with negligible area overhead (5% only). It comes
at the expense of a 21% increased critical path, though this timing overhead
can be masked in multicore processors where cores limit the clock frequency
rather than routers. More critically, virtual bypassing does not address
non-data-dependent power.
• Low-swing signaling can substantially reduce datapath energy (3.2x less
energy in 1mm links compared to a full-swing datapath) as well as realize high frequency single-cycle traversal per hop (5.4GHz with a 64bits
5×5 crossbar and 1mm links), but comes with increased process variation
vulnerability and area overhead.
• System-level NoC power modeling tools like ORION 2.0 can be way off in
absolute accuracy (∼5x of measured chip power) but maintain relative ac68
2.5. Chapter Summary
curacy. RTL-based post-layout power simulations (post-layout) are much
closer to measured power numbers, but post-layout timing simulations are
still off.
69
Chapter 2. Towards the Theoretical Limits of a Mesh NoC
70
3
Low-Swing Datapath for Reconfigurable NoCs
This chapter explores two circuit optimization opportunities that reconfigurable NoC
architectures create: Self-resetting logic repeaters (SRLRs) improve the energy efficiency of reconfigurable NoCs without affecting the network performance while voltagelocked repeaters (VLRs) enhance the reconfigurable NoC’s performance, sustaining
datapath energy efficiency.
3.1
Background: Reconfigurable NoCs
Multiprocessor systems-on-chip (MPSoCs) have integrated more and more generalpurpose and application-specific processor elements (PEs) to meet the requirements
of progressively compute-intensive applications [48, 49] while such applications have
increased and diversified with proliferation of smart phones [50]. Traffic on the manycore MPSoCs that support diverse applications substantially varies depending on its
executed application, and this dynamic on-chip traffic should be delivered with low
latency and high bandwidth at low power consumption to satisfy aggressive MPSoC
design targets.
To tackle this design challenge, one approach has been to tailor the NoC topology to match application communication patterns at design time. Star-Ring [26],
71
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
Figure 3-1: Single-cycle reconfigurable NoC [1] with SMART links (red bold lines)
where its backbone mesh network is reconfigured at run time.
Octagon [49], Fat Tree [51] and the high-radix crossbar [52] serve as examples of
network topologies customized at design time. These NoCs, coupled with equalized
on-chip interconnects [37, 38, 39, 40], can achieve a single-cycle transmission between
distant PEs. However, this approach requires knowledge of all applications and their
communication graphs at design time to be able to pin these dedicated express links
to specific pairs of dedicated cores, and assumes sufficient wiring density to support
dedicated links between all communicating cores.
An alternate has been to employ a scalable network topology at design time such as
a mesh connecting a collection of generic PEs (like ARM processors), then reconfigure
the network at run time to match application traffic. Since router delays can vary
depending on congestion, prior NoC literatures [21, 22, 23, 24, 25] have proposed pre72
3.2. Introduction: Clockless Low-Swing Repeaters
reservation of (parts of) the route to provide predictable and bounded delays. These
NoC architectures perform an offline computation of contention-free routes, allowing
flits to bypass queues and arbiters at routers where there is no conflict between the
routes of different flows. We will refer to this flexible communication fabric as a
reconfigurable NoC in this thesis.
3.2
Introduction: Clockless Low-Swing Repeaters
The reconfigurable NoC architectures offer two link optimization opportunities.
First, we can further reduce the dynamic energy of the pre-paved routes through
clockless low-swing repeaters. The low-swing circuit proposed in Chapter 2, as well as
most of existing low-swing interconnects [36, 53, 54, 55, 56, 57], requires clocked sense
amplifiers at every node in a mesh, so it will have to pay unnecessary clocking energy
and delay at the contention-free nodes when embedded into the reconfigurable NoC
datapath. In addition, the area overhead and system costs (such as an additional
power supply voltage and differential wires) make its adoption in a NoC datapath
infeasible. Motivated by these challenges, this chapter presents a self-resetting logic
repeater (SRLR) that enables clockless, single-ended low-swing signaling without the
extra supply voltage.
Second, we can maximize latency benefits of the pre-reserved routes through
Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) links, enabling
flits to potentially incur a single-cycle delay all the way from the source to the destination. Figure 3-1 shows the single-cycle reconfigurable NoC, where a network reconfigures into 3 different topologies for 3 different applications. The single-cycle delay
benefits of SMART can be obtained even at high clock frequency (e.g. 2GHz on a 4×4
mesh in 45nm SOI CMOS) by voltage-locked repeaters (VLRs) at which a low-swing
technique is optimized for lower transmission delay. In other words, VLRs stretch the
maximum distance that full-swing repeated links (or other existing low-swing links)
73
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
can span in a cycle. Such VLR-based SMART links can lead to energy savings as
well as latency reduction when actual MPSoC application traffic is delivered through
the network [1].
We will first investigate why embedding existing low-swing links into the mesh,
which is the backbone of our reconfigurable NoCs, do not sufficiently tackle the challenges of NoC design in Section 3.3, then present design details and silicon-proven
performance of SRLRs and VLRs in Section 3.4 and Section 3.5, respectively. Finally, Section 3.6 will summarize and conclude our second test chip prototyping of
the clockless low-swing repeaters.
3.3
Related Work: Existing Low-Swing Links
As demonstrated through SWIFT [29] and our first test chip, low-swing drivers
can be embedded within mesh NoC routers and shown to substantially reduce NoC
energy, but such low-swing links and other existing low-swing circuits face key NoC
design challenges. First, the area overhead imposed by low-swing drivers is of prime
concern, since a NoC shares precious on-die real estate with processor cores, caches,
memory controllers, etc. Second, low-swing signaling comes at the cost of reduced
noise margin, which is crucial as packet losses are not tolerated in NoCs. Thirdly, existing low-swing circuits impose a considerable system overhead such as an additional
dedicated power supply voltage or clocking circuitry in an entire NoC datapath, or
provide energy-optimal design of only one-to-one signaling, making their adoption in
a mesh fabric infeasible. We will next explain in detail why prior circuits may come
up short in area, robustness and energy-efficient application to a mesh.
Apart from traditional low-swing circuits which use a lower supply voltage or inherent threshold voltage drop [15, 36], there have been a number of more sophisticated
low-swing circuits proposed, based on linear-mode transistors [29, 58], charge sharing [53, 54, 55], cut-off drivers [15, 56, 57] and channel attenuation [37, 38, 39]. The
74
3.3. Related Work: Existing Low-Swing Links
low-swing drivers exploiting linear-mode transistors [29, 58] are composed of PMOS
pullups and pulldowns only (or NMOS pullups and pulldowns only) to obtain lower
linear drive resistance even at small Vds . While such designs enable better energy efficiency and higher bandwidth than the traditional low-swing signaling generated by
simply lowering power supply voltage, they require differential wiring, clocked sense
amplifiers and an additional power supply voltage. In particular, the additional power
supply dedicated only to the NoC low-swing datapath can be a system overhead in
manycore processor design. The charge sharing-based low-swing drivers [53, 54, 55]
limit voltage swing without a second power supply voltage, but they require fixed
data patterns for reliable operation, which limit NoC design. The voltage swing
of the cut-off drivers [15, 56, 57] is directly affected by threshold voltage variation
of drive transistors, thus requiring complicated receivers to sense and calibrate the
threshold voltage variation, resulting in an area overhead.
Equalized on-chip interconnects [37, 38, 39, 40] can generate low-swing signaling
by leveraging the inherent channel attenuation of RC-dominant wires and have successfully provided high-bandwidth low-power global links that transmit data through
long wires (5-10mm). These long equalized links can be used as point-to-point wires
between pairs of cores, but as there is insufficient on-die wiring to support dedicated
links between all pairs of cores, equalized links map more readily to indirect, multistage NoC topologies with long global links, thus being advantageous to the MPSoC
NoC tailored at design time where specific flows can be optimized with equalized
interconnects. Besides, such topologies do not leverage application locality, turning
all traffic into cross-die global traversals, which leads to high NoC latency and energy
overheads. Meshes, on the other hand, are dominated by short local core-to-core
links. Adopting equalizers as parallel links in a mesh NoC will lead to considerable
area overhead (e.g. the 10mm 1-bit driver of [39] occupies 1760µm2 ). Yet another
way of incorporating long equalized links in meshes is to use them as express links
between far-away cores [59, 60]. That increases router port count though, leading to
75
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
high NoC area overhead. On top of that, direct transmission on a long global wire
makes equalized interconnects vulnerable to wire capacitance/resistance variation and
crosstalk coupling noise.
3.4
Self-Resetting Logic Repeater (SRLR)
In this section, we seek to tackle the above-mentioned design challenges of incorporating low-swing signaling in a reconfigurable NoC. To be specific, this section presents
a novel low-swing signaling circuit named a self-resetting logic repeater (SRLR). The
proposed repeater has the following features:
• The SRLR enables low-swing signaling to be repeated without a reference clock,
and hence, eliminates clocking energy and delay at contention-free nodes of the
pre-paved routes in a reconfigurable NoC.
• The SRLR enables single-ended low-swing signaling, consuming less energy than
differential low-swing signaling at the same wire density (i.e., the SRLR can
have higher wire density at the same energy budget).
• The SRLR achieves low-swing signaling mainly through the inherent wire channel attenuation so it does not require additional power supplies and works across
all data patterns.
• The SRLR enables low-swing signaling to be regenerated with a single repeater
length, the wire length of local core-to-core links in a mesh. A single optimized
SRLR design can thus be used for energy-efficient signaling between any pair
of nodes in a mesh. As a side benefit, the SRLR enables 1-to-N multicasts for
free since inherent full-swing signals are available at every intermediate repeater
node. This multicast capability is a significant benefit as multicast traffic forms
a sizable portion of NoC traffic [6].
• The SRLR incorporates circuit techniques to mitigate global process variation
and ensure robustness of single-ended low-swing signaling.
76
3.4. Self-Resetting Logic Repeater (SRLR)
3.4.1
SRLR Circuit Design
Figure 3-2 shows the overall 10mm link with SRLRs located at the end of each
1mm wire segment connecting adjacent routers in a mesh NoC. Typically, embedding
repeaters within the crosspoints of a crossbar can lead to increased layout complexity
due to the active silicon region in the midst of wires. The SRLR-based datapath,
however, averts that by ensuring that the SRLR insertion length is equal to the routerto-router distance in a mesh NoC. We assume that the local router-to-router distance
is 1mm, and accordingly, the SRLR transistors are optimally-sized to directly drive
the 1mm wire in order to offer low-swing repeated signaling without adding to layout
Figure 3-2: 10mm SRLR-based link for the mesh-based reconfigurable NoC where the
local router-to-router distance is 1mm.
77
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
Figure 3-3: Proposed SRLR circuit and its simulated waveforms.
complexity. The only implementation overheads of the proposed low-swing signaling
are thus a pulse modulator (PM) and a demodulator (DM) required for pulse-based
data communication. With the PMs and DMs at every router, our proposed circuit
can send low-swing pulses to a far-away node in a mesh without energy overheads since
each SRLR drives only a 1mm wire segment and the low-swing pulses are repeated
without clocking. In addition, our SRLR-based datapath provides low-swing 1-to-N
multicast capability for free while equalized links [37, 38, 39, 40] offer only 1-to1 unicasts. For instance, in Figure 3-2, the data sent from the 1st SRLR to the
10th SRLR can be directly sampled at all the intermediate SRLRs. This inherent
multicast capability can result in substantial benefits in NoCs that see significant
78
3.4. Self-Resetting Logic Repeater (SRLR)
multicast traffic [6].
Figure 3-3 shows the proposed SRLR circuitry along with its simulation waveforms. When a pulse (whose low-swing is obtained by wire channel attenuation)
arrives at an input NMOS (M1), the node X is discharged and output voltage of
the SRLR (OUT) becomes high. The node X is again charged when a reset signal
comes back through a delay cell, generating another pulse at the output. As a last
step, a keeper NMOS (M2) lowers the node X voltage down to VDD-Vth after the
pulse is repeated. The reduced standby voltage at the node X increases amplification gain of the current-starved inverter (INV) but this standby voltage should stay
above the threshold voltage of INV across process variation. Also, the size ratio of
M1/M2 should be designed to allow enough SRLR input sensitivity at a given lowswing voltage level. The current-starved inverter (INV) amplifier becomes activated
when enable signal (EN) is high, and this 3-port (IN, OUT and EN) circuit design
allows SRLRs to be directly integrated into a matrix crossbar switch.
While single-ended low-swing signaling has higher energy efficiency than differential low-swing signaling, this comes at the expense of global (die-to-die) process
variation immunity. To mitigate such variation effects on the proposed on-chip signaling, the SRLR-based link employs three circuit techniques: an alternating delay
cell design, an NMOS-based driver and an adaptive swing voltage scheme.
Alternating Delay Cell Design:
First, we propose an alternating delay cell design where odd SRLRs and even SRLRs
incorporate different delay cells. As shown in Figure 3-5, the SRLR output pulse
width (P Wout ) is a function of node X’s pulse width (P Wx ), which is mainly given by
the delay of the delay cell, and the difference between rising time (trising ) and falling
time (tf alling ) of the INV amplifier. At an n − th SRLR, the output pulse width
(P Wout,n ) can be expressed as:
79
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
P Wout,n = P Wx,n − Drising,n + Df alling,n
= P Wx,n − (trising,n − tf alling,n ).
The rising time becomes longer (or shorter) as input pulse swing gets smaller (or
bigger); whereas, the falling time experiences little change with the input pulse swing
change.
With a single delay cell design (e.g. 6-buffer whose delay enables the single delay
cell design to offer the most reliable repeated signaling at a no-variation simulation
environment), this influence of the input pulse swing on the rising time of INV accumulates over several SRLR stages, and hence, the rising time gradually becomes
longer (or shorter) at the smaller (or bigger) initial pulse swing caused by the process
variation. The increasing (or decreasing) rising time causes a shrinking (or widening)
output pulse width, resulting in a transmission failure at the end of the 10mm link.
In other words, the output pulse widths obtained from process corner simulations of
the single delay cell design are
P Wout,0 > P Wout,1 > P Wout,2 > . . . > P Wout,10
(3.1)
(bit 1 transmission f ailure)
or
P Wout,0 < P Wout,1 < P Wout,2 < . . . < P Wout,10 .
(3.2)
(bit 0 transmission f ailure)
The proposed alternating delay cell design, on the other hand, enables output pulse
widths to increase (or decrease) even with the longer (or shorter) rising time of the
INV amplifier through the intentionally-increased (or intentionally-decreased) delay
of the delay cell. The alternating design can still saturate, but because of the non-
80
3.4. Self-Resetting Logic Repeater (SRLR)
linearity of the feedback (where larger input pulse width causes even larger change in
output pulse width) the alternating design takes more stages to saturate. Therefore,
the alternating design improves the probability of correct operation for a fixed link
length.
NMOS-based Driver:
Global process variation influences the output stage of the SRLR as well. Under a
straightforward implementation, an inverter driver at the output exhibits two distinct
failure modes. In one mode, a weak PMOS will generate insufficient voltage swing
at the input of the following stage. In the other mode, a strong PMOS generates too
much voltage swing for a weak NMOS to fully discharge the node at the end of a
wire channel prior to the arrival of the next bit. Accordingly, the worst-case sequence
of ‘11110’ will eventually saturate the voltage and prevent transmission of several 1s
followed by a 0. The NMOS-based driver in this design supplies both pull-up and
pull-down currents through NMOS devices, so the strong PMOS condition no longer
applies. The resulting circuit is more robust since it is optimized for only one failure
mode at a weak NMOS corner, instead of two distinct failure modes across a weak
PMOS or a strong PMOS with weak NMOS.
Adaptive Swing Voltage Scheme:
Having a robust NMOS-based circuit also allows the optimization of transmission
energy. At a strong NMOS corner, the output pulse swing tends to be excessively
high, especially for the lower Vth of the input NMOS (M1) of the next stage. Therefore,
the adaptive voltage swing scheme with an on-chip bias current generator (Figure 3-5)
tracks the M1 threshold voltage to reduce swing voltage, avoiding the needless waste
of energy. In other words, when M1 is fabricated with higher (or lower) threshold
voltage than the nominal value, the lower (or higher) Vref is applied to the NMOSbased drivers to increase (or decrease) voltage swing. The bias current, which does not
81
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
Figure 3-4: 1000-run Monte-Carlo simulation results that show the impact of each
variation-robust design technique.
contain any threshold voltage-related terms for the first order analysis [61], is tolerant
of process and temperature variations so that Vref is mainly given by the threshold
voltage and technology parameters of M1, a primary determinant transistor of the
SRLR input sensitivity.
Figure 3-4 shows the error probability obtained from 1000-run Monte-Carlo simulations on different SRLR designs with various swing voltages. At the voltage swing
selected for test chip fabrication, the proposed process variation robust SRLR design
achieves about 3.7 times higher process variation immunity than the straightforward
SRLR design that incorporates inverter drivers (instead of NMOS-based drivers) and
6-buffer delay cells only (instead of an alternating delay cell design) without the
adaptive swing voltage scheme.
82
3.4. Self-Resetting Logic Repeater (SRLR)
83
Figure 3-5: Process variation robust SRLR circuit with (1) an alternating delay cell design, (2) NMOS-based drivers and (3) an
adaptive swing voltage scheme.
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
3.4.2
Test Chip Fabrication and Measurement
In order to demonstrate the energy efficiency and performance of the proposed
low-swing on-chip signaling, a proof-of-concept chip of a 1bit 10mm SRLR-based link
(described in Figure 3-2) is implemented using a 45nm SOI CMOS process. Figure 36 shows its die photograph overlaid with a design layout where each SRLR occupies
47.9µm2 active silicon area.
The fabricated link is fed by pseudo-random binary sequence (PRBS) data generated on-chip and a test circuit performs data comparison and error counting. This
on-chip measurement circuit shows that the 1bit 10mm SRLR-based on-chip in-
Figure 3-6: Die photograph of the SRLR test chip in 45nm SOI CMOS that includes
an on-chip test circuit and an on-chip clocking circuit.
84
3.4. Self-Resetting Logic Repeater (SRLR)
terconnect can deliver up to 4.1Gb/s data with the bit error rate (BER) that is
less than 10−9 . Measurement results show that the SRLR-based on-chip signaling
achieves 6.83Gb/s/µm bandwidth density at its maximum data rate of 4.1Gb/s, consuming 1.66mW (i.e., 404fJ/bit/cm or 40.4fJ/b/mm) at a power supply voltage of
0.8V. Figure 3-7 shows 10mm link traversal (LT) energy versus bandwidth density
characteristics of the SRLR-based link and other silicon-proven on-chip interconnects [58, 37, 38, 39]. Details of the fabricated test link are summarized in Table
3.1 together with the previous works.
When analyzing the comparison results in Table 3.1, we should be aware of the
following considerations. First, higher bandwidth density (i.e., smaller wire spacing)
incurs larger wire coupling capacitance, resulting in higher energy consumption. Thus,
the energy consumption of on-chip interconnects should be considered along with their
bandwidth density as shown in Figure 3-7. Second, CMOS process scaling does not
provide much energy benefit for on-chip signaling circuits since the load capacitance
of on-chip interconnects is mostly given by their long wire capacitance (not by the
gate capacitance) [16]. Lastly, the low-swing circuits proposed in Chapter 2 requires
an additional power supply and its energy is evaluated assuming optimistically that
the additional power supply has no charge-recycling circuits.
The on-chip bias circuit for an adaptive swing voltage scheme consumes 587µW
and it can be shared by all parallel links at a NoC router. When considering a 64bit
10mm link implementation, the bias circuit dissipates just 0.6% of total link power.
In order to compare the power consumption and area of our SRLR-based datapath
with those of an entire router, we synthesized a typical mesh router (64bits, 5ports,
4VCs, and 16 buffers) in the same process, 45nm SOI CMOS. Extracted simulation
results showed that input buffers and control logic consume 38.8mW and 5.2mW
respectively, while our low-swing datapath consumes 12.9mW. Area wise, the SRLR
low-swing datapath occupies 18% of the overall router footprint.
85
Signaling Type
Data Rate
Bandwidth Density
Energy for 10mm
Link Traversal (LT)
Process Technology
JSSC’10 [62]
fully differential
2Gb/s
1.163Gb/s/µm
340fJ/bit/cm
(repeaterless)
90nm bulk CMOS
JSSC’10 [63]
fully differential
(4Gb/s), 6Gb/s
(2Gb/s/µm), 3Gb/s/µm
(370fJ/bit/cm), 630fJ/bit/cm
(repeaterless)
90nm bulk CMOS
ISSCC’10 [64]
fully differential
4.9Gb/s
4.375Gb/s/µm
340 X 2 = 680fJ/bit/cm
(2 repeaters)
90nm bulk CMOS
low-swing circuit in Chapter 2,
fully differential
5.4Gb/s
6.0Gb/s/µm
56.1 X 10 = 561fJ/bit/cm
(10 repeaters)
45nm SOI CMOS
Table 3.1: Comparison of silicon-proven low-swing on-chip interconnects.
SRLRs in Chapter 3
single-ended
4.1Gb/s
6.83Gb/s/µm
404fJ/bit/cm
(10 repeaters)
45nm SOI CMOS
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
86
Figure 3-7: 1cm link traversal (LT) energy versus bandwidth density.
3.5. Voltage-Locked Repeater (VLR)
3.5
Voltage-Locked Repeater (VLR)
A low-swing signaling technique can be utilized for lower transmission delay (instead of lower energy) by reducing the charging and discharging time of interconnects’
parasitic capacitance. For such a delay-optimized low-swing circuit, wire driving currents should be maintained even with the reduced voltage swing. It is again noted
that regular repeaters with a lower power supply voltage (simple low-swing links) have
weaker driving currents, thus leading to longer delay than conventional full-swing repeaters.
This section presents another novel low-swing circuit optimized for lower latency
and higher bandwidth in the network, a voltage locked repeater (VLR). The proposed
circuits enables Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART)
which forms the basis of the single-cycle reconfigurable NoC [1] explained as our
background in Section 3.1. The VLR-based link stretches the maximum distance
that a full-swing repeated link can span in a cycle.
3.5.1
VLR Circuit Design
Figure 3-8 shows the proposed VLR circuit design. Once a logic High signal arrives
at node X through a highly-resistive link wire and exceeds the threshold voltage of the
first inverter amplifier (INV1x), the logic High signal starts to traverse the feedback
path. When the feedback signal turns RxN on, the node X voltage is locked at
some voltage level, Vhigh , resulting in low-swing signaling along with wire channel
attenuation as shown in Fig. 3-9. The inverse results when a logic Low signal is
applied at node X. The proposed low-swing generation maintains the node X voltage
near the threshold voltage of INV1x without a decrease in driving current, and hence,
enables lower delay of the next symbol propagation delay. Since the low-swing voltage
level is determined by transistor sizes and link wire impedance (Vhigh is given by
link wire resistance, TxP’s on-state resistance and RxN’s on-state resistance while
87
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
Vlow is determined by link wire resistance, TxN’s on-state resistance and RxP’s onstate resistance), careful gate sizing and extracted simulations are required to prevent
oscillation and static current through the RxP-to-RxN path in all possible process
corners. We thus custom-designed the proposed low-swing repeater circuit and used
this block in our NoC generator as a standard cell in the synthesis flow.
In this low-swing circuit design, a delay cell in the feedback path plays a key role
in making our single-ended low-swing signaling variation-robust. The delay cell generates transient overshoots at node X, leading to lower repeater propagation delay,
and more importantly, larger noise margin. An advantage of this delay cell-included
low-swing repeater design is that such delay and reliability benefits can be obtained
without a significant energy overhead since the high frequency overshoots are filtered
out through long and narrow link wires (i.e. highly resistive and capacitive channels) which act as a low-pass filter. 1000-run Monte Carlo simulations done on the
extracted netlist of the fabricated design at 6.8Gb/s data rate and 290mV voltage
swing (i.e. Vhigh -Vlow =290mV) at the end of link wires in a 45nm SOI CMOS process design kit show that the delay cell enables 3.4x lower process variation failure
probability at just 11% energy overhead.
While the proposed low-swing repeater does not require clocking power and differential signaling, it has static current paths between two consecutive repeaters,
TxP-wire-RxN for logic High and TxN-wire-RxP for logic Low. It should be noted,
however, that the static energy is much less than a conventional continuous-time
comparator since the static current paths include a highly-resistive link wire. Also,
switching off the enable signal (EN) when the link is not used help eliminate unnecessary static power dissipation. Thus, as long as TxP and RxN are optimally sized
for the target data rate and wire impedance (which is given by specific CMOS technology, wire pitch and width), the VLR-based links do not pay higher link energy for
the lower latency and higher bandwidth.
88
89
Figure 3-9: Simulated waveforms at 6.8Gb/s: (a) original input data and (b) VLR’s low-swing signaling at node X.
3.5. Voltage-Locked Repeater (VLR)
Figure 3-8: Proposed clockless low-swing voltage-locked repeater (VLR) for single-cycle multi-hop link traversal.
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
3.5.2
Test Chip Fabrication and Measurement
To prove the concept of VLRs that outperform full-swing repeaters in terms of
both latency and bandwidth without energy overhead, both a 1b 10mm VLR-based
on-chip link and an equivalent full-swing link were fabricated in 45nm SOI CMOS
along with SRLRs. Both VLRs and full-swing repeaters have same repeater length
Voltage-Locked Repeatet (VLR)
On-chip test circuit
1mm (repeater length) metal wire (M7)
VLR clocking circuits
SRLRs
Equivalent
full-swing repeaters
Figure 3-10: 1bit 10mm VLR-based on-chip link and its equivalent full-swing link fabricated on the same die as SRLRs in 45nm SOI CMOS.
90
3.5. Voltage-Locked Repeater (VLR)
(1mm), link wire pitch (minimum DRC pitch) and width (minimum DRC width).
Figure 3-10 shows the test chip die photo with a design layout.
Experimental environment is identical to that of SRLRs. The pseudo-random
binary sequence (PRBS) feeds both VLR-based links and full-swing repeaters, and
on-chip test circuits perform input and output data comparison and error counting.
Due to the limitation of such an on-chip test environment, we cannot get accurate
BER performance. Instead, we can only see if the 10mm link BER is less than 10−9
or not.
Test chip measurement results verify the VLR performance expected in its design
phase. First, VLRs exceed the equivalent full-swing repeaters in terms of bandwidth. The fabricated VLR-based 10mm link achieves the maximum data rate of
6.8Gb/s with 4.14mW power consumption (i.e. 608fJ/b energy efficiency) at the supply voltage of 1.0V, maintaining BER below 10−9 . On the other hand, the equivalent
full-swing repeaters can send 5.5Gb/s data at most, with BER which is less than
10−9 , consuming 4.21mW (i.e. 765fJ/b) at the same supply voltage. At the data
rate of 5.5Gb/s (the maximum data rate of the full-swing repeaters), VLRs dissipate
3.78mW (i.e. 687fJ/b) at 1.0V supply voltage. Second, latency wise, experiment results demonstrate that the VLRs-based link also surpasses its equivalent full-swing
repeaters; the delay of a link with VLRs is arounud 64ps/mm, whereas the delay of
a link with full-swing repeaters is around 100ps/mm. In short, the 10mm 10-hop
VLRs transmit its PRBS input data with 23.6% higher bandwidth, 35.8% lower latency and 10.4% less energy than its counterpart full-swing interconnect. While the
energy benefit varies depending on an input data pattern (higher transition rate leads
to bigger energy savings, hence our PRBS is favorable to VLRs), wire transmission
delay is irrelative to the input data. Therefore, measurements of our chip prototype
provide evidence for the latency and bandwidth advantages of VLRs.
As compared to the simulated results, the energy efficiency and maximum data
rate of the fabricated VLRs are way off in absolute accuracy (up to 18% and 31%
91
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
Table 3.2: Maximum hop counts in a single cycle at high data rate.
Data Rate
Full-swing Repeaters
Fabricated VLRs
4 Gb/s
4 (98 fJ/b/mm)
7 (132 fJ/b/mm)
5 Gb/s
3 (89 fJ/b/mm)
6 (107 fJ/b/mm)
5.5 Gb/s
3 (85 fJ/b/mm)
5 (96 fJ/b/mm)
Table 3.3: Maximum hop counts in a single cycle at low data rate.
Data Rate
Full-swing Repeaters
Re-optimized VLRs
1 Gb/s
13 (103 fJ/b/mm)
16 (128 fJ/b/mm)
2 Gb/s
6 (95 fJ/b/mm)
8 (104 fJ/b/mm)
3 Gb/s
4 (84 fJ/b/mm)
6 (87 fJ/b/mm)
in energy efficiency and maximum data rate, respectively) while cycle-wise latency
performance exactly matches up with the simulation results across all data rates. This
cycle-wise latency certainty of our simulation environment provides solid foundation
to the next discussion based on such latency simulation results.
The system-level design goal of VLRs, as Single-cycle Multi-hop Asynchronous
Repeated Traversal (SMART) links in the single-cycle reconfigurable NoC [1], is
to stretch the maximum hop counts that conventional full-swing repeaters traverse
within a single cycle. Unfortunately, our on-chip test environment does not give this
information since only the last repeater is connected to the comparison circuit. In
other words, our test chip can only give us how many cycles are required for 10mm
link traversal (LT). Hence, we obtained such information from Spectre circuit simulations. Table 3.2 summaries these maximum hop counts and corresponding energy
efficiency of the VLR-based 10mm link.
In MPSoCs, the maximum clock frequency is usually limited by the computation
core rather than the link. We thus re-optimize the transistor sizes and wire spacing
of our circuits for a lower clock frequency of 2GHz. The maximum hop counts and
energy efficiency of such re-optimized VLRs are shown in Table 3.3. At 2GHz, VLRs
can traverse 8 hops in a cycle at 104fJ/b/mm, thus enabling maximum bypassing in
SMART across a 4×4 mesh network.
92
3.5. Voltage-Locked Repeater (VLR)
Figure 3-11: SMART NoC performance across SoC applications. Reference: [1].
Figure 3-12: SMART NoC power breakdown across SoC applications. Reference: [1].
Finally, we briefly discuss the network-level latency reduction and power savings
of the VLR-based single-cycle reconfigurable NoC (SMART NoC). The architecture
design and simulation methodology of the SMART NoC are not the contributions
of this thesis, but the network-level evaluation results explicitly show the VLRs,
which hardly lead to link-level energy benefits but enable such a SMART NoC on
93
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
4×4 mesh at 2GHz, can facilitate power savings at the network level as well as
network performance enhancement. Chia-Hsin Owen Chen, a PhD student at MIT,
evaluated the SMART NoC against two baselines, a typical 3-cycle-per-hop mesh NoC
(similar to our thesis baseline design) and the dedicated NoC with 1-cycle dedicated
links between all communicating cores tailored to each application. Figure 3-11 and
Figure 3-12 show the average network latency and the post-layout power estimates
across diverse SoC applications, respectively. The SMART NoC reduces network
power by 2.2x on average as a result of buffer bypassing through the pre-paved routes
and clock gating at the no-traffic routers. Details of the SMART NoC design and
simulation methodology are presented in [1].
3.6
Chapter Summary
In this chapter, we first introduced a novel NoC architecture, a single-cycle reconfigurable NoC, as the background of our second chip prototyping research. Then,
we proposed two clockless, single-ended low-swing repeaters: a self-resetting logic
repeater (SRLR) for a reduction in the pre-paved link energy, and a voltage-locked
repeaters (VLR) for lower propagation delay.
• The SRLR optimized for the router-to-router distance in a mesh NoC (e.g., 1mm
in this work) provides scalable on-chip signaling without the increased layout
complexity. Since the SRLR enables single-ended low-swing pulses to be repeated without a reference clock, the SRLR-based on-chip signaling achieves
higher energy efficiency than differential, clocked low-swing signaling circuits.We
also presented circuit techniques to improve process variation immunity of the
SRLR-based on-chip signaling.
• The VLR maintains its standby voltage level near the threshold voltage of the
first amplifier for higher amplification gain. When the VLR repeats the signal
transition, it locks the swing voltage by its feedback path signal. A delay cell
94
3.6. Chapter Summary
on the feedback path generates voltage overshoot only at the receiver end, thus
leading to an increase in noise margin without significant energy overheads.
This standby voltage and reduced voltage swing enable lower propagation delay
at the expense of static power dissipation. The VLRs make single-cycle crosschip transmission feasible over a 4×4 mesh network at 2GHz, realizing the
network-level power savings in the SMART NoC.
95
Chapter 3. Low-Swing Datapath for Reconfigurable NoCs
96
4
Energy and Area Efficient TSV Signaling for
3D-IC NoCs
In this chapter, we present a simultaneously bi-directional (SBD) TSV interconnect
for energy and area efficient 3D-IC vertical signaling. The proposed SBD signaling enables 10.3-31.1% lower energy and 34.4% less area than equivalent two uni-directional
TSVs. Albeit with 12.5% lower maximum data rate, the SBD TSV interconnect functions error free at bi-directional data rates up to 9.1Gb/s/TSV (i.e. 4.55GHz maximum clock frequency).
4.1
Chapter Introduction
Three-dimensional integrated circuits (3D-ICs), in which multiple layers of active
electronic components are vertically integrated into a single chip, have emerged as
an appealing alternative to planar 2D counterparts. This is mainly for three reasons:
First, 3D-IC integration enables fewer hop counts in a NoC by exploiting greater spatial locality as shown in Figure 4-1 (where 10 hops in a 2D NoC can be reduced to 3
hops in a 3D-IC NoC), leading to lower interconnect delay as well as less interconnect
energy. Second, 3D-ICs allow different fabrication technologies to be integrated into
a single MPSoC design, and this heterogeneous 3D integration makes it possible to
97
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
3D-IC NoC
2D-IC NoC
Figure 4-1: Example of hop count reduction through greater spatial locality in 3D-ICs.
The reduced hop counts translate into lower interconnect delay and energy.
integrate the best technology for a specific functionality (e.g. RF analog CMOS, lowpower digital CMOS or opto-electronic devices) in a single chip cube. Thirdly, but
most importantly, 3D-ICs can offer much higher bandwidth for die-to-die communication, resulting in substantial performance benefits in manycore chips that require
heavy processor-to-memory bandwidth. Actually, the memory bandwidth is looming
as a major performance bottleneck as core counts scale; even with an efficient memory
controller design such as Intel’s QuickPath Interconnect or AMD’s HyperTransport,
the off-die bandwidth is still restricted by limited pin counts of a chip package [65].
The heart of 3D-ICs is a through silicon via (TSV)1 that cuts across thinned silicon substrates to offer inter-die connectivity, e.g. Tezzaron Semiconductor Corporation [66], IBM [67, 68], IMEC [69] and MediaTek TSVs (this work). In general, TSVs
can provide two types of 3D vertical signaling depending on die stacking topologies:
face-to-face (F2F) and face-to-back (F2B). The F2F TSVs have lower 3D signaling
energy and delay due to smaller parasitic lumped capacitance than the F2B TSVs,
but only support two-tier 3D-IC integration. On the other hand, the F2B TSVs
allow of any multiple-tier 3D-IC integration, and hence, enable scalable 3D-IC chip
architectures.
1
The semiconductor industry has pursed 3D-ICs in many different technologies. Accordingly, the
3D-IC definition is diverse; wire-bonded 3D-ICs, microbump-only 3D-ICs, contactless (capacitive
or inductive) burled bump-based 3D-ICs and TSV-based 3D-ICs. In this thesis, we refer to the
TSV-based chip cube as 3D-ICs.
98
4.1. Chapter Introduction
Figure 4-2: Uni-directional TSV signaling versus proposed SBD TSV signaling at the
same clock frequency, e.g. 5GHz in this example.
The scalable F2B TSVs, however, face other design challenges. First, the F2B
TSVs consume higher signaling energy than the F2F TSVs due to higher parasitic
lumped capacitance, e.g. about 200fF/TSV in IBM technology [68] or 80-120fF/TSV
in IMEC technology [69]. While these F2B TSVs still hold much better energy merits
over off-chip interconnects (typically in the order of tens of pF), their parasitic capacitance is 10-20× higher than the F2F TSVs [66, 11]. Second, TSV landing pads2
occupy bigger silicon area than F2F TSVs, e.g. 7µm×7µm in Tezzaron TSV technology [66]. Lastly, current TSV fabrication technologies cause non-negligible fault rates,
leading to lower yield than standard 2D chip fabrication [70]. For reliable signaling,
such low-yield 3D-IC chips require redundant TSVs such as spare TSV arrays [11],
twin TSVs [71] or shared spare TSVs [72], thereby further exacerbating the energy
and area overheads of the F2B TSV interconnect.
2
In face-to-back (F2B) TSV technology, the upper end of TSVs is connected to a poly layer
landing pad on a top die while the lower end of TSVs is connected with a top metal micro bump on
a bottom die.
99
Off-Chip
SBD #1
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
“1” “0”
VDD
“1” “1”
VDD/3
“0” “1”
“0” “0”
SBD
Tx
GND
SBD
Tx
chip-to-chip wire
Figure 4-3: 4 voltage-level SBD signaling with weaker driving strength required (pros)
Off-Chip
SBD
#2 between SBD signaling symbols (cons).
and smaller
noise margin
“1” “1”
VDD
“0” “1”
“1” “0”
VDD/2
“0” “0”
SBD
Tx
chip-to-chip wire
GND
SBD
Tx
Figure 4-4: 3 voltage-level SBD signaling with bigger noise margin between SBD signaling symbols (pros) and stronger driving strength required (cons).
100
Back-to-Back Current Issue
4.1. Chapter Introduction
“1” “1”
VDD
“0” “1”
“1” “0”
top die
bottom die
“0” “0”
GND
TSV
repeater
repeater
two inverters
Figure 4-5: Upward die-to-die static current path through a low resistance TSV:
bottom die PMOS → micro bump → landing pad → top die NMOS.
Back-to-Back Current Issue
“1” “1”
VDD
“0” “1”
top die
“1” “0”
“0” “0”
TSV
repeater
bottom die
GND
repeater
two inverters
Figure 4-6: Downward die-to-die static current path through a low resistance TSV:
top die PMOS → landing pad → micro bump → bottom die NMOS.
101
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
In order to tackle these challenges, this chapter proposes a simultaneously bidirectional (SBD) TSV signaling circuit that can transmit and receive data at the
same time through a single TSV. Figure 4-2 shows the concept of our SBD TSV
signaling versus uni-directional TSV signaling. The proposed SBD signaling can
halve the number of TSVs, and at the same time, lower TSV transmission energy
through such reduced TSV counts and reduced swing (ideally half swing, as will be
explained in the following design sections). In other words, the SBD TSV signaling
can offer 2× higher bandwidth than uni-directional TSVs with the same TSV counts
and clock frequency.
4.2
Design Considerations of SBD TSV Links
While there have been reported studies on off-chip (chip-to-chip) SBD signaling
circuits [73, 74, 75, 76], this work presents the SBD link design optimized for TSV
channel characteristics. Since different channel characteristics lead to different interconnect design requirements, a natural next step is to compare and analyze design
considerations of well-known off-chip interconnects and 3D-IC TSV links. Here, we
highlight the key differences:
1. TSV signaling does not require accurate impedance matching because TSV
transmission delay is generally negligible compared to the clock period. For
instance, the IBM TSV interconnect can transmit 6Gb/s/TSV data without
any impedance matching circuits [68].
2. TSV interconnects do not need complicated, power-hungry equalization circuits
since TSVs’ RC constant is much smaller than that of off-chip interconnects due
to TSVs’ negligible parasitic resistance.
3. SBD TSV signaling circuit requires higher noise margin ratio than off-chip counterparts because the off-chip power supply rail is typically 2.5V or 1.8V while
the on-chip rail is around 1.0V.
102
4.2. Design Considerations of SBD TSV Links
4. Most importantly, SBD TSV signaling circuits should minimize inter-die static
current flowing through a low resistance TSV when TSV voltage is driven to
middle-level voltages (e.g. bi-directional data ”11” and ”01” in Figure 4-3, or
”01” and ”10” in Figure 4-4) for energy-efficient 3D-IC signaling. While the
static current required for SBD signaling in off-chip interconnects does not dissipate significant power due to their highly-resistive channel, SBD TSV signaling
can lead to non-negligible static current through a low resistance TSV. This will
be explained later in detail with Figure 4-5 and Figure 4-6.
A straightforward implementation of 2-bit SBD signaling is to use four voltage
levels between a power supply voltage (the highest voltage in a chip) and a ground
voltage (the lowest voltage in a chip), mapping each of them to four possible SBD
symbols as shown in Figure 4-3. An alternative is to share one voltage level between
two SBD symbols [74], e.g. ”01” and ”10” in Figure 4-4. There are obvious trade-offs
between these two schemes: Three voltage-level SBD signaling (Figure 4-4) has higher
voltage swing between adjacent SBD symbols than four voltage-level SBD signaling
(Figure 4-3), requiring bigger driving strength but with higher noise margin. Considering that SBD TSV signaling needs higher noise margin ratio but does not require
bigger driving strength through Tx equalization, three voltage-level SBD signaling is
fitted for TSV channel characteristics.
Let us investigate the issue of die-to-die static current in the three voltage-level
SBD TSV shown in Figure 4-5 and Figure 4-6 where a TSV driver is implemented
using a simple repeater (composed of two inverters). Once a top die driver transmits
logic 0 with a pulled-down NMOS and a bottom die driver transmits logic 1 with a
pulled-up PMOS, a static current path is formed from the bottom die PMOS through
a TSV to the top die NMOS. Similarly, when a top die driver transmits logic 1 and
a bottom die driver transmits logic 0, another static current path comes out from
a top die to a bottom die. Since the parasitic resistance of TSVs is much smaller
than that of off-chip interconnects, die-to-die static current flowing through a TSV
103
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
can incur significant energy overheads. Higher threshold voltage or smaller transistor
size of a pull-up PMOS and a pull-down NMOS can reduce the amount of die-to-die
static current, but it will lead to bandwidth loss due to reduction in dynamic driving
strength. Thus, the SBD TSV interconnect circuit should minimize these energy
overheads by eliminating unnecessary die-to-die static current, while maintaining the
advantages of TSV signaling such as high bandwidth (e.g. 10Gb/s/TSV bi-directional
data rate) and low delay (e.g. single cycle die-to-die transmission at 5GHz clock
frequency). The next section will present our SBD TSV transmitter designed for
such low-power yet high-performance 3D-IC vertical signaling.
4.3
SBD Transmitter Design
Figure 4-7 shows the proposed SBD transmitter (Tx) circuit where a driver design
on a top die differs from a driver design on a bottom die. While the bottom die
driver is implemented using a simple NAND-enabled inverter, the top die driver
includes clocking circuitry (highlighted in blue) and a coupling capacitor (highlighted
in red). Since our proposed design does not depend on the top and bottom die circuit
symmetry, it can easily apply to heterogeneous 3D-IC integration (e.g. a top die in
45nm CMOS and a bottom die in 28nm CMOS). The following sub-sections will
explain how this SBD Tx circuit works and enables energy-efficient TSV signaling
without significant reduction in maximum data bandwidth.
4.3.1
Case 1: EN=0 (no data to be transmitted)
Figure 4-8 shows circuit connectivity of the proposed SBD Tx when an enable
signal is not asserted (EN=0). Clearly, both sides of an idle TSV are connected to
ground (GND), i.e. no die-to-die current path is formed through a low resistance
TSV, and hence, our SBD Tx circuits do not consume any static power when there
is no data to be transmitted through a TSV. It is notable that the enable signal can
104
4.3. SBD Transmitter Design
Proposed TSV SBD Tx
coupling capacitor for
bandwidth compensation
SBD transmitter
on a bottom die
SBD transmitter on a top die
Figure 4-7: Proposed SBD TSV Tx circuits: a simple NAND-enabled inverter on a
bottom die and a half-clocked driver on a top die.
be obtained at every clock cycle for free from NoCs since it is an output of a switch
allocator, the vital logic of packet-switched NoC routers described in Chapter 1.
4.3.2
Case 2: EN=1 and CLK=1 (first half clock cycle)
During the first half clock period (Figure 4-9), the bottom die driver acts as
a repeater composed of two inverters. On the other hand, on a top die, there is
no connection between an input (IN1) and a TSV because the high state of clock
(CLK=1) makes all transistors between IN1 and a TSV turned off. Hence, during
the first half clock cycle, a TSV is driven by a bottom die driver only, allowing no
static current to flow through a low resistance TSV. The bottom die driver does not
need to be over-sized to drive rail-to-rail TSV voltage swing during the first half clock
period (half-swing transitions suffice), because another half-swing transition can be
105
Proposed TSV SBD Tx – operation (1/2)
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
X1
X1
CLK=0
0
1
X
0
X0
GND at TSV top side (landing pad)
GND at TSV bottom side (micro bump)
Figure 4-8: Tx circuit connectivity of Case 1 (EN=0). No die-to-die current path is
formed when there is no data to be transmitted through a TSV.
driven during the next half clock cycle by a top die driver and a bottom die driver
together. In other words, during the first half clock cycle, a TSV is driven by a pull-up
PMOS (P1) or a pull-down NMOS (N1) on a bottom die regardless of a top die input
(IN1), consuming dynamic energy only (i.e. no static energy consumed). Depending
on data transition patterns on a bottom die, the dynamic energy is zero (no TSV
voltage swing) or the energy required for TSV half voltage swing.
4.3.3
Case 3: EN=1 and CLK=0 (next half clock cycle)
When clock goes low (Figure 4-10), two pass-gate transistors, N2 and P2, are
turned on, generating a signaling path between a top die input (IN1) and a TSV
landing pad. Now, a top die driver also acts as a repeater composed of two inverters
(a NAND gate with EN=1 is the first inverter while the second inverter consists of
106
4.3. SBD Transmitter Design
0
ON
0
ON
X
P3
floating (open)
X
N3
repeater (two inverters)
P1
1
N1
N2
X
VDD
GND
X
P2
ON
ON
X
X
1
1
1
driven by P1 or N1
Figure 4-9: Tx circuit connectivity of Case 2 (EN=1 and CLK=1, first half clock cycle)
where a TSV is driven by a bottom driver only, consuming dynamic energy
only.
P3 and N3). Since there are no changes on a bottom die, a TSV is driven by a top
die driver and a bottom die driver together during the second half clock cycle. When
we have bi-directional data of (IN1, IN2) = (0, 1) or (1, 0), a die-to-die static current
path is formed during the second half clock period to drive TSV voltage at the middle
voltage level (ideally VDD/2).
While the straightforward SBD Tx design implemented using just repeaters (Figure 4-5 and Figure 4-6) allows static current to flow during the entire clock period,
our half-clocked Tx design can have a die-to-die static current path only during the
half clock period (hence, 50% lower static energy consumption). At the cost of such
energy savings, our Tx design incurs maximum bandwidth loss because TSV half
voltage swing can be driven by a pull-up PMOS and a pull-down NMOS together (P1
107
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
additional current path
through a coupling capacitor
ON
to make TSV voltage swing faster
driven by P3 or N3
P3
0
X
N2
ON
1
ON
1
N3
repeater (two inverters)
P1
1
N1
ON
ON
X
ON
P2
0
1
driven by P1 or N1
Figure 4-10: Tx circuit connectivity of Case 3 (EN=1 and CLK=0, next half clock
cycle) where a TSV is driven by a bottom driver and a top driver together,
forming a static current path through a TSV for the three voltage-level
SBD signaling. The coupling capacitor, which acts as a high-pass filter,
compensates the bandwidth loss without adding to inter-die static current.
and N3 together or P3 and N1 together) during the second half clock period. Switching to the middle voltage level by a pull-up PMOS and a pull-down NMOS together
is slower than switching by a pull-up PMOS only or pull-down NMOS only, because
some portion of switching current always flows through the static current path formed
by the pull-up PMOS and the pull-down NMOS, making no contribution to the TSV
voltage swing.
In order to compensate such bandwidth loss, our top die driver has one more signaling path between an input (IN1) and a TSV through a coupling capacitor (highlighted in red in Figure 4-10). This additional signaling path adds switching current
108
4.3. SBD Transmitter Design
during the second clock period, and hence, reduces switching time required for the
half voltage swing. In other words, it increases maximum clock frequency by adding
up additional switching current during high-to-low clock transitions. An advantage of
this design is that switching current through the coupling capacitor flows only during
such clock transitions (i.e. the coupling capacitor acts as a high-pass filter), and hence,
it can increase bandwidth without adding to inter-die static current. This coupling
capacitor also generates dynamic current flow during the first half clock period when a
bottom die driver makes TSV voltage transition. However, due to TSV’s big lumped
capacitance, the TSV voltage transition during the first half clock period (at clock
transition from low to high) is less sharp than the voltage transition at another end
of the coupling capacitor during the second half clock period (at clock transition from
high to low). Since more sharp transitions enable higher dynamic current through a
coupling capacitor at a given time, the switching current flowing through a coupling
capacitor during the first half clock period is less than 10% of the switching current
during the second half clock period, thus resulting in little impact on switching time
reduction during the first half clock period. However, since the bandwidth bottleneck
of our proposed Tx circuit is the second half clock period when TSV voltage is driven
to VDD/2 by a pull-up PMOS and a pull-down NMOS together, we don’t need a
large coupling capacitor to reduce the first half clock period’s switching time.
To sum up, the underlying design philosophy of our proposed SBD Tx is that
while die-to-die static current through a TSV is inevitable in multiple voltage-level
SBD signaling for keeping TSV voltage at middle voltage levels (e.g. VDD/2), such
static current does not need to flow over an entire symbol period (one clock cycle in
our design) if a TSV voltage transition time is shorter than an entire symbol period.
In our transmitter circuit design, the half-clocked driver on a top die allows static
current to flow through a TSV during the second half clock period only (half of symbol
period), and the coupling capacitor enables shorter TSV voltage transition time by
creating a dynamic current path at negative clock edges.
109
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
Two uni-directional TSVs
VDD
VDD
Proposed SBD TSV
VS
GND
bottom to top TSV
CLK=1
CLK=0
(1st half period) (2nd half period)
top to bottom TSV
CLK=1
CLK=0
time
GND
(1st half period) (2nd half period)
CLK=1
CLK=0
(1st half period) (2nd half period)
00 to 11 bi-directional data transition (0 to 1 on a bottom die and 0 to 1 on a top die)
Proposed SBD TSV
Two uni-directional TSVs
VDD
VDD
VS
GND
bottom to top TSV
CLK=1
CLK=0
(1st half period) (2nd half period)
top to bottom TSV
CLK=1
CLK=0
time
GND
(1st half period) (2nd half period)
CLK=1
CLK=0
(1st half period) (2nd half period)
00 to 10 bi-directional data transition (0 to 1 on a bottom die and 0 to 0 on a top die)
Proposed SBD TSV
Two uni-directional TSVs
VDD
VDD
VS
GND
bottom to top TSV
CLK=1
CLK=0
(1st half period) (2nd half period)
top to bottom TSV
CLK=1
CLK=0
(1st half period) (2nd half period)
time
GND
CLK=1
CLK=0
(1st half period) (2nd half period)
00 to 01 bi-directional data transition (0 to 0 on a bottom die and 0 to 1 on a top die)
Figure 4-11: TSV voltage transitions of uni-directional TSVs versus our SBD TSV.
110
4.3. SBD Transmitter Design
Floating during the first half clock period
for static die-to-die current reduction
VDD
Proposed SBD TSV
VDD
Floating during the first half clock
Fail period
for static die-to-die current reduction
VDD
VS
GND
CLK=1
CLK=0
Fail
time
(1st half period) (2nd half period)
VS
Success
Proposed SBD TSV
VDD
GND
Success
CLK=1
CLK=0
(1st half period) (2nd half period)
00 to 11 bi-directional data transition (0 to 1 on a bottom die and 0 to 1 on a top die)
Figure 4-12: While a floating TSV during the first half clock period also enables 50%
GND
GND
time
CLK=1
CLK=0
CLK=0bandwidth loss
Proposed
SBD
TSV
lower static die-to-die current, such a floatingCLK=1
state incurs
(1st half period) (2nd half period)
(1st
half period) (2nd
half period)
Proposed
SBD
TSV
wiout a coupling capacitor
at 00 → 11
bi-directional
00 toVDD
11 bi-directional
data
transition (0data
to 1transition.
on a bottom
VDD die and 0 to 1 on a top die)
Success
Proposed SBD TSV
wiout a coupling capacitor
VDD
VS
bandwidth compensation
bySBD
a coupling
Proposed
TSV capacitor
VDD
Success
Fail
GND
CLK=1
CLK=0
time
(1st half period) (2nd half period)
VS
GND
CLK=1
bandwidth compensation
by a coupling capacitor
CLK=0
(1st half period) (2nd half period)
Fail
00 to 01 bi-directional data transition
(0 to 0 on a bottom die and 0 to 1 on a top die)
GND
CLK=1
CLK=0
GND
time
(1st half period) (2nd half period)
CLK=1
CLK=0
(1st half period) (2nd half period)
00 to 01 bi-directional data transition (0 to 0 on a bottom die and 0 to 1 on a top die)
Figure 4-13: The coupling capacitor on a top die driver enables shorter switching time
when a TSV is driven to VDD/2 by a pull-up PMOS and a pull-down
NMOS together during the second half clock period.
111
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
4.4
Rx Design: Switched Dual-Tree Sense Amplifier
In order for our SBD TSV links to be incorporated into a typical NoC router
microarchitecture3 , there are two key requirements in their receiver (Rx) design: low
sensing delay and error-free SBD signaling against die-to-die (a top die to a bottom
die) process variations.
4.4.1
Switched Scheme for Low Sensing Delay
While a pipelined NoC router microarchitecture generally requires single-cycle link
traversal (LT), our SBD signaling needs extra time to sense and convert TSV voltage
into full-swing logic level. To minimize such extra time and hence utilize our SBD
TSV interconnects as links of the pipelined NoC router, a SBD Rx circuit should
be designed for low sensing delay. In terms of small-signal sensing delay, an NMOSinput sense amplifier is optimized for higher input common mode voltage while a
PMOS-input sense amplifier is ideally fitted for lower common mode voltage. Thus,
a possible Rx circuit implementation is to switch such two sense amplifiers according
to the TSV common mode voltage, in a similar way as the reconfigurable sensing
network proposed in [77].
In our three voltage-level SBD signaling, the TSV common mode voltage is given
by the transmitted data as described in Figure 4-4; when the transmitted data is 0,
the SBD TSV voltage should be GND or VDD/2 (i.e. lower common mode voltage).
On the other hand, the SBD TSV voltage should be VDD or VDD/2 (i.e. higher
common mode voltage) if the transmitted data is 1. Therefore, we can switch on and
off an NMOS-input sense amplifier and a PMOS-input sense amplifier depending on
the transmitted data.
3
The 2D mesh 5-port NoC router microarchitecture (Figure 1-1) can naturally be extended to
any 3D-IC NoC router microarchitectures by adding more input ports, e.g. a 3D mesh 6-port router
microarchitecture for 2-tier 3D-ICs or a 3D mesh 7-port router microarchitecture for many-tier
3D-ICs.
112
4.4. Rx Design: Switched Dual-Tree Sense Amplifier
Figure 4-16 shows such a switched Rx scheme. When the transmitted data (txIN)
is logic 1, an NMOS-input sense amplifier is activated to speed up the sensing operation with higher common mode voltage while a PMOS-input sense amplifier is turned
off for energy saving. Similarly, once the transmitted data is logic 0 (i.e. our SBD Rx
senses the voltage difference between VDD/2 and GND), a PMOS-input amplifier is
switched on for low sensing delay of lower common mode voltage whereas an NMOSinput sense amplifier is turned off. Measurement results showed that this switched
Rx scheme, which was fabricated in a 28nm Low-Power (LP) CMOS process, enabled
single-cycle SBD TSV signaling at high clock frequencies, up to 4.55GHz, at 1.05V.
4.4.2
Dual-Tree Sense Amplifier for Reliable SBD Signaling
The widely-used sense amplifier designs [77, 78] can incur signaling reliability
issues when applying to our SBD Rx circuit; using a reference voltage between VDD
and VDD/2 (or GND and VDD/2) as another input of sense amplifiers reduces the
sensing noise margin into half of the SBD symbol noise margin (i.e. VDD/4). Across
all possible process variations, the sensing noise margin of VDD/4 will be further
reduced. As shown in Figure 4-14, a strong PMOS and a weak NMOS increase the
voltage level for SBD symbol 01 and 10 (VM ), making the noise margin between VDD
and VM less than VDD/2. Similarly, a weak PMOS and a strong NMOS decrease
VM , making the noise margin between VM and GND less than VDD/2. In particular,
since 3D-ICs are vertically-integrated at the wafer level in general, we have to consider
different die-to-die variation corners between different wafers. Accordingly, our three
voltage-step SBD signaling can have different voltage levels for symbol 01 (VM 01 ) and
symbol 10 (VM 10 ) as shown in Figure 4-14 (c), if both PMOS and NMOS on a top die
were fabricated stronger than typical transistors while both PMOS and NMOS on a
bottom die were fabricated weaker than typical transistors.
To increase the sensing noise margin of SBD Rx circuits, we present dual-tree
sense amplifiers. Figure 4-15 shows an NMOS-input dual-tree sense amplifier (a)
113
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
Figure 4-14: Reduced symbol noise margin of SBD signaling due to process variation.
When designing 3D-IC circuits, we should consider die-to-die variation
mismatch as well as on-die variation mismatch described in (c).
114
4.4. Rx Design: Switched Dual-Tree Sense Amplifier
Figure 4-15: Switched dual-tree sense amplifiers for variation-robust SBD signaling and
low sensing delay.
115
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
NMOS-input dual-tree
sense amplifier
CLK
VDD
VDD/2
VDD/2
1-cycle delayed txIN
GND
CLK
PMOS-input dual-tree
sense amplifier
TSV
SBD Tx
EN
Figure 4-16: Overall circuit implementation of the proposed TSV SBD signaling. Two
types of sense amplifiers, a PMOS-input and an NMOS-input sense amplifier, are switched on and off according to the transmitted data (txIN)
for low sensing delay.
and a PMOS-input dual-tree sense amplifier (b). In the dual-tree sense amplifier
design, the tail current difference of cross-coupled inverters is the result of directly
comparing the TSV voltage (rxIN) with VDD/2 and VDD (or GND). Accordingly,
its sensing noise margin is equal to the SBD symbol noise margin (i.e. VDD/2),
whereas the traditional sense amplifier designs [77, 78] have half of the SBD symbol
noise margin (i.e. VDD/4) as their sensing noise margin. Simulated with a 28nm
116
4.5. Prototyping and Testing of TSV Interconnects
LP CMOS process design kit (PDK), our dual-tree sense amplifiers achieved errorfree SBD TSV signaling across all possible process corners, whereas traditional sense
amplifier designs [77, 78] (whose transistors were equivalently-sized with the dual-tree
sense amplifiers) incurred Rx errors at three corner cases discussed in Figure 4-14.
This is mainly because the input offset of sense amplifiers, which is caused by on-die
variations, is bigger than the reduced sensing noise margin at such corner cases.
Figure 4-16 describes an overall circuit implementation of our proposed TSV signaling. Depending on the transmitted data (txIN), two types of dual-tree sense amplifiers (a PMOS-input and an NMOS-input dual-tree sense amplifier) are switched
on and off, and then, a final output (rxOUT) is selected by a simple 2:1 multiplexer.
The SBD Tx design on a bottom die is a NAND-enabled inverter while a top die has
a half-clocked driver as its SBD Tx.
4.5
Prototyping and Testing of TSV Interconnects
Our 3D-IC chip prototype was implemented using MediaTek face-to-back (F2B)
TSV technology and a 28nm Low-Power (LP) CMOS process; we first fabricated two
different CMOS wafers (one for top die design and another for bottom die design),
then integrated such two wafers into a single 3D-IC chip using the F2B TSV technology. While the test chip in this work is a 2-tier 3D-IC, the proposed F2B TSV
signaling circuits can be utilized as repeaters in any multiple-layered 3D-ICs. Figure 4-17 and Figure 4-18 show a top die and a bottom die photo of our 2-tier 3D-IC
chip prototype, respectively.
To fairly compare the energy consumption, maximum data rate and occupied area
of the proposed SBD TSV signaling versus uni-directional TSV signaling, both interconnect circuits (proposed SBD TSV and baseline #1 in Figure 4-19) were included in
the test chip. In addition, to see the impact of a half-clocked SBD Tx and a coupling
capacitor, two other TSV interconnects (baseline #2 and baseline #3 in Figure 4-19)
117
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
poly layer landing pads (TSV top side)
poly layer landing pads (TSV top side)
Figure 4-17: Top die photograph of a 2-tier 3D-IC test chip fabricated with a 28nm
Low-Power (LP) CMOS process.
top metal micro bumps (TSV bottom side)
top metal micro bumps (TSV bottom side)
Figure 4-18: Bottom die photograph of a 2-tier 3D-IC test chip fabricated with the same
process as a top die, 28nm LP CMOS.
were also implemented in our chip prototype. While baseline #2 employs straightforward SBD Tx implementation using inverters, baseline #3 incorporates the same
Tx circuits as our proposed design except for the coupling capacitor.
118
4.5. Prototyping and Testing of TSV Interconnects
Inverter Driver
TSV
TSV
TSV
Proposed Rx
Proposed Rx
Baseline #1
Proposed Rx
Proposed Rx
TSV
Proposed Tx
@ top die
TSV
Proposed Tx w/o bypass cap
@ top die
Proposed Rx
Proposed Rx
Proposed Tx
@ bottom die
Baseline #2
Inverter Driver
Proposed Tx
@ bottom die
Baseline #3
Proposed SBD TSV
Figure 4-19: Four types of TSV interconnects implemented in a 3D-IC test chip: two
uni-directional TSVs (baseline #1); an inverter-based SBD TSV (baseline
#2); a proposed SBD TSV without a coupling capacitor (baseline #3);
and a completed design (proposed SBD TSV).
4.5.1
Maximum Data Rate
We first measured the maximum data rate of the fabricated F2B TSV signaling circuits. Each TSV end was fed by two pseudorandom binary sequences generated on-chip using linear feedback shift registers, PRBS7 (x7 + x6 + 1) and PRBS31
119
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
12.5% decreased
9.1Gb/s
5.2Gb/s
5.2Gb/s
repeaters
repeaters
(a) Half TSV Counts: Two uni-directional TSVs (Baseline #1) vs One proposed SBD TSV
9.1Gb/s
75% increased
9.1Gb/s
5.2Gb/s
5.2Gb/s
repeaters
repeaters
(b) Same TSV Counts: Two uni-directional TSVs (Baseline #1) vs Two proposed SBD TSV
Figure 4-20: Measured maximum die-to-die bandwidth comparison at 1.05V between
uni-directional TSVs (baseline #1) and proposed SBD TSVs.
(x31 + x28 + 1), which are industrial standard patterns for an off-chip link test. Accordingly, for 2-bit bi-directional signaling (one bit from a top die to a bottom die and
another bit from a bottom die to a top die), each interconnect circuit was tested by
four possible input vectors: PRBS7*
)PRBS7, PRBS7*
)PRBS31, PRBS31*
)PRBS7,
PRBS31*
)PRBS31. Then, an on-chip test circuit executed input and output data
comparison and error counting. All experiments were performed at 1.05V, the typical
power supply voltage of our 28nm LP CMOS process.
Measurement results showed that two uni-directional TSVs (baseline #1) achieved
the maximum data rate of 10.4Gb/s whereas our proposed SBD signaling circuits
120
4.5. Prototyping and Testing of TSV Interconnects
12.5% lower bandwidth than
two uni-directional TSVs
(2
un
i-d
i
re
ct
io
na
lT
SV
s)
33.8% bandwidth increase
by a coupling capacitor
Figure 4-21: Maximum bi-directional bandwidth of our fabricated F2B TSV interconnects. The proposed SBD signaling can deliver up to 9.1Gb/s/TSV bidirectional data (i.e. 4.55GHz maximum clock frequency) at 1.05V.
attained the maximum data rate of 9.1Gb/s through a single TSV (i.e. 12.5% lower
maximum bandwidth than two uni-directional TSVs). In other words, the proposed
SBD TSVs can send (100−12.5)×2−100 = 75% more data than uni-directional TSVs
with the same TSV counts. Figure 4-20 describes such trade-offs between die-to-die
bandwidth and TSV counts.
While the straightforward SBD implementation using inverters (baseline #2)
121
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
showed the maximum data rate of 9.9Gb/s, which was slightly less than two unidirectional TSVs (baseline #1), the same SBD signaling circuits as our proposed
design except for a coupling capacitor (baseline #3) showed much smaller maximum
bandwidth, 6.8Gb/s/TSV. As discussed in Section 4.3, our half-clocked driver at a
top die (designed for die-to-die static current reduction) has only half clock cycle
for switching to the middle voltage level (ideally VDD/2=5.025V), and such middle
voltage switching by a pull-up PMOS and a pull-down NMOS together takes longer
time than pull-up only or pull-down only switching, thereby leading to a significant
decrease in maximum data rate. This maximum bandwidth loss can be compensated
by the coupling capacitor (highlighted in red in Fig. 4-7); experimental results proved
that this circuit design enabled a 33.8% increase (6.8Gb/s/TSV to 9.1Gb/s/TSV) in
maximum SBD TSV bandwidth. As discussed earlier in Section 4.3, this is because
the coupling capacitor shortens the time required for switching to the middle voltage
level by a pull-up PMOS and a pull-down NMOS together (which is the critical path
of our SBD TSV signaling circuits) by adding extra charging/discharging current during the signal transition through capacitive coupling. Figure 4-21
summarizes the results of our maximum die-to-die bandwidth experiments.
4.5.2
Energy Efficiency
Next, we measured signaling energy efficiency of the fabricated TSV interconnects.
In addition to two industrial standard patterns for a link test, PRBS7 and PRBS31,
two more input data sequences were included in the energy measurement experiment:
CLK/2 for a higher data transition input sequence and CLK/32 for a lower data
transition input sequence. CLK/2 is the clock-shaped waveform whose frequency is
half of the interconnect clock, and hence, it has 100% data transition density. On
the other hand, CLK/32 is the clock-shaped signal whose frequency is 1/32 of the
interconnect clock so that it generates 6.25% data transition density. Figure 4-22
describes four bi-directional data sets used in our energy measurement.
122
4.5. Prototyping and Testing of TSV Interconnects
TOP
TOP
BOTTOM
BOTTOM
low activity (CLK/32)
moderate activity (PRBS7)
moderate activity (PRBS31)
moderate activity (PRBS31)
Bi-directional data set #1
Bi-directional data set #2
TOP
TOP
BOTTOM
BOTTOM
moderate activity (PRBS31)
moderate activity (PRBS7)
moderate activity (PRBS7)
high activity (CLK/2)
Bi-directional data set #3
Bi-directional data set #4
Figure 4-22: Four bi-directional input data sets for energy comparison.
Figure 4-23 shows the TSV signaling energy efficiency4 measured at the maximum data rate of our proposed SBD interconnect circuits, 9.1Gb/s bi-directional
bandwidth (i.e. 4.55GHz clock frequency), at the power supply voltage of 1.05V.
Once data set #1 (CLK/32*
)PRBS31) were transmitted and received across two dies,
two uni-directional TSVs and a single inverter-based SBD TSV consumed 98.46fJ/2b
and 112.87fJ/2b, respectively. On the other hand, The proposed SBD TSV showed
88.31fJ/2b energy efficiency, 10.3% less energy than two uni-directional TSVs or
21.8% less energy than the inverter-based SBD TSV. When data transition density
is low, the energy benefit of our SBD signaling over uni-directional signaling is not
significant (only 10.3% reduction) because die-to-die static current is comparable to
4
To follow the convention of interconnect energy efficiency calculation, we obtained the TSV
energy efficiency by dividing the measured power consumption (µW) by operating frequency (GHz).
Since our TSV interconnects simultaneously deliver two bits, one bit from a top die to a bottom
die and another bit from a bottom die to a top die, the energy efficiency unit of our bi-directional
signaling is not fJ/b but fJ/2b.
123
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
Measured energy efficiency (fJ/2b)
CLK/32
PRBS31
PRBS7
PRBS31
19.8% reduction
137.08
21.8% reduction
131.58 16.4% reduction
112.87
109.94
10.3% reduction
98.46
PRBS31
PRBS7
PRBS7
Measured energy efficiency (fJ/2b)
186.41
16.9% reduction
136.84
es
ig
n
D
ed
po
s
Pr
o
(2
(in B
ve as
rte e
r-b lin
as e
ed #2
SB
D
)
u n Ba
i-d se
ire lin
ct e
io
na #1
lT
SV
s)
es
ig
n
D
ed
po
s
Pr
o
(2
(in B
ve as
rte e
r-b lin
as e
ed #2
SB
D
)
u n Ba
i-d se
ire lin
ct e
io
na #1
lT
SV
s)
88.31
CLK/2
31.1% reduction
146.58 12.4% reduction
132.89 14.5% reduction
128.44
es
ig
n
D
po
se
d
Pr
o
(2
(in B
v e as
rte e
r-b lin
as e
ed #2
SB
D
)
)
un Ba
i-d se
ire lin
ct e
io
n a #1
lT
SV
s
es
ig
n
D
po
se
d
Pr
o
(in B
ve as
rte e
r-b lin
as e
ed #2
SB
D
)
(2
u n Ba
i-d se
ire lin
ct e
io
n a #1
lT
SV
s
)
113.67
Figure 4-23: Measured TSV interconnect energy efficiency over various input data
sets at 9.1Gb/s bi-directional data rate (i.e. 4.55GHz clock frequency)
at 1.05V. The proposed SBD signaling circuits consume 10.3-31.1% less
energy than uni-directional TSVs.
124
4.5. Prototyping and Testing of TSV Interconnects
the low-activity switching current. As compared to the inverter-based SBD TSV,
however, our SBD TSV shows higher energy savings (21.8% reduction) due to the
half-clocked driver on a top die that enables 2× reduction in die-to-die static current.
Once moderate data transition inputs, data set #2 (PRBS7*
)PRBS31) and data
set #3 (PRBS31*
)PRBS7), were applied to the fabricated TSV interconnects, our
proposed SBD TSV showed 16.9-19.8% energy savings over two uni-directional TSVs.
When comparing with the inverter-based SBD signaling circuits, the proposed design showed 14.5-16.4% lower energy consumption. One notable result is that unidirectional TSVs and an inverter-based SBD TSV dissipated almost the same energy
with data set #2 and data set #3, which are the same data sequences but with
different direction, while the proposed SBD TSV showed over 5% energy deviation
between such two data sets. This is because our SBD TSV signaling circuits incorporate different transmitter designs at a top die (the half-clocked coupling driver) and
a bottom die (the NAND-enabled inverter) as shown in Figure 4-7. This asymmetric
feature can offer an energy optimization opportunity at a design phase if data patterns between two adjacent dies in 3D-ICs are known before chip fabrication; we can
simply switch the transmitter designs when such a switched case shows better energy
efficiency.
The proposed SBD signaling circuits achieved the best energy saving over unidirectional TSVs when delivering data set #4 (PRBS7*
)CLK/2) that has the highest transition density of our 4 data sets; while two uni-directional TSVs dissipated
186.41fJ/2b, the proposed SBD TSV consumed 128.44fJ/2b (31.1% energy reduction).
Each data transition on a uni-directional TSV requires one full-swing switching, and
hence, two simultaneous transitions such as 00→11, 11→00, 01→10 or 10→01 incur
two full-swing switches. On the other hand, in our proposed SBD signaling, such
simultaneous transitions require only one full-swing switching, leading to dynamic
energy reduction. Thus, the higher data transition rate on SBD TSVs results in the
bigger energy benefits over uni-directional TSVs as proven in our experiments.
125
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
4.5.3
Area Comparison
Since SBD signaling can transmit and receive data at the same time through
a single TSV, it occupies smaller silicon area than uni-directional TSV signaling
at a given die-to-die bandwidth requirement. Such an area benefit is ideally 50%
(2 TSVs versus 1 TSV), but our proposed SBD interconnect incorporates a halfclocked driver (which is bigger than uni-directional repeaters) and a switched dual-
Normalized TSV interconnect footprint
tree sense amplifier (which is also bigger than conventional flip-flops) so that its actual
pa
ci
to
r)
ca
SB
D
(n
o
co
up
l
in
g
d
ve
rte
r-b
as
e
(in
(tw
o
un
i-d
ire
ct
io
na
lT
SV
s)
TS
V)
only 2.7% area overhead
of a coupling capacitor
Figure 4-24: Normalized area comparison of the fabricated TSV signaling circuits. While
baseline #1 includes two TSV landing pads, other three SBD TSV interconnects have only one TSV landing pad.
126
4.5. Prototyping and Testing of TSV Interconnects
area benefit is 34.4%. Considering current face-to-back (F2B) TSV technology that
imposes substantial area overheads due to its poly layer landing pads, this 34.4%
area benefit is quite profitable in designing 3D-ICs; we can use more redundant TSVs
for reliable die-to-die signaling or more power TSVs for stable power delivery at
a given area budget through SBD signaling. It is also notable that the proposed
SBD signaling circuits have a 2.7% bigger footprint than baseline #3 which is the
identical to the proposed design except for a coupling capacitor. In other words,
the coupling capacitor at a half-clocked driver incurs only a 2.7% area overhead at
an overall signaling circuit level that includes transmitters (Tx), receivers (Rx) and
TSV landing pads. Figure 4-24 shows the normalized area value of four types of TSV
interconnects fabricated in our 3D-IC test chip.
4.5.4
Comparison with Other Low-Power TSV Circuits
There are only a few TSV interconnect studies at circuit design level with actual
chip prototypes [79, 68]. Futoshi Furuta et al. demonstrated low-swing TSV circuits
featured with adaptive timing control to deal with variations of TSV parasitic lumped
capacitance [79]. In their design, low-swing signaling was generated by an inverter
with lower power supply voltage, 0.4V, consuming 27% lower energy than an equivalent full-swing TSV. While the 0.4V voltage swing provided enough noise margin, the
inverter-based low-swing signaling caused weaker driving strength, resulting in significant bandwidth loss; their uni-directional TSVs were able to deliver 2Gb/s/TSV
at most. In addition, the additional lower power supply voltage dedicated only for
low-swing links is sometimes an unacceptable system cost.
Yong Liu et al. presented a 6-tier 3D-IC chip prototype of low-swing TSV circuits that achieved 27-53% energy savings over full-swing TSVs with 0.19 to 0.3V
voltage swing. While the gated-diode sense amplifier enables single-ended low-swing
signaling, their small noise margin (0.19 to 0.3V voltage swing) is vulnerable to PVT
variations. The single-ended tri-state low-swing transmitter is very similar to our
127
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
Chapter 2 low-swing circuits (Figure 2-4); our PMOS transistors were replaced by
NMOS transistors with the lower second power supply voltage. As discussed in Section 2.3, this transmitter design incurred no bandwidth and latency reduction, and
hence, their low-swing TSVs delivered uni-directional data up to 6Gb/s/TSV data
rate.
Our TSV link is arguably the first Simultaneously Bi-Directional (SBD) TSV. In
other words, the proposed interconnect design is the first SBD link design optimized
for TSV channel characteristics. While the previous two low-swing TSV circuits [79,
68] provided no area reduction (actually, they incurred a little area overhead), our
SBD TSV enabled 34.4% lower area consumption than uni-directional TSVs. Since
the TSV micro bump size (50µm×50µm in [79]) is almost same as the size of flip-chip
micro bumps, this 34.4% area saving is quite precious. Also, 10.3 to 31.1% energy
reduction with half-swing noise margin provides energy-efficient yet reliable 3D-IC
vertical signaling. Details of the proposed SBD TSV link are summarized in Table
4.1 together with the previous works.
128
ISSCC 2012 [68]
This work
TSV Interconnect Feature
Low-Swing Signaling
Low-Swing Signaling
SBD Signaling
Energy Reduction over Full-Swing
27%
27 to 53%
10.3 to 31.1%
Voltage Swing (Noise Margin)
0.4V
0.19 to 0.3V
0.505V (half swing)
Process Corner Simulations
Error Free
N/A
Error Free
2nd Power Supply Required
Yes
Yes
No
Area Reduction
No (a little overhead)
No (a little overhead)
Yes (34.4% reduction)
Signaling Type
Single-ended
Single-ended
Single-ended
TSV Lumped Capacitance
∼ 200fF/TSV
∼ 200fF/TSV
N/A
TSV Landing Pad Size
50um×50um
N/A
N/A
Max Clock Frequency
2GHz
6GHz
4.55GHz
3D-IC Stacks
2-tier 3D-ICs
6-tier 3D-ICs
2-tier 3D-ICs
CMOS Technology
65nm CMOS
45nm CMOS
28nm LP CMOS
F2B TSV Technology
N/A
IBM Support
MediaTek Support
Table 4.1: Comparison of Energy-efficient Face-to-Back TSV Interconnects (CMOS-on-CMOS).
4.5. Prototyping and Testing of TSV Interconnects
129
3DIC 2012 [79]
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
4.6
Chapter Summary
In this chapter, we presented an energy and area efficient TSV signaling circuit
for 3D-IC NoCs that require an intensive die-to-die bandwidth. To be specific, we
proposed a simultaneously bi-directional (SBD) TSV interconnect that can transmit
and receive data at the same time through a single TSV. While a half-clocked driver
with a coupling capacitor enabled inter-die static current reduction (hence, low-power
SBD data transmission) without a substantial loss in maximum data rate, a switched
dual-tree sense amplifier allowed of error-free single-cycle TSV signaling even at high
clock frequency.
Fabricated with MediaTek F2B TSV technology and a 28nm LP CMOS process,
our 3D-IC test chip proved that the proposed SBD signaling circuit consumed 10.331.1% lower energy and 34.4% less silicon area than traditional uni-directional TSVs.
Even though one SBD TSV showed 12.5% lower maximum bandwidth than two unidirectional TSVs (i.e. two SBD TSVs can deliver 75% more data than two unidirectional TSVs), it successfully functioned error free at bi-directional data rates
up to 9.1Gb/s/TSV at 4.55GHz clock frequency. Considering that our test chip
was fabricated with a Low-Power (LP) CMOS process typically used for mobile
chips that require moderate clock frequency, we believe our 4.55GHz maximum clock
frequency at 1.05V is high enough.
To sum up, the proposed TSV interconnect enables higher die-to-die bandwidth
than uni-directional TSVs at a given energy and area budget in 3D-ICs. In other
words, it can deliver the same amount of inter-die data with lower energy and less
area than uni-directional TSVs. Besides, our SBD signaling circuit enables cyclewise bandwidth adaptivity so that it can be utilized as the basis of fined-grained
bandwidth-adaptive 3D NoCs that efficiently handle dynamic traffic in 3D-IC manycore chips.
130
4.6. Chapter Summary
131
Chapter 4. Energy and Area Efficient TSV Signaling for 3D-IC NoCs
132
5
Conclusions and Future Work
Can we imagine the future NoC? Can the circuits proposed in this thesis feature in
the future NoC? What kind of NoC architectures can be enabled? Let us conclude the
thesis by shaping future research direction based on insights and lessons learnt from
our study.
In this thesis, we presented design techniques for low-power yet high-performance
NoCs at link circuit level, router microarchitecture level and link-and-router co-design
level. The proposed design techniques reduced NoC power without compromising network performance, i.e. simultaneously improved network energy efficiency and performance. However, our NoC designs incurred overheads such as longer critical path,
larger silicon area, reduced signaling noise margin, or additional system overheads.
This thesis seeks to accurately analyze both the benefits and penalties of our proposed
NoC design techniques through chip prototyping.
5.1
Thesis Summary
We will briefly summarize the contributions of our three test chips: The first test
chip explored a regular mesh NoC for general-purpose computing chips (CMPs), while
133
Chapter 5. Conclusions and Future Work
the second chip prototype aimed at designing the low-swing datapath of irregular
mesh NoCs for high-performance application-specific ICs (MPSoCs). The third test
chip demonstrated energy and area efficient TSV signaling for 3D mesh NoCs in
3D-ICs.
5.1.1
Regular Mesh Network in CMPs
Our first chip prototype demonstrated that a 1GHz 16-node mesh NoC with a 64b
link width can be designed in 45nm SOI CMOS within a power budget of less than
1W, delivering 87-91% of the theoretical throughput limits of a mesh and 1.04-1.05
cycles per-hop-latency at low-load traffic.
As compared to an equivalent baseline, the fabricated mesh NoC showed 31-38%
power reductions as well as 48-55% latency benefits and over 2× bandwidth improvements through the combination of virtual bypassing, router-level multicast supports
and clocked low-swing datapath. On the other hand, the proposed design resulted in
a 21% stretched critical path, 39% larger area and reduced signaling noise margin.
While the longer critical path can be easily masked in actual manycore chips where
computation cores limit the clock frequency rather than NoC routers, the 39% area
overhead at the router level is quite painful. A more compact layout (e.g. a router
layout through flattened placement and automatic route of low-swing circuits and
full-swing logic gates together) can substantially reduce the area overhead, but such
a compact layout will incur more noise coupling between noise-sensitive low-swing
circuits and noisy full-swing digital circuits.
5.1.2
Low-Swing Datapath of Configurable Meshes in SoCs
The second test chip explored two types of clockless, single-ended low-swing repeaters to enable configuration of fast contention-free paths through any irregular
meshes (i.e. any subsets of a regular backbone mesh) that efficiently support dynamic traffic in MPSoCs.
134
5.1. Thesis Summary
Self-resetting logic repeaters (SRLRs) enabled single-ended low-swing pulses to
be repeated without clocking, thereby leading to lower power dissipation than differential, clocked low-swing signaling. To mitigate global process variations while
delivering high energy efficiency, three circuit techniques are incorporated. Fabricated in 45nm SOI CMOS, the 10mm SRLR-based datapath achieved 6.83Gb/s/µm
bandwidth density and 40.4fJ/b/mm energy efficiency at 4.1Gb/s data rate. Featured
with single-ended signaling, the SRLR-based low-swing datapath occupied only 18%
of the entire router footprint.
Voltage-locked repeaters (VLRs), on the other hand, facilitated lower transmission delay albeit with negligible low-swing energy benefits at the link level. When
compared to an equivalent full-swing repeaters, the fabricated 10mm 10-hop VLRs
showed 35.8% latency reduction and 23.6% higher maximum data rate. The VLRs
enabled 2GHz single-cycle, cross-chip communication between any node pairs on a
4×4 mesh, and such single-cycle multi-hop asynchronous repeated traversal (SMART)
contributed to network-level power savings as well as latency reduction in the SMART
NoC [1].
5.1.3
Towards Low-Cost 3D Meshes in 3D-ICs
Just as a 2D mesh maps easily to the planar layout of traditional CMOS wafers, a
3D mesh maps readily to the physical structure of 3D-ICs. A 3D mesh fabric provides
more routing paths between a source and destination node (i.e. more flexible) than a
2D mesh, and such path diversity enables fewer hop counts and higher throughput. In
other words, 3D mesh NoCs can integrate more nodes than 2D mesh NoCs at a given
budget of network latency and bandwidth. While F2B TSVs are the most promising
technology for vertical signaling of multiple-layered 3D-ICs, the current F2B TSV
fabrication technologies incur low yield, sizable silicon landing pads and significant
parasitic capacitance.
Our third test chip demonstrated that the proposed SBD TSV signaling circuits
135
Chapter 5. Conclusions and Future Work
can alleviate such power and area overheads; the SBD TSVs showed 10.3-31.1% lower
power and 34.4% less silicon area than two uni-directional TSVs that deliver the same
amount of data. While the half-clocked driver with a coupling capacitor eliminated
unnecessary inter-tier static current with a slight bandwidth loss (less than 13%),
the switched dual-tree sense amplifier enabled variation-tolerable TSV signaling as
well as lower sensing delay. These low-power, high-density SBD TSVs can be utilized to improve reliability of 3D-ICs’ vertical signaling (using spare TSVs) or build
bandwidth-adaptive 3D NoCs that can efficiently deliver highly dynamic inter-tier
traffic in 3D-ICs.
5.2
Low-Swing Signaling Reliability
This thesis demonstrated that NoC datapath energy can be substantially reduced
by various low-swing signaling techniques such as a clocked, differential low-swing
circuit (Chapter 2), a clockless, single-ended low-swing circuit (Chapter 3) and a
half-swing SBD TSV interconnect (Chapter 4). These energy savings, as discussed
in detail in each chapter, involve a reduced noise margin. Basically, lower voltage
swing enables more dynamic energy saving but incurs a smaller voltage noise margin,
thereby resulting in higher error probability. Figure 5-1,which is same as Figure 34, explicitly shows such trade-off between low-swing energy efficiency and reliability.
Link-level errors should be detected and corrected at the system level (e.g. NoC
router layer, network interface layer or protocol layer), and hence, the higher error
probability incurs bigger system level overheads. The optimal voltage swing level to
minimize overall system power is debatable because the optimal point highly depends
on the implementation of system-level error detection and correction whose overheads
are not yet accurately analyzed.
To overcome such reliability issues, our Chapter 2 low-swing circuit (Figure 2-4)
can control its voltage swing level by an off-chip voltage regulator. While this off136
5.2. Low-Swing Signaling Reliability
Figure 5-1: Lower voltage swing enables higher energy efficiency, but results in higher
signaling error probability (hence bigger system overheads). This figure is
identical to Figure 2-11.
chip solution guarantees stable low-swing signaling with a little voltage margin, the
additional power supply voltage is sometimes considered as an unacceptable system
cost, and moreover, this scheme requires extra post-fabrication testing efforts to find
proper voltage swing level for each die.
Self-resetting logic repeaters (SRLRs), which generate single-ended low-swing signaling without clocking circuits, achieve substantial energy reduction without compromising bandwidth density (See measurement results in Figure 3-7). However, since
their robustness against process variations is 3∼4σ, which is far from the industrial
standard (typically over 5σ), SRLRs cannot be directly incorporated into commercial
chips due to the poor signaling reliability. The easiest way, needless to say, is to
increase the voltage swing, but covering all 5σ variations only with higher voltage
137
Chapter 5. Conclusions and Future Work
swing in an advanced node (e.g. 28nm CMOS) will lead to a considerable loss in
energy benefits. The reduced signaling noise margin can be addressed through error
correction codes (ECCs) in a system level, but such ECC scheme will incur power
and latency overheads. NoC-layer solutions will be introduced in Subsection 5.3.2 as
our future project.
While our SBD TSV link shows lower energy reduction than SRLRs, its half swing
nature (∼0.5V noise margin) enables much more reliable functionality. As discussed
in Section 4.4 and Section 4.5, both simulation and measurement results demonstrate
that the proposed SBD interconnect design functions error free at data rates up to
9.1Gb/s/TSV at 1.05V power supply voltage. Thus, our SBD TSV can be utilized
as error-free NoC datapath without system-level error correction schemes.
As CMOS process scales down, NoC datapath energy will increase in percentage
relative to control and storage circuitry energy [15, 16], and hence, it will become more
critical to reduce the interconnect power. On the other hand, since CMOS scaling
makes process variation worse, low-swing signaling circuits with smaller feature size
incur lower BER at the same noise margin. Therefore, future NoCs will require
variation-aware circuit design even at the link level to achieve low-power yet reliable
interconnects.
5.3
Future Projects
Can we imagine the future NoC? With an advance in 3D-IC fabrication technology,
coupled with the physical limits of conventional CMOS scaling, the future NoC will
likely have a 3D mesh as its backbone network topology. This 3D mesh, which
is much more flexible and scalable than a 2D mesh, will be reconfigured at runtime depending on the executed applications. In addition to network topology, the
future NoC will tailor each router microarchitecture (e.g. pipeline depth) through
background calibration against PVT variations. Ultimately, the NoC will satisfy the
138
5.3. Future Projects
communication requirements of both bandwidth-intensive applications through highdensitiy TSVs (hopefully our SBD TSVs!) and latency-sensitive applications through
contention-free low-latency links (optical interconnects or low-swing links such as our
VLRs). Not to mention, such a NoC should be low-powered. Is it possible?
5.3.1
Broadcast-Intensive Cache Coherent Protocols
It is widely believed that broadcasts over a mesh network are too expensive on-die
due to bandwidth and power constraints, so broadcast-intensive on-chip communication has typically been limited to bus interconnects or ring NoCs that suffer from poor
scalability. Our 4×4 mesh NoC chip prototype challenged this. As discussed in Section 2.4.1, our mesh NoC chip achieved 2.2× higher saturation throughput than the
baseline, which means our mesh design can endure over 2× more broadcasts before
the network saturates. Latency wise, the proposed design showed 55.1% broadcast
latency reduction before saturation, leading to 1.05 cycles per hop on average before
the network saturates. From the energy point of view, a broadcast flit is nothing
but a unicast flit with 5× ST/LT energy and 5× buffering energy. As demonstrated
through our chip prototyping, the higher ST/LT energy can be alleviated by lowswing signaling circuits while the higher buffering energy can be effectively reduced
through virtual bypassing, thus resulting in reasonable broadcast energy even in a
mesh.
We see potential architecture-level research opportunities enabled by our lowpower broadcast mesh NoC. One well-known application of on-chip broadcasts is
a cache coherence protocol for scalable on-chip cache subsystems [80, 81, 82, 83].
The broadcast-based cache coherence protocols have a virtue of requiring no (or
smaller) directory storage whose power and area substantially increase as core counts
scale. In other words, in the cache coherence protocols that incorporate broadcasts as
the requests and invalidates, more broadcasts enable lower directory overheads, thus
mitigating the scalability issue caused by the directory. SCORPIO [84], which is a
139
Chapter 5. Conclusions and Future Work
36-core processor chip prototype featured with in-network ordering over a mesh NoC,
leveraged our broadcast router design to incorporate snoopy coherence for scalable
on-chip cache subsystems.
5.3.2
Error-Tolerant NoCs with Low-Swing Links
As discussed in Section 3.4.2 and Section 3.5.2, SRLRs and VLRs require systemlevel error-tolerant schemes due to their poor signaling reliability, in order to be
utilized for stable NoC datapath.
One interesting approach is that the errors that our circuit-level solutions do not
cover will be detected and corrected at the NoC architecture level. Such NoC-level
solutions will likely include topology reconfiguration, resilient routing or flexible router
pipeline. It is notable that even though NoCs do not incorporate low-swing links,
these error-tolerant NoCs will be required at future technology nodes where there
will be many transistor failures during the lifetime of manycore chips [85, 86]. Thus,
if such error-tolerant schemes can be extended for low-swing links with acceptable
overheads, our low-swing circuits will be able to be embedded within a NoC without
compromising their excellent energy efficiency. Diverse previous works on the errortolerant NoCs are well summarized in [87] so that it can be the starting point of this
future project.
5.3.3
Bandwidth-Adaptive 3D NoCs
Our Simultaneously Bi-Directional (SBD) TSV circuit provides a foundation for
bandwidth-adaptive 3D NoCs which efficiently handle dynamic traffic in 3D-ICs.
The half-clocked driver enables/disables SBD signaling at every cycle, and hence,
capacitates cycle-wise inter-tier bandwidth adaptivity. In other words, our SBD
TSVs support three effective signaling modes (UP→DOWN unidirectional signaling,
DOWN→UP unidirectional signaling, or UP↔DOWN bidirectional signaling), and
such a signaling mode can be switched at every cycle. Thanks to the switched dual140
5.3. Future Projects
tree sense amplifier, coupled with the half-swing nature of the 3-step SBD signaling,
our proposed SBD TSVs hardly incur any reliability issues.
Armed with these flexible yet reliable SBD TSVs, we can instantaneously double the die-to-die bandwidth at given TSV area and thermal budgets. While this
bandwidth adaptivity can efficiently handle bursty off-die traffic which is often the
performance bottleneck in manycore chips, it does not require any other link-level
overheads such as serialization/desirialization, data encoding/decoding, or error correction. However, incorporating this SBD-based bandwidth adaptivity into the 3D
NoC router will be a challenging task, i.e. it will require some router-level overheads.
For instance, designing a 7×7 crossbar switch with two vertical SBD links is never
trivial. Since the one end of SBD TSVs can be input ports and/or ouput ports of the
crossbar switch, careful datapath decoupling is required. Besides, the physical constraints of SBD TSVs differ from those of other 2D links (North, South, East, West
and NIC), leading to more complicated timing requirements. Detecting the bursty
traffic in the NoC level through flow control is also challenging. Finally, energy efficiency and performance of the bandwidth-adaptive 3D NoC will significantly vary
depending on how to merge a spare TSV scheme (which is critical in existing 3D-ICs)
into the SBD TSV-based router design.
141
Chapter 5. Conclusions and Future Work
142
Bibliography
[1] Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramanian,
Anantha Chandrakasan, and Li-Shiuan Peh. SMART: A Single-Cycle Reconfigurable NoC for SoC Applications. In Proceedings of the IEEE/ACM Design,
Automation and Test in Europe (DATE), 2013.
[2] Seongmoo Heo and Krste Asanovic. Replacing Global Wires with an On-Chip
Network: a Power Analysis. In Proceedings of the IEEE/ACM International
Symposium on Low Power Electronics and Design (ISLPED), August 2005.
[3] W. J. Dally and B. Towles. Route Packets Not Wires: On-Chip Interconnection
Networks. In Proceedings of the IEEE/ACM Design Automation Conference
(DAC), June 2001.
[4] M. B. Taylor et al. The Raw Microprocessor: a Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, 22(2):25–35, 2002.
[5] Y. Hoskote et al. A 5-GHz Mesh Interconect for a Teraflops Processor. IEEE
Micro, 27(5):51–61, 2007.
[6] T. Krishna et al. Towards the Ideal On-Chip Fabric for 1-to-Many and Many-to1 Communication. In Proceedings of the IEEE/ACM International Symposium
on Microarchitecture (MICRO), December 2011.
143
BIBLIOGRAPHY
[7] T. Krishna et al. Breaking the On-Chip Latency Barrier Using SMART. In Proceedings of the IEEE International Symposium on High-Performance Computer
Architecture (HPCA), February 2013.
[8] P. Gratz et al. On-Chip Interconnection Networks of the TRIPS Chip. IEEE
Micro, 27(5):41–50, 2007.
[9] Shane Bell et al. TILE 64-Processor: A 64-Core SoC with Mesh Interconnect. In
Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC),
February 2008.
[10] Jason Howard et al. A 48-core IA-32 Message-Passing Processor with DVFS
in 45nm CMOS. In Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), February 2010.
[11] Dae Hyun Kim et al. 3D-MAPS: 3D Massively Parallel Processor with Stacked
Memory. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), February 2012.
[12] Bhavya K. Daya et al. SCORPIO: A 36-Core Research Chip Demonstrating
Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture
(ISCA), June 2014.
[13] William J. Dally and Brian Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers, 2004.
[14] William J. Dally.
Virtual-Channel Flow Control.
In Proceedings of the
IEEE/ACM International Symposium on Computer Architecture (ISCA), June
1990.
144
BIBLIOGRAPHY
[15] Hui Zhang, Varghese George, and Jan M. Rabaey. Low-Swing On-Chip Signaling
Techniques: Effectiveness and Robustness. IEEE Transactions on Very Large
Scale Integration Systems (T-VLSI), pages 264–272, 2000.
[16] C. Sun, Chia-Hsin O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S.
Peh, and V. Stojanovic. DSENT - a Tool Connecting Emerging Photonics with
Electronics for Opto-Electronic Networks-on-Chip Modeling. In Proceedings of
the IEEE/ACM International Symposium on Networks-on-Chip (NOCS), May
2010.
[17] J. Cong and D. Z. Pan. Interconnect Estimation and Planning for Deep Submicron Designs. In Proceedings of the IEEE/ACM Design Automation Conference
(DAC), June 1999.
[18] P. Gratz et al. Uniform Repeater Insertion in RC Trees. IEEE Transactions
on Circuits and Systems-I: Fundamental Theory and Applications (TCAS-I),
47(10):41–50, 2000.
[19] Natalie Enright Jerger and Li-Shiuan Peh. Synthesis Lectures on Computer
Architecture - On-Chip Networks. Morgan and Claypool Publishers.
[20] Andrew Kahng et al.
ORION 2.0: A Fast and Accurate NoC Power and
Area Model for Early-Stage Design Space Exploration. In Proceedings of the
IEEE/ACM Design, Automation and Test in Europe (DATE), pages 423–428,
2009.
[21] M. Modarressi et al. Application-Aware Topology Reconfiguration for On-Chip
Networks. IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2011.
[22] M. Modarressi et al. Virtual Point-to-Point Connections for NoCs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (T-CAD),
2010.
145
BIBLIOGRAPHY
[23] M. B. Stensgaard and J. Sparso. ReNoC: A Network-on-Chip Architecture with
Reconfigurable Topology. In Proceedings of the IEEE/ACM International Symposium on Networks-on-Chip (NOCS), 2008.
[24] M. B. Stuart et al. Synthesis of Topology Configurations and Deadlock Free
Routing Algorithms for ReNoC-based Systems-on-Chip. In Proceedings of the
IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2009.
[25] C. Jackson and S. J. Hollis. Skip-links: A Dynamically Reconfiguring Topology
for Energy-efficient NoCs. In International Symposium on System on Chip (SoC),
2010.
[26] Joo-Young Kim et al. A 118.4 GB/s Multi-Casting Network-on-Chip With Hierarchical Star-Ring Combined Topology for Real-Time Object Recognition. IEEE
Journal of Solid-State Circuits (JSSC), 45:1399–1409, 2010.
[27] M. Coppola et al. Spidergon: a Novel On-Chip Communication Network. In
International Symposium on System on Chip (SoC), page 15, 2004.
[28] H. Zhang et al. A 1V Heterogeneous Reconfigurable Processor IC for Baseband
Wireless Applications. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 68–69, 2000.
[29] T. Krishna, J. Postman, C. Edmonds, L.-S. Peh, and P. Chiang. SWIFT:
A SWing-reduced Interconnect For a Token-based Network-on-Chip in 90nm
CMOS. In Proceedings of the IEEE International Conference on Computer Design (ICCD), pages 439–446, 2010.
[30] Amit Kumar et al. Token Flow Control. In Proceedings of the IEEE/ACM
International Symposium on Microarchitecture (MICRO), November 2008.
146
BIBLIOGRAPHY
[31] David Wentzlaff et al. On-chip interconnection architecture of the tile processor.
27(5):15–31, 2007.
[32] A. Kumar et al. Express Virtual Channels: Towards the Ideal Interconnection
Fabric. In Proceedings of the IEEE/ACM International Symposium on Computer
Architecture (ISCA), June 2007.
[33] S. S. Mukherjee et al. The Alpha 21364 Network Architecture. IEEE Micro,
22(5):26–35, 2002.
[34] P. Gratz et al. Implementation and Evaluation of On-Chip Network Architectures. In Proceedings of the IEEE International Conference on Computer Design
(ICCD), October 2006.
[35] A. Kumar et al. A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel
Switch Allocator in 65nm CMOS. In Proceedings of the IEEE International
Conference on Computer Design (ICCD), October 2007.
[36] Jan M. Rabaey, Anantha P. Chandrakasan, and Borivoje Nikolic. Digital Integrated Circuits: A design perspective. Prentice Hall, 2nd Edition, 1998.
[37] R. Ho et al. High-Speed and Low-Energy Capacitive-Driven On-Chip Wires. In
Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC),
pages 412–413, February 2007.
[38] E. Mensink et al. A 0.28pJ/b 2Gb/s/ch Transceiver in 90nm CMOS for 10mm
On-chip interconnects. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 314–315, February 2007.
[39] B. Kim and V. Stojanovic. A 4Gb/s/ch 356fJ/b 10mm equalized on-chip interconnect with nonlinear charge-injecting transmit filter and transimpedance
receiver in 90nm CMOS. In Proceedings of the IEEE International Solid-State
Circuits Conference (ISSCC), pages 66–67, February 2009.
147
BIBLIOGRAPHY
[40] J. Seo et al. High-Bandwidth and Low-Energy On-Chip Signaling with Adaptive
Pre-Emphasis in 90nm CMOS. In Proceedings of the IEEE International SolidState Circuits Conference (ISSCC), pages 182–183, February 2010.
[41] Jason Howard et al. A 48-core ia-32 message-passing processor with dvfs in 45nm
cmos. In Proceedings of the IEEE International Solid-State Circuits Conference
(ISSCC), pages 108–109, 2010.
[42] Sunghyun Park. Low-Swing Signaling for Energy Efficient On-Chip Networks.
SM Thesis, Massachusetts Institute of Technology (MIT), June 2011.
[43] N. Verma and A. P. Chandrakasan. A High-Density 45nm SRAM Using SmallSignal Non-Strobed Regenerative Sensing. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), pages 380–381, February 2008.
[44] I. Arsovski and R. Wistort.
Self-referenced Sense Amplifier for Across-
chipvariation Immune Sensing in High-performance Content-Addressable Memories. In Proceedings of the IEEE/ACM Custom Integrated Circuits Conference
(CICC), pages 453–456, 2006.
[45] M. Qazi et al. A 512kb 8T SRAM Macro Operating Down to 0.57V with An
AC-Coupled Sense Amplifier and Embedded Data-Retention-Voltage Sensor in
45nm SOI CMOS. In Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), pages 350–351, February 2010.
[46] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. ORION: A
Power-Performance Simulator for Interconnection Networks. In Proceedings of
the IEEE/ACM International Symposium on Microarchitecture (MICRO).
[47] Andrew Kahng et al. Explicit Modeling of Control and Data for Improved NoC
Router Estimation. In Proceedings of the IEEE/ACM Design Automation Conference (DAC), pages 392–397, 2012.
148
BIBLIOGRAPHY
[48] K. Goossens et al. AEthereal Network on Chip: Concepts, Architectures, and
Implementations. IEEE Dessign and Test of Computers, 22(5):414–421, 2005.
[49] F. Karim et al. An Interconnect Architecture for Networking Systems on Chips.
IEEE Micro, 22(5):36–45, September 2002.
[50] Nam-Sung Woo. High Performance SOC for mobile applications. In Proceedings
of the IEEE Asian Solid-State Circuits Conference (A-SSCC), 2010.
[51] A. Adriahantenaina et al. SPIN: A Scalable, Packet Switched, On-Chip MicroNetwork. In Proceedings of the IEEE/ACM Design, Automation and Test in
Europe (DATE), March 2003.
[52] G. Passas et al. A 128 x 128 x 24Gb/s Crossbar Interconnecting 128 Tiles
in a Single Hop and Occupying 6 percent of Their Area. In Proceedings of the
IEEE/ACM International Symposium on Networks-on-Chip (NOCS), May 2010.
[53] E.D. Kyriakis-Bitzaros et al. Design of Low Power CMOS Drivers Based on
Charge Recycling. IEEE International Symposium on Circuits and Systems (ISCAS), pages 1924–1927, June 1997.
[54] M. Hiraki et al. Data-Dependent Logic Swing Internal Bus Architecture for
Ultralow-Power LSIs. IEEE Journal of Solid-State Circuits (JSSC), pages 397–
402, April 1995.
[55] H. Yamauchi et al. An Asymptotically Zero Power Charge-Recycling Bus Architecture for Battery-Operated Ultrahigh Data Rate ULSIs. IEEE Journal of
Solid-State Circuits (JSSC), pages 423–431, April 1995.
[56] R. Golshan et al. A novel reduced swing CMOS BUS interface circuit for high
speed low power VLSI systems. IEEE International Symposium on Circuits and
Systems (ISCAS), pages 351–354, May 1994.
149
BIBLIOGRAPHY
[57] B.-D. Yang et al. High-Speed and Low-Swing On-Chip Bus Interface Using
Threshold Voltage Swing Driver and Dual Sense Amplifier Receiver. IEEE European Solid-State Circuit Conference (ESSCIRC), pages 144–147, September
2000.
[58] Sunghyun Park, Tushar Krishna, Chia-Hsin O. Chen, Bhavya Daya, Anantha P.
Chandrakasan, and Li-Shiuan Peh. Approaching the theoretical limits of a mesh
NoC with a 16-node chip prototype in 45nm SOI. Proceedings of the IEEE/ACM
Design Automation Conference (DAC), June 2012.
[59] Y.-H Kao, M. Yang, N. S. Artan, and H. J. Chao. CNoC: High-Radix Clos
Network-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems (T-CAD), 30:1897–1910, 2011.
[60] W. J. Dally. Express Cubes: Improving the Performance of k-ary n-cube Interconnection Networks. IEEE Transactions on Computers, 40:1016–1023, 1991.
[61] Henri J. Oguey and Daniel Aebischer. CMOS Current Reference Without Resistance. IEEE Journal of Solid-State Circuits (JSSC), 32:1132–1135, July 1997.
[62] Eisse Mensink, Daniel Schinkel, Eric A. M. Klumperink, Ed van Tuijl, and
Bram Nauta. Power efficient gigabit communication over capacitively driven
RC-limited on-chip interconnects. IEEE Journal of Solid-State Circuits (JSSC),
45:447–457, Apr. 2010.
[63] Byungsub Kim and Vladimir Stojanovic.
An energy-efficient equalized
transceiver for RC-dominant channels. IEEE Journal of Solid-State Circuits
(JSSC), 45:1186–1197, June 2010.
[64] Jae sun Seo, Ron Ho, Jon Lexau, Michael Dayringer, Dennis Sylvester, and David
Blaauw. High-Bandwidth and Low-Energy On-Chip Signaling with Adaptive
Pre-Emphasis in 90nm CMOS. In Proceedings of the IEEE International SolidState Circuits Conference (ISSCC), pages 182–183, February 2010.
150
BIBLIOGRAPHY
[65] D. Woo et al. An Optimized 3D-Stacked Memory Architecture by Exploiting
Excessive, High-Density TSV Bandwidth. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages
1–12, January 2010.
[66] R. S. Patti. Three-Dimensional Integrated Circuits and the Future of Systemon-Chip Designs. In Proceedings of the IEEE, 2006.
[67] A. W. Topol, J. D. C. La Tulipe, D. J. Frank L. Shi, K. Bernstein, S. E. Steen,
A. Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong. ThreeDimensional Integrated Circuits. IBM Journal of Research and Development,
50(4/5):491–506, 2006.
[68] Yong Liu, Wing Luk, and Daniel Friedman. A Compact Low-Power 3D I/O
in 45nm CMOS. In Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), February 2012.
[69] B. Swinnen, W. Ruythooren, P. D. M. L. Bogaerts, L. Carbonell, K. D. Munck,
B. Eyckens, S. Stoukatch, Tezcan, D. Sabuncuoglu, Z. Tokei, J. Vaes, J. V.
Aelst, and E. Beyne. 3D Integration by Cu-Cu Thermo-Compression Bonding
of Extremely Thinned Bulk-Si Die Containing 10um Pitch Through-Si Vias. In
International Electron Devices Meeting (IEDM), December 2006.
[70] Igor Loi, Subhasish Mitra, Thomas H. Lee, Shinobu Fujita, and Luca Benini.
A Low-Overhead Fault Tolerance Scheme for TSV-Based 3D Network on Chip
Links. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe
(DATE), November 2008.
[71] M. Laisne et al. System and Methods Utilizing Redundancy in Semiconductor
Chip Interconnects. In US patent 20100060310A1, March 2010.
151
BIBLIOGRAPHY
[72] A. Hsieh et al. TSV Redundancy: Architecture and Design Issues in 3D IC. In
Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE),
March 2010.
[73] Kevin Lam, Larry R. Dennison, and William J. Dally. Simultaneous Bidirectional
Signalling for IC Systems. In Proceedings of the IEEE International Conference
on Computer Design (ICCD), September 1990.
[74] Jae-Toon Sim et al. A 1-Gb/s Bidirectional I/O Buffer Using the Current-Mode
Scheme. IEEE Journal of Solid-State Circuits (JSSC), 34(4):529–535, 1999.
[75] H. Wilson et al. A Six-Port 30-GB/s Nonblocking Router Component Using
Point-to-Point Simultaneous Bidirectional Signaling for High-Bandwidth Interconnects. IEEE Journal of Solid-State Circuits (JSSC), 36(12):1954–1963, 2001.
[76] Wilson et al. A 4-Gb/s/pin Low-Power Memory I/O Interface Using 4-Level
Simultaneous Bi-Directional Signaling. IEEE Journal of Solid-State Circuits
(JSSC), 40(1):89–101, 2005.
[77] Mahmut E. Sinangil et al. A Reconfigurable 8T Ultra-Dynamic Voltage Scalable
(U-DVS) SRAM in 65nm CMOS. IEEE Journal of Solid-State Circuits (JSSC),
11:3163–3173, 2009.
[78] T. Kobayashi et al. A Current-Controlled Latch Sense Amplifier and a Static
Power-Saving Input Buffer for Low-Power Architecture. IEEE Journal of SolidState Circuits (JSSC), 28:523–527, 1993.
[79] Futoshi Furuta and Kenichi Osada. 6Tbps/W, 1Tbps/mm, 3D Interconnect using Adaptive Timing Control and Low Capacitance TSV. In IEEE International
3D Systems Itegration Conference(3DIC), January 2012.
152
BIBLIOGRAPHY
[80] M. M. K. Martin, M. D. Hill, and D. A. Wood. Token Coherence: Decoupling
Performance and Correctness. In Proceedings of the IEEE/ACM International
Symposium on Computer Architecture (ISCA), June 2003.
[81] Niket Agarwal, Li-Shiuan Peh, and N. K. Jha. In-Network Snoop Ordering
(INSO): Snoopy Coherence on Unordered Interconnects. In Proceedings of the
IEEE International Symposium on High-Performance Computer Architecture
(HPCA), February 2009.
[82] Pat Conway et al. Cache Hierarchy and Memory Subsystem of the AMD Opteron
Processor. IEEE Micro, 30:16–29, 2010.
[83] Arun Raghavan et al.
Token Tenure: PATCHing Token Counting Using
Directory-Based Cache Coherence. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), November 2008.
[84] Bhavya K. Daya, Chia-Hsin Owen Chen, Suvinay Subramanian, Woo-Cheol
Kwon, Sunghyun Park, Tushar Krishna, Jim Holt, Anantha P. Chandrakasan,
and Li-Shiuan Peh. SCORPIO: A 36-Core Research Chip Demonstrating Snoopy
Coherence on a Scalable Mesh NoC with In-Network Ordering. In Proceedings
of the IEEE/ACM International Symposium on Computer Architecture (ISCA),
June 2014.
[85] Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. The impact of Technology Scaling and Lifetime Reliability. In International Conference
on Dependable Systems and Networks (DSN), June-July 2004.
[86] Konstantinos Aisopos, Chia-Hsin Owen Chen, and Li-Shiuan Peh. Enabling
System-Level Modeling of Variation-Induced Faults in Networks-on-Chip. In
Proceedings of the IEEE/ACM Design Automation Conference (DAC), October
2011.
153
BIBLIOGRAPHY
[87] K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco. ARIADNE: Agnostic Reconfiguration In A Disconnected Network Environment. In Proceedings of the
IEEE/ACM International Conference on Parallel Architectures and Compilation
Techniques (PACT), October 2011.
154
BIBLIOGRAPHY
155
BIBLIOGRAPHY
156
Download