Energy-Efficient SRAM Design in 28nm FDSOI
MASCHIS
MAiSSACHUSMS
Technology
by
JUN 3 0 2014
Avishek Biswas
LIBRARIES
B. Tech. (Hons.), Indian Institute of Technology, Kharagpur (2012)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2014
© Massachusetts Institute of Technology 2014. All rights reserved.
Author ..................
YfM
OF TECHNOLOGY
Signature redacted
Department of Electrical Engineering and Computer Science
May 21, 2014
C ertified by ..........................
Signature redacted
Anantha P. Chandrakasan
Joseph F. and Nancy P. Keithley Professor of Electrical Engineering
Thesis Supervisor
A ccepted by .........................
Signature redacted
Leslie/A. IKl/qj 5 ziejski
Chairman, Department Committee on Graduate Theses
2
Energy-Efficient SRAM Design in 28nm FDSOI Technology
by
Avishek Biswas
Submitted to the Department of Electrical Engineering and Computer Science
on May 21, 2014, in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
Abstract
As CMOS scaling continues to sub-32nm regime, the effects of device variations become more prominent. This is very critical in SRAMs, which use very small transistor
dimensions to achieve high memory density. The conventional 6T SRAM bit-cell,
which provides the smallest cell-area, fails to operate at lower supply voltages (Vdd).
This is due to the significant degradation of functional margins as the supply voltage
is scaled down. However, Vdd scaling is crucial in reducing the energy consumption of
SRAMs, which is a significant portion of the overall energy consumption in modern
micro-processors. Energy savings in SRAM is particularly important for batteryoperated applications, which run from a very constrained power-budget.
This thesis focuses on energy-efficient 6T SRAM design in a 28nm FDSOI technology. Significant savings in energy/access of the SRAM is achieved using two techniques: Vdd scaling and data prediction. A 200mV improvement in the minimum
SRAM operating voltage (Vdd,min) is achieved by using dynamic forward body-biasing
(FBB) on the NMOS devices of the bit-cell. The overhead of dynamic FBB is reduced by implementing it row-wise. Layout modifications are proposed to share the
body terminals (n-wells) horizontally, along a row. Further savings in energy/access
is achieved by incoporating data-prediction in the 6T read path, which reduces bitline switching. The proposed techniques are implemented for a 128Kb 6T SRAM,
designed in a 28nm FDSOI technology. This thesis also presents a reconfigurable
fully-integrated switched-capacitor based step-up DC-DC converter, which can be
used to generate the body-bias voltage for a SRAM. 3 reconfigurable conversion ratios of 5/2, 2/1 and 3/2 are implemented in the converter. It provides a wide range
of output voltage, 1.2V-2.4V, from a fixed input of 1V. The converter achieves a peak
efficiency of 88%, using only on-chip MOS and MOM capacitors, for a high density
implementation.
Thesis Supervisor: Anantha P. Chandrakasan
Title: Joseph F. and Nancy P. Keithley Professor of Electrical Engineering
3
4
Acknowledgments
I would first like to thank my advisor, Prof. Anantha Chandrakasan, for giving me
the opportunity to be part of his wonderful research group at MIT. He has been
an incredible mentor, motivated me to think and analyze more critically. He was
kind enough to provide me the flexibility to work in different areas I am interested
in. Thank you Anantha for providing me the various opportunities and guiding me
through-out. I am looking forward to working on various exciting projects during
my PhD and I consider myself very fortunate to have you as my advisor. Next, I
would like to thank all the members of the Ananthagroup for being such friendly and
welcoming people. Thank you Yildiz for all your help with my first test chip tapeout and testing. You have been really encouraging and patient to answer my doubts
regarding SRAM design. Thanks Chu, for all the help and discussions on the SRAM
project. Thanks to Arun and Nachiket for always being patient and kind enough to
answer my doubts regarding Cadence simulations and other stuff. Thanks Saurav
and Dina for sharing your expertise on switched-capacitor DC-DC converters with
me. Thanks to our administrative assistant, Margaret for always greeting us with a
smile and being helpful with various logistics. I am also lucky to have so many good
friends and colleagues at MIT, who have made my life here enjoyable. I would also
like to thank ST Microelectronics and Andreia Cathelin for their generous support
with chip fabrication and DARPA for funding the projects.
Last, but certainly not the least, I would like to thank my parents, my younger
brother and my family for their consistent support and belief in me. Words are not
good enough to describe all the sacrifice and hard work my parents have put in, to
get me to where I am today. I am waiting to see that proud smile on their faces when
they would see my SM degree from MIT.
5
6
Contents
1
2
Introduction
1.1
Motivation for Low-voltage SRAM
. . . . . . . . . . . . . . . . . . .
15
1.2
Recent 6T SRAM designs in sub-45nm CMOS . . . . . . . . . . . . .
17
1.3
Advantages of FDSOI technology
. . . . . . . . . . . . . . . . . . . .
20
1.4
Thesis Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Background of 6T SRAM design
23
2.1
6T SRAM Bit-cell Operation . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.1
Data Retention . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.1.2
Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.1.3
Write Operation
. . . . . . . . . . . . . . . . . . . . . . . . .
26
. .
27
2.2.1
Static Noise Margin . . . . . . . . . . . . . . . . . . . . . . . .
27
2.2.2
Write M argin . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.2.3
Dynamic Read Margin
. . . . . . . . . . . . . . . . . . . . . .
30
2.2.4
Effect of
scaling on noise margins . . . . . . . . . . . . . .
31
2.2
2.3
3
15
6T SRAM Functional Margin Issues and the effect of
Vdd
Conventional Assist Techniques
Vdd
Scaling
. . . . . . . . . . . . . . . . . . . . .
31
2.3.1
Read Assists in Previous Works . . . . . . . . . . . . . . . . .
32
2.3.2
Write Assists in Previous Works . . . . . . . . . . . . . . . . .
33
6T SRAM design in 28nm FDSOI
35
3.1
. . . . . . . . . . . . . . . . . . . . . .
35
FBB as Write Assist . . . . . . . . . . . . . . . . . . . . . . .
36
Forward Body-Biasing (FBB)
3.1.1
7
3.1.2
4
5
Read-Stability Issues and Dynamic FBB
39
3.2
Energy-efficient Implementation of D-FBB . . . . . . . . . . . . . . .
40
3.3
Hierachical BL structure and Data Prediction
. . . . . . . . . . . . .
43
3.3.1
Dynamic Read Margin . . . . . . . . . . . . . . . . . . . . . .
43
3.3.2
Hierarchical Read Path . . . . . . . . . . . . . . . . . . . . . .
44
3.3.3
Using Data Prediction in 6T SRAM . . . . . . . . . . . . . . .
45
3.4
Overall Array Architecture . . . . . . . . . . . . . . . . . . . . . . . .
47
3.5
Energy Savings
48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1
Due to Vdd Scaling
3.5.2
Using Data Prediction
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
49
50
Reconfigurable Body-Bias Generator in 28nm FDSOI
53
4.1
Brief overview of SC converters
53
4.2
Reconfigurable Step-up SC Module
4.3
MOS Implementation of the sub-module
. . . . . . . . . . . . . . . .
58
4.4
Overall System Architecture . . . . . . . . . . . . . . . . . . . . . . .
60
4.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
Conclusions
55
65
5.1
Summary of contributions
. . . . . . . . . . . . . . . . . . . . . . . .
65
5.2
Energy-efficient 6T SRAM design . . . . . . . . . . . . . . . . . . . .
65
5.3
Reconfigurable Step-up SC DC-DC Converter
. . . . . . . . . . . . .
67
5.4
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
A Energy Model of the 28nm FDSOI 128Kb SRAM Macro
69
A.1
Dynamic Energy Consumption . . . . . . . . . . . . . . . . . . . . . .
70
A.2
Leakage Energy Consumption . . . . . . . . . . . . . . . . . . . . . .
74
8
List of Figures
1-1
SRAM in embedded memory hierarchy [1]. ....................
16
1-2
General trend of cache size. [source: ISSCC 2013 Trends] . . . . . . .
16
1-3
Scaling trends for SRAM bit-cell size and operating
ISSCC 2013 Trends]
Vdd.
[source:
. . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1-4
Conventional 6T SRAM bit-cell . . . . . . . . . . . . . . . . . . . . .
18
1-5
UTBB FDSOI vs. Bulk body-biasing structure (shown for NMOS) [2]
20
2-1
Conventional SRAM array architecture and 6T bit-cell. . . . . . . . .
24
2-2
(a) 6T bit-cell in data retention mode, (b) Bit-cell flips when
Vdd
goes
below data retention voltage . . . . . . . . . . . . . . . . . . . . . . .
2-3
24
(a) 6T bit-cell during a read operation, (b) Waveforms showing a "read
disturb" for a minimum sized bit-cell. Bit-cell flips since the disturbance at NI is large enough to trip the inverter (PU2, PD2). . . . . .
2-4
25
(a) 6T bit-cell during a write operation, (b) Waveforms during a write
operation for two different 1-ratios: (WPG/WPU)x = 1.25, (WPG/Wpu)=
1). Write failure occurs when the 7-ratio is not high enough to lower
the potential of N2 below the
2-5
VTRIP
of the PU1-PD1 inverter. ....
26
(a) Schematic to evaluate SNM (b) Graphical representation of SNM.
The noise voltage V, shifts VTC1 vertically and VTC2 horizontally,
until they intersect at only one stable point when V, = VSNM . . . . .
2-6
28
Schematic setup to evaluate static WRM. Static WRM is defined as
the difference between
Vdd
and VWL, at which internal nodes (NI and
N2) flip to write the new data. . . . . . . . . . . . . . . . . . . . . . .
9
29
2-7
Schematic setup to evaluate Dynamic Read Margin. . . . . . . . . . .
30
2-8
SNM and WRM dependence on Vdd . . . . . . . . . . . . . . . . . . .
31
2-9
Conventional read assist techniques. . . . . . . . . . . . . . . . . . . .
32
. . . . . . . . . . . . . . . . . .
33
2-10 Conventional write assist techniques.
3-1
Cross-sectional view and circuit symbols of the LVT transistors used
in the 6T SRAM design [3].
. . . . . . . . . . . . . . . . . . . . . . .
3-2
6T bit-cell with forward body-biasing applied during a write operation.
3-3
WRM improvement as a function of the applied forward bias voltage
3-4
- . . . . . . . . . . .. . . . . . .
at Vdd=0.4V . . . . . . .
(VFBB),
Improvement in Write Margin by IV FBB in the
The
Vdd
38
at 5.5- is improved from 600 mV to 400 mV (worst process
Vdd,min
38
(a) 6T bit-cell during a read operation under DC forward body-bias
(b) Delayed FBB to reduce read-stability issues. . . . . . . . . . . . .
3-6
37
range of 0.4V-1V.
corner and temperature). . . . . . . . . . . . . . . . . . . . . . . . . .
3-5
36
39
Proposed layout of a single row, showing row-wise sharing of n-wells,
BL sharing between adjacent columns and multiple WLs per row. (not
to scale)
3-7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Circuit implementation of the proposed row-wise forward body-biasing
technique (hFBB). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-8
Dynamic read margin of the 6T bit-cell as a function of
values of NOBC (number of bit-cells per local BL).
3-9
41
Vdd,
42
for different
. . . . . . . . . .
43
Hierarchical bit-line structure used in this work to improve read-stability
at low
Vdd
levels. The read path from the local BL to the global BL is
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3-10 Frediction architecture used in this design. . . . . . . . . . . . . . . .
46
also show n.
3-11 Array architecture for the 28nm FDSOI 128Kb 6T-SRAM, which incorporates row-wise body-biasing and data prediction to reduce energy
consum ption.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
47
3-12 Energy savings due to Vdd scaling using 1V dynamic FBB. The energy
reduction with the proposed row-wise FBB implementation (hFBB) is
compared to a conventional implementation, for two different read-tow rite ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3-13 Energy savings by using data prediction during a read operation at
Vdd=400mV, as compared to a conventional 6T read.
. . . . . . . . .
51
4-1
Basic 2:1 step-up SC converter, along with its idealized 2-port model.
54
4-2
Reconfigurable step-up switched capacitor module.
. . . . . . . . . .
55
4-3
Operation of the proposed and conventional topologies in 5/2 mode. .
55
4-4
Simulated performance comparison of the ideal converter for the proposed and conventional topologies in 5/2 mode.
. . . . . . . . . . . .
56
4-5
Operation of the converter in 2/1 and 3/2 modes.
. . . . . . . . . . .
58
4-6
MOS implementation of the switches. . . . . . . . . . . . . . . . . . .
59
4-7
Reconfigurable gate drive circuits for a sub-module. . . . . . . . . . .
60
4-8
Overall system architecture with die photo. . . . . . . . . . . . . . . .
61
4-9
Measured performance of the converter. . . . . . . . . . . . . . . . . .
63
11
12
List of Tables
4.1
Performance comparison with previous works . . . . . . . . . . . . . .
62
A.1
Array organization for both the implementations . . . . . . . . . . . .
70
13
14
Chapter 1
Introduction
With the tremendous increase in the usage of battery-operated portable electronics
and the advent of new and promising applications, like biomedical monitoring, wireless
sensor nodes etc., the demand for low power and energy-efficient circuits, have been
increasing in modern System-on-a-Chip (SoC). Energy-efficiency of circuits directly
translate into a longer battery-life, which is very crucial for these applications.
1.1
Motivation for Low-voltage SRAM
Static Random Access Memories (SRAM) are the most popular type of embedded
memories and one of the most critical building blocks in modern SoCs. SRAM has
been pre-dominantly used for register files and L1-L3 cache memories, in the embedded memory hierarchy (Figure 1-1). This is primarily because, SRAMs offer the
best access-speed performance among other embedded memory technologies [1]. Furthermore, SRAMs are fully compatible with modern CMOS processes and operating
voltage, and hence, can be easily integrated with logic circuits. With the continuous
scaling of CMOS technology, and integration of multiple processing cores on-chip,
the demand for memory capacity and bandwidth has considerably increased over the
years. Figure 1-2 shows the general trend of increasing cache size in modern microprocessors, which can be as high as 54MB on a single die [4].
Therefore, SRAMs
account for a large fraction of the total power consumption of the chip.
15
Hence,
low-power SRAM design is a very active research area. Power savings in SRAM is
particularly critical in battery-operated mobile and hand-held applications, where the
power budget is very constrained.
-Qs
L
CLm.
Figure 1-1: SRAM in embedded memory hierarchy [1].
Figure 1-2: General trend of cache size. [source: ISSCC 2013 Trends]
Dynamic Voltage Scaling (DVS) has been proven to be an effective way to reduce
energy consumption of circuits [5, 6]. Decreasing the supply voltage
(Vdd)
provides
savings in the dynamic energy consumption (oc V]a), as well as a reduction in the leakage power consumption, at the expense of slower performance. Meanwhile, with the
16
scaling of device dimensions to sub-65nm regime, the variation in transistor threshold
voltage (Vt) has become more severe. Since SRAM bit-cell size aggressively reduces
with every technology node, the effect of random V variation makes it extremely challenging to reduce the Vdd of SRAMs, while maintaining sufficient stability margin for
the bit-cell. Figure 1-3 shows the recent trend in SRAM bit-cell size and the operating
Vdd.
As seen from the figure, the Vdd scaling has been essentially stagnant, in sub-
65nm CMOS processes. Hence, new and improved read and write assist techniques
are being actively researched as design solutions, to reliably reduce the minimum operating voltage (Vdd,min) of SRAMs. Additionally, newer transistor structures, such
as FDSOI [2, 7] and FinFET [8, 9], are also emerging as replacements for planar-bulk
devices, to reduce device variations and further improve SRAM Vdd,m n.
Figure 1-3: Scaling trends for SRAM bit-cell size and operating Vdd. [source: ISSCC
2013 Trends]
1.2
Recent 6T SRAM designs in sub-45nm CMOS
Six transistor (6T)-based bit-cell (Figure 1-4) has been the industry-wide preferred
choice for high density SRAMs [10], since it provides the smallest cell area and has a
very compact and lithography-friendly "thin-cell" layout [11]. Figure 1-4 shows the
17
conventional 6T bit-cell, with the cross-coupled inverter (PU-PG) pair and the two
access (PG) devices. The 6T bit-cell is a ratioed circuit and there is a conflicting
sizing requirements for the transistors, to improve both read and write operations
simultaneously. In addition, it uses transistors with close to minimum device features,
to provide high memory density. Hence, there is a limited scope for improvement
of the bit-cell's noise margin by only optimizing the transistor sizing.
Therefore,
peripheral assist techniques are required [9] to reduce the failure probability of the
6T bit-cell and overcome the challenges to high yield.
Vdd
WL
WI
PUl
BL
PU2
PG1
PG2
PD1
BLB
PD2
N2
N1
Figure 1-4: Conventional 6T SRAM bit-cell
This section summarizes recent works in 6T SRAM that are designed in sub45nm CMOS processes, in which the effect of device variation is more pronounced.
[10] demonstrates a 64Mb SRAM in a 32nm SOI process that works down to 0.7V,
using a bit-line (BL) regulation scheme during a read operation and a negative bitline (NBL) technique to improve the write operation. A 0.6V SRAM in a 28nm-bulk
process was shown in [12], which uses a delayed word-line (WL) boosting scheme to
improve write-ability and a hierarchical BL architecture to improve read-stability. A
multi-step WL control coupled with a hierarchical BL structure is proposed in [13] to
improve the read-stability, for a 40nm 2Mb 6T SRAM. [14] uses a dynamic forward
body-bias on the PU PMOS transistors of a 6T bit-cell, to improve read margin.
18
A 75mV improvement in Vd,min was achieved for a 153Mb SRAM designed in a
45nm bulk process. [15] uses a WL suppression scheme using replica cell transistors
and passive resistances, to improve read-stability, for a 45nm 4.5Mb SRAM working
down to 0.7V. [16] uses a lower cell-VDD technique along with NBL to improve write
margin, for a 0.6V 256Kb 45nm SRAM. [17] implements a partially suppressed WL
scheme as read-assist and a BL length tracked NBL scheme as write-assist for a 112Mb
SRAM in a 20nm bulk process, to achieve a 200mV improvement in the SRAM
Vdd,min.
[18] presents a 20nm 128Kb SRAM which achieves a 0.6V operation and 20PW/MHz
active power consumption, with a interleaved WL and a hierarchical BL scheme.
[19] implements a charge-sharing scheme to reduce excessive BL discharge during a
read operation at low
Vdd's,
which happens due to increasing random V variations.
Designed in a 40nm low standy-power technology, it achieves a power consumption
of 13.8pJ/access/Mbit for a 1Mb SRAM, which is considerably lower as compared to
36pJ [16], 46.7pJ [15], 50pJ [13] (all normalized to 1 Mb).
For 28nm and beyond, fully-depleted device structures, namely FinFETs and FDSOl, have recently gained significant momentum, to continue CMOS scaling following
Moore's law [20], which has become very difficult to maintain using planar bulk processes. A 128Mb 6T SRAM, designed in a 14nm FinFET technology, is presented in
[9], which operates down to 0.48V with a high-performance (HP) bit-cell. It implements a technique to partially discharge the pre-charged BLs before WL is asserted,
to reduce read-noise injected into the bit-cell. It avoids the requirement to generate
a separate lower pre-charge voltage, which needs precise regulation ([10]) so as not to
disturb the write operation. Additionally, a NBL scheme is used to improve the write
margin for a high density (HD) bit-cell. [21] also implements a NBL scheme and a
lower cell-VDD (LCV) technique to reduce the
Vdd,min
of a HD bit-cell, designed in a
16nm FinFET technology.
[22] analyzes 6T SRAM design for a 28nm FDSOI process. It uses a single p-well
architecture, particular to FDSOI, with a high density bit-cell. Simulation results
suggests a 0.65V operation with 128 bit-cells per column and no assist techniques.
19
1.3
Advantages of FDSOI technology
Fully Depleted Silicon On Insulator (FDSOI) offers excellent short-channel electrostatic control, reduced source/drain capacitances, lower leakage currents and significantly reduced random dopant fluctuations (RDF) as compared to a bulk process,
for 28nm and beyond [2, 7, 23]. The ultra-thin dielectric buried-oxide (BOX) layer
provides electrical isolation of the source and drain of the planar transistor, from its
well and substrate. FDSOI also features a back-plane (BP) doping (either 'n' or 'p'
type) underneath the BOX layer. This is independent of the transistor type (PMOS
or NMOS) and results in two distinct V flavors, Regular-VT and Low-VT [3].
OV < VFBB < +0.3V
Sorc
Dai
OV < VFBB < +3V
G
BdySource
Drain
Buried
BP
Oxide
Body
layer
p/n-well
p
subtratep-substrate
Bulk
UTBB
-
FDSOI
Figure 1-5: UTBB FDSOI vs. Bulk body-biasing structure (shown for NMOS) [2]
The ultra-thin BOX layer enables a wide body-biasing range, in addition to improving the body-biasing efficiency. As shown in Figure 1-5, in a bulk process, a
forward body bias (FBB) of only +300mV can be applied, so that the n+-diffusion
to p-substrate diode is not turned-on. However, in FDSOI, a FBB upto +3V can be
applied [2, 24], due to the BOX isolation of source/drain from the substrate. Furthermore, due to its superior electrostatics, Ultra Thin Body and Buried-oxide (UTBB)
FDSOI exhibits a higher body-factor compared to its bulk counterpart, 85mV/V
vs. 25mV/V [2]. This provides the flexibility to efficiently apply body-bias to target
transistors to improve their performance (by FBB) or reduce their leakage (by reverse
body bias or RBB) [24, 25]. We extensively use these improved features of UTBB FDSOI in this work, to achieve better performance and higher energy-efficiency.
20
1.4
Thesis Contributions
This thesis primarily focuses on energy-efficient 6T SRAM design in a FDSOI process.
Forward body-biasing (FBB) is investigated as a write-assist technique to reduce the
operating voltage
(Vdd)
of the SRAM.
Vdd
scaling provides significant energy savings
by decreasing the dynamic energy consumption. Furthermore, data-prediction is used
during a read operation, to obtain additional energy savings.
Chapter 3 presents a 128Kb 6T SRAM designed in a 28nm FDSOI process. The
SRAM uses FBB to improve write-ability at low
Vdd
levels. FBB is used in dynami-
cally (i.e. only during a write operation), so that the read-stability of the bit-cell is
not degraded. The dynamic FBB is implemented in a row-by-row manner, to reduce
the energy overhead associated with it. To enable row-wise dynamic FBB, a layout
technique is proposed to share the n-wells horizontally, across all the bit-cells in a
row. Next, a hierarchical bit-line architecture, to incorporate data prediction in 6T
SRAM, is presented. A correct data prediction avoids the discharge of the long global
bit-lines, providing significant reduction in dynamic energy consumption. Finally, the
energy savings, obtained by
Vdd
scaling and by using data prediction, are quantified
using an energy model for the SRAM, which is developed in Appendix A.
The second part of the thesis (Chapter 4), presents a switched-capacitor (SC)
based step-up DC-DC converter, which can be used to generate the body-bias voltage
for SRAMs. The reconfigurable step-up converter implements 3 conversion ratios of
5/2, 2/1 and 3/2. It provides a wide range of output voltage, from 1.2V to 2.4V,
from a IV input. The converter has been designed to obviate the need of using high
voltage I/O transistors, which otherwise would have degraded the efficiency owing to
their higher on-resistance and parasitic capacitance.
Furthermore, a new topology
is proposed for the 5/2 mode which improves efficiency by reducing the bottomplate parasitic loss as compared to a conventional series-parallel topology [26]. The
converter has been implemented in a 28nm FDSOI process, using only on-chip MOS
and MOM capacitors, that do not require any extra fabrication steps unlike MIM [27]
and trench [28] capacitors. Measurement results show that the converter achieves a
21
peak efficiency of 88% in the 2/1 mode.
22
Chapter 2
Background of 6T SRAM design
This chapter presents a brief overview of the basic 6T bit-cell operation, the associated
functional margin issues and the concept of assist techniques.
Six transistor (6T) based bit-cell has been the workhorse for SRAM design, owing
to the small cell area and compact layout, resulting in high density memory arrays.
As shown in Figure 2-1, each row of bit-cells share a common word-line (WL), while
a pair of bit-lines (BL and BLB) are shared by multiple bit-cells in a column. The
number of bit-line pairs (n) and the bit-width (m) of a single word, determine the
column select ratio of n - to - m.
2.1
6T SRAM Bit-cell Operation
A conventional 6T SRAM bit-cell (shown in Figure 2-1) consist of two back-to-back
inverters (comprised of PUl, PD1 and PU2, PD2) and two access transistors (PG1
and PG2).
The inverter pair is cross-coupled such that the output of one goes to
the input of the other, and vice-versa. The resulting positive feedback of the inverter
pair, can hold the desired data (states "1", or "0") indefinitely at the internal nodes
(NI and N2), as long as the SRAM is powered up and the access transistors are
turned-off. Access transistors are only turned-on during read and write operations,
to connect the internal data nodes to the bit-lines (BL and BLB).
23
'n' BL pairs
I-----------------------------------------------
Vdd
6T bit-ce
BL
WL
WL
0.
BL
---
BLB
WL
PU2
PUl
PG1
PG2
BL
Column Multiplexing &
Sense Amplifiers
- - - - - - - - - - - --
'm' data bits
- - - - - - - --
--
-
Figure 2-1: Conventional SRAM array architecture and 6T bit-cell.
2.1.1
Data Retention
Vdd
0.7
"0" "Ol PUl
PU2
0.6
"0"
"0j
0.5
0.4
B--PG1
N1
N2PG2 -PD1
, BLB
0.3
Vdd
0.2
0.1
0
-0.1
XPD2
1
Mw
-
N
0
N2
0.5
1
Time (us)
(a)
(b)
Figure 2-2: (a) 6T bit-cell in data retention mode, (b) Bit-cell flips when
below data retention voltage.
Vdd
goes
The 6T bit-cell is in data retention mode when the word-line (WL) is turned-off
(Figure 2-2(a)). The cross-coupled inverter pair creates a positive feedback loop that
preserves the internal data nodes, without any disturbance from the bit-lines, through
the pass-gate (PG) transistors. However, below a certain level of supply voltage, the
inverters can no longer hold the state and the internal data nodes might flip. This
24
minimum required level, known as the data retention voltage
(VDRV),
is typically
As seen from the simulation
below the threshold voltage (Vt) of the transistors.
waveforms in Figure 2-2(b) the bit-cell can no longer retain its original state when
the cell supply voltage
(Vdd)
goes down to 150mV, which is lesser than the
VDRV
Of
the bit-cell.
Read Operation
2.1.2
The read operation starts with pre-charging the bit-line pairs (BL and BLB) to a
known voltage (typically Vdd). The bit-lines are then kept floating and the word-line
(WL) is asserted. Depending on the data stored, one of the bit-lines (BL or BLB)
starts discharging through the pass-gate (PG) and pull-down (PD) NMOS transistors,
connected in series (Figure 2-3(a)).
The bit-line differntial voltage is sensed by a
sense-amplifier to output the data.
Vdd
0.6
WL="1"
BI
PG1
.....
PUl
N1
P 2
"1" N2 PG2
0"
"1
PD1
0.5
0.4
WL= "1"
WL-
0.3
N
0.1
L
PD2
--
"'__
0
1
_n 1
0
--
10
20
30
Time (ns)
(b)
(a)
Figure 2-3: (a) 6T bit-cell during a read operation, (b) Waveforms showing a "read
disturb" for a minimum sized bit-cell. Bit-cell flips since the disturbance at NI is
large enough to trip the inverter (PU2, PD2).
During the read operation, the discharging current flows from the bit-line to the
cell ground, on the side of the bit-cell storing a "0". This leads to an increase in the
potential of the corresponding internal node (NI in Figure 2-3(a)) and the amount
of disturbance depends on the drive strengths of the PG and PD NMOS transistors.
25
If this increased voltage goes above the trip-point
(VTRIP)
of the connected inverter,
the stored data is flipped. This event is known as 'read disturb' and it is shown in
Figure 2-3(b).
In order to prevent this, the PD NMOS needs to be stronger than
the PG NMOS. The ratio of their drive strengths is known as the /-ratio, which is
an important SRAM design parameter.
Careful sizing of the NMOS transistors is
required to achieve the desired /-ratio, which ensures successful read operations (i.e.
without any 'read disturbs').
2.1.3
Write Operation
The write operation starts with driving the bit-lines to the data value to be written.
The WL is then asserted, turning-on the PG transistors.
Vdd
WL =
"1"PU
WL =21
0.7
1"1
P0.5
BL
PG1 N1 "0"
-
"1"
N2
_
"1^
"0"
PD2
PD1
BLB
PG0.3
1
Fail
W
0.1
-01
0
(a)
1
2
3
4
Time (ns)
(b)
Figure 2-4: (a) 6T bit-cell during a write operation, (b) Waveforms during a write
operation for two different y-ratios: (WPG/WPu)x = 1.25, (WPG/WPu)= 1). Write
failure occurs when the -y-ratio is not high enough to lower the potential of N2 below
the VTRIP of the PU1-PD1 inverter.
If the data to be written is opposite to the previously stored state (as shown in
Figure 2-4(a)), the potential of the high internal node is lowered, depending on the
drive strengths of the pull-up (PU) PMOS and the PG NMOS transistors. The ratio
of the drive strengths of the PG and PU transistors is known as the -- ratio and it
is another important parameter in SRAM design. The transistors need to be sized
carefully so that the -y-ratio is high enough to lower the potential of the high internal
26
node below the VTRIP of the connected inverter. As shown in Figure 2-4(b), a low
-y-ratio can lead to a write failure.
2.2
6T SRAM Functional Margin Issues and the
effect of
Vdd
Scaling
The three main SRAM operations discussed above, viz. data retention (or hold),
read and write, are characterized by their respective functional margins.
In sub-
65nm CMOS technologies, increasing amount of threshold voltage (V) variation (due
to random dopant fluctations) hugely degrades the functional margins and limits the
minimum SRAM operating Vdd. In this section the concepts of Static Noise Margin
(SNM), Write Margin (WRM) and Dynamic Read Margin (DRM) will be discussed,
along with the effect of Vdd scaling on them.
2.2.1
Static Noise Margin
Static Noise Margin (SNM) is a widely used metric in SRAM design to characterize the
bit-cell's stability. SNM is the maximum amount of noise voltage (V,) which can be
tolerated at both the inputs of the cross-coupled inverters (with opposite polarity),
while retaining the cell data (Figure 2-5(a)).
In other words, SNM quantifies the
amount of noise voltage (V) required at the internal storage nodes of the bit-cell,
in order to flip the cell's data. The noise source (V) models any static disturbance
arising out of V mismatches, variations in device geometries, noise injected from the
BL through the PG transistor etc..
Figure 2-5(b) shows the graphical representation of the SNM in read mode (i.e.
when WL is turned-on). The two curves (VTC1 and VTC2) represent the voltage
transfer characteristics of the two inverters (PU1, PD1 and PU2, PD2). The two
curves intersects at three points: two stable states, reperesenting the two possible
data stored (logic "1" or logic "0") and one metastable state. The resulting two-wing
curve is known as the "butterfly curve", which is used to graphically determine the
27
SNM. The SNM is defined as the length of the side of the largest square
(VSNM),
which can be fit inside the smaller wing of the butterfly-curve.
The SNM is worse for a read operation as compared to a data retention mode. As
explained in section 2.1.2, during a read operation the voltage of the node storing a
"0" rises above the ground (GND) level, depending on the /3-ratio. With the scaling
of device geometries to nanometer regime, the variation in the device strengths and
hence, in the f-ratio becomes more severe. This causes significant degradation of the
read-SNM. In addition, the local and global V variation of the PU and PD transistors
shift the inverters' VTCs, causing further SNM degradation.
During data retention, the WL is turned-off and hence there is no noise injected
from the BLs.
Therefore, the SNM in this case (i.e.
hold-SNM) is much better
than the read-SNM, which typically limits the minimum operating voltage of the 6T
SRAM.
Vdd
Vdd -
VTC2
VN2
BL
>
BLB
VTC1
VTC2
0
(a)
VN1
Vdd
(b)
Figure 2-5: (a) Schematic to evaluate SNM (b) Graphical representation of SNM. The
noise voltage V, shifts VTC1 vertically and VTC2 horizontally, until they intersect
at only one stable point when Vn = VSNM-
28
2.2.2
Write Margin
During a write operation, the bit-line pair, BL and BLB, are driven to the differential
levels of "0" and "1" and the word-line is turned-on (WL= "1"). For a successful
write, the PG NMOS has to pull-down the high internal node (storing a "1") below
the trip-point of the connected inverter. This depends on the relative strengths of
the PG and PU transistors, as explained in section 2.1.3. It also depends on the
WL pulse width. If the WL is not kept high for a sufficient amount of time, then
the write operation might fail, even though the bit-cell satisfies the required -- ratio.
Hence, ideally, write margin (WRM) should be simulated as a dynamic condition.
[29] discusses the dynamic nature of write margin and compares it to various static
methods of evaluating write margin. It was concluded that the WL sweeping method
[30] provides the best estimate for static WRM, since it exhibits the best correlation
with dynamic write margin, especially at lower supply voltages.
WL
k
Vdd
PUl
BL
1"
PGI
Vdd
PU2
"1"
N2
"O"
N1
PD1
WRMI
PG2
PD2
BLB
"0,l
/N
0
I
VWL
Vdd
Figure 2-6: Schematic setup to evaluate static WRM. Static WRM is defined as the
difference between Vdd and VWL, at which internal nodes (NI and N2) flip to write
the new data.
Figure 2-6 shows the simulation setup to evaluate static WRM. The WLs are
swept together from 0 to Vdd to replicate a real write operation, in which the WL
drives both the PG transistors simultaneously. WRM is defined as the difference
between Vdd and the WL voltage at which the cell flips its original state.
29
2.2.3
Dynamic Read Margin
Seevinck method [31] is the traditional way of characterizing the read margin. This
static method, explained in section 2.2.1, does not consider the dynamic effect of bitline discharge during a read operation and hence, provides a pessimistic estimate for
the read margin. Recent works [12, 32] showed that reducing the number of bit-cells
(BCs) in a column can significantly improve the read margin. This is due to the
fact that, with lesser number of BCs the capacitance of the bit-lines (both device
and parasitic components) reduces. Therefore, when the WL is turned-on, the BL
discharges faster and it reduces the amount of charge injected into the internal node
storing a "0". Hence, it is less likely to exceed the trip-point of the connected inverter,
preventing accidental flipping of the cell data.
WIE
Vn
BILB
"O"1"
N1
BL
N2
CBL
......
CBL
Vn
Figure 2-7: Schematic setup to evaluate Dynamic Read Margin.
Figure 2-7 shows the simulation setup used to evaluate dynamic read margin
(DRM). The BL capacitances, extracted from the layout, are initialized to Vdd at
the beginning of the simulation. The DC noise voltage V, is swept in consecutive
transient simulations, until the cell data is flipped. The maximum value of V, which
does not cause a "read disturb" is defined as the DRM.
30
2.2.4
Effect of Vdd scaling on noise margins
In sub-65nm CMOS technologies, random Vt variation is exacerbated as transistor
feature size is scaled down to nanometer regime. As the channel length reduces to
10's of nm, it becomes extremely difficult to precisely control the doping in the channel
region. The effect of random V variation is more pronounced as Vdd scales down, since
the overdrive voltage (Vdd
-
Vt) of the transistors is reduced. Thus, even though the
required 4 and -y ratios are satisfied at higher Vdd's, they might not be sufficient for
every bit-cell when Vdd approaches close to V. This causes SRAM functional failure,
limiting the minimum achievable Vdd.
20 0 ------------------------------- ----------- -----
0
---
20.0
-+-WRM +*SNM
+----------16.0 ------------------------------ --- ---
4.0
0.5
0.6
0.7
0.8
0.9
1
Vdd (V)
Figure 2-8: SNM and WRM dependence on Vdd.
Figure 2-8 shows the effect of reducing Vdd on SNM and WRM of a 6T bit-cell
that has been designed in a 28nm FDSOI process using regular-Vt transistors. As seen
from the figure, the p/- ratio for both SNM and WRM decreases with Vdd. However,
WRM exhibits a much stronger dependence on Vdd than SNM, especially at higher
voltages. A
/o- of more than 5 is typically required, for high yield ratios in large
sized SRAMs.
2.3
Conventional Assist Techniques
The issues of functional margin degradation with Vdd scaling have been addressed
by using peripheral assist circuits, to aid the read and write operations [33, 34].
31
[33] defines three modes of SRAM functional failure: read-ability, write-ability and
read-stability. Assuming a differential sense-amplifier (SA) based read operation for
6T bit-cells, a read-ability failure occurs if the BL differential voltage (when the SA
is triggered) is less than the offset voltage of the SA. A write-ability failure occurs
when the desired data cannot be written at the end of the WL pulse. Read-stability
failures can happen if the selected bit-cell (BC) data or the half-selected BC data
is accidently flipped during a read or write operation, respectively.
This section
summarizes common assist techniques used to improve read-ability, write-ability and
read-stability of 6T SRAMs.
2.3.1
Read Assists in Previous Works
Figure 2-9 shows the waveforms for the different read assist techniques used in previous works. Most of these techniques aim at improving the 3 ratio by making the PG
NMOS weaker or the PD NMOS stronger.
Vdd
WL
0
WLUD
Reduced BL Pre-charge
Cell VDD
Vdd Boost
0
Negative GND
Cell Vss
Figure 2-9: Conventional read assist techniques.
For the word-line underdrive (WLUD) technique [17, 35, 15], the gate-bias of the
PG transistor is reduced, making it weaker than the PD transitor and improving
read-stability. However, a reduced gate-drive decreases the BL discharge current,
making read-ability worse.
A reduced BL pre-charge (PCH) level can help in improving read-stability, without
32
degrading the drive strength of the PG device (assuming negligible effect of
VDS).
The work in [10] demonstrated a yield increase from 5 to 5.7 sigma, by pre-charging
BLs to approximately 70% of
Vdd.
This technique has the overhead of generating
and regulating the reduced PCH voltage. Precise regulation of the PCH voltage is
necessary, since a low PCH level can create a pseudo-write operation scenario, which
can overwrite the existing cell data.
Boosting the cell
VDD
makes the PD NMOS stronger than the PG device, im-
proving read-stability. Driving the cell Vss to a negative voltage level, simultaneously improves the strengths of the PG and PD devices. Hence the BL discharge
current is increased, improving read-ability. These techniques can be implemented in
a row-by-row [36] or column-by-column [18, 37] fashion.
2.3.2
Write Assists in Previous Works
Figure 2-10 shows the waveforms for the conventional write assist techniques, which
attempt to improve the -y ratio or weaken the bistability of the cross-coupled inverters.
-- A-W
4Vdd
WL Boost
BLB
Negative BL
Cell Vss
Vdd Collapse
t
f
Vss Boost
Figure 2-10: Conventional write assist techniques.
The word-line boosting technique improves write-ability by increasing the gatedrive of the PG device, making it stronger than the PU PMOS. However, this has a
detrimental effect on the read-stability of the half-selected bit-cells (i.e. bit-cells in the
selected row and unselected columns) in a column-interleaved array. [12] addresses
33
this issue by delaying the boosting phase with respect to WL turn-on. Hence, the
half-selected bit-cells have already started reading and the BL voltage has sufficiently
reduced when the WL boosting is applied. This technique incurs the area overhead
of generating the boosted WL voltage.
The negative BL technique [10, 15, 9, 16] increases the gate-drive (VGS) of the PG
transitor by reducing its source voltage and hence, improves write-ability. However,
by decreasing the potential of one of the BLs below GND, there is a non-zero VGS
across the PG devices in unaccessed rows. If the internal node of an unaccessed bitcell on this side is "1", then there is a chance of unintentional over-writing of that
cell data. The non-zero VGS also results in increased leakage from the PG devices and
causes partial loss of the boost signal. This technique is also susceptible to voltage
overstress in the write path at higher Vdd values [10, 17].
'Cell-VDD collapse' [35, 16] and 'Cell-Vss boost' [38] techniques decrease the
strength of the cross-coupled inverter pair holding the data and hence improves
writability. However, the effect is much weaker [33] than the 'WL boost' and *Negative BL' techniques, since the PG transistor's strength is not improved. Furthermore,
these techniques when implemented column-wise, run the risk of violating the data
retention voltage for unaccessed bit-cells, which can cause accidental loss of cell data.
Whereas, if they are implemented row-wise, the read-stability of the half-selected
bit-cells are degraded.
34
Chapter 3
6T SRAM design in 28nm FDSOI
This chapter focusses on a low voltage, energy-efficient 6T SRAM design, in a 28nm
FDSOI process. Forward body-biasing (FBB) is investigated as a write-assist technique, to reduce the SRAM operating voltage and provide energy savings. The proposed implementation of FBB significantly reduces its energy overhead as compared
to a conventional implementation.
Furthermore, data prediction is incorporated in
the read path, to obtain additional energy savings.
3.1
Forward Body-Biasing (FBB)
Forward body-biasing of an NMOS device, refers to applying a positive body-to-source
voltage (VBS > OV) across it. This reduces the threshold voltage of the NMOS, since
the body-terminal acts like a second gate [2].
FDSOI offers the unique feature of
applying FBB on NMOS devices without the need of a triple-well structure, which
would be required in a bulk CMOS process. This is possible because of the electrical
isolation of the source/drain of the transistor from the well/substrate by using a
buried-oxide (BOX) layer. The ultra-thin BOX layer also improves the body-biasing
efficiency [2, 7] as compared to PDSOI, which features a thicker BOX layer. In this
work, we use the LVT flavor of the FDSOI transistors [3], which is characterized
by NMOS devices on n-well and PMOS devices on p-well, as shown in Figure 3-1.
Since the n-well bias (GNDS) is controlled independently, the NMOS devices can be
35
selectively forward body-biased, reducing their threshold voltage. The PMOS devices
are already in FBB mode, since their body terminal is at OV (same as the p-substrate).
GNDS >
0V
NMOS (LVT)
PMOS (LVT)
BOX
BOX
n.-typ SP
p-type BPD
B-
G
VDDS = OV
B
S
G
S
Figure 3-1: Cross-sectional view and circuit symbols of the LVT transistors used in
the 6T SRAM design [3].
3.1.1
FBB as Write Assist
Figure 3-2 shows the 6T bit-cell with FBB applied on NMOS devices, during a
write operation. FBB decreases the threshold voltage (Vn) of the NMOS transistors. Hence, the NMOS access transistor (PG2) becomes stronger than the PMOS
pull-up transistor (PU2). This helps in lowering the potential of the high internal
node (N2), and therefore, improves write-ability. An alternate way of improving
write-ability by reverse body biasing (RBB) the PMOS, was not chosen.
This is
because the PU PMOS devices are already sized to be weaker than the PG NMOS
devices and hence, making them further weak has a much lesser effect in improving
write-ability. Furthermore, a stronger PG NMOS helps to improve the write-speed
as well.
It may be noted that, applying FBB to the PG NMOS writing a "0", is sufficient.
However, as explained later, it is preferable to share the n-wells (GNDS1 and GNDS2)
row-wise. Since a "0" can be written from either side, therefore, to ensure write-ability
improvement for all the bit-cells in a selected row, FBB has to be applied to both
PG1 and PG2. Applying a bias to both the n-wells in a bit-cell, does not degrade
write-ability. FBB affects both the PG and PD NMOS devices, on the side of the
36
j
Vdd
PU2
PUl
"A
WL
BL
N1
"1"U BLB
WL
N2
GNDS1
(> OV)
PD2
VFBB
"i
GNDS2
BB (>OV)
PG2 becomes stronger
due to FBB (VBS = VFBB).
VN2 lowered more easily
PG1, PD1 VFBB,
FBB does not affect VN1
significantly
VBS of both
Figure 3-2: 6T bit-cell with forward body-biasing applied during a write operation.
bit-cell storing a "0". This is because the NMOS devices share a common n-well and
have source voltage (Vs) close to zero. Hence, the VBS applied is same for both, which
results in approximately equal V modulation.
Figure 3-3 shows the improvement in write margin (WRM) at Vdd = 0.4V, as a
function of the forward bias voltage (VFBB) applied. As seen from the figure, the
p/a of the write margin improves linearly with VFBB ('p' and 'a' are the mean and
standard deviation respectively, of the write margin distribution). This is primarily
because of the linear dependence of the NMOS threshold voltage (V") on the applied
body-bias. It can be verified from the figure, that applying FBB to all the NMOS
devices in the bit-cell does not degrade WRM. In fact, there is a slight improvement
in WRM as compared to the case when FBB is only applied to the NMOS devices
on the side of the bit-cell writing a "0". Figure 3-3 also suggests that depending on
the desired A/a of the WRM, the appropriate body-bias voltage can be chosen. In
this work, a VFBB of IV is chosen, which provides a worst case pA/a of 5.5 even at a
Vdd
of 0.4V. The requirement of a separate circuit to generate the body-bias voltage
is eliminated, since IV is the nominal supply voltage of the process [39] and readily
available on-chip.
37
8.0
-- - -
7.5
.-
7.0
----
6.5
------
6.0
N o FB B--
-------- -------- --
-E-2 sided FBB
----
-*-1 sided FBB
--
----- -------------------- -- -- ---- ----- ------------- ------------ ----- --
- --- ----
--
-
----
5.5
Cu
5.0
------------------------------
4.5
---------------- ----
4.0
----------
3.5
-
----------
- --- -~V dd =0.4V,
---------- SF corner, -400 C
----------------4---------------------
-
-
-------------
i
3.0
0
1.2
0.8
0.4
1.6
VFBB (V)
Figure 3-3: WRM improvement as a function of the applied forward bias voltage
(VFBB), at Vdd=0.4V .
400.0
- u (No FBB)
350.0
-0-5.5a (No FBB)
-- -QO (1V FBB)
300.0
E 250.0
-- 200.0
150.0
100.0
50.0
0.0
-50.0
-100.0
-W-5.5a (1V FBB)
SF--corner,
-40*C
--- ----- ------------- r ------------
~---------.
V
n
-
-- - - - - - -
-.---- ---.
--------
-----------
------------ - -- ----.-
- -------
-- ----
- - ---------;w
---
.0-',0
------- -~ - --------------- ,-.-
------- - ----
- ----
-----------
---
--------- ----
- --------------------
200mV Vmin improvement at 5.5a ----0.4
0.5
0.6
0.8
0.7
0.9
1
Vdd (V)
Figure 3-4: Improvement in Write Margin by IV FBB in the Vdd range of 0.4V-1V.
The Vdd,min at 5.5a is improved from 600 mV to 400 mV (worst process corner and
temperature).
Figure 3-4 shows the improvement in the write margin as a function of the supply
voltage
(Vdd),
with the body-bias voltage
(VFBB)
38
kept at a constant value of IV. It
can be seen from the figure that, both the mean value (0a) and the 5.5a value of the
write margin are consistently improved in the entire Vdd range of O.4V to 1V. The
conventional 6T bit-cell works down to 600mV with a p/a = 5.5, without any write
assists. As seen from the figure, a lV forward body-biasing can reduce the Vdd,min for
the write operation to 400mV, while maintaining a p/o of 5.5. The 200mV reduction
in Vdd,min provides significant benefits in terms of the SRAM energy consumption.
3.1.2
Read-Stability Issues and Dynamic FBB
If the FBB technique, described above, is implemented in a static manner, the readstability of the bit-cell is degraded. As shown in Figure 3-5(a), the primary reason
for degraded read-stability is the lowering of the trip-point (VTRIP) of the inverter
storing a "1". This is because the threshold voltage of the pull-down NMOS (PD2)
is reduced by FBB. And hence, it is easier for the disturbance at the low internal
node (N1) to flip the inverter. The degradation of read-stability with FBB is more
prominent for higher body-bias voltages, which are required for a successful write
operation at lower Vdd levels.
Vdd
Vdd
WL
WL
0
I
I
WL
VFBB -
BL
GNDS1,
N r
N2
""1W
N
""P2BLB
GNDS2
0
CBL
CBL
Vdd
GNS1GDS
VFBB
PD 1
V
FBB is delayed w.r.t. WL
FBB of PD2 lowers VTRIP of the
inverter => It flips more easily
=> VTRIP of PU2-PD2 inverter
not lowered during AT
(b)
(a)
Figure 3-5: (a) 6T bit-cell during a read operation under DC forward body-bias (b)
Delayed FBB to reduce read-stability issues.
This problem can be mitigated by delaying the forward body-biasing with respect
39
to the WL rise edge. As seen from Figure 3-5(b), if the body-biasing is applied after
a delay AT, the
VTRIP
of the inverter is not lowered during that time. Whereas, the
BL has already started discharging and the noise injected to the "0" internal node is
relatively lesser.
This motivates a dynamic implementation of the FBB technique (D-FBB), which
can take advantage of the full body-biasing voltage range to improve write-ability,
without compromising read-stability. Furthermore, as the leakage of the bit-cell increases when FBB is applied, a huge leakage power penalty would be incurred if a
DC FBB is applied to the whole memory array.
3.2
Energy-efficient Implementation of D-FBB
A dynamic implementation of the FBB technique (D-FBB), however, has its own
share of power overhead due to n-well switching. For the conventional "thin-cell"
6T layout [11, 40], the n-wells are shared vertically with other bit-cells in a column. Therefore, to apply body-biasing to a selected bit-cell, the entire n-well of the
corresponding column needs to be charged up [40]. This translates into a significant
capacitive-switching power overhead, since it scales with both the number of rows and
columns. The n-well switching power can limit the benefit of
Vdd
scaling, achieved
using dynamic body-biasing. To address this issue, an alternate layout technique is
proposed in this work, which shares the n-wells horizontally across all the bit-cells
in a row. The benefit of this technique stems from the fact that, only one row in
the memory array is accessed at a time. Hence, body-biasing can be applied to only
the two n-wells in a selected row. This significantly reduces the amount of n-well
capacitance switched per write cycle, since it is only dependent on the number of
columns. Hence, the proposed technique of sharing the n-wells horizontally, provides
a more energy-efficient way of implementing dynamic body-biasing for 6T SRAMs.
Figure 3-6 shows the proposed layout technique, which shares the n-wells horizontally across all the bit-cells in a row. The conventional "thin-cell" [11] layout is
used for the 6T bit-cell. The "thin-cell" layout has an aspect ratio of approximately
40
T bit-cells
WLA
.
WLB
WLC
rF__
4 WLs
per row
WLD
Row-wise shared n-well
M3
M
M4
N Via (M3-M4)
Figure 3-6: Proposed layout of a single row, showing row-wise sharing of n-wells, BL
sharing between adjacent columns and multiple WLs per row. (not to scale)
3:1. This implies that a bit-line, which is now routed along the longer bit-cell dimension, has approximately 3 times more parasitic metal-routing capacitance
(CM,par),
as compared to a conventional implementation. This can lead to a 3X increase in BL
switching power. In this work, this issue has been addressed in two ways: (1) sharing
the bit-lines between adjacent columns. (2) routing multiple word-lines for each row.
(1) Traditionally, the bit-line (BL) diffusion contacts are shared with neighboring
bit-cells in a column, reducing the effective cell area. However, this is not possible
in the present scenario, since the diffusion regions run horizontally. Hence, the BL
diffusion contacts are shared between bit-cells in adjacent columns, so that the effective bit-cell area is not increased and the lithography-friendly layout structure [11]
is maintained. Sharing BLs between two adjacent columns also reduces the parasitic
metal routing capacitance per bit-cell, by a factor of 2. Hence, the effective increase
in CM,,ar/bit-cell is 1.5X compared to a conventional implementation (as opposed to
3X).
(2) Since two adjacent columns now share a BL, it is necessary to have atleast
two word-lines (WL) for each row. This is to ensure that two adjacent bit-cells in a
selected row, do not simultaneously drive a single BL. In this implementation, 4 WLs
41
are routed for each row, taking advantage of the longer cell-height. The 2 extra WLs
help in reducing the unnecessary BL discharge in half-selected columns. Therefore,
the effective number of BL switching per cycle is reduced by a factor of 2.
Due to these layout optimizations, the BL switching power is actually reduced by
a factor of 0.75X(=
')
as compared to a conventional implementation, providing
further energy savings.
4 metal layers are used for routing the different signals. Metal-2 (M2) is used
for the cell
Vdd
and GND, metal-3 (M3) is used for bit-lines and finally, WLs are
routed in metal-4 (M4). The proposed implementation (h6T) incurs a 2.5% increase
in the effective cell-area, due to non-overlapping WL contacts between adjacent rows.
Normal logic design rules are used for layout.
-----------------------------.
VFBB
Vdd to OV
WL.
Delay
Vdd
Vb
WrEnNx
N
-Vdd
-VFBB
To the
n-well of
one row
to OV
Level Shifter
Figure 3-7: Circuit implementation of the proposed row-wise forward body-biasing
technique (hFBB).
The proposed technique of row-wise forward body-biasing (hFBB) is implemented
by the circuit shown in Figure 3-7. During a write operation (WrEn ="1"), the
word-line (WL) of the selected row triggers the level shifter, to pull the node N, to
ground. Hence, the output node Nbb, which is connected to the n-well of the selected
row, is charged up towards VFBB. The bias voltage V can be chosen to be OV for
full swing body-biasing at every Vdd.
Alternatively, it can be connected to Vdd, SO
that the n-well node is charged up slowly at higher Vdd levels, when body-biasing is
not required. As explained before, the delay after the WL is necessary to eliminate
42
read-stability issues in half selected bit-cells.
3.3
Hierachical BL structure and Data Prediction
Recent works [12, 18, 32] have shown the advantages of hierarchical bit-line (HBL)
scheme in improving the read-stability and read-ability of 6T SRAMs. In HBL, a
small number of bit-cells are connected to a local bit-line (BL) pair. The signal
development on the local BLs are, transferred to the long global BLs, which are
used to finally read the data. The local BLs have a significantly lesser capacitance,
due to the reduced number of access transistors connected to them and the reduced
parasitic metal routing capacitance. Therefore, they can discharge much faster during
a read operation and hence, injects lesser noise through the access transistor to the
"0" internal node. This significantly improves the read-stability of the bit-cell, as
compared to the conventional non-hierarchical architecture, with higher number of
bit-cells per BL.
3.3.1
Dynamic Read Margin
13.0
-------------
11.0
------------------------- +-----------
10.0
E 9.0
rm
--------------- ------------------ FSc r e
12.0
---- -------------
------------------------------ - ------- - - ----
- -
-
8.0
7.0
---------
S6.0
5.0
-- ----
4.0
-
-
-
-- - -------
---
------
------- --------------
- --------
-*
SNM
--
--
NOBC=32
-
-*-NOBC=64
-*-NOBC=128
3.0
0.4
0.5
0.6
Vdd (V)
0.7
0.8
Figure 3-8: Dynamic read margin of the 6T bit-cell as a function of Vdd, for different
values of NOBC (number of bit-cells per local BL).
In this work, a hierarchical BL structure is used, with 64 cells per local bitline. This translates to 32 physical rows, since a local BL is shared between adjacent
43
columns, in this implementation. Figure 3-8 shows the effect of the number of bit-cells
per local BL (NOBC) on the dynamic read margin (DRM). 64 bit-cells per LBL was
chosen, so that more than 5-sigma DRM is achieved, even at a
Vdd
of O.4V and worst
process-corner (FS). It can be also seen from Figure 3-8 that, the static method of
read-margin simulation (SNM) provides a considerably pessimistic estimation. This
is because the SNM method does not account for the effect of BL discharge on the
read-noise injected in the "0" internal node of a bit-cell.
3.3.2
Hierarchical Read Path
Figure 3-9 shows the hierarchical BL structure used in this implementation.
In a
sub-array, the local bit-lines (LBL and LBLB) are connected across 32 rows and
shared with the adjacent column (not shown in the figure).
*I
6T bit-cell
LocalSub-Array0
"1"
"""
Ca
CLBL
LocalSub-Array_1
LBL
X32
LBLB
GBL
----
GBLB ,I
I"s
""
CGBL
GBL
CGBL
+ -
G
_J
SAEn
Globa
SA
"1"TR GW L
--
G
............-.....
.
--
Read output
Figure 3-9: Hierarchical bit-line structure used in this work to improve read-stability
at low Vdd levels. The read path from the local BL to the global BL is also shown.
During a read operation, one of the local bit-lines (LBL and LBLB) starts discharging, depending on the data stored in the selected bit-cell. The signal developement on the local BLs is sensed with a pair of local inverters and pull-down NMOS
devices, connected to the global bit-lines (GBL and GBLB). To improve read-access
44
time, the local inverters are designed to favor a "0" to "1" output transition. Large
signal sensing is used for the local BLs to reduce the area-overhead of the local sensing
circuit. On the other hand, small-signal differential sensing is used for the global BLs.
This is because, the global BLs are connected across multiple local arrays and have
significantly high capacitance (due to long metal routing).
Hence, they discharge
slower than local BLs, and therefore, a small-signal sensing is more suitable.
The
GWL signal is used to turn-off the GBL discharge when the global sense-amplifier is
enabled.
PMOS devices (not shown in the figure) are used to pre-charge all the local BLs,
global BLs and other nodes in the local sensing circuitry to
Vdd,
at the beginning a
read operation.
3.3.3
Using Data Prediction in 6T SRAM
Application specific features can provide interesting data properties, which can be
exploited to design a more tailored SRAM. [41] proposed a 10T SRAM bit-cell which
uses prediction of data to reduce bit-line switching power, during a read operation. It
was targeted specifically towards motion-estimation in video processing applications.
In motion estimation, the pixels from a small block of a video frame (reference buffer)
is stored in the SRAM array and used in consecutive read cycles, before it is overwritten.
The correlation of the pixel data, stored in the reference buffer, can be
exploited to predict the data during a read operation, using previously read values.
If the prediction matches the actual data, the bit-line (BL) pair is not discharged.
Thus, depending on the prediction accuracy, the BL switching power can be reduced.
This can provide significant energy savings, since BL switching constitutes a major
portion of the overall SRAM power consumption. This work extends the concept of
data prediction to 6T SRAM arrays. Instead of incorporating data prediction in the
bit-cell, as done in [41], this work uses data prediction at the local array level. Thus,
we get the area advantage of using a smaller 6T bit-cell as compared to a 10T design,
while saving BL switching power.
Figure 3-10 shows the architecture implementing the prediction scheme for a 6T
45
Local Sub-Array
LBL
&
:
LBLB
~~Intl
itZ
*q
GWL-
Global SA
+
SA_ En
Vref
predPrd
SAout
1 0
Dout
[Read output]
Figure 3-10: Prediction architecture used in this design.
SRAM array. Two extra transistors (Np1 , N, 2 ) are added at the local sub-array level.
They control the signal development at the 'intl' and 'int2' nodes, driving the local
sensing inverters. All the internal nodes are pre-charged to Vd before a read operation
(using PMOS transistors, not shown in the figure). Let us assume the data to be read
is "0". Hence, the LBL discharges to ground, during a read operation, while LBLB
stays at Vdd. If the prediction is correct, i.e. Pred = "0" and PredB = "1", Np, is
turned-off and the discharge of the LBL is not transferred to the 'intl' node. Thus,
both 'intl' and 'int2' nodes stay at the pre-charged level of
Vdd.
Hence, neither of
the global bit-lines (GBL, GBLB) are discharged. The global sense-amplifier (SA)
outputs a "1" and the correct prediction value, Pred = "0", is chosen as the read
output data. On the other hand, if the prediction is incorrect, i.e. Pred = "1" and
PredB
=
"0,
Np1 is turned-on and the discharge of the LBL is transferred to the
'int1' node. Hence, GBL starts discharging and the global SA senses this, to output a
"0". Therefore, PredB (= "0") is chosen as the read output (since the prediction was
46
incorrect). Thus, in either case, the correct value of the data (= "0" in this example)
is obtained at the output. However, if the prediction was correct, the discharge of
the global bit-line was avoided, which manifests into dynamic power savings.
Although the local BL switching is not affected in this technique, the dominant
component of the switching power, which is due to the global BLs, can be reduced
by using data prediction.
Overall Array Architecture
3.4
Figure 3-11 shows the overall array architecture for the 128Kb SRAM, designed in
the 28nm FDSOI technology. Each 64Kb block, of 256 rows by 256 columns, consists
of 8 local arrays. A 4:1 column interleaving ratio is implemented to obtain a 64-bit
output data.
..................
*CL
Pch B
Local Array 0
* 32X256
X32*
Local Array 1
**
WLD[0]
32X256
h6T
h6T
h6T
h6T
I
I
I
2
* X8
LRELL2
LBo
LBLBo
LBLBi
Local Array 7
32X256
mLBL[
]
Q.
T
j
-a
mLBLB[0]
C
I
data[B
Loa
6
Read
U
S
~
.~
qi
0.
0.
~
Data_in
Dataout
.~
,J
Predjout
I..
Figure 3-11: Array architecture for the 28nm FDSOI 128Kb 6T-SRAM, which incorporates row-wise body-biasing and data prediction to reduce energy consumption.
47
The local array consists of 6T bit-cells (h6T) arranged in 32 rows by 256 columns.
As shown in Figure 3-11, for a group of 4 columns, there are 3 local BLs and 2 local
BLBs. This is because, in this implementation, the local bit-lines are shared between
adjacent columns. Therefore, 2-bits (b[1 : 0]) are required for column multiplexing,
to get a local BL pair (mLBL[0], mLBLB[0]).
As explained in section 3.2, each row
has 4 word-lines (WLA, WLB, WLC and WLD), only one of which is asserted based
on the row-decoder's outputs and the column interleaving bits. The selection of a 4:1
column interleaving ratio and the ability to route 4 WLs per row, eliminates the halfselect issue in this architecture. Hence, dynamic forward body-biasing can be applied
for the selected row, as a write assist technique, without any read-stability issues.
The hFBB circuit implements the row-wise dynamic body-biasing. The n-wells are
shared horizontally, across all the 256 bit-cells in a row.
The local R/W circuitry, shared between two local arrays, consist of inverter-based
large-signal sensing for read and the prediction logic, as explained in section 3.3. In
addition, the local write is implemented by the pull-down NMOS devices, controlled
by data[0] and dataB[O]. Although not shown in the figure, data[0] and dataB[0]
are locally generated, during a write operation, from the data on the global BLs.
Whereas, they are driven to "0" during a read operation. The pair of cross-coupled
PMOS devices, connected to a local BL pair, maintains a differential signal level on
the local BLs, during a write operation.
All the global signals (including the global BLs and Pred, PredB lines) are driven
by the global R/W circuitry, which also incorporates small-signal sensing for the global
BLs, during a read operation. The prediction generator circuitry is similar to the one
described in [41].
3.5
Energy Savings
In this section, we estimate the energy consumption of the proposed hFBB 6T SRAM,
which implements row-wise dynamic body-biasing. This is compared to the conventional implementation [40], with column-wise n-well sharing. We evaluate the energy
48
savings achieved due to
scaling and by using data prediction. The energy con-
Vdd
sumption model of the SRAM, used in this section, is described in Appendix A.
Energy estimations are done at typical (TT) process corner and 250C temperature,
with various capacitances extracted from a local array's layout.
3.5.1
Due to Vdd Scaling
The bit-cell used in this design, can work down to a Vdd of 600 mV, which is limited by
the write operation. The dynamic forward body-biasing technique, used as a writeassist, improves the Vdd,min by 200mV. This translates into a significant reduction in
the dynamic energy consumption of the SRAM macro (since it is roughly proportional
to Via).
However, the leakage energy consumption is increased, due to a reduced
frequency of operation (f,,) at
Vdd
= 400mV.
The normalized average energy per access (Eavg/acc.) for the 128Kb SRAM macro
is shown in Figure 3-12, at
Vdd
= 600mV and 400mV, for two different read to write
ratios (R/W). For a R/W of 1:1, a 38% reduction in Eavg/acc. is achieved, due to
a 200mV reduction in the SRAM
Vdd,min.
The improvement is maintained even for a
higher R/W ratio of 5:1, when we get 36% energy savings due to
Vdd
scaling. At a
high R/W ratio of 5:1, the contribution from GBL switching is lesser. This is because
the number of write operations (which involves full swing of the GBL) is reduced.
Hence, the energy savings by
Vdd
scaling is slightly lesser.
Next, we compare the energy savings with the body-biasing technique implemented in two different ways. The proposed technique (hFBB) of row-wise n-well
sharing, incurs less energy overhead as compared to the conventional implementation [40], in which n-wells are shared column-wise.
This is because, during every
write cycle, the n-well of only one selected row needs to be charged up, instead of
switching the n-wells for multiple selected columns (as required in the conventional
implementation).
As seen from Figure 3-12, for a R/W ratio of 1:1, the proposed
hFBB implementation results in 38% energy savings, while the conventional implementation, in fact, increases Eavg/acc. by 20%. The proposed technique out-performs
the conventional implementation in terms of Evg/acc., even at a higher R/W ratio
49
1.4
VI
1.2
LU
W
ta
+20%
1.0
-38%1
0.8
H E_leak
H E_SA
* Edyn WL
0.6
0.4
* EdynBB
0.2
* EdynGBL
0
Z 0.0
* EdynLBL
w/o assist
hFBB
conv.BB
Vdd=0.6V
Vdd=0.4V
(R/W=1:1)
1.2
-7%
1L.0
U
-36%
0.8
U
"p
0.6
H E_SA
0U
hO
H E_leak
*
0.4
EdynWL
* Edyn BB
0.2
M EdynGBL
- EdynLBL
0.0
w/o assist
conv.BB
Vdd=0.6V
I
hFBB
Vdd=0.4V
(R/W=5:1)
Figure 3-12: Energy savings due to Vdd scaling using 1V dynamic FBB. The energy
reduction with the proposed row-wise FBB implementation (hFBB) is compared to
a conventional implementation, for two different read-to-write ratios.
of 5:1.
3.5.2
Using Data Prediction
Data prediction is used during a read operation, to reduce the global bit-line (GBL)
switching, which constitutes a significant portion of the overall SRAM energy consumption. Figure 3-13 shows the normalized read energy per access (Eread/acc.) at
Vdd
= 400mV, as a function of the percentage of correct prediction.
The conventional 6T SRAM (without prediction) uses differential sensing for the
50
4
1.6
-
1.4
(U
1.2
1.0
(U
Ui
cc
DC
0
z
0.8
0.6
0.4
0.2
0.0
L_
m E_leak
-35%
-I
w/o
pred.
* Epred
" E_SA
" EdynWL
* EdynGBL
0%
25%
50%
75%
100%
" Edyn LBL
Percentage of Correct Prediction
Figure 3-13: Energy savings by using data prediction during a read operation at
Vdd-400mV, as compared to a conventional 6T read.
On the contrary, a single-ended sensing
global bit-lines, during a read operation.
scheme is required, when the read path involves data prediction. Assuming the same
sense-amplifier (SA) is used in both the schemes, the global bit-lines need to discharge
approximately twice, when a single-ended sensing scheme is used. Therefore, when
prediction is correct for less than 50% of the time, the Eread/acc. is actually more
than the conventional 6T read. However, energy savings are obtained when there is
more than 50% correct prediction. At
Vdd
= 400mV, a 35% (or 1.54X) reduction in
Eread/acc. is achieved with 100% correct prediction. It must be noted that, this 35%
reduction is in addition to the energy savings achieved by
51
Vdd
scaling to 400 mV.
52
Chapter 4
Reconfigurable Body-Bias
Generator in 28nm FDSOI
This chapter presents a switched-capacitor (SC) based step-up DC-DC converter,
which can be used for SRAM body-biasing. The reconfigurable converter implements
3 step-up conversion ratios of 5/2, 2/1 and 3/2, to provide a wide range of output
voltage. The step-up converter has been designed to obviate the need of using high
voltage I/O transistors (as charge-transfer switches), which otherwise would have
degraded the efficiency owing to their higher R,, and capacitance.
Additionally, a
new topology is proposed for the 5/2 mode which improves efficiency by reducing
the bottom-plate parasitic loss as compared to a conventional series-parallel topology
[26].
A brief overview of SC converters is presented in section 4.1, followed by the
detailed description of the designed reconfigurable step-up converter.
4.1
Brief overview of SC converters
Switched-capacitor (SC) power converters are a type of DC-DC converters, which
use only switches and capacitors, to efficiently convert one voltage to another. Since
they do not require bulky inductors, SC converters are ideally suited for on-chip
implementations. Step-up SC converters (i.e.V,, > V4,) have been traditionally used
53
in integrated circuits to provide the programming voltage for FLASH [42] and other
non-volatile memories [43]. They also find use in energy harvesting applications [44].
Figure 4-1 shows a simple 2:1 SC step-up converter, which uses 4 switches (Si - S4)
and 1 charge-transfer capacitor (Cf).
SC converters generally operate in 2 phases.
As shown in the figure, during phase
4
i,
Cf gets charged from the input voltage
(Vi,,), while in the other phase (4D2), it transfers the stored charge to the output. An
idealized 2 port model of the SC converter is also shown in the figure. It consists
of an ideal transformer, which represents the no-load conversion ratio and an output
resistance
(ROUT),
which represents the load current dependent voltage drop (due to
charging and discharging of Cf every cycle).
ROUT
depends on the topology of the
converter and also on the switching frequency (f5,) of the converter. A more detailed
analysis of SC converters can be found in [45].
Vin
Vout
Vout
Cf
Nq
q
qCf
S2
Vin
S1
Vin
-
ZCf
(01
(02
ROUT
S4
Vin
Vout
1:2
Figure 4-1: Basic 2:1 step-up SC converter, along with its idealized 2-port model.
54
Reconfigurable Step-up SC Module
4.2
Figure 4-2 shows the switch level schematic of a single module of the reconfigurable
step-up converter. A module is comprised of two identical sub-modules ('a' and 'b')
which are connected by switch S8 . Each sub-module consists of 7 switches (Si
-
S7),
2 charge-transfer capacitors (C 1 , C 2 ) and is driven by two non-overlapping, complementary clocks (CLK1 , CLK 2). Additionally, sub-module 'a' operates out-of-phase
with sub-module 'b'. This design strategy allows us to reuse simple 2/1 sub-modules
to design a more complex 5/2 conversion module.
Vin
Vin
-
S4
,CLKi
1 SubModule 'a'
Sub-
C2n
C2n
S8
Module 'b'
$iCLK
42 CLKi
-L
1S7
4
L
S2
CLK2
C2I
S6
/
S5
nn
-~ /Cut
-ot4
Vouto
i
Figure 4-2: Reconfigurable step-up switched capacitor module.
Figure 4-3(a) shows the operation of the converter in the 5/2 mode for the proposed topology. As shown in the figure, during phase 45i, capacitors Cia and C2b are
charged from the input node, Vs, while capacitors C2a and
Vout
out
C2a
C2b
C2b
VinIVin
inn
C~
j C >
Clb
Vout O
Vout
Vin
Ca
Vin
~-
1-41TnIL
transfer charge to the
Cib
~
< C2a
<>
T
(a1 o
(a) Proposed Topologv
C2b
2a
Vin
Vin
Vout = Vin*5S/2
C1C4
C
>
0
Ca
b
C2b
n_e
2
(b) Conventional Topology
Figure 4-3: Operation of the proposed and conventional topologies in 5/2 mode.
55
30
5
- 25
4
-40-Rout(Conv.)
-
S15
_E
10
90
---
o
U
100
-*-Rout (Prop.)
-
r-Ripple ratio
---------------------- 33
3
----------------
Cm 15~~7
------------------10 --CL'
----- ------ ---------------
0~6
_
----0-2 - -------- ----
_
___
__
_
_
_
- 20
6
a
_
0
U-
--- w/oBP-(Prop.)
-Iw_w/oBP
P P(Conv.)
o .
-+-+-w/
w/ BP
_BP(Prop.)
(Conv.)
--
-------
50
1x
20
10
0
Switching Frequency (M Hz)
20
10
0
Switching Frequency (MHz)
(a) RoUT and Ripple (ideal converter)
(b) Efficiency with BP(Bottom-plate Parasitic)
Figure 4-4: Simulated performance comparison of the ideal converter for the proposed
and conventional topologies in 5/2 mode.
output node, V0st. On the other hand, during phase 42, Cib gets charged from the input node and C2a gets charged by Cia, while C2b transfers charge to the output node.
Also shown in the figure are the voltages across the different charge-transfer capacitors for the no-load case. Using charge balance, it is easily seen that the voltage across
each capacitor is identical during the two phases 41 and 42 which proves that, in the
steady state, this mode will generate a no-load output voltage VUt,NL = 5/2 x V .
Figure 4-3(b) shows the operation of a conventional series-parallel topology [26] implementing a 5/2 mode.
Although both the topologies require the same number of capacitors and switches
to implement a 5/2 mode, the proposed topology offers two significant benefits compared to the conventional topology. Firstly, for the proposed implementation, charge
is delivered to the output in both the clock phases, 41 and 42. However, in the conventional topology, charge is delivered to the output in only one phase. This results in
a lesser output voltage ripple for the proposed implementation. Figure 4-4(a) shows
the simulated result for the ideal converter in 5/2 mode. As can be seen from the
figure, the proposed topology reduces output voltage ripple by 2X compared to the
conventional case.
56
Second and more importantly, the proposed topology offers a much better performance in terms of reducing bottom-plate parasitic loss, which is a significant component of the overall loss for on-chip implementation of the charge-transfer capacitors.
On-chip capacitors offer a much higher energy density compared to their off-chip
counterparts, but they suffer from having significantly more parasitic capacitance
(associated with their bottom or top plate and the substrate).
This parasitic can
be as high as 5-10% [46] of the actual capacitance for the MOS capacitors used in
this design.
In SC converters, this parasitic capacitor gets charged in one phase
and loses that energy by discharging in the other phase.
The associated bottom
(or top)-plate parasitic loss can severely degrade the efficiency of the converter especially for low output power levels. The proposed implementation for the 5/2 mode
significantly decreases this loss component by reducing the swing of the bottom (or
top) plate of the charge-transfer capacitors.
Pbot(proposed) = aCj((V,)
2
2
+ (Vi/2)
+
It can be observed from Figure 4-3,
2
(Vn3)
+ (V,/ 2) 2 )
2.5&C5V2{
f",
where
a denotes the ratio of the bottom-plate parasitic capacitor and the corresponding
charge-transfer capacitor (Cf) and
f.,
denotes the switching frequency of the con-
verter. For the conventional series-parallel implementation, this loss can be calcu-
lated as Pbot(conventional) = ceCf((3Vin/2) 2 + ( n/2)2 + (3 i/2)2 + (in)
2
)f, 2
5.75aCf infs.. Hence we get a 2.3X reduction in bottom-plate parasitic loss which
significantly improves the efficiency. As seen from Figure 4-4(b), for the ideal converter with 2% bottom-plate parasitic (o = 0.002), we can get an efficiency improvement as high as 15%.
This comparison assumes that the total amount of charge-
transfer capacitance, the load capacitance and the load resistance are the same for
both topologies. As it can be seen from Figure 4-4(a), this results in similar output
impedance (ROUT) and hence, both the implementations deliver the same amount of
output power, at a given load current.
The operation in the other two modes (2/1 and 3/2) are illustrated in Figure 4-5.
It may be noted that in mode 2/1 the capacitors C1 and C2, in each sub-module,
work exactly in the same manner. Hence, for clarity, only one (C 2 ) is shown in Figure
4-5.
57
Vout
Vout
C2a
C C2b
Vout
Vout
C2a
C2
>
r-4
C
Vin
Vin
C2b
C2a
Vin
Vin
I
I
I
C2b
C2a
Vin/2
)21
Vout = 2*in
4)2
Vout = Vin*3/2
Figure 4-5: Operation of the converter in 2/1 and 3/2 modes.
4.3
MOS Implementation of the sub-module
In this design, all the charge-transfer switches in the main converter module have been
implemented with core (1V) transistors. It is important to ensure that in the steady
state none of the transistors are overstressed due to application of a gate-to-source
(VGs) or drain-to-source (VDs) voltage higher than the nominal supply voltage, Vdd.
Figure 4-6 shows the MOS implementation of the switches for a sub-module. The
bottom-plate of both the capacitors C1 and C2 remain within the voltage range of 0
to Vdd. Hence the switches (Si, S2, S4, S 5 ) which are connected to the bottom-plates
of C1 and C2 have been implemented with regular PMOS (Sip, S4p) and NMOS
(S2N, S5N) transistors. Although not shown in the figure, these transistors are driven
by buffers in the voltage range 0 to Vdd.
Switch S 3 connects the top-plate of capacitor C1 to Vdd and is turned OFF when
the top-plate of C1 goes to
2
Vdd. Hence, it is implemented with an NMOS transistor
(S3N) with a gate drive between Vdd and
2
Vdd, to avoid VGS overstress.
Switch S6
needs to connect the top-plate of the capacitor C2 to the output voltage node Vst
and is OFF otherwise. Hence, it is implemented with a regular PMOS transistor
(S6p) but with a gate drive between (Vuot - Vdd) and Vot, so that the maximum VGS
applied is Vdd and the transistor is not overstressed.
Switch S 7 operates in a wide range of voltage levels, which depend on the conversion mode. It needs to block a voltage of (Vout - Vdd) across it, which can be as high
58
Vin
C1
S4P
3N
Vin
k
sip
_
S7PH
_T
S7PL
C2
S6P
S2N
-
Vout
Figure 4-6: MOS implementation of the switches.
as 2.5 - 1
1.5V in 5/2 mode. Hence, to avoid a VDS overstress, it is implemented
with a cascode of two 1V regular PMOS transistors (S7PL and S7PH). Conventionally
a suitable DC voltage needs to be generated to bias this cascode switch structure.
In this design, the need for a separate DC voltage is obviated by dynamically biasing the gate of both the transistors, S7PL and S7PH, to turn them ON and OFF
simultaneously. It may be noted that the dynamic biasing is also dependent on the
conversion mode. Hence it needs to be reconfigurable to work across all the three
conversion modes (5/2, 2/1, 3/2).
Figure 4-7 shows the reconfigurable gate drive
structures for switches S7PL and S7PH, along with the necessary level shifter circuits.
The 'V - V Shifter', shown in the figure, is also used to drive switch S6P.
Switch S 8 was implemented with regular IV PMOS and NMOS transistors in
tramsission gate structure, since it needs to pass a voltage of Vdd/
2
. The charge-
transfer capacitors were implemented on-chip with high density MOS capacitors along
with MOM capacitors stacked on top (to improve density). For MOS capacitors soft
connection of the N-well [46] was adopted. This technique hugely reduces the parasitic
capacitance from the bottom (or top)-plate to ground and improves efficiency of the
converter in all the 3 modes.
59
Vin
--
in
E
Vin
IN
Vin
2Vin
E
(/2)
Vin
OUT
T
EN='1'
liii#
OUT
142,(1
OTVin
12/1,3/2)
1
:0
J
J
EN ='O'0
INF'
(Thick-oxide)
-
Gate Drive of S7PL.
EN
(a) Level Shifter with Enable (LS-EN)
OUT
IN
(GrS6P)
IOU
Vin
OUT
|412
Vout
1'41|
|)
(GS6P)
Vout-Vin
Gate drive for S6P
IN
I
-I
IN
sI
-
-
-
Fro GSP
of other oS6P3 I
4%sub-module.-
Vin
K--
IN
Vout
OUT
OUT
(5/2)
Vin
I
i
'42
4A
O
Vin
:4,2:
IN
OUT
Vout
I
1
(3/2)
4,2
K---
0
Gate drive for S7PH
(b) Gate Drivers for S7pH and S6
Figure 4-7: Reconfigurable gate drive circuits for a sub-module.
4.4
Overall System Architecture
Figure 4-8 shows the overall architecture of the converter.
This work implements
4-phase interleaving in order to reduce output voltage ripple.
60
The 4-phase clock
Vin
Ck
Clk Ge
erator
Vout
Sw-Cap
Module
into 4 phases (frequency
f,./4),
JT
each shifted by
Cload
450.
Each phase generates two
complementary non-overlapping clocks, which drive a single converter module. A
tunable circuit has been implemented to control the non-overlapping delay, which is
crucial to reduce shoot-through current loss [47]. Reconfigurable switch drivers, as
explained in the previous section, provide the gate drives for all the switches in each
module. An on-chip load capacitor provides further necessary ripple reduction at the
output.
4.5
Results
The fully integrated step-up converter was implemented in a 28nm FDSOI process
occupying a core area of 0.054mm 2 . An additional 0.06mm 2 area was used to implement an on-chip load capacitor. Figure 4-8 shows the die photo of the converter. The
measured efficiency of the converter with varying load current and Vs.= 1V is plotted
in Figure 4-9(a). The output voltage was kept constant at ~2.2V (mode 5/2), 1.9V
(mode 2/1) and 1.3V (mode 3/2), by changing the switching frequency (f8 .) of the
converter. As seen from the figure, the converter can supply a load current in the
61
Table 4.1: Performance comparison with previous works
Design
Technology
[27]
130nm Bulk
[48]
32nm Bulk
This work
28nm FDSOI
Topology
Step-up 2/1
Step-up 2/1
Capacitor
MIM
Metal finger
Reconfigurable Step-up:
5/2, 2/1, 3/2
MOS and MOM
Area(with Cload)
Vin
2.25 mm2
1-1.2V
1.8V
6678 pm 2
1-1.2V
0.114 mm2
IV
1.3 - 2V
1.2 - 2.4V
(Q Vin=lV)
82% ©Pout= 1.5
mW
Iload(max.)
(A Vin=1V)
1.5 mA AVot=
1.8V, Tj = 81%
64% FPout = 2.9
mW
6.8 mA ©Vot =
1.4V, q = 56%
88% (2/1 mode) ©Pout
= 0.56 mW
1 mA (2/1 mode) WVout
Vout
(©Vin=1V)
17peak
= 1.73V, q =83%
range 10 - 500 pA while maintaining an efficiency of more than 70%. It achieves
a peak efficiency of 88% for the 2/1 mode at Pout = 0.56mW and 82% for the 5/2
mode at Pout = 0.66mW. Figure 4-9(b) shows the measured performance with a fixed
load current of 100 pA and varying output voltage for the 3 modes. The converter
provides an output voltage ranging from ~ 1.2V to 2.4V with more than 70% efficiency (V, = 1V). It can be observed that increasing the switching frequency of the
converter increases the output voltage by decreasing the output impedance (RoUT) of
the converter. However this effect saturates for higher frequencies, since the converter
enters the fast-switching-limit (FSL) mode in which the non-zero resistance of the
MOS switches limit ROUT.
The performance of the designed converter is compared with recent works in stepup SC DC-DC converters, in Table 4.1. The proposed reconfigurable converter provides a wider range of output voltage, with a better peak efficiency value, using only
MOS and MOM capacitors.
62
-u-eff_2/1
-+eff_3/2
-+- fsw_3/2
-U-
eff _5/2
-A-
-*- fsv _5/2
fsw_2/1
300.5
I
85
------------ ---------------- ----
80
-
-- - -------- I
-------
250.5
U
200.5
- --- ---- - - -- -- -- -----------------
--- --
- T--
-----
-
(U
150.5
L.
UL-
75
LU
- -
--
-----------
-------------
100.5
o--------------
;
70
- - - -+ - - - - - - -
5
-- -- -- --
-
- -- -
-
50.5
&was*
..
65
.S
100
0
-C
300
200
500
400
Load Current (IA)
(a) Varying load current and fixed output voltage
90
85
120
-------------- -
--- --- -- -- -- - -- -----
--------------------- 100
N
80
-----------
75
-- ------------ ----------------- - 80
>
Cr
C 70
-------------1---
----------------r--------------- -----------------
-- ------
---------------L------ - ---- -------------
60
LL
LUJ
65
60
-
-20
,
A
*
---.0- - - - - - - - - - - -- - -ip - - - -
55
-40
- - -- --
0
1.2
1.5
2.1
1.8
2.4
Vout (V)
(b) Varying output voltage and a fixed 100 pA load current
Figure 4-9: Measured performance of the converter.
63
64
Chapter 5
Conclusions
This thesis primarily focuses on energy-efficient 6T SRAM design in a 28nm FDSOI
technology. Additionally, the design of a reconfigurable step-up DC-DC converter is
presented, which can be used for body-biasing in SRAMs. This chapter summarizes
the important conclusions of this research and discusses opportunities for future work.
5.1
Summary of contributions
The three main contributions of this thesis are summarized below:
1.
Energy-efficient implementation of dynamic forward body-biasing, which is
used as a write-assist technique in the designed 6T SRAM.
2. Incorporating data prediction in the conventional read path of 6T SRAMs, to
save global bit-line switching power.
3.
A reconfigurable integrated switched-capacitor (SC) based step-up DC-DC
converter, as a body-bias generator for SRAMs.
5.2
Energy-efficient 6T SRAM design
An energy-efficient 6T SRAM design is presented in Chapter 3.
achieves operation down to a
Vdd
of 400mV, with a
process corner and temperature (SF corner, -40
65
(1/U-)WRM
The 6T SRAM
of 5.5, at the worst-
C). The 200mV reduction in the
Vdd,min
of the SRAM is achieved by using dynamic forward body-biasing (FBB), as a
write-assist technique. Dynamic body-biasing can incur significant energy overhead
if it is implemented column-wise [40]. This is because the n-wells of all the selected
columns, need to be charged up, during every write operation. To reduce this energy
overhead, a modified layout with horizontal n-well sharing is proposed. Hence, only
the two n-wells of the selected row needs to be charged up, which significantly reduces
the energy overhead due to body-biasing.
The energy consumption of the 128Kb SRAM is estimated using the energy model,
described in Appendix A. The 200mV improvement in
to a 38% reduction of average energy/access (Ea/
Vdd,min
of the SRAM translates
/acc.), for the proposed implementa-
tion with row-wise body-biasing (hFBB). The conventional implementation, however,
increases Eav/acc. by 20%. This assumes equal number of read and write operations,
i.e. read to write ratio R/W= 1:1. The improvement is also seen at a higher R/W
ratio of 5:1, when the hFBB technique saves 36%, while the conventional implementation saves only 7%.
To achieve further energy savings, a data-prediction scheme is inserted in the
conventional 6T read path. As demonstrated in [41], in certain applications, such
as motion estimation in video processing, the correlation of the stored data can be
exploited to predict the data during a read operation, using previously read values.
The concept of data-prediction is extended for conventional 6T SRAM arrays, without
any significant area overhead.
Using the proposed technique, the discharge of the
global bit-lines (GBL) can be avoided if the prediction is correct.
Although, the
switching of the LBLs are not affected by this technique, the dominant component
of switching power, which is due to GBLs, can be significantly reduced with correct
prediction. Upto 35% (i.e. 1.54X) improvement in the read energy/access, at
Vdd
= 400mV, was estimated using the energy models of the SRAM (Appendix A). This
improvement is in addition to the energy savings, achieved by the
mV.
66
Vdd
scaling to 400
5.3
Reconfigurable Step-up SC DC-DC Converter
An integrated reconfigurable switched-capacitor (SC) step-up DC-DC converter is
presented in Chapter 4.
The converter can be used to generate a wide range of
body-bias voltage for SRAMs. The reconfigurable converter implements 3 step-up
conversion ratios of 5/2, 2/1 and 3/2. It provides a wide range of output voltage
from 1.2V to 2.4V, using a 1V input. The step-up converter has been designed to
obviate the need of using high voltage I/O transistors which otherwise would have
degraded the efficiency owing to their higher Ron and capacitance. Additionally, a
new topology is proposed for the 5/2 mode which improves efficiency by reducing the
bottom-plate parasitic loss as compared to a conventional series-parallel topology.
The converter was implemented in a 28nm FDSOI process using only on-chip MOS
and MOM capacitors that do not require any extra fabrication steps, unlike MIM and
trench capacitors. The converter can deliver load current in the range of 10 PA to
500 pA, achieving a peak efficiency of 88% (measured).
5.4
Future Work
As CMOS scaling continues to sub-20nm regime, increased device variations will limit
the minimum operating voltage (Vdd,min) of 6T SRAMs. While FDSOI and FinFETs
offer improved device performance as compared to bulk processes, they also present
new challenges for SRAM design. Hence, improved read and write-assist techniques
are required to scale the
Vdd
Vdd,min
of 6T SRAMs, while maintaining high yield ratios.
scaling is very crucial to reduce the SRAM energy consumption.
Data dependency can be further exploited to take advantage of interesting signal
statistics, in SRAM design.
Alternate prediction schemes for 6T SRAMs can be
explored, which can provide further energy savings.
67
68
Appendix A
Energy Model of the 28nm FDSOI
128Kb SRAM Macro
The energy consumption in SRAM has two major components: dynamic energy
and leakage energy (Eleak).
(Edun)
The dominant sources of these energy consumptions
are bit-line (BL) switching and bit-cell leakage.
BL switching energy
(Edyn,BL)
is
proportional to the number of columns in a given SRAM macro. Since the capacitance
of each BL is proportional to the number of rows in the macro,
Edyn,BL
scales with
both the number of rows and columns. However, it does not depend on the total size
of the memory, since only one block is accessed at a time and hence, the BL switching
occurs only in the selected block. On the other hand, the leakage energy
(Eeak)
scales
with the total size of the memory, since every bit-cell, whether it is accessed or not,
consumes a non-zero leakage current. Futhermore, these energy consumptions depend
on the SRAM supply voltage
(Vdd)
and the frequency of operation
(f8 .).
The average energy consumption per access is given by:
Eav/acc.
Edyn + Eleak
69
(A.1)
Table A.1: Array organization for both the implementations
Implementation
h6T
r6T
Total Memory Size
128Kb
128Kb
Number of Blocks
Number of local arrays in a block
2
8
2
8
Local Array Organization
Number of bit-cells per local BL
Number of local BL pairs
Word Length
32X256
64
128
64
64X128
64
128
64
Number of Word Lines per row
4
1
For the proposed layout implementation with horizontal n-well sharing (h6T),
a local array consists of 32 rows and 256 columns.
Since a local bit-line (LBL)
is shared between adjacent columns, 64 bit-cells share a LBL. To maintain the same
number of bit-cells per LBL, 64 rows are chosen in the local array, for the conventional
implementation (with column-wise n-well sharing). And hence, to have the same local
array size, the conventional implementation (r6T) has 128 columns.
Table A.1 shows the detailed array organization for the conventional (r6T) and
proposed (h6T) implementations. These values are used for the formulation described
in this chapter.
A.1
Dynamic Energy Consumption
The primary sources of dynamic energy consumption are bit-line switching (both local
and global BLs), word-line (WL) switching and sense-amplifier. In addition, dynamic
body-biasing, during a write operation, incurs extra energy overhead due to n-well
switching. Furthermore, when data prediction is used during a read operation, there
is some switching power overhead while updating the predictor's outputs.
Since a conventional 6T SRAM has a differential BL architecture, hence, in every
access cycle one of the BLs would always discharge to "0". Therefore, one of the
BLs would need to be pre-charged to
Vdd
at the end of the access cycle.
For the
conventional architecture with 1 WL per row, the dynamic energy consumption due
70
to local BL switching can be calculated as:
Edyn,LBL(COrv.)
=
(A.2)
128CLBLVL
where, CLBL represents the capacitance associated with a local BL, for the conventional implementation.
However, for the proposed implementation, with shared local BLs between adjacent columns and 4 WLs per row, this energy can be calculated as:
Edyn,LBL (PTOP.)
= 64C BLV
(A.3)
where, C BL represents the capacitance associated with a local BL, for the proposed
implementation.
The global BL discharge is dependent on the type of memory access. For a write
operation, one of the GBL swings fully from
Vdd
to OV. Whereas, during a read
operation, the swing (AVrd) is lesser and it is determined by the sense-amplifier (SA)
offset voltage (Vff). AVd should be estimated, assuming the slowest GBL discharge
can droop to Vdd
-
Vff when the SA is enabled. The average energy due to global
BL switching can be calculated as:
Edyn,GBL
=
64
k1
CBLVdd(
k +1
AVrd +
k+ 1
Vdd)
(A.4)
where, 'k' denotes the read to write ratio (i.e. the ratio of the number of read
operations to the number of write operations) and CbBL denotes the respective capacitance of the global BL for the conventional and proposed implementations.
The energy consumption of the SA is given by:
Edyn,SA = 64CSAVdk k
k+ 1
where,
CSA
denotes the effective capacitance of the SA, which is switched in every
read-access cycle.
The energy consumption due to word-line (WL) switching is given by:
71
(A.6)
Edyn,WL = C LV
where, C&VL denotes respective capacitance of one WL for the conventional and
proposed implementations.
Without write assists, the total dynamic energy consumption per access, is given
by:
Edyn,tot(w/o assist) =
Edyn,LBL
+
Edyn,GBL
+ Edyn,SA + Edyn,WL
(A.7)
The energy overhead due to dynamic body-biasing (BB), as a write assist, is given
by:
Edyn,BB(conv.)
= 64 x 64 x 2CBBV4
Edyn,BB(prop.)= 2 x
1
BB
1
1
256 x 2CBBVFkBB k+1
1
(.)
8 192CBBVBB
1k
1
kBBVFBB
(A.9)
where, CBB denotes the capacitance associated with one of the n-well body terminals, for a single bit-cell.
2
CBB is used in the equations because the n-wells are
shared with the adjacent row or column (depending on the implementation).
VFBB
denotes the amount of body-baising required at a given Vdd to achieve a certain p/a
of write margin. In this work, a VFBB of IV is required at Vdd= 400mV, for a p/= 5.5 (©SF corner and -40'C).
Since the conventional implementation has column-wise n-well sharing, a single
sided body-biasing can be implemented. Hence, for each of the 64 selected columns,
only 1 n-well is assumed to be switching. Each column-wise n-well is shared by 64 bitcells. On other hand, with the proposed layout implementation, the body-biasing is
done row-wise. Therefore, both the n-wells, corresponding to a selected row, need to
switch (hence, the extra factor of 2). Each row-wise n-well is shared by 256 bit-cells.
72
As seen from the equations, the proposed implementation reduces the body-biasing
energy overhead by 8X.
With body-biasing used as a write-assist technique, the total dynamic energy
consumption per access, is given by:
Edyn,tot(w/ assist) = Edyn,LBL
+ Edyn,GBL + Edyn,SA + Edyn,WL + Edyn,BB
(A.10)
When data-prediction is used, the dynamic energy consumption of the global
BL during a read operation, is only affected. Let us denote the fraction of correct
prediction as pc. For the target application of motion estimation, the number of read
access is considerably more than the number of write access. Hence, we only consider
the energy per access for read operation (i.e. 'k' can be assumed to be a very high
number in the above equations).
The dynamic energy consumption of the global BLs for a read operation, using
data prediction, is given by:
Edyn,GBL(W/
pred) = (1
-
Pc) x
64
CGBLVdd X 2AVrd
(A.11)
However, there is an energy overhead of updating the global prediction lines.
Assuming that prediction outputs are updated every 16 clock cycles, which was found
to be optimum in [41] for most video sequences tested, this overhead can be calculated
as:
Edyn,ov(w/ pred) = a2red64CBL
where,
cped
1
(A.12)
is the average activity factor for the predictor's outputs. It is chosen to
be 1/2 for simplicity.
Hence, the overall energy per access for read, with data prediction, is given by:
73
Edyn,tot(w/
pred) = Edyn,LBL+Edy,,GBL(w/ pred)+E(yn,sA+Edyn,WL+Edyn,ow/ pred)
(A.13)
It can be observed that, with more than 50% correct prediction, we get energy
savings as compared to a conventional 6T read.
A.2
Leakage Energy Consumption
The dominant source of the total leakage energy consumption of the SRAM is the
6T bit-cell leakage. This is because all the bit-cells in the memory array consumes a
leakage current and hence this energy scales with the total size of the memory. It is
given by:
Eleak,tot -217
X Vddlleak((Vdd) X
1
(A.14)
fSW
where, Ileak(©Vdd) is the leakage current per bit-cell and
fw, is
the frequency of
operation, at a particular value of Vdd.
f,,
at a particular Vdd value, is estimated from the worst-case bit-cell read and
write times.
74
Bibliography
[1] H. Yamauchi, "Embedded SRAM Design in Nanometer-Scale Technologies," in
Embedded Memories for Nano-Scale VLSIs, K. Zhang, Ed. New York: Springer,
2009.
[2] P. Magarshack, P. Flatresse, and G. Cesana, "UTBB FD-SOI: A process/design
symbiosis for breakthrough energy-efficiency," in Design, Automation Test in
Europe Conference Exhibition (DATE), 2013, March 2013, pp. 952-957.
[3] J.-P. Noel et al., "Multi-VT UTBB FDSOI Device Architectures for Low-Power
CMOS Circuit," in IEEE Transactionson Electron Devices, vol. 58, no. 8, August
2011, pp. 2473-2482.
[4] R. Riedlinger, R. Bhatia, L. Biro, B. Bowhill, E. Fetzer, P. Gronowski, and
T. Grutkowski, "A 32nm 3.1 billion transistor 12-wide-issue itanium processor for
mission-critical servers," in Solid-State Circuits Conference Digest of Technical
Papers (ISSCC), 2011 IEEE International,Feb 2011, pp. 84-86.
[5] T. Burd and R. Brodersen, "Design issues for Dynamic Voltage Scaling," in
Low Power Electronics and Design, 2000. ISLPED '00. Proceedings of the 2000
InternationalSymposium on, 2000, pp. 9-14.
[6] V. Gutnik and A. Chandrakasan, "Embedded power supply for low-power DSP,"
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 5,
no. 4, pp. 425-435, Dec 1997.
75
[7] N. Planes et al., "28nm FDSOI technology platform for high-speed low-voltage
digital applications," in VLSI Technology (VLSIT), 2012 Symposium on, June
2012, pp. 133-134.
[8] A. Carlson, Z. Guo, S. Balasubramanian, R. Zlatanovici, T.-J. K. Liu, and
B. Nikolic, "SRAM Read/Write Margin Enhancements Using FinFETs," Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18, no. 6,
pp. 887-900, June 2010.
[9] T. Song et al., "A 14nm FinFET 128Mb 6T SRAM with VMIN-enhancement
techniques for low-power applications," in Solid-State Circuits Conference Digest
of Technical Papers (ISSCC), 2014 IEEE International,Feb 2014, pp. 232-233.
[10] H. Pilo, I. Arsovski, K. Batson, G. Braceras, J. Gabric, R. Houle, S. Lamphier,
C. Radens, and A. Seferagic, "A 64 Mb SRAM in 32 nm High-k Metal-Gate SOI
Technology With 0.7 V Operation Enabled by Stability, Write-Ability and ReadAbility Enhancements," Solid-State Circuits, IEEE Journalof, vol. 47, no. 1, pp.
97-106, Jan 2012.
[11] M. Khare et al., "A high performance 90nm SOI technology with 0.992pm 2 6TSRAM cell," in Electron Devices Meeting, 2002. IEDM '02. International,Dec
2002, pp. 407-410.
[12] M. Sinangil, H. Mair, and A. Chandrakasan, "A 28nm high-density 6T SRAM
with optimized peripheral-assist circuits for operation down to 0.6V," in SolidState Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE In-
ternational,Feb 2011, pp. 260-262.
[13] K. Takeda, T. Saito, S. Asayama, Y. Aimoto, H. Kobatake, S. Ito, T. Takahashi,
K. Takeuchi, M. Nomura, and Y. Hayashi, "Multi-step word-line control technology in hierarchical cell architecture for scaled-down high-density SRAMs," in
VLSI Circuits (VLSIC), 2010 IEEE Symposium on, June 2010, pp. 101-102.
76
[14] F. Hamzaoglu, K. Zhang, Y. Wang, H. J. Ahn, U. Bhattacharya, Z. Chen,
Y.-G. Ng, A. Pavlov, K. Smits, and M. Bohr, "A 153Mb-SRAM Design with
Dynamic Stability Enhancement and Leakage Reduction in 45nm High-K MetalGate CMOS Technology," in Solid-State Circuits Conference, 2008. ISSCC 2008.
Digest of Technical Papers. IEEE International,Feb 2008, pp. 376-621.
[15] K. Nii, M. Yabuuchi, Y. Tsukamoto, S. Ohbayashi, Y. Oda, K. Usui, T. Kawamura, N. Tsuboi, T. Iwasaki, K. Hashimoto, H. Makino, and H. Shinohara, "A
45-nm single-port and dual-port sram family with robust read/write stabilizing
circuitry under dvfs environment," in VLSI Circuits, 2008 IEEE Symposium on,
June 2008, pp. 212-213.
[16] S. Barasinski, L. Camus, and S. Clerc, "A 45nm single power supply SRAM supporting low voltage operation down to 0.6V," in Solid-State Circuits Conference,
2008. ESSCIRC 2008. 34th European, Sept 2008, pp. 502-505.
[17] J. Chang, Y.-H. Chen, H. Cheng, W.-M. Chan, H.-J. Liao,
Q.
Li, S. Chang,
S. Natarajan, R. Lee, P.-W. Wang, S.-S. Lin, C.-C. Wu, K.-L. Cheng, M. Cao,
and G. Chang, "A 20nm 112Mb SRAM in High-K metal-gate with assist circuitry
for low-leakage and low-VMIN applications," in Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), 2013 IEEE International,Feb 2013, pp.
316-317.
[18] H. Fujiwara, M. Yabuuchi, M. Morimoto, K. Tanaka, M. Tanaka, N. Maeda,
Y. Tsukamoto, and K. Nii, "A 20nm 0.6V 2.1/LW/MHz 128kb SRAM with no
half select issue by interleave wordline and hierarchical bitline scheme," in VLSI
Circuits (VLSIC), 2013 Symposium on, June 2013, pp. C118-C119.
[19] S. Moriwaki, Y. Yamamoto, A. Kawasumi, T. Suzuki, S. Miyano, T. Sakurai,
and H. Shinohara, "A 13.8pJ/Access/Mbit SRAM with charge collector circuits
for effective use of non-selected bit line charges," in VLSI Circuits (VLSIC),
2012 Symposium on, June 2012, pp. 60-61.
77
[20] G. Moore, "Cramming More Components Onto Integrated Circuits," Proceedings
of the IEEE, vol. 86, no. 1, pp. 82-85, Jan 1998.
[21] Y.-H. Chen, W.-M. Chan, W.-C. Wu, H.-J. Liao, K.-H. Pan, J.-J. Liaw, T.-H.
Chung,
Q. Li,
G. Chang, C.-Y. Lin, M.-C. Chiang, S.-Y. Wu, S. Natarajan, and
J. Chang, "A 16nm 128Mb SRAM in high-; metal-gate FinFET technology
with write-assist circuitry for low-VMIN applications," in Solid-State Circuits
Conference Digest of Technical Papers (ISSCC), 2014 IEEE International,Feb
2014, pp. 238-239.
[22] 0.
Thomas,
B.
Zimmer,
B.
Pelloux-Prayer,
N.
Planes,
K.-C.
Akyel,
L. Ciampolini, P. Flatresse, and B. Nikolic, "6T SRAM design for wide voltage range in 28nm FDSOI," in S01 Conference (S0I), 2012 IEEE International,
Oct 2012, pp. 1-2.
[23] D. Jacquet et al., "A 3 GHz Dual Core Processor ARM Cortex TM -A9 in 28 nm
UTBB FD-SOI CMOS With Ultra-Wide Voltage Range and Energy Efficiency
Optimization," Solid-State Circuits, IEEE Journalof, vol. 49, no. 4, pp. 812-826,
April 2014.
[24] P. Flatresse, B. Giraud, J. Noel, B. Pelloux-Prayer, F. Giner, D. Arora, F. Arnaud, N. Planes, J. Le Coz, 0. Thomas, S. Engels, G. Cesana, R. Wilson,
and P. Urard, "Ultra-wide body-bias range LDPC decoder in 28nm UTBB FDSOI technology," in Solid-State Circuits Conference Digest of Technical Papers
(ISSCC), 2013 IEEE International,Feb 2013, pp. 424-425.
[25] F. Arnaud, N. Planes, 0. Weber, V. Barral, S. Haendler, P. Flatresse, and
F. Nyer, "Switching energy efficiency optimization for advanced CPU thanks
to UTBB technology," in Electron Devices Meeting (IEDM), 2012 IEEE Inter-
national,Dec 2012, pp. 3.2.1-3.2.4.
[26] H.-P. Le et al., "A sub-ns response fully integrated battery-connected switchedcapacitor voltage regulator delivering 0.19W/mm2 at 73% efficiency," in Proc.
IEEE ISSCC, Feb. 2013, pp. 372-373.
78
[27] T. V. Breussegem and M. Steyaert, "A 82% efficiency 0.5% ripple 16-phase fully
integrated capacitive voltage doubler," in Proc. IEEE Symp. VLSI Circuits,Jun.
2009, pp. 198-199.
[28] L. Chang et al., "A fully-integrated switched-capacitor 2:1 voltage converter with
regulation capability and 90% efficiency at 2.3A/mm2," in Proc. IEEE Symp.
VLSI Circuits, Jun. 2010, pp. 55-56.
[29] J. Wang, S. Nalam, and B. Calhoun, "Analyzing static and dynamic write margin
for nanometer SRAMs," in Low Power Electronics and Design (ISLPED), 2008
ACM/IEEE International Symposium on, Aug 2008, pp. 129-134.
[30] N. Gierczynski, B. Borot, N. Planes, and H. Brut, "A New Combined Methodology for Write-Margin Extraction of Advanced SRAM," in Microelectronic Test
Structures, 2007. ICMTS '07. IEEE International Conference on, March 2007,
pp. 97-100.
[31] E. Seevinck, F. List, and J. Lohstroh, "Static-noise margin analysis of MOS
SRAM cells," Solid-State Circuits, IEEE Journal of, vol. 22, no. 5, pp. 748-754,
Oct 1987.
[32] S. Moriwaki, A. Kawasumi, T. Suzuki, T. Sakurai, and S. Miyano, "0.4v sram
with bit line swing suppression charge share hierarchical bit line scheme," in
Custom Integrated Circuits Conference (CICC), 2011 IEEE, Sept 2011, pp. 1-4.
[33] B. Zimmer, S. 0. Toh, H. Vo, Y. Lee, 0. Thomas, K. Asanovic, and B. Nikolic,
"SRAM Assist Techniques for Operation in a Wide Voltage Range in 28-nm
CMOS," Circuits and Systems II: Express Briefs, IEEE Transactionson, vol. 59,
no. 12, pp. 853-857, Dec 2012.
[34] V. Chandra, C. Pietrzyk, and R. Aitken, "On the efficacy of write-assist techniques in low voltage nanoscale SRAMs," in Design, Automation Test in Europe
Conference Exhibition (DATE), 2010, March 2010, pp. 345-350.
79
[35] E. Karl, Y. Wang, Y.-G. Ng, Z. Guo, F. Hamzaoglu, M. Meterelliyoz, J. Keane,
U. Bhattacharya, K. Zhang, K. Mistry, and M. Bohr, "A 4.6 GHz 162 Mb SRAM
Design in 22 nm Tri-Gate CMOS Technology With Integrated Read and Write
Assist Circuitry," Solid-State Circuits, IEEE Journal of, vol. 48, no. 1, pp. 150-
158, Jan 2013.
[36] M. Yamaoka, K. Osada, and K. Ishibashi, "0.4-V logic library friendly SRAM
array using rectangular-diffusion cell and delta-boosted-array-voltage scheme,"
in VLSI Circuits Digest of Technical Papers, 2002. Symposium on, June 2002,
pp. 170-173.
[37] K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N. Vallepalli,
Y. Wang, B. Zheng, and M. Bohr, "A 3-GHz 70MB SRAM in 65nm CMOS
technology with integrated column-based dynamic power supply," in Solid-State
Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE In-
ternational,Feb 2005, pp. 474-611 Vol. 1.
[38] A. Bhavnagarwala, S. Kosonocky, C. Radens, Y. Chan, K. Stawiasz, U. Srinivasan, S. P. Kowalczyk, and M. Ziegler, "A Sub-600-mV, Fluctuation Tolerant
65-nm CMOS SRAM Array With Dynamic Cell Biasing," Solid-State Circuits,
IEEE Journal of, vol. 43, no. 4, pp. 946-955, April 2008.
[39] N. Planes et al., "28nm FDSOI technology platform for high-speed low-voltage
digital applications," in VLSI Technology (VLSIT), 2012 Symposium on, June
2012, pp. 133-134.
[40] M. Yamaoka, R. Tsuchiya, and T. Kawahara, "SRAM Circuit With Expanded
Operating Margin and Reduced Stand-By Leakage Current Using Thin-BOX
FD-SOI Transistors," Solid-State Circuits, IEEE Journal of, vol. 41, no. 11, pp.
2366-2372, Nov 2006.
[41] M. Sinangil and A. Chandrakasan, "Application-Specific SRAM Design Using
Output Prediction to Reduce Bit-Line Switching Activity and Statistically Gated
80
Sense Amplifiers for Up to 1.9 x Lower Energy/Access," Solid-State Circuits,
IEEE Journal of, vol. 49, no. 1, pp. 107-117, Jan 2014.
[42] J.-T. Wu and K.-L. Chang, "MOS charge pumps for low-voltage operation,"
Solid-State Circuits, IEEE Journal of, vol. 33, no. 4, pp. 592-597, Apr 1998.
[43] P. Feng, Y.-L. Li, and N.-J. Wu, "An improved charge pump circuit for non-
volatile memories in RFID tags," in Proc. IEEE 10th ICSICT, Nov. 2010, pp.
363-365.
[44] W. Jung et al., "A 3nW fully integrated energy harvester based on self-oscillating
switched-capacitor DC-DC converter," in Proc. IEEE ISSCC, Feb. 2014, pp.
398-399.
[45] M.
D.
Seeman,
DC-DC
Converters,"
versity
of
"A
California,
Design
Ph.D.
Methodology
dissertation,
Berkeley,
May
for
EECS
2009.
Switched-Capacitor
Department,
[Online].
Uni-
Available:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-78.html
[46] A. Biswas, M. Kar, and P. Mandal, "Techniques for reducing parasitic loss in
switched-capacitor based DC-DC converter," in Proc. IEEE 28th APEC, Mar.
2013, pp. 2023-2028.
[47] P. R. Kumar, K. Bhattacharyya, T. Das, and P. Mandal, "Improvement of power
efficiency in switched capacitor dc-dc converter by shoot-through current elimination," in Proceedings of the 14th ACM/IEEE internationalsymposium on Low
power electronics and design, ser. ISLPED '09, 2009, pp. 81-86.
[48] D. Somasekhar et al., "Multi-phase 1GHz voltage doubler charge-pump in 32nm
logic process," in Proc. IEEE Symp. VLSI Circuits, Jun. 2009, pp. 196-197.
81