Energy-Efficient SRAM Design in 28nm FDSOI MASCHIS MAiSSACHUSMS Technology by JUN 3 0 2014 Avishek Biswas LIBRARIES B. Tech. (Hons.), Indian Institute of Technology, Kharagpur (2012) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2014 © Massachusetts Institute of Technology 2014. All rights reserved. Author .................. YfM OF TECHNOLOGY Signature redacted Department of Electrical Engineering and Computer Science May 21, 2014 C ertified by .......................... Signature redacted Anantha P. Chandrakasan Joseph F. and Nancy P. Keithley Professor of Electrical Engineering Thesis Supervisor A ccepted by ......................... Signature redacted Leslie/A. IKl/qj 5 ziejski Chairman, Department Committee on Graduate Theses 2 Energy-Efficient SRAM Design in 28nm FDSOI Technology by Avishek Biswas Submitted to the Department of Electrical Engineering and Computer Science on May 21, 2014, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract As CMOS scaling continues to sub-32nm regime, the effects of device variations become more prominent. This is very critical in SRAMs, which use very small transistor dimensions to achieve high memory density. The conventional 6T SRAM bit-cell, which provides the smallest cell-area, fails to operate at lower supply voltages (Vdd). This is due to the significant degradation of functional margins as the supply voltage is scaled down. However, Vdd scaling is crucial in reducing the energy consumption of SRAMs, which is a significant portion of the overall energy consumption in modern micro-processors. Energy savings in SRAM is particularly important for batteryoperated applications, which run from a very constrained power-budget. This thesis focuses on energy-efficient 6T SRAM design in a 28nm FDSOI technology. Significant savings in energy/access of the SRAM is achieved using two techniques: Vdd scaling and data prediction. A 200mV improvement in the minimum SRAM operating voltage (Vdd,min) is achieved by using dynamic forward body-biasing (FBB) on the NMOS devices of the bit-cell. The overhead of dynamic FBB is reduced by implementing it row-wise. Layout modifications are proposed to share the body terminals (n-wells) horizontally, along a row. Further savings in energy/access is achieved by incoporating data-prediction in the 6T read path, which reduces bitline switching. The proposed techniques are implemented for a 128Kb 6T SRAM, designed in a 28nm FDSOI technology. This thesis also presents a reconfigurable fully-integrated switched-capacitor based step-up DC-DC converter, which can be used to generate the body-bias voltage for a SRAM. 3 reconfigurable conversion ratios of 5/2, 2/1 and 3/2 are implemented in the converter. It provides a wide range of output voltage, 1.2V-2.4V, from a fixed input of 1V. The converter achieves a peak efficiency of 88%, using only on-chip MOS and MOM capacitors, for a high density implementation. Thesis Supervisor: Anantha P. Chandrakasan Title: Joseph F. and Nancy P. Keithley Professor of Electrical Engineering 3 4 Acknowledgments I would first like to thank my advisor, Prof. Anantha Chandrakasan, for giving me the opportunity to be part of his wonderful research group at MIT. He has been an incredible mentor, motivated me to think and analyze more critically. He was kind enough to provide me the flexibility to work in different areas I am interested in. Thank you Anantha for providing me the various opportunities and guiding me through-out. I am looking forward to working on various exciting projects during my PhD and I consider myself very fortunate to have you as my advisor. Next, I would like to thank all the members of the Ananthagroup for being such friendly and welcoming people. Thank you Yildiz for all your help with my first test chip tapeout and testing. You have been really encouraging and patient to answer my doubts regarding SRAM design. Thanks Chu, for all the help and discussions on the SRAM project. Thanks to Arun and Nachiket for always being patient and kind enough to answer my doubts regarding Cadence simulations and other stuff. Thanks Saurav and Dina for sharing your expertise on switched-capacitor DC-DC converters with me. Thanks to our administrative assistant, Margaret for always greeting us with a smile and being helpful with various logistics. I am also lucky to have so many good friends and colleagues at MIT, who have made my life here enjoyable. I would also like to thank ST Microelectronics and Andreia Cathelin for their generous support with chip fabrication and DARPA for funding the projects. Last, but certainly not the least, I would like to thank my parents, my younger brother and my family for their consistent support and belief in me. Words are not good enough to describe all the sacrifice and hard work my parents have put in, to get me to where I am today. I am waiting to see that proud smile on their faces when they would see my SM degree from MIT. 5 6 Contents 1 2 Introduction 1.1 Motivation for Low-voltage SRAM . . . . . . . . . . . . . . . . . . . 15 1.2 Recent 6T SRAM designs in sub-45nm CMOS . . . . . . . . . . . . . 17 1.3 Advantages of FDSOI technology . . . . . . . . . . . . . . . . . . . . 20 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Background of 6T SRAM design 23 2.1 6T SRAM Bit-cell Operation . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Data Retention . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.2 Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.3 Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . 26 . . 27 2.2.1 Static Noise Margin . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.2 Write M argin . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.3 Dynamic Read Margin . . . . . . . . . . . . . . . . . . . . . . 30 2.2.4 Effect of scaling on noise margins . . . . . . . . . . . . . . 31 2.2 2.3 3 15 6T SRAM Functional Margin Issues and the effect of Vdd Conventional Assist Techniques Vdd Scaling . . . . . . . . . . . . . . . . . . . . . 31 2.3.1 Read Assists in Previous Works . . . . . . . . . . . . . . . . . 32 2.3.2 Write Assists in Previous Works . . . . . . . . . . . . . . . . . 33 6T SRAM design in 28nm FDSOI 35 3.1 . . . . . . . . . . . . . . . . . . . . . . 35 FBB as Write Assist . . . . . . . . . . . . . . . . . . . . . . . 36 Forward Body-Biasing (FBB) 3.1.1 7 3.1.2 4 5 Read-Stability Issues and Dynamic FBB 39 3.2 Energy-efficient Implementation of D-FBB . . . . . . . . . . . . . . . 40 3.3 Hierachical BL structure and Data Prediction . . . . . . . . . . . . . 43 3.3.1 Dynamic Read Margin . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 Hierarchical Read Path . . . . . . . . . . . . . . . . . . . . . . 44 3.3.3 Using Data Prediction in 6T SRAM . . . . . . . . . . . . . . . 45 3.4 Overall Array Architecture . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Energy Savings 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Due to Vdd Scaling 3.5.2 Using Data Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 50 Reconfigurable Body-Bias Generator in 28nm FDSOI 53 4.1 Brief overview of SC converters 53 4.2 Reconfigurable Step-up SC Module 4.3 MOS Implementation of the sub-module . . . . . . . . . . . . . . . . 58 4.4 Overall System Architecture . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions 55 65 5.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Energy-efficient 6T SRAM design . . . . . . . . . . . . . . . . . . . . 65 5.3 Reconfigurable Step-up SC DC-DC Converter . . . . . . . . . . . . . 67 5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A Energy Model of the 28nm FDSOI 128Kb SRAM Macro 69 A.1 Dynamic Energy Consumption . . . . . . . . . . . . . . . . . . . . . . 70 A.2 Leakage Energy Consumption . . . . . . . . . . . . . . . . . . . . . . 74 8 List of Figures 1-1 SRAM in embedded memory hierarchy [1]. .................... 16 1-2 General trend of cache size. [source: ISSCC 2013 Trends] . . . . . . . 16 1-3 Scaling trends for SRAM bit-cell size and operating ISSCC 2013 Trends] Vdd. [source: . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1-4 Conventional 6T SRAM bit-cell . . . . . . . . . . . . . . . . . . . . . 18 1-5 UTBB FDSOI vs. Bulk body-biasing structure (shown for NMOS) [2] 20 2-1 Conventional SRAM array architecture and 6T bit-cell. . . . . . . . . 24 2-2 (a) 6T bit-cell in data retention mode, (b) Bit-cell flips when Vdd goes below data retention voltage . . . . . . . . . . . . . . . . . . . . . . . 2-3 24 (a) 6T bit-cell during a read operation, (b) Waveforms showing a "read disturb" for a minimum sized bit-cell. Bit-cell flips since the disturbance at NI is large enough to trip the inverter (PU2, PD2). . . . . . 2-4 25 (a) 6T bit-cell during a write operation, (b) Waveforms during a write operation for two different 1-ratios: (WPG/WPU)x = 1.25, (WPG/Wpu)= 1). Write failure occurs when the 7-ratio is not high enough to lower the potential of N2 below the 2-5 VTRIP of the PU1-PD1 inverter. .... 26 (a) Schematic to evaluate SNM (b) Graphical representation of SNM. The noise voltage V, shifts VTC1 vertically and VTC2 horizontally, until they intersect at only one stable point when V, = VSNM . . . . . 2-6 28 Schematic setup to evaluate static WRM. Static WRM is defined as the difference between Vdd and VWL, at which internal nodes (NI and N2) flip to write the new data. . . . . . . . . . . . . . . . . . . . . . . 9 29 2-7 Schematic setup to evaluate Dynamic Read Margin. . . . . . . . . . . 30 2-8 SNM and WRM dependence on Vdd . . . . . . . . . . . . . . . . . . . 31 2-9 Conventional read assist techniques. . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . 33 2-10 Conventional write assist techniques. 3-1 Cross-sectional view and circuit symbols of the LVT transistors used in the 6T SRAM design [3]. . . . . . . . . . . . . . . . . . . . . . . . 3-2 6T bit-cell with forward body-biasing applied during a write operation. 3-3 WRM improvement as a function of the applied forward bias voltage 3-4 - . . . . . . . . . . .. . . . . . . at Vdd=0.4V . . . . . . . (VFBB), Improvement in Write Margin by IV FBB in the The Vdd 38 at 5.5- is improved from 600 mV to 400 mV (worst process Vdd,min 38 (a) 6T bit-cell during a read operation under DC forward body-bias (b) Delayed FBB to reduce read-stability issues. . . . . . . . . . . . . 3-6 37 range of 0.4V-1V. corner and temperature). . . . . . . . . . . . . . . . . . . . . . . . . . 3-5 36 39 Proposed layout of a single row, showing row-wise sharing of n-wells, BL sharing between adjacent columns and multiple WLs per row. (not to scale) 3-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Circuit implementation of the proposed row-wise forward body-biasing technique (hFBB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 Dynamic read margin of the 6T bit-cell as a function of values of NOBC (number of bit-cells per local BL). 3-9 41 Vdd, 42 for different . . . . . . . . . . 43 Hierarchical bit-line structure used in this work to improve read-stability at low Vdd levels. The read path from the local BL to the global BL is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3-10 Frediction architecture used in this design. . . . . . . . . . . . . . . . 46 also show n. 3-11 Array architecture for the 28nm FDSOI 128Kb 6T-SRAM, which incorporates row-wise body-biasing and data prediction to reduce energy consum ption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 47 3-12 Energy savings due to Vdd scaling using 1V dynamic FBB. The energy reduction with the proposed row-wise FBB implementation (hFBB) is compared to a conventional implementation, for two different read-tow rite ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3-13 Energy savings by using data prediction during a read operation at Vdd=400mV, as compared to a conventional 6T read. . . . . . . . . . 51 4-1 Basic 2:1 step-up SC converter, along with its idealized 2-port model. 54 4-2 Reconfigurable step-up switched capacitor module. . . . . . . . . . . 55 4-3 Operation of the proposed and conventional topologies in 5/2 mode. . 55 4-4 Simulated performance comparison of the ideal converter for the proposed and conventional topologies in 5/2 mode. . . . . . . . . . . . . 56 4-5 Operation of the converter in 2/1 and 3/2 modes. . . . . . . . . . . . 58 4-6 MOS implementation of the switches. . . . . . . . . . . . . . . . . . . 59 4-7 Reconfigurable gate drive circuits for a sub-module. . . . . . . . . . . 60 4-8 Overall system architecture with die photo. . . . . . . . . . . . . . . . 61 4-9 Measured performance of the converter. . . . . . . . . . . . . . . . . . 63 11 12 List of Tables 4.1 Performance comparison with previous works . . . . . . . . . . . . . . 62 A.1 Array organization for both the implementations . . . . . . . . . . . . 70 13 14 Chapter 1 Introduction With the tremendous increase in the usage of battery-operated portable electronics and the advent of new and promising applications, like biomedical monitoring, wireless sensor nodes etc., the demand for low power and energy-efficient circuits, have been increasing in modern System-on-a-Chip (SoC). Energy-efficiency of circuits directly translate into a longer battery-life, which is very crucial for these applications. 1.1 Motivation for Low-voltage SRAM Static Random Access Memories (SRAM) are the most popular type of embedded memories and one of the most critical building blocks in modern SoCs. SRAM has been pre-dominantly used for register files and L1-L3 cache memories, in the embedded memory hierarchy (Figure 1-1). This is primarily because, SRAMs offer the best access-speed performance among other embedded memory technologies [1]. Furthermore, SRAMs are fully compatible with modern CMOS processes and operating voltage, and hence, can be easily integrated with logic circuits. With the continuous scaling of CMOS technology, and integration of multiple processing cores on-chip, the demand for memory capacity and bandwidth has considerably increased over the years. Figure 1-2 shows the general trend of increasing cache size in modern microprocessors, which can be as high as 54MB on a single die [4]. Therefore, SRAMs account for a large fraction of the total power consumption of the chip. 15 Hence, low-power SRAM design is a very active research area. Power savings in SRAM is particularly critical in battery-operated mobile and hand-held applications, where the power budget is very constrained. -Qs L CLm. Figure 1-1: SRAM in embedded memory hierarchy [1]. Figure 1-2: General trend of cache size. [source: ISSCC 2013 Trends] Dynamic Voltage Scaling (DVS) has been proven to be an effective way to reduce energy consumption of circuits [5, 6]. Decreasing the supply voltage (Vdd) provides savings in the dynamic energy consumption (oc V]a), as well as a reduction in the leakage power consumption, at the expense of slower performance. Meanwhile, with the 16 scaling of device dimensions to sub-65nm regime, the variation in transistor threshold voltage (Vt) has become more severe. Since SRAM bit-cell size aggressively reduces with every technology node, the effect of random V variation makes it extremely challenging to reduce the Vdd of SRAMs, while maintaining sufficient stability margin for the bit-cell. Figure 1-3 shows the recent trend in SRAM bit-cell size and the operating Vdd. As seen from the figure, the Vdd scaling has been essentially stagnant, in sub- 65nm CMOS processes. Hence, new and improved read and write assist techniques are being actively researched as design solutions, to reliably reduce the minimum operating voltage (Vdd,min) of SRAMs. Additionally, newer transistor structures, such as FDSOI [2, 7] and FinFET [8, 9], are also emerging as replacements for planar-bulk devices, to reduce device variations and further improve SRAM Vdd,m n. Figure 1-3: Scaling trends for SRAM bit-cell size and operating Vdd. [source: ISSCC 2013 Trends] 1.2 Recent 6T SRAM designs in sub-45nm CMOS Six transistor (6T)-based bit-cell (Figure 1-4) has been the industry-wide preferred choice for high density SRAMs [10], since it provides the smallest cell area and has a very compact and lithography-friendly "thin-cell" layout [11]. Figure 1-4 shows the 17 conventional 6T bit-cell, with the cross-coupled inverter (PU-PG) pair and the two access (PG) devices. The 6T bit-cell is a ratioed circuit and there is a conflicting sizing requirements for the transistors, to improve both read and write operations simultaneously. In addition, it uses transistors with close to minimum device features, to provide high memory density. Hence, there is a limited scope for improvement of the bit-cell's noise margin by only optimizing the transistor sizing. Therefore, peripheral assist techniques are required [9] to reduce the failure probability of the 6T bit-cell and overcome the challenges to high yield. Vdd WL WI PUl BL PU2 PG1 PG2 PD1 BLB PD2 N2 N1 Figure 1-4: Conventional 6T SRAM bit-cell This section summarizes recent works in 6T SRAM that are designed in sub45nm CMOS processes, in which the effect of device variation is more pronounced. [10] demonstrates a 64Mb SRAM in a 32nm SOI process that works down to 0.7V, using a bit-line (BL) regulation scheme during a read operation and a negative bitline (NBL) technique to improve the write operation. A 0.6V SRAM in a 28nm-bulk process was shown in [12], which uses a delayed word-line (WL) boosting scheme to improve write-ability and a hierarchical BL architecture to improve read-stability. A multi-step WL control coupled with a hierarchical BL structure is proposed in [13] to improve the read-stability, for a 40nm 2Mb 6T SRAM. [14] uses a dynamic forward body-bias on the PU PMOS transistors of a 6T bit-cell, to improve read margin. 18 A 75mV improvement in Vd,min was achieved for a 153Mb SRAM designed in a 45nm bulk process. [15] uses a WL suppression scheme using replica cell transistors and passive resistances, to improve read-stability, for a 45nm 4.5Mb SRAM working down to 0.7V. [16] uses a lower cell-VDD technique along with NBL to improve write margin, for a 0.6V 256Kb 45nm SRAM. [17] implements a partially suppressed WL scheme as read-assist and a BL length tracked NBL scheme as write-assist for a 112Mb SRAM in a 20nm bulk process, to achieve a 200mV improvement in the SRAM Vdd,min. [18] presents a 20nm 128Kb SRAM which achieves a 0.6V operation and 20PW/MHz active power consumption, with a interleaved WL and a hierarchical BL scheme. [19] implements a charge-sharing scheme to reduce excessive BL discharge during a read operation at low Vdd's, which happens due to increasing random V variations. Designed in a 40nm low standy-power technology, it achieves a power consumption of 13.8pJ/access/Mbit for a 1Mb SRAM, which is considerably lower as compared to 36pJ [16], 46.7pJ [15], 50pJ [13] (all normalized to 1 Mb). For 28nm and beyond, fully-depleted device structures, namely FinFETs and FDSOl, have recently gained significant momentum, to continue CMOS scaling following Moore's law [20], which has become very difficult to maintain using planar bulk processes. A 128Mb 6T SRAM, designed in a 14nm FinFET technology, is presented in [9], which operates down to 0.48V with a high-performance (HP) bit-cell. It implements a technique to partially discharge the pre-charged BLs before WL is asserted, to reduce read-noise injected into the bit-cell. It avoids the requirement to generate a separate lower pre-charge voltage, which needs precise regulation ([10]) so as not to disturb the write operation. Additionally, a NBL scheme is used to improve the write margin for a high density (HD) bit-cell. [21] also implements a NBL scheme and a lower cell-VDD (LCV) technique to reduce the Vdd,min of a HD bit-cell, designed in a 16nm FinFET technology. [22] analyzes 6T SRAM design for a 28nm FDSOI process. It uses a single p-well architecture, particular to FDSOI, with a high density bit-cell. Simulation results suggests a 0.65V operation with 128 bit-cells per column and no assist techniques. 19 1.3 Advantages of FDSOI technology Fully Depleted Silicon On Insulator (FDSOI) offers excellent short-channel electrostatic control, reduced source/drain capacitances, lower leakage currents and significantly reduced random dopant fluctuations (RDF) as compared to a bulk process, for 28nm and beyond [2, 7, 23]. The ultra-thin dielectric buried-oxide (BOX) layer provides electrical isolation of the source and drain of the planar transistor, from its well and substrate. FDSOI also features a back-plane (BP) doping (either 'n' or 'p' type) underneath the BOX layer. This is independent of the transistor type (PMOS or NMOS) and results in two distinct V flavors, Regular-VT and Low-VT [3]. OV < VFBB < +0.3V Sorc Dai OV < VFBB < +3V G BdySource Drain Buried BP Oxide Body layer p/n-well p subtratep-substrate Bulk UTBB - FDSOI Figure 1-5: UTBB FDSOI vs. Bulk body-biasing structure (shown for NMOS) [2] The ultra-thin BOX layer enables a wide body-biasing range, in addition to improving the body-biasing efficiency. As shown in Figure 1-5, in a bulk process, a forward body bias (FBB) of only +300mV can be applied, so that the n+-diffusion to p-substrate diode is not turned-on. However, in FDSOI, a FBB upto +3V can be applied [2, 24], due to the BOX isolation of source/drain from the substrate. Furthermore, due to its superior electrostatics, Ultra Thin Body and Buried-oxide (UTBB) FDSOI exhibits a higher body-factor compared to its bulk counterpart, 85mV/V vs. 25mV/V [2]. This provides the flexibility to efficiently apply body-bias to target transistors to improve their performance (by FBB) or reduce their leakage (by reverse body bias or RBB) [24, 25]. We extensively use these improved features of UTBB FDSOI in this work, to achieve better performance and higher energy-efficiency. 20 1.4 Thesis Contributions This thesis primarily focuses on energy-efficient 6T SRAM design in a FDSOI process. Forward body-biasing (FBB) is investigated as a write-assist technique to reduce the operating voltage (Vdd) of the SRAM. Vdd scaling provides significant energy savings by decreasing the dynamic energy consumption. Furthermore, data-prediction is used during a read operation, to obtain additional energy savings. Chapter 3 presents a 128Kb 6T SRAM designed in a 28nm FDSOI process. The SRAM uses FBB to improve write-ability at low Vdd levels. FBB is used in dynami- cally (i.e. only during a write operation), so that the read-stability of the bit-cell is not degraded. The dynamic FBB is implemented in a row-by-row manner, to reduce the energy overhead associated with it. To enable row-wise dynamic FBB, a layout technique is proposed to share the n-wells horizontally, across all the bit-cells in a row. Next, a hierarchical bit-line architecture, to incorporate data prediction in 6T SRAM, is presented. A correct data prediction avoids the discharge of the long global bit-lines, providing significant reduction in dynamic energy consumption. Finally, the energy savings, obtained by Vdd scaling and by using data prediction, are quantified using an energy model for the SRAM, which is developed in Appendix A. The second part of the thesis (Chapter 4), presents a switched-capacitor (SC) based step-up DC-DC converter, which can be used to generate the body-bias voltage for SRAMs. The reconfigurable step-up converter implements 3 conversion ratios of 5/2, 2/1 and 3/2. It provides a wide range of output voltage, from 1.2V to 2.4V, from a IV input. The converter has been designed to obviate the need of using high voltage I/O transistors, which otherwise would have degraded the efficiency owing to their higher on-resistance and parasitic capacitance. Furthermore, a new topology is proposed for the 5/2 mode which improves efficiency by reducing the bottomplate parasitic loss as compared to a conventional series-parallel topology [26]. The converter has been implemented in a 28nm FDSOI process, using only on-chip MOS and MOM capacitors, that do not require any extra fabrication steps unlike MIM [27] and trench [28] capacitors. Measurement results show that the converter achieves a 21 peak efficiency of 88% in the 2/1 mode. 22 Chapter 2 Background of 6T SRAM design This chapter presents a brief overview of the basic 6T bit-cell operation, the associated functional margin issues and the concept of assist techniques. Six transistor (6T) based bit-cell has been the workhorse for SRAM design, owing to the small cell area and compact layout, resulting in high density memory arrays. As shown in Figure 2-1, each row of bit-cells share a common word-line (WL), while a pair of bit-lines (BL and BLB) are shared by multiple bit-cells in a column. The number of bit-line pairs (n) and the bit-width (m) of a single word, determine the column select ratio of n - to - m. 2.1 6T SRAM Bit-cell Operation A conventional 6T SRAM bit-cell (shown in Figure 2-1) consist of two back-to-back inverters (comprised of PUl, PD1 and PU2, PD2) and two access transistors (PG1 and PG2). The inverter pair is cross-coupled such that the output of one goes to the input of the other, and vice-versa. The resulting positive feedback of the inverter pair, can hold the desired data (states "1", or "0") indefinitely at the internal nodes (NI and N2), as long as the SRAM is powered up and the access transistors are turned-off. Access transistors are only turned-on during read and write operations, to connect the internal data nodes to the bit-lines (BL and BLB). 23 'n' BL pairs I----------------------------------------------- Vdd 6T bit-ce BL WL WL 0. BL --- BLB WL PU2 PUl PG1 PG2 BL Column Multiplexing & Sense Amplifiers - - - - - - - - - - - -- 'm' data bits - - - - - - - -- -- - Figure 2-1: Conventional SRAM array architecture and 6T bit-cell. 2.1.1 Data Retention Vdd 0.7 "0" "Ol PUl PU2 0.6 "0" "0j 0.5 0.4 B--PG1 N1 N2PG2 -PD1 , BLB 0.3 Vdd 0.2 0.1 0 -0.1 XPD2 1 Mw - N 0 N2 0.5 1 Time (us) (a) (b) Figure 2-2: (a) 6T bit-cell in data retention mode, (b) Bit-cell flips when below data retention voltage. Vdd goes The 6T bit-cell is in data retention mode when the word-line (WL) is turned-off (Figure 2-2(a)). The cross-coupled inverter pair creates a positive feedback loop that preserves the internal data nodes, without any disturbance from the bit-lines, through the pass-gate (PG) transistors. However, below a certain level of supply voltage, the inverters can no longer hold the state and the internal data nodes might flip. This 24 minimum required level, known as the data retention voltage (VDRV), is typically As seen from the simulation below the threshold voltage (Vt) of the transistors. waveforms in Figure 2-2(b) the bit-cell can no longer retain its original state when the cell supply voltage (Vdd) goes down to 150mV, which is lesser than the VDRV Of the bit-cell. Read Operation 2.1.2 The read operation starts with pre-charging the bit-line pairs (BL and BLB) to a known voltage (typically Vdd). The bit-lines are then kept floating and the word-line (WL) is asserted. Depending on the data stored, one of the bit-lines (BL or BLB) starts discharging through the pass-gate (PG) and pull-down (PD) NMOS transistors, connected in series (Figure 2-3(a)). The bit-line differntial voltage is sensed by a sense-amplifier to output the data. Vdd 0.6 WL="1" BI PG1 ..... PUl N1 P 2 "1" N2 PG2 0" "1 PD1 0.5 0.4 WL= "1" WL- 0.3 N 0.1 L PD2 -- "'__ 0 1 _n 1 0 -- 10 20 30 Time (ns) (b) (a) Figure 2-3: (a) 6T bit-cell during a read operation, (b) Waveforms showing a "read disturb" for a minimum sized bit-cell. Bit-cell flips since the disturbance at NI is large enough to trip the inverter (PU2, PD2). During the read operation, the discharging current flows from the bit-line to the cell ground, on the side of the bit-cell storing a "0". This leads to an increase in the potential of the corresponding internal node (NI in Figure 2-3(a)) and the amount of disturbance depends on the drive strengths of the PG and PD NMOS transistors. 25 If this increased voltage goes above the trip-point (VTRIP) of the connected inverter, the stored data is flipped. This event is known as 'read disturb' and it is shown in Figure 2-3(b). In order to prevent this, the PD NMOS needs to be stronger than the PG NMOS. The ratio of their drive strengths is known as the /-ratio, which is an important SRAM design parameter. Careful sizing of the NMOS transistors is required to achieve the desired /-ratio, which ensures successful read operations (i.e. without any 'read disturbs'). 2.1.3 Write Operation The write operation starts with driving the bit-lines to the data value to be written. The WL is then asserted, turning-on the PG transistors. Vdd WL = "1"PU WL =21 0.7 1"1 P0.5 BL PG1 N1 "0" - "1" N2 _ "1^ "0" PD2 PD1 BLB PG0.3 1 Fail W 0.1 -01 0 (a) 1 2 3 4 Time (ns) (b) Figure 2-4: (a) 6T bit-cell during a write operation, (b) Waveforms during a write operation for two different y-ratios: (WPG/WPu)x = 1.25, (WPG/WPu)= 1). Write failure occurs when the -y-ratio is not high enough to lower the potential of N2 below the VTRIP of the PU1-PD1 inverter. If the data to be written is opposite to the previously stored state (as shown in Figure 2-4(a)), the potential of the high internal node is lowered, depending on the drive strengths of the pull-up (PU) PMOS and the PG NMOS transistors. The ratio of the drive strengths of the PG and PU transistors is known as the -- ratio and it is another important parameter in SRAM design. The transistors need to be sized carefully so that the -y-ratio is high enough to lower the potential of the high internal 26 node below the VTRIP of the connected inverter. As shown in Figure 2-4(b), a low -y-ratio can lead to a write failure. 2.2 6T SRAM Functional Margin Issues and the effect of Vdd Scaling The three main SRAM operations discussed above, viz. data retention (or hold), read and write, are characterized by their respective functional margins. In sub- 65nm CMOS technologies, increasing amount of threshold voltage (V) variation (due to random dopant fluctations) hugely degrades the functional margins and limits the minimum SRAM operating Vdd. In this section the concepts of Static Noise Margin (SNM), Write Margin (WRM) and Dynamic Read Margin (DRM) will be discussed, along with the effect of Vdd scaling on them. 2.2.1 Static Noise Margin Static Noise Margin (SNM) is a widely used metric in SRAM design to characterize the bit-cell's stability. SNM is the maximum amount of noise voltage (V,) which can be tolerated at both the inputs of the cross-coupled inverters (with opposite polarity), while retaining the cell data (Figure 2-5(a)). In other words, SNM quantifies the amount of noise voltage (V) required at the internal storage nodes of the bit-cell, in order to flip the cell's data. The noise source (V) models any static disturbance arising out of V mismatches, variations in device geometries, noise injected from the BL through the PG transistor etc.. Figure 2-5(b) shows the graphical representation of the SNM in read mode (i.e. when WL is turned-on). The two curves (VTC1 and VTC2) represent the voltage transfer characteristics of the two inverters (PU1, PD1 and PU2, PD2). The two curves intersects at three points: two stable states, reperesenting the two possible data stored (logic "1" or logic "0") and one metastable state. The resulting two-wing curve is known as the "butterfly curve", which is used to graphically determine the 27 SNM. The SNM is defined as the length of the side of the largest square (VSNM), which can be fit inside the smaller wing of the butterfly-curve. The SNM is worse for a read operation as compared to a data retention mode. As explained in section 2.1.2, during a read operation the voltage of the node storing a "0" rises above the ground (GND) level, depending on the /3-ratio. With the scaling of device geometries to nanometer regime, the variation in the device strengths and hence, in the f-ratio becomes more severe. This causes significant degradation of the read-SNM. In addition, the local and global V variation of the PU and PD transistors shift the inverters' VTCs, causing further SNM degradation. During data retention, the WL is turned-off and hence there is no noise injected from the BLs. Therefore, the SNM in this case (i.e. hold-SNM) is much better than the read-SNM, which typically limits the minimum operating voltage of the 6T SRAM. Vdd Vdd - VTC2 VN2 BL > BLB VTC1 VTC2 0 (a) VN1 Vdd (b) Figure 2-5: (a) Schematic to evaluate SNM (b) Graphical representation of SNM. The noise voltage V, shifts VTC1 vertically and VTC2 horizontally, until they intersect at only one stable point when Vn = VSNM- 28 2.2.2 Write Margin During a write operation, the bit-line pair, BL and BLB, are driven to the differential levels of "0" and "1" and the word-line is turned-on (WL= "1"). For a successful write, the PG NMOS has to pull-down the high internal node (storing a "1") below the trip-point of the connected inverter. This depends on the relative strengths of the PG and PU transistors, as explained in section 2.1.3. It also depends on the WL pulse width. If the WL is not kept high for a sufficient amount of time, then the write operation might fail, even though the bit-cell satisfies the required -- ratio. Hence, ideally, write margin (WRM) should be simulated as a dynamic condition. [29] discusses the dynamic nature of write margin and compares it to various static methods of evaluating write margin. It was concluded that the WL sweeping method [30] provides the best estimate for static WRM, since it exhibits the best correlation with dynamic write margin, especially at lower supply voltages. WL k Vdd PUl BL 1" PGI Vdd PU2 "1" N2 "O" N1 PD1 WRMI PG2 PD2 BLB "0,l /N 0 I VWL Vdd Figure 2-6: Schematic setup to evaluate static WRM. Static WRM is defined as the difference between Vdd and VWL, at which internal nodes (NI and N2) flip to write the new data. Figure 2-6 shows the simulation setup to evaluate static WRM. The WLs are swept together from 0 to Vdd to replicate a real write operation, in which the WL drives both the PG transistors simultaneously. WRM is defined as the difference between Vdd and the WL voltage at which the cell flips its original state. 29 2.2.3 Dynamic Read Margin Seevinck method [31] is the traditional way of characterizing the read margin. This static method, explained in section 2.2.1, does not consider the dynamic effect of bitline discharge during a read operation and hence, provides a pessimistic estimate for the read margin. Recent works [12, 32] showed that reducing the number of bit-cells (BCs) in a column can significantly improve the read margin. This is due to the fact that, with lesser number of BCs the capacitance of the bit-lines (both device and parasitic components) reduces. Therefore, when the WL is turned-on, the BL discharges faster and it reduces the amount of charge injected into the internal node storing a "0". Hence, it is less likely to exceed the trip-point of the connected inverter, preventing accidental flipping of the cell data. WIE Vn BILB "O"1" N1 BL N2 CBL ...... CBL Vn Figure 2-7: Schematic setup to evaluate Dynamic Read Margin. Figure 2-7 shows the simulation setup used to evaluate dynamic read margin (DRM). The BL capacitances, extracted from the layout, are initialized to Vdd at the beginning of the simulation. The DC noise voltage V, is swept in consecutive transient simulations, until the cell data is flipped. The maximum value of V, which does not cause a "read disturb" is defined as the DRM. 30 2.2.4 Effect of Vdd scaling on noise margins In sub-65nm CMOS technologies, random Vt variation is exacerbated as transistor feature size is scaled down to nanometer regime. As the channel length reduces to 10's of nm, it becomes extremely difficult to precisely control the doping in the channel region. The effect of random V variation is more pronounced as Vdd scales down, since the overdrive voltage (Vdd - Vt) of the transistors is reduced. Thus, even though the required 4 and -y ratios are satisfied at higher Vdd's, they might not be sufficient for every bit-cell when Vdd approaches close to V. This causes SRAM functional failure, limiting the minimum achievable Vdd. 20 0 ------------------------------- ----------- ----- 0 --- 20.0 -+-WRM +*SNM +----------16.0 ------------------------------ --- --- 4.0 0.5 0.6 0.7 0.8 0.9 1 Vdd (V) Figure 2-8: SNM and WRM dependence on Vdd. Figure 2-8 shows the effect of reducing Vdd on SNM and WRM of a 6T bit-cell that has been designed in a 28nm FDSOI process using regular-Vt transistors. As seen from the figure, the p/- ratio for both SNM and WRM decreases with Vdd. However, WRM exhibits a much stronger dependence on Vdd than SNM, especially at higher voltages. A /o- of more than 5 is typically required, for high yield ratios in large sized SRAMs. 2.3 Conventional Assist Techniques The issues of functional margin degradation with Vdd scaling have been addressed by using peripheral assist circuits, to aid the read and write operations [33, 34]. 31 [33] defines three modes of SRAM functional failure: read-ability, write-ability and read-stability. Assuming a differential sense-amplifier (SA) based read operation for 6T bit-cells, a read-ability failure occurs if the BL differential voltage (when the SA is triggered) is less than the offset voltage of the SA. A write-ability failure occurs when the desired data cannot be written at the end of the WL pulse. Read-stability failures can happen if the selected bit-cell (BC) data or the half-selected BC data is accidently flipped during a read or write operation, respectively. This section summarizes common assist techniques used to improve read-ability, write-ability and read-stability of 6T SRAMs. 2.3.1 Read Assists in Previous Works Figure 2-9 shows the waveforms for the different read assist techniques used in previous works. Most of these techniques aim at improving the 3 ratio by making the PG NMOS weaker or the PD NMOS stronger. Vdd WL 0 WLUD Reduced BL Pre-charge Cell VDD Vdd Boost 0 Negative GND Cell Vss Figure 2-9: Conventional read assist techniques. For the word-line underdrive (WLUD) technique [17, 35, 15], the gate-bias of the PG transistor is reduced, making it weaker than the PD transitor and improving read-stability. However, a reduced gate-drive decreases the BL discharge current, making read-ability worse. A reduced BL pre-charge (PCH) level can help in improving read-stability, without 32 degrading the drive strength of the PG device (assuming negligible effect of VDS). The work in [10] demonstrated a yield increase from 5 to 5.7 sigma, by pre-charging BLs to approximately 70% of Vdd. This technique has the overhead of generating and regulating the reduced PCH voltage. Precise regulation of the PCH voltage is necessary, since a low PCH level can create a pseudo-write operation scenario, which can overwrite the existing cell data. Boosting the cell VDD makes the PD NMOS stronger than the PG device, im- proving read-stability. Driving the cell Vss to a negative voltage level, simultaneously improves the strengths of the PG and PD devices. Hence the BL discharge current is increased, improving read-ability. These techniques can be implemented in a row-by-row [36] or column-by-column [18, 37] fashion. 2.3.2 Write Assists in Previous Works Figure 2-10 shows the waveforms for the conventional write assist techniques, which attempt to improve the -y ratio or weaken the bistability of the cross-coupled inverters. -- A-W 4Vdd WL Boost BLB Negative BL Cell Vss Vdd Collapse t f Vss Boost Figure 2-10: Conventional write assist techniques. The word-line boosting technique improves write-ability by increasing the gatedrive of the PG device, making it stronger than the PU PMOS. However, this has a detrimental effect on the read-stability of the half-selected bit-cells (i.e. bit-cells in the selected row and unselected columns) in a column-interleaved array. [12] addresses 33 this issue by delaying the boosting phase with respect to WL turn-on. Hence, the half-selected bit-cells have already started reading and the BL voltage has sufficiently reduced when the WL boosting is applied. This technique incurs the area overhead of generating the boosted WL voltage. The negative BL technique [10, 15, 9, 16] increases the gate-drive (VGS) of the PG transitor by reducing its source voltage and hence, improves write-ability. However, by decreasing the potential of one of the BLs below GND, there is a non-zero VGS across the PG devices in unaccessed rows. If the internal node of an unaccessed bitcell on this side is "1", then there is a chance of unintentional over-writing of that cell data. The non-zero VGS also results in increased leakage from the PG devices and causes partial loss of the boost signal. This technique is also susceptible to voltage overstress in the write path at higher Vdd values [10, 17]. 'Cell-VDD collapse' [35, 16] and 'Cell-Vss boost' [38] techniques decrease the strength of the cross-coupled inverter pair holding the data and hence improves writability. However, the effect is much weaker [33] than the 'WL boost' and *Negative BL' techniques, since the PG transistor's strength is not improved. Furthermore, these techniques when implemented column-wise, run the risk of violating the data retention voltage for unaccessed bit-cells, which can cause accidental loss of cell data. Whereas, if they are implemented row-wise, the read-stability of the half-selected bit-cells are degraded. 34 Chapter 3 6T SRAM design in 28nm FDSOI This chapter focusses on a low voltage, energy-efficient 6T SRAM design, in a 28nm FDSOI process. Forward body-biasing (FBB) is investigated as a write-assist technique, to reduce the SRAM operating voltage and provide energy savings. The proposed implementation of FBB significantly reduces its energy overhead as compared to a conventional implementation. Furthermore, data prediction is incorporated in the read path, to obtain additional energy savings. 3.1 Forward Body-Biasing (FBB) Forward body-biasing of an NMOS device, refers to applying a positive body-to-source voltage (VBS > OV) across it. This reduces the threshold voltage of the NMOS, since the body-terminal acts like a second gate [2]. FDSOI offers the unique feature of applying FBB on NMOS devices without the need of a triple-well structure, which would be required in a bulk CMOS process. This is possible because of the electrical isolation of the source/drain of the transistor from the well/substrate by using a buried-oxide (BOX) layer. The ultra-thin BOX layer also improves the body-biasing efficiency [2, 7] as compared to PDSOI, which features a thicker BOX layer. In this work, we use the LVT flavor of the FDSOI transistors [3], which is characterized by NMOS devices on n-well and PMOS devices on p-well, as shown in Figure 3-1. Since the n-well bias (GNDS) is controlled independently, the NMOS devices can be 35 selectively forward body-biased, reducing their threshold voltage. The PMOS devices are already in FBB mode, since their body terminal is at OV (same as the p-substrate). GNDS > 0V NMOS (LVT) PMOS (LVT) BOX BOX n.-typ SP p-type BPD B- G VDDS = OV B S G S Figure 3-1: Cross-sectional view and circuit symbols of the LVT transistors used in the 6T SRAM design [3]. 3.1.1 FBB as Write Assist Figure 3-2 shows the 6T bit-cell with FBB applied on NMOS devices, during a write operation. FBB decreases the threshold voltage (Vn) of the NMOS transistors. Hence, the NMOS access transistor (PG2) becomes stronger than the PMOS pull-up transistor (PU2). This helps in lowering the potential of the high internal node (N2), and therefore, improves write-ability. An alternate way of improving write-ability by reverse body biasing (RBB) the PMOS, was not chosen. This is because the PU PMOS devices are already sized to be weaker than the PG NMOS devices and hence, making them further weak has a much lesser effect in improving write-ability. Furthermore, a stronger PG NMOS helps to improve the write-speed as well. It may be noted that, applying FBB to the PG NMOS writing a "0", is sufficient. However, as explained later, it is preferable to share the n-wells (GNDS1 and GNDS2) row-wise. Since a "0" can be written from either side, therefore, to ensure write-ability improvement for all the bit-cells in a selected row, FBB has to be applied to both PG1 and PG2. Applying a bias to both the n-wells in a bit-cell, does not degrade write-ability. FBB affects both the PG and PD NMOS devices, on the side of the 36 j Vdd PU2 PUl "A WL BL N1 "1"U BLB WL N2 GNDS1 (> OV) PD2 VFBB "i GNDS2 BB (>OV) PG2 becomes stronger due to FBB (VBS = VFBB). VN2 lowered more easily PG1, PD1 VFBB, FBB does not affect VN1 significantly VBS of both Figure 3-2: 6T bit-cell with forward body-biasing applied during a write operation. bit-cell storing a "0". This is because the NMOS devices share a common n-well and have source voltage (Vs) close to zero. Hence, the VBS applied is same for both, which results in approximately equal V modulation. Figure 3-3 shows the improvement in write margin (WRM) at Vdd = 0.4V, as a function of the forward bias voltage (VFBB) applied. As seen from the figure, the p/a of the write margin improves linearly with VFBB ('p' and 'a' are the mean and standard deviation respectively, of the write margin distribution). This is primarily because of the linear dependence of the NMOS threshold voltage (V") on the applied body-bias. It can be verified from the figure, that applying FBB to all the NMOS devices in the bit-cell does not degrade WRM. In fact, there is a slight improvement in WRM as compared to the case when FBB is only applied to the NMOS devices on the side of the bit-cell writing a "0". Figure 3-3 also suggests that depending on the desired A/a of the WRM, the appropriate body-bias voltage can be chosen. In this work, a VFBB of IV is chosen, which provides a worst case pA/a of 5.5 even at a Vdd of 0.4V. The requirement of a separate circuit to generate the body-bias voltage is eliminated, since IV is the nominal supply voltage of the process [39] and readily available on-chip. 37 8.0 -- - - 7.5 .- 7.0 ---- 6.5 ------ 6.0 N o FB B-- -------- -------- -- -E-2 sided FBB ---- -*-1 sided FBB -- ----- -------------------- -- -- ---- ----- ------------- ------------ ----- -- - --- ---- -- - ---- 5.5 Cu 5.0 ------------------------------ 4.5 ---------------- ---- 4.0 ---------- 3.5 - ---------- - --- -~V dd =0.4V, ---------- SF corner, -400 C ----------------4--------------------- - - ------------- i 3.0 0 1.2 0.8 0.4 1.6 VFBB (V) Figure 3-3: WRM improvement as a function of the applied forward bias voltage (VFBB), at Vdd=0.4V . 400.0 - u (No FBB) 350.0 -0-5.5a (No FBB) -- -QO (1V FBB) 300.0 E 250.0 -- 200.0 150.0 100.0 50.0 0.0 -50.0 -100.0 -W-5.5a (1V FBB) SF--corner, -40*C --- ----- ------------- r ------------ ~---------. V n - -- - - - - - - -.---- ---. -------- ----------- ------------ - -- ----.- - ------- -- ---- - - ---------;w --- .0-',0 ------- -~ - --------------- ,-.- ------- - ---- - ---- ----------- --- --------- ---- - -------------------- 200mV Vmin improvement at 5.5a ----0.4 0.5 0.6 0.8 0.7 0.9 1 Vdd (V) Figure 3-4: Improvement in Write Margin by IV FBB in the Vdd range of 0.4V-1V. The Vdd,min at 5.5a is improved from 600 mV to 400 mV (worst process corner and temperature). Figure 3-4 shows the improvement in the write margin as a function of the supply voltage (Vdd), with the body-bias voltage (VFBB) 38 kept at a constant value of IV. It can be seen from the figure that, both the mean value (0a) and the 5.5a value of the write margin are consistently improved in the entire Vdd range of O.4V to 1V. The conventional 6T bit-cell works down to 600mV with a p/a = 5.5, without any write assists. As seen from the figure, a lV forward body-biasing can reduce the Vdd,min for the write operation to 400mV, while maintaining a p/o of 5.5. The 200mV reduction in Vdd,min provides significant benefits in terms of the SRAM energy consumption. 3.1.2 Read-Stability Issues and Dynamic FBB If the FBB technique, described above, is implemented in a static manner, the readstability of the bit-cell is degraded. As shown in Figure 3-5(a), the primary reason for degraded read-stability is the lowering of the trip-point (VTRIP) of the inverter storing a "1". This is because the threshold voltage of the pull-down NMOS (PD2) is reduced by FBB. And hence, it is easier for the disturbance at the low internal node (N1) to flip the inverter. The degradation of read-stability with FBB is more prominent for higher body-bias voltages, which are required for a successful write operation at lower Vdd levels. Vdd Vdd WL WL 0 I I WL VFBB - BL GNDS1, N r N2 ""1W N ""P2BLB GNDS2 0 CBL CBL Vdd GNS1GDS VFBB PD 1 V FBB is delayed w.r.t. WL FBB of PD2 lowers VTRIP of the inverter => It flips more easily => VTRIP of PU2-PD2 inverter not lowered during AT (b) (a) Figure 3-5: (a) 6T bit-cell during a read operation under DC forward body-bias (b) Delayed FBB to reduce read-stability issues. This problem can be mitigated by delaying the forward body-biasing with respect 39 to the WL rise edge. As seen from Figure 3-5(b), if the body-biasing is applied after a delay AT, the VTRIP of the inverter is not lowered during that time. Whereas, the BL has already started discharging and the noise injected to the "0" internal node is relatively lesser. This motivates a dynamic implementation of the FBB technique (D-FBB), which can take advantage of the full body-biasing voltage range to improve write-ability, without compromising read-stability. Furthermore, as the leakage of the bit-cell increases when FBB is applied, a huge leakage power penalty would be incurred if a DC FBB is applied to the whole memory array. 3.2 Energy-efficient Implementation of D-FBB A dynamic implementation of the FBB technique (D-FBB), however, has its own share of power overhead due to n-well switching. For the conventional "thin-cell" 6T layout [11, 40], the n-wells are shared vertically with other bit-cells in a column. Therefore, to apply body-biasing to a selected bit-cell, the entire n-well of the corresponding column needs to be charged up [40]. This translates into a significant capacitive-switching power overhead, since it scales with both the number of rows and columns. The n-well switching power can limit the benefit of Vdd scaling, achieved using dynamic body-biasing. To address this issue, an alternate layout technique is proposed in this work, which shares the n-wells horizontally across all the bit-cells in a row. The benefit of this technique stems from the fact that, only one row in the memory array is accessed at a time. Hence, body-biasing can be applied to only the two n-wells in a selected row. This significantly reduces the amount of n-well capacitance switched per write cycle, since it is only dependent on the number of columns. Hence, the proposed technique of sharing the n-wells horizontally, provides a more energy-efficient way of implementing dynamic body-biasing for 6T SRAMs. Figure 3-6 shows the proposed layout technique, which shares the n-wells horizontally across all the bit-cells in a row. The conventional "thin-cell" [11] layout is used for the 6T bit-cell. The "thin-cell" layout has an aspect ratio of approximately 40 T bit-cells WLA . WLB WLC rF__ 4 WLs per row WLD Row-wise shared n-well M3 M M4 N Via (M3-M4) Figure 3-6: Proposed layout of a single row, showing row-wise sharing of n-wells, BL sharing between adjacent columns and multiple WLs per row. (not to scale) 3:1. This implies that a bit-line, which is now routed along the longer bit-cell dimension, has approximately 3 times more parasitic metal-routing capacitance (CM,par), as compared to a conventional implementation. This can lead to a 3X increase in BL switching power. In this work, this issue has been addressed in two ways: (1) sharing the bit-lines between adjacent columns. (2) routing multiple word-lines for each row. (1) Traditionally, the bit-line (BL) diffusion contacts are shared with neighboring bit-cells in a column, reducing the effective cell area. However, this is not possible in the present scenario, since the diffusion regions run horizontally. Hence, the BL diffusion contacts are shared between bit-cells in adjacent columns, so that the effective bit-cell area is not increased and the lithography-friendly layout structure [11] is maintained. Sharing BLs between two adjacent columns also reduces the parasitic metal routing capacitance per bit-cell, by a factor of 2. Hence, the effective increase in CM,,ar/bit-cell is 1.5X compared to a conventional implementation (as opposed to 3X). (2) Since two adjacent columns now share a BL, it is necessary to have atleast two word-lines (WL) for each row. This is to ensure that two adjacent bit-cells in a selected row, do not simultaneously drive a single BL. In this implementation, 4 WLs 41 are routed for each row, taking advantage of the longer cell-height. The 2 extra WLs help in reducing the unnecessary BL discharge in half-selected columns. Therefore, the effective number of BL switching per cycle is reduced by a factor of 2. Due to these layout optimizations, the BL switching power is actually reduced by a factor of 0.75X(= ') as compared to a conventional implementation, providing further energy savings. 4 metal layers are used for routing the different signals. Metal-2 (M2) is used for the cell Vdd and GND, metal-3 (M3) is used for bit-lines and finally, WLs are routed in metal-4 (M4). The proposed implementation (h6T) incurs a 2.5% increase in the effective cell-area, due to non-overlapping WL contacts between adjacent rows. Normal logic design rules are used for layout. -----------------------------. VFBB Vdd to OV WL. Delay Vdd Vb WrEnNx N -Vdd -VFBB To the n-well of one row to OV Level Shifter Figure 3-7: Circuit implementation of the proposed row-wise forward body-biasing technique (hFBB). The proposed technique of row-wise forward body-biasing (hFBB) is implemented by the circuit shown in Figure 3-7. During a write operation (WrEn ="1"), the word-line (WL) of the selected row triggers the level shifter, to pull the node N, to ground. Hence, the output node Nbb, which is connected to the n-well of the selected row, is charged up towards VFBB. The bias voltage V can be chosen to be OV for full swing body-biasing at every Vdd. Alternatively, it can be connected to Vdd, SO that the n-well node is charged up slowly at higher Vdd levels, when body-biasing is not required. As explained before, the delay after the WL is necessary to eliminate 42 read-stability issues in half selected bit-cells. 3.3 Hierachical BL structure and Data Prediction Recent works [12, 18, 32] have shown the advantages of hierarchical bit-line (HBL) scheme in improving the read-stability and read-ability of 6T SRAMs. In HBL, a small number of bit-cells are connected to a local bit-line (BL) pair. The signal development on the local BLs are, transferred to the long global BLs, which are used to finally read the data. The local BLs have a significantly lesser capacitance, due to the reduced number of access transistors connected to them and the reduced parasitic metal routing capacitance. Therefore, they can discharge much faster during a read operation and hence, injects lesser noise through the access transistor to the "0" internal node. This significantly improves the read-stability of the bit-cell, as compared to the conventional non-hierarchical architecture, with higher number of bit-cells per BL. 3.3.1 Dynamic Read Margin 13.0 ------------- 11.0 ------------------------- +----------- 10.0 E 9.0 rm --------------- ------------------ FSc r e 12.0 ---- ------------- ------------------------------ - ------- - - ---- - - - 8.0 7.0 --------- S6.0 5.0 -- ---- 4.0 - - - -- - ------- --- ------ ------- -------------- - -------- -* SNM -- -- NOBC=32 - -*-NOBC=64 -*-NOBC=128 3.0 0.4 0.5 0.6 Vdd (V) 0.7 0.8 Figure 3-8: Dynamic read margin of the 6T bit-cell as a function of Vdd, for different values of NOBC (number of bit-cells per local BL). In this work, a hierarchical BL structure is used, with 64 cells per local bitline. This translates to 32 physical rows, since a local BL is shared between adjacent 43 columns, in this implementation. Figure 3-8 shows the effect of the number of bit-cells per local BL (NOBC) on the dynamic read margin (DRM). 64 bit-cells per LBL was chosen, so that more than 5-sigma DRM is achieved, even at a Vdd of O.4V and worst process-corner (FS). It can be also seen from Figure 3-8 that, the static method of read-margin simulation (SNM) provides a considerably pessimistic estimation. This is because the SNM method does not account for the effect of BL discharge on the read-noise injected in the "0" internal node of a bit-cell. 3.3.2 Hierarchical Read Path Figure 3-9 shows the hierarchical BL structure used in this implementation. In a sub-array, the local bit-lines (LBL and LBLB) are connected across 32 rows and shared with the adjacent column (not shown in the figure). *I 6T bit-cell LocalSub-Array0 "1" """ Ca CLBL LocalSub-Array_1 LBL X32 LBLB GBL ---- GBLB ,I I"s "" CGBL GBL CGBL + - G _J SAEn Globa SA "1"TR GW L -- G ............-..... . -- Read output Figure 3-9: Hierarchical bit-line structure used in this work to improve read-stability at low Vdd levels. The read path from the local BL to the global BL is also shown. During a read operation, one of the local bit-lines (LBL and LBLB) starts discharging, depending on the data stored in the selected bit-cell. The signal developement on the local BLs is sensed with a pair of local inverters and pull-down NMOS devices, connected to the global bit-lines (GBL and GBLB). To improve read-access 44 time, the local inverters are designed to favor a "0" to "1" output transition. Large signal sensing is used for the local BLs to reduce the area-overhead of the local sensing circuit. On the other hand, small-signal differential sensing is used for the global BLs. This is because, the global BLs are connected across multiple local arrays and have significantly high capacitance (due to long metal routing). Hence, they discharge slower than local BLs, and therefore, a small-signal sensing is more suitable. The GWL signal is used to turn-off the GBL discharge when the global sense-amplifier is enabled. PMOS devices (not shown in the figure) are used to pre-charge all the local BLs, global BLs and other nodes in the local sensing circuitry to Vdd, at the beginning a read operation. 3.3.3 Using Data Prediction in 6T SRAM Application specific features can provide interesting data properties, which can be exploited to design a more tailored SRAM. [41] proposed a 10T SRAM bit-cell which uses prediction of data to reduce bit-line switching power, during a read operation. It was targeted specifically towards motion-estimation in video processing applications. In motion estimation, the pixels from a small block of a video frame (reference buffer) is stored in the SRAM array and used in consecutive read cycles, before it is overwritten. The correlation of the pixel data, stored in the reference buffer, can be exploited to predict the data during a read operation, using previously read values. If the prediction matches the actual data, the bit-line (BL) pair is not discharged. Thus, depending on the prediction accuracy, the BL switching power can be reduced. This can provide significant energy savings, since BL switching constitutes a major portion of the overall SRAM power consumption. This work extends the concept of data prediction to 6T SRAM arrays. Instead of incorporating data prediction in the bit-cell, as done in [41], this work uses data prediction at the local array level. Thus, we get the area advantage of using a smaller 6T bit-cell as compared to a 10T design, while saving BL switching power. Figure 3-10 shows the architecture implementing the prediction scheme for a 6T 45 Local Sub-Array LBL & : LBLB ~~Intl itZ *q GWL- Global SA + SA_ En Vref predPrd SAout 1 0 Dout [Read output] Figure 3-10: Prediction architecture used in this design. SRAM array. Two extra transistors (Np1 , N, 2 ) are added at the local sub-array level. They control the signal development at the 'intl' and 'int2' nodes, driving the local sensing inverters. All the internal nodes are pre-charged to Vd before a read operation (using PMOS transistors, not shown in the figure). Let us assume the data to be read is "0". Hence, the LBL discharges to ground, during a read operation, while LBLB stays at Vdd. If the prediction is correct, i.e. Pred = "0" and PredB = "1", Np, is turned-off and the discharge of the LBL is not transferred to the 'intl' node. Thus, both 'intl' and 'int2' nodes stay at the pre-charged level of Vdd. Hence, neither of the global bit-lines (GBL, GBLB) are discharged. The global sense-amplifier (SA) outputs a "1" and the correct prediction value, Pred = "0", is chosen as the read output data. On the other hand, if the prediction is incorrect, i.e. Pred = "1" and PredB = "0, Np1 is turned-on and the discharge of the LBL is transferred to the 'int1' node. Hence, GBL starts discharging and the global SA senses this, to output a "0". Therefore, PredB (= "0") is chosen as the read output (since the prediction was 46 incorrect). Thus, in either case, the correct value of the data (= "0" in this example) is obtained at the output. However, if the prediction was correct, the discharge of the global bit-line was avoided, which manifests into dynamic power savings. Although the local BL switching is not affected in this technique, the dominant component of the switching power, which is due to the global BLs, can be reduced by using data prediction. Overall Array Architecture 3.4 Figure 3-11 shows the overall array architecture for the 128Kb SRAM, designed in the 28nm FDSOI technology. Each 64Kb block, of 256 rows by 256 columns, consists of 8 local arrays. A 4:1 column interleaving ratio is implemented to obtain a 64-bit output data. .................. *CL Pch B Local Array 0 * 32X256 X32* Local Array 1 ** WLD[0] 32X256 h6T h6T h6T h6T I I I 2 * X8 LRELL2 LBo LBLBo LBLBi Local Array 7 32X256 mLBL[ ] Q. T j -a mLBLB[0] C I data[B Loa 6 Read U S ~ .~ qi 0. 0. ~ Data_in Dataout .~ ,J Predjout I.. Figure 3-11: Array architecture for the 28nm FDSOI 128Kb 6T-SRAM, which incorporates row-wise body-biasing and data prediction to reduce energy consumption. 47 The local array consists of 6T bit-cells (h6T) arranged in 32 rows by 256 columns. As shown in Figure 3-11, for a group of 4 columns, there are 3 local BLs and 2 local BLBs. This is because, in this implementation, the local bit-lines are shared between adjacent columns. Therefore, 2-bits (b[1 : 0]) are required for column multiplexing, to get a local BL pair (mLBL[0], mLBLB[0]). As explained in section 3.2, each row has 4 word-lines (WLA, WLB, WLC and WLD), only one of which is asserted based on the row-decoder's outputs and the column interleaving bits. The selection of a 4:1 column interleaving ratio and the ability to route 4 WLs per row, eliminates the halfselect issue in this architecture. Hence, dynamic forward body-biasing can be applied for the selected row, as a write assist technique, without any read-stability issues. The hFBB circuit implements the row-wise dynamic body-biasing. The n-wells are shared horizontally, across all the 256 bit-cells in a row. The local R/W circuitry, shared between two local arrays, consist of inverter-based large-signal sensing for read and the prediction logic, as explained in section 3.3. In addition, the local write is implemented by the pull-down NMOS devices, controlled by data[0] and dataB[O]. Although not shown in the figure, data[0] and dataB[0] are locally generated, during a write operation, from the data on the global BLs. Whereas, they are driven to "0" during a read operation. The pair of cross-coupled PMOS devices, connected to a local BL pair, maintains a differential signal level on the local BLs, during a write operation. All the global signals (including the global BLs and Pred, PredB lines) are driven by the global R/W circuitry, which also incorporates small-signal sensing for the global BLs, during a read operation. The prediction generator circuitry is similar to the one described in [41]. 3.5 Energy Savings In this section, we estimate the energy consumption of the proposed hFBB 6T SRAM, which implements row-wise dynamic body-biasing. This is compared to the conventional implementation [40], with column-wise n-well sharing. We evaluate the energy 48 savings achieved due to scaling and by using data prediction. The energy con- Vdd sumption model of the SRAM, used in this section, is described in Appendix A. Energy estimations are done at typical (TT) process corner and 250C temperature, with various capacitances extracted from a local array's layout. 3.5.1 Due to Vdd Scaling The bit-cell used in this design, can work down to a Vdd of 600 mV, which is limited by the write operation. The dynamic forward body-biasing technique, used as a writeassist, improves the Vdd,min by 200mV. This translates into a significant reduction in the dynamic energy consumption of the SRAM macro (since it is roughly proportional to Via). However, the leakage energy consumption is increased, due to a reduced frequency of operation (f,,) at Vdd = 400mV. The normalized average energy per access (Eavg/acc.) for the 128Kb SRAM macro is shown in Figure 3-12, at Vdd = 600mV and 400mV, for two different read to write ratios (R/W). For a R/W of 1:1, a 38% reduction in Eavg/acc. is achieved, due to a 200mV reduction in the SRAM Vdd,min. The improvement is maintained even for a higher R/W ratio of 5:1, when we get 36% energy savings due to Vdd scaling. At a high R/W ratio of 5:1, the contribution from GBL switching is lesser. This is because the number of write operations (which involves full swing of the GBL) is reduced. Hence, the energy savings by Vdd scaling is slightly lesser. Next, we compare the energy savings with the body-biasing technique implemented in two different ways. The proposed technique (hFBB) of row-wise n-well sharing, incurs less energy overhead as compared to the conventional implementation [40], in which n-wells are shared column-wise. This is because, during every write cycle, the n-well of only one selected row needs to be charged up, instead of switching the n-wells for multiple selected columns (as required in the conventional implementation). As seen from Figure 3-12, for a R/W ratio of 1:1, the proposed hFBB implementation results in 38% energy savings, while the conventional implementation, in fact, increases Eavg/acc. by 20%. The proposed technique out-performs the conventional implementation in terms of Evg/acc., even at a higher R/W ratio 49 1.4 VI 1.2 LU W ta +20% 1.0 -38%1 0.8 H E_leak H E_SA * Edyn WL 0.6 0.4 * EdynBB 0.2 * EdynGBL 0 Z 0.0 * EdynLBL w/o assist hFBB conv.BB Vdd=0.6V Vdd=0.4V (R/W=1:1) 1.2 -7% 1L.0 U -36% 0.8 U "p 0.6 H E_SA 0U hO H E_leak * 0.4 EdynWL * Edyn BB 0.2 M EdynGBL - EdynLBL 0.0 w/o assist conv.BB Vdd=0.6V I hFBB Vdd=0.4V (R/W=5:1) Figure 3-12: Energy savings due to Vdd scaling using 1V dynamic FBB. The energy reduction with the proposed row-wise FBB implementation (hFBB) is compared to a conventional implementation, for two different read-to-write ratios. of 5:1. 3.5.2 Using Data Prediction Data prediction is used during a read operation, to reduce the global bit-line (GBL) switching, which constitutes a significant portion of the overall SRAM energy consumption. Figure 3-13 shows the normalized read energy per access (Eread/acc.) at Vdd = 400mV, as a function of the percentage of correct prediction. The conventional 6T SRAM (without prediction) uses differential sensing for the 50 4 1.6 - 1.4 (U 1.2 1.0 (U Ui cc DC 0 z 0.8 0.6 0.4 0.2 0.0 L_ m E_leak -35% -I w/o pred. * Epred " E_SA " EdynWL * EdynGBL 0% 25% 50% 75% 100% " Edyn LBL Percentage of Correct Prediction Figure 3-13: Energy savings by using data prediction during a read operation at Vdd-400mV, as compared to a conventional 6T read. On the contrary, a single-ended sensing global bit-lines, during a read operation. scheme is required, when the read path involves data prediction. Assuming the same sense-amplifier (SA) is used in both the schemes, the global bit-lines need to discharge approximately twice, when a single-ended sensing scheme is used. Therefore, when prediction is correct for less than 50% of the time, the Eread/acc. is actually more than the conventional 6T read. However, energy savings are obtained when there is more than 50% correct prediction. At Vdd = 400mV, a 35% (or 1.54X) reduction in Eread/acc. is achieved with 100% correct prediction. It must be noted that, this 35% reduction is in addition to the energy savings achieved by 51 Vdd scaling to 400 mV. 52 Chapter 4 Reconfigurable Body-Bias Generator in 28nm FDSOI This chapter presents a switched-capacitor (SC) based step-up DC-DC converter, which can be used for SRAM body-biasing. The reconfigurable converter implements 3 step-up conversion ratios of 5/2, 2/1 and 3/2, to provide a wide range of output voltage. The step-up converter has been designed to obviate the need of using high voltage I/O transistors (as charge-transfer switches), which otherwise would have degraded the efficiency owing to their higher R,, and capacitance. Additionally, a new topology is proposed for the 5/2 mode which improves efficiency by reducing the bottom-plate parasitic loss as compared to a conventional series-parallel topology [26]. A brief overview of SC converters is presented in section 4.1, followed by the detailed description of the designed reconfigurable step-up converter. 4.1 Brief overview of SC converters Switched-capacitor (SC) power converters are a type of DC-DC converters, which use only switches and capacitors, to efficiently convert one voltage to another. Since they do not require bulky inductors, SC converters are ideally suited for on-chip implementations. Step-up SC converters (i.e.V,, > V4,) have been traditionally used 53 in integrated circuits to provide the programming voltage for FLASH [42] and other non-volatile memories [43]. They also find use in energy harvesting applications [44]. Figure 4-1 shows a simple 2:1 SC step-up converter, which uses 4 switches (Si - S4) and 1 charge-transfer capacitor (Cf). SC converters generally operate in 2 phases. As shown in the figure, during phase 4 i, Cf gets charged from the input voltage (Vi,,), while in the other phase (4D2), it transfers the stored charge to the output. An idealized 2 port model of the SC converter is also shown in the figure. It consists of an ideal transformer, which represents the no-load conversion ratio and an output resistance (ROUT), which represents the load current dependent voltage drop (due to charging and discharging of Cf every cycle). ROUT depends on the topology of the converter and also on the switching frequency (f5,) of the converter. A more detailed analysis of SC converters can be found in [45]. Vin Vout Vout Cf Nq q qCf S2 Vin S1 Vin - ZCf (01 (02 ROUT S4 Vin Vout 1:2 Figure 4-1: Basic 2:1 step-up SC converter, along with its idealized 2-port model. 54 Reconfigurable Step-up SC Module 4.2 Figure 4-2 shows the switch level schematic of a single module of the reconfigurable step-up converter. A module is comprised of two identical sub-modules ('a' and 'b') which are connected by switch S8 . Each sub-module consists of 7 switches (Si - S7), 2 charge-transfer capacitors (C 1 , C 2 ) and is driven by two non-overlapping, complementary clocks (CLK1 , CLK 2). Additionally, sub-module 'a' operates out-of-phase with sub-module 'b'. This design strategy allows us to reuse simple 2/1 sub-modules to design a more complex 5/2 conversion module. Vin Vin - S4 ,CLKi 1 SubModule 'a' Sub- C2n C2n S8 Module 'b' $iCLK 42 CLKi -L 1S7 4 L S2 CLK2 C2I S6 / S5 nn -~ /Cut -ot4 Vouto i Figure 4-2: Reconfigurable step-up switched capacitor module. Figure 4-3(a) shows the operation of the converter in the 5/2 mode for the proposed topology. As shown in the figure, during phase 45i, capacitors Cia and C2b are charged from the input node, Vs, while capacitors C2a and Vout out C2a C2b C2b VinIVin inn C~ j C > Clb Vout O Vout Vin Ca Vin ~- 1-41TnIL transfer charge to the Cib ~ < C2a <> T (a1 o (a) Proposed Topologv C2b 2a Vin Vin Vout = Vin*5S/2 C1C4 C > 0 Ca b C2b n_e 2 (b) Conventional Topology Figure 4-3: Operation of the proposed and conventional topologies in 5/2 mode. 55 30 5 - 25 4 -40-Rout(Conv.) - S15 _E 10 90 --- o U 100 -*-Rout (Prop.) - r-Ripple ratio ---------------------- 33 3 ---------------- Cm 15~~7 ------------------10 --CL' ----- ------ --------------- 0~6 _ ----0-2 - -------- ---- _ ___ __ _ _ _ - 20 6 a _ 0 U- --- w/oBP-(Prop.) -Iw_w/oBP P P(Conv.) o . -+-+-w/ w/ BP _BP(Prop.) (Conv.) -- ------- 50 1x 20 10 0 Switching Frequency (M Hz) 20 10 0 Switching Frequency (MHz) (a) RoUT and Ripple (ideal converter) (b) Efficiency with BP(Bottom-plate Parasitic) Figure 4-4: Simulated performance comparison of the ideal converter for the proposed and conventional topologies in 5/2 mode. output node, V0st. On the other hand, during phase 42, Cib gets charged from the input node and C2a gets charged by Cia, while C2b transfers charge to the output node. Also shown in the figure are the voltages across the different charge-transfer capacitors for the no-load case. Using charge balance, it is easily seen that the voltage across each capacitor is identical during the two phases 41 and 42 which proves that, in the steady state, this mode will generate a no-load output voltage VUt,NL = 5/2 x V . Figure 4-3(b) shows the operation of a conventional series-parallel topology [26] implementing a 5/2 mode. Although both the topologies require the same number of capacitors and switches to implement a 5/2 mode, the proposed topology offers two significant benefits compared to the conventional topology. Firstly, for the proposed implementation, charge is delivered to the output in both the clock phases, 41 and 42. However, in the conventional topology, charge is delivered to the output in only one phase. This results in a lesser output voltage ripple for the proposed implementation. Figure 4-4(a) shows the simulated result for the ideal converter in 5/2 mode. As can be seen from the figure, the proposed topology reduces output voltage ripple by 2X compared to the conventional case. 56 Second and more importantly, the proposed topology offers a much better performance in terms of reducing bottom-plate parasitic loss, which is a significant component of the overall loss for on-chip implementation of the charge-transfer capacitors. On-chip capacitors offer a much higher energy density compared to their off-chip counterparts, but they suffer from having significantly more parasitic capacitance (associated with their bottom or top plate and the substrate). This parasitic can be as high as 5-10% [46] of the actual capacitance for the MOS capacitors used in this design. In SC converters, this parasitic capacitor gets charged in one phase and loses that energy by discharging in the other phase. The associated bottom (or top)-plate parasitic loss can severely degrade the efficiency of the converter especially for low output power levels. The proposed implementation for the 5/2 mode significantly decreases this loss component by reducing the swing of the bottom (or top) plate of the charge-transfer capacitors. Pbot(proposed) = aCj((V,) 2 2 + (Vi/2) + It can be observed from Figure 4-3, 2 (Vn3) + (V,/ 2) 2 ) 2.5&C5V2{ f", where a denotes the ratio of the bottom-plate parasitic capacitor and the corresponding charge-transfer capacitor (Cf) and f., denotes the switching frequency of the con- verter. For the conventional series-parallel implementation, this loss can be calcu- lated as Pbot(conventional) = ceCf((3Vin/2) 2 + ( n/2)2 + (3 i/2)2 + (in) 2 )f, 2 5.75aCf infs.. Hence we get a 2.3X reduction in bottom-plate parasitic loss which significantly improves the efficiency. As seen from Figure 4-4(b), for the ideal converter with 2% bottom-plate parasitic (o = 0.002), we can get an efficiency improvement as high as 15%. This comparison assumes that the total amount of charge- transfer capacitance, the load capacitance and the load resistance are the same for both topologies. As it can be seen from Figure 4-4(a), this results in similar output impedance (ROUT) and hence, both the implementations deliver the same amount of output power, at a given load current. The operation in the other two modes (2/1 and 3/2) are illustrated in Figure 4-5. It may be noted that in mode 2/1 the capacitors C1 and C2, in each sub-module, work exactly in the same manner. Hence, for clarity, only one (C 2 ) is shown in Figure 4-5. 57 Vout Vout C2a C C2b Vout Vout C2a C2 > r-4 C Vin Vin C2b C2a Vin Vin I I I C2b C2a Vin/2 )21 Vout = 2*in 4)2 Vout = Vin*3/2 Figure 4-5: Operation of the converter in 2/1 and 3/2 modes. 4.3 MOS Implementation of the sub-module In this design, all the charge-transfer switches in the main converter module have been implemented with core (1V) transistors. It is important to ensure that in the steady state none of the transistors are overstressed due to application of a gate-to-source (VGs) or drain-to-source (VDs) voltage higher than the nominal supply voltage, Vdd. Figure 4-6 shows the MOS implementation of the switches for a sub-module. The bottom-plate of both the capacitors C1 and C2 remain within the voltage range of 0 to Vdd. Hence the switches (Si, S2, S4, S 5 ) which are connected to the bottom-plates of C1 and C2 have been implemented with regular PMOS (Sip, S4p) and NMOS (S2N, S5N) transistors. Although not shown in the figure, these transistors are driven by buffers in the voltage range 0 to Vdd. Switch S 3 connects the top-plate of capacitor C1 to Vdd and is turned OFF when the top-plate of C1 goes to 2 Vdd. Hence, it is implemented with an NMOS transistor (S3N) with a gate drive between Vdd and 2 Vdd, to avoid VGS overstress. Switch S6 needs to connect the top-plate of the capacitor C2 to the output voltage node Vst and is OFF otherwise. Hence, it is implemented with a regular PMOS transistor (S6p) but with a gate drive between (Vuot - Vdd) and Vot, so that the maximum VGS applied is Vdd and the transistor is not overstressed. Switch S 7 operates in a wide range of voltage levels, which depend on the conversion mode. It needs to block a voltage of (Vout - Vdd) across it, which can be as high 58 Vin C1 S4P 3N Vin k sip _ S7PH _T S7PL C2 S6P S2N - Vout Figure 4-6: MOS implementation of the switches. as 2.5 - 1 1.5V in 5/2 mode. Hence, to avoid a VDS overstress, it is implemented with a cascode of two 1V regular PMOS transistors (S7PL and S7PH). Conventionally a suitable DC voltage needs to be generated to bias this cascode switch structure. In this design, the need for a separate DC voltage is obviated by dynamically biasing the gate of both the transistors, S7PL and S7PH, to turn them ON and OFF simultaneously. It may be noted that the dynamic biasing is also dependent on the conversion mode. Hence it needs to be reconfigurable to work across all the three conversion modes (5/2, 2/1, 3/2). Figure 4-7 shows the reconfigurable gate drive structures for switches S7PL and S7PH, along with the necessary level shifter circuits. The 'V - V Shifter', shown in the figure, is also used to drive switch S6P. Switch S 8 was implemented with regular IV PMOS and NMOS transistors in tramsission gate structure, since it needs to pass a voltage of Vdd/ 2 . The charge- transfer capacitors were implemented on-chip with high density MOS capacitors along with MOM capacitors stacked on top (to improve density). For MOS capacitors soft connection of the N-well [46] was adopted. This technique hugely reduces the parasitic capacitance from the bottom (or top)-plate to ground and improves efficiency of the converter in all the 3 modes. 59 Vin -- in E Vin IN Vin 2Vin E (/2) Vin OUT T EN='1' liii# OUT 142,(1 OTVin 12/1,3/2) 1 :0 J J EN ='O'0 INF' (Thick-oxide) - Gate Drive of S7PL. EN (a) Level Shifter with Enable (LS-EN) OUT IN (GrS6P) IOU Vin OUT |412 Vout 1'41| |) (GS6P) Vout-Vin Gate drive for S6P IN I -I IN sI - - - Fro GSP of other oS6P3 I 4%sub-module.- Vin K-- IN Vout OUT OUT (5/2) Vin I i '42 4A O Vin :4,2: IN OUT Vout I 1 (3/2) 4,2 K--- 0 Gate drive for S7PH (b) Gate Drivers for S7pH and S6 Figure 4-7: Reconfigurable gate drive circuits for a sub-module. 4.4 Overall System Architecture Figure 4-8 shows the overall architecture of the converter. This work implements 4-phase interleaving in order to reduce output voltage ripple. 60 The 4-phase clock Vin Ck Clk Ge erator Vout Sw-Cap Module into 4 phases (frequency f,./4), JT each shifted by Cload 450. Each phase generates two complementary non-overlapping clocks, which drive a single converter module. A tunable circuit has been implemented to control the non-overlapping delay, which is crucial to reduce shoot-through current loss [47]. Reconfigurable switch drivers, as explained in the previous section, provide the gate drives for all the switches in each module. An on-chip load capacitor provides further necessary ripple reduction at the output. 4.5 Results The fully integrated step-up converter was implemented in a 28nm FDSOI process occupying a core area of 0.054mm 2 . An additional 0.06mm 2 area was used to implement an on-chip load capacitor. Figure 4-8 shows the die photo of the converter. The measured efficiency of the converter with varying load current and Vs.= 1V is plotted in Figure 4-9(a). The output voltage was kept constant at ~2.2V (mode 5/2), 1.9V (mode 2/1) and 1.3V (mode 3/2), by changing the switching frequency (f8 .) of the converter. As seen from the figure, the converter can supply a load current in the 61 Table 4.1: Performance comparison with previous works Design Technology [27] 130nm Bulk [48] 32nm Bulk This work 28nm FDSOI Topology Step-up 2/1 Step-up 2/1 Capacitor MIM Metal finger Reconfigurable Step-up: 5/2, 2/1, 3/2 MOS and MOM Area(with Cload) Vin 2.25 mm2 1-1.2V 1.8V 6678 pm 2 1-1.2V 0.114 mm2 IV 1.3 - 2V 1.2 - 2.4V (Q Vin=lV) 82% ©Pout= 1.5 mW Iload(max.) (A Vin=1V) 1.5 mA AVot= 1.8V, Tj = 81% 64% FPout = 2.9 mW 6.8 mA ©Vot = 1.4V, q = 56% 88% (2/1 mode) ©Pout = 0.56 mW 1 mA (2/1 mode) WVout Vout (©Vin=1V) 17peak = 1.73V, q =83% range 10 - 500 pA while maintaining an efficiency of more than 70%. It achieves a peak efficiency of 88% for the 2/1 mode at Pout = 0.56mW and 82% for the 5/2 mode at Pout = 0.66mW. Figure 4-9(b) shows the measured performance with a fixed load current of 100 pA and varying output voltage for the 3 modes. The converter provides an output voltage ranging from ~ 1.2V to 2.4V with more than 70% efficiency (V, = 1V). It can be observed that increasing the switching frequency of the converter increases the output voltage by decreasing the output impedance (RoUT) of the converter. However this effect saturates for higher frequencies, since the converter enters the fast-switching-limit (FSL) mode in which the non-zero resistance of the MOS switches limit ROUT. The performance of the designed converter is compared with recent works in stepup SC DC-DC converters, in Table 4.1. The proposed reconfigurable converter provides a wider range of output voltage, with a better peak efficiency value, using only MOS and MOM capacitors. 62 -u-eff_2/1 -+eff_3/2 -+- fsw_3/2 -U- eff _5/2 -A- -*- fsv _5/2 fsw_2/1 300.5 I 85 ------------ ---------------- ---- 80 - -- - -------- I ------- 250.5 U 200.5 - --- ---- - - -- -- -- ----------------- --- -- - T-- ----- - (U 150.5 L. UL- 75 LU - - -- ----------- ------------- 100.5 o-------------- ; 70 - - - -+ - - - - - - - 5 -- -- -- -- - - -- - - 50.5 &was* .. 65 .S 100 0 -C 300 200 500 400 Load Current (IA) (a) Varying load current and fixed output voltage 90 85 120 -------------- - --- --- -- -- -- - -- ----- --------------------- 100 N 80 ----------- 75 -- ------------ ----------------- - 80 > Cr C 70 -------------1--- ----------------r--------------- ----------------- -- ------ ---------------L------ - ---- ------------- 60 LL LUJ 65 60 - -20 , A * ---.0- - - - - - - - - - - -- - -ip - - - - 55 -40 - - -- -- 0 1.2 1.5 2.1 1.8 2.4 Vout (V) (b) Varying output voltage and a fixed 100 pA load current Figure 4-9: Measured performance of the converter. 63 64 Chapter 5 Conclusions This thesis primarily focuses on energy-efficient 6T SRAM design in a 28nm FDSOI technology. Additionally, the design of a reconfigurable step-up DC-DC converter is presented, which can be used for body-biasing in SRAMs. This chapter summarizes the important conclusions of this research and discusses opportunities for future work. 5.1 Summary of contributions The three main contributions of this thesis are summarized below: 1. Energy-efficient implementation of dynamic forward body-biasing, which is used as a write-assist technique in the designed 6T SRAM. 2. Incorporating data prediction in the conventional read path of 6T SRAMs, to save global bit-line switching power. 3. A reconfigurable integrated switched-capacitor (SC) based step-up DC-DC converter, as a body-bias generator for SRAMs. 5.2 Energy-efficient 6T SRAM design An energy-efficient 6T SRAM design is presented in Chapter 3. achieves operation down to a Vdd of 400mV, with a process corner and temperature (SF corner, -40 65 (1/U-)WRM The 6T SRAM of 5.5, at the worst- C). The 200mV reduction in the Vdd,min of the SRAM is achieved by using dynamic forward body-biasing (FBB), as a write-assist technique. Dynamic body-biasing can incur significant energy overhead if it is implemented column-wise [40]. This is because the n-wells of all the selected columns, need to be charged up, during every write operation. To reduce this energy overhead, a modified layout with horizontal n-well sharing is proposed. Hence, only the two n-wells of the selected row needs to be charged up, which significantly reduces the energy overhead due to body-biasing. The energy consumption of the 128Kb SRAM is estimated using the energy model, described in Appendix A. The 200mV improvement in to a 38% reduction of average energy/access (Ea/ Vdd,min of the SRAM translates /acc.), for the proposed implementa- tion with row-wise body-biasing (hFBB). The conventional implementation, however, increases Eav/acc. by 20%. This assumes equal number of read and write operations, i.e. read to write ratio R/W= 1:1. The improvement is also seen at a higher R/W ratio of 5:1, when the hFBB technique saves 36%, while the conventional implementation saves only 7%. To achieve further energy savings, a data-prediction scheme is inserted in the conventional 6T read path. As demonstrated in [41], in certain applications, such as motion estimation in video processing, the correlation of the stored data can be exploited to predict the data during a read operation, using previously read values. The concept of data-prediction is extended for conventional 6T SRAM arrays, without any significant area overhead. Using the proposed technique, the discharge of the global bit-lines (GBL) can be avoided if the prediction is correct. Although, the switching of the LBLs are not affected by this technique, the dominant component of switching power, which is due to GBLs, can be significantly reduced with correct prediction. Upto 35% (i.e. 1.54X) improvement in the read energy/access, at Vdd = 400mV, was estimated using the energy models of the SRAM (Appendix A). This improvement is in addition to the energy savings, achieved by the mV. 66 Vdd scaling to 400 5.3 Reconfigurable Step-up SC DC-DC Converter An integrated reconfigurable switched-capacitor (SC) step-up DC-DC converter is presented in Chapter 4. The converter can be used to generate a wide range of body-bias voltage for SRAMs. The reconfigurable converter implements 3 step-up conversion ratios of 5/2, 2/1 and 3/2. It provides a wide range of output voltage from 1.2V to 2.4V, using a 1V input. The step-up converter has been designed to obviate the need of using high voltage I/O transistors which otherwise would have degraded the efficiency owing to their higher Ron and capacitance. Additionally, a new topology is proposed for the 5/2 mode which improves efficiency by reducing the bottom-plate parasitic loss as compared to a conventional series-parallel topology. The converter was implemented in a 28nm FDSOI process using only on-chip MOS and MOM capacitors that do not require any extra fabrication steps, unlike MIM and trench capacitors. The converter can deliver load current in the range of 10 PA to 500 pA, achieving a peak efficiency of 88% (measured). 5.4 Future Work As CMOS scaling continues to sub-20nm regime, increased device variations will limit the minimum operating voltage (Vdd,min) of 6T SRAMs. While FDSOI and FinFETs offer improved device performance as compared to bulk processes, they also present new challenges for SRAM design. Hence, improved read and write-assist techniques are required to scale the Vdd Vdd,min of 6T SRAMs, while maintaining high yield ratios. scaling is very crucial to reduce the SRAM energy consumption. Data dependency can be further exploited to take advantage of interesting signal statistics, in SRAM design. Alternate prediction schemes for 6T SRAMs can be explored, which can provide further energy savings. 67 68 Appendix A Energy Model of the 28nm FDSOI 128Kb SRAM Macro The energy consumption in SRAM has two major components: dynamic energy and leakage energy (Eleak). (Edun) The dominant sources of these energy consumptions are bit-line (BL) switching and bit-cell leakage. BL switching energy (Edyn,BL) is proportional to the number of columns in a given SRAM macro. Since the capacitance of each BL is proportional to the number of rows in the macro, Edyn,BL scales with both the number of rows and columns. However, it does not depend on the total size of the memory, since only one block is accessed at a time and hence, the BL switching occurs only in the selected block. On the other hand, the leakage energy (Eeak) scales with the total size of the memory, since every bit-cell, whether it is accessed or not, consumes a non-zero leakage current. Futhermore, these energy consumptions depend on the SRAM supply voltage (Vdd) and the frequency of operation (f8 .). The average energy consumption per access is given by: Eav/acc. Edyn + Eleak 69 (A.1) Table A.1: Array organization for both the implementations Implementation h6T r6T Total Memory Size 128Kb 128Kb Number of Blocks Number of local arrays in a block 2 8 2 8 Local Array Organization Number of bit-cells per local BL Number of local BL pairs Word Length 32X256 64 128 64 64X128 64 128 64 Number of Word Lines per row 4 1 For the proposed layout implementation with horizontal n-well sharing (h6T), a local array consists of 32 rows and 256 columns. Since a local bit-line (LBL) is shared between adjacent columns, 64 bit-cells share a LBL. To maintain the same number of bit-cells per LBL, 64 rows are chosen in the local array, for the conventional implementation (with column-wise n-well sharing). And hence, to have the same local array size, the conventional implementation (r6T) has 128 columns. Table A.1 shows the detailed array organization for the conventional (r6T) and proposed (h6T) implementations. These values are used for the formulation described in this chapter. A.1 Dynamic Energy Consumption The primary sources of dynamic energy consumption are bit-line switching (both local and global BLs), word-line (WL) switching and sense-amplifier. In addition, dynamic body-biasing, during a write operation, incurs extra energy overhead due to n-well switching. Furthermore, when data prediction is used during a read operation, there is some switching power overhead while updating the predictor's outputs. Since a conventional 6T SRAM has a differential BL architecture, hence, in every access cycle one of the BLs would always discharge to "0". Therefore, one of the BLs would need to be pre-charged to Vdd at the end of the access cycle. For the conventional architecture with 1 WL per row, the dynamic energy consumption due 70 to local BL switching can be calculated as: Edyn,LBL(COrv.) = (A.2) 128CLBLVL where, CLBL represents the capacitance associated with a local BL, for the conventional implementation. However, for the proposed implementation, with shared local BLs between adjacent columns and 4 WLs per row, this energy can be calculated as: Edyn,LBL (PTOP.) = 64C BLV (A.3) where, C BL represents the capacitance associated with a local BL, for the proposed implementation. The global BL discharge is dependent on the type of memory access. For a write operation, one of the GBL swings fully from Vdd to OV. Whereas, during a read operation, the swing (AVrd) is lesser and it is determined by the sense-amplifier (SA) offset voltage (Vff). AVd should be estimated, assuming the slowest GBL discharge can droop to Vdd - Vff when the SA is enabled. The average energy due to global BL switching can be calculated as: Edyn,GBL = 64 k1 CBLVdd( k +1 AVrd + k+ 1 Vdd) (A.4) where, 'k' denotes the read to write ratio (i.e. the ratio of the number of read operations to the number of write operations) and CbBL denotes the respective capacitance of the global BL for the conventional and proposed implementations. The energy consumption of the SA is given by: Edyn,SA = 64CSAVdk k k+ 1 where, CSA denotes the effective capacitance of the SA, which is switched in every read-access cycle. The energy consumption due to word-line (WL) switching is given by: 71 (A.6) Edyn,WL = C LV where, C&VL denotes respective capacitance of one WL for the conventional and proposed implementations. Without write assists, the total dynamic energy consumption per access, is given by: Edyn,tot(w/o assist) = Edyn,LBL + Edyn,GBL + Edyn,SA + Edyn,WL (A.7) The energy overhead due to dynamic body-biasing (BB), as a write assist, is given by: Edyn,BB(conv.) = 64 x 64 x 2CBBV4 Edyn,BB(prop.)= 2 x 1 BB 1 1 256 x 2CBBVFkBB k+1 1 (.) 8 192CBBVBB 1k 1 kBBVFBB (A.9) where, CBB denotes the capacitance associated with one of the n-well body terminals, for a single bit-cell. 2 CBB is used in the equations because the n-wells are shared with the adjacent row or column (depending on the implementation). VFBB denotes the amount of body-baising required at a given Vdd to achieve a certain p/a of write margin. In this work, a VFBB of IV is required at Vdd= 400mV, for a p/= 5.5 (©SF corner and -40'C). Since the conventional implementation has column-wise n-well sharing, a single sided body-biasing can be implemented. Hence, for each of the 64 selected columns, only 1 n-well is assumed to be switching. Each column-wise n-well is shared by 64 bitcells. On other hand, with the proposed layout implementation, the body-biasing is done row-wise. Therefore, both the n-wells, corresponding to a selected row, need to switch (hence, the extra factor of 2). Each row-wise n-well is shared by 256 bit-cells. 72 As seen from the equations, the proposed implementation reduces the body-biasing energy overhead by 8X. With body-biasing used as a write-assist technique, the total dynamic energy consumption per access, is given by: Edyn,tot(w/ assist) = Edyn,LBL + Edyn,GBL + Edyn,SA + Edyn,WL + Edyn,BB (A.10) When data-prediction is used, the dynamic energy consumption of the global BL during a read operation, is only affected. Let us denote the fraction of correct prediction as pc. For the target application of motion estimation, the number of read access is considerably more than the number of write access. Hence, we only consider the energy per access for read operation (i.e. 'k' can be assumed to be a very high number in the above equations). The dynamic energy consumption of the global BLs for a read operation, using data prediction, is given by: Edyn,GBL(W/ pred) = (1 - Pc) x 64 CGBLVdd X 2AVrd (A.11) However, there is an energy overhead of updating the global prediction lines. Assuming that prediction outputs are updated every 16 clock cycles, which was found to be optimum in [41] for most video sequences tested, this overhead can be calculated as: Edyn,ov(w/ pred) = a2red64CBL where, cped 1 (A.12) is the average activity factor for the predictor's outputs. It is chosen to be 1/2 for simplicity. Hence, the overall energy per access for read, with data prediction, is given by: 73 Edyn,tot(w/ pred) = Edyn,LBL+Edy,,GBL(w/ pred)+E(yn,sA+Edyn,WL+Edyn,ow/ pred) (A.13) It can be observed that, with more than 50% correct prediction, we get energy savings as compared to a conventional 6T read. A.2 Leakage Energy Consumption The dominant source of the total leakage energy consumption of the SRAM is the 6T bit-cell leakage. This is because all the bit-cells in the memory array consumes a leakage current and hence this energy scales with the total size of the memory. It is given by: Eleak,tot -217 X Vddlleak((Vdd) X 1 (A.14) fSW where, Ileak(©Vdd) is the leakage current per bit-cell and fw, is the frequency of operation, at a particular value of Vdd. f,, at a particular Vdd value, is estimated from the worst-case bit-cell read and write times. 74 Bibliography [1] H. Yamauchi, "Embedded SRAM Design in Nanometer-Scale Technologies," in Embedded Memories for Nano-Scale VLSIs, K. Zhang, Ed. New York: Springer, 2009. [2] P. Magarshack, P. Flatresse, and G. Cesana, "UTBB FD-SOI: A process/design symbiosis for breakthrough energy-efficiency," in Design, Automation Test in Europe Conference Exhibition (DATE), 2013, March 2013, pp. 952-957. [3] J.-P. Noel et al., "Multi-VT UTBB FDSOI Device Architectures for Low-Power CMOS Circuit," in IEEE Transactionson Electron Devices, vol. 58, no. 8, August 2011, pp. 2473-2482. [4] R. Riedlinger, R. Bhatia, L. Biro, B. Bowhill, E. Fetzer, P. Gronowski, and T. Grutkowski, "A 32nm 3.1 billion transistor 12-wide-issue itanium processor for mission-critical servers," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE International,Feb 2011, pp. 84-86. [5] T. Burd and R. Brodersen, "Design issues for Dynamic Voltage Scaling," in Low Power Electronics and Design, 2000. ISLPED '00. Proceedings of the 2000 InternationalSymposium on, 2000, pp. 9-14. [6] V. Gutnik and A. Chandrakasan, "Embedded power supply for low-power DSP," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 5, no. 4, pp. 425-435, Dec 1997. 75 [7] N. Planes et al., "28nm FDSOI technology platform for high-speed low-voltage digital applications," in VLSI Technology (VLSIT), 2012 Symposium on, June 2012, pp. 133-134. [8] A. Carlson, Z. Guo, S. Balasubramanian, R. Zlatanovici, T.-J. K. Liu, and B. Nikolic, "SRAM Read/Write Margin Enhancements Using FinFETs," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18, no. 6, pp. 887-900, June 2010. [9] T. Song et al., "A 14nm FinFET 128Mb 6T SRAM with VMIN-enhancement techniques for low-power applications," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International,Feb 2014, pp. 232-233. [10] H. Pilo, I. Arsovski, K. Batson, G. Braceras, J. Gabric, R. Houle, S. Lamphier, C. Radens, and A. Seferagic, "A 64 Mb SRAM in 32 nm High-k Metal-Gate SOI Technology With 0.7 V Operation Enabled by Stability, Write-Ability and ReadAbility Enhancements," Solid-State Circuits, IEEE Journalof, vol. 47, no. 1, pp. 97-106, Jan 2012. [11] M. Khare et al., "A high performance 90nm SOI technology with 0.992pm 2 6TSRAM cell," in Electron Devices Meeting, 2002. IEDM '02. International,Dec 2002, pp. 407-410. [12] M. Sinangil, H. Mair, and A. Chandrakasan, "A 28nm high-density 6T SRAM with optimized peripheral-assist circuits for operation down to 0.6V," in SolidState Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE In- ternational,Feb 2011, pp. 260-262. [13] K. Takeda, T. Saito, S. Asayama, Y. Aimoto, H. Kobatake, S. Ito, T. Takahashi, K. Takeuchi, M. Nomura, and Y. Hayashi, "Multi-step word-line control technology in hierarchical cell architecture for scaled-down high-density SRAMs," in VLSI Circuits (VLSIC), 2010 IEEE Symposium on, June 2010, pp. 101-102. 76 [14] F. Hamzaoglu, K. Zhang, Y. Wang, H. J. Ahn, U. Bhattacharya, Z. Chen, Y.-G. Ng, A. Pavlov, K. Smits, and M. Bohr, "A 153Mb-SRAM Design with Dynamic Stability Enhancement and Leakage Reduction in 45nm High-K MetalGate CMOS Technology," in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International,Feb 2008, pp. 376-621. [15] K. Nii, M. Yabuuchi, Y. Tsukamoto, S. Ohbayashi, Y. Oda, K. Usui, T. Kawamura, N. Tsuboi, T. Iwasaki, K. Hashimoto, H. Makino, and H. Shinohara, "A 45-nm single-port and dual-port sram family with robust read/write stabilizing circuitry under dvfs environment," in VLSI Circuits, 2008 IEEE Symposium on, June 2008, pp. 212-213. [16] S. Barasinski, L. Camus, and S. Clerc, "A 45nm single power supply SRAM supporting low voltage operation down to 0.6V," in Solid-State Circuits Conference, 2008. ESSCIRC 2008. 34th European, Sept 2008, pp. 502-505. [17] J. Chang, Y.-H. Chen, H. Cheng, W.-M. Chan, H.-J. Liao, Q. Li, S. Chang, S. Natarajan, R. Lee, P.-W. Wang, S.-S. Lin, C.-C. Wu, K.-L. Cheng, M. Cao, and G. Chang, "A 20nm 112Mb SRAM in High-K metal-gate with assist circuitry for low-leakage and low-VMIN applications," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International,Feb 2013, pp. 316-317. [18] H. Fujiwara, M. Yabuuchi, M. Morimoto, K. Tanaka, M. Tanaka, N. Maeda, Y. Tsukamoto, and K. Nii, "A 20nm 0.6V 2.1/LW/MHz 128kb SRAM with no half select issue by interleave wordline and hierarchical bitline scheme," in VLSI Circuits (VLSIC), 2013 Symposium on, June 2013, pp. C118-C119. [19] S. Moriwaki, Y. Yamamoto, A. Kawasumi, T. Suzuki, S. Miyano, T. Sakurai, and H. Shinohara, "A 13.8pJ/Access/Mbit SRAM with charge collector circuits for effective use of non-selected bit line charges," in VLSI Circuits (VLSIC), 2012 Symposium on, June 2012, pp. 60-61. 77 [20] G. Moore, "Cramming More Components Onto Integrated Circuits," Proceedings of the IEEE, vol. 86, no. 1, pp. 82-85, Jan 1998. [21] Y.-H. Chen, W.-M. Chan, W.-C. Wu, H.-J. Liao, K.-H. Pan, J.-J. Liaw, T.-H. Chung, Q. Li, G. Chang, C.-Y. Lin, M.-C. Chiang, S.-Y. Wu, S. Natarajan, and J. Chang, "A 16nm 128Mb SRAM in high-; metal-gate FinFET technology with write-assist circuitry for low-VMIN applications," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International,Feb 2014, pp. 238-239. [22] 0. Thomas, B. Zimmer, B. Pelloux-Prayer, N. Planes, K.-C. Akyel, L. Ciampolini, P. Flatresse, and B. Nikolic, "6T SRAM design for wide voltage range in 28nm FDSOI," in S01 Conference (S0I), 2012 IEEE International, Oct 2012, pp. 1-2. [23] D. Jacquet et al., "A 3 GHz Dual Core Processor ARM Cortex TM -A9 in 28 nm UTBB FD-SOI CMOS With Ultra-Wide Voltage Range and Energy Efficiency Optimization," Solid-State Circuits, IEEE Journalof, vol. 49, no. 4, pp. 812-826, April 2014. [24] P. Flatresse, B. Giraud, J. Noel, B. Pelloux-Prayer, F. Giner, D. Arora, F. Arnaud, N. Planes, J. Le Coz, 0. Thomas, S. Engels, G. Cesana, R. Wilson, and P. Urard, "Ultra-wide body-bias range LDPC decoder in 28nm UTBB FDSOI technology," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International,Feb 2013, pp. 424-425. [25] F. Arnaud, N. Planes, 0. Weber, V. Barral, S. Haendler, P. Flatresse, and F. Nyer, "Switching energy efficiency optimization for advanced CPU thanks to UTBB technology," in Electron Devices Meeting (IEDM), 2012 IEEE Inter- national,Dec 2012, pp. 3.2.1-3.2.4. [26] H.-P. Le et al., "A sub-ns response fully integrated battery-connected switchedcapacitor voltage regulator delivering 0.19W/mm2 at 73% efficiency," in Proc. IEEE ISSCC, Feb. 2013, pp. 372-373. 78 [27] T. V. Breussegem and M. Steyaert, "A 82% efficiency 0.5% ripple 16-phase fully integrated capacitive voltage doubler," in Proc. IEEE Symp. VLSI Circuits,Jun. 2009, pp. 198-199. [28] L. Chang et al., "A fully-integrated switched-capacitor 2:1 voltage converter with regulation capability and 90% efficiency at 2.3A/mm2," in Proc. IEEE Symp. VLSI Circuits, Jun. 2010, pp. 55-56. [29] J. Wang, S. Nalam, and B. Calhoun, "Analyzing static and dynamic write margin for nanometer SRAMs," in Low Power Electronics and Design (ISLPED), 2008 ACM/IEEE International Symposium on, Aug 2008, pp. 129-134. [30] N. Gierczynski, B. Borot, N. Planes, and H. Brut, "A New Combined Methodology for Write-Margin Extraction of Advanced SRAM," in Microelectronic Test Structures, 2007. ICMTS '07. IEEE International Conference on, March 2007, pp. 97-100. [31] E. Seevinck, F. List, and J. Lohstroh, "Static-noise margin analysis of MOS SRAM cells," Solid-State Circuits, IEEE Journal of, vol. 22, no. 5, pp. 748-754, Oct 1987. [32] S. Moriwaki, A. Kawasumi, T. Suzuki, T. Sakurai, and S. Miyano, "0.4v sram with bit line swing suppression charge share hierarchical bit line scheme," in Custom Integrated Circuits Conference (CICC), 2011 IEEE, Sept 2011, pp. 1-4. [33] B. Zimmer, S. 0. Toh, H. Vo, Y. Lee, 0. Thomas, K. Asanovic, and B. Nikolic, "SRAM Assist Techniques for Operation in a Wide Voltage Range in 28-nm CMOS," Circuits and Systems II: Express Briefs, IEEE Transactionson, vol. 59, no. 12, pp. 853-857, Dec 2012. [34] V. Chandra, C. Pietrzyk, and R. Aitken, "On the efficacy of write-assist techniques in low voltage nanoscale SRAMs," in Design, Automation Test in Europe Conference Exhibition (DATE), 2010, March 2010, pp. 345-350. 79 [35] E. Karl, Y. Wang, Y.-G. Ng, Z. Guo, F. Hamzaoglu, M. Meterelliyoz, J. Keane, U. Bhattacharya, K. Zhang, K. Mistry, and M. Bohr, "A 4.6 GHz 162 Mb SRAM Design in 22 nm Tri-Gate CMOS Technology With Integrated Read and Write Assist Circuitry," Solid-State Circuits, IEEE Journal of, vol. 48, no. 1, pp. 150- 158, Jan 2013. [36] M. Yamaoka, K. Osada, and K. Ishibashi, "0.4-V logic library friendly SRAM array using rectangular-diffusion cell and delta-boosted-array-voltage scheme," in VLSI Circuits Digest of Technical Papers, 2002. Symposium on, June 2002, pp. 170-173. [37] K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N. Vallepalli, Y. Wang, B. Zheng, and M. Bohr, "A 3-GHz 70MB SRAM in 65nm CMOS technology with integrated column-based dynamic power supply," in Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE In- ternational,Feb 2005, pp. 474-611 Vol. 1. [38] A. Bhavnagarwala, S. Kosonocky, C. Radens, Y. Chan, K. Stawiasz, U. Srinivasan, S. P. Kowalczyk, and M. Ziegler, "A Sub-600-mV, Fluctuation Tolerant 65-nm CMOS SRAM Array With Dynamic Cell Biasing," Solid-State Circuits, IEEE Journal of, vol. 43, no. 4, pp. 946-955, April 2008. [39] N. Planes et al., "28nm FDSOI technology platform for high-speed low-voltage digital applications," in VLSI Technology (VLSIT), 2012 Symposium on, June 2012, pp. 133-134. [40] M. Yamaoka, R. Tsuchiya, and T. Kawahara, "SRAM Circuit With Expanded Operating Margin and Reduced Stand-By Leakage Current Using Thin-BOX FD-SOI Transistors," Solid-State Circuits, IEEE Journal of, vol. 41, no. 11, pp. 2366-2372, Nov 2006. [41] M. Sinangil and A. Chandrakasan, "Application-Specific SRAM Design Using Output Prediction to Reduce Bit-Line Switching Activity and Statistically Gated 80 Sense Amplifiers for Up to 1.9 x Lower Energy/Access," Solid-State Circuits, IEEE Journal of, vol. 49, no. 1, pp. 107-117, Jan 2014. [42] J.-T. Wu and K.-L. Chang, "MOS charge pumps for low-voltage operation," Solid-State Circuits, IEEE Journal of, vol. 33, no. 4, pp. 592-597, Apr 1998. [43] P. Feng, Y.-L. Li, and N.-J. Wu, "An improved charge pump circuit for non- volatile memories in RFID tags," in Proc. IEEE 10th ICSICT, Nov. 2010, pp. 363-365. [44] W. Jung et al., "A 3nW fully integrated energy harvester based on self-oscillating switched-capacitor DC-DC converter," in Proc. IEEE ISSCC, Feb. 2014, pp. 398-399. [45] M. D. Seeman, DC-DC Converters," versity of "A California, Design Ph.D. Methodology dissertation, Berkeley, May for EECS 2009. Switched-Capacitor Department, [Online]. Uni- Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-78.html [46] A. Biswas, M. Kar, and P. Mandal, "Techniques for reducing parasitic loss in switched-capacitor based DC-DC converter," in Proc. IEEE 28th APEC, Mar. 2013, pp. 2023-2028. [47] P. R. Kumar, K. Bhattacharyya, T. Das, and P. Mandal, "Improvement of power efficiency in switched capacitor dc-dc converter by shoot-through current elimination," in Proceedings of the 14th ACM/IEEE internationalsymposium on Low power electronics and design, ser. ISLPED '09, 2009, pp. 81-86. [48] D. Somasekhar et al., "Multi-phase 1GHz voltage doubler charge-pump in 32nm logic process," in Proc. IEEE Symp. VLSI Circuits, Jun. 2009, pp. 196-197. 81