Team 2 Implementation of a Low Power SRAM design using 45 nm FreePDK Technology Yanran Chen, Cary Converse, Chenqian Gan, David Moore ECE 4332 – Fall 2010 University of Virginia < yc5t, crc4m, cg5pe, dmm4ce>@virginia.edu ABSTRACT The members of Team 2, Yanran Chen, Cary Converse, Chenqian Gan, and David Moore, were presented with the challenge to implement to develop a low-power SRAM, using the 45 nm FreePDK technology, for Portable Instruments Company (PICo). The design was required to act as a microsensor node by strongly focusing on optimizing the low-energy constraint. connections from the block decoder to select which of the 16 memory blocks will be accessed, and shows how the bus from each block is selected by an output 16-to-1 multiplexer. This SRAM required a capacity of 1 Mb, and its merit would be based on the metric: (Active Energy per Access)2 * Delay * Area * Idle Power. Several special features, including low VDD and single bit error correction were added to increase the appeal of the SRAM to PICo. This paper will highlight the unique aspects of the design, display functionality, explain design decisions along with trade-offs, and prove the processor's worthiness over competing designs. 1. INTRODUCTION Portable Instruments Company requested a low-power SRAM, using the 45 nm FreePDK technology, with capacity of 1Mb, consisting of 32-bit words. Team 2 has fulfilled this request with the SRAM design described in this paper, creating a high-quality product that operates with great energy efficiency, without sacrificing significant area or delay. All circuits included in the SRAM design were designed and tested using Cadence and Ocean. 2. DESIGN OVERVIEW The Design for the PICo low-power SRAM consists of bitcells and the various periphery circuits that are required to read and write to the bitcells. Each sub circuit was designed in Cadence as transistor level logic, and layouts were created for most of the circuit components. The memory consists of 16 memory blocks, each of which is made up of a 304-by-256 bitcell array and the circuits required for reading and writing the bitcells in the block, as well as the error correction. There are eight 32-bit words in each row of a memory block, and each word has an additional six parity bits used for error-correcting purposes, resulting in 304 bits for each 256 bits of data. A block select decoder is used to select which block will be manipulated, and the specific memory address is fed into each block, along with the clock and signals for read, write (and the data input bus). An output bus comes out of each block into a 16to-1 multiplexer, which selects which block’s read data is sent out. Figure 1 shows the overall topology of the low-power SRAM design, in block diagram form. It indicates the high-level © 2010 Chen, Converse, Gan, & Moore Figure 1: Simplified Full SRAM block diagram As shown in the above Figure 1, each memory block is connected to inputs such as the clock, an address bus, read/write lines, and a data input bus. 3. IMPORTANT SUB-CIRCUITS The entire SRAM design is comprised of numerous decoders, multiplexers, and other standard logic gates, however, there are several sub-circuits of note that were implemented in our lowpower design. These sub-circuits are explored in depth in the following sections. 3.1 Bitcell The bitcell is a thin 6-transistor bitcell, designed to maximize area savings (see Figure 2). The layout is such that it is easily tiled into large arrays of bitcells, closely arranged with each other, with minimal unused space. Each of the MOSFETs used in the bitcell is of the high threshold voltage variety, which helps minimize leakage current (which is key for minimizing power), as well as minimizing the necessary area. The transistors are ratioed with the goal of allowing easy writes, while avoiding read upsets and minimizing area as much as possible. To further minimize the leakage current from, and power consumed by the bitcell, the voltage was reduced to 600 mV for active operation. To gain additional power savings, a “sleep mode” was implemented, in which the voltage rail is dropped further, to 300 mV [2]. Simulations were performed to verify that the data in each bitcell is held at this sleep voltage, and to assess if the voltage could be dropped lower to realize even greater power savings, while maintaining sufficient static noise margin. Six-transistor bitcells function reliably with voltage rails above the threshold, but a to use a lower voltage requires bitcells composed of eight to ten transistors, resulting in asymmetrical layouts that are difficult or impossible to tile efficiently into blocks [3]. 3.2 Latching Voltage-Mode Sense Amplifier Figure 2: Bitcell Layout Figure 3, below, shows the simulated butterfly plots of the bitcell at 300 mV (top) and 250 mV (bottom), which were used to determine the sleep voltage required to retain the stored data. It can be seen that the static noise margin is considerably smaller at 250 mV, so 300 mV was chosen as the sleep voltage to keep the SRAM more robust. The latching voltage-mode sense amplifier is implemented in Team 2’s low power SRAM to allow bitcells to be read, while reducing active and leakage energy over other potential sense amplifier designs. The design of the sense amplifier is based off of a design intended for sub-threshold operation [5], but it still works well at the 600 mV active rail applied in our SRAM design. By using a latching voltage-mode sense amplifier to read the bitcell values, the bitlines do not need to drop as far in order to recognize the voltage differential between them, so current is minimized, and the read process is sped up. Figure 4: Layout of Voltage-Mode Latching Sense Amplifier Figure 3: Bitcell Butterfly Plots of Hold at Lower Voltages Ideally, we would reduce the active voltage even further, in order to operate in the sub-threshold region, however, such low voltages could not be implemented without significant redesign of our bitcell. Active voltages cannot reliably be reduced past 600 mV for 6T bitcells, so this limited how far the voltage could be reduced in pursuit of power savings [6]. Furthermore, delays increase very quickly as the rail voltage drops below 600 mV [3]. 3.3 Single-Bit Error-Correcting Code Single bit error correction was implemented as one of the SRAM’s special features, in order to make the design more robust. As a low-power SRAM, low voltages are used, which results in smaller static noise margins. This means that cosmic rays or other noise may more easily flip the value of any given bitcell. Implementing error correction, the SRAM is strengthened by minimizing the chance that an output bit may be incorrect after having been inadvertently flipped in the memory. The 6-parity bit Hamming Code was chosen for the error correction scheme because of its relative simplicity and ease of implementation. This Hamming Code allows the correction of one error per 32 bit word, or the detection of two errors per 32 bit word. Each block of memory contains its own ECC circuitry, which can be divided into two components: parity generation and error fixing. Parity generation occurs as a word is being written, so that the six parity bits can be stored in the memory block along with the data bits, and error fixing occurs when a word is read out from the memory block. and its inverted value. The mux’s select line is driven by a corresponding line from the output of the error location decoder, to determine whether the original memory out value, or its inverted value will pass out of the memory block. This entire process can be seen in Figure 6, below. Unfortunately, the six parity bits and the extra logic circuitry add significant overhead to the complexity, area, and power of the design, which will harm our overall metric value. Error correction is, however, an important and valuable special feature to improve the robustness of the design, so it was assessed to be a worthy addition to the low power SRAM design. 3.3.1 Parity Generation The Hamming Code in this SRAM is implemented by using even parity to generate bits that can be used to correct single errors in the future. This is done by using exclusive-OR gates to determine the parity bits. The XOR inputs are chosen according the Hamming Code convention, which essentially allows the parity bits to encode the location of any error, so that it can be corrected in the future. The XOR scheme is shown below in Figure 5, below. Figure 6: Error-Fixing Logic 3.4 Memory Block Figure 5: Parity Generation Logic What makes Hamming Code so appealing is its ability to use the parity bits to determine where the error occurs, by “encoding” the error location in a manipulation of the parity bits themselves. 3.3.2 Error Fixing Once a read is called, the 38 bits in a word, comprised of 32 data bits and six parity bits, are pulled out from their bitcells to be checked before being sent through the output multiplexer. Parity check bits are determined by using the same logic as the original parity generation method, but using the bitcell values as the inputs instead. These new parity check bits are then exclusive-ORed with the original parity bit values, to compare them. The results of these XORs will form a value that indicates the location of an error, if any. This value is then sent to a decoder that will control which memory out bits must be flipped if an error is detected in one specific location. The circuit that actually corrects the error is composed of a 2-to-1 multiplexer that takes the memory out value, The memory block design consists of the 304-by-256 bitcell array, as well as other periphery used to read and write the bits, while controlling which words will be accessed. The vast majority of the memory block area is composed of the bitcells themselves. Bitcells within the memory block are interleaved in order to minimize the burst effects of cosmic rays which would occur in real life. Interleaving also allows for the sharing of peripheral circuitry that appears inside the memory block itself, such as the latching voltage-mode sense amplifier, which is shared across four columns of bitcells. The decision was made to divide the memory into smaller blocks in order to minimize the capacitance associated with the wordlines, and to help reduce the access energy. Simulations were performed to determine the average power per operation per block for different size blocks (see Figure 7 below) in order to assess which size would be most beneficial for the energy component of the metric. Additional guidance was provided by the decisions and reasoning of past VLSI groups [1]. The overall area was thus calculated to be roughly 1.2 mm2. Since the area of a single bitcell is approximately 0.649μm2, it can be found that bitcells cover over two-thirds of the area of the entire SRAM design. This is a testament to our design’s ability to easily and efficiently tile the bitcells, however, it is clear that different dimensions of memory blocks would have allowed the area of peripheral circuits to be minimized, which would further increase the area efficiency. 4.1.4 Idle Power Figure 7: Average Power Access Results Though the 256-by-256 dimension block does not necessary achieve the minimum power possible, it allows for simpler design of periphery multiplexers that are required to control the blocks, while still realizing lower power. Due to the later addition of ECC circuitry, the dimension did change slightly to 304-by-256, which results in higher power per access, but the relative power and complexity compared to other dimensions remains similar. 4. RESULTS The metric provided by PICo to measure optimality was set as: Metric = (Active Energy per Access)2*Delay*Area*Idle Power The following sections discuss the Active Energy, Delay, and Area, and Idle Power of Team 2’s SRAM design. The idle power is the power that is consumed by the SRAM while the clock is off and no memory accesses are occurring. Essentially, this represents the quiescent power of the design, so it should be minimized in order to make the design truly low-power, so that any battery source used to power the SRAM will be longlasting. Using the active VDD rail, set at 600 mV, the idle power is calculated to be 57.78 μW. With a sleep VDD rail used (set at 300 mV, as discussed previously), the idle power is reduced significantly, to only 0.483 μW. This is a significant achievement that will allow the SRAM to be a very useful microsensor node, since the power source can be very small and last a long time, by activating the sleep mode when necessary. For purposes of calculating the metric, however, sleep mode will be ignored. 4.1.5 Metric Calculation The final calculated metric was calculated to be: Metric = (9.704 fJ)2*(7.186 ns)*(1.2 mm2)*(57.78 μW) = 4.692 x10-41J2*s*mm2*W 4.1 Overall Findings Using the calculated active energy, delay, and area, the low-power focused metric outlined by PICo can be calculated. 4.1.1 Active Energy per Access Active energy is measured as the average energy of the design over five read accesses and one write access, using the fastest clock that the SRAM can accept. With the VDD set at 600 mV, the read energy was measured at 9.678 fJ, and the write energy was measured at 9.834 fJ. As a result, the active energy per access was computed to be 9.704 fJ. 4.1.2 Delay The delay is measured as the worst case delay for writing and reading to a bitcell. The read delay was measured as the delay from the start of the precharge to when the value has passed through the output error correction circuitry, so the precharge pulse was decreased just to the point at which the circuit would no longer read. Thus the read delay was measured to be approximately 7.186 ns, and the write delay was measured at 6.491 ns, so the accepted delay of the circuit is the greater of the two, which is 7.186 ns. 4.1.3 Area The area of the device was calculated by measuring the dimensions of the effective rectangle the surrounds the entire design. Since several of the peripheral circuits were not fully laid out, their exact effects could not be considered for the calculation of the area, however, their areas were estimated and sufficient extra space was added to the rectangle to take them into account. 5. CONCLUSIONS In conclusion, Team 2’s low-power SRAM design provides a product that produces a superb metric value, through significant gains in energy efficiency. Some tradeoffs were made with speed, especially by decreasing the voltage rail, but since energy is the most emphasized specification in our chosen metric, speed was willingly sacrificed. The area of the SRAM may be somewhat large, but this was necessary to maintain some simplicity of design, and keep the delay from increasing unreasonably. Furthermore, much of the area was taken up by the bitcells, which were well designed to tile easily and maximize our area efficiency. Other methods of saving space were forgone (specifically dynamic logic design) because they would have resulted in extra energy consumption [4]. Our special features should positively differentiate our design from our competitors. Single bit error correction is a necessary feature in low-power SRAM, in order to maintain data despite noise and some variations in the MOSFETs. The low voltage rail and sleep voltage permit extreme power savings, thus improving our metric greatly. The final metric value for the SRAM was 4.692 x10-41J2*s*mm2*W, making Team 2’s product the superior choice for PICo’s representatives. For this reason, Team 2’s design is the best choice for the PICo contract. 6. ACKNOWLEDGMENTS We would like to thank Benton Highsmith Calhoun for his essential design knowledge and guidance. [1] ECE 4332 2009 Group Projects Pages. UVa ECE Wiki. [4] Rabaey, J. Digital Integrated Circuits: A Design Perspective. Prentice Hall, 2003. [2] Kaxiras, S., Zhigang, H., & Martonosi, M. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power. 9th International Symposium on Computer Architecture, 2001. [5] Ryan, J. F., & Calhoun, B. H. Minimizing Offset for Latching Voltage-Mode Sense Amplifiers for Sub-Threshold Operation. 9th International Symposium on Quality Electronic Design, 2008. [3] Ling, S., Kim, Y. B., & Lombardi, F. A Low-Leakage 9T SRAM Cell for Ultra-Low Power Operation, 2008. [6] Wang, A., Calhoun, B. H., & Chandrakasan, A. P. SubThreshold Design for Ultra Low-Power Systems. Springer, 2006. 7. REFERENCES