Lower Power SRAM in the AMI06 0.5 μm Technology Caroline Andrews Robert Hunter Yousef Shakhsheer ECE 432 – Fall 2007 University of Virginia cja3g@virginia.edu, rdh8t@virginia.edu, yas5b@virginia.edu ABSTRACT A lower power, one megabyte SRAM was designed in the 0.5 μm AMI06 technology. The memory uses a highly integrated design of sixteen 64kB blocks. All access periphery is equipped with enable signals to minimize power consumption while it is not being used. Simulations of read and write were performed at all process, temperature, and supply voltage variations. Total area was reduced by utilizing a flat bit cell layout which was tiled efficiently by overlapping the cells. Hierarchical decoders serve to minimize access delay to the memory arrays. 1. INTRODUCTION As transistor sizes decrease and the IC industry is able to place more and more transistors on a single chip, designers are able to create larger memories every year. Another direction in which technology has advanced is the ultra-portable and therefore low power domain. Wireless sensing applications often demand very low power systems but do not require high speed. The design of static random access memory depends greatly on its application. Investigation into lowering the power dissipation of each component of the memory allows for an extremely efficient design. decoded three-bit column address are combined to choose the correct word, and produce the signals which turn on the transmission gates to access the 32 bit lines (BL). The data is driven onto the proper BLs and bit line bars (BLB) of the memory array. The new bit cell values are available to be read or written over on the next clock cycle. 3. MEMORY CELL The memory uses a standard six transistor SRAM bit cell. The transistors were sized as minimally as possible while ensuring proper operation. In order to prevent a read upset, the NMOS pass transistor was weakened by sizing it larger than the pull down transistor. The pull down transistor is sized slightly larger than the PMOS transistors in order to ensure that the NMOS transistors will not turn on during a write operation. In laying out the cell, we chose to use a flat layout similar to that presented by Kim et. al. [1]. This layout allows for an extremely compact memory as it is mirrored and tiled in 4 cell arrays, which are then tiled to produce each memory block. 2. MEMORY DESIGN The 1MB memory array was divided into sixteen 64kB blocks. Each block contains a 256 by 256 bit cell array. Each word line contains eight 32 bit words. Each word is addressable by 15 bits. The nominal operation of the memory is at 5V and room temperature. However, its performance was tested from 4.5V to 5.5V and between 0oC and 50oC at all four process corners of variation in the transistor threshold voltages (FF, SS, SF, FS). Each memory block contains the aforementioned 256 x 256 bit array, 256 sense amplifiers, 32 eight-input multiplexer, 512 transmission gates, a three-to-eight decoder, and drivers to effectively read and write. The overall memory shares one 8:256 decoder, one three-to-eight decoder, one sixteen input clock, 32 input registers, and 32 output registers. A common clock is shared by both registers. A common voltage source is shared by all memory components. To read from memory, the 8:256 decoder asserts a single word line using the inputted eight bit row address. All bits on the word line are read by the sense amplifiers. The correct word is chosen by a multiplexed three-bit column address and then passed to the appropriate memory block by the 16-input multiplexer. The word is latched by the output registers. To write to memory, the 8:256 decoder also asserts a single word line using the inputted eight bit row address. Data is latched by the input registers. The decoded four-bit block address and the Figure 1. Layout of bit cell 4. PERIPHERY 4.1 Sense Amplifier This memory employs a differential voltage sensing amplifier in order to prevent the BLs and BLBs from completely discharging for each read. The amplifier senses a 0.5V or greater difference in BL and BLB and swings to full VDD when BL is greater and down to the ground rail when BLB is higher. The topology uses a dummy inverter to simulate a similar load on both sides in order to balance the charging to VDD. 4.2 Word Line and Write Drivers To ensure the lowest delay the memory utilizes drivers on its word lines and bit lines. The drivers consist of four inverters in series, each sized to the optimal fan out of 4. 4.3 Decoders To speed up the memory while remaining minimizing of power dissipation, its decoding blocks feature hierarchical decoders. The memory utilizes three-to-eight, four-to-16, and eight-to-256 decoders. Each decoder begins by using NOR pre-decoders. 4.4 Multiplexers To select the correct word, the memory utilizes multiplexers. While reading, the unutilized bit cells on the word line perform dummy-reads. Utilizing the column address, an eight-input multiplexer outputs the correct word from the memory block. The 32-bit words are sent from each memory block to a 16-input multiplexer. The correct word is selected with the block address. Eight-input and 16-input multiplexers were designed by cascading a single two-input multiplexer. The multiplexer topology includes transmission gates and inverters as well as an AND gate for the enable signal. 4.5 Registers Rising-edge registers are used to hold the 32-bit word to be written and the 32-bit word to be read. The registers utilize a common clock and its compliment. The registers are positive edge-triggered based on a master-slave configuration of two multiplexers. Figure 2. Master Slave Edge Triggered Register 5. SPECIAL FEATURES Enable signals distributed throughout the memory periphery allow for power reduction. During each read/write, access to only one block out of sixteen is enabled. The decoder circuits will be disabled unless a read or write occurs. The sense amplifiers below each block are only enabled when the read bit is asserted. When neither a read nor a write is occurring, all periphery is disabled while the bit arrays remain unchanged, holding the state of the bit cells. 6. RESULTS AND SIMULATIONS Due to the large size of the memory block, a testing block was required in order to perform simulations. The testing block contains four bit cells at the corners of the block along with cells emulating multiple bit cells used to simulate rows and columns containing 256 bit cells. The testing block also contains capacitors which represent the capacitance of the bit cells along with resistors to model the interconnect resistance. This capacitance value was obtained by extraction on the bit cell layout in Cadence. The block also contains a sense amplifier and several logic gates to model the decoders. Simulations were performed on the memory at all PVT corners: [TT, FF, SS, SF, and FS], V= [5.5V, 5V, 4.5V], T= [50oC, 27oC, and 0oC]. The memory performed correctly at all combinations of these parameters. 6.1 Power Consumption Power consumption was measured over a simulation containing five reads and one write operation. The power measured is the average power over a period of 1μs. Power consumption during the write operation was found to be larger because larger voltages were discharged on the bit-lines. The average power was obtained through Spectre. Table 1 displays the on and off power for the main periphery of the memory. The enable signals on the access circuitry provide significant savings to the overall memory as only a small fraction of the logic is turned on for each read or write. The decoders, which use the most power, are only enabled by the read or write signals and so would all be in the off mode while the memory is simply holding its state. The 16 decoder is used to select one of sixteen memory blocks, we estimated the activity of the memory to be around 80% in determining what it’s on and off power would be. Table 1. Active versus Inactive Power Consumption On Power Qty On Off Power Qty Off Sense Amp 0.174 mW 256 9.4nW 3840 256 Decoder 13 mW 1 206 nW 15 16 Decoder 5.906 mW 0.8 11.54 nW 0.2 8 Decoder 4.062 mW 2 5.78 nW 15 Tx Gates 10.2 μW 64 1pW 4288 16-Mux 220 μW 32 15.6 μW 480 8-Mux 199 μW 32 10 μW 480 6.2 Delay Delay was calculated by observing the time between when the read signal was executed and the output was latched. Write delay was obtained by observing the time between the write command and the voltages of internal nodes of the bit cell changing to the desired result. The read delay was longer than the write delay due to time to being latched as well as having to discharge BL and BLB in the lines. 6.3 Layout The flat layout used for the bit cell and the high level of integration of our memory blocks allows the design to save area used for periphery. The cell was designed to overlap on all four sides with surrounding cells by being flipped in the x and y directions to create a block of four cells. These were in turn tiled with SKILL code to create the 64kB block. While the area of a single cell is 239.04 μm2, its effective area within one of the sixteen 64kB blocks is only 182.03 μm2. Since the memory blocks represent 95% of whole memory, this savings in area is significant. The sense amplifiers were laid out to snap directly under the bit cells without any further connections. The word line drivers were laid out using fingers to reduce transistor areas and were designed to fit right next to the memory blocks by having the same effective height as a bit cell. The input and output registers were laid out as single cells and then tiled with overlap to reduce their overall area. The multiplexers were laid out in an efficient way which 7. CONCLUSION The total area of the memory was estimated by tiling the 16 bit cell arrays, with sense amps, drivers and multiplexers and by leaving enough space for the remaining periphery and routing which was not laid out in the time allowed. The estimate provided in Table 2 is a conservative one and likely to be higher than the true value by about 10% Through more development, the design can be refined to further reduce power consumption through implementation of a sleep mode for inactive memory blocks. Lower power and faster decoders could be designed by increasing the levels of hierarchy and utilizing more NAND and NOR gates instead of the currently used primarily AND-based logic. 6.4 Metric Evaluation The metric for the lower power SRAM was provided by PiCO. The table below displays the results of this memory design. Table 2. Metric Breakdown Metric Bit Cell Area 10388006 W2 * ns * μm2 The designed lower power SRAM memory meets the specifications set forth by PiCO. Through disabling unused memory periphery, the design is able to save power which other design do not attempt to do. Through overlapping bit cells, the design minimizes the amount of area required for an IC and provides a product to be used in remote sensing application. The group feels it has addressed all 6 goals as defined in the requirement sheet. The remaining work needed to complete the memory layout would consist of routing the various blocks to each other. 8. ACKNOWLEDGMENTS We would like to thank Ben Calhoun and Jiajing Wang for their support throughout the semester. 239.04 μm2 (Effective: 182.03 μm2) Total Area 256099200 μm2 Read Power 29.4 mW Write Power 30.56 mW Total Power 132.8 mW Read Delay 2.3 ns Write Delay 1.4 ns Total Delay 2.3 ns 9. REFERENCES [1] Kim, K.J., et. al. A Novel 6.4μm2 Full-CMOS SRAM cell with Aspect Ratio of 0.63 in a High-Performance 0.25μmGeneration CMOS Technology. Symposium on VLSI Technology Digest of Technical Papers, (Jun. 1998), 68-69.