Designing a Low Power SRAM for PICo Sharif Morad Max Martin Jaleesa Boykin Marc Hall ECE4332 – Fall 2009 som5r@virginia.edu, mbm9p@virginia.edu, jlb2aa@virginia.edu, mah9ks@virginia.edu ABSTRACT A low power 1Mb embedded SRAM with a 32 bit word size was designed for PICo based on a metric that includes total power, delay, and area. Several design decisions were made to lower power, delay, and area. Simulation and results from Cadence using Spectre were used to verify functionality and design decisions. By using a lower VDD, smaller clock period, shorted bit lines, and hierarchical pre-charging and word-lines, low power was obtained. the bit cells because the word lines do not drive the entire 1024 bit-cell word lines (i.e. capacitance is reduced) (see Figure 1). In addition, another design decision was made to short the bit-lines together for each column of blocks. This was done reduce the amount of area required to read and write to each block (see section 3.3 and 3.4). A significant area reduction was achieved by doing this and helped to lower the design metric (see attached calculations). 1. INTRODUCTION This SRAM design uses AMI 0.6 um technology. Simulations in Cadence and past attempts at designing a low power SRAM were utilized to achieve a suitable design for PICo. Schematics, simulations of the modeled SRAM array, and layouts were created and delivered to PICo for review. In verifying the design, simulation was done in all process corners to test for variation, temperature change, and different voltage supplies. 2. METRICS AND REQUIREMENTS The designed SRAM must meet certain design metrics as the measurement of success in order to meet PICo’s requirements. The primary concern outlined was low power for the SRAM. To measure success, the following equation was applied: (total power)2 * delay * area The units of measure for each of the metrics are as follows: Figure 1: Array Architecture 3.1.1 Bit Cell A standard six-transistor bit-cell was used for the SRAM design. total power = mWatts2 delay = nanoseconds area = mm2 The design must also be a size of 1 Mb, having a word size and output of thirty-two bits, and maintain one read or write access per cycle. 3. ARCHITECTURE 3.1 Array Architecture In designing a 1Mb SRAM, the architecture of the array was crucial to achieving low power. The bit lines and word lines contributed to significant capacitance when repeated, so a decision was made to break up the 1Mb array into 16 separate blocks, with each block containing 256 x 256 bit cells. Each block contains 8 words, and each word contains 32 bits. The blocks were arranged in a 4 x 4 array such that there are 1024 bit cell rows and 1024 bit cell columns. These blocks allow the use of a hierarchical word-line scheme, which results in a faster access to Figure 2: Bitcell The bit cell was sized according to calculations in the attached report and also through robust testing to find the proper ratio to allow for proper reading and writing (see Figure 2). 3.2 Decoders and Block Select reduced area by a factor of 4 times the number of transistors needed at the cost of increasing delay by a factor of less than 2 times the read delay of one block (see attached Calculation 6). The multiplexer chooses which of the 8 words to access, and outputs this to a sense amplifier for each block column. The output of each of these sense amplifiers (one for each block column) is sent to a 4:1 multiplexer which ultimately chooses which word to output to a register. Using this architecture reduced the need for a multiplexer and sense amplifier for each block and drastically reduced area. Figure 3: Block Select There are four 8:256 decoders, for each row of blocks, and each of these decoders drives 256 word lines for each row. Each decoder is controlled by the same 8-bit address. An AND gate is used with the word line outputs of the decoder and a block select bit from a 4:16 decoder to determine which of the 16 blocks to access at a time (see Figure 3). This reduces the capacitance that a word line drives significantly when reading or writing to a specific bit cell. Instead of a word line driving 4 blocks in a row, the word line drives only 1 block and effectively drives ¼ of the capacitance. Figure 5: Block Output Logic 3.3.1 Pre-charge Because the bit lines are shorted together, a pre-charge block was made for each block column. This saves area by allowing only the selected column of blocks to be pre-charged, as opposed to each of the 16 blocks (see Figure 6). Figure 6: Precharge Logic Figure 4: Block Decoder Logic In addition, each of the four decoders is has an enable bit signal with the logic configured such that only one decoder will be enabled at a time to save power (see Figure 4). Only when the precharge phase is over, either a read or write is being performed, and a block for that decoder’s row is being accessed will the decoder activate. 3.3 Data Access To read a 32-bit word, an 8:1 multiplexer was placed at the bottom of each column of blocks (see Figure 5). Only one multiplexer was needed for each of the four columns because the bit lines were shorted for each block column. This significantly Figure 7: Periphery for write transistors Allowing only one pre-charge block to be active at a time also reduces power. The desired pre-charge block is selected from a 2:4 decoder that chooses which block column to access (see Figure 7). over, a write is being performed, and a particular block column is selected. The block column selected is determined from a 2:4 decoder. 3.3.2 Sense Amplifier To save power, a differential sense amplifier was created to prevent the bit lines from discharging completely to 0 during a read operation (see Figure 8). Since the sense-amplifier is utilized during a read operation, the sense amplifier enable signal is the Read input signal buffered with four inverters. When read is asserted high, the sense-amplifier enable will go high as well, but with a delay of only 1.64ns. Figure 10: Data Driver 4. FURTHER OPTIMIZATIONS The voltage supply being used is significant in terms of power consumption. To that end, VDD was lowered from 5V to 2.5V and tested to verify SRAM functionality. Based on average power simulations for read and write, this significantly reduced the power. Also to reduce power, the clock period was decreased to a point where the SRAM was still functional (see attached Calculation 6 for an in depth analysis). Finally, the width ratio of the pre-charge transistors to the data writing transistors was reduced to a minimum size to still allow functionality and save area. 5. CHALLENGES 5.1 Cadence/Spectre Troubleshooting Figure 8: Sense Amplifier 3.4 Data Writing To write a 32-bit word NMOS transistors were placed at the bottom of each bit-line and bit-line bar for each column block (see Figure 9). Data0 thru data31 and databar0 thru databar31 are buffered (see Figure 10) and sent through the NMOS transistors for each word in a block. Using Cadence and Spectre as a simulation tool was useful in order to verify the functionality of the 1Mb SRAM. However, when using the simulation program, there were many errors that took significant amounts of time to troubleshoot. Getting familiar with the interface of Cadence took some time as well, and because of time constraints, this limited the available time to dedicate to designing the SRAM. 5.2 Tradeoffs In designing this SRAM, the primary metrics concerned were power, area, and delay. Tuning each of one these “knobs,” however, affects the location on the Pareto optimal curve. For example, adding buffers might decrease delay, but it will increase the energy and thus power in the SRAM. The amount of delay saved might be much more significant than the energy gained, so this is a good design decision. Deciding how sensitive each metric was required some calculations and simulations to make sure we were optimizing the SRAM for low power. 5.3 Layout/Schematic Concerns Figure 9: Block Data Inputs The gates of the transistors that allow data to pass through are controlled by a 3:8 decoder, which only allows one word to be written to at a time. There are four of these 3:8 decoders (one for each block column), and, to save power, logic was implemented such that only one of these 3:8 decoders are enabled at a time (see Figure 7). The 3:8 decoder activates when the pre-charge phase is Due to time constraints and the magnitude of the 1Mb SRAM, laying out the SRAM and constructing the schematic was a challenge in itself. Since SKILL code was not used, it took a significant amount of time to wire particular components together and to lay them out. This made altering the design very time consuming if something needed to be changed. In the future, SKILL code may be more helpful. In terms of layout, putting together components that took up the least amount of space to save area was challenging. There were several iterations of the standard bit cell layout, and ultimately the final decision was to use a design that utilized mirroring and minimized gaps between the metal. A picture of the bit cell layout can be found in the attached Simulations and Calculations, in Figure H. one bit cell, instead of a 32 bit word. The SRAM proved to be functional and performing as expected. The simulations were run at a combination of all process corners (TT, SS, SF, FS, and FF), different temperatures (zero, twentyseven, and fifty degrees Centigrade), and different voltages (nominal 2.5V, 5V, 5.5V, and 4.5V) to verify functionality across different variables. Graphs of these simulations can be found in the attached report. 5.4 Logic Control In saving power, there were design decisions made to activate only whatever periphery was needed for a particular section of the SRAM at a time. For example, the 8:256 decoders have an enable bit that only enables one of the four decoders at a time, depending on whether a block in that decoder’s row is being accessed or written to. To do this correctly, the proper logic had to be implemented. This increased area slightly, but ultimately saved more power because it decreases prevents more hardware from being active than necessary. Based on the simulations, calculations for metrics determining low power were obtained. The results are shown below in the following table and calculations for each one of these metrics is shown in the attached report (see Calculation 2). Delivery Item Value Metric (mWatts2*ns*mm2) 179633.89 Bitcell area (um2) .000498 Due to time constraints, further optimizations to decrease energy and delay were unable to be implemented. Several ways to save energy and decrease delay are to buffer the pre-charge delay. Instead of using a minimum-sized inverter, 3 buffer inverters, within the AND gate driving the clock signal, with a fan out of 4 should be used to drive the 512 PMOS transistors. Decreasing the pre-charge delay would help to drive the BL and BLB to its value faster and would save delay and energy. The model file used to simulate the values for the pre-charge delay assumed an ideal signal and ideally, this should be driven from the AND gate that drives the column pre-charge blocks. Total area (mm2) .524 Read power (mW) 73.22 Write power (mW) 58.21 Total power (mW) average of 5 reads for each write operation 68.916 Read delay (ns) 72.18 Along with this, if the pre-charge signal were more pulsed, with a shorter duty cycle, the sense amp enable would be able to drive the BL and BLB back to VDD during a read without having to completely discharge. This would have huge energy savings for the SRAM. Write delay (ns) 38.39 Total delay (ns) 72.18 5.5 Future Optimizations Table 1: Metric Breakdown 6. SIMULATION AND RESULTS Simulation was performed to verify the functionality of the 1Mb SRAM. Because of the magnitude of the SRAM, however, the simulation was done with a modeled array; models for extracted capacitances and resistances for the bit lines and word lines were used (see Calculation 1 in attached report). Simulating the actual 1Mb SRAM proved impractical and would not run effectively on the servers that contained Cadence. Due to difficulty in simulations and time constraints, the decision was made to simulate the modeled array with a read and write to 7. REFERENCES [1] Ashworth, A., Chen, J., and Williams M. Low Power SRAM Design for PICo. ECE4332 Introduction to VLSI Wiki Fall 2008. University of Virginia, 2008