Project_Report - University of Virginia

advertisement
Designing a Low Power SRAM for PICo
Sharif Morad
Max Martin
Jaleesa Boykin
Marc Hall
ECE4332 – Fall 2009
som5r@virginia.edu, mbm9p@virginia.edu, jlb2aa@virginia.edu, mah9ks@virginia.edu
ABSTRACT
A low power 1Mb embedded SRAM with a 32 bit word size was
designed for PICo based on a metric that includes total power,
delay, and area. Several design decisions were made to lower
power, delay, and area. Simulation and results from Cadence
using Spectre were used to verify functionality and design
decisions. By using a lower VDD, smaller clock period, shorted
bit lines, and hierarchical pre-charging and word-lines, low power
was obtained.
the bit cells because the word lines do not drive the entire 1024
bit-cell word lines (i.e. capacitance is reduced) (see Figure 1). In
addition, another design decision was made to short the bit-lines
together for each column of blocks. This was done reduce the
amount of area required to read and write to each block (see
section 3.3 and 3.4). A significant area reduction was achieved by
doing this and helped to lower the design metric (see attached
calculations).
1. INTRODUCTION
This SRAM design uses AMI 0.6 um technology. Simulations in
Cadence and past attempts at designing a low power SRAM were
utilized to achieve a suitable design for PICo. Schematics,
simulations of the modeled SRAM array, and layouts were created
and delivered to PICo for review. In verifying the design,
simulation was done in all process corners to test for variation,
temperature change, and different voltage supplies.
2. METRICS AND REQUIREMENTS
The designed SRAM must meet certain design metrics as the
measurement of success in order to meet PICo’s requirements.
The primary concern outlined was low power for the SRAM. To
measure success, the following equation was applied:
(total power)2 * delay * area
The units of measure for each of the metrics are as follows:
Figure 1: Array Architecture
3.1.1 Bit Cell
A standard six-transistor bit-cell was used for the SRAM design.
total power = mWatts2
delay = nanoseconds
area = mm2
The design must also be a size of 1 Mb, having a word size and
output of thirty-two bits, and maintain one read or write access
per cycle.
3. ARCHITECTURE
3.1 Array Architecture
In designing a 1Mb SRAM, the architecture of the array was
crucial to achieving low power. The bit lines and word lines
contributed to significant capacitance when repeated, so a
decision was made to break up the 1Mb array into 16 separate
blocks, with each block containing 256 x 256 bit cells. Each block
contains 8 words, and each word contains 32 bits. The blocks
were arranged in a 4 x 4 array such that there are 1024 bit cell
rows and 1024 bit cell columns. These blocks allow the use of a
hierarchical word-line scheme, which results in a faster access to
Figure 2: Bitcell
The bit cell was sized according to calculations in the attached
report and also through robust testing to find the proper ratio to
allow for proper reading and writing (see Figure 2).
3.2 Decoders and Block Select
reduced area by a factor of 4 times the number of transistors
needed at the cost of increasing delay by a factor of less than 2
times the read delay of one block (see attached Calculation 6).
The multiplexer chooses which of the 8 words to access, and
outputs this to a sense amplifier for each block column. The
output of each of these sense amplifiers (one for each block
column) is sent to a 4:1 multiplexer which ultimately chooses
which word to output to a register. Using this architecture reduced
the need for a multiplexer and sense amplifier for each block and
drastically reduced area.
Figure 3: Block Select
There are four 8:256 decoders, for each row of blocks, and each
of these decoders drives 256 word lines for each row. Each
decoder is controlled by the same 8-bit address. An AND gate is
used with the word line outputs of the decoder and a block select
bit from a 4:16 decoder to determine which of the 16 blocks to
access at a time (see Figure 3). This reduces the capacitance that a
word line drives significantly when reading or writing to a
specific bit cell. Instead of a word line driving 4 blocks in a row,
the word line drives only 1 block and effectively drives ¼ of the
capacitance.
Figure 5: Block Output Logic
3.3.1 Pre-charge
Because the bit lines are shorted together, a pre-charge block was
made for each block column. This saves area by allowing only the
selected column of blocks to be pre-charged, as opposed to each
of the 16 blocks (see Figure 6).
Figure 6: Precharge Logic
Figure 4: Block Decoder Logic
In addition, each of the four decoders is has an enable bit signal
with the logic configured such that only one decoder will be
enabled at a time to save power (see Figure 4). Only when the precharge phase is over, either a read or write is being performed,
and a block for that decoder’s row is being accessed will the
decoder activate.
3.3 Data Access
To read a 32-bit word, an 8:1 multiplexer was placed at the
bottom of each column of blocks (see Figure 5). Only one
multiplexer was needed for each of the four columns because the
bit lines were shorted for each block column. This significantly
Figure 7: Periphery for write transistors
Allowing only one pre-charge block to be active at a time also
reduces power. The desired pre-charge block is selected from a
2:4 decoder that chooses which block column to access (see
Figure 7).
over, a write is being performed, and a particular block column is
selected. The block column selected is determined from a 2:4
decoder.
3.3.2 Sense Amplifier
To save power, a differential sense amplifier was created to
prevent the bit lines from discharging completely to 0 during a
read operation (see Figure 8). Since the sense-amplifier is utilized
during a read operation, the sense amplifier enable signal is the
Read input signal buffered with four inverters. When read is
asserted high, the sense-amplifier enable will go high as well, but
with a delay of only 1.64ns.
Figure 10: Data Driver
4. FURTHER OPTIMIZATIONS
The voltage supply being used is significant in terms of power
consumption. To that end, VDD was lowered from 5V to 2.5V
and tested to verify SRAM functionality. Based on average power
simulations for read and write, this significantly reduced the
power. Also to reduce power, the clock period was decreased to a
point where the SRAM was still functional (see attached
Calculation 6 for an in depth analysis). Finally, the width ratio of
the pre-charge transistors to the data writing transistors was
reduced to a minimum size to still allow functionality and save
area.
5. CHALLENGES
5.1 Cadence/Spectre Troubleshooting
Figure 8: Sense Amplifier
3.4 Data Writing
To write a 32-bit word NMOS transistors were placed at the
bottom of each bit-line and bit-line bar for each column block (see
Figure 9). Data0 thru data31 and databar0 thru databar31 are
buffered (see Figure 10) and sent through the NMOS transistors
for each word in a block.
Using Cadence and Spectre as a simulation tool was useful in
order to verify the functionality of the 1Mb SRAM. However,
when using the simulation program, there were many errors that
took significant amounts of time to troubleshoot. Getting familiar
with the interface of Cadence took some time as well, and because
of time constraints, this limited the available time to dedicate to
designing the SRAM.
5.2 Tradeoffs
In designing this SRAM, the primary metrics concerned were
power, area, and delay. Tuning each of one these “knobs,”
however, affects the location on the Pareto optimal curve. For
example, adding buffers might decrease delay, but it will increase
the energy and thus power in the SRAM. The amount of delay
saved might be much more significant than the energy gained, so
this is a good design decision. Deciding how sensitive each metric
was required some calculations and simulations to make sure we
were optimizing the SRAM for low power.
5.3 Layout/Schematic Concerns
Figure 9: Block Data Inputs
The gates of the transistors that allow data to pass through are
controlled by a 3:8 decoder, which only allows one word to be
written to at a time. There are four of these 3:8 decoders (one for
each block column), and, to save power, logic was implemented
such that only one of these 3:8 decoders are enabled at a time (see
Figure 7). The 3:8 decoder activates when the pre-charge phase is
Due to time constraints and the magnitude of the 1Mb SRAM,
laying out the SRAM and constructing the schematic was a
challenge in itself. Since SKILL code was not used, it took a
significant amount of time to wire particular components together
and to lay them out. This made altering the design very time
consuming if something needed to be changed. In the future,
SKILL code may be more helpful.
In terms of layout, putting together components that took up the
least amount of space to save area was challenging. There were
several iterations of the standard bit cell layout, and ultimately the
final decision was to use a design that utilized mirroring and
minimized gaps between the metal. A picture of the bit cell layout
can be found in the attached Simulations and Calculations, in
Figure H.
one bit cell, instead of a 32 bit word. The SRAM proved to be
functional and performing as expected.
The simulations were run at a combination of all process corners
(TT, SS, SF, FS, and FF), different temperatures (zero, twentyseven, and fifty degrees Centigrade), and different voltages
(nominal 2.5V, 5V, 5.5V, and 4.5V) to verify functionality across
different variables. Graphs of these simulations can be found in
the attached report.
5.4 Logic Control
In saving power, there were design decisions made to activate
only whatever periphery was needed for a particular section of the
SRAM at a time. For example, the 8:256 decoders have an enable
bit that only enables one of the four decoders at a time, depending
on whether a block in that decoder’s row is being accessed or
written to. To do this correctly, the proper logic had to be
implemented. This increased area slightly, but ultimately saved
more power because it decreases prevents more hardware from
being active than necessary.
Based on the simulations, calculations for metrics determining
low power were obtained. The results are shown below in the
following table and calculations for each one of these metrics is
shown in the attached report (see Calculation 2).
Delivery Item
Value
Metric (mWatts2*ns*mm2)
179633.89
Bitcell area (um2)
.000498
Due to time constraints, further optimizations to decrease energy
and delay were unable to be implemented. Several ways to save
energy and decrease delay are to buffer the pre-charge delay.
Instead of using a minimum-sized inverter, 3 buffer inverters,
within the AND gate driving the clock signal, with a fan out of 4
should be used to drive the 512 PMOS transistors. Decreasing the
pre-charge delay would help to drive the BL and BLB to its value
faster and would save delay and energy. The model file used to
simulate the values for the pre-charge delay assumed an ideal
signal and ideally, this should be driven from the AND gate that
drives the column pre-charge blocks.
Total area (mm2)
.524
Read power (mW)
73.22
Write power (mW)
58.21
Total power (mW) average of 5
reads for each write operation
68.916
Read delay (ns)
72.18
Along with this, if the pre-charge signal were more pulsed, with a
shorter duty cycle, the sense amp enable would be able to drive
the BL and BLB back to VDD during a read without having to
completely discharge. This would have huge energy savings for
the SRAM.
Write delay (ns)
38.39
Total delay (ns)
72.18
5.5 Future Optimizations
Table 1: Metric Breakdown
6. SIMULATION AND RESULTS
Simulation was performed to verify the functionality of the 1Mb
SRAM. Because of the magnitude of the SRAM, however, the
simulation was done with a modeled array; models for extracted
capacitances and resistances for the bit lines and word lines were
used (see Calculation 1 in attached report). Simulating the actual
1Mb SRAM proved impractical and would not run effectively on
the servers that contained Cadence.
Due to difficulty in simulations and time constraints, the decision
was made to simulate the modeled array with a read and write to
7. REFERENCES
[1] Ashworth, A., Chen, J., and Williams M. Low Power SRAM
Design for PICo. ECE4332 Introduction to VLSI Wiki Fall
2008. University of Virginia, 2008
Download