Ece432_write_up - University of Virginia

advertisement
Lower Power SRAM in the AMI06 0.5 μm Technology
Caroline Andrews
Robert Hunter
Yousef Shakhsheer
ECE 432 – Fall 2007
University of Virginia
cja3g@virginia.edu, rdh8t@virginia.edu, yas5b@virginia.edu
ABSTRACT
A lower power, one megabyte SRAM was designed in the 0.5 μm
AMI06 technology. The memory uses a highly integrated design
of sixteen 64kB blocks. All access periphery is equipped with
enable signals to minimize power consumption while it is not
being used. Simulations of read and write were performed at all
process, temperature, and supply voltage variations. Total area
was reduced by utilizing a flat bit cell layout which was tiled
efficiently by overlapping the cells. Hierarchical decoders serve to
minimize access delay to the memory arrays.
1. INTRODUCTION
As transistor sizes decrease and the IC industry is able to place
more and more transistors on a single chip, designers are able to
create larger memories every year. Another direction in which
technology has advanced is the ultra-portable and therefore low
power domain. Wireless sensing applications often demand very
low power systems but do not require high speed. The design of
static random access memory depends greatly on its application.
Investigation into lowering the power dissipation of each
component of the memory allows for an extremely efficient
design.
decoded three-bit column address are combined to choose the
correct word, and produce the signals which turn on the
transmission gates to access the 32 bit lines (BL). The data is
driven onto the proper BLs and bit line bars (BLB) of the memory
array. The new bit cell values are available to be read or written
over on the next clock cycle.
3. MEMORY CELL
The memory uses a standard six transistor SRAM bit cell. The
transistors were sized as minimally as possible while ensuring
proper operation. In order to prevent a read upset, the NMOS pass
transistor was weakened by sizing it larger than the pull down
transistor. The pull down transistor is sized slightly larger than the
PMOS transistors in order to ensure that the NMOS transistors
will not turn on during a write operation.
In laying out the cell, we chose to use a flat layout similar to that
presented by Kim et. al. [1]. This layout allows for an extremely
compact memory as it is mirrored and tiled in 4 cell arrays, which
are then tiled to produce each memory block.
2. MEMORY DESIGN
The 1MB memory array was divided into sixteen 64kB blocks.
Each block contains a 256 by 256 bit cell array. Each word line
contains eight 32 bit words. Each word is addressable by 15 bits.
The nominal operation of the memory is at 5V and room
temperature. However, its performance was tested from 4.5V to
5.5V and between 0oC and 50oC at all four process corners of
variation in the transistor threshold voltages (FF, SS, SF, FS).
Each memory block contains the aforementioned 256 x 256 bit
array, 256 sense amplifiers, 32 eight-input multiplexer, 512
transmission gates, a three-to-eight decoder, and drivers to
effectively read and write.
The overall memory shares one 8:256 decoder, one three-to-eight
decoder, one sixteen input clock, 32 input registers, and 32 output
registers. A common clock is shared by both registers. A common
voltage source is shared by all memory components.
To read from memory, the 8:256 decoder asserts a single word
line using the inputted eight bit row address. All bits on the word
line are read by the sense amplifiers. The correct word is chosen
by a multiplexed three-bit column address and then passed to the
appropriate memory block by the 16-input multiplexer. The word
is latched by the output registers.
To write to memory, the 8:256 decoder also asserts a single word
line using the inputted eight bit row address. Data is latched by
the input registers. The decoded four-bit block address and the
Figure 1. Layout of bit cell
4. PERIPHERY
4.1 Sense Amplifier
This memory employs a differential voltage sensing amplifier in
order to prevent the BLs and BLBs from completely discharging
for each read. The amplifier senses a 0.5V or greater difference in
BL and BLB and swings to full VDD when BL is greater and down
to the ground rail when BLB is higher. The topology uses a
dummy inverter to simulate a similar load on both sides in order
to balance the charging to VDD.
4.2 Word Line and Write Drivers
To ensure the lowest delay the memory utilizes drivers on its word
lines and bit lines. The drivers consist of four inverters in series,
each sized to the optimal fan out of 4.
4.3 Decoders
To speed up the memory while remaining minimizing of power
dissipation, its decoding blocks feature hierarchical decoders. The
memory utilizes three-to-eight, four-to-16, and eight-to-256
decoders. Each decoder begins by using NOR pre-decoders.
4.4 Multiplexers
To select the correct word, the memory utilizes multiplexers.
While reading, the unutilized bit cells on the word line perform
dummy-reads. Utilizing the column address, an eight-input
multiplexer outputs the correct word from the memory block. The
32-bit words are sent from each memory block to a 16-input
multiplexer. The correct word is selected with the block address.
Eight-input and 16-input multiplexers were designed by cascading
a single two-input multiplexer. The multiplexer topology includes
transmission gates and inverters as well as an AND gate for the
enable signal.
4.5 Registers
Rising-edge registers are used to hold the 32-bit word to be
written and the 32-bit word to be read. The registers utilize a
common clock and its compliment. The registers are positive
edge-triggered based on a master-slave configuration of two
multiplexers.
Figure 2. Master Slave Edge Triggered Register
5. SPECIAL FEATURES
Enable signals distributed throughout the memory periphery allow
for power reduction. During each read/write, access to only one
block out of sixteen is enabled. The decoder circuits will be
disabled unless a read or write occurs. The sense amplifiers below
each block are only enabled when the read bit is asserted. When
neither a read nor a write is occurring, all periphery is disabled
while the bit arrays remain unchanged, holding the state of the bit
cells.
6. RESULTS AND SIMULATIONS
Due to the large size of the memory block, a testing block was
required in order to perform simulations. The testing block
contains four bit cells at the corners of the block along with cells
emulating multiple bit cells used to simulate rows and columns
containing 256 bit cells. The testing block also contains capacitors
which represent the capacitance of the bit cells along with
resistors to model the interconnect resistance. This capacitance
value was obtained by extraction on the bit cell layout in Cadence.
The block also contains a sense amplifier and several logic gates
to model the decoders. Simulations were performed on the
memory at all PVT corners: [TT, FF, SS, SF, and FS], V= [5.5V,
5V, 4.5V], T= [50oC, 27oC, and 0oC]. The memory performed
correctly at all combinations of these parameters.
6.1 Power Consumption
Power consumption was measured over a simulation containing
five reads and one write operation. The power measured is the
average power over a period of 1μs. Power consumption during
the write operation was found to be larger because larger voltages
were discharged on the bit-lines. The average power was
obtained through Spectre.
Table 1 displays the on and off power for the main periphery of
the memory. The enable signals on the access circuitry provide
significant savings to the overall memory as only a small fraction
of the logic is turned on for each read or write. The decoders,
which use the most power, are only enabled by the read or write
signals and so would all be in the off mode while the memory is
simply holding its state. The 16 decoder is used to select one of
sixteen memory blocks, we estimated the activity of the memory
to be around 80% in determining what it’s on and off power
would be.
Table 1. Active versus Inactive Power Consumption
On Power
Qty On
Off Power
Qty Off
Sense Amp
0.174 mW
256
9.4nW
3840
256 Decoder
13 mW
1
206 nW
15
16 Decoder
5.906 mW
0.8
11.54 nW
0.2
8 Decoder
4.062 mW
2
5.78 nW
15
Tx Gates
10.2 μW
64
1pW
4288
16-Mux
220 μW
32
15.6 μW
480
8-Mux
199 μW
32
10 μW
480
6.2 Delay
Delay was calculated by observing the time between when the
read signal was executed and the output was latched. Write delay
was obtained by observing the time between the write command
and the voltages of internal nodes of the bit cell changing to the
desired result. The read delay was longer than the write delay due
to time to being latched as well as having to discharge BL and
BLB in the lines.
6.3 Layout
The flat layout used for the bit cell and the high level of
integration of our memory blocks allows the design to save area
used for periphery. The cell was designed to overlap on all four
sides with surrounding cells by being flipped in the x and y
directions to create a block of four cells. These were in turn tiled
with SKILL code to create the 64kB block. While the area of a
single cell is 239.04 μm2, its effective area within one of the
sixteen 64kB blocks is only 182.03 μm2. Since the memory blocks
represent 95% of whole memory, this savings in area is
significant.
The sense amplifiers were laid out to snap directly under the bit
cells without any further connections. The word line drivers were
laid out using fingers to reduce transistor areas and were designed
to fit right next to the memory blocks by having the same effective
height as a bit cell. The input and output registers were laid out as
single cells and then tiled with overlap to reduce their overall
area. The multiplexers were laid out in an efficient way which
7. CONCLUSION
The total area of the memory was estimated by tiling the 16 bit
cell arrays, with sense amps, drivers and multiplexers and by
leaving enough space for the remaining periphery and routing
which was not laid out in the time allowed. The estimate provided
in Table 2 is a conservative one and likely to be higher than the
true value by about 10%
Through more development, the design can be refined to further
reduce power consumption through implementation of a sleep
mode for inactive memory blocks. Lower power and faster
decoders could be designed by increasing the levels of hierarchy
and utilizing more NAND and NOR gates instead of the currently
used primarily AND-based logic.
6.4 Metric Evaluation
The metric for the lower power SRAM was provided by PiCO.
The table below displays the results of this memory design.
Table 2. Metric Breakdown
Metric
Bit Cell Area
10388006 W2 * ns * μm2
The designed lower power SRAM memory meets the
specifications set forth by PiCO. Through disabling unused
memory periphery, the design is able to save power which other
design do not attempt to do. Through overlapping bit cells, the
design minimizes the amount of area required for an IC and
provides a product to be used in remote sensing application.
The group feels it has addressed all 6 goals as defined in the
requirement sheet. The remaining work needed to complete the
memory layout would consist of routing the various blocks to
each other.
8. ACKNOWLEDGMENTS
We would like to thank Ben Calhoun and Jiajing Wang for their
support throughout the semester.
239.04 μm2 (Effective: 182.03 μm2)
Total Area
256099200 μm2
Read Power
29.4 mW
Write Power
30.56 mW
Total Power
132.8 mW
Read Delay
2.3 ns
Write Delay
1.4 ns
Total Delay
2.3 ns
9. REFERENCES
[1] Kim, K.J., et. al. A Novel 6.4μm2 Full-CMOS SRAM cell
with Aspect Ratio of 0.63 in a High-Performance 0.25μmGeneration CMOS Technology. Symposium on VLSI
Technology Digest of Technical Papers, (Jun. 1998), 68-69.
Download