Ece4332finalreport - UVA ECE Wiki & Forum

advertisement
Team 2 Implementation of a Low Power SRAM design
using 45 nm FreePDK Technology
Yanran Chen, Cary Converse,
Chenqian Gan, David Moore
ECE 4332 – Fall 2010
University of Virginia
< yc5t, crc4m, cg5pe,
dmm4ce>@virginia.edu
ABSTRACT
The members of Team 2, Yanran Chen, Cary Converse, Chenqian
Gan, and David Moore, were presented with the challenge to
implement to develop a low-power SRAM, using the 45 nm
FreePDK technology, for Portable Instruments Company (PICo).
The design was required to act as a microsensor node by strongly
focusing on optimizing the low-energy constraint.
connections from the block decoder to select which of the 16
memory blocks will be accessed, and shows how the bus from
each block is selected by an output 16-to-1 multiplexer.
This SRAM required a capacity of 1 Mb, and its merit would be
based on the metric: (Active Energy per Access)2 * Delay * Area *
Idle Power. Several special features, including low VDD and
single bit error correction were added to increase the appeal of the
SRAM to PICo. This paper will highlight the unique aspects of
the design, display functionality, explain design decisions along
with trade-offs, and prove the processor's worthiness over
competing designs.
1. INTRODUCTION
Portable Instruments Company requested a low-power SRAM,
using the 45 nm FreePDK technology, with capacity of 1Mb,
consisting of 32-bit words. Team 2 has fulfilled this request with
the SRAM design described in this paper, creating a high-quality
product that operates with great energy efficiency, without
sacrificing significant area or delay. All circuits included in the
SRAM design were designed and tested using Cadence and
Ocean.
2. DESIGN OVERVIEW
The Design for the PICo low-power SRAM consists of bitcells
and the various periphery circuits that are required to read and
write to the bitcells. Each sub circuit was designed in Cadence as
transistor level logic, and layouts were created for most of the
circuit components.
The memory consists of 16 memory blocks, each of which is made
up of a 304-by-256 bitcell array and the circuits required for
reading and writing the bitcells in the block, as well as the error
correction. There are eight 32-bit words in each row of a memory
block, and each word has an additional six parity bits used for
error-correcting purposes, resulting in 304 bits for each 256 bits
of data.
A block select decoder is used to select which block will be
manipulated, and the specific memory address is fed into each
block, along with the clock and signals for read, write (and the
data input bus). An output bus comes out of each block into a 16to-1 multiplexer, which selects which block’s read data is sent
out.
Figure 1 shows the overall topology of the low-power SRAM
design, in block diagram form. It indicates the high-level
© 2010 Chen, Converse, Gan, & Moore
Figure 1: Simplified Full SRAM block diagram
As shown in the above Figure 1, each memory block is connected
to inputs such as the clock, an address bus, read/write lines, and a
data input bus.
3. IMPORTANT SUB-CIRCUITS
The entire SRAM design is comprised of numerous decoders,
multiplexers, and other standard logic gates, however, there are
several sub-circuits of note that were implemented in our lowpower design. These sub-circuits are explored in depth in the
following sections.
3.1 Bitcell
The bitcell is a thin 6-transistor bitcell, designed to maximize area
savings (see Figure 2). The layout is such that it is easily tiled
into large arrays of bitcells, closely arranged with each other, with
minimal unused space. Each of the MOSFETs used in the bitcell
is of the high threshold voltage variety, which helps minimize
leakage current (which is key for minimizing power), as well as
minimizing the necessary area. The transistors are ratioed with the
goal of allowing easy writes, while avoiding read upsets and
minimizing area as much as possible.
To further minimize the leakage current from, and power
consumed by the bitcell, the voltage was reduced to 600 mV for
active operation. To gain additional power savings, a “sleep
mode” was implemented, in which the voltage rail is dropped
further, to 300 mV [2]. Simulations were performed to verify that
the data in each bitcell is held at this sleep voltage, and to assess if
the voltage could be dropped lower to realize even greater power
savings, while maintaining sufficient static noise margin.
Six-transistor bitcells function reliably with voltage rails above
the threshold, but a to use a lower voltage requires bitcells
composed of eight to ten transistors, resulting in asymmetrical
layouts that are difficult or impossible to tile efficiently into
blocks [3].
3.2 Latching Voltage-Mode Sense Amplifier
Figure 2: Bitcell Layout
Figure 3, below, shows the simulated butterfly plots of the bitcell
at 300 mV (top) and 250 mV (bottom), which were used to
determine the sleep voltage required to retain the stored data. It
can be seen that the static noise margin is considerably smaller at
250 mV, so 300 mV was chosen as the sleep voltage to keep the
SRAM more robust.
The latching voltage-mode sense amplifier is implemented in
Team 2’s low power SRAM to allow bitcells to be read, while
reducing active and leakage energy over other potential sense
amplifier designs. The design of the sense amplifier is based off
of a design intended for sub-threshold operation [5], but it still
works well at the 600 mV active rail applied in our SRAM design.
By using a latching voltage-mode sense amplifier to read the
bitcell values, the bitlines do not need to drop as far in order to
recognize the voltage differential between them, so current is
minimized, and the read process is sped up.
Figure 4: Layout of Voltage-Mode Latching Sense Amplifier
Figure 3: Bitcell Butterfly Plots of Hold at Lower Voltages
Ideally, we would reduce the active voltage even further, in order
to operate in the sub-threshold region, however, such low voltages
could not be implemented without significant redesign of our
bitcell. Active voltages cannot reliably be reduced past 600 mV
for 6T bitcells, so this limited how far the voltage could be
reduced in pursuit of power savings [6]. Furthermore, delays
increase very quickly as the rail voltage drops below 600 mV [3].
3.3 Single-Bit Error-Correcting Code
Single bit error correction was implemented as one of the
SRAM’s special features, in order to make the design more
robust. As a low-power SRAM, low voltages are used, which
results in smaller static noise margins. This means that cosmic
rays or other noise may more easily flip the value of any given
bitcell. Implementing error correction, the SRAM is strengthened
by minimizing the chance that an output bit may be incorrect after
having been inadvertently flipped in the memory.
The 6-parity bit Hamming Code was chosen for the error
correction scheme because of its relative simplicity and ease of
implementation. This Hamming Code allows the correction of
one error per 32 bit word, or the detection of two errors per 32 bit
word. Each block of memory contains its own ECC circuitry,
which can be divided into two components: parity generation and
error fixing. Parity generation occurs as a word is being written,
so that the six parity bits can be stored in the memory block along
with the data bits, and error fixing occurs when a word is read out
from the memory block.
and its inverted value. The mux’s select line is driven by a
corresponding line from the output of the error location decoder,
to determine whether the original memory out value, or its
inverted value will pass out of the memory block. This entire
process can be seen in Figure 6, below.
Unfortunately, the six parity bits and the extra logic circuitry add
significant overhead to the complexity, area, and power of the
design, which will harm our overall metric value. Error correction
is, however, an important and valuable special feature to improve
the robustness of the design, so it was assessed to be a worthy
addition to the low power SRAM design.
3.3.1 Parity Generation
The Hamming Code in this SRAM is implemented by using even
parity to generate bits that can be used to correct single errors in
the future. This is done by using exclusive-OR gates to determine
the parity bits. The XOR inputs are chosen according the
Hamming Code convention, which essentially allows the parity
bits to encode the location of any error, so that it can be corrected
in the future. The XOR scheme is shown below in Figure 5,
below.
Figure 6: Error-Fixing Logic
3.4 Memory Block
Figure 5: Parity Generation Logic
What makes Hamming Code so appealing is its ability to use the
parity bits to determine where the error occurs, by “encoding” the
error location in a manipulation of the parity bits themselves.
3.3.2 Error Fixing
Once a read is called, the 38 bits in a word, comprised of 32 data
bits and six parity bits, are pulled out from their bitcells to be
checked before being sent through the output multiplexer. Parity
check bits are determined by using the same logic as the original
parity generation method, but using the bitcell values as the inputs
instead.
These new parity check bits are then exclusive-ORed with the
original parity bit values, to compare them. The results of these
XORs will form a value that indicates the location of an error, if
any.
This value is then sent to a decoder that will control which
memory out bits must be flipped if an error is detected in one
specific location. The circuit that actually corrects the error is
composed of a 2-to-1 multiplexer that takes the memory out value,
The memory block design consists of the 304-by-256 bitcell array,
as well as other periphery used to read and write the bits, while
controlling which words will be accessed. The vast majority of
the memory block area is composed of the bitcells themselves.
Bitcells within the memory block are interleaved in order to
minimize the burst effects of cosmic rays which would occur in
real life. Interleaving also allows for the sharing of peripheral
circuitry that appears inside the memory block itself, such as the
latching voltage-mode sense amplifier, which is shared across four
columns of bitcells.
The decision was made to divide the memory into smaller blocks
in order to minimize the capacitance associated with the
wordlines, and to help reduce the access energy. Simulations
were performed to determine the average power per operation per
block for different size blocks (see Figure 7 below) in order to
assess which size would be most beneficial for the energy
component of the metric. Additional guidance was provided by
the decisions and reasoning of past VLSI groups [1].
The overall area was thus calculated to be roughly 1.2 mm2.
Since the area of a single bitcell is approximately 0.649μm2, it can
be found that bitcells cover over two-thirds of the area of the
entire SRAM design. This is a testament to our design’s ability to
easily and efficiently tile the bitcells, however, it is clear that
different dimensions of memory blocks would have allowed the
area of peripheral circuits to be minimized, which would further
increase the area efficiency.
4.1.4 Idle Power
Figure 7: Average Power Access Results
Though the 256-by-256 dimension block does not necessary
achieve the minimum power possible, it allows for simpler design
of periphery multiplexers that are required to control the blocks,
while still realizing lower power. Due to the later addition of
ECC circuitry, the dimension did change slightly to 304-by-256,
which results in higher power per access, but the relative power
and complexity compared to other dimensions remains similar.
4. RESULTS
The metric provided by PICo to measure optimality was set as:
Metric = (Active Energy per Access)2*Delay*Area*Idle Power
The following sections discuss the Active Energy, Delay, and
Area, and Idle Power of Team 2’s SRAM design.
The idle power is the power that is consumed by the SRAM while
the clock is off and no memory accesses are occurring.
Essentially, this represents the quiescent power of the design, so it
should be minimized in order to make the design truly low-power,
so that any battery source used to power the SRAM will be longlasting.
Using the active VDD rail, set at 600 mV, the idle power is
calculated to be 57.78 μW. With a sleep VDD rail used (set at
300 mV, as discussed previously), the idle power is reduced
significantly, to only 0.483 μW. This is a significant achievement
that will allow the SRAM to be a very useful microsensor node,
since the power source can be very small and last a long time, by
activating the sleep mode when necessary. For purposes of
calculating the metric, however, sleep mode will be ignored.
4.1.5 Metric Calculation
The final calculated metric was calculated to be:
Metric = (9.704 fJ)2*(7.186 ns)*(1.2 mm2)*(57.78 μW)
= 4.692 x10-41J2*s*mm2*W
4.1 Overall Findings
Using the calculated active energy, delay, and area, the low-power
focused metric outlined by PICo can be calculated.
4.1.1 Active Energy per Access
Active energy is measured as the average energy of the design
over five read accesses and one write access, using the fastest
clock that the SRAM can accept.
With the VDD set at 600 mV, the read energy was measured at
9.678 fJ, and the write energy was measured at 9.834 fJ. As a
result, the active energy per access was computed to be 9.704 fJ.
4.1.2 Delay
The delay is measured as the worst case delay for writing and
reading to a bitcell. The read delay was measured as the delay
from the start of the precharge to when the value has passed
through the output error correction circuitry, so the precharge
pulse was decreased just to the point at which the circuit would no
longer read.
Thus the read delay was measured to be
approximately 7.186 ns, and the write delay was measured at
6.491 ns, so the accepted delay of the circuit is the greater of the
two, which is 7.186 ns.
4.1.3 Area
The area of the device was calculated by measuring the
dimensions of the effective rectangle the surrounds the entire
design. Since several of the peripheral circuits were not fully laid
out, their exact effects could not be considered for the calculation
of the area, however, their areas were estimated and sufficient
extra space was added to the rectangle to take them into account.
5. CONCLUSIONS
In conclusion, Team 2’s low-power SRAM design provides a
product that produces a superb metric value, through significant
gains in energy efficiency. Some tradeoffs were made with speed,
especially by decreasing the voltage rail, but since energy is the
most emphasized specification in our chosen metric, speed was
willingly sacrificed.
The area of the SRAM may be somewhat large, but this was
necessary to maintain some simplicity of design, and keep the
delay from increasing unreasonably. Furthermore, much of the
area was taken up by the bitcells, which were well designed to tile
easily and maximize our area efficiency. Other methods of saving
space were forgone (specifically dynamic logic design) because
they would have resulted in extra energy consumption [4].
Our special features should positively differentiate our design
from our competitors. Single bit error correction is a necessary
feature in low-power SRAM, in order to maintain data despite
noise and some variations in the MOSFETs. The low voltage rail
and sleep voltage permit extreme power savings, thus improving
our metric greatly. The final metric value for the SRAM was
4.692 x10-41J2*s*mm2*W, making Team 2’s product the superior
choice for PICo’s representatives. For this reason, Team 2’s
design is the best choice for the PICo contract.
6. ACKNOWLEDGMENTS
We would like to thank Benton Highsmith Calhoun for his
essential design knowledge and guidance.
[1] ECE 4332 2009 Group Projects Pages. UVa ECE Wiki.
[4] Rabaey, J. Digital Integrated Circuits: A Design Perspective.
Prentice Hall, 2003.
[2] Kaxiras, S., Zhigang, H., & Martonosi, M. Cache Decay:
Exploiting Generational Behavior to Reduce Cache Leakage
Power. 9th International Symposium on Computer
Architecture, 2001.
[5] Ryan, J. F., & Calhoun, B. H. Minimizing Offset for Latching
Voltage-Mode Sense Amplifiers for Sub-Threshold
Operation. 9th International Symposium on Quality
Electronic Design, 2008.
[3] Ling, S., Kim, Y. B., & Lombardi, F. A Low-Leakage 9T
SRAM Cell for Ultra-Low Power Operation, 2008.
[6] Wang, A., Calhoun, B. H., & Chandrakasan, A. P. SubThreshold Design for Ultra Low-Power Systems. Springer,
2006.
7. REFERENCES
Download