A Configurable TCAM / BCAM / SRAM using 28nm push

advertisement
A Configurable TCAM / BCAM / SRAM using 28nm push-rule 6T bit cell
Supreet Jeloka1, Naveen Akesh2, Dennis Sylvester1, and David Blaauw1
1
University of Michigan, Ann Arbor, MI, 2Oracle, Santa Clara, CA
Abstract
Conventional Content Addressable Memory (BCAM and TCAM)
uses specialized 10T / 16T bit cells that are significantly larger than
6T SRAM cells. We propose a new BCAM/TCAM that can operate
with standard push-rule 6T SRAM cells, reducing array area by 2-5×
and allowing reconfiguration of the CAM as an SRAM. Using a 6T
28nm FDSOI SRAM bit cell, the 64x64 (4kb) BCAM achieves 370
MHz at 1V and consumes 0.6fJ/search/bit.
Introduction & Conventional CAM
A Content Addressable Memory (CAM) compares its search input
data with every word stored in the memory, and returns the address
location of matching words. This parallel multi-data search makes a
CAM an indispensable component for high-associativity caches,
translation look-aside buffers [1], and register-renaming [5].
Furthermore, in certain applications such as IP-routers [6] the CAM
dominates total chip area, raising the need for area-efficient CAM
structures. As shown in Fig. 1, a conventional 10-transistor (10T)
binary CAM (BCAM) bit cell is composed of a 6T SRAM-like
storage component and a 4T XOR component to determine the
bit-wise match. A Ternary CAM (TCAM) can store 0, 1, or X, where
‘X’ implies that it matches with both a 0 and a 1 of the search key. As
such it requires double the storage, resulting in a 16T cell. The high
transistor count of B/T-CAM cells, coupled with the fact that
foundries do not typically support “push-rule” CAM cells, results in a
CAM array with 2–5× larger area than a corresponding SRAM; this
significantly impacts chip area as well as power and performance.
We propose a reconfigurable CAM circuit based on a conventional,
push-rule 6T SRAM bit cell that improves array density by as much as
4×. The approach hinges on storing the words column-wise and using
the standard bit-lines to perform a matching operation. A
configurability feature allows on-the-fly mode switching among
BCAM, TCAM, and SRAM operation. In this way, an SRAM
memory can be re-configured to a CAM upon demand to accelerate
parallel search-like applications. Implemented in 28nm FDSOI, the
BCAM attains 370MHz operation and consumes only 0.6fJ/search/bit
while the TCAM achieves the same performance at 0.56fJ/search/bit.
Proposed Configurable CAM
In order to maintain the density of a push-rule 6T SRAM cell, the
proposed CAM separates the wordline into WLR (Word-Line-Right)
and WLL (Word-Line-Left) (Fig. 2). This creates two independent
access transistors but incurs no area penalty since the push-rule layers
are kept intact (i.e., only DRC-compliant metallization changes are
made). The key to performing a parallel search with this bit-cell is to
store words column-wise (vertically) while placing the search data on
the word-lines rather than the bitlines as in a conventional BCAM.
The search then proceeds as follows (Fig. 2):
BCAM Search: WLR carries ‘Search Data’ while WLL carries
‘Search Data bar’. As a result, the “1” bits in the search data are
matched on BL while BLB matches the “0” bits. In the match case
both BL and BLB stay at the pre-charged high value. If there is a
mismatch, BL, BLB, or both discharge. To detect this, BL and BLB
are sensed separately using two single-ended sense-amplifiers that are
logically ANDed to indicate a match in the column. For example, in
Fig. 2 the search string has D_m = 0 while the word in column ‘n’
stores a ‘1’ at the mth bit position. This leads BLB_n to discharge, and
word ‘n’ is not a match as indicated by the ‘0’ at the AND output.
TCAM Search: In TCAM mode the array is reconfigured to use
two adjacent columns to form one word (Fig. 3). This reduces the
number of words in TCAM mode by half, as the bit cell is effectively
12T. In this mode two of the four sense amplifier outputs that span the
two columns constituting a word are ANDed together. A mask bit ‘X’
(e.g., Fig. 3, top right bit) will not discharge either sensed bitline as it
stores a ‘1’ in both positions. Hence, it matches with both 0 and 1 of
the search data.
To write data column-wise into the CAM, as required for parallel
search, we propose a two-cycle write scheme for BCAM mode. A
column-decoder is added to select the column to be written. In the
first cycle, the Data is applied to both WLL and WLR while BL=1 and
BLB=0 for the addressed column, resulting in the ‘1’ bits in data
being written. In cycle-2, Data_bar is applied to both WLL and WLR
and BL and BLB are flipped to write all ‘0’ bits. TCAM write is
similar to BCAM but takes three cycles (Table 1). Since write is less
common in many CAM applications than search, the additional cycles
pose less overhead. In addition, if data is written in ‘bulk’, the extra
write cycles can be avoided by first writing zero into the entire array in
one cycle and then only writing the “1” bits in the data to each of the
columns.
It is important to note that in a CAM search operation many
wordlines can be simultaneously asserted, depending on the search
data. To avoid bit-flips during this search, we use an assist technique
where the wordline voltage is lowered to Vdd_Lo (500mV nominally,
Fig. 4). Similarly, during column-wise write the column being
written has its bitcells supplied by Vdd_Lo for write assist given the low
wordline voltage. Other columns have their bit cells supplied at
nominal Vdd to prevent data corruption. By using Vdd_Lo for both
write and search assist, only one additional supply voltage is needed.
The SRAM mode works conventionally with both WLR and WLL
driven from the address-decoder output. In SRAM mode, reads and
writes proceed row-wise using conventional differential signaling and
the performance impact from re-configurability was found to be
negligible. By reconfiguring the two single-ended sense-amplifiers in
CAM mode into a single differential sense amplifier in SRAM mode,
total reconfiguration area overhead is limited to only 7% for the added
column decoder.
Measured Results
The configurable memory is validated in a 28nm FDSOI CMOS
test chip. An on-chip BIST is used to test the three modes.
Checkerboard patterns are run on the 4kb array with 64-bit words to
check for search/write disturb faults in CAM modes. Single-mismatch
for each word is checked with walking-1 and walking-0 patterns.
Arbitrary data can be searched at-speed using an on-chip FIFO buffer.
Fig. 5 shows the frequency and energy in BCAM mode across supply
voltage with Vdd_Lo = Vdd/2. Fig. 6 shows a shmoo plot that sweeps
both Vdd_Lo and Vdd independently. At Vdd = 1V, a maximum
frequency of 400MHz is achieved. The minimum energy point is
0.41fJ, with a frequency of 70MHz at Vdd=0.7V and Vdd_Lo=0.375V.
If Vdd_Lo is too high relative to Vdd, data is corrupted (read disturb). On
the other hand, as Vdd_Lo is decreased a longer wordline pulse (slower
frequency) is needed to resolve between the no-mismatch and
1-mismatch cases. Below 0.325V, the design cannot reliably resolve
the 1-mismatch case for every column. Fig 7 shows the Vdd_Lo
operational voltage margin distribution across multiple chips.
In TCAM mode, the maximum frequency is 417MHz and optimal
energy is 0.46fJ/search/bit. In comparison, the SRAM mode functions
at > 900MHz at 0.9V. Table 2 compares the proposed work with
other designs. The normalized bit cell area is improved by > 4×
compared to other BCAMs.
Acknowledgements
STMicroelectronics is gratefully acknowledged for IC fabrication.
Funding provided in part by NSF and DARPA.
References
[1] A.Agarwal et al., ESSCIRC, 2011. [2] A.Do et al., ESSCIRC, 2013.
[3] C.Wang et al., TCAS-II, 2010.
[4] B.Yang et al., TCAS-I, 2011.
[5] G.Burda et al., ISSCC, 2010. [6] Nii et al., ISSCC, 2014.
1 1
1
SA
preb
diffb
Sense
Amp
Sense
Amp
vref
diff
wlr
bli
Di
diff
bli+1
SA
Di+1
SA
Output
out
outb
1.6
250
1.4
200
1.2
150
1.0
100
0.8
243
234
184
139
76
13
Frequency [MHz]
370
n.a.
500
250
Array Size
64x64
(64x64)
x4Arrays
128x128
128x32
Memory Modes
BL_N
BLB_N
BL_N-1
BLB_N-1
Word_N/2
--
--
SA_bli&
SA_blbi
--
SA_bl i+1
&SA_blbi
--
294
267
250
194
139
78
48
364
346
304
256
192
122
80
400
383
357
310
248
193
128
76
4
5
4
3
BCAM/TCAM/
BCAM
BCAM
SRAM
Table 2. Comparison with previous works
1
0
Max Frequency
Vdd_Lo_Margin
(MHz)
(mV)
Fig. 7. Measured VDD_LO margin
and max frequency across 10 chips
[4]
1100µ m
Scan &
Test
128x
34x4Block
TCAM/
BCAM
* From die-photo
BIST
45.6µ m
Control
&
Timing
CLK Gen
Pre-charge & Header Switches
210
Differential NAND-NOR
BCAM
2
1
1.07 (1V) 0.77 (1.2V) 1.87 (1V) 2.82 (1.8V)
0.3 (0.5V)
NOR
3
2
0.13µm
0.18µm
9T+Read 13T,14T
20 (1200) 30 (926)*
0.6 (1V)
0.41 (0.75V)
2-Single Ended SA Wide AND
0
425
3.3 (780)*
[3]
Energy/Search
/bit [fJ]
Match-line
Tecnique
1
1
375
65nm
10T
0
0
400
32nm
11T
n.a.
1
350
Area/Cell[µm2 ](F 2 )
28nm FDSOI
6T
0.152 (194)
SA
300
Technology
Hi-z Hi-z
2-Single
Ended
VDD
Fig. 6. Measured shmoo plot of VDD_LO vs VDD
for BCAM. Numbers in box are frequency in MHz
[2]
1
--
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
[1]
1
Table 1. Memory mode configurations.
223
162 181
122 116
70 75 73
12 12 14
This Work
0
325
Fig. 5. Measured frequency and energy
in BCAM mode against Vdd
0
Pre/
220
0.500
1.000
0
1
180
0.475
0.950
1
0
200
0.450
0.900
0
2-Single
Ended
160
0.4
D
2
1
Hi-z Hi-z
140
0.6
0
0.575
0.555
0.525
0.500
0.475
0.450
0.425
0.400
0.375
0.350
0.325
3
M
a
D Dbar s
k
Dbar
1
--
120
50
VDD_LO
1.8
Energy / Search / Bit (fJ)
Frequency (MHz)
2.0
Frequency
Energy / Search / Bit
WriteCycle
Search
2
D Dbar
Dba ri+1
SA
o/p
1
D
Pre/
Sense Differ
Amp ential
TCAM
WrCy cle
Dbar
Pre/ Dbar i
blbi+1
diffb
sa_en
2.2
Transistor/Cell
0=Mismatch
BCAM
blbi
Fig. 4. Proposed configurable CAM architecture.
400
0.425
0.850
SA
MEM
6T
0.400
0.800
1
#Chips
6T
Output Mux & Lat ches
0.375
0.750
1
1=Mat ch
#Chips
sampl eb
bl_n
blb_n
Array
wll_m
Vdd_Lo (V) 0.350
Vdd (V) 0.700
0
SA
Fig. 3. Proposed TCAM search.
Row
Decoder
Output
wll
bl
sampl eb
6T
bl_1
blb_1
Row Decoder
& Muxing Logic
CAM_Data
blb
sampl eb
SRAM_Addr
Differential SA in SRAM
2-Single-Ended SA in CAM
6T
300
1
SA
Read Write Search
Reconfigurable Sense Amp
wll_1
350
1 1
SRAM
Wr Drv
Pre
CAM_b
wlr_1
diff
01
CAM_WR_Addr
Wr Drv
Pre
wlr_m
1
SA
Column Decoder & Muxing Logic
VDD
01
10
0=Mismatch
1=Mat ch
VDD_LO
01
1
1 1
SA
SA
Fig. 2. Proposed BCAM search using 6T cells.
Fig. 1. Conventional CAM.
SRAM _D
10
1
1 0
1 1
SA
10
TCAM
BCAM
Word_1
0 1
WLL_M
01
01
0
0
1
1 0
1
Dbar_m=1
1
1
WL Drivers
slb_n
sl_n
sl_1
slb_1
WLR_M
1
ML
Sense
Amps
Word_m
Mat ch o/p
ml_m
WLL_1
D_m=0
Word_1
Mat ch o/p
ml_1
0 1
BCAM
Word_n
sl2 sl1
Dbar_1=0
Ternary CAM 16T bit cell.
If both bit stores 0, then bit masked
Search
String
Storage+compare wi th search line (SL) in each bitcell.
sl
slb
Binary CAM 10T bit cell
1
WLR_1
1
Masked
1
0 1
ml
BL_2
D_1=1
ml
BL_n
BLB_n
BLB_2
BL_1
Word_1
BLB_1
wl
BL_1
blb2
TCAM
bl2
BLB_1
bl1
Vdd_Lo_Margin
blb1
wl
Words St ored
Column W ise
bl
blb
42.6µ m
Array
Sense Amps
Fig. 8. Die photo and memory layout
Download