A Configurable TCAM / BCAM / SRAM using 28nm push-rule 6T bit cell Supreet Jeloka1, Naveen Akesh2, Dennis Sylvester1, and David Blaauw1 1 University of Michigan, Ann Arbor, MI, 2Oracle, Santa Clara, CA Abstract Conventional Content Addressable Memory (BCAM and TCAM) uses specialized 10T / 16T bit cells that are significantly larger than 6T SRAM cells. We propose a new BCAM/TCAM that can operate with standard push-rule 6T SRAM cells, reducing array area by 2-5× and allowing reconfiguration of the CAM as an SRAM. Using a 6T 28nm FDSOI SRAM bit cell, the 64x64 (4kb) BCAM achieves 370 MHz at 1V and consumes 0.6fJ/search/bit. Introduction & Conventional CAM A Content Addressable Memory (CAM) compares its search input data with every word stored in the memory, and returns the address location of matching words. This parallel multi-data search makes a CAM an indispensable component for high-associativity caches, translation look-aside buffers [1], and register-renaming [5]. Furthermore, in certain applications such as IP-routers [6] the CAM dominates total chip area, raising the need for area-efficient CAM structures. As shown in Fig. 1, a conventional 10-transistor (10T) binary CAM (BCAM) bit cell is composed of a 6T SRAM-like storage component and a 4T XOR component to determine the bit-wise match. A Ternary CAM (TCAM) can store 0, 1, or X, where ‘X’ implies that it matches with both a 0 and a 1 of the search key. As such it requires double the storage, resulting in a 16T cell. The high transistor count of B/T-CAM cells, coupled with the fact that foundries do not typically support “push-rule” CAM cells, results in a CAM array with 2–5× larger area than a corresponding SRAM; this significantly impacts chip area as well as power and performance. We propose a reconfigurable CAM circuit based on a conventional, push-rule 6T SRAM bit cell that improves array density by as much as 4×. The approach hinges on storing the words column-wise and using the standard bit-lines to perform a matching operation. A configurability feature allows on-the-fly mode switching among BCAM, TCAM, and SRAM operation. In this way, an SRAM memory can be re-configured to a CAM upon demand to accelerate parallel search-like applications. Implemented in 28nm FDSOI, the BCAM attains 370MHz operation and consumes only 0.6fJ/search/bit while the TCAM achieves the same performance at 0.56fJ/search/bit. Proposed Configurable CAM In order to maintain the density of a push-rule 6T SRAM cell, the proposed CAM separates the wordline into WLR (Word-Line-Right) and WLL (Word-Line-Left) (Fig. 2). This creates two independent access transistors but incurs no area penalty since the push-rule layers are kept intact (i.e., only DRC-compliant metallization changes are made). The key to performing a parallel search with this bit-cell is to store words column-wise (vertically) while placing the search data on the word-lines rather than the bitlines as in a conventional BCAM. The search then proceeds as follows (Fig. 2): BCAM Search: WLR carries ‘Search Data’ while WLL carries ‘Search Data bar’. As a result, the “1” bits in the search data are matched on BL while BLB matches the “0” bits. In the match case both BL and BLB stay at the pre-charged high value. If there is a mismatch, BL, BLB, or both discharge. To detect this, BL and BLB are sensed separately using two single-ended sense-amplifiers that are logically ANDed to indicate a match in the column. For example, in Fig. 2 the search string has D_m = 0 while the word in column ‘n’ stores a ‘1’ at the mth bit position. This leads BLB_n to discharge, and word ‘n’ is not a match as indicated by the ‘0’ at the AND output. TCAM Search: In TCAM mode the array is reconfigured to use two adjacent columns to form one word (Fig. 3). This reduces the number of words in TCAM mode by half, as the bit cell is effectively 12T. In this mode two of the four sense amplifier outputs that span the two columns constituting a word are ANDed together. A mask bit ‘X’ (e.g., Fig. 3, top right bit) will not discharge either sensed bitline as it stores a ‘1’ in both positions. Hence, it matches with both 0 and 1 of the search data. To write data column-wise into the CAM, as required for parallel search, we propose a two-cycle write scheme for BCAM mode. A column-decoder is added to select the column to be written. In the first cycle, the Data is applied to both WLL and WLR while BL=1 and BLB=0 for the addressed column, resulting in the ‘1’ bits in data being written. In cycle-2, Data_bar is applied to both WLL and WLR and BL and BLB are flipped to write all ‘0’ bits. TCAM write is similar to BCAM but takes three cycles (Table 1). Since write is less common in many CAM applications than search, the additional cycles pose less overhead. In addition, if data is written in ‘bulk’, the extra write cycles can be avoided by first writing zero into the entire array in one cycle and then only writing the “1” bits in the data to each of the columns. It is important to note that in a CAM search operation many wordlines can be simultaneously asserted, depending on the search data. To avoid bit-flips during this search, we use an assist technique where the wordline voltage is lowered to Vdd_Lo (500mV nominally, Fig. 4). Similarly, during column-wise write the column being written has its bitcells supplied by Vdd_Lo for write assist given the low wordline voltage. Other columns have their bit cells supplied at nominal Vdd to prevent data corruption. By using Vdd_Lo for both write and search assist, only one additional supply voltage is needed. The SRAM mode works conventionally with both WLR and WLL driven from the address-decoder output. In SRAM mode, reads and writes proceed row-wise using conventional differential signaling and the performance impact from re-configurability was found to be negligible. By reconfiguring the two single-ended sense-amplifiers in CAM mode into a single differential sense amplifier in SRAM mode, total reconfiguration area overhead is limited to only 7% for the added column decoder. Measured Results The configurable memory is validated in a 28nm FDSOI CMOS test chip. An on-chip BIST is used to test the three modes. Checkerboard patterns are run on the 4kb array with 64-bit words to check for search/write disturb faults in CAM modes. Single-mismatch for each word is checked with walking-1 and walking-0 patterns. Arbitrary data can be searched at-speed using an on-chip FIFO buffer. Fig. 5 shows the frequency and energy in BCAM mode across supply voltage with Vdd_Lo = Vdd/2. Fig. 6 shows a shmoo plot that sweeps both Vdd_Lo and Vdd independently. At Vdd = 1V, a maximum frequency of 400MHz is achieved. The minimum energy point is 0.41fJ, with a frequency of 70MHz at Vdd=0.7V and Vdd_Lo=0.375V. If Vdd_Lo is too high relative to Vdd, data is corrupted (read disturb). On the other hand, as Vdd_Lo is decreased a longer wordline pulse (slower frequency) is needed to resolve between the no-mismatch and 1-mismatch cases. Below 0.325V, the design cannot reliably resolve the 1-mismatch case for every column. Fig 7 shows the Vdd_Lo operational voltage margin distribution across multiple chips. In TCAM mode, the maximum frequency is 417MHz and optimal energy is 0.46fJ/search/bit. In comparison, the SRAM mode functions at > 900MHz at 0.9V. Table 2 compares the proposed work with other designs. The normalized bit cell area is improved by > 4× compared to other BCAMs. Acknowledgements STMicroelectronics is gratefully acknowledged for IC fabrication. Funding provided in part by NSF and DARPA. References [1] A.Agarwal et al., ESSCIRC, 2011. [2] A.Do et al., ESSCIRC, 2013. [3] C.Wang et al., TCAS-II, 2010. [4] B.Yang et al., TCAS-I, 2011. [5] G.Burda et al., ISSCC, 2010. [6] Nii et al., ISSCC, 2014. 1 1 1 SA preb diffb Sense Amp Sense Amp vref diff wlr bli Di diff bli+1 SA Di+1 SA Output out outb 1.6 250 1.4 200 1.2 150 1.0 100 0.8 243 234 184 139 76 13 Frequency [MHz] 370 n.a. 500 250 Array Size 64x64 (64x64) x4Arrays 128x128 128x32 Memory Modes BL_N BLB_N BL_N-1 BLB_N-1 Word_N/2 -- -- SA_bli& SA_blbi -- SA_bl i+1 &SA_blbi -- 294 267 250 194 139 78 48 364 346 304 256 192 122 80 400 383 357 310 248 193 128 76 4 5 4 3 BCAM/TCAM/ BCAM BCAM SRAM Table 2. Comparison with previous works 1 0 Max Frequency Vdd_Lo_Margin (MHz) (mV) Fig. 7. Measured VDD_LO margin and max frequency across 10 chips [4] 1100µ m Scan & Test 128x 34x4Block TCAM/ BCAM * From die-photo BIST 45.6µ m Control & Timing CLK Gen Pre-charge & Header Switches 210 Differential NAND-NOR BCAM 2 1 1.07 (1V) 0.77 (1.2V) 1.87 (1V) 2.82 (1.8V) 0.3 (0.5V) NOR 3 2 0.13µm 0.18µm 9T+Read 13T,14T 20 (1200) 30 (926)* 0.6 (1V) 0.41 (0.75V) 2-Single Ended SA Wide AND 0 425 3.3 (780)* [3] Energy/Search /bit [fJ] Match-line Tecnique 1 1 375 65nm 10T 0 0 400 32nm 11T n.a. 1 350 Area/Cell[µm2 ](F 2 ) 28nm FDSOI 6T 0.152 (194) SA 300 Technology Hi-z Hi-z 2-Single Ended VDD Fig. 6. Measured shmoo plot of VDD_LO vs VDD for BCAM. Numbers in box are frequency in MHz [2] 1 -- 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 [1] 1 Table 1. Memory mode configurations. 223 162 181 122 116 70 75 73 12 12 14 This Work 0 325 Fig. 5. Measured frequency and energy in BCAM mode against Vdd 0 Pre/ 220 0.500 1.000 0 1 180 0.475 0.950 1 0 200 0.450 0.900 0 2-Single Ended 160 0.4 D 2 1 Hi-z Hi-z 140 0.6 0 0.575 0.555 0.525 0.500 0.475 0.450 0.425 0.400 0.375 0.350 0.325 3 M a D Dbar s k Dbar 1 -- 120 50 VDD_LO 1.8 Energy / Search / Bit (fJ) Frequency (MHz) 2.0 Frequency Energy / Search / Bit WriteCycle Search 2 D Dbar Dba ri+1 SA o/p 1 D Pre/ Sense Differ Amp ential TCAM WrCy cle Dbar Pre/ Dbar i blbi+1 diffb sa_en 2.2 Transistor/Cell 0=Mismatch BCAM blbi Fig. 4. Proposed configurable CAM architecture. 400 0.425 0.850 SA MEM 6T 0.400 0.800 1 #Chips 6T Output Mux & Lat ches 0.375 0.750 1 1=Mat ch #Chips sampl eb bl_n blb_n Array wll_m Vdd_Lo (V) 0.350 Vdd (V) 0.700 0 SA Fig. 3. Proposed TCAM search. Row Decoder Output wll bl sampl eb 6T bl_1 blb_1 Row Decoder & Muxing Logic CAM_Data blb sampl eb SRAM_Addr Differential SA in SRAM 2-Single-Ended SA in CAM 6T 300 1 SA Read Write Search Reconfigurable Sense Amp wll_1 350 1 1 SRAM Wr Drv Pre CAM_b wlr_1 diff 01 CAM_WR_Addr Wr Drv Pre wlr_m 1 SA Column Decoder & Muxing Logic VDD 01 10 0=Mismatch 1=Mat ch VDD_LO 01 1 1 1 SA SA Fig. 2. Proposed BCAM search using 6T cells. Fig. 1. Conventional CAM. SRAM _D 10 1 1 0 1 1 SA 10 TCAM BCAM Word_1 0 1 WLL_M 01 01 0 0 1 1 0 1 Dbar_m=1 1 1 WL Drivers slb_n sl_n sl_1 slb_1 WLR_M 1 ML Sense Amps Word_m Mat ch o/p ml_m WLL_1 D_m=0 Word_1 Mat ch o/p ml_1 0 1 BCAM Word_n sl2 sl1 Dbar_1=0 Ternary CAM 16T bit cell. If both bit stores 0, then bit masked Search String Storage+compare wi th search line (SL) in each bitcell. sl slb Binary CAM 10T bit cell 1 WLR_1 1 Masked 1 0 1 ml BL_2 D_1=1 ml BL_n BLB_n BLB_2 BL_1 Word_1 BLB_1 wl BL_1 blb2 TCAM bl2 BLB_1 bl1 Vdd_Lo_Margin blb1 wl Words St ored Column W ise bl blb 42.6µ m Array Sense Amps Fig. 8. Die photo and memory layout