Lecture 9

advertisement
Class 09
Content Addressable Memories
Cell Design and Peripheral Circuits
Semiconductor Memory Classification
RWM
Random
Access
Non-Random
Access
SRAM
FIFO
DRAM
LIFO
Shift Register
CAM
NVRWM
ROM
EPROM
Mask-Programmed
E2PROM
Programmable (PROM)
FLASH
FIFO: First-in-first-out
LIFO: Last-in-first-out (stack)
CAM: Content addressable memory
Memory Architecture: Decoders
pitch matched
line too long
2D Memory Architecture
bit line
2k-j
word line
Aj
Aj+1
storage
(RAM) cell
Ak-1
m2j
A0
A1
Aj-1
Column Decoder
Sense Amplifiers
Read/Write Circuits
Input/Output (m bits)
selects appropriate word
from memory row
amplifies bit line swing
3D Memory Architecture
Input/Output (m bits)
Advantages:
1. Shorter word and/or bit lines
2. Block addr activates only 1 block saving power
Hierarchical Memory Architecture
Row
Address
Column
Address
Block
Address
Global Data Bus
Control
Block Selector
Circuitry

Advantages:
Global
Amplifier/Driver
I/O

shorter wires within blocks

block address activates only 1 block: power management
Read-Write Memories (RAM)

Static (SRAM)
 Data stored as long as supply is applied

Large (6 transistors per cell)
Fast

Differential signal (more reliable)


Dynamic (DRAM)
 Periodic refresh required
 Small (1-3 transistors per cell) but slower

Single ended (unless using dummy cell to generate
differential signals)
Associative Memory
What is CAM?
• Content Addressable Memory
is a special kind of memory!
• Read operation in traditional
memory:
Input is address location of the
content that we are interested
in it.
Output is the content of that
address.
• In CAM it is the reverse:
Input is associated with
something stored in the
memory.
Output is location where the
associated content is stored.
00
1 0 1 X X
01
0 1 1 0 X
10
0 1 1 X X
11
1 0 0 1 1
0 1 1 0 X
0 1
Traditional Memory
00
1 0 1 X X
01
0 1 1 0 X
10
0 1 1 X X
11
1 0 0 1 1
01
0 1 1 0 1
Content Addressable
Memory
Type of CAMs
• Binary CAM (BCAM) only stores 0s and 1s
– Applications: MAC table consultation. Layer 2 security
related VPN segregation.
• Ternary CAM (TCAM) stores 0s, 1s and don’t cares.
– Application: when we need wilds cards such as, layer 3 and
4 classification for QoS and CoS purposes. IP routing
(longest prefix matching).
• Available sizes: 1Mb, 2Mb, 4.7Mb, 9.4Mb, and 18.8Mb.
• CAM entries are structured as multiples of 36 bits
rather than 32 bits.
CAM: Introduction
• CAM vs. RAM
Data In
0 1 0 1 0 1 0 1 0
1 0 1 1 0 0 0 1
4
2 1 1 0 1 0 0 1 1
0 1 0 1 0 1 0 1 0
3 1 0 1 1 1 1 0 1
1 1 0 1 1 0 0 0 0
4 1 0 1 1 0 0 0 1
2 1 1 0 1 0 0 1 1 3
3 1 0 1 1 0 0 0 1
5 1 1 1 0 1 1 0 0
4 1 0 1 1 1 0 0 0
1 0 1 1 0 0 0 1
Data Out
5 1 1 1 0 0 0 1 1
Address Out
Address In
1 1 0 1 1 0 0 0 0
Memory Hierarchy
The overall goal of using a memory hierarchy is to obtain the
highest-possible average access speed while minimizing the
total cost of the entire memory system.
Microprogramming: refers to the existence of many programs in
different parts of main memory at the same time.
Main memory
ROM Chip
Memory Address Map
The designer of a computer system must calculate the
amount of memory required for the particular application
and assign it to either RAM or ROM.
The interconnection between memory and processor is then
established from knowledge of the size of memory needed
and the type of RAM and ROM chips available.
The addressing of memory can be established by means
of a table that specifies the memory address assigned to
each chip.
The table, called a memory address map, is a pictorial
representation of assigned address space for each chip in
the system.
Memory Configuration (case study):
Required: 512 bytes ROM + 512 bytes RAM
Available: 512 byte ROM + 128 bytes RAM
Memory Address Map
Associative Memory
The time required to find an item stored in memory can be
reduced considerably if stored data can be identified for access
by the content of the data itself rather than by an address.
A memory unit access by content is called an associative
memory or Content Addressable Memory (CAM). This type of
memory is accessed simultaneously and in parallel on the basis
of data content rather than specific address or location.
When a word is written in an associative memory, no address is
given. The memory is capable of finding an empty unused
location to store the word. When a word is to be read from an
associative memory, the content of the word or part of the
word is specified.
The associative memory is uniquely suited to do parallel
searches by data association. Moreover, searches can be done
on an entire word or on a specific field within a word.
Associative memories are used in applications where the
search time is very critical and must be very short.
Hardware Organization
Argument register (A)
Key register (K)
Match
register
Input
Associative memory
array and logic
Read
Write
m words
n bits per word
Output
M
Associative memory of an m word, n cells per word
A1
Aj
An
K1
Kj
Kn
Word 1
C 11
C 1j
C 1n
M1
Word i
C i1
C
C in
Mi
Word m
C m1
C mj
C mn
Mm
Bit j
Bit n
Bit 1
ij
One Cell of Associative Memory
Ai
Kj
Input
Write
R
S
F ij
Read
Output
Match
logic
To M i
Match Logic cct.
A1
K1
F'i1
F i1
A2
K2
F'i2
F i2
An
Kn
F'in
F in
Mi
CAM: Introduction
• Binary CAM Cell
SL1c
SL1
ML
N5
N7
BL1_cell
BL1c_cell
N6
N8
P1
P2
N4
N3
N1
N2
BL1
WL
BL1c
CAM: Introduction
• Ternary CAM (TCAM)
Input Keyword
Input Keyword
1 0 1 1 0 X X X
1 0 1 1 0 0 0 1
0 1 0 1 0 X 0 1 0
1 1 0 1 1 0 1 0 1
2 1 X 0 1 0 0 1 1
3 1 0 1 1 1 0 0 0
4 1 0 1 1 0 0 1 0
5 1 1 1 0 0 X 0 0
1
Match
4
Match
0 1 0 1 0 1 0 1 0
1
1 1 0 1 1 0 0 0 X Match
2 1 1 0 1 0 0 X X
3 1 0 1 1 1 X X X
4
4 1 0 1 1 X X X X Match
5 1 1 1 X X X X X
CAM: Introduction
• TCAM Cell
Comparison
Logic
– Global Masking  SLs
– Local Masking  BLs
BL1
BL2
Logic
0
1
1
1
0
1
0
1
X
0
0
N.A.
SL1
SL2
ML
BL1c
BL2c
BL2
BL1
RAM
Cell
RAM
Cell
WL
CAM: Introduction
• DRAM based TCAM Cell
 Higher bit density
 Slower table update
 Expensive process
 Refreshing circuitry
 Scaling issues (Leakage)
SL2
SL1
ML
N5
N7
BL1_cell
BL2_cell
N6
N8
N3
N4
BL1
WL
BL2
CAM: Introduction
• SRAM based TCAM Cell
 Standard CMOS process
 Fast table update
 Large area (16T)
SL1
SL2
ML
BL1c_cell
BL1
BL2c_cell
BL1c
WL
BL2c
BL2
CAM: Introduction
• Block diagram of a 256 x 144 TCAM
Search Lines (SLs)
SL Drivers
ML Sense Amplifiers
SL1(0)
SL1(143) SL2(143)
SL2(0)
ML0
MLSA
MLSO(0)
Match Lines
BL1c(N) BL2c(N)
(MLs)
BL1c(0) BL2c(0)
CAM Cell (143)
CAM Cell (0)
ML255
BL1c(N) BL2c(N)
CAM Cell (143)
MLSA
BL1c(0) BL2c(0)
CAM Cell (0)
MLSO(255)
CAM: Introduction
• Why low-power TCAMs?
– Parallel search  Very high power
– Larger word size, larger no. of entries  High
power
– Embedded applications (SoC)
CAM: Design Techniques
• Cell Design: 12T Static TCAM cell*
– ‘0’ is retained by Leakage (VWL ~ 200 mV)
 High density
 Leakage (3 orders)
 Noise margin
 Soft-errors (node S)
 Unsuitable for READ
CAM: Design Techniques
• Cell Design: NAND vs. NOR Type CAM
 Low Power
 Charge-sharing
 Slow
NAND-type CAM
BL1
VDD
CAM
Cell (N)
M
CAM
Cell (0)
BL1c
ML_NOR
SL1
SL1c
BL1
VDD
BL1c
CAM
Cell (N)
WL
CAM
Cell (1)
SA
NOR-type CAM
SL1c
SL1
ML_NAND
WL
CAM
Cell (1)
CAM
Cell (0)
SA
MM
CAM: Design Techniques
• MLSA Design: Conventional
– Pre-charge ML to VDD
– Match  VML = VDD
– Mismatch  VML = 0
VDD
VDD
MLSO
PRE
ML
MM
MM
CAM: Design Techniques
• Low Power: Dual-ML TCAM
– Same speed, 50% less energy (Ideally!)
– Parasitic interconnects degrade both speed and
energy
– Additional ML increases coupling capacitance
CAM: Design Techniques
• Static Power Reduction
– 16T TCAM: Leakage Paths*
SL1
SL2
ML
N9
N11
BL1c_cell
P1
BL1
‘0’
‘1’
N3
N1
BL2c_cell
N10
P2
N12
N4
P4
BL2c
BL1c
‘0’
P3
‘1’
‘0’
BL2
N7
‘0’
‘1’
N2
N5
WL
* N. Mohan, M. Sachdev, Proc. IEEE CCECE, pp. 711-714, May 2-5, 2004
N6
N8
‘1’
CAM: Design Techniques
• Static Power Reduction
– Side Effects of VDD Reduction in TCAM Cells
 Speed: No change
 Dynamic power: No change
MLSO [0]
 Robustness 
ML [0]
– VDD  Volt. Margin
Voltage Margin
(Current-race sensing)
ML [1]
CAM for Routing Table
Implementation
• CAM can be used as a search engine.
• We want to find matching contents in a
database or Table.
• Example Routing Table
Source: http://pagiamtzis.com/cam/camintro.html









Simplified CAM Block Diagram
The input to the system is the search word.
The search word is broadcast on the search lines.
Match line indicates if there were a match btw. the search and stored word.
Encoder specifies the match location.
If multiple matches, a priority encoder selects the first match.
Hit signal specifies if there is no match.
The length of the search word is long ranging from 36 to 144 bits.
Table size ranges: a few hundred to 32K.
Address space : 7 to 15 bits.
Source: K. Pagiamtzis, A. Sheikholeslami,
“Content-Addressable Memory (CAM)
Circuits and Architectures:
A Tutorial and Survey,”
IEEE J. of Solid-state circuits. March 2006
CAM Memory Size
• Largest available around
18 Mbit (single chip).
• Rule of thumb: Largest
CAM chip is about half
the largest available
SRAM chip.
A typical CAM cell
consists of two SRAM
cells.
• Exponential growth rate
on the size
Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable
Memory (CAM) Circuits and Architectures: A Tutorial and Survey,”
IEEE J. of Solid-state circuits. March 2006
CAM Basics
• The search-data word is loaded
into the search-data register.
• All match-lines are pre-charged
to high (temporary match state).
• Search line drivers broadcast the
search word onto the differential
search lines.
• Each CAM core compares its
stored bit against the bit on the
corresponding search-lines.
• Match words that have at least
one missing bit, discharge to
ground.
Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable
Memory (CAM) Circuits and Architectures: A Tutorial and Survey,”
IEEE J. of Solid-state circuits. March 2006
CAM Advantages
• They associate the input (comparand) with their memory
contents in one clock cycle.
• They are configurable in multiple formats of width and
depth of search data that allows searches to be conducted
in parallel.
• CAM can be cascaded to increase the size of lookup tables
that they can store.
• We can add new entries into their table to learn what they
don’t know before.
• They are one of the appropriate solutions for higher
speeds.
CAM Disadvantages
• They cost several hundred of dollars per CAM even in large
quantities.
• They occupy a relatively large footprint on a card.
• They consume excessive power.
• Generic system engineering problems:
– Interface with network processor.
– Simultaneous table update and looking up requests.
CAM structure
Output Port
Control
Mixable with
72 bits x 16384
144 bits x 8192
288 bits x 4096
576 bits x 2048
Flag Control
•
72 bits 131072
CAM
(72 bits x 16K x 8 structures)
Priority Encoder
•
CAM control
Empty Bit
•
Global mask registers
Decoder
•
Control & status registers
I/O Port Control
•
•
The comparand bus is 72 bytes
wide bidirectional.
The result bus is output.
Command bus enables instructions
to be loaded to the CAM.
It has 8 configurable banks of
memory.
The NPU issues a command to the
CAM.
CAM then performs exact match or
uses wildcard characters to extract
relevant information.
There are two sets of mask
registers inside the CAM.
Pipeline execution control
(command bus)
•
CAM structure
Output Port
Control
I/O Port Control
Control & status registers
Global mask registers
Flag Control
Mixable with
72 bits x 16384
144 bits x 8192
288 bits x 4096
576 bits x 2048
Priority Encoder
Decoder
72 bits 131072
CAM
(72 bits x 16K x 8 structures)
Empty Bit
CAM control
Pipeline execution control
(command bus)
 There is global mask registers
which can remove specific bits
and a mask register that is
present in each location of
memory.
 The search result can be
 one output (highest priority)
 Burst of successive results.
 The output port is 24 bytes
wide.
 Flag and control signals specify
status of the banks of the
memory.
 They also enable us to cascade
multiple chips.
CAM Features
• CAM Cascading:
– We can cascade up to 8 pieces without incurring performance
penalty in search time (72 bits x 512K).
– We can cascade up to 32 pieces with performance degradation (72
bits x 2M).
• Terminology:
– Initializing the CAM: writing the table into the memory.
– Learning: updating specific table entries.
– Writing search key to the CAM: search operation
• Handling wider keys:
– Most CAM support 72 bit keys.
– They can support wider keys in native hardware.
• Shorter keys: can be handled at the system level more efficiently.
•
•
•
•
•
CAM Latency
Clock rate is between 66 to 133
MHz.
The clock speed determines
maximum search capacity.
Factors affecting the search
performance:
– Key size
– Table size
For the system designer the total
latency to retrieve data from the
SRAM connected to the CAM is
important.
By using pipeline and multi-thread
techniques for resource allocation
we can ease the CAM speed
requirements.
Source: IDT
Management of Tables Inside a CAM
•
•
•
It is important to squeeze as much information as we can in a CAM.
Example from Netlogic application notes:
– We want to store 4 tables of 32 bit wide IP destination addresses.
– The CAM is 128 bits wide.
– If we store directly in every slot 96 bits are wasted.
We can arrange the 32 bit wide tables next to each other.
– Every 128 bit slot is partitioned into four 32 bit slots.
– These are 3rd, 2nd, 1st, and 0th tables going from left to right.
– We use the global mask register to access only one of the tables.
MASK 3
00000000
FFFFFFFF
FFFFFFFF
FFFFFFFF
MASK 2
FFFFFFFF
00000000
FFFFFFFF
FFFFFFFF
MASK 1
FFFFFFFF
FFFFFFFF
00000000
FFFFFFFF
MASK 0
FFFFFFFF
FFFFFFFF
FFFFFFFF
00000000
Example Continued
•
We can still use the mask register (not global mask register) to do maximum prefix
length match.
127
94
3
2
1
0
….
1 1 0 0
….
1 1 1 0
Comparand
Register
….
0 0 1 1
….
1 1 1 1
Global Mask
Register
1 0 1
0 0 0
1 0 1
95
1 1 1 0
1 1 0
1 0 1
96
….
….
….
….
….
….
….
….
1 0 1
97
0 0 0 0
1 1 0 1
1 0 1 0
1 1 0 1
1 0 1 0
MATCH FOUND
0 0 0 1
0 1 1 0
Download