Memory Hierarchy and
Cache Design
The following sources are used for preparing these slides:
•
Lecture 14 from the course Computer architecture ECE 201 by Professor Mike
Schulte.
•
Lecture 4 from William Stallings, Computer Organization and Architecture,
Prentice Hall; 6th edition, July 15, 2002.
•
Lecture 6 from the course Systems Architectures II by Professors Jeremy R.
Johnson and Anatole D. Ruslanov
•
Some of figures are from Computer Organization and Design: The
Hardware/Software Approach, Third Edition, by David Patterson and John
Hennessy, are copyrighted material (COPYRIGHT 2004 MORGAN KAUFMANN
PUBLISHERS, INC. ALL RIGHTS RESERVED).
Memory Hierarchy
Memory technology
Typical access time
$ per GB in 2004
SRAM
0.5–5 ns
$4000–$10,000
DRAM
50–70 ns
$100–$200
Magnetic disk
5,000,000–20,000,000 ns
$0.50–$2
CPU
Processor
Level 1
Levels in the
memory hierarchy
Increasing distance
from the CPU in
access time
Level 2
Data are transferred
Level n
Size of the memory at each level
SRAM v DRAM
• Both volatile
– Power needed to preserve data
• Dynamic cell
–
–
–
–
–
Simpler to build, smaller
More dense
Less expensive
Needs refresh
Larger memory units
• Static
– Faster
– Cache
General Principles of Memory
• Locality
– Temporal Locality: referenced memory is likely to be referenced
again soon (e.g. code within a loop)
– Spatial Locality: memory close to referenced memory is likely to be
referenced soon (e.g., data in a sequentially access array)
• Definitions
–
–
–
–
–
–
Upper: memory closer to processor
Block: minimum unit that is present or not present
Block address: location of block in memory
Hit: Data is found in the desired location
Hit time: time to access upper level
Miss rate: percentage of time item not found in upper level
• Locality + smaller HW is faster = memory hierarchy
– Levels: each smaller, faster, more expensive/byte than level below
– Inclusive: data found in upper level also found in the lower level
Cache
• Small amount of fast memory
• Sits between normal main memory and CPU
• May be located on CPU chip or module
Cache operation - overview
•
•
•
•
CPU requests contents of memory location
Check cache for this data
If present, get from cache (fast)
If not present, read required block from main
memory to cache
• Then deliver from cache to CPU
• Cache includes tags to identify which block of
main memory is in each cache slot
Cache/memory structure
Four Questions for Memory
Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
• Q2: How is a block found if it is in the upper level?
(Block identification)
• Q3: Which block should be replaced on a miss?
(Block replacement)
• Q4: What happens on a write?
(Write strategy)
Q1: Where can a block be placed?
• Direct Mapped: Each block has only one
place that it can appear in the cache.
• Fully associative: Each block can be placed
anywhere in the cache.
• Set associative: Each block can be placed in
a restricted set of places in the cache.
– If there are n blocks in a set, the cache placement is
called n-way set associative
• What is the associativity of a direct mapped
cache?
Associativity Examples
Cache size is 8 blocks
Where does word 12 from memory go?
Fully associative:
Block 12 can go anywhere
Direct mapped:
Block no. = (Block address) mod
(No. of blocks in cache)
Block 12 can go only into block 4
(12 mod 8 = 4)
=> Access block using lower 3 bits
2-way set associative:
Set no. = (Block address) mod
(No. of sets in cache)
Block 12 can go anywhere in set 0
(12 mod 4 = 0)
=> Access set using lower 2 bits
Direct Mapped Cache
• Mapping: memory mapped to one location in
cache:
(Block address) mod (Number of blocks in
cache)
• Number of blocks is typically a power of two, i.e.,
cache location obtained from low-order bits of
address.
000
001
010
011
100
101
110
111
Cache
00001
00101
01001
01101
10001
Memory
10101
11001
11101
Locating data in the Cache
Address (showing bit positions)
31 30
• Index is 10 bits, while tag is 20
bits
– We need to address 1024
words
– We could have any of 220 words per
cache location
13 12 11
210
Byte
offset
Hit
10
20
Tag
Index
(210)
• Valid bit indicates whether an
entry contains a valid address
or not
• Tag bits is usually indicated by
address size – (log2(memory size)
+ 2)
– E.g. 32 – (10 + 2) = 20
Index Valid Tag
Data
0
1
2
1021
1022
1023
20
32
Data
Q2: How Is a Block Found?
• The address can be divided into two main parts
– Block offset: selects the data from the block
offset size = log2(block size)
– Block address: tag + index
» index: selects set in cache
index size = log2(#blocks/associativity)
» tag: compared to tag in cache to determine hit
tag size = addreess size - index size - offset size
• Each block has a valid bit that tells if the block is
valid - the block is in the cache if the tags match
and the valid bit is set.
Tag
Index
Q4: What Happens on a Write?
• Write through: The information is written to both the
block in the cache and to the block in the lower-level
memory.
• Write back: The information is written only to the block
in the cache. The modified cache block is written to
main memory only when it is replaced.
– is block clean or dirty? (add a dirty bit to each block)
• Pros and Cons of each:
– Write through
» Read misses cannot result in writes to memory,
» Easier to implement
» Always combine with write buffers to avoid memory latency
– Write back
» Less memory traffic
» Perform writes at the speed of the cache
Reducing Cache Misses with a more
Flexible Replacement Strategy
• In a direct mapped cache a block can go in
exactly one place in cache
• In a fully associative cache a block can go
anywhere in cache
• A compromise is to use a set associative cache
where a block can go into a fixed number of
locations in cache, determined by:
(Block number) mod (Number of sets in cache)
Direct mapped
Block # 0 1 2 3 4 5 6 7
Data
Tag
Search
Set associative
Set #
0
Data
1
2
Tag
Search
1
2
Fully associative
3
Data
1
2
Tag
Search
1
2
Example
• Three small 4 word caches:
Direct mapped, two-way set associative, fully
associative
• How many misses in the sequence of block
addresses:
0, 8, 0, 6, 8?
• How does this change with 8 words, 16
words?
Locating a Block in Cache
• Check the tag of
every cache block in
the appropriate set
• Address consists of
index
block offset
3 parts tag
• Replacement
strategy:
E.G. Least Recently
Used (LRU)
Address
31 30
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
3210
Tag
Data
V
Tag
Data
V
Assoc.
1
2
4
Data
253
254
255
22
4-to-1 multiplexor
Program
gcc
Tag
I miss rate D miss rate Combined rate
2.0%
1.7%
1.9%
1.6%
1.4%
1.5%
1.6%
1.4%
1.5%
Hit
Data
32
Size of Tags vs. Associativity
• Increasing associativity requires more
comparators, as well as more tag bits per
cache block.
• Assume a cache with 4K 4-word blocks and
32 bit addresses
• Find the total number of sets and the total
number of tag bits for a
–
–
–
–
direct mapped cache
two-way set associative cache
four-way set associative cache
fully associative cache
Size of Tags vs. Associativity
• Total cache size 4K x 4 words/block x 4 bytes/word = 64Kb
• Direct mapped cache:
–
–
–
–
16 bytes/block  28 bits for tag and index
# sets = # blocks
Log(4K) = 12 bits for index  16 bits for tag
Total # of tag bits = 16 bits x 4K locations = 64 Kbits
• Two-way set-associative cache:
–
–
–
–
–
32 bytes / set
16 bytes/block  28 bits for tag and index
# sets = # blocks / 2  2K sets
Log(2K) = 11 bits for index  17 bits for tag
Total # of tag bits = 17 bits x 2 location / set x 2K sets = 68 Kbits
Size of Tags vs. Associativity
• Four-way set-associative cache:
–
–
–
–
–
64 bytes / set
16 bytes/block  28 bits for tag and index
# sets = # blocks / 4  1K sets
Log(1K) = 10 bits for index  18 bits for tag
Total # of tag bits = 18 bits x 4 location / set x 1K sets = 72
Kbits
• Fully associative cache:
– 1 set of 4 K blocks  28 bits for tag and index
– Index = 0 bits  tag will have 28 bits
– Total # of tag bits = 28 bits x 4K location / set x 1 set = 112
Kbits
Measuring Cache Performance
• CPU time = (CPU execution clock cycles +
Memory stall clock cycles)  Clock-cycle time
• Memory stall clock cycles =
Read-stall cycles + Write-stall cycles
• Read-stall cycles = Reads/program  Read miss rate 
Read miss penalty
• Write-stall cycles = (Writes/program  Write miss rate 
Write miss penalty) + Write buffer stalls
(assumes write-through cache)
• Write buffer stalls should be negligible and write and read
miss penalties equal (cost to fetch block from memory)
• Memory stall clock cycles = Mem access/program  miss
rate  miss penalty
Example I
• Assume I-miss rate of 2% and D-miss rate of
4% (gcc)
• Assume CPI = 2 (without stalls) and miss
penalty of 40 cycles
• Assume 36% loads/stores
• What is the CPI with memory stalls?
• How much faster would a machine with
perfect cache run?
• What happens if the processor is made faster,
but the memory system stays the same (e.g.
reduce CPI to 1)?
Calculation I
• Instruction miss cycles = I x 100% x 2% x 40 = .80 x I
• Data miss cycles = I x 36% x 4% x 40 = .58 x I
• Total miss cycles = .80 x I + .58 x I = 1.38 x I
• CPI = 2 + 1.38 = 3.38
• PerfPerf / PerfStall = 3.38/2 = 1.69
• For a processor with base CPI = 1:
• CPI = 1 + 1.38 = 2.38  PerfPerf / PerfStall = 2.38
• Time spent on stalls for slower processor 1.38/3.38 = 41%
• Time spent on stalls for faster processor 1.38/2.38 = 58%