Cache Memory

advertisement
55:035
Computer Architecture and Organization
Lecture 7
55:035 Computer Architecture and Organization
1
Outline
Cache Memory Introduction
 Memory Hierarchy
 Direct-Mapped Cache
 Set-Associative Cache
 Cache Sizes
 Cache Performance

55:035 Computer Architecture and Organization
2
Introduction



Memory access time is important to performance!
Users want large memories with fast access times 
ideally unlimited fast memory
To use an analogy, think of a bookshelf containing
many books:


Suppose you are writing a paper on birds. You go to the bookshelf, pull out
some of the books on birds and place them on the desk. As you start to
look through them you realize that you need more references. So you go
back to the bookshelf and get more books on birds and put them on the
desk. Now as you begin to write your paper, you have many of the
references you need on the desk in front of you.
This is an example of the principle of locality:
This principle states that programs access a relatively small
portion of their address space at any instant of time.
55:035 Computer Architecture and Organization
3
Levels of the Memory Hierarchy
Part of The On-chip
CPU Datapath
ISA 16-128
Registers
One or more levels (Static RAM):
Level 1: On-chip 16-64K
Level 2: On-chip 256K-2M
Level 3: On or Off-chip 1M-16M
Dynamic RAM (DRAM)
256M-16G
Interface:
SCSI, RAID,
IDE, 1394
80G-300G
CPU
Registers
Cache
Level(s)
Main Memory
Farther away from
the CPU:
Lower Cost/Bit
Higher Capacity
Increased Access
Time/Latency
Lower Throughput/
Bandwidth
Magnetic Disc
Optical Disk or Magnetic Tape
55:035 Computer Architecture and Organization
4
Memory Hierarchy Comparisons
Capacity
Access Time
Cost
CPU Registers
100s Bytes
<10s ns
Cache
K Bytes
10-100 ns
1-0.1 cents/bit
Main Memory
M Bytes
200ns- 500ns
$.0001-.00001 cents /bit
Disk
G Bytes, 10 ms
(10,000,000 ns)
-5
-6
10 - 10 cents/bit
Tape
infinite
sec-min
-8
10
Staging
Xfer Unit
faster
Registers
Instr. Operands
prog./compiler
1-8 bytes
Cache
Blocks
cache cntl
8-128 bytes
Memory
Pages
OS
4K-16K bytes
Files
user/operator
Mbytes
Disk
Larger
Tape
55:035 Computer Architecture and Organization
5
Memory Hierarchy

We can exploit the natural locality in programs by implementing the
memory of a computer as a memory hierarchy.



Multiple levels of memory with different speeds and sizes.
The fastest memories are more expensive, and usually much smaller in size
(see figure).
The user has the illusion of a memory that is both large and fast.

Accomplished by using efficient methods for memory structure and organization.
55:035 Computer Architecture and Organization
6
Inventor of Cache
M. V. Wilkes, “Slave Memories and Dynamic Storage Allocation,”
IEEE Transactions on Electronic Computers, vol. EC-14, no. 2,
pp. 270-271, April 1965.
55:035 Computer Architecture and Organization
7
Cache

Processor

words
Cache
small, fast
memory

blocks
Main memory
large, inexpensive
(slow)

Processor does all memory
operations with cache.
Miss – If requested word is not in
cache, a block of words containing
the requested word is brought to
cache, and then the processor
request is completed.
Hit – If the requested word is in
cache, read or write operation is
performed directly in cache, without
accessing main memory.
Block – minimum amount of data
transferred between cache and
main memory.
55:035 Computer Architecture and Organization
8
The Locality Principle



A program tends to access data that form a physical
cluster in the memory – multiple accesses may be
made within the same block.
Physical localities are temporal and may shift over
longer periods of time – data not used for some time is
less likely to be used in the future. Upon miss, the least
recently used (LRU) block can be overwritten by a new
block.
P. J. Denning, “The Locality Principle,”
Communications of the ACM, vol. 48, no. 7, pp. 19-24,
July 2005.
55:035 Computer Architecture and Organization
9
Temporal & Spatial Locality

There are two types of locality:
TEMPORAL LOCALITY
(locality in time) If an item is referenced, it will likely be referenced again
soon. Data is reused.
SPATIAL LOCALITY
(locality in space) If an item is referenced, items in neighboring addresses will
likely be referenced soon



Most programs contain natural locality in structure. For example,
most programs contain loops in which the instructions and data
need to be accessed repeatedly. This is an example of temporal
locality.
Instructions are usually accessed sequentially, so they contain a
high amount of spatial locality.
Also, data access to elements in an array is another example of
spatial locality.
55:035 Computer Architecture and Organization
10
Data Locality, Cache, Blocks
Memory
Increase
block size
to match
locality
size
Increase
cache size
to include
most blocks
Cache
Data
needed by
a program
Block 1
Block 2
55:035 Computer Architecture and Organization
11
Basic Caching Concepts



Memory system is organized as a hierarchy
with the level closest to the processor being a
subset of any level further away, and all of
the data is stored at the lowest level (see
figure).
Data is copied between only two adjacent
levels at any given time. We call the
minimum unit of information contained in a
two-level hierarchy a block or line. See the
highlighted square shown in the figure.
If data requested by the user appears in
some block in the upper level it is known as a
hit. If data is not found in the upper levels, it
is known as a miss.
55:035 Computer Architecture and Organization
12
Basic Cache Organization
Tags
Block address
Data Array
Full byte address: Tag Idx Off
Decode & Row
Select
Compare Tags
?
Hit
55:035 Computer Architecture and Organization
Mux
select
Data Word
13
Direct-Mapped Cache
Memory
Cache
LRU
Data
needed by
a program
Block 1
Block 2
Swap-in
Data
needed
55:035 Computer Architecture and Organization
14
Set-Associative Cache
Memory
Cache
LRU
Block 1
Swap-in
Block 2
Data
needed by
a program
Data
needed
55:035 Computer Architecture and Organization
15
Three Major Placement Schemes
55:035 Computer Architecture and Organization
16
Direct-Mapped Placement

A block can only go into one place in the cache

Determined by the block’s address (in memory space)
 The index number for block placement is usually given by some loworder bits of block’s address.

This can also be expressed as:
(Index) =
(Block address) mod (Number of blocks in cache)

Note that in a direct-mapped cache,

Block placement & replacement choices are both completely
determined by the address of the new block that is to be
accessed.
55:035 Computer Architecture and Organization
17
00
10
11
01
01
00
10
11
000
001
010
011
100
101
110
111
Cache of 8 blocks
Block size = 1 word
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
index (local address)
00000
00001
00010
00011
00100
00101
00110
00111
tag
32-word word-addressable memory
Direct-Mapped Cache
Main
memory
cache address:
tag
index
11 101 → memory address
55:035 Computer Architecture and Organization
18
00
11
00
10
00
01
10
11
Cache of 4 blocks
Block size = 2 word
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
index (local address)
00000
00001
00010
00011
00100
00101
00110
00111
tag
32-word word-addressable memory
Direct-Mapped Cache
0
Main
memory
1
block offset
cache address:
tag
index
block offset
11 10 1 → memory address
55:035 Computer Architecture and Organization
19
00000 00
00001 00
00010 00
00011 00
00100 00
00101 00
00110 00
00111 00
Cache of 8 blocks
Block size = 1 word
tag
index
32-word byte-addressable memory
Direct-Mapped Cache (Byte Address)
01000 00
01001 00
01010 00
01011 00
01100 00
01101 00
01110 00
01111 00
00
10
11
01
01
00
10
11
10000 00
10001 00
10010 00
10011 00
10100 00
10101 00
10110 00
10111 00
11000 00
11001 00
11010 00
11011 00
11100 00
11101 00
11110 00
11111 00
Main
memory
000
001
010
011
100
101
110
111
cache address:
tag
index
11 101 00 → memory address
55:035 Computer Architecture and Organization
byte offset
20
Finding a Word in Cache
Memory address
32 words
byte-address
Tag
b6 b5 b4 b3 b2 b1 b0
Index
Index
Valid 2-bit
bit
Tag
byte offset
Data
000
001
010
011
100
101
110
111
Cache size
8 words
Block size
= 1 word
=
Data
1 = hit
0 = miss
55:035 Computer Architecture and Organization
21
00
00
00
00
00
00
00
00
01000
01001
01010
01011
01100
01101
01110
01111
00
00
00
00
00
00
00
00
10000
10001
10010
10011
10100
10101
10110
10111
00
00
00
00
00
00
00
00
11000
11001
11010
11011
11100
11101
11110
11111
00
00
00
00
00
00
00
00
This block is needed
Cache of 8 blocks
Block size = 1 word
Main
memory
index
00000
00001
00010
00011
00100
00101
00110
00111
tag
32-word word-addressable memory
Miss Rate of Direct-Mapped Cache
00
10
11
01
01
00
10
11
000
001
010
011
100
101
110
111
Least recently used
(LRU) block
cache address:
tag
index
11 101 00 → memory address
55:035 Computer Architecture and Organization
byte offset
22
00
00
00
00
00
00
00
00
01000
01001
01010
01011
01100
01101
01110
01111
00
00
00
00
00
00
00
00
10000
10001
10010
10011
10100
10101
10110
10111
00
00
00
00
00
00
00
00
11000
11001
11010
11011
11100
11101
11110
11111
00
00
00
00
00
00
00
00
Memory references to addresses: 0, 8, 0, 6, 8, 16
1. miss
3. miss
Cache of 8 blocks
Block size = 1 word
2. miss
4.
miss
00 / 01 / 00 / 10
xx
xx
xx
xx
xx
00
xx
5. miss
6. miss
Main
memory
index
00000
00001
00010
00011
00100
00101
00110
00111
tag
32-word word-addressable memory
Miss Rate of Direct-Mapped Cache
000
001
010
011
100
101
110
111
cache address:
tag
index
11 101 00 → memory address
55:035 Computer Architecture and Organization
byte offset
23
00000 00
00001 00
00010 00
00011 00
00100 00
00101 00
00110 00
00111 00
This block is needed
Cache of 8 blocks
Block size = 1 word
01000 00
01001 00
01010 00
01011 00
01100 00
01101 00
01110 00
01111 00
tag
32-word word-addressable memory
Fully-Associative Cache (8-Way Set Associative)
00 000
10 001
11 010
01 011
01 100
00 101
10 110 01010
11 111
10000 00
10001 00
10010 00
10011 00
10100 00
10101 00
10110 00
10111 00
11000 00
11001 00
11010 00
11011 00
11100 00
11101 00
11110 00
11111 00
LRU block
cache address:
tag
Main
memory
11101 00 → memory address
55:035 Computer Architecture and Organization
byte offset
24
00000 00
00001 00
00010 00
00011 00
00100 00
00101 00
00110 00
00111 00
01000 00
01001 00
01010 00
01011 00
01100 00
01101 00
01110 00
01111 00
10000 00
10001 00
10010 00
10011 00
10100 00
10101 00
10110 00
10111 00
11000 00
11001 00
11010 00
11011 00
11100 00
11101 00
11110 00
11111 00
Memory references to addresses: 0, 8, 0, 6, 8, 16
Cache of 8 blocks
1. miss
4. miss
Block size = 1 word
tag
32-word word-addressable memory
Miss Rate: Fully-Associative Cache
2. miss
00000
01000
00110
10000
xxxxx
xxxxx
xxxxx
xxxxx
6. miss
5. hit
3. hit
cache address:
tag
Main
memory
11101 00 → memory address
55:035 Computer Architecture and Organization
byte offset
25
Finding a Word in Associative Cache
Memory address
32 words
byte-address
5 bit Tag
b6 b5 b4 b3 b2 b1 b0
no index
Index
Valid 5-bit
bit
Tag
byte offset
Data
Cache size
8 words
Block size
= 1 word
Must compare
with all tags
in the cache
=
Data
1 = hit
0 = miss
55:035 Computer Architecture and Organization
26
Eight-Way Set-Associative Cache
Memory address
32 words
byte-address
V | tag | data
=
V | tag | data
=
b31 b30 b29 b28 b27
index
5 bit Tag
V | tag | data
=
b1 b0
byte offset
V | tag | data
=
V | tag | data
V | tag | data
=
=
V | tag | data
=
Cache size
8 words
Block size
= 1 word
V | tag | data
=
8
t o
1
multiplexer
1 = hit
0 = miss
55:035 Computer Architecture and Organization
Data
27
This block is needed
Cache of 8 blocks
Block size = 1 word
01000 00
01001 00
01010 00
01011 00
01100 00
01101 00
01110 00
01111 00
10000 00
10001 00
10010 00
10011 00
10100 00
10101 00
10110 00
10111 00
11000 00
11001 00
11010 00
11011 00
11100 00
11101 00
11110 00
11111 00
index
00000 00
00001 00
00010 00
00011 00
00100 00
00101 00
00110 00
00111 00
tags
32-word word-addressable memory
Two-Way Set-Associative Cache
000 | 011
100 | 001
110 | 101
010 | 111
00
01
10
11
LRU block
Main
memory
cache address:
tag
index
111 01 00 → memory address
55:035 Computer Architecture and Organization
byte offset
28
10000 00
10001 00
10010 00
10011 00
10100 00
10101 00
10110 00
10111 00
11000 00
11001 00
11010 00
11011 00
11100 00
11101 00
11110 00
11111 00
Cache of 8 blocks
1. miss
Block size = 1 word
index
01000 00
01001 00
01010 00
01011 00
01100 00
01101 00
01110 00
01111 00
Memory references to addresses: 0, 8, 0, 6, 8, 16
tags
00000 00
00001 00
00010 00
00011 00
00100 00
00101 00
00110 00
00111 00
000 | 010
xxx | xxx
001 | xxx
xxx | xxx
00
01
10
11
2. miss
4. miss
32-word word-addressable memory
Miss Rate: Two-Way Set-Associative Cache
3. hit
5. hit
6. miss
Main
memory
cache address:
tag
index
111 01 00 → memory address
55:035 Computer Architecture and Organization
byte offset
29
Two-Way Set-Associative Cache
b6 b5 b4
b3 b2 b1 b0
byte offset
3 bit tag
2 bit index
00
01
10
11
V | tag |
data
V | tag |
data
V | tag |
data
V | tag |
data
V | tag |
data
V | tag |
data
V | tag |
data
V | tag |
data
=
1 = hit
0 = miss
55:035 Computer Architecture and Organization
=
Cache size
8 words
Block size
= 1 word
2 to 1 MUX
Memory address
32 words
byte-address
Data
30
Using Larger Cache Block (4 Words)
Memory address
4GB = 1G words
byte-address
16 bit Tag
Index
b31… b16 b15… b4 b3 b2 b1 b0
Val. 16-bit 12 bit Index Data
bit Tag
(4 words=128 bits)
4K Indexes
0000 0000 0000
byte offset
2 bit
block
offset
Cache size
16K words
Block size
= 4 word
1111 1111 1111
=
1 = hit
0 = miss
MUX
Data
55:035 Computer Architecture and Organization
31
Number of Tag and Index Bits
Cache Size
= w words
Main memory
Size=W words
Each word in cache has unique index (local addr.)
Number of index bits = log2w
Index bits are shared with block offset when
a block contains more words than 1
Assume partitions of w words each
in the main memory.
W/w such partitions, each identified by a tag
Number of tag bits = log2(W/w)
55:035 Computer Architecture and Organization
32
How Many Bits Does Cache Have?

Consider a main memory:


32 words; byte address is 7 bits wide: b6 b5 b4 b3 b2 b1 b0
Each word is 32 bits wide
Assume that cache block size is 1 word (32 bits
data) and it contains 8 blocks.
 Cache requires, for each word:



2 bit tag, and one valid bit
Total storage needed in cache
= #blocks in cache × (data bits/block + tag bits + valid
bit)
= 8 (32+2+1) = 280 bits
Physical storage/Data storage = 280/256 = 1.094
55:035 Computer Architecture and Organization
33
A More Realistic Cache

Consider 4 GB, byte-addressable main memory:
 1Gwords; byte address is 32 bits wide: b31…b16 b15…b2 b1 b0
 Each word is 32 bits wide


Assume that cache block size is 1 word (32 bits data) and it contains
64 KB data, or 16K words, i.e., 16K blocks.
Number of cache index bits = 14, because 16K = 214
 Tag size = 32 – byte offset – #index bits = 32 – 2 – 14 = 16 bits

Cache requires, for each word:
 16 bit tag, and one valid bit
 Total storage needed in cache
= #blocks in cache × (data bits/block + tag size + valid bits)
= 214(32+16+1) = 16×210×49 = 784×210 bits = 784 Kb = 98 KB
Physical storage/Data storage = 98/64 = 1.53
But, need to increase the block size to match the size of locality.
55:035 Computer Architecture and Organization
34
Cache Bits for 4-Word Block

Consider 4 GB, byte-addressable main memory:
 1Gwords; byte address is 32 bits wide: b31…b16 b15…b2 b1 b0
 Each word is 32 bits wide


Assume that cache block size is 4 words (128 bits data) and it contains
64 KB data, or 16K words, i.e., 4K blocks.
Number of cache index bits = 12, because 4K = 212


Tag size = 32 – byte offset – #block offset bits – #index bits
= 32 – 2 – 2 – 12 = 16 bits
Cache requires, for each word:


16 bit tag, and one valid bit
Total storage needed in cache
= #blocks in cache × (data bits/block + tag size + valid bit)
= 212(4×32+16+1) = 4×210×145 = 580×210 bits =580 Kb = 72.5 KB
Physical storage/Data storage = 72.5/64 = 1.13
55:035 Computer Architecture and Organization
35
Cache size equation

Simple equation for the size of a cache:
(Cache size) = (Block size) × (Number of sets)
× (Set Associativity)

Can relate to the size of various address fields:
(Block size) = 2(# of offset bits)
(Number of sets) = 2(# of index bits)
(# of tag bits) = (# of memory address bits)
 (# of index bits)  (# of offset bits)
Memory address
55:035 Computer Architecture and Organization
36
Interleaved Memory

Processor

words
Cache
Small, fast
memory

blocks
Memory
bank 0
Memory
bank 1
Memory
bank 2
Main memory
Memory
bank 3
Reduces miss penalty.
Memory designed to read words
of a block simultaneously in one
read operation.
Example:
 Cache block size = 4 words
 Interleaved memory with 4 banks
 Suppose memory access ~15
cycles
 Miss penalty = 1 cycle to send
address + 15 cycles to read a
block + 4 cycles to send data to
cache = 20 cycles
 Without interleaving,
Miss penalty = 65 cycles
55:035 Computer Architecture and Organization
37
Cache Design

The level’s design is described by four
behaviors:

Block Placement:


Block Identification:


How is a existing block found, if it is in the level?
Block Replacement:


Where could a new block be placed in the given level?
Which existing block should be replaced, if necessary?
Write Strategy:

How are writes to the block handled?
55:035 Computer Architecture and Organization
38
Handling a Miss
Miss occurs when data at the required memory
address is not found in cache.
 Controller actions:




Stall pipeline
Freeze contents of all registers
Activate a separate cache controller

If cache is full
 select the least recently used (LRU) block in cache for over-writing
 If selected block has inconsistent data, take proper action


Copy the block containing the requested address from memory
Restart Instruction
55:035 Computer Architecture and Organization
39
Miss During Instruction Fetch
Send original PC value (PC – 4) to the memory.
 Instruct main memory to perform a read and wait
for the memory to complete the access.
 Write cache entry.
 Restart the instruction whose fetch failed.

55:035 Computer Architecture and Organization
40
Writing to Memory
Cache and memory become inconsistent when
data is written into cache, but not to memory –
the cache coherence problem.
 Strategies to handle inconsistent data:


Write-through



Write to memory and cache simultaneously always.
Write to memory is ~100 times slower than to (L1) cache.
Write-back


Write to cache and mark block as “dirty”.
Write to memory occurs later, when dirty block is cast-out
from the cache to make room for another block
55:035 Computer Architecture and Organization
41
Writing to Memory: Write-Back



Write-back (or copy back) writes only to cache but sets a
“dirty bit” in the block where write is performed.
When a block with dirty bit “on” is to be overwritten in the
cache, it is first written to the memory.
“Unnecessary” writes may occur for both write-through
and write-back



write-through has extra writes because each store instruction
causes a transaction to memory (e.g. eight 32-bit transactions
versus 1 32-byte burst transaction for a cache line)
write-back has extra writes because unmodified words in a cache
line get written even if they haven’t been changed
penalty for write-through is much greater, thus write-back is far
more popular
55:035 Computer Architecture and Organization
42
Cache Hierarchy

Processor
Access
time = T1
L1 Cache
(SRAM)
Average access time
= T1 + (1 – h1) [ T2 + (1 – h2)Tm ]

Where

Access time = T2

L2 Cache
(DRAM)


Access time = Tm
Main memory
large, inexpensive
(slow)

T1 = L1 cache access time
(smallest)
T2 = L2 cache access time (small)
Tm = memory access time (large)
h1, h2 = hit rates (0 ≤ h1, h2 ≤ 1)
Average access time reduces
by adding a cache.
55:035 Computer Architecture and Organization
43
Average Access Time
T1 + (1 – h1) [ T2 + (1 – h2)Tm ]
T1 < T2 < Tm
Access time
T1+T2+Tm
T1+T2+Tm / 2
T1+T2
T1
miss rate, 1- h1
0
h1=1
1
h1=0
55:035 Computer Architecture and Organization
44
Processor Performance Without Cache
5GHz processor, cycle time = 0.2ns
 Memory access time = 100ns = 500 cycles
 Ignoring memory access, Clocks Per Instruction
(CPI) = 1
 Assuming no memory data access:
CPI = 1 + # stall cycles

= 1 + 500 = 501
55:035 Computer Architecture and Organization
45
Performance with Level 1 Cache
Assume hit rate, h1 = 0.95
 L1 access time = 0.2ns = 1 cycle
 CPI
= 1 + # stall cycles
= 1 + 0.05 x 500
= 26
 Processor speed increase due to cache
= 501/26 = 19.3

55:035 Computer Architecture and Organization
46
Performance with L1 and L2 Caches

Assume:



L1 hit rate, h1 = 0.95
L2 hit rate, h2 = 0.90 (this is very optimistic!)
L2 access time = 5ns = 25 cycles
CPI = 1 + # stall cycles
= 1 + 0.05 (25 + 0.10 x 500)
= 1 + 3.75 = 4.75
 Processor speed increase due to both caches
= 501/4.75
= 105.5
 Speed increase due to L2 cache
= 26/4.75
= 5.47

55:035 Computer Architecture and Organization
47
Cache Miss Behavior


If the tag bits do not match, then a miss occurs.
Upon a cache miss:




Recall that we have two different types of memory
accesses:


The CPU is stalled
Desired block of data is fetched from memory and placed in
cache.
Execution is restarted at the cycle that caused the cache
miss.
reads (loads) or writes (stores).
Thus, overall we can have 4 kinds of cache events:

read hits, read misses, write hits and write misses.
55:035 Computer Architecture and Organization
48
Fully-Associative Placement

One alternative to direct-mapped is:


Allow block to fill any empty place in the cache.
How do we then locate the block later?

Can associate each stored block with a tag


Identifies the block’s home address in main memory.
When the block is needed, we can use the cache as
an associative memory, using the tag to match all
locations in parallel, to pull out the appropriate block.
55:035 Computer Architecture and Organization
49
Set-Associative Placement

The block address determines not a single location, but a
set.

A set is several locations, grouped together.
(set #) = (Block address) mod (# of sets)

The block can be placed associatively anywhere within
that set.


Where? This is part of the placement strategy.
If there are n locations in each set, the scheme is called
“n-way set-associative”.


Direct mapped = 1-way set-associative.
Fully associative = There is only 1 set.
55:035 Computer Architecture and Organization
50
Replacement Strategies


Which existing block do we replace, when a new block
comes in?
With a direct-mapped cache:


There’s only one choice! (Same as placement)
With a (fully- or set-) associative cache:


If any “way” in the set is empty, pick one of those
Otherwise, there are many possible strategies:
 (Pseudo-) Random: Simple, fast, and fairly effective
 (Pseudo-) Least-Recently Used (LRU)
 Makes little difference in L2 (and higher) caches
55:035 Computer Architecture and Organization
51
Write Strategies

Most accesses are reads, not writes


Optimize for reads!


Direct mapped can return value before valid check
Writes are more difficult, because:



Especially if instruction reads are included
We can’t write to cache till we know the right block
Object written may have various sizes (1-8 bytes)
When to synchronize cache with memory?

Write through - Write to cache & to memory
 Prone to stalls due to high mem. bandwidth requirements

Write back - Write to memory upon replacement
 Memory may be left out of date for a long time
55:035 Computer Architecture and Organization
52
Action on Cache Hits vs. Misses

Read hits:


Read misses:


Stall the CPU, fetch block from memory, deliver to cache, restart
Write hits:



Desirable
Write-through: replace data in cache and memory at same time
Write-back: write the data only into the cache. It is written to
main memory only when it is replaced
Write misses:


No write-allocate: write the data to memory only.
Write-allocate: read the entire block into the cache, then write
the word
55:035 Computer Architecture and Organization
53
Cache Hits vs. Cache Misses

Consider the write-through strategy: every block written to cache
is automatically written to memory.
 Pro: Simple; memory is always up-to-date with the cache
 No write-back required on block replacement.
 Con: Creates lots of extra traffic on the memory bus.
 Write hit time may be increased if CPU must wait for bus.

One solution to write time problem is to use a write buffer to store
the data while it is waiting to be written to memory.
 After storing data in cache and write buffer, processor can continue
execution.

Alternately, a write-back strategy writes data to main memory
only a block is replaced.
 Pros: Reduces memory bandwidth used by writes.
 Cons: Complicates multi-processor systems
55:035 Computer Architecture and Organization
54
Hit/Miss Rate, Hit Time, Miss Penalty

The hit rate or hit ratio is


The miss rate (= 1 – hit rate) is


fraction of memory accesses not found in upper levels.
The hit time is


fraction of memory accesses found in upper level.
the time to access the upper level of the memory hierarchy,
which includes the time needed to determine whether the access
is a hit or miss.
The miss penalty is

the time needed to replace a block in the upper level with a
corresponding block from the lower level.
 may include the time to write back an evicted block.
55:035 Computer Architecture and Organization
55
Cache Performance Analysis


Performance is always a key issue for caches.
We consider improving cache performance by:





(1) reducing the miss rate, and
(2) reducing the miss penalty.
For (1) we can reduce the probability that different
memory blocks will contend for the same cache location.
For (2), we can add additional levels to the hierarchy,
which is called multilevel caching.
We can determine the CPU time as
CPUTime  (CCCPUExecution  CCMemoryStalls )  tCC
55:035 Computer Architecture and Organization
56
Cache Performance


The memory-stall clock cycles come from cache misses.
It can be defined as the sum of the stall cycles coming
from writes + those coming from reads:

Memory-Stall CC = Read-stall cycles + Write-stall cycles, where
Re ad  stall cycles 
Re ads
 Re ad Miss Rate  Re ad Miss Penalty
Pr ogram
 Writes

Write  stall cycles  
 Write Miss Rate  Write Miss Penalty   WriteBufferStalls
 Pr ogram

55:035 Computer Architecture and Organization
57
Cache Performance Formulas

Useful formulas for analyzing ISA/cache interactions :



(CPU time) = [(CPU cycles) + (Memory stall cycles)]
× (Clock cycle time)
(Memory stall cycles) = (Instruction count) ×
(Accesses per instruction) × (Miss rate) × (Miss penalty)
But, are not the best measure for cache design by itself:

Focus on time per-program, not per-access
 But accesses-per-program isn’t up to the cache design
 We can limit our attention to individual accesses

Neglects hit penalty
 Cache design may affect #cycles taken even by a cache hit

Neglects cycle length
 May be impacted by a poor cache design
55:035 Computer Architecture and Organization
58
More Cache Performance Metrics

Can split access time into instructions & data:


Another simple formula:



Avg. mem. acc. time =
(% instruction accesses) × (inst. mem. access time) + (% data
accesses) × (data mem. access time)
CPU time = (CPU execution clock cycles + Memory stall clock
cycles) × cycle time
Useful for exploring ISA changes
Can break stalls into reads and writes:

Memory stall cycles =
(Reads × read miss rate × read miss penalty) + (Writes × write
miss rate × write miss penalty)
55:035 Computer Architecture and Organization
59
Factoring out Instruction Count

Gives (lumping together reads & writes):
CPU time  IC  Clock cycle time 
Accesses


CPI


Miss
rate

Miss
penalty


exec
Inst



May replace:
Accesses
Misses
 Miss rate 
instructio n
instructio n

So that miss rates aren’t affected by redundant accesses to
same location within an instruction.
55:035 Computer Architecture and Organization
60
Improving Cache Performance

Consider the cache performance equation:
(Average memory access time) =
(Hit time) + (Miss rate)×(Miss penalty)

“Amortized miss penalty”
It obviously follows that there are three basic ways to
improve cache performance:
Reducing amortized
A. Reducing miss rate
miss penalty
 B. Reducing miss penalty
 C. Reducing hit time
Note that by Amdahl’s Law, there will be diminishing returns from
reducing only hit time or amortized miss penalty by itself, instead of
both together.


55:035 Computer Architecture and Organization
61
AMD Opteron Microprocessor
L2
1MB
Block 64B
Write-back
L1
(split
64KB each)
Block 64B
Write-back
55:035 Computer Architecture and Organization
62
Download