Memory Hierarchy - Memory & Storage Architecture Lab.

advertisement
Computer Architecture
Memory Hierarchy
Memory/Storage Architecture Lab
Technology Trends
300
Year
Capacity
$/GB
1980
64Kbit
$1500000
1983
256Kbit
$500000
1985
1Mbit
$200000
1989
4Mbit
$50000
250
1992
16Mbit
$15000
1996
64Mbit
$10000
1998
128Mbit
$4000
2000
256Mbit
$1000
2004
512Mbit
$250
2007
1Gbit
$50
200
Trac
Tcac
150
100
50
0
'80 '83 '85 '89 '92 '96 '98 '00 '04 '07
Memory/Storage Architecture Lab
2
Memory Hierarchy
Ideally one would desire an indefinitely large memory capacity such that any
particular … word would be immediately available … We are … forced to
recognize the possibility of constructing a hierarchy of memories, each of which
has greater capacity than the preceding but which is less quickly accessible.
Burks, Goldstine, and von Neumann, 1946
CPU
Level 1
Levels in the
memory hierarchy
Decreasing
cost
Increasing speed
and bandwidth
Level 2
• • •
Level n
Size of the memory at each level
Memory/Storage Architecture Lab
3
Memory Technology (Big Picture)
Processor
Control
Memory
Memory
Memory
Datapath
Memory
Memory
Speed: Fastest
Size: Smallest
Cost: Highest
Memory/Storage Architecture Lab
Slowest
Biggest
Lowest
4
Memory Technology (Real-world Realization)
Processor
Control
On-chip
Caches
Registers
Speed
Size
Management
Register
<1ns
100B
Compiler
Off-chip
Level
Caches
(SRAM)
Cache
<5ns
KB→MB
Hardware
Main
Memory
(DRAM)
Main Memory
50ns~70ns
MB→GB
OS
Memory/Storage Architecture Lab
Secondary
storage
(Disk)
Disk Memory
5ms~20ms
GB→TB
OS
5
Memory Hierarchy

An optimization resulting from a perfect match between
memory technology and two types of program locality

Temporal locality (locality in time)
−

Spatial locality (locality in space)
−

If an item is referenced, it will tend to be referenced again soon.
If an item is referenced, items whose addresses are close by will
tend to be referenced soon.
Goal : To provide a “virtual” memory technology (an
illusion) that has an access time of the highest-level
memory with the size and cost of the lowest-level memory
Memory/Storage Architecture Lab
6
Temporal and Spatial Localities
Source: Glass & Cao (1997 ACM SIGMETRICS)
Memory/Storage Architecture Lab
7
Memory Hierarchy Terminology

Hit – Accessed data is found in upper level



Miss – Accessed data found only in lower level




Hit Rate = fraction of accesses found in upper level
Hit Time = time to access the upper level
Processor waits until data is fetched from next level,
then restarts/continues access
Miss rate = 1 – (hit rate)
Miss penalty = time to get block from lower level
+ time to replace in upper level
Data are transferred
in the unit of blocks
Hit time << miss penalty


Average memory access time << worst case access time
Average memory access time
= hit time + miss rate ⅹmiss penalty
Memory/Storage Architecture Lab
8
(CPU) Cache



Upper level : SRAM (small, fast, expensive)
lower level : DRAM (large, slow, cheap)
Goal : To provide a “virtual” memory technology that has
an access time of SRAM with the size and cost of DRAM
Additional benefits


Reduction of memory bandwidth consumed by processor
 More memory bandwidth for I/O
No need to change the ISA
Memory/Storage Architecture Lab
9
Direct-mapped Cache


Each memory block is mapped to a single cache block
The mapped cache block is determined by memory block
address mod number of cache blocks
Memory/Storage Architecture Lab
10
Direct-Mapped Cache Example

Consider a direct-mapped
cache with block size 4 bytes
and total capacity 4KB






Assume 1 word per block…
The 2 lowest address bits
specify the byte within a block
The next 10 address bits
specify the block’s index within
the cache
The 20 highest address bits are
the unique tag for this memory
block
The valid bit specifies whether
the block is an accurate copy of
memory
Exploit temporal locality
Memory/Storage Architecture Lab
11
On cache read


On cache hit, CPU proceeds normally
On cache miss (handled completely by hardware)



Stall the CPU pipeline
Fetch the missed block from the next level of hierarchy
Instruction cache miss
−

Restart instruction fetch
Data cache miss
−
Complete data access
Memory/Storage Architecture Lab
12
On cache write

Write-through



Always write the data into both the cache and main
memory
Simple but slow and increases memory traffic (requires a
write buffer)
Write-back


Write the data into the cache only and update the main
memory when a dirty block is replaced (requires a dirty bit
and possibly a write buffer)
Fast but complex to implement and causes a consistency
problem
Memory/Storage Architecture Lab
13
Write allocation


What should happen on a write miss?
Alternatives for write-through


Allocate on miss: fetch the block
Write around: don’t fetch the block
−

Since programs often write a whole block before reading it (e.g.,
initialization)
For write-back

Usually fetch the block
Memory/Storage Architecture Lab
14
Memory Reference Sequence

Look at the following sequence of memory references for the
previous direct-mapped cache

0, 4, 8188, 0, 16384, 0
Index
Valid
Tag
Data
XXXX
XXXX
1
0
XXXX
XXXX
2
0
XXXX
XXXX
3
0
XXXX
XXXX
1021
0
XXXX
XXXX
1022
0
XXXX
XXXX
1023
0
XXXX
XXXX
…
0
…
0
Cache Initially Empty
Memory/Storage Architecture Lab
15
After Reference 1

Look at the following sequence of memory references for the
previous direct-mapped cache


0, 4, 8188, 0, 16384, 0
Address = 00000000000000000000 0000000000 00
Index
Valid
Tag
Data
00000000000000000000
Memory bytes 0…3 (copy)
1
0
XXXX
XXXX
2
0
XXXX
XXXX
3
0
XXXX
XXXX
1021
0
XXXX
XXXX
1022
0
XXXX
XXXX
1023
0
XXXX
XXXX
Miss
…
1
…
0
Cache Miss, Place Block at Index 0
Memory/Storage Architecture Lab
16
After Reference 2

Look at the following sequence of memory references for the
previous direct-mapped cache


0, 4, 8188, 0, 16384, 0
Address = 00000000000000000000 0000000001 00
Index
Valid
Tag
Data
00000000000000000000
Memory bytes 0…3 (copy)
1
1
00000000000000000000
Memory bytes 4…7 (copy)
2
0
XXXX
XXXX
3
0
XXXX
XXXX
1021
0
XXXX
XXXX
1022
0
XXXX
XXXX
1023
0
XXXX
XXXX
Miss
…
1
…
0
Cache Miss, Place Block at Index 1
Memory/Storage Architecture Lab
17
After Reference 3

Look at the following sequence of memory references for the
previous direct-mapped cache


0, 4, 8188, 0, 16384, 0
Address = 00000000000000000001 1111111111 00
Index
Valid
Tag
Data
00000000000000000000
Memory bytes 0…3 (copy)
1
1
00000000000000000000
Memory bytes 4…7 (copy)
2
0
XXXX
XXXX
3
0
XXXX
XXXX
1021
0
XXXX
XXXX
1022
0
XXXX
XXXX
1023
1
00000000000000000001
Memory bytes 8188…8191 (copy)
…
1
…
0
Miss
Cache Miss, Place Block at Index 1023
Memory/Storage Architecture Lab
18
After Reference 4

Look at the following sequence of memory references for the
previous direct-mapped cache


0, 4, 8188, 0, 16384, 0
Address = 00000000000000000000 0000000000 00
Index
Valid
Tag
Data
00000000000000000000
Memory bytes 0…3 (copy)
1
1
00000000000000000000
Memory bytes 4…7 (copy)
2
0
XXXX
XXXX
3
0
XXXX
XXXX
1021
0
XXXX
XXXX
1022
0
XXXX
XXXX
1023
1
00000000000000000001
Memory bytes 8188…8191 (copy)
Hit
…
1
…
0
Cache Hit to Block at Index 0
Memory/Storage Architecture Lab
19
After Reference 5

Look at the following sequence of memory references for the
previous direct-mapped cache


0, 4, 8188, 0, 16384, 0
Address = 00000000000000000100 0000000000 00 [same index!]
Index
Valid
Tag
Data
00000000000000000100
00000000000000000000
Memory bytes 16384…16387(copy)
1
1
00000000000000000000
Memory bytes 4…7 (copy)
2
0
XXXX
XXXX
3
0
XXXX
XXXX
1021
0
XXXX
XXXX
1022
0
XXXX
XXXX
1023
1
00000000000000000001
Memory bytes 8188…8191 (copy)
Miss
…
1
…
0
Cache Miss, Replace Block at Index 0
Memory/Storage Architecture Lab
20
After Reference 6

Look at the following sequence of memory references for the
previous direct-mapped cache


0, 4, 8188, 0, 16384, 0
Address = 00000000000000000000 0000000000 00 [same index!]
Index
Valid
Tag
Data
00000000000000000100
00000000000000000000
Memory bytes 0…3 (copy)
1
1
00000000000000000000
Memory bytes 4…7 (copy)
2
0
XXXX
XXXX
3
0
XXXX
XXXX
1021
0
XXXX
XXXX
1022
0
XXXX
XXXX
1023
1
00000000000000000001
Memory bytes 8188…8191 (copy)
Miss
…
1
…
0
Cache Miss, Replace Block at Index 0
Total of 1 Hit and 5 Misses
Memory/Storage Architecture Lab
21
Exploiting Spatial Locality - Larger than one word block size
16 KB Direct-mapped cache
with 256 64B (16 words) blocks
Memory/Storage Architecture Lab
22
Miss Rate vs. Block Size
Memory/Storage Architecture Lab
23
Set-Associative Caches

Allow multiple entries per index to improve hit rates

n-way set associative caches allow up to n conflicting references to be
cached
−
−
−

Fully-associative caches
−


n is the number of cache blocks in each set
n comparisons are needed to search all blocks in the set in parallel
When there is a conflict, which block is replaced (this was easy for direct mapped caches
– there`s only one entry!)
a single (very large!) set allows a memory location to be placed in any cache block
Direct-mapped caches are essentially 1-way set-associative caches
For fixed cache capacity, higher associativity leads to higher hit rates


Because more combinations of memory blocks can be present in the
cache
Set associativity optimizes cache contents, but at what cost?
Memory/Storage Architecture Lab
24
Cache Organization Spectrum
Memory/Storage Architecture Lab
25
Implementation of Set Associative Cache
Memory/Storage Architecture Lab
26
Cache Organization Example
One-way set associative
(direct mapped)
Block
Two-way set associative
Tag Data
Set
0
0
1
1
2
2
3
3
Tag Data Tag Data
4
5
Four-way set associative
6
Set
Tag Data Tag Data Tag Data Tag Data
7
0
1
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Memory/Storage Architecture Lab
27
Cache Block Replacement Policy

Direct-mapped Caches


No replacement policy is needed since each memory block
can be placed in only one cache block
N-way set-associative Caches



Each memory block can be placed in any of the n cache
blocks in the mapped set
Least Recently Used (LRU) replacement policy is typically
used to select a block to be replaced among the blocks in
the mapped set
LRU replaces the block that has not been used for the
longest time
Memory/Storage Architecture Lab
28
Miss Rate vs. Set Associativity
Memory/Storage Architecture Lab
29
Memory Reference Sequence

Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0

This sequence had 5 misses and 1 hit for the direct mapped cache
with the same capacity
Set
Number
0
1
Valid
Tag
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
…
…
255
Data
Cache Initially Empty
Memory/Storage Architecture Lab
30
After Reference 1

Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes)


0, 4, 8188, 0, 16384, 0
Address = 000000000000000000000 00000000 000
Set
Number
0
1
Valid
Tag
1
000000000000000000000
Memory bytes 0..7 (copy)
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
Miss
…
…
255
Data
Cache Miss, Place in First Block of Set 0
Memory/Storage Architecture Lab
31
After Reference 2

Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes)


0, 4, 8188, 0, 16384, 0
Address = 000000000000000000000 00000000 100
Set
Number
0
1
Valid
Tag
1
000000000000000000000
Memory bytes 0..7 (copy)
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
Hit
…
…
255
Data
Cache Hit to first Block in Set 0
Memory/Storage Architecture Lab
32
After Reference 3

Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes)


0, 4, 8188, 0, 16384, 0
Address = 000000000000000000111 11111111 000
Set
Number
0
1
Valid
Tag
1
000000000000000000000
Memory bytes 0..7 (copy)
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
1
000000000000000000111
Memory bytes 8188..8195 (copy)
0
XXXX
XXXX
…
…
255
Data
Miss
Cache Miss, Place in First Block of Set 255
Memory/Storage Architecture Lab
33
After Reference 4

Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes)


0, 4, 8188, 0, 16384, 0
Address = 000000000000000000000 00000000 000
Set
Number
0
1
Valid
Tag
1
000000000000000000000
Memory bytes 0..7 (copy)
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
1
000000000000000000111
Memory bytes 8188..8195 (copy)
0
XXXX
XXXX
Hit
…
…
255
Data
Cache Hit to first Block in Set 0
Memory/Storage Architecture Lab
34
After Reference 5

Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes)


0, 4, 8188, 0, 16384, 0
Address = 000000000000000010000 00000000 000
Set
Number
0
1
Valid
Tag
1
000000000000000000000
Memory bytes 0..7 (copy)
1
000000000000000010000
Memory bytes 16384..16391(copy)
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
1
000000000000000000111
Memory bytes 8188..8195 (copy)
0
XXXX
XXXX
Miss
…
…
255
Data
Cache Miss, Place in Second Block of Set 0
Memory/Storage Architecture Lab
35
After Reference 6

Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes)


0, 4, 8188, 0, 16384, 0
Address = 000000000000000000000 00000000 000
Set
Number
0
1
Valid
Tag
Data
000000000000000000000
Memory bytes 0..7 (copy)
1
000000000000000010000
Memory bytes 16384..16391(copy)
0
XXXX
XXXX
0
XXXX
XXXX
0
XXXX
XXXX
1
000000000000000000111
Memory bytes 8188..8195 (copy)
0
XXXX
XXXX
255
Hit
…
…
1
Cache Hit to first Block in Set 0
Total of 3 hits and 3 misses
Memory/Storage Architecture Lab
36
Improving Cache Performance

Cache Performance is determined by
Average memory access time = hit time + (miss rate x miss penalty)

Decrease hit time



Decrease miss rate




Make cache smaller, but miss rate increases
Use direct mapped, but miss rate increase
Make cache larger, but can increases hit time
Add associativity, but can increases hit time
Increase block size, but increases miss penalty
Decrease miss penalty


Reduce transfer time component of miss penalty
Add another level of cache
Memory/Storage Architecture Lab
37
Current Cache Organizations
Intel Nehalem
AMD Opteron X4
L1 caches
(per core)
L1 I-cache: 32KB, 64-byte blocks, 4-way,
approx LRU replacement, hit time n/a
L1 D-cache: 32KB, 64-byte blocks, 8-way,
approx LRU replacement, writeback/allocate, hit time n/a
L1 I-cache: 32KB, 64-byte blocks, 2-way,
LRU replacement, hit time 3 cycles
L1 D-cache: 32KB, 64-byte blocks, 2-way,
LRU replacement, write-back/allocate, hit
time 9 cycles
L2 unified
cache
(per core)
256KB, 64-byte blocks, 8-way, approx
LRU replacement, write-back/allocate, hit
time n/a
512KB, 64-byte blocks, 16-way, approx
LRU replacement, write-back/allocate, hit
time n/a
L3 unified
cache (shared)
8MB, 64-byte blocks, 16-way,
replacement n/a, write-back/allocate, hit
time n/a
2MB, 64-byte blocks, 32-way, replace
block shared by fewest cores, writeback/allocate, hit time 32 cycles
n/a: data not available
Memory/Storage Architecture Lab
38
Cache Coherence Problem

Suppose two CPU cores share a physical address space

Write-through caches
Time
step
CPU A’s
cache
Event
CPU B’s
cache
0
Memory
0
1
CPU A reads X
0
2
CPU B reads X
0
0
0
3
CPU A writes 1 to X
1
0
1
Memory/Storage Architecture Lab
0
39
Snoopy Protocols

Write Invalidate Protocol:


Write Broadcast Protocol:


Write to shared data: an invalidate is sent to all caches
which snoop and invalidate any copies
Write to shared data: broadcast on bus, processors snoop,
and update copies
Write serialization: bus serializes requests

Bus is single point of arbitration
Memory/Storage Architecture Lab
40
Write invalidate Protocol

Cache gets exclusive access to a block when it is to be
written


Broadcasts an invalidate message on the bus
Subsequent read in another cache misses
−
Owning cache supplies updated value
CPU activity
Bus activity
CPU A’s cache
CPU B’s
cache
Memory
0
CPU A reads X
Cache miss for X
0
CPU B reads X
Cache miss for X
0
CPU A writes 1 to X
Invalidate for X
1
CPU B read X
Cache miss for X
1
Memory/Storage Architecture Lab
0
0
0
0
1
1
41
Summary

Memory hierarchies are an optimization resulting from a perfect
match between memory technology and two types of program locality




Temporal locality
Spatial locality
The goal is to provide a “virtual” memory technology (an illusion) that
has an access time of the highest-level memory with the size and cost
of the lowest-level memory
Cache memory is an instance of a memory hierarchy





exploits both temporal and spatial localities
direct-mapped caches are simple and fast but have higher miss rates
set-associative caches have lower miss rates but are complex and slow
multilevel caches are becoming increasingly popular
cache coherence protocols ensures consistency among multiple caches
Memory/Storage Architecture Lab
42
Download