CDA 5155 Caches

advertisement
CDA 5155
Caches
New Topic: Memory Systems
24. Cache 101 – review of undergraduate material
25. Associativity and other organization issues
26. Advanced designs and interactions with
pipelines
27. Tomorrow’s cache design
(power/performance)
28. Advances in memory design
29. Virtual memory (and how to do it fast)
Cache Design 101
Reg
100s bytes
1 cycle access (early in pipeline)
L1 Cache
(several KB)
1-3 cycle access
L2 Cache (½-32MB)
Memory (128MB – fewGB)
Memory pyramid
Disk (Many GB)
6-15 cycle access
50-300 cycle access
Millions cycle access!
Terminology
• Temporal locality:
– If memory location X is accessed, then it is more likely to be reaccessed in the near future than some random location Y.
– Caches exploit temporal locality by placing a memory element
that has been referenced into the cache.
• Spatial locality:
– If memory location X is accessed, then locations near X are more
likely to be accessed in the near future than some random
location Y.
– Caches exploit spatial locality by allocating a cache line of data
(including data near the referenced location).
A simple cache
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
R0
R1
R2
R3
1
5
1
4
0
]
]
]
]
]
Cache
2 cache lines
3 bit tag field
2 byte block
tag data
V
V
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
st
1
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
R0
R1
R2
R3
access
Cache
1
5
1
4
0
]
]
]
]
]
tag
0
0
data
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
st
1
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
access
Cache
1
5
1
4
0
]
]
]
]
]
tag
1 0
data
100
110
lru 0
Addr: 0001
R0
R1
R2
R3
110
Misses: 1
Hits:
0
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
nd
2
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
R0
R1
R2
R3
access
Cache
1
5
1
4
0
110
]
]
]
]
]
tag
1 0
data
100
110
lru 0
Misses: 1
Hits:
0
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
nd
2
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
access
Cache
1
5
1
4
0
]
]
]
]
]
tag
lru 1 0
1 2
data
100
110
140
150
Addr: 0101
R0
R1
R2
R3
110
150
Misses: 2
Hits:
0
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
rd
3
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
access
Cache
1
5
1
4
0
]
]
]
]
]
tag
lru 1 0
1 2
data
100
110
140
150
Addr: 0001
R0
R1
R2
R3
110
150
Misses: 2
Hits:
0
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
rd
3
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
R0
R1
R2
R3
access
Cache
1
5
1
4
0
110
150
110
]
]
]
]
]
tag
1 0
lru 1 2
data
100
110
140
150
Misses: 2
Hits:
1
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
th
4
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
access
Cache
1
5
1
4
0
]
]
]
]
]
tag
1 0
lru 1 2
data
100
110
140
150
Addr: 0100
R0
R1
R2
R3
110
150
110
Misses: 2
Hits:
1
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
th
4
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
R0
R1
R2
R3
access
Cache
1
5
1
4
0
110
150
140
]
]
]
]
]
tag
lru 1 0
1 2
data
100
110
140
150
Misses: 2
Hits:
2
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
th
5
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
access
Cache
1
5
1
4
0
]
]
]
]
]
tag
lru 1 0
1 2
data
100
110
140
150
Addr: 0000
R0
R1
R2
R3
110
150
140
Misses: 2
Hits:
2
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
th
5
Processor
Ld
Ld
Ld
Ld
Ld
R1  M[
R2  M[
R3  M[
R3  M[
R2  M[
R0
R1
R2
R3
access
Cache
1
5
1
4
0
110
100
140
140
]
]
]
]
]
tag
1 0
lru 1 2
data
100
110
140
150
Misses: 2
Hits:
3
Memory
0 100
1 110
2 120
3 130
4 140
5 150
6 160
7 170
8 180
9 190
10 200
11 210
12 220
13 230
14 240
15 250
Basic Cache Organization
• Decide on the block size
– How? Simulate lots of different block sizes
and see which one gives the best performance
– Most systems use a block size between 32
bytes and 128 bytes
– Longer sizes reduce the overhead by:
• Reducing the number of tags
• Reducing the size of each tag
Tag
Block
offset
What About Stores?
• Where should you write the result of a store?
– If that memory location is in the cache?
• Send it to the cache.
• Should we also send it to memory right away?
(write-through policy)
• Wait until we kick the block out (write-back policy)
– If it is not in the cache?
• Allocate the line (put it in the cache)?
(write allocate policy)
• Write it directly to memory without allocation?
(no write allocate policy)
Handling Stores (write-through)
Processor
Cache
Assume write-allocate
policy
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
1
7
0
5
10
]
]
]
]
]
V tag data
0
0
Misses: 0
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 1)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
]
]
]
]
]
V tag data
0
0
Misses: 0
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 1)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
]
]
]
]
]
V tag data
78
1 0
29
lru 0
Misses: 1
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 2)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
]
]
]
]
]
V tag data
78
1 0
29
lru 0
Misses: 1
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 2)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
V tag data
lru 1 0
78
29
1 3
162
173
Misses: 2
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 3)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
V tag data
lru 1 0
78
29
1 3
162
173
Misses: 2
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 3)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
V tag data
173
1 0
29
lru 1 3
162
173
Misses: 2
Hits:
1
Memory
0 173
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 4)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
V tag data
173
1 0
29
lru 1 3
162
173
Misses: 2
Hits:
1
Memory
0 173
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 4)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
V tag data
lru 1 0
173
29
1 2
71
150
29
Misses: 3
Hits:
1
Memory
0 173
1
29
2 120
3 123
4
71
5 150
29
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 5)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
V tag data
lru 1 0
173
29
1 2
71
29
Misses: 3
Hits:
1
Memory
0 173
1
29
2 120
3 123
4
71
5
29
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-through (REF 5)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
33
]
]
]
]
]
V tag data
1 5
33
28
71
lru 1 2
29
Misses: 4
Hits:
1
Memory
0 173
1
29
2 120
3 123
4
71
5
29
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
How many memory references?
• Each miss reads a block
– 2 bytes in this cache
• Each store writes a byte
• Total reads: 8 bytes
• Total writes: 2 bytes
but caches generally miss < 20%
Write-through vs. write-back
• We can also design the cache to NOT write
all stores to memory immediately?
– We can keep the most current copy in the cache
and update the memory when that data is
evicted from the cache (a write-back policy).
• Do we need to write-back all evicted lines?
– No, only blocks that have been stored into
» Keep a “dirty bit”, reset when the line is allocated, set
when the block is stored into. If a block is “dirty”
when evicted, write its data back into memory.
Handling Stores (write-back)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
]
]
]
]
]
V d tag data
0
0
Misses: 0
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 1)
Processor
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
]
]
]
]
]
V d tag data
0
0
Misses: 0
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 1)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
]
]
]
]
]
lru
Processor
V d tag data
10 0
78
29
0
Misses: 1
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 2)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
]
]
]
]
]
lru
Processor
V d tag data
10 0
78
29
0
Misses: 1
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 2)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
lru
Processor
V d tag data
10 0
78
29
10 3
162
173
Misses: 2
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 3)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
lru
Processor
V d tag data
10 0
78
29
10 3
162
173
Misses: 2
Hits:
0
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 3)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
lru
Processor
V d tag data
11 0
173
29
10 3
162
173
Misses: 2
Hits:
1
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 4)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
lru
Processor
V d tag data
11 0
173
29
10 3
162
173
Misses: 2
Hits:
1
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 4)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
lru
Processor
V d tag data
11 0
173
29
11 3
71
29
Misses: 3
Hits:
1
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 5)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
lru
Processor
V d tag data
11 0
173
29
11 3
71
29
Misses: 3
Hits:
1
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 5)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
173
]
]
]
]
]
lru
Processor
V d tag data
11 0
173
29
11 3
71
29
Misses: 4
Hits:
1
Memory
0 173
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
write-back (REF 5)
Ld
Ld
St
St
Ld
R1  M[
R2  M[
R2  M[
R1  M[
R2  M[
R0
R1
R2
R3
Cache
1
7
0
5
10
29
33
]
]
]
]
]
lru
Processor
V d tag data
10 5
33
28
11 3
71
29
Misses: 4
Hits:
1
Memory
0
78
1
29
2 120
3 123
4
71
5 150
6 162
7 173
8
18
9
21
10
33
11
28
12
19
13 200
14 210
15 225
How many memory references?
• Each miss reads a block
– 2 bytes in this cache
• Each evicted dirty cache line writes a block
• Total reads: 8 bytes
• Total writes: 4 bytes (after final eviction)
Choose write-back or write-through?
Download