Caches

advertisement
Embedded Computer Architecture
Memory Hierarchy:
Cache Recap
Course 5KK73
Henk Corporaal
November 2014
h.corporaal@tue.nl
Computation I pg 1
Memory Hierarchy, why?
• Users want large and fast memories!
SRAM access times are 1 – 10 ns
DRAM access times are 20-120 ns
Disk access times are 5 to 10 million ns, but it’s bits are very cheap
• Get best of both worlds: fast and large memories:
– build a memory hierarchy
CPU
Level 1
Level 2
Speed
Level n
Size
Computation I pg 2
Memory recap
• We can build a memory – a logical k × m array of
stored bits. Usually m = 8 bits / location
Address Space:
number of locations
(usually a power of 2)
n bits address
k = 2n locations
Addressability:
m: number of bits per location
(e.g., byte-addressable)
•
•
•
m bits data / entry
Computation I pg 3
Memory element: SRAM vs DRAM
• SRAM:
– value is stored with a pair of inverting gates
– very fast but takes up more space than DRAM (4 to 6
transistors)
• DRAM:
– value is stored as a charge on capacitor (must be
refreshed)
– very small but slower than SRAM (factor of 5 to 10)
Word line
– charge leakes =>
• refresh needed
Pass transistor
Capacitor
Bit line
Computation I pg 4
Latest Intel: i7 Ivy Bridge, 22 nm
-Sandy Bridge 32nm -> 22 nm
-- incl graphics, USB3, etc.; 3 levels of cache
Computation I pg 5
Exploiting Locality
• Locality = principle that makes having a memory hierarchy a good idea
• If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality : nearby items will tend to be referenced soon.
Why does code have locality?
• Our initial focus: two levels (upper, lower)
– block:
– hit:
– miss:
minimum unit of data
data requested is in the upper level
data requested is not in the upper level
upper level
$
block
lower level
Computation I pg 6
Memory / Lower level
Cache operation
Cache / Higher level
block / line
tags
data
Computation I pg 7
Direct Mapped Cache
• Mapping: cache address is memory address modulo the
number of blocks in the cache
000
001
010
011
100
101
110
111
Cache
00001
00101
01001
01101
10001
10101
11001
11101
Memory
Computation I pg 8
Direct Mapped Cache
Address (bit positions)
31 30
Q:What kind
of locality
are we taking
advantage of
in this
example?
13 12 11
2 10
Byte
offset
Hit
10
20
Tag
Data
Index
Index Valid Tag
Data
0
1
2
1021
1022
1023
20
32
Computation I pg 9
Direct Mapped Cache
• This example exploits (also) spatial locality (having
larger blocks):
Address
(showing
positions) )
Address
(bitbitpositions
31
16 15
16
Hit
4 32 1 0
12
2 Byte
offset
Tag
Data
Index
V
Block offset
16 bits
128 bits
Tag
Data
4K
entries
16
32
32
32
32
Mux
32
Computation I pg 10
Hits vs. Misses
• Read hits
– this is what we want!
• Read misses
– stall the CPU, fetch block from memory, deliver to cache, restart the
load instruction
• Write hits:
– can replace data in cache and memory (write-through)
– write the data only into the cache (write-back the cache later)
• Write misses:
– read the entire block into the cache, then write the word (allocate on
write miss)
– do not read the cache line; just write to memory (no allocate on write
miss)
Computation I pg 11
Splitting first level cache
• Use split Instruction and Data caches
– Caches can be tuned differently
– Avoids dual ported cache
I$
CPU
D$
L1
Program
gcc
spice
Block size in
words
1
4
1
4
I&D
$
Main Memory
L2
Instruction
miss rate
6.1%
2.0%
1.2%
0.3%
Data miss
rate
2.1%
1.7%
1.3%
0.6%
Effective combined
miss rate
5.4%
1.9%
1.2%
0.4%
Computation I pg 12
Let’s look at cache&memory performance
Texec = Ncycles • Tcycle = Ninst• CPI • Tcycle
with
CPI
= CPIideal + CPIstall
CPIstall = %reads • missrateread • misspenaltyread+
%writes • missratewrite • misspenaltywrite
or:
Texec = (Nnormal-cycles + Nstall-cycles ) • Tcycle
with
Nstall-cycles = Nreads • missrateread • misspenaltyread +
Nwrites • missratewrite • misspenaltywrite
(+ Write-buffer stalls )
Computation I pg 13
Performance example (1)
• Assume application with:
–
–
–
–
–
Icache missrate 2%
Dcache missrate 4%
Fraction of ld-st instructions = 36%
CPI ideal (i.e. without cache misses) is 2.0
Misspenalty 40 cycles
• Calculate CPI taking misses into account
CPI = 2.0 + CPIstall
CPIstall = Instruction-miss cycles + Data-miss cycles
Instruction-miss cycles = Ninstr x 0.02 x 40 = 0.80 Ninstr
Data-miss cycles = Ninstr x %ld-st x 0.04 x 40
CPI = 3.36
Slowdown: 1.68 !!
Computation I pg 14
Performance example (2)
1. What if ideal processor had CPI = 1.0 (instead of 2.0)
• Slowdown would be 2.36 !
2. What if processor is clocked twice as fast
• => penalty becomes 80 cycles
• CPI = 4.75
• Speedup = N.CPIa.Tclock / (N.CPIb.Tclock/2) =
3.36 / (4.75/2)
• Speedup is not 2, but only 1.41 !!
Computation I pg 15
Improving cache / memory performance
• Ways of improving performance:
– decreasing the miss ratio (avoiding conflicts): associativity
– decreasing the miss penalty: multilevel caches
– Adapting block size: see earlier slides
– Note: there are many more ways to improve memory
performance
(see e.g. master course 5MD00)
Computation I pg 16
How to reduce CPIstall ?
CPIstall = %reads • missrateread • misspenaltyread+
%writes • missratewrite • misspenaltywrite
Reduce missrate:
• Larger cache
– Avoids capacity misses
– However: a large cache may increase Tcycle
• Larger block (line) size
– Exploits spatial locality: see previous lecture
• Associative cache
– Avoids conflict misses
Reduce misspenalty:
• Add 2nd level of cache
Computation I pg 17
Decreasing miss ratio with
associativity
One-way set associative
(direct mapped)
Block
Tag Data
0
Two-way set associative
1
2
Set
3
0
4
1
5
2
6
3
Tag Data Tag Data
2 blocks / set
block
7
Four-way set associative
Set
Tag Data Tag Data Tag Data Tag Data
0
4 blocks / set
1
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
8 blocks / set
Computation I pg 18
An implementation: 4 way associative
Address
31 30
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
3210
Tag
Data
V
Tag
Data
V
Tag
Data
253
254
255
22
32
4-to-1 multiplexor
Hit
Data
Computation I pg 19
Performance of Associative Caches
15%
12%
1 KB
Miss rate
9%
2 KB
6%
8 KB
3%
0%
One-way
Two-way
Four-way
Associativity
Eight-way
1 KB
16 KB
2 KB
32 KB
4 KB
64 KB
8 KB
128 KB
Computation I pg 20
Further Cache Basics
• cache_size = Nsets x Associativity x Block_size
• block_address = Byte_address DIV Block_size in
bytes
• index size = Block_address MOD Nsets
• Because the block size and the number of sets are
(usually) powers of two, DIV and MOD can be performed
efficiently
block address
tag
bit 31 …
index
block
offset
…210
Computation I pg 21
Comparing different (1-level) caches (1)
• Assume
– Cache of 4K blocks
– 4 word block size
– 32 bit address
• Direct mapped (associativity=1) :
–
–
–
–
16 bytes per block = 2^4
32 bit address : 32-4=28 bits for index and tag
#sets=#blocks/ associativity : log2 of 4K=12 : 12 for index
Total number of tag bits : (28-12)*4K=64 Kbits
• 2-way associative
– #sets=#blocks/associativity : 2K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-11) * 2 * 2K=68 Kbits
• 4-way associative
– #sets=#blocks/associativity : 1K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-10) * 4 * 1K=72 Kbits
Computation I pg 22
Comparing different (1-level) caches (2)
3 caches consisting of 4 one-word blocks:
• Cache 1 : fully associative
• Cache 2 : two-way set associative
• Cache 3 : direct mapped
Suppose following sequence of block
addresses:
0, 8, 0, 6, 8
Computation I pg 23
Direct Mapped
Block address
Cache Block
0
0 mod 4=0
6
6 mod 4=2
8
8 mod 4=0
Address of
Hit or Location
memory block miss
0
Location
1
Location
2
0
miss
Mem[0]
8
miss
Mem[8]
0
miss
Mem[0]
6
miss
Mem[0]
Mem[6]
8
miss
Mem[8]
Mem[6]
Location
3
Coloured = new entry = miss
Computation I pg 24
2-way Set Associative:
2 sets
Block address
Cache Block
0
0 mod 2=0
6
6 mod 2=0
8
8 mod 2=0
Address of
memory block
Hit or
miss
SET 0
entry 0
0
Miss
Mem[0]
8
Miss
Mem[0]
Mem[8]
0
Hit
Mem[0]
Mem[8]
6
Miss
Mem[0]
Mem[6]
8
(so all in set/location 0)
SET 0
entry 1
SET 1
entry 0
SET 1
entry 1
Miss
Mem[8]
Mem[6]
LEAST RECENTLY USED BLOCK
Computation I pg 25
Fully associative
(4 way assoc., 1 set)
Address of
memory block
Hit or
miss
Block 0
Block 1
Block 2
0
Miss
Mem[0]
8
Miss
Mem[0]
Mem[8]
0
Hit
Mem[0]
Mem[8]
6
Miss
Mem[0]
Mem[8]
Mem[6]
8
Hit
Mem[0]
Mem[8]
Mem[6]
Block 3
Computation I pg 26
Review: Four Questions for Memory
Hierarchy Designers
• Q1: Where can a block be placed in the upper
level? (Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper
level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, FIFO, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
Computation I pg 27
Classifying Misses: the 3 Cs
• The 3 Cs:
– Compulsory—First access to a block is always a
miss. Also called cold start misses
• misses in infinite cache
– Capacity—Misses resulting from the finite
capacity of the cache
• misses in fully associative cache with optimal replacement strategy
– Conflict—Misses occurring because several blocks
map to the same set. Also called collision misses
• remaining misses
Computation I pg 28
3 Cs: Compulsory, Capacity, Conflict
In all cases, assume total cache size not changed
What happens if we:
1) Change Block Size:
Which of 3Cs is obviously affected? compulsory
2) Change Cache Size:
Which of 3Cs is obviously affected? capacity
misses
3) Introduce higher associativity :
Which of 3Cs is obviously affected? conflict
misses
Computation I pg 29
3Cs Absolute Miss Rate (SPEC92)
0.14
1-way
0.12
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
128
64
32
16
8
4
0
2
Miss rate per type
Conflict
0.1
1
Miss Rate per Type
2-way
Compulsory
Computation I pg 30
Second Level Cache (L2)
• Most CPUs
– have an L1 cache small enough to match the cycle time
(reduce the time to hit the cache)
– have an L2 cache large enough and with sufficient
associativity to capture most memory accesses (reduce
miss rate)
• L2 Equations, Average Memory Access Time
(AMAT):
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2)
• Definitions:
– Local miss rate— misses in this cache divided by the total number
of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total number
of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Computation I pg 31
Second Level Cache (L2)
• Suppose processor with base CPI of 1.0
• Clock rate of 500 Mhz
• Main memory access time : 200 ns
• Miss rate per instruction primary cache : 5%
What improvement with second cache having 20ns access time,
reducing miss rate to memory to 2% ?
• Miss penalty : 200 ns/ 2ns per cycle=100 clock cycles
• Effective CPI=base CPI+ memory stall per instruction = ?
– 1 level cache : total CPI=1+5%*100=6
– 2 level cache : a miss in first level cache is satisfied by second cache or
memory
•
•
•
•
Access second level cache : 20 ns / 2ns per cycle=10 clock cycles
If miss in second cache, then access memory : in 2% of the cases
Total CPI=1+primary stalls per instruction +secondary stalls per instruction
Total CPI=1+5%*10+2%*100=3.5
Machine with L2 cache : 6/3.5=1.7 times faster
Computation I pg 32
Second Level Cache
• Global cache miss is similar to single cache miss rate of second
level cache provided L2 cache is much bigger than L1.
• Local cache rate is NOT good measure of secondary caches as it is function
of L1 cache.
Global cache miss rate should be used.
Computation I pg 33
Second Level Cache
Computation I pg 34
How to connect the cache to next level?
• Make reading multiple words easier by using banks of memory
CPU
CPU
CPU
Multiplexor
Cache
Cache
Cache
Bus
Memory
Memory
Bus
Bus
b. Wide memory organization
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
c. Interleaved memory organization
a. One-word-wide
memory organization
• It can get a lot more complicated...
Computation I pg 35
Download