Cache I

advertisement
CS5100 Advanced Computer Architecture
Memory Hierarchy Design
Prof. Chung-Ta King
Department of Computer Science
National Tsing Hua University, Taiwan
(Slides are from textbook, Prof. Hsien-Hsin Lee, Prof. Yasun Hsu)
National Tsing Hua University
About This Lecture
• Goal:
 To review the basics of memory hierarchy design and basic
cache optimization techniques
• Outline:
 Memory hierarchy design (Sec. 2.1, Appendix B.1)
• Motivation, basic concepts
 Basic cache organization and performance (Appendix B.2)
 Basic cache optimizations (Appendix B.3)
1
National Tsing Hua University
Memory Performance Gap
2
National Tsing Hua University
Motivation for Memory Hierarchy
• Programmers want unlimited amounts of memory
with low latency
 But fast memory more expensive than slower memory
• Solution: small fast memory + big slow memory
= Looks like a big fast memory
Big
Fast
MC
MM
Small & Fast
Big & Slow
3
National Tsing Hua University
Basic Idea of Memory Hierarchy
• Entire addressable memory space available in
largest, slowest memory
• Incrementally smaller and faster memories, each
containing a subset of the memory below it, proceed
in steps up toward the processor
4
National Tsing Hua University
Model of Memory Hierarchy
SRAM
Reg
File
L1
Data cache
DISK
DRAM
L2
Cache
Main
Memory
L1
Inst cache
eDRAM or
Emerging memory
SRAM
Reg
File
L1
Data cache
L2
Cache
L1
Inst cache
L3
Cache
L4
Cache
DRAM
DISK
Main
Memory
5
National Tsing Hua University
Model of Memory Hierarchy
Cycle
CPU
Chip
Words/cycle Management
Registers
1
3-10
Level 1 Cache
1-3
1-2
Level 2 Cache
5-10
1
30-100
0.5
Chips
Mechanic
DRAM
Disk
106-107 0.01
6
National Tsing Hua University
Memory Hierarchy Design
• Memory hierarchy design becomes more crucial
with recent multi-core processors:
• Aggregate peak bandwidth grows with # cores:
 Intel Core i7 generates two references per core per clock
 Four cores and 3.2 GHz clock:
25.6 billion 64-bit data references/second + 12.8 billion
128-bit instruction references = 409.6 GB/s!
 DRAM bandwidth is only 6% of this (25 GB/s)
 Requires:
• Multi-port, pipelined caches
• Two levels of cache per core
• Shared third-level cache on chip
7
National Tsing Hua University
Performance and Power
• High-end mPs have >10 MB on-chip cache
 Consumes large amount of area and power budget
Intel
Core i7
L1 D$
L1 I$
L2
Shared L3
8
National Tsing Hua University
Underlying Principles
• Locality of reference:
 Program access a relatively small portion of the address
space at any instant of time  a program property
 Temporal locality
 Spatial locality
Stack
Code
P
Array
P
t
Location
9
National Tsing Hua University
But, Program Behavior Matters
• Locality depends on type of program
• Some programs ‘behave’ well
 Small loop operating on data on stack (towers of hanoi)
• Some programs don’t
10
National Tsing Hua University
Memory Hierarchy Basics
• Hit: data appears in some block (page) in upper level
 Hit rate: % of memory access found in upper level
 Hit time: time to access upper level
• Miss: data not in upper level
 Miss rate = 1 - (hit rate)
 Miss penalty: time to replace a block in upper level + time
to deliver the block to the processor
• Hit Time << Miss Penalty
From Processor
Uppermiss
Level
Memory
Blk X
Lower Level
Memory
Blk Y
To Processor
11
National Tsing Hua University
Memory Hierarchy Performance
• Average memory access time between levels
= Hit time + Miss rate x Miss penalty
• Miss penalty: time to fetch a block from lower
memory level
 Access time: time to lower level; function of latency
 Transfer time: function of bandwidth between levels
• Transfer one “cache line/block” at a time
• Transfer at the size of the memory-bus width
12
National Tsing Hua University
4 Questions for Memory Hierarchy
• Block placement: where in upper level?
 fully associative, direct mapped, set-associative
• Block identification: find block in upper level?
 search and match address tag:
block address (= tag + index) + block offset
 valid bit: tag-index boundary and associativity
• Block replacement: which to replace on miss?
 easy for direct map.; random/LRU for associative
• Write strategy: what happens on a write?
 Write through and write back
 Write allocate and not allocate
13
National Tsing Hua University
Outline
• Memory hierarchy design (Sec. 2.1, Appendix B.1)
 Motivation, basic concepts
• Basic cache organization and performance (Appendix
B.2)
• Basic cache optimizations (Appendix B.3)
14
National Tsing Hua University
Cache on CPU Performance
• CPU time = (CPU cycle + memory stall cycles) x clock
cycle time
• Memory stall cycles = memory access x miss rate x
miss penalty
 Assumptions: memory stalls are due to cache misses, hit
cycles included in CPU execution cycles, reads and writes
have same miss rate and penalty
• CPU time = instruction count x (CPIideal + memory
access/instruction x miss rate x miss penalty) x clock
cycle time
• Cache design becomes more important for CPUs
with lower CPI and higher clock rates
15
National Tsing Hua University
Cache Basics
• When a word is not found in the cache, a miss
occurs:
 Fetch word from lower level in hierarchy, requiring a
higher latency reference
 Lower level may be another cache or the main memory
 Also fetch the other words contained within the block
 Place block into cache in any location within its set,
determined by address
• Block address MOD number of sets
16
National Tsing Hua University
Cache Organization
000…
100.. 111
00100
01100
10100
11100
17
National Tsing Hua University
Example: 1KB DM Cache, 32-byte Blocks
• The lowest M bits are the offset (block size = 2M= 32)
• Index = log2 (# of sets) Address
31
9
Tag
Index
Set-associativity will increase
or reduce the index bits?
Cache Tag
Ex: 0x01
Ex: 0x00
Cache Data
Byte 31
Byte 63
Byte 1
Byte 0
0
Byte 33 Byte 32 1
2
3
:
:
:
Byte 1023
:
# of set
0
Offset
: :
Valid Bit
4
Byte 992 31
18
National Tsing Hua University
Example: Set-Associative Cache
N elements per set
S
Sets
=
=
=
=
Tag Set
Address
Hit
Data
• Share costly comparators across sets; MUX delay
19
National Tsing Hua University
Comparison of Cache Organization
0x1234
• Cache miss as a metric:
(causes of cache misses)
 Compulsory:
• First access to a block
Processor
Cache
0x1234
 Capacity:
0x567
80x91B
01x111
• Block discarded due to limited
cache size and later retrieved
 Conflict:
Processor
• Block discarded due to conflict
in set and later retrieved
0x1234
0x567
08x91B
1
 Coherence:
• In multicore systems (next
semester)
Cache
Processor
Cache
20
National Tsing Hua University
Comparison of Cache Organization
• Which cache organization can reduce compulsory
misses?
• Which cache organization can reduce capacity
misses?
• Which cache organization can reduce conflict
misses?
• Reduce cache misses = improve cache performance?
21
National Tsing Hua University
Comparison of Cache Organizations
Comparisons in terms of space and time:
• Can access to tag and data arrays be done in
parallel?
• How many comparators and multiplexers are
needed?
 How about wiring in IC layout?
• How many bits of storage are needed in tag and data
arrays?
 The index bits need not be stored!
22
National Tsing Hua University
Speed of Direct-Mapped Cache
• Tag and data access happen in parallel
 Faster cache access!
Tag
Tag array
Index Offset
Index
Data array
23
National Tsing Hua University
Overhead of Full Associative Cache
tag
=
=
=
offset
Tag
Data
Associative
Search
=
Multiplexor
Rotate and Mask
24
National Tsing Hua University
Overhead of Fully Associative Cache
Tag
Write Data
offset
Address
Tag
compare
Data
Tag
compare
Data
Tag
compare
Data
Tag
Data
compare
Additional circuitry as compared to DM caches
More extensive than SA caches
Read Data
Makes FA caches slower to access than either
DM or SA of comparable size
25
National Tsing Hua University
Other Design Issues: Write Hit Policy
• Write through
 Update next level on every write  cache is always clean
 Lots of traffic to next level (mostly writes)
• Write back
 Write to cache and mark block dirty  update main
memory on eviction
 More complex eviction and coherence
• Reservation problem:
 Reads use tag and data array at same time
 Writes use tag first, then data array
 How to pipeline to allow one read/write per cycle?
26
National Tsing Hua University
Write Buffer for Write Through
• A write buffer between cache and memory
 Processor: writes data into cache and write buffer
 Memory controller: write contents of buffer to memory
• Write buffer is just a FIFO:
 Typical number of entries: 4
 Works fine if: store freq. (w.r.t. time) << 1 / DRAM write
cycle
Processor
Cache
DRAM
Write Buffer
27
National Tsing Hua University
Write Miss Policy
• Write allocate:
 Allocate a new block on each write
 Fetch on write: fetch entire block, then write word into
block
 No-fetch: allocate block but don’t fetch
• Requires valid bits per word, complex eviction
• Write no-allocate:
 Write around cache; typically used by write through
• Write invalidate (instead of update)
• Sometimes like to have a read no-allocate
 irregular accesses on machine with large block size
28
National Tsing Hua University
Contents of the Cache
Cache may hold certain types of information
• Instruction cache:
 Read-only (mostly), very good spatial locality
• sequential instruction stream and loops
• but sometimes lots of subroutine calls
• Data cache:
 read/write with different spatial locality pattern
• Stack cache:
 captures a window of memory addresses including the
top-of-stack
• Unified cache:
 holds everything - data + instructions
29
National Tsing Hua University
Other Design Issues: Replacement Policy
• On a cache miss we need to evict a block to make
room for the new line
• In an n-way set associative cache, we have n choices
of which block to evict
• Which block gets evicted?
• Optima cache replacement policy: [Belady, IBM
Systems Journal, 1966]
 Evict block with longest reuse distance
• i.e. next reference to block is farthest in future
• Requires knowledge of the future!
 Can’t build it, but can approximate it
30
National Tsing Hua University
Least Recently Used
• For 2-way set-associative cache
 Single bit per set indicates LRU/MRU
 Set/clear on each access
• For a>2, LRU is difficult/expensive in HW
 Need to maintain history of access order
 Timestamps? How many bits?
• Must find min timestamp on each eviction
• Expensive for HW to compare them on every access
 Sorted list?
• Linked list to keep track of access order?
• Re-sort blocks on every access HW expensive
31
National Tsing Hua University
Pseudo LRU
• Tree-based: 3 bits for 4-way O(N)
 Value 1: MRU block is left subtree
 Value 0: MRU block is right subtree
• Advantages
 Less hardware than LRU
 Faster than LRU
AB/CD bit (L0)
A/B bit (L1)
Way A
Way B
C/D bit (L2)
Way C
Way D
Way3
Way2
Way1
Way0
A
B
C
D
32
National Tsing Hua University
Pseudo LRU
AB/CD bit (L0)
A/B bit (L1)
Way A
Way B
L2-L1-L0 = 000,
there is a hit in Way B, what is
the new updated L2L1L0?
C/D bit (L2)
Way C
Way D
L2-L1-L0 = 001,
a way needs to be replaced,
which way would be chosen?
Replacement decision
LRU update algorithm
CD
AB AB/CD
CD
AB AB/CD
Way hit
L2
L1
L0
L2
L1
L0
Way to replace
Way A
---
1
1
X
0
0
Way A
Way B
---
0
1
X
1
0
Way B
Way C
1
---
0
0
X
1
Way C
Way D
0
---
0
1
X
1
Way D
33
National Tsing Hua University
Not Recently Used (NRU)
• A “used bit” is associated with every block
 Initially, the used bits of all blocks are reset to 0
 Whenever a block is accessed, either on a hit or miss, its
used bit is set to 1
 If on an access, all other used bits in a set are 1, they are
reset to 0 except the bit of the block that is accessed
 A replacement pointer is used to look for a victim block
 A block can be replaced if its used bit is 0
 Rotate pointer until a block with used bit = 0 is found
• Simple
• Used by Intel Itanium, Sparc T2
34
National Tsing Hua University
Yet Another NRU
• Use R(eferenced) and M(odified) bits
 0 (not referenced or not modified)
 1 (referenced or modified)
• Classify blocks into




C0:
C1:
C2:
C3:
R=0, M=0
R=0, M=1
R=1, M=0
R=1, M=1
• Choose the victim from the lowest class
 (C3 > C2 > C1 > C0)
• Periodically clear R and M bits
35
National Tsing Hua University
Traditional Cache Replacement Policies
• LRU and Pseudo-LRU focus on victim selection
 They may work fine for blocks with high temporal locality
or for work loads smaller than cache size
• However, thrashing may occur for a working set
greater than cache size
• Divide cache replacement into two parts
 Victim selection policy: which block to evict
 Insertion policy: where to insert incoming block
36
National Tsing Hua University
LRU Insertion Policy (LIP)
MRU
[Qureshi et al. ISCA 07]
LRU
Incoming
Block
37
National Tsing Hua University
LRU Insertion Policy (LIP)
MRU
LRU
Useless Block
Evicted at next eviction
Useful Block
Moved to MRU position
38
National Tsing Hua University
LRU Insertion Policy (LIP)
MRU
LRU
Useless Block
Evicted at next eviction
Useful Block
Moved to MRU position
39
National Tsing Hua University
Bimodal Insertion Policy (BIP)
• LIP may not age older blocks
• BIP is similar to LIP, except that BIP places some % of
incoming blocks into MRU position
• Let e = Bimodal throttle parameter
if ( rand() < e )
Insert at MRU position; // LRU replacement policy
else
Insert at LRU position;
Promote to MRU if reused
[Qureshi et al. ISCA 07]
40
National Tsing Hua University
Outline
• Memory hierarchy design (Sec. 2.1, Appendix B.1)
 Motivation, basic concepts
• Basic cache organization and performance (Appendix
B.2)
• Basic cache optimizations (Appendix B.3)
41
National Tsing Hua University
Improving Cache Performance
• Recall
average memory access time
= hit time + miss rate x miss penalty
• Improving cache performance by
 Reducing the miss rate
 Reducing the miss penalty
 Reducing the time to hit in the cache
42
National Tsing Hua University
Six Basic Cache Optimizations
• Larger block size
 Exploit spatial locality
 Reduce compulsory misses
 Increase capacity and conflict misses, increase miss
penalty and false sharing
• Larger total cache capacity
 Reduce capacity misses
 Increase hit time, increase power consumption
• Higher associativity
 Reduce conflict misses
 Increase hit time, increase power consumption
43
National Tsing Hua University
Six Basic Cache Optimizations
• More cache levels
• Giving priority to read misses over writes
• Avoiding address translation in cache indexing
44
National Tsing Hua University
Reducing Miss Penalty
• Average Memory Access Time (AMAT)
= Hit Time + Miss rate * Miss Penalty
= Thit(L1) + Miss%(L1) * T(memory)
• Example:




Cache Hit = 1 cycle
Miss rate = 10% = 0.1
Miss penalty = 300 cycles
AMAT = 1 + 0.1 * 300 = 31 cycles
1 cyc
300 cyc
L1$
Hit Time
Miss % * Miss penalty
Main
Memory
(DRAM)
45
National Tsing Hua University
Multi-level Cache
• AMAT = Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)*
(Thit(L3) + Miss%(L3)*T(memory) ) )
• Example:
14.7x speed-up!




Miss rate L1=10%, Thit(L1) = 1 cycle
Miss rate L2=5%, Thit(L2) = 10 cycles
Miss rate L3=1%, Thit(L3) = 20 cycles
T(memory) = 300 cycles
1 cyc
L1
10 cyc
L2
20 cyc
300 cyc
L3
On-die
AMAT = 2.115 cyc
Main
Memory
(DRAM)
46
National Tsing Hua University
Priority to Read Miss over Write
• For write through caches with write buffer:
 WB may hold updated value needed on a read miss
SW 512(R0),R3 ; M[512] <- R3
LW R1,1024(R0) ; R1 <- M[1024]
LW R2,512(R0) ; R2 <- M[512]
(direct-mapped, write-thru.; 512/1024 to same block)
 Rad miss waits until write buffer empty  miss penalty
 Better solution: check WB contents before read;
if no conflicts, let the memory access continue
• For write-back caches:
 Read miss writes dirty block to memory, then do read
 Better: dirty block to WB, do read, do write
 CPU stall less since restarts as soon as do read
47
National Tsing Hua University
Avoid Address Translation
• Virtually addressed cache:
VA
P
VA
PA
TLB
$
P
Physical Cache
$
PA
TLB
• Every time a process is switched logically, the cache
must be flushed; otherwise get false hits
 Cost: time to flush + “compulsory” misses
 Add processes identifier tag
48
National Tsing Hua University
Avoid Address Translation
• Aliases (synonyms): two different virtual addresses
mapped to same physical address
 Two different cache entries holding data for the same
physical address
 For update, must update all cache entries with same
physical address or memory becomes inconsistent
• Solution to aliases:
 HW guarantees every cache block has unique physical
address
 SW guarantees: lower n bits must have same address; as
long as covers index field and direct mapped, they must
be unique (called page coloring)
49
National Tsing Hua University
Avoid Address Translation
• Physically mapped, virtually tagged:
 Index with physical part of address => low bits are
untranslated
 Can start tag access in parallel with translation so that
physical tag can be compared
 What if number of sets
VA
is greater than blocks
Tag Set
per page?
Data
• higher associativity
• page coloring
$
TLB
Tag
=
50
National Tsing Hua University
Recap
• Memory hierarchy exploits program locality to
reduce AMAT
• Types of caches
 Direct mapped, set-associative, fully associative
• Cache policies
 Cache replacement: victim selection, insertion policy
 Write through vs. Write back
 Write allocate vs. No write allocate
• Basic cache optimization techniques
 For reducing miss rate, reducing miss penalty, improving
hit time
51
National Tsing Hua University
Download