Chapter 2

advertisement
CACHE BASICS
1
1977: DRAM faster than microprocessors
2
Since 1980, CPU has outpaced DRAM ...
3
How do architects address this gap?
• Programmers want unlimited amounts of memory
with low latency
• Fast memory technology is more expensive per bit
than slower memory
• Solution: organize memory system into a hierarchy
– Entire addressable memory space available in largest,
slowest memory
– Incrementally smaller and faster memories, each
containing a subset of the memory below it, proceed in
steps up toward the processor
• Temporal and spatial locality insures that nearly all
references can be found in smaller memories
– Gives the allusion of a large, fast memory being presented
to the processor
4
Memory Hierarchy
5
Memory Hierarchy Design
• Memory hierarchy design becomes more
crucial with recent multi-core processors:
– Aggregate peak bandwidth grows with # cores:
• Intel Core i7 can generate two references per core per clock
• Four cores and 3.2 GHz clock
– 25.6 billion 64-bit data references/second +
– 12.8 billion 128-bit instruction references
– = 409.6 GB/s!
• DRAM bandwidth is only 6% of this (25 GB/s)
• Requires:
– Multi-port, pipelined caches
– Two levels of cache per core
– Shared third-level cache on chip
6
Memory Hierarchy Basics
• When a word is not found in the cache, a
miss occurs:
– Fetch word from lower level in hierarchy,
requiring a higher latency reference
– Lower level may be another cache or the main
memory
– Also fetch the other words contained within the
block
• Takes advantage of spatial locality
– Place block into cache in any location within its
set, determined by address
• block address MOD number of sets
7
Locality
•
•
A principle that makes memory hierarchy a good idea
If an item is referenced
– Temporal locality: it will tend to be referenced again soon
– Spatial locality: nearby items will tend to be referenced soon
•
Our initial focus: two levels (upper, lower)
– Block: minimum unit of data
– Hit: data requested is in the upper level
– Miss: data requested is not in the upper level
8
Memory Hierarchy Basics
• Note that speculative and multithreaded
processors may execute other instructions
during a miss
– Reduces performance impact of misses
9
Cache
•
•
Two issues
– How do we know if a data item is in the cache?
– If it is, how do we find it?
Our first example
– Block size is one word of data
– ”Direct mapped"
For each item of data at the lower level, there is exactly one location
in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
10
Direct mapped cache
Mapping
– Cache address is Memory address modulo the number of blocks
in the cache
– (Block address) modulo (#Blocks in cache)
Cache
000
001
010
011
100
101
110
111
•
00001
00101
01001
01101
10001
Memory
10101
11001
11101
11
Direct mapped cache
•
What kind of locality are we
taking advantage of?
12
Direct mapped cache
•
•
Taking advantage of spatial locality
(16KB cache, 256 Blocks, 16 words/block)
13
Block Size vs. Performance
14
Block Size vs. Cache Measures
•
Increasing Block Size generally
increases Miss Penalty and
decreases Miss Rate
Miss
Penalty
X
Miss
Rate
Block Size
Avg.
Memory
Access
Time
=
Block Size
Block Size
15
Four Questions for Memory Hierarchy
Designers
• Q1: Where can a block be placed in the upper
level? (Block placement)
• Q2: How is a block found if it is in the upper level?
(Block identification)
• Q3: Which block should be replaced on a miss?
(Block replacement)
• Q4: What happens on a write?
(Write strategy)
16
Q1: Where can a block be placed in the
upper level?
• Direct Mapped: Each block has only one place that it can appear
in the cache.
• Fully associative: Each block can be placed anywhere in the
cache.
• Set associative: Each block can be placed in a restricted set of
places in the cache.
– If there are n blocks in a set, the cache placement is called nway set associative
17
Associativity Examples
Fully associative:
Block 12 can go anywhere
Direct mapped:
Block no. = (Block address) mod
(No. of blocks in cache)
Block 12 can go only into block 4
(12 mod 8)
Set associative:
Set no. = (Block address) mod
(No. of sets in cache)
Block 12 can go anywhere in set 0
(12 mod 4)
18
Direct Mapped Cache
19
2 Way Set Associative Cache
20
Fully Set Associative Cache
21
An implementation of a
four-way set associative cache
22
Performance
23
Q2: How Is a Block Found If It Is in the
Upper Level?
• The address can be divided into two main parts
– Block offset: selects the data from the block
offset size = log2(block size)
– Block address: tag + index
• index: selects set in cache
index size = log2(#blocks/associativity)
– tag: compared to tag in cache to determine hit
tag size = addreess size - index size - offset
size
Tag
Index
24
Q3: Which Block Should be Replaced on a
Miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:
– Random - easier to implement
– Least Recently used - harder to implement - may
approximate
• Miss rates for caches with different size, associativity and
replacement algorithm.
Associativity:
Size
LRU
16 KB
5.18%
64 KB
1.88%
256 KB
1.15%
2-way
Random
5.69%
2.01%
1.17%
LRU
4.67%
1.54%
1.13%
4-way
Random
5.29%
1.66%
1.13%
LRU
4.39%
1.39%
1.12%
8-way
Random
4.96%
1.53%
1.12%
For caches with low miss rates, random is almost as good as LRU.
25
Q4: What Happens on a Write?
26
Q4: What Happens on a Write?
•
Since data does not have to be brought into the cache on a write
miss, there are two options:
– Write allocate
• The block is brought into the cache on a write miss
• Used with write-back caches
• Hope subsequent writes to the block hit in cache
– No-write allocate
• The block is modified in memory, but not brought into
the cach
• Used with write-through caches
• Writes have to go to memory anyway, so why bring
the block into the cache
27
Hits vs. misses
•
Read hits
– This is what we want!
•
Read misses
– Stall the CPU, fetch block from memory, deliver to cache, restart
•
Write hits
– Can replace data in cache and memory (write-through)
– Write the data only into the cache (write-back the cache later)
•
Write misses
– Read the entire block into the cache, then write the word
28
Cache Misses
•
•
On cache hit, CPU proceeds normally
On cache miss
– Stall the CPU pipeline
– Fetch block from next level of hierarchy
– Instruction cache miss
• Restart instruction fetch
– Data cache miss
• Complete data access
29
Cache Measures
• Hit rate: fraction found in the cache
– Miss rate = 1 - Hit Rate
• Hit time: time to access the cache
• Miss penalty: time to replace a block from lower level,
– access time: time to access lower level
– transfer time: time to transfer block
CPU time =
(CPU execution cycles+ Memory stall cycles)*Cycle time
Memory stall cycles

Memory accesses
 Miss rate  Miss penalty
Program

Instructions
Misses

 Miss penalty
Program
Instruction
30
Improving Cache Performance
•
Average memory-access time
= Hit time + Miss rate x Miss penalty
•
Improve performance by:
1. Reduce the miss rate:
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
31
Types of misses
• Compulsory
– Very first access to a block (cold-start miss)
• Capacity
– Cache cannot contain all blocks needed
• Conflict
– Too many blocks mapped onto the same
set
32
How do you solve
• Compulsory misses?
– Larger blocks with a side effect!
• Capacity misses?
– Not much options: enlarge the cache
otherwise face “thrashing!”, computer runs
at a speed of the lower memory or slower!
• Conflict misses?
– Full associative cache with a cost of
hardware and may slow the processor!
33
Basic cache optimizations:
– Larger block size
• Reduces compulsory misses
• Increases capacity and conflict misses, increases miss
penalty
– Larger total cache capacity to reduce miss rate
• Increases hit time, increases power consumption
– Higher associativity
• Reduces conflict misses
• Increases hit time, increases power consumption
– Higher number of cache levels
• Reduces overall memory access time
– Giving priority to read misses over writes
• Reduces miss penalty
34
Other Optimizations: Victim Cache
• Add a small fully associative victim cache to place data discarded from
regular cache
• When data not found in cache, check victim cache
• 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct
mapped data cache
• Get access time of direct mapped with reduced miss rate
35
Other Optimizations: Reducing Misses by HW
Prefetching of Instruction & Data
•
•
•
E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in stream buffer
– On miss check stream buffer
– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB
cache; 4 streams got 43%
Works with data blocks too:
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from 2 64KB, 4-way set
associative caches
Prefetching relies on extra memory bandwidth that can be used
without penalty
36
Other Optimizations: Reducing Misses by
Compiler Optimizations
•
•
Instructions
– Reorder procedures in memory so as to reduce misses
– Profiling to look at conflicts
– McFarling [1989] reduced caches misses by 75% on 8KB direct
mapped cache with 4 byte blocks
Data
– Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in
order stored in memory
– Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of
data repeatedly vs. going down whole columns or rows
37
Merging Arrays Example
. Problem: referencing multiple arrays in the same
dimension, with the same index, at the same time can
lead to conflict misses.
. Solution: Merge the independent arrays into a compound
array.
/* Before */
int val[SIZE];
int key[SIZE];
/* After */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
38
Miss Rate Reduction Techniques: Compiler
Optimizations
– Loop Interchange
39
Miss Rate Reduction Techniques: Compiler
Optimizations
– Loop Fusion
40
Blocking
. Problem: When accessing multiple multi-dimensional arrays
(e.g., for matrix multiplication), capacity misses occur if not
all of the data can fit into the cache.
. Solution: Divide the matrix into smaller submatrices (or
blocks) that can fit within the cache
. The size of the block chosen depends on the size of the cache
. Blocking can only be used for certain types of algorithms
41
Summary of Compiler Optimizations to
Reduce Cache Misses
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky
(nasa7)
compress
1
1.5
2
2.5
3
Performance Improvement
merged
arrays
loop
interchange
loop fusion
blocking
42
Decreasing miss penalty
with multi-level caches
•
Add a second level cache:
– Often primary cache is on the same chip as the processor
– Use SRAMs to add another cache above primary memory
(DRAM)
– Miss penalty goes down if data is in 2nd level cache
•
Using multilevel caches:
– Try and optimize the hit time on the 1st level cache
– Try and optimize the miss rate on the 2nd level cache
43
Multilevel Caches
•
•
•
•
Primary cache attached to CPU
– Small, but fast
Level-2 cache services misses from primary cache
– Larger, slower, but still faster than main memory
Main memory services L-2 cache misses
Some high-end systems include L-3 cache
44
Virtual Memory
• Use main memory as a “cache” for secondary
(disk) storage
– Managed jointly by CPU hardware and the
operating system (OS)
• Programs share main memory
– Each gets a private virtual address space
holding its frequently used code and data
– Protected from other programs
• CPU and OS translate virtual addresses to physical
addresses
– VM “block” is called a page
– VM translation “miss” is called a page fault
45
Virtual Memory
•
Main memory can act as a cache for the secondary storage (disk)
Virtual addresses
Physical addresses
Address translation
Disk addresses
•
Advantages:
– illusion of having more physical memory
– program relocation
– protection
46
Pages: virtual memory blocks
•
Page faults: the data is not in memory, retrieve it from disk
– huge miss penalty, thus pages should be fairly large (e.g., 4KB)
• What type (direct mapped, set or fully set associative)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is too
expensive so we use writeback
Virtual address
31 30 29 28 27
15 14 13 12 11 10 9 8
3210
Book title
Page offset
Virtual page number
Translation
29 28 27
15 14 13 12 11 10 9 8
Physical page number
Page offset
3210
Lib. Location
Physical address
47
Page Tables (Fully Associative Search Time)
48
A Program’s State
• Page Table
• PC
• Registers
49
Page Tables
50
Page Faults
• Replacement Policy
• Handle with Hardware or Software
– External memory excess time is large relative to
software based solution
• LRU
– Costly to keep track of every page
– Mechanism?
• Keep refreshing the 1 bit
51
Making Address Translation Fast
•
•
•
•
Page tables in memory
Memory access by a program : twice as long
– Obtain physical address
– Get data
Make us of locality of reference
– Temporal & Spatial (Words in a page)
Solution
– Special cache
• Keep track of recently used translations
• Translation Lookaside Buffer (TLB)
– Translation cache
– Your piece of paper where you record the location of books
you need from the library
52
Making Address Translation Fast
•
A cache for address translations: translation lookaside buffer
Typical values:
16-512 entries,
miss-rate: .01% - 1%
miss-penalty: 10 – 100 cycles
53
TLBs and caches
Virtual address
TLB access
TLB miss
exception
No
Yes
TLB hit?
Physical address
No
Try to read data
from cache
Cache miss stall
while read block
No
Cache hit?
Yes
Write?
No
Yes
Write access
bit on?
Write protection
exception
Yes
Try to write data
to cache
Deliver data
to the CPU
Cache miss stall
while read block
No
Cache hit?
Yes
Write data into cache,
update the dirty bit, and
put the data and the
address into the write buffer
54
TLBs and Caches
55
Some Issues
•
Processor speeds continue to increase very fast
— much faster than either DRAM or disk access times
100,000
10,000
1,000
Performance
CPU
100
10
Memory
1
Year
•
Design challenge: dealing with this growing disparity
– Prefetching? 3rd level caches and more? Memory design?
56
Memory Technology
• Performance metrics
– Latency is concern of cache
– Bandwidth is concern of multiprocessors
and I/O
– Access time
• Time between read request and when desired word arrives
• DRAM used for main memory, SRAM
used for cache
57
Latches and Flip-flops
C
Q
_
Q
D
58
Latches and Flip-flops
D
D
C
Q
D
latch
D
C
Q
D
latch
Q
Q
Q
C
59
Latches and Flip-flops
Latches: whenever the inputs change, and the clock is asserted
Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
60
SRAM
61
SRAM vs. DRAM
Which one has a better memory density?
static RAM (SRAM): value stored in a cell is kept on a pair
of inverting gates
dynamic RAM (DRAM), value kept in a cell is stored as a
charge in a capacitor.
DRAMs use only a single transistor per bit of storage,
By comparison, SRAMs require four to six transistors per bit
Which one is faster?
In DRAMs, the charge is stored on a capacitor, so it cannot be
kept indefinitely and must periodically be refreshed. (called dynamic)
Every ~ 8 ms
Each row can be refreshed simultaneously
Must be re-written after being read
62
Memory Technology
• Amdahl:
– Memory capacity should grow linearly with processor
speed (followed this trend for about 20 years)
– Unfortunately, memory capacity and speed has not
kept pace with processors
– Fourfold improvement every 3 years (originally)
– Doubled capacity from 2006-2010
63
Memory Optimizations
64
Memory Technology
• Some optimizations:
– Synchronous DRAM
• Added clock to DRAM interface
• Burst mode with critical word first
– Wider interfaces
• 4 bit transfer mode originally
• In 2010, upto 16-bit busses
– Double data rate (DDR)
• Transfer data on both rising and falling edge
65
Memory Optimizations
66
Memory Optimizations
• DDR:
– DDR2
• Lower power (2.5 V -> 1.8 V)
• Higher clock rates (266 MHz, 333 MHz, 400 MHz)
– DDR3
• 1.5 V
• 800 MHz
– DDR4 (scheduled for production in 2014)
• 1-1.2 V
• 1600 MHz
• GDDR5 is graphics memory based on
DDR3
67
Memory Optimizations
• Graphics memory:
– Achieve 2-5 X bandwidth per DRAM vs.
DDR3
• Wider interfaces (32 vs. 16 bit)
• Higher clock rate
– Possible because they are attached via soldering instead of socketted
Dual Inline Memory Modules (DIMM)
• Reducing power in SDRAMs:
– Lower voltage
– Low power mode (ignores clock, continues
to refresh)
68
Virtual Machines
• First developed in 1960s
• Regained popularity recently
– Need for isolation and security in modern
systems
– Failures in security and reliability of standard
operation systems
– Sharing of single computer among many
unrelated users (datacenter, cloud)
– Dramatic increase in raw speed of processors
• Overhead of VMs now more acceptable
69
Virtual Machines
• Emulation methods that provide a standard
software interface
– IBM VM/370, VMware, ESX Server, Xen
• Create the illusion of having an entire computer to
yourself including a copy of the OS
• Allows different ISAs and operating systems to be
presented to user programs
– “System Virtual Machines”
– SVM software is called “virtual machine
monitor” or “hypervisor”
– Individual virtual machines run under the
monitor are called “guest VMs”
70
Impact of VMs on Virtual Memory
• Each guest OS maintains its own set of
page tables
– VMM adds a level of memory between
physical and virtual memory called “real
memory”
– VMM maintains shadow page table that
maps guest virtual addresses to physical
addresses
• Requires VMM to detect guest’s changes to its own page
table
• Occurs naturally if accessing the page table pointer is a
privileged operation
71
Assume 75% instruction, 25% data access
72
73
Cost of Misses, CPU time
74
75
76
Example
• CPI of 1.0 on a 5Ghz machine with a 2% miss rate and 100ns
main memory access
• Adding 2nd level cache with 5ns access time decreases miss
rate to 0.5%
• How much faster is the new configuration?
100ns

 500clockcycles
0.2ns / clockcyle
TotalCPI  BaseCPI Mem orystallcyclesperinstruction
TotalCPI  1.0 2% * 500  11.0
5ns
 25clockcycles
0.2ns / clockcyle
TotalCPI  1 P r im arystallsperinstrcution econdaryStallsPerInstruction

TotaCPI  1  2% * 25  0.5% * 500  1  0.5  2.5  4.0
11
Speedup  2.8
4
77
Download