CS 61C: Great Ideas in Computer Architecture Caches Instructor:

advertisement
CS 61C:
Great Ideas in Computer Architecture
Caches
Instructor:
David A. Patterson
http://inst.eecs.Berkeley.edu/~cs61c/sp12
6/27/2016
Spring 2012 -- Lecture #11
1
New-School Machine Structures
(It’s a bit more complicated!)
Software
• Parallel Requests
Assigned to computer
e.g., Search “Katz”
Hardware
Harness
Smart
Phone
Warehouse
Scale
Computer
• Parallel Threads Parallelism &
Assigned to core
e.g., Lookup, Ads
Achieve High
Performance
Computer
• Parallel Instructions
>1 instruction @ one time
e.g., 5 pipelined instructions
• Parallel Data
>1 data item @ one time
e.g., Add of 4 pairs of words
• Hardware descriptions
All gates @ one time
Core
Memory
Input/Output
Instruction Unit(s)
Core
Functional
Unit(s)
A0+B0 A1+B1 A2+B2 A3+B3
Cache Memory
Logic Gates
• Programming Languages
6/27/2016
Today’s
…
Core Lecture
(Cache)
Spring 2012 -- Lecture #11
2
Review
• Time (seconds/program) is measure of performance
Instructions
Clock cycles
Seconds
×
×
=
Program
Instruction
Clock Cycle
• Benchmarks stand in for real workloads to as
standardized measure of relative performance
• Power of increasing concern, and being added to
benchmarks
• Time measurement via clock cycles, machine specific
• Profiling tools as way to see where spending time in
your program
• Don’t optimize prematurely!
6/27/2016
Spring 2012 -- Lecture #11
3
Agenda
•
•
•
•
•
Memory Hierarchy Analogy
Memory Hierarchy Overview
Administrivia
Caches
Fully Associative, N-Way Set Associative,
Direct Mapped Caches
• Cache Performance
• Multilevel Caches
6/27/2016
Spring 2012 -- Lecture #11
4
Conventional Wisdom (CW)
in Computer Architecture
• Old CW: Power cheap, Transistors expensive
• New CW: “Power wall”
Power expensive, transistors cheap
– Can put more on chip than can turn on
• Old: Multiplies slow, Memory access fast
• New: “Memory wall”
Memory slow, multiplies fast
– 200 clocks to memory, 4 clocks for FP multiply
5
Big Idea: Memory Hierarchy
Processor
Higher
Levels in
memory
hierarchy
Lower
Level 1
Level 2
Increasing
distance from
processor,
decreasing
speed
Level 3
...
Level n
Size of memory at each level
Student Roulette
As we move to deeper levels the latency goes up
and price per bit goes down. Why?
6/27/2016
Spring 2012 -- Lecture #11
6
Library Analogy
• Writing a report based on books on reserve
– E.g., works of J.D. Salinger
• Go to library to get reserved book and place on
desk in library
• If need more, check them out and keep on desk
– But don’t return earlier books since might need them
• You hope this collection of ~10 books on desk
enough to write report, despite 10 being only
0.00001% of books in UC Berkeley libraries
6/27/2016
Spring 2012 -- Lecture #11
7
Principle of Locality
• Principle of Locality: Programs access small
portion of address space at any instant of time
• What program structures lead to locality in
code?
Student Roulette
6/27/2016
Spring 2012 -- Lecture #11
8
How does hardware exploit principle
of locality?
• Offer a hierarchy of memories where
– closest to processor is fastest
(and most expensive per bit so smallest)
– furthest from processor is largest
(and least expensive per bit so slowest)
• Goal is to create illusion of memory almost as
fast as fastest memory and almost as large as
biggest memory of the hierarchy
6/27/2016
Spring 2012 -- Lecture #11
9
A Cache
on-chip
CPU
Cache
32-bit data
&
32-bit addr
per cycle
bus
DRAM
Memory
6/27/2016
• Processor requests 32-bit words
• Cache controller checks address from
CPU to see if requested word is in the
cache
• If not, go to memory and load into
cache, kicking out some other word
– “Bus” is name for wires connecting
processor to memory
• Speedup: cache typical takes 1 or 2
clock cycles, vs. 100-200 to DRAM
Spring 2012 -- Lecture #12
10
Anatomy of a
Cache
Processor
• Operations:
32-bit
Address
1. Cache Hit
2. Cache Miss
3. Refill cache from
memory
• Cache needs Address
Tag to decide if
Processor Address is a
Cache Hit or Cache Miss
6/27/2016
32-bit
Data
Tag
CacheData
Cache
32-bit
Address
Spring 2012 -- Lecture #11
32-bit
Data
Memory
11
Hardware Cost of
Cache
• Need 32-bit Tag for
every 32 bits of data
• Optimization: 1 Tag for
4 (or more) words
Processor
32-bit
Address
– Group of words called a
“cache block”
– ¼ number tags
• Also can make address
tag 2-bits narrower
since block 4X larger
6/27/2016
128-bit
Data
Tag
Data
Cache
32-bit
Address
Spring 2012 -- Lecture #11
32-bit
Data
Memory
1212
Big Idea: Locality
• Temporal Locality (locality in time)
– Go back to same book on desktop multiple times
– If a memory location is referenced then it will tend to
be referenced again soon
• Spatial Locality (locality in space)
– When go to book shelf, pick up multiple books on J.D.
Salinger since library stores related books together
– If a memory location is referenced, the locations with
nearby addresses will tend to be referenced soon
6/27/2016
Spring 2012 -- Lecture #11
13
Principle of Locality
• Principle of Locality: Programs access small
portion of address space at any instant of time
• What program structures lead to temporal
and spatial locality in code?
• In data?
Student Roulette
6/27/2016
Spring 2012 -- Lecture #11
14
Administrivia
• Lab #6 posted
• Hundreds of students using GitHub successfully
– Will lose 2 points if GSIs need to regrade due to Git mistakes
• Project #2, Part 2 Due Sunday @ 11:59:59
• No Homework this week!
• Midterm in 2 weeks:
–
–
–
–
–
–
–
6/27/2016
TA Review: Su, Mar 4, starting 2 PM, 2050 VLSB
Exam: Tu, Mar 6, 6:40-9:40 PM, 2050 VLSB (room change)
TA Review: Su, Mar 6, 2-5 PM, 2050 VLSB
Covers everything through lecture Tue Feb 28
Closed book, can bring one sheet notes, both sides
Copy of Green Card will be supplied
No phones, calculators, …; just bring pencils & eraser
Spring 2012 -- Lecture #11
15
Project 2, Part 1 Scores
Avg: 10.7 pts with 244 submissions
6/27/2016
Spring 2012 -- Lecture #11
16
61C in the News
Australian and American
physicists have built a
working transistor from a
single phosphorus atom
embedded in a silicon crystal.
“It shows that Moore’s Law
can be scaled toward atomic
scales in silicon.” … Currently,
the smallest dimension in
state-of-the-art computers
made by Intel is 22 nm— less
than 100 atoms in diameter.
6/27/2016
Moore’s Law refers to
technology improvements
by the semiconductor
industry that have doubled
the number of transistors
on a silicon chip roughly
every 18 months for the
past half-century. That has
led to accelerating
increases in performance
and declining prices.
“Physicists Create a Working Transistor From a Single Atom,”
-- Lecture #11
By John Markoff, NewSpring
York2012
Times,
February 20, 2012
17
Agenda
•
•
•
•
•
Memory Hierarchy Analogy
Memory Hierarchy Overview
Administrivia
Caches
Fully Associative, N-Way Set Associative,
Direct Mapped Caches
• Cache Performance
• Multilevel Caches
6/27/2016
Spring 2012 -- Lecture #11
18
Hardware Cost of
Cache
• Need to compare
32-bit
every tag to the
Address
Processor address
• Comparators are
Tag
Set
0
expensive
• Optimization: 2 sets
Tag
Set 1
=> ½ comparators
• 1 Address bit selects
32-bit
which set
Address
6/27/2016
Spring 2012 -- Lecture #11
Processor
128-bit
Data
Data
Data
Cache
32-bit
Data
Memory
1919
Processor Address Fields used by
Cache Controller
• Block Offset: Byte address within block
• Index: Selects which set
• Tag: Remaining portion of processor address
Tag
Index Block offset
• Size of Index = log2 (number of blocks)
• Size of Tag = Address size – Size of Index
– log2 (number of bytes/block)
4/12/11
Spring 2011 -- Lecture #22
20
What is limit to number of sets?
• Can save more comparators if have more than
2 sets
• Limit: As Many Sets as Cache Blocks
• Called “Direct Mapped” Design
Tag
6/27/2016
Index Block offset
Spring 2012 -- Lecture #11
21
One More Detail: Valid Bit
• When start a new program, cache does not
have valid information for this program
• Need an indicator whether this tag entry is
valid for this program
• Add a “valid bit” to the cache entry
– 0 => cache miss, even if by chance address = tag
– 1 => cache hit if processor address = tag
6/27/2016
Spring 2012 -- Lecture #11
22
Direct Mapped Cache Example
• One word blocks, cache size = 1K words (or 4KB)
31 30
Hit
Valid bit
ensures
something
useful in
cache for
this index
Compare
Tag with
upper part
of Address
to see if a
Hit
6/27/2016
...
13 12 11
Tag
20
Index
Index Valid
Tag
...
Block
offset
2 1 0
10
Data
Data
0
1
2
.
.
.
1021
1022
1023
32
20
Read
data
from
cache
instead
of
memory
if a Hit
Comparator
Student Roulette
What kind of locality are we taking advantage of?
Spring 2012 -- Lecture #11
23
Cache Terms
• Hit rate: fraction of access that hit in the cache
• Miss rate: 1 – Hit rate
• Miss penalty: time to replace a block from
lower level in memory hierarchy to cache
• Hit time: time to access cache memory
(including tag comparison)
6/27/2016
Spring 2012 -- Lecture #11
24
Mapping a 6-bit Memory Address
5
•
•
•
•
4 3
2 1
0
Block Within $ Byte Offset Within Block
Mem Block Within
(e.g., Word)
Index
$ Block
Tag
Note: $ = Cache
In example, block size is 4 bytes/1 word (it could be multi-word)
Memory and cache blocks are the same size, unit of transfer between memory
and cache
# Memory blocks >> # Cache blocks
– 16 Memory blocks/16 words/64 bytes/6 bits to address all bytes
– 4 Cache blocks, 4 bytes (1 word) per block
– 4 Memory blocks map to each cache block
• Byte within block: low order two bits, ignore! (nothing smaller than a block)
• Memory block to cache block, aka index: middle two bits
• Which memory block is in a given cache block, aka tag: top two bits
6/27/2016
Spring 2012 -- Lecture #11
25
Caching: A Simple First Example
Main Memory
0000xx
Index Valid Tag
Data
One word blocks
0001xx
Two low order bits
00
0010xx
define the byte in the
0011xx
01
block (32b words)
0100xx
10
0101xx Q: Where in the cache is
11
0110xx the mem block?
0111xx
1000xx Use next 2 low order
1001xx memory address bits –
Q: Is the mem block in cache?
1010xx the index – to determine
1011xx which cache block (i.e.,
Compare the cache tag to the
1100xx modulo the number of
high order 2 memory address
1101xx blocks in the cache)
bits to tell if the memory block
1110xx
is in the cache
1111xx
(provided as valid bit is a 1)
(block address) modulo (# of blocks in the cache)
Cache
6/27/2016
Spring 2012 -- Lecture #11
26
Caching: A Simple First Example
Main Memory
0000xx
One word blocks
0001xx
Cache
Two low order bits
0010xx
Index Valid Tag
Data
define the byte in the
0011xx
block (32b words)
00
0100xx
0101xx
01
0110xx
10
0111xx Q: Where in the cache is
11
1000xx the mem block?
1001xx
Q: Is the mem block in cache?
1010xx Use next 2 low order
1011xx memory address bits –
Compare the cache tag to the
1100xx the index – to determine
high order 2 memory address
1101xx which cache block (i.e.,
bits to tell if the memory
1110xx modulo the number of
block is in the cache
1111xx blocks in the cache)
(provided Valid bit is 1)
(block address) modulo (# of blocks in the cache)
6/27/2016
Spring 2012 -- Lecture #11
27
Multiword Block Direct Mapped Cache
• Four words/block, cache size = 1K words
31 30 . . .
Hit
Tag
Index Valid
13 12 11 . . . 4 3 2 1 0
20
Index
Byte
offset
Data
Block offset
8
Data
Tag
0
1
2
.
.
.
253
254
255
20
32
What kind of locality are we taking advantage of?
6/27/2016
Spring 2012 -- Lecture #11
Student Roulette
28
Cache Names for Each
Organization
• “Fully Associative”: Block can go anywhere
– First design in lecture
– Note: No Index field, but 1 comparator/block
• “Direct Mapped”: Block goes one place
– Note: Only 1 comparator
– Number of sets = number blocks
• “N-way Set Associative”: N places for a block
– Number of sets = number of blocks / N
– Fully Associative: N = number of blocks
– Direct Mapped: N = 1
6/27/2016
Spring 2012 -- Lecture #11
29
Range of Set-Associative Caches
• For a fixed-size cache, each increase by a factor of 2 in
associativity doubles the number of blocks per set
(i.e., the number of “ways”) and halves the number of
sets –
• decreases the size of the index by 1 bit and
increases the size of the tag by 1 bit
More Associativity (more ways)
Tag
4/12/11
Index Block offset
Spring 2011 -- Lecture #22
30
For S sets, N ways, B blocks, which statements hold?
A) The cache has B tags
B) Size of Index = Log2(B)
C) B = N x S
D) The cache needs N comparators
☐
A only
☐
A and B only
☐
A, B, and C only
☐
31
Measuring Cache Performance
• Assuming cache hit costs are included as part of the normal CPU execution
cycle, then
CPU time = IC × CPI × CC
= IC × (CPIideal + Memory-stall cycles) × CC
CPIstall
•
A simple model for Memory-stall cycles
Memory-stall cycles = accesses/program × miss rate × miss penalty
•
6/27/2016
Will talk about writes and write misses next lecture, where its
a little more complicated
Spring 2012 -- Lecture #11
32
Average Memory Access Time (AMAT)
• Average Memory Access Time (AMAT) is the
average to access memory considering both
hits and misses in the cache
AMAT = Time for a hit
+ Miss rate x Miss penalty
6/27/2016
Spring 2012 -- Lecture #11
33
Average Memory Access Time (AMAT) is the average to access
memory considering both hits and misses
AMAT = Time for a hit + Miss rate x Miss penalty
Given a 200 psec clock, a miss penalty of 50 clock
cycles, a miss rate of 0.02 misses per instruction and
a cache hit time of 1 clock cycle, what is AMAT?
☐ ≤200 psec
☐
400 psec
☐
600 psec
☐
34
Impacts of Cache Performance
• Relative $ penalty increases as processor performance
improves (faster clock rate and/or lower CPI)
– Memory speed unlikely to improve as fast as processor
cycle time. When calculating CPIstall, cache miss penalty is
measured in processor clock cycles needed to handle a
miss
– Lower the CPIideal, more pronounced impact of stalls
• Processor with a CPIideal of 2, a 100 cycle miss penalty,
36% load/store instr’s, and 2% I$ and 4% D$ miss rates
– Memory-stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44
– So
CPIstalls = 2 + 3.44 = 5.44
Student Roulette
– More than twice the CPIideal !
• What if the CPIideal is reduced to 1?
• What if the D$ miss rate went up by 1%?
6/27/2016
Spring 2012 -- Lecture #11
36
How Reduce Miss Penalty?
• Could there locality on the misses from a
cache?
• Use multiple cache levels!
• With Moore’s Law, have more room on die for
bigger L1 caches and for second level
(L2)cache
• And in some cases even an L3 cache!
6/27/2016
Spring 2012 -- Lecture #11
38
Typical Memory Hierarchy
On-Chip Components
Control
Size (bytes):
Cost/bit:
Instr
Data
Cache Cache
Speed (cycles):
RegFile
Datapath
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM)
½’s
1’s
10’s
100’s
100’s
10K’s
M’s
G’s
highest
Secondary
Memory
(Disk
Or Flash)
1,000,000’s
T’s
lowest
• Principle of locality + memory hierarchy presents programmer with
≈ as much memory as is available in the cheapest technology at the
≈ speed offered by the fastest technology
6/27/2016
Spring 2012 -- Lecture #11
39
Memory Hierarchy Technologies
• Caches use SRAM (Static RAM) for speed and
technology compatibility
– Fast (typical access times of 0.5 to 2.5 ns)
– Low density (6 transistor cells), higher power, expensive
($2000 to $4000 per GB in today)
– Static: content will last as long as power is on
• Main memory uses DRAM (Dynamic RAM) for size
(density)
– Slower (typical access times of 50 to 70 ns)
– High density (1 transistor cells), lower power, cheaper
($20 to $40 per GB in today)
– Dynamic: needs to be “refreshed” regularly (~ every 8 ms)
6/27/2016
• Consumes 1% to 2% of the active cycles of the DRAM
Spring 2012 -- Lecture #11
40
For L1 cache
AMAT = Time for a hit + Miss rate x Miss penalty
What is AMAT for L2 cache?
☐
Time for L2 hit + L2 Miss rate x L2 Miss penalty
☐
Time for L1 hit + L1 Miss rate x
L2 Miss rate x Miss penalty
Time for L1 hit + L1 Miss rate x
(Time for L2 hit + L2 Miss rate x Miss Penalty)
☐
☐
41
Local vs. Global Miss Rates
• Local miss rate – the fraction of references to one
level of a cache that miss
• Local Miss rate L2$ = $L2 Misses / L1$ Misses
• Global miss rate – the fraction of references that
miss in all levels of a multilevel cache
• L2$ local miss rate >> than the global miss rate
• Global Miss rate = L2$ Misses / Total Accesses
= L2$ Misses / L1$ Misses x L1$ Misses / Total Accesses
= Local Miss rate L2$ x Local Miss rate L1$
• AMAT = Time for a hit + Miss rate x Miss penalty
• AMAT = Time for a L1$ hit + (local) Miss rateL1$ x
(Time for a L2$ hit + (local) Miss rate L2$ x L2$ Miss penalty)
6/27/2016
Spring 2012 -- Lecture #12
42
Reducing Cache Miss Rates
• E.g., CPIideal of 2,
100 cycle miss penalty (to main memory),
25 cycle miss penalty (to L2$),
36% load/stores,
a 2% (4%) L1 I$ (D$) miss rate,
add a 0.5% L2$ miss rate
– CPIstalls = 2 + .02×25 + .36×.04×25 + .005×100
+
.36×.005×100
= 3.54 (vs. 5.44 with no L2$)
6/27/2016
Spring 2012 -- Lecture #12
43
Multilevel Cache Design
Considerations
• Different design considerations for L1$ and L2$
– L1$ focuses on minimizing hit time for shorter clock
cycle: Smaller $ with smaller block sizes
– L2$(s) focus on reducing miss rate to reduce penalty of
long main memory access times: Larger $ with larger
block sizes
• Miss penalty of L1$ is significantly reduced by
presence of L2$, so can be smaller/faster but with
higher miss rate
• For the L2$, hit time is less important than miss
rate
– L2$ hit time determines L1$’s miss penalty
6/27/2016
Spring 2012 -- Lecture #12
44
Review so far
• Principle of Locality for Libraries /Computer Memory
• Hierarchy of Memories (speed/size/cost per bit) to
Exploit Locality
• Cache – copy of data lower level in memory hierarchy
• Direct Mapped to find block in cache using Tag field
and Valid bit for Hit
• Larger caches reduce Miss rate via Temporal and
Spatial Locality, but can increase Hit time
• Multilevel caches help Miss penalty
• AMAT helps balance Hit time, Miss rate, Miss penalty
6/27/2016
Spring 2012 -- Lecture #11
45
Download