Here are the notes on Caches from Chapter 7

advertisement
Ch7 MEMORY HIERARCHY
OVERVIEW
Accessing memory has always been one of the primary bottlenecks in
computing. Fast memory is expensive, and only very high-performance,
expensive computers are built with very large, fast memories. The rest of us
live with slow (50 ns access, these days) DRAM memory.
Consider: Today’s machines may have a clock cycle of 1 ns, but memory
access requires 50 nsec! Obviously, a disconnect in performance. If we had
to live with a simple memory of this type, we’d have a performance
differential (assuming that ¼ of the instructions are loads and stores and all
instructions have to do a fetch):
Perf: 50ns memory/1.0 nsec memory=0.75*50+0.25*140/0.75*1.0+0.25*2.0
= 50
Obviously, this is a problem. Thus, the necessity of better memory
performance. A number of techniques are used to greatly improve memory
performance. Most important is, of course, cache memory.
What is cache memory? A special fast memory that stores RECENTLY
ACCESSED locations, and locations adjacent to those.
1. Level 1 cache memory is integrated with the CPU, runs full speed
2. NOT random access, instead, it’s structured to store recent program and
data locations.
3. As long as required instructions and data are in cache, a hit occurs, and
memory access does not slow down CPU (possibly 2-cycle Load/Store)
4. If info is NOT in cache, a miss occurs, and the CPU has to stop to await
the slow process of memory being uploaded.
5. Because the amount of space available on the CPU die for cache memory
is small (32 Kbytes-128 kbytes), often a LEVEL 2 cache is integrated
into the computer’s motherboard. This is a cache memory that is larger
than level 1 cache (256 kbyte-4 Mbyte), but necessarily slower because it
is off-chip. Usually 10-25 nsec access.
Why does cache memory work? If locations were accessed totally randomly,
one would not expect many hits. However, the execution of a program is
anything but random:
1
Principle of locality:
 Temporal locality: if an address is referenced, it is likely to be accessed
again soon. Programs tend to consist of loops, so the first time through a
loop may be slow because all the instructions need to be uploaded from
memory. Data is often accessed within the loop too, so L/S addresses
tend to be the same time after time.
 Spatial locality: If an address is referenced, it is likely that close-by
locations will also be referenced. Loops consist of adjacent instructions.
Data: related variables tend to be spaced closely together (arrays, structs,
variable definitions).
These principles suggest that it is indeed worthwhile to use a cache, and that
rather than uploading one instruction or data location at a time, it may be
useful to upload several. WHY?
Memory (DRAM) requires substantial overhead to get the first location out.
Subsequent locations can be accessed much faster.
½ ADDR
RAS
DRAM
DATA OUT (1,2 or 4 bits)
CAS
DATA IN
This is true because two cycles are needed to latch the address (1/2 the
address at a time), but today’s DRAMS are pipelined so that only the first
address needs to be given; all subsequent locations just require a little
control handshaking.
Consequently: cache memory operates on BLOCKS of adjacent bytes. A
whole block is loaded from DRAM (or Level 2 cache) at a time. Block sizes
tend to be 16-64 bytes (4-16 words).
Thus, the primary purpose of cache memory is to improve the average
access time of instruction and data fetches.
2
Today, faster memory is also available (SDRAM), at a premium price.
Access times are down to 8 ns, allowing full-speed 133 MHz memory bus
operation.
Virtual memory and paging
But, there are other aspects to the memory hierarchy system. Consider a user
of, say, a UNIX system on which a number of other users are operating. That
user wants to be certain that:
 No other user interferes with his/her program execution
 Multiple programs can run concurrently on the machine, even if they all
start at Address 0
 The machine will not run out of memory, even if a program requires 10
times as many locations as are available in DRAM!
The memory hierarchy of today’s computers ensures that these are all true.
Memory protection hardware compares accessed locations to the range of
addresses allowed to the currently-running (in CPU) program, and causes an
exception if the access is not legal (memory access violation).
Active programs (called processes) share CPU resources: periodically an
external timer generates an interrupt, and the O.S. decides whether to give a
different process CPU time. This happens about every 10 ms.
Virtual memory allows the second need to be met. Addresses accessed by
the CPU are translated so that the target address may be in one of many
different locations in DRAM. This allows all programs to start at location 0,
as long as the mapping is different for all the programs. Locations in
memory are organized into blocks of adjacent locations called PAGES;
each page can be arbitrarily mapped.
Demand paging allows writing memory blocks that have not been recently
used to the hard disk drive, temporarily caching them at a lower level! This
frees up DRAM memory for other uses. When a program needs a location
that has been paged out to disk, the hardware detects the problem and
causes an exception, and the operating system takes over and retrieves the
page.
3
CACHE MEMORY (7.2)
Consider a RAM memory read operation. The system responds to an applied
address by outputting the bits stored at that location.
A cache memory must work differently. It stores recently used locations
somewhere in the cache, and must retrieve the information by address. A
cache is organized into LINES. Each line is comprised of:
1) One block of data (or instructions)
2) Address information that allows determining whether the CPU
address requested is located in the line
3) A few housekeeping control bits.
Most systems today have a UNIFIED CACHE that contains both
instructions and data, but a few have separate caches for data and
instructions to allow simultaneous IF and Load/Store operations.
Questions that must be addressed here in considering caches are:
1) Where can a block be placed in the cache?
2) How is a block found if it is in the cache?
3) Which block should be replaced on a cache miss?
4) What happens on a write?
We’ll consider three different cache architectures. Let’s assume that each
line contains 32 bytes.
ADDR: A31-A5=block number, A4-A2=word in block, A1-A0=byte #
Architecture 1: FULLY-ASSOCIATIVE CACHE.
This is the hardest system to implement, but the most general, and so it is
pretty much a theoretical construct.
A block can be placed ANYWHERE in the FA cache. This means that the
full address of the block must be stored in the cache. Hardest is that a LRU
(least-recently used) algorithm must be implemented to determine which
block should be replaced if there is a miss. This takes a lot of slow circuitry!
Block diagram:
4
Tag
Data
ADDR
DATA
STATUS
…
=?
Addr 5-31,
hit
MUX
2-4
Data out
What happens: each line has an independent address comparator that tries to
match the CPU address with the line address. If one matches, we have a hit
UNLESS the line isn’t in use, and we have a spurious match. One of the
status bits must therefore be “line valid”.
On a hit, the appropriate word in the line is passed via mux to the CPU (on a
read). We’ll talk about writes later.
On a miss (read), the LRU hardware decides which line to replace, and the
CPU is halted until the requested block is uploaded from DRAM or L2
cache. The block’s address is copied along with the data, and the line valid
flag is set.
Architecture 2: DIRECT-MAPPED CACHE
This is the opposite case from the above: each block can only go to one line
in the cache, resulting in the possibility that lines can be overwritten
frequently. But it is very simple structurally.
To avoid the possibility that nearby data or instructions can overwrite each
other’s line in the cache, the lower CPU address bits are used to index into
the cache. The location of the data or instructions in cache is found by
taking the modulo of the of the address in memory and the number of
locations in cache. If a power of 2, you just take the low number bits for the
address in cache and the tag is the upper number of bits.
Note that in a unified cache, if data and instructions have the same bit
patterns, they will interfere with each other. Likewise, a long loop that has
twice as many instructions as cache capacity will also render the cache
5
useless because instructions will constantly overwrite each other in the
cache.
ADDR: (A16-A5)
000
tag
001
A31-A17
002
…
FFF
Index
data (32 bytes)
status
=?
CPU A31-A17 A4-A2
DATA OUT
In this case, there is only ONE comparator, and no LRU hardware. So
simple! Yet, the failings of this architecture have led to an intermediate
approach.
Architecture 3: SET-ASSOCIATIVE CACHE
This is an intermediate approach. The cache is divided into 2-8 sets that are
each direct-mapped. However, data may reside in any one of the slots in the
sets, at the direct-mapped address. A read results in one comparison in each
set. If a hit, data is taken from that set. If a miss, a crude LRU algorithm is
used to determine which set’s element to replace. A 4 way set associative
cache means that there are 4 slots of data in each set. There could be a
number of sets in the total cache.
SET 1
ADDR: (A16-A5)
000
001
002
A31-A17
FFF
DATA (32 bytes)
=?
6
STATUS
CPU A31-A17 A4-A2 DATA OUT
SET 2,3,4 same as set 1 (some additional control circuitry needed,
plus an RLU implementation- which is sometimes just random replacement).
Let’s reconsider our questions:
1) Where can a block be placed in the cache?
FA-anywhere, DM-one place, SA-one location in each set.
2) How is a block found if it is in the cache?
FA-one of N comparators, DM-one comparator, SA-one of a few
3) Which block should be replaced on a cache miss?
FA-LRU algorithm, DM-constrained, SA-LRU or random.
A common LRU algorithm involves a status bit for each line that is
periodically reset. When a hit occurs, the bit is set again. After a
while, only the non-accessed lines have bits reset, and can be replaced
4) What happens on a write? In other words, if we update data in the
program that we want to store in memory?
This is a bit tricky. In some cases (esp. with a DM cache), we may want to
avoid writing data to the cache, in case it overwrites instructions or data. A
write does not guarantee a subsequent read! But, DRAM writes DO have a
time penalty. Thus, there are a number of policies:
On a HIT:
Write-through (or store-through) – Information written BOTH to block in
cache AND the lower-level memory.
Write-back (copy back) – Information written ONLY to block in cache. A
“dirty bit” in status is set to indicate that a write to memory needs to be
done.
On a MISS:
Write allocate (fetch on write) – Block is loaded on a write miss, and one of
the above actions is taken.
7
No-write allocate (write around) – the block is modified at the lower level
only; the cache is not updated.
How can we quantify cache performance?
1) CPUtime = (CPUclockCycles + MemoryStallCycles) * ClockCycleTime
2) MemoryStallCycles = ReadStallCycles + WriteStallCycles
= Instructs/program * misses/instruct * MissPenalty
3) ReadStallCycles = Reads/Program x ReadMissRate x ReadMissPenalty
Write cycles stalls are more complex because they depend on write
protocols. If write-allocate, then a miss involves a read miss penalty.
Otherwise, penalty may be minimal (or nonexistent) to transfer data into
write buffer. If write buffer write-to-memory doesn’t complete before next
read miss, however, there will be additional stalls on the read…
4)WriteStallCycles = (Writes/program * writeMissRate * WriteMissPenalty)
+ WriteBufferStalls
IMPROVING PERFORMANCE
To improve performance, what can we do? Look at the components of
Expression 3:
A)
B)
Reduce Reads/Program – these can be reduced by making the
compiler smarter. The compiler must choose whether a variable
remains in memory or is manipulated in a register location, within
a window of the program. Good strategies lead to fewer loads and
stores.
Reduce ReadMissRate – by:
 Larger cache overall
8
 Larger associativity in the cache (4-way vs. 2 way Set Assoc.)
 Optimizing block size (larger is better to a certain point)
 Compiler optimizations (e.g. ordering of variables in blocks)
Reduce ReadMissPenalty – by:
 Faster main memory (Fig below)
 Wider memory bus (more bits uploaded at once) (Fig below)
 Additional cache level(s) – example follows
 PREFETCH instructions and or data- special instruction
indicates to memory system that certain address will be read
soon, giving memory system time to load it (while CPU is
active with preceeding code!).
Write optimizations – can also be used to improve performance:
 Defer main memory writes in a WRITE BUFFER. A memory
write operation is delayed until there is no read. Note that read
addresses must be compared with addresses in write buffer!
 Use write-back strategies.
C)
D)
The miss penalty can be improved either by increasing the bandwidth or data
width of the path between DRAM memory and cache: (example on page
561)
CPU
32
32
L1 cache
128
32
128
32
L2 cache
DRAM
Normal bandwidth
Memory access interleaved (4x faster)
High bandwidth (4x faster)
9
SAMPLE PROBLEM USING LEVEL 2 CACHE
A system has a CPU with avg CPI of 1.5 and a clock speed 1.0 GHz. Main
memory access block transfer time is 150 nsec. The miss rate for a (dataintensive) program is 10%.
How much faster will the machine be if a secondary cache is used that
transfers a block in 20 nsec, with a 2% miss rate?
Consider the system without Level 2 cache:
MainMemoryMissPenalty = 150 ns / 1ns clock cycle = 150 clock cycles
EffectiveCPI = BaseCPI + memoryStalls/instruction
= 1.5 + 0.1 * 150 = 16.5
**** According to Amdahl’s law, improving CPU won’t help much! ****
With second-level cache:
Level2MissPenalty = 20ns/1ns clock = 20 clocks
TotalCPI = base + Lev2 stalls +mainMemStalls
= 1.5 + 0.1*20 + 0.02*150 = 1.5 + 2 + 3 = 6.5.
Still somewhat far from optimal!
Performance = 16.5 / 6.5 = 2.54
Definitely want Level 2 cache in there. But also want a reduced miss rate!
Problems:
1) (7.1) Describe the general characteristics of a program that would exhibit
very little temporal and spatial locality with regard to data accesses.
Solution: This is a problem where you have no arrays or structures. Memory
accesses jump around and are only accessed once.
2) (7.7) Here is a series of address references given as word addresses: 1, 4,
8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17. Assuming a direct mapped
10
cache with 16 one-word blocks that are initially empty, label each reference
in the list as a hit or a miss and show the final contents of the cache.
Solution:
1 – Miss
4 – Miss
8 – Miss
5 – Miss
20 – Miss
17 – Miss
19 – Miss
56 – Miss
9 – Miss
11 – Miss
4 – Miss
43 – Miss
5 – Hit
6 – Miss
9 – Hit
17 – Hit
Value
Slot
0
17
1 2
19 4
3 4
5
5
6
6
7
56 9
8 9
43
10 11 12 13 14 15
3. (7.8) Using the same references as problem 2, show the hits and misses
and final cache contents for a direct mapped cache with four word blocks
and a total size of 16 words.
Solution. What this means is we still have the same total cache space, but
since we have 4 word blocks and only 16 total blocks, we can only have 4
slots. The hits and misses:
1 – Miss
4 – Miss
8 – Miss
5 – Hit
11
20 – Miss
17 – Miss
19 – Hit
56 – Miss
9 – Miss
11 – Hit
4 – Miss
43 – Miss
5 – Hit
6 – Hit
9 – Miss
17 – Hit
Final contents:
Start of word
block
Block
Number
16
4
8
17
5
9
18
6
10
19
7
11
0
1
2
3
4) (7.11) Consider a memory hierarchy using one of the three organizations
for main memory shown in Figure 7.13 (a: one-word wide memory, b: fourword-wide memory, c: four word interleaved memory). Assume that the
cache block size is 16 words. If the main memory latency for a new access
is 10 cycles and the transfer time is 1 cycle, what are the miss penalties for
each of these organizations?
Solution:
For a: one-word wide memory: This requires 16 main memory accesses to
retrieve a cache block and words are transferred 1 at a time so:
1 + 16*10 + 16*1 = 177 clock cycles.
12
For b: four-word-wide: This requires 4 main memory accesses to retrieve a
cache block and they are transferred 4 at a time:
1 + 4*10 + 4*1 = 45 clock cycles
For c: four word interleaved: This requires 4 main memory accesses to
retrieve a cache block and they are transferred 1 at a time:
1 + 4*10 + 16*1 = 57
5) (7.12) Suppose a processor with a 16 word block size has an effective
miss rate per instruction of 0.5%. Assume that the CPI without cache misses
is 1.2. Using the memories from problem 4, how much faster is this
processor when using the wide memory than narrow or interleaved?
Solution: First we have to calculate the effective CPI for each memory:
a: Effective CPI = 1.2 + .005*177 = 2.085
b: Effective CPI = 1.2 + .005*45 = 1.425
c: Effective CPI = 1.2 + .005*57 = 1.485
Since it is the same software and CPU, we can compare CPI directly.
Wide over narrow = 2.085-1.425 = 1.46
Wide over interleaved = 1.485/1.425 = 1.04
6) (7.20) Using the following series of references, 1, 4, 8, 5, 20, 17, 19, 56,
9, 11, 4, 43, 5, 6, 9, 17, show the hits and misses and final cache contents
for a two-way set associative cache with one word blocks and a total size of
16 words.
Solution: Ok, two way set assoc. means we have two slots in each group.
Each slot is only one word so we have 8 slots.
1 – Miss
4 – Miss
8 – Miss
5 – Miss
20 – Miss
17 – Miss
19 – Miss
56 – Miss
9 – Miss
13
11 – Miss
4 – Hit
43 – Miss
5 – Hit
6 – Miss
9 – Hit
17 – Hit
Final state:
Block #
0
1
2
3
4
5
6
7
Element #1
56
17
Element #2
8
9
43
4
5
6
11
20
7) (7.38) If all misses are classified into one of three categories: compulsory
(blocks not there since we are just starting), capacity (blocks not there since
they couldn’t all fit, some had to be replaced), or conflict (misses in direct
mapped or set assoc. that wouldn’t occur in a full assoc.), which misses are
likely to be reduced when a program is rewritten so as to require less
memory? How about if the clock rate of the machine that the program is
running on is increased? How about if the associativity of the existing cache
is increased?
Solution:
Less memory: will likely reduce compulsory and capacity misses. Maybe
conflict…tough to tell
Increase clock rate : nothing. Clock rate doesn’t effect hit rate.
Increase assoc.: should reduce conflict.
14
Homework:
1) (7.5) Describe the general characteristics of a program that would exhibit
very high amounts of temporal locality, but very little spatial locality with
regard to instruction fetches.
2) (7.9) Compute the total number of bits required to implement the cache
in Figure 7.10. This number is different from the size of the cache, which
usually refers to the number of bytes of data stored in the cache. The
number of bits needed to implement the cache represents the total amount of
memory needed for storing all of the data, tags and valid bits.
3) (7.13) Cache C1 is direct-mapped with 16 one-word blocks. Cache C2
is direct-mapped with 4 four-word blocks. Assume that the miss penalty for
C1 is 8 clock cycles and C2 is 11 clock cycles. Assuming that the caches are
initially empty, find a reference string for which C2 has a lower miss rate but
spends more cycles on cache misses than C1. Use word addresses.
4) (7.21) Using the series of references: 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43,
5, 6, 9, 17, show the hits and misses and final cache contents for a fully
associative cache with one-word blocks and a total size of 16 words.
Assume least recently used replacement.
15
Download