CS 1104 Help Session I Caches Colin Tan,

advertisement
CS 1104
Help Session I
Caches
Colin Tan,
ctank@comp.nus.edu.sg
S15-04-15
Topics for Today
• Session I - Cache, Session II - Virtual
Memory
• Basically we will pick off important topics
and elaborate on them. Also will have
simple problems to solve
– Please have your calculator, paper and pen
ready!
• Help session notes are available at:
http://www.comp.nus.edu.sg/~ctank
Memory Hierarchy
• Motivation
– Not all memory is created equal
• Cheap Memory => Slow
• Fast Memory => Expensive
– DRAM, 70 ns access time, $5/MByte
– SRAM, 8 ns access time, $100/MByte
• The purpose of memory hierarchy is that it allows us
to have huge amounts of cheap that operate close to
the speed of expensive but fast memory.
How do we do that?
• We can achieve this by using locality.
– Locality is the behavior of programs such that:
• Data or instructions that was recently accessed is likely to be accessed
in the near future (Temporal Locality)
– The variable x in this code fragment will have temporal locality
for(int i=0; i<100; i++)
y= y+x
• The neighbors of data or instructions that was recently accessed are
likely to also be accessed (Spatial Locality)
– Program execution tends to be sequential, so if an instruction
was just executed, it is likely that the next instruction next to it
will also be executed.
Caches
• Locality means that only small portions of main
memory will be used in the near future.
• We can create a small memory system with fast
but expensive devices to store these portions of
main memory.
• We can then access these portions of memory very
quickly.
• This is the concept behind caches.
How Do Caches Help?
• The average time to access memory (AMAT) is
given by:
AMAT = Tcache + miss_rate x Tmemory
Tcache = Time to read the cache (8ns for SRAM cache)
Tmemory = Time to read main memory (70ns for DRAM)
miss_rate = Probability of not finding what we want in the cache.
• Because of locality, miss_rate is very small
– Typically about 3% to 5%.
• On the average, time to access memory is very
close to that of the cache (8ns) rather than that of
the main memory (70ns).
How do Caches Help?
• Yet at the same time, we have the benefit of
being able to have large amounts of
memory
– This is because most of our memory is cheap
DRAM!
• The net effect is that caches allow us to
have huge amounts of cheap, slow memory,
yet have access times of fast, expensive
memory.
Basics of Addressing
• Memory Addresses
– Memory is a huge array of data.
– Just like arrays, we must specify the index of
the piece of data that we want, so that we can
get it out of the array.
– In memory systems, this index is called an
“Address”.
Where do Addresses Come
From?
• For instruction fetches, the address of the
instruction (i.e. the location in the memory
where the instruction is in) comes from the
Program Counter.
• For data fetches, the address comes from
the ALU stage of the pipeline whenever we
do a lw or sw operation.
• In the MIPS architecture, addresses are 32bit numbers.
Cache Architecture
• Caches consist of blocks (or lines). Each
block stores data from memory:
Block
• Block allocation problem:
– Given data from an address A, how do we
decide which block of cache its data should go
to?
The Block Allocation Problem
• 3 possible solutions:
– Data from each address A will go to to a fixed
block.
• Direct Mapped Cache
– Data from each address A will go to any block.
• Fully associative cache
– Data from address A will go to a fix set of
blocks.
• Data may be put into any block within a set.
• Set associative cache.
Direct Mapped Caches
• The value of a portion of memory address is
used to decide which block to send the data
to:
Address A
Tag
Block Index Block Offset
Byte Offset
• The Block Index portion is used to decide
which block data from this address should
go to.
Example
• The number of bits in the block index is
log2N, where N is the total number of
blocks.
• For a 4-block cache, the block index portion
of the address will be 2 bits, and these 2 bits
can take on the value of 00, 01, 10 or 11.
• The exact value of these 2 bits will
determine which block the data for that
address will go to.
Example
• The value of the two block index bits will
determine which block the data will go to,
following the scheme shown below:
Cache
00
01
10
11
Solving Direct-Mapped Cache
Problems
• Question 7.7
Basic formula:
Blk_Addr = floor(word_address/words_per_block) mod N
– N here is the total number of blocks in the cache
– This is the mathematical version of taking the value of the Block
Index bits from the address.
A Complication:
Multiple Word Blocks
• Single word blocks do not support spatial
locality
– Spatial locality: Likelihood of accessing
neighbor of a piece of data that was just
accessed is high.
– But with single word blocks, none of the
neighbors are in cache!
• All accesses to neighbors that were not accessed
before will miss!
An Example
Question 7.8
Accessing Individual Words
• In our example, each block has 4 words.
• But we always access memory 1 word at a time!
(e.g. lw)
• Use the Block Offset to specify which of the 4
words in a block we want to read:
Address A
Tag
Block Index Block Offset
Byte Offset
The Block Offset
• Number of block offset bits = log2M, where M is
the number of words per block.
• For our example, M=4. So number of block offset
bits is 2.
• These two bits can take on the values of 00, 01, 10
and 11.
• Note that for single word blocks, the number of
block offset bits is log2 1, which is 0. I.e. There are
no block offset bits for single-word blocks.
• These values determine exactly which word within
a block address A is referring to:
The Block Offset
Word 00
Word 01
Word 10
Word 11
00
01
10
11
4 block cache, 4 words per block
• The value of the 2 block offset bits (see
previous slide) determine if our address A is
referring to word00, word01, word10 or
word11.
Who am I?
Purpose of the Tag
• Many different addresses may map to the
same block: e.g. (Block Index portions
shown highlighted)
01000 00010010
00000000
00
01010 00010010
00000000
00
11011
00000000
00
00010010
• All 3 addresses are different, but all map
to block 00010010
Disambiguation
• We need a way to disambiguate the situation
– Otherwise how do we know that the data in block x
actually comes from address A and not from another
address A’ that has the same block index bit value?
• The portion of the address A to the left of
the Block Index can be used for
disambiguation.
• This portion is called the tag, and the tag for
address A is stored in the cache together
with address A data.
The Tag
Tag
Word 00
Word 01
Word 10
Word 11
00
01
10
11
• When we access the cache, the Tag portion and
Block Index portions of address A are extracted.
• The Block Index portion will tell the cache
controller which block of cache to look at.
• The Tag portion is compared against the tag stored
in the block. If the tags match, we have a cache
hit. The data is read from the cache.
Accessing Individual Bytes
Address A
Tag
Block Index Block Offset
Byte Offset
• MIPS addresses are byte addresses, and actually
index individual bytes rather than words.
• Each MIPS word consists of 4 bytes.
• The byte offset tells us exactly which byte within a
word we are referring to.
Advantages & Disadvantages of
Direct Mapped Caches
• Advantages:
– Simple to implement
– Fast performance
• Less time to detect a cache hit => less time to get
data from the cache => faster performance
• Disadvantages
– Poor temporal locality.
• Many addresses may map to the same block.
• The next time address A is accessed, it may have
been replaced by the contents of address A’.
Improving Temporal Locality
The Fully Associative Cache
• In the fully associative cache, data from an
address A can go to any block in cache.
– In practice, data will go into the first available
cache block.
– When the cache is full, a replacement policy is
invoked to choose which block of cache to
throw out.
Example
• Question 7.22
Searching the Cache
• In the fully associative cache, an address A is split
into the following parts:
Address A
Tag
Block Offset
Byte Offset
• The cache controller will search the entire cache to
see if it can find a block with the same tag value as
the tag portion of A.
• If it can find such a block, we have a cache hit,
and the controller reads the data from the cache.
Advantages and Disadvantages
Fully Associative Cache
• Good temporal locality properties
– Flexible block placement allows smart
replacement policies such that blocks that are
likely to be referenced again will not be
replaced. E.g. LRU, LFU.
• Disadvantages
– Complex and too expensive for large caches
• Each block needs a comparator to check the tag.
• With 8192 blocks, we need 8192 comparators!
A Compromise
Set Associative Caches
• Represents a compromise between directmapped and fully associative caches.
• Cache is divided into sets of blocks.
• An address A is mapped directly to a set
using a similar scheme as for direct mapped
caches.
• Once the set has been determined, the data
from A may be stored in any block within a
set - Fully associative within a set!
Set Associative Cache
• An n-way set associative cache will have n
blocks per set.
• For example, for a 16-block cache that is
implemented as a 2-way set associative
cache, each set has 2 blocks, and we have a
total of 8 sets.
Set Associative Cache
Block 0
Block 1
Set 000
Set 001
Set 010
Set 011
Set 100
Set 101
Set 110
Set 111
• An address A will be divided into:
Address A
Tag
Set Index
Block Offset
Byte Offset
Accessing a Set Associative
Cache
Address A
Tag
Set Index
Block Offset
Byte Offset
• The Set Index portion of address A is extracted.
This is used to index the sets (i.e. If the Set Index
portion is 010, then this address is mapped to Set
010).
• The tag portion of A is extracted and compared
against the tags stored in Block 0 and Block 1 of
Set 010.
Accessing a Set Associative
Cache
• If a match is made either in Block 0 or Block 1 of
Set 010, then we have a cache hit, and the data for
A is read from the cache block.
• If we have a miss, then the data for A is fetched
from main memory, and placed in the first
available block in Set 010.
• If no blocks are available, a replacement policy is
invoked to choose a block to replace.
Example
• Question 7.20
Basic formula:
Blk_Addr = floor(word_address/words_per_block) mod N
– Here N is the number of sets, NOT NUMBER OF BLOCKS!
– This is the mathematical version of taking the value of the Block
Index bits from the address.
Multi-block Set
vs.
Multi-word blocks?
• Confusion often arises over multi-block sets
(or n-way set associative) vs. multi-word
blocks.
• Each block in a set can itself have multiple
words, like the blocks in question 7.8. Each
block will also have its own tag.
Advantages and Disadvantages
Set Associative Cache
• Advantages
– Almost as simple to build as a direct-mapped
cache.
– Only n comparators are needed for an n-way set
associative cache. For 2-way set-associative,
only 2 comparators are needed to compare tags.
– Supports temporal locality by having full
associativity within a set.
Advantages and Disadvantages
Set Associative Cache
• Disadvantages
– Not as good as fully-associative cache in supporting
temporal locality.
– For LRU schemes, because of small associativity,
actually possible to have 0% hit rate for temporally
local data.
– E.g. If our accesses are A1 A2 A3 A1 A2 A3, and if A1, A2
and A3 map to the same 2-way set, then hit rate is 0% as
subsequent accesses replace previous accesses in the
LRU scheme.
Writing to Cache
• Remember that data in cache is merely a
copy of data in main memory.
• When data that is stored in a cache block is
modified (e.g. when doing a sw to address
A), then the copy in cache will become
inconsistent with the copy in memory.
• Need a way to maintain consistency.
Memory/Cache Consistency
2 solutions
• Write-through cache
– In the write-through cache, consistency
between cache data and memory data is
maintained by updating both main memory and
cache.
– This is very slow
• Must wait for both cache and memory writes to
complete before CPU can proceed.
• Memory writes are very slow!
Memory/Cache Consistency
2 solutions
• Write-back Cache
– Only the cache copy of data is updated.
– When the data in a block is updated, a special flag
called the dirty bit will be set to indicate that the cache
copy is now inconsistent with the memory copy.
– If the block is chosen for replacement (either by
replacement policy or because another address A’ maps
to the same block), then the memory copy is updated if
the dirty bit is set.
– If dirtybit is not set, the block is simply replaced.
Nitty-Gritty
Use of Addresses by Cache
• Addresses are used to access cache.
• For Direct Mapped Cache:
Address A
Tag
Block Index Block Offset
Byte Offset
Nitty-Gritty
Use of Addresses by Cache
• Direct Mapped Cache
Address A
Tag
Block Index Block Offset
Byte Offset
Number of Byte Offset Bits B= log2(number of bytes per word)
•On MIPS this is usually 2 bits
Number of Block Offset Bits W= log2(Number of words per block)
•0 bits for 1-word blocks
Number of Block Index Bits I = log2(Number of blocks)
Number of tag bits = address_length - B - W- I
•address_length is 32 bits on MIPS
Nitty-Gritty
Use of Addresses by Cache
• Fully Associative Cache
Address A
Tag
Block Offset
Byte Offset
Number of Byte Offset Bits B= log2(number of bytes per word)
•On MIPS this is usually 2 bits
Number of Block Offset Bits W= log2(Number of words per block)
•0 bits for 1-word blocks
Number of tag bits = address_length - B - W
•address_length is 32 bits on MIPS
Note that there are no index bits for fully associative caches.
Nitty-Gritty
Use of Addresses by Cache
• Set-Associative Cache
Address A
Tag
Set Index
Block Offset
Byte Offset
Number of Byte Offset Bits B= log2(number of bytes per word)
•On MIPS this is usually 2 bits
Number of Block Offset Bits W= log2(Number of words per block)
•0 bits for 1-word blocks
Number of Set Index Bits S = log2(Number of sets)
Number of tag bits = address_length - B - W- S
•address_length is 32 bits on MIPS
Example
• A cache built for the MIPS architecture has a total
size 128 KB. Find the total number of tag, set
index, block index, block offset, and byte offset
bits for a given address A for each of the following
cache architectures:
–
–
–
–
Direct Mapped, 1 word per block.
Direct Mapped, 8 words per block
Fully associative, 2 words per block
2-way set associative, 4 words per block
Example
• Basic things you first need to work out:
– What types of information do I need to determine for
each cache architecture?
• E.g. for set-associative, need to determine byte-offset, block
offset, set index and tag bits.
– What is the cache size in terms of words?
– What is the total number of blocks that we would have,
or the total number of sets?
• This will give us the number of index bits.
– Any other important information?
Nitty-Gritty
Cache Housekeeping Flags
• Other than the data and tag bits, cache blocks need
to store housekeeping flags.
• The dirty bit (D) we saw earlier is an example.
• Other bits include:
– Valid bit (V)
• When a cache first starts up, the tag and data bits are random.
It is possible to have a cache hit because the tag from an
address may match a random number in the tag field of a
block.
• But the data is random in invalid!
• The Valid bit is normally off, and will be set when valid data is
written to a block.
Nitty-Gritty
Cache Housekeeping Flags
– Use Bit (U)
• This is used by the LRU replacement algorithm to determine
which block is LRU.
• Present only in fully-associative and set-associative caches
using LRU replacement policies.
• A cache may thus look as complicated as this:
V D U Tag Word 00
Word 01 Word 10
Word 11
Total Number of Bits in Cache?
• When we speak of cache size, we normally refer
to how many bytes of main memory data the
cache can hold
– E.g. a 64KB cache can hold up to 64KB of main
memory data
• But as we can see, data is not the only thing stored
in a cache block.
– We also have the tag and housekeeping flags!
• Thus the total number of bits needed to implement
a cache can be much bigger than the specified
cache size!
Example
• We want to implement a 256KB write-back
cache on the MIPS architecture. The cache
will be 4-way set associative, with 4 word
blocks. The LRU replacement policy will be
used. Find the total number of bits of
SRAM required to implement this cache.
Example
• Analysis
–
–
–
–
–
What housekeeping flags will be needed?
What is the size of the data portion of each block?
What is the number of blocks?
What is the number of sets?
What is the number of tag bits?
• What is the number of byte offset, block offset and set index
bits required?
• Based on this analysis, you should be able to get
the answer.
Summary
• Caches
– Make use of locality to make it possible to have small
amounts of fast expensive memory hold a copy of main
memory data that is likely to be accessed soon.
– Allows fast access of huge amounts of memory.
• Cache types
– Direct Mapped
• Simple, fast
• Poor temporal locality
Summary
– Fully Associative
• Flexibility of block placement allows smart
replacement algorithms that promotes temporal
locality.
• Expensive, slow.
– Set Associative
• Simpler to build than fully associative, yet gives
good temporal locality through flexible placement of
blocks (just like fully-associative).
• Limited associativity can sometimes give poor
performance
Summary
• Writing policies
– Write-through
• Simple to implement
• Slow
– Write-back
• Fast
• Difficult to implement
• Housekeeping flags
– Need extra info for the running of the cache
• Total Cache Sizes vs. Cache Sizes
– Not the same thing!
Download