CS1104 Help Session I Memory Semester II 2001/02 Colin Tan,

advertisement
CS1104 Help Session I
Memory
Semester II 2001/02
Colin Tan,
S15-04-05,
Ctank@comp.nus.edu.sg
Memory
• Memory can be visualized as a stack of pigeon
holes. Current computers have about 128,000,000
pigeon holes.
• Each pigeon hole is given a number, starting from
0. This number is called an “address”.
• Each pigeon hole will contain either data (e.g.
numbers you want to add together) or instruction
(e.g. add two numbers)
Memory
• Memory locations 0 to 3 contain instructions, locations 4
to 6 contain data.
• Note: In reality, instructions are also encoded into
numbers!
Addresses
• As mentioned, each pigeon hole has a number
identifying it called an “address”.
• When the CPU requires an instruction, it will send
the instruction’s “address” to memory, and the
memory will return the instruction at that address.
– E.g. At IF CPU will send “0” to memory, and the
memory returns li t1, 5
– At MEM CPU will send “6” to memory, and memory
returns “10”.
– At WB, CPU writes “10” back to t1.
Addressing Bits
• Computers work only in binary
– Hence addresses generated in the previous example are
also in binary!
• In general, to address a maximum of n memory
locations, you will need m = log2 n bits in your
address.
• Conversely, if you had m bits in your address, you
can access a maximum of 2m memory locations.
Memory Hierarchy
• Motivation
– Not all memory is created equal
• Cheap Memory => Slow
• Fast Memory => Expensive
– DRAM, 70 ns access time, $1/MByte
– SRAM, 8 ns access time, $50/Mbyte
– So, you can choose either:
• Have fast but very small memory, OR
• Large but very slow memory.
Memory Hierarchy
• Memory hierarchy gives you a third option:
– Large, but very fast memory
• Though slower than the expensive memory mentioned earlier.
Locality
• “Locality” is a particular type of behavior exhibited by
running programs:
– Spatial locality: If a memory location has been accessed, it is very
likely its neighbor will also be accessed.
– Temporal locality: If a memory location has been accessed, it is
very likely that it will be accessed again sometime soon.
Locality - Example
• Consider the following program:
for(i=0; i<10; i++)
a[i] = b[i] + c[i];
Locality - Example
• In memory it will look like this:
Locality - Example
• Tracing the execution of the program:
Locality - Example
• Focusing only on the addresses of the fetched instructions,
we see that the addresses the instructions are fetched from
are:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, …
• Here we see both:
– Spatial locality (e.g. after location 0 is accessed, location 1 is
accessed, then 2, etc.)
– Temporal locality (e.g. location 2 is accessed 10 times!)
Effect of Locality
• Locality means that in the short run out of all the memory
you have (perhaps up to 128,000,000 pigeon holes!), only
a very small number of locations are actually being
accessed!
– In our example for ten iterations only memory locations 2 to 10 are
being accessed out of 128,000,000 possible locations!
– What if we had a tiny amount of very fast (but expensive!)
memory and kept these locations in that fast memory?
• We can speed up access times dramatically!!
– This is the idea behind caches.
How Do Caches Help?
• The average time to access memory (AMAT) is
given by:
AMAT = hit_rate * Tcache + miss_rate x (Tmemory +
Tcache)
Tcache = Time to read the cache (8ns for SRAM cache)
Tmemory = Time to read main memory (70ns for DRAM)
miss_rate = Probability of not finding what we want in the cache.
• Because of locality, miss_rate is very small
– Typically about 3% to 5%.
Here, our AMAT = 0.95 * 8ns + 0.05 x (70 + 8) ns = 11.5 ns
Our AMAT is about 43% slower than pure SRAM cache
memory (11.5 ns vs. 8 ns)
How Do Caches Help?
• What about cost?
• Let’s consider:
• A system with 32 MB of DRAM memory, 512KB of
SRAM cache.
• Cost is $1/MB for DRAM, and $50/MB for SRAM.
• If we had 32MB of SRAM, access time is 8 ns, but
cost will be $1,600
• With 32MB of DRAM, cost is $32, but access time
is 70 ns!
• But with 32MB of DRAM and 512 (1/2 MB) of
SRAM, cost will be: $32 + (512/1024) * 50 = $57!
How do Caches Help?
• So with pure SRAM, we can have 8 ms average access time at $1,600.
• With pure DRAM, our memory will cost $32, but all accesses will take
70 ns!
• With DRAM memory and SRAM cache, we can have 11.5 ms access
time at $57.
• So for a performance drop of 43%, we have a cost savings of >2700%!
• Hence caches give us large memory size (32 MB), at close to the cost
of the DRAM technology ($57 vs. $32), but at close to the speed of
expensive SRAM technology (11.5 ms vs. 8 ms)
Cache Architecture
• Caches consist of blocks (or lines). Each
block stores data from memory:
Block
• Block allocation problem:
– Given data from an address A, how do we
decide which block of cache its data should go
to?
The Block Allocation Problem
• 3 possible solutions:
– Data from each address A will go to to a fixed
block.
• Direct Mapped Cache
– Data from each address A will go to any block.
• Fully associative cache
– Data from address A will go to a fix set of
blocks.
• Data may be put into any block within a set.
• Set associative cache.
Direct Mapped Caches
• The value of a portion of memory address is
used to decide which block to send the data
to:
Address A
Tag
Block Index Block Offset
Byte Offset
• The Block Index portion is used to decide
which block data from this address should
go to.
Example
• The number of bits in the block index is
log2N, where N is the total number of
blocks.
• For a 4-block cache, the block index portion
of the address will be 2 bits, and these 2 bits
can take on the value of 00, 01, 10 or 11.
• The exact value of these 2 bits will
determine which block the data for that
address will go to.
Direct Mapped Addressing E.g.
• Show how an addresses generated by the MIPS CPU will
be divided into byte offset, block offset, block index and
tag portions for the following cases:
i) Block size: 1 word, 128 blocks
ii) Block size: 4 words, 64 blocks
• All MIPS addresses are 32 bit byte addresses (i.e. they
address individual bytes in a word).
Case I
Case II
Example
• The value of the two block index bits will
determine which block the data will go to,
following the scheme shown below:
Cache
00
01
10
11
Solving Direct-Mapped Cache
Problems
• Question 7.7
Basic formula:
Blk_Addr = floor(word_address/words_per_block) mod N
– N here is the total number of blocks in the cache
– This is the mathematical version of taking the value of the Block
Index bits from the address.
A Complication:
Multiple Word Blocks
• Single word blocks do not support spatial
locality
– Spatial locality: Likelihood of accessing
neighbor of a piece of data that was just
accessed is high.
– But with single word blocks, none of the
neighbors are in cache!
• All accesses to neighbors that were not accessed
before will miss!
An Example
Question 7.8
Accessing Individual Words
• In our example, each block has 4 words.
• But we always access memory 1 word at a time!
(e.g. lw)
• Use the Block Offset to specify which of the 4
words in a block we want to read:
Address A
Tag
Block Index Block Offset
Byte Offset
The Block Offset
• Number of block offset bits = log2M, where M is
the number of words per block.
• For our example, M=4. So number of block offset
bits is 2.
• These two bits can take on the values of 00, 01, 10
and 11.
• Note that for single word blocks, the number of
block offset bits is log2 1, which is 0. I.e. There are
no block offset bits for single-word blocks.
• These values determine exactly which word within
a block address A is referring to:
Who am I?
Purpose of the Tag
• Many different addresses may map to the
same block: e.g. (Block Index portions
shown highlighted)
01000 00010010
00000000
00
01010 00010010
00000000
00
11011
00000000
00
00010010
• All 3 addresses are different, but all map
to block 00010010
Disambiguation
• We need a way to disambiguate the situation
– Otherwise how do we know that the data in block x
actually comes from address A and not from another
address A’ that has the same block index bit value?
• The portion of the address A to the left of
the Block Index can be used for
disambiguation.
• This portion is called the tag, and the tag for
address A is stored in the cache together
with address A data.
The Tag
Tag
Word 00
Word 01
Word 10
Word 11
00
01
10
11
• When we access the cache, the Tag portion and
Block Index portions of address A are extracted.
• The Block Index portion will tell the cache
controller which block of cache to look at.
• The Tag portion is compared against the tag stored
in the block. If the tags match, we have a cache
hit. The data is read from the cache.
Accessing Individual Bytes
Address A
Tag
Block Index Block Offset
Byte Offset
• MIPS addresses are byte addresses, and actually
index individual bytes rather than words.
• Each MIPS word consists of 4 bytes.
• The byte offset tells us exactly which byte within a
word we are referring to.
Advantages & Disadvantages of
Direct Mapped Caches
• Advantages:
– Simple to implement
– Fast performance
• Less time to detect a cache hit => less time to get
data from the cache => faster performance
• Disadvantages
– Poor temporal locality.
• Many addresses may map to the same block.
• The next time address A is accessed, it may have
been replaced by the contents of address A’.
Improving Temporal Locality
The Fully Associative Cache
• In the fully associative cache, data from an
address A can go to any block in cache.
– In practice, data will go into the first available
cache block.
– When the cache is full, a replacement policy is
invoked to choose which block of cache to
throw out.
Advantages and Disadvantages
Fully Associative Cache
• Good temporal locality properties
– Flexible block placement allows smart
replacement policies such that blocks that are
likely to be referenced again will not be
replaced. E.g. LRU, LFU.
• Disadvantages
– Complex and too expensive for large caches
• Each block needs a comparator to check the tag.
• With 8192 blocks, we need 8192 comparators!
A Compromise
Set Associative Caches
• Represents a compromise between directmapped and fully associative caches.
• Cache is divided into sets of blocks.
• An address A is mapped directly to a set
using a similar scheme as for direct mapped
caches.
• Once the set has been determined, the data
from A may be stored in any block within a
set - Fully associative within a set!
Set Associative Cache
• An n-way set associative cache will have n
blocks per set.
• For example, for a 16-block cache that is
implemented as a 2-way set associative
cache, each set has 2 blocks, and we have a
total of 8 sets.
Advantages and Disadvantages
Set Associative Cache
• Advantages
– Almost as simple to build as a direct-mapped
cache.
– Only n comparators are needed for an n-way set
associative cache. For 2-way set-associative,
only 2 comparators are needed to compare tags.
– Supports temporal locality by having full
associativity within a set.
Advantages and Disadvantages
Set Associative Cache
• Disadvantages
– Not as good as fully-associative cache in supporting
temporal locality.
– For LRU schemes, because of small associativity,
actually possible to have 0% hit rate for temporally
local data.
– E.g. If our accesses are A1 A2 A3 A1 A2 A3, and if A1, A2
and A3 map to the same 2-way set, then hit rate is 0% as
subsequent accesses replace previous accesses in the
LRU scheme.
Multi-level Cache
• Let the first level of cache (closest to CPU) be called “L1”,
and the next level “L2”.
• Let Phit_l1 be the hit rate of L1, Tcache_L1 be the cache access
time of L1, Tmiss_L1 be the miss penalty of L1.
• AMAT of L1 = Phit_l1 * Tcache_L1 + (1-Phit_l1) * Tmiss_L1
• What is Tmiss_L1?
– If L1 misses, then we will attempt to get data from L2. Hence
Tmiss_l1 is actually just the AMAT of L2!
• Let Phit_l2 be the hit rate of L2, Tcache_l2 be the cache access
time of L2, Tmiss_l2 be the miss penalty of L2.
Multilevel Cache
• Tmiss_l1 = AMATl2 = Phit_l2 * Tcache_L2 + (1-Phit_l2) * Tmiss_L2
• Substitute this back and we get:
AMAT of L1 = Phit_l1 * Tcache_L1 + (1-Phit_l1) * (Phit_l2 * Tcache_L2 + (1Phit_l2) * Tmiss_L2)
• Tmiss_l2 is of course the time taken to access the slow
DRAM memory.
• What if we had an L3 cache?
Other Problems
• Question 7.9
Virtual Memory
Motivation
• Drive space is very very cheap
– Typically about 2cents per megabyte.
– It would be ideal if we could set aside a portion of drive
space to be used as memory.
– Unfortunately disk drives are very slow
• Fastest access time is about 10ms, or about 1,000 times slower
than SRAM and several hundred times slower than DRAM.
• Idea: Use drive space as memory, and main
memory to cache the drive space!
– This is the idea behind virtual memory.
Main Idea
System Cache
Is cached by
Main Memory
Is cached by
Virtual Memory
• Virtual memory (residing on disk) is cached by
main memory
• Main memory is cached by system cache
• All memory transfers are only between
consecutive levels (e.g. VM to main memory,
main memory to cache).
Cache vs. VM
• Concept behind VM is almost identical to concept
behind cache.
• But different terminology!
– Cache: Block
– Cache: Cache Miss
VM: Page
VM: Page Fault
• Caches implemented completely in hardware. VM
implemented in software, with hardware support
from CPU.
• Cache speeds up main memory access, while main
memory speeds up VM access.
Technical Issues of VM
• Relatively cheap to remedy cache misses
– Miss penalty is essentially the time taken to access the
main memory (around 60-80ns).
– Pipeline freezes for about 60-80 cycles.
• Page Faults are EXPENSIVE!
– Page fault penalty is the time taken to access the disk.
– May take up to 50 or more ms, depending on the speed
of the disk and I/O bus.
– Wastes millions of processor cycles!
Virtual Memory Design
• Because page-miss penalties are so heavy, not
practical to implement direct-mapped or setassociative architectures
– These have poorer hit rates.
• Main memory caching of VM is always fully
associative.
– 1% or 2% improvement in hit rate over other fully
associative or set associative designs.
– But with heavy page-miss penalties, 1% improvement
is A LOT!
• Also relatively cheap to implement full
associativity in software
Summary
• Memory can be thought of as pigeon holes where CPU
stores instructions and data.
• Each pigeon hole (memory location) is given a number
called its address.
• Memory technology can be cheap and slow (DRAM) or
fast and expensive (SRAM)
• Locality allows us to use a small amount of fast expensive
memory to store parts of the cheap and slow memory to
improve performance.
• Caches are organized into blocks.
Summary
• Mapping between memory addresses and blocks can be
accomplished by:
– Directly mapping a memory location to a cache block (direct map)
– Slotting a memory location to any block (fully associative)
– Mapping a memory location to a set of blocks, then slotting it into
any block within the set (set associative)
• Virtual memory attempts to use disk space as “main
memory”, DRAM main memory as cache to the disk
memory, and SRAM as cache to the DRAM.
Download