Cache Memory - Personal Web Pages

advertisement
Processor - Memory Interface
Memory must be random access
memory - memory in which individual
memory locations can be accessed in
any order at the same high speed.
Memory
Instructions and data
The memory that connects to the
processor should operate at a very high
speed, preferably at a speed that matches
the processor, so as not to slow the
system down.
Processor
Large dynamic semiconductor RAM used for main memory cannot
operate at that speed (much slower)
Relatively small static semiconductor memory can be designed to
operate faster.
ITCS 3181 Logic and Computer Systems 2014 B. Wilkinson Slides13.ppt
Modification date: Nov 18, 2014
1
Solution: Cache Memory
Processor operates much faster than the main memory can.
To ameliorate the situation, a high speed memory called a cache
memory placed between the processor and main memory.
Main memory
X
Data transfer
High speed
cache memory
X
Data transfer
Information
must be in
cache memory
for processor to
access it:
Processor
What else did he invent/publish first?
The first paper on cache memories:
M. Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.
2
Time to access contents of memory
If same instructions never re-executed, caches would cause an
additional overhead as information would first have to be
transferred from the main memory to the cache and then to the
processor and vice versa, i.e. the access time, ta would be:
ta = tm + tc
where
tc = cache access time
tm = main memory access time.
Fortunately, virtually all programs repeat sections of code and
repeatedly access the same or nearby data. This characteristic is
embodied in the Principle of Locality.
3
Principle of Locality
Found empirically to be obeyed by most programs. Applies to both
instruction and data references, though more likely in instruction refs.
Two main aspects:
1. Temporal locality (locality in time) – individual locations, once
referenced, are likely to be referenced again in the near future.
Seen in instruction loops, stacks, variable accesses…
Temporal locality is essential for an effective cache.
2. Spatial locality (locality in space) – references are likely to be near
last reference. Seen in data accesses as data often stored in
consecutive locations. References to next location sometimes
separated into a third aspect, known as sequential locality.
Spatial locality helpful in the design of a cache but not essential.
4
Taking Advantage of Temporal Locality
Suppose a reference is repeated n times in all during a program loop
and, after the first reference, the location is always found in the
cache, then the average access time would be:
Average access time = (ntc + tm)/n = tc + tm/n
where n = number of references.
Example
If tc = 5 ns, tm = 60 ns and n = 10, average access time would be 11 ns,
as opposed to 60 ns without cache.
THROUGHOUT tc is the time to access the cache, read (or write) the data if a hit or recognize a
miss. In practice, these times could be different. tm is the extra time to access the main
memory. Sometime machines are used in the equations rather than absolute time.
5
Hit Ratio
Main memory
X
– the probability that the required
word is already in the cache.
High speed
cache memory
A hit occurs when a location in the
cache is found immediately, otherwise
a miss occurs and a reference to the
main memory is necessary.
On a
cache miss
X
Data
Address
Processor
The cache hit ratio, h, (or hit rate) is defined as:
h=
Number of times required word found in cache
Total number of references
The miss ratio (or miss rate) is given by 1 - h.
6
Average access time using Hit Ratio
The average access time, ta is given by:
ta = tc + (1 - h)tm
assuming again that the access must be to the cache on a hit or miss
before an access is made to the main memory on a miss.*
Example
If hit ratio is 0.85 (a typical value), main memory access time is 50 ns
and cache access time is 5 ns, average access time is 5+0.1550=12.5 ns.
Machine cycles
In a practical system, each access time given as an integer number of
machine cycles. Typically hit time will be 1–2 cycles. Cache miss penalty
(extra time to access main memory) in order of 5–20 cycles.
*Only read requests are consider. Write requests considered later.
7
Taking advantage of Spatial Locality
To take advantage of spatial
locality, transfer not just one
byte or word to/from main
memory to cache but a
series of sequential locations
called a line or a block.
Cache memory with multiple memory
modules (wide word length memory)
Memory modules
Address
8
0
For best performance, line
should be transferred
simultaneously across a wide
data bus to the cache. This
also enables access time of
main memory to be matched
to the cache.
9
1
10
2
11
3
12
4
13
5
14
6
Bus
Byte
location
Cache
Line
15
7
Line
Byte
Memory address
Processor
8
Cache Memory Organizations
Need a way to select the location within the cache.
The memory address of its location in main memory is used.
Three ways of selecting cache location:
1. Fully associative
Memory
2. Direct mapped
Cache
3. Set associative
Data
Memory address
Processor
9
1. Fully Associative Mapping
Both memory address and data stored together in the cache.
Incoming memory address is simultaneously compared with all
stored addresses using the internal logic of the cache memory.
M emo ry a dd re ss fr om
p ro cesso r
M ain me mor y acce ssed
i f a dd re ss n o t in cach e
M ai n
me mo ry
C ach e
Co mp ar e with
a ll stor ed
a dd re sses
sim ulta ne ou sly
A dd re ss
Data
R eq ui res o ne ad dr ess co mpa ra tor
w ith e ach sto re d a dd re ss
( Co nten t-a dd re ssab le m em ory) A dd re ss fo un d
Ad dr ess n ot
fo un d i n cach e
A ccess lo catio n
10
Example
Suppose each line has 16 bytes.
With 32-bit processors, a word consists of
4 bytes:
Wo rd w ithin lin e
M em or y a dd re ss fr om
p ro cesso r
“Word” field specifies word
within line. In this example,
with 4 words in line, need 2 bits.
“Byte” field specifies byte within
word. In this example with 4
bytes in word, need 2 bits.
B yte with in wo rd
W or d B yte
2
2
C ach e
C om pa re wi th
a ll sto red
a d dre sse s
si mu lta n eo usly
L in e
Ad d ress Wo rd 0 Wo rd 1 Wor d 2 Wo rd 3
S e lect byte in wo rd
i f n ece ssar y
Ad d ress fo un d
A ccess wo rd i n lin e
11
Selection/Replacement Algorithms
Fully associative cache needs an algorithm to select where to store
information in cache, generally over some existing line (which would
have to be copied back to the main memory if altered).*
Must be implemented in hardware. (No software)
Ideally, algorithm should choose a line which is not likely to be
needed again in the near future, from all lines that could be selected.
Common Algorithms
1. Random selection
2. The least recently used algorithm (or an approximation to it).
* Note in caches the selection and replacement location usually refers to the same
location whereas in virtual memory (OS course) they usually refer to different locations.
12
Least Recently Used (LRU) Algorithm
Line which has not been referenced for longest time removed
from cache.
The word “recently” comes about because the line is not the
least used, as this is likely to be back in memory. It is the least
used of those lines in the cache, and all of these are likely to have
been recently used otherwise they would not be in the cache.
Can only be implemented in hardware fully when the number of
lines that need to be considered is small.
13
Direct Mapping
Line held in cache at a location given by ”index” bits of main
memory address. Line selected by index bits of main memory
address. Most significant bits of address stored in cache compared
with most significant bits of main memory address (tags):
Me mo ry ad dr ess fro m
pr oce ssor
Tag
Ind ex
Wo rd
B yte
H igh spe ed RA M
Ca che
Ind e x
L ine
Tag Word 0
Word n-1
M ain
me mo ry
a ccesse d
if ta gs d o
n ot ma tch
Re a d
Co mp are
Differ en t
S ame
On e e xter na l
ad dr ess co mpa ra to r
Acce ss wor d/b yte in l ine
14
Sample Direct-Mapped Cache Design
8192-byte direct mapped cache with 32-byte line organized as eight
4-byte words. 32-bit memory address.
M em or y a dd re ss fr om p ro cesso r
32
27
Ta g
19
5
In d ex
Wor d
8
3
8192/32 = 256
Byte
2
Ind ex
Cach e
Li ne
Tag W or d 0 Wo rd 1 Wo rd 2 W ord 3 Wo rd 4
Wo rd 5
Wo rd 6
Wo rd 7
256
(28 )
Re ad
C omp ar e
S am e
Tag has 19 bits
Acce ss wor d/b yte in li ne
With 4 bytes in word, need 2 bits in byte field.
With 8 words in line, need 3 bits in word field.
With 8192 bytes in total and 32 bytes in each line, 8192/32 entries in cache (= 256 = 28). So index = 8 bits.
15
Advantages of Direct Mapped Caches
1. No replacement algorithm necessary - because there is no
choice in the selection of the location for the incoming line. It
is given by the index of the address of incoming line.
2. Simple hardware and low cost.
3. High speed of operation.
16
Major Disadvantage of Direct Mapped
Caches
Performance drops significantly if accesses are made to
different locations with the same index.
However, as the size of cache increases, the difference in the
hit ratios of the direct and associative caches reduces and
becomes insignificant.
17
Elements of an Array Stored in Memory
Every n th location in memory map
into same location in cache where
there n locations in the cache.
a[2][1]
a[2][0]
a[1][n-1]
n locations
A 2-dimensional array, a[ ][ ],with n
elements in the first position would
map all these elements into one
location (if row-major order as C).
a[1][1]
a[1][0]
a[0][n-1]
n locations
Cache
a[0][1]
a[0][0]
18
Set-Associative Mapping
Allows a limited number of lines, with the same index and
different tags, in the cache. A compromise between a fully
associative cache and a direct mapped cache.
Cache divided into “sets” of lines. A four-way set associative
cache would have four lines in each set.
The number of lines in a set is known as the associativity or set
size. Each line in each set has a stored tag which, together with
the index (set number), completes the identification of the line.
19
4-way Set-Associative Cache
Memory address from processor
Ta g
Ind ex
Wo rd B yte
Lin e
Ca ch e
Set
Ta g Da ta Ta g Da ta
Ta g Data Ta g Data
Com pa re
M ain
me mo ry
acce sse d
if ta gs d o
n ot ma tch
S ame
A ccess wo rd /byte
First, index of address from processor used to access set. Then, all tags of
selected set compared with incoming tag. If match found, corresponding
location accessed, otherwise access main memory.
20
Sample 4-way Set-Associative Cache Design
4096-byte 4-way set-associative cache with 8-byte line organized as
two 4-byte words. 32-bit memory address.
32
3
Ta g
22
Ind ex
7
Wo rd B yte
2
1
Memory address
from processor
Ca ch e
Lin e
Ta g
Ta g
Ta g
Com pa re
Ta g
4096/(4 x 8)
= 128
M ain
me mo ry
acce sse d
if ta gs d o
n ot ma tch
S ame
A ccess wo rd /byte
With 4 bytes in word, need 2 bits in byte field.
With 2 words in line, need 1 bit in word field.
With 4096 bytes in total and 8 bytes in each line and 4 lines in set (4-way set assoc.) 4096/(4 x 8) entries in
cache (= 128 = 27). So index = 7 bits.
21
Set-Associative Cache Replacement Algorithm
Need only consider the lines in one set, as the choice of set is
predetermined by the index (set number) in the address.
Hence, with two lines in each set, for example, only one additional bit
is necessary in each set to identify the line to replace.
Set size
• Typically, set size is 2, 4, 8, or 16.
• A set size of one line reduces organization to that of direct
mapping.
• An organization with one set becomes fully associative mapping.
Set-associative cache popular for internal caches of microprocessors.
22
Valid Bits
In all caches, one valid bit provided with each line.*
Will assume one valid bit per line.
Valid bits set to a 0 initially. Then set to a 1 when contents of line is
valid. Checked before accessing line.
Needed to handle start-up situation when cache holds random
patterns of bits and also before cache is full.
* Or parts of a line if only parts transferred in separate transactions)
23
Sample Cache Design showing valid bits
(assuming a line can be transferred in one transaction)
4096-byte 2-way set-associative cache with 16-byte lines organized
as four 4-byte words. 32-bit memory address.
32
28
4
Memory address
Tag Index WordByte from processor
2
2
21
7
Valid
Cache
bits
Valid
bits
Line
Line
Tag
Valid bit set
when line
transferred
into cache
Word 0 Word 1 Word 2 Word 3
Tag
Word 0 Word 1 Word 2 Word 3
128
(27)
Compare
Same
Same
Access word/byte in line
With 4 bytes in word, need 2 bits in byte field.
With 4 words in line, need 2 bit in word field.
With 4096 bytes in total and 16 bytes in each line and 2 lines in set (2-way set assoc.) 4096/(16 x 2) entries in cache
(= 128 = 27). So index = 7 bits.
24
Fetch policy
Three strategies for fetching lines from main memory to cache:
Demand fetch - fetching a line when it is needed on a miss.
Prefetch - fetching lines before they are requested.
Simple prefetch strategy - prefetch (i + 1)th line when ith line is initially referenced
(assuming that the (i + 1)th line is not already in the cache) on the expectation that it
is likely to be needed if the ith line is needed.
Selective fetch - policy of not always fetching lines, dependent upon
some defined criterion. Then, main memory used rather than cache to
hold the information. Individual locations could be tagged as noncacheable.
May be advantageous to lock certain cache lines so that these are not
be replaced. Hardware could be provided within cache to implement
such locking.
25
Write Policies
Reading a word in cache does not affect it and no discrepancy
between the cache word and copy held in main memory.
Writing can occur to cache words and then copy held in main
memory different. Important to maintain copies same if other
devices such as disks access the main memory directly.
Two principal alternative mechanisms to update the main
memory:
1. Write through
2. Write back
26
1. Write-Through
In the write-through mechanism, every write operation to the cache
is repeated to the main memory, normally at the same time. Then
main memory always the same as the cache.
Main memory
X
O n eve ry
write reference (b ut see later)
Cache X
Data
Address
Processor
27
Cache with write buffer
Write-through scheme can be enhanced by incorporating buffers:
Data
Processor
Address
Main
memory
Cache
Write
Read
Allows the cache to be accessed while multiple previous
memory write operations proceed. “Non-blocking” store.
28
Two ways to handle write misses
1. Fetch-on-write (miss)
Describes a policy of bringing a line from the main memory into the
cache for a write operation on a write miss (when the line is not
already in the cache).
Also called allocate on write because a line is allocated for an
incoming line on cache miss.
2. No-Fetch-on-write (miss)
Describes a policy of not bringing a line from the main memory into
the cache for a write operation.
Also called Non-allocate on write.
No fetch on write often practiced with a write-through cache. Why?
29
2. Write-Back (or copy back)
Write operation to main memory only done at line replacement
time. At this time, line displaced by incoming line written back to
main memory.
Step 2
Main memory
X written back to main memory
when location used by incoming line (Y)
Y
X
Step 3
Bring in Y
Cache
Only necessary if X altered in cache
Requires an altered (“dirty”) bit with line
X
Step 1
Reference to Y, a miss
Data
Processor
Address
Here X and Y have same index
if direct mapped/set associative
30
Instruction and Data Caches
Several advantages if separate cache into two parts, one holding the
data (a data cache) and one holding program instructions (an
instruction cache or code cache):
• Separate paths could be provided from the processor to each
cache, allowing simultaneous transfers to both the instruction
cache and the data cache.
• Write policy would only have to be applied to the data cache
assuming instructions are not modified.
• Designer may choose to have different sizes for the instruction
cache and data cache, and have different internal organizations
and line sizes for each cache.
31
Particularly convenient in a pipeline processor, as different stages
of the pipeline access each cache (instruction fetch unit accesses
instruction cache and memory access unit accesses data cache):
Main memory
Data paths
Data Commonly
cache inside the
processor
Instruction
cache
Instructions
Instruction
fetch
unit
Data
IF
OF
EX
MEM
Memory
access
unit
Instruction pipeline
Processor
32
General Cache Performance Characteristics
Miss Ratio against Cache Size
?
1.0
Miss ratio
0.5
Program A
Program B
0.1
0.05
Program C
0.01
2K 4K 8K
16K
32K
Cache size
33
Miss Ratio against Line Size
1.0
32
0.5
Instruction/data cache
Instruction cache
Miss ratio
256
0.1
1024
Cache size (fixed)
0.05
4096 Has a minimum
(Why?)
0.01
32768
4
8
64
32
16
Line size (bytes)
128
34
Second Level Caches
Most present-day systems use two levels of cache (or three levels).
Processor
First-level
cache(s)
Second-level
cache
Main memory
Unified cache holding code and data
Usually separate data and instruction caches
First-level cache access time matches processor. Second-level
cache access time between main memory access time and first
level cache access time.
35
Strictly inclusive caches -- all the data in the L2 cache is also in
the L1 cache
Exclusive caches – data is guaranteed only to be in one cache (L1
or L2) at most, never in both.
Alternative: data could be in only L1 or L2 or both.
36
Caches Example
Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm (Launched 2011)
•
•
•
•
L1 Data cache = 32 Kbyte, 8-WAY. (Write-Allocate?), line = 64 Bytes.
L1 Instruction cache = 32 Kbyte. 8-WAY, line = 64 Bytes
L2 Cache = 256 KB. 8-WAY, line = 64 bytes
L3 Cache = 3 MB. Direct?, line = 64 bytes
L1 Data Cache Latency = 4 cycles or 5 cycles
L2 Cache Latency = 12 cycles
L3 Cache Latency = 27.85 cycles
RAM Latency = 28 cycles + 49 ns or 56 ns.
http://www.7-cpu.com/cpu/SandyBridge.html
37
Questions
38
Download