data caches

advertisement
CS 201
Computer Systems Programming
Chapter 10
Data Cache Architecture
Herbert G. Mayer, PSU
Status 6/28/2015
1
Syllabus















Introduction
Definitions
Effective Times teff
Cache Subsystem and Design Parameters
Single-Line Degenerate Cache
Multi-Line, Single-Set Cache
Single-Line, Multi-Set Cache, Blocked Mapping
Single-Line, Multi-Set, Cyclic Mapping
Multi-Line per Set (Associative), Multi-Set Cache
Replacement Policies
LRU Sample
Compute Cache Size
Trace Cache
Characteristic Cache Curve
Bibliography
2
Introduction Cache Architecture
Cache-related definitions below are common,
though not all manufacturers apply the same
nomenclature. Initially we discuss cache designs
for single-processor architectures. In another
lecture note we progress to more complex and
complete MP architectures, covering the MESI
protocol for a two-processor system with external
L2 cache. Focus will be data caches
3
Introduction
 The speed with which the processor executes an
instruction and references data in its registers is
generally vastly superior to the speed with which
memory can be accessed
 For example, an integer type instruction on a
Pentium® Pro costs on the order of 1 cycle or less;
less is possible, since multiple operations may be
executed in one step on a superscalar processor
 The number of cycles to get an operand out of
memory on typical Pentium Pro or newer systems is
several dozens of cycles
 The gap between the slowness of memory and the
speed of processors is increasing over time, despite
memories getting faster!
4
Introduction
 To bridge this long recognized gap (von Neumann
bottleneck), computer architects invented (at
Manchester University in the 1960s; see [5]) a
special purpose memory, now called the cache
 Like regular memory, a cache holds bits of
information, data or instructions
 Unlike regular memory, a cache is very fast and
more expensive per bit. If it were not so costly, we’d
simply build all of memory out of cache memory and
the speed gap between processor and memory
would be solved; but alas! 
 Even to date, with some caches being several
megabytes large, caches are small vs. a memory’s
logical addressing space of 264 bytes
5
Introduction
 While regular memory is arranged as a linear array
of equal cells (bytes, words), caches usually are
arranged by lines, also called blocks
 Since block has already several other meanings, we
shall use line
 Only the address of the first byte of a line need be
remembered
 Individual bytes within lines are addressable by their
offset. Note that only line-size-aligned portions of
memory (AKA paragraphs) are moved into cache
lines!
 Each line represents a small linearly contiguous
subsection of memory, which we’ll call paragraph
6
Introduction
 Caches evolved into multiple levels and purposes
 Often the first level cache (L1) is physically on-chip,
allowing the processor to retrieve information
sometimes in a single cycle
 The next level cache (L2) is often a separate
physical device, larger in size than the L1, and
slower to access, due to “having to go off-chip”
 With multi-core architectures, L2 caches also tend to
move on-chip
 On some multi-core chips the L2 is shared between
the cores, yet on others there are individual L2
caches per core
7
Introduction
 L3 caches are common in servers that process very
large amounts of data
 Caches also have become specialized. Instructions
are stored separately in so-called I-Caches, while
data reside in data caches (D-Cache)
 In the early 2000s, the trend was to replace I-Caches
with trace-caches (TC), which store already predecoded micro instructions
 Since about 2007 trace-caches are out of favor and Icaches emerge again
8
Definitions
Aging
 A cache line’s age is tracked; only in associative
cache
 Aging tracks, when a cache line was accessed,
relative to the other lines in this set
 This implies that ages are compared
 Generally, the relative ages are of interest, such
as: am I older than you? Rather than the absolute
age, e.g.: I was accessed at cycle such and such
 Think about the minimum number of bits needed
to store the relative ages of, say, 8 cache lines!
 Memory access addresses only one line, hence all
lines in a set have distinct (relative) ages
9
Definitions
Alignment
 Alignment is a spacing requirement, i.e. the
restriction that an address adhere to a specific
placement condition
 For example, even-alignment means that an
address is even, that it be divisible by 2
 E.g. address 3 is not even-aligned, but address
1000 is; thus the rightmost address bit will be 0
 In VMM, page addresses are aligned on pageboundaries. If a page-frame has size 4k, then page
addresses that adhere to page-alignment are
evenly divisible by 4k
 As a result, the low-order (rightmost) 12 bits are 0.
Knowledge of alignment can be exploited to save
storing address bits in VMM, caching, etc.
10
Definitions
Associativity
 If a cache has multiple lines per set, we call it
associative; K stands for number of lines in a set
 Having a cache with multiple lines K > 1 requires
searching, or address comparing, whether a
referenced object is in fact present in cache; the key
term is: to hit the cache
 Another way of saying this is: An object at some
address in memory has more lines than one where
it might live in an associative cache
 Synonym: full associativity
 Antonym: direct mapped; if only a single line (per
set) exists, the search is reduced to a simple, single
tag comparison
11
Definitions
Blocked Cache
 If a cache cannot be accessed by the HW while
some line is currently being streamed in, the
cache is said to be blocked
 This can be a performance limiter, if the current
memory access wishes to refer to a line different
from the one being streamed in
 Not to be confused with cache blocks, AKA cache
lines!
12
Definitions
Critical Chunk First
 The number of bytes in a line is generally larger
than the number of bytes that can be brought into
the cache across the bus in 1 step, requiring
multiple bus transfers to fill a line completely
 It would be efficient, if the actual byte needed,
would reside in the first chunk brought across the
bus
 The deliberate policy that accomplishes just that
is the Critical Chunk First policy
 This allows the cache to be unblocked after the
first transfer, even though the line is not yet
completely loaded
 Other parts of the line may be used later, but the
critical byte can thus be accessed right away
13
Definitions
Direct Mapped
 If each memory address has just one possible
location (i.e. one single line, of K = 1) in the cache
where it could possibly reside, then that cache is
called direct mapped
 Antonym: associative, or fully associative
 Synonym: non-associative
Directory
 The collection of all tags is referred to as the
cache directory; opposed to actual data bits in a
D-Cache or actual instruction bits in an I-Cache
14
Definitions
Dirty Bit
 If a line in a cache with write-back policy is never
modified (written), then that line doesn’t need to
be copied back into memory upon retirement; it is
already there
 However, if at least one write (AKA store, AKA
modification) into that complete cache line has
occurred, the line must be copied back into
memory eventually, lest memory becomes stale
 To discern, whether or not to copy back, the dirty
bit must be set upon write; initially this bit is clear.
(See also: modified state)
 Synonym: write-bit
15
Definitions
Effective Cycle Time teff
 Let the cache hit rate h be the number of hits
divided by the number of all memory accesses,
with an ideal hit rate being 1; thus:
 teff = tcache + (1-h) * tmem
 Alternatively, the effective cycle time might be
 teff = max( tcache, (1-h)*tmem )
 The latter holds, if a memory access is initiated
parallel to the cache access
 Here tcache is the time to access a datum in the
cache, while tmem is the time to access a data item
in memory
 The hit rate h varies from 0.0 to 1.0
16
Definitions
Hit Rate h
 The hit rate h is the number of memory accesses
(read/writes, or load/stores) that hit the cache,
over the total number of memory accesses
 By contrast H is the total number of hits
 A hit rate h = 1 means: all accesses are from
cache, while h = 0 means, all are from memory, i.e.
none hit the cache
 Conventional notations are: hr or hw for read and
write misses
 See also miss rate
17
Definitions
LLC
 Acronym for Last Level Cache. This is the largest
cache in the memory hierarchy, the one closest to
physical memory, or furthest from the processor
 Typical on multi-core architectures
 Typical cash sizes: 4 MB to 32 MB. See [3]
 Common to have one LLC be shared between all
cores of an MCP (Multi-Core Processor), but have
option of separating (by fusing) and creating
dedicated LLC caches, with identical total size
18
Definitions
LRU
 Acronym for Least Recently Used. A cache
replacement policy (also page replacement policy
discussed under VMM) that requires aging
information for the lines in a set
 Each time a cache line is accessed, that line
become by definition the youngest one touched
 Other lines of the same set do age by one unit, i.e.
get older by 1 event
 Relative ages are sufficient for LRU tracking; no
need to track exact ages!
 Antonym: last recently used!
19
Definitions
Line
 Storage area in cache able to hold a copy of a
contiguous block of memory cells, i.e. a paragraph
 The portion of memory stored in that line is
aligned on an address modulo the line size
 For example, if a line holds 64 bytes on a byteaddressable architecture, the address of the first
byte stored in such a line will have 6 trailing zeros,
as it is evenly divisible by 64, it is 64-byte aligned
 Such known zeros don’t need to be stored in the
tag, the address bits stored in the cache; they are
implied
 This shortens the tag, which makes the cache
cheaper to build: less bits!
20
Definitions
Locality of Data
 A surprising and very beneficial attribute of
memory access patterns: when an address is
referenced, there is a good chance that in the near
future another access will happen at or near that
same address
 I.e. memory accesses tend to cluster, also
observable in hashing functions and memory page
accesses
 Antonym: Randomly distributed, or normally
distributed
21
Definitions
Miss Rate
 Miss rate is the number of memory (read/write)
accesses that miss the cache over total number of
accesses, denoted m
 Clearly the miss rate, like the hit rate, varies
between 0.0 .. 1.0
 The miss rate m = 1 - h
22
Definitions
Paragraph
 A paragraph is a contiguous portion of memory of
exactly line-size bytes
 The starting address of a paragraph is evenly
divisible by the line-size
 For example, if a cache line is 32 bytes long,
memory can be thought of as logically partitioned
into contiguous byte streams, AKA paragraphs
 These start at 32-byte boundaries, each 32 bytes
long; hence the rightmost 5 bits = log2(32) are 0
 Paragraphs or correspondingly line sizes may be
of any size, not just 32 bytes, but a power of 2
seems handy on a binary system 
23
Definitions
Replacement Policy
 A replacement policy is a defined convention that
defines which line is to be retired in case a new
line must be loaded, none is free in a set, so one
has to be evicted
 Ideally, the line that will remain unused for the
longest time in the future should be replaced and
its contents overwritten with new data
 Generally we do not know which line will stay
unreferenced for the longest time in the future
 In a direct-mapped cache, the replacement policy
is trivial, it is moot, as there will be just 1 line
24
Definitions
Set

A logically connected region of memory, to be mapped onto a
specific area of cache (line), is a set; there are N sets in memory

Elements of a set don’t need to be physically contiguous in
memory; if contiguous, leftmost log2(N) bits are 0, if cyclic
distribution, then the rightmost log2(N) after alignment bits are 0

The number of sets is conventionally labeled N

A degenerate case is to map all memory onto the whole cache,
in which case only a single set exists: N = 1; i.e. one set

Notion of set is meaningful only if there are multiple sets. A
memory region belonging to one set can be physically
contiguous or distributed cyclically

In the former case the distribution is called blocked, the latter
cyclic. Cache area into which a portion of memory is mapped to
is also called set
25
Definitions
Set-Associative
 A cached system in which each set has multiple
lines is called set-associative
 For example, 4-way set associative means that
there are multiple sets (could be 4 sets, 256 sets,
1024 sets, or any other number of sets) and each
of those sets has 4 lines
 That’s what the 4 refers to in 4-way
 Opposite: non-associative
26
Definitions
Stale Memory
 A valid line may be overwritten in a cache with
new data
 The write-back policy records such an over writing
 At the moment of a cache write with write-back,
cache and memory are out of synch; we say
memory is stale
 Poses no danger, since the dirty bit (or modified
bit) reflects that memory eventually is updated
 But until this happens, memory is stale
 Note that if two processors’ caches share memory
and one cache renders memory stale, the other
processor should no longer have access to that
portion of shared memory
27
Definitions
Stream-Out
 Streaming out a line refers to the movement of one
line of modified data, out of the cache and back
into a memory paragraph
Stream-In
 The movement of one paragraph of data from
memory into a cache line. Since line length
generally exceeds the bus width (i.e. exceeds the
number of bytes that can be move in a single bus
transaction), a stream-in process requires multiple
bus transactions in a row
 Possible that the byte actually needed will arrive
last in a cache line during a sequence of bus
transactions; can be avoided with the critical
chunk first policy
28
Definitions
Tag
 A tag is the relevant portion of address bits. If a
memory object (paragraph) is present in the
cache, its address must be stored, so the cache
control unit can determine, whether the referenced
bits are present
 That portion of a memory address that must be
stored in the directory is the tag. If there is only
one set in the whole cache and any line can hold
only a single addressable unit, then the tag would
hold the complete address
 If there are N sets in the cache, log2(N) = m bits of
the virtual address are implied. If there are L
aligned bytes per line, log2(L) = n bits can be
implied in the tag. Hence, for an address of M bits,
only M-m-n need be represented in a line’s tag
29
Definitions
Trace Cache
 Special-purpose cache that holds pre-decoded
instructions, AKA micro-ops
 Advantage: Repeated decoding for instructions is
not needed
 See [1]. Trace caches have fallen out of favor in
the 2000s
30
Definitions
Valid Bit
 Single-bit data structure per cache line, indicating,
whether or not the line is free; free means invalid
 If a line is not valid (i.e. if valid bit is 0), it can be
filled with a new paragraph upon a cache miss
 Else, (valid bit 1), the line holds valid information
 After a system reset, all valid bits of the whole
cache are set to 0
 The I bit in the MESI protocol takes on that role on
an MP cache subsystem; to be discussed in
higher level class
31
Definitions
Write Back
 Write back is a cache write policy that keeps
changed bits in cache after modification (after a
write), until the line is evicted
 Thus, whenever a line is written (AKA modified),
that fact must be remembered by the dirty bit
 Upon retirement, any dirty line must be written
back (streamed-out) to memory
 Advantage: Multiple writes to the same line put
traffic onto the bus only once: Upon retirement
32
Definitions
Write Once
 Cache write policy that starts out with write
through
 After the first write hit causing a write though, the
policy then changes to write back
 This is called write once
 Applies to multi-level caches
33
Definitions
Write Through
 Cache write policy that copies modified data from
cache back to memory immediately, i.e. when the
write hit occurs
 Thus cache and main memory are always in synch
 Disadvantage: Each cache write consumes
memory bus bandwidth
 Antonym: Write back
34
Effective Time teff
 Starting with teff = tcache + ( 1-h ) * tmem we observe:
 No matter how many hits (H) we experience during
repeated memory access, the effective cycle time is
never less than tcache
 No matter how many misses (M) we experience, the
effective cycle time to access a datum is never
more than tcache + tmem
 It is desirable to have teff = tmem in case of a cache
miss
 Another way to compute the effective access time
is to add all memory-access times, and divide them
by the total number of accesses, and thus compute
the average
35
Effective Time teff
Average time per access:
teff = ( hits * tcache + misses * ( tcache + tmem ) ) /
total_accesses
teff =
h * tcache
accessed immed.:
+ m * ( tcache + tmem )
teff =
+ m * tmem
h * tcache
or if memory
•
Assume an access time of 1 (one) cycle to
reference data in the cache
•
Assume an access time of 10 (ten) for data in
memory
•
Assume that a memory access is initiated after a
cache miss; then:
36
Effective Time teff
37
Effective Time teff
Symb.
H
M
A
T
tcache
tmem
teff
H
M
h+m
Name
Hits
Misses
All
Total time
Cache time
Mem time
Effective tm.
Hit rate
Miss rate
Total rate = 1
Explanation
Number of successful cache accesses
Number of failed cache accesses
All accesses A = H + M
Time for A memory accesses
Time to successfully access memory via cache
Time to access memory
Average time over all memory accesses
H/A =h=1–m
M/A =m=1–h
Total rate, either hit or miss, probability is 1
38
Effective Time teff
39
Highlights of Different Kinds of Caches,
Not all Useful 
40
Purpose of Cache
 Cache is logically part of Memory Subsystem, but
physically often part of processor (e.g. on the
same silicon die)
 Purpose: render slow memory into a fast one
 With minimal cost, since the cache is just a few %
of total physical main store
 Works well, if locality is good, but only if locality is
good; else performance is same as memory
access, or worse, depending on architecture
 With poor locality, i.e. random distribution of
memory accesses, cache can slow down if:
teff = tcache+(1-h)*tmem and not: teff = max(tcache, (1-h)*tmem)
41
Purpose of Cache
 With good locality, cache delivers available data in
close to unit cycle time
 Cache must cooperate with other processors’
caches and with memory in MP system
 Cache must cooperate with VMM of memory
subsystem to jointly render a physically small,
slow memory into a virtually large, fast memory at
small cost in additional hardware (or silicon), and
system SW
 L1 cache access time should be within order of
magnitude of machine cycle time. For example, a
successful L1 data cache access costing 1 cycle
is desirable
42
Cache Design Parameters
 Number of lines in set: K
Quick test students: K is how large in a direct-mapped cache?
 Number of bytes in a line, AKA Length of line: L
 Number of sets in memory, and hence in the cache: N
 Policy upon memory write (cache write policy)
 Policy upon access miss (cache read policy)
 What to do, when an empty lines is needed for the
next paragraph to be streamed-in, but none is
available (replacement policy)
43
Cache Design Parameters
 Size = K * ( L + bits for tag and control bits ) * N
 Ratio of cache size to physical memory generally
is very small
 Cache access time, typically close to 1 cycle for
L1 cache
 Number of processors with cache, 1 in UP, M in
MP architecture
 Levels of caches, L1, L2, L3 … Last one referred to
as LLC, for last level cache
44
Single-Line Degenerate Cache
45
Single-Line Degenerate Cache
 Quick test students: what is the minimum size (in
number of bits) for tag for this degenerate cache?
 The single-line cache, shown here, stores multiple
words
 Can improve memory access if extremely good
locality exists within narrow address range
 Upon miss cache initiates a stream-in operation
 Is direct mapped cache: all memory locations
know a priori where they’ll reside in cache; there
is but one line
 Is single-set cache, since all memory locations are
mapped onto collection of lines, and there is just
one line
46
Single-Line Degenerate Cache
 As data cache: exploits only locality of near-by
addresses in the same paragraph
 As instruction cache: Exploits locality of tight
loops that completely fit inside the address range
of a single line
 However, there will be a cache-miss as soon as an
address makes reference outside of line’s range
 For example, tight loop with a function call will
cause cache miss
 Stream-in time is time to load a line worth of data
from memory
 Total overhead: tag bits = address bits + valid bit +
dirty bit (if write-back)
 Not advisable to build this cache subsystem 
47
Multi-Line, Single-Set Cache
48
Multi-Line, Single-Set Cache
 Next cache has one set, multiple lines; here 2 lines
as shown
Quick test students: minimum size of the tag on 32bit architecture with 2 lines, 1 set?
 Each line holds multiple, contiguous addressing
units, 4 words AKA 16 bytes shown
 Thus 2 disparate areas of memory can be cached
at the same time
 Is associative cache; all lines in the single set
must be searched to determine, whether a
memory element is present in cache
 Is single-set associative cache, since all of
memory is mapped onto the same cache lines
49
Multi-Line, Single-Set Cache
 Some tight loops with a function call can be
completely cached in an I-cache, assuming loop
body fits into line and callée fits into the other line
 Also would allow one larger loop to be cached,
whose total body does not fit into a single line, but
would fit into two (or more if available) lines
 With multiple lines in a set locality constraints are
less stringent
 Applies to more realistic programs
 But if number of lines K >> 1, the time to search all
tags (in set) can grow beyond unit cycle time
 Sometimes trade-off between cycle time and
cache access so that multiple cycles needed even
in case of cache hit
50
Single-Line, Multi-Set Cache
51
Single-Line, Multi-Set Cache

Next cache architecture has multiple sets, 2 shown, 2 distinct
areas of memory, each being mapped onto separate cache
lines: N = 2, K = 1
Quick test students: minimum size of the tag on 32-bit arch.?

Each set has a single line, in this case 4 memory units (e.g.
words, AKA 16 bytes) long; AKA paragraph

Thus 2 disparate areas of memory can be cached at the same
time

But these areas must reside in separate memory sets, each
contiguous, each having only 1 option

Is direct mapped; all memory locations know a priori where
they’ll reside in cache

Is multi-set cache, since different blocks of memory are
mapped onto different sets. Different parts of memory have
their own portion of cache
52
Single-Line, Multi-Set Cache
 Allows one larger loop to be cached, whose total
body does not fit into a single line of an I-cache,
but would fit into two lines
 But only if by some great coincidence both parts
of that loop reside in different memory sets
 If used as instruction cache, all programs
consuming half of memory or less never use the
second line in the second set. Hence that is again
a bad idea!
 If used as data cache, all data areas that fit into
first block will never utilize second set of cache
 Problem specific to blocked mapping; try cyclic
instead
53
Multi-Set, Single-Line, Cyclic
54
Multi-Set, Single-Line, Cyclic

This cache architecture below also has 2 sets, N = 2

Each set has a single line, each holding 4 contiguous memory
units, 4 words, 16 bytes, K = 1

Thus 2 disparate areas of memory can be cached at the same
time
Quick test: tag size on 32-bit, 4-byte architecture?

Disparate areas (of line size, equal to paragraph size) are
scattered cyclically throughout memory

Cyclically distributed memory areas associated with each
respective set

Is direct mapped; all memory locations know a priori where
they’ll reside in cache, as each set has a single line

Is multi-set cache: different locations of memory are mapped
onto different cache lines, the sets
55
Multi-Set, Single-Line, Cyclic
 Also allows one larger loop to be cached, whose
total body does not fit into a single line, but would
fit into two lines
 Even if parts of loop belong to different sets
 If used as instruction cache, small code section
can use the total cache
 If used as data cache, small data areas can utilize
complete cache
 Cyclic mapping of memory areas to sets is
generally superior to blocked mapping
56
Multi-Line, Multi-Set, Cyclic
57
Multi-Line, Multi-Set, Cyclic
 Quick test: minimum size (in bits) of the tag?
 Here is a more realistic cache architecture
 Two sets, memory will be mapped cyclically, AKA
in a round-robin fashion
 Each set has two lines, each line holds 4
addressable words (a paragraph)
58
Multi-Line, Multi-Set, Cyclic
 Associative cache: once set is known, search all
tags for the memory address in all lines of that set
 In example, line 2 of set 2 is unused
 By now you know: sets, lines, associate, nonassociative, direct mapped, etc.!!
59
Replacement Policy
 The replacement policy is the rule that determines:
 When all lines are valid, and a new line must be
streamed in
 Which of the valid lines is to be removed?
 Removal can be low cost, if the modified bit (AKA
“dirty” bit) is 0
 Or removal may be costly, if “dirty” bit is set
60
Replacement Policy
#
Name
Summary
1
LRU
Replaces Least Recently Used cache line; requires keeping track of
relative “ages” of lines. Retire line that has remained unused for the
longest time of all candidate lines. Speculate that that line will
remain unused for the longest time in the future.
2
LFU
Replaces Least Frequently Used cache line; requires keeping track of
the number m of times this line was used over the last n>=m uses.
Depending on how long we track the usage, this may require many
bits.
3
FIFO
First In First Out: The first of the lines in the set that was streamed in
is the first to be retired, when it comes time to find a candidate. Has
the advantage that no further update is needed, while all lines are in
use.
4
Random
Pick a random line from candidate set for retirement; is not as bad as
this irrational algorithm might suggest. Reason: The other methods
are not too good either 
5
Optimal
If a cache were omniscient, it could predict, which line will remain
unused for the longest time in the future. Of course, that is not
computable. However, for creating the perfect reference point, we
can do this with past memory access patterns, and use the optimal
access pattern for comparison, how well our chosen policy rates vs.
the optimal strategy!
61
LRU Sample
Assume the following cache architecture:
•
N = 16 sets
•
K = 4 lines per set
•
32-bit architecture
•
write back (dirty bit)
•
valid line indicator (valid bit)
•
L = 64 bytes per line
•
This results in a tag size of 22 bits
•
2 LRU bits (4 lines per set), to store relative ages
of the 4 lines in each set
62
LRU Sample
 Let lines be numbered 0..3
 And accessed in the order x=0 miss, x=1 miss, 0
hit, x=2 miss, 0 hit, x=3 miss, 0 hit, and another
miss
 Assume initially a cold cache, all lines in the cache
are free
 Problem: Once all lines are filled (Valid bit is 1 for
all 4 lines) some line must be retired (i.e. kicked
out) to make room for the new paragraph x that
caused a miss, but which?
 The answer is based on the LRU line (Least
Recently Used line), which is line 1
63
LRU Sample
 The access order, assuming all memory accesses
are just reads (loads), no writes (no stores), i.e.
dirty bit is always clear:








Read miss, all lines invalid, stream paragraph in line 0
Read miss (implies to a new address), stream another
paragraph in line 1
Read hit on line 0
Read miss to a new address, store paragraph in line 2
Read hit, access line 0
Read miss, store paragraph in line 3
Read hit, access line 0
Now another Read miss, all lines valid, find line to retire
 Note that LRU age 002 is youngest for cache cache
line 0, and 112 is the oldest line (AKA the least
recently used line) for cache line 1, of the 4
relative ages out of 4 total lines
64
LRU Sample
65
LRU Sample
 Whenever an empty line is filled, its relative age is
set to 00. It will be the youngest line. All others
must be checked, some may be updated. This
automatically avoids any of 4 lines ever growing
as “old” as 4 or “older”. Detail:
1. Initially, in a partly cold cache, if we experience a
miss and there is an empty line (partly cold
cache), the paragraph is streamed into the empty
line, its relative age is set to 0, and all other ages
are incremented by 1
2. In a warm cache (all lines are used) when a line of
age X experiences a hit, its new age becomes 0.
But the ages of all other lines whose age is
younger than that of X, all and only those ages are
incremented by 1
66
Compute Cache Size
Typical Cache Design Parameters:
1. Number of lines in set: K
2. Number of bytes in a line, Length of line: L
3. Number of sets in memory, and hence in cache: N
4. Policy upon memory write (cache write policy)
5. Policy upon read miss (cache read policy)
6. Replacement policy (e.g. LRU, random, FIFO, etc.)
7. Size (bits) = K *( 8 * L + tag + control bits ) * N
67
Compute Cache Size
Compute minimum number of bits of an 8-way, setassociative cache with 64 sets, using cyclic allocation
of memory sets, cache line length of 32 bytes, using
LRU replacement. Use write-back. Memory is byte
addressable, 32-bit addresses:
Tag
=
LRU 8-ways
=
3
Dirty bit
=
Valid bit
=
Overhead per line
=
# of lines
= K * N
Data bits per cache line
Total cache size
=
Byte size
=
68
32-5-6 = 21 bits
bits
1 bit
1 bit
21+3+1+1
= 26 bits
= 64*8 =
29 lines
= 32*8
=
28
29*(26+28) = 144,384
~141 kB
Trace Cache
 Trace Cache is a special-purpose cache that does
not hold (raw) instructions, but instead stores predecoded operations (micro-ops)
 The old AMD K5 uses a Trace Cache (TC); see [1]
 Intel’s Pentium® P4 uses a 12 k micro-op TC
 Advantages: faster access to executable bits at
every cached instruction
 Disadvantage: less dense cache storage
exploitation, i.e. wasted cache bits compared to a
regular I-cache
 Note that cache bits are more costly than memory
bits!
 Trace caches are falling out of favor in the 2010s
69
Trace Cache
70
Characteristic Cache Curve
 In graph below we use relative number of cache
misses [RM] to avoid infinitely high abscissa
 RM = 0 is ideal case: No misses at all
 RM = 1 is worst case: All memory accesses are
cache misses
 If a program exhibits good locality, relative cache
size of 1 results in good performance; we use this
as the reference point:
 Very coarsely, in some ranges, doubling the
cache’s size results in 30% less cache misses
 In others, doubling the cache results in a few %
less misses: beyond the sweet spot!
71
Characteristic Cache Curve
72
Summary
 A cache is a special HW storage device that allows
fast access
 Its is costly, hence the size of a cache relative to
the size of memory is small; cache holds a subset
 Frequently used data (or instruction in an I-cache)
are copied in a cache, with the hope that the data
present in the cache are accessed relatively
frequently
 Miraculously, that is generally true, so caches in
general do speed up execution despite slow
memories
 Caches are organized into sets, with each set
having 1 or more lines
 Defined portions of memory get mapped into any
one of these sets
73
Bibliography
1. http://forums.amd.com/forum/messageview.cfm?catid=11&thre
adid=29382&enterthread=y
2. Lam, M., E. E. Rothberg, and M. E. Wolf [1991]. "The Cache
Performance and Optimizations of Blocked Algorithms," ACM
0-89791-380-9/91, p. 63-74.
3. http://www.ece.umd.edu/~blj/papers/hpca2006.pdf
4. On MESI: http://en.wikipedia.org/wiki/Cache_coherence
5. Kilburn, T., et al: “One-level storage systems, IRE Transactions,
EC-11, 2, 1962, p. 223-235.
74
Download