memory systems

advertisement
CMPE 421 Parallel Computer Architecture
PART4
Caching with Associativity
1
Fully Associative Cache
Reducing Cache Misses by More Flexible Placement Blocks

Instead of direct mapped, we allow any memory block to
be placed in any cache slot.

There are many different potential addresses that mapped to
each index

Use any available entry to store memory elements

Remember: Direct memory caches are more rigid, any cache
data goes directly where the index says to, even if the rest of the
cache is empty

But in Fully associative cache, nothing gets “thrown out” of the
cache until it is completely full.

It’s harder to check for a hit (hit time will increase).

Requires lots more hardware (a comparator for each
cache slot).

Each tag will be a complete block address (No index bits
are used).
2
Fully Associative Cache

Must compare tags of all entries in parallel to find the
desired one (if there is a hit)

But Direct mapped cache only need to look one place

No conflict misses, only capacity misses

Practical only for caches with small number of blocks,
since searching increases the hardware cost
3
Fully Associative Cache
4
Direct Mapped vs Fully Associative
Direct Mapped
Index
V Tag
0:
1:
2
3:
4:
5:
6:
7:
8
9:
10:
11:
12:
13:
14:
15:
Fully Associative
Data
V Tag
Data
No Index
Each address
has only one
possible location
Address = Tag | Index | Block offset
Address = Tag | Block offset
5
Trade off

Fully Associate is much more flexible, so the miss rate will be lower.

Direct Mapped requires less hardware (cheaper).
– will also be faster!

Tradeoff of miss rate vs. hit time.

Therefore we might be able to compromise to find best solution between
direct mapped cache and fully associative cache

We can also provide more flexibility without going to a fully associative
placement policy.

For each memory location, provide a small number of cache slots that
can hold the memory element.

This is much more flexible than direct-mapped, but requires less
hardware than fully associative.
SOLUTION:
Set Associative
6
SET Associative Cache

A fixed number of locations where each block can be
placed.

N-way set associative means there are N places (slots)
where each block can be placed.

Divide the cache into a number of sets each set is of size
N “ways” (N way set associative)

Therefore, A memory block maps to unique set (specified
by index field) and can be placed in any “way” of that set

So there N choices

A memory block can be mapped is Set-accociative cache
- (Block address) modulo (Number of set in the cache)
- Remember that in a direct mapped cache the position of memory
block is given by
(Block address) modulo (Number of cache blocks)
7
A Compromise
2-Way set associative
V Tag
4-Way set associative
V Tag
Data
0:
0:
1:
2:
3:
Each address
has two possible
locations with
the same index
6:
1:
Each address
has four possible
locations with the
same index
2:
4:
5:
Data
One fewer index
bit: 1/2 the indexes
7:
Address = Tag | Index | Block offset
3:
Two fewer index bits:
1/4 the indexes
Address = Tag | Index | Block offset
8
Range of Set Associative Caches
Index is the set number is used
to determine which set the
block can be placed in
Used for tag compare
Tag
Decreasing associativity
Direct mapped
(only one way)
Smaller tags
Selects the set
Index
Selects the word in the block
Block offset Byte offset
Increasing associativity
Fully associative
(only one set)
Tag is all the bits except
block and byte offset
9
Range of Set Associative Caches
 For
a fixed size cache,
 each
increase by a factor of two in
associativity doubles the number of blocks
per set (i.e. the numbers or ways)
 And
halves the number of sets,
 Decreases
 And
the size of the index by 1 bit
increases the size of the tag by 1 bit
Tag
Index
Block offset Byte offset
10
Set Associative Cache
Cache
V Tag
0
0
1
1
0
1
Q2: Is it there?
Data
Main Memory
0000xx
0001xx Two low order bits
0010xx define the byte in the
word (32-b words)
0011xx
One word blocks
0100xx
0101xx
0110xx
0111xx
1000xx Q1: How do we find it?
1001xx
1010xx Use next 1 low order
1011xx memory address bit to
1100xx determine which
1101xx cache set (i.e., modulo
1110xx the number of sets in
1111xx the cache)
Compare all the cache
tags in the set to the
high order 3 memory
address bits to tell if
the memory block is in (block address) modulo (# set in the cache)
the cache
Valid bit indicates whether an entry contains valid information –
if the bit is not set, there cannot be a match for this block
11
Set Associative Cache Organization
FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1
multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the
comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded
select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used
to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the
element that matches to drive the data outputs.
12
Set Associative Cache Organization

This is called a 4-way set associative cache because
there are four cache entries for each cache index.
Essentially, you have four direct mapped cache working
in parallel.

This is how it works: the cache index selects a set from
the cache. The four tags in the set are compared in
parallel with the upper bits of the memory address.

If no tags match the incoming address tag, we have a
cache miss.

Otherwise, we have a cache hit and we will select the
data from the way where the tag matches occur.

This is simple enough. What is its disadvantages?
13
N-way Set Associative Cache versus Direct Mapped Cache:
N
way set associative cache will also be slower
than a direct mapped cache because

N comparators vs. 1

Extra MUX delay for the data

Data comes AFTER Hit/Miss decision and set
selection

In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:

Possible to assume a hit and continue. Recover later
if miss.
14
Remember the Example for Direct Mapping (ping pong effect)

Consider the main memory word reference string
Start with an empty cache - all
blocks initially marked as not valid
0 miss
00
00
01


Mem(0)
0 miss
Mem(4)
0
01
00
01
00
0 4 0 4 0 4 0 4
4 miss
Mem(0)
4
4 miss
Mem(0)
4
00
01
00
01
0 miss
0
01
00
0 miss
0
Mem(4)
01
00
Mem(4)
4 miss
Mem(0)4
4 miss
Mem(0)
4
8 requests, 8 misses
Ping pong effect due to conflict misses - two memory
locations that map into the same cache block
15
Solution: Use set associative cache

Consider the main memory word reference string
Start with an empty cache - all
blocks initially marked as not valid
0 miss
000


Mem(0)
0 4 0 4 0 4 0 4
4 miss
000
010
Mem(0)
Mem(4)
0 hit
000
010
Mem(0)
Mem(4)
4 hit
000
010
Mem(0)
Mem(4)
8 requests, 2 misses
Solves the ping pong effect in a direct mapped cache
due to conflict misses since now two memory locations
that map into the same cache set can co-exist!
16
Byte offset (2 bits)
Block offset (2 bits)
Index (1-3 bits)
Tag (3-5 bits)
Set Associative Example
0100111000
1100110100
0100111100
0110110000
1100111000
Index V
000: 0
Tag
Miss
Miss
Miss
Miss
Miss
Data
Index V
0
0
0
01:
0
0
10:
0
0
1
11:
1
0
Tag
Miss
Miss
Hit
Miss
Miss
Data
00:
001: 0
010: 0
011: 0
1
100: 0
101: 0
0100111000
1100110100
0100111100
0110110000
1100111000
011
110
010
010
-
110: 0
111: 0
Direct-Mapped
1100
0100
0110
1100
-
2-Way Set Assoc.
0100111000
1100110100
0100111100
0110110000
1100111000
Index V
0
0
0:
0
0
0
1
0
1
1:
0
1
0
Miss
Miss
Hit
Miss
Hit
Tag
Data
01001
11001
01101
-
4-Way Set Assoc.
17
New Performance Numbers
Miss rates for DEC 3100 (MIPS machine)
Separate 64KB Instruction/Data Caches
Benchmark Associativity Instruction
rate
gcc
Direct
2.0%
Data miss
miss rate
1.7%
Combined
1.9%
gcc
2-way
1.6%
1.4%
1.5%
gcc
4-way
1.6%
1.4%
1.5%
spice
Direct
0.3%
0.6%
0.4%
spice
2-way
0.3%
0.6%
0.4%
spice
4-way
0.3%
0.6%
0.4%
18
Benefits of Set Associative Caches

The choice of direct mapped or set associative depends
on the cost of a miss versus the cost of implementation
12
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
Miss Rate
10
8
6
4
2
0
1-way
2-way
4-way
8-way
Data from Hennessy &
Patterson, Computer
Architecture, 2003
Associativity

Largest gains are in going from direct mapped to 2-way
(20%+ reduction in miss rate)
19
Benefits of Set Associative Caches

As the cache size grow, the relative improvement from
associativity increases only slightly

Since overall miss rate of a larger cache is lower, the
opportunity for improving the miss rate decreases

And the obsolete improvement in miss rate from
associativity shrinks significantly
20
Cache Block Replacement Policy
For deciding which block to replace when a new entry is coming

Random Replacement:


First in First Out (FIFO)


Hardware randomly selects a cache item and throw it out
Equally fair / equally unfair to all frames
Least Recently Used (LRU) strategy:

Use idea of temporal locality to select the entry that has not been accessed
recently

Additional bit(s) required in the cache entry to track access order
- Must update on each access, must scan all on a replace

For two way set associative cache one needs one bit for LRU replacement.

Common approach is to use pseudo LRU strategy

Example of a Simple “Pseudo” Least Recently Used Implementation:

Assume 64 Fully Associative Entries

Hardware replacement pointer points to one cache entry

Whenever an access is made to the entry the pointer points to:
- Move the pointer to the next entry
-Otherwise: do not move the pointer
21
Source of Cache Misses
Direct
Mapped
N way Set
Associative
Fully
Associative
Cache Size
Big
Medium
Small
Compulsory
Miss
Same
Same
Same
Conflict Miss
High
Medium
Zero
Capacity
Miss
Low(er)
Medium
High
Designing a cache
Design Cache
Effect on Miss rate
Negative performance
effect
Increase size
Decrease Capacity Misses
May increase Access
time
Increase
Associativity
Decrease conflict Misses
May increase Access
time
Increase Block
Size
May decrease compulsory
misses
May increase miss
penalty
May Increase Capacity
Misses
Not: If you are running “billions” of instructions compulsory misses are
insignificand
Key Cache Design Parameters
L1 typical
L2 typical
Total size (blocks)
250 to 2000 4000 to
250,000
Total size (KB)
16 to 64
500 to 8000
Block size (B)
32 to 64
32 to 128
Miss penalty (clocks)
10 to 25
100 to 1000
Miss rates
(global for L2)
2% to 5%
0.1% to 2%
Two Machines’ Cache Parameters
Intel P4
AMD Opteron
L1 organization
Split I$ and D$
Split I$ and D$
L1 cache size
8KB for D$, 96KB for
trace cache (~I$)
64KB for each of I$ and D$
L1 block size
64 bytes
64 bytes
L1 associativity
4-way set assoc.
2-way set assoc.
L1 replacement
~ LRU
LRU
L1 write policy
write-through
write-back
L2 organization
Unified
Unified
L2 cache size
512KB
1024KB (1MB)
L2 block size
128 bytes
64 bytes
L2 associativity
8-way set assoc.
16-way set assoc.
L2 replacement
~LRU
~LRU
L2 write policy
write-back
write-back
Where can a block be placed/found?
# of sets
Direct mapped
# of blocks in cache
Set associative
(# of blocks in cache)/
associativity
Fully associative
1
Location method
Direct mapped
Index
Set associative
Index the set; compare
set’s tags
Fully associative Compare all blocks tags
Blocks per set
1
Associativity (typically
2 to 16)
# of blocks in cache
# of comparisons
1
Degree of
associativity
# of blocks
Multilevel caches

Two level cache structure allows the primary cache (L1) to
focus on reducing hit time to yield a shorter clock cycle.

The second level cache (L2) focuses on reducing the
penalty of long memory access time.

Compared to the cache of a single cache machine, L1 on
a multilevel cache machine is usually smaller, has a
smaller block size, and has a higher miss rate.

Compared to the cache of a single cache machine, L2 on
a multilevel cache machine is often larger with a larger
block size.

The access time of L2 is less critical than that of the cache
of a single cache machine.
27
Download