Data Layout for Caches

advertisement
Embedded Computer Architecture
Data Memory Management
Part d:
Data Layout for Caches
5KK73 TU/e
Henk Corporaal
Bart Mesman
Data layout for caches
• Caches are hardware controled
• Therefore: no explicit reuse copy code needed in
your code!
• What can we still do to improve
performance?
• Topics:
– Cache principles
– The 3 C's: Compulsory, Capacity and Conflict misses
– Data layout examples reducing misses
@H.C. Embedded Computer Architecture
2
Memory / Lower level
Cache operation (direct mapped cache)
Cache / Higher level
block or line
tags
@H.C. Embedded Computer Architecture
data
3
Why does a cache work?
• Principle of Locality
– Temporal locality
• an accessed item has a high probability being accessed in the
near future
– Spatial locality
• items close in space to a recently accessed item have a high
probability of being accessed next
• Check yourself why there is temporal and spatial
locality for instruction accesses and for data
accesses
– Regular programs have high instruction and data locality
@H.C. Embedded Computer Architecture
4
Direct mapped
cache
Hit
Address (bit positions)
31 30
13 12 11
2 10
Byte
offset
10
20
Tag
Data
Index
Index Valid Tag
Data
0
1
2
1021
1022
1023
20
@H.C. Embedded Computer Architecture
32
5
Direct mapped cache: larger blocks
• Taking advantage of spatial locality:
Address (bit positions)
Address (showing bit positions)
31
16 15
16
Hit
4 32 1 0
12
2 Byte
offset
Tag
Data
Index
V
Block offset
16 bits
128 bits
Tag
Data
4K
entries
16
32
32
32
32
Mux
32
@H.C. Embedded Computer Architecture
6
Performance: effect of block size
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
4
16
64
Block size (bytes)
• Increasing the block (or line) size
tends to decrease miss rate
256
1 KB
8 KB
16 KB
64 KB
256 KB
@H.C. Embedded Computer Architecture
7
Cache principles
data
Cache Line or Block
tag
2k lines
main
memory
2m bytes
p-k-m
Virtual or Physical
address
CPU
p-k-m
tag
@H.C. Embedded Computer Architecture
k
index address
m
byte
address
8
4 Cache Architecture Fundamentals
1. Block placement
–
Where in the cache will a new block be placed?
2. Block identification
–
How is a block found in the cache?
3. Block replacement policy
–
Which block is evicted from the cache?
4. Updating policy
–
–
When is a block written from cache to memory?
Write-Through vs. Write-Back caches
@H.C. Embedded Computer Architecture
9
Block placement policies
Memory
Cache
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
...
@H.C. Embedded Computer Architecture
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
...
Mapping?
Direct mapped
(one-to-one)
0
1
2
3
4
5
6
7
Here only!
Here only!
Fully associative
(one-to-many)
0
1
2
3
4
5
6
7
Anywhere in cache
Anywhere in cache
10
4-way associative cache
Address
31 30
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
321 0
Tag
Data
V
Tag
Data
V
Tag
Data
253
254
255
22
• 4 ways
• 256 sets
4-to-1 multiplexor
Hit
@H.C. Embedded Computer Architecture
32
Data
11
Performance: effect of associativity
15%
12%
1 KB
Miss rate
9%
2 KB
6%
8 KB
3%
0%
One-way
Two-way
Four-way
Associativity
@H.C. Embedded Computer Architecture
Eight-way
1 KB
16 KB
2 KB
32 KB
4 KB
64 KB
8 KB
128 KB
12
Cache Basics
• Cache_size = Nsets x Associativity x Block_size
• Block_address = Byte_address DIV Block_size in
bytes
• Index = Block_address MOD Nsets
• Because the block size and the number of sets are
(usually) powers of two, DIV and MOD can be
performed efficiently
block address
tag
31 …
@H.C. Embedded Computer Architecture
index
block
offset
…210
13
Example 1
• Assume
– Cache of 4K blocks, with 4 word block size
– 32 bit addresses
• Direct mapped (associativity=1) :
–
–
–
–
16 bytes per block = 2^4  4 (2+2) bits for byte and word offsets
32 bit address : 32-4=28 bits for index and tag
#sets=#blocks/ associativity : log2 of 4K=12 : 12 for index
Total number of tag bits : (28-12)*4K=64 Kbits
• 2-way associative
– #sets=#blocks/associativity : 2K sets
– 1 bit less for indexing, 1 bit more for tag (compared to direct mapped)
– Tag bits : (28-11) * 2 * 2K=68 Kbits
• 4-way associative
– #sets=#blocks/associativity : 1K sets
– 2 bits less for indexing, 2 bits more for tag (compared to direct mapped)
– Tag bits : (28-10) * 4 * 1K=72 Kbits
@H.C. Embedded Computer Architecture
14
Example 2
3 caches consisting of 4 one-word blocks:
• Cache 1 : fully associative
• Cache 2 : two-way set associative
• Cache 3 : direct mapped
Suppose following sequence of block addresses:
0, 8, 0, 6, 8
@H.C. Embedded Computer Architecture
15
Example 2:
Direct Mapped
Block address
Cache Block
0
0 mod 4=0
6
6 mod 4=2
8
8 mod 4=0
Address of
Hit or Location
memory block miss
0
Location
1
Location
2
0
miss
Mem[0]
8
miss
Mem[8]
0
miss
Mem[0]
6
miss
Mem[0]
Mem[6]
8
miss
Mem[8]
Mem[6]
Location
3
Coloured = new entry = miss
@H.C. Embedded Computer Architecture
16
Example 2:
2-way Set Associative:
(4/2 = 2 sets)
Block address
Cache Block
0
0 mod 2=0
6
6 mod 2=0
8
8 mod 2=0
Address of
memory block
Hit or
miss
SET 0
entry 0
0
Miss
Mem[0]
8
Miss
Mem[0]
Mem[8]
0
Hit
Mem[0]
Mem[8]
6
Miss
Mem[0]
Mem[6]
8
Miss
Mem[8]
Mem[6]
(so all in set/location 0)
SET 0
entry 1
SET 1
entry 0
SET 1
entry 1
LEAST RECENTLY USED BLOCK
@H.C. Embedded Computer Architecture
17
Example 2: Fully associative
(4 way assoc., 4/4 = 1 set)
Address of
memory block
Hit or
miss
Block 0
0
Miss
Mem[0]
8
Miss
Mem[0]
Mem[8]
0
Hit
Mem[0]
Mem[8]
6
Miss
Mem[0]
Mem[8]
Mem[6]
8
Hit
Mem[0]
Mem[8]
Mem[6]
@H.C. Embedded Computer Architecture
Block 1
Block 2
Block 3
18
Cache Fundamentals
The “Three C's”
• Compulsory Misses
– 1st access to a block: never in the cache
• Capacity Misses
– Cache cannot contain all the blocks
– Blocks are discarded and retrieved later
– Avoided by increasing cache size
• Conflict Misses
– Too many blocks mapped to same set
– Avoided by increasing associativity
• Some add 4th C: Coherence Misses
@H.C. Embedded Computer Architecture
19
Compulsory miss example
for(i=0; i<10; i++)
A[i] = f(B[i]);
Cache(@ i=2)
B[0]
A[0]
B[1]
A[1]
B[2]
A[2]
-----
@H.C. Embedded Computer Architecture
Cache(@ i=3)
• B[3], A[3] required
• B[3] never loaded before
 loaded into cache
• A[3] never loaded before
 allocates new line
20
Capacity miss example
Cache size: 8 blocks of 1 word
Fully associative
for(i=0; i<N; i++)
A[i] = B[i+3]+B[i];
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
B[3]
B[0]
A[0]
B[3]
B[0]
A[0]
B[4]
B[1]
A[1]
A[2]
B[0]
A[0]
B[4]
B[1]
A[1]
B[5]
B[2]
A[2]
B[6]
B[3]
A[3]
B[1]
A[1]
B[5]
B[2]
A[2]
B[6]
B[3]
A[3]
B[7]
B[4]
A[4]
B[2]
B[5]
A[5]
B[3]
A[3]
B[7]
B[4]
A[4]
B[8]
B[5]
A[5]
B[9]
B[6]
A[6]
B[4]
A[4]
B[8]
B[5]
A[5]
B[9]
B[6]
A[6]
B[10]
B[7]
A[7]
• 11 compulsory misses (+8 write misses)
• 5 capacity misses
@H.C. Embedded Computer Architecture
21
Conflict miss example
Memory
address
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A[0]
A[1]
A[2]
A[3]
B[0][0]
B[1][0]
B[2][0]
B[3][0]
B[0][1]
B[1][1]
B[2][1]
B[3][1]
B[0][2]
B[1][2]
B[2][2]
B[3][2]
B[0][3]
31
B[3][9]
@H.C. Embedded Computer Architecture
Cache
address
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
...
7
A[0]
multiply
loaded
for(j=0; j<10; j++)
for(i=0; i<4; i++)
A[i] = A[i]+B[i][j];
A[i] read 10 times
Cache
(@ i=0)
0 A[0]/B[0][j]
1
2
3
4 B[0][j]
5
6
7
j=odd -> A[0] flushed in
favor B[0][j] -> Miss
j=even
22
“Three C's” vs Cache size [Gee93]
Relative Absolute Missess
0 .1 5
Total Misses
Compulsory Misses
Capacity Misses
Conflict Misses
0 .1 0
0 .0 5
0 .0 0
1
2
4
8
16
32
64
Cache Size in KB
@H.C. Embedded Computer Architecture
23
Data layout may reduce cache
misses
Example 1: Capacity & Compulsory
miss reduction
for(i=0; i<N; i++)
A[i] = B[i+3]+B[i];
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
B[3]
B[0]
A[0]
B[3]
B[0]
A[0]
B[4]
B[1]
A[1]
A[2]
B[0]
A[0]
B[4]
B[1]
A[1]
B[5]
B[2]
A[2]
B[6]
B[3]
A[3]
B[1]
A[1]
B[5]
B[2]
A[2]
B[6]
B[3]
A[3]
B[7]
B[4]
A[4]
B[2]
B[5]
A[5]
B[3]
A[3]
B[7]
B[4]
A[4]
B[8]
B[5]
A[5]
B[9]
B[6]
A[6]
B[4]
A[4]
B[8]
B[5]
A[5]
B[9]
B[6]
A[6]
B[10]
B[7]
A[7]
• 11 compulsory misses (+8 write misses)
• 5 capacity misses
@H.C. Embedded Computer Architecture
25
Fit data in cache with
in-place mapping
for(i=0; i<12; i++)
A[i] = B[i+3]+B[i];
#Words
Traditional
Analysis:
max=27 words
Detailed
Analysis:
max=15 words
15
B[]
Main
Memory
Cache
Memory
(16 words)
(16 words)
A[]
0
@H.C. Embedded Computer Architecture
6
12
i
26
Remove capacity / compulsory
misses with in-place mapping
for(i=0; i<N; i++)
AB[i] = AB[i+3]+AB[i];
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
AB[3]
AB[0]
AB[3]
AB[0]
AB[4]
AB[1]
AB[3]
AB[0]
AB[4]
AB[1]
AB[5]
AB[2]
AB[3]
AB[0]
AB[4]
AB[1]
AB[5]
AB[2]
AB[6]
AB[3]
AB[0]
AB[4]
AB[1]
AB[5]
AB[2]
AB[6]
AB[7]
AB[3]
AB[8]
AB[4]
AB[1]
AB[5]
AB[2]
AB[6]
AB[7]
AB[3]
AB[8]
AB[4]
AB[9]
AB[5]
AB[2]
AB[6]
AB[7]
AB[7]
AB[8]
AB[4]
AB[9]
AB[5]
AB[10]
AB[6]
AB[7]
• 11 compulsory misses
• 5 cache hits (+8 write hits)
@H.C. Embedded Computer Architecture
27
Example 2: Conflict miss reduction
Memory
address
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A[0]
A[1]
A[2]
A[3]
B[0][0]
B[1][0]
B[2][0]
B[3][0]
B[0][1]
B[1][1]
B[2][1]
B[3][1]
B[0][2]
B[1][2]
B[2][2]
B[3][2]
B[0][3]
31
B[3][9]
@H.C. Embedded Computer Architecture
Cache
address
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
...
7
A[0]
multiply
loaded
for(j=0; j<10; j++)
for(i=0; i<4; i++)
A[i] = A[i]+B[i][j];
A[i] read 10 times
Cache
(@ i=0)
0 A[0]/B[0][j]
1
2
3
4 B[0][j]
5
6
7
j=odd -> A[0] flushed in
favor B[0][j] -> Miss
j=even
28
Avoid conflict miss with
main memory data layout
Main Memory
0
1
2
3
4
5
6
7
12
13
14
15
0
1
2
3
4
5
6
7
Leave gap ...
B[0][1]
4
B[1][1]
5
B[2][1]
6
B[3][1]
7
Leave gap ...
A[0]
A[1]
A[2]
A[3]
B[0][0]
B[1][0]
B[2][0]
B[3][0]
18
B[0][2]
31
B[3][9]
@H.C. Embedded Computer Architecture
4
...
7
for(j=0; j<10; j++)
for(i=0; i<4; i++)
A[i] = A[i]+B[i][j];
A[0]
multiply
loaded
A[i] multiple x read
Cache
(@ i=0)
No
conflict
0
1
2
3
4
5
6
7
A[0]
B[0][j]
j=any
© imec 2001
29
Data Layout Organization for
Direct Mapped Caches
16
14
Miss Rate (%)
12
Initial - Direct
Mapped
10
8
Data Layout Org Direct Mapped
6
Initial - Fully Assoc
4
2
0
512Bytes
1KB
2KB
Cache Size
@H.C. Embedded Computer Architecture
30
Conclusions on Data Management
• In multi-media applications exploring data transfer and storage
issues should be done at source code level
• DMM method:
– Reducing number of external memory accesses
– Reducing external memory size
– Trade-offs between internal memory complexity and speed
– Platform independent high-level transformations
– Platform dependent transformations exploit platform characteristics
(efficient use of memory, cache, …)
– Substantial energy reduction
• Although caches are hardware controlled data layout can
largely influence the miss-rate
@H.C. Embedded Computer Architecture
31
Download