Embedded Computer Architecture Data Memory Management Part d: Data Layout for Caches 5KK73 TU/e Henk Corporaal Bart Mesman Data layout for caches • Caches are hardware controled • Therefore: no explicit reuse copy code needed in your code! • What can we still do to improve performance? • Topics: – Cache principles – The 3 C's: Compulsory, Capacity and Conflict misses – Data layout examples reducing misses @H.C. Embedded Computer Architecture 2 Memory / Lower level Cache operation (direct mapped cache) Cache / Higher level block or line tags @H.C. Embedded Computer Architecture data 3 Why does a cache work? • Principle of Locality – Temporal locality • an accessed item has a high probability being accessed in the near future – Spatial locality • items close in space to a recently accessed item have a high probability of being accessed next • Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses – Regular programs have high instruction and data locality @H.C. Embedded Computer Architecture 4 Direct mapped cache Hit Address (bit positions) 31 30 13 12 11 2 10 Byte offset 10 20 Tag Data Index Index Valid Tag Data 0 1 2 1021 1022 1023 20 @H.C. Embedded Computer Architecture 32 5 Direct mapped cache: larger blocks • Taking advantage of spatial locality: Address (bit positions) Address (showing bit positions) 31 16 15 16 Hit 4 32 1 0 12 2 Byte offset Tag Data Index V Block offset 16 bits 128 bits Tag Data 4K entries 16 32 32 32 32 Mux 32 @H.C. Embedded Computer Architecture 6 Performance: effect of block size 40% 35% Miss rate 30% 25% 20% 15% 10% 5% 0% 4 16 64 Block size (bytes) • Increasing the block (or line) size tends to decrease miss rate 256 1 KB 8 KB 16 KB 64 KB 256 KB @H.C. Embedded Computer Architecture 7 Cache principles data Cache Line or Block tag 2k lines main memory 2m bytes p-k-m Virtual or Physical address CPU p-k-m tag @H.C. Embedded Computer Architecture k index address m byte address 8 4 Cache Architecture Fundamentals 1. Block placement – Where in the cache will a new block be placed? 2. Block identification – How is a block found in the cache? 3. Block replacement policy – Which block is evicted from the cache? 4. Updating policy – – When is a block written from cache to memory? Write-Through vs. Write-Back caches @H.C. Embedded Computer Architecture 9 Block placement policies Memory Cache 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... @H.C. Embedded Computer Architecture 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 ... Mapping? Direct mapped (one-to-one) 0 1 2 3 4 5 6 7 Here only! Here only! Fully associative (one-to-many) 0 1 2 3 4 5 6 7 Anywhere in cache Anywhere in cache 10 4-way associative cache Address 31 30 12 11 10 9 8 8 22 Index 0 1 2 V Tag Data V 321 0 Tag Data V Tag Data V Tag Data 253 254 255 22 • 4 ways • 256 sets 4-to-1 multiplexor Hit @H.C. Embedded Computer Architecture 32 Data 11 Performance: effect of associativity 15% 12% 1 KB Miss rate 9% 2 KB 6% 8 KB 3% 0% One-way Two-way Four-way Associativity @H.C. Embedded Computer Architecture Eight-way 1 KB 16 KB 2 KB 32 KB 4 KB 64 KB 8 KB 128 KB 12 Cache Basics • Cache_size = Nsets x Associativity x Block_size • Block_address = Byte_address DIV Block_size in bytes • Index = Block_address MOD Nsets • Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently block address tag 31 … @H.C. Embedded Computer Architecture index block offset …210 13 Example 1 • Assume – Cache of 4K blocks, with 4 word block size – 32 bit addresses • Direct mapped (associativity=1) : – – – – 16 bytes per block = 2^4 4 (2+2) bits for byte and word offsets 32 bit address : 32-4=28 bits for index and tag #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index Total number of tag bits : (28-12)*4K=64 Kbits • 2-way associative – #sets=#blocks/associativity : 2K sets – 1 bit less for indexing, 1 bit more for tag (compared to direct mapped) – Tag bits : (28-11) * 2 * 2K=68 Kbits • 4-way associative – #sets=#blocks/associativity : 1K sets – 2 bits less for indexing, 2 bits more for tag (compared to direct mapped) – Tag bits : (28-10) * 4 * 1K=72 Kbits @H.C. Embedded Computer Architecture 14 Example 2 3 caches consisting of 4 one-word blocks: • Cache 1 : fully associative • Cache 2 : two-way set associative • Cache 3 : direct mapped Suppose following sequence of block addresses: 0, 8, 0, 6, 8 @H.C. Embedded Computer Architecture 15 Example 2: Direct Mapped Block address Cache Block 0 0 mod 4=0 6 6 mod 4=2 8 8 mod 4=0 Address of Hit or Location memory block miss 0 Location 1 Location 2 0 miss Mem[0] 8 miss Mem[8] 0 miss Mem[0] 6 miss Mem[0] Mem[6] 8 miss Mem[8] Mem[6] Location 3 Coloured = new entry = miss @H.C. Embedded Computer Architecture 16 Example 2: 2-way Set Associative: (4/2 = 2 sets) Block address Cache Block 0 0 mod 2=0 6 6 mod 2=0 8 8 mod 2=0 Address of memory block Hit or miss SET 0 entry 0 0 Miss Mem[0] 8 Miss Mem[0] Mem[8] 0 Hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[6] 8 Miss Mem[8] Mem[6] (so all in set/location 0) SET 0 entry 1 SET 1 entry 0 SET 1 entry 1 LEAST RECENTLY USED BLOCK @H.C. Embedded Computer Architecture 17 Example 2: Fully associative (4 way assoc., 4/4 = 1 set) Address of memory block Hit or miss Block 0 0 Miss Mem[0] 8 Miss Mem[0] Mem[8] 0 Hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[8] Mem[6] 8 Hit Mem[0] Mem[8] Mem[6] @H.C. Embedded Computer Architecture Block 1 Block 2 Block 3 18 Cache Fundamentals The “Three C's” • Compulsory Misses – 1st access to a block: never in the cache • Capacity Misses – Cache cannot contain all the blocks – Blocks are discarded and retrieved later – Avoided by increasing cache size • Conflict Misses – Too many blocks mapped to same set – Avoided by increasing associativity • Some add 4th C: Coherence Misses @H.C. Embedded Computer Architecture 19 Compulsory miss example for(i=0; i<10; i++) A[i] = f(B[i]); Cache(@ i=2) B[0] A[0] B[1] A[1] B[2] A[2] ----- @H.C. Embedded Computer Architecture Cache(@ i=3) • B[3], A[3] required • B[3] never loaded before loaded into cache • A[3] never loaded before allocates new line 20 Capacity miss example Cache size: 8 blocks of 1 word Fully associative for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 B[3] B[0] A[0] B[3] B[0] A[0] B[4] B[1] A[1] A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] • 11 compulsory misses (+8 write misses) • 5 capacity misses @H.C. Embedded Computer Architecture 21 Conflict miss example Memory address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] B[0][1] B[1][1] B[2][1] B[3][1] B[0][2] B[1][2] B[2][2] B[3][2] B[0][3] 31 B[3][9] @H.C. Embedded Computer Architecture Cache address 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 ... 7 A[0] multiply loaded for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] read 10 times Cache (@ i=0) 0 A[0]/B[0][j] 1 2 3 4 B[0][j] 5 6 7 j=odd -> A[0] flushed in favor B[0][j] -> Miss j=even 22 “Three C's” vs Cache size [Gee93] Relative Absolute Missess 0 .1 5 Total Misses Compulsory Misses Capacity Misses Conflict Misses 0 .1 0 0 .0 5 0 .0 0 1 2 4 8 16 32 64 Cache Size in KB @H.C. Embedded Computer Architecture 23 Data layout may reduce cache misses Example 1: Capacity & Compulsory miss reduction for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 B[3] B[0] A[0] B[3] B[0] A[0] B[4] B[1] A[1] A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] • 11 compulsory misses (+8 write misses) • 5 capacity misses @H.C. Embedded Computer Architecture 25 Fit data in cache with in-place mapping for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; #Words Traditional Analysis: max=27 words Detailed Analysis: max=15 words 15 B[] Main Memory Cache Memory (16 words) (16 words) A[] 0 @H.C. Embedded Computer Architecture 6 12 i 26 Remove capacity / compulsory misses with in-place mapping for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 AB[3] AB[0] AB[3] AB[0] AB[4] AB[1] AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] • 11 compulsory misses • 5 cache hits (+8 write hits) @H.C. Embedded Computer Architecture 27 Example 2: Conflict miss reduction Memory address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] B[0][1] B[1][1] B[2][1] B[3][1] B[0][2] B[1][2] B[2][2] B[3][2] B[0][3] 31 B[3][9] @H.C. Embedded Computer Architecture Cache address 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 ... 7 A[0] multiply loaded for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] read 10 times Cache (@ i=0) 0 A[0]/B[0][j] 1 2 3 4 B[0][j] 5 6 7 j=odd -> A[0] flushed in favor B[0][j] -> Miss j=even 28 Avoid conflict miss with main memory data layout Main Memory 0 1 2 3 4 5 6 7 12 13 14 15 0 1 2 3 4 5 6 7 Leave gap ... B[0][1] 4 B[1][1] 5 B[2][1] 6 B[3][1] 7 Leave gap ... A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] 18 B[0][2] 31 B[3][9] @H.C. Embedded Computer Architecture 4 ... 7 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] multiply loaded A[i] multiple x read Cache (@ i=0) No conflict 0 1 2 3 4 5 6 7 A[0] B[0][j] j=any © imec 2001 29 Data Layout Organization for Direct Mapped Caches 16 14 Miss Rate (%) 12 Initial - Direct Mapped 10 8 Data Layout Org Direct Mapped 6 Initial - Fully Assoc 4 2 0 512Bytes 1KB 2KB Cache Size @H.C. Embedded Computer Architecture 30 Conclusions on Data Management • In multi-media applications exploring data transfer and storage issues should be done at source code level • DMM method: – Reducing number of external memory accesses – Reducing external memory size – Trade-offs between internal memory complexity and speed – Platform independent high-level transformations – Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) – Substantial energy reduction • Although caches are hardware controlled data layout can largely influence the miss-rate @H.C. Embedded Computer Architecture 31