Data Layout for Caches

Embedded Computer Architecture Data Memory Management Part d: Data Layout for Caches 5KK73 TU/e Henk Corporaal Bart Mesman Data layout for caches • Caches are hardware controled • Therefore: no explicit reuse copy code needed in your code! • What can we still do to improve performance? • Topics: – Cache principles – The 3 C's: Compulsory, Capacity and Conflict misses – Data layout examples reducing misses @H.C. Embedded Computer Architecture 2 Memory / Lower level Cache operation (direct mapped cache) Cache / Higher level block or line tags @H.C. Embedded Computer Architecture data 3 Why does a cache work? • Principle of Locality – Temporal locality • an accessed item has a high probability being accessed in the near future – Spatial locality • items close in space to a recently accessed item have a high probability of being accessed next • Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses – Regular programs have high instruction and data locality @H.C. Embedded Computer Architecture 4 Direct mapped cache Hit Address (bit positions) 31 30 13 12 11 2 10 Byte offset 10 20 Tag Data Index Index Valid Tag Data 0 1 2 1021 1022 1023 20 @H.C. Embedded Computer Architecture 32 5 Direct mapped cache: larger blocks • Taking advantage of spatial locality: Address (bit positions) Address (showing bit positions) 31 16 15 16 Hit 4 32 1 0 12 2 Byte offset Tag Data Index V Block offset 16 bits 128 bits Tag Data 4K entries 16 32 32 32 32 Mux 32 @H.C. Embedded Computer Architecture 6 Performance: effect of block size 40% 35% Miss rate 30% 25% 20% 15% 10% 5% 0% 4 16 64 Block size (bytes) • Increasing the block (or line) size tends to decrease miss rate 256 1 KB 8 KB 16 KB 64 KB 256 KB @H.C. Embedded Computer Architecture 7 Cache principles data Cache Line or Block tag 2k lines main memory 2m bytes p-k-m Virtual or Physical address CPU p-k-m tag @H.C. Embedded Computer Architecture k index address m byte address 8 4 Cache Architecture Fundamentals 1. Block placement – Where in the cache will a new block be placed? 2. Block identification – How is a block found in the cache? 3. Block replacement policy – Which block is evicted from the cache? 4. Updating policy – – When is a block written from cache to memory? Write-Through vs. Write-Back caches @H.C. Embedded Computer Architecture 9 Block placement policies Memory Cache 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... @H.C. Embedded Computer Architecture 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 ... Mapping? Direct mapped (one-to-one) 0 1 2 3 4 5 6 7 Here only! Here only! Fully associative (one-to-many) 0 1 2 3 4 5 6 7 Anywhere in cache Anywhere in cache 10 4-way associative cache Address 31 30 12 11 10 9 8 8 22 Index 0 1 2 V Tag Data V 321 0 Tag Data V Tag Data V Tag Data 253 254 255 22 • 4 ways • 256 sets 4-to-1 multiplexor Hit @H.C. Embedded Computer Architecture 32 Data 11 Performance: effect of associativity 15% 12% 1 KB Miss rate 9% 2 KB 6% 8 KB 3% 0% One-way Two-way Four-way Associativity @H.C. Embedded Computer Architecture Eight-way 1 KB 16 KB 2 KB 32 KB 4 KB 64 KB 8 KB 128 KB 12 Cache Basics • Cache_size = Nsets x Associativity x Block_size • Block_address = Byte_address DIV Block_size in bytes • Index = Block_address MOD Nsets • Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently block address tag 31 … @H.C. Embedded Computer Architecture index block offset …210 13 Example 1 • Assume – Cache of 4K blocks, with 4 word block size – 32 bit addresses • Direct mapped (associativity=1) : – – – – 16 bytes per block = 2^4  4 (2+2) bits for byte and word offsets 32 bit address : 32-4=28 bits for index and tag #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index Total number of tag bits : (28-12)*4K=64 Kbits • 2-way associative – #sets=#blocks/associativity : 2K sets – 1 bit less for indexing, 1 bit more for tag (compared to direct mapped) – Tag bits : (28-11) * 2 * 2K=68 Kbits • 4-way associative – #sets=#blocks/associativity : 1K sets – 2 bits less for indexing, 2 bits more for tag (compared to direct mapped) – Tag bits : (28-10) * 4 * 1K=72 Kbits @H.C. Embedded Computer Architecture 14 Example 2 3 caches consisting of 4 one-word blocks: • Cache 1 : fully associative • Cache 2 : two-way set associative • Cache 3 : direct mapped Suppose following sequence of block addresses: 0, 8, 0, 6, 8 @H.C. Embedded Computer Architecture 15 Example 2: Direct Mapped Block address Cache Block 0 0 mod 4=0 6 6 mod 4=2 8 8 mod 4=0 Address of Hit or Location memory block miss 0 Location 1 Location 2 0 miss Mem[0] 8 miss Mem[8] 0 miss Mem[0] 6 miss Mem[0] Mem[6] 8 miss Mem[8] Mem[6] Location 3 Coloured = new entry = miss @H.C. Embedded Computer Architecture 16 Example 2: 2-way Set Associative: (4/2 = 2 sets) Block address Cache Block 0 0 mod 2=0 6 6 mod 2=0 8 8 mod 2=0 Address of memory block Hit or miss SET 0 entry 0 0 Miss Mem[0] 8 Miss Mem[0] Mem[8] 0 Hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[6] 8 Miss Mem[8] Mem[6] (so all in set/location 0) SET 0 entry 1 SET 1 entry 0 SET 1 entry 1 LEAST RECENTLY USED BLOCK @H.C. Embedded Computer Architecture 17 Example 2: Fully associative (4 way assoc., 4/4 = 1 set) Address of memory block Hit or miss Block 0 0 Miss Mem[0] 8 Miss Mem[0] Mem[8] 0 Hit Mem[0] Mem[8] 6 Miss Mem[0] Mem[8] Mem[6] 8 Hit Mem[0] Mem[8] Mem[6] @H.C. Embedded Computer Architecture Block 1 Block 2 Block 3 18 Cache Fundamentals The “Three C's” • Compulsory Misses – 1st access to a block: never in the cache • Capacity Misses – Cache cannot contain all the blocks – Blocks are discarded and retrieved later – Avoided by increasing cache size • Conflict Misses – Too many blocks mapped to same set – Avoided by increasing associativity • Some add 4th C: Coherence Misses @H.C. Embedded Computer Architecture 19 Compulsory miss example for(i=0; i<10; i++) A[i] = f(B[i]); Cache(@ i=2) B[0] A[0] B[1] A[1] B[2] A[2] ----- @H.C. Embedded Computer Architecture Cache(@ i=3) • B[3], A[3] required • B[3] never loaded before  loaded into cache • A[3] never loaded before  allocates new line 20 Capacity miss example Cache size: 8 blocks of 1 word Fully associative for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 B[3] B[0] A[0] B[3] B[0] A[0] B[4] B[1] A[1] A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] • 11 compulsory misses (+8 write misses) • 5 capacity misses @H.C. Embedded Computer Architecture 21 Conflict miss example Memory address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] B[0][1] B[1][1] B[2][1] B[3][1] B[0][2] B[1][2] B[2][2] B[3][2] B[0][3] 31 B[3][9] @H.C. Embedded Computer Architecture Cache address 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 ... 7 A[0] multiply loaded for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] read 10 times Cache (@ i=0) 0 A[0]/B[0][j] 1 2 3 4 B[0][j] 5 6 7 j=odd -> A[0] flushed in favor B[0][j] -> Miss j=even 22 “Three C's” vs Cache size [Gee93] Relative Absolute Missess 0 .1 5 Total Misses Compulsory Misses Capacity Misses Conflict Misses 0 .1 0 0 .0 5 0 .0 0 1 2 4 8 16 32 64 Cache Size in KB @H.C. Embedded Computer Architecture 23 Data layout may reduce cache misses Example 1: Capacity & Compulsory miss reduction for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 B[3] B[0] A[0] B[3] B[0] A[0] B[4] B[1] A[1] A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] • 11 compulsory misses (+8 write misses) • 5 capacity misses @H.C. Embedded Computer Architecture 25 Fit data in cache with in-place mapping for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; #Words Traditional Analysis: max=27 words Detailed Analysis: max=15 words 15 B[] Main Memory Cache Memory (16 words) (16 words) A[] 0 @H.C. Embedded Computer Architecture 6 12 i 26 Remove capacity / compulsory misses with in-place mapping for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 AB[3] AB[0] AB[3] AB[0] AB[4] AB[1] AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] • 11 compulsory misses • 5 cache hits (+8 write hits) @H.C. Embedded Computer Architecture 27 Example 2: Conflict miss reduction Memory address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] B[0][1] B[1][1] B[2][1] B[3][1] B[0][2] B[1][2] B[2][2] B[3][2] B[0][3] 31 B[3][9] @H.C. Embedded Computer Architecture Cache address 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 ... 7 A[0] multiply loaded for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] read 10 times Cache (@ i=0) 0 A[0]/B[0][j] 1 2 3 4 B[0][j] 5 6 7 j=odd -> A[0] flushed in favor B[0][j] -> Miss j=even 28 Avoid conflict miss with main memory data layout Main Memory 0 1 2 3 4 5 6 7 12 13 14 15 0 1 2 3 4 5 6 7 Leave gap ... B[0][1] 4 B[1][1] 5 B[2][1] 6 B[3][1] 7 Leave gap ... A[0] A[1] A[2] A[3] B[0][0] B[1][0] B[2][0] B[3][0] 18 B[0][2] 31 B[3][9] @H.C. Embedded Computer Architecture 4 ... 7 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] multiply loaded A[i] multiple x read Cache (@ i=0) No conflict 0 1 2 3 4 5 6 7 A[0] B[0][j] j=any © imec 2001 29 Data Layout Organization for Direct Mapped Caches 16 14 Miss Rate (%) 12 Initial - Direct Mapped 10 8 Data Layout Org Direct Mapped 6 Initial - Fully Assoc 4 2 0 512Bytes 1KB 2KB Cache Size @H.C. Embedded Computer Architecture 30 Conclusions on Data Management • In multi-media applications exploring data transfer and storage issues should be done at source code level • DMM method: – Reducing number of external memory accesses – Reducing external memory size – Trade-offs between internal memory complexity and speed – Platform independent high-level transformations – Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) – Substantial energy reduction • Although caches are hardware controlled data layout can largely influence the miss-rate @H.C. Embedded Computer Architecture 31

Data Layout for Caches

Related documents

Products

Support

Data Layout for Caches

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib