CMPE 421 Advanced Computer Architecture Caching with Associativity PART2 1 Other Cache organizations Direct Mapped Index V Tag 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: Fully Associative Data V Tag Data No Index Each address has only one possible location Address = Tag | Index | Block offset Address = Tag | Block offset 2 Fully Associative Cache 3 A Compromise 2-Way set associative V Tag 4-Way set associative V Tag Data 0: 0: 1: 2: 3: Each address has two possible locations with the same index 6: 1: Each address has four possible locations with the same index 2: 4: 5: Data One fewer index bit: 1/2 the indexes 7: Address = Tag | Index | Block offset 3: Two fewer index bits: 1/4 the indexes Address = Tag | Index | Block offset 4 Range of Set Associative Caches Used for tag compare Tag Decreasing associativity Direct mapped (only one way) Smaller tags Selects the set Index Selects the word in the block Block offset Byte offset Increasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset 5 Set Associative Cache Cache Way Set V Tag 0 0 1 1 0 1 Data Q2: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache Main Memory 0000xx 0001xx Two low order bits 0010xx define the byte in the word (32-b words) 0011xx One word blocks 0100xx 0101xx 0110xx 0111xx 1000xx Q1: How do we find it? 1001xx 1010xx Use next 1 low order 1011xx memory address bit to 1100xx determine which 1101xx cache set (i.e., modulo 1110xx the number of sets in 1111xx the cache) (block address) modulo (# set in the cache) 6 Set Associative Cache Organization FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1 multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the element that matches to drive the data outputs. 7 Remember the Example for Direct Mapping (ping pong effect) Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 miss 00 00 01 Mem(0) 0 miss Mem(4) 0 01 00 01 00 0 4 0 4 0 4 0 4 4 miss Mem(0) 4 4 miss Mem(0) 4 00 01 00 01 0 miss 0 01 00 0 miss 0 Mem(4) 01 00 Mem(4) 4 miss Mem(0)4 4 miss Mem(0) 4 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block 8 Solution: Use set associative cache Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 miss 000 Mem(0) 0 4 0 4 0 4 0 4 4 miss 0 hit 4 hit 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! 9 Byte offset (2 bits) Block offset (2 bits) Index (1-3 bits) Tag (3-5 bits) Set Associative Example 0100111000 1100110100 0100111100 0110110000 1100111000 Index V 000: 0 Tag Miss Miss Miss Miss Miss Data Index V 0 0 0 01: 0 0 10: 0 0 1 11: 1 0 Tag Miss Miss Hit Miss Miss Data 00: 001: 0 010: 0 011: 0 1 100: 0 101: 0 0100111000 1100110100 0100111100 0110110000 1100111000 011 110 010 010 - 110: 0 111: 0 Direct-Mapped 1100 0100 0110 1100 - 2-Way Set Assoc. 0100111000 1100110100 0100111100 0110110000 1100111000 Index V 0 0 0: 0 0 0 1 0 1 1: 0 1 0 Miss Miss Hit Miss Hit Tag Data 01001 11001 01101 - 4-Way Set Assoc. 10 New Performance Numbers Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches Benchmark Associativity Instruction rate gcc Direct 2.0% Data miss miss rate 1.7% Combined 1.9% gcc 2-way 1.6% 1.4% 1.5% gcc 4-way 1.6% 1.4% 1.5% spice Direct 0.3% 0.6% 0.4% spice 2-way 0.3% 0.6% 0.4% spice 4-way 0.3% 0.6% 0.4% 11 Benefits of Set Associative Caches The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation 12 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB Miss Rate 10 8 6 4 2 0 1-way 2-way 4-way 8-way Data from Hennessy & Patterson, Computer Architecture, 2003 Associativity Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate) 12 Virtual Memory (32-bit system): 8KB page size,16MB Mem 31 Index 0 13 12 19 219=512K Page offset 13 Virtual Address 4GB / 8KB = 512K entries Virt. Pg.# V Phys. Page # 0 1 2 ... Disk Address ... 512K 11 23 13 12 0 Physical Address Virtual memory example System with 20-bit V.A., 16KB pages, 256KB of physical memory Page offset takes 14 bits, 6 bits for V.P.N. and 4 bits for P.P.N. Page Table: Virtual Page # (index) 000000 000001 000010 000011 000100 000101 000110 000111 Valid Bit 1 0 1 0 1 1 0 0 1 1 Physical Page #/ Disk address 1001 sector 5000... 0010 sector 4323… 1011 1010 sector xxxx... sector 1239... 1010 0001 Access to: 0000 1000 1100 1010 1010 PPN = 0010 Physical Address: 00 1000 1100 1010 1010 Access to: 0001 1001 0011 1100 0000 PPN = Page Fault to sector 1239... Pick a page to “kick out” of memory (use LRU). Read data from sector 1239 Assume LRU is VPN 000101 for this example. into PPN 1010