CMPEN 431 Practice Exam 2A 1. (11) Caching a) (6) Consider a cache for a system that does not support virtual memory (i.e. no paging and no translation). The byte-addressable address space consists of Z bytes, the cache has S sets, A ways, and a block size of K bytes. i) If we want to address compulsory misses, and we do not change the capacity of the cache, what parameters will we change and how? Will the bit length of the tag become larger, smaller, or stay the same? Why? ii) If we want to address conflict misses, and we do not change the capacity of the cache, what parameters will we change and how? Will the bit length of the tag become larger, smaller, or stay the same? Why? iii) If block size decreases by 2x, associativity triples, and cache capacity goes up by 12x, how many bits are in each of the tag, index and block offset fields, in terms of the given parameters of the original cache? b) (5) There are three programs, FOO, BAR, and BAZ, a reference microbenchmark “ALWAYS-HITS” and an architecture SHODAN-N where each processor in the SHODAN-N series varies only by the total size of its (highly associative) single-level cache. The size of the cache doubles with each increment of N (i.e. SHODAN-3 has 8 times as much cache as SHODAN-0). Assume that the access patterns of all three programs are not pathological with respect to cache parameters (i.e. assume as a simplifying assumption that changing among reasonable replacement policies would not have significant effects on performance, nor would the relative behaviors of the programs change significantly if the associativity or block size were modestly increased or decreased, etc.) As FOO, BAR, and BAZ are run across the SHODAN series of processors in order from a SHODAN-0 to a SHODAN-7, the following behaviors are observed: The performance of FOO is very good (similar to ALWAYS-HITS) on a SHODAN-0, but gets progressively worse by SHODAN-7, although it still performs reasonably well. The performance of BAR is initially poor, and then increases greatly between SHODAN-2 to SHODAN-3, and then declines. The performance of BAZ is initially poor, and gets progressively better from SHODAN-0 to SHODAN-7, but the rate of improvement decreases with every step. i.) (3) Describe the properties of FOO, BAR, and BAZ that would produce these respective behaviors (you may assume that each program has uniform behavior if it simplifies your answer). ii.) (2) Assume that you are designing a follow-on to the SHODAN series, the POLITO processors. Describe a cache memory system that will effectively serve all three programs and justify your answers. Note any compromises in the design where one program suffers at the expense of the others, and how. This study source was downloaded by 100000826153891 from CourseHero.com on 11-28-2023 22:05:56 GMT -06:00 https://www.coursehero.com/file/80945504/CMPEN431-Exam2A-Practice-1pdf/ 2. (10) Caching in Virtual Memory Systems Assume that you have a byte-addressable MIPS system with the following properties and configuration: Physical address space: 26 bits; Virtual address space: 32 bits; Page size: 16KB; word = 32 bits 8 entry, 2-way associative ITLB and DTLB VIPT L1 D-cache = 16 entry, 2-way associative, with 8-byte blocks; write-allocate/write-back VIPT L1 I-cache = 32 entry, 2-way associative, with 4-byte blocks Entry for each TLB or D$ entry consists of {valid, tag, data} Tag and data in hexadecimal. I$ entries shown as {valid, tag, decoded instruction} for ease of exposition Cache-block Representation Endianness: If the data block containing address 0x0006 was 0x0123456789ABCDEF, the byte loaded from 0x0006 would have integer value = 0xCD. DTLB: SET 0 1,0x1234,0x000 1,0x0000,0x0A2 SET 1 1,0x0000,0x003 1,0xC234,0x002 SET 2 1,0x7234,0x023 1,0x0000,0x100 SET 3 SET 0 L1 D$: SET 1 SET 2 SET 3 SET 4 SET 5 SET 6 SET 7 ITLB: L1 I$: SET 0 SET 1 SET 2 SET 3 SET 4 SET 5 SET 6 SET 7 SET 8 SET 9 SET 10 SET 11 SET 12 SET 13 SET 14 SET 15 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,0x0000,0x001 0x00000, 0x1234567887654321 0x00100, 0xFEEDEEC5C0DEF00D 0x00200, 0xBEE5BEE5BEE5BEE5 0x00387, 0xB0B0D0D01A2B3C4D 0x00000, 0x0102030405060708 0x00100, 0xFE99EE88C077F066 0x00200, 0x1E1D1E15101E101D 0x00386, 0xFEEDEEC5C0DEF00D 1, 1, 1, 1, 1, 1, 1, 1, 1,0x2234,0x009 0x00308, 0xC00CEF0990FEDCBA 0x00201, 0x1FEEDEE15DEADC0D 0x00101, 0xDEAFBEEFDEADBEE5 0x00001, 0x1FEED515E1FF00D5 0x00389, 0x0910111213141516 0x00201, 0x54ED43E132EA210D 0x00108, 0x11E2D3E455E6D708 0x00001, 0x1FEEDEE15DEADC0D SET 0 1,0x0004,0x007 1,0x1234,0x040 SET 1 1,0x0040,0x640 1,0x0004,0x068 SET 2 1,0x012C,0x022 1,0x0040,0x00F SET 3 1,0x1234,0xD1E 1,0x0CA7,0x486 0x64000, 0x48600, 0x48604, 0xD1E10, 0x02211, 0x04040, 0x00737, 0x00FAD, 0x64000, 0x48600, 0x48604, 0xD1E10, 0x02211, 0x04040, 0x00737, 0x00FAD, LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU $13, $5, $3, $2, $13, $13, $13, $13, $13, $13, $13, $9, $7, $5, $3, $2, 0x1238($0) 0x2236($0) 0x4238($0) 0x8232($0) 0x5230($0) 0x623E($0) 0x723C($0) 0x823A($0) 0x1238($0) 0x61DF($0) 0x6264($0) 0x8242($0) 0x5234($0) 0x623C($0) 0x7230($0) 0x823A($0) 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0x04000, 0x06880, 0x21264, 0x64004, 0x640AD, 0x48633, 0x04000, 0x64086, 0x04000, 0x06880, 0x21264, 0x64004, 0x640AD, 0x48633, 0x04000, 0x64086, LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU LBU This study source was downloaded by 100000826153891 from CourseHero.com on 11-28-2023 22:05:56 GMT -06:00 https://www.coursehero.com/file/80945504/CMPEN431-Exam2A-Practice-1pdf/ $0, $13, $1, $13, $2, $13, $3, $13, $4, $13, $5, $13, $6, $9, $7, $8, 0x9236($0) 0xA234($0) 0xB232($0) 0xC230($0) 0xD22E($0) 0xE23C($0) 0xF23A($0) 0x0238($0) 0x9238($0) 0xA236($0) 0xB234($0) 0xC232($0) 0xD230($0) 0xE23E($0) 0xF23C($0) 0x023A($0) If the result of executing the current instruction is 1) an I-cache & ITLB hit, and 2) a D-cache & DTLB hit that puts 0x4D (circled) in register 13, what is the address (PC) of the current instruction? Show work. This study source was downloaded by 100000826153891 from CourseHero.com on 11-28-2023 22:05:56 GMT -06:00 https://www.coursehero.com/file/80945504/CMPEN431-Exam2A-Practice-1pdf/ 3. (6) Cache Performance The base CPI of a system, excluding memory stalls, is QUUX Loads and stores collectively constitute A% of all instructions Accessing the L1 data cache takes P cycles (accounted for in base CPI). Accessing the L1 instruction cache takes D cycles (accounted for in base CPI). Misses to main memory take an average of K cycles. The L1 D-cache miss rate/access is MD. The L1 I-cache miss rate/access is MI. The L1 D-cache miss rate/access in a double-capacity cache is MDX. The L1 I-cache miss rate/access in a double-capacity cache is MIX. Your team is considering either i) adding a unified L2 cache for both data and instruction accesses or ii) doubling the size of the existing L1 caches at a penalty of 1 extra cycle per access. SOLELY from an AMAT optimization perspective, and assuming that the L2 miss rate for instructions is half that of the miss rate for data, what relationship would have to hold true for the L2 cache to be a better option? This study source was downloaded by 100000826153891 from CourseHero.com on 11-28-2023 22:05:56 GMT -06:00 https://www.coursehero.com/file/80945504/CMPEN431-Exam2A-Practice-1pdf/ 4. (8) Multi-threading Consider a 2-wide in-order 5-stage pipeline with all functional units are replicated 2x, except data memory ports, that initially supports only 1 thread, and has the below schedule. Assume that the branch to FOO is taken. 1 2 3 4 7 8 FOO: lw $2, 0($4) F D E M W lw $3, 0($5) F d D E M W addu $2, $2, $3 f F d D E M W sw $2, 0($4) f F d D E M W lw $4, 4($4) f F D E f F D E M W M W addi $5, $5, 4 bne $4, $0, FOO F d d <MISFETCH> sll $0, $0, $0 F d d CYCLE Instruction 5 6 9 1 0 1 1 1 2 D E M W D - - - 1 3 1 4 1 5 1 6 1 7 a) (5) Assume that two copies of the same code were running on a two-wide, 2-thread FGMT (Fine-Grained-Multi-Threaded) processor with simple round-robin scheduling. In what cycle would the first thread to execute FOO fetch the second dynamic instance of that instruction? b) (3) Consider two possible implementations of designs that support two threads: i) A single core, 4-issue, 2-way simultaneous multi-threading dynamically scheduled (OoO) processor ii) A multiprocessor with two 2-issue single-threaded OoO cores Assume that both designs have a two-level cache hierarchy, and that the L1 cache size per core is identical. Describe a two-threaded workload (either multi-threaded or multi-process) where each thread has a fixed amount of work to perform that would be expected to reach a point where both threads have completed execution significantly faster on i) than ii) and explain why. This study source was downloaded by 100000826153891 from CourseHero.com on 11-28-2023 22:05:56 GMT -06:00 https://www.coursehero.com/file/80945504/CMPEN431-Exam2A-Practice-1pdf/ Powered by TCPDF (www.tcpdf.org) 1 8 1 9