1. Answer the following questions: a. (multiple choice) Many new CMPs have private L1s and a shared L2 as the last-level cache. What is the advantage of having a shared L2 instead of private L2s? 1. Improve the overall cache capacity 2. Allow fast L2 access time 3. Simplify cache coherence 4. Flexible cache space sharing 5. Enable for another L3 cache level 6. Reduce cache miss penalty 7. None of above b. (multiple choice) What are the essential factors for today’s microprocessor? 1. Super pipelined and aggressive speculative since the transistors in a chip has been continuing follow the Moore Law 2. Branch prediction is the most important performance factor 3. With emerging 3D VLSI stacking technology, on-chip cache becomes less critical due to high off-chip bandwidth 4. Power has become a limiting factor in future high-performance processors 5. CMP has arrived and simple pipeline is the trend. c. Do you agree that a 3.8GHz Pentium 4 has a better performance than a 1.5GHz Itanium 2? Why? d. AMD Opteron requires 9 index bits to access the first-level (L1) data cache before the address translation. However, there are only 7 bits available from the page offset that do not need any address translation. How does AMD Opteron handle the extra 2 index bits? What’s the issue related to their solutions? 2. Assume a cache consists of 4 cache blocks. Construct a sequence of 12 requests to show that a direct-mapped cache performs better than a 2-way cache, and the 2-way cache performs better than a fully-associative cache. The block address is expressed as a 4-bit binary (and its digital number) without block offset. The cache content is expressed by the set (separated by “;”) and blocks within a set from MRU to LRU (separated by “,”). All the caches are empty initially with LRU replacement. For each access, you must indicate a hit or a miss with the miss type: CM-Compulsory, CFConflict, and CP-Capacity. Block Address Direct-Map 2-way Fully-asso. 0000 (0) 0; -; -; (CM) 0,-; -,(CM) 0,-,-,(CM) 0001 1 0010 0011 1000 0000 0001 0010 0011 1000 3. A write-invalidate cache coherence protocol is implemented in a snooping-bus SMP system. Fill the blanks in the following table to simulate the cache coherence activities. Note that each processor executes instructions independently. However, they can only access the bus one at a time. Assume block X is located in both P1 and P2 caches in the shared state initially. Step 1 Coherence Action in P1 Coherence Action in P2 Cache hit; X in shared state Cache hit; X in shared state Bus Activity None Processor Activity P1 read X=0 P2 read X=0 2 2 Cache miss to X in the shared state P1 write X=1 P2 write X=2 (Simultaneously) 3 P1 won the bus arbitration 4 P2 re-issue write X=2 P1 read X 5 (See Figure 4.23) 4. Comparison of a 3-state ESI protocol with a 4-state MESI protocol. 5. Simulate snooping-bus or directory-based coherence actions with a sequence of instructions from multiple processors. 3