Computer Architecture Area Fall 2010 PhD Qualifying Exam October 22nd 2010 This exam has six (6) equally-weighted problems. You should submit your answers to all six problems. Write your answers/solutions clearly and legibly. Illegible answers will be treated as wrong answers. You should clearly mark on top of each page of your answers which problem the page corresponds to. You should not use the same page for two problems, but it is fine to use multiple pages per problem if you need to. Although there is no restriction to how long your answer should be, try to keep your answers and solutions short and to the point. Good luck! Written Qualifying Exam for Computer Architecture Page 1 of 7 1. The application executes on a quad-core CMP that has private (per-core) L1 caches connected by an on-chip bus, and a large shared L2 cache. The coherence protocol between L1 caches is MESI. The machine is running only a four-threaded application and (after a brief initial warm-up period) all accesses are L2 cache hits. The only memory accesses in this application are: • Each thread accesses its own stack-based data. For each location on the stack, the access pattern is to write to it, read it a short while afterwards, and then not use that particular location for a while. By the time the same stack location is accessed (written and then read) again, it is replaced from the L1 cache. • All threads use a shared array, where each element is written by only one thread, then read by all four threads, then written by its writer thread again soon afterwards. The shared array is small and all its elements are accessed very often. A) For this particular application, explain what the benefit is from having the E state (as opposed to using the MSI protocol) in terms of on-chip bus utilization, reads from the L2 cache, and writes to the L2 cache. B) For this particular application, explain what the benefit is from having a write-back L1 cache and using a write-invalidate MESI coherence protocol (as opposed to using a write-through L1 cache and a write-update protocol with only S and I states). Written Qualifying Exam for Computer Architecture Page 2 of 7 2. A single-core processor chip has only one on-chip data cache that is 4-way set-associative, with 1024 sets and 128-byte blocks. Metadata (tags, dirty bit, valid bit) for all four lines in the set is checked in one cycle, which consumes 20pJ (pico-joules). However, no data is accessed in this first cycle. Instead, if the first cycle identifies a cache hit, in the second cycle only one of the data blocks (the one that has a hit) in the set is accessed (read or written), consuming another 20pJ. No parity or ECC is used for the data in this cache, so the engineer responsible for designing this cache decides to add a “reliable mode” for this cache. When in this “reliable mode”, the cache uses one of the data blocks in each set as the bit-wise parity for the other three (similar to a RAID-4 configuration for disks). To minimize the cost and energy consumption increase due to adding this reliable mode, in any given cycle the cache can either access the metadata for a whole set or the data for one block in the set – it cannot access both meta data and data in one cycle, nor can it access more than one data block in a cycle. The processor runs at 1GHz and is very simple – it stalls until the cache returns the data (e.g. the stall is two cycles per load/store in the non-reliable mode). The machine runs a program that executes a total of 1 billion loads and 300 million stores. All these accesses are cache hits (in either mode), and the entire program executes in 10 seconds when the cache operates in the non-reliable mode. A) What is the total energy consumed by the cache when running this entire program, if the cache is operating in the non-reliable mode? B) What is the total energy consumed by the cache to run this entire program, if the cache is operating in the reliable mode? C) How should the cache’s reliable mode be implemented to minimize execution time in the reliable mode? Describe what should happen when for a load and for a store access to the cache. Clearly specify how many cycles does a load hit take and how many cycles does a store hit take with your implementation. Written Qualifying Exam for Computer Architecture Page 3 of 7 3. A CMP has 16 cores in a chip. The total power budget for the chip is 120W. When all cores are active, the frequency of all cores is 1.2GHz, their Vdd is 2.7V, and 60% of the power budget is allocated to dynamic power and the remaining 40% to static power. Recall that dynamic power is P = (C * Vdd2 * f) / 2 To improve power efficiency, the processor uses a power gating mechanism. When a program uses only two cores, the system increases the clock frequency of the two active cores to improve performance and the remaining 14 cores are power gated. Note that the system changes Vdd to control frequency, and frequency increases linearly as we increase Vdd. The total area of the chip is 2.5cm by 2.5cm, and the cooling capacity is 50W/cm2. Assume that two cores use only 1/8th of the total area, and for this area the total power must still remain at or below the cooling capacity. Also assume that all active cores share the same frequency and Vdd. A) What will be the maximum frequency of two active cores? B) Instead of power gating, the system uses a clock gating mechanism. Now, the remaining 14 cores are clock gated. What will be the maximum frequency of two active cores? Written Qualifying Exam for Computer Architecture Page 4 of 7 4. This problem is about the HPS architecture from the Patt et al. paper in the reading list. A) Tomasulo’s algorithm enables out-of-order scheduling. Then, what are the impacts of the HPS architecture? B) Consider those features or benefits that HPS provides but the original Tomasulo’s algorithm does not. Are there any other mechanisms that provide these features or benefits? C) The HPS architecture provides a branch misprediction recovery mechanism that is faster than the classical ROB-based mechanism in out of order processors do. Explain why. Modern out-of-order processors typically employ some hardware structures to overcome this problem. Explain how. Written Qualifying Exam for Computer Architecture Page 5 of 7 5. This problem is about the interaction between the compiler’s register allocator and the processor’s microarchitecture. A) When registers are allocated in a compiler, sometimes there are not enough physical registers to hold all values generated by the program. Therefore, registers must be re-used. What are the consequences of this on the instruction-level parallelism? Propose a mechanism in a superscalar processor to get around this limitation. Describe the hardware in sufficient detail so that another architect would know how to build the hardware. B) When there are more live variables than there are registers available to hold these values, the register allocator in a compiler “spills” registers to the stack. What are the consequences of this on the instruction-level parallelism of the program? Propose a mechanism in a superscalar processor to get around this limitation. Note that this is a different phenomenon than the one discussed in part (A) – that was about each value not getting its own register, this is about each variable not getting its own register. Describe the hardware in sufficient detail so that another architect would know how to build the hardware. Your score on this problem will depend on the completeness of your answers. Written Qualifying Exam for Computer Architecture Page 6 of 7 6. “Wakeup and Select” is the name for the portion of a superscalar processor that causes dependant instructions to execute after the instructions they depend on have completed. Consider an ISA with two source operands and one destination operand per instruction. Describe a wakeup and select buffer that holds K waiting instructions. Each entry contains the two source operands and an “instruction id” for each of the K instructions. The buffer receives the register number of a completing instruction and outputs the instruction id of the next instruction to execute. A) Design the buffer. Show a detailed diagram with sufficient information such that another architect would know how to build the hardware. B) What happens when K is increased? Are there any adverse effects on your design? Explain. Your score on this problem will depend on the completeness of your answers. Written Qualifying Exam for Computer Architecture Page 7 of 7