Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20th 2008 This exam has nine (9) problems. You should submit your answers to six (6) of these nine problems. You should not submit answers for the remaining three problems. Since all problems are equally weighted (carry the same number of points), carefully choose which six problems you will answer. Write your answers/solutions clearly and legibly. Illegible answers will be treated as wrong answers. You should clearly mark on top of each page of your answers which problem the page corresponds to, and you should not use the same page for two problems (it’s fine to use multiple pages per problem if you need to). Although there is no restriction to how long your answer should be, try to keep your answers and solutions short and to the point. Good luck! 1. Modern processors typically support a memory page size of 4KB per page, and sometimes 1MB or 2MB per page. Recently, AMD’s 64‐bit “Barcelona” processor has been introduced with a newly supported page size of 1GB. a. Would supporting 1GB pages have made sense for older 32‐bit processors? Why or why not? [One paragraph is sufficient.] b. What changes in the hardware (if any) are required to support 1GB pages as opposed to conventional 4KB pages? Beyond the main processor pipeline, also consider any other hardware on the chip that is required to help in virtual‐to‐physical memory translation. [Write as much as needed.] c. Consider a hypothetical single‐core processor that uses SMT to provide two logical cores. Further assume that virtualization is used to run two separate Virtual Machines (VMs) on each of the logical cores, and that one OS uses 4KB pages while the other OS uses 1GB pages. What changes (if any) are required from the TLB and/or caches to simultaneously support two different page sizes? Since the two SMT threads must share the same physical TLB, discuss any potential resource contention issues in this scenario and whether it will likely result in significant performance and/or fairness issues. [Write as much as needed.] 2. SIMD instruction set extensions are now very popular in many instruction set architectures. For example, instead of performing two separate 64‐bit computations using two separate instructions, a single SIMD instruction can perform both computations in a “vector” style by directly operating on the two halves of a 128‐bit SIMD register. Additional load and store instructions for reading and writing 128‐bits to/from the SIMD registers are also typical. While SIMD instruction set extensions have been introduced primarily for performance reasons, they may also have an impact on a processor’s power consumption. a. How does the hardware to support SIMD instructions cause the processor to consume less energy? Consider all possible effects from the high‐level memory system down to low‐level circuit issues. b. Same as above, but how does SIMD cause the processor to consume more energy? c. Beyond simply executing more than one equivalent instruction at the same time (i.e., SIMD or vector execution), how else can using SIMD instructions improve performance? d. Since SIMD instructions may impact both execution time and energy consumption, how might heavy usage of SIMD instructions impact thermal hotspots in the processor? Are there any functional units that would likely get hotter (and why)? Are there any that would likely get cooler (and why)? 3. This question is about power and thermal issues. a. Multiple power metrics are used in computer architecture research. Many people use the word of power for many different meanings although there is only one scientific definition. Discuss the following power terms and discuss why we need to consider the following components: (i) Power/Energy (ii) Dynamic power/static power (iii) Peak power/average power b. Discuss architectural mechanisms for dynamic thermal management. You should discuss at least 3 mechanisms and discuss complexity, cost, and performance of each mechanism. 4. This question is about branch prediction and recovery from branch mispredictions. a. Branch prediction is critical to improve ILP. Is a branch predictor useful in various architecture designs? Discuss whether a branch predictor will be important or not in the following architectures (discuss each briefly): (i) 1‐wide non‐pipelined processor (ii) 4‐wide non‐pipelined processor (iii) 4‐wide 10‐stage in‐order processor (iv) 4‐wide 10‐stage out of order processor (v) 4‐wide 10 stage MT processor (3 threads can be run together (vi) 4‐wide 10 stage MT processor (100 threads can be run together) b. ROB (reorder buffer) is widely used to support out of order execution. One of the limitations of ROB is a longer branch recovery cost (the time from a branch misprediction detection to the time to re‐direct execution to correct path). The processor can start to fetch from the correct path almost immediately (after it receives the correct PC address) but the processor cannot process the correctly fetched instructions until some part of the pipeline is correctly updated. Which part could it be? How do modern high performance processors solve this problem? 5. This question is about implicitly parallel programming models. a. What are implicitly parallel programming models? b. What are the pros and cons of implicitly parallel programming models, in comparison with sequential and explicitly parallel programming models? c. List one thing that modern architectures could do to better support implicitly parallel programming. 6. Speculative lock elision has been proposed as a way of speeding up lock synchronization in multi‐core processors. a. What is the additional hardware needed to support speculative lock elision? b. How would you determine the AVF (architectural vulnerability factor) for this hardware? c. When a many‐core processor executes a many‐threaded application, explain how speculative lock elision may result in: (i) major performance improvement (ii) major performance degradation, (iii) no significant change in performance 7. Many modern programming languages (e.g. Java) use garbage collection to help free the programmer from the burden of having to free memory. One of implementation approaches for garbage collection consists of maintaining the “reference count” for each allocated memory chunk (e.g. a chunk of memory returned by malloc). The reference count for a chunk is the number of pointers that point to that chunk; the chunk can be garbage‐collected when its reference count becomes zero. To improve the performance of garbage collection, we are planning to add hardware support for reference counting. The planned support would work as follows. The hardware maintains a reference count for each memory location using a scheme similar to Mondrian memory Protection (we can use the same trie structure, but keep reference counts instead of Mondrian’s protection bits), and each 64‐bit value in registers or memory would be treated as a potential pointer. When a new value is written to a register or memory, we decrement the reference count for the location the old value “points” to and increment the reference count for the location the new value “points” to. This scheme does not work very well: a. Many values in registers and memory are not really pointers to allocated chunks of memory. How would you change the hardware support to take advantage of that? b. Instead of tracking reference counts for each chunk, this scheme tracks the reference count for each location. The actual reference count for the chunk is equal to the sum of reference counts of its locations, so garbage collection is still slow because this sum must be computed for each chunk. How would you change the hardware support to maintain chunk reference counts efficiently? c. Some of the memory may be swapped out to disk. How does that interfere with our hardware support for reference counting? 8. A particular multiprocessor has four processing nodes, P0‐P3, and a shared bus. a. The system implements the MSI snooping protocol. Consider what happens when a particular cache line is first accessed as a write miss by P0, then the same line is accessed as a write miss by P1, then P2, and then P3. For each cache in P0 to P3, show the status of this cache line as it evolves over time, including the appropriate M, S or I state. b. The system now (for this part of the problem only) implements memory‐based directory coherence. If the same state information (MSI) is maintained per cache line, and the memory is X MB in size, what is the memory overhead (i.e., additional storage) required to maintain the directory? c. The system now (for this part of the problem only) implements cache‐based directory coherence, in a linked‐list fashion. If the same state information (MSI) is maintained per cache line, and the memory is X MB in size, what is the memory overhead (i.e., additional storage) required to maintain the directory and what is the overhead per cache line in each cache? 9. Consider the following code sequence, where this ISA is of the format “op dest, src, src”, except where noted: LW R1,R2,0 ; R1 <= memory[R2+0] ADD R2,R2,R1 MUL R3,R2,R5 ADD R2,R2,#4 ; R2 <= R2+4 SW R3,R2,0 ; Memory[R2+0] <= R3 This instruction sequence is executed 100 times on a VLIW processor that has the following latencies: LW : 2 cycles, pipelined ADD: 1 cycle MUL: 3 cycles, pipelined SW: 1 cycle The VLIW has • • one memory unit that can initiate either LW or SW (but not both) every cycle and one arithmetic unit that can initiate either an ADD or a MUL (but not both) every cycle. In answering the following, show your work ‐ how you get the answer is more important than the numerical answer itself: a. What is the number of cycles it takes to execute this instruction sequence as shown above? b. What is the minimum number of cycles it takes to execute this instruction sequence? Show at least one schedule that achieves this minimum number of cycles.