Computer Architecture Area Fall 2006 PhD Qualifier Exam October 26th 2006 This exam has nine (9) problems. You should submit your answers to six (6) of these nine problems. You should not submit answers for the remaining three problems. Since all problems are equally weighted (carry the same number of points), carefully choose which six problems you will answer. Write your answers/solutions clearly and legibly. Illegible answers will be treated as wrong answers. You can attach additional sheets of paper if you wish, in which case you should clearly mark on top of each page which problem the sheet corresponds to, and you should not use the same sheet for two problems (it is fine to use multiple sheets per problem). Although you can use additional paper, try to keep your answers and solutions short and to the point. Good luck! 1 Problem 1: Parallel Speedups This question relates to the speedup observed for a parallel application as you scale up the number of processors. Ideally, one would like to see a speedup curve that increases linearly with the number of processors. However, this is seldom the case. a) Explain the sources of slowdown you might see in application performance as you scale up the number of processors. For each source of slowdown from an ideal speedup curve, suggest what one might do to mitigate the cause of the slowdown. b) Explain why sometimes superlinear speedup is observed for some parallel applications. c) Discuss the pros and cons of different metrics used to quantify application performance on parallel machines. 2 Problem 2: SIMD There have been a number of parallel machines built in the past to cater to the SIMD model of parallel computing. a) Why was this a useful model to build parallel architectures for? b) Why did such machines die out? c) Are the principles behind such a model still relevant? If not, why not? If yes, explain how this relevance is making a resurgence in modern computer architecture. 3 Problem 3: Memory Consistency The early 90's saw a spate of research in memory consistency models for shared memory multiprocessors. Many research papers of that genre would have you believe that relaxed consistency models offer a significant performance advantage over a sequentially consistent memory model. Indeed research projects such as Stanford DASH implemented such a model in hardware. The intent of this question is to critique relaxed consistency models for building memory systems (be they hardware or software shared memory systems). Your question should address both the software and hardware issues in building such memory systems. a) What is hard about implementing relaxed consistency models? b) What is hard about writing system software on top of a memory system that uses a relaxed consistency model? c) What is hard about writing application software on top of a memory system that uses a relaxed consistency model? d) What are the sources of performance advantage with relaxed consistency models? Are such sources prevalent in a significant number of applications? If the research papers of the early 90's showed relaxed consistency models in a good light, how were they able to do that? e) For any SMP you are familiar with that uses a relaxed consistency model, describe the model implemented by it. Give a state diagram that shows the protocol transitions. Qualitatively benchmark this model against release consistency and sequential consistency in terms of its "relaxedness". 4 Problem 4: False dependencies False dependencies between scheduled instructions pose one of the biggest problems which leads to a loss of ILP. There are multiple reasons for false dependencies; first enumerate different cases showing examples and how the loss of ILP might occur. Develop at least two different architectural solutions that would resolve false dependencies using run-time information and discuss the overheads involved and scope of the solutions. Note that a detailed and concrete answer is expected with examples which give clear description of the exact solution involved. Verbose text that describes general and vague solutions is not expected. 5 Problem 5: Speculation Speculation offers a powerful mechanism to increase ILP. Distinguish clearly between data and control speculation showing the underlying mechanisms involved. What are the disadvantages of aggressive speculation on performance and power consumption? Show solutions that might be able to scale back on such disadvantages but that also preserve the advantages. Note that a detailed and concrete answer is expected with examples which give clear description of the exact solution involved. Verbose text that describes general and vague solutions is not expected. 6 Problem 6: Write-through caches There are two main write policies for caches: write-through and write-back. This questions pertains to advantages and disadvantages of write-through caches relative to write-through caches. a) In recent years, most caches have used a write-back policy. Describe advantages and disadvantages of write-through caches and explain why they have not been popular in recent years. b) Recent dual-core chip-multiprocessors provide L1 caches for each processor core, but a shared on-chip L2 cache. Some such designs have used write-through L1 caches and a write-back L2 cache. Explain why this particular combination of write policies is attractive, compared to an implementation where all on chip caches use the same write policy (discuss both write-through and write-back). 7 Problem 7: Memory latency A well-known trend in computer architecture is that processor speed (in terms of cycle time) is increasing faster than DRAM memory performance (in terms of time needed to fetch a requested cache block). a) Explain why memory latency, in terms of processor cycles, is increasing and why this is a problem. b) Overall round-trip memory latency consists of several separate latencies. What are these component latencies, and what are the architectural solutions (if any) that reduce each of these components. c) Assuming that future memory latencies will be 2,000 processor cycles, that one in 5 executed instructions is a load, and that even large on-chip caches provide global hit rates of only up to 95%, what kind of processor core do we need to still provide good performance. Specifically, discuss which of the following would have a major impact: in- or out-of-order execution, reorder buffer size, branch prediction accuracy, issue width, split-transaction memory bus interface and the maximum number of memory requests that can concurrently be in progress. d) Assuming that plentiful ILP opportunities exist in the program code and latency and cache hit rates as in part c), can an IPC (instructions per cycle) of 4 be achieved. If no, why not. If yes, describe the processor that can achieve such an IPC. Your explanation or processor description should be quantitative, e.g. Even with infinite resources we only achieve an IPC of X because ��, or �We need a 16-wide processor with a 100-entry ROB, because �� 8 Problem 8: Scheduling Logic The dynamic instruction scheduling logic is one of the central components to modern out-oforder processors. There are a variety of different implementations; in particular, there are tagbroadcast/CAM-based designs as well as dependency-matrix-based designs. a) Briefly describe each of these two techniques for implementing schedulers, and then compare and contrast their merits/weaknesses with respect to performance (both clock speed and IPC rates), power, area, and on-chip hot-spots. b) The newer Pentium processors (Pentium-M, Core, Core 2) support �uop-fusion� where certain pairs of uops can be combined together into a single �super-uop� or fused uop.� A fused uop only occupies a single RS entry in the instruction scheduler, but the two uops may together have more than two input register dependencies.� To support this, each RS entry can support up to three input dependencies.� Explain what modifications to both the CAM-based and matrix-based schedulers would be needed to support three input dependencies.� Discuss the ramifications of these changes on the schedulers� latency, power and area. 9 Problem 9: Clock frequency vs. ILP For several years, Intel microprocessor designs continued to increase performance by aggressively increasing clock frequencies.� However in the recent past, Intel has changed their strategy.� One design decision was to pursue higher-ILP microarchitectures rather than higherfrequency approaches.� However, several academic studies have argued that scaling traditional out-of-order structures to larger sizes (more physical registers, larger schedulers, bigger LSQs, wider machines, etc.) is not practical due to super-linear increases in circuit latencies and area. a)� Given that program runtime is equal to (# Insts)*(CPI)*(cycle time), increases in the clock frequency should result in a directly proportional increase in performance (e.g., 10% frequency boost should give 10% more performance).� However in practice, the realizable performance increase is less than proportional (e.g., 10% frequency ◊ <10% performance).� Why? b) Explain why you think an aggressive ILP-oriented microarchitecture (such as the Intel Core processors) can still result in lower power and/or better performance/Watt.� In particular, contrast the performance and power attributes of the high-frequency and high-ILP approaches. c)� Did Intel make the right decision to pursue higher-ILP designs?� Why or why not (which may include reasons relating to design complexity, design time, testing/verification/validation, market considerations, etc.)? 10