Computer Architecture Area 

advertisement
Computer Architecture Area Fall 2006 PhD Qualifier Exam
October 26th 2006
This exam has nine (9) problems. You should submit your answers to six (6) of these nine
problems. You should not submit answers for the remaining three problems. Since all problems
are equally weighted (carry the same number of points), carefully choose which six problems
you will answer.
Write your answers/solutions clearly and legibly. Illegible answers will be treated as wrong
answers. You can attach additional sheets of paper if you wish, in which case you should clearly
mark on top of each page which problem the sheet corresponds to, and you should not use the
same sheet for two problems (it is fine to use multiple sheets per problem). Although you can use
additional paper, try to keep your answers and solutions short and to the point.
Good luck!
1 Problem 1: Parallel Speedups
This question relates to the speedup observed for a parallel application as you scale up the
number of processors. Ideally, one would like to see a speedup curve that increases linearly with
the number of processors. However, this is seldom the case.
a) Explain the sources of slowdown you might see in application performance as you scale up the
number of processors. For each source of slowdown from an ideal speedup curve, suggest what
one might do to mitigate the cause of the slowdown.
b) Explain why sometimes superlinear speedup is observed for some parallel applications.
c) Discuss the pros and cons of different metrics used to quantify application performance on
parallel machines.
2 Problem 2: SIMD
There have been a number of parallel machines built in the past to cater to the SIMD model of
parallel computing.
a) Why was this a useful model to build parallel architectures for?
b) Why did such machines die out?
c) Are the principles behind such a model still relevant? If not, why not? If yes, explain how this
relevance is making a resurgence in modern computer architecture.
3 Problem 3: Memory Consistency
The early 90's saw a spate of research in memory consistency models for shared memory
multiprocessors. Many research papers of that genre would have you believe that relaxed
consistency models offer a significant performance advantage over a sequentially consistent
memory model. Indeed research projects such as Stanford DASH implemented such a model in
hardware.
The intent of this question is to critique relaxed consistency models for building memory systems
(be they hardware or software shared memory systems). Your question should address both the
software and hardware issues in building such memory systems.
a) What is hard about implementing relaxed consistency models?
b) What is hard about writing system software on top of a memory system that uses a relaxed
consistency model?
c) What is hard about writing application software on top of a memory system that uses a relaxed
consistency model?
d) What are the sources of performance advantage with relaxed consistency models? Are such
sources prevalent in a significant number of applications? If the research papers of the early 90's
showed relaxed consistency models in a good light, how were they able to do that?
e) For any SMP you are familiar with that uses a relaxed consistency model, describe the model
implemented by it. Give a state diagram that shows the protocol transitions. Qualitatively
benchmark this model against release consistency and sequential consistency in terms of its
"relaxedness".
4 Problem 4: False dependencies
False dependencies between scheduled instructions pose one of the biggest problems which leads
to a loss of ILP. There are multiple reasons for false dependencies; first enumerate different
cases showing examples and how the loss of ILP might occur. Develop at least two different
architectural solutions that would resolve false dependencies using run-time information and
discuss the overheads involved and scope of the solutions.
Note that a detailed and concrete answer is expected with examples which give clear description
of the exact solution involved. Verbose text that describes general and vague solutions is not
expected.
5 Problem 5: Speculation
Speculation offers a powerful mechanism to increase ILP. Distinguish clearly between data and
control speculation showing the underlying mechanisms involved. What are the disadvantages of
aggressive speculation on performance and power consumption? Show solutions that might be
able to scale back on such disadvantages but that also preserve the advantages.
Note that a detailed and concrete answer is expected with examples which give clear description
of the exact solution involved. Verbose text that describes general and vague solutions is not
expected.
6 Problem 6: Write-through caches
There are two main write policies for caches: write-through and write-back. This questions
pertains to advantages and disadvantages of write-through caches relative to write-through
caches.
a) In recent years, most caches have used a write-back policy. Describe advantages and
disadvantages of write-through caches and explain why they have not been popular in recent
years.
b) Recent dual-core chip-multiprocessors provide L1 caches for each processor core, but a shared
on-chip L2 cache. Some such designs have used write-through L1 caches and a write-back L2
cache. Explain why this particular combination of write policies is attractive, compared to an
implementation where all on chip caches use the same write policy (discuss both write-through
and write-back).
7 Problem 7: Memory latency
A well-known trend in computer architecture is that processor speed (in terms of cycle time) is
increasing faster than DRAM memory performance (in terms of time needed to fetch a requested
cache block).
a) Explain why memory latency, in terms of processor cycles, is increasing and why this is a
problem.
b) Overall round-trip memory latency consists of several separate latencies. What are these
component latencies, and what are the architectural solutions (if any) that reduce each of these
components.
c) Assuming that future memory latencies will be 2,000 processor cycles, that one in 5 executed
instructions is a load, and that even large on-chip caches provide global hit rates of only up to
95%, what kind of processor core do we need to still provide good performance. Specifically,
discuss which of the following would have a major impact: in- or out-of-order execution, reorder
buffer size, branch prediction accuracy, issue width, split-transaction memory bus interface and
the maximum number of memory requests that can concurrently be in progress.
d) Assuming that plentiful ILP opportunities exist in the program code and latency and cache hit
rates as in part c), can an IPC (instructions per cycle) of 4 be achieved. If no, why not. If yes,
describe the processor that can achieve such an IPC. Your explanation or processor description
should be quantitative, e.g. Even with infinite resources we only achieve an IPC of X because
��, or �We need a 16-wide processor with a 100-entry ROB, because ��
8 Problem 8: Scheduling Logic
The dynamic instruction scheduling logic is one of the central components to modern out-oforder processors. There are a variety of different implementations; in particular, there are tagbroadcast/CAM-based designs as well as dependency-matrix-based designs.
a) Briefly describe each of these two techniques for implementing schedulers, and then compare
and contrast their merits/weaknesses with respect to performance (both clock speed and IPC
rates), power, area, and on-chip hot-spots.
b) The newer Pentium processors (Pentium-M, Core, Core 2) support �uop-fusion� where
certain pairs of uops can be combined together into a single �super-uop� or fused uop.� A
fused uop only occupies a single RS entry in the instruction scheduler, but the two uops may
together have more than two input register dependencies.� To support this, each RS entry can
support up to three input dependencies.� Explain what modifications to both the CAM-based
and matrix-based schedulers would be needed to support three input dependencies.� Discuss the
ramifications of these changes on the schedulers� latency, power and area.
9 Problem 9: Clock frequency vs. ILP
For several years, Intel microprocessor designs continued to increase performance by
aggressively increasing clock frequencies.� However in the recent past, Intel has changed their
strategy.� One design decision was to pursue higher-ILP microarchitectures rather than higherfrequency approaches.� However, several academic studies have argued that scaling traditional
out-of-order structures to larger sizes (more physical registers, larger schedulers, bigger LSQs,
wider machines, etc.) is not practical due to super-linear increases in circuit latencies and area.
a)� Given that program runtime is equal to (# Insts)*(CPI)*(cycle time), increases in the clock
frequency should result in a directly proportional increase in performance (e.g., 10% frequency
boost should give 10% more performance).� However in practice, the realizable performance
increase is less than proportional (e.g., 10% frequency ◊ <10% performance).� Why?
b) Explain why you think an aggressive ILP-oriented microarchitecture (such as the Intel Core
processors) can still result in lower power and/or better performance/Watt.� In particular,
contrast the performance and power attributes of the high-frequency and high-ILP approaches.
c)� Did Intel make the right decision to pursue higher-ILP designs?� Why or why not (which
may include reasons relating to design complexity, design time, testing/verification/validation,
market considerations, etc.)?
10 
Download