Fall 2010 PhD Qualifying Exam Computer Architecture Area October 22 2010

advertisement
Computer Architecture Area
Fall 2010 PhD Qualifying Exam
October 22nd 2010
This exam has six (6) equally-weighted problems. You should submit your answers to all six
problems. Write your answers/solutions clearly and legibly. Illegible answers will be treated as
wrong answers. You should clearly mark on top of each page of your answers which problem the
page corresponds to. You should not use the same page for two problems, but it is fine to use
multiple pages per problem if you need to. Although there is no restriction to how long your
answer should be, try to keep your answers and solutions short and to the point.
Good luck!
Written Qualifying Exam for Computer Architecture
Page 1 of 7
1. The application executes on a quad-core CMP that has private (per-core) L1 caches connected by an
on-chip bus, and a large shared L2 cache. The coherence protocol between L1 caches is MESI. The
machine is running only a four-threaded application and (after a brief initial warm-up period) all
accesses are L2 cache hits. The only memory accesses in this application are:
• Each thread accesses its own stack-based data. For each location on the stack, the access
pattern is to write to it, read it a short while afterwards, and then not use that particular
location for a while. By the time the same stack location is accessed (written and then read)
again, it is replaced from the L1 cache.
• All threads use a shared array, where each element is written by only one thread, then read by
all four threads, then written by its writer thread again soon afterwards. The shared array is
small and all its elements are accessed very often.
A) For this particular application, explain what the benefit is from having the E state (as opposed to
using the MSI protocol) in terms of on-chip bus utilization, reads from the L2 cache, and writes to
the L2 cache.
B) For this particular application, explain what the benefit is from having a write-back L1 cache and
using a write-invalidate MESI coherence protocol (as opposed to using a write-through L1 cache
and a write-update protocol with only S and I states).
Written Qualifying Exam for Computer Architecture
Page 2 of 7
2. A single-core processor chip has only one on-chip data cache that is 4-way set-associative, with 1024
sets and 128-byte blocks. Metadata (tags, dirty bit, valid bit) for all four lines in the set is checked in
one cycle, which consumes 20pJ (pico-joules). However, no data is accessed in this first cycle.
Instead, if the first cycle identifies a cache hit, in the second cycle only one of the data blocks (the one
that has a hit) in the set is accessed (read or written), consuming another 20pJ.
No parity or ECC is used for the data in this cache, so the engineer responsible for designing this
cache decides to add a “reliable mode” for this cache. When in this “reliable mode”, the cache uses
one of the data blocks in each set as the bit-wise parity for the other three (similar to a RAID-4
configuration for disks). To minimize the cost and energy consumption increase due to adding this
reliable mode, in any given cycle the cache can either access the metadata for a whole set or the data
for one block in the set – it cannot access both meta data and data in one cycle, nor can it access more
than one data block in a cycle.
The processor runs at 1GHz and is very simple – it stalls until the cache returns the data (e.g. the stall
is two cycles per load/store in the non-reliable mode). The machine runs a program that executes a
total of 1 billion loads and 300 million stores. All these accesses are cache hits (in either mode), and
the entire program executes in 10 seconds when the cache operates in the non-reliable mode.
A) What is the total energy consumed by the cache when running this entire program, if the cache is
operating in the non-reliable mode?
B) What is the total energy consumed by the cache to run this entire program, if the cache is
operating in the reliable mode?
C) How should the cache’s reliable mode be implemented to minimize execution time in the reliable
mode? Describe what should happen when for a load and for a store access to the cache. Clearly
specify how many cycles does a load hit take and how many cycles does a store hit take with your
implementation.
Written Qualifying Exam for Computer Architecture
Page 3 of 7
3. A CMP has 16 cores in a chip. The total power budget for the chip is 120W. When all cores are
active, the frequency of all cores is 1.2GHz, their Vdd is 2.7V, and 60% of the power budget is
allocated to dynamic power and the remaining 40% to static power. Recall that dynamic power is
P = (C * Vdd2 * f) / 2
To improve power efficiency, the processor uses a power gating mechanism. When a program uses
only two cores, the system increases the clock frequency of the two active cores to improve
performance and the remaining 14 cores are power gated. Note that the system changes Vdd to control
frequency, and frequency increases linearly as we increase Vdd. The total area of the chip is 2.5cm by
2.5cm, and the cooling capacity is 50W/cm2. Assume that two cores use only 1/8th of the total area,
and for this area the total power must still remain at or below the cooling capacity. Also assume that
all active cores share the same frequency and Vdd.
A) What will be the maximum frequency of two active cores?
B) Instead of power gating, the system uses a clock gating mechanism. Now, the remaining 14 cores
are clock gated. What will be the maximum frequency of two active cores?
Written Qualifying Exam for Computer Architecture
Page 4 of 7
4. This problem is about the HPS architecture from the Patt et al. paper in the reading list.
A) Tomasulo’s algorithm enables out-of-order scheduling. Then, what are the impacts of the HPS
architecture?
B) Consider those features or benefits that HPS provides but the original Tomasulo’s algorithm does
not. Are there any other mechanisms that provide these features or benefits?
C) The HPS architecture provides a branch misprediction recovery mechanism that is faster than the
classical ROB-based mechanism in out of order processors do. Explain why. Modern out-of-order
processors typically employ some hardware structures to overcome this problem. Explain how.
Written Qualifying Exam for Computer Architecture
Page 5 of 7
5. This problem is about the interaction between the compiler’s register allocator and the processor’s
microarchitecture.
A) When registers are allocated in a compiler, sometimes there are not enough physical registers to
hold all values generated by the program. Therefore, registers must be re-used. What are the
consequences of this on the instruction-level parallelism? Propose a mechanism in a superscalar
processor to get around this limitation. Describe the hardware in sufficient detail so that another
architect would know how to build the hardware.
B) When there are more live variables than there are registers available to hold these values, the
register allocator in a compiler “spills” registers to the stack. What are the consequences of this
on the instruction-level parallelism of the program? Propose a mechanism in a superscalar
processor to get around this limitation. Note that this is a different phenomenon than the one
discussed in part (A) – that was about each value not getting its own register, this is about each
variable not getting its own register. Describe the hardware in sufficient detail so that another
architect would know how to build the hardware.
Your score on this problem will depend on the completeness of your answers.
Written Qualifying Exam for Computer Architecture
Page 6 of 7
6.
“Wakeup and Select” is the name for the portion of a superscalar processor that causes dependant
instructions to execute after the instructions they depend on have completed. Consider an ISA with
two source operands and one destination operand per instruction. Describe a wakeup and select
buffer that holds K waiting instructions. Each entry contains the two source operands and an
“instruction id” for each of the K instructions. The buffer receives the register number of a
completing instruction and outputs the instruction id of the next instruction to execute.
A) Design the buffer. Show a detailed diagram with sufficient information such that another
architect would know how to build the hardware.
B) What happens when K is increased? Are there any adverse effects on your design? Explain.
Your score on this problem will depend on the completeness of your answers.
Written Qualifying Exam for Computer Architecture
Page 7 of 7
Download