Word97

advertisement
EECC 722 - Fall 2001
Homework Assignment #1, Due October 17
Question 1:
Describe very concisely the advantages of the Tomasulo method for dynamic scheduling over
the Scoreboard. Are there any disadvantages relative to the scoreboard?
Question 2:
A certain system with a 350 MHz clock uses a separate data and instruction cache, and a unified
second-level cache. The First-level data cache is a direct-mapped, write-through, write-allocate
cache with 8kBytes of data total and 8-Byte blocks, and has a perfect write buffer (never causes
any stalls). The first-level instruction cache is a direct-mapped cache with 4kBytes of data total
and 8-Byte blocks. The second-level cache is a two-way set associative, write-back, writeallocate cache with 2MBytes of data total and 32-Byte blocks. The first-level instruction cache
has a miss rate of 2%. The first-level data cache has a miss rate of 15%. The unifed second-level
cache has a local miss rate of 10%. Assume that 40% of all instructions are data memory
accesses; 60% of those are loads, and 40% are stores. Assume that 50% of the blocks in the
second-level cache are dirty at any time. Assume that there is no optimization for fast reads on an
L1 or L2 cache miss. All first-level cache hits cause no stalls. The second-level hit time is 10
cycles. (That means that the L1 miss penalty, assuming a hit in the L2 cache, is 10 cycles.) Main
memory access time is 100 cycles to the first bus width of data; after that, the memory system
can deliver consecutive bus widths of data on each following cycle. Outstanding non-consecutive
memory requests can not overlap; an access to one memory location must complete before an
access to another memory location can begin. There is a 128-bit bus from memory to the L2
cache, and a 64-bit bus from both L1 caches to the L2 cache. Assume a perfect TLB for this
problem (never causes any stalls).
a) What percent of all data memory references cause a main memory access (main
memory is accessed before the memory request is satisfied)? First show the equation, then the
numeric result.
b) How many bits are used to index each of the caches? Assume the caches are presented
physical addresses.
c) How many cycles can the longest possible data memory access take? Describe (briefly)
the events that occur during this access.
d) What is the average memory access time in cycles (including instruction and data
memory references)? First show the equation, then the numeric result.
Question 3:
a) What performance aspects does SMT improve over RISC superscalar architectures?
b) What are the differences between Simultaneous Multithreading (SMT) and Traditional
Multithreaded Architectures? Which architecture offers higher potential performance?
Why?
c) What hardware modifications are required to add SMT capabilities to a RISC superscalar
processor micro-architecture?
d) What are the required operating system modifications to support SMT?
e) Both SMT and Chip Multiprocessors (CMP) exploit parallelism at the thread level.
Contrast the two approaches based on hardware complexity and required modifications of
existing conventional single processor programs.
Question 4:
Explain why cyclic loop and cyclic tiled data distributions exhibit better performance over
blocked loop and blocked tiled data distributions for SMT, contrasting this with symmetric
shared-memory multiprocessors where the opposite occurs?
Question 5:
For each of the papers:
Software-Directed Register Deallocation for Simultaneous Multithreaded Processors
Jack Lo, Sujay Parekh, Susan Eggers, Henry Levy, and Dean Tullsen
IEEE Transactions on Parallel and Distributed Systems, September 1999, pages 922-933.
Efficient FFTs On VIRAM,
Randi Thomas and Katherine Yelick,
Proceeding of the 1st Workshop on Media Processors and DSPs, in Conjunction with the 32nd
Annual International Symposium on Microarchitecture, Haifa, Israel, November 15, 1999.
Write a paper summary no more than 2 pages long that contains the following information: (1)
summary of the paper, (2) What is the most important result?, (3) One strength of the paper, (4)
one weakness of the paper, (5) one thing you didn't understand.
Question 6:
Consider a vector computer which can operate in one of two execution modes at a time: one is
the vector mode with an execution rate of Rv = 10 Mflops, and the other is the scalar mode
with an execution rate of Rs = 1 Mflops. Let x be the percentage of code that is
vectorizable in a typical program mix for this computer.
a) Derive an expression for the average execution rate Ra for this program.
b) Plot Ra as a function of x in the range (0, 1).
c) Determine the vectorization ratio x needed in order to achieve an average execution rate of
Ra = 7.5 Mflops.
d) Suppose Rs = 1 and x = 0.7
execution Ra = 2 Mflops?
What value Rv is needed to achieve an average
Question 7:
The following sequence is to be executed on a register-register vector processor:
A(I) = B(I) + s x C(I)
D(I) = s x B(I) x C(I)
E(I) = C(I) x ( C(I) - B(I) )
Where B(I) and C(I) are each 64-element vectors originally stored in memory. The resulting
vectors A(I), D(I), and E(I) must be stored back into memory after the computation
a) Write 11 vector instructions (use vector DLX instructions - See ref. Book Appendix B ) in
proper order to execute the above sequence.
b) The resulting vector instructions are executed on a Cray 1 (80MHz) with 1 add, (latency 6
cycles) 1 multiply (latency 7 cycles) and one vector memory load/store pipeline (latency
12 cycles). Show a space time diagram of the execution with and without vector chaining.
How many convoys of vector instructions exist and how many cycles are needed for both
cases?
c) Repeat part (b) above for a Cray X-MP (120 MHz) with similar add/multiply units but
with two vector-load pipelines and one vector store pipeline which can be used
simultaneously with the remaining functional units. What is the speedup over the Cray 1
for both cases (with and without chaining)?
Question 8:
Discuss the potential advantages of Vector IRAM over conventional CPU architectures
(superscalar/VLIW) including the advantages for the targeted applications.
Question 9:
a) Discuss the data and thread speculation mechanisms in the HYDRA CMP. How do they
differ from hardware based speculation in current superscalar RSIC CPUs?
b) Discuss the modifications to level 1 data cache and secondary cache required to implement
speculation and ensure correct execution in the HYDRA CMP.
c) Describe how correct execution is guaranteed for speculative loads and stores in the
HYDRA CMP.
Question 10:
a) Identify 5 significant differences between general purpose and DSP/embedded processors.
Give for each of these one example on how this impacts the architectural design process.
b) List 6 characteristics of DSP instruction set architectures that differ from general purpose
microprocessors.
a) Which of those characteristics are supported in vector architectures?
b) List the architectural improvements between different DSP processor generations including
examples from each generation.
Question 11:
a) What is the main motivation for Re-Configurable Computing?
b) What type of applications are suitable for Re-Configurable Computing?
c) Contrast Hybrid-Architecture Re-Configurable Computing compute models and give
examples for each model.
d) Identify the differences between the Hybrid-Architecture Re-Configurable architectures
studied that utilize instruction augmentation including computation grain size, access to
memory and concurrent host/RC hardware operation.
e) Describe the instruction augmentation process in PRISC and GARP.
Question 12:
A number of approaches to utilize billion transistor chips were studied in this course. Identify
which of these approaches may be combined to further enhance performance.
Download