Word97

EECC 722 - Fall 2001 Homework Assignment #1, Due October 17 Question 1: Describe very concisely the advantages of the Tomasulo method for dynamic scheduling over the Scoreboard. Are there any disadvantages relative to the scoreboard? Question 2: A certain system with a 350 MHz clock uses a separate data and instruction cache, and a unified second-level cache. The First-level data cache is a direct-mapped, write-through, write-allocate cache with 8kBytes of data total and 8-Byte blocks, and has a perfect write buffer (never causes any stalls). The first-level instruction cache is a direct-mapped cache with 4kBytes of data total and 8-Byte blocks. The second-level cache is a two-way set associative, write-back, writeallocate cache with 2MBytes of data total and 32-Byte blocks. The first-level instruction cache has a miss rate of 2%. The first-level data cache has a miss rate of 15%. The unifed second-level cache has a local miss rate of 10%. Assume that 40% of all instructions are data memory accesses; 60% of those are loads, and 40% are stores. Assume that 50% of the blocks in the second-level cache are dirty at any time. Assume that there is no optimization for fast reads on an L1 or L2 cache miss. All first-level cache hits cause no stalls. The second-level hit time is 10 cycles. (That means that the L1 miss penalty, assuming a hit in the L2 cache, is 10 cycles.) Main memory access time is 100 cycles to the first bus width of data; after that, the memory system can deliver consecutive bus widths of data on each following cycle. Outstanding non-consecutive memory requests can not overlap; an access to one memory location must complete before an access to another memory location can begin. There is a 128-bit bus from memory to the L2 cache, and a 64-bit bus from both L1 caches to the L2 cache. Assume a perfect TLB for this problem (never causes any stalls). a) What percent of all data memory references cause a main memory access (main memory is accessed before the memory request is satisfied)? First show the equation, then the numeric result. b) How many bits are used to index each of the caches? Assume the caches are presented physical addresses. c) How many cycles can the longest possible data memory access take? Describe (briefly) the events that occur during this access. d) What is the average memory access time in cycles (including instruction and data memory references)? First show the equation, then the numeric result. Question 3: a) What performance aspects does SMT improve over RISC superscalar architectures? b) What are the differences between Simultaneous Multithreading (SMT) and Traditional Multithreaded Architectures? Which architecture offers higher potential performance? Why? c) What hardware modifications are required to add SMT capabilities to a RISC superscalar processor micro-architecture? d) What are the required operating system modifications to support SMT? e) Both SMT and Chip Multiprocessors (CMP) exploit parallelism at the thread level. Contrast the two approaches based on hardware complexity and required modifications of existing conventional single processor programs. Question 4: Explain why cyclic loop and cyclic tiled data distributions exhibit better performance over blocked loop and blocked tiled data distributions for SMT, contrasting this with symmetric shared-memory multiprocessors where the opposite occurs? Question 5: For each of the papers: Software-Directed Register Deallocation for Simultaneous Multithreaded Processors Jack Lo, Sujay Parekh, Susan Eggers, Henry Levy, and Dean Tullsen IEEE Transactions on Parallel and Distributed Systems, September 1999, pages 922-933. Efficient FFTs On VIRAM, Randi Thomas and Katherine Yelick, Proceeding of the 1st Workshop on Media Processors and DSPs, in Conjunction with the 32nd Annual International Symposium on Microarchitecture, Haifa, Israel, November 15, 1999. Write a paper summary no more than 2 pages long that contains the following information: (1) summary of the paper, (2) What is the most important result?, (3) One strength of the paper, (4) one weakness of the paper, (5) one thing you didn't understand. Question 6: Consider a vector computer which can operate in one of two execution modes at a time: one is the vector mode with an execution rate of Rv = 10 Mflops, and the other is the scalar mode with an execution rate of Rs = 1 Mflops. Let x be the percentage of code that is vectorizable in a typical program mix for this computer. a) Derive an expression for the average execution rate Ra for this program. b) Plot Ra as a function of x in the range (0, 1). c) Determine the vectorization ratio x needed in order to achieve an average execution rate of Ra = 7.5 Mflops. d) Suppose Rs = 1 and x = 0.7 execution Ra = 2 Mflops? What value Rv is needed to achieve an average Question 7: The following sequence is to be executed on a register-register vector processor: A(I) = B(I) + s x C(I) D(I) = s x B(I) x C(I) E(I) = C(I) x ( C(I) - B(I) ) Where B(I) and C(I) are each 64-element vectors originally stored in memory. The resulting vectors A(I), D(I), and E(I) must be stored back into memory after the computation a) Write 11 vector instructions (use vector DLX instructions - See ref. Book Appendix B ) in proper order to execute the above sequence. b) The resulting vector instructions are executed on a Cray 1 (80MHz) with 1 add, (latency 6 cycles) 1 multiply (latency 7 cycles) and one vector memory load/store pipeline (latency 12 cycles). Show a space time diagram of the execution with and without vector chaining. How many convoys of vector instructions exist and how many cycles are needed for both cases? c) Repeat part (b) above for a Cray X-MP (120 MHz) with similar add/multiply units but with two vector-load pipelines and one vector store pipeline which can be used simultaneously with the remaining functional units. What is the speedup over the Cray 1 for both cases (with and without chaining)? Question 8: Discuss the potential advantages of Vector IRAM over conventional CPU architectures (superscalar/VLIW) including the advantages for the targeted applications. Question 9: a) Discuss the data and thread speculation mechanisms in the HYDRA CMP. How do they differ from hardware based speculation in current superscalar RSIC CPUs? b) Discuss the modifications to level 1 data cache and secondary cache required to implement speculation and ensure correct execution in the HYDRA CMP. c) Describe how correct execution is guaranteed for speculative loads and stores in the HYDRA CMP. Question 10: a) Identify 5 significant differences between general purpose and DSP/embedded processors. Give for each of these one example on how this impacts the architectural design process. b) List 6 characteristics of DSP instruction set architectures that differ from general purpose microprocessors. a) Which of those characteristics are supported in vector architectures? b) List the architectural improvements between different DSP processor generations including examples from each generation. Question 11: a) What is the main motivation for Re-Configurable Computing? b) What type of applications are suitable for Re-Configurable Computing? c) Contrast Hybrid-Architecture Re-Configurable Computing compute models and give examples for each model. d) Identify the differences between the Hybrid-Architecture Re-Configurable architectures studied that utilize instruction augmentation including computation grain size, access to memory and concurrent host/RC hardware operation. e) Describe the instruction augmentation process in PRISC and GARP. Question 12: A number of approaches to utilize billion transistor chips were studied in this course. Identify which of these approaches may be combined to further enhance performance.

Word97

Related documents

Products

Support

Word97

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib