Computer Architecture Practice Test Solution

Name ONLY: ____________________________ CS3375: Computer Architecture Fall 2023 Final Exam Practice Test Solution Show all your work, Closed book and note, and No calculator/smartphone. 1 1. The performance of a machine is determined by instruction count (IC), clock cycles per instruction (CPI), and clock cycle time. Here, IC is determined by compiler and [ ? ]. CPI and the clock cycle time are determined by the implementation of the hardware. • Instruction Set Architecture 2. In a single cycle data path, every instruction begins on one clock edge and completes execution on the next clock edge. So the CPI is [ ? ]. • CPI = 1 3. The single-cycle datapath must have separate instruction and data memories, because [ a. b. c. d. c ] the formats of data and instructions are different in MIPS, and hence different memories are needed. having separate memories is less expensive. the processor operates in one cycle and cannot use a single-ported memory for two different accesses within that cycle. above all three are correct 4. Based on the following instruction, what are the values (in decimal) marked as “?” in the single cycle datapath? Suppose $29 initially contains the number 129. If a value cannot be determined, mark it as “x” (don’t care). addi $29, $29, 16 2 5. Assume that there are no pipeline stalls and that the breakdown of executed instructions is as follows: add addi not beq lw sw 20% 20% 0% 25% 25% 10% a. In what fraction of all cycles is the data memory used? • • The data memory is used by LW and SW instructions, so the answer is: 25% + 10% = 35% b. In what fraction of all cycles is the input of the sign-extended circuit needed? • • The sign-extend circuit is actually computing a result in every cycle, but its output is ignored for ADD and NOT instructions. The input of the sign-extend circuit is needed for ADDI (to provide the immediate ALU operand), BEQ (to provide the PC-relative off set), and LW and SW (to provide the off set used in addressing memory) so the answer is: 20% + 25% + 25% + 10% = 80% 6. The operation times for the major functional units are 200ps for memory access, 200ps for ALU operation, and 100ps for register file read or write. For example, in single-cycle design, the time required for every instruction is 800ps due to lw instruction (instruction fetch, register read, ALU operation, data access, and register write). Here, we only consider lw instruction for speedup comparison. a. [True/False]: If the time for an ALU operation can be shortened by 25%, it will affect the speedup obtained from pipelining. • False, shortening the ALU operation will not affect the speedup obtained from pipelining. It would not affect the clock cycle. b. If the ALU operation now takes 20% more time, the clock cycle needs to be [ • ] ps. If the ALU operation takes 20% more time, it becomes the bottleneck in the pipeline. The clock cycle needs to be 240ps. 3 7. Based on the following instruction, fill out the control signal table. If a control signal value cannot be determined, mark it as “x” (don’t care). add $t1, $t2, $t3. Inst RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp2 add 1 0 0 1 0 0 0 1 0 8. Assume that individual stages of the datapath have the following latencies: IF ID EX 250ps 350ps 150ps MEM 300ps WB 200ps a. What is the clock cycle time in a pipelined and non-pipelined (single-cycle) processor, respectively? Show all your work. • Pipelined: 350ps (longest); Non-pipelined: 1250ps = (250 + 350 + 150 + 300 + 200) b. What is the total latency of a lw instruction in a pipelined and non-pipelined (single-cycle) processor, respectively? Show all your work. • Pipelined: 1750ps (each stage will be 350ps and thus, 350 x 5 stages); Non-pipelined: 1250ps 4 9. Consider executing the following code on the pipelined datapath, lw $4, 100($2) sub $6, $4, $3 add $2, $3, $5 (i) How many clock cycles will it take to execute this code? (ii) Draw a multiple clock cycle pipeline diagram that illustrates the dependencies that need to be resolved. • It will take 8 clock cycles to execute this code, including a bubble of 1 cycle due to the dependency between the lw and sub instructions. 10. Consider three branch prediction schemes: predict not taken, predict taken, and dynamic prediction. Assume that they all have zero penalties when they predict correctly and two cycles when they are wrong. Assume that the average predict accuracy of the dynamic predictor is 90%. If a branch is taken with 5% frequency, which c ] predictor is the best choice? [ a. Predict taken b. Dynamic prediction c. Predict not taken d. above three predictors are same 11. Refer to the following sequence of instructions, and assume that it is executed on a 5-stage pipelined datapath. add r5, r2, r1 lw r3, 4(r5) lw r2, 0(r2) or r3, r5, r3 sw r3, 0(r5) If there is no forwarding or hazard detection, insert nops to ensure correct execution. [2 pts] • 5 12. Which of the following statements are generally true? Multiple answers. [ ] a. On a read, the value returned depends on which blocks are in the cache. b. Memory hierarchies take advantage of temporal locality. c. Most of the capacity of the memory hierarchy is at the lowest level. d. Most of the cost of the memory hierarchy is at the highest level. • b, c 13. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. For (i = 0; i < 8; i ++) For (j = 0; j < 8000; j ++) A[i][j] = B[i][0] + A[j][i]; a. References to which variables exhibit spatial locality. • A, B (e.g., array) b. References to which variables exhibit temporal locality. • i, j 14. For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache. Tag 31-13 Index 12-6 Offset 5-0 a. What is the cache block size in words? Show all your work. • Here, 1 word = 4 bytes; 2^6 (offset bits) / 4 = 16 b. How many entries does the cache have? Show all your work. • 2^7 (index bits) = 128 15. Although a larger bock size [increases? or decreases?] the miss rate, it can also [increase? or decrease?] the miss penalty. If the miss penalty increases linearly with the block size, larger block could easily lead to [higher? or lower?] performance. • decreases; increase; lower 16. The speed of the memory system affects the designer’s decision on the size of the cache block. Which of the following cache designer guidelines are generally valid? Multiple answers. [ ] a. The shorter the memory latency, the larger the cache block. b. The shorter the memory latency, the smaller the cache block. c. The higher the memory bandwidth, the larger the cache block. d. The higher the memory bandwidth, the smaller the cache block. • b, c 17. Consider a cache with 128 blocks and a block size of 8 bytes. What block number does byte address 1000 map to? Show all your work. • block address % cache block o |_ 1000 / 8 _| = 125 (change byte address into block address) o 125 % 128 = 125 block number 6 18. How many tag bits are required for a direct-mapped cache with 64 KByte of data and 32-word blocks? Assume a 32-bit address. Show all your work. Here, 1K = 1024. Show all your work. 64 KByte of data is 16K words. With a block size of 32 words, there are (64 x 1024 byte) / (32 x 4 byte) blocks  512 (= 2^9) blocks 32 word block  32 x 4 byte  2^7 byte (7 bit byte offset) Tag bit = 32 – (9 + 7) = 16 Thus, total tag bits are 512 x 16 = 8192 bits (= 8K bits) • • • • • 19. Suppose a cache of 8K blocks, 2-word block size, and a 32-bit address, (i) find the total number of sets and (ii) the total number of tag bits for 2-way set associative cache. Show all your work. Here, 1K = 1024. Show all your work. 2-word cache block  8 bytes  2^3 bytes (3 bits for byte offset) (8 x 1024 blocks) / 2 = 4096 sets  4 x 1024 sets  2^(12) sets (12 bits for set number) o Total number of sets = 4096  (or 4K) Tag bits = 32 – (12 + 3) = 17 bits o Total number of tag bits = 17 x 2 x 4096  (or 136 K) • • • 20. There is a small 2-way set associative cache, consisting of four one-word blocks. (i) How many hits and (ii) show the final cache contents based on the given block address sequences 0, 8, 0, 6, 8, 3, 1, 8. Here, the least recently used (LRU) is used for a cache replacement policy. Show all your work – fill out the cache & hit/miss. • • 0 miss Memory[0] 8 miss Memory[0] Memory[8] 0 hit Memory[0] Memory[8] 6 miss Memory[0] Memory[6] 8 miss Memory[8] Memory[6] 3 miss Memory[8] Memory[6] Memory[3] 1 miss Memory[0] Memory[6] Memory[3] Memory[1] 8 hit Memory[8] Memory[6] Memory[3] Memory[1] 2 hits Memory[8] Memory[6] Memory[3] 7 Memory[1]

Computer Architecture Practice Test Solution

Related documents

Products

Support

Computer Architecture Practice Test Solution

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib