CSCE 614 (Spring 2015) Eun Jung Kim Computer Architecture Homework # 3 COVER SHEET (Due: Beginning of class on 03/23/15) Name : ID Number : Directions: Write your answers on the sheets provided. Submit with the COVER SHEET. If you need additional sheets for any of the problem, add as many blank papers as you require. Print your name clearly. No late homework will be accepted. You are expected to write up your solutions on your own, without referring to other students' works or to solutions you may find on the web. The total score is 160 points. This homework is due at the beginning of class on Monday, Mar 23, 2015. Dynamic Hardware Branch Prediction 1. Suppose the following branch instructions have been executed. Label Address branch Taken/Not Taken 1 2 3 4 5 …101101 …101101 …101101 …110011 …110011 b1 b1 b1 b2 b2 T T NT NT T . Draw a (1, 2) predictor and indicate the state of the buffer (with 4 prediction entries) after executing the above branch instructions. Also show the prediction for each branch instruction. Assume that a predictor uses 2bit saturating counter implemented in Simplescalar by default (20) Instruction Prediction 1 2 3 4 5 2. Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the mis-prediction penalty is always 4 cycles and the buffer miss penalty is always 3 cycles. Assume 90% hit rate, 95% accuracy and 15% conditional branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. (10) 3. Suppose the following branch instructions have been executed. Label Address branch Taken/Not Taken 1 2 3 4 5 …101101 …101101 …101101 …110011 …110011 b1 b1 b1 b2 b2 T NT T NT NT . a) Draw a (2, 2) predictor and indicate the state of the buffer (with 4 prediction entries per a table) after executing the above branch instructions. Also show the prediction for each branch instruction. Assume that a predictor uses 2bit saturating counter implemented in Simplescalar by default. (20) Instruction Prediction 1 2 3 4 5 b) Show the prediction for each branch instruction using a tournament predictor with 2 entries. Also show the final contents of Predictor 1 buffer and Predictor 2 buffer. Predictor 1 and Predictor 2 are 2-bit saturating counters with 2 prediction entries. Note that Predictor 1 is a local predictor while Predictor 2 is global. Assume all table and buffer contents are initialized to ‘zero’. (20) Instruction 1 2 3 4 5 Prediction c) Explain why a branch target buffer (BTB) reduces the CPI compared to a branch prediction buffer. Explain why the BTB must include tags for the buffer entries while the branch prediction buffer does not.(10) 4. With a MIPS pipeline architecture, we can have four different branch alternatives as follows. Assume we have a 2 GHz machine for which the following measurements have been made and the CPI of instruction except branch is 1. What are the MIPS rates for each scheme? Assume 5% unconditional branch, 10% conditional branch untaken, 7% conditional branch taken. Use the following performance penalty table. (10) Scheduling Stall pipeline Predict taken Predict not taken Delayed branch Penalty 4 1 1 0.5 Dynamic Pipelining and Hardware Speculation 5. Assume there are a floating-point unit with 2 add, 2 multiple/divide, and 2 load/store units, with execution latencies of 2 clock cycles for add, 10 for multiply, 40 for divide, and 3 for load/store (1 for address calculation, 2 for memory access), an integer unit for ALU operation, another unit for address calculation and the other unit for branch condition evaluation. For the code sequence below, answer the following questions. Note that the number of reservation stations is same as that of functional units and single issue. 1. 2. 3. 4. 5. 6. LD LD MULTD SUBD DIVD ADDD F6,34(R2) F2,45(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 a. Identify all the data hazards in the above code fragment, along with the type of each hazard identified. You can mark them appropriately on the code fragment and use acronyms to specify the hazard type. (10) b. For the above code sequence, show the status tables when all instructions have completed with single-issue Tomasulo's algorithm. For the instruction status table, list the clock cycle when the event happens. (20) Instruction Status Instruction Issue Memory Access Execute Write Result LD F6,34(R2) LD F2,45(R3) MULTD F0,F2,F4 SUBD F8,F6,F2 DIVD F10,F0,F6 ADDD F6,F8,F2 Reservation Stations Name Busy ADD1 Op Vj Vk Qj Qk A ADD2 MUL/DIV1 MUL/DIV2 LD/STR1 LD/STR2 Register Status Field F0 Qi F2 F4 F6 F8 F10 F12 . F30 6. Consider the execution of a loop on a two-issue processor. Assume there are two functional units; one for effective address calculation and integer ALU operation, and the other for branch condition evaluation. Also assume that there are 1 CDB and that up to two instructions of any type can commit per clock cycle. Assume that branches single issue but that branch prediction is perfect. (Latency: integer ALU operation 1, load and store (memory access only) 2, FP ALU operation 3). a. Fill out the time table of a pipeline with Dynamic Scheduling. (20) Instruction L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP Issue Execute Memory Access Write CDB b. Fill out the time table of a pipeline with Hardware Speculation. (20) Instruction L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, LOOP Issue Execute Memory Access Write CDB Commit