ECE369 Pipelining ECE369 1 addm (rs), rt # Memory[R[rs]] = R[rt] + Memory[R[rs]]; Assume that we can read and write the memory in the same cycle (like the register file, but this is likely not efficient to do in a real machine). All instructions use the same format (shown below), but not all instructions use all of the fields. Assume that each unused field is set to 0. Field Bits op 31-26 rs 25-21 rt 20-16 ECE369 rd 15-11 imm 10-0 2 Instr addm RegDst RegWrite MemRead MemWrite ALUsrc MemToALU DataSrc PCSrc x 0 1 1 0 0 ECE369 1 ALUOp Add 0 3 Pipelining One CPU manufacturer has proposed the 10-stage pipeline shown below. Here are the correspondences between this and the MIPS pipeline: • Instructions are fetched in the FET stage. • Register reading is performed in the REG stage. • ALU operations and memory accesses are both done in the EXE stage. • Branches are resolved in the DET stage. • WRB is the writeback stage. • Write and Read on Memory or Register File can occur in the same cycle Without forwarding, how many stall cycles are needed for the following code? Show your work to get credit. lw $t0, 0($a0) add $v1, $t0, $t0 ECE369 4 Solution ECE369 5 Assume that the initial value of R3 is R2+396, How many cycles does this loop take to execute? Loop: LW ADDI SW ADDI SUB BNEZ R1, 0(R2) R1, R1,#1 R1, 0(R2) R2, R2, #4 R4, R3, R2 R4, Loop -no forwarding or bypassing hardware. -all memory and register writes occur during the first half and reads occur during the second half of the clock cycle. (a register read and a register write in the same cycle forwards through the register file). -branching is handled by flushing the pipeline and branches are resolved in Memory stage. ECE369 6 branches are resolved in MEM. Second iterations starts 17 clock cycles after the first instructions. Last iterations takes 18 cycles. Loop executes 99 times. => 98*17+18=1684cycles. ECE369 7 Assume that the initial value of R3 is R2+396, How many cycles does this loop take to execute? Loop: LW ADDI SW ADDI SUB BNEZ R1, 0(R2) R1, R1,#1 R1, 0(R2) R2, R2, #4 R4, R3, R2 R4, Loop -with forwarding and bypassing hardware. -all memory and register writes occur during the first half and reads occur during the second half of the clock cycle. (a register read and a register write in the same cycle forwards through the register file). -Assume that branch is resolved in Memory stage and handled by predicting it as not taken. {Use (m) for branch mis-prediction in the table} ECE369 8 branches are resolved in MEM. Second iterations starts 10 clock cycles after the first instructions. Last iterations takes 11 cycles. Loop executes 99 times. => 98*10+11=991cycles. ECE369 9 Assume that the initial value of R3 is R2+396, How many cycles does this loop take to execute? Loop: LW ADDI SW ADDI SUB BNEZ R1, 0(R2) R1, R1,#1 R1, 0(R2) R2, R2, #4 R4, R3, R2 R4, Loop Assuming the MIPS pipeline with a single cycle delayed branch and normal forwarding and bypassing hardware, • Schedule the instructions in the loop including the branch delay slot. • You may reorder the instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop. • Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop. ECE369 10 Loop: =98*6+10=598 clocks ECE369 LW ADDI SW ADDI SUB BNEZ R1, 0(R2) R1, R1,#1 R1, 0(R2) R2, R2, #4 R4, R3, R2 R4, Loop 11