Multiple Cycle Implementation of MIPS-Lite CPU CS365 Lecture 8 Review on Single Cycle Datapath Subset of the core MIPS ISA Arithmetic/Logic instructions: AND, OR, ADD, SUB, SLT Data flow instructions: LW, SW Branch instructions: BEQ, J Five steps in processor design Analyze the instruction Determine the datapath components Assemble the components Determine the control Design the control unit Multi-cycle CPU CS465 2 D. Barbara Complete Single Cycle Datapath How lw, sw, R-Type, beq, j instructions work? Multi-cycle CPU CS465 3 D. Barbara Delays in Single Cycle Datapath 2ns 2ns 2ns 2ns 2ns 1ns What are the delays for lw, sw, R-Type, beq, j instructions? Multi-cycle CPU CS465 4 D. Barbara Single Cycle Implementation Calculate cycle time assuming negligible delays except: memory (2ns), ALU and adders (2ns), register file access (1ns) Instruction Instruction Register ALU Register/ class Fetch Access Memory Access R-Type Load Store X X X X X X X X X Branch Jump X X X X Multi-cycle CPU CS465 R M M Register Access X 6 8 7 5 2 5 D. Barbara Remarks on Single Cycle Datapath Single cycle datapath ensures the execution of any instruction within one clock cycle Functional units must be duplicated if used multiple times by one instruction, e.g. ALU Why? Functional units can be shared if used by different instructions Single cycle datapath is not efficient in time Clock cycle time is determined by the instruction taking the longest time, eg. lw in MIPS Variable clock cycle time is too complicated Alternative design/implementation approaches Multiple clock cycles per instruction – this lecture Pipelining – Chapter 6 Multi-cycle CPU CS465 6 D. Barbara Multi-cycle Approach Single cycle problems: What if we had a more complicated instruction like FP? Wasteful of area One solution: Use a “smaller” cycle time Have different instructions take different numbers of cycles A “multicycle” datapath Instruction register PC Data Address A Memory Data Multi-cycle CPU Register # Instruction or data ALU Registers ALUOut Register # Memory data register B Register # CS465 7 D. Barbara Multistage Execution of Instructions We have informally described instructions as executing in several steps Instruction fetch and PC update Instruction decode and reading from registers Performing an ALU computation Reading/writing data memory Storing data back to register file What if we made these stages explicit in the hardware design? Multi-cycle CPU CS465 8 D. Barbara Benefits Performance benefits Each instruction can execute only the stages that are necessary E.g. branch instruction beq Instructions can complete as soon as possible, instead of being limited by the slowest instruction Cost benefits As an added bonus, we can eliminate some of the extra hardware from the single-cycle datapath We will restrict to using each functional unit once per cycle We could reuse some units in a different cycle during the execution of a single instruction E.g. beq can compare operand and compute target in different cycles Multi-cycle CPU CS465 9 D. Barbara Multicycle Approach We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data Unlike the single cycle implementation, our control signals will not be determined solely by instruction E.g., what should the ALU do for a “subtract” instruction? Need to specify not only instruction but also which cycle in instruction’s execution We will use a finite state machine for control Multi-cycle CPU CS465 10 D. Barbara Review: Finite State Machines Finite state machines: A set of states and Next state function (determined by current state and the input) Output function (determined by current state and possibly input) Current state Next-state function Next state Clock Inputs Output function Outputs There are two types of FSM according to the output function Multi-cycle CPU CS465 11 D. Barbara Moore Machine vs. Mealy Machine Moore machine The output function depends on the current state Mealy machine The output function depends on both the current state and the current input Current state Next-state function Next state Clock Inputs Output function Outputs We will use a Moore machine Refer to Appendix B.10 (page B-67) Multi-cycle CPU CS465 12 D. Barbara Multicycle Approach Break up the instructions into steps, each step takes a cycle Balance the amount of work to be done Restrict each cycle to use only one major functional unit At the end of a cycle PC Store values for use in later cycles Introduce additional “internal” registers 0 M u x 1 Address Memory Instruction [25– 21] Read register 1 Instruction [20– 16] Read Read register 2 data 1 Registers Write Read register data 2 MemData Instruction [15– 0] Write data Instruction register Instruction [15– 0] Memory data register Multi-cycle CPU 0 M Instruction u x [15– 11] 1 A B 0 M u x 1 Sign extend CS465 32 Zero ALU ALU result ALUOut 0 4 Write data 16 0 M u x 1 1 M u 2 x 3 Shift left 2 13 D. Barbara Changes to Datapath A single memory unit is used to store both instructions and data Memory address specified by PC or ALUout Need a new multiplexor (IorD) to control whether the memory is being accessed to read an instruction or read/write data (for lw/sw) Instruction register PC Data Address A Register # Instruction Memory or data Data Multi-cycle CPU ALU Registers ALUOut Register # Memory data register B Register # CS465 14 D. Barbara Changes to Datapath New internal registers added to support multi-cycle operation Store data output by functional elements for use in subsequent cycles Memory access: Instruction register (IR), memory data register (MDR); register access:A, B; ALU op: ALUOut All except IR hold data only between adjacent cycles Only IR needs to hold the instruction until the end of execution of that instruction, hence need a write control Instruction register PC Data Address A Memory Data Multi-cycle CPU Register # Instruction or data ALU Registers ALUOut Register # Memory data register B Register # CS465 15 D. Barbara Changes to Datapath ALU used for all arithmetic (including computing branch target address and PC+4) Need a new multiplexor for top input to ALU (ALUSrcA): choose between PC and register A Bottom input multiplexor (ALUSrcB) to ALU extended to have four inputs (instead of two in single cycle datapath): register B, signextended immediate, address offset (sign-extended and shifted), constant 4 Instruction register PC Data Address A Register # Instruction Memory or data Data Multi-cycle CPU ALU Registers ALUOut Register # Memory data register B Register # CS465 16 D. Barbara Multicycle Datapath with Control Figure 5.27 Multi-cycle CPU CS465 17 D. Barbara Multicycle Datapath & Control PCWriteCond PCSource PCWrite ALUOp IorD Outputs ALUSrcB MemRead ALUSrcA Control MemWrite RegWrite MemtoReg Op RegDst IRWrite [5– 0] 0 M 26 Instruction [25– 0] PC 0 M u x 1 Shift left 2 Instruction [31-26] Address Memory MemData Write data Instruction [25– 21] Read register 1 Instruction [20– 16] Read Read register 2 data 1 Registers Write Read register data 2 Write data Instruction [15– 0] Instruction register Instruction [15– 0] Memory data register 0 M Instruction u x [15– 11] 1 B 4 0 M u x 1 16 Sign extend 32 Shift left 2 1 u x 2 PC [31-28] 0 M u x 1 A 28 Jump address [31-0] Zero ALU ALU result ALUOut 0 1 M u 2 x 3 ALU control Instruction [5– 0] With support to jump and branch Multi-cycle CPU CS465 Figure 5.28 18 D. Barbara Five Execution Steps Instruction fetch Instruction decode and register fetch Execution, memory address computation, or branch completion Memory access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! Multi-cycle CPU CS465 19 D. Barbara High Level View of FSM Control Figure 5.31 Multi-cycle CPU CS465 20 D. Barbara Step 1: Instruction Fetch Operations Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in PC Described in RTL (Register-Transfer Language) IR = Memory[PC]; PC = PC + 4; Control signals What is the advantage of updating the PC now? Assert MemRead, IRwrite Set IorD to 0 select PC as the source of the address Increment PC by 4: setting ALUSrcA to 0 (sending PC to ALU), ALUSrcB to 01 (sending 4), ALUOp 00 (for add), store PC+4 to PC PCSource 00, and PCWrite 1 21 Multi-cycle CPU CS465 D. Barbara Step 2: Instr. Decode & Reg. Fetch Operations Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2); Control signals We are not setting any control lines based on the instruction type yet (we are busy "decoding" it in our control logic) Set ALUSrcA to 0 so that PC is sent to ALU, ALUSrcB to 11 so that sign-extended and shifted offset to ALU Multi-cycle CPU CS465 22 D. Barbara Instruction Fetch & Decode Instruction decode/ Register fetch Instruction fetch 0 1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 (Op = 'JMP') Start MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 01 ALUOp = 00 PCWrite PCSource = 00 Memory reference FSM (Figure 5.33) Multi-cycle CPU R-type FSM (Figure 5.34) Branch FSM (Figure 5.35) CS465 Jump FSM (Figure 5.36) 23 D. Barbara Step 3: Execution ALU is performing one of three functions, based on instruction type Need to set ALUSrcA, ALUSrcB, ALUOp Memory reference (LW/SW) ALUOut = A + sign-extend(IR[15-0]); R-type: ALUOut = A op B; Branch: if (A==B) PC = ALUOut; Need to set PCWriteCond, PCSource, use Zero Jump PC = PC[31:28]|(IR[25:0], 2’b00) Need to set PCWrite, PCSource Multi-cycle CPU CS465 24 D. Barbara Step 4: R-type or Memory-access Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B; Need to set MemRead/MemWrite, IorD R-type instructions finish Reg[IR[15-11]] = ALUOut; Need to set RegDst, RegWrite, MemtoReg The write actually takes place at the end of the cycle on the edge Multi-cycle CPU CS465 25 D. Barbara Step 5: Write-back For LW instruction: Reg[IR[20-16]]= MDR; Need to set RegWrite, MemtoReg, RegDst What about all the other instructions? Multi-cycle CPU CS465 26 D. Barbara FSM for Memory Access Instr. From state 1 (Op = 'LW') or (Op = 'SW') Memory address computation 2 (Op = 'LW') ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 Memory access 3 Memory access 5 MemRead IorD = 1 MemWrite IorD = 1 Write-back step 4 To state 0 (Figure 5.32) RegWrite MemtoReg = 1 RegDst = 0 Multi-cycle CPU CS465 27 D. Barbara FSM for R-format Instructions From state 1 (Op = R-type) Execution 6 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 R-type completion 7 RegDst = 1 RegWrite MemtoReg = 0 To state 0 (Figure 5.32) Multi-cycle CPU CS465 28 D. Barbara FSM for Branch Instruction From state 1 (Op = 'BEQ') Branch completion 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond PCSource = 01 To state 0 (Figure 5.32) Multi-cycle CPU CS465 29 D. Barbara FSM for Jump From state 1 (Op = 'J') Jump completion 9 PCWrite PCSource = 10 To state 0 (Figure 5.32) Multi-cycle CPU CS465 30 D. Barbara Summary of Steps Step name Instruction fetch Action for R-type instructions Instruction decode/register fetch Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Execution, address computation, branch/ jump completion ALUOut = A op B ALUOut = A + sign-extend (IR[15-0]) Memory access or R-type completion Reg [IR[15-11]] = ALUOut Load: MDR = Memory[ALUOut] or Store: Memory [ALUOut] = B Memory read completion Multi-cycle CPU if (A ==B) then PC = ALUOut Action for jumps PC = PC [31-28] II (IR[25-0]<<2) Load: Reg[IR[20-16]] = MDR CS465 31 D. Barbara Complete Finite State Machine Instruction decode/ register fetch Instruction fetch Start Memory address computation MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 01 ALUOp = 00 PCWrite PC Source = 00 6 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 (Op = 'LW') ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 Branch completion Execution 2 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 Memory access 3 1 Memory access 5 MemRead IorD = 1 (Op = 'J') 0 Jump completion 9 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PC WriteCond PCSource = 01 PCWrite PC Source = 10 R-type completion 7 MemWrite IorD = 1 RegDst = 1 RegWrite MemtoReg = 0 Write-back step 4 R egDst = 0 RegWrite MemtoReg = 1 Multi-cycle CPU CS465 32 D. Barbara Implementing the Control Value of control signals is dependent upon: Use the information we’ve accumulated to specify a finite state machine What instruction is being executed Which step is being performed Specify the finite state machine graphically, or Use microprogramming Implementation can be derived from specification Multi-cycle CPU CS465 33 D. Barbara Finite State Machine for Control Implementation: PCW rite PCW riteCond IorD M em Read M em W rite IRW rite Control logic M em toReg PCS ource ALUO p Outputs ALUSrcB ALUSrcA RegW rite RegD st NS3 NS2 NS1 NS0 Instruction register opcode field Multi-cycle CPU S0 S1 S2 S3 Op0 Op1 Op2 Op3 Op4 Inputs Op5 State register CS465 34 D. Barbara Simple Questions How many cycles will it take to execute this code? Label: lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label #assume not taken add $t5, $t2, $t3 sw $t5, 8($t3) ... What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place? Multi-cycle CPU CS465 35 D. Barbara Exceptions Hardest part of control is to implement exceptions & interrupts Type of event From where? I/O device request Invoke the operating system from user program Arithmetic overflow Using an undefined instruction Hardware malfunctions External Internal MIPS terminology Interrupt Exception Internal Internal Exception Exception Multi-cycle CPU Internal or External CS465 Exception or interrupt 36 D. Barbara How Are Exceptions Handled? In our design, we will consider two types of exceptions Arithmetic overflow Execution of an undefined instruction Actions on exception Save address of offending instruction in the Exception Program Counter (EPC) Transfer control to the operating system at a pre-specified address (exception handler) OS then takes appropriate action Multi-cycle CPU CS465 37 D. Barbara Exception Handling For the OS to take appropriate action, it must know the reason for the exception Two ways to communicate reason to OS Have a Status register which holds a field that indicates the reason for the exception Vectored interrupts Address to which control is transferred depends upon the cause of the exception MIPS uses first method above; it has a register called Cause (in addition to the EPC register) Undefined instruction =0 Arithmetic overflow =1 Multi-cycle CPU CS465 38 D. Barbara CPU w/ Exception Support CauseWrite IntCause EPCWrite PCSource ALUOp PCWriteCond PCWrite IorD Outputs MemRead MemWrite ALUSrcB Control ALUSrcA MemtoReg IRWrite RegWrite Op [5– 0] RegDst 0 26 Instruction [25– 0] PC 0 M u x 1 Shift left 2 Instruction [31-26] Address Memory MemData Read register 1 Instruction [20– 16] Read Read register 2 data 1 Registers Write Read register data 2 Write data Instruction register Instruction [15– 0] Memory data register 0 M Instruction u x [15– 11] 1 B 4 Write data 0 M u x 1 Zero ALU ALU result Sign extend 32 Shift left 2 3 0 1M u 2 x 3 ALUOut 0 1 16 u x PC [31-28] 0 M u x 1 A 1M 2 CO 00 00 00 Instruction [25– 21] Instruction [15– 0] 28 Jump address [31-0] EPC 0 M u x 1 Cause ALU control Instruction [5– 0] Multi-cycle CPU CS465 39 D. Barbara Exception Handling Datapath additions Control signals EPC, Cause (for undefined instruction, Cause = 0, arithmetic overflow Cause = 1) EPCWrite, CauseWrite IntCause (sets the Cause register) PCSrc has to be modified so that one of its sources is the OS entry point Three steps 1. 2. 3. Write Cause EPC = PC – 4 (Have to use ALU, so need to expand multiplexors for ALUSrcA and ALUSrcB Write PC Multi-cycle CPU CS465 40 D. Barbara CPU w/ Exception Support CauseWrite IntCause EPCWrite PCSource ALUOp PCWriteCond PCWrite IorD Outputs MemRead MemWrite ALUSrcB Control ALUSrcA MemtoReg IRWrite RegWrite Op [5– 0] RegDst 0 26 Instruction [25– 0] PC 0 M u x 1 Shift left 2 Instruction [31-26] Address Memory MemData Read register 1 Instruction [20– 16] Read Read register 2 data 1 Registers Write Read register data 2 Write data Instruction register Instruction [15– 0] Memory data register 0 M Instruction u x [15– 11] 1 B 4 Write data 0 M u x 1 Zero ALU ALU result Sign extend 32 Shift left 2 3 0 1M u 2 x 3 ALUOut 0 1 16 u x PC [31-28] 0 M u x 1 A 1M 2 CO 00 00 00 Instruction [25– 21] Instruction [15– 0] 28 Jump address [31-0] EPC 0 M u x 1 Cause ALU control Instruction [5– 0] Multi-cycle CPU CS465 41 D. Barbara States for Handling Exceptions 11 IntCause = 1 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 EPCWrite PCWrite PCSource = 11 10 IntCause = 0 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 EPCWrite PCWrite PCSource = 11 Multi-cycle CPU To state 0 to begin next instruction CS465 42 D. Barbara FSM including Exception Support Instruction decode/ Register fetch Instruction fetch 0 Memory address computation Branch completion Execution 2 6 (Op = 'LW') ALUSrcA = 1 ALUSrcB = 00 ALUOp = 00 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 Memory access Memory access 5 MemRead IorD = 1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond PCSource = 01 R-type completion 7 MemWrite IorD = 1 (Op = 'J') Start 3 1 MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 01 ALUOp = 00 PCWrite PCSource = 00 11 RegDst = 1 RegWrite MemtoReg = 0 Overflow Jump completion 9 PCWrite PCSource = 10 IntCause = 1 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 EPCWrite PCWrite PCSource = 11 10 IntCause = 0 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 EPCWrite PCWrite PCSource = 11 Write-back step 4 Overflow RegWrite MemtoReg = 1 RegDst = 0 Multi-cycle CPU CS465 43 D. Barbara Exercise Given the following instruction mix, what is the CPI for the multi-cycle CPU, assuming each step requires 1 clock cycle? Loads: 20%, stores: 10%, branches: 13%, jumps: 2%, ALU: 55% Idea Get the cycles needed for every instruction type: these are their individual CPIs Calculate the average CPI considering the instruction mix (instruction frequency) Multi-cycle CPU CS465 44 D. Barbara Cycles for Individual Instructions Step name Instruction fetch Action for R-type instructions Instruction decode/register fetch Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Execution, address computation, branch/ jump completion ALUOut = A op B ALUOut = A + sign-extend (IR[15-0]) Memory access or R-type completion Reg [IR[15-11]] = ALUOut Load: MDR = Memory[ALUOut] or Store: Memory [ALUOut] = B Memory read completion if (A ==B) then PC = ALUOut Action for jumps PC = PC [31-28] II (IR[25-0]<<2) Load: Reg[IR[20-16]] = MDR Loads: 5, stores: 4 ALU instructions: 4 Branches: 3 Jumps: 3 Multi-cycle CPU CS465 45 D. Barbara Exercise Given the following instruction mix, what is the CPI for the multi-cycle CPU, assuming each step requires 1 clock cycle? Loads: 20%, stores: 10%, branches: 13%, jumps: 2%, ALU: 55% CPI for each of them: loads: 5, stores: 4, branches: 3, jumps:3, ALU: 4 Average CPI = 0.205 + 0.104 + 0.133 + 0.023 + 0.554 = 4.05 Compare with the worst-case CPI of 5.0 if all instructions take the same number of cycles Multi-cycle CPU CS465 46 D. Barbara Summary Multiple-cycle datapath and control Break up the instructions into steps, each step takes a cycle Reduced clock cycle time; different instructions take varied cycles to implement Functional units can be reused Typical steps Instruction fetch Instruction decode and register fetch Execution: R-type, memory address computation, condition check Memory access or R-type write Write back from memory Multi-cycle CPU CS465 47 D. Barbara Next Lecture Topic Pipelining Reading Ch6 Patterson and Hennessy Multi-cycle CPU CS465 48 D. Barbara