CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Chapter 4 The Processor Zhao Zhang Iowa State University Revised from original slides provided by MKP Week 9 Overview Mini Project B CPU Pipelining: Pipelined Data Path and Control ALU Data Hazards and Forwarding Chapter 1 — Computer Abstractions and Technology — 2 Mini-Project B Overview Implement single-cycle processor (SCP). There will be three parts 1. Part 1, SCPv1: Implement the nineinstruction ISA 2. Part 2, SCPv2a: Support all the instructions needed to run bubble sorting With coarse-level modeling of datapath elements Part 3, SCPv2b: Detailed modeling of datapath elements There is a bonus project 3. Chapter 1 — Computer Abstractions and Technology — 3 Project A Late Submission Start working on Project B, ASAP You may submit Mini-Project A late for three weeks (with 20% late penalty) Demo those parts that are working Late penalty only applies to those parts that are actually late If you demo Project B successfully, you don’t have to demo any late part of Project A Chapter 1 — Computer Abstractions and Technology — 4 Part 1: SCPv1 Implementing the nine-Instruction MIPS ISA Memory reference: LW and SW Arithmetic/logic: ADD, SUB, AND, OR, SLT Branch: BEQ, J The textbook provides almost all implementation details Datapath and control The main control unit (9-bit signals w/o Jump) The ALU control unit Chapter 1 — Computer Abstractions and Technology — 5 Part 1: SCPv1 Use this diagram as the blueprint for Part 1 Chapter 1 — Computer Abstractions and Technology — 6 SCPv1: Control Signals Control signal setting for SCPv1 Inst It is a truth table RegDst ALU- Mem- Reg- Mem Mem Bran toReg Write Read Write ch Src ALU Op1 ALU Op0 Jum p R- 1 0 0 1 0 0 0 1 0 0 lw 0 1 1 1 1 0 0 0 0 0 sw X 1 X 0 0 1 0 0 0 0 beq X 0 X 0 0 0 1 0 1 0 j X X X 0 0 0 0 X X 1 Note: “R-” means R-format Chapter 1 — Computer Abstractions and Technology — 7 SCPv1: ALU Control Truth table for ALU Control opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111 Chapter 4 — The Processor — 8 SCPv1 Fast Prototyping You are provided with following files mips32.vhd: A VHDL package regfile.vhd: For the register file register.vhd: For the PC alu.vhd: For the ALU adder.vhd: For the PC-related adders mem.vhd: The memory, for both instruction memory and data memory Chapter 1 — Computer Abstractions and Technology — 9 SCPv1 Fast Prototyping Rational behind Part 1: Focus on the structure/organization of the CPU The provided components are modeled at coarse-level We know that efficient circuit design exists for those components: Memory, register file, ALU, adder, mux and so on Work out the details at the late time Chapter 1 — Computer Abstractions and Technology — 10 Strongly Structural Modeling Your CPU composition must be strongly structural No behavior modeling can be used. No process statement. Limited dataflow modeling (see next) Additional requirement: Declare all components in the architecture body of CPU Only component instantiation, no entity instantiation Chapter 1 — Computer Abstractions and Technology — 11 Strongly Structural Modeling Acceptable forms of dataflow modeling Signal copying/splitting opcode <= inst(31 downto 26); Signal Merging j_target <= PC(31 downto 28) & j_offset & "00”; One-level of basic logic gates taken_branch <= branch AND zero; Chapter 1 — Computer Abstractions and Technology — 12 Cpu.vhd This is a partial sample -- Control Unit CONTROL1: control port map (opcode, reg_dst, alu_src, mem_to_reg, …); -- ALU Control unit ALU_CTRL1: alu_ctrl port map (alu_op, funct, alu_code); -- The mux connected to the dst port of regfile DST_MUX : mux2to1 generic map (M => 5) port map (rt, rd, reg_dst, dst); … Chapter 1 — Computer Abstractions and Technology — 13 Datapath and Control Modeling For datapath elements and control units, you may use any modeling style (in Part 1) The provided components all use behavior modeling for simplicity Chapter 1 — Computer Abstractions and Technology — 14 mips32.vhd package MIPS32 is -- Half Cycle Time of the clock signal constant HCT : time := 50 ns; -- Clock Cycle Time of the clock signal constant CCT : time := 2 * HCT; -- MIPS32 logic type subtype m32_logic is std_logic; -- MIPS32 logic vector type subtype m32_vector is std_logic_vector; Pre-defined constants and types to make coding simpler and consistent Chapter 1 — Computer Abstractions and Technology — 15 mips32.vhd -- Word type, for … subtype m32_word is m32_vector(31 downto 0); -- Halfword, byte, and bit fields of varying size subtype m32_halfword is m32_vector(15 downto 0); subtype m32_byte is m32_vector(7 downto 0); subtype m32_1bit is m32_logic; subtype m32_2bits is m32_vector(1 downto 0); subtype m32_3bits is m32_vector(2 downto 0); … end MIPS32; Pre-defined types shorten the names Chapter 1 — Computer Abstractions and Technology — 16 Alu.vhd Why provide the ALU and the other VHDL programs? Your implementation might have bugs We don’t want to fight the bugs in two fronts You shall test those modules Always test any modules that you will use The provided modules have been tested Some test-bench programs are provided Write your own test-bench or extend the provided test-bench Chapter 1 — Computer Abstractions and Technology — 17 Alu.vhd entity ALU is port (rdata1 rdata2 alu_code result zero end entity; : : : : : in in in out out m32_word; m32_word; m32_4bits; m32_word; m32_1bit); Chapter 1 — Computer Abstractions and Technology — 18 Alu.vhd architecture behavior of ALU is signal r : m32_word; begin P_ALU : process (alu_code, rdata1, rdata2) variable code, a, b, sum, diff, slt: integer; begin -- Pre-calculate arithmetic results a := to_integer(signed(rdata1)); b := to_integer(signed(rdata2)); sum := a + b; diff := a - b; if (a < b) then slt := 1; else slt := 0; end if; Chapter 1 — Computer Abstractions and Technology — 19 Alu.vhd -- Select the result, convert to signal if necessary case (alu_code) is when "0000" => -- AND r <= rdata1 AND rdata2; when "0010" => -- add r <= std_logic_vector(to_signed(sum, 32)); … end case; end process; -- Drive the alu result output result <= r; -- Drive the zero output with r select zero <= '1' when x"00000000", '0' when others; end behavior; Coarse-level modeling is easy, reliable but may not be synthesized efficiently Chapter 1 — Computer Abstractions and Technology — 20 Regfile.vhd entity regfile is port(src1 : in src2 : in dst : in wdata : in rdata1 : out rdata2 : out WE : in reset : in clock : in end regfile; m32_5bits; m32_5bits; m32_5bits; m32_word; m32_word; m32_word; m32_1bit; m32_1bit; m32_1bit); Caveat: The clock signal is needed in the single-cycle implementation Chapter 1 — Computer Abstractions and Technology — 21 Regfile.vhd architecture behavior of regfile is signal reg_array : m32_regval_array; begin -- Register reset logic P_WRITE : process (clock) variable r : integer; begin -- Write/reset logic if (rising_edge(clock)) then if (reset = '1') then for i in 0 to 31 loop reg_array(i) <= X"00000000"; end loop; Chapter 1 — Computer Abstractions and Technology — 22 Regfile.vhd elsif (WE = '1') then r := to_integer(unsigned(dst)); if not (r = 0) the reg_array(r) <= wdata; end if; end if; end if; end process; Chapter 1 — Computer Abstractions and Technology — 23 Regfile.vhd P_READ : process (clock, src1, src2) variable r1, r2 : integer; begin -- Read logic r1 := to_integer(unsigned(src1)); r2 := to_integer(unsigned(src2)); rdata1 <= reg_array(r1); rdata2 <= reg_array(r2); end process; end behavior; Chapter 1 — Computer Abstractions and Technology — 24 Demonstration For each of multiple test cases Trace the program execution Inspect the register and memory contents at the end of execution Test case consists of MIPS binary code, e.g. in imem.txt Data memory content, e.g. in dmem.txt Chapter 1 — Computer Abstractions and Technology — 25 Test Bench Inside test bench: CPU1 : cpu port map (imem_addr, inst, dmem_addr, dmem_read, dmem_write, dmem_wmask, dmem_rdata, dmem_wdata, reset, clock); INST_MEM : mem generic map (mif_filename => "imem.txt") port map (imem_addr(9 downto 2), "0000", clock, x"00000000", '0', inst); DATA_MEM : mem generic map (mif_filename => "dmem.txt") port map (dmem_addr(9 downto 2), dmem_wmask, clock, dmem_wdata, dmem_write, dmem_rdata); Note: Treat memories as external datapath elements Chapter 1 — Computer Abstractions and Technology — 26 Instruction Memory imem.txt contents (MIF) DEPTH=1024; WIDTH = 32; -lw $t0, 0($zero) -lw $t1, 4($zero) -beq $t0, $t1, +2 -add $t0, $t0, $t1 -sw $t0, 8($zero) -noop CONTENT BEGIN -- Instruction formats --R ======-----=====-----=====-------I ======-----=====-----------------J ======-------------------------0 : 10001100000010000000000000000000; 1 : 10001100000010010000000000000100; 2 : 00010001000010010000000000000010; 3 : 00000001000010010100000000100000; 4 : 10101100000010000000000000001000; [5..63] : 00000000; END; Chapter 1 — Computer Abstractions and Technology — 27 Part 2. SCPv2 Prototyping (SCPv2a) Support all MIPS instructions used by the bubble sort example We have studied how to extend the nineinstruction design to support ADDI, SLL, BNE, and JAL For each new instruction, think about Datapath: Any new/revised data elements, any new signal connections The main control: Any new control signals, any extension to the truth table The ALU control: Any extension to the truth table Chapter 1 — Computer Abstractions and Technology — 28 Part 3. SCPv2b SCPv2 Detailed Implementation Provide detailed modeling for Use your code from Labs 1-4 and MiniProject A Register file ALU Adder You may revise your code Your final code should be strongly structural Consult your lab TA if you are not sure Chapter 1 — Computer Abstractions and Technology — 29 Bonus Project Part 1 Green MIPS SCP (SCP-G) Bonus Project Part 2 is to do pipelined implementation The lab bonus can overflow in your overall grade Extend SCPv2 to support all integer instructions listed on the green sheet As said, quiz bonus does not overflow Partial credit will be given The grading details will be finalized Chapter 1 — Computer Abstractions and Technology — 30 Pipelined CPU CPU A natural idea to improve performance The devil is in the details Pipelined data path and control Data hazard from ALU instructions Data Hazard from Load instructions Control Hazard from branches Exception handling in pipelined processor Chapter 1 — Computer Abstractions and Technology — 31 SCP With Jumps Added Chapter 4 — The Processor — 32 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Now we will improve performance by pipelining Chapter 4 — The Processor — 33 Pipelined laundry: overlapping execution Parallelism improves performance Four loads: §4.5 An Overview of Pipelining Pipelining Analogy Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n + 1.5 ≈ 4 = number of stages Chapter 4 — The Processor — 34 Pipeline Performance Look at this example In single-cycle implementation, the critical path is 800ps (one cycle @ 1.25 GHz) The longest component latency is 200ps (one cycle @ 5GHz) Note: Latency of mux, extender and so on ignored Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps Chapter 4 — The Processor — 35 MIPS Pipeline Idea If we divide the execution into stages, clock frequency can be much faster Five stages, one step per stage 1. 2. 3. 4. 5. IF: Instruction fetch from memory ID: Instruction decode & register read EX: Execute operation or calculate address MEM: Access memory operand WB: Write result back to register Chapter 4 — The Processor — 36 MIPS Pipeline Idea General idea: Split the datapath into stages, with critical path delay <= 1 clock cycle Chapter 4 — The Processor — 37 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) First look at performance gain Chapter 4 — The Processor — 38 Pipeline Speedup If all stages are balanced If not balanced, speedup is less i.e., all take the same time Time between instructionspipelined = Time between instructionsnonpipelined Number of stages Ideal speedup = N for N-stage pipeline In the example, speedup is up to 4.0 Speedup due to increased throughput Latency (time for each instruction) does not decrease, or even increases Chapter 4 — The Processor — 39 Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Chapter 4 — The Processor — 40 Pipelining and ISA Design How would you design a pipeline for this instruction format? Prefixes (1-4 bytes) Opcode (1-3 bytes), required ModR/M (1 byte ) SIB (1 byte) Addr. Immediate Displacement (0, 1, (0, 1, 2, or 4 bytes) 2, or 4 bytes) ModR/M: addressing-form specifier, mixing of register numbers, addressing modes, additional opcode bits SIB: Second addressing byte for base-plus-index and scale-plus-index addressing modes Source: Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z Chapter 1 — Computer Abstractions and Technology — 41 Pipelining and ISA Design MIPS ISA designed for pipelining Load/store addressing Can calculate address in 3rd stage, access memory in 4th stage Alignment of memory operands Memory access takes only one cycle Chapter 4 — The Processor — 42 Pipelining and ISA Design How would you design a pipeline that works well for the following instructions? ADD eax, ebx ; add with two registers SUB ebx, 100 ; sub with reg and const ADD eax, [0x1000] ; add reg and memory ADD BYTE PTR [0x1000], 100 ; add with mem and const SUB [esi+4*ebx], eax ; sub with reg and mem (array) Chapter 1 — Computer Abstractions and Technology — 43 §4.6 Pipelined Datapath and Control MIPS Pipelined Datapath MEM Right-to-left flow leads to hazards WB Chapter 4 — The Processor — 44 Pipeline registers Need registers between stages To hold information produced in previous cycle Chapter 4 — The Processor — 45 Hazards Situations that prevent starting the next instruction in the next cycle Structure hazards Data hazard A required resource is busy Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction Chapter 4 — The Processor — 46 Hazards There are ways to handle those hazards. Let’s ignore them for now Assume, for now, no data dependence and control dependence in the program lw sub add lw sub $10, $11, $12, $13, $14, 20($1) $2, $3 $3, $4 24($1) $5, $6 Can you design a pipeline to run the about instructions correctly? Chapter 1 — Computer Abstractions and Technology — 47 Hazards Program with data dependence sub and or add sw $2, $1,$3 $12,$2,$5 $13,$6,$2 $14,$2,$2 $15,100($2) Program with control dependence beq $1, $3, +4 addi $2, $2, 1 addi $4, $4, 1 Chapter 1 — Computer Abstractions and Technology — 48 Pipeline Operation Cycle-by-cycle flow of instructions through the pipelined datapath “Single-clock-cycle” pipeline diagram c.f. “multi-clock-cycle” diagram Shows pipeline usage in a single cycle Highlight resources used Graph of operation over time We’ll look at “single-clock-cycle” diagrams for load & store Chapter 4 — The Processor — 49 IF for Load, Store, … Chapter 4 — The Processor — 50 ID for Load, Store, … Chapter 4 — The Processor — 51 EX for Load Chapter 4 — The Processor — 52 MEM for Load Chapter 4 — The Processor — 53 WB for Load Wrong register number Chapter 4 — The Processor — 54 Corrected Datapath for Load Chapter 4 — The Processor — 55 EX for Store Chapter 4 — The Processor — 56 MEM for Store Chapter 4 — The Processor — 57 WB for Store Chapter 4 — The Processor — 58 Multi-Cycle Pipeline Diagram Form showing resource usage Chapter 4 — The Processor — 59 Multi-Cycle Pipeline Diagram Traditional form Chapter 4 — The Processor — 60 Single-Cycle Pipeline Diagram State of pipeline in a given cycle Chapter 4 — The Processor — 61 Pipelined Control (Simplified) Chapter 4 — The Processor — 62 Pipelined Control Control signals derived from instruction As in single-cycle implementation Chapter 4 — The Processor — 63 Pipelined Control Chapter 4 — The Processor — 64 Simple Pipeline Summary The BIG Picture Pipelining improves performance by increasing instruction throughput Subject to hazards Executes multiple instructions in parallel Each instruction has the same latency Structure, data, control (will be studied) Instruction set design affects complexity of pipeline implementation Chapter 4 — The Processor — 65 Hazards Situations that prevent starting the next instruction safely in the next cycle Structure hazards A required resource is busy Data hazard The simple pipeline won’t work correctly Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction Chapter 4 — The Processor — 66 Structure Hazards Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches Chapter 4 — The Processor — 67 Data Hazards in ALU Instructions An instruction depends on completion of data access by a previous instruction add sub $s0, $t0, $t1 $t2, $s0, $t3 Consider this sequence: sub and or add sw $2, $1,$3 $12,$2,$5 $13,$6,$2 $14,$2,$2 $15,100($2) Chapter 4 — The Processor — 68 Data Hazards in ALU Instructions A naïve approach is to insert NOOPs to wait out the dependence add sub $s0, $t0, $t1 $t2, $s0, $t3 Change to add $s0, $t0, $t1 noop noop sub $t2, $s0, $t3 Chapter 4 — The Processor — 69 Data Hazards in ALU Instructions Another naïve approach is to stall the 2nd instruction in the dependence add sub $s0, $t0, $t1 $t2, $s0, $t3 Chapter 4 — The Processor — 70 Forwarding (aka Bypassing) Use result when it is computed Don’t wait for it to be stored in a register Requires extra connections in the datapath Chapter 4 — The Processor — 71 Data Hazards in ALU Instructions Chapter 1 — Computer Abstractions and Technology — 72 Dependencies & Forwarding Chapter 4 — The Processor — 73 Detecting the Need to Forward Pass register numbers along pipeline ALU operand register numbers in EX stage are given by e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg Chapter 4 — The Processor — 74 Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd ≠ 0, MEM/WB.RegisterRd ≠ 0 Chapter 4 — The Processor — 75 Forwarding Paths Chapter 4 — The Processor — 76 Forwarding Conditions EX hazard if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Chapter 4 — The Processor — 77