Computer Systems Organization: Lecture 6 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, UCB] ENEE350 Single Cycle Datapath with Control Unit 0 Add Add Shift left 2 4 ALUOp 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Address Instr[31-0] ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU 1 32 Instr[5-0] ENEE350 Data Memory Read Data 1 Write Data 0 0 Data 2 Sign 16 Extend Address ALU control What is Pipelining Instruc Fetch Reg Decode ALU Instr Mem Exec Data Mem Mem Reg WB 5 Basic Steps in MIPS 1. Instruction Fetch from Memory 2. Instruction Decode by the control and Read the Rs and Rt Registers, Perform Sign Extension 3. Execute the Computation in the ALU, Calculate the target Branch Address 4. Memory Access: Read of Write data from/into memory 5. Write Back the result from the ALU or MemeorY Read into the register file ENEE350 What is Pipelining Instruc Fetch Reg Decode ALU Instr Mem Exec Data Mem Mem Reg WB Pipelining: Start the next instruction before the current one has completed. Hence when the first instruction is being decoded, start fetching the next instructions and so on. ENEE350 A Pipelined MIPS Processor Start the next instruction before the current one has completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 lw sw R-type IFetch Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB - clock cycle (pipeline stage time) is limited by the slowest stage - for some instructions, some stages are wasted cycles ENEE350 Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Pipeline Implementation: lw IFetch sw Dec Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem R-type IFetch ENEE350 WB Waste What is Pipelining Instruc Fetch Reg Decode ALU Instr Mem Exec Data Mem Mem Reg WB In Order to Implement Pipelining, we will need to redesign the datapath so that we can start executing the next instruction when the first one is still being processed. The first thing to do is to design the data-path such that each of the 5 steps could work independently ENEE350 What is Pipelining Reg ALU Instr Mem Data Mem Reg This is achieved by having what are called as pipeline registers in between the 5 stages. The clock delay is long enough to finish any of these steps. ENEE350 What is Pipelining Reg ALU Instr Mem Data Mem Reg Let us suppose Instruction Fetch : 200ps, Instruction Decode: 100ps, ALU Operation: 200ps, Mem Access: 200ps, Register Write Back: 100ps. Single Cycle Architecture: Clock Delay: 800ps Pipeline Clock Delay: 200ps ENEE350 What is Pipelining lw IFetch sw R-type Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB Execution time on a single cycle processor: 800*3 = 2400 Execution time on a pipelined processor: 200*5 + 200 + 200 = 1400 ENEE350 Single Cycle Datapath with Control Unit 0 Add Add Shift left 2 4 ALUOp 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Address Instr[31-0] ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU 1 32 Instr[5-0] ENEE350 Data Memory Read Data 1 Write Data 0 0 Data 2 Sign 16 Extend Address ALU control MIPS Pipeline Datapath Modifications What do we need to add/modify in our MIPS datapath? State registers between each pipeline stage to isolate them For simplicity the entire datapath is not shown IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack Add Read Addr 2Data 1 File Write Addr Write Data 16 System Clock ENEE350 Sign Extend Read Data 2 32 ALU Exec/Mem Register Read Dec/Exec PC Read Address Read Addr 1 IFetch/Dec Instruction Memory Add Data Memory Address Write Data Read Data Mem/WB Shift left 2 4 Corrected Datapath to Save RegWrite Addr Need to preserve the destination register address in the pipeline state registers IF/ID ID/EX EX/MEM Add Shift left 2 4 PC Instruction Memory Read Address Read Addr 1 Read Addr 2Data 1 File Write Addr 16 Sign Extend Read Data 2 32 MEM/WB Data Memory Register Read Write Data ENEE350 Add ALU Address Write Data Read Data MIPS Pipeline Control Path Modifications All control signals can be determined during Decode and held in the state registers between pipeline stages ID/EX EX/MEM IF/ID Control Add MEM/WB Shift left 2 4 PC Instruction Memory Read Address Read Addr 1 Data Memory Register Read Read Addr 2Data 1 File Write Addr Write Data 16 ENEE350 Add Sign Extend Read Data 2 32 ALU Address Write Data Read Data Pipelining the MIPS ISA What makes it easy all instructions are the same length (32 bits) - can fetch in the 1st stage and decode in the 2nd stage few instruction formats (three) with symmetry across formats - can begin reading register file in 2nd stage memory operations can occur only in loads and stores - can use the execute stage to calculate memory addresses each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB) What makes it hard ENEE350 structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction’s input operands depend on the output of a previous instruction? Why Pipeline? For Performance! Time (clock cycles) Inst 3 IM Reg IM Reg IM Reg ALU Inst 2 Reg ALU Inst 1 IM ALU O r d e r Inst 0 ALU I n s t r. IM Reg DM Time to fill the pipeline ENEE350 Reg DM Reg DM Reg DM ALU Inst 4 Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 Reg DM Reg Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready - An instruction’s source operand(s) are produced by a prior instruction still in the pipeline control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated - branch instructions ENEE350 Can always resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards A Single Memory Would Be a Structural Hazard Time (clock cycles) ENEE350 Mem Mem Reg Mem Reg Mem Reg ALU Inst 4 Reg ALU Inst 3 Mem Reg ALU Inst 2 Reg Reading data from memory Mem ALU O r d e r Inst 1 Mem ALU I n s t r. lw Mem Reg Mem Reading instruction from memory Reg Mem Reg Reg Fix with separate instr and data memories (I$ and D$) How About Register File Access? Time (clock cycles) Reg Inst 1 IM Reg ALU IM Reg ALU IM Reg Inst 2 add $2,$1, clock edge that controls register writing ENEE350 DM Reg DM Reg DM ALU O r d e r add $1 ,IM ALU I n s t r. Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half Reg DM Reg clock edge that controls loading of pipeline state registers Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards $8,$1,$9 Reg IM Reg IM Reg ALU or IM ALU and $6,$1,$7 Reg ALU sub $4,$1,$5 IM ALU add $1,$x,$y IM Reg DM ENEE350 Read before write data hazard DM Reg DM Reg DM ALU xor $4,$1,$5 Reg Reg DM Reg Loads Can Cause Data Hazards Reg IM Reg IM Reg IM Reg ALU sub $4,$1,$5 IM ALU $1,4($2) ALU O r d e r lw ALU I n s t r. Dependencies backward in time cause hazards IM Reg and $6,$1,$7 or $8,$1,$9 ENEE350 Load-use data hazard Reg DM Reg DM Reg DM ALU xor $4,$1,$5 DM Reg DM Reg One Way to “Fix” a Data Hazard IM Reg DM Reg IM Reg ALU O r d e r add $1, ALU I n s t r. Can fix data hazard by waiting – stall – but impacts CPI IM Reg stall stall sub $4,$1,$5 ENEE350 ALU and $6,$1,$7 DM Reg DM Reg Another Way to “Fix” a Data Hazard and $6,$1,$7 or $8,$1,$9 ENEE350 Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU xor $4,$1,$5 IM ALU sub $4,$1,$5 Reg ALU IM ALU O r d e r add $1, ALU I n s t r. Fix data hazards by forwarding results as soon as they are available to where they are needed Reg DM Reg Forwarding with Load-use Data Hazards and $6,$1,$7 or $8,$1,$9 ENEE350 Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU xor $4,$1,$5 IM ALU sub $4,$1,$5 Reg ALU $1,4($2) IM ALU O r d e r lw ALU I n s t r. Reg DM Will still need one stall cycle even with forwarding Reg Branch Instructions Cause Control Hazards Inst 3 ENEE350 IM Reg IM Reg IM Reg DM Reg DM Reg DM ALU Inst 4 Reg ALU lw IM ALU O r d e r beq ALU I n s t r. Dependencies backward in time cause hazards Reg DM Reg One Way to “Fix” a Control Hazard O r d e r beq IM Reg ALU I n s t r. DM Fix branch hazard by waiting – stall – but affects CPI Reg stall stall stall ENEE350 Reg IM Reg DM ALU Inst 3 IM ALU lw Reg DM Summary All modern day processors use pipelining Pipelining doesn’t help latency of single task, it helps throughput of entire workload Potential speedup: a CPI of 1 and fast a CC Pipeline rate limited by slowest pipeline stage Unbalanced pipe stages makes for inefficiencies The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs Must detect and resolve hazards ENEE350 Stalling negatively affects CPI (makes CPI less than the ideal of 1)