A Pipelined Processor Taken from Digital Design and Computer Architecture by Harris and Harris 7-<1> Introduction • Microarchitecture: how to implement an architecture in hardware • Processor: – Datapath: functional blocks – Control: control signals Application Software programs Operating Systems device drivers Architecture instructions registers Microarchitecture datapaths controllers Logic adders memories Digital Circuits AND gates NOT gates Analog Circuits amplifiers filters Devices transistors diodes Physics electrons 7-<2> Microarchitecture • Multiple implementations for a single architecture: – Single-cycle • Each instruction executes in a single cycle – Multicycle • Each instruction is broken up into a series of shorter steps – Pipelined • Each instruction is broken up into a series of steps • Multiple instructions execute at once. – Microcode 7-<3> Architectural State • Determines everything about a processor: – PC – 32 registers – Memory 7-<4> Instruction Formats add sub or and … lw sw … beq bne … R-Type op 6 bits rs 5 bits rt rd shamt funct 5 bits 5 bits 5 bits 6 bits I-Type op 6 bits rs 5 bits rt imm 5 bits 16 bits J-Type op addr 6 bits 26 bits 6-<5> State Elements: PC, 32 registers, and Memory CLK CLK PC' PC 32 32 32 A RD Instruction Memory 5 32 5 5 32 A1 A2 A3 WD3 CLK WE3 RD1 RD2 Register File WE 32 32 32 32 A RD Data Memory WD 7-<6> 32 Single-Cycle Datapath: lw fetch • executing lw: lw (index reg) (destination reg) (immediate offset) example: lw $s3, 1($0) # read memory word 1 into $s3 I-Type rt <- DataMemory[rs+imm] op 6 bits rs 5 bits rt imm 5 bits 16 bits • STEP 1: Fetch instruction CLK CLK PC' PC Instr A RD Instruction Memory A1 CLK WE3 WE RD1 A A2 A3 WD3 RD2 Register File RD Data Memory WD 7-<7> Single-Cycle Datapath: lw register read • STEP 2: Read source operands from register file CLK CLK 25:21 PC' PC A RD Instruction Memory Instr A1 CLK WE3 WE RD1 A A2 A3 WD3 RD2 Register File RD Data Memory WD 7-<8> Single-Cycle Datapath: lw immediate • STEP 3: Sign-extend the immediate CLK CLK PC' PC A RD Instr 25:21 A1 CLK WE3 WE RD1 A Instruction Memory A2 A3 WD3 RD Data Memory WD RD2 Register File SignImm 15:0 Sign Extend 7-<9> Single-Cycle Datapath: lw address • STEP 4: Compute the memory address ALUControl2:0 PC A RD Instr 25:21 Instruction Memory A1 A2 A3 WD3 WE3 RD2 SrcB Register File CLK Zero SrcA RD1 ALU PC' 010 CLK CLK ALUResult WE A RD Data Memory WD SignImm 15:0 Sign Extend 7-<10> Single-Cycle Datapath: lw memory read • STEP 5: Read data from memory and write it back to register file RegWrite ALUControl2:0 1 PC A RD Instruction Memory Instr 25:21 20:16 A1 A2 A3 WD3 CLK WE3 Zero SrcA RD1 RD2 SrcB Register File ALU CLK PC' 010 CLK ALUResult WE A RD Data Memory WD ReadData SignImm 15:0 Sign Extend 7-<11> Single-Cycle Datapath: lw PC increment • STEP 6: Determine the address of the next instruction ALUControl2:0 RegWrite 010 1 PC RD A Instr Instruction Memory 25:21 A1 A2 20:16 A3 + WD3 WE3 RD2 SrcB Register File WE Zero SrcA RD1 ALU CLK PC' CLK CLK ALUResult RD Data Memory WD A ReadData PCPlus4 SignImm 4 15:0 Sign Extend Result 7-<12> Single-Cycle Datapath: sw • sw: sw (index reg) (source reg) (immediate offset) I-Type rs op 6 bits 5 bits rt imm 5 bits 16 bits • Write data in rt to memory RegWrite ALUControl2:0 0 1 A RD Instr Instruction Memory 25:21 A1 CLK WE3 20:16 A2 20:16 A3 WD3 Zero SrcA RD1 RD2 SrcB ALU PC + PC' 010 CLK CLK MemWrite ALUResult WriteData Register File WE A RD Data Memory WD ReadData PCPlus4 SignImm 4 15:0 Sign Extend Result 7-<13> Single-Cycle Datapath: R-type instructions example: rd <- rt + rs ; add $s0, $s1, $s2 R-Type Read from rs and rt op rs rt rd Write ALUResult to register file 6 bits 5 bits 5 bits 5 bits Write to rd (instead of rt) RegWrite RegDst 1 1 0 PC A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 CLK WE3 Zero SrcA RD1 0 SrcB RD2 1 Register File 20:16 ALUResult WriteData shamt funct 5 bits 6 bits MemtoReg 0 0 WE A RD Data Memory WD ReadData 0 1 0 15:11 WriteReg4:0 PCPlus4 MemWrite varies ALU CLK PC' ALUSrc ALUControl2:0 CLK + • • • • 1 SignImm 4 15:0 Sign Extend Result 7-<14> Single-Cycle Datapath: beq • Determine whether values in rs and rt are equal beq $s0, $s1, target … target: # label I-Type rs op 6 bits 5 bits rt imm 5 bits 16 bits • Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4) PCSrc RegWrite RegDst 0 110 1 A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 CLK WE3 RD2 0 SrcB 1 Register File 20:16 + WriteReg4:0 15:0 WriteData x 0 WE A RD Data Memory WD ReadData 0 1 1 SignImm 4 ALUResult MemtoReg 0 15:11 PCPlus4 Zero SrcA RD1 ALU PC MemWrite 1 Sign Extend <<2 + PC' 0 CLK CLK 0 x ALUSrc ALUControl2:0 Branch PCBranch 7-<15> Result Complete Single-Cycle Processor 31:26 5:0 MemtoReg Control MemWrite Unit Branch ALUControl2:0 Op ALUSrc Funct RegDst PCSrc RegWrite CLK PC' PC 1 A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 WE3 RD2 0 SrcB 1 Register File 20:16 + WriteReg4:0 15:0 WriteData A RD Data Memory WD ReadData 0 1 1 SignImm 4 ALUResult WE 0 15:11 PCPlus4 Zero SrcA RD1 ALU 0 CLK Sign Extend <<2 + CLK PCBranch Result 7-<16> Review: Processor Performance Program Execution Time = (# instructions)(cycles/instruction)(seconds/cycle) = # instructions x CPI x TC 7-<17> Single-Cycle Performance • TC is limited by the critical path (lw) 31:26 5:0 MemtoReg Control MemWrite Unit Branch ALUControl 2:0 0 0 PCSrc Op ALUSrc Funct RegDst RegWrite PC' PC 1 A RD Instr Instruction Memory 25:21 A1 WE3 010 SrcA RD1 1 20:16 A2 A3 WD3 RD2 0 SrcB 1 Register File + WriteReg4:0 A RD Data Memory WD ReadData 0 1 1 SignImm 15:0 WriteData 1 WE 0 15:11 4 ALUResult 0 0 20:16 PCPlus4 Zero Sign Extend <<2 + 0 CLK 1 ALU CLK CLK PCBranch Result 7-<18> Single-Cycle Performance • Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup • In most implementations, limiting paths are: – memory, ALU, register file. – Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup 7-<19> Single-Cycle Performance Example Element Parameter Delay (ps) Register clock-to-Q tpcq_PC 30 Register setup tsetup 20 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20 Tc = 7-<20> Single-Cycle Performance Example Element Parameter Delay (ps) Register clock-to-Q tpcq_PC 30 Register setup tsetup 20 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20 Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup = [30 + 2(250) + 150 + 25 + 200 + 20] ps = 925 ps 7-<21> Single-Cycle Performance Example • For a program with 100 billion instructions executing on a singlecycle MIPS processor, Execution Time = 7-<22> Single-Cycle Performance Example • For a program with 100 billion instructions executing on a singlecycle MIPS processor, Execution Time = # instructions x CPI x TC = (100 × 109)(1)(925 × 10-12 s) = 92.5 seconds 7-<23> Pipelined Processor • Temporal parallelism • Divide single-cycle processor into 5 ROUGHLY EQUIVALENT stages: – Fetch – Decode – Execute – Memory – Writeback • Each stage includes one “slow step” • Add pipeline registers between stages • 5 stages => ~ 5 times faster! • All modern high-performance processors are pipelined. 7-<24> Single-Cycle vs. Pipelined Performance Single-Cycle 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 Instr 1 Fetch Instruction Decode Read Reg Execute ALU Memory Read / Write Time (ps) Write Reg Fetch Instruction 2 Decode Read Reg Execute ALU Memory Read / Write Write Reg Pipelined Instr 1 2 3 Fetch Instruction Decode Read Reg Fetch Instruction Execute ALU Decode Read Reg Fetch Instruction Memory Read/Write Execute ALU Decode Read Reg Write Reg Memory Read/Write Execute ALU Write Reg Memory Read/Write Write Reg The length of all pipeline stages is set by the slowest stage The instruction latency is 5 * 250 ps = 1250 ps 7-<25> Pipelining Abstraction 1 2 3 4 5 6 7 8 9 10 Time (cycles) lw $s2, 40($0) add $s3, $t1, $t2 sub $s4, $s1, $s5 and $s5, $t5, $t6 sw $s6, 20($s1) or $s7, $t3, $t4 IM lw $0 RF 40 IM add DM + $t1 RF $t2 IM sub RF DM + $s1 RF $s5 IM $s2 and RF DM - $t5 RF $t6 IM $s3 sw RF DM & $s1 RF 20 IM $s4 or + $t3 RF $t4 $s5 RF DM $s6 RF DM | 7-<26> $s7 RF Single-Cycle and Pipelined Datapath CLK 0 PC' PC 1 A Instr RD Instruction Memory 25:21 20:16 A1 WE3 A2 A3 WD3 + CLK Zero SrcA RD1 RD2 0 SrcB 1 Register File 20:16 0 15:11 1 ALU CLK WE ALUResult WriteData ReadData A RD Data Memory WD 0 1 WriteReg4:0 PCPlus4 SignImm 4 Sign Extend <<2 + 15:0 PCBranch Result CLK CLK CLK 0 PC' PCF 1 A RD Instruction Memory CLK InstrD 25:21 20:16 A1 WE3 A2 A3 WD3 ALUOutW CLK CLK RD2 0 SrcBE 1 Register File RdE 0 A RD Data Memory WD ReadDataW 0 1 WriteRegE4:0 1 + 15:11 ALUOutM WriteDataM WriteDataE RtE 20:16 WE ZeroM SrcAE RD1 ALU CLK SignImmE 15:0 <<2 Sign Extend PCBranchM + 4 PCPlus4F PCPlus4D PCPlus4E ResultW Fetch Decode Execute Memory Writeback 7-<27> Corrected Pipelined Datapath • WriteReg address must arrive at the same time as Result CLK CLK CLK PC' PCF A RD Instruction Memory InstrD 25:21 20:16 A1 WE3 PCPlus4F 0 SrcBE 1 RtE RdE 0 ALUOutM WriteDataE WriteDataM WriteRegE4:0 WriteRegM4:0 A RD Data Memory WD ReadDataW 0 1 WriteRegW 4:0 1 SignImmE <<2 + Sign Extend PCPlus4D WE ZeroM SrcAE A2 RD2 A3 Register WD3 File 15:11 4 CLK RD1 20:16 15:0 ALUOutW CLK + 0 1 CLK ALU CLK PCBranchM PCPlus4E ResultW Fetch Decode Execute Memory Writeback 7-<28> Pipelined Control CLK 5:0 RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM BranchD BranchE BranchM Op ALUControlD ALUControlE2:0 Funct ALUSrcD ALUSrcE RegDstD RegDstE CLK PC' PCF 1 A InstrD RD Instruction Memory ALUOutW CLK 25:21 20:16 A1 A2 A3 WD3 CLK WE3 RD2 0 SrcBE 1 Register File RtE 20:16 RdE 0 15:0 4 PCPlus4F Sign Extend PCPlus4D ALUOutM WriteDataE WriteDataM WriteRegE4:0 WriteRegM4:0 A RD Data Memory WD ReadDataW 0 1 WriteRegW 4:0 1 + 15:11 WE ZeroM SrcAE RD1 SignImmE <<2 + 0 PCSrcM ALU CLK CLK RegWriteE RegWriteD Control MemtoRegD Unit 31:26 CLK PCBranchM PCPlus4E ResultW Same control unit as single-cycle processor Control delayed to proper pipeline stage 7-<29> Pipeline Hazard • Occurs when an instruction depends on results from previous instruction that hasn’t completed. • Types of hazards: – Data hazard: register value not written back to register file yet – Control hazard: next instruction not decided yet (caused by branches) 7-<30> Data Hazard 1 2 3 4 5 6 7 8 Time (cycles) add $s0, $s2, $s3 and $t0, $s0, $s1 or $t1, $s4, $s0 sub $t2, $s0, $s5 IM add $s2 RF $s3 IM and DM + $s0 RF $s0 RF $s1 IM or DM & $t0 RF $s4 RF $s0 IM sub DM | $t1 RF $s0 RF $s5 - DM 7-<31> $t2 RF Handling Data Hazards • Insert nops in code at compile time • Rearrange code at compile time • Forward data at run time • Stall the processor at run time 7-<32> Control Hazards • beq: – branch is not determined until the fourth stage of the pipeline – Instructions after the branch are fetched before branch occurs – These instructions must be flushed if the branch happens • Branch misprediction penalty – number of instruction flushed when branch is taken – May be reduced by determining branch earlier 7-<33> Branch Prediction • Guess whether branch will be taken – Backward branches are usually taken (loops) – Perhaps consider history of whether branch was previously taken to improve the guess • Good prediction reduces the fraction of branches requiring a flush 7-<34>