Part IV Data Path and Control Nov. 2014 Computer Architecture, Data Path and Control Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2006 Feb. 2007 Feb. 2008 Feb. 2009 Feb. 2011 Nov. 2014 Nov. 2014 Computer Architecture, Data Path and Control Slide 2 A Few Words About Where We Are Headed Performance = 1 / Execution time simplified to 1 / CPU execution time CPU execution time = Instructions CPI / (Clock rate) Performance = Clock rate / ( Instructions CPI ) Try to achieve CPI = 1 with clock that is as high as that for CPI > 1 designs; is CPI < 1 feasible? (Chap 15-16) Design memory & I/O structures to support ultrahigh-speed CPUs (chap 17-24) Nov. 2014 Define an instruction set; make it simple enough to require a small number of cycles and allow high clock rate, but not so simple that we need many instructions, even for very simple tasks (Chap 5-8) Computer Architecture, Data Path and Control Design hardware for CPI = 1; seek improvements with CPI > 1 (Chap 13-14) Design ALU for arithmetic & logic ops (Chap 9-12) Slide 3 IV Data Path and Control Design a simple computer (MicroMIPS) to learn about: • Data path – part of the CPU where data signals flow • Control unit – guides data signals through data path • Pipelining – a way of achieving greater performance Topics in This Part Chapter 13 Instruction Execution Steps Chapter 14 Control Unit Synthesis Chapter 15 Pipelined Data Paths Chapter 16 Pipeline Performance Limits Nov. 2014 Computer Architecture, Data Path and Control Slide 4 13 Instruction Execution Steps A simple computer executes instructions one at a time • Fetches an instruction from the loc pointed to by PC • Interprets and executes the instruction, then repeats Topics in This Chapter 13.1 A Small Set of Instructions 13.2 The Instruction Execution Unit 13.3 A Single-Cycle Data Path 13.4 Branching and Jumping 13.5 Deriving the Control Signals 13.6 Performance of the Single-Cycle Design Nov. 2014 Computer Architecture, Data Path and Control Slide 5 13.1 A Small Set of Instructions 31 R I op 25 rs 20 rt 15 rd 10 sh fn 5 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Opcode Source 1 or base Source 2 or dest’n Destination Unused Opcode ext imm Operand / Offset, 16 bits jta J Jump target address, 26 bits inst Instruction, 32 bits Fig. 13.1 MicroMIPS instruction formats and naming of the various fields. We will refer to this diagram later Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) Six I-format ALU instructions (lui, addi, slti, andi, ori, xori) Two I-format memory access instructions (lw, sw) Three I-format conditional branch instructions (bltz, beq, bne) Four unconditional jump instructions (j, jr, jal, syscall) Nov. 2014 Computer Architecture, Data Path and Control Slide 6 0 The MicroMIPS Instruction Set Copy Arithmetic Including la rt,imm(rs) (load address) makes it easier to write useful programs Logic Memory access Control transfer Table 13.1 Nov. 2014 Instruction Usage Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch less than 0 Branch equal Branch not equal Jump and link System call lui rt,imm add rd,rs,rt sub rd,rs,rt slt rd,rs,rt addi rt,rs,imm slti rd,rs,imm and rd,rs,rt or rd,rs,rt xor rd,rs,rt nor rd,rs,rt andi rt,rs,imm ori rt,rs,imm xori rt,rs,imm lw rt,imm(rs) sw rt,imm(rs) j L jr rs bltz rs,L beq rs,rt,L bne rs,rt,L jal L syscall Computer Architecture, Data Path and Control op fn 15 0 0 0 8 10 0 0 0 0 12 13 14 35 43 2 0 1 4 5 3 0 Slide 7 32 34 42 36 37 38 39 8 12 13.2 The Instruction Execution Unit beq,bne syscall 31 R I Next addr bltz,jr jta op 25 rs 20 10 sh fn 5 5 bits 5 bits 5 bits 5 bits 6 bits Opcode Source 1 or base Source 2 or dest’n Destination Unused Opcode ext 0 imm Operand / Offset, 16 bits jta J Jump target address, 26 bits inst 22 instructions (rs) Reg file inst rd Instruction, 32 bits rs,rt,rd Instr cache 15 6 bits j,jal PC rt 12 A/L, lui, lw,sw ALU Address Data Data cache (rt) imm op fn Control Harvard architecture Fig. 13.2 Abstract view of the instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig. 13.1. Nov. 2014 Computer Architecture, Data Path and Control Slide 8 13.3 A Single-Cycle Data Path Incr PC Next addr jta Next PC ALUOvfl (PC) PC (rs) rs rt Instr cache inst rd 31 imm op Instruction fetch Nov. 2014 Reg file ALU (rt) / 16 ALU out Data addr Data cache Data out Data in Func 0 32 SE / 1 0 1 2 Register input fn RegDst RegWrite Br&Jump Fig. 13.3 0 1 2 Ovfl Reg access / decode ALUSrc ALUFunc ALU operation DataRead RegInSrc DataWrite Data access Reg writeback Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control Slide 9 ConstVar Shift function Constant 5 amount 0 Amount 5 1 5 Variable amount 2 00 01 10 11 No shift Logical left Logical right Arith right Shifter Function class 32 5 LSBs imm Adder y 32 k / c 31 xy Shift Set less Arithmetic Logic 0 0 or 1 c0 32 00 01 10 11 An ALU for MicroMIPS 2 Shifted y x lui s 32 2 32 Shorthand symbol for ALU 1 MSB We use only 5 control signals (no shifts) Control c 32 3 5 x Func AddSub s ALU Logic unit AND OR XOR NOR 00 01 10 11 Ovfl y 32input NOR Zero 2 Logic function Zero Ovfl Fig. 10.19 A multifunction ALU with 8 control signals (2 for function class, 1 arithmetic, 3 shift, 2 logic) specifying the operation. Nov. 2014 Computer Architecture, Data Path and Control Slide 10 13.4 Branching and Jumping Update options for PC (PC)31:2 + 1 (PC)31:2 + 1 + imm (PC)31:28 | jta (rs)31:2 SysCallAddr Default option When instruction is branch and condition is met When instruction is j or jal When the instruction is jr Start address of an operating system routine Lowest 2 bits of PC always 00 BrTrue / 30 IncrPC / 30 Adder c in 0 1 2 3 NextPC / 30 PCSrc Fig. 13.4 Nov. 2014 / 30 / 30 / 30 / 30 4 MSBs 1 / 30 Branch condition checker (rt) / 32 (rs) / 30 (PC)31:2 / 26 jta 30 MSBs SE / 30 / 32 4 16 imm MSBs SysCallAddr BrType Next-address logic for MicroMIPS (see top part of Fig. 13.3). Computer Architecture, Data Path and Control Slide 11 13.5 Deriving the Control Signals Table 13.2 Control signals for the single-cycle MicroMIPS implementation. Control signal Reg file ALU Data cache Next addr Nov. 2014 0 1 2 3 RegWrite Don’t write Write RegDst1, RegDst0 rt rd $31 RegInSrc1, RegInSrc0 Data out ALU out IncrPC ALUSrc (rt ) imm AddSub Add Subtract LogicFn1, LogicFn0 AND OR XOR NOR FnClass1, FnClass0 lui Set less Arithmetic Logic DataRead Don’t read Read DataWrite Don’t write Write BrType1, BrType0 No branch beq bne bltz PCSrc1, PCSrc0 IncrPC jta (rs) SysCallAddr Computer Architecture, Data Path and Control Slide 12 Single-Cycle Data Path, Repeated for Reference Incr PC Outcome of an executed instruction: A new value loaded into PC Possible new value in a reg or memory loc Next addr jta Next PC ALUOvfl (PC) PC (rs) rs rt Instr cache inst rd 31 imm op Instruction fetch Nov. 2014 Reg file ALU (rt) / 16 ALU out Data addr Data cache Data out Data in Func 0 32 SE / 1 0 1 2 Register input fn RegDst RegWrite Br&Jump Fig. 13.3 0 1 2 Ovfl Reg access / decode ALUSrc ALUFunc ALU operation DataRead RegInSrc DataWrite Data access Reg writeback Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control Slide 13 Nov. 2014 Computer Architecture, Data Path and Control 00 01 10 11 00 01 10 0 0 FnClass LogicFn 0 1 1 0 1 00 10 10 01 10 01 11 11 11 11 11 11 11 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCSrc 10 10 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 Add’Sub 01 01 01 01 01 01 01 01 01 01 01 01 01 00 ALUSrc 00 01 01 01 00 00 01 01 01 01 00 00 00 00 BrType 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 DataW rite 001111 000000 100000 000000 100010 000000 101010 001000 001010 000000 100100 000000 100101 000000 100110 000000 100111 001100 001101 001110 100011 101011 000010 000000 001000 000001 000100 000101 000011 000000 001100 DataRead Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch on less than 0 Branch on equal Branch on not equal Jump and link System call fn RegInS rc op RegDst Table 13.3 Instruction RegWrite Control Signal Settings 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 10 00 00 00 01 11 11 01 10 00 Slide 14 Control Signals in the Single-Cycle Data Path Incr PC Next addr jta Next PC ALUOvfl (PC) PC Instr cache inst 001111 000000 Br&Jump BrType Fig. 13.3 Nov. 2014 Ovfl Reg file ALU (rt) / 16 imm op 00 00 00 00 0 1 2 rd 31 lui slt PCSrc (rs) rs rt ALU out Register input fn 00 01 1 1 RegDst RegWrite 010101 1 0 ALUSrc Data cache Data out Data in Func 0 32 SE / 1 Data addr x xx 00 1 xx 01 ALUFunc 0 0 0 0 01 01 DataRead RegInSrc DataWrite AddSub LogicFn FnClass Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 Slide 15 0 3 4 5 bltzInst jInst jalInst beqInst bneInst 8 addiInst 1 2 10 sltiInst 12 13 14 15 andiInst oriInst xoriInst luiInst 35 lwInst 43 /6 RtypeInst 0 8 fn Decoder 1 fn /6 op Decoder Instruction Decoding op Nov. 2014 12 syscallInst 32 addInst 34 subInst 36 37 38 39 andInst orInst xorInst norInst 42 sltInst swInst 63 Fig. 13.5 jrInst 63 Instruction decoder for MicroMIPS built of two 6-to-64 decoders. Computer Architecture, Data Path and Control Slide 16 Nov. 2014 Computer Architecture, Data Path and Control 00 01 10 11 00 01 10 0 0 FnClass LogicFn 0 1 1 0 1 00 10 10 01 10 01 11 11 11 11 11 11 11 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCSrc 10 10 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 Add’Sub 01 01 01 01 01 01 01 01 01 01 01 01 01 00 ALUSrc 00 01 01 01 00 00 01 01 01 01 00 00 00 00 BrType 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 DataW rite 001111 000000 100000 000000 100010 000000 101010 001000 001010 000000 100100 000000 100101 000000 100110 000000 100111 001100 001101 001110 100011 101011 000010 000000 001000 000001 000100 000101 000011 000000 001100 DataRead Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch on less than 0 Branch on equal Branch on not equal Jump and link System call fn RegInS rc Table 13.3 op RegDst Repeated for Reference Instruction RegWrite Control Signal Settings: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 10 00 00 00 01 11 11 01 10 00 Slide 17 Control Signal Generation Auxiliary signals identifying instruction classes arithInst = addInst subInst sltInst addiInst sltiInst logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst immInst = luiInst addiInst sltiInst andiInst oriInst xoriInst Example logic expressions for control signals RegWrite = luiInst arithInst logicInst lwInst jalInst ALUSrc = immInst lwInst swInst addInst subInst jInst AddSub = subInst sltInst sltiInst DataRead = lwInst PCSrc0 = jInst jalInst syscallInst Nov. 2014 Computer Architecture, Data Path and Control . . . Control . . . sltInst Slide 18 Putting It All Together Fig. 10.19 ConstVar Fig. 13.4 / 30 / 30 IncrPC / 30 Adder 0 1 2 3 / 30 / 30 / 30 / 30 / 30 / 30 4 MSBs1 / 32 / 32 (rs) 4 16 imm MSBs Shift function Constant 5 amount 0 Amount 5 1 5 Variable amount 2 / 30 (PC)31:2 / 26 jta Function class imm 5 LSBs lui Shift Set less Arithmetic Logic 0 y 0 or 1 c0 32 BrType 00 01 10 11 2 Shifted y x SysCallAddr No shift Logical left Logical right Arith right Shifter Adder PCSrc 00 01 10 11 32 30 MSBs SE c in NextPC Branch condition checker BrTrue (rt) 32 k / c c 32 31 xy s 32 2 32 Shortha symb for AL 1 MSB Cont 3 x Fun AddSub A Incr PC Next addr jta Next PC rd 31 imm op 0 1 2 00 01 10 11 Ovfl Reg file ALU (rt) / 16 0 32 SE / 1 Func ALU out Data addr Data cache Data out Data in Zero Ovfl addInst subInst jInst 0 1 2 Register input fn Zero . . . . . . Control sltInst Br&Jump Nov. 2014 RegDst RegWrite ALUSrc ALUFunc DataRead RegInSrc DataWrite Computer Architecture, Data Path and Control O y 32input NOR 2 Logic function (rs) rs rt inst AND OR XOR NOR ALUOvfl (PC) PC Instr cache Logic unit Fig. 13.3 Slide 19 13.6 Performance of the Single-Cycle Design An example combinational-logic data path to compute z := (u + v)(w – x) / y Add/Sub latency 2 ns Multiply latency 6 ns Divide latency 15 ns Total latency 23 ns u + v Note that the divider gets its correct inputs after 9 ns, but this won’t cause a problem if we allow enough total time w - / z x y Nov. 2014 Beginning with inputs u, v, w, x, and y stored in registers, the entire computation can be completed in 25 ns, allowing 1 ns each for register readout and write Computer Architecture, Data Path and Control Slide 20 Performance Estimation for Single-Cycle MicroMIPS Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz R-type 44% 6 ns Load 24% 8 ns Store 12% 7 ns Branch 18% 5 ns Jump 2% 3 ns Weighted mean 6.36 ns ALU-type P C Load P C Store P C Branch P C (and jr) Jump (except jr & jal) P C Not used Not used Not used Not used Not used Not used Not used Not used Not used Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. Nov. 2014 Computer Architecture, Data Path and Control Slide 21 How Good is Our Single-Cycle Design? Clock rate of 125 MHz not impressive How does this compare with current processors on the market? Not bad, where latency is concerned Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz A 2.5 GHz processor with 20 or so pipeline stages has a latency of about 0.4 ns/cycle 20 cycles = 8 ns Throughput, however, is much better for the pipelined processor: Up to 20 times better with single issue Perhaps up to 100 times better with multiple issue Nov. 2014 Computer Architecture, Data Path and Control Slide 22 14 Control Unit Synthesis The control unit for the single-cycle design is memoryless • Problematic when instructions vary greatly in complexity • Multiple cycles needed when resources must be reused Topics in This Chapter 14.1 A Multicycle Implementation 14.2 Choosing the Clock Cycle 14.3 The Control State Machine 14.4 Performance of the Multicycle Design 14.5 Microprogramming 14.6 Exception Handling Nov. 2014 Computer Architecture, Data Path and Control Slide 23 14.1 A Multicycle Implementation Appointment book for a dentist Single-cycle Multicycle Assume longest treatment takes one hour Nov. 2014 Computer Architecture, Data Path and Control Slide 24 Single-Cycle vs. Multicycle MicroMIPS Clock Time needed Time allotted Instr 1 Instr 2 Instr 3 Instr 4 Clock Time needed Time allotted Time saved 3 cycles 5 cycles 3 cycles 4 cycles Instr 1 Instr 2 Instr 3 Instr 4 Fig. 14.1 Nov. 2014 Single-cycle versus multicycle instruction execution. Computer Architecture, Data Path and Control Slide 25 A Multicycle Data Path Inst Reg x Reg jta Address rs,rt,rd (rs) PC imm Cache z Reg Reg file ALU (rt) Data Data Reg op y Reg fn Control von Neumann (Princeton) architecture Fig. 14.2 Abstract view of a multicycle instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig. 13.1. Nov. 2014 Computer Architecture, Data Path and Control Slide 26 Multicycle Data Path with Control Signals Shown Three major changes relative to the single-cycle data path: 26 / 1. Instruction & data caches combined Corrections are shown in red Inst Reg 3. Registers added for jta intercycle data x Reg rt 0 rd 1 31 2 Cache Data Reg PCWrite MemWrite MemRead Fig. 14.3 Nov. 2014 op Reg file IRWrite (rt) imm 16 / fn 32 y Reg SE / RegInSrc RegDst ALUZero x Mux ALUOvfl 0 Zero z Reg 1 Ovfl (rs) 0 12 Data 0 1 SysCallAddr rs PC InstData 30 / 4 MSBs Address 0 1 2. ALU performs double duty for address calculation RegWrite y Mux 4 0 1 2 4 3 ALUSrcX ALUSrcY 30 4 ALU Func ALU out ALUFunc PCSrc JumpAddr Key elements of the multicycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 3 Slide 27 14.2 Clock Cycle and Control Signals Table 14.1 Program counter Cache Register file ALU Nov. 2014 Control signal 0 1 2 JumpAddr jta SysCallAddr PCSrc1, PCSrc0 Jump addr x reg PCWrite Don’t write Write InstData PC z reg MemRead Don’t read Read MemWrite Don’t write Write IRWrite Don’t write Write RegWrite Don’t write Write RegDst1, RegDst0 rt rd $31 RegInSrc1, RegInSrc0 Data reg z reg PC ALUSrcX PC x reg ALUSrcY1, ALUSrcY0 4 y reg AddSub Add Subtract LogicFn1, LogicFn0 AND FnClass1, FnClass0 lui z reg 3 ALU out imm 4 imm OR XOR NOR Set less Arithmetic Logic Computer Architecture, Data Path and Control Slide 28 Multicycle Data Path, Repeated for Reference 26 / 30 / 4 MSBs Corrections are shown in red Inst Reg rt 0 rd 1 31 2 Cache Data Reg InstData PCWrite MemWrite MemRead Fig. 14.3 Nov. 2014 (rs) Reg file 0 12 Data op IRWrite (rt) imm 16 / fn ALUZero x Mux ALUOvfl 0 Zero z Reg 1 Ovfl x Reg rs PC 0 1 SysCallAddr jta Address 32 y Reg SE / RegInSrc RegDst 0 1 RegWrite y Mux 4 0 1 2 4 3 ALUSrcX ALUSrcY 30 4 ALU Func ALU out ALUFunc PCSrc JumpAddr Key elements of the multicycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 3 Slide 29 Execution Cycles Fetch & PC incr Decode & reg read Table 14.2 Instruction Operations Signal settings Any Read out the instruction and write it into instruction register, increment PC Any Read out rs & rt into x & y registers, compute branch address and save in z register Perform ALU operation and save the result in z register Add base and offset values, save in z register If (x reg) = < (y reg), set PC to branch target address InstData = 0, MemRead = 1 IRWrite = 1, ALUSrcX = 0 ALUSrcY = 0, ALUFunc = ‘+’ PCSrc = 3, PCWrite = 1 ALUSrcX = 0, ALUSrcY = 3 ALUFunc = ‘+’ 1 2 ALU type Set PC to the target address jta, SysCallAddr, or (rs) Write back z reg into rd Load Read memory into data reg ALUSrcX = 1, ALUSrcY = 1 or 2 ALUFunc: Varies ALUSrcX = 1, ALUSrcY = 2 ALUFunc = ‘+’ ALUSrcX = 1, ALUSrcY = 1 ALUFunc= ‘-’, PCSrc = 2 PCWrite = ALUZero or ALUZero or ALUOut31 JumpAddr = 0 or 1, PCSrc = 0 or 1, PCWrite = 1 RegDst = 1, RegInSrc = 1 RegWrite = 1 InstData = 1, MemRead = 1 Store Copy y reg into memory InstData = 1, MemWrite = 1 Load Copy data register into rt RegDst = 0, RegInSrc = 0 RegWrite = 1 ALU type Load/Store ALU oper & PC update 3 Branch Jump Reg write or mem access Reg write for lw Nov. 2014 4 5 Execution cycles for multicycle MicroMIPS Computer Architecture, Data Path and Control Slide 30 14.3 The Control State Machine Cycle 1 Cycle 2 Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 Jump/ Branch State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ Note for State 7: ALUFunc is determined based on the op and fn fields Fig. 14.4 Nov. 2014 Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘-’ JumpAddr = % PCSrc = @ PCWrite = # State 6 lw/ sw ALUtype State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ Cycle 5 InstData = 1 MemWrite = 1 Branches based on instruction sw Speculative calculation of branch address Start Cycle 3 lw State 3 State 4 InstData = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 The control state machine for multicycle MicroMIPS. Computer Architecture, Data Path and Control Slide 31 State and Instruction Decoding op /4 5 6 7 8 9 10 11 12 13 14 15 ControlSt0 ControlSt1 ControlSt2 ControlSt3 ControlSt4 ControlSt5 ControlSt6 ControlSt7 ControlSt8 0 1 2 3 4 1 op Decoder st Decoder 0 1 2 3 4 1 fn /6 5 bltzInst jInst jalInst beqInst bneInst 8 addiInst andiInst 10 sltiInst 12 13 14 15 andiInst oriInst xoriInst luiInst 35 lwInst 43 Nov. 2014 0 jrInst 8 12 syscallInst 32 addInst 34 subInst 36 37 38 39 andInst orInst xorInst norInst 42 sltInst swInst 63 Fig. 14.5 /6 RtypeInst fn Decoder st 63 State and instruction decoders for multicycle MicroMIPS. Computer Architecture, Data Path and Control Slide 32 Control Signal Generation Certain control signals depend only on the control state ALUSrcX = ControlSt2 ControlSt5 ControlSt7 RegWrite = ControlSt4 ControlSt8 Auxiliary signals identifying instruction classes addsubInst = addInst subInst addiInst logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst Logic expressions for ALU control signals AddSub = ControlSt5 (ControlSt7 subInst) FnClass1 = ControlSt7 addsubInst logicInst FnClass0 = ControlSt7 (logicInst sltInst sltiInst) LogicFn1 = ControlSt7 (xorInst xoriInst norInst) LogicFn0 = ControlSt7 (orInst oriInst norInst) Nov. 2014 Computer Architecture, Data Path and Control Slide 33 14.4 Performance of the Multicycle Design R-type Load Store Branch Jump 44% 24% 12% 18% 2% 4 cycles 5 cycles 4 cycles 3 cycles 3 cycles Contribution to CPI R-type 0.444 = 1.76 Load 0.245 = 1.20 Store 0.124 = 0.48 Branch 0.183 = 0.54 Jump 0.023 = 0.06 ALU-type P C Load P C Store P C Branch P C (and jr) Not used Not used Not used Not used Not used Not used Not used Not used _____________________________ Average CPI 4.04 Jump (except jr & jal) P C Not used Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. Nov. 2014 Computer Architecture, Data Path and Control Slide 34 How Good is Our Multicycle Design? Clock rate of 500 MHz better than 125 MHz of single-cycle design, but still unimpressive Cycle time = 2 ns Clock rate = 500 MHz How does the performance compare with current processors on the market? Not bad, where latency is concerned R-type Load Store Branch Jump A 2.5 GHz processor with 20 or so pipeline stages has a latency of about 0.4 20 = 8 ns Contribution to CPI R-type 0.444 = 1.76 Throughput, however, is much better for the pipelined processor: Up to 20 times better with single issue Load Store Branch Jump 44% 24% 12% 18% 2% 0.245 0.124 0.183 0.023 4 cycles 5 cycles 4 cycles 3 cycles 3 cycles = = = = 1.20 0.48 0.54 0.06 _____________________________ Perhaps up to 100 with multiple issue Nov. 2014 Computer Architecture, Data Path and Control Average CPI 4.04 Slide 35 14.5 Microprogramming PC control Cache control Register control ALU inputs ALU Sequence function control 2 bits JumpAddr PCSrc PCWrite FnType LogicFn AddSub ALUSrcY ALUSrcX RegInSrc Microinstruction RegDst RegWrite InstData MemRead MemWrite IRWrite Cycle 1 Cycle 2 Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 23 Fig. 14.6 Possible 22-bit microinstruction format for MicroMIPS. State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 Jump/ Branch State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ Nov. 2014 Note for State 7: ALUFunc is determined based on the op and fn fields Computer Architecture, Data Path and Control Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘-’ JumpAddr = % PCSrc = @ PCWrite = # State 6 Cycle 5 InstData = 1 MemWrite = 1 sw lw/ sw Start The control state machine resembles a program (microprogram) Cycle 3 ALUtype State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ lw State 3 State 4 InstData = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 Slide 36 The Control State Machine as a Microprogram Cycle 1 Cycle 2 Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘-’ JumpAddr = % PCSrc = @ PCWrite = # State 6 Jump/ Branch ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ lw/ sw Start InstData = 1 MemWrite = 1 ALUtype State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ lw State 3 State 4 InstData = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 Multiple substates Fig. 14.4 Nov. 2014 Cycle 5 sw State 1 Note for State 7: ALUFunc is determined based on the op and fn fields Cycle 3 Multiple substates Decompose into 2 substates The control state machine for multicycle MicroMIPS. Computer Architecture, Data Path and Control Slide 37 Symbolic Names for Microinstruction Field Values Table 14.3 Microinstruction field values and their symbolic names. The default value for each unspecified field is the all 0s bit pattern. Field name PC control Cache control Register control ALU inputs* ALU function* Seq. control Possible field values and their symbolic names 0001 1001 x011 x101 x111 PCjump PCsyscall PCjreg PCbranch PCnext 0101 1010 1100 CacheFetch CacheStore CacheLoad 1000 10000 rt Data 1001 10001 rt z 1011 10101 rd z 1101 11010 $31 PC 000 011 101 110 x10 PC 4 PC 4imm x y x imm (imm) 0xx10 1xx01 1xx10 x0011 x0111 + < - x1011 x1111 xxx00 lui 01 10 11 PCdisp1 PCdisp2 PCfetch * The operator symbol stands for any of the ALU functions defined above (except for “lui”). Nov. 2014 Computer Architecture, Data Path and Control Slide 38 fetch: ------------- Control Unit for Microprogramming Multiway branch andi: --------- 64 entries in each table Dispatch table 1 Dispatch table 2 0 0 1 2 3 MicroPC 1 Address Microprogram memory or PLA Incr Data Microinstruction register op (from instruction register) Fig. 14.7 Nov. 2014 Control signals to data path Sequence control Microprogrammed control unit for MicroMIPS . Computer Architecture, Data Path and Control Slide 39 fetch: Microprogram for MicroMIPS 37 microinstructions Fig. 14.8 The complete MicroMIPS microprogram. Nov. 2014 PCnext, CacheFetch PC + 4imm, PCdisp1 lui1: lui(imm) rt z, PCfetch add1: x + y rd z, PCfetch sub1: x - y rd z, PCfetch slt1: x - y rd z, PCfetch addi1: x + imm rt z, PCfetch slti1: x - imm rt z, PCfetch and1: x y rd z, PCfetch or1: x y rd z, PCfetch xor1: x y rd z, PCfetch nor1: x y rd z, PCfetch andi1: x imm rt z, PCfetch ori1: x imm rt z, PCfetch xori: x imm rt z, PCfetch lwsw1: x + imm, mPCdisp2 lw2: CacheLoad rt Data, PCfetch sw2: CacheStore, PCfetch j1: PCjump, PCfetch jr1: PCjreg, PCfetch branch1: PCbranch, PCfetch jal1: PCjump, $31PC, PCfetch syscall1:PCsyscall, PCfetch Computer Architecture, Data Path and Control # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State 0 (start) 1 7lui 8lui 7add 8add 7sub 8sub 7slt 8slt 7addi 8addi 7slti 8slti 7and 8and 7or 8or 7xor 8xor 7nor 8nor 7andi 8andi 7ori 8ori 7xori 8xori 2 3 4 6 5j 5jr 5branch 5jal 5syscall Slide 40 14.6 Exception Handling Exceptions and interrupts alter the normal program flow Examples of exceptions (things that can go wrong): ALU operation leads to overflow (incorrect result is obtained) Opcode field holds a pattern not representing a legal operation Cache error-code checker deems an accessed word invalid Sensor signals a hazardous condition (e.g., overheating) Exception handler is an OS program that takes care of the problem Derives correct result of overflowing computation, if possible Invalid operation may be a software-implemented instruction Interrupts are similar, but usually have external causes (e.g., I/O) Nov. 2014 Computer Architecture, Data Path and Control Slide 41 Exception Control States Cycle 1 Cycle 2 Jump/ Branch Cycle 3 Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘-’ JumpAddr = % PCSrc = @ PCWrite = # State 6 Cycle 5 InstData = 1 MemWrite = 1 sw State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ lw/ sw Start ALUtype Illegal operation Fig. 14.10 Nov. 2014 State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ lw State 3 State 4 InstData = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 State 10 IntCause = 0 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘-’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1 Overflow State 9 IntCause = 1 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘-’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1 Exception states 9 and 10 added to the control state machine. Computer Architecture, Data Path and Control Slide 42 15 Pipelined Data Paths Pipelining is now used in even the simplest of processors • Same principles as assembly lines in manufacturing • Unlike in assembly lines, instructions not independent Topics in This Chapter 15.1 Pipelining Concepts 15.2 Pipeline Stalls or Bubbles 15.3 Pipeline Timing and Performance 15.4 Pipelined Data Path Design 15.5 Pipelined Control 15.6 Optimal Pipelining Nov. 2014 Computer Architecture, Data Path and Control Slide 43 Nov. 2014 Computer Architecture, Data Path and Control Slide 44 Single-Cycle Data Path of Chapter 13 Incr PC Next addr jta Next PC ALUOvfl (PC) PC (rs) rs rt Instr cache inst rd 31 imm op Br&Jump Fig. 13.3 Nov. 2014 0 1 2 Clock rate = 125 MHz CPI = 1 (125 MIPS) Ovfl Reg file ALU (rt) / 16 0 32 SE / 1 ALU out Data addr Data cache Data out Data in Func 0 1 2 Register input fn RegDst RegWrite ALUSrc ALUFunc DataRead RegInSrc DataWrite Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control Slide 45 Multicycle Data Path of Chapter 14 Clock rate = 500 MHz CPI 4 ( 125 MIPS) Inst Reg 26 / 4 MSBs rt 0 rd 1 31 2 Cache Data Reg InstData PCWrite MemWrite MemRead Fig. 14.3 Nov. 2014 (rs) Reg file 0 12 Data op IRWrite (rt) imm 16 / fn ALUZero x Mux ALUOvfl 0 Zero z Reg 1 Ovfl x Reg rs PC 32 y Reg SE / RegInSrc RegDst 0 1 SysCallAddr jta Address 0 1 30 / RegWrite y Mux 4 0 1 2 4 3 ALUSrcX ALUSrcY 30 4 ALU Func ALU out ALUFunc PCSrc JumpAddr Key elements of the multicycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 3 Slide 46 Getting the Best of Both Worlds Pipelined: Clock rate = 500 MHz CPI 1 Single-cycle: Multicycle: Clock rate = 125 MHz CPI = 1 Clock rate = 500 MHz CPI 4 Single-cycle analogy: Doctor appointments scheduled for 60 min per patient Nov. 2014 Multicycle analogy: Doctor appointments scheduled in 15-min increments Computer Architecture, Data Path and Control Slide 47 15.1 Pipelining Concepts Strategies for improving performance 1 – Use multiple independent data paths accepting several instructions that are read out at once: multiple-instruction-issue or superscalar 2 – Overlap execution of several instructions, starting the next instruction before the previous one has run to completion: (super)pipelined Approval 1 Cashier 2 2 Registrar ID photo 3 4 Pickup 5 Start here Exit Fig. 15.1 Nov. 2014 Pipelining in the student registration process. Computer Architecture, Data Path and Control Slide 48 Pipelined Instruction Execution Instr cache Instr 3 Instr 4 Instr 5 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Time dimension Instr 2 Instr 1 Cycle 1 Task dimension Fig. 15.2 Nov. 2014 Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Pipelining in the MicroMIPS instruction execution process. Computer Architecture, Data Path and Control Slide 49 Alternate Representations of a Pipeline Except for start-up and drainage overheads, a pipeline can execute one instruction per clock tick; IPS is dictated by the clock frequency 1 2 1 2 3 4 5 f r a d w f r a d w f r a d w f f = Fetch r = Reg read a = ALU op d = Data access w = Writeback r a d w f r a d w f r a d w f r a d 3 4 5 6 7 6 7 8 9 10 11 Cycle 1 2 1 2 3 4 5 6 7 f f f f f f f r r r r r r r a a a a a a a d d d d d d d w w w w w w 3 4 5 w Start-up region 8 9 10 Cycle Drainage region Pipeline stage Instruction (a) Task-time diagram (b) Space-time diagram Fig. 15.3 Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions). Nov. 2014 Computer Architecture, Data Path and Control 11 Slide 50 w Pipelining Example in a Photocopier Example 15.1 A photocopier with an x-sheet document feeder copies the first sheet in 4 s and each subsequent sheet in 1 s. The copier’s paper path is a 4-stage pipeline with each stage having a latency of 1s. The first sheet goes through all 4 pipeline stages and emerges after 4 s. Each subsequent sheet emerges 1s after the previous sheet. How does the throughput of this photocopier vary with x, assuming that loading the document feeder and removing the copies takes 15 s. Solution Each batch of x sheets is copied in 15 + 4 + (x – 1) = 18 + x seconds. A nonpipelined copier would require 4x seconds to copy x sheets. For x > 6, the pipelined version has a performance edge. When x = 50, the pipelining speedup is (4 50) / (18 + 50) = 2.94. Nov. 2014 Computer Architecture, Data Path and Control Slide 51 15.2 Pipeline Stalls or Bubbles First type of data dependency $5 = $6 + $7 $8 = $8 + $6 $9 = $8 + $2 sw $9, 0($3) Cycle 1 Cycle 2 Instr cache Reg file Instr cache Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Cycle 8 Data forwarding Reg file Fig. 15.4 Read-after-write data dependency and its possible resolution through data forwarding . Nov. 2014 Computer Architecture, Data Path and Control Slide 52 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Instr cache Reg file ALU Data cache Reg file Instr cache Reg file Instr 3 Instr 5 Instr 4 Instr cache Reg file Cycle 7 Cycle 8 Cycle 9 Cycle 2 Cycle 3 Writes into $8 Data cache ALU Bubble Reg file Task dimension Cycle 4 Reg file Data cache Bubble ALU Instr cache Reg file Cycle 5 Cycle 6 Reg file Data cache Reg file ALU Data cache Reg file Cycle 8 Cycle 9 Cycle 7 Without data forwarding, three bubbles are needed to resolve a read-after-write data dependency Reads from $8 Time dimension Instr cache Reg file ALU Instr cache Reg file Instr 3 Instr 2 Instr 1 ALU Bubble Instr cache Cycle 1 Instr cache Instr 4 Instr 5 Cycle 6 Time dimension Instr 2 Instr 1 Inserting Bubbles in a Pipeline Data cache ALU Bubble Reg file Instr cache Task dimension Nov. 2014 Reg file Data cache Writes into $8 Reg file Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache ALU Bubble Two bubbles, if we assume that a register can be updated and read from in one cycle Reads from $8 Reg file Computer Architecture, Data Path and Control Slide 53 Second Type of Data Dependency C ycle 1 C ycle 2 Instr mem Reg file sw $6, . . . C ycle 3 ALU C ycle 4 C ycle 5 Data mem Reg file C ycle 6 C ycle 7 C ycle 8 Reorder? lw $8, . . . Insert bubble? $9 = $8 + $2 Instr mem Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem Reg file Without data forwarding, three (two) bubbles are needed to resolve a read-after-load data dependency Fig. 15.5 Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding. Nov. 2014 Computer Architecture, Data Path and Control Slide 54 Control Dependency in a Pipeline C ycle 1 C ycle 2 Instr mem Reg file Instr mem $6 = $3 + $5 beq $1, $2, . . . Insert bubble? C ycle 3 C ycle 4 C ycle 5 ALU Data mem Reg file Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem $9 = $8 + $2 Assume branch resolved here Fig. 15.6 Nov. 2014 C ycle 6 C ycle 7 C ycle 8 Reorder? (delayed branch) Reg file Here would need 1-2 more bubbles Control dependency due to conditional branch. Computer Architecture, Data Path and Control Slide 55 15.3 Pipeline Timing and Performance Latching of results t Function unit Stage 1 t/q Fig. 15.7 Nov. 2014 Stage 2 Stage 3 .. . Stage q-1 Stage q Pipelined form of a function unit with latching overhead. Computer Architecture, Data Path and Control Slide 56 Throughput Increase in a q-Stage Pipeline Throughput improvement factor 8 7 t t/q + Ideal: /t = 0 6 /t = 0.05 5 or 4 /t = 0.1 q 1 + q / t 3 2 1 1 2 3 4 5 6 Number q of pipeline stages 7 8 Fig. 15.8 Throughput improvement due to pipelining as a function of the number of pipeline stages for different pipelining overheads. Nov. 2014 Computer Architecture, Data Path and Control Slide 57 Pipeline Throughput with Dependencies Assume that one bubble must be inserted due to read-after-load dependency and after a branch when its delay slot cannot be filled. Let be the fraction of all instructions that are followed by a bubble. q Pipeline speedup = (1 + q / t)(1 + ) Effective CPI R-type Load Store Branch Jump 44% 24% 12% 18% 2% Example 15.3 Calculate the effective CPI for MicroMIPS, assuming that a quarter of branch and load instructions are followed by bubbles. Solution Fraction of bubbles = 0.25(0.24 + 0.18) = 0.105 CPI = 1 + = 1.105 (which is very close to the ideal value of 1) Nov. 2014 Computer Architecture, Data Path and Control Slide 58 15.4 Pipelined Data Path Design Stage 1 Stage 2 ALUOvfl 1 PC inst Instr cache rs rt (rs) 1 Reg file ALU imm SE Incr IncrPC SeqInst op Nov. 2014 Stage 5 Data Data addr Address Ovfl (rt) Fig. 15.9 Stage 4 Next addr NextPC 0 Stage 3 Data cache 0 1 Func 0 1 0 1 2 rt rd 0 1 31 2 Br&Jump RegDst fn RegWrite ALUSrc ALUFunc DataRead RetAddr DataWrite RegInSrc Key elements of the pipelined MicroMIPS data path. Computer Architecture, Data Path and Control Slide 59 15.5 Pipelined Control Stage 1 Stage 2 Stage 4 Data Data addr Address ALUOvfl 1 PC Stage 5 Next addr NextPC 0 Stage 3 inst Instr cache rs rt (rs) Ovfl Reg file (rt) 1 imm SE Incr Data cache ALU Func 0 1 0 1 2 rt rd 0 1 31 2 IncrPC 0 1 2 3 5 SeqInst op Br&Jump RegDst fn RegWrite ALUSrc Fig. 15.10 Nov. 2014 ALUFunc DataRead RetAddr DataWrite RegInSrc Pipelined control signals. Computer Architecture, Data Path and Control Slide 60 15.6 Optimal Pipelining MicroMIPS pipeline with more than four-fold improvement PC Instruction fetch Register readout Instr cache Reg file Instr cache ALU operation ALU Reg file Instr cache Data read/store Register writeback Data cache Reg file ALU Reg file Data cache ALU Reg file Reg file Data cache Fig. 15.11 Higher-throughput pipelined data path for MicroMIPS and the execution of consecutive instructions in it . Nov. 2014 Computer Architecture, Data Path and Control Slide 61 Optimal Number of Pipeline Stages Assumptions: Pipeline sliced into q stages Stage overhead is q/2 bubbles per branch (decision made midway) Fraction b of all instructions are taken branches Derivation of q opt Latching of results t Function unit Stage 1 t/q Fig. 15.7 Stage 2 Stage 3 .. . Stage q-1 Stage q Pipelined form of a function unit with latching overhead. Average CPI = 1 + b q / 2 Throughput = Clock rate / CPI = 1 (t / q + )(1 + b q / 2) Differentiate throughput expression with respect to q and equate with 0 q opt = Nov. 2014 2t / b Varies directly with t / and inversely with b Computer Architecture, Data Path and Control Slide 62 Pipelining Example An example combinational-logic data path to compute z := (u + v)(w – x) / y Add/Sub latency 2 ns Multiply latency 6 ns Divide latency 15 ns Throughput, original = 1/(25 10–9) = 40 M computations / s u + v Pipeline register placement, Option 2 w - / Throughput, option 1 = 1/(17 10–9) = 58.8 M computations / s Write, 1 ns z x y Readout, 1 ns Nov. 2014 Pipeline register placement, Option 1 Throughput, Option 2 = 1/(10 10–9) = 100 M computations / s Computer Architecture, Data Path and Control Slide 63 16 Pipeline Performance Limits Pipeline performance limited by data & control dependencies • Hardware provisions: data forwarding, branch prediction • Software remedies: delayed branch, instruction reordering Topics in This Chapter 16.1 Data Dependencies and Hazards 16.2 Data Forwarding 16.3 Pipeline Branch Hazards 16.4 Delayed Branch and Branch Prediction 16.5 Dealing with Exceptions 16.6 Advanced Pipelining Nov. 2014 Computer Architecture, Data Path and Control Slide 64 16.1 Data Dependencies and Hazards Cycle 1 Cycle 2 Instr cache Reg file Instr cache Cycle 3 Cycle 4 Cycle 5 ALU Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Fig. 16.1 Nov. 2014 Cycle 6 Cycle 7 Cycle 8 Cycle 9 $2 = $1 - $3 Instructions that read register $2 Reg file Data dependency in a pipeline. Computer Architecture, Data Path and Control Slide 65 Resolving Data Dependencies via Forwarding Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Instr cache Reg file ALU Instr cache Cycle 6 Cycle 7 Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Cycle 8 Cycle 9 $2 = $1 - $3 Instructions that read register $2 Reg file Fig. 16.2 When a previous instruction writes back a value computed by the ALU into a register, the data dependency can always be resolved through forwarding. Nov. 2014 Computer Architecture, Data Path and Control Slide 66 Pipelined MicroMIPS – Repeated for Reference Stage 1 Stage 2 Stage 4 Data Data addr Address ALUOvfl 1 PC Stage 5 Next addr NextPC 0 Stage 3 inst Instr cache rs rt (rs) Ovfl Reg file (rt) 1 imm SE Incr Data cache ALU Func 0 1 0 1 2 rt rd 0 1 31 2 IncrPC 0 1 2 3 5 SeqInst op Br&Jump RegDst fn RegWrite ALUSrc Fig. 15.10 Nov. 2014 ALUFunc DataRead RetAddr DataWrite RegInSrc Pipelined control signals. Computer Architecture, Data Path and Control Slide 67 Certain Data Dependencies Lead to Bubbles Cycle 1 Cycle 2 Instr cache Reg file Instr cache Cycle 3 ALU Cycle 4 Cycle 5 Cycle 6 Data cache Reg file lw Cycle 7 Cycle 8 Cycle 9 $2,4($12) Instructions that read register $2 Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Fig. 16.3 When the immediately preceding instruction writes a value read out from the data memory into a register, the data dependency cannot be resolved through forwarding (i.e., we cannot go back in time) and a bubble must be inserted in the pipeline. Nov. 2014 Computer Architecture, Data Path and Control Slide 68 16.2 Data Forwarding Stage 2 Stage 3 s2 rt x2 Reg file SE RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 Ovfl ALU (rt) x3 0 1 2 y2 x3 y3 x4 y4 t2 Forw arding unit, lower ALUSrc2 Fig. 16.4 Nov. 2014 Data cache x4 0 1 Func y3 d3 ALUSrc1 Stage 5 d3 d4 Forw arding unit, upper x3 y3 x4 y4 (rs) rs Stage 4 0 1 RegInSrc3 d3 RegWrite3 d4 RetAddr3 RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 y4 d4 RegInSrc4 RegWrite4 Forwarding unit for the pipelined MicroMIPS data path. Computer Architecture, Data Path and Control Slide 69 Design of the Data Forwarding Units Stage 2 Stage 3 s2 Let’s focus on designing the upper data forwarding unit rt Ovfl x2 Reg file ALU (rt) Table 16.1 Partial truth table for the upper forwarding unit in the pipelined MicroMIPS data path. Data cache x3 x3 y3 x4 y4 y2 0 1 0 1 y3 d3 Forw arding unit, lower t2 ALUSrc1 ALUSrc2 Fig. 16.4 x4 Func 0 1 2 SE Stage 5 RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 x3 y3 x4 y4 (rs) rs Stage 4 d3 d4 Forw arding unit, upper y4 RegInSrc3 RegWrite3 RetAddr3 RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 d3 d4 d4 RegInSrc4 RegWrite4 Forwarding unit for the pipelined MicroMIPS data path. RegWrite3 RegWrite4 s2matchesd3 s2matchesd4 RetAddr3 RegInSrc3 RegInSrc4 Choose 0 0 x x x x x x2 0 1 x 0 x x x x2 0 1 x 1 x x 0 x4 0 1 x 1 x x 1 y4 1 0 1 x 0 1 x x3 1 0 1 x 1 1 x y3 1 1 1 1 0 1 x x3 Incorrect in textbook Nov. 2014 Computer Architecture, Data Path and Control Slide 70 Hardware for Inserting Bubbles Stage 1 Stage 2 Stage 3 Data hazard detector LoadPC LoadInst Inst Instr cache PC (rs) rs rt x2 Reg file (rt) LoadIncrPC Inst reg Corrections to textbook figure shown in red 0 1 2 IncrPC y2 Bubble Control signals from decoder All-0s t2 0 1 Controls or all-0s DataRead2 Fig. 16.5 Data hazard detector for the pipelined MicroMIPS data path. Nov. 2014 Computer Architecture, Data Path and Control Slide 71 Augmentations to Pipelined Data Path and Control Branch predictor Next addr forwarders Hazard detector Stage 1 Data cache forwarder Stage 3 Stage 4 Stage 2 ALUOvfl 1 PC Stage 5 Next addr NextPC 0 ALU forwarders inst Instr cache rs rt (rs) Ovfl Reg file imm SE Incr IncrPC Data cache ALU (rt) 1 Data Data addr Address Func 0 1 0 1 0 1 2 rt rd 0 1 31 2 2 3 5 SeqInst Fig. 15.10 Nov. 2014 op Br&Jump RegDst fn RegWrite ALUSrc ALUFunc Computer Architecture, Data Path and Control DataRead RetAddr DataWrite RegInSrc Slide 72 16.3 Pipeline Branch Hazards Software-based solutions Compiler inserts a “no-op” after every branch (simple, but wasteful) Branch is redefined to take effect after the instruction that follows it Branch delay slot(s) are filled with useful instructions via reordering Hardware-based solutions Mechanism similar to data hazard detector to flush the pipeline Constitutes a rudimentary form of branch prediction: Always predict that the branch is not taken, flush if mistaken More elaborate branch prediction strategies possible Nov. 2014 Computer Architecture, Data Path and Control Slide 73 16.4 Branch Prediction Predicting whether a branch will be taken Always predict that the branch will not be taken Use program context to decide (backward branch is likely taken, forward branch is likely not taken) Allow programmer or compiler to supply clues Decide based on past history (maintain a small history table); to be discussed later Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines Nov. 2014 Computer Architecture, Data Path and Control Slide 74 Forward and Backward Branches Example 5.5 List A is stored in memory beginning at the address given in $s1. List length is given in $s2. Find the largest integer in the list and copy it into $t0. Solution Scan the list, holding the largest element identified thus far in $t0. lw addi loop: add beq add add add lw slt beq addi j done: ... Nov. 2014 $t0,0($s1) $t1,$zero,0 $t1,$t1,1 $t1,$s2,done $t2,$t1,$t1 $t2,$t2,$t2 $t2,$t2,$s1 $t3,0($t2) $t4,$t0,$t3 $t4,$zero,loop $t0,$t3,0 loop # # # # # # # # # # # # # initialize maximum to A[0] initialize index i to 0 increment index i by 1 if all elements examined, quit compute 2i in $t2 compute 4i in $t2 form address of A[i] in $t2 load value of A[i] into $t3 maximum < A[i]? if not, repeat with no change if so, A[i] is the new maximum change completed; now repeat continuation of the program Computer Architecture, Data Path and Control Slide 75 Simple Branch Prediction: 1-Bit History Taken Not taken Not taken Predict taken Predict not taken Taken Two-state branch prediction scheme. Problem with this approach: Each branch in a loop entails two mispredictions: Once in first iteration (loop is repeated, but the history indicates exit from loop) Once in last iteration (when loop is terminated, but history indicates repetition) Nov. 2014 Computer Architecture, Data Path and Control Slide 76 Simple Branch Prediction: 2-Bit History Taken Not taken Not taken Predict taken Taken Not taken Predict taken again Taken Predict not taken Not taken Predict not taken again Taken Fig. 16.6 Four-state branch prediction scheme. Example 16.1 L1: ---10 iter’s ---20 iter’s L2: ------br <c2> L2 ---br <c1> L1 Nov. 2014 Impact of different branch prediction schemes Solution Always taken: 11 mispredictions, 94.8% accurate 1-bit history: 20 mispredictions, 90.5% accurate 2-bit history: Same as always taken Computer Architecture, Data Path and Control Slide 77 Other Branch Prediction Algorithms Problem 16.3 Taken Not taken Not taken Part a Predict taken Taken Not taken Predict taken again Taken Part b Predict taken Taken Taken Not taken Predict not taken Not taken Taken Predict taken again Taken Predict not taken again Not taken Taken Predict not taken Not taken Predict not taken again Not taken Taken Not taken Not taken Fig. 16.6 Predict taken Taken Not taken Predict taken again Taken Predict not taken Not taken Predict not taken again Taken Nov. 2014 Computer Architecture, Data Path and Control Slide 78 Hardware Implementation of Branch Prediction Low-order bits used as index Addresses of recent branch instructions Target addresses History bit(s) Incremented PC Next PC 0 Read-out table entry From PC Fig. 16.7 Compare = 1 Logic Hardware elements for a branch prediction scheme. The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches (Chapter 18) Nov. 2014 Computer Architecture, Data Path and Control Slide 79 Pipeline Augmentations – Repeated for Reference Branch predictor Next addr forwarders Hazard detector Stage 1 Data cache forwarder Stage 3 Stage 4 Stage 2 ALUOvfl 1 PC Stage 5 Next addr NextPC 0 ALU forwarders inst Instr cache rs rt (rs) Ovfl Reg file imm SE Incr IncrPC Data cache ALU (rt) 1 Data Data addr Address Func 0 1 0 1 0 1 2 rt rd 0 1 31 2 2 3 5 SeqInst Fig. 15.10 Nov. 2014 op Br&Jump RegDst fn RegWrite ALUSrc ALUFunc Computer Architecture, Data Path and Control DataRead RetAddr DataWrite RegInSrc Slide 80 16.5 Advanced Pipelining Deep pipeline = superpipeline; also, superpipelined, superpipelining Parallel instruction issue = superscalar, j-way issue (2-4 is typical) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Variable # of stages Stage q-2 Stage q-1 Stage q Function unit 1 Instr cache Instr decode Operand prep Instr issue Function unit 2 Function unit 3 Instruction fetch Retirement & commit stages Fig. 16.8 Dynamic instruction pipeline with in-order issue, possible out-of-order completion, and in-order retirement. Nov. 2014 Computer Architecture, Data Path and Control Slide 81 Design Space for Advanced Superscalar Pipelines Front end: Instr. issue: Writeback: Commit: In-order or out-of-order In-order or out-of-order In-order or out-of-order In-order or out-of-order The more OoO stages, the higher the complexity Example of complexity due to out-of-order processing: MIPS R10000 Source: Ahi, A. et al., “MIPS R10000 Superscalar Microprocessor,” Proc. Hot Chips Conf., 1995. Nov. 2014 Computer Architecture, Data Path and Control Slide 82 Performance Improvement for Deep Pipelines Hardware-based methods Lookahead past an instruction that will/may stall in the pipeline (out-of-order execution; requires in-order retirement) Issue multiple instructions (requires more ports on register file) Eliminate false data dependencies via register renaming Predict branch outcomes more accurately, or speculate Software-based method Pipeline-aware compilation Loop unrolling to reduce the number of branches Loop: Compute with index i Increment i by 1 Go to Loop if not done Nov. 2014 Loop: Compute with index i Compute with index i + 1 Increment i by 2 Go to Loop if not done Computer Architecture, Data Path and Control Slide 83 CPI Variations with Architectural Features Table 16.2 Effect of processor architecture, branch prediction methods, and speculative execution on CPI. Architecture Methods used in practice CPI Nonpipelined, multicycle Strict in-order instruction issue and exec 5-10 Nonpipelined, overlapped In-order issue, with multiple function units 3-5 Pipelined, static In-order exec, simple branch prediction 2-3 Superpipelined, dynamic Out-of-order exec, adv branch prediction 1-2 Superscalar 2- to 4-way issue, interlock & speculation 0.5-1 Advanced superscalar 4- to 8-way issue, aggressive speculation 0.2-0.5 Need 100 for TIPS performance Need 100,000 for 1 PIPS Nov. 2014 3.3 inst / cycle 3 Gigacycles / s 10 GIPS Computer Architecture, Data Path and Control Slide 84 Development of Intel’s Desktop/Laptop Micros In the beginning, there was the 8080; led to the 80x86 = IA32 ISA Half a dozen or so pipeline stages 80286 80386 80486 Pentium (80586) More advanced technology A dozen or so pipeline stages, with out-of-order instruction execution Pentium Pro Pentium II Pentium III Celeron More advanced technology Instructions are broken into micro-ops which are executed out-of-order but retired in-order Two dozens or so pipeline stages Pentium 4 Nov. 2014 Computer Architecture, Data Path and Control Slide 85 Current State of Computer Performance Multi-GIPS/GFLOPS desktops and laptops Very few users need even greater computing power Users unwilling to upgrade just to get a faster processor Current emphasis on power reduction and ease of use Multi-TIPS/TFLOPS in large computer centers World’s top 500 supercomputers, http://www.top500.org Next list due in June 2009; as of Nov. 2008: All 500 >> 10 TFLOPS, 30 > 100 TFLOPS, 1 > PFLOPS Multi-PIPS/PFLOPS supercomputers on the drawing board IBM “smarter planet” TV commercial proclaims (in early 2009): “We just broke the petaflop [sic] barrier.” The technical term “petaflops” is now in the public sphere Nov. 2014 Computer Architecture, Data Path and Control Slide 86 The Shrinking Supercomputer Nov. 2014 Computer Architecture, Data Path and Control Slide 87 16.6 Dealing with Exceptions Exceptions present the same problems as branches How to handle instructions that are ahead in the pipeline? (let them run to completion and retirement of their results) What to do with instructions after the exception point? (flush them out so that they do not affect the state) Precise versus imprecise exceptions Precise exceptions hide the effects of pipelining and parallelism by forcing the same state as that of strict sequential execution (desirable, because exception handling is not complicated) Imprecise exceptions are messy, but lead to faster hardware (interrupt handler can clean up to offer precise exception) Nov. 2014 Computer Architecture, Data Path and Control Slide 88 The Three Hardware Designs for MicroMIPS Incr PC Single-cycle Next addr jta Next PC ALUOvfl (PC) PC Instr cache inst rd 31 imm op 0 1 2 Ovfl Inst Reg Reg file ALU (rt) / 16 ALU out Data cache 0 1 0 1 2 Cache Data Reg InstData RegDst RegWrite ALUSrc Stage 1 y Mux 0 1 2 4 3 (rt) imm 16 / 4 32 y Reg SE / Stage 2 IRWrite RegInSrc RegDst ALUSrcX RegWrite Stage 4 ALUOvfl PC fn Stage 3 1 inst Instr cache rs rt (rs) 4 ALU 0 1 2 3 Func ALU out ALUSrcY Stage 5 Reg file 1 Incr IncrPC rt rd 31 PCSrc JumpAddr 500 MHz CPI 4 Address Data cache ALU imm SE ALUFunc Data Data addr Ovfl (rt) 500 MHz CPI 1.1 op MemWrite MemRead Next addr NextPC 0 PCWrite DataRead RegInSrc DataWrite ALUFunc 125 MHz CPI = 1 Func 0 1 0 1 0 1 2 0 1 2 2 3 5 SeqInst op Nov. 2014 Reg file 30 Register input fn Br&Jump (rs) 0 1 Data ALUZero x Mux ALUOvfl 0 Zero z Reg 1 Ovfl x Reg rs rt 0 1 rd 31 2 0 1 SysCallAddr jta PC Data out Data in Func 0 32 SE / 1 Data addr 30 / 4 MSBs Address (rs) rs rt 26 / Multicycle Br&Jump RegDst fn RegWrite ALUSrc ALUFunc DataRead RetAddr DataWrite RegInSrc Computer Architecture, Data Path and Control Slide 89 Where Do We Go from Here? Memory Design: How to build a memory unit that responds in 1 clock Input and Output: Peripheral devices, I/O programming, interfacing, interrupts Higher Performance: Vector/array processing Parallel processing Nov. 2014 Computer Architecture, Data Path and Control Slide 90