Chapter 4 Processor Architecture Pipelined Implementation Instructor: Dr. Hyunyoung Lee Texas A&M University –1– Based on slides provided by Randal E. Bryant, CMU Overview General Principles of Pipelining Goal: Enhancing Performance Difficulties: Complex implementation, Non-deterministic behavior etc. Creating a Pipelined Y86 Processor Rearranging SEQ Inserting pipeline registers Problems with data and control hazards Data Hazards » Instruction having register R as source follows shortly after instruction having register R as destination » Common condition, don’t want to slow down pipeline Control Hazards » Mispredict conditional branch • Our design predicts all branches as being taken • Naïve pipeline executes two extra instructions » Getting return address for ret instruction • Naïve pipeline executes three extra instructions –2– Real-World Pipelines: Car Washes Sequential Parallel Pipelined Idea –3– Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed Computational Example 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GIPS Clock System –4– Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Can must have clock cycle of at least 320 ps Throughput = 1/(320 ps) = 0.003125 instructions per ps = 3125000000 instructions per sec = 3.12 GIPS 3-Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C 20 ps R Delay = 360 ps e Throughput = 8.33 GIPS g Clock System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Throughput increases by a factor of 8.33/3.12=2.67 at the expense of increased latency* (360/320=1.12) * Latency: total time required to perform a single instruction from beginning to end –5– Pipeline Diagrams Unpipelined OP1 OP2 OP3 Time Cannot start new operation until previous one completes 3-Way Pipelined OP1 OP2 A B C A B C A B OP3 C Time –6– Up to 3 operations in process simultaneously Operating a Pipeline 239 241 300 359 Clock OP1 A OP2 B C A B C A B OP3 0 120 240 360 C 480 640 Time 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Clock –7– Limitations: Nonuniform Delays 50 ps 20 ps 150 ps 20 ps 100 ps Comb. logic R e g Comb. logic B R e g Comb. logic C A OP1 OP2 A B OP3 B A R Delay = 510 ps e Throughput = 5.88 GOPS g Clock C A 20 ps C B C Time –8– Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages Limitations: Register Overhead 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock 3-stage pipeline: 6-stage pipeline: –9– R e g Comb. logic R e g Comb. logic R e g Delay = 420 ps, Throughput = 14.29 GOPS As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: Comb. logic 6.25% 16.67% 28.57% High speeds of modern processor designs obtained through very deep pipelining Data Hazards and Control Hazards Make the pipelined processor work! Data Hazards Instruction having register R as source follows shortly after instruction having register R as destination Common condition, don’t want to slow down pipeline Control Hazards Mispredict conditional branch Our design predicts all branches as being taken Naïve pipeline executes two extra instructions Getting return address for ret instruction Naïve pipeline executes three extra instructions – 10 – Data Dependencies Combinational logic R e g Clock OP1 OP2 OP3 Time System – 11 – Each operation depends on result from preceding one Data Hazards Comb. logic A OP1 OP2 R e g A Comb. logic B R e g Comb. logic C Clock B C A B C A B C A B OP3 OP4 R e g C Time – 12 – Result does not feed back around in time for next operation Pipelining has changed behavior of system Data Dependencies in Processors 1 irmovl $50, %eax 2 addl %eax , 3 mrmovl 100( %ebx ), %ebx %edx Result from one instruction used as operand for another Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly Get correct results Minimize performance impact – 13 – newPC SEQ Hardware Stages occur in sequence One operation in process at a time New PC PC valM data out read write Addr Execute Bch valE CC CC ALU ALU Key – 14 – Blue boxes: predesigned hardware blocks E.g., memories, ALU Gray boxes: control logic Describe in HCL White ovals: labels for signals Thick lines: 32-bit word values Thin lines: 4-8 bit values Dotted lines: 1-bit values Data Data memory memory Mem. control Memory ALU A Data ALU fun. ALU B valA Decode A valB dstE dstM srcA srcB dstE dstM srcA srcB B Register Register M file file E Write back icode Fetch ifun rA rB Instruction Instruction memory memory PC valC valP PC PC increment increment valM SEQ+ Hardware data out read Data Data memory memory Mem. control Memory write Addr Still sequential implementation Reorder PC stage to put at beginning Execute Bch valE CC CC ALU ALU ALU A Data ALU fun. ALU B PC Stage Task is to select PC for current instruction Based on results computed by previous instruction Processor State – 15 – PC is no longer stored in register But, can determine PC based on other stored information valA valB dstE dstM srcA srcB dstE dstM srcA srcB Decode A B Register Register M file file E Write back icode Fetch ifun rA rB valC Instruction Instruction memory memory valP PC PC increment increment PC PC PC pIcode pBch pValM pValC pValP Adding Pipeline Registers valE, valM W_icode, W_valM Write back valM W_valE, W_valM, W_dstE, W_dstM valM W valM Data Data memory memory Memory Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Addr, Data M valE Bch Bch CC CC Execute ALU ALU valE CC CC Execute aluA, aluB ALU ALU aluA, aluB E valA, valB Decode valA, valB srcA, srcB dstA, dstB icode, valC valP A B Register Register M file file d_srcA, d_srcB Decode A B Register Register M file file E E Write back valP icode, ifun rA, rB valC Fetch Instruction Instruction memory memory D valP icode, ifun, rA, rB, valC PC PC increment increment Instruction Instruction memory memory Fetch valP PC PC increment increment PC PC predPC pState PC f_PC F – 16 – W_icode, W_valM Pipeline Stages W_valE, W_valM, W_dstE, W_dstM W valM Fetch Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Select current PC Read instruction Compute incremented PC M Bch valE CC CC Execute ALU ALU aluA, aluB Decode E Read program registers valA, valB Execute d_srcA, d_srcB Decode A B Register Register M file file E Write back Operate ALU Memory D icode, ifun, rA, rB, valC Instruction Instruction memory memory Fetch Write Back – 17 – valP PC PC increment increment predPC Read or write data memory PC valP Update register file f_PC F Write back PIPE- Hardware W icode valE Mem. control write Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths Values passed from one stage to next Cannot jump past stages dstE dstM data out read valM Data Data memory memory Memory data in Addr M_valA M_Bch M icode Bch valE valA dstE dstM e_Bch Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB d_srcA d_srcB Select A Decode d_rvalA A dstE dstM srcA srcB W_valM B Register Register M file file W_valE E e.g., valC passes through decode dstE dstM srcA srcB D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 18 – predPC W_valM Write back Feedback Paths W icode valE dstE dstM data out read Mem. control write Predicted PC valM Data Data memory memory Memory data in Addr M_valA M_Bch Guess value of next PC M icode Bch valE valA dstE dstM e_Bch Branch information Jump taken/not-taken Fall-through or target address Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB Return point Decode To register file write ports A dstE dstM srcA srcB W_valM B W_valE E D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 19 – d_rvalA Register Register M file file Read from memory Register updates Select A predPC W_valM Predicting the PC D M_icode M_Bch M_valA W_icode W_valM icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F predPC Start fetch of new instruction after current one has completed fetch stage Not enough time to reliably determine next instruction Guess which instruction will follow Recover if prediction was incorrect – 20 – Our Prediction Strategy Instructions that don’t transfer control Predict next PC to be valP Always reliable Call and unconditional Jumps Predict next PC to be valC (destination) Always reliable Conditional jumps Predict next PC to be valC (destination) Only correct if branch is taken Typically right 60% of time Return instruction – 21 – Don’t try to predict Recovering from PC Misprediction M_icode M_Bch M_valA W_icode W_valM D icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F predPC Mispredicted Jump Will see branch flag once instruction reaches memory stage Can get fall-through PC from valA Return Instruction Will get return PC when ret reaches write-back stage – 22 – Pipeline Demonstration irmovl $1,%eax #I1 irmovl $2,%ecx #I2 irmovl $3,%edx #I3 irmovl $4,%ebx #I4 halt #I5 1 2 3 4 5 F D E M W F D E M W F D E M W F D E M W F D E M Cycle 5 File: demo-basic.ys W I1 M I2 E I3 D I4 F I5 – 23 – 6 7 8 9 W Data Dependencies: 3 Nop’s # demo-h3.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl 0x00c: nop 0x00d: nop 0x00e: nop $3,%eax 0x00f: addl %edx,%eax 0x011: halt 6 7 8 9 10 Cycle 6 W R[ %eax] f 3 Cycle 7 D – 24 – valA f R[ %edx] = 10 valB f R[ %eax] = 3 11 W Data Dependencies: 2 Nop’s # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt 6 7 8 9 Cycle 6 W R[ %eax] f 3 • • • D valA f R[ %edx] = 10 valB f R[ %eax] = 0 – 25 – Error 10 W Data Dependencies: 1 Nop # demo-h1.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl 0x00c: $3,%eax nop 0x00d: addl %edx,%eax 0x00f: halt 6 7 8 Cycle 5 W R[ %edx] f 10 M M_valE = 3 M_dstE = %eax • • • D – 26 – valA f R[ %edx] = 0 valB f R[ %eax] = 0 Error 9 W Data Dependencies: No Nop # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt 6 7 Cycle 4 M M_valE = 10 M_dstE = %edx E e_valE f 0 + 3 = 3 E_dstE = %eax D valA f R[ %edx] = 0 valB f R[ %eax] = 0 – 27 – Error 8 W Stalling for Data Dependencies # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W E M W D D E M W F F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 6 bubble 0x00e: addl %edx,%eax 0x010: halt – 28 – F 7 8 9 If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage 10 11 W Stall Condition Source Registers srcA and srcB of current instruction in decode stage Destination Registers dstE and dstM fields Instructions in execute, memory, and write-back stages Special Case Don’t stall for register ID 15 (0xF) Indicates absence of register operand – 29 – Don’t stall for failed conditional move Detecting Stall Condition # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W E M W D D E M W F F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 6 bubble 0x00e: addl %edx,%eax 0x010: halt F 7 Cycle 6 W W_dstE = %eax W_valE = 3 • • • D – 30 – srcA = %edx srcB = %eax 8 9 10 11 W Stalling X3 # demo-h0.ys 1 2 3 4 5 F D E M W F D E M W E M W E M W E M W 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 6 bubble bubble F 0x00c: addl %edx,%eax 7 8 D D D E M W F F F F D E M Cycle 6 W Cycle 5 W_dstE = %eax M Cycle 4 M_dstE = %eax E • • • D D – 31 – srcA = %edx srcB = %eax 10 D 0x00e: halt E_dstE = %eax 9 srcA = %edx srcB = %eax • • • D srcA = %edx srcB = %eax 11 W What Happens When Stalling? # demo-h0.ys 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Cycle 8 4 5 6 7 Write Back Memory Execute Decode Fetch 0x000: bubble 0x006: irmovl $10,%edx $3,%eax 0x006: bubble 0x00c: irmovl addl %edx,%eax $3,%eax 0x00c: halt 0x00e: addl %edx,%eax 0x00e: halt Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage Like dynamically generated nop’s Move through later stages – 32 – 0x000: bubble 0x006: irmovl $10,%edx $3,%eax Implementing Stalling W_stat W_stall W stat icode valE valM dstE dstM valE valA dstE dstM M_icode M_bubble M stat icode Cnd m_stat e_Cnd stat set_cc Pipe control logic CC E_dstM E_icode E_bubble E stat icode ifun valC valA valB dstE dstM srcA srcB d_srcB d_srcA srcB D_icode srcA D_bubble D_stall D F_stall F stat icode ifun rA rB valC valP predPC Pipeline Control – 33 – Combinational logic detects stall condition Sets mode signals for how pipeline registers should update Pipeline Register Modes Input = y Output = x x Normal stall =0 Output = x x stall =1 – 34 – Output = x x stall =0 y _ Rising clock _ Output = x x bubble =0 Input = y Bubble _ Output = y bubble =0 Input = y Stall _ Rising clock bubble =1 _ Rising clock _ n o p Output = nop Data Forwarding Naïve Pipeline Register isn’t written until completion of write-back stage Source operands read from register file in decode stage Needs to be in register file at start of stage Observation Value generated in execute or memory stage Trick – 35 – Pass value directly from generating instruction to decode stage Needs to be available at end of decode stage Data Forwarding Example # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D F M E D F W M E D F 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt – 36 – irmovl in writeback stage Destination value in W pipeline register Forward as valB for decode stage 6 7 8 9 10 W M E D F W M E D W M E W M W Cycle 6 W R[ %eax] f 3 W_dstE = %eax W_valE = 3 • • • D srcA = %edx srcB = %eax valA f R[ %edx] = 10 valB f W_valE = 3 Bypass Paths Decode Stage Forwarding logic selects valA and valB Normally from register file Forwarding: get valA or valB from later pipeline stage Forwarding Sources – 37 – Execute: valE Memory: valE, valM Write back: valE, valM Data Forwarding Example #2 # demo-h0.ys 1 2 3 4 5 6 0x000: irmovl $10,%edx F D F E D M E W M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Register %edx Generated by ALU during previous cycle Forward from memory as valA Register %eax – 38 – Value just generated by ALU Forward from execute as valB 7 Cycle 4 M M_dstE = %edx M_valE = 10 E E_dstE = %eax e_valE f 0 + 3 = 3 D srcA = %edx srcB = %eax valA f M_valE = 10 valB f e_valE = 3 8 W Forwarding Priority # demo-priority.ys 1 2 3 4 5 0x000: irmovl $1, %eax F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $2, %eax 0x00c: irmovl $3, %eax 0x012: rrmovl %eax, %edx 0x014: halt Multiple Forwarding Choices – 39 – Which one should have priority Match serial semantics Use matching value from earliest pipeline stage 6 Cycle 5 W R[ %eax] f 3 1 M W R[ %eax] f 3 2 E W R[ %eax] f 3 D valA f R[ %eax 10 %edx] = ? valB f 0 R[ %eax] = 0 7 8 9 W 10 Limitation of Forwarding # demo-luh.ys 0x000: 0x006: 0x00c: 0x012: 0x018: 0x01e: 0x020: 1 2 4 irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax addl %ebx,%eax # Use %eax halt Load-use dependency 3 Value needed by end of decode stage in cycle 7 Value read from memory in memory stage of cycle 8 5 6 W M E D F W M E D F 7 8 9 10 11 W M E D F W M E D W M E W M W Cycle 7 Cycle 8 M M M_dstE = %ebx M_valE = 10 M_dstM = %eax m_valM f M[128] = 3 • • • D valA f M_valE = 10 valB f R[%eax] = 0 – 40 – Error Avoiding Load/Use Hazard # demo-luh.ys 1 2 3 4 5 irmovl $128,%edx F irmovl $3,%ecx rmmovl %ecx, 0(%edx) irmovl $10,%ebx D E M W F D F E D F 0x018: mrmovl 0(%edx),%eax # Load %eax bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt M E D F 0x000: 0x006: 0x00c: 0x012: Stall using instruction for one cycle Can then pick up loaded value by forwarding from memory stage 6 7 8 9 W M E D W M E W M W D F E D F M E D F Cycle 8 W W_dstE = %ebx W_valE = 10 M M_dstM = %eax m_valM f M[128] = 3 • • • D – 41 – valA f W_valE = 10 valB f m_valM = 3 10 11 W M E W M 12 W Detecting Load/Use Hazard Condition Trigger Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } – 42 – Control for Load/Use Hazard # demo-luh.ys 1 2 3 0x000: 0x006: 0x00c: 0x012: 0x018: 4 irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt 6 7 W M E D F W M E D W M E F D F 8 9 10 11 W M E D F W M E D W M E W M 12 Stall instructions in fetch and decode stages Inject bubble into execute stage Condition Load/Use Hazard – 43 – 5 F D E M W stall stall bubble normal normal W Branch Misprediction Example demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx – 44 – # Not taken # Fall through # Target (Should not execute) # Should not execute # Should not execute Should only execute first 7 instructions Branch Misprediction Trace # demo-j 0x000: xorl %eax,%eax 0x002: jne t # Not taken 1 2 3 4 5 6 F D F E D M E W M W F D F E D F M E D 0x011: t: irmovl $3, %edx # Target 0x017: irmovl $4, %ecx # Target+1 0x007: irmovl $1, %eax # Fall Through Cycle 5 M Incorrectly execute two instructions at branch target M_Bch = 0 M_valA = 0x007 E valE f 3 dstE = %edx D valC = 4 dstE = %ecx F – 45 – valC f 1 rB f %eax 7 W M E 8 9 cancelled W M W Return Example 0x000: 0x006: 0x007: 0x008: 0x009: 0x00e: 0x014: 0x020: 0x020: 0x021: 0x022: 0x023: 0x024: 0x02a: 0x030: 0x036: 0x100: 0x100: – 46 – demo-ret.ys irmovl Stack,%esp # Intialize stack pointer nop # Avoid hazard on %esp nop nop call p # Procedure call irmovl $5,%esi # Return point halt .pos 0x20 p: nop # procedure nop nop ret irmovl $1,%eax # Should not be executed irmovl $2,%ecx # Should not be executed irmovl $3,%edx # Should not be executed irmovl $4,%ebx # Should not be executed .pos 0x100 Stack: # Stack: Stack pointer Require lots of nops to avoid data hazards Incorrect Return Example # demo-ret 0x023: ret 0x024: D irmovl $1,%eax # Oops! F 0x02a: irmovl $2,%ecx # Oops! 0x030: irmovl $3,%edx # Oops! 0x00e: irmovl $5,%esi # Return Incorrectly execute 3 instructions following ret F E D F M E D F W M E D F W valM = 0x0e M valE = 1 dstE = %eax E valE f 2 dstE = %ecx D valC = 3 dstE = %edx F – 47 – valC f 5 rB f %esi W M E D W M E W M W Pipeline Summary (1) Concept Break instruction execution into 5 stages Run instructions through in pipelined mode Limitations Can’t handle dependencies between instructions when instructions follow too closely Data dependencies One instruction writes register, later one reads it Control dependency Instruction sets PC in way that pipeline did not predict correctly Mispredicted branch and return – 48 – Pipeline Summary (2) Fixing the Pipeline Data Hazards Most handled by forwarding » No performance penalty Load/use hazard requires one cycle stall Control Hazards Cancel instructions when detect mispredicted branch » Two clock cycles wasted Stall fetch stage while ret passes through pipeline » Three clock cycles wasted – 49 –