document

Chapter 4 Processor Architecture Pipelined Implementation Instructor: Dr. Hyunyoung Lee Texas A&M University –1– Based on slides provided by Randal E. Bryant, CMU Overview General Principles of Pipelining   Goal: Enhancing Performance Difficulties: Complex implementation, Non-deterministic behavior etc. Creating a Pipelined Y86 Processor    Rearranging SEQ Inserting pipeline registers Problems with data and control hazards  Data Hazards » Instruction having register R as source follows shortly after instruction having register R as destination » Common condition, don’t want to slow down pipeline  Control Hazards » Mispredict conditional branch • Our design predicts all branches as being taken • Naïve pipeline executes two extra instructions » Getting return address for ret instruction • Naïve pipeline executes three extra instructions –2– Real-World Pipelines: Car Washes Sequential Parallel Pipelined Idea    –3– Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed Computational Example 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GIPS Clock System     –4– Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Can must have clock cycle of at least 320 ps Throughput = 1/(320 ps) = 0.003125 instructions per ps = 3125000000 instructions per sec = 3.12 GIPS 3-Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C 20 ps R Delay = 360 ps e Throughput = 8.33 GIPS g Clock System   Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A.  Begin new operation every 120 ps  Throughput increases by a factor of 8.33/3.12=2.67 at the expense of increased latency* (360/320=1.12) * Latency: total time required to perform a single instruction from beginning to end –5– Pipeline Diagrams Unpipelined OP1 OP2 OP3  Time Cannot start new operation until previous one completes 3-Way Pipelined OP1 OP2 A B C A B C A B OP3 C Time  –6– Up to 3 operations in process simultaneously Operating a Pipeline 239 241 300 359 Clock OP1 A OP2 B C A B C A B OP3 0 120 240 360 C 480 640 Time 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Clock –7– Limitations: Nonuniform Delays 50 ps 20 ps 150 ps 20 ps 100 ps Comb. logic R e g Comb. logic B R e g Comb. logic C A OP1 OP2 A B OP3 B A R Delay = 510 ps e Throughput = 5.88 GOPS g Clock C A 20 ps C B C Time    –8– Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages Limitations: Register Overhead 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock    3-stage pipeline:  6-stage pipeline: –9– R e g Comb. logic R e g Comb. logic R e g Delay = 420 ps, Throughput = 14.29 GOPS As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register:  1-stage pipeline:  Comb. logic 6.25% 16.67% 28.57% High speeds of modern processor designs obtained through very deep pipelining Data Hazards and Control Hazards Make the pipelined processor work! Data Hazards   Instruction having register R as source follows shortly after instruction having register R as destination Common condition, don’t want to slow down pipeline Control Hazards  Mispredict conditional branch  Our design predicts all branches as being taken  Naïve pipeline executes two extra instructions  Getting return address for ret instruction  Naïve pipeline executes three extra instructions – 10 – Data Dependencies Combinational logic R e g Clock OP1 OP2 OP3 Time System  – 11 – Each operation depends on result from preceding one Data Hazards Comb. logic A OP1 OP2 R e g A Comb. logic B R e g Comb. logic C Clock B C A B C A B C A B OP3 OP4 R e g C Time   – 12 – Result does not feed back around in time for next operation Pipelining has changed behavior of system Data Dependencies in Processors  1 irmovl $50, %eax 2 addl %eax , 3 mrmovl 100( %ebx ), %ebx %edx Result from one instruction used as operand for another  Read-after-write (RAW) dependency   Very common in actual programs Must make sure our pipeline handles these properly  Get correct results  Minimize performance impact – 13 – newPC SEQ Hardware   Stages occur in sequence One operation in process at a time New PC PC valM data out read write Addr Execute Bch valE CC CC ALU ALU Key       – 14 – Blue boxes: predesigned hardware blocks  E.g., memories, ALU Gray boxes: control logic  Describe in HCL White ovals: labels for signals Thick lines: 32-bit word values Thin lines: 4-8 bit values Dotted lines: 1-bit values Data Data memory memory Mem. control Memory ALU A Data ALU fun. ALU B valA Decode A valB dstE dstM srcA srcB dstE dstM srcA srcB B Register Register M file file E Write back icode Fetch ifun rA rB Instruction Instruction memory memory PC valC valP PC PC increment increment valM SEQ+ Hardware data out read Data Data memory memory Mem. control Memory write Addr   Still sequential implementation Reorder PC stage to put at beginning Execute Bch valE CC CC ALU ALU ALU A Data ALU fun. ALU B PC Stage   Task is to select PC for current instruction Based on results computed by previous instruction Processor State   – 15 – PC is no longer stored in register But, can determine PC based on other stored information valA valB dstE dstM srcA srcB dstE dstM srcA srcB Decode A B Register Register M file file E Write back icode Fetch ifun rA rB valC Instruction Instruction memory memory valP PC PC increment increment PC PC PC pIcode pBch pValM pValC pValP Adding Pipeline Registers valE, valM W_icode, W_valM Write back valM W_valE, W_valM, W_dstE, W_dstM valM W valM Data Data memory memory Memory Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Addr, Data M valE Bch Bch CC CC Execute ALU ALU valE CC CC Execute aluA, aluB ALU ALU aluA, aluB E valA, valB Decode valA, valB srcA, srcB dstA, dstB icode, valC valP A B Register Register M file file d_srcA, d_srcB Decode A B Register Register M file file E E Write back valP icode, ifun rA, rB valC Fetch Instruction Instruction memory memory D valP icode, ifun, rA, rB, valC PC PC increment increment Instruction Instruction memory memory Fetch valP PC PC increment increment PC PC predPC pState PC f_PC F – 16 – W_icode, W_valM Pipeline Stages W_valE, W_valM, W_dstE, W_dstM W valM Fetch Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data    Select current PC Read instruction Compute incremented PC M Bch valE CC CC Execute ALU ALU aluA, aluB Decode  E Read program registers valA, valB Execute d_srcA, d_srcB Decode A B Register Register M file file E Write back  Operate ALU Memory  D icode, ifun, rA, rB, valC Instruction Instruction memory memory Fetch Write Back – 17 – valP PC PC increment increment predPC Read or write data memory PC  valP Update register file f_PC F Write back PIPE- Hardware W icode valE Mem. control write Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths   Values passed from one stage to next Cannot jump past stages dstE dstM data out read  valM Data Data memory memory Memory data in Addr M_valA M_Bch M icode Bch valE valA dstE dstM e_Bch Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB d_srcA d_srcB Select A Decode d_rvalA A dstE dstM srcA srcB W_valM B Register Register M file file W_valE E  e.g., valC passes through decode dstE dstM srcA srcB D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 18 – predPC W_valM Write back Feedback Paths W icode valE dstE dstM data out read Mem. control write Predicted PC valM Data Data memory memory Memory data in Addr M_valA M_Bch  Guess value of next PC M icode Bch valE valA dstE dstM e_Bch Branch information   Jump taken/not-taken Fall-through or target address Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB Return point  Decode To register file write ports A dstE dstM srcA srcB W_valM B W_valE E D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 19 – d_rvalA Register Register M file file Read from memory Register updates  Select A predPC W_valM Predicting the PC D M_icode M_Bch M_valA W_icode W_valM icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F  predPC Start fetch of new instruction after current one has completed fetch stage  Not enough time to reliably determine next instruction  Guess which instruction will follow  Recover if prediction was incorrect – 20 – Our Prediction Strategy Instructions that don’t transfer control   Predict next PC to be valP Always reliable Call and unconditional Jumps   Predict next PC to be valC (destination) Always reliable Conditional jumps   Predict next PC to be valC (destination) Only correct if branch is taken  Typically right 60% of time Return instruction  – 21 – Don’t try to predict Recovering from PC Misprediction M_icode M_Bch M_valA W_icode W_valM D icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F  predPC Mispredicted Jump  Will see branch flag once instruction reaches memory stage  Can get fall-through PC from valA  Return Instruction  Will get return PC when ret reaches write-back stage – 22 – Pipeline Demonstration irmovl $1,%eax #I1 irmovl $2,%ecx #I2 irmovl $3,%edx #I3 irmovl $4,%ebx #I4 halt #I5 1 2 3 4 5 F D E M W F D E M W F D E M W F D E M W F D E M Cycle 5 File: demo-basic.ys W I1 M I2 E I3 D I4 F I5 – 23 – 6 7 8 9 W Data Dependencies: 3 Nop’s # demo-h3.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl 0x00c: nop 0x00d: nop 0x00e: nop $3,%eax 0x00f: addl %edx,%eax 0x011: halt 6 7 8 9 10 Cycle 6 W R[ %eax] f 3 Cycle 7 D – 24 – valA f R[ %edx] = 10 valB f R[ %eax] = 3 11 W Data Dependencies: 2 Nop’s # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt 6 7 8 9 Cycle 6 W R[ %eax] f 3 • • • D valA f R[ %edx] = 10 valB f R[ %eax] = 0 – 25 – Error 10 W Data Dependencies: 1 Nop # demo-h1.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl 0x00c: $3,%eax nop 0x00d: addl %edx,%eax 0x00f: halt 6 7 8 Cycle 5 W R[ %edx] f 10 M M_valE = 3 M_dstE = %eax • • • D – 26 – valA f R[ %edx] = 0 valB f R[ %eax] = 0 Error 9 W Data Dependencies: No Nop # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt 6 7 Cycle 4 M M_valE = 10 M_dstE = %edx E e_valE f 0 + 3 = 3 E_dstE = %eax D valA f R[ %edx] = 0 valB f R[ %eax] = 0 – 27 – Error 8 W Stalling for Data Dependencies # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W E M W D D E M W F F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 6 bubble 0x00e: addl %edx,%eax 0x010: halt    – 28 – F 7 8 9 If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage 10 11 W Stall Condition Source Registers  srcA and srcB of current instruction in decode stage Destination Registers   dstE and dstM fields Instructions in execute, memory, and write-back stages Special Case  Don’t stall for register ID 15 (0xF)  Indicates absence of register operand – 29 –  Don’t stall for failed conditional move Detecting Stall Condition # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W E M W D D E M W F F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 6 bubble 0x00e: addl %edx,%eax 0x010: halt F 7 Cycle 6 W W_dstE = %eax W_valE = 3 • • • D – 30 – srcA = %edx srcB = %eax 8 9 10 11 W Stalling X3 # demo-h0.ys 1 2 3 4 5 F D E M W F D E M W E M W E M W E M W 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 6 bubble bubble F 0x00c: addl %edx,%eax 7 8 D D D E M W F F F F D E M Cycle 6 W Cycle 5 W_dstE = %eax M Cycle 4 M_dstE = %eax E • • • D D – 31 – srcA = %edx srcB = %eax 10 D 0x00e: halt E_dstE = %eax 9 srcA = %edx srcB = %eax • • • D srcA = %edx srcB = %eax 11 W What Happens When Stalling? # demo-h0.ys 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt    Cycle 8 4 5 6 7 Write Back Memory Execute Decode Fetch 0x000: bubble 0x006: irmovl $10,%edx $3,%eax 0x006: bubble 0x00c: irmovl addl %edx,%eax $3,%eax 0x00c: halt 0x00e: addl %edx,%eax 0x00e: halt Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage  Like dynamically generated nop’s  Move through later stages – 32 – 0x000: bubble 0x006: irmovl $10,%edx $3,%eax Implementing Stalling W_stat W_stall W stat icode valE valM dstE dstM valE valA dstE dstM M_icode M_bubble M stat icode Cnd m_stat e_Cnd stat set_cc Pipe control logic CC E_dstM E_icode E_bubble E stat icode ifun valC valA valB dstE dstM srcA srcB d_srcB d_srcA srcB D_icode srcA D_bubble D_stall D F_stall F stat icode ifun rA rB valC valP predPC Pipeline Control   – 33 – Combinational logic detects stall condition Sets mode signals for how pipeline registers should update Pipeline Register Modes Input = y Output = x x Normal stall =0 Output = x x stall =1 – 34 – Output = x x stall =0 y _ Rising clock _ Output = x x bubble =0 Input = y Bubble _ Output = y bubble =0 Input = y Stall _ Rising clock bubble =1 _ Rising clock _ n o p Output = nop Data Forwarding Naïve Pipeline   Register isn’t written until completion of write-back stage Source operands read from register file in decode stage  Needs to be in register file at start of stage Observation  Value generated in execute or memory stage Trick   – 35 – Pass value directly from generating instruction to decode stage Needs to be available at end of decode stage Data Forwarding Example # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D F M E D F W M E D F 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt    – 36 – irmovl in writeback stage Destination value in W pipeline register Forward as valB for decode stage 6 7 8 9 10 W M E D F W M E D W M E W M W Cycle 6 W R[ %eax] f 3 W_dstE = %eax W_valE = 3 • • • D srcA = %edx srcB = %eax valA f R[ %edx] = 10 valB f W_valE = 3 Bypass Paths Decode Stage    Forwarding logic selects valA and valB Normally from register file Forwarding: get valA or valB from later pipeline stage Forwarding Sources    – 37 – Execute: valE Memory: valE, valM Write back: valE, valM Data Forwarding Example #2 # demo-h0.ys 1 2 3 4 5 6 0x000: irmovl $10,%edx F D F E D M E W M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Register %edx   Generated by ALU during previous cycle Forward from memory as valA Register %eax – 38 –  Value just generated by ALU  Forward from execute as valB 7 Cycle 4 M M_dstE = %edx M_valE = 10 E E_dstE = %eax e_valE f 0 + 3 = 3 D srcA = %edx srcB = %eax valA f M_valE = 10 valB f e_valE = 3 8 W Forwarding Priority # demo-priority.ys 1 2 3 4 5 0x000: irmovl $1, %eax F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $2, %eax 0x00c: irmovl $3, %eax 0x012: rrmovl %eax, %edx 0x014: halt Multiple Forwarding Choices    – 39 – Which one should have priority Match serial semantics Use matching value from earliest pipeline stage 6 Cycle 5 W R[ %eax] f 3 1 M W R[ %eax] f 3 2 E W R[ %eax] f 3 D valA f R[ %eax 10 %edx] = ? valB f 0 R[ %eax] = 0 7 8 9 W 10 Limitation of Forwarding # demo-luh.ys 0x000: 0x006: 0x00c: 0x012: 0x018: 0x01e: 0x020: 1 2  4 irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax addl %ebx,%eax # Use %eax halt Load-use dependency  3 Value needed by end of decode stage in cycle 7 Value read from memory in memory stage of cycle 8 5 6 W M E D F W M E D F 7 8 9 10 11 W M E D F W M E D W M E W M W Cycle 7 Cycle 8 M M M_dstE = %ebx M_valE = 10 M_dstM = %eax m_valM f M[128] = 3 • • • D valA f M_valE = 10 valB f R[%eax] = 0 – 40 – Error Avoiding Load/Use Hazard # demo-luh.ys 1 2 3 4 5 irmovl $128,%edx F irmovl $3,%ecx rmmovl %ecx, 0(%edx) irmovl $10,%ebx D E M W F D F E D F 0x018: mrmovl 0(%edx),%eax # Load %eax bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt M E D F 0x000: 0x006: 0x00c: 0x012:   Stall using instruction for one cycle Can then pick up loaded value by forwarding from memory stage 6 7 8 9 W M E D W M E W M W D F E D F M E D F Cycle 8 W W_dstE = %ebx W_valE = 10 M M_dstM = %eax m_valM f M[128] = 3 • • • D – 41 – valA f W_valE = 10 valB f m_valM = 3 10 11 W M E W M 12 W Detecting Load/Use Hazard Condition Trigger Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } – 42 – Control for Load/Use Hazard # demo-luh.ys 1 2 3 0x000: 0x006: 0x00c: 0x012: 0x018: 4 irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt   6 7 W M E D F W M E D W M E F D F 8 9 10 11 W M E D F W M E D W M E W M 12 Stall instructions in fetch and decode stages Inject bubble into execute stage Condition Load/Use Hazard – 43 – 5 F D E M W stall stall bubble normal normal W Branch Misprediction Example demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx  – 44 – # Not taken # Fall through # Target (Should not execute) # Should not execute # Should not execute Should only execute first 7 instructions Branch Misprediction Trace # demo-j 0x000: xorl %eax,%eax 0x002: jne t # Not taken 1 2 3 4 5 6 F D F E D M E W M W F D F E D F M E D 0x011: t: irmovl $3, %edx # Target 0x017: irmovl $4, %ecx # Target+1 0x007: irmovl $1, %eax # Fall Through Cycle 5 M  Incorrectly execute two instructions at branch target M_Bch = 0 M_valA = 0x007 E valE f 3 dstE = %edx D valC = 4 dstE = %ecx F – 45 – valC f 1 rB f %eax 7 W M E 8 9 cancelled W M W Return Example 0x000: 0x006: 0x007: 0x008: 0x009: 0x00e: 0x014: 0x020: 0x020: 0x021: 0x022: 0x023: 0x024: 0x02a: 0x030: 0x036: 0x100: 0x100:  – 46 – demo-ret.ys irmovl Stack,%esp # Intialize stack pointer nop # Avoid hazard on %esp nop nop call p # Procedure call irmovl $5,%esi # Return point halt .pos 0x20 p: nop # procedure nop nop ret irmovl $1,%eax # Should not be executed irmovl $2,%ecx # Should not be executed irmovl $3,%edx # Should not be executed irmovl $4,%ebx # Should not be executed .pos 0x100 Stack: # Stack: Stack pointer Require lots of nops to avoid data hazards Incorrect Return Example # demo-ret 0x023:  ret 0x024: D irmovl $1,%eax # Oops! F 0x02a: irmovl $2,%ecx # Oops! 0x030: irmovl $3,%edx # Oops! 0x00e: irmovl $5,%esi # Return Incorrectly execute 3 instructions following ret F E D F M E D F W M E D F W valM = 0x0e M valE = 1 dstE = %eax E valE f 2 dstE = %ecx D valC = 3 dstE = %edx F – 47 – valC f 5 rB f %esi W M E D W M E W M W Pipeline Summary (1) Concept   Break instruction execution into 5 stages Run instructions through in pipelined mode Limitations   Can’t handle dependencies between instructions when instructions follow too closely Data dependencies  One instruction writes register, later one reads it  Control dependency  Instruction sets PC in way that pipeline did not predict correctly  Mispredicted branch and return – 48 – Pipeline Summary (2) Fixing the Pipeline  Data Hazards  Most handled by forwarding » No performance penalty  Load/use hazard requires one cycle stall  Control Hazards  Cancel instructions when detect mispredicted branch » Two clock cycles wasted  Stall fetch stage while ret passes through pipeline » Three clock cycles wasted – 49 –

document

Related documents

Products

Support

document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib