CS:APP Chapter 4 Computer Architecture Pipelined Implementation Part I Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu CS:APP Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor –2– Rearranging SEQ Inserting pipeline registers Problems with data and control hazards CS:APP Real-World Pipelines: Car Washes Sequential Parallel Pipelined Idea –3– Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed CS:APP Computational Example 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GOPS Clock System –4– Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Can must have clock cycle of at least 320 ps CS:APP 3-Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C 20 ps R Delay = 360 ps e Throughput = 8.33 GOPS g Clock System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Overall latency increases 360 ps from start to finish –5– CS:APP Pipeline Diagrams Unpipelined OP1 OP2 OP3 Time Cannot start new operation until previous one completes 3-Way Pipelined OP1 OP2 A B C A B C A B OP3 C Time –6– Up to 3 operations in process simultaneously CS:APP Operating a Pipeline 239 241 300 359 Clock OP1 A OP2 B C A B C A B OP3 0 120 240 360 C 480 640 Time 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Clock –7– CS:APP Limitations: Nonuniform Delays 50 ps 20 ps 150 ps 20 ps 100 ps Comb. logic R e g Comb. logic B R e g Comb. logic C A OP1 OP2 A B OP3 B A R Delay = 510 ps e Throughput = 5.88 GOPS g Clock C A 20 ps C B C Time –8– Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages CS:APP Limitations: Register Overhead 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock 3-stage pipeline: 6-stage pipeline: –9– R e g Comb. logic R e g Comb. logic R e g Delay = 420 ps, Throughput = 14.29 GOPS As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: Comb. logic 6.25% 16.67% 28.57% High speeds of modern processor designs obtained through very deep pipelining CS:APP Data Dependencies Combinational logic R e g Clock OP1 OP2 OP3 Time System – 10 – Each operation depends on result from preceding one CS:APP Data Hazards Comb. logic A OP1 OP2 R e g A Comb. logic B R e g Comb. logic C Clock B C A B C A B C A B OP3 OP4 R e g C Time – 11 – Result does not feed back around in time for next operation Pipelining has changed behavior of system CS:APP Data Dependencies in Processors 1 irmovl $50, %eax 2 addl %eax , 3 mrmovl 100( %ebx ), %ebx %edx Result from one instruction used as operand for another Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly Get correct results Minimize performance impact – 12 – CS:APP newPC SEQ Hardware Stages occur in sequence One operation in process at a time New PC PC valM data out read Data Data memory memory Mem. control Memory write Addr Execute Bch valE CC CC ALU ALU ALU A Data ALU fun. ALU B valA Decode A valB dstE dstM srcA srcB dstE dstM srcA srcB B Register Register M file file E Write back icode Fetch ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment PC – 13 – CS:APP valM SEQ+ Hardware data out read Data Data memory memory Mem. control Memory write Addr Still sequential implementation Reorder PC stage to put at beginning Execute Bch valE CC CC ALU ALU ALU A Data ALU fun. ALU B PC Stage Task is to select PC for current instruction Based on results computed by previous instruction Processor State – 14 – PC is no longer stored in register But, can determine PC based on other stored information valA valB dstE dstM srcA srcB dstE dstM srcA srcB Decode A B Register Register M file file E Write back icode Fetch ifun rA rB valC Instruction Instruction memory memory valP PC PC increment increment PC PC PC pIcode pBch pValM pValC pValP CS:APP Adding Pipeline Registers valE, valM W_icode, W_valM Write back valM W_valE, W_valM, W_dstE, W_dstM valM W valM Data Data memory memory Memory Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Addr, Data M valE Bch Bch CC CC Execute ALU ALU valE CC CC Execute aluA, aluB ALU ALU aluA, aluB E valA, valB Decode valA, valB srcA, srcB dstA, dstB icode, valC valP A B Register Register M file file d_srcA, d_srcB Decode A B Register Register M file file E E Write back valP icode, ifun rA, rB valC Fetch Instruction Instruction memory memory D valP icode, ifun, rA, rB, valC PC PC increment increment Instruction Instruction memory memory Fetch valP PC PC increment increment PC PC predPC pState PC f_PC F – 15 – CS:APP W_icode, W_valM Pipeline Stages W_valE, W_valM, W_dstE, W_dstM W valM Fetch Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Select current PC Read instruction Compute incremented PC M Bch valE CC CC Execute ALU ALU aluA, aluB Decode E Read program registers valA, valB Execute d_srcA, d_srcB Decode A B Register Register M file file E Write back Operate ALU Memory D icode, ifun, rA, rB, valC Instruction Instruction memory memory Fetch Write Back – 16 – valP PC PC increment increment predPC Read or write data memory PC valP f_PC F Update register file CS:APP Write back PIPE- Hardware W icode valE Mem. control write Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths Values passed from one stage to next Cannot jump past stages dstE dstM data out read valM Data Data memory memory Memory data in Addr M_valA M_Bch M icode Bch valE valA dstE dstM e_Bch Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB d_srcA d_srcB Select A Decode d_rvalA A dstE dstM srcA srcB W_valM B Register Register M file file W_valE E e.g., valC passes through decode dstE dstM srcA srcB D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 17 – W_valM predPC CS:APP Write back Feedback Paths W icode valE dstE dstM data out read Mem. control write Predicted PC valM Data Data memory memory Memory data in Addr M_valA M_Bch Guess value of next PC M icode Bch valE valA dstE dstM e_Bch Branch information Jump taken/not-taken Fall-through or target address Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB Return point Decode To register file write ports A dstE dstM srcA srcB W_valM B W_valE E D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 18 – d_rvalA Register Register M file file Read from memory Register updates Select A W_valM predPC CS:APP Predicting the PC D M_icode M_Bch M_valA W_icode W_valM icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F predPC Start fetch of new instruction after current one has completed fetch stage Not enough time to reliably determine next instruction Guess which instruction will follow Recover if prediction was incorrect – 19 – CS:APP Our Prediction Strategy Instructions that Don’t Transfer Control Predict next PC to be valP Always reliable Call and Unconditional Jumps Predict next PC to be valC (destination) Always reliable Conditional Jumps Predict next PC to be valC (destination) Only correct if branch is taken Typically right 60% of time Return Instruction – 20 – Don’t try to predict CS:APP Recovering from PC Misprediction M_icode M_Bch M_valA W_icode W_valM D icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F predPC Mispredicted Jump Will see branch flag once instruction reaches memory stage Can get fall-through PC from valA Return Instruction Will get return PC when ret reaches write-back stage – 21 – CS:APP Pipeline Demonstration irmovl $1,%eax #I1 irmovl $2,%ecx #I2 irmovl $3,%edx #I3 irmovl $4,%ebx #I4 halt #I5 1 2 3 4 5 6 7 8 F D E M W F D E M W F D E M W F D E M W F D E M 9 W Cycle 5 File: demo-basic.ys W I1 M I2 E I3 D I4 F I5 – 22 – CS:APP Data Dependencies: 3 Nop’s # demo-h3.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: nop 0x00f: addl %edx,%eax 0x011: halt 6 7 8 9 10 Cycle 6 W R[ %eax] f 3 Cycle 7 D – 23 – valA f R[ %edx] = 10 valB f R[ %eax ]=3 CS:APP 11 W Data Dependencies: 2 Nop’s # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt 6 7 8 9 10 W Cycle 6 W R[ %eax] f 3 • • • D valA f R[ %edx] = 10 valB f R[ %eax] = 0 – 24 – Error CS:APP Data Dependencies: 1 Nop # demo-h1.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: addl %edx,%eax 0x00f: halt 6 7 8 9 W Cycle 5 W R[ %edx] f 10 M M_valE = 3 M_dstE = %eax • • • D – 25 – valA f R[ %edx] = 0 valB f R[ %eax] = 0 Error CS:APP Data Dependencies: No Nop # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt 6 7 8 W Cycle 4 M M_valE = 10 M_dstE = %edx E e_valE f 0 + 3 = 3 E_dstE = %eax D valA f R[ %edx] = 0 valB f R[ %eax] = 0 – 26 – Error CS:APP Branch Misprediction Example demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx – 27 – # Not taken # Fall through # Target (Should not execute) # Should not execute # Should not execute Should only execute first 8 instructions CS:APP Branch Misprediction Trace # demo-j 0x000: xorl %eax,%eax 0x002: jne t # Not taken 1 2 3 4 5 6 F D F E D M E W M W F D F E D F M E D 0x011: t: irmovl $3, %edx # Target 0x017: irmovl $4, %ecx # Target+1 0x007: irmovl $1, %eax # Fall Through 7 8 9 W M E W M W Cycle 5 M Incorrectly execute two instructions at branch target M_Bch = 0 M_valA = 0x007 E valE f 3 dstE = %edx D valC = 4 dstE = %ecx F – 28 – valC f 1 rB f %eax CS:APP Return Example 0x000: 0x006: 0x007: 0x008: 0x009: 0x00e: 0x014: 0x020: 0x020: 0x021: 0x022: 0x023: 0x024: 0x02a: 0x030: 0x036: 0x100: 0x100: – 29 – demo-ret.ys irmovl Stack,%esp # Intialize stack pointer nop # Avoid hazard on %esp nop nop call p # Procedure call irmovl $5,%esi # Return point halt .pos 0x20 p: nop # procedure nop nop ret irmovl $1,%eax # Should not be executed irmovl $2,%ecx # Should not be executed irmovl $3,%edx # Should not be executed irmovl $4,%ebx # Should not be executed .pos 0x100 Stack: # Stack: Stack pointer Require lots of nops to avoid data hazards CS:APP Incorrect Return Example # demo-ret 0x023: ret 0x024: D irmovl $1,%eax # Oops! F 0x02a: irmovl $2,%ecx # Oops! 0x030: irmovl $3,%edx # Oops! 0x00e: irmovl $5,%esi # Return Incorrectly execute 3 instructions following ret F E D F M E D F W M E D F W M E D W M E W M W W valM = 0x0e M valE = 1 dstE = %eax E valE f 2 dstE = %ecx D valC = 3 dstE = %edx F – 30 – valC f 5 rB f %esi CS:APP Pipeline Summary Concept Break instruction execution into 5 stages Run instructions through in pipelined mode Limitations Can’t handle dependencies between instructions when instructions follow too closely Data dependencies One instruction writes register, later one reads it Control dependency Instruction sets PC in way that pipeline did not predict correctly Mispredicted branch and return Fixing the Pipeline – 31 – We’ll do that next time CS:APP