class4-pipeline-a.ppt

CS:APP Chapter 4 Computer Architecture Pipelined Implementation Part I Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu CS:APP Overview General Principles of Pipelining   Goal Difficulties Creating a Pipelined Y86 Processor    –2– Rearranging SEQ Inserting pipeline registers Problems with data and control hazards CS:APP Real-World Pipelines: Car Washes Sequential Parallel Pipelined Idea    –3– Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed CS:APP Computational Example 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GOPS Clock System    –4– Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Can must have clock cycle of at least 320 ps CS:APP 3-Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C 20 ps R Delay = 360 ps e Throughput = 8.33 GOPS g Clock System   Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A.  Begin new operation every 120 ps  Overall latency increases  360 ps from start to finish –5– CS:APP Pipeline Diagrams Unpipelined OP1 OP2 OP3  Time Cannot start new operation until previous one completes 3-Way Pipelined OP1 OP2 A B C A B C A B OP3 C Time  –6– Up to 3 operations in process simultaneously CS:APP Operating a Pipeline 239 241 300 359 Clock OP1 A OP2 B C A B C A B OP3 0 120 240 360 C 480 640 Time 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Clock –7– CS:APP Limitations: Nonuniform Delays 50 ps 20 ps 150 ps 20 ps 100 ps Comb. logic R e g Comb. logic B R e g Comb. logic C A OP1 OP2 A B OP3 B A R Delay = 510 ps e Throughput = 5.88 GOPS g Clock C A 20 ps C B C Time    –8– Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages CS:APP Limitations: Register Overhead 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock    3-stage pipeline:  6-stage pipeline: –9– R e g Comb. logic R e g Comb. logic R e g Delay = 420 ps, Throughput = 14.29 GOPS As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register:  1-stage pipeline:  Comb. logic 6.25% 16.67% 28.57% High speeds of modern processor designs obtained through very deep pipelining CS:APP Data Dependencies Combinational logic R e g Clock OP1 OP2 OP3 Time System  – 10 – Each operation depends on result from preceding one CS:APP Data Hazards Comb. logic A OP1 OP2 R e g A Comb. logic B R e g Comb. logic C Clock B C A B C A B C A B OP3 OP4 R e g C Time   – 11 – Result does not feed back around in time for next operation Pipelining has changed behavior of system CS:APP Data Dependencies in Processors  1 irmovl $50, %eax 2 addl %eax , 3 mrmovl 100( %ebx ), %ebx %edx Result from one instruction used as operand for another  Read-after-write (RAW) dependency   Very common in actual programs Must make sure our pipeline handles these properly  Get correct results  Minimize performance impact – 12 – CS:APP newPC SEQ Hardware   Stages occur in sequence One operation in process at a time New PC PC valM data out read Data Data memory memory Mem. control Memory write Addr Execute Bch valE CC CC ALU ALU ALU A Data ALU fun. ALU B valA Decode A valB dstE dstM srcA srcB dstE dstM srcA srcB B Register Register M file file E Write back icode Fetch ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment PC – 13 – CS:APP valM SEQ+ Hardware data out read Data Data memory memory Mem. control Memory write Addr   Still sequential implementation Reorder PC stage to put at beginning Execute Bch valE CC CC ALU ALU ALU A Data ALU fun. ALU B PC Stage   Task is to select PC for current instruction Based on results computed by previous instruction Processor State   – 14 – PC is no longer stored in register But, can determine PC based on other stored information valA valB dstE dstM srcA srcB dstE dstM srcA srcB Decode A B Register Register M file file E Write back icode Fetch ifun rA rB valC Instruction Instruction memory memory valP PC PC increment increment PC PC PC pIcode pBch pValM pValC pValP CS:APP Adding Pipeline Registers valE, valM W_icode, W_valM Write back valM W_valE, W_valM, W_dstE, W_dstM valM W valM Data Data memory memory Memory Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Addr, Data M valE Bch Bch CC CC Execute ALU ALU valE CC CC Execute aluA, aluB ALU ALU aluA, aluB E valA, valB Decode valA, valB srcA, srcB dstA, dstB icode, valC valP A B Register Register M file file d_srcA, d_srcB Decode A B Register Register M file file E E Write back valP icode, ifun rA, rB valC Fetch Instruction Instruction memory memory D valP icode, ifun, rA, rB, valC PC PC increment increment Instruction Instruction memory memory Fetch valP PC PC increment increment PC PC predPC pState PC f_PC F – 15 – CS:APP W_icode, W_valM Pipeline Stages W_valE, W_valM, W_dstE, W_dstM W valM Fetch Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data    Select current PC Read instruction Compute incremented PC M Bch valE CC CC Execute ALU ALU aluA, aluB Decode  E Read program registers valA, valB Execute d_srcA, d_srcB Decode A B Register Register M file file E Write back  Operate ALU Memory  D icode, ifun, rA, rB, valC Instruction Instruction memory memory Fetch Write Back – 16 – valP PC PC increment increment predPC Read or write data memory PC  valP f_PC F Update register file CS:APP Write back PIPE- Hardware W icode valE Mem. control write Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths   Values passed from one stage to next Cannot jump past stages dstE dstM data out read  valM Data Data memory memory Memory data in Addr M_valA M_Bch M icode Bch valE valA dstE dstM e_Bch Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB d_srcA d_srcB Select A Decode d_rvalA A dstE dstM srcA srcB W_valM B Register Register M file file W_valE E  e.g., valC passes through decode dstE dstM srcA srcB D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 17 – W_valM predPC CS:APP Write back Feedback Paths W icode valE dstE dstM data out read Mem. control write Predicted PC valM Data Data memory memory Memory data in Addr M_valA M_Bch  Guess value of next PC M icode Bch valE valA dstE dstM e_Bch Branch information   Jump taken/not-taken Fall-through or target address Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB Return point  Decode To register file write ports A dstE dstM srcA srcB W_valM B W_valE E D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC F – 18 – d_rvalA Register Register M file file Read from memory Register updates  Select A W_valM predPC CS:APP Predicting the PC D M_icode M_Bch M_valA W_icode W_valM icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F  predPC Start fetch of new instruction after current one has completed fetch stage  Not enough time to reliably determine next instruction  Guess which instruction will follow  Recover if prediction was incorrect – 19 – CS:APP Our Prediction Strategy Instructions that Don’t Transfer Control   Predict next PC to be valP Always reliable Call and Unconditional Jumps   Predict next PC to be valC (destination) Always reliable Conditional Jumps   Predict next PC to be valC (destination) Only correct if branch is taken  Typically right 60% of time Return Instruction  – 20 – Don’t try to predict CS:APP Recovering from PC Misprediction M_icode M_Bch M_valA W_icode W_valM D icode ifun rA rB valC valP Predict PC Need valC Instr valid Need regids Split Split PC PC increment increment Align Align Byte 0 Bytes 1-5 Instruction Instruction memory memory Select PC F  predPC Mispredicted Jump  Will see branch flag once instruction reaches memory stage  Can get fall-through PC from valA  Return Instruction  Will get return PC when ret reaches write-back stage – 21 – CS:APP Pipeline Demonstration irmovl $1,%eax #I1 irmovl $2,%ecx #I2 irmovl $3,%edx #I3 irmovl $4,%ebx #I4 halt #I5 1 2 3 4 5 6 7 8 F D E M W F D E M W F D E M W F D E M W F D E M 9 W Cycle 5 File: demo-basic.ys W I1 M I2 E I3 D I4 F I5 – 22 – CS:APP Data Dependencies: 3 Nop’s # demo-h3.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: nop 0x00f: addl %edx,%eax 0x011: halt 6 7 8 9 10 Cycle 6 W R[ %eax] f 3 Cycle 7 D – 23 – valA f R[ %edx] = 10 valB f R[ %eax ]=3 CS:APP 11 W Data Dependencies: 2 Nop’s # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt 6 7 8 9 10 W Cycle 6 W R[ %eax] f 3 • • • D valA f R[ %edx] = 10 valB f R[ %eax] = 0 – 24 – Error CS:APP Data Dependencies: 1 Nop # demo-h1.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: addl %edx,%eax 0x00f: halt 6 7 8 9 W Cycle 5 W R[ %edx] f 10 M M_valE = 3 M_dstE = %eax • • • D – 25 – valA f R[ %edx] = 0 valB f R[ %eax] = 0 Error CS:APP Data Dependencies: No Nop # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt 6 7 8 W Cycle 4 M M_valE = 10 M_dstE = %edx E e_valE f 0 + 3 = 3 E_dstE = %eax D valA f R[ %edx] = 0 valB f R[ %eax] = 0 – 26 – Error CS:APP Branch Misprediction Example demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx  – 27 – # Not taken # Fall through # Target (Should not execute) # Should not execute # Should not execute Should only execute first 8 instructions CS:APP Branch Misprediction Trace # demo-j 0x000: xorl %eax,%eax 0x002: jne t # Not taken 1 2 3 4 5 6 F D F E D M E W M W F D F E D F M E D 0x011: t: irmovl $3, %edx # Target 0x017: irmovl $4, %ecx # Target+1 0x007: irmovl $1, %eax # Fall Through 7 8 9 W M E W M W Cycle 5 M  Incorrectly execute two instructions at branch target M_Bch = 0 M_valA = 0x007 E valE f 3 dstE = %edx D valC = 4 dstE = %ecx F – 28 – valC f 1 rB f %eax CS:APP Return Example 0x000: 0x006: 0x007: 0x008: 0x009: 0x00e: 0x014: 0x020: 0x020: 0x021: 0x022: 0x023: 0x024: 0x02a: 0x030: 0x036: 0x100: 0x100:  – 29 – demo-ret.ys irmovl Stack,%esp # Intialize stack pointer nop # Avoid hazard on %esp nop nop call p # Procedure call irmovl $5,%esi # Return point halt .pos 0x20 p: nop # procedure nop nop ret irmovl $1,%eax # Should not be executed irmovl $2,%ecx # Should not be executed irmovl $3,%edx # Should not be executed irmovl $4,%ebx # Should not be executed .pos 0x100 Stack: # Stack: Stack pointer Require lots of nops to avoid data hazards CS:APP Incorrect Return Example # demo-ret 0x023:  ret 0x024: D irmovl $1,%eax # Oops! F 0x02a: irmovl $2,%ecx # Oops! 0x030: irmovl $3,%edx # Oops! 0x00e: irmovl $5,%esi # Return Incorrectly execute 3 instructions following ret F E D F M E D F W M E D F W M E D W M E W M W W valM = 0x0e M valE = 1 dstE = %eax E valE f 2 dstE = %ecx D valC = 3 dstE = %edx F – 30 – valC f 5 rB f %esi CS:APP Pipeline Summary Concept   Break instruction execution into 5 stages Run instructions through in pipelined mode Limitations   Can’t handle dependencies between instructions when instructions follow too closely Data dependencies  One instruction writes register, later one reads it  Control dependency  Instruction sets PC in way that pipeline did not predict correctly  Mispredicted branch and return Fixing the Pipeline  – 31 – We’ll do that next time CS:APP

class4-pipeline-a.ppt

Related documents

Products

Support

class4-pipeline-a.ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib