Pipelined Processor Design A basic techniques to improve performance - always applied in high performance systems. Adapted in all processors. ITCS 3181 Logic and Computer Systems 2014 B. Wilkinson Slides9.ppt Modification date: Nov 3, 2014 1 Pipelined Processor Design The operation of the processor are divided into a number of sequential actions, e.g.: 1. 2. 3. 4. Fetch instruction. Fetch operands. Execute operation. Store results or more steps. Each step is performed by a separate unit (stage). Each action is performed by a separate logic unit which are linked together in a “pipeline.” 2 Processor Pipeline Space-Time Diagram Uni t 1 U nit 2 Un it 3 Uni t 4 U nit 5 Un it 6 Uni t 7 Instru ction s from me mo ry (a ) Stag es P roce ssin g first in structio n Pr oce ssin g se con d in structio n Un it 7 Un it 6 Un it 5 Un it 4 Un it 3 Un it 2 Un it 1 Notation: Subscript - instruction Superscript - stage I 1 1 I 7 1 I 7 2 I 7 3 I 7 4 I 7 5 I 7 6 I 7 7 I 6 1 I 6 2 I 6 3 I 6 4 I 6 5 I 6 6 I 5 7 I 6 8 I 5 1 I 5 2 I 5 3 I 5 4 I 5 5 I 5 6 I 5 7 I 5 8 I 5 9 I 4 1 I 4 2 I 4 3 I 4 4 I 4 5 I 4 6 I 4 7 I 4 8 I 4 9 I 4 10 I 3 1 I 3 2 I 3 3 I 3 4 I 3 5 I 3 6 I 3 7 I 3 8 I 3 9 I 3 10 I 3 11 I 2 1 I 2 2 I 2 3 I 2 4 I 2 5 I 2 6 I 2 7 I 2 8 I 2 9 I 2 10 I 2 11 I 2 12 I 1 2 I 1 3 I 1 4 I 1 5 I 1 6 I 1 7 I 1 8 I 1 9 I 1 10 I 1 11 I 1 12 I 1 13 (b ) Sp ace -Time dia g ram Ti me 3 Pipeline Staging Latches Usually, pipelines designed using latches (registers) between units (stages) to hold the information being transferred from one stage to next. Transfer occurs in synchronism with a clock signal: Latch Unit Latch Unit Latch Unit Latch Data Clock 4 Processing time Time to process s instructions using a pipeline with p stages = p + s - 1 cycles p stag es p + s - 1 (cycle s) p-1 s i nstru cti on s Stag e p Stag e p -1 L ast in structio n Stag e 3 Stag e 2 Stag e 1 Ti me 5 Speedup How much faster using pipeline rather than a single homogeneous unit? Speed-up available in a pipeline can be given by: 𝑺𝒑𝒆𝒆𝒅𝒖𝒑, 𝒔 = 𝑻𝟏 𝑻𝟐 = 𝒔𝒑 𝒑+𝒔 −𝟏 Note: This does not take into account the extra time due to the latches in the pipeline version Potential maximum speed-up is p, though only be achieved for an infinite stream of tasks (s ) and no hold-ups in the pipeline. An alternative to pipelining - using multiple units each doing the complete task. Units could be designed to operate faster than the pipelined version, but the system would cost much more. 6 Dividing Processor Actions The operation of the processor can be divided into: • Fetch Cycle • Execute Cycle 7 Two Stage Fetch/Execute Pipeline F etch unit Execute unit Instructions (a) Fetch/execute stages EX IF F etch 1st instruction Execute 1st instruction Execute 2nd instruction Execute 3rd instruction F etch 2nd instruction Fetch 3rd instruction Fetch 4th instruction IF = Fetch unit EX = Execute unit Time (b) Space-time diagram with ideal overlap 8 A Two-Stage Pipeline Design F etch unit Instruction Latch Execute unit MDR Registers Control MAR IR Memory Address PC +4 Branch/jump can affect PC ALU Accesses memory for data (LD and ST instructions) 9 Fetch/decode/execute pipeline Relevant for complex instruction formats Recognizes instruction - separates operation and operand addresses Fe tch u nit D eco de u n it E xecu te un it In stru cti on s ( a) Fe tch/d eco de /exe cute stag e s E xe cute 1 st instru ctio n E xe cute 2 nd in stru cti on E xecu te 3r d instr uctio n D eco de 1st in str uctio n D eco de 2n d instru ctio n De cod e 3 rd in structio n De cod e 4 th in str uctio n F etch 2 nd instr uctio n Fe tch 3 rd i nstru ction Fe tch 4 th instr uctio ns Fe tch 5th instr uctio n Exe cu te De cod e Fe tch Fetch 1st i nstru ction (b) Id ea l o verl ap Time 10 Try to have each stage require the same time otherwise pipeline will have to operate at the time of the slowest stage. Usually have more stages to equalize times. Let’s start at four stages: Four-Stage Pipeline In str uctio n fe tch un it In structio n IF Ope ra nd fetch un it OF Exe cute u ni t EX Ope ra nd store un it OS Space-Time Diagram OS Instruction 1 Instruction 2 EX Instruction 1 Instruction 2 Instruction 3 OF Instruction 1 Instruction 2 Instruction 3 Instruction 4 IF Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Time 11 Four-stage Pipeline “Instruction-Time Diagram” An alternative diagram: Instruction 1st IF 2nd OF EX OS IF OF EX OS IF OF EX IF OF 3rd 4th IF OF EX OS = = = = Instru ctio n fetch u ni t O pe ra nd fe tch u nit E xecu te un it O pe ra nd sto re un it Time This form of diagram used later to show pipeline dependencies. 12 Information Transfer in Four-Stage Pipeline Register file Memory Register #’s Contents Latch Latch Latch Instruction Address PC IF ALU OF EX OS Clock 13 Register-Register Instructions ADD R3, R2, R1 Register file After instruction fetched: Memory Latch Latch Latch Add R3 Instruction Address PC = PC+4 R2 R1 PC IF ALU OF EX OS Clock Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers. 14 Register-Register Instructions ADD R3, R2, R1 Register file After instruction fetched: Memory Latch Latch Latch Add Instruction R3 Address R2 R1 PC IF ALU OF EX OS Clock Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers. 15 Register-Register Instructions ADD R3, R2, R1 Register file After operands fetched: Memory Latch Latch ----- Add Instruction --- R3 Address --- V2 --- V1 PC IF Latch ALU OF EX OS Clock V1 is contents of R1, V2 is contents of R2 16 Register-Register Instructions ADD R3, R2, R1 Register file After execution (addition): Memory Latch Latch Latch R3 ---Instruction --- --- Address --- --- --- --- Result V2 V1 ALU PC IF Add OF EX OS Clock Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers. 17 Register-Register Instructions ADD R3, R2, R1 Register file After result stored: R3, result Memory Latch Latch Latch ---Instruction --- --- -- Address --- --- --- --- --- PC IF ALU OF EX OS Clock Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers. 18 Register-Register Instructions ADD R3, R2, R1 Register file Overall: R3, result Memory Instruction Address PC = PC+4 Latch Latch Add Add R3 R3 R2 V2 R1 V1 R3 Add Result V2 V1 ALU PC IF Latch OF EX OS Clock Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers. 19 Register-Constant Instructions ADD R3, R2, 123 Register file After instruction fetched: Memory Latch Latch Latch Add R3 Instruction Address PC = PC+4 R2 123 PC IF ALU OF EX OS Clock Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers. 20 Register-Constant Instructions ADD R3, R2, 123 Register file After instruction fetched: Memory Latch Latch Latch Add Instruction R3 Address R2 123 PC IF ALU OF EX OS Clock Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers. 21 Register-Constant Instructions ADD R3, R2, 123 Register file After operands fetched: Memory Latch Latch ----- Add Instruction --- R3 Address --- V2 --- 123 PC IF Latch ALU OF EX OS Clock V2 is contents of R2 22 Register-Constant Instructions ADD R3, R2, 123 Register file After execution (addition): Memory Latch Latch Latch R3 ---Instruction --- --- Address --- --- --- --- Result V2 123ALU PC IF Add OF EX OS Clock Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers. 23 Register-Constant Instructions ADD R3, R2, 123 Register file After result stored: R3, result Memory Latch Latch Latch ---Instruction --- --- -- Address --- --- --- --- --- PC IF ALU OF EX OS Clock Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers. 24 Register-Constant Instructions (Immediate addressing) ADD R3, R2, 123 Register file Overall: R3, result Memory R2 Latch Latch Latch Add Add Instruction R3 R3 R3 Address R2 V2 Result 123 123 PC IF ALU OF EX OS Clock V2 is contents of R2 25 Branch Instructions A couple of issues to deal with here: 1. Number of steps needed. 1. Dealing with program counter incrementing after each instruction fetch. 26 (Complex) Branch Instructions Offset to L1 held in instruction Bcond R1, R2, L1 After instruction fetched: Register file Memory Latch R2 R1 Test Bcond Instruction + Latch Latch R1 R2 Address Offset PC IF ALU OF EX/BR OS Clock 27 (Complex) Branch Instructions Offset to L1 held in instruction Bcond R1, R2, L1 After operands fetched: Register file Memory V1 V2 Latch Latch Bcond Instruction Latch Test V1 + V2 Address Offset Offset PC IF ALU OF EX/BR OS Clock V1 is contents of R1, V2 is contents of R2 28 (Complex) Branch Instructions Offset to L1 held in instruction Bcond R1, R2, L1 After execution (addition): Register file Memory Latch Latch Latch Bcond Test Instruction + V1 Address V2 PC IF Result Offset ALU OF EX/BR OS Clock V1 is contents of R1 29 (Complex) Branch Instructions Offset to L1 held in instruction Bcond R1, R2, L1 After result stored: Offset Memory If TRUE add offset to PC else do nothing Result (TRUE/FALSE) Register file Latch Latch Latch Test Instruction + Result Address Offset PC IF ALU OF EX/BR OS Clock V1 is contents of R1 30 (Complex) Branch Instructions Offset to L1 held in instruction Bcond R1, R2, L1 Overall: Offset Memory If TRUE add offset to PC else do nothing Result (TRUE/FALSE) Register file Latch Instruction + Address R2 R1 V1 V2 Latch Bcond Bcond R1 V1 R2 V2 Result Offset Offset Offset PC IF Latch Test ALU OF EX/BR OS Clock V1 is contents of R1 31 Simpler Branch Instructions Bcond R1, L1 Overall: Tests R1 against zero Offset Memory If TRUE add offset to PC else do nothing Result (TRUE/FALSE) Register file Latch Instruction + V1 R1 Latch Latch Bcond Bcond R1 V1 Result Offset Offset Offset Test Address PC IF OF EX/BR OS Clock V1 is contents of R1 32 Dealing with program counter incrementing after each instruction fetch Previous design will need to taking into account that by the time the branch instruction is in the execute unit, the program counter will have been incremented three times. Solutions: 1. Modify the offset value in the instruction (subtract 12). 2. Modify the arithmetic operation to be PC + offset – 12 3. Feed the program counter value through the pipeline. (This is the best way as it takes into account any pipeline length. Done in the P-H book) 33 Feeding PC value through pipeline Bcond R1, L1 Overall: Tests R1 against zero New PC value Memory If TRUE update PC else do nothing Result (TRUE/FALSE) Register file Latch Instruction Address PC IF V1 R1 Latch Latch Bcond Bcond R1 V1 Offset Offset PC PC OF Test Result Add EX/BR New PC value OS Clock V1 is contents of R1 34 Load and Store Instructions Need at least one extra stage to handle memory accesses. Early RISC processor arrangement was to place memory stage (MEM) between EX and OS as below. Now a five-stage pipeline. LD R1, 100[R2] R eg ister fil e R1 , va lu e Instru ction me mo ry R2 Instru ctio n Ad d ress PC IF Val ue LD R1 R2 100 LD R1 V2 100 OF M EM AL U LD R1 R1 Addr Value + EX OS Clo ck Address Data Co mp ute effective a d dre ss Da ta mem or y 35 ST 100[R2], R1 Re giste r file Th is stag e n ot use d R1 R2 In str uctio n m em ory In stru cti on A d dre ss PC IF Va lu es ST R1 R2 100 ST V1 V2 100 OF ME M A LU ST V1 Addr + OS EX C lock Address Data Co mpu te e ffe cti ve a dd re ss D ata me mo ry Note: Convenient to have separate instruction and data memories connecting to processor pipeline - usually separate cache memories, see later. 36 Usage of Stages Fetch Fetch o pera nds Execute instru ction from registe rs (Compute) Access Mem ory Store resu lts in reg iste r Instructions Time (a) Units Load Sto re Arithmetic Instruction passes through ME M stage but no a ctions taking place Branch (b) Instruction usage 37 Number of Pipeline Stages As the number of stages is increased, one would expect the time for each stage to decrease, i.e. the clock period to decrease and the speed to increase. However one must take into account the pipeline latch delay. 5-stage pipeline represents an early RISC design - “underpipelined” Most recent processors have more stages. 38 Optimum Number of Pipeline Stages* Suppose one homogeneous unit doing everything takes Ts time units. With p pipeline stages with equally distributed work, each stage takes T/p. Let tL = time for latch to operate. Then: Execution time Tex = (p + s - 1) (Ts/p + tL) 800 600 Typical results (Ts = 128, TL=2) Optimum about 16 stages Tex In practice, there are a lot more factors involved, see later for some. 400 200 0 21 22 23 24 25 26 Number of pipeline stages, p 27 * Adapted from “Computer Architecture and Implementation” by H. G. Cragon, Cambridge University Press, 2000. 39 Questions 40