2015-11-06 Lecture 3: Instruction Pipelining Basic concepts Pipeline hazards Branch handling Branch prediction Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 3 What is a Pipeline? Divide a task into a sequence of simpler sub-tasks. Employ one worker for each simpler H. sub-task. Ford Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 3 1 2015-11-06 Basic Concepts Sequential execution of an N-stage task: 1 2 3 … N 1 2 … 3 Task1 … N Task2 Production time: N time units. Resource needed: one general-purpose machine. Productivity: one product per N time units. Pipelined execution of an N-stage task: 1 2 1 … N 2 3 … N 1 2 3 … 3 Production time: N time units. Resource needed: N specialpurpose machines. N Productivity ≈ one product/time unit. … Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 3 Instruction Execution Stages A typical instruction execution sequence: 1. Fetch Instruction (FI): Fetch the instruction. 2. Decode Instruction (DI): Determine the op-code and the operand specifiers. 3. Calculate Operands (CO): Calculate the effective addresses. 4. Fetch Operands (FO): Fetch the operands. 5. Execute Instruction (EI): perform the operation. 6. Write Operand (WO): store the result in memory. FI Zebo Peng, IDA, LiTH DI CO FO 4 EI WO TDTS 08 – Lecture 3 2 2015-11-06 Instruction Pipelining I1 time FI I2 DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI I3 I4 I5 I6 I7 I8 This is the ideal case Speed up by 6 times. I9 Zebo Peng, IDA, LiTH 5 WO TDTS 08 – Lecture 3 Typical Instruction Pipelining I1 I2 I3 time FI DI CO FO FI DI CO FI DI I4 FI I5 EI EI DI CO FI DI I6 I7 I8 WO FI Different execution patterns for different instructions! WO EI WO FO EI EI DI CO FI DI FI I9 Zebo Peng, IDA, LiTH In practice, there are many holes, which reduces the speed-up factor. FO EI WO DI FI 6 DI EI WO CO FO EI TDTS 08 – Lecture 3 3 2015-11-06 Number of Pipeline Stages In general, a larger number of stages gives better performance. However: A larger number of stages increases the overhead in moving information between stages and synchronization between stages. The complexity of the CPU grows with the number of stages. It is difficult to keep a large pipeline at maximum rate because of pipeline hazards. Intel 80486 and Pentium: Five-stage pipeline for integer instructions. Eight-stage pipeline for FP (floating points) instructions. Pentium 4 machine has 20 stages (!). IBM PowerPC has different numbers of stages (3-9) for different machines. PowerPC 440 has 7 stages. Zebo Peng, IDA, LiTH 7 TDTS 08 – Lecture 3 Lecture 3: Instruction pipelining Basic concepts Pipeline hazards Branch handling Branch prediction Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 3 4 2015-11-06 Pipeline Hazards (Conflicts) Situations that prevent the next instruction in the instruction stream from executing during its designated clock cycle. The instruction is said to be stalled. When an instruction is stalled: All instructions later in the pipeline than it are also stalled; No new instructions are fetched during the stall; Instructions earlier than the stalled one continue as usual. Types of hazards: Structural hazards Data hazards Control hazards Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 3 Structural (Resource) Hazards Hardware conflicts caused by the use of the same hardware resource at the same time (e.g., memory conflicts). FI DI CO FO EI FI DI CO ? FI DI CO FI stall FI Harvard Architecture solves this problem! EI WO EI WO DI CO FO stall FI DI FO EI CO FO EI WO Penalty: 1 cycle (NOTE: the performance lost is multiplied by the number of stages). Zebo Peng, IDA, LiTH 10 TDTS 08 – Lecture 3 5 2015-11-06 Structural Hazard Solutions In general, the hardware resources in conflict are duplicated in order to avoid structural hazards. Functional units (ALU, FP unit) can also be pipelined themselves to support several instructions at the same time. Memory conflicts can be solved by: having two separate caches, one for instructions and the other for operands (Harvard Architecture); FI DI CO FO FI DI CO FI DI CO FI stall FI Zebo Peng, IDA, LiTH EI EI WO EI WO DI CO stall FI DI 11 FO EI CO FO EI WO TDTS 08 – Lecture 3 Structural Hazard Solutions In general, the hardware resources in conflict are duplicated in order to avoid structural hazards. Functional units (ALU, FP unit) can also be pipelined themselves to support several instructions at the same time. Memory conflicts can be solved by having two separate caches, one for instructions and the other for operands (Harvard architecture); Using multiple banks of the main memory; or keeping as many intermediate results as possible in the registers (!). Zebo Peng, IDA, LiTH 12 TDTS 08 – Lecture 3 6 2015-11-06 Data Hazards Caused by reversing the order of data-dependent operations due to the pipeline (e.g., WRITE/READ conflicts). Mem(A) Mem(A) + R1; Mem(A) Mem(A) - R2; ADD A, R1; SUB A, R2; Sequential execution ADD FI SUB DI CO FO EI WO FI DI CO FO EI WO Value of Mem(A) needed Ex. A=200, Mem(200)=100, R1=30, R2=50 Value of Mem(A) available 80 50 Zebo Peng, IDA, LiTH 13 TDTS 08 – Lecture 3 Data Hazard Penalty Mem(A) Mem(A) + R1; Mem(A) Mem(A) - R2; ADD A, R1; SUB A, R2; ADD SUB FI DI CO FO EI WO FI DI CO Stall Stall FO EI WO Value of Mem(A) available Data hazard is an important issue: Penalty: 2 cycles. It happens very often, since we have many data dependencies. Zebo Peng, IDA, LiTH 14 TDTS 08 – Lecture 3 7 2015-11-06 Data Hazard Solutions The penalty due to data hazards can be reduced by a technique called forwarding (bypassing). MUX ALU Memory System Bypass Path MUX Registers, cache and main memory The ALU result is fed back to the ALU input. If it detects that the value needed for an operation is the one produced by the previous one, and has not yet been written back. ALU selects the forwarded result, instead of the value from the memory system. Zebo Peng, IDA, LiTH 15 TDTS 08 – Lecture 3 Control Hazards Caused by branch instructions, which change the instruction execution order. 1 2 3 FI The branch should be taken! DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI 4 5 6 25 26 Zebo Peng, IDA, LiTH 16 time “BRA 25 IF Zero” WO TDTS 08 – Lecture 3 8 2015-11-06 Lecture 3: Instruction pipelining Basic concepts Pipeline hazards Branch handling Branch prediction Zebo Peng, IDA, LiTH 17 TDTS 08 – Lecture 3 Branch Handling (1) 1 2 25 Stop the pipeline until the branch instruction reaches the last stage. FI DI CO FO EI WO FI DI CO FO EI The branch should be taken! FI 26 “BRA 25 IF Zero” WO Stall Stall Stall Stall time DI CO FO EI WO FI DI CO FO EI WO This leads to very large lost of performance, in particular, since 20%-35% of the instructions executed are branches. Zebo Peng, IDA, LiTH 18 TDTS 08 – Lecture 3 9 2015-11-06 Branch Handling (2) Multiple streams — implement hardware resources to deal with different branch alternatives. Branch condition known 1 FI 2 3 DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI FI DI CO FO FI DI CO FI DI 4 5 6 25 FI 26 27 time “BRA 25 IF Zero” This is an expensive solution, and you need special memory technique to fully utilize it! DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI 28 Zebo Peng, IDA, LiTH 19 WO TDTS 08 – Lecture 3 Branch Handling (3) Pre-fetch branch target — when a conditional branch is recognized, the following instruction is fetched, and the branch target is also pre-fetched. Loop buffer — use a small, high-speed memory to keep the n most recently fetched instructions. If a branch is to be taken, the buffer is first checked to see if the branch target is in it. Special cache for branch target instructions. Delayed branch — re-arrange the instructions so that branching occur later than originally specified. - Software solution. Zebo Peng, IDA, LiTH 20 TDTS 08 – Lecture 3 10 2015-11-06 Delayed Branch Example Original inst. sequence: ADD X; No data-dependence BRA L; … ADD FI EI BRA FI BRA EI FI Delayed branch: BRA L; ADD X; … Branch FI EI ADD FI EI condition known EI FI EI The compiler or the programmer has to find an instruction which can be moved from its original place to the branch delay slot (it will be executed regardless of the branch outcome). 60% to 85% success rate. This leads, however, to un-readable code. Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 3 Lecture 3: Instruction pipelining Basic concepts Pipeline hazards Branch handling Branch prediction Zebo Peng, IDA, LiTH 22 TDTS 08 – Lecture 3 11 2015-11-06 Branch Prediction When a branch is encountered, a prediction is made and the predicted path is followed. The instructions on the predicted path are fetched. The fetched instruction can also be executed called Speculative Execution. Results produced of these executions should be marked as tentative. When the branch outcome is decided, if the prediction is correct, the special tags on tentative results are removed. If not, the tentative results are removed, and the execution goes to the other path. Branch prediction can base on static or dynamic information. Zebo Peng, IDA, LiTH 23 TDTS 08 – Lecture 3 Static Branch Prediction Predict always taken Assume that jump will happen. Always fetch target instruction. 1 2 25 time FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI 26 Zebo Peng, IDA, LiTH 24 “BRA 25 IF Zero” WO TDTS 08 – Lecture 3 12 2015-11-06 Static Branch Prediction (Cont’d) sum = 0; //Java code for (i = 0; i < 1000; i++) sum += a[i]; 0 1 2 … 999 R0 a (sum) R1 (start at memory location, e.g., 100) MOVE MOVE L1: ADD ADD COMP BNZ MOVE R0, R1, R0, R1, R1, L1 R0, #0 -- sum #100 -- index (R1) The prediction will #1 be correct 99.9% of #1100 the time with this example ! sum Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 3 Static Branch Prediction (Cont’d) Predict never taken Assume that jump will not happen. Always fetch next instruction. Predict by Operation Codes Some instructions are more likely to result in a jump than others. • BNZ (Branch if the result is Not Zero) • BEZ (Branch if the result equals Zero) Can get up to 75% success. Predict by relative positions Backward-pointing branches will be taken (usually loop back). Forward-pointing branches will not be taken (often loop exist). Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 3 13 2015-11-06 Dynamic Branch Prediction Based on branch history: Store information regarding branches in a branch-history table so as to more accurately predict the branch outcome. E.g., assuming that the branch will do what it did last time. One-Bit Predictor: T 2 3 Not Taken (0) Taken (1) N N T Two errors per loop iteration. …NNTNNNNNNTNN… Zebo Peng, IDA, LiTH 27 TDTS 08 – Lecture 3 Bimodal Prediction Use 2-bit saturating counters to predict the most common direction, where the first bit indicates the predication. Branches evaluated as not taken (N) decrement the counter towards strongly not taken, and branches evaluated as taken (T) increment the state towards strongly taken. It tolerates a branch going an unusual direction one time. A loop-closing branch is miss-predicted once rather than twice. …NNTNNNNNNTNN… Strongly Not Taken (00) T (+1) N (-1) N (-1) N Zebo Peng, IDA, LiTH T (+1) T (+1) Not Taken (01) Taken (10) N (-1) Strongly Taken (11) T 28 TDTS 08 – Lecture 3 14 2015-11-06 Summary Instruction execution can be substantially accelerated by instruction pipelining. A pipeline is organized as a succession of N stages. Ideally N instructions can be active inside a pipeline. Keeping a pipeline at its maximal rate is prevented by pipeline hazards. Structural hazards are due to resource conflicts. Data hazards are caused by data dependencies between instructions. Control hazards are produced as consequence of branch instructions. Branches can dramatically affect pipeline performance. It is very important to reduce penalties produced by them. (Dynamic) prediction is an efficient way to address this problem. Zebo Peng, IDA, LiTH 29 TDTS 08 – Lecture 3 15