ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law Pipelining is a method of processing in which a problem is divided into a number of sub-problems and solved and the solutions of the sub-problems for different instances of the problem are then overlapped. Example: a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,…,n f[2] f[1] e[2] e[1] d[2] D D D D D + + + d[1] c[2] c[1] b[1] + D a[2] a[1] Adders have delay D to compute. Computation time = 4D + (n-1)D = nD +3D Speed-up = 4nD/{3D + nD} -> 4 for large n. We can describe the computation process in a linear pipeline algorithmically. There are three distinct phases to this computation: (a) filling the pipeline, (b) running the pipeline in the filled state until the last input arrives, and (c) emptying the pipeline. (linear pipeline) while(1) {resetLa tches(); clock = 0 ; //fill the pipeline for(j = 0; j <= n-1; j++) {for(k = 0; k <= j; k++) segment(k); clock++; } //execute all segments until the last input arrives while (clock <= m) {for(j = 0; j <= n-1; j++) segment(j); clock++; } //empty the pipeline for(j = 0; j <= n-1; j++) {for(k = j; k <= n-1; k++) segment(k); clock++; } } Instruction pipelines: Goal: (i) to increase the throughput (number of instructions/sec) in executing programs (ii) to reduce the execution time (clock cycles/instruction, etc. clock fetch decode execute 0 I1 1 I2 I1 2 I3 I4 I2 I3 I1 I2 I4 I3 3 4 clock fetch decode execute memory 0 I1 1 I2 I1 2 3 I3 I4 I2 I3 I1 I2 I1 4 I5 I4 I3 I2 write back I1 Speed-up of pipelined execution of instructions over a sequential execution: CPIu N u f 5 S(5) CPI p N p f1 Assuming that the systems operate at the same clock rate and use the same number of operations: S(5) CPIu CPI p1 Example Suppose that the instruction mix of programs executed on a serial and pipelined machines is 40% ALU, 20% branching, and 40% memory with 4, 2, and 4 cycles per each instruction in the three classes respectively. Then, under ideal conditions (no stalls due to hazards) CPIu 4 0.4 2 0.2 4 0.4 S(5) 3.3 CPI p1 1 If, the clock speed needs to be increased for the pipeline implementation then the speed-up will have to be scaled down accordingly. MIPS Pipeline IF ID EX WB Register operations IF ID EX ME Register/Memory operations WB Instruction Pipelines (Hennessy & Patterson) Hazards 1-Structural Hazards 2-Data Hazards 3-Control Hazards Structural Hazards: They arise when limited resources are scheduled to operate on different streams during the same clock period. Structural Hazards: They arise when limited resources are scheduled to operate concurrently on different streams during the same clock period. Example: Memory conflict (data fetch + instruction fetch) or datapath conflict (arithmetic operation + PC update) Clock IF ID EX ME WB 0 I1 1 I2 I1 2 I3 I2 I1 3 I4 I3 I2 I1 4 I5 I4 I3 I2 I1 5 I6 I5 I4 I3 I2 6 I7 I6 I5 I4 I3 Fix: Duplicate hardware (too expensive) Stall the pipeline (serialize the operation) (too slow) Clock IF 0 I1 1 I2 2 ID EX 3 I1 I2 I3 5 I4 6 I3 7 9 I6 I3 I3 I4 I5 I1 I2 I4 I5 I1 I2 I4 8 WB I1 I2 4 ME I3 I4 Speed-up = Tserial/Tpipeline = 5nts/ {2nts + 2ts}, for odd n = 5nts/ {2nts + 3ts }, for even n -> 5/2 as the number of instructions, n, tends to infinity. Thus, we loose half the throughput due to stalls. Note: The pipeline time of execution can be computed using the recurrences T1 = 4 Ti = Ti-1 + 1 for even i Ti = Ti-1 + 3 for odd i Data Hazards They occur when the executions of two instructions may result in the incorrect reading of operands and/or writing of a result. Read After Write (RAW) Hazard (Data Dependency) Write After Read Hazard (WAR) (Data Anti-dependency) Write After Write Hazard (WAW) (Data Anti-dependency) RAW Hazards They occur when reads are early and writes are late. Clock IF ID 0 I1 1 I2 I1 2 I3 I2 I1 3 I4 I3 Read I1 4 I5 I4 I3 I2 Write 5 I6 I5 I4 I3 I2 6 I7 I6 I5 I4 I3 I1: R1 = R1 + R2 I2: R3 = R1 + R2 EX ME WB RAW Hazards (Cont’d) They can be avoided by stalling the reads but this increases the execution time. A better approach is to use data forwarding: Clock IF ID 0 I1 1 I2 I1 2 I3 I2 I1 3 I4 I3 Read I1 4 I5 I4 Read I2 Write 5 I6 I5 I4 I3 I2 6 I7 I6 I5 I4 I3 I1: R1 = R1 + R2 I2: R3 = R1 + R2 EX ME WB WAR Hazards They occur when writes are early and reads are late Clock IF ID EX ME WB EX ME 0 I1 1 I2 I1 2 I3 I2 I1 3 I4 I3 I2 I1 4 I5 I4 I3 I2 I1 5 I6 I5 I4 I3 Write Read 6 I7 I6 I5 I4 I3 I2 I1 I4 I3 I2 I1: R2 = R2 + R3 ; R9 = R3 + R4 , I2: R3 = R7 + R5; R6 = R2 + R8 WB I1 Branch Prediction in Pipeline Instruction Sequencing One of the major issues in pipelined instruction processing is to schedule conditional branch instructions. When a pipeline controller encounters a conditional branch instruction it has a choice to decode it into one of two instruction streams. If the branch condition is met then the execution continues from the target of the conditional branch instruction; Otherwise, it continues with the instruction that follows the conditional branch instruction. Example: Suppose that we execute the following assembly code on a 5-stage pipeline (IF, ID,EX,ME, WB): JCD R0 < 10, add; SUB R0,R1; JMP D,halt; add: ADD R0,R1; halt: HLT; If we assume that R0 < 10 then the SUB instruction would have been incorrectly fetched during the second clock cycle. and we will have to another fetch cycle to fetch the ADD instruction. Classification of branch prediction algorithms Static Branch Prediction: The branch decision does not change over time-- we use a fixed branching policy. Dynamic Branch Prediction: The branch decision does change over time-- we use a branching policy that varies over time. Static Branch Prediction Algorithms 1 Don’t predict (stall the pipeline) 2- Never take the branch 3- Always take the branch 4- Delayed branch 1- Stall the pipeline by 1 clock cycle : This allows us to determine the target of the branch instruction. JCD IF ID EX ME WB SUB ADD IF ID EX ME WB Stall and decide the branch. Pipeline Execution Speed (stall case): Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency = 1 + branch frequency In general, CPI of the pipeline > 1 + branch frequency because of data and possibly structural hazards Pros: Straightforward to implement Cons: The time overhead is high when the instruction mix includes a high percentage of branch instructions. 2- Never take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out. JCD SUB IF ID EX ME WB IF ID EX ME WB IOR IF ID EX ME WB XOR IF ID EX ME WB SUB instruction is always executed and then either the IOR instruction is executed next or SUB is flushed and XOR is executed. Pipeline Execution Speed (Never take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency misprediction rate = 1 + branch frequency misprediction rate Pros: If the prediction is highly accurate then the pipeline can operate close to its full throughput. Cons: Implementation is not as straightforward and requires flushing if decoding the branch address takes more than 1 clock cycle. 3- Always take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out. JCD IF ID EX SUB ME WB IF IF IOR XOR ID IF ID address computation EX EX ME WB ID EX ME WB ME WB Pipeline Execution Speed (Always take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency prediction rate + branch penalty branch frequency misprediction rate = 1 + branch frequency prediction rate + 2 branch frequency misprediction rate Pros: Better suited for the execution of loops without the compiler's intervention (but this can generally be overcome, see the next slide). Cons: Implementation is not as straightforward, and has a higher misprediction penalty. Not as advantageous as not taking the branch since the branch address computation is not completed until after the EX segment is carried out. Example: for (i = 0; i < 10; i++) a[i] = a[i] + 1; “Branch always” will not work well without compiler’s help CLR R0; loop: JCD R0 >=10,exit LDD R1,R0; ADD R1,1; ST+ R1,R0; JMP D,loop; exit: ---------------------------------------------------------“Branch always” will work well without compiler’s help CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop; 3- Delayed branch: Insert an instruction after a branch instruction, and always execute it whether or not the branch condition applies. Of course, this must be an instruction that can be executed without any side effects on the correctness of the program. Pros: Pipeline is never stalled or flushed and with the correct choice branch delayed slot instruction, performance can approach that of an ideal pipeline. Cons: It is not always possible to find a delayed slot instruction in which case a NOP instruction may have to be inserted into the delayed slot to make sure that the program's integrity is not violated. It makes compilers work harder. Which instruction to place into the delayed branch slot? 3.1-Choose an instruction before the branch, but make sure that branch does not depend on moved instruction. If such an instruction can be found, this always pays off. Example: ADD R1,R2; JCD R2>10,exit; can be rescheduled as JCD R2,>,10,exit; ADD R1,R2; (Delay slot) 3.2-Choose an instruction from the target of the branch, but make sure that the moved instruction is executable when the branch is not taken. Example: ADD R1,R2; JCD R2 > 10,sub; JMP D, add; …. sub: SUB R4,R5; add: ADI R3,5; can be rescheduled as sub: ADD R1,R2; JCD R2,>,10,sub; ADI R3,5; (Delay slot) …. SUB R4,R5; 3.3-Choose an instruction from the anti-target of the branch, but make sure that the moved instruction is executable when the branch is taken. Example: // ADD R3,R2; JCD R2 > 10,exit; ADD R3,R2; exit: SUB R4,R5; // ADD R4,R3; can be rescheduled as ADD R1,R2; JCD R2,>,10,exit; ADD R3,R2; (Schedule for execution if it does not alter the program flow or output) exit: SUB R4,R5; Dynamic Branch Prediction --Dynamic branch prediction relies on the history of how branch conditions were resolved in the past. --History of branches is kept in a buffer. To keep this buffer reasonably small and easy to access, the buffer is indexed by some fixed number of lower order bits of the address of the branch instruction. --Assumption is that the address values in the lower address field are unique enough to prevent frequent collisions or overrides. Thus if we are trying to predict branches in a program which remains within a block of 256 locations, 8 bits should suffice. x JCD x+1 . . x+256 JCD Branch instructions in the instruction cache include a branch prediction field that is used to predict if the branch should be taken. Memory Location Program x Branch instruction 0 (branch was not taken) Branch instruction 0 (branch was not taken) Branch instruction 1 (branch was taken) Branch prediction field x+4 x+8 x+12 x+16 x+20 Branch prediction: In the simplest case, the field is a 1-bit tag: 0 <=> branch was not taken last time (State A) 1 <=> branch was taken last time (State B) not taken taken taken A B not taken While in state A predict the branch as “not to be taken” While in state B predict the branch as “to be taken” This works relatively well: It accurately predicts the branches in loops in all but two of the iterations CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop; Assuming that we begin in state A, prediction fails when R0 = 1 (branch is not taken when it should be) and R0 =10(branch is taken when it should not be) Assuming that we begin in state B, prediction fails when R0 =10 (branch is taken when it should not be) We can modify the loop to make the branch prediction algorithm fail twice when we begin in state B as well. CLR R0; loop:LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 >=10,exit; JMP D,loop; exit: Assuming that we begin in state B, prediction fails: when R0 = 1 (branch is taken when it should not be) and R0 =10(branch is not taken when it should not be) What is worse is that we can make this branch prediction algorithm fail each time it makes a prediction: LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop; neg: LDI R0,-1; JMP D,loop; Assuming that we begin in state A, prediction fails when R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be) R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be) and so on 2- bit prediction ( A more reluctant flip in decision ) not taken taken A1 A2 not taken taken not taken taken taken B2 B1 not taken While in states A1 and A2 predict the branch as “not to be taken” While in states B1 and B2 predict the branch as “to be taken” not taken taken CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop; A1 A2 not taken taken not taken taken taken Assuming that we begin in state A1, prediction fails when R0 = 1,2 (branch is not taken when it should be) and R0 = 10 (branch is taken when it should not be) Assuming that we begin in state B1, prediction fails when R0 = 10 (branch is taken when it should not be) B2 B1 not taken 2-bit predictors are more resilient to branch inversions (predictions are reversed when they are missed twice): not taken LDI R0,1; taken loop: JCD R0 > 0,neg; A1 A2 LDI R0,1; not taken JMP D,loop; taken neg: LDI R0,-1; not taken JMP D,loop; taken taken Assuming that we begin in state B1, prediction succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = -1 (branch is taken when it should not be) succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = -1 (branch is taken when it should not be) and so on… B2 B1 not taken Amdahl's Law (Fixed Load Speed-up) Let q be the fraction of a load L that cannot be speeded-up by introducing more processors and let T(p) be the amount time it takes to execute L on p processors by a linear work function, p > 1. Then (1 q)T(1) p T(1) 1 1 S( p) as p T( p) q 1 q q p T( p) qT(1) All this means is that, the maximum speed-up of a system is limited by the fraction of the work that must be completed sequentially. Thus, the execution of the work using p processors can be reduced to qT(1) under the best of circumstances, and the speed-up cannot exceed 1/q. Example A 4-processor computer executes instructions that are fetched from a random access memory over a shared bus as shown below: The task to be performed is divided into two parts: 1. 2. Fetch instruction (serial part)- it takes 30 microseconds Execute instruction (parallel part)- it takes 10 microseconds to execute: S(4) = T(1)/T(4) = 1/(0.75 + 0.25/4) = 4/3.25 = 1.23 microseconds microseconds microseconds microseconds Now, suppose that the number of processors is doubled. Then S(8) = T(1)/T(8) = 1/(0.75 + 0.25/8) = 8/6.25 = 1.28 Suppose that the number of processors is doubled again. Then S(16) = T(1)/T(16) = 1/(0.75 + 0.25/16) = 16/12.25 = 1.30. What is the limit S(p) = T(1)/T(p) = 1/(0.75 + 0.25/p) = 1/0.75 = 1.333. Alternate Forms of Amdahl's Law T(1) S Tunenhanced Tenhanced T(1) 1 as s . 1 q q T(1)(q ) s where s is the speed-up of the computation that can be enhanced. Example: Suppose that you've upgraded your computer from a 2 GHz processor to a 4 GHz processor. What is the maximum speed-up you expect in executing a typical program assuming that (1) the speed of fetching each instruction is directly proportional to the speed of reading an instruction from the primary memory of your computer, and reading an instruction takes four times longer than executing it, (2) the speed of executing each instruction is directly proportional to the clock speed of the processor of your computer? Using Amdahl's Law with q = 0.8 and s = 2, we have S = 2 /(0.2 + 0.8 x 2) = 1.111 Very disappointing as you are likely to have paid quite a bit of money for the upgrade! Generalized Amdahl's Law In general, a task may be partitioned into a set of subtasks, with each subtask requiring a designated number of processors to execute. In this case, the speed-up of the parallel execution of the task over its sequential execution can be characterized by the following, more general formula: T(1) S( p1, p2 , , pk ) T( p1, p2 , , pk ) where q1 q2 T(1) q1T(1) q2T(1) p1 p2 qk 1. qk T(1) pk 1 q1 q2 p1 p2 When k = 2, q1 = q, q2 = 1- q, p1 = 1, p2 = p, this formula reduces to Amdahl's Law. qk pk Remark: The generalized Amdahl's Law can also be rewritten to express the speed-up due to different amounts of speed enhancement (Se) that can be made to different parts of a system: T(1) T(s1,s2 , ,sk ) T(1) q1T(1) q2T(1) s1 s2 where q1 q2 qk 1. Se (s1,s2 , ,sk ) qk T(1) sk 1 q1 q2 s1 s2 qk sk Example: Suppose that your computer executes a program that has the following profile of execution: (a) 30% integer operations, (b) 20% floating-point operations, (c) 50% memory reference instructions How much speed-up will you expect if you double the speed of the floating unit of your computer?Using the formula above: Se =1/(0.3 + 0.2/2 + 0.5 ) = 1.1 Example: Suppose that you have a fixed budget of $500 to upgrade each of the computers in your laboratory, and you find out that the computations you perform on your computers require (a) 40% integer operations, (b) 60% floating-point operations, If every dollar spent on the integer unit after $50 decreases its execution time by 2%, and if every dollar spent on the floating-point unit after $100 decreases its execution time by 1%, how would you spend the $500? Example (Continued): S T(1) where x1 x 2 350 Ti (x1) Tf (x 2 ) T i(x1 ) (1 0.02)T i(x1 1) T i(x1) 0.98x1 Ti (0) T f (x 2 ) (1 0.01)T f (x 2 1) T f (x 2 ) 0.99x2 Tf (0) T i(0) 0.4T(1) T f (0) 0.6T(1) Substituting these into the generalized Amdahl's speed-up expression gives: T(1) 0.98 x1 0.4 T(1) 0.99 x2 0.6 T(1) 1 = 0.98 x1 0.4 0.99 x2 0.6 S Example 8 (Continued): So we maximize 1 0.98x1 0.4 0.99x2 0.6 subject to x1 + x2 = 350, or maximize 1 0.98x1 0.4 0.99350x1 0.6 subject to x1 < 350. Example (Continued): Computing the values in the neighborhood of 120 reveals that the speed-up is maximized when x1 = 126. From Mathematica: Table[1/ (0.4 * 0.98^x + 0.6 * 0.99 ^(350 - x)),{ x, 120,128,1}] {10.5398,10.5518,10.5616,10.5691,10.5744,10.5776,10.5785,10.5773,10.574} Note: It is possible to have higher speed up with all of the money invested in one of the units if the fix cost for one of the units becomes sufficiently large. Addendum: If the changes in performance due to upgrades are specified in terms of speed rather than time, we can then use the following formulation: tL s t t s L s L s 1 x s x s s x s2 x t L s t s s s s t T(x) T(x 1) T (x 1) s s T(x) (1 s )T(x 1) s where s denotes the percentage change in speed. s