Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-2 Basic Compiler Techniques for Exposing – Basic pipeline scheduling and loop unrolling • To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. • A compiler’s ability to perform such kind of scheduling depends on both the amount of ILP available in the program and on the latencies of the functional units in the pipeline. • To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.. Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Scheduling and Loop Unrolling – Basic assumptions: • The latencies of the FP unit Inst. producing result FP ALU op FP ALU op Load double Load double Inst. Using result Another FP ALU op Store double FP ALU op Store double Latency 3 2 1 0 • The branch delay of the pipeline implementation is 1 delay slot. • The functional units are fully pipelined or replicated such that no structural hazards can occur 4-3 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Loop Unrolling by Compilers – Example: for (j=1, j<= 1000, j++) x[j]=x[j]+s; • Assume R1 initially holds the highest address of the first element and 8(R2) holds the last element. Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2,Loop – Performance of scheduled code with loop unrolling. 4-4 Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches 4-5 Performance of Unscheduled Code without Loop Unrolling Loop: L.D stall ADD.D stall stall S.D DADDUI stall BNE stall – Need 10 cycles per result F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1, #-8 R1, R2,Loop Clock cycle issued 1 2 3 4 5 6 7 8 9 10 Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Performance of Scheduled Code without Loop Unrolling Loop: L.D DADDUI ADD.D stall BNE S.D – Need 6 cycles per result F0, 0(R1) R1, R1, #-8 F4, F0, F2 R1, R2,Loop ; delay branch F4, 8(R1) 4-6 Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Performance of Unscheduled Code with Loop Unrolling • Unroll the loop 4 iterations Loop: L.D F0, 0(R1) ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D DADDUI BNE F4, F0, F2 F4, 0(R1) F6, -8(R1) F8, F6, F2 F8, -8(R1) F10, -16(R1) F12, F10, F2 F12, -16(R1) F14, -24(R1) F16, F14, F2 F16, -24(R1) R1, R1, #--32 R1, R1, Loop – Needs 7 cycles per result 4-7 Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Performance of Scheduled Code with Loop Unrolling Loop: L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D S.D S.D DADDUI S.D BNE S.D • Need 3.5 cycles per result F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 F4, 0(R1) F8, -8(R1) R1, R1, #--32 F12, 16(R1) R1, R1, Loop F16, 8(R1) 4-8 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-9 Using Loop Unrolling and Pipeline Scheduling with Static Multiple Issue • Fig. 4.2 on page 313 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-10 Static Branch Prediction – For a compiler to effectively schedule the code such as for scheduling branch delay slot, we need to statically predict the behavior of branches. – Static branch prediction used in a compiler LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1, L OR R4, R5, R6 DADDU R10, R4, R3 L: DADDU R7, R8, R9 – If the BEQZ was almost always taken and the value of R7 was not needed on the fall through path, DADDU can be moved to the position after LD. – If it is rarely taken and the value of R4 was not needed on the taken path, OR can be moved to the position after LD. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-11 Branch Behavior in Programs – Program behavior • Average frequency of taken branches : 67% – 60% of the forward branches are taken. – 85% of the backward branches are taken – Methods for statically branch prediction • By examination of the program behavior – Predict-taken (mis-prediction rate: 9%~59%). – Predict-forward-untaken and backward taken. – The above two approaches combined mis-prediction rate is 30%~40%. • By the use of profile information collected from earlier runs of the program. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Mis-prediction Rate for a Profile-Based Predictor 4-12 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-13 Comparison between Profile-Based and PredictTaken Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-14 The Basic VLIW Approach • VLIW uses multiple, independent functional units. • Multiple, independent instructions are issued by processing a large instruction package that consists of multiple operations. • A VLIW instruction might include one integer/branch instruction, two memory references, and two floating-point operations. – If each operation requires a 16 to 24 bits field, the length of each VLIW instruction is of 112 to 168 bits. • Performance of VLIW Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Scheduling of VLIW Instructions • Fig. 4.5 on page 318 4-15 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-16 Limitations to VLIW Implementation • Limitations – Technical problem • To generate enough straight-line code fragment requires ambitiously unrolling loops, which increases code size. – Poor code density • Whenever the instructions are not full, the unused functional units translate into wasted bits in the instruction encoding (only 60% full). – Logistical problem • Binary code compatibility; it depends on – Instruction set definition, – The detailed pipeline structure, including both functional units and their latencies. • Advantages of a superscalar processor over a VLIW processor – Little impact on code density. – Even unscheduled programs, or those compiled for older implementations, can be run. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-17 Advanced Compiler Support for Exposing and Exploiting ILP – Exploiting Loop-Level Parallelism • Converting the loop-level parallelism into ILP – Software pipelining (Symbolic loop unrolling) – Global code scheduling Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-18 Loop-Level Parallelism – Concepts and techniques • Loop-level parallelism is normally analyzed at the source level while most ILP analysis is done once the instructions have been generated by the compiler. • The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are data dependent on data values produced in earlier iterations. • Example: for (i=1; i<=1000; i++) x[i]=x[i]+s; • Loop-carried data dependence: Dependence exists between different iterations of the loop. • A loop is parallel unless there is a cycle in the dependences. Therefore, a non-cycled loop-carried data dependence can be eliminated by code transformation. Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Loop-Carried Data Dependence (1) • Example for (I=1; I<=100; I=I+1){ A[I+1] = A[I]+C[I]; /* S1 */ B[I+1] = B[I]+A[I+1]; /* s2 */ } – Dependence graph S1 S2 4-19 Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Loop-Carried Data Dependence (2) • Example for (I=1; I<=100; I=I+1){ A[I] = A[I]+B[I]; /* S1 */ B[I+1] = C[I]+D[I]; /* s2 */ } S1 S2 – Code transformation A[1] = A[1] +B[1]; for (I=1; I<99; I=I+1){ S1 S2 B[I+1] = C[I]+D[I]; /* s2 */ A[I+1] = A[I+1]+B[I+1]; /* S1 */ } – Convert loop-carried data dependence into data dependence. 4-20 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-21 Loop-Carried Data Dependence (3) • True loop-carried data dependence are usually in the form of a recurrence. For (I=2; I<=100; I++){ Y[I] = Y[I-1] + Y[I]; } • Even true loop-carried data dependence has parallelism. For (I=6; I<=100; I++){ Y[I] = Y[I-5] + Y[I]; } – The first, second, …, five iterations are parallel. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Detecting and Eliminating Dependencies • Finding the dependences in a program is an important part of three tasks: – Good scheduling of code – Determining which loops might contain parallelism, and – Eliminating name dependence • Example – for (i=1; i<= 100; i++) { – A[i] = B[i] + C[i]; – D[i] = A[i] + E[i]; –} • Absence of loop-carried dependence, which implies existence of a large amount of parallelism. 4-22 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-23 Dependence Detection Problem • NP complete. • GCD test heuristic – Suppose we have stored to an array element with index value a*j+b and loaded from the same array with index value c*k+d, where j and k are the for-loop index variable that runs from m to n. A dependence exists if two conditions hold: – There are tow iteration indices, j and k, both within the limits of the for loop. – The loop stores into an array element indexed by a*j+b and later fetches from that same array element when it is indexed by c*k+d. That is, a*j+b=c*k+d. » Note, a,b,c, and d are generally unknown at compile time, making it impossible to tell if a dependence exists. – A simple and sufficient test for the absence of a dependence. If a loopcarried dependence exists, then GCD(c,a) must divide (d-b). That is if GCD(c,a) does not divide (d-b), no dependence is possible (Example on page 324). Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-24 Situations where Dependence Analysis Fails – When objects are referenced via pointers rather than array indices; – When array indexing is indirect through another array. – When a dependence may exist for some value of the inputs, but does not exist in actuality. – Others. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Eliminating Dependent Computations • Copy propagation DADDUI R1, R2, #4 DADDUI R1, R2, #4 to DADDUI R1, R2, #8 • Tree height reduction ADD ADD ADD R1, R2, R3 R4, R1, R6 R8, R4, R7 ADD ADD ADD R1, R2, R3 R4, R6, R7 R8, R1, R4 to 4-25 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-26 Software Pipelining: Symbolic Loop Unrolling – Software pipelining is a technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop. – A software-pipelined loop interleaves instructions from different loop iterations without unrolling the loop. – A software pipeline loop consists of a loop body, start-up code and clean-up code Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches 4-27 Example Original loop Loop: Reorganized loop L.D ADD.D S.D DADDUI BNE Iteration i: Iteration i+1: Iteration i+2: F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1, #-8 R1, R2, Loop L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D Loop: F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1) S.D ADD.D L.D DADDUI BNE F4, 16(R1) F4, F0, F2 F0, 0(R1) R1, R1, #-8 R1, R2, Loop Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Comparison between Software-Pipelining and Loop Unrolling – Software pipelining consumes less code space. – Loop unrolling reduces the overhead of the loop -- the branch and counter-updated code. – Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end. 4-28 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Global Code Scheduling Rung-Bin Lin 4-29 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Trace Scheduling: Focusing on Critical Path • Trace selection • Trace compaction • Bookkeeping code 4-30 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-31 Hardware Support for Exposing More Parallelism at Compile Time – The difficulty of uncovering more ILP at compile time ( due to unknown branch behavior) can be overcome by employing the following techniques: • Conditional or predicated instructions • Speculation – Static speculation performed by the compiler with hardware support. – Dynamic speculation performed by hardware using branch prediction to guide speculation process. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-32 Conditional or Predicated instructions – Basic concept • An instruction refers to a condition, which is evaluated as part of the instruction execution. If the condition is true, the instruction is executed normally, otherwise, the execution continues as if it is a no-op. • The conditional instruction allows us to convert the control dependence present in the branch-based code sequence to a data dependence. – A conditional instruction can be used to speculatively move an instruction that is time critical – To use a conditional instruction successfully like the one in examples, we must ensure that the speculated instruction does not introduce an exception. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Conditional Move • Example on page 341 Rung-Bin Lin 4-33 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches On Time Critical Path • Example on page 342 and 343 Rung-Bin Lin 4-34 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Example (Cont.) Rung-Bin Lin 4-35 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-36 Limiting Factors • The usefulness of conditional instructions is limited by several factors: – Conditional instructions that are annulled still take execution time. – Conditional instructions are most useful when the condition can be evaluated early. – The use of conditional instructions is limited when the control flow involves more than a simple alternative sequence. – Conditional instructions may have some speed penalty compared with unconditional instructions. • Machines that use conditional instruction – Alpha: Conditional move; – HP PA: Any register-register instruction; – SPARC: Conditional move; – ARM: All instructions. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-37 Compiler Speculation with Hardware Support • In moving instructions across a branch the compiler must ensure that exception behavior is not changed and the dynamic data dependence remains the same. – The simplest case is that the compiler is conservative about what instructions it speculatively moves, and the exception behavior is unaffected. • Four methods – The hardware and OS cooperatively ignore exceptions for speculative instructions. – Speculative instructions that never raise exceptions are used, and checks are introduced to determine when an exception should occur. – Poison bits are attached to the result registers written by speculated instructions when the instruction cause exceptions. – The instruction results are buffered until it is certain that the instruction is no longer speculative. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-38 Types of Exceptions • Two types of exceptions needs to be distinguished: – Exceptions cause program error, which indicates the program must be terminated. Ex., memory protection error. – Exceptions can be normally resumed, Ex., page faults. • Basic principles employed by the above mechanism: – Exceptions that can be resumed can be accepted and processed for speculative instructions just as if they are normal instruction. – Exceptions that indicate a program error should not occur in correct programs. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-39 Hardware-Software Cooperation for Speculation • The hardware and OS simply – Handle all resumable exceptions when exception occurs, and – Return an undefined value for any exception that would cause termination. • If a normal instruction generate – terminating exception --> return an undefined value and program proceeds normally --> generate incorrect result, or – resumable exception --> accepted and handled accordingly --> program terminated normally. • If a speculative instruction generate – terminating exception --> return an undefined value --> a correct program will not use it --> the result is still correct. – resumable exception --> accepted and handled accordingly --> program terminated normally. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Example • On page 346 and 347 Rung-Bin Lin 4-40 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin Speculative Instructions Never … (Method 2) • Example on page 347 4-41 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Answer Rung-Bin Lin 4-42 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-43 Speculation with Poison Bits – A poison bit is added to every register and another bit is added to every instruction to indicate whether the instruction is speculative. – Three steps: • The poison bit is set whenever a speculative instruction results in a terminating exception; all other exceptions are handled immediately. • If a speculative instruction uses a register with a poison bit turned on, the destination register of the instruction simply has its poison bit turned on. • If a normal instruction attempts to use a register source with its poison bit turned on, the instruction causes a fault. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Example • On page 348 Rung-Bin Lin 4-44 Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-45 Hardware Support for Memory Reference Speculation • Moving load across stores is usually done when the compiler is certain the address do not conflict. • To support speculative load – A special check instruction to check for address conflict is placed at the original location of the load instruction. – When a speculated load is executed, the hardware saves the address of the accessed memory location. – If the value stored in the location is changed before check instruction, speculation fails. If not, it succeeds. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-46 Hardware- versus Software-Based Speculation • Dynamic runtime disambiguation of memory addresses is conducive to speculate extensively. This allows us to move loads past stores at runtime. • Hardware-based speculation is better because hardware-based branch predictions is better than software-based branch prediction done at compile time. • Hardware-based speculation maintains a completely precise exception model. • Hardware-based speculation does not require bookkeeping codes. • Hardware-based speculation with dynamic scheduling does not require different code sequence for different implementation of an architecture to achieve good performance. • Compiler-based approaches can see further in the code sequence. Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches Rung-Bin Lin 4-47 Concluding Remarks • Hardware and software approaches to increasing ILP tend to fuse together.