INSTRUCTION-LEVEL PARALLEL PROCESSORS Chapter No. 4 Evolution of ILPprocessors What is instruction-level parallelism? The potential for executing certain instructions in parallel, because they are independent. Any technique for identifying and exploiting such opportunities. Instruction-level parallelism (ILP) Basic blocks Sequences of instruction that appear between branches Usually no more than 5 or 6 instructions! Loops for ( i=0; i<N; i++) x[i] = x[i] + s; We can only realize ILP by finding sequences of independent instructions Dependent instructions must be separated by a sufficient amount of time Principle Operation of ILPProcessors Pipelined Operation Pipelined Operation Pipelined Operation A number of functional units are employed in sequence to perform a single computation. Each functional unit represent a certain stage of computation. Pipeline allows overlapped execution of instructions. It increases the overall processor’s throughput. Superscalar Processors Increase the ability of the processor to use instruction level parallelism. Multiple instructions are issued every cycle multiple pipelines operating in parallel. received sequential stream of instructions. The decode and execute unit then issues multiple instructions for the multiple execution units in each cycle. Superscalar Approach VLIW architecture Abbreviation of VLIW is Very Large Instruction Word. VLIW architecture receives multi-operation instructions, i.e. with multiple fields for the control of EUs. Basic structure of superscalar processors and VLIW is same multiple EUs, each capable of parallel execution on data fetched from a register file. VLIW Approach Dependencies Between Instructions Data dependency control dependency resource dependency Dependencies Between Instructions Dependences (constraints) Data dependences Control dependences Resource dependences arise from precedence requirements concerning referenced data arise from conditional statements arise from limited resources Data Dependency An instruction j depends on data from previous instruction i cannot execute j before the earlier instruction i cannot execute j and i simultaneously. Data dependence's are properties of the program whether this leads to data hazard or stall in a pipeline depending upon the pipeline organization. Data Dependency Data can be differentiated according to the data involved and according to their type. The data involved in dependency may be from register or from memory the type of data dependency may be either in a straight-line code or in a loop. Data Dependency Data Dependency in Straightline Code Straight-line code may contain three different types of dependencies: RAW (read after write) WAR (write after read) WAW (write after write) RAW (Read after Write) Read After Write (RAW) i1: load i2: add r1, a; r2, r1, r1; Assume a pipeline of Fetch/Decode/Execute/Mem/Writeback Name Dependencies A “name dependence” occurs when 2 instructions use the same register or memory location (a name) but there is no flow of data between the 2 instructions There are 2 types: Antidependencies: Occur when an instruction j writes a register or memory location that instruction i reads – and i is executed first Corresponds to a WAR hazard Output dependencies: Occur when instruction i and instruction j write the same register or memory location Protected against by checking for WAW hazards Write after Read (WAR) Write after Read (WAR) i1: mul i2: add r1, r2, r3; r2, r4 , r5; r1 <= r2 * r3 r2 <= r4 + r5 If instruction i2 (add) is executed before instruction i1 (mul) for some reason, then i1 (mul) could read the wrong value for r2. Write after Read (WAR) One reason for delaying i1 would be a stall for the ‘r3’ value being produced by a previous instruction. Instruction i2 could proceed because it has all its operands, thus causing the WAR hazard. Use register renaming to eliminate WAR dependency. Replace r2 with some other register that has not been used yet. Write after Write (WAW) Write after Write (WAW) i1: mul i2: add r1, r2, r3; r1, r4 , r5; r1 <= r2 * r3 r2 <= r4 + r5 If instruction i1 (mul) finishes AFTER instruction i2 (add), then register r1 would get the wrong value. Instruction i1 could finish after instruction i2 if separate execution units were used for instructions i1 and i2. Write after Write (WAW) One way to solve this hazard is to simply let instruction i1 proceed normally, but disable its write stage. Data Dependency in Loops Instruction belonging to a particular loop iteration may be dependent on the the instructions belonging to previous loop iterations. This type of dependency is referred as recurrences or inter-iteration data dependency. do X(I)= A*X(I-1) + B end do Data Dependency in Loops Loops are a “common case” in pretty much any program so this is worth mentioning… Consider: for (j=0; j<=100; j++) { A[j+1] = A[j] + C[j]; /*S1*/ B[j+1] = B[j] + A[j+1]; /*S2*/ } Data Dependency in Loops Now, look at the dependence of S1 on an earlier iteration of S1 This is a loop-carried dependence; in other words, the dependence exists b/t different iterations of the loop Successive iterations of S1 must execute in order S2 depends on S1 within an iteration and is not loop carried Multiple iterations of just this statement in a loop could execute in parallel Data Dependency in Graphs Data dependency can also be represented by graphs d0,, dt or ,da to denote RAW, WAR or WAR respectively. i1 i2 i1 I2 is dependent on i1 i2 d t d t i1 : lo ad r1, a; i3 i2: lo ad r2, b; i3: add r3, r1, r2; do i4: m ul r1, r2, r 4; i5: div r1, r2, r 4; i4 do Instruction interpretation: r 3 (r1) (r 2) etc. i5 Control Dependence Control dependence determines the ordering of an instruction with respect to a branch instruction if the branch is taken, the instruction is executed if the branch is not taken, the instruction is not executed An instruction that is not control dependent on a branch can not be moved before the branch instructions from then part of if-statement cannot be executed before the branch Control Dependence An instruction that is not control dependent on a branch cannot be moved after the branch other instructions cannot be moved into the then part of an if statement Example sub r1, r2, r3; jz zproc; mul r4, r1, r1; : : zproc: load r1, x: : Control Dependence Yeh, Pat (1992) M88100 Stephens & al. (1991) RS/6000 Lee, Smith (1984) IBM/370 PDP-11-70 CDC 6400 Branch ratio( %) Uncond. total branch ratio (%) gen. purp. scientific gen. purp. scientific Uncond. branch ratio (%) gen. purp. scientific 24 5 78 82 20 4 22 11 83 85 18 9 29 10 73 73 21 8 39 46 8 Published branch statistics 18 53 4 Control Dependence Frequent conditional branches impose a heavy performance constraint on ILP-processors. Higher rate of instruction issue per cycle raise the probability of encountering conditional control dependency in each cycle. Control Dependence JC JC scalar issue JC JC JC JC with 2 intructions/issue JC superscalar (or VLIW) issue JC JC with 3 intructions/issue JC JC JC with 6 intructions/issue t JC: conditional branch Control Dependency Graph i0: i1 : i2: i3: i4: i5: i6: i7: i8: i0 r1 = op1; r2 = op2; r3 = op3; if (r2 > r1) { if (r3 > r1) i1 i2 r4 = r3; else r4 = r1} i3 F T i4 else r4 = r2; T F 5r = r4 * r4 i5 Figure 5.2.11: An example for a CDG i6 i8 i7 Resource Dependency An instruction is resource dependent on a previously issued instruction if it requires hardware resource which is still being used by previously issued instruction If, for instance, there is only a single multiplication unit available, then in the code sequence i1: div r1, r2, r3 i2: div r4, r2, r5 i2 is resource dependent on i1 Instruction Scheduling Why instruction scheduling is needed? Instruction scheduling involves: Detection detect where dependency occurs in a code Resolution removing dependencies from the code two approaches for instruction scheduling static approach dynamic approach Instruction Scheduling Static Scheduling Detection and resolution is accomplished by the compiler which avoids dependencies by reordering the code. VLIW processors expects dependency free code generated by ILP compiler. Dynamic Scheduling Performed by the processor contains two windows issue window contains all fetched instructions which are intended for issue in the next cycle. Issue window’s width is equal to issue rate all instructions are checked for dependencies that may exists in an instruction. Dynamic Scheduling Execution window contains all those instructions which are still in execution and whose results have not yet been produced are retained in execution window. Instruction Scheduling in ILPprocessors ILP-instruction scheduling Detection and resolution of dependencies Parallel optimization Parallel Optimization Parallel optimization is achieved by reordering the sequence of instructions by appropriate code transformation for parallel execution. Also, known as code restructuring or code reorganization. ILP-instruction Scheduling Preserving Sequential Consistency To maintain the logical integrity of program div r1, r2, r3; ad r5, r6, r7; jz anywhere; Preserving Sequential Consistency