2015-11-17 Lecture 6: VLIW Processors Introduction and motivation Loop unrolling IA-64 Architecture Features of Itanium Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 6 What about Superscalar? It is a good architecture from two perspectives: The hardware solves everything: It detects potential parallelism between instructions. It tries to issue as many instructions as possible in parallel. It uses register renaming to increase parallelism. Binary compatibility: If functional units are added or other improvements are made (without changing the instruction set), programs don’t need to be changed. Old programs can benefit from the additional machine parallelism, since the new hardware will simply issue instructions in a more efficient way. Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 6 1 2015-11-17 Problems with Superscalar The architecture is very complex. A lot of hardware is needed for run-time detection of parallelism. It consumes very much power (advanced cooling technique may be needed). There is, therefore, a limit in how far we can go with this technique. The instruction window for execution is limited in size. This limits the capacity to detect large number of parallel instructions. Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 6 Very Long Instruction Word Processors In a VLIW (also called Very Large Instruction Word) processor, several operations that can be executed in parallel are placed an a single instruction word. Instruction 1 Instruction 2 Instruction 3 op1 op1 Ø op2 Ø op2 op3 op3 op3 op4 op4 Ø VLIW architectures rely on compile-time detection of parallelism. The compiler analyzes the program and detects operations to be executed in parallel. After one instruction has been fetched all the corresponding operations are issued in parallel. The instruction window limitation disappears: the compiler can potentially analyze the whole program to detect parallel operations. No hardware is needed for run-time detection of parallelism. Zebo Peng, IDA, LiTH 4 TDTS 08 – Lecture 6 2 2015-11-17 Typical Organization Memory System Functional Unit 1 Instruction Fetch Unit Instruction Decode Unit Functional Unit 3 Functional Unit 4 Register Files Functional Unit 2 … FI DI EI WO Functional Unit n Execution Unit Zebo Peng, IDA, LiTH 5 TDTS 08 – Lecture 6 Explicit Parallelism Instruction parallelism scheduled at compile time. Included within the machine instructions explicitly. An EPIC (Explicitly Parallel Instruction Computing) processor uses this information to perform parallel execution. The hardware is very much simplified. The controller is similar to a simple scalar computer. The number of FUs can be increased without needing additional sophisticated hardware to detect parallelism, as in SSA. Compiler has much more time to determine parallel operations. This analysis is only done once off-line, while run-time detection is carried out by SSA hardware for each execution of the code. Good compilers can detect parallelism based on global analysis of the whole program. Zebo Peng, IDA, LiTH 6 TDTS 08 – Lecture 6 3 2015-11-17 Main Issues A large number of registers is needed in order to keep all FUs active (to store operands and results). Large data transport capacity is needed between FUs and the register files and between register files and memory. High bandwidth between instruction cache and fetch unit is also needed due to long instructions. Ex: each instruction with 8 operations, each 24 bits 192 bits/instruction (six 32-bit words). Large code size, partially because unused operations wasted bits in instruction words. op1 op1 Ø op1 Zebo Peng, IDA, LiTH op Ø2 Ø op2 Ø op3 op Ø3 op3 op3 7 op4 op4 Ø op4 TDTS 08 – Lecture 6 Software Issues Incompatibility of binary code: If a new version of the processor introduces additional FUs, the number of operations to execute in parallel is increased. Therefore, the instruction word changes, and old binary code cannot be run on the new processor. We might not have sufficient parallelism in the program to utilize the large degree of machine parallelism. Hardware resources will be wasted in this case. A lot of memory space will also be wasted. A technique to address this problem is loop unrolling, which can be performed by a compiler. Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 6 4 2015-11-17 Lecture 6: VLIW Processors Introduction and motivation Loop unrolling IA-64 Architecture Features of Itanium Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 6 An Example for (i=959; i>=0; i--) x[i] = x[i] + s; Assumptions: x is an array of floating point values; s is a floating point constant. Memory allocation: R1 initially contains the address of the last element in x; the other elements are at lower addresses; x[0] is at address 0. Floating point register F2 contains the value s. A floating point value is 8 bytes long. For an ordinary processor, this C code will be compiled to: Loop: LDD ADF STD SBI BGEZ Zebo Peng, IDA, LiTH F0,(R1) F4,F0,F2 (R1),F4 R1,R1,#8 R1,Loop F0:= x[i];(load double) F4:= F0+F2;(add floating point) x[i]:= F4;(store double) R1:= R1-8; branch if R1 ≥ 0. 10 TDTS 08 – Lecture 6 5 2015-11-17 A VLIW Processor Example Two memory references, two FP operations, and one integer operation or branch can be packed in an instruction, in the following way: Instruction Mem R Mem R FP 1 FP 2 I/BRA Assuming also that: Integer operations take one cycle to complete. The delay for a double word load is one additional clock cycles. The delay for a floating point operation is two additional clock cycles. Zebo Peng, IDA, LiTH 11 TDTS 08 – Lecture 6 Loop Unrolling Let us rewrite the example: Loop unrolling: a technique used in compilers in order to increase the potential of parallelism in a program. It supports more efficient code generation for processors with instruction level parallelism. for (i=959; i>=0; i-=2){ x[i] = x[i] + s; x[i-1] = x[i-1] + s; } For an ordinary processor, this new code will be compiled to: Loop: LDD ADF STD LDD ADF STD SBI BGEZ Zebo Peng, IDA, LiTH F0,(R1) F4,F0,F2 (R1),F4 F6,(R1-8) F8,F6,F2 (R1-8),F8 R1,R1,#16 R1,Loop F0:=x[i];(load double) F4:=F0+F2;(add floating pnt) x[i]:=F4;(store double) F6:=x[i-1];(load double) F8:=F6+F2;(add floating pnt) x[i-1]:=F8;(store double) R1:=R1-16; branch if R1 ≥ 0. 12 TDTS 08 – Lecture 6 6 2015-11-17 Loop Unrolling (2 iterations) Cycle 1 LDD F0,(R1) LDD F6,(R1-8) 2 3 4 5 6 STD(R1+16),F4 STD(R1+8),F8 ADF F4,F0,F2 ADF F8,F6,F2 SBI R1,R1,#16 BGEZ R1,Loop There is an increased degree of parallelism in this case. Still we have two completely empty cycles and many empty operation slots. Nevertheless, we have basically double the performance: Two iterations take 6 cycles The whole loop takes 480*6 = 2880 cycles Zebo Peng, IDA, LiTH 13 TDTS 08 – Lecture 6 Loop Unrolling (4 iterations) Cycle 1 LDD F0,(R1) LDD F6,(R1-8) 2 LDD F10,(R1-16) LDD F15,(R1-24) ADF F4,F0,F2 ADF F8,F6,F2 3 ADF F12,F10,F2 ADF F16,F15,F2 4 5 STD(R1),F4 STD(R1-8),F8 6 7 STD(R1+16),F12 STD(R1+24),F8 SBI R1,R1,#32 BGEZ R1,Loop The degree of parallelism is further improved. There is still an empty cycle and many empty operation slots. Four iterations take 7 cycles, and the whole loop takes now about 240*7 = 1680 cycles. With 8 unrolled iteration, we need 120*9 = 1080 cycles (5.3 times better performance than the unrolled version). Zebo Peng, IDA, LiTH 14 TDTS 08 – Lecture 6 7 2015-11-17 Loop Unrolling in General Given a certain set of resources (processor architecture) and a given loop, there is a limit on how many iterations should be unrolled. Beyond that limit there is no gain in performance any more. Loop unrolling increases the memory space needed to store the machine code. Original code Loop: LDD ADF STD SBI BGEZ Code with 2 iterations: Loop: LDD ADF STD LDD ADF STD SBI BGEZ F0,(R1) F4,F0,F2 (R1),F4 R1,R1,#8 R1,Loop Zebo Peng, IDA, LiTH F0,(R1) F4,F0,F2 (R1),F4 F6,(R1-8) F8,F6,F2 (R1-8),F8 R1,R1,#16 R1,Loop 15 TDTS 08 – Lecture 6 Loop Unrolling in General (Cont’d) A compiler has to find the optimal level of unrolling for each loop. We need also hardware support to keep a VLIW processor busy: Large number of registers (in order to store data for operations which are active in parallel); Large traffic has to be supported in parallel: • register files memory • register files functional units Original code Loop: LDD ADF STD SBI BGEZ Code with 2 iterations: Loop: LDD ADF STD LDD ADF STD SBI BGEZ F0,(R1) F4,F0,F2 (R1),F4 R1,R1,#8 R1,Loop Zebo Peng, IDA, LiTH 16 F0,(R1) F4,F0,F2 (R1),F4 F6,(R1-8) F8,F6,F2 (R1-8),F8 R1,R1,#16 R1,Loop TDTS 08 – Lecture 6 8 2015-11-17 Lecture 6: VLIW Processors Introduction and motivation Loop unrolling IA-64 Architecture Features of Itanium Zebo Peng, IDA, LiTH 17 TDTS 08 – Lecture 6 Commercial VLIW Processors Example of successful VLIW processors: TriMedia of Philips. TMS320C6x of Texas Instruments. Both are targeting the multimedia market (where there was not much compiled legacy code to support). The IA-64 architecture from Intel and Hewlett-Packard. This family uses many of the VLIW ideas. It is not "just" a multimedia processor, but intended to become "the new generation" processor for servers and workstations. The first product in 2001: Itanium Zebo Peng, IDA, LiTH 18 TDTS 08 – Lecture 6 9 2015-11-17 The IA-64 Architecture IA-64 is not a pure VLIW architecture, but many of its features are typical for VLIW processors. These are typical VLIW features of IA-64: Instruction-level parallelism fixed at compile-time. (Very) long instruction word (128 bits). These are specific for the IA-64 Architecture: Flexibility of the number of concurrent operations. Branch predication (NOT prediction!). Speculative loading. Zebo Peng, IDA, LiTH 19 TDTS 08 – Lecture 6 IA-64 Basic architecture Memory System Functional Unit … Instruction Fetch Unit Functional Unit Instruction Decode & Control Unit 64 predicate registers Functional Unit Registers (both integer and floating point) are 64-bit. Predicate registers are 1-bit. 8 or more functional units. Zebo Peng, IDA, LiTH … 128 registers for integers 128 registers for FPs Functional Unit 20 TDTS 08 – Lecture 6 10 2015-11-17 Instruction Format 128 bits Operation 1 Operation 2 Operation 3 Template 5 bits Op code Reg Reg Reg Pred reg 41 bits Three operations are specified in an instruction word. However, the three operations are not necessarily to be executed in parallel! Instead, the template indicates what can be executed in parallel. The encoding in the template shows which of the operations in the instruction can be executed in parallel. The template connects also neighboring instructions. Therefore, operations from different instructions can also be executed in parallel. Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 6 Instruction Format (cont’d) The template provides high flexibility and avoids several problems with classical VLIW processors: Operations in one instruction don’t have to be parallel. • No places will be left empty when no operation is there to be executed in parallel. The number of operations to be executed in parallel is not restricted by the instruction size. • New processor generations can have different number of functional units without changing the instruction format. • Improved binary compatibility. If, according to the template, more operations can be executed in parallel than the functional units available, the processor will simply take them sequentially. Zebo Peng, IDA, LiTH 22 TDTS 08 – Lecture 6 11 2015-11-17 Predicated Execution Any operation can refer to a predicate register. <Pi> operation where i is the number of a predicate register (between 0 and 63) This means that an operation is to be committed (the results made permanent) only when the respective predicate is true (i.e., the predicate register gets value 1). If the predicate value is known when the operation is issued, the operation is executed only if this value is true. If the predicate is not known at that moment, the operation will also be started. If the predicate turns out to be false, the operation is discarded. If no predicate register is mentioned, the operation is executed and committed unconditionally. Zebo Peng, IDA, LiTH 23 TDTS 08 – Lecture 6 Branch Predication (!) Branch predication is an aggressive compilation technique to generate code with higher degree of instruction level parallelism. It lets operations from both branches of a conditional branch to be executed in parallel, to increase the amount of parallel operations. In this way, branches are eliminated and replaced by conditional execution. Hardware support is needed, as implemented in the IA-64 architecture. The idea is: let instructions from both branches go on in parallel, before the branch condition is known. The hardware takes care that only instructions of the right branch will be finally committed. Zebo Peng, IDA, LiTH 24 TDTS 08 – Lecture 6 12 2015-11-17 Branch Predication Example Instruction 1 For a branch instruction, the compiler assigns a predicate to each of the two following instruction paths. Instruction 2 Instruction 3 (branch) <P1> Instruction 4 <P2> Instruction 7 <P1> Instruction 5 <P2> Instruction 8 <P1> Instruction 6 <P2> Instruction 9 CPU can execute instructions from different paths concurrently, but only the correct path will finally be committed. For a VLIW machine, the instructions may be arranged as follows: Instruction 1 Instruction 2 Instruction 3 <P1> Instruction 4 <P2> Instruction 7 <P1> Instruction 5 <P2> Instruction 8 <P1> Instruction 6 <P2> Instruction 9 Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 6 Branch Predication (cont’d) Branch predication is not branch prediction! Branch prediction: Guess for one branch and then go along that one. If the guess is not correct, undo all the work, and go to the right one. Branch predication: Both branches are started and when the condition is known, the right instructions are committed, the others are discarded. - There is no lost time with failed predictions. - You pay by introducing duplicated hardware. - Computation on some hardware will be wasted, instead of doing useful things! Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 6 13 2015-11-17 Placement of Loading A load-from-memory operation should be placed so that memory latency is avoided, i.e., the value is there when it’s needed: ADD R4, R3 LOAD R1, X ADD R2, R1 R4 := R4 + R3 Loads from address X into R1 R2 := R2 + R1 LOAD R1, X ADD R4, R3 ADD R2, R1 Loads from address X into R1 R4 := R4 + R3 R2 := R2 + R1 Zebo Peng, IDA, LiTH 27 TDTS 08 – Lecture 6 Speculative Loading What if a load is moved across a branch? Ex. from within an ELSE-branch to before the conditional branch. <P2> Instruction 8 (load data) Instruction 1 Instruction 2 Instruction 3 (branch) <P1> Instruction 4 <P2> Instruction 7 <P1> Instruction 5 <P2> Instruction 8 (load data) <P1> Instruction 6 Zebo Peng, IDA, LiTH <P2> Instruction 9 28 TDTS 08 – Lecture 6 14 2015-11-17 Speculative Loading What if a load is moved across a branch? Ex. from within an ELSE-branch to before the conditional branch. The load will be executed for both branches. This shouldn’t be a problem if hardware resources are available. However, if, for example, a page fault is generated, handling such an exception takes extremely long time and resources. • A whole page has to be brought in from the secondary memory to the main memory. If the load will turn out to be not needed, it wastes a huge amount of time, and may replace a useful page. We would therefore like to handle the page faults (exception) only when we really need the loaded value. This is solved by speculative loading. Zebo Peng, IDA, LiTH 29 TDTS 08 – Lecture 6 Speculative Loading (cont’d) With speculative loading, a load instruction in the original program is replaced by two instructions: A speculative load (LD.s) which performs the memory fetch and detects if an exception (like page fault) is generated. The exception, however is not signaled to the operating system. • The speculative load is the one which is moved for latency reduction. A checking instruction (CHK.s) which signals the exception if one has been detected by the speculative load. • If no exception has been detected, nothing happens. • The checking instruction remains in the original place. Zebo Peng, IDA, LiTH 30 TDTS 08 – Lecture 6 15 2015-11-17 Speculative Loading Example Speculative Load Instruction 1 The load will only be done, if there is no exception. Instruction 2 Instruction 3 (branch) <P1> Instruction 4 <P2> Instruction 7 <P1> Instruction 5 <P2> Speculative Check <P1> Instruction 6 Exception handling will occur here, if needed. <P2> Instruction 9 Zebo Peng, IDA, LiTH 31 TDTS 08 – Lecture 6 Superscalar vs IA-64 Superscalar IA-64 (improved VLIW) Multiple parallel execution units Multiple parallel execution units RISC-line instructions, one per word RISC-line instructions, bundled into group of three Reorder and optimize instruction stream at run time Reorder and optimize instruction stream at compiler time Branch prediction with speculative Speculative execution along both execution of one path paths of a branch Load data from memory (caches) only when needed. Zebo Peng, IDA, LiTH 32 Speculatively load data before it is needed (caches or memory). TDTS 08 – Lecture 6 16 2015-11-17 VLIW Summary In order to keep up with increasing demands on performance, superscalar architectures become excessively complex. VLIW architectures avoid hardware complexity by relying on the compiler for parallelism detection. Operation level parallelism is explicit in the instructions of a VLIW processor. Not having to deal with parallelism detection, VLIW processors can have a higher number of functional units. This generates the need for a large number of registers and high communication bandwidth. In order to keep the large number of functional units busy, compilers for VLIW processors have to be aggressive in parallelism detection. Loop unrolling is used in order to increase the degree of parallelism in loops: several iterations of a loop are unrolled and handled in parallel. Zebo Peng, IDA, LiTH 33 TDTS 08 – Lecture 6 IA/64 Architecture Summary The IA-64 family uses VLIW techniques for implementing high-end processors. Itanium is the first member of this family. The instruction format avoids empty operations and allows more flexibility in parallelism specification, by using the template field. Beside typical VLIW features, the Itanium architecture uses branch predication and speculative loading. Branch predication is based on predicated execution: The execution of an instruction is connected to a predicate register, which will be committed only if the predicate becomes true. It allows instructions from both branches to be started in parallel. Speculative loading tries to reduce latencies generated by load instructions. It allows load instructions to be moved across branch boundaries. Page exceptions will be handled only if really needed. Zebo Peng, IDA, LiTH 34 TDTS 08 – Lecture 6 17