Instruction Level Parallelism - a historical perspective Roland Ibbett Roland Ibbett Instruction Level Parallelism - a historical perspective Instruction Level Parallelism I I Types of Instruction Level Parallelism I Parallel Function Units I Vector Processors I Superscalar Processors I VLIW Processors I SIMD Systems Where Next? Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units The Floating-point Arithmetic Problem - 1 I Instruction pipelines can operate at the speed of fixed-point arithmetic I Floating-point arithmetic is more complex: • exponent subtraction • mantissa shifting • mantissa addition • normalisation I To match fl-point execution rate to instruction pipeline rate, need parallel or pipelined arithmetic units, or both. Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units I Parallel function units were originally introduced as a means of matching function execution rate to instruction issue rate I Require use of two-address or three-address instruction format I c.f. one-address systems: I Each arithmetic operation requires the result from the previous operation before it can start Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units MEMORY Typical Write Back Registers Microprocessor Architecture Integer Unit Instruction Fetch Instruction Decode Fl−pt Unit Memory Access Multiply Unit Divide Unit I I Hardware must ensure that instructions still appear to be executed in sequential order Two main techniques: I I Common Data Bus + Tomasulo’s Algorithm (IBM) Scoreboards (CDC) Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units Tomasulo’s Algorithm www.icsa.inf.ed.ac.uk/research/groups/hase/models/tomasulo Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units Systems With Scoreboards I CDC 6600 - introduced in early 1960s I designed to solve problems well beyond contemporary capability I major innovation: parallel function units I CDC 7600 - introduced in late 1960s I upwardly compatible with 6600 I Cray-1 - introduced in 1976 I logically an extension of 7600 by addition of vector instructions I RISC Architectures - “typical” system: MIPS Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units Data Hazards - RAW, WAW, WAR I RAW (Read-After-Write) I an instruction requires the result of a previously issued, but uncompleted instruction: flow dependency I WAW (Write-After-Write) I an instruction tries to write its result to the same register as a previously issued, but uncompleted instruction: output dependency I WAR (Write-After-Read) I an instruction tries to write to a register which has not yet been read by a previously issued, but uncompleted instruction: anti dependency Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units DLX Simulation Model Roland Ibbett Instruction Level Parallelism - a historical perspective Code Optimisation Naive Code Improved Code LF F1 4(R1) LF F2 36(R1) MULTF F3 F1 F2 ADDF F0 F0 F3 ADDI R1 R1 4 SEQI R3 R1 32 BEQZ R3 loop LF F1 4(R1) LF F2 36(R1) ADDI R1 R1 4 MULTF F3 F1 F2 SEQI R3 R1 32 ADDF F0 F0 F3 BEQZ R3 loop Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units The Floating-point Arithmetic Problem - 2 I CDC 6600 Performance I I I I I I + 1/4 clocks = 0.25 Flops/Clock × 2/10 clocks = 0.2 Flops/Clock Total = 0.45 Flops/Clock Max Issue Rate = 1/clock (100 ns) Max MIPS = 10, Max MFLOPS = 4.5 CDC 7600 Performance I I I I I I + 1/clock = 1.0 Flops/Clock × 1/2 clocks = 0.5 Flops/Clock Total = 1.5 Flops/Clock Max Issue Rate = 1/clock (27.5 ns) Max MIPS = 36.4, Max MFLOPS = 36.4 Floating-point performance is limited by I I Instruction issue rate Entry rate of results into registers Roland Ibbett Instruction Level Parallelism - a historical perspective Parallel Function Units The Floating-point Arithmetic Problem - 3 I I How can architectural performance (FLOPS/CLOCK rate) be increased? Stop issuing so many instructions I I many FLOPS per instruction VECTOR instructions I Multiple register entry paths I CRAY-1 I FL-pt instructions are of the form: Vi = Vj ± Vk Vi = Vj × Sk Si = Sj ± Sk Roland Ibbett Instruction Level Parallelism - a historical perspective Vector Processors Cray-1 Processor Organisation Roland Ibbett Instruction Level Parallelism - a historical perspective Vector Processors The Cray-1 Scoreboard I An instruction is issued only when it is guaranteed to complete: I A, S registers Only 1 result per clock may be entered into registers, e.g. a 3-clock instruction with an S result register cannot be issued in the clock following the issue of a 4-clock S instruction I V registers Separate input multiplexers for each V register allow each to receive 1 result per clock Roland Ibbett Instruction Level Parallelism - a historical perspective Vector Processors Reservations I I Each register has a Reservation Bit A and S registers I I I set when an instruction is issued which will deliver a result to the register reset when result is entered V registers I I I set when an instruction is issued that will deliver results to the register OR when an instruction is issued that reads the register (V register technology allowed only one read or write access in 1 clock cycle) reset when last element is read/written Roland Ibbett Instruction Level Parallelism - a historical perspective Vector Processors Exception: I as a data element arrives it can also be routed back as an input to a function unit: Chaining I Example: I V0 ← Memory V1 ← V0 × S1 V3 ← V1 + V2 Roland Ibbett Instruction Level Parallelism - a historical perspective Vector Processors Roland Ibbett Instruction Level Parallelism - a historical perspective Vector Processors - Cray1 Performance Roland Ibbett Instruction Level Parallelism - a historical perspective Vector Processors - Cray1 Performance Roland Ibbett Instruction Level Parallelism - a historical perspective Superscalar Processors In a superscalar processor, multiple instructions are issued in one cycle. An example system is the Alpha processor: Roland Ibbett Instruction Level Parallelism - a historical perspective Superscalar Processors I Instruction Unit (Ibox) I I I I I I fetches instructions checks resources maintains state for all pipeline stages to track outstanding writes controls stalls/aborts/restarts issues dual instructions Instruction Issue I An instruction pair can contain one from each of the sets (but only one load/store or branch per pair): Integer Operate Floating Operate Fl-pt Load/Store Integer Load/Store Fl-pt Branch Integer Branch BR/BSR/JSR Roland Ibbett Instruction Level Parallelism - a historical perspective Superscalar Processors Superscalar Processor Features I Multiple issue is an implementation feature: I I I I Alpha 21064 is dual issue Alpha 21264 issues 6 instructions per cycle i.e. instruction format is unchanged (c.f. VLIW) BUT: instructions must have certain features: I I fixed length branches on register values, not CCs Roland Ibbett Instruction Level Parallelism - a historical perspective VLIW Processors (VLIW = Very Long Instruction Word) I multiple instructions, fixed format I requires complex compiler optimisation e.g. Intel-HP IA-64* I I I I I 128-bit word 3 x 32-bit instructions + template template indicates instruction dependencies features include: I I I Predication Control Speculation ∗“The IA-64 Architecture at Work”, Carole Dulong, IEEE Computer, Vol 31, No 7, July 1998 Roland Ibbett Instruction Level Parallelism - a historical perspective VLIW Processors Predication I I Predicate: binary tag that permits conditional execution of an instruction value depends on the outcome of a conditional statement I I I P = true - instruction executes normally P = false - instruction is issued but results are not written to registers or memory IA-64 uses a full predication model I Compiler can append a predicate to all instructions Roland Ibbett Instruction Level Parallelism - a historical perspective VLIW Processors if-then-else Roland Ibbett Instruction Level Parallelism - a historical perspective Simulation Model Roland Ibbett Instruction Level Parallelism - a historical perspective SIMD Array Systems I An alternative to the temporal parallelism of pipelined vector processing is the use of an array of processing elements (PEs) to provide spatial parallelism I In an SIMD array system multiple PEs execute data instructions I PEs operate in lockstep, controlled by an array control unit (ACU), but can be individually disabled I The ACU executes scalar and program control instructions and issues SIMD instructions to the array I PEs are typically interconnected by a NEWS network I An early example was the ICL DAP I Later SIMD machines include the TMC Connection Machines Roland Ibbett Instruction Level Parallelism - a historical perspective ICL DAP Architecture Roland Ibbett Instruction Level Parallelism - a historical perspective ICL DAP Processing Element Roland Ibbett Instruction Level Parallelism - a historical perspective ICL DAP Store Organisation Roland Ibbett Instruction Level Parallelism - a historical perspective DAP Arithmetic Modes Roland Ibbett Instruction Level Parallelism - a historical perspective The TMC Connection Machine Roland Ibbett Instruction Level Parallelism - a historical perspective SIMD Simulation Model www.icsa.inf.ed.ac.uk/research/groups/hase/models/simd Roland Ibbett Instruction Level Parallelism - a historical perspective SIMD Instructions in µprocessors I An example is Pentium III SSE instructions I I I I I I I SSE = Streaming SIMD Extensions Instruction set contains new data type: 128-bit word of four 32-bit floating-point operands Processor contains eight 128-bit SSE registers packed instructions perform four operations simultaneously scalar instructions operate on l.s. operand Designed for use in 3D graphics applications Most systems offering high performance graphics use Graphics Processing Units as co-processors I I GPU’s are essentially SIMD systems HPC systems are starting to use GPUs Roland Ibbett Instruction Level Parallelism - a historical perspective Amdahl’s Law I Assume proportion p of a program is executed in parallel I Proportion executed serially = (1 − p) I If Ts = time to execute with no parallelism, then time to execute the same code with degree of parallelism = N is I Ts + (1 − p).Ts N maximum speedup (Ts /Tp ) is thus N and the relative speed-up S/N is Tp = p. I I S Ts Ts = = N NTp pTs + N(1 − p)Ts = 1 p + N(1 − p) Roland Ibbett Instruction Level Parallelism - a historical perspective Amdahl’s Law 1 0.9 N=4 N = 16 N = 64 N = 128 0.8 0.7 0.6 Relative 0.5 speed-up 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 Degree of parallelisation Roland Ibbett 1 Instruction Level Parallelism - a historical perspective Where Next? I Moore’s Law I I I I Use of Silicon Real Estate I I I The capacity of memory chips quadruples every 3 years Since much of this increase has been due to smaller silicon feature sizes, processor speeds have increased and processor architectures have also become more complex When will Moore’s Law stop working? High Performance Computing High Throughput Computing Has ILP come to the end of the road? I I problem of branches will not go away memory wall problem I GPUs and chip multiprocessors I Multithreaded architectures - Sun Niagara Roland Ibbett Instruction Level Parallelism - a historical perspective