2015-11-24 Lecture 8: SIMD Architectures Vector processors Array processors Cray supercomputers Multimedia extensions Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 8 Introduction Manipulation of arrays or vectors is a common operation in scientific and engineering applications. Typical operations of array-oriented data include: Processing one or more vectors to produce a scalar result. Combining two vectors to produce a third one. Combining a scalar and a vector to generate a vector. A combination of the above three operations. Two architectures suitable for vector processing are: Pipelined vector processors • Implemented in many supercomputers Parallel array processors They are architecture for data parallelism. The user or the compiler does the difficult work of finding the parallelism, so the hardware doesn’t have to. Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 8 1 2015-11-24 Exploiting Parallelism There are three major categories of parallelism: Instruction-level parallelism (ILP) Multiple instructions from one instruction stream are executed simultaneously. Thread-level parallelism (TLP) Multiple instruction streams are executed simultaneously. Data parallelism (DP) The same operation is performed simultaneously on arrays of data elements. Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 8 Vector Processor Architecture Scalar Unit Scalar Instructions Instruction Fetch and Decode unit Scalar Registers Scalar Functional Units Memory Vector Registers Vector Instructions Vector Functional Units Vector Unit Zebo Peng, IDA, LiTH 4 TDTS 08 – Lecture 8 2 2015-11-24 Vector Unit Operation Vector Registers Pipelined ALU Memory System Zebo Peng, IDA, LiTH 5 TDTS 08 – Lecture 8 Vector Processors A processor operates on an entire vector in one single instruction. Strictly speaking, vector processors are not parallel processors. There are not several CPUs in a vector processor, running in parallel. They only behave like SIMD computers. They are SISD processors with vector instructions executed in pipelined manner. It has vector registers that can each store usually 64 to 128 values. Vector instructions examples: Load a vector from memory into vector register; Store a vector into memory; Arithmetic and logic operations between vectors; and Operations between vectors and scalars. The programmers are allowed to use vector operations directly, and the compiler translates them into vector instructions at machine level. Zebo Peng, IDA, LiTH 6 TDTS 08 – Lecture 8 3 2015-11-24 Vector Unit A vector unit consists of a pipelined functional unit, which perform ALU operation of vectors in a pipeline. It consists of several registers: A set of general purpose vector registers, each of length s (e.g., 128); A vector length register VL (a scalar value), which stores the length l (0 l s) of the currently processed vector(s); A mask register M, which stores a set of l bits, one for each element in a vector, interpreted as Boolean values: • Vector instructions can be executed in masked mode so that vector elements corresponding to a false value in M are ignored. 8 VL VR1 M 10 1 01 1 1 1 01 10 Zebo Peng, IDA, LiTH 8 … … … 7 TDTS 08 – Lecture 8 Vector Program Example Consider an element-by-element addition of two N-element vectors A and B to create the sum vector C. On an SISD machine, this computation will be implemented as: for i = 0 to N-1 do C[i] := A[i] + B[i]; Loop: LOAD LOAD ADD STO ADD BRA R1, A[i] R2, B[i] R1, R2 R1, C[i] i, #1 Loop, i <= 128 This execution has: 128*6 = 768 fetches; 128 additions; and 128 conditional branches. There will be N*K instruction fetches (assuming that K instructions are needed for each iteration) and N additions. There will also be N conditional branches (if loop unrolling is not used). Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 8 4 2015-11-24 Vector Program Example (Cont’d) In a vector computer, we need only one statement: C[0:N-1] A[0:N-1] + B[0:N-1]; Vector code: LOAD_V LOAD_V ADD_V STO_V V1, A V2, B V3, V1, V2 C, V3 This execution has: 4 fetches [SISD: 768]; 128 additions; 0 branches [SISD: 128]. N additions will still be performed, now in pipelined fashion. There will only be K’ instruction fetches (e.g., Load A, Load B, Add_vector, Write C. K’ = 4). No conditional branch is needed. Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 8 Features of Vector Processors Advantages: Quick fetch and decode of a single instruction for multiple operations. The instruction provides a regular source of data, which arrive at each cycle, and can be processed in a pipelined fashion. The compiler generates codes to fully utilize both the vector unit and the scalar unit. Memory-to-memory operation mode: No vector registers are needed. It can process very long vectors; but setup time is large. It appeared in the 70’s and died in the 80’s (memory bottleneck). Register-to-register operations are more popular now: Operations are performed to values stored in the vector registers. They are usually part of a supercomputer or a mainframe. Zebo Peng, IDA, LiTH 10 TDTS 08 – Lecture 8 5 2015-11-24 IBM 3090 with Vector Facility Similar to a superscalar computer. Zebo Peng, IDA, LiTH Except that parallelism is mainly due to vector computation Little impact on software. Vector processors execute vector instructions. 11 TDTS 08 – Lecture 8 Lecture 9: SIMD Architectures Vector processors Array processors Cray supercomputers Multimedia extensions Zebo Peng, IDA, LiTH 12 TDTS 08 – Lecture 8 6 2015-11-24 Array Processors Built with N identical processing elements and a number of memory modules. All PEs are under the control of a single control unit. They execute instructions in a lock-step mode. Processing units and memory elements communicate with each other through an interconnection network. Different topologies can be used, e.g., crossbar. Complexity of the control unit is at the same level of the uniprocessor system. Control unit is usually itself a computer with its own high speed registers, local memory and ALU. The main memory is the collection of the memory modules. Zebo Peng, IDA, LiTH 13 TDTS 08 – Lecture 8 Global Memory Organization IS PE2 ... PEn M1 M2 ... Control Unit Interconnection Network PE1 I/O System Mk Shared Memory Zebo Peng, IDA, LiTH 14 TDTS 08 – Lecture 8 7 2015-11-24 Array Processor Classification Processing element complexity: Single-bit processors • e.g., Connection Machine (CM-2) 65536 PEs connected by a hypercube network (Thinking Machine Co.). Multi-bit processors • e.g., ILLIAC IV (64-bit), and MasPar MP-1 (32-bit). Processor-memory interconnection: Dedicated memory organization • ILLIAC IV, CM-2, MP-1 Global memory organization • Bulk Synchronous Parallel (BSP) computer Zebo Peng, IDA, LiTH 15 TDTS 08 – Lecture 8 Global Memory Organization IS PE2 ... PEn M1 M2 ... Control Unit Interconnection Network PE1 I/O System Mk Shared Memory Zebo Peng, IDA, LiTH 16 TDTS 08 – Lecture 8 8 2015-11-24 Control Unit IS PE1 M1 PE2 M2 ... Mcont PEn Interconnection Network Dedicated Memory Organization Mn I/O System Zebo Peng, IDA, LiTH 17 TDTS 08 – Lecture 8 Features of Array Processors Control and scalar type instructions are executed in the control unit. Vector instructions are performed in the processing elements. Each vector element mapped to a PE. Data organization and detection of parallelism in a program are major issues when using such architecture. Operations like C(i) = A(i) × B(i), 1 i n could be executed in parallel, if the elements of the arrays A and B are distributed properly among the processors/memory-modules. Ex. PEi is assigned the task of computing C(i). In the ideal case, both have the same dimension. Zebo Peng, IDA, LiTH 18 TDTS 08 – Lecture 8 9 2015-11-24 An Example To compute N Y A(i ) B (i ) i 1 Assuming: A dedicated memory organization. Elements of A and B are properly and perfectly distributed among processors (the compiler can help here). We have: The product terms are computed in parallel. Additions can be done in log2N iterations in a pair-wise manner. Speed up factor (assuming that addition and multiplication take the same time): S= Zebo Peng, IDA, LiTH 2N-1 1+ log 2 N 19 N 32 S 10,5 64 128 256 512 1024 18 32 57 102 186 TDTS 08 – Lecture 8 ILLIAC IV ILLIAC IV is a classical example of Array Processors. A typical SIMD computer for array processing. 64 Processing Elements (PEs), each with its local memory. One single Control Unit (CU). CU can access all memory. PEs can access local memory and communicate with neighbors. CU reads program and broadcasts instructions to PEs. Zebo Peng, IDA, LiTH 20 TDTS 08 – Lecture 8 10 2015-11-24 ILLIAC IV Architecture Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 8 Lecture 9: SIMD Architectures Vector processors Array processors Cray supercomputers Multimedia extensions Zebo Peng, IDA, LiTH 22 TDTS 08 – Lecture 8 11 2015-11-24 Cray X1: Parallel Vector Machine Cray combines several technologies in the X1 machine: 12.8 gFLOPS high-performance vector processors. Shared caches. • 4 processor nodes sharing 2 MB cache, and up to 64 GB of memory. Multi-streaming vector processing. Multiple node architecture. Zebo Peng, IDA, LiTH 23 TDTS 08 – Lecture 8 Cray X1: Building Block MSP: Multi-Streaming vector Processor Formed by 4 SSPs (each a 2-pipe vector processor). Balance computations across SSPs. Compiler will try to vectorize/parallelize across the MSP, achieving “streaming.” custom blocks 12.8 Gflops (64 bit) 25.6 Gflops (32 bit) S V S V V S V V S V V V 51 GB/s load 25-41 GB/s store 2 MB cache 0.5 MB $ 0.5 MB $ 0.5 MB $ To local memory and network: Zebo Peng, IDA, LiTH 24 0.5 MB $ shared caches Figure source J. Levesque, Cray TDTS 08 – Lecture 8 12 2015-11-24 Cray X1: Node P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M mem mem mem mem mem mem mem M M M M M M M M M mem mem mem mem mem mem mem mem mem IO IO Shared memory. 32 network links and four I/O links per node. A Cray X1 machine consists of 32 such nodes. Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 8 Cray X1: Parallelism Many levels of parallelism Within a processor: vectorization. Within an MSP: streaming. Within a node: shared memory. Across nodes: message passing. Some are automated by the compiler, others require work by the programmer: This is a common trend. The more complex the architecture, the more difficult it is for the programmer to exploit it. Hard to fit this machine into a simple taxonomy! Locally: SIMD (vector processing) Globally: MIMD Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 8 13 2015-11-24 Most Powerful Supercomputer Tianhe-2, located at the National Supercomputing Center in Guangzhou, China. A typical high-end PC: Performance: 33.86 petaFLOPS (1015) Peak rate: 54.9 petaFLOPS. Total memory size: 1,375 teraBytes (1012) - Power consumption: 17.6 MW. - Huge number of microprocessors (16,000 computer nodes, with a total of 3,120,000 cores, Intel Xeon Phi). Cost is estimated to $390 million. [Triolith: NSC in Linköping, 407 teraFLOPS, ranked 122th in the world.] Zebo Peng, IDA, LiTH 27 - Performance 20 gigaFLOPS. No. of cores: 8. Clock rate: 3.5GHz. Memory: 16 GB. TDTS 08 – Lecture 8 Growth of Supercomp Performance The y-axis shows performance in GFLOPS. The red line denotes the fastest supercomputer; the yellow line no. 500, and dark blue line the total combined performance of supercomputers on the TOP500 list. Zebo Peng, IDA, LiTH 28 TDTS 08 – Lecture 8 14 2015-11-24 Lecture 9: SIMD Architectures Vector processors Array processors Cray supercomputers Multimedia extensions Zebo Peng, IDA, LiTH 29 TDTS 08 – Lecture 8 Multimedia Extensions How do we extend general purpose microprocessors so that they can handle multimedia applications efficiently? Analysis of the need: Video and audio applications very often deal with large arrays of small data types (8 or 16 bits). Such applications exhibit a large potential of SIMD (vector) parallelism. Data parallelism. Solutions: General purpose microprocessors are equipped with special instructions to exploit this parallelism. The specialized multimedia instructions perform vector computations on bytes, half-words, or words. Zebo Peng, IDA, LiTH 30 TDTS 08 – Lecture 8 15 2015-11-24 Special Instructions Conventional instruction sets have been extended in order to improve performance with multimedia applications: MMX for Intel x86 family; VIS for UltraSparc; MDMX for MIPS; and MAX-2 for Hewlett-Packard PA-RISC. The Pentium line provides 57 MMX instructions, which treat data in a SIMD fashion to improve the performance of: Computer-aided design; Internet application; Computer visualization; Video games; and Speech recognition. Zebo Peng, IDA, LiTH 31 TDTS 08 – Lecture 8 Implementation The basic idea: sub-word execution Use the entire width of a processor data path (e.g., 64 bits), when processing small data (8, 12, or 16 bits). With word size of 64 bits, an adder can be used to implement eight 8-bit additions in parallel. R1 a7 a6 +1 a5 +1 a4 +1 a3 +1 a2 +1 a1 +1 a0 +1 +1 MMX technology allows a single instruction to work on multiple pieces of data. Consequently we have practically a kind of SIMD parallelism, at a reduced scale and with very low cost. Zebo Peng, IDA, LiTH 32 TDTS 08 – Lecture 8 16 2015-11-24 Packed Data Types Three packed data types are defined for parallel operations: packed byte, packed word, packed double word. Packed byte q7 q6 q5 q4 q3 q2 q1 q0 Packed word q3 q2 q1 q0 Packed double word q1 q0 Quad word q0 64 bits Zebo Peng, IDA, LiTH 33 TDTS 08 – Lecture 8 SIMD Arithmetic Examples ADD R3 R1, R2 a7 a6 a5 a4 a3 a2 a1 a0 + + + + + + + + b7 b6 b5 b4 b3 b2 b1 b0 R2 = = = = = = = = a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0 a7+b7 R3 R1 Hardware support is needed to check sub-word execution overflow! MULADD R3 R1, R2 R1 R2 a7 ×&+ b7 = a6 ×&+ b6 = a5 ×&+ b5 = a4 ×&+ b4 = a3 ×&+ b3 = a2 ×&+ b2 = a1 ×&+ b1 = a0 ×&+ b0 = R3 (a6×b6)+(a7×b7) (a4×b4)+(a5×b5) (a2×b2)+(a3×b3) (a0×b0)+(a1×b1) Zebo Peng, IDA, LiTH 34 TDTS 08 – Lecture 8 17 2015-11-24 Performance Comparison The following shows the performance of Pentium processors (32-bit machine) with and without MMX technology: Application Without MMX With MMX Speedup Video 155.52 268.70 1.72 Image Processing 159.03 743.90 4.67 3D geometry 161.52 166.44 1.03 Audio 149.80 318.90 2.13 OVERALL 156.00 255.43 1.64 Zebo Peng, IDA, LiTH 35 TDTS 08 – Lecture 8 Summary Vector processors are SISD processors which include in their instruction set instructions operating on vectors. They are implemented using pipelined functional units. They behave like SIMD machines. Array processors, being typical SIMD, execute the same operation on a set of interconnected processing units. Both vector and array processors are specialized for numerical problems expressed in matrix or vector formats. They are usually integrated inside a large computer. Many modern architectures deploy usually several parallel architecture concepts at the same time, such as Cray X1. Multimedia applications exhibit a large potential of SIMD parallelism, which can be accelerated by extending the traditional SISD instruction set and architecture. Zebo Peng, IDA, LiTH 36 TDTS 08 – Lecture 8 18