A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek Rajski, Nels Oscar, David Burri, Alex Diede Introduction • We have seen how to improve performance • • through exploitation of: Instruction-level parallelism Thread-level parallelism • One other exploitation we have not discussed is Data-level parallelism. Introduction • Flynn's Taxonomy • • An organization of computer architectures based on their instruction and data streams Divides all architectures into 4 categories: 1. SISD 2. SIMD 3. MISD 4. MIMD Introduction • Implementations of SIMD • • • Prevalent in GPUs SIMD extensions in CPU Embedded systems and Mobile Platforms Introduction • Software for SIMD • • Many libraries utilize and encapsulate SIMD Adopted in these areas Graphics o Signal Processing o Video Encoding/Decoding o Some scientific applications o Introduction • SIMD Implementations fall into three highlevel categories: 1. Vector Processors 2. Multimedia Extensions 3. Graphics Processors Introduction • Going forward: • Streaming SIMD • • Extensions(MMX/SSE/AVX) o Similar technology in GPUs Compiler techniques for DLP Problems in the world of SIMD Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for x86 computers. This figure assumes that two cores per chip for MIMD will be added every two years and the number of operations for SIMD will double every four years. Copyright © 2011, Elsevier Inc. SIMD in Hardware • • • • Register Size/Hardware changes Intel Core i7 example The ‘Roofline’ model Limitations of streaming extensions in a CPU SIMD in Hardware • Streaming SIMD requires some basic components o o o Wide Registers Rather than 32bits, have 64, 128, or 256 bit wide registers. Additional control lines Additional ALU's to handle the simultaneous operation on up to operand sizes of 16-bytes Hardware Figure 4.4 Using multiple functional units to improve the performance of a single vector add instruction, C = A + B. The vector processor (a) on the left has a single add pipeline and can complete one addition per cycle. The vector processor (b) on the right has four add pipelines and can complete four additions per cycle. The elements within a single vector add instruction are interleaved across the four pipelines. The set of elements that move through the pipelines together is termed an element group. s Intel i7 • The Intel i7 Core o Superscalar processor o Contains several SIMD extensions 16x256-bit wide registers, and physical registers on pipeline. Support for 2 and 3 operand instructions The Roofline Model of Performance • The Roofline model of performance aggregates floating-point performance, operational intensity memory • • • The Roofline Model of Performance Opteron X2 The Roofline Model of Performance Opteron X2 The Roofline Model of Performance Opteron X2 Limitations • • • Memory Latency Memory Bandwidth The actual amount of vectorizable code SIMD at the software level • • SIMD is not a new field. But more focus has been brought to it by the GPGPU movement. SIMD at the software level • CUDA • • • • • Developed by Nvidia Compute Unified Device Architecture Closed to GPUs with chips from Nvidia Graphics cards G8x and newer Provides both high and low level API SIMD at the software level • OpenCL • • • • • Developed by Apple Open to any vendor that decide to support it Designed to execute across GPUs and CPUs Graphics cards G8x and newer Provides both high and low level API SIMD at the software level • Direct Compute • • • • • Developed by Microsoft Open to any vendor that supports DirectX11 Windows only Graphics cards GTX400 and HD5000 Intel’s Ivy Bridge will also be supported Compiler Optimization • • • Not everyone programs in SIMD based languages. But C, Java were never designed with SIMD in mind. Compiler technology had to improve to catch code with vectorizable instructions. Compiler Optimization • Before optimization can begin • • • Data dependencies have to be understood But only within the vector window size matter Vector window size - The size of data executed in parallel with the SIMD instruction Compiler Optimization • Before optimization can begin • Example: for( int i = 0; i < 16; i++){ C[i] = c[i+1]; C[i] = c[i+16]; } for( int i = 0; i < 16; 4++){ C[i] = c[i+1]; C[i+1] = c[i+2]; (Wrong) C[i+2] = c[i+3]; (Wrong) C[i+3] = c[i+4]; (Wrong) C[i] = c[i+16]; C[i+1] = c[i+17]; C[i+2] = c[i+18]; C[i+3] = c[i+20]; } Compiler Optimization • Framework for vectorization o o o o Prelude Loop Postlude Cleanup Compiler Optimization • Framework for vectorization • Prelude • • • Loop independent variables are prepared for use. Run time checks that vectorization is possible Loop • • • Vectorizable instructions are performed in order with original code. Loop could be split into multiple loops. Vectorizable sections could be split by more complex code in original loop. Compiler Optimization • Framework for vectorization o Postlude All loop independent variables are returned. o Cleanup Non vectorizable iterations of the loop are run. These include the remainder of vectorizable instructions that do not fit evenly into the vector size. Compiler Optimization • • • • Compiler techniques Loop Level Automatic Vectorization Basic Block Level Automatic Vectorization In the presence of control flow Compiler Optimization • • • Loop Level Automatic Vectorization 1. Find innermost loop that can be vectorized. 2. Transform loop and create vector instructions. Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; Vectorized Code for( i=0; i<1024; i+=4){ vA = vec_ld( A[i] ); vB = vec_ld( B[i] ); vC = vec_mul( vA, vB); vec_st( vC, C[i] ); } Compiler Optimization • Basic Block Level Automatic Vectorization 1. The inner most loop is unrolled by the size of the vector window. 2. Isomorphic scalar instructions are packed into vector instruction. Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; Vectorized Code for (i = 0; i < 1024; i+=4) C[i] = A[i]*B[i]; C[i+1] = A[i+1]*B[i+1]; C[i+2] = A[i+2]*B[i+2]; C[i+3] = A[i+3]*B[i+3]; Compiler Optimization • In the presence of control flow 1. Apply predication 2. Apply method from above After Predication 3. Remove vector predication for (i = 0; i < 1024; i+=1){ 4. Remove scalar predication Original Code for (i = 0; i < 1024; i+=1){ if (A[i] > 0) C[i] = B[i]; else D[i] = D[i-1]; } P = A[i] > 0; NP = !P; C[i] = B[i]; (P) D[i] = D[i-1]; (NP) } Compiler Optimization • In the presence of control flow After Vectorization After Removing Predicates for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP); C[i:i+3]=B[i:i+3]; (vP) (NP1,NP2,NP3,NP4) = vP; D[i+3]=D[i+2]; (NP4) D[i+2]=D[i+1]; (NP3) D[i+1]=D[i]; (NP2) D[i]=D[i-1]; (NP1) } for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP); C[i:i+3]=vec_sel(C[i:i+3], B[i:i+3], vP); (NP1,NP2,NP3,NP4) = vP; if (NP4) D[i+3]=D[i+2]; if (NP3) D[i+2]=D[i+1]; if (NP2) D[i+1]=D[i]; if (NP1) D[i]=D[i-1]; } CPU vs GPU • • Founding of the GPU as we know it today was Nvidia in 1999 Popularity increased in recent years VisionTek GeForec 256 [Wikipedia] Nvidia GeForce GTX590 [Nvidia] CPU vs GPU • Theoretical GFLOP/s & Bandwidth [Nvidia, NVIDIA CUDA C Programming Guide] CPU vs GPU • Intel Core i7 Nehalem Die Shot [NVIDIA’s Fermi: The First Complete GPU Computing Architecture] CPU vs GPU Game, Little Big Planet [http://trendygamers.com] CPU vs GPU • OpenGL Graphics Pipeline [Wojtek Palubicki; http://pages.cpsc.ucalgary.ca/~wppalubi/] CPU vs GPU • CPU SIMD vs. GPU SIMD • • Intel’s sandy-bridge architecture: 256-bit AVX --> on 8 registers parallel • CUDA multiprocessor up to 512 raw mathematical operations in parallel CPU vs GPU • Nvidia’s Fermi Source: http://www.legitreviews.com/article/1193/2/ CPU vs GPU • Nvidia’s Fermi [Nvidia; NVIDIA’s Next Generation CUDA Compute Architecture: Fermi] Standardization Problems and Industry Challenges [Widescreen Wallpapers; http://widescreen.dpiq.org/30__AMD_vs_Intel_Challenge.htm] Standardization Problems and Industry Challenges • 1998 o AMD - 3Dnow o Intel - SSE instruction set a few years later without supporting the 3Dnow o Intel won this battle since SSE was better Standardization Problems and Industry Challenges • 2001 o Intel - Itanium processor (64-bit, parallel computing instruction set) o AMD - Its own 64-bit instruction set (backward compatible) o AMD won this time because of its backward compatibility. • 2007 o AMD - SSE5 o Intel - AVX Standardization Problems and Industry Challenges • Example: fused-multiply-add (FMA) o d=a+b*c • AMD o Supports since 2011 FMA4 o FMA4 - 4 operand form • Intel o Will support FMA3 in 2013 with Haswell o FMA3 - 3 operand form Standardization Problems and Industry Challenges • This causes • More work for the programmer • Impossible maintenance of the code Standardization required! Conclusion • • • SIMD Processors exploit data-level parallelism increasing performance. The hardware requirements are easily met as transistor size decreases. HPC languages have been created to give programmers access to high and low level SIMD operations. Conclusion • • • • Compiler technology has improved to recognize some potential SIMD operations in serial code. The utility of SIMD instructions in modern microprocessors is diminishing except in special purpose applications due to standardization problems and industry in-fighting. The increasing adoption of GPGPU computing has the potential to supplant SIMD type instructions in the CPU. On-chip GPU's appear to be on the horizon, so wider really is better.