VECTOR PROCESSING Τσόλκας Χρήστος & Αντωνίου Χρυσόστομος Contents Introduction Vector Processor Definition Components & Properties of Vector Processors Advantages/Disadvantages of Vector Processors Vector Machines & Architectures Virtual Processors Model Vectorization Inhibitors Improving Performance Vector Metrics Applications 2 Architecture Classification SISD SIMD Single Instruction Multiple Data MIMD Single Instruction Single Data Multiple Instruction Multiple Data MISD Multiple Instruction Single Data 3 Alternative Forms of Machine Parallelism Instruction Level Parallelism (ILP) Thread Level Parallelism (TLP) vector Data Parallelism (DP) 4 Alternative Forms of Machine Parallelism 5 Drawbacks of ILP and TLP Coherency Synchronization Large Overhead instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality 6 Alternative: Vector Processors 7 What is a Vector Processor? Provides high-level operations that work on vectors Vector is a linear array of numbers Type of number can vary, but usually 64 bit floating point (IEEE 754, 2’s complement) Length of the array also varies depending on hardware Example vectors would be 64 or 128 elements in length Small vectors (e.g. MMX/SSE) are about 4 elements in length 8 Components of Vector Processor Vector Registers Fixed length bank holding a single vector Vector Functional Units Fully pipelined, start new operation every clock Typically 4-8 FUs: FP add, FP mult, FP reciprocal, integer add, logical, shift Scalar Registers Has at least 2 read and 1 write ports Typically 8-32 vector registers, each holding 64-128 64bit elements Single element for FP scalar or address Load Store Units 9 Components of Vector Processor 10 Vector Processor Properties Computation of each result must be independent of previous results Single vector instruction specifies a great deal of work Equivalent to executing an entire loop Vector instructions must access memory in a known access pattern Many control hazards can be avoided since the entire loop is replaced by a vector instruction 11 Advantages of Vector Processors Increase in code density Decrease in total number of instructions Data is organized in patterns which is easier for the hardware to compute Simple loops are replaced with vector instructions, hence decrease in overhead Scalable 12 Disadvantages of Vector Processors Expansion of the Instruction Set Architecture (ISA) is needed Additional vector functional units and registers Modification of the memory system 13 Example Vector Machines Machine Year Clock Regs Elements Cray 1 1976 80 MHz 8 64 Cray XMP 1983120 MHz 8 64 Cray YMP 1988166 MHz 8 64 Cray C-90 1991240 MHz 8 128 Cray T-90 1996455 MHz 8 128 Conv. C-1 1984 10 MHz 8 128 Conv. C-4 1994133 MHz 16 128 Fuj. VP2001982133 MHz 8-256 32-1024 NEC SX/2 1984160 MHz 8+8K 256+var NEC SX/3 1995400 MHz 8+8K 256+var FUs LSUs 6 1 8 2 L, 1 S 8 2 L, 1 S 8 4 8 4 4 1 3 1 3 2 16 8 16 8 14 Vector Instruction Execution Static scheduling Prefetching Dynamic scheduling 15 Styles of Vector Architectures Memory-memory vector processors All vector operations are memory to memory CDC Star 100 Vector-register processors All vector operations between vector registers Vector equivalent of load-store architecture Includes all vector machines since late 1980s Cray, Convex, Fujitsu, Hitachi, NEC 16 Vector-Register Architecture 17 Memory operations Load/store operations move groups of data between registers and memory Three types of addressing Unit stride access Fastest Non-unit (constant) stride access Indexed (gather-scatter) Vector equivalent of register indirect Increases number of programs that vectorize 18 Vector Stride Position of the elements we want in memory may not be sequential Consider following code: Do 10 I=1, 100 Do 10 j =1, 100 A(I,j) = 0.0 Do 10 k =1,100 A(I,j) = A(I,j) + B(I,k)*C(k,j) 10 Continue 19 Virtual Processor Model Vector operations are SIMD (single instruction multiple data)operations Each element is computed by a virtual processor (VP) Number of VPs given by vector length 20 Virtual Processor Model 21 Vectorization Example DO 100 I = 1, N A(I) = B(I) + C(I) 100 CONTINUE Scalar process: 1. B(1) will be fetched from memory 2. C(1) will be fetched from memory 3. A scalar add instruction will operate on B(1) and C(1) 4. A(1) will be stored back to memory 5. Step (1) to (4) will be repeated N times. 22 Vectorization Example DO 100 I = 1, N A(I) = B(I) + C(I) 100 CONTINUE Vector process: 1. A vector of values in B(I) will be fetched from memory 2. A vector of values in C(I) will be fetched from memory. 3. A vector add instruction will operate on pairs of B(I) and C(I) values. 4. After a short start-up time, stream of A(I) values will be stored back to memory, one value every clock cycle. 23 Example (2): Y=aX+Y Scalar Code: LD F0, A ADDI R4,Rx, #512 ; Last addr Loop: LD F2, 0(Rx) MULTD F2, F0, F2 ; A * X[I] LD F4, 0(Ry) ADDD F4, F2, F4 ; + Y[I] SD 0(Ry), F4 ADDI Rx, Rx, #8 ; Inc index ADDI Ry, Ry, #8 SUB R20, R4, Rx BNEZ R20, Loop Vector Code: LD F0, A LV V1, Rx ; Load vecX MULTSV V2, F0, V1 ; Vec Mult LV V3, Ry ; Load vecY ADDV V4, V2, V3 ; Vec Add SV Ry, V4 ; Store result 64 is element size .So we need no loop now 1+5*64=321 operations Loop goes 64 times. 2+9*64=578 operations Vector/Scalar=1.8x 24 Vector Length We would like loops to iterate the same number of times that we have elements in a vector But unlikely in a real program Also the number of iterations might be unknown at compile time Problem: n, number of iterations, greater than MVL (Maximum Vector Length) Solution: Strip Mining Create one loop that iterates a multiple of MVL times Create a final loop that handles any remaining iterations, which must be less than MVL 25 Strip Mining Example low=1 VL = (n mod MVL) ; Find odd-sized piece Do 1 j=0,(n/MVL) ; Outer Loop Do 10 I = low, low+VL-1 ; runs for length VL Y(I) = a*X(I)+Y(I) ; Main operation 10 continue low = low + VL VL = MVL 1 Continue Executes loop in blocks of MVL Inner loop can be vectorized 26 Strip Mining Example low=1 ; low=1 VL = (n mod MVL) ; VL=2 Do 1 j=0,(n/MVL) ; j=1 Do 10 I = low, low+VL-1 ; I=1 .. 2 Y(I) = a*X(I)+Y(I) 10 continue low = low + VL VL = MVL 1 Continue ; Υ(1) and Υ(2) ; ; low=3 ; VL=32 ; 27 Strip Mining Example low=1 ; low=3 VL = (n mod MVL) ; VL=32 Do 1 j=0,(n/MVL) ; j=2 Do 10 I = low, low+VL-1 ; I=3 .. 34 Y(I) = a*X(I)+Y(I) 10 continue low = low + VL VL = MVL 1 Continue ; Υ(3) .. Υ(34) ; ; low=35 ; VL=32 ; 28 Strip Mining Example low=1 ; low=99 VL = (n mod MVL) ; VL=32 Do 1 j=0,(n/MVL) ; j=4 Do 10 I = low, low+VL-1 ; I=99 .. 130 Y(I) = a*X(I)+Y(I) 10 continue low = low + VL VL = MVL 1 Continue ; Υ(99) .. Υ(130) ; ; low=130 ; VL=32 ; 29 Vectorization Inhibitors Subroutine calls I/O Statements Character data Unstructured branches Data dependencies Complicated programming 30 Vectorization Inhibitors Subroutine calls I/O Statements Character data Unstructured branches Data dependencies Complicated programming 31 Subroutine calls Solution: Inline inline double radius( double x, double y, double z ) { return sqrt( x*x + y*y + z*z ); } .. int main() { .. for( int i=1; i<=n; ++i ){ r[i] = radius( x[i], y[i], z[i] ); } .. } Vectorization Inhibitors Subroutine calls I/O Statements Character data Unstructured branches Data dependencies Complicated programming 33 Vectorization Inhibitors Subroutine calls I/O Statements Character data Unstructured branches Data dependencies Complicated programming 34 Vectorization Inhibitors Subroutine calls I/O Statements Character data Unstructured branches Data dependencies Complicated programming 35 Vectorization Inhibitors Subroutine calls I/O Statements Character data Unstructured branches Data dependencies Complicated programming 36 Dependence Example do i=2,n a(i) = a(i-1) enddo do i=2,n temp(i) = a(i-1) ! temporary vector enddo temp(1) = a(1) do i=1,n a(i) = temp(i) enddo 37 Vectorization Inhibitors Subroutine calls I/O Statements Character data Unstructured branches Data dependencies Complicated programming 38 Improving Vector Performance Better compiler techniques Techniques for accessing sparse matrices As with all other techniques, we may be able to rearrange code to increase the amount of vectorization Hardware support to move between dense (no zeros), and normal (include zeros) representations Chaining Same idea as forwarding in pipelining Consider: MULTV V1, V2, V3 ADDV V4, V1, V5 ADDV must wait for MULTV to finish But we could implement forwarding; as each element from the MULTV finishes, send it off to the ADDV to start work 39 Chaining Example 7 64 6 64 Total = 141 Unchained MULTV 7 ADDV 64 Chained MULTV 6 Total = 77 64 ADDV 6 and 7 cycles are start-up-times of the adder and multiplier Every vector processor today performs chaining 40 Improving Performance Conditionally Executed Statements Consider the following loop Do 100 i=1, 64 If (a(i) .ne. 0) then a(i)=a(i)-b(i) Endif 100 continue Not vectorizable due to the conditional statement But we could vectorize this if we could somehow only include in the vector operation those elements where a(i) != 0 41 Conditional Execution Solution: Create a vector mask of bits that corresponds to each vector element 1=apply operation 0=leave alone As long as we properly set the mask first, we can now vectorize the previous loop with the conditional Implemented on most vector processors today 42 Conditional Execution lv v1 ra lv v2 rb id f0 #0 vsnes f0 v1 vsub v1 v1 v2 cvm sv ra v1 ;load vector into v1 ;load vector into v2 ;f0=0 ;set VM to 1 if v1(i)!=0 ;sub. under vector mask ;set vector mask all to 1 ; store the reslult to a Common Vector Metrics Rn: MFLOPS rate on an infinite-length vector N1/2: The vector length needed to reach one-half of Rn Real problems do not have unlimitend vector lengths, and the start-up penalties encountered in real problems will be larger (Rn is the MFLOPS rate for a vector of length n) a good measure of the impact of start-up NV: The vector length needed to make vector mode faster than scalar mode measures both start-up and speed of scalars relative to vectors, quality of connection of scalar unit to vector unit 44 Applications Linear Algebra Image processing (Convolution, Composition, Compressing, etc.) Audio synthesis Compression Cryptography Speech recognition 45 Applications in multimedia Kernel Matrix transpose/multiply DCT (video, communication) FFT (audio) Motion estimation (video) Gamma correction (video) Median filter (image processing) Separable convolution (img. proc.) Vector length # vertices at once image width 256-1024 image width,iw/16 image width image width image width 46 Vector Summary Alternate model,doesn’t rely on caches as does Out-Of-Order and superscalar implementations Handles memory in a more organized way Powerful instructions that replace loops Cope with multimedia applications Ideal architecture for scientific simulation 47