Vector processors [Hennessy & Patterson, §B.1] Pipelines can be used to process vectors as well as scalar quantities. • In the vector case, a single instruction operates on more than one set of operands. • Multiple pipelines may be provided to allow different operations to take place on different streams of data. The cost/performance ratio of vector processors is impressive, but only for problems that fit their mold. Vector machines solve the problem that large scientific programs have large, active data sets, and thus poor data locality. Cache performance is thus quite bad. In a vector machine, the compiler can often determine the access pattern and pipeline the accesses efficiently. Several characteristics of vector machines make them capable of high performance: • The computation of each result is independent of the computation of previous results, so there are no pipeline stalls due to hazards. How is this independence achieved? • A single vector instruction can do the work of an entire loop. • Vector instructions process operands that have a regular structure, and can thus be efficiently loaded into registers from interleaved memory. Basic vector architecture [H&P, §B.2] A vector machine usually has— • an ordinary pipelined scalar unit, and • a vector unit. Lecture 25 Architecture of Parallel Computers 1 Most machines have integer, boolean, and floating-point instructions that operate on the data. All vector machines built today are vector-register machines: All instructions take their operands from vector registers except load and store. What would be the alternative to a vector-register machine? Here is a diagram of a sample vector architecture. It is Hennessy and Patterson’s DLXV, and is loosely based on the Cray-1. Its scalar portion is Hennessy and Patterson’s DLX (see ECE 521). Main memory Vector load/store FP add/sub. FP multiply FP divide Vector registers Integer Boolean Scalar registers The architecture has these components: • Vector registers: 8 registers, each of which holds 64 doublewords. • Vector functional units: Each unit is fully pipelined and can start a new operation on each clock cycle. A control unit must detect conflicts for functional units and data accesses. • Vector load/store unit: A vector memory unit that loads or stores a vector to or from memory. Moves one word/cycle, after a startup latency. • Scalar registers: 32 general-purpose and 32 floating-point. On the next page is a table, taken from Hennessy and Patterson (Fig. B.2), giving the characteristics of several vector-register machines. Lecture 25 Architecture of Parallel Computers 2 All vector operations in DLXV are double-precision floating point. Instruction set There are five kinds of vector instructions, depending on whether the operands and results are vectors or scalars: Type f1: f2: f3: f4: f5: Lecture 25 Example V V V V V V S V V S V V S LV POP ADDV ADDSV S_V Architecture of Parallel Computers 3 Processor Year announ ced Clock rate (MHz) Registers Elements per register (64-bit elements) Functional units Loadstore units Cray-1 1976 80 8 64 6: add, multiply, reciprocal, integer add, logical, shift Cray X-MP Cray Y-MP 1983 1988 120 166 8 64 8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity 2 load 1 store Cray-2 1985 166 8 64 5: FP add, FP multiply, FP reciprocal/sqrt, integer (add shift, population count), logical 1 Fujitsu VP100/20 1982 133 3: FP or integer add/logical, multiply, divide 2 Hitachi S810/820 1985 71 32 256 4: 2 integer add/logical, 1 multiply-add, and 1 multiply/divide-add unit 4 Convex C-1 1985 10 8 128 4: multiply, add, divide, integer/logical 1 NEC SX/2 1984 160 8+ 8192 256 variable 16: 4 integer add/logical, 4 FP multiply/divide, 4 FP add, 4 shift 8 DLXV 1990 200 8 64 5: multiply, divide, add, integer add, logical 1 Cray C-90 1991 240 8 128 8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity 4 Convex C-4 1994 135 16 128 3: each is full integer, logical, and FP (including multiply-add) NEC SX/4 1995 400 8+ 8192 256 variable 16: 4 integer add/logical, 4 FP multiply/divide, 4 FP add, 4 shift Cray J-90 1995 100 8 64 4: FP add, FP multiply, FP reciprocal, integer/logical Cray T-90 1996 ≈500 8 128 8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity 8–256 32–1024 1 8 4 Two special-purpose registers are also needed: VLR the vector-length register, and VM the vector-mask register. Here are the vector instructions in DLXV: Lecture 25 Architecture of Parallel Computers 4 Opcode Operands Function ADDV V1, V2, V3 Add elements of V2 and V3, then put each result in V1. ADDSV V1, V2, F0 Add F0 to each elt. of V2, then put each result in V1 SUBV V1, V2, V3 Subtract elts. of V3 from V2, then put each result in V1. SUBVS V1, V2, F0 Subtract F0 from elts. of V2, then put each result in V1. SUBSV V1, F0, V2 Subtract elts. of V2 from F0, then put each result in V1. MULTV V1, V2, V3 Multiply elts. of V2 and V3, then put each result in V1. MULTSV V1, F0, V2 Multiply F0 by each elt. of V2, then put each result in V1 DIVV V1, V2, V3 Divide elements of V2 by V3, then put each result in V1. DIVVS V1, V2, F0 Divide elements of V2 by F0, then put each result in V1. DIVSV V1, F0, V2 Divide F0 by elements of V2, then put each result in v1. LV V1, R1 Load vector reg. V1 from memory starting at addr. R1 SV R1, V1 Store vector reg. V1 into memory starting at address R1 LVWS V1, (R1,R2) Load V1 from addr. at R1 w/stride in R2, i.e., R1 + I * R2. SVWS (R1,R2), V1 Store V1 at addr. in R1 w/stride in R2, i.e., R1 + I*R2. LVI V1,(R1+V2) Load V1 with vector whose elements are at R1+V2(I), i.e., V2 is an index. SVI (R1+V2), V1 Store V1 with vector whose elements are at R1+V2(I). CVI V1, R1 Create an index vector by storing the values 0, 1*R1, 2*R1, … , 63*R1 into V1. S_ _V V1, V2 S_ _SV F0, V1 Compare (=, ≠, >, <, ≥, ≤) the elements in V1 and V2. If condition is true, put a 1 in the corresponding bit vector; otherwise put a 0. Put resulting bit vector into vector-mask register (VM). The instruction S_ _SV performs the same compare, but using a scalar value as one operand. Opcode Operands Function POP R1, VM Count the 1s in VM and store count in R1. Set the vector-mask register to all 1s. CVM MOVI2S Lecture 25 VLR, R1 Move contents of R1 to the vector-length register. Architecture of Parallel Computers 5 MOVS2I R1, VLR Move the contents of the vector-length register to R1. MOVF2S VM, F0 Move contents of F0 to the vector-mask register. MOVS2F F0, VM Move the contents of the vector-mask register to F0. Let us take a look at a sample program. We want to calculate Y := a *X + Y X and Y are vectors, initially in memory, and a is a scalar. This is the DAXPY (double-precision a times X plus Y) that forms the inner loop of the Linpack benchmark’s Gaussian elimination routine. For now, we will assume that the length of the vector is 64, the same as the length of a vector register. We assume that the starting addresses of X and Y are in RX and RY, respectively. Here is the code for DLX (non-vector architecture). 1. 2. LD ADDI F0, a R4,RX,#512 Last address to load. LD MULTD LD ADDD SD ADDI F2,0(RX) F2,F0,F2 F4,0(RY) F4,F2,F4 F4,0(RY) RX,RX,#8 loop: 3. 4. 5. 6. 7. 8. 9. 10. 11. ADDI SUB BNZ Load X[i] a *X[i] Load Y[i] a *X[i] + Y[i] Store into Y[i] Increment index for X RY,RY,#8 Increment index for Y R20,R4,RX Compute bound R20,loop Check if done. Here is the DLXV code. 1. 2. Lecture 25 LD LV F0,a V1,RX Load scalar a Load vector X Architecture of Parallel Computers 6 3. 4. 5. 6. MULTSV LV V3,RY ADDV V4, SV RY,V4 Vector-scalar multiply Load vector Y Add Store the result In going from the scalar to the vector code, the number of is reduced from almost 600 to 6. There are two reasons for this. • • What can you say about pipeline interlocks? How does the change from scalar to vector code affect them? Vector execution time The execution time of a sequence of vector operations depends on— • length of vectors being operated on, • structural hazards (conflicts for resources) between operations, and • data dependencies. If two instructions contain no data dependencies or structural hazards, they could potentially begin execution on the same clock cycle. (Most architectures can only initiate one operation per clock cycle, but for reasonably long vectors, the execution time of a sequence depends much more on the length of the vectors than on whether more than one operation can be initiated per cycle.) Lecture 25 Architecture of Parallel Computers 7 Hennessy and Patterson introduce the term convoy to refer to the set of vector instructions that could potentially begin execution in a single clock period. It provides an approximation to the time required for a sequence of vector operations. A chime is the time it takes to execute a convoy (the time is called 1 convoy, regardless of vector length). Thus, • a vector sequence that consists of m convoys executed in m chimes; • if vector length is n, this is approximately mn clock cycles. Measuring in terms of chimes explicitly ignores processor-specific overheads. Because of the fact that most architectures can initiate only one vector operation per cycle, the chime count underestimates the actual execution time. Example: Consider this code: LV V1, RX MULTSV V2, F0, V1 LV V3, RY ADDV V4, V2, V3 SV RY, V4 Load vector X. Vector-scalar multiply Load vector Y Add Store the result How many chimes will this sequence take? How many chimes per FLOP are there? Lecture 25 Architecture of Parallel Computers 8 The chime approximation is reasonably accurate for long vectors. For 64-element vectors, four chimes ≈ clock cycles. The overhead of issuing Convoy 2 in two separate clocks is small. Startup time The startup time is the chief overhead ignored by the chime model. For load and store operations, it is usually longer than for other operations. For loads: • up to 50 clock cycles on some machines. • 9 to 17 clock cycles on Cray 1 and Cray X/MP. • 12 clock cycles on DLXV. For stores, startup time makes less difference. • Stores do not produce results, so it is not necessary to wait for them to finish. • Unless, e.g., a load has to wait for a store to finish when there is only one memory pipeline. Here are the startup penalties for DLXV vector operations: Operation Startup penalty Vector add 6 Vector multiply 7 Vector divide 20 Vector load 12 Assuming these times, how long will it really take to execute our fourconvoy example, above? Convoy 1. LV 2. MULTSV LV Lecture 25 Start time 1st-result time Last-result time 0 12 + n + Architecture of Parallel Computers 24 + 2n 9 3. ADDV 4. SV 31 + 3n 25 + 2n + 30 + 3n 31 + 3n + 12 42 + 4n Thus, assuming that no other instruction can overlap the final SV, we have an execution time of 42 + 4n. For a vector of length 64, this evaluates to The number of clock cycles per result is therefore The chime approximation is clock cycles per result. Thus the execution time with startup is 1.16 times higher. Effect of the memory system To maintain a rate of one word per cycle, the memory system must be able to produce/accept data this fast. This is done by creating separate memory banks that can be accessed in parallel. This is similar to interleaving. Memory banks can be implemented in two ways: • Parallel access. Access all the banks in parallel, latch the results, then transfer them one by one. Like S access to interleaved memory. This is essentially S-access interleaving. • Independent bank phasing. On first access, access all banks in parallel, then transfer words one at a time. Begin a new access as soon as one access is finished. This is essentially C-access interleaving. If an initiation rate of one word per clock cycle is to be maintained, it must be true that— # of memory banks ≥ memory access time in clock cycles. Example: We want to fetch— • a vector of 64 elements Lecture 25 Architecture of Parallel Computers 10 • starting at byte address 136, • with each memory access taking 6 “clocks.” Then, • How many memory banks do we need? • What addresses are used to access the banks? • When will each element arrive at the CPU? We need 8 banks so that references can proceed at 1/clock cycle. We will employ independent bank phasing. Here is a table showing the memory addresses (in bytes) by • bank number, and • time at which access begins. Bank Beginning at cycle # 0 1 2 3 4 5 6 7 0 192 136 144 152 160 168 176 184 6 256 200 208 216 224 232 240 248 14 320 264 272 280 288 296 304 312 22 384 328 336 344 352 360 368 376 … … … … … … … … … Note that— • Accesses begin 8 bytes apart, since • The exact time that a bank transmits its data is given by its address – starting address + the memory latency. 8 Memory latency is 6 cycles. • The first access is to bank 1. Why? Lecture 25 Architecture of Parallel Computers 11 • Like C-access interleaving, memory banks require address latches for each bank. • Why does the second go-round of accesses start at time 6 rather than time 8? Here is a time line showing when each access occurs. Action Time 0 Lecture 25 Next mem. Next mem. access & access & Memory deliver prev. deliver prev. access 8 words 8 words 6 14 22 Architecture of Parallel Computers Deliver last 8 words 62 70 12