§ Algorithms for Computer arithmetic pipelining: Vec-1 Pipelined Vector Multi-stage pipelines are like assembly lines in automobile factories. Cray-I, Cary X-MP, Cyber 205, CONVEX are pipelined vector computers. There are vector operations in vector computers Programs should be “vectorized”. The addition of 2 vectors: t=1 t=2 … t=k t=k+1 x1 y1 x1 y1 x y x y 2 2 2 2 x y x y n n n n D1 D2 Dk … x1+y1 x2+y2 x1+y1 … … xk+yk xk-1+yk-1 … … x1+y1 x2+y2 x3+y3 x1+y1 x2+y2 … xn+yn Vec-2 We do not use many processors to perform vector operations concurrently. Vector operations can be completed in a rather short time. Vector matrix operations: Scalar vector operation Vector vector operation Scalar matrix operation Matrix matrix operation A matrix can be viewed as a very long vector. When performing matrix-matrix multiplication we need the following 2 operations: (1) vector-vector multiplications: a1 b1 a1 * b1 a b a * b 2* 2 2 2 a b a * bn n n n Vec-3 (2) summation of a vector: a1 a2 a12 a a a 3 4 34 a a a n 1 n n 1, n a1234 a12 a34 a a a5678 56 78 a a a n 3 , n 2 n 1 , n n 3 , n 2 , n 1 , n … a1234…n-1,n is obtained. The prefix sum problem s1 = a1 s2 = a1+ a2 s3 = a1+ a2 + a3 … sn =a1 + a2+…+ an Vec-4 1st addition: a1 a 2 a12 S2 a a a 3 4 34 a 5 a 6 a 56 a a a 7 8 78 2nd addition: a12 a 3 a13 S3 a a a S 12 34 14 4 a 56 a 7 a 57 a a a 56 78 58 3rd addition: a14 a 5 a15 S5 a a a S 6 14 56 16 a14 a 57 a17 S7 a a a S 14 58 18 8 Vec-5 The first order linear recurrence problem Given x0 a1, a2,…,an b1, b2,…,bn find x1=a1x0+b1 x2=a2x1+b2 x3=a3x2+b3 … xn=anxn-1+bn Consider n=8. The vectorized algorithm: x1=a1x0+b1 x2=a2x1+b2 x3=a3x2+b3 x4=a4x3+b4 x5=a5x4+b5 x6=a6x5+b6 x7=a7x6+b7 x8=a8x7+b8 Vec-6 Step1: x2 = a2x1 + b2 = a2a1x0 + ( a2b1 + b2 ) x4 = a4x3 + b4 = a4a3x2 + ( a4b3 + b4 ) x6 = a6x5 + b6 = a6a5x4 + ( a6b5 + b6 ) x8 = a8x7 + b8 = a8a7x6 + ( a8b7 + b8 ) We need the following operations: a 2 a1 a a 4 3, a 6 a 5 a a 8 7 a 2 b1 a b 4 3 a 6 b5 a b 8 7 a 2 b1 b 2 a b b 4 3 4 a 6 b5 b 6 a b b 8 7 8 reexpress: x1 = a1'x0 + b1' x2 = a2'x0 + b2' x3 = a3'x2 + b3' x4 = a4'x2 + b4' Vec-7 x5 = a5'x4 + b5' x6 = a6'x4 + b6' x7 = a7'x6 + b7' x8 = a8'x6 + b8' Step 2: x3 = a3'a2'x0 + ( a3'b2' + b3' ) x4 = a4'a2'x0 + ( a4'b2' + b4' ) x7 = a7'a6'x4 + ( a7'b6' + b7' ) x8 = a8'a6'x4 + ( a8'b6' + b8' ) We need the following operations: a3 ' a2 ' a3 ' b2 ' a3 ' b2 ' a ' a ' a ' b ' a ' b ' 4 2 , 4 2 4 2 a7 ' a6 ' a7 ' b6 ' a7 ' b6 ' a ' a ' a ' b ' a ' b ' 8 6 8 6 8 6 a3 ' b2 ' b3 ' a ' b ' b ' 4 2 4 a7 ' b6 ' b7 ' a ' b ' b ' 8 6 8 Vec-8 reexpress: x1 = a1''x0 + b1'' x2 = a2''x0 + b2'' x3 = a3''x0 + b3'' x4 = a4''x0 + b4'' x5 = a5''x4 + b5'' x6 = a6''x4 + b6'' x7 = a7''x4 + b7'' x8 = a8''x4 + b8'' Step3: x5 = a5''a4''x0 + ( a5''b4'' + b5'' ) x6 = a6''a4''x0 + ( a6''b4'' + b6'' ) x7 = a7''a4''x0 + ( a7''b4'' + b7'' ) x8 = a8''a4''x0 + ( a8''b4'' + b8'' ) We need the following operations: a5 " a4 " a5 " b4 " a " a " a " b " 6 4 , 6 4 a7 " a4 " a7 " b4 " a " a " a " b " 8 4 8 4 Vec-9 a5 " b4 " b5 " a " b " b " 6 4 6 a7 " b4 " b7 " a " b " b " 8 4 8 reexpress: x1 x2 x3 x4 x5 x6 x7 x8 xi = aI"'x0 + bi"' , 1 i 8 Function of Original Step 1 Step2 x0 x0 x0 x1 x0 x0 x2 x2 x0 x3 x2 x0 x4 x4 x4 x5 x4 x4 x6 x6 x4 x7 x6 x4 O(log n) vector operations Vec-10 Step3 x0 x0 x0 x0 x0 x0 x0 x0 Banded matrix operations a banded matrix: a11 a 21 A= a12 a22 a32 a23 a33 a43 a34 a44 a54 a45 a55 multiplication of a banded matrix and a vector: a11 a 21 a12 a22 a32 a23 a33 a43 a34 a44 a54 b1 b 2 b3 a45 b4 a55 b5 There may be many vector multiplications: b1 b 2 a11 a12 0 0 0 b3 b 4 b5 Vec-11 The vector extracted from matrix A contain many zeros. a11b1 a12 b2 a11 a12 b1 a a a b a b a b a b 22 2 23 3 21 22 23 2 21 1 a32 a33 a34 b3 a32 b2 a33b3 a34 b4 b a b a b a b a a a 43 44 45 4 44 4 45 5 43 3 a54 a55 b5 a54 b4 a55b5 The result can be obtained by: a11 b1 0 0 a12 b2 a b a b a b 22 2 21 1 23 3 a33 b3 a32 b2 a34 b4 a b a b 44 4 43 3 a45 b5 a55 b5 a54 b4 0 0 Vec-12 multiplication of two banded matrices a11 a12 b11 b12 a a a b b b 21 22 23 21 22 23 a32 a33 a34 b32 b33 b34 a a a b b b 43 44 45 43 44 45 a54 a55 b54 b55 a b 21 12 C0 a32 b23 a b 43 34 a54 b45 a11b11 a 22 b22 a33b33 a 44 b44 a55b55 a12 b21 a 23b32 a34 b43 a 45b54 0 0 a11 b11 a12 b21 a b a b a b 21 12 22 22 23 32 a32 b23 a33 b33 a34 b43 a b a b 43 34 44 44 a45 b54 a54 b45 a55 b55 0 0 Vec-13 a22 b21 a21 b11 a b a b C1 33 32 32 22 a44 b43 a43 b33 a b a b 55 54 54 44 a32 b21 C2 a43 b32 a54 b43 C-1 and C-2 can be calculated similarly. Linear systems Ax y , where A: n n matrix x : unknown column vector y : column vector If A is dense, we can apply the Guassian elimination method to design a vectorized algorithm. ( O(n2) vector operations) Vec-14 Banded triangular matrix e.g. a 66 banded matrix with bandwith 3: a11 a 21 a31 a22 a32 a42 a33 a43 a53 a44 a54 a55 a64 a65 a66 corresponds to a second order linear recurrence problem: y x1 1 a11 x2 ( y2 a21 x1 ) / a22 and xi 1 yi ai (i 1) xi 1 ai (i 2) xi 2 aii for 3 i 6 A linear system with a banded triangular parameter matrix with bandwidth w can be transformed into a (w-1)th order linear recurrence problem. O(log n) vector operations ( similar to the first oreder linear recurrence problem.) Vec-15 Tridiagonal linear systems – method 1 e.g. a 44 tridiagonal matrix: a1 c 2 b1 a2 c3 b2 a3 c4 b3 a4 All of non-zero elements are clustered on the central three diagonals. Algorithm: (O(log n ) vector operations.) Step 1: Apply LU decomposition. i.e. find 1 L A=LU, where L 2 1 L3 u1 b1 u2 U Vec-16 1 L4 b2 u3 1 b3 u 4 Step 2: Let Ux=z. Solve a linear system with banded triangular matrix: Lz = y. Step 3: Solve another linear system with banded triangular matrix: Ux = z. How to find L and U ? e.g. A = LU. b1 a1 b1 u1 c a b L u L b u b2 2 1 2 1 2 2 2 2 L3u 2 L3b2 u3 b3 c3 a3 b3 L u L b u c a 4 3 4 3 4 4 4 u1 u 2 u3 u 4 a1 a2 b1 L2 a3 b2 L3 a4 b3 L4 L2 L3 L 4 Vec-17 c2 u1 c3 u2 c4 u3 substitute Li into ui: u1 a1 c2b1 u 2 a2 u1 u3 a3 c3b2 u2 u 4 a4 c4b3 u3 The ui sequence is a first order recurrence, but not linear. Introduce qi : u1 a1 q1 1 q0 u 2 a2 c2b1 a2 q1 c2b1q0 q2 u1 q1 q1 u3 a3 c3b2 a3q2 c3b2 q1 q3 u2 q2 q2 c4b3 a4 q3 c4b3q2 q4 u 4 a4 u3 q3 q3 Vec-18 In general, a q cb q q ui 1 i 1 i i 1 i 2 i qi 1 qi 1 The qi sequence is a second order linear recurrence: given q0 = 1 q1 = a1 find qi = aiqi-1 – cibi-1qi-2, i2 Summary: Step 1: Solve the qi sequence. q Step 2: ui i , 1 i n qi 1 c Step 3: Li i , 2 i n u i 1 O(log n ) vector operations. Vec-19 Tridiagonal linear systems — method 2 (divide-and-conquer) e.g. p1 p2 p3 p4 p p5 p6 p7 p8 a1 c 2 b1 a2 c3 b2 a3 c4 b3 a4 c5 b4 a5 c6 b5 a6 c7 A pi denotes the ith row. Perform the following operations: Q1 p1 b1 p2 a2 Q2 p2 c2 p1 b2 p3 a1 a3 Qi pi ci pi 1 bi pi 1 ai 1 ai 1 Q8 p8 c8 p7 a7 Vec-20 b6 a7 c8 b7 a8 y1 y2 y3 y4 y5 y6 y7 y8 y Q1 Q2 Q3 Q4 Q Q5 Q6 Q7 Q8 d1 0 e1 0 d 0 e2 2 f 3 0 d 3 0 e3 f 4 0 d 4 0 e4 f 5 0 d 5 0 e5 f 6 0 d 6 0 e6 f7 0 d7 0 f8 0 d8 z1 z2 z3 z 4 z5 z6 z7 z8 Rearrange: d1 e1 f3 d3 e3 f5 d5 e5 f7 d7 d2 e2 f4 d4 e4 f6 d6 e6 f8 d8 z1 z3 z5 z7 z2 z4 z6 z8 An 88 tridiagonal linear system can be split into two 44 independent tridiagonal linear systems. How to obtain the matrix Q ? k1 b1 a2 k b a 3 2 2 k b a 7 7 8 ( 1 division ) Vec-21 l2 c2 a1 l c a 3 3 2 l c a 8 8 7 ( 1 division ) d1 a1 k1 c2 0 0 d a k c l b 2 2 2 3 2 1 d a l b 7 7 k 7 c8 7 2 d 8 a8 0 0 l8 b7 ( 2 multiplications and 2 additions ) e1 k1 b2 e k b 2 2 3 e k b 6 6 7 ( 1 multiplication ) f 3 l3 c2 f l c 4 4 3 f l c 8 8 7 ( 1 multiplication ) z1 y1 k1 y2 0 0 z y k y l y 2 2 2 3 2 1 z y l y 7 7 k 7 y8 7 2 z8 y8 0 0 l8 y7 ( 2 multiplications and 2 additions ) Vec-22 After 12 vector operations, we obtain the matrix Q. 88 44 22 11. total: O(log n ) vector operations. general banded linear systems e.g. p1 a1 p 2 c2 p3 e3 p4 p5 p p6 p7 p8 p9 p10 b1 d1 a2 b2 d 2 c3 a3 b3 d 3 e4 c4 a4 b4 d 4 e5 c5 a5 b5 e6 c6 a6 e7 c7 e8 Vec-23 d5 b6 a7 c8 e9 d6 b7 a8 c9 e10 d7 b8 d8 a9 b9 c10 a10 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 Q1 f1 g1 i1 Q2 f2 g2 Q3 h3 f3 g3 Q4 h4 f4 Q5 j5 h5 f5 Q Q6 j6 h6 Q7 j7 h7 Q8 j8 Q9 j9 Q10 i2 i3 g4 i4 g5 f6 i5 g6 f7 h8 g7 f8 h9 j10 i6 g8 f9 h10 f10 f1 g1 i1 h3 f3 g3 i3 j5 h5 f5 g5 i5 j7 h7 f7 g7 j9 h9 f9 f2 g2 i2 h4 f4 g4 j6 h6 f6 j8 h8 j10 Vec-24 i4 g6 i6 f8 g8 h10 f10 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z1 z2 z3 z 4 z5 z6 z7 z8 z9 z10 How to obtain the matrix Q ? Take Q5 as an example: Q5 = pp3 + qp4 + p5 + rp6 + sp7 c3 e4 p 0 b a e q c 3 4 6 5 d 4 a6 c7 r b5 d b s 0 6 7 ( a tridiagonal linear system ) Summary: Step 1: Solve n tridagonal linear systems. Step 2: Generate Q. At this point, the original problem is split into two smaller problems with half size. Vec-25