Tagged systolic arrays S. Sarkar A.K. Majumdar Indexing terms Fael-Fourier fransfiirm, Sysrolic arraj, Tagqed .svefolic array, V L S l Abstract: Design of systolic arrays from a set of non-linear and nonuniform recurrence equations is discussed. A systematic method for deriving a systolic design in such cases is presented. A novel architectural idea, termed a tagged systolic array (TSA), is introduced. The design methodology described broadens the class of algorithms amenable for tagged systolic array implementation. The methodology is illustrated by deriving a systolic design for the fast Fourier transform. 1 Introduction Systolic arrays exploit the advantages offered by VLSI technology to develop special-purpose devices for parallel computation. For this reason, hardware implementation of several specialised parallel algorithms has become feasible. Kung [ l ] characterises a systolic array as a specialpurpose device consisting of a number of interconnected processing elements each capable of performing some simple operations. In a systolic array, data flows from cell to cell in a very regular and pipelined fashion. Local and regular interconnection between the processing elements is one of the most important characteristics of a systolic array. Considerable effort has been directed towards the development of a systematic method for the design of systolic arrays [1-8]. Most of these methods depend upon the fact that the dependence graph (DG) [ 6 ] of the algorithm is local and regular, or it can be transformed intuitively or by some valid transformation into a local and regular one. The locality property of the D G implies that it can be embedded in a multidimensional index space such that the computation at an index point depends only on data from neighbouring index points, and the regularity property of the D G implies that the dependencies are the same at every index point. More recently, based on the pioneering work of Karp, Miller & Winograd [9]. Quinton [ l l ] described a systematic design methodology using a uniform recurrence equation (URE). A transformation technique has also been discussed by Dongen and Quinton [12] which enables a set of nonuniform linear recurrence equations to be converted to a set of URE. Designing a systolic array from a set of nonuniform linear recurrence equations is a threestep process: conversion of the set of recurrence equations into a set of URE followed by derivation of a permissible timing function and a permissible allocation function [I 11. When an algorithm cannot be expressed as a set of URE, much less is known about the method of deriving a systolic design. Such algorithms can, in general, be expressed as a set of nonlinear and nonuniform recurrence equations (NLNURE). Many important algorithms such as FFT, bitonic sorting, etc. belong to this class. Thus the need arises to explore the possibility of a systolic design for such algorithms. In this paper, a methodology for deriving systolic designs for algorithms expressed as a set of NLNURE is presented. In the process, a new architectural idea called the tagged systolic array (TSA) is introduced. In a TSA, tags are attached to the results of a particular computation for sending them to other processing elements (PEs) where the result of that particular computation is required. A TSA uses only nearest-neighbour local communication links for sending data. We illustrate the proposed design methodology by deriving a systolic design for the fast Fourier transform. 2 Design methodology The constraint imposed by a systolic array on the physical layout of the array is the local and regular interconnections between the processing elements. Thus algorithms whose DGs are local and regular can easily be mapped onto a systolic array. The two steps that are always followed for a systolic array design are finding a timing function T and an allocation function a. The timing function shows the time ordering of computations at different index points and it must be compatible with the ordering of computations represented by the DG i.e. if the computation at an index point z1 depends on the computation at another index point z 2 , then T(z,) > T ( z 2 ) .A permissible allocation function a will be such that if for two index points z1 and z 2 , a(zJ = a(z2), then f TP2). According to the uniformisation technique for linear recurrence equations described by Dongen & Quinton [12], we need to find a set of integral vectors B , , ..., B, such that any dependence vector u, can be expressed as a non-negative integral combination of these vectors, and a linear timing function T which is compatible with all vectors B,, . . . , B, i.e. VB,: T . B j > 0. If D, is defined as the smallest domain that contains all the dependence vectors for all values of size parameters of a particular problem, then c, can be defined as the smallest cone pointed at the origin that contains D,.It has been shown 1121 that the existence of c, is necessary for the parallelisation of the final uniform recurrences. Such a cone does not always exist. In that case, we need to consider a reindexing transformation. To design a systolic array from a set of NLNURE, we follow a procedure similar to that outlined above. If we cannot find a cone c,. for the problem concerned, we try mi) the reindexing transformation so that c, exists for the reindexed DG. Next we find the set of vectors B , , . . . , B, and T ( T need not necessarily be a linear function). Because an integral combination of vectors B , , . . , , B, is used to replace any dependence vector u,, we route the data along the directions of these vectors. The vectors B,, . . . , B, are chosen such that when the dependence vectors of the DG are replaced with an integral combination of B , , . . . , B , , the resulting D G has only local dependencies. Although the locality property required for systolic design has been satisfied, the regularity property cannot be satisfied (We have assumed that the D G of the algorithm cannot be transformed into a regular one by any known method.). However, this implies that the relative position(s) of the index point(s) where data from a particular index point is to be routed, is known. It may also be noted that these relative positions may vary with the index points. T o overcome this problem, we enhance the existing systolic array architecture by including tags for data routing. A tag is attached to the result of a computation to identify the relative location of the index point where the result is to be used. This type of systolic array using tags for data routing is called a tagged systolic array (TSA). A TSA employs only local and regular (because data is routed only along a fixed number of directions given by B,, . . . , B,, the interconnection among the PEs may be made regular) interconnection among the PEs. Thus for problems in which the D G cannot be transformed into a local and regular one by any valid transformation known, we can still explore the possibility of mapping the problem on a TSA. 3 Fast Fourier transform on tagged systolic array The fast Fourier transform (FFT) [13] is one of the most powerful tools used in many signal and image processing applications. Given a sequence {x(O),x(l), . . ., x ( N - 1)) of time-dependent inputs, the Fourier transform computes the output sequence {X(O), X ( l), . . . , X ( N - 1)) as follows: N- 1 X(k)= x(n)e-jZnnk" "=O c x(n)w"k N-1 = "=O where w = e-jZffiN. The D G for an N-point F F T (we assume N = 2'" where m is an integer) is shown in Fig. 1 (The constants involved in the computation have been ignored.). The computation represented by the D G can be expressed by a set of recurrence equations. If k = 1, 1 4 i 4 N then A(k, i) = x, If (1) l<k<(logN+l), then l<i<N, l<(imod2'-') 4 2'-' A(k, i) = F[A(k - 1, i + 2' -'), A(k - 1, i)] (2) If I < k < ( l o g N + l ) , l < i < N , ( i m 0 d 2 ~ - ' ) = 0 or (imod 2k-') > 2"' then A(k, i) = F[A(k Ifk > (logN X(N - - 1, i). A(k - 1, i - 2"')] (3) + l), 1 < i 4 N then i) = A(k, i) In the above recurrence equations, the input data (for N = 8) are x1 = x(7) x, x(3) x-,= x(5) x4 = x(1) X, = x(4) x g = x(0) -~ k Fig. 1 Dependence graph ( D G )for N = X FFT The function F describes the computation involved and is given by F(a, b) = a + w* . b where wx is a constant The constants involved in the computation are assumed to be stored at the nodes. Recurrence eqns. 2 and 3 can be written in fully indexed form [9, 123. If 1 < k 4 (log N + I), 1 4 i 4 N , 1 4 (i mod 2k-1) 4 2'-' then ~ ( ki), = F [ A { ( ~i), - (I, -2'-')}, A { ( k , i) - ( I , O)}] ( 5 ) If 1 < k < (log N + I), 1 4 i 4 N , (i mod 2k-') = 0 or (i mod 2k-1)> 2'-' then A(k, i) = F [ A { k , i) - (1, O)}, A { ( k , i) - (1, 2'-')}] (6) According to the definition of URE given in Reference 12, the above set of recurrence equations can be identified as NLNURE. The set of dependence vectors for eqns. 5 and 6 (that represent computation at an index point) is H = { u , , 0,. u-,} = {(I, 0), (1, -2'-'), (1, 2'-')}. The domain D, containing all the dependence vectors for all values of the parameter k is given by D, = {(k, i)l k > 0, - m < i < m}. Clearly, the cone c, does not exist in this case. However, if the D G is reindexed such that an index point ( k , i) in the original DG is reindexed to an index point ( k , i 2' - I), the cone c, exists. Fig. 2 shows the DG after reindexing. Fig. 3 shows the cone c, and vectors El and B , . The set of recurrence eqns. 1 to 4 are modified accordingly. Ifk = 1,2" < i < N +(2'-' 1) then + - A(k, i) xi If 1 < k 4 (log N + I), 2'..-' 4 i 4 N (i + 1 - 2") mod 2'-' < 2'-' then A(k, i) = = F[A(k - 1, i), A(k - + (2" - (7) l), 1 4 I , i - 2'-')] (8) 2" 4 i < N + (2" - l), 1 < k <(log N + l), ( i + 1 - 2") mod 2'-' = 0 or (i + 1 - 2'-') mod 2'-' > 2k-2 then If A(k, i) = F[A(k - 1, i If k > (log N (4) = x 5 = ~ ( 6 )x6 = x(2) X(N - + l), 2*-' i) = A(k, i) A(k - 1, i - 2'-' )I < N + (2k-1 - 1 ) then - 2'-'), 4i (9) (10) The new set of dependence vectors is 8' = {u;, u; , u ; } = {(l, 0), (1, 2 k - 2 ) ,(1, 2*-')). The set of vectors B , , ..., B, is now chosen as {El, B 2 } =. {(l, O), (0, 1)) and T' = (1, 1). If a linear allocation function a is chosen, then any a that satisfies the relation T' . a > 0 is permissible [ 111. Thus a = (1, O)T and a = (0, 1)' are both permissible allocation data. Each PE stores only two pieces of data (intermediate results) and one constant for computation. Two tag values are also stored in each PE and these values are also constants. A tag is attached to the result of a computation when it is sent to a neighbouring PE. A PE immediately checks the tag of a piece of data after 9 X(0) L Fig. 2 Reindexed DG Fig. 4 (2N - I ) P E systolic array Fig. 5 (log N ) P E systolic array B, Fig. 3 Cone C , and vectors B , and B , functions. a = (0, 1)' will result in the design shown in Fig. 4. The array uses ( 2 N - 1) PEs and tags for routing data. However, this design has the drawback that each PE has to dynamically compute the values of tags to send results to other PEs. This increases the processor complexity. If a = (1, 0)' the systolic array shown in Fig. 5 using log N PEs results. This design does not use tags for routing data. However, it has the drawback that it requires exponentially-increasing local memory within each PE from the leftmost PE to the rightmost PE for storing the partial results of computation and the constants involved. For a better design, a is taken to be the following permissible nonlinear function: For 2 < k < (log N + I), an index point (k, i) is allocated to PE {(imod 2'-') + "2 - 2'-'}. The resulting design, shown in Fig. 6, uses only (N - 1) PEs. The array is linear and uses tags for routing ill d3) d5) d7) - - - Fig. 6 . . Tagged systolic arrayfor N = 8 F F T receiving it and copies the data if it is meant for it. In the next time-step, the piece of data passes into the next PE. A PE of the TSA derived for the F F T is shown in Fig. 7. Because most high-speed A/D converters are lower in When the computation is started, the value of count is set to zero. Before starting computation, the values of the tag constants and the constant involved in the computation are set. : Multiplier : 0 A,B,C,D w (= ,+j.w Line no. 1 2 3 4 S 1s compute 17 output data 1 = A 19 20 21 22 23 24 4 Performance analysis The ( N - 1 ) PE TSA takes ( N + log N ) time-steps to complete the computation of an N-point FFT. The speed-up S is given by ( N log N ) / ( N log N ) . The average processor utilisation decreases with the increase in size oi the FFT. The block pipelining period is N . Fig. 8 plots S against PE number. The performance of an ( N - 1) PE TSA can be compared with that of an ( N log N/2) PE network using butterfly interconnection. A PE of the latter type is shown in Fig. 9. Fig. 7 shows the PE of a TSA. T o compare the two designs based on the chip area required for computing an N-point FFT, the approximate number of gates (estimated from the data path synthesis) required for a PE is taken as the basis. The complexity of the controller is measured in terms of the number of basic operations it performs. The additional controller complexity of the TSA for F F T results from lines 4 to 15 and 19 to 23 of the program that each PE executes. The approximate gate count (from the data path synthesis) of the TSA is 5400 and that for a butterfly PE is 4500. Thus an optimistic assumption would be that a PE of TSA takes at most twice the area required for a processor employing butterfly interconnection for computing an FFT. However, because of the nonlocal and nonregular interconnection employed in the latter case, the total chip area required is approximately double that required for processors only. If we assume that a butterfly processor occupies unit chip area, then the area required to + heain time-steu send output'data 1 and output data 2 receive input data 1 and input data 2 ,fiv all tags of input data decrement tag by 1 iftag = 0 and count = 0 copy data to register B count = 1 else iftag = 0 and count = 1 copy data to register A count = 2 end if end if end for tfcount = 2 16 18 : Twiddle factor : TagRegisterj P E /or computation of FFT using tagged systolic array precision, the input data are assumed to be 16-bit fixedpoint words. Each PE executes the following program in one time-step. 6 7 8 9 10 11 12 13 14 : Subtmctor : Registerpairs 2 TI,...,T6 Fig. 7 Adder output data 2 count = 0 else output data 1 output data 2 end if end time-step =A = = + w"B - w^B input data 1 input data 2 compute an N-point FFT is 2N log N/2 = N log N. If we use a TSA to compute an N-point FFT, the chip area required is 2(N - 1). Table 1 shows the chip area comparison between TSA and butterfly for the computation of an N-point FFT. -1 6 5-1 1 2 - , 1 1 0 ~ X ) 4 0 5 0 6 .-~ 0 7 0 No. of PEs Fig. 8 Butterfly 24 64 160 384 32 64 128 256 896 2048 TSA f o r o t h e r orthogonal transforms The Hartley transform [lo] of a data sequence { x ( n ) ; n = 0, 1, 2, . ..} is given by 1 Ak) = TSA 14 30 62 126 254 510 I 1 I32 7- : 16bitregister : Multiplier fl : Adder : SUbaaaOI A,B,C,D : Registerpain w (= 0 ,+j.y): Twiddle factor Fig. 9 P E f o r Computation of F F T usinq butterfly interconnection between PEs + w'BandD = A - w*B wherew' Outputs of all registers are In-state C= A ~ J(N) N-1 1 x(n)[cos (2nknlN) + sin (Znkn/N] "=o The fast Hartley transform (FHT) is very similar to the FFT and the DG for an N = 8 FHT is the same as for an N = 8 FFT except for the constants involved in the computation and that the FHT involves only real arithmetic computations. Thus a similar TSA can be derived for the computation of an FHT where the PEs perform only real arithmetic computations. The Hadamard transform [6] of a data sequence { x ( n ) ; n = 0, 1, 2, ...} is given by y = Hx, where H is an Area (in units) 8 16 5 Speed-up S us u Junction ofnumber o f P E s Table 1 : Chip area comparison N Thus the chip area utilisation of a TSA is much better than that of a butterfly interconnection network for the computation of an FFT, especially when N is large. Table 2 shows the comparison (based on some other factors) of TSA and butterfly interconnection networks for the computation of an FFT. From the above analysis, we observe that both designs have certain strong aspects and certain weak aspects. It is difficult to relate all the factors by a common formula which can be used as a performance metric. The alternative approach is to assign credit points for each of the factors depending on their relative merit and the total points can be used as a performance index for comparing two designs. If we assign equal credit points for all factors with equal relative merit, we conclude that the TSA implementation of an FFT is superior. = wI + w2 T a b l e 2: C o m p a r i s o n o f ( N - 1) PE TSA a n d ( N log N / 2 ) PE n e t w o r k using b u t t e r f l y i n t e r c o n n e c t i o n for c o m p u t a t i o n of N - p o i n t FFT Factor TSA Butterfly computational area Area efficiency = total chip area 1 0.5 1 0.5 Less because of local and regular interconnection among PEs r 100%fault tolerant more because of nonlocal and nonregular interconnection among PEs less than loo’% Modular 4 data/time-step Less than 100% N 2 ( N - 1 ) units Non-modular 2N data/time-step 100% 1 (N log N ) units N+looN log N Power efficiency = computational power total power Design cost Fault tolerance (interconnection failure) Modularity I/O bandwidth Processor utilisation Block pipelining period Chip area required Time N x N matrix. For N H = 1/[2 . ,/(2)] ’ = 8, the matrix H is -- 1 11 11 11 11 11 11 11 1 1-1 1-1 1-1 1-1 1 1 - 1 -1 1 1 - 1 -1 1 - 1 - 1 1 1 - 1 -1 1 1 1 1 1 - 1 -1 -I - 1 1-1 1 - 1 -1 1-1 1 1 1 - I - 1 -1 -1 1 1 -1-1 - 1 - 1 1 - 1 1 1-1 The fast Hadamard transform also involves only real arithmetic computations. The DG for a fast Hadamard transform is the same as for the N = 8 FFT except for the constants involved in the computation. Thus a TSA similar to an FFT can be derived in this case. The discrete cosine transform (DCT) [lo] of a data sequence { x ; n = 0, 1, ..., N - 1) is given by the output sequence { z t ;k = 0, 1, . . . , N - 1) where N-1 zk = 2e(k)/N 1 x , cos [n(2n + l)k/2N] n=O and ~ ( k= ) 1/4(2) for k = I otherwise = 0 Ordinarily, a DCT is calculated from an FFT by using broader class of algorithms amenable for implementation on a TSA. The starting point of the design is a set of recurrence equations describing the algorithm. When no apparent transformation can be applied to the set of recurrence equations describing the algorithm to convert them into a set of uniform recurrence equations or alternatively the DG describing the algorithm cannot be transformed into a regular one, we may try to implement the problem on a TSA. A TSA design for the FFT is derived and shown to be better in certain aspects than the design employing butterfly interconnection among PEs. Finally it has been shown that similar designs can be derived for some other important orthogonal transforms. 7 References 1 KUNG, H.T. ‘Why systolic architectures?’, IEEE Computer. Jan 1982 2 MOLDOVAN, D.I.. ‘On the design of algorithms for VLSI systolic arrays’, Proc. IEEE, 1983,71 3 LI, G.H., and WAH, B.W.: ‘The design of optimal systolic arrays’, IEEE Truns. Cornput., 1985, 34, (1) 4 ULLMAN, J.D.: ‘The computational aspects of VLSI’ (Computer Science Press, 1984) 5 DELOSME, J.M.. and IPSEN, I.C.F.: ‘Efiicient systolic arrays for the solution of Toeplitz system: an illustration of the methodology for the construction of systolic architectures for VLSI’, in MOORE, W.. McCABE, A,, and URQUHART, R. (Eds.): ‘Construction of systolic architectures’ (Adam Hilger, 1986) 6 KUNG, S.Y.: ‘VLSI array processor’ (Prentice-Hall. New Jersey, 19x8) for k = 0, 1, 2, ..., N/2 1 where R and I are the real and imaginary parts of the FFT, respectively, and zNiz(x)= ,/(2/N) . RN12(x’)and xk = x2”; x h - n - i = x 2 ” + , and 0, = nk/2N. Hence the DCT sequence can be obtained from the FFT sequence by including two additional processors that implement the above computation. Because the outputs from an FFT tagged systolic array are available sequentially, the DCT outputs are also available sequentially. - 6 Conclusion The design methodology and the enhanced systolic archi- tecture discussed in the paper allows us to consider a 7 RAO, S.K.: ‘Regular iterative algorithms and their implementatlons on processor arrays’. PhD thesis, Stanford University, USA, 1985 8 YAACOBY, Y., and CAPPELLO, P.R.: ‘Scheduling a system of afiine recurrence equations onto a systolic array‘. Proceedings of the International Conference on Systolic Arrays, San Diego, California, May 1988 9 KARP, R.M.. MILLER, R.E., and WINOGARD, S.:‘The organisation of computations for uniform recurrence equations’, J A C M , 14, (3). July 1967 10 HOU, S.H.: ‘The fast Hartley transform alrorithm’, IEEE Trans. Cornput., 1987, C-36,( 2 ) 1 I QUINTON, P.: ‘The systematic design of systolic arrays’. IRlSA Research Report No. 193, April 1983 12 DONGEN, VV., and QUINTON, P.. ’Uniformization of linear recurrence equations: a step towards automatic synthesis of systolic arrays’. Proceedings of the International Conference on Systolic Arrays, San Diego, California, May 1988 13 NUSSBAUMER, H.J.: ‘Fast Fourier transform and convolutlon algorithms’ (Springer-Verlag, 1982). 2nd edn. *