Tagged systolic arrays

advertisement
Tagged systolic arrays
S. Sarkar
A.K. Majumdar
Indexing terms Fael-Fourier fransfiirm, Sysrolic arraj, Tagqed .svefolic array, V L S l
Abstract: Design of systolic arrays from a set of
non-linear and nonuniform recurrence equations
is discussed. A systematic method for deriving a
systolic design in such cases is presented. A novel
architectural idea, termed a tagged systolic array
(TSA), is introduced. The design methodology
described broadens the class of algorithms amenable for tagged systolic array implementation. The
methodology is illustrated by deriving a systolic
design for the fast Fourier transform.
1
Introduction
Systolic arrays exploit the advantages offered by VLSI
technology to develop special-purpose devices for parallel
computation. For this reason, hardware implementation
of several specialised parallel algorithms has become feasible. Kung [ l ] characterises a systolic array as a specialpurpose device consisting of a number of interconnected
processing elements each capable of performing some
simple operations. In a systolic array, data flows from cell
to cell in a very regular and pipelined fashion. Local and
regular interconnection between the processing elements
is one of the most important characteristics of a systolic
array.
Considerable effort has been directed towards the
development of a systematic method for the design of
systolic arrays [1-8]. Most of these methods depend
upon the fact that the dependence graph (DG) [ 6 ] of the
algorithm is local and regular, or it can be transformed
intuitively or by some valid transformation into a local
and regular one. The locality property of the D G implies
that it can be embedded in a multidimensional index
space such that the computation at an index point
depends only on data from neighbouring index points,
and the regularity property of the D G implies that the
dependencies are the same at every index point. More
recently, based on the pioneering work of Karp, Miller &
Winograd [9]. Quinton [ l l ] described a systematic
design methodology using a uniform recurrence equation
(URE). A transformation technique has also been discussed by Dongen and Quinton [12] which enables a set
of nonuniform linear recurrence equations to be converted to a set of URE. Designing a systolic array from a
set of nonuniform linear recurrence equations is a threestep process: conversion of the set of recurrence equations into a set of URE followed by derivation of a
permissible timing function and a permissible allocation
function [I 11.
When an algorithm cannot be expressed as a set of
URE, much less is known about the method of deriving a
systolic design. Such algorithms can, in general, be
expressed as a set of nonlinear and nonuniform recurrence equations (NLNURE). Many important algorithms
such as FFT, bitonic sorting, etc. belong to this class.
Thus the need arises to explore the possibility of a systolic design for such algorithms.
In this paper, a methodology for deriving systolic
designs for algorithms expressed as a set of NLNURE is
presented. In the process, a new architectural idea called
the tagged systolic array (TSA) is introduced. In a TSA,
tags are attached to the results of a particular computation for sending them to other processing elements (PEs)
where the result of that particular computation is
required. A TSA uses only nearest-neighbour local communication links for sending data. We illustrate the proposed design methodology by deriving a systolic design
for the fast Fourier transform.
2
Design methodology
The constraint imposed by a systolic array on the physical layout of the array is the local and regular interconnections between the processing elements. Thus
algorithms whose DGs are local and regular can easily be
mapped onto a systolic array. The two steps that are
always followed for a systolic array design are finding a
timing function T and an allocation function a. The
timing function shows the time ordering of computations
at different index points and it must be compatible with
the ordering of computations represented by the DG i.e.
if the computation at an index point z1 depends on the
computation at another index point z 2 , then T(z,) >
T ( z 2 ) .A permissible allocation function a will be such
that if for two index points z1 and z 2 , a(zJ = a(z2), then
f TP2).
According to the uniformisation technique for linear
recurrence equations described by Dongen & Quinton
[12], we need to find a set of integral vectors B , , ..., B,
such that any dependence vector u, can be expressed as a
non-negative integral combination of these vectors, and a
linear timing function T which is compatible with all
vectors B,, . . . , B, i.e. VB,: T . B j > 0. If D, is defined as
the smallest domain that contains all the dependence
vectors for all values of size parameters of a particular
problem, then c, can be defined as the smallest cone
pointed at the origin that contains D,.It has been shown
1121 that the existence of c, is necessary for the parallelisation of the final uniform recurrences. Such a cone does
not always exist. In that case, we need to consider a reindexing transformation.
To design a systolic array from a set of NLNURE, we
follow a procedure similar to that outlined above. If we
cannot find a cone c,. for the problem concerned, we try
mi)
the reindexing transformation so that c, exists for the
reindexed DG. Next we find the set of vectors B , , . . . , B,
and T ( T need not necessarily be a linear function).
Because an integral combination of vectors B , , . . , , B, is
used to replace any dependence vector u,, we route the
data along the directions of these vectors. The vectors B,,
. . . , B, are chosen such that when the dependence vectors
of the DG are replaced with an integral combination of
B , , . . . , B , , the resulting D G has only local dependencies.
Although the locality property required for systolic
design has been satisfied, the regularity property cannot
be satisfied (We have assumed that the D G of the algorithm cannot be transformed into a regular one by any
known method.). However, this implies that the relative
position(s) of the index point(s) where data from a particular index point is to be routed, is known. It may also be
noted that these relative positions may vary with the
index points. T o overcome this problem, we enhance the
existing systolic array architecture by including tags for
data routing. A tag is attached to the result of a computation to identify the relative location of the index point
where the result is to be used. This type of systolic array
using tags for data routing is called a tagged systolic
array (TSA). A TSA employs only local and regular
(because data is routed only along a fixed number of
directions given by B,, . . . , B,, the interconnection among
the PEs may be made regular) interconnection among the
PEs. Thus for problems in which the D G cannot be
transformed into a local and regular one by any valid
transformation known, we can still explore the possibility
of mapping the problem on a TSA.
3
Fast Fourier transform on tagged systolic array
The fast Fourier transform (FFT) [13] is one of the most
powerful tools used in many signal and image processing
applications. Given a sequence {x(O),x(l), . . ., x ( N - 1))
of time-dependent inputs, the Fourier transform computes the output sequence {X(O), X ( l), . . . , X ( N - 1)) as
follows:
N- 1
X(k)=
x(n)e-jZnnk"
"=O
c x(n)w"k
N-1
=
"=O
where w = e-jZffiN.
The D G for an N-point F F T (we assume N = 2'"
where m is an integer) is shown in Fig. 1 (The constants
involved in the computation have been ignored.). The
computation represented by the D G can be expressed by
a set of recurrence equations.
If k = 1, 1 4 i 4 N then
A(k, i) = x,
If
(1)
l<k<(logN+l),
then
l<i<N,
l<(imod2'-')
4 2'-'
A(k, i) = F[A(k - 1, i
+ 2'
-'), A(k
-
1, i)]
(2)
If I < k < ( l o g N + l ) , l < i < N , ( i m 0 d 2 ~ - ' ) = 0 or
(imod 2k-') > 2"' then
A(k, i) = F[A(k
Ifk > (logN
X(N
-
-
1, i). A(k - 1, i
- 2"')]
(3)
+ l), 1 < i 4 N then
i) = A(k, i)
In the above recurrence equations, the input data (for
N = 8) are
x1 = x(7) x,
x(3) x-,= x(5) x4 = x(1)
X, =
x(4) x g = x(0)
-~
k
Fig. 1
Dependence graph ( D G )for N = X FFT
The function F describes the computation involved and is
given by
F(a, b) = a
+ w* . b
where wx is a constant
The constants involved in the computation are assumed
to be stored at the nodes. Recurrence eqns. 2 and 3 can
be written in fully indexed form [9, 123.
If 1 < k 4 (log N + I), 1 4 i 4 N , 1 4 (i mod 2k-1)
4 2'-' then
~ ( ki), = F [ A { ( ~i), - (I, -2'-')}, A { ( k , i) - ( I , O)}] ( 5 )
If 1 < k < (log N + I), 1 4 i 4 N , (i mod 2k-') = 0 or
(i mod 2k-1)> 2'-' then
A(k, i) = F [ A { k , i) - (1, O)}, A { ( k , i) - (1, 2'-')}]
(6)
According to the definition of URE given in Reference
12, the above set of recurrence equations can be identified
as NLNURE. The set of dependence vectors for eqns. 5
and 6 (that represent computation at an index point) is
H = { u , , 0,. u-,} = {(I, 0), (1, -2'-'), (1, 2'-')}. The
domain D, containing all the dependence vectors for all
values of the parameter k is given by D, = {(k, i)l k > 0,
- m < i < m}. Clearly, the cone c, does not exist in this
case. However, if the D G is reindexed such that an index
point ( k , i) in the original DG is reindexed to an index
point ( k , i 2'
- I), the cone c, exists. Fig. 2 shows the
DG after reindexing. Fig. 3 shows the cone c, and vectors
El and B , . The set of recurrence eqns. 1 to 4 are modified accordingly.
Ifk = 1,2"
< i < N +(2'-'
1) then
+
-
A(k, i)
xi
If 1 < k 4 (log N + I), 2'..-' 4 i 4 N
(i + 1 - 2")
mod 2'-' < 2'-' then
A(k, i)
=
=
F[A(k - 1, i), A(k
-
+ (2"
-
(7)
l), 1 4
I , i - 2'-')]
(8)
2"
4 i < N + (2"
- l),
1 < k <(log N + l),
( i + 1 - 2")
mod 2'-' = 0 or (i + 1 - 2'-') mod 2'-'
> 2k-2 then
If
A(k, i) = F[A(k - 1, i
If k > (log N
(4)
=
x 5 = ~ ( 6 )x6 = x(2)
X(N
-
+ l), 2*-'
i) = A(k, i)
A(k - 1, i - 2'-' )I
< N + (2k-1 - 1 ) then
- 2'-'),
4i
(9)
(10)
The new set of dependence vectors is 8' = {u;, u; , u ; } =
{(l, 0), (1, 2 k - 2 ) ,(1, 2*-')). The set of vectors B , , ..., B, is
now chosen as {El, B 2 } =. {(l, O), (0, 1)) and T' = (1, 1).
If a linear allocation function a is chosen, then any a that
satisfies the relation T' . a > 0 is permissible [ 111. Thus
a = (1, O)T and a = (0, 1)' are both permissible allocation
data. Each PE stores only two pieces of data
(intermediate results) and one constant for computation.
Two tag values are also stored in each PE and these
values are also constants. A tag is attached to the result
of a computation when it is sent to a neighbouring PE. A
PE immediately checks the tag of a piece of data after
9 X(0)
L
Fig. 2
Reindexed DG
Fig. 4
(2N - I ) P E systolic array
Fig. 5
(log N ) P E systolic array
B,
Fig. 3
Cone C , and vectors B , and B ,
functions. a = (0, 1)' will result in the design shown in
Fig. 4. The array uses ( 2 N - 1) PEs and tags for routing
data. However, this design has the drawback that each
PE has to dynamically compute the values of tags to
send results to other PEs. This increases the processor
complexity.
If a = (1, 0)' the systolic array shown in Fig. 5 using
log N PEs results. This design does not use tags for
routing data. However, it has the drawback that it
requires exponentially-increasing local memory within
each PE from the leftmost PE to the rightmost PE for
storing the partial results of computation and the constants involved.
For a better design, a is taken to be the following permissible nonlinear function:
For 2 < k < (log N + I), an index point (k, i) is allocated to PE {(imod 2'-') + "2
- 2'-'}.
The resulting design, shown in Fig. 6, uses only
(N - 1) PEs. The array is linear and uses tags for routing
ill
d3)
d5)
d7)
-
-
-
Fig. 6
.
.
Tagged systolic arrayfor N = 8 F F T
receiving it and copies the data if it is meant for it. In the
next time-step, the piece of data passes into the next PE.
A PE of the TSA derived for the F F T is shown in Fig.
7. Because most high-speed A/D converters are lower in
When the computation is started, the value of count is set
to zero. Before starting computation, the values of the tag
constants and the constant involved in the computation
are set.
: Multiplier
:
0
A,B,C,D
w (=
,+j.w
Line no.
1
2
3
4
S
1s
compute
17
output data 1 = A
19
20
21
22
23
24
4
Performance analysis
The ( N - 1 ) PE TSA takes ( N + log N ) time-steps to
complete the computation of an N-point FFT. The
speed-up S is given by ( N log N ) / ( N log N ) . The
average processor utilisation decreases with the increase
in size oi the FFT. The block pipelining period is N . Fig.
8 plots S against PE number.
The performance of an ( N - 1) PE TSA can be compared with that of an ( N log N/2) PE network using
butterfly interconnection. A PE of the latter type is
shown in Fig. 9. Fig. 7 shows the PE of a TSA. T o
compare the two designs based on the chip area required
for computing an N-point FFT, the approximate number
of gates (estimated from the data path synthesis) required
for a PE is taken as the basis. The complexity of the controller is measured in terms of the number of basic operations it performs. The additional controller complexity
of the TSA for F F T results from lines 4 to 15 and 19 to
23 of the program that each PE executes. The approximate gate count (from the data path synthesis) of the
TSA is 5400 and that for a butterfly PE is 4500. Thus an
optimistic assumption would be that a PE of TSA takes
at most twice the area required for a processor employing
butterfly interconnection for computing an FFT.
However, because of the nonlocal and nonregular interconnection employed in the latter case, the total chip
area required is approximately double that required for
processors only. If we assume that a butterfly processor
occupies unit chip area, then the area required to
+
heain time-steu
send output'data 1 and output data 2
receive input data 1 and input data 2
,fiv all tags of input data
decrement tag by 1
iftag = 0 and count = 0
copy data to register B
count = 1
else iftag = 0 and count = 1
copy data to register A
count = 2
end if
end if
end for
tfcount = 2
16
18
: Twiddle factor
: TagRegisterj
P E /or computation of FFT using tagged systolic array
precision, the input data are assumed to be 16-bit fixedpoint words. Each PE executes the following program in
one time-step.
6
7
8
9
10
11
12
13
14
: Subtmctor
: Registerpairs
2
TI,...,T6
Fig. 7
Adder
output data 2
count = 0
else
output data 1
output data 2
end if
end time-step
=A
=
=
+ w"B
- w^B
input data 1
input data 2
compute an N-point FFT is 2N log N/2 = N log N. If we
use a TSA to compute an N-point FFT, the chip area
required is 2(N - 1). Table 1 shows the chip area comparison between TSA and butterfly for the computation
of an N-point FFT.
-1
6
5-1
1
2 -
,
1
1
0
~
X
)
4
0
5
0
6
.-~
0
7
0
No. of PEs
Fig. 8
Butterfly
24
64
160
384
32
64
128
256
896
2048
TSA f o r o t h e r orthogonal transforms
The Hartley transform [lo] of a data sequence { x ( n ) ;
n = 0, 1, 2, . ..} is given by
1
Ak) =
TSA
14
30
62
126
254
510
I
1
I32
7-
: 16bitregister
: Multiplier
fl
: Adder
: SUbaaaOI
A,B,C,D
: Registerpain
w (= 0 ,+j.y): Twiddle factor
Fig. 9
P E f o r Computation of F F T usinq butterfly interconnection between PEs
+ w'BandD
= A - w*B wherew'
Outputs of all registers are In-state
C= A
~
J(N)
N-1
1 x(n)[cos (2nknlN) + sin (Znkn/N]
"=o
The fast Hartley transform (FHT) is very similar to the
FFT and the DG for an N = 8 FHT is the same as for an
N = 8 FFT except for the constants involved in the computation and that the FHT involves only real arithmetic
computations. Thus a similar TSA can be derived for the
computation of an FHT where the PEs perform only real
arithmetic computations.
The Hadamard transform [6] of a data sequence
{ x ( n ) ; n = 0, 1, 2, ...} is given by y = Hx, where H is an
Area (in units)
8
16
5
Speed-up S us u Junction ofnumber o f P E s
Table 1 : Chip area comparison
N
Thus the chip area utilisation of a TSA is much better
than that of a butterfly interconnection network for the
computation of an FFT, especially when N is large. Table
2 shows the comparison (based on some other factors) of
TSA and butterfly interconnection networks for the computation of an FFT.
From the above analysis, we observe that both designs
have certain strong aspects and certain weak aspects. It is
difficult to relate all the factors by a common formula
which can be used as a performance metric. The alternative approach is to assign credit points for each of the
factors depending on their relative merit and the total
points can be used as a performance index for comparing
two designs. If we assign equal credit points for all factors
with equal relative merit, we conclude that the TSA
implementation of an FFT is superior.
=
wI
+ w2
T a b l e 2: C o m p a r i s o n o f ( N - 1) PE TSA a n d ( N log N / 2 ) PE n e t w o r k using b u t t e r f l y i n t e r c o n n e c t i o n for
c o m p u t a t i o n of N - p o i n t FFT
Factor
TSA
Butterfly
computational area
Area efficiency =
total chip area
1
0.5
1
0.5
Less because of local and
regular interconnection among PEs
r 100%fault tolerant
more because of nonlocal and
nonregular interconnection among PEs
less than loo’%
Modular
4 data/time-step
Less than 100%
N
2 ( N - 1 ) units
Non-modular
2N data/time-step
100%
1
(N log N ) units
N+looN
log N
Power efficiency =
computational power
total power
Design cost
Fault tolerance
(interconnection failure)
Modularity
I/O bandwidth
Processor utilisation
Block pipelining period
Chip area required
Time
N x N matrix. For N
H
=
1/[2 . ,/(2)]
’
=
8, the matrix H is
-- 1 11 11 11 11 11 11 11 1
1-1
1-1
1-1
1-1
1
1 - 1 -1
1
1 - 1 -1
1 - 1 - 1 1 1 - 1 -1 1
1 1 1 1 - 1 -1 -I - 1
1-1
1 - 1 -1
1-1
1
1 1 - I - 1 -1 -1
1
1
-1-1 - 1 - 1 1 - 1
1
1-1
The fast Hadamard transform also involves only real
arithmetic computations. The DG for a fast Hadamard
transform is the same as for the N = 8 FFT except for
the constants involved in the computation. Thus a TSA
similar to an FFT can be derived in this case.
The discrete cosine transform (DCT) [lo] of a data
sequence { x ; n = 0, 1, ..., N - 1) is given by the output
sequence { z t ;k = 0, 1, . . . , N - 1) where
N-1
zk = 2e(k)/N
1 x , cos [n(2n + l)k/2N]
n=O
and
~ ( k=
) 1/4(2) for k
= I otherwise
=
0
Ordinarily, a DCT is calculated from an FFT by using
broader class of algorithms amenable for implementation
on a TSA. The starting point of the design is a set of
recurrence equations describing the algorithm. When no
apparent transformation can be applied to the set of
recurrence equations describing the algorithm to convert
them into a set of uniform recurrence equations or alternatively the DG describing the algorithm cannot be
transformed into a regular one, we may try to implement
the problem on a TSA. A TSA design for the FFT is
derived and shown to be better in certain aspects than
the design employing butterfly interconnection among
PEs. Finally it has been shown that similar designs can
be derived for some other important orthogonal transforms.
7
References
1 KUNG, H.T. ‘Why systolic architectures?’, IEEE Computer. Jan
1982
2 MOLDOVAN, D.I.. ‘On the design of algorithms for VLSI systolic
arrays’, Proc. IEEE, 1983,71
3 LI, G.H., and WAH, B.W.: ‘The design of optimal systolic arrays’,
IEEE Truns. Cornput., 1985, 34, (1)
4 ULLMAN, J.D.: ‘The computational aspects of VLSI’ (Computer
Science Press, 1984)
5 DELOSME, J.M.. and IPSEN, I.C.F.: ‘Efiicient systolic arrays for
the solution of Toeplitz system: an illustration of the methodology
for the construction of systolic architectures for VLSI’, in MOORE,
W.. McCABE, A,, and URQUHART, R. (Eds.): ‘Construction of
systolic architectures’ (Adam Hilger, 1986)
6 KUNG, S.Y.: ‘VLSI array processor’ (Prentice-Hall. New Jersey,
19x8)
for k = 0, 1, 2, ..., N/2
1
where R and I are the real and imaginary parts of the
FFT, respectively, and zNiz(x)= ,/(2/N) . RN12(x’)and
xk = x2”; x h - n - i = x 2 ” + , and 0, = nk/2N. Hence the
DCT sequence can be obtained from the FFT sequence
by including two additional processors that implement
the above computation. Because the outputs from an
FFT tagged systolic array are available sequentially, the
DCT outputs are also available sequentially.
-
6
Conclusion
The design methodology and the enhanced systolic archi-
tecture discussed in the paper allows us to consider a
7 RAO, S.K.: ‘Regular iterative algorithms and their implementatlons
on processor arrays’. PhD thesis, Stanford University, USA, 1985
8 YAACOBY, Y., and CAPPELLO, P.R.: ‘Scheduling a system of
afiine recurrence equations onto a systolic array‘. Proceedings of the
International Conference on Systolic Arrays, San Diego, California,
May 1988
9 KARP, R.M.. MILLER, R.E., and WINOGARD, S.:‘The organisation of computations for uniform recurrence equations’, J A C M , 14,
(3). July 1967
10 HOU, S.H.: ‘The fast Hartley transform alrorithm’, IEEE Trans.
Cornput., 1987, C-36,( 2 )
1 I QUINTON, P.: ‘The systematic design of systolic arrays’. IRlSA
Research Report No. 193, April 1983
12 DONGEN, VV., and QUINTON, P.. ’Uniformization of linear
recurrence equations: a step towards automatic synthesis of systolic
arrays’. Proceedings of the International Conference on Systolic
Arrays, San Diego, California, May 1988
13 NUSSBAUMER, H.J.: ‘Fast Fourier transform and convolutlon
algorithms’ (Springer-Verlag, 1982). 2nd edn.
*
Download