IEEE TRANSACTIONS ON 637 COMPUTERS, VOL. c-27, NO. 7, JULY 1978 He is currently an Assistant Professor of REFERENCES Computer Science, Southern Methodist Univer- [1] C. V. Ramamoorthy and M. J. Gonzalez, "A survey of techniques for [2] [3] [4] [5] recognizing parallel processable streams in computer programs," in Proc. 1969 Fall Joint Computer Conf., pp. 1-15, 1969. C. V. Ramamoorthy, T. F. Fox, and H. F. Li, "Scheduling parallel processable tasks for a uniprocessor," IEEE Trans. Comput., vol. C-25, pp. 485-495, May 1976. R. W. Conway, W. L. Maxwell, and L. W. Miller, Theory ofScheduling. Reading, MA: Addison-Wesley, 1967. N. J. Nilsson, Problem Solving Methods in Artificial Intelligence. New York: McGraw-Hill, 1971. T. F. Fox and C. V. Ramamoorthy, "Scheduling parallel processable tasks for a uniprocessor," Electron. Res. Cent., University of Texas at Austin, Tech. Memo 38, NSF-GJ28452, Jan. 1973. William F. Appelbe received the Ph.D. and M.S. degrees in computer science at the University of British Columbia, Vancouver, in 1978 and 1975, respectively, and the B.S. degree in information science at Monash University, Melbourne, Australia, in 1973. sity, Dallas, TX. His current research interests include compiler design, the semantics of programming languages, and operating systems design for distributed processing. Mabo R. Ito was born in Vancouver, Canada, in 1938. He received the B.Sc. degree in engineering physics and the M.Sc. degree in electrical engineering from the University of Manitoba, Manitoba, in 1960 and 1963, respectively, and the Ph.D. degree from the University of British Columbia, Vancouver, in 1971. From 1962 to 1966 and from 1970 to 1973 he __ was with the National Research Council of Canada, where he worked on the design of special purpose computers, microprogramming, realtime software, computer graphics, and pattern recognition. From 1973 to the present, he has been a faculty member of the Department of Electrical Engineering, University of British Columbia. His research interests are in computer and digital systems architecture, real-time software and operating systems, microprogramming, computer graphics, pattern analysis, and recognition. Parallel Permutations of Data: A Benes Network Control Algorithm for Frequently Used Permutations JACQUES LENFANT Abstract-The Benes binary network can realize any one-to-one mapping of its 2' inlets onto its 2" outlets. Several authors have proposed algorithms which compute control patterns for this network from any bijection assignment. However, these algorithms are both time-consuming and space-consuming. In order to meet the time constraints arising from the use of a Benes network as the alignment network of a parallel computer, another approach must be chosen. In this paper, we consider typical functions and show that the set of needed permutations of data is very small, as compared to the whole symmetric group. We gather frequently used bijections into five families. For each family we present an algorithm that can control the two-state switches on the fly, as the vector of data passes through Manuscript received February 2, 1977; revised October 6, 1977. This work was supported by the Service d'Orientation et de Synthese de la Recherche en Informatique (France) under Contract SESORI-IRIA76121. The author is with the Institut de Recherche en Informatique et Systemes Aleatoires, University of Rennes, Rennes, France. the network. Finally, we describe one possible scheme to implement an instruction "Trigger a Frequently Used Bijection." Index Terms-Alignment network, Benes network, Clos network, divide-and-conquer technique, memory-processor connection, parallel computer, permutation network, switching network. I. INTRODUCTION FOR SEVERAL YEARS, there has been considerable in array computer architecture [28] which, according to Flynn's classification [8], is designated by the acronym for Single-Instruction stream, Multiple-Data stream (SIMD). These computer systems are characterized by a large number of identical processing elements (PE) to which a single instruction is broadcast from a central control during each time unit. Only those PE's which are not inhibited as a result of previous operations execute the Finterest 0018-9340/78/0700-0637$00.75 C 1978 IEEE 638 6EEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 7, JULY 1978 common instruction. In order to avoid access conflicts, the main storage is divided into memory blocks which are at least as numerous as the processing elements. Two types of connection between processing elements and memory blocks are depicted on Fig. 1. The first structure is appropriate only when the number N of processing elements is equal to the number M of memory blocks. Then each processing element is permanently linked to one-and only one-memory block. Communication between processing elements is established by means of the interconnection network which can realize some permutations on the contents of registers. When several permutations are successively performed on the same vector of data, this structure is very efficient since the main storage whose technology is usually slower than processing element technology is not involved in the transfers. This design was chosen for most array computers which are currently in use or under development [1], [3], [11]. A slightly different approach is also illustrated by Fig. 1. If the routing network is used to connect memory blocks to processing elements, then it can cope with a number M of memory blocks greater than the number N of processing elements. This is an interesting possibility since, as shown by Lawrie [16], access to rows, columns, diagonals, reverse diagonals, and blocks of a matrix cannot all be implemented if M = N, but they can if M = 2N. The performance of the interconnection network is critical with respect to the overall performance of an array computer. Its design must be aimed at achieving low cost of hardware, high operating speed, large combinatorial power (i.e., a high percentage of realizable bijections ofinputs onto outputs), and simplicity of control. In this paper, we describe a network that can realize any one-to-one connection between N = 2' inputs and N outputs. For this network we shall propose algorithms that yield control patterns for the bijections frequently used in parallel processing. These selected bijections, referred to in the sequel as Frequently Used Bijections (FUB), are grouped into five families. Within its family, a bijection is referenced by a small number of parameters. More precisely, the name of a FUB can be easily coded with a number of bits that is close to 2n. Consequently, the length of such a name is, in practice, smaller than the memory word length. The next section is devoted to a description of FUB's and to a survey of those interconnection networks which have been previously designed for SIMD computers. In Section III, we describe the Benes binary rearrangeable network [5] for which we present control algorithms and a possible implementation at hardware level. II. PARALLEL ALGORITHMS AND INTERCONNECTION NETWORKS A. Notation Let n be a strictly positive integer and denote by E(n) the set {O, 1, 2, * , 2n 1}. The elements of E(") will be considered as integers, as residue classes modulo 2", or as n-dimensional vectors whose elements are either 0 or 1: the vector (xn, MB MB MB MB PE PE PE PE Interconnection MB MB MB Interconnection P Network PE MB Netwo rk PE P Fig. 1. Memory and processing elements in SIMD computers-two cases. xn1, , x2, x1) is identified with the integer 1l. xi2--, xn being the most significant digit. The x= symbol D denotes the bit-per-bit EXCLUSIVE OR operation. A segment of size 2' in E(") is a subset of E(") consisting of 2' consecutive integers, of which the smallest is a multiple of 2' (O < i < n). We denote by ,9(n) the set of permutations on E("). The permutations that keep all the subsets {2j, 2j + 1} of E(") globally invariant (O < j < 2n- 1) are of special interest in this paper. They form a subgroup of the symmetric group that is denoted by _on). Let h,j, 1, m, and k be five nonnegative integers, k being less than 2". The permutations on E(") that will be considered in the sequel are defined in Table I and commented on in Sections II-B and C. In Table I, 'a and a2 are the subvectors (xn j, x"- - 1, , x 1) and (x",xn _ 1, , xn-j+ 1), respectively. In the same way, b4, b3, b2, b1 and c5, C4, C3, C, I are bit substrings of x = (x", x"-, , x2, xl) whose lengths are j, 1, m, n - (j + 1 + m) and j, 1, m, h, n - (j + I + m + h), respectively. For instance, since 97 = (1, 1, 0, 0, 0, 0, 1), permutation I473)7,1,97 maps the element (X7 , X, X3, X2 , X1) 0 1 7-4-0-1 X5, X4, 4 bits of E(7) onto the element (X7, X6, X5, X4 X2 X1, X3) where x, = xi ( 1 is the complement of xi. B. Bijections Used in Parallel Processing Five families of bijections are defined in Table I. Each of them was obtained by selecting a set of permutations with respect to their usefulness and by extending this set, if necessary, in order to meet a stability criterion. This stability criterion, which is explained in Section III, allows recursive 639 LENFANT: PARALLEL PERMUTATIONS OF DATA PERMUTATIONS :X=(xnaxn- x2.xl) x=xnxn-i *'x2,0l) ,(n) p(n) 1 . x - - n k , (xn_1.-- x2 xI xn) o- (x ,lx2 *. aaxn-l0xn) n T(n) TABLE I SET E(1) = {0, 1, 2, ON THE a x n k x + k 2" - 1} PERFECT SHUFFLE BIT REVERSAL mod. 2n CYCLIC SHIFT OF AMPLITUDE k THE FIVE FAMILIES x(n) a2,1) <(,k (a2' 1) mn) (E5,E4,EJ,E2,E1) PERMUTATIONS IN SUBGROUP (n) (or (n) (or + a k mod. 2 (a2 0 k (a1)) ^(b4,b1,,b3) - (j odd) CYCLIC SHIFT WITHIN SEGMENTS OF SIZE 2nBIT REVERSAL WITHIN SEGMENTS (Al)) (j n) OF SIZE * (V5,A2,EJ,E0,V1( k (ji+t+m < (j+a+m+ll 2n-j n) nn) SEGMENT SHUFFLE B(n) x CO (n)) IDENTITY x + 1 sX2') - in)(ar x=(xC.Xn_ls. (n x (ao> (b4, b3, b2'b1) B,(,m,k C j x ,k J -xn'n- n"l .. = if x nn x is even EXCHANGE @ xj = (xn,xn_l, .......... (1 algorithms to be developed for the control of a Benes binary rearrangeable network. Because of this extension, a family may include some permutations of little interest to parallel programming. Nevertheless, any permutation belonging to one of these five families -will be referred to as a FUB. Most permutations which are necessary to implement previously published algorithms are FUB's. Typical applications are presented below. 1) The Family {),j')}: If j is an odd integer, A(J) is the bijection that maps any x in E(nl ontoj x + k mod 2n. Note that A(n) is the cyclic shift of amplitude k and is also denoted by r( . These permutations are used to implement parallel LOAD and STORE operations on matrices. In order to take advantage of the SIMD architecture, data and computations must be organized so that as many instructions as possible can fetch N words of data and process them simultaneously. If two of the data words to be fetched are in the same memory block, all the processors sit idle until the second memory cycle is completed. This can seriously degrade the performance of these computer systems. As far as numerical analysis applications are concerned [12], it is usual to handle matrices by rows, columns, diagonals, and blocks. Therefore, much attention has been paid to storage schemes which allow conflict-free access to these subarrays [6], [16], [22]. As in Illiac IV software [13], linear skewing schemes may be used: in an N-block memory, the element a,,, of an array A is assigned to memory block u s + v t (mod N). Then row access is conflict-free iff v and N are relatively prime (similar conditions hold for parallel access to other types of subarray). Fetching row so consists of loading a,O,i into the accumulator of processing element i for all i, 0 < i < N - 1. As element aso, is stored in memory block u so + v i, the N-word vector delivered by the memory must be unscrambled [26]: that is, the word delivered by memory block x (O < x < N - 1) must be broadcast to accumulator v-1x -v-u sO, where v-' denotes the inverse of v in the ring of integers modulo N. This permutation, which is ,xfax isn add .l,x-2,xl1 e( < i5 n) performed by the interconnection network, belongs to the family {Aj(,)}: here n = log2 N, j = v- 1, k = -u s0. The same family is involved in accesses to other subarrays of interest: columns, diagonals, and blocks. 2) The Family {6(n)}: The permutation 6)k is the cyclic shift of amplitude k. By b(n)k' a shift of k places occurs within both the first and second half of E() (see Fig. 2). In the same way, b(n) shifts each quarter, and so on. Permutations within segments, of which the (n,) are examples, are very useful in parallel processing. They arise when the "divide and conquer" technique is used in algorithm design; that is, when a computation on N items is replaced by two computations on N/2 items, which are themselves replaced by four computations on N/4 items, and so forth. It is worth noting that the permutations available to the programmers of the Staran system [3] are the cyclic shifts within segments, 6(J), and the Mirror permutations within segments, T(n) (k = 2i- 1, 2 < j < n). 3) The Family {fX")}: The most interesting permutations in the family {a(x,} are the bit reversal on the whole set E(n) O,= p(nln) and the bit reversals within segments oc(Jo. In conjunction with the perfect shuffle, they can be used to transfer data during the computation of a fast Fourier transform [25]. Another important subset of this family consists of (n) =-T(n) (O < k < 2n), which maps x onto x E k. As an example of its usefulness, let us consider the skewing scheme built into the Staran hardware [3]. The memory of this computer may be viewed as a 2" x 2n array of 1-bit words, whose element ax,y (O < x, y < 2 n) is assigned to block x @ y. Since the EXCLUSIVE OR operation is a group operation, any row or column may be accessed in one memory cycle. Permutation zi unscrambles line i. The other permutations, i.e., acjzn with k * O or j $ n, are necessary to meet the stability requirement (see Theorem 4). With a few exceptions, they have not been used in previously published algorithms. 4) The Families {l(Jf),m,k} and {fY(),m,h}: Much emphasis has 640 IEEE TRANSACTIONS ON COMPUTERS, VOL. 0.0 *0 1 . 'o 1 2 0 ,, 1 *- - 1 0 2 3 2. 3. : 4 o 5 3. .3 4. 4 5 . 67 7 0 c-27, NO. 7, JULY 1978 0 Fig 2. The permutation 6 7 , 8 8 9 * 5 1O * 6 1 1 * 12 * 7 13 * 14 0 15 P)I'll 5 6 7 0 9 0 - 10 11 12 X -1 - . 13 . 14 15 4 SEGMENT SHUFFLE been laid on the prominent part played by the perfect shuffle in parallel programming (see [25], for instance). This permutation maps x = (x"1, x,,., , x2, x1) onto (")(x) = (xn_1 ..., x2, xl, xn). The families {I(j,m,k} and {y17(),h} mainly include variations around this bijection. The permutation y(J U,Oh( + I + h < n) has the following interpretation: within each segment of size 2r-i, subsegments of size 2"r-j--h are reordered by the Ith power of the perfect shuffle. This relationship between the subfamily {YJ(' ,0,h} and the perfect shuffle is shown in Fig. 3. The perfect o shuffle itself is (0-)= Y_O_Tei,ntThe permutation I(J'I),m,o is just Y,,n-j-M The introduction of permutations (in,",m, with k -* 0 into the family ,B is essentially aimed at stability in the sense of Section III. The use of the perfect shuffle in programming the fast Fourier transform has been noted in the previous section. As concerns parallel sorting, the permutations involved in Batcher's network [2] are perfect shuffles within segments of 1 and size 2' (2 < i < n) and their inverses; i.e., (n) (n)- wYn i,i 1,0 I Let us consider a final example of the use of these bijections. If a 2a x 2" - array A is stored as a vector-of data in the Algol fashion (i.e., row by row), its transposition is the permutation ,o,ao , as may easily be verified. If an array B, declared as B[1: 2', 1: 2b, 1: 2r-a'], is stored in this way, its "transpose" C, such that B[i,j, k] is equal to C[i, k,j] (to C[j, k, i], C[, i, k], C[k, i,j], C[k,j, i], respectively), is obtained by ~ ~~() () permutation y (n (y(a(), permutation Y(a,b,O,n -a-b(b',a,O,n-a' YO,a,O,b, YO,a+b,O,n-a-b, YO,a,b,n-a-b, respectively). 5) Remarks: A given FUB may appear several times in the same family. For instance, y(¶,O,h and y(J,)h are the same bijection. The identity may be denoted as y(),o,o or as 7(P) (0 < j + I < n). Moreover, a FUB may belong to several families. For instance, T(n) equals o (n), or fi¶?onOk) If k = (0 < j < n), then T(n) iS the cyclic shift of amplitude 2i within segments of size 2i++ 2 n-12i (1,) Note also that rT(1) is the Exchange permutation 4(n) In a parallel computer, the permutations of data are performed by the interconnection network. Several possible - 1 * 2 a 3 * 91,90 0 ,1 1 2 - * 3 PERFECT SHUFFLE 2 Fig. 3. The structure of the segment shuffles. schemes have been considered for this essential device. They are presented below. C. A Survey of Interconnection Networks The 64 processing elements of Illiac IV are connected according to the Nearest Neighbor scheme. Interpreting the set of processing elements as a square, this scheme means that processing element i (0 < i < 63) is connected to processing elements i + 1, i - 1, i + 8, i - 8 (mod 64). In , T(6)) and T(6) can be other words, only permutations T(6), T(6) performed in one step. Orcutt [20] considered the problem of generating any permutation by repeated transfers of data through such an interconnection network. A more powerful network to realize bijections between N = 2" inlets and 2" outlets is the Omega network proposed by Lawrie [16]. This is composed of n stages of 2 x 2 switches linked by perfect shuffles (see Fig. 4). Control patterns may be easily derived from permutation assignment [15], [16]. Unfortunately, this network is not capable of realizing all possible permutations. In one pass, it can perform such bijections as A(n) , T(z), or 60), but not the perfect shuffle nor the bit reversal. As an extension of Lawrie's Omega network, Lang and Stone [14] have proposed a network that can realize any permutation in 0(8N/) shuffleexchange steps. A Clos network is a three-stage network defined by three parameters p, q, r, and consequently will be denoted hereafter by C(p, q, r). It is constructed from crossbar switches whose dimensions depend on the stages p x q, r x r, and q x p, respectively. Each output of a stage is linked to an input of the next stage (see Fig. 5). The Slepian-Duguid LENFANT: PARALLEL PERMUTATIONS OF DATA 641 0 1 0 2 3 2 4 4 Recall that Y(n) is the symmetric group on E(, whereas #'n) is the subgroup of those permutations which keep all the subsets {2j, 2j + 1} (O < j < 2" 1) globally invariant. A. From the Clos Network C(2, 2, 2"- 1) to the Benes Network R ") As a result of the Slepian-Duguid theorem, the Clos network C(2, 2,2n- 1) is capable of realizing all permutations between inputs and outputs (see Fig. 6). In other words, for any permutation 0 in o(n), there are permutations E1 and v2 in _on) 4) and i in °('- ) such that 0 = C2 ° C(n) (4, f) 0 (,(n))-1 0 , (3) Hereafter, we contract equality (3) into the following notation: 1 5 6 7 6 7 Fig. 4. The -Omegai-etwork with 8 inputs- and 8 outputs. pxq switches r-r switches qxp switches 0 [SI; (0, 0); S21(4) Note that all permutations 0 may be generated in at least two different ways. By another casting between the upper switch and the lower switch in the middle stage, we obtain 0 = It 1; (0, uf); £g2] [4(n) ° £1; (l*, 0); X{") ° £2]. (5) (Remember that ~4n) iS the Exchange permutation.) Moreover, the inverse of permutation 0 is 0-l = [C2; ()-'1, 0 1); C1]. (6) If we replace the median switches of C(2, 2, 2"-') by networks C(2, 2, 2" -2) and repeat this operation until the = rr r - Fig. 5. The three-stage Clos network C(p, q, r). theorem [5] states that this network can realize all bijections between its p r inputs and outputs if (and obviously, only if) q p: with telephone exchange terminology, this condition means that the network is rearrangeable. As concerns the cost of hardware, a C(16, 16, 16) network is competitive with the Omega network -to interconnect 256 processing elements [11]. However, no control procedure is currently known for a Clos network, with the exception of some algorithms that may meet time constraints in a switchboard environment, but are completely inadequate for parallel computers [5]. III. CONTROL ALGORITHMS In this section, we FOR present a BENES BINARY NETWORK Benes binary rearrangeable network for interconnection of2" inputs and 2" outputs. This network may be derived from the Clos network C(2, 2,2"-') by repeated use of the Clos structure until all switches are 2 x 2 switches. Th-anks to this recursivity in the design of the network, we can derive recursive algorithms which yield control patterns for the FUB's (Sections III-B and C). Finally, we discuss some implementation ideas. First, let us introduce a new notation. If a permutation v median switches are 2 x 2 switches, we then obtain the network R("), of which some properties have been studied by Benes [5]. As an example, R(3) is shown in Fig.-7. Network R(") is made from (2n - 1) stages of 2" ' two-state switches that perform the direct connection 1) or the crossed connection 4<1) between their two inputs and their two outputs. The links between the stages of R(") realize the bijections (from left to right) !0n- 1,0,09 ,n_-2,0,09 2,1,0,0 (j,n- j -1,0,0 for the first half of the network and, for the second half, .. ... 9 fln 2, 1,0,0 9 9 I ,10,0 9 .. 9 ,1,0,01) g0 1,0, The Benes network R(") can realize all the bijections between its inputs and its outputs. As a matter of fact, it results from equality (5) that each bijection can be obtained by at least 2(2N 1_ l) different control patterns. Another consequence of equality (5) is that, without loss of combinatorial power, one switch in the first (or in the third) stage of C(2, 2, 2n -l) can be permanently forced to one of its possible states-or be replaced by a fixed link. This modification of a Clos network can be repeated in all steps ofthe construction of R(", saving (2" - 1) basic switches. Waksman [29] has considered such a modified Benes network. E(") is such that for all x = (x", xn- 1, *, x ,) in E"), the most significant digits of x and v(x) are equal, then we shall denote it by v = (+, *) (2) B. Controlling the Benes Network R ") where and are the bijections on E("- 1) defined by The structure ofcontrol algorithms will be traced from the X2, X1) = V(0, xn-1 * - i X2 X1) (Xn -15 recurrent structure of the network R We consider families of permutations that meet the following stability property: a X2, XI) f(Xn 1 V(1, Xn_1, * *. X2, X1). on . ..., - , = 642 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 7, JULY 1978 0 If n = 1 and j = 0, then b(f =) = If j = n, then 0 1 2 3 2 3 4 5 4 5 6 7 6 7 ~~~~~~~~~~~~~0 2 2 3 3 4 4 .5 5 6 6 () Theorem 5: If n R(n) > (n) _ (n) (n) _ (n) > 1 j+ +m g, =-1,k= Tf, fOn,k =Tk * = n _ r (n) tn -j I+ I VIj,,,m I- 1,m + 1,k'9 ,I lm,k- I(f-1) where k* = 6: k' If n 2m R(n) _ > (n) < n, then =); 1(,) t11 m + h, then 2 and (n) MjO,m,k- Ijm O,k Fig. 7. The Benes binary network R(31. = 2 and 0 <j < n - 2, then [i )1= 1+1n ) j; (n-+1) n 1) 4(n)* k "(n) j] and k** = k where k* = k' @ 2mn- if2 k,0 j. If n 2 1, then Theorem 4: If n Fig. 6. The Clos network C(2, 2, 4). 0 b(l) 1 ) tj,0,0,k If n > 2e, i 1+O andj + h+ m = n, then (n) - f (n) Tk Mj,l,m,k _ (n) Ik = family JV is stable iff, for all 0 in J, there are two bijections and in and two bijections E 1 and in Vn) (for some n) If n = 1, then such that 0 = [E1; (4, 0i/); e2]. The search for control patterns #(i,l,m,k- kT, yielding 0 is replaced by the search for control patterns Theorem 6: If n > 2 and j + +n m + h < n, then yielding and The domain ofthe latter two permutations aI d tr l nompt1)ion-p1) (n)y b T (n) is half the size of the domain of the former. Permutations e1 Yj,lmh S\Yj,l,m,h dJ,I,m,hJ and describe the control patterns for the first and the last Else (i.e., ifnj + + m + h = n or if n a 1) stage of R(n), respectively. As and belong to the stable ()n)I i 212(n) ,m, family, this procedure can be used recursively. The final j, ^j,lmh permutations to generate are 0tt) or 4(1), which are realized Corollaro (of Theorem 2 or 3): If n > 2, then by the two-state switches. (n) = (n); (,,(n -1), 1(n 1)); C n ] This "divide-and-conquer" technique may be applied to the five families of Frequently Used Bijections by means of the theorems below. These theorems are proved in the If n = 1 , then Appendix and an example of their use is given in the next 7(1) = 1(0) U(1) = (). section. Let k be a positive integer and denote, respectively, by k' In the previous theorems, the number of distinct bijecand k1 its quotient and its remainder in the division by 2, i.e., tions describing control patterns for the extreme stages of < k = 2k' + k1. R(n) is as small as 2n. These bijections are m(n) () (2<wjot) aple This Theorem 1: If n > 2, then and their composition products by (see Fig. 8) is since the cardinality of the subgroup scarcity noteworthy (kn) = (n); (,Z:(n -1), (n -1));Sn) j,(n) iS 2(2n 1 ) If n = 1, then C. An Example TMol = 1(1) TM1 = 4(M) Suppose that a Bit Reversal p (3) = a(3,) is to be performed (Remember that C(n) is either the Identity, (n), or the by R (3) on an 8-word vector of data. The three steps of the Exchange, 4(n), according to whether k, is 0 or 1.) of a control pattern are as follows. Theorem 2: Assume that j is odd: j = 2' + 1. If n 2, computation First Step: From the first case of Theorem 4, we obtain then (3) (3)h old(3). (2) (2) =( ) [(n). (,(n - )1) n ±-+) . (n)]. 0,0 -L 3 6 vl,0, OC1,2), q3 J MI V+J j'+ kJ c j,k' ki j,k-L In the second step a similar reduction will be applied to oa()' If n = 1, then m ) .mn, and oa(,2)2 111,0- I Second Step: From the second case of Theorem 4 and Theorem 3: If n >22 and O < j < n, then. from the first case of Theorem 1, the following equalities 1) hold: 6.n) j,k' 1), b(n i,k'+ k J; C(n)]. ki J,k [ (n); (6(n e2 /. e2 643 LENFANT: PARALLEL PERMUTATIONS OF DATA o 1 * p 2 " 3* 4 5 * 6 * 7. - 1 2 3 a* 4 *- 5 0 6 37 X * * 0 O1 * 2 * 3 43 1* 2 * 3 * 4 5 * 6 * 7 * 5 6 *7 O* 1 *I.. >O 1 *-- 2 * 2 3* 3 4" 4 5 * ~~ 5 0 6 6 * 0 7 7 * Fig. 8. From left to right: the permutations t(3), (33), and 4(3)* t(l2,) =T(o2) = 1(2); (T(1), T(0)); 1(2)] 2122= (2) [I(2); (t1) T (1)); 1(2)]. Third Step: As a result of Theorem 1, the bijections zT and zTi are recognized as the basic controls (1) and 4(1), respectively. Consequently, the five stages of R(3) must realize the following bijections (with a "vertical" notation to match the vertical drawing of stages in the figures) I(M1 0 0 1 2 2 3 3 4 : 5 4 6 6 7 5 1 X-. 7 Fig. 9. The network R(31 under a control yielding p(3). Our approach is different. By selecting a set of Frequently Used Bijections, we allow an efficient representation ofthese bijections. As concerns the families of Table I, the paparameters of the bijections, say h,j, k, 1, m, are either in the range {0, 1, n} or in the range {0, 1, 2" 1}. In the latter case, they may be recorded in an n-bit word with the usual binary expansion (denoted by R2 hereafter). In the former case, it would be more convenient to represent an , , integer m by an (n + 1)-bit word (mn m", * mi, ., ml such that mi = 1 iff m = i - 1. In this way (denoted by RI), + 1, i(2) t(2) (1) q13(3) (3) '13 , , the increments required by the theorems of Section Ill-B are performed as shifts. Table II specifies this coding scheme which results from a tradeoff between space saving and time saving. 1(2) 1(2) -. 4(~~~~~1) The selection of a set of interconnections deserving special treatment because of their frequent use implies that two different instructions must be included in the machine repertoire: Trigger a Frequently Used Bijection (TFUB) and Trigger an Ordinary Bijection (TOB). Due to our choice of FUB's, an instruction word TFUB contains six fields corresponding to the following: 1) the opcode; the name of the FUB family (a, f3, y, 6, or A): 3 bits; 2) 0 00 0 3) the four parameters: 4(n + 1) bits. The length of the FUB name, i.e., 4n + 7 bits, is consistent 1 0 1 0 1 with the format of an instruction: in a computer with 256 1 0 1 0 1 processing elements, this name would be 39 bits long. When This setting of the network is illustrated in Fig. 9. executed, the instruction TFUB forces the basic 2 x 2 switches to the correct state and broadcasts a 2'-word vector D. Implementation of the Control Algorithms of data through the network. The other instruction, TOB, Several authors [19], [29] have proposed algorithms that has one parameter that is the address in main storage of a can compute control patterns for the network R , whatever vector of (2n 1)21 I bits which is used to set the two-state the bijection under consideration. These powerful algor- switches. The vector may be computed by one ofthe general ithms suffer from several shortcomings. First, they are procedures mentioned at the beginning of the present space-consuming. As input, they accept the representation section. of a permutation 0 by its value assignment; i.e., the sequence The network control mechanism must be devised to cope 0(0), 0(1), 0(2n - 1), which requires n2n bits to be with the implementation of the instruction TFUB. One memorized. A considerably larger amount of memory is possible structure, which readily lends itself to pipelining, needed for the computation. Second, these algorithms are consists of a binary tree of registers. A node of this tree is a set time-consuming. As far as applications to parallel proces- of four registers whose lengths are (n - d + 2) bits at depth d sors are concerned, they could be used to fill in tables of in the tree (1 < d < n) (see Fig. 10). From each node at depth control patterns before run time. Such tables would be huge: d (1 < d < n 1), it is possible to compute 2n-d bits of the (2n 1)2n-1 bits to store the control pattern of each control patterns for stages d and 2n - d. If several transfers considered mapping. This figure is 1920 bits per permuta- of data vectors through the network are allowed to be in tion in the case of a network interconnecting 256 processing progress at the same time (as a pipeline system where the elements. units are the stages), the control patterns for stage 2n d of If we assume that signal 0 (signal 1, respectively) on its control line forces a 2 x 2 switch to state P1) (W(1), respectively), we can represent the control pattern by the following 4 x 5 matrix - ..., - - - 644 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 7, JULY 1978 TABLE II CODING THE BUECTION NAMES 1st PARAMETER 2nd PARAMETER ;(n) J,k R2 R2 J,k Rl R2 (n) Rl R2 Rl Rl Rl R2 Y~~~n) ~Rl Rl Rl Rl BIJECTION j,k B(n),m,k j ,),m,k RI R2 parameter in the range parameter in the range F1 {O,1, {O,1,i F22 t0 ... ,n} 2-11 3rd PARAMETER 4th PARAMETER ("unary" representation) (binary representation) F3 a4 2) If A is declared as an array of bits A[: h], the affectation A:=(O, B[b:b +h -2]) is used for: A[1] :=0 and A[i] := B[b + i -2] (2 < i < h). 3) int(A) is the integer whose "unary" representation (representation R1) is the bit array A. 4) The procedures mirror(A) and complement(A) deliver bit arrays that are the mirror image of A and the ones' complement of A, respectively. We focus our attention on node c at depth d in the tree (1 < c < 2d1-'), assuming that the hardware of the interconnection network performs the same operations on all nodes at the same depth. As concerns a node at depth d with 1 < d . n - 1, two registers of this node, denoted by J and K and their respective sons, are involved in these operations. The control patterns for stage 2n -d are recorded in the free entry of Q2n -d; whereas, the control signals sent to the switches of stage d are represented as bits of an array. The proposed algorithm for node c at depth d (1 < d <n - 1 1 < c . 2d-i) is as follows: bit array J, K[1:n+ 2 -d]; bit array upper-son-of-J, upper-son-of-K, lower-son-of-J, lower-son-of-K[l: n + 1 -d]; bit array control-of-stage-d, free-Q2, d[l:2" 1]; bit array table-d[1: n + 1 - d, 1: 2 -d]; if J[n + 1 -d:n + 2 -d] =00 then a5 0 ~ parbegin 0~~~~ comment we use Theorem 4 end of comment control-of-stage-d[(c - 1)2-d + 1: c2-d] . 1\ 1 2 3 01 2 Ml 1 free-Q2n - d[(C- 1)2 T2 =table-d[int(mirror(J)), 1:2" -d], d+ 1: c2"] :=if K[int(mirror(J)] = K[1] then table-d[int(mirror(J)), 1: 2"-d] else complement(table-d[int(mirror(J), 1: 2 -d]), upper-son-of-J = (0, J[1: n -d]), R(") must be stored in a queue Q2n-d (1 d n -1); upper-son-of-K := K[2: n + 2 -d], whereas, the control patterns for the n first stages can be used := (0, J[1: n -d]), lower-son-of-J as soon as they are computed. Moreover, if the network is lower-son-of-K := K[2: n + 2 - d] built as a pipeline system, it is necessary to provide level d of (D (mirror(J[1: n - d]), 0), the tree (1 < d < n) with a 3-bit register Fdwhich records the parend (common) family name of the FUB's whose parameters are memorized in the nodes of depth d. Finally, a table Td is else associated with depth d (1 < d < n). Each of its (n + 1 d) parbegin entries contains the 2" -d-bit pattern that controls either I(n+I-d) (for entry 1) or t(n+1-d) (for entry j, comment we use Theorem 1 end of comment 2 <.j.n + 1 -d). control-of-stage-d[(c - 1)2" -d + 1: c2" -d] We shall make this structure clear by describing the = table-d[1, 1: 2"-d] register-to-register operations involved in the computation free-Q2,..-d[(C- 1)2 + 1 c2d] :=if K[1] = 0 of control signals for family x(.") This description is exthen table-d[1, 1: 2n-d] pressed in a language that deserves a few words of else complement(table-d[1, 1:2"-d]), explanation. := J[2: n + 2 -d], upper-son-of-J 1) Three types of data are used: bits, arrays of bits, and (for array indices only) integers. upper-son-of-K := K[2: n + 2 -d], Fig. 10. A possible structure for the control mechanism. - 645 LENFANT: PARALLEL PERMUTATIONS OF DATA lower-son-of-J = J[2: n + 2 - d], lower-son-of-K:= K[2: n + 2 -d], parend F2 E1 uppe r -son - of - J upper- son-of - K For a node c at depth n, the only operation to perform is control-of-stage-n[c] = K[2]. Fig. 11 shows this algorithm operating on the root (d = 1, = c 1) of the control mechanism of the network Rt3). The permutation to be performed is the bit reversal p(3) =(3,0 which has been considered in the example of Section 111-C. Arrays of bits are represented by registers whose rightmost bit is the first element of the array. IV. CONCLUSION Several authors have proposed algorithms which compute control patterns for the Benes binary network from any bijection assignment. If such a network is used to interconnect processing elements in a SIMD computer, the time constraints hinder the execution of the aforementioned algorithms every time a transfer of data is needed. An alternative would be to use these algorithms in order to fill a table of control patterns. This is not very practical, because this table would need a huge amount of main storage. In this paper, we have chosen another approach. We have restricted our attention to a set of frequently used bijections for which we have proposed tailored algorithms. The number of selected bijections for a network with 2" inputs is of the order of magnitude 22" as compared to 2"! for the whole symmetric group. It is small enough so that the name of a bijection can be coded within the parameter field of an instruction Trigger a Frequently Used Bijection. Another point which makes the implementation of such an instruction feasible is that control patterns can be computed "on the fly" by our algorithms, as shown in Section III-D. If a required permutation failed to be one of the selected Frequently Used Bijections, then it would be possible to use one of the general algorithms which have been previously published. Due to the recurrent structure of both Benes network and our algorithms, a data vector of 2N words can be easily processed by a computer with N = 2" processing elements. For this purpose the interconnection network is used successively as the upper median switch and as the lower median switch ofaCs network C(2, 2,2"). This recurrent structure is also very interesting in the case of several configurations of the same computer which differ from each other by the I K 01 ao Il / _ Is \ - iower-son-of-J free-Q5 lower-son-of-K -A 00 0 LoL Lo LOLoLI L2oJi .J -o control signals sent stage 1 of R 3': to 1[ ] Fig. 11. The first step of the computation of a control pattern for the permutation o) = pp . APPENDIX PROOFS OF THE THEOREMS OF SECTION III-B The bijections achieved by the first, second, and third stages of a Clos network C(2, 2, 2" 1) are denoted by F, S, and T in the sequel. The fixed links between the stages are E and a; i.e., the inverse of the perfect shuffle and the perfect shuffle itself. The sign o will be omitted in the expression of bijection products. Moreover, the notation y ,, will be used instead of y mod 2". Finally, the notations of Table I are valid for this appendix. We shall restrict our attention to the first statement of each theorem; the others result immediately from the definitions of Table I. Proof of Theorem 1: Let x = (Xn, Xn_ 1, * ** X2! X 1) be an element of E("). Denote by x' the (n - 1 )tuple (x", X"- 1, * *, x2) so that x = (x', x 1). If n is greater than 1, then x is mapped by gF onto (xl, x'); by SeF onto (x1, x' @ k); by aSeF onto (x' e k', x,) = (x', xl) / (k', 0); by TcSEF onto ((x', xI) ® (k', 0)) @ k, = (x', x) ® (k', k1) = x e k. Q.E.D. Therefore the theorem holds for n 2 2. Proof of Theorem 2: A word of data is broadcast to the upper median switch or to the lower median switch according to the parity of the number of the line by which it enters the network. As in the previous proof, let us consider an element x = (xn, xl) = 2x' + x of ), omittingthe obvious case n = 1. This integer is mapped number of processing elements. - by eF onto x' + 2" -1x; Further research could usefully be conducted into the selection of families of FUB's. It is noteworthy that a stable Ij x' + k'Inifxl = 0 1- by SeF onto family may contain stable subfamilies: e.g., the subfamily ix'+ + k' + k, In-1, if X1 fI, m,k for a given j is stable since its first parameter is not x affected by the derivation of Theorem S. New bijections Thus, if x is even, it is mapped by arSeF ontocould be selected in relation to the progress of algorithm y= 12jx'+2k'l = ljx+2k'ln x n design. Our approach is well-suited for algorithms obtained is =-0 As or "divide and a concurrent by conquer" technique [21 [231. y even, y D kI (kI 1) equals y + kl, so that xis 6EEE TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 7, JULY 1978 646 mapped by TrSeF onto I ix +2 k' +kiIn-= ix +kI. since (x1 (3 x"_;) xl = x,,"_; Finally, the image of x by This proves the case for x even. For odd x, x is mapped by ToSeF, i.e., C(n)(n(n) j(z)), is aSeF onto (a2, Xl, p("-j-2)(a0), xn _j)ek = (a2, p "n- (a)) k. Z= li(2x')+(2j'+1)+2k'+2kiln Q.E.D. = lijx+2k'+2k, InI Proof of Theorem 5: Assume that n > 2,1 > 0, andj + I + Since z is odd, z (D k 1 equals z - k I. Con'sequently, the image m < n. Let x = (6&, 61, b2, 61) be an element of E("). We ofanoddxbyThSeFis ij'x+2k'+k1In= |jx +kIn denote by bP and b3 the integers (or vectors of bits) defined Q.E.D. by the equalities Proofof Theorem 3: Assume that n > 2 and 0 < j < n. Let x = (x, x- 1, , x2,x1) = (a2,a1)be an element of E() and Ab^3 = (3, Xn-j-l+ 1) denote by a'1 the integer (xn_j, Xn-j-1, , x2), so that Notice that inequalities 1 > 0 and j + I + m < n imply that 1 = 2 a1 + x1. The integer x = 02, a1, x1) is mapped b'1 and b'3 exist. Theelementx = (4, b3, xn1±1,b2,b1,x1) * by EF onto (x1, a2, a1); is mapped * by SEF onto *by 8F onto (X1 (Xn_ +1 b4, b3, Xn-j-1+1 b2, bJ); if x1=O x1,a2, Ia'1+k'In-j 1), b4, * by S&F onto (X IXn i b-1, y,4 b2, P3)Dk' (Xl, a2,9 a1 + k' + k1 In -j- 1), if x1 1 where bit y is equal to Xn-Ji-+1, if X1llXn-j-I+1=O; i.e., onto (xI, a2, la+ k' + k1 XI In-j-1); otherwise, to Xn-j_l+ I D 1. Obviously, y is xn_j1l+I1®(xl,3x,-j-1+1)=x,. By oSsF, x is mapped * by oSeF onto 02, la1 + k' + k1 x onto by ThSeF onto @2, Y) z = (64, b1, xl, b2, b3, Xl DXn-j-l+ 1) (2 k') where y is the integer +n to this value, we obtain Applying the bijection -lm (Ia, + k' + k X X1knj)1, t171"4m+I(Z) = (64, b1, xl, b2, 63, xn-j-1+1)G (2 k') ® injI,x1); = = = 12a'1+2k'+2kX1i+(X1+kl-2x1kl)lnj 1(2 a", +x1) + (2k' + k1) In.. Q.E.D. la1+kkl TO(- j(a1). Proofof Theorem 4: Assume that n > 2 and 0 < j < n - 2. Letx = (x",xn, ,x2,x1)a= (a2'aI)beanelementofE(") and denote by ao the integer (xn j - 1, xn - j- 2' *, X 2), so that a1 = (Xn-j, ao, x1). The integer x = (a2, x,-j, ao, x1) is mapped by F onto 02, x,_, a, j); by eF onto (x1 Xnj,a2, x.-j, ao). Note that the effect of eF is to broadcast on the upper median switch (on the lower median switch, respectively) the data entering the network by an inlet x such that x 1 = xn -j (x1 $ x,-j, respectively). The image of x by SeF is (XI (3 Xn-j, a29 y, p( (a'o)) 3k' where bit y is equal to Xn-i if x1 e xn = 0, and equal to 1 if xIl Xn j = 1. We can summarize both cases in one expression X nj (X1 E3 Xn-i) which shows that y = x1. By qSeF, x is mapped onto y = Xn-j 0=(a2, ( x pX2I(a), Applying the permutation qtn") Xn j)@ (2 k'). we obtain 4n)j(z) = (a2, xl, p-21(a0), Xn-j) (2 k') = (64, 61, 62, b3) D (2 kV). Finally, the image of x by TrSsF, i.e., C(n)(q(n)m+ 1(z)), is (64, bI, b2, b3) (3 (2 k' () k1) = (64, 61, b2, b3) @ k. Q.E.D. Proof of Theorem 6: Assume that n . 2 and] + I + m + h <n. Let x = (5, C4, C33 C2, 1) be an element of E()1, and denote by c" the integer such that c1 = (cl, xl). The element x of E(n) is mapped by eF onto (x1 , C4, C3, C2' C1); by SeF onto (x1, 5c2, c3, c4,9'1); by TuSeF onto (c5, C2, C3, &4, C x1) = (C5 C2 C3 C4, C1). Q.E.D. REFERENCES [1] G. H. Barnes et al., "The ILLIAC IV computer," IEEE Trans. Comput., vol. C-17, pp. 746-757, Aug. 1968. [2] K. E. Batcher, "Sorting networks and their applications," in Proc. Spring Joint Computer Conf., AFIPS Conf. (Montvale, N.J.: AFIPS Press, 1968), vol. 32, pp. 307-314. [3] -, "STARAN parallel processor system hardware," in Proc. Fall Joint Computer Conf., AFIPS Conf. (Montvale, N.J.: AFIPS Press, 1974), vol. 43, pp. 405-410. [4] 7 "The multi-dimensional access memory in STARAN," in Proc. 5th Sagamore ConfJ Parallel Processing, Lecture Notes in Computer Science (New York: Springer, 1976), vol. 24. [5] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic. New York: Academic, 1968. [6] P. Budnick and D. Kuck, "The organization and use of parallel memories," IEEE Trans. Comput., vol. C-20, pp. 1566-1569, Dec. 1971. ,im .. 'r LENFANT: PARALLEL 7t7 .. .e lj... '..71.. PERMUJTATIONS OF DATA [7] C. Clos, "A study of non-blocking switching networks," Bell Syst. Tech. J., vol. 32, pp. 406-424, 1953. [8] M. J. Flynn, "Very high speed computing systems," Proc. IEEE, vol. 54, pp. 1901-1909, 1966. [9] D. Fraser, "Array permutation by index-digit permutation," J. Ass. Comput. Mach., vol. 23, pp. 298-309, Apr. 1976. [10] S. W. Golomb, "Permutations by cutting and shuffling" SIAM Rev., vol. 3, pp. 293-297, Oct. 1961. [11] M. L. Graham and D. L. Slotnick, "An array computer for the class of problems typified by the general circulation model of the atmosphere," Dep. Comput. Sci., Univ. IliEnois, Urbana, IL, Rep. UIUCDS-R-75-761, Dec. 1975 (IEEE Repository no. 76-83) [12] D. Heller, "A survey of parallel algorithms in numerical linear algebra," Carnegie-Mellon Univ., Res. Rep., 1976. [13] D. Kuck, "ILLIAC IV software and application programming" IEEE Trans. Comput., vol. C-17, pp. 758-770, Aug. 1968. [14] T. Lang, "Interconnections between processors and memory modules using the shuffle-exchange network," IEEE Trans. Comput., vol. C-25, pp. 496-503, May 1976. [15] T. Lang and H. S. Stone, "A shuffle-exchange network with simplified control," IEEE Trans. Comput., vol. C-25, pp. 55-65, Jan. 1976. [16] D. H. Lawrie, "Access and alignment of data in an array computer," IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975. [17] J. Lenfant, "Fast random and sequential access to dynamic memories of any size," IEEE Trans. Comput., vol. C-26, pp. 847-855, Sept. 1977. [18] S. B. Morris and R. E. Hartwig "The generalized faro shuffle," Discrete Math., vol. 15, pp. 333-346, 1976. [19] D. C. Opferman and N. T. Tsao-Wu, "On a class of rearrangeable switching networks," Bell Syst. Tech. J., vol. 50, pp. 1579-1618, May/June 1971. [20] S. E. Orcutt, "Implementation of permutation functions in ILLIAC IV-type computers," IEEE Trans. Comput., voL C-25, pp. 929-936, Sept. 1976. [21] M. C. Pease, "An adaptation of the fast Fourier transform for parallel processing," J. Ass. Comput. Mach., voL 15, pp. 252-264, Apr. 1968. [22] H. D. Shapiro, "Theoretical hmitations on the use of parallel memories," Ph.D. dissertation, Dep. Comput. Sci., Univ. Illhnois, Urbana, IL, Rep. UIUCDCS-R-75-776, Dec. 1975 (IEEE Repository no. 76-82) 1 ..t..N .,1, 647 [23] W. J. Stewart, "A note on cyclic odd-even reduction," IRISA, Univ. Rennes, Rennes, France, Res. Rep. 1977. [24] H. S. Stone, "Dynamic memories with fast random and sequential access," IEEE Trans. Comput., vol. C-24, pp. 1167-1174, Dec. 1975. [25] "Parallel processing with the perfect shuffle," IEEE Trans. Comput., vol. C-20, pp. 153-161, Feb. 1971. [26] R. C. Swanson, "Interconnections for parallel memories to unscramble p-ordered vectors," IEEE Trans. Comput., voL C-23, pp. 1105-1116, Nov. 1974. [27] K. J. Thurber, "Programmable indexing networks," in Proc. 1970 Spring Joint Computer Conf., AFIPS Conf. (Montvale, N.J.: AFIPS Press, 1970), vol. 36, pp. 51-58. [28] K. J. Thurber and L. D. Wald, "Associative and parallel processors," Comput. Surveys, vol. 8, pp. 215-255, 1976. [29] A. Waksman, "A permutation network," J. Ass. Comput. Mach., vol. 15, pp. 159-163, Jan. 1968. - Jacques Lenfant was born in Boulogne-sur-mer, France, on June 21, 1947. He received the B.S. degree in mathematics and the M.S. degree in algebraic topology from the University of Paris, Paris, France, and the Ecole Normale Sup6rieure de Saint-Cloud, and the Doctorat-es-Sciences degree from the University of Rennes, Rennes, France. Since 1970, he has occupied various academic positions in Rennes, with an interruption in 1975 when he was a Visiting Assistant-Professor with the Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor. He is currently the Vice-Director of the Institut de Recherche en Informatique et Systemes Aleatoires (IRISA) and a Professor of Computer Science. His research interests are in computer system evaluation, program behavior modeling scheduling, and parallel processing. Dr. Lenfant is a member of the IEEE Computer Society and the Association for Computing Machinery. He serves as an Associate Editor of the RAIRO, the Journal of the French Computer Society AFCET.