Performance of Processor-Memory Interconnections for

771 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-30, NO. 10, OCTOBER 1981 REFERENCES [1] V. B. Aggarwal and J. W. Burgmeier, "A round-off error model with applications to arithmetic expressions," SIAM J. Comput., vol. 8, pp. 60--72, Feb. 1979. [2] D. Bini, M. Capovani, F. Romani, and G. Lotti, "O(n27799) complexity for n X n approximate matrix multiplication," Inform. Process. Lett., vol. 8, pp. 234-235, 1979. [3] R. P. Brent, "Algorithms for matrix multiplication," Stanford Univ., Stanford, CA, STAN-CS-70-157, Mar. 1970. [4] , "Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity," Numer. Math., vol. 16, pp. 145-156, 1970. [5] J. R. Bunch and J. E. Hopcroft, "Triangular factorization and inversion by fast matrix multiplication," Math Comput., vol. 28, pp. 231-236, Jan. 1974. [6] F. Y Chin, "An 0(n) algorithm for determining a near-optimal computation order of matrix chain products," Commun. Ass. Comput. Mach., vol. 21, pp. 544-549, July 1978. [7] J. Cohen and M. Roth, "On the implementation of Strassen's fast multiplication algorithm," Acta Inform., vol. 6, pp. 341-355, 1976. [8] S. S. Godbole, "On efficient computation of matrix chain products," IEEE Trans. Comput., vol. C-22, pp. 864-866, Sept. 1973. [9] W. Miller, "Computational complexity and numerical stability," in Proc. 6th ACM Symp. Theory Comput., 1974, pp. 317-322. [10] V. Muraoka and D. J. Kuck, "On the time required for a sequence of matrix products," Commun. Ass. Comput. Mach., vol. 16, pp. 22-26, Jan. 1973. [11] V. Y. Pan, "New methods for the acceleration of matrix multiplications," in Proc. 20th FOCSSymp., 1979, pp. 28-38. [12] V. Strassen, "Gaussian elimination is not optimal," Numer. Math., vol. 13, pp. 354-356, 1969. [13] N. K. Tsao, "On accurate evaluation of extended sums," J. Franklin Inst., vol. 308, pp. 39-55, July 1979. [14] , "The numerical instability of Bini's algorithm," Inform. Process. Lett.,vol. 12. pp. 17-19, 1981. [15] J. H. Wilkinson, Rounding Errors in Algebraic Processes. Englewood Cliffs, NJ: Prentice-Hall, 1963. [16] S. Winograd, "A new algorithm for inner product," IEEE Trans. Comput., vol. C-17, pp. 693-694, July 1968. Nai-Kuan Tsao was born in Shanghai, China, on June 25, 1939. He received the B.S. degree from the National Taiwan University, Taipei, Taiwan, the M.S. degree from the National Chiao Tung University, Hsinchu, Taiwan, in 1961 and 1964, respectively, all in electrical engineering. He received the M.S. and Ph.D. degrees in electrical engineering from the University of Hawaii, Honolulu, in 1966 aned 1970, respectively. Since 1974 he has been with the Department of Computer Science, Wayne State University, Detroit, MI, where he is currently an Associate Professor. His research interests include the analysis and implementation of numerical and nonnumerical algorithms. Performance of Processor-Memory Interconnections for Multiprocessors JANAK H. PATEL, MEMBER, IEEE Abstract-A class of interconnection networks based on some existing permutation networks is described with applications to processor to memory communication in multiprocessing systems. These networks, termed delta networks, allow a direct link between any processor to any memory module. The delta networks and full crossbars are analyzed with respect to their effective bandwidth and cost. The analysis shows that delta networks have a far better performance per cost than crossbars in large multiprocessing systems. I. INTRODUCTION W ITH the advent of low cost microprocessors, the arW chitectures involving multiple processors are becoming very attractive. Several organizations have been implemented or proposed. Principally among these are parallel (SIMD) type processors [I], computer networking, and multiprocessor organizations. In this paper we focus our attention on multiIndex Terms-Crossbar, interconnection networks, memory processors. bandwidth, multiprocessor memories. The principal characteristics of a multiprocessor system is the ability of each processor to share a single main memory. Manuscript received March 24, 1980; revised February 17, 1981. This work This sharing capability is provided through an interconnection was supported in part by the Joint Services Electronics Program under Con- network between the processor and the memory modules, tract NOOOI 6-79-C-0424. The author is with the Coordinated Science Laboratory and the Department which logically looks like Fig. 1. The function of the switch is of Electrical Engineering, University of Illinois, Urbana, IL 61801. to provide a logical linkl between any processor and any 0018-9340/81/1000-0771$00.75 ©) 1981 IEEE 772 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 10, OCTOBER 1981 II. PRINCIPLE OF OPERATION SWITCH Fig. 1. Logical organization of a multiprocessor. memory module. There are several different physical forms available for the processor-memory switch; the least expensive of which is the time-shared bus. However, a time-shared bus has a very limited transfer rate, which is inadequate for even a small number of processors. At the other end of the bandwidth spectrum is the full crossbar switch, which is also the most expensive switch. In fact, considering the current low costs of microprocessors and memories, a crossbar would probably cost more than the rest of the system components combined. Therefore, it is very difficult to justify the use of a crossbar for large multiprocessing systems. It is the absence of a switch with reasonable cost and performance which has prevented the growth of large multimicroprocessor systems. To circumvent the high cost of switch, some "loosely coupled" systems can be defined. In these systems sharing of main memory is somewhat restricted, for example, some memory accesses may be fast and direct while many other references may be slow, indirect, and may even involve operating system intervention. There is considerable research on the permutation networks for parallel (SIMD) processors but very little research on processor-memory interconnections requiring random access capabilities. In this paper we study a specific class of interconnection networks, termed delta networks, which are far less expensive than full crossbars. Moreover, delta networks are modular and easy to control. The delta networks form a subset of a very broad class of networks called banyan networks, initially defined in graph theoretic terms by Goke and Lipovski [2]. Informally, in terms of Fig. 1 the switch is a banyan network if and only if there exists a unique path from every processor to every memory module. The control of delta networks resembles the control of permutation networks of Lawrie [3] and Pease [4]. However, these networks were not applied or analyzed for random access of memories which invariably involves path conflicts in the network, as well as memory conflicts. In Section II we present the basic principle involved in the design and control of delta networks. Then we describe the generalized delta networks with some examples. Following this we give the detailed logic of a 2 X 2 delta module which is the basic building block of 2n X 2" delta networks. In Section V we analyze full crossbar networks for effective bandwidth and probability of conflicts, given the rate of memory requests from processors. The results of crossbar are used in turn to analyze arbitrary delta networks. Finally, we present quantitative comparison of crossbars and delta networks. Before we define delta networks, let us study the basic principle involved in the construction and control of delta networks. Consider a 2 X 2 crossbar switch (Fig. 2). This 2 X 2 switch has the capability of connecting the input A to either the output labeled 0 or output labeled 1, depending on the value of some control bit of the input A. If the control bit is 0, then the input is connected to the upper output and if 1, then it is connected to the lower output. The same description applies to terminal B, but for the time being ignore the existence of B. It is straightforward to construct a 1 X 2n demultiplexer using the above described 2 X 2 module. This is done by making a binary tree of this module. For example, Fig. 3 shows a 1 X 8 demultiplexer tree. The destinations are marked in binary. If the source A requires to connect to destination (d2d1d0)2, then the root node. is controlled by bit d2, the second stage modules are controlled by bit dl, and the last stage modules are controlled by bit do. It is clear that A can be connected to any one of the eight output terminals. It is also obvious that the lower input terminal of the root-node also can be switched to any one of the 8 outputs. At this point we add another capability to the basic 2 X 2 module, the capability to arbitrate between conflicting requests. If both inputs require the same output terminal, then only one of them will be connected and the other will be blocked or rejected. The probability of blocking and the logic for arbitration is treated later on. Now consider constructing an 8 X 8 network using 2 X 2 switches; the principle used is the same as that of Fig. 3. Every additional input must also have its own demultiplexer tree to connect to any one of the eight outputs. Basically, the construction works as follows. Start with a demultiplexer tree, then for each additional input superimpose a demultiplexer tree on the partially constructed network. One may use the already existing links as part of the new tree or add extra links and modules if needed. We have redrawn the tree of Fig. 3 as Fig. 4(a). The addition of next tree is shown with heavy lines in Fig. 4(b). This procedure is continued until the final 8 X 8 network of Fig. 4(d) results. The only restriction which must be strictly followed during this construction is that if a 2 X 2 module has its inputs coming from other modules, then both inputs must come from upper terminals of other modules or both must be lower terminals of other modules. (All upper output terminals are understood to have label 0 and the lower terminals, label 1.) Other than this there is considerable freedom in establishing links between the stages of the network. In the construction of the above 8 X 8 network we had the benefit of some hindsight that 12 modules are necessary and sufficient to build this network. If more modules are used then some inputs of some modules will remain unconnected. We could have stopped in the middle of the construction to obtain a 4 X 8 or a 6 X 8 network, such as Fig. 4(b) and (c), however in each case some inputs of the 2 X 2 modules will remain unutilized. We term the networks constructed in the above manner digit controlled or simply delta networks, since each module is PATEL: PROCESSOR-MEMORY INTERCONNECTIONS Af-l 0 A -B _ J I B _. __ I Control bit of A L\ 773 0 Control -bit of A I -0 Fig. 2. A 2 X 2 crossbar. )000 '001 ,010 1011 '100 101 110 , ll Fig. 3. 1 X 8 demultiplexer. same set of permutations. As far as we know, the network of Fig. 5 cannot be made equivalent to either Lawrie's or Pease's network by a simple address transformation. It is quite possible that there are only two nonequivalent 8 X 8 delta networks, namely Lawrie's omega network and the network of Fig. 5. We shall not pursue this subject any further, as our primary interest lies in the random access capabilities of these networks and not permutations. III. DESIGN AND DESCRIPTION OF DELTA NETWORKS So far we have not defined the delta networks in a formal and rigorous manner. For the purpose of this paper we define them as follows. Let an a X b crossbar module have the capability to connect any one of its a inputs to any one of the b outputs. Let the outputs be labeled 0, 1,- , b - 1. An input terminal is connected to the output labeled d if the control digit supplied by the input is d, where d is a base-b digit. Moreover, an a X b module also arbitrates between conflicting requests by accepting some and rejecting others. A delta network is an an X bn switching network with n stages, consisting of a X b crossbar modules. The link pattern between stages is such that there exists a unique path of constant length from any source to any destination. Moreover, the path is digit controlled such that a crossbar module connects an input to one of its b outputs depending on a single base-b digit taken from the destination address. Also, in a delta network no input or output terminal of any crossbar module is left unconnected. One can determine the number of crossbar modules in each stage from the above definition. Specifically, the conditions of constant length paths and no unconnected terminals require that all the output terminals of one stage be connected to all the input terminals of the next stage in one-to-one fashion. The inputs of the first stage are connected to the source and the outputs of the final stage are connected to the destination. Thus, an X bn delta network has an sources and bn destinations. Numbering the stages of the network as 1, 2, starting at the source side of the network requires that there be crossbar modules in the first stage. The first stage then has a n- lb output terminals. This implies that the stage two must have an-lb input terminals, which requires an-2b crossbar modules in the second stage. In general, ith stage has a-ibi-1 crossbar modules of size a X b. Thus, the total number of a X b crossbar modules required in an an X bn delta network is controlled by a single digit from the destination address Digit Furthermore, no external or global control is requiroed. controlled networks are not new; Lawrie's omega netseorks [3] and Pease's indirect binary n-cube [4] are subsets of delta networks. Under the formal definition of delta networks pre,sented in the next section, Fig. 4(d) is a delta network, but not IFig. 4(a), (b), or (c). Note that the network of Fig. 4(d) does riot allow an identity permutation, that is, the connection 0 to 0, 1 to 1 *.* , 7 to 7 at the same time. An identity permutation is usefu if say memory module 0 is a "favorite" module of p 0, and module 1 that of processor 1, and so on. Thus, identity permutation allows most of the memory references to be made without conflict. A simple renaming of the inputs of I will allow an identity permutation. This is shown in IFig. in here if all 2 X 2 switches were in the straight (=) posit ion, then then an identity permutation is generated. As a matter of fiact, since one and only one path is available from any sourc destination, every different setting of the form X ani : erates a different permutation. Thus, the network (,.Uo r1Ig. generates 212 distinct permutations. This brings us tc another a. b Z an-ibi-I = ansomewhat unrelated topic of permutation netwoirks. The a-b 1<i<n procedure to construct a delta network can be used to generate = nbn-I aa=b. different permutation networks. For example, we co'uld have started with a tree in which the first stage was contirolled by The construction of an an X bn delta network follows the bit d1, the second by do, and third by d2. This would (of course principle presented in the previous section. Informally, the require relabeling the outputs. But does this really piroduce a procedure can be stated as follows. "different" network? Siegel [5] has shown that by a simple Construct a b-ary demultiplexer tree using a X b crossbar address transformation the networks of Lawrie [3] and that switches. A b-ary tree has b branches for every node. For a 1 of Pease [4] can be made equivalent, i.e., they pro' duce the X bn demultiplexer, the tree has n levels. Each level is con- drocessor Fig; =tgen an-m 774 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 10, OCTOBER 1981 000 001 010 011 100 101 110 111 (c) (d) Fig. 4. Construction of an 8 X 8 delta network. Fig. 5. An 8 X 8 delta network to allow identity permutation. trolled by a distinct base-b digit taken from the destination address. For every additional input source, superimpose a new tree on the partially completed network. Each superimposition must satisfy the condition that an a X b module which receives inputs from other a X b modules must have all its inputs connected to identically labeled outputs, where the outputs of each a X b module are labeled 0, 1,, b - 1, as was described earlier. As one can see from the above construction procedure, there is a large number of link patterns available for an an X b n delta network. It is natural to wonder if one topology is better than the others. We shall see later that, as far as probability of acceptance or blocking for random access is concerned, all delta networks are identical. However, different topologies may have different permutation capabilities in bn X bn delta networks. Since the link pattern between stages of a delta network is of no particular concern to us, we may ask if there is some regular link pattern, which can be used between all stages and thus avoid the cumbersome construction procedure for every different delta network. There is indeed such a pattern which we describe below. Let a q-shuffle of qr objects, denoted Sq*r, where q and r are some positive integers, be a permutation of qr indices (0, 1, 2, *, (qr - 1)), defined as 775 PATEL: PROCESSOR-MEMORY INTERCONNECTIONS S4J3 (i) -'Go 2 D 1 JO 2 3 3 4 4 5 5 6 6 7 8 9 10 10 0 iD 11 11 Fig. 6. 4-shuffle of 12 objects. 0 1 axb 1 a-i b-l a a+l 2a-1 __ n a -a - ax x b n1 . - - l bb1- an_Stage 1 Stage 2 a-shuffle a-shuffle S tage n Fig. 7. An an X bn delta network. Sq*r(i) = (qi + )mod qr 0<i< qr- 1. Alternately, the same function can be expressed as Sq*r(i) = qi mod (qr - 1) 0 i <qr- 1 i=qr-1. A q-shuffle of qr playing cards can be viewed as follows. Divide the deck of qr cards into q piles of r cards each; top r cards in the first pile, next r cards in the second pile, and so on. Now pick the cards, one at a time from the top of each pile; the first card from top of pile one, second card from the top of pile two, and so on in a circular fashion until all cards are picked up. This new order of cards represents a Sq*r permutation of the previous order. Fig. 6 shows an example of 4-shuffle of 12 indices; namely the function S4*3. From the above description it is clear that a 2-shuffle is the well-known perfect shuffle [6]. One can also show that q-shuffle is an inverse permutation of r-shuffle of qr objects. That is, Sq*r is an inverse of Sr*q. Thus, Sq*r (Sr*q(i)) = i 0<iS<qr- 1. We show below that an an X bn delta network can be constructed by using the a-shuffle as the link pattern between every two stages. If the destination D is expressed in base-b system as (dn-dn-2 ... dldo)b, where D = E dibi and 0 < O<i<n di < b, then the base-b digit di controls the crossbar modules of stage (n - i). The a-shuffle function is used to connect the outputs of a stage to the inputs of the next stage where the inputs and outputs are numbered 0, 1, 2, starting at the top. Fig. 7 shows a general an X bn delta network. Note that the shuffle networks between stages are passive, i.e., simply wires, and not active like the stages themselves. Two delta networks, one 42 X 32 and 23 X 23, derived from Fig. 7 are shown in Figs. 8 and 9; the interstage patterns are, respectively, 4-shuffle and 2-shuffle. The destinations are labeled in base 3 in Fig. 8 and in base 2 in Fig. 9. Now we prove that the a-shuffle link pattern indeed allows a source to connect to any destination using the destination-digit control of each a X b crossbar module. 776 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 10, OCTOBER 1981 0 1 2 3 (d1d0)3 - 00 - .01 02 4 5 6 7 If the a X b modules are numbered 0, 1, 2,- from top to bottom (see Fig. 7), then S is connected to module number LS/aj = s,-.an-2 + . + s2a + s1. By the destination-digit control algorithm, the source S is switched to the output terminal d"_1 of the module number LS/aJ. Assuming all output lines of stage 1 are numbered 0, 1, 2, * * from top, the source S is switched by stage 1 to its output line number L1 = LS/ajb + dn_.I, that is - 10 11 12 8 9 10 11 LI = (sn lan-2 + + s2a + sI)b + dn-1. This line, when the a-shuffle of an-lb output lines of stage 1 is done, becomes 20 21 22 12 13 14 15 LI = (Lla + 2b]) mod a lb. Let us first evaluate the floor function. Stage 4-shuffle I Stage 2 Fig. 8. A 42 X 32 delta network. (d2d1d0)2 0- LI Sn+S2an-3 + + S1 + dn-l/b an-2b n1an-2 Since dn_ l/b < 1 the numerator of the second term is less than an-2. Therefore 000 1 001 b = [sn-I + fractionj =Sn-1. Therefore 010 011 L' = (LIa + sn-O) mod an-lb = sn-lan-lb + (s,-2an-2 ++ dn- 1a + sn-i mod an-lb 101 110 111 2-shuffle Stage 2-shuffle Stage 2 3 1 Fig. 9. A 23 X 23 delta network. Theorem: An an X bn delta network which uses a-shuffles interstage link pattern, can connect any source to any destination D = (d, Idn-2 dldo)b by switching the source to the output terminal d,-i at each stage i, where the outputs of each a X b crossbar are assumed to have labels 0, 1,* , (b 1) from top to bottom. Proof: Let the source S need a connection to destination D, where as L2 = ... S = (sn_-1sn2 + sIa + dn-ja + Sn-lI lb mod at1lb / b since dn_1 < b and Sn-1 < a the quantity (dn lIa + sn-.)/b is less than a. Thus, the expression in the parentheses is less than anl. Therefore L' = (sn-2an-2 + * + sla)b + dn-ja + sn-I which is the input to stage 2 of the network. This line is connected to module LL7aJ of stage 2 and is switched according to digit dn2 and becomes the output line number L2 of stage 2, where (sn2an-2 + = 100 Stage + sla)b - L' |I1 a b + dn-2 (sn-2an-3 + * + s2a + sl)b2 + dn-lb +.dn-2. After the a-shuffle of an-2b2 lines, the line number L2 becomes L= (L2a + tan32I) mod an-2b2. SISO)a in base-a system After simplifying the right-hand side as before, the input to stage 3 is line number -- and D = (dn1dn2 dido)b in base-b system. L2 = (Sn-3an-3 + * + s1a)b2 + In other words S = D = Sn-lan-1 + Sn-2an-2 + and O < si a- 1 + sla + So dn-lb"-1 + d,_2bn-2 + * + d1b + do and O < di < b 1. - (dn-lb + dn-2)a + Sn-2 This line when switched by stage 3 becomes the output of stage 3 at line number =3 L2 b + dn3 777 PATEL: PROCESSOR-MEMORY INTERCONNECTIONS = (Sn-3an-4 + + (dn-lb2 + + s2a + sl)b3 dn-2b + dn-3). By finite induction it can be shown that the source is connected to an output terminal of stage i at line number Li = (sn-jan-i-I + - - + * sl)bi (dn-lbi-I + dn-2bi-2 + request busy request destination busy + s2a + + dn-i) which after the a-shuffle of an-ibi lines becomes the input of stage i + 1 at line number L' = (sn-i-lan-i-I + + s2a2 + sia)bi + (dn-lbi-I + + dn_j)a + sn-,. And after the final stage n, the source is connected to the output line number Ln = dn-1bn-l + + dib + do iI X-=r 0d0 +TFd 10 1 < X -s r0d0 + rOd I 0 which is the desired destination D. R0 =r0d0 +IdI R = rd + r IdI When a = b, the above proof can be simplified because the b XB +XB bo XB0 + XBI + r0d0d +rd0 1 b-shuffle of bn lines has a simple form. Take, for example, any i integer i, 0 < i < bn in its base-b representation. Let i = iiX 11 = i X +iX it to changes then the b-shuffle (Cn.lCn-2 CICO)b (cn-2cn-3 Fig. 10. Implementation details of a 2 X 2 module. * ** CLCOCn-1)b, which is simply a left rotation of the digits of i. This property was used by Lawrie [3] in proving a statement 1 to a low priority input of the next stage. The logic equations similar to the theorem above, when a = b = 2. for all the labeled signals are given with the block diagram. For INFO box the equations are given for left to right direction. The IV. IMPLEMENTATION OF DELTA NETWORKS parallel generation of X and X reduces one gate level. Signal and X are valid after 3 gate delays. Assuming that one level X Within the current technological limitations it is unecoof buffer gates for X and k in the INFO box exists due to fannomical to encode base b digits, where b is not a power of 2. out limitations of X and X, the total delay to establish the Thus, in practice an a X b crossbar module for a delta network of INFO box is 6 gate delays, of which 4 gate delays connections is more cost-effective if b is a power of 2, since our modules due to X and X. Thus, after the initial setup time, the data are require base b digits for control. Again, due to the cost and transfer requires only 2 gate delays per stage of the nettechnological limitations modules of size 8 X 8 or greater are work. not very practical at this time. This leaves a X 2 and a X 4 The operation of a 2n X 2n delta network using the above modules, where a is 1-4, as the most likely candidates for 2 X 2 modules is as follows; recall that there are n described implementation of delta networks. Here we give the functional in network. this stages and logical description of 2 X 2 modules. This in turn will be All requiring memory access must submit their processors used to estimate the cost and delay factors of delta nettime by placing a 1 on the respective rethe same at requests works. After lines. 8n delays the busy signals are valid. If gate quest The functional block diagram of a 2 X 2 crossbar module must resubmit its request. is then the processor the line 1, busy of a delta network appears in Fig. 10. All single lines in the doing nothing, i.e., concan be simply by accomplished This figure are one bit lines. The double lines on INFO box represent Read data is valid after The line tinue to hold the high. request address lines, incoming and outgoing data lines, and a Read/ time if the busy signal access 8n the memory delays plus gate Write control line. The data lines may or may not be bidirecdescribed here of the is 0. the implementation operation Thus, tional. The function of the INFO box is that of a simple 2 X 2 issued at fixed intervals is are that the requests is, synchronous, crossbar; if the input X is 1, then a cross connection exists and is prefat the same time. An implementation asynchronous if X is 0, then a straight connection exists. such an imif has However, erable the network many stages. The function of the CONTROL box is to generate the signal for buffers addresses, data, would storage require plementation X and provide arbitration. A request exists at an input port if control modin a and module and also control complex every the corresponding request line is 1. The destination digit provides the nature of the request; a 0 for the connection to upper ule. Thus, the cost of such an implementation might well be output port and a 1 for the lower port. In case of conflict the excessive. We have analyzed only the synchronous networks request ro is given the priority and a busy signal b1 = 1 is in this paper. supplied to the lower input port. A busy signal is eventually V. ANALYSIS OF CROSSBARS AND DELTA NETWORKS transmitted to the source which originated the blocked reIn this section first we analyze M X N crossbar networks quest. The priority among processors can be randomized by con- and then delta networks. Both networks are analyzed under necting the two outputs of each 2 X 2 module to two different identical assumptions for the purpose of comparison. We anpriority input ports at the next stage. For example, output port alyze the networks for finding the expected bandwidth given 0 can be connected to a high priority input port and the port the rate of memory requests. Bandwidth is expressed in ... IEEE TRANSACTIONS ON 778 number of memory requests accepted per cycle. A cycle is defined to be the time for a request to propagate through the logic of the network plus the time to access a memory word plus the time to return through the network to the source. We shall not distinguish the read or write cycles in this analysis. The analysis is based on the following assumptions. 1) Each processor generates random and independent requests; the requests are uniformly distributed over all memory modules. 2) At the beginning of every cycle each processor generates a new request with a probability m. Thus, m is also the average number of requests generated per cycle by each processor. 3) The requests which are blocked (not accepted) are ignored; that is, the requests issued at the next cycle are independent of the requests blocked. The last assumption is there to simplify the analysis. In practice, of course, the rejected requests must be resubmitted during the next cycle; thus the independent request assumption will not hold. However, to assume otherwise would make the analysis if not impossible, certainly very difficult. Moreover, simulation studies done by us and by others and more complex analyses reported [7] -[10] for similar problems have shown that the probability of acceptance is only slightly different if the third assumption above is omitted. Thus, the results of the analysis are fairly reliable and they provide a good measure for comparing different networks. Analysis of Crossbars: Assume a crossbar of size M X N, that is, M processors (sources) and N memory modules (destinations). In a full crossbar two requests are in conflict if and only if the requests are to the same memory module. Therefore, in essence we are analyzing memory conflicts rather than network conflicts. Recall that m is the probability that a processor generates a request during a cycle. Let q(i) be the probability that i requests arrive during one cycle. Then q(i) = ( )mi(l m)Mi - where (1) is the binomial coefficient. Let E(i) be the expected number of requests accepted by the M X N crossbar during a cycle; given that i requests arrived in the cycle. To evaluate E(i), consider the number of ways that i random requests can map to N distinct memory modules, which is Ni. Suppose now that a particular memory module is not requested. Then the number of ways to map i requests to the remaining (N - 1) modules is (N - 1) '. Thus, Ni- (N - 1)i is the number of maps in which a particular module is always requested. Thus, the probability that a particular module is requested is [N' - (N -1 )i]/N'. For every memory module, if it is requested, it means one request is accepted by the network for that module. Therefore, the expected number of acceptances, given i requests, is Ni'- (N- l)i E(i) =NN = (N)-1j N. (2) COMPUTERS, VOL. C-30, NO. 10, OCTOBER 1981 Thus, the expected bandwidth BW, that is, requests accepted per cycle is BW E E(i) wmq(i) = 0<iSM which simplifies to BW = N- N(1-MM) (3) Let us define the ratio of expected bandwidth to the expected number of requests generated per cycle as the probability of acceptance PA, that is, the probability that an arbitrary request will be accepted. Thus, PA is a measure of the wait time of blocked requests. A higher PA indicates a lower wait time and a lower PA indicates higher wait time. The expected wait time of a request is (I /PA - 1). BWA_ N N ( mlM il--I. PA= -Im N mm mm mm (4) It is interesting to note the limiting values of BW and PA as M and N grow very large. Let k = M/N, then n kN lim lu-e-IeMmk. Thus, for very large values of M and N BW PA - N( - e-mM/N) N(I e-mM/N) - mM (5) (6) The above approximations are good within 1 percent of actual value when M and N are greater than 30 and within 5 percent when M, N > 8. Note that for a fixed ratio M/N the bandwidth of (5) increases linearly with N. Analysis of Delta Networks: Assume a delta network of size an X bn constructed from a X b crossbar modules. Thus, there are an processors connected to bn memory modules. We apply the result of (3) for an M X N crossbar to an a X b crossbar and then extend the analysis for the complete delta network. However, to apply (3) to any a X b crossbar module we must first satisfy the assumptions of the analysis. We show below that the independent request assumption holds for every a X b module in a delta network. Each stage of the delta network is controlled by a distinct destination digit (in base b) for setting of individual a X b switches. Since the destinations are independent and uniformly distributed, so are the destination digits. Thus, for example, in some arbitrary stage i an a X b crossbar uses digit dn-i of each request; this digit is not used by any other stage in the network. Moreover, no digit other than dn-i is used by stage i. Thus, the requests at any a X b module are independent and uniformly distributed over b different destinations. Thus, we can apply the result of (3) to any a X b module in the delta network. Given the request rate m at each of the a inputs of an a X 779 PATEL: PROCESSOR-MEMORY INTERCONNECTIONS b crossbar module, the expected number of requests that it passes per time unit is obtained by setting M = a and N = b in (3), which is 1.0m = 1.0 0.8crossbar 0.6- b-b(1 m)a Dividing the above expression by the number of output lines of the a X b module gives us the rate of requests on any one of b output lines A I" \ -1,> 0.4 - deLlt a-4 __ delta-2 0.2- -_ 0.0 b Thus, for any stage of a delta network the output rate of requests mout is a function of its input rate and is given by 1 N 1024 256 BW = 1.0 delta-2 - 1641- 16 4 64 N-. 256 1024 4096 Fig. 12. Expected bandwidth of N X N networks.> 1.0- and the probability that a request will be accepted is rbnMn m 64' andmo=m =n aA 4096 4096 Since the output rate of a stage is the input rate of the next stage, one can recursively evaluate the output rate of any stage starting at stage 1. In particular, the output rate of the final stage n determines the bandwidth of a delta network, that is, the number of requests accepted per cycle. Let us define mi to be the rate of requests on an output line of stage i. Then the following equations determine the bandwidth, BW of an an X bn delta network, given m, the rate of requests generated by each processor. BW = bnm, (7) where l) 1024 - Fig. 11. Probability of acceptance of N X N networks. - bin) mout= 1 b- mi=1-(1-mI 256 64 16 4 0.8- (8) VI. EFFECTIVENESS OF DELTA AND CROSSBAR NETWORKS Since we do not have a closed form solution for the bandwidth of delta networks (7), we cannot directly compare the bandwidths of crossbar (3) and delta networks. We have computed the values of the expected bandwidth and the probability of acceptance for several networks. These results are plotted in Figs. 1 1, 12, and 13. Fig. 11 shows the probability of acceptance PA, for 2 n X 2 n and 4n X 4n delta networks and N X N crossbar, when the request generation rate of each processor is m = 1. The curve marked delta-2 is for delta networks using 2 X 2 switches and delta-4 for delta networks using 4 X 4 switches. The graphs are drawn as smooth curves in this and other figures only for visual convenience, in actuality the values are valid only at specific discrete points. In particular, an N X N crossbar is defined for all integers N > 1, a delta-2 is defined only for N = 2X n > 1, and delta-4 is defined only for N = 40 n > 1. Notice in Fig. 11 that PA for crossbar approaches a constant 1 pA 0.60.4- 0.2 - 0. 0 F .1 .3 .5 .7 .91.0 a-m Fig. 13. Probability of acceptance versus mean request rate. value as was predicted by (6) of the previous section. PA for delta networks continues to fall as N grows. Fig. 12 shows the expected bandwidth BW as a- function of N. The bandwidth is measured in number of requests accepted per cycle. In all fairness we must point out that a cycle for a crossbar could be smaller than a cycle for a large delta network. Taking into account fan-in and fan-out constraints, the decoder and arbiter for a N X N crossbar has a delay of O(log2N) gate delays. A 2n X 2n delta network also has O(log2N) gate delays, from the analysis of Section IV. However, the delay of a delta network is approximately twice the delay of a crossbar. If the delay is small compared to the memory access time, then the cycle time (the sum of network delay and memory access time) of a 780 IEEE TRANSACTIONS ON COMPUTERS, VOL. 1/4 1/16detta-2 1/2561/1024 crossbar . 1 4 16 64 NO. 10, OCTOBER 1981 curve for delta, thus shifting the crossover point of the two curves towards right. Thus, depending on the assumptions, the crossover point may move slightly left or right; but in any case the curves clearly show the effectiveness of delta networks for medium and large scale multiprocessors. M = 1.0 Performance cost c-30, 256 N- 1024 Fig. 14. Cost-effectiveness of N X N networks. crossbar is not too different from that of a delta network. Thus, the curves for bandwidth provide a good comparison between networks. Fig. 13 shows PA as a function of the request generation rate m. The curve for the crossbar is the limiting value of PA as N grows to infinity. Curves for N 2 32 are not distinguishable with the scale used in that graph. Finally, the graph of Fig. 14 is an indication of cost-effectiveness of delta networks. The cost of a N X N crossbar or delta network is assumed to be proportional to the number of gates required. The constant of proportionality should be the same in both cases because the degree of integration, modularity, and wiring complexity in both cases is more or less the same. For the N X N crossbar the ,iminimum number of gates required is one per crosspoint per data line'. Depending on the assumptions used on fan-in, fan-out, the complexity of the decoder and the arbiter, one can estimate the gate complexity of a crossbar anywhere from one gate to six gates per crosspoint. Let us assume the lowest cost figure of one gate per crosspoint. The cost of 2" X 2" delta network is estimated from the Boolean equations of the 2 X 2 module of Fig. 10. The number of gates in a 2 X 2 module is 23 gates for the control plus 6 gates per information line. Assuming the number of information lines to be large, the gates for control can be ignored. Thus, the gate count of a 2" X 2" delta network is 6n2"n1 per information line because the network has n2 n1 modules. Thus, the cost of N X N networks are kN2 for crossbar and 3kNlog2N for delta (N = 2"), where k is the constant of proportionality. We have used these cost expressions in the computation of performance-cost ratio for Fig. 14; the ratio is that of expected bandwidth over cost. Taking this ratio for a I X 1 crossbar as unity, the Y-axis of Fig. 14 represents the performance over cost relative to a 1 X I crossbar. The Y-axis may also be interpreted as bandwidth per gate per information line. Notice that delta network is more cost-effective for network size N greater than 16. If the cost of the crossbar was assumed as 2 or more gates per crosspoint, then the curve for the crossbar would shift downward and the effectiveness of delta becomes even more pronounced. However, if one assumed the cycle time of a crossbar half as much as that of a delta network, then the curve for crossbar would shift upwards relative to the VIII. CONCLUDING REMARKS We have presented in this paper a class of processor-memory interconnection networks, called delta networks, which are easy to control and design, and are very cost-effective. We also presented the combinatorial analysis of delta networks and full crossbars. It is seen that delta networks bridge the gap between a single time-shared bus and a full crossbar. The cost of an N X N delta network varies as Nlog2N, while that of crossbar varies as N2. Thus, delta networks are very suitable for relatively low cost multimicroprocessor systems. REFERENCES, [1] M. J. Flynn, "Very high speed computing systems," Proc. IEEE, vol. 54, pp. 1901-1909, Dec. 1966. [2] L. R. Goke and G. J. Lipovski, "Banyan networks for partitioning multiprocessor systems," in Proc. Ist Annu. Symp. Comput. Arch., Dec. 1973, pp. 21-28. [3] D. H. Lawrie, "Access and alignment of data in an array processor," IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975. [4] M. C. Pease, "The indirect binary n-cube microprocessor array," IEEE Trans. Comput., vol. C-26, pp. 458-473, May 1977. [5] H. J. Siegel, "Study of multistage SIMD interconnection networks," in Proc. 5th Annu. Symp. Comput. Arch., Apr. 1978, pp. 223-229. [6] H. S. Stone, "Parallel processing with the perfect shuffle," IEEE Trans. Comput., vol. C-20, pp. 153-161, Feb. 1971. [7] D. Y. Chang, D. J. Kuck, and D. H. Lawrie, "On the effective bandwidth of parallel memories," IEEE Trans. Comput., vol. C-26, pp. 480-490, May 1977. [8] W. D. Strecker, "Analysis of the instruction execution rate in certain computer structures," Ph.D. dissertation, Carnegie-Mellon Univ., Pittsburgh, PA, 1970. [9] F. Baskett and A. J. Smith, "Interference in multiprocessor computer systems with interleaved memory," Commun. Ass. Comput. Mach., vol. 19, pp. 327-334, June 1976. [10] D. P. Bhandarkar, "Analysis of memory interference in multiprocessors," IEEE Trans. Comput., vol. C-24, pp. 897-908, Sept. 1975. Janak H. Patel (S'73-M'76) was born in Bhavnagar, India. He received the B.Sc. degree in physics from Gujarat University, India, the B. Tech. degree from the Indian Institute of Technology, Madras, India, and the M.S. and Ph.D. degrees from Stanford University, Stanford, CA, all three in electrical engineering. From 1976 to 1979 he was an Assistant Professor of Electrical Engineering at Purdue University, West Lafayette, IN. Since 1980 he has been with the University of Illinois at Urbana-Champaign, where he is currently an Assistant Professor of Electrical Engineering and a Research Assistant Professor with the Coordinated Science Laboratory. He is currently engaged in research and teaching in the areas of computer architecture, VLSI, and fault-tolerant systems. Dr. Patel is a member of the Association for Computing Machinery.

Performance of Processor-Memory Interconnections for

Related documents

Products

Support

Performance of Processor-Memory Interconnections for

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib