Performance of Processor-Memory Interconnections for

advertisement
771
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-30, NO. 10, OCTOBER 1981
REFERENCES
[1] V. B. Aggarwal and J. W. Burgmeier, "A round-off error model with
applications to arithmetic expressions," SIAM J. Comput., vol. 8, pp.
60--72, Feb. 1979.
[2] D. Bini, M. Capovani, F. Romani, and G. Lotti, "O(n27799) complexity
for n X n approximate matrix multiplication," Inform. Process. Lett.,
vol. 8, pp. 234-235, 1979.
[3] R. P. Brent, "Algorithms for matrix multiplication," Stanford Univ.,
Stanford, CA, STAN-CS-70-157, Mar. 1970.
[4]
, "Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity," Numer. Math., vol.
16, pp. 145-156, 1970.
[5] J. R. Bunch and J. E. Hopcroft, "Triangular factorization and inversion
by fast matrix multiplication," Math Comput., vol. 28, pp. 231-236,
Jan. 1974.
[6] F. Y Chin, "An 0(n) algorithm for determining a near-optimal computation order of matrix chain products," Commun. Ass. Comput.
Mach., vol. 21, pp. 544-549, July 1978.
[7] J. Cohen and M. Roth, "On the implementation of Strassen's fast
multiplication algorithm," Acta Inform., vol. 6, pp. 341-355, 1976.
[8] S. S. Godbole, "On efficient computation of matrix chain products,"
IEEE Trans. Comput., vol. C-22, pp. 864-866, Sept. 1973.
[9] W. Miller, "Computational complexity and numerical stability," in Proc.
6th ACM Symp. Theory Comput., 1974, pp. 317-322.
[10] V. Muraoka and D. J. Kuck, "On the time required for a sequence of
matrix products," Commun. Ass. Comput. Mach., vol. 16, pp. 22-26,
Jan. 1973.
[11] V. Y. Pan, "New methods for the acceleration of matrix multiplications,"
in Proc. 20th FOCSSymp., 1979, pp. 28-38.
[12] V. Strassen, "Gaussian elimination is not optimal," Numer. Math., vol.
13, pp. 354-356, 1969.
[13] N. K. Tsao, "On accurate evaluation of extended sums," J. Franklin
Inst., vol. 308, pp. 39-55, July 1979.
[14]
, "The numerical instability of Bini's algorithm," Inform. Process.
Lett.,vol. 12. pp. 17-19, 1981.
[15] J. H. Wilkinson, Rounding Errors in Algebraic Processes. Englewood
Cliffs, NJ: Prentice-Hall, 1963.
[16] S. Winograd, "A new algorithm for inner product," IEEE Trans.
Comput., vol. C-17, pp. 693-694, July 1968.
Nai-Kuan Tsao was born in Shanghai, China, on
June 25, 1939. He received the B.S. degree from
the National Taiwan University, Taipei, Taiwan,
the M.S. degree from the National Chiao Tung
University, Hsinchu, Taiwan, in 1961 and 1964,
respectively, all in electrical engineering. He received the M.S. and Ph.D. degrees in electrical engineering from the University of Hawaii, Honolulu, in 1966 aned 1970, respectively.
Since 1974 he has been with the Department of
Computer Science, Wayne State University,
Detroit, MI, where he is currently an Associate Professor. His research interests include the analysis and implementation of numerical and nonnumerical algorithms.
Performance of Processor-Memory
Interconnections for Multiprocessors
JANAK H. PATEL, MEMBER, IEEE
Abstract-A class of interconnection networks based on some
existing permutation networks is described with applications to processor to memory communication in multiprocessing systems. These
networks, termed delta networks, allow a direct link between any
processor to any memory module. The delta networks and full crossbars are analyzed with respect to their effective bandwidth and cost.
The analysis shows that delta networks have a far better performance
per cost than crossbars in large multiprocessing systems.
I. INTRODUCTION
W ITH the advent of low cost microprocessors, the arW chitectures involving multiple processors are becoming
very attractive. Several organizations have been implemented
or proposed. Principally among these are parallel (SIMD) type
processors [I], computer networking, and multiprocessor organizations. In this paper we focus our attention on multiIndex Terms-Crossbar, interconnection networks, memory processors.
bandwidth, multiprocessor memories.
The principal characteristics of a multiprocessor system is
the ability of each processor to share a single main memory.
Manuscript received March 24, 1980; revised February 17, 1981. This work This sharing capability is provided through an interconnection
was supported in part by the Joint Services Electronics Program under Con- network between the processor and the memory modules,
tract NOOOI 6-79-C-0424.
The author is with the Coordinated Science Laboratory and the Department which logically looks like Fig. 1. The function of the switch is
of Electrical Engineering, University of Illinois, Urbana, IL 61801.
to provide a logical linkl between any processor and any
0018-9340/81/1000-0771$00.75
©) 1981 IEEE
772
IEEE TRANSACTIONS ON COMPUTERS, VOL.
C-30, NO. 10, OCTOBER 1981
II. PRINCIPLE OF OPERATION
SWITCH
Fig. 1. Logical organization of a multiprocessor.
memory module. There are several different physical forms
available for the processor-memory switch; the least expensive
of which is the time-shared bus. However, a time-shared bus
has a very limited transfer rate, which is inadequate for even
a small number of processors. At the other end of the bandwidth spectrum is the full crossbar switch, which is also the
most expensive switch. In fact, considering the current low
costs of microprocessors and memories, a crossbar would
probably cost more than the rest of the system components
combined. Therefore, it is very difficult to justify the use of a
crossbar for large multiprocessing systems. It is the absence
of a switch with reasonable cost and performance which has
prevented the growth of large multimicroprocessor systems.
To circumvent the high cost of switch, some "loosely coupled"
systems can be defined. In these systems sharing of main
memory is somewhat restricted, for example, some memory
accesses may be fast and direct while many other references
may be slow, indirect, and may even involve operating system
intervention. There is considerable research on the permutation
networks for parallel (SIMD) processors but very little research on processor-memory interconnections requiring random access capabilities.
In this paper we study a specific class of interconnection
networks, termed delta networks, which are far less expensive
than full crossbars. Moreover, delta networks are modular and
easy to control. The delta networks form a subset of a very
broad class of networks called banyan networks, initially defined in graph theoretic terms by Goke and Lipovski [2]. Informally, in terms of Fig. 1 the switch is a banyan network if
and only if there exists a unique path from every processor to
every memory module. The control of delta networks resembles
the control of permutation networks of Lawrie [3] and Pease
[4]. However, these networks were not applied or analyzed for
random access of memories which invariably involves path
conflicts in the network, as well as memory conflicts.
In Section II we present the basic principle involved in the
design and control of delta networks. Then we describe the
generalized delta networks with some examples. Following this
we give the detailed logic of a 2 X 2 delta module which is the
basic building block of 2n X 2" delta networks. In Section V
we analyze full crossbar networks for effective bandwidth and
probability of conflicts, given the rate of memory requests from
processors. The results of crossbar are used in turn to analyze
arbitrary delta networks. Finally, we present quantitative
comparison of crossbars and delta networks.
Before we define delta networks, let us study the basic
principle involved in the construction and control of delta
networks. Consider a 2 X 2 crossbar switch (Fig. 2). This 2 X
2 switch has the capability of connecting the input A to either
the output labeled 0 or output labeled 1, depending on the value
of some control bit of the input A. If the control bit is 0, then
the input is connected to the upper output and if 1, then it is
connected to the lower output. The same description applies
to terminal B, but for the time being ignore the existence of B.
It is straightforward to construct a 1 X 2n demultiplexer using
the above described 2 X 2 module. This is done by making a
binary tree of this module. For example, Fig. 3 shows a 1 X 8
demultiplexer tree. The destinations are marked in binary. If
the source A requires to connect to destination (d2d1d0)2, then
the root node. is controlled by bit d2, the second stage modules
are controlled by bit dl, and the last stage modules are controlled by bit do. It is clear that A can be connected to any one
of the eight output terminals. It is also obvious that the lower
input terminal of the root-node also can be switched to any one
of the 8 outputs.
At this point we add another capability to the basic 2 X 2
module, the capability to arbitrate between conflicting requests. If both inputs require the same output terminal, then
only one of them will be connected and the other will be
blocked or rejected. The probability of blocking and the logic
for arbitration is treated later on.
Now consider constructing an 8 X 8 network using 2 X 2
switches; the principle used is the same as that of Fig. 3. Every
additional input must also have its own demultiplexer tree to
connect to any one of the eight outputs. Basically, the construction works as follows. Start with a demultiplexer tree, then
for each additional input superimpose a demultiplexer tree on
the partially constructed network. One may use the already
existing links as part of the new tree or add extra links and
modules if needed. We have redrawn the tree of Fig. 3 as Fig.
4(a). The addition of next tree is shown with heavy lines in Fig.
4(b). This procedure is continued until the final 8 X 8 network
of Fig. 4(d) results. The only restriction which must be strictly
followed during this construction is that if a 2 X 2 module has
its inputs coming from other modules, then both inputs must
come from upper terminals of other modules or both must be
lower terminals of other modules. (All upper output terminals
are understood to have label 0 and the lower terminals, label
1.) Other than this there is considerable freedom in establishing links between the stages of the network. In the construction of the above 8 X 8 network we had the benefit of some
hindsight that 12 modules are necessary and sufficient to build
this network. If more modules are used then some inputs of
some modules will remain unconnected. We could have
stopped in the middle of the construction to obtain a 4 X 8 or
a 6 X 8 network, such as Fig. 4(b) and (c), however in each
case some inputs of the 2 X 2 modules will remain unutilized.
We term the networks constructed in the above manner digit
controlled or simply delta networks, since each module is
PATEL: PROCESSOR-MEMORY INTERCONNECTIONS
Af-l
0
A
-B _ J
I
B _. __ I
Control
bit of A
L\
773
0
Control
-bit of A
I
-0
Fig. 2. A 2 X 2 crossbar.
)000
'001
,010
1011
'100
101
110
,
ll
Fig. 3. 1 X 8 demultiplexer.
same set of permutations. As far as we know, the network of
Fig. 5 cannot be made equivalent to either Lawrie's or Pease's
network by a simple address transformation. It is quite possible
that there are only two nonequivalent 8 X 8 delta networks,
namely Lawrie's omega network and the network of Fig. 5. We
shall not pursue this subject any further, as our primary interest lies in the random access capabilities of these networks
and not permutations.
III. DESIGN AND DESCRIPTION OF DELTA NETWORKS
So far we have not defined the delta networks in a formal
and rigorous manner. For the purpose of this paper we define
them as follows.
Let an a X b crossbar module have the capability to connect
any one of its a inputs to any one of the b outputs. Let the
outputs be labeled 0, 1,- , b - 1. An input terminal is connected to the output labeled d if the control digit supplied by
the input is d, where d is a base-b digit. Moreover, an a X b
module also arbitrates between conflicting requests by accepting some and rejecting others.
A delta network is an an X bn switching network with n
stages, consisting of a X b crossbar modules. The link pattern
between stages is such that there exists a unique path of constant length from any source to any destination. Moreover, the
path is digit controlled such that a crossbar module connects
an input to one of its b outputs depending on a single base-b
digit taken from the destination address. Also, in a delta network no input or output terminal of any crossbar module is left
unconnected.
One can determine the number of crossbar modules in each
stage from the above definition. Specifically, the conditions
of constant length paths and no unconnected terminals require
that all the output terminals of one stage be connected to all
the input terminals of the next stage in one-to-one fashion. The
inputs of the first stage are connected to the source and the
outputs of the final stage are connected to the destination.
Thus, an X bn delta network has an sources and bn destinations. Numbering the stages of the network as 1, 2, starting
at the source side of the network requires that there be
crossbar modules in the first stage. The first stage then has
a n- lb output terminals. This implies that the stage two must
have an-lb input terminals, which requires an-2b crossbar
modules in the second stage. In general, ith stage has a-ibi-1
crossbar modules of size a X b. Thus, the total number of a X
b crossbar modules required in an an X bn delta network is
controlled by a single digit from the destination address
Digit
Furthermore, no external or global control is requiroed.
controlled networks are not new; Lawrie's omega netseorks [3]
and Pease's indirect binary n-cube [4] are subsets of delta
networks.
Under the formal definition of delta networks pre,sented in
the next section, Fig. 4(d) is a delta network, but not IFig. 4(a),
(b), or (c). Note that the network of Fig. 4(d) does riot allow
an identity permutation, that is, the connection 0 to 0, 1 to 1
*.* , 7 to 7 at the same time. An identity permutation is usefu
if say memory module 0 is a "favorite" module of p
0, and module 1 that of processor 1, and so on. Thus, identity
permutation allows most of the memory references to be made
without conflict. A simple renaming of the inputs of I
will allow an identity permutation. This is shown in IFig. in
here if all 2 X 2 switches were in the straight (=) posit ion, then
then
an identity permutation is generated. As a matter of fiact, since
one and only one path is available from any sourc
destination, every different setting of the form X ani :
erates a different permutation. Thus, the network (,.Uo r1Ig.
generates 212 distinct permutations. This brings us tc another
a. b
Z an-ibi-I = ansomewhat unrelated topic of permutation netwoirks. The
a-b
1<i<n
procedure to construct a delta network can be used to generate
= nbn-I
aa=b.
different permutation networks. For example, we co'uld have
started with a tree in which the first stage was contirolled by
The construction of an an X bn delta network follows the
bit d1, the second by do, and third by d2. This would (of course principle presented in the previous section. Informally, the
require relabeling the outputs. But does this really piroduce a procedure can be stated as follows.
"different" network? Siegel [5] has shown that by a simple
Construct a b-ary demultiplexer tree using a X b crossbar
address transformation the networks of Lawrie [3] and that switches. A b-ary tree has b branches for every node. For a 1
of Pease [4] can be made equivalent, i.e., they pro' duce the X bn demultiplexer, the tree has n levels. Each level is con-
drocessor
Fig;
=tgen
an-m
774
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 10, OCTOBER
1981
000
001
010
011
100
101
110
111
(c)
(d)
Fig. 4. Construction of an 8 X 8 delta network.
Fig. 5. An 8 X 8 delta network to allow identity permutation.
trolled by a distinct base-b digit taken from the destination
address. For every additional input source, superimpose a new
tree on the partially completed network. Each superimposition
must satisfy the condition that an a X b module which receives
inputs from other a X b modules must have all its inputs connected to identically labeled outputs, where the outputs of each
a X b module are labeled 0, 1,, b - 1, as was described
earlier.
As one can see from the above construction procedure, there
is a large number of link patterns available for an an X b n delta
network. It is natural to wonder if one topology is better than
the others. We shall see later that, as far as probability of acceptance or blocking for random access is concerned, all delta
networks are identical. However, different topologies may have
different permutation capabilities in bn X bn delta networks.
Since the link pattern between stages of a delta network is
of no particular concern to us, we may ask if there is some
regular link pattern, which can be used between all stages and
thus avoid the cumbersome construction procedure for every
different delta network. There is indeed such a pattern which
we describe below.
Let a q-shuffle of qr objects, denoted Sq*r, where q and r
are some positive integers, be a permutation of qr indices (0,
1, 2, *, (qr - 1)), defined as
775
PATEL: PROCESSOR-MEMORY INTERCONNECTIONS
S4J3 (i)
-'Go
2
D 1
JO 2
3
3
4
4
5
5
6
6
7
8
9
10
10
0
iD
11
11
Fig. 6. 4-shuffle of 12 objects.
0
1
axb
1
a-i
b-l
a
a+l
2a-1
__
n
a -a
-
ax
x b
n1
.
-
-
l
bb1-
an_Stage 1
Stage 2
a-shuffle
a-shuffle
S tage n
Fig. 7. An an X bn delta network.
Sq*r(i) = (qi +
)mod qr
0<i<
qr-
1.
Alternately, the same function can be expressed as
Sq*r(i)
=
qi mod (qr
-
1)
0
i <qr- 1
i=qr-1.
A q-shuffle of qr playing cards can be viewed as follows.
Divide the deck of qr cards into q piles of r cards each; top r
cards in the first pile, next r cards in the second pile, and so on.
Now pick the cards, one at a time from the top of each pile; the
first card from top of pile one, second card from the top of pile
two, and so on in a circular fashion until all cards are picked
up. This new order of cards represents a Sq*r permutation of
the previous order. Fig. 6 shows an example of 4-shuffle of 12
indices; namely the function S4*3. From the above description
it is clear that a 2-shuffle is the well-known perfect shuffle [6].
One can also show that q-shuffle is an inverse permutation of
r-shuffle of qr objects. That is, Sq*r is an inverse of Sr*q.
Thus,
Sq*r (Sr*q(i)) = i
0<iS<qr- 1.
We show below that an an X bn delta network can be constructed by using the a-shuffle as the link pattern between
every two stages. If the destination D is expressed in base-b
system as (dn-dn-2 ... dldo)b, where D = E dibi and 0
<
O<i<n
di < b, then the base-b digit di controls the crossbar modules
of stage (n - i). The a-shuffle function is used to connect the
outputs of a stage to the inputs of the next stage where the
inputs and outputs are numbered 0, 1, 2, starting at the top.
Fig. 7 shows a general an X bn delta network. Note that the
shuffle networks between stages are passive, i.e., simply wires,
and not active like the stages themselves. Two delta networks,
one 42 X 32 and 23 X 23, derived from Fig. 7 are shown in Figs.
8 and 9; the interstage patterns are, respectively, 4-shuffle and
2-shuffle. The destinations are labeled in base 3 in Fig. 8 and
in base 2 in Fig. 9. Now we prove that the a-shuffle link pattern
indeed allows a source to connect to any destination using the
destination-digit control of each a X b crossbar module.
776
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 10, OCTOBER 1981
0
1
2
3
(d1d0)3
-
00
-
.01
02
4
5
6
7
If the a X b modules are numbered 0, 1, 2,- from top to
bottom (see Fig. 7), then S is connected to module number
LS/aj = s,-.an-2 + . + s2a + s1. By the destination-digit
control algorithm, the source S is switched to the output terminal d"_1 of the module number LS/aJ. Assuming all output
lines of stage 1 are numbered 0, 1, 2, * * from top, the source
S is switched by stage 1 to its output line number L1 = LS/ajb
+ dn_.I, that is
-
10
11
12
8
9
10
11
LI = (sn lan-2 + + s2a + sI)b + dn-1.
This line, when the a-shuffle of an-lb output lines of stage 1
is done, becomes
20
21
22
12
13
14
15
LI = (Lla +
2b]) mod a
lb.
Let us first evaluate the floor function.
Stage
4-shuffle
I
Stage 2
Fig. 8. A 42 X 32 delta network.
(d2d1d0)2
0-
LI
Sn+S2an-3 + + S1 + dn-l/b
an-2b n1an-2
Since dn_ l/b < 1 the numerator of the second term is less
than an-2. Therefore
000
1
001
b = [sn-I + fractionj =Sn-1.
Therefore
010
011
L' = (LIa + sn-O) mod an-lb
= sn-lan-lb + (s,-2an-2 ++ dn- 1a + sn-i mod an-lb
101
110
111
2-shuffle Stage 2-shuffle Stage
2
3
1
Fig. 9. A 23 X 23 delta network.
Theorem: An an X bn delta network which uses a-shuffles
interstage link pattern, can connect any source to any destination D = (d, Idn-2 dldo)b by switching the source to
the output terminal d,-i at each stage i, where the outputs of
each a X b crossbar are assumed to have labels 0, 1,* , (b 1) from top to bottom.
Proof: Let the source S need a connection to destination
D, where
as
L2 =
...
S
=
(sn_-1sn2
+ sIa + dn-ja
+ Sn-lI
lb mod at1lb
/
b
since dn_1 < b and Sn-1 < a the quantity (dn lIa + sn-.)/b is
less than a. Thus, the expression in the parentheses is less than
anl. Therefore
L' = (sn-2an-2 + * + sla)b + dn-ja + sn-I
which is the input to stage 2 of the network. This line is connected to module LL7aJ of stage 2 and is switched according
to digit dn2 and becomes the output line number L2 of stage
2, where
(sn2an-2 +
=
100
Stage
+ sla)b
-
L'
|I1
a
b + dn-2
(sn-2an-3 + *
+ s2a + sl)b2 +
dn-lb +.dn-2.
After the a-shuffle of an-2b2 lines, the line number L2 be-
comes
L= (L2a + tan32I) mod an-2b2.
SISO)a in base-a system
After simplifying the right-hand side as before, the input
to stage 3 is line number --
and
D
=
(dn1dn2
dido)b in base-b system.
L2
=
(Sn-3an-3 + *
+
s1a)b2
+
In other words
S
=
D
=
Sn-lan-1 + Sn-2an-2 +
and O < si a- 1
+
sla
+
So
dn-lb"-1 + d,_2bn-2 + * + d1b + do
and O < di < b 1.
-
(dn-lb + dn-2)a + Sn-2
This line when switched by stage 3 becomes the output of stage
3 at line number
=3 L2 b + dn3
777
PATEL: PROCESSOR-MEMORY INTERCONNECTIONS
= (Sn-3an-4 +
+ (dn-lb2 +
+ s2a + sl)b3
dn-2b + dn-3).
By finite induction it can be shown that the source is connected
to an output terminal of stage i at line number
Li
=
(sn-jan-i-I +
-
-
+
*
sl)bi
(dn-lbi-I + dn-2bi-2 +
request
busy
request
destination
busy
+ s2a +
+
dn-i)
which after the a-shuffle of an-ibi lines becomes the input of
stage i + 1 at line number
L' = (sn-i-lan-i-I + + s2a2 + sia)bi
+
(dn-lbi-I
+
+
dn_j)a + sn-,.
And after the final stage n, the source is connected to the
output line number
Ln
=
dn-1bn-l +
+
dib + do
iI
X-=r 0d0 +TFd
10 1 <
X -s
r0d0
+ rOd I
0
which is the desired destination D.
R0 =r0d0 +IdI R = rd + r IdI
When a = b, the above proof can be simplified because the
b
XB +XB
bo XB0 + XBI + r0d0d +rd0 1
b-shuffle of bn lines has a simple form. Take, for example, any
i
integer i, 0 < i < bn in its base-b representation. Let i =
iiX
11 = i X +iX
it
to
changes
then
the
b-shuffle
(Cn.lCn-2 CICO)b
(cn-2cn-3
Fig. 10. Implementation details of a 2 X 2 module.
* ** CLCOCn-1)b, which is simply a left rotation of the digits of
i. This property was used by Lawrie [3] in proving a statement 1 to a low priority input of the next stage. The logic equations
similar to the theorem above, when a = b = 2.
for all the labeled signals are given with the block diagram. For
INFO box the equations are given for left to right direction. The
IV. IMPLEMENTATION OF DELTA NETWORKS
parallel generation of X and X reduces one gate level. Signal
and X are valid after 3 gate delays. Assuming that one level
X
Within the current technological limitations it is unecoof
buffer gates for X and k in the INFO box exists due to fannomical to encode base b digits, where b is not a power of 2.
out
limitations of X and X, the total delay to establish the
Thus, in practice an a X b crossbar module for a delta network
of INFO box is 6 gate delays, of which 4 gate delays
connections
is more cost-effective if b is a power of 2, since our modules
due
to
X
and X. Thus, after the initial setup time, the data
are
require base b digits for control. Again, due to the cost and
transfer
requires
only 2 gate delays per stage of the nettechnological limitations modules of size 8 X 8 or greater are
work.
not very practical at this time. This leaves a X 2 and a X 4
The operation of a 2n X 2n delta network using the above
modules, where a is 1-4, as the most likely candidates for
2 X 2 modules is as follows; recall that there are n
described
implementation of delta networks. Here we give the functional
in
network.
this
stages
and logical description of 2 X 2 modules. This in turn will be
All
requiring memory access must submit their
processors
used to estimate the cost and delay factors of delta nettime by placing a 1 on the respective rethe
same
at
requests
works.
After
lines.
8n
delays the busy signals are valid. If
gate
quest
The functional block diagram of a 2 X 2 crossbar module
must resubmit its request.
is
then
the
processor
the
line
1,
busy
of a delta network appears in Fig. 10. All single lines in the
doing nothing, i.e., concan
be
simply
by
accomplished
This
figure are one bit lines. The double lines on INFO box represent
Read data is valid after
The
line
tinue
to
hold
the
high.
request
address lines, incoming and outgoing data lines, and a Read/
time if the busy signal
access
8n
the
memory
delays
plus
gate
Write control line. The data lines may or may not be bidirecdescribed here
of
the
is
0.
the
implementation
operation
Thus,
tional. The function of the INFO box is that of a simple 2 X 2
issued
at
fixed intervals
is
are
that
the
requests
is,
synchronous,
crossbar; if the input X is 1, then a cross connection exists and
is prefat
the
same
time.
An
implementation
asynchronous
if X is 0, then a straight connection exists.
such
an imif
has
However,
erable
the
network
many
stages.
The function of the CONTROL box is to generate the signal
for
buffers
addresses,
data,
would
storage
require
plementation
X and provide arbitration. A request exists at an input port if
control
modin
a
and
module
and
also
control
complex
every
the corresponding request line is 1. The destination digit provides the nature of the request; a 0 for the connection to upper ule. Thus, the cost of such an implementation might well be
output port and a 1 for the lower port. In case of conflict the excessive. We have analyzed only the synchronous networks
request ro is given the priority and a busy signal b1 = 1 is in this paper.
supplied to the lower input port. A busy signal is eventually V. ANALYSIS OF CROSSBARS AND DELTA NETWORKS
transmitted to the source which originated the blocked reIn this section first we analyze M X N crossbar networks
quest.
The priority among processors can be randomized by con- and then delta networks. Both networks are analyzed under
necting the two outputs of each 2 X 2 module to two different identical assumptions for the purpose of comparison. We anpriority input ports at the next stage. For example, output port alyze the networks for finding the expected bandwidth given
0 can be connected to a high priority input port and the port the rate of memory requests. Bandwidth is expressed in
...
IEEE TRANSACTIONS ON
778
number of memory requests accepted per cycle. A cycle is
defined to be the time for a request to propagate through the
logic of the network plus the time to access a memory word plus
the time to return through the network to the source. We shall
not distinguish the read or write cycles in this analysis. The
analysis is based on the following assumptions.
1) Each processor generates random and independent requests; the requests are uniformly distributed over all memory
modules.
2) At the beginning of every cycle each processor generates
a new request with a probability m. Thus, m is also the average
number of requests generated per cycle by each processor.
3) The requests which are blocked (not accepted) are ignored; that is, the requests issued at the next cycle are independent of the requests blocked.
The last assumption is there to simplify the analysis. In
practice, of course, the rejected requests must be resubmitted
during the next cycle; thus the independent request assumption
will not hold. However, to assume otherwise would make the
analysis if not impossible, certainly very difficult. Moreover,
simulation studies done by us and by others and more complex
analyses reported [7] -[10] for similar problems have shown
that the probability of acceptance is only slightly different if
the third assumption above is omitted. Thus, the results of the
analysis are fairly reliable and they provide a good measure
for comparing different networks.
Analysis of Crossbars: Assume a crossbar of size M X N,
that is, M processors (sources) and N memory modules (destinations). In a full crossbar two requests are in conflict if and
only if the requests are to the same memory module. Therefore,
in essence we are analyzing memory conflicts rather than
network conflicts. Recall that m is the probability that a processor generates a request during a cycle. Let q(i) be the
probability that i requests arrive during one cycle. Then
q(i) = ( )mi(l m)Mi
-
where
(1)
is the binomial coefficient.
Let E(i) be the expected number of requests accepted by
the M X N crossbar during a cycle; given that i requests arrived in the cycle. To evaluate E(i), consider the number of
ways that i random requests can map to N distinct memory
modules, which is Ni. Suppose now that a particular memory
module is not requested. Then the number of ways to map i
requests to the remaining (N - 1) modules is (N - 1) '. Thus,
Ni- (N - 1)i is the number of maps in which a particular
module is always requested. Thus, the probability that a particular module is requested is [N' - (N -1 )i]/N'. For every
memory module, if it is requested, it means one request is accepted by the network for that module. Therefore, the expected
number of acceptances, given i requests, is
Ni'- (N- l)i
E(i) =NN
=
(N)-1j N.
(2)
COMPUTERS, VOL. C-30, NO. 10, OCTOBER 1981
Thus, the expected bandwidth BW, that is, requests accepted per cycle is
BW
E E(i) wmq(i)
=
0<iSM
which simplifies to
BW
=
N- N(1-MM)
(3)
Let us define the ratio of expected bandwidth to the expected
number of requests generated per cycle as the probability of
acceptance PA, that is, the probability that an arbitrary request
will be accepted. Thus, PA is a measure of the wait time of
blocked requests. A higher PA indicates a lower wait time and
a lower PA indicates higher wait time. The expected wait time
of a request is (I /PA - 1).
BWA_ N
N
(
mlM
il--I.
PA= -Im
N
mm mm mm
(4)
It is interesting to note the limiting values of BW and PA as
M and N grow very large. Let k = M/N, then
n kN
lim lu-e-IeMmk.
Thus, for very large values of M and N
BW
PA
-
N(
-
e-mM/N)
N(I e-mM/N)
-
mM
(5)
(6)
The above approximations are good within 1 percent of actual
value when M and N are greater than 30 and within 5 percent
when M, N > 8. Note that for a fixed ratio M/N the bandwidth of (5) increases linearly with N.
Analysis of Delta Networks: Assume a delta network of size
an X bn constructed from a X b crossbar modules. Thus, there
are an processors connected to bn memory modules. We apply
the result of (3) for an M X N crossbar to an a X b crossbar
and then extend the analysis for the complete delta network.
However, to apply (3) to any a X b crossbar module we must
first satisfy the assumptions of the analysis. We show below
that the independent request assumption holds for every a X
b module in a delta network.
Each stage of the delta network is controlled by a distinct
destination digit (in base b) for setting of individual a X b
switches. Since the destinations are independent and uniformly
distributed, so are the destination digits. Thus, for example,
in some arbitrary stage i an a X b crossbar uses digit dn-i of
each request; this digit is not used by any other stage in the
network. Moreover, no digit other than dn-i is used by stage
i. Thus, the requests at any a X b module are independent and
uniformly distributed over b different destinations. Thus, we
can apply the result of (3) to any a X b module in the delta
network.
Given the request rate m at each of the a inputs of an a X
779
PATEL: PROCESSOR-MEMORY INTERCONNECTIONS
b crossbar module, the expected number of requests that it
passes per time unit is obtained by setting M = a and N = b
in (3), which is
1.0m =
1.0
0.8crossbar
0.6-
b-b(1 m)a
Dividing the above expression by the number of output lines
of the a X b module gives us the rate of requests on any one of
b output lines
A
I"
\
-1,>
0.4 -
deLlt a-4
__
delta-2
0.2-
-_
0.0
b
Thus, for any stage of a delta network the output rate of
requests mout is a function of its input rate and is given by
1
N
1024
256
BW
=
1.0
delta-2
-
1641-
16
4
64
N-.
256
1024
4096
Fig. 12. Expected bandwidth of N X N networks.>
1.0-
and the probability that a request will be accepted is
rbnMn
m
64'
andmo=m
=n
aA
4096
4096
Since the output rate of a stage is the input rate of the next
stage, one can recursively evaluate the output rate of any stage
starting at stage 1. In particular, the output rate of the final
stage n determines the bandwidth of a delta network, that is,
the number of requests accepted per cycle.
Let us define mi to be the rate of requests on an output line
of stage i. Then the following equations determine the bandwidth, BW of an an X bn delta network, given m, the rate of
requests generated by each processor.
BW = bnm,
(7)
where
l)
1024
-
Fig. 11. Probability of acceptance of N X N networks.
- bin)
mout= 1 b-
mi=1-(1-mI
256
64
16
4
0.8-
(8)
VI. EFFECTIVENESS OF DELTA AND CROSSBAR
NETWORKS
Since we do not have a closed form solution for the bandwidth of delta networks (7), we cannot directly compare the
bandwidths of crossbar (3) and delta networks. We have
computed the values of the expected bandwidth and the
probability of acceptance for several networks. These results
are plotted in Figs. 1 1, 12, and 13.
Fig. 11 shows the probability of acceptance PA, for 2 n X 2 n
and 4n X 4n delta networks and N X N crossbar, when the
request generation rate of each processor is m = 1. The curve
marked delta-2 is for delta networks using 2 X 2 switches and
delta-4 for delta networks using 4 X 4 switches. The graphs
are drawn as smooth curves in this and other figures only for
visual convenience, in actuality the values are valid only at
specific discrete points. In particular, an N X N crossbar is
defined for all integers N > 1, a delta-2 is defined only for N
= 2X n > 1, and delta-4 is defined only for N = 40 n > 1.
Notice in Fig. 11 that PA for crossbar approaches a constant
1
pA
0.60.4-
0.2 -
0. 0
F
.1
.3
.5
.7
.91.0
a-m
Fig. 13. Probability of acceptance versus mean request rate.
value as was predicted by (6) of the previous section. PA for
delta networks continues to fall as N grows. Fig. 12 shows the
expected bandwidth BW as a- function of N. The bandwidth
is measured in number of requests accepted per cycle. In all
fairness we must point out that a cycle for a crossbar could be
smaller than a cycle for a large delta network. Taking into
account fan-in and fan-out constraints, the decoder and arbiter
for a N X N crossbar has a delay of O(log2N) gate delays. A
2n X 2n delta network also has O(log2N) gate delays, from the
analysis of Section IV. However, the delay of a delta network
is approximately twice the delay of a crossbar. If the delay is
small compared to the memory access time, then the cycle time
(the sum of network delay and memory access time) of a
780
IEEE TRANSACTIONS ON COMPUTERS, VOL.
1/4
1/16detta-2
1/2561/1024
crossbar
.
1
4
16
64
NO.
10, OCTOBER 1981
curve for delta, thus shifting the crossover point of the two
curves towards right. Thus, depending on the assumptions, the
crossover point may move slightly left or right; but in any case
the curves clearly show the effectiveness of delta networks for
medium and large scale multiprocessors.
M = 1.0
Performance
cost
c-30,
256
N-
1024
Fig. 14. Cost-effectiveness of N X N networks.
crossbar is not too different from that of a delta network. Thus,
the curves for bandwidth provide a good comparison between
networks.
Fig. 13 shows PA as a function of the request generation rate
m. The curve for the crossbar is the limiting value of PA as N
grows to infinity. Curves for N 2 32 are not distinguishable
with the scale used in that graph.
Finally, the graph of Fig. 14 is an indication of cost-effectiveness of delta networks. The cost of a N X N crossbar or
delta network is assumed to be proportional to the number of
gates required. The constant of proportionality should be the
same in both cases because the degree of integration, modularity, and wiring complexity in both cases is more or less the
same. For the N X N crossbar the ,iminimum number of gates
required is one per crosspoint per data line'. Depending on the
assumptions used on fan-in, fan-out, the complexity of the
decoder and the arbiter, one can estimate the gate complexity
of a crossbar anywhere from one gate to six gates per crosspoint. Let us assume the lowest cost figure of one gate per
crosspoint.
The cost of 2" X 2" delta network is estimated from the
Boolean equations of the 2 X 2 module of Fig. 10. The number
of gates in a 2 X 2 module is 23 gates for the control plus 6
gates per information line. Assuming the number of information lines to be large, the gates for control can be ignored.
Thus, the gate count of a 2" X 2" delta network is 6n2"n1 per
information line because the network has n2 n1 modules.
Thus, the cost of N X N networks are kN2 for crossbar and
3kNlog2N for delta (N = 2"), where k is the constant of proportionality. We have used these cost expressions in the computation of performance-cost ratio for Fig. 14; the ratio is that
of expected bandwidth over cost. Taking this ratio for a I X
1 crossbar as unity, the Y-axis of Fig. 14 represents the performance over cost relative to a 1 X I crossbar. The Y-axis may
also be interpreted as bandwidth per gate per information line.
Notice that delta network is more cost-effective for network
size N greater than 16. If the cost of the crossbar was assumed
as 2 or more gates per crosspoint, then the curve for the
crossbar would shift downward and the effectiveness of delta
becomes even more pronounced. However, if one assumed the
cycle time of a crossbar half as much as that of a delta network,
then the curve for crossbar would shift upwards relative to the
VIII. CONCLUDING REMARKS
We have presented in this paper a class of processor-memory
interconnection networks, called delta networks, which are
easy to control and design, and are very cost-effective. We also
presented the combinatorial analysis of delta networks and full
crossbars. It is seen that delta networks bridge the gap between
a single time-shared bus and a full crossbar. The cost of an N
X N delta network varies as Nlog2N, while that of crossbar
varies as N2. Thus, delta networks are very suitable for relatively low cost multimicroprocessor systems.
REFERENCES,
[1] M. J. Flynn, "Very high speed computing systems," Proc. IEEE, vol.
54, pp. 1901-1909, Dec. 1966.
[2] L. R. Goke and G. J. Lipovski, "Banyan networks for partitioning
multiprocessor systems," in Proc. Ist Annu. Symp. Comput. Arch., Dec.
1973, pp. 21-28.
[3] D. H. Lawrie, "Access and alignment of data in an array processor,"
IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975.
[4] M. C. Pease, "The indirect binary n-cube microprocessor array," IEEE
Trans. Comput., vol. C-26, pp. 458-473, May 1977.
[5] H. J. Siegel, "Study of multistage SIMD interconnection networks,"
in Proc. 5th Annu. Symp. Comput. Arch., Apr. 1978, pp. 223-229.
[6] H. S. Stone, "Parallel processing with the perfect shuffle," IEEE Trans.
Comput., vol. C-20, pp. 153-161, Feb. 1971.
[7] D. Y. Chang, D. J. Kuck, and D. H. Lawrie, "On the effective bandwidth
of parallel memories," IEEE Trans. Comput., vol. C-26, pp. 480-490,
May 1977.
[8] W. D. Strecker, "Analysis of the instruction execution rate in certain
computer structures," Ph.D. dissertation, Carnegie-Mellon Univ.,
Pittsburgh, PA, 1970.
[9] F. Baskett and A. J. Smith, "Interference in multiprocessor computer
systems with interleaved memory," Commun. Ass. Comput. Mach., vol.
19, pp. 327-334, June 1976.
[10] D. P. Bhandarkar, "Analysis of memory interference in multiprocessors,"
IEEE Trans. Comput., vol. C-24, pp. 897-908, Sept. 1975.
Janak H. Patel (S'73-M'76) was born in Bhavnagar, India. He received the B.Sc. degree in physics
from Gujarat University, India, the B. Tech. degree from the Indian Institute of Technology, Madras, India, and the M.S. and Ph.D. degrees from
Stanford University, Stanford, CA, all three in
electrical engineering.
From 1976 to 1979 he was an Assistant Professor of Electrical Engineering at Purdue University, West Lafayette, IN. Since 1980 he has been
with the University of Illinois at Urbana-Champaign, where he is currently an Assistant Professor of Electrical Engineering and a Research Assistant Professor with the Coordinated Science Laboratory. He is currently engaged in research and teaching in the areas of
computer architecture, VLSI, and fault-tolerant systems.
Dr. Patel is a member of the Association for Computing Machinery.
Download