The Turn Model CHRISTOPHER for Adaptive J. GLASS J!flchgatz State Uni{er.~i[y, Abstract. paper presents model is th~t This feature of networks the (although [In analyzlng can form. on the k-ary the directions two most prohibited fidaptwe, Adaptive k-my network routing ~t-cubes, rithms show which on the pattern perform better algorlthm with The traffic. wormhole three for latcncies nonuniform of the turns traffic, adaptive the that meshes nonadaptive and must routing be to be n-dimensional routing throughput routing turns focuses turns allow sustainable the paper meshes, and and highest This of two-dimensional of adaptive that direct is based algorithms ~z-dimensional a quarter quarters Simulations routing adaptive. unique to the model and the cycles routing, just A channels Instead, and highly networks, are described For nonadaptive for such remaining hypercubes. channels). m a network or nonminimal, In algorithms, or virtual all of the cycles produces topologies channels. routing physical extra can turn has the lowest of message than NI wormhole cm adding to brezk algorithms and designing packets mmimal deadlock. for based to networks free, extra to prevent meshes is not turns common without M. Mtchigan enough livelock LIONEL a model lt in which just free, /z-cubes East Lansuzg, it can be applied Prohibiting are deadlock AND Routing algorithms algo- depends generally ones. Ctitegorms and SubJect Descriptors: C 1.2 [Processor Architectures]: Multiple Data Stream Arch itectures—uiter c[,nnectim arc)zitectures: parallel processors: C,2. 1 [Computer-~ ommunication izet~dx: ~zetwork comnunicat~ons; Networks]: Network Architecture and I)csign-&fnlwtcd tz~,ttvork topolom: Ciencral packet Terms: Additional Kcy computers, medl rzehvorks Algorithms, Design, Words Phrases. md topology, Performance wormhole Adaptive routing, de~dlock, hypcrcubc, massively parallel routing 1. Introdzlction Direct networks have become a popular interconnection architecture for consuch as multicomputers and scalable structing large-scale multiprocessors, shared-memoly multiprocessors. Direct networks offer massive parallelism [NCUBE Company 1990; Intel Corporation 1991; Borkar et al. 1990; Agatwal et al. 1990] and are more This work in part &f II’ 93-04066. was szrpportcd Symposium A shorter on Computer Authors’ address: Michigm State Permission of the University. or distributed publication Association Computer East fee M its date for Computing d (h. ,A..ocI.lt)m I for or part appear, Machinery. Compul~ng MI M.wh~nay, Laboratory, 48 S24-1027. (NSF) at the other grants 19th Department c-Mail: of this material commercial specific permission. G 19!)4 ACM 00(J4-541 1/94/0900-0874 J<wrn, tl Foundfltlon was presented Systems Lonsing, for dmect and Science of Lhls paper et al, 1988; Lenoski than approaches ECS 88-14027 Annual and International Architecture. Advanced to copy without not made by National vclwon scalable and advantage, notjce To copy {glass, IS granted provided the ACM is gwen otherswse, that of Computer copyright copying 41, N“ 5, S.ptcmlxr 1YY4, pp that notice the copies 874–[ 102 requires are and the title is by permission or to republish, $03.50 V<>l Science, ni}<~cps.msu.edu. of the a fce and/or The Turn Model to for Adaptile multiprocessor organized local directly defined each with interconnection. as an ensemble memory, Routing and other 875 A of nodes, supportive system based where each devices. The on node a direct has its own network connects neighbors through for forwarding. the direct local which is processor, each node to only a few other nodes, its neighbors. Which nodes are neighbors is by the topolog of the network. Neighboring nodes communicate with other by sending messages across the channels. A node a node that is not its neighbor by sending a message both network network, channels, connect To which handle the each node connect it to neighboring complexities often of has a router. it to local devices, routing The and communicates to one of its messages router network controls channels, routers. Wownhole routing [Dally and Seitz 1986] is becoming a popular switching technique in direct networks. Wormhole routing switches a message along a network by first dividing the message into packets and the packets into flow control digits or f?its. It then routes the flits in each packet through the network in a pipeline fashion. Header jlits contain all of the routing information for a packet and lead each packet through a router that has no suitable output the network. When channel available, the header flits reach all of the flits in the packet wait where they are for a suitable channel to become available. the attractions of wormhole routing is that each router requires only One of enough buffer space to store a few flits for each channel. Another attraction is that communication latencies are low. In the absence of contention, latencies are proportional to the sum of the packet length and the distance to travel [Dally 1990]. The majority of network topologies meshes and k-my n-cubes. particularly A two-dimensional mesh is used Corporation the Caltech 1991], the Intel MOSAIC; for wormhole routing are ~z-dirnensional low-dimensional meshes in the Intel Touchstone Paragon, a two-dimensional the Symult torus 2010 [Seitz and hypercubes. DELTA [Intel et al. 1988], and is used in the Fujitsu AP1OOO [Ishihata et al. 1991], a three-dimensional mesh is used in the MIT J-machine [Dally 1989], and a hypercube is used in the nCUBE-2 [NCUBE Company 1990]. Formally, an n-dimensional mesh has k. X kl X “”” X k,, - 1 nodes, k, nodes along each dimension i, where is identified by n coordinates, each dimension all i except i. Two one, j, nodes where k, > 2. Each node X in an n-dimensional mesh (.xn, 1-1. . . . . 1-,1_ 1) , where O<xi<k, –l for X and Y are neighbors if and only if x, = y, for yj = x, + 1. Thus, neighbors, depending on their location 1990], all nodes have the same number nodes in the mesh. of neighbors. have from n to 2n In a k-ary n-cube [Dally The definition of a k-ary n-cube differs from that of an n-dimensional mesh in that the k,’s are all equal to k and in that two nodes X and Y are neighbors if and only if xl = J’, for all i except one, j, where y, = ( XJ & 1)mod k. The change to modular arithmetic in the definition adds wraparound channels to the k-ary mcube, giving it symmetry. As a result, each node in a k-ary n-cube has n neighbors if k = 2 and 2n neighbors if k > 2. Another topology with symmetry is a hypercube, which is a n-cubes. A hypercube is special case of both }z-dimensional meshes and k-a-y an n-dimensional mesh in which k, = 2 for all i, or a 2-ary n-cube. Meshes and k-ary n-cubes are popular, in part, because their regular topologies simplify the routing of messages. In general, low-dimensional meshes are preferred because they have low, fixed node degrees, which makes them 876 C. J. GLASS more scalable dimensional per bisection result than in lower 1990]. In high-dimensional meshes density communication addition, topological meshes also have fewer and have lower their dimensions latencies function being sions. High-dimensional lower diameters, which channels contention k-ary L. M. n-cubes. NI Low- and higher channel bandwidth and blocking ]atencies, which and higher better easier and AND hot-spot matches their to implement throughputs form, two in the three meshes and k-ary n-cubes, on the shortens path lengths. High-dimensional [Dally and physical three dimen- other hand, topologies have also have more paths between pairs of nodes, which permits more fault tolerance. Finally, the symmetry of k-ary n-cubes can make it easier to spread message traffic more evenly. The sequence of channels a message packet traverses on its way from its source the node to its destination routers of the have three features: direct node network low communication ease of implementation in VLSI. provide packet transmission that Factors contributing buffers, switches, high throughput from Iivelock, routing packets some by different by the routing A good latency, high to ease of implementation and control along short definition circuits. authors). Factors are few fault tolerance. they have been occurs when a packet Deadlock and that should throughput, and algorithm should and practicality. and contributing because algorithm algorithm network small channels. to low latency deadlock, freedom from traffic evenly, routing paths, (partly routing In other words, a routing is high in quality, quantity, are freedom from spreading packet require is determined execute. and starvation, freedom packets adaptively, Many of these terms used in different waits for ways an event that cannot happen. For example, a packet may wait for a network resource to be released by a packet that is, in turn, waiting for the first packet to release some resource. In wormhole routing, such a circular waif condition causes deadlock, because a packet holds resources from acquiring the held resources. when a packet a packet packets may waits wait are always for an event forever to competing while waiting and Starvation is similar that can happen acquire but never a network successfully. excludes other packets to deadlock, but occurs Starvation does. For example, resource for which is only possible other when the allocation of network resources is unfair. Both deadlock and starvation can stop a packet from being injected into a network or from moving once it is in the network. In contrast, Iivelock does not stop the movement of a packet, but rather stops its progress toward the destination. routing of a packet never leads it to its destination. when routing is adaptile (not along a predetermined Lil’e/ock Livelock path) occurs when the is possible only and nonrninimal (away from, or equidistant to, the destination at times). Minimal routing, on the other hand, restricts packets to shortest paths. Although minimal routing may initially sound more promising, nonminimal routing provides better fault toler- ance. Of the factors contributing to good routing algorithms, the most important one is freedom from deadlock. Deadlock can keep many or all packets from reaching their destinations and occurs readily unless a routing algorithm contains preventive measures. Starvation and livelock are less likely to occur and are generally easier to prevent. Adaptiveness is also an important factor, because it contributes to several of the other factors [Dally and Aoki 1990]. It provides alternative paths for packets that encounter unfair channel allocation, Paulty hardware, or hot spots in traffic patterns. The adaptiveness of a routing The Tum Model algorithm falls the into algorithm nonadaptive McKinley choose a path every a partially node a packet from from on the number packets, toward algorithm its source A partially multiple leading adaptive depending forwarding is predetermined. 1993] can choose from Thus, when or the of nodes number can route packets along. A nonadaptive node for each packet to be forwarded. routes that 877 categories, from the algorithm of only one algorithm along Routing one of three can choose shortest paths has a choice node for AduptiL’e nodes to its destination adaptil’e algorithm for some packets, the destination cannot node route node packets of algorithm Thus, a [Ni and but cannot for every packet. along every shortest path in a network topology. A fidly adaptil]e node leading toward the destination node algorithm can choose from every for every packet. Thus, a fully adaptive topology. every algorithm can route packets along shortest path in a network The drawback to previous methods of designing routing algorithms is that they either sacrifice adaptiveness for deadlock freedom or achieve adaptiveness and deadlock freedom at the expense of adding physical or virtual channels. The popular method of dimension ordering is based on ordering the dimensions in a network Applied and routing packets to two-dimensional Corporation along meshes, the dimensions it produces 1991; Seitz et al. 1988]; route strictly the .x-yrouting a packet first along in that order. algorithm [Intel the x dimension (dimension hypercubes, 0) and 1977]; route and higher a packet first along the lowest dimension and then along higher dimensions. Ordering the dimensions ensures that the xy and e-cube then it produces algorithms avoid along the the e-cube deadlock, y dimension (dimension routi?zg algoiith?n [Sullivan but it also ensures 1). Applied to and Bashkow nonadaptiveness. Another popular method of designing routing algorithms that are deadlock free, and possibly adaptive, is to add Lirtual channels to a network. Dally and Seitz [1987] introduced the method for nonadaptive routing, and several researchers [Yantchev have and Jesshope extended 1989; Dally it to adaptive and Aoki routing. 1990: Linder Adding a virtual and Harden channel 1991] to a physical channel is less expensive It involves adding buffer than adding a new physical channel, but it is not free. space and control logic to the two routers at the ends of the physical channel channel and routers. It so that the virtual channels also reduces the bandwidths already sharing the physical channel. An can share the physical of the virtual channels advantage of adding virtual or physical channels, however, is that it can produce routing algorithms that have a high degree of adaptiveness. For example, Dally [1988], Linder and Harden [1991], and Glass and Ni [1992] describe how two-dimensional mesh in order to produce This paper presents a model for designing are deadlock free, for networks. A unique physical or virtual livelock free, feature channels minimal add or nonminimal, of the model to networks to virtual a fully adaptive wormhole-routing is that (although channels to a routing algorithm. algorithms that and highly it is not based it can be applied adaptive on adding to networks with extra channels). Instead, the model is based on analyzing the directions in which packets can turn in a network and the cycles that the turns can form. Section 2 presents the model in general terms. Sections 3,4, 5, and 6 apply the model to two-dimensional meshes, n-dimensional meshes, k-ary n-cubes, and hypercubes without virtual channels, producing a variety of partially adaptive routing algorithms. (Without adding channels to meshes and k-ary n-cubes, it is impossible to produce routing algorithms that are deadlock free and fully C. J. GLASS AND 878 adaptive [Ni rithms. and as well Sections 2. The McKinley 1993]. as previously 7 and 8. Section of the nonadaptive 9 concludes partially algorithms, M. adaptive NI algo- are described in the paper. Model TLU71 Deadlock Simulations known, L. in wormhole routing is caused by message packets waiting on each other in a cycle. One way that deadlock can occur in a two-dimensional mesh is shown in Figure 1. Four packets traveling in different directions try to turn left and wind up in a circular wait. If only one of the packets had not turned, deadlock algorithm turns. the would might have been avoided. This possibility suggests that a routing prevent deadlock in a direct network by prohibiting certain The routing many algorithm possible would cycles in algorithm would have to leave algorithm should not prohibit wormhole adaptive for a direct directions in which routing network, a path more tiveness would be reduced. The preceding observations designing have to prohibit the between turns suggest At every pair than that the time, the Finally, the otherwise, solution are in each of same of nodes. necessary: a new algorithms at least one turn however, to deadlock the its adapproblem free and of highly network. the turn model. The model involves analyzing packets can turn in the network and the cycles that the the turns can form. It prohibits just enough of the turns to break all of the cycles. Routing algorithms that employ the remaining turns are deadlock free, livelock free, minimal or nonminimal, and highly adaptive for the network. The steps of the turn term, model are given channel, indicates (1) Partition directions topological in more either detail below. a physical Unless or virtual specified otherwise, the channels in the direct network into sets according to the in which they route packets. If each node has u channels in a direction, l’irtua[ treat these channels as being in L’ distinct L’ distinct sets accordingly. directions and divide them into wraparound channels in a separate set to be incorporated during (2) Identify the possible turns from one virtual direction to another, O-degree and 180-degree turns. A (1-degree turn are multiple channels in a topological from one set of channels to another, topological (3) Identify simplest the channel. direction, but different is only possible Put any Step (5). omitting when there direction. It represents a transition where the two sets are in the same virtual directions. the cycles that the turns can form. Generally, cycles in each plane of the topology is adequate. identifying the (4) Prohibit the minimum number of turns so that at least one turn is prohibited in each cycle. The turns must be chosen carefully in order to break every possible cycle, including very complex cycles. A useful approach is first to break the cycles in each plane and then check whether doing so allows more complex cycles. (5) Incorporate as many turns as possible channels, without reintroducing cycles. wraparound (6) Incorporate reintroducing Routing channel can always as many cyc~es. algorithms that O-degree route in Step (1) and use only the turns involving At least the set of wraparound one turn involving each be incorporated. and 180-degree packets from along turns as possible, the sets of channels one set to another allowed without identified by Steps (4), The Tum Model for Adaptile Routing 879 K packet3 n — It ..’’’”” II •1 ., “O n ~ packet 4 * ❑☞ “’’...,.❑ I ~~::i:election packet 0.,m J- buffer t packet 2 . ~ fl, u progression ................ + packet awaiting resource H /“” ❑ i >,, II packet 1 -0 T FIG.1. Awormhole deadlock invoicing four routers and four packets inatwo-dimensional (5), and (6) are deadlock free, livelock free, minimal or nonminimal, adaptive for the network. They are deadlock free because breaking cycles involving channel turns dependency algorithms are check the turn that channel the free model are means each (or and Seitz in later applied that algorithm increasing) to another prevents 1982]. Proofs presented has been graph so that decreasing one set (direction) [Dally deadlock dependency network strictly from graph correctly. that it is possible to number packet order. The cycles in the possibility routing but Preventing routes every and highly all of the particular sections, mesh. only as a cycles in the the channels along of such channels in in a numbering, together with the fact that a network contains a finite number of channels, means that a packet reaches its destination after a limited number of hops. Thus, the routing algorithms are livelock free. They are also inherently nonminimal, only but when can easily they lead be made toward minimal the by modifying destination. them Nonminimal to use channels routing, more adaptive and fault tolerant. Finally, the routing algorithms adaptive for the network because the model prohibits the minimum turns from dependency The turn one direction to another graph. model is applicable necessary to direct to prevent networks that however, is are highly number of cycles in the channel have directions associ- ated with their channels, which is most existing and feasible networks. The model produces partially adaptive routing algorithms for such common topologies as mesh-connected, k-ary n-cube, hexagonal, octagonal, and cubeconnected cycle networks. Adding extra virtual or physical channels to the topologies allows the model to produce fully adaptive routing algorithms. The following sections apply the model to a variety of meshes and k-ary n-cubes without extra channels and should make the model much clearer. C. J. GLASS 880 3. Two-Dimensional For The directions meshes, meshes lengths of –.x, L. M. NI of n- Meshes two-dimensional dimensional AND the +.x, the terminology can be simplified. dimensions, –y, and k. +? used Dimensions and become k,, in become west, the O and m definition 1 become and east, south, x and n. Finally, y. the and north. The four directions permit eight 90-degree turns: left and right turns when traveling west, east, south, and north. The eight turns form two simple cycles, as shown in Figure 2. The xy routing algorithm shown in Figure 3. Prohibiting these four turns four remaining turns also turns do not cannot permit vented by prohibiting need to be prohibited, routing. Figure a cycle. four of the turns, as deadlock because the Unfortunately, any adaptiveness in routing. the four Deadlock any The two turns three left does turns not prevent allowed deadlock, in Figure impossible to route Of the cycles, only into most 16 different 12 prevent packets to the node ways to prohibit deadlock in the northwest one turn however, as 4(a) are equivalent to the prohibited right turn in Figure 4(b), and the three right Figure 4(b) are equivalent to the prohibited left turn in Figure cycles still exist and deadlock is possible, as shown in Figure two turns as in Figure 4 also has another serious drawback. turns that are a combination of the directions north and west, mesh. remaining can be pre- fewer than four turns, however. In fact, only two turns one from each cycle, allowing for some adaptiveness in Prohibiting 4 illustrates, form prohibits prevents turns allowed in 4(a). Thus, both 4(c). Prohibiting It prohibits all which makes it corner of the in each of the two simple and only 3 are unique if symmetries are taken 5(a) shows one way to account. 3.1. THE WEST-FIRST ROUTING Figure ALGORITHM. prohibit two 90-degree turns in a two-dimensional Because the mesh has no wraparound channels step of the turn model is to incorporate as many without reintroducing can be so incorporated: mesh and break every cycle. and O-degree turns, the last 180-degree turns as possible cycles. Of the four possible 180-degree turns. only two the turn from west to east and either the turn from south to north or the turn from north to south. Because all turns to the west are prohibited, a packet must start out in that direction in order to travel west. This requirement suggests the wext-jirst routing algorithm: route a packet first west, if necessary, and then adaptively south, east, and north. Examples of paths allowed by the west-first algorithm are shown in Figure 5(a). In this figure, black squares represent nodes, and gray bars indicate blocked channels requiring that packets wait (dashed lines) or take alternative paths (solid lines ). Note that both minimal and nonminimal paths are shown. Proof that the west-first routing algorithm is deadlock free is based on the work of Dally and Seitz [1987], who show that a routing algorithm is deadlock free if the channels in the direct network can be numbered so that the algorithm routes every packet along channels with strictly decreasing (or increasing) numbers. Because the west-first algorithm routes packets first west and then in the other directions, the proof assigns lower numbers to westward channels the farther west they are, and assigns still lower numbers to eastward, northward, and southward channels the farther east they are. THEOREM 1. The west-first routing algorithm is deadlock free. The Tum Model F , , --l Ill ri FIG. 4. Six turns that complete cycles and allow deadlock. Ill the ~y 7 t.(a) (b) r, where particular illustrates shows in a m X n mesh greater of 3m – 2 and every packet four possible used to route number a. b in 6 shows the along channels with strictly decreasing numbers, it is 3.2. THE cases. Examination a packet out shows of a node NORTI way to prohibit I-LAST ROUTING two 90-degree turns cast or the turn from east to west. Because all turns when traveling north travel north when that is the last direction in each case, the numbers than the Tile tlorth-last The proof is similar routing to that 5(b) shows another mesh and break every only two 180-degree turns to be and either the turn from west to are prohibited, a packet should only it needs to travel. This requirement suggests the north-lint routing algorithm: route south, and east, and then north. Examples north-last algorithm are shown in Figure 5(b). 2. Figure ALGORITHM. in a two-dimensional cycle. The remaining 90-degree turns allow incorporated: the turn from south to north THEOREM that, have lower ❑ channel. PROOF. a two-digit n – 1. Figure to show that, for each channel into an arbitrary node, the algorithm route the packet out along a channel with a lower number. Figure 8 the channels ~ is the numbers to assign to the channels leaving node (x, y). Figure 7 algothe numbering of a 4 x 4 mesh. To show that the west-first routes sufficient can only (c) Assim-–– each channel PROOF. input The four turns allowed by the w routing algorithm. rl ~ ....-1 1= I_J rithm FIG. 3. +--’ +--m base 881 Fm. 2. The possible turns and simple cycles in a two-dimensional mesh. L-l ,.. 1 t--- 4--, , ,..~ Routing r~ 1--1 L-J ~ for Adaptile algorithm a packet first of the paths is deadlock for Theorem adaptively west, allowed by the ji-ee. 1. Rotate Figures 6 and 7 counterclockwise 90 degrees, and reverse the directions of the channels. The figures now show that the north-last algorithm routes eve~ packet along ❑ channels with strictly increasing numbers. Figure 5(c) shows a third 3.3. THE NEGATIVE-FIRST ROUTING ALGORITHM. way to prohibit two 90-degree turns in a two-dimensional mesh and break every 882 C. J. GLASS ,.. m 7 L-l ■ ■ ■ ■ mmmm’um ❑ Eq!4m ■ ■ AND L. M. NI ;.m 1 ❑ . (b) r 4 --, L.-j r ..-. Li (c) Fici. 5. The SIX turns allowed and ewimplcs of paths In an 8 X 8 mesh of’ the three addptlve routing algorlthrns. (a) The west-first routing algorithm. (b) The north-least routing algorlthm. (c) The nqytwe-first routing algorithm. cycle. The remaining 90-degree turns allow only two 180-degree turns to be incorporated: the turn from west to east and the turn from south to north. Because all turns from a positive direction to a negative direction are prohibited, a packet must start out in a negative direction in order to travel in a negative direction. This requirement suggests the negatiL’e-first routing algorithm: Route a packet first adaptively west and south, and then adaptively east The Turn Model for AdaptiLe 883 Routing 2m-2-2x,n-2-y t 2m-2+x,0 %Y FIG. 6. Numbermg of the channels leaving node (.Y, y) for the west-first routing algorithm. 2m-3-2x,0 each v 2m-2-2x,y-l 6,2 8,0 9,0 5,0 3,0 1,0 4,2 6,0 6,1 4,0 2,2 2,0 7,0 8,0 9,0 5,0 3,0 1,0 6,1 6,0 FIG. 7. 7,0 4,1 4,1 2,1 2,1 7,0 8,0 9,0 5,0 3,0 1,0 4,0 6,2 4,2 2,0 2,2 7,0 8,0 9,0 5,0 3,0 1,0 Numbering of a 4 x 4 mesh for the west-first 0,2 0,0 0,1 0,1 0,0 0,2 routing algorithm and north. Yantchev and Jesshope [1989] propose a similar algorithm, but only for minimal routing. The negative-first algorithm can be either minimal or nonminimal; the nonminimal version, which allows 180-degree turns, being more adaptive negative-first THEOREM and fault algorithm 3. tolerant. Examples are shown The negatiue-jii-st in Figure routing The proof is a special in Theorem 6. ❑ algorithm case of PROOF. meshes of the paths allowed by the 5(c). the is deadlock one given free. for n-dimensional How adaptive are these partially adaptive 3.4. DEGREES OF ADAPTIVENESS. routing algorithms? Let S.lgO, ~th~ be the number of shortest paths an algorithm allows from source node (s,, Sy) to destination node ( dx, dY ). Also, let f be a fully algorithms, adaptive Ax algorithm, p = Idx – SJ and Ay (Ax+ S,= f one of the three Ay)! Ax!Ay! (Ax s west-first be = Id} – S}l. Then, + Ay)! if = [ b, Ax!Ay! dz > Sx ‘ otherwise partially adaptive 884 C. J. GLASS 2m-2-2x,n-2-y AND L. M. NI 2m-2-2x,y 2m-’-20+03-2x2 &-3-2x’0 + t FIG, 8. For e~ch input channel, the outrxtt channels Wowed bv the westf]rs; routing algorithm have lower numbers. (a) Input from west (b) Input from north. (c) Input from ext. (d) Input from south. 2m-2-2x,y-l 2m-2-2x,y-l (b) (a) 2m-2-2x,n-2-y 2m-2-2x,n-2-y 2m-l+x,O 2m-2+x,0 2m-3-2x,0 X,y %Y 2m-3-2x,0 + 2m-2-2x,n-l-y + 2m-2-2x,y-l (d) (c) 1 (A.r S,,<,,,,,.,,,,, = For the minimal algorithm the degree to otherwise \l, otherwise algorithms, the which d,, < s, ‘ \l, three ever, Sr = 1 for at least half of if Ax!Ay! routing is. For + Ay)! the larger partially S.l~O,.{,l,,H is, the adaptive routing of the source-destination a minimal routing pairs. algorithm more adaptive algorithms, Another is howmeasure adaptive is the SP/Sj ranges from O to 1, but averaged across all sourcel-atlo su/g(l?[t/t,}t /Sf. destination pairs, S,,/Sf > 1/2, which indicates a significant degree of adaptiveness. Finally, note that SP = 1 for a source-destination pair does not imply that algorithm p is nonadaptive for that pair. The partially adaptive algorithms all permit nonminimal nonminimal routing. In the case of the negative-first routing means that routing can be adaptive even when ~Y > SY or when d,, > s, and dY < SY. For an illustration tweness, see the bottom path in Figure 5(c). 4. n-Dimensional of this algorithm, d, s s, and added adap- Meshes In an n-dimensional mesh, each node has up to 2n input channels and Zn output channels. For each of the 2n directions, there are 2n – 2 possible 90-degree 90-degree turns to a different direction, making 4n(n – 1) total turns. These turns form two simple cycles in each of the n(lz – 1)/2 planes, making n( n – 1) total cycles of four turns. THEOREM 4. The minimlun number of 90-degree turns that must be prohibited to prel)ent deadlock in an n-dimensional mesil is n( n – 1), or a quarter of the possible tur?ls. The Turn Model for Adaptitle The PROOF. 4n(n Routing – 1) turns 885 in an n-dimensional mesh form n( n – 1) cycles of four turns each. At least one of the four turns in each cycle must prohibited in order to prevent these cycles. Therefore, prohibiting a quarter the turns is necessary to prevent deadlock. ❑ be of THEOREM 5. The minimum and maximum number of 180-degree turns that must be prohibited to prevent deadlock in an n-dimensional mesh is n, or one half of the possible turns. The PROOF. each 180-degree 180-degree turn, turns in an n-dimensional its reverse is always mesh come a different 180-degree in pairs: turn. for Both of the turns in a pair cannot be allowed, because they would create a cycle by themselves. Therefore, at least half of the 180-degree turns must be prohibited. If a 180-degree turn creates a cycle when allowed, some combination of the other allowed turns must form the equivalent of the reverse 180-degree turn. If both of the turns in a pair create cycles when allowed, the other allowed turns must form the equivalents of both of the turns, which implies that the other allowed turns can also form a cycle. This contradiction implies that at least one of the turns in 180-degree turns each must pair can be be prohibited. allowed. ❑ Therefore, at most Of the many routing algorithms formed by prohibiting per cycle and one 180-degree turn per dimension, three their simplicity. They are analogs of the west-first, routing algorithms for two-dimensional meshes. routing algorithm packet first is the adaptively in all-but-one-negatil~e-jirst the negative of one 90-degree are noteworthy the turn for north-last, and negative-first The analog of the west-first routing directions half of all algorithm: but one route a dimension (n – 1) and then adaptively in the other directions. The analog of the north-last routing algorithm is the all-but-one-positil!e-last routing algorithm: route a packet first adaptively in the negative directions and the positive direction of one dimension (0) and then negative-first routing adaptively algorithm in the other is also called directions. the negative-first The analog routing of the algorithm: route a packet first adaptively in the negative the positive directions. All of these algorithms directions and then adaptively in are deadlock free, but the only proof algorithm. presented THEOREM deadlock the channel leaving is for the negative-first The negatil)e-first routing algorithm for n-dimensional meshes is $-ee. PROOF. be 6. here sum Let K be the sum of the k, for an n-dimensional mesh, and let X each of the x, for any node (xo, X1, ..., x,, _z, Xn _ !). Number leaving a node in a positive a node in a negative direction direction K – n + X and each channel K – n – X. Then, if a packet enters a node when traveling in a negative direction, it enters along a channel numthe bered K – H – X – 1,which is less than both K – n – X and K – n +X, numbers of the channels leaving the node in the negative and positive directions. If a packet enters a node when traveling in a positive direction, it enters along a channel numbered K – n + X – 1, which is less than K – n + X, the number of the channels leaving the node in the positive directions. Therefore, the negative-first algorithm routes every packet along channels with strictly increasing numbers and is deadlock free. ❑ ~. J. GLASS AND 886 The negative-first one turn algorithm per cycle, Therefore, the turn prohibiting is deadlock from some free a positive quarter of the deadlock in an n-dimensional mesh. Combining 4, gives the following theorem, which supports produces highly THEOREM dimensional 7. adaptive routing Prohibiting 4.1. DEGREES turns M. of prohibiting to a negative is sufficient this observation the claim that NI just direction. to prevent with Theorem the turn model algorithms. sonte mesh is necessary aid as a result direction L. quarter sufficient OF ADAPTIVENESS. of the 4n( n – 1) to pre[ent As the number turns in an n- deadlock. of dimensions increases, the minimal, partially adaptive algorithms are more likely to be able to route message packets adaptively. SP = 1 less often, especially when the k, are large. But that averaged across all source-destination pairs, S,,/Sf the degree of adaptiveness relative to fully adaptive as iZ increases. As in the special case of two-dimensional adaptiveness can be increased by nonminimal routing. 5. k-sty > 1/2”1. indicating algorithms decreases meshes, the degree of n-cubes The adaptive routing wraparound channels algorithms for meshes can be extended of k-ary ~z-cubes. One way to extend them to use the is to allow a packet to be routed along a wraparound channel only on its first hop. To prove that the routing algorithm is still deadlock free, number the mesh channels as before and assign the wraparound channels a number that is either greater than or less than those of the mesh are routed along channels in strictly The negative-first routing algorithm another way: classify each wraparound channels, depending on whether decreasing or increasing order. can also be extended to k-ary channel as negative packets n-cubes or positive in accord- ing to how it routes packets relative to the mesh channels and then apply the negative-first algorithm. For example, a node at the east edge of the mesh channels has two channels to the west: a mesh channel to the node immediately to its west and a wraparound channel to a node at the west edge of the mesh channels. Either channel to the west can be used during the negative phase of routing a packet. The negative and positive phases in a 4-ary 2-cube are illustrated in Figure 9. The proof of deadlock freedom is, again, a simple modification of the proof for meshes, All of the preceding routing algorithms are strictly nonminimal. For k-ary /l-cubes with k > 4, it is impossible to design deadlock-free routing algorithms that are minimal, without adding extra channels [Ni and McKinley 1993]. To design a routing algorithm that is deadlock free, minimal, and fully adaptive, Linder and Harden [1991] add virtual channels to a k-ary n-cube. They add, however, enough virtual channels to partition the k-ary n-cube into 2“ -1 virtual networks with rz + 1 levels per virtual network and (n + 1)k” channels per level. That is, an average of ((~z + l)z/n)2’z z virtual channels per physical channel. Dividing the physical channels in a k-ary /z-cube into fewer virtual channels, however, is sufficient for designing routing algorithms that are deadlock free, minimal, and partially adaptive. A suitable design method is inspired by the one Dally and Seitz [1987] use for unidirectional k-ary divide each physical channel into two virtual channels n-cubes. Dally and Seitz and connect the virtual The Tum Model for Adaptive Routing 887 ~[”f~: ]~~] FIG. 9. The channels used during the two phases of the negative-first routing algorithm for 4-ary 2-cubes. (a) Negative phase. (b)witiv.phs LLA. .* (a) negative (b) positive phase channels 4 in each k-ary l-cube phase embedded in the k-ary n-cube into a spiral that circles the k-ary n-cube nearly twice. Unfolding the spirals converts the unidirectional k-ary n-cube into a unidirectional, n-dimensional mesh with 2 k – 1 nodes along each dimension. Each node of the k-ary n-cube appears in the mesh up to 2” times. Dimension-order routing in the deadlock. Applying the same unfolding scheme to a bidirectional yields a bidirectional, dimension. channel. The Applying are deadlock n-dimensional unfolding again the turn free, model minimal, mesh requires 2k – 1 nodes with just two virtual to the mesh produces and partially mesh prevents k-ary n-cube adaptive. routing The use in the k-ary n-cube are the ones that correspond nearest copy of its destination node in the mesh. along channels each per physical algorithms routing that algorithms to routing a packet to to the 6. Hypercubes Hypercubes n-cubes. special Proofs are a special Consequently, case the cases of the algorithms for that the routing algorithms proofs for the more The special expression, general case of the the p-cube of both adaptive n-dimensional routing algorithms n-dimensional are deadlock meshes for and k-ary hypercubes are meshes and k-ary free are corollaries n-cubes. of the has a particularly compact cases. negative-first routing algorithm algorithm. Let S be the binary address of the source node for a packet, C be the binary address of the node the header currently occupy, and D be the bina~ address of the destination node. flits The p-cube routing algorithm has two phases. In the case of minimal routing, the first phase routes the packet along any dimension, i, for which c1 = 1 and d, = O. When along there any dimension, is no such dimension, i, for which the second c, = O and phase d, = 1. These routes the packet steps are easily computed using bitwise logic operations, as shown in Figure 10. If nonminimal routing is desired, because of its increased adaptiveness and fault tolerance, the first phase can also route the packet along any dimension, c, = 1 and d, = 1. Then, the steps can be computed both of these algorithms, the only input transmitted is a unique constant for each router, and p depends header similar flits occupy in the router. Konstantinidou to p-cube, but only for minimal routing. i, for which as shown in Figure 11. In in the header flits is D. C on which input buffer the [1990] proposes an algorithm Let IX I represent the number of 1‘s in the 6.1. DEGREES OF ADAPTIVENESS. binary number X. The nu_mber of shorte~t paths from S to D, SP.<,,h,,, equals routing h, !}z, )!, where h, = I(S A D)l and hO = I(S A D)/. For a fully adaptive 888 C. J. GLASS AND L. M. NI Algorithm: Minimal p-cube routing algorithm. Input: Current address, C, and destination address, D. Procedure: I 1. H C = D, I 2. R= I 3. If R= 1 the packet route to the processor local and exit. CAD. 4. Route O,then R= the packet along any available FIG. 10. Algorithm: CAD The mnnmal Nonminimal Input: Current positil,e direction. p-cube routing p-cube address. channel, routing C; destination 1, for which algorithm r, = 1, for hypercubes, algorithm. address, D; and whether the last hop was in a p, Procedure: 1. If C = D, Ioute tile packet 2. l[p=l, R= theu 3. fZJseif CA D= to the lncal processor and exit. CAD, O,then R=CV (CAD). .1. Llse R = C, 5. Route the packet FIG, 11. along any availabl[> rhannel The nonminimal p-cube routing t, for which algor]thm r, = 1. for hypercubes. f, S~= h!, where algorithm, between S and D. s ~_cuA,,/Sf, which Overall, equals the p-cube h = h] + ho = I(S @ D)l, the other measure of the degree The l/[fl, routing Hamming distance of adaptiveness is ]. algorithm offers a choice of many especially when compared to the nonadaptive e-cube routing I illustrates this adaptiveness, for a binary 10-cube in 1011010100 sends a packet to the node shortest paths, algorithm. Table which the node 0010111001. In this example, h = 6, shows only one of the 36 possible ho = 3, /11 = 3, and SP.C,,~,, = 36. The table shortest shows paths. the For number each of choices node transmitting of output channel the packet, allowed however, by the routing algorithm. The number of choices in parentheses tional choices available with nonminimal routing. minimal indicates the table p-cube the addi- 7. The Simulator The most versatile way to study and compare routing algorithms is simulation. A routing algorithm cannot be simulated in isolation, however. It must be simulated for a particular direct network, input selection policy, output selection policy, and pattern of message traffic. Each of these factors affects the performance of the routing algorithm. This section describes the kinds of direct The Turn Model TABLE for AdaptiL’e I. Routing EXAMPLE OF ,4 PAW ALLOWED BY ‘ItI~ p-CuBE Rout 0011010000 1(+2) 3 6 5 phase 1 0010010000 0010110000 0010110001 2 1 0 3 phase ‘2 I 001011[001 I networks. as well channel selection experiments and their pattern is measured in terms latency and average sustainable that elapses for transmission The of two between of a network main network quantities: is considered source nodes the source node node Each tional physical a pair pair sustainable of neighboring Each of unidirectional width, 20 flits\ps. single flit. The ously transmit input operate all of the flits 7.3. CHANNEL routing, multiple SELECTION input the same available channel for USC. per unit of messages by one pair to its local channels in a router but the portion of queued at have is per of unidirecprocessor the same by band- has a buffer the synchronize to simultane- of a packet size of a in the network. During both nonadaptive and adaptive in a router may contain header flits waiting for channel. Which channel to use is decided Selection All asynchronously, next is decided by the input flit may have a choice of Input is connected channels. channel header 7.3.1. delivered available POLICIES. channels output available 100 in the simulations, is also connected in a worm, of u message theoretical throughput, transmitted flits. The the number limit, communication latency network studied in the simulations has one processor and one router routers router of routing algopolicy, and traffic the message some small direct that physical Each routers the actual the message of messages when does not exceed channels. in this paper. describes The making having is the number NETWORKS. Each 7.2. DIRECT treated as part of a multicomputer node. studied average throughput. It is measured as a percent of the maximum would be achieved if every channel continuously throughput their patterns The next section results. and the destination throughput time. which and traffic I ion The performance of each combination input selection policy, output selection 7.1. PERFORMANCE. rithm, direct network, is the time Dhase 2 I dmtiJlat is measured. AI.GORJTWM ING phase 2 I policies, as how performance simulation 889 by the Policies. input selection multiple output Input channel gets to use the output policy. During adaptive routing, a output channels. Which output selection selection policy. policies can be based on a variety of packet and router characteristics. Some of these are the length of the packet, the location of the source node, the location of the destination node, the location of the router, the distance travelled already. the distance to travel still, the time the message was sent, the time the packet arrived in the router, the number of output channels available to the packet, and the input channels chosen previously in the router. In this paper, seven policies are studied: random, no-turn, local first-come-first-served, global first-come-first-served, 890 C. distance -travelled, least-adaptive, one channels of the channel input directly of the input waiting across channels input randomly. the router waiting. channels The local and distance-least. randomly. from The policy GLASS policy channel, the no-turn policy policies three the input L. polic} chooses next chooses AND The random no-turn the output Otherwise, randomly. FCFS The J. input it is one chooses channel NI chooses the provided also M. one of the decide containing ties the packet that arrived in the router first. The global FCFS policy chooses the input channel containing the packet that was sent first. The distaizce-trtll’elled policy chooses the input channel containing farthest. The least-adaptiL1e policy chooses the packet that has travelled the input channel containing the the packet that has the fewest choices of output channels. It decides ties by using the distance-travelled policy. Using a policy that is more complex than the random policy of output is necessary channels upon because, which in low-dimensional packets often be waiting for the same number The distance-least policy is like the decides Each ties by using of these can wait of output channels, distance-travelled the least-adaptive input the random, no-turn, routing information. selection topologies, is small the number and many packets will making ties common. policy, except that it policy. policies has its advantages. An advantage of and local FCFS policies is that they require no additional The global FCFS requires that a global time stamp be passed along in the header flits of packets. Generating and passing the global time stamps necessitates establishing a global clock and adding many bits to the header flits. The distance-travelled, least-adaptive, and distance-least policies require of packets. attempt that the distance travelled so far be passed along in the header flits The distance takes up a few extra bits. All but the random policy to reduce local and global their latencies high latencies average FCFS from communication policies increasing included is that much in the average. latencies. giving more, The reasoning priority to older thereby decreasing Because the global behind packets FCFS the number policy the prevents of is based on more global information, it should perform better than local FCFS. The FCFS policies have an added benefit: they are fair and, therefore, free from starvation. The distance-travelled, least-adaptive, distance-least, and no-turn policies, on the other hand, are not reasoning behind the distance-travelled packets that have reserved the most possible, network, fair and are subject to starvation. The policy is that giving priority to the channels releases as many channels as as soon as possible, and limits the number of packets entering the thereby reducing contention and latencies. The least-adaptive policy tries to reduce contention by giving making it less likely that the remaining priority packets to the least adaptive will have to wait, The packet, no-turn policy tries to reduce contention by giving priority to the packet that does not have to turn. Why such a priority reduces contention will be discussed in connection with output selection policies. 7.3.2. Output Selection Policies. Output selection policies can be based on many of the same packet and router characteristics as input selection policies. Some of the more interesting characteristics are: which channels are available, which lead toward the destination and which away, the distance the packet must still this paper, the output travel in each direction, and the direction of the input channel. In three policies are studied: zigzag, xy, and no-turn. When none of channels allowed by the routing algorithm is available, the policies The TurnModelfo all choose channels rAdaptile the first one is available, Routing 891 to become the policies available. choose When the only available one of the channel. output When more than one of the output channels are available, the zigzag policy chooses the output channel in the direction the packet must still travel the farthest. More precisely, if the current node is C and the destination node is D, the zigzag policy chooses a channel in the dimension, i, with the largest value Id, – c, 1. The The ay policy chooses the output channel for the lowest dimension. no-turn policy chooses the output channel in the same direction as the input channel. Each of these output selection in a different reasoning follow way. The a diagonal, staircase policies tries to improve behind path to network the zigzag policy the destination performance is that node preferring is most to likely to preserve adaptiveness along the way, thereby reducing latency. The disadvantage of this policy is that it tends to concentrate traffic near the center of a mesh, although adaptiveness mitigates the concentration attempts to overcome this disadvantage the xy routing algorithm, which spreads channels of a mesh. produce channel plus signs (“ +”) Simulations utilizations represent some. The by preferring the channels uniform traffic very evenly of the .V routing algorithm in tenths. The main thing channels across for uniform the traffic like those shown in Figure 12. In the figure, the the nodes of a 10 X 10 mesh, and the numbers surrounding each plus sign indicate the utilizations of the output the node. Each number is the fraction of the time that the channel the eastward xy policy chosen by to note in Figure in a column 12 is that are very nearly channels of is reserved, the utilizations equal. The for all of same is true of all of the westward channels in a column, all of the northward channels in a row, and all of the southward channels in a row. The packets that cross a particular column or row of channels could not across the channels than the xy routing algorithm be spread much more evenly spreads them. The reasoning behind 13. Each onto the no-turn policy a row or column, along that row is illustrated it blocks or column. in Figure other In Figure packets traveling 13, the zigzag path time a packet turns in the same direction (solid arrows on the upper left) blocks many northbound and eastbound packets (dashed arrows), while the xy path (solid arrows on the lower right) blocks fewer. The impact is greatest heaviest. when turning onto the middle of a row or column, where traffic is Turns are not all bad, however. Their benefit is that they increase adaptiveness. Therefore, what should be avoided is what the no-turn policy tries to do. are unnecessary turns, which PATTERNS. In the simulated multicomputers, the processors 7.4. TRAFFIC generate messages at time intervals chosen from a negative exponential distribution. Each message has an equal probability of being one packet of 10 or 200 flits. Messages queued at the are immediately What probably that are blocked source processor. from immediately Messages arriving entering the network are at a destination processor consumed. affects the performance of the routing algorithms more than any of the preceding factors is the pattern of message traffic, that is. the destinations to which messages are sent. In this paper, six traffic patterns are studied: uniform, matrix-transpose, matrix-rotate, merge-south, reverse-flip, and address-rotate. Some apply only to two-dimensional meshes, and some apply only to hypercubes. The uniform pattern sends each message to any of C. GIJWS AND 1+ L. +2 1 1+3 2+4 3+1 4+4 4+4 4+3 1+2 3+1 1 1 1 1 1 1 1 1 1 1 0 1+3 ~ 1 2+3 1 3+4 1 1 1 0 1 1 4+3 ~ 3+2 2 1+1 ~ 4+3 2 2 2 3+1 ,2 1 1 1 ‘j ‘J 1 1 1 1+3 2+4 3+1 -1+4 :3+1 :) 2 3 2 4+2 ~ 2 +2 ~ 2 +2 3 2 +’2 1+3 2 2 +-t 4+1 3 3 4+ 4+4 32 ~ 3+4 3 3 ~ 2 4+-L :3 2 4+3 -f+z 2 .3 4+2 3 3+1 3 3 3 3 3 3 3 ~+~ 2 3+1 :3 :3 3 3 :+ 3 3 3 3 32 3+4 1+3 2+4 3+4 -1+4 4+4 4+:3 3 3 2 2 2 3 3 3 ;3 3 :3 3 33 1+3 2 :3+J 2 3+1 2 2 :3 -~+-t 2 ~ 4+3 2 3 2+4 2 ~ 1+3 2+4 3+4 4+4 4+4 2 ,2 1 1 1 1 1+3 ~ 2 2 22 :3+4 3+2 22 2 2 ,2 :3+1 1 1 2 1+2 2 3+1 1 1 1 1 1 -!+4 0 4+4 1 1 4+3 1 1 1 1 1 I 1 1 1 1 1+3 2+3 3+4 4+4 1+3 4+3 4+2 :3+1 FIG. 13. 1+ :3+’2 1 in a 10 ~ 10meshfor ~+ ,2 1+3 utilization 3 3 3 1 Channel 2+ :~+1 2 2+4 +2 3 y 3 3 +1 ~+ 4+3 1+3 3 2+1 3 2 :3+1 3 3 2 +2 ~ :3 :3 2 4+4 :3 3 :3 ~ 3 3 +’2 +2 2 1 2+ 4+4 3 N[ g 3 3 M. 1+ 3 ~ :3 +2 FIG. 12. J. 2+ 1 ~ 2+ 1+ uniform traffic and the .V routingalgorlthm. ❑ mmm ❑ ❑ --f-p Turns can increase contention. :J;P : :~F~ the other processors with equal probability. In a k x k mesh, the nzatrixtranspose pattern sends each message from the processor at (x, y) to the one at (k – 1 – y, k – 1 – x). In a binary rived by mesh are dictated binary (x,,, k x1>x2, X 8-cube, a matrix-transpose pattern is demapping a 16 X 16 mesh to the hypercube so that neighbors in the neighbors in the hypercube. Messages are then sent to the processors by the matrix transpose in the mesh. The resulting pattern in the 8-cube sends each message from the processor at -~3>x4, k mesh, the x5>x6!x7 ) matrix-rotate to the pattern one at (IA, X5, XK, X7, sends each message ZO, XI,.YZ, from X3). ln the processor a The Turn Model for Adaptive Routing 893 at (x, y) to the one at (k – 1 – y, x). In a k x k mesh, the merge-south reduction pattern sends each message from the processor or tree at (x, y) to the one at (x/2 + (y mod 2) * (k\2), y/2). The merge-south pattern might occur if all the children in a binary tree had to send messages southward to their parents. In a binary 8-cube, the reL1erse-jZip pattern sends each message from the processor at (.to, xl, .Y2, x3, ~4, x5, ~6>~7 ) to the one at (2T, .i~, ~j, Zl, ~j, 2Z, il, .iO). In a binary 8-cube, message from the the address-rotate processor at pattern [Konstantinidou 1990] (xO, xl, xj, x3, X4, X5, xfj, X7 ) to sends the each one at (x,> x,, x,, x,, x“, -x,, x,, X3). 8. Sinmlation Experiments Numerous simulations algorithms produced have been by the turn conducted model to study the and the nonadaptive adaptive routing produced by dimension ordering for different direct networks, and output selection policies, and different traffic patterns. 8.1. INPUT no-turn, SELECTION local FCFS, The first POLICIES. global FCFS, cies. Each policy is simulated in north-last, and negative-first routing tion policy is zigzag, the routing simulations input is minimal, and the message throughputs, distance-travelled policy global FCFS policy produces the second no-turn, and local FCFS policies produce produces suggests requires the and local information. poli- the is uniform. (Figure 14). At same. At high lowest latencies, the lowest latencies, and the random, the highest latencies. Compared to and local FCFS policies, the distance-travelled sustainable throughput by 15%. The most surprising of these results is that policies do not perform much better than the no-turn, routing the random, selection traffic each routing algorithm are fairly consistent the input selection policies perform the the random, no-turn, increases the maximum input a 10 x 10 mesh for the W, west-first, algorithms. In all cases, the output selec- The results for low throughputs, the different compare and distance-travelled routing algorithms policy the local FCFS and no-turn random policy. The random, FCFS policies are the three that do not require additional That they perform the worst may not be a coincidence and that improving performance by changing input selection policies changing to a policy that is based on additional information carried in header flits. outperforms reducing The the FCFS contention other main policies is more result is that the at high throughputs. important than distance-travelled This result maintaining policy suggests fairness. In that the simulations of the distance-travelled policy, starvation, though theoretically possible, is not a problem. The next simulations compare the distance-travelled, least-adaptive, and distance-least input selection policies [Glass 1992]. These three policies differ only in whether they emphasize distance or adaptiveness and in how they decide ties. The results for each routing algorithm are very consistent. At all throughputs, These the three input suggest that results distance-travelled marginal value, selection policies augmenting policy, to give priority at least for two-dimensional have ~early the same latencies. a successful policy, to the least adaptive packet is of meshes. For higher-dimensional such as the meshes, the adaptiveness of packets has a wider range of values, and the added discriminating power of the least-adaptive and distance-least policies may make them more useful. 894 C. 40 r’~- J. GLASS AND L. M. NI ~o ~ o~ ;~ o .5 Pcr~ent 10 of max o 1.5 25 30 20 throretlcal throughput .5 Percent 10 of max (b) (a) 10 I 35 .Jo ~~ I I 40 I I l~ndom * 35 - no-turr + 30 - Fms + ’20 – rlIst -travelled ~ local latency 10 - 5~ 5~ P[r( ellt 20 15 illwl’etl(.al 25 + Q- -travelled & o~ o (J~ 1() of Inclx [ 1!5 - 10 - 5 no-turn local FCFS 20 -rflst (pie,) I I random * Average p~ 1.5 - () ~o 2.5 :30 1.5 theore~lcal throughput :30 10 ,5 Percent throughput of max (c) 15 20 theoretical 2.5 30 throughput (d) FIG. 14, Comparison of input selection policies for four different routing algorithms in a 10 ~ 10 mesh. (a) q routing algorithm. (b) West-first routing algor]thm. (c) North-last routing algorhhm. (d) Ncgatwe-first routing algorithm. s~ OuTpuT SELECT-ON poLIclEs. The next simulations compare the zigzag, W, and no-turn output selection policies. Each policy is simulated in a 10 X 10 mesh for the west-first, north-last, and negative-first routing algorithms. In all cases, the input selection policy is distance-travelled, the routing is minimal, and the message traffic is uniform. Figure 15 shows the results. For each of the routing algorithms, the no-turn policy has the lowest latencies and the zigzag policy has the highest latencies. Depending on the routing algorithm. the .xy policy is comparable to either the no-turn or increases zigzag policy. the maximum Relative sustainable to the zigzag throughput policy, the by an average The relative performance of the three output due to two factors: spreading message traffic no-turn policy of 5%. selection policies appears to be evenly and avoiding turns. The zigzag policy tends to concentrate uniform traffic in the center of the mesh, increasing contention, and latencies. Compensating for this tendency somewhat is the adaptiveness of the routing algorithms. The xy policy attempts to spread the uniform traffic more evenly. When contention is high, however, the xy policy, like the other policies, chooses the first available channel, ignoring which channel is preferred. This behavior tends to route more packets the center of the mesh. The better performance of the .xy policy relative zigzag policy, in general, suggests that spreading traffic evenly is more tant than maintaining maximum adaptiveness. The superior performance toward to the imporof the The Turn Model for Adaptiue Routing 895 o~ o .5 Percent 10 of max 15 jo theoretical 30 25 tl]roughput (a) 40 I Average latency (/lsec) I I I t zigzag * 35 - ,zg * 30 25 20 - no-turn @r FIG. 15. Comparison of output selection policies for three different routing algorithms in a 10 x 10 mesh. (a) West-first routing algorithm. (b) North-last routing algorithm. (c) Negative-first routing algorithm. 1!5 - 10 .5~ o~ o Percent 5 10 of Inzm 1<5 20 25 30 theorctlca] throughp(lt (b) ()~ o 5 10 Pcr( ~nt of nlax 1.5 2() 25 30 thmr~tlcd throughput (c) no-turn policy suggests that turns sary turns is even more important are indeed costly and that avoiding than spreading traffic evenly. unneces- PATTERNS. The next simulations compare the adaptive routing 8.3. TRAFFIC algorithms produced by the turn model with the nonadaptive routing algorithms produced by dimension ordering for different direct networks and different patterns of message traffic. mesh and a binary 8-cube (Simulation for a larger 2D 16 x 16 mesh have The two networks studied are a 10 x 10 results for a 3D 10 x 10 x 10 mesh and a similar behavior and can be found in Glass [1992]). The six traffic patterns studied are uniform, matrix-transpose, matrix-rotate, merge-south, reverse-flip, and address-rotate. For the 10 X 10 mesh, the input selection policy is distance-travelled and the output selection policy is no-turn. For the binary and the output selection policy 8-cube, the input selection is V. In all cases, routing policy is local FCFS is minimal. 896 C. J. GLASS AND L. M. NI 40 35 30 Average 2.5 latency 20 (~sec) north-last negative-first 10 .5~ I n n 1 { 5 10 1!5 F’wrcent of maxlmurn I % & ea t-west 1,5 10 5 o~ * o Xl 25 30 35 40 thecretlca] throughput Percent .5 10 1.5 of mammum (b) (a) 40 I 35 I I I I I 40 I I I I I I I I 35 - – 30 - 30 - Average 2.5 - 25 20 - Zy 1,5 - west-first north-last negative-first 10 5~ I o o I 1 .5 Pel[.erlt I + % & east-west 10 1.5 20 2.5 nf max theoretical * latency 20 - (psf’c) 1.5 10 5~ * 0 30 35 40 throughput I I O 5 Percent + * % 4 - 30 35 40 throughput (d) for different traffic patterns in a 10 X traffic pattern. (c) Matrk-rotate traffic FIG, 16. Comparison of routing algorithms Uniform traffic pattern. (b) Matrix-transpose Merge-south traffic pattern. each traffic I ~11 west-tirsl north-last negative-first ea+west I 10 1.5 20 2.5 of mm theoretical (c) For 20 2.5 30 3.5 40 theoretical throughput pattern, a different cies at the high throughputs. 16(a)); for matrix-transpose routing algorithm 10 mesh. (a) pattern. has the lowest (d) laten- For uniform traffic, it is the ~y algorithm (Figure traffic, the negative-first algorithm (Figures 16(b) and 17(a)); for matrhcrotate, merge-south, and address-rotate traffic, the north-last or all-but-one-negative-first (ABONF) algorithm (Figures 16(c), 16(d), and 17(b)); and for reverse-flip traffic, the west-first or all-but-one-positive-last (ABOPL) algorithm (Figure 17(c)). (The 16 will be discussed later. ) In general, perform better better for hypercube, for uniform traffic, east–west routing the nonadaptive and the adaptive routing algorithm routing algorithms in Figure algorithms perform nonuniform traffic. For matrix-transpose traffic in the meshes and the maximum sustainable throughput of the adaptive algorithms is twice that of the nonadaptive third higher than the second the xy algorithm and uniform algorithms. In the meshes, this throughput is a highest sustainable throughput, which occurs for traffic. The latter improvement in throughput is not due to shorter path lengths for matrix-transpose traffic than for uniform traffic: the average path length for matrix-transpose traffic is 11.34 hops and for uniform traffic is 10.61 hops. For reverse-flip traffic in the hypercube, the maximum sustainable throughput of the adaptive algorithms is four times that of the nonadaptive e-cube algorithm and one half higher than the next highest sustainable throughput, which occurs for the e-cube algorithm and uniform traffic. Again, average path length is longer for reverse-flip traffic (4.27 hops) than for uniform traffic (4.01 hops). The Turtl Model 40 for AdaptiLv 1 I 1 I I (psec) I I I 35 - e-cube ++ 30 - ABONF ABOPL &+- Average ~~ latency Routing p-cube 897 I + ~tJ IS _ 51 (J~ 0246 Percent S101’2 1416182022 of’ max. theoretical throughput (a) 40 1 I I 35 - I I f 1 e-cube [1 1 I + ABC)NF -+ A130 F’[J -B-- :30 Average 2.5 latency 20 - ~~-cube & FIG. (/lscc) 15 10 - Comparmm for binary R-cube. pattern. .5~ 17. rithms different 0246 Percent routing traffic alg- patterns Matrix-transport (b) Address-rotate (c) Reverse-flip o~ (a) of traffic trtiffic in a traffic pattern. pattern. S101214161S20’22 of max. theoretical throughput (b) 40 I I 35 - e-cube + 30 - ABONF ABOPL + + I I Average 2.5 latency 20 - I I [ I I I (psec) 1,5 - 10 .5~ o~ 02468101214161820’ Percent of max theoretical 2’2 throughput (c) The differences in the performances of the algorithms, for individual patterns and across traffic patterns, are largely due to differences of information about traffic used in the algorithms. The xy algorithms tion about are nonadaptive, but happen the uniform traffic pattern. traffic in the types and e-cube to embody global, long-term informaBy routing packets first along one dimension and then the other, they spread uniform traffic as evenly as possible across the channels of a mesh or hypercube in the long term. The adaptive algorithms, on the other hand, select channels based on local, short-term information, These selections tend to benefit just the routed pacliet, and only for the immediate future, and tend to interfere with other packets. The selections often produce zigzag paths, which disturb the global, long-term evenness of uniform traffic. The result is increased contention and decreased performance. C. J. GLASS AND L. M. NI 898 Despite the superior uniform traffic, the in real systems. studies, but performance adaptive Uniform is not traffic generated mined by the application direct network. For of the nonadaptive algorithms probably has been by many used most applications, in many applications. and how its processes routing provide are mapped for performance previous A traffic each process algorithms better simulation pattern is deter- to the nodes of the communicates with some processes much more than others. Nonuniform traffic presents a problem for the xy and e-cube algorithms because they are nonadaptive. Just as they maintain the evenness of uniform traffic, they blindly maintain the unevenness of nonuniform traffic. The result, as the figures illustrate, is often poor performance. 8.4. SPREADING simulations TRAFFIC in the algorithm is largely channels of a direct MORE preceding determined network. The EVENLY. subsection by The is that how overall well present conclusion the performance it spreads subsection from traffic discusses the of a routing across the how to modify partially adaptive routing algorithms so that they spread traffic more evenly. Konstantinidou [1990] describes two separate modifications of the p-cube routing algorithm that make it spread traffic more evenly across a hypercube. The first modification is to route some packets first in positive directions and then in negative directions and to route the other directions and then in positive directions. To prevent channels in the positive directions is doubled. The negative-first packets the channels in the negative is heavier for channels use separate channels directions. nearer node packets first in negative deadlock, the number of positive-first packets and in the positive The traffic directions of the positive-first /Z – 1, and the traffic but share packets of the negative-first packets is heavier for channels nearer node O. In total, the traffic of the two types of packets is fairly even. The second modification is to route all packets first in positive directions, then in negative directions, and then again in positive directions. Deadlock is again prevented by doubling the number of channels in the positive directions. The total traffic of the three phases is fairly even across the channels of the hypercube. The modifications by Konstantinidou [1990] can be generalized partially adaptive routing algorithm and any direct network. technique for modifying a routing algorithm to spread traffic direct network is to create multiple virtual networks to any The first general more evenly in a and use a different version of the routing algorithm (or an entirely different routing algorithm) in each virtual network. For example, a north-last routing algorithm might be used in one virtual network, and a south-last routing algorithm might be used in a ~ecorid virtual network. Which routing algorithm and yirtual network arG used to route a particular message can be determined randomly (as Konstantinidou suggests) or according to which routing algorithm is best suited for the particular source and destination nodes. When the virtual networks are overlaid onto the physical network, it may be possible for virtual networks to share some virtual channels and still prevent deadlock (as Konstantinidou demonstrates for the p-cube routing algorithm). The overall objective is that the channel utilizations in the virtual networks complement each other such that the channel utilizations in the physical network are as even as possible. The second general technique for modifying a partially adaptive routing algorithm to spread traffic more evenly is again to create multiple virtual The Turn Model networks with multiple in sequence. advance for Adaptille All Routing routing messages to the next algorithms, start virtual 899 in the network but to connect same virtual the virtual ~etwork in the sequence. networks and When repeatedly the last phase of the routing algorithm in one virtual network is similar to the first phase of the routing algorithm in the next virtual network, the two virtual networks may share virtual tions in the utilizations The channels. Again, virtual networks in the physical drawback to both the overall complement network objective is that the channel utilizaeach other such that the channel are as even as possible. of these general techniques k that they virtual channels be added to a direct network. Adding virtual creases the cost of a direct network. It also creates the possibility algorithm designed from the start for a direct network with require that channels inthat a routing that many virtual channels might make better use of the channels and perform better. A third general technique for modifying a partially adaptive routing algorithm to spread traffic more evenly does not have this drawback of adding virtual channels to a direct network. The technique is to use different versions of the routing parts spreads this algorithm of the direct (or entirely network. The traffic the most evenly technique can be applied routing algorithm. The result the mesh uses a west-first uses an east-first destined for the different objective routing in each part of the direct to two-dimensional is the east–west routing algorithms) is to use the routing algorithm network. meshes rowing to route and algo~itlun: packets, the west-first the east half of and the west half or south). Packets in the same half as their primarily if they are heading away from the center of the mesh, as shown in Figure 18(b). This combination routing toward the center of the mesh and adaptive routing helps that For example, routing algorithm. Thus, as shown in Figure 18(a), packets other half of the mesh are routed there nonadaptively (i.e., east or west without going north destination are routed adaptively center in different algorithm keep adaptiveness from inadvertently congesting of nonadaptive away from the the center of the mesh. To find out how the east–west routing a 10 X 10 mesh for a variety of traffic policy the is distance-travelled, algorithm patterns. output performs, it is simulated In all cases, the input selection policy is no-turn, in selection and the routing is minimal. The results are shown in Figure 16. Overall, the east–west routing algorithm performs better than the other adaptive routing algorithms. For uniform traffic, the nonadaptive xy routing algorithm still performs the best, but the east–west 9. Conclusions This paper previous model network. highly algorithm and Future presents models, is based This adaptive on turn Work a new which is a close second. model are largely analyzing model the for designing based directions systematically for a network. For networks routing on adding virtual in which designs without routing algorithms. Unlike channels, the new packets can algorithms turn that in a are virtual channels, the turn model produces routing algorithms that are partially adaptive. (For networks with virtual channels, the turn model can produce routing algorithms that are fully adaptive [Glass and Ni 1992; Glass 1992].) Adaptiveness is important in a routing algorithm because it reduces the effects of hot spots, unfair channel allocation, and fwlts. The turn model also designs routing algorithms that are 900 C. J. GLASS AND L. M. NI FIG. 18. halves The directions in which of a two-dimensional cast–west routing packets mesh algorithm in the two are routed during by the its two phases. EHm (b) (a) ideal for fault addition tolerance, to highly being adaptive minimal [Glass or nonminimal and livelock free, in and Ni 1993]. This paper also presents the results of numerous simulations. Simulations of different input selection policies show the distance-travelled policy to perform the best, than implying to provide the no-turn policy to that fairness. policy it is more to perform reduce traffic patterns sion-order adaptive routing routing than to and routing algorithm algorithms for a policy of different output the best, implying contention different important Simulations that increase it is more to make partially 9.1. IMPLEMENTATION formance expected on two things: the routing factor from whether algorithms is the topic adaptive routing show the nonadaptive of dimen- algorithms of a network is Three ways are spread traffic Whether the improvements are realizable in real systems model real traffic for a the best for uniform traffic and more better for nonuniform traffic. The CONSIDERATIONS. the turn show Simulations implication is that spreading traffic evenly across the channels very important for good performance by a routing algorithm. suggested evenly. contention policies important adaptiveness. algorithms to perform to perform to reduce selection patterns are highly can be implemented nonuniform in hardware more in perdepends and whether efficiently. The latter of this subsection. Adaptive routing requires that more or larger header flits be present in a router at one time, increasing communication latencies. Although nonadaptive routing decisions can be made based on the distance remaining in a single dimension, adaptive routing in more than one dimension, address more decisions must be based on the distance remaining if not all dimensions. For dimension ordering, the of the destination node can be stored for each dimension. Only the first it is no longer needed, decision, When header in a series of header flit is required it can be discarded flits, one or for each routing and the next flit used. For adaptive routing, the best approach may be to place the least significant bits of each dimension in the first header flit, the next least significmt bits in the second header flit, and so forth. For example, the first flit might contain bits O to 3, the second store bits 3 to 6, the third store bits 6 to 9, and so forth. Then, when the distances in the first flit are reduced to zero, the flit is discarded, the distances in the second flit are decremented by one, and a new first flit is generated by the router. Adaptive routing also requires more complex control circuits. More informaand the decision rules are more tion is involved in the routing decision, complex. decision the direct The extra control hardware means more expense and possibly longer times. How much longer decision times arc depends on the specifics of network and routing algorithm. Basing the decision on more infor- The Turn Model mation does not complexity Finally, necessarily of the decision adaptive routing of a message. 901 for AdaptiL1e Routing require more time. It is likely, however, that the rule increases decision times by a few clock cycles. may increase the time for reassembling the packets Nonadaptive routing algorithms usually prevent the packets in a message from passing each other on their way through a network, because all of the packets follow the same path. Adaptive routing, on the other hand, makes no guarantee that packets will arrive at the destination in the order they are sent. Thus, be more about reassembling packets for its proper 9.2. FUTURE the most future a message a message, may require too, since sorting. each packet There must may have to carry information order. There are many is identifying realistic possibilities for traffic patterns, WORK. important simulations can be more meaningful. Another future work. One so that the results is exploring of of the efficiency with which the routing algorithms produced by the turn model can be implemented in hardware. There are many more input and output selection policies to be explored, too, especially tion ones based policy that on additional assigns packets information. priorities For example, an input based on the distance selec- travelled, minus the distance remaining and the packet length, might release more channels more quickly. Also, an output selection policy that waits a short period before selecting a nonpreferred from the destination, The turn model output channel, especially a channel may improve performance. has been applied to n-dimensional that leads away and lc-ary meshes n-cubes in this paper, but it can also be applied to other topologies, such as hexagonal networks, octagonal networks, cube-connected cycles, and express cubes [Dally 1991]. In such topologies, the turns are not necessarily 90-degrees and the simple cycles are not necessarily formed by four turns. The turn model should again produce interesting So far, the research message help has only one destination. design which and useful has involved routing each applicable algorithms message The multiple star model algorithms. communication, question for multicast can have to the multicast routing only unicast is whether and broadcast destinations. of multicast in which the turn each model can communication, The turn model communication in may be [Lin and Ni 1991]. The authors the simulator ACKNOWLEDGMENTS. helping us develop suggestions wish to thank Dr. Philip and anonymous referees K. McKinley for for their helpful and comments. REFERENCES AG~RWAL, A., architecture siwn on Contpt~ter BORKAR, B.-H., KRANZ, Arcizitectztre S. B., COHN, PETERSON, 1988. LIM, for multiprocessing R., Cox, C., PIEP~R, integrated 3upercu)rz~)~tttr~g’[98 (Orkzrzdu, W. J. 1988. Fine-grain Conference on Hyperczlbe New pp. 2–12. DALLY, York, W. Concurrent J. 1989. Cornpz[talg, The AND KUBIATOWICZ, (Seattle, In Wash., May L., solution Fla., to 28–31 ). IEEE, high-speed parallel Nu~, 14--18). IEEE, message passing concurrent Corrzpztters, J-machine: System and Agha, APRIL: 1990. A processor of the 17th Zntcr-nat~o)za[ S,wrzpcJNew York, H. T., LAM, pp. 104– 114. M., MOORE, B., P. S., SUmON, J., URBANSZU. J., .AND WEBB, TSENG, C’o)zczm’ent Hewitt J. Proceedings G., GLEASON, S., GROSS, T., KUNG. J., RANKIN, An DALLY, iWarp: D., multiprocessor. vol. support eds. MIT New computing. Yofk, for Actors. Rrx-eedm,q J. oj’ pp. 330–33Q computers. 1 (Pasadena. Press. In In Proccuii?zgs Cal if., In Cambridge, Jan. Actors: Mass. of tbe 3rcl 19-20). ACM. Kno)tledge-B~zsed 902 C. J. GLASS DALLY, W. J. Trans. 1990. Compat. DALLY. W. J. t]on W. (June) 1991. networks. DALLY, Performance C-39, cubes. Trans. J., AND Aota, Massachusetts Institute mcube interconnectmn L. M. networks. NI lILEE 775-785. Express IEEE of k-ary analysis AND Improving Comput. H. the 40, (Sept.), 1990. Adaptive of Technology, performance of k-my n-cube lntercrmnec- 1016– 1023. rout]ng Laboratory using for virtual Computer channels. Science, Tech. Rep., Cambridge, Mass., Sept. DALLI, W. J., AND SEITZ, C. L. 1986. The torus routing DALLY, W. J., AND SEITZ, C. L. 1987 Deadlock-free interconnection GLASS, C. turn networks. J. 1992. model. Lansing, Ph.D. Mich.. IEEE Designing thesis, Twrrzs Comput. maximally C-36, State 1992. Maximally routing m for Department wormhole of Computer ings of the 1992 International Cottference GLASS, C. J., AND NI, L. M. the 23rd Annurd 1993. International fully adaptive on Parallel Fault-tolerant Sympo~ium routing Proce.swzg, message-passing Japan. Nov.), S. KONSTANTINIDOU, MIT Conference: wormhole routing on Faalt-Tolerant 1. (Aug.), pp in meshes. Compz{tzng Adapt]ve. Researclz LAUDON, cache 17th International AP 1000. In mm]mal In (Toronto, Ont., protocol L. for L. M. 1991. Proceedozgs Canada, May k-ary 27-30). COMPANY. 1990. AND MCKINLEY, P. K. IEEE 1988. DASH 1991. multicmt New In York, An A of the May 28-31) of the 3rd Calif., Proceedings pp. adaptive The of the IEEE, New in multicomputer on Compz[ter Arclzuec tare 116-125. and fault tolerant wormhole rout]ng 40, (Jan. ) 2– 1‘2. Manual, survey of wormhole routing techniques m direct A, J., SEIZOVIC, J., STEELE, C, S., AND Su, programming Conference Jan. routing Symposil[m on 19-20). ACM, 1977. of the Hypercube New A large of the %h ,%nual Ametek Series Concametlt York, multlcomputer. and Apphcatzorzs, pp. 33-36. scale. homogeneous, Sym,ooszarn 2010 Computers orz Curnputer fully distributed ,4rchltecture (Mar, parallel ), IEEE, pp. 105-124. for networks of processors. In 1989. IEEE Adaptive. Proceeding, low latency, Pt. E, vol. RECEIVED FEBRUARY 1993; REVISED IU1.Y 1993; ACCEPTED JULY 1993 Journal 1990. Proceedings 62–76. Y.ANTCHEV, J. T., AND JESSHOPE, C. R. routing Third- P~-oceedmgs of the 6t/7 In Wash., wormhole York, 6400 Processor and SULLIVAN, H., AND BASHKOW, T. R. New In HENNESSY, J AND tect~Lre (Seattle, Compw, 1993. 26, (Feb.) architecture I, (Pasadena, machine. 1991, OFZSupercoozpzmng multiprocessor. International ACM, Trans. NCUBE Computer The Procecdmgs Volume in hypercubes. GUPTA, A., the C. L., ATHAS, W. C., FLAIG, C. M., MARTIN, W.-K. of 139-153. Deadlock-free IEEE M., SEITZ, In for of the 18th Annual n-cubes. networks. pp on Compz~terArc)z~ LINDER, D. H., AND HARDEN, J. C. NCUBE routing ztz jrLS[. J., GHARACHORLOO, K., coherence Jymposwn NI. networks. NI, Proceeduzgs pp. 240–249. M, Sympos[unz Proceed- 148-159. LIN, X., AND strategy Inter?zarional In (June), In 101 – 104 pp. 46-55. AdLsunced dutctory-based pp. computer 1990. LENOSIU, D,, York, The East In 2D meshes. vol. INTEL CORPORATION. 1991. .4 Touchstone DEL X4 System Description lSHIHATA, H., HORIE, T., INANO, S,, SHIMIZU, T., KATO, S., AND IKEXAKA, (Kyusyu, muting: Science, (Aug.), GLASS. C. J., AND Nt, L. M. generation 3, 187-196. multiprocessor 547–553, algorithms University. Comput.1, J. Iluf. message (May) adaptive Mich]gan chip. A~mcmt,on for Cornputrng Ma~h~ncry. Vd dl, No 5, September 1994 136(3). deadlock-free (May), packet pp. 178– 186.