1YQ UNSW School of Electrical Engineering and Telecommunications Space-division fabrics Keshav Chapter 8 Copyright © Copyright © 10/03/2017 Tim Moors 1 184 UNSW School of Electrical Engineering and Telecommunications Textbook references • Keshav: Ch. 8.4 • Varghese: Ch. 13 • switch cost: 13.9.1 of Varghese • Appendix A.3 and A.4: space division switch fabrics • other space division switch fabrics: Varghese pp. 443-4 Copyright © Copyright © 10/03/2017 Tim Moors 2 14X UNSW School of Electrical Engineering and Telecommunications Original References Y. Yeh et al: “The Knockout Switch: A simple, modular architecture for high-performance packet switching”, IEEE J. Sel. Areas in Comm. 5(8):1274-83 C. Clos: “A study of non-blocking switching networks”, Bell Sys. Tech. J., 32(3):406-24, Mar. 1953 L. Goke and G. Lipovski: “Banyan networks for partitioning multiprocessor systems”, Proc. 1st Annual Int. Symp. Comput. Architecture, Dec. 1973, pp. 21-28 K. Batcher: “Sorting networks and their applications”, Proc. AFIPS Spring Joint Computer Conference, pp. 307-14, 1968 A. Huang and S. Knauer: “Starlite: A wideband digital switch”, Proc. Globecom, Dec. 1984, pp. 121-5 V. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic, Academic Press, New York, 1965 Copyright © Copyright © 10/03/2017 Tim Moors 3 1DV UNSW School of Electrical Engineering and Telecommunications Tutorial/survey References W. Kabacinski et al: “Guest editorial - 50th anniversary of Clos networks”, IEEE Comm. Mag. 41(10):26-7 G. Broomell and J. Heath: “Classification categories and historical development of circuit switching topologies”, ACM Computing Surveys 15(2):95-133 Online texts (see the TELE9751 reading list): A. Pattavina: Switching Theory, Wiley H. Chao et al: Broadband Packet Switching Technologies Ch. 13 of N. Mir: Computer and Communication Networks Copyright © Copyright © 10/03/2017 Tim Moors 4 51Q6* UNSW 5 School of Electrical Engineering and Telecommunications Cisco Nexus context “Each row in the crossbar is associated with an input port, and each group of four columns is associated with an egress port; thus, there are four cross-points per egress port. In addition, there are four fabric buffers with 10,240 bytes of memory buffer per egress port.” “The crossbar fabric on the Cisco Nexus 5500 Series is implemented as a single-stage fabric, thus eliminating any bottleneck [blocking] within the switches” “The [fabric] is a single-stage highperformance 100-by-100 crossbar with an integrated scheduler. The scheduler coordinates the use of the crossbar between inputs and outputs, allowing a contention-free match between I/O pairs. The scheduling algorithm is based on an enhanced iSLIP algorithm.” Copyright © Copyright © 10/03/2017 Tim Moors 1HX UNSW School of Electrical Engineering and Telecommunications Outline Single-stage: Crossbar switches Handling crossbar output port contention Knockout Arbitration Staged switches Networks of basic elements Clos Banyan Structure Blocking & Motivation for sorting Batcher sorting networks Copyright © Copyright © 10/03/2017 Tim Moors 6 1JC UNSW School of Electrical Engineering and Telecommunications Crossbar switches At the intersection of each input and each output, there is a “cross point”, which can be selectively enabled, allowing communication from input to output. e.g. Intel 470 switches, Cisco Catalyst 6500, 12000 series routers Feasible to implement in VLSI (e.g. PMC-Sierra PM9312), limited by high-speed I/O to chip - pincount [Sketch from znet.net/~cdk14568/mpet/contacts/fig23-5.gif. Photos from Lucent] Copyright © Copyright © 10/03/2017 Tim Moors 7 UNSW School of Electrical Engineering and Telecommunications 8 Source of switch element photo unknow Bell System’s #4A toll crossbar switching system, circa 1957 From AT&T Archives and History Center IEEE Spectrum Feb. 2013 Copyright © 10/03/2017 Tim Moors 1V8 UNSW School of Electrical Engineering and Telecommunications Advantages of crossbar switches √ Simple structure √ Suitable for circuits or packets √ Low latency – minimal number of connecting points between arbitrary input and output. √ No internal blocking (may have output port blocking) √ Multicast is easy with electrical crossbars × Difficult with MEMS optical crossbars, since mirror will (hopefully) reflect all signal power to one intended output. × Easy to reach output, but scheduling multicast s.t. all outputs are simultaneously available (other inputs aren’t transmitting to some) can be tricky. Copyright © Copyright © 10/03/2017 Tim Moors 9 177 UNSW School of Electrical Engineering and Telecommunications Disadvantages of crossbar switches × Scalability: × # of crosspoints (Nx) grows with (P=number of ports)2 × Difficult to incrementally expand × Output buffer speed ∝ P, not line speed (=> knockout process). × Fanout: Number of crosspoints on each line increases linearly with number of ports, increasing capacitive loading, slowing transmission (just as bad as time-division bus, but inferior to space-division fabrics such as Banyan) × No fault tolerance: each crosspoint is needed for one connection or another. Copyright © Copyright © 10/03/2017 Tim Moors 10 103 UNSW School of Electrical Engineering and Telecommunications Handling crossbar output port contention Crossbar switches (like all switches) can suffer output port blocking. Responses: • Internal buffering at crosspoints (next slide) [GJ> • Buffers at outputs [5D>. Issue: How to reduce speed of buffer, or reduce number of low (input-port) speed buffers? → Knockout tournaments [4K> • Block at input. Issue: How to be fair to inputs s.t. none is blocked forever? → arbitration systems. [ZP> Copyright © Copyright © 10/03/2017 Tim Moors 11 1GJ UNSW School of Electrical Engineering and Telecommunications Internal crossbar buffering Internal buffering, e.g. to handle output contention Fig. 8.14 from Keshav Copyright © Copyright © 10/03/2017 Tim Moors 12 15D UNSW School of Electrical Engineering and Telecommunications Buffering at crossbar outputs Multiple inputs might • Time-division access to one output port buffer. • Small, simple, but need fast buffer. • Speedup problem: as number of inputs increases, speed of buffer at output must increase (like shared medium switch) • Lead to multiple distinct buffers on each output port. • Low speed, but no sharing => more buffer space needed. How about something in-between: some L<P buffers, and mechanism (Knockout tournament) to spread inputs evenly over buffers? Copyright © Copyright © 10/03/2017 Tim Moors 13 UNSW School of Electrical Engineering and Telecommunications 14 Knockout switch Can we exploit the directional nature of traffic (few inputs will be directed to any one output at a time) to reduce the output port e.g. left-most occupied buffer buffer capacity (size+speed)? “wins” a round; losers repeat So use a “knockout tournament” to decide which Example 1 Example 2 inputs get buffered: • Competitors (inputs) play and those who win progress to the next round. At the end, one competitor is selected (transferred to buffer); others have lost one match. • Losers then compete again (clean slate) to determine next competitor to be selected. • Repeat to capacity of output port. For details, see Peterson and Davie, 1st edition, pp. 193-6; Keshav p. 199 14K Copyright © Copyright © 10/03/2017 Tim Moors Scenarios UNSW P = packet from L/R input Inputs x 15 School of Electrical Engineering and Telecommunications PL PR PL - - P R - - L R Knockout structure Outputs PL PR PL - PR - - - or @ random PR PL (to be fair to RHS ports) B 8 in C A A ... A Shifter T0 Buffers 8 out Delay 8 A Packet ... filters P:L knockout B concentrator C Shifter & shared buffers D D D D Shifter D T1 D D D D D D D D 1 2 3 Buffers Shifter 4 T2 Buffers P:L knockout concentrator selects up to L pkts from P inputs (here P=8, L=4). Delay for pipeline Shifter spreads load 1L0 Drawings based on L. Peterson Copyright © Copyright © 10/03/2017 Tim Moors 1ZP UNSW 16 School of Electrical Engineering and Telecommunications Crossbar arbitration† Problem: When multiple inputs have info for one output, crossbar controller needs to determine when each should be switched through. Aims: Schedule aims for • Performance: Maximise throughput by minimising transfer rounds; minimise delay • Fairness: each input should receive equal access. • Cost: Shouldn’t require much state info or many iterations. Generic method: May achieve this through multiple iterations (per round) of negotiation between inputs & outputs FIFO 6 rounds iSLIP Optimal? 1: 2: 3: 4: A B C B A C - C B A - † Often (e.g. by Varghese) called “scheduling”, but called arbitration by others (e.g. Cisco) & “arbitration” avoids confusion with later lecture on “packet scheduling”: Scheduling affects the timing of packets; here timing for crossing the fabric; later packet scheduling timing when sentTim onMoors a line. Copyright © Copyright © 10/03/2017 19U UNSW School of Electrical Engineering and Telecommunications Arbitration basis • Each input maintains a separate “Virtual Output Queue” of cells destined for each output. (Fixes Head of Line blocking <DD]) • Each of P inputs maintains P VOQs => P2 bits of activity info control arbitration. Feasible to optimise with tens of ports (e.g. P=32 => 1024 bits) • Need to • convey activity info between inputs and outputs • Inputs send requests to outputs • Each output offers to serve one request (must choose fairly) • Each input accepts one offer • coordinate inputs and outputs, e.g. multiple iterations: Reiterate to use outputs whose offers weren’t accepted. • Trade-off between performance and number of iterations Copyright © Copyright © 10/03/2017 Tim Moors 17 1ZK UNSW School of Electrical Engineering and Telecommunications Parallel Iterative Matching (PIM) • Offers are made to random inputs • Used in DEC AN-2 30 port switch Copyright © Copyright © 10/03/2017 Tim Moors 18 UNSW School of Electrical Engineering and Telecommunications 19 PIM example Figure from Varghese they are similar to other requests. “a=1” on input c is an artifact of iSLIP example 12CCorrection: Requests from A to 1 should be gray; Copyright © Copyright © 10/03/2017 Tim Moors UNSW School of Electrical Engineering and Telecommunications 20 PIM example (repeated for non-animated display) Figure from Varghese they are similar to other requests. “a=1” on input c is an artifact of iSLIP example 1CF Correction: Requests from A to 1 should be gray; Copyright © Copyright © 10/03/2017 Tim Moors 1QT UNSW School of Electrical Engineering and Telecommunications PIM vs iSLIP • “PIM … requires a logarithmic number of iterations to attain maximal matches” vs “iSLIP … gets close to maximal matches after just one or two iterations.” (though “maximal” ≠ “close to maximal” & PIM example uses 1 iteration and iSLIP example uses 2 iterations ) • in terms of fairness mechanism PIM : Ethernet as iSLIP : token ring randomness taking turns Copyright © Copyright © 10/03/2017 Tim Moors 21 UNSW School of Electrical Engineering and Telecommunications iSLIP • Each port maintains a pointer to next port desired to use. • Choose lowest port ≥ pointer; update pointer • “≥” in circular sense: P+1 = 1st port • Pointers initially all point to 1st port, but desynchronise as randomly directed traffic passes through. • Randomness creates fairness (as in Knockout & Ethernet); PIM adds it directly; iSLIP derives it from random traffic. • Used in Cisco GSR router N. McKeown et al: “Scheduling Cells in an Input-Queued Switch”, IEE Electronics Letters, pp.2174-5, 1993 1D8 Copyright © Copyright © 10/03/2017 Tim Moors 22 UNSW 23 School of Electrical Engineering and Telecommunications iSLIP round 1 a=3?! Pointers only get updated after iteration 1 => B still points to 1 & 2 still points to A B overhears A accepting 1 => knows that 1 is busy => doesn’t request 1 16E Figure from Varghese Copyright © Copyright © 10/03/2017 Tim Moors UNSW 24 School of Electrical Engineering and Telecommunications iSLIP round 2 • a 3 3 Figure from Varghese Round2 iteration 1 should show C with packet for output 3 1K3 Copyright © Copyright © 10/03/2017 Tim Moors 3 UNSW School of Electrical Engineering and Telecommunications iSLIP round 3 • a Figure from Varghese 11G Copyright © Copyright © 10/03/2017 Tim Moors 25 1QQ UNSW School of Electrical Engineering and Telecommunications Outline Copyright © Copyright © 10/03/2017 Tim Moors 26 1RH UNSW School of Electrical Engineering and Telecommunications “Centralized” vs distributed switches Switches can be: • centralized (e.g. implemented in a single box) or • distributed (consisting of multiple interconnected boxes) Potential advantages of centralized switches: √ Regular internal structure, c.f. heterogeneous distributed systems √ Feasible to have central control, c.f. propagation delays in distributed systems We’ll consider centralized switches first, and later consider distributed switching (Ethernet switches and inter-switch protocols) Regular “centralized” space-division designs can conceivably be applied to distributed switching, provided component consistency is possible. (e.g. Clos switches in data centres) Copyright © 27 UNSW School of Electrical Engineering and Telecommunications Composition of switches Small switching components can be connected to form larger switching systems. Examples from earlier lectures: • stacking of switches using high-speed port <MV] • a network forms a distributed switch • backplane fabric of chassis-based switch <MV] interconnects small switches within port processors <0F] Application to space-division switches: • Multistage switches are composed of “networks”† of smaller switches (e.g. crossbars or shared-media) A crosspoint is like a gate: It has the potential to connect one input to one output, but it can be controlled so that the connection is enabled or disabled at will. 1AC Copyright © 28 161 UNSW School of Electrical Engineering and Telecommunications Staged switches Potential benefits: √ Fewer crosspoints than in a crossbar switch √ Diversity of paths (e.g. see figure on next slide [JX>) How?: Selection of switch in first and last stage is determined by which input & output are being connected => use three or more stages to get benefits Why?: Intermediate stages can offer choice of switch => √ Lower blocking probability √ Increased reliability: can still connect input and output even if a component switch has failed † Here we are interested in networks within the switch. Of course switches can also be interconnected to form external networks (the more common use of the word). Copyright © Copyright © 10/03/2017 Tim Moors 29 30 School of Electrical Engineering and Telecommunications Clos switches P/ n k arrays P/ n Data center applications: Clos Networks: What's Old Is New Again Copyright © Copyright © 10/03/2017 Tim Moors … … • Each switch (“array”) of a stage arrays arrays connects to each switch of neighboring nk kn stages P/ P/ n n (# inputs may # outputs for internal nk kn switches) P/ P/ P n n P • Simplest Clos switches have 3 stages, inputs outputs but more are possible by recursively nk kn replacing middle stages by 3-stage Clos P/ P/ n n structures. nk kn • Advantages: Ignoring the ellipsis, this √ can switch circuits (c.f. Banyan) example has P=8+, n=2, k=3+ √ path diversity N = (n×k)(P/n)2 + ((P/n) × (P/n))k √ fewer crosspoints for large P x = 2Pk + kP2/n2 What are the optimal values for n and k? … 1JX UNSW 31 School of Electrical Engineering and Telecommunications k=Number of arrays in middle kstage … ... ... Copyright © Copyright © 10/03/2017 Tim Moors … ... ... To be non-blocking, must be able to arrays P/ P/ create a connection in the presence of n n P P /n /n existing connections on all other ports arrays arrays of input and output arrays. nk kn P/ P/ Each array has only one connection to n n switches in adjacent stages. nk kn P/ P/ P n n How many middle-stage arrays may be in P outputs use? inputs • n-1 may be tied up with other inputs nk kn P/ P/ from the same input array. (n=3 here) n n • n-1 may be tied up with other outputs nk kn P/ P/ to the same output array. n n Worst case: may have different mid(Connections that don’t share both stage arrays tied up by inputs and same input and same output array outputs. can reuse existing arrays.) => need at least (1+1)(n-1)+1 middlestage arrays, i.e. k≥2n-1 => Nx = 2P(2n-1) + (2n-1)P2/n2 … 17M UNSW UNSW 32 School of Electrical Engineering and Telecommunications N=>Number of arrays in outer stages Differentiate Nx wrt n n 2 2 P 2 2n 1P 2 2n 0 4P n4 0 4 Pn 4 2 P 2 n 2 4 P 2 n 2 2 P 2 n 0 2n 3 Pn P For n large†, neglect +P ≡ +Pn0 factor n Ports 8 128 2048 P 2 N x (min) 4 P 2 P 1 Nx(3-stage Clos) 96 7,680 516,096 Nx(Cross) 64 16,384 4,194,304 † Bonus mark on offer if you can justify why n (not P) should be large. Is it because large n leads to large k for non-blocking, and with large k, adding another array in the middle to provide fault tolerance with little overhead? 15G Copyright © Copyright © 10/03/2017 Tim Moors 129 UNSW School of Electrical Engineering and Telecommunications (Rearrangeably) non-blocking Clos • 2n-1≤k ensures Clos switch won’t block • n ≤ k < 2n-1 ensures Clos switch is rearrangeably nonblocking, i.e. can connect input to output, but might need to rearrange existing connections e.g. n=k=2 • to connect each input to corresponding output requires pattern on left • straight connections may end; but cross connections block connection between bottom input to top output • need to rearrange Copyright © Copyright © 10/03/2017 Tim Moors 33 1V6 UNSW School of Electrical Engineering and Telecommunications Clos switches Used in some commercial products: • Juniper T-series routers • NEC ATOM ATM switch • Myrinet switch Clos architecture used in many data centre designs Photo from http://www.cs.utk.edu/~dongarra/lyon2000/Talk02-Chuck-Seitz.ppt Copyright © Copyright © 10/03/2017 Tim Moors 34 14G UNSW School of Electrical Engineering and Telecommunications Outline Copyright © Copyright © 10/03/2017 Tim Moors 35 UNSW School of Electrical Engineering and Telecommunications Pipelining Pipeline: “A sequence of functional units ("stages") which performs a task in several steps, like an assembly line in a factory. Each functional unit takes inputs and produces outputs which are stored in its output buffer. One stage's output buffer is the next stage's input buffer. This arrangement allows all the stages to work in parallel thus giving greater throughput than if each input had to pass through the whole pipeline before the next input could enter. The costs are greater latency and complexity due to the need to synchronise the stages in some way so that different inputs do not interfere. The pipeline will only work at full efficiency if it can be filled and emptied at the same rate that it can process.” [http://foldoc.org/] • Multistage switches can operate as pipelines, with each stage operating (at any instant) on a different set of packets. • Fixed-length packets (aka “cells”) facilitate synchronisation: all switches in a stage operate for the same period. Port processors segment variable length frames into cells & reassemble from cells. 1K7 Copyright © Copyright © 10/03/2017 Tim Moors 36 1LU UNSW 37 School of Electrical Engineering and Telecommunications Multistage networks:1.Physically separate stages Physically separated stages form a pipeline Figure from H. Peyravi Copyright © Copyright © 10/03/2017 Tim Moors 1JU UNSW 38 School of Electrical Engineering and Telecommunications Multistage networks:2.Recirculation Multiple logically separate stages (emulating physically separate stages) can be created by recirculating, over time, data through one physical stage. ≡S95 SA≡S ≡S 1 ≡S106 SB≡S ≡S 2 ≡S117 SC≡S ≡S 3 ≡S128 SD ≡S ≡S 4 External Inputs of switches A-D: Recirculated External Outputs of switches A-D: Recirculated Copyright © Copyright © 10/03/2017 Tim Moors time 1EJ UNSW School of Electrical Engineering and Telecommunications Pipelined multistage switches • With multiple stages, each transfer (from input to output) passes through successive stages • At any instant: • circuit engages the complete path through the fabric • packets need only engage one stage at a time => can pipeline packets through: at any instant, each stage processes a separate wave of packets • Packet switching, stages could act autonomously, each responding to different packet header bits. • If we identify output ports with binary addresses, successive stages could handle progressively less significant bits of the address. (Note: “address” here is for internal use within switch. Packet classifier will take address from packet (e.g. IP) header to determine output port, and hence internal address) Copyright © Copyright © 10/03/2017 Tim Moors 39 1UX UNSW 40 School of Electrical Engineering and Telecommunications Banyan switches from input port 001 to dest port 110 n Self-routing using binary representation of output 0 1 port (extra header for use inside switch) Direction for each stage 0 1 specified by bit corresponding to that stage (MSb 1st) 0 1 => can’t multicast P-port switch has log2P stages each with P/2 2×2 000 001 010 011 100 101 110 111 switches 1 => switch to right output port of switch in this stage => Nx= 2Plog2P 0 => switch to left output port “There is a very large, famous Banyan tree that occupies more than 3 acres of land in a park in Lahaina, on the island of Maui in the Hawaiian islands.” – Seifert p. 208 Copyright © Copyright © 10/03/2017 Tim Moors 1KM UNSW 41 School of Electrical Engineering and Telecommunications Switch complexity compared Number of crosspoints for a switch with P ports: Crossbar: Nx = P2 Clos: N x 4P 2P 1 Banyan: Nx= 2Plog2P P=2048 4,194,304 516,096 45,056 Banyan switches √ require the fewest crosspoints (of those covered) × but suffer from the potential for blocking × typically limited to packet switching Copyright © Copyright © 10/03/2017 Tim Moors 1VN UNSW School of Electrical Engineering and Telecommunications 42 Project: Software Banyan • Rudimentary (e.g. fixed size), but demonstrates the idea. void fabricElements() { • Full source code in ... elements[7]->zero = elements[10]; 3FabricBANYAN.c elements[7]->one = elements[11]; struct banyanElement { int elementNumber; // flags indicate if // outputs busy top=0,bot=1 int topFlag; int bottomFlag; // pointers to next element banyanElement *zero; banyanElement *one; }; banyanElement *elements[12]; /* Will print out graphical representa printf(" ___ ___ ___ printf("000 ---| 0 |---| 4 |---| 8 |-printf("001 ---|___| |___|\\ |___|printf(" \\ / \\/ printf(" ___ \\ /___ /\\ ___ printf("010 ---| 1 |---| 5 |/ | 9 |-printf("011 ---|___| /\\|___|---|___|printf(" \\/ \\/ printf(" ___/\\ /\\__ ___ printf("100 ---| 2 | \\/| 6 |---| 10|printf("101 ---|___|---|___|\\ |___|printf(" / \\ \\/ printf(" ___/ _\\__ /\\ ___ printf("110 ---| 3 | | 7 |/ | 11|-printf("111 ---|___|---|___|---|___|-*/ Copyright © Copyright © Tim Moors } 10/03/2017 UNSW School of Electrical Engineering and Telecommunications 43 Project: Software Banyan (ctd) if(out_addr[stage] == '0') { // send from top node of element // check whether output zero has been used, if not allow this packet to go through & record that it has been used; else discard this packet if(currentElement -> topFlag == 0) { nextElement = currentElement->zero; currentElement -> topFlag = 1; } else { // element node busy, packed dropped. blocked = 1; } } else { // send from bottom node of element if(currentElement -> bottomFlag == 0) { nextElement = currentElement->one; } else { // element node busy, packed dropped. blocked = 1; } } 1CW Copyright © Copyright © 10/03/2017 Tim Moors 1M2 UNSW 44 School of Electrical Engineering and Telecommunications Types of Banyan blocking 110 110 Internal blocking (†): Contention for an output port of a component switch in one of the early stages. Output port blocking (‡): Contention for an output port of a switch in the last stage. § 110 111 0 1 † 0 1 ‡ 0 1 Output port blocking may even occur internally (§) Copyright © Copyright © 10/03/2017 Tim Moors 000 001 010 011 100 101 110 111 1WY UNSW 45 School of Electrical Engineering and Telecommunications Probability of internal blocking Destinations: Blocking probability is high, e.g. if priority is given to leftmost input. 000 010 111 100 001 011 110 101 (Exercise: Find a formula expressing how high, assuming random input→output mappings, e.g. see wiki) 000 001 010 011 100 101 110 111 Output port numbers Copyright © Copyright © 10/03/2017 Tim Moors 17J UNSW School of Electrical Engineering and Telecommunications Dealing with Banyan internal blocking Buffering within the switching network Dilation: • Multiple planes of latter stages • Move traffic blocked in one plane to higher plane (R→G→B,Y) • Expensive: number of switches in each stage increases exponentially with stage number for the worst case. Preface Banyan with network to • reduce blocking (distribution network → Benes) • avoid blocking (Sorting networks ...) Copyright © Copyright © 10/03/2017 Tim Moors 46 UNSW 47 School of Electrical Engineering and Telecommunications Motivation for sorting networks The blocking could have been avoided if only the information destined to each output was present on the correct input. Sort => Preface Banyan network with a 000 001 010 011 100 101 110 111 sorting network, Shuffle and a Shuffle Exchange Destinations: 000 010 111 100 001 011 110 101 000 100 001 101 010 110 011 111 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 1UP Copyright © Copyright © 10/03/2017 Tim Moors UNSW 48 School of Electrical Engineering and Telecommunications Examples† of non-blocking with sorting Destinations are multiples of 2 Destinations are multiples of 3 Destinations: 1, 3, 4, 5, 7 Sort 000 010 100 110 000 011 110 Shuffle 000 010 100 Sort Sort 001 011 100 101 111 Shuffle 110 000 001 010 011 100 101 110 111 000 011 110 000 001 010 011 100 101 110 111 Shuffle 001 111 011 Copyright © Copyright © 10/03/2017 Tim Moors 101 000 001 010 011 100 101 110 111 † Not a proof that sorting+shuffling always prevents blocking 1XL 100 1X9 UNSW School of Electrical Engineering and Telecommunications Sorting outline • Isn’t sorting complicated? As complicated as switching? • Sorting networks • Unlike serial sorting algorithms, sequence of comparisons must be independent of data values to allow fixed parallel implementation. • Issues (covered in ensuing slides): • Bitonic sequences • Batcher’s sorting of Bitonic sequences • How to create a bitonic sequence • Batcher-Banyan switches • Trapping sorted duplicates http://en.wikipedia.org/wiki/Sorting_network Copyright © Copyright © 10/03/2017 Tim Moors 49 1NF UNSW 50 School of Electrical Engineering and Telecommunications Sorting is simpler than switching Both sorting and switching reorder items, but 1. Switch must spread after sorting to account for idle ports Input port: a Destination: 101a Sorted: 001d Switched: ---x b c d e f g h ---b 111c 001d 010e 110f ---g 011h 010e 011h 101a 110f 111c ---x ---x 001d 010e 011h ---x 101a 110f 111c 2. Switch must deal with duplicates e.g. inputs a & d contend for port 101 Destination: 101a ---b 111c 101d 010e 110f ---g 011h Sorted: 010e 011h 101a 101d 110f 111c ---x ---x Switched: ---x --- 010e 011h ---x 101a 110f 111c 101d i.e. adding a sorter ≠ (<cost) repeating the switching process Copyright © Copyright © 10/03/2017 Tim Moors 18H UNSW 51 School of Electrical Engineering and Telecommunications Aside: Bitonic sequences A bitonic sequence is either: A. a concatenation of an increasing sequence and a decreasing sequence, i.e. {a1, a2, ... a2k} and k : a1 a2 a3 ...ak ak 1 ak 2 ... a2 k or B. a sequence that can be shifted cyclically to become A. e.g. 12387654 87123456 45687123 Theorem: Any bitonic sequence that has been arbitrarily split in two and had the two parts reversed is also bitonic. e.g. 8 7 1 2 3 | 4 5 6 → 3 2 1 7 8 6 5 4cyclic →17865432 Copyright © Copyright © 10/03/2017 Tim Moors shift UNSW School of Electrical Engineering and Telecommunications Batcher’s sorting of bitonic sequences Batcher showed† that given a bitonic sequence {a1 ... a2k} then the sequences {min(a1,ak+1), min(a2,ak+2), .. min(ak-1,a2k)} and {max(a1,ak+1), max(a2,ak+2), .. max(ak-1,a2k)} • are also both bitonic, and • no number in the sequence of minima exceeds any number in the sequence of maxima e.g. {8 7 1 2 3 4 5 6} → {3=min(8,3), 4=min(7,4), 1=min(1,5), 2=min(2,6)} and {8=max(8,3), 7=max(7,4), 5=max(1,5), 6=max(2,6)} i.e. can sort a bitonic sequence by dividing into two sub-sequences (one of minima and one of maxima), and repeating on sub-sequences. {3,4,1,2}{8,7,5,6}→{1,2}{3,4}{5,6}{8,7}→1,2,3,4,5,6,7,8 † K. Batcher: “Sorting networks and their applications”, Proc. AFIPS Spring Joint Computer Conference, pp. 307-14, 1968 140 Copyright © Copyright © 10/03/2017 Tim Moors 52 1GU UNSW School of Electrical Engineering and Telecommunications How to create a bitonic sequence Use a “odd-even mergesort”† to make sequence bitonic before feeding into Batcher sorting network. Aim: out.i < out.j ∀ i<j • Sort groups of 2 if(a1<a2) {out.1=a1; out.2=a2;} else {out.1=a2; out.2=a1;} • Merge pairs of groups of 2 to form groups of 4. Sort group of 4. if(a1<a3) {out.1=a1; tmp.lo=a3;} else {out.1=a3; tmp.lo=a1;} if(a2>a4) {out.4=a2; tmp.hi=a4;} else {out.4=a4; tmp.hi=a2;} if(tmp.lo<tmp.hi) {out.2=tmp.lo; out.3=tmp.hi;} else {out.2=tmp.hi; out.3=tmp.lo;} note that prior sorting of sub-groups helps sorting of merged group • repeat for larger groups of 8, 16, ... † Like a conventional merge sort, but comparisons are independent of data Copyright © Copyright © 10/03/2017 Tim Moors 53 UNSW 54 School of Electrical Engineering and Telecommunications Odd-even Merge sort Batcher sort max = = Legend Ascending sort to make bitonic Decending sort = min 1 6 2 7 8 4 5 3 6 1 2 7 8 4 3 5 bitonic list 666 7 7 227 6 3 172 2 6 711 1 4 833 3 2 384 4 5 448 5 1 555 8 8 7 3 6 4 5 2 8 1 7 6 5 8 3 4 2 1 Numerical example from 15U C. Partridge: Gigabit Networks, p. 115 7 6 8 5 4 3 2 1 7 8 6 5 4 3 2 1 8 7 6 5 4 3 2 1 Copyright © Copyright © 10/03/2017 Tim Moors 1MZ UNSW 55 School of Electrical Engineering and Telecommunications Trap modules Sorting alone doesn’t resolve output port contention => trap excessive (>1) packets destined to one port Starlite switch: Recirculate duplicates Resequencing is needed. Often give priority to recirculated packets to avoid prolonged resequencing. Conc. Sorter Trap Expand Copyright © Copyright © 10/03/2017 Tim Moors Shuffle Banyan UNSW School of Electrical Engineering and Telecommunications Benes networks • Essentially two Banyan networks, one the mirror image of the other • 2nd half = standard Banyan network • 1st half = “distributor”: used to shift inputs so as to reduce blocking. Switch controller assigns path through distributor. • A PP Benes network has s=2log2P-1 stages, with each stage having P/2 22 switching elements (4 crosspoints) Fig. from A. Pattavina: Switching Theory, i.e. Nx = 4Plog2P-2P Wiley, p. 100 • The Cisco CRS-1 uses “a unique 3-Stage Benes topology switching fabric” [http://www.cisco.com/en/US/products/ps5763/index.html] 13Y Copyright © Copyright © 10/03/2017 Tim Moors 56 1!!!! UNSW School of Electrical Engineering and Telecommunications Things to think about • Critical thinking: • • Engineering methods: • • Assembly lines are another example of the pipelining technique of multi-stage processing. Links to other areas: • • • • Most sorting algorithms are designed for sequential processing. How might other algorithms be adapted for parallel processing? FPGAs are constructed as arrays of programmable (disjoint or wilton) “switch boxes”, like the smaller switches used to compose larger switching systems in this lecture This lecture drew on sorting algorithms & randomisation (like Ethernet) A key change from HTTP/1.1 to /2 is reducing “Head of line blocking” Independent learning: • Read about data center applications: Clos Networks: What's Old Is New Again Copyright Copyright©©Tim Moors 2016