Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In NoCs - 1 Overview Introduction Communication Performance Organizational Structure Interconnection Topologies Trade-offs in Network Topology Routing A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In NoCs - 2 Introduction Interconnection Network • Topology: How switches and nodes are connected • Routing algorithm: determines the route from source to destination • Switching strategy: how a message traverses the route Mem P Network interface Network interface Communication assistm Communication assistm Mem • Flow control: Schedules the traversal of the message over time P A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In NoCs - 3 Basic Definitions Message is the basic communication entity. Flit is the basic flow control unit. A message consists of 1 or many flits. Phit is the basic unit of the physical layer. Direct network is a network where each switch connects to a node. Indirect network is a network with switches not connected to any node. Hop is the basic communication action from node to switch or from switch to switch. Diameter is the length of the maximum shortest path between any two nodes measured in hops. Routing distance between two nodes is the number of hops on a route. Average distance is the average of the routing distance over all pairs of nodes. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In NoCs - 4 Basic Switching Techniques Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet. Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch. Cut Through Packet Switching The flits of a packet are pipelined through the network. The packet is not completely buffered in each switch. Virtual Cut Through Packet Switching The entire packet is stored in a switch only when the header flit is blocked due to congestion. Wormhole Switching is cut through switching and all flits are blocked on the spot when the header flit is blocked. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Performance - 5 Latency 1 2 3 4 A B C D Time(n) = Admission + ChannelOccupancy + RoutingDelay + ContentionDelay Admission is the time it takes to emit the message into the network. ChannelOccupancy is the time a channel is occupied. RoutingDelay is the delay for the route. ContentionDelay is the delay of a message due to contention. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Performance - 6 Channel Occupancy ChannelOccupancy = n + nE b n ... message size in bits nE ... envelop size in bits b ... raw bandwidth of the channel A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Performance - 7 Routing Delay Store and Forward: Circuit Switching: Cut Through: Store and Forward with fragmented packets: Tsf (n, h) = h( nb + ∆) Tcs(n, h) = n b + h∆ Tct(n, h) = n b + h∆ Tsf (n, h, np) = n−np b + h( np b + ∆) n ... message size in bits np ... size of message fragments in bits h ... number of hops b ... raw bandwidth of the channel ∆ ... switching delay per hop A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Performance - 8 Routing Delay: Store and Forward vs Cut Through SF vs CT switching; d=2, k=10, b=1 12000 SF vs CT switching; d=2, k=10, b=32 350 Cut Through Store and Forward 250 8000 Average latency Average latency 10000 6000 4000 2000 0 Cut Through Store and Forward 300 200 150 100 50 0 100 200 300 400 500 600 700 Packet size in Bytes 800 900 1000 1100 0 0 100 200 SF vs CT switching, k=2, m=8 90 900 1000 1100 Cut Through Store and Forward 250 60 Average latency Average latency 800 SF vs CT switching, d=2, m=8 70 50 40 30 20 200 150 100 50 10 0 400 500 600 700 Packet size in Bytes 300 Cut Through Store and Forward 80 300 0 200 400 600 800 Number of nodes (k=2) 1000 1200 0 0 200 400 600 800 Number of nodes (d=2) 1000 1200 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Performance - 9 Local and Global Bandwidth n n+nE +w∆ Local bandwidth = b Total bandwidth = Cb[bits/second] = Cw[bits/cycle] = C[phits/cycle] Bisection bandwidth ... minimum bandwidth to cut the net into two equal parts. b ... raw bandwidth of a link; n ... message size; nE ... size of message envelope; w ... link bandwidth per cycle; ∆ ... switching time for each switch in cycles; w∆ ... bandwidth lost during switching; C ... total number of channels; 1 For a k ×k mesh with bidirectional channels: Total bandwidth = (4k 2 − 4k)b Bisection bandwidth = 2kb 2 3 4 A B C D A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Performance - 10 Link and Network Utilization total load on the network: L = load per channel: ρ = N hl [phits/cycle] M N hl [phits/cycle] ≤ 1 MC M ... each host issues a packet every M cycles C ... number of channels N ... number of nodes h ... average routing distance l = n/w ... number of cycles a message occupies a channel n ... average message size w ... bitwidth per channel A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Performance - 11 Network Saturation Network saturation Latency Delivered bandwidth Network saturation Offered bandwidth Delivered bandwidth Typical saturation points are between 40% and 70%. The saturation point depends on • Traffic pattern • Stochastic variations in traffic • Routing algorithm A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Organizational Structure - 12 Organizational Structure • Link • Switch • Network Interface A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Organizational Structure - 13 Link Short link At any time there is only one data word on the link. Long link Several data words can travel on the link simultaneously. Narrow link Data and control information is multiplexed on the same wires. Wide link Data and control information is transmitted in parallel and simultaneously. Synchronous clocking Both source and destination operate on the same clock. Asynchronous clocking The clock is encoded in the transmitted data to allow the receiver to sample at the right time instance. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Organizational Structure - 14 Switch Receiver Input buffer Output buffer Input ports Transmitter Output ports Crossbar Control (Routing, Scheduling) A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Organizational Structure - 15 Switch Design Issues Degree: number of inputs and outputs; Buffering • Input buffers • Output buffers • Shared buffers Routing • Source routing • Deterministic routing • Adaptive routing Output scheduling Deadlock handling Control flow A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Organizational Structure - 16 Network Interface • Admission protocol • Reception obligations • Buffering • Assembling and disassembling of messages • Routing • Higher level services and protocols A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 17 Interconnection Topologies • Fully connected networks • Linear arrays and rings • Multidimensional meshes and tori • Trees • Butterflies A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 18 Fully Connected Networks Node Bus: switch degree diameter distance network cost total bandwidth bisection bandwidth Node = = = = = = N 1 1 O(N ) b b Node Node Node Node Crossbar: switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = = = N 1 1 O(N 2) Nb Nb A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 19 Linear Arrays and Rings Linear array Torus Folded torus Linear array: switch degree diameter distance network cost total bandwidth bisection bandwidth Torus: switch degree diameter distance network cost total bandwidth bisection bandwidth = = ∼ = = = 2 N −1 2/3N O(N ) 2(N − 1)b 2b = = ∼ = = = 2 N/2 1/3N O(N ) 2N b 4b A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 20 Multidimensional Meshes and Tori 2−d mesh k-ary d-cubes are d-dimensional tori with unidirectional links and k nodes in each dimension: 3−d cube 2−d torus number of nodes N = kd switch degree = d diameter = d(k − 1) distance ∼ d 12 (k − 1) network cost = O(N ) total bandwidth = 2N b bisection bandwidth = 2k (d−1)b A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 21 Routing Distance in k-ary n-Cubes Network Scalability wrt Distance 45 k=2 d=4 d=3 d=2 40 Average distance 35 30 25 20 15 10 5 0 0 200 400 600 Number of nodes 800 1000 1200 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 22 Projecting High Dimensional Cubes 2−ary 2−cube 2−ary 4−cube 2−ary 3−cube 2−ary 5−cube A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 23 Binary Trees 4 3 2 1 0 number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = ∼ = = = 2d 2d − 1 3 2d d+2 O(N ) 2 · 2(N − 1)b 2b A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 24 k-ary Trees 4 3 2 1 0 number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = ∼ = = ∼ = = = kd kd k+1 2d d+2 O(N ) 2 · 2(N − 1)b kb A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 25 Binary Tree Projection • Efficient and regular 2-layout; • Longest wires in resource width: lW = 2b d N lW 2 4 0 3 8 1 4 16 1 5 32 2 6 64 2 d−1 2 c−1 7 128 4 8 256 4 9 512 8 10 1024 8 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 26 k-ary n-Cubes versus k-ary Trees k-ary n-cubes: k-ary trees: number of nodes N = kd switch degree = d+2 diameter = d(k − 1) distance ∼ d 12 (k − 1) network cost = O(N ) total bandwidth = 2N b bisection bandwidth = 2k (d−1)b number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = ∼ = = ∼ = = = kd kd k+1 2d d+2 O(N ) 2 · 2(N − 1)b kb A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 27 Butterflies 01 01 01 01 0 0 Butterfly building block 4 3 16 node butterfly 2 1 0 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 28 Butterfly Characteristics 4 3 2 1 0 number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = = = = = 2d 2d−1d 2 d+1 d+1 O(N d) 2ddb N 2b A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 29 k-ary n-Cubes versus k-ary Trees vs Butterflies k-ary n-cubes binary tree cost per node distance links per node bisection frequency limit of random traffic butterfly O(N ) √ 1 d 2 N log N O(N ) O(N log N ) 2 log N log N 2 2 log N 1 1 2N 1/N 1/2 d−1 d 2N q 1/( d N2 ) A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 30 Problems with Butterflies • Cost of the network ? O(N log N ) ? 2-d layout is more difficult than for binary trees ? Number of long wires grows faster than for trees. • For each source-destination pair there is only one route. • Each route blocks many other routes. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 31 Benes Networks • Many routes; • Costly to compute non-blocking routes; • High probability for non-blocking route by randomly selecting an intermediate node [Leighton, 1992]; A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 32 fat nodes Fat Trees 16−node 2−ary fat−tree A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 33 fat nodes k-ary n-dimensional Fat Tree Characteristics 16−node 2−ary fat−tree number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = ∼ = = = kd k d−1d 2k 2d d O(N d) 2k ddb 2k d−1b A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 34 k-ary n-Cubes versus k-ary d-dimensional Fat Trees k-ary n-cubes: k-ary n-dimensional fat trees: number of nodes N = kd switch degree = d diameter = d(k − 1) distance ∼ d 12 (k − 1) network cost = O(N ) total bandwidth = 2N b bisection bandwidth = 2k (d−1)b number of nodes N number of switches switch degree diameter distance network cost total bandwidth bisection bandwidth = = = = ∼ = = = kd k d−1d 2k 2d d O(N d) 2k ddb 2k d−1b A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 35 Relation between Fat Tree and Hypercube binary 2−dim fat tree binary 1−cube A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 36 Relation between Fat Tree and Hypercube - cont’d binary 3−dim fat tree binary 2−cube binary 2−cube A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 37 Relation between Fat Tree and Hypercube - cont’d binary 4−dim fat tree binary 3−cube binary 3−cube A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Topologies - 38 Topologies of Parallel Computers Machine Topology nCUBE/2 TMC CM-5 IBM SP-2 Intel Paragon Meiko CS-2 Cray T3D DASH J-Machine Monsoon SGI Origin Myricom Hypercube Fat tree Banyan 2D Mesh Fat tree 3D Torus Torus 3D Mesh Butterfly Hypercube Arbitrary Cycle Time [ns] 25 25 25 11.5 20 6.67 30 31 20 2.5 6.25 Channel width [bits] 1 4 8 16 8 16 16 8 16 20 16 Routing delay [cycles] 40 10 5 2 7 2 2 2 2 16 50 Flit size [bits] 32 4 16 16 8 16 16 8 16 160 16 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 39 Trade-offs in Topology Design for the k-ary n-Cube • Unloaded Latency • Latency under Load A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 40 Network Scaling for Unloaded Latency Latency(n) = Admission + ChannelOccupancy +RoutingDelay + ContentionDelay n RoutingDelay Tct(n, h) = + h∆ b 1 1 1 √ d d(k − 1) = (k − 1) logk N = (d N − 1) RoutingDistance h = 2 2 2 Network scalabilit wrt latency (m=32) Network scalabilit wrt latency (m=128) 160 260 k=2 d=5 d=4 d=3 d=2 140 240 220 Average latency Average latency 120 100 80 60 40 20 k=2 d=5 d=4 d=3 d=2 200 180 160 140 0 1000 2000 3000 4000 5000 6000 Number of nodes 7000 8000 9000 10000 120 0 1000 2000 3000 4000 5000 6000 Number of nodes 7000 8000 9000 10000 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 41 Unloaded Latency for Small Networks and Local Traffic Network scalabilit wrt latency (m=128) Network scalabilit wrt latency (m=128; h=dk/5) 137 k=2 d=5 d=4 d=3 d=2 136 135 k=2 d=5 d=4 d=3 d=2 131.5 Average latency 134 133 132 131 130.5 130 131 129.5 130 129 10 20 30 40 50 60 Number of nodes 70 80 90 100 129 10 20 30 40 50 60 Number of nodes 70 80 90 100 Network scalabilit wrt latency (m=128; h=dk/5) 170 k=2 d=5 d=4 d=3 d=2 165 160 Average latency Average latency 132 155 150 145 140 135 130 125 0 1000 2000 3000 4000 5000 6000 Number of nodes 7000 8000 9000 10000 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 42 Unloaded Latency under a Free-Wire Cost Model Free-wire cost model: Wires are free and can be added without penalty. Latency wrt dimension under free−wire cost model (m=32;b=32) 120 N=16K N=1K N=256 N=128 N=64 100 80 60 40 20 0 N=16K N=1K N=256 N=128 N=64 100 Average latency Average latency Latency wrt dimension under free−wire cost model (m=128;b=32) 120 80 60 40 20 2 3 4 5 Dimension 6 7 0 2 3 4 5 6 7 Dimension A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 43 Unloaded Latency under a Fixed-Wire Cost Models Fixed-wire cost model: The number of wires is constant per node: 128 wires per node: w(d) = b 64 d c. d 2 3 4 5 6 7 8 9 10 w(d) 32 21 16 12 10 9 8 7 6 Latency wrt dimension under fixed−wire cost model (m=32;b=64/d) 120 N=16K N=1K N=256 N=128 N=64 100 80 60 40 20 0 N=16K N=1K N=256 N=128 N=64 100 Average latency Average latency Latency wrt dimension under fixed−wire cost model (m=128;b=64/d) 120 80 60 40 20 2 3 4 5 6 Dimension 7 8 9 10 0 2 3 4 5 6 Dimension 7 8 9 10 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 44 Unloaded Latency under a Fixed-Bisection Cost Models Fixed-bisection cost model: The number of wires across the bisection is constant: √ d k bisection = 1024 wires: w(d) = 2 = 2N . Example: N=1024: d 2 3 4 5 6 7 8 9 10 w(d) 512 16 5 3 2 2 1 1 1 Latency wrt dimension under fixed−bisection cost model (m=128B;b=k/2) Latency wrt dimension under fixed−bisection cost model (m=32B;b=k/2) 400 N=16K N=1K N=256 N=128 N=64 350 300 800 250 200 150 100 700 600 500 400 300 50 0 N=16K N=1K N=256 N=128 N=64 900 Average latency Average latency 1000 200 2 3 4 5 6 Dimension 7 8 9 10 100 2 3 4 5 6 Dimension 7 8 9 10 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 45 Unloaded Latency under a Logarithmic Wire Delay Cost Models Fixed-bisection Logarithmic Wire Delay cost model: The number of wires across the bisection is constant and the delay on wires increases logarithmically with the length [Dally, 1990]: n Length of long wires: l = k 2 −1 d Tc ∝ 1 + log l = 1 + ( − 1) log k 2 Latency wrt dimension under fixed−bisection log wire delay cost model (m=32B;b=k/2) 1200 N=16K N=1K N=256 N=128 N=64 1000 800 Average latency Average latency Latency wrt dimension under fixed−bisection log wire delay cost model (m=128B;b=k/2) 4000 N=16K N=1K 3500 N=256 N=128 N=64 3000 600 400 2500 2000 1500 1000 200 0 500 2 3 4 5 6 Dimension 7 8 9 10 0 2 3 4 5 6 Dimension 7 8 9 10 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 46 Unloaded Latency under a Linear Wire Delay Cost Models Fixed-bisection Linear Wire Delay cost model: The number of wires across the bisection is constant and the delay on wires increases linearly with the length [Dally, 1990]: n Length of long wires: l = k 2 −1 Tc ∝ l = k 8000 Latency wrt dimension under fixed−bisection log wire delay cost model (m=128B;b=k/2) 40000 N=16K N=1K 35000 N=256 N=128 N=64 30000 Average latency Average latency Latency wrt dimension under fixed−bisection log wire delay cost model (m=32B;b=k/2) 12000 N=16K N=1K N=256 10000 N=128 N=64 6000 4000 d −1 2 25000 20000 15000 10000 2000 0 5000 2 3 4 5 6 Dimension 7 8 9 10 0 2 3 4 5 6 Dimension 7 8 9 10 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 47 Latency under Load Assumptions [Agarwal, 1991]: • k-ary n-cubes • random traffic • dimension-order cut-through routing • unbounded internal buffers (to ignore flow control and deadlock issues) A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 48 Latency under Load - cont’d Latency(n) = Admission + ChannelOccupancy + RoutingDelay + ContentionDelay T (m, k, d, w, ρ) = RoutingDelay + ContentionDelay m T (m, k, d, w, ρ) = + dhk (∆ + W (m, k, d, w, ρ)) w m ρ hk − 1 1 W (m, k, d, w, ρ) = · · · 1 + w (1 − ρ) h2k d 1 h = d(k − 1) 2 m w ρ hk ∆ ··· ··· ··· ··· ··· message size bitwidth of link aggregate channel utilization average distance in each dimension switching time in cycles A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Trade-offs in Topologies - 49 Latency vs Channel Load Latency wrt channel utilization (w=8;delta=1) 300 m8,d3,k10 m8,d2,k32 m32,d3,k10 m32,d2,k32 m128,d3,k10 m128,d2,k32 Average latency 250 200 150 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Channel utilization 0.7 0.8 0.9 1 A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 50 Routing Deterministic routing The route is determined solely by source and destination locations. Arithmetic routing The destination address of the incoming packet is compared with the address of the switch and the packet is routed accordingly. (relative or absolute addresses) Source based routing The source determines the route and builds a header with one directive for each switch. The switches strip off the top directive. Table-driven routing Switches have routing tables, which can be configured. Adaptive routing The route can be adapted by the switches to balance the load. Minimal routing allows only shortest paths while non-minimal routing allows even longer paths. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 51 Deadlock Deadlock Two or several packets mutually block each other and wait for resources, which can never be free. Livelock A packet keeps moving through the network but never reaches its destination. Starvation A packet never gets a resource because it always looses the competition for that resource (fairness). A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 52 Deadlock Situations • Head-on deadlock; • Nodes stop receiving packets; • Contention for switch buffers can occur with store-and-forward, virtual-cut-through and wormhole routing. Wormhole routing is particularly sensible. • Cannot occur in butterflies; • Cannot occur in trees or fat trees if upward and downward channels are independent; • Dimension order routing is deadlock free on k-ary n-arrays but not on tori with any n ≥ 1. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 53 Deadlock in a 1-dimensional Torus A B C D Message 1 from C−> B, 10 flits Message 2 from A−> D, 10 flits A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 54 Channel Dependence Graph for Dimension Order Routing 1 2 1 1 2 3 18 172 0 17 18 18 16 19 19 17 18 17 18 17 1 2 17 18 4−ary 2−array 17 18 17 18 3 0 2 1 16 19 1 2 18 17 3 0 2 1 1 2 channel dependence graph 18 17 17 18 16 19 3 0 2 1 16 19 2 1 16 19 3 0 Routing is deadlock free if the channel dependence graph has no cycles. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 55 Deadlock-free Routing • Two main approaches: ? Restrict the legal routes; ? Restrict how resources are allocated; • Number the channel cleverly • Construct the channel dependence graph • Prove that all legal routes follow a strictly increasing path in the channel dependence graph. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 56 Virtual Channels input ports virtual channels outpout ports crossbar • Virtual channels can be used to break cycles in the dependence graph. • E.g. all n-dimensional tori can be made deadlock free under dimension-order routing by assigning all wrap-around paths to a different virtual channel than other links. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 57 Virtual Channels and Deadlocks A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 58 Turn-Model Routing What are the minimal routing restrictions to make routing deadlock free? North−last West−first Negative−first • Three minimal routing restriction schemes: ? North-last ? West-first ? Negative-first • Allow complex, non-minimal adaptive routes. • Unidirectional k-ary n-cubes still need virtual channels. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Routing - 59 Adaptive Routing • The switch makes routing decisions based on the load. • Fully adaptive routing allows all shortest paths. • Partial adaptive routing allows only a subset of the shortest path. • Non-minimal adaptive routing allows also non-minimal paths. • Hot-potato routing is non-minimal adaptive routing without packet buffering. A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Summary - 60 Summary • Communication Performance: bandwidth, unloaded latency, loaded latency • Organizational Structure: NI, switch, link • Topologies: wire space and delay domination favors low dimension topologies; • Routing: deterministic vs source based vs adaptive routing; deadlock; A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Summary - 61 Issues beyond the Scope of this Lecture • Switch: Buffering; output scheduling; flow control; • Flow control: Link level and end-to-end control; • Power • Clocking • Faults and reliability • Memory architecture and I/O • Application specific communication patterns • Services offered to applications; Quality of service A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 NoC Examples - 62 NoC Research Projects • Nostrum at KTH • Æthereal at Philips Research • Proteo at Tampere University of Technology • SPIN at UPMC/LIP6 in Paris • XPipes at Bologna U • Octagon at ST and UC San Diego A. Jantsch, KTH Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In NoCs - 63 To Probe Further - Books and Classic Papers [Agarwal, 1991] Agarwal, A. (1991). Limit on interconnection performance. IEEE Transactions on Parallel and Distributed Systems, 4(6):613–624. [Culler et al., 1999] Culler, D. E., Singh, J. P., and Gupta, A. (1999). Parallel Computer Architecture - A Hardware/Software Approach. Morgan Kaufman Publishers. [Dally, 1990] Dally, W. J. (1990). Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 39(6):775– 785. [Duato et al., 1998] Duato, J., Yalamanchili, S., and Ni, L. (1998). Interconnection Networks - An Engineering Approach. Computer Society Press, Los Alamitos, California. [Leighton, 1992] Leighton, F. T. (1992). Introduction to Parallel Algorithms and Architectures. Morgan Kaufmann, San Francisco. A. Jantsch, KTH