Communication Performance in Network-on-Chips Axel Jantsch November 24, 2004

advertisement
Communication Performance in Network-on-Chips
Axel Jantsch
Royal Institute of Technology, Stockholm
November 24, 2004
Network on Chip Seminar, Linköping, November 25, 2004
Communication Performance In NoCs - 1
Overview
Introduction
Communication Performance
Organizational Structure
Interconnection Topologies
Trade-offs in Network Topology
Routing
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Communication Performance In NoCs - 2
Introduction
Interconnection
Network
• Topology: How switches and
nodes are connected
• Routing algorithm: determines
the route from source to
destination
• Switching strategy: how a
message traverses the route
Mem
P
Network
interface
Network
interface
Communication
assistm
Communication
assistm
Mem
• Flow control: Schedules the
traversal of the message over
time
P
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Communication Performance In NoCs - 3
Basic Definitions
Message is the basic communication entity.
Flit is the basic flow control unit. A message consists of 1 or
many flits.
Phit is the basic unit of the physical layer.
Direct network is a network where each switch connects to
a node.
Indirect network is a network with switches not connected
to any node.
Hop is the basic communication action from node to switch
or from switch to switch.
Diameter is the length of the maximum shortest path
between any two nodes measured in hops.
Routing distance between two nodes is the number of hops
on a route.
Average distance is the average of the routing distance
over all pairs of nodes.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Communication Performance In NoCs - 4
Basic Switching Techniques
Circuit Switching A real or virtual circuit establishes a
direct connection between source and destination.
Packet Switching Each packet of a message is routed
independently. The destination address has to be provided
with each packet.
Store and Forward Packet Switching The entire packet is
stored and then forwarded at each switch.
Cut Through Packet Switching The flits of a packet are
pipelined through the network. The packet is not
completely buffered in each switch.
Virtual Cut Through Packet Switching The entire packet
is stored in a switch only when the header flit is blocked
due to congestion.
Wormhole Switching is cut through switching and all flits
are blocked on the spot when the header flit is blocked.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Performance - 5
Latency
1
2 3
4
A
B
C
D
Time(n) = Admission + ChannelOccupancy + RoutingDelay + ContentionDelay
Admission is the time it takes to emit the message into the network.
ChannelOccupancy is the time a channel is occupied.
RoutingDelay is the delay for the route.
ContentionDelay is the delay of a message due to contention.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Performance - 6
Channel Occupancy
ChannelOccupancy =
n + nE
b
n ... message size in bits
nE ... envelop size in bits
b ... raw bandwidth of the channel
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Performance - 7
Routing Delay
Store and Forward:
Circuit Switching:
Cut Through:
Store and Forward with
fragmented packets:
Tsf (n, h) = h( nb + ∆)
Tcs(n, h) =
n
b
+ h∆
Tct(n, h) =
n
b
+ h∆
Tsf (n, h, np) =
n−np
b
+ h(
np
b
+ ∆)
n ... message size in bits
np ... size of message fragments in bits
h ... number of hops
b ... raw bandwidth of the channel
∆ ... switching delay per hop
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Performance - 8
Routing Delay: Store and Forward vs Cut Through
SF vs CT switching; d=2, k=10, b=1
12000
SF vs CT switching; d=2, k=10, b=32
350
Cut Through
Store and Forward
250
8000
Average latency
Average latency
10000
6000
4000
2000
0
Cut Through
Store and Forward
300
200
150
100
50
0
100
200
300
400
500
600
700
Packet size in Bytes
800
900
1000
1100
0
0
100
200
SF vs CT switching, k=2, m=8
90
900
1000
1100
Cut Through
Store and Forward
250
60
Average latency
Average latency
800
SF vs CT switching, d=2, m=8
70
50
40
30
20
200
150
100
50
10
0
400
500
600
700
Packet size in Bytes
300
Cut Through
Store and Forward
80
300
0
200
400
600
800
Number of nodes (k=2)
1000
1200
0
0
200
400
600
800
Number of nodes (d=2)
1000
1200
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Performance - 9
Local and Global Bandwidth
n
n+nE +w∆
Local bandwidth = b
Total bandwidth = Cb[bits/second] = Cw[bits/cycle] = C[phits/cycle]
Bisection bandwidth ... minimum bandwidth to cut the net into two equal parts.
b ... raw bandwidth of a link;
n ... message size;
nE ... size of message envelope;
w ... link bandwidth per cycle;
∆ ... switching time for each switch in cycles;
w∆ ... bandwidth lost during switching;
C ... total number of channels;
1
For a k ×k mesh with bidirectional channels:
Total bandwidth = (4k 2 − 4k)b
Bisection bandwidth = 2kb
2 3
4
A
B
C
D
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Performance - 10
Link and Network Utilization
total load on the network: L =
load per channel: ρ =
N hl
[phits/cycle]
M
N hl
[phits/cycle] ≤ 1
MC
M ... each host issues a packet every M cycles
C ... number of channels
N ... number of nodes
h ... average routing distance
l = n/w ... number of cycles a message occupies a channel
n ... average message size
w ... bitwidth per channel
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Performance - 11
Network Saturation
Network saturation
Latency
Delivered bandwidth
Network saturation
Offered bandwidth
Delivered bandwidth
Typical saturation points are between 40% and 70%.
The saturation point depends on
• Traffic pattern
• Stochastic variations in traffic
• Routing algorithm
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Organizational Structure - 12
Organizational Structure
• Link
• Switch
• Network Interface
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Organizational Structure - 13
Link
Short link At any time there is only one data word on the
link.
Long link Several data words can travel on the link
simultaneously.
Narrow link Data and control information is multiplexed on
the same wires.
Wide link Data and control information is transmitted in
parallel and simultaneously.
Synchronous clocking Both source and destination operate
on the same clock.
Asynchronous clocking The clock is encoded in the
transmitted data to allow the receiver to sample at the
right time instance.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Organizational Structure - 14
Switch
Receiver
Input
buffer
Output
buffer
Input
ports
Transmitter
Output
ports
Crossbar
Control
(Routing, Scheduling)
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Organizational Structure - 15
Switch Design Issues
Degree: number of inputs and outputs;
Buffering
• Input buffers
• Output buffers
• Shared buffers
Routing
• Source routing
• Deterministic routing
• Adaptive routing
Output scheduling
Deadlock handling
Control flow
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Organizational Structure - 16
Network Interface
• Admission protocol
• Reception obligations
• Buffering
• Assembling and disassembling
of messages
• Routing
• Higher level services and
protocols
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 17
Interconnection Topologies
• Fully connected networks
• Linear arrays and rings
• Multidimensional meshes and tori
• Trees
• Butterflies
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 18
Fully Connected Networks
Node
Bus: switch degree
diameter
distance
network cost
total bandwidth
bisection
bandwidth
Node
=
=
=
=
=
=
N
1
1
O(N )
b
b
Node
Node
Node
Node
Crossbar: switch degree
diameter
distance
network cost
total bandwidth
bisection
bandwidth
=
=
=
=
=
=
N
1
1
O(N 2)
Nb
Nb
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 19
Linear Arrays and Rings
Linear array
Torus
Folded torus
Linear
array: switch degree
diameter
distance
network cost
total bandwidth
bisection
bandwidth
Torus: switch degree
diameter
distance
network cost
total bandwidth
bisection
bandwidth
=
=
∼
=
=
=
2
N −1
2/3N
O(N )
2(N − 1)b
2b
=
=
∼
=
=
=
2
N/2
1/3N
O(N )
2N b
4b
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 20
Multidimensional Meshes and Tori
2−d mesh
k-ary d-cubes are d-dimensional tori with
unidirectional links and k nodes in each
dimension:
3−d cube
2−d torus
number of nodes N
= kd
switch degree
= d
diameter
= d(k − 1)
distance
∼ d 12 (k − 1)
network cost
= O(N )
total bandwidth
= 2N b
bisection bandwidth = 2k (d−1)b
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 21
Routing Distance in k-ary n-Cubes
Network Scalability wrt Distance
45
k=2
d=4
d=3
d=2
40
Average distance
35
30
25
20
15
10
5
0
0
200
400
600
Number of nodes
800
1000
1200
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 22
Projecting High Dimensional Cubes
2−ary 2−cube
2−ary 4−cube
2−ary 3−cube
2−ary 5−cube
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 23
Binary Trees
4
3
2
1
0
number of nodes N
number of switches
switch degree
diameter
distance
network cost
total bandwidth
bisection bandwidth
=
=
=
=
∼
=
=
=
2d
2d − 1
3
2d
d+2
O(N )
2 · 2(N − 1)b
2b
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 24
k-ary Trees
4
3
2
1
0
number of nodes N
number of switches
switch degree
diameter
distance
network cost
total bandwidth
bisection bandwidth
=
∼
=
=
∼
=
=
=
kd
kd
k+1
2d
d+2
O(N )
2 · 2(N − 1)b
kb
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 25
Binary Tree Projection
• Efficient and regular 2-layout;
• Longest wires in resource width:
lW = 2b
d
N
lW
2
4
0
3
8
1
4
16
1
5
32
2
6
64
2
d−1
2 c−1
7
128
4
8
256
4
9
512
8
10
1024
8
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 26
k-ary n-Cubes versus k-ary Trees
k-ary n-cubes:
k-ary trees:
number of nodes N
= kd
switch degree
= d+2
diameter
= d(k − 1)
distance
∼ d 12 (k − 1)
network cost
= O(N )
total bandwidth
= 2N b
bisection bandwidth = 2k (d−1)b
number of nodes N
number of switches
switch degree
diameter
distance
network cost
total bandwidth
bisection bandwidth
=
∼
=
=
∼
=
=
=
kd
kd
k+1
2d
d+2
O(N )
2 · 2(N − 1)b
kb
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 27
Butterflies
01
01
01
01
0
0
Butterfly
building
block
4
3
16 node
butterfly
2
1
0
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 28
Butterfly Characteristics
4
3
2
1
0
number of nodes N
number of switches
switch degree
diameter
distance
network cost
total bandwidth
bisection bandwidth
=
=
=
=
=
=
=
=
2d
2d−1d
2
d+1
d+1
O(N d)
2ddb
N
2b
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 29
k-ary n-Cubes versus k-ary Trees vs Butterflies
k-ary n-cubes binary tree
cost per node
distance
links per node
bisection
frequency limit of
random traffic
butterfly
O(N )
√
1 d
2 N log N
O(N )
O(N log N )
2 log N
log N
2
2
log N
1
1
2N
1/N
1/2
d−1
d
2N
q
1/( d N2 )
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 30
Problems with Butterflies
• Cost of the network
? O(N log N )
? 2-d layout is more difficult than for binary trees
? Number of long wires grows faster than for trees.
• For each source-destination pair there is only one route.
• Each route blocks many other routes.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 31
Benes Networks
• Many routes;
• Costly to compute
non-blocking routes;
• High probability for
non-blocking route by
randomly selecting an
intermediate node
[Leighton, 1992];
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 32
fat nodes
Fat Trees
16−node 2−ary fat−tree
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 33
fat nodes
k-ary n-dimensional Fat Tree Characteristics
16−node 2−ary fat−tree
number of nodes N
number of switches
switch degree
diameter
distance
network cost
total bandwidth
bisection bandwidth
=
=
=
=
∼
=
=
=
kd
k d−1d
2k
2d
d
O(N d)
2k ddb
2k d−1b
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 34
k-ary n-Cubes versus k-ary d-dimensional Fat Trees
k-ary n-cubes:
k-ary n-dimensional fat trees:
number of nodes N
= kd
switch degree
= d
diameter
= d(k − 1)
distance
∼ d 12 (k − 1)
network cost
= O(N )
total bandwidth
= 2N b
bisection bandwidth = 2k (d−1)b
number of nodes N
number of switches
switch degree
diameter
distance
network cost
total bandwidth
bisection bandwidth
=
=
=
=
∼
=
=
=
kd
k d−1d
2k
2d
d
O(N d)
2k ddb
2k d−1b
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 35
Relation between Fat Tree and Hypercube
binary 2−dim fat tree
binary 1−cube
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 36
Relation between Fat Tree and Hypercube - cont’d
binary 3−dim fat tree
binary 2−cube
binary 2−cube
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 37
Relation between Fat Tree and Hypercube - cont’d
binary 4−dim fat tree
binary 3−cube
binary 3−cube
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Topologies - 38
Topologies of Parallel Computers
Machine
Topology
nCUBE/2
TMC CM-5
IBM SP-2
Intel Paragon
Meiko CS-2
Cray T3D
DASH
J-Machine
Monsoon
SGI Origin
Myricom
Hypercube
Fat tree
Banyan
2D Mesh
Fat tree
3D Torus
Torus
3D Mesh
Butterfly
Hypercube
Arbitrary
Cycle
Time [ns]
25
25
25
11.5
20
6.67
30
31
20
2.5
6.25
Channel
width
[bits]
1
4
8
16
8
16
16
8
16
20
16
Routing
delay
[cycles]
40
10
5
2
7
2
2
2
2
16
50
Flit size
[bits]
32
4
16
16
8
16
16
8
16
160
16
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 39
Trade-offs in Topology Design for the k-ary n-Cube
• Unloaded Latency
• Latency under Load
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 40
Network Scaling for Unloaded Latency
Latency(n) = Admission + ChannelOccupancy
+RoutingDelay + ContentionDelay
n
RoutingDelay Tct(n, h) =
+ h∆
b
1
1
1 √
d
d(k − 1) = (k − 1) logk N = (d N − 1)
RoutingDistance h =
2
2
2
Network scalabilit wrt latency (m=32)
Network scalabilit wrt latency (m=128)
160
260
k=2
d=5
d=4
d=3
d=2
140
240
220
Average latency
Average latency
120
100
80
60
40
20
k=2
d=5
d=4
d=3
d=2
200
180
160
140
0
1000
2000
3000
4000 5000 6000
Number of nodes
7000
8000
9000
10000
120
0
1000
2000
3000
4000 5000 6000
Number of nodes
7000
8000
9000
10000
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 41
Unloaded Latency for Small Networks and Local Traffic
Network scalabilit wrt latency (m=128)
Network scalabilit wrt latency (m=128; h=dk/5)
137
k=2
d=5
d=4
d=3
d=2
136
135
k=2
d=5
d=4
d=3
d=2
131.5
Average latency
134
133
132
131
130.5
130
131
129.5
130
129
10
20
30
40
50
60
Number of nodes
70
80
90
100
129
10
20
30
40
50
60
Number of nodes
70
80
90
100
Network scalabilit wrt latency (m=128; h=dk/5)
170
k=2
d=5
d=4
d=3
d=2
165
160
Average latency
Average latency
132
155
150
145
140
135
130
125
0
1000
2000
3000
4000 5000 6000
Number of nodes
7000
8000
9000
10000
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 42
Unloaded Latency under a Free-Wire Cost Model
Free-wire cost model: Wires are free and can be added
without penalty.
Latency wrt dimension under free−wire cost model (m=32;b=32)
120
N=16K
N=1K
N=256
N=128
N=64
100
80
60
40
20
0
N=16K
N=1K
N=256
N=128
N=64
100
Average latency
Average latency
Latency wrt dimension under free−wire cost model (m=128;b=32)
120
80
60
40
20
2
3
4
5
Dimension
6
7
0
2
3
4
5
6
7
Dimension
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 43
Unloaded Latency under a Fixed-Wire Cost Models
Fixed-wire cost model: The number of wires is constant per
node:
128 wires per node: w(d) = b 64
d c.
d 2 3 4 5 6 7 8 9 10
w(d) 32 21 16 12 10 9 8 7 6
Latency wrt dimension under fixed−wire cost model (m=32;b=64/d)
120
N=16K
N=1K
N=256
N=128
N=64
100
80
60
40
20
0
N=16K
N=1K
N=256
N=128
N=64
100
Average latency
Average latency
Latency wrt dimension under fixed−wire cost model (m=128;b=64/d)
120
80
60
40
20
2
3
4
5
6
Dimension
7
8
9
10
0
2
3
4
5
6
Dimension
7
8
9
10
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 44
Unloaded Latency under a Fixed-Bisection Cost Models
Fixed-bisection cost model: The number of wires across the
bisection is constant:
√
d
k
bisection = 1024 wires: w(d) = 2 = 2N .
Example: N=1024:
d
2 3 4 5 6 7 8 9 10
w(d) 512 16 5 3 2 2 1 1 1
Latency wrt dimension under fixed−bisection cost model (m=128B;b=k/2)
Latency wrt dimension under fixed−bisection cost model (m=32B;b=k/2)
400
N=16K
N=1K
N=256
N=128
N=64
350
300
800
250
200
150
100
700
600
500
400
300
50
0
N=16K
N=1K
N=256
N=128
N=64
900
Average latency
Average latency
1000
200
2
3
4
5
6
Dimension
7
8
9
10
100
2
3
4
5
6
Dimension
7
8
9
10
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 45
Unloaded Latency under a Logarithmic Wire Delay Cost Models
Fixed-bisection Logarithmic Wire Delay cost model: The
number of wires across the bisection is constant and the delay
on wires increases logarithmically with the length [Dally, 1990]:
n
Length of long wires: l = k 2 −1
d
Tc ∝ 1 + log l = 1 + ( − 1) log k
2
Latency wrt dimension under fixed−bisection log wire delay cost model (m=32B;b=k/2)
1200
N=16K
N=1K
N=256
N=128
N=64
1000
800
Average latency
Average latency
Latency wrt dimension under fixed−bisection log wire delay cost model (m=128B;b=k/2)
4000
N=16K
N=1K
3500
N=256
N=128
N=64
3000
600
400
2500
2000
1500
1000
200
0
500
2
3
4
5
6
Dimension
7
8
9
10
0
2
3
4
5
6
Dimension
7
8
9
10
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 46
Unloaded Latency under a Linear Wire Delay Cost Models
Fixed-bisection Linear Wire Delay cost model: The number
of wires across the bisection is constant and the delay on wires
increases linearly with the length [Dally, 1990]:
n
Length of long wires: l = k 2 −1
Tc ∝ l = k
8000
Latency wrt dimension under fixed−bisection log wire delay cost model (m=128B;b=k/2)
40000
N=16K
N=1K
35000
N=256
N=128
N=64
30000
Average latency
Average latency
Latency wrt dimension under fixed−bisection log wire delay cost model (m=32B;b=k/2)
12000
N=16K
N=1K
N=256
10000
N=128
N=64
6000
4000
d −1
2
25000
20000
15000
10000
2000
0
5000
2
3
4
5
6
Dimension
7
8
9
10
0
2
3
4
5
6
Dimension
7
8
9
10
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 47
Latency under Load
Assumptions [Agarwal, 1991]:
• k-ary n-cubes
• random traffic
• dimension-order cut-through routing
• unbounded internal buffers (to ignore flow control and
deadlock issues)
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 48
Latency under Load - cont’d
Latency(n) = Admission + ChannelOccupancy + RoutingDelay + ContentionDelay
T (m, k, d, w, ρ) = RoutingDelay + ContentionDelay
m
T (m, k, d, w, ρ) =
+ dhk (∆ + W (m, k, d, w, ρ))
w
m
ρ
hk − 1
1
W (m, k, d, w, ρ) =
·
·
·
1
+
w (1 − ρ)
h2k
d
1
h =
d(k − 1)
2
m
w
ρ
hk
∆
···
···
···
···
···
message size
bitwidth of link
aggregate channel utilization
average distance in each dimension
switching time in cycles
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Trade-offs in Topologies - 49
Latency vs Channel Load
Latency wrt channel utilization (w=8;delta=1)
300
m8,d3,k10
m8,d2,k32
m32,d3,k10
m32,d2,k32
m128,d3,k10
m128,d2,k32
Average latency
250
200
150
100
50
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Channel utilization
0.7
0.8
0.9
1
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 50
Routing
Deterministic routing The route is determined solely by
source and destination locations.
Arithmetic routing The destination address of the
incoming packet is compared with the address of the
switch and the packet is routed accordingly. (relative or
absolute addresses)
Source based routing The source determines the route and
builds a header with one directive for each switch. The
switches strip off the top directive.
Table-driven routing Switches have routing tables, which
can be configured.
Adaptive routing The route can be adapted by the switches
to balance the load.
Minimal routing allows only shortest paths while
non-minimal routing allows even longer paths.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 51
Deadlock
Deadlock Two or several packets
mutually block each other and
wait for resources, which can
never be free.
Livelock A packet keeps moving
through the network but never
reaches its destination.
Starvation A packet never gets a
resource because it always
looses the competition for that
resource (fairness).
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 52
Deadlock Situations
• Head-on deadlock;
• Nodes stop receiving packets;
• Contention for switch buffers can occur with
store-and-forward, virtual-cut-through and wormhole
routing. Wormhole routing is particularly sensible.
• Cannot occur in butterflies;
• Cannot occur in trees or fat trees if upward and downward
channels are independent;
• Dimension order routing is deadlock free on k-ary n-arrays
but not on tori with any n ≥ 1.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 53
Deadlock in a 1-dimensional Torus
A
B
C
D
Message 1 from C−> B, 10 flits
Message 2 from A−> D, 10 flits
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 54
Channel Dependence Graph for Dimension Order Routing
1
2
1
1
2
3
18
172
0
17
18
18
16
19
19
17
18 17
18 17
1
2
17 18
4−ary 2−array
17 18
17 18
3
0
2
1
16 19
1
2
18 17
3
0
2
1
1
2
channel dependence graph
18 17
17 18
16 19
3
0
2
1
16 19
2
1
16 19
3
0
Routing is deadlock free if the channel dependence graph has no cycles.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 55
Deadlock-free Routing
• Two main approaches:
? Restrict the legal routes;
? Restrict how resources are allocated;
• Number the channel cleverly
• Construct the channel dependence graph
• Prove that all legal routes follow a strictly increasing path
in the channel dependence graph.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 56
Virtual Channels
input
ports
virtual channels
outpout
ports
crossbar
• Virtual channels can be used to break cycles in the
dependence graph.
• E.g. all n-dimensional tori can be made deadlock free
under dimension-order routing by assigning all
wrap-around paths to a different virtual channel than
other links.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 57
Virtual Channels and Deadlocks
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 58
Turn-Model Routing
What are the minimal routing restrictions to make routing deadlock free?
North−last
West−first
Negative−first
• Three minimal routing restriction schemes:
? North-last
? West-first
? Negative-first
• Allow complex, non-minimal adaptive routes.
• Unidirectional k-ary n-cubes still need virtual channels.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Routing - 59
Adaptive Routing
• The switch makes routing decisions based on the load.
• Fully adaptive routing allows all shortest paths.
• Partial adaptive routing allows only a subset of the
shortest path.
• Non-minimal adaptive routing allows also non-minimal
paths.
• Hot-potato routing is non-minimal adaptive routing
without packet buffering.
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Summary - 60
Summary
• Communication Performance: bandwidth, unloaded latency, loaded
latency
• Organizational Structure: NI, switch, link
• Topologies: wire space and delay domination favors low dimension
topologies;
• Routing: deterministic vs source based vs adaptive routing;
deadlock;
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Summary - 61
Issues beyond the Scope of this Lecture
• Switch: Buffering; output scheduling; flow control;
• Flow control: Link level and end-to-end control;
• Power
• Clocking
• Faults and reliability
• Memory architecture and I/O
• Application specific communication patterns
• Services offered to applications; Quality of service
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
NoC Examples - 62
NoC Research Projects
• Nostrum at KTH
• Æthereal at Philips Research
• Proteo at Tampere University of Technology
• SPIN at UPMC/LIP6 in Paris
• XPipes at Bologna U
• Octagon at ST and UC San Diego
A. Jantsch, KTH
Network on Chip Seminar, Linköping, November 25, 2004
Communication Performance In NoCs - 63
To Probe Further - Books and Classic Papers
[Agarwal, 1991] Agarwal, A. (1991).
Limit on interconnection
performance. IEEE Transactions on Parallel and Distributed Systems,
4(6):613–624.
[Culler et al., 1999] Culler, D. E., Singh, J. P., and Gupta, A. (1999).
Parallel Computer Architecture - A Hardware/Software Approach.
Morgan Kaufman Publishers.
[Dally, 1990] Dally, W. J. (1990). Performance analysis of k-ary n-cube
interconnection networks. IEEE Transactions on Computers, 39(6):775–
785.
[Duato et al., 1998] Duato, J., Yalamanchili, S., and Ni, L. (1998).
Interconnection Networks - An Engineering Approach. Computer
Society Press, Los Alamitos, California.
[Leighton, 1992] Leighton, F. T. (1992). Introduction to Parallel
Algorithms and Architectures. Morgan Kaufmann, San Francisco.
A. Jantsch, KTH
Download