Multiprocessor Interconnection Networks Todd C. Mowry

advertisement
Multiprocessor
Interconnection
Networks
Todd C. Mowry
CS 495
October 30, 2002
Topics
• Network design issues
• Network Topology
• Performance
Networks
• How do we move data between processors?
• Design Options:
• Topology
• Routing
• Physical implementation
–2–
CS 495 F’02
Evaluation Criteria
• Latency
• Bisection Bandwidth
• Contention and hot-spot behavior
• Partitionability
• Cost and scalability
• Fault tolerance
–3–
CS 495 F’02
Communication Perf: Latency
Time(n)s-d = overhead + routing delay + channel
occupancy + contention delay
occupancy = (n + ne) / b
Routing delay?
Contention?
–4–
CS 495 F’02
Link Design/Engineering Space
Cable of one or more wires/fibers with connectors at the ends
attached to switches or interfaces
Narrow:
- control, data and timing
multiplexed on wire
Short:
- single logical
value at a time
Asynchronous:
- source encodes clock in
signal
–5–
Synchronous:
- source & dest on same
clock
Long:
- stream of logical
values at a time
Wide:
- control, data and timing
on separate wires
CS 495 F’02
Buses
P
P
P
Bus
• Simple and cost-effective for small-scale multiprocessors
• Not scalable (limited bandwidth; electrical complications)
–6–
CS 495 F’02
Crossbars
• Each port has link to every other port
+ Low latency and high throughput
- Cost grows as O(N^2) so not very scalable.
- Difficult to arbitrate and to get all data lines into and out
of a centralized crossbar.
• Used in small-scale MPs (e.g., C.mmp) and as building
block for other networks (e.g., Omega).
–7–
CS 495 F’02
Rings
• Cheap: Cost is O(N).
• Point-to-point wires and pipelining can be used to
make them very fast.
+ High overall bandwidth
- High latency O(N)
• Examples: KSR machine, Hector
–8–
CS 495 F’02
(Multidimensional) Meshes and Tori
3D Cube
2D Grid
•
•
•
•
•
O(N) switches (but switches are more complicated)
Latency : O(k*n) (where N=kn)
High bisection bandwidth
Good fault tolerance
Physical layout hard for multiple dimensions
–9–
CS 495 F’02
Real World 2D mesh
1824 node Paragon: 16 x 114 array
– 10 –
CS 495 F’02
Hypercubes
• Also called binary n-cubes. # of nodes = N = 2^n.
• Latency is O(logN); Out degree of PE is O(logN)
• Minimizes hops; good bisection BW; but tough to layout in
3-space
• Popular in early message-passing computers (e.g., intel
iPSC, NCUBE)
• Used as direct network ==> emphasizes locality
– 11 –
CS 495 F’02
k-ary d-cubes
• Generalization of hypercubes (k nodes in every
dimension)
• Total # of nodes = N = k^d.
• k > 2 reduces # of channels at bisection, thus
allowing for wider channels but more hops.
– 12 –
CS 495 F’02
Embeddings in two dimensions
6x3x2
3x3x3
Embed multiple logical dimension in one
physical dimension using long wires
– 13 –
CS 495 F’02
Trees
• Cheap: Cost is O(N).
• Latency is O(logN).
• Easy to layout as planar graphs (e.g., H-Trees).
• For random permutations, root can become bottleneck.
• To avoid root being bottleneck, notion of Fat-Trees (used in CM5)
– 14 –
CS 495 F’02
Multistage Logarithmic Networks
Key Idea: have multiple layers of switches between
destinations.
• Cost is O(NlogN); latency is O(logN); throughput is O(N).
• Generally indirect networks.
• Many variations exist (Omega, Butterfly, Benes, ...).
• Used in many machines: BBN Butterfly, IBM RP3, ...
– 15 –
CS 495 F’02
Omega Network
Omega Net w or k
000
000
001
001
010
011
010
011
100
100
101
101
110
111
110
111
• All stages are same, so can use recirculating
network.
• Single path from source to destination.
• Can add extra stages and pathways to minimize
collisions and increase fault tolerance.
• Can support combining. Used in IBM RP3.
– 16 –
CS 495 F’02
Butterfly Network
But t er f l y Net w or k
000
000
001
001
010
011
010
011
100
100
101
101
110
111
110
111
spl i t on
MSB
spl i t on
LSB
• Equivalent to Omega network. Easy to see routing of
messages.
• Also very similar to hypercubes (direct vs. indirect though).
• Clearly see that bisection of network is (N / 2) channels.
• Can use higher-degree switches to reduce depth.
– 17 –
CS 495 F’02
Properties of Some Topologies
Topology
Degree Diameter
Ave Dist
Bisection
D (D ave) @ P=1024
1D Array
2
N-1
N/3
1
huge
1D Ring
2
N/2
N/4
2
huge
2D Mesh
4
2 (N1/2 - 1)
2/3 N1/2
N1/2
63 (21)
2D Torus
4
N1/2
1/2 N1/2
2N1/2
32 (16)
dk/2
dk/4
dk/4
15 (7.5) @d=3
n/2
N/2
10 (5)
k-ary d-cube 2d
Hypercube
– 18 –
n=logN n
CS 495 F’02
Real Machines
In general, wide links => smaller routing delay
Tremendous variation
– 19 –
CS 495 F’02
How many dimensions?
d = 2 or d = 3
• Short wires, easy to build
• Many hops, low bisection bandwidth
• Requires traffic locality
d >= 4
• Harder to build, more wires, longer average length
• Fewer hops, better bisection bandwidth
• Can handle non-local traffic
k-ary d-cubes provide a consistent framework for
comparison
• N = kd
• scale dimension (d) or nodes per dimension (k)
• assume cut-through
– 20 –
CS 495 F’02
Scaling k-ary d-cubes
What is equal cost?
• Equal number of nodes?
• Equal number of pins/wires?
• Equal bisection bandwidth?
• Equal area?
• Equal wire length?
Each assumption leads to a different optimum
Recall:
switch degree: d
diameter = d(k-1)
total links = N*d
pins per node = 2wd
bisection = kd-1 = N/d links in each directions
2Nw/d wires cross the middle
– 21 –
CS 495 F’02
Scaling: Latency
250
140
200
100
d=2
d=3
80
d=4
k=2
60
n/w
40
Ave Latency T(n=140)
Ave Latency T(n=40)
120
150
100
50
20
0
0
0
5000
10000
0
Machine Size (N)
2000
4000
6000
8000
10000
Machine Size (N)
Assumes equal channel width
– 22 –
CS 495 F’02
Average Distance with Equal Width
100
256
90
1024
80
16384
1048576
Ave Distance
70
60
50
40
Avg. distance = d (k-1)/2
30
20
10
0
0
5
10
15
20
25
Dimension
Assumes equal channel width
but, equal channel width is not equal cost!
Higher dimension => more channels
– 23 –
CS 495 F’02
Latency with Equal Width
250
256
Average Latency (n = 40, D = 2)
1024
200
16384
1048576
150
100
50
0
0
5
10
15
20
25
Dimension
total links(N) = Nd
– 24 –
CS 495 F’02
Latency with Equal Pin Count
300
300
256 nodes
250
1024 nodes
Ave Latency T(n= 140 B)
Ave Latency T(n=40B)
250
16 k nodes
200
1M nodes
150
100
50
200
150
100
256 nodes
1024 nodes
50
16 k nodes
1M nodes
0
0
0
5
10
15
Dimension (d)
20
25
0
5
10
15
20
25
Dimension (d)
Baseline d=2, has w = 32 (128 wires per node)
fix 2dw pins => w(d) = 64/d
distance up with d, but channel time down
– 25 –
CS 495 F’02
Latency with Equal Bisection Width
1000
900
Ave Latency T(n=40)
800
700
600
500
400
300
256 nodes
200
1024 nodes
16 k nodes
100
1M nodes
N-node hypercube
has N bisection
links
2d torus has 2N 1/2
Fixed bisection =>
w(d) = N 1/d / 2 =
k/2
0
0
5
10
15
Dimension (d)
– 26 –
20
25
1 M nodes, d=2 has
w=512!
CS 495 F’02
Larger Routing Delay (w/ equal pin)
1000
Ave Latency T(n= 140 B)
900
800
700
600
500
400
300
256 nodes
200
1024 nodes
16 k nodes
100
1M nodes
0
0
5
10
15
20
25
Dimension (d)
Conclusions strongly influenced by assumptions
of routing delay
– 27 –
CS 495 F’02
Latency under Contention
300
250
n40,d2,k32
200
n40,d3,k10
Latency
n16,d2,k32
n16,d3,k10
150
n8,d2,k32
n8,d3,k10
n4,d2,k32
100
n4,d3,k10
50
0
0
0.2
0.4
0.6
0.8
1
Channel Utilization
Optimal packet size?
– 28 –
Channel utilization?
CS 495 F’02
Saturation
250
200
150
Latency
n/w=40
n/w=16
n/w=8
n/w = 4
100
50
0
0
0.2
0.4
0.6
0.8
1
Ave Channel Utilization
Fatter links shorten queuing delays
– 29 –
CS 495 F’02
Data transferred per cycle
350
300
Latency
250
200
150
100
50
n8, d3, k10
n8, d2, k32
0
0
0.05
0.1
0.15
0.2
0.25
Flits per cycle per processor
Higher degree network has larger available bandwidth
• cost?
– 30 –
CS 495 F’02
Advantages of Low-Dimensional Nets
What can be built in VLSI is often wire-limited
LDNs are easier to layout:
• more uniform wiring density (easier to embed in 2-D or 3-D
space)
• mostly local connections (e.g., grids)
Compared with HDNs (e.g., hypercubes), LDNs
have:
• shorter wires (reduces hop latency)
• fewer wires (increases bandwidth given constant bisection width)
– increased channel width is the major reason why LDNs win!
LDNs have better hot-spot throughput
• more pins per node than HDNs
– 31 –
CS 495 F’02
Routing
Recall: routing algorithm determines
• which of the possible paths are used as routes
• how the route is determined
• R: N x N -> C, which at each switch maps the destination node nd
to the next channel on the route
Issues:
• Routing mechanism
– arithmetic
– source-based port select
– table driven
– general computation
• Properties of the routes
• Deadlock free
– 32 –
CS 495 F’02
Store&Forward vs Cut-Through Routing
Cut-Through Routing
Store & Forward Routing
Source
Dest
Dest
32 1 0
3 2 1 0
3 2 1
3 2
3
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
0
3 2 1
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 1 0
3 2 1 0
1 0
2 1 0
3 2 1 0
Time
3 2 1
h(n/b + D)
0
vs
n/b + h D
h: hops, n: packet length, b: badwidth,
D: routing delay at each switch
– 33 –
CS 495 F’02
Routing Mechanism
need to select output port for each input
packet
• in a few cycles
Reduce relative address of each dimension in
order
• Dimension-order routing in k-ary d-cubes
• e-cube routing in n-cube
– 34 –
CS 495 F’02
Routing Mechanism (cont)
P3
P2
P1
P0
Source-based
•
•
•
•
message header carries series of port selects
used and stripped en route
CRC? Packet Format?
CS-2, Myrinet, MIT Artic
Table-driven
• message header carried index for next port at next switch
– o = R[i]
• table also gives index for following hop
– o, I’ = R[i ]
• ATM, HPPI
– 35 –
CS 495 F’02
Properties of Routing Algorithms
Deterministic
• route determined by (source, dest), not intermediate state
(i.e. traffic)
Adaptive
• route influenced by traffic along the way
Minimal
• only selects shortest paths
Deadlock free
• no traffic pattern can lead to a situation where no packets
mover forward
– 36 –
CS 495 F’02
Deadlock Freedom
How can it arise?
• necessary conditions:
– shared resource
– incrementally allocated
– non-preemptible
• think of a channel as a shared
resource that is acquired incrementally
– source buffer then dest. buffer
– channels along a route
How do you avoid it?
• constrain how channel resources are allocated
• ex: dimension order
How do you prove that a routing algorithm is
deadlock free
– 37 –
CS 495 F’02
Proof Technique
Resources are logically associated with channels
Messages introduce dependences between resources as they
move forward
Need to articulate possible dependences between channels
Show that there are no cycles in Channel Dependence Graph
• find a numbering of channel resources such that every legal route follows
a monotonic sequence
=> no traffic pattern can lead to deadlock
Network need not be acyclic, only channel dependence graph
– 38 –
CS 495 F’02
Examples
Why is the obvious routing on X deadlock free?
• butterfly?
• tree?
• fat tree?
Any assumptions about routing mechanism? amount of
buffering?
What about wormhole routing on a ring?
2
1
0
3
7
4
5
– 39 –
6
CS 495 F’02
Flow Control
What do you do when push comes to shove?
•
•
•
•
ethernet: collision detection and retry after delay
FDDI, token ring: arbitration token
TCP/WAN: buffer, drop, adjust rate
any solution must adjust to output rate
Link-level flow control
Ready
Data
– 40 –
CS 495 F’02
Examples
Short Links
F/E
Ready/Ack
F/E
Source
Destination
Req
Data
Long links
• several messages on the wire
– 41 –
CS 495 F’02
Link vs global flow control
Hot Spots
Global communication operations
Natural parallel program dependences
– 42 –
CS 495 F’02
Case study: Cray T3E
• 3-dimensional torus, with 1024 switches
each connected to 2 processors
• Short, wide, synchronous links
• Dimension order, cut-through, packetswitched routing
• Variable sized packets, in multiples of 16
bits
– 43 –
CS 495 F’02
Case Study: SGI Origin
• Hypercube-like topologies with up to 256
switches
• Each switch supports 4 processors and
connects to 4 other switches
• Long, wide links
• Table-driven routing: programmable, allowing
for flexible topologies and fault-avoidance
– 44 –
CS 495 F’02
Download