Network Design

advertisement
Dynamic Networks
CS 213, LECTURE 15
L.N. Bhuyan
What is Dynamic Network
• Dynamic Network is the network that can
connect any input to any output by enabling or
disabling some switches in the network
• Examples:
- Shared Bus: The bus arbiter connects a
processor to a memory
- Crossbar: Consists of a lot of switching
elements, which can be enabled to connect many
inputs to many outputs simultaneously
- Multistage Network: Consists of several stages
of switches that are enabled to get connections
- The nodes in static networks (like Mesh) also
consist of dynamic crossbars
3/14/2016
CS258 S99
2
Crossbar Switch Design
Input
Ports
Receiver
Input
Buffer
Output
Buffer Transmiter
Output
Ports
Cross-bar
Control
Routing, Scheduling
• Complexity O(N**2) for an NXN Crossbar
3/14/2016
CS258 S99
3
How do you build a crossbar
Io
Io
I1
I2
I3
O0
I1
Oi
O2
I2
O3
From Control
I3
N**2 switches => Cost O(N**2)
Time taken by the arbiter = O(N**2)
Multiplexors are controlled from
the arbiter/controller/scheduler
3/14/2016
CS258 S99
4
Crossbar Contd.
• An NXN Crossbar allows all N inputs to be connected
simultaneously to all N outputs
• It allows all one-to-one mappings, called permutations.
No. of permutations = N!
• When two or more inputs request the same output, it is
called CONFLICT. Only one of them is connected and
others are either dropped or buffered
• When processors access memories through crossbar, this
situation is called memory access conflicts
• Given p as the probability of request by a processor per
cycle and assuming that each of N processors’ request is
uniformly directed to all N memories, the average number
of connections allowed per cycle, called Bandwidth (BW) is
BW = N{1- (1-p/N)**N} – Derive this!!!
3/14/2016
CS258 S99
5
Input buffered swtich
Input
Ports
Output
Ports
R0
R1
R2
Cross-bar
R3
Scheduling
•
•
•
•
Independent routing logic per input
Scheduler logic arbitrates each output - priority, FIFO, random
Head-of-line blocking problem – The head packet in a buffer cannot depart because
the output is busy with another packet. The second packet may be destined to an output
that is free, but cannot depart due to blocking by the first packet => One solution is to
create multiple input queues, one per output, called Virtual Output Queuing – adopted in
most routers.
Scheduler Design – How to ensure maximum simultaneous connections is a challenging
research area.
3/14/2016
CS258 S99
6
Problems with Input-Buffered Switch
• FIFO Input buffers give rise to Head of the Line
(HOL) problem
• Current routers employ a separate input queue
for each output, called virtual output queue
(VOQ)
• Then how to schedule the packets from different
VOQ’s for transmission?
3/14/2016
CS258 S99
7
VOQ-based Input Buffered Switch
3/14/2016
CS258 S99
8
Scheduling in Input Buffered Switch
R0
Input
Buffers
O0
R1
R2
Cross-bar
Output
Ports
O1
O2
R3
• n independent arbitration problems?
– static priority, random, round-robin
• simplifications due to routing algorithm?
• general case is max bipartite matching – Iterative
algorithms – iSLIP in Cisco
3/14/2016
CS258 S99
9
Output Buffered Switch
Input
Ports
Output
Ports
R0
Output
Ports
R1
Output
Ports
R2
Output
Ports
R3
Control
• How would you build a shared pool?
3/14/2016
CS258 S99
10
Output scheduling
R0
Input
Buffers
O0
R1
R2
Cross-bar
Output
Ports
O1
O2
R3
• n independent arbitration problems?
– static priority, random, round-robin
• simplifications due to routing algorithm?
• general case is max bipartite matching
3/14/2016
CS258 S99
11
Multistage Interconnection Network (MIN)
•
•
•
•
•
Crossbar switch is not scalable. How about a
network consisting of multiple stages of small
crossbar switches? Has the following properties.
NxN network for N=2n
Consists of log2N stages of 2x2 switches
Has N/2 2x2 switches per stage
Cost O(N log n) instead of O(N2) for Crossbar
For N= an, a MIN can be similarly designed with
axa switches
3/14/2016
CS258 S99
12
Multistage interconnection networks
0
000
1
1
2
001
010
1
3
011
4
100
5
101
6
7
0
110
111
Complexity: Omega Network Complexity O(Nlog2N)
Self Routing: The source node generates a tag, which is binary equivalent
Of the destination. At each switch, the corresponding tag bit is checked.
If the bit is 0, the input is connected to the upper output. If it is 1, the
Input is connected to the lower output. If both inputs have either 0 or 1,
It is a switch conflict. One of them is connected. The other one is rejected or
buffered at the switch (if it has buffer => buffered crossbar)
3/14/2016
CS258 S99
13
What is Shuffle?
000
000
000
000
=0
001
001
001
001
=1
010
010
010
010
=2
011
011
011
011
=3
100
100
100
100
=4
101
101
101
101
=5
110
110
110
110
=6
111
111
111
111
=7
(a) Perfect shuffle
(b) Inverse perfect shuffle
shuffle interconnection
S(an-1 an-2 … a1 a0) = (an-2 an-3 … a0 an-1 )
3/14/2016
CS258 S99
14
Omega Network
• Every stage of switches is preceded by a perfect
shuffle interconnection
S(an-1 an-2 … a1 a0) = (an-2 an-3 … a0 an-1 )
• An input can be connected to a straight or
exchange output in a 2x2 switch.
E(an-1 an-2 … a1 a0) = (an-1 an-2 … a1 ā0)
• To route a message/packet in an Omega
network, the destination tag which is binary
equivalent of the destination is used, (dn-1 dn-2 …
d1 d0). The ith bit di is used to control the routing
at the ith stage counted from the right with 0 <= i
<= n-1. If di = 0, the input is connected to the
upper output. If di = 1, it is connected to the
lower output.
3/14/2016
CS258 S99
15
Self Routing
• A processor generates a tag that is binary equivalent of the
destination
• MSB controls the leftmost stage and the lsb controls the
rightmost stage of the Omega network. A small controller
inside the 2 x 2 switch senses this bit and enables the
connection
• If bit ci = 0, the request is to the upper output; if it is 1, the
request is to the lower output.
• Based on digit if switch size is greater than 2
• Network conflict - Select Round Robin
• Less Bandwidth than crossbar, but more cost effective
• What about QoS? Future research
3/14/2016
CS258 S99
16
Theorem: The Omega network is self routing
Let source be (sn-1sn-2 … s2 … s1s0) and
destination be (dn-1dn-2 … d2 … d1d0). Before Stage
1, the source is switched to the position (sn-2sn-3
… s1 … s0sn-1) due to perfect shuffle connection.
After Stage 1 it is switched to (sn-2sn-3 … s1 …
s0dn-1) as per the (n-1)th of the destination.
Before 2nd stage of the switches, the source is
connected to (sn-3 … s0dn-1sn-2) as after 2nd stage
it becomes (sn-3 … s0dn-1dn-2)
If we continue like this for n stages, the source
matches (dn-1dn-2 … di … d1d0) which is the
destination.
3/14/2016
CS258 S99
17
Switch Size axa
Let N = a**n
• The MIN will consist of n stages of axa crossbar
switches with N/a switches per stage.
• The routing will be based on digit (a-1) <= I => 0
based on radix a
• Interconnection based on a-shuffle
Home Work:
Prove self routing based on radix a. Draw a 16x16 MIN based
on 4x4 switches and explain its operation
Derive the BW of an Omega network with N=a**n with same
input parameters as Crossbar (Slide 5)
3/14/2016
CS258 S99
18
Example: SP
16-node Rack
Multi-rack Configuration
Inter-Rack External Switch Ports
E0E1E2E3
E15
Switch
Board
P0P1P2P3
Intra-Rack Host Ports
P15
• 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single
40 MHz clock
• packet sw, cut-through, no virtual channel, source-based
routing
• variable packet <= 255 bytes, 31 byte fifo per input, 7 bytes
per output, 16 phit links
3/14/2016
CS258 S99
19
Route
control
64
In
Arb
Out
Arb
RAM
64x128
Deserializer
°
°
°
Input Port
Flow
Control
FIFO
8
8
CRC
Route
check control
64
Serializer
CRC
check
Ouput Port
Central
Queue
°
°
64 ° 8
8
8x 8
Crossbar
8
XBar
Arb
8
Flow
Control
FIFO
°
°
°
Ouput Port
Serializer
Input Port
Flow
Control
FIFO
8
8
Deserializer
Example: IBM SP vulcan switch
CRC
Gen
Flow
Control
FIFO
XBar
Arb
8
CRC
Gen
8
• Many gigabit ethernet switches use similar
design without the cut-through
3/14/2016
CS258 S99
20
SGI SPIDER Chip
3/14/2016
CS258 S99
21
SPIDER OPERATION
• The physical transmission layer for each port is based on a pair
of Source Synchronous Drivers and Receivers (SSD and SSR),
which transmit and receive 20 data bits and a data framing
signal at 400 MBaud.
• The data link level guarantees reliable transmission using a
CCITT-CRC code with a go-back-n sliding window protocol [1]
retry mechanism, and is referred to as the Link Level Protocol
(LLP).
• The message layer defines 4 virtual channels and a credit
based flow control scheme to support arbitrary message
lengths, as well as a header format to specify message
destination, priority, and congestion control options.
• The receive buffers of a port maintain a separate linked list of
messages for each of the 5 possible output ports for each
virtual channel to avoid the ‘block at head of queue’ bottleneck.
3/14/2016
CS258 S99
22
SPIDER Crossbar Arbitration
• To maximize bandwidth through the crossbar without using
unreasonable buffering, each virtual channel buffer is
organized as a set of linked lists. There is one linked list for
each possible output port for each virtual channel. This solution
avoids the block at head of queue problem. To maximize
crossbar efficiency, each virtual channel from each port can
request arbitration for every possible destination. Each
arbitration cycle, the arbiter chooses up to 6 winners from as
many as 120 arbitration candidates to maximize crossbar
utilization.
• Messages accumulate a network age as they are routed,
increasing their priority to avoid starvation and promote network
fairness. In order to avoid starvation and encourage network
fairness, the arbiter is rotated each arbitration cycle to favor the
highest priority requestor. Priority is based on the age field of a
message header.
3/14/2016
CS258 S99
23
Arbitration Contd.
•
•
After data is received by the SSR and synchronized, it enters the chip core and
begins several operations in parallel. Table lookup and crossbar arbitration is
normally serialized, as the exit port must be known before arbitration begins.
To parallelize these operations, table lookup is pipelined across SPIDER chips.
While arbitration progresses. the table lookup is performed for the next
SPIDER chip, which depends on the destination ID and the direction field. This
does increase table size, as a full table is required for each neighboring
SPIDER chip, but it reduces latency by a full clock. Pipelined tables also add
flexibility to possible routes, as different exit ports can be given depending on
where a messages came from as well as where it is going.
3/14/2016
CS258 S99
24
Summary
• Routing Algorithms restrict the set of routes
within the topology
– simple mechanism selects turn at each hop
– arithmetic, selection, lookup
• Deadlock-free if channel dependence graph is
acyclic
– limit turns to eliminate dependences
– add separate channel resources to break dependences
– combination of topology, algorithm, and switch design
• Deterministic vs. adaptive routing
• Switch design issues
– input/output/pooled buffering, routing logic, selection logic
• Flow control
• Real networks are a ‘package’
of design choices 25
3/14/2016
CS258 S99
Download