Flattened Butterfly: A Cost-Efficient Topology for High

advertisement
Flattened Butterfly: A Cost-Efficient
Topology for High-Radix Networks
______________________________
John Kim, William J. Dally &Dennis Abts
Presented by:
Evan Su
•
•
•
•
•
Basic metrics
Basic topologies
Why high-radix
Router microarchitecture
High-radix topologies
• Interconnection networks used to connect
processors and memories in multiprocessors,
as switching fabrics for high-end routers and
switches, and for connecting I/O devices.
• Definition: determines arrangement of
channels and nodes in the network (road
map)
• Often first step in network design
Performance Metrics
• Average Hop Count
• Average Latency
• throughput
• Bisection Bandwidth
Hop Count
• The number of links traversed between source
and destination
Latency
• Defined as the time it takes for a packet to
traverse the network
• Latency= Header latency + serialization latency
– Header latency: head arrives at input port
– Serialization: time for rest of the packet to catch
up
Throughput
• Data rate (bits/sec) that the network accepts
per input port
• Offered load - % of capacity network accepts
Bisection Bandwidth
• Split N nodes into two groups of N/2 nodes
such that the bandwidth between these two
groups is minimum
• Why is it relevant: if traffic is completely
random, the probability of a message going
across the two halves is ½- tells how much
traffic a network can support ( ½ of total
traffic bandwidth)
Topology Examples
Hypercube
Grid
Criteria
64 nodes
Performance
Bisection
bandwidth
Torus
Bus
Ring
2Dtorus
6-cube
Fully
connected
1
2
16
32
1024
9
Why High Radix?
• Definition: number of inputs/outputs for each
router
• For past 20 years, used low-radix k-ary n-cubes
(torus)
– Routers didn’t have enough bandwidth to support
high radix
• Network routers have growth curve that obeys
Moore’s law
– Bandwidth increased
– Packet length stayed the same
– Latency gone down
Why High Radix?
• Approximately an order
of magnitude increase in
bandwidth every 5 years
• Bandwidth growth result
of:
– Increase in signaling rate
– Increase in number of
signals
High-Radix Routers
High-Radix vs. Low-Radix
• Cost
• Power dissipation
• latency
Cost
• Increasing radix of routers monotonically
reduces overall cost
• Network cost proportional to total router
bandwidth
– Router pins
– Connectors
• For fixed bisection bandwidth, cost
proportional to hop count
– High-radix => lower hop count
Cost
Power
• Power dissipated decreases with increasing radix
• Power proportional to number of router nodes
• As radix increases, hop count decreases and
router nodes decrease as well
– Independent of individual router node
• Router power due to I/O circuits, switch
bandwidth.
• arbitration logic more complex with higher radix
but negligible fraction of total power
Latency
H = # hops
tr = delay in router
L = length of packets
b = channel bandwidth
B = total Bandwidth
k = radix
• Bandwidth (B) is divided
among 2k input and output
channels so b = B/2k
Aspect Ratio
• Differentiate by dT/dk and set equal to
zero
• Expression on right side determines
router radix that minimizes network
latency
Optimal Latency
Optimal Latency
Optimal Latency
Router Microarchitecture (VC)
• Route computation (RC) – based on info
stored in header, select output port
• Virtual-channel allocation (VA)- packet must
gain exclusive access to virtual channel of
output port
• Switch allocation (SA)- if there is a free buffer
in channel, flit can vie for access to crossbar
• Switch traversal (ST) – transfers flit from input
to output buffers
Router Microarchitecture (VC)
Microarchitecture for High-radix
• Routing computation – linear function of
bandwidth
• VC Allocator – quadratic function of input/
output ports because take bids from all ports
• Switch Allocator- quadratic function of ports
Baseline Performance
• Due to head of line blocking
• Before, overprovision switch because
low cost
Fully buffered crossbar
• Separate the queuing up
• Had to compete for input and output of switch
• With crosspoint, decouples two allocations,
always make forward progress
Fully buffered crossbar
• Trade performance for cost
• Crosspoint buffering dominates chip area
(quadratic)
Hierarchical crossbar
V = inputs
K = radix
P = number of subswitches
• Using subswitches, area grows O(vk2 / p)
• Decouples allocation, reduces HoL
blocking
Hierarchical crossbar
Uniform random traffic
Worst case performance
Okay! Back to Topologies
• Butterfly
• Clos
• Flattened Butterfly
Butterfly Network
• K-ary n-fly: kn
network nodes
• Example: 2-ary 3-fly
• Routing from 000 to
010
– Dest address used to
directly route packet
– Bit n used to select
output port at stage n
0
1
0
1
0
00
10
20
2
3
6
7
1
2
01
11
21
4
5
0
3
4
02
12
22
03
13
23
5
6
7
Butterfly Network
• Pros
– Low hop count: H = log k N
• Cons
– Deterministic routing/ no path diversity
– Doesn’t exploit traffic locality
Clos Network
nxm
input
switch
nxm
input
switch
nxm
input
switch
nxm
input
switch
rxr
input
switch
rxr
input
switch
rxr
input
switch
rxr
input
switch
rxr
input
switch
mxn
output
switch
mxn
output
switch
mxn
output
switch
mxn
output
switch
Clos Network
• Butterfly folded back on itself
• Pros
– Path diversity (good performance on both
benign and adversarial)
• Cons
– Double cost of butterfly
– H = 2 log k (N)
Folded Clos Network (Fat Tree)
• Similar to Clos
• Exploits locality
Flattened Butterfly Network
4-ary 2-fly
2-ary 4 -fly
• Routers in each row are combined
Flattened Butterfly Network
• Routers in each row are combined
Flattened Butterfly Network
• On benign traffic
– Approaches performance/cost of Butterfly
– ½ cost of Clos network
• Eliminates redundant hops when no need for load
balancing
• On adversarial traffic
– Matches cost/performance of folded Clos
– Order of magnitude better performance than
Butterfly
• Use non-minimal global-adaptive routing
Routing
• In Figure, there are two minimal routes
between node 0 (00002) and node 10
(10102).
• In general, if two nodes a and b have
addresses that differ in j digits, then there
are j! minimal routes between a and b.
• This path diversity derives from the fact
that a packet routing in a flattened
butterfly is able to traverse the
dimensions in any order.
Routing Algorithms
uniform random traffic
worst case traffic pattern
VAL = Valiant’s non-minimal oblivious algorithm
MIN = minimal adaptive , UGAL = non-minimal adaptive algorithm
UGAL-S = UGAL using sequential allocation
CLOS AD = non-minimal adaptive routing in a flattened Clos
Routing Algorithms
• Valiant – picks random middle node b, and
routes minimally from s to b and ten b to
d. achieves only ½ network capacity,
regardless of traffic
• Minimal Adaptive- chooses minimal route
• Adaptive Clos – minimum routing in
benign traffic, folded- Clos routing in
adversarial
Performance Comparison
• To compare the performance, a network of
node size 1024 is taken and is constructed
using the following topology by maintaining a
constant bisection bandwidth.
Performance Comparison
Uniform random traffic
Worst-case traffic
Cost Comparison
Conclusion
• Use high-radix routers to take advantage
of increased router bandwidth
• Flattened Butterfly exploits high-radix
routers and global adaptive routing to give
cost-effective network
• Flattened butterfly has lower hop count
than folded Clos and better path diversity
than conventional Butterfly
• On adversarial traffic, exploits global
adaptive routing to match performance of
folded Clos with ½ the cost
Download