[FAT- TREE]: A SCALABLE, COMMODITY DATA CENTER NETWORK ARCHITECTURE

advertisement
[FAT- TREE]: A SCALABLE,
COMMODITY DATA
CENTER NETWORK
ARCHITECTURE
Mohammad Al-Fares, Alexander
Loukissas,Amin Vahdat
Represented by: Bassma Aldahlan
Outline
• Cloud Computing
• Data Center Networks
• Problem Space
• Paper Goal
• Background
• Architecture
• Implantation
• Evaluation
• Results and Outgoing work
Cloud Computing
• "Cloud computing is a
model for enabling
convenient, on-demand
network access to a
shared pool of.
configurable computing
resources (for example,
networks, servers,
storage, applications,
and services) that can
be rapidly provisioned
and released with
minimal management
effort or service provider
interaction."
Data Center Networks
• Data Centers have
recently received
significant attention as
a cost-effective
infrastructure for
storing large volumes
of data and hosting
large-scale service
applications
Data Center Networks
• A data center (DC) is a facility consisting of servers
(physical machines), storage and network devices (e.g.,
switches, routers, and cables), power distribution
systems, cooling systems
• A data center network is the communication
infrastructure used in a data center, and is described by
the network topology, routing/switching equipment, and
the used protocols (e.g., Ethernet and IP).
Could vs. Datacenter
Could
• an off-premise form of
computing that stores
data on the Internet
• services are outsourced
to third-party cloud
providers
• Less secure
• does not require time or
capital to get up and
running (less expensive)
Datacenter
• on-premise hardware
that stores data within
an organization's local
network
• run by an in-house IT
department
• More secure
• More expensive
Problem Space
• Inter-node communication bandwidth is the principle
bottleneck in DC(or large-scale clusters) .
• Two solutions:
• Specialized hardware and communication protocols
• (e.g., InfiniBand or Myrinet )
• Cons:
• scale to clusters of thousands of nodes with high bandwidth,
• No commodity parts (more expensive)
• Not compatible with TCP/IP applications.
• commodity Ethernet switches and routers to interconnect cluster
machines
• Scale poorly
• Non-linear cost increases with clusters size
Design Goal
The goal of this paper is to design a data center
communication architecture that meets the following goals:
• Scalable interconnection bandwidth
Arbitrary host communication at full bandwidth.
• Economies of scale
• Commodity switches
• Backward compatibility
• Compatible with host running Ethernet and IP
Background
Common data center topology
Internet
Core
Aggregation
Access
Data Center
Layer-3 router
Layer-2/3 switch
Layer-2 switch
Servers
Background (Cont.)
 Oversubscription:
 Ratio of the worst-case achievable aggregate bandwidth among
the end hosts to the total bisection bandwidth of a particular
communication topology
 Lower the total cost of the design
 Typical designs: factor of 2:5:1 (400 Mbps)to 8:1(125 Mbps)
 An oversubscription of 1:1 indicates that all hosts may potentially
communicate with arbitrary other hosts at the full bandwidth of
their network interface.
Multi-rooted DC network
• Multiple core switches at the top level
• Equal Cost Multiple Routing (ECMP)
Background (Cont.)
• Cost
Background (Cont.)
• Cost
Architecture
Clos Networks/Fat-Trees
• Adopt a special instance of a Clos topology
• Similar trends in telephone switches led to
designing a topology with high bandwidth by
interconnecting smaller commodity switches.
• Why Fat-Tree?
• Fat tree has identical bandwidth at any bisections
• Each layer has the same aggregated bandwidth
Architecture (Cont.)
Fat-Trees
Inter-connect racks (of servers) using a fat-tree topology
K-ary fat tree: three-layer topology (edge, aggregation and core)
each pod consists of (k/2)2 switches & each k-port switche in the
lower layer is directly connected to k/2 hosts.
each edge switch connects to k/2 servers & k/2 aggr. switches
each aggr. switch connects to k/2 edge & k/2 core switches
(k/2)2 core switches: each connects to k pods
Architecture (Cont.)
• Addressing
• IP addresses is allocated within the private 10.0.0.0/8 block.
• The pod switches:
10.pod.switch.1, where pod denotes the pod number (in [0, k − 1]),
and switch denotes the position of that switch in the pod (in [0, k − 1],
starting from left to right, bottom to top)
• The core switches:
10.k.j.i, where j and i denote that switch’s coordinates in the (k/2)2
core switch grid (each in [1, (k/2)], starting from top-left).
• Hosts :
10.pod.switch.ID, where ID is the host’s position in that subnet (in
[2, k/2+1], starting from left to right).
Architecture (Cont.)
Architecture (Cont.)
• Tow- level Routing structure:
Use two level look-ups to distribute traffic and maintain packet ordering
First level is prefix lookup
used to route down the topology to servers
Second level is a suffix lookup
used to route up towards core
maintain packet ordering by using same ports for same server
Architecture (Cont.)
• To implement the two- level routing:
They use Content Addressable Memory (CAM) :
• faster in finding a match against a bit pattern.
• perform parallel searches among all its entries in a single clock
cycle.
• Lookup engines use a special kind of CAM, called Ternary CAM
(TCAM).
Architecture (Cont.)
• Routing Algorithm:
Architecture (Cont.)
S: 10.0.1.1
S: 10.2.2.1
Prefixes
Port
10.0.1.0/24
0
10.0.01.0/24
1
Prefixes
Por
t
0.0.0.2/8
3
0.0.0.3/8
2
0.0.0.0/8
Prefixes
Port
Prefixes
Port
10.2.0.1/24
0
0.0.0.2/8
3
10.2.1.0/24
1
0.0.0.3/8
2
0.0.0.0/8
S: 10.0.2.1
Prefixes
Port
10.0.0.0/24
0
10.0.1.0/24
1
0.0.0.0/8
S: 10.4.1.1
Prefixes
Port
10.0.0.0/16
0
10.1.0.0/16
1
10.2.0.0/16
2
Prefixes
Port
0.0.0.2/8
2
0.0.0.3/8
3
S: 10.2.0.1
Prefixes
Port
Prefixes
Port
10.2.0.0/24
0
0.0.0.2/8
3
10.2.1.0/24
1
0.0.0.3/8
2
0.0.0.0/8
Architecture (Cont.)
• Two level Routing table
• Traffic diffusion occurs in the first half of a packet
journey
• Subsequent packets to the same host to follow the
same path; avoid packet reordering
• Centralized protocol to initialize the routing table.
Architecture (Cont.)
• Flow Classification
•
used with dynamic port- reassignment in port switches to
avoid local congestion.
• A flow: a sequence of packets that has the same source
and destination IP
• In particular, pod switches:
• 1. Recognize subsequent packets of the same flow, and forward
them on the same outgoing port. (to avoid packet reordering)
• 2.Periodically reassign a minimal number of flow output ports to
minimize any disparity between the aggregate flow capacity of
different ports.(to insure fair distribution on flow- balancing)
Architecture (Cont.)
• Flow scheduling
• routing large flows plays the most important role in
determining the achievable bisection bandwidth of a
network
• Large flow to minimize the overlap with one another.
Architecture (Cont.)
• Edge Switches
assign a new
flow to the
least- loaded
port initially
detect any
outgoing flow
with a size
above a
threshold
Notify central
scheduler about the
source and
destination for all
active large flows.
• Prevent long flows from having the same link
• Assign long lived flow to different links
• Avoid congestion.
Assign the flow to uncontended path.
Architecture (Cont.)
• Central Scheduler
Replicated
tacks all active
large flows and
assign them nonconflicting paths
maintains boolean
state for all links in
signifying their
availability to carry
large flows.
For inter-pod traffic:
•
Call(k/2)2 paths corresponds to the core switches
• searches through the core switches to find corresponding path
components do not include a reserved link; marks those links as
reserved
• notifies the relevant lower- and upper-layer switches in the source
pod with the correct outgoing port that corresponds to that flow’s
chosen path.
Architecture (Cont.)
• For intra-pod large flows:
• The scheduler garbage collects flows whose last update
is older than a given time.
• clearing their reservations.
• Flow Classifier-very few seconds,
Implementation
• Tow level table
• Click is a modular software router
architecture that supports
implementation of experimental
router designs.
• A Click router is a graph of packet
processing modules called
elements that perform tasks such
as routing table lookup
• a new Click element is built
• Flow Scheduler
the element FlowReporter is
implemented , which resides in all
edge switches, and detects outgoing
flows whose size is larger than a
given threshold.
It sends regular notifications to the
central scheduler about these active
large flows
the heuristic at- tempts to switch, if
needed, the output port of at most three
flows to minimize the difference between
the aggregate flow capacity of its output
ports.
Hash(src,dst) IP
if
Seen(hash
)
Record the new flow f;
Lookup previously
assigned port x
Assign f to the leastloaded upward port x
Send packet on port x
Send the packet on
port x
end
For every
aggregate
switch
Find p
and p
with the largest and
max
min
smallest aggregate outgoing traffic
D = pmax -
p
min
Find the largest flow f assigned to port p
size is smaller than D
max
whose
if such
a flow
exists
Switch the output port of flow f to pmin
Fault-Tolerance
• In this scheme, each switch in the network maintains a
BFD (Bidirectional Forwarding Detection) session with
each of its neighbors to determine when a link or
neighboring switch fails
 Failure b/w upper layer and core switches
 Outgoing inter-pod traffic, local routing table marks the affected link as
unavailable and chooses another core switch
 Incoming inter-pod traffic, core switch broadcasts a tag to upper
switches directly connected signifying its inability to carry traffic to that
entire pod, then upper switches avoid that core switch when assigning
flows destined to that pod
Fault-Tolerance
• Failure b/w lower and upper layer switches
– Outgoing inter- and intra pod traffic from lower-layer,
– the local flow classifier sets the cost to infinity and
does not assign it any new flows, chooses another
upper layer switch
– Intra-pod traffic using upper layer switch as
intermediary
– Switch broadcasts a tag notifying all lower level
switches, these would check when assigning new
flows and avoid it
– Inter-pod traffic coming into upper layer switch
– Tag to all its core switches signifying its ability to
carry traffic, core switches mirror this tag to all upper
layer switches, then upper switches avoid affected
core switch when assigning new flaws
Power and Heat Issues
Results: Heat & Power Consumption
Evaluation
• Experiment Description
4-port fat-tree, there are 16 hosts, four pods (each with four
switches), and four core switches; a total of 20 switches and 16
end hosts
multiplex these 36 elements onto ten physical machines,
interconnected by a 48-port ProCurve 2900 switch with 1 Gigabit
Ethernet links
Each pod of switches is hosted on one machine; each pod’s
hosts are hosted on one machine; and the two remaining machines run two core switches each.
Each pod of switches is hosted on one machine; each pod’s
hosts are hosted on one machine; and the two remaining machines run two core switches each.
Evaluation (Cont.)
• four machines running four hosts each, and four
•
•
•
•
machines each running four pod switches with one
additional uplink.
The four pod switches are connected to a 4-port core
switch running on a dedicated machine
3.6:1 oversubscription on the uplinks from the pod
switches to the core switch
these links are bandwidth-limited to 106.67Mbit/s, and all
other links are limited to 96Mbit/s.
Each host generates a constant 96Mbit/s of outgoing
traffic.
Results: Network Utilization
Packaging
 Increased wiring overhead is inherent to the fat-tree




topology
Each pod consists of 12 racks with 48 machines each,
and 48 individual 48-port GigE switches
Place the 48 switches in a centralized rack
Cables moves in sets of 12 from pod to pod and in sets
of 48 from racks to pod switches opens additional
opportunities for packing to reduce wiring complexity
Minimize total cable length by placing racks around the
pod switch in two dimensions
Packaging
Conclusion
 Bandwidth is the scalability bottleneck in large scale
clusters
 Existing solutions are expensive and limit cluster size
 Fat-tree topology with scalable routing and backward
compatibility with TCP/IP and Ethernet
 Large number of commodity switches have the potential
of displacing high end switches in DC the same way
clusters of commodity PCs have displaced
supercomputers for high end computing environments
THANK YOU
Q&A
Download