CloudDCNet

advertisement
Computer Networks
Datacenter Networks
Lin Gu
lingu@ieee.org
Rack-Mounted Servers
Sun Fire x4150 1U server
Scale Up vs. Scale Out
MPP
SMP
Super Server
Departmental
Server
Personal
System
Cluster
of PCs
Data center networks

10’s to 100’s of thousands of hosts, often closely
coupled, in close proximity:
 e-business (e.g. Amazon)
 content-servers (e.g., YouTube, Akamai, Apple, Microsoft)
 search engines, data mining (e.g., Google)

challenges:
 multiple applications, each
serving massive numbers of
clients
 managing/balancing load,
avoiding processing,
networking, data bottlenecks
Inside a 40-ft Microsoft container,
Chicago data center
Link Layer
5-4
Data center networks
load balancer: application-layer routing
 receives external client requests
 directs workload within data center
 returns results to external client (hiding data
center internals from client)
Internet
Border router
Load
balancer
Access router
Tier-1 switches
B
A
Load
balancer
Tier-2 switches
C
TOR switches
Server racks
1
2
3
4
5
6
7
8
Link Layer
5-5
Data center networks

rich interconnection among switches, racks:
 increased throughput between racks (multiple routing
paths possible)
 increased reliability via redundancy
Tier-1 switches
Tier-2 switches
TOR switches
Server racks
1
2
3
4
5
6
7
8
Low Earth Orbit networks
Wireless communication is convenient, and can be highbandwidth. Satellites can be an effective solution.
 Some unsuccessful earlier attempts, Iridium, …
 Such systems may come back in the future
 It does not have to be a satellite – Google Loon, …
Deep space communication
How to communication in the Solar System, in the Galaxy,
or in deeper space?
 Extremely long latency
 What protocols work?
 How to build the transceivers?
 Befriend physicists
The network is the computer”
- SUN Microsystems
6-9
Appendix
10
Motivations of using Clusters over Specialized
Parallel Computers
• Individual PCs are becoming increasingly powerful
• Communication bandwidth between PCs is increasing and
latency is decreasing (Gigabit Ethernet, Myrinet)
• PC clusters are easier to integrate into existing networks
• Typical low user utilization of PCs (<10%)
• Development tools for workstations and PCs are mature
• PC clusters are a cheap and readily available
• Clusters can be easily grown
Cluster Architecture
Parallel Applications
Parallel Applications
Parallel Applications
Sequential Applications
Sequential Applications
Sequential Applications
Parallel Programming Environment
Cluster Middleware
(Single System Image and Availability Infrastructure)
PC/Workstation
PC/Workstation
PC/Workstation
PC/Workstation
Communications
Communications
Communications
Communications
Software
Software
Software
Software
Network Interface
Hardware
Network Interface
Hardware
Network Interface
Hardware
Network Interface
Hardware
Cluster Interconnection Network/Switch
Datacenter Networking
Major Components of a Datacenter
• Computing hardware (equipment racks)
• Power supply and distribution hardware
• Cooling hardware and cooling fluid
distribution hardware
• Network infrastructure
• IT Personnel and office equipment
Datacenter Networking
Growth Trends in Datacenters
• Load on network & servers continues to rapidly grow
– Rapid growth: a rough estimate of annual growth rate:
enterprise datacenters: ~35%, Internet datacenters: 50% 100%
– Information access anywhere, anytime, from many devices
• Desktops, laptops, PDAs & smart phones, sensor
networks, proliferation of broadband
• Mainstream servers moving towards higher speed links
– 1-GbE to10-GbE in 2008-2009
– 10-GbE to 40-GbE in 2010-2012
• High-speed datacenter-MAN/WAN connectivity
– High-speed datacenter syncing for disaster recovery
Datacenter Networking
• A large part of the total cost of the DC hardware
– Large routers and high-bandwidth switches are very
expensive
• Relatively unreliable – many components may fail.
• Many major operators and companies design their
own datacenter networking to save money and
improve reliability/scalability/performance.
– The topology is often known
– The number of nodes is limited
– The protocols used in the DC are known
• Security is simpler inside the data center, but
challenging at the border
• We can distribute applications to servers to distribute
load and minimize hot spots
Datacenter Networking
Networking components (examples)
64 10-GE port Upstream
768 1-GE port Downstream
• High Performance & High
Density Switches & Routers
– Scaling to 512 10GbE ports per
chassis
– No need for proprietary protocols
to scale
• Highly scalable DC
Border Routers
– 3.2 Tbps capacity in a single
chassis
– 10 Million routes, 1 Million in
hardware
– 2,000 BGP peers
– 2K L3 VPNs, 16K L2 VPNs
– High port density for GE and
10GE application connectivity
– Security
Datacenter Networking
Common data center topology
Internet
Core
Layer-3 router
Datacenter
Aggregation
Access
Layer-2/3 switch
Layer-2 switch
Servers
Datacenter Networking
Data center network design goals
• High network bandwidth, low latency
• Reduce the need for large switches in the core
• Simplify the software, push complexity to the
edge of the network
• Improve reliability
• Reduce capital and operating cost
Datacenter Networking
Avoid this…
and simplify this…
Interconnect
Can we avoid using high-end switches?
• Expensive high-end switches to
scale up
• Single point of failure and
bandwidth bottleneck
– Experiences from real systems
• One answer: DCell
?
20
Interconnect
DCell Ideas
• #1: Use mini-switches to scale out
• #2: Leverage servers to be part of the routing
infrastructure
– Servers have multiple ports and need to forward
packets
• #3: Use recursion to scale and build complete
graph to increase capacity
Data Center Networking
One approach: switched network with
a hypercube interconnect
• Leaf switch: 40 1Gbps ports+2 10 Gbps ports.
– One switch per rack.
– Not replicated (if a switch fails, lose one rack of capacity)
• Core switch: 10 10Gbps ports
– Form a hypercube
• Hypercube – high-dimensional rectangle
Interconnect
Hypercube properties
•
•
•
•
Minimum hop count
Even load distribution for all-all communication.
Can route around switch/link failures.
Simple routing:
– Outport = f(Dest xor NodeNum)
– No routing tables
Interconnect
A 16-node (dimension 4) hypercube
3
2
2
2
2
0
0
0
1
3
3
3
2
2
5
0
0
4
1
1
1
1
1
1
1
1
2
0
0
3
2
2
7
0
0
6
3
3
3
3
3
3
3
3
10
0
0
11
2
2
15
0
0
14
1
1
1
1
1
1
1
1
8
3
0
0
9
3
2
2
13
3
0
0
12
3
2
2
2
2
Interconnect
64-switch Hypercube
4X4
Sub-cube
16
links
16
links
4X4
Sub-cube
4X4
Sub-cube
16
links
16
links
Core switch:
10Gbps port x 10
4X4
Sub-cube
How many servers can
be connected in this
system?
4 links
63 * 4 links to
other containers
One container:
81920 servers with
1Gbps bandwidth
Level 2: 2 10-port 10 Gb/sec switches
16 10 Gb/sec links
Level 1: 8 10-port 10 Gb/sec switches
64 10 Gb/sec links
Level 0: 32 40-port 1 Gb/sec switches
Leaf switch: 1Gbps port x
40 + 10Gbps port x 2.
1280 Gb/sec links
Data Center Networking
The Black Box
Interconnect
Typical layer 2 & Layer 3 in existing systems
• Layer 2
– One spanning tree for entire network
• Prevents looping
• Ignores alternate paths
• Layer 3
– Shortest path routing between source and
destination
– Best-effort delivery
Interconnect
Problem With common DC topology
• Single point of failure
• Over subscription of links higher up in the
topology
– Trade off between cost and provisioning
• Layer 3 will only use one of the existing equal
cost paths
• Packet re-ordering occurs if layer 3 blindly takes
advantage of path diversity
Interconnect
Fat-tree based Solution
Connect hosts together using a fat-tree topology
– Infrastructure consists of cheap devices
• Each port supports same speed as the end host
– All devices can transmit at line speed if packets are
distributed along existing paths
– A k-ary fat-tree is composed of switches with k ports
• How many switches? … 5k2/4
• How many connected hosts? … k3/4
Interconnect
k-ary Fat-Tree (k=4)
k2/4 switches
k pods
k/2 switches/pod
k/2 switches/pod
k/2 hosts/edge switch
Use the same type of switches in the core, aggregation, and
edge, with each switch having k ports
Fat-tree Modified
Interconnect
• Enforce special addressing scheme in DC
– Allows host attached to same switch to route only through
switch
– Allows inter-pod traffic to stay within pod
– unused.PodNumber.switchnumber.Endhost
• Use two level look-ups to distribute traffic and maintain
packet ordering.
Interconnect
2-Level look-ups
• First level is prefix lookup
– Used to route down the topology to endhost
• Second level is a suffix lookup
– Used to route up towards core
– Diffuses and spreads out traffic
– Maintains packet ordering by using the same ports for the
same endhost
Interconnect
Comparison of several schemes
– Hypercube: high-degree interconnect for large net, difficult
to scale incrementally
– Butterfly and fat-tree: cannot scale as fast as DCell
– De Bruijn: cannot incrementally expand
– DCell: low bandwidth between two clusters (sub-DCells)
Distributed Systems
Sun Fire x4150 1U server
Cloud and Globalization of Computation
Users are geographically distributed, and computation is globally optimized.
Datacenter
Datacenter
DNS
LB system
Datacenter
•
•
•
Load Balancing
The load balancing systems regulate global data center traffic
Incorporates site health, load, user proximity, and service
response for user site selection
Provides transparent site failover in case of disaster or service
outage
Global Data Center Deployment
•
•
•
Providing site selection for users
Harnessing the benefits and
intricacies of geo-distribution
Leveraging both DNS and nonDNS methods for multi-site
redundancy
Google’s Search System
San Jose
Computing in an LSDS
GWS
Google.com
The browser issues a query
 DNS lookup
 HTTP handling
 GWS
 Backend
 HTTP response

London
Hong Kong
GWS
GWS
Backend
GWS
GWS
Google’s Cluster Architecture
Goals
 A high-performance distributed system for
search

Thousands of machines collaborate to handle the
workload
 Price-performance ratio
 Scalability
 Energy efficiency and cooling
 High availability
Luiz Andre Barroso, Jeffrey Dean, Urs Holzle. Web Search for a Planet: The Google
Cluster Architecture. IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003
How to compute in a network?
 Multiple computers on a perfect network may work
like one larger traditional computer.
 However, computing becomes more complex when
 messages can be lost/tampered/duplicated.
 bandwidth is limited.
 operations incur long latencies, non-uniform
latencies, or both.
 events are asynchronous.
 Hence, computation in an LSDS over imperfect
networks may have to be organized in a different
way from traditional computer systems.
 How to correctly compute on an imperfect network?
Two Generals’ Problem
Two generals want to agree on a time to attack an enemy in
between them. If the attack is synchronized, the generals can
defeat the enemy. Otherwise, they will be defeated by enemy
one by one. The generals can send messengers to each other,
but the messenger may be caught by the enemy. Can the two
generals reach an agreement?
 Send a messenger, then
expect the messenger to
come back with one
acknowledgment?
 Send 100 messengers?
 How to prove it is possible
or impossible to reach an
agreement?
Attack at 5am.
Three Generals’ Problem in Paxos
Paxos: reach global consensus in a distributed system with
packet loss
 Who decides the attack time?
 When is the decision made and agreed on?
 What if one general betrayed? Byzantine
Generals Problem.
Further reading: [Lamport98] Leslie Lamport. The part-time parliament. ACM
Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.
How to maintain state in a network? Part-Time
Parliament
Priests (legislators) in a parliament communicate to each
other using messengers. Both the priests and the
messengers can leave the parliament Chamber at any time,
and may never come back. Can the parliament pass laws
(decrees) and ensure consistency (no ambiguity/discrepancy
on the content of a decree)?
 Priests can leave the Chamber (server crash or isolated
from the system) and may never come back (server
fail).
 Messengers can leave the Chamber (delayed or out-oforder packets) and may never come back (packet loss).
Paxos
Protocol 1: Suppose we know there are n priests.
A priest constructs a decree and sends it to the
other n-1 priests, and collects their votes that
support the decree. A vote against the degree
equal to not voting. If there are n-1 votes for the
decree, the decree is passed.
 Each priest keeps records of the
passed decrees (and some
additional information) on his/her
ledger (nonvolatile storage).
 Messengers deliver the
candidate decree and votes.
Problem?
Tax=0
OK
Paxos
Protocol 2: … A decree is passed by a majority voting for it.
 Resilient to server failures and
packet loss.
 State (passing or not passing) of a
decree is defined unambiguously.
 What is “majority”?
 The proposing priest may contact a
“quorum” consisting a majority of
the priests.
Tax=0
Problem?
OK
Paxos
Protocol 3: … Inform all priests about the passing of a
decree.
 Clients can query any priest, and
the priest may know the decree.
 What if the particular priest does
not know the decree?
Tax=0
OK
Tax=0 done
Problem?
Paxos
Protocol 4: … Read from a majority set.
 If all replies agree, the decree is
(perhaps) unambiguous.
 What if a priest in the majority set
does not know?
Tax = ?
Problem?
Paxos
Protocol 5: … Read following a majority.
 Will there be a majority?
 Can the majority be wrong?
Tax = ?
Answers to a query should be consistent
(identical, or, at least, compatible).
Paxos
Protocol 6: Consider a single decree (e.g., tax = 0) –
1. One priest serves as the president, and proposes
a decree with a unique ballot no. b. The
president sends the ballot with the proposal to a
set of priests.
2. A priest responds to the receipt of the proposal
message by replying with its latest vote
(LastVote) and a promise that it would not
respond to any ballots whose ballot nos. are
between the LastVote’s ballot no. and b. LastVote
can be null.
Paxos
Protocol 6: (continued)
3. After receiving promises from a majority set, the
president selects the value for the decree based on
the LastVotes of this set (the quorum), and sends
the acceptance of the decree to the quorum.
4. The members of the quorum replies with a
confirmation (vote) to the president, and reception
of all the quorum members’ confirmations (votes)
means the decree is passed.
Paxos
Protocol 6: (continued)
5. After receiving votes from the whole quorum, the
president records the decree in its ledger, and sends
a success message to all the priests.
6. Upon receiving the success message, a priest
records the decree d in its ledger.
How to know the protocol works correctly?
Paxos
Paxos
The leads to a system where every passed decree is the same as
the first passed one.
Paxos
All passed decrees are identical.
Three Generals’ Problem in Paxos
Can we use Paxos to solve the Three Generals’ problem?
 Who decides the attack time?
 How to agree?
 What if one general betrayed?
Byzantine Generals Problem.
Attack at 5am.
Beyond Single-Decree Paxos
Can we use Paxos to pass more than one decree?
 Multiple Paxos instances
 Sequence of instances
 Further reading: [Lamport98] Leslie Lamport. The part-time
parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.
Download