Grapevine: An Exercise in Distributed Computing Landon Cox February 16, 2016

advertisement
Grapevine: An Exercise in
Distributed Computing
Landon Cox
February 16, 2016
Naming other computers
• Low-level interface
• Provide the destination MAC address
• 00:13:20:2E:1B:ED
• Middle-level interface
• Provide the destination IP address
• 152.3.140.183
• High-level interface
• Provide the destination hostname
• www.cs.duke.edu
Translating hostname to IP addr
• Hostname  IP address
• Performed by Domain Name Service (DNS)
• Used to be a central server
• /etc/hosts at SRI
• What’s wrong with this approach?
• Doesn’t scale to the global Internet
DNS
• Centralized naming doesn’t scale
• Server has to learn about all changes
• Server has to answer all lookups
• Instead, split up data
• Use a hierarchical database
• Hierarchy allows local management of changes
• Hierarchy spreads lookup work across many computers
Where is
www.wikipidia.org?
Example: linux.cs.duke.edu
• nslookup in interactive mode
Translating IP to MAC addrs
• IP address  MAC address
• Performed by ARP protocol within a LAN
• How does a router know the MAC address of 152.3.140.183?
•
•
•
•
•
ARP (Address Resolution Protocol)
If it doesn’t know the mapping, broadcast through switch
“Whoever has this IP address, please tell me your MAC address”
Cache the mapping
“/sbin/arp”
• Why is broadcasting over a LAN ok?
• Number of computers connected to a switch is relatively small
Broadcast on local networks
• On wired ethernet switch
• ARP requests/replies are broadcast
• For the most part, IP communication is not broadcast (w/
caveats)
• What about on a wireless network?
• Everything is broadcast
• Means hosts can see all unencrypted traffic
• Why might this be dangerous?
• Means any unencrypted traffic is visible to others
• Open WiFi access points + non-SSL web requests and pages
• Many sites send cookie credentials in the clear …
• Use secure APs and SSL!
High-level network overview
Server
Workstation
Workstation
Workstation
Ethernet
Gateway
Gateway
Ethernet
Workstation
Ethernet
Workstation
Workstation
Workstation
Server
Workstation
Client-server
• Classic and convenient structure for distributed
systems
• How do clients and servers differ?
• Servers have more physical resources (disk, RAM, etc.)
• Servers are trusted by all clients
• Why are servers more trustworthy?
• Usually have better, more reliable hardware
• Servers are better administered (paid staff watch over them)
• Servers are kind of like the kernel of a distributed
system
• Centralized concentration of trust
• Support coordinated activity of mutually distrusting clients
Client-server
• Why not put everything on one server?
• Scalability problems (server becomes overloaded)
• Availability problems (server becomes single point of failure)
• Want to retain organizational control of some data (some
distrust)
• How do we address these issues?
• Replicate servers
• Place multiple copies of server in network
• Allow clients to talk to any server with appropriate functionality
• What are some drawbacks to replication?
• Data consistency (need sensible answers from servers)
• Resource discovery (which server should I talk to?)
Client-server
• Kernels are centralized too
• Subject to availability, scalability problems
• Does it make sense to replicate kernels?
•
•
•
•
•
Perhaps for multi-core machines
Assign a kernel to each core
Separate address spaces of each kernel
Coordinate actions via message passing
Multi-core starts to look a lot like a distributed system
Grapevine services
• Message delivery
• Send data to specified users
• Access control
• Only allow specified users to access name
• Resource discovery
• Where can I find a printer?
• Authentication
• How do I know who I am talking to?
Registration servers
• What logical data structure is replicated?
• The registry
• RName  Group entry | Individual entry
• What does an RName look like?
• Character string F.R
• F is a name (individual or group)
• R is a registry corresponding to a data partition
• At what grain is registration data replicated?
• Servers contain copies of whole registries
• Individual server unlikely to have copy of all registries
RNames
Group
{RName1, …, RNameN}
RName
name.registry
Individual
Authenticator (password),
Inbox sites,
Connect site
What two entities are represented by an individual entry?
Users and servers
RNames
Group
{RName1, …, RNameN}
RName
name.registry
Individual
Authenticator (password),
Inbox sites,
Connect site
How does an individual entry allow communication with a user?
Inbox sites for users
RNames
Group
{RName1, …, RNameN}
RName
name.registry
Individual
Authenticator (password),
Inbox sites,
Connect site
How does an individual entry allow communication with a server?
Connect site for servers
Namespace
• RNames provide a symbolic namespace
• Similar to file-system hierarchy or DNS
• Autonomous control of names within a registry
• What is the most important part of the namespace?
• *.gv (for Grapevine)
• *.gv is replicated at every registration server
• Who gets to define the other registries?
• All other registries must have group entry under *.gv
• Owners of *.gv have complete control over other registries
• In what way do file systems and DNS operate
similarly?
• ICANN’s root DNS servers decide top-level domains
• Root user controls root directory “/”
Resource discovery
• How do clients locate server replicas?
•
•
•
•
Get list of all registries via “gv.gv”
Find registry name for service (e.g., “ms”)
Lookup group ms.gv at registration server
ms.gv returns a list of available servers (e.g., *.ms)
• At this point control is transferred to service
• Service has autonomous control of its namespace
• Service can define its own namespace conventions
Implementing services
• Mail servers are replicated
• Any message server accepts any delivery request
• All message servers can forward to others
• An individual may have inboxes on many servers
• How does a client identify a server to send a message?
•
•
•
•
Find well-known name “MailDrop.ms” in *.ms
MailDrop.ms maps to mail servers
Any mail server can accept a message
Mail servers forward message to servers hosting users’ inboxes
• Note that the mail service makes “MailDrop.ms” special
• Grapevine only defines semantics of *.gv
• Grapevine delegates control of semantics of *.ms to mail service
• Similar to imap.cs.duke.edu or www.google.com
Resource discovery
• Bootstrapping resource discovery
• Rely on lower-level methods
• Broadcast to name lookup server on Ethernet
• Broadcast to registration server on Ethernet
• What data does the name lookup server store?
• Simple string to Internet address mappings
• Infrequently updated (minimal consistency issues)
• Well-known GrapevineRServer  addrs of registration servers
• What does this remind you of on today’s networks?
• Dynamic host configuration protocol (DHCP)
• Clients broadcast DHCP request on Ethernet
• DHCP server (usually on gateway) responds with IP addr, DNS
info
Updating replicated servers
• At some point need to update registration database
• Want to add new machines
• Want to reconfigure server locations
• Why not require updates to be atomic at all servers?
•
•
•
•
•
Requires that most servers be accessible to even start
All kinds of reasons why this might not be true
Trans-Atlantic phone line might be down
Servers might be offline for maintenance
Servers might be offline due to failure
• Instead embrace the chaos of eventual consistency
• Might have transient differences between server state
• Eventually everything will look the same (probably!)
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• Information included in timestamps
• Time + server address
• Timestamps are guaranteed to be unique
• Provides a total order on updates from a server
• Does the entry itself need a timestamp (a version)?
• Not really, can just compute as the max of item timestamps
• Entry version is a convenient optimization
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• Operations on an entries
• Can add/delete items from lists
• Can merge lists
• Operations update item timestamps, modify list
content
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• How are updates propagated?
• Asynchronously via the messaging service (i.e., *.ms)
• Does not require all servers to be online
• Updates can be buffered and ordered
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• How fast is convergence?
• Registration servers check their inbox every 30 seconds
• If all are online, state will converge in ~30 seconds
• If server is offline, may take longer
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• What happens if two admins update concurrently?
• “it is hard to predict which one of them will prevail.”
• “acceptable“ because admins aren’t talking to each other
• Anyone make sense of this?
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• Why not just use a distributed lock?
• What if a replica is offline during acquire, but reappears?
• What if lock owner crashes?
• What if lock maintainer crashes?
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• What if clients get different answers from servers?
• Clients just have to deal with it (•_•) ( •_•)>⌐■-■ (⌐■_■)
• Inconsistencies are guaranteed to be transient
• May not be good enough for some applications
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• What happens if a change message is lost during
prop.?
• Could lead to permanent inconsistency
• Periodic replica comparisons and mergers if needed
• Not perfect since partitions can prevent propagation
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• What happens if namespace is modified concurrently?
• Use timestamps to pick a winner (last writer wins)
• Why is this potentially dangerous?
• Later update could be trapped in offline machine
• Updates to first namespace accumulate
• When offline machine goes online, all work to first is thrown out
Updating the database
Registration Entry
List 1
List 2
Active items:{str1|t1, …, strn|tn}
Deleted items:{str1|t1, …, strm|tm}
Active items
Deleted items
• What was the solution?
• “Shouldn’t happen in practice.”
• Humans should coordinate out-of-band
• Probably true, but a little unsatisfying
Why read Grapevine?
• Describes many fundamental problems
• Performance and availability
Caching and replication
Consistency problems
We still deal with many of these issues
Keeping replicas consistent
• Requirement: members of write set agree
• Write request only returns if WS members agree
• Problem: things fall apart
• What do we do if something fails in the middle?
• This is why we had multiple replicas in first place
• Need agreement protocols that are robust to failures
Two-phase commit
• Two phases
• Voting phase
• Completion phase
• During the voting phase
• Coordinator proposes value to rest of group
• Other replicas tentatively apply update, reply “yes” to coordinator
• During the completion phase
• Coordinator tallies votes
• Success (entire group votes “yes”): coordinator sends “commit”
message
• Failure (some “no” votes or no reply): coordinator sends “abort”
message
• On success, group member commits update, sends “ack” to coordinator
• On failure, group member aborts update, sends “ack” to coordinator
• Coordinator aborts/applies update when all “acks” have been received
Two-phase commit
Phase 1
Replica
Coordinator
Replica
Replica
Two-phase commit
Phase 1
Replica
Coordinator
Propose: X  1
Replica
Replica
Two-phase commit
Phase 1
Coordinator
Yes
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
3 Yes votes
Coordinator
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
Commit: X  1
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
ACK
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 1
2 Yes votes
Coordinator
Yes
• What if fewer than 3 Yes votes?
• Replicas will time out and assume
update is aborted
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 1
2 Yes votes
Coordinator
Abort: X  1
• What if fewer than 3 Yes votes?
• Replicas do not commit
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 1
2 Yes votes
Coordinator
Yes
• Why might replica vote No?
• Replicas will time out and assume
update is aborted
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 1
2 Yes votes
Coordinator
Yes
• Why might replica vote No?
• Might not be able to acquire local
write lock
• Might be committing w/ another coord.
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
3 Yes votes
Coordinator
• What if coord. fails after vote msg,
before decision msg?
• Replicas will time out and assume
update is aborted
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
3 Yes votes
Coordinator
• What if coord. fails after vote msg,
before decision msg?
• Replicas will time out and assume
update is aborted
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
Commit: X  1
• What if coord. fails after decision
messages are sent?
• Replicas commit update
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
• What if coord. fails after decision
messages are sent?
• Replicas commit update
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
•
What if coord. fails while decision
messages are sent?
•
•
•
If one replica receives a commit, all must
commit
If replica time out, check with other members
If any member receives a commit, all commit
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
•
What if coord. fails while decision
messages are sent?
•
•
•
If one replica receives a commit, all must
commit
If replica time out, check with other members
If any member receives a commit, all commit
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 2
Coordinator
•
What if coord. fails while decision
messages are sent?
•
•
•
If one replica receives a commit, all must
commit
If replica time out, check with other members
If any member receives a commit, all commit
Replica
X1
Replica
X1
Replica
X1
Two-phase commit
Phase 1 or 2
Coordinator
Replica
X1
Replica
X1
Replica
X1
• What if replica crashes during 2PC?
• Coordinator removes it from the replica
group
• If replica recovers it can rejoin the group
later
Two-phase commit
Phase 1 or 2
Coordinator
Replica
X1
Replica
X1
Replica
X1
• What if replica crashes during 2PC?
• Coordinator removes it from the replica
group
• If replica recovers it can rejoin the group
later
Two-phase commit
• Anyone detect circular dependencies here?
• How do we agree on the coordinator?
• How do we agree on the group membership?
• Need more powerful consensus protocols
• Can become very complex
• Protocols vary depending on what a “failure” is
• Will cover in-depth very soon
• Two classes of failures
• Fail-stop: failed nodes do not respond
• Byzantine: failed nodes generate arbitrary outputs
Two-phase commit
• What’s another problem with this protocol?
• It’s really slow
• And it’s slow even when there are no failures (the common case)
• Consistency often requires taking a performance hit
• As we saw it can also undermine availability
• Can think of an unavailable service as a really slow service
Course administration
• Project 2 questions?
• Animesh is working on a test suite
• Mid-term exam
• Friday, March 11
• Responsible for everything up to that point
Download