Grapevine: An Exercise in Distributed Computing Landon Cox February 16, 2016 Naming other computers • Low-level interface • Provide the destination MAC address • 00:13:20:2E:1B:ED • Middle-level interface • Provide the destination IP address • 152.3.140.183 • High-level interface • Provide the destination hostname • www.cs.duke.edu Translating hostname to IP addr • Hostname IP address • Performed by Domain Name Service (DNS) • Used to be a central server • /etc/hosts at SRI • What’s wrong with this approach? • Doesn’t scale to the global Internet DNS • Centralized naming doesn’t scale • Server has to learn about all changes • Server has to answer all lookups • Instead, split up data • Use a hierarchical database • Hierarchy allows local management of changes • Hierarchy spreads lookup work across many computers Where is www.wikipidia.org? Example: linux.cs.duke.edu • nslookup in interactive mode Translating IP to MAC addrs • IP address MAC address • Performed by ARP protocol within a LAN • How does a router know the MAC address of 152.3.140.183? • • • • • ARP (Address Resolution Protocol) If it doesn’t know the mapping, broadcast through switch “Whoever has this IP address, please tell me your MAC address” Cache the mapping “/sbin/arp” • Why is broadcasting over a LAN ok? • Number of computers connected to a switch is relatively small Broadcast on local networks • On wired ethernet switch • ARP requests/replies are broadcast • For the most part, IP communication is not broadcast (w/ caveats) • What about on a wireless network? • Everything is broadcast • Means hosts can see all unencrypted traffic • Why might this be dangerous? • Means any unencrypted traffic is visible to others • Open WiFi access points + non-SSL web requests and pages • Many sites send cookie credentials in the clear … • Use secure APs and SSL! High-level network overview Server Workstation Workstation Workstation Ethernet Gateway Gateway Ethernet Workstation Ethernet Workstation Workstation Workstation Server Workstation Client-server • Classic and convenient structure for distributed systems • How do clients and servers differ? • Servers have more physical resources (disk, RAM, etc.) • Servers are trusted by all clients • Why are servers more trustworthy? • Usually have better, more reliable hardware • Servers are better administered (paid staff watch over them) • Servers are kind of like the kernel of a distributed system • Centralized concentration of trust • Support coordinated activity of mutually distrusting clients Client-server • Why not put everything on one server? • Scalability problems (server becomes overloaded) • Availability problems (server becomes single point of failure) • Want to retain organizational control of some data (some distrust) • How do we address these issues? • Replicate servers • Place multiple copies of server in network • Allow clients to talk to any server with appropriate functionality • What are some drawbacks to replication? • Data consistency (need sensible answers from servers) • Resource discovery (which server should I talk to?) Client-server • Kernels are centralized too • Subject to availability, scalability problems • Does it make sense to replicate kernels? • • • • • Perhaps for multi-core machines Assign a kernel to each core Separate address spaces of each kernel Coordinate actions via message passing Multi-core starts to look a lot like a distributed system Grapevine services • Message delivery • Send data to specified users • Access control • Only allow specified users to access name • Resource discovery • Where can I find a printer? • Authentication • How do I know who I am talking to? Registration servers • What logical data structure is replicated? • The registry • RName Group entry | Individual entry • What does an RName look like? • Character string F.R • F is a name (individual or group) • R is a registry corresponding to a data partition • At what grain is registration data replicated? • Servers contain copies of whole registries • Individual server unlikely to have copy of all registries RNames Group {RName1, …, RNameN} RName name.registry Individual Authenticator (password), Inbox sites, Connect site What two entities are represented by an individual entry? Users and servers RNames Group {RName1, …, RNameN} RName name.registry Individual Authenticator (password), Inbox sites, Connect site How does an individual entry allow communication with a user? Inbox sites for users RNames Group {RName1, …, RNameN} RName name.registry Individual Authenticator (password), Inbox sites, Connect site How does an individual entry allow communication with a server? Connect site for servers Namespace • RNames provide a symbolic namespace • Similar to file-system hierarchy or DNS • Autonomous control of names within a registry • What is the most important part of the namespace? • *.gv (for Grapevine) • *.gv is replicated at every registration server • Who gets to define the other registries? • All other registries must have group entry under *.gv • Owners of *.gv have complete control over other registries • In what way do file systems and DNS operate similarly? • ICANN’s root DNS servers decide top-level domains • Root user controls root directory “/” Resource discovery • How do clients locate server replicas? • • • • Get list of all registries via “gv.gv” Find registry name for service (e.g., “ms”) Lookup group ms.gv at registration server ms.gv returns a list of available servers (e.g., *.ms) • At this point control is transferred to service • Service has autonomous control of its namespace • Service can define its own namespace conventions Implementing services • Mail servers are replicated • Any message server accepts any delivery request • All message servers can forward to others • An individual may have inboxes on many servers • How does a client identify a server to send a message? • • • • Find well-known name “MailDrop.ms” in *.ms MailDrop.ms maps to mail servers Any mail server can accept a message Mail servers forward message to servers hosting users’ inboxes • Note that the mail service makes “MailDrop.ms” special • Grapevine only defines semantics of *.gv • Grapevine delegates control of semantics of *.ms to mail service • Similar to imap.cs.duke.edu or www.google.com Resource discovery • Bootstrapping resource discovery • Rely on lower-level methods • Broadcast to name lookup server on Ethernet • Broadcast to registration server on Ethernet • What data does the name lookup server store? • Simple string to Internet address mappings • Infrequently updated (minimal consistency issues) • Well-known GrapevineRServer addrs of registration servers • What does this remind you of on today’s networks? • Dynamic host configuration protocol (DHCP) • Clients broadcast DHCP request on Ethernet • DHCP server (usually on gateway) responds with IP addr, DNS info Updating replicated servers • At some point need to update registration database • Want to add new machines • Want to reconfigure server locations • Why not require updates to be atomic at all servers? • • • • • Requires that most servers be accessible to even start All kinds of reasons why this might not be true Trans-Atlantic phone line might be down Servers might be offline for maintenance Servers might be offline due to failure • Instead embrace the chaos of eventual consistency • Might have transient differences between server state • Eventually everything will look the same (probably!) Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • Information included in timestamps • Time + server address • Timestamps are guaranteed to be unique • Provides a total order on updates from a server • Does the entry itself need a timestamp (a version)? • Not really, can just compute as the max of item timestamps • Entry version is a convenient optimization Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • Operations on an entries • Can add/delete items from lists • Can merge lists • Operations update item timestamps, modify list content Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • How are updates propagated? • Asynchronously via the messaging service (i.e., *.ms) • Does not require all servers to be online • Updates can be buffered and ordered Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • How fast is convergence? • Registration servers check their inbox every 30 seconds • If all are online, state will converge in ~30 seconds • If server is offline, may take longer Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • What happens if two admins update concurrently? • “it is hard to predict which one of them will prevail.” • “acceptable“ because admins aren’t talking to each other • Anyone make sense of this? Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • Why not just use a distributed lock? • What if a replica is offline during acquire, but reappears? • What if lock owner crashes? • What if lock maintainer crashes? Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • What if clients get different answers from servers? • Clients just have to deal with it (•_•) ( •_•)>⌐■-■ (⌐■_■) • Inconsistencies are guaranteed to be transient • May not be good enough for some applications Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • What happens if a change message is lost during prop.? • Could lead to permanent inconsistency • Periodic replica comparisons and mergers if needed • Not perfect since partitions can prevent propagation Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • What happens if namespace is modified concurrently? • Use timestamps to pick a winner (last writer wins) • Why is this potentially dangerous? • Later update could be trapped in offline machine • Updates to first namespace accumulate • When offline machine goes online, all work to first is thrown out Updating the database Registration Entry List 1 List 2 Active items:{str1|t1, …, strn|tn} Deleted items:{str1|t1, …, strm|tm} Active items Deleted items • What was the solution? • “Shouldn’t happen in practice.” • Humans should coordinate out-of-band • Probably true, but a little unsatisfying Why read Grapevine? • Describes many fundamental problems • Performance and availability Caching and replication Consistency problems We still deal with many of these issues Keeping replicas consistent • Requirement: members of write set agree • Write request only returns if WS members agree • Problem: things fall apart • What do we do if something fails in the middle? • This is why we had multiple replicas in first place • Need agreement protocols that are robust to failures Two-phase commit • Two phases • Voting phase • Completion phase • During the voting phase • Coordinator proposes value to rest of group • Other replicas tentatively apply update, reply “yes” to coordinator • During the completion phase • Coordinator tallies votes • Success (entire group votes “yes”): coordinator sends “commit” message • Failure (some “no” votes or no reply): coordinator sends “abort” message • On success, group member commits update, sends “ack” to coordinator • On failure, group member aborts update, sends “ack” to coordinator • Coordinator aborts/applies update when all “acks” have been received Two-phase commit Phase 1 Replica Coordinator Replica Replica Two-phase commit Phase 1 Replica Coordinator Propose: X 1 Replica Replica Two-phase commit Phase 1 Coordinator Yes Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 3 Yes votes Coordinator Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator Commit: X 1 Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator ACK Replica X1 Replica X1 Replica X1 Two-phase commit Phase 1 2 Yes votes Coordinator Yes • What if fewer than 3 Yes votes? • Replicas will time out and assume update is aborted Replica X1 Replica X1 Replica X1 Two-phase commit Phase 1 2 Yes votes Coordinator Abort: X 1 • What if fewer than 3 Yes votes? • Replicas do not commit Replica X1 Replica X1 Replica X1 Two-phase commit Phase 1 2 Yes votes Coordinator Yes • Why might replica vote No? • Replicas will time out and assume update is aborted Replica X1 Replica X1 Replica X1 Two-phase commit Phase 1 2 Yes votes Coordinator Yes • Why might replica vote No? • Might not be able to acquire local write lock • Might be committing w/ another coord. Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 3 Yes votes Coordinator • What if coord. fails after vote msg, before decision msg? • Replicas will time out and assume update is aborted Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 3 Yes votes Coordinator • What if coord. fails after vote msg, before decision msg? • Replicas will time out and assume update is aborted Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator Commit: X 1 • What if coord. fails after decision messages are sent? • Replicas commit update Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator • What if coord. fails after decision messages are sent? • Replicas commit update Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator • What if coord. fails while decision messages are sent? • • • If one replica receives a commit, all must commit If replica time out, check with other members If any member receives a commit, all commit Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator • What if coord. fails while decision messages are sent? • • • If one replica receives a commit, all must commit If replica time out, check with other members If any member receives a commit, all commit Replica X1 Replica X1 Replica X1 Two-phase commit Phase 2 Coordinator • What if coord. fails while decision messages are sent? • • • If one replica receives a commit, all must commit If replica time out, check with other members If any member receives a commit, all commit Replica X1 Replica X1 Replica X1 Two-phase commit Phase 1 or 2 Coordinator Replica X1 Replica X1 Replica X1 • What if replica crashes during 2PC? • Coordinator removes it from the replica group • If replica recovers it can rejoin the group later Two-phase commit Phase 1 or 2 Coordinator Replica X1 Replica X1 Replica X1 • What if replica crashes during 2PC? • Coordinator removes it from the replica group • If replica recovers it can rejoin the group later Two-phase commit • Anyone detect circular dependencies here? • How do we agree on the coordinator? • How do we agree on the group membership? • Need more powerful consensus protocols • Can become very complex • Protocols vary depending on what a “failure” is • Will cover in-depth very soon • Two classes of failures • Fail-stop: failed nodes do not respond • Byzantine: failed nodes generate arbitrary outputs Two-phase commit • What’s another problem with this protocol? • It’s really slow • And it’s slow even when there are no failures (the common case) • Consistency often requires taking a performance hit • As we saw it can also undermine availability • Can think of an unavailable service as a really slow service Course administration • Project 2 questions? • Animesh is working on a test suite • Mid-term exam • Friday, March 11 • Responsible for everything up to that point