Scalable, Distributed Data Structures for Internet Service Construction Landon Cox March 2, 2016 In the year 2000 … • Portals were thought to be a good idea • Yahoo!, Lycos, AltaVista, etc. • Original content up front + searchable directory • The dot-com bubble was about to burst • Started to break around 1999 • Lots of companies washed out by 2001 • Google was really taking off • Founded in 1998 • PageRank was extremely accurate • Proved: great search is enough (and portals are dumb) • Off in the distance: Web 2.0, Facebook, AWS, “the cloud” Questions of the day 1. How do we build highly-available web services? • Support millions of users • Want high-throughput 2. How do we build highly-available peer-to-peer services? • • • • • Napster had just about been shut down (centralized) BitTorrent was around the corner Want to scale to thousands of nodes No centralized trusted administration or authority Problem: everything can fall apart (and does) • Some of the solutions to #2 can help with #1 Storage interfaces Physical storage Physical storage File hierarchy Logical schema D F D F mkdir, create Process 1 Attr1 … Val1 … Attr N ValN F open, read, write Process 2 What is the interface to a file system? SQL Query Process 1 SQL Query Process 2 What is the interface to a DBMS? Data independence • Data independence • Idea that storage issues should be hidden from programs • Programs should operate on data independently of underlying details • In what way do FSes and DBs provide data independence? • Both hide the physical layout of data • Can change layout without altering how programs operate on data • In what way do DBs provide stronger data independence? • File systems leave format of data within files up to programs • One program can alter/corrupt layout file format • Database clients cannot corrupt schema definition ACID properties • Databases also ensure ACID • What is meant by Atomicity? • Sequences of operations are submitted via transactions • All operations in transaction succeed or fail • No partial success (or failure) • What is meant by Consistency? • After transaction commits DB is in “consistent” state • Consistency is defined by data invariants • i.e., after transaction completes all invariants are true • What is the downside of ensuring Consistency? • In tension with concurrency and scalability • Particularly in distributed settings ACID properties • Databases also ensure ACID • What is meant by Isolation? • Other processes cannot view modification of in-flight transactions • Similar to atomicity • Effects of a transaction cannot be partially viewed • What is meant by Durability? • After transaction commits data will not be lost • Committed transactions survive hardware and software failures ACID properties • Databases also ensure ACID • Do file systems ensure ACID properties? • • • • • Not really Atomicity: operations can be buffered, re-ordered, flushed async Consistency: many different consistency models Isolation: hard to ensure isolation without notion of transaction Durability: need to cache undermines guarantees (can use sync) • What do file systems offer instead of ACID? • Faster performance • Greater flexibility for programs • Byte-array abstraction rather than table abstraction Needs of cluster-based storage • Want three things • Scalability (incremental addition machines) • Availability (failure/loss of machines) • Consistency (sensible answers to requests) • Traditional DBs fail to provide these features • Focus on strong consistency hinders scalability and availability • Requires a lot of coordination and complexity • For file systems, it depends • Some offer strong consistency guarantees (poor scalability) • Some offer good scalability (poor consistency) Distributed data structures (DDS) • Paper from OSDI ‘00 • Steve Gribble, Eric Brewer, Joseph Hellerstein, and David Culler • Pointed out inadequacies for traditional storage for large-scale services • Proposed a new storage interface • More structured than file systems (structure is provided by DDS) • Not as fussy as databases (no SQL) • A few operations on data structure elements Distributed data structures (DDS) • Present a new storage interface • More structured than file systems (structure is provided by DDS) • Not as fussy as databases (no SQL) • A few operations on data structure elements Storage brick DDS Get, Put Process 1 Process 2 Get, Put Key1 Val1 … … KeyN ValN Storage brick Distributed Hash Tables (DHTs) • DHT: same idea as DDS but decentralized • Same interface as a traditional hash table • put(key, value) — stores value under key • get(key) — returns all the values stored under key • Built over a distributed overlay network • Partition key space over available nodes • Route each put/get request to appropriate node Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab How DHTs Work How do we ensure the put and the get find the same machine? How does this work in DNS? K V K V K V K V k1 k1,v1 K V v1 K V K V K V put(k1,v1) Sean C. Rhea K V K V OpenDHT: A Public DHT Service get(k1) Nodes form a logical ring 000 110 First question: how do new nodes figure out where they should go on the ring? Sean C. Rhea 010 100 Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Step 1: Partition Key Space • Each node in DHT will store some k,v pairs • Given a key space K, e.g. [0, 2160): • Choose an identifier for each node, idi K, uniformly at random • A pair k,v is stored at the node whose identifier is closest to k • Key technique: cryptographic hashing • Node id = SHA1(MAC address) Contrast this to DDS, in • P(sha1 collision) <<< P(hardware failure) which an admin manually • Nodes can independently compute their id assigned nodes to partitions. 2160 0 Sean C. Rhea OpenDHT: A Public DHT Service Step 2: Build Overlay Network • Each node has two sets of neighbors • Immediate neighbors in the key space • Important for correctness • Long-hop neighbors • Allow puts/gets in O(log n) hops 2160 0 Sean C. Rhea OpenDHT: A Public DHT Service Step 3: Route Puts/Gets Thru Overlay • Route greedily, always making progress get(k) 2160 0 k Sean C. Rhea OpenDHT: A Public DHT Service Explain the How Does Lookup Work? green arrows. Source • Assign IDs to nodes • Map hash values to node with closest ID • Leaf set is successors and predecessors 111… 110… 00… • Correctness • Routing table matches Explain the successively longer red arrows. prefixes • Efficiency Sean C. Rhea 10… OpenDHT: A Public DHT Service Lookup ID Iterative vs. recursive • Previous example: recursive lookup Which one is faster? • Could also perform lookup iteratively: Recursive Sean C. Rhea Iterative Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Iterative vs. recursive • Previous example: Why recursive lookup might I want to do this iteratively? • Could also perform lookup iteratively: Recursive Sean C. Rhea Iterative Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Iterative vs. recursive • Previous example: recursive lookup What does DNS do and why? • Could also perform lookup iteratively: Recursive Sean C. Rhea Iterative Fixing the Embarrassing Slowness of OpenDHT on PlanetLab (LPC: from Pastry pape Example routing state Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab OpenDHT Partitioning responsible for these keys • Assign each node an identifier from the key space • Store a key-value pair (k,v) on several nodes with IDs closest to k • Call them replicas for (k,v) Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab id = 0xC9A1… OpenDHT Graph Structure • Overlay neighbors match prefixes of local identifier • Choose among nodes with same matching prefix length by network latency Sean C. Rhea 0xED 0xC0 Fixing the Embarrassing Slowness of OpenDHT on PlanetLab 0x41 0x84 Performing Gets in OpenDHT • Client sends a get request to gateway • Gateway routes it along neighbor links to first replica encountered • Replica sends response back directly over IP client gateway get(0x6b) 0x41 get(0x6b) get response 0x6c Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab DHTs: The Hype • High availability • Each key-value pair replicated on multiple nodes • Incremental scalability • Need more storage/tput? Just add more nodes. • Low latency • Recursive routing, proximity neighbor selection, server selection, etc. Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Robustness Against Failure • If a neighbor dies, a node routes through its next best one • If replica dies, remaining replicas create a new one to replace it Sean C. Rhea client 0xC0 0x41 0x6c Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Routing Around Failures • Under churn, neighbors may have failed • How to detect failures? • acknowledge eachACK hop ACK 2160 0 k Sean C. Rhea OpenDHT: A Public DHT Service Routing Around Failures • What if we don’t receive an ACK? • resend through different neighbor Timeout! 2160 0 k Sean C. Rhea OpenDHT: A Public DHT Service Computing Good Timeouts • What if timeout is too long? • increases put/get latency • What if timeout is too short? • get message explosion Timeout! 2160 0 k Sean C. Rhea OpenDHT: A Public DHT Service (LPC) Computing Good Timeouts • Three basic approaches to timeouts • Safe and static (~5s) • Rely on history of observed RTTs (TCP style) • Rely on model of RTT based on location 2160 0 k Sean C. Rhea OpenDHT: A Public DHT Service Computing Good Timeouts • Chord errs on the side of caution • Very stable, but gives long lookup latencies Timeout! 2160 0 k Sean C. Rhea OpenDHT: A Public DHT Service (LPC) Timeout results Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Recovering From Failures • Can’t route around failures forever • Will eventually run out of neighbors • Must also find new nodes as they join • Especially important if they’re our immediate predecessors or successors: old responsibility new node 2160 0 new responsibility Sean C. Rhea OpenDHT: A Public DHT Service Recovering From Failures • Obvious algorithm: reactive recovery • When a node stops sending acknowledgements, notify other neighbors of potential replacements • Similar techniques for arrival of new nodes 0 Sean C. Rhea A B C D OpenDHT: A Public DHT Service 2160 Recovering From Failures • Obvious algorithm: reactive recovery • When a node stops sending acknowledgements, notify other neighbors of potential replacements • Similar techniques for arrival of new nodes 0 A B B failed, use D Sean C. Rhea C D B failed, use A OpenDHT: A Public DHT Service 2160 The Problem with Reactive Recovery • What if B is alive, but network is congested? • • • • C still perceives a failure due to dropped ACKs C starts recovery, further congesting network More ACKs likely to be dropped Creates a positive feedback cycle (=BAD) 0 A B B failed, use D Sean C. Rhea C D B failed, use A OpenDHT: A Public DHT Service 2160 The Problem with Reactive Recovery • What if B is alive, but network is congested? • This was the problem with Pastry • Combined with poor congestion control, causes network to partition under heavy churn 0 A B B failed, use D Sean C. Rhea C D B failed, use A OpenDHT: A Public DHT Service 2160 Periodic Recovery • Every period, each node sends its neighbor list to each of its neighbors 0 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service 2160 Periodic Recovery • Every period, each node sends its neighbor list to each of its neighbors 0 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service 2160 Periodic Recovery • Every period, each node sends its neighbor list to each of its neighbors • How does this break the feedback loop? • Volume of recovery msgs independent of failures 0 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service 2160 Periodic Recovery • Every period, each node sends its neighbor list to each of its neighbors • Do we need to send the entire list? • No, can send delta from last message 0 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service 2160 Periodic Recovery • Every period, each node sends its neighbor list to each of its neighbors • What if we contact only a random neighbor (instead of all neighbors)? • Still converges in log(k) rounds (k=num neighbors) 0 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service 2160 (LPC) Recovery results Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab More key-value stores • Two settings in which you can use DHTs • DDS in a cluster • Bamboo on the open Internet • How is “the cloud” (e.g., EC2) different/similar? • Cloud is a combination of fast/slow networks • Cloud is under a single administrative domain • Cloud machines should fail less frequently Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab