“Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University A service Client request Web Server client reply server App Server DB Server Store The Steve Yegge rant, part 1 Products vs. Platforms Selectively quoted/clarified from http://steverant.pen.io/, emphasis added. This is an internal google memorandum that ”escaped”. Yegge had moved to Google from Amazon. His goal was to promote service-oriented software structures within Google. So one day Jeff Bezos [CEO of Amazon] issued a mandate....[to the developers in his company]: His Big Mandate went something along these lines: 1) All teams will henceforth expose their data and functionality through service interfaces. 2) Teams must communicate with each other through these interfaces. 3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no sharedmemory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network. The Steve Yegge rant, part 2 Products vs. Platforms 4) It doesn't matter what technology they use. HTTP, Corba, PubSub, custom protocols -- doesn't matter. Bezos doesn't care. 5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions. 6) Anyone who doesn't do this will be fired. 7) Thank you; have a nice day! Managing overload • What should we do when a service is in overload? – Overload: service is close to saturation, leading to unacceptable response time. λ > λmax – Work queues grow without bound, increasing memory consumption. Throughput X λ λmax offered load Options for overload 1. Thrashing – Keep trying and hope things get better. Accept each request and inject it into the system. Then drop requests at random if some queue overflows its memory bound. Note: leads to dropping requests after work has been invested, wasting work and reducing throughput (e.g., congestion collapse). 2. Admission control or load conditioning – Reject requests as needed to keep system healthy. Reject them early, before they incur processing costs. Choose your victims carefully, e.g., prefer “gold” customers, or reject the most expensive requests. 3. Dynamic provisioning or elastic scaling – E.g., acquire new capacity on the fly from a cloud provider, and shift load over to the new capacity. Elastic scaling: “pay as you grow” EC2 Elastic Compute Cloud The canonical public cloud Virtual Appliance Image Client Service Guest Cloud Provider(s) Host IaaS: Infrastructure as a Service Client Service Platform Hosting performance and isolation is determined by virtualization layer Virtual Machines (VM): VMware, KVM, etc. OS VMM Physical EC2 is a public IaaS cloud (fee-for-service). Deployment of private clouds is growing rapidly w/ open IaaS cloud software. Native virtual machines (VMs) • Slide a hypervisor underneath the kernel. – New OS/TCB layer: virtual machine monitor (VMM). • Kernel and processes run in a virtual machine (VM). – The VM “looks the same” to the OS as a physical machine. – The VM is a sandboxed/isolated context for an entire OS. • Can run multiple VM instances on a shared computer. Thank you, VMware Also in the news, the Snowden NSA leak for Hallowe’en 2013. Scary? Scaling a service Dispatcher Work Support substrate Server cluster/farm/cloud/grid Data center Add servers or “bricks” for scale and robustness. Issues: state storage, server selection, request routing, etc. Service scaling and bottlenecks Scale up by adding capacity incrementally? • “Just add bricks/blades/units/elements/cores”...but that presumes we can parallelize the workload. • “Service workloads parallelize easily.” – Many independent requests: spread requests across multiple units. – Problem: some requests use shared data. Partition data into chunks and spread them across the units: be sure to read/write a common copy. • Load must be evenly distributed, or else some unit saturates before the others (bottleneck). A bottleneck limits throughput and/or may increase response time for some class of requests. Work Coordination for service clusters • How to assign data and functions among servers? – To spread the work across an elastic server cluster? • How to know which server is “in charge” of a given function or data object? – E.g., to serialize reads/writes on each object, or otherwise ensure consistent behavior: each read sees the value stored by the most recent write (or at least some reasonable value). • Goals: safe, robust, even, balanced, dynamic, etc. • Two key techniques: – Leases (leased locks) – Hash-based adaptive data distributions What about failures? X • Systems fail. Here’s a reasonable set of assumptions about failure properties: • Nodes/servers/replicas/bricks – Fail-stop or fail-fast fault model – Nodes either function correctly or remain silent – A failed node may restart, or not – A restarted node loses its memory state, and recovers its secondary (disk) state – Note: nodes can also fail by behaving in unexpected ways, like sending false messages. These are called Byzantine failures. • Network messages – “delivered quickly most of the time” but may be dropped. – Message source and content are safely known (e.g., crypto). Challenge: data management • Data volumes are growing enormously. • Mega-services are “grounded” in data. • How to scale the data tier? – Scaling requires dynamic placement of data items across data servers, so we can grow the number of servers. – Caching helps to reduce load on the data tier. – Replication helps to survive failures and balance read/write load. – E.g., alleviate hot-spots by spreading read load across multiple data servers. – Caching and replication require careful update protocols to ensure that servers see a consistent view of the data. Service-oriented architecture of Amazon’s platform Dynamo is a scalable, replicated key-value store. Key-value stores • Many mega-services are built on key-value stores. – Store variable-length content objects: think “tiny files” (value) – Each object is named by a “key”, usually fixed-size. – Key is also called a token: not to be confused with a crypto key! Although it may be a content hash (SHAx or MD5). – Simple put/get interface with no offsets or transactions (yet). – Goes back to literature on Distributed Data Structures [Gribble 1998] and Distributed Hash Tables (DHTs). [image from Sean Rhea, opendht.org] Key-value stores • Data objects named in a “flat” key space (e.g., “serial numbers”) • K-V is a simple and clean abstraction that admits a scalable, reliable implementation: a major focus of R&D. • Is put/get sufficient to implement non-trivial apps? Distributed application put(key, data) Distributed hash table lookup(key) Lookup service node node get (key) data node IP address …. node [image from Morris, Stoica, Shenker, etc.] Scalable key-value stores • Can we build massively scalable key/value stores? – Balance the load: distribute the keys across the nodes. – Find the “right” server(s) for a given key. – Adapt to change (growth and “churn”) efficiently and reliably. – Bound the spread of each object. • Warning: it’s a consensus problem! • What is the consistency model for massive stores? – Can we relax consistency for better scaling? Do we have to? Memcached is a scalable in-memory key-value cache. Scaling database access • Many services are data-driven. – Multi-tier services: the “lowest” layer is a data tier with authoritative copy of service data. • Data is stored in various stores or databases, some with advanced query API. SQL query API – e.g., SQL • Databases are hard to scale. – Complex data: atomic, consistent, recoverable, durable. (“ACID”) database servers SQL: Structured Query Language Caches can help if much of the workload is simple reads. web servers Memcached memcached servers • “Memory caching daemon” • It’s just a key/value store • Scalable cluster service get/put API – array of server nodes – distribute requests among nodes etc… – how? distribute the key space – scalable: just add nodes • Memory-based • LRU object replacement • Many technical issues: Multi-core server scaling, MxN communication, replacement, consistency SQL query API database servers web servers [From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09] Consistency and coordination Very loosely: • Consistency: each read of a {key, file, object} sees the value stored by the most recent write. – Or at least writes propagate to readers “eventually”. – More generally: the service behaves as expected: changes to shared state are properly recorded and observed. • Coordination: the roles and functions of each element are understood and properly adjusted to respond to changes (e.g., failures, growth, or rebalancing: “churn”). – E.g., distribution of functions or data among the elements. Example: mutual exclusion • It is often necessary to grant some node/process the “right” to “own” some given data or function. • Ownership rights often must be mutually exclusive. – At most one owner at any given time. • How to coordinate ownership? • Warning: it’s a consensus problem! One solution: lock service acquire grant acquire x=x+1 release grant A x=x+1 release lock service B A lock service in the real world acquire acquire grant X x=x+1 A ??? ??? B B Solution: leases (leased locks) • A lease is a grant of ownership or control for a limited time. • The owner/holder can renew or extend the lease. • If the owner fails, the lease expires and is free again. • The lease might end early. – lock service may recall or evict – holder may release or relinquish A lease service in the real world acquire acquire grant X x=x+1 A ??? grant x=x+1 release B Leases and time • The lease holder and lease service must agree when a lease has expired. – i.e., that its expiration time is in the past – Even if they can’t communicate! • We all have our clocks, but do they agree? – synchronized clocks • For leases, it is sufficient for the clocks to have a known bound on clock drift. – |T(Ci) – T(Cj)| < ε – Build in slack time > ε into the lease protocols as a safety margin. OK, fine, but… • What if the A does not fail, but is instead isolated by a network partition? A network partition C ras hed ro ute r A network partition is any event that blocks all message traffic between subsets of nodes. Never two kings at once acquire acquire grant x=x+1 A ??? grant x=x+1 release B Lease example network file cache consistency [From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09] Google File System (GFS) Similar: Hadoop HDFS, p-NFS, many other parallel file systems. A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is. OK, fine, but… • What if the manager/master itself fails? X We can replace it, but the nodes must agree on who the new master is: requires consensus. The Answer • Replicate the functions of the manager/master. – Or other coordination service… • Designate one of the replicas as a primary. – Or master • The other replicas are backup servers. – Or standby or secondary • If the primary fails, use a high-powered consensus algorithm to designate and initialize a new primary. Consensus: abstraction P1 P1 v1 d1 Unreliable multicast P2 v2 P3 Step 1 Propose. v3 Each P proposes a value to the others. Coulouris and Dollimore Consensus algorithm P2 P3 d2 Step 2 Decide. d3 All nonfaulty P agree on a value in a bounded time. Fischer-Lynch-Patterson (1985) • No consensus can be guaranteed in an asynchronous system in the presence of failures. • Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time. • Consensus may occur recognizably, rarely or often. Network partition Split brain “CAP theorem” consistency C Dr. Eric Brewer CA: available, and consistent, unless there is a partition. A Availability C-A-P choose two CP: always consistent, even in a partition, but a reachable replica may deny service if it is unable to agree with the others (e.g., quorum). AP: a reachable replica provides service even in a partition, but may be inconsistent. P Partition-resilience Properties for Correct Consensus • Termination: All correct processes eventually decide. • Agreement: All correct processes select the same di. – Or…(stronger) all processes that do decide select the same di, even if they later fail. • Consensus “must be” both safe and live. • FLP and CAP say that a consensus algorithm can be safe or live, but not both. Now what? • We have to build practical, scalable, efficient distributed systems that really work in the real world. • But the theory says it is impossible to build reliable computer systems from unreliable components. • So what are we to do? Butler W. Lampson / http://research.microsoft.com/en-us/um/people/blampson Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004. [Lampson 1995] Summary/preview • Master coordinates, dictates consensus – e.g., lock service – Also called “primary” • Remaining consensus problem: who is the master? – Master itself might fail or be isolated by a network partition. – Requires a high-powered distributed consensus algorithm (Paxos) to elect a new master. – Paxos is safe but not live: in the worst case (multiple repeated failures) it might not terminate. A Classic Paper • ACM TOCS: – Transactions on Computer Systems • Submitted: 1990. Accepted: 1998 • Introduced: ??? A Paxos Round Self-appoint Wait for majority “Can I lead b?” “OK, but” “v?” Wait for majority “OK” “v!” L N 1a 1b log Propose 2b 2a log Promise Accept 3 safe Ack Commit Nodes may compete to serve as leader, and may interrupt one another’s rounds. It can take many rounds to reach consensus. Consensus in Practice • Lampson: “Since general consensus is expensive, practical systems reserve it for emergencies.” – e.g., to select a primary/master, e.g., a lock server. • Centrifuge, GFS master, Frangipani, etc. • Google Chubby service (“Paxos Made Live”) • Pick a primary with Paxos. Do it rarely; do it right. – Primary holds a “master lease” with a timeout. • Renew by consensus with primary as leader. – Primary is “czar” as long as it holds the lease. – Master lease expires? Fall back to Paxos. – (Or BFT.) Coordination services • Build your cloud apps around a coordination service with consensus (Paxos) at its core. • This service is a fundamental building block for consistent scalable services. – Chubby (Google) – Zookeeper (Yahoo!) – Centrifuge (Microsoft) Coordination and Consensus • If the key to availability and scalability is to decentralize and replicate functions and data, how do we coordinate the nodes? – – – – – – – – data consistency update propagation mutual exclusion consistent global states failure notification group membership (views) group communication event delivery and ordering We stopped here • But I’ll leave these slides in. • Hidden slides (e.g., on VMs and network file cache consistency) are not to be tested. • Obviously we only peeked at some of these topics, but you should understand the problem of failures (lock service). Caching in the Web • Web “proxy” caches are servers that cache Web content. • Reduce traffic to the origin server. • Deployed by enterprises to reduce external network traffic to serve Web requests of their members. • Also deployed by third-party companies that sell caching service to Web providers. – Content Delivery/Distribution Network (CDN) – Help Web providers serve their clients better. – Help absorb unexpected load from “flash crowds”. – Reduce Web server infrastructure costs. Content Delivery Network (CDN) DNS Caching Cache freshness by TTL “freshness date”), also used by early NFS, and HTTP/Web.