Scalable, Consistent, and Elastic Database Systems for Cloud Platforms Sudipto Das Computer Science, UC Santa Barbara sudipto@cs.ucsb.edu Sponsors: Web replacing Desktop Sudipto Das {sudipto@cs.ucsb.edu} 2 Paradigm shift in Infrastructure Sudipto Das {sudipto@cs.ucsb.edu} 3 Cloud computing Computing infrastructure and solutions delivered as a service ◦ Industry worth USD150 billion by 2014* Contributors to success ◦ Economies of scale ◦ Elasticity and pay-per-use pricing Popular paradigms ◦ Infrastructure as a Service (IaaS) ◦ Platform as a Service (PaaS) ◦ Software as a Service (SaaS) *http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm Sudipto Das {sudipto@cs.ucsb.edu} 4 Databases for cloud platforms Data is central to applications DBMSs are mission critical component in cloud software stack ◦ Manage petabytes of data, drive revenue ◦ Serve a variety of applications (multitenancy) Data needs for cloud applications ◦ OLTP systems: store and serve data ◦ Data analysis systems: decision support, intelligence Sudipto Das {sudipto@cs.ucsb.edu} 5 Application landscape Social gaming Rich content and mash-ups Managed applications Cloud application platforms Sudipto Das {sudipto@cs.ucsb.edu} 6 Challenges for OLTP systems Scalability ◦ While ensuring efficient transaction execution! Lightweight Elasticity ◦ Scale on-demand! Self-Manageability ◦ Intelligence without a human controller! Sudipto Das {sudipto@cs.ucsb.edu} 7 Two approaches to scalability Scale-up ◦ Preferred in classical enterprise setting (RDBMS) ◦ Flexible ACID transactions ◦ Transactions access a single node Scale-out ◦ Cloud friendly (Key value stores) ◦ Execution at a single server Limited functionality & guarantees ◦ No multi-row or multi-step transactions Sudipto Das {sudipto@cs.ucsb.edu} 8 Why care about transactions? confirm_friend_request(user1, user2) { begin_transaction(); update_friend_list(user1, user2, status.confirmed); update_friend_list(user2, user1, status.confirmed); end_transaction(); } Simplicity in application design with ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 9 confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; } } confirm_friend_request_B(user1, user2) { try{ update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); } } Sudipto Das {sudipto@cs.ucsb.edu} 10 Challenge: Transactions at Scale Scale-out Key Value Stores RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 11 Challenge: Lightweight Elasticity Provisioning on-demand and not for peak Resources Capacity Demand Resources Optimize operating cost! Capacity Demand Time Time Traditional Infrastructures Deployment in the Cloud Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu} 12 Challenge: Self-Manageability Managing a large distributed system ◦ ◦ ◦ ◦ ◦ ◦ Detecting failures and recovering Coordination and synchronization Provisioning Capacity planning … “A large distributed system is a Zoo” Cloud platforms inherently multitenant ◦ Balance conflicting goals Minimize operating cost while ensuring good performance Sudipto Das {sudipto@cs.ucsb.edu} 13 Contributions for OLTP systems Transactions at Scale ◦ ElasTraS [HotCloud 2009, UCSB TR 2010] ◦ G-Store [SoCC 2010] Lightweight Elasticity ◦ Albatross [VLDB 2011] ◦ Zephyr [SIGMOD 2011] Self-Manageability ◦ Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu} 14 Contributions Data Management Analytics Transaction Processing Ricardo [SIGMOD ‘10] Dynamic partitioning MD-HBase [MDM ‘11] Best Paper Runner up G-Store [SoCC ‘10] Anonimos [ICDE ‘10], [TKDE] Static partitioning ElasTraS [HotCloud ‘09] [TR ‘10] Albatross [VLDB ‘11] Zephyr [SIGMOD ‘11] This talk Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu} Novel Architectures Hyder [CIDR ‘11] Best Paper CoTS [ICDE ‘09], [VLDB ‘09] TCAM [DaMoN ‘08] 15 Transactions at Scale Scale-out Key Value Stores RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu} 16 Scale-out with static partitioning Table level partitioning (range, hash) ◦ Distributed transactions Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Sudipto Das {sudipto@cs.ucsb.edu} 17 Scale-out with static partitioning Table level partitioning (range, hash) ◦ Distributed transactions Partitioning the Database schema ◦ Co-locate data items accessed together ◦ Goal: Minimize distributed transactions Scaling-out with static partitioning ◦ ◦ ◦ ◦ ElasTraS [HotCloud 2009, TR 2010] Cloud SQL Server [ICDE 2011] Megastore [CIDR 2011] Relational Cloud [CIDR 2011] Sudipto Das {sudipto@cs.ucsb.edu} 18 Dynamically formed partitions Access patterns change, often rapidly ◦ Online multi-player gaming applications ◦ Collaboration based applications ◦ Scientific computing applications Not amenable to static partitioning How to get the benefit of partitioning when accesses do not statically partition? ◦ Ours is the first solution to allow that Sudipto Das {sudipto@cs.ucsb.edu} 19 Online Multi-player Games ID Name $$$ Player Profile Sudipto Das {sudipto@cs.ucsb.edu} 20 Score Online Multi-player Games Execute transactions on player profiles while the game is in progress Sudipto Das {sudipto@cs.ucsb.edu} 21 Online Multi-player Games Partitions/groups are dynamic Sudipto Das {sudipto@cs.ucsb.edu} 22 Online Multi-player Games Hundreds of thousands of concurrent groups Sudipto Das {sudipto@cs.ucsb.edu} 23 Data Fusion for dynamic partitions [G-Store, SoCC 2010] Transactional access to a group of data items formed on-demand Challenge: Avoid distributed transactions! Key Group Abstraction ◦ Groups are small ◦ Groups execute non-trivial no. of transactions ◦ Groups are dynamic and on-demand Groups are dynamically formed tenant databases Sudipto Das {sudipto@cs.ucsb.edu} 24 Transactions on Groups Without distributed transactions Grouping Protocol Key Group Ownership of keys at a single node One key selected as the leader Followers transfer ownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu} 25 Why is group formation hard? Guarantee the contract between leaders and followers in the presence of: ◦ Leader and follower failures ◦ Lost, duplicated, or re-ordered messages ◦ Dynamics of the underlying system How to ensure efficient and ACID execution of transactions? Sudipto Das {sudipto@cs.ucsb.edu} 26 Grouping protocol Log entries Follower(s) Create Request Leader L(Joining) J JA L(Joined) JAA Group Opns L(Creating) L(Joined) Time L(Free) D L(Deleting) DA L(Deleted) Delete Request Conceptually akin to “locking” ◦ Locks held by groups Sudipto Das {sudipto@cs.ucsb.edu} 27 Efficient transaction processing How does the leader execute transactions? ◦ Caches data for group members underlying data store equivalent to a disk ◦ Transaction logging for durability ◦ Cache asynchronously flushed to propagate updates ◦ Guaranteed update propagation Leader Transaction Manager Cache Manager Log Asynchronous update Propagation Followers Sudipto Das {sudipto@cs.ucsb.edu} 28 Prototype: G-Store [SoCC 2010] An implementation over Key-value stores Application Clients Transactional Multi-Key Access Grouping middleware layer resident on top of a key-value store Grouping Transaction Layer Manager Grouping Transaction Layer Manager Grouping Transaction Layer Manager Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic Distributed Storage G-Store Sudipto Das {sudipto@cs.ucsb.edu} 29 G-Store Evaluation Implemented using HBase ◦ Added the middleware layer ◦ ~10000 LOC Experiments in Amazon EC2 Benchmark: An online multi-player game Cluster size: 10 nodes Data size: ~1 billion rows (>1 TB) For groups with 100 keys ◦ Group creation latency: ~10 – 100ms ◦ More than 10,000 groups concurrently created Sudipto Das {sudipto@cs.ucsb.edu} 30 G-Store Evaluation Group creation latency Group creation throughput Sudipto Das {sudipto@cs.ucsb.edu} 31 Lightweight Elasticity Provisioning on-demand and not for peak Resources Capacity Demand Resources Optimize operating cost! Capacity Demand Time Time Traditional Infrastructures Deployment in the Cloud Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu} 32 Elasticity in the Database tier Load Balancer Application/ Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu} 33 Live database migration Migrate a database partition (or tenant) in a live system ◦ Optimize operating cost ◦ Resource orchestration in multitenant systems Different from ◦ Migration between software versions ◦ Migration in case of schema evolution Sudipto Das {sudipto@cs.ucsb.edu} 34 VM migration for DB elasticity One DB partition-per-VM ◦ Pros: allows fine-grained load balancing ◦ Cons Performance overhead Poor consolidation ratio [Curino et al., CIDR 2011] VM VM VM Hypervisor Multiple DB partitions in a VM ◦ Pros: good performance ◦ Cons: Migrate all partitions Coarse-grained load balancing Sudipto Das {sudipto@cs.ucsb.edu} VM Hypervisor 35 Live database migration Multiple partitions share the same database process ◦ Shared process multitenancy Migrate individual partitions ondemand in a live system ◦ Virtualization in the database tier Straightforward solution ◦ ◦ ◦ ◦ Stop serving partition at the source Copy to destination Start serving at the destination Expensive! Sudipto Das {sudipto@cs.ucsb.edu} 36 Migration cost measures Service un-availability ◦ Time the partition is unavailable Number of failed requests ◦ Number of operations failing/transactions aborting Performance overhead ◦ Impact on response times Additional data transferred Sudipto Das {sudipto@cs.ucsb.edu} 37 Two common DBMS architectures Decoupled storage architectures ◦ ElasTraS, G-Store, Deuteronomy, MegaStore ◦ Persistent data is not migrated ◦ Albatross [VLDB 2011] Shared nothing architectures ◦ SQL Azure, Relational Cloud, MySQL Cluster ◦ Migrate persistent data ◦ Zephyr [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu} 38 Why is live DB migration hard? Persistent data must be migrated (GBs) ◦ How to ensure no downtime? Nodes can fail during migration ◦ How to guarantee correctness during failures? Transaction atomicity and durability Recover migration state after failure Transactions execute during migration ◦ How to guarantee serializability? Transaction correctness equivalent to normal operation Sudipto Das {sudipto@cs.ucsb.edu} 39 Our approach: Zephyr [SIGMOD 2011] Migration executed in phases ◦ Starts with transfer of minimal information to destination (“wireframe”) Database pages used as granule of migration ◦ Unique page ownership Source and destination concurrently execute transactions in one migration phase Minimal transaction synchronization Guaranteed serializability Logging and handshaking protocols Sudipto Das {sudipto@cs.ucsb.edu} 40 Simplifying assumptions For this talk ◦ Transactions access a single partition ◦ No replication ◦ No structural changes to indices Extensions in the paper [SIGMOD 2011] ◦ Relaxes these assumptions Sudipto Das {sudipto@cs.ucsb.edu} 41 Design overview P1 Owned Pages P2 P3 Pn Active transactions TS1,…, TSk Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 42 Init mode Freeze indices and migrate wireframe P1 Owned Pages Active transactions P2 P3 P1 P2 P3 Pn Pn Un-owned Pages TS1,…, TSk Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 43 What is an index wireframe? Source Destination Sudipto Das {sudipto@cs.ucsb.edu} 44 Dual mode Requests for un-owned pages can block P1 P2 P3 Pn Old, still active transactions P3 accessed by TDi P1 P2 P3 P3 pulled from source Pn TSk+1,…, TSl TD1,…, TDm Source Destination Index wireframes remain frozen New transactions Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 45 Finish mode Pages can be pulled by the destination, if needed P1 P2 P3 P1 P2 P3 Pn P1, P2, … pushed from source Pn TDm+1,… , TDn Completed Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 46 Normal operation Index wireframe un-frozen P1 P2 P3 Pn TDn+1,…, TDp Source Destination Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu} 47 Artifacts of this design Once migrated, pages are never pulled back by source ◦ Abort transactions at source accessing the migrated pages No structural changes to indices during migration ◦ Abort transactions (at both nodes) that make structural changes to indices Destination “pulls” pages on-demand ◦ Transactions at the destination experience higher latency compared to normal operation Sudipto Das {sudipto@cs.ucsb.edu} 48 Serializability Only concern is “dual mode” ◦ Init and Finish: only one node is executing transactions Local predicate locking of internal index and exclusive page ownership no phantoms Strict 2PL Transactions are locally serializable Pages transferred only once ◦ No Tdest Tsource conflict dependency Guaranteed serializability Sudipto Das {sudipto@cs.ucsb.edu} 49 Recovery Transaction recovery ◦ For every database page, Tsrc Tdst ◦ Recovery: transactions replayed in conflict order Migration recovery ◦ Atomic transitions between migration modes Developed logging and handshake protocols ◦ Every page has exactly one owner Bookkeeping at the index level Sudipto Das {sudipto@cs.ucsb.edu} 50 Correctness In the presence of arbitrary repeated failures, Zephyr ensures: ◦ Updates made to database pages are consistent ◦ Failure does not leave a page without an owner ◦ Both source and destination are in the same migration mode Guaranteed termination and starvation freedom Sudipto Das {sudipto@cs.ucsb.edu} 51 Implementation Prototyped using an open source OLTP database H2 ◦ ◦ ◦ ◦ Supports standard SQL/JDBC API Serializable isolation level Tree Indices Relational data model Modified the database engine ◦ Added support for freezing indices ◦ Page migration status maintained using index ◦ ~6000 LOC Tungsten SQL Router migrates JDBC connections during migration Sudipto Das {sudipto@cs.ucsb.edu} 52 Results Overview Downtime (partition unavailability) ◦ S&C: 3 – 8 seconds (needed to migrate, unavailable for updates) ◦ Zephyr: No downtime. Either source or destination is available Service interruption (failed operations) ◦ S&C: ~100s – 1,000s. All transactions with updates are aborted ◦ Zephyr: ~10s – 100s. Order of magnitude less interruption Minimal operational and data transfer overhead Sudipto Das {sudipto@cs.ucsb.edu} 53 Failed Operations Order of magnitude fewer failed operations Sudipto Das {sudipto@cs.ucsb.edu} 54 Concluding Remarks Major enabling technologies ◦ Scalable distributed database infrastructure ElasTraS ◦ Dynamically formed data partitions G-Store ◦ Live database migration Albatross, Zephyr Sudipto Das {sudipto@cs.ucsb.edu} 55 Future Directions Self-managing controller for large multitenant database infrastructures Novel data management architectures ◦ Leveraging advances from novel hardware ◦ Convergence of transactional and analytics systems for real-time intelligence Putting human-in-the-loop: Leveraging crowd-sourcing Sudipto Das {sudipto@cs.ucsb.edu} 56 Thank you! Collaborators UCSB: Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu Shashank Agarwal, Shyam Antony, Aaron Elmore, Shoji Nishimura (NEC Japan) Microsoft Research Redmond: Phil Bernstein, Colin Reid IBM Almaden: Yannis Sismanis, Kevin Beyer, Rainer Gemulla, Peter Haas, John McPherson