The Art of Database Sharding Maxym Kharchenko Amazon.com April 22-26, 2012 Mandalay Bay Convention Center Las Vegas, Nevada, USA www.collaborate12.org www.collaborate12.ioug.org When your data grows … Problem Old System New System The Big Data problem One machine is not enough Vertical Scaling Scaling Up … Scaling Up … Scaled! What you get when you scale up 2+2=5 What you get when you scale up 2+2=3 Scale out, not up Running on >1 machines Difficulty 10,000,000 1 0 1 2 3 4 5 Number of machines Courtesy: John Rauser @amazon.com Distributed computing is hard Distributed System Sharded System Sharding is (relatively) easy Split your data into small independent chunks And run each chunk on cheap commodity hardware How to split your data Data Data Data Data Data How to split your data How to split your data How to split your data How to split your data Step 1: Split off different things Vertical Partitioning Vertical Partitioning Vertical Partitioning Step 2: Chose sharding key and function Sharding Bad Sharding Can we partition collaborate participants by last name ? 9 8 7 CREATE TABLE Collaborate_Participants ( Last Names Distribution Shard Size last_name varchar2(30) PRIMARY KEY, signup_date date ) 6 5 4 3 2 1 0 A B C D E F G H I J K L M N O P Q R S T U VWX Y Z 1 2 3 4 Avalanche Effect i.e. MD5 Bad Distribution Good Distribution Step 3: Make enough shards Hashes and Buckets Good Distribution MOD MOD MOD Resharding 3 shards Shard: Hashed_id mod(hashed_id, 3) 1 1 2 2 3 0 4 1 5 2 6 0 7 1 8 2 9 0 10 1 11 2 12 0 75 % bad Adding 4th shard Old Shard: Hashed_id mod(hashed_id, 3) 1 1 2 2 3 0 4 1 5 2 6 0 7 1 8 2 9 0 10 1 11 2 12 0 New Shard: mod(hashed_id, 4) 1 2 3 0 1 2 3 0 1 2 3 0 Logical Shards Good Distribution MOD MOD MOD MOD Implementing Shards: Standbys Apps Read Only Unsharded Shard 1 Standby Shard 2 Implementing Shards: Tables Apps Read Only Shard1 Tab A Create Drop materialized view materialized view … … as select … preserve table from a@shard1 Shard 2 Tab MV A Why shards are awesome • Small data, small load – Better caching, faster queries – Smaller load, fewer surprises – Faster maintenance, i.e. restores • Eggs not in one basket: – Availability redefined – Safer maintenance • Multiple points of view: – SQL performance – System load Why shards are NOT so great • More systems – Power, rack space etc – Needs automation … bad – More likely to fail overall • Some operations become impractical: – Joins across shards – Foreign keys across shards • More work: – Applications, developers, DBAs – High skill, DIY everything Thank you Implementing Shards: Moving “data head” Logical Time Logical PhysicalPhysical Shard Shard Shard Shard (1,2,3,4) 2011(1,2,3,4) 1 1 (5,6,7,8) 2011(5,6,7,8) 2 2 Apps Shard 1 Shard 2 Shard 3 Shard 4 Time Logical Physical Shard Shard 2011(1,2,3,4) 1 2011(5,6,7,8) 2 2012(1,2) 1 2012(3,4) 3 2012(5,6) 2 2012(7,8) 4 Bad Sharding. Example 2 Can we shard customers by meaningless sequence ? CREATE TABLE Orders ( order_id number PRIMARY KEY, customer_fname varchar2(30), customer_lname varchar2(30), order_date date ) order_id: 10000 - 20000 order_id: 20001 - 30000 order_id: 30001 - 40000 order_id: 40001 - 50000