The Art of Database Sharding – presentation

advertisement
The Art
of Database Sharding
Maxym Kharchenko
Amazon.com
April 22-26, 2012
Mandalay Bay Convention Center
Las Vegas, Nevada, USA
www.collaborate12.org
www.collaborate12.ioug.org
When your data grows …
Problem
Old System
New System
The Big Data problem
One machine is not enough
Vertical Scaling
Scaling Up …
Scaling Up …
Scaled!
What you get
when you scale up
2+2=5
What you get
when you scale up
2+2=3
Scale out, not up
Running on >1 machines
Difficulty
10,000,000
1
0
1
2
3
4
5
Number of machines
Courtesy: John Rauser @amazon.com
Distributed computing is hard
Distributed System
Sharded System
Sharding is (relatively) easy
Split your data
into small independent chunks
And run each chunk
on cheap commodity hardware
How to split your data
Data
Data
Data
Data
Data
How to split your data
How to split your data
How to split your data
How to split your data
Step 1: Split off different things
Vertical Partitioning
Vertical Partitioning
Vertical Partitioning
Step 2: Chose sharding key
and function
Sharding
Bad Sharding
Can we partition collaborate participants by last name ?
9
8
7
CREATE
TABLE
Collaborate_Participants
(
Last Names
Distribution
Shard Size
last_name varchar2(30) PRIMARY KEY,
signup_date date
)
6
5
4
3
2
1
0
A B C D E F G H I J K L M N O P Q R S T U VWX Y Z
1
2
3
4
Avalanche Effect
i.e. MD5 
Bad Distribution
Good Distribution
Step 3: Make enough shards
Hashes and Buckets
Good Distribution
MOD
MOD
MOD
Resharding
3 shards
Shard:
Hashed_id mod(hashed_id, 3)
1
1
2
2
3
0
4
1
5
2
6
0
7
1
8
2
9
0
10
1
11
2
12
0
75 % bad
Adding 4th shard
Old Shard:
Hashed_id mod(hashed_id, 3)
1
1
2
2
3
0
4
1
5
2
6
0
7
1
8
2
9
0
10
1
11
2
12
0
New Shard:
mod(hashed_id,
4)
1
2
3
0
1
2
3
0
1
2
3
0
Logical Shards
Good Distribution
MOD
MOD
MOD
MOD
Implementing Shards: Standbys
Apps
Read Only
Unsharded
Shard 1
Standby
Shard 2
Implementing Shards: Tables
Apps
Read Only
Shard1
Tab
A
Create
Drop
materialized view
materialized view
…
…
as select …
preserve table
from a@shard1
Shard 2
Tab
MV
A
Why shards are awesome
• Small data, small load
– Better caching, faster queries
– Smaller load, fewer surprises
– Faster maintenance, i.e. restores
• Eggs not in one basket:
– Availability redefined
– Safer maintenance
• Multiple points of view:
– SQL performance
– System load
Why shards are NOT so great
• More systems
– Power, rack space etc
– Needs automation … bad
– More likely to fail overall
• Some operations become impractical:
– Joins across shards
– Foreign keys across shards
• More work:
– Applications, developers, DBAs
– High skill, DIY everything
Thank you
Implementing Shards:
Moving “data head”
Logical
Time Logical
PhysicalPhysical
Shard Shard
Shard Shard
(1,2,3,4)
2011(1,2,3,4) 1
1
(5,6,7,8)
2011(5,6,7,8) 2
2
Apps
Shard 1
Shard 2
Shard 3
Shard 4
Time Logical Physical
Shard Shard
2011(1,2,3,4)
1
2011(5,6,7,8)
2
2012(1,2)
1
2012(3,4)
3
2012(5,6)
2
2012(7,8)
4
Bad Sharding. Example 2
Can we shard customers by meaningless sequence ?
CREATE TABLE Orders (
order_id number PRIMARY KEY,
customer_fname varchar2(30),
customer_lname varchar2(30),
order_date date
)
order_id:
10000 - 20000
order_id:
20001 - 30000
order_id:
30001 - 40000
order_id:
40001 - 50000
Download