Windows Azure Storage

advertisement
Windows Azure Storage –
A Highly Available Cloud Storage Service with
Strong Consistency
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam
McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev
Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman
Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq,
Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha
Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, Leonidas
Rigas
Microsoft Corporation
•
•
•
•
Blobs
Tables
Queues
Drives
Windows Azure Storage High Level Architecture
Design Goals
Access blob storage via the URL: http://<account>.blob.core.windows.net/
Storage
Location
Service
Data access
LB
LB
Front-Ends
Front-Ends
Partition Layer
Partition Layer
Stream Layer
Intra-stamp replication
Storage Stamp
Inter-stamp (Geo) replication
Stream Layer
Intra-stamp replication
Storage Stamp
•
•
•
Append-only distributed file system
All data from the Partition Layer is stored into files (extents) in the Stream layer
An extent is replicated 3 times across different fault and upgrade domains
•
•
Checksum all stored data
•
•
•
With random selection for where to place replicas for fast MTTR
Verified on every client read
Scrubbed every few days
Re-replicate on disk/node/rack failure or checksum mismatch
Stream
Layer
(Distributed
File System)
M
M
Paxos
M
Extent Nodes (EN)
•
•
•
•
Provide transaction semantics and strong consistency for Blobs, Tables and Queues
Stores and reads the objects to/from extents in the Stream layer
Provides inter-stamp (geo) replication by shipping logs to other stamps
Scalable object index via partitioning
Partition
Master
Lock
Service
Partition Layer
Partition
Server
Partition
Server
Partition
Server
Partition
Server
M
Stream
Layer
M
Paxos
M
Extent Nodes (EN)
•
•
•
Front End
Layer
FE
FE
Stateless Servers
Authentication + authorization
Request routing
FE
FE
FE
Partition
Master
Lock
Service
Partition Layer
Partition
Server
Partition
Server
Partition
Server
Partition
Server
M
Stream
Layer
M
Paxos
M
Extent Nodes (EN)
Incoming Write Request
Ack
Front End
Layer
FE
FE
FE
FE
FE
Partition
Master
Lock
Service
Partition Layer
Partition
Server
Partition
Server
Partition
Server
Partition
Server
M
Stream
Layer
M
Paxos
M
Extent Nodes (EN)
Partition Layer
• Need a scalable index for the objects that can
• Spread the index across 100s of servers
• Dynamically load balance
•
Dynamically change what servers are serving each part of the index based on load
Blob Index
Account
Account
Name
Name
Container
Container
Name
Name
Blob
Blob
Name
Name
aaaa
aaaa
aaaa
aaaa
aaaaa
aaaaa
……..
………
……..
………
……..
………
……..
………
……..
………
……..
………
……..
……..
Account
Container
harry
pictures
Name
Name
……..
……..
Front-End
harry
pictures
…….. Server
……..
………
………
……..
……..
A-H:
PS1
………
………
……..
……..
PS2
Account H’-R:
Container
richard
videos
Name R’-Z:
Name
PS3
……..
……..
richard
videos
……..
……..
Partition
………
………
…….. Map……..
……..
Blob
sunrise
Name
……..
sunset
……..
………
……..
………
……..
Blob
soccer
Name
……..
tennis
……..
………
……..
………
……..
………
……..
………
……..
zzzz
zzzz
zzzz
zzzz
zzzzz
zzzzz
Storage Stamp
PS 1
PS 2
A-H: PS1
Partition
H’-R: PS2
Master
R’-Z: PS3
Partition
Server
A-H
Partition
Server
H’-R
Partition
Map
Partition
Server
R’-Z
PS 3
Writes
Commit Log Stream
Metadata log Stream
Read/Query
Checkpoint
File Table
Blob Data
Checkpoint
File Table
Blob Data
Checkpoint
File Table
Blob Data
Stream Layer
Stream //foo/myfile.data
Extent E1
Extent E2
Extent E3
Block
Block
Block
Ptr E4
Block
Block
Block
Block
Block
Block
Block
Ptr E3
Block
Ptr E2
Block
Block
Block
Block
Ptr E1
Extent E4
Paxos
Partition
Layer
Create Stream/Extent
EN1 Primary
EN2, EN3 Secondary
SM
Stream
SM
Master
Allocate Extent replica set
EN 1
Primary
EN 2
Secondary A
EN 3
Secondary B
EN
Paxos
Partition
Layer
Ack
EN1 Primary
EN2, EN3 Secondary
SM
SM
SM
Append
EN 1
Primary
EN 2
Secondary A
EN 3
Secondary B
EN
Stream //foo/myfile.dat
Ptr E1
Ptr E2
Extent E1
Ptr E3
Extent E2
Ptr E4
Ptr E5
Extent E3
Extent E4
Extent E5
Paxos
Partition
Layer
Append
SM
Stream
SM
Master
Seal Extent
120
120
Seal Extent
Sealed at 120
Ask for current length
EN 1
EN 2
EN 3
Primary
Secondary A
Secondary B
EN 4
Paxos
SM
Stream
SM
Master
Partition
Layer
120
Seal Extent
Sealed at 120
Sync with SM
EN 1
EN 2
EN 3
Primary
Secondary A
Secondary B
EN 4
Paxos
Partition
Layer
Append
Seal Extent
120
SM
SM
SM
Seal Extent
Sealed at 100
Ask for current length
100
EN 1
EN 2
EN 3
Primary
Secondary A
Secondary B
EN 4
Paxos
SM
SM
SM
Partition
Layer
100
Seal Extent
Sealed at 100
Sync with SM
EN 1
EN 2
EN 3
Primary
Secondary A
Secondary B
EN 4
• For Data Streams, Partition Layer
only reads from offsets returned
from successful appends
•
•
SM
SM
SM
Partition
Server
Committed on all replicas
Row and Blob Data Streams
• Offset valid on any replica
EN 1
Safe to read from EN3
EN 2
EN 3
Network partition
• PS can talk to EN3
• SM cannot talk to EN3
Primary
Secondary A
Secondary B
• Logs are used on partition load
• Commit and Metadata log streams
SM
SM
SM
• Check commit length first
• Only read from
•
•
Check commit length
Use EN1, EN2 for loading
Partition
Server
Unsealed replica if all replicas have
the same commit length
A sealed replica
Seal Extent
Check commit length
EN 1
EN 2
EN 3
Network partition
• PS can talk to EN3
• SM cannot talk to EN3
Primary
Secondary A
Secondary B
Design Choices and Lessons Learned
• Multi-Data Architecture
• Use extra resources to serve mixed
workload for incremental costs
•
•
•
•
Blob -> storage capacity
Table -> IOps
Queue -> memory
Drives -> storage capacity and IOps
• Multiple data abstractions from a
single stack
• Greatly simplifies replication protocol
and failure handling
• Consistent and identical replicas up to the
extent’s commit length
•
•
•
•
Keep snapshots at no extra cost
Benefit for diagnosis and repair
Erasure Coding
Tradeoff: GC overhead
• Improvements at lower layers help all data
abstractions
• Simplifies hardware management
• Tradeoff: single stack is not optimized
for specific workload pattern
• Allows each to be scaled separately
• Important for multitenant environment
• Moving toward full bisection bandwidth
between compute and storage
• Tradeoff: Latency/BW to/from storage
Windows Azure Storage Summary
http://blogs.msdn.com/windowsazurestorage/
Download