ScaleIO-Storage-Architecture

advertisement
SCALEIO: ARCHITECTURE DEEP
DIVE
ScaleIO Introduction
ScaleIO is a Software-Defined-Storage (SDS)
ScaleIO is a software that uses standard servers to create an elastic/flexible,
scalable, and resilient virtual SAN that reduces the complexity of traditional SANs
•
Installs on industry-standard x86 servers
•
Aggregate applications servers’ local disks
•
Add storage and/or compute on the fly
ScaleIO agent
(minimal
footprint)
3
© Copyright 2017 Dell Inc.
Core, fundamental features of ScaleIO
• Configuration flexibility
– Hyper-converged and/or 2-layers
• Highly scalable
• High performance / low footprint
–
–
–
–
Performance scales linearly
High I/O parallelism
Gets the maximum from flash media
Various caching options (RAM, flash)
• Elastic/Flexible
– Add, move, remove nodes or disks
“on the fly”
– Auto-rebalance
4
© Copyright 2017 Dell Inc.
• Resilient
–
–
–
–
–
Distributed mirroring
Fast auto-rebuild
Extensive failure handling / HA
Inflight I/O checksum
Background disk scanner
• Platform agnostic
– Bare-metal: Linux / Windows
– Virtual: ESX, XEN, KVM, Hyper-V
• Flash and magnetic
– SSD, NVMe, PCI or HDD
– Manual and automatic multi-tiering
Core, fundamental features of ScaleIO
• Partitioning / tiering / multi-tenancy
–
–
–
–
Protection-domains
Storage-pools
Fault-sets
QoS - bandwidth/IOPs limiter
• Secure
– AD/LDAP, RBAC integration
– Secure cluster formation and component
authentication
– Secure connectivity w/ components,
secure ext. client comm.
– D@RE (SW, followed by SED)
5
© Copyright 2017 Dell Inc.
• Any network
– Slow, fast, shared, dedicated, IPv6…
• Ease of management & operation
– GUI, CLI, REST, OpenStack, ViPR,
ESRS, and more..
– Instant maintenance-mode
– NDU
• All storage services :Writeable
snapshots, Thin-provisioning, etc
ScaleIO Enables Multiple Consumption Choices
Buy
Buy & Build
Build
0
1
0
1
Consume
6
© Copyright 2017 Dell Inc.
0
1
0
0
1 1
1
0
0
1 1
0
1
1
0 1
1
0
0
1
1
Maximum Flexibility & Choice
More Time & Resources
Lowest Risk, Highest Value,
Lowest TCO
Hyper-converged rack-scale
engineered system
0
0
1
1 1
0
1
1 0
1
0
0 1
1
1
0 1
1
1
ScaleIO software and optimized
Dell PowerEdge servers
ScaleIO software only
Maintain
Flexibility In
Configuration
What is an application in ScaleIO configuration?
terms / definitions
app
any application that directly
accesses block devices
(could be an application, a
local file-system, a distributed
file-system, a hypervisor etc.)
8
© Copyright 2017 Dell Inc.
Local Storage in ScaleIO SDS
terms / definitions
local storage
could be either dedicated
disks or partitions within disks
Can be any disk type, SSD, HDD,
Flash card, NVMe etc
9
© Copyright 2017 Dell Inc.
ScaleIO in Hyper-converged configuration
• Hyper-converged
– App and storage in the same
node
– ScaleIO is yet another application
running alongside other
applications
app
app
app
app
app
app
app
app
app
app
app
app
app
app
app
ETH/IB
10
© Copyright 2017 Dell Inc.
app
app
app
app
app
app
app
ScaleIO in Two-Layer Configuration
app
app
app
app
app
app
app
app
The “traditional” two-layer
configuration
ETH/IB
11
© Copyright 2017 Dell Inc.
Combining App-Only Servers With Converged Servers
app
App-only servers can
access ScaleIO volumes
app
app
app
app
app
app
app
app
app
app
app
app
app
app
app
ETH/IB
app
12
+
© Copyright 2017 Dell Inc.
Hyper-converged
servers
app
app
app
app
app
app
app
ScaleIO
Components
Life without ScaleIO
(bare metal)
Host
application(s)
file-system
semantics
file-system
block
semantics
block dev.
drivers
mostly
unutilized,
contain OS
files
DAS
HBA
NIC/IB
External
Storage
Subsystem
switch
switch
switch
15
© Copyright 2017 Dell Inc.
Fabric
HBA
ScaleIO Data Client (SDC)
Host
application(s)
file-system
semantics
Exposes ScaleIO shared
block volumes to the
application
file-system
block
semantics
Notes
block dev.
drivers
ScaleIO data client (SDC) is a
block device driver
SDC
ScaleIO
protocol
DAS
HBA
NIC/IB
External
Storage
Subsystem
switch
switch
switch
16
© Copyright 2017 Dell Inc.
Fabric
HBA
ScaleIO Data Server (SDS)
Host
application(s)
Owns local storage that
contributes to the
ScaleIO storage pool
SDS
file-system
block
semantics
Notes
ScaleIO data server (SDS) is a
daemon/service
block dev.
drivers
Space
allocated
to ScaleIO
ScaleIO
protocol
DAS
HBA
NIC/IB
Local storage could
be dedicated disks,
partitions within a
disk
switch
External
Storage
Subsystem
switch
switch
17
© Copyright 2017 Dell Inc.
Fabric
HBA
SDS & SDC in the same host
Host
application(s)
file-system
semantics
An SDC and an SDS can live together.
SDS
file-system
SDC serves the I/O requests of the
resident host applications.
block
semantics
SDC
SDS serves the I/O requests of various
SDCs.
block dev.
drivers
Space
allocated
to ScaleIO
ScaleIO
protocol
DAS
HBA
NIC/IB
External
Storage
Subsystem
switch
switch
switch
18
© Copyright 2017 Dell Inc.
Fabric
HBA
Fully Converged Configuration
app
C
app
S
app
C
19
© Copyright 2017 Dell Inc.
S
app
S
app
C
C
app
C
C
S
app
S
app
S
C
app
C
C
S
app
S
app
S
C
app
C
C
S
app
S
app
S
C
app
C
C
S
app
S
app
S
C
C
S
app
S
C
S
ETH/IB
Two Layer Configuration
app
app
app
app
app
app
C
C
C
C
C
C
ETH/IB
S
20
© Copyright 2017 Dell Inc.
S
S
S
S
S
Two Layer Configuration
app
app
app
app
app
app
C
C
C
C
C
C
Similar to a traditional storage subsystem box, but:
• Software based
• Highly scalable
ETH/IB
• and …
S
21
© Copyright 2017 Dell Inc.
S
S
S
S
S
Two Layer Configuration
app
app
app
app
app
app
C
C
C
C
C
C
… massive parallelism
ETH/IB
S
22
© Copyright 2017 Dell Inc.
S
S
S
S
S
Two Layer Configuration
app
app
app
app
app
app
C
C
C
C
C
C
… massive parallelism
as the SDCs contact the
relevant SDSs directly
S
23
© Copyright 2017 Dell Inc.
S
ETH/IB
S
S
S
S
SDS CPU utilization linear with IOPs
4KB Write IOPs vs. SDS CPU%
4KB Read IOPs vs. SDS CPU%
12%
12%
11%
11%
10%
10%
CPU %
7%
6%
4%
CPU %
8%
8%
4%
4%
2%
1%
7%
6%
4%
2%
2%
2%
1%
0%
0%
0
50,000
100,000 150,000 200,000 250,000 300,000
0
20,000
IOPs
• Utilization on a node with 2x2698V4 CPU (20 Cores each)
• Cores are only used when workload is generated
24
© Copyright 2017 Dell Inc.
40,000
60,000
IOPs
80,000
100,000
120,000
SDC CPU utilization linear with IOPs
4KB Read IOPs vs. SDC CPU%
4KB Write IOPs vs. SDC CPU%
8%
8%
8%
7%
6%
6%
5%
CPU %
CPU %
6%
4%
3%
7%
7%
3%
2%
6%
5%
4%
3%
3%
2%
1%
1%
0%
0%
0
100,000
1%
1%
0%
0%
200,000
300,000
400,000
500,000
600,000
0
100,000
200,000
IOPs
• Utilization on a node with 2x2698V4 CPU (20 Cores each)
• Cores are only used when workload is generated
25
© Copyright 2017 Dell Inc.
300,000
IOPs
400,000
500,000
600,000
Volume Layout,
Redundancy and
Elasticity
Volumes
SDC
SDS1
SDS3
27
© Copyright 2017 Dell Inc.
SDS2
ScaleIO
Volume
• A volume appears as a
single object to the
application
• The SDC is always
accessing data from
multiple devices on
multiple nodes
SDS4
Volumes
SDC
SDS1
SDS2
• Logical collection of
mirrored, distributed
chunks in a Storage Pool
• SDC only accesses
primary chunks
SDS3
28
© Copyright 2017 Dell Inc.
ScaleIO
Volume
SDS4
Virtual Spares and Free Space
30
•
Data layout is elastic
•
Free space is used as a
distributed spare
•
Losing a storage device =
contraction of storage pool
•
Unprotected data mirrored into
free space
•
Not enough free space? ScaleIO
warning
•
More SDSs = smaller spare space
requirement
•
Unprotected data is rebuilt across
storage pool
•
Min spare space:
•
4 nodes: 25% free space
•
10 nodes: 10% free space
© Copyright 2017 Dell Inc.
Fast, balanced and smart rebuild
• Forwards Rebuild
– Once disk/node fails – the rebuild load is balanced across all the cluster
partition disks/nodes  faster and smoother rebuild
• Backwards Rebuild
– Smart & selective transition to “backwards” rebuild (re-silvering), once a
failed node is back alive
– Short outage = small penalty
36
© Copyright 2017 Dell Inc.
SIO: rebuild with 80K IOPS; 400GB rebuild size, and using
default QoS rebuild settings
Rebuild started
Rebuild time: 390 seconds
Rebuild rate: 1.05 GB/sec
Rebuild completed
When running at 80K system IOPs, the additional background workload causes an impact. This
impact can be controlled with a rebuild QoS value as shown in the next slide.
NOTE: the increased I/O seen after the rebuild completes is a vdbench test artifact.
38
© Copyright 2017 Dell Inc.
SIO: rebuild with 80K IOPS; 400GB rebuild size with rebuild
b/w limit per device
Rebuild started
Rebuild time: 2510 seconds
Rebuild rate: 163 MB/sec
Rebuild completed
When running at 80K system IOPs, there is almost no performance impact when limiting
the rebuild bandwidth.
39
© Copyright 2017 Dell Inc.
Elasticity/Flexibility, auto rebalance
• Add: One may add nodes or disks dynamically  the system
automatically rebalances the storage
Minimal data
transferred in a
many-to-many
fashion
40
© Copyright 2017 Dell Inc.
Elasticity, Auto rebalance
• Add: One may add nodes or disks dynamically  the system
automatically rebalances the storage
• Remove: One may remove nodes / disks dynamically  the
system automatically rebalances the storage
Minimal data
transferred in a
many-to-many
fashion
41
© Copyright 2017 Dell Inc.
IO Flow
A Single Read I/O
app
C
app
S
C
app
S
C
app
S
C
app
S
C
app
S
C
S
The SDC interacts directly with
the relevant SDS
ETH/IB
46
© Copyright 2017 Dell Inc.
A single Read I/O generally
involves an interaction with a
single node
A Single Write I/O
app
C
app
S
C
app
S
C
app
S
C
app
S
C
app
S
C
S
The SDC interacts directly with
the relevant SDS
ETH/IB
47
© Copyright 2017 Dell Inc.
A single Write I/O generally
involves interactions with only 2
nodes
A Single Write I/O
app
C
app
S
C
app
S
C
app
S
C
app
S
C
app
S
C
S
The SDC interacts directly with
the relevant SDS
ETH/IB
4KB write = 2 x 4KB propagated over the
network + 2 x 4KB written to media (in 2
different nodes)
4KB read = 1 x 4KB (network, media)
48
© Copyright 2017 Dell Inc.
A single Write I/O generally
involves interactions with only 2
nodes
A single Read I/O generally
involves an interaction with a
single node
A Single Write I/O
app
C
app
S
C
app
S
C
app
S
C
app
S
C
app
S
C
S
The SDC interacts directly with
the relevant SDS
• Scalability: Data doesn’t
flow via a
ETH/IB
central point
• Performance: High I/O parallelism
• Shared-everything volumes
49
© Copyright 2017 Dell Inc.
A single Write I/O generally
involves interactions with only 2
nodes
A single Read I/O generally
involves an interaction with a
single node
Client-side Mapping
Information and
The Metadata Manager
MDM – Three Viewpoints
MDM
MDM
self
• Lightweight
• Clustered
• Redundant
• Highly-available
• Does not require
dedicated nodes
54
© Copyright 2017 Dell Inc.
storage
• Maintains
authoritative inventory
and mappings
• Initiates rebalances
and rebuilds, keeps
storage protected and
optimized
admin
• Accepts GUI, CLI, API
commands
• User-facing storage
monitoring and
alerting
• Control plane only,
never sees user data
Tightly Coupled, Loosely Coupled
• Connection type between
components suited to their
purpose
TB
Slave
tightly coupled
• Master MDM replicates changes
to system status synchronously
• Master MDM monitors SDS
status continuously
Master
– Informs SDSs of changes to
system and MDM status,
nodes/devices/data layout
tightly coupled
55
SDS
SDS
SDS
SDS
SDS
SDS
© Copyright 2017 Dell Inc.
loosely coupled,
lazy update
• MDM update to SDCs “lazily”
SDC
SDC
SDC
SDC
SDC
SDC
– SDCs can recognize data layout
changes and contact MDMs
– SDCs contact MDMs for data
layout update after changes and
failures
Protection Domains,
Storage Pools
Multi-tenancy and IO
Limiter
Protection Domains
A protection domain is a set of
SDSs
A volume is defined in a
protection domain
57
© Copyright 2017 Dell Inc.
Protection Domains
A protection domain is a set of
SDSs
A volume is defined in a
protection domain
• SDCs from domain X can access data in
domain Y
• An SDS resides in exactly one protection
domain
58
© Copyright 2017 Dell Inc.
Protection Domains
A protection domain is a set of
SDSs
A volume is defined in a
protection domain
• Toleration of simultaneous failures in large clusters
• Performance isolation when needed
• Data location control (e.g., multi tenancy)
59
© Copyright 2017 Dell Inc.
Storage-pools
SDS
SDS
Magnetic
(HDD)
SDS
SDS
SDS
SDS
SDS
Protection domain
60
© Copyright 2017 Dell Inc.
Flash
(SSD)
Storage-pools
SDS
Magnetic
(HDD)
SDS
SDS
• Multi-tiering: Fast vs. slower storagepools
SDS
SDS
SDS
Flash
(SSD)
SDS
pool1
61
© Copyright 2017 Dell Inc.
• Storage-pool: A set of disks in a
protection domain. A volume is defined
from a storage-pool.
pool2
Storage-pools
SDS
Magnetic
(HDD)
SDS
SDS
• Performance-isolation: Multiple
storage-pools of the same media speed
SDS
SDS
SDS
Flash
(SSD)
SDS
pool1
62
© Copyright 2017 Dell Inc.
pool2
pool3
Elasticity/Flexibility – moving resources
• Move: One could easily move a node storage from one protectiondomain to another, totally non-disruptively
63
© Copyright 2017 Dell Inc.
Elasticity – moving resources
• Move: One could easily move a node storage from one protectiondomain to another, totally non-disruptively
By simply sending a
command (!)
Could you move spindles
from one storage box to
another by sending a
command?
64
© Copyright 2017 Dell Inc.
Fault Set
65
© Copyright 2017 Dell Inc.
Bandwidth / IOPs Limiter
• The ability to limit a specific client from exceeding X IOPs
and/or Y bandwidth at volume V
66
© Copyright 2017 Dell Inc.
Partitioning / Tiering / Multi-Tenancy
• The ability to limit a specific client from exceeding X IOPs and/or Y
bandwidth at volume V
The combination of protection-domains,
storage-pools and limiter allows the user
to control multi-tenancy performance,
capacity and Availability!
67
© Copyright 2017 Dell Inc.
Tools: ScaleIO Sizer
https://scaleio-sizer.emc.com/
68
© Copyright 2017 Dell Inc.
Virtualization
Environments
VIRTUALIZATION ENVIRONMENTS
• Almost identical to bare-metal
• SDC sits inside the hypervisor’s kernel
• SDS sits in the hypervisor’s user-mode
‒ Or in a VM in ESX environment
70
© Copyright 2017 Dell Inc.
GUI
Read Workload
31M
IOPS
72
© Copyright 2017 Dell Inc.
78
© Copyright 2017 Dell Inc.
Want More ScaleIO?
•
EMC ScaleIO In The Enterprise: The Citi Experience (storage.28)
–
•
ScaleIO: Architecture Deep Dive (storage.29)
–
•
Monday @ 12:00 PM; Wednesday @ 12:00 PM
Tuesday @ 1:30 PM
Try our Hands-on-Labs
•
Build A 100-Node ScaleIO SDS Cluster In
Minutes!
•
Operations & Lifecycle Management Of
Dell EMC ScaleIO Software Defined
Storage
•
Use REX-Ray & ScaleIO With Docker,
Mesos & Kubernetes (Hands-on Lab)
Monday @ 4:30 PM; Wednesday @ 3:00 PM
ScaleIO: Software-Defined Storage Lifecycle Management Viewed
Through Demos (storage.36)
–
79
Monday @ 08:30 AM
ScaleIO: Customer Panel: How Is Software Defined Storage
Helping My Data Center? (storage.34)
–
•
Monday @ 1:30 PM; Thursday @ 10:00 AM
ScaleIO: Redefining Software-Defined
Storage & Hyper-Convergence
–
ScaleIO & vSAN: Software-Defined Storage - The Revolution is
Here! (storage.33)
–
•
•
Monday @ 4:30 PM; Thursday @ 1:00 PM
ScaleIO: Simplying OpenStack With ScaleIO Software Defined
Storage (storage.32)
–
•
Birds of a Feather
ScaleIO: Architecting For Availability, Performance & Networking
With ScaleIO (storage.30)
–
•
Monday @ 3:00 PM; Tuesday @ 8:30 AM
Monday @ 8:30 AM; Thursday @ 8:30 AM
© Copyright 2017 Dell Inc.
Visit us in Booth! #757
Want to win a Levitating Death Star Speaker?
• Follow @DellEMCStorage while at
Dell EMC World
• 2 Winners will be chosen daily from
Monday May 8 through Thursday May 11
• All winners will be notified through
Twitter Direct Message
NO PURCHASE NECESSARY. Ends 05/11/2017. To enter and for Official Rules, visit
http://thecoreblog.emc.com/dell-emc-world-follow-win-sweepstakes-2017/
80
© Copyright 2017 Dell Inc.
Download