SCALEIO: ARCHITECTURE DEEP DIVE ScaleIO Introduction ScaleIO is a Software-Defined-Storage (SDS) ScaleIO is a software that uses standard servers to create an elastic/flexible, scalable, and resilient virtual SAN that reduces the complexity of traditional SANs • Installs on industry-standard x86 servers • Aggregate applications servers’ local disks • Add storage and/or compute on the fly ScaleIO agent (minimal footprint) 3 © Copyright 2017 Dell Inc. Core, fundamental features of ScaleIO • Configuration flexibility – Hyper-converged and/or 2-layers • Highly scalable • High performance / low footprint – – – – Performance scales linearly High I/O parallelism Gets the maximum from flash media Various caching options (RAM, flash) • Elastic/Flexible – Add, move, remove nodes or disks “on the fly” – Auto-rebalance 4 © Copyright 2017 Dell Inc. • Resilient – – – – – Distributed mirroring Fast auto-rebuild Extensive failure handling / HA Inflight I/O checksum Background disk scanner • Platform agnostic – Bare-metal: Linux / Windows – Virtual: ESX, XEN, KVM, Hyper-V • Flash and magnetic – SSD, NVMe, PCI or HDD – Manual and automatic multi-tiering Core, fundamental features of ScaleIO • Partitioning / tiering / multi-tenancy – – – – Protection-domains Storage-pools Fault-sets QoS - bandwidth/IOPs limiter • Secure – AD/LDAP, RBAC integration – Secure cluster formation and component authentication – Secure connectivity w/ components, secure ext. client comm. – D@RE (SW, followed by SED) 5 © Copyright 2017 Dell Inc. • Any network – Slow, fast, shared, dedicated, IPv6… • Ease of management & operation – GUI, CLI, REST, OpenStack, ViPR, ESRS, and more.. – Instant maintenance-mode – NDU • All storage services :Writeable snapshots, Thin-provisioning, etc ScaleIO Enables Multiple Consumption Choices Buy Buy & Build Build 0 1 0 1 Consume 6 © Copyright 2017 Dell Inc. 0 1 0 0 1 1 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 Maximum Flexibility & Choice More Time & Resources Lowest Risk, Highest Value, Lowest TCO Hyper-converged rack-scale engineered system 0 0 1 1 1 0 1 1 0 1 0 0 1 1 1 0 1 1 1 ScaleIO software and optimized Dell PowerEdge servers ScaleIO software only Maintain Flexibility In Configuration What is an application in ScaleIO configuration? terms / definitions app any application that directly accesses block devices (could be an application, a local file-system, a distributed file-system, a hypervisor etc.) 8 © Copyright 2017 Dell Inc. Local Storage in ScaleIO SDS terms / definitions local storage could be either dedicated disks or partitions within disks Can be any disk type, SSD, HDD, Flash card, NVMe etc 9 © Copyright 2017 Dell Inc. ScaleIO in Hyper-converged configuration • Hyper-converged – App and storage in the same node – ScaleIO is yet another application running alongside other applications app app app app app app app app app app app app app app app ETH/IB 10 © Copyright 2017 Dell Inc. app app app app app app app ScaleIO in Two-Layer Configuration app app app app app app app app The “traditional” two-layer configuration ETH/IB 11 © Copyright 2017 Dell Inc. Combining App-Only Servers With Converged Servers app App-only servers can access ScaleIO volumes app app app app app app app app app app app app app app app ETH/IB app 12 + © Copyright 2017 Dell Inc. Hyper-converged servers app app app app app app app ScaleIO Components Life without ScaleIO (bare metal) Host application(s) file-system semantics file-system block semantics block dev. drivers mostly unutilized, contain OS files DAS HBA NIC/IB External Storage Subsystem switch switch switch 15 © Copyright 2017 Dell Inc. Fabric HBA ScaleIO Data Client (SDC) Host application(s) file-system semantics Exposes ScaleIO shared block volumes to the application file-system block semantics Notes block dev. drivers ScaleIO data client (SDC) is a block device driver SDC ScaleIO protocol DAS HBA NIC/IB External Storage Subsystem switch switch switch 16 © Copyright 2017 Dell Inc. Fabric HBA ScaleIO Data Server (SDS) Host application(s) Owns local storage that contributes to the ScaleIO storage pool SDS file-system block semantics Notes ScaleIO data server (SDS) is a daemon/service block dev. drivers Space allocated to ScaleIO ScaleIO protocol DAS HBA NIC/IB Local storage could be dedicated disks, partitions within a disk switch External Storage Subsystem switch switch 17 © Copyright 2017 Dell Inc. Fabric HBA SDS & SDC in the same host Host application(s) file-system semantics An SDC and an SDS can live together. SDS file-system SDC serves the I/O requests of the resident host applications. block semantics SDC SDS serves the I/O requests of various SDCs. block dev. drivers Space allocated to ScaleIO ScaleIO protocol DAS HBA NIC/IB External Storage Subsystem switch switch switch 18 © Copyright 2017 Dell Inc. Fabric HBA Fully Converged Configuration app C app S app C 19 © Copyright 2017 Dell Inc. S app S app C C app C C S app S app S C app C C S app S app S C app C C S app S app S C app C C S app S app S C C S app S C S ETH/IB Two Layer Configuration app app app app app app C C C C C C ETH/IB S 20 © Copyright 2017 Dell Inc. S S S S S Two Layer Configuration app app app app app app C C C C C C Similar to a traditional storage subsystem box, but: • Software based • Highly scalable ETH/IB • and … S 21 © Copyright 2017 Dell Inc. S S S S S Two Layer Configuration app app app app app app C C C C C C … massive parallelism ETH/IB S 22 © Copyright 2017 Dell Inc. S S S S S Two Layer Configuration app app app app app app C C C C C C … massive parallelism as the SDCs contact the relevant SDSs directly S 23 © Copyright 2017 Dell Inc. S ETH/IB S S S S SDS CPU utilization linear with IOPs 4KB Write IOPs vs. SDS CPU% 4KB Read IOPs vs. SDS CPU% 12% 12% 11% 11% 10% 10% CPU % 7% 6% 4% CPU % 8% 8% 4% 4% 2% 1% 7% 6% 4% 2% 2% 2% 1% 0% 0% 0 50,000 100,000 150,000 200,000 250,000 300,000 0 20,000 IOPs • Utilization on a node with 2x2698V4 CPU (20 Cores each) • Cores are only used when workload is generated 24 © Copyright 2017 Dell Inc. 40,000 60,000 IOPs 80,000 100,000 120,000 SDC CPU utilization linear with IOPs 4KB Read IOPs vs. SDC CPU% 4KB Write IOPs vs. SDC CPU% 8% 8% 8% 7% 6% 6% 5% CPU % CPU % 6% 4% 3% 7% 7% 3% 2% 6% 5% 4% 3% 3% 2% 1% 1% 0% 0% 0 100,000 1% 1% 0% 0% 200,000 300,000 400,000 500,000 600,000 0 100,000 200,000 IOPs • Utilization on a node with 2x2698V4 CPU (20 Cores each) • Cores are only used when workload is generated 25 © Copyright 2017 Dell Inc. 300,000 IOPs 400,000 500,000 600,000 Volume Layout, Redundancy and Elasticity Volumes SDC SDS1 SDS3 27 © Copyright 2017 Dell Inc. SDS2 ScaleIO Volume • A volume appears as a single object to the application • The SDC is always accessing data from multiple devices on multiple nodes SDS4 Volumes SDC SDS1 SDS2 • Logical collection of mirrored, distributed chunks in a Storage Pool • SDC only accesses primary chunks SDS3 28 © Copyright 2017 Dell Inc. ScaleIO Volume SDS4 Virtual Spares and Free Space 30 • Data layout is elastic • Free space is used as a distributed spare • Losing a storage device = contraction of storage pool • Unprotected data mirrored into free space • Not enough free space? ScaleIO warning • More SDSs = smaller spare space requirement • Unprotected data is rebuilt across storage pool • Min spare space: • 4 nodes: 25% free space • 10 nodes: 10% free space © Copyright 2017 Dell Inc. Fast, balanced and smart rebuild • Forwards Rebuild – Once disk/node fails – the rebuild load is balanced across all the cluster partition disks/nodes faster and smoother rebuild • Backwards Rebuild – Smart & selective transition to “backwards” rebuild (re-silvering), once a failed node is back alive – Short outage = small penalty 36 © Copyright 2017 Dell Inc. SIO: rebuild with 80K IOPS; 400GB rebuild size, and using default QoS rebuild settings Rebuild started Rebuild time: 390 seconds Rebuild rate: 1.05 GB/sec Rebuild completed When running at 80K system IOPs, the additional background workload causes an impact. This impact can be controlled with a rebuild QoS value as shown in the next slide. NOTE: the increased I/O seen after the rebuild completes is a vdbench test artifact. 38 © Copyright 2017 Dell Inc. SIO: rebuild with 80K IOPS; 400GB rebuild size with rebuild b/w limit per device Rebuild started Rebuild time: 2510 seconds Rebuild rate: 163 MB/sec Rebuild completed When running at 80K system IOPs, there is almost no performance impact when limiting the rebuild bandwidth. 39 © Copyright 2017 Dell Inc. Elasticity/Flexibility, auto rebalance • Add: One may add nodes or disks dynamically the system automatically rebalances the storage Minimal data transferred in a many-to-many fashion 40 © Copyright 2017 Dell Inc. Elasticity, Auto rebalance • Add: One may add nodes or disks dynamically the system automatically rebalances the storage • Remove: One may remove nodes / disks dynamically the system automatically rebalances the storage Minimal data transferred in a many-to-many fashion 41 © Copyright 2017 Dell Inc. IO Flow A Single Read I/O app C app S C app S C app S C app S C app S C S The SDC interacts directly with the relevant SDS ETH/IB 46 © Copyright 2017 Dell Inc. A single Read I/O generally involves an interaction with a single node A Single Write I/O app C app S C app S C app S C app S C app S C S The SDC interacts directly with the relevant SDS ETH/IB 47 © Copyright 2017 Dell Inc. A single Write I/O generally involves interactions with only 2 nodes A Single Write I/O app C app S C app S C app S C app S C app S C S The SDC interacts directly with the relevant SDS ETH/IB 4KB write = 2 x 4KB propagated over the network + 2 x 4KB written to media (in 2 different nodes) 4KB read = 1 x 4KB (network, media) 48 © Copyright 2017 Dell Inc. A single Write I/O generally involves interactions with only 2 nodes A single Read I/O generally involves an interaction with a single node A Single Write I/O app C app S C app S C app S C app S C app S C S The SDC interacts directly with the relevant SDS • Scalability: Data doesn’t flow via a ETH/IB central point • Performance: High I/O parallelism • Shared-everything volumes 49 © Copyright 2017 Dell Inc. A single Write I/O generally involves interactions with only 2 nodes A single Read I/O generally involves an interaction with a single node Client-side Mapping Information and The Metadata Manager MDM – Three Viewpoints MDM MDM self • Lightweight • Clustered • Redundant • Highly-available • Does not require dedicated nodes 54 © Copyright 2017 Dell Inc. storage • Maintains authoritative inventory and mappings • Initiates rebalances and rebuilds, keeps storage protected and optimized admin • Accepts GUI, CLI, API commands • User-facing storage monitoring and alerting • Control plane only, never sees user data Tightly Coupled, Loosely Coupled • Connection type between components suited to their purpose TB Slave tightly coupled • Master MDM replicates changes to system status synchronously • Master MDM monitors SDS status continuously Master – Informs SDSs of changes to system and MDM status, nodes/devices/data layout tightly coupled 55 SDS SDS SDS SDS SDS SDS © Copyright 2017 Dell Inc. loosely coupled, lazy update • MDM update to SDCs “lazily” SDC SDC SDC SDC SDC SDC – SDCs can recognize data layout changes and contact MDMs – SDCs contact MDMs for data layout update after changes and failures Protection Domains, Storage Pools Multi-tenancy and IO Limiter Protection Domains A protection domain is a set of SDSs A volume is defined in a protection domain 57 © Copyright 2017 Dell Inc. Protection Domains A protection domain is a set of SDSs A volume is defined in a protection domain • SDCs from domain X can access data in domain Y • An SDS resides in exactly one protection domain 58 © Copyright 2017 Dell Inc. Protection Domains A protection domain is a set of SDSs A volume is defined in a protection domain • Toleration of simultaneous failures in large clusters • Performance isolation when needed • Data location control (e.g., multi tenancy) 59 © Copyright 2017 Dell Inc. Storage-pools SDS SDS Magnetic (HDD) SDS SDS SDS SDS SDS Protection domain 60 © Copyright 2017 Dell Inc. Flash (SSD) Storage-pools SDS Magnetic (HDD) SDS SDS • Multi-tiering: Fast vs. slower storagepools SDS SDS SDS Flash (SSD) SDS pool1 61 © Copyright 2017 Dell Inc. • Storage-pool: A set of disks in a protection domain. A volume is defined from a storage-pool. pool2 Storage-pools SDS Magnetic (HDD) SDS SDS • Performance-isolation: Multiple storage-pools of the same media speed SDS SDS SDS Flash (SSD) SDS pool1 62 © Copyright 2017 Dell Inc. pool2 pool3 Elasticity/Flexibility – moving resources • Move: One could easily move a node storage from one protectiondomain to another, totally non-disruptively 63 © Copyright 2017 Dell Inc. Elasticity – moving resources • Move: One could easily move a node storage from one protectiondomain to another, totally non-disruptively By simply sending a command (!) Could you move spindles from one storage box to another by sending a command? 64 © Copyright 2017 Dell Inc. Fault Set 65 © Copyright 2017 Dell Inc. Bandwidth / IOPs Limiter • The ability to limit a specific client from exceeding X IOPs and/or Y bandwidth at volume V 66 © Copyright 2017 Dell Inc. Partitioning / Tiering / Multi-Tenancy • The ability to limit a specific client from exceeding X IOPs and/or Y bandwidth at volume V The combination of protection-domains, storage-pools and limiter allows the user to control multi-tenancy performance, capacity and Availability! 67 © Copyright 2017 Dell Inc. Tools: ScaleIO Sizer https://scaleio-sizer.emc.com/ 68 © Copyright 2017 Dell Inc. Virtualization Environments VIRTUALIZATION ENVIRONMENTS • Almost identical to bare-metal • SDC sits inside the hypervisor’s kernel • SDS sits in the hypervisor’s user-mode ‒ Or in a VM in ESX environment 70 © Copyright 2017 Dell Inc. GUI Read Workload 31M IOPS 72 © Copyright 2017 Dell Inc. 78 © Copyright 2017 Dell Inc. Want More ScaleIO? • EMC ScaleIO In The Enterprise: The Citi Experience (storage.28) – • ScaleIO: Architecture Deep Dive (storage.29) – • Monday @ 12:00 PM; Wednesday @ 12:00 PM Tuesday @ 1:30 PM Try our Hands-on-Labs • Build A 100-Node ScaleIO SDS Cluster In Minutes! • Operations & Lifecycle Management Of Dell EMC ScaleIO Software Defined Storage • Use REX-Ray & ScaleIO With Docker, Mesos & Kubernetes (Hands-on Lab) Monday @ 4:30 PM; Wednesday @ 3:00 PM ScaleIO: Software-Defined Storage Lifecycle Management Viewed Through Demos (storage.36) – 79 Monday @ 08:30 AM ScaleIO: Customer Panel: How Is Software Defined Storage Helping My Data Center? (storage.34) – • Monday @ 1:30 PM; Thursday @ 10:00 AM ScaleIO: Redefining Software-Defined Storage & Hyper-Convergence – ScaleIO & vSAN: Software-Defined Storage - The Revolution is Here! (storage.33) – • • Monday @ 4:30 PM; Thursday @ 1:00 PM ScaleIO: Simplying OpenStack With ScaleIO Software Defined Storage (storage.32) – • Birds of a Feather ScaleIO: Architecting For Availability, Performance & Networking With ScaleIO (storage.30) – • Monday @ 3:00 PM; Tuesday @ 8:30 AM Monday @ 8:30 AM; Thursday @ 8:30 AM © Copyright 2017 Dell Inc. Visit us in Booth! #757 Want to win a Levitating Death Star Speaker? • Follow @DellEMCStorage while at Dell EMC World • 2 Winners will be chosen daily from Monday May 8 through Thursday May 11 • All winners will be notified through Twitter Direct Message NO PURCHASE NECESSARY. Ends 05/11/2017. To enter and for Official Rules, visit http://thecoreblog.emc.com/dell-emc-world-follow-win-sweepstakes-2017/ 80 © Copyright 2017 Dell Inc.