School of EECS, Peking University Microsoft Research Asia UStore: A Low Cost Cold and Archival Data Storage System for Data Centers Quanlu Zhang†, Yafei Dai†, Fengqian Li#, Lintao Zhang∗ † Peking University # Shanghai Jiao Tong University * Microsoft Research Asia School of EECS, Peking University A BRIEF INTRODUCTION TO CLOUD STORAGE Microsoft Research Asia School of EECS, Peking University Microsoft Research Asia “Cold Storage Is Hot Again” -- IDC Technology Assessment School of EECS, Peking University Microsoft Research Asia Hotmail: 5~22 GB per account, OneDrive: 7~25 GB per account User generated data: video feeds, sensor inputs, operational logs Long term archiving for financial and medical data System backups …… School of EECS, Peking University Microsoft Research Asia Managing storage growth Managing storage growth Designing, managing Designing,deploying, deploying, andand managing Backup, Recovery,and and Archive solutions Backup, Recovery, Archive solutions 79% 77% 43% 45% Making informed strategic/big-picture decisions 39% 36% Designing, deploying, and managing disaster recovery solutions Designing, deploying, and managing storage in a virtualized server environment 38% 39% 31% 29% 27% Lack of skilled storage professionals Designing, deploying, and managing storage in cloud computing environment 18% 16% 15% 15% Lack of skilled cloud technology professionals 11% 10% Convincing higher management to adopt cloud 10% 7% Infrastructure for Big Data analytics Managing external cloud service providers 4% 0% Source: Managing Storage: Trends, Challenges, and options (2013-2014). 37% 8% 10% 20% 30% 40% 50% 60% Percentage of Respondents 2013-14 2012-13 70% 80% 90% School of EECS, Peking University Microsoft Research Asia Much of the Data are Cold or Archiving Facebook Photo Access Patterns Source: Facebook, 2013 • Hot data: very low latency, high bandwidth • Cold data: low bandwidth, (relatively) low latency • Archival data: predictable workload, can tolerate long latency School of EECS, Peking University Microsoft Research Asia What Characteristics Does An Ideal Cold and Archival Storage Possess? • Cheap – Low capital expense – Low operational expense • Incrementally deployable – No need to over-provision too much • Good Performance – Reasonable throughput – Relatively low access latency • Reliable and Available School of EECS, Peking University Microsoft Research Asia Which Storage Media? Tape Optical Disk Magnetic Disk School of EECS, Peking University Microsoft Research Asia Magnetic Disk is Promising for Cold and Archival Storage • The average cost per gigabyte fell from $437,500 in 1980 to $0.05 in 2013 • Shingled Magnetic Recording – High capacity • Helium-filled hard drives – Low power – High capacity School of EECS, Peking University Microsoft Research Asia How to Connect and Manage Large Numbers of Disks to Provide Storage Service? School of EECS, Peking University Microsoft Research Asia Interconnection Technologies • SATA – 6.0 Gb/s transfer speed – SATA multiplier • support only 15 devices, not support cascade • SAS – 6 Gb/s transfer speed – SAS expander • Fibre Channel – High bandwidth, also high expense • Ethernet – ARM attaches and exposes disk – dedicated ARMs and network infrastructure School of EECS, Peking University Microsoft Research Asia USB based Storage for Data Center Disk Array Box USB Hub Existing Server • USB 3.0 – 5.0Gb/s transfer speed (up to 10Gb/s for USB 3.1), 300~400MB/s realistic throughput – Tree structured hubs to address up to 127 devices – Supported by most new chipsets, very (very) cheap School of EECS, Peking University Microsoft Research Asia The Problems of the Naïve Design • Limited performance – An enclosure of ~100 disks with only 400MB/s throughput • Single point of failure – Failure of the root hub or the server cause total data loss Desired Design Traditional wisdom: multi-path attached storage is expensive School of EECS, Peking University Two Primitives Hub Switch Control Microsoft Research Asia School of EECS, Peking University Microsoft Research Asia The Data Plane (Simple Tree) School of EECS, Peking University Microsoft Research Asia The Data Plane (2-Way Redundancy) Server 1 Server 2 School of EECS, Peking University Microsoft Research Asia The Data Plane (4 Output Ports) School of EECS, Peking University Microsoft Research Asia The Data Plane (4 Output Ports) School of EECS, Peking University The Control Plane • What can be controlled? – Switches and Power to each disk Control Plane Microsoft Research Asia School of EECS, Peking University SOFTWARE DESIGN Microsoft Research Asia School of EECS, Peking University Microsoft Research Asia Software Design • Serve the storage allocation and access • Detect failures and implement quick failover • Provide an appropriate interface for upper layer services and applications School of EECS, Peking University Microsoft Research Asia Software Architecture UStore ClientLib iSCSI Initiator UStore EndPoint UStore EndPoint UStore EndPoint UStore EndPoint UStore EndPoint iSCSI Target iSCSI Target iSCSI Target iSCSI Target iSCSI Target USB Monitor USB Monitor USB Monitor USB Monitor USB Monitor Host Interconnect Fabric …… School of EECS, Peking University Microsoft Research Asia Software Architecture Paxos UStore Master Control Commands UStore ClientLib Heartbeat Messages iSCSI Initiator UStore EndPoint UStore EndPoint UStore EndPoint UStore EndPoint iSCSI Target iSCSI Target iSCSI Target iSCSI Target USB Monitor USB Monitor USB Monitor USB Monitor Primary Controller Backup Controller Control Hubs, Switches, Disks Interconnect Fabric …… School of EECS, Peking University Microsoft Research Asia Configuring Interconnect Fabric UStore Master S1 : D1,D4 S2 : D7,D8 S3 : D2,D3 S4 : D5,D6,D7 S1 S2 S3 S4 Connect D1 to S3 and D4 to S2 D1 D2 D3 D4 D5 D6 D7 D8 D9 School of EECS, Peking University Microsoft Research Asia Configuring Interconnect Fabric UStore Master S1 : Crash S2 : D7,D8, D4 S3 : D2,D3, D1 S4 : D5,D6,D7 S1 S2 S3 S4 Connect D1 to S3 and D4 to S2 Reconfiguration Completion D1 D2 D3 D4 D5 D6 D7 D8 D9 School of EECS, Peking University UStore Prototype Microsoft Research Asia School of EECS, Peking University COST COMPARISON Microsoft Research Asia School of EECS, Peking University Microsoft Research Asia Cost Comparison • Capital Expense System Media Capital Expense Without Disks DELL PowerVault MD3260i Near-line SAS $3,340,000 $1,525,000 Sun StorageTek SL150 LTO6 Tape $1,748,000 - Pergamum SATA HD $756,000 $415,000 BACKBLAZE SATA HD $598,000 $257,000 UStore SATA HD $456,000 $115,000 • Operational Expense – – – – Low power consumption Low cooling cost Low space occupation Low operational cost School of EECS, Peking University Microsoft Research Asia PERFORMANCE EVALUATION School of EECS, Peking University Microsoft Research Asia Throughput • SATA to USB bridge, USB hub, and USB switch have little impact on disk performance 4MB Sequence 4KB Sequence 200 IO/s MB/s 150 100 50 0 100% 50% 0% 16000 14000 12000 10000 8000 6000 4000 2000 0 100% USB Hub&Switch 0% Read Percentage Read Percentage SATA 50% SATA USB Hub&Switch School of EECS, Peking University Microsoft Research Asia Total Throughput • Total throughput increases with the increase of disks 4MB Sequence 200 Total Bandwidth (MB/s) Total Bandwidth (MB/s) 4KB Sequence 150 100 50 0 Read 1disk 2disks 4disks 350 300 250 200 150 100 50 0 Write 8disks 12disks Read 1disk 2disks 4disks Duplex throughput of one root: 540MB/s Total throughput of our prototype: 2160MB/s Write 8disks 12disks School of EECS, Peking University Microsoft Research Asia Switching Time 12 Latency (s) 10 8 6 Mount delay Expose delay 4 Recognize delay 2 0 1 2 3 4 8 Number of Switched Disks 16 School of EECS, Peking University Microsoft Research Asia Whole System’s Power Consumption Power Consumption (W) 250 200 222.5 193.5 166.8 150 100 83.5 50 28.9 0 Spinning DD860/ES30 Powered off Pergamum UStore 22.1 School of EECS, Peking University CONCLUSION AND FUTURE WORK Microsoft Research Asia School of EECS, Peking University Microsoft Research Asia Conclusion • Cheap – Low capital expense – Low operational expense • Incrementally deployable – No need to over-provision too much • Good Performance – Reasonable throughput – Relatively low access latency • Reliable and Available School of EECS, Peking University Microsoft Research Asia Future Work Provide data redundancy in UStore, leveraging low coupling of disks and servers School of EECS, Peking University Microsoft Research Asia Thank You! Questions? School of EECS, Peking University Microsoft Research Asia Failure Rate • MTTF of servers is 3.4 months • MTTF of disks is 10-50 years School of EECS, Peking University Microsoft Research Asia Prototype’s interconnect topology School of EECS, Peking University Microsoft Research Asia Power Management • A lot of mechanisms proposed for power saving in storage system – managing data redundancy and placement • Provide disk control interface that allows upper layer services to control the state of the disks that belong to them (spin-down/spinup). • Spin down disks after a configured interval School of EECS, Peking University Microsoft Research Asia Power Consumption 8 Hub 1W 6 1W 4 2 0 Spin Down SATA Disk Idle Read/Write USB Bridge&Disk Power Consumption (W) Power Consumption (W) Disk 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 Number of Connected Disks