HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC 1 Agenda ?? • Where are we today ?? • File systems • Interconnects • Disk technologies • Solid state devices • Solutions • Final thoughts … Pan Galactic Gargle Blaster "Like having your brains smashed out by a slice of lemon wrapped around a large gold brick.” 2 Current Top10 ….. Rank Name 1 2 3 4 5 6 7 8 9 10 Computer Site TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi Cray XK7 , Opteron 6274 Titan 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x BlueGene/Q, Power BQC Sequoia 16C 1.60 GHz, Custom Interconnect Fujitsu, SPARC64 VIIIfx K computer 2.0GHz, Tofu interconnect Tianhe-2 National Super Computer Center in Guangzhou DOE/SC/Oak Ridge National Laboratory Rmax Rpeak Power (KW) File system Size Perf Lustre/H2 12.4 PB ~750 GB/s FS 3120000 33862700 54902400 17808 560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s DOE/NNSA/LLNL 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s 12659 Lustre 40 PB 965 GB/s RIKEN AICS DOE/SC/Argonne Mira National Laboratory Cray XC30, Xeon E5-2670 Swiss National Piz Daint 8C 2.600GHz, Aries Supercomputing interconnect , NVIDIA K20x Centre (CSCS) PowerEdge C8220, Xeon TACC/ Stampede E5-2680 8C 2.7GHz, IB Univ. of Texas FDR, Intel Xeon Phi BlueGene/Q, Power BQC Forschungs JUQUEEN 16C 1.600GHz, Custom zentrum Juelich Interconnect (FZJ) BlueGene/Q, Power BQC Vulcan 16C 1.600GHz, Custom DOE/NNSA/LLNL Interconnect iDataPlex DX360M4, Xeon Leibniz SuperMUC E5-2680 8C 2.70GHz, Rechenzentrum Infiniband FDR BlueGene/Q, Power BQC 16C 1.60GHz, Custom Total Cores 705024 10510000 11280384 786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s 115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s 462462 5168110 8520112 4510 Lustre 14 PB 150 GB/s 458752 5008857 5872025 2301 GPFS 5.6 PB 33 GB/s 393216 4293306 5033165 1972 Lustre 55 PB 850 GB/s 147456 2897000 3185050 3423 GPFS 10 PB 200 GB/s n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3) 3 Lustre Roadmap 4 Other parallel file systems • GPFS – Running out of steam ?? – Let me qualify !! (and he then rambles on …..) • Frauenhofer – Excellent metadata perf – Many modern features – No real HA • Ceph – New, interesting and with a LOT of good features – Immature and with limited track record • Panasas – Still RAID 5 and running out of steam …. 5 Object based storage • A traditional file system includes a hierarchy of files and directories • Accessed via a file system driver in the OS • Object storage is “flat”, objects are located by direct reference • Accessed via custom APIs o Swift, S3, librados, etc. • The difference boils down to 2 questions: o o How do you find files? Where do you store metadata? • Object store + Metadata + driver is a filesystem 6 Object Storage Backend: Why? • It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system • It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc. • It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid 7 Elephants all the way down... Most clustered FS and OS are built on local FS’s – ….and inherit their problems • Native FS o XFS, ext4, ZFS, btrfs • OS on FS o o Ceph on btrfs Swift on XFS • FS on OS on FS o o o CephFS on Ceph on btrfs Lustre on OSS on ext4 Lustre on OSS on ZFS 8 The way forward .. • ObjectStor based solutions offers a lot of flexibility: – Next-generation design, for exascale-level size, performance, and robustness – Implemented from scratch • "If we could design the perfect exascale storage system..." – – – – – – – Not limited to POSIX Non-blocking availability of data Multi-core aware Non-blocking execution model with thread-per-core Support for non-uniform hardware Flash, non-volatile memory, NUMA Using abstractions, guided interfaces can be implemented • e.g., for burst buffer management (pre-staging and de-staging). 9 Interconnects (Disk and Fabrics) • S-ATA 6 Gbit • FC-AL 8 Gbit • SAS 6 Gbit • SAS 12 Gbit • PCI-E direct attach • Ethernet • Infiniband • Next gen interconnect… 10 12 Gbit SAS • Doubles bandwidth compared to SAS 6 Gbit • Triples the IOPS !! • Same connectors and cables • 4.8 GB/s in each direction with 9.6 GB/s following – 2 streams moving to 4 streams • 24 Gb/s SAS is on the drawing board …. 11 PCI-E direct attach storage • M.2 Solid State Storage Modules • Lowest latency/highest bandwidth • Limitations in # of PCI-E channels available • Ivy Bridge has up to 40 lanes per chip PCIe Arch Raw Bit Rate Data Encoding Interconnect bandwidth BW Lane Direction Total BW for x16 link PCIe 1.x 2.5GT/s 8b/10b 2Gb/s ~250MB/s ~8GB/s PCIe 2.0 5.0GT/s 8b/10b 4Gb/s ~500MB/s ~16GB/s PCIe 3.0 8.0GT/s 128b/130b 8Gb/s ~1 GB/s ~32GB/s PCIe 4.0 ?? ?? ?? ?? ?? 12 Ethernet – Still going strong ?? • Ethernet has now been around for 40 years !!! • Currently around 41% of Top500 systems … 28% 1 GbE 13% 10 GbE • 40 GbE shipping in volume • 100 GbE being demonstrated – Volume shipments expected in 2015 • 400 GbE and 1 TbE is on the drawing board – 400 GbE planned for 2017 13 Infiniband … 14 Next gen interconnects … • Intel acquired Qlogic Infiniband team … • Intel acquired Cray’s interconnect technologies … • Intel has published and shows silicon photonics … • And this means WHAT ???? 15 Disk drive technologies 16 Areal density futures 17 Disk write technology …. 18 18 Hard drive futures … • Sealed Helium Drives (Hitachi ) – Higher density – 6 platters/12 heads – Less power (~ 1.6W idle) – Less heat (~ 4°C lower temp) • SMR drives (Seagate) – Denser packaging on current technology – Aimed at read intensive application areas • Hybrid drives (SSHD) – Enterprise edition – Transparent SSD/HHD combination (aka Fusion drives) • eMLC + SAS 19 SMR drive deep-dive 20 Hard drive futures … • HAMR drives (Seagate) – Using a laser to heat the magnetic substrate (Iron/Platinum alloy) – Projected capacity – 30-60 TB/ 3.5 inch drive … – 2016 timeframe …. • BPM (bit patterned media recording) – Stores one bit per cell, as opposed to regular hard-drive technology, where each bit is stored across a few hundred magnetic grains – Projected capacity – 100+ TB / 3.5 inch drive … 21 What about RAID ? • RAID 5 – No longer viable • RAID 6 – Still OK – But re-build times are becoming prohibitive • RAID 10 – OK for SSDs and small arrays • RAID Z/Z2 etc – A choice but limited functionality on Linux • Parity Declustered RAID – Gaining foothold everywhere – But …. PD-RAID ≠ PD-RAID ≠ PD-RAID …. • No RAID ??? – Using multiple distributed copies works but … 22 Flash (NAND) • Supposed to “Take over the World” [cf. Pinky and the Brain] • But for high performance storage there are issues …. – Price and density not following predicted evolution – Reliability (even on SLC) not as expected • Latency issues – SLC access ~25µs, MLC ~50µs … – Larger chips increase contention • Once a flash die is accessed, other dies on the same bus must wait • Up to 8 flash dies shares a bus • Address translation, garbage collection and especially wear leveling add significant latency 23 Flash (NAND) • MLC 3-4 bits per cell @ 10K duty cycles • SLC – 1 bit per cell @ 100K duty cycles • eMLC – 2 bits per cell @ 30K duty cycles • Disk drive formats (S-ATA / SAS bandwidth limitations) • PCI-E accelerators • PCI-E direct attach 24 NV-RAM • Flash is essentially NV-RAM but …. • Phase Change Memory (PCM) – Significantly faster and more dense that NAND – Based on chalcogenide glass • Thermal vs electronic process – More resistant to external factors • Currently the expected solution for burst buffers etc … – but there’s always Hybrid Memory Cubes …… 25 Solutions … • Size does matter ….. 2014 – 2016 >20 proposals for 40+ PB file systems Running at 1 – 4 TB/s !!!! • Heterogeneity is the new buzzword – Burst buffers, data capacitors, cache off loaders … • Mixed workloads are now taken seriously …. • Data integrity is paramount – T10-DIF/X is a decent start but … • Storage system resiliency is equally important – PD-RAID need to evolve and become system wide • Multi tier storage as standard configs … • Geographical distributed solutions commonplace 26 Final thoughts ??? • Object based storage – Not an IF but a WHEN …. – Flavor(s) still TBD – DAOS, Exascale10, XEOS, …. • Data management core to any solution – Self aware data, real time analytics, resource management ranging from job scheduler to disk block …. – HPC storage = Big Data • Live data – – Cache ↔︎ Near line ↔︎ Tier2 ↔︎ Tape ? ↔︎ Cloud ↔︎ Ice • Compute with storage ➜ Storage with compute Storage is no longer a 2nd class citizen 27 Thank You torben_kling_petersen@xyratex.com 28