NCAR’s Data-Centric Supercomputing Environment Yellowstone December 21, 2011 Anke Kamrath, OSD Director/CISL anke@ucar.edu Overview • Strategy – Moving from Process to Data-Centric Computing • HPC/Data Architecture – What we have today at ML – What’s planned for NWSC • Storage/Data/Networking – Data in Flight – WAN and High-Performance LAN – Central Filesystem Plans – Archival Plans 2 Evolving the Scientific Workflow • Common data movement issues – Time consuming to move data between systems – Bandwidth to archive system is insufficient – Lack of sufficient disk space • Need to evolve data management techniques – Workflow management systems – Standardize metadata information – Reduce/eliminate duplication of datasets (ie – CMIP5) • User Education – Effective methods for understanding workflow – Effective methods for streamlining workflow Traditional Workflow Process Centric Data Model Evolving Scientific Workflow Information Centric Data Model Current Resources @ Mesa Lab GLADE BLUEFIRE LYNX NWSC: Yellowstone Environment Geyser & Caldera Yellowstone GLADE HPC resource, 1.55 PFLOPS peak Central disk resource 11 PB (2012), 16.4 PB (2014) DAV clusters High Bandwidth Low Latency HPC and I/O Networks FDR InfiniBand and 10Gb Ethernet NCAR HPSS Archive 100 PB capacity ~15 PB/yr growth 1Gb/10Gb Ethernet (40Gb+ future) Science Gateways RDA, ESG Data Transfer Services Remote Vis Partner Sites XSEDE Sites NWSC-1 Resources in a Nutshell • Centralized Filesystems and Data Storage Resource (GLADE) – >90 GB/sec aggregate I/O bandwidth, GPFS filesystems – 10.9 PetaBytes initially -> 16.4 PetaBytes in 1Q2014 • High Performance Computing Resource (Yellowstone) – IBM iDataPlex Cluster with Intel Sandy Bridge EP† processors with Advanced Vector Extensions (AVX) – 1.552 PetaFLOPs – 29.8 bluefire-equivalents – 4,662 nodes – 74,592 cores – 149.2 TeraBytes total memory – Mellanox FDR InfiniBand full fat-tree interconnect • Data Analysis and Visualization Resource (Geyser & Caldera) – Large Memory System with Intel Westmere EX processors • 16 nodes, 640 cores, 16 TeraBytes memory, 16 NVIDIA Kepler GPUs – GPU-Computation/Vis System with Intel Sandy Bridge EP processors with AVX • 16 nodes, 128 SB cores, 1 TeraByte memory, 32 NVIDIA Kepler GPUs – Knights Corner System with Intel Sandy Bridge EP processors with AVX • 16 nodes, 128 SB cores, 992 KC cores, 1 TeraByte memory - Nov’12 delivery † “Sandy Bridge EP” is the Intel® Xeon® E5-2670 GLADE • 10.94 PB usable capacity 16.42 PB usable (1Q2014) Estimated initial file system sizes – collections ≈ 2 PB RDA, CMIP5 data – scratch ≈ 5 PB shared, temporary space – projects ≈ 3 PB long-term, allocated space – users ≈ 1 PB medium-term work space • Disk Storage Subsystem – 76 IBM DCS3700 controllers & expansion drawers • 90 2-TB NL-SAS drives/controller • add 30 3-TB NL-SAS drives/controller (1Q2014) • GPFS NSD Servers – 91.8 GB/s aggregate I/O bandwidth; 19 IBM x3650 M4 nodes • I/O Aggregator Servers (GPFS, GLADE-HPSS connectivity) – 10-GbE & FDR interfaces; 4 IBM x3650 M4 nodes • High-performance I/O interconnect to HPC & DAV – Mellanox FDR InfiniBand full fat-tree – 13.6 GB/s bidirectional bandwidth/node NCAR Disk Storage Capacity Profile Total Centralized Filesystem Storage (PB) GLADE (NWSC) GLADE (ML) bluefire 18 16 Total Usable Storage (PB) 14 12 10 GLADE (at NWSC) 8 6 4 2 GLADE (Mesa Lab) 0 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Yellowstone NWSC High-Performance Computing Resource • Batch Computation – – – – – 4,662 IBM dx360 M4 nodes – 16 cores, 32 GB memory per node Intel Sandy Bridge EP processors with AVX – 2.6 GHz clock 74,592 cores total – 1.552 PFLOPs peak 149.2 TB total DDR3-1600 memory 29.8 Bluefire equivalents • High-Performance Interconnect – – – – Mellanox FDR InfiniBand full fat-tree 13.6 GB/s bidirectional bw/node <2.5 µs latency (worst case) 31.7 TB/s bisection bandwidth • Login/Interactive – 6 IBM x3650 M4 Nodes; Intel Sandy Bridge EP processors with AVX – 16 cores & 128 GB memory per node NCAR HPC Profile Peak PFLOPs at NCAR IBM iDataPlex/FDRIB (yellowstone) 30x Bluefire performance 1.5 yellowstone Cray XT5m (lynx) IBM Power 575/32 (128) POWER6/DDR-IB (bluefire) 1.0 IBM p575/16 (112) POWER5+/HPS (blueice) IBM p575/8 (78) POWER5/HPS (bluevista) 0.5 IBM BlueGene/L (frost) bluesky frost bluevista blueice IBM POWER4/Colony (bluesky) lynx bluefire frost upgrade 0.0 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Geyser and Caldera NWSC Data Analysis & Visualization Resource • Geyser: Large-memory system – 16 IBM x3850 nodes – Intel Westmere-EX processors – 40 cores, 1 TB memory, 1 NVIDIA Kepler Q13H-3 GPU per node – Mellanox FDR full fat-tree interconnect • Caldera: GPU computation/visualization system – 16 IBM x360 M4 nodes – Intel Sandy Bridge EP/AVX – 16 cores, 64 GB memory, 2 NVIDIA Kepler Q13H-3 GPUs per node – Mellanox FDR full fat-tree interconnect • Knights Corner system (November 2012 delivery) – Intel Many Integrated Core (MIC) architecture – 16 IBM Knights Corner nodes – 16 Sandy Bridge EP/AVX cores, 64 GB memory, 1 Knights Corner adapter per node – Mellanox FDR full fat-tree interconnect Erebus Antarctic Mesoscale Prediction System (AMPS) 0° • IBM iDataPlex Compute Cluster – – – – – 84 IBM dx360 M4 Nodes; 16 cores, 32 GB Intel Sandy Bridge EP; 2.6 GHz clock 1,344 cores total – 28 TFLOPs peak Mellanox FDR InfiniBand full fat-tree 0.54 Bluefire equivalents • Login Nodes – 2 IBM x3650 M4 Nodes – 16 cores & 128 GB memory per node • Dedicated GPFS filesystem – 57.6 TB usable disk storage – 9.6 GB/sec aggregate I/O bandwidth 90° E 90° W 180° Erebus, on Ross Island, is Antarctica’s most famous volcanic peak and is one of the largest volcanoes in the world – within the top 20 in total size and reaching a height of 12,450 feet. Yellowstone Software • Compilers, Libraries, Debugger & Performance Tools – Intel Cluster Studio (Fortran, C++, performance & MPI libraries, trace collector & analyzer) 50 concurrent users – Intel VTune Amplifier XE performance optimizer 2 concurrent users – PGI CDK (Fortran, C, C++, pgdbg debugger, pgprof) 50 conc. users – PGI CDK GPU Version (Fortran, C, C++, pgdbg debugger, pgprof) for DAV systems only, 2 concurrent users – PathScale EckoPath (Fortran C, C++, PathDB debugger) 20 concurrent users – Rogue Wave TotalView debugger 8,192 floating tokens – IBM Parallel Environment (POE), including IBM HPC Toolkit • System Software – LSF-HPC Batch Subsystem / Resource Manager • IBM has purchased Platform Computing, Inc. (developers of LSF-HPC) – – – – Red Hat Enterprise Linux (RHEL) Version 6 IBM General Parallel Filesystem (GPFS) Mellanox Universal Fabric Manager IBM xCAT cluster administration toolkit NCAR HPSS Archive Resource • NWSC – Two SL8500 robotic libraries (20k cartridge capacity) – 26 T10000C tape drives (240 MB/sec I/O rate each) and T10000C media (5 TB/cartridge, uncompressed) initially; +20 T10000C drives ~Nov 2012 – >100 PB capacity – Current growth rate ~3.8 PB/year – Anticipated NWSC growth rate ~15 PB/year • Mesa Lab – Two SL8500 robotic libraries (15k cartridge capacity) – Existing data (14.5 PB): • 1st & 2nd copies will be ‘oozed’ to new media @ NWSC, begin 2012 – New data @ Mesa: • Disaster-recovery data only – T10000B drives & media to be retired – No plans to move Mesa Lab SL8500 libraries (more costly to move than to buy new under AMSTAR Subcontract) Plan to release an “AMSTAR-2” RFP 1Q2013, with target for first equipment delivery during 1Q2014 to further augment the NCAR HPSS Archive. Yellowstone Physical Infrastructure Resource # Racks Yellowstone 65 - iDataPlex Racks (72 nodes per rack) 9 - 19” Racks (9 Mellanox FDR core switches) 1 - 19” Rack (login, service, management nodes) GLADE 20 - NSD Server, Controller and Storage Racks 1 - 19” Rack (I/O aggregator nodes, management , IB & Ethernet switches) Geyser & Caldera 1 - iDataPlex Rack (GPU-Comp & Knights Corner) 2 - 19” Racks (Large Memory, management , IB switch) Erebus (AMPS) 1 - iDataPlex Rack 1 - 19” Rack (login, IB, NSD, disk & management nodes) Total Power Required ~2.13 MW Yellowstone ~1.9 MW GLADE 0.134 MW Geyser & Caldera 0.056 MW Erebus (AMPS) 0.087 MW Yellowstone allocations (% of resource) NCAR’s 29% represents 170 million core-hours per year for Yellowstone alone (compared to less than 10 million per year on Bluefire) plus a similar fraction of the DAV and GLADE resources. Yellowstone Schedule Current Schedule Production Science ASD & early users Acceptance Testing 15 Oct 1 Oct 17 Sep 3 Sep 20 Aug 6 Aug 23 Jul 9 Jul 25 Jun 11 Jun 28 May 14 May 30 Apr 16 Apr 2 Apr 19 Mar 5 Mar 20 Feb 6 Feb 23 Jan 9 Jan 26 Dec 12 Dec 28 Nov 14 Nov 31 Oct Integration & Checkout Production Systems Delivery Test Systems Delivery & Installation Storage & InfiniBand Delivery & Installation Infrastructure Preparation Data in Flight to NWSC • NWSC Networking • Central Filesystem – Migrating from GLADE-ML to GLADE-NWSC • Archive – Migrating HPSS data from ML to NWSC 20 a 21 BiSON to NWSC Initially three 10G circuits active Two 10G connections back to Mesa Lab for internal traffic One 10G direct to FRGP for general Internet2 / NLR / Research and Education traffic Options for dedicated 10G connections for high performance computing to other BiSON members System is engineered for 40 individual lambdas Each lambda can be a 10G, 40G, or 100G connection Independent lambdas can be sent each direction around the ring (two ADVA shelves at NWSC – one for each direction) With a major upgrade system could support 80 lambdas 100Gbps * 80 channels * 2 paths = 16Tbps 22 High performance LAN Data Center Networking (DCN) high speed data center computing with 1G and 10G client facing ports for supercomputer, mass storage, and other data center components redundant design (e.g., multiple chassis and separate module connections) future option for 100G interfaces Juniper awarded RFP after NSF approval Access switches: QFX3500 series Network core and WAN edge: EX8200 series switch/routers Includes spares for lab/testing purposes Juniper training for NETS staff early Jan-2012 Deploy Juniper equipment late Jan-2012 23 Moving Data… Ugh 24 Migrating GLADE Data • Temporary work spaces (/glade/scratch, /glade/user) – No data will automatically be moved to NWSC • Allocated project spaces (/glade/projxx) – New allocations will be made for the NWSC – No data will automatically be moved to NWSC – Data transfer option so users may move data they need • Data collections (/glade/data01, /glade/data02) – CISL will move data from ML to NWSC – Full production capability will need to be maintained during the transition Network Impact - Current storage max performance is 5GB/s - Can sustain ~2GB/s for reads while under a production load - Will move 400TB in a couple of days, however we will saturate a 20Gb/s network link 25 Migrating Archive • What Migrates???? • • • • Data: 15PBs and counting… Format: MSS to HPSS Tape: Tape B (1TB tapes) to Tape C (5TB tapes) Location: ML to NWSC • HPSS at NWSC to become primary site in Spring 2012 • • • • 1 day outage when metadata servers get moved ML HPSS will remain as Disaster Recovery Site Data Migration will take until early 2014 Will throttle migration to not overload network 26 WHEW… A LOT OF WORK AHEAD! QUESTIONS?