SSD – Applications, Usage Examples Gordon Summer Institute August 8-11, 2011 Mahidhar Tatineni San Diego Supercomputer Center SAN DIEGO SUPERCOMPUTER CENTER Overview • Introduction to flash hardware and benefits • Flash usage scenarios • Examples of applications tested on Dash, Trestles compute nodes and Dash/Gordon I/O nodes. • Flash access/remote mounts on Dash, Trestles, and Gordon. SAN DIEGO SUPERCOMPUTER CENTER Gordon Architecture Bridges the Latency Gap I/O to traditional HPC FS 1.00E+00 Data Oasis Lustre 4PB PFS 1.00E-01 64 I/O nodes 300 TB Intel SSD (lower is better) Latency (seconds) 1.00E-02 1.00E-03 1.00E-05 L3 Cache MB Application 1.00E-06 1.00E-07 1.00E-08 I/O to flash node FS Quick Path Interconnect 10’s of GB 1.00E-04 L1 Cache KB 1.00E-09 1.00E-05 1.00E-03 QDR InfiniBand Interconnect 100’s of GB Space DDR3 Memory 10’s of GB L2 Cache KB 1.00E-01 1.00E+01 1.00E+03 Data Capacity (GB) (higher is better) SAN DIEGO SUPERCOMPUTER CENTER 1.00E+05 1.00E+07 1.00E+09 Flash Drives are a Good Fit for Data Intensive Computing Flash Drive Typical HDD Good for Data Intensive Apps < .1 ms 10 ms ✔ 250 /170 MB/s 100 MB/s ✔ 35,000/ 2000 100 ✔ 2-5 W 6-10 W ✔ 1M hours 1M hours - Price/GB $2/GB $.50/GB - Endurance 2-10PB N/A ✔ Latency Bandwidth (r/w) IOPS (r/w) Power consumption MTBF Total Cost of Ownership *** The jury is still out *** . Apart from the differences between HDD and SSD it is not common to find local storage “close” to the compute. We have found this to be attractive in our Trestles cluster, which has local flash on the compute, but is used for traditional HPC applications (not high IOPS). SAN DIEGO SUPERCOMPUTER CENTER • Dash uses Intel X25E SLC drives and Trestles has X25-M MLC drives. • The performance specs of the Intel flash drives to be deployed in Gordon are similar to those of the X25-M except that they will have higher endurance SAN DIEGO SUPERCOMPUTER CENTER Flash Usage Scenarios • Node local scratch for I/O during a run • Very little or no code changes required • Ideal if there are several threads doing I/O simultaneously and often. • Examples: Gaussian, Abaqus, QCHEM • Caching of partial or complete dataset in analysis, search, and visualization tasks • Loading entire database into flash • Use flash via a filesystem • Use raw device [DB2] SAN DIEGO SUPERCOMPUTER CENTER Flash as Local Scratch • Applications which do a lot of local scratch I/O during computations. Examples: Gaussian, Abaqus, QCHEM • Using flash is very straightforward. For example on Trestles where local SSDs are available: • Gaussian: GAUSS_SCRDIR=/scratch/$USER/$PBS_JOBID • Abaqus: scratch=/scratch/$USER/$PBS_JOBID • When a lot of cores (up to 32 on Trestles) are doing I/O and reading/writing constantly, the SSDs can make a significant difference. • Parallel filesystems not ideal for such I/O. SAN DIEGO SUPERCOMPUTER CENTER Flash as local scratch space provides 1.5x1.8x speedup over local disk for Abaqus • Standard Abaqus test cases (S2A1, S4B) were run on Dash with 8 cores to compare performance between local hard disk and SSDs. Benchmark performance was as follows: Benchmark Local disk SSDs S4B 2536s 1748s S2A1 811s 450s SAN DIEGO SUPERCOMPUTER CENTER Reverse-Time-Migration Application • Acoustic Imaging Application • • • Used to create images of sub-surface structures Oil and Gas companies use RTM to plan drilling investments This is a computation research that is sponsored by a commercial user • Correlation between source data and recorded data • • • forward-propagated seismic waves backward-propagated seismic waves Correlation between seismic waves illuminates reflection/diffraction points • Temporary Storage Requirements • • Snapshots stored for correlation Example • • • Example: Computation-IO Profile 4003 max grid points 20000 msec ~60GB temporary storage used 400x20000 on HDD 26% Write 54% 20% SAN DIEGO SUPERCOMPUTER CENTER Read Computation Reverse-Time-Migration on Flash* • Storage comparison on batch nodes • Spinning disk (HDD), flash drives (flash), parallel file system (GPFS) • Local flash drive outperforms other storages • Avg 7.2x IO speedup vs HDD 1600 • Avg 3.9x IO speedup vs GPFS 1400 • IO-node RAID’d-flash • • • • IO time (sec) 1200 1000 800 HDD 600 GPFS 200 Comparison with RAID’d drives 0 16 Intel drives 400x20000 800x2480 1200x720 Test case 4 Fusion-io cards Raided flash achieves 2.2x speedup compared to single drive * Done by Pietro Cicotti, SDSC SAN DIEGO SUPERCOMPUTER CENTER Flash 400 1600x304 Local SSD to Cache Partial/Full Dataset • Load partial/full dataset into flash. • Typically needs application modification to write data into flash and do all subsequent reads from flash. • Example: Munagala-Ranade Breadth First Search (MR-BFS) code: • Generation phase -> puts the data in flash. • Multiple MR-BFS runs read and process data. • Multiple threads reading, benefits from low latency of SSDs. SAN DIEGO SUPERCOMPUTER CENTER Flash case study – Breadth First Search* MR-BFS serial performance 134217726 nodes 3000 Benchmark problem: BFS on graph containing 134 million nodes 2500 I/O time t (s) 2000 Implementation of Breadthfirst search (BFS) graph algorithm developed by Munagala and Ranade non-I/O time Use of flash drives reduced I/O time by factor of 6.5x. As expected, no measurable impact on non-I/O operations 1500 1000 500 Problem converted from I/O bound to compute bound 0 SDDs HDDs * Done by Sandeep Gupta, SDSC SAN DIEGO SUPERCOMPUTER CENTER Flash for caching: Case study – Parallel Streamline Visualization Camp et al, accepted to IEEE Symp. on Large-Scale Data Analysis and Visualization (LDAV 2011) SAN DIEGO SUPERCOMPUTER CENTER Databases on Flash • Database performance benefits from low latency I/O from flash • Two options for setting up database: • Load database on flash based filesystem, already tested on Dash I/O nodes. • DB2 with direct native access to flash memory (coming soon!). SAN DIEGO SUPERCOMPUTER CENTER LIDAR Data Lifecycle Waveform Data D. Harding, NASA Full-featured DEM Portal Point Cloud Dataset Bare earth DEM OpenTopography is a “cloud” for topography data and tools SAN DIEGO SUPERCOMPUTER CENTER LIDAR benchmarking* and experiments on a Dash I/O node • • Experiments with LIDAR point cloud data with data sizes ranging from 1GB to 1TB using DB2. Experiments to be performed include: • • • • Load times: time to load each dataset Single user Selection times: for selecting 6%, 12%, 50% of data Single user Processing times: for DEM generation on selected data4. Multiuser: for a fixed dataset size (either 100GB or 1TB), run selections and processing for multiple concurrent users, e.g. 2, 4, 8, 16 concurrent users • Logical nodes testing: for a fixed dataset size (100GB or 1TB), db2 has the option of creating multiple “logical nodes” on a given system (“physical node”). Test what is optimal number of logical nodes on an SSD node *Chaitan Baru’s group at SDSC. SAN DIEGO SUPERCOMPUTER CENTER Flash case study – LIDAR 4000 3500 SSDs HDDs 3000 t (s) 2500 Remote sensing technology used to map geographic features with high resolution Benchmark problem: Load 100 GB data into single table, then count rows. DB2 database instance 2000 1500 1000 500 0 100GB Load 100GB Load 100GB Count(*) 100GB Count(*) FastParse Cold Warm SAN DIEGO SUPERCOMPUTER CENTER Flash drives 1.5x (load) to 2.4x (count) faster than hard disks Flash case study – LIDAR 1200 Remote sensing technology used to map geographic features with high resolution SSDs HDDs 1000 Comparison of runtimes for concurrent LIDAR queries obtained with flash drives (SSD) and hard drives (HDD) using the Alaska DenaliTotschunda data collection. t (s) 800 600 400 Impact of SSDs was modest, but significant when executing multiple simultaneous queries 200 0 1 Concurrent 4 Concurrent 8 Concurrent SAN DIEGO SUPERCOMPUTER CENTER PDB – protein interaction query • First step in analysis involves reduction of 150 million row data base table to one million rows. Use of flash drives reduced query time to 3 minutes, 10x speedup over hard disk • Dash I/O node configuration • Four 320 GB Fusion-io Drives configured as 1.2 TB RAID 0 device running an XFS file system • Two quad-core Intel Xeon E5530 2.40 GHz processors and 48 GB of DDR3-1066 memory SAN DIEGO SUPERCOMPUTER CENTER Accessing Flash/SSDs on Dash, Trestles System Dash – batch* Dash – vSMP* Configuration HDD PFS 64GB, node local Yes GPFS/ Data Oasis IB-DDR 16 nodes; 2 quad-core Intel 1TB (64x16 Nehalem (8 cores/node); aggregated) 48GB/node. Memory aggregated to 768GB via vSMP N/A GPFS IB-DDR 16 nodes; 2 quad-core Intel Nehalem (8 cores/node); 48GB/node SSD Network Dash I/O node 4 nodes; 2 quad-core Intel Nehalem (8 cores/node); 48GB/node; large SSD 1 TB (64*16) per node N/A N/A N/A Trestles* 324 nodes; 4, eight-core AMD Magny- Cours/node (32cores/node); 64GB/node 120GB, node local drives N/A Data Oasis IB-QDR SAN DIEGO SUPERCOMPUTER CENTER Sample Script on Dash #!/bin/bash #PBS -N PBStest #PBS -l nodes=1:ppn=8 #PBS -l walltime=01:00:00 #PBS -o test-normal.out #PBS -e test-normal.err #PBS -m e #PBS -M mahidhar@sdsc.edu #PBS -V #PBS –q batch cd /scratch/mahidhar/$PBS_JOBID cp -r /home/mahidhar/COPYBK/input /scratch/mahidhar/$PBS_JOBID mpirun_rsh -hostfile $PBS_NODEFILE -np 8 test.exe cp out.txt /home/mahidhar/COPYBK/ SAN DIEGO SUPERCOMPUTER CENTER Dash Prototype vs. Gordon Dash Gordon Number of Compute Nodes 64 1,024 Number of I/O Nodes 4 64 Intel Nehalem Intel Sandy Bridge 48 GB 64 GB Intel X25E SLC Intel eMLC 1 TB 4.8 TB 16 nodes/768GB 32 nodes/2TB Single Rail, Fat Tree, DDR Dual Rail, 3D Torus, QDR Torque SLURM Compute node processors Compute node memory I/O node flash Flash Capacity per Node vSMP Supernode Size InfiniBand Network Resource Management When considering benchmark results and scalability, keep in mind that nearly every major feature of Gordon will be an improvement over Dash. SAN DIEGO SUPERCOMPUTER CENTER Accessing Flash on Gordon • Majority of the flash disk will be in the 64 Gordon I/O nodes. Each I/O node will have ~ 4.8TB of flash. • Flash from I/O nodes will be made available to non-vSMP compute nodes via the IB network and iSER implementations. Two options will be available: • XFS filesystem mounted locally on each node. • Oracle Cluster Filesystem (OCFS) • vSMP software will aggregate the flash from the I/O node(s) included in the vSMP nodes. The aggregated flash filesystem will be available as local scratch on the node. SAN DIEGO SUPERCOMPUTER CENTER Flash performance needs to be freed from the I/O nodes Application is here Flash is here SAN DIEGO SUPERCOMPUTER CENTER Alphabet Soup of networking protocols, and file systems • • • • • • • • • • SRP - SCSI over RDMA iSER - iSCSI over RDMA NFS over RDMA NFS/IP over IB Xfs – via iSER devices Lustre OCFS – via iSER devices PVFS OrangeFS Others… In our effort to maximize flash performance we have tested most of these. BTW: Very few people doing this! SAN DIEGO SUPERCOMPUTER CENTER Exporting Flash Performance using iSER: Sequential iSER Implementations: Sequential 4000 3500 3000 MB/s 2500 TGTD 2000 TGTD+ 1500 1000 500 0 mt-seq-read ep-seq-read SAN DIEGO SUPERCOMPUTER CENTER mt-seq-write ep-seq-write Exporting Flash Performance using iSER: Random iSER Implementation: Random 300000 250000 IOPS 200000 TGTD 150000 TGTD+ 100000 50000 0 mt-rnd-read ep-rnd-read SAN DIEGO SUPERCOMPUTER CENTER mt-rnd-write ep-rnd-write Flash performance – parallel file system OCFS Sequential access 3500 Bandwidth (MB/s) 3000 2500 MT-RD 2000 MT-WR 1500 EP-RD 1000 EP-WR 500 0 1-node 2-node 4-node OCFS Random access 250000 IOPS 200000 MT-RD 150000 MT-WR 100000 EP-RD EP-WR 50000 0 1-node 2-node 4-node SAN DIEGO SUPERCOMPUTER CENTER Performance of Intel Postville Refresh SSDs (16 drives RAID 0) with OCSF (Oracle Cluster File System) I/O done simultaneously from 1, 2, or 4 compute nodes MT = multi-threaded EP = embarrassingly parallel Flash performance – serial file system XFS Sequential access 1600 Bandwidth (MB/s) 1400 1200 1000 MT-RD 800 MT-WR 600 EP-RD 400 EP-WR 200 I/O done simultaneously from 1, 2, or 4 compute nodes 0 1-node 2-node 4-node XFS Random access 160000 140000 IOPS 120000 100000 MT-RD 80000 MT-WR 60000 EP-RD 40000 EP-WR 20000 0 1-node 2-node 4-node SAN DIEGO SUPERCOMPUTER CENTER Performance of Intel Postville Refresh SSDs (16 drives RAID 0) with XFS MT = multi-threaded EP = embarrassingly parallel Summary • The early hardware has allowed us to test applications, protocols and file systems. • I/O profiling tools and running different application flash usage scenarios have helped optimize application I/O performance. • Performance test results point to iSER, OCFS, and XFS as the right solutions for exporting flash. • Further work required to integrate into user documentation, systems scripts, and the SLURM resource manager. SAN DIEGO SUPERCOMPUTER CENTER Discussion • Attendee I/O access pattern/method. SAN DIEGO SUPERCOMPUTER CENTER Thank you! For more information http://gordon.sdsc.edu gordoninfo@sdsc.edu Mahidhar Tatineni mahidhar@sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER